吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (9): 2620-2625.doi: 10.13229/j.cnki.jdxbgxb.20220434

• 计算机科学与技术 • 上一篇    下一篇

基于分类和回归树决策树的网络大数据集离群点动态检测算法

傅丽芳1(),陈卓2,敖长林2()   

  1. 1.东北农业大学 理学院,哈尔滨 150030
    2.东北农业大学 工程学院,哈尔滨 150030
  • 收稿日期:2022-04-18 出版日期:2023-09-01 发布日期:2023-10-09
  • 通讯作者: 敖长林 E-mail:fulifang7895@163.com;chenweiliang7895@163.com
  • 作者简介:傅丽芳(1974-),女,副教授,博士.研究方向:农业网络舆情,大数据分析.E-mail:fulifang7895@163.com
  • 基金资助:
    国家自然科学基金项目(71874026)

Dynamic outlier detection algorithm for network large data set based on classification and regression trees decision tree

Li-fang FU1(),Zhuo CHEN2,Chang-lin AO2()   

  1. 1.College of Science,Northeast Agricultural University,Harbin 150030,China
    2.College of Engineering,Northeast Agricultural University,Harbin 150030,China
  • Received:2022-04-18 Online:2023-09-01 Published:2023-10-09
  • Contact: Chang-lin AO E-mail:fulifang7895@163.com;chenweiliang7895@163.com

摘要:

针对大数据集中存在海量数据,当数据规模扩大到一定程度时,离散点检测处理效率受到限制的问题,提出了一种基于分类和回归树(CART)决策树的网络大数据集离群点动态检测算法。首先,划分大数据集异常数据标准,利用方差衡量数据离散程度,使用支持向量机建立异常数据样本关联规则矩阵,明确大数据集异常数据范围,并通过动态网格划分策略降低离群点检测计算量;然后,运用CART决策树方法在分支节点采取布尔检测,将待检测数据统一拟作连续数据,升序排列训练数据集,计算数据最高信息增益,剪枝决策树直到没有非叶子节点可被替换,得到离群点动态检测结果。仿真结果证明,本文算法离群点检测准确率高、检测耗时短,具备显著的计算优势,能为大数据集的可靠应用提供积极帮助。

关键词: 分类和回归树决策树, 大数据集, 离群点检测, 数据预处理, 网格划分, 基尼系数

Abstract:

There are massive data in big data sets, and when the data scale expands to a certain extent, the processing efficiency of discrete point detection is limited. Therefore, a dynamic outlier detection algorithm based on CART decision tree was proposed. Firstly, the abnormal data standard of large data set was divided, the data dispersion degree by variance was measured, the abnormal data sample association rule matrix by support vector machine was established, the abnormal data range of large data set was clarified, and the amount of outlier detection calculation by dynamic meshing strategy was reduced. Then, the classification and regression trees(CART) decision tree method was used to take Boolean detection at the branch nodes, unify the data to be detected as continuous data, arrange the training data set in ascending order, calculate the maximum information gain of the data, prune the decision tree until no non leaf nodes can be replaced, and obtain the dynamic detection results of outliers. Simulation results show that the proposed algorithm has high outlier detection accuracy, short detection time, significant computational advantages, and can provide positive help for the reliable application of large data sets.

Key words: classification and regression trees(CART) decision tree, large data sets, outlier detection, data preprocessing, meshing, Gini coefficient

中图分类号: 

  • TP393

表1

大数据集异常数据分析标准"

序号类型潜在表现分析结果
1数据失效无法完成解析任务无需过滤
不在有效范围内无需过滤
2数据跳变数据产生大幅度改变无需过滤
数据发生改变后随即恢复正常无需过滤
数据改变幅度极小需要过滤
3其他数据状态稳定需要过滤

表2

实验样本数据信息"

样本数据量/万条样本大小/MB
A30894
B401056
C501719
D602493

图1

离群点检测精准度对比"

图2

三种方法加速比实验结果对比"

图3

三种方法扩展性实验结果对比"

1 杨晓玲, 冯山, 袁钟. 基于相对距离的反k近邻树离群点检测[J]. 电子学报, 2020, 48(5): 937-945.
Yang Xiao-ling, Feng shan, Yuan Zhong. Outlier detection based on reversed k-nearest neighborhood mst of relative distance measure[J]. Acta Electronica Sinica, 2020, 48(5): 937-945.
2 Vafaei N, Ribeiro R A, Camarinha-Matos L M. Comparison of normalization techniques on data sets with outliers[J]. International Journal of Decision Support System Technology, 2022, 14(1): 1-17.
3 张倩倩, 于炯, 李梓杨, 等. 基于近邻传播的离群点检测算法[J]. 计算机应用研究, 2021, 38(6): 1662-1667.
Zhang Qian-qian, Yu Jiong, Li Zi-yang, et al. Outlier detection algorithm based on affinity propagation[J]. Application Research of Computers, 2021, 38(6): 1662-1667.
4 江峰, 王凯郦, 于旭, 等. 基于粗糙熵的离群点检测方法及其在无监督入侵检测中的应用[J]. 控制与决策, 2020, 35(5): 1199-1204.
Jiang Feng, Wang Kai-li, Yu Xu, et al. A rough entropy-based approach to outlier detection and its application in unsupervised intrusion detection[J]. Control and Decision, 2020, 35(5): 1199-1204.
5 Belhadi A, Djenouri Y, Djenouri D, et al. Deep learning versus traditional solutions for group trajectory outliers[J]. IEEE Transactions on Cybernetics, 2020, 52(6): 4508-4519.
6 袁庆军, 王安, 王永娟, 等. 基于流形学习能量数据预处理的模板攻击优化方法[J]. 电子与信息学报, 2020, 42(8): 1853-1861.
Yuan Qing-jun, Wang An, Wang Yong-juan, et al. An improved template analysis method based on power traces preprocessing with manifold learning[J]. Journal of Electronics & Information Technology, 2020, 42(8): 1853-1861.
7 Ghani M U, Rafi M, Tahir M A. Discriminative adaptive sets for multi-label classification[J]. IEEE Access, 2020, 8: 227579-227595.
8 邓泓, 刘志超, 彭莹琼, 等. 基于Fibonacci采样的数据预处理方法研究[J]. 江西师范大学学报: 自然科学版, 2021, 45(1): 60-66.
Deng Hong, Liu Zhi-chao, Peng Ying-qiong, et al. The study on data preprocessing method based on fibonacci sampling[J]. Journal of Jiangxi Normal University (Natural Sciences Edition), 2021, 45(1): 60-66.
9 Sripriya T P, Srinivasan M R, Gallo M. Robust distance measure to detect outliers for categorical data[J]. Soft Computing, 2020, 24(18): 1-8.
10 Li N, Zhao X W, Mu H L, et al. Research on the self-repairing model of outliers in energy data based on regional convergence[J]. Energies, 2020, 13(18): No.4909.
11 刘云, 郑文凤, 张轶. 模糊残差算法对离群点数据的优化研究[J].小型微型计算机系统, 2021, 42(6): 1321-1326.
Liu Yun, Zheng Wen-feng, Zhang Yi. Optimization of outlier data by fuzzy residual algorithm[J]. Journal of Chinese Computer Systems, 2021, 42(6): 1321-1326.
12 王习特, 朱宗梅, 于雪苹, 等. 异构分布式环境中的并行离群点检测算法[J]. 湖南大学学报: 自然科学版, 2020, 47(10): 100-110.
Wang Xi-te, Zhu Zong-mei, Yu Xue-ping, et al. Parallel outlier detection algorithm in heterogeneous distributed environment[J]. Journal of Hunan University (Natural Sciences), 2020, 47(10): 100-110.
13 Yang L, Lu Y Z, Yang S X, et al. An evolutionary game based secure clustering protocol with fuzzy trust evaluation and outlier detection for wireless sensor networks[J]. IEEE Sensors Journal, 2021, 21(12): 13935-13947.
14 水泽农, 张星宇, 沙朝锋. 基于最优输运和k-近邻的离群文档检测[J]. 计算机科学, 2021, 48(7): 105-111.
Shui Ze-nong, Zhang Xing-yu, Sha Chao-feng. Outlier document detection via optimal transport and k-nearest neighbor[J]. Computer Science, 2021, 48(7): 105-111.
15 Yu K Q, Shi W, Santoro N. Designing a streaming algorithm for outlier detection in data mining—an incrementa approach[J]. Sensors, 2020, 20(5): No.1261.
16 Hagan R, Langston M A. Molecular subtyping and outlier detection in human disease using the paraclique algorithm[J]. Algorithms, 2021, 14(2): No.63.
17 林雪. 海量不确定数据集中离群点快速检测方法仿真[J]. 计算机仿真, 2021, 38(6): 378-382.
Lin Xue. Simulation of quick detection method for outliers in massive uncertain data sets[J]. Computer Simulation, 2021,38(6): 378-382.
18 Mouret F, Albughdadi M, Duthoit S, et al. Outlier detection at the parcel-level in wheat and rapeseed crops using multispectral and sar time series[J]. Remote Sensing, 2021, 13(5): No.956.
19 董泽, 贾昊. 基于EWT-LOF的热工过程数据异常值检测方法[J]. 仪器仪表学报, 2020, 41(2): 126-134.
Dong Ze, Jia Hao. Outlier detection method for thermal process data based on EWT-LOF[J]. Chinese Journal of Scientific Instrument, 2020, 41(2): 126-134.
20 Riahi-Madvar M, Azirani A A, Nasersharif B, et al. A new density-based subspace selection method using mutual information for high dimensional outlier detection[J]. Knowledge-Based Systems, 2021, 216(2): No.106733.
[1] 谭旭, 刘建军, 李春来. 月球数据预处理工作流模型的构建及应用[J]. 吉林大学学报(工学版), 2015, 45(6): 2007-2013.
[2] 王智宏, 刘杰, 王婧茹, 孙玉洋, 于永, 林君. 数据预处理方法对油页岩含油率近红外光谱分析的影响[J]. 吉林大学学报(工学版), 2013, 43(04): 1017-1022.
[3] 冀世军,王扬,吕汉明 . 三角形网格多面体空间四边界区域的数据参数化[J]. 吉林大学学报(工学版), 2009, 39(02): 458-0462.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 李寿涛, 李元春. 在未知环境下基于递阶模糊行为的移动机器人控制算法[J]. 吉林大学学报(工学版), 2005, 35(04): 391 -397 .
[2] 李红英;施伟光;甘树才 .

稀土六方Z型铁氧体Ba3-xLaxCo2Fe24O41的合成及电磁性能与吸波特性

[J]. 吉林大学学报(工学版), 2006, 36(06): 856 -0860 .
[3] 张全发,李明哲,孙刚,葛欣 . 板材多点成形时柔性压边与刚性压边方式的比较[J]. 吉林大学学报(工学版), 2007, 37(01): 25 -30 .
[4] 杨树凯,宋传学,安晓娟,蔡章林 . 用虚拟样机方法分析悬架衬套弹性对
整车转向特性的影响
[J]. 吉林大学学报(工学版), 2007, 37(05): 994 -0999 .
[5] 冯金巧;杨兆升;张林;董升 . 一种自适应指数平滑动态预测模型[J]. 吉林大学学报(工学版), 2007, 37(06): 1284 -1287 .
[6] 刘寒冰,焦玉玲,,梁春雨,秦卫军 . 无网格法中形函数对计算精度的影响[J]. 吉林大学学报(工学版), 2007, 37(03): 715 -0720 .
[7] 杨庆芳,陈林 . 交通控制子区动态划分方法[J]. 吉林大学学报(工学版), 2006, 36(增刊2): 139 -142 .
[8] 李月英,刘勇兵,陈华 . 凸轮材料的表面强化及其摩擦学特性
[J]. 吉林大学学报(工学版), 2007, 37(05): 1064 -1068 .
[9] 张和生,张毅,温慧敏,胡东成 . 利用GPS数据估计路段的平均行程时间[J]. 吉林大学学报(工学版), 2007, 37(03): 533 -0537 .
[10] 曲昭伟,陈红艳,李志慧,胡宏宇,魏巍 . 基于单模板的二维场景重建方法[J]. 吉林大学学报(工学版), 2007, 37(05): 1159 -1163 .