吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (9): 2620-2625.doi: 10.13229/j.cnki.jdxbgxb.20220434
Li-fang FU1(),Zhuo CHEN2,Chang-lin AO2()
摘要:
针对大数据集中存在海量数据,当数据规模扩大到一定程度时,离散点检测处理效率受到限制的问题,提出了一种基于分类和回归树(CART)决策树的网络大数据集离群点动态检测算法。首先,划分大数据集异常数据标准,利用方差衡量数据离散程度,使用支持向量机建立异常数据样本关联规则矩阵,明确大数据集异常数据范围,并通过动态网格划分策略降低离群点检测计算量;然后,运用CART决策树方法在分支节点采取布尔检测,将待检测数据统一拟作连续数据,升序排列训练数据集,计算数据最高信息增益,剪枝决策树直到没有非叶子节点可被替换,得到离群点动态检测结果。仿真结果证明,本文算法离群点检测准确率高、检测耗时短,具备显著的计算优势,能为大数据集的可靠应用提供积极帮助。
中图分类号:
1 | 杨晓玲, 冯山, 袁钟. 基于相对距离的反k近邻树离群点检测[J]. 电子学报, 2020, 48(5): 937-945. |
Yang Xiao-ling, Feng shan, Yuan Zhong. Outlier detection based on reversed k-nearest neighborhood mst of relative distance measure[J]. Acta Electronica Sinica, 2020, 48(5): 937-945. | |
2 | Vafaei N, Ribeiro R A, Camarinha-Matos L M. Comparison of normalization techniques on data sets with outliers[J]. International Journal of Decision Support System Technology, 2022, 14(1): 1-17. |
3 | 张倩倩, 于炯, 李梓杨, 等. 基于近邻传播的离群点检测算法[J]. 计算机应用研究, 2021, 38(6): 1662-1667. |
Zhang Qian-qian, Yu Jiong, Li Zi-yang, et al. Outlier detection algorithm based on affinity propagation[J]. Application Research of Computers, 2021, 38(6): 1662-1667. | |
4 | 江峰, 王凯郦, 于旭, 等. 基于粗糙熵的离群点检测方法及其在无监督入侵检测中的应用[J]. 控制与决策, 2020, 35(5): 1199-1204. |
Jiang Feng, Wang Kai-li, Yu Xu, et al. A rough entropy-based approach to outlier detection and its application in unsupervised intrusion detection[J]. Control and Decision, 2020, 35(5): 1199-1204. | |
5 | Belhadi A, Djenouri Y, Djenouri D, et al. Deep learning versus traditional solutions for group trajectory outliers[J]. IEEE Transactions on Cybernetics, 2020, 52(6): 4508-4519. |
6 | 袁庆军, 王安, 王永娟, 等. 基于流形学习能量数据预处理的模板攻击优化方法[J]. 电子与信息学报, 2020, 42(8): 1853-1861. |
Yuan Qing-jun, Wang An, Wang Yong-juan, et al. An improved template analysis method based on power traces preprocessing with manifold learning[J]. Journal of Electronics & Information Technology, 2020, 42(8): 1853-1861. | |
7 | Ghani M U, Rafi M, Tahir M A. Discriminative adaptive sets for multi-label classification[J]. IEEE Access, 2020, 8: 227579-227595. |
8 | 邓泓, 刘志超, 彭莹琼, 等. 基于Fibonacci采样的数据预处理方法研究[J]. 江西师范大学学报: 自然科学版, 2021, 45(1): 60-66. |
Deng Hong, Liu Zhi-chao, Peng Ying-qiong, et al. The study on data preprocessing method based on fibonacci sampling[J]. Journal of Jiangxi Normal University (Natural Sciences Edition), 2021, 45(1): 60-66. | |
9 | Sripriya T P, Srinivasan M R, Gallo M. Robust distance measure to detect outliers for categorical data[J]. Soft Computing, 2020, 24(18): 1-8. |
10 | Li N, Zhao X W, Mu H L, et al. Research on the self-repairing model of outliers in energy data based on regional convergence[J]. Energies, 2020, 13(18): No.4909. |
11 | 刘云, 郑文凤, 张轶. 模糊残差算法对离群点数据的优化研究[J].小型微型计算机系统, 2021, 42(6): 1321-1326. |
Liu Yun, Zheng Wen-feng, Zhang Yi. Optimization of outlier data by fuzzy residual algorithm[J]. Journal of Chinese Computer Systems, 2021, 42(6): 1321-1326. | |
12 | 王习特, 朱宗梅, 于雪苹, 等. 异构分布式环境中的并行离群点检测算法[J]. 湖南大学学报: 自然科学版, 2020, 47(10): 100-110. |
Wang Xi-te, Zhu Zong-mei, Yu Xue-ping, et al. Parallel outlier detection algorithm in heterogeneous distributed environment[J]. Journal of Hunan University (Natural Sciences), 2020, 47(10): 100-110. | |
13 | Yang L, Lu Y Z, Yang S X, et al. An evolutionary game based secure clustering protocol with fuzzy trust evaluation and outlier detection for wireless sensor networks[J]. IEEE Sensors Journal, 2021, 21(12): 13935-13947. |
14 | 水泽农, 张星宇, 沙朝锋. 基于最优输运和k-近邻的离群文档检测[J]. 计算机科学, 2021, 48(7): 105-111. |
Shui Ze-nong, Zhang Xing-yu, Sha Chao-feng. Outlier document detection via optimal transport and k-nearest neighbor[J]. Computer Science, 2021, 48(7): 105-111. | |
15 | Yu K Q, Shi W, Santoro N. Designing a streaming algorithm for outlier detection in data mining—an incrementa approach[J]. Sensors, 2020, 20(5): No.1261. |
16 | Hagan R, Langston M A. Molecular subtyping and outlier detection in human disease using the paraclique algorithm[J]. Algorithms, 2021, 14(2): No.63. |
17 | 林雪. 海量不确定数据集中离群点快速检测方法仿真[J]. 计算机仿真, 2021, 38(6): 378-382. |
Lin Xue. Simulation of quick detection method for outliers in massive uncertain data sets[J]. Computer Simulation, 2021,38(6): 378-382. | |
18 | Mouret F, Albughdadi M, Duthoit S, et al. Outlier detection at the parcel-level in wheat and rapeseed crops using multispectral and sar time series[J]. Remote Sensing, 2021, 13(5): No.956. |
19 | 董泽, 贾昊. 基于EWT-LOF的热工过程数据异常值检测方法[J]. 仪器仪表学报, 2020, 41(2): 126-134. |
Dong Ze, Jia Hao. Outlier detection method for thermal process data based on EWT-LOF[J]. Chinese Journal of Scientific Instrument, 2020, 41(2): 126-134. | |
20 | Riahi-Madvar M, Azirani A A, Nasersharif B, et al. A new density-based subspace selection method using mutual information for high dimensional outlier detection[J]. Knowledge-Based Systems, 2021, 216(2): No.106733. |
[1] | 谭旭, 刘建军, 李春来. 月球数据预处理工作流模型的构建及应用[J]. 吉林大学学报(工学版), 2015, 45(6): 2007-2013. |
[2] | 王智宏, 刘杰, 王婧茹, 孙玉洋, 于永, 林君. 数据预处理方法对油页岩含油率近红外光谱分析的影响[J]. 吉林大学学报(工学版), 2013, 43(04): 1017-1022. |
[3] | 冀世军,王扬,吕汉明 . 三角形网格多面体空间四边界区域的数据参数化[J]. 吉林大学学报(工学版), 2009, 39(02): 458-0462. |
|