吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (10): 2917-2922.doi: 10.13229/j.cnki.jdxbgxb.20220689

• 计算机科学与技术 • 上一篇    下一篇

基于谱聚类的多维数据集异常数据检测方法

宋世军1(),樊敏2()   

  1. 1.西南交通大学 交通运输与物流学院,成都 610031
    2.西南交通大学 土木工程学院,成都 610031
  • 收稿日期:2022-06-12 出版日期:2023-10-01 发布日期:2023-12-13
  • 通讯作者: 樊敏 E-mail:songshijun20220@yeah.net;fanmin@swjtu.edu.cn
  • 作者简介:宋世军(1981-),女,助理研究员,博士研究生.研究方向:物流工程,智能计算,信息安全.E-mail:songshijun20220@yeah.net
  • 基金资助:
    国家重点研发计划项目(2023YFB2304204)

Detection method of abnormal data in cube based on spectral clustering

Shi-jun SONG1(),Min FAN2()   

  1. 1.School of Transportation and Logistics,Southwest Jiaotong University,Chengdu 610031,China
    2.School of Civil Engineering,Southwest Jiaotong University,Chengdu 610031,China
  • Received:2022-06-12 Online:2023-10-01 Published:2023-12-13
  • Contact: Min FAN E-mail:songshijun20220@yeah.net;fanmin@swjtu.edu.cn

摘要:

针对多维数据集异常数据检测过程中未对多维数据集进行降维处理,导致多维数据集中异常数据检测精度较低、误检率较高、检测时间较长的问题,提出一种基于谱聚类的多维数据集异常数据检测方法。首先,通过拉普拉斯矩阵对多维数据集中的数据进行聚类,初步分类数据;其次,采用局部线性嵌入(LLE)算法对分类后的数据进行降维处理,用特征向量表达高维数据集,去除多维数据集中的冗余信息;最后,将处理后的多维数据集输入到支持向量机模型中,根据回归估计值的计算,完成异常数据的检测。实验结果表明,本文算法进行多维数据集中异常数据检测的精度更高、误检率更低,检测时间较短。

关键词: 拉普拉斯矩阵, 谱聚类, 数据降维, 多维数据集, 支持向量机算法

Abstract:

Due to the lack of dimension reduction in the process of cube abnormal data detection, the detection accuracy of abnormal data in the cube is low, the error detection rate is high, and the detection time is long. Therefore, a cube abnormal data detection method based on spectral clustering is proposed. Cluster the data in the multidimensional data set through Laplace matrix, preliminarily classify the data, use LLE algorithm to reduce the dimension of the classified data, express the high-dimensional data set with eigenvectors, remove the redundant information in the multidimensional data set, input the processed multidimensional data set into the support vector machine model, and complete the detection of abnormal data according to the calculation of regression estimates. Experimental results show that the proposed algorithm has higher accuracy, lower false detection rate and shorter detection time.

Key words: Laplace matrix, spectral clustering, data dimensionality reduction, cube, support vector machine algorithm

中图分类号: 

  • TP393

图1

异常数据检测流程"

表1

数据库的信息统计表"

数据集名称属性个数数据数量
Isolet6737
Multiple Features1018 282
KDDCUP19991525 021

图2

三种方法的检测精度"

表2

三种方法的误检率"

方 法参 数Isolet数据库Multiple Features数据库KDDCUP1999数据库
本文误检/个102
误检率/%204%
文献[3误检/个5810
误检率8%12%20%
文献[4误检/个101218
误检率20%22%32%

图3

三种方法的检测时间"

1 赵臣啸, 薛惠锋, 王磊, 等. 基于孤立森林算法的取用水量异常数据检测方法[J]. 中国水利水电科学研究院学报, 2020, 18(1): 31-39.
Zhao Chen-xiao, Xue Hui-feng, Wang Lei, et al. Water consumption abnormal data detection method based on isolation forest[J]. Journal of China Institute of Water Resources and Hydropower Research, 2020, 18(1): 31-39.
2 李晨, 王布宏, 田继伟, 等. 基于LSTM-OCSVM的无人机传感器数据异常检测[J]. 小型微型计算机系统, 2021, 42(4): 700-705.
Li Chen, Wang Bu-hong, Tian Ji-wei, et al. Anomaly detection method for UAV sensor data based on LSTM-OCSVM[J]. Journal of Chinese Computer Systems, 2021, 42(4): 700-705.
3 吴金娥, 王若愚, 段倩倩, 等. 基于反向k近邻过滤异常的群数据异常检测[J]. 上海交通大学学报, 2021, 55(5): 598-606.
Wu Jin-e, Wang Ruo-yu, Duan Qian-qian, et al. Collective data anomaly detection based on reverse k-nearest neighbor filtering[J]. Journal of Shanghai Jiaotong University, 2021, 55(5): 598-606.
4 仇开, 姜瑛. 加权LOF结合上下文判断的云环境中服务运行数据异常检测方法[J]. 计算机工程与科学,2020, 42(6): 951-958.
Qiu Kai, Jiang Ying. A service running data anomaly detection method based on weighted LOF and context judgment in cloud environment[J]. Computer Engineering and Science, 2020, 42(6): 951-958.
5 仇媛, 常相茂, 仇倩, 等. 基于长短期记忆网络和滑动窗口的流数据异常检测方法[J]. 计算机应用, 2020, 40(5): 1335-1339.
Qiu Yuan, Chang Xiang-mao, Qiu Qian, et al. Stream data anomaly detection method based on long short-term memory network and sliding window[J]. Journal of Computer Applications, 2020, 40(5): 1335-1339.
6 王秋萍, 丁成, 王晓峰. 一种基于改进KH与KHM聚类的混合数据聚类算法[J]. 控制与决策, 2020, 35(10): 2449-2458.
Wang Qiu-ping, Ding Cheng, Wang Xiao-feng. A hybrid data clustering algorithm based on improved krill herd algorithm and KHM clustering[J]. Control and Decision, 2020, 35(10): 2449-2458.
7 石险峰, 刘学军, 张礼. PUseqClust: 一种RNA-seq数据聚类分析方法[J]. 软件学报, 2019, 30(9): 2857-2868.
Shi Xian-feng, Liu Xue-jun, Zhang Li. PUseqClust: a clustering analysis method for RNA-Seq data[J]. Journal of Software, 2019, 30(9): 2857-2868.
8 钱晓东, 罗彦福. 基于互信息属性排序的不完整数据聚类算法[J]. 信息与控制, 2019, 48(1): 80-87.
Qian Xiao-dong, Luo Yan-fu. Incomplete data clustering algorithm based on mutual information attributes ranking[J]. Information and Control, 2019, 48(1): 80-87.
9 刘颖, 张艳邦. 拉普拉斯矩阵在聚类中的应用[J]. 天津科技大学报, 2019, 34(3): 76-80.
Liu Ying, Zhang Yan-bang. Application of Laplacian matrix in clustering[J]. Journal of Tianjin University of Science & Technology, 2019, 34(3): 76-80.
10 魏世超, 李歆, 张宜弛, 等. 基于E-t-SNE的混合属性数据降维可视化方法[J]. 计算机工程与应用, 2020, 56(6): 66-72.
Wei Shi-chao, Li Xin, Zhang Yi-chi, et al. Dimension reduction and visualization of mixed-type data based on E-t-SNE[J]. Computer Engineering and Applications, 2020, 56(6): 66-72.
11 郭方方, 吕宏武, 任威霖, 等. 基于有监督判别投影的网络安全数据降维算法[J]. 通信学报, 2021, 42(6): 84-93.
Guo Fang-fang, Lv Hong-wu, Ren Wei-lin, et al. Reduction algorithm based on supervised discriminant projection for network security data[J]. Journal on Communications, 2021, 42(6): 84-93.
12 李世波, 林辉, 葛淼. 东洞庭湖湿地植被高光谱数据降维与分类[J]. 中南林业科技大学学报, 2019, 39(11): 36-41.
Li Shi-bo, Lin Hui, Ge Miao. Hyperspectral dimensionality reduction and classification of the east Dongting lake wetland vegetation[J]. Journal of Central South University of Forestry & Technology, 2019,39(11): 36-41.
13 刘俐, 李勇, 曹一家, 等. 基于支持向量机和长短期记忆网络的暂态功角稳定预测方法[J]. 电力自动化设备, 2020, 40(2): 129-139.
Liu Li, Li Yong, Cao Jia-jia, et al. Transient rotor angle stability prediction method based on SVM and LSTM network[J]. Electric Power Automation Equipment, 2020, 40(2): 129-139.
14 雷庆祝, 秦永松. 强混合样本下非参数回归函数的经验似然推断[J]. 应用数学学报, 2019, 42(2): 179-196.
Lei Qing-zhu, Qin Yong-song. Empirical Likelihood for nonparametric regression functions under strong mixing samples[J]. Acta Mathematicae Applicatae Sinica, 2019, 42(2): 179-196.
15 赵海燕, 刘琨, 王廷梅, 等. 网络文本蕴含关系识别的异常信息获取仿真[J]. 计算机仿真, 2020, 37(8):256-260.
Zhao Hai-yan, Liu Kun, Wang Ting-mei, et al. Simulation of abnormal information acquisition for network text implication relationship recognition[J]. Computer Simulation, 2020, 37(8): 256-260.
[1] 康耀龙,冯丽露,张景安,曹素娥. 基于谱聚类的不确定数据集中快速离群点挖掘算法[J]. 吉林大学学报(工学版), 2023, 53(4): 1181-1186.
[2] 郭世杰,张学炜,张楠,乔冠,唐术锋. 机床主轴热关键点选择与典型转速热误差预测[J]. 吉林大学学报(工学版), 2023, 53(1): 72-81.
[3] 康耀龙,冯丽露,张景安,陈富. 基于谱聚类的高维类别属性数据流离群点挖掘算法[J]. 吉林大学学报(工学版), 2022, 52(6): 1422-1427.
[4] 李军军,曹建农,程贝贝,廖娟,朱莹莹. 联合像素与多尺度对象的高分辨率遥感影像谱聚类分割[J]. 吉林大学学报(工学版), 2019, 49(6): 2098-2108.
[5] 赵金钢,张明,占玉林,谢明志. 基于塑性应变能密度的钢筋混凝土墩柱损伤准则[J]. 吉林大学学报(工学版), 2019, 49(4): 1124-1133.
[6] 邓剑勋, 熊忠阳, 邓欣. 基于谱聚类矩阵的改进DNALA算法[J]. 吉林大学学报(工学版), 2018, 48(3): 903-908.
[7] 刘仲民, 李战明, 李博皓, 胡文瑾. 基于稀疏矩阵的谱聚类图像分割算法[J]. 吉林大学学报(工学版), 2017, 47(4): 1308-1313.
[8] 曲琳,周凡,陈耀武. 基于Hausdorff距离的视觉监控轨迹分类算法[J]. 吉林大学学报(工学版), 2009, 39(06): 1618-1624.
[9] 李勇,陈贺新,赵刚 ,孙中华,陈绵书 . 基于可变k近邻LLE数据降维的图像检索方法[J]. 吉林大学学报(工学版), 2008, 38(04): 946-949.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!