Journal of Jilin University(Engineering and Technology Edition) ›› 2022, Vol. 52 ›› Issue (6): 1422-1427.doi: 10.13229/j.cnki.jdxbgxb20210511

Previous Articles    

Outlier mining algorithm for high dimensional categorical data streams based on spectral clustering

Yao-long KANG1(),Li-lu FENG2,Jing-an ZHANG3,Fu CHEN4   

  1. 1.School of Computer and Network Engineering,Shanxi Datong University,Datong 037009,China
    2.School of Education Science and Technology,Shanxi Datong University,Datong 037009,China
    3.Computer Network Center,Shanxi Datong University,Datong 037009,China
    4.School of Mathematics and Statistics,Shanxi Datong University,Datong 037009,China
  • Received:2021-06-07 Online:2022-06-01 Published:2022-06-02

Abstract:

In order to discover abnormal data in the data stream in time and reduce potential threats to the network, a high-dimensional category attribute data stream outlier mining algorithm based on spectral clustering is proposed. The characteristics of orderliness, high speed and high dimensionality of data streams are analyzed, and the main sources of outliers are explored.Using the attribute weight quantization method, introducing information entropy, merging the data streams with strong relevance, and then reducing the dimensionality of the data streams to reduce interference. The spectral clustering algorithm is used to set key scale parameters, the distance between the sample and the target is calculated by the affinity matrix, the spectral clustering is transformed into an undirected graph segmentation problem, the feature matrix is obtained, and the significant outlier features are extracted.Using the distance mining method, data blocks is added to the data stream, the probability distribution between two adjacent data blocks is judged, a sliding window is set, the distance between the data and the sliding window is obtained, and then compare with the set threshold. Outliers are added to the set to complete the mining.

The simulation results show that for data streams of different sizes and dimensions, the execution time required by the algorithm is within 42 s and 40 s respectively, and it has good scalability for the size and dimensions of data streams, and the outlier data mined is consistent with the reality.

Key words: computer application, spectral clustering algorithm, high dimensional category attribute, data flow, outlier mining, sliding window

CLC Number: 

  • TP274

Table 1

Number of outliers in different data blocks"

数据块离群度为0.02时的离群点数量离群度为0.04时的离群点数量
1100200
280160
3110185
495210
5105205
6110170
770160

Fig.1

Number of outlier mining when thedegree of outlier is 0.02"

Fig.2

Number of outlier mining when thedegree of outlier is 0.04"

Fig.3

Scalability test for data stream size"

Fig.4

Scalability test for data stream dimension"

1 江峰, 王凯郦, 于旭, 等. 基于粗糙熵的离群点检测方法及其在无监督入侵检测中的应用[J]. 控制与决策, 2020, 35(5): 1199-1204.
Jiang Feng, Wang Kai-li, Yu Xu, et al. A rough entropy-based approach to outlier detection and its application in unsupervised intrusion detection[J]. Control and Decision, 2020, 35(5): 1199-1204.
2 杨晓玲, 冯山, 袁钟. 基于相对距离的反k近邻树离群点检测[J]. 电子学报, 2020, 48(5): 937-945.
Yang Xiao-ling, Feng Shan, Yuan Zhong. Outlier detection based on reversed k-nearest neighborhood MST of relative distance measure[J]. Acta Electronica Sinica, 2020, 48(5): 937-945.
3 叶福兰. 基于离群点检测的不确定数据流聚类算法研究[J]. 中国电子科学研究院学报, 2019, 14(10): 1094-1099.
Ye Fu-lan. Clustering algorithm for uncertain data stream based on outlier detection[J]. Journal of China Academy of Electronics and Information Technology, 2019, 14(10): 1094-1099.
4 毛亚琼, 田立勤, 王艳, 等. 引入局部向量点积密度的数据流离群点快速检测算法[J]. 计算机工程, 2020, 46(11): 132-138, 147.
Mao Ya-qiong, Tian Li-qin, Wang Yan, et al. Fast outlier detection algorithm in data stream with local density of vector dot product [J]. Computer Engineering, 2020, 46(11): 132-138, 147.
5 谢娟英, 丁丽娟, 王明钊. 基于谱聚类的无监督特征选择算法[J]. 软件学报, 2020, 31(4): 1009-1024.
Xie Juan-ying, Ding Li-juan, Wang Ming-zhao. Spectral clustering based unsupervised feature selection algorithms[J]. Journal of Software, 2020, 31(4): 1009-1024.
6 杨梓樱, 濮晓龙, 徐嘉辉. 基于控制过度遗漏发现概率的高维数据流异常诊断[J]. 数理统计与管理, 2020, 39(3): 495-510.
Yang Zi-ying, Pu Xiao-long, Xu Jia-hui. High-dimensional fault diagnosis by controlling missed discovery excessive probability[J]. Journal of Applied Statistics and Management, 2020, 39(3): 495-510.
7 邓丽, 刘庆连, 邬群勇, 等. 基于数据流时空特征的WSN异常检测及异常类型识别[J]. 传感技术学报, 2019, 32(9): 1374-1380.
Deng Li, Liu Qing-lian, Wu Qun-yong, et al. Anomaly detection and type identification based on spatio-temporal characteristics of data streams in wireless sensor network[J]. Chinese Journal of Sensors and Actuators, 2019, 32(9): 1374-1380.
8 陈少波. 多维稀疏数据流异常数据关联挖掘仿真[J].计算机仿真, 2019, 36(9): 342-345.
Chen Shao-bo. Multidimensional sparse data flow anomaly data association mining simulation[J]. Computer Simulation, 2019, 36(9): 342-345.
9 张艳梅, 陆伟, 杨余旺. 一种基于关联频繁模式的振动数据流挖掘框架[J]. 数据采集与处理, 2019, 34(5): 872-882.
Zhang Yan-mei, Lu Wei, Yang Yu-wang. Novel data mining framework for vibration data stream based on associated frequency patterns[J]. Journal of Data Acquisition and Processing, 2019, 34(5): 872-882.
10 程士卿, 郝问裕, 李晨, 等. 低秩张量分解的多视角谱聚类算法[J]. 西安交通大学学报, 2020, 54(3): 119-125, 133.
Cheng Shi-qing, Hao Wen-yu, Li Chen, et al. Multi-view clustering by low-rank tensor decomposition[J]. Journal of Xi'an Jiaotong University, 2020, 54(3): 119-125, 133.
[1] Ming LIU,Yu-hang YANG,Song-lin ZOU,Zhi-cheng XIAO,Yong-gang ZHANG. Application of enhanced edge detection image algorithm in multi-book recognition [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 891-896.
[2] Xiao-hui WEI,Yan-wei MIAO,Xing-wang WANG. Rhombus sketch: adaptive and more accurate sketch for streaming data [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 874-884.
[3] Shi-min FANG. Multiple source data selective integration algorithm based on frequent pattern tree [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 885-890.
[4] Da-xiang LI,Meng-si CHEN,Ying LIU. Spontaneous micro-expression recognition based on STA-LSTM [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 897-909.
[5] Xue-yun CHEN,Xue-yu BEI,Qu YAO,Xin JIN. Pedestrian segmentation and detection in multi-scene based on G-UNet [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(4): 925-933.
[6] Su-ming KANG,Ye-e ZHANG. Hadoop⁃based local timing link prediction algorithm across social networks [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 626-632.
[7] Xue-zhi WANG,Qing-liang LI,Wen-hui LI. Spatio⁃temporal model of soil moisture prediction integrated with transfer learning [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 675-683.
[8] Xue WANG,Zhan-shan LI,Ying-da LYU. Medical image segmentation based on multi⁃scale context⁃aware and semantic adaptor [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 640-647.
[9] Ji-hong OUYANG,Ze-qi GUO,Si-guang LIU. Dual⁃branch hybrid attention decision net for diabetic retinopathy classification [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 648-656.
[10] Lin MAO,Feng-zhi REN,Da-wei YANG,Ru-bo ZHANG. Two⁃way feature pyramid network for panoptic segmentation [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 657-665.
[11] You QU,Wen-hui LI. Single-stage rotated object detection network based on anchor transformation [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(1): 162-173.
[12] Hong-wei ZHAO,Dong-sheng HUO,Jie WANG,Xiao-ning LI. Image classification of insect pests based on saliency detection [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(6): 2174-2181.
[13] Zhou-zhou LIU,Qian-yun ZHANG,Xin-hua MA,Han PENG. Compressed sensing signal reconstruction based on optimized discrete differential evolution algorithm [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(6): 2246-2252.
[14] Sheng-sheng WANG,Jing-yu CHEN,Yi-nan LU. COVID⁃19 chest CT image segmentation based on federated learning and blockchain [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(6): 2164-2173.
[15] Dong-ming SUN,Liang HU,Yong-heng XING,Feng WANG. Text fusion based internet of things service recommendation for trigger⁃action programming pattern [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(6): 2182-2189.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!