Journal of Jilin University(Engineering and Technology Edition) ›› 2026, Vol. 56 ›› Issue (2): 509-515.doi: 10.13229/j.cnki.jdxbgxb.20241184

Previous Articles    

Unbalanced drift big data stream classification algorithm based on spectral clustering undersampling

Yao-long KANG1(),Li-lu FENG2,Jing-an ZHANG3   

  1. 1.School of Computer and Network Engineering,Shanxi Datong University,Datong 037009,China
    2.School of Education Science and Technology,Shanxi Datong University,Datong 037009,China
    3.Network Information Center,Shanxi Datong University,Datong 037009,China
  • Received:2024-11-05 Online:2026-02-01 Published:2026-03-17

Abstract:

In imbalanced data classification, the majority of class samples have an advantage in terms of quantity, and their distribution will have a significant "pulling" effect on the clustering results. However, the minority class samples, due to their small quantity, have relatively unclear features in the entire dataset, resulting in drift problems in the data stream and affecting the classification performance of the data stream. To address this issue, research is conducted on an imbalanced drift big data stream classification algorithm based on spectral clustering undersampling. By using undersampling techniques to reduce the redundant amount of majority class data in imbalanced drift big data streams, balance the amount of majority class data and minority class data, and alleviate the problem of data drift caused by clustering "pulling"; select the core points of the balanced big data stream to form a core point set, and use spectral clustering algorithm to cluster this core point set. Based on the clustering structure obtained from spectral clustering and the selected core points, realize the classification of imbalanced drift big data streams. The experimental results show that the algorithm can achieve balanced processing of imbalanced drift big data streams, and the average imbalance degree after processing can be reduced to 1.024, almost approaching the equilibrium state; it can achieve the selection and effective grouping of core points for different attribute big data streams, providing guarantees for the subsequent effective application of such big data streams.

Key words: spectral clustering, undersampling, out of balance, drift big data stream, core point set, group division

CLC Number: 

  • TP391

Fig. 1

Core point selection and grouping effect"

Table 1

Initial situation of experimental big data flow"

大数据流名称属性多数类数据量少数类数据量不平衡度
pima105102771.84
car71 2003733.22
sonar25122951.28
yeast104524181.08
ionosphere362141151.86
glass1175164.69
breast-w305042801.80
letter177787551.03
seg1242 1624634.67
wdbc313462011.72
haberman52141151.86
artificial2856748136.98

Fig.2

Partial data drift of experimental big data stream"

Table 2

Details of the algorithm in this article for undersampling and equalization of big data streams"

大数据流名称属性多数类数据量少数类数据量不平衡度
pima102982771.076
car73773731.011
sonar25100951.053
yeast104194181.002
ionosphere361171151.017
glass1117161.063
breast-w302822801.007
letter177577551.003
seg1244664631.006
wdbc312032011.010
haberman51181151.026
artificial288158131.002

Fig.3

Core point selection and grouping results of each experiment big data flow after equalization"

Fig.4

This article presents the classification results of big data streams using algorithms"

[1] 宋婷婷, 吴赛君, 裴颂文. 融合BiLSTM的双图神经网络文本分类模型[J]. 上海理工大学学报, 2023, 45(2): 103-111.
Song Ting-ting, Wu Sai-jun, Pei Song-wen. Dual graph neural networks with BiLSTM for text classification[J]. Journal of University of Shanghai for Science and Technology, 2023,45(2): 103-111.
[2] 邓维斌, 王智莹, 高荣壕, 等. 融合注意力与CorNet的多标签文本分类[J].西北大学学报: 自然科学版, 2022, 52(5): 824-833.
Deng Wei-bin, Wang Zhi-ying, Gao Rong-hao, et al. Multi-label text classification combining attention with CorNet[J]. Journal of Northwest University (Natural Science Edition), 2022, 52(5): 824-833.
[3] 崔雨萌, 王靖亚, 刘晓文, 等. 融合注意力和裁剪机制的通用文本分类模型[J]. 计算机应用, 2023, 43(8): 2396-2405.
Cui Yu-meng, Wang Jing-ya, Liu Xiao-wen, et al. General text classification model combining attention and cropping mechanism[J]. Journal of Computer Applications, 2023, 43(8): 2396-2405.
[4] 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法[J]. 计算机科学, 2022, 49(2): 279-284.
Zhang Hu, Bai Ping. Graph convolutional networks with long-distance words dependency in sentences for short text classification[J]. Computer Science, 2022, 49(2): 279-284.
[5] 赵嘉, 姚占峰, 吕莉, 等. 基于相互邻近度的密度峰值聚类算法[J]. 控制与决策, 2021, 36(3): 543-552.
Zhao Jia, Yao Zhan-feng, Li Lyu, et al. Density peak clustering algorithm based on mutual proximity[J]. Control and Decision, 2021, 36(3): 543-552.
[6] 赵小强, 姚青磊. 基于DBSCAN聚类分解和过采样的随机森林不平衡数据分类算法[J]. 兰州理工大学学报, 2023, 49(6): 80-89.
Zhao Xiao-qiang, Yao Qing-lei. Random forest imbalanced data classification algorithm based on DBSCAN clustering decomposition and oversampling[J]. Journal of Lanzhou University of Technology, 2023,49(6): 80-89.
[7] 黄富幸, 韩文花. 基于Voting机制的IMA-BP不平衡数据分类算法[J]. 科学技术与工程, 2023, 23(27): 11698-11705.
Huang Fu-xing, Han Wen-hua. Classification algorithm of IMA-BP for unbalanced data based on voting mechanism[J]. Science Technology and Engineering, 2023, 23(27): 11698-11705.
[8] 周尔昊, 高尚, 申震. 基于旋转平衡森林的不平衡数据分类算法[J]. 计算机工程与设计, 2022, 43(2):458-464.
Zhou Er-hao, Gao Shang, Shen Zhen. Classification algorithm of imbalanced data based on rotation balanced forest[J]. Computer Engineering and Design, 2022, 43(2): 458-464.
[9] Kenger M N, Ozceylan E. A hybrid approach based on mathematical modelling and improved online learning algorithm for data classification[J].Expert Systems with Applications, 2023, 218(5): 1-16.
[10] 毕志臻, 杨德刚, 冯骥. 面向超大规模数据的自适应谱聚类算法[J].智能系统学报, 2023, 18(2):251-259.
Bi Zhi-zhen, Yang De-gang, Feng Ji.Adaptive spectral clustering algorithm for very large scale data[J]. Journal of Intelligent Systems,2023,18(2):251-259.
[11] 古险峰, 汤永利. 基于群体智能算法的混合属性大数据聚类仿真[J]. 计算机仿真,2023, 40(9): 458-461.
Gu Xian-feng, Tang Yong-li. Clustering simulation of mixed attribute big data based on swarm intelligence algorithm[J]. Computer Simulation, 2023,40(9): 458-461.
[12] 张熳, 徐兆瑞, 沈项军. 一种傅里叶域海量数据高速谱聚类方法[J]. 北京航空航天大学学报, 2022, 48(8): 1445-1454.
Zhang Man, Xu Zhao-rui, Shen Xiang-jun. A high-speed spectral clustering method for Fourier domain massive data[J]. Journal of Beijing University of Aeronautics and Astronautics, 2022, 48(8):1445-1454.
[13] 梁浩玮, 王石, 曹存根. 非完美多分类标签体系下的领域短文本分类方法研究[J]. 计算机科学, 2023, 50(1):185-193.
Liang Hao-wei, Wang Shi, Cao Cun-gen. Study on short text classification with imperfect labels[J]. Computer Science, 2023,50(1): 185-193.
[14] 黄伟, 刘贵全. MSML-BERT模型的层级多标签文本分类方法研究[J]. 计算机工程与应用, 2022, 58(15): 191-201.
Huang Wei, Liu Gui-quan. Study on hierarchical multi-label text classification method of MSML-BERT model[J]. Computer Engineering and Applications, 2022, 5815: 191-201.
[1] Yao-long KANG,Li-lu FENG,Jing-an ZHANG,Su-e CAO. Fast outlier mining algorithm in uncertain data set based on spectral clustering [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(4): 1181-1186.
[2] Shi-jun SONG,Min FAN. Detection method of abnormal data in cube based on spectral clustering [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(10): 2917-2922.
[3] Shi-jie GUO,Xue-wei ZHANG,Nan ZHANG,Guan QIAO,Shu-feng TANG. Thermal key point select and error prediction under typical speed of machine tool spindle [J]. Journal of Jilin University(Engineering and Technology Edition), 2023, 53(1): 72-81.
[4] Yao-long KANG,Li-lu FENG,Jing-an ZHANG,Fu CHEN. Outlier mining algorithm for high dimensional categorical data streams based on spectral clustering [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(6): 1422-1427.
[5] Jun-jun LI,Jian-nong CAO,Bei-bei CHENG,Juan LIAO,Ying-ying ZHU. High spatial resolution remote sensing imagery segmentation based on combination of pixels and multi⁃scaleobjects using spectral clustering [J]. Journal of Jilin University(Engineering and Technology Edition), 2019, 49(6): 2098-2108.
[6] LIU Zhong-min, LI Zhan-ming, LI Bo-hao, HU Wen-jin. Spectral clustering image segmentation based on sparse matrix [J]. 吉林大学学报(工学版), 2017, 47(4): 1308-1313.
[7] QU Lin,ZHOU Fan,CHEN Yao-wu. Trajectory lcassification based on Hausdorff distance for visual surveillance system [J]. 吉林大学学报(工学版), 2009, 39(06): 1618-1624.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!