Journal of Jilin University(Engineering and Technology Edition) ›› 2022, Vol. 52 ›› Issue (6): 1434-1441.doi: 10.13229/j.cnki.jdxbgxb20210098

Previous Articles    

Incremental k⁃means clustering algorithm based on multi⁃sphere splitting

Fu-heng QU1(),Chao-yue QIAN1,Yong YANG1,2(),Yang LU1,Jian-fei SONG1,Ya-ting HU3   

  1. 1.College of Computer Science and Technology,Changchun University of Science and Technology,Changchun 130022,China
    2.College of Education,Changchun Normal University,Changchun 130032,China
    3.College of Information Technology,Jilin Agricultural University,Changchun 130118,China
  • Received:2021-01-27 Online:2022-06-01 Published:2022-06-02
  • Contact: Yong YANG E-mail:qufuheng@163.com;yy@mail.ccsfu.edu.cn

Abstract:

To address the problem that the Ball k-means (BKM) algorithm is sensitive to the initialized centers and the solution accuracy is not high, an incremental k-means clustering algorithm based on multi-ball splitting is proposed. The algorithm takes advantage of the feature that BKM needs to record the clustering radius of each ball, starts from a given number of initial clustering centers, and generates multiple incremental clustering centers at once according to a fixed step until k initial centers are obtained. The algorithm avoids the calculation of distances between all n s-dimensional data and reduces the computation of traditional incremental clustering center selection from O(n2s) to Oklogkk<n) without increasing the space complexity. The experimental results of comparison with k-means, BKM, IK-+and three typical incremental clustering algorithms on 8 UCI datasets show that the proposed algorithm reduces the initialization sensitivity of BKM and improves the solution accuracy by 37.15%~66.92%; it is the only algorithm among the comparative incremental algorithms that can improve the efficiency of k-means, and the efficiency advantage becomes more obvious as the number of clusters increases.

Key words: pattern recognition, k-means, incremental clustering, ball k-means, multi-ball splitting

CLC Number: 

  • TP391

Fig.1

Schematic diagram of multi-sphere splittingincremental clustering"

Table 1

Characteristics of eight UCI datasets"

数据集数据个数数据维数
iris1504
wine17813
Breast Cancer (BC)56930
Wine Quality (WQ)4 89812
Abalone4 1778
Page block5 47310
Condition Based Maintenance (CO)11 93416
EEG Eye State(EE)14 98014

Table 2

k-means and the function value E results of each algorithm in the UCI data set"

k数据集BKMFGKM(EIK-+(EMS-MGKM(E本文(E
10iris30.1411.3812.4714.338.17
wine3.89×10543.4540.7843.4924.85
BC1.02×10714.544.5717.550.67
WQ1.07×1062.291.803.10-2.48
Abalone2.57×10326.5527.1633.106.05
Page block1.43×101067.8967.4268.1461.46
CO1.16×101198.9898.8298.9798.53
EE3.37×101199.9999.9999.9999.97
平均值45.6344.1347.3337.15
20iris18.1217.1715.9320.879.66
wine1.57×10554.4551.6456.1646.24
BC8.16×10655.0156.0058.0252.28
WQ6.33×1057.056.618.673.82
Abalone1.33×10349.4348.3751.5932.55
Page block1.34×101087.2585.4587.2981.54
CO1.96×101099.5899.5699.5899.38
EE3.37×101199.9999.9999.9999.99
平均值58.7457.9460.2753.18
50WQ3.47×10519.1317.8120.4213.32
Abalone476.6966.8666.0067.6459.51
Page block1.31×101096.7293.6296.7692.30
CO9.29×10776.5175.5776.4469.49
EE3.13×101199.9999.9999.9999.99
平均值71.8470.6072.2566.92

Table 3

Time results I of k-means and various algorithms in the UCI data set"

k数据集KM/msFGKM(IIK-+(IMS-MGKM(IBKM(IFGBKM(I本文(I
10BC7.18-1 085.93-17.83-9 326.1869.64-1 074.3779.80
WQ41.24-6 346.65-34.29-101 722.5047.67-6 443.3133.45
Abalone8.72-13 585.78-21.10-38 195.8739.22-14 061.9335.62
Page block53.64-6 522.3063.53-15 357.4918.49-6 735.4666.63
CO33.76-46 953.32-45.56-94 534.6862.91-48 530.6321.56
EE176.86-14 315.02-55.94-2 798.45-12.97-14 563.35-66.37
平均值-14 801.50-18.53-43 655.8637.49-15 234.8428.45
20BC8.72-1 924.08-42.89-16 188.9985.55-1 849.3171.64
WQ90.64-5 780.41-67.92-69 578.5170.37-5 769.3366.26
Abalone20.58-12 317.88-22.45-48 183.7772.69-12 559.4859.03
Page block316.32-2 204.2583.83-9 885.6569.34-2 271.9386.71
CO101.48-33 320.77-128.58-44 973.7183.64-33 805.2070.20
EE608.42-8 757.01-73.31-2 294.4047.58-8 800.8354.12
平均值-10 717.40-41.89-31 850.8471.53-10 842.6867.99
50WQ141.62-9 313.36-177.39-77 101.1084.56-9 011.3576.63
Abalone84.98-7 819.04-7.63-39 362.2389.74-7 803.9182.39
Page block864.60-1 951.9574.88-7 893.4185.03-1 972.6589.60
CO380.10-23 082.16-142.18-19 747.8893.43-23 085.3785.11
EE1469.48-9 462.43-58.02-3 079.3865.85-9 267.2675.45
平均值-10 391.06-62.07-29 436.8081.69-10 330.5479.53

Fig.2

Function value and time result of each algorithm on all data sets when k=20"

1 金晓民,张丽萍. 基于最小生成树的多层次k-means聚类算法及其在数据挖掘中的应用[J]. 吉林大学学报: 理学版,2018,56(5): 1187-1192.
Jin Xiao-min, Zhang Li-ping. Multi-level k-means clustering algorithm based on minimum spanning tree and its application in data mining[J]. Journal of Jilin University(Science Edition),2018, 56(5): 1187-1192.
2 张杰,卓灵,朱韵攸. 一种k-means聚类算法的改进与应用[J]. 电子技术应用,2015,41(1): 125-128, 131.
Zhang Jie, Zhuo Ling, Zhu Yun-xiao. Improvement and application of a k-means clustering algorithm[J]. Application of Electronic Technology, 2015, 41(1): 125-128, 131.
3 刘仲民,李战明,李博皓,等. 基于稀疏矩阵的谱聚类图像分割算法[J]. 吉林大学学报: 工学版,2017,47(4): 1308-1313.
Liu Zhong-min, Li Zhan-ming, Li Bo-hao, et al. Spectral clustering image segmentation based on sparse matrix[J]. Journal of Jilin University(Engineering and Technology Edition), 2017, 47(4): 1308-1313.
4 Elkan C. Using the triangle inequality to accelerate k-means[C]∥Proceedings of the 20th International Conference on Machine Learning,La Jolla, California, 2003: 147-153.
5 Hamerly G. Making k-means even faster[C]∥Proceedings of the 2010 SIAM International Conference on Data Mining, Columbus, Ohio, USA, 2010: 130-140.
6 Drake J, Hamerly G. Accelerated k-means with adaptive distance bounds[C]∥5th NIPS Workshop on Optimization for Machine Learning, Lake Tahoe, California, America, 2012.
7 Newling J, Fleuret F. Fast k-means with accurate bounds[C]∥ICML'16: Proceedings of the 33rd International Conference on International Conference on Machine Learning, New York, America, 2016: 936-944.
8 Ding Y F, Zhao Y, Shen X P, et al. Yinyang k-means: a drop-in replacement of the classic k-means with consistent speedup[C]∥Proceedings of the 32nd International Conference on Machine Learning, Lila, France, 2015: 579-587.
9 Xia S Y, Peng D W, Meng D Y, et al. Ball k-means: fast adaptive clustering with no bounds[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1): 87-99.
10 Likas A, Vlassis N, Verbeek J J. The global k-means clustering algorithm[J]. Pattern Recognition, 2003, 36(2): 451-461.
11 Bagirov A M. Modified global k-means algorithm for minimum sum-of-squares clustering problems[J]. Pattern Recognition, 2008, 41(10): 3192-3199.
12 Bagirov A M, Ugon J, Webb D. Fast modified global k-means algorithm for incremental cluster construction[J]. Pattern Recognition, 2011, 44(4): 866-876.
13 Ordin B, Bagirov A M. A heuristic algorithm for solving the minimum sum-of-squares clustering problems[J]. Journal of Global Optimization, 2015, 61(2): 341-361.
14 Xie J, Jiang S, Xie W, et al. An efficient global k-means clustering algorithm[J]. Journal of Computers, 2011, 6(2): 271-279.
15 Yang X, Li Y, Sun Y, et al. Fast and robust rbf neural network based on global k-means clustering with adaptive selection radius for sound source angle estimation[J]. IEEE Transactions on Antennas and Propagation, 2018, 66(6): 3097-3107.
16 于剑,程乾生. 模糊聚类方法中的最佳聚类数的搜索范围[J]. 中国科学,2002,32(2): 275-280.
Yu Jian, Cheng Qian-sheng. The search range of optimal cluster number in fuzzy clustering method[J]. China Science, 2002, 32(2): 275-280.
17 Hartigan J A, Wong M A. Algorithm AS 136: a k-means clustering algorithm[J]. Applied Statistics, 1979, 28(1): 100-108.
18 Ismkhan H. Ik-means-+: an iterative clustering algorithm based on an enhanced version of the k-means[J]. Pattern Recognition, 2018, 79: 402-413.
[1] Zhuo-jun XU,Wen-ting YANG,Cheng-zhi YANG,Yan-tao TIAN,Xiao-jun WANG. Improved residual neural network algorithm for radar intra-pulse modulation classification [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(4): 1454-1460.
[2] Zhe-ming YUAN,Hong-jie YUAN,Yu-xuan YAN,Qian LI,Shuang-qing LIU,Si-qiao TAN. Automatic recognition and classification of field insects based on lightweight deep learning model [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(3): 1131-1139.
[3] Fu LIU,Lu LIU,Tao HOU,Yun LIU. Night road image enhancement method based on optimized MSR [J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(1): 323-330.
[4] Feng⁃wen ZHAI,Jian⁃wu DANG,Yang⁃ping WANG,Jing JIN,Wei⁃wei LUO. Extended contour⁃based fast affine invariant feature extracting [J]. Journal of Jilin University(Engineering and Technology Edition), 2019, 49(4): 1345-1356.
[5] LIU Fu, LAN Xu-teng, HOU Tao, KANG Bing, LIU Yun, LIN Cai-xia. Metagenomic clustering method based on k-mer frequency optimization [J]. Journal of Jilin University(Engineering and Technology Edition), 2018, 48(5): 1593-1599.
[6] GENG Qing-tian, YU Fan-hua, ZHAO Hong-wei, WANG Chuang. New algorithm of flame detection based on color features [J]. 吉林大学学报(工学版), 2014, 44(6): 1787-1792.
[7] GU Bo-yu,SUN Jun-xi,LI Hong-zuo,LIU Hong-xi,LIU Guang-wen. Face recognition based on eigen weighted modular two-directional two-dimensional PCA [J]. 吉林大学学报(工学版), 2014, 44(3): 828-833.
[8] CHEN Wan-zhong, SUN Bao-feng, GAO Ren-jie, LEI Jun. Pattern recognition of human arm motion based on neural network ensemble method [J]. 吉林大学学报(工学版), 2013, 43(增刊1): 69-73.
[9] YANG Dong-feng, MA Xiu-lian. Identification of egg shell crack based on fractal texture analysis [J]. 吉林大学学报(工学版), 2011, 41(增刊1): 348-352.
[10] HU Hong-yu, WANG Qing-nian, QU Zhao-wei, LI Zhi-hui. Spatial pattern recognition and abnormal traffic behavior detection of moving object [J]. 吉林大学学报(工学版), 2011, 41(6): 1598-1602.
[11] WANG Dan, ZHANG Xiang-He, ZHANG Li, REN Lou-Quan. Multidimensional and multiresolution biomimetic recognition method and its application [J]. 吉林大学学报(工学版), 2011, 41(02): 408-0412.
[12] WANG Li-min,ZANG Xue-bai,CAO Chun-hong. Data mining model of decision forest based on generalized informaion theory [J]. 吉林大学学报(工学版), 2010, 40(01): 155-0158.
[13] WANG Xu,CHEN Yong-gang,YANG Yin-sheng . DEA-DA model with interval numbers and its sensitivity analysis [J]. 吉林大学学报(工学版), 2009, 39(03): 716-0720.
[14] LI Lun-bo, MA Guang-fu . Identification of degraded traffic sign symbols using radial basis probabilistic neural networks [J]. 吉林大学学报(工学版), 2008, 38(06): 1429-1433.
[15] TAN Mei, YIN Yi-long, YANG Wei-hui. Method at Region Level for Ridge Distance [J]. 吉林大学学报(工学版), 2005, 35(05): 537-0541.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!