Journal of Jilin University(Engineering and Technology Edition) ›› 2023, Vol. 53 ›› Issue (10): 2909-2916.doi: 10.13229/j.cnki.jdxbgxb.20220601

Previous Articles     Next Articles

Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm

Yi-min MAO(),Sen-qing GU   

  1. College of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China
  • Received:2022-05-18 Online:2023-10-01 Published:2023-12-13

Abstract:

In the process of parallel density clustering, the boundary points of clusters with different densities are divided fuzzy and there is data noise, which affects the clustering performance and makes the clustering results subject to the influence of local optimization. Therefore, a parallel density clustering algorithm MCS-KDBSCAN (maprule based parallel maximization cuckoo search K-means DBSCAN) based on MapReduce and optimized cuckoo algorithm is proposed. Firstly, the algorithm combines the strategy KDBSCAN(K-means DBSCAN), which is based on the idea of nearest neighbor and inverse nearest neighbor in k-means. By calculating the influence space of each data point, the expansion conditions of clustering clusters in DBSCAN algorithm are redefined to avoid the problem of fuzzy boundary points of clustering clusters with different densities; Then, combined with the nearest neighbor idea in KDBSCAN density clustering, a feasible iterative noise point processing strategy is proposed to reduce the impact of noise points in data on the performance of clustering algorithm; Secondly, the optimization and improvement strategy MCS (maximization cuckoo search) based on the traditional cuckoo algorithm is proposed. By attenuating the weight of the probability of finding nests, with the increase of the number of iterative searches, the convergence speed of the algorithm is improved, and the influence of local optimization on the clustering results is solved; Finally, combined with MapReduce, a parallel density clustering strategy MCS-KDBSCAN is proposed. By parallelizing the operation of density clustering algorithm, the communication burden of local optimal solution transmission of parallel clustering algorithm is reduced and the performance of the algorithm is improved. Experiments show that the proposed mcs-kdbscan parallel density clustering algorithm is superior in clustering accuracy and clustering running time.

Key words: density clustering, optimization cuckoo algorithm, density-based spatial dutering of apptications with noise, MapReduce, resist noise ability

CLC Number: 

  • TP311.13

Table 1

Experimental results of each model under standardized mutual information"

噪声比例Ari1-NMIAri1-V-measureAri1-ARIAri2-NMIAri2-V-measureAri2-ARIAri3-NMIAri3-V-measureAri3-ARI
来源数据0.8370.8250.8720.8850.8860.9220.4920.4860.453
5%0.7920.7800.8560.8310.8330.8770.4580.4520.429
10%0.7450.7330.8140.7920.7960.8290.4370.4210.398
15%0.7040.6970.7580.7540.7500.7840.3910.3870.373

Table 2

Experimental results of each model under V-measure evaluation index"

算法模型ImageLetterWirelessAct-RecAvilaOptdigitsSuperConductDigitis
RNNDBSCAN0.4870.4800.5060.6240.2310.6380.3640.710
ISDBSCAN0.6140.5520.4580.5570.2810.6120.3820.653
ISBDBSCAN0.2080.5110.3800.5840.2450.4330.3410.291
CS-KDBSCAN0.6030.5540.6070.6300.2870.7790.3740.774
MCS-KDBSCAN0.6470.5840.6230.6510.3180.8140.3970.829

Table 3

Experimental results of each model under adjustment of rand index evaluation index"

算法模型ImageLetterWirelessAct-RecAvilaOptdigitsSuperConductDigitis
RNNDBSCAN0.4120.0430.3250.4040.0490.2370.1650.381
ISDBSCAN0.2140.0510.1520.3560.0570.2420.0420.302
ISBDBSCAN0.4470.0370.1640.3420.0410.0890.0390.081
CS-KDBSCAN0.4010.0650.4230.4490.0660.5750.0740.650
MCS-KDBSCAN0.4680.0850.5110.5070.0880.6820.1430.683

Table 4

Statistical results of running time and clustering efficiency of respective models"

算法模型ImageLetterWirelessAct-RecAvilaOptdigitsSuperConductDigitis
RNNDBSCAN0.60339.6690.50711.67511.7251.87277.6880.635
ISDBSCAN0.56967.3260.49412.99011.6601.81973.7750.603
ISBDBSCAN21.923360.9989.312152.934197.217146.4201114.50647.780
CS-KDBSCAN0.53439.1420.47111.82311.3761.94378.1920.678
MCS-KDBSCAN0.51837.4110.41810.94110.1041.73974.5380.612
1 王岩, 彭涛, 韩佳育, 等. 一种基于密度的分布式聚类方法[J]. 软件学报, 2017, 28(11): 2836-2850.
Wang Yan, Peng Tao, Han Jia-yu, et al. Density-based distributed clustering method[J]. Journal of Software, 2017, 28(11): 2836-2850.
2 Liu P, Zhou D, Wu N. VDBSCAN: varied density based spatial clustering of applications with noise[C]∥ International Conference on Service Systems and Service Management, Chengdu, China, 2007: 1-4.
3 Chinta S, Sivaram A, Rengaswamy R. Prediction error-based clustering approach for multiple-model learning using statistical testing[J]. Engineering Applications of Artificial Intelligence, 2019, 77(1): 125-135.
4 Ros F, Guillaume S. A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise[J]. Expert Systems with Applications, 2019, 128(1): 96-108.
5 Brown D, Japa A, Shi Y. A fast density-grid based clustering method[C]∥ IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, USA, 2019: 48-54.
6 Manogaran G, Vijayakumar V, Varatharajan R, et al. Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering[J]. Wireless Personal Communications, 2018, 102(3): 2099-2116.
7 Kriegel H P, Kröger P, Sander J, et al. Density‐based clustering[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2011, 1(3): 231-240.
8 陈敏, 高学东, 栾绍峻, 等. 基于密度的并行聚类算法[J]. 计算机工程, 2010, 36(11): 8-10.
Chen Min, Gao Xue-dong, Luan Shao-jun, et al. Parallel clustering algorithm based on density[J]. Computer Engineering, 2010, 36(11): 8-10.
9 Hu Wei-hua. Research on parallel data stream clustering algorithm based on grid and density[C]∥ International Conference on Computer Science & Mechanical Automation, Hangzhou, China, 2015.
[1] Su-ming KANG,Ye-e ZHANG. Hadoop⁃based local timing link prediction algorithm across social networks [J]. Journal of Jilin University(Engineering and Technology Edition), 2022, 52(3): 626-632.
[2] WEI Xiao-hui, LI Xiang, LI Hong-liang, LI Cong, ZHUANG Yuan, YU Hong-mei. Flexible Online MapReduce model and topology protocols supporting large-scale stream data processing [J]. 吉林大学学报(工学版), 2016, 46(4): 1222-1231.
[3] CHEN Tao, DENG Hui-fang, LIU Jing. Image categorization method using density clustering on region features and multi-instance learning [J]. 吉林大学学报(工学版), 2014, 44(4): 1126-1134.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!