吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (10): 2909-2916.doi: 10.13229/j.cnki.jdxbgxb.20220601

• 计算机科学与技术 • 上一篇    下一篇

基于MapReduce与优化布谷鸟算法的并行密度聚类算法

毛伊敏(),顾森晴   

  1. 江西理工大学 信息工程学院,江西 赣州 341000
  • 收稿日期:2022-05-18 出版日期:2023-10-01 发布日期:2023-12-13
  • 作者简介:毛伊敏(1970-),女,教授,博士.研究方向:大数据,数据挖掘,生物技术,地理信息.E-mail:maoyimin2022@yeah.net
  • 基金资助:
    国家重点研发计划项目(2018YFC1504705);国家自然科学基金项目(41562019)

Parallel density clustering algorithm based on MapReduce and optimized cuckoo algorithm

Yi-min MAO(),Sen-qing GU   

  1. College of Information Engineering,Jiangxi University of Science and Technology,Ganzhou 341000,China
  • Received:2022-05-18 Online:2023-10-01 Published:2023-12-13

摘要:

针对并行化密度聚类的过程中,不同密度聚类簇边界点划分模糊,并且存在数据噪声,从而影响聚类性能,使聚类结果受制于局部最优影响的问题,提出一种基于MapReduce与优化布谷鸟算法的并行密度聚类算法。首先,该算法结合K-means中的近邻与逆近邻思路的策略KDBSCAN(K-means DBSCAN),通过计算各数据点的影响空间,以此重新定义基于密度的聚类(Density-based spatial dutering of apptications with noise,DBSCAN)算法中聚类簇的拓展条件,避免了不同密度聚类簇边界点划分模糊的问题;其次,结合KDBSCAN密度聚类中的近邻思想提出了一种可行的迭代性噪声点处理策略,减轻数据中噪声点对于聚类算法性能的影响;再次,提出基于传统布谷鸟算法的优化改进策略MCS(Majorization cuckoo search),通过衰减发现巢穴概率的权重,随着迭代搜寻次数的增加提升算法收敛速度,解决了聚类结果受制于局部最优的问题;最后,结合MapReduce提出了并行密度聚类策略MCS-KDBSCAN,通过并行化密度聚类算法运算,减轻了并行聚类算法局部最优解传递的通信负担,提升了算法性能。实验证明,提出的MCS-KDBSCAN并行化密度聚类算法在聚类精度、聚类运行时间等方面均较优。

关键词: 密度聚类, 优化布谷鸟算法, 基于密度的聚类算法, MapReduce, 抗噪能力

Abstract:

In the process of parallel density clustering, the boundary points of clusters with different densities are divided fuzzy and there is data noise, which affects the clustering performance and makes the clustering results subject to the influence of local optimization. Therefore, a parallel density clustering algorithm MCS-KDBSCAN (maprule based parallel maximization cuckoo search K-means DBSCAN) based on MapReduce and optimized cuckoo algorithm is proposed. Firstly, the algorithm combines the strategy KDBSCAN(K-means DBSCAN), which is based on the idea of nearest neighbor and inverse nearest neighbor in k-means. By calculating the influence space of each data point, the expansion conditions of clustering clusters in DBSCAN algorithm are redefined to avoid the problem of fuzzy boundary points of clustering clusters with different densities; Then, combined with the nearest neighbor idea in KDBSCAN density clustering, a feasible iterative noise point processing strategy is proposed to reduce the impact of noise points in data on the performance of clustering algorithm; Secondly, the optimization and improvement strategy MCS (maximization cuckoo search) based on the traditional cuckoo algorithm is proposed. By attenuating the weight of the probability of finding nests, with the increase of the number of iterative searches, the convergence speed of the algorithm is improved, and the influence of local optimization on the clustering results is solved; Finally, combined with MapReduce, a parallel density clustering strategy MCS-KDBSCAN is proposed. By parallelizing the operation of density clustering algorithm, the communication burden of local optimal solution transmission of parallel clustering algorithm is reduced and the performance of the algorithm is improved. Experiments show that the proposed mcs-kdbscan parallel density clustering algorithm is superior in clustering accuracy and clustering running time.

Key words: density clustering, optimization cuckoo algorithm, density-based spatial dutering of apptications with noise, MapReduce, resist noise ability

中图分类号: 

  • TP311.13

表1

各模型在标准化互信息下的实验结果"

噪声比例Ari1-NMIAri1-V-measureAri1-ARIAri2-NMIAri2-V-measureAri2-ARIAri3-NMIAri3-V-measureAri3-ARI
来源数据0.8370.8250.8720.8850.8860.9220.4920.4860.453
5%0.7920.7800.8560.8310.8330.8770.4580.4520.429
10%0.7450.7330.8140.7920.7960.8290.4370.4210.398
15%0.7040.6970.7580.7540.7500.7840.3910.3870.373

表2

各模型在V-measure评价指标下的实验结果"

算法模型ImageLetterWirelessAct-RecAvilaOptdigitsSuperConductDigitis
RNNDBSCAN0.4870.4800.5060.6240.2310.6380.3640.710
ISDBSCAN0.6140.5520.4580.5570.2810.6120.3820.653
ISBDBSCAN0.2080.5110.3800.5840.2450.4330.3410.291
CS-KDBSCAN0.6030.5540.6070.6300.2870.7790.3740.774
MCS-KDBSCAN0.6470.5840.6230.6510.3180.8140.3970.829

表3

各模型在调整兰德指数评价指标下的实验结果"

算法模型ImageLetterWirelessAct-RecAvilaOptdigitsSuperConductDigitis
RNNDBSCAN0.4120.0430.3250.4040.0490.2370.1650.381
ISDBSCAN0.2140.0510.1520.3560.0570.2420.0420.302
ISBDBSCAN0.4470.0370.1640.3420.0410.0890.0390.081
CS-KDBSCAN0.4010.0650.4230.4490.0660.5750.0740.650
MCS-KDBSCAN0.4680.0850.5110.5070.0880.6820.1430.683

表4

不同模型的运行时间与聚类效率的统计结果"

算法模型ImageLetterWirelessAct-RecAvilaOptdigitsSuperConductDigitis
RNNDBSCAN0.60339.6690.50711.67511.7251.87277.6880.635
ISDBSCAN0.56967.3260.49412.99011.6601.81973.7750.603
ISBDBSCAN21.923360.9989.312152.934197.217146.4201114.50647.780
CS-KDBSCAN0.53439.1420.47111.82311.3761.94378.1920.678
MCS-KDBSCAN0.51837.4110.41810.94110.1041.73974.5380.612
1 王岩, 彭涛, 韩佳育, 等. 一种基于密度的分布式聚类方法[J]. 软件学报, 2017, 28(11): 2836-2850.
Wang Yan, Peng Tao, Han Jia-yu, et al. Density-based distributed clustering method[J]. Journal of Software, 2017, 28(11): 2836-2850.
2 Liu P, Zhou D, Wu N. VDBSCAN: varied density based spatial clustering of applications with noise[C]∥ International Conference on Service Systems and Service Management, Chengdu, China, 2007: 1-4.
3 Chinta S, Sivaram A, Rengaswamy R. Prediction error-based clustering approach for multiple-model learning using statistical testing[J]. Engineering Applications of Artificial Intelligence, 2019, 77(1): 125-135.
4 Ros F, Guillaume S. A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise[J]. Expert Systems with Applications, 2019, 128(1): 96-108.
5 Brown D, Japa A, Shi Y. A fast density-grid based clustering method[C]∥ IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, USA, 2019: 48-54.
6 Manogaran G, Vijayakumar V, Varatharajan R, et al. Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering[J]. Wireless Personal Communications, 2018, 102(3): 2099-2116.
7 Kriegel H P, Kröger P, Sander J, et al. Density‐based clustering[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2011, 1(3): 231-240.
8 陈敏, 高学东, 栾绍峻, 等. 基于密度的并行聚类算法[J]. 计算机工程, 2010, 36(11): 8-10.
Chen Min, Gao Xue-dong, Luan Shao-jun, et al. Parallel clustering algorithm based on density[J]. Computer Engineering, 2010, 36(11): 8-10.
9 Hu Wei-hua. Research on parallel data stream clustering algorithm based on grid and density[C]∥ International Conference on Computer Science & Mechanical Automation, Hangzhou, China, 2015.
[1] 康苏明,张叶娥. 基于Hadoop的跨社交网络局部时序链路预测算法[J]. 吉林大学学报(工学版), 2022, 52(3): 626-632.
[2] 魏晓辉, 李翔, 李洪亮, 李聪, 庄园, 于洪梅. 支持大规模流数据处理的弹性在线MapReduce模型及拓扑协议[J]. 吉林大学学报(工学版), 2016, 46(4): 1222-1231.
[3] 陈涛, 邓辉舫, 刘靖. 基于密度聚类和多示例学习的图像分类方法[J]. 吉林大学学报(工学版), 2014, 44(4): 1126-1134.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!