吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (10): 2909-2916.doi: 10.13229/j.cnki.jdxbgxb.20220601
摘要:
针对并行化密度聚类的过程中,不同密度聚类簇边界点划分模糊,并且存在数据噪声,从而影响聚类性能,使聚类结果受制于局部最优影响的问题,提出一种基于MapReduce与优化布谷鸟算法的并行密度聚类算法。首先,该算法结合K-means中的近邻与逆近邻思路的策略KDBSCAN(K-means DBSCAN),通过计算各数据点的影响空间,以此重新定义基于密度的聚类(Density-based spatial dutering of apptications with noise,DBSCAN)算法中聚类簇的拓展条件,避免了不同密度聚类簇边界点划分模糊的问题;其次,结合KDBSCAN密度聚类中的近邻思想提出了一种可行的迭代性噪声点处理策略,减轻数据中噪声点对于聚类算法性能的影响;再次,提出基于传统布谷鸟算法的优化改进策略MCS(Majorization cuckoo search),通过衰减发现巢穴概率的权重,随着迭代搜寻次数的增加提升算法收敛速度,解决了聚类结果受制于局部最优的问题;最后,结合MapReduce提出了并行密度聚类策略MCS-KDBSCAN,通过并行化密度聚类算法运算,减轻了并行聚类算法局部最优解传递的通信负担,提升了算法性能。实验证明,提出的MCS-KDBSCAN并行化密度聚类算法在聚类精度、聚类运行时间等方面均较优。
中图分类号:
1 | 王岩, 彭涛, 韩佳育, 等. 一种基于密度的分布式聚类方法[J]. 软件学报, 2017, 28(11): 2836-2850. |
Wang Yan, Peng Tao, Han Jia-yu, et al. Density-based distributed clustering method[J]. Journal of Software, 2017, 28(11): 2836-2850. | |
2 | Liu P, Zhou D, Wu N. VDBSCAN: varied density based spatial clustering of applications with noise[C]∥ International Conference on Service Systems and Service Management, Chengdu, China, 2007: 1-4. |
3 | Chinta S, Sivaram A, Rengaswamy R. Prediction error-based clustering approach for multiple-model learning using statistical testing[J]. Engineering Applications of Artificial Intelligence, 2019, 77(1): 125-135. |
4 | Ros F, Guillaume S. A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise[J]. Expert Systems with Applications, 2019, 128(1): 96-108. |
5 | Brown D, Japa A, Shi Y. A fast density-grid based clustering method[C]∥ IEEE 9th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, USA, 2019: 48-54. |
6 | Manogaran G, Vijayakumar V, Varatharajan R, et al. Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering[J]. Wireless Personal Communications, 2018, 102(3): 2099-2116. |
7 | Kriegel H P, Kröger P, Sander J, et al. Density‐based clustering[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2011, 1(3): 231-240. |
8 | 陈敏, 高学东, 栾绍峻, 等. 基于密度的并行聚类算法[J]. 计算机工程, 2010, 36(11): 8-10. |
Chen Min, Gao Xue-dong, Luan Shao-jun, et al. Parallel clustering algorithm based on density[J]. Computer Engineering, 2010, 36(11): 8-10. | |
9 | Hu Wei-hua. Research on parallel data stream clustering algorithm based on grid and density[C]∥ International Conference on Computer Science & Mechanical Automation, Hangzhou, China, 2015. |
[1] | 康苏明,张叶娥. 基于Hadoop的跨社交网络局部时序链路预测算法[J]. 吉林大学学报(工学版), 2022, 52(3): 626-632. |
[2] | 魏晓辉, 李翔, 李洪亮, 李聪, 庄园, 于洪梅. 支持大规模流数据处理的弹性在线MapReduce模型及拓扑协议[J]. 吉林大学学报(工学版), 2016, 46(4): 1222-1231. |
[3] | 陈涛, 邓辉舫, 刘靖. 基于密度聚类和多示例学习的图像分类方法[J]. 吉林大学学报(工学版), 2014, 44(4): 1126-1134. |
|