基于JS散度的不确定数据密度峰值聚类算法

doi:10.13229/j.cnki.jdxbgxb.20221161

摘要/Abstract

摘要：

针对传统的基于密度的不确定性聚类算法存在参数敏感和对复杂流形不确定数据集得到聚类结果较差的缺陷，提出一种新的基于JS散度的不确定数据密度峰值聚类算法（UDPC-JS）。该算法首先用不确定自然邻居定义的不确定自然邻域密度因子去除噪声点；其次，通过不确定自然邻居和JS散度相结合的方式计算不确定数据对象的局部密度，通过结合代表点的思想找到不确定数据集的初始聚类中心，并在初始聚类中心之间定义基于JS散度和图的距离；然后，再利用基于不确定自然邻居和JS散度计算出的局部密度和在初始聚类中心之间新定义的基于JS散度和图的距离在初始聚类中心上构建决策图，并根据决策图选择最终的聚类中心；最后，将未分配的不确定数据对象分配到其初始聚类中心所在的簇中。实验结果表明：该算法较对比算法具有更好的聚类效果和准确性，并且在处理复杂流形的不确定数据集上的优势较大。

关键词: 不确定数据, 不确定自然邻居, JS散度, 密度峰, 聚类

Abstract:

Aiming at the defects of traditional density-based uncertain clustering algorithm， such as parameter sensitivity and poor clustering results for complex manifold uncertain data sets， a new uncertain data density peak clustering algorithm based on JS divergence （UDPC-JS） was proposed. The algorithm first uses the uncertain natural neighborhood density factor defined by uncertain natural neighbors to remove noise points； secondly， the local density of uncertain data objects is calculated by combining uncertain natural neighbors and JS divergence. Then， the initial clustering center of uncertain data sets is found by combining the idea of representative points， and the distance based on JS divergence and graph is defined between the initial clustering centers. Then， the local density calculated based on uncertain natural neighbors and JS divergence and the newly defined distance based on JS divergence and graph between the initial clustering centers are used to construct the decision graph on the initial clustering center， and the final clustering center is selected according to the decision graph. Finally， the unassigned uncertain data objects are assigned to the cluster where their initial clustering centers are located. The experimental results show that the algorithm has better clustering effect and accuracy than the comparison algorithm and has greater advantages in dealing with uncertain data sets of complex manifolds.

Key words: uncertain data, uncertain natural neighbors, JS divergence, density peak, clustering

中图分类号:

TP311

李松,刘晓楠,刘娟. 基于JS散度的不确定数据密度峰值聚类算法[J]. 吉林大学学报(工学版), 2024, 54(7): 2038-2048.

Song LI,Xiao-nan LIU,Juan LIU. Peak clustering algorithm for uncertain data density based on JS divergence[J]. Journal of Jilin University(Engineering and Technology Edition), 2024, 54(7): 2038-2048.

图/表 8

图1

表1

表2

图2

图3

图4

图5

图6

参考文献 25

1	李松, 王冠群, 郝晓红, 等. 面向推荐系统的多目标决策优化算法[J].西安交通大学学报, 2022, 56(8):104-112.
	Li Song, Wang Guan-qun, Hao Xiao-hong, et al. A multi-objective decision optimization algorithm for recommendation system[J]. Journal of Xi´an Jiao Tong University, 2022, 56(8):104-112.
2	Khan S S, Ahmad A. Cluster center initialization algorithm for K-means clustering[J]. Pattern Recognition Letters, 2004, 25(11): 1293-1302.
3	Chau M, Cheng R, Kao B, et al. Uncertain data mining: an example in clustering location data[C]∥ Pacific-Asia Conference on Knowledge Discovery and Data Mining. Berlin:Springer, 2006: 199-204.
4	Kaufman L, Rousseeuw P J. An Introduction to Cluster Analysis[M]. London:John Wiley and Sons, Incorporated, 1990.
5	Gullo F, Ponti G, Tagarelli A. Clustering uncertain data via k-medoids[C]∥Proceedings of the 2nd international conference on Scalable Uncertainty Management, Berlin,Germany,2008: 229-242.
6	Tran L, Duckstein L. Comparison of fuzzy numbers using a fuzzy distance measure[J]. Fuzzy Sets and Systems, 2002, 130(3): 331-341.
7	Ester M, Kriegel H P, Sander J, et al. A density-based algorithm for discovering clusters in large spatial databases with noise[C]∥The 2nd International Conference on Knowledge Discovery and Data Mining, Portland,USA,1996: 226-231.
8	Kriegel H P, Pfeifle M. Hierarchical density-based clustering of uncertain data[C]∥Fifth IEEE International Conference on Data Mining,Houston, USA,2005: 672-677.
9	Ankerst M, Breunig M M, Kriegel H P, et al. OPTICS: ordering points to identify the clustering structure[J]. ACM Sigmod record, 1999, 28(2): 49-60.
10	Liu H, Zhang X, Zhang X, et al. Self-adapted mixture distance measure for clustering uncertain data[J]. Knowledge-Based Systems, 2017, 126: 33-47.
11	Rodriguez A, Laio A. Clustering by fast search and find of density peaks[J]. Science, 2014, 344(6191): 1492-1496.
12	Ni L, Luo W, Bu C, et al. Improved CFDP algorithms based on shared nearest neighbors and transitive closure[C]∥Pacific-Asia Conference on Knowledge Discovery and Data Mining,Tokyo, Japan, 2017: 79-93.
13	Guo Z, Huang T, Cai Z, et al. A new local density for density peak clustering[C]∥Pacific-Asia Conference on Knowledge Discovery and Data Mining,Tokyo, Japan,2018: 426-438.
14	纪霞, 姚晟, 赵鹏. 相对邻域与剪枝策略优化的密度峰值聚类算法[J]. 自动化学报, 2019, 45(4): 1-14.
	Ji Xia, Yao Sheng, Zhao Peng. Density peak clustering algorithm optimized by relative neighborhood and pruning strategy[J]. Acta Automation Sinica, 2019, 45(4): 1-14.
15	Wu Y, He Y, Huang J Z.Clustering ensembles based on probability density function estimation[C]∥The 7th IEEE International Conference on Cyber Security and Cloud Computing (CSCloud)/The 6th IEEE International Conference on Edge Computing and Scalable Cloud (EdgeCom), New York,USA,2020 : 126-131.
16	Yang L, Bi S, Faes M G R, et al. Bayesian inversion for imprecise probabilistic models using a novel entropy-based uncertainty quantification metric[J]. Mechanical Systems and Signal Processing, 2022, 162: No.107954.
17	Kingetsu Y, Hamasuna Y. Jensen–Shannon divergence-based k-medoids clustering[J]. Journal of Advanced Computational Intelligence and Intelligent Informatics, 2021, 25(2): 226-233.
18	Yang L, Zhu Q, Huang J, et al. Adaptive edited natural neighbor algorithm[J]. Neurocomputing, 2017, 230: 427-433.
19	Dai Q Z, Xiong Z Y, Xie J, et al. A novel clustering algorithm based on the natural reverse nearest neighbor structure[J]. Information Systems, 2019, 84: 1-16.
20	Zhou S, Zhao Y, Guan J, et al. A neighborhood-based clustering algorithm[C]∥Pacific-Asia Conference on Knowledge Discovery and Data Mining,Tokyo, Japan, 2005: 361-371.
21	Huang J, Zhu Q, Yang L, et al. QCC: a novel clustering algorithm based on Quasi-Cluster centers[J]. Machine Learning, 2017, 106(3): 337-357.
22	Tenenbaum J B, de Silva V, Langford J C. A global geometric framework for nonlinear dimensionality reduction[J]. Science, 2000, 290(5500): 2319-2323.
23	Cheng D, Zhang S, Huang J. Dense members of local cores-based density peaks clustering algorithm[J]. Knowledge-Based Systems, 2020,193:No.105454.
24	Jiang B, Pei J, Tao Y, et al. Clustering uncertain data based on probability distribution similarity[J]. IEEE Transactions on Knowledge and Data Engineering, 2011, 25(4): 751-763.
25	Cai Y, Zhang Y, Qu J, et al. Differential privacy preserving dynamic data release scheme based on Jensen-Shannon divergence[J]. China Communications, 2022, 19(6): 11-21.

相关文章 15

[1]	张玺君,余光杰,崔勇,尚继洋. 基于聚类算法和图神经网络的短时交通流预测[J]. 吉林大学学报(工学版), 2024, 54(6): 1593-1600.
[2]	刘迪,孙耀,胡云峰,陈虹. 基于密度聚类的商用车编队策略[J]. 吉林大学学报(工学版), 2024, 54(5): 1459-1468.
[3]	吕莉,朱梅子,康平,韩龙哲. 二阶K近邻和多簇合并的密度峰值聚类算法[J]. 吉林大学学报(工学版), 2024, 54(5): 1417-1425.
[4]	陈桂珍,程慧婷,朱才华,李昱燃,李岩. 考虑驾驶员生理信息的城市交叉口风险评估方法[J]. 吉林大学学报(工学版), 2024, 54(5): 1277-1284.
[5]	张西广,张龙飞,马钰锡,樊银亭. 基于密度峰值的海量云数据模糊聚类算法设计[J]. 吉林大学学报(工学版), 2024, 54(5): 1401-1406.
[6]	曲福恒,潘曰涛,杨勇,胡雅婷,宋剑飞,魏成宇. 基于加权空间划分的高效全局K-means聚类算法[J]. 吉林大学学报(工学版), 2024, 54(5): 1393-1400.
[7]	宋世军,樊敏. 基于随机森林算法的大数据异常检测模型设计[J]. 吉林大学学报(工学版), 2023, 53(9): 2659-2665.
[8]	张雅丽,付锐,袁伟,郭应时. 考虑能耗的进出站驾驶风格分类及识别模型[J]. 吉林大学学报(工学版), 2023, 53(7): 2029-2042.
[9]	刘状壮,郑文清,郑健,李轶峥,季鹏宇,沙爱民. 基于网格化的路表温度感知技术[J]. 吉林大学学报(工学版), 2023, 53(6): 1746-1755.
[10]	康耀龙,冯丽露,张景安,曹素娥. 基于谱聚类的不确定数据集中快速离群点挖掘算法[J]. 吉林大学学报(工学版), 2023, 53(4): 1181-1186.
[11]	姚荣涵,徐文韬,郭伟伟. 基于因子长短期记忆的驾驶人接管行为及意图识别[J]. 吉林大学学报(工学版), 2023, 53(3): 758-771.
[12]	郭柏苍,雒国凤,金立生,谢宪毅,孙栋先. 面向自动驾驶虚拟测试的变道切入场景库构建方法[J]. 吉林大学学报(工学版), 2023, 53(11): 3130-3140.
[13]	翁剑成,魏瑞聪,何寒梅,徐海辉,王晶晶. 基于关联路链组的城市路网短时交通流预测模型[J]. 吉林大学学报(工学版), 2023, 53(11): 3104-3112.
[14]	蒋林,杨立,张文俊,张琼玉,吴艳霞. 在障碍物检测时对斜坡点云的检测处理算法[J]. 吉林大学学报(工学版), 2023, 53(11): 3221-3228.
[15]	曹倩,李志慧,陶鹏飞,马永建,杨晨曦. 考虑风险异质特性的路网交通事故风险评估方法[J]. 吉林大学学报(工学版), 2023, 53(10): 2817-2825.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

数据集	Instances	Demensionality	Clusters
Iris	150	4	3
Wine	178	13	3
Seeds	210	7	3
Segmentation	2 310	19	7
Habernan	306	3	2
Heart	270	13	2

数据集	算法	F-Measure	Silhouette
Iris	UDPC-JS	0.9011	0.8346
	FDBSCAN	0.8522	0.7612
	UK-medoids	0.8651	0.7703
	FOPTICS	0.8375	0.6953
	FDBSCAN-KL	0.8864	0.8162
	NBC-JS	0.8933	0.8269
Wine	UDPC-JS	0.7846	0.8818
	FDBSCAN	0.6451	0.7059
	UK-medoids	0.6732	0.7187
	FOPTICS	0.7045	0.6042
	FDBSCAN-KL	0.7753	0.7354
	NBC-JS	0.7465	0.7245
Seeds	UDPC-JS	0.9055	0.7887
	FDBSCAN	0.7042	0.7567
	UK-medoids	0.7011	0.6432
	FOPTICS	0.8231	0.7442
	FDBSCAN-KL	0.8156	0.7465
	NBC-JS	0.8842	0.7885
Segmentation	UDPC-JS	0.4798	0.3458
	FDBSCAN	0.2058	0.2060
	UK-medoids	0.4401	0.3264
	FOPTICS	0.2275	0.2780
	FDBSCAN-KL	0.4014	0.3155
	NBC-JS	0.3221	0.2998
Habernan	UDPC-JS	0.7844	0.6828
	FDBSCAN	0.6643	0.6132
	UK-medoids	0.6540	0.6279
	FOPTICS	0.7553	0.6478
	FDBSCAN-KL	0.7901	0.6210
	NBC-JS	0.7462	0.6732
Heart	UDPC-JS	0.7995	0.7887
	FDBSCAN	0.6654	0.7321
	UK-medoids	0.7612	0.7421
	FOPTICS	0.7095	0.6653
	FDBSCAN-KL	0.7590	0.6043
	NBC-JS	0.6632	0.7963