基于K-medoids聚类算法的多源信息数据集成算法

吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (3): 665-670.

基于K-medoids聚类算法的多源信息数据集成算法

祝鹏, 郭艳光

内蒙古农业大学计算机技术与信息管理系, 内蒙古包头 014109

收稿日期:2022-03-28 出版日期:2023-05-26 发布日期:2023-05-26
通讯作者: 郭艳光 E-mail:guoyanguang@imau.edu.cn

Multi-source Information Data Integration Algorithm Based on K-Medoids Clustering Algorithm

ZHU Peng, GUO Yanguang

Department of Computer Technology and Information Management, Inner Mongolia Agricultural University, Baotou 014109, Inner Mongolia Autonomous Region, China

Received:2022-03-28 Online:2023-05-26 Published:2023-05-26

摘要/Abstract

摘要： 针对因多源信息数据源域相似性较低、不易确定导致的集成难度较大问题, 提出一种基于K-medoids聚类算法的集成方法. 先将多源数据的聚类过程视为迁移学习过程, 确定初始样本的权重值, 记录训练样本每次迭代时权重和损失期望值的学习特点, 再利用特点参数判定数据属于源域还是目标域；然后将集成算法聚类转化为多样化的域值标记问题, 使数据具有聚类特性后, 再分别计算源域和目标域中待集成数据间的权重因子, 利用权重因子覆盖特性判定二者间的交互信息量, 对信息量较高的数据进行集成, 以确保集成的成功率. 仿真实验结果表明, 该算法无论是在稳定、数目较少的数据集, 还是在紊乱、数目较多较杂的数据集下, 都能实现高效集成, 并且二次集成次数较少, 整体耗用较低.

关键词: K-medoids聚类算法, 多源数据, 源域, 目标域, 交互信息量

Abstract: Aiming at the problem that the integration difficulty was relatively high caused by the low similarity and uncertainty of multi-source information data source domain, we proposed an integration method based on K-medoids clustering algorithm. First, the clustering process of multi-source data was regarded as a transfer learning process, the weight value of the initial sample was determined, the learning characteristics of the weight and loss expectation value of the training sample in each iteration were recorded, and then the characteristic parameters were used to determine whether the data belongs to the source domain or the target domain. Then the clustering of the integration algorithm was transformed into a diversified domain value marking problem. After the data had the clustering characteristics, the weight factors between the data to be integrated in the source domain and the target domain were calculated respectively, the amount of interactive information between them was determined by using the coverage characteristics of the weight factors, and the data with high amount of information was integrated to ensure the success rate of integration. The simulation experiment results show that the proposed algorithm can achieve efficient integration, less secondary integration times and low overall consumption under stable and less datasets, or disordered and more and more complex datasets.

Key words: K-medoids clustering algorithm, multi\, source data, source domain, target domain, amount of interactive information

中图分类号:

TP393.09

祝鹏, 郭艳光. 基于K-medoids聚类算法的多源信息数据集成算法[J]. 吉林大学学报(理学版), 2023, 61(3): 665-670.

ZHU Peng, GUO Yanguang. Multi-source Information Data Integration Algorithm Based on K-Medoids Clustering Algorithm[J]. Journal of Jilin University Science Edition, 2023, 61(3): 665-670.

[1]	付燕宁, 赵东范, 赵健. 持续自适应的Web服务组合方法[J]. J4, 2012, 50(05): 972-978.
[2]	马志欣, 赵鼎新, 谢显中, 王昭然. 车载通信网中拓扑发现策略的研究与仿真[J]. J4, 2011, 49(04): 717-722.
[3]	马志欣, 赵鼎新, 谢显中, 王昭然. 车载通信网的路由策略研究与仿真[J]. J4, 2011, 49(03): 512-518.
[4]	王冬, 左万利, 赫枫龄, 彭涛, 张长利. 一种增量倒排索引结构的设计与实现[J]. J4, 2007, 45(06): 953-958.
[5]	杨喜权, 韩正服, 石丹, 丛荣华. 基于VoiceXML的语音信息发布模式[J]. J4, 2006, 44(06): 935-938.
[6]	赫枫龄, 左万利, 张雪松. 高性能网页索引器JU_Indexer的实现[J]. J4, 2006, 44(01): 50-56.
[7]	赫枫龄, 左万利. 用有向图法解决网页爬行中循环链接问题[J]. J4, 2004, 42(03): 402-404.
[8]	赫枫龄，陶文学，李凯，周力，左万利. 新一代网络搜索引擎系统CHINA_VIVI的实现[J]. J4, 2003, 41(02): 192-195.