吉林大学学报(信息科学版) ›› 2026, Vol. 44 ›› Issue (3): 625-631.

• • 上一篇    下一篇

基于鲁棒子空间聚类算法的多来源数据集成处理方法

江明泽,李 伟,董 丹   

  1. 辽宁中医药大学信息工程学院,沈阳110847
  • 收稿日期:2024-07-09 出版日期:2026-06-02 发布日期:2026-06-02
  • 作者简介:江明泽(1995— ), 男, 沈阳人, 辽宁中医药大学讲师, 硕士, 主要从事数学与数据科学研究, (Tel)86-15712482747 (E-mail)mzjiang_lnzy@163. com。
  • 基金资助:
    2023 年辽宁省科技创新研发基金资助项目(SGTYHT/15-WT-244) 

Processing Method of Multisource Data Integration Based on Robust Subspace Clustering Algorithm

JIANG Mingze, LI Wei, DONG Dan   

  1. College of Information Engineering, Liaoning University of Traditional Chinese Medicine, Shenyang 110847, China
  • Received:2024-07-09 Online:2026-06-02 Published:2026-06-02

摘要: 针对在多来源数据集成中数据可能分布在不同的子空间中且数据不平衡度较高的问题为提高数据的分析效率提出基于鲁棒子空间聚类算法的多来源数据集成处理方法。首先通过改进数据平衡算法计算最大类的采样数目及类平均采样数目利用合成少数类过采样技术获取相对平衡子集解决数据分布不均衡的问题然后通过 Dicecoefficient 相似度度量的方式, 计算多来源数据的余弦相似性通过评估不同来源数据间的相似性解决数据异构性和冗余问题; 最后, 在建立自表示性亲和图揭示数据内在关联性的基础上利用鲁棒子空间聚类算法识别不同数据的特征子空间通过引入鲁棒性机制能抵抗噪声和冗余特征的影响并计算数据的隶属度根据隶属度实现数据集成处理。实验结果表明该方法能实现对多来源数据集成处理提高数据分析效率保证数据一致性和可靠性。

关键词: 鲁棒子空间聚类算法, 多来源数据, 余弦相似性, 数据集成处理, 高维特征空间

Abstract: In multi-source data integration, data may be distributed in different subspaces and have a high degree of data imbalance. In order to improve the efficiency of data analysis, a multi-source data integration processing method based on the Lubang subspace clustering algorithm is proposed. Firstly, by improving the data balancing algorithm, the maximum number of class samples and the average number of class samples are calculated, and the composite minority class oversampling technique is used to obtain a relatively balanced subset, solving the problem of imbalanced data distribution. Then, by using the Dice coefficient similarity measure, the cosine similarity of multi-source data is calculated. By evaluating the similarity between data from different sources, the problem of data heterogeneity and redundancy is solved. Finally, on the basis of establishing self representativeness and establishing affinity graphs to reveal the inherent correlations of data, the Lu Bang subspace clustering algorithm is used to identify the feature subspaces of different data. By introducing a robustness mechanism which can resist the influence of noise and redundant features, the membership degree of the data is calculated, and data integration processing performed based on the membership degree. The experimental results show that this method can achieve integrated processing of multiple source data, improve data analysis efficiency, and ensure data consistency and reliability.

Key words: Robust subspace clustering algorithm, multiple source data, cosine similarity, data integrationprocessing, high dimensional feature space

中图分类号: 

  • TP391