Journal of Jilin University (Information Science Edition) ›› 2026, Vol. 44 ›› Issue (3): 625-631.

Previous Articles     Next Articles

Processing Method of Multisource Data Integration Based on Robust Subspace Clustering Algorithm

JIANG Mingze, LI Wei, DONG Dan   

  1. College of Information Engineering, Liaoning University of Traditional Chinese Medicine, Shenyang 110847, China
  • Received:2024-07-09 Online:2026-06-02 Published:2026-06-02

Abstract: In multi-source data integration, data may be distributed in different subspaces and have a high degree of data imbalance. In order to improve the efficiency of data analysis, a multi-source data integration processing method based on the Lubang subspace clustering algorithm is proposed. Firstly, by improving the data balancing algorithm, the maximum number of class samples and the average number of class samples are calculated, and the composite minority class oversampling technique is used to obtain a relatively balanced subset, solving the problem of imbalanced data distribution. Then, by using the Dice coefficient similarity measure, the cosine similarity of multi-source data is calculated. By evaluating the similarity between data from different sources, the problem of data heterogeneity and redundancy is solved. Finally, on the basis of establishing self representativeness and establishing affinity graphs to reveal the inherent correlations of data, the Lu Bang subspace clustering algorithm is used to identify the feature subspaces of different data. By introducing a robustness mechanism which can resist the influence of noise and redundant features, the membership degree of the data is calculated, and data integration processing performed based on the membership degree. The experimental results show that this method can achieve integrated processing of multiple source data, improve data analysis efficiency, and ensure data consistency and reliability.

Key words: Robust subspace clustering algorithm, multiple source data, cosine similarity, data integrationprocessing, high dimensional feature space

CLC Number: 

  • TP391