J4

• 论文 •    

Data Cleaning Algorithm Based on Unsupervised Learning

SUN Tie-mina,YU Jieb,SHANG Chengc,TIAN Da-xinc,ZHANG Li-huaa
  

  1. a.Department of Science and Technology;b.College of Communication Engineering;c.College of Computer Scince and Technology,Jilin University,Changchun 130012,China
  • Received:2008-01-09 Revised:1900-01-01 Online:2008-11-20 Published:2008-11-20

Abstract: To resolve the similarity and iteration record problem in the data warehouse, a data cleaning algorithm which is based on unsupervised learning was put forward. The learning method is based on the Hebbian postulate and the main idea of the learning is that the similarity level decides the rewarded and penalized rate. To overcome the problem of dead cluster a new cluster is constituted when no existing cluster is similar to one pattern. After learning, another important task is to detect whether there are wrong clusters, if one is found, the cluster will be deleted and combined with the cluster which is the most similar cluster to it, and thus the result of clustering is more accurate. In the experiments, the learning algorithm is applied to clustering task to check its capability and the results show that it performs accurately.

Key words: data extract, data transform, data cleaning, data loading, data warehouse

CLC Number: 

  • TP311