J4

• 论文 •    

基于无监督学习的数据清洗算法

孙铁民a, 于 杰b, 尚 程c, 田大新c, 张丽华a
  

  1. 吉林大学 a.科技处; b.通信工程学院;c.计算机科学与技术学院, 长春 130012
  • 收稿日期:2008-01-09 修回日期:1900-01-01 出版日期:2008-11-20 发布日期:2008-11-20
  • 通讯作者: 孙铁民

Data Cleaning Algorithm Based on Unsupervised Learning

SUN Tie-mina,YU Jieb,SHANG Chengc,TIAN Da-xinc,ZHANG Li-huaa
  

  1. a.Department of Science and Technology;b.College of Communication Engineering;c.College of Computer Scince and Technology,Jilin University,Changchun 130012,China
  • Received:2008-01-09 Revised:1900-01-01 Online:2008-11-20 Published:2008-11-20

摘要: 为了解决数据仓库中相似重复记录的数据问题,提出了基于无监督学习的数据清洗算法。该算法采用基于Hebbian假设的自适应学习方法,并通过相似度确定奖励和惩罚等级。在学习过程中根据需要增加新的聚类,在学习结束后,通过分析聚类情况删除错误的聚类,从而避免了死神经元问题并使聚类更加准确。实验表明,该算法能准确地完成实体识别。

关键词: 数据仓库, 数据抽取, 数据转换, 数据清洗, 数据装载

Abstract: To resolve the similarity and iteration record problem in the data warehouse, a data cleaning algorithm which is based on unsupervised learning was put forward. The learning method is based on the Hebbian postulate and the main idea of the learning is that the similarity level decides the rewarded and penalized rate. To overcome the problem of dead cluster a new cluster is constituted when no existing cluster is similar to one pattern. After learning, another important task is to detect whether there are wrong clusters, if one is found, the cluster will be deleted and combined with the cluster which is the most similar cluster to it, and thus the result of clustering is more accurate. In the experiments, the learning algorithm is applied to clustering task to check its capability and the results show that it performs accurately.

Key words: data extract, data transform, data cleaning, data loading, data warehouse

中图分类号: 

  • TP311