相似重复记录,SNM算法,中文分词 ," /> 相似重复记录,SNM算法,中文分词 ,"/> Improved SNM Chinese Semantic Duplicate Record Detection Algorithm

Journal of Jilin University (Information Science Edition) ›› 2021, Vol. 39 ›› Issue (3): 348-356.

Previous Articles     Next Articles

Improved SNM Chinese Semantic Duplicate Record Detection Algorithm

YUAN Man1 , MU Yonghao1 , WANG Guiyou2 , YU Zaifu1   

  1. 1. School of Computer and Information Technology, Northeast Petroleum University, Daqing 163318, China; 2. Information Center, Zhaodong Branch of the Tenth Oil Production Plant of Daqing, Heilongjiang Province, Daqing 163000, China
  • Received:2020-11-16 Online:2021-05-24 Published:2021-05-25

Abstract: In order to detect the duplicate of Chinese data, we propose a duplicate record detection algorithm based on SNM (Sorted-Neighborhood Method) algorithm, which integrates the extended version of synonym word forest and Chinese word segmentation. Using the extended version of synonym word forest and Jaccard algorithm to calculate the similarity of words, the Chinese word segmentation in Python is used to segment sentences, to optimize cosine similarity algorithm and to calculate the similarity of sentences. The improved algorithm can effectively detect duplicate records of fields and sentences recorded in Chinese. The experiment on the test data set of students in a counseling institution shows that the recall ratio of the new algorithm is much higher than that of the traditional SNM algorithm.

Key words: similar duplicate records, sorted-neighborhood method ( SNM ) algorithm, chinese word segmentation

CLC Number: 

  • TP311