改进的 SNM 中文语义重复记录检测算法

Abstract

Abstract: In order to detect the duplicate of Chinese data, we propose a duplicate record detection algorithm based on SNM (Sorted-Neighborhood Method) algorithm, which integrates the extended version of synonym word forest and Chinese word segmentation. Using the extended version of synonym word forest and Jaccard algorithm to calculate the similarity of words, the Chinese word segmentation in Python is used to segment sentences, to optimize cosine similarity algorithm and to calculate the similarity of sentences. The improved algorithm can effectively detect duplicate records of fields and sentences recorded in Chinese. The experiment on the test data set of students in a counseling institution shows that the recall ratio of the new algorithm is much higher than that of the traditional SNM algorithm.

CLC Number:

TP311

YUAN Man , MU Yonghao , WANG Guiyou , YU Zaifu. Improved SNM Chinese Semantic Duplicate Record Detection Algorithm[J].Journal of Jilin University (Information Science Edition), 2021, 39(3): 348-356.

[1]	GAO Bingkun, CONG Zhicheng, SUN Yu. Displacement Measurement Method Based on Differential Optical Fiber Interferometer [J]. Journal of Jilin University (Information Science Edition), 2021, 39(4): 363-367.
[2]	LI Hong, TIAN Lei, LU Jingyi, LIU Qingqiang. Time Delay Estimation Method of Generalized Second Cross Correlation Based on VMD [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 260-266.
[3]	SUN Tiegang, CHEN Jian, LI Zhijun. Study on Performance Monitoring System for EngineBasedon Passive Optical Network [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 246-251.
[4]	LIU Chao, MA Tianchi, WANG Haisheng. Probabilistic Power Flow Based on Improved Saddle Point Approximation [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 267-275.
[5]	WANG Dongmei, HE Bin, LU Jingyi , XIAO Jianli. VMD-SCT-GMF Filtering Algorithm [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 252-259.
[6]	CHEN Song, WANG Xiquan, CHEN Junbiao. Infrared and Visible Image Fusion Based on Bionic Vision Imaging Mechanism [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 276-281.
[7]	DUAN Zhiwei, SU Hao, LIU Dongdong, CONG Zhicheng, XU Kaichuan. All-Fiber Vibration Detector Based on Fabry-Perot Interference [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 282-287.
[8]	WANG Xiufang, GUO Songhe, CUI Xiangyu, YANG Dandi. Feature Extraction Method for Speech Signals Based on Improved Empirical Modal Decomposition [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 288-294.
[9]	YUAN Mengshun, CHEN Mou, WU Qingxian. Cooperative Path Planning for Multiple UAVs Based on NSGA-Ⅲ Algorithm [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 295-302.
[10]	WU Yuhao, WANG Congqing. Gesture Recognition Based on Multi-Branch Convolutional Neural Networks [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 303-309.
[11]	REN Jiao. Application of Sliding-Mode Observer CACA Optimized in IPMSM Speed Regulation [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 310-317.
[12]	TANG Hui , HOU Yu , PENG Tao , SONG Zhixin , ZHENG Wei , LI Dehui , SHI Jinglong. One-Dimensional Silicon OPA Optical Phase Control Performance Test System [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 324-330.
[13]	XU Xiang, JIN Qing. Research on Measurement and Influencing Factors of Information Narrowing Based on Word2vec [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 339-347.
[14]	SONG Kuiyong , ZHOU Lianke , WANG Hongbin. Feature-Level Fusion Method for Underwater Multisource Data [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 331-338.
[15]	LI Jia, MA Haitao, LI Yue. Low-Rank Algorithm Based on Adaptive Rank Convergence for Desert Seismic Random Noise Attenuation [J]. Journal of Jilin University (Information Science Edition), 2021, 39(3): 237-245.

Improved SNM Chinese Semantic Duplicate Record Detection Algorithm

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 10