基于语义簇的中文文本聚类算法

Abstract

Abstract: Aiming at the problem that Chinese text clustering was influenced by semantic, grammatical and contextual factors, after using traditional vector space model to quantify representation, text vectors were independent of each other and semantic relations were ignored, which affected the results of clustering analysis, we proposed a Chinese text clustering algorithm based on semantic cluster. The algorithm is based on the principle of word cooccurrence and semantic relevance. Firstly, termfrequencyinverse document frequecy (TFIDF) method was used to obtain the weight of feature words, and the collocation vector of feature words was used to construct semantic clusters. Secondly, by using the weight of feature words and their collocation words, the feature words were spatially transformed to the semantic cluster center, and the document vector embedded in the semantic information was obtained. Finally, the document vector was used for K-means clustering analysis. The experimental results show that the vectorization method can effectively improve the approximation ability of text vector to text semantics, and improve the accuracy and recall rate of text clustering results.

Key words: vector, feature word, semantic cluster, semantic embedding, cluster analysis

CLC Number:

TP391

QI Xiangming, SUN Xujiao. Chinese Text Clustering Algorithm Based on Semantic Cluster[J].Journal of Jilin University Science Edition, 2019, 57(5): 1193-1199.

[1]	PAN Peng, LIU Jiancheng. Characteristics of Conformally Flat Yamabe Soliton with Semi-symmetric Metric ρ-Connection [J]. Journal of Jilin University Science Edition, 2021, 59(5): 1045-1049.
[2]	NIE Lusong, CHANG Fangyuan, CHANG Xuezhi, LIU Chang, JIN Youwei, LIU Guosheng, FU Jiasheng, HAN Xiaosong. A Novel Self-adaptive Multiple Kernel Learning Algorithm [J]. Journal of Jilin University Science Edition, 2021, 59(5): 1212-1218.
[3]	LIU Guifeng, YU Shaonan, CUI Lu. An Ensemble Classification Algorithm for Single Cell Transcriptome Data Based on Ensemble Learning Strategy [J]. Journal of Jilin University Science Edition, 2021, 59(5): 1252-1255.
[4]	YANG Guang, JIA Yanxin, CHEN Xiang, XU Shuyuan. Stack Overflow Question Post Classification Method Based on Deep Learning [J]. Journal of Jilin University Science Edition, 2021, 59(4): 922-928.
[5]	LIU Ying, DU Xingqiu, WEN Dongxin, TANG Weining, ZHANG Hongming. Diagnosis Algorithm of Theft and Leakage of Electricity for Unbalanced Users Based on TLSmote-SVM [J]. Journal of Jilin University Science Edition, 2021, 59(1): 136-142.
[6]	ZHANG Chuanmei, MENG Xudong. Optimal Conditions for Lower Semicontionuity of Efficient Solutions to Parametric Set-Valued Vector Equilibrium Problem [J]. Journal of Jilin University Science Edition, 2020, 58(5): 1142-1148.
[7]	TAN Xiangwei. Image Retrieval Algorithm Based on User Feedback and Support Vector Machine [J]. Journal of Jilin University Science Edition, 2020, 58(4): 899-905.
[8]	WEN Changji, ZHAO Shanshan, SHEN Liwei, REN Hongbin. Sports Video Behavior Recognition Based on Local SpatioTemporal Pattern [J]. Journal of Jilin University Science Edition, 2020, 58(2): 379-387.
[9]	LIU Liangfeng, LIU Sanyang. ynamic Fuzzy Clustering AlgorithmBased on Weighted Difference Degree [J]. Journal of Jilin University Science Edition, 2019, 57(3): 574-582.
[10]	YANG Jian, YANG Chaoyu, LI Huizong. Image Identification Method Based on TwoDimensional Discrete Wavelet [J]. Journal of Jilin University Science Edition, 2019, 57(3): 619-626.
[11]	ZHU Chaoping, REN Jiping. Heterogeneous Data Fusion Method of Internet ofThings Based on Intelligent Optimization Agorithm [J]. Journal of Jilin University Science Edition, 2019, 57(3): 627-632.
[12]	JIANG Jianhua, WU Di, HAO Dehao, WANG Limin, ZHANG Yonggang, LI Keqin. Density Peaks Clustering Algorithm Based on CDbw and ABC Optimization#br# [J]. Journal of Jilin University Science Edition, 2018, 56(6): 1469-1475.
[13]	ZHOU Shuisheng, YAO Dan. An Improved LSTSVM Incremental Learning Algorithm [J]. Journal of Jilin University Science Edition, 2018, 56(4): 909-916.
[14]	GAO Yunlong1,2, ZUO Wanli1,2, WANG Ying1,2, WANG Xin2,3. Short Text Classification Model Based on Integrated Neural Networks [J]. Journal of Jilin University Science Edition, 2018, 56(4): 933-938.
[15]	YANG Yongqiang, LI Shuhong. Hole Repairing Algorithm for Point Cloud Data Based onLeast Square Support Vector Machine [J]. Journal of Jilin University Science Edition, 2018, 56(3): 692-696.

Chinese Text Clustering Algorithm Based on Semantic Cluster

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 10