一种改进型TF-IDF文本聚类方法

吉林大学学报(理学版) ›› 2021, Vol. 59 ›› Issue (5): 1199-1204.

一种改进型TF-IDF文本聚类方法

张蕾¹, 姜宇², 孙莉¹

1. 吉林大学发展规划处, 长春 130012； 2. 吉林大学计算机科学与技术学院, 长春 130012

收稿日期:2020-11-10 出版日期:2021-09-26 发布日期:2021-09-26
通讯作者: 张蕾 E-mail:zhlei@jlu.edu.cn

An Improved TF-IDF Text Clustering Method

ZHANG Lei¹, JIANG Yu², SUN Li¹

1. Division of Development and Strategic Planning, Jilin University, Changchun 130012, China；
2. College of Computer Science and Technology, Jilin University, Changchun 130012, China

Received:2020-11-10 Online:2021-09-26 Published:2021-09-26

摘要/Abstract

摘要： 针对传统词频逆文档频率(TF-IDF)算法对具有特定属性的文本分类存在的不足, 尤其是词汇在特定分类中具有特殊意义情形下准确率较低的问题, 提出一种改进的TF-IDF文本聚类算法. 采用2015—2019年吉林省科研机构发表论文数据进行对比实验, 分别用改进TF-IDF算法和传统TF-IDF算法先统计论文中的关键词词频, 再通过K-means++算法进行聚类, 最后使用随机森林算法分别评估聚类的准确性. 实验结果表明, 改进TF-IDF算法提高了分类的准确率.

关键词: 词频-逆文档频率(TF-IDF), 混合聚类, 交叉学科, 基本科学指标数据库(ESI)文献

Abstract: Aiming at the shortcomings of traditional term frequency-inverse document frequency (TF-IDF) algorithm for text classification with specific attributes, especially the low accuracy of words with specific meaning under specific classification, we proposed an improved TF-IDF text clustering algorithm. Comparative experiments were carried out through the papers published by scientific research institutions in Jilin Province from 2015 to 2019. The improved TF-IDF algorithm and the traditional TF-IDF algorithm were used to calculate the frequency of keywords in the papers, then K-means++ method was used to cluster. Finally, random forest algorithm was used to evaluate the accuracy of clustering. The experimental results show that the improved TF-IDF algorithm improves the accuracy of classification.

Key words: term frequency-inverse document frequency (TF-IDF), hybrid clustering, interdisciplinary, essential science indicators (ESI) literature

中图分类号:

TP181

张蕾, 姜宇, 孙莉. 一种改进型TF-IDF文本聚类方法[J]. 吉林大学学报(理学版), 2021, 59(5): 1199-1204.

ZHANG Lei, JIANG Yu, SUN Li. An Improved TF-IDF Text Clustering Method[J]. Journal of Jilin University Science Edition, 2021, 59(5): 1199-1204.

[1]	高云龙, 吴川, 朱明. 基于改进卷积神经网络的短文本分类模型[J]. 吉林大学学报(理学版), 2020, 58(4): 923-930.
[2]	王颖, 曹捷, 邱志洋. 基于乌鸦搜索算法的新型特征选择算法[J]. 吉林大学学报(理学版), 2019, 57(04): 869-874.
[3]	薛小娜, 高淑萍, 彭弘铭, 吴会会. 基于K近邻和多类合并的密度峰值聚类算法[J]. 吉林大学学报(理学版), 2019, 57(1): 111-120.
[4]	董立岩, 王雪松, 王朝阳, 李永丽. 基于区域活跃用户的好友推荐和位置推荐算法[J]. 吉林大学学报(理学版), 2018, 56(6): 1441-1446.
[5]	周水生, 姚丹. 一种改进的LSTSVM增量学习算法[J]. 吉林大学学报(理学版), 2018, 56(4): 909-916.
[6]	王玲娣, 徐华. 一种基于聚类和AdaBoost的自适应集成算法[J]. 吉林大学学报(理学版), 2018, 56(4): 917-924.
[7]	高云龙, 左万利, 王英, 王鑫. 基于集成神经网络的短文本分类模型[J]. 吉林大学学报(理学版), 2018, 56(4): 933-938.
[8]	周水生, 周艳玲, 姚丹, 王保军. 基于QR分解的稀疏LSSVM算法[J]. 吉林大学学报(理学版), 2018, 56(2): 347-354.
[9]	邓蕾蕾, 陈霄. 基于相关向量机的网络通信负载状态识别模型[J]. 吉林大学学报(理学版), 2017, 55(06): 1533-1538.
[10]	陈志雨, 王慧君, 胡明, 刘钢. 一种基于Seeds集和成对约束的主动半监督聚类算法[J]. 吉林大学学报(理学版), 2017, 55(03): 664-672.
[11]	李猛, 刘元宁. 一种基于信息增益的新垃圾邮件特征选择算法[J]. 吉林大学学报(理学版), 2017, 55(02): 379-382.
[12]	彭涛, 戴耀康, 朱枫彤, 张邦佐, 刘露, 闫昭, 钱锋. 一种基于规则的无监督词性标注方法[J]. 吉林大学学报(理学版), 2015, 53(05): 956-962.
[13]	郭新辰, 郗仙田, 樊秀玲, 韩啸. 基于半监督的模糊C-均值聚类算法[J]. 吉林大学学报(理学版), 2015, 53(04): 705-709.
[14]	郭新辰, 樊秀玲, 郗仙田, 韩啸. 改进的FCM半监督聚类算法[J]. 吉林大学学报(理学版), 2014, 52(06): 1293-1296.
[15]	王宏志, 刘婉军, 韩啸. 基于全变分自适应保真项去噪算法的数值实现[J]. 吉林大学学报(理学版), 2014, 52(06): 1261-1266.