吉林大学学报(理学版) ›› 2021, Vol. 59 ›› Issue (5): 1199-1204.

• • 上一篇    下一篇

一种改进型TF-IDF文本聚类方法

张蕾1, 姜宇2, 孙莉1   

  1. 1. 吉林大学 发展规划处, 长春 130012; 2. 吉林大学 计算机科学与技术学院, 长春 130012
  • 收稿日期:2020-11-10 出版日期:2021-09-26 发布日期:2021-09-26
  • 通讯作者: 张蕾 E-mail:zhlei@jlu.edu.cn

An Improved TF-IDF Text Clustering Method

ZHANG Lei1, JIANG Yu2, SUN Li1   

  1. 1. Division of Development and Strategic Planning, Jilin University, Changchun 130012, China;
    2. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2020-11-10 Online:2021-09-26 Published:2021-09-26

摘要: 针对传统词频逆文档频率(TF-IDF)算法对具有特定属性的文本分类存在的不足, 尤其是词汇在特定分类中具有特殊意义情形下准确率较低的问题, 提出一种改进的TF-IDF文本聚类算法. 采用2015—2019年吉林省科研机构发表论文数据进行对比实验, 分别用改进TF-IDF算法和传统TF-IDF算法先统计论文中的关键词词频, 再通过K-means++算法进行聚类, 最后使用随机森林算法分别评估聚类的准确性. 实验结果表明, 改进TF-IDF算法提高了分类的准确率.

关键词: 词频-逆文档频率(TF-IDF), 混合聚类, 交叉学科, 基本科学指标数据库(ESI)文献

Abstract: Aiming at the shortcomings of traditional term frequency-inverse document frequency (TF-IDF) algorithm for text classification with specific attributes, especially the low accuracy of words with specific meaning under specific classification, we proposed an improved TF-IDF text clustering algorithm. Comparative experiments were carried out through the papers published by scientific research institutions in Jilin Province from 2015 to 2019. The improved TF-IDF algorithm and the traditional TF-IDF algorithm were used to calculate the frequency of keywords in the papers, then K-means++ method was used to cluster. Finally, random forest algorithm was used to evaluate the accuracy of clustering. The experimental results show that the improved TF-IDF algorithm improves the accuracy of classification.

Key words: term frequency-inverse document frequency (TF-IDF), hybrid clustering, interdisciplinary, essential science indicators (ESI) literature

中图分类号: 

  • TP181