Journal of Jilin University Science Edition ›› 2021, Vol. 59 ›› Issue (5): 1199-1204.

Previous Articles     Next Articles

An Improved TF-IDF Text Clustering Method

ZHANG Lei1, JIANG Yu2, SUN Li1   

  1. 1. Division of Development and Strategic Planning, Jilin University, Changchun 130012, China;
    2. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2020-11-10 Online:2021-09-26 Published:2021-09-26

Abstract: Aiming at the shortcomings of traditional term frequency-inverse document frequency (TF-IDF) algorithm for text classification with specific attributes, especially the low accuracy of words with specific meaning under specific classification, we proposed an improved TF-IDF text clustering algorithm. Comparative experiments were carried out through the papers published by scientific research institutions in Jilin Province from 2015 to 2019. The improved TF-IDF algorithm and the traditional TF-IDF algorithm were used to calculate the frequency of keywords in the papers, then K-means++ method was used to cluster. Finally, random forest algorithm was used to evaluate the accuracy of clustering. The experimental results show that the improved TF-IDF algorithm improves the accuracy of classification.

Key words: term frequency-inverse document frequency (TF-IDF), hybrid clustering, interdisciplinary, essential science indicators (ESI) literature

CLC Number: 

  • TP181