Journal of Jilin University(Information Science Ed ›› 2014, Vol. 32 ›› Issue (1): 88-94.

Previous Articles     Next Articles

Theme Feature Extraction of Chinese Webpage Based on Vector Space Model

DAI Kuana, ZHAO Huia, HAN Dongb, SONG Tian-yonga   

  1. a. College of Computer Science and Engineering; b. College of Software Vocational Technology,Changchun University of Technology, Changchun 130012, China
  • Received:2013-08-22 Online:2014-01-24 Published:2014-04-03

Abstract:

In order to solve the problem of imprecision in Chinese webpage theme feature extraction, feature extraction algorithm for Chinese webpage theme is studied. Webpage theme feature extraction is the foundation of topic web crawler to calculate webpage correlation. Considering two classifications of theme webpage, we improved the commonly used text feature item weighting method of TF-IDF(Term Frequency\|Inverse Document Frequency). We combine Semi-structured characteristics of webpage, feature's position information, present a new calculation method of linear feature item weighting. This method can effectively improve the theme webpage recall rate and precision rate.

Key words: term frequency-inverse document frequency(TF-IDF), vector space model, feature, correlation calculation, information gain

CLC Number: 

  • TP391