J4 ›› 2012, Vol. 30 ›› Issue (5): 544-.

• 论文 • 上一篇    下一篇

文本特征选择方法的改进算法

郭晓冬1, 姜昱明2, 费非3   

  1. 1. 长春市工程咨询服务中心, 长春 130042|2. 中山大学 物理科学与工程技术学院, 广东 中山 510006;3. 上海交通大学 电子信息与电气工程学院, 上海 200240
  • 出版日期:2012-09-28 发布日期:2012-11-01
  • 作者简介:郭哓冬(1967—), 女, 长春人, 长春市工程咨询服务中心高级工程师, 主要从事工程计算研究, (Tel)86-13578786326(E-mail)zz8837829@163.com。

Improved Feature Selection Method

GUO Xiao-dong1, JIANG Yu-ming2, FEI Fei3   

  1. 1. Changchun Engineering Consulting Service Center, Changchun 130042, China;2. School of Physics and Engineering, SUN YAT-SEN University, Zhongsh
    an 510006, China;3. School of Electronic Information and Electrical Engineering, Shanghai Jiaotong University, Shanghai 200240, China
  • Online:2012-09-28 Published:2012-11-01

摘要:

传统的互信息特征选择方法受边缘概率的影响较大, 可能产生稀有词的概率评估分高于常用词的评估分, 从而导致倾向于选择低频词条的现象。为此,在分析了几种传统的特征提取方法基础上, 通过引入分散度及平均词频两个参数, 将互信息方法与特征的词频相关联, 从而使互信息的分类更加准确。实验结果表明, 该方法使分类效果更好。

关键词: 文本分类, 特征选择, 互信息

Abstract:

Marginal probability has a greater effect on traditional mutual information feature selection method, which may leads to evaluation of rare words bigger than common words, resulting in selecting low frequency words. In order to improve these insufficiencies, we analyze several traditional feature extraction methods, associates the mutual information method with characteristics of word frequency by introducing disparity and average frequency, and increases the accuracy of mutual information classification Experiment shows that this method makes better classification results.

Key words: text classification, feature selection, mutual information

中图分类号: 

  • TP37