中文文本分类相关算法的研究与实现

J4 ›› 2009, Vol. 47 ›› Issue (4): 790-794.

中文文本分类相关算法的研究与实现

徐沛娟, 李雄飞, 惠玥, 张桂林

吉林大学计算机科学与技术学院, 长春 130012

收稿日期:2009-01-14 出版日期:2009-07-26 发布日期:2009-08-24
通讯作者: 李雄飞 E-mail:lxf@jlu.edu.cn

Research and Implementation of Related Algorithm ofChinese Text Categorization

XU Pei-Juan, LI Xiong-Fei, HUI Yue, ZHANG Gui-Lin

College of Computer Science and Technology, Jilin University, Changchun 130012, China

Received:2009-01-14 Online:2009-07-26 Published:2009-08-24
Contact: LI Xiong-Fei E-mail:lxf@jlu.edu.cn

摘要/Abstract

摘要：

通过对分词歧义处理情况的分析, 提出一种基于上下文的双向扫描分词算法, 对分词词典进行改进, 将词组短语的固定搭配引入词典中. 讨论了特征项的选择及权重的设定, 并引进χ²统计量参与项的权值计算, 解决了目前通用TF-IDF加权法的不足, 同时提出了项打分分类算法, 提高了特征项对于文本分类的有效性.
实验结果表明, 改进后的权重计算方法性能更优越.

关键词: 文本分类；上下文双向扫描；向量空间模型；权重；特征选择

Abstract:

On the basis of the analysis of the process of dealing with the Chinese word segmentation ambiguity, this paper covers bidirectional sc
an word segmentation algorithm based on the context. In order to improve the word segmentation dictionary, the authors put the fixed phrase into the dictionary and discussed the feature selection and the weighting schema enactment in detail. In order to solve the problem of general TFIDF weighting schema at present, we took statistics into consideration, and meanwhile put up the itemscoring method which improves the efficiency of the feature item about text categorization. At last we proved the advantage of the improved weighting schema through test.

Key words: text categorization； context bidirectional scan； vector space model； weighting schema； feature selection

中图分类号:

TP391.1

徐沛娟, 李雄飞, 惠玥, 张桂林. 中文文本分类相关算法的研究与实现[J]. J4, 2009, 47(4): 790-794.

XU Pei-Juan, LI Xiong-Fei, HUI Yue, ZHANG Gui-Lin. Research and Implementation of Related Algorithm ofChinese Text Categorization[J]. J4, 2009, 47(4): 790-794.

[1]	张凯勇, 周春光, 王康平, 郭东伟, 翟延冬. 基于扩展关系的信息量计算方法[J]. J4, 2011, 49(06): 1068-1072.
[2]	郭东伟, 李三义, 张仲明, 刘淼. 基于模型匹配的Deep Web数据库分类[J]. J4, 2011, 49(03): 487-492.
[3]	张东娜, 周春光, 刘彦斌, 郭东伟. 一种基于WordNet和Corpus Statistics的语义相似性计算方法[J]. J4, 2010, 48(05): 811-816.
[4]	赵刚, 郭东伟, 李丹. 基于序列比对的动态Web信息抽取算法[J]. J4, 2010, 48(03): 421-426.