朴素Bayes分类器文本特征向量的参数优化

吉林大学学报(理学版) ›› 2019, Vol. 57 ›› Issue (06): 1479-1485.

朴素Bayes分类器文本特征向量的参数优化

方秋莲, 王培锦, 隋阳, 郑涵颖, 吕春玥, 王艳彤

中南大学数学与统计学院, 长沙 410083

收稿日期:2019-05-07 出版日期:2019-11-26 发布日期:2019-11-21
通讯作者: 方秋莲 E-mail:qiulianfang@csu.edu.cn

Parameter Optimization of Text Feature Vectorof Nave Bayesian Classifier

FANG Qiulian, WANG Peijin, SUI Yang, ZHENG Hanying, LV Chunyue, WANG Yantong

School of Mathematics and Statistics, Central South University, Changsha 410083, China

Received:2019-05-07 Online:2019-11-26 Published:2019-11-21
Contact: FANG Qiulian E-mail:qiulianfang@csu.edu.cn

摘要/Abstract

摘要： 采用朴素Bayes算法建立中文文本自动分类器, 并研究相关参数的选择问题, 以实现中文文本的高效分类. 首先在模型训练阶段, 采用N-gram模型处理训练数据集提取特征向量; 然后使用朴素Bayes算法建立文本分类器; 最后在模型测试阶段, 为提高分类准确率, 使用词频反文档频率算法对测试样本进行特征向量提取. 实例分析结果表明, 在提取训练集特征向量时, 2-gram模型和4-gram模型的特征提取效果最佳; 在选取特征向量长度时, 长度为25 000的特征向量可使分类准确率出现最大增幅并保证较高准确率; 在确定特征项词性方面, 同时选取动词和名词可使分类器准确率达到最高, 仅选取动词时准确率最低.

关键词: 朴素Bayes分类器, 特征选择, TFIDF算法, Ngram模型

Abstract: Nave Bayesian algorithm was used to build an automatic Chinese text classifier, and the selection of relevant parameter was studied to realize the efficient classification of Chinese text. Firstly, in model training stage, N-gram model was used to extract feature vectors from training data sets. Secondly, Nave Bayesian algorithm was used to build a text classifier. Finally, in model testing stage, in order to improve the classification accuracy, term frequencyinverse document frequency algorithm was used to extract feature vectors of the test samples. The results show that when extracting feature vectors from training sets, 2-gram model and 4-gram model have the best effect of feature extraction; when selecting the length of feature vectors, the length of 25 000 can make the greatest increment of classification accuracy and ensure a higher accuracy; when determining the characteristic of feature items, the accuracy is the highest when both verbs and nouns are selected, and the lowest when only verbs are selected.

Key words: Nave Bayesian classifier, feature selection, TFIDF algorithm, N-gram model

中图分类号:

TP391.1

方秋莲, 王培锦, 隋阳, 郑涵颖, 吕春玥, 王艳彤. 朴素Bayes分类器文本特征向量的参数优化[J]. 吉林大学学报(理学版), 2019, 57(06): 1479-1485.

FANG Qiulian, WANG Peijin, SUI Yang, ZHENG Hanying, LV Chunyue, WANG Yantong. Parameter Optimization of Text Feature Vectorof Nave Bayesian Classifier[J]. Journal of Jilin University Science Edition, 2019, 57(06): 1479-1485.

[1]	齐妙, 闫光友, 徐慧, 孙慧. 基于多尺度特征选择网络的人脸表情识别[J]. 吉林大学学报(理学版), 2022, 60(2): 425-431.
[2]	王丽, 王涛, 肖巍, 刘兆赓, 李占山. XGBoost启发的双向特征选择算法[J]. 吉林大学学报(理学版), 2021, 59(3): 627-634.
[3]	杨舒涵, 李博, 周丰丰. 基于机器学习的跨患者癫痫自动检测算法[J]. 吉林大学学报(理学版), 2021, 59(1): 101-106.
[4]	王颖, 曹捷, 邱志洋. 基于乌鸦搜索算法的新型特征选择算法[J]. 吉林大学学报(理学版), 2019, 57(04): 869-874.
[5]	王银花, 王丽萍, 王忠良. 基于判别分析与低秩投影的人脸识别算法[J]. 吉林大学学报(理学版), 2018, 56(2): 355-360.
[6]	郭凯文, 潘宏亮, 侯阿临. 基于特征选择和聚类的分类算法[J]. 吉林大学学报(理学版), 2018, 56(2): 395-398.
[7]	李猛, 刘元宁. 一种基于信息增益的新垃圾邮件特征选择算法[J]. 吉林大学学报(理学版), 2017, 55(02): 379-382.
[8]	杨志伟, 努尔布力, 贾雪, 胡亮. 基于ReliefF的入侵特征选择方法[J]. 吉林大学学报(理学版), 2015, 53(03): 505-510.
[9]	崔亚芬, 解男男. 一种基于特征选择的入侵检测方法[J]. 吉林大学学报(理学版), 2015, 53(01): 112-116.
[10]	杨杰明, 刘元宁, 曲朝阳, 刘志颖. 文本分类中基于综合度量的特征选择方法[J]. 吉林大学学报(理学版), 2013, 51(05): 887-893.
[11]	鲍捷, 杨明, 何志芬. 基于SVM评价准则的高维数据混合特征选择算法[J]. J4, 2012, 50(06): 1192-1198.
[12]	徐沛娟, 李雄飞, 惠玥, 张桂林. 中文文本分类相关算法的研究与实现[J]. J4, 2009, 47(4): 790-794.
[13]	董立岩, 苑森淼, 刘光远, 贾书洪. 基于贝叶斯分类器的图像分类[J]. J4, 2007, 45(02): 249-253.