吉林大学学报(信息科学版) ›› 2021, Vol. 39 ›› Issue (6): 751-757.

• • 上一篇    下一篇

基于改进 TFIDF-Logistic Regression 微博暴力文本分类

刘思新a , 高 珺b , 田一龙b , 魏韵郦b , 李旭睿b , 吴 静b   

  1. 吉林大学 a. 汽车工程学院; b. 计算机科学与技术学院, 长春 130022
  • 收稿日期:2021-04-14 出版日期:2021-12-01 发布日期:2021-12-02
  • 通讯作者: 吴静(1973— ), 女, 武汉人, 吉林大学讲师, 博士, 主要从事人工智能、智能学习算法研究, (Tel)86-13009002589(E-mail)wujing@ jlu. edu. cn。
  • 作者简介:刘思新 ( 2000— ), 男, 四川射洪人, 吉林大学本科生, 主要从事机器学习研究, ( Tel) 86-18699327350 ( E-mail)1213297436@ qq. com。
  • 基金资助:
    吉林省大学生创新训练基金资助项目(202010183566)

Classification Method of Microblog Violence Text Based on Improved TFIDF-Logistic Regression

LIU Sixin a , GAO Jun b , TIAN Yilong b , WEI Yunli b , LI Xurui b , WU Jing b   

  1. a. College of Automobile; b. College of Computer Science and Technology, Jilin University, Changchun 130022, China
  • Received:2021-04-14 Online:2021-12-01 Published:2021-12-02

摘要: 为解决微博网络暴力言论的自动识别和检测问题, 基于微博语料进行了数据集构建, 数据清洗等工作, 提出一种改进的 TFIDF(Term Frequency-Inverse Document Frequency)文本向量化方法。 将传统方法和此方法构 建的向量用于逻辑回归模型输入, 分别创建出传统方法和改进方法的逻辑回归暴力文本分类模型。 对上述模 型做评估并进行横向比较, 实验结果表明, 改进方法的 AUC 指标和准确率分别为 0. 969 和 0. 970, 较之传统方 法分别提升 14. 4% 和 15. 5%.

关键词: 网络暴力 , 微博文本 , 文本向量化 , 文本分类 , 机器学习

Abstract: In order to solve the problem of automatic identification and detection of violent speech on Weibo network, after analyzing the domestic and foreign research on violent text recognition, based on microblog corpus, a data set is established, and data cleaning work is carried out. An improved TFIDF text vectorization method is proposed. The vector of traditional method and the vector constructed by this method are used for the input of the logistic regression model, and the logistic regression violent text classification models of the traditional method and the improved method are created respectively. The above models are evaluated and compared. The experimental results show that the AUC and accuracy of the improved method are 0. 969 and 0. 970, respectively, which are 14. 4% and 15. 5% higher than those of the traditional method.

Key words: internet violence, microblog text, text vectorization, text classification, machine learning

中图分类号: 

  • TP3