吉林大学学报(工学版) ›› 2022, Vol. 52 ›› Issue (8): 1889-1895.doi: 10.13229/j.cnki.jdxbgxb20210167

• 计算机科学与技术 • 上一篇    

基于深度学习的不均衡文本分类方法

李晓英1(),杨名1,全睿2,谭保华3()   

  1. 1.湖北工业大学 工业设计学院,武汉 430068
    2.湖北工业大学 太阳能高效利用及储能运行控制湖北省重点实验室,武汉 430068
    3.湖北工业大学 理学院,武汉 430068
  • 收稿日期:2021-03-05 出版日期:2022-08-01 发布日期:2022-08-12
  • 通讯作者: 谭保华 E-mail:yangyang1994121@163.com;tan_bh@126.com
  • 作者简介:李晓英(1973-),女,教授,硕士.研究方向:交互与信息产品设计,服务设计,人机工程设计应用.E-mail:yangyang1994121@163.com
  • 基金资助:
    国家自然科学基金面上项目(51977061)

Unbalanced text classification method based on deep learning

Xiao-ying LI1(),Ming YANG1,Rui QUAN2,Bao-hua TAN3()   

  1. 1.Industrial Design Engineering,Hubei University of Technology,Wuhan 430068,China
    2.Hubei Key Laboratory for High-Efficiency Utilization of Solar Energy and Operation Control of Energy Storage System,Hubei University of Technology,Wuhan 430068,China
    3.School of Science,Hubei University of Technology,Wuhan 430068,China
  • Received:2021-03-05 Online:2022-08-01 Published:2022-08-12
  • Contact: Bao-hua TAN E-mail:yangyang1994121@163.com;tan_bh@126.com

摘要:

不均衡文本分类时分类结果过于倾向多数类,忽略少数类,导致分类效果较差,本文研究了基于深度学习的不均衡文本分类方法。利用类别区分能力(DA)方法选择不均衡文本特征,将评分标准设置为文档概率相关度之差的最小值,令所选取文本特征均衡分布于多数类以及少数类中,改进文本特征的均衡性。将特征选取所获取的子集作为多个受限玻尔兹曼机所构成的深度信念网络的输入,受限玻尔兹曼机通过预训练获取训练样本的最佳概率分布,利用对比分歧算法确定受限玻尔兹曼机权值,完成受限玻尔兹曼机参数设定后,利用贪婪算法迭代训练受限玻尔兹曼机,直至完成全部文本分类。实验结果表明:该方法可有效分类不均衡文本,分类精度高达99.5%以上。

关键词: 深度学习, 不均衡, 文本, 分类方法, 深度信念网络, 文档概率, 预训练, 对比算法

Abstract:

In unbalanced text classification, the classification results tend to the majority and ignore the minority, which leads to poor classification effect. The unbalanced text classification method based on deep learning is studied. DA method is used to select unbalanced text features. DA method sets the scoring standard to the minimum value of the difference of document probability correlation, so that the selected text features are evenly distributed in most classes and a few classes to improve the balance of text features. The subset obtained by feature selection is used as the input of the depth belief network composed of multiple constrained Boltzmann machines. The constrained Boltzmann machine obtains the optimal probability distribution of training samples through pre training. The weight of the constrained Boltzmann machine is determined by contrast bifurcation algorithm. After the parameters of the constrained Boltzmann machine are set, the greedy algorithm is used to train the constrained Boltzmann machine iteratively until the whole process is completed text classification. Experimental results show that this method can effectively classify unbalanced text, and the classification accuracy is more than 99.5%.

Key words: deep learning, imbalance, text, classification method, deep belief network, document probability, pre-training, contrast divergence algorithm

中图分类号: 

  • TP391

图1

三种方法收敛图对比"

图2

三种方法分类结果对比"

图3

分类精度对比"

图4

F1值对比"

1 陈志, 郭武. 不平衡训练数据下的基于深度学习的文本分类[J]. 小型微型计算机系统, 2020, 41(1): 1-5.
Chen Zhi, Guo Wu. Text classification based on deep learning under unbalanced training data[J]. Mini Computer System, 2020, 41(1): 1-5
2 汤景泰, 陈秋怡. 意见领袖的跨圈层传播与“回音室效应”——基于深度学习文本分类及社会网络分析的方法[J]. 现代传播(中国传媒大学学报), 2020, 286(5): 31-39.
Tang Jing-tai, Chen Qiu-yi. Cross circle communication of opinion leaders and "echo room effect"— a method based on deep learning text classification and social network analysis[J]. Modern Communication (Journal of Communication University of China), 2020, 286(5): 31-39.
3 汪少敏, 杨迪, 任华. 基于深度学习的文本分类系统关键技术研究与模型验证[J]. 电信科学, 2018, 34(12): 123-130.
Wang Shao-min, Yang Di, Ren Hua. Key technology research and model validation of text classification system based on deep learning[J]. Telecom Science, 2018, 34(12): 123-130.
4 崔昕阳, 龙华, 熊新, 等. 基于并行双向门控循环单元与自注意力机制的中文文本情感分类[J]. 北京化工大学学报: 自然科学版, 2020, 47(2): 115-123.
Cui Xin-yang, Long Hua, Xiong Xin, et al. Sentiment classification of Chinese texts based on parallel bidirectional gating cycle unit and self attention mechanism[J]. Journal of Beijing University of Chemical Technology(Natural Science Edition), 2020, 47(2): 115-123.
5 李杰, 李欢. 基于深度学习的短文本评论产品特征提取及情感分类研究[J]. 情报理论与实践, 2018, 41(2): 143-148.
Li Jie, Li Huan. Feature extraction and sentiment classification of short text reviews based on deep learning[J]. Intelligence Theory and Pactice, 2018, 41(2): 143-148.
6 吴皋, 李明, 周稻祥, 等. 基于深度集成朴素贝叶斯模型的文本分类[J]. 济南大学学报: 自然科学版, 2020, 149(5): 17-23.
Wu Gao, Li Ming, Zhou Dao-xiang, et al. Text classification based on deep integration naive Bayesian model[J]. Journal of Jinan University(Natural Science Edition), 2020, 149(5): 17-23.
7 吴玉佳, 李晶, 宋成芳, 等. 基于高效用神经网络的文本分类方法[J]. 电子学报, 2020, 48(2): 279-284.
Wu Yu-jia, Li Jing, Song Cheng-fang, et al. Text classification method based on efficient neural network[J]. Acta electronica Sinica, 2020, 48(2): 279-284.
8 马喆康, 迪力亚尔·帕尔哈提, 早克热·卡德尔, 等. 一种集成深度学习模型的旅游问句文本分类算法[J]. 计算机工程, 2020, 520(11): 76-82.
Ma Zhe-kang, Parharti Diliar, Kader Zaokere, et al. A text classification algorithm for tourism questions based on integrated deep learning model[J]. Computer Engineering, 2020, 520(11): 76-82.
9 孟先艳, 崔荣一, 赵亚慧,等. 基于双向长短时记忆单元和卷积神经网络的多语种文本分类方法[J]. 计算机应用研究, 2020, 347(9): 115-119.
Meng Xian-yan, Cui Rong-yi, Zhao Ya-hui, et al. Multilingual text classification method based on bidirectional long short time memory unit and convolutional neural network[J]. Computer Application Research, 2020, 347(9): 115-119.
10 郑炜, 陈军正, 吴潇雪, 等. 基于深度学习的安全缺陷报告预测方法实证研究[J]. 软件学报, 2020, 31(5): 58-77.
Zheng Wei, Chen Jun-zheng, Wu Xiao-xue, et al. An empirical study on security defect report prediction method based on deep learning[J]. Acta Sinica Sinica Sinica, 2020, 31(5): 58-77.
11 王丽亚, 刘昌辉, 蔡敦波,等. CNN-BiGRU网络中引入注意力机制的中文文本情感分析[J]. 计算机应用, 2019, 39(10): 2841-2846.
Wang Li-ya, Liu Chang-hui, Cai Dun-bo, et al. Chinese text sentiment analysis with attention mechanism in CNN bigru network[J]. Computer Applications, 2019, 39(10): 2841-2846.
12 谢红玲, 奉国和, 何伟林. 基于深度学习的科技文献语义分类研究[J]. 情报理论与实践, 2018, 41(11):149-154.
Xie Hong-ling, Feng Guo-he, He Wei-lin. Semantic classification of scientific and technological literature based on deep learning[J]. Information Theory and Practice, 2018, 41(11): 149-154.
13 宋化志, 马于涛. DeepTriage:一种基于深度学习的软件缺陷自动分配方法[J]. 小型微型计算机系统, 2019, 40(1): 128-134.
Song Hua-zhi, Ma Yu-tao. Deeptriage: an automatic software defect allocation method based on deep learning[J]. Mini Computer System, 2019, 40(1): 128-134.
14 董丽丽, 杨丹, 张翔. 基于深度学习的大规模语义文本重叠区域检索[J]. 吉林大学学报: 工学版, 2021, 51(5): 1817-1822.
Dong Li-li, Yang Dan, Zhang Xiang. Large-scale semantic text overlapping region retrieval based on deep learning[J]. Journal of Jilin University(Engineering and Technology Edition), 2021, 51(5): 1817-1822.
15 翟玲,崔旭.基于分段估计和PageRank的文本信息相似性搜索算法[J].吉林大学学报:工学版,2022,52(4):910-915.
Zhai Ling, Cui Xu. Text information similarity search algorithm based on segment estimation and PageRank [J] Journal of Jilin University(Engineering and Technology Edition), 2022, 52 (4): 910-915.
[1] 白天,徐明蔚,刘思铭,张佶安,王喆. 基于深度神经网络的诉辩文本争议焦点识别[J]. 吉林大学学报(工学版), 2022, 52(8): 1872-1880.
[2] 申铉京,张雪峰,王玉,金玉波. 像素级卷积神经网络多聚焦图像融合算法[J]. 吉林大学学报(工学版), 2022, 52(8): 1857-1864.
[3] 秦贵和,黄俊锋,孙铭会. 基于双手键盘的虚拟现实文本输入[J]. 吉林大学学报(工学版), 2022, 52(8): 1881-1888.
[4] 高明华,杨璨. 基于改进卷积神经网络的交通目标检测方法[J]. 吉林大学学报(工学版), 2022, 52(6): 1353-1361.
[5] 翟玲,崔旭. 基于分段估计和PageRank的文本信息相似性搜索算法[J]. 吉林大学学报(工学版), 2022, 52(4): 910-915.
[6] 刘勇,徐雷,张楚晗. 面向文本游戏的深度强化学习模型[J]. 吉林大学学报(工学版), 2022, 52(3): 666-674.
[7] 欧阳继红,郭泽琪,刘思光. 糖尿病视网膜病变分期双分支混合注意力决策网络[J]. 吉林大学学报(工学版), 2022, 52(3): 648-656.
[8] 宋林,王立平,吴军,关立文,刘知贵. 基于信息物理融合和数字孪生的可靠性分析[J]. 吉林大学学报(工学版), 2022, 52(2): 439-449.
[9] 曹洁,马佳林,黄黛麟,余萍. 一种基于多通道马尔可夫变迁场的故障诊断方法[J]. 吉林大学学报(工学版), 2022, 52(2): 491-496.
[10] 刘桂霞,裴志尧,宋佳智. 基于深度学习的蛋白质⁃ATP结合位点预测[J]. 吉林大学学报(工学版), 2022, 52(1): 187-194.
[11] 曲优,李文辉. 基于锚框变换的单阶段旋转目标检测方法[J]. 吉林大学学报(工学版), 2022, 52(1): 162-173.
[12] 张杰,景雯,陈富. 基于被动分簇算法的即时通信网络协议漏洞检测[J]. 吉林大学学报(工学版), 2021, 51(6): 2253-2258.
[13] 孙东明,胡亮,邢永恒,王峰. 基于文本融合的物联网触发动作编程模式服务推荐方法[J]. 吉林大学学报(工学版), 2021, 51(6): 2182-2189.
[14] 董丽丽,杨丹,张翔. 基于深度学习的大规模语义文本重叠区域检索[J]. 吉林大学学报(工学版), 2021, 51(5): 1817-1822.
[15] 金立生,郭柏苍,王芳荣,石健. 基于改进YOLOv3的车辆前方动态多目标检测算法[J]. 吉林大学学报(工学版), 2021, 51(4): 1427-1436.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!