吉林大学学报(信息科学版) ›› 2021, Vol. 39 ›› Issue (5): 553-561.

• • 上一篇    下一篇

适于长短文本分类的 CBLGA 和 CBLCA 混联模型

王得强, 吴 军, 王立平   

  1. 清华大学 机械工程系, 北京 100084
  • 收稿日期:2021-05-04 出版日期:2021-10-01 发布日期:2021-10-01
  • 作者简介:王得强(1997— ), 男, 山东德州人, 清华大学硕士研究生, 主要从事自然语言处理和数据挖掘研究, ( Tel) 86- 18813188721(E-mail)15504661292@163.com; 吴军(1978— ), 男, 河南光山人, 清华大学副教授, 博士, 博士生导师,主要从事智能制造、先进制造装备设计与控制研究, ( Tel)86-10-62772633 (E-mail) jhwu@mail.tsinghua.edu.cn; 王立平(1967— ), 男, 吉林辽源人, 清华大学教授, 博士, 博士生导师, 主要从事先进制造装备及其控制等研究, (Tel)86-10-62772633(E-mail)lpwang@mail.tsinghua.edu.cn。
  • 基金资助:
    国家重点研发计划基金资助项目(2018YFB1703502)

CBLGA and CBLCA Hybrid Model for Long and Short Text's Classification

WANG Deqiang, WU Jun, WANG Liping   

  1. Department of Mechanical Engineering, Tsinghua University, Beijing 100084, China
  • Received:2021-05-04 Online:2021-10-01 Published:2021-10-01

摘要: 为提高文本分类的准确性和效率, 构建了一种基于 Attention 的 CNN-BiLSTM/ BiGRU(简称 CBLGA)混联 文本分类模型。 首先通过并联不同卷积窗口大小的 CNN(Convolutional Neural Networks)网络同时提取多种局部 特征, 之后将数据输入至 BiLSTM 和 BiGRU 并联组合模型中, 利用 BiLSTM 和 BiGRU 组合提取了与文本中的上 下文有密切关系的全局特征, 最后对两个模型所得到的特征值进行了融合并在其中引入了注意力机制。 构建 基于 Attention 的 CNN-BiLSTM/ CNN(简称 CBLCA) 混联文本分类模型, 特点是将 CNN 的输出分为两部分, 其中一部分输入 BiLSTM 网络中, 另一部分则直接和 BiLSTM 网络的输出进行融合, 既保留了 CNN 提取的文字 序列局部特征, 又利用了 BiLSTM 网络提取出的全局特征。 实验表明 CBLGA 模型和 CBLCA 模型在准确率和效 率方面均实现了有效提升。 最后, 建立了一套针对不同长度的文本进行相应预处理和后续分类工作的分类的 流程, 使模型无论面对长文本还是短文本数据, 均实现了同时提高文本分类的准确率和效率的目标。

关键词: CBLGA 模型; , CBLCA 混联模型; , 注意力机制; , 混联模型; , 文本分类

Abstract: With the development of information technology, a large amount of text classification is needed in many industries. In order to improve the accuracy and the efficiency of classification at the same time, a kind of CNN-BiLSTM/ BiGRU mixed text classification model based on the attention mechanism(CBLGA) is proposed, in which parallel CNN(Convolution Neural Networks) with different window sizes to extract a variety of text characteristics, then input the data in BiLSTM/ BiGRU parallel model. BiLSTM/ BiGRU combination model is used to extract global characteristics relate with the whole text context, finally the characteristics of two models are fused and the Attention mechanism is introduced. Secondly, another kind of Attention of CNN-BiLSTM/ CNN mixed text classification model based on the attention mechanism(CBLCA) is proposed, and its feature is divided CNN's output into two parts. One part is input to the BiLSTM network, another is integrated to the output of BiLSTM network. Successfully retaining the partial text features extracted by CNN and the global text features extracted by BiLSTM. After several experiments, the CBLGA model and CBLCA model is achieved effective improvements in accuracy and efficiency. Finally, a set of preprocessing methods for texts with different lengths is established, so the model can improve the accuracy and efficiency of text classification target in long text and short text.

Key words: CBLGA model, CBLCA model, attention mechanism, hybrid model, text classification

中图分类号: 

  • TP391