吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (5): 1147-1158.

• • 上一篇    下一篇

基于改进DEC的评论文本聚类算法

陈可嘉1, 夏瑞东1, 林鸿熙2   

  1. 1. 福州大学 经济与管理学院, 福州 350108; 2. 莆田学院 商学院, 福建 莆田 351100
  • 收稿日期:2022-09-15 出版日期:2023-09-26 发布日期:2023-09-26
  • 通讯作者: 林鸿熙 E-mail:ptulhx@163.com

Review Text Clustering Algorithm Based on Improved DEC

CHEN Kejia1, XIA Ruidong1, LIN Hongxi2   

  1. 1. School of Economics and Management, Fuzhou University, Fuzhou 350108, China;
    2. School of Business, Putian University, Putian 351100, Fujian Province, China
  • Received:2022-09-15 Online:2023-09-26 Published:2023-09-26

摘要: 针对原始深度嵌入聚类(DEC)算法中聚类层得出的初始聚类数目和聚类中心有很强的随机性, 从而影响DEC算法效果的问题, 提出一种基于改进DEC的评论文本聚类算法, 对无类别标注的电商评论数据进行无监督聚类. 首先获得融合句子嵌入向量和主题分布向量的BERT-LDA数据集向量化表示; 然后改进DEC算法, 通过自动编码器进行降维处理, 在编码器后堆叠聚类层, 其中聚类层的聚类数目基于主题连贯性选择, 同时使用主题特征向量作为自定义聚类中心, 再进行编码器和聚类层的联合训练以提高聚类的准确度; 最后利用可视化工具直观展示聚类效果. 为验证算法的有效性, 将该算法与6个对比算法在无标注的产品评论数据集上进行无监督聚类训练, 结果表明, 该算法在轮廓系数和Calinski-Harabaz(CH)指标上取得了0.213 5和2 958.18的最佳效果, 说明其可有效处理电商评论数据, 反映用户对产品的关注情况.

关键词: BERT模型, LDA模型, 深度嵌入聚类, 自动编码器, 聚类

Abstract: Aiming at the problem that the initial number of clusters and cluster centers derived from the clustering layer in the original deep embedding clustering (DEC) algorithm had  strong randomness and thus affected the effectiveness of the  DEC algorithm, we proposed a review text clustering algorithm based on improved DEC for unsupervised clustering of category-free labeled e-commerce review data. Firstly, we  obtained a vectorized representation of the BERT-LDA dataset incorporating the sentence embedding vector and the topic distribution vector.  Secondly, we improved DEC algorithm, reduced the dimensionality by the autoencoder, and stacked the clustering layers  after the encoder, where the number of clusters in the clustering layers was selected based on the topic coherence. Meanwhile,  the topic feature vector was used as the custom clustering center, and then the joint training of encoder and clustering layers was performed to improve the accuracy of clustering. Finally, the visualization tool was used to visually  display the clustering effect. In order to verify the effectiveness of the proposed algorithm, unsupervised clustering  training was conducted  on an unlabeled product review dataset by using the algorithm and six comparison algorithms. The results show that the algorithm  achieves the best results of 0.213 5 and 2 958.18 in the contour coefficient  and Calinski-Harabaz (CH) index, indicating that the algorithm  can effectively handle ecommerce review data and reflect users’ attention to the products.

Key words: BERT model, LDA model, deep embedding clustering, autoencoder, clustering

中图分类号: 

  • TP391.1