Journal of Jilin University Science Edition ›› 2023, Vol. 61 ›› Issue (5): 1147-1158.

Previous Articles     Next Articles

Review Text Clustering Algorithm Based on Improved DEC

CHEN Kejia1, XIA Ruidong1, LIN Hongxi2   

  1. 1. School of Economics and Management, Fuzhou University, Fuzhou 350108, China;
    2. School of Business, Putian University, Putian 351100, Fujian Province, China
  • Received:2022-09-15 Online:2023-09-26 Published:2023-09-26

Abstract: Aiming at the problem that the initial number of clusters and cluster centers derived from the clustering layer in the original deep embedding clustering (DEC) algorithm had  strong randomness and thus affected the effectiveness of the  DEC algorithm, we proposed a review text clustering algorithm based on improved DEC for unsupervised clustering of category-free labeled e-commerce review data. Firstly, we  obtained a vectorized representation of the BERT-LDA dataset incorporating the sentence embedding vector and the topic distribution vector.  Secondly, we improved DEC algorithm, reduced the dimensionality by the autoencoder, and stacked the clustering layers  after the encoder, where the number of clusters in the clustering layers was selected based on the topic coherence. Meanwhile,  the topic feature vector was used as the custom clustering center, and then the joint training of encoder and clustering layers was performed to improve the accuracy of clustering. Finally, the visualization tool was used to visually  display the clustering effect. In order to verify the effectiveness of the proposed algorithm, unsupervised clustering  training was conducted  on an unlabeled product review dataset by using the algorithm and six comparison algorithms. The results show that the algorithm  achieves the best results of 0.213 5 and 2 958.18 in the contour coefficient  and Calinski-Harabaz (CH) index, indicating that the algorithm  can effectively handle ecommerce review data and reflect users’ attention to the products.

Key words: BERT model, LDA model, deep embedding clustering, autoencoder, clustering

CLC Number: 

  • TP391.1