基于改进DEC的评论文本聚类算法

吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (5): 1147-1158.

基于改进DEC的评论文本聚类算法

陈可嘉¹, 夏瑞东¹, 林鸿熙²

1. 福州大学经济与管理学院, 福州 350108； 2. 莆田学院商学院, 福建莆田 351100

收稿日期:2022-09-15 出版日期:2023-09-26 发布日期:2023-09-26
通讯作者: 林鸿熙 E-mail:ptulhx@163.com

Review Text Clustering Algorithm Based on Improved DEC

CHEN Kejia¹, XIA Ruidong¹, LIN Hongxi²

1. School of Economics and Management, Fuzhou University, Fuzhou 350108, China；
2. School of Business, Putian University, Putian 351100, Fujian Province, China

Received:2022-09-15 Online:2023-09-26 Published:2023-09-26

摘要/Abstract

摘要： 针对原始深度嵌入聚类(DEC)算法中聚类层得出的初始聚类数目和聚类中心有很强的随机性, 从而影响DEC算法效果的问题, 提出一种基于改进DEC的评论文本聚类算法, 对无类别标注的电商评论数据进行无监督聚类. 首先获得融合句子嵌入向量和主题分布向量的BERT-LDA数据集向量化表示；然后改进DEC算法, 通过自动编码器进行降维处理, 在编码器后堆叠聚类层, 其中聚类层的聚类数目基于主题连贯性选择, 同时使用主题特征向量作为自定义聚类中心, 再进行编码器和聚类层的联合训练以提高聚类的准确度；最后利用可视化工具直观展示聚类效果. 为验证算法的有效性, 将该算法与6个对比算法在无标注的产品评论数据集上进行无监督聚类训练, 结果表明, 该算法在轮廓系数和Calinski-Harabaz(CH)指标上取得了0.213 5和2 958.18的最佳效果, 说明其可有效处理电商评论数据, 反映用户对产品的关注情况.

关键词: BERT模型, LDA模型, 深度嵌入聚类, 自动编码器, 聚类

Abstract: Aiming at the problem that the initial number of clusters and cluster centers derived from the clustering layer in the original deep embedding clustering (DEC) algorithm had strong randomness and thus affected the effectiveness of the DEC algorithm, we proposed a review text clustering algorithm based on improved DEC for unsupervised clustering of category-free labeled e-commerce review data. Firstly, we obtained a vectorized representation of the BERT-LDA dataset incorporating the sentence embedding vector and the topic distribution vector. Secondly, we improved DEC algorithm, reduced the dimensionality by the autoencoder, and stacked the clustering layers after the encoder, where the number of clusters in the clustering layers was selected based on the topic coherence. Meanwhile, the topic feature vector was used as the custom clustering center, and then the joint training of encoder and clustering layers was performed to improve the accuracy of clustering. Finally, the visualization tool was used to visually display the clustering effect. In order to verify the effectiveness of the proposed algorithm, unsupervised clustering training was conducted on an unlabeled product review dataset by using the algorithm and six comparison algorithms. The results show that the algorithm achieves the best results of 0.213 5 and 2 958.18 in the contour coefficient and Calinski-Harabaz (CH) index, indicating that the algorithm can effectively handle ecommerce review data and reflect users’ attention to the products.

Key words: BERT model, LDA model, deep embedding clustering, autoencoder, clustering

中图分类号:

TP391.1

陈可嘉, 夏瑞东, 林鸿熙. 基于改进DEC的评论文本聚类算法[J]. 吉林大学学报(理学版), 2023, 61(5): 1147-1158.

CHEN Kejia, XIA Ruidong, LIN Hongxi. Review Text Clustering Algorithm Based on Improved DEC[J]. Journal of Jilin University Science Edition, 2023, 61(5): 1147-1158.

参考文献

Metrics

Viewed

Full text

292

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	292

From	Others	local

Times	16	276
Rate	5%	95%

Abstract

323

Just accepted	Online first	Issue

0	0	323

	From	Others

	Times	323
	Rate	100%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

[1]	曲福恒, 宋剑飞, 杨勇, 胡雅婷, 潘曰涛. 基于min-max准则与区域划分的I-k-means-+聚类算法[J]. 吉林大学学报(理学版), 2023, 61(5): 1131-1138.
[2]	姚博, 王卫卫. 基于异构融合和判别损失的图嵌入聚类[J]. 吉林大学学报(理学版), 2023, 61(4): 853-862.
[3]	闫晨, 杨有龙, 刘原园. 基于聚类质量的两阶段集成算法[J]. 吉林大学学报(理学版), 2023, 61(4): 899-908.
[4]	高新成, 周中雨, 王莉利, 邵国铭, 张强. 基于二进制蜉蝣优化的特征选择及文本聚类算法[J]. 吉林大学学报(理学版), 2023, 61(3): 631-640.
[5]	周丰丰, 张亚琪. 基于ProtBert预训练模型的HLA-Ⅰ和多肽的结合预测算法[J]. 吉林大学学报(理学版), 2023, 61(3): 651-657.
[6]	祝鹏, 郭艳光. 基于K-medoids聚类算法的多源信息数据集成算法[J]. 吉林大学学报(理学版), 2023, 61(3): 665-670.
[7]	李岳泽, 左祥麟, 左万利, 梁世宁, 张一嘉, 朱媛. 基于BERT-GCN的因果关系抽取[J]. 吉林大学学报(理学版), 2023, 61(2): 325-330.
[8]	郭晴晴, 王卫卫. 基于类间损失和多视图融合的深度嵌入聚类[J]. 吉林大学学报(理学版), 2023, 61(1): 101-110.
[9]	李向利, 范学珍, 逯喜燕. 基于非负矩阵分解的修正模糊聚类算法[J]. 吉林大学学报(理学版), 2022, 60(6): 1416-1422.
[10]	刘铭, 于子奇. 一种改进的期望最大化算法[J]. 吉林大学学报(理学版), 2022, 60(5): 1176-1182.
[11]	耿莉, 王长鹏. 基于多样性的一致谱嵌入学习[J]. 吉林大学学报(理学版), 2022, 60(5): 1133-1142.
[12]	赵健. 基于k多数值代表的混合矩阵对象数据聚类[J]. 吉林大学学报(理学版), 2022, 60(4): 929-942.
[13]	程学军, 王建平. 基于图形正则化低秩表示张量与亲和矩阵的多视图聚类[J]. 吉林大学学报(理学版), 2022, 60(3): 671-684.
[14]	任伟建, 刘泽宇, 霍凤财, 康朝海, 任璐, 张永丰. 一种改进的多光谱遥感图像超像素分割算法[J]. 吉林大学学报(理学版), 2022, 60(2): 351-360.
[15]	杨亚男, 朱晓冬, 刘元宁, 朱琳, 董霖. 基于改进YoloV4网络的虹膜定位算法[J]. 吉林大学学报(理学版), 2022, 60(2): 369-380.