吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (3): 631-640.

• • 上一篇    下一篇

基于二进制蜉蝣优化的特征选择及文本聚类算法

高新成1, 周中雨2, 王莉利2, 邵国铭2, 张强2   

  1. 1. 东北石油大学 现代教育技术中心, 黑龙江 大庆 163318;  2. 东北石油大学 计算机与信息技术学院, 黑龙江 大庆 163318
  • 收稿日期:2022-03-07 出版日期:2023-05-26 发布日期:2023-05-26
  • 通讯作者: 周中雨 E-mail:1294460548@qq.com

Feature Selection and Text Clustering Algorithm Based on Binary Mayfly Optimization

GAO Xincheng1, ZHOU Zhongyu2, WANG Lili2, SHAO Guoming2, ZHANG Qiang2   

  1. 1. Modern Education Technique Center, Northeast Petroleum University, Daqing 163318, Heilongjiang Province, China; 
    2. School of Computer and Information Technology, Northeast Petroleum University, Daqing 163318, Heilongjiang Province, China
  • Received:2022-03-07 Online:2023-05-26 Published:2023-05-26

摘要: 针对文本冗余特征导致聚类精度较低的问题, 提出一种基于二进制蜉蝣优化的特征选择及文本聚类算法. 首先, 对传统蜉蝣算法的位置更新、 交配与变异策略进行改进; 然后, 将其与特征选择模型相结合, 以逆文档频率为目标函数对文本特征进行选择; 最后, 在新特征子集的基础上, 利用K-means++算法对文本进行聚类, 得到最优文本聚类结果. 在多个数据集上进行实验的结果表明, 该算法能有效缩短特征维数, 提高文本聚类效率.

关键词: 二进制蜉蝣算法, 文本聚类, 收敛速度, 特征选择

Abstract: Aiming at the problem of low clustering accuracy caused by redundant text features, we proposed a feature selection and text clustering algorithm based on binary mayfly optimization. Firstly, we improved the strategy of location update, mating, and mutation of the traditional mayfly algorithm.  Secondly, we  combined it with a feature selection model to select text features using the inverse document frequency as the objective function. Finally,  on the basis of new feature subset, K-means++ algorithm was used to cluster text and obtain the optimal text clustering results. The results of experiments conducted on multiple datasets show that the proposed algorithm can effectively shorten the feature dimension and improve the efficiency of text clustering.

Key words: binary mayfly algorithm, text clustering, convergence rate, feature selection

中图分类号: 

  • TP393