基于二进制蜉蝣优化的特征选择及文本聚类算法

吉林大学学报(理学版) ›› 2023, Vol. 61 ›› Issue (3): 631-640.

基于二进制蜉蝣优化的特征选择及文本聚类算法

高新成¹, 周中雨², 王莉利², 邵国铭², 张强²

1. 东北石油大学现代教育技术中心, 黑龙江大庆 163318; 2. 东北石油大学计算机与信息技术学院, 黑龙江大庆 163318

收稿日期:2022-03-07 出版日期:2023-05-26 发布日期:2023-05-26
通讯作者: 周中雨 E-mail:1294460548@qq.com

Feature Selection and Text Clustering Algorithm Based on Binary Mayfly Optimization

GAO Xincheng¹, ZHOU Zhongyu², WANG Lili², SHAO Guoming², ZHANG Qiang²

1. Modern Education Technique Center, Northeast Petroleum University, Daqing 163318, Heilongjiang Province, China;
2. School of Computer and Information Technology, Northeast Petroleum University, Daqing 163318, Heilongjiang Province, China

Received:2022-03-07 Online:2023-05-26 Published:2023-05-26

摘要/Abstract

摘要： 针对文本冗余特征导致聚类精度较低的问题, 提出一种基于二进制蜉蝣优化的特征选择及文本聚类算法. 首先, 对传统蜉蝣算法的位置更新、交配与变异策略进行改进; 然后, 将其与特征选择模型相结合, 以逆文档频率为目标函数对文本特征进行选择; 最后, 在新特征子集的基础上, 利用K-means++算法对文本进行聚类, 得到最优文本聚类结果. 在多个数据集上进行实验的结果表明, 该算法能有效缩短特征维数, 提高文本聚类效率.

关键词: 二进制蜉蝣算法, 文本聚类, 收敛速度, 特征选择

Abstract: Aiming at the problem of low clustering accuracy caused by redundant text features, we proposed a feature selection and text clustering algorithm based on binary mayfly optimization. Firstly, we improved the strategy of location update, mating, and mutation of the traditional mayfly algorithm. Secondly, we combined it with a feature selection model to select text features using the inverse document frequency as the objective function. Finally, on the basis of new feature subset, K-means++ algorithm was used to cluster text and obtain the optimal text clustering results. The results of experiments conducted on multiple datasets show that the proposed algorithm can effectively shorten the feature dimension and improve the efficiency of text clustering.

Key words: binary mayfly algorithm, text clustering, convergence rate, feature selection

中图分类号:

TP393

高新成, 周中雨, 王莉利, 邵国铭, 张强. 基于二进制蜉蝣优化的特征选择及文本聚类算法[J]. 吉林大学学报(理学版), 2023, 61(3): 631-640.

GAO Xincheng, ZHOU Zhongyu, WANG Lili, SHAO Guoming, ZHANG Qiang. Feature Selection and Text Clustering Algorithm Based on Binary Mayfly Optimization[J]. Journal of Jilin University Science Edition, 2023, 61(3): 631-640.

[1]	齐妙, 闫光友, 徐慧, 孙慧. 基于多尺度特征选择网络的人脸表情识别[J]. 吉林大学学报(理学版), 2022, 60(2): 425-431.
[2]	王丽, 王涛, 肖巍, 刘兆赓, 李占山. XGBoost启发的双向特征选择算法[J]. 吉林大学学报(理学版), 2021, 59(3): 627-634.
[3]	杨舒涵, 李博, 周丰丰. 基于机器学习的跨患者癫痫自动检测算法[J]. 吉林大学学报(理学版), 2021, 59(1): 101-106.
[4]	方秋莲, 王培锦, 隋阳, 郑涵颖, 吕春玥, 王艳彤. 朴素Bayes分类器文本特征向量的参数优化[J]. 吉林大学学报(理学版), 2019, 57(06): 1479-1485.
[5]	王颖, 曹捷, 邱志洋. 基于乌鸦搜索算法的新型特征选择算法[J]. 吉林大学学报(理学版), 2019, 57(04): 869-874.
[6]	王烨, 左万利, 王英. 基于隐喻词扩展的短文本聚类算法[J]. 吉林大学学报(理学版), 2018, 56(6): 1447-1452.
[7]	王银花, 王丽萍, 王忠良. 基于判别分析与低秩投影的人脸识别算法[J]. 吉林大学学报(理学版), 2018, 56(2): 355-360.
[8]	郭凯文, 潘宏亮, 侯阿临. 基于特征选择和聚类的分类算法[J]. 吉林大学学报(理学版), 2018, 56(2): 395-398.
[9]	孙红光, 高星, 孙铁利, 杨凤芹, 彭杨, 冯国忠. 基于改进Single-Pass算法的网络新闻话题发现[J]. 吉林大学学报(理学版), 2018, 56(1): 114-118.
[10]	安葳鹏, 屈星龙. 快速差分进化算法[J]. 吉林大学学报(理学版), 2017, 55(04): 866-873.
[11]	李猛, 刘元宁. 一种基于信息增益的新垃圾邮件特征选择算法[J]. 吉林大学学报(理学版), 2017, 55(02): 379-382.
[12]	杨志伟, 努尔布力, 贾雪, 胡亮. 基于ReliefF的入侵特征选择方法[J]. 吉林大学学报(理学版), 2015, 53(03): 505-510.
[13]	崔亚芬, 解男男. 一种基于特征选择的入侵检测方法[J]. 吉林大学学报(理学版), 2015, 53(01): 112-116.
[14]	杨杰明, 刘元宁, 曲朝阳, 刘志颖. 文本分类中基于综合度量的特征选择方法[J]. 吉林大学学报(理学版), 2013, 51(05): 887-893.
[15]	鲍捷, 杨明, 何志芬. 基于SVM评价准则的高维数据混合特征选择算法[J]. J4, 2012, 50(06): 1192-1198.