基于自分簇自学习算法的垃圾短信识别

吉林大学学报(信息科学版) ›› 2021, Vol. 39 ›› Issue (5): 583-588.

基于自分簇自学习算法的垃圾短信识别

李根¹ , 王科峰¹, 贲卫国¹ , 宋微¹ , 刘鸿儒² , 徐亦晋²

1. 中国联通网络通信集团有限公司吉林省分公司, 长春 130021; 2. 吉林大学计算机科学与技术学院, 长春 130012

收稿日期:2021-03-30 出版日期:2021-10-01 发布日期:2021-10-01
作者简介:李根(1982— ), 男, 长春人, 中国联通网络通信集团有限公司吉林省分公司高级工程师, 博士, 主要从事通信技术和人工智能技术研究, (Tel)86-18643100688(E-mail)lig27@chinaunicom.cn。
基金资助:
吉林省科技发展计划基金资助项目(20190201023JC)

Spam Message Recognition Based on Self-Clustering and Self-Learning Algorithm

LI Gen ¹ , WANG Kefeng ¹ , BEN Weiguo ¹ , SONG Wei ¹ , LIU Hongru ² , XU Yijin ²

1. Jilin Province Branch, China United Network Communications Group Company Limited, Changchun 130021, China; 2. College of Computer Science and Technology, Jilin University, Changchun 130012, China

Received:2021-03-30 Online:2021-10-01 Published:2021-10-01

摘要/Abstract

摘要： 垃圾短信发送者会不断尝试修改垃圾短信内容以欺骗过滤系统, 导致识别准确率降低, 为此提出一种基于自分簇自学习算法的识别方法。首先以最小编辑距离的方式构建垃圾短信关系链, 使用 MeanShift 算法对其进行聚类实现自分簇功能。之后计算每个簇核心, 并以与核心的距离确定每个样本的权值, 以权值样本训练分类器, 当新垃圾短信样本被分类器识别后, 会被归类到某个簇并重新计算该簇的核心和各个样本的权值, 并更新分类器, 重复此过程实现自学习功能。实验结果表明, 新方法准确率提高约 2. 51% ~ 5. 14% , 且能长时间保持。

关键词: 编辑距离, , 聚类算法, , 自学习, , 垃圾短信

Abstract: The spam message senders continually try to modify spam content for cheating filter system, causing the recognition accuracy to decrease. Aiming at this problem, a new recognition method based on self-clustering and self-learning algorithm is presented. First, the spam relation chain is built by the minimum edit distance to realize self-clustering function using MeanShift algorithm is used on the chain. Second, the core of each cluster is computed, and the weight of each sample is computed by the distance from the cluster core. Then train the classifier by the samples with its weights. When the new spam is recognized by the classifier, it will be classified to a cluster. The core and sample weights of this cluster will be recomputed, and update the classifier to realize the self-learning function this process is repeated. Experiment results demonstrate that the new method can improve the recognition accuracy by 2. 51% ~ 5. 14% , and can keep the high accuracy for a long time.

Key words: edit distance, clustering, self-learning, spam message

中图分类号:

TP393

李根 , 王科峰 , 贲卫国 , 宋微 , 刘鸿儒 , 徐亦晋. 基于自分簇自学习算法的垃圾短信识别[J]. 吉林大学学报(信息科学版), 2021, 39(5): 583-588.

LI Gen , WANG Kefeng , BEN Weiguo , SONG Wei , LIU Hongru , XU Yijin . Spam Message Recognition Based on Self-Clustering and Self-Learning Algorithm[J]. Journal of Jilin University (Information Science Edition), 2021, 39(5): 583-588.

[1]	刘伟, 王文天. 油田井下配水器感应耦合输电功率与频率研究[J]. 吉林大学学报(信息科学版), 2021, 39(6): 617-623.
[2]	万云霞, 王力鑫, 张宏伟, 许钶杭, 胡鹤. 基于小波变换的MT数据人文噪声抑制方法[J]. 吉林大学学报(信息科学版), 2021, 39(6): 624-629.
[3]	贺更新, 陈莹. 基于AD9361的MIMO-OFDM同步定时接收系统[J]. 吉林大学学报(信息科学版), 2021, 39(6): 630-636.
[4]	周桂平 , 李石强 , 于华楠 , 王鹤. 基于压缩感知的电力扰动数据采集与分类方法[J]. 吉林大学学报(信息科学版), 2021, 39(6): 637-646.
[5]	高金兰 , 李豪 , 邓蒙. 基于GAVMD-SGRU模型的风电场短期功率预测[J]. 吉林大学学报(信息科学版), 2021, 39(6): 647-655.
[6]	李艳. 基于立体双目视觉的模糊图像视觉优化算法[J]. 吉林大学学报(信息科学版), 2021, 39(6): 656-661.
[7]	李洪丰 , 吴越 , 张树东 , 刘红宇 , 李兴 , 狄波. 施工现场人员车辆无感化管理技术研究与应用[J]. 吉林大学学报(信息科学版), 2021, 39(6): 662-668.
[8]	孙铁刚, 李志军, 刘丹. 虚实结合的光通信网络实验教学改革与实践[J]. 吉林大学学报(信息科学版), 2021, 39(6): 669-674.
[9]	洪涛, 张富强. 基于端对端最优功率的成套连采系统设计[J]. 吉林大学学报(信息科学版), 2021, 39(6): 675-681.
[10]	王婧媛, 方健. 基于 Yolov5 的密集场所人数估计方法[J]. 吉林大学学报(信息科学版), 2021, 39(6): 682-687.
[11]	刘伟 , 王克宽 , 桑喜新 , 段瑞彬 , 任福深. 全位置焊接机器人轨道运动智能控制研究[J]. 吉林大学学报(信息科学版), 2021, 39(6): 688-694.
[12]	贺媛, 牛立刚, 李昕, 纪永成, 马健, 王蕊. MOS 静态参数性能研究综合实验的设计与实践[J]. 吉林大学学报(信息科学版), 2021, 39(6): 695-699.
[13]	杨永吉, 赵剑, 史丽娟, 崔宇晴, 张智威, 肖治国. 基于 VR 技术的沉浸式听障儿童发音康复训练系统[J]. 吉林大学学报(信息科学版), 2021, 39(6): 700-705.
[14]	汪卓俊, 周建文, 钱伟, 朱汉平, 张洁. 基于多层准入控制的内网物资采购信息化平台[J]. 吉林大学学报(信息科学版), 2021, 39(6): 706-711.
[15]	赵剑 , 董文华 , 史丽娟 , 匡哲君 , 毕京晓 , 王晢宇 , 强文倩. 针对突发公共事件的舆情监测与可视化分析[J]. 吉林大学学报(信息科学版), 2021, 39(6): 712-719.