吉林大学学报(信息科学版) ›› 2021, Vol. 39 ›› Issue (5): 583-588.

• • 上一篇    下一篇

基于自分簇自学习算法的垃圾短信识别

李 根1 , 王科峰1 , 贲卫国1 , 宋 微1 , 刘鸿儒2 , 徐亦晋2   

  1. 1. 中国联通网络通信集团有限公司 吉林省分公司, 长春 130021; 2. 吉林大学 计算机科学与技术学院, 长春 130012
  • 收稿日期:2021-03-30 出版日期:2021-10-01 发布日期:2021-10-01
  • 作者简介:李根(1982— ), 男, 长春人, 中国联通网络通信集团有限公司吉林省分公司高级工程师, 博士, 主要从事通信技术和 人工智能技术研究, (Tel)86-18643100688(E-mail)lig27@chinaunicom.cn。
  • 基金资助:
    吉林省科技发展计划基金资助项目(20190201023JC)

Spam Message Recognition Based on Self-Clustering and Self-Learning Algorithm

LI Gen 1 , WANG Kefeng 1 , BEN Weiguo 1 , SONG Wei 1 , LIU Hongru 2 , XU Yijin 2   

  1. 1. Jilin Province Branch, China United Network Communications Group Company Limited, Changchun 130021, China; 2. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2021-03-30 Online:2021-10-01 Published:2021-10-01

摘要: 垃圾短信发送者会不断尝试修改垃圾短信内容以欺骗过滤系统, 导致识别准确率降低, 为此提出一种基于 自分簇自学习算法的识别方法。 首先以最小编辑距离的方式构建垃圾短信关系链, 使用 MeanShift 算法对其进行 聚类实现自分簇功能。 之后计算每个簇核心, 并以与核心的距离确定每个样本的权值, 以权值样本训练分类器, 当新垃圾短信样本被分类器识别后, 会被归类到某个簇并重新计算该簇的核心和各个样本的权值, 并更新分类 器, 重复此过程实现自学习功能。 实验结果表明, 新方法准确率提高约 2. 51% ~ 5. 14% , 且能长时间保持。

关键词: 编辑距离,  , 聚类算法,  , 自学习,  , 垃圾短信

Abstract: The spam message senders continually try to modify spam content for cheating filter system, causing the recognition accuracy to decrease. Aiming at this problem, a new recognition method based on self-clustering and self-learning algorithm is presented. First, the spam relation chain is built by the minimum edit distance to realize self-clustering function using MeanShift algorithm is used on the chain. Second, the core of each cluster is computed, and the weight of each sample is computed by the distance from the cluster core. Then train the classifier by the samples with its weights. When the new spam is recognized by the classifier, it will be classified to a cluster. The core and sample weights of this cluster will be recomputed, and update the classifier to realize the self-learning function this process is repeated. Experiment results demonstrate that the new method can improve the recognition accuracy by 2. 51% ~ 5. 14% , and can keep the high accuracy for a long time.

Key words: edit distance, clustering, self-learning, spam message

中图分类号: 

  • TP393