吉林大学学报(理学版)

• 计算机科学 • 上一篇    下一篇

基于改进Single-Pass算法的网络新闻话题发现

孙红光1,2, 高星3, 孙铁利1,2, 杨凤芹1, 彭杨1, 冯国忠1   

  1. 1. 东北师范大学 信息科学与技术学院, 长春 130117;  2. 智能信息处理吉林省高校重点实验室, 长春 130117; 3. 解放军报社, 北京 100832
  • 收稿日期:2016-10-24 出版日期:2018-01-26 发布日期:2018-01-24
  • 通讯作者: 孙铁利 E-mail:suntl@nenu.edu.cn

Network News Topics Discovery Based on Improved Single-Pass Algorithm

SUN Hongguang1,2, GAO Xing3, SUN Tieli1,2, YANG Fengqin1, PENG Yang1, FENG Guozhong1   

  1. 1. School of Information Science and  Technology, Northeast Normal University, Changchun 130117, China;2. Key Lab of Intelligent Information Processing of Jilin Universities, Changchun 130117, China;3. Liberation Army Daily, Beijing 100832, China
  • Received:2016-10-24 Online:2018-01-26 Published:2018-01-24
  • Contact: SUN Tieli E-mail:suntl@nenu.edu.cn

摘要: 通过改进的SinglePass增量文本聚类算法, 以话题为粒度对新闻信息进行组织, 实现网络新闻话题的发现. 该方法考虑了新闻的动态性和时间特性, 在特征词项权重计算中从词项在标题和正文中的位置信息及词项的增量文档频率两方面进行优化, 同时在相似度的计算中添加了时间因素及聚类中动态更新话题的质心向量. 应用
基于主题的网络爬虫构建的新闻等语料作为测试数据集, 实验结果表明, 改进算法较传统算法在耗费代价和错检率上分别降低0.34%和1.57%, 验证了改进算法的有效性和准确性.

关键词: 文本聚类, Single-Pass算法, 话题发现

Abstract: By improved SinglePass incremental text clustering algorithm, we organized news information with granularity of topics, and achieved the discovery of network news topics. Considering the dynamic and time characteristics of news, the position information of terms in the headlines and texts and the frequency of incremental documents of terms in the feature terms weight calculation were optimized, meanwhile, time factor was added in similarity calculation and the topics centroid vectors were updated dynamically in clustering. Through the topicbased Web crawler to construct news corpus as the test data set, the experimental results show that, compared with  the traditional algorithm, the improved algorithm reduces the cost and fallout ratio by 0.34% and 1.57% respectively, which verify the validity and accuracy of the improved algorithm.

Key words: text clustering, SinglePass algorithm, topic discovery

中图分类号: 

  • TP311.5