吉林大学学报(信息科学版) ›› 2016, Vol. 34 ›› Issue (4): 543-549.

• 论文 • 上一篇    下一篇

改进的基于后缀树的Web 搜索结果聚类算法

董亚则a, 李万龙b, 李航b, 郑山红b   

  1. 长春工业大学a. 应用技术学院; b. 计算机科学与工程学院, 长春130012
  • 收稿日期:2015-12-17 出版日期:2016-07-25 发布日期:2017-01-16
  • 通讯作者: 李万龙(1963—), 男, 长春人, 长春工业大学教授, 主要从事软件工程、智能计算研究,(Tel)86-431-85118311(E-mail)lwl@ ccut. edu. cn。
  • 作者简介:董亚则(1982—), 女, 吉林德惠人, 长春工业大学讲师, 主要从事智能计算研究, (Tel)86-431-85197565(E-mail)yaze_dong@163. com。
  • 基金资助:
    吉林省自然科学基金资助项目(20130101060JC); 吉林省教育厅“十二五冶科学技术研究基金资助项目(2014125; 2014131)

Improved Algorithm of Web Retrieve Results Clustering Based on Suffix Tree

DONG Yaze a, LI Wanlong b, LI Hang b, ZHENG Shanhong b   

  1. a. School of Application Technology; b. School of Computer Science & Engineering,
    Changchun University of Technology, Changchun 130012, China
  • Received:2015-12-17 Online:2016-07-25 Published:2017-01-16

摘要: 为提高Web 搜索精度和检准率, 在后缀树聚类算法基本模型的基础上, 提出了一种改进的基于后缀树的搜索结果聚类算法。将向量空间模型与后缀树聚类相结合, 改善了基类合并的效果, 综合基类节点对应文本数、短语包含词语长度、短语权重及是否包含查询词作为聚类标签的筛选条件, 改进了聚类标签的合理性和可读性。以搜狗语料库中的文本分类语料库为数据源进行的实验结果表明, 该方法在一定程度上提高了聚类结果的准确率。

关键词: Web 检索结果, 后缀树, 向量空间模型, 文本聚类

Abstract: How to improve the accuracy and precision of search engine in the Internet Era is the key problem needed to be solved urgently. Based on the basic model of the suffix tree clustering algorithm, an improved search results clustering algorithm based on suffix tree is proposed, in which Vector space model is combined with suffix tree clustering to improve the effect of the base class merge. Otherwise, the number of the texts corresponding to base class node, word length included in the phrase, phrase weight and whether it contains the query terms are combined as the seletion condition of clustering label. It improves the rationality and readability of the clustering labels consquently. Finally, the method is testified by using the text classification corpus data in the Sogou corpus. The experimental results show that the method can improve the accuracy of clustering results to a certain extent.

Key words: suffix tree, text clustering, Web retrieval results, vector space model

中图分类号: 

  • TP39