J4

• 计算机科学 • 上一篇    下一篇

基于网页分块技术主题爬行器的实现

李晓亚, 赫枫龄, 左万利   

  1. 吉林大学 计算机科学与技术学院, 长春 130012
  • 收稿日期:2006-11-17 修回日期:1900-01-01 出版日期:2007-11-26 发布日期:2007-11-26
  • 通讯作者: 左万利

Realization of Focused Crawler Based on Page Segmentation

LI Xiaoya, HE Fengling, ZUO Wanli   

  1. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2006-11-17 Revised:1900-01-01 Online:2007-11-26 Published:2007-11-26
  • Contact: ZUO Wanli

摘要: 针对目前通用搜索引擎搜索到的结果过多、 与主题相关性不强的现状, 提出一种基于网页分块技术的主题爬行器实现方法, 并实现了一个原型系统Crawler1. 实验结果表明, 本系统性能较好, 所爬网页的相关度在55%以上.

关键词: 主题搜索, 主题爬行, 相关度分析, 网页分块

Abstract: In the light of result returned currently by generalpurpose search engines being excessive, and having no strong similarity with the topic, this paper covers a technique of dividing the web page to chunks to implement a focused crawler. With this method, Crawler1, a prototype of a focused crawler has been realized. Experimental results indicate that Crawler1 has better performance. The number of topic web pages crawled by Crawler1 attains more than 55%.

Key words: topicspecific search, focused crawling, relevance analysis, page segmentation

中图分类号: 

  • TP311