J4

• 计算机 • 上一篇    下一篇

表格信息在主题爬行中的作用

黄凤云, 王辉, 左万利   

  1. 吉林大学 计算机科学与技术学院, 长春 130012
  • 收稿日期:2006-07-06 修回日期:1900-01-01 出版日期:2007-05-26 发布日期:2007-05-26
  • 通讯作者: 王辉

Importance of Text about Table Elements Used in Focused Crawling

HUANG Fengyun, WANG Hui, ZUO Wanli   

  1. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2006-07-06 Revised:1900-01-01 Online:2007-05-26 Published:2007-05-26
  • Contact: WANG Hui

摘要: 采用计算向量之间相似度的方法, 通过实验分析验证了表格信息在主题爬行中的重要性. 研究结果表明, 与整个网页相比, 表格所能提供的与用户相关的信息占整个网页信息总量的80%以上, 因而在主题爬行领域可以充分利用这一结论进行网页解析. 在舍弃除表格和标题之外的其他元素后, 提高了爬行程序的效率.

关键词: 主题爬行, 链接, TFIDF, 相似度

Abstract: In this paper, some experiments are conducted to analyze and verify the importance of table elements which lied in a Web page. In contr ast with Web pages, table elements can provide a large quantity of information ( beyond eighty percent) which is relevant to users’ information need. This conclusion can be utilizedto parse Web pages in the domain of focused crawling. After getting rid of elements except tables and headers, the efficiency of a focused crawler can be augmented distinctly and substantially.

Key words: focused crawling, URL, TFIDF, similarity

中图分类号: 

  • TP31