J4 ›› 2010, Vol. 48 ›› Issue (03): 421-426.

• 计算机科学 • 上一篇    下一篇

基于序列比对的动态Web信息抽取算法

赵刚1, 郭东伟2, 李丹3   

  1. 1. 吉林大学 网络教育学院, 长春 130022; 2. 吉林大学 计算机科学与技术学院, 长春 130012;3. 大连理工大学 继续教育学院, 辽宁 大连 116011
  • 收稿日期:2009-02-17 出版日期:2010-05-26 发布日期:2010-05-19
  • 通讯作者: 郭东伟 E-mail:guodw@jlu.edu.cn

Dynamic Web Information Extraction Based onSequence Alignment

ZHAO Gang1, GUO Dongwei2, LI Dan3   

  1. 1. College of Network Education, Jilin University, Changchun 130022, China;2. College of Computer Science and Technology, Jilin University, Changchun |130012, China;3. School of Continuing Education, Dalian University of Technology, Dalian |116011, Liaoning Province, China
  • Received:2009-02-17 Online:2010-05-26 Published:2010-05-19
  • Contact: GUO Dongwei E-mail:guodw@jlu.edu.cn

摘要:

基于对深网(Deep Web)网页公共框架的定义, 提出一种在信息抽取算法中增加公共框架检测阶段, 采用序列比对算法提取公共框架的方法. 与原始网页数据相比, 去除公共框架的数据域信息对模板抽取更有利. 基于真实网站的数据密集型网页集合, 测试和对比了序列比对算法中参数不同取值以及公共框架检测阶段在数据量和抽取准确率等方面对信息抽取算法的影响. 实验结果表明了算法的有效性.

关键词: Web信息抽取; 序列比对; 公共框架检测

Abstract:

Based on “common framework” defined as the information which is irrelative to the kernel contents of Web pages and common in Web pages from the same source, sequence alignment was adopted in the information extraction algorithm to detect the common framework. After eliminating the common frameworks from Web pages, the data fields obtained will be more suitable for information extraction. On the dataintensive Web pages from realworld websites, the effects of the alignment parameter values on extraction results and those of the phase of common framework detection on decreasing data quantity and increasing extraction accuracy were tested and evaluated. The experimental results prove the validity of this approach convincingly.

Key words: Web information extraction, sequence alignment, common framework detection

中图分类号: 

  • TP391.1