J4 ›› 2010, Vol. 48 ›› Issue (03): 421-426.

Previous Articles     Next Articles

Dynamic Web Information Extraction Based onSequence Alignment

ZHAO Gang1, GUO Dongwei2, LI Dan3   

  1. 1. College of Network Education, Jilin University, Changchun 130022, China;2. College of Computer Science and Technology, Jilin University, Changchun |130012, China;3. School of Continuing Education, Dalian University of Technology, Dalian |116011, Liaoning Province, China
  • Received:2009-02-17 Online:2010-05-26 Published:2010-05-19
  • Contact: GUO Dongwei E-mail:guodw@jlu.edu.cn

Abstract:

Based on “common framework” defined as the information which is irrelative to the kernel contents of Web pages and common in Web pages from the same source, sequence alignment was adopted in the information extraction algorithm to detect the common framework. After eliminating the common frameworks from Web pages, the data fields obtained will be more suitable for information extraction. On the dataintensive Web pages from realworld websites, the effects of the alignment parameter values on extraction results and those of the phase of common framework detection on decreasing data quantity and increasing extraction accuracy were tested and evaluated. The experimental results prove the validity of this approach convincingly.

Key words: Web information extraction, sequence alignment, common framework detection

CLC Number: 

  • TP391.1