J4 ›› 2009, Vol. 27 ›› Issue (04): 396-.

• 论文 • 上一篇    下一篇

基于特征和HMM的信息提取

纪 祥,刘华虓,吴芬芬,刘 磊   

  1. 吉林大学 计算机科学与技术院,长春 130012
  • 出版日期:2009-07-20 发布日期:2009-08-27
  • 通讯作者: 纪祥(1984— ),男,黑龙江鸡西人, 吉林大学硕士研究生,主要从事基于语义的web服务研究, E-mail:jixiang_2113@126.com
  • 作者简介:纪祥(1984— )|男|黑龙江鸡西人| 吉林大学硕士研究生|主要从事基于语义的web服务研究|(Tel)86-13578920655 (E-mail)jixiang_2113@126.com;刘磊(1960— )|男|长春人| 吉林大学教授|博士生导师|主要从事基于语义的web服务研究|(Tel)86-13086867820 (E-mail)liulei@mail.jlu.edu.cn
  • 基金资助:

    中国高等教育博士研究基金资助项目(20060183044)                    

Information Extraction Based on Character Extraction and HMM

JI Xiang,LIU Hua-xiao,WU Fen-fen,LIU Lei   

  1. College of Computer Science and Technology,Jilin University,Changchun 130012,China
  • Online:2009-07-20 Published:2009-08-27

摘要:

 为了解决在信息提取中,召回率和精度都不高的问题,提出了改进的HMM(Hidden Markov Models)模型,该模型采用一种新的文本分块技术。通过文本的语义特征和结构特征,抽取具有特征的状态,并在此基础上,抽取剩余的无特征的状态改进HMM,测试了由卡耐基梅隆大学数据搜索引擎研究小组所提供的100篇计算机科学文件头部。结果表明,与基于字词和传统的HMM方法相比,召回率和精确率分别达到了91.99%和94.79%。

关键词: 文本块, 特征提取, 机器学习, HMM模型

Abstract:

An improved HMM(Hidden Markov Models) was proposed for text information extraction by utilizing the semanteme characteristic and structure characteristic of the text to make certain the states with characteristic. We carry on extracting the remainder states having no characteristic with the improved HMM. It can solve the problem which the recall rate and the precision rate are not high in information extraction.We have tested 100 pieces of headers of computer science paper of the data provided by the search-engine research group from CMU(Carnegie Mellon Univerisity) of USA.The result shows that the recall and precision rate are all improved compared with existing methods which are based on words and traditional HMM.Recall rate and precision rate are 91.99%and 94.79%.

Key words: text block, characterextraction, machine learning, hidden markov models(HMM)

中图分类号: 

  •