J4 ›› 2009, Vol. 27 ›› Issue (04): 396-.

Previous Articles     Next Articles

Information Extraction Based on Character Extraction and HMM

JI Xiang,LIU Hua-xiao,WU Fen-fen,LIU Lei   

  1. College of Computer Science and Technology,Jilin University,Changchun 130012,China
  • Online:2009-07-20 Published:2009-08-27

Abstract:

An improved HMM(Hidden Markov Models) was proposed for text information extraction by utilizing the semanteme characteristic and structure characteristic of the text to make certain the states with characteristic. We carry on extracting the remainder states having no characteristic with the improved HMM. It can solve the problem which the recall rate and the precision rate are not high in information extraction.We have tested 100 pieces of headers of computer science paper of the data provided by the search-engine research group from CMU(Carnegie Mellon Univerisity) of USA.The result shows that the recall and precision rate are all improved compared with existing methods which are based on words and traditional HMM.Recall rate and precision rate are 91.99%and 94.79%.

Key words: text block, characterextraction, machine learning, hidden markov models(HMM)

CLC Number: 

  •