J4

• 计算机 • 上一篇    下一篇

基于CRF的中文组块分析

徐中一, 胡谦, 刘磊   

  1. 吉林大学 计算机科学与技术学院, 长春 130012
  • 收稿日期:2006-06-29 修回日期:1900-01-01 出版日期:2007-05-26 发布日期:2007-05-26
  • 通讯作者: 刘磊

Chinese Text Chunking Based CRF

XU Zhongyi, HU Qian, LIU Lei   

  1. College of Computer Science and Technology, Jilin University, Changchun 130012, China
  • Received:2006-06-29 Revised:1900-01-01 Online:2007-05-26 Published:2007-05-26
  • Contact: LIU Lei

摘要: 提出一种基于条件随机域模型的方法用于中文文本组块分析. 该方法将中文组块分析转化为对每个词语赋予一个组块标注符号, 再根据条件随机域对标注好的训练语料建立模型, 从而预测测试语料中每个词语的组块标注符号. 使用北京大学中文树库的测试结果为F1=85.5%, 高于隐马尔可夫模型和最大熵马尔可夫模型. 实验结果表明, 条件随机域在中文组块识别方面有效, 并避免了严格的独立性假设和数据归纳偏 置问题.

关键词: 组块分析, 条件随机域, 特征函数

Abstract: A new method to solve Chinese text chunking was introduced as conditional random fields (CRF) model, by which Chinese text chunking transformed into labeling the words with their chunk tags and establishinga model for tagged corpus according to conditional random fields so as to predict the chunk ta g of each word. An F1 score of 85.5% is achieved by using the evaluation dataset of Chinese treebank of Beijing university, and obviously better than those of hidden Markov model and maximum entropy Markov model. Experimental results show that conditional random fields model is an effective way on Chinese text chunking and the strict Independence hypothesis and the label bias problem are avoided.

Key words: chunking, conditional random fields, feature function

中图分类号: 

  • TP391