吉林大学学报(工学版) ›› 2014, Vol. 44 ›› Issue (6): 1843-1848.doi: 10.13229/j.cnki.jdxbgxb201406047

• • 上一篇    下一篇

基于层叠条件随机场的中文病历命名实体识别

燕杨1, 2, 文敦伟3, 王云吉1, 王珂1   

  1. 1.吉林大学 通信工程学院, 长春 130012;
    2.长春师范大学 计算机科学与技术学院,长春 130032;
    3.阿萨巴斯卡大学 计算与信息系统学院, 加拿大 艾伯塔 T9S3A3
  • 收稿日期:2013-08-12 出版日期:2014-11-01 发布日期:2014-11-01
  • 通讯作者: 王珂(1955-),男,教授,博士生导师.研究方向:模式识别与图像处理.E-mail:wangke@jlu.edu.cn
  • 作者简介:燕杨(1981-),女,博士研究生.研究方向:模式识别与图像处理.E-mail:
  • 基金资助:
    吉林省科技发展计划项目(201201112)

Named entity recognition in Chinese medical records based on cascaded conditional random field

YAN Yang1, 2, WEN Dun-wei3, WANG Yun-ji1, WANG Ke1   

  1. 1.College of Communication Engineering, Jilin University, Changchun 130012, China;
    2.College of Computer Science and Engineering, Changchun Normal University of Technology, Changchun 130032, China;
    3.School of Computing and Information Systems, Athabasca University, Athabasca, Alberta T9S3A3, Canada
  • Received:2013-08-12 Online:2014-11-01 Published:2014-11-01

摘要: 提出了一种基于层叠条件随机场的中文病历命名实体识别新方法,该方法在第一层条件随机场模型中实现对病历中身体基本部位或组织和基本疾病名称的识别,将识别结果传递到第二层条件随机场模型(Conditional Random Field,CRF),同时定义一个由词性和实体特征结合而成的组合特征,与字符特征、词边界特征及上下文特征共同作为第二层CRF模型的特征集,为疾病名称和临床症状两类命名实体的识别提供决策支持。在利用CRF++进行的开放测试中,本文模型相比于无自定义组合特征的层叠CRF模型,F值提高了3%;相比于单层CRF模型,F值提高了7%,总体性能有显著提高。

关键词: 信息处理技术, 条件随机场, 层叠条件随机场, 中文病历, 命名实体识别

Abstract: A new method for named entity recognition in Chinese medical records based on cascaded Conditional Random Fields (CRFs) is proposed. The first layer of the cascaded CRFs is used to identify the basic named entities of body parts and diseases. Then, the identified results are fed to the second layer for recognition of nested named entities for complex diseases and clinical symptoms. A new combination feature, composed of part-of-speech features and named entity features, is defined. This new feature together with the character features, word boundary features and context features in a sentence are taken as the feature set of the second layer. In the experiments based on CRF++, the proposed method yields a 3% higher F-score than cascaded CRF without the combination feature. Moreover, compared to single layer CRF method, it yields a 7% higher F-score, a significant increase in overall performance.

Key words: information processing, conditional random field, cascaded conditional random field, Chinese medical records, named entity recognition

中图分类号: 

  • TP391
[1] Gu B. Recognizing nested named entities in GENIA corpus[C]∥Proceedings of the HLT-NAACL BioNLP Workshop on Linking Natural Language and Biology. Association for Computational Linguistics, 2006: 112-113.
[2] Tanabe L, Wilbur W J. A priority model for named entities[C]∥Proceedings of the Workshop on Linking Natural Language Processing and Biology: Towards Deeper Biological Literature Analysis. Association for Computational Linguistics, 2006: 33-40.
[3] Kim J D, Ohta T, Tsuruoka Y, et al. Introduction to the bio-entity recognition task at JNLPBA[C]∥Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. Association for Computational Linguistics, 2004: 70-75.
[4] 夏涵. 基于本体的医学命名实体识别技术研究[D]. 上海:上海交通大学软件学院, 2012:60-65. Xia Han. Research of medical named entity recognition technology based on ontology[D]. Shanghai: College of Software,Shanghai Jiaotong University, 2012:60-65.
[5] Leaman R, Miller C, Gonzalez G. Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark[C]∥Proceedings of the 2009 Symposium on Languages in Biology and Medicine,2009.
[6] 赵军. 命名实体识别、排歧和跨语言关联[J]. 中文信息学报,2009,23(2):6-7.[6] Zhao Jun.A survey on named entity recognition, disambiguation and cross-lingual coreference resolution[J].Journal of Chinese Information Prosessing,2009,23(2):6-7.
[7] 郑强,刘齐军,王正华,等.生物医学命名实体识别的研究与进展[J].计算机应用研究, 2010, 27(3):812-814. Zheng Qiang,Liu Qi-jun,Wang Zheng-hua,et al. Research and development on biomedical named entity recognition[J] Application Research of Computers,2010, 27(3):812-814.
[8] Li D, Kipper-Schuler K, Savova G. Conditional random fields and support vector machines for disorder named entity recognition in clinical texts[C]∥Current Trends in Biomedical Natural Language Processing (BioNLP) 2008:94-95.
[9] Rabiner L R. A tutorial on hidden Markov models and selected applications in speech recognition[J]. Proceedings of the IEEE, 1989, 77(2): 257-286.
[10] McCallum A, Freitag D, Pereira F C N. Maximum entropy markov models for information extraction and segmentation[C]∥ICML,2000: 591-598.
[11] Lafferty J, McCallum A, Pereira F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data[Z]. 2001.
[12] Mc Donald R, Pereira F. Identifying gene and protein mentions in text using conditional random fields[J]. BMC Bioinformatics, 2005, 6(Suppl 1): S6.
[13] Leaman R, Gonzalez G. BANNER: an executable survey of advances in biomedical named entity recognition[C]∥Pacific Symposium on Biocomputing. 2008, 13: 652-663.
[14] Wang Ya-qiang,Liu Yi-guang.A preliminary work on symptom name recognition from free-text clinical records of traditional chinese medicine using conditional random fields and reasonable features[C]∥BioNLP2012:223-230.
[15] Sutton C, McCallum A. Composition of conditional random fields for transfer learning[C]∥Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2005:748-754.
[16] 周俊生, 戴新宇, 尹存燕, 等. 基于层叠条件随机场模型的中文机构名自动识别[J]. 电子学报, 2006, 34(5): 804-809. Zhou Jun-sheng, Dai Xin-yu,Yin Cun-yan,et al. Automatic rrecognition of Chinese organization name based on cascaded conditional random fields[J].Chinese Journal of Electronics,2006,34(5):804-809.
[17] Ratinov L,Roth D. Design challengesand misconceptions in named entity recognition[C]∥InCoNLL,2009:147-155.
[1] 苏寒松,代志涛,刘高华,张倩芳. 结合吸收Markov链和流行排序的显著性区域检测[J]. 吉林大学学报(工学版), 2018, 48(6): 1887-1894.
[2] 徐岩,孙美双. 基于卷积神经网络的水下图像增强方法[J]. 吉林大学学报(工学版), 2018, 48(6): 1895-1903.
[3] 黄勇,杨德运,乔赛,慕振国. 高分辨合成孔径雷达图像的耦合传统恒虚警目标检测[J]. 吉林大学学报(工学版), 2018, 48(6): 1904-1909.
[4] 李居朋,张祖成,李墨羽,缪德芳. 基于Kalman滤波的电容屏触控轨迹平滑算法[J]. 吉林大学学报(工学版), 2018, 48(6): 1910-1916.
[5] 应欢,刘松华,唐博文,韩丽芳,周亮. 基于自适应释放策略的低开销确定性重放方法[J]. 吉林大学学报(工学版), 2018, 48(6): 1917-1924.
[6] 陆智俊,钟超,吴敬玉. 星载合成孔径雷达图像小特征的准确分割方法[J]. 吉林大学学报(工学版), 2018, 48(6): 1925-1930.
[7] 刘仲民,王阳,李战明,胡文瑾. 基于简单线性迭代聚类和快速最近邻区域合并的图像分割算法[J]. 吉林大学学报(工学版), 2018, 48(6): 1931-1937.
[8] 单泽彪,刘小松,史红伟,王春阳,石要武. 动态压缩感知波达方向跟踪算法[J]. 吉林大学学报(工学版), 2018, 48(6): 1938-1944.
[9] 姚海洋, 王海燕, 张之琛, 申晓红. 双Duffing振子逆向联合信号检测模型[J]. 吉林大学学报(工学版), 2018, 48(4): 1282-1290.
[10] 全薇, 郝晓明, 孙雅东, 柏葆华, 王禹亭. 基于实际眼结构的个性化投影式头盔物镜研制[J]. 吉林大学学报(工学版), 2018, 48(4): 1291-1297.
[11] 陈绵书, 苏越, 桑爱军, 李培鹏. 基于空间矢量模型的图像分类方法[J]. 吉林大学学报(工学版), 2018, 48(3): 943-951.
[12] 陈涛, 崔岳寒, 郭立民. 适用于单快拍的多重信号分类改进算法[J]. 吉林大学学报(工学版), 2018, 48(3): 952-956.
[13] 孟广伟, 李荣佳, 王欣, 周立明, 顾帅. 压电双材料界面裂纹的强度因子分析[J]. 吉林大学学报(工学版), 2018, 48(2): 500-506.
[14] 林金花, 王延杰, 孙宏海. 改进的自适应特征细分方法及其对Catmull-Clark曲面的实时绘制[J]. 吉林大学学报(工学版), 2018, 48(2): 625-632.
[15] 王柯, 刘富, 康冰, 霍彤彤, 周求湛. 基于沙蝎定位猎物的仿生震源定位方法[J]. 吉林大学学报(工学版), 2018, 48(2): 633-639.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!