吉林大学学报(工学版) ›› 2026, Vol. 56 ›› Issue (1): 231-238.doi: 10.13229/j.cnki.jdxbgxb.20241174

• 计算机科学与技术 • 上一篇    下一篇

软注意力掩码嵌入下中文命名实体识别算法

王秀慧1(),徐永波2   

  1. 1.山西大同大学 计算机与网络工程学院,山西 大同 037009
    2.河南大学 人工智能学院,郑州 450046
  • 收稿日期:2024-10-10 出版日期:2026-01-01 发布日期:2026-02-03
  • 作者简介:王秀慧(1981-),女,副教授,硕士.研究方向:机器学习. E-mail: zenlw@163.com
  • 基金资助:
    国家自然科学基金项目(12301138)

Chinese named entity recognition algorithm with soft attention mask embedding

Xiu-hui WANG1(),Yong-bo XU2   

  1. 1.School of Computer and Network Engineering,Shanxi Datong University,Datong 037009,China
    2.College of Artificial Intelligence,Henan University,Zhengzhou 450046,China
  • Received:2024-10-10 Online:2026-01-01 Published:2026-02-03

摘要:

中文词汇的语义存在一定的模糊性,在中文文本中,存在一些与命名实体识别相关性较低的特征,同一个词汇在不同语境中具有不同的含义,不同的词汇和短语对命名实体的识别具有不同的贡献度,若不进行加权或掩码操作,这些特征则会干扰模型的识别准确率。为此,本文提出一种软注意力掩码嵌入的中文命名实体识别(CNER)算法。首先,建立多层次CNER模型,在模型的词向量表示层,借助jieba技术对输入层传递过来的中文文本进行分词处理,并利用Word2Vec方法获取各词汇的词向量,形成词向量序列。其次,在BiLSTM层对词向量序列进行双向长短期记忆处理,得到每个词向量对应的融合了前后文信息的特征向量。再次,在BiLSTM层后嵌入一个软注意力掩码模块,利用该模块的软注意力机制对BiLSTM层输出的特征向量进行加权和掩码操作,关注对实体识别有重要贡献的特征,去除和抑制不重要的特征,提高识别的精度。最后,在条件随机场(CRF)层对经过软注意力掩码模块处理后的特征向量进行标签标注与解码,从而得到最佳实体标签序列,该序列即为中文命名实体识别结果。实验结果表明,该算法可以精准识别中文命名实体,在实体标签标注覆盖性和F1值方面均有较好的表现。

关键词: 中文命名, 软注意力, 实体识别, 掩码操作, Word2Vec, BiLSTM模型

Abstract:

The semantics of Chinese vocabulary have a certain degree of ambiguity. In Chinese text, there are some features that have low relevance to named entity recognition. The same vocabulary has different meanings in different contexts, and different vocabulary and phrases have different contributions to named entity recognition. If weighting or masking operations are not performed, these features will interfere with the recognition accuracy of the model. To this end, a Chinese named entity recognition (CNER) algorithm with soft attention mask embedding is studied. Establish a multi-level CNER model, in the word vector representation layer of the model, use jieba technology to perform segmentation processing on the Chinese text passed from the input layer, and use Word2Vec method to obtain the word vectors of each vocabulary, forming a sequence of word vectors. In the BiLSTM layer, bidirectional long short-term memory processing is applied to the sequence of word vectors to obtain feature vectors that fuse contextual information for each word vector. Embedding a soft attention mask module after the BiLSTM layer, using the soft attention mechanism of this module to perform weighted and masked operations on the feature vectors output by the BiLSTM layer, focusing on features that contribute significantly to entity recognition, removing and suppressing unimportant features, and improving recognition accuracy. Label and decode the feature vectors processed by the soft attention mask module in the CRF layer to obtain the optimal entity label sequence, which is the Chinese named entity recognition result. The experiment shows that the algorithm can accurately recognize Chinese named entities, and has good performance in entity label annotation coverage and F1 value.

Key words: Chinese naming, soft attention, entity recognition, mask operation, Word2Vec, BiLSTM model

中图分类号: 

  • TP391.1

图1

CNER模型设计"

图2

CBOW模型结构图"

表1

实验主要参数"

名 称数 值
词向量维度200
CBOW窗口大小7
LSTM单元隐含层60
隐藏层神经元数量256
学习率0.1
Dropout0.5
数据批次大小32
最大迭代数量60

图3

中文病例数据采集平台"

表2

中文病例实体分布情况"

实体训练集测试集
人名(Name)967290
症状(Symptom)3 6841 105
部位(Body)613184
检查(Test)2 846854
药物(Drug)1 057317

表3

BIO标注准则"

标注名称描述
B实体的开始
I实体的内部
E实体的结束
O非实体

图4

中文病例实体识别结果"

图5

实体标签标注覆盖情况"

图6

本文方法消融测试"

[1] 王颖洁, 张程烨, 白凤波, 等. 中文命名实体识别研究综述[J]. 计算机科学与探索, 2023, 17(2): 324-341.
Wang Ying-jie, Zhang Cheng-ye, Bai Feng-bo, et al. Review of Chinese named entity recognition research[J]. Journal of Frontiers of Computer Science & Technology, 2023, 17(2): 324-341.
[2] 赵继贵, 钱育蓉, 王魁, 等. 中文命名实体识别研究综述[J]. 计算机工程与应用, 2024, 60(1): 15-27.
Zhao Ji-gui, Qian Yu-rong, Wang Kui, et al. Survey of Chinese named entity recognition research[J]. Computer Engineering and Applications, 2024, 60(1): 15-27.
[3] 卢青华, 袁丽娜. 基于组合神经网络的软件命名实体识别仿真[J]. 计算机仿真, 2023, 40(1): 489-492, 509.
Lu Qing-hua, Yuan li-na. Software named entity recognition simulation based on combined neural network[J]. Computer Simulation, 2023, 40(1): 489-492, 509.
[4] 康怡琳, 孙璐冰, 朱容波, 等. 深度学习中文命名实体识别研究综述[J]. 华中科技大学学报: 自然科学版, 2022, 50(11): 44-53.
Kang Yi-lin, Sun Lu-bing, Zhu Rong-bo, et al. Survey on Chinese named entity recognition with deep learning [J]. Journal of Huazhong University of Science and Technology (Natural Science Edition), 2022, 50(11): 44-53.
[5] 张昀, 黄橙, 张玉瑶, 等. 面向少量标注数据的中文命名实体识别[J]. 中文信息学报, 2023, 37(3): 101-111.
Zhang Yun, Huang Cheng, Zhang Yu-yao, et al. Chinese named entity recognition with few labeled data[J]. Journal of Chinese Information Processing, 2023, 37(3): 101-111.
[6] 李健, 熊琦, 胡雅婷, 等. 基于Transformer和隐马尔科夫模型的中文命名实体识别方法[J]. 吉林大学学报: 工学版, 2023, 53(5): 1427-1434.
Li Jian, Xiong Qi, Hu Ya-ting, et al. Chinese named entity recognition method based on Transformer and hidden Markov model [J]. Journal of Jilin University (Engineering and Technology Edition), 2023, 53(5): 1427-1434.
[7] Jeon K, Lee G, Yang S, et al. Named entity recognition of building construction defect information from text with linguistic noise[J]. Automation in Construction, 2022, 143: No.104543.
[8] 方红, 苏铭, 冯一铂, 等. 结合gazetteers和句法依存树的中文命名实体识别[J]. 计算机工程与应用, 2022, 58(18): 227-232.
Fang Hong, Su Ming, Feng Yi-bo, et al. Chinese named entity recognition combined with gazetteers and syntactic dependency tree[J]. Computer Engineering and Applications, 2022, 58(18): 227-232.
[9] 廖梦, 贾真, 李天瑞. 基于标签信息融合与多任务学习的中文命名实体识别[J]. 计算机科学, 2024, 51(3): 198-204.
Liao Meng, Jia Zhen, Li Tian-rui. Chinese named entity recognition based on label information fusion and Multi-task learning[J]. Computer Science, 2024, 51(3): 198-204.
[10] 陈威达, 王林飞, 陶大鹏. 融合软注意力掩码嵌入的场景文本识别方法[J]. 中国图象图形学报, 2024,29(5): 1381-1391.
Chen Wei-da, Wang Lin-fei, Tao Da-peng. SAME-net:scene text recognition method based on soft attention mask embedding[J]. Journal of Image and Graphics, 2024, 29(5): 1381-1391.
[11] 廖列法, 谢树松. 基于注意力机制特征融合的中文命名实体识别[J]. 计算机工程, 2023, 49(4): 256-262.
Liao Lie-fa, Xie Shu-song. Chinese named entity recognition based on attention mechanism feature fusion[J]. Computer Engineering, 2023, 49(4): 256-262.
[12] 占文韬, 吴晓鸰, 凌捷. 基于多窗口注意力机制的中文命名实体识别[J]. 小型微型计算机系统, 2024,45(6): 1325-1330.
Zhan Wen-tao, Wu Xiao-ling, Ling Jie. Chinese named entity recognition based on multi-window attention mechanism[J]. Journal of Chinese Computer Systems, 2024, 45(6): 1325-1330.
[13] 赵丹丹, 黄德根, 孟佳娜, 等. 多头注意力与字词融合的中文命名实体识别[J]. 计算机工程与应用,2022, 58(7): 142-149.
Zhao Dan-dan, Huang De-gen, Meng Jia-na, et al. Chinese named entity recognition by integrating multi-heads attention mechanism and character and words fusion[J]. Computer Engineering and Applications, 2022, 58(7): 142-149.
[14] 李军怀, 陈苗苗, 王怀军, 等. 基于ALBERT-BGRU-CRF的中文命名实体识别方法[J]. 计算机工程, 2022, 48(6): 89-94, 106.
Li Jun-huai, Chen Miao-miao, Wang Huai-jun, et al. Chinese named entity recognition method based on ALBERT-BGRU-CRF[J]. Computer Engineering, 2022, 48(6): 89-94, 106.
[15] 张栋, 陈文亮. 基于上下文相关字向量的中文命名实体识别[J]. 计算机科学, 2021, 48(3): 233-238.
Zhang Dong, Chen Wen-liang. Chinese named entity recognition based on contextualized char embeddings[J]. Computer Science, 2021, 48(3): 233-238.
[1] 张磊,焦晶,李勃昕,周延杰. 融合机器学习和深度学习的大容量半结构化数据抽取算法[J]. 吉林大学学报(工学版), 2024, 54(9): 2631-2637.
[2] 车翔玖,徐欢,潘明阳,刘全乐. 生物医学命名实体识别的两阶段学习算法[J]. 吉林大学学报(工学版), 2023, 53(8): 2380-2387.
[3] 李健,熊琦,胡雅婷,刘孔宇. 基于Transformer和隐马尔科夫模型的中文命名实体识别方法[J]. 吉林大学学报(工学版), 2023, 53(5): 1427-1434.
[4] 郭晓然,罗平,王维兰. 基于Transformer编码器的中文命名实体识别[J]. 吉林大学学报(工学版), 2021, 51(3): 989-995.
[5] 燕杨, 文敦伟, 王云吉, 王珂. 基于层叠条件随机场的中文病历命名实体识别[J]. 吉林大学学报(工学版), 2014, 44(6): 1843-1848.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!