吉林大学学报(工学版) ›› 2023, Vol. 53 ›› Issue (12): 3529-3535.doi: 10.13229/j.cnki.jdxbgxb.20221137

• 计算机科学与技术 • 上一篇    

考虑特征稀疏特性的短文本命名实体快速识别方法

马月坤1,2,3,4(),郝益锋1   

  1. 1.华北理工大学 人工智能学院,河北 唐山 063210
    2.华北理工大学 河北省工业智能感知重点实验室,河北 唐山 063210
    3.北京科技大学 计算机与通信学院,北京 100083
    4.北京科技大学 材料领域知识工程北京市重点实验室,北京 100083
  • 收稿日期:2022-10-12 出版日期:2023-12-01 发布日期:2024-01-12
  • 作者简介:马月坤(1976-),女,教授,硕士.研究方向:自然语言处理.E-mail:mayuekun@163.com
  • 基金资助:
    中央高校基本科研业务费项目(FRF-DF-20-04);河北省三三三人才项目(A201803083)

Fast recognition method of short text named entities considering feature sparsity

Yue-kun MA1,2,3,4(),Yi-feng HAO1   

  1. 1.College of Artificial Intelligence,North China University of Science and Technology,Tangshan 063210,China
    2.Hebei Provincial Key Laboratory of Industrial Intelligent Perception,North China University of Science and Technology,Tangshan 063210,China
    3.School of Computer & Communication Engineering,University of Science & Technology,Beijing 100083,China
    4.Beijing Key Laboratory of Knowledge Engineering for Materials Science,University of Science & Technology Beijing,Beijing 100083,China
  • Received:2022-10-12 Online:2023-12-01 Published:2024-01-12

摘要:

首先,通过过滤标点符号选择适当的特征,并构建向量,分割两个及两个以上词语组成特定语义,标注词性,找出相对词类。其次,利用潜在狄利克雷分配(LDA)模型令话题与文档间存在相关性,明确文档主题,降低数据特征稀疏特性。再次,本文双向长短期记忆网络条件随机场(BR-BiLSTM-CRF)模型通过双向LSTM模型检测文本命名实体的边界,与链式条件随机场层的输出实体类型相结合,增加了词汇和词类的特征,实现对文本整体序列实体边缘的检测。最后,采用交叉熵和梯度下降修正网络参数,直至误差不超过指定数值,实现文本命名实体的识别。实验结果表明:本文方法识别速度快、精度高、整体性能强;该方法能够更好地通过计算机识别语言明确文本词性,提高命名实体识别的准确性和效率。

关键词: 自然语言处理, 特征稀疏特性, 短文本命名, 短文本实体快速识别, 文本预处理, 特性权重

Abstract:

The proposed method selects appropriate features by filtering punctuation marks, constructs vectors, segments two or more words to form specific semantics, and labels parts of speech to identify relative parts of speech; Utilizing the Latent dirichlet allocation (LDA) model to establish correlation between topics and documents, clarify document topics, and reduce data feature sparsity; The Bidirectional long short-term memory-conditional random field (BR-BiLSTM-CRF)model detects the boundaries of text named entities through a bidirectional LSTM model, which is combined with the output entity types of the chain conditional random field layer. After adding features of vocabulary and parts of speech, the overall sequence entity edge of the text is detected. The network parameters are corrected using cross entropy and gradient descent until the error does not exceed the specified value, achieving text named entity recognition. Through experiments, it has been proven that the proposed method has fast recognition speed, high accuracy, and strong overall performance. The proposed method can better recognize language through computers, clarify the part of speech of text, and improve the accuracy and efficiency of named entity recognition.

Key words: natural language processing, feature sparsity, short text naming, fast recognition of short text entities, text preprocessing, characteristic weight

中图分类号: 

  • TP391.1

表1

标点符号过滤表"

名称样式名称样式
单位符号¥$℃¢£叹号!!
百分号、千分号%、‰左括号)){[【《<
破折号——右括号))}]】》>
省略号…...左引号“‘
冒号右引号”’
顿号句号
分号;;问号??
逗号,,

表2

词性缩写"

虚词词性缩写实词词性缩写
介词p动词v
副词d名词n
助词u形容词a
叹词e量词q
语气词o数词m
连词c代词r

图1

LDA概率模型"

图2

BR-BiLSTM-CRF模型"

表3

实验数据集样本数量"

名称样本数
人名地名组织名其他
MSRA数据集15218754791017
OntoNotes数据集104210843281108
Iris数据集80448111931054

图3

MSRA数据集下3种方法对比"

图4

OntoNotes数据集下3种方法对比"

图5

Iris数据集下3种方法对比"

1 石鑫, 赵池航, 张小琴, 等. 基于融合特征稀疏编码模型的车辆品牌识别方法[J]. 筑路机械与施工机械化, 2020, 37(3): 59-63.
Shi Xin, Zhao Chi-hang, Zhang Xiao-qin, et al. Recognition of vehicle brands based on sparse coding model of fused features[J]. Road Machinery & Construction Mechanization, 2020,37(3): 59-63.
2 李宝昌, 郭卫斌. 词典信息分层调整的中文命名实体识别方法[J]. 华东理工大学学报: 自然科学版, 2023, 49(2): 276-283.
Li Bao-chang, Guo Wei-bin. Chinese named entity recognition method based on hierarchical adjustment of dictionary information [J]. Journal of East China University of Science and Technology, 2023, 49(2): 276-283.
3 Guan F, Cui W, Li L, et al. A method of false alarm recognition in built-in test considering its time series characteristics[J]. IEEE Transactions on Industrial Electronics, 2021, 68(11): 11428-11437.
4 张虹, 左鑫兰, 黄瑶. 基于稀疏表示系数相关性的特征选择及SAR目标识别方法[J]. 激光与光电子学进展, 2020, 57(14): 271-278.
Zhang Hong, Zuo Xin-lan, Huang Yao. Feature selection based on the correlation of sparse coefficient vectors with application to SAR target recognition[J]. Laser & Optoelectronics Progress, 2020, 57(14): 271-278.
5 刘宇鹏, 栗冬冬. 基于BLSTM-CNN-CRF的中文命名实体识别方法[J]. 哈尔滨理工大学学报, 2020, 25(1): 115-120.
Liu Yu-peng, Li Dong-dong. Chinese named entity recognition method based on BLSTM-CNN-CRF[J]. Journal of Harbin University of Science and Technology, 2020, 25(1): 115-120.
6 Li S, Yang K, Ma J, et al. Anti-interference recognition method of aerial infrared targets based on the Bayesian network[J]. Journal of Optics, 2021, 50(2): 264-277.
7 刘奕洋, 余正涛, 高盛祥, 等. 基于机器阅读理解的中文命名实体识别方法[J]. 模式识别与人工智能, 2020, 33(7): 653-659.
Liu Yi-yang, Yu Zheng-tao, Gao Sheng-xiang, et al. Chinese named entity recognition method based on machine reading comprehension[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(7): 653-659.
8 田雨, 张桂平, 蔡东风, 等. 基于多颗粒度文本表征的中文命名实体识别方法[J]. 中文信息学报, 2022, 36(4): 90-99.
Tian Yu, Zhang Gui-ping, Cai Dong-feng, et al. Chinese named entity recognition method based on multi-granularity text representation[J]. Journal of Chinese Information Processing, 2022, 36(4): 90-99.
9 Wang X, Wang S, Guo Y, et al. Dielectric and geometric feature extraction and recognition method of coal and gangue based on VMD-SVM[J]. Powder Technology, 2021, 392: 241-250.
10 Tao D, Gao F, Chen D, et al. Recognition method of the dual-objective in a linear array CCD-based improved photoelectric measurement system using two lasers with different wavelengths[J]. Optik, 2020, 217: No. 164857.
11 严红, 陈兴蜀, 王文贤, 等. 基于深度神经网络的法语命名实体识别模型[J]. 计算机应用, 2019, 39(5): 1288-1292.
Yan Hong, Chen Xing-shu, Wang Wen-xian, et al. Recognition model for French named entities based on deep neural network[J]. Journal of Computer Applications, 2019, 39(5): 1288-1292.
12 Zhang W, Zhou T, Zhao J, et al. Recognition of the idle state based on a novel IFB-OCN method for an asynchronous brain-computer interface[J]. Journal of Neuroscience Methods, 2020, 341: No. 108776.
13 杨阳, 刘恩博, 顾春华, 等. 稀疏数据下结合词向量的短文本分类模型研究[J]. 计算机应用研究, 2022, 39(3): 711-715, 750.
Yang Yang, Liu En-bo, Gu Chun-hua, et al. Research on short text classification model combined with word vectors under sparse data[J]. Application Research of Computers, 2022,39(3):711-715, 750.
14 Li Y, Du G, Xiang Y, et al. Towards Chinese clinical named entity recognition by dynamic embedding using domain-specific knowledge[J]. Journal of Biomedical Informatics, 2020, 106: No. 103435.
15 范维克, 张绍阳, 陈博远,等. 交通信息标准条款BLSTM和CNN链式模型分类方法[J]. 江苏大学学报: 自然科学版, 2020, 41(2): 143-148.
Fan Wei-ke, Zhang Shao-yang, Chen Bo-yuan, et al. Classification method of BLSTM and CNN chain model of traffic information standard clauses[J]. Journal of Jiangsu University (Natural Science Edition), 2020, 41(2): 143-148.
16 Wang Y, Liu T, Yang M, et al. A handheld testing device for the fast and ultrasensitive recognition of cardiac troponin I via an ion-sensitive field-effect transistor[J]. Biosensors & Bioelectronics, 2021, 193(2): No. 113554.
17 李静, 程芃森, 许丽丹, 等. 基于局部对抗训练的命名实体识别方法研究[J]. 四川大学学报: 自然科学版, 2021, 58(2): 113-120.
Li Jing, Cheng Peng-sen, Xu Li-dan, et al. Name entity recognition based on local adversarial training[J]. Journal of Sichuan University (Natural Science Edition), 2021, 58(2): 113-120.
18 Cao W, Wang R, Fan M, et al. A new froth image classification method based on the MRMR-SSGMM hybrid model for recognition of reagent dosage condition in the coal flotation process[J]. Applied Intelligence, 2022, 52(1): 732-752.
19 孙同晶, 刘桐, 杨阳. 多阶次分数阶傅里叶域特征融合的主动声呐目标稀疏表示分类方法[J].电子与信息学报, 2021, 43(3): 809-816.
Sun Tong-jing, Liu Tong, Yang Yang. Sparse representation classification method for active sonar target based on multi-order fractional Fourier domain feature fusion[J]. Journal of Electronics & Information Technology, 2021, 43(3): 809-816.
20 王进, 徐巍, 丁一, 等. 基于图嵌入和区域注意力的多标签文本分类[J]. 江苏大学学报: 自然科学版, 2022, 43(3): 310-318.
Wang Jin, Xu Wei, Ding Yi, et al. Multi-label text classification based on graph embedding and regional attention[J]. Journal of Jiangsu University (Natural Science Edition), 2022, 43(3): 310-318.
21 Cesari M, Heidbreder A, Gaig C, et al. Automatic analysis of muscular activity in the flexor digitorum superficialis muscles: a fast screening method for rapid eye movement sleep without atonia[J]. Sleep, 2023, 46(3): No. zsab299.
22 李大湘, 陈梦思, 刘颖. 基于STA-LSTM的自发微表情识别算法[J]. 吉林大学学报: 工学版, 2022, 52(4): 897-909.
Li Da-xiang, Chen Meng-si, Liu Ying. Spontaneous micro-expression recognition based on STA-LSTM[J]. Journal of Jilin University (Engineering and Technology Edition), 2022, 52(4): 897-909.
23 王进, 李颖, 蒋晓翠, 等. 基于层级残差连接LSTM的命名实体识别[J]. 江苏大学学报: 自然科学版, 2022, 43(4): 446-452.
Wang Jin, Li Ying, Jiang Xiao-cui, et al. Named entity recognition based on hierarchical residuals connected LSTM[J]. Journal of Jiangsu University (Natural Science Edition), 2022, 43(4): 446-452.
[1] 车翔玖,徐欢,潘明阳,刘全乐. 生物医学命名实体识别的两阶段学习算法[J]. 吉林大学学报(工学版), 2023, 53(8): 2380-2387.
[2] 白天,徐明蔚,刘思铭,张佶安,王喆. 基于深度神经网络的诉辩文本争议焦点识别[J]. 吉林大学学报(工学版), 2022, 52(8): 1872-1880.
[3] 赵亚慧,杨飞扬,张振国,崔荣一. 基于强化学习和注意力机制的朝鲜语文本结构发现[J]. 吉林大学学报(工学版), 2021, 51(4): 1387-1395.
[4] 周炫余, 刘娟, 邵鹏, 罗飞, 刘洋. 基于层次过滤模型的中文指代消解[J]. 吉林大学学报(工学版), 2016, 46(4): 1209-1215.
[5] 辛宇, 杨静, 谢志强. 一种基于LDA的k话题增量训练算法[J]. 吉林大学学报(工学版), 2015, 45(4): 1242-1252.
[6] 李抵非,田地,胡雄伟. 基于深度学习的中文标准文献语言模型[J]. 吉林大学学报(工学版), 2015, 45(2): 596-599.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!