吉林大学学报(信息科学版) ›› 2023, Vol. 41 ›› Issue (4): 608-620.

• • 上一篇    下一篇

面向高中化学试题的命名实体识别

 张 璐1 , 马子睿2 , 王 岳3 , 马翠玲   

  1. 1. 北方民族大学 计算机科学与工程学院, 银川 750021; 2. 宁夏大学 信息工程学院, 银川 750021; 3. 吉林大学 计算机科学与技术学院, 长春 130012; 4. 石嘴山市第三中学, 宁夏 石嘴山 753000
  • 收稿日期:2022-08-25 出版日期:2023-08-16 发布日期:2023-08-17
  • 通讯作者: 马子睿(1977— ), 男, 银川人, 宁夏大学副教授, 主要从事智能信息处理 研究, (Tel)86-13895076591(E-mail)mzr@ nxu. edu. cn。
  • 作者简介:张璐(1996— ), 女, 宁夏石嘴山人, 北方民族大学硕士研究生, 主要从事自然语言处理研究, ( Tel) 86-18810926506 (E-mail)892440475@ qq. com;
  • 基金资助:
    全国教育信息技术研究课题基金资助项目(186430001)

Named Entity Recognition for High School Chemistry Exam Papers

 ZHANG Lu 1 , MA Zirui 2 , WANG Yue 3 , MA Cuiling   

  1. 1. School of Computer Science and Engineering, North Minzu University, Yinchuan 750021, China; 2. School of Information Engineering, Ningxia University, Yinchuan 750021, China; 3. College of Computer Science and Technology, Jilin University, Changchun 130012, China; 4. Shizuishan No. 3 Middle School, Shizuishan 753000, China
  • Received:2022-08-25 Online:2023-08-16 Published:2023-08-17

摘要: 中文化学命名实体结构没有严格的构词规律可循, 识别实体中包含字母、 数字、 特殊符号等多种形式, 传统字向量模型无法有效区分化学术语中存在的嵌套实体和歧义实体。 为此, 将高中化学试题资源的命名实 体划分为物质、 性质、 量值、 实验四大类, 并构建化学学科实体词汇表辅助人工标注。 通过 ALBERT 预训练模 型提取文本特征并生成动态字向量, 结合 BILSTM-CRF( Bidirectional Long Short-Term Memory with Conditional Random Field)模型对高中化学试题文本进行命名实体识别。 实验结果表明, 该模型的精确率、 召回率和 F1 值 分别达到了 95. 24% 95. 26% 95. 25%

关键词: 命名实体识别, ALBERT 预训练模型, 双向长短期记忆网络, 条件随机场, 化学资源文本 

Abstract: Chinese chemical named entities do not have strict word formation rules to follow, and the recognition entities contain letters, numbers, special symbols and other forms, and the traditional word vector model can not effectively distinguish between nested entities and ambiguous entities in chemical terms. The named entities of high school chemistry test resources are devided into four categories: substances, properties, quantities, and experiments, constructing a vocabulary of chemistry subjects to assist manual labeling. Then, the ALBERT pre- training model is used to extract text features and generate dynamic word vectors, and the named entity recognition is performed on the text of high school chemistry questions combined with the BILSTM-CRF (Bidirectional Long Short-Term Memory with Conditional Random Field) model. The accuracy, recall and F1 values of the proposed model reached 95. 24% ,95. 26% and 95. 25% , respectively. 

Key words: named entity recognition, a lite bert(ALBERT) pre-training model, bidirectional long short-term memory network, crf, chemical resources text

中图分类号: 

  • TP391. 1