吉林大学学报(信息科学版) ›› 2025, Vol. 43 ›› Issue (4): 755-762.

• • 上一篇    下一篇

基于XLBMC 的专业领域中文分词方法

任伟建1a,1b, 张义东1a, 任 璐2, 张永丰3, 孙勤江4   

  1. 1. 东北石油大学 a. 电气信息工程学院;b. 黑龙江省网络化与智能控制重点实验室,黑龙江大庆163318; 2. 海洋石油工程股份有限公司海洋工程技术中心,天津300450; 3. 大庆油田有限责任公司第二采油厂,黑龙江大庆163414;4. 中海石油(中国)有限公司天津分公司,天津300450
  • 收稿日期:2024-05-16 出版日期:2025-08-15 发布日期:2025-08-14
  • 作者简介:任伟建(1963— ), 女, 黑龙江泰来人, 东北石油大学教授, 博士生导师, 主要从事油气集输过程故障诊断研究, (Tel) 86-15765988699(E-mail)renwj@126. com。
  • 基金资助:
    国家自然科学基金资助项目(61933007;61873058); 河北省自然科学基金面上资助项目(D2022107001)

Chinese Segmentation Method for Specialized Domains Based on XLBMC

REN Weijian1a,1b, ZHANG Yidong1a, REN Lu2, ZHANG Yongfeng3, SUN Qinjiang4   

  1. 1a. College of Electrical and Information Engineering; 1b. Heilongjiang Provincial Key Laboratory of Networking and Intelligent Control, Northeast Petroleum University, Daqing 163318, China; 2. Marine Engineering Technology Center, Offshore Oil Engineering Company Limited, Tianjin 300450, China; 3. No.2 Oil Production Plant, Daqing Oilfield Company Limited, Daqing 163414, China; 4. Tianjin Branch, China National Offshore Oil Corporatio, Tianjin 300450, China
  • Received:2024-05-16 Online:2025-08-15 Published:2025-08-14

摘要: 针对通用分词方法在专业领域的中文分词任务中,由于跨领域的数据分布不匹配和大量未登录的专业 词汇限制导致分词准确率低的问题, 提出基于 XLBMC(XLNet-BiGRU-Multi-head Self-attention-Conditional Random Field)的专业领域分词方法。首先通过改进的XLNet 预训练模型生成包含上下文语义信息的动态词 向量,使模型能更好地利用边界特征和语义知识;然后将获取的词向量输入BiGRU中进行特征提取,得到每个字符的隐藏状态表示。在BiGRU编码的基础上,引入稀疏多头自注意力机制(Multi-head Self-attention)对每个字符加权表示,提高模型在受限内存预算下细粒度和强长期依赖性的时间序列的预测准确性。最后由CRF (Conditional Random Field)解码相邻标签之间的依赖关系, 输出最佳的分词序列。在自建的控制工程语料上 进行分词实验。结果表明,该分词模型准确率为94.27%,召回率为93.24%,F1 值为95.52%, 证明其在专业领域中文分词任务中的可靠性。

关键词: 中文分词, 专业领域, XLNet预训练模型, 多头注意力机制, 控制工程

Abstract: Aiming at the problem of low accuracy of methods in Chinese word segmentation in professional domains due to the mismatch of cross-domain data distribution and the limitation of a large number of unregistered professional words, a professional domain word segmentation method based on XLBMC(XLNet- BiGRU-Multi-head Self-attention-Conditional Random Field) is proposed. Firs, the dynamic word vectors containing contextual semantic information is generated through an improved XLNet pre-training model, enabling the model to better utilize the boundary and semantic knowledge. Then the acquired word vectors are input into BiGRU for feature extraction to obtain the hidden state representation of each character. On the basis of BiGRU coding, a sparsified MHSA(Multi-head Self-Attention) mechanism is introduced to weight the representation of each character, which improves the prediction accuracy of the model for fine-grained and strongly long-term dependent time series under restricted memory budget. Finally, the CRF(Conditional Random Field) decodes the dependencies between neighboring tags and outputs the optimal segmentation sequence. Segmentation experiments are conducted on a self鄄constructed control engineering corpus. The results show that the accuracy of the proposed segmentation model is 94. 27%, the recall is 93. 24%, and the F1 value is 95.52%, which proves the reliability of the model in Chinese segmentation tasks in the professional domain.

Key words: Chinese word segmentation, domain specific, XLNet, multi-head self-attention, control engineering

中图分类号: 

  • TP391