基于XLBMC 的专业领域中文分词方法

吉林大学学报(信息科学版) ›› 2025, Vol. 43 ›› Issue (4): 755-762.

基于XLBMC 的专业领域中文分词方法

任伟建^1a,1b, 张义东^1a, 任璐², 张永丰³, 孙勤江⁴

1. 东北石油大学 a. 电气信息工程学院;b. 黑龙江省网络化与智能控制重点实验室,黑龙江大庆163318; 2. 海洋石油工程股份有限公司海洋工程技术中心,天津300450; 3. 大庆油田有限责任公司第二采油厂,黑龙江大庆163414;4. 中海石油(中国)有限公司天津分公司,天津300450

收稿日期:2024-05-16 出版日期:2025-08-15 发布日期:2025-08-14
作者简介:任伟建(1963— ), 女, 黑龙江泰来人, 东北石油大学教授, 博士生导师, 主要从事油气集输过程故障诊断研究, (Tel) 86-15765988699(E-mail)renwj@126. com。
基金资助:
国家自然科学基金资助项目(61933007;61873058); 河北省自然科学基金面上资助项目(D2022107001)

Chinese Segmentation Method for Specialized Domains Based on XLBMC

REN Weijian^1a,1b, ZHANG Yidong^1a, REN Lu², ZHANG Yongfeng³, SUN Qinjiang⁴

1a. College of Electrical and Information Engineering; 1b. Heilongjiang Provincial Key Laboratory of Networking and Intelligent Control, Northeast Petroleum University, Daqing 163318, China; 2. Marine Engineering Technology Center, Offshore Oil Engineering Company Limited, Tianjin 300450, China; 3. No.2 Oil Production Plant, Daqing Oilfield Company Limited, Daqing 163414, China; 4. Tianjin Branch, China National Offshore Oil Corporatio, Tianjin 300450, China

Received:2024-05-16 Online:2025-08-15 Published:2025-08-14

摘要/Abstract

摘要： 针对通用分词方法在专业领域的中文分词任务中,由于跨领域的数据分布不匹配和大量未登录的专业词汇限制导致分词准确率低的问题, 提出基于 XLBMC(XLNet-BiGRU-Multi-head Self-attention-Conditional Random Field)的专业领域分词方法。首先通过改进的XLNet 预训练模型生成包含上下文语义信息的动态词向量,使模型能更好地利用边界特征和语义知识;然后将获取的词向量输入BiGRU中进行特征提取,得到每个字符的隐藏状态表示。在BiGRU编码的基础上,引入稀疏多头自注意力机制(Multi-head Self-attention)对每个字符加权表示,提高模型在受限内存预算下细粒度和强长期依赖性的时间序列的预测准确性。最后由CRF (Conditional Random Field)解码相邻标签之间的依赖关系, 输出最佳的分词序列。在自建的控制工程语料上进行分词实验。结果表明,该分词模型准确率为94.27%,召回率为93.24%,F1 值为95.52%, 证明其在专业领域中文分词任务中的可靠性。

关键词: 中文分词, 专业领域, XLNet预训练模型, 多头注意力机制, 控制工程

Abstract: Aiming at the problem of low accuracy of methods in Chinese word segmentation in professional domains due to the mismatch of cross-domain data distribution and the limitation of a large number of unregistered professional words, a professional domain word segmentation method based on XLBMC(XLNet- BiGRU-Multi-head Self-attention-Conditional Random Field) is proposed. Firs, the dynamic word vectors containing contextual semantic information is generated through an improved XLNet pre-training model, enabling the model to better utilize the boundary and semantic knowledge. Then the acquired word vectors are input into BiGRU for feature extraction to obtain the hidden state representation of each character. On the basis of BiGRU coding, a sparsified MHSA(Multi-head Self-Attention) mechanism is introduced to weight the representation of each character, which improves the prediction accuracy of the model for fine-grained and strongly long-term dependent time series under restricted memory budget. Finally, the CRF(Conditional Random Field) decodes the dependencies between neighboring tags and outputs the optimal segmentation sequence. Segmentation experiments are conducted on a self鄄constructed control engineering corpus. The results show that the accuracy of the proposed segmentation model is 94. 27%, the recall is 93. 24%, and the F1 value is 95.52%, which proves the reliability of the model in Chinese segmentation tasks in the professional domain.

Key words: Chinese word segmentation, domain specific, XLNet, multi-head self-attention, control engineering

中图分类号:

TP391

任伟建, 张义东, 任璐, 张永丰, 孙勤江. 基于XLBMC 的专业领域中文分词方法[J]. 吉林大学学报(信息科学版), 2025, 43(4): 755-762.

REN Weijian, ZHANG Yidong, REN Lu, ZHANG Yongfeng, SUN Qinjiang. Chinese Segmentation Method for Specialized Domains Based on XLBMC[J]. Journal of Jilin University (Information Science Edition), 2025, 43(4): 755-762.

[1]	陈雪松, 詹子依, 王浩畅. 融合 SikuBERT 模型与 MHA 的古汉语命名实体识别[J]. 吉林大学学报(信息科学版), 2023, 41(5): 866-875.
[2]	董添, 李广, 杨振宇, 张博, 于波, 王巍. 基于 Transformer 的电网企业文件密点标注系统[J]. 吉林大学学报(信息科学版), 2021, 39(6): 720-725.
[3]	袁满, 穆永豪, 王贵友, 于再富. 改进的 SNM 中文语义重复记录检测算法[J]. 吉林大学学报(信息科学版), 2021, 39(3): 348-356.
[4]	刘畅, 张猛. 中文全文检索系统中基于分词技术的研究[J]. 吉林大学学报(信息科学版), 2013, 31(3): 320-323.