Journal of Jilin University (Information Science Edition) ›› 2025, Vol. 43 ›› Issue (4): 755-762.

Previous Articles     Next Articles

Chinese Segmentation Method for Specialized Domains Based on XLBMC

REN Weijian1a,1b, ZHANG Yidong1a, REN Lu2, ZHANG Yongfeng3, SUN Qinjiang4   

  1. 1a. College of Electrical and Information Engineering; 1b. Heilongjiang Provincial Key Laboratory of Networking and Intelligent Control, Northeast Petroleum University, Daqing 163318, China; 2. Marine Engineering Technology Center, Offshore Oil Engineering Company Limited, Tianjin 300450, China; 3. No.2 Oil Production Plant, Daqing Oilfield Company Limited, Daqing 163414, China; 4. Tianjin Branch, China National Offshore Oil Corporatio, Tianjin 300450, China
  • Received:2024-05-16 Online:2025-08-15 Published:2025-08-14

Abstract: Aiming at the problem of low accuracy of methods in Chinese word segmentation in professional domains due to the mismatch of cross-domain data distribution and the limitation of a large number of unregistered professional words, a professional domain word segmentation method based on XLBMC(XLNet- BiGRU-Multi-head Self-attention-Conditional Random Field) is proposed. Firs, the dynamic word vectors containing contextual semantic information is generated through an improved XLNet pre-training model, enabling the model to better utilize the boundary and semantic knowledge. Then the acquired word vectors are input into BiGRU for feature extraction to obtain the hidden state representation of each character. On the basis of BiGRU coding, a sparsified MHSA(Multi-head Self-Attention) mechanism is introduced to weight the representation of each character, which improves the prediction accuracy of the model for fine-grained and strongly long-term dependent time series under restricted memory budget. Finally, the CRF(Conditional Random Field) decodes the dependencies between neighboring tags and outputs the optimal segmentation sequence. Segmentation experiments are conducted on a self鄄constructed control engineering corpus. The results show that the accuracy of the proposed segmentation model is 94. 27%, the recall is 93. 24%, and the F1 value is 95.52%, which proves the reliability of the model in Chinese segmentation tasks in the professional domain.

Key words: Chinese word segmentation, domain specific, XLNet, multi-head self-attention, control engineering

CLC Number: 

  • TP391