吉林大学学报(理学版) ›› 2025, Vol. 63 ›› Issue (2): 551-0558.

• • 上一篇    下一篇

基于交替语言数据重构方法的跨语言文本相似度模型

王轶1, 王坤宁2, 刘铭2   

  1. 1. 长春工业大学 外国语学院, 长春 130012; 2. 长春工业大学 数学与统计学院, 长春 130012
  • 收稿日期:2024-02-28 出版日期:2025-03-26 发布日期:2025-03-26
  • 通讯作者: 刘铭 E-mail:liuming@ccut.edu.cn

Cross-Language Text Similarity Model Based on Alternating Language Data Reconstruction Method

WANG Yi1, WANG Kunning2, LIU Ming2   

  1. 1. School of Foreign Languages, Changchun University of Technology, Changchun 130012, China;2. School of Mathematics and  Statistics, Changchun University of Technology, Changchun 130012, China
  • Received:2024-02-28 Online:2025-03-26 Published:2025-03-26

摘要: 针对现有多语言模型在预训练过程中对多语言数据集的利用效率低, 导致跨语言上下文学习能力不足, 进而产生语言偏差的问题, 提出一种基于交替语言数据重构方法的跨语言文本相似度模型. 该方法通过对称地替换平行语料中的中英文词语, 形成重构的预训练文本对, 并利用上述文本对对多语言大模型mBERT(BERT-based-multilingual)进行基于数据重构的针对性预训练和微调处理. 为验证该模型的可行性, 在联合国平行语料数据集上进行实验, 实验结果表明, 该模型的相似度查准率优于mBERT和其他两种基线模型,其不仅可以进一步提高跨语言信息检索的准确性, 并且可以降低多语言自然语言处理任务的研究成本.

关键词: mBERT模型, 文本相似度, 多语言预训练模型, 大模型微调

Abstract: Aiming at the problem that existing multilingual models were inefficient in utilising multilingual datasets in the pre-training process, which led to a more insufficient cross-language contextual learning ability and thus language bias, we proposed a cross-language text similarity model based on the alternating language data  reconstruction method. This method formed reconstructed pre-trained text pairs by symmetrically replacing Chinese and English words in the parallel corpus, and used the above text pairs to perform targeted pre-training and fine-tuning processing based on data reconstruction for the multilingual large model mBERT (BERT-based-multilingual). In order to verify the feasibility of the model, experiments were conducted on the United Nations parallel corpus dataset, and the experimental results show that the similarity checking accuracy of this model outperforms that of mBERT and the other two baseline models. It can not only  further improve the accuracy of cross-language information retrieval, but also  reduce the research cost of multilingual natural language processing tasks.

Key words: mBERT model, text similarity, multilingual pre-trained model, large model fine-tuning

中图分类号: 

  • TP391.1