Journal of Jilin University Science Edition ›› 2025, Vol. 63 ›› Issue (2): 551-0558.

Previous Articles     Next Articles

Cross-Language Text Similarity Model Based on Alternating Language Data Reconstruction Method

WANG Yi1, WANG Kunning2, LIU Ming2   

  1. 1. School of Foreign Languages, Changchun University of Technology, Changchun 130012, China;2. School of Mathematics and  Statistics, Changchun University of Technology, Changchun 130012, China
  • Received:2024-02-28 Online:2025-03-26 Published:2025-03-26

Abstract: Aiming at the problem that existing multilingual models were inefficient in utilising multilingual datasets in the pre-training process, which led to a more insufficient cross-language contextual learning ability and thus language bias, we proposed a cross-language text similarity model based on the alternating language data  reconstruction method. This method formed reconstructed pre-trained text pairs by symmetrically replacing Chinese and English words in the parallel corpus, and used the above text pairs to perform targeted pre-training and fine-tuning processing based on data reconstruction for the multilingual large model mBERT (BERT-based-multilingual). In order to verify the feasibility of the model, experiments were conducted on the United Nations parallel corpus dataset, and the experimental results show that the similarity checking accuracy of this model outperforms that of mBERT and the other two baseline models. It can not only  further improve the accuracy of cross-language information retrieval, but also  reduce the research cost of multilingual natural language processing tasks.

Key words: mBERT model, text similarity, multilingual pre-trained model, large model fine-tuning

CLC Number: 

  • TP391.1