基于交替语言数据重构方法的跨语言文本相似度模型

Journal of Jilin University Science Edition ›› 2025, Vol. 63 ›› Issue (2): 551-0558.

Previous Articles Next Articles

Cross-Language Text Similarity Model Based on Alternating Language Data Reconstruction Method

WANG Yi¹, WANG Kunning², LIU Ming²

1. School of Foreign Languages, Changchun University of Technology, Changchun 130012, China；2. School of Mathematics and Statistics, Changchun University of Technology, Changchun 130012, China

Received:2024-02-28 Online:2025-03-26 Published:2025-03-26

Abstract

Abstract: Aiming at the problem that existing multilingual models were inefficient in utilising multilingual datasets in the pre-training process, which led to a more insufficient cross-language contextual learning ability and thus language bias, we proposed a cross-language text similarity model based on the alternating language data reconstruction method. This method formed reconstructed pre-trained text pairs by symmetrically replacing Chinese and English words in the parallel corpus, and used the above text pairs to perform targeted pre-training and fine-tuning processing based on data reconstruction for the multilingual large model mBERT (BERT-based-multilingual). In order to verify the feasibility of the model, experiments were conducted on the United Nations parallel corpus dataset, and the experimental results show that the similarity checking accuracy of this model outperforms that of mBERT and the other two baseline models. It can not only further improve the accuracy of cross-language information retrieval, but also reduce the research cost of multilingual natural language processing tasks.

Key words: mBERT model, text similarity, multilingual pre-trained model, large model fine-tuning

CLC Number:

TP391.1

WANG Yi, WANG Kunning, LIU Ming. Cross-Language Text Similarity Model Based on Alternating Language Data Reconstruction Method[J].Journal of Jilin University Science Edition, 2025, 63(2): 551-0558.

References

Metrics

Viewed

Full text

121

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	121

From	Others	local

Times	7	114
Rate	6%	94%

Abstract

Just accepted	Online first	Issue

0	0	96

From	Others	local

Times	94	2
Rate	98%	2%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

Cross-Language Text Similarity Model Based on Alternating Language Data Reconstruction Method

PDF (PC)

Like

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 1

Metrics

Comments

Recommended 10