Journal of Jilin University (Information Science Edition) ›› 2024, Vol. 42 ›› Issue (4): 747-753.

Previous Articles     Next Articles

Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model

QIAN Lianghong1 , WANG Fude2,3, SUN Xiaohai2   

  1. 1. Data Science Department, Yepdata Software Technology Company Limited, Shanghai 200233, China; 2. Technical Department, Jilin Haicheng Technology Company Limited, Changchun 130119, China; 3. Smart Agriculture Research Institute of Jilin Agricultural University, Changchun 130118, China
  • Received:2024-01-23 Online:2024-07-22 Published:2024-07-22

Abstract: To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages, we propose a source code plagiarism detection method based on pre-trained Transformer language models, in combination with word embedding, similarity and classification models. The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance. Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets. In addition, for scenarios where only a few labeled plagiarism training samples can be obtained, this paper also proposes a method that combines supervised learning classification models to further improve detection performance. The method can be widely used in source code plagiarism detection scenarios where training data is scarce, computational resources are limited, and the programming languages are diverse.

Key words: source code plagiarism detection, Transformer model, pre-trained model, machine learning, deep learning

CLC Number: 

  • TP181