吉林大学学报(信息科学版) ›› 2024, Vol. 42 ›› Issue (4): 747-753.

• • 上一篇    下一篇

基于预训练 Transformer 语言模型的源代码剽窃检测研究

钱亮宏1, 王福德2,3, 孙晓海2   

  1. 1. 益数软件科技( 上海) 有限公司 数据科学部, 上海 200233;2. 吉林海诚科技有限公司 技术部, 长春 130119; 3. 吉林农业大学 智慧农业研究院, 长春 130118
  • 收稿日期:2024-01-23 出版日期:2024-07-22 发布日期:2024-07-22
  • 作者简介:钱亮宏( 1989— ), 男, 南京人, 益数软件科技( 上海) 有限公司数据科学家, 主要从事人工智能研究, ( Tel) 86- 15900530128( E-mail)31311853@ qq. com; 通讯作者: 王福德(1990— ), 男, 吉林大安人, 吉林海诚科技有限公司工程师, 主要从事计算机教育技术研究, ( Tel)86-18243044666( E-mail)562324919@ qq. com。
  • 基金资助:

    吉林省教育厅产业化培育基金资助项目( JJKH20240274CY)

Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model

QIAN Lianghong1 , WANG Fude2,3, SUN Xiaohai2   

  1. 1. Data Science Department, Yepdata Software Technology Company Limited, Shanghai 200233, China; 2. Technical Department, Jilin Haicheng Technology Company Limited, Changchun 130119, China; 3. Smart Agriculture Research Institute of Jilin Agricultural University, Changchun 130118, China
  • Received:2024-01-23 Online:2024-07-22 Published:2024-07-22

摘要:

为解决源代码剽窃检测的问题, 以及针对现有方法需要大量训练数据且受限于特定语言的不足, 提出了一种基于预训练 Transformer 语言模型的源代码剽窃检测方法, 其结合了词嵌入,相似度计算和分类模型。该方法支持多种编程语言, 不需要任何标记为剽窃的训练样本, 即可达到较好的检测性能。实验结果表明,该方法在多个公开数据集上取得了先进的检测效果, F1 值接近。同时, 对特定的能获取到较少标记为剽窃训练样本的场景, 还提出了一种结合有监督学习分类模型的方法, 进一步提升了检测效果。该方法能广泛应用于缺乏训练数据、计算资源有限以及语言多样的源代码剽窃检测场景。

关键词: 源代码剽窃检测, Transformer 模型, 预训练模型, 机器学习, 深度学习

Abstract: To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages, we propose a source code plagiarism detection method based on pre-trained Transformer language models, in combination with word embedding, similarity and classification models. The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance. Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets. In addition, for scenarios where only a few labeled plagiarism training samples can be obtained, this paper also proposes a method that combines supervised learning classification models to further improve detection performance. The method can be widely used in source code plagiarism detection scenarios where training data is scarce, computational resources are limited, and the programming languages are diverse.

Key words: source code plagiarism detection, Transformer model, pre-trained model, machine learning, deep learning

中图分类号: 

  • TP181