基于预训练 Transformer 语言模型的源代码剽窃检测研究

吉林大学学报(信息科学版) ›› 2024, Vol. 42 ›› Issue (4): 747-753.

基于预训练 Transformer 语言模型的源代码剽窃检测研究

钱亮宏¹, 王福德^2,3, 孙晓海²

1. 益数软件科技( 上海) 有限公司数据科学部, 上海 200233;2. 吉林海诚科技有限公司技术部, 长春 130119; 3. 吉林农业大学智慧农业研究院, 长春 130118

收稿日期:2024-01-23 出版日期:2024-07-22 发布日期:2024-07-22
作者简介:钱亮宏( 1989— ), 男, 南京人, 益数软件科技( 上海) 有限公司数据科学家, 主要从事人工智能研究, ( Tel) 86- 15900530128( E-mail)31311853@ qq. com; 通讯作者: 王福德(1990— ), 男, 吉林大安人, 吉林海诚科技有限公司工程师, 主要从事计算机教育技术研究, ( Tel)86-18243044666( E-mail)562324919@ qq. com。
基金资助:
吉林省教育厅产业化培育基金资助项目( JJKH20240274CY)

Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model

QIAN Lianghong¹ , WANG Fude^2,3, SUN Xiaohai²

1. Data Science Department, Yepdata Software Technology Company Limited, Shanghai 200233, China; 2. Technical Department, Jilin Haicheng Technology Company Limited, Changchun 130119, China; 3. Smart Agriculture Research Institute of Jilin Agricultural University, Changchun 130118, China

Received:2024-01-23 Online:2024-07-22 Published:2024-07-22

摘要/Abstract

摘要：

为解决源代码剽窃检测的问题, 以及针对现有方法需要大量训练数据且受限于特定语言的不足, 提出了一种基于预训练 Transformer 语言模型的源代码剽窃检测方法, 其结合了词嵌入，相似度计算和分类模型。该方法支持多种编程语言, 不需要任何标记为剽窃的训练样本, 即可达到较好的检测性能。实验结果表明,该方法在多个公开数据集上取得了先进的检测效果, F1 值接近。同时, 对特定的能获取到较少标记为剽窃训练样本的场景, 还提出了一种结合有监督学习分类模型的方法, 进一步提升了检测效果。该方法能广泛应用于缺乏训练数据、计算资源有限以及语言多样的源代码剽窃检测场景。

关键词: 源代码剽窃检测, Transformer 模型, 预训练模型, 机器学习, 深度学习

Abstract: To address the issue of source code plagiarism detection and the limitations of existing methods that require a large amount of training data and are restricted to specific languages, we propose a source code plagiarism detection method based on pre-trained Transformer language models, in combination with word embedding, similarity and classification models. The proposed method supports multiple programming languages and does not require any training samples labeled as plagiarism to achieve good detection performance. Experimental results show that the proposed method achieves state-of-the-art detection performance on multiple public datasets. In addition, for scenarios where only a few labeled plagiarism training samples can be obtained, this paper also proposes a method that combines supervised learning classification models to further improve detection performance. The method can be widely used in source code plagiarism detection scenarios where training data is scarce, computational resources are limited, and the programming languages are diverse.

Key words: source code plagiarism detection, Transformer model, pre-trained model, machine learning, deep learning

中图分类号:

TP181

钱亮宏, 王福德, 孙晓海. 基于预训练 Transformer 语言模型的源代码剽窃检测研究[J]. 吉林大学学报(信息科学版), 2024, 42(4): 747-753.

QIAN Lianghong, WANG Fude, SUN Xiaohai. Research on Source Code Plagiarism Detection Based on Pre-Trained Transformer Language Model[J]. Journal of Jilin University (Information Science Edition), 2024, 42(4): 747-753.

参考文献

Metrics

Viewed

Full text

219

HTML			PDF

Just accepted	Online first	Issue	Just accepted	Online first	Issue
0	0	0	0	0	219

From	Others	local

Times	36	183
Rate	16%	84%

Abstract

151

Just accepted	Online first	Issue

0	0	151

	From	Others

	Times	151
	Rate	100%

Cited

Web of Science	Crossref	ScienceDirect	Search for Citations in Google Scholar >>


This page requires you have already subscribed to WoS.

Shared

[1]	钱亮宏, 王福德, 宋海龙. 金融交易反欺诈人工智能建模方法研究 [J]. 吉林大学学报(信息科学版), 2024, 42(5): 930-936.
[2]	李凯, 李雨, 王乐枭, 张晓晴. 违法违规收集个人信息评估系统[J]. 吉林大学学报(信息科学版), 2024, 42(3): 537-543.
[3]	郭亚茹, 刘苗, 聂中文. 油气物联网数据污染检测算法研究[J]. 吉林大学学报(信息科学版), 2024, 42(2): 307-311.
[4]	陈雪松, 邹梦. 基于 BERT-BiGRU-CNN 模型的短文本分类研究 [J]. 吉林大学学报(信息科学版), 2023, 41(6): 1048-1053.
[5]	梁楠, 王成喜, 张春飞, 徐涛, 籍风磊. 基于 Python 的多维度、层次化的综合实验平台[J]. 吉林大学学报(信息科学版), 2023, 41(5): 858-865.
[6]	张璐, 马子睿, 王岳, 马翠玲 . 面向高中化学试题的命名实体识别[J]. 吉林大学学报(信息科学版), 2023, 41(4): 608-620.
[7]	郑冲, 李明洋, 兰文婧, 刘香玉, 包磊, 纪铁凤. 影像组学在乳腺病灶良恶性鉴别中的应用[J]. 吉林大学学报(信息科学版), 2023, 41(2): 315-320.
[8]	苗馨方, 刘铭, 蒋扬. 基于机器学习算法的丙肝预测[J]. 吉林大学学报(信息科学版), 2022, 40(4): 638-643.
[9]	张赛男, 孙彪. 基于机器学习的网络异常检测方法综述[J]. 吉林大学学报(信息科学版), 2021, 39(6): 732-742.
[10]	刘思新, 高珺, 田一龙, 魏韵郦, 李旭睿, 吴静. 基于改进 TFIDF-Logistic Regression 微博暴力文本分类[J]. 吉林大学学报(信息科学版), 2021, 39(6): 751-757.
[11]	贾隆嘉,孙铁利,杨凤芹,孙红光 . 基于类空间密度的文本分类特征加权算法[J]. 吉林大学学报(信息科学版), 2017, 35(1): 92-97.
[12]	纪祥,刘华虓,吴芬芬,刘磊. 基于特征和HMM的信息提取[J]. J4, 2009, 27(04): 396-.