吉林大学学报(工学版) ›› 2015, Vol. 45 ›› Issue (4): 1260-1265.doi: 10.13229/j.cnki.jdxbgxb201504034

• • 上一篇    下一篇

基于频繁闭合序列模式挖掘的学生程序雷同检测

王克朝1, 2, 王甜甜2, 苏小红2, 马培军2   

  1. 1.哈尔滨学院 软件学院,哈尔滨 150086;
    2.哈尔滨工业大学 计算机学院,哈尔滨 150001
  • 收稿日期:2014-01-03 出版日期:2015-07-01 发布日期:2015-07-01
  • 通讯作者: 王甜甜(1980-),女,副教授.研究方向:程序分析,软件调试.E-mail:sweetwtt@126.com
  • 作者简介:王克朝(1980-),男,讲师,博士研究生.研究方向:程序分析,软件调试.E-mail:erickcwang@126.com
  • 基金资助:
    国家自然科学基金项目(61202092,61173021); 高等学校博士学科点专项科研基金项目(20112302120052); 哈尔滨科技创新人才专项项目(RC2013QN010001); 黑龙江省高教学会“十二五”重点规划课题项目(HGJXHB1110957); 黑龙江省普通高校青年学术骨干项目(1254G037)

Plagiarism detection in student programs based on frequent closed sequence mining

WANG Ke-chao1, 2, WANG Tian-tian2, SU Xiao-hong2, MA Pei-jun2   

  1. 1.School of Software, Harbin University, Harbin 150086, China;
    2.School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
  • Received:2014-01-03 Online:2015-07-01 Published:2015-07-01

摘要: 针对学生程序抄袭导致考核可信度降低而人工检测抄袭工作量巨大的问题,提出了程序雷同检测模型,首先通过词法分析将程序转换成token序列,并将其散列映射为数字序列;然后采用BIDE挖掘算法挖掘频繁闭合序列;在此基础上,识别相似代码片段,并计算程序之间的相似度,进而判定程序是否雷同。实验结果表明,与目前应用广泛的雷同程序检测工具MOSS相比,本文方法提高了雷同检测的准确性,不但可以准确地给出雷同统计信息,还能够较为直观地显示雷同代码片段。

关键词: 计算机软件, 抄袭检测, 频繁闭合序列模式, 相似度, 雷同代码

Abstract: Plagiarism in student programs is a common phenomenon, which decreases the credibility of assessment. However, manual detection loads a heavy burden on the teachers. To solve this problem, a plagiarism detection model is proposed. First, student programs are converted into token sequences through lexical analysis. Then, the token sequences are hashed to digital sequences. Then, the frequent closed sequences are mined by the BIDE algorithm. On this basis, the similar code fragments are detected and the plagiarism programs are identified by the calculated similarity. Experimental results show that, compared with the commonly used toll MOSS, the proposed method is more precise. It can not only give accurate statistical information of similar programs, but also explicitly display the plagiarized code fragments.

Key words: computer software, plagiarism detection, frequent closed sequence mining, similarity, similar code

中图分类号: 

  • TP311
[1] Shawky D M, Ali A F. An approach for assessing similarity metrics used in metric-based clone detection techniques[C]∥The 3rd IEEE International Conference on Computer Science and Information Technology (ICCSIT), Chengdu,2010: 580-584.
[2] Brixtel R, Fontaine M, Lesner B, et al. Language-independent clone detection applied to plagiarism detection[C]∥The 10th IEEE Working Conference on Source Code Analysis and Manipulation (SCAM),Timisoara,2010: 77-86.
[3] Dang Y, Ge S, Huang R, et al. Code clone detection experience at Microsoft[C]∥Proceedings of the 5th International Workshop on Software Clones, ACM, 2011: 63-64.
[4] Zibran M F, Roy C K. IDE-based real-time focused search for near-miss clones[C]∥Proceedings of the 27th Annual ACM Symposium on Applied Computing, ACM, 2012: 1235-1242.
[5] Higo Y, Kamiya T, Kusumoto S, et al. Method and implementation for investigating code clones in a software system[J]. Information and Software Technology, 2007, 49(9): 985-998.
[6] 邓爱萍. 程序代码相似度度量算法研究[J]. 计算机工程与设计, 2008, 29(17): 4636-4638.
Deng Ai-ping. Study on similarity measurement of program code[J]. Computer Engineering and Design, 2008, 29(17): 4636-4638.
[7] 古平, 张锋, 周海涛. 一种程序源代码相似度度量方法[J]. 计算机工程, 2012, 38(6): 37-39.
Gu Ping, Zhang Feng, Zhou Hai-tao. Method of program source code similarity measurement[J]. Computer Engineering, 2012, 38(6): 37-39.
[8] 张丽萍, 刘东升, 李彦臣, 等. 一种基于 AST 的代码抄袭检测方法[J]. 计算机应用研究, 2011, 28(12): 4616-4620.
Zhang Li-ping, Liu Dong-sheng, Li Yan-chen, et al. AST-based code plagiarism detection method[J]. Application Research of Computers, 2011, 28(12): 4616-4620.
[9] Schleimer S, Wilkerson D S, Aiken A. Winnowing: local algorithms for document fingerprinting[C]∥Proceedings of the ACM SIGMOD International Conference on Management of Data, ACM, 2003: 76-85.
[10] Wang J, Han J. BIDE: efficient mining of frequent closed sequences[C]∥IEEE 20th International Conference on Data Engineering, 2004: 79-90.
[1] 桂春, 黄旺星. 基于改进的标签传播算法的网络聚类方法[J]. 吉林大学学报(工学版), 2018, 48(5): 1600-1605.
[2] 王旭, 欧阳继红, 陈桂芬. 基于垂直维序列动态时间规整方法的图相似度度量[J]. 吉林大学学报(工学版), 2018, 48(4): 1199-1205.
[3] 王旭, 欧阳继红, 陈桂芬. 基于多重序列所有公共子序列的启发式算法度量多图的相似度[J]. 吉林大学学报(工学版), 2018, 48(2): 526-532.
[4] 马健, 樊建平, 刘峰, 李红辉. 面向对象软件系统演化模型[J]. 吉林大学学报(工学版), 2018, 48(2): 545-550.
[5] 罗养霞, 郭晔. 基于数据依赖特征的软件识别[J]. 吉林大学学报(工学版), 2017, 47(6): 1894-1902.
[6] 董立岩, 王越群, 贺嘉楠, 孙铭会, 李永丽. 基于时间衰减的协同过滤推荐算法[J]. 吉林大学学报(工学版), 2017, 47(4): 1268-1272.
[7] 应欢, 王东辉, 武成岗, 王喆, 唐博文, 李建军. 适用于商用系统环境的低开销确定性重放技术[J]. 吉林大学学报(工学版), 2017, 47(1): 208-217.
[8] 李勇, 黄志球, 王勇, 房丙午. 基于多源数据的跨项目软件缺陷预测[J]. 吉林大学学报(工学版), 2016, 46(6): 2034-2041.
[9] 王贵参, 黄岚, 王岩, 宋立明, 欧歌. 引入极值非相邻连接的连接聚类方法[J]. 吉林大学学报(工学版), 2016, 46(5): 1616-1621.
[10] 王念滨, 祝官文, 周连科, 王红卫. 支持高效路径查询的数据空间索引方法[J]. 吉林大学学报(工学版), 2016, 46(3): 911-916.
[11] 特日跟, 江晟, 李雄飞, 李军. 基于整数数据的文档压缩编码方案[J]. 吉林大学学报(工学版), 2016, 46(1): 228-234.
[12] 康辉, 王家琦, 梅芳. 基于Pi演算的并行编程语言[J]. 吉林大学学报(工学版), 2016, 46(1): 235-241.
[13] 陈鹏飞, 田地, 杨光. 基于MVC架构的LIBS软件设计与实现[J]. 吉林大学学报(工学版), 2016, 46(1): 242-245.
[14] 刘磊, 王燕燕, 申春, 李玉祥, 刘雷. Bellman-Ford算法性能可移植的GPU并行优化[J]. 吉林大学学报(工学版), 2015, 45(5): 1559-1564.
[15] 冯晓宁, 王卓, 张旭. 基于L-π演算的WSN路由协议形式化方法[J]. 吉林大学学报(工学版), 2015, 45(5): 1565-1571.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!