基于多源数据的跨项目软件缺陷预测

doi:10.13229/j.cnki.jdxbgxb201606037

吉林大学学报(工学版) ›› 2016, Vol. 46 ›› Issue (6): 2034-2041.doi: 10.13229/j.cnki.jdxbgxb201606037

基于多源数据的跨项目软件缺陷预测

李勇^{1, 2}, 黄志球¹, 王勇¹, 房丙午¹

1.南京航空航天大学计算机科学与技术学院,南京 211106;
2.新疆师范大学网络信息安全与舆情分析重点实验室,乌鲁木齐 830054

收稿日期:2015-11-30 出版日期:2016-11-20 发布日期:2016-11-20
通讯作者: 黄志球(1965-),男,教授,博士生导师.研究方向:软件工程.E-mail:zqhuang@nuaa.edu.cn
作者简介:李勇(1983-),男,博士研究生.研究方向:实证软件工程.E-mail:liyong@live.com
基金资助:
国家自然科学基金项目(61562087,61272083); 江苏省普通高校研究生科研创新计划项目(CXLX13_160); 中央高校基本科研业务费专项资金项目

New approach of cross-project defect prediction based on multi-source data

LI Yong^{1, 2}, HUANG Zhi-qiu¹, WANG Yong¹, FANG Bing-wu¹

1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;
2.Key Laboratory of Network Information Security and Public Opinion Analysis, Xinjiang Normal University, Urumqi 830054, China

Received:2015-11-30 Online:2016-11-20 Published:2016-11-20

摘要/Abstract

摘要： 跨项目(CP)的软件缺陷预测方法可以解决传统基于目标项目(WP)实现预测时要求有历史积累数据以及缺陷标注代价较高等问题。针对已有CP方法中存在的预测性能较低和可操作性较差等不足,提出了一种基于多源数据的跨项目软件缺陷预测方法。首先获取与目标项目特征相似的多源项目为候选;然后以候选项目的软件模块引导训练数据的选择;最后基于朴素贝叶斯算法实现预测模型。采用真实的软件缺陷数据进行实验,结果表明该方法的性能优于传统的WP方法,可以代替WP方法用于软件工程实践。

关键词: 计算机软件, 跨项目缺陷预测, 多源项目数据, 分级数据选择, 朴素贝叶斯算法

Abstract: Software defect prediction is significant to the optimization of quality assurance activities. The Within Project Defect Prediction (WPDP) can produce high quality results, but requires historical data of the project, which is often not available in practical scenarios. The cross-project Defect Prediction (CPDP) can effectively overcome the drawback of WPDP. However, existing research suggested that CPDP is particularly challenging and often yields poor performance, and very few studies investigated the practical guidelines on how to select suitable training data for CPDP from multi-source project data. A novel multi-source data driven approach for CPDP is proposed. First, the hierarchical filter strategy based on the characteristics of both projects and modules is developed to select training data. Then, the Naive Bayes algorithm is used to realize the prediction model. Experimental results of 14 open-source projects show that the proposed approach significantly improves CPDP performance, and can compete with WPDP.

Key words: computer software, cross project defects prediction, multi-source projects data, hierarchical data selection, Naive Bayes algorithm

中图分类号:

TP311.5

李勇, 黄志球, 王勇, 房丙午. 基于多源数据的跨项目软件缺陷预测[J]. 吉林大学学报(工学版), 2016, 46(6): 2034-2041.

LI Yong, HUANG Zhi-qiu, WANG Yong, FANG Bing-wu. New approach of cross-project defect prediction based on multi-source data[J]. 吉林大学学报(工学版), 2016, 46(6): 2034-2041.

参考文献

[1] Caglayan B, Tosun M A, Bener A B, et al. Predicting defective modules in different test phases[J]. Software Quality Journal,2015,23(2):205-227.
[2] 陈媛,沈湘衡,王安邦,等. 似然关系模型在航天软件缺陷预测中的应用[J]. 光学精密工程, 2013,21(7):1865-1872.
Chen Yuan,Shen Xiang-heng,Wang An-bang,et al.Application of probabilistic relational model toaerospace software defect prediction[J]. Optics and Precision Engineering, 2013,21(7):1865-1872.
[3] 王红园,郭永飞,姬琪. 面向需求覆盖的航天软件测试用例优化方法[J]. 光学精密工程,2014,22(1):228-234.
Wang Hong-yuan, Guo Yong-fei, Ji Qi. Optimization of aerospace software test cases based on requirement coverage[J]. Optics and Precision Engineering, 2014,22(1):228-234.
[4] Shepperd M, Bowes D, Hall T. Researcher bias: the use of machine learning in software defect prediction[J]. IEEE Transactions on Software Engineering,2014,40(6):603-616.
[5] Turhan B. On the dataset shift problem in software engineering prediction models[J]. Empirical Software Engineering,2012,17(1):62-74.
[6] Watanabe S, Kaiya H, Kaijiri K. Adapting a fault prediction model to allow inter language reuse[C]∥Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, New York, NY, USA,2008:19-24.
[7] Nam J, Pan S J, Kim S. Transfer defect learning[C]∥Proceedings of the 2013 International Conference on Software Engineering, San Francisco,2013:382-391.
[8] He Peng, Li Bing, Liu Xiao, et al. An empirical study on software defect prediction with a simplified metric set[J]. Information and Software Technology,2015,59:170-190.
[9] Menzies T, Butcher A, Cok D, et al. Local versus global lessons for defect prediction and effort estimation[J]. IEEE Transactions on Software Engineering,2013,39(6):822-834.
[10] Turhan B, Menzies T, Bener A B, et al. On the relative value of cross-company and within-company data for defect prediction[J]. Empirical Software Engineering,2009,14(5):540-578.
[11] Peters F, Menzies T, Marcus A. Better cross company defect prediction[C]∥Proceedings of the Tenth International Workshop on Mining Software Repositories, San Francisco, CA, USA, 2013:409-418.
[12] He Z, Shu F, Yang Y, et al. An investigation on the feasibility of cross-project defect prediction[J]. Automated Software Engineering,2012,19(2):167-199.
[13] Herbold S. Training data selection for cross-project defect prediction[C]∥Proceedings of the 9th International Conference on Predictive Models in Software Engineering, Baltimore,2013:1-10.
[14] Turhan B, Bener A. Analysis of Naive Bayes' assumptions on software fault data:an empirical study[J]. Data & Knowledge Engineering,2009,68(2):278-290.
[15] Pang-Ning T, Steinbach M, Kumar V. Introduction to Data Mining[M]. New York: Pearson,2005: 231-236.
[16] Jureczko M, Spinellis D. Using object-oriented design metrics to predict software defects[C]∥Fifth International Conference on Dependability of Computer Systems DepCoS,Poland,2010: 69-81.
[17] Catal C, Diri B. Investigating the effect of dataset size, metrics sets, and feature selection techniques on software fault prediction problem[J]. Information Sciences,2009,179(8):1040-1058.
[18] Rahman F, Devanbu P. How, and why, process metrics are better[C]∥Proceedings of the 2013 International Conference on Software Engineering, San Francisco, CA, USA,2013:432-441.
[19] Lessmann S, Baesens B, Mues C, et al. Benchmarking classification models for software defect prediction: a proposed framework and novel findings[J]. IEEE Transactions on Software Engineering,2008,34(4): 485-496.
[20] Okutan A, Y 1 ld 1 z O T. Software defect prediction using Bayesian networks[J]. Empirical Software Engineering,2014,19(1):154-181.
[21] Rahman F, Posnett D, Devanbu P. Recalling the “imprecision” of cross-project defect prediction[C]∥Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, North Carolina,2012:1-11.
[22] Shull F, Basili V, Boehm B, et al. What we have learned about fighting defects[C]∥IEEE Symposium on Software Metrics, Washington, DC, USA, 2002:249-258.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于多源数据的跨项目软件缺陷预测

New approach of cross-project defect prediction based on multi-source data

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 10

[1]	马健, 樊建平, 刘峰, 李红辉. 面向对象软件系统演化模型[J]. 吉林大学学报(工学版), 2018, 48(2): 545-550.
[2]	罗养霞, 郭晔. 基于数据依赖特征的软件识别[J]. 吉林大学学报(工学版), 2017, 47(6): 1894-1902.
[3]	应欢, 王东辉, 武成岗, 王喆, 唐博文, 李建军. 适用于商用系统环境的低开销确定性重放技术[J]. 吉林大学学报(工学版), 2017, 47(1): 208-217.
[4]	王念滨, 祝官文, 周连科, 王红卫. 支持高效路径查询的数据空间索引方法[J]. 吉林大学学报(工学版), 2016, 46(3): 911-916.
[5]	特日跟, 江晟, 李雄飞, 李军. 基于整数数据的文档压缩编码方案[J]. 吉林大学学报(工学版), 2016, 46(1): 228-234.
[6]	康辉, 王家琦, 梅芳. 基于Pi演算的并行编程语言[J]. 吉林大学学报(工学版), 2016, 46(1): 235-241.
[7]	陈鹏飞, 田地, 杨光. 基于MVC架构的LIBS软件设计与实现[J]. 吉林大学学报(工学版), 2016, 46(1): 242-245.
[8]	刘磊, 王燕燕, 申春, 李玉祥, 刘雷. Bellman-Ford算法性能可移植的GPU并行优化[J]. 吉林大学学报(工学版), 2015, 45(5): 1559-1564.
[9]	冯晓宁, 王卓, 张旭. 基于L-π演算的WSN路由协议形式化方法[J]. 吉林大学学报(工学版), 2015, 45(5): 1565-1571.
[10]	李明哲, 王劲林, 陈晓, 陈君. 基于网络处理器的流媒体应用架构模型(VPL)[J]. 吉林大学学报(工学版), 2015, 45(5): 1572-1580.
[11]	王克朝, 王甜甜, 苏小红, 马培军. 基于频繁闭合序列模式挖掘的学生程序雷同检测[J]. 吉林大学学报(工学版), 2015, 45(4): 1260-1265.
[12]	黄宏涛，王静，叶海智，黄少滨. 基于惰性切片的线性时态逻辑性质验证[J]. 吉林大学学报(工学版), 2015, 45(1): 245-251.
[13]	范大娟^{1, 2}, 黄志球¹, 肖芳雄¹, 祝义¹, 王进¹. 面向多服务交互的相容性分析与适配器生成[J]. 吉林大学学报(工学版), 2014, 44(4): 1094-1103.
[14]	贺秦禄¹, 李战怀¹, 王乐晓¹, 王瑞². 云存储系统聚合带宽测试技术[J]. 吉林大学学报(工学版), 2014, 44(4): 1104-1111.
[15]	康辉, 张双双, 梅芳. 一种递归π演算向Petri网的转换方法[J]. 吉林大学学报(工学版), 2014, 44(01): 142-148.