0% found this document useful (0 votes)

170 views6 pages

Software Defect Prediction Using Ensemble Learning

Uploaded by

phm623688

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

170 views6 pages

Software Defect Prediction Using Ensemble Learning

Uploaded by

phm623688

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Software Defect Prediction Using Ensemble

Learning
Naidu Sudheer1*, Maram Sai Nivedh Kumar1, and Smt. G. Mamatha2
1,2,3 Department of Computer Science and Engineering, Chaitanya Bharathi Institute of Technology,
Hyderabad, Telangana.

sudheerchowdary676@gmail.com, sainivedh.maram@gmail.com,
gmamatha_cse@cbit.ac.in

Abstract. Finding software bugs is a crucial step in the software development life cycle. A
software defect is a flaw or shortage in a work product that prevents it from meeting
requirements or specifications and necessitates repair or replacement. Early software defect
detection helps the business avoid time and money losses. Numerous algorithms have been put
forth for this purpose to predict software defects, but these models still have some drawbacks.
This project work has been designed as a Prediction system that uses Ensemble learning and
also a Hybrid feature selection technique for predicting the defect in software modules to
increase the efficiency in finding the defects.

Keywords: Software Defect, Ensemble learning, Software Development Life cycle (SDLC)

1 Introduction
The process of detecting flaws in software modules that have been developed is known as
software defect prediction. Software defect prediction plays a vital role in the software development process as
it helps in reducing the cost of repairing defects and the time for identifying the defects. Identifying defects in
software modules is a very difficult task in software development. so, by applying machine learning techniques
in the software defect prediction process we can easily identify whether there is a defect or not. This reduces
the time and also improves the development process. However, there are some potential problems in the process
of software defect prediction such as imbalanced data. So, to handle this problem there are some solutions like
oversampling the data or undersampling the data. We can use Random resampling, Synthetic minority
oversampling technique (SMOTE), or Ensemble techniques to overcome the class imbalance problem.

Accordingly, this paper proposes a combined sampling, hybrid feature selection, and an
ensemble model to predict the software defect. Combined sampling means balancing the data samples using
some techniques. Hybrid feature selection helps to select the appropriate features from the dataset and improves
the accuracy of the model. Ensemble learning is the process of training multiple models and combining the
output of the models. We are using ensemble learning to improve the accuracy of model

2 Related Work

Numerous classification techniques, such as tree-based approaches, analogy-based approaches,

neural networks, Bayes methods, etc., have been used to predict software defects directly;
nevertheless, these techniques plainly ignore the issue of class imbalance. Classifiers typically fail
to identify the minority of defective components when forecasting the occurrence of software defects
while many components are not defective.[2]
Since the 1970s, one of the most important study areas in software engineering has been software
defect prediction technologies. A variety of machine learning techniques have been widely used
recently to enhance the performance of software defect prediction due to the machine learning field's
explosive growth. Defective modules are also much less common in software projects than non-
defective modules, which is a problem of class imbalance. This issue has a significant negative
impact on classifier performance.[3]

Software metrics related to software defects are designed by software defect prediction technology
by examining software code, the software development process, and other factors. The relationship
between software metrics and software defects is then established by using historical defect data.
Artificial neural networks, Bayesian networks SVM, dictionary learning, association rules, naïve
bayes tree-based techniques, evolutionary algorithms, and other machine learning-based methods
have all been used to forecast software problems. However, these techniques overlook the faulty
data set's high dimension and uneven class distribution, which have a significant impact on
classification performance. [6]

3 Software Requirements
Python is used in almost every field for a wide variety of applications and projects. It supports a wide variety
of programming paradigms, including procedural, functional, and object-oriented structures.
Python gives programmers some of the best flexibility and capabilities, which will improve their efficiency and
capacities as well as the quality of their code. Python also has a vast library that helps with the heavy workload.
Libraries for machine learning methods include NumPy, Pandas, Scikit-Learn, and mlxtend. Using ensemble
learning to build the model.

4 Proposed Method
Considering previous models’, we are proposing an ensemble model to address these challenges and to improve
the accuracy of software defect prediction.
Steps involved in the proposed system:
1. The dataset is loaded into the system and sent for SMOTE process.
2. Firstly, we perform SMOTE on the dataset. We use the oversampling technique to balance the dataset.
3. Now after the SMOTE process we focus on selecting the best-correlated features from the existing
feature set using hybrid feature selection.
4. Now after selecting features, we are splitting the dataset into an 80:20 ratio.
5. Now we are training the ensemble model using the training data.
6. First, we pass training data to the base models random forest and logistic regression model.
7. Later we pass the outputs of these base models to the linear regression model and train that model.
8. After training we use the test data to test the ensemble model.
Fig. 1 Data flow diagram

4.1 Synthetic minority oversampling technique (SMOTE)

SMOTE is used to handle data that is not balanced. When observed frequencies of a categorical variable vary
significantly among its conceivable values, the data is said to be imbalanced. In general, there are many
observations of one kind and few of another. Both under sampling and oversampling are preventive measures.
There are two things we can do with SMOTE oversampling or undersampling. Undersampling is the most
straightforward to counteract the class imbalance problem. Undersampling means that you discard a number of
data points of the class that are present too often. The drawback of undersampling is that you lose a lot of valuable
data which is helpful for training the model.

In this proposed solution we are using oversampling the data because there are very few samples. So, when we
do undersampling that can result in reducing the performance of the model. So we are using oversampling of the
data and increasing the samples in the dataset and dividing the dataset into 80:20 for training and testing the
model.
4.2 Hybrid Feature Selection

After resampling the data we are selecting the important features that are correlated to the output variable defect
using hybrid feature selection. We are using the Chi-square test and also sequence backward selection for
selecting the most correlated features from the dataset. First, we perform a chi-square test on the features of the
dataset and select the top features from it after that selected features are passed to sequence backward selection
to select features that are required using SBS.

In statistics, the chi-square test is used to determine if two occurrences are independent. So by using the
chi-square test we find the co-related features and send the subset of features to sequence backward selection. In
SBS we remove the features from the subset that are not related to the output variable.

4.3 Ensemble model

Ensemble learning is the process of combining more than one model using the other machine learning models to
increase the performance of the model. We are using ensemble learning to develop the model for software defect
prediction. The base learners of the model are Random Forest, Logistic regression, and using linear regression
for combining the base learners.

Random forest is a machine learning algorithm that consists of multiple random decision trees used to perform
the classification task. Logistic regression is a machine learning algorithm that is used for solving classification
problems. It forecasts the result of a dependent categorical variable. Therefore, the outcome is a probabilistic
value that lies between 0 and 1.

We are using a linear regression model to combine the output of the base models. So we combine the output of
two base models and train the meta-model. After that, we use these models to predict the defect. We are training
the meta model with the outputs of base models where the target variable will be the result variable while training
this model.

5 Results and Evaluation

Fig. 2. Dataset in form of bar graphs before and after smote

Fig. 3. Confusion metrics of different algorithms

To further improve the performance of the model we use the Ensemble model and hybrid feature selection which
improved the performance of the model. To train the proposed algorithm we used the CM1 dataset which contains
449 non-defective samples and 49 defective samples. We increased the data samples using the Smote sampling
technique. After that, we have 898 samples of which 449 are defective and 449 are non-defective. We use
confusion matrix for measuring the performance of the model. We find accuracy, precision, recall and f-measure
values using confusion matrix. We observe the accuracy of the model as 93%.

6 Conclusion and Future Scope

In this report, we propose a ensemble model to predict the software defect consisting of hybrid feature selection,
SMOTE technique and an ensemble model. The prediction performance of the ensemble learning algorithm is
better than other single classifier algorithms, and it has the advantages of higher precision and recall rate. So
oversampling the data helps in training the model accurately. Using hybrid feature selection also helps to select
the most important features from the set of features available. In future research, further consideration will be
given to the combination of different sampling and feature selection methods to improve the performance of the
prediction. We also recommend to combine the output of these models with the models that are trained on
semantic information of the code. For improved performance, we also advise using real-time data and variety of
datasets to train the model.
References
1. Q. Song, Y. Guo and M. Shepperd, "A Comprehensive Investigation of the Role of Imbalanced
Learning for Software Defect Prediction," in IEEE Transactions on Software Engineering, vol. 45,
no. 12, pp. 1253-1269, 1 Dec. 2019, doi: 10.1109/TSE.2018.2836442.

2. Z. Sun, Q. Song and X. Zhu, "Using Coding-Based Ensemble Learning to Improve Software Defect
Prediction," in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews), vol. 42, no. 6, pp. 1806-1817, Nov. 2012, doi: 10.1109/TSMCC.2012.2226152.

3. L. Gong, S. Jiang and L. Jiang, "Tackling Class Imbalance Problem in Software Defect Prediction
Through Cluster-Based Over-Sampling With Filtering," in IEEE Access, vol. 7, pp. 145725-
145737, 2019, doi: 10.1109/ACCESS.2019.2945858.

4. J. Zheng, X. Wang, D. Wei, B. Chen and Y. Shao, "A Novel Imbalanced Ensemble Learning in
Software Defect Predication," in IEEE Access, vol. 9, pp. 86855-86868, 2021, doi:
10.1109/ACCESS.2021.3072682.

5. S. Huda et al., "A Framework for Software Defect Prediction and Metric Selection," in IEEE
Access, vol. 6, pp. 2844-2858, 2018, doi: 10.1109/ACCESS.2017.2785445.

6. H. He et al., "Ensemble MultiBoost Based on RIPPER Classifier for Prediction of Imbalanced

Software Defect Data," in IEEE Access, vol. 7, pp. 110333-110343, 2019, doi:
10.1109/ACCESS.2019.2934128.

7. S. Huda et al., "An Ensemble Oversampling Model for Class Imbalance Problem in Software
Defect Prediction," in IEEE Access, vol. 6, pp. 24184-24195, 2018, doi:
10.1109/ACCESS.2018.2817572.

Software Defect Prediction Using Random Forest
No ratings yet
Software Defect Prediction Using Random Forest
5 pages
Proposal Defense v6
No ratings yet
Proposal Defense v6
55 pages
Software Defect
No ratings yet
Software Defect
46 pages
Software Defect Prediction - Final - Doc - Phase 1
No ratings yet
Software Defect Prediction - Final - Doc - Phase 1
36 pages
Software Defect Prediction Using A Bidirectional LSTM Network Combined With Oversampling Techniques
No ratings yet
Software Defect Prediction Using A Bidirectional LSTM Network Combined With Oversampling Techniques
24 pages
Software Defect Prediction Using An Intelligent Ensemble-Based Model
No ratings yet
Software Defect Prediction Using An Intelligent Ensemble-Based Model
20 pages
Muhammad
No ratings yet
Muhammad
17 pages
Software Fault Prediction Using Cross-Project Anal
No ratings yet
Software Fault Prediction Using Cross-Project Anal
17 pages
Integrating Adaptive Sampling With Ensembles Model
No ratings yet
Integrating Adaptive Sampling With Ensembles Model
12 pages
Xu 2019
No ratings yet
Xu 2019
19 pages
A Three-Step Combination Strategy For Addressing Outliers and Class Imbalance in Software Defect Prediction
No ratings yet
A Three-Step Combination Strategy For Addressing Outliers and Class Imbalance in Software Defect Prediction
12 pages
Is Better Data Better Than Better Data Miners
No ratings yet
Is Better Data Better Than Better Data Miners
12 pages
Print Out Project MACHINE LEARNING
No ratings yet
Print Out Project MACHINE LEARNING
12 pages
Seminar Final Presentation
No ratings yet
Seminar Final Presentation
11 pages
P4 - Progress On Approaches To Software Defect Prediction
No ratings yet
P4 - Progress On Approaches To Software Defect Prediction
15 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
Deep Learning Based Software Defect Prediction
No ratings yet
Deep Learning Based Software Defect Prediction
11 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
A Survey of Different Machine Learning M
No ratings yet
A Survey of Different Machine Learning M
13 pages
Predicciones de Defectos de Software
No ratings yet
Predicciones de Defectos de Software
6 pages
Software Defect Prediction Using An Intelligent Ensemble-Based Model - Abstract
No ratings yet
Software Defect Prediction Using An Intelligent Ensemble-Based Model - Abstract
5 pages
Research Proposal
No ratings yet
Research Proposal
4 pages
Fuzzy C Means Method For Cross - Project Software Defect Prediction
No ratings yet
Fuzzy C Means Method For Cross - Project Software Defect Prediction
10 pages
Software Defect Prediction Using Machine Learning
No ratings yet
Software Defect Prediction Using Machine Learning
5 pages
OPABP NidhiSrivastava
No ratings yet
OPABP NidhiSrivastava
7 pages
1401 5830 PDF
No ratings yet
1401 5830 PDF
14 pages
Content Server
No ratings yet
Content Server
21 pages
A Systematic Literature Review On Fault Prediction Performance in Software Engineering PDF
No ratings yet
A Systematic Literature Review On Fault Prediction Performance in Software Engineering PDF
4 pages
After IJCA Comments Paper-F Ver 27-5-2018
No ratings yet
After IJCA Comments Paper-F Ver 27-5-2018
12 pages
A Defect Prediction Method For Software Versioning: Ó Springer Science+Business Media, LLC 2008
No ratings yet
A Defect Prediction Method For Software Versioning: Ó Springer Science+Business Media, LLC 2008
20 pages
A General Software Defect-Proneness Prediction Framework: Qinbao Song, Zihan Jia, Martin Shepperd, Shi Ying, and Jin Liu
No ratings yet
A General Software Defect-Proneness Prediction Framework: Qinbao Song, Zihan Jia, Martin Shepperd, Shi Ying, and Jin Liu
15 pages
14 Apr
No ratings yet
14 Apr
9 pages
IEEE - INDIACom 2018 Paper
No ratings yet
IEEE - INDIACom 2018 Paper
6 pages
Majorproject Abstract
No ratings yet
Majorproject Abstract
1 page
Software Defect Prediction Using ML
No ratings yet
Software Defect Prediction Using ML
6 pages
Ensembles Based Combined Learning For Improved Software Fault Prediction: A Comparative Study
No ratings yet
Ensembles Based Combined Learning For Improved Software Fault Prediction: A Comparative Study
6 pages
Overview of Software Defect Prediction Using Machine Learning Algorithms
No ratings yet
Overview of Software Defect Prediction Using Machine Learning Algorithms
12 pages
SDP Edited1.edited
No ratings yet
SDP Edited1.edited
8 pages
Romi Jse Template 2014
No ratings yet
Romi Jse Template 2014
5 pages
Software Defect Prediction: A Survey With Machine Learning Approach
No ratings yet
Software Defect Prediction: A Survey With Machine Learning Approach
6 pages
An Enhanced Bayesian Decision Tree Model For Defect Detection On Complex SDLC Defect Data
No ratings yet
An Enhanced Bayesian Decision Tree Model For Defect Detection On Complex SDLC Defect Data
6 pages
IJCA Paper-F Ver 28-4-2018
No ratings yet
IJCA Paper-F Ver 28-4-2018
12 pages
Software Testing Defect Prediction Model - A Practical Approach
No ratings yet
Software Testing Defect Prediction Model - A Practical Approach
5 pages
Predicting Root Cause Analysis (RCA) Bucket For
No ratings yet
Predicting Root Cause Analysis (RCA) Bucket For
4 pages
DA Unit 1
No ratings yet
DA Unit 1
23 pages
Neural Network Parameter Optimization Based On Genetic Algorithm For Software Defect Prediction
No ratings yet
Neural Network Parameter Optimization Based On Genetic Algorithm For Software Defect Prediction
2 pages
Ph.D. Thesis
No ratings yet
Ph.D. Thesis
225 pages
Testing The Reliability and Validity of The Self-Effica.. PDF
No ratings yet
Testing The Reliability and Validity of The Self-Effica.. PDF
13 pages
Toulene To Benzoic Acid
No ratings yet
Toulene To Benzoic Acid
18 pages
Lecture 5 Dummy Variable
No ratings yet
Lecture 5 Dummy Variable
11 pages
ICS 2408 - Lecture 6 - Classification and Prediction
No ratings yet
ICS 2408 - Lecture 6 - Classification and Prediction
47 pages
Chapter 5 - Violations of Regression Assumptions
No ratings yet
Chapter 5 - Violations of Regression Assumptions
44 pages
2022 ADSP CSAT Survey Report - DRAFT - 6.30.22
No ratings yet
2022 ADSP CSAT Survey Report - DRAFT - 6.30.22
24 pages
Chapter 4 Understanding Consumer Behavior
No ratings yet
Chapter 4 Understanding Consumer Behavior
22 pages
Final LP - 1 Problem Statement (ML, DAA or ADBMS) - 2023 - 24
No ratings yet
Final LP - 1 Problem Statement (ML, DAA or ADBMS) - 2023 - 24
21 pages
The Simple Regression Model
No ratings yet
The Simple Regression Model
61 pages
POM Final
No ratings yet
POM Final
42 pages
Performance Measures - Session 2
No ratings yet
Performance Measures - Session 2
35 pages
Decision Tree Approaches For Zero-Inflated Count Data: Seong-Keon Lee & Seohoon Jin
100% (1)
Decision Tree Approaches For Zero-Inflated Count Data: Seong-Keon Lee & Seohoon Jin
15 pages
Klein CV Feb19 2
No ratings yet
Klein CV Feb19 2
4 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Ca 2 QTTM 2
No ratings yet
Ca 2 QTTM 2
19 pages
BMSP-ML: Big Mart Sales Prediction Using Different Machine Learning Techniques
No ratings yet
BMSP-ML: Big Mart Sales Prediction Using Different Machine Learning Techniques
10 pages
Heather Walen-Frederick SPSS Handout
No ratings yet
Heather Walen-Frederick SPSS Handout
13 pages
Land 11 01007 v2
No ratings yet
Land 11 01007 v2
5 pages
Homework Solutions For Chapter 4
No ratings yet
Homework Solutions For Chapter 4
9 pages
ID Analisis Pendapatan Nelayan Pemilik Paya
No ratings yet
ID Analisis Pendapatan Nelayan Pemilik Paya
11 pages
Land Cover Mapping and Crop Phenology of Potohar Region, Punjab, Pakistan
No ratings yet
Land Cover Mapping and Crop Phenology of Potohar Region, Punjab, Pakistan
10 pages
Ch.4 Correlation
No ratings yet
Ch.4 Correlation
1 page
HashimZ2017 CET139
No ratings yet
HashimZ2017 CET139
7 pages
Econ20003 - Quantitative Methods 2 2020 Formula Sheet: One Population
No ratings yet
Econ20003 - Quantitative Methods 2 2020 Formula Sheet: One Population
5 pages
Sample Sol Mcd2080-Fat-2
No ratings yet
Sample Sol Mcd2080-Fat-2
4 pages
Stat 371 Ass#1
No ratings yet
Stat 371 Ass#1
2 pages
Epr Moga XL
No ratings yet
Epr Moga XL
2 pages
Excel Spreadsheet For Response Surface Analysis
No ratings yet
Excel Spreadsheet For Response Surface Analysis
3 pages
Beyond The Algorithm: Practical Machine Learning Strategies
From Everand
Beyond The Algorithm: Practical Machine Learning Strategies
Jane Onwuchekwa
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
From Everand
Mastering Classification Algorithms for Machine Learning: Learn how to apply Classification algorithms for effective Machine Learning solutions (English Edition)
PARTHA MAJUMDAR
No ratings yet
Using Vocals Determine Human Emotion
From Everand
Using Vocals Determine Human Emotion
Faiz ul haque Zeya
No ratings yet
Machine Learning Algorithms for Data Scientists: An Overview
From Everand
Machine Learning Algorithms for Data Scientists: An Overview
Vinaitheerthan Renganathan
No ratings yet
Machine Learning with Python: Foundations and Applications: ML, #1
From Everand
Machine Learning with Python: Foundations and Applications: ML, #1
Mohammed Nurudeen
No ratings yet
Data Mining Models: Techniques and Applications
From Everand
Data Mining Models: Techniques and Applications
Ravi Deshpande
No ratings yet
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. CLASSIFICATION PREDICTIVE TECHNIQUES: SUPPORT VECTOR MACHINE, LOGISTIC REGRESSION, DISCRIMINANT ANALYSIS and DECISION TREES: Examples with MATLAB
César Pérez López
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING: CLUSTER ANALYSIS and kNN CLASSIFIERS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
From Everand
DEEP LEARNING TECHNIQUES: CLUSTER ANALYSIS and PATTERN RECOGNITION with NEURAL NETWORKS. Examples with MATLAB
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

Software Defect Prediction Using Ensemble Learning

Uploaded by

Software Defect Prediction Using Ensemble Learning

Uploaded by

Software Defect Prediction Using Ensemble

Numerous classification techniques, such as tree-based approaches, analogy-based approaches,

4.1 Synthetic minority oversampling technique (SMOTE)

4.3 Ensemble model

5 Results and Evaluation

Fig. 2. Dataset in form of bar graphs before and after smote

6 Conclusion and Future Scope

6. H. He et al., "Ensemble MultiBoost Based on RIPPER Classifier for Prediction of Imbalanced

You might also like