0% found this document useful (0 votes)
36 views

Software Defect Prediction Using Ensemble Learning

Uploaded by

phm623688
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Software Defect Prediction Using Ensemble Learning

Uploaded by

phm623688
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Software Defect Prediction Using Ensemble

Learning
Naidu Sudheer1*, Maram Sai Nivedh Kumar1, and Smt. G. Mamatha2
1,2,3 Department of Computer Science and Engineering, Chaitanya Bharathi Institute of Technology,
Hyderabad, Telangana.

sudheerchowdary676@gmail.com, sainivedh.maram@gmail.com,
gmamatha_cse@cbit.ac.in

Abstract. Finding software bugs is a crucial step in the software development life cycle. A
software defect is a flaw or shortage in a work product that prevents it from meeting
requirements or specifications and necessitates repair or replacement. Early software defect
detection helps the business avoid time and money losses. Numerous algorithms have been put
forth for this purpose to predict software defects, but these models still have some drawbacks.
This project work has been designed as a Prediction system that uses Ensemble learning and
also a Hybrid feature selection technique for predicting the defect in software modules to
increase the efficiency in finding the defects.

Keywords: Software Defect, Ensemble learning, Software Development Life cycle (SDLC)

1 Introduction
The process of detecting flaws in software modules that have been developed is known as
software defect prediction. Software defect prediction plays a vital role in the software development process as
it helps in reducing the cost of repairing defects and the time for identifying the defects. Identifying defects in
software modules is a very difficult task in software development. so, by applying machine learning techniques
in the software defect prediction process we can easily identify whether there is a defect or not. This reduces
the time and also improves the development process. However, there are some potential problems in the process
of software defect prediction such as imbalanced data. So, to handle this problem there are some solutions like
oversampling the data or undersampling the data. We can use Random resampling, Synthetic minority
oversampling technique (SMOTE), or Ensemble techniques to overcome the class imbalance problem.

Accordingly, this paper proposes a combined sampling, hybrid feature selection, and an
ensemble model to predict the software defect. Combined sampling means balancing the data samples using
some techniques. Hybrid feature selection helps to select the appropriate features from the dataset and improves
the accuracy of the model. Ensemble learning is the process of training multiple models and combining the
output of the models. We are using ensemble learning to improve the accuracy of model

2 Related Work

Numerous classification techniques, such as tree-based approaches, analogy-based approaches,


neural networks, Bayes methods, etc., have been used to predict software defects directly;
nevertheless, these techniques plainly ignore the issue of class imbalance. Classifiers typically fail
to identify the minority of defective components when forecasting the occurrence of software defects
while many components are not defective.[2]
Since the 1970s, one of the most important study areas in software engineering has been software
defect prediction technologies. A variety of machine learning techniques have been widely used
recently to enhance the performance of software defect prediction due to the machine learning field's
explosive growth. Defective modules are also much less common in software projects than non-
defective modules, which is a problem of class imbalance. This issue has a significant negative
impact on classifier performance.[3]

Software metrics related to software defects are designed by software defect prediction technology
by examining software code, the software development process, and other factors. The relationship
between software metrics and software defects is then established by using historical defect data.
Artificial neural networks, Bayesian networks SVM, dictionary learning, association rules, naïve
bayes tree-based techniques, evolutionary algorithms, and other machine learning-based methods
have all been used to forecast software problems. However, these techniques overlook the faulty
data set's high dimension and uneven class distribution, which have a significant impact on
classification performance. [6]

3 Software Requirements
Python is used in almost every field for a wide variety of applications and projects. It supports a wide variety
of programming paradigms, including procedural, functional, and object-oriented structures.
Python gives programmers some of the best flexibility and capabilities, which will improve their efficiency and
capacities as well as the quality of their code. Python also has a vast library that helps with the heavy workload.
Libraries for machine learning methods include NumPy, Pandas, Scikit-Learn, and mlxtend. Using ensemble
learning to build the model.

4 Proposed Method
Considering previous models’, we are proposing an ensemble model to address these challenges and to improve
the accuracy of software defect prediction.
Steps involved in the proposed system:
1. The dataset is loaded into the system and sent for SMOTE process.
2. Firstly, we perform SMOTE on the dataset. We use the oversampling technique to balance the dataset.
3. Now after the SMOTE process we focus on selecting the best-correlated features from the existing
feature set using hybrid feature selection.
4. Now after selecting features, we are splitting the dataset into an 80:20 ratio.
5. Now we are training the ensemble model using the training data.
6. First, we pass training data to the base models random forest and logistic regression model.
7. Later we pass the outputs of these base models to the linear regression model and train that model.
8. After training we use the test data to test the ensemble model.
Fig. 1 Data flow diagram

4.1 Synthetic minority oversampling technique (SMOTE)


SMOTE is used to handle data that is not balanced. When observed frequencies of a categorical variable vary
significantly among its conceivable values, the data is said to be imbalanced. In general, there are many
observations of one kind and few of another. Both under sampling and oversampling are preventive measures.
There are two things we can do with SMOTE oversampling or undersampling. Undersampling is the most
straightforward to counteract the class imbalance problem. Undersampling means that you discard a number of
data points of the class that are present too often. The drawback of undersampling is that you lose a lot of valuable
data which is helpful for training the model.

In this proposed solution we are using oversampling the data because there are very few samples. So, when we
do undersampling that can result in reducing the performance of the model. So we are using oversampling of the
data and increasing the samples in the dataset and dividing the dataset into 80:20 for training and testing the
model.
4.2 Hybrid Feature Selection

After resampling the data we are selecting the important features that are correlated to the output variable defect
using hybrid feature selection. We are using the Chi-square test and also sequence backward selection for
selecting the most correlated features from the dataset. First, we perform a chi-square test on the features of the
dataset and select the top features from it after that selected features are passed to sequence backward selection
to select features that are required using SBS.

In statistics, the chi-square test is used to determine if two occurrences are independent. So by using the
chi-square test we find the co-related features and send the subset of features to sequence backward selection. In
SBS we remove the features from the subset that are not related to the output variable.

4.3 Ensemble model


Ensemble learning is the process of combining more than one model using the other machine learning models to
increase the performance of the model. We are using ensemble learning to develop the model for software defect
prediction. The base learners of the model are Random Forest, Logistic regression, and using linear regression
for combining the base learners.

Random forest is a machine learning algorithm that consists of multiple random decision trees used to perform
the classification task. Logistic regression is a machine learning algorithm that is used for solving classification
problems. It forecasts the result of a dependent categorical variable. Therefore, the outcome is a probabilistic
value that lies between 0 and 1.

We are using a linear regression model to combine the output of the base models. So we combine the output of
two base models and train the meta-model. After that, we use these models to predict the defect. We are training
the meta model with the outputs of base models where the target variable will be the result variable while training
this model.

5 Results and Evaluation

Fig. 2. Dataset in form of bar graphs before and after smote


Fig. 3. Confusion metrics of different algorithms

To further improve the performance of the model we use the Ensemble model and hybrid feature selection which
improved the performance of the model. To train the proposed algorithm we used the CM1 dataset which contains
449 non-defective samples and 49 defective samples. We increased the data samples using the Smote sampling
technique. After that, we have 898 samples of which 449 are defective and 449 are non-defective. We use
confusion matrix for measuring the performance of the model. We find accuracy, precision, recall and f-measure
values using confusion matrix. We observe the accuracy of the model as 93%.

6 Conclusion and Future Scope


In this report, we propose a ensemble model to predict the software defect consisting of hybrid feature selection,
SMOTE technique and an ensemble model. The prediction performance of the ensemble learning algorithm is
better than other single classifier algorithms, and it has the advantages of higher precision and recall rate. So
oversampling the data helps in training the model accurately. Using hybrid feature selection also helps to select
the most important features from the set of features available. In future research, further consideration will be
given to the combination of different sampling and feature selection methods to improve the performance of the
prediction. We also recommend to combine the output of these models with the models that are trained on
semantic information of the code. For improved performance, we also advise using real-time data and variety of
datasets to train the model.
References
1. Q. Song, Y. Guo and M. Shepperd, "A Comprehensive Investigation of the Role of Imbalanced
Learning for Software Defect Prediction," in IEEE Transactions on Software Engineering, vol. 45,
no. 12, pp. 1253-1269, 1 Dec. 2019, doi: 10.1109/TSE.2018.2836442.

2. Z. Sun, Q. Song and X. Zhu, "Using Coding-Based Ensemble Learning to Improve Software Defect
Prediction," in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews), vol. 42, no. 6, pp. 1806-1817, Nov. 2012, doi: 10.1109/TSMCC.2012.2226152.

3. L. Gong, S. Jiang and L. Jiang, "Tackling Class Imbalance Problem in Software Defect Prediction
Through Cluster-Based Over-Sampling With Filtering," in IEEE Access, vol. 7, pp. 145725-
145737, 2019, doi: 10.1109/ACCESS.2019.2945858.

4. J. Zheng, X. Wang, D. Wei, B. Chen and Y. Shao, "A Novel Imbalanced Ensemble Learning in
Software Defect Predication," in IEEE Access, vol. 9, pp. 86855-86868, 2021, doi:
10.1109/ACCESS.2021.3072682.

5. S. Huda et al., "A Framework for Software Defect Prediction and Metric Selection," in IEEE
Access, vol. 6, pp. 2844-2858, 2018, doi: 10.1109/ACCESS.2017.2785445.

6. H. He et al., "Ensemble MultiBoost Based on RIPPER Classifier for Prediction of Imbalanced


Software Defect Data," in IEEE Access, vol. 7, pp. 110333-110343, 2019, doi:
10.1109/ACCESS.2019.2934128.

7. S. Huda et al., "An Ensemble Oversampling Model for Class Imbalance Problem in Software
Defect Prediction," in IEEE Access, vol. 6, pp. 24184-24195, 2018, doi:
10.1109/ACCESS.2018.2817572.

You might also like