Software Defect Prediction Using Ensemble Learning
Software Defect Prediction Using Ensemble Learning
Learning
Naidu Sudheer1*, Maram Sai Nivedh Kumar1, and Smt. G. Mamatha2
1,2,3 Department of Computer Science and Engineering, Chaitanya Bharathi Institute of Technology,
Hyderabad, Telangana.
sudheerchowdary676@gmail.com, sainivedh.maram@gmail.com,
gmamatha_cse@cbit.ac.in
Abstract. Finding software bugs is a crucial step in the software development life cycle. A
software defect is a flaw or shortage in a work product that prevents it from meeting
requirements or specifications and necessitates repair or replacement. Early software defect
detection helps the business avoid time and money losses. Numerous algorithms have been put
forth for this purpose to predict software defects, but these models still have some drawbacks.
This project work has been designed as a Prediction system that uses Ensemble learning and
also a Hybrid feature selection technique for predicting the defect in software modules to
increase the efficiency in finding the defects.
Keywords: Software Defect, Ensemble learning, Software Development Life cycle (SDLC)
1 Introduction
The process of detecting flaws in software modules that have been developed is known as
software defect prediction. Software defect prediction plays a vital role in the software development process as
it helps in reducing the cost of repairing defects and the time for identifying the defects. Identifying defects in
software modules is a very difficult task in software development. so, by applying machine learning techniques
in the software defect prediction process we can easily identify whether there is a defect or not. This reduces
the time and also improves the development process. However, there are some potential problems in the process
of software defect prediction such as imbalanced data. So, to handle this problem there are some solutions like
oversampling the data or undersampling the data. We can use Random resampling, Synthetic minority
oversampling technique (SMOTE), or Ensemble techniques to overcome the class imbalance problem.
Accordingly, this paper proposes a combined sampling, hybrid feature selection, and an
ensemble model to predict the software defect. Combined sampling means balancing the data samples using
some techniques. Hybrid feature selection helps to select the appropriate features from the dataset and improves
the accuracy of the model. Ensemble learning is the process of training multiple models and combining the
output of the models. We are using ensemble learning to improve the accuracy of model
2 Related Work
Software metrics related to software defects are designed by software defect prediction technology
by examining software code, the software development process, and other factors. The relationship
between software metrics and software defects is then established by using historical defect data.
Artificial neural networks, Bayesian networks SVM, dictionary learning, association rules, naïve
bayes tree-based techniques, evolutionary algorithms, and other machine learning-based methods
have all been used to forecast software problems. However, these techniques overlook the faulty
data set's high dimension and uneven class distribution, which have a significant impact on
classification performance. [6]
3 Software Requirements
Python is used in almost every field for a wide variety of applications and projects. It supports a wide variety
of programming paradigms, including procedural, functional, and object-oriented structures.
Python gives programmers some of the best flexibility and capabilities, which will improve their efficiency and
capacities as well as the quality of their code. Python also has a vast library that helps with the heavy workload.
Libraries for machine learning methods include NumPy, Pandas, Scikit-Learn, and mlxtend. Using ensemble
learning to build the model.
4 Proposed Method
Considering previous models’, we are proposing an ensemble model to address these challenges and to improve
the accuracy of software defect prediction.
Steps involved in the proposed system:
1. The dataset is loaded into the system and sent for SMOTE process.
2. Firstly, we perform SMOTE on the dataset. We use the oversampling technique to balance the dataset.
3. Now after the SMOTE process we focus on selecting the best-correlated features from the existing
feature set using hybrid feature selection.
4. Now after selecting features, we are splitting the dataset into an 80:20 ratio.
5. Now we are training the ensemble model using the training data.
6. First, we pass training data to the base models random forest and logistic regression model.
7. Later we pass the outputs of these base models to the linear regression model and train that model.
8. After training we use the test data to test the ensemble model.
Fig. 1 Data flow diagram
In this proposed solution we are using oversampling the data because there are very few samples. So, when we
do undersampling that can result in reducing the performance of the model. So we are using oversampling of the
data and increasing the samples in the dataset and dividing the dataset into 80:20 for training and testing the
model.
4.2 Hybrid Feature Selection
After resampling the data we are selecting the important features that are correlated to the output variable defect
using hybrid feature selection. We are using the Chi-square test and also sequence backward selection for
selecting the most correlated features from the dataset. First, we perform a chi-square test on the features of the
dataset and select the top features from it after that selected features are passed to sequence backward selection
to select features that are required using SBS.
In statistics, the chi-square test is used to determine if two occurrences are independent. So by using the
chi-square test we find the co-related features and send the subset of features to sequence backward selection. In
SBS we remove the features from the subset that are not related to the output variable.
Random forest is a machine learning algorithm that consists of multiple random decision trees used to perform
the classification task. Logistic regression is a machine learning algorithm that is used for solving classification
problems. It forecasts the result of a dependent categorical variable. Therefore, the outcome is a probabilistic
value that lies between 0 and 1.
We are using a linear regression model to combine the output of the base models. So we combine the output of
two base models and train the meta-model. After that, we use these models to predict the defect. We are training
the meta model with the outputs of base models where the target variable will be the result variable while training
this model.
To further improve the performance of the model we use the Ensemble model and hybrid feature selection which
improved the performance of the model. To train the proposed algorithm we used the CM1 dataset which contains
449 non-defective samples and 49 defective samples. We increased the data samples using the Smote sampling
technique. After that, we have 898 samples of which 449 are defective and 449 are non-defective. We use
confusion matrix for measuring the performance of the model. We find accuracy, precision, recall and f-measure
values using confusion matrix. We observe the accuracy of the model as 93%.
2. Z. Sun, Q. Song and X. Zhu, "Using Coding-Based Ensemble Learning to Improve Software Defect
Prediction," in IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and
Reviews), vol. 42, no. 6, pp. 1806-1817, Nov. 2012, doi: 10.1109/TSMCC.2012.2226152.
3. L. Gong, S. Jiang and L. Jiang, "Tackling Class Imbalance Problem in Software Defect Prediction
Through Cluster-Based Over-Sampling With Filtering," in IEEE Access, vol. 7, pp. 145725-
145737, 2019, doi: 10.1109/ACCESS.2019.2945858.
4. J. Zheng, X. Wang, D. Wei, B. Chen and Y. Shao, "A Novel Imbalanced Ensemble Learning in
Software Defect Predication," in IEEE Access, vol. 9, pp. 86855-86868, 2021, doi:
10.1109/ACCESS.2021.3072682.
5. S. Huda et al., "A Framework for Software Defect Prediction and Metric Selection," in IEEE
Access, vol. 6, pp. 2844-2858, 2018, doi: 10.1109/ACCESS.2017.2785445.
7. S. Huda et al., "An Ensemble Oversampling Model for Class Imbalance Problem in Software
Defect Prediction," in IEEE Access, vol. 6, pp. 24184-24195, 2018, doi:
10.1109/ACCESS.2018.2817572.