SDP Edited1.edited
SDP Edited1.edited
Iftikhar Ahmad
Department of computer science, University of Agriculture Faisalabad
[email protected]
significant issue in the software development
Abstract community[3]. Without an appropriate
Software defects prediction can straightforwardly
impact on software quality and achieved significance techniques for detecting defects in the software, a
fame in a recent couple of years. Defective modules product may be delivered unsatisfactory condition.
have a greater impact over software quality leading to Software quality and reliability are the most
cost, delivery time and a lot of higher maintenance important factor of software and DP perform an
costs. The interest of complex business applications important measure for software reliability and
requires undertaking mistakes free and brilliant software quality[4]. Software quality means
application frameworks. Unfortunately, most of the producing predictive results and delivers its output
created programming applications carry defect, under defined time and cost restrictions[5]. Quality of
which causes disappointment for stakeholder and software products cannot be achieved by just finding
system failure. These disappointments are and eliminating defects at the time testing or delivery
unsatisfactory for critical business applications. It is time. Software defects arrive each and every stage of
far vital to better recognize and compute the software development and defect are removed at each
association among software defects and disasters for phase insisted of testing phase[6]. Complexity and
powerful prediction. Numerous works have been dependencies in software product increase
carried out to forecast of these defects to decay the development time, efforts and this thing become the
failure and improve software program. This paper cause of the entrance of software defects and faults.
presents a review of software defects and its Identifying these faults are too much hard 50% of
anticipations approach for software quality and time devote software developer to finding these
programming advancement. This paper use 148 paper faults, according to Cambridge University report cost
on software defect prediction. As per this survey, the of finding and fixing bugs leads global economy
use of open source datasets and machine learning $312 billion every year[7].
Techniques are increased significantly since 2010. Traditionally white box testing techniques applied to
The researcher keeps using open source datasets and find the defective software module, but it is difficult
machine learning for defect prediction to build a and expensive to finding and removing of these
better defect prediction model. Moreover, creating a defects when the project consists of enormous size,
bug prediction model is an exceedingly difficult and complexity, interdependent module. Finding and
challenging task and many techniques have been fixing of defect at the early phase is require less time,
purposed in this study. Further, the qualities and less expensive as compared to after delivery of the
project[8]. NASA consumes $125 million for space
weakness of the machine learning algorithm (MLA)
aircraft and their space aircraft lost in space due to
are also condensed.
the small bug conversation of data[9]. These things
Keyword: failure, defect prediction, bug, software inspire developers to develop high-quality software
defect, machine learning, algorithm, products that are defects free and producing results
1. Introduction within defined constraints. It is possible to reduce the
A defect is some deficiency of flaw in software effects of defects, but it is not possible to remove
process or product because of an error, fault or each defect form software product. At this time the
failure. The paradigm defines “error” as a human most concern of software developer to produce their
activity that promotes improper outcomes, and product that contains few types of fault as
possible[10].
“defect” as a wrong choice that results in erroneous
Many techniques are available for predicting buggy
outcomes for a solution to the problem[1]. To
module the mostly used prediction techniques are
maintain a strategic distance from such
machine-learning techniques (MLT). These MLT are
disappointments in a software product, Defect
using different computing technique, different metric,
prediction techniques (DPT) are performed at each
previous data or historical data for predicting buggy
phase of the software development life cycle[2].
module[11]. There are different MLT are used for
Identification and prediction of software defects are a
defect prediction like clustering, Regression,
Artificial Neural network (ANN), Bayesian Belief big analysis to compare the performance result of
network K-mean clustering, association rule, hybrid four machine learning classifier Naïve Bayes,
selection approach, genetic programming[12]. Support Vector Machine, R Part, and random forest
Here are the research question and motivation, which tree for the same defect[12]. Combined the feature of
discussed in this literature. different metrics also a well-known approach for
Table 1: Research Question SDP[17]. Feature selection with sample data is also
No. Research Question Motivation good to approach to comparing the results before and
1 Which types of datasets Finding either public or after the selection feature[18]. If the cost or time are
often used for defect private datasets is useful. ignored and focus on accuracy then the feature
prediction? selection technique provides the best and appropriate
2 Which datasets are used Finding different popular results[19]. Improving the fault prediction process
software fault predictions? datasets that are used for metaheuristic technique is used to combine different
SDP? algorithms features and concept of bagging to predict
3 Which types of methods are The methods are the most defect more efficiently[20].
most important? focus.
The author examined the possibility of SDP in the
4 Which types of machine Most often ML
learning algorithms are algorithms are important.
automotive industry and found through a correlation
important? analysis and component analysis the key assumption
5 Which types of the common Finding the best needed for defect prediction based software code
metric used in SDP? prediction metric. metrics are not satisfied, i.e., that there is overlap
6 Which metrics are seen to be Show metrics that are between the fault portion in target and training
a noteworthy indicator for seen to be pertinent for data[21]. A survey was performing on programming
SDP? SDP. building applications for SDP. In this survey, almost
7 Which method supplies Prioritize model. 90 papers explored each paper identified with
higher accuracy?
shortcoming forecast between the years 1990 and
8 Which journal is best for the Showing an important
2009. Both ML techniques and statistical methods
SDP journal? journal for SDP.
9 What are the distinctive Distinguish the
used in this study. In the wake of doing the review, it
performance measures used performance measure that discovered that a large portion of the models based
for SDP? is used to access how on ML techniques. This will profitable for the
well the prediction model researcher to acquire the learning information from
worked. defect forecast techniques[22]. The researcher
10 Different statistical tests that Recognize statistical tests investigates that utilizing designer data in the model
are used for SDP? that are used to quantify report in clashing results. Some researcher says that
the performance of SDP
the designer data does not improve predictive
models.
information[23]. Classification techniques have
constrained effect on SDP[24]. Kim at al. Purpose a
2. Related work way to deal with anticipates inadequate modules by
Size and complexity metric used for defect utilizing programming change history
prediction using this metric they conclude that having information[25]. An NB model has been built up that
one KLOC consist of approximately 23 defects[13]. foresee the quantity of defect and shortcoming
The fuzzy logic information system is used to predict density of defect in the next release of a software
defect at an early phase. The proposed model's fuzzy system. The outcomes demonstrated that the created
inference system that consists of metrics data and shortcoming forecast model was precisely foreseeing
different error data used at the function level. That the defect in software product[23]. Challagulla et al.
model consists of different input and output layer five expand on this work by analyzing bigger ML
input variable used in one input layer and output methods with the end goal of SDP[26]. Zhou et al.
layer predict that are these input layer data are utilize linear regression to distinguish the differing
defective or not[14]. After studying the different degree of software defect[27].
software metrics and their relationship with software 3. Defect Prediction Techniques
DP they build a metric that is appropriate for software Two methods used for finding defects one is an SDP
defect prediction (SDP) [15]. Using previous data model and another is manual software finding
they build a model for defect prediction and how defects. The probability of finding software defects
much effort needed to develop module[16]. Make a using the SDP model is greater as compare to manual
defect finding and removal. Approximately 71 metrics for object-oriented design metric(MOOD).
percent of the overall defect is founded using the Component level metrics are also important here are
prediction model approach[28]. Approximately 61 some component level metrics that are widely used in
percent of the total defects are found in manual defect prediction like complexity metric,
testing[29]. Comparing the result of both method customization metrics, reusability metrics, coupling
defect prediction model supply better results.
and cohesion metrics, java component metrics[34].
3.1. Software Metrics Techniques.
There are five levels of software metrics used for the 3.2. ML Techniques
SDP method level, quantitative level, file level, class In this section, some important and often MLA are
level, component level, and process level. discussed. Not every machine learning (ML)
algorithms can supply good accuracy to each dataset.
The nature of the datasets is most important for the
choice of ML algorithms. The most often use ML
algorithms are discussed below.
3.2.1. Naïve Bayes (NB)
Naïve Bayes classifier is one of those machines
learning algorithms that are widely used in SDP. This
classifier supplies the best result when the feature of
Fig1: Software Metrics the datasets is independent. Independence of features
for naïve Bayes is not a good approach after this it
remains in the best algorithms of ML. Naïve Bayes is
There are four method level metrics for the
measurement of software modules line of code also known as a probabilistic ML classifier that
metric(LOC), cyclomatic complexity V(g), design categorized input datasets on the base of
complexity iv(g) and even(g) metric for essential mathematical probability. This algorithm provides
complexity[30]. appropriate results in two situations first is when all
Table 2: Hallstead method level metrics[31]. the features are independent to each other and the
Metrics Description next situation is when all the features are almost
N Total “Operator and operands” functionally dependent on each other[35][36].
V “volume”
I “Program length”
3.2.2. Decision Tree (DT)
D Difficult A decision tree is also a good approach to ML. It
I Smartness constructs tree on the base of different categorical
E Efforts attribute each node on the tree show the attribute and
B bug
leaf node is the label of class. First, it analyzes the
T Time
IQ Code Line count most important attribute and placed it into root place
IQ comment Comments next it splits training set data into different subsets
Unique op Unique Operator and data in each subset hold relevant to its attributes.
Unique opnd Unique Operands This process carries on until the final leaf node found
Branch count Number of branches
in every branch of the tree. Researchers prefer
decision tree because it is simple, not complex than
The quantitative level approaches use to measure the
other ML algorithms. This algorithm is also called
quantitative performance of systems like disk space,
rule-based algorithm because each path from the root
cup usage, average transaction, etc. The class level
to leaf represents a rule[37]. The complexity of DT is
metrics are mostly used in the object-oriented
measured by a number of the leaf node, parent node,
techniques the most widely used class level metric is
Chidamber and Kemerer metrics (CK) presented by and level of the tree[38].
[32]. Here are some CK metrics that are used in 3.2.3. Linear Classifier (LC)
object-oriented technique “number of children class Word linear associated with two words linear
metrics (NOCC), message passing coupling (MPC), regression and linear classifier. Linear regression
number of methods (NOM), weighted method per means predict a value and LC is used to predict class.
class (WMPC), response for class (RFC), depth of LC analyzes the qualities of items and predicts which
inheritance tree (DIT), lack of cohesion of methods classes these items belong. The LC works like a
(LCOM)”. Some other class level metric is available support vector machine it analyzes data using
here is presented by [33] this metric are called different classification and regression techniques and
makes a gap as much as possible so that it easily researcher in Attribute-Relation File Format (ARFF)
predict the class of item. It built a hyper plan which format it is possible to use directly in Weka tool.
identifies a point where the distance between data According to the nature of datasets, we divide the
items is larger so that it categories the item[39]. The papers into four various categories private, public,
purpose of building a hyper plan is to separate the partial, and others or unknown datasets. Private or
training data into different sessions so the prediction Confidential datasets are those that are not available
of the class makes possible. to everyone. The public datasets are those that are
3.2.4. Particle Swarm Optimization (PSO) available to everyone like NASA datasets UCI
PSO is an evolutionary technique like a genetic repository. If the information of datasets is not
algorithm. The main advantage of PSO is it is easy to available in paper, these datasets called unknown
implement as compared to the genetic algorithm. datasets. Those datasets that not stored in public
PSO is widely used in many research areas such as repository but publically available somewhere called
the fuzzy control system, optimization function, ANN partial datasets... In this paper 148 paper reviewed
training. This algorithm consists of two steps first after categorizing paper on the base of their datasets,
step comprising of velocity matching of the nearest we compared this result to previous research of Catal
neighbor next step is called craziness. When it et al. The use of public datasets are increasing
executes at each iteration nearest neighbor are significantly as compared to previous research[41].
identified and assigned it to velocity X and Y. Due to
this habit, it remains sometime in the genetic
algorithm[40].
3.2.5. Artificial ANN (ANN)
An ANN is also an important algorithm of ML for
SDP. It works like a human brain. It consists of an
artificial group of neurons that interconnected to each
other working at the same time to solve a specific
problem. There are two main parts of an ANN first
one is vertices and second is edges. Edges connect
different vertices to each other for the sharing of
information. ANN takes inputs from input layer
assign different weight to each edge these weights Fig 2: Datasets Distribution Table.
assigned on the base learning experience. Weights 4.2. Recently used datasets for SDP? (RQ2).
that are assigning to each edge are affecting the In this section, we recap different datasets used for
output of variables values. In the final output of defect prediction. Researchers used publically
variables value at this stage is compared to its available datasets like NASA, Promise, Apache,
threshold value and the result is predicted[16]. Eclipse, and Student Developed to build software
4. Results defects, prediction models. Out of 106 papers, 42
One hundred and forty-eight papers reviewed researchers use NASA datasets. The legacy system,
systematically in this paper. Publication years of industrial software system, and other data sources
paper are between 2010 to 2019. The entire research that are not publicly available include another
questions are discussed below. category. The most frequently used datasets are
NASA datasets for SDP[42].
4.1. Datasets (RQ1)
Some researchers use their own datasets (private 4.3. Types of important methods (RQ 3)
datasets) that are not available to every researcher it There are four types of methods that are widely used
is not a good approach. This approach creates many for software fault prediction ML, statistic, statistic
difficulties when compare the result of such studies plus expert opinion, and the final is ML with the
because datasets are not the same to each other. statistic. Overall methodologies based on these
Every researcher that works on fault prediction faces methods suppose in a paper we use statistics and
a relevant problem to each other. A different machine learning methodology then we mark the
repository is created to store public datasets like the paper statistic and ML methods. In this paper 148
University of California Irvine (UCI) which consists paper reviewed after extracting their methodology,
of NASA datasets that are publicly available to every we compared this result to previous research of Catal
et al. As per this survey, the use of MLT, ML with
Statistic and statistics with expert opinion increased
significantly since 2010. After 2010, statistical Looking at the comparison table LC supply greater
techniques for software defect prediction is reduced accuracy in most of the datasets as compare other
because it contains heavy calculation and greater algorithms. However, LC cannot supply greater
error rate as compare MLT. Figure 3 shows accuracy at each dataset, so we proved that the nature
comparison percentages before 2010 and after the of the datasets is most important for defect prediction
2010 result [41]. accuracy.
Table 4 shows the standard deviation of every
algorithm discussed below. We mark bold the lowest
value of the standard deviation of the algorithm.
Table 4: Standard deviation of error table.