An Enhanced Bayesian Decision Tree Model For Defect Detection On Complex SDLC Defect Data
An Enhanced Bayesian Decision Tree Model For Defect Detection On Complex SDLC Defect Data
Volume: 3 Issue: 11
ISSN: 2321-8169
6360 - 6365
________________________________________________________________________________________________________
Dr. N. Geethanjili
Research Scholar,
Dept.of Computer Science & Technology,
Sri Krishnadevaraya University
Ananthapuram,India
Associated professor,
Dept.of Computer Science & Technology,
Sri Krishnadevaraya University
Ananthapuram,India
Abstract In this paper, we explore the multi-defect prediction model on complex metric data using hybrid Bayesian network.Traditional
software metrics are used to estimate the effect of defects for decision making. Extensive study has been carried out to find the defect patterns
using one or two software phase metrics.However, the effect of traditional models is influenced by redundant and irrelevant features.Also, as the
number of software metrics increases, then the relationship between the new metrics with the traditional metric becomes too complex for
decision making. In this proposed work, a preprocessed based hybrid Bayesian network was implemented to handle large number of metrics for
multi-defect decision patterns. Experimental results show that proposed model has high precision and low false positive rate compared to
traditional Bayesian models.
KeywordsHybrid Model, Bayesian Model, Decision patterns,Defect data.
__________________________________________________*****_________________________________________________
I. INTRODUCTION
The defect is a flaw in the software program which can cause
it to fail to perform its functions. Defect prediction provides an
optimized way to find the vulnerabilities in the SDLC phases
which occurs due to manual or automatic errors.. As the
dependency of software programs increasing, software quality
is becoming more and more essential in present era. Software
defects such as failures and faults may affect the quality of
software which leads to customer dissatisfaction.Due to the
increasing of software constraints and modular complexity , it
is too difficult to produce a quality end product. Defects in
software may cause loss of money and time, so it is necessary
to predict bugs in advance for successful quality products and
decision makers. As a result, these bug reports present in
various bug tracking frameworks contains
detailed
information about the bugs along with the severity level[1-3].
Generally faulty constraints that causes incorrect outputs are
represented as software bugs. These constraints can be defined
as a set of features which can be used to find the bugs.These
features influence the effectiveness of the bug prediction
model. Various types of classification and feature selection
models have been applied for software defect detection,
including decision trees, multiple regression, neural networks,
svm and nave Bayes. However, these models have failed to
select the relevant defects for appropriate classifier. The
performance of software defect detection also decreases due to
the noise and large number defect features [4] [5].
The basic limitations of these traditional models are :
1) Unable to find the new patterns to the dynamic
features.
2) Fail to load the metric data with a large number of
instances.
3) The requirement specification of the project may be
wrong either due to missing features or values and
Prob(Vi / Vi 1 ,...Vn )
i 1
_____________________________________________________________________________________
ISSN: 2321-8169
6360 - 6365
________________________________________________________________________________________________________
The rest of the paper is summarized as follows. The related
Dynamic analysis techniques can be categorized into three
work of the different defect prediction models and feature
independent layers. First layer is a systematic testing layer.
selection models in software defects are discussed in Section
This layer is to execute target programs within policies. These
II. In section III, we proposed a new filter based hybrid
policies aim to reach error states effectively. Second layer is
Bayesian network model for defect prediction . In Section IV,
an information extraction layer. The information on the
experimental results are evaluated on different software
internal behaviors of the target programs is extracted to be
defects datasets and finally, Section V describes about
used for the program correctness checking. At third layer, the
conclusion and future scope.
monitors generate an abstract model of the target program
from the extracted information and then verify the abstract
model to detect possible errors in the program. Dynamic
II. Related Work
analysis techniques, share the limitations of testing inherently.
[1][2] formulated the defect prediction models to find the
Dynamic analysis cannot support complete analysis of target
stochastic process in terms of defect variables and find the
programs since it uses monitored partial behavior of the target
interval between the variable rate. They used nonprograms. The other limitation is that dynamic analysis
homogeneous poison process to formulate the number of
techniques are difficult to be applied unless target programs
defects found during the defect dependency test. For each
are complete. Dynamic analysis techniques require executable
defect find the poison process, P(t), the probability of finding
environments and test cases[7-9].
k defects by the time t and it is expressed in terms of the
Poisson distribution with mean m (t) as
In [3] importance of different software metrics with prediction
model.In this model, they implemented correlations and metric
mt
Prob{P(t)=k}=m(t)n. e / n !
occurrences in the bug prediction model by using different
algorithms and the number of bugs in each metric was
The exponential model is used to find the defect distribution in
computed. [4] implemented object oriented metrics to
the testing phase of SDLC,especially the regression testing and
measure the object oriented software quality.It was found that
integrated testing phases. The basic assumption is that,defect
models which are built on coupling and complexity are more
occur at any stage in the testing phase or failure mode is the
precise and accurate than the models build on other metrics.
best indication of the software reliability.
[5] , designed a model that describes the prediction of 90
releases in open source projects and other projects on
t
academics to perform clustering algorithm. They implemented
F (t ) (k 1)(.e )
similarity cluster measures to group the metrics in the design
and implemented phases. Statistical tests are used to validate
Nave bayes is a very effective classification technique to
the cluster in each group of metrics.[6] implemented the
predict the existence of defects based on the training samples.
principal component analysis to reduce the simple multiA nave Bayes model considers bug prediction as a binary
collinear complexity to un-correlated measures of orthogonal
classifier i.e. it trains and predict predictor by analyzing
complexity.
historical metric data.If the attribute types in the metric data
are mixed type , then it is difficult to predict the defects due to
[6] Proposed a model to predict bugs and their levels with
missing values or uncertain data.
high, medium and low severity faults and found that the high
severity faults are less accurate as compared to the traditional
KNN method to judge the defect rate in software status and
models at different severities.
events. They try to give the software defect rates using some
statistic techniques. With the data mining techniques more
Regression technique is aimed to predict the quantity and
mature and widely used, for analysis and mining the hidden
density of software defects. Classification technique is aimed
information in the software development repository become a
to determine whether software module (which can be a
hot research topic. The usual ways which use data mining
package, code, file, or the like) has a higher defect risk or not.
techniques in this domain include Association Rules,
Classification usually learns the data in earlier versions of the
Classification and Prediction, Clustering. Classification means
same project or similar data of other projects to establish a
to build defects prediction model by learning the already
classification model. The model will be used to forecast the
existed defects data to predict the defects in future version of
software defects in the projects which need to be predicted. By
software. [8] use this method to improve the efficiency and
doing this, we can make more reasonable allocation of
quality of software development. Some other researches
resources and time, to improve the efficiency and quality of
include raised to predict the status and the number of
software development[9-12].
software defects. The current software defect prediction,
mainly uses the software metrics to predict the amount and
Main Objectives of this paper:
distribution of the software defects. The research method of
software defect classification, prediction is based on the
Remove noise in hybrid dataset using correlated
program properties of the historical software versions, to build
based normalization.
different prediction models and to forecast the defects in the
_____________________________________________________________________________________
ISSN: 2321-8169
6360 - 6365
________________________________________________________________________________________________________
If(I[j]==null & M[i+1]==null)
Then
Proposed Model
I[j]=(Mean(M[i])+S.D(M[i]))/(2*Max{M[I],M[I-1]});
In this model, multi-phase metric data was given input to the
End if
proposed model for preprocessing. In this framework as
End for
shown in Fig 1, input software metric data with a large
End for
number of attributes and values are given input to the filtering
For each pair of metrics M[i] and M[i+1]
technique. Filtering algorithm handles missing data and
Compute Normalization as
normalized correlation computations for data transformation.
NM[i]= Normalize(M[1]);
After the data transformation, output filtered data is used for
NM[i+1]=Normalize(M[i+1)];
the hybrid Bayesian based ranking model to predict and rank
NML=addList(NM[i]);
the features for pattern mining. Each pattern in the hybrid
NML=addList(NM[i+1]);
model is evaluated using F-measure, FP, TP and accuracy of
done
performance evaluation. Finally, decision patterns relevant to
done
set of metrics are evaluated for defect prediction.
Sort normalized metrics list NML in ascending order.
Hybrid Metric
Data
Data Preprocess
Algorithm
Apply Proposed
Model
Pattern Evaluation
If( PC>thres)
Then
D =addMetric(NML[i],NML[i+1],PC);
End if
Done
Algorithm 1, describes the hybrid preprocessing algorithm on
the hybrid metric dataset for noise and data transformation.
Algorithm reads the input data and checks the each instance
for missing values. If the instance value is missing, then it is
replaced with the equation (1) or equation (2). After replacing
the missing instances, each pair of metrics is normalized to
remove the un-certainty. Afterwards, compute the predictive
correlation between two metrics and check the condition with
the user defined threshold.
Algorithm-2: Hybrid Bayesian Ranking Based Pattern
Miner(HBRBPM)
Results
Step 3: if ( )
Then
Create a node with
(max{ 1, 2})
as root
6362
_____________________________________________________________________________________
ISSN: 2321-8169
6360 - 6365
________________________________________________________________________________________________________
Else
Compute the predictive correlation
computation between the other metrics.
and
gain
End if
Step 4: Repeat the steps 2,3 until all metrics
Step 5: Validate the test using F-measure and t-test.
Step 6: Extract rules from the tree.
Step 7: Display results.
Sample Data:
Data 3:
Data 2:
6363
IJRITCC | November 2015, Available @ http://www.ijritcc.org
_____________________________________________________________________________________
ISSN: 2321-8169
6360 - 6365
________________________________________________________________________________________________________
lines_removed < 271.2 AND lines_removed < 135.6 ->
lines_added <= 404.4
filetype != documentation AND lines_added <
606.5999999999999 -> lines_removed <= 135.6
RFC <= 862.0 -> Bug-count != false
NOC <= 38.0 AND NPM <= 214.0 -> CBO <= 125.0
NOC <= 38.0 -> Bug-count != false
RFC >= 0.0 -> NOC <= 38.0
RFC >= 0.0 -> WMC <= 351.0
CBO <= 125.0 AND NPM <= 214.0 -> WMC <= 351.0
LOC <= 5317.0 AND DIT >= 0.0 -> RFC >= 0.0
NOC <= 38.0 -> WMC <= 351.0
NOC <= 38.0 AND DIT >= 0.0 -> RFC <= 862.0
DIT >= 0.0 AND NPM <= 214.0 -> WMC <= 351.0
Bug-count != false -> CBO <= 125.0
WMC <= 351.0 -> RFC >= 0.0
DIT >= 0.0 AND RFC <= 862.0 -> Bug-count != false
LOC <= 5317.0 AND NPM <= 214.0 -> WMC <= 351.0
LOC <= 5317.0 -> CBO <= 125.0
RFC >= 0.0 -> NPM <= 214.0
Bug-count != false -> WMC <= 351.0
NOC <= 38.0 AND RFC <= 862.0 -> NPM <= 214.0
NOC <= 38.0 AND Bug-count != false -> DIT >= 0.0
NOC <= 38.0 AND NPM <= 214.0 AND DIT >= 0.0 ->
RFC <= 862.0
RFC <= 862.0 -> CBO <= 125.0
DIT >= 0.0 -> LOC <= 5317.0
NOC <= 38.0 -> LCOM >= 0.0
NPM <= 214.0 AND WMC <= 351.0 -> RFC <= 862.0
WMC <= 351.0 AND LCOM >= 0.0 -> RFC >= 0.0
CBO <= 125.0 AND RFC >= 0.0 -> Bug-count != false
RFC >= 0.0 AND NPM <= 214.0 -> CBO <= 125.0
LCOM >= 0.0 AND NPM <= 214.0 -> Bug-count != false
Proposed Experimental Results:
CBO <= 125.0 AND NPM <= 214.0 AND DIT >= 0.0 ->
RFC <= 862.0
lines_removed <= 542.4 -> filetype != documentation
WMC <= 351.0 AND RFC >= 0.0 -> Bug-count != false
lines_removed < 678.0 -> filetype != documentation
LOC <= 5317.0 AND DIT >= 0.0 -> RFC <= 862.0
lines_added <= 1011.0 AND lines_removed < 135.6 ->
lines_removed <= 135.6 AND lines_removed <= 678.0 ->
external != 1
lines_added <= 404.4
lines_added <= 404.4 -> filetype != documentation
lines_added <= 404.4 AND lines_added <=
filetype != documentation AND lines_added <
606.5999999999999 -> external != 1
606.5999999999999 -> lines_removed < 135.6
lines_removed <= 678.0 AND lines_removed <= 135.6 ->
lines_added <= 808.8 AND lines_added <=
filetype != i18n
606.5999999999999 -> external != 1
lines_removed <= 678.0 AND lines_removed <= 135.6 AND
lines_removed < 678.0 AND lines_removed <= 271.2 ->
external != 1 AND lines_added <= 202.2 -> filetype !=
lines_added <= 1011.0
documentation
lines_removed < 135.6 AND lines_removed < 678.0 ->
lines_removed < 135.6 -> external != 1
lines_added <= 404.4
lines_added <= 202.2 AND external != 1 AND lines_removed
lines_removed < 135.6 -> filetype != images
< 271.2 -> filetype != images
filetype != i18n -> external != 1
lines_removed < 542.4 AND lines_removed < 271.2 ->
lines_added < 1011.0 AND lines_removed <= 135.6 ->
lines_added < 606.5999999999999
external != 1
lines_added < 1011.0 -> lines_removed < 135.6
lines_added <= 404.4 -> filetype != images
lines_removed <= 678.0 AND lines_removed <= 135.6 AND
lines_added < 1011.0 AND external != 1 AND lines_removed
lines_removed < 271.2 -> filetype != images
< 271.2 -> filetype != images
filetype != documentation -> lines_removed <= 678.0
lines_added < 1011.0 -> filetype != documentation
filetype != images AND filetype != unknown -> lines_added
lines_added <= 1011.0 AND lines_added < 808.8 -> filetype
<= 202.2
!= images
lines_added < 202.2 AND lines_removed <= 135.6 AND
Performance Measures:
lines_removed < 271.2 -> filetype != images
6364
IJRITCC | November 2015, Available @ http://www.ijritcc.org
_____________________________________________________________________________________
ISSN: 2321-8169
6360 - 6365
________________________________________________________________________________________________________
Table 1: Uncertain data preprocessing
V.CONCLUSION
Datasize
MissingValues
FilterTime(secs)
#500
12
16
#1000
16
18
#1500
19
21
#2000
25
28
#5000
28
34
KNN
Regression
Based
Bug prediction
0.798
HBRBPM
0.87
Predictive
Bug
detection
0.85
#500
#1000
0.91
0.867
0.814
0.9524
#1500
#2000
0.85
0.867
0.827
0.819
0.845
0.842
0.917
0.9713
#5000
0.891
0.84
0.86
0.939
0.96
6365
IJRITCC | November 2015, Available @ http://www.ijritcc.org
_____________________________________________________________________________________