0% found this document useful (0 votes)
2 views

Software Bug Detection Using Data Mining

SOFTWARE DEFECT PREDICTION

Uploaded by

bharanikumar018
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Software Bug Detection Using Data Mining

SOFTWARE DEFECT PREDICTION

Uploaded by

bharanikumar018
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/275327486

Software Bug Detection using Data Mining

Article in International Journal of Computer Applications · April 2015


DOI: 10.5120/20228-2513

CITATIONS READS

7 4,152

1 author:

Saurabh Pal
Veer Bahadur Singh Purvanchal University
113 PUBLICATIONS 4,127 CITATIONS

SEE PROFILE

All content following this page was uploaded by Saurabh Pal on 22 April 2015.

The user has requested enhancement of the downloaded file.


International Journal of Computer Applications (0975 – 8887)
Volume 115 – No. 15, April 2015

Software Bug Detection using Data Mining


Dhyan Chandra Yadav Saurabh Pal
Research Scholar, Shri Venkateshwara University, Head, MCA Dept.,
Gajraula, Amroha (U.P.) VBS Purvanchal University Jaunpur (U.P.)

ABSTRACT 1.3 Decision Tree


The common software problems appear in a wide variety of Tiwari and Chaudhary [2] introduced about decision tree
applications and environments. Some software related which is a classifier of root node which generate another
problems arises in software project development i.e. branches as a node. The common attributes of data at class
software related problems are known as software defect in level each node have own information. For example
which Software bug is a major problem arises in the coding
implementation .There are no satisfied result found by project
development team. The software bug problems mentation in
problem report and software engineer does not easily detect
this software defect but by the help of data mining
classification software engineers easily can classify software
bug. This paper classified and detect software bug by J48, ID3
and Naïve Bayes data mining algorithms. Comparison of
these algorithms to detect accuracy and time taken to build
model is also presented in this paper.

General Terms
Data Mining, Classification algorithms, Software Bug, Weka
Tool.

Keywords Fig 1: Represents to check level defect of software


Classification: ID3, J48 and Naïve Bayes; Software BUG; 1. If software > 1 then root node extract another
WEKA. branches or internal node (not leaf node) show class
(2).
1. INTRODUCTION
Humpherey [1] introduced about software bug. It is a major 2. 2-If software < 1 then shows root node on class (1).
bug in coding implementation because without correct code
we do not found the correct result. Software engineering 3. 3-If software defect > 1 then found defect
teams have bug report in which these type bugs mentioned. In classification categories bug at class (3) not extract
the absent of perfect (required) quality of software, customer another node otherwise on the class (2).
does not satisfy. The help of software tracker software
engineer easily detect error as a software defect and its type.
1.4 ID3
J. Ross Quinlan [3] introduced about iterative dichotomies
The software bug report is known as problem report but by
algorithms start with training sample of data at the root and
the help of data mining easily we detect and analyzed bug.
create the partition of root which not have a common attribute
1.1 Data Mining with link between corresponding sub sample values and the
Tiwari and Chaudhary [2] introduced about data mining. It is extracted node as like the child node behave as a class on his
processes in computer science by the help of this process node then all possible outcomes instances check whether they
easily extract relationship and pattern from data and collect are falling under the same class or not and classifier of top to
information which provides help in decision making as we bottom and minimize the information entropy measure.
want in software development field. It is easily extract Hunt and Quinlan presents C4.5 is a successor of ID3.The
information from problem reports and take decision by the C4.5 is represents by J48 in Weka. J48 classification by
help of information we can detect software defect and decision tree leaf nodes represents class level:
improve software quality.

1.2 Classification
Tiwari and Chaudhary [2] introduced about classification
which divide data samples into target classes. Classification
have a training set which provide a facility to have a common
level of same classes of data .Some different type bugs in
software project development: SW-bug, document bug,
duplicate bug and mistaken bug .These bugs have common
level bug classes of data object known as software defect in
training set.

Fig 2: J48 classification of attribute by decision tree

21
International Journal of Computer Applications (0975 – 8887)
Volume 115 – No. 15, April 2015

1. A flow chart like tree structure internal node Li and Reformate [13] discussed that the software
denotes a test. configuration management is a system includes documents,
software code, status accounting, design model defect
2. On an attribute branch represents an out comes of tracking and also include revision data.
the test.
Elcan [14] discussed that COCOMO model pruned accurate
3. Decision tree generate consists of two phases. cost estimation and there are many thing about cost estimation
because in project development involve more variable so
1.5 Bayes Rule COCOMO measure in term effort and metrics.
Tiwari and Chaudhary [2] introduced about Bayes rule have
event and supporting evidence and there are two cases arise: Chang and Chu [15] discussed that for discovering pattern of
large database and its variables also relation between them by
1. If event occurs means between evidence P (H)
association rule of data mining.
probability conform.
Kotsiantis and Kanellopoulos [16] discussed that high severity
2. Event occurs means with supporting evidence P
defect in software project development and also discussed the
(H/E).
pattern provide facility in prediction and associative rule
Let H be the event of SW-bug and E be the evidence of reducing number of pass in database.
software defect then we have
Pannurat, N. Kerdprasop and K. Kerdprasop [17] discussed
P (SW-bug/software defect) =P (software defect/SW-bug)*P that association rule provide facility the relationship among
(SW-bug)/P (software defect) large dataset as like software project term hug amount, cost
record and helpful in process of project development.
Naïve Bayes classification algorithm basically used for high
dimension input from the above example. We can predict and Fayyad, Piatesky Shapiro, Smuth and Uthurusamy [18]
output of some event and observing some evidence. Generally discussed that classification creates a relationship or map
it is better to have more than one evidence to support the between data item and predefined classes.
prediction of an event.
Pal [19] conducted study on the student dropout rate by
2. RELATED WORK selecting 1650 students from different branches of
Shepperd, Schofield and Kitchenham [4] discussed that need engineering college. In their study, it was found that student’s
of cost estimation for management and software development dropout rate in engineering exam, high school grade; senior
organizations and give the idea of prediction and discuss the secondary exam grade, family annual income and mother’s
methods for estimation. occupation were highly correlated with the student academic
performance.
Alsmadi and Magel [5] discussed that how data mining
provide facility in new software project its quality, cost and Shtern and Vassillios [20] discussed that in clustering analysis
complexity also build a channel between data mining and the similar object placed in the same cluster also sorting
software engineering. attribute into group so that the variation between clusters is
maximized relative to variation within clusters.
Boehm, Clark, Horowitz, Madachy, Shelby and Westland [6]
discussed that some software companies suffer from some Runeson and Nyholm [21] discussed that code duplication is a
accuracy problems depend on his data set after prediction problem which is language independent. It is appear again and
software company provide new idea to specify project cost again another problem report in software development and
schedule and determine staff time table. duplication arises using neural language with data mining.

Pal and Pal [7] conducted study on the student performance Vishal and Gurpreet [22] discussed that data mining analyzing
based by selecting 200 students from BCA course. By means information and research of hidden information from the text
of ID3, c4.5 and Bagging they find that SSG, HSG, Focc, in software project development.
Fqual and FAIn were highly correlated with the student Lovedeep and Arti [23] data mining provide a specific
academic performance. platform for software engineering in which many task run
K.Ribu [8] discussed that the need of open source code easily with best quality and reduce the cost and high profile
projects analyzed by prediction and get estimating object problems.
oriented software project by case model. Yadav and Pal [24] conducted a study using classification tree
Nagwani and Verma [9] discussed that the prediction of to predict student academic performance using students’
software defect (bug) and duration similar bug and bug gender, admission type, previous schools marks, medium of
average in all software summery, by data mining also discuss teaching, location of living, accommodation type, father’s
about software bug. qualification, mother’s qualification, father’s occupation,
mother’s occupation, family annual income and so on. In their
Hassan [10] discussed that the complex data source (audio, study, they achieved around 62.22%, 62.22% and 67.77%
video, text etc.) need more of buffer for processing it does not overall prediction accuracy using ID3, CART and C4.5
support general size and length of buffer. decision tree algorithms respectively.
Chaurasia and Pal [11, 12] conducted study on the prediction The present study proposed classification to get better rules
of heart attack risk levels from the heart disease database with and to decrease the error rate as much as possible, several
data mining technique like Naïve Bayes, J48 decision tree and approaches are used SW-bug detection from a software defect
Bagging approaches and CART, ID3 and Decision Table. The origin using different data mining techniques (ID3, J48 ,
outcome shows that bagging techniques performance is more Naïve Bayes) and Weka tool .The aim is to detect better
accurate than Bayesian classification and J48. accurate result of data by classifying all observations of
BUG.

22
International Journal of Computer Applications (0975 – 8887)
Volume 115 – No. 15, April 2015

3. METHODOLOGY 3.2 Data Selection and Transformation


The variables move automatic in the computational technique
3.1 Data Preparation to identify the SW-bug or none bug in software. The kappa
Table 1. Variables used in the computational technique
static is a matric that compares an observed accuracy with the
expected accuracy. The kappa static is used to evaluate a
PROPERTY DESCRIPTION single classifier and confusion matrix.

Name of a project or department in MASC


Source
that raises the PR.

(SW-BUG) The bug is from the software


Bug Type
code implementation.

61 TOTAL: 6 BUG and 55 NON-BUG


software bug-tracking system, GNATS (A
Sample Size
Tracking System by GNU), is set up on
MASC Intranet

Dependable Variables

Bug (1) BUG accepted

Non-Bug (0) BUG not accepted


Fig 3: Instances classified by ID3 algorithm
Property Value Description

{1=Normal, Describe the


Severity
0=Serious} Severity of PR

{0=Not, 1=High, Describe schedule


Priority
2=Medium, 3=Low} permit duration

{0=within two days,


1=within one week,
2=within two week, Take time duration
Time to Fix
3=within three week, in PR
4=within four week,
5=within five week}

{0=SW-BUG,
1=DOC-BUG,
2=Change request, Category of BUG
Class
3=Support, classes
4=Mistake,
5=Duplicate}
Fig 4: Instances classified by J48 algorithm
A software error arises in problem report and all problem
reports grouped in two categories: recoverable and
unrecoverable. In recoverable group an error easily recovered
automatically by software. A software bug tracking system
GANTS, (a tracking system by GNU) is set up on MASC
intranet to collect and maintain all problem reports from every
department of MASC. The SW-bug is an input value for class
field; SW-bug is from code implementation. Now performing
for classification of SW-bug using several standard data
mining tasks, data preprocessing, clustering, classification,
association and tasks are needed to be done. The database is
designed in MS-Excel, MS word 2010 database and database
management system to store the collect data. The data is
formed according to the required format and structures and
data is converted to .csv (comma delimited) format to process
in Weka that describes a list of instances sharing a set of
attributes.
Fig 5: Instances classified by Naïve Bayes algorithm

23
International Journal of Computer Applications (0975 – 8887)
Volume 115 – No. 15, April 2015

In general, positive=identified and negative=rejected. square error


Therefore Time taken 0.02 0.02 0
(second)
True positive=correctly identified.
Total 61 61 61
False positive=incorrectly identified.
number of
True negative=correctly rejected. instances
False negative=incorrectly rejected. Table 3. Detailed Accuracy by Class
By the help of confusion matrix easily specified layout table
that allow visualization of the performance of an algorithm. NAÏVE BAYES ID3 J48
Each column of matrix representation of instances is a
predicted class, while each row represents the instances is an DETAILED B NON B NON B NON
actual class. U BUG U BUG U BUG
G G G
3.3 Implementation of Data Mining
The paper presents an approach to classifying SW-bug in
1 1 1 1 1 1
order to predict design, implement and evaluate a series of
TP RATE
pattern classifier also compare performance of an online SW-
bug dataset. The classifiers were used to declare surety of bug.
Present paper uses the J48, Id3 and Naïve Bayes algorithms to FP RATE 0 0 0 0 0 0
improve the prediction accuracy. These techniques are of
considerable useful in identifying software bug in very large PRECISION 1 1 1 1 1 1
data set.
RECALL 1 1 1 1 1 1
3.4 Result and Discussion
The proposed techniques are included in Weka tool; decision F- 1 1 1 1 1 1
tree and naïve Bayes techniques. Data mining tools are MEASURE
software components and proposed tool that will be applied in
Weka and support several data mining task. The proposed
techniques that will be applied in this paper are decision tree ROC 1 1 1 0.75 1 1
(J48, ID3) and Naïve Bayes because it is powerful
classification algorithms. From table 3 it is clear that J48 give more correctly classified
compare to ID3 and NB algorithms, with binary classification,
From the table 2 it is clear that kappa static value observed precision as positive predictive value is the fraction of
give the equal value compare to J48 and NB algorithms and retrieved instances that are relevant while recall is the fraction
ID3. Naive Bayes give the more error compare to J48 and ID3 of relevant instances that are retrieved. Both precision and
algorithms. But ID3 and J48 take 0.2 sec in process recall are therefore based on an understanding and measure of
completion but NB takes 0 second. relevance but from the table 3 it is clear that recall-measure
Table 2. Evaluation on test split have equal value for corresponding three algorithms.
Table 4. Confusion Matrix
Detailed J48 ID3 Naïve
Bayes
Confusion Confusion Confusion
Correctly 61 100% 58 95.08% 61 100% Matrix(J48) Matrix(ID3) Matrix(Naïve
classified Bayes)
Instances
a b <-- a b <-- a b <--
Incorrectly 0 0% 0 0% 0 0% classified as classified as classified as
classified
instances 55 0 | a = zero 55 0 | a = zero 55 0 | a = zero
Kappa 1 1 1
static 0 6 | b = one 0 3| b = one 0 6 | b = one
Mean 0 0 0.0
absolute 0 From table 4 each column of matrix represents the instances
error in a predicted class, while each row represents in an actual
class. Confusion matrix J48 shows 55 correctly classified and
Root mean 0 0 0.0 0 none correctly classified another side 0 none correctly
squared 0 classified and 6 correctly classified arise. Confusion matrix in
error ID3 represents 55 correctly classified and 0 none correctly
Relative 0 0% .27% classified another side 0 none correctly classified and 3
absolute correctly classified. In NB 55 correctly classified and 0 none
error correctly classified another side 0 none correctly classified
and 6 correctly classified. It is clear from analysis correctly
Root 0 0% .84% classified in J48 is total instances 61 is better value without
relative error compare to other ID3 and NB algorithms.

24
International Journal of Computer Applications (0975 – 8887)
Volume 115 – No. 15, April 2015

4. CONCLUSION Maintenance at the 24th IEEE international Conference


In Weka all data is considered as instances attributes in the on software maintenance, 2008.
data for easier analysis and evaluation. Similar result is [11] Chauraisa V. and Pal S., “Data Mining Approach to
partitioned into several sub items. In the first part correctly Detect Heart Diseases”, International Journal of
classified instances will be partition in to numeric and Advanced Computer Science and Information
percentage value, kappa statics, mean absolute error and root Technology (IJACSIT),Vol. 2, No. 4,2013, pp 56-66.
mean square error will be at numeric value only ID3 andJ48
time taken to build model: 0.2 seconds and test mode :10 fold [12] Chauraisa V. and Pal S., “Early Prediction of Heart
cross validation. Here Weka compare all required parameters Diseases Using Data Mining Techniques”,
on given instances with the classifiers respective accuracy and Carib.j.SciTech,,Vol.1, pp. 208-217, 2013.
prediction rate. Based on table 2 it can clearly see that highest [13] Li and Reformat, “A practical method for the Software
accuracy of J48 is 100% without error also Naïve Bayes 100% fault prediction”, in proceeding of IEEE Nation
correctly classified but with some error and ID3 95% conference information reuse and Integration (IRI),
correctly classified, so it is clear that J48 is the best in three 2007.
respective algorithms so it is more accurate.
[14] Elcan C., “The foundations of cost sensitive learning”,
5. REFERENCES In processing of the 17 International conference on
[1] Hampherey Watts S., “A discipline for software Machine learning, 2001.
Engineering reading”, Ma,Addison Wesley,1995.
[15] Chang and Chu, “software defect prediction Using
[2] Sunita Tiwari and Neha Chaudhary, “Data mining and international association rule mining”, 2009.
Warehousing” Dhanpati Rai and Co.(P) Ltd. First
Edition: 2010. [16] Kotsiantis and Kanellopoulos, “Associationn rule
mining: A recent overview”, GESTS international
[3] J.R.Quinlan, “C4.5: programs for machine learning”, transaction on computer science and Engineering, 2006.
Morgan Kaufmann,San Francisco,1993.
[17] Pannurat, Kerdprasop and Kerdprasop, “Database
[4] M. Shepperd, C. Schofield, and B. Kitchenham, ”Effort reverses engineering based On Association rule
estimation using analogy,” in of the 18th International mining”, IJCSI international Journal Of computer
Conference On Software Engineering, pp.170- 178. science issues 2010.
Berlin, Germany, 1996.
[18] Fayyad, Piatesky Shapiro, Smuth and Uthurusamy,
[5] Alsmadi and Magel, “Open source evolution Analysis,” “Advances in knowledge discovery And data mining”,
in proceeding of the 22nd IEEE International Conference AAAI Press,1996.
on Software Maintenance (ICMS’06), phladelphia,
pa.USA, 2006. [19] Pal S., “Mining Educational Data to Reduce Dropout
Rates of Engineering Students”, I.J. Information
[6] Boehm, Clark, Horowitz, Madachy, Shelby and Engineering and Electronic Business (IJIEEB), Vol. 4,
Westland, ”Cost models for future software life cycle No. 2, 2012, pp. 1-7.
Process: COCOMO2.0.” in Annals of software
Engineering special volume on software process and [20] Shtern and Vassilios, “Review article advances in
prodocuct measurement, J.D. Arther and S.M. Henry, Software engineering clustering methodologies for
Eds, vol.1, pp.45-60, j.c. Baltzer AG, science publishers, software engineering”, Tzerpos volume, 2012.
Amsterdam ,The Netherlands, 1995. [21] Runeson and Nyholm, “Detection of duplicate Defect
[7] Pal A. K., and Pal S., “Analysis and Mining of report uses neural network processing”, in Proceeding of
Educational Data for Predicting the Performance of the 29th international conference on Software engineering
Students”, (IJECCE) International Journal of Electronics 2007.
Communication and Computer Engineering, Vol. 4, [22] Vishal and Gurpreet, “A survey of text mining
Issue 5, pp. 1560-1565, ISSN: 2278-4209, 2013. Techniques and applications”, journal of engineering
[8] Ribu, Estimating, “Object oriented software projects Technologies in web intelligence, 2009.
With use cases”, M. S. thesis, University of Oslo [23] Lovedeep and Varinder Kaur Arti, “Application of Data
Department of informatics, 2001. mining techniques in software engineering”
[9] Nagwani N. and Verma S., “Prediction data mining International journal of electrical, electronics and
Model for software bug estimation using average computer system(IJEECS) Volume-2 issue-5, 6. 2014.
Weighted similiarity,” In proceeding of advance [24] Yadav S. K. and Pal S., “Data Mining: A Prediction for
Computing conference (IACC), 2010. Performance Improvement of Engineering Students
[10] Hassan, “The road ahead for mining software using Classification”, World of Computer Science and
Repositories”, in processing of the future of software Information Technology (WCSIT), 2(2), 51-56, 2012.

IJCATM : www.ijcaonline.org 25

View publication stats

You might also like