0% found this document useful (0 votes)
40 views10 pages

Comparing Mining Algorithms For Predicting The Severity of A Reported Bug

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views10 pages

Comparing Mining Algorithms For Predicting The Severity of A Reported Bug

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

2011 15th European Conference on Software Maintenance and Reengineering

Comparing Mining Algorithms for Predicting the Severity of a Reported Bug

Ahmed Lamkanfi∗ , Serge Demeyer∗ , Quinten David Soetens∗ , Tim Verdonck‡


∗ LORE - Lab On Reengineering — University of Antwerp, Belgium
‡ Department of Mathematics - Katholieke Universiteit Leuven, Belgium

Abstract—A critical item of a bug report is the so-called descriptions, text mining algorithms are likely candidates for
“severity”, i.e. the impact the bug has on the successful providing such support. Text mining techniques have been
execution of the software system. Consequently, tool support
previously applied on these descriptions of bug reports to
for the person reporting the bug in the form of a recommender
or verification system is desirable. In previous work we made automate the bug triaging process [1, 2, 3] and to detect
a first step towards such a tool: we demonstrated that text duplicate bug reports [4, 5]. Our hypothesis is that frequently
mining can predict the severity of a given bug report with used terms to describe bugs like “crash” or “failure” serve
a reasonable accuracy given a training set of sufficient size. as good indicators for the severity of a bug. Using a text
In this paper we report on a follow-up study where we
mining approach, we envision a tool that — after a certain
compare four well-known text mining algorithms (namely,
Naı̈ve Bayes, Naı̈ve Bayes Multinomial, K-Nearest Neighbor “training period” — provides a second opinion to be used
and Support Vector Machines) with respect to accuracy and by the development team for verification purposes.
training set size. We discovered that for the cases under In previous work, we demonstrated that given sufficient
investigation (two open source systems: Eclipse and GNOME)
bug reports to train a classifier, we are able to predict
Naı̈ve Bayes Multinomial performs superior compared to the
other proposed algorithms. the severity of unseen bug reports with a reasonable ac-
curacy [6]. In this paper, we go one step further by investi-
I. I NTRODUCTION gating three additional research questions.
During bug triaging, a software development team must 1) RQ1: What classification algorithm should we use
decide how soon bugs needs to be fixed, using categories when predicting bug report severity? We aim to com-
like (P1) as soon as possible; (P2) before the next product pare the accuracy of the predictions made by the
release; (P3) may be postponed; (P4) bugs never to be fixed. different algorithms. Here, the algorithm with the best
This so-called priority assigned to a reported bug represents prediction capabilities is best fit. We also compare the
how urgent it is from a business perspective that the bug time needed to make the predictions, although in this
gets fixed. A malfunctioning feature used by many users for stage this is less important as the tools we used were
instance might be more urgent to fix than a system crash not tuned for optimal performance.
on an obscure platform only used by a tiny fraction of the 2) RQ2: How many bug reports are necessary when
user base. In addition to the priority, a software development training a classifier in order to have reliable predic-
team also keeps track of the so-called severity: the impact tions? Before a classifier can make predictions, it is
the bug has on the successful execution of the software necessary to sufficiently train a classifier so that it
system. While the priority of a bug is a relative assessment learns the underlying properties of bug reports.
depending on the other reported bugs and the time until the 3) RQ3: What can we deduce from the resulting clas-
next release, severity is an absolute classification. Ideally, sification algorithms? Once the different classification
different persons reporting the same bug should assign it the algorithms have been trained on the given data sets, we
same severity. Consequently, software projects typically have inspect the resulting parameters to see whether there
clear guidelines on how to assign a severity to a bug. High are some generic properties that hold for different
severity typically represent fatal errors and crashes whereas algorithms.
low severity mostly represent cosmetic issues. Depending on The paper itself is structured as follows. First, Section II
the project, several intermediate categories exist as well. provides the necessary background on the bug triaging pro-
Despite their differing objectives, the severity is a critical cess and an introduction to text mining. The experimental-
factor in deciding the priority of a bug. And because the setup of our study is then proposed in Section III followed
number of reported bugs is usually quite high1 , tool support by the evaluation method in Section IV. Subsequently, we
to aid a development team in verifying the severity of a bug evaluate the experimental-setup in Section V. After that,
is desirable. Since bug reports typically come with textual Section VI lists those issues that may pose a risk to the
1 A software project like Eclipse received over 2.764 bug reports over a validity of our results, followed by Section VII discussing
period of 3 months (between 01/10/2009-01/01/2010); GNOME received the related work of other researchers. Finally, Section VIII
3.263 reports over the same period. summarizes the results and points out future work.

1534-5351/11 $26.00 © 2011 IEEE 248


249
DOI 10.1109/CSMR.2011.31
Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
II. BACKGROUND users and developers in respectively assigning and assessing
In this section, we provide an insight of how bugs are the severity of reported bugs.
reported and managed within a software project. Then, we Document classification is widely studied in the data
provide an introduction of data mining in general and how mining field. Classification or categorization is the process of
we can use text mining in the context of bug reports. automatically assigning a predefined category to a previously
unseen document according to its topic [10]. For example,
A. Bug reporting and triaging the popular news site Google News [news.google.com] uses
A software bug is what software engineers commonly use a classification based approach to order online news reports
to describe the occurrence of a fault in a software system. A according to topics like entertainment, sports and others.
fault is then defined as a mistake which causes the software Formally, a classifier is a function
to behave differently from its specifications [7]. Nowadays,
f : Document 7→ {c1 , ..., cq }
users of software systems are encouraged to report the
bugs they encounter, using bug tracking systems such as mapping a document (e.g., a bug report) to a certain category
Jira [www.atlassian.com/software/jira] or Bugzilla [www. in {c1 , ..., cq } (e.g., {non–severe, severe}).
bugzilla.org]. Subsequently, the development is able to make Many classification algorithms are studied in the data
an effort to resolve these issues in future releases of their mining field, each of which behaves differently in the same
applications. situation according to their specific characteristics. In this
Bug reports exchange information between the users of study, we compare four well-known classification algorithms
a software project experiencing bugs and the developers (namely, Naı̈ve Bayes, Naı̈ve Bayes Multinomial, K-Nearest
correcting these faults. Such a report includes a one-line Neighbor, Support Vector Machines) on our problem to find
summary of the observed malfunction and a longer more out which particular algorithm is best suited for classifying
profound description, which may for instance include a bug reports in either a severe or a non-severe category.
stack trace. Typically the reporter also adds information
about the particular product and component of the faulty C. Preprocessing of bug reports
software system: e.g., in the GNOME project, “Mailer” Since each term occurring in the documents is considered
and “Calendar” are components of the product “Evolution” a additional dimension, textual data can be of a very high
which is an application integrating mail, address-book and dimensionality. Introducing preprocessing steps partly over-
calendaring functionality to the users of GNOME. comes this problem by reducing the number of considered
Researchers examined bug reports closely looking for terms. An example of the effects of these preprocessing steps
the typical characteristics of “good” reports, i.e., the ones as shown in Table I.
providing sufficient information for the developers to be
considered useful. This would in turn lead to an earlier Table I
E FFECTS OF THE PREPROCESSING STEPS
fix of the reported bugs [8]. In this study, they concluded
that developers consider information like “stack traces” Original description Evolution crashes trying to open
and “steps to reproduce” most useful. Since this is fairly calendar
After stop-words removal Evolution crashes open calendar
technical information to provide, there is unfortunately little After stemming evolut crash open calendar
knowledge on whether users submitting bug reports are
capable to do so. Nevertheless, we can make some educated
assumptions. Users of technical software such as Eclipse The typical preprocessing steps are the following:
and GNOME typically have more knowledge about software • Tokenization: The process of tokenization consists of
development, hence they are more likely to provide the dividing a large textual string into a set of tokens
necessary technical detail. Also, a user base which is heavily where a single token corresponds to a single term. This
attached to the software system is more likely to help the step also includes filtering out all meaningless symbols
developers by writing detailed bug reports. like punctuations and commas, because these symbols
do not contribute to the classification task. Also, all
B. Data mining and classification capitalized characters are replaced by their lower-cased
Developers are overwhelmed with bug reports. Tool sup- ones.
port to aid the development team with verifying the severity • Stop-words removal: Human languages commonly
of a bug is desired. Data mining refers to extracting or “min- make use of constructive terms like conjunctions, ad-
ing” knowledge from large amounts of data [9]. Through an verbs, prepositions and other language structures to
analysis of a large amount of data — which can be both build up sentences. Terms like “the”, “in” and “that”
automatic or semi-automatic — data mining intends to assist also known as stop-words do not carry much specific
humans in their decisions when solving a particular question. information in the context of a bug report. Moreover,
In this study, we use data mining techniques to assist both these terms appear frequently in the descriptions of

250
249

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
the bug reports and thus increase the dimensionality Figure 1. Various steps of our approach
of the data which in turn could decrease the accuracy
of classification algorithms. This is sometimes also
referred as the curse of dimensionality. Therefore, all Bug Database

stop-words are removed from the set of tokens based


New report
on a list of known stop-words. ---------------

• Stemming: The stemming step aims at reducing each (1) & (2)
Extract and preprocess bug reports (4) Predict the severity
term appearing in the descriptions into its basic form.
Each single term can be expressed in different forms
------------
but still carry the same specific information. For ex- ------------
------------
------------
------------
------------
------------ (3) Training predict prediction
ample, the terms “computerized”, “computerize” and ------------
------------

“computation” all share the same morphological base:


“computer”. A stemming algorithm like the porter Non-
Severe
Severe

stemmer [11] transforms each term to its basic form.

III. E XPERIMENTAL SETUP


Of course, a prediction must be based on problem-domain
In this section, we first provide a step-by-step in depth specific assumptions. In this case, the predicted severity of
clarification of the setup necessary to investigate the research a new report is based on properties observed in previous
questions. Then, we present the classification algorithms we ones. Therefore, we use a classifier which learns the specific
will be using throughout this study. Since each of these properties of bug reports from a history of reports where the
algorithms requires different information to work on, we severity of each report is known in advance. The classifier
also discuss the various document indexing mechanisms can then be deployed to predict the severity of a previously
necessary for each algorithm. Afterwards, we propose and unseen bug report. The provided history of bug reports is
motivate the cases we selected for the purpose of this study. also known as the training set. A separate evaluation set of
reports is used to evaluate the accuracy of the classifier.
A. General approach In Figure 1, we see the various steps in our approach.
In this study, we rely on the assumption that reporters Each of these steps are further discussed in what follows.
use potentially significant terms in their descriptions of bugs (1) Extract and organize bug reports: The terms used
distinguishing non-severe from severe bugs. For example, to describe bugs are most likely to be specific for the
when a reporter explicitly states that the application crashes part of the system they are reporting about. Bug reports
all the time or that there is a typo in the application, then are typically organized according to the affected component
we assume that we are dealing with respectively a severe and the corresponding product. In our previous work, we
and a non-severe bug. showed that classifiers trained with bug reports of one
The bug reports we studied originated from Bugzilla bug single component have generally more accurate predictions
tracking systems where the severity varies between trivial, compared to ones without this distinction [6]. Therefore, the
minor, normal, major, critical and blocker. There exist clear first step of our approach selects bug reports of a certain
guidelines on how to assign the severity of a bug. Bugzilla component from the collection of all available bug reports.
also allows users to request features using the reporting From these bug reports, we extract the severity and the short
mechanism in the form of a report with “severity” enhance- description of the problem.
ment which are requests for new features. These reports are (2) Preprocessing the bug reports: To assure the op-
not considered in this study since they technically do not timal accuracy of the text mining algorithm, we apply the
represent real bug reports. In our approach, we treat the standard preprocessing steps for textual data (tokenization,
severities trivial and minor as non-severe, while reports with stop-words removal and stemming) on the short descriptions.
severity major, critical, blocker are considered severe bugs. (3) Training of the predictor: As is common in a
Herraiz et al. proposed a similar grouping of severities [12]. classification experiment, we train our classifier using a set
In our case, the normal severity is deliberately not taken into of bug reports, of which the severity is known in advance
account. First of all because they represent the grey zone, (a training set). It is in this particular step that our classifier
hence might “confuse” the classifier. But more importantly, learns the specific properties of the bug reports.
because in the cases we investigated this normal severity was (4) Predicting the severity: The classifier calculates
the default option for selecting the severity when reporting a estimates of the probability that a bug belongs to a certain
bug and we suspected that many reporters just did not bother class. Using these estimates, it then selects the class with
to consciously asses the bug severity. Manual sampling of the highest probability as its prediction.
bug reports confirmed this suspicion. In this study, we performed our experiments using the

251
250

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
popular WEKA [http://www.cs.waikato.ac.nz/ml/weka/] tool (4) Support Vector Machines: This classifier represents
which is an open-source tool implementing many mining the document in a higher dimensionality space, similar
algorithms. to the K-Nearest Neighbor classifier. Within this space,
the algorithm tries to separate the documents according
B. Classifiers to their category. In other words, the algorithm searches
for hyperplanes between the document vectors separating
In the data mining community, a variety of classification the different categories in an optimally way. Based on the
algorithms have been proposed. In this study, we apply a subspaces separated through the hyperplanes, a new bug
series of these algorithms on our problem with the intention report is assigned a severity according to the subspace it
of finding out which one is best suited for classifying bug belongs to. Both theoretical and empirical evidence show
reports in either a severe or a non-severe category. Next, we that Support Vector Machines are very well suited for
briefly discuss the underlying principles of each algorithm text categorization [14]. Many variations exist when using
we use throughout this study. Support Vector Machines. In this study, we use a Radial
(1) Naı̈ve Bayes: The Naı̈ve Bayes classifier found Basis Function (RBF) based kernel with parameters cost and
its way into many applications nowadays due to its simple gamma respectively 100.0 and 0.001.
principle but yet powerful accuracy [13]. Bayesian classifiers
are based on a statistical principle. Here, the presence or C. Document representation
absence of a word in a textual document determines the As mentioned previously, each text document is repre-
outcome of the prediction. In other words, each processed sented using a vector of n terms. This means that when we
term is assigned a probability that it belongs to a certain have for example m documents in our collection, we actually
category. This probability is calculated from the occurrences have m vectors of length n. In other words, a m × n matrix
of the term in the training documents where the categories representing all documents. This is also known as the Vector-
are already known. When all these probabilities are calcu- Space model. Now, the value of each term in a document
lated, a new document can be classified according to the vector remains to be determined. This can be done in several
sum of the probabilities for each category of each term ways, as we demonstrate below:
occurring within the document. However, this classifier does (1) Binary representation: In this representation, the
not take the number of occurrences into account, which is a value of a term within a document varies between {0,1}.
potentially useful additional source of information. They are The value 0 simply denotes the fact that a particular term is
called “naı̈ve” because the algorithm assumes that all terms absent in the current document while the value 1 denotes its
occur independent from each other which is often obviously presence. This representation especially fits the Naı̈ve Bayes
false [10]. classifier since it only depends on the absence or presence
(2) Naı̈ve Bayes Multinomial: This classifier is similar of a term.
to the previous one. But now, the category is not solely (2) Term Count representation: Similarly to the binary
determined from the presence or absence of a term, but also representation, but now we count the number of occurrences
on the number of occurrences of the terms. In general, this as a value for a term within the document. We use this repre-
classifier performs better than the original one, especially sentation in the combination of the Naı̈ve Bayes Multinomial
when the total number of distinct terms is large. classifier.
(3) 1-Nearest Neighbor: A K-Nearest Neighbor com- (3) Term Frequency - Inverse Document Frequency
pares a new document to all other documents in the training representation: The Term Frequency denotes the number
set. The outcome of the prediction is then based on the of occurrences which we also normalize to prevent a bias
prominent category within the K most similar documents towards longer documents. Suppose we have a set of docu-
from the training set (the so-called neighbors). However, the ments D = {d1 , ..., dm } where document dj is a set of terms
question remains: how do we compare documents and rate dj = {t1 , ..., tk }. We can then define the Term Frequency
their similarities? Suppose, the total number of distinct terms for term ti in document dj as follows:
occurring in all documents is n. Then, each document can ni,j
be represented as a vector of n terms. Mathematically, each tfi,j = P
document is represented as a single vector in a n dimensional k nk,j

space. In this space, we are able to calculate a distance where ni,j refers to thePnumber of occurrences of the term
measure (e.g., the Euclidian distance) between two vectors. ti in document dj and k nk,j denotes the total number of
This distance is then subsequently regarded as a similarity terms in that particular document.
measure between two documents. In this study, we use a Besides the Term Frequency, we also have the Inverse
1-Nearest Neighbor classifier where a newly reported bug Document Frequency that represents the “scaling” factor, or
is categorized according to the severity of the most similar the importance, of a particular term. If a term appears in
report from the training set. many documents, its importance will subsequently decrease.

252
251

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
Table II
The Inverse Document Frequency for term ti is defined as BASIC NUMBERS ABOUT THE SELECTED COMPONENTS FOR
follows: RESPECTIVELY E CLIPSE AND GNOME
|D|
idfi = log
|{d ∈ D : ti ∈ d}| Product Name Non-severe bugs Severe bugs

where |D| refers to the total number of documents and |{d ∈ Eclipse SWT 696 3218
Eclipse User Interface 1485 3351
D : ti ∈ d}| denotes the number of documents containing
JDT User Interface 1470 1554
term ti . Eclipse Debug 327 485
The so called tf -idf value for each term ti in document CDT Debug 60 205
dj is now calculated as follows: GEF Draw2D 36 83
Evolution Mailer 2537 7291
tf -idfi,j = tfi,j × idfi
Evolution Calendar 619 2661
We use this representation in combination with the 1-Nearest GNOME Panel 332 1297
Metacity General 331 293
Neighbor and Support Vector Machines classifiers. GStreamer Core 93 352
Nautilus CD-Burner 73 355
D. Case selection
Throughout this study, we will be using bug reports from
two major open-source projects to evaluate our experiment:
Eclipse and GNOME. Both projects use Bugzilla as their the name K-Fold Cross-validation. For example in the case
bug tracking system. of 10-fold cross validation, the complete set of available bug
Eclipse: [http://bugs.eclipse.org/bugs] Eclipse is an reports is first split randomly into 10 subsets. These subsets
open-source integrated development environment widely are split in a stratified manner, meaning that the distribution
used in both open-source and industrial settings. The bug of the severities in the subsets respect the distribution of
database contains over 200.000 bug reports submitted in the the severities in the complete set of the bug reports. Then,
period of 2001-2010. Eclipse is a technical application used the classifier is trained using only 9 of the subsets and the
by developers themselves, so we expect the bug reports to classifier is executed on the remaining subset. Here, accuracy
be quite detailed and “good” (as defined by Bettenburg et metrics are calculated. This process is repeated until each
al. [8]). subset has been used for evaluation purposes. Finally, the
GNOME: [http://bugzilla.gnome.org] GNOME is an accuracy metrics of each step are averaged to obtain the
open-source desktop-environment developed for Unix-based final evaluation.
operating systems. In this case we have over 450.000
reported bugs available submitted in the period of 1998- B. Accuracy metrics
2009. GNOME was selected primarily because it was part When considering a prediction of our classifier, we can
of the MSR 2010 mining challenge [http://msr.uwaterloo. have four possible outcomes. For instance, a severe bug is
ca/msr2010/challenge/]. As such the community agreed that predicted correctly as a severe or incorrectly as a non-severe
this is a worthwhile case to investigate. Moreover results we bug. For the prediction of a non-severe bug, it is the other
obtained here might be compared against results obtained by way around. We summarize these correct and faulty pre-
other researchers. dictions using a single matrix, also known as the confusion
The components we selected for both cases in this study matrix, as presented in table III. This matrix provides a basis
are presented in Table II, along with the total number of for the calculation of many accuracy metrics.
non-severe and severe reported bugs.
Table III
IV. E VALUATION C ONFUSION MATRIX USED TO MANY ACCURACY METRICS
After we have trained a classifier, we would like to
Correct severity
estimate how accurately the classifier will predict future
non-severe severe
bug reports. In this section, we discuss the different steps
Predicted non-severe tp: true positives fp: false positives
necessary to properly evaluate the classifiers. severity severe fn: false negatives tn: true negatives
A. Training and testing
In this study, we apply the widely used K-Fold Cross- The general way of calculating the accuracy is by calcu-
validation approach. This approach first splits up the col- lating the percentage of bug reports from the evaluation set
lection of bug reports into disjoint training and testing sets, that are correctly classified. Similarly, precision and recall
then trains the classifier using the training set. Finally, the are widely used as evaluation measures.
classifier is executed on the evaluation set and accuracy However, these measures are not fit when dealing with
results are gathered. These steps are executed K times, hence data that has an unbalanced category distribution because of

253
252

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
the dominating effect of the major category [15]. Further- V. R ESULTS AND D ISCUSSIONS
more, most classifiers also produce probability estimations We explore different classification algorithms and com-
of their classifications. These estimations also contain inter- pare the resulting accuracies with each other. This gives
esting evaluation information but unfortunately are ignored us the opportunity to select the most accurate classifier.
when using the standard accuracy, precision and recall Furthermore, we investigate what underlying properties the
approaches [15]. most accurate classifier has learned from the bug reports.
We opted to use the Receiver Operating Characteristic,
This will give us a better understanding of the key indicators
or simply ROC, graph as an evaluation method as this is a
of both non-severe and severe ranked bug reports. Finally, we
better way for not only evaluating classifier accuracy, but
examine how many bug reports we actually need for training
also allows for an easier comparison of different classi-
in order to obtain a classifier with a reasonable accuracy.
fication algorithms [16]. Additionally, this approach does
take probability estimations into account. In this graph, the A. What classification algorithm should we use?
rate of true positives (TPR) is compared against the rate of
In this study, we investigated different classification al-
false positives (FPR) in a two dimensional coordinate system
gorithms: Naı̈ve Bayes, Naı̈ve Bayes Multinomial, Support
where:
tp tp Vector Machines and Nearest-Neighbor classification. Each
TPR = = classifier is based on different principles which then results
total number of positives tp + f n
in varying accuracies. Therefore, we compare the levels of
fp fp accuracy of the algorithms.
FPR = =
total number of negatives f p + tn In Figure 3 (a) and (b) we see the ROC curves of each
A ROC curve close to the diagonal of the graph indicates algorithm denoting its accuracy for respectively an Eclipse
random guesses made by the classifier. In order to optimize and a GNOME case. Remember, the nearer a curve is to the
the accuracy of a classifier, we aim for classifiers with a left-upper side of the graph, the more accurately the predic-
ROC curve as close as possible to the coordinate (1,0) tions are. From Figure 3, we notice a winner in both cases:
in the graph. For example, in Figure 2 we see the ROC the Naı̈ve Bayes Multinomial classifier. At the same time, we
curves of three different classifiers. We can observe that also observe that the Support Vector Machines classifier is
Classifier 1 demonstrates a random behavior. We also notice nearly as accurate as the Naı̈ve Bayes Multinomial classifier.
that Classifier 2 performs better than random predictions but Furthermore, we notice that the accuracy decreases in the
not as good as Classifier 3. case of the standard Naı̈ve Bayes classifier. Lastly, we see
the 1-Nearest Neighbor based approach tends to be the less
Figure 2. Example of ROC curves accurate classifier.
1
Classifier 1
Classifier 2
Classifier 3 Table IV
A REA U NDER C URVE RESULTS FROM THE DIFFERENT COMPONENTS
0.8

Product / Component NB NB Mult. 1-NN SVM


True positive rate

0.6
Eclipse / SWT 0.74 0.83 0.62 0.76
JDT / UI 0.69 0.75 0.63 0.71
0.4 Eclipe / UI 0.70 0.80 0.61 0.79
Eclipse / Debug 0.72 0.76 0.67 0.73

0.2
GEF / Draw2D 0.59 0.55 0.51 0.48
CDT / Debug 0.68 0.70 0.52 0.69

0
Evolution / Mailer 0.84 0.89 0.73 0.87
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate Evolution / Calendar 0.86 0.90 0.78 0.86
GNOME / Panel 0.89 0.90 0.78 0.86
Comparing curves visually can be a cumbersome activity, Metacity / General 0.72 0.76 0.69 0.71
especially when the curves are close together. Therefore, GStreamer / Core 0.74 0.76 0.65 0.73
the area beneath the ROC curve is calculated which serves Nautilus / CDBurner 0.93 0.93 0.81 0.91
as a single number expressing the accuracy. If the Area
Under Curve (AUC) is close to 0.5 then the classifier is
practically random, whereas a number close to 1.0 means The same conclusions can also be drawn from the other
that the classifier makes practically perfect predictions. This selected cases based on an analysis of the Area Under Curve
number allows more rational discussions when comparing measures in Table IV. In this table here, we highlighted the
the accuracy of different classifiers. best results in bold. From these results, we indeed notice that

254
253

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
Figure 3. ROC curves of predictors based on different algorithms
1 1
NB NB
NB Multinomial NB Multinomial
1-NN 0.9 1-NN
SVM SVM

0.8 0.8

0.7
True positive rate

True positive rate


0.6 0.6

0.5

0.4 0.4

0.3

0.2 0.2

0.1

0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
False positive rate False positive rate

(a) Eclipse/SWT (b) Evolution/Mailer

Table V
S ECONDS NEEDED TO EXECUTE WHOLE EXPERIMENT. investigate the effect of the size of the training set on the
overall accuracy of the classifiers using a so-called learning-
Product / Component NB NB Mult. 1-NN SVM curve. A learning-curve is a graphical representation where
the x-axis represents the number of training instances and
Eclipse / UI 73 1 10 47 the y-axis denotes the accuracy of the classifier expressed
GNOME / Mailer 161 2 27 232 using the Area Under Curve measure. The learning curves of
the classifiers are partly shown in Figure 4, the right part of
each graph is omitted since the curve remains rather stable
the Naı̈ve Bayes Multinomial classifier is the most accurate here. Furthermore, we are especially interested in the left
in all but one single case: the GEF / Draw2D case. In this part of the curves.
case, the number of available bug reports for training is In the case of the JDT - User Interface component, we
rather low and thus we are dealing with an insufficiently observe that for all classifiers the AUC measure stabilizes
trained classifier resulting naturally in a poor accuracy. In when the training set contains approximately 400 bug reports
this table we can see that the classifier based on Support (this means about 200 reports of each severity) except in the
Vector Machines has an accuracy nearly as good as the Naı̈ve case of the classifier based on Support Vector Machines. The
Bayes Multinomial classifier. Furthermore, we see that the same holds for the Evolution - Mailer component, besides
standard Naı̈ve Bayes and 1-Nearest Neighbor classifiers are the fact that we need more bug reports here: approximately
a less accurate approach. 700 reports. This higher number can be explained by the
Table V shows some rough numbers about the execution fact that the distribution of the severities in the training sets
times of the whole 10-Fold Cross Validation experiment. respects the distribution of the global set of bug reports.
Here, we see that Naı̈ve Bayes Multinomial outperforms the To clarify, this component contains almost three times more
other classifiers in terms of speed. The standard Naı̈ve Bayes severe bugs than non-severe and thus when we have 700
and Support Vector Machines tend to take the most time. bug reports, approximately 250 are non-severe while 450
are severe. Therefore, we assume that we need a minimal
The Naı̈ve Bayes Multinomial classifier has the best number of around 250 bug reports of each severity when
accuracy (as measured by ROC) and is also the fastest we aim to have a reliable stable classifier though we will
when classifying the severity of reported bugs. need to investigate this for more cases to investigate whether
this assumption holds. This is particularly the case with the
B. How many bug reports do we need for training? the Naı̈ve Bayes and Naı̈ve Bayes Multinomial classifier.
Furthermore, we notice that more available bug reports for
Besides the accuracy of the different classifiers, the num- training results in more accurate classifiers.
ber of bug reports in the training set also plays an important
role in our evaluation. Since the number of bug reports The Naı̈ve Bayes and Naı̈ve Bayes Multinomial clas-
available across the components in a project can be rather sifiers are able to achieve stable accuracy the fastest.
low, we aim for the minimal amount of bug reports for Also, we need about 250 bug reports of each severity
training without losing too much on accuracy. Therefore, we when we train our classifier.

255
254

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
Figure 4. Learning curves
1
NB 1
NB mult. NB
1-NN NB mult.
SVM 1-NN
SVM
0.9
0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
0 200 400 600 800 1000 1200 1400 0 500 1000 1500 2000 2500 3000

(a) learning curve of JDT/UI (b) learning curve of Evolution/Mailer

C. What can we learn from the classification algorithm? the typical non-severe terms. This can be explained by the
origins of severe bugs which are typically easier to describe
A classifier extracts properties from the bug reports in the
using specific terms. For example, the application crashes or
training phase. These properties are usually in the form of
there is a memory issue. These situations are easily described
probability values for each term estimating the probability
using specific powerful terms like “crash” or “npe”. This is
that the term appears in a non-severe or severe reported bug.
less obvious in the case of non-severe indicators since they
These estimations can give us more understanding about
typically describe cosmetic issues. In this case, reporters use
the specific choice of words reporters use when expressing
less common terms to describe the nature of the problem.
severe or non-severe problems. In Table VI, we reveal these
Furthermore, we also notice from Table VI that the terms
terms which we extracted from the resulting Naı̈ve Bayes
tend to vary for each component and thus are component-
Multinomial classifier of two cases.
specific indicators of the severity. This suggests that our
Table VI approach where we train classifiers on a component base
T OP MOST SIGNIFICANT TERMS OF EACH SEVERITY is sound.
Each component tends to have its own particular way
Case Non-severe Severe
of describing severe and non-severe bugs. Thus, terms
Eclipse quick, fix, dialog, npe, java, file, which are good indicators of the severity are usually
JDT type, gener, code, package, open, junit,
UI set, javadoc, wizard, eclips, editor, component-specific.
mnemon, messag, prefer, folder, problem,
import, method, project, cannot, VI. T HREATS TO VALIDITY
manipul, button, delete, rename,
warn, page, miss, error, view, search, In this section we identify factors that may jeopardize the
wrong, extract, label, fail, intern, broken,
add, quickfix, pref, run, explore, cause, validity of our results and the actions we took to reduce
constant, enabl, icon, perspect, jdt, or alleviate the risk. Consistent with the guidelines for case
paramet, constructor classpath, hang,
resourc, save, crash studies research (see [17, 18]) we organize them in four
Evolution message, not, crash, evolut, mail, categories.
Mailer button, change, email, imap, click,
Construct Validity: We have trained our classifier per
dialog, display, evo, inbox, mailer,
doesnt, header, list, open, read, server, component, assuming that special terminology used per
search, select, show, start, hang, bodi,
signature, text, cancel, onli, appli, component will result in a better prediction. However, bug
unread, view, window, junk, make, prefer, reporters have confirmed that providing the “component”
load, pad, ad, content, tree, user, automat,
startup, subscribe, mode, sourc, sigsegv, field in a bug report is notoriously difficult [8], hence we
another, encrypt, warn, segment risk that the users interpreted these categories in different
import, press, print,
sometimes ways than intended. We alleviated the risk by selecting those
components with a significant number of bug reports.
Internal Validity: Our approach relies heavily on the
When we have a look at Table VI, we notice that some presence of a causal relationship between the contents of the
terms conform to our expectations: “crash”, “hang”, “npe” fields in the bug report and the severity of the bug. There is
(null pointer exception), “fail” and the like are good indica- empirical evidence that this causal relationship indeed holds
tors of severe bugs. This is less obvious when we investigate (see for instance [19]). Nevertheless software developers and

256
255

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
bug reporters confirmed that other fields in the bug report Naı̈ve Bayesian classifier for this purpose. The performance
are more important, which may be a confounding factor [8]. of this approach on three cases (Mozilla, Eclipse and JBoss)
External Validity: In this study, we focused on the bug indicated that reports can be predicted to be a bug or an
reports of two software projects: Eclipse and GNOME. Like enhancement with between 77% and 82% correct decisions.
in any other empirical studies, the results obtained from Other current research concerning bug characterization
our presented approach are therefore not guaranteed to hold and prediction mainly apply text mining techniques on the
with other software projects. However, we selected the cases descriptions of bug reports. This work can be divided in
to represent worthwhile points in the universe of software two groups: automatically assigning newly reported bugs to
projects, representing sufficiently different characteristics an appropriate developer based on his or her expertise and
to warrant comparison. For instance, Eclipse was selected detecting duplicate bug reports.
because its user base exists mostly of developers hence it is A. Automatic bug assignment
likely to have “good” bug reports.
The bug reports used in our approach are extracted from Machine learning techniques are used to predict the most
cases that use Bugzilla as their bug tracking system. Other appropriate developer for resolving a new incoming bug
bug tracking systems like Jira and CollabNet also exist. report. This way, bug triagers are assisted in their task.
However since they potentially use other representations for Cubranic et al. trained a Naı̈ve Bayes classifier with the
bug reports, it may be possible that the approach must be history of the developers who solved the bugs as the category
adapted to the context of other bug tracking systems. and the corresponding descriptions of the bug reports as the
Reliability: Since we use bug reports submitted by data [3]. This classifier is subsequently used to predict the
the community both as training and evaluation purposes, most appropriate developer for a newly reported bug. Over
it is not guaranteed that the severities of these reports are 30 % of the incoming bug reports of the Eclipse project are
entered correctly. Users fill in the reports and the severity assigned to a correct developer using this approach.
according to their understanding and experience, which does Anvik et al. continued investigating the topic of the previ-
not necessarily correspond with the guidelines. We explicitly ous work and performed new experiments in the context of
omitted the bugs with severity “normal” since this category automatic bug assignment. The new experiment introduced
corresponds to the default severity when submitting a bug more extensive preprocessing on the data, introducing more
and thus likely to be unreliable. classification algorithms like Support Vector Machines. In
The tools we used to process the data might contain errors. this study, they obtained an overall classification accuracy
We implemented our approach using the widely used open- of 57 % and 64 % for the Eclipse and Firefox projects
source data mining tool WEKA [http://www.cs.waikato.ac. respectively [1].
nz/ml/weka/]. Hence we believe this risk to be acceptable. B. Duplicate bug report detection
VII. R ELATED WORK Since the community behind a project is in some cases
very large, it is possible for multiple users to report the same
At the moment, we are only aware of one other study bug into the bug tracking system. This leads to multiple
on the automatic prediction of the severity of reported bug reports describing the same bug. These “duplicate” bug
bugs. Menzies et al. predict the severity based on a rule reports result in more triaging work. Runeson et al. used
learning technique which also uses the textual descriptions text similarity techniques to help automate the detection of
of reported bugs [20]. The approach was applied on five duplicate bug reports by comparing the similarities between
projects supplied by the NASA’s Independent Verification bug reports [4]. In this instance, the description was used
and Validation Facility. In this case-study, the authors have to calculate the similarity between bug reports. Using this
shown that it is feasible to predict the severity of bug reports approach, over 40 % of the duplicate bug reports are
using a text mining technique even for a more fine-grained correctly detected.
categorization than we do (the paper distinguishes between Wang et al. consider not only the actual bug reports, but
5 severity levels of which 4 were included in the paper). also include “execution information” of a program which
Since they were forced to use smaller training sets than us is for example the execution traces [5]. This additional
(the data sets sizes ranged from 1 to 617 bug reports per information reflects the situation that lead to the bug and
severity), they also reported on precision and recall values therefore reveal buggy runs. Adding structured and unam-
that varied a lot (precision between 0.08 and 0.91; recall biguous information to the bug reports and comparing it to
between 0.59 and 1.00). This suggests that the training sets others, improves the overall performance of the duplicate
indeed must be sufficiently large to arrive at stable results. bug report detection technique.
Antoniol et al. also used text mining techniques on the
descriptions of reported bugs to predict whether a report VIII. C ONCLUSIONS AND FUTURE WORK
is either a real bug or a feature request [21]. They used A critical item of a bug report is the so-called “severity”,
techniques like decision trees, logistic regression and also a and consequently tool support for the person reporting the

257
256

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.
bug in the form of a recommender or verification system [5] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun, “An
is desirable. This paper compares four well-known docu- approach to detecting duplicate bug reports using natural
ment classification algorithms (namely, Naı̈ve Bayes, Naı̈ve language and execution information,” in Proceedings of the
30th international conference on Software engineering, 2008.
Bayes Multinomial, K-Nearest Neighbor, Support Vector [6] A. Lamkanfi, S. Demeyer, E. Giger, and B. Goethals, “Pre-
Machines) to find out which particular algorithm is best dicting the severity of a reported bug,” in Mining Software
suited for classifying bug reports in either a “severe” or Repositories, 2010, pp. 1–10.
a “non-severe” category. We found out that for the cases [7] R. Patton, Software Testing (2nd Edition). Sams, 2005.
under investigation (two open source systems: Eclipse and [8] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj,
and T. Zimmermann, “What makes a good bug report?”
GNOME), Naı̈ve Bayes Multinomial is the classifier with in Proceedings of the 16th ACM SIGSOFT International
the best accuracy as measured by the Receiver Operating Symposium on Foundations of software engineering. ACM,
Characteristic. Moreover, Naı̈ve Bayes Multinomial is also 2008, pp. 308–318.
the fastest of the four and it requires the smallest training [9] J. Han, Data Mining: Concepts and Techniques. San Fran-
set. Therefore we conclude that Naı̈ve Bayes Multinomial is cisco, CA, USA: Morgan Kaufmann Publishers Inc., 2005.
[10] R. Feldman and J. Sanger, The Text Mining Handbook: Ad-
best suited for the purpose of classifying bug reports. vanced Approaches in Analyzing Unstructured Data. Cam-
We could also deduce from the resulting classifiers, bridge University Press, December 2006.
that the terms indicating the severity of a bug report are [11] M. Porter, “An algorithm for suffix stripping,” Program,
component dependent. This supports our approach to train vol. 14, no. 3, pp. 130–137, 1980.
[12] I. Herraiz, D. German, J. Gonzalez-Barahona, and G. Rob-
classifiers on a component base.
les, “Towards a Simplification of the Bug Report Form in
This study is relevant because it enables us to implement Eclipse,” in 5th International Working Conference on Mining
a more automated and more efficient bug triaging process. Software Repositories, May 2008.
It also can contribute to the current research regarding bug [13] I. Rish, “An empirical study of the naı̈ve bayes classifier,” in
triaging. We see trends concentrating on automating the Workshop on Empirical Methods in AI, 2001.
[14] T. Joachims, “Text categorization with support vector ma-
triaging process where this current research can be combined chines: Learning with many relevant features,” Universität
with our approach with the intention to improve the overall Dortmund, LS VIII-Report, Tech. Rep. 23, 1997.
reliability of a more automated triaging process. [15] C. G. Weng and J. Poon, “A new evaluation measure for
Future work is aimed at including additional sources imbalanced datasets,” in Seventh Australasian Data Mining
of data to support our predictions. Information from the Conference (AusDM 2008), ser. CRPIT, J. F. Roddick, J. Li,
P. Christen, and P. J. Kennedy, Eds., vol. 87. Glenelg, South
(longer) description will be more thoroughly preprocessed Australia: ACS, 2008, pp. 27–32.
so that it can be used for the predictions. Also, we will in- [16] C. Ling, J. Huang, and H. Zhang, “Auc: A better measure
vestigate other cases, where fewer bug reports get submitted than accuracy in comparing learning algorithms,” in Advances
but where the bug reports get reviewed consciously. in Artificial Intelligence, ser. Lecture Notes in Computer
Science, Y. Xiang and B. Chaib-draa, Eds. Springer Berlin
ACKNOWLEDGMENTS / Heidelberg, 2003, vol. 2671, pp. 991–991.
[17] R. K. Yin, Case Study Research: Design and Methods, 3
This work has been carried out in the context of a Ph.D grant of edition. Sage Publications, 2002.
the Institute for the Promotion of Innovation through Science and [18] P. Runeson and M. Höst, “Guidelines for conducting and
Technology in Flanders (IWT-Vlaanderen). Additional sponsoring reporting case study research in software engineering,” Em-
by (i) the Interuniversity Attraction Poles Programme - Belgian pirical Software Engineering, 2009.
State – Belgian Science Policy, project MoVES; (ii) the Research [19] A. J. Ko, B. A. Myers, and D. H. Chau, “A linguistic analysis
Foundation – Flanders (FWO) sponsoring a sabbatical leave of of how people describe software problems,” in VLHCC ’06:
Prof. Serge Demeyer. Proceedings of the Visual Languages and Human-Centric
Computing, 2006, pp. 127–134.
R EFERENCES [20] T. Menzies and A. Marcus, “Automated severity assessment
of software defect reports,” in IEEE International Conference
[1] J. Anvik, L. Hiew, and G. C. Murphy, “Who should fix this on Software Maintenance, 28 2008-Oct. 4 2008, pp. 346–355.
bug?” in Proceedings of the 28th international conference on [21] G. Antoniol, K. Ayari, M. Di Penta, F. Khomh, and Y.-
Software engineering, 2006. G. Guéhéneuc, “Is it a bug or an enhancement?: a text-
[2] J. Gaeul, K. Sunghun, and T. Zimmermann, “Improving based approach to classify change requests,” in CASCON ’08:
bug triage with bug tossing graphs,” in Proceedings of the Proceedings of the conference of the center for advanced
European Software Engineering Conference 2009. ACM, studies on collaborative research. ACM, 2008, pp. 304–
2009, pp. 111–120. 318.
[3] D. Cubranic and G. C. Murphy, “Automatic bug triage using
text categorization,” in Proceedings of the Sixteenth Inter-
national Conference on Software Engineering & Knowledge
Engineering, June 2004, pp. 92–97.
[4] P. Runeson, M. Alexandersson, and O. Nyholm, “Detection of
duplicate defect reports using natural language processing,” in
Proceedings of the 29th international conference on Software
Engineering, 2007.

258
257

Authorized licensed use limited to: University of Ghana. Downloaded on December 22,2023 at 23:06:58 UTC from IEEE Xplore. Restrictions apply.

You might also like