CNN-Based Automatic Prioritization of Bug Reports Transaction Paper
CNN-Based Automatic Prioritization of Bug Reports Transaction Paper
Abstract—Software systems often receive a large number of bug priority, and severity), attachments (e.g., screenshots of the bug
reports. Triagers read through such reports and assign different for developers), textual information (e.g., summary and descrip-
priorities to different reports so that important and urgent bugs
tion), and comments from the users and developers. Reporters
could be fixed on time. However, manual prioritization is tedious
and time-consuming. To this end, in this article, we propose a provide all the required information while reporting bugs.
convolutional neural network (CNN) based automatic approach Software systems often receive a large number of bug re-
to predict the multiclass priority for bug reports. First, we apply ports [6]. Triagers read through such reports and assign different
natural language processing (NLP) techniques to preprocess tex- priorities to different reports so that important and urgent bugs
tual information of bug reports and covert the textual information could be fixed on time. Developers often do not fix bugs for
into vectors based on the syntactic and semantic relationship of
words within each bug report. Second, we perform the software years due to various constraints, e.g., time and availability of
engineering domain specific emotion analysis on bug reports and developers [7]. Therefore, triagers need to prioritize bug reports
compute the emotion value for each of them using a software adequately so that developers can fix the ranked bugs in se-
engineering domain repository. Finally, we train a CNN-based quence. Different bug tracking systems have different priority
classifier that generates a suggested priority based on its input, levels for a reported bug. Therefore, the priority of bug reports is
i.e., vectored textual information and emotion values. To the best
of our knowledge, it is the first CNN-based approach to bug report actually a multiobject classification. For example, in Bugzilla,
prioritization. We evaluate the proposed approach on open-source the priority of a bug report can be defined from p1 to p5 , where
projects. Results of our cross-project evaluation suggest that the p1 is the highest priority and p5 is the lowest priority. Prioritizing
proposed approach significantly outperforms the state-of-the-art bug reports is often a manual and time-consuming process. After
approaches and improves the average F1-score by more than 24%. a user reports a new bug through the bug tracking system, a
Index Terms—Bug reports, deep learning, prioritization, triager is responsible first to examine the reported bug. Based
reliability. on the examination, the triager decides its priority. The manual
process of assigning priority increases the resolution time of the
bug report [8]. To this end, some automated approaches have
I. INTRODUCTION
been proposed to suggest the priority of bug reports [5], [9]–[11].
OFTWARE systems are often released with defects because
S of inadequate testing and system complexity [1]. Develop-
ers want feedback from users to resolve the defects that users
However, the performance of such approaches deserves further
significant improvement.
The machine learning and deep learning classifiers used for
experienced while using released systems. They employ issue text classification have their own limitations for bug prioritiza-
reporting systems to collect feedback from users. Bugzilla [2], tion. Such as the best machine learning algorithm, support vector
JIRA [3], and GitHub [4] are the most popular issue reporting machine (SVM) reported by Umer et al. [9], requires feature
systems. Users utilize such systems to report defects and track modeling efforts. Long short-term memory (LSTM) does not
their progress. The utilization of issue tracking systems is stan- extract the position-invariant features (similar features from the
dard practice in software development and maintenance [5] that text that are similar in semantic but different in structure) as its
helps developers to resolve reported defects. The resolution of output is position-variant dependent [12]. Consequently, LSTM
reported defects has become an essential, expansive, and critical does not identify the patterns (features) like hate a lot from the
task in software maintenance due to the exponential growth of text [13]. In contrast to SVM and LSTM, convolutional neural
defects in complex software systems. network (CNN) not only extracts patterns and position-invariant
A bug report contains information that can be helpful in features independently but also eliminates the feature modeling
debugging and explains how exactly the product is crashed. efforts for bug prioritization [14]. Notably, such pattern and
A typical bug report includes predefined fields (e.g., product, position-invariant features are effective in emotion analysis.
However, tuning hyperparameter settings to avoid overfitting
Manuscript received February 11, 2019; revised July 23, 2019 and October 9, problem is essential to CNN for optimal performance, which
2019; accepted December 1, 2019. The work was supported by the National Nat- requires a deep understanding of both CNN and the problem to
ural Science Foundation of China under Grant 61690205 and Grant 61772071. be resolved by CNN.
Associate Editor: B. Xu. (Corresponding author: Hui Liu.)
The authors are with the School of Computer Science and Technology, Bei- To this end, in this article, we propose a CNN-based auto-
jing Institute of Technology, Beijing 100081, China (e-mail: qasimumer667@ matic multiclass (p1–p5) prioritization for bug reports (cPur).
hotmail.com; [email protected]; [email protected]). Notably, we are the first to exploit CNN to bug report prioritiza-
Color versions of one or more of the figures in this article are available online
at http://ieeexplore.ieee.org. tion. We apply natural language processing (NLP) techniques
Digital Object Identifier 10.1109/TR.2019.2959624 to preprocess textual information of bug reports. From the
0018-9529 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
preprocessed bug reports, we perform emotion analysis because projects Mozilla, Eclipse, and GNOME. They prioritized the
hidden emotions may influence the priority of bug reports [9]. bug reports according to their mean time.
Users are rarely impassive when encountered with tiresome Tian et al. [18] proposed a novel approach leveraging infor-
bugs. Consequently, the bug reports specified by such users may mation retrieval in a particular BM25-based document similarity
contain evident emotion that may reflect how urgent users want function that automatically predicts the severity of bug reports.
the bugs to be fixed. To classify the emotion of bug reports, The proposed approach automatically analyzes bug reports and
we calculate the emotion of each bug report. Although emotion focuses on predicting fine-grained severity labels, namely the
analysis has been leveraged for the prediction of bug priority [9], different severity labels of Bugzilla including blocker, critical,
the proposed approach differs from existing work in that we major, minor, and trivial. Results suggest that fine-grained sever-
compute emotions using a distributional semantic model [15] ity prediction outperforms the state-of-the-art study and brings
and train it on software engineering dataset. In contrast, Umer significant improvement.
et al. [9] leverage a generic emotion repository SentiWordNet. Tian et al. [5] proposed an automated classification approach
A distributional semantic model [15] has been proved more (DRONE) for priority prediction of bug reports. They employed
effective than emotion repositories, and training it on soft- linear regression (LR) for priority classification and achieved
ware engineering specific dataset may significantly improve the the average F1-score up to 29%.
accuracy in emotion computation for bug reports. Based on Alenezi and Banitaan [8] adopted naive Bayes, decision tree,
preprocessed textual information of bug reports, we construct and random forest [8] to execute the priority prediction. They
a vector for each bug report with a word2vector model. We pass used two feature sets, i.e., 1) based on TF weighted words of
the constructed vector and the emotion of each bug report to a bug reports, and 2) based on the classification of bug reports
CNN-based classifier as input that predicts the priority. For the attributes. Evaluation results suggest that the usage of the second
multiclass priority prediction, we train a CNN-based classifier. feature set performed better than the first feature set, where
Finally, we evaluate the proposed approach on open-source random forests and decision trees outperform naive Bayes.
projects. The results of the cross-project evaluation suggest that Tian et al. [1] predicted the priority of bug reports using the
the proposed approach is accurate. On average, it improves the nearest neighbor approach to identify fine-grained bug report
average F1-score upon state-of-the-art approaches by more than labels. They applied the proposed approach to a larger collection
24% (detailed results are provided in Section IV-E). Note that of bug reports consisting of more than 65 000 Bugzilla reports.
the reason to choose state-of-the-art approaches [1], [5], [9] as Tian et al. [19] used three open-source software systems
baselines is provided in Section IV-A. (OpenOffice, Mozilla, and Eclipse) and found that around 51%
This article makes the following contributions. of the duplicate bug reports have inconsistent human-assigned
1) An automated CNN-based approach to suggest the priority severity labels even though they refer to the same software
of bug reports. To the best of our knowledge, it is the first problem. Results suggest that current automated approaches
CNN-based prioritization approach for bug reports. perform well and their agreement varies from 77% to 86% with
2) Evaluation results of the proposed approach on the history human-assigned severity labels.
data suggest that the proposed CNN-based approach is Choudhary [20] recently developed a model for priority pre-
accurate in priority suggestion of bug reports and outper- diction using a SVM that assigns priorities to Firefox crash
forms the state-of-the-art approaches. reports in the Mozilla Socorro server based on the frequency
The rest of this article is organized as follows. Section II and entropy of the crashes.
discusses the related work. Section III defines the proposed As a conclusion, researchers have proposed a number of
approach details. Section IV describes the evaluation process machine learning approaches to predict the priority of bug
of the proposed approach and its results. Section V explains the reports. Our proposed approach differs from the existing ap-
threats. Section VI concludes this article. proaches in that we are first to apply CNN-based prioritization of
bug reports.
which some of them are related to bug handling, e.g., bug 2) Second, we apply NLP techniques to the bug reports for
localization, bug report summarization, bug triager, and dupli- preprocessing.
cate bug detection [24]. The successful applications of deep 3) Third, we perform emotion analysis on bug reports and
learning techniques to such software engineering tasks inspire compute the emotion of each bug report.
us to apply deep learning techniques, e.g., CNN, to prioritization 4) Fourth, we create a vector for each bug report by using its
of bug reports. preprocessed words.
5) Finally, we train a CNN-based classifier for the priority
C. Emotion Analysis prediction. We pass the generated vector and the emotion
of each bug report to the classifier as input that predicts
Emotion analysis have achieved excellent results in semantic the priority of bug reports.
parsing, search query retrieval [25], sentence modeling [26], We introduce each of the key steps of the proposed approach
traditional NLP tasks [27], and sentiment analysis [23]. Some in the following sections.
state-of-the-art classification approaches based on emotion anal-
ysis are discussed in the following. B. Illustrating Example
Ouyang et al. [23] used a recurrent neural network to classify
Italian Twitter messages for predicting emotions. The work We use the following example to illustrate how the proposed
is built upon a deep learning approach. They leveraged large approach prioritizes bug reports. It is an Android bug report
amounts of weakly labeled data to train a two-layer CNN. To (81613) collected from Google Issue Tracker [37]. It was created
train their network, they applied a form of multitask train- on December 3, 2014, and closed on December 3, 2014.
ing. Their work participated in the EvalItalia-2016 competi- 1) Product = “Android Studio” is the name of the product
tion and outperformed all other approaches to the sentiment that is affected by the bug.
analysis task. 2) Textual Information = “First run wizard delete SDK if
Umer et al. [9] recently proposed an emotion-based automatic androidsdk.repo and androidsdk.dir point to the same
approach (eApp) for the priority classification. They employed dir” explains the bug. It may contain information on bug
emotion analysis for priority classification and used the SVM (a regeneration.
machine learning algorithm) for the prioritization of bug reports. 3) Priority = “p1 ” is a priority of the example bug report that
Results suggest that the proposed approach outperforms the could be left blank during reporting bugs. A triager may
state-of-the-art approach and improves F1-score by more than then assign the priority to the bug report.
24%. Our proposed approach also employs the emotion analysis We present the details on how the proposed approach works
for priority classification but differs from their approach in that for the illustrating example in the following section.
we exploit a distributional semantic model to compute emotions
of bug reports as an essential step for priority classification. C. Problem Definition
Other different studies related to bug reports include detection A bug report r from a set of bug reports R can be formalized
of bug report duplication [28]–[30], a recommendation of an as
appropriate developer to a new bug report [31]–[33], and the
prediction of bug fixing time [34]–[36]. r = < t, p > (1)
where, t is the textual information of r and p is an as-
III. APPROACH signed priority to r. For the illustrating example presented in
Section III-B, we have
A. Overview
re = < te , pe > (2)
An overview of the CNN-based prioritization for bug reports
(cPur) is presented in Fig. 1. The proposed approach recom- where te = “First run wizard delete SDK if androidsdk.repo and
mends a priority level for each bug report as follows. androidsdk.dir point to the same dir,” and pe = p1
1) First, we collect the history data of bug reports of open- The proposed approach suggests the priority of the new bug
source projects as training data. report as either p1 , p2 , p3 , p4 , or p5 , where p1 is the highest
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
D. Preprocessing
Bug reports contain irrelevant and unwanted text, e.g., punctu- ws = w1 , w2 , . . ., wn (5)
ation. The input of irrelevant text to the classification algorithms where w1 , w2 , . . ., wn are the words (tokens) from the textual
in an overhead as it increases the processing time and utilizes description of r after preprocessing.
more memory for processing. Therefore, we perform prepro- For the illustrating example presented in Section III-B, the
cessing to increase the performance of the proposed approach second column of Table I presents the preprocessing results of
and to make it cost effective. NLP techniques are often used an example bug report re . After preprocessing, we have
for the preprocessing of bug reports that include tokenization,
stop-word removal, negation handling, spell correction, modi- re = wizard, delete, sdk, . . ., dir, p1 (6)
fier word recognition, word inflection, and lemmatization. We
employ the following preprocessing steps to clean the textual where wizard, delete, sdk, ..., dir are the preprocessed words
information of bug reports. from re .
1) Tokenization: The text of bug reports often contains words
and special characters, e.g., spaces and punctuation marks. E. Emotion Analysis
Tokenization removes the special characters and decom- Users are rarely impassive when encountered with tiresome
poses the text into words (tokens). bugs. As a result, the bug reports specified by such users may
2) Stop-word removal: Textual documents often contain contain evident emotion. For example, the bug report 5083:
words that are used to make sentences meaningful but Breakpoint not hit has negative emotion due to a word hit.
do not have meaning individually. Such words are known Whereas the bug report 8423: Thank you, that was really helpful.
as stop-words. We remove such words from the extracted “I want them to resize based on the length of the data they’re
words in tokenization. Note that bug reports contain some showing.” has positive emotion due to words Thank and helpful.
programming-related words; however, we only remove the To classify the emotion of bug reports whether the emotion
English language stop-words from the text. of the reporter in the bug reports is positive or negative, we
3) Spell correction: Users type the unstructured fields while calculate the emotion of each bug report. There are many
reporting bugs, e.g., textual information: summary and repositories for the emotion analysis of text documents, e.g.,
description that may have spelling mistakes. Therefore, SentiWordNet [40]. However, to the best of our knowledge,
we apply an automated way to correct spelling mistakes. SentiStrengthSE [41], SentiCR [42], Senti4SD [15], and EmoTxt
4) Negation and modifiers: The usage of negation or modifier [43] are the repositories used for emotion analysis of software
with an English word changes its impact in the sentence. engineering text. We choose Senti4SD for emotion analysis
For example; good and not good are reciprocal to each because it is commonly used repository and outperforms the
other. Similarly, good and very good have different inten- SentiStrength, SentiStrengthSE, and SentiCR for the text classi-
sities. We apply negation and word modifier recognition fication in the software engineering domain [15]. We input each
during the calculation of emotion of each bug report, as bug report to the distributional semantic model [15] to compute
mentioned in Section III-E. its emotion. The distributional semantic model creates a math-
5) Word inflection and lemmatization: Word inflection con- ematical point in high-dimensional vector space to represent
verts the words into their singular form. For exam- words. It depends on the distributional hypothesis believing that
ple; inflection converts the word errors into error. semantically similar words belong to the same context [15]. It
Whereas, lemmatization converts comparative and su- returns the emotion of a given bug report based on its emotion
perlative words into their base words. For example, words, negation, and modifiers. We store the computed emotions
lemmatization converts the word crashed into crash. We with the corresponding bug reports. After emotion analysis, a
apply both word inflection and lemmatization on the ex- bug report can be represented as
tracted words and finally fold them into lowercase.
To perform preprocessing, we utilize Python Natural Lan- r = e, w1 , w2 , . . ...wn , p (7)
guage Toolkit (NLTK) [38] and Python TextBlob Library [39].
Senti4SD and we have filters convolution with the long input vector enables the CNN
to encode short-term and long-term dependencies by small and
re = positive, wizard, delete, sdk, . . ., dir, p1 (8)
large filter sizes, respectively. Third, the CNN does not suffer
where positive is the calculated emotion of the example bug from the exploding gradient problem of a recurrent neural net-
report. work [45] by using different filter sizes.
1) Overview: The overview of the proposed model is shown
F. Word2Vector Modeling in Fig. 3. Given a bug report r with its emotion e (computed
in Section III-E) and k-dimensional vector x (constructed in
In this step of the automatic prioritization approach of bug
Section III-F), a bug report of maximal length of n (to compute
reports, we construct a vector for each bug report. We pass the
n, we first find the preprocessed bug reports with maximum
preprocessed words w1 , w2 , . . ., wn from (7) to a skip-gram-
length and apply padding on the remaining bug reports) can be
based word2vector model [44]. It is an efficient method for learn-
represented as x (i.e., the input of the CNN in Fig. 3), which is
ing continuous word representation (a high-quality distributed
calculated as
vector) based on a single hidden layer neural network. The model
captures a large number of precise syntactic and semantic word x = e, u1 , u2 , . . .., un (9)
relationships and returns a k-dimensional vector.
For the motivating example presented in Section III-B, the x = v1 , v2 , v3 , . . .., vn (10)
preprocessed words wizard, delete, sdk, ..., dir are passed to the
skip-gram model to convert them into a vector. For example, where vi is the vector representation of e and wi .
wizard is presented as [0.102, −4.31, −0.003, ...]. 2) Filter Operation: The CNN applies a filter w Rdk to a
window of d words to generate a new feature. For instance, a
G. Priority Prediction Model new feature ci is generated from a window of words vi,i+d−1
that can be formalized as
The proposed approach exploits the CNN to relate a vector
and an emotion with the priority of a bug report. We use a CNN to ci = f (w.vi,i+d−1 + b) (11)
predict the priority of bug reports for the following reasons. First,
the CNN uses the vector concatenation method to concatenate where b represents a bias term that belongs to R, and f
incoming inputs into one long input vector. Consequently, the is a hyperbolic tangent nonlinear function. This filter gen-
CNN can handle the long-term dependencies better than the erates a feature map using each window of the features
recurrent neural network. Second, the usage of different sized < v1:d , v2:d+1 , . . ., vn−d+1:n >. A generated feature map c that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
belongs to Rn−d+1 can be formalized as et al. [9]), DRONE (proposed by Tian et al. [5]), and DRONE∗
(proposed by Tian et al. [1]) for the comparison because of the
c = c1 , c2 , . . ., cn−d+1 . (12) following reasons. First, eApp, DRONE, and DRONE∗ were de-
For the motivating example presented in Section III-B, we first signed for automatic prioritization of bug reports as our approach
feed inputs (emotion and preprocessed text) into an embedding is. Second, they are recently proposed and represent the state of
layer that converts them into numerical vectors. Second, we pass the art.
the numerical vectors into a CNN with dropout = 0.2. Because The second research question (RQ2) investigates the influence
we observe that this setting results in the minimum loss in the of the given inputs. We provide two inputs (preprocessed text
training phase when the dropout varies from 0.0 to 0.5 (where of bug reports and their emotion analysis results) to the CNN-
the step size is 0.1). Notably, we include dropout to prevent based approach for the priority prediction of bug reports. We
overfitting. A fully connected softmax layer involves most of want to know to what extent does this affect the performance,
the parameters. Thus, neurons create dependencies between each respectively.
other that restrict the individual power of each neuron leading to The third research question (RQ3) investigates the impact
overfitting of training data. Consequently, overfitting decreases of the preprocessing on the performance of cPur. Most of the
the performance of the model. Finally, we use three layers of textual datasets are not clean, i.e., they may contain punc-
the CNN. Notably, we use three convolutional layers because tuation. Therefore, we perform preprocessing (mentioned in
each convolution generates tensors of different shapes due to Section III-D) to clean the given dataset.
multiple filters. We create a layer for each of tensors to iterate The fourth research question (RQ4) investigates the impact
through them for merging the results into one big feature vector. of the length of filters on the effectiveness of the proposed
We forward the output of the CNN to a flatten layer that converts approach.
the numerical vectors into a one-dimensional vector. We apply The fifth research question (RQ5) investigates the relationship
the different length of filters (3, 4, and 5) on each n x k vector between the training size and the effect of the proposed approach.
and generate new features. The sixth research question (RQ6) compares the selected
3) Pooling Operation: A max-over-time pooling operation is classification algorithm (CNN) with alternatives. We choose
applied on the feature map to get the maximum value ĉ from (12). SVM because Umer el al. [9] recently declared it as a best
The pooling operation helps to find the most important features machine learning algorithm for priority prediction. Whereas,
from its feature map, i.e., the features having the highest value. we select LSTM because it is approved effective in NLP [46].
The proposed model applies multiple different size filters
and extracts one feature from one filter. Such features construct
B. Dataset
the penultimate layer and are forwarded to a fully connected
softmax layer. The output of the softmax layer is the probability We exploit the dataset created by Tian et al. [5] and reused
distribution of the priority levels. by Tian et al. [1] and Umer et al. [9]. They investigated the
bug repository of Eclipse, which is an open-source integrated
IV. EVALUATION development platform. They collected the bug reports submitted
from October 2001 to December 2007 from Bugzilla [47].
In this section, the performance of the proposed approach
Notably, they only collected the defect reports and ignored the
is evaluated on the bug reports of four open-source Eclipse
enhancement reports. The resulting dataset includes the bug
projects.
reports of four open-source projects: Java development tools
(JDT), Eclipse’s C/C++ Development Tooling (CDT), Plug-in
A. Research Questions Development Environment (PDE), and Platform. Their sum-
The evaluation investigates the following research questions. mary attribute defines the reported bugs, whereas their priority
1) RQ1: Does cPur outperform the state-of-the-art ap- attribute indicates their importance and urgency. The total num-
proaches in prediction of bug reports? ber of bug reports in the dataset are 80 000 in which 25%, 28%,
2) RQ2: How does different input (text and emotion) influ- 16%, and 31% of bug reports belong to CDT, JDT, PDE, and
ence the performance of cPur? Platform, respectively. This dataset is also used by Umer et al.
3) RQ3: How does preprocessing influence the performance and Tian et al. to evaluate their approaches that are selected in
of cPur? this article as a baseline approaches.
4) RQ4: How does the length of filters influence the perfor-
mance of cPur?
C. Process
5) RQ5: How does the training size influence the perfor-
mance of cPur? We evaluate the proposed approach as follows. First, we reuse
6) RQ6: Does convolution neural network outperform other the bug reports R of four open-source projects from Bugzilla
classification algorithms (traditional machine learning al- and apply NLP techniques to preprocess them, as mentioned in
gorithm and LSTM) in predicting priority of bug reports? Section III-D. Second, we carry out a cross-project validation
The first research question (RQ1) examines the performance on R. We partition R dataset into four sets based on the project
improvement of cPur against the state-of-the-art approaches. notated as Si (i = 1. . .4). For the ith cross validation, we con-
To this end, we select three approaches eApp (proposed by Umer sider all bug reports except for those in Si as a training dataset
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
and treat the bug reports in Si as a testing dataset. For the ith |C|
1 2 ∗ Pmac ∗ Rmac
cross validation, the evaluation process as follows. F 1mac = (19)
1) First, we select all training reports (TR) from training |C| i=1 Pmac + Rmac
dataset that is a union of all sets but Si , calculated as |C|
TPi
TRi = Sj . (13) Pmic = (20)
i=1
TPi + FPi
i∈[1,4] ∧ j =i
|C|
2) Second, we train a LSTM with data from TR. TPi
Rmic = (21)
3) Third, we train a CNN with data from TR.
i=1
TPi + FNi
4) Fourth, we train the machine learning algorithms (SVM
|C|
[9] and LR [1], [5]) with data from TR. 2 ∗ Pmic ∗ Rmic
5) Fifth, for each report in Si , we predict the priority of each F 1mic = . (22)
Pmic + Rmic
bug report using the trained LSTM, CNN, LR, and SVM i=1
to compare their status with its real priority.
6) Finally, we compute the evaluation metrics for each algo- We also compute the Hamming-loss error. It calculates the
rithm to compare their performances. average number of the relevance of a bug report to a priority
level, which is falsely predicted [49]. It normalizes the loss over
D. Metrics a total number of priority levels and the total number of bug
reports using priority prediction error (an incorrect priority level
Given the bug reports R, the performance of the proposed ap- is predicted) and missing error (a relevant priority level is not
proach is evaluated by calculating the priority specific precision predicted). The Hamming-loss error HE can be formalized as
P , recall R, and F1-score F 1 as
TP |N | |P |
P = (14) 1
TP + FP HE = (yi,j , zi,j ) (23)
|N | . |P | i=1 j=1
TP
R= (15)
TP + FN
where N is a number of bug reports, P is a number of priority
2 ∗ P ∗ R
F1 = (16) levels, yi,j is the true priority levels, and zi,j is the predicted
P + R priority levels.
where P , R, and F 1 are, respectively, precision, recall, and F1-
score of the approaches for priority prediction of R whose actual
priority is Pi . TP is the number of R that are truly predicted as E. RQ1: Comparison Against the State-of-the-Art Approaches
Pi , FP is the number of R that are falsely predicted as Pi , and To answer the research question RQ1, we compare the cPur
FN is the number of R that are not predicted as Pi but they are against the state-of-the-art approaches (eApp, DRONE∗ , and
actually Pi . DRONE) in priority prediction of bug reports. To this end, we
Our multiclass classification problem has five priority classes perform microanalysis and macroanalysis to find out the perfor-
(levels) as labels. Therefore, we also perform macroanalysis mance improvement of cPur against each class. We also conduct
and microanalysis for all priority levels C, which are commonly the priority-level and project-level comparison to evaluate the
used to evaluate the performance of multiclass classification [9], performance improvement of cPur for each priority and each
[48]. Where macroanalysis combines precision and recall of project, respectively.
multiclass priority levels by averaging their values. It simply 1) Comparison on Microanalysis and Macroanalysis: Eval-
normalizes the sum of precision of each of the priority levels uation results of microanalysis and macroanalysis are presented
using the number of different values. Whereas, microanalysis in Table II. The first column presents the approaches. Columns
has a similar idea to macroanalysis but computes precision and 2–4 and 5–7 present the performance in microanalysis and
recall from the sum of true positive, true negative, false posi- macroanalysis, respectively. The last column presents the error.
tive, and false negative values of all priority levels. In contrast The rows of the table present the performance results of cPur,
to macroanalysis, microanalysis takes the frequency of each eApp, DRONE∗ , and DRONE, respectively.
priority level into consideration. We calculate macroprecision From Table II, we make the following observations.
Pmac , macrorecall Rmac , macro F1-score F 1mac , micropre- 1) cPur outperforms DRONE∗ and DRONE in both macro-
cision Pmic , microrecall Rmic , and micro F1-score F 1mic as analysis and microanalysis. It indicates that cPur improves
follows: F1-score not only for all priority levels (as whole system)
|C|
1
but also for each priority level (individual). However, we
TPi
Pmac = (17) also notice that cPur outperforms eApp in macroanalysis
|C| i=1 TPi + FPi
only, and its precision in microanalysis is slightly lower
|C| than that of eApp.
1 TPi 2) The performance improvement of cPur upon eApp in
Rmac = (18)
|C| i=1 TPi + FNi F1-score for macroanalysis and microanalysis are 22.94%
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE II
PERFORMANCE ON MICRO AND MACRO LEVELS
TABLE III
PERFORMANCE ON PRIORITY LEVEL
TABLE IV
PERFORMANCE ON PROJECT LEVEL
= (56.99%–46.36%)/46.36% and 5.77% = (48.98%– From Tables III and IV, we make the following observations.
46.31%)/46.31%, respectively. Moreover, the perfor- 1) On each priority level, cPur outperforms eApp, DRONE∗ ,
mance improvement of cPur upon DRONE∗ in F1- and DRONE. The improvement of cPur upon eApp in
score for macroanalysis and microanalysis are 25.45% F1-score varies from 4.30% = (73.81%–70.77%)/70.77%
= (56.99%–45.43%)/45.43% and 12.65% = (48.98%– to 24.34% = (46.18%–37.14%)/37.14%. Moreover, the
43.48%)/43.48%, respectively. Similarly, the perfor- improvement of cPur upon DRONE∗ in F1-score varies
mance improvement of cPur upon DRONE in F1- from 5.46% = (73.81%–69.99%)/69.99% to 45.13% =
score for macroanalysis and microanalysis are 41.55% (46.18%–31.82%)/31.82%. Similarly, the improvement
= (56.99%–40.26%)/40.26% and 22.07% = (48.98%– of cPur upon DRONE in F1-score varies from 7.25%
40.13%)/40.13%, respectively. = (73.81%–68.82%)/68.82% to 64.04% = (46.18%–
3) cPur reduces the error upon eApp, DRONE∗ , and DRONE 28.15%)/28.15%.
by 2.5% = (0.4397–0.4291)/0.4291, 7.81% = (0.4626– 2) On each project level, cPur outperforms eApp, DRONE∗ ,
0.4291)/0.4291, and 11.6% = (0.4790–0.4291)/0.4291, and DRONE. The improvement of cPur upon eApp in F1-
respectively. score varies from 27.10% = (62.07%–48.84%)/48.84%
2) Comparison on Priority Level and Project Level: Evalua- to 62.79% = (73.47%–45.13%)/45.13%. Moreover, the
tion results of each priority and each project for the involved improvement of cPur upon DRONE∗ in F1-score varies
approaches are presented in Tables III and IV, respectively. from 31.24% = (62.07%–45.32%)/45.32% to 78.80% =
Notably, we apply tenfold cross validation and cross-project (73.47%–41.09%)/41.09%. Similarly, the improvement
evaluation techniques to produce the priority level and project of cPur upon DRONE in F1-score varies from 44.79%
level results, respectively. Therefore, the reproduced results of = (62.07%–42.87%)/42.87% to 88.75% = (73.47%–
the state-of-the-art approaches may differ from the originally 38.92%)/38.92%.
reported results (using n-cross validation). To validate the significant difference among cPur, eApp,
In Table III, the first column presents the approaches. Columns DRONE∗ , and DRONE, we employ one-way analysis of variance
2–4, 5–7, 8–10, 11–13, and 14–16 present the performance of (ANOVA). ANOVA determines whether there are any statisti-
priority p1 , p2 , p3 , p4 , and p5 , respectively. The rows of the cally significant difference between the means of independent
table present the performance results of cPur, eApp, DRONE∗ , (unrelated) groups [50], where the unit of analysis in ANOVA
and DRONE, respectively. is a project. ANOVA is employed because all approaches are
In Table IV, the first column presents the approaches. applied to the same projects. It may validate whether the only
Columns 2–4, 5–7, 8–10, and 11–13 present the performance difference (single factor, i.e., different approaches) leads to the
of projects CDT, JDT, PDE, and Platform, respectively. The difference in performance. We compute ANOVA on Excel with
rows of the table present the performance results of cPur, eApp, its default settings and do not involve any adjustment. Notably,
DRONE∗ , and DRONE, respectively. ANOVA on F1-score is conducted independently, where the unit
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE VI
INFLUENCE OF DIFFERENT INPUT
TABLE VII
INFLUENCE OF PREPROCESSING
TABLE VIII
INFLUENCE OF MACHINE LEARNING TECHNIQUES
through such bug reports and manually correct or assign the [8] M. Alenezi and S. Banitaan, “Bug reports prioritization: Which features
priority of each bug report. Manual prioritization of bug reports and classifier to use?” in Proc. 12th Int. Conf. Mach. Learn. Appl., vol. 2,
Washington, DC, USA, 2013, pp. 112–116.
requires expertise and resources (e.g., time and professionals). [9] Q. Umer, H. Liu, and Y. Sultan, “Emotion based automated priority
To this end, in this article, we proposed a CNN-based automatic prediction for bug reports,” IEEE Access, vol. 6, pp. 35743–35752, 2018.
approach for multiclass priority prediction of bug reports. The [10] J. Kanwal and O. Maqbool, “Bug prioritization to facilitate bug report
triage,” J. Comput. Sci. Technol., vol. 27, pp. 397–412, Mar. 2012.
proposed approach applied not only a deep learning model but [11] L. Yu, W.-T. Tsai, W. Zhao, and F. Wu, “Predicting defect priority based on
also employed natural language techniques and emotion analysis neural networks,” in Advanced Data Mining and Applications, L. Cao, J.
on the given dataset for the priority prediction of bug reports. Zhong, and Y. Feng, Eds., Berlin, Germany: Springer, 2010, pp. 356–367.
[12] B. Wang, “Disconnected recurrent neural networks for text categorization,”
The proposed approach automated the priority assignment pro- in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long
cess and saved the required time and efforts of developers. Papers), Melbourne, VIC, Australia, Jul. 2018, pp. 2311–2320.
We performed the cross-project evaluation on the history data [13] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmidhu-
ber, “LSTM: A search space odyssey,” IEEE Trans. Neural Netw. Learn.
of the four open-source projects of Eclipse. The evaluation Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017.
results suggested that the proposed approach outperformed the [14] W. Y. Ramay, Q. Umer, X. C. Yin, C. Zhu, and I. Illahi, “Deep neural
state-of-the-art approaches. network-based severity prediction of bug reports,” IEEE Access, vol. 7,
pp. 46846–46857, 2019.
The broader impact of our article is to show that the textual [15] F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity
information of the bug reports could be a rich source of informa- detection for software development,” Empirical Softw. Eng., vol. 23,
tion to prioritize them for their resolution on time. We expect our pp. 1352–1382, Jun. 2018.
[16] A. Lamkanfi, S. Demeyer, E. Giger, and B. Goethals, “Predicting the
results to encourage future research on the prioritization of bug. severity of a reported bug,” in Proc. 7th IEEE Work. Conf. Mining Softw.
We would like to investigate the rationale behind the proposed Repositories, May 2010, pp. 1–10.
approach in future. One of the drawbacks of deep learning neural [17] W. Abdelmoez, M. Kholief, and F. M. Elsalmy, “Bug fix-time prediction
model using naïve Bayes classifier,” in Proc. 22nd Int. Conf. Comput.
networks is that it is challenging, if not impossible, to explain Theory Appl., Oct. 2012, pp. 167–172.
why deep learning based approaches, e.g., the one proposed in [18] Y. Tian, D. Lo, and C. Sun, “Information retrieval based nearest neighbor
this article, work or not work. Opening the “black box” of deep classification for fine-grained bug severity prediction,” in Proc. 19th Work.
Conf. Reverse Eng., Oct. 2012, pp. 215–224.
neural networks to understand better how a deep learning model [19] Y. Tian, N. Ali, D. Lo, and A. E. Hassan, “On the unreliability of bug
learns. Deep learning is called a black box as it is nonparameter- severity data,” Empirical Softw. Eng., vol. 21, pp. 2298–2323, Dec. 2016.
ized. Although the choice of hyperparameters of deep learning [20] P. Choudhary, “Neural network based bug priority prediction model using
text classification techniques,” Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5,
models, such as the number of layers, the activation function, and pp. 1315–1319, 2017.
the learning rate, as well as the predictor importance is known. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
It is still unclear how machines learn and deduce conclusions. In with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural
Inf. Process. Syst., 2012, vol. 1, pp. 1097–1105.
future, we would like to exploit advanced techniques in neural [22] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep
networks to uncover the rationale behind the phenomenon. recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
We would also like to investigate a domain-specific prior- Process., 2013, pp. 6645–6649.
[23] X. Ouyang, P. Zhou, C. H. Li, and L. Liu, “Sentiment analysis using
itization of bug reports by including more bug reports from convolutional neural network,” in Proc. IEEE Int. Conf. Comput. Inf.
different domains, e.g., information systems. A domain-specific Technol.; Ubiquitous Comput. Commun.; Dependable, Autonomic Secure
prioritization of bug reports will affirm the generalizability of Comput.; Pervasive Intell. Comput., Oct. 2015, pp. 2359–2364.
[24] X. Li, H. Jiang, Z. Ren, G. Li, and J. Zhang, “Deep learning in software
the proposed approach. engineering,” 2018.
[25] W.-T. Yih, K. Toutanova, J. C. Platt, and C. Meek, “Learning discriminative
projections for text similarity measures,” in Proc. 15th Conf. Comput.
ACKNOWLEDGMENT Natural Lang. Learn., Stroudsburg, PA, USA, Jul. 2011, pp. 247–256.
[26] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural
The authors would like to thank the Associate Editor and network for modelling sentences,” in Proc. 52nd Annu. Meeting Assoc.
the anonymous reviewers for their insightful comments and Computational Linguistics, 2014, pp. 655–665.
[27] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
constructive suggestions. P. Kuksa, “Natural language processing (almost) from scratch,” J. Mach.
Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011.
[28] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun, “An approach to detecting
REFERENCES duplicate bug reports using natural language and execution information,”
in Proc. 30th Int. Conf. Softw. Eng., 2008, pp. 461–470.
[1] Y. Tian, D. Lo, X. Xia, and C. Sun, “Automated prediction of bug re- [29] C. Sun, D. Lo, X. Wang, J. Jiang, and S. Khoo, “A discriminative model
port priority using multi-factor analysis,” Empirical Softw. Eng., vol. 20, approach for accurate duplicate bug report retrieval,” in Proc. ACM/IEEE
pp. 1354–1383, Oct. 2015. 32nd Int. Conf. Softw. Eng., vol. 1, May 2010, pp. 45–54.
[2] Bugzilla, 2018. [Online]. Available: https://www.bugzilla.org/ [30] N. Jalbert and W. Weimer, “Automated duplicate detection for bug tracking
[3] Jira, 2002. [Online]. Available: https://www.atlassian.com/software/jira/ systems,” in Proc. IEEE Int. Conf. Dependable Syst. Netw. FTCS DCC,
[4] Github, 2008. [Online]. Available: https://github.com/features/ Jun. 2008, pp. 52–61.
[5] Y. Tian, D. Lo, and C. Sun, “Drone: Predicting priority of reported bugs [31] G. Canfora and L. Cerulo, “Supporting change request assignment in open
by multi-factor analysis,” in Proc. IEEE Int. Conf. Softw. Maintenance, source development,” in Proc. ACM Symp. Appl. Comput., 2006, pp. 1767–
Washington, DC, USA, 2013, pp. 200–209. 1772.
[6] J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug reposi- [32] G. Jeong, S. Kim, and T. Zimmermann, “Improving bug triage with bug
tory,” in Proc. OOPSLA Workshop Eclipse Technol. eXchange, New York, tossing graphs,” in Proc. 7th Joint Meeting Eur. Softw. Eng. Conf. ACM
NY, USA, 2005, pp. 35–39. SIGSOFT Symp. Found. Softw. Eng., Jan. 2009, pp. 111–120.
[7] X. Xia, D. Lo, M. Wen, E. Shihab, and B. Zhou, “An empirical study of [33] J. Xuan, H. Jiang, H. Zhang, and Z. Ren, “Developer recommendation
bug report field reassignment,” in Proc. Softw. Evol. Week—IEEE Conf. on bug commenting: A ranking approach for the developer crowd,” Sci.
Softw. Maintenance, Reeng., Reverse Eng. , Feb. 2014, pp. 174–183. China Inf. Sci., vol. 60, Apr. 2017, Art. no. 072105.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[34] C. Weiss, R. Premraj, T. Zimmermann, and A. Zeller, “How long will it take Qasim Umer received the B.S. degree in computer
to fix this bug?,” in Proc. 4th Int. Workshop Mining Softw. Repositories, science from Punjab University, Lahore, Pakistan,
2007, pp. 1–8. in 2006, the M.S. degree in net distributed system
[35] S. Akbarinasaji, B. Caglayan, and A. Bener, “Predicting bug-fixing time: development from the University of Hull, Hull, U.K.,
A replication study using an open source software project,” J. Syst. Softw., in 2009, and the second M.S. degree in computer
vol. 136, pp. 173–186, 2018. science from the University of Hull, in 2012. He is
[36] P. Bhattacharya and I. Neamtiu, “Bug-fix time prediction models: Can we currently working toward the Ph.D. degree in com-
do better?,” in Proc. 8th Work. Conf. Mining Softw. Repositories, 2011, puter science with the Beijing Institute of Technology,
pp. 207–210. Beijing, China.
[37] Google-Issue-Tracker, [Online]. Available: https://issuetracker.google. He is particularly interested in machine learning,
com/ data mining, and software maintenance.
[38] E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc.
ACL-02 Workshop Effective Tools Methodologies Teaching Natural Lang.
Process. Comput. Linguistics, 2002, vol. 1, 2006, pp. 63–70.
[39] TextBlob, 2013. [Online]. Available: https://textblob.readthedocs.io/en/
dev/ Hui Liu received the B.S. degree in control science
[40] J. Uddin, R. Ghazali, M. Mat Deris, R. Naseem, and H. Shah, “A survey from Shandong University, Jinan, China, in 2001,
on bug prioritization,” Artif. Intell. Rev., vol. 47, pp. 145–180, Apr. 2016. the M.S. degree in computer science from Shanghai
[41] M. R. Islam and M. F. Zibran, “Sentistrength-SE: Exploiting domain University, Shanghai, China, in 2004, and the Ph.D.
specificity for improved sentiment analysis in software engineering text,” degree in computer science from Peking University,
J. Syst. Softw., vol. 145, pp. 125–146, 2018. Bejing, China, in 2008.
[42] T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi, “Senticr: A customized senti- He is currently a Professor with the School of
ment analysis tool for code review interactions,” in Proc. 32nd IEEE/ACM Computer Science and Technology, Beijing Institute
Int. Conf. Automated Softw. Eng., 2017, pp. 106–111. of Technology, Beijing. He was a Visiting Research
[43] F. Calefato, F. Lanubile, and N. Novielli, “Emotxt: A toolkit for emotion Fellow with the Centre for Research on Evolution,
recognition from text,” in Proc. 7th Int. Conf. Affect. Comput. Intell. Search and Testing (CREST), University College
Interact. Workshops Demos, 2017, pp. 79–80. London, London, U.K. He served on the program committees and organizing
[44] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed committees of prestigious conferences, such as International Conference on
representations of words and phrases and their compositionality,” in Proc. Software Maintenance and Evolution, RE, International Centre for the Study
26th Int. Conf. Neural Inf. Process. Syst., 2013, pp. 3111–3119. of Radicalisation, and COMPSAC. He is particularly interested in software
[45] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhut- refactoring, AI-based software engineering, and software quality. He is also
dinov, “Improving neural networks by preventing co-adaptation of feature interested in developing practical tools to assist software engineers.
detectors,” 2012.
[46] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in
deep learning based natural language processing [review article],” IEEE
Comput. Intell. Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018.
[47] Bugzilla, 2018. [Online]. Available: https://bugs.eclipse.org/bugs/
[48] I. Safonov, I. Gartseev, M. Pikhletsky, O. Tishutin, and M. Bailey, “An Inam Illahi graduated degree in arts from the Uni-
approach for model assessment for activity recognition,” Pattern Recognit. versity of Sargodha, Sargodha, Pakistan, 2007. He re-
Image Anal., vol. 25, pp. 263–269, Apr. 2015. ceived the M.S. degree in software engineering from
[49] R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for the Chalmers University of Technology, Gothenburg,
text categorization,” Mach. Learn., vol. 39, pp. 135–168, May 2000. Sweden, in 2010. He is currently working toward
[50] E. T. Berkman and S. P. Reise, A Conceptual Guide to Statistics Using the Ph.D. degree in software engineering with the
SPSS. Thousand Oaks, CA, USA: Sage, 2011. School of Computer Science and Technology, Beijing
[51] T. Menzies and A. Marcus, “Automated severity assessment of software Institute of Technology, Beijing, China.
defect reports,” in Proc. IEEE Int. Conf. Softw. Maintenance, Sep. 2008, He is particularly interested in software mainte-
pp. 346–355. nance, crowdsourcing, and machine learning.