0% found this document useful (0 votes)

7 views14 pages

CNN-Based Automatic Prioritization of Bug Reports Transaction Paper

Uploaded by

sinhaaman777777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views14 pages

CNN-Based Automatic Prioritization of Bug Reports Transaction Paper

Uploaded by

sinhaaman777777

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON RELIABILITY 1

CNN-Based Automatic Prioritization of Bug Reports

Qasim Umer , Hui Liu , and Inam Illahi

Abstract—Software systems often receive a large number of bug priority, and severity), attachments (e.g., screenshots of the bug
reports. Triagers read through such reports and assign different for developers), textual information (e.g., summary and descrip-
priorities to different reports so that important and urgent bugs
tion), and comments from the users and developers. Reporters
could be fixed on time. However, manual prioritization is tedious
and time-consuming. To this end, in this article, we propose a provide all the required information while reporting bugs.
convolutional neural network (CNN) based automatic approach Software systems often receive a large number of bug re-
to predict the multiclass priority for bug reports. First, we apply ports [6]. Triagers read through such reports and assign different
natural language processing (NLP) techniques to preprocess tex- priorities to different reports so that important and urgent bugs
tual information of bug reports and covert the textual information could be fixed on time. Developers often do not fix bugs for
into vectors based on the syntactic and semantic relationship of
words within each bug report. Second, we perform the software years due to various constraints, e.g., time and availability of
engineering domain specific emotion analysis on bug reports and developers [7]. Therefore, triagers need to prioritize bug reports
compute the emotion value for each of them using a software adequately so that developers can fix the ranked bugs in se-
engineering domain repository. Finally, we train a CNN-based quence. Different bug tracking systems have different priority
classifier that generates a suggested priority based on its input, levels for a reported bug. Therefore, the priority of bug reports is
i.e., vectored textual information and emotion values. To the best
of our knowledge, it is the first CNN-based approach to bug report actually a multiobject classification. For example, in Bugzilla,
prioritization. We evaluate the proposed approach on open-source the priority of a bug report can be defined from p1 to p5 , where
projects. Results of our cross-project evaluation suggest that the p1 is the highest priority and p5 is the lowest priority. Prioritizing
proposed approach significantly outperforms the state-of-the-art bug reports is often a manual and time-consuming process. After
approaches and improves the average F1-score by more than 24%. a user reports a new bug through the bug tracking system, a
Index Terms—Bug reports, deep learning, prioritization, triager is responsible first to examine the reported bug. Based
reliability. on the examination, the triager decides its priority. The manual
process of assigning priority increases the resolution time of the
bug report [8]. To this end, some automated approaches have
I. INTRODUCTION
been proposed to suggest the priority of bug reports [5], [9]–[11].
OFTWARE systems are often released with defects because
S of inadequate testing and system complexity [1]. Develop-
ers want feedback from users to resolve the defects that users
However, the performance of such approaches deserves further
significant improvement.
The machine learning and deep learning classifiers used for
experienced while using released systems. They employ issue text classification have their own limitations for bug prioritiza-
reporting systems to collect feedback from users. Bugzilla [2], tion. Such as the best machine learning algorithm, support vector
JIRA [3], and GitHub [4] are the most popular issue reporting machine (SVM) reported by Umer et al. [9], requires feature
systems. Users utilize such systems to report defects and track modeling efforts. Long short-term memory (LSTM) does not
their progress. The utilization of issue tracking systems is stan- extract the position-invariant features (similar features from the
dard practice in software development and maintenance [5] that text that are similar in semantic but different in structure) as its
helps developers to resolve reported defects. The resolution of output is position-variant dependent [12]. Consequently, LSTM
reported defects has become an essential, expansive, and critical does not identify the patterns (features) like hate a lot from the
task in software maintenance due to the exponential growth of text [13]. In contrast to SVM and LSTM, convolutional neural
defects in complex software systems. network (CNN) not only extracts patterns and position-invariant
A bug report contains information that can be helpful in features independently but also eliminates the feature modeling
debugging and explains how exactly the product is crashed. efforts for bug prioritization [14]. Notably, such pattern and
A typical bug report includes predefined fields (e.g., product, position-invariant features are effective in emotion analysis.
However, tuning hyperparameter settings to avoid overfitting
Manuscript received February 11, 2019; revised July 23, 2019 and October 9, problem is essential to CNN for optimal performance, which
2019; accepted December 1, 2019. The work was supported by the National Nat- requires a deep understanding of both CNN and the problem to
ural Science Foundation of China under Grant 61690205 and Grant 61772071. be resolved by CNN.
Associate Editor: B. Xu. (Corresponding author: Hui Liu.)
The authors are with the School of Computer Science and Technology, Bei- To this end, in this article, we propose a CNN-based auto-
jing Institute of Technology, Beijing 100081, China (e-mail: qasimumer667@ matic multiclass (p1–p5) prioritization for bug reports (cPur).
hotmail.com; [email protected]; [email protected]). Notably, we are the first to exploit CNN to bug report prioritiza-
Color versions of one or more of the figures in this article are available online
at http://ieeexplore.ieee.org. tion. We apply natural language processing (NLP) techniques
Digital Object Identifier 10.1109/TR.2019.2959624 to preprocess textual information of bug reports. From the

0018-9529 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON RELIABILITY

preprocessed bug reports, we perform emotion analysis because projects Mozilla, Eclipse, and GNOME. They prioritized the
hidden emotions may influence the priority of bug reports [9]. bug reports according to their mean time.
Users are rarely impassive when encountered with tiresome Tian et al. [18] proposed a novel approach leveraging infor-
bugs. Consequently, the bug reports specified by such users may mation retrieval in a particular BM25-based document similarity
contain evident emotion that may reflect how urgent users want function that automatically predicts the severity of bug reports.
the bugs to be fixed. To classify the emotion of bug reports, The proposed approach automatically analyzes bug reports and
we calculate the emotion of each bug report. Although emotion focuses on predicting fine-grained severity labels, namely the
analysis has been leveraged for the prediction of bug priority [9], different severity labels of Bugzilla including blocker, critical,
the proposed approach differs from existing work in that we major, minor, and trivial. Results suggest that fine-grained sever-
compute emotions using a distributional semantic model [15] ity prediction outperforms the state-of-the-art study and brings
and train it on software engineering dataset. In contrast, Umer significant improvement.
et al. [9] leverage a generic emotion repository SentiWordNet. Tian et al. [5] proposed an automated classification approach
A distributional semantic model [15] has been proved more (DRONE) for priority prediction of bug reports. They employed
effective than emotion repositories, and training it on soft- linear regression (LR) for priority classification and achieved
ware engineering specific dataset may significantly improve the the average F1-score up to 29%.
accuracy in emotion computation for bug reports. Based on Alenezi and Banitaan [8] adopted naive Bayes, decision tree,
preprocessed textual information of bug reports, we construct and random forest [8] to execute the priority prediction. They
a vector for each bug report with a word2vector model. We pass used two feature sets, i.e., 1) based on TF weighted words of
the constructed vector and the emotion of each bug report to a bug reports, and 2) based on the classification of bug reports
CNN-based classifier as input that predicts the priority. For the attributes. Evaluation results suggest that the usage of the second
multiclass priority prediction, we train a CNN-based classifier. feature set performed better than the first feature set, where
Finally, we evaluate the proposed approach on open-source random forests and decision trees outperform naive Bayes.
projects. The results of the cross-project evaluation suggest that Tian et al. [1] predicted the priority of bug reports using the
the proposed approach is accurate. On average, it improves the nearest neighbor approach to identify fine-grained bug report
average F1-score upon state-of-the-art approaches by more than labels. They applied the proposed approach to a larger collection
24% (detailed results are provided in Section IV-E). Note that of bug reports consisting of more than 65 000 Bugzilla reports.
the reason to choose state-of-the-art approaches [1], [5], [9] as Tian et al. [19] used three open-source software systems
baselines is provided in Section IV-A. (OpenOffice, Mozilla, and Eclipse) and found that around 51%
This article makes the following contributions. of the duplicate bug reports have inconsistent human-assigned
1) An automated CNN-based approach to suggest the priority severity labels even though they refer to the same software
of bug reports. To the best of our knowledge, it is the first problem. Results suggest that current automated approaches
CNN-based prioritization approach for bug reports. perform well and their agreement varies from 77% to 86% with
2) Evaluation results of the proposed approach on the history human-assigned severity labels.
data suggest that the proposed CNN-based approach is Choudhary [20] recently developed a model for priority pre-
accurate in priority suggestion of bug reports and outper- diction using a SVM that assigns priorities to Firefox crash
forms the state-of-the-art approaches. reports in the Mozilla Socorro server based on the frequency
The rest of this article is organized as follows. Section II and entropy of the crashes.
discusses the related work. Section III defines the proposed As a conclusion, researchers have proposed a number of
approach details. Section IV describes the evaluation process machine learning approaches to predict the priority of bug
of the proposed approach and its results. Section V explains the reports. Our proposed approach differs from the existing ap-
threats. Section VI concludes this article. proaches in that we are first to apply CNN-based prioritization of
bug reports.

II. RELATED WORK

B. Deep Learning
A. Machine Learning Based Severity Identification/
Deep learning is an emerging machine learning technique that
Prioritization of Bug Reports
enables a machine to analyze complex and abstract data features
Lamkanfi et al. [16] investigated whether the severity of a through hierarchical neural networks. Lately, deep learning is
reported bug by analyzing its textual description using text getting a lot of attention and attaining results sometimes better
mining algorithms can be accurately predicted. They applied than humans. It has achieved significant results in the field of
a naive Bayes algorithm on the history data collected from the computer vision [21], speech recognition [22], and sentiment
open-source community (Mozilla, Eclipse, and GNOME). They analysis [23].
reported both precision and recall vary between 0.65–0.75 with Deep learning is also increasingly prevalent in the field of
Mozilla and Eclipse and 0.70–0.85 with GNOME. software engineering and playing an important role in soft-
Abdelmoez et al. [17] proposed an approach that uses naive ware engineering tasks, e.g., development, testing, and main-
Bayes classifier to predict the priority of bug reports. They used tenance [24]. According to Li et al., different studies employed
the data of four systems taken from three large open-source deep learning techniques for the maintenance of softwares in
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

UMER et al.: CNN-BASED AUTOMATIC PRIORITIZATION OF BUG REPORTS 3

Fig. 1. Overview of the proposed approach.

which some of them are related to bug handling, e.g., bug 2) Second, we apply NLP techniques to the bug reports for
localization, bug report summarization, bug triager, and dupli- preprocessing.
cate bug detection [24]. The successful applications of deep 3) Third, we perform emotion analysis on bug reports and
learning techniques to such software engineering tasks inspire compute the emotion of each bug report.
us to apply deep learning techniques, e.g., CNN, to prioritization 4) Fourth, we create a vector for each bug report by using its
of bug reports. preprocessed words.
5) Finally, we train a CNN-based classifier for the priority
C. Emotion Analysis prediction. We pass the generated vector and the emotion
of each bug report to the classifier as input that predicts
Emotion analysis have achieved excellent results in semantic the priority of bug reports.
parsing, search query retrieval [25], sentence modeling [26], We introduce each of the key steps of the proposed approach
traditional NLP tasks [27], and sentiment analysis [23]. Some in the following sections.
state-of-the-art classification approaches based on emotion anal-
ysis are discussed in the following. B. Illustrating Example
Ouyang et al. [23] used a recurrent neural network to classify
Italian Twitter messages for predicting emotions. The work We use the following example to illustrate how the proposed
is built upon a deep learning approach. They leveraged large approach prioritizes bug reports. It is an Android bug report
amounts of weakly labeled data to train a two-layer CNN. To (81613) collected from Google Issue Tracker [37]. It was created
train their network, they applied a form of multitask train- on December 3, 2014, and closed on December 3, 2014.
ing. Their work participated in the EvalItalia-2016 competi- 1) Product = “Android Studio” is the name of the product
tion and outperformed all other approaches to the sentiment that is affected by the bug.
analysis task. 2) Textual Information = “First run wizard delete SDK if
Umer et al. [9] recently proposed an emotion-based automatic androidsdk.repo and androidsdk.dir point to the same
approach (eApp) for the priority classification. They employed dir” explains the bug. It may contain information on bug
emotion analysis for priority classification and used the SVM (a regeneration.
machine learning algorithm) for the prioritization of bug reports. 3) Priority = “p1 ” is a priority of the example bug report that
Results suggest that the proposed approach outperforms the could be left blank during reporting bugs. A triager may
state-of-the-art approach and improves F1-score by more than then assign the priority to the bug report.
24%. Our proposed approach also employs the emotion analysis We present the details on how the proposed approach works
for priority classification but differs from their approach in that for the illustrating example in the following section.
we exploit a distributional semantic model to compute emotions
of bug reports as an essential step for priority classification. C. Problem Definition
Other different studies related to bug reports include detection A bug report r from a set of bug reports R can be formalized
of bug report duplication [28]–[30], a recommendation of an as
appropriate developer to a new bug report [31]–[33], and the
prediction of bug fixing time [34]–[36]. r = < t, p > (1)
where, t is the textual information of r and p is an as-
III. APPROACH signed priority to r. For the illustrating example presented in
Section III-B, we have
A. Overview
re = < te , pe > (2)
An overview of the CNN-based prioritization for bug reports
(cPur) is presented in Fig. 1. The proposed approach recom- where te = “First run wizard delete SDK if androidsdk.repo and
mends a priority level for each bug report as follows. androidsdk.dir point to the same dir,” and pe = p1
1) First, we collect the history data of bug reports of open- The proposed approach suggests the priority of the new bug
source projects as training data. report as either p1 , p2 , p3 , p4 , or p5 , where p1 is the highest
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON RELIABILITY

priority and p5 is the lowest priority. Consequently, the auto- TABLE I

EXAMPLE OF PREPROCESSING AND EMOTION ANALYSIS
matic prioritization of a new bug report r could be defined a
mapping f
f :r→c
c ∈ {p1 , p2 , p3 , p4 , p5 } , r ∈ R (3)
where c is a suggested priority from a priority set (p1 , p2 , p3 ,
p4 , p5 ).

D. Preprocessing
Bug reports contain irrelevant and unwanted text, e.g., punctu- ws = w1 , w2 , . . ., wn (5)
ation. The input of irrelevant text to the classification algorithms where w1 , w2 , . . ., wn are the words (tokens) from the textual
in an overhead as it increases the processing time and utilizes description of r after preprocessing.
more memory for processing. Therefore, we perform prepro- For the illustrating example presented in Section III-B, the
cessing to increase the performance of the proposed approach second column of Table I presents the preprocessing results of
and to make it cost effective. NLP techniques are often used an example bug report re . After preprocessing, we have
for the preprocessing of bug reports that include tokenization,

stop-word removal, negation handling, spell correction, modi- re = wizard, delete, sdk, . . ., dir, p1 (6)
fier word recognition, word inflection, and lemmatization. We
employ the following preprocessing steps to clean the textual where wizard, delete, sdk, ..., dir are the preprocessed words
information of bug reports. from re .
1) Tokenization: The text of bug reports often contains words
and special characters, e.g., spaces and punctuation marks. E. Emotion Analysis
Tokenization removes the special characters and decom- Users are rarely impassive when encountered with tiresome
poses the text into words (tokens). bugs. As a result, the bug reports specified by such users may
2) Stop-word removal: Textual documents often contain contain evident emotion. For example, the bug report 5083:
words that are used to make sentences meaningful but Breakpoint not hit has negative emotion due to a word hit.
do not have meaning individually. Such words are known Whereas the bug report 8423: Thank you, that was really helpful.
as stop-words. We remove such words from the extracted “I want them to resize based on the length of the data they’re
words in tokenization. Note that bug reports contain some showing.” has positive emotion due to words Thank and helpful.
programming-related words; however, we only remove the To classify the emotion of bug reports whether the emotion
English language stop-words from the text. of the reporter in the bug reports is positive or negative, we
3) Spell correction: Users type the unstructured fields while calculate the emotion of each bug report. There are many
reporting bugs, e.g., textual information: summary and repositories for the emotion analysis of text documents, e.g.,
description that may have spelling mistakes. Therefore, SentiWordNet [40]. However, to the best of our knowledge,
we apply an automated way to correct spelling mistakes. SentiStrengthSE [41], SentiCR [42], Senti4SD [15], and EmoTxt
4) Negation and modifiers: The usage of negation or modifier [43] are the repositories used for emotion analysis of software
with an English word changes its impact in the sentence. engineering text. We choose Senti4SD for emotion analysis
For example; good and not good are reciprocal to each because it is commonly used repository and outperforms the
other. Similarly, good and very good have different inten- SentiStrength, SentiStrengthSE, and SentiCR for the text classi-
sities. We apply negation and word modifier recognition fication in the software engineering domain [15]. We input each
during the calculation of emotion of each bug report, as bug report to the distributional semantic model [15] to compute
mentioned in Section III-E. its emotion. The distributional semantic model creates a math-
5) Word inflection and lemmatization: Word inflection con- ematical point in high-dimensional vector space to represent
verts the words into their singular form. For exam- words. It depends on the distributional hypothesis believing that
ple; inflection converts the word errors into error. semantically similar words belong to the same context [15]. It
Whereas, lemmatization converts comparative and su- returns the emotion of a given bug report based on its emotion
perlative words into their base words. For example, words, negation, and modifiers. We store the computed emotions
lemmatization converts the word crashed into crash. We with the corresponding bug reports. After emotion analysis, a
apply both word inflection and lemmatization on the ex- bug report can be represented as
tracted words and finally fold them into lowercase.
To perform preprocessing, we utilize Python Natural Lan- r = e, w1 , w2 , . . ...wn , p (7)
guage Toolkit (NLTK) [38] and Python TextBlob Library [39].

After preprocessing, a bug report r can be represented as where e is the emotion of r .

For the motivating example presented in Section III-B, we

r = ws, p (4) input the preprocessed text of the example bug report to the
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

UMER et al.: CNN-BASED AUTOMATIC PRIORITIZATION OF BUG REPORTS 5

Fig. 2. Overview of preprocessing.

Fig. 3. Overview of the priority prediction model.

Senti4SD and we have filters convolution with the long input vector enables the CNN
to encode short-term and long-term dependencies by small and
re = positive, wizard, delete, sdk, . . ., dir, p1 (8)
large filter sizes, respectively. Third, the CNN does not suffer
where positive is the calculated emotion of the example bug from the exploding gradient problem of a recurrent neural net-
report. work [45] by using different filter sizes.
1) Overview: The overview of the proposed model is shown
F. Word2Vector Modeling in Fig. 3. Given a bug report r with its emotion e (computed
in Section III-E) and k-dimensional vector x (constructed in
In this step of the automatic prioritization approach of bug
Section III-F), a bug report of maximal length of n (to compute
reports, we construct a vector for each bug report. We pass the
n, we first find the preprocessed bug reports with maximum
preprocessed words w1 , w2 , . . ., wn from (7) to a skip-gram-
length and apply padding on the remaining bug reports) can be
based word2vector model [44]. It is an efficient method for learn-
represented as x (i.e., the input of the CNN in Fig. 3), which is
ing continuous word representation (a high-quality distributed
calculated as
vector) based on a single hidden layer neural network. The model
captures a large number of precise syntactic and semantic word x = e, u1 , u2 , . . .., un (9)
relationships and returns a k-dimensional vector.
For the motivating example presented in Section III-B, the x = v1 , v2 , v3 , . . .., vn (10)
preprocessed words wizard, delete, sdk, ..., dir are passed to the
skip-gram model to convert them into a vector. For example, where vi is the vector representation of e and wi .
wizard is presented as [0.102, −4.31, −0.003, ...]. 2) Filter Operation: The CNN applies a filter w Rdk to a
window of d words to generate a new feature. For instance, a
G. Priority Prediction Model new feature ci is generated from a window of words vi,i+d−1
that can be formalized as
The proposed approach exploits the CNN to relate a vector
and an emotion with the priority of a bug report. We use a CNN to ci = f (w.vi,i+d−1 + b) (11)
predict the priority of bug reports for the following reasons. First,
the CNN uses the vector concatenation method to concatenate where b represents a bias term that belongs to R, and f
incoming inputs into one long input vector. Consequently, the is a hyperbolic tangent nonlinear function. This filter gen-
CNN can handle the long-term dependencies better than the erates a feature map using each window of the features
recurrent neural network. Second, the usage of different sized < v1:d , v2:d+1 , . . ., vn−d+1:n >. A generated feature map c that
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON RELIABILITY

belongs to Rn−d+1 can be formalized as et al. [9]), DRONE (proposed by Tian et al. [5]), and DRONE∗
(proposed by Tian et al. [1]) for the comparison because of the
c = c1 , c2 , . . ., cn−d+1 . (12) following reasons. First, eApp, DRONE, and DRONE∗ were de-
For the motivating example presented in Section III-B, we first signed for automatic prioritization of bug reports as our approach
feed inputs (emotion and preprocessed text) into an embedding is. Second, they are recently proposed and represent the state of
layer that converts them into numerical vectors. Second, we pass the art.
the numerical vectors into a CNN with dropout = 0.2. Because The second research question (RQ2) investigates the influence
we observe that this setting results in the minimum loss in the of the given inputs. We provide two inputs (preprocessed text
training phase when the dropout varies from 0.0 to 0.5 (where of bug reports and their emotion analysis results) to the CNN-
the step size is 0.1). Notably, we include dropout to prevent based approach for the priority prediction of bug reports. We
overfitting. A fully connected softmax layer involves most of want to know to what extent does this affect the performance,
the parameters. Thus, neurons create dependencies between each respectively.
other that restrict the individual power of each neuron leading to The third research question (RQ3) investigates the impact
overfitting of training data. Consequently, overfitting decreases of the preprocessing on the performance of cPur. Most of the
the performance of the model. Finally, we use three layers of textual datasets are not clean, i.e., they may contain punc-
the CNN. Notably, we use three convolutional layers because tuation. Therefore, we perform preprocessing (mentioned in
each convolution generates tensors of different shapes due to Section III-D) to clean the given dataset.
multiple filters. We create a layer for each of tensors to iterate The fourth research question (RQ4) investigates the impact
through them for merging the results into one big feature vector. of the length of filters on the effectiveness of the proposed
We forward the output of the CNN to a flatten layer that converts approach.
the numerical vectors into a one-dimensional vector. We apply The fifth research question (RQ5) investigates the relationship
the different length of filters (3, 4, and 5) on each n x k vector between the training size and the effect of the proposed approach.
and generate new features. The sixth research question (RQ6) compares the selected
3) Pooling Operation: A max-over-time pooling operation is classification algorithm (CNN) with alternatives. We choose
applied on the feature map to get the maximum value ĉ from (12). SVM because Umer el al. [9] recently declared it as a best
The pooling operation helps to find the most important features machine learning algorithm for priority prediction. Whereas,
from its feature map, i.e., the features having the highest value. we select LSTM because it is approved effective in NLP [46].
The proposed model applies multiple different size filters
and extracts one feature from one filter. Such features construct
B. Dataset
the penultimate layer and are forwarded to a fully connected
softmax layer. The output of the softmax layer is the probability We exploit the dataset created by Tian et al. [5] and reused
distribution of the priority levels. by Tian et al. [1] and Umer et al. [9]. They investigated the
bug repository of Eclipse, which is an open-source integrated
IV. EVALUATION development platform. They collected the bug reports submitted
from October 2001 to December 2007 from Bugzilla [47].
In this section, the performance of the proposed approach
Notably, they only collected the defect reports and ignored the
is evaluated on the bug reports of four open-source Eclipse
enhancement reports. The resulting dataset includes the bug
projects.
reports of four open-source projects: Java development tools
(JDT), Eclipse’s C/C++ Development Tooling (CDT), Plug-in
A. Research Questions Development Environment (PDE), and Platform. Their sum-
The evaluation investigates the following research questions. mary attribute defines the reported bugs, whereas their priority
1) RQ1: Does cPur outperform the state-of-the-art ap- attribute indicates their importance and urgency. The total num-
proaches in prediction of bug reports? ber of bug reports in the dataset are 80 000 in which 25%, 28%,
2) RQ2: How does different input (text and emotion) influ- 16%, and 31% of bug reports belong to CDT, JDT, PDE, and
ence the performance of cPur? Platform, respectively. This dataset is also used by Umer et al.
3) RQ3: How does preprocessing influence the performance and Tian et al. to evaluate their approaches that are selected in
of cPur? this article as a baseline approaches.
4) RQ4: How does the length of filters influence the perfor-
mance of cPur?
C. Process
5) RQ5: How does the training size influence the perfor-
mance of cPur? We evaluate the proposed approach as follows. First, we reuse
6) RQ6: Does convolution neural network outperform other the bug reports R of four open-source projects from Bugzilla
classification algorithms (traditional machine learning al- and apply NLP techniques to preprocess them, as mentioned in
gorithm and LSTM) in predicting priority of bug reports? Section III-D. Second, we carry out a cross-project validation
The first research question (RQ1) examines the performance on R. We partition R dataset into four sets based on the project
improvement of cPur against the state-of-the-art approaches. notated as Si (i = 1. . .4). For the ith cross validation, we con-
To this end, we select three approaches eApp (proposed by Umer sider all bug reports except for those in Si as a training dataset
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

UMER et al.: CNN-BASED AUTOMATIC PRIORITIZATION OF BUG REPORTS 7

and treat the bug reports in Si as a testing dataset. For the ith |C|
1 2 ∗ Pmac ∗ Rmac
cross validation, the evaluation process as follows. F 1mac = (19)
1) First, we select all training reports (TR) from training |C| i=1 Pmac + Rmac
dataset that is a union of all sets but Si , calculated as |C|
TPi
TRi = Sj . (13) Pmic = (20)
i=1
TPi + FPi
i∈[1,4] ∧ j =i
|C|
2) Second, we train a LSTM with data from TR. TPi
Rmic = (21)
3) Third, we train a CNN with data from TR.
i=1
TPi + FNi
4) Fourth, we train the machine learning algorithms (SVM
|C|
[9] and LR [1], [5]) with data from TR. 2 ∗ Pmic ∗ Rmic
5) Fifth, for each report in Si , we predict the priority of each F 1mic = . (22)
Pmic + Rmic
bug report using the trained LSTM, CNN, LR, and SVM i=1
to compare their status with its real priority.
6) Finally, we compute the evaluation metrics for each algo- We also compute the Hamming-loss error. It calculates the
rithm to compare their performances. average number of the relevance of a bug report to a priority
level, which is falsely predicted [49]. It normalizes the loss over
D. Metrics a total number of priority levels and the total number of bug
reports using priority prediction error (an incorrect priority level
Given the bug reports R, the performance of the proposed ap- is predicted) and missing error (a relevant priority level is not
proach is evaluated by calculating the priority specific precision predicted). The Hamming-loss error HE can be formalized as
P , recall R, and F1-score F 1 as
TP |N | |P |
P = (14) 1
TP + FP HE = (yi,j , zi,j ) (23)
|N | . |P | i=1 j=1
TP
R= (15)
TP + FN
where N is a number of bug reports, P is a number of priority
2 ∗ P ∗ R
F1 = (16) levels, yi,j is the true priority levels, and zi,j is the predicted
P + R priority levels.
where P , R, and F 1 are, respectively, precision, recall, and F1-
score of the approaches for priority prediction of R whose actual
priority is Pi . TP is the number of R that are truly predicted as E. RQ1: Comparison Against the State-of-the-Art Approaches
Pi , FP is the number of R that are falsely predicted as Pi , and To answer the research question RQ1, we compare the cPur
FN is the number of R that are not predicted as Pi but they are against the state-of-the-art approaches (eApp, DRONE∗ , and
actually Pi . DRONE) in priority prediction of bug reports. To this end, we
Our multiclass classification problem has five priority classes perform microanalysis and macroanalysis to find out the perfor-
(levels) as labels. Therefore, we also perform macroanalysis mance improvement of cPur against each class. We also conduct
and microanalysis for all priority levels C, which are commonly the priority-level and project-level comparison to evaluate the
used to evaluate the performance of multiclass classification [9], performance improvement of cPur for each priority and each
[48]. Where macroanalysis combines precision and recall of project, respectively.
multiclass priority levels by averaging their values. It simply 1) Comparison on Microanalysis and Macroanalysis: Eval-
normalizes the sum of precision of each of the priority levels uation results of microanalysis and macroanalysis are presented
using the number of different values. Whereas, microanalysis in Table II. The first column presents the approaches. Columns
has a similar idea to macroanalysis but computes precision and 2–4 and 5–7 present the performance in microanalysis and
recall from the sum of true positive, true negative, false posi- macroanalysis, respectively. The last column presents the error.
tive, and false negative values of all priority levels. In contrast The rows of the table present the performance results of cPur,
to macroanalysis, microanalysis takes the frequency of each eApp, DRONE∗ , and DRONE, respectively.
priority level into consideration. We calculate macroprecision From Table II, we make the following observations.
Pmac , macrorecall Rmac , macro F1-score F 1mac , micropre- 1) cPur outperforms DRONE∗ and DRONE in both macro-
cision Pmic , microrecall Rmic , and micro F1-score F 1mic as analysis and microanalysis. It indicates that cPur improves
follows: F1-score not only for all priority levels (as whole system)
|C|
1
but also for each priority level (individual). However, we
TPi
Pmac = (17) also notice that cPur outperforms eApp in macroanalysis
|C| i=1 TPi + FPi
only, and its precision in microanalysis is slightly lower
|C| than that of eApp.
1 TPi 2) The performance improvement of cPur upon eApp in
Rmac = (18)
|C| i=1 TPi + FNi F1-score for macroanalysis and microanalysis are 22.94%
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON RELIABILITY

TABLE II
PERFORMANCE ON MICRO AND MACRO LEVELS

TABLE III
PERFORMANCE ON PRIORITY LEVEL

TABLE IV
PERFORMANCE ON PROJECT LEVEL

= (56.99%–46.36%)/46.36% and 5.77% = (48.98%– From Tables III and IV, we make the following observations.
46.31%)/46.31%, respectively. Moreover, the perfor- 1) On each priority level, cPur outperforms eApp, DRONE∗ ,
mance improvement of cPur upon DRONE∗ in F1- and DRONE. The improvement of cPur upon eApp in
score for macroanalysis and microanalysis are 25.45% F1-score varies from 4.30% = (73.81%–70.77%)/70.77%
= (56.99%–45.43%)/45.43% and 12.65% = (48.98%– to 24.34% = (46.18%–37.14%)/37.14%. Moreover, the
43.48%)/43.48%, respectively. Similarly, the perfor- improvement of cPur upon DRONE∗ in F1-score varies
mance improvement of cPur upon DRONE in F1- from 5.46% = (73.81%–69.99%)/69.99% to 45.13% =
score for macroanalysis and microanalysis are 41.55% (46.18%–31.82%)/31.82%. Similarly, the improvement
= (56.99%–40.26%)/40.26% and 22.07% = (48.98%– of cPur upon DRONE in F1-score varies from 7.25%
40.13%)/40.13%, respectively. = (73.81%–68.82%)/68.82% to 64.04% = (46.18%–
3) cPur reduces the error upon eApp, DRONE∗ , and DRONE 28.15%)/28.15%.
by 2.5% = (0.4397–0.4291)/0.4291, 7.81% = (0.4626– 2) On each project level, cPur outperforms eApp, DRONE∗ ,
0.4291)/0.4291, and 11.6% = (0.4790–0.4291)/0.4291, and DRONE. The improvement of cPur upon eApp in F1-
respectively. score varies from 27.10% = (62.07%–48.84%)/48.84%
2) Comparison on Priority Level and Project Level: Evalua- to 62.79% = (73.47%–45.13%)/45.13%. Moreover, the
tion results of each priority and each project for the involved improvement of cPur upon DRONE∗ in F1-score varies
approaches are presented in Tables III and IV, respectively. from 31.24% = (62.07%–45.32%)/45.32% to 78.80% =
Notably, we apply tenfold cross validation and cross-project (73.47%–41.09%)/41.09%. Similarly, the improvement
evaluation techniques to produce the priority level and project of cPur upon DRONE in F1-score varies from 44.79%
level results, respectively. Therefore, the reproduced results of = (62.07%–42.87%)/42.87% to 88.75% = (73.47%–
the state-of-the-art approaches may differ from the originally 38.92%)/38.92%.
reported results (using n-cross validation). To validate the significant difference among cPur, eApp,
In Table III, the first column presents the approaches. Columns DRONE∗ , and DRONE, we employ one-way analysis of variance
2–4, 5–7, 8–10, 11–13, and 14–16 present the performance of (ANOVA). ANOVA determines whether there are any statisti-
priority p1 , p2 , p3 , p4 , and p5 , respectively. The rows of the cally significant difference between the means of independent
table present the performance results of cPur, eApp, DRONE∗ , (unrelated) groups [50], where the unit of analysis in ANOVA
and DRONE, respectively. is a project. ANOVA is employed because all approaches are
In Table IV, the first column presents the approaches. applied to the same projects. It may validate whether the only
Columns 2–4, 5–7, 8–10, and 11–13 present the performance difference (single factor, i.e., different approaches) leads to the
of projects CDT, JDT, PDE, and Platform, respectively. The difference in performance. We compute ANOVA on Excel with
rows of the table present the performance results of cPur, eApp, its default settings and do not involve any adjustment. Notably,
DRONE∗ , and DRONE, respectively. ANOVA on F1-score is conducted independently, where the unit
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

UMER et al.: CNN-BASED AUTOMATIC PRIORITIZATION OF BUG REPORTS 9

TABLE V 1) Disabling emotion value from the input significantly re-

ANOVA ANALYSIS ON F1-SCORE
duces the recall and F1-score of the proposed approach.
The decrease in recall varies from 3.45% = (60.52%–
58.43%)/60.52% to 7.66% = (72.34%–66.80%)/72.34%.
The decrease in F1-score varies from 1.52% = (57.67%–
56.80%)/56.80% to 6.42% = (73.47%–69.03%)/69.03%.
2) Disabling textual features from the input also signif-
icantly reduces recall and F1-score of the proposed
of analysis is a project. Table V describes the results of ANOVA approach. The decrease in recall varies from 7.63%
analysis, which presents F > Fcric and p-value < (alpha = = (60.52%–55.90%)/60.52% to 13.66% = (72.34%–
0.05) are true for F1-score, where F = 10.82, Fcric = 4.26, 62.46%)/72.34%. The decrease in F1-score varies from
and p-value = 0.004. It suggests that the factor (using different 6.32% = (62.07%–58.38%)/58.38% to 10.22% =
approaches) has a significant difference in F1-score. (73.47%–66.66%)/66.66%.
Moreover, we perform Wilcoxon test (using Stata software 3) Disabling either emotion value or textual features from
built-in settings) to calculate the difference between approaches the input would decrease the precision on most cases
and analyze these differences. The results present p-value < (on three out of four subject applications). However, it
(alpha = 0.05) is true for F1-score, where P -value = 0.02. results in slight increase in precision on Platform. The
Furthermore, we quantify the effect size to check the differ- increase is 0.53% = 60.45%–59.92% (disabling emotion)
ence between approaches by employing Cohen’s delta d, where and 1.16% = 61.08%–59.92% (disabling textual features),
d >= 0.2, d >= 0.5, and d >= 0.8 represents the difference respectively. However, due to the poor interpretability of
as small, medium, and large, respectively. Result (d = 0.75) deep neural networks, we have not yet fully understood the
suggests that the difference between approaches is medium. rationale for the increase of precision on Platform caused
Finally, we compute the time cost of preprocessing, emotion by disabling emotion value or textual features. Overall,
analysis, Word2Vector modeling, training, and testing processes disabling either the emotion value or textual features from
to investigate the efficiency of cPur. The results suggest that the input has little and inconsistent influence on the pre-
cPur is efficient. The average time cost of the preprocessing, cision of the proposed approach.
emotion analysis, and Word2Vector modeling is 2.01 min, 5.67 To investigate the existence of emotion and why emotion
min, and 2.98 min, respectively. Notably, parts-of-speech tag- features work in prioritization of bug reports, we randomly
ging significantly increases the time cost of emotion analysis. select 200 sample reports (50% positive and 50% negative) and
Moreover, the training time cost of cPur (4.23 min) is higher manually check the effect of emotion features to the proposed
than DRONE∗ (3.52 min), DRONE (3.52 min), and eApp (3.98 approach. The manual checking is accomplished by five soft-
min), respectively. However, using the trained models, cPur ware engineering professionals. Two of them are Ph.D. scholars
takes 1.06 min in priority prediction, which is equal to DRONE∗ (conducting research in software engineering and maintenance)
and DRONE, and faster than eApp by 0.03 min. On average, the and three of them are software developers, and have rich ex-
training process for CDT, JDT, PDE, and Platform requires 0.8 perience in handling bug reports. They classify the emotions
min, 1.2 min, 1.33 min, and 0.9 min, respectively. The testing of the reports independently and then discuss together to share
times of CDT, JDT, PDE, and platform are 0.22 min, 0.26 min, their experiences on why emotion features work in the prioriti-
0.38 min, and 0.21 min, respectively. zation of bug reports. Results suggest that 86% of the selected
Based on the preceding analysis, we conclude that cPur reports having low priority are negative. They are agreed on that
achieves a significant improvement upon the state-of-the-art being rude when writing a bug report can affect the cohesion
approaches. of the participants (users or developers). It also affects the
prioritization and resolution of bug reports. For example, we
observe that negative bug reports (e.g., What is the best way
F. RQ2: Influence of Different Inputs to kill a critical process or I am missing a parenthesis but I
To answer the research question RQ2, we perform a project- don’t know where) have lower priority not because of negative
level comparison with and without different inputs (textual words e.g., kill and missing, but because they are not con-
features of bug reports and their emotions). structive (i.e., posting questions but do not present constructive
Evaluation results of the proposed approach by enabling and suggestions). In contrast, positive bug reports (e.g., Styled Text
disabling different inputs are presented in Table VI. The first printing should implement “print to file”) have higher priority
column presents the input settings. Columns 2–4, 5–7, 8–10, because they often present constructive suggestion and, thus,
and 11–13 present the performance of project CDT, JDT, PDE, are likely to be resolved. Another observation is that adverbs
and Platform, respectively. The performance of cPur upon dif- e.g., very/too increase the intensity of the positive/negative
ferent settings is presented in the rows of the table. Fig. 4 also emotions and affect the priority for all priorities. Consequently,
visualizes the performance difference of the proposed approach a respectful environment is feasible and an incentive for new
upon different settings. participants.
From Table VI and Fig. 4, we make the following Moreover, we employ (one-way) ANOVA with the same
observations. settings (mentioned in RQ1) to validate the significant
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON RELIABILITY

TABLE VI
INFLUENCE OF DIFFERENT INPUT

difference cPur with and without emotion. Results of ANOVA

analysis, which presents F > Fcric and p-value < (alpha =
0.05), are not true for F1-score, where F = 0.39, Fcric = 5.98,
and Pvalue = 0.55. We note that in all cases disabling emotion
results in a significant reduction in the F1-score. However,
the ANOVA analysis suggests that disabling emotion does not
significantly influence the F1-score. One possible reason is that
the F1-score varies significantly from project to project. For
example, it is 57.67% on project Platform whereas it increases
dramatically to 73.47% on project PDE. As a result, the variation
within groups is even more significant than that between groups.
As a result, the ANOVA analysis suggests that there is no Fig. 4. Influence of different inputs.
significant difference between groups.
Furthermore, we quantify the effect size to check the differ-
ence of cPur with and without emotion by employing Cohen’s 1) The proposed approach with the preprocessing step
delta d. Result (d = 0.44) suggests that the difference of cPur achieves significant improvement in performance.
with and without emotion is small. The improvement in precision and recall varies
Finally, we perform the Pearson correlation coefficient (r) from 1.94% = (67.20%–65.92%)/65.92% to 6.17%
to compute the strength of the relationship between emo- = (74.63%–70.29%)/70.29%, and from 5.73% =
tion/textual features and priority. Notably, we use the prior- (60.52%–57.24%)/57.24% to 15.08% = (60.44%–
ity prediction of the proposed approach without emotion and 52.52%)/52.52%, respectively.
without textual features to compute r. Results (r = 0.405 and 2) Disabling preprocessing from the input significantly re-
r = 0.731, respectively) suggest a medium correlation between duces the performance in F1-score of the proposed ap-
emotion and priority. Whereas, the correlation between textual proach. It varies from 4.17% = (62.07%–59.48%)/62.07%
features and priority is large. to 9.03% = (73.47%–66.83%)/73.47%.
Based on the preceding analysis, we conclude that both textual Moreover, cPur and both state-of-the-art approaches (eApp
features and emotion are significantly important for the proposed and DRONE) adopt a preprocessing step to concise the raw
approach. text of bug reports. Notably, all approaches use different NLP
layers and packages. To compare the results of different layers
G. RQ3: Influence of Preprocessing and packages, we conduct an experiment on the textual data
The textual information of bug reports contains noisy data of randomly selected 200 bug reports. All approaches have
(e.g., stop-words and punctuation), which is irrelevant and mean- two common preprocessing steps (tokenization and stop-word
ingless (as mentioned in Section III-D). Therefore, passing such removal) that are in the same sequence. We observe that common
information to the machine learning algorithms is an overhead. preprocessing steps (tokenization and stop-word removal) pro-
To this end, applying preprocessing may help in performance duce similar results for cPur, eApp, and DRONE. Both baseline
improvement and computation cost reduction. approaches use Poter stemming algorithms in their third/final
To answer the research question RQ3, we compare the per- step (stemming and lemmatization), respectively. However,
formance results of the proposed approach with and without cPur uses Lancaster stemming algorithm for lemmatization and
preprocessing. Evaluation results by enabling and disabling the includes some additional steps as mentioned in Section III-D.
preprocessing are presented in Table VII. The first column We observe that the results of all approaches are different
presents the preprocessing input settings. Columns 2–4, 5–7, other than the first two steps of preprocessing. The reason for
8–10, and 11–13 present the performance of project CDT, JDT, this difference is the selection of the preprocessing tool (e.g.,
PDE, and Platform, respectively. The performance of cPur with Python NLTK or Stanford Parser) and the parameter settings
different settings is presented in the rows of the table. The last of the different preprocessing steps (e.g., the use to different
row of the table presents the improvement of cPur upon dif- stemming algorithms). For example, the output of a word crying
ferent input settings for preprocessing. Fig. 5 also visualizes the with Poter stemming algorithm is cri, which has no meaning in
performance difference of the proposed approach upon different any emotion analysis repository. Whereas, a word crying with
preprocessing input settings. Lancaster stemming algorithm is cry, which has negative
From Table VII and Fig. 5, we make the following emotion in emotion analysis. As a conclusion, the selection
observations. of preprocessing tools and parameter settings varies from
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

UMER et al.: CNN-BASED AUTOMATIC PRIORITIZATION OF BUG REPORTS 11

TABLE VII
INFLUENCE OF PREPROCESSING

Fig. 6. Overview of LSTM model.

last 20% reports. Evaluation results of the proposed approach

Fig. 5. Influence of preprocessing.
against each training/testing sample are presented in Table IX.
In Table IX, the first column presents the training sample
problem to problem as each tool has its own advantages and size. Columns 2–4 present the performance of the proposed
limitations [38], [51]. approach. The rows of the table present the performance results
Based on the preceding analysis, we conclude that the prepro- of the proposed approach against different training sample sizes,
cessing step is significantly important for the proposed approach. respectively.
Results suggest that the average F1-score of the proposed
approach with training sizes of first 80%, 70%, 60%, 50%, and
H. RQ4: Influence of Different Lengths of Filters
40% reports are 74.95%, 70.24%, 66.52%, 56.19%, and 53.40%,
To answer the research question RQ4, we conduct an addi- respectively. The CNN with training size 80% outperforms the
tional experiment to choose the filter size of the CNN model. It other training sizes. We observe that the increase in the train-
analyzes how big is the n on n-grams that applies a filter to a ing size improves the performance of the proposed approach.
window of d words to generate a new feature. To this end, we Although, the precision of the proposed approach significantly
first fix all other hyperparameters except filter window size to decreases when the training size is less than 60% (48 000 reports
check the effect of the filter length. Then, we train and test the out of 80 000 reports); however, the recall does not significantly
proposed model with default Adam optimizer for ten epochs on decrease. Consequently, the results of the proposed approach are
the given dataset. acceptable for both precision and recall when the training size
Results suggest that the average F1-score of the proposed ap- is greater than or equal to 60% of the given dataset.
proach with filter size 2, 3, 4, and 5 are 63.09%, 62.95%, 62.81%,
and 62.05%, respectively. The CNN with filter windows of size J. RQ6: Comparison Against Classification Algorithms
2 outperforms the other filter windows sizes. It suggests that
the meaning of words in a sentence is important and significant To answer the research question RQ6, we apply SVM and
in computing the emotion of the sentence. It also proves why LSTM (as mentioned in Section IV-A) for the comparison of
bag-of-words (emotion words) can have a strong performance their performances with the proposed approach. Note that we use
in bug prioritization. In addition, the performance of the CNN the same preprocessing and emotion features (mentioned in Sec-
model decreases when the filter size increases. However, there tions III-E and III-F) to compare these classification algorithms.
is a significant decrease in performance when n-gram reaches Moreover, we exploit SVM same as the existing approach (linear
to 5. Finally, we combine filters with different window sizes SVM with default settings). Whereas, the LSTM model (shown
for additional performance improvements and decide to choose in Fig. 6) contains embedding layer, LSTM layer (dropout = 0.2
the filter sizes of 2, 3, and 4 with 100 feature maps, where the and recurrent_dropout = 0.2), and dense layer (activation =
average F1-score is 64.21%. sigmoid). We use the binary_crossentropy as the loss function
for LSTM.
Evaluation results of classification algorithms are presented in
I. RQ5: Influence of Different Sizes of Training Dataset
Table VIII. The first column presents the approaches. Columns
To answer the research question RQ5, we conduct an experi- 2–4, 5–7, 8–10, and 11–13 present the performance of projects
ment to find the impact of training size on the proposed approach. CDT, JDT, PDE, and Platform, respectively. Rows 2–4 present
To this end, we train the proposed model with five different the performance of CNN, SVM, and LSTM, respectively. Fig. 7
training datasets. We use first 80%, 70%, 60%, 50%, and 40% also visualizes the performance difference in machine learning
reports for each training and test the trained models with the algorithms.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

12 IEEE TRANSACTIONS ON RELIABILITY

TABLE VIII
INFLUENCE OF MACHINE LEARNING TECHNIQUES

precise syntactic and semantic words relationships that

could be ngrams words (verbs having adverb e.g., very
good), and assign a value for each word based on their
semantic relations as mentioned in Section III-F.
4) The performance of SVM is significantly lower than the
proposed approach. The reason for the decrease in perfor-
mance of SVM upon cPur is that SVM works poor with
variable-high input dimensions. Whereas, CNN works
quite well with variable-high input dimensions. Another
reason is that CNN does not need the feature modeling
efforts (required for machine learning algorithms), which
is tedious and time-consuming.
Fig. 7. Influence of machine learning techniques. Based on the preceding analysis, we conclude that the pro-
posed CNN classifier outperforms other machine learning and
deep learning classifiers in predicting priority of bug reports.
TABLE IX
INFLUENCE OF TRAINING SIZE
V. THREATS
A. Threats to Validity
A threat to construct validity is the suitability of the selected
evaluation metrics. The precision, recall, and F1-score are the
standard and most adopted metrics [1], [5], [51]. Therefore,
we select these metrics for the evaluation of the classification
algorithms.
From Table VIII and Fig. 7, we make the following Another threat to construct validity is the usage of Senti4SD
observation. for emotion analysis. There are many other libraries for this
1) The proposed approach outperforms both machine learn- purpose; however, we select it due to its performance result
ing and deep learning algorithms. It achieves the best for the software engineering text. Other repositories for emo-
precision, recall, and F1-score on each of the projects. tion analysis may decrease the performance of the proposed
2) The proposed approach without preprocessing also out- approach.
performs both preprocessing disabled machine learning A threat to internal validity is the implementation of machine
and deep learning algorithms. The performance improve- learning and deep learning algorithms. To mitigate the threat, we
ment of the proposed approach without preprocessing double-check the implementation and results. However, there
upon SVM (preprocessing enabled) in F1-score varies could be some unseen errors.
from 19.77% = (58.46%–48.81%)/48.81% to 48.09% A threat to external validity is the generalizability of our
= (66.83%–45.13%)/45.13%. However, the performance results. We evaluate the proposed approach only on the four
improvement of the proposed approach without pre- open-source projects of Eclipse. The inclusion of the bug re-
processing upon LSTM (preprocessing enabled) in F1- ports from other projects may decrease the performance of the
score varies from 32.19% = (53.32%–40.34%)/40.34% proposed approach.
to 110.52% = (58.46%–27.77%)/27.77%. Another threat to external validity is the input of the hyperpa-
3) The performance of LSTM is significantly lower than rameters of the deep learning approaches. We trained the deep
the proposed approach. The reason for the decrease in learning algorithms on a small number of bug reports. They
performance of LSTM against CNN is that CNN is good usually require a large training set. They also have a number
at extracting local/position-invariant features, and works of hyperparameters to be adjusted. The adjustment of such
well with long input text [12]. However, LSTM performs parameters may influence performance.
well with the short and sequential input text. Notably,
our input text is long and does not require sequential
VI. CONCLUSION
processing. Thus, CNN works better than LSTM in our
case. Another reason for this improvement is the usage Bug reports are often submitted either with an incorrect pri-
of word2vector modeling. It captures a large number of ority level or without defining priority level. Developers read
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

UMER et al.: CNN-BASED AUTOMATIC PRIORITIZATION OF BUG REPORTS 13

through such bug reports and manually correct or assign the [8] M. Alenezi and S. Banitaan, “Bug reports prioritization: Which features
priority of each bug report. Manual prioritization of bug reports and classifier to use?” in Proc. 12th Int. Conf. Mach. Learn. Appl., vol. 2,
Washington, DC, USA, 2013, pp. 112–116.
requires expertise and resources (e.g., time and professionals). [9] Q. Umer, H. Liu, and Y. Sultan, “Emotion based automated priority
To this end, in this article, we proposed a CNN-based automatic prediction for bug reports,” IEEE Access, vol. 6, pp. 35743–35752, 2018.
approach for multiclass priority prediction of bug reports. The [10] J. Kanwal and O. Maqbool, “Bug prioritization to facilitate bug report
triage,” J. Comput. Sci. Technol., vol. 27, pp. 397–412, Mar. 2012.
proposed approach applied not only a deep learning model but [11] L. Yu, W.-T. Tsai, W. Zhao, and F. Wu, “Predicting defect priority based on
also employed natural language techniques and emotion analysis neural networks,” in Advanced Data Mining and Applications, L. Cao, J.
on the given dataset for the priority prediction of bug reports. Zhong, and Y. Feng, Eds., Berlin, Germany: Springer, 2010, pp. 356–367.
[12] B. Wang, “Disconnected recurrent neural networks for text categorization,”
The proposed approach automated the priority assignment pro- in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (Volume 1: Long
cess and saved the required time and efforts of developers. Papers), Melbourne, VIC, Australia, Jul. 2018, pp. 2311–2320.
We performed the cross-project evaluation on the history data [13] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmidhu-
ber, “LSTM: A search space odyssey,” IEEE Trans. Neural Netw. Learn.
of the four open-source projects of Eclipse. The evaluation Syst., vol. 28, no. 10, pp. 2222–2232, Oct. 2017.
results suggested that the proposed approach outperformed the [14] W. Y. Ramay, Q. Umer, X. C. Yin, C. Zhu, and I. Illahi, “Deep neural
state-of-the-art approaches. network-based severity prediction of bug reports,” IEEE Access, vol. 7,
pp. 46846–46857, 2019.
The broader impact of our article is to show that the textual [15] F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity
information of the bug reports could be a rich source of informa- detection for software development,” Empirical Softw. Eng., vol. 23,
tion to prioritize them for their resolution on time. We expect our pp. 1352–1382, Jun. 2018.
[16] A. Lamkanfi, S. Demeyer, E. Giger, and B. Goethals, “Predicting the
results to encourage future research on the prioritization of bug. severity of a reported bug,” in Proc. 7th IEEE Work. Conf. Mining Softw.
We would like to investigate the rationale behind the proposed Repositories, May 2010, pp. 1–10.
approach in future. One of the drawbacks of deep learning neural [17] W. Abdelmoez, M. Kholief, and F. M. Elsalmy, “Bug fix-time prediction
model using naïve Bayes classifier,” in Proc. 22nd Int. Conf. Comput.
networks is that it is challenging, if not impossible, to explain Theory Appl., Oct. 2012, pp. 167–172.
why deep learning based approaches, e.g., the one proposed in [18] Y. Tian, D. Lo, and C. Sun, “Information retrieval based nearest neighbor
this article, work or not work. Opening the “black box” of deep classification for fine-grained bug severity prediction,” in Proc. 19th Work.
Conf. Reverse Eng., Oct. 2012, pp. 215–224.
neural networks to understand better how a deep learning model [19] Y. Tian, N. Ali, D. Lo, and A. E. Hassan, “On the unreliability of bug
learns. Deep learning is called a black box as it is nonparameter- severity data,” Empirical Softw. Eng., vol. 21, pp. 2298–2323, Dec. 2016.
ized. Although the choice of hyperparameters of deep learning [20] P. Choudhary, “Neural network based bug priority prediction model using
text classification techniques,” Int. J. Adv. Res. Comput. Sci., vol. 8, no. 5,
models, such as the number of layers, the activation function, and pp. 1315–1319, 2017.
the learning rate, as well as the predictor importance is known. [21] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
It is still unclear how machines learn and deduce conclusions. In with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural
Inf. Process. Syst., 2012, vol. 1, pp. 1097–1105.
future, we would like to exploit advanced techniques in neural [22] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with deep
networks to uncover the rationale behind the phenomenon. recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
We would also like to investigate a domain-specific prior- Process., 2013, pp. 6645–6649.
[23] X. Ouyang, P. Zhou, C. H. Li, and L. Liu, “Sentiment analysis using
itization of bug reports by including more bug reports from convolutional neural network,” in Proc. IEEE Int. Conf. Comput. Inf.
different domains, e.g., information systems. A domain-specific Technol.; Ubiquitous Comput. Commun.; Dependable, Autonomic Secure
prioritization of bug reports will affirm the generalizability of Comput.; Pervasive Intell. Comput., Oct. 2015, pp. 2359–2364.
[24] X. Li, H. Jiang, Z. Ren, G. Li, and J. Zhang, “Deep learning in software
the proposed approach. engineering,” 2018.
[25] W.-T. Yih, K. Toutanova, J. C. Platt, and C. Meek, “Learning discriminative
projections for text similarity measures,” in Proc. 15th Conf. Comput.
ACKNOWLEDGMENT Natural Lang. Learn., Stroudsburg, PA, USA, Jul. 2011, pp. 247–256.
[26] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural
The authors would like to thank the Associate Editor and network for modelling sentences,” in Proc. 52nd Annu. Meeting Assoc.
the anonymous reviewers for their insightful comments and Computational Linguistics, 2014, pp. 655–665.
[27] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
constructive suggestions. P. Kuksa, “Natural language processing (almost) from scratch,” J. Mach.
Learn. Res., vol. 12, pp. 2493–2537, Nov. 2011.
[28] X. Wang, L. Zhang, T. Xie, J. Anvik, and J. Sun, “An approach to detecting
REFERENCES duplicate bug reports using natural language and execution information,”
in Proc. 30th Int. Conf. Softw. Eng., 2008, pp. 461–470.
[1] Y. Tian, D. Lo, X. Xia, and C. Sun, “Automated prediction of bug re- [29] C. Sun, D. Lo, X. Wang, J. Jiang, and S. Khoo, “A discriminative model
port priority using multi-factor analysis,” Empirical Softw. Eng., vol. 20, approach for accurate duplicate bug report retrieval,” in Proc. ACM/IEEE
pp. 1354–1383, Oct. 2015. 32nd Int. Conf. Softw. Eng., vol. 1, May 2010, pp. 45–54.
[2] Bugzilla, 2018. [Online]. Available: https://www.bugzilla.org/ [30] N. Jalbert and W. Weimer, “Automated duplicate detection for bug tracking
[3] Jira, 2002. [Online]. Available: https://www.atlassian.com/software/jira/ systems,” in Proc. IEEE Int. Conf. Dependable Syst. Netw. FTCS DCC,
[4] Github, 2008. [Online]. Available: https://github.com/features/ Jun. 2008, pp. 52–61.
[5] Y. Tian, D. Lo, and C. Sun, “Drone: Predicting priority of reported bugs [31] G. Canfora and L. Cerulo, “Supporting change request assignment in open
by multi-factor analysis,” in Proc. IEEE Int. Conf. Softw. Maintenance, source development,” in Proc. ACM Symp. Appl. Comput., 2006, pp. 1767–
Washington, DC, USA, 2013, pp. 200–209. 1772.
[6] J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug reposi- [32] G. Jeong, S. Kim, and T. Zimmermann, “Improving bug triage with bug
tory,” in Proc. OOPSLA Workshop Eclipse Technol. eXchange, New York, tossing graphs,” in Proc. 7th Joint Meeting Eur. Softw. Eng. Conf. ACM
NY, USA, 2005, pp. 35–39. SIGSOFT Symp. Found. Softw. Eng., Jan. 2009, pp. 111–120.
[7] X. Xia, D. Lo, M. Wen, E. Shihab, and B. Zhou, “An empirical study of [33] J. Xuan, H. Jiang, H. Zhang, and Z. Ren, “Developer recommendation
bug report field reassignment,” in Proc. Softw. Evol. Week—IEEE Conf. on bug commenting: A ranking approach for the developer crowd,” Sci.
Softw. Maintenance, Reeng., Reverse Eng. , Feb. 2014, pp. 174–183. China Inf. Sci., vol. 60, Apr. 2017, Art. no. 072105.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

14 IEEE TRANSACTIONS ON RELIABILITY

[34] C. Weiss, R. Premraj, T. Zimmermann, and A. Zeller, “How long will it take Qasim Umer received the B.S. degree in computer
to fix this bug?,” in Proc. 4th Int. Workshop Mining Softw. Repositories, science from Punjab University, Lahore, Pakistan,
2007, pp. 1–8. in 2006, the M.S. degree in net distributed system
[35] S. Akbarinasaji, B. Caglayan, and A. Bener, “Predicting bug-fixing time: development from the University of Hull, Hull, U.K.,
A replication study using an open source software project,” J. Syst. Softw., in 2009, and the second M.S. degree in computer
vol. 136, pp. 173–186, 2018. science from the University of Hull, in 2012. He is
[36] P. Bhattacharya and I. Neamtiu, “Bug-fix time prediction models: Can we currently working toward the Ph.D. degree in com-
do better?,” in Proc. 8th Work. Conf. Mining Softw. Repositories, 2011, puter science with the Beijing Institute of Technology,
pp. 207–210. Beijing, China.
[37] Google-Issue-Tracker, [Online]. Available: https://issuetracker.google. He is particularly interested in machine learning,
com/ data mining, and software maintenance.
[38] E. Loper and S. Bird, “NLTK: The natural language toolkit,” in Proc.
ACL-02 Workshop Effective Tools Methodologies Teaching Natural Lang.
Process. Comput. Linguistics, 2002, vol. 1, 2006, pp. 63–70.
[39] TextBlob, 2013. [Online]. Available: https://textblob.readthedocs.io/en/
dev/ Hui Liu received the B.S. degree in control science
[40] J. Uddin, R. Ghazali, M. Mat Deris, R. Naseem, and H. Shah, “A survey from Shandong University, Jinan, China, in 2001,
on bug prioritization,” Artif. Intell. Rev., vol. 47, pp. 145–180, Apr. 2016. the M.S. degree in computer science from Shanghai
[41] M. R. Islam and M. F. Zibran, “Sentistrength-SE: Exploiting domain University, Shanghai, China, in 2004, and the Ph.D.
specificity for improved sentiment analysis in software engineering text,” degree in computer science from Peking University,
J. Syst. Softw., vol. 145, pp. 125–146, 2018. Bejing, China, in 2008.
[42] T. Ahmed, A. Bosu, A. Iqbal, and S. Rahimi, “Senticr: A customized senti- He is currently a Professor with the School of
ment analysis tool for code review interactions,” in Proc. 32nd IEEE/ACM Computer Science and Technology, Beijing Institute
Int. Conf. Automated Softw. Eng., 2017, pp. 106–111. of Technology, Beijing. He was a Visiting Research
[43] F. Calefato, F. Lanubile, and N. Novielli, “Emotxt: A toolkit for emotion Fellow with the Centre for Research on Evolution,
recognition from text,” in Proc. 7th Int. Conf. Affect. Comput. Intell. Search and Testing (CREST), University College
Interact. Workshops Demos, 2017, pp. 79–80. London, London, U.K. He served on the program committees and organizing
[44] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed committees of prestigious conferences, such as International Conference on
representations of words and phrases and their compositionality,” in Proc. Software Maintenance and Evolution, RE, International Centre for the Study
26th Int. Conf. Neural Inf. Process. Syst., 2013, pp. 3111–3119. of Radicalisation, and COMPSAC. He is particularly interested in software
[45] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhut- refactoring, AI-based software engineering, and software quality. He is also
dinov, “Improving neural networks by preventing co-adaptation of feature interested in developing practical tools to assist software engineers.
detectors,” 2012.
[46] T. Young, D. Hazarika, S. Poria, and E. Cambria, “Recent trends in
deep learning based natural language processing [review article],” IEEE
Comput. Intell. Mag., vol. 13, no. 3, pp. 55–75, Aug. 2018.
[47] Bugzilla, 2018. [Online]. Available: https://bugs.eclipse.org/bugs/
[48] I. Safonov, I. Gartseev, M. Pikhletsky, O. Tishutin, and M. Bailey, “An Inam Illahi graduated degree in arts from the Uni-
approach for model assessment for activity recognition,” Pattern Recognit. versity of Sargodha, Sargodha, Pakistan, 2007. He re-
Image Anal., vol. 25, pp. 263–269, Apr. 2015. ceived the M.S. degree in software engineering from
[49] R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for the Chalmers University of Technology, Gothenburg,
text categorization,” Mach. Learn., vol. 39, pp. 135–168, May 2000. Sweden, in 2010. He is currently working toward
[50] E. T. Berkman and S. P. Reise, A Conceptual Guide to Statistics Using the Ph.D. degree in software engineering with the
SPSS. Thousand Oaks, CA, USA: Sage, 2011. School of Computer Science and Technology, Beijing
[51] T. Menzies and A. Marcus, “Automated severity assessment of software Institute of Technology, Beijing, China.
defect reports,” in Proc. IEEE Int. Conf. Softw. Maintenance, Sep. 2008, He is particularly interested in software mainte-
pp. 346–355. nance, crowdsourcing, and machine learning.