Word Embedding Comparison
Word Embedding Comparison
Dataset link: https://github.com/AI4BA/dl4ba Bug assignment, or bug triage, focuses on identifying the appropriate developers to repair newly discovered
bugs, thereby managing them more effectively. Several deep learning-based approaches have been proposed
Keywords:
for automated bug assignment. These approaches view automated bug assignment as a text classification task
Bug assignment
Bug report
— the textual description of a bug report is utilized as the input and the potential fixers are regarded as
Word embedding the output labels. Such approaches typically depend on the classification performance of natural language
Deep learning processing and machine learning techniques. Various word embedding and deep learning models have emerged
Classification continuously. The effectiveness of those approaches depends on the chosen deep learning model, used for
classification, and the word embedding model, used for representing bug reports. However, prior research
does not empirically evaluate the impacts of various word embedding and deep learning models for automated
bug assignment. In this paper, we conduct an empirical study to analyze the performance variations among
35 deep learning-based automated bug assignment approaches. These approaches are based on five word
embedding techniques, i.e., Word2Vec, GloVe, NextBug, ELMo, and BERT, and seven text classification models,
i.e., TextCNN, LSTM, Bi-LSTM, LSTM with attention, Bi-LSTM with attention, MLP, and Naive Bayes. We
evaluated these combinations across three benchmark datasets, namely Eclipse JDT, GCC, and Firefox, and
their mergence i.e., a cross-project dataset. Our main observations are: (1) Bi-LSTM with attention and Bi-LSTM
using ELMo are significantly superior to other deep learning models on bug assignment tasks in terms of top-k
(k = 1, 5, 10) accuracy and MRR; (2) Both the summary and description of bug reports are useful for bug
assignment, but the description is more useful than the summary; (3) The training corpus for word embedding
models has a significant impact on the performance of deep learning-based bug assignment methods. Our
results show the importance of tuning different components (e.g. word embedding model, classification model,
and textual input) in deep learning-based automated bug assignment methods and provide important insights
for practitioners and researchers.
https://doi.org/10.1016/j.jss.2024.111961
Received 17 August 2023; Received in revised form 10 November 2023; Accepted 3 January 2024
Available online 6 January 2024
0164-1212/© 2024 Elsevier Inc. All rights reserved.
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
is a bug and the type of this bug based on the content of the issue learning and deep learning in bug assignment. It is worth noting that
report. (3) The triager assigns the bug to the appropriate fixer. (4) the promising and recently popular representation learning methods,
When the assignee cannot fix the bug, this bug is put back into the like global vectors (GloVe), embeddings from language models (ELMo),
bug-tracking system. Meanwhile, the status of the bug report is changed and bidirectional encoder representations from Transformers (BERT),
to ‘‘New’’ waiting for the next round of assignment. (5) when the bug have been rarely utilized for bug assignment. Likewise, the deep learn-
is successfully fixed, the bug is closed. As seen in this workflow, it is ing classification models widely used in natural language processing
tedious and time-consuming work for triagers (Jahanshahi and Cevik, such as multi-layer perceptron (MLP) (Taud and Mas, 2018) and long
2022). Particularly, as software continually evolves, the daily number short-term memory (LSTM) (Graves, 2012) have been rarely applied
of bug reports submitted to these systems greatly rises. For example, for for bug assignment. The bug assignment methods based on different
ASP.NET Core project,4 approximately 115.8 bug reports were posed word embedding and deep learning classification models could result
per month from May to October 2023 on average. The GCC Bugzilla5 in different performances. Nevertheless, no previous studies empirically
received 85 reports in the recent week, from October 3 to October 10, investigated the effects of the methods using various representation
2023. Faced with such a substantial volume of bug reports, triagers learning and deep learning classification models on automated bug
need to make great efforts to understand them and assign them to the assignment tasks. This restricts further development of the bug assign-
appropriate developers for bug resolution. Particularly, for complex ment task, especially in picking the relatively optimal deep learning
open-source software products, bug assignment is more challenging models for addressing this task.
because the number of involved bug fixers is big and they have different In this context, we designed and conducted an empirical assessment
skills. of different word embedding and deep learning models for bug as-
To address the above challenge, researchers have proposed two signment. Three word-level embedding models, i.e., Word2Vec, GloVe,
main types of automated bug assignment approaches, i.e., information and a fine-tuned skip-gram model based on the bug-specialized domain
retrieval-based (Yang et al., 2014; Hu et al., 2018; Matter et al., 2009; called NextBug (Du et al., 2022), and two sequence-based representa-
Sajedi-Badashian and Stroulia, 2020; Xia et al., 2017; Alazzam et al., tion learning models, i.e., ELMo and BERT, were empirically accessed.
2020) and text classification-based bug assignment methods (Murphy Seven popular deep learning classifiers including TextCNN (Kim, 2014),
and Cubranic, 2004; Xuan et al., 2015; Dedík and Rossi, 2016; Ahsan LSTM (Graves, 2012), bidirectional LSTM (Bi-LSTM), LSTM with at-
et al., 2009; Naguib et al., 2013; Sbih and Akour, 2018; Sarkar et al., tention, Bi-LSTM with attention, MLP (Taud and Mas, 2018), and
2019; Sawarkar et al., 2019; Lee et al., 2017; Mani et al., 2019; Guo Naive Bayes (NB) (Xu et al., 2017), were also empirically evaluated.
et al., 2020). Bug assignment methods based on information retrieval
To our knowledge, we are first to empirically investigate the effects
recommend bug fixers according to the similarity between historical
of the models based on different word embedding techniques and
bug reports and newly-arrived bug reports, and the corresponding
deep learning classification models on automated bug assignment. The
relationship between historical bug reports and fixers. Unlike informa-
experiments were designed and conducted on three widely-used bug
tion retrieval-based bug assignment methods, bug assignment methods
assignment datasets, namely Eclipse JDT, GCC, and Firefox, and a cross-
based on text classification extract textual features from bug reports and
project dataset. The following three research questions are set to better
build classifiers to predict the labels, i.e., the actual fixers of bugs. The
understand the effects of three key components, i.e., learning models,
text classification-based bug assignment methods consist of two-stage
bug report elements, and source of training corpus on bug assignment
operations. One is to generate word vector representations by various
based on text classification.
techniques such as the term frequency-inverse document frequency (TF-
RQ1: Are there differences in the performance of various deep
IDF) (Xuan et al., 2015; Dedík and Rossi, 2016; Ahsan et al., 2009;
learning models for automated bug assignment? We consider four
Sarkar et al., 2019) and the word to vector (Word2Vec) (Lee et al.,
representation learning models: word-level models with local sen-
2017; Mani et al., 2019; Guo et al., 2020). The other is to predict bug
sitivity (e.g. GloVe and Word2Vec) and sentence-level models with
fixers by different classification models such as Naive Bayes (NB) (Xuan
global sensitivity (e.g. BERT and ELMo) for generating word embed-
et al., 2015; Murphy and Cubranic, 2004), Support Vector Machines
dings. Moreover, seven deep learning classification models including
(SVM) (Dedík and Rossi, 2016; Ahsan et al., 2009), and ensemble
learning methods (Sbih and Akour, 2018). TextCNN, LSTM, Bi-LSTM, LSTM with attention, Bi-LSTM with atten-
Recently, various deep neural networks have been successfully ap- tion, MLP, and Naive Bayes are used for predicting bug fixers. The
plied in many tasks related to software engineering such as code com- extracted features could vary with the models, resulting in different
ment generation (Hu et al., 2020), vulnerability detection (Chakraborty predictive results. Therefore, we set up this research question to inves-
et al., 2022), and software defect prediction (Giray et al., 2023). More- tigate which deep learning models are more suitable for bug assignment
over, they have achieved better performance compared with traditional tasks.
machine learning-based bug assignment methods (Lee et al., 2017; Lee RQ2: Is the description useful for bug assignment? What is
and Seo, 2019). For example, both Lee et al. (2017) and Guo et al. the optimal weight between the summary and description? The
(2020) utilized Word2Vec to generate textual representations of bug textual information of bug reports typically consists of a summary
reports. They employed convolutional neural networks (CNN) for text and description. Compared with the description, the summary is more
classification to predict fixers. Xi et al. (2018) proposed a sequence- concise and contains semantic connections that are more closely related
to-sequence model for bug assignment. Recurrent Neural Networks to a bug. Contrarily, the description provides more details related to
(RNN) and Gate Recurrent Unit (GRU) are used for feature extraction a bug. In view of the above differences, we proposed this research
and model prediction, respectively. Choquette-Choo et al. (2019) used question to investigate their contributions to bug assignment tasks.
Dual-Output Deep Neural Networks (DNN) for bug assignment. Bug RQ3: To what extent does the training corpus influence the
assignment based on deep learning has made remarkable progress. This performance of the representation learning models for bug assign-
makes bug assignment to be more precise and cost-effective, ultimately ment tasks? Most representation learning models are trained based on
enhancing the efficiency of software maintenance and reducing overall a general domain corpus. The models trained based on general corpus
bug resolution costs. tend to have stronger generalization capabilities. However, the models
In previous studies (Lee et al., 2017; Mani et al., 2019; Guo et al., trained based on bug-specialized domain corpus (e.g. NextBug) may
2020), Word2Vec (Mikolov et al., 2013) was predominantly used to be more suitable for bug-related tasks. Therefore, we proposed this re-
generate text vector representations, and CNN (Zhang and Wallace, search question to investigate whether there are differences in the per-
2017) was commonly employed as the classifier for bug assignment. formance of the models trained on general corpus and bug-specialized
Existing studies have demonstrated the effectiveness of representation domain corpus in bug assignment tasks.
2
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
The whole contributions of this paper are listed as follows. bug reports, enabling the retrieval of similar reports. Alazzam et al.
(2020) proposed a feature augmentation approach that leverages graph
• We comprehensively investigated the effects of different word
partitioning based on neighborhood overlap to augment features. By
embedding and deep learning models on bug assignment. The
considering the similarity between summaries of bug reports and each
differences among all investigated models were qualitatively and
term cluster, the approach concatenates terms from strong clusters into
quantitatively accessed.
the feature vectors of the summaries. This approach enhances bug
• We designed nine strategies representing different weights be- assignment methods based on machine learning.
tween the summary and description of the bug report. The nine
strategies were empirically evaluated for the study of their rela-
2.2. Text classification-based bug assignment methods
tive importance between the summary and description.
• We conducted an empirical evaluation of the impacts of train- Bug assignment methods based on text classification can be further
ing corpus from general and bug-specialized domains on bug classified into traditional machine learning-based methods and deep
assignment tasks. learning-based methods.
The results and source code related to this study are available at
https://github.com/AI4BA/dl4ba. 2.2.1. Traditional machine learning-based methods
The rest of this paper is organized as follows. Section 2 summarizes Murphy and Cubranic (2004) first treated the problem of bug assign-
the related work. The experimental design and analysis of results are ment as a case of text classification. The NB algorithm was employed
presented in Sections 3 and 4, respectively. We discuss the threats to to predict the developer based on the bug’s description. Similarly, Xuan
validity in Section 5. Finally, the paper concludes with the future work et al. (2015) used TF-IDF to generate word vector representations for
in Section 6. the summaries and descriptions of bug reports, and utilized the NB clas-
sifier for bug assignment. Compared with the study (Xuan et al., 2015),
2. Related work Dedik et al. (Dedík and Rossi, 2016) extracted more features from bug
reports including summaries, descriptions, keywords, and stack traces.
Previous studies on automatic bug assignment can be roughly cat- They combined TF-IDF with SVM for automated bug triaging in an
egorized into three types based on the used techniques: information industrial context. Ahsan et al. (2009) extracted textual information
retrieval-based, text classification-based, and hybrid methods. The first such as summaries and descriptions from bug reports and parsed the
aims to assign bugs based on the similarity between the newly arrived names of developers as labels. They represented the textual information
bug reports and historical bug reports. The second regards bug assign- using the TF-IDF and applied feature selection and latent semantic
ment as a classification issue. In other words, a classification model is indexing methods to reduce the dimension of a term document matrix.
trained for bug assignment using the text features extracted from the Six machine learning classifiers were compared. The best one is based
bug reports with the assignees as labels. The last mainly combines mul- on SVM and latent semantic indexing. Naguib et al. (2013) utilized
tiple techniques including IR-based methods, learning-based methods, developers’ historical activities from the bug tracking system to create
and tossing graphs for automated bug assignment. an activity profile for each developer describing her/his roles, expertise,
and level of involvement in the project. Subsequently, LDA was utilized
2.1. Information retrieval-based bug assignment methods to generate the topic feature vectors of bug reports. SVM was employed
to determine the appropriate fixer.
Information retrieval-based bug assignment methods are based on Sbih et al. (Sbih and Akour, 2018) proposed a model for bug
the observation that the fixers who previously solved similar bugs assignment. The model was evaluated using ensemble learning methods
are regarded as the most suitable candidates for newly arrived bug such as bagging, boosting, and decorating, along with classifiers includ-
reports (Zhang et al., 2015). Most of the methods search for bug ing Bayes Net, Naive Bayes, Decision Table, Random Tree, and J48.
reports in the repository that are the most similar to the newly arrived Sarkar et al. (2019) considered three types of attributes, i.e., textual,
report. The bug fixers corresponding to the retrieved bug reports are categorical, and log attributes. Several traditional classifiers such as K-
recommended as the appropriate fixers. For instance, Yang et al. (2014) Nearest Neighbor (KNN) and Logistic Regression(LR) were empirically
extracted topics from historical bug reports based on Latent Dirichlet evaluated. Sawarkar et al. (2019) utilized the Bag-of-Words (BOW)
Allocation (LDA). According to the extracted topics, a newly-arrived model to extract feature vectors from textual information. They em-
bug report can be determined to the topics it belongs to. The historical ployed multi-label classification algorithms, including Random Forest,
bug reports having the same features and topics as the newly-arrived SVM, and J48, to predict the fixers for the given bug reports.
report can be retrieved. The fixers corresponding to the retrieved bug
reports are regarded as the fixers of the newly arrived bug report. Hu 2.2.2. Deep learning-based methods
et al. (2018) proposed a document embedding model-based method Lee et al. (2017) utilized the Word2Vec representation learning
for improving the performance of Yang et al.’s method (Yang et al., model to learn representations of summaries and descriptions in bug
2014). Similarly, Matter et al. (2009) presented an expertise model reports. The CNN was used to extract text features and predict bug
of developers based on their contributed source code. According to fixers. Similarly, Mani et al. (2019) also used Word2Vec and introduced
the cosine similarity between the vocabulary of bug reports and the an attention mechanism to learn sentence-level semantic features for
vocabulary in source code contributions, the newly arrived bug report bug assignment. They proposed a bug assignment method based on a
can be assigned to developers. Bidirectional Recursive Neural Network. Guo et al. (2020) considered
Sajedi-Badashian and Stroulia (2020) presented a vocabulary and developer activity in addition to summaries and descriptions and pro-
time-aware bug assignment method. The vocabulary consists of pro- posed a bug assignment method based on Word2Vec and CNN. Xi et al.
gramming keywords in bug reports belonging to Stack Overflow tags. (2018) utilized the RNN to generate textual features from the text in
They argued developers’ recent expertise is more important than past bug reports and model the dependencies among developers depending
expertise. Furthermore, they considered the importance of the key- on tossing sequences. After that, a sequence of candidate fixers was gen-
words and time of usage to improve the TF-IDF metric. Xia et al. erated. Choquette-Choo et al. (2019) employed the Dual-Output Deep
(2017) proposed a bug assignment method, in which they utilized a Neural Networks (DNN) to perform bug assignment. They utilized the
multi-feature topic model (MTM) to conduct topic modeling based on model to predict the team category, which in turn, helps in predicting
product and component information from historical bug reports. They the appropriate fixers. Existing studies predominantly use Word2Vec
employed TopicMiner to obtain the topic distribution of newly arrived and CNN to generate word embeddings and predict the right fixers,
3
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 1. Overview of automated bug assignment based on representation learning models and deep learning classifiers.
4
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
utilized the NLTK toolkit (Bird et al., 2009) to remove stopwords and different sizes for extracting feature maps from different regions. Based
converted all uppercase letters to lowercase. The above operations aim on the study (Zhang and Wallace, 2017), we selected convolutional
at cleaning the text and eliminating irrelevant information, making it kernels of size [3, 4, 5] for convolution operation. The work (Zhang and
more suitable for further training and classification. Wallace, 2017) investigated the impact of the number of convolutional
kernels on multiple datasets such as the Sentence Polarity dataset (MR),
3.2.2. Word embedding matrix generating Stanford Sentiment Treebank (SST-1), Subjectivity dataset (Subj), and
After pre-processing, the next step is the tokenization of the text. Question Classification (TREC). Given the similarity between the bug
The tokens are then fed into the representation learning models. assignment problem and question classification, we opted for the same
number of convolutional kernels as TREC, i.e., 512.
For the word-level embedding models, an additional operation is to
After convolution, the feature maps have a shape of (N, L-[3, 4,
generate a vocabulary dictionary corresponding to the text and form
5]+1, 512). These feature maps are then subjected to max-pooling
word vectors using Word2Vec, GloVe, or NextBug. The vocabulary
to reduce dimension while preserving important features. The pooled
dictionary is a collection of all unique words in the training corpus,
vectors are concatenated, flattened, and passed through fully connected
each of which has a unique index. Building the vocabulary dictionary layers to compute the probability scores for each bug report with
involves the following steps. The first step is to count word frequencies respect to each developer, thereby completing the classification task.
i.e., traversing the entire corpus to count the occurrence frequency
of each word. Then, the words are sorted in descending order of 3.3.2. LSTM (Long Short-Term Memory)
frequency, i.e., placing the most common words at the front of the LSTM as a variant of RNNs, can process sequential data. Compared
vocabulary dictionary. The last step is to assign a unique index to each with the traditional RNN model, LSTM can not only better capture
word, thus generating a vocabulary dictionary by mapping words to long-term dependencies within a sequence by gate mechanism, but also
IDs. The vocabulary dictionary is used to map words in the text to ID effectively address the issues of gradient disappearance and exploding
sequences, which are then mapped to a high-dimensional vector space when dealing with long sequences. Therefore, we also employed the
through Wor2Vec, GloVe, or NextBug. Since the text could contain LSTM model for predicting bug fixers. During the training of the LSTM
words that are not trained by the above models, the word embed- model, the loss between the model’s predictions and the ground truth is
ding matrix is initialized randomly following a normal distribution computed. The loss is back-propagated to update the weights and biases
before using the word-level embedding models. Subsequently, the text in the network. This process is repeated until the model converges and
is traversed to generate the corresponding text vectors according to achieves optimal performance. Then, the outputs of the last hidden
the vocabulary dictionary. Different from the word-level embedding layers are fed into a linear layer for computing the probability scores.
Increasing the size of hidden layers can enhance the network’s
models, the sequence-based models (e.g. ELMo and BERT) can accept
learning ability, enabling it to better adapt to complex patterns and
the token list as their inputs for generating the corresponding text
data. However, more complex networks easily suffer from more com-
vectors directly, rather than an additional operation such as generating
putational resources and longer training time. In the same way, an
a vocabulary in the case of word-level embedding models.
excessive size of hidden layers may lead to over-fitting and training
Note that for Word2Vec,9 and GloVe10 we use 300-dimensional more difficult. Therefore, it is necessary to choose an appropriate size of
word embeddings. As for NextBug,11 it is a pre-trained binary file hidden layers for achieving a balance between accuracy and efficiency.
with 500-dimensional word embeddings. For BERT and ELMo, we used A prior study (Nowak et al., 2017) illustrates that LSTM can get better
the BERT-BASE12 model with 2 layers, 768 hidden layers, 12 multi- performance for the text classification task when the hidden layer size
heads, and 118M parameters. Thus, the dimension of word embeddings is equal to 128. Considering the bug assignment task is highly similar
generated by BERT is 768. Due to the limit of the experimental budget, to the text classification task, we set the same hidden layer size i.e.,128.
the maximum sequence length for BERT is 512. Note that we did not
fine-tune the BERT. There are two main reasons. One is that the limited 3.3.3. Bi-LSTM (Bi-directional Long Short-Term Memory)
experimental budget cannot provide sufficient resources for fine-tuning Bi-LSTM is another variant of RNNs that incorporates memory cells
BERT. The other is more fairly to compare the studied models as not and gate mechanism. Similar to LSTM, Bi-LSTM excels in modeling
all studied models support the fine-tuning mode. On the other hand, sequential data. As shown in Fig. 1, the Bi-LSTM consists of two LSTM
ELMo13 generates word embeddings with a dimension of 1024. To layers. One is responsible for processing the input sequence in the
preserve semantic information without loss or distortion as possible as forward direction, while the other is used in the backward direction.
we can, the vectors are not truncated or padded. The final output is generated by concatenating the outputs of the two
layers. Subsequently, the output of the hidden layers is passed through
a linear layer, which performs the computation of probability scores.
3.2.3. Bug fixer predicting
Compared with LSTM, Bi-LSTM is capable of capturing both the
Once the word embedding matrix is generated by a certain repre-
preceding and succeeding context information in the input sequence.
sentation learning model, it can be fed into the deep learning classifiers,
By utilizing the forward and backward LSTM layers, it can simultane-
including TextCNN, LSTM, Bi-LSTM, LSTM with attention, Bi-LSTM ously consider the information before and after the current time step,
with attention, MLP, and Naive Bayes for bug fixers prediction. providing a more comprehensively contextual understanding. The size
of hidden layers of Bi-LSTM is the same as that of LSTM, i.e., 128.
3.3. Classification model
3.3.4. MLP (Multi-layer Perceptron)
3.3.1. TextCNN (Text Convolutional Neural Network) MLP is a feed-forward neural network model, also known as a fully
TextCNN is a text classification model based on CNN. It consists connected neural network. It consists of multiple layers, i.e., an input
of convolutional layers, pooling layers, and fully connected layers. As layer, multiple hidden layers, and an output layer. The input layer is
shown in Fig. 1, the model takes the word embedding matrix as an input responsible for receiving the word embedding matrix. The MLP utilized
and performs convolution operations with convolutional kernels of in this paper comprises two hidden layers, each of which consists of
multiple neurons. These neurons are used for computing the weighted
sums of the outputs from the previous layer. The linear factors are
9
https://github.com/mmihaltz/word2vec-GoogleNews-vectors eliminated by the ReLU activation function. The final results are then
10
https://nlp.stanford.edu/projects/glove/ outputted through the output layer. During the model training process,
11
https://github.com/xiaotingdu/DeepSIM the parameters of the MLP are updated using back-propagation to
12
https://github.com/google-research/BERT/ minimize the loss. Similar to LSTM, the hidden layer size of MLP is
13
https://s3-us-west-2.amazonaws.com/allennlp/models/elmo also 128.
5
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Frank and Bouckaert, 2006), spam filtering (Metsis et al., 2006; Rus- Epochs 12
Batch size 32
land et al., 2017), and sentiment analysis (Tan et al., 2009; Wongkar
Learning rate 0.001
and Angdresey, 2019). This algorithm is built upon the principles of Filter size 3,4,5
Bayesian probability theory. When dealing with classification problems, Nums of filters 512
Naive Bayes computes the conditional probabilities of different labels Size of hidden layers 128
for a given sample and selects the label with the highest probability
as the classification result. In this paper, after obtaining the word
embedding matrix, the model is fitted using the feature vectors from
the training data.
6
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Table 2 Table 3
The relation between 𝛿 and effect size. Corresponding relation of the studied models and the flags in Fig. 7, Fig. 8, Fig. 9, and
Value Effect size Fig. 10.
No. Model No. Model
|𝛿| < 0.147 negligible
|𝛿| ≥ 0.147 𝑎𝑛𝑑 |𝛿| < 0.33 small 0 Bi-LSTM-A+ELMo 14 Bi-LSTM+W2V
|𝛿| ≥ 0.33 𝑎𝑛𝑑 |𝛿| < 0.474 medium 1 Bi-LSTM+ELMo 15 LSTM-A+ELMo
|𝛿| ≥ 0.474 large 2 Bi-LSTM-A+BERT 16 MLP+W2V
3 Bi-LSTM+BERT 17 TextCNN+GloVe
4 Bi-LSTM-A+GloVe 18 LSTM-A+GloVe
5 Bi-LSTM+GloVe 19 LSTM+ELMo
difference between the top-k accuracy values generated by the two 6 MLP+GloVe 20 LSTM+BERT
7 LSTM-A+BERT 21 LSTM-A+W2V
models. When the p-value is less than 0.05, we can reject the null
8 LSTM+W2V 22 TextCNN+BERT
hypothesis, indicating a significant difference between the two models. 9 LSTM+GloVe 23 MLP+ELMo
Additionally, to quantify the magnitude of the differences between 10 MLP+BERT 24 NB+BERT
the two different methods, we employ a non-parametric effect size 11 Bi-LSTM-A+W2V 25 NB+ELMo
12 TextCNN+W2V 26 NB+GloVe
measure, i.e., Cliff’s Delta (Cliff, 1993). It can be calculated by the
13 TextCNN+ELMo 27 NB+W2V
following formula
(𝑈 − 𝐷)
𝛿= , (3)
𝑛1 ∗ 𝑛2
of a violin chart represent the maximum and minimum values of top-k
where 𝑈 and 𝐷 represent the number of times that the data in the accuracy (k = 1, 5, 10) or MRR. The central short line of a violin chart
first group is greater than that in the second group and the data in represents the mean of top-k accuracy (k = 1, 5, 10) or MRR.
the second group is greater than that in the first group, and 𝑛1 and 𝑛2
denote the total number of samples in two control groups, respectively. Findings: Bi-LSTM-A+ELMo and Bi-LSTM+ELMo are significantly
The value of 𝛿 ranges from −1.0 to +1.0. The extreme value 0.0 superior to other deep learning models on bug assignment tasks in
indicates that the values in two control groups are identical, whereas ± terms of top-k accuracy and MRR metrics. Moreover, the two mod-
1.0 indicates all values in one group are larger than those in the other els have large or medium effect sizes over other models in most
group. The absolute value of 𝛿 indicates the magnitude of the effect cases. Particularly, compared with the methods using Word2Vec
size (Romano et al., 2006). The relation between 𝛿 and effect size is and TextCNN (Lee et al., 2017; Zaidi et al., 2020), the methods using
shown in Table 2. the above combinations have over 30% improvement in top-k (k
= 1, 5, 10) accuracy across all datasets.
4. Experiment result and analysis From Fig. 3, Fig. 4, Fig. 5, and Fig. 6, it can be seen that the
top-k accuracy and MRR vary among the models. The top-k and MRR
In this section, we report the top-k (k = 1, 5, 10) accuracy and MRR values are different from the same representation learning models using
of each model. According to the experiment results, we answered the different classification models. Likewise, the same classification models
proposed four research questions. using different representation learning models are different from each
other. In other words, both representation learning and classification
4.1. RQ1: Are there differences in the performance of various deep learning models can influence the accuracy of bug assignment. Bi-LSTM-A using
models for the bug assignment problem? ELMo and Bi-LSTM using ELMo always have higher mean values than
other models across all datasets with respect to the top-k accuracy and
Motivation: Word2Vec and GloVe operate at the word level and MRR. Compared with the top-k and MRR values on the three datasets,
generate word embeddings based on tokens. Different from Word2Vec, their values on the cross-project dataset are lower. A likely explanation
GloVe incorporating a global co-occurrence matrix can better capture is that the data distributions of the three datasets are different. The
global information. In contrast with the above two models, ELMo project with the maximum samples plays a critical role in the prediction
and BERT are both sequence-based models, enabling them to capture of the cross-project dataset. The results on the cross-project dataset tend
sentence-level semantic information. Compared to the word embedding to be the ones on the project with the maximum samples. Furthermore,
models, sequence-based models can capture more contextual informa- we conducted the Wilcoxon signed-rank test and Cliff’s Delta test. The
tion. Similarly, the classification models including TextCNN, LSTM, results are shown in Fig. 7, Fig. 8, Fig. 9, and Fig. 10 where each cell
Bi-LSTM, LSTM with attention, Bi-LSTM with attention, MLP, and Naïve in the left and right subplots shows 𝑝-value and 𝛿 value, respectively.
Bayes are different from each other. This means that they could have To enhance the readability of the result graphs, different values
varied performance on bug assignments. This drives us to set this are differentiated with different colors. In the left subplots, each cell
research question for accessing the difference in their performance. contains a p-value. A gray cell represents that the p-value is less than
Method: The text after pre-processing is used to generate word 0.05. In this case, we can reject the null hypothesis, i.e., there is a
embedding matrices using the representation learning models. The significant difference between the two models. Contrarily, a white cell
word embedding matrices serve as inputs to different deep learning means that the p-value is greater than or equal to 0.05, i.e., we cannot
classifiers for generating a recommendation list of bug fixers as pre- reject the null hypothesis. In the right subplots, color represents the
dictors. The performance of different models is evaluated using the magnitude of the effect size, i.e., green for negligible, yellow for small,
top-k accuracy and MRR. The results of 28 deep learning models red for medium, and blue for large. The flags from 0 to 27 denote 28
based on four representation learning models and seven classifiers, are models. The detailed indicators are shown in Table 3.
shown in Fig. 3, Fig. 4, Fig. 5, and Fig. 6, where the 𝑥-axis of each From Fig. 7, Fig. 8, Fig. 9, and Fig. 10, it can be seen that Bi-LSTM-
subgraph represents various models, whereas the 𝑦-axis denotes the top- A using ELMo and Bi-LSTM using ELMo are significantly superior to
k accuracy or MRR. To clearly illustrate the results, the violin charts other models in most cases across all datasets in terms of top-1 accuracy
with the same classification model are placed in the same group. The and MRR. Moreover, they have large or medium effect sizes over all
violin charts with different classification models are split by a vertical other models. The above two models are also significantly better than
line. Each violin chart represents the distribution of the top-k (k = 1, other models over GCC, Firefox, and the cross-project dataset with
5, 10) accuracy and MRR. Wider sections of the violin charts indicate a respect to the top-5 and top-10 accuracy. Likewise, they also have
greater concentration of data in the area. The upper and lower bounds large or medium effect sizes over other models. This shows that both
7
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 3. The Top-k (k = 1, 5, 10) accuracy and MRR over Eclipse JDT .
Bi-LSTM-A+ELMo and Bi-LSTM+ELMo are more suitable for solving Despite Bi-LSTM-A+ELMo using the attention mechanism, there is
the bug assignment task than other investigated models. In particular, no significant difference between it and Bi-LSTM+ELMo. The likely
compared with existing text classification-based bug assignment meth- reason is that the attention mechanism could concern the features that
ods (Lee et al., 2017; Zaidi et al., 2020) using Word2Vec and TextCNN, are not closely relevant to bug fixers’ prediction limited by the quality
Bi-LSTM-A+ELMo and Bi-LSTM+ELMo have over 30% improvement in of bug reports. Compared with other classification models, Bi-LSTM and
top-k (k = 1, 5, 10) accuracy across all subjects. Bi-LSTM with attention can capture more important features including
8
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
9
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
10
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 6. The Top-k (k = 1, 5, 10) accuracy and MRR over a cross-project dataset.
11
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 7. The statistical and effect size results between all investigated models on Eclipse
JDT .
Fig. 8. The statistical and effect size results between all investigated models on GCC.
12
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 9. The statistical and effect size results between all investigated models on Firefox. Fig. 10. The statistical and effect size results between all investigated models on a
cross-project dataset.
13
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
4.3. RQ3: To what extent does the training corpus influence the accuracy
Fig. 11. The effect of the training text composed of different proportions of the of the representation learning model for bug assignment tasks?
summary and description on Bi-LSTM-A+ELMo and Bi-LSTM+ELMo.
14
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 12. The top-k(k = 1, 5, 10) accuracy and MRR of ELMo and NextBug with Bi-LSTM and Bi-LSTM with attention over Eclipse JDT, GCC, Firefox, and cross- project datasets .
Table 4 models using ELMo with respect to top-1 in most cases. Moreover, the
Corresponding relation of the studied models and the flags in former ones have large or medium effect sizes over the latter ones. With
Fig. 13.
the exception of the top-1 accuracy, there is no significant difference
No. Model
between the NextBug and ELMo in most cases. Moreover, the former
0 Bi-LSTM-A+NextBug only has a negligible or small effect size over the latter. In comparison
1 Bi-LSTM+NextBug with other metrics, the top-1 metric is more useful. In this sense,
2 Bi-LSTM-A+ELMo
compared with the general domain training corpus, the domain-specific
3 Bi-LSTM+ELMo
training corpus related to the solving task can significantly improve
the accuracy of bug assignments. Although Bi-LSTM-A+NextBug uses
the attention mechanism, there is no significant difference between Bi-
LSTM-A+NextBug and Bi-LSTM+NextBug in terms of top-k (k = 1, 5,
From Fig. 13, it can be seen that the classifications using NextBug
10) accuracy and MRR. Moreover, the former only has a negligible or
are significantly better than the models with the same classification small effect size over the latter.
15
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Fig. 13. The statistical and effect size results between NextBug and ELMo on Eclipse JDT, GCC, Firefox, and cross-project datasets.
16
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
5. Threats to validity Bi-LSTM, Bi-LSTM with attention mechanism, MLP, and NB on bug as-
signment. Three commonly used datasets, i.e., Eclipse JDT, GCC, Firefox,
Despite careful experiment design, it is important to acknowledge and a cross-project dataset were used for experimental comparisons.
the potential threats to the validity of this study. These threats can be According to the experimental results, the following three findings
summarized as follows. can be drawn: (1) Bi-LSTM-A+ELMo and Bi-LSTM+ELMo are signifi-
cantly superior to other deep learning models on bug assignment tasks
5.1. Internal validity in terms of top-k (k = 1, 5, 10) accuracy and MRR; (2) Both the
summary and description of bug reports are useful for bug assignment,
The random split of the datasets can cause fluctuations in the but the description should be given higher weight than the summary;
performance of a model, leading to generating different results. This (3) The choice of training corpus for representation learning models has
is because when the training set contains the class with few samples, a significant impact on bug assignment task. The model that utilizes a
bug-specific training corpus has large or medium effect sizes over the
the model cannot well learn the features corresponding to the class,
model using a general domain corpus with respect to top-1 accuracy.
resulting in inaccurate predictions. To alleviate the impact of random
This study not only aids in understanding the critical components of
data set partitioning, we employed ten-fold cross-validation, which
bug assignment based on deep learning, but also provides practice
involves training and validating the model on all data subsets in a
guidance for picking the optimal deep learning technique for addressing
rotating way.
bug assignment tasks. Additionally, it helps identify the boundaries of
Another internal validity arises from the influence of parameters
problem-solving abilities for different deep learning models.
in all studied models. The performance of different models can vary
All experimental subjects come from the same bug tracking system,
depending on the chosen parameters. To ensure a fair and objective
i.e., Bugzilla. In addition, there are several popular bug tracking systems
comparison, on the one hand, we conducted parameter experiments to
(e.g. JIRA and BugNet). The structure of bug reports from different
obtain the optimal parameters. On the other hand, we also referred to
bug-tracking systems could be different. For future work, we will
the related works for selecting appropriate parameters.
conduct a large-scale empirical study on more projects from other
bug-tracking systems. The fixed bug reports can reflect the skills or
5.2. External validity
expertise of the developers. We are planning to measure the similarities
of developers based on the similarities of their fixed historical bug
Deep learning-based bug assignment methods depend on bug re- reports. Contrastive learning will be employed to improve the bug
ports. Limited by the expertise of bug submitter, the quality of bug assignment accuracy based on the measured similarity information. The
reports can vary significantly. Low-quality bug reports can cause the Transformer model has achieved promising results in natural language
performance degeneration of a model. To mitigate the impact of bug processing tasks. Moreover, most studies only utilize bug reports. Un-
report quality, we conducted the empirical study using three widely like bug reports, the fixed code can directly reflect the expertise of bug
used datasets in the field of bug assignment, which are extracted from fixers. Therefore, we will extract more features from source code and
four popular and mature open-source software projects, and a cross- bug reports via the Transformer model to improve the performance of
project dataset. We will conduct an empirical study based on more bug assignment further.
open-source and closed-source projects to address this threat in future
work. CRediT authorship contribution statement
Additionally, the data collected from the three projects may contain
noise. Noisy data can affect the performance of the bug assignment Rongcun Wang: Conceptualization, Methodology, Experimental
method. Consequently, we preprocessed the data by removing missing design, Writing – original draft, Supervision. Xingyu Ji: Writing –
data, URLs, non-English bug reports, and stop words to mitigate the original draft, Experimental implementation, Data analysis, Data val-
effect of noisy data. idation, Data visualization. Senlei Xu: Experimental implementation,
Data analysis, Data validation. Yuan Tian: Investigation, Writing –
5.3. Construct validity review & editing. Shujuan Jiang: Supervision, Writing – review &
editing. Rubing Huang: Writing – review & editing.
The construct validity may be affected by the use of the metric, i.e.,
top-k accuracy. As previous studies, we also employed this metric to Declaration of competing interest
evaluate the impacts of different representation learning models and
deep learning classifiers on bug assignment. It may not to be enough The authors declare that they have no known competing finan-
provide a comprehensive evaluation of the performance of the studied cial interests or personal relationships that could have appeared to
models. We will seek other appropriate metrics for the comprehensive influence the work reported in this paper.
evaluation of all studied models to alleviate this threat. Since top-k ac-
curacy only focuses on the top-k recommended bug fixers and does not Data availability
report the specific position of the true assignee in the recommendation
list. The MRR was also employed to address this limitation. https://github.com/AI4BA/dl4ba.
6. Conclusions Acknowledgments
Deep learning-based bug assignment methods have achieved The authors would like to thank the anonymous reviewers for their
promising performance. Several deep learning models have been in- valuable comments and helpful suggestions. This work is partially
creasingly proposed. No previous studies empirically evaluate the ef- supported by the National Natural Science Foundation of China under
fects of different deep learning models on bug assignment. In this grant NO. 61673384, No. 61872167 and No. 61502205, partially sup-
context, we conducted an empirical study of evaluating the impacts of ported by the Science and Technology Development Fund of Macau,
35 deep learning models based on five representation models, namely Macau SAR under grant 0046/2021/A and 0021/2023/R1A1, and
Word2Vec, GloVe, NextBug, BERT, and ELMo, and seven classifica- partially supported by a Faculty Research Grant of Macau University
tion models, i.e., LSTM, TextCNN, LSTM with attention mechanism, of Science and Technology under grant FRG-22-103-FIE.
17
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
References Mani, S., Sankaran, A., Aralikatte, R., 2019. Deeptriage: Exploring the effectiveness of
deep learning for bug triaging. In: Proceedings of the ACM India Joint International
Ahsan, S.N., Ferzund, J., Wotawa, F., 2009. Automatic software bug triage system (BTS) Conference on Data Science and Management of Data. pp. 171–179.
based on latent semantic indexing and support vector machine. In: Proceedings of Matter, D., Kuhn, A., Nierstrasz, O., 2009. Assigning bug reports using a vocabulary-
the 4th International Conference on Software Engineering Advances. pp. 216–221. based expertise model of developers. In: Proceedings of the 6th International
Alazzam, I., Aleroud, A., Al Latifah, Z., Karabatis, G., 2020. Automatic bug triage Working Conference on Mining Software Repositories. pp. 131–140.
in software systems using graph neighborhood relations for feature augmentation. Metsis, V., Androutsopoulos, I., Paliouras, G., 2006. Spam filtering with naive
IEEE Trans. Comput. Soc. Syst. 7 (5), 1288–1303. Bayes-which Naive Bayes? In: CEAS, Vol. 17. Mountain View, CA, pp. 28–69.
Anvik, J., Hiew, L., Murphy, G.C., 2006. Who should fix this bug? In: Proceedings of Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word
the 28th International Conference on Software Engineering. ICSE ’06, pp. 361–370. representations in vector space. arXiv preprint arXiv:1301.3781.
Aung, T.W.W., Wan, Y., Huo, H., Sui, Y., 2022. Multi-triage: A multi-task learning Murphy, G., Cubranic, D., 2004. Automatic bug triage using text categorization.
framework for bug triage. J. Syst. Softw. 184, 111133. In: Proceedings of the 6th International Conference on Software Engineering &
Beltagy, I., Lo, K., Cohan, A., 2019. SciBERT: A pretrained language model for Knowledge Engineering. Citeseer, pp. 1–6.
scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Naguib, H., Narayan, N., Brügge, B., Helal, D., 2013. Bug report assignee recommen-
Natural Language Processing and the 9th International Joint Conference on Natural dation using activity profiles. In: Proceedings of the 10th Working Conference on
Language Processing. EMNLP-IJCNLP, pp. 3615–3620. Mining Software Repositories. MSR, pp. 22–30.
Bhattacharya, P., Neamtiu, I., Shelton, C.R., 2012. Automated, highly-accurate, bug
Nowak, J., Taspinar, A., Scherer, R., 2017. LSTM recurrent neural networks for short
assignment using machine learning and tossing graphs. J. Syst. Softw. 85 (10),
text and sentiment classification. In: Artificial Intelligence and Soft Computing.
2275–2292.
Springer International Publishing, Cham, pp. 553–562.
Bird, S., Klein, E., Loper, E., 2009. Natural Language Processing with Python: Analyzing
Romano, J., Kromrey, J.D., Coraggio, J., Skowronek, J., 2006. Appropriate statistics
Text with the Natural Language Toolkit. O’Reilly Media, Inc..
Chakraborty, S., Krishna, R., Ding, Y., Ray, B., 2022. Deep learning based vulnerability for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating
detection: Are we there yet? IEEE Trans. Softw. Eng. 48 (9), 3280–3296. group differences on the NSSE and other surveys. In: The Annual Meeting of the
Choquette-Choo, C.A., Sheldon, D., Proppe, J., Alphonso-Gibbs, J., Gupta, H., 2019. Florida Association of Institutional Research. pp. 1–31.
A multi-label, dual-output deep neural network for automated bug triaging. In: Rusland, N.F., Wahid, N., Kasim, S., Hafit, H., 2017. Analysis of Naïve Bayes algorithm
Proceedings of the 18th IEEE International Conference on Machine Learning and for email spam filtering across multiple datasets. In: IOP Conference Series:
Applications. ICMLA, pp. 937–944. Materials Science and Engineering, vol. 226, (no. 1), IOP Publishing, 012091.
Cliff, N., 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Sajedi-Badashian, A., Stroulia, E., 2020. Vocabulary and time based bug-assignment:
Psychol. Bull. 114 (3), 494. A recommender system for open-source projects. Softw. - Pract. Exp. 50 (8),
Dai, W., Xue, G.-R., Yang, Q., Yu, Y., 2007. Transferring Naive Bayes classifiers for text 1539–1564.
classification. In: AAAI, Vol. 7. pp. 540–545. Sarkar, A., Rigby, P.C., Bartalos, B., 2019. Improving bug triaging with high confidence
Dedík, V., Rossi, B., 2016. Automated bug triaging in an industrial context. In: predictions at ericsson. In: 2019 IEEE International Conference on Software
Proceedings of the 42th Euromicro Conference on Software Engineering and Maintenance and Evolution. ICSME, pp. 81–91.
Advanced Applications. SEAA, pp. 363–367. Sawarkar, R., Nagwani, N.K., Kumar, S., 2019. Predicting available expert developer
Du, X., Zheng, Z., Xiao, G., Zhou, Z., Trivedi, K.S., 2022. DeepSIM: Deep semantic for newly reported bugs using machine learning algorithms. In: Proceedings of the
information-based automatic mandelbug classification. IEEE Trans. Reliab. 71 (4), 5th International Conference for Convergence in Technology. I2CT, pp. 1–4.
1540–1554. Sbih, A., Akour, M., 2018. Towards efficient ensemble method for bug triaging. J.
Frank, E., Bouckaert, R.R., 2006. Naive Bayes for text classification with unbalanced Mult.-Valued Logic Soft Comput. 31, 567–590.
classes. In: Knowledge Discovery in Databases: PKDD 2006: 10th European Con- Sun, X., Yang, H., Xia, X., Li, B., 2017. Enhancing developer recommendation with
ference on Principles and Practice of Knowledge Discovery in Databases Berlin, supplementary information via mining historical commits. J. Syst. Softw. 134,
Germany, September 18-22, 2006 Proceedings 10. Springer, pp. 503–510. 355–368.
Giray, G., Bennin, K.E., Köksal, Ö., Babur, Ö., Tekinerdogan, B., 2023. On the use of
Tan, S., Cheng, X., Wang, Y., Xu, H., 2009. Adapting Naive Bayes to domain
deep learning in software defect prediction. J. Syst. Softw. 195, 111537.
adaptation for sentiment analysis. In: Advances in Information Retrieval: 31th
Graves, A., 2012. Long short-term memory. In: Supervised Sequence Labelling with
European Conference on IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009.
Recurrent Neural Networks. Springer Berlin Heidelberg, Berlin, Heidelberg, pp.
Proceedings 31. Springer, pp. 337–349.
37–45.
Taud, H., Mas, J., 2018. Multilayer perceptron (MLP). In: Camacho Olmedo, M.T.,
Guo, S., Zhang, X., Yang, X., Chen, R., Guo, C., Li, H., Li, T., 2020. Developer activity
Paegelow, M., Mas, J.-F., Escobar, F. (Eds.), Geomatic Approaches for Modeling
motivated bug triaging: Via convolutional neural network. Neural Process. Lett. 51
Land Change Scenarios. Springer International Publishing, Cham, pp. 451–455.
(3), 2589–2606.
Hu, D., Chen, M., Wang, T., Chang, J., Yin, G., Yu, Y., Zhang, Y., 2018. Recommending Von der Mosel, J., Trautsch, A., Herbold, S., 2022. On the validity of pre-trained
similar bug reports: A novel approach using document embedding model. In: transformers for natural language processing in the software engineering domain.
Proceedings of the 25th Asia-Pacific Software Engineering Conference. APSEC, pp. IEEE Trans. Softw. Eng. 49 (4), 1487–1507.
725–726. Voorhees, E.M., et al., 1999. The trec-8 question answering track report. In: Trec, Vol.
Hu, X., Li, G., Xia, X., Lo, D., Jin, Z., 2020. Deep code comment generation with hybrid 99. pp. 77–82.
lexical and syntactical information. Empir. Softw. Eng. 25 (3), 2179–2217. Wilcoxon, F., 1946. Individual comparisons by ranking methods. Biometrics 1 (6),
Jahanshahi, H., Cevik, M., 2022. S-DABT: Schedule and dependency-aware bug triage 80–83.
in open-source bug tracking systems. Inf. Softw. Technol. 151, 107025. Wongkar, M., Angdresey, A., 2019. Sentiment analysis using Naive Bayes algorithm of
Jahanshahi, H., Chhabra, K., Cevik, M., Basar, A., 2021. DABT: A dependency-aware the data crawler: Twitter. In: 2019 Fourth International Conference on Informatics
bug triaging method. In: International Conference on Evaluation and Assessment and Computing. ICIC, IEEE, pp. 1–5.
in Software Engineering. pp. 221–230. Wu, W., Zhang, W., Yang, Y., Wang, Q., 2011. DREX: Developer recommendation
Jeong, G., Kim, S., Zimmermann, T., 2009. Improving bug triage with bug tossing with K-nearest-neighbor search and expertise ranking. In: Proceedings of the 18th
graphs. In: Proceedings of the 7th Joint Meeting of the European Software Asia-Pacific Software Engineering Conference. pp. 389–396.
Engineering Conference and the ACM SIGSOFT Symposium on the Foundations Xi, S., Yao, Y., Xiao, X., Xu, F., Lu, J., 2018. An effective approach for routing the bug
of Software Engineering. pp. 111–120. reports to the right fixers. In: Proceedings of the 10th Asia-Pacific Symposium on
Kim, Y., 2014. Convolutional neural networks for sentence classification. In: Proceed- Internetware. Internetware ’18.
ings of the 2014 Conference on Empirical Methods in Natural Language Processing.
Xia, X., Lo, D., Ding, Y., Al-Kofahi, J., Nguyen, T., Wang, X., 2017. Improving automated
EMNLP, Doha, Qatar, pp. 1746–1751.
bug triaging with specialized topic model. IEEE Trans. Softw. Eng. 43 (3), 272–297.
Ko, A.J., Myers, B.A., Chau, D.H., 2006. A linguistic analysis of how people de-
Xia, X., Lo, D., Wang, X., Zhou, B., 2015. Dual analysis for recommending developers
scribe software problems. In: Visual Languages and Human-Centric Computing.
to resolve bugs. J. Softw.: Evol. Process 27 (3), 195–220.
VL/HCC’06, IEEE, pp. 127–134.
Lamkanfi, A., Demeyer, S., Giger, E., Goethals, B., 2010. Predicting the severity Xu, S., Li, Y., Wang, Z., 2017. Bayesian multinomial naïve Bayes classifier to text clas-
of a reported bug. In: 2010 7th IEEE Working Conference on Mining Software sification. In: Advanced Multimedia and Ubiquitous Engineering: MUE/FutureTech
Repositories. MSR 2010, IEEE, pp. 1–10. 2017 11. Springer, pp. 347–352.
Lee, S.-R., Heo, M.-J., Lee, C.-G., Kim, M., Jeong, G., 2017. Applying deep learning Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X., 2015. Towards effective
based automatic bug triager to industrial projects. In: Proceedings of the 11th Joint bug triage with software data reduction techniques. IEEE Trans. Knowl. Data Eng.
Meeting on Foundations of Software Engineering. pp. 926–931. 27 (1), 264–280.
Lee, D.-G., Seo, Y.-S., 2019. Systematic review of bug report processing techniques to Yadav, A., Singh, S.K., 2020. A novel and improved developer rank algorithm for bug
improve software management performance. J. Inf. Process. Syst. 15 (4), 967–985. assignment. Int. J. Intell. Syst. Technol. Appl. 19 (1), 78–101.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J., 2020. BioBERT: A Yang, G., Zhang, T., Lee, B., 2014. Towards semi-automatic bug triage and severity
pre-trained biomedical language representation model for biomedical text mining. prediction based on topic model and multi-feature of bug reports. In: Proceedings
Bioinformatics 36 (4), 1234–1240. of the 38th Annual Computer Software and Applications Conference. pp. 97–106.
18
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961
Yin, Y., Dong, X., Xu, T., 2018. Rapid and efficient bug assignment using ELM for IOT Zhang, Y., Wallace, B.C., 2017. A sensitivity analysis of (and practitioners’ guide to)
software. IEEE Access 6, 52713–52724. convolutional neural networks for sentence classification. In: Proceedings of the 8th
Zaidi, S.F.A., Awan, F.M., Lee, M., Woo, H., Lee, C.-G., 2020. Applying convolutional International Joint Conference on Natural Language Processing (Volume 1: Long
neural networks with different word representation techniques to recommend bug Papers). pp. 253–263.
fixers. IEEE Access 8, 213729–213747. Zhang, J., Wang, X., Hao, D., Xie, B., Zhang, L., Mei, H., 2015. A survey on bug-report
Zhang, T., Chen, J., Jiang, H., Luo, X., Xia, X., 2017. Bug report enrichment analysis. Sci. China Inf. Sci. 58 (2), 1–24.
with application of automated fixer recommendation. In: 2017 IEEE/ACM 25th Zhou, Y., Tong, Y., Gu, R., Gall, H., 2016. Combining text mining and data mining for
International Conference on Program Comprehension. ICPC, pp. 230–240. bug report classification. J. Softw.: Evol. Process 28 (3), 150–176.
19