0% found this document useful (0 votes)

16 views19 pages

Word Embedding Comparison

.........

Uploaded by

himanshu Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views19 pages

Word Embedding Comparison

.........

Uploaded by

himanshu Chaudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

The Journal of Systems and Software 210 (2024) 111961

Contents lists available at ScienceDirect

The Journal of Systems & Software

journal homepage: www.elsevier.com/locate/jss

An empirical assessment of different word embedding and deep learning

models for bug assignment✩
Rongcun Wang a ,∗, Xingyu Ji a , Senlei Xu a , Yuan Tian b , Shujuan Jiang a , Rubing Huang c
a
School of Computer Science and Technology, China University of Mining and Technology, No. 1 Daxue Road, Xuzhou, 221116, Jiangsu, China
b
School of Computing, Queen’s University, Kingston, Canada
c
School of Computer Science and Engineering, Macau University of Science and Technology, Taipa, Macao Special Administrative Region of China

ARTICLE INFO ABSTRACT

Dataset link: https://github.com/AI4BA/dl4ba Bug assignment, or bug triage, focuses on identifying the appropriate developers to repair newly discovered
bugs, thereby managing them more effectively. Several deep learning-based approaches have been proposed
Keywords:
for automated bug assignment. These approaches view automated bug assignment as a text classification task
Bug assignment
Bug report
— the textual description of a bug report is utilized as the input and the potential fixers are regarded as
Word embedding the output labels. Such approaches typically depend on the classification performance of natural language
Deep learning processing and machine learning techniques. Various word embedding and deep learning models have emerged
Classification continuously. The effectiveness of those approaches depends on the chosen deep learning model, used for
classification, and the word embedding model, used for representing bug reports. However, prior research
does not empirically evaluate the impacts of various word embedding and deep learning models for automated
bug assignment. In this paper, we conduct an empirical study to analyze the performance variations among
35 deep learning-based automated bug assignment approaches. These approaches are based on five word
embedding techniques, i.e., Word2Vec, GloVe, NextBug, ELMo, and BERT, and seven text classification models,
i.e., TextCNN, LSTM, Bi-LSTM, LSTM with attention, Bi-LSTM with attention, MLP, and Naive Bayes. We
evaluated these combinations across three benchmark datasets, namely Eclipse JDT, GCC, and Firefox, and
their mergence i.e., a cross-project dataset. Our main observations are: (1) Bi-LSTM with attention and Bi-LSTM
using ELMo are significantly superior to other deep learning models on bug assignment tasks in terms of top-k
(k = 1, 5, 10) accuracy and MRR; (2) Both the summary and description of bug reports are useful for bug
assignment, but the description is more useful than the summary; (3) The training corpus for word embedding
models has a significant impact on the performance of deep learning-based bug assignment methods. Our
results show the importance of tuning different components (e.g. word embedding model, classification model,
and textual input) in deep learning-based automated bug assignment methods and provide important insights
for practitioners and researchers.

1. Introduction BugNet3 ) for highly efficient bug management such as submission,

fixing, and verification. These systems store bugs in the form of bug
Bugs are inevitable during software development and maintenance, reports, which contain detailed information that could help devel-
posing significant challenges to large-scale open-source software opers reproduce and fix the bugs, such as summaries, descriptions,
projects (Xuan et al., 2015). To shorten the software development and relevant code. The workflows of these bug tracking systems are
cycle and reduce maintenance costs, large-scale open-source software similar (Jahanshahi et al., 2021): (1) When a developer encounters an
projects leverage bug tracking systems (e.g. JIRA,1 Bugzilla,2 and error, she/he writes the corresponding issue report and submits it to
the bug tracking system. (2) The triager determines whether the issue

✩ Editor: Aldeida Aleti.

∗ Corresponding author.
E-mail address: [email protected] (R. Wang).
1
https://www.atlassian.com/software/jira/bug-tracking
2
https://www.bugzilla.org/
3
https://bugnetproject.com/
4
https://github.com/dotnet/aspnetcore
5
https://gcc.gnu.org/bugzilla

https://doi.org/10.1016/j.jss.2024.111961
Received 17 August 2023; Received in revised form 10 November 2023; Accepted 3 January 2024
Available online 6 January 2024
0164-1212/© 2024 Elsevier Inc. All rights reserved.
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

is a bug and the type of this bug based on the content of the issue learning and deep learning in bug assignment. It is worth noting that
report. (3) The triager assigns the bug to the appropriate fixer. (4) the promising and recently popular representation learning methods,
When the assignee cannot fix the bug, this bug is put back into the like global vectors (GloVe), embeddings from language models (ELMo),
bug-tracking system. Meanwhile, the status of the bug report is changed and bidirectional encoder representations from Transformers (BERT),
to ‘‘New’’ waiting for the next round of assignment. (5) when the bug have been rarely utilized for bug assignment. Likewise, the deep learn-
is successfully fixed, the bug is closed. As seen in this workflow, it is ing classification models widely used in natural language processing
tedious and time-consuming work for triagers (Jahanshahi and Cevik, such as multi-layer perceptron (MLP) (Taud and Mas, 2018) and long
2022). Particularly, as software continually evolves, the daily number short-term memory (LSTM) (Graves, 2012) have been rarely applied
of bug reports submitted to these systems greatly rises. For example, for for bug assignment. The bug assignment methods based on different
ASP.NET Core project,4 approximately 115.8 bug reports were posed word embedding and deep learning classification models could result
per month from May to October 2023 on average. The GCC Bugzilla5 in different performances. Nevertheless, no previous studies empirically
received 85 reports in the recent week, from October 3 to October 10, investigated the effects of the methods using various representation
2023. Faced with such a substantial volume of bug reports, triagers learning and deep learning classification models on automated bug
need to make great efforts to understand them and assign them to the assignment tasks. This restricts further development of the bug assign-
appropriate developers for bug resolution. Particularly, for complex ment task, especially in picking the relatively optimal deep learning
open-source software products, bug assignment is more challenging models for addressing this task.
because the number of involved bug fixers is big and they have different In this context, we designed and conducted an empirical assessment
skills. of different word embedding and deep learning models for bug as-
To address the above challenge, researchers have proposed two signment. Three word-level embedding models, i.e., Word2Vec, GloVe,
main types of automated bug assignment approaches, i.e., information and a fine-tuned skip-gram model based on the bug-specialized domain
retrieval-based (Yang et al., 2014; Hu et al., 2018; Matter et al., 2009; called NextBug (Du et al., 2022), and two sequence-based representa-
Sajedi-Badashian and Stroulia, 2020; Xia et al., 2017; Alazzam et al., tion learning models, i.e., ELMo and BERT, were empirically accessed.
2020) and text classification-based bug assignment methods (Murphy Seven popular deep learning classifiers including TextCNN (Kim, 2014),
and Cubranic, 2004; Xuan et al., 2015; Dedík and Rossi, 2016; Ahsan LSTM (Graves, 2012), bidirectional LSTM (Bi-LSTM), LSTM with at-
et al., 2009; Naguib et al., 2013; Sbih and Akour, 2018; Sarkar et al., tention, Bi-LSTM with attention, MLP (Taud and Mas, 2018), and
2019; Sawarkar et al., 2019; Lee et al., 2017; Mani et al., 2019; Guo Naive Bayes (NB) (Xu et al., 2017), were also empirically evaluated.
et al., 2020). Bug assignment methods based on information retrieval
To our knowledge, we are first to empirically investigate the effects
recommend bug fixers according to the similarity between historical
of the models based on different word embedding techniques and
bug reports and newly-arrived bug reports, and the corresponding
deep learning classification models on automated bug assignment. The
relationship between historical bug reports and fixers. Unlike informa-
experiments were designed and conducted on three widely-used bug
tion retrieval-based bug assignment methods, bug assignment methods
assignment datasets, namely Eclipse JDT, GCC, and Firefox, and a cross-
based on text classification extract textual features from bug reports and
project dataset. The following three research questions are set to better
build classifiers to predict the labels, i.e., the actual fixers of bugs. The
understand the effects of three key components, i.e., learning models,
text classification-based bug assignment methods consist of two-stage
bug report elements, and source of training corpus on bug assignment
operations. One is to generate word vector representations by various
based on text classification.
techniques such as the term frequency-inverse document frequency (TF-
RQ1: Are there differences in the performance of various deep
IDF) (Xuan et al., 2015; Dedík and Rossi, 2016; Ahsan et al., 2009;
learning models for automated bug assignment? We consider four
Sarkar et al., 2019) and the word to vector (Word2Vec) (Lee et al.,
representation learning models: word-level models with local sen-
2017; Mani et al., 2019; Guo et al., 2020). The other is to predict bug
sitivity (e.g. GloVe and Word2Vec) and sentence-level models with
fixers by different classification models such as Naive Bayes (NB) (Xuan
global sensitivity (e.g. BERT and ELMo) for generating word embed-
et al., 2015; Murphy and Cubranic, 2004), Support Vector Machines
dings. Moreover, seven deep learning classification models including
(SVM) (Dedík and Rossi, 2016; Ahsan et al., 2009), and ensemble
learning methods (Sbih and Akour, 2018). TextCNN, LSTM, Bi-LSTM, LSTM with attention, Bi-LSTM with atten-
Recently, various deep neural networks have been successfully ap- tion, MLP, and Naive Bayes are used for predicting bug fixers. The
plied in many tasks related to software engineering such as code com- extracted features could vary with the models, resulting in different
ment generation (Hu et al., 2020), vulnerability detection (Chakraborty predictive results. Therefore, we set up this research question to inves-
et al., 2022), and software defect prediction (Giray et al., 2023). More- tigate which deep learning models are more suitable for bug assignment
over, they have achieved better performance compared with traditional tasks.
machine learning-based bug assignment methods (Lee et al., 2017; Lee RQ2: Is the description useful for bug assignment? What is
and Seo, 2019). For example, both Lee et al. (2017) and Guo et al. the optimal weight between the summary and description? The
(2020) utilized Word2Vec to generate textual representations of bug textual information of bug reports typically consists of a summary
reports. They employed convolutional neural networks (CNN) for text and description. Compared with the description, the summary is more
classification to predict fixers. Xi et al. (2018) proposed a sequence- concise and contains semantic connections that are more closely related
to-sequence model for bug assignment. Recurrent Neural Networks to a bug. Contrarily, the description provides more details related to
(RNN) and Gate Recurrent Unit (GRU) are used for feature extraction a bug. In view of the above differences, we proposed this research
and model prediction, respectively. Choquette-Choo et al. (2019) used question to investigate their contributions to bug assignment tasks.
Dual-Output Deep Neural Networks (DNN) for bug assignment. Bug RQ3: To what extent does the training corpus influence the
assignment based on deep learning has made remarkable progress. This performance of the representation learning models for bug assign-
makes bug assignment to be more precise and cost-effective, ultimately ment tasks? Most representation learning models are trained based on
enhancing the efficiency of software maintenance and reducing overall a general domain corpus. The models trained based on general corpus
bug resolution costs. tend to have stronger generalization capabilities. However, the models
In previous studies (Lee et al., 2017; Mani et al., 2019; Guo et al., trained based on bug-specialized domain corpus (e.g. NextBug) may
2020), Word2Vec (Mikolov et al., 2013) was predominantly used to be more suitable for bug-related tasks. Therefore, we proposed this re-
generate text vector representations, and CNN (Zhang and Wallace, search question to investigate whether there are differences in the per-
2017) was commonly employed as the classifier for bug assignment. formance of the models trained on general corpus and bug-specialized
Existing studies have demonstrated the effectiveness of representation domain corpus in bug assignment tasks.

2
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

The whole contributions of this paper are listed as follows. bug reports, enabling the retrieval of similar reports. Alazzam et al.
(2020) proposed a feature augmentation approach that leverages graph
• We comprehensively investigated the effects of different word
partitioning based on neighborhood overlap to augment features. By
embedding and deep learning models on bug assignment. The
considering the similarity between summaries of bug reports and each
differences among all investigated models were qualitatively and
term cluster, the approach concatenates terms from strong clusters into
quantitatively accessed.
the feature vectors of the summaries. This approach enhances bug
• We designed nine strategies representing different weights be- assignment methods based on machine learning.
tween the summary and description of the bug report. The nine
strategies were empirically evaluated for the study of their rela-
2.2. Text classification-based bug assignment methods
tive importance between the summary and description.
• We conducted an empirical evaluation of the impacts of train- Bug assignment methods based on text classification can be further
ing corpus from general and bug-specialized domains on bug classified into traditional machine learning-based methods and deep
assignment tasks. learning-based methods.
The results and source code related to this study are available at
https://github.com/AI4BA/dl4ba. 2.2.1. Traditional machine learning-based methods
The rest of this paper is organized as follows. Section 2 summarizes Murphy and Cubranic (2004) first treated the problem of bug assign-
the related work. The experimental design and analysis of results are ment as a case of text classification. The NB algorithm was employed
presented in Sections 3 and 4, respectively. We discuss the threats to to predict the developer based on the bug’s description. Similarly, Xuan
validity in Section 5. Finally, the paper concludes with the future work et al. (2015) used TF-IDF to generate word vector representations for
in Section 6. the summaries and descriptions of bug reports, and utilized the NB clas-
sifier for bug assignment. Compared with the study (Xuan et al., 2015),
2. Related work Dedik et al. (Dedík and Rossi, 2016) extracted more features from bug
reports including summaries, descriptions, keywords, and stack traces.
Previous studies on automatic bug assignment can be roughly cat- They combined TF-IDF with SVM for automated bug triaging in an
egorized into three types based on the used techniques: information industrial context. Ahsan et al. (2009) extracted textual information
retrieval-based, text classification-based, and hybrid methods. The first such as summaries and descriptions from bug reports and parsed the
aims to assign bugs based on the similarity between the newly arrived names of developers as labels. They represented the textual information
bug reports and historical bug reports. The second regards bug assign- using the TF-IDF and applied feature selection and latent semantic
ment as a classification issue. In other words, a classification model is indexing methods to reduce the dimension of a term document matrix.
trained for bug assignment using the text features extracted from the Six machine learning classifiers were compared. The best one is based
bug reports with the assignees as labels. The last mainly combines mul- on SVM and latent semantic indexing. Naguib et al. (2013) utilized
tiple techniques including IR-based methods, learning-based methods, developers’ historical activities from the bug tracking system to create
and tossing graphs for automated bug assignment. an activity profile for each developer describing her/his roles, expertise,
and level of involvement in the project. Subsequently, LDA was utilized
2.1. Information retrieval-based bug assignment methods to generate the topic feature vectors of bug reports. SVM was employed
to determine the appropriate fixer.
Information retrieval-based bug assignment methods are based on Sbih et al. (Sbih and Akour, 2018) proposed a model for bug
the observation that the fixers who previously solved similar bugs assignment. The model was evaluated using ensemble learning methods
are regarded as the most suitable candidates for newly arrived bug such as bagging, boosting, and decorating, along with classifiers includ-
reports (Zhang et al., 2015). Most of the methods search for bug ing Bayes Net, Naive Bayes, Decision Table, Random Tree, and J48.
reports in the repository that are the most similar to the newly arrived Sarkar et al. (2019) considered three types of attributes, i.e., textual,
report. The bug fixers corresponding to the retrieved bug reports are categorical, and log attributes. Several traditional classifiers such as K-
recommended as the appropriate fixers. For instance, Yang et al. (2014) Nearest Neighbor (KNN) and Logistic Regression(LR) were empirically
extracted topics from historical bug reports based on Latent Dirichlet evaluated. Sawarkar et al. (2019) utilized the Bag-of-Words (BOW)
Allocation (LDA). According to the extracted topics, a newly-arrived model to extract feature vectors from textual information. They em-
bug report can be determined to the topics it belongs to. The historical ployed multi-label classification algorithms, including Random Forest,
bug reports having the same features and topics as the newly-arrived SVM, and J48, to predict the fixers for the given bug reports.
report can be retrieved. The fixers corresponding to the retrieved bug
reports are regarded as the fixers of the newly arrived bug report. Hu 2.2.2. Deep learning-based methods
et al. (2018) proposed a document embedding model-based method Lee et al. (2017) utilized the Word2Vec representation learning
for improving the performance of Yang et al.’s method (Yang et al., model to learn representations of summaries and descriptions in bug
2014). Similarly, Matter et al. (2009) presented an expertise model reports. The CNN was used to extract text features and predict bug
of developers based on their contributed source code. According to fixers. Similarly, Mani et al. (2019) also used Word2Vec and introduced
the cosine similarity between the vocabulary of bug reports and the an attention mechanism to learn sentence-level semantic features for
vocabulary in source code contributions, the newly arrived bug report bug assignment. They proposed a bug assignment method based on a
can be assigned to developers. Bidirectional Recursive Neural Network. Guo et al. (2020) considered
Sajedi-Badashian and Stroulia (2020) presented a vocabulary and developer activity in addition to summaries and descriptions and pro-
time-aware bug assignment method. The vocabulary consists of proposed a bug assignment method based on Word2Vec and CNN. Xi et al.
gramming keywords in bug reports belonging to Stack Overflow tags. (2018) utilized the RNN to generate textual features from the text in
They argued developers’ recent expertise is more important than past bug reports and model the dependencies among developers depending
expertise. Furthermore, they considered the importance of the key- on tossing sequences. After that, a sequence of candidate fixers was gen-
words and time of usage to improve the TF-IDF metric. Xia et al. erated. Choquette-Choo et al. (2019) employed the Dual-Output Deep
(2017) proposed a bug assignment method, in which they utilized a Neural Networks (DNN) to perform bug assignment. They utilized the
multi-feature topic model (MTM) to conduct topic modeling based on model to predict the team category, which in turn, helps in predicting
product and component information from historical bug reports. They the appropriate fixers. Existing studies predominantly use Word2Vec
employed TopicMiner to obtain the topic distribution of newly arrived and CNN to generate word embeddings and predict the right fixers,

3
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 1. Overview of automated bug assignment based on representation learning models and deep learning classifiers.

respectively. Although recent representation learning models such as 3.1. Datasets

GloVe, ELMo, and BERT, and deep learning classifiers like LSTM and
Bi-LSTM are popular in the field of natural language processing, they The experiments were conducted on three widely used projects, i.e.,
have been rarely applied for addressing bug assignment. Moreover, Eclipse JDT,6 GCC,7 and Firefox.8 There are two primary reasons for the
the word embeddings vary with the embedding models because they choice of the Eclipse JDT and Firefox datasets. Firstly, Lee et al. (Lee
use different algorithms. Likewise, different classifiers could generate and Seo, 2019) enumerated 18 relevant papers, out of which 16 studies
different classification results. This drives us to conduct this study. used the data from Eclipse JDT and Firefox. Secondly, both the two
Different from the above studies, we empirically investigated many projects adopt Bugzilla as their bug tracking system, ensuring the
more word embedding and deep learning classification models for consistency of bug report structures. The Eclipse JDT dataset focusing
automated bug assignment. on Java Development Tooling (JDT) comprises 842 developers and
1,465 bug reports spanning from October 20, 2013 to October 20,
2.3. Hybrid methods
2016. The Firefox dataset contains 13,668 bug reports contributed by
69 developers during the period from August 1, 2013, to July 31, 2016.
Bhattacharya et al. (2012) investigated the impact of bug toss-
ing graphs and machine learning classifiers on the prediction of bug Additionally, we also incorporated the GCC dataset commonly used
assignment. They utilized a probabilistic graph-based model to repre- in previous studies (Zhang et al., 2017; Anvik et al., 2006; Xia et al.,
sent bug reports and investigated its impact on bug assignment tasks 2015; Yin et al., 2018). This GCC dataset (Yin et al., 2018) we used
across Naive Bayes, Bayesian Networks, C4.5, and SVM classifiers. contains 2103 bug reports involving 82 developers. The bug reports
Xia et al. (2015) proposed a composite method called DevRec for from the three datasets contain essential information such as bug ID,
developer recommendation to resolve bugs using Euclidean distance assignee, summary, description, and status. To ensure the reliability of
metric, multi-label k-nearest neighbor (ML-KNN), and topic modeling. the experiments, we only used bug reports with the status ‘RESOLVED’
Wu et al. (2011) proposed a developer recommendation method with K- or ‘FIXED’. The severity levels of those bug reports are ‘P1 Critical’ or
Nearest-Neighbor Search and Expert Rating (DREX) for bug assignment. ‘P2 Normal’. We also conducted a cross-project validation, i.e., the three
This method utilizes KNN and expert rating to assign bugs and evaluates datasets are merged into a dataset, to evaluate comprehensively the
the model using precision and social network metrics. performance of the studied models in bug assignments.
Furthermore, there are other methods for automated bug assign-
ment. For example, the approach (Jeong et al., 2009) trained a tossing 3.2. Technical framework
graph model, which effectively reduces the tossing length of the assign-
ment process and improves overall accuracy. Yadav et al. Yadav and The process of bug assignment based on text classification mainly
Singh (2020) constructed a developer recommendation list based on involves three steps: data pre-processing, word embedding matrix gen-
individual developers’ expertise and proficiency in bug fixing, thereby eration, and bug fixer prediction. The overall technical framework is
optimizing the assignment process. Sun et al. (2017) proposed a method illustrated in Fig. 1. The bug reports are first pre-processed. Then, the
for developer recommendation based on data mining. The commits summary and description of each bug report are input to one of the
relevant to the issue requests were extracted by the mining software
studied representation learning models to obtain embedding vectors.
repository. The cosine similarity between historical commits and new
Embedding vectors are subsequently fed to one of the studied deep
bug reports is calculated for deciding the related historical commits to
learning classifiers. Finally, the correlation probabilities between each
the changed source code. The contributors corresponding to those de-
bug report and all fixers are computed via a fully connected layer to
termined commits are given the scores by collaborative topic modeling.
obtain the recommended top-k fixers.
According to the scores, the appropriate contributors are recommended
as the fixers of the new issue. To address the problem of limited
informative contents of bug reports, Zhang et al. (2017) proposed a bug 3.2.1. Data pre-processing
report enrichment method to improve the performance of automated The bug report consists of various attributes, such as bug ID, compo-
fixer recommendations. Different from the above studies, Aung et al. nent, bug status, bug assignee, summary, and description. Like previous
(2022) used code snippets in addition to bug reports and proposed a studies (Lee et al., 2017; Mani et al., 2019; Guo et al., 2020), we also
transformer-based model for bug triage. focus on the summary and description attributes. These extracted bug
assignees are considered the labels for a supervised classifier. To pre-
3. Experimental design process the bug report data, we removed digits, punctuation marks,
URLs, spaces, and non-ASCII characters from the text. Additionally, we
This section describes the experiments in detail including the used
datasets, data pre-processing, technical framework, evaluation metrics,
and statistical analysis methods. All experiments were conducted on 6
https://bugs.eclipse.org/bugs
Ubuntu 20.04 equipped with an Intel(R) Xeon(R) Platinum 8338C CPU 7
https://gcc.gnu.org/bugzilla
8
@ 2.60 GHz, 80 GB RAM, and an RTX 3090–24 GB video card. https://bugzilla.mozilla.org

4
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

utilized the NLTK toolkit (Bird et al., 2009) to remove stopwords and different sizes for extracting feature maps from different regions. Based
converted all uppercase letters to lowercase. The above operations aim on the study (Zhang and Wallace, 2017), we selected convolutional
at cleaning the text and eliminating irrelevant information, making it kernels of size [3, 4, 5] for convolution operation. The work (Zhang and
more suitable for further training and classification. Wallace, 2017) investigated the impact of the number of convolutional
kernels on multiple datasets such as the Sentence Polarity dataset (MR),
3.2.2. Word embedding matrix generating Stanford Sentiment Treebank (SST-1), Subjectivity dataset (Subj), and
After pre-processing, the next step is the tokenization of the text. Question Classification (TREC). Given the similarity between the bug
The tokens are then fed into the representation learning models. assignment problem and question classification, we opted for the same
number of convolutional kernels as TREC, i.e., 512.
For the word-level embedding models, an additional operation is to
After convolution, the feature maps have a shape of (N, L-[3, 4,
generate a vocabulary dictionary corresponding to the text and form
5]+1, 512). These feature maps are then subjected to max-pooling
word vectors using Word2Vec, GloVe, or NextBug. The vocabulary
to reduce dimension while preserving important features. The pooled
dictionary is a collection of all unique words in the training corpus,
vectors are concatenated, flattened, and passed through fully connected
each of which has a unique index. Building the vocabulary dictionary layers to compute the probability scores for each bug report with
involves the following steps. The first step is to count word frequencies respect to each developer, thereby completing the classification task.
i.e., traversing the entire corpus to count the occurrence frequency
of each word. Then, the words are sorted in descending order of 3.3.2. LSTM (Long Short-Term Memory)
frequency, i.e., placing the most common words at the front of the LSTM as a variant of RNNs, can process sequential data. Compared
vocabulary dictionary. The last step is to assign a unique index to each with the traditional RNN model, LSTM can not only better capture
word, thus generating a vocabulary dictionary by mapping words to long-term dependencies within a sequence by gate mechanism, but also
IDs. The vocabulary dictionary is used to map words in the text to ID effectively address the issues of gradient disappearance and exploding
sequences, which are then mapped to a high-dimensional vector space when dealing with long sequences. Therefore, we also employed the
through Wor2Vec, GloVe, or NextBug. Since the text could contain LSTM model for predicting bug fixers. During the training of the LSTM
words that are not trained by the above models, the word embed- model, the loss between the model’s predictions and the ground truth is
ding matrix is initialized randomly following a normal distribution computed. The loss is back-propagated to update the weights and biases
before using the word-level embedding models. Subsequently, the text in the network. This process is repeated until the model converges and
is traversed to generate the corresponding text vectors according to achieves optimal performance. Then, the outputs of the last hidden
the vocabulary dictionary. Different from the word-level embedding layers are fed into a linear layer for computing the probability scores.
Increasing the size of hidden layers can enhance the network’s
models, the sequence-based models (e.g. ELMo and BERT) can accept
learning ability, enabling it to better adapt to complex patterns and
the token list as their inputs for generating the corresponding text
data. However, more complex networks easily suffer from more com-
vectors directly, rather than an additional operation such as generating
putational resources and longer training time. In the same way, an
a vocabulary in the case of word-level embedding models.
excessive size of hidden layers may lead to over-fitting and training
Note that for Word2Vec,9 and GloVe10 we use 300-dimensional more difficult. Therefore, it is necessary to choose an appropriate size of
word embeddings. As for NextBug,11 it is a pre-trained binary file hidden layers for achieving a balance between accuracy and efficiency.
with 500-dimensional word embeddings. For BERT and ELMo, we used A prior study (Nowak et al., 2017) illustrates that LSTM can get better
the BERT-BASE12 model with 2 layers, 768 hidden layers, 12 multi- performance for the text classification task when the hidden layer size
heads, and 118M parameters. Thus, the dimension of word embeddings is equal to 128. Considering the bug assignment task is highly similar
generated by BERT is 768. Due to the limit of the experimental budget, to the text classification task, we set the same hidden layer size i.e.,128.
the maximum sequence length for BERT is 512. Note that we did not
fine-tune the BERT. There are two main reasons. One is that the limited 3.3.3. Bi-LSTM (Bi-directional Long Short-Term Memory)
experimental budget cannot provide sufficient resources for fine-tuning Bi-LSTM is another variant of RNNs that incorporates memory cells
BERT. The other is more fairly to compare the studied models as not and gate mechanism. Similar to LSTM, Bi-LSTM excels in modeling
all studied models support the fine-tuning mode. On the other hand, sequential data. As shown in Fig. 1, the Bi-LSTM consists of two LSTM
ELMo13 generates word embeddings with a dimension of 1024. To layers. One is responsible for processing the input sequence in the
preserve semantic information without loss or distortion as possible as forward direction, while the other is used in the backward direction.
we can, the vectors are not truncated or padded. The final output is generated by concatenating the outputs of the two
layers. Subsequently, the output of the hidden layers is passed through
a linear layer, which performs the computation of probability scores.
3.2.3. Bug fixer predicting
Compared with LSTM, Bi-LSTM is capable of capturing both the
Once the word embedding matrix is generated by a certain repre-
preceding and succeeding context information in the input sequence.
sentation learning model, it can be fed into the deep learning classifiers,
By utilizing the forward and backward LSTM layers, it can simultane-
including TextCNN, LSTM, Bi-LSTM, LSTM with attention, Bi-LSTM ously consider the information before and after the current time step,
with attention, MLP, and Naive Bayes for bug fixers prediction. providing a more comprehensively contextual understanding. The size
of hidden layers of Bi-LSTM is the same as that of LSTM, i.e., 128.
3.3. Classification model
3.3.4. MLP (Multi-layer Perceptron)
3.3.1. TextCNN (Text Convolutional Neural Network) MLP is a feed-forward neural network model, also known as a fully
TextCNN is a text classification model based on CNN. It consists connected neural network. It consists of multiple layers, i.e., an input
of convolutional layers, pooling layers, and fully connected layers. As layer, multiple hidden layers, and an output layer. The input layer is
shown in Fig. 1, the model takes the word embedding matrix as an input responsible for receiving the word embedding matrix. The MLP utilized
and performs convolution operations with convolutional kernels of in this paper comprises two hidden layers, each of which consists of
multiple neurons. These neurons are used for computing the weighted
sums of the outputs from the previous layer. The linear factors are
9
https://github.com/mmihaltz/word2vec-GoogleNews-vectors eliminated by the ReLU activation function. The final results are then
10
https://nlp.stanford.edu/projects/glove/ outputted through the output layer. During the model training process,
11
https://github.com/xiaotingdu/DeepSIM the parameters of the MLP are updated using back-propagation to
12
https://github.com/google-research/BERT/ minimize the loss. Similar to LSTM, the hidden layer size of MLP is
13
https://s3-us-west-2.amazonaws.com/allennlp/models/elmo also 128.

5
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

3.3.5. Naive Bayes Table 1

The key parameters for training deep learning models.
Naive Bayes is a statistical classification algorithm, which has been
widely used in fields such as text classification (Dai et al., 2007; Parameter Value

Frank and Bouckaert, 2006), spam filtering (Metsis et al., 2006; Rus- Epochs 12
Batch size 32
land et al., 2017), and sentiment analysis (Tan et al., 2009; Wongkar
Learning rate 0.001
and Angdresey, 2019). This algorithm is built upon the principles of Filter size 3,4,5
Bayesian probability theory. When dealing with classification problems, Nums of filters 512
Naive Bayes computes the conditional probabilities of different labels Size of hidden layers 128
for a given sample and selects the label with the highest probability
as the classification result. In this paper, after obtaining the word
embedding matrix, the model is fitted using the feature vectors from
the training data.

3.3.6. Attention mechanism

The attention mechanism is used for allocating different attention
weights to different parts of the sequence input, thereby capturing
key and contextual relevance information more effectively. It plays an
important role in processing sequential data.
For both LSTM and Bi-LSTM using the attention mechanism, two Fig. 2. Model training and testing process.
linear layers are applied for calculating attention scores based on the
outputs of the LSTM or Bi-LSTM. The linear layers can learn the
importance of different positions based on the input feature. The atten- also employed mean reciprocal rank (MRR) (Voorhees et al., 1999).
tion scores are then normalized using the softmax function to obtain MRR represents the mean of the reciprocals of the ranks of the true
attention weights. Next, the attention weights are multiplied element- bug fixer in the recommendation list. It is defined as follows:
wise with the outputs of the LSTM or Bi-LSTM and summed to get the
|𝑁|
final text vectors. This weighted sum process allows the model to better 1 ∑ 1
𝑀𝑅𝑅 = , (2)
focus on multiple important parts of the sequence, regardless of the |𝑁| 𝑖=1 𝑟𝑎𝑛𝑘𝑖
sequence length. By introducing the attention mechanism, the model
where 𝑁 and 𝑟𝑎𝑛𝑘𝑖 represent the number of bug reports and the rank of
can automatically learn and concentrate its attention on the important
the true fixer in the 𝑖th bug report’s recommendation list, respectively.
parts of the input sequence for a given task, thereby improving the
model’s performance.
3.5. Model training and testing
3.4. Evaluation metric
The parameters used during training deep learning models are
shown in Table 1.
The model’s output is the likelihood of each bug report being
The training and testing process is shown in Fig. 2. The ten-fold
associated with all potential fixers, and a ranked list of recommended
bug fixers is generated based on this likelihood in descending order. cross-validation as a commonly used cross-validation method was em-
The top-k accuracy metric gauges the likelihood of the bug fixers ployed for evaluating the performance of a model. The original dataset
recommended by a bug assignment method including the actual bug is divided into 10 equal-sized subsets at random, out of which 9 subsets
fixers. In contrast to traditional metrics such as accuracy, precision, are used as training data, and the remaining is used as testing data. For
recall, and F1 score, the metrics such as top-k accuracy and top-k each iteration, the dataset is fed into one of the studied representation
precision focus on the performance of the top-k bug fixers, rather learning models to generate embedding vectors for both the training
than the overall performance. Assessing the overall performance could and testing datasets. Subsequently, one of the studied deep learning
be less meaningful for bug assignment because the bug assignment classifiers is trained on the training dataset, and training is stopped
problem concerns primarily the recommendation of the top-k fixers, once the model reaches convergence. The trained model is tested using
especially the top 1 fixer. Furthermore, even if the overall performance the test dataset. The top-k accuracy and MRR values are calculated.
is high, if the quality of the top-k recommended bug fixers is low, the The above process is repeated 10 times until each different subset is
model is considered to be unsuccessful. Compared with top-k accuracy, used as the testing set. The mean values of top-k accuracy and MRR
top-k precision measures how many fixers in the top-k recommended were computed to evaluate the performance of models. By repeatedly
list are truly relevant. In a top-k recommendation list, there can only be training and evaluating a model on different subsets of the dataset, we
0 or 1 truly relevant fixer. Therefore, calculating top-k precision also can understand more comprehensively its performance. This approach
becomes meaningless. Similarly, top-k recall and F1 score cannot also tends to mitigate the potential bias or variability caused by a single data
be inferred. Finally, as previous studies (Aung et al., 2022; Yang et al., partition. By averaging the performance results from multiple folds, we
2014; Jeong et al., 2009; Xia et al., 2015), we also employed top-k can obtain a more robust and reliable assessment of the performance
accuracy to evaluate the impacts of various deep learning models on of models.
bug assignment tasks. It is defined as follows:
∑𝑁 3.6. Statistical analysis
𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖 (𝑡𝑟𝑢𝑡ℎ, 𝑡𝑜𝑝 − 𝑘 𝑙𝑖𝑠𝑡)
𝑇 𝑜𝑝 − 𝑘 = 𝑖=1 , (1)
𝑁 The Wilcoxon signed-rank test (Wilcoxon, 1946), a non-parametric
where 𝑁, 𝑡𝑟𝑢𝑡ℎ, and 𝑡𝑜𝑝 − 𝑘 𝑙𝑖𝑠𝑡 represent the number of bug reports, hypothesis test, is used for qualitatively analyzing the significant dif-
a real bug fixer, and a recommendation list containing 𝑘 bug fixers ference between two bug assignment methods with a confidence level
generated by a bug assignment method, respectively. 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖 (𝑡𝑟𝑢𝑡ℎ, 𝑡𝑜𝑝− of 95%. Compared to other parametric test methods such as the t-
𝑘 𝑙𝑖𝑠𝑡) denotes the recommended result of the 𝑖th bug report. If 𝑡𝑟𝑢𝑡ℎ test, the Wilcoxon signed-rank test does not assume that the test data
belongs to 𝑡𝑜𝑝 − 𝑘 𝑙𝑖𝑠𝑡, it is equal to 1, otherwise 0. follows a certain distribution (e.g. normal distribution). This method
Top-k accuracy does not consider the specific positions of the rec- is also robust to outliers, as it operates based on rankings rather than
ommended bug fixers. To address the limitation of top-k accuracy, we values. A null hypothesis is formulated, i.e., there is no significant

6
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Table 2 Table 3
The relation between 𝛿 and effect size. Corresponding relation of the studied models and the flags in Fig. 7, Fig. 8, Fig. 9, and
Value Effect size Fig. 10.
No. Model No. Model
|𝛿| < 0.147 negligible
|𝛿| ≥ 0.147 𝑎𝑛𝑑 |𝛿| < 0.33 small 0 Bi-LSTM-A+ELMo 14 Bi-LSTM+W2V
|𝛿| ≥ 0.33 𝑎𝑛𝑑 |𝛿| < 0.474 medium 1 Bi-LSTM+ELMo 15 LSTM-A+ELMo
|𝛿| ≥ 0.474 large 2 Bi-LSTM-A+BERT 16 MLP+W2V
3 Bi-LSTM+BERT 17 TextCNN+GloVe
4 Bi-LSTM-A+GloVe 18 LSTM-A+GloVe
5 Bi-LSTM+GloVe 19 LSTM+ELMo
difference between the top-k accuracy values generated by the two 6 MLP+GloVe 20 LSTM+BERT
7 LSTM-A+BERT 21 LSTM-A+W2V
models. When the p-value is less than 0.05, we can reject the null
8 LSTM+W2V 22 TextCNN+BERT
hypothesis, indicating a significant difference between the two models. 9 LSTM+GloVe 23 MLP+ELMo
Additionally, to quantify the magnitude of the differences between 10 MLP+BERT 24 NB+BERT
the two different methods, we employ a non-parametric effect size 11 Bi-LSTM-A+W2V 25 NB+ELMo
12 TextCNN+W2V 26 NB+GloVe
measure, i.e., Cliff’s Delta (Cliff, 1993). It can be calculated by the
13 TextCNN+ELMo 27 NB+W2V
following formula
(𝑈 − 𝐷)
𝛿= , (3)
𝑛1 ∗ 𝑛2
of a violin chart represent the maximum and minimum values of top-k
where 𝑈 and 𝐷 represent the number of times that the data in the accuracy (k = 1, 5, 10) or MRR. The central short line of a violin chart
first group is greater than that in the second group and the data in represents the mean of top-k accuracy (k = 1, 5, 10) or MRR.
the second group is greater than that in the first group, and 𝑛1 and 𝑛2
denote the total number of samples in two control groups, respectively. Findings: Bi-LSTM-A+ELMo and Bi-LSTM+ELMo are significantly
The value of 𝛿 ranges from −1.0 to +1.0. The extreme value 0.0 superior to other deep learning models on bug assignment tasks in
indicates that the values in two control groups are identical, whereas ± terms of top-k accuracy and MRR metrics. Moreover, the two mod-
1.0 indicates all values in one group are larger than those in the other els have large or medium effect sizes over other models in most
group. The absolute value of 𝛿 indicates the magnitude of the effect cases. Particularly, compared with the methods using Word2Vec
size (Romano et al., 2006). The relation between 𝛿 and effect size is and TextCNN (Lee et al., 2017; Zaidi et al., 2020), the methods using
shown in Table 2. the above combinations have over 30% improvement in top-k (k
= 1, 5, 10) accuracy across all datasets.
4. Experiment result and analysis From Fig. 3, Fig. 4, Fig. 5, and Fig. 6, it can be seen that the
top-k accuracy and MRR vary among the models. The top-k and MRR
In this section, we report the top-k (k = 1, 5, 10) accuracy and MRR values are different from the same representation learning models using
of each model. According to the experiment results, we answered the different classification models. Likewise, the same classification models
proposed four research questions. using different representation learning models are different from each
other. In other words, both representation learning and classification
4.1. RQ1: Are there differences in the performance of various deep learning models can influence the accuracy of bug assignment. Bi-LSTM-A using
models for the bug assignment problem? ELMo and Bi-LSTM using ELMo always have higher mean values than
other models across all datasets with respect to the top-k accuracy and
Motivation: Word2Vec and GloVe operate at the word level and MRR. Compared with the top-k and MRR values on the three datasets,
generate word embeddings based on tokens. Different from Word2Vec, their values on the cross-project dataset are lower. A likely explanation
GloVe incorporating a global co-occurrence matrix can better capture is that the data distributions of the three datasets are different. The
global information. In contrast with the above two models, ELMo project with the maximum samples plays a critical role in the prediction
and BERT are both sequence-based models, enabling them to capture of the cross-project dataset. The results on the cross-project dataset tend
sentence-level semantic information. Compared to the word embedding to be the ones on the project with the maximum samples. Furthermore,
models, sequence-based models can capture more contextual informa- we conducted the Wilcoxon signed-rank test and Cliff’s Delta test. The
tion. Similarly, the classification models including TextCNN, LSTM, results are shown in Fig. 7, Fig. 8, Fig. 9, and Fig. 10 where each cell
Bi-LSTM, LSTM with attention, Bi-LSTM with attention, MLP, and Naïve in the left and right subplots shows 𝑝-value and 𝛿 value, respectively.
Bayes are different from each other. This means that they could have To enhance the readability of the result graphs, different values
varied performance on bug assignments. This drives us to set this are differentiated with different colors. In the left subplots, each cell
research question for accessing the difference in their performance. contains a p-value. A gray cell represents that the p-value is less than
Method: The text after pre-processing is used to generate word 0.05. In this case, we can reject the null hypothesis, i.e., there is a
embedding matrices using the representation learning models. The significant difference between the two models. Contrarily, a white cell
word embedding matrices serve as inputs to different deep learning means that the p-value is greater than or equal to 0.05, i.e., we cannot
classifiers for generating a recommendation list of bug fixers as pre- reject the null hypothesis. In the right subplots, color represents the
dictors. The performance of different models is evaluated using the magnitude of the effect size, i.e., green for negligible, yellow for small,
top-k accuracy and MRR. The results of 28 deep learning models red for medium, and blue for large. The flags from 0 to 27 denote 28
based on four representation learning models and seven classifiers, are models. The detailed indicators are shown in Table 3.
shown in Fig. 3, Fig. 4, Fig. 5, and Fig. 6, where the 𝑥-axis of each From Fig. 7, Fig. 8, Fig. 9, and Fig. 10, it can be seen that Bi-LSTM-
subgraph represents various models, whereas the 𝑦-axis denotes the top- A using ELMo and Bi-LSTM using ELMo are significantly superior to
k accuracy or MRR. To clearly illustrate the results, the violin charts other models in most cases across all datasets in terms of top-1 accuracy
with the same classification model are placed in the same group. The and MRR. Moreover, they have large or medium effect sizes over all
violin charts with different classification models are split by a vertical other models. The above two models are also significantly better than
line. Each violin chart represents the distribution of the top-k (k = 1, other models over GCC, Firefox, and the cross-project dataset with
5, 10) accuracy and MRR. Wider sections of the violin charts indicate a respect to the top-5 and top-10 accuracy. Likewise, they also have
greater concentration of data in the area. The upper and lower bounds large or medium effect sizes over other models. This shows that both

7
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 3. The Top-k (k = 1, 5, 10) accuracy and MRR over Eclipse JDT .

Bi-LSTM-A+ELMo and Bi-LSTM+ELMo are more suitable for solving Despite Bi-LSTM-A+ELMo using the attention mechanism, there is
the bug assignment task than other investigated models. In particular, no significant difference between it and Bi-LSTM+ELMo. The likely
compared with existing text classification-based bug assignment meth- reason is that the attention mechanism could concern the features that
ods (Lee et al., 2017; Zaidi et al., 2020) using Word2Vec and TextCNN, are not closely relevant to bug fixers’ prediction limited by the quality
Bi-LSTM-A+ELMo and Bi-LSTM+ELMo have over 30% improvement in of bug reports. Compared with other classification models, Bi-LSTM and
top-k (k = 1, 5, 10) accuracy across all subjects. Bi-LSTM with attention can capture more important features including

8
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 4. The Top-k (k = 1, 5, 10) accuracy and MRR over GCC.

9
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 5. The Top-k (k = 1, 5, 10) accuracy and MRR over Firefox.

10
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 6. The Top-k (k = 1, 5, 10) accuracy and MRR over a cross-project dataset.

11
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 7. The statistical and effect size results between all investigated models on Eclipse
JDT .

Fig. 8. The statistical and effect size results between all investigated models on GCC.

12
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 9. The statistical and effect size results between all investigated models on Firefox. Fig. 10. The statistical and effect size results between all investigated models on a
cross-project dataset.

13
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

thereby enhancing top-k accuracy and MRR values. Furthermore, the

SUM+DESC*2 strategy always has higher top-k (k = 1, 5, 10) and
MRR values than the other eight strategies. In other words, the optimal
proportion between the summary and the description is 1:2. This means
that the summary and description cannot be treated with the same
weights.
Additionally, we found a very interesting phenomenon, i.e., with
the increase of the number of description copies, the top-k accuracy
and MRR values do not rise, but fall. There are two likely reasons.
One is that as the number of description copies increases, the features
extracted from the summary become less effective. The other is that the
increase of description copies generates more redundant training data,
leading to the accumulation of noisy data. This could make the model
more susceptible to the influence of noise data, resulting in a decrease
in top-k accuracy and MRR. Additionally, the increase of description
copies can improve the model’s top-k accuracy on the training set but
paradoxically lead to lower performance on the test set. Therefore, the
increase of description copies may cause the over-fitting of the model,
ultimately reducing its generalization ability.

4.3. RQ3: To what extent does the training corpus influence the accuracy
Fig. 11. The effect of the training text composed of different proportions of the of the representation learning model for bug assignment tasks?
summary and description on Bi-LSTM-A+ELMo and Bi-LSTM+ELMo.

Motivation: Various pre-trained models trained on specific do-

mains, such as SciBERT (Beltagy et al., 2019) and BioBERT (Lee
the preceding and succeeding context information from bug reports. et al., 2020), have emerged since the introduction of the genera-
This could be why they can get better performance. tive pre-training (GPT). SciBERT was trained on over one million
documents, 82% of which originated from the biology domain and
4.2. RQ2: Is the description useful for bug assignment? what is the optimal the rest from the computer science domain. Similarly, BioBERT is
weight between the summary and description? also a pre-trained model specifically trained on biomedical domain
data. A prior study (Von der Mosel et al., 2022) has demonstrated
Motivation: Zhou et al. (2016) claimed that the bug report’s de- that a domain-specific pre-trained model can lead to performance
scription is considered valueless due to its lengthy and scattered nature. improvement compared to a model trained on a general domain corpus.
Therefore, they only utilized the summary because it is concise and However, the impact of the training corpus difference between the two
exhibits strong semantic connections between words (Ko et al., 2006; different domains on the improvements is relatively small. This raises
Lamkanfi et al., 2010). In comparison with the summary, the de- the question of whether such a conclusion still holds in the context
scription contains additional valuable information related to the bug. of bug assignment. Furthermore, if the conclusion is still valid, to what
Several existing studies (Matter et al., 2009; Zhou et al., 2016) only extent does the training corpus influence the accuracy of representation
used the description or summary. Although most studies (Murphy and learning models? This drives us to set up RQ3.
Cubranic, 2004; Bhattacharya et al., 2012; Jeong et al., 2009) consid- Method: To answer RQ3, we selected a pre-trained model, i.e.,
ered both the summary and description, they were treated equally. The NextBug (Du et al., 2022) trained based on a dataset containing four
summary and description in the bug assignment are different. Thus, hundred thousand bug reports. It is the first model trained based on
it could be unfair to treat them equally. This prompted us to further a bug domain corpus. The bug assignment task aims at extracting
explore RQ2. information and predicting bug fixers from bug reports. In this sense,
Method: We designed nine strategies representing the summaries the NextBug model is suitable for bug assignment tasks. Since Bi-
and descriptions with different weights. SUM and DESC represent the LSTM+ELMo and Bi-LSTM-A+ELMo are better than other models, we
use of only summary and description, respectively. SUM+DESC denotes only compared the performance of ELMo and NextBug using Bi-LSTM
the use of the summary and description. SUM+DESC*𝑛 represents the and Bi-LSTM-A on four datasets. The experimental results are shown
use of the summary and 𝑛 replications of the description. On the in Fig. 12, where the 𝑥-axis and 𝑦-axis of each violin chart represent a
contrary, SUM*𝑛+DESC expresses the use of the description and 𝑛 repli- deep learning model and the top-k accuracy or MRR, respectively.
cations of the summary. According to the nine strategies, we conducted
Findings: The training corpus of the representation learning model
an empirical investigation of their influences on bug assignment. Given
has a significant impact on bug assignment tasks. The model
that Bi-LSTM-A+ELMo and Bi-LSTM+ELMo outperform other models
using the specialized-domain corpus has larger or medium effect
in terms of top-k accuracy, we only focused on the effects of the nine
sizes over the model using the general-domain corpus in bug
strategies on the two models. We excluded the Eclipse JDT and Firefox
assignments with respect to top-1 accuracy.
as their bug reports only contain the descriptions without summaries.
From Fig. 12, it can be seen that the classification models using
The results are shown in Fig. 11, where the 𝑥-axis represents different
NextBug have higher median top-k and MRR values than the same
strategies and the 𝑦-axis represents the top-k accuracy or MRR values.
classification models using ELMo on the four datasets. The top-k and
Findings: Both the summary and description of the bug report are MRR values on the cross-project dataset are closer to the ones on
useful for bug assignment, but the description is more valuable Firefox because the project contains many more samples than the other
than the summary. The optimal weight between them is 1:2. two projects. To qualitatively analyze the difference between the two
From Fig. 11, it can be seen that the SUM+DESC strategy has models, we conducted a Wilcoxon signed-rank test. The Cliff’s Delta
higher top-k (k = 1, 5, 10) accuracy and MRR values than the SUM test was also executed for quantitative analysis of the magnitude of the
and DESC. This means that both the summary and description of the difference. The results are shown in Fig. 13, where the corresponding
bug report are useful for bug assignments. The SUM+DESC strategy relation between the flags from 0 to 3 and the models is shown in
provides more comprehensive and richer information to the model, Table 4.

14
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 12. The top-k(k = 1, 5, 10) accuracy and MRR of ELMo and NextBug with Bi-LSTM and Bi-LSTM with attention over Eclipse JDT, GCC, Firefox, and cross- project datasets .

Table 4 models using ELMo with respect to top-1 in most cases. Moreover, the
Corresponding relation of the studied models and the flags in former ones have large or medium effect sizes over the latter ones. With
Fig. 13.
the exception of the top-1 accuracy, there is no significant difference
No. Model
between the NextBug and ELMo in most cases. Moreover, the former
0 Bi-LSTM-A+NextBug only has a negligible or small effect size over the latter. In comparison
1 Bi-LSTM+NextBug with other metrics, the top-1 metric is more useful. In this sense,
2 Bi-LSTM-A+ELMo
compared with the general domain training corpus, the domain-specific
3 Bi-LSTM+ELMo
training corpus related to the solving task can significantly improve
the accuracy of bug assignments. Although Bi-LSTM-A+NextBug uses
the attention mechanism, there is no significant difference between Bi-
LSTM-A+NextBug and Bi-LSTM+NextBug in terms of top-k (k = 1, 5,
From Fig. 13, it can be seen that the classifications using NextBug
10) accuracy and MRR. Moreover, the former only has a negligible or
are significantly better than the models with the same classification small effect size over the latter.

15
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Fig. 13. The statistical and effect size results between NextBug and ELMo on Eclipse JDT, GCC, Firefox, and cross-project datasets.

16
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

5. Threats to validity Bi-LSTM, Bi-LSTM with attention mechanism, MLP, and NB on bug as-
signment. Three commonly used datasets, i.e., Eclipse JDT, GCC, Firefox,
Despite careful experiment design, it is important to acknowledge and a cross-project dataset were used for experimental comparisons.
the potential threats to the validity of this study. These threats can be According to the experimental results, the following three findings
summarized as follows. can be drawn: (1) Bi-LSTM-A+ELMo and Bi-LSTM+ELMo are signifi-
cantly superior to other deep learning models on bug assignment tasks
5.1. Internal validity in terms of top-k (k = 1, 5, 10) accuracy and MRR; (2) Both the
summary and description of bug reports are useful for bug assignment,
The random split of the datasets can cause fluctuations in the but the description should be given higher weight than the summary;
performance of a model, leading to generating different results. This (3) The choice of training corpus for representation learning models has
is because when the training set contains the class with few samples, a significant impact on bug assignment task. The model that utilizes a
bug-specific training corpus has large or medium effect sizes over the
the model cannot well learn the features corresponding to the class,
model using a general domain corpus with respect to top-1 accuracy.
resulting in inaccurate predictions. To alleviate the impact of random
This study not only aids in understanding the critical components of
data set partitioning, we employed ten-fold cross-validation, which
bug assignment based on deep learning, but also provides practice
involves training and validating the model on all data subsets in a
guidance for picking the optimal deep learning technique for addressing
rotating way.
bug assignment tasks. Additionally, it helps identify the boundaries of
Another internal validity arises from the influence of parameters
problem-solving abilities for different deep learning models.
in all studied models. The performance of different models can vary
All experimental subjects come from the same bug tracking system,
depending on the chosen parameters. To ensure a fair and objective
i.e., Bugzilla. In addition, there are several popular bug tracking systems
comparison, on the one hand, we conducted parameter experiments to
(e.g. JIRA and BugNet). The structure of bug reports from different
obtain the optimal parameters. On the other hand, we also referred to
bug-tracking systems could be different. For future work, we will
the related works for selecting appropriate parameters.
conduct a large-scale empirical study on more projects from other
bug-tracking systems. The fixed bug reports can reflect the skills or
5.2. External validity
expertise of the developers. We are planning to measure the similarities
of developers based on the similarities of their fixed historical bug
Deep learning-based bug assignment methods depend on bug re- reports. Contrastive learning will be employed to improve the bug
ports. Limited by the expertise of bug submitter, the quality of bug assignment accuracy based on the measured similarity information. The
reports can vary significantly. Low-quality bug reports can cause the Transformer model has achieved promising results in natural language
performance degeneration of a model. To mitigate the impact of bug processing tasks. Moreover, most studies only utilize bug reports. Un-
report quality, we conducted the empirical study using three widely like bug reports, the fixed code can directly reflect the expertise of bug
used datasets in the field of bug assignment, which are extracted from fixers. Therefore, we will extract more features from source code and
four popular and mature open-source software projects, and a cross- bug reports via the Transformer model to improve the performance of
project dataset. We will conduct an empirical study based on more bug assignment further.
open-source and closed-source projects to address this threat in future
work. CRediT authorship contribution statement
Additionally, the data collected from the three projects may contain
noise. Noisy data can affect the performance of the bug assignment Rongcun Wang: Conceptualization, Methodology, Experimental
method. Consequently, we preprocessed the data by removing missing design, Writing – original draft, Supervision. Xingyu Ji: Writing –
data, URLs, non-English bug reports, and stop words to mitigate the original draft, Experimental implementation, Data analysis, Data val-
effect of noisy data. idation, Data visualization. Senlei Xu: Experimental implementation,
Data analysis, Data validation. Yuan Tian: Investigation, Writing –
5.3. Construct validity review & editing. Shujuan Jiang: Supervision, Writing – review &
editing. Rubing Huang: Writing – review & editing.
The construct validity may be affected by the use of the metric, i.e.,
top-k accuracy. As previous studies, we also employed this metric to Declaration of competing interest
evaluate the impacts of different representation learning models and
deep learning classifiers on bug assignment. It may not to be enough The authors declare that they have no known competing finan-
provide a comprehensive evaluation of the performance of the studied cial interests or personal relationships that could have appeared to
models. We will seek other appropriate metrics for the comprehensive influence the work reported in this paper.
evaluation of all studied models to alleviate this threat. Since top-k ac-
curacy only focuses on the top-k recommended bug fixers and does not Data availability
report the specific position of the true assignee in the recommendation
list. The MRR was also employed to address this limitation. https://github.com/AI4BA/dl4ba.

6. Conclusions Acknowledgments

Deep learning-based bug assignment methods have achieved The authors would like to thank the anonymous reviewers for their
promising performance. Several deep learning models have been in- valuable comments and helpful suggestions. This work is partially
creasingly proposed. No previous studies empirically evaluate the ef- supported by the National Natural Science Foundation of China under
fects of different deep learning models on bug assignment. In this grant NO. 61673384, No. 61872167 and No. 61502205, partially sup-
context, we conducted an empirical study of evaluating the impacts of ported by the Science and Technology Development Fund of Macau,
35 deep learning models based on five representation models, namely Macau SAR under grant 0046/2021/A and 0021/2023/R1A1, and
Word2Vec, GloVe, NextBug, BERT, and ELMo, and seven classifica- partially supported by a Faculty Research Grant of Macau University
tion models, i.e., LSTM, TextCNN, LSTM with attention mechanism, of Science and Technology under grant FRG-22-103-FIE.

17
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

References Mani, S., Sankaran, A., Aralikatte, R., 2019. Deeptriage: Exploring the effectiveness of
deep learning for bug triaging. In: Proceedings of the ACM India Joint International
Ahsan, S.N., Ferzund, J., Wotawa, F., 2009. Automatic software bug triage system (BTS) Conference on Data Science and Management of Data. pp. 171–179.
based on latent semantic indexing and support vector machine. In: Proceedings of Matter, D., Kuhn, A., Nierstrasz, O., 2009. Assigning bug reports using a vocabulary-
the 4th International Conference on Software Engineering Advances. pp. 216–221. based expertise model of developers. In: Proceedings of the 6th International
Alazzam, I., Aleroud, A., Al Latifah, Z., Karabatis, G., 2020. Automatic bug triage Working Conference on Mining Software Repositories. pp. 131–140.
in software systems using graph neighborhood relations for feature augmentation. Metsis, V., Androutsopoulos, I., Paliouras, G., 2006. Spam filtering with naive
IEEE Trans. Comput. Soc. Syst. 7 (5), 1288–1303. Bayes-which Naive Bayes? In: CEAS, Vol. 17. Mountain View, CA, pp. 28–69.
Anvik, J., Hiew, L., Murphy, G.C., 2006. Who should fix this bug? In: Proceedings of Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013. Efficient estimation of word
the 28th International Conference on Software Engineering. ICSE ’06, pp. 361–370. representations in vector space. arXiv preprint arXiv:1301.3781.
Aung, T.W.W., Wan, Y., Huo, H., Sui, Y., 2022. Multi-triage: A multi-task learning Murphy, G., Cubranic, D., 2004. Automatic bug triage using text categorization.
framework for bug triage. J. Syst. Softw. 184, 111133. In: Proceedings of the 6th International Conference on Software Engineering &
Beltagy, I., Lo, K., Cohan, A., 2019. SciBERT: A pretrained language model for Knowledge Engineering. Citeseer, pp. 1–6.
scientific text. In: Proceedings of the 2019 Conference on Empirical Methods in Naguib, H., Narayan, N., Brügge, B., Helal, D., 2013. Bug report assignee recommen-
Natural Language Processing and the 9th International Joint Conference on Natural dation using activity profiles. In: Proceedings of the 10th Working Conference on
Language Processing. EMNLP-IJCNLP, pp. 3615–3620. Mining Software Repositories. MSR, pp. 22–30.
Bhattacharya, P., Neamtiu, I., Shelton, C.R., 2012. Automated, highly-accurate, bug
Nowak, J., Taspinar, A., Scherer, R., 2017. LSTM recurrent neural networks for short
assignment using machine learning and tossing graphs. J. Syst. Softw. 85 (10),
text and sentiment classification. In: Artificial Intelligence and Soft Computing.
2275–2292.
Springer International Publishing, Cham, pp. 553–562.
Bird, S., Klein, E., Loper, E., 2009. Natural Language Processing with Python: Analyzing
Romano, J., Kromrey, J.D., Coraggio, J., Skowronek, J., 2006. Appropriate statistics
Text with the Natural Language Toolkit. O’Reilly Media, Inc..
Chakraborty, S., Krishna, R., Ding, Y., Ray, B., 2022. Deep learning based vulnerability for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating
detection: Are we there yet? IEEE Trans. Softw. Eng. 48 (9), 3280–3296. group differences on the NSSE and other surveys. In: The Annual Meeting of the
Choquette-Choo, C.A., Sheldon, D., Proppe, J., Alphonso-Gibbs, J., Gupta, H., 2019. Florida Association of Institutional Research. pp. 1–31.
A multi-label, dual-output deep neural network for automated bug triaging. In: Rusland, N.F., Wahid, N., Kasim, S., Hafit, H., 2017. Analysis of Naïve Bayes algorithm
Proceedings of the 18th IEEE International Conference on Machine Learning and for email spam filtering across multiple datasets. In: IOP Conference Series:
Applications. ICMLA, pp. 937–944. Materials Science and Engineering, vol. 226, (no. 1), IOP Publishing, 012091.
Cliff, N., 1993. Dominance statistics: Ordinal analyses to answer ordinal questions. Sajedi-Badashian, A., Stroulia, E., 2020. Vocabulary and time based bug-assignment:
Psychol. Bull. 114 (3), 494. A recommender system for open-source projects. Softw. - Pract. Exp. 50 (8),
Dai, W., Xue, G.-R., Yang, Q., Yu, Y., 2007. Transferring Naive Bayes classifiers for text 1539–1564.
classification. In: AAAI, Vol. 7. pp. 540–545. Sarkar, A., Rigby, P.C., Bartalos, B., 2019. Improving bug triaging with high confidence
Dedík, V., Rossi, B., 2016. Automated bug triaging in an industrial context. In: predictions at ericsson. In: 2019 IEEE International Conference on Software
Proceedings of the 42th Euromicro Conference on Software Engineering and Maintenance and Evolution. ICSME, pp. 81–91.
Advanced Applications. SEAA, pp. 363–367. Sawarkar, R., Nagwani, N.K., Kumar, S., 2019. Predicting available expert developer
Du, X., Zheng, Z., Xiao, G., Zhou, Z., Trivedi, K.S., 2022. DeepSIM: Deep semantic for newly reported bugs using machine learning algorithms. In: Proceedings of the
information-based automatic mandelbug classification. IEEE Trans. Reliab. 71 (4), 5th International Conference for Convergence in Technology. I2CT, pp. 1–4.
1540–1554. Sbih, A., Akour, M., 2018. Towards efficient ensemble method for bug triaging. J.
Frank, E., Bouckaert, R.R., 2006. Naive Bayes for text classification with unbalanced Mult.-Valued Logic Soft Comput. 31, 567–590.
classes. In: Knowledge Discovery in Databases: PKDD 2006: 10th European Con- Sun, X., Yang, H., Xia, X., Li, B., 2017. Enhancing developer recommendation with
ference on Principles and Practice of Knowledge Discovery in Databases Berlin, supplementary information via mining historical commits. J. Syst. Softw. 134,
Germany, September 18-22, 2006 Proceedings 10. Springer, pp. 503–510. 355–368.
Giray, G., Bennin, K.E., Köksal, Ö., Babur, Ö., Tekinerdogan, B., 2023. On the use of
Tan, S., Cheng, X., Wang, Y., Xu, H., 2009. Adapting Naive Bayes to domain
deep learning in software defect prediction. J. Syst. Softw. 195, 111537.
adaptation for sentiment analysis. In: Advances in Information Retrieval: 31th
Graves, A., 2012. Long short-term memory. In: Supervised Sequence Labelling with
European Conference on IR Research, ECIR 2009, Toulouse, France, April 6-9, 2009.
Recurrent Neural Networks. Springer Berlin Heidelberg, Berlin, Heidelberg, pp.
Proceedings 31. Springer, pp. 337–349.
37–45.
Taud, H., Mas, J., 2018. Multilayer perceptron (MLP). In: Camacho Olmedo, M.T.,
Guo, S., Zhang, X., Yang, X., Chen, R., Guo, C., Li, H., Li, T., 2020. Developer activity
Paegelow, M., Mas, J.-F., Escobar, F. (Eds.), Geomatic Approaches for Modeling
motivated bug triaging: Via convolutional neural network. Neural Process. Lett. 51
Land Change Scenarios. Springer International Publishing, Cham, pp. 451–455.
(3), 2589–2606.
Hu, D., Chen, M., Wang, T., Chang, J., Yin, G., Yu, Y., Zhang, Y., 2018. Recommending Von der Mosel, J., Trautsch, A., Herbold, S., 2022. On the validity of pre-trained
similar bug reports: A novel approach using document embedding model. In: transformers for natural language processing in the software engineering domain.
Proceedings of the 25th Asia-Pacific Software Engineering Conference. APSEC, pp. IEEE Trans. Softw. Eng. 49 (4), 1487–1507.
725–726. Voorhees, E.M., et al., 1999. The trec-8 question answering track report. In: Trec, Vol.
Hu, X., Li, G., Xia, X., Lo, D., Jin, Z., 2020. Deep code comment generation with hybrid 99. pp. 77–82.
lexical and syntactical information. Empir. Softw. Eng. 25 (3), 2179–2217. Wilcoxon, F., 1946. Individual comparisons by ranking methods. Biometrics 1 (6),
Jahanshahi, H., Cevik, M., 2022. S-DABT: Schedule and dependency-aware bug triage 80–83.
in open-source bug tracking systems. Inf. Softw. Technol. 151, 107025. Wongkar, M., Angdresey, A., 2019. Sentiment analysis using Naive Bayes algorithm of
Jahanshahi, H., Chhabra, K., Cevik, M., Basar, A., 2021. DABT: A dependency-aware the data crawler: Twitter. In: 2019 Fourth International Conference on Informatics
bug triaging method. In: International Conference on Evaluation and Assessment and Computing. ICIC, IEEE, pp. 1–5.
in Software Engineering. pp. 221–230. Wu, W., Zhang, W., Yang, Y., Wang, Q., 2011. DREX: Developer recommendation
Jeong, G., Kim, S., Zimmermann, T., 2009. Improving bug triage with bug tossing with K-nearest-neighbor search and expertise ranking. In: Proceedings of the 18th
graphs. In: Proceedings of the 7th Joint Meeting of the European Software Asia-Pacific Software Engineering Conference. pp. 389–396.
Engineering Conference and the ACM SIGSOFT Symposium on the Foundations Xi, S., Yao, Y., Xiao, X., Xu, F., Lu, J., 2018. An effective approach for routing the bug
of Software Engineering. pp. 111–120. reports to the right fixers. In: Proceedings of the 10th Asia-Pacific Symposium on
Kim, Y., 2014. Convolutional neural networks for sentence classification. In: Proceed- Internetware. Internetware ’18.
ings of the 2014 Conference on Empirical Methods in Natural Language Processing.
Xia, X., Lo, D., Ding, Y., Al-Kofahi, J., Nguyen, T., Wang, X., 2017. Improving automated
EMNLP, Doha, Qatar, pp. 1746–1751.
bug triaging with specialized topic model. IEEE Trans. Softw. Eng. 43 (3), 272–297.
Ko, A.J., Myers, B.A., Chau, D.H., 2006. A linguistic analysis of how people de-
Xia, X., Lo, D., Wang, X., Zhou, B., 2015. Dual analysis for recommending developers
scribe software problems. In: Visual Languages and Human-Centric Computing.
to resolve bugs. J. Softw.: Evol. Process 27 (3), 195–220.
VL/HCC’06, IEEE, pp. 127–134.
Lamkanfi, A., Demeyer, S., Giger, E., Goethals, B., 2010. Predicting the severity Xu, S., Li, Y., Wang, Z., 2017. Bayesian multinomial naïve Bayes classifier to text clas-
of a reported bug. In: 2010 7th IEEE Working Conference on Mining Software sification. In: Advanced Multimedia and Ubiquitous Engineering: MUE/FutureTech
Repositories. MSR 2010, IEEE, pp. 1–10. 2017 11. Springer, pp. 347–352.
Lee, S.-R., Heo, M.-J., Lee, C.-G., Kim, M., Jeong, G., 2017. Applying deep learning Xuan, J., Jiang, H., Hu, Y., Ren, Z., Zou, W., Luo, Z., Wu, X., 2015. Towards effective
based automatic bug triager to industrial projects. In: Proceedings of the 11th Joint bug triage with software data reduction techniques. IEEE Trans. Knowl. Data Eng.
Meeting on Foundations of Software Engineering. pp. 926–931. 27 (1), 264–280.
Lee, D.-G., Seo, Y.-S., 2019. Systematic review of bug report processing techniques to Yadav, A., Singh, S.K., 2020. A novel and improved developer rank algorithm for bug
improve software management performance. J. Inf. Process. Syst. 15 (4), 967–985. assignment. Int. J. Intell. Syst. Technol. Appl. 19 (1), 78–101.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J., 2020. BioBERT: A Yang, G., Zhang, T., Lee, B., 2014. Towards semi-automatic bug triage and severity
pre-trained biomedical language representation model for biomedical text mining. prediction based on topic model and multi-feature of bug reports. In: Proceedings
Bioinformatics 36 (4), 1234–1240. of the 38th Annual Computer Software and Applications Conference. pp. 97–106.

18
R. Wang et al. The Journal of Systems & Software 210 (2024) 111961

Yin, Y., Dong, X., Xu, T., 2018. Rapid and efficient bug assignment using ELM for IOT Zhang, Y., Wallace, B.C., 2017. A sensitivity analysis of (and practitioners’ guide to)
software. IEEE Access 6, 52713–52724. convolutional neural networks for sentence classification. In: Proceedings of the 8th
Zaidi, S.F.A., Awan, F.M., Lee, M., Woo, H., Lee, C.-G., 2020. Applying convolutional International Joint Conference on Natural Language Processing (Volume 1: Long
neural networks with different word representation techniques to recommend bug Papers). pp. 253–263.
fixers. IEEE Access 8, 213729–213747. Zhang, J., Wang, X., Hao, D., Xie, B., Zhang, L., Mei, H., 2015. A survey on bug-report
Zhang, T., Chen, J., Jiang, H., Luo, X., Xia, X., 2017. Bug report enrichment analysis. Sci. China Inf. Sci. 58 (2), 1–24.
with application of automated fixer recommendation. In: 2017 IEEE/ACM 25th Zhou, Y., Tong, Y., Gu, R., Gall, H., 2016. Combining text mining and data mining for
International Conference on Program Comprehension. ICPC, pp. 230–240. bug report classification. J. Softw.: Evol. Process 28 (3), 150–176.