0% found this document useful (0 votes)
12 views14 pages

RLocator

The document presents RLocator, a novel bug localization approach utilizing Reinforcement Learning (RL) to optimize evaluation metrics directly, improving the efficiency of identifying source code files responsible for bugs. RLocator is formulated as a Markov Decision Process and has been experimentally validated against a dataset of 8,316 bug reports from popular Apache projects, demonstrating significant performance improvements over existing methods. The results indicate that RLocator achieves higher Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Top K scores compared to state-of-the-art bug localization tools.

Uploaded by

yizho0412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views14 pages

RLocator

The document presents RLocator, a novel bug localization approach utilizing Reinforcement Learning (RL) to optimize evaluation metrics directly, improving the efficiency of identifying source code files responsible for bugs. RLocator is formulated as a Markov Decision Process and has been experimentally validated against a dataset of 8,316 bug reports from popular Apache projects, demonstrating significant performance improvements over existing methods. The results indicate that RLocator achieves higher Mean Reciprocal Rank (MRR), Mean Average Precision (MAP), and Top K scores compared to state-of-the-art bug localization tools.

Uploaded by

yizho0412
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO.

10, OCTOBER 2024 2695

RLocator: Reinforcement Learning for


Bug Localization
Partha Chakraborty , Student Member, IEEE, Mahmoud Alfadel , Member, IEEE, and Meiyappan Nagappan

Abstract—Software developers spend a significant portion of (in terms of time and resources), especially when there are a lot
time fixing bugs in their projects. To streamline this process, of files and bug reports. Moreover, the number of bugs reported
bug localization approaches have been proposed to identify the is often higher than the number of available developers [2].
source code files that are likely responsible for a particular bug.
Prior work proposed several similarity-based machine-learning Consequently, the fix-time and maintenance costs rise when the
techniques for bug localization. Despite significant advances in customer satisfaction rate decreases [3].
these techniques, they do not directly optimize the evaluation Bug Localization is a method that refers to identifying the
measures. We argue that directly optimizing evaluation measures source code files where a particular bug originated. Given a
can positively contribute to the performance of bug localization bug report, bug localization approaches utilize the textual in-
approaches. Therefore, in this paper, we utilize Reinforcement
Learning (RL) techniques to directly optimize the ranking formation in the bug report, and the project source code files
metrics. We propose RLOCATOR, a Reinforcement Learning- to shortlist the potentially buggy files. Prior work has pro-
based bug localization approach. We formulate RLocator using posed various Information Retrieval-based Bug Localization
a Markov Decision Process (MDP) to optimize the evaluation (IRBL) approaches to help developers speed up the debugging
measures directly. We present the technique and experimentally process (e.g., Deeplocator [4], CAST [5], KGBugLocator [6],
evaluate it based on a benchmark dataset of 8,316 bug reports
from six highly popular Apache projects. The results of our BL-GAN [7]).
evaluation reveal that RLocator achieves a Mean Reciprocal One common theme among these approaches is that they
Rank (MRR) of 0.62, a Mean Average Precision (MAP) of 0.59, follow a similarity-based approach to localize bugs. Such tech-
and a Top 1 score of 0.46. We compare RLocator with three niques measure the similarity between bug reports and the
state-of-the-art bug localization tools, FLIM, BugLocator, and source code files. For estimating similarity, they use various
BL-GAN. Our evaluation reveals that RLocator outperforms both
approaches by a substantial margin, with improvements of 38.3% methods such as cosine distance [8], Deep Neural Networks
in MAP, 36.73% in MRR, and 23.68% in the Top K metric. These (DNN) [9], and Convolutional Neural Networks (CNN) [5].
findings highlight that directly optimizing evaluation measures Then, they rank the source code files based on their simi-
considerably contributes to performance improvement of the bug larity score. In the training phase of these approaches, the
localization problem. model learns to optimize the similarity metrics. In contrast,
Index Terms—Reinforcement learning, bug localization, deep in the testing phase, the model is tested with ranking met-
learning. rics (e.g., Mean Reciprocal Rank (MRR) or Mean Average
Precision (MAP)).
While most of these approaches showed promising perfor-
mance, they optimize a metric that indirectly represents the
I. INTRODUCTION
performance metrics. Prior studies [10], [11], [12], [13] found

S OFTWARE bugs are an inevitable part of software devel-


opment. Developers spend one-third of their time debug-
ging and fixing bugs [1]. After a bug report/issue has been
that direct optimization of evaluation measures substantially
contributes to performance improvement of ranking problems.
Direct optimization is also efficient compared to optimizing
filed, the project team identifies the source code files that need indirect metrics [13]. Hence, we argue that it is challenging for
to be inspected and modified to address the issue. However, the solutions proposed by prior studies to sense how a wrong
manually locating the files responsible for a bug is expensive prediction would affect the performance evaluation measures
[10]. In other words, if we use the retrieval metrics (e.g., MAP)
in the training phase, the model will learn how each prediction
Manuscript received 10 January 2024; revised 25 July 2024; accepted
19 August 2024. Date of publication 30 August 2024; date of current version will impact the evaluation metrics. A wrong prediction will
17 October 2024. Recommended for acceptance by D. Lo. (Corresponding change the rank of the source code file and ultimately impact
author: Partha Chakraborty.) the evaluation metrics.
Partha Chakraborty and Meiyappan Nagappan are with David R. Cheriton
School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Reinforcement Learning (RL) is a sub-category of machine
Canada (e-mail: [email protected]; [email protected]). learning methods where labeled data is not required. In RL,
Mahmoud Alfadel is with the Department of Computer Science, University the model is not trained to predict a specific value. Instead,
of Calgary, Calgary, AB T2N 1N4, Canada (e-mail: mahmoud.alfadel@
ucalgary.ca). the model is given a signal about a right or wrong choice
Digital Object Identifier 10.1109/TSE.2024.3452595 in training [14]. Based on the signal, the model updates its

0098-5589 © 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2696 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

decision. This allows RL to use evaluation measures such as various companies [20] and domains [21] leverage its capacity
MRR and MAP in the training phase and directly optimize the for iterative learning and adjustment.
evaluation metrics. Moreover, because of using MRR/MAP as a The proficiency required for bug localization is often ac-
signal instead of a label, the problem of overfitting will be less quired through experience, with seasoned developers exhibiting
prevalent. Markov Decision Process (MDP) is a foundational a faster bug-finding aptitude than their less experienced coun-
element of RL. MDP is a mathematical framework that allows terparts [22]. Recognizing the significance of experience in bug
the formalization of discrete-time decision-making problems localization, we propose the integration of reinforcement learn-
[15]. Real-world problems often need to be formalized as MDP ing into this domain. By employing RL, a model can present
to apply RL. developers with sets of source code files as possible causes
In this paper, we present RLocator, an RL technique for for a bug and learn from their feedback to enhance its skill
localizing software bugs in source code files. We formulate in localizing bugs in the software. In contrast to conventional
RLocator into an MDP. In each step of the MDP, we use MRR machine learning approaches, which rely solely on labeled data
and MAP as signals to guide the model to optimal choice. and lack easy adaptability, reinforcement learning presents two
We evaluate RLocator on a benchmark dataset of six Apache distinct advantages: firstly, the ability to learn from developer
projects and find that, compared with existing state-of-the-art feedback, and secondly, the elimination of the requirement for
bug localization techniques, RLocator achieves substantial per- labeled data in real-world scenarios. Therefore, our research
formance improvement. While pinpointing the exact reasons aims to incorporate reinforcement learning into bug localiza-
for RL’s superior performance over other supervised techniques tion, leveraging its capacity to adapt and enhance performance
can be challenging, RL learns more generalizable approaches, through iterative feedback.
especially in dynamic and complex environments. In compar-
ison to supervised learning, it learns approaches that are more III. BACKGROUND
adaptable to a variety of situations [16], [17], which is a form
In this section, we describe terms related to the bug local-
of generalization. Additionally, RL demonstrates proficiency
ization problem, which we use throughout our study. Also, we
in scenarios where the optimal solution is not clearly defined,
present an overview of reinforcement learning.
showcasing its versatility across various tasks and domains
[14]. These factors can contribute to the superior performance
of RLocator. A. Bug Localization System
The main contributions of our work are as follows: A typical bug localization system utilizes several sources of
• We present RLocator, an RL-based software bug localiza- data, e.g., bug reports, stack traces, and logs, to identify the
tion approach. The key technical novelty of RLocator is responsible source code files. One particular challenge of the
using RL for bug localization, which includes formulating system is that the bug report contains natural language, whereas
the bug localization process into an MDP. source code files are written in a programming language.
• We provide an experimental evaluation of RLocator with Typically, bug localization systems identify whether a bug
8,316 bug reports from six Apache projects. When RLoca- report relates to a source code file. To do so, the system extracts
tor can localize, it achieves an MRR of 0.49 - 0.62, features from both the bug report and the source code files.
MAP of 0.47 - 0.59, and Top 1 of 0.38 - 0.46 across Previous studies used techniques such as N-gram [23], [24] and
all studied projects. Additionally, we compare RLoca- Word2Vec [25], [26] to extract features (embedding) from bug
tor’s performance with state-of-the-art bug localization reports and source code files. Other studies (e.g., Devlin et al.
methods. RLocator outperforms FLIM [18] by 38.3% in [27]) introduced the transformer-based model BERT which has
MAP, 36.73% in MRR, and 23.68% in Top K. Further- achieved higher performance than all the previous techniques.
more, RLocator exceeds BugLocator [3] by 56.86% in One of the reasons transformer-based models perform better in
MAP, 41.51% in MRR, and 26.32% in Top K. In terms extracting textual features is that the transformer uses multi-
of Top K, RLocator shows improved performance over head attention, which can utilize long context while generat-
BL-GAN [7], with gains ranging from 55.26% to 3.33%. ing embedding. Previous studies have proposed a multi-modal
The performance gains for MAP and MRR are 40.74% and BERT model [28] for programming languages, which can ex-
32.2%, respectively. tract features from both bug reports and source code files.
A bug report mainly contains information related to unex-
pected behavior and how to reproduce it. It mainly includes
II. MOTIVATION
a bug ID, title, textual description of the bug, and version of
Reinforcement Learning (RL) stands out for its ability to the codebase where the bug exists. The bug report may have
learn from feedback, a characteristic that empowers models an example of code, stack trace, or logs. A bug localization
to self-correct based on the outcomes of their actions. This system retrieves all the source code files from a source code
feature finds widespread application, exemplified by platforms repository at that particular version. For example, assume we
like Spotify, an audio streaming service using RL to learn user have 100 source code files in a repository in a specific version.
preferences [19]. The model evolves and adapts by presenting After retrieving 100 files from that version, the system will
music selections and refining recommendations through user in- estimate the relevance between the bug report and each of
teractions. The versatility of RL extends beyond entertainment; the 100 files. The relevance can be measured in several ways.

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
CHAKRABORTY et al.: RLOCATOR: REINFORCEMENT LEARNING FOR BUG LOCALIZATION 2697

For example, a naive system can check how many words of with entropy adds the entropy of the probability of the possible
the bug report exist in each source code file. A sophisticated action with the loss of the actor model. As a result, in the
system can compare embeddings using cosine distance [29]. gradient descent step, the model tries to maximize the entropy
After relevance estimation, the system ranks the files based on of the learned policy. Maximization of entropy ensures that the
their relevance score. The ranked list of files is the final output agent assigns almost an equal probability to an action with a
of a bug localization system that developers will use. similar return.

IV. RLOCATOR: REINFORCEMENT LEARNING FOR


B. Reinforcement Learning BUG LOCALIZATION
In Reinforcement Learning (RL), the agent interacts with the
In this section, we discuss the steps we follow to use RLoca-
environment through observation. Formally, an observation is
tor. First, we explain (in Section IV-A) the pre-processing step
called “State,” S. In each state, at time t, St , the agent takes
required for using RLocator. Then, we explain (in Section IV-B)
action A based on its understanding of the state. Then, the en-
the formulation steps of our design of RLocator. We present the
vironment provides feedback/reward  and transfers the agent
overview of our approach in Fig. 1.
into a new state St+1 . The agent’s strategy to determine the
best action, which will eventually lead to the highest cumulative
A. Pre-Process
reward, is referred to as policy [14], [30].
The cumulative reward (until the goal/end) that an agent can Before using the bug reports and source code files to train
get if it takes a specific action in a certain state is called Q the RL model, they undergo a series of pre-processing steps.
value. The function that is used to estimate the Q value is often The steps are described in this section.
referred as Q function or Value function. Input: The inputs to our bug localization tool are bug reports
In RL, an agent starts its journey from a starting state and then and source code files associated with a particular version of
goes forward by picking the appropriate action. The journey a project repository. Software projects maintain a repository
ends in a pre-defined end state. The journey from start to end for their bugs or issues (e.g., Jira, Github, Bugzilla). The first
state is referred to as episode. component, the bug report, can be retrieved from those issue
From a high level, we can divide the state-of-the-art RL algo- repositories. We use the bug report to obtain the second com-
rithms into two classes. The first is the model-free algorithms, ponent (i.e., source code) by identifying the project version
where the agent has no prior knowledge about the environment. affected by the bug. Typically, each bug report is associated
The agent learns about the environment by interacting with the with a version or commit SHA of the project repository. After
environment. The other type is the model-based algorithm. In a identifying the buggy version, we collect all source code files
model-based algorithm, the agent uses the reward prediction from the specific version of the code repository. In the training
from the model instead of interacting with the environment. phase, we compile bug reports and source code files into a
The bug localization task is quite similar to the model-free dataset for subsequent usage. Our dataset contains a set of
environment as we cannot predict/identify the buggy files with- bug reports where each bug report has its own set of source
out checking the bug report and source code files (without code files. In real-world usage, RLocator directly accesses bug
interacting with the environment). Thus, we use model-free RL reports and source code files from the repository. In Fig. 1, we
algorithms in this study. Two popular variants of model-free RL illustrate the input stage where we get the bug report and source
algorithms are: code files from the dataset.
• Value Optimization: The agent tries to learn the Q value Shortlisting source code files: The number of source code
function in value optimization approaches. The agent files in different versions of the repository can be different. All
keeps the Q value function in memory and updates it of the source code files can be potentially responsible for a
gradually. It consults the Q value function in a particular bug. In this step, we identify K source code files as candidates
state and picks the action that will give the highest value for each bug. We limited the candidates to K as we cannot
(reward). An example of the value optimization-based ap- pass a variable number of source code files to the RL model.
proach is Deep Q Network (DQN) [14]. Moreover, given that RLocator primarily learns from develop-
• Policy Optimization: In the Policy optimization approach, ers’ feedback, its usage can prove challenging for a developer
the agent tries to learn the mapping between the state and with many candidate source code files. To illustrate the issue,
the action that will result in the highest reward. The agent consider a repository with 700 files. RLocator presents files
will pick the action based on the mapping in a particular to the developer one by one for relevance verification. This
state. An example of the policy optimization-based ap- sequential approach significantly prolongs the time taken to
proach is Advantage Actor-Critic (A2C) [14], [31]. find a relevant file, resulting in a waste of developers’ time.
A2C is a policy-based algorithm where the agent learns an Consequently, it is crucial to limit the number of files shown
optimized policy to solve a problem. In Actor-Critic, the actor- to developers by providing a shortlisted set for assessment.
model picks action. The future return (reward) of action is To identify the K most relevant files, we use ElasticSearch
estimated using the critic model. The actor model uses the critic (ES). ES is a search engine based on the Lucene search engine
model to pick the best action in any state. Advantage actor-critic project. It is a distributed, open-source search and analytic
subtracts a base value from the return in any timestep. A2C engine for all data types, including text. It analyzes and indexes

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2698 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

Fig. 1. Bug localization as Markov decision process.

words/tokens for the textual match and uses BM25 to rank the files in the first k files, it hinders RLocator training using de-
files matching the query. We use the ES index for identifying the veloper feedback and introduces noise. Therefore, we use an
topmost k source code files related to a bug report. Following XGBoost-based binary classifier [33] to identify cases where
the study by Liu et el. [32] (who used ES in the context of code ES may return no relevant files in the top k files. The rationale
search), we build an ES index using the source code files and for using XGBoost is twofold: (1) to optimize developer time
then queried the index using the bug report as the query. Then, by not presenting irrelevant files and (2) to filter out noise
we picked the first k files with the highest textual similarities during training.
with the bug report. We want to note that the goal of bug ES-based filtering is not used because its similarity values
localization is to get the relevant files to be ranked as close to are not normalized, and cosine similarity is inapplicable to text
the 1st rank as possible. Hence, metrics like MAP and MRR data. We provide the XGBoost model with the bug report and
can measure the performance of bug localization techniques. the top k files retrieved by ES to determine if any are relevant.
While one can argue why we not only rely on ES to rank the If the XGBoost model predicts no relevant files in the set, we
relevant files, we find that the MAP and MRR of using ES are exclude those bug reports and their associated files. Each bug
poor. Our RL-based technique learns from feedback and aims to report is associated with its unique set of source code files, so
rerank the output from ES to get higher MAP and MRR scores. filtering one does not impact others.
In Fig. 1, we illustrated the candidate refinement step where To build the model, we study the most important features
we query ElasticSearch using the bug report and use outputs to associated with the prediction task. We consult the related lit-
refine the candidate source code files. erature on the field of information retrieval [35], [36], [38]
Filtration of bug report and source code files: One limitation and bug report classification [34] for feature selection. The list
of ES is that it sometimes returns irrelevant files among the top of computed features is presented in Table I. For our dataset,
k most relevant source code files. When there are no relevant we calculate the selected features and trained the model using

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
CHAKRABORTY et al.: RLOCATOR: REINFORCEMENT LEARNING FOR BUG LOCALIZATION 2699

TABLE I employs Reinforcement Learning for its bug localization. Our


DESCRIPTION AND RATIONALE OF THE SELECTED FEATURES approach is grounded in the belief that each bug report contains
specific indicators, such as terms and keywords, aiding develop-
Feature Description Rationale
ers in pinpointing problematic source code files. For example,
Bug Report Length Length of the bug report. Fan et al. [34] found that in Java code, we can get a nested exception (the exception that
it is hard to localize bugs
using a short bug report. leads to another exception, and another; an example is available
A short bug report will in the online Appendix [39]). A developer can identify the root
contain little information exception and which method call (or any other code block)
about the bug. Thus it will
be hard for the Elastic- caused that. After that, they can go for the implementation
Search to retrieve source of that method in the source code files. The process indicates
code file responsible for that developers can identify and use the important informa-
this bug.
tion from a bug report. Following prior studies [40], [41], we
Source Code Length Median length of the Prior studies [35], [36]
source code files found that calculating tex-
formulate RLocator into a Markov Decision Process (MDP)
associated with a tual similarity is challeng- by dividing the ranking problem into a sequence of decision-
particular bug. Note ing for long texts. Length making steps. A general RL model can be represented by a tuple
that we calculate the of source code may con-
string length of the source tribute to the performance
S, A, τ, , π, which is composed of states, actions, transition,
code files after removing drop of ElasticSearch. reward, and policy, respectively. Fig. 1 shows an overview of
code comments. our RLocator approach. Next, we describe the formulation steps
Stacktrace Availability of stack trace Schroter et al. [37] found of each component of RLocator.
in bug report. that stacktraces in bug re- States: S is the set of states. The RL model moves from one
ports can help the de-
bugging process as they state to another state until it reaches the end state. To form the
may contain useful in- states of the MDP, we apply the following steps:
formation. Availability of Input:
stacktraces may improve
the performance of Elas- The input of our MDP comprises a bug report and the top
ticSearch. K relevant source code files from a project repository. We use
Similarity Ratio of similar tokens be- Similarity indicates the CodeBERT [28], a transformer-based model, to convert text into
tween a bug report and amount of helpful infor- embeddings, representing the text in a multi-dimensional space.
source code files mation in the bug report. CodeBERT is chosen for its ability to handle long contexts,
We calculate the similar-
ity based on the equation making it suitable for long source code files where methods
presented in Section VI. may be declared far from their usage. Unlike Word2Vec, which
generates static embeddings for words, CodeBERT generates
dynamic embeddings for sequences, capturing context during
10-fold cross-validation. The results show that our classifier inference. This is crucial in source code files where variable
model has a precision of 0.78, a recall of 0.93, and an F1-score use depends on scope.
of 0.85. Additionally, the model is able to correctly classify 91% CodeBERT, trained on natural language and programming
of the dataset (there will be relevant source code files in the top language pairs, handles both programming and natural lan-
k files returned by ES). guages. Its self-attention mechanism assesses the significance
After filtration, we pass each bug report and its source code of individual terms, helping link bug reports to relevant source
files to RLocator. In Fig. 1, we have depicted the operational code files. For example, in a Java nested exception, developers
procedure of RLocator. The workflow commences with a cu- can identify the main exception and pinpoint the responsible
rated dataset containing bug reports and source code files. Sub- code block. RLocator relies on this self-attention mechanism
sequently, we index the source code files into the ES index. to identify and leverage these informative cues effectively.
From ES, we obtain bug reports and shortlisted K source code In our approach, as shown in Fig. 1, the embedding model
files linked to those bug reports. Following this shortlisting, we processes bug reports and source code files, generating em-
employ the XGBoost model to predict the presence of a relevant beddings for the source code files F1 , F2 , ..., Fk , and the bug
file within the top K files. If at least one relevant file exists, we report R.
proceed to the next step by passing the bug reports and filtered Concatenation: After we obtain the embedding for the
source code files. source codes and the bug report, we concatenate them. As prior
studies [42], [43] suggest combining distinct sets of features
through concatenation and processing them with a linear layer
B. Formulation of RLocator
enables effective interaction among the features. Furthermore,
In the previous step, we have pre-processed the dataset for feature interaction is fundamental in determining similarity
training the reinforcement learning model. We shortlist k most [44], [45]. Thus, with the goal of calculating the similarity
relevant files for each bug report. After that, we identify the bug between a bug report and a source code file pair, we concatenate
reports for which there will be no relevant files in top k files and their embedding. Given our example in Fig. 1, we concatenate
filter out those bug reports. Finally, we pass the top k relevant the embedding of F1 , F2 , ..., and Fk with the embedding of
files to RLocator. In this section, we explain how RLocator bug report R independently. This step leads us to obtain the

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2700 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

corresponding concatenated embedding E1 , E2 , ..., and Ek , as action as return. The RL technique signals the agent about the
shown in Fig. 1. appropriate action in each step through the Reward Function,
Note that each state of the MDP comprises two lists: a can- which can be modeled using the retrieval metrics. Thus, the
didate list and a ranked list. The candidate list contains the RL agent can learn to optimize the retrieval metrics through
concatenated list of embedding. As shown in our example in the reward function. We consider two important factors in the
Fig. 1, the candidate list contains E1 , E2 , ..., and Ek . In the ranking evaluation: the position of the relevant files and the
candidate list, source code embeddings (code files and bug distance between relevant files in the ranked list of embedding.
reports embedding concatenated together) are ranked randomly. We incorporated both factors in designing the reward function
The other list is the ranked list of source code files based on their shown below.
relevance to the bug report R. Initially (at State1 ), the candidate M ∗ f ile relevance
(S, A) = ; if A is
list is full, and the ranked list is empty. In each state transition, log2 (t + 1) ∗ distance(s)
the model moves one embedding from the candidate list to an action that has not been selected bef ore
the ranked list based on their probability of being responsible
for a bug. In the final state, the ranked list will be full, and (1)
the candidate list will be empty. We describe the process of (S, A) = − log2 (t + 1); otherwise (2)
selecting and ranking a file in detail in the next step. distance(S) = Avg.(Distance between currently
Actions: We define Actions in our MDP as selecting a file from picked subsequent related f iles) (3)
the candidate list and moving it to the ranked list. Suppose at
the timestep t; the RL model picks the embedding E1 , then the In Equations 1 and 2, t is the timestamp, S is State and A
rank of that particular file will be t. In Fig. 1 at the timestamp 1, is Action. Mean reciprocal rank (MRR) measures the aver-
the model picks concatenated embedding of file F2 . Thus, the age reciprocal rank of all the relevant files. In Equation 1,
f ile relevance
rank of F2 will be 1. As in each timestamp, we are moving one log2 (t+1) represents the MRR. The use of a logarithmic
file from the candidate list to the ranked list; the total number function in the equation is motivated by previous studies [52],
of files will be equal to the number of states and the number [53], which found that it leads to a stable loss. When the relevant
of actions. For identifying the potentially best action at any files are ranked higher, the average precision tends to be higher.
timestamp t, we use a deep learning (DL) model (indicated as To encourage the reinforcement learning system to rank relevant
Ranking Model in Fig. 1), which is composed of a Convolu- files higher, we introduce a punishment mechanism if there is
tional Neural Network (CNN) followed by a Long Short-Term a greater distance between two relevant files. By imposing this
Memory (LSTM) [46]. Following [47], [48], [49], we use CNN punishment on the agent, we incentivize it to prioritize relevant
to establish the connection between source code files and bug files in higher ranks, which in turn contributes to the Mean
reports and extract relevant features. As mentioned earlier, de- Average Precision (MAP).
velopers acquire the ability to recognize cues and subsequently We illustrate the reward functions with an example below.
employ them to establish the association between source code Presuming that the process reaches State S6 and the currently
files and bug reports. The CNN facilitates the second stage of picked concatenated embeddings are E1 , E2 , E3 , E4 , E5 , E6
bug localization, which involves extracting important features. and their relevancy to the bug report is 0, 0, 1, 0, 1, 1. This
The input of the CNN is the concatenated embedding of both means that these embeddings (or files) ranked in the 3rd , 5th ,
bug reports and each source code file, and the output of CNN and 6th positions are relevant to the bug report. The position
is extracted features from the combined embedding of bug of the relevant files are 3, 5, 6, and the distance between them
reports and source code files. The features are later used to is 1, 0. Hence, distance(S6 ) = Avg.1, 0 = 0.5. If the agent
calculate relevance. picks a new relevant file, we reward the agent M times the
On the other hand, LSTM [50] intends to make the model reciprocal rank of the file divided by the distance between the
aware of a restriction, which we call state awareness. That is, already picked related files. In our example, the last picked
in each timestamp, the model is allowed to pick the potentially file, E6 ’s relevancy is 1. Thus, we have the following val-
best embedding that has not been picked yet, i.e., if a file is ues for Equation 1: distance(S6 ) = 0.5; log2 (6 + 1) = 2.8074;
selected at Statei , it cannot be selected again in a later state f ile relevance = 1. Note that M is a hyper-parameter. We find
(i.e., Statei+j ; j ≥ 1). The LSTM retains the state and aids the that three as the value of M results in the highest reward for
RL agent in choosing a subsequent action that does not conflict our RL model. We identify the best value for M by experi-
with prior actions. Thus, following previous studies [50], [51], menting with different values (1, 3, 6, and 9). Fig. 2 shows
we use an LSTM to make the model aware of previous actions. the resulting reward-episode graph using different values of M .
The LSTM takes a set of feature vectors as input and outputs Hence, given M = 3, the value of the reward function will be
the id of the source code file most suitable for the current state. (S, A) = 2.8074∗0.5
3∗1
= 2.14. The reward can vary between M
Transition: τ (S, A) is a function τ : S × A → S which maps to ∼ 0. A higher value of the reward function indicates a better
a state st into a new state st+1 in response to the selected action of the model. Finally, in the case of optimal ranking, the
action at . Choosing an action at means removing a file from distance(S) will be zero. We handle this case by assigning a
the candidate list and placing it in the ranked list. value of 1 for distance(S). Even though we are using MRR
Reward: A reward is a value provided to the RL agent as and MAP as optimization goals we do not require labeled
feedback on their action. We refer to a reward received from one data. Instead, it learns from developers’ feedback. It presents

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
CHAKRABORTY et al.: RLOCATOR: REINFORCEMENT LEARNING FOR BUG LOCALIZATION 2701

TABLE II
DATASET STATISTICS

# of Bug Avg. # of Buggy


Project
Reports Files per Bug
AspectJ 593 4.0
Birt 6,182 3.8
Eclipse UI 6,495 2.7
JDT 6,274 2.6
SWT 4,151 2.1
Tomcat 1,056 2.4
Fig. 2. Effect of M in the reward-episode graph.

ranked list of 31 source code files, indicated by step #4. Devel-


opers then review this list to identify and select files that may
contain bugs; an example is shown when a developer selects
F ile3 , noted as step #5. This selection serves as feedback to
RLocator, marked by step #6, aiding in refining its bug local-
ization strategy. The feedback from developers is expressed as
a binary value: files that developers open are marked with a 1,
and all other files are marked with a 0. Additionally, the system
can also indicate its inability to localize a bug for a given report,
as shown by step #3. This ongoing loop allows RLocator to stay
Fig. 3. Developer interaction flow. updated with changes in techniques and patterns, enhancing its
bug localization performance.

a limited set of files to the developer, seeking their feedback. If


V. DATASET AND EVALUATION MEASURES
a developer deems a specific file as relevant, they can click on
it. This click feedback signifies the file’s relevance within the In this section, we discuss the dataset used to train and eval-
set. RLocator leverages this input at Equation 1 and learns the uate our model (Section V-A). Then, we present the evaluation
process of bug localization. Incorporating developers’ feedback metrics we use for evaluation (Section V-B).
may cause some inconvenience. However, all machine learning
models are prone to data drift [54], [55], [56], where initial A. Dataset
training data no longer matches current data, leading to de-
clining performance. RLocator addresses this by continuously In our experiment, we evaluate our approach on six real-
updating its learning based on developers’ feedback. world open-source projects [57], which are commonly used
We limit the number of actions to 31 in RLocator. Since the benchmark datasets in bug localization studies [6], [18]. Prior
number of states equals the number of actions, we also limit the work has shown that this dataset has the lowest number of
number of states to 31. The prediction space of a reinforcement false positives and negatives compared to other datasets [3],
learning agent cannot be variable, and the number of source [58], [59], [60], [61]. Following previous studies, we train
code files is variable. Thus, we must fix the number of ac- our RLocator model separately for each of the six Apache
tions, k, to a manageable number that fits in memory. We use projects (AspectJ, Birt, Eclipse Platform UI, JDT, SWT, Tom-
an Nvidia V100 16 GB GPU and found that with more than cat). Table II shows descriptive statistics on the datasets.
31 actions, training scripts fail due to out-of-memory errors. The dataset contains metadata such as bug ID, description,
Therefore, we set K = 31 to keep the state size under control. report timestamp, commit SHA of the fixing commit and buggy
As mentioned in Section IV-A, we select the top 31 relevant source code file paths. Each bug report is associated with a com-
source code files from ES and pass them to RLocator, ensuring mit SHA/version, and we use a multiple version set matching
the number of states and files remains the same. approach to exclusively utilize the source code files linked to
each specific report [58]. This approach closely resembles the
bug localization process done by developers and reduces noise
C. Developers’ Workflow
in the dataset, improving tool performance.
Fig. 3 illustrates the interaction flow of developers using We identify the version containing the bug from the commit
RLocator, which is represented as a central black box in the di- SHA and collect all relevant source code files from that version,
agram. Details of RLocator are presented in Fig. 1. The process excluding the bug-fixing code. This ensures our bug localization
begins with RLocator receiving two primary inputs: a bug report system closely mimics real-world scenarios.
and all the source code files, labeled as 1 and 2, respectively, For training and testing, we use 91% of the data, sorting
in the figure. After processing these inputs, RLocator outputs a the dataset by the date of bug reports and splitting it 60:40

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2702 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

for training and testing, respectively. Unlike previous studies least one buggy source appears among the top K positions
[62] that used a 60:20:20 split, we repurpose validation data in the ranked list generated by the bug localization tool.
for testing to shorten the training duration. Following previous studies (e.g., [48], [49]), we consider
three values of K: 1, 5, and 10.
B. Evaluation Measures
VI. RLOCATOR PERFORMANCE
The dataset proposed by Ye et al. [57] provides ground truth
associated with each bug report. The ground truth contains the We evaluate RLocator on the hold-out dataset using the
path of the file in the project repository that has been modified metrics described in Section V-B. As there has been no
to fix a particular bug. To evaluate RLocator performance, we RL-based bug localization tool, we compare RLocator with
use the ground truth and analyze the experimental results based three state-of-the-art bug localization tools: BugLocator, FLIM,
on three criteria, which are widely adopted in bug localization and BL-GAN.
studies [3], [5], [6], [7], [8]. A short description of the approaches is presented below.
• Mean Reciprocal Rank (MRR): To identify the average • BugLocator [3]: an IR-based tool that utilizes a vector
rank of the relevant file in the retrieved files set, we adopted space model to identify the potentially responsible source
the Mean Reciprocal Rank. MRR is the average recipro- code files by estimating the similarity between source code
cal rank of the source code files for all the bug reports. file and bug report.
We present the equation for calculating MRR below, where • FLIM [18]: a deep-learning-based model that utilizes a
A is the set of bug reports. large language model like CodeBERT.
• BL-GAN [7]: uses generative adversarial strategy to train
1  1 an attention-based transformer model.
M RR =
|A| Least rank of the relevant f iles We use the original implementations to assess the perfor-
A
(4) mance of BugLocator [3] and FLIM [18]. Additionally, we fine-
tune a CodeBERT [28] model as a baseline to demonstrate
Suppose we have two bug reports, report1 and report2 .
the benefits of using reinforcement learning. For tools like
For each bug report, the bug localization model will rank
CAST [5], KGBugLocator [6], and BL-GAN [7], which lack
six files. For report1 the ground truth of the retrieved files
replication packages, we refer to their respective studies. These
are [0, 0, 1, 0, 1, 0] and for report2 the ground truth of the
studies show that KGBugLocator outperforms CAST, and BL-
retrieved files are [1, 0, 0, 0, 0, 1]. In this case, the least rank
GAN outperforms KGBugLocator. Consequently, we replicate
of relevant files is 3 and 1, respectively, for report1 and
BL-GAN based on its study descriptions.
report2 . Now, the M RR = 12 ( 13 + 11 ) = 0.67.
Regarding FBL-BERT [8], a recent technique, we do not
• Mean Average Precision (MAP): To consider the case
compare it with RLocator. This is because FBL-BERT per-
where a bug is associated with multiple source code files,
forms bug localization at the changeset level, and applying it
we adopted Mean Average Precision. It provides a measure
to our file-level dataset would disadvantage FBL-BERT, as it is
of the quality of the retrieval [3], [63]. MRR considers
designed for shorter documents. Therefore, comparing it with
only the best rank of relevant files; on the contrary, MAP
RLocator would be unfair.
considers the rank of all the relevant files in the retrieved
Furthermore, other studies, such as DeepLoc [62], bjXnet
files list. Thus, MAP is more descriptive and unbiased
[64], CAST [5], KGBugLocator [6], and Cheng et al. [65],
than MRR. Precision means how noisy the retrieval is.
also propose deep learning-based approaches but do not provide
If we calculate the precision on the first two retrieved
replication packages. Although these studies evaluate similar
files, we will get precision@2. For calculating average
projects, the lack of available code or pre-trained models pre-
precision, we have to figure precision@1, precision@2,...
vents further comparison. However, to ensure comprehensive
precision@k, and then we have to average the precision at
information, we include a table in our online appendix [39]
different points. After calculating the average precision for
displaying their performance alongside RLocator.
each bug report, we have to find the mean of the average
precision to calculate the MAP.
A. Retrieval Performance
1 
M AP = AvgP recision(Reporti ) (5) We use k=31 relevant files in RLocator, allowing us to rerank
|A|
A files for 91% of the bug reports. Table III shows RLocator’s
We show the MAP calculation for the previous example performance on 91% and 100% of the data. RLocator is not
of two bug reports. The Average precision for report1 and designed for 100% data as it cannot rerank files if no relevant
report2 will be 0.37 and 0.67. So, the M AP = 12 (0.36 + files are in the top k files. For such cases, we estimate perfor-
0.67) = 0.52. mance assuming zero contribution, providing a lower bound for
• Top K: For fare comparison with prior studies [48], [49] RLocator’s effectiveness. This conservative approach ensures
and to present a straightforward understanding of perfor- we do not overestimate the technique’s effectiveness. Table III
mance we calculate Top K. Top K measures the over- showcases that RLocator achieves better performance
all ranking performance of the bug localization model. than BugLocator and FLIM in both MRR and MAP
It indicates the percentage of bug reports for which at across all studied projects when using the 91% data.

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
CHAKRABORTY et al.: RLOCATOR: REINFORCEMENT LEARNING FOR BUG LOCALIZATION 2703

TABLE III
RLOCATOR PERFORMANCE

Top 1 Top 5 Top 10 MAP MRR


Project Model
91% 100% 91% 100% 91% 100% 91% 100% 91% 100%
RLocator 0.46 0.40 0.69 0.63 0.75 0.70 0.56 0.46 0.59 0.50

AspectJ BugLocator 0.36 0.28 0.50 0.45 0.56 0.51 0.33 0.31 0.49 0.48
FLIM 0.51 0.36 0.65 0.60 0.72 0.67 0.41 0.35 0.47 0.45
CodeBERT 0.4 0.35 0.59 0.55 0.65 0.61 0.49 0.39 0.51 0.44
BL-GAN 0.41 0.38 0.6 0.55 0.71 0.65 0.33 0.31 0.42 0.39
RLocator 0.65 0.25 0.46 0.41 0.53 0.48 0.47 0.38 0.49 0.41

Birt BugLocator 0.61 0.15 0.27 0.21 0.34 0.29 0.30 0.30 0.39 0.38
FLIM 0.49 0.18 0.39 0.34 0.47 0.42 0.29 0.25 0.31 0.28
CodeBERT 0.33 0.22 0.39 0.35 0.46 0.43 0.41 0.33 0.42 0.35
BL-GAN 0.17 0.16 0.33 0.3 0.46 0.42 0.32 0.29 0.4 0.37
RLocator 0.45 0.37 0.69 0.63 0.78 0.73 0.54 0.42 0.59 0.50

Eclipse Platform UI BugLocator 0.45 0.33 0.54 0.49 0.63 0.58 0.29 0.30 0.38 0.35
FLIM 0.48 0.41 0.72 0.67 0.80 0.75 0.51 0.48 0.52 0.53
CodeBERT 0.39 0.32 0.6 0.55 0.68 0.62 0.47 0.36 0.52 0.44
BL-GAN 0.34 0.31 0.53 0.49 0.66 0.61 0.32 0.3 0.4 0.36
RLocator 0.44 0.33 0.67 0.61 0.78 0.75 0.51 0.44 0.53 0.45

JDT BugLocator 0.34 0.21 0.51 0.45 0.60 0.55 0.22 0.20 0.31 0.28
FLIM 0.40 0.35 0.65 0.60 0.82 0.77 0.42 0.41 0.51 0.49
CodeBERT 0.38 0.29 0.59 0.54 0.68 0.66 0.44 0.38 0.46 0.39
BL-GAN 0.3 0.27 0.53 0.48 0.64 0.59 0.35 0.32 0.44 0.41
RLocator 0.40 0.30 0.57 0.51 0.63 0.58 0.48 0.42 0.51 0.44

SWT BugLocator 0.37 0.25 0.50 0.45 0.56 0.51 0.42 0.40 0.46 0.43
FLIM 0.51 0.37 0.70 0.65 0.83 0.78 0.43 0.43 0.48 0.50
CodeBERT 0.34 0.27 0.5 0.45 0.54 0.51 0.42 0.37 0.45 0.39
BL-GAN 0.31 0.29 0.53 0.48 0.6 0.55 0.37 0.34 0.44 0.4
RLocator 0.46 0.39 0.61 0.55 0.73 0.68 0.59 0.47 0.62 0.51

Tomcat BugLocator 0.40 0.29 0.43 0.38 0.55 0.50 0.31 0.27 0.37 0.35
FLIM 0.51 0.42 0.70 0.65 0.76 0.71 0.52 0.47 0.59 0.60
CodeBERT 0.39 0.34 0.53 0.49 0.62 0.6 0.51 0.41 0.53 0.44
BL-GAN 0.38 0.35 0.61 0.55 0.65 0.61 0.43 0.4 0.55 0.5

On 91% data, RLocator outperforms FLIM by 5.56-38.3% in The results point out that RLocator outperforms BL-GAN
MAP and 3.77-36.73% in MRR. Regarding Top K, the perfor- across all the metrics in 91% settings. Specifically, in TopK,
mance improvement is up to 23.68%, 15.22%, and 11.32% in RLocator achieved better performance than BL-GAN, ranging
terms of Top 1, Top 5, and Top 10, respectively. Compared to from 3.33% to 55.26%. The performance gain is 40.74% and
BugLocator, RLocator achieves performance improvement of 32.2% for MAP and MRR, respectively.
12.5-56.86% and 9.8%-41.51%, in terms of MAP and MRR, Compared to the CodeBERT model trained as a classifier
respectively. Regarding Top K, the performance improvement (CodeBERT), RLocator achieves better performance across all
is up to 26.32%, 41.3%, and 35.85% in terms of Top 1, Top the metrics. CodeBERT model archives consistently lower per-
5, and Top 10, respectively. The results indicate that RLoca- formance across all the metrics. The performance drops up to
tor consistently outperforms BL-GAN in 91% settings across 17.65%, 15.63%, 17.95%, 17.14%, and 16.67% for Top1, Top5,
all metrics. Specifically, in the TopK measurements, RLoca- Top 10, MAP, and MRR, respectively.
tor’s performance exceeded that of BL-GAN, with improve- When we consider 100% of the data, RLocator has better
ments ranging from 55.26% to 3.33%. Additionally, RLocator MAP results than FLIM in three out of the six projects (AspectJ,
achieved performance gains of 40.74% in MAP and 32.2% in Birt, and JDT) by 6.82-34.21%, equal to FLIM in one project
MRR, respectively. (Tomcat) and worse than FLIM in 2 projects (Eclipse Platform

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2704 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

UI and SWT) by 2-14%. The underperformance of RLocator


in the Eclipse Platform UI and SWT projects can be linked to
the poor and inconsistent quality of bug reports, which creates a
significant lexical gap between the reports and the source code.
By applying the IMaChecker [66] approach, we discovered that
AspectJ reports are of the highest quality, whereas those for
Eclipse Platform UI, SWT, and Tomcat rank among the lowest.
For a detailed analysis of bug report quality, please refer to the
online Appendix [39]. In terms of MRR, RLocator is better than
FLIM in 2 projects (AspectJ and Birt) by 10-31.71% and worse
than FLIM in the remaining four projects (Eclipse platform
UI, JDT, SWT, and Tomcat) by 6-18%. In terms of Top K, Fig. 4. Feature importance of classifier model.
RLocator ranks 4.29-12.5% more bugs in the top 10 positions
than FLIM in two projects. On the other hand, in the rest of the The median similarity scores for the Birt, Eclipse Platform
four projects, FLIM ranks more bugs in the top 10 positions, UI, and SWT projects are 0.29, 0.30, and 0.33, respectively,
ranging between 2.74-34.48%. When comparing RLocator with making them the lowest among the six projects. This observa-
BugLocator for the 100% data along MAP, we find that the tion suggests that the lower quality of bug reports (reflected in
RLocator is better in five of the six projects and similar in just their similarity to source files) may contribute to the decreased
the Tomcat project. With respect to MRR, RLocator is better performance of RLocator in these projects.
than BugLocator in all six projects. In terms of Top K, RLocator To effectively use RLocator in real-world scenarios, we em-
ranks more bugs than BugLocator in the top 10 position, where ploy an XGBoost model (Section IV-A) to filter out bug re-
the improvement ranges between 12.07-39.58%. The results ports where relevant files do not appear in the top K (=31)
demonstrate that RLocator consistently surpasses BL-GAN in files. We then compute the importance of features listed in
all metrics in the 100% setting. Specifically, in the TopK met- Table I using XGBoost’s built-in module. The importance
ric, RLocator’s performance was better than BL-GAN, with score indicates each feature’s contribution to the model, with
improvements ranging from 3.33% to 36%. The performance higher values signifying greater importance. Fig. 4 shows
enhancements for RLocator are 32% in MAP and 1.96% in that similarity is the most crucial feature, followed by source
MRR, respectively. code length and Bbg report length. These findings high-
It is important to note that MAP provides a more balanced light the importance of similarity in text-based search sys-
view than MRR and top K since it accounts for all the files that tems and suggest that high-quality bug reports can influence
are related to a bug report and not just one file. Additionally, localization performance.
in our technique, we optimize to give more accurate results for
most of the bug reports than give less accurate results on average
B. Entropy Ablation Analysis: Impact of Entropy on
for all the bug reports. Thus, by looking at the MAP data
for the 91%, we can see that RLocator performs better RLocator Performance
than the state-of-the-art techniques in all projects. Even We conduct an ablation study to gain insights into the sig-
if we consider 100% of the data, RLocator is still better nificance of each component of RLocator. The main two com-
than other techniques in the majority of the projects. Only ponents of RLocator are the ES-based shortlisting step and the
with 100% of the data and when using MRR as the evaluation reinforcement learning step.
metric, RLocator does not perform better than the state-of-the- In RL, we have used the A2C with entropy algorithm. En-
art in most projects. tropy refers to the unpredictability of an agent’s actions. A low
RLocator performs the worst in the Birt project, with a perfor- entropy indicates a predictable policy, while high entropy rep-
mance drop of 10.47% in MAP, 11.71% in MRR, and 41.42% in resents a more random and robust policy. An agent in RL
the top 10 compared to its average on 91% of the data. Despite will tend to repeat actions that previously resulted in positive
this, RLocator outperforms FLIM by 38.3% in MAP, 36.73% in rewards while learning the policy. The agent may become stuck
MRR, and 11.32% in the top 10. It also surpasses BugLocator in a local optimum due to exploiting learned actions instead
by 36.17% in MAP, 20.41% in MRR, and 35.85% in the top of exploring new ones and finding a higher global optimum.
10. Factors like bug report quality, amount of information in the This is where entropy comes useful: we can use entropy to
bug report, and source code length may contribute to the per- encourage exploration and avoid getting stuck in local optima
formance drop. We measure helpful information in bug reports [67]. Because of this, entropy in RL has become very popular in
using a similarity metric, which calculates the ratio of similar the design of RL approaches such as A2C [68]. In our proposed
tokens between source code files and bug reports, indicating the model (Section VI-A), we use A2C with entropy to train the
potential usefulness of the report for bug localization. The met- RLocator aiming to rank relevant files closer to each other.
ric is defined in equation 6. As entropy is part of the reward, the gradient descent process
will try to maximize the entropy. Entropy will increase if the
Bug Report T okens ∩ F ile T okens model identifies different actions as the best in the same state.
Similarity = (6)
# of U nique T okens in Bug Report However, those actions must select a relevant file; otherwise, the

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
CHAKRABORTY et al.: RLOCATOR: REINFORCEMENT LEARNING FOR BUG LOCALIZATION 2705

TABLE IV VII. RELATED WORK


RLOCATOR PERFORMANCE WITH AND WITHOUT ENTROPY FOR A2C
The work most related to our study falls into studies on bug
Project Model Top 1 Top 5 Top 10 MAP MRR localization techniques. In the following, we discuss the related
ES 0.15 0.20 0.28 0.23 0.27 work and reflect on how the work compares with ours.
AspectJ
A2C 0.27 0.39 0.48 0.40 0.52
A plethora of work studied how developers localize bugs
[70], [71], [72], [73]. For example, Böhme et al. [70] stud-
A2C with Entropy 0.46 0.69 0.75 0.56 0.59
ied how developers debug. They found that the most popular
ES 0.10 0.14 0.17 0.18 0.23 technique for localizing a bug is forward reasoning, where
Birt
A2C 0.21 0.30 0.43 0.31 0.42 developers go through each computational step of a failing test
A2C with Entropy 0.38 0.46 0.53 0.47 0.49 case to identify the location. Zimmermann et al. [73] studied
ES 0.09 0.15 0.19 0.25 0.31 the characteristics of a good bug report and found that test
Eclipse cases and stack traces are one of the most important criteria
Platform UI A2C 0.25 0.38 0.51 0.39 0.51
that makes a good bug report. While these studies focused on
A2C with Entropy 0.45 0.69 0.78 0.54 0.59 developers’ manual localization of bugs, our approach examines
the automation of the bug localization process for developers.
Several studies offered test case coverage-based solutions
reward will be decreased. Thus, if there are multiple relevant for bug localization [74], [75], [76]. Vancsics et al. [74] pro-
files in a state, the A2C with entropy regularized model will posed a count-based spectrum instead of a hit-based spectrum
assign almost the same probability in those actions (actions in the Spectrum-Based Fault Localization (SBFL) tool. GRACE
related to selecting those relevant files). This means that when [75] proposed gated graph neural network-based representation
the states are repeated, a different action will likely be se- learning to improve the SBFL technique. However, these studies
lected each time. This probability assignment will lead to a mainly utilize test cases to localize bugs, whereas our approach
higher MAP. (RLocator) focuses on the bug report for bug localization.
The observed performance of RLocator in achieving higher There were several efforts to assess the impact of query
MAP can be interpreted due to two factors: 1) the way we design reformulation in improving the performance of existing bug
our reward function, given that we define a function that aims localization tools. [77], [78]. For example, Rahman et al. [77]
to encourage higher MAP; 2) the inclusion of entropy, as en- found that instead of using the full bug report as a full-text
tropy regularization is assumed to enable the model to achieve query, a reformulated query with some additional expansion
higher MAP. performs better. BugLocator [3] used a revised Vector Space
Hence, to provide a better understanding of our model, we Model (rVSM) to estimate the textual similarity between bug
measure the performance of three different steps of our model reports and source code files.
separately. The ES-based shortlisting step, the A2C-based RL A few studies incorporated the information of program struc-
model (without entropy), and the A2C with entropy model. Due tures such as the Program Dependence Graph (PDG), Data
to resource (time and GPU) limitations, we limit our evaluation Flow Graph (DFG) [79], and Abstract Syntax Tree (AST) for
to half of the total projects in our dataset, i.e., AspectJ, Birt, and learning source code representation [4], [5], [62], [64], [79]. For
Eclipse Platform UI. We observe a similar trend in those three example, CAST [5] used AST of the source code to extract the
projects. Thus, we believe our results will follow a similar trend semantic information and then used Word2Vec to project the
in the remaining projects. source code and the bug report in the same embedding space.
Table IV presents the performance of the three choices (i.e., They used a CNN model that measures the similarity between
ES, A2C only, and A2C with Entropy). Table IV shows that a bug report and source code. The model ranks the file based on
ES archives the baseline performance, which is 53-61% lower the calculated similarity. Hyloc [9] incorporated the techniques
than the A2C with entropy model in terms of MAP and 47-54% of IR-based bug localization with deep learning. It concatenated
lower in terms of MRR. We also find that the MRR and MAP the TF-IDF vector of the source code with repository and file-
of the models without entropy are lower than those of the A2C level metadata.
with entropy models. Table IV shows that in terms of MAP, Other studies applied several deep learning-based approaches
the performance of A2C with entropy models is higher than for bug localization [6], [64], [80]. DEMOB [80] used attention
A2C models by a range of 27.78-34.04%. In MRR and the on ELMo [81] embedding, whereas KGBugLocator [6] used at-
top 10, the A2C with entropy model achieves higher perfor- tention on graph embedding. BL-GAN [7] offered a generative
mance by a range of 11.86-13.56% and 18.87-36%, respec- adversarial network (GAN) based solution for bug localization.
tively. Such results indicate that entropy could substantially GAN is often seen as a methodology closely related to rein-
contribute to the model performance regarding MAP, MRR, forcement learning, but it diverges from the typical use of the
and Top K. Moreover, this shows that the use of entropy en- Markov decision process (MDP), a fundamental aspect of rein-
courages the RL agent to explore possible alternate policies forcement learning [14], [82]. Additionally, BL-GAN has lim-
[69]; thus, it has a higher chance of getting a better pol- itations in actively learning bug localization from developers’
icy for solving the problem in a given environment than the real-time actions. In contrast, we have incorporated developers’
A2C model. feedback directly into the reward function, allowing RLocator

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2706 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

to learn from developers’ actions. Xie et al. [83] employed four days on an Nvidia V100 16GB GPU. The uniform out-
GAN to create failing test cases, addressing the data imbalance comes across these projects indicate that similar results could
issue within fault localization methods. White et al. [84] utilized be expected in the remaining projects. Additionally, due to the
reinforcement learning for fault localization in distributed net- absence of a replication package, we replicated BL-GAN based
works. Nonetheless, their approach involves the reinforcement on its description in the original study, which may lead to slight
learning agent understanding how to interact with the network performance deviations. Nevertheless, after experimenting with
which is different from our approach. In our approach, the agent various hyperparameters, we selected a set that achieves com-
learns to localize bugs from developers’ activity. Rezapour et al. parable performance to that reported in the original study.
[85] have provided a thorough exploration of reinforcement
learning’s application in fault localization within power sys- Construct Validity. Finally, our evaluation measures might
tems. However, their discussed approaches significantly differ be one threat to construct validity. The evaluation measures
from ours. In the context of power systems, the reinforcement may not completely reflect real-world situations. The threat is
learning agent can directly observe phenomenal components mitigated by the fact that the used evaluation measures are well-
(e.g., current, voltage) related to the environment. Moreover, known [3], [8], [18], [57] and best available to measure and
the agents are allowed to probe the environment by changing compare the performance of information retrieval-based bug
current or voltage. In contrast, bug localization entails more localization tools.
abstract phenomenal components in the environment (e.g., in-
teracting code blocks) and agents are not allowed to change any IX. CONCLUSION
code or execute the code. Other studies focused on associating
commits with bug reports [8], [86]. For example, FBL-BERT In this paper, we propose RLocator, a reinforcement learning-
[8] used CodeBERT embedding for estimating the similarity based (RL) technique to rank the source code files where the
between source code files and changesets of a commit. Based bug may reside, given the bug report. The key contribution
on the similarity, it ranks the suspicious commit. FLIM [18] also of our study is the formulation of the bug localization prob-
used CodeBERT embedding for estimating similarity. However, lem using the Markov Decision Process (MDP), which helps
FLIM works on function-level bug localization. us to optimize the evaluation measures directly. We evaluate
Our approach, RLocator, uses deep reinforcement learn- RLocator on 8,316 bug reports and find that RLocator per-
ing for bug localization, differing from previous similarity- forms better than the state-of-the-art techniques when using
based methods. By formulating the problem as a Markov MAP as an evaluation measure. Using 91% bug reports dataset,
Decision Process (MDP), we directly optimize evaluation mea- RLocator outperforms prior tools in all the project in terms
sures. Testing on a dataset of 8,316 projects from six popu- of both MAP and MRR. When using 100% data, RLocator
lar Apache projects, our results show significant performance outperforms all prior approaches in four of six projects using
improvement. MAP and two of the six projects using MRR. RLocator can be
used along with other bug localization approaches to improve
performance. Our results show that RL is a promising avenue
VIII. THREATS TO VALIDITY for future exploration when it comes to advancing state-of-the-
RLocator has a number of limitations as well. We identify art techniques for bug localization. Future research can explore
them and discuss how to overcome the limitations below. the application of advanced reinforcement learning algorithms
in bug localization. Additionally, researchers can investigate
Internal Validity. One limitation of our approach is we are not how training on larger datasets impacts the performance of tools
able to utilize 9% of our dataset due to the limitation of text- in low-similarity contexts.
based search. One may point out that we exclude the bug reports
where we do not perform well. But our the XGBoost model in DATA AVAILABILITY STATEMENT
our approach automatically identifies them and we say that we
To foster future research in the field, we make a replication
would rather not localize the source code files for these bug
package comprising our dataset and code are publicly avail-
reports than localize them incorrectly. Hence, developers need
able [39].
to rely on their manual analysis only for the 9%. Moreover, as
a measure of full transparency, we estimate the lower bound
of RLocator performance for the 100% data and show that the REFERENCES
difference is negligible. [1] T. D. LaToza and B. A. Myers, “Developers ask reachability questions,”
in Proc. 32nd ACM/IEEE Int. Conf. Softw. Eng. (ICSE), New York, NY,
USA: ACM, 2010, pp. 185–194.
External Validity. The primary concern for the external va- [2] J. Anvik, L. Hiew, and G. C. Murphy, “Coping with an open bug
lidity of the RLocator evaluation stems from its limitation to repository,” in Proc. OOPSLA Workshop Eclipse Technol. eXchange—
a small number of bugs in six varied, real-world open-source Eclipse, New York, NY, USA: ACM, 2005, pp. 35–39.
[3] J. Zhou, H. Zhang, and D. Lo, “Where should the bugs be fixed?
projects, potentially impacting its broad applicability. However, More accurate information retrieval-based bug localization based on bug
those projects are from different domains and used by prior reports,” in Proc. 34th Int. Conf. Softw. Eng. (ICSE), Piscataway, NJ,
studies [4], [5], [6], [9], [80], [87]. Furthermore, the A2C with- USA: IEEE Press, Jun. 2012, pp. 14–24.
[4] Y. Xiao, J. Keung, K. E. Bennin, and Q. Mi, “Machine translation-based
out entropy model was only evaluated on three projects because bug localization technique for bridging lexical gap,” Inf. Softw. Technol.,
of the substantial resources required for training—taking about vol. 99, pp. 58–61, Jul. 2018.
Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
CHAKRABORTY et al.: RLOCATOR: REINFORCEMENT LEARNING FOR BUG LOCALIZATION 2707

[5] H. Liang, L. Sun, M. Wang, and Y. Yang, “Deep learning with cus- Lang. Technol., Minneapolis, Minnesota: Assoc. Comput. Linguistics,
tomized abstract syntax tree for bug localization,” IEEE Access, vol. 7, Jun. 2019, pp. 4171–4186.
pp. 116309–116320, 2019. [28] Z. Feng et al., “CodeBERT: A pre-trained model for programming and
[6] J. Zhang, R. Xie, W. Ye, Y. Zhang, and S. Zhang, “Exploiting code natural languages,” in Findings Assoc. Comput. Linguistics (EMNLP),
knowledge graph for bug localization via bi-directional attention,” in Nov. 2020, pp. 1536–1547.
Proc. 28th Int. Conf. Program Comprehension, New York, NY, USA: [29] J. Wang and Y. Dong, “Measurement of text similarity: A survey,”
ACM, Jul. 2020, pp. 219–229. Information, vol. 11, no. 9, p. 421, Aug. 2020.
[7] Z. Zhu, H. Tong, Y. Wang, and Y. Li, “BL-GAN: Semi-supervised bug [30] S. Fujimoto, D. Meger, and D. Precup, “Off-policy deep reinforcement
localization via generative adversarial network,” IEEE Trans. Knowl. learning without exploration,” in Proc. 36th Int. Conf. Mach. Learn.,
Data Eng., vol. 35, no. 11, pp. 11112–11125, Nov. 2023. vol. 97, K. Chaudhuri and R. Salakhutdinov, Eds., PMLR, Jun. 2019,
[8] A. Ciborowska and K. Damevski, “Fast changeset-based bug localization pp. 2052–2062.
with BERT,” in Proc. 44th Int. Conf. Softw. Eng., New York, NY, USA: [31] T. T. Nguyen and V. J. Reddi, “Deep reinforcement learning for
ACM, May 2022, pp. 946–957. cyber security,” IEEE Trans. Neural Netw. Learn. Syst., vol. 34, no. 8,
[9] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen, “Combining pp. 3779–3795, Aug. 2023.
deep learning with information retrieval to localize buggy files for bug [32] C. Liu, X. Xia, D. Lo, Z. Liu, A. E. Hassan, and S. Li, “CodeMatcher:
reports (N),” in Proc. 30th IEEE/ACM Int. Conf. Automated Softw. Eng. Searching code based on sequential semantics of important query
(ASE), Piscataway, NJ, USA: IEEE Press, Nov. 2015, pp. 476–481. words,” ACM Trans. Softw. Eng. Methodol., vol. 31, no. 1, pp. 1–37,
[10] Z. Wei, J. Xu, Y. Lan, J. Guo, and X. Cheng, “Reinforcement learning to Jan. 2022.
rank with Markov decision process,” in Proc. 40th Int. ACM SIGIR Conf. [33] T. Chen and C. Guestrin, “XGBoost,” in Proc. 22nd ACM SIGKDD
Res. Develop. Inf. Retrieval, New York, NY, USA: ACM, Aug. 2017, Int. Conf. Knowl. Discovery Data Mining, New York, NY, USA: ACM,
pp. 945–948. Aug. 2016, pp. 9129–9149.
[11] O. Alejo, J. M. Fernandez-Luna, J. F. Huete, and R. Perez-Vazquez, [34] F. Fang, J. Wu, Y. Li, X. Ye, W. Aljedaani, and M. W. Mkaouer, “On the
“Direct optimization of evaluation measures in learning to rank using classification of bug reports to improve bug localization,” Soft Comput.,
particle swarm,” in Proc. Workshops Database Expert Syst. Appl., vol. 25, no. 11, pp. 7307–7323, Mar. 2021.
Piscataway, NJ, USA: IEEE Press, Aug. 2010, pp. 42–46. [35] Y. Lv and C. Zhai, “When documents are very long, BM25 fails!” in
[12] J. Xu, L. Xia, Y. Lan, J. Guo, and X. Cheng, “Directly optimize diversity Proc. 34th Int. ACM SIGIR Conf. Res. Develop. Inf. (SIGIR), New York,
evaluation measures,” ACM Trans. Intell. Syst. Technol., vol. 8, no. 3, NY, USA: ACM, 2011, pp. 1103–1104.
pp. 1–26, Jan. 2017. [36] D. D. Lewis, “Naive (Bayes) at forty: The independence assumption
[13] Y. Yue, T. Finley, F. Radlinski, and T. Joachims, “A support vector in information retrieval,” in Proc. Mach. Learn. (ECML-98), Berlin,
method for optimizing average precision,” in Proc. 30th Annu. Int. ACM Heidelberg: Springer Berlin Heidelberg, 1998, pp. 4–15.
SIGIR Conf. Res. Develop. Inf. Retrieval, New York, NY, USA: ACM, [37] A. Schroter, A. Schröter, N. Bettenburg, and R. Premraj, “Do stack
Jul. 2007, pp. 271–278. traces help developers fix bugs?” in Proc. 7th IEEE Work. Conf. Mining
[14] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Softw. Repositories (MSR), Piscataway, NJ, USA: IEEE Press, May
Cambridge, MA, USA: A Bradford Book, 2018. 2010, pp. 118–121.
[38] Y. Lv and C. Zhai, “Lower-bounding term frequency normalization,” in
[15] F. Garcia and E. Rachelson, “Markov decision processes,” in Markov
Proc. 20th ACM Int. Conf. Inf. Knowl. Manage. (CIKM), New York,
Decision Processes in Artificial Intellelligence. Hoboken, NJ, USA:
NY, USA: ACM, 2011, pp. 7–16.
Wiley, Mar. 2013, pp. 1–38.
[39] “Rlocator: Reinforcement learning for bug localization.” Accessed: May
[16] J. Fan, C. Xiao, Y. Gdi, “Rethinking what makes reinforcement learning
23, 2024. [Online]. Available: https://zenodo.org/record/7591879
different from supervised learning,” Jun. 2021, arXiv:2106.06232.
[40] M. Bagherzadeh, N. Kahani, and L. Briand, “Reinforcement learning
[17] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in
for test case prioritization,” IEEE Trans. Softw. Eng., vol. 48, no. 8,
robotics: A survey,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1238–1274,
pp. 2836–2856, Aug. 2022.
Aug. 2013. [41] Y. Wan et al., “Improving automatic source code summarization via deep
[18] H. Liang, D. Hang, and X. Li, “Modeling function-level interactions for reinforcement learning,” in Proc. 33rd ACM/IEEE Int. Conf. Automated
file-level bug localization,” Empirical Softw. Eng., vol. 27, no. 7, pp. Softw. Eng., New York, NY, USA: ACM, Sep. 2018, pp. 397–407.
186–212, Oct. 2022. [42] X. He, L. Liao, H. Zhang, L. Nie, X. Hu, and T.-S. Chua, “Neural
[19] L. Maystre, D. Russo, and Y. Zhao, “Optimizing audio recommendations collaborative filtering,” in Proc. 26th Int. Conf. World Wide Web. Int.
for the long-term: A reinforcement learning perspective,” Feb. 2023, World Wide Web Conf. Steering Committee, Apr. 2017, pp. 173–182.
arXiv:2302.03561. [43] H. Zhang, Y. Yang, H. Luan, S. Yang, and T.-S. Chua, “Start from
[20] M. Chen, A. Beutel, P. Covington, S. Jain, F. Belletti, and E. Chi, “Top-K scratch,” in Proc. 22nd ACM Int. Conf. Multimedia, New York, NY,
off-policy correction for a REINFORCE recommender system,” in Proc. USA: ACM, Nov. 2014, pp. 187–196.
12th ACM Int. Conf. Web Search Data Mining (WSDM ’19), New York, [44] R. Zhu, X. Tu, and J. X. Huang, “Deep learning on information retrieval
NY, USA: Association for Computing Machinery, 2019, pp. 456–464. and its applications,” in Deep Learning for Data Analytics, Washington,
[21] C. Yu, J. Liu, S. Nemati, and G. Yin, “Reinforcement learning in USA: Elsevier, 2020, pp. 125–153.
healthcare: A survey,” ACM Comput. Surv., vol. 55, no. 1, pp. 1–36, [45] O. Khattab and M. Zaharia, “ColBERT,” in Proc. 43rd Int. ACM
Nov. 2021. SIGIR Conf. Res. Develop. Inf. Retrieval, New York, NY, USA: ACM,
[22] E. Winter et al., “How do developers really feel about bug fixing? Jul. 2020, pp. 39–48.
Directions for automatic program repair,” IEEE Trans. Softw. Eng., [46] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural
vol. 49, no. 4, pp. 1823–1841, Apr. 2023. Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997.
[23] S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features [47] L. Pang, Y. Lan, J. Guo, J. Xu, J. Xu, and X. Cheng, “DeepRank,” in
for defect prediction,” in Proc. 38th Int. Conf. Softw. Eng., Piscataway, Proc. ACM Conf. Inf. Knowl. Manage., New York, NY, USA: ACM,
NJ, USA: ACM, May 2016, pp. 297–308. Nov. 2017, pp. 257–266.
[24] N. Miryeganeh, S. Hashtroudi, and H. Hemmati, “GloBug: Using [48] X. Huo, F. Thung, M. Li, D. Lo, and S.-T. Shi, “Deep transfer bug
global data in fault localization,” J. Syst. Softw., vol. 177, Jul. 2021, localization,” IEEE Trans. Softw. Eng., vol. 47, no. 7, pp. 1368–1380,
Art. no. 110961. Jul. 2021.
[25] Y. Kim, M. Kim, and E. Lee, “Feature combination to alleviate hubness [49] X. Huo, M. Li, and Z.-H. Zhou, “Learning unified features from natural
problem of source code representation for bug localization,” in Proc. and programming languages for locating buggy source code,” in Proc.
27th Asia-Pacific Softw. Eng. Conf. (APSEC), Piscataway, NJ, USA: Int. Joint Conf. Artif. Intell. (IJCAI), 2016, pp. 166–1612.
IEEE Press, Dec. 2020, pp. 511–512. [50] M. J. Hausknecht and P. Stone, “Deep recurrent q-learning for partially
[26] L. Chen, Z. Tang, and G. H. Yang, “Balancing reinforcement learning observable MDPs,” 2015, arXiv:1507.06527.
training experiences in interactive information retrieval,” in Proc. 43rd [51] I. Bello, H. Pham, Q. V. Le, M. Norouzi, and S. Bengio, “Neural
Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, New York, NY, USA: combinatorial optimization with reinforcement learning,” 2016.
ACM, Jul. 2020, pp. 1525–1528. [52] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
[27] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- policy maximum entropy deep reinforcement learning with a stochastic
training of deep bidirectional transformers for language understanding,” actor,” in Proc. 35th Int. Conf. Mach. Learn., J. Dy and A. Krause, Eds.,
in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, Human PMLR, vol. 8, Jul. 2018, pp. 1861–1870.

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.
2708 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 50, NO. 10, OCTOBER 2024

[53] C. Wang, C. Xu, X. Yao, and D. Tao, “Evolutionary generative adversar- Softw. Eng. Conf. Symp. Found. Softw. Eng., New York, NY, USA: ACM,
ial networks,” IEEE Trans. Evol. Comput., vol. 23, no. 6, pp. 921–934, Aug. 2021.
Dec. 2019. [76] Y. Kim, S. Mun, S. Yoo, and M. Kim, “Precise learn-to-rank fault
[54] E. Rabinovich, M. Vetzler, S. Ackerman, and A. Anaby Tavor, “Reliable localization using dynamic and static features of target programs,” ACM
and interpretable drift detection in streams of short texts,” in Proc. 61st Trans. Softw. Eng. Methodol., vol. 28, no. 4, pp. 1–34, Oct. 2019.
Annu. Meeting Assoc. Comput. Linguistics (Volume 5: Industry Track), [77] M. M. Rahman, F. Khomh, S. Yeasmin, and C. K. Roy, “The forgotten
Assoc. Comput. Linguistics, 2023. role of search queries in IR-based bug localization: An empirical study,”
[55] M. R. Islam and M. F. Zibran, “What changes in where?: An empirical Empirical Softw. Eng., vol. 26, no. 6, Aug. 2021.
study of bug-fixing change patterns,” ACM SIGAPP Appl. Comput. Rev., [78] J. M. Florez, O. Chaparro, C. Treude, and A. Marcus, “Combining query
vol. 20, no. 4, pp. 18–34, Jan. 2021. reduction and expansion for text-retrieval-based bug localization,” in
[56] W. Aljedaani and Y. Javed, Bug Reports Evolution in Open Source Proc. IEEE Int. Conf. Softw. Anal., Evol. Reeng. (SANER), Piscataway,
Systems. Springer Int. Publishing, 2018, pp. 63–73. NJ, USA: IEEE Press, Mar. 2021, pp. 166–176.
[57] X. Ye, R. Bunescu, and C. Liu, “Learning to rank relevant files for bug [79] Y. Li, S. Wang, T. N. Nguyen, and S. V. Nguyen, “Improving bug detec-
reports using domain knowledge,” in Proc. 22nd ACM SIGSOFT Int. tion via context-based code representation learning and attention-based
Symp. Found. Softw. Eng. (FSE), New York, NY, USA: ACM, 2014. neural networks,” Proc. ACM Program. Lang., vol. 3, no. OOPSLA,
[58] J. Lee, D. Kim, T. F. Bissyandé, W. Jung, and Y. L. Traon, “Bench4bl: pp. 1–30, Oct. 2019.
Reproducibility study on the performance of IR-based bug localization,” [80] Z. Zhu, Y. Li, Y. Wang, Y. Wang, and H. Tong, “A deep multimodal
in Proc. 27th ACM SIGSOFT Int. Symp. Softw. Testing Anal., New York, model for bug localization,” Data Mining Knowl. Discovery, Apr. 2021.
NY, USA: ACM, Jul. 2018, pp. 61–72. [81] M. E. Peters et al., “Deep contextualized word representations,” CoRR,
[59] B. Dit, M. Revelle, M. Gethers, and D. Poshyvanyk, “Feature location in 2018, arXiv:1802.05365.
source code: A taxonomy and survey,” J. Softw., Evol. Process, vol. 25, [82] R. S. Sutton, D. Precup, and S. Singh, “Between MDPs and semi-MDPs:
no. 1, pp. 53–95, Nov. 2011. A framework for temporal abstraction in reinforcement learning,” Artif.
[60] L. Moreno et al., “Query-based configuration of text retrieval solutions Intell., vol. 112, nos. 1–2, pp. 181–211, Aug. 1999.
for software engineering tasks,” in Proc. 2015 10th Joint Meeting Found. [83] H. Xie, Y. Lei, M. Yan, Y. Yu, X. Xia, and X. Mao, “A universal data
Softw. Eng., New York, NY, USA: ACM, Aug. 2015. augmentation approach for fault localization,” in Proc. 44th Int. Conf.
[61] B. Sisman and A. C. Kak, “Assisting code search with automatic query Softw. Eng., New York, NY, USA: ACM, May 2022, pp. 48–60.
reformulation for bug localization,” in Proc. 10th Work. Conf. Mining [84] T. White and B. Pagurek, “Distributed fault location in networks using
Softw. Repositories (MSR), Piscataway, NJ, USA: IEEE Press, May learning mobile agents,” in Approaches to Intelligence Agents. Berlin
2013, pp. 309–318. Heidelberg: Springer Berlin Heidelberg, 1999, pp. 182–196.
[62] Y. Xiao, J. Keung, K. E. Bennin, and Q. Mi, “Improving bug localization [85] H. Rezapour, S. Jamali, and A. Bahmanyar, “Review on artificial
with word embedding and enhanced convolutional neural networks,” Inf. intelligence-based fault location methods in power distribution net-
Softw. Technol., vol. 105, pp. 17–29, Jan. 2019. works,” Energies, vol. 16, no. 12, Jun. 2023, Art. no. 4636.
[63] M. N. Schwarz and A. Flammer, “Text structure and title—Effects on [86] C. Ni, W. Wang, K. Yang, X. Xia, K. Liu, and D. Lo, “The best of
comprehension and recall,” J. Verbal Learn. Verbal Behav., vol. 20, no. 1, both worlds: Integrating semantic features with expert features for defect
pp. 61–66, Feb. 1981. prediction and localization,” in Proc. 30th ACM Joint Eur. Softw. Eng.
[64] J. Han, C. Huang, S. Sun, Z. Liu, and J. Liu, “bjXnet: An improved Conf. Symp. Found. Softw. Eng., New York, NY, USA: ACM, Nov. 2022.
bug localization model based on code property graph and attention [87] B. Wang, L. Xu, M. Yan, C. Liu, and L. Liu, “Multi-dimension
mechanism,” Automated Softw. Eng., vol. 30, no. 1, Mar. 2023. convolutional neural network for bug localization,” IEEE Trans. Services
[65] S. Cheng, X. Yan, and A. A. Khan, “A similarity integration method Comput., vol. 15, no. 3, pp. 1649–1663, May/Jun. 2022.
based information retrieval and word embedding in bug localization,” in
Proc. 20th IEEE Int. Conf. Softw. Qual., Rel. Secur. (QRS), Piscataway,
NJ, USA: IEEE Press, Dec. 2020, pp. 180–187. Partha Chakraborty (Student Member, IEEE) is
[66] M. Soltani, F. Hermans, and T. Bäck, “The significance of bug report currently working toward the Ph.D. degree with
elements,” Empirical Softw. Eng., vol. 25, no. 6, pp. 5255–5294, David R. Cheriton School of Computer Science,
Sep. 2020. University of Waterloo, Canada. His research in-
[67] Z. Ahmed, N. Le Roux, M. Norouzi, and D. Schuurmans, “Understand- terests include bug localization, vulnerability detec-
tion, and the use of machine learning techniques
ing the impact of entropy on policy optimization,” in Proc. 36th Int.
in software engineering. For more information, see
Conf. Mach. Learn., K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97,
https://parthac.me/.
PMLR, Jun. 2019, pp. 151–160.
[68] S. Jang and H.-I. Kim, “Entropy-aware model initialization for effective
exploration in deep reinforcement learning,” Sensors, vol. 22, no. 15,
Aug. 2022, Art. no. 5845.
[69] V. Mnih et al., “Asynchronous methods for deep reinforcement learning,”
in Proc. 33rd Int. Conf. Mach. Learn., M. F. Balcan and K. Q. Mahmoud Alfadel (Member, IEEE) is an Assis-
Weinberger, Eds., vol. 48, New York, NY, USA: PMLR, Jun. 2016, tant Professor with the Department of Computer
pp. 1928–1937. Science, University of Calgary. His research inter-
[70] M. Böhme, E. O. Soremekun, S. Chattopadhyay, E. Ugherughe, and ests include mining software repositories, software
A. Zeller, “Where is the bug and how is it fixed? An experiment with ecosystems, open-source security, and release engi-
neering.
practitioners,” in Proc. 11th Joint Meeting Found. Softw. Eng., New York,
NY, USA: ACM, Aug. 2017.
[71] T. D. Sasso, A. Mocci, and M. Lanza, “What makes a satisficing
bug report?” in Proc. IEEE Int. Conf. Softw. Qual., Rel. Secur. (QRS),
Piscataway, NJ, USA: IEEE Press, Aug. 2016, pp. 164–174.
[72] N. Bettenburg, S. Just, A. Schröter, C. Weiss, R. Premraj, and T.
Zimmermann, “What makes a good bug report?” in Proc. 16th ACM
SIGSOFT Int. Symp. Found. Softw. Eng., New York, NY, USA: ACM, Meiyappan Nagappan is an Associate Professor
Nov. 2008. with David R. Cheriton School of Computer Sci-
[73] T. Zimmermann, R. Premraj, N. Bettenburg, S. Just, A. Schroter, and ence, University of Waterloo. He has worked on
C. Weiss, “What makes a good bug report?” IEEE Trans. Softw. Eng., empirical software engineering to address software
vol. 36, no. 5, pp. 618–643, Sep. 2010. development concerns and currently researches the
[74] B. Vancsics, F. Horváth, A. Szatmári, and Á. Beszédes, “Fault localiza- impact of large language models on software
tion using function call frequencies,” J. Syst. Softw., vol. 193, Nov. 2022, development.
Art. no. 111429.
[75] Y. Lou et al., “Boosting coverage-based fault localization via graph-
based representation learning,” in Proc. 29th ACM Joint Meeting Eur.

Authorized licensed use limited to: Nanjing University. Downloaded on December 16,2024 at 14:18:38 UTC from IEEE Xplore. Restrictions apply.

You might also like