0% found this document useful (0 votes)
29 views13 pages

Fse23 1

This document discusses a new method for bug localization called Semantic Flow Graph (SFG) that aims to address two issues with existing BERT-based bug localization techniques. SFG proposes a novel directed, multiple-label code graph representation that better captures code semantics. It also designs SemanticCodeBERT, a pre-trained model based on SFG, and a Hierarchical Momentum Contrastive Bug Localization technique (HMCBL) that leverages large-scale negative samples and accounts for lexical similarity between bug reports and code changes. The proposed method is evaluated to achieve state-of-the-art performance in bug localization.

Uploaded by

Zhongxing Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views13 pages

Fse23 1

This document discusses a new method for bug localization called Semantic Flow Graph (SFG) that aims to address two issues with existing BERT-based bug localization techniques. SFG proposes a novel directed, multiple-label code graph representation that better captures code semantics. It also designs SemanticCodeBERT, a pre-trained model based on SFG, and a Hierarchical Momentum Contrastive Bug Localization technique (HMCBL) that leverages large-scale negative samples and accounts for lexical similarity between bug reports and code changes. The proposed method is evaluated to achieve state-of-the-art performance in bug localization.

Uploaded by

Zhongxing Yu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Pre-training Code Representation with Semantic Flow Graph for

Effective Bug Localization


Yali Du Zhongxing Yu∗
Shandong University, China Shandong University, China
[email protected] [email protected]

ABSTRACT 1 INTRODUCTION
Enlightened by the big success of pre-training in natural language While modern software engineering recognizes a broad range of
processing, pre-trained models for programming languages have methods (e.g., model checking, symbolic execution, type checking)
been widely used to promote code intelligence in recent years. In for helping ensure that the software meets the specification of its
particular, BERT has been used for bug localization tasks and im- desirable behavior, the software (even deployed ones) is still un-
pressive results have been obtained. However, these BERT-based fortunately plagued with heterogeneous bugs for reasons such as
bug localization techniques suffer from two issues. First, the pre- programming errors made by developers and immature develop-
trained BERT model on source code does not adequately capture ment process. The process of resolving the resultant bugs termed
the deep semantics of program code. Second, the overall bug local- debugging, is an indispensable yet frustrating activity that can
ization models neglect the necessity of large-scale negative samples easily account for a significant part of software development and
in contrastive learning for representations of changesets and ignore maintenance costs [76]. To tackle the ever-growing high costs in-
the lexical similarity between bug reports and changesets during volved in debugging, a variety of automatic techniques have been
similarity estimation. We address these two issues by 1) proposing proposed as debugging aids for developers over the past decades
a novel directed, multiple-label code graph representation named [88]. In particular, numerous methods have been developed to fa-
Semantic Flow Graph (SFG), which compactly and adequately cap- cilitate fault localization, which aims to identify the exact locations
tures code semantics, 2) designing and training SemanticCodeBERT of program bugs and is one of the most expensive, tedious, and
based on SFG, and 3) designing a novel Hierarchical Momentum time-consuming activities in debugging [80, 85].
Contrastive Bug Localization technique (HMCBL). Evaluation re- The literature on fault localization is rich and is abundant with
sults show that our method achieves state-of-the-art performance methods stemming from ideas that originate from several differ-
in bug localization. ent disciplines, notably including statistical analysis [37, 48, 84, 86],
program transformation [60], information retrieval [68, 71]. Among
CCS CONCEPTS them, information retrieval-based methods typically proceed by
• Software and its engineering → Software testing and debug- establishing the relevance between bug reports and related soft-
ging; Maintaining software. ware artifacts on the ground of information retrieval techniques,
and this category of methods is appealing as it is amenable to
the mainstream development practice which features continuous
KEYWORDS integration (CI), versioning with Git, and collaboration within plat-
bug localization, semantic flow graph, type, computation role, pre- forms like GitHub [75]. In line with existing literature, information
trained model, contrastive learning retrieval-based fault localization hereafter is simply referred to as
bug localization.
ACM Reference Format:
Yali Du and Zhongxing Yu. 2023. Pre-training Code Representation with
The matched software artifact at the early phase of bug localiza-
Semantic Flow Graph for Effective Bug Localization. In Proceedings of the tion research focuses on code elements such as classes and meth-
31st ACM Joint European Software Engineering Conference and Symposium ods [41, 72], but recent years have witnessed a growing interest
on the Foundations of Software Engineering (ESEC/FSE ’23), December 3–9, in changesets [18, 68, 79, 81]. The key advantage of changesets is
2023, San Francisco, CA, USA. ACM, New York, NY, USA, 13 pages. https: that they contain simultaneously changed parts of the code that are
//doi.org/10.1145/3611643.3616338 related, facilitating bug fixing. With regard to information retrieval
techniques, the major shift is that the dominating techniques have
∗ Zhongxing
changed from Vector Space Model (VSM) to deep learning tech-
Yu is the corresponding author.
niques, both for code elements and changesets. To precisely locate
the bug, bug localization techniques essentially need to accurately
Permission to make digital or hard copies of all or part of this work for personal or relate the natural language used to describe the bug (in the bug
classroom use is granted without fee provided that copies are not made or distributed report) and identifier naming practices adopted by developers (in
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than the
the software artifacts). However, it is quite common that there ex-
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or ists a significant lexical gap between them, and consequently, the
republish, to post on servers or to redistribute to lists, requires prior specific permission retrieval quality of bug localization techniques is not always sat-
and/or a fee. Request permissions from [email protected].
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA isfactory [96]. To overcome the issue, bug localization techniques
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM. necessarily need to go beyond exact term matching and establish
ACM ISBN 979-8-4007-0327-0/23/12. . . $15.00
https://doi.org/10.1145/3611643.3616338

579
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Yali Du and Zhongxing Yu

the semantic relatedness between bug reports and software arti- utilize the momentum contrastive method [28] to account for the
facts. inconsistency of negative vectors obtained by different models (in
Given that deep learning architectures are capable of leveraging different mini-batches). Second, existing BERT-based bug local-
contextual information and have achieved impressive progress in ization techniques only account for the semantic level similarity
natural language processing, a number of bug localization tech- between bug reports and changesets, totally ignoring the lexical
niques based on the neural network have been proposed in re- similarity (e.g., same identifier) which is also of vital importance
cent years [17, 32, 44, 55, 57, 64, 83, 94, 95]. In particular, state-of- for retrieval if exists. To alleviate this issue, we propose to use a
the-art transformer-based architecture BERT [21] (bidirectional hierarchical contrastive loss to leverage similarities at different
encoder representation from the transformer) has been widely em- levels. On the whole, we design a novel Hierarchical Momentum
ployed [17, 47]. Based on the naturalness hypothesis which states Contrastive Bug Localization (HMCBL) technique to address the
that “software corpora have similar statistical properties to natural two limitations.
language corpora” [29], these BERT-based techniques first pre-train We implement the analyzer for obtaining SFG for Java code and
a BERT model on a massive corpus of source code using certain use the Java corpus (including 450,000 functions) of the CodeSearch-
pre-training tasks such as masked language modeling, and then fine- Net dataset [33] to pre-train SemanticCodeBERT. On top of Seman-
tune the trained BERT model for bug localization task. Experimental ticCodeBERT, we apply the hierarchical momentum contrastive
evaluations have shown that reasonable accuracy improvements method to facilitate the retrieval of bug-inducing changesets given
can be obtained by these BERT-based techniques. a bug report on the widely used dataset established in [79], which
Despite the progress made, one drawback of these BERT-based includes six Java projects. Results show that we achieve state-of-
techniques is that the pre-trained BERT model on source code does the-art performance on bug localization. Ablation studies justify
not adequately capture the deep semantics of program code. Unlike that the newly designed SFG improves the BERT model and the
natural language, the programming language has a formal structure, new bug localization architecture is better than the existing ones.
which provides important code semantics that is unambiguous in Our contributions can be summarized as follows:
general [2]. However, the existing pre-trained BERT model either
• We present a novel directed, multiple-label code graph represen-
totally ignores the code structure by treating code snippet as a
tation termed Semantic Flow Graph (SFG), which compactly and
sequence of tokens same as natural language or considers only the
adequately captures code semantics.
shallow structure of the code by using graph code representations
• We employ SFG to train SemanticCodeBERT, which can be ap-
such as data flow graph [25]. Consequently, the formal code struc-
plied to obtain code representations for various code-related
ture has not been fully exploited, resulting in an under-optimal
downstream tasks.
BERT model. To overcome this issue, we in this paper present
• We design a novel Hierarchical Momentum Contrastive Bug Lo-
a novel code graph representation termed Semantic Flow Graph
calization technique (HMCBL), which overcomes two important
(SFG), which compactly and adequately captures code semantics.
issues of existing techniques.
SFG is a directed, multiple-label graph that captures not only the
• We conduct a large-scale experimental evaluation, and the results
data flow and control flow between program elements but also the
show that our method outperforms state-of-the-art techniques
type of program element and the specific role that a certain program
in bug localization performance.
element plays in computation. On the ground of SFG, we further
propose SemanticCodeBERT, a pre-training model with BERT-like
architecture to learn code representation that considers deep code 2 RELATED WORKS
structure. SemanticCodeBERT features novel pre-training tasks This section reviews work closely related to this paper. Bug local-
besides the ordinary masked language modeling task. ization techniques proceed by making a query about the relevance
In addition, the overall models of existing BERT-based bug lo- between bug reports and related software artifacts on top of in-
calization techniques ignore several points which are beneficial formation retrieval techniques. The investigated software artifacts
for further improving performance. First, the batch size is typi- can be majorly divided into two categories: code elements such as
cally limited to save model space because of the huge scale of classes and methods [31, 41, 42, 46, 50–53, 58, 62, 67, 71–73, 90, 94]
BERT parameters, and the number of negative samples coupled to and changesets [18, 68, 79, 81]. Given changesets contain simulta-
batch size is thus limited. A variety of existed methods [9, 11, 13– neously changed parts of the code that are related and can thus
16, 22, 28, 39, 45, 65, 66, 74, 78, 82, 89] emphasizes the necessity of facilitate bug fixing, the use of changesets is gradually dominating.
large-scale negative samples in contrastive representation learn- With regard to information retrieval techniques, the Vector Space
ing. In the bug localization context, it implies the importance of Model (VSM) is widely used for its simplicity and effectiveness,
considering the large-scale negative sample interactions for rep- especially in the early phase of bug localization research. For in-
resentation learning of bug reports and changesets. Nevertheless, stance, BugLocator [91] makes use of the revised Vector Space
existing techniques like Ciborowska et. al. [17] only select one ir- Model (rVSM) to establish the textual similarity between the bug
relevant changeset in training as the negative sample for a bug report and the source code and then ranks all source code files based
report, which causes inefficient mining of negative samples and on the calculated similarity. For another example, Locus [79] repre-
poor representation of the programming language. To alleviate this sents one of the earliest works on changeset-based bug localization,
issue, we propose to use a memory bank [82] to store rich change- and it proceeds by matching bug reports to hunks.
sets obtained from different batches for later contrast. In particular, As VSM basically performs exact term matching, the effective-
due to the constant parameter update by back-propagation, we ness will be compromised in the common case where there exists

580
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

a significant lexical gap between the descriptions in the bug re- progress towards code representation, one drawback of them is that
port and naming practices adopted by developers in the software they not adequately capture the deep semantics of program code
artifacts. To overcome this issue, bug localization techniques es- as they either treat code snippets as token sequences or consider
sentially need to establish the semantic relatedness between bug only shallow code structure by using graph code representations
reports and software artifacts. Given the impressive progress in such as data flow graph. Hence, we give a novel code graph repre-
leveraging contextual information by deep learning architectures sentation termed Semantic Flow Graph (SFG) to more compactly
in natural language processing, deep neural networks have been and adequately capture code semantics in this paper. On top of
widely used by researchers to learn representations for bug local- SFG, we further design and train SemanticCodeBERT with novel
ization in recent years [32, 44, 57, 93]. For instance, Huo et. al. [32] pre-training tasks.
present the Deep Transfer Bug Localization task, and propose the
TRANP-CNN as the first solution for the cold-start problem which
combines cross-project transfer learning and convolutional neural 3 SEMANTIC FLOW GRAPH
networks for file-level bug localization. Zhu et. al. [93] focus on This section introduces the Semantic Flow Graph (SFG), a novel
transferring knowledge (while filtering out irrelevant noise) from code graph representation designed for compactly representing
the source project to the target project, and propose the COOBA deep code semantics. On top of the naturalness hypothesis “Software
to leverage adversarial transfer learning for cross-project bug lo- is a form of human communication, software corpora have similar
calization. Murali et. al. [57] propose Bug2Commit, which is an statistical properties to natural language corpora, and these properties
unsupervised model leveraging multiple dimensions of data associ- can be exploited to build better software engineering tools” [29], recent
ated with bug reports and commits. years have witnessed many innovations in using machine learning
In particular, enlightened by the impressive achievements made (particularly deep learning) techniques to help make the software
by BERT in natural language processing, BERT has been used for more reliable and maintainable. To achieve successful learning, one
bug localization tasks. Lin et. al. [47] study the tradeoffs between important ingredient lies in suitable code representation. The repre-
different BERT architectures for the purpose of changeset retrieval. sentation, on the one hand, should capture enough code semantics,
Based on the Colbert developed by Khattab et. al. [40], Ciborowska and on the other hand, should be learnable across code written by
et. al. [17] propose the FBL-BERT model towards changeset-based different developers or even different programming languages [2].
bug localization. Evaluation results show that FBL-BERT can speed There are majorly three categories of code representation ways
up the retrieval and several design decisions have also been ex- within the literature: token-based ways that represent code as a se-
plored, including granularities of input changesets and the utiliza- quence of tokens [1, 19, 26, 27], syntactic-based ways that represent
tion of special tokens for capturing changesets’ semantic repre- code as trees [4–6, 30, 56, 63, 87], and semantic-based ways that
sentation. While impressive retrieval results of changesets have represent code as graph structures [3, 7, 8, 12, 20, 25, 43, 92]. For
been achieved, the Colbert used by FBL-BERT does not adequately token-based representation, while its simplicity facilities learning,
capture the deep semantics of program code and the overall models the representation ignores the structural nature of code and thus
of FBL-BERT suffer from two important limitations as described in captures quite limited semantics. For syntactic-based representa-
Section 1 (Introduction). tion, despite the tree representation in principle can contain rich
Furthermore, inspired by the success of pre-training models in semantic information, the learnability is unfortunately confined
natural language processing, a number of pre-trained models for as the tree typically has an unusually deep hierarchy and there, in
programming languages have been proposed to promote the de- general, will involve significant refinement efforts of the raw tree
velopment of code representation (which is vital for a variety of representation to enable successful learning in practice. Semantic-
code-based tasks in the field of SE). For instance, CodeBERT is a based representation aims to encode semantics in a way that fa-
pre-trained model proposed by Feng et. al. [23], which provides cilitates learning, and a variety of graphs have been employed for
generic representations for natural and programming language code model learning, including for example data flow graph [12, 25],
downstream applications. GraphCodeBERT [25] imports structural control flow graph [20], program dependence graph [8], contextual
information to enhance the code representation by adding the data flow graph [7].
flow graph as an auxiliary of input tokens and improves the perfor- While these graph-based representations have facilitated the
mance of code representation compared to CodeBERT. Kanade et. learning of code semantics embodied in data dependency and con-
al. [38] propose CuBERT, which is pre-trained on a massive Python trol dependency, certain other code semantics are overlooked. In
source corpus with two pre-training tasks of Masked Language particular, the information of what kinds of program elements are
Modeling (MLM) and Next Sentence Prediction (NSP). Buratti et. related by data dependency or control dependency and through
al. [10] propose C-BERT, a transformer-based language model that which operations they are related to is neglected. We argue that
is pre-trained on the C language corpus for code analysis tasks. Xue this information is crucial for accurately learning code semantics.
et. al. [35] propose TreeBERT, which proposes a hybrid target for For instance, given a code snippet “a = m (b, c)” where a, b, and
AST to learn syntactic and semantic knowledge with tree-masked c are Boolean, Integer, and User-defined type variables respec-
language modeling (TMLM) and node order prediction (NOP) pre- tively, and m is a certain function call, there will be two data flow
training tasks. More recently, the UniXcoder [24] is proposed to edges b → a and c → a considering the data flow graph, and the
leverage cross-modal information like Abstract Syntax Tree and code snippet will read “the values of two variables have flown into
comments written in natural language to enhance code represen- another variable”. Under this circumstance, as the corresponding
tation. While these pre-trained models on source code have made data flow graph coincides, the meaning of the code snippet has no

581
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Yali Du and Zhongxing Yu

Figure 1: An example of the semantic flow graph.

difference with a variety of other code snippets such as “a = (b && (1) Establish E𝐷 among nodes from set N𝑉 according to Intra-block
m (c))” where a, b, and c are Boolean, Boolean, and arbitrary type and Inter-block data dependencies between variables.
variables respectively, and m is a certain function call that returns a (2) Establish E𝐶 among nodes from set N𝐶 according to the specific
boolean value. But if the additional information of what kinds of control flow of the control instruction.
program elements and which operations are taken into account, the
(3) Establish E𝑆 following these rules: there will be an edge n𝑎 → n𝑏
code snippet “a = m (b, c)” will read “the value of an Integer type
(i) if n𝑏 ∈ N𝐶 is for the control instruction condition and n𝑎 ∈ N𝑉
variable and the value of a User-defined type variable have flown into
is for a certain variable involved in the condition; (ii) if n𝑎 ∈ N𝐶 is
another Boolean type variable through a function call”, which is more
for the control instruction branch and n𝑏 ∈ N𝑉 is for the left-most
precise. To compactly integrate these two pieces of information
variable of the first statement inside the branch; (iii) if n𝑎 ∈ N𝑉 is for
into graphs, we design a novel directed, multiple-label code graph
the left-most variable of the last statement inside a control instruction
representation termed Semantic Flow Graph (SFG).
branch and n𝑏 ∈ N𝐶 is for the control instruction convergence; (IV)
Definition 3.1. (Semantic Flow Graph). The Semantic Flow if n𝑎 ∈ N𝐶 is for the control instruction convergence and n𝑏 ∈ N𝑉 is
Graph (SFG) for a code snippet is a tuple < 𝑁 , 𝐸,𝑇 , 𝑅 > where for the left-most variable of the first statement inside the basic block
N is a set of nodes, E is a set of directed edges between nodes in directly following the control instruction.
N, and T and R are mappings from nodes to their types and their
Third, mapping T maps each node in N to its type, encoding the
roles in computation respectively.
needed information of “what kinds of program elements are related”.
A number of points deserve comment. First, the node set N can For each node in N𝑉 , T maps it to the corresponding type of the
be further divided into node sets N𝑉 and N𝐶 , which contain nodes variable. For each node in N𝐶 , T maps the node to the specific part
corresponding to variables and control instructions in the code of the control instruction it refers to. Take the control instruction
respectively. While the variable has a one-to-one mapping with a If-Then-Else as an example, T maps the associated 4 nodes in
certain node from N𝑉 , there may be one or multiple nodes in N𝐶 for N𝐶 for it to type IfCondition, IfThen, IfElse, and IfCONVERGE
a certain control instruction. Essentially, if a control instruction has respectively.
an associated condition and n different branches (i.e., straight-line Finally, mapping R maps each node in N𝑉 to its role in the com-
code blocks) to go depending on the condition evaluation result, putation, encoding the needed information of “through which op-
there will be a node in N𝐶 for the condition, a node in N𝐶 for the erations program elements are related”. Basically, R considers the
convergence of the different branches, and n different nodes in N𝐶 associated operation and control structure for the variable to deter-
for the n branches respectively. mine its computation role. From an implementation perspective, for
Second, a directed edge n𝑎 → n𝑏 (n𝑎 ∈ N, n𝑏 ∈ N) in E can each node in N𝑉 , R checks the direct parent of the corresponding
be of 3 kinds. The first kind E𝐷 represents a data flow between variable in the abstract syntax tree (AST) and the position rela-
two variables if n𝑎 ∈ N𝑉 ∧ n𝑏 ∈ N𝑉 holds, the second kind E𝐶 tionship between it and the direct parent to establish the role. For
embodies the control flow between two straight-line basic blocks instance, given a code snippet, “a = b” where a and b are variables,
if n𝑎 ∈ N𝐶 ∧ n𝑏 ∈ N𝐶 holds, and finally the third kind E𝑆 denotes R maps the roles of a and b to Assigned and Assignement respec-
the natural sequential computation flow inside or between basic tively. For another example, given a code snippet “a.m(b)” where
blocks in case n𝑎 ∈ N𝑉 ∧ n𝑏 ∈ N𝐶 or n𝑎 ∈ N𝐶 ∧ n𝑏 ∈ N𝑉 holds. In a and b are variables, and m is a certain function call, R maps the
particular, the edge set E is established as follows: roles of a and b to InvocationTarget and InvocationArgument

582
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

respectively. For nodes in N𝐶 , we do not consider their roles as they


are implicit in their types.
Note it is difficult to simply augment classical graph program
representations with information of type and computation role.
Existing representations like program dependence graph typically
work at the statement granularity (i.e., each graph node represents a
statement), making it hard to encode detailed type and computation
role information of multiple program elements in the statement.
The proposed SFG works at a finer granularity with two types
of nodes that have a one-to-one mapping with program variables
and a one-to-one (or many-to-one) mapping with program con-
trol ingredients respectively. This kind of node representation is
proposed for two reasons. On the one hand, it is convenient to
analyze the types and computation roles of variables (through how
they are connected with other program elements) and program
control ingredients. On the other hand, data flow and control flow
information are established respectively by analyzing variable uses
and program control ingredients. With SFG (built on such a node
representation), data flow and control flow can be encoded through
the edges between nodes, and the type and computation role in-
formation can be encoded through node labels. SFG does not have
nodes for additional program elements (like invocation etc.), thus Figure 2: All defined types and roles.
it is compact but contains adequate semantic information.
Overall, SFG is a directed, multiple-label graph that captures not
only the data flow and control flow between program elements,
but also the type of program element and the specific role that Comment Input Sequence: We import comments as a supple-
a certain program element plays in computation. Moreover, SFG ment for the model to understand the semantic information of
represents this information in a compact way, facilitating learning programming code. [𝐶𝐿𝑆] is the special classification token at the
across programs. beginning of the comment sequence 𝑊 .
Example 3.1. Figure 1 gives an example of a Semantic Flow Graph Source Code Input Sequence: We cleanse the source code and
for a simple method. remove erroneous characters, and add the special classification
token [𝑆𝐸𝑃] at the end of the source code and input sequence. To
Implementation: We fully implement an analyzer to get Semantic represent the start-of-code, we import a pre-appended token [𝐶]
Flow Graph (SFG) for a Java method on top of Spoon [61], which is to split the comment and source code. The source code sequence
an open-source library to analyze, rewrite, transform, and transpile can be represented as 𝑆.
Java source code. Our analyzer supports modern Java versions up to
Node Input Sequence: With the procedure discussed in Section
Java 16. For nodes in N𝑉 , the analyzer considers different kinds of
3, we generate a semantic flow graph (SFG) for each code snippet.
primitive types and common JDK types, and a special type named
At the beginning of the node list 𝑁 , a pre-appended token [𝑁 ] is
user-defined type. In total, the analyzer considers 20 types for
added to represent the start-of-node.
nodes in N𝑉 . For nodes in N𝐶 , the analyzer takes all the control
instruction kinds (up to Java 16) into account and considers 35 types Type Input Sequence: To answer the question of "what kinds of
in total. With regard to role, the analyzer considers 43 different program elements are related", we have identified 55 possible types
roles in total for nodes in N𝑉 . for the code element. 𝑇 = {𝑡 1, . . . , 𝑡 55 } represents the set of all 55
possible types, and [𝑇 ] is pre-appended as the start-of-type. The
complete list of types is shown in Figure 2.
4 SEMANTICCODEBERT Role Input Sequence: To answer the question of "through which
In this section, we describe first the architecture of SemanticCode- operations program elements are related", we have defined 43 roles
BERT (shown in Figure 3), then the graph-guided masked attention to mark the role of each program element in the computation,
based on the semantic flow graph, and finally the pre-training tasks. taking into account the associated operation and control structure.
Overall, the SemanticCodeBERT network architecture adapts the 𝑅 = {𝑟 1, . . . , 𝑟 43 } is the set of all 43 possible roles, and the pre-
architecture of GraphCodeBERT for the proposed novel SFG pro- appended token [𝑅] represents the start-of-role. The complete list
gram representation and SemanticCodeBERT also features tailored of roles is shown in Figure 2.
pre-training tasks for the SFG representation. As intuitively shown in Figure 3, we concatenate the comment,
source code, nodes, types, and roles as the input sequence:
4.1 Model Architecture
𝑋 = 𝐶𝑜𝑛𝑐𝑎𝑡 [[𝐶𝐿𝑆],𝑊 , [𝐶], 𝑆, [𝑆𝐸𝑃], [𝑁 ], 𝑁 , [𝑇 ],𝑇 , [𝑅], 𝑅, [𝑆𝐸𝑃]].
The SemanticCodeBERT follows BERT (Bidirectional Encoder Rep- (1)
resentation from Transformers) (Devlin et. al., [21]) as the backbone.

583
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Yali Du and Zhongxing Yu

Figure 3: The SemanticCodeBERT takes to comment, source code, nodes of SFG, types, and roles as the input, and is pre-trained
by standard masked language modeling [21], node alignment (marked with red lines), graph prediction (marked with green
lines), type prediction (marked with blue lines) and role prediction (marked with purple lines).

4.2 Masked Attention and nodes, and then predict where the nodes are identified from (i.e.,
predict these masked edges 𝐸𝑚𝑎𝑠𝑘 1 ). As shown in the Figure 3, the
We resort to the graph-guided masked attention function described
in [25] to filter irrelevant signals in Transformer. model should distinguish that 𝑛 2 comes from 𝑠 6 and 𝑛 13 comes from
𝑠 33 . We formulate the loss function as Equation 3. Let 𝐸 1 be 𝑆 × 𝑁 ,
• The set 𝐸 1 indicates the alignment relation between 𝑆 and 𝑁 , 1
𝛿 (𝑒𝑖 𝑗 ∈ 𝐸𝑚𝑎𝑠𝑘 ) is one if (𝑠𝑖 , 𝑛 𝑗 ) ∈ 𝐸 1 , and zero otherwise. 𝑝𝑒𝑖 𝑗 is
where (𝑠𝑖 , 𝑛 𝑗 )/(𝑛 𝑗 , 𝑠𝑖 ) ∈ 𝐸 1 if the node 𝑛 𝑗 is identified from the
the probability of the edge from 𝑖-th code token to 𝑗-th node, which
source code token 𝑠𝑖 .
is calculated by dot product following a sigmoid function using the
• The set 𝐸 2 indicates the dependency relation in 𝑁 , where (𝑛𝑖 , 𝑛 𝑗 ) ∈
representations of 𝑠𝑖 and 𝑛 𝑗 outputted from SemanticCodeBERT.
𝐸 2 if there is a direct edge from the node 𝑛𝑖 to the node 𝑛 𝑗 . Õ
• The set 𝐸 3 incorporates the type information of the nodes, where L𝑁 𝐴 = − [𝛿 (𝑒𝑖 𝑗 )𝑙𝑜𝑔𝑝𝑒𝑖 𝑗 +
(𝑛𝑖 , 𝑡 𝑗 ) ∈ 𝐸 3 if the type of the node 𝑛𝑖 is 𝑡 𝑗 . 1
𝑒𝑖 𝑗 ∈𝐸𝑚𝑎𝑠𝑘 (3)
• The set 𝐸 4 incorporates the role information of the nodes, where (1 − 𝛿 (𝑒𝑖 𝑗 ))𝑙𝑜𝑔(1 − 𝑝𝑒𝑖 𝑗 )].
(𝑛𝑖 , 𝑟 𝑗 ) ∈ 𝐸 4 if the role of the node 𝑛𝑖 is 𝑟 𝑗 .
The masked attention matrix is formulated as 𝑀: Edge Prediction: The motivation of edge prediction is to encourage
 the model to learn structural relationships from semantic flow

 0 𝑥𝑖 ∈ [𝐶𝐿𝑆], [𝑆𝐸𝑃];

 graphs for better programming code representation. Like node

 𝑤 𝑖 , 𝑠 𝑗 ∈ 𝑊 ∪ 𝑆;

 1 alignment, we randomly mask 20% edges between nodes in the


 (𝑠 𝑖 , 𝑛 𝑗 )/(𝑛 𝑗 , 𝑠𝑖 ) ∈ 𝐸 ;
2 mask matrix, encouraging the model to predict these masked edges
𝑀𝑖 𝑗 = (𝑛𝑖 , 𝑛 𝑗 ) ∈ 𝐸 ; (2) 2
 𝐸𝑚𝑎𝑠𝑘 (e.g., the edges (𝑛 3 , 𝑛 2 ) and (𝑛 12 , 𝑛 11 )). We formulate the loss


 (𝑛𝑖 , 𝑡 𝑗 ) ∈ 𝐸 3 ;


 (𝑛𝑖 , 𝑟 𝑗 ) ∈ 𝐸 4 ; function as Equation 4. Let 𝐸 2 be 𝑁 × 𝑁 , 𝛿 (𝑒𝑖 𝑗 ∈ 𝐸𝑚𝑎𝑠𝑘 2 ) is one if


 −∞ 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒. (𝑛𝑖 , 𝑛 𝑗 ) ∈ 𝐸 2 , and zero otherwise. 𝑝𝑒𝑖 𝑗 is the probability of the edge

from 𝑖-th node to 𝑗-th node.
Specifically, the masked attention function blocks the transmission Õ
of unrelated tokens by setting the attention score to an infinitely L𝐺𝑃 = − [𝛿 (𝑒𝑖 𝑗 )𝑙𝑜𝑔𝑝𝑒𝑖 𝑗 +
2
negative value. 𝑒𝑖 𝑗 ∈𝐸𝑚𝑎𝑠𝑘 (4)
(1 − 𝛿 (𝑒𝑖 𝑗 ))𝑙𝑜𝑔(1 − 𝑝𝑒𝑖 𝑗 )].
4.3 Pre-Training Tasks
The pre-training tasks of SemanticCodeBERT are described in this Type Prediction: The motivation of type prediction is to guide
section. Besides masked language modeling, node alignment, and the model to comprehend the types (e.g., “int”, “double”, “IfCondi-
edge prediction pre-training tasks proposed by Guo et. al., [25], we tion”) of nodes for better programming code representation. We
define two novel pre-training tasks–type and role prediction. These pre-append the full set of types 𝑇 to the input nodes. Let 𝐸 3 be
two novel pre-training tasks represent the first attempt to leverage 𝑁 ×𝑇 , if the type of node 𝑛𝑖 is 𝑡 𝑗 (i.e., (𝑛𝑖 , 𝑡 𝑗 ) ∈ 𝐸 3 ), 𝛿 (𝑒𝑖 𝑗 ∈ 𝐸𝑚𝑎𝑠𝑘
3 )
the attribute information of nodes for learning code representation. is one, otherwise it is zero. We randomly mask 20% edges between
nodes and types and formulate the loss function as Equation 5,
Masked Language Modeling: The masked language modeling 3
where 𝐸𝑚𝑎𝑠𝑘 are masked edges and 𝑝𝑒𝑖 𝑗 is the probability of the
pre-training task is proposed by Devlin et. al., [21]. We replace 15%
edge from 𝑖-th node to 𝑗-th type.
of the source code with [MASK] 80% of the time, a random token Õ
10% of the time or itself 10% of the time. The comment context L𝑇 𝑃 = − [𝛿 (𝑒𝑖 𝑗 )𝑙𝑜𝑔𝑝𝑒𝑖 𝑗 +
contributes to inferring the masked code tokens [25]. 3
𝑒𝑖 𝑗 ∈𝐸𝑚𝑎𝑠𝑘 (5)
Node Alignment: The motivation of node alignment is to align (1 − 𝛿 (𝑒𝑖 𝑗 ))𝑙𝑜𝑔(1 − 𝑝𝑒𝑖 𝑗 )].
representation between source code and nodes of semantic flow
graph [25]. We randomly mask 20% edges between the source code Role Prediction: “Role” indicates the computation role of the node

584
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

in the semantic flow graph (e.g., “InvocationArgument”, “Assigned”, n 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 ∈ R𝑑 are the refined vectors (𝑑 is the dimension of the
“Assignment”). Role prediction can feed the model with a more mapped spaces).
informative signal to understand the correlation among different Projector Network: After the feature vectors are extracted, we use
nodes. We pre-append the full set of roles 𝑅 to the input nodes. a multi-layer perception neural network as a projector to compress
Let 𝐸 4 be 𝑁 × 𝑅, if the role of node 𝑛𝑖 is 𝑟 𝑗 (i.e., (𝑛𝑖 , 𝑟 𝑗 ) ∈ 𝐸 4 ), the vectors of bug reports and changesets into a compact shared
4
𝛿 (𝑒𝑖 𝑗 ∈ 𝐸𝑚𝑎𝑠𝑘 ) is one, otherwise it is zero. We randomly mask embedding space. We replace Dropout with Batch Normalization for
20% edges between nodes and roles and formulate the loss func- regularization, which can be trained with saturating nonlinearities
tion as Equation 6, where 𝐸𝑚𝑎𝑠𝑘 4 are masked edges and 𝑝𝑒𝑖 𝑗 is the and are more tolerant to increased training rates [34].
probability of the edge from 𝑖-th node to 𝑗-th role.
Õ 
 q = W𝑏2 𝑛𝑜𝑟𝑚(𝜙 (W𝑏1 q 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 )),

 𝑚𝑜𝑑𝑒𝑙

L𝑅𝑃 = − [𝛿 (𝑒𝑖 𝑗 )𝑙𝑜𝑔𝑝𝑒𝑖 𝑗 + p𝑚𝑜𝑑𝑒𝑙 = W𝑐2𝑛𝑜𝑟𝑚(𝜙 (W𝑐1 p 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 )), (8)
4
𝑒𝑖 𝑗 ∈𝐸𝑚𝑎𝑠𝑘 (6) 

 n𝑚𝑜𝑑𝑒𝑙 = W2𝑛𝑜𝑟𝑚(𝜙 (W1 n 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 )),
 𝑐 𝑐
(1 − 𝛿 (𝑒𝑖 𝑗 ))𝑙𝑜𝑔(1 − 𝑝𝑒𝑖 𝑗 )].
′ ′ ′
where q𝑚𝑜𝑑𝑒𝑙 ∈ R𝑑 , p𝑚𝑜𝑑𝑒𝑙 ∈ R𝑑 , and n𝑚𝑜𝑑𝑒𝑙 ∈ R𝑑 are the
projected vectors (𝑑 ′ is the dimension of the output of projector),
5 CHANGESET-BASED BUG LOCALIZATION W𝑏· and W𝑐· are the trainable weight matrices, 𝑛𝑜𝑟𝑚(·) denotes
batch normalization [34], and 𝜙 (·) is the 𝑙𝑒𝑎𝑘𝑦_𝑟𝑒𝑙𝑢 function [54].
In this section, we illustrate the utilization of the SemanticCode-
BERT towards bug localization with changesets. The proposed bug Momentum Update Mechanism with Memory Bank: As men-
localization model is shown in Figure 4. The model aims to address tioned in Section 1, it is important to consider large-scale negative
the two important limitations (as described in Section 1) of the samples in contrastive learning for representations of changesets.
overall models of existing BERT-based bug localization techniques. To account for this, we use memory bank [82] to store rich change-
sets obtained from different batches for later contrast. In particular,
we build the key model for encoder and projector networks of
5.1 Problem Definition
changesets based on the momentum contrastive learning mecha-
Given a set Q = {𝑞 1, 𝑞 2, . . . , 𝑞𝑀 } of 𝑀 bug reports, the bug lo- nism proposed by He et. al. [28]. The parameters of the query model
calization task aims to discover more relevant changesets from 𝜃 𝑞 , are updated by back-propagation, while the parameters of the
K = {𝑘 1, 𝑘 2, . . . , 𝑘 𝑁 }, a set including 𝑁 changesets. More specifi- key model 𝜃 𝑘 are momentum updated as follows:
cally, for a bug report 𝑞 ∈ Q, a bug-inducing changeset 𝑝 ∈ K and
a not bug-inducing changeset 𝑛 ∈ K are selected to form a triplet 𝜃 𝑘 ← 𝑚𝜃 𝑘 + (1 − 𝑚)𝜃 𝑞 , (9)
(𝑞, 𝑝, 𝑛). All bug-inducing changesets and not bug-inducing change-
sets are non-overlapping. The goal of learned similarity function 𝑠 where 𝑚 ∈ [0, 1) is a pre-defined momentum coefficient, which is set
is to provide a high value for 𝑠 (𝑞, 𝑝) (between the anchor 𝑞 and the as 0.999 in our experiment. As proved in the previous study [28], a
positive sample 𝑝) and a low value for 𝑠 (𝑞, 𝑛) (between the anchor 𝑞 relatively large momentum works much better than a smaller value
and the negative sample 𝑛). Section 5.2 focuses on producing accu- suggesting that a slowly evolving key model is core to making use
rate representations of bug reports and changesets, and Section 5.3 of the memory bank. For per mini-batch, we use average pooling
describes the estimation of similarities and the loss function for and enqueue the latest negative samples into the memory bank and
training the model. dequeue the oldest negative samples.

5.2 Representation Learning 5.3 Similarity Estimation


The proposed model consists of three parts, an encoder network, As mentioned before, the lexical similarity between bug reports
projector network, and momentum update mechanism with a mem- and program changesets like the same application programming
ory bank that stores rich representations of changesets. interfaces is also crucial for retrieval besides semantic similarity. In
this paper, we use the hierarchical contrastive loss to leverage the
Encoder Network: As mentioned before, bug reports consist of
lower feature-level similarity, higher model-level similarity, and
natural language descriptions and project changesets consist of
broader bank-level similarity for matching the bug report with
programming language code. Hence, we introduce BERT [21] as
the backbone to the encoder bug report as q 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 , and Seman- relevant changesets. We get the positive feature-level similarity 𝑠 𝑓 +
ticCodeBERT as the backbone to encoder relevant changeset as by calculating cosine similarity between 𝑞 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 and 𝑝 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 , the
p 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 and irrelevant changeset as n 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 . negative feature-level similarity 𝑠 𝑓 − by calculating cosine similarity
between 𝑞 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 and 𝑛 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 , the positive model-level similarity


 q = BERT(𝑞𝑡𝑜𝑘 ), 𝑠𝑚+ by calculating cosine similarity between 𝑞𝑚𝑜𝑑𝑒𝑙 and 𝑝𝑚𝑜𝑑𝑒𝑙 ,
 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒

p 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 = SemanticCodeBERT(𝑝𝑡𝑜𝑘 ), (7) and the negative model-level similarity 𝑠𝑚− by calculating cosine

 similarity between 𝑞𝑚𝑜𝑑𝑒𝑙 and 𝑛𝑚𝑜𝑑𝑒𝑙 . Specifically, we calculate
 n 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 = SemanticCodeBERT(𝑛𝑡𝑜𝑘 ),
 the positive bank-level similarity 𝑠𝑏+ as cosine similarity between
where BERT and SemanticCodeBERT are the trainable parameters q𝑞𝑢𝑒𝑟 𝑦 and p𝑘𝑒𝑦 , and the negative bank-level similarity 𝑠𝑖𝑏 − (𝑖 ∈
of BERT and SemanticCodeBERT, 𝑞𝑡𝑜𝑘 , 𝑝𝑡𝑜𝑘 , and 𝑛𝑡𝑜𝑘 are the input {1, 2, . . . , 𝐾 }) as cosine similarity between q𝑞𝑢𝑒𝑟 𝑦 and 𝑖-th negative
tokens obtained by tokenizers, q 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 ∈ R𝑑 , p 𝑓 𝑒𝑎𝑡𝑢𝑟𝑒 ∈ R𝑑 , and sample n𝑖𝑘𝑒𝑦 of the memory bank (𝐾 is the size of memory bank).

585
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Yali Du and Zhongxing Yu

Figure 4: An overview of the Hierarchical Momentum Contrastive Bug Localization technique (HMCBL).

We adopt InfoNCE [59], a form of contrastive loss functions, as Table 1: Six projects used for evaluation.
our objective function for contrastive matching. The feature-level
contrastive loss is formulated as follows: Changesets
Dataset Bugs
𝑒𝑥𝑝 (𝑠 𝑓 + /𝛾) Commits Files Hunks
L 𝑓 = −𝑙𝑜𝑔 . (10) AspectJ 200 2,939 14,030 23,446
𝑒𝑥𝑝 (𝑠 𝑓 + /𝛾) + 𝑒𝑥𝑝 (𝑠 𝑓 − /𝛾)
JDT 94 13,860 58,619 150,630
The model-level contrastive loss is formulated as follows: PDE 60 9,419 42,303 100,373
𝑒𝑥𝑝 (𝑠𝑚+ /𝛾) SWT 90 10,206 25,666 69,833
L𝑚 = −𝑙𝑜𝑔 . (11) Tomcat 193 10,034 30,866 72,134
𝑒𝑥𝑝 (𝑠 /𝛾) + 𝑒𝑥𝑝 (𝑠𝑚− /𝛾)
𝑚+
ZXing 20 843 2,846 6,165
The bank-level contrastive loss is formulated as follows:
𝑒𝑥𝑝 (𝑠𝑏+ /𝛾)
L𝑏 = −𝑙𝑜𝑔 , (12)
𝐾
Í
𝑏+
𝑒𝑥𝑝 (𝑠 /𝛾) + 𝑏 −
𝑒𝑥𝑝 (𝑠𝑖 /𝛾)
𝑘=1 6 EXPERIMENTAL EVALUATION
where 𝐾 is the size of the memory bank and 𝛾 is a temperature 6.1 Dataset
hyper-parameter that is set to be 0.07 in our experiment. Thus, the The SemanticCodeBERT is trained using all the Java corpus in Code-
overall objective function is L: SearchNet [33], and we provide the weights and the guidance to
L = 𝛼 𝑓 L 𝑓 + 𝛼𝑚 L𝑚 + 𝛼𝑏 L𝑏 , (13) fine-tune the pre-trained model for downstream tasks. To evalu-
ate our bug localization technique, we use the dataset separated
where 𝛼 𝑓 , 𝛼𝑚 , and 𝛼𝑏 are three hyper-parameters to balance the
by Ciborowska et. al. [17] from the manually validated dataset by
feature-level, model-level, and bank-level contrasts.
Ming et. al. [79]. The dataset includes six software projects, termed
AspectJ, JDT, PDE, SWT, Tomcat, and ZXing, as shown in Table 1.
5.4 Offline Indexing and Retrieval
To explore the impact of the granularity of changeset data, the
After fine-tuning the model on a project-specific dataset, we re- bug-inducing changeset is further divided into file-level and hunk-
sort to the offline indexing and retrieval methods proposed by level code changes. Thus, one bug report can have multiple pairs
Ciborowska et. al. [17]. All encoded changesets are stored in IVFPQ with files or hunks from the original inducing changes. In total, we
(InVert File with Product Quantization) index. The IVFPQ index is consider three different granularities: commits, files, and hunks.
implemented using the Faiss library [36], which uses the k-means
algorithm to partition the embedding space into programmed par-
titions and assign each embedding to its nearest cluster. In the
6.2 Evaluation Metrics
retrieval process, the query bug report is first located to the near- A set of metrics commonly used to evaluate the performance of in-
est partition’s centroid, and then the nearest instance within the formation retrieval systems are applied to evaluate the performance
partition is discovered. For each query bug report, we can identify of different models.
the 𝑁 ′ most similar changesets across all 𝑁 changesets stored in Precision@K (𝑃@𝐾): 𝑃@𝐾 evaluates how many of the top-𝐾
the IVFPQ index. Therefore, we only re-rank the top-𝑁 ′ subset as changesets in a ranking are relevant to the bug report, which is
the candidate changesets to produce the final ranking. equal to the number of the relevant changesets |𝑅𝑒𝑙𝐵𝑖 | located in

586
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

the top-𝐾 position in the ranking averaged across 𝐵 bug reports: a changeset. These three strategies are based on the output of
|𝐵 | the git diff command, which divides changeset lines into three
1 Õ |𝑅𝑒𝑙𝐵𝑖 |
𝑃@𝐾 = . (14) kinds: added lines, removed lines, and unchanged lines. All code
|𝐵| 𝑖=1 𝐾 sequences are preprocessed by filtering the intrusive characters
(e.g., docstrings, comments) from the original code tokens.
Mean Average Precision (𝑀𝐴𝑃): 𝑀𝐴𝑃 quantifies the ability of
a model to locate all changesets relevant to a bug report. 𝑀𝐴𝑃 is 6.4 Retrieval Performance
calculated as the mean of 𝐴𝑣𝑔𝑃 (average precision) of 𝐵 bug reports. We compare the performance of our proposed model with the tra-
𝑀 ditional bug localization tool, state-of-the-art changeset-based bug
Õ 𝑃@𝑗 × 𝑝𝑜𝑠 ( 𝑗)
𝐴𝑣𝑔𝑃 = . (15) localization approach, and two recent state-of-the-art pre-trained
𝑗=1
𝑁 models with the HMCBL framework.
|𝐵 | • BLUiR [71]: A structured IR-based fault localization tool, which
1 Õ 1 builds AST to extract the program constructs of each source
𝑀𝐴𝑃 = , (16)
|𝐵| 𝑖=1 𝐴𝑣𝑔𝑃𝐵𝑖 code file and utilizes Okapi BM25 [69] to calculate the similarity
where 𝑗 is the rank, 𝑀 is the number of retrieved changesets, 𝑝𝑜𝑠 ( 𝑗) between the bug report and the candidate changesets.
denotes whether 𝑗-th changeset is relevant to the bug report, 𝑁 is • FBL-BERT [17]: The state-of-the-art approach for automatically
the total number of bug reports relevant to changesets, 𝑃@𝑗 is the retrieving bug-inducing changesets given a bug report, which
precision of top-𝑗 position in the ranking of this retrieval, and 𝐵𝑖 is uses the popular BERT model to more accurately match the se-
the 𝑖-th bug report. mantics in the bug report text with the bug-inducing changesets.
Mean Reciprocal Rank (𝑀𝑅𝑅): 𝑀𝑅𝑅 quantifies the ability of a • GraphCodeBERT [25]: A pre-trained model that considers data
model to locate the first relevant changeset to a bug report, and is flow to better encode the relation between variables.
calculated as the average of reciprocal ranks across 𝐵 bug reports. • UniXcoder [24]: An unified cross-modal pre-trained model,
1st𝑅𝑎𝑛𝑘𝐵𝑖 is the reciprocal rank of 𝑖-th bug report, which is the which leverages cross-modal information like Abstract Syntax
inverted rank of the first relevant changeset in the ranking: Tree and comments to enhance code representation.
|𝐵 | For BLUiR, we fully follow the original technical description
1 Õ 1 in [71] (as no open-source implementation is available) to get the
𝑀𝑅𝑅 = . (17)
|𝐵| 𝑖=1 1𝑠𝑡𝑅𝑎𝑛𝑘𝐵𝑖 results for the evaluation metrics. For FBL-BERT, we use the exper-
imental results provided in [17]. For GraphCodeBERT and UniX-
coder, we get the results by replacing the pre-trained model Se-
manticCodeBERT within the HMCBL framework respectively with
6.3 Experimental Setup GraphCodeBERT and UniXcoder (keeping other configurations the
same). Table 2 shows the retrieval performances of different models
Configurations of Pre-Training Tasks: The SemanticCodeBERT
with different changeset encoding strategies (i.e., 𝐷-, 𝐴𝑅𝐶- and
is pre-trained on NVIDIA Tesla A100 with 128GB RAM on the
𝐴𝑅𝐶𝐿 - encoding) and three granularities (i.e. 𝐶𝑜𝑚𝑚𝑖𝑡𝑠−, 𝐹𝑖𝑙𝑒𝑠−
Ubuntu system. The Adam optimizer [49] is used to update model
and 𝐻𝑢𝑛𝑘𝑠− level) on six projects. Limited by space, the best result
parameters with batch size 80 and learning rate 1E-04. To accelerate
of the three encoding strategies is shown for each configuration.
the training process, the parameters of GraphCodeBERT [25] are
The following observations can be obtained from the figure.
used to initialize the pre-training model. The model is trained with
First, compared with the traditional bug localization method which
600K batches and costs about 156 hours.
relies on more direct term matching between a bug report and a
Configurations of Bug Localization: The first half of the project’s changeset, the neural network methods perform better by obtaining
pairs of bug reports and bug-inducing changesets, ordered by bug semantic representations for the calculation of similarity. Second,
opening date, are selected as the training dataset, and the remaining our proposed method outperforms the state-of-the-art method (FBL-
half is left as the test dataset. The experiments are implemented BERT) by a clear margin. In particular, our proposed bug localization
with GPU support. The Adam optimizer [49] is used to update technique improves FBL-BERT by 140.78% to 188.79% in terms
model parameters with learning rate 3E-05. All bug reports and of MRR on six projects with 𝐶𝑜𝑚𝑚𝑖𝑡𝑠− level granularity. Third,
changesets are truncated or padded to their respective length limit. compared with GraphCodeBERT and UniXcoder, our model using
According to the experimental verification, we set the trade-off SemanticCodeBERT as a changeset encoder consistently achieves
hyper-parameters 𝛼 𝑓 , 𝛼𝑚 , and 𝛼𝑏 as 1, 1, and 1, respectively. better performance in almost all experimental configurations. This
Changeset Encoding Strategies: Changesets are time-ordered suggests that the proposed Semantic Flow Graph (SFG) captures
sequences recording the software’s evolution over time. We build good code semantics, and the proposed framework contributes to
upon the three changeset encoding strategies (𝐷-encoding, 𝐴𝑅𝐶- changeset-based bug localization.
encoding, and 𝐴𝑅𝐶𝐿 -encoding) proposed by Ciborowska et. al. [17] The Student’s t-test is conducted between our technique and
to encode changesets. 𝐷-encoding does not utilize specific char- other baselines, and the results show that the improvements are
acteristics of changeset lines. 𝐴𝑅𝐶-encoding divides the lines into significant with p < 0.01. We additionally observe that with the
three groups with three unique tokens. 𝐴𝑅𝐶𝐿 -encoding instead 𝐶𝑜𝑚𝑚𝑖𝑡𝑠−level granularity, the obtained improvement is more sig-
does not group the lines and maintains the ordering of lines within nificant than the other two granularities (𝐹𝑖𝑙𝑒𝑠−level and 𝐻𝑢𝑛𝑘𝑠−

587
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Yali Du and Zhongxing Yu

Table 2: Retrieval performance of different models.

𝐶𝑜𝑚𝑚𝑖𝑡𝑠 − 𝐹𝑖𝑙𝑒𝑠 − 𝐻𝑢𝑛𝑘𝑠 −


Projects Technique
MRR MAP P@1 P@3 P@5 MRR MAP P@1 P@3 P@5 MRR MAP P@1 P@3 P@5
BLUiR 0.077 0.016 0.071 0.024 0.014 0.073 0.023 0.000 0.024 0.014 0.056 0.035 0.000 0.071 0.086
FBL-BERT 0.155 0.061 0.100 0.133 0.120 0.212 0.163 0.100 0.133 0.220 0.328 0.210 0.200 0.233 0.240
ZXing GraphCodeBERT 0.189 0.118 0.143 0.143 0.118 0.280 0.155 0.214 0.143 0.214 0.346 0.118 0.225 0.111 0.067
UniXcoder 0.354 0.167 0.414 0.171 0.120 0.359 0.143 0.333 0.224 0.200 0.331 0.164 0.214 0.261 0.282
Ours 0.439 0.226 0.429 0.250 0.225 0.421 0.185 0.357 0.226 0.271 0.422 0.212 0.333 0.444 0.400
BLUiR 0.009 0.001 0.000 0.000 0.000 0.018 0.003 0.000 0.008 0.005 0.024 0.005 0.000 0.008 0.010
FBL-BERT 0.103 0.013 0.067 0.033 0.027 0.260 0.079 0.167 0.128 0.151 0.288 0.093 0.200 0.144 0.127
PDE GraphCodeBERT 0.180 0.042 0.142 0.087 0.058 0.264 0.094 0.167 0.129 0.148 0.284 0.074 0.206 0.124 0.129
UniXcoder 0.178 0.029 0.095 0.063 0.072 0.267 0.090 0.167 0.135 0.129 0.289 0.102 0.212 0.144 0.129
Ours 0.248 0.045 0.190 0.103 0.076 0.274 0.095 0.214 0.137 0.160 0.294 0.134 0.286 0.182 0.160
BLUiR 0.016 0.013 0.007 0.014 0.015 0.098 0.065 0.028 0.076 0.108 0.086 0.048 0.007 0.017 0.159
FBL-BERT 0.107 0.061 0.058 0.080 0.083 0.176 0.085 0.154 0.095 0.097 0.183 0.093 0.173 0.111 0.099
AspectJ GraphCodeBERT 0.172 0.065 0.167 0.065 0.060 0.178 0.071 0.167 0.065 0.060 0.188 0.086 0.167 0.120 0.116
UniXcoder 0.270 0.148 0.245 0.160 0.158 0.209 0.119 0.167 0.140 0.152 0.250 0.134 0.250 0.150 0.138
Ours 0.309 0.169 0.278 0.198 0.196 0.272 0.148 0.250 0.157 0.146 0.262 0.143 0.250 0.161 0.163
BLUiR 0.019 0.001 0.015 0.005 0.003 0.027 0.003 0.000 0.010 0.012 0.033 0.005 0.000 0.005 0.009
FBL-BERT 0.118 0.016 0.064 0.043 0.030 0.403 0.060 0.319 0.184 0.128 0.429 0.062 0.319 0.195 0.167
JDT GraphCodeBERT 0.125 0.022 0.061 0.035 0.030 0.423 0.058 0.308 0.179 0.118 0.385 0.041 0.231 0.179 0.118
UniXcoder 0.182 0.018 0.182 0.061 0.038 0.434 0.062 0.379 0.166 0.131 0.364 0.045 0.288 0.182 0.123
Ours 0.306 0.026 0.288 0.096 0.064 0.489 0.080 0.462 0.195 0.167 0.443 0.088 0.322 0.206 0.167
BLUiR 0.005 0.001 0.000 0.000 0.000 0.020 0.003 0.016 0.005 0.006 0.014 0.001 0.000 0.000 0.013
FBL-BERT 0.067 0.015 0.023 0.027 0.026 0.555 0.131 0.535 0.233 0.173 0.526 0.131 0.488 0.217 0.164
SWT GraphCodeBERT 0.105 0.018 0.048 0.026 0.022 0.535 0.137 0.525 0.220 0.175 0.536 0.132 0.516 0.220 0.159
UniXcoder 0.129 0.035 0.107 0.106 0.063 0.548 0.149 0.524 0.233 0.183 0.535 0.143 0.535 0.205 0.179
Ours 0.283 0.085 0.159 0.177 0.170 0.560 0.153 0.540 0.249 0.192 0.540 0.147 0.540 0.228 0.179
BLUiR 0.007 0.002 0.000 0.002 0.002 0.014 0.003 0.000 0.010 0.007 0.014 0.005 0.000 0.012 0.013
FBL-BERT 0.141 0.055 0.062 0.077 0.088 0.463 0.114 0.381 0.222 0.183 0.482 0.129 0.412 0.216 0.182
Tomcat GraphCodeBERT 0.253 0.062 0.188 0.104 0.084 0.287 0.067 0.271 0.104 0.080 0.395 0.118 0.363 0.216 0.211
UniXcoder 0.328 0.057 0.338 0.120 0.084 0.364 0.065 0.353 0.125 0.085 0.396 0.097 0.378 0.139 0.118
Ours 0.386 0.073 0.360 0.135 0.107 0.487 0.122 0.406 0.247 0.232 0.484 0.132 0.423 0.225 0.211

Table 3: Ablation study of pre-training tasks of Semantic- 6.5 Ablation Study


CodeBERT with Semantic Flow Graph (SFG).
To evaluate the design choices in the proposed model, we con-
duct several ablation studies. To begin with, as shown in Table 3,
Dataset Pre-training Tasks MRR MAP P@1 P@3 P@5 we analyze the contributions of node alignment, edge prediction,
-w/ 0.189 0.118 0.143 0.143 0.118 type prediction, and role prediction pre-training tasks on the six
ZXing -w/ N.& E. 0.372 0.102 0.333 0.111 0.067
-w/ N.& E.& T.& R. 0.439 0.226 0.429 0.250 0.225
projects with commits granularity. N., E., T., and R. denote the Node
Alignment, Edge Prediction, Type Prediction, and Role Prediction
-w/ 0.180 0.042 0.142 0.087 0.058
PDE -w/ N.& E. 0.219 0.032 0.143 0.076 0.072 pre-training tasks, respectively. With all of these pre-training tasks,
-w/ N.& E.& T.& R. 0.248 0.045 0.190 0.103 0.076 we train SemanticCodeBert according to the proposed new code
-w/ 0.172 0.065 0.167 0.065 0.060 representation SFG. According to the results, after adding Type and
AspectJ -w/ N.& E. 0.289 0.158 0.250 0.184 0.170 Role Prediction pre-training tasks, the obtained performance has
-w/ N.& E.& T.& R. 0.309 0.169 0.278 0.198 0.196 universally improved. This result suggests that leveraging the node
-w/ 0.125 0.022 0.061 0.035 0.030 attributes (type and role) is vital to learn code representation.
JDT -w/ N.& E. 0.139 0.021 0.095 0.044 0.048
-w/ N.& E.& T.& R. 0.306 0.026 0.288 0.096 0.064
Furthermore, we evaluate the effectiveness of the Hierarchical
Momentum Contrastive Bug Localization (HMCBL) technique on
-w/ 0.105 0.018 0.048 0.026 0.022
SWT -w/ N.& E. 0.197 0.058 0.063 0.085 0.141 the six projects with commits granularity. As illustrated in Table 4,
-w/ N.& E.& T.& R. 0.283 0.085 0.159 0.177 0.170 for -w/o HMCBL, the memory bank and hierarchical contrastive
-w/ 0.253 0.062 0.188 0.104 0.084 loss which leverages similarities at different levels do not exist,
Tomcat -w/ N.& E. 0.300 0.048 0.346 0.113 0.077 and only the representation obtained by the encoder is utilized to
-w/ N.& E.& T.& R. 0.386 0.073 0.360 0.135 0.107
calculate similarity.
To demonstrate the generality, the technique is evaluated with
different pre-training models as the encoder of the changeset, in-
level). It can be attributed that the undivided bug-inducing change- cluding BERT, GraphCodeBERT, and SemanticCodeBERT. It is ob-
set carries enriched semantic information which can be captured served that overall much better performance will be obtained with
by SemanticCodeBERT. This again confirms the effectiveness of hierarchical momentum contrastive learning, which provides large-
the SemanticCodeBERT-based bug localization technique. scale negative sample interactions for representation learning and

588
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

Table 4: Ablation study of Hierarchical Momentum Con- tangled changes can affect our results, the dataset has been widely
trastive Bug Localization (HMCBL) technique, where used for changeset-based bug localization studies [17, 79], and
GCBERT and SCBERT are short of GraphCodeBERT and removing tangled changes completely is extraordinarily difficult.
SemanticCodeBERT. With regard to threats to external validity, one potential issue
is that the evaluation is conducted on a limited number of bugs
Technique Dataset MRR MAP P@1 P@3 P@5 from several open-source projects. However, these projects feature
ZXing 0.155 0.061 0.100 0.133 0.120 various purposes and development styles. Also, the dataset can be
PDE 0.103 0.013 0.067 0.033 0.027 considered as the de-facto evaluation target for changeset-based bug
BERT -w/o
HMCBL
AspectJ 0.107 0.061 0.058 0.080 0.083 localization studies and prior studies have widely used it [17, 79].
JDT 0.118 0.016 0.064 0.043 0.030
(FBL-BERT )
SWT 0.067 0.015 0.023 0.027 0.026
Tomcat 0.141 0.055 0.062 0.077 0.088 7 CONCLUSION
ZXing 0.162 0.106 0.143 0.095 0.086 We aim to advance the state-of-the-art BERT-based bug localization
PDE 0.167 0.018 0.119 0.071 0.045 techniques in this paper, which currently suffer from two issues:
GCBERT -w/o AspectJ 0.123 0.067 0.076 0.073 0.084 the pre-trained BERT models on source code are not robust enough
HMCBL JDT 0.120 0.022 0.061 0.035 0.036
to capture code semantics and the overall bug localization models
SWT 0.090 0.019 0.048 0.021 0.022
Tomcat 0.151 0.035 0.059 0.064 0.063 neglect the necessity of large-scale negative samples in contrastive
learning and ignore the lexical similarity between bug reports and
ZXing 0.222 0.112 0.143 0.190 0.150
PDE 0.230 0.049 0.142 0.095 0.069 changesets. To address these two issues, we 1) propose a novel
SCBERT -w/o AspectJ 0.271 0.148 0.250 0.161 0.165 directed, multiple-label Semantic Flow Graph (SFG), which com-
HMCBL JDT 0.217 0.051 0.136 0.111 0.091 pactly and adequately captures code semantics, 2) design and train
SWT 0.250 0.062 0.095 0.167 0.185 SemanticCodeBERT on the basis of SFG, and 3) design a novel
Tomcat 0.285 0.053 0.265 0.092 0.069
Hierarchical Momentum Contrastive Bug Localization technique
ZXing 0.179 0.040 0.143 0.095 0.061 (HMCBL). Evaluation results confirm that our method achieves
PDE 0.156 0.032 0.119 0.063 0.051
state-of-the-art performance.
BERT -w/ AspectJ 0.162 0.097 0.118 0.141 0.149
HMCBL JDT 0.128 0.017 0.030 0.070 0.100
SWT 0.082 0.013 0.048 0.024 0.021 8 DATA AVAILABILITY
Tomcat 0.235 0.055 0.169 0.098 0.096
Our replication package (including code, model, etc.) is publicly
ZXing 0.189 0.118 0.143 0.143 0.118 available at https://github.com/duyali2000/SemanticFlowGraph.
PDE 0.180 0.042 0.142 0.087 0.058
GCBERT -w/ AspectJ 0.172 0.065 0.167 0.065 0.060
HMCBL JDT 0.125 0.022 0.061 0.035 0.030 ACKNOWLEDGMENTS
SWT 0.105 0.018 0.048 0.026 0.022
Tomcat 0.253 0.062 0.188 0.104 0.084
We are very grateful to the anonymous ESEC/FSE reviewers for
their valuable feedback on this work. This work was partially sup-
ZXing 0.439 0.226 0.429 0.250 0.225
ported by National Natural Science Foundation of China (Grant No.
PDE 0.248 0.045 0.190 0.103 0.076
SCBERT -w/ AspectJ 0.309 0.169 0.278 0.198 0.196 62102233), Shandong Province Overseas Outstanding Youth Fund
HMCBL JDT 0.306 0.026 0.288 0.096 0.064 (Grant No. 2022HWYQ-043), and Qilu Young Scholar Program of
SWT 0.283 0.085 0.159 0.177 0.170 Shandong University.
Tomcat 0.386 0.073 0.360 0.135 0.107

REFERENCES
[1] Miltiadis Allamanis, Earl T. Barr, Christian Bird, and Charles Sutton. 2014. Learn-
increases retrieval accuracy. For instance, compared with BERT ing Natural Coding Conventions. In Proceedings of the 22nd ACM SIGSOFT In-
-w/o HMCBL, which is the FBL-BERT exactly, BERT -w/ HMCBL ternational Symposium on Foundations of Software Engineering. Association for
improves the performance in terms of MRR scores for more than Computing Machinery. https://doi.org/10.1145/2635868.2635883
[2] Miltiadis Allamanis, Earl T. Barr, Premkumar Devanbu, and Charles Sutton. 2018.
80% projects by 15.48% to 66.67%. It is indicative of the observation A Survey of Machine Learning for Big Code and Naturalness. ACM Comput. Surv.
that the hierarchical momentum contrastive bug localization tech- (2018). https://doi.org/10.1145/3212695
[3] Miltiadis Allamanis, Marc Brockschmidt, and Mahmoud Khademi. 2018. Learning
nique can be extended as a general and effective framework with to Represent Programs with Graphs. In International Conference on Learning
different advanced pre-training models. Representations.
[4] Miltiadis Allamanis and Charles Sutton. 2014. Mining Idioms from Source Code.
In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations
6.6 Threats to Validity of Software Engineering. Association for Computing Machinery. https://doi.org/
Our results should be interpreted with several threats to validity 10.1145/2635868.2635901
[5] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A General Path-
in mind. As bug-inducing changes are identified using the SZZ Based Representation for Predicting Program Properties. SIGPLAN Not. (2018).
algorithm [70], one threat to the internal validity of the results is https://doi.org/10.1145/3192366.3192412
[6] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2vec: Learn-
possible noise introduced by SZZ may make the mapping between ing Distributed Representations of Code. Proc. ACM Program. Lang. POPL (2019).
bug reports and bug-inducing changesets not very precise. However, https://doi.org/10.1145/3290353
the dataset used in the study has been validated manually [79], so [7] Tal Ben-Nun, Alice Shoshana Jakobovits, and Torsten Hoefler. 2018. Neural Code
Comprehension: A Learnable Representation of Code Semantics. In Proceedings
this threat is minimized. Another threat to internal validity is the of the 32nd International Conference on Neural Information Processing Systems.
dataset may contain tangled changes [77]. While we do believe Curran Associates Inc.

589
ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA Yali Du and Zhongxing Yu

[8] Benjamin Bichsel, Veselin Raychev, Petar Tsankov, and Martin Vechev. 2016. [30] Xing Hu, Yuhan Wei, Ge Li, and Zhi Jin. 2017. CodeSum: Translate Program
Statistical Deobfuscation of Android Applications. In Proceedings of the 2016 ACM Language to Natural Language.
SIGSAC Conference on Computer and Communications Security. Association for [31] Xuan Huo, Ming Li, and Zhi-Hua Zhou. 2020. Control Flow Graph Embedding
Computing Machinery. https://doi.org/10.1145/2976749.2978422 Based on Multi-Instance Decomposition for Bug Localization. In The Thirty-
[9] Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2021. Self-Supervised Contrastive Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second
Learning for Code Retrieval and Summarization via Semantic-Preserving Trans- Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth
formations. In The 44th International ACM SIGIR Conference on Research and AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020,
Development in Information Retrieval, Virtual Event. ACM. New York, NY, USA, February 7-12, 2020. AAAI Press.
[10] Luca Buratti, Saurabh Pujar, Mihaela Bornea, Scott McCarley, Yunhui Zheng, [32] Xuan Huo, Ferdian Thung, Ming Li, David Lo, and Shu-Ting Shi. 2019. Deep
Gaetano Rossiello, Alessandro Morari, Jim Laredo, Veronika Thost, Yufan Zhuang, transfer bug localization. IEEE Transactions on software engineering (2019). https:
et al. 2020. Exploring software naturalness through neural language models. //doi.org/10.1109/TSE.2019.2920771
arXiv preprint arXiv:2006.12641 (2020). [33] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc
[11] Xianshuai Cao, Yuliang Shi, Jihu Wang, Han Yu, Xinjun Wang, and Zhongmin Yan. Brockschmidt. 2019. Codesearchnet challenge: Evaluating the state of semantic
2022. Cross-modal Knowledge Graph Contrastive Learning for Machine Learning code search. arXiv preprint arXiv:1909.09436 (2019).
Method Recommendation. In MM ’22: The 30th ACM International Conference on [34] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep
Multimedia. ACM. network training by reducing internal covariate shift. In International conference
[12] Kwonsoo Chae, Hakjoo Oh, Kihong Heo, and Hongseok Yang. 2017. Automati- on machine learning.
cally Generating Features for Learning Program Analysis Heuristics for C-like [35] Xue Jiang, Zhuoran Zheng, Chen Lyu, Liang Li, and Lei Lyu. 2021. Treebert:
Languages. Proc. ACM Program. Lang. (2017). https://doi.org/10.1145/3133925 A tree-based pre-trained model for programming language. In Uncertainty in
[13] Qibin Chen, Jeremy Lacomis, Edward J. Schwartz, Graham Neubig, Bogdan Artificial Intelligence. PMLR.
Vasilescu, and Claire Le Goues. 2022. VarCLR: Variable Semantic Representa- [36] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity
tion Pre-training via Contrastive Learning. In 44th IEEE/ACM 44th International search with gpus. IEEE Transactions on Big Data (2019). https://doi.org/10.1109/
Conference on Software Engineering. ACM. TBDATA.2019.2921572
[14] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A [37] James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the
simple framework for contrastive learning of visual representations. In Interna- tarantula automatic fault-localization technique. In Proceedings of the 20th
tional conference on machine learning. PMLR. IEEE/ACM international Conference on Automated software engineering. https:
[15] Xinlei Chen, Haoqi Fan, Ross B. Girshick, and Kaiming He. 2020. Improved //doi.org/10.1145/1101908.1101949
Baselines with Momentum Contrastive Learning. CoRR (2020). [38] Aditya Kanade, Petros Maniatis, Gogul Balakrishnan, and Kensen Shi. 2019. Pre-
[16] Hyunsoo Cho, Jinseok Seol, and Sang-goo Lee. 2021. Masked Contrastive Learn- trained contextual embedding of source code. (2019).
ing for Anomaly Detection. In Proceedings of the Thirtieth International Joint [39] Minguk Kang and Jaesik Park. 2020. ContraGAN: Contrastive Learning for
Conference on Artificial Intelligence, Virtual Event / Montreal. ijcai.org. Conditional Image Generation. In Advances in Neural Information Processing
[17] Agnieszka Ciborowska and Kostadin Damevski. 2022. Fast changeset-based Systems 33: Annual Conference on Neural Information Processing Systems 2020.
bug localization with BERT. In 2022 IEEE/ACM 44th International Conference on [40] Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage
Software Engineering (ICSE). https://doi.org/10.1145/3510003.3510042 Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd
[18] Agnieszka Ciborowska, Michael J Decker, and Kostadin Damevski. 2022. On- International ACM SIGIR conference on research and development in Information
line Adaptable Bug Localization for Rapidly Evolving Software. arXiv preprint Retrieval. ACM. https://doi.org/10.1145/3397271.3401075
arXiv:2203.03544 (2022). [41] Dongsun Kim, Yida Tao, Sunghun Kim, and Andreas Zeller. 2013. Where should
[19] Chris Cummins, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. 2017. we fix this bug? a two-phase recommendation model. IEEE transactions on
End-to-End Deep Learning of Optimization Heuristics. In 2017 26th International software Engineering (2013). https://doi.org/10.1109/TSE.2013.24
Conference on Parallel Architectures and Compilation Techniques (PACT). https: [42] Kisub Kim, Sankalp Ghatpande, Kui Liu, Anil Koyuncu, Dongsun
//doi.org/10.1109/PACT.2017.24 Kim, Tegawendé F Bissyandé, Jacques Klein, and Yves Le Traon. 2022.
[20] Yaniv David, Uri Alon, and Eran Yahav. 2020. Neural reverse engineering of DigBug—Pre/post-processing operator selection for accurate bug localization.
stripped binaries using augmented control flow graphs. Proceedings of the ACM Journal of Systems and Software (2022).
on Programming Languages (2020). https://doi.org/10.1145/3428293 [43] Ted Kremenek, Andrew Y. Ng, and Dawson Engler. 2007. A Factor Graph Model
[21] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: for Software Bug Finding. In Proceedings of the 20th International Joint Conference
Pre-training of Deep Bidirectional Transformers for Language Understanding. on Artifical Intelligence. Morgan Kaufmann Publishers Inc.
https://doi.org/10.18653/V1/N19-1423 [44] An Ngoc Lam, Anh Tuan Nguyen, Hoan Anh Nguyen, and Tien N Nguyen. 2017.
[22] Yali Du, Yinwei Wei, Wei Ji, Fan Liu, Xin Luo, and Liqiang Nie. 2023. Multi-queue Bug localization with combination of deep learning and information retrieval. In
Momentum Contrast for Microvideo-Product Retrieval. In Proceedings of the 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).
Sixteenth ACM International Conference on Web Search and Data Mining, WSDM https://doi.org/10.1109/ICPC.2017.24
2023, Singapore, 27 February 2023 - 3 March 2023. ACM. https://doi.org/10.1145/ [45] Hao Lan, Li Chen, and Baochun Li. 2021. Accelerated Device Placement Opti-
3539597.3570405 mization with Contrastive Learning. In ICPP 2021: 50th International Conference
[23] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, on Parallel Processing. ACM.
Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Codebert: [46] Zhengliang Li, Zhiwei Jiang, Xiang Chen, Kaibo Cao, and Qing Gu. 2021. Laprob:
A pre-trained model for programming and natural languages. (2020). https: a label propagation-based software bug localization method. Information and
//doi.org/10.18653/V1/2020.FINDINGS-EMNLP.139 Software Technology (2021). https://doi.org/10.1016/J.INFSOF.2020.106410
[24] Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. [n. d.]. [47] Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, and Jane Cleland-Huang. 2021.
UniXcoder: Unified Cross-Modal Pre-training for Code Representation. ([n. d.]). Traceability Transformed: Generating More Accurate Links with Pre-Trained
[25] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long BERT Models. In Proceedings of the 43rd International Conference on Software
Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Engineering. https://doi.org/10.1109/ICSE43902.2021.00040
Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, [48] Chao Liu, Xifeng Yan, Long Fei, Jiawei Han, and Samuel P Midkiff. 2005. SOBER:
and Ming Zhou. 2021. Graphcodebert: Pre-training code representations with statistical model-based bug localization. ACM SIGSOFT Software Engineering
data flow. (2021). Notes (2005). https://doi.org/10.1145/1081706.1081753
[26] Jin Guo, Jinghui Cheng, and Jane Cleland-Huang. 2017. Semantically Enhanced [49] Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization.
Software Traceability Using Deep Learning Techniques. In 2017 IEEE/ACM 39th In 7th International Conference on Learning Representations. OpenReview.net.
International Conference on Software Engineering (ICSE). https://doi.org/10.1109/ [50] Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and
ICSE.2017.9 Lingming Zhang. 2021. Boosting coverage-based fault localization via graph-
[27] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. DeepFix: based representation learning. In Proceedings of the 29th ACM Joint Meeting on
Fixing Common C Language Errors by Deep Learning. In Proceedings of the European Software Engineering Conference and Symposium on the Foundations of
Thirty-First AAAI Conference on Artificial Intelligence. AAAI Press. https://doi. Software Engineering.
org/10.1609/AAAI.V31I1.10742 [51] Yi-Fan Ma, Yali Du, and Ming Li. 2023. Capturing the Long-Distance Dependency
[28] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- in the Control Flow Graph via Structural-Guided Attention for Bug Localization.
mentum contrast for unsupervised visual representation learning. In Proceed- In Proceedings of the Thirty-Second International Joint Conference on Artificial
ings of the IEEE/CVF conference on computer vision and pattern recognition. Intelligence, IJCAI 2023, 19th-25th August 2023, Macao, SAR, China. ijcai.org. https:
https://doi.org/10.1109/CVPR42600.2020.00975 //doi.org/10.24963/IJCAI.2023/249
[29] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel, and Premkumar Devanbu. [52] Yi-Fan Ma and Ming Li. 2022. The flowing nature matters: feature learning from
2012. On the Naturalness of Software. In Proceedings of the 34th International the control flow graph of source code for bug localization. Mach. Learn. (2022).
Conference on Software Engineering. https://doi.org/10.1145/2902362 https://doi.org/10.1007/S10994-021-06078-4

590
Pre-training Code Representation with Semantic Flow Graph for Effective Bug Localization ESEC/FSE ’23, December 3–9, 2023, San Francisco, CA, USA

[53] Yi-Fan Ma and Ming Li. 2022. Learning from the Multi-Level Abstraction of the [76] Iris Vessey. 1985. Expertise in debugging computer programs: A process analysis.
Control Flow Graph via Alternating Propagation for Bug Localization. In IEEE International Journal of Man-Machine Studies (1985). https://doi.org/10.1016/
International Conference on Data Mining. IEEE. S0020-7373(85)80054-7
[54] Andrew L Maas, Awni Y Hannun, Andrew Y Ng, et al. 2013. Rectifier nonlineari- [77] Min Wang, Zeqi Lin, Yanzhen Zou, and Bing Xie. 2020. CoRA: Decomposing
ties improve neural network acoustic models. In Proc. icml. and Describing Tangled Code Changes for Reviewer. In Proceedings of the 34th
[55] Ginika Mahajan and Neha Chaudhary. 2022. Design and development of novel IEEE/ACM International Conference on Automated Software Engineering.
hybrid optimization-based convolutional neural network for software bug local- [78] Xuanrun Wang, Kanglin Yin, Qianyu Ouyang, Xidao Wen, Shenglin Zhang,
ization. Soft Computing (2022). https://doi.org/10.1007/S00500-022-07341-Z Wenchi Zhang, Li Cao, Jiuxue Han, Xing Jin, and Dan Pei. 2022. Identifying
[56] Lili Mou, Ge Li, Lu Zhang, Tao Wang, and Zhi Jin. 2016. Convolutional Neural Erroneous Software Changes through Self-Supervised Contrastive Learning on
Networks over Tree Structures for Programming Language Processing. In Pro- Time Series Data. In IEEE 33rd International Symposium on Software Reliability
ceedings of the Thirtieth AAAI Conference on Artificial Intelligence. AAAI Press. Engineering, ISSRE 2022, Charlotte, NC, USA, October 31 - Nov. 3, 2022. IEEE.
https://doi.org/10.1609/AAAI.V30I1.10139 [79] Ming Wen, Rongxin Wu, and Shing-Chi Cheung. 2016. Locus: Locating Bugs from
[57] Vijayaraghavan Murali, Lee Gross, Rebecca Qian, and Satish Chandra. 2021. Software Changes. In Proceedings of the 31st IEEE/ACM International Conference
Industry-scale IR-based bug localization: a perspective from Facebook. In 2021 on Automated Software Engineering. Association for Computing Machinery. https:
IEEE/ACM 43rd International Conference on Software Engineering: Software Engi- //doi.org/10.1145/2970276.2970359
neering in Practice. https://doi.org/10.1109/ICSE-SEIP52600.2021.00028 [80] W Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A
[58] Chao Ni, Wei Wang, Kaiwen Yang, Xin Xia, Kui Liu, and David Lo. 2022. The survey on software fault localization. IEEE Transactions on Software Engineering
best of both worlds: integrating semantic features with expert features for de- (2016). https://doi.org/10.1109/TSE.2016.2521368
fect prediction and localization. In Proceedings of the 30th ACM Joint European [81] Rongxin Wu, Ming Wen, Shing-Chi Cheung, and Hongyu Zhang. 2018. Change-
Software Engineering Conference and Symposium on the Foundations of Software locator: locate crash-inducing changes based on crash reports. Empirical Software
Engineering. Engineering (2018). https://doi.org/10.1007/S10664-017-9567-4
[59] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning [82] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised
with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018). feature learning via non-parametric instance discrimination. In Proceedings of
[60] Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: mutation-based fault the IEEE conference on computer vision and pattern recognition. https://doi.org/10.
localization. Software Testing, Verification and Reliability (2015). https://doi.org/ 1109/CVPR.2018.00393
10.1002/STVR.1509 [83] Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, and Xiaoguang Mao. 2022. A
[61] Renaud Pawlak, Martin Monperrus, Nicolas Petitprez, Carlos Noguera, and Lionel Universal Data Augmentation Approach for Fault Localization. In Proceedings
Seinturier. 2015. Spoon: A Library for Implementing Analyses and Transforma- of the 44th International Conference on Software Engineering. Association for
tions of Java Source Code. Software: Practice and Experience (2015). Computing Machinery. https://doi.org/10.1145/3510003.3510136
[62] Michael Pradel, Vijayaraghavan Murali, Rebecca Qian, Mateusz Machalica, Erik [84] Zhongxing Yu, Chenggang Bai, and Kai-Yuan Cai. 2013. Mutation-Oriented Test
Meijer, and Satish Chandra. 2020. Scaffle: bug localization on millions of files. Data Augmentation for GUI Software Fault Localization. Inf. Softw. Technol.
In Proceedings of the 29th ACM SIGSOFT International Symposium on Software (2013). https://doi.org/10.1016/j.infsof.2013.07.004
Testing and Analysis. [85] Zhongxing Yu, Chenggang Bai, and Kai-Yuan Cai. 2015. Does the Failing Test
[63] Michael Pradel and Koushik Sen. 2018. DeepBugs: A Learning Approach to Execute a Single or Multiple Faults? An Approach to Classifying Failing Tests. In
Name-Based Bug Detection. Proc. ACM Program. Lang. (2018). https://doi.org/ Proceedings of the 37th International Conference on Software Engineering - Volume
10.1145/3276517 1. IEEE Press.
[64] Binhang Qi, Hailong Sun, Wei Yuan, Hongyu Zhang, and Xiangxin Meng. 2022. [86] Zhongxing Yu, Hai Hu, Chenggang Bai, Kai-Yuan Cai, and W. Eric Wong.
DreamLoc: A Deep Relevance Matching-Based Framework for bug Localization. 2011. GUI Software Fault Localization Using N-gram Analysis. In 2011 IEEE
IEEE Transactions on Reliability (2022). https://doi.org/10.1109/TR.2021.3104728 13th International Symposium on High-Assurance Systems Engineering. https:
[65] Shibo Qi, Rize Jin, and Joon-Young Paik. 2022. eMoCo: Sentence Representa- //doi.org/10.1109/HASE.2011.29
tion Learning With Enhanced Momentum Contrast. In Proceedings of the 5th [87] Zhongxing Yu, Matias Martinez, Zimin Chen, Tegawendé F. Bissyandé, and
International Conference on Computer Science and Software Engineering. Martin Monperrus. 2023. Learning the Relation Between Code Features and
[66] Yiyue Qian, Yiming Zhang, Qianlong Wen, Yanfang Ye, and Chuxu Zhang. 2022. Code Transforms With Structured Prediction. IEEE Transactions on Software
Rep2Vec: Repository Embedding via Heterogeneous Graph Adversarial Con- Engineering (2023). https://doi.org/10.1109/TSE.2023.3275380
trastive Learning. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge [88] Zhongxing Yu, Matias Martinez, Benjamin Danglot, Thomas Durieux, and Martin
Discovery and Data Mining. ACM. Monperrus. 2019. Alleviating Patch Overfitting with Automatic Test Generation:
[67] Fangcheng Qiu, Meng Yan, Xin Xia, Xinyu Wang, Yuanrui Fan, Ahmed E Has- A Study of Feasibility and Effectiveness for the Nopol Repair System. Empirical
san, and David Lo. 2020. JITO: a tool for just-in-time defect identification and Softw. Engg. (2019). https://doi.org/10.1007/s10664-018-9619-4
localization. In Proceedings of the 28th ACM joint meeting on european software [89] Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen,
engineering conference and symposium on the foundations of software engineering. and Hong Yang. [n. d.]. TraceCRL: contrastive representation learning for mi-
[68] Mohammad Masudur Rahman and Chanchal K Roy. 2018. Improving IR-based croservice trace analysis. In Proceedings of the 30th ACM Joint European Software
bug localization with context-aware query reformulation. In Proceedings of the Engineering Conference and Symposium on the Foundations of Software Engineering,
2018 26th ACM joint meeting on European software engineering conference and publisher = ACM, year = 2022,.
symposium on the foundations of software engineering. https://doi.org/10.1145/ [90] Jinglei Zhang, Rui Xie, Wei Ye, Yuhan Zhang, and Shikun Zhang. 2020. Exploit-
3236024.3236065 ing code knowledge graph for bug localization via bi-directional attention. In
[69] Stephen E. Robertson, Steve Walker, and Micheline Hancock-Beaulieu. 2000. Proceedings of the 28th International Conference on Program Comprehension.
Experimentation as a way of life: Okapi at TREC. Inf. Process. Manag. (2000). [91] Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where Should the Bugs Be
[70] Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Fixed? - More Accurate Information Retrieval-Based Bug Localization Based
Bavota, Michele Lanza, and Rocco Oliveto. 2021. Evaluating SZZ Implementations on Bug Reports. In Proceedings of the 34th International Conference on Software
Through a Developer-Informed Oracle. In Proceedings of the 43rd International Engineering. IEEE Press.
Conference on Software Engineering. [92] Yaqin Zhou, Shangqing Liu, Jingkai Siow, Xiaoning Du, and Yang Liu. 2019.
[71] Ripon K Saha, Matthew Lease, Sarfraz Khurshid, and Dewayne E Perry. 2013. Devign: Effective Vulnerability Identification by Learning Comprehensive Program
Improving bug localization using structured information retrieval. In 2013 28th Semantics via Graph Neural Networks. Curran Associates Inc.
IEEE/ACM International Conference on Automated Software Engineering (ASE). [93] Ziye Zhu, Yun Li, Hanghang Tong, and Yu Wang. 2020. Cooba: Cross-project
https://doi.org/10.1109/ASE.2013.6693093 bug localization via adversarial transfer learning. In IJCAI. https://doi.org/10.
[72] Shivkumar Shivaji, E James Whitehead, Ram Akella, and Sunghun Kim. 2012. Re- 24963/IJCAI.2020/493
ducing features to improve code change-based bug prediction. IEEE Transactions [94] Ziye Zhu, Yun Li, Yu Wang, Yaojing Wang, and Hanghang Tong. 2021. A deep
on Software Engineering (2012). https://doi.org/10.1109/TSE.2012.43 multimodal model for bug localization. Data Mining and Knowledge Discovery
[73] Marius Smytzek and Andreas Zeller. 2022. SFLKit: a workbench for statistical fault (2021).
localization. In Proceedings of the 30th ACM Joint European Software Engineering [95] Ziye Zhu, Hanghang Tong, Yu Wang, and Yun Li. 2022. Enhancing bug localization
Conference and Symposium on the Foundations of Software Engineering. with bug report decomposition and code hierarchical network. Knowledge-Based
[74] Chenning Tao, Qi Zhan, Xing Hu, and Xin Xia. 2022. C4: contrastive cross- Systems (2022). https://doi.org/10.1016/J.KNOSYS.2022.108741
language code clone detection. In Proceedings of the 30th IEEE/ACM International [96] Weiqin Zou, David Lo, Zhenyu Chen, Xin Xia, Yang Feng, and Baowen Xu. 2020.
Conference on Program Comprehension. ACM. How Practitioners Perceive Automated Bug Report Management Techniques.
[75] Simon Urli, Zhongxing Yu, Lionel Seinturier, and Martin Monperrus. 2018. How IEEE Transactions on Software Engineering (2020). https://doi.org/10.1109/TSE.
to design a program repair bot? insights from the repairnator project. In 2018 2018.2870414
IEEE/ACM 40th international conference on software engineering: software engi-
neering in practice Track (ICSE-SEIP). https://doi.org/10.1145/3183519.3183540 Received 2023-02-02; accepted 2023-07-27

591

You might also like