0% found this document useful (0 votes)
3 views13 pages

Fault Localization To Detect Co-Change Fixing Locations

The document presents FixLocator, a deep learning-based fault localization approach designed to identify co-change fixing locations in automated program repair. It utilizes dual-task learning with two models, MethFL and StmtFL, to improve the detection of multiple interdependent faulty statements that need to be fixed together. Empirical results demonstrate that FixLocator significantly outperforms existing fault localization methods, enhancing the effectiveness of automated program repair tools.

Uploaded by

yu pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views13 pages

Fault Localization To Detect Co-Change Fixing Locations

The document presents FixLocator, a deep learning-based fault localization approach designed to identify co-change fixing locations in automated program repair. It utilizes dual-task learning with two models, MethFL and StmtFL, to improve the detection of multiple interdependent faulty statements that need to be fixed together. Empirical results demonstrate that FixLocator significantly outperforms existing fault localization methods, enhancing the effectiveness of automated program repair tools.

Uploaded by

yu pei
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Fault Localization to Detect Co-change Fixing Locations

Yi Li Shaohua Wang∗ Tien N. Nguyen


New Jersey Institute of Technology New Jersey Institute of Technology University of Texas at Dallas
New Jersey, USA New Jersey, USA Texas, USA
[email protected] [email protected] [email protected]

ABSTRACT 1 INTRODUCTION
Fault Localization (FL) is a precursor step to most Automated Pro- To assist developers in the bug-detecting and fixing process, several
gram Repair (APR) approaches, which fix the faulty statements approaches have been proposed for Automated Program Repair
identified by the FL tools. We present FixLocator, a Deep Learn- (APR) [21]. A common usage of an APR tool is that one needs to
ing (DL)-based fault localization approach supporting the detection use a fault localization (FL) tool [41] to locate the faulty statements
of faulty statements in one or multiple methods that need to be that must be fixed, and then uses an APR tool to generate the fixing
modified accordingly in the same fix. Let us call them co-change changes for those detected statements. The input of an FL model
(CC) fixing locations for a fault. We treat this FL problem as dual- is the execution of a test suite, in which some of the test cases
task learning with two models. The method-level FL model, MethFL, are passing or failing ones. Specifically, the key input is the code
learns the methods to be fixed together. The statement-level FL coverage matrix in which the rows and columns correspond to the
model, StmtFL, learns the statements to be co-fixed. Correct learning statements and test cases, respectively. Each cell is assigned with
in one model can benefit the other and vice versa. Thus, we simul- the value of 1 if the statement is executed in the respective test
taneously train them with soft-sharing the models’ parameters via case, and with the value of 0, otherwise. An FL model uses such
cross-stitch units to enable the propagation of the impact of MethFL information to identify the list of suspicious lines of code that are
and StmtFL onto each other. Moreover, we explore a novel feature for ranked based on their associated suspiciousness scores [41]. In recent
FL: the co-changed statements. We also use Graph-based Convolu- advanced FL, several approaches also support fault localization at
tion Network to integrate different types of program dependencies. method level to locate faulty methods [22, 24].
Our empirical results show that FixLocator relatively improves The FL approaches can be broadly divided into the following cate-
over the state-of-the-art statement-level FL baselines by locating gories: spectrum-based fault localization (SBFL) [6, 17, 18], mutation-
26.5%–155.6% more CC fixing statements. To evaluate its usefulness based fault localization (MBFL) [31, 34, 35], and machine learning
in APR, we used FixLocator in combination with the state-of-the- (ML) and deep learning (DL) fault localization [22, 24]. For SBFL
art APR tools. The results show that FixLocator+DEAR (the origi- approaches, the key idea is that a line covered more in the fail-
nal FL in DEAR replaced by FixLocator) and FixLocator+CURE ing test cases than in the passing ones is more suspicious than a
improve relatively over the original DEAR and Ochiai+CURE by line executed more in the passing ones. To improve SBFL, MBFL
10.5% and 42.9% in terms of the number of fixed bugs. approaches [31, 34, 35] enhance the code coverage matrix by mod-
ifying a statement with mutation operators, and collecting code
CCS CONCEPTS coverage when executing the mutated programs with the test cases.
• Software and its engineering → Software testing and debug- The MBFL approaches apply suspiciousness score formulas in the
ging. same manner as in SBFL approaches on the matrix for each origi-
nal statement and its mutated code. Finally, ML and DL-based FL
KEYWORDS approaches explore the code coverage matrix and apply different
neural network models for fault localization.
Fault Localization; Deep Learning; Co-Change Fixing Locations
Despite their successes, the state-of-the-art FL approaches are
ACM Reference Format: still limited in locating all dependent fixing locations that need to
Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. Fault Localization to Detect be repaired at the same time in the same fix. In practice, there are
Co-change Fixing Locations. In Proceedings of the 30th ACM Joint European many bugs that require dependent changes in the same fix to multiple
Software Engineering Conference and Symposium on the Foundations of Soft-
lines of code in one or multiple hunks of the same or different methods
ware Engineering (ESEC/FSE ’22), November 14–18, 2022, Singapore, Singapore.
for the program to pass the test cases. For those bugs, applying the
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3540250.3549137
fixing change to individual statements once at a time will not make
∗ Corresponding Author the program pass the test case after the change to one statement.
This capability to detect the fixing locations of the co-changes in a
Permission to make digital or hard copies of all or part of this work for personal or fix for a bug (let us call them Co-change (CC) Fixing Locations) is
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
crucial for an APR tool. Such capability will enable an APR tool to
on the first page. Copyrights for components of this work owned by others than ACM make the correct and complete changes to fix a bug.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, The state-of-the-art FL approaches do not satisfy that require-
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. ment. From the ranked list of suspicious statements returned from
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore an existing FL model, a naive approach to detect CC fixing locations
© 2022 Association for Computing Machinery. would be to take the top 𝑘 statements in that list and to consider
ACM ISBN 978-1-4503-9413-0/22/11. . . $15.00
https://doi.org/10.1145/3540250.3549137 them as to be fixed together. This solution might be ineffective

659
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen

because the mechanisms used in the state-of-the-art FL approaches


1 public void toSource(final CodeBuilder cb, int inputSeqNum, Node root) {
have never considered the co-change nature of those fixes. Our 2 ...
3 - String code = toSource(root, sourceMap);
empirical evaluation also confirmed that (Section 8.1). 4 + String code = toSource(root, sourceMap, inputSeqNum == 0);
Detecting all the CC fixing locations at multiple statements in 5 if (!code.isEmpty()) {
6 cb.append(code);
potentially multiple methods is challenging. A naive solution would 7 } ...
be detecting the potential methods that need to be fixed together 8 }
9 //--------------------------------------------------------------------------
and then detecting potential statements that need to be changed 10 @Override
together in each of those methods. However, doing so will create 11 String toSource(Node n) {
12 initCompilerOptionsIfTesting();
a confounding effect from the inaccuracy of the detection of the 13 - return toSource(n, null);
14 + return toSource(n, null, true);
co-fixed methods to that of the co-fixed statements. 15 }
We propose FixLocator, a fault localization approach to de- 16 //--------------------------------------------------------------------------
17 - private String toSource(Node n, SourceMap sourceMap)
rive the co-change fixing locations in the same fix for a fault (i.e, 18 + private String toSource(Node n, SourceMap sourceMap, boolean firstOutput)
multiple faulty statements in possible multiple faulty methods). To 19 ......
20 builder.setSourceMapDetailLevel(options.sourceMapDetailLevel);
avoid the confounding effect in that naive solution, we treat this 21 - builder.setTagAsStrict(
problem as dual-task learning with two dedicated models. First, 22 + builder.setTagAsStrict(firstOutput &&
23 options.getLanguageOut(a) == LanguageMode.ECMASCRIPT5_STRICT);
the method-level FL model (MethFL) learns the methods that need to 24 builder.setLineLengthThreshold(options.lineLengthThreshold);
25 ......
be modified in the same fix. Second, the statement-level FL model 26 }
(StmtFL) learns the co-fixed statements in the same or different meth-
ods. The intuition is that they are closely related, which we refer Figure 1: Co-Change Fixing Locations for a Fault
to as duality. Correct learning for a model can benefit the other and
vice versa. If two statements in two methods are fixed together for
a bug, those methods are also co-fixed. If two methods are co-fixed, with FixLocator for a variant, DEAR𝐹𝑖𝑥𝐿 . Our result shows that
some of their statements are also co-fixed. Exploring this duality DEAR𝐹𝑖𝑥𝐿 and FixLocator+CURE improve relatively DEAR and
can provide useful constraints to detect CC fixing locations for a Ochiai+CURE by 10.5% and 42.9% in terms of numbers of fixed bugs.
bug. Thus, instead of cascading the two models MethFL and StmtFL, Through our ablation analysis on the impact of different features
we train them simultaneously with the soft-sharing of the mod- and modules of FixLocator, we showed that all designed features/-
els’ parameters to exploit this duality. Specifically, we leverage the modules have contributed to its high performance. Specifically, the
cross-stitch units [30] to connect MethFL and StmtFL. In a cross-stitch proposed dual-task learning significantly improves the statement-
unit, the sharing of representations between MethFL and StmtFL is level FL by up to 12.8% in terms of Hit-1. The designed feature
modeled by learning a linear combination of the input features of co-change relations among methods and statements has also
from the two models. The cross-stitch units enable the propagation positively contributed to FixLocator’s high accuracy level.
of the impact of MethFL and StmtFL on each other. The contributions of this paper are listed as follows:
In addition to the new solution in dual-task learning, we utilize (1) FixLocator: Advancing DL-based Fault Localization
a novel feature for this CC fixing location problem: co-changed to derive the co-change fixing locations (multiple faulty
statements, which have never been exploited in FL. The rationale statements) in the same fix for a bug. We treat that problem
is that the co-changed statements in the past might become the as dual-task learning to propagate the impact between
statements that will be fixed together in the future. Finally, since the the method-level and statement-level FL.
co-fixed statements are often interdependent, we use Graph-based (2) Novel graph-based representation learning with GCN
Convolution Network (GCN) [26] to integrate different types of and novel type of features in co-changed statements for
program dependencies among statements, e.g., data and control FL enable dual-task learning to derive CC fixing locations.
dependencies, execution traces, stack traces, etc. We also encode (3) Extensive empirical evaluation. We evaluated FixLoca-
test coverage and co-changed/co-fixed statements in the graph. The tor against the recent DL-based FL models to show its accu-
GCN model learns and predicts the bugginess of the statements. racy and usefulness in APR. Our data/tool are available [3].
We conducted several experiments to evaluate FixLocator on
Defects4J-v2.0 [1]. Our empirical results show that FixLocator 2 MOTIVATING EXAMPLE
improves the baselines, CNN-FL [50], DeepFL [22], DeepRL4FL [24],
and DEAR’s FL [25] by 16.6%, 16.9%, 9.9%, and 20.6% respectively, 2.1 Example and Observations
in terms of Hit-1 (i.e., the percentage of bugs in which the predicted Let us start with a real-world example. Figure 1 shows a bug fix in
set overlaps with the oracle set for at least one faulty statement), the Defects4J dataset that require multiple interdependent changes
and by 33.6%, 40.3%, 26.5%, and 57.5% in terms of Hit-2 (i.e., the to multiple statements in different methods. The bug occurred
percentage of bugs in which the number of overlapping statements when the method call to setTagAsStrict did not consider the first
between the predicted and oracle sets is ≥2), 43.9%, 46.4%, 28.1%, output in its arguments. Therefore, for fixing, a developer adds
and 51.9% in terms of Hit-3, respectively. FixLocator also improves a new argument in the method toSource at line 18, and uses that
those baselines by 32.0%, 38.8%, 20.8%, and 46.1% in terms of Hit-All argument in the method call setTagAsStrict (firstOutput,...) at line
(i.e., the predicted set exactly matches with the oracle set for a bug). 22. Because the method toSource at line 17 was changed, the two
To evaluate its usefulness in APR, we combined it with the APR callers at line 3 of the method toSource (line 1) and at line 13 of the
tools, DEAR [25] and CURE [15]. We replaced DEAR’s FL module method toSource (line 11) need to be changed accordingly.

660
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

2.1.1 Observation 1 [Co-change Fixing Locations]. In this example, we design an approach that treats this FL problem of detecting
the changes to fix this bug involve multiple faulty statements that dependent CC fixing locations as dual-task learning between the
are dependent on one another. Fixing only one of the faulty state- method-level and statement-level FL. First, the method-level FL
ments will not make the program pass the failing test(s). Fixing model (MethFL) aims to learn the methods that need to be modified
individual statements once at a time in the ranked list returned in the same fix. Second, the statement-level FL model (StmtFL) aims
from an existing FL tool will also not make the program pass the to learn the co-fixing statements regardless of whether they are in
tests. For an APR model to work, an FL tool needs to point out all of the same or different methods.
those faulty statements to be changed in the same fix. For example, Intuitively, MethFL and StmtFL are related to each other, in which
all four faulty statements at lines 17, 21, 3, and 13 need to modified the results of one model can help the other. We refer to this relation
accordingly in the same fix to fix the bug in Figure 1. as duality, which can provide some useful constraints for FixLo-
cator to learn dependent CC fixing locations. We conjecture that
2.1.2 Observation 2 [Multiple Faulty Methods]. As seen, this bug the joint training of the two models can improve the performance
requires an APR tool to make changes to multiple statements in of both models, when we leverage the constraints of this duality
three different methods in the same fix: toSource(...) at lines 17 in term of shared representations. For example, if two statements
and 21, toSource(...) at line 3, and toSource (...) at line 13. Thus, it in two different methods 𝑚 1 and 𝑚 2 were observed to be changed
is important for an FL tool to connect and identify these multiple in the same fix, then it should help the model learn that 𝑚 1 and
faulty statements in potentially different methods. 𝑚 2 were also changed together to fix the bug. If two methods were
Traditional FL approaches [49, 52] using program analysis (PA), observed to be fixed together, then some of their statements were
e.g., execution flow analysis, are restricted to specific PA techniques, changed in the same fix as well. In our model, we jointly train MethFL
thus, not general to locate all types of CC fixing locations. Spectrum- and StmtFL with the soft-sharing of the models’ parameters to exploit
based [6, 16], mutation-based [31, 34, 35]), statistic-based [27], and their relation. Specifically, we use a mechanism, called cross-stitch
machine learning (ML)-based FL approaches [22, 24] could implic- unit [30], to learn a linear combination of the input features from
itly learn the program dependencies for FL purpose. However, de- those two models to enable the propagation of the impact of MethFL
spite their successes, the non-PA FL approaches do not support the and StmtFL on each other. We also add an attention mechanism in
detection of multiple locations that need to be changed in the same fix the two models to help emphasize on the key features.
for a bug, i.e., Co-Change (CC) Fixing Locations. The spectrum-based
and ML-based FL models return a ranked list of suspicious state- 2.2.2 Key Idea 2 [Co-change Representation Learning in Fault Lo-
ments according to the corresponding suspiciousness scores. In this calization]. In detecting CC fixing locations, in addition to a new
example, the lines 13, 17, 21, and the other lines (e.g., 12, 20 and dual-task learning model in key idea 1, we use a new feature: co-
24) are executed in the same passing or failing test cases, thus as- change information among statements/methods, which has never
signed with the same scores by spectrum- and mutation-based FL explored in prior fault localization research. The rationale is that
approaches. A user would not be informed on what lines need to be the co-changed statements/methods in the past might become the
fixed together. Those non-PA, especially ML-based FL approaches, statements/methods that will be fixed together for a bug in the fu-
do not have a mechanism to detect CC fixing locations. ture. We also encode the co-fixed statements/methods in the same
In this work, we aim to advance the level of deep learning (DL)- fixes. The co-changed/co-fixed statements/methods in the same
based FL approaches to detect CC fixing statements. However, it is commit are used to train the models.
not trivial. A solution of assuming the top-𝑘 suspicious statements
2.2.3 Key Idea 3 [Graph Modeling for Dependencies among State-
from a FL tool as CC fixing locations does not work because even
ments/Methods]. The statements/methods that need to be fixed
being the most suspicious, those statements might not need to be
together are interdependent via several dependencies. Thus, we
changed in the same fix. In this example, all of the above lines with
use Graph-based Convolution Network (GCN) [26] to model differ-
the same suspiciousness scores would confuse a fixer.
ent types of dependencies among statements/methods, e.g., data
Moreover, another naive solution would be to use a method-
and control dependencies in a program dependence graph (PDG),
level FL tool to detect multiple faulty methods first and then use a
execution traces, stack traces, etc. We encode the co-change/co-fix
statement-level FL tool to detect the statements within each faulty
relations into the graph representations with different types of edges
method. As we will show in Section 7, the inaccuracy of the first
representing different relations. The GCN model enables nodes’ and
phase of detecting faulty methods will have a confounding effect
edges’ attributes and learns to classify the nodes as buggy or not.
on the overall performance in detecting CC fixing statements.
3 FIXLOCATOR: APPROACH OVERVIEW
2.2 Key Ideas
We propose FixLocator, an FL approach to locate all the CC fixing
3.1 Training Process
locations (i.e., faulty statements) that need to be changed in the Figure 2 summarizes the training process. The input of training
same fix for a bug. In designing FixLocator, we have the following includes the passing and failing test cases, and the source code un-
key ideas in both new model and new features: der study. The output includes the trained method-level FL model
(detecting co-fixed methods) and the trained statement-level FL
2.2.1 Key Idea 1 [Dual-Task Learning for Fault Localization]. To model (detecting co-fixed statements). The training process has
avoid the confounding effect in a naive solution of detecting faulty three main steps: 1) Feature Extraction, 2) Graph-based Feature Rep-
methods first and then detecting faulty statements in those methods, resentation Learning, and 3) Dual-Task Learning Fault Localization.

661
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen

Figure 2: FixLocator: Training Process

3.1.1 Step 1. Feature Extraction. (Section 4). We aim to extract the embeddings for all sub-tokens as we consider a method or state-
the important features for FL from the test coverage and source code ment as a sentence in each case. We then use Gated Recurrent Unit
including co-changes. The features are extracted from two levels: (GRU) [14] to produce the vector for the entire sequence.
statements and methods. At each level, we extract the important For the structure of a method or statement, the representation is a
attributes of statements/methods, as well as the crucial relations (sub)tree in the AST. For this, we first use GloVe [36] to produce the
among them. We use graphs to model those attributes and relations. embeddings for all the nodes in the sub-tree, considering the entire
For a method 𝑚, we collect as its attributes 1) method content: method or statement as a sentence in each case. After obtaining
the sequences of the sub-tokens of its code tokens (excluding sepa- the sub-tree where the nodes are replaced by their GloVe’s vectors,
rators and special tokens), and 2) method structure: the Abstract we use TreeCaps [12], which captures well the tree structure, to
Syntax Tree (AST) of the method. For the relations among methods, produce the embedding for the entire sub-tree.
we extract the relations involving in the following: For the code coverage representation, we directly use the two
1) Execution flow (the calling relation, i.e., 𝑚 calls 𝑛), vectors for coverage and passing/failing and concatenate them
2) Stack trace after a crash, i.e., the order relation among the to produce the embedding. The embedding for the most similar
methods in the stack trace (the dynamic information in execution buggy method is computed in the same manner as explained with
and stack traces have been showed to be useful in FL [22, 24]), GloVe and TreeCaps. Finally, the embeddings for the attributes
3) Co-change relation in the project history (two methods that of the nodes are used in the fully connected layers to produce
were changed in the same commit are considered to have the co- the embedding for each node in the feature graph at the method
change relation), level. Similarly, we obtain the feature graph at the statement level in
4) Co-fixing relation among the methods (two methods that were which each node is the resulting vector of the fully connected layers.
fixed for the same bug are considered to have the co-fixing relation),
5) Similarity: we also extract the similar methods in the project 3.1.3 Step 3. Dual-Task Learning Fault Localization. After the fea-
that have been buggy before in the project history. We keep only ture representation learning step, we obtain two feature graphs at
the most similar method for each method. the method and statement levels, in which a node in either graph
For a statement 𝑠, we extract both static and dynamic informa- is a vector representation. The two graphs are used as the input for
tion. First, for static data, we extract the AST subtree that corre- dual-task learning. For dual-task learning, we use two Graph-based
sponds to 𝑠 to represent its structure. We also extract the list of Convolution Network (GCN) models [19] for the method-level FL
variables in 𝑠 together with their types, forming a sequence of names model (MethFL) and the statement-level FL model (StmtFL) to learn the
and types, e.g., “name String price int ...". Second, for dynamic data, CC fixing methods and CC fixing statements, respectively. During
we encode the test coverage matrix for 𝑠 into the feature vectors. training, the two feature graphs at the method and statement levels
At both method and statement levels, we use graphs to represent are used as the inputs of MethFL and StmtFL. The two GCN models
the methods and statements, and their relations. Let us call them play the role of binary classifiers for the bugginess for the nodes
the method-level and statement-level feature graphs. (i.e., methods/statements). We train the two models simultaneously
with soft-sharing of parameters. Details will be given in Section 6.
3.1.2 Step 2. Graph-based Feature Representation Learning. This
step is aimed to learn the vector representations (i.e., embeddings) 3.2 Predicting Process
for the nodes in the feature graphs from step 1. The input includes The input of the prediction process (Figure 3) includes the test cases
the method-level and statement-level feature graphs. The output and the source code in the project. The steps 1–2 of the process is the
includes the embeddings for the nodes in the method-/statement- same as in training. In step 3, the feature graph 𝑔𝑀 at the statement
level feature graphs. The graph structures for both feature graphs level built from the source code is used as the input of the trained
are un-changed after this step. StmtFL model, which predicts the labels of the nodes in that graph.
For the content of a method or statement, we use the embedding The labels indicate the bugginess of the corresponding statements
techniques accordingly to feature representations (Section 5). For in the source code, which represent the CC fixing statements. If one
the method’s content and a list of variables in a statement, the repre- aims to predict the faulty methods, the trained MethFL model can be
sentation is a sequence of sub-tokens. We use GloVe [36] to produce used on the feature graph to produce the CC fixing methods.

662
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

Figure 3: FixLocator: Prediction Process


Figure 5: Stmt-level Feature Extraction 𝑔𝑀 for 𝑀1 in Figure 4

3) Co-change/co-fixing relation: Such a relation exists between


two methods that were changed/fixed in a commit. Such an edge is
made into two one-directional edges (e.g., 𝑀5 ⇆ 𝑀6 in Figure 4).

4.2 Statement-Level Feature Extraction 𝑔𝑀


For each statement, we extract the following attributes.
1) Code coverage: we run the test cases and collect code cov-
erage information. For each statement 𝑠, we use a vector 𝐶 =
<𝑐 1, 𝑐 2, ..., 𝑐 𝐾 > (𝐾 is the number of test cases) to encode code cover-
Figure 4: Method-level Feature Extraction 𝐺 𝑀 for 𝑀1 age in which 𝑐𝑖 = 1 if the test 𝑡𝑖 covers 𝑠, and 𝑐𝑖 = 0 otherwise. We
use another vector 𝑅 = <𝑟 1, 𝑟 2, ..., 𝑟 𝐾 > to encode the passing/failing
of a test case in which 𝑟𝑖 = 1 if the test case 𝑡𝑖 is a passing one and
4 FEATURE EXTRACTION 𝑟𝑖 = 0 otherwise. 𝑅 is common for all the statements. We concate-
4.1 Method-Level Feature Extraction 𝐺 𝑀 nate 𝐶 and 𝑅 for each statement to obtain the code coverage feature
vector 𝑉𝐶𝑜𝑣 = <𝑐 1, 𝑐 2, ..., 𝑐 𝐾 , 𝑟 1, 𝑟 2, ..., 𝑟 𝐾 >. We used DeepRL4FL’s
Figure 4 illustrates the key attributes and relations that we collect. test ordering algorithm [24] as the ordering of test cases is useful in
For each method 𝑀1 , we extract the following attributes: FL. For the different numbers of test cases across files, we perform
1) The method’s content: we remove special characters and sepa- zero padding to make the vectors have the same length.
rators in the method’s interface and body, and use naming conven- 2) AST structure: we extract the sub-tree in the AST that corre-
tion to break each code token into the sub-tokens. For example, in sponds to the current statement.
Figure 4, the node 𝑀1 represents the method computeGeometricalPro- 3) List of variables: We break the names into sub-tokens. In Fig-
perties in Figure 5. For the content for 𝑀1 , the extracted sequence
ure 5, the sequence for the variable list is [tree, BSPTree, Euclidean2D,...].
of sub-tokens is protected, void, compute, Geometrical, Properties, etc. We encode the following types of relations among statements:
2) The method’s structure: the corresponding parser is used to 1) Program dependence graph (PDG): as suggested in [24], the
build the AST of the method (e.g., JDT [4] for Java code). relations among statements in an PDG are important in FL, thus,
3) Most similar faulty method: we keep the most similar faulty we integrate them into the feature graph. In Figure 5, the blue edges
method 𝑀𝑏 with 𝑀1 . Note that we keep 𝑀𝑏 as an attribute of 𝑀1 , represent the relations in the PDG for the given code. The statement
rather than representing 𝑀𝑏 in the feature graph. The rationale at line 4 has a control/data dependency with the one at line 5, which
is that 𝑀𝑏 might be in the past and might not be present in the connects to the ones at lines 7–8, and to the ones at lines 10–11.
current version of the project. Two methods are similar when they 2) Execution flow in an execution trace: if two statements are
have similar sequences (measured by the cosine similarity) of the executed consecutively in an execution trace, we will connect them
sub-tokens (represented by the GloVe embeddings [36]). For 𝑀𝑏 , together. In Figure 5, we have the execution flow 𝑆 5 → 𝑆 7 , 𝑆 7 → 𝑆 8 .
we build its AST and keep it as an attribute for 𝑀1 . 3) Co-change/co-fixing relation: we maintain the co-change/co-
We encode as the edges three types of relations: fixing relations among statements. In Figure 5, 𝑆 4 and 𝑆 5 have been
1) Calling relation in a stack trace: we encode into the feature changed in a commit, thus, two co-change edges connect them.
graph the calling relations in a stack trace of a failed test case as we
ran it. In Figure 4, a blue edge connects 𝑀𝑖 to 𝑀 𝑗 for that relation.
5 FEATURE REPRESENTATION LEARNING
Since the stack trace can be long, from the failing/crash point, we
collect only part of the stack trace with 𝑛 levels of depth from that The goal of this step is to learn to build the vector representations
point. Following a prior work [44], in our experiment, 𝑛=10. for the nodes in the feature graphs at the method and statement
2) Calling relations in an execution trace: Similar to the stack levels. At either level, the input includes the attributes of either a
trace, an execution trace needs to be encoded in the feature graph. method or a statement as in Figures 4 and 5. The output is each
It can be very long from the failing/crash point. Thus, we keep the feature graph in which the nodes are replaced by their embeddings.
methods with only 𝑚 levels of length in calling relations from that
point. In our experiment, we use 𝑚=10. Figure 4 illustrates a few 5.1 Method-Level Representation Learning
calling relations (in green color) in execution traces. Figure 6 shows how we build the vectors for a method’s attributes.

663
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen

Figure 6: Method-level Feature Representation Learning

Figure 8: Dual-Task Learning Fault Localization

5.3 Feature Representation Learning


After computing the three embeddings for three attributes of each
method, we use three fully connected layers to standardize each
vector’s length to a chosen value 𝑙. Similarly, we use three fully con-
nected layers for the three embeddings for each statement. Then,
for a method or a statement, we concatenate the three output vec-
Figure 7: Statement-level Feature Representation Learning
tors from the fully connected layer to produce the vector 𝑉𝑀 for
the method and the other three vectors for 𝑉𝑆 for the statement
1) The method’s content: the method’s content is represented by with the length of (𝑙 × 3).
the sequence 𝑆𝑒𝑞𝑐 of the sub-tokens built from the code tokens in After all, for a method 𝑀, we have the method-level graph 𝐺 𝑀
the interface and the body of the method. To vectorize each sub- and the statement-level graph 𝑔𝑀 with its statements. The nodes in
token in 𝑆𝑒𝑞𝑐 , we use a word embedding model, called GloVe [36], 𝐺 𝑀 (Figure 4) now are the vectors computed for methods, and the
and treat each method as a sentence. After this vectorization, for nodes in 𝑔𝑀 are the vectors 𝑉𝑆 for the statements in 𝑀 (Figure 5).
the method, we obtain the sequence <𝑣 1, 𝑣 2, ..., 𝑣𝑛 > of the vectors
of the sub-tokens in 𝑆𝑒𝑞𝑐 . We then apply a sequential model on 6 DUAL-TASK LEARNING FOR FAULT
the sequence <𝑣 1, 𝑣 2, ..., 𝑣𝑛 > to learn the “summarized” vector 𝑉𝑀𝐶 LOCALIZATION
that represents the method’s content. Specifically, we use Gated Figures 8 and 9 illustrate our dual-task learning for fault localization.
Recurrent Unit (GRU) [13], a type of RNN layer that is efficient in In the training dataset, for each bug 𝐵, to ensure the matching of a
learning and capturing the information in a sequence. method and its corresponding statements, we build for each faulty
2) The method’s structure: we first treat the method as the se- method 𝑀 the pairs (𝐺 𝑀 , 𝑔𝑀 ): 1) 𝐺 𝑀 , the method-level graph (Fig-
quence of tokens and use GloVe to build the embeddings for all ure 4 with nodes replaced by vectors); and 2) 𝑔𝑀 , the statement-level
the tokens as in 1). We then replace every node in the AST of graph (Figure 5) containing all the statements belonging to 𝑀. To
the method with the GloVe’s vector of the corresponding token of ensure the co-fixing connections among the buggy methods for the
the node (Figure 6). From the tree of vectors, we use a tree-based same bug 𝐵, we model the co-fixed methods of 𝑀 via co-fixed rela-
model, called TreeCaps [12], to capture its structure to produce the tions in 𝐺 𝑀 (Figure 4). At the output layer, we label those methods
“summarized” vector 𝑉𝐴𝑆𝑇 representing the method’s structure. as faulty/co-fixing. The co-fixed statements within 𝑔𝑀 for the bug
3) Most similar faulty method: for a method, we process the most 𝐵 are also labeled as faulty/co-fixing. The non-buggy methods or
similar buggy method 𝑀𝑏 in the same way as the method’s structure statements are labeled as non-faulty. The pairs (𝐺 𝑀 , 𝑔𝑀 ) are used as
via GloVe and TreeCaps to learn the vector 𝑉𝑀𝑆𝐵𝑀 for 𝑀𝑏 . the input of this dual-task learning model (Figure 8). We process all
Finally, for each method 𝑀1 , we obtain 𝑉𝑀𝐶 , 𝑉𝐴𝑆𝑇 , and 𝑉𝑀𝑆𝐵𝑀 . the faulty methods 𝑀 for each bug 𝐵, and non-buggy methods.
In prediction, for each method 𝑀 ∗ in the project, we build the
5.2 Statement-Level Representation Learning pair (𝐺 𝑀 ∗ , 𝑔𝑀 ∗ ) and feed it to the trained dual-task model. In the
Figure 7 shows how we build the vectors for a statement’s attributes. output graphs, each node (for a method or a statement) will be
1) Code Coverage: we directly use the vector 𝑉𝐶𝑜𝑣 = <𝑐 1, 𝑐 2, ..., classified as either faulty/co-fixing or non-faulty. The nodes with
𝑐 𝐾 , 𝑟 1, 𝑟 2, ..., 𝑟 𝐾 > computed in Section 4.2 for the next computation. faulty/co-fixing labels in 𝑔𝑀 ∗ are the co-fixing statements for the
2) The statement’s structure: we process the AST subtree repre- bug. Let us explain our dual-task learning in details.
senting the statement’s structure in the same manner (via GloVe
and TreeCaps) as for the method’s structure to produce 𝑉𝑠𝑢𝑏𝑡𝑟𝑒𝑒 . 6.1 Graph Convolutional Network (GCN) for FL
3) List of variables: as with the method’s content, we run GloVe First, FixLocator has two GCN models [19], each for FL at the
on the sequence of sub-tokens to produce a sequence of vectors method and statement levels. GCN processes the attributes of the
and use GRU to produce the summarized vector 𝑉𝑣𝑎𝑟 for the list. nodes (vectors) and their edges (relations) in feature graphs. Each
Finally, for each statement 𝑆, we obtain 𝑉𝐶𝑂𝑉 , 𝑉𝑠𝑢𝑏𝑡𝑟𝑒𝑒 , and 𝑉𝑣𝑎𝑟 . GCN model has 𝑛 − 1 pairs of a graph convolution layer (Conv) and

664
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

𝛼 is the trainable weight matrix; 𝑋𝑀 𝑖+1 and 𝑋 𝑖+1 are the inputs for
𝑆
𝑡ℎ
the (𝑖 + 1) layers of the GCNs at the method and statement levels.
𝑋𝑀𝑖+1 and 𝑋 𝑖+1 contain the information learned from both MethFL
𝑆
and StmtFL, which helps achieve the main goal for dual-task learning
to enhance the performance of fault localization at both levels.
In general, 𝛼s can be set. If 𝛼 𝑀𝑆 and 𝛼𝑆𝑀 are set to zeros, the
layers are made to be task-specific. The 𝛼 values model linear com-
binations of feature maps. Their initialization in the range [0,1] is
important for stable learning, as it ensures that values in the output
activation map (after cross-stitch unit) are of the same order of
magnitude as the input values before linear combination [30].
Figure 9: Dual-Task Learning via Cross-stitch Unit 𝑖 and 𝐻 𝑖 are different, we need to adjust the
If the sizes of the 𝐻𝑀 𝑆
sizes of the matrices. From Formula 3, we have:
𝑖+1 𝑖
𝑋𝑀 = 𝛼 𝑀𝑀 𝐻𝑀 + 𝛼 𝑀𝑆 𝐻𝑆𝑖 (4)
a rectified linear unit (ReLU). They are aimed to consume and learn
the characteristic features in the input feature graphs. The last pair 𝑋𝑆𝑖+1 = 𝑖
𝛼𝑆𝑀 𝐻𝑀 + 𝛼𝑆𝑆 𝐻𝑆𝑖 (5)
of each GCN model is a pair of a graph convolution layer (Conv) We resize 𝐻𝑠𝑖 in Formula 4 and 𝑖 in Formula
resize 𝐻𝑚 5 if needed.
and a softmax layer (SoftMax). The SoftMax layer plays the role of the We use the bilinear interpolation technique [39] in image processing
classifier to determine whether a node for a method or a statement for resizing. We pad zeros to the matrix to make the aspect ratio
is labeled as faulty/co-fixing or non-faulty. 1:1. If the size needs to be reduced, we do the center crop on the
matrix to match the required size.
6.2 Dual-Task Learning with Cross-stitch Units FixLocator also has a trainable threshold for SoftMax to classify
In a regular GCN model, those above pairs of Conv and ReLU are if a node corresponding to a method or a statement is faulty or not.
connected to one another. However, to achieve dual-task learning
between method-level and statement-level FL (methFL and stmtFL), we 7 EMPIRICAL EVALUATION
apply a cross-stitch unit [30] to connect the two GCN models. The 7.1 Research Questions
sharing of representations between methFL and stmtFL is modeled by
For evaluation, we seek to answer the following research questions:
learning a linear combination of the input features in both feature
graphs 𝐺 𝑀 and 𝑔𝑀 . At each of the ReLU layer of each GCN model RQ1. Comparison with State-of-the Art Deep Learning (DL)-
(Figure 9), we aim to learn such a linear combination of the output based Approaches. How well does FixLocator perform compared
from the graph convolution layers (Conv) of methFL and stmtFL. with the state-of-the-art DL-based fault localization approaches?
The top sub-network in Figure 8 gets direct supervision from RQ2. Impact Analysis of Dual-Task Learning. How does the
methFL and indirect supervision (through cross-stitch units) from dual-task learning scheme affect FixLocator’s performance?
stmtFL. Cross-stitch units regularize methFL and stmtFL by learning and RQ3. Sensitivity Analysis. How do various factors affect the
enforcing shared representations by combining feature maps [30]. overall performance of FixLocator?
Formulation. For each pair of the GCN model, the outputs of the RQ4. Evaluation on Python Projects. How does FixLocator
ReLU layer, called the hidden states, are computed as follows: perform on Python code?
1 1 RQ5. Extrinsic Evaluation on Usefulness. How much does
𝐴ˆ = 𝐷 ′− 2 𝐴 ′𝐷 ′ − (1) FixLocator help an APR tool improve its bug-fixing?
2

𝐻 𝑖 = Δ(𝐴𝑋
ˆ 𝑖𝑊 𝑖 ) (2) 7.2 Experimental Methodology
Where 𝐴 ′ is the adjacency matrix of each feature graph; 𝐷 ′
is the 7.2.1 Dataset. We use a benchmark dataset Defects4J V2.0.0 [1]
degree matrix; 𝑊 𝑖 is the weight matrix for layer 𝑖; 𝑋 𝑖 is the input with 835 bugs from 17 Java projects. For each bug in a project 𝑃,
for layer 𝑖; 𝐻 𝑖 is the hidden state of layer 𝑖 and the output from the Defects4J has the faulty and fixed versions of the project. The faulty
ReLU layer; and Δ is the activation function ReLU. In a regular GCN,
and fixed versions contain the corresponding test suite relevant to
𝐻 𝑖 is the input of the next layer of GCN (i.e., the input of Conv). the bug. With the Diff comparison between faulty and fixed versions
In Figures 8 and 9, a cross-stitch unit is inserted between the ReLU of a project, we can identify the faulty statements. Specifically,
layer of the previous pair and the Conv layer of the next one. The for a bug in 𝑃, Defects4J has a separate copy of 𝑃 but with only
input of the cross-stitch unit includes the outputs of the two ReLU the corresponding test suite revealing the bug. For example, 𝑃1 , a
layers: 𝐻𝑀 𝑖 and 𝐻 𝑖 (i.e., the hidden states of those layers in methFL version of 𝑃, passes a test suite 𝑇1 . Later, a bug 𝐵 1 in 𝑃 1 is identified.
𝑆 After debugging, 𝑃1 has an evolved test suite 𝑇2 detecting the bug. In
and stmtFL). We aim to learn the linear combination of both inputs
of the cross-stitch unit, which is parameterized using the weights 𝛼. this case, Defects4J has a separate copy of the buggy 𝑃1 with a single
Thus, the output of the cross-stitch unit is computed as: bug, together with the test suite 𝑇2 . Similarly, for bug 𝐵 2 , Defects4J
 𝑖+1    𝑖  has a copy of 𝑃2 together with 𝑇3 (evolving from 𝑇2 ), and so on. We
𝑋𝑀 𝛼 𝑀𝑀 𝛼 𝑀𝑆 𝐻𝑀 do not use the whole T of all test suites for training/testing. For
= (3)
𝑋𝑆𝑖+1 𝛼𝑆𝑀 𝛼𝑆𝑆 𝐻𝑆𝑖 within-project setting, we test one bug 𝐵𝑖 with test suite 𝑇 (𝑖+1) by

665
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen

training on all other bugs in 𝑃. We conducted all the experiments (2) Hit-All is the number of bugs in which the predicted set
on a server with 16 core CPU and a single Nvidia A100 GPU. covers the correct set in the oracle for a bug.
In Defects4J-v2.0, regarding the statistics on the number of bug- (3) Hit-N@Top-𝐾 is the number of bugs that the predicted list
gy/fixed statements for a bug, there are 199 bugs with one bug- of the top-𝐾 statements contains at least 𝑁 faulty statements. This
gy/fixed statement, 142 bugs with two, 90 bugs with three, 78 metric is used when we compare the approaches in ranking.
bugs with four, 43 bugs with five, and 283 bugs with >5 buggy RQ2. Impact Analysis of Dual-Task Learning Model.
statements. Regarding the statistics on the number of buggy/fixed Baselines. To study the impact of dual-task learning, we built two
methods/hunks for a bug, there are 199 bugs with one-method/one- variants of FixLocator: (1) Statement-only model: the method-level
statement, 105 bugs with one-method/multi-statements, 142 bugs FL model (methFL) is removed from FixLocator and only statement-
with multi-methods/one-statement for each method, 61 bugs with level FL (stmtFL) is kept for training. (2) Cascading model: in this
multi-methods/multi-statements for each method, and 357 bugs variant, dual-task learning is removed, and we cascade the output
with multiple methods, each has one or multiple buggy statements. of methFL directly to the input of stmtFL.
Thus, there are 665 (out of 864 bugs) with CC fixing statements.
Procedures. The statement-only model has only the statement-
7.2.2 Experimental Setup and Procedures. level fault localization. We ran it on all methods in the project to
RQ1. Comparison with DL-based FL Approaches. find the faulty statements. We use the same training strategy and
Baselines. Our tool aims to output a set of CC fixing statements parameter tuning as in RQ1. We use Hit-N for evaluation.
for a bug. However, the existing Deep Learning-based FL approaches RQ3. Sensitivity Analysis. We conduct ablation analysis to
can produce only the ranked lists of suspicious statements with evaluate the impact of different factors on the performance: every
scores. Thus, we chose as baselines the most recent, state-of-the- node feature, co-change relation, and the depth limit on the stack
art, DL-based, statement-level FL approaches: (1) CNN-FL [50]; (2) trace and the execution trace. Specifically, we set FixLocator as the
DeepFL [22]; and (3) DeepRL4FL [24]; then, we use the predicted, complete model, and each time we built a variant by removing one
ranked list of statements as the output set. For the comparison in key factor, and compared the results. Except for the removed factor,
ranking with those ranking baselines, we convert our tool’s result we keep the same setting as in other experiments.
into a ranked list by ranking the statements in the predicted set RQ4. Evaluation on Python Projects. To evaluate FixLocator
by the classification scores (i.e., before deriving the final set). We on different programming languages, we ran it on the Python bench-
also compare with the CC fixing-statement detection module in mark BugsInPy [2, 40] with 441 bugs from 17 different projects.
DEAR [25], a multi-method/multi-statement APR tool. RQ5. Extrinsic Evaluation. To evaluate usefulness, we replaced
Procedures. We use the leave-one-out setting as in prior work [22, the original CC fixing-location module in DEAR [25] with FixLoca-
23] (i.e., testing on one bug and training on all other bugs). We tor to build a variant of DEAR, namely DEAR𝐹𝑖𝑥𝐿 . We also added
also consider the order of the bugs in the same project via the FixLocator and Ochiai FL [7] to CURE [15] to build two variants:
revision numbers. Specifically, for each buggy version 𝐵 of project CURE𝐹𝑖𝑥𝐿 (FixLocator + CURE) and CURE𝑂𝑐ℎ𝑖 (Ochiai+CURE).
𝑃 in Defects4J, all buggy versions from the other projects are first
included in the training data. Besides, we separate all the buggy 8 EMPIRICAL RESULTS
versions of the project 𝑃 into two groups: 1) one buggy version as
the test data for model prediction, and 2) all the buggy versions of 8.1 RQ1. Comparison Results with
the same project 𝑃 that have occurred before the buggy version 𝐵 State-of-the-Art DL-based FL Approaches
are also included in the training data. If the latter group is empty, Table 1 shows how well FixLocator’s coverage is on the actual correct
only the buggy versions from the other projects are used for training CC fixing statements (recall). The result is w.r.t. the bugs in the oracle
to predict for the current buggy version in 𝑃. with different numbers 𝐾 of CC fixing statements: 𝐾= #𝐶𝐶-𝑆𝑡𝑚𝑡𝑠 = 1,
We tune all models using autoML [5] to find the best param- 2, 3, 4, 5, and 5+. For example, in the oracle, there are 90 bugs with 3
eter setting. We directly follow the baseline studies to select the faulty statements. FixLocator’s predicted set correctly contains all
parameters that need to be tuned in the baselines. We tuned our 3 buggy statements for 21 bugs (Hit-All), 2 of them for 25 bugs, and
model with the following key hyper-parameters to obtain the best 1 faulty statement for 51 bugs. As seen, regardless of 𝑁 , FixLocator
performance: (1) Epoch size (i.e., 100, 200, 300); (2) Batch size (i.e., performs better in any Hit-𝑁 over the baselines for all 𝐾s. Note
64, 128, 256); (3) Learning rate (i.e., 0.001, 0.003, 0.005, 0.010); (4) that Hit-All = Hit-𝑁 when 𝑁 (#overlaps) = 𝐾(#CC-Stmts).
Vector length of word representation and its output (i.e., 150, 200, Table 2 shows the summary of the comparison results in which
250, 300); (5) The output channels of convolutional layer (16, 32, we sum all the corresponding Hit-𝑁 values across different numbers
64,128); (6) The number of convolutional layers (3, 5, 7, 9). 𝐾 of CC fixing statements in Table 1. As seen, FixLocator can
DeepFL was proposed for the method-level FL. For comparison, improve CNN-FL, DeepFL, DeepRL4FL, and DEAR by 16.6%, 16.9%,
following a prior study [24], we use only DeepFL’s spectrum-based 9.9%, and 20.6%, respectively, in terms of Hit-1 (i.e., the predicted
and mutation-based features applicable to detect faulty statements. set contains at least one faulty statement). It also improves over
Evaluation Metrics. We use the following metrics for evaluation: those baselines by 33.6%, 40.3%, 26.5%, and 57.5% in terms of Hit-2,
(1) Hit-N measures the number of bugs that the predicted set 43.9%, 46.4%, 28.1%, and 51.9% in terms of Hit-3, 100%, 155.6%, 64.5%,
contains at least 𝑁 faulty statements (i.e., the predicted and oracle and 142.1% in terms of Hit-4. Note: Any Hit-𝑁 reflects the cases of
sets for a bug overlap at least 𝑁 statements regardless of the sizes of multiple CC statements. For example, Hit-1 might include the bugs
both sets). Both precision and recall can be computed from Hit-N. with more than one buggy/fixed statement. Importantly, our tool

666
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

Table 1: RQ1. Detailed Comparison w.r.t. Faults with Different Table 3: RQ1. Detailed Comparison w.r.t. Faults with Different
# of CC Fixing Statements in an Oracle Set (Recall) # of CC Fixing Statements in a Predicted Set (Precision)
#CC-Stmts Metrics CNN-FL DeepFL DeepRL4FL DEAR Fix- #Stmts in Metrics CNN-FL DeepFL DeepRL4FL DEAR Fix-
in Oracle Locator Predicted Set Locator
1 (199 bugs) Hit-1 78 76 84 74 93 1 (203 bugs) Hit-1 83 79 87 75 (183) 99
Hit-1 67 64 70 65 75 Hit-1 75 72 78 71 (172) 83
2 (142 bugs) 2 (165 bugs)
Hit-2 33 30 34 28 41 Hit-2 36 34 39 34 (172) 45
Hit-1 46 44 47 42 51 Hit-1 52 46 48 41 (129) 55
3 (90 bugs) Hit-2 21 20 23 20 25 3 (120 bugs) Hit-2 24 22 26 19 (129) 27
Hit-3 11 10 13 12 21 Hit-3 12 11 14 10 (129) 23
Hit-1 41 42 42 40 45 Hit-1 47 49 46 33 (78) 51
Hit-2 22 19 21 20 24 Hit-2 24 21 22 14 (78) 26
4 (78 bugs) 4 (96 bugs)
Hit-3 9 7 8 5 12 Hit-3 11 9 10 5 (78) 14
Hit-4 3 2 4 2 9 Hit-4 5 3 6 1 (78) 11
Hit-1 15 14 16 13 18 Hit-1 17 16 17 12 (55) 19
Hit-2 9 8 9 7 12 Hit-2 10 10 11 7 (55) 14
5 (43 bugs) Hit-3 6 5 6 5 7 5 (73 bugs) Hit-3 8 6 7 4 (55) 9
Hit-4 3 2 3 2 3 Hit-4 3 3 4 1 (55) 5
Hit-5 1 1 1 0 1 Hit-5 2 1 2 0 (55) 2
Hit-1 85 91 93 87 105 Hit-1 58 69 76 68(218) 80
Hit-2 40 42 45 41 65 Hit-2 31 32 34 32 (218) 55
Hit-3 31 34 37 32 42 Hit-3 26 30 33 24 (218) 36
5+ (283 bugs) 5+ (178 bugs)
Hit-4 17 14 21 15 34 Hit-4 15 12 18 16 (218) 30
Hit-5 4 3 5 2 8 Hit-5 3 3 4 5 (218) 7
Hit-5+ 1 2 3 1 3 Hit-5+ 1 2 3 2 (218) 3

Table 2: RQ1. Comparison Results with DL-based FL Models Table 4: RQ1. Comparison with Baselines w.r.t. Ranking

Metrics CNN-FL DeepFL DeepRL4FL DEAR FixLocator Hit-N@Top-5 Hit-N@Top-10


Hit-1 332 331 352 321 387 N= 1 2 3 4 5 1 2 3 4 5 5+
Hit-2 125 119 132 106 167 CNN-FL 533 311 133 33 4 578 386 166 42 10 81
Hit-3 57 56 64 54 82 DeepFL 525 298 131 35 6 563 364 156 42 10 83
Hit-4 23 18 28 19 46 DeepRL4FL 586 339 159 32 9 623 407 186 48 13 92
Hit-5 5 4 6 2 9 DEAR 501 274 119 25 3 544 341 142 36 7 71
Hit-5+ 1 2 3 1 3 FixLocator 633 420 195 46 11 690 470 217 51 13 94
Hit-All 127 121 139 115 168
Table 5: Overlapping Analysis Results for Hit-1
FixLocator
produced the exact-match sets for 168/864 bugs (19.5%), relatively Unique-Baseline Overlap Unique-FixLocator
improving over the baselines 32%, 38.8%, 20.8%, and 46.1% in Hit-All. CNN-FL 48 284 103
It performs well in Hit-All when the number of CC statements 𝐾=1- DeepFL 54 277 110
4. However, producing the exact-matched sets for all statements DeepRL4FL 61 291 96
when 𝐾 ≥ 5 is still challenging for all the models. DEAR 35 286 101
Table 3 shows the comparison on how precise the results are in a
predicted set. For example, when the number of the CC statements
in a predicted set is 𝐾 ′ =3, there are 23 bugs in which all of those We also performed the analysis on the overlapping between
3 faulty statements are correct (there might be other statements the results of FixLocator and each baseline. As seen in Table 5,
missing). There are 27 bugs in which two of the 3 predicted, faulty FixLocator can detect at least one correct faulty statement in 103
statements are correct. There are 55 bugs in which only one of the bugs that CNN-FL missed, while CNN-FL can do so only in 48 bugs
3 predicted, faulty statements are correct. As seen, regardless of 𝑁 , that FixLocator missed. Both FixLocator and CNN-FL can do so
FixLocator is more precise than the baselines for all 𝐾 ′ s. in the same 284 bugs. In brief, FixLocator can detect at least one
Table 4 shows the comparison as ranking is considered (Hit- correct buggy statement in more “unique” bugs than any baseline.
N@Top-𝐾). As seen, in the ranking setting, FixLocator locates
more CC fixing statements than any baseline. For example, FixLo-
8.2 RQ2. Impact Analysis Results on Dual-Task
cator improves the best baseline DeepRL4RL by 23.9% in Hit- Learning
2@Top-5, 22.6% in Hit-3@Top-5, 43.8% in Hit-4@Top-5, and 22.2% Table 6 shows that FixLocator has better performance in detecting
in Hit-5@Top-5, respectively. The same trend is for Hit-N@Top-10. CC fixing statements than the two variants (statement-only and
We did not compare with the spectrum-/mutation-based FL mod- cascading models). This result shows that the dual-task learning
els since DeepRL4FL [24] was shown to outperform them. helps improve FL over the cascading model (methFL → stmtFL).

667
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen

Table 6: RQ2. Impact Analysis of Dual-Task Learning


1 public UnivariateRealPointValuePair optimize(final FUNC f, GoalType goal,
Variant Hit-1 Hit-2 Hit-3 Hit-4 Hit-5 Hit-5+ double min, double max) throws FunctionEvaluationException {
2 - return optimize(f, goal, min, max, 0);
Stmt-only 304 111 51 11 3 2 3 + return optimize(f, goal, min, max, min + 0.5 * (max - min));
Cascading 343 125 61 19 3 3 4 }
5 public UnivariateRealPointValuePair optimize(final FUNC f, GoalType goal,
FixLocator 387 167 82 46 9 3 double min, double max, double startValue) throws Func...Exception {
6 ...
7 try {
Table 7: RQ3. Sensitivity Analysis of Method- and Statement- 8 - final double bound1 = (i == 0) ? min : min + generator.nextDouble()...;
9 - final double bound2 = (i == 0) ? max : min + generator.nextDouble()...;
Level Features. ML: Method-level; SL: Statement-level 10 - optima[i] = optimizer.optimize(f, goal, FastMath.min(bound1, bound2),...;
11 + final double s = (i == 0) ? startValue : min + generator.nextDouble()...;
12 + optima[i] = optimizer.optimize(f, goal, min, max, s); ...
Hit-N
Model Variant 13 }
1 2 3 4 5 5+
w/o Method Content 366 158 78 39 9 3 Figure 10: An Illustrating Example
w/o Method Structure 357 155 80 40 8 3
ML
w/o Similar Buggy Method 361 157 79 44 9 3
w/o ML Co-change Rel. 355 152 77 40 8 3
Table 9: Ranking of CC Fixing Locations for Figure 10
w/o Code Coverage 348 151 75 38 7 2 LOC CNN-FL DeepFL DeepRL4FL DEAR FixLocator
w/o AST Subtree 354 153 77 41 8 3 Line 2 1 22 2 27 ⋆ (no rank)
SL
w/o Variables 373 162 78 42 9 3 Line 8 24 3 6 12 ⋆ (no rank)
w/o SL Co-change Relation 351 150 76 39 7 2 Line 9 25 4 7 13 ⋆ (no rank)
FixLocator 387 167 82 46 9 3 Line 10 50+ 13 16 39 ⋆ (no rank)

Table 8: RQ3. Sensitivity Analysis (Depth of Traces) Table 10: RQ4. BugsInPy (Python Projects) versus Defects4J
Hit-N (Java Projects). P% = |Located Bugs|/|Total Bugs in Datasets|
Depth
1 2 3 4 5 5+ BugsInPy (Python projects) Defects4J (Java projects)
Metrics
5 371 162 74 42 8 3 P% Cases P% Cases
10 387 167 82 46 9 3 Hit-1 43.8% 193 46.3% 387
15 368 158 71 39 7 3 Hit-2 16.3% 72 20.0% 167
Hit-3 10.2% 45 9.8% 82
Hit-4 3.4% 15 5.5% 46
Moreover, without the impact of method-level FL (methFL), the per- Hit-5 0.7% 3 1.1% 9
formance decreases significantly, indicating methFL’s contribution. Hit-5+ 0% 0 0.4% 3

8.3 RQ3. Sensitivity Analysis Results


8.3.4 Illustrating Example. Table 9 displays the ranking from the
8.3.1 Impact of the Method-Level (ML) Features and ML Co-change
models for Figure 10. FixLocator correctly produces all 4 CC fixing
Relation. Among all the method-level features/attributes of FixLo-
statements in its predicted set (lines 2,8,9, and 10 in two methods).
cator, the feature of co-change relations among methods has the
The statement-only model detects only line 2 as faulty. It completely
largest impact. Specifically, without the co-change feature among
missed lines 8–10 of the optimize method. In contrast, the cascading
methods, Hit-1 is decreased by 8.3%. Moreover, the method structure
model detects lines 8–10, however, its MethFL considers the first
feature, represented as AST, has the second largest impact. Without
method (optimize(...) at line 1) as non-faulty, thus, it did not detect
the method structure feature, Hit-1 is decreased by 7.8%.
the buggy line 2 due to its cascading.
Among the last two method-level features with least impact, the
The baselines CNN-FL, DeepFL, DeepRL4FL, and DEAR detect
method content feature has less impact than the similar-buggy-
only 1, 2, 1, and 0 faulty statements (bold cells) in their top-4 re-
method feature. This shows that the bugginess nature of a method
sulting lists, respectively. In brief, the baselines are not designed to
and similar ones has more impact than the tokens of the method itself.
detect CC fixing locations, thus, their top-𝐾 lists are not correct.
8.3.2 Impact of the Statement-Level (SL) Features and SL Co-change
Relation. Among all the statement-level features, Code Coverage 8.4 RQ4. Evaluation on Python Projects
has the largest impact. Without Code Coverage feature, Hit-1 is As seen in Table 10, FixLocator can localize 193 faulty statements
decreased by 10.1%. The co-change relations among statements with Hit-1. This shows that the performance on the Python projects
have the second largest impact among all SL features/attributes. is consistent with that on the Java projects. Specifically, at the
Specifically, without the co-change relations among statements, statement level, the percentages of the total Python and Java bugs
Hit-1 is decreased by 9.3%. that can be localized are similar, e.g., 43.8% vs. 46.3% with Hit-1.
8.3.3 Impact of the Depth Level of Stack Trace. As seen in Table 8,
FixLocator can achieve the best performance when depth=10. The
8.5 RQ5. Extrinsic Evaluation: Usefulness in
cases with depth= 5 or 15 can bring into analysis too few or too Automated Program Repair
many irrelevant methods, causing more noises to the model. Thus, In Table 11, with its better CC fixing-locations, FixLocator can
we chose depth=10 for our experiments. help DEAR𝐹𝑖𝑥𝐿 relatively improve over DEAR in auto-fixing 10.5%

668
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

Table 11: RQ5. Usefulness in APR (running on Defects4J) comparison, we use only DeepFL’s features applicable to statement-
Bug 1-Meth 1-Meth M-Meths M-Meths M-Meths Total level FL although it works at the method level. Other baselines
Types 1-Stmt M-Stmt 1-Stmt M-Stmts Mix-Stmts work directly at the statement level. (4) In 501 bugs in BugsInPy,
DEAR 64 12 24 2 3 105 the third-party tool cannot process 60 of them. (5) We focus on CC
DEAR𝐹 𝑖𝑥𝐿 69 13 26 2 4 116 fixing statements, instead of methods, due to bug fixing purpose.
CURE𝑂𝑐ℎ𝑖 84 0 0 0 0 84
CURE𝐹 𝑖𝑥𝐿 79 11 26 1 3 120 9 RELATED WORK
Table 12: RQ5. Detailed Results on Usefulness in APR Several types of techniques have been proposed to locate faulty
statements/methods. However, none of the existing FL approaches
Projects in Defects4J DEAR DEAR𝐹 𝑖𝑥𝐿 CURE𝑂𝑐ℎ𝑖 CURE𝐹 𝑖𝑥𝐿
Chart 8/16 9/18 6/13 9/19
detect CC fixing locations. A related work is DEAR [25], which uses
Cli 5/11 7/14 4/11 7/16 a combination of BERT and data flows to locate CC statements.
Closure 7/11 8/13 6/10 9/14 Hercules APR tool [37] can detect multiple buggy hunks of code.
Codec 1/4 1/4 1/4 1/4
Collections 0/0 0/0 0/0 0/0 It can detect only the buggy hunks with similar statements (repli-
Compress 7/15 7/16 5/12 8/17 cated fixes), while our tool detects general CC fixing locations. In
Csv 2/5 2/6 1/4 2/5 comparison, FixLocator and Hercules detect 26 and 15 multi-hunk
Gson 1/3 1/3 1/2 1/3
JacksonCore 4/9 5/10 3/7 6/9 bugs respectively among 395 bugs in Defects4J-v1.2 [37].
JacksonDatabind 16/27 17/29 13/25 16/28 The Spectrum-based Fault Localization (SBFL) [6, 8, 16, 27, 29,
JacksonXml 1/1 1/1 1/1 1/1
Jsoup 13/21 14/22 10/17 16/23
33, 43, 46] and Mutation-based Fault Localization (MBFL) [11, 31,
JxPath 8/14 9/15 6/13 10/17 32, 35, 47, 48] have been proposed for statement-level FL. Their key
Lang 8/15 9/17 9/16 11/15 limitations are that they cannot differentiate the statements with the
Math 20/33 20/31 16/25 20/35
Mockito 1/2 1/2 1/1 1/2 same scores or cannot have effective mutators to catch a complex
Time 3/6 3/6 1/3 3/6 fault. Among the learning-based FL models, learning-to-rank FL
Total 105/193 116/207 84/164 120/214 approaches [9, 23, 38, 45] aim to locate faulty methods. Statistical
X/Y: are the numbers of correct and plausible patches; Dataset: Defects4J
FL has been combined with casual inference for statement-level
FL [20]. All of those models do not locate CC fixing statements.
Table 13: Running Time
Machine learning has also been used for FL. Early neural network-
Models CNN-FL DeepFL DeepRL4FL DEAR FixLocator
based FL [10, 42, 51, 53] mainly use test coverage data. A limi-
Training Time 4 hours 5 mins 7 hours 21 hours 6 hours
tation is that they cannot distinguish elements accidentally ex-
Prediction Time 2 seconds 1 second 4 seconds 9 seconds 2 seconds
ecuted by failed tests and the actual faulty elements [23]. Deep
learning-based approaches, GRACE [28], DeepFL [22], CNNFL [50],
more bugs (11 bugs) across all bug types. Moreover, CURE𝐹𝑖𝑥𝐿 DeepRL4FL [24] achieve better results. GRACE [28] proposes a new
(FixLocator+ CURE) can fix 42.9% relatively more bugs (36 bugs) graph representation for a method and learns to rank the faulty
than CURE𝑂𝑐ℎ𝑖 (Ochiai+CURE). Especially, CURE𝐹𝑖𝑥𝐿 fixed 41 more methods. In contrast, FixLocator is aimed to locate multiple CC
bugs with multi-statements or multi-methods. The 5 bugs with single fixing statements in a fix for a fault. DeepFL and DeepRL4FL can
buggy statements that CURE𝐹𝑖𝑥𝐿 missed are due to FixLocator in- outperform the learning-based and early neural networks FL tech-
correctly producing more than one fixing locations. Table 12 shows niques, such as MULTRIC [45], TrapT [23], and Fluccs [38]. In our
that FixLocator can help DEAR and CURE improve both correct empirical evaluation, we showed that FixLocator can outperform
and plausible patches (passing all the tests) across all projects. those baselines under study in detecting CC fixing statements.

8.6 Further Analysis 10 CONCLUSION


8.6.1 Running Time. As seen in Table 13, except for DeepFL (using We present FixLocator, a novel DL-based FL approach that aims to
a basic neural network), the other approaches have similar train- locate co-change fixing locations within one or multiple methods.
ing and prediction time. Importantly, prediction time is just a few The key ideas of FixLocator include (1) a new dual-task learning
seconds, making FixLocator suitable for interactive use. model of method- and statement-level fault localization to detect
CC fixing locations; (2) a novel graph-based representation learning
8.6.2 Limitations. First, our tool does not detect well the sets with with co-change relations among methods and statements; (3) a novel
+5 CC fixing statements since it does not learn well those large feature in co-change methods/statements. Our empirical results
co-changes. Second, it does not work in locating a fault that require show that FixLocator relatively improves over the state-of-the-art
only adding statements to fix (neither do all baselines). Third, if FL baselines by locating more CC fixing statements from 26.5% to
the faulty statements/methods occur far from the crash method in 155.6%, and help APR tools improve its bug-fixing accuracy.
the execution traces, it is not effective. Finally, it does not have any
mechanism to integrate program analysis in expanding the faulty ACKNOWLEDGMENTS
statements having dependencies with the detected faulty ones.
This work was supported in part by the US National Science Founda-
8.6.3 Threats to Validity. (1) We evaluated FixLocator on Java and tion (NSF) grants CNS-2120386, CCF-1723215, CCF-1723432, TWC-
Python. Our modules are general for any languages. (2) We com- 1723198, and the US National Security Agency (NSA) grant NCAE-
pared the models only on two datasets that have test cases. (3) For C-002-2021 on Cybersecurity Research Innovation.

669
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen

REFERENCES Conference on Programming Language Design and Implementation (Chicago, IL,


[1] 2019. The Defects4J Data Set. https://github.com/rjust/defects4j USA) (PLDI ’05). Association for Computing Machinery, New York, NY, USA,
[2] 2020. The BugsInPy Data Set. https://github.com/soarsmu/BugsInPy 15–26. https://doi.org/10.1145/1065010.1065014
[3] 2021. FixLocator. https://github.com/fixlocatorresearch/fixlocatorresearch [28] Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and
[4] 2021. JDT. https://www.eclipse.org/jdt/core/tools/jdtcoretools/index.php Lingming Zhang. 2021. Boosting coverage-based fault localization via graph-
[5] 2021. The NNI autoML tool. https://github.com/microsoft/nni based representation learning. In Proceedings of the 29th ACM Joint Meeting on
[6] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2006. An evaluation European Software Engineering Conference and Symposium on the Foundations of
of similarity coefficients for software fault localization. In 2006 12th Pacific Rim Software Engineering. 664–676.
International Symposium on Dependable Computing (PRDC’06). IEEE, 39–46. [29] Lucia Lucia, David Lo, Lingxiao Jiang, Ferdian Thung, and Aditya Budi. 2014.
[7] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2006. An evaluation Extended comprehensive study of association measures for fault localization.
of similarity coefficients for software fault localization. In 2006 12th Pacific Rim Journal of software: Evolution and Process 26, 2 (2014), 172–219.
International Symposium on Dependable Computing (PRDC’06). IEEE, 39–46. [30] Ishan Misra, Abhinav Shrivastava, Abhinav Gupta, and Martial Hebert. 2016.
[8] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the accuracy of Cross-stitch networks for multi-task learning. In Proceedings of the IEEE conference
spectrum-based fault localization. In Testing: Academic and Industrial Conference on computer vision and pattern recognition. 3994–4003.
Practice and Research Techniques-MUTATION (TAICPART-MUTATION 2007). IEEE, [31] S. Moon, Y. Kim, M. Kim, and S. Yoo. 2014. Ask the Mutants: Mutating Faulty
89–98. Programs for Fault Localization. In IEEE International Conference on Software
[9] Tien-Duy B Le, David Lo, Claire Le Goues, and Lars Grunske. 2016. A learning- Testing, Verification and Validation. 153–162. https://doi.org/10.1109/ICST.2014.28
to-rank based fault localization approach using likely invariants. In Proceedings [32] Vincenzo Musco, Martin Monperrus, and Philippe Preux. 2017. A large-scale
of the 25th International Symposium on Software Testing and Analysis (ISSTA’16). study of call graph-based impact prediction using mutation testing. Software
ACM, 177–188. Quality Journal 25, 3 (2017), 921–950.
[10] Lionel C Briand, Yvan Labiche, and Xuetao Liu. 2007. Using machine learning to [33] Lee Naish, Hua Jie Lee, and Kotagiri Ramamohanarao. 2011. A model for spectra-
support debugging with tarantula. In The 18th IEEE International Symposium on based software diagnosis. ACM Transactions on software engineering and method-
Software Reliability (ISSRE’07). IEEE, 137–146. ology (TOSEM) 20, 3 (2011), 11.
[11] Timothy Alan Budd. 1981. MUTATION ANALYSIS OF PROGRAM TEST DATA. [34] Mike Papadakis and Yves Le Traon. 2012. Using mutants to locate "unknown"
(1981). faults. In IEEE International Conference on Software Testing, Verification and Vali-
[12] Nghi DQ Bui, Yijun Yu, and Lingxiao Jiang. 2021. TreeCaps: Tree-Based Capsule dation. IEEE, 691–700.
Networks for Source Code Processing. In Proceedings of the AAAI Conference on [35] Mike Papadakis and Yves Le Traon. 2015. Metallaxis-FL: mutation-based fault
Artificial Intelligence, Vol. 35. 30–38. localization. Software Testing, Verification and Reliability 25, 5-7 (2015), 605–628.
[13] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, [36] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe:
Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase Global Vectors for Word Representation. In Empirical Methods in Natural Lan-
representations using RNN encoder-decoder for statistical machine translation. guage Processing (EMNLP). 1532–1543. http://www.aclweb.org/anthology/D14-
arXiv preprint arXiv:1406.1078 (2014). 1162
[14] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger [37] Seemanta Saha, Ripon K. Saha, and Mukul R. Prasad. 2019. Harnessing Evolution
Schwenk, and Yoshua Bengio. 2014. Learning Phrase Representations using RNN for Multi-Hunk Program Repair. IEEE Press. https://doi.org/10.1109/ICSE.2019.
Encoder-Decoder for Statistical Machine Translation. CoRR abs/1406.1078 (2014). 00020
arXiv:1406.1078 http://arxiv.org/abs/1406.1078 [38] Jeongju Sohn and Shin Yoo. 2017. Fluccs: Using code and change metrics to
[15] Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. CURE: Code-Aware Neural improve fault localization. In Proceedings of the 26th ACM SIGSOFT International
Machine Translation for Automatic Program Repair. In Proceedings of the 43rd Symposium on Software Testing and Analysis. 273–283.
International Conference on Software Engineering. 1161–1173. https://doi.org/10. [39] Unknown. 2022. Bilinear Interpolation. https://en.wikipedia.org/wiki/Bilinear_
1109/ICSE43902.2021.00107 interpolation
[16] James A Jones and Mary Jean Harrold. 2005. Empirical evaluation of the tarantula [40] Ratnadira Widyasari, Sheng Qin Sim, Camellia Lok, Haodi Qi, Jack Phan, Qijin
automatic fault-localization technique. In Proceedings of the 20th IEEE/ACM inter- Tay, Constance Tan, Fiona Wee, Jodie Ethelda Tan, Yuheng Yieh, et al. 2020.
national Conference on Automated Software Engineering (ASE’05). ACM, 273–282. BugsInPy: a database of existing bugs in Python programs to enable controlled
[17] J. A. Jones, M. J. Harrold, and J. Stasko. 2002. Visualization of test information testing and debugging studies. In Proceedings of the 28th ACM Joint Meeting on
to assist fault localization. In Proceedings of the 24th International Conference on European Software Engineering Conference and Symposium on the Foundations of
Software Engineering (ICSE’02). 467–477. https://doi.org/10.1145/581396.581397 Software Engineering. 1556–1560.
[18] Fabian Keller, Lars Grunske, Simon Heiden, Antonio Filieri, Andre van Hoorn, [41] W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A
and David Lo. 2017. A critical evaluation of spectrum-based fault localization Survey on Software Fault Localization. IEEE Trans. Softw. Eng. 42, 8 (Aug. 2016),
techniques on a large-scale software system. In IEEE International Conference on 707–740. https://doi.org/10.1109/TSE.2016.2521368
Software Quality, Reliability and Security (QRS’17). IEEE, 114–125. [42] W Eric Wong and Yu Qi. 2009. BP neural network-based effective fault localization.
[19] Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph International Journal of Software Engineering and Knowledge Engineering 19, 04
convolutional networks. arXiv preprint arXiv:1609.02907 (2016). (2009), 573–597.
[20] Yiğit Küçük, Tim AD Henderson, and Andy Podgurski. 2021. Improving fault [43] W Eric Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. 2007. Effective fault localiza-
localization by integrating value and predicate based causal inference techniques. tion using code coverage. In 31st Annual International Computer Software and
In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). Applications Conference (COMPSAC 2007), Vol. 1. IEEE, 449–456.
IEEE, 649–660. [44] Rongxin Wu, Hongyu Zhang, Shing-Chi Cheung, and Sunghun Kim. 2014.
[21] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. Crashlocator: Locating crashing faults based on crash stacks. In Proceedings
2012. A Systematic Study of Automated Program Repair: Fixing 55 out of 105 of the 2014 International Symposium on Software Testing and Analysis. 204–214.
Bugs for $8 Each. In Proceedings of the 34th International Conference on Software [45] Jifeng Xuan and Martin Monperrus. 2014. Learning to combine multiple rank-
Engineering (ICSE ’12). IEEE Press, 3–13. ing metrics for fault localization. In IEEE International Conference on Software
[22] Xia Li, Wei Li, Yuqun Zhang, and Lingming Zhang. 2019. DeepFL: integrating Maintenance and Evolution (ICSME’14). IEEE, 191–200.
multiple fault diagnosis dimensions for deep fault localization. In Proceedings of [46] Lingming Zhang, Miryung Kim, and Sarfraz Khurshid. 2011. Localizing failure-
the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. inducing program edits based on spectrum information. In Proceedings of the 27th
ACM, 169–180. IEEE International Conference on Software Maintenance (ICSM’11). IEEE, 23–32.
[23] Xia Li and Lingming Zhang. 2017. Transforming programs and tests in tandem for [47] Lingming Zhang, Tao Xie, Lu Zhang, Nikolai Tillmann, Jonathan De Halleux, and
fault localization. Proceedings of the ACM on Programming Languages 1, OOPSLA Hong Mei. 2010. Test generation via dynamic symbolic execution for mutation
(2017), 1–30. testing. In IEEE International Conference on Software Maintenance (ICSM’10). IEEE,
[24] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2021. Fault Localization with Code 1–10.
Coverage Representation Learning. In Proceedings of the 43rd International Con- [48] Lingming Zhang, Lu Zhang, and Sarfraz Khurshid. 2013. Injecting Mechanical
ference on Software Engineering (ICSE’21). IEEE. Faults to Localize Developer Faults for Evolving Software. In Proceedings of the
[25] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. DEAR: A Novel Deep Learning- 2013 ACM SIGPLAN International Conference on Object Oriented Programming
based Approach for Automated Program Repair. In Proceedings of the 44th Inter- Systems Languages and Applications (Indianapolis, Indiana, USA) (OOPSLA ’13).
national Conference on Software Engineering (ICSE’22). ACM Press. Association for Computing Machinery, New York, NY, USA, 765–784. https:
[26] Ziyao Li, Liang Zhang, and Guojie Song. 2019. GCN-LASE: Towards adequately //doi.org/10.1145/2509136.2509551
incorporating link attributes in graph convolutional networks. arXiv preprint [49] Zhenyu Zhang, Wing Kwong Chan, TH Tse, Bo Jiang, and Xinming Wang. 2009.
arXiv:1902.09817 (2019). Capturing propagation of infected program states. In Proceedings of the 7th joint
[27] Ben Liblit, Mayur Naik, Alice X. Zheng, Alex Aiken, and Michael I. Jordan. meeting of the European software engineering conference and the ACM SIGSOFT
2005. Scalable Statistical Bug Isolation. In Proceedings of the 2005 ACM SIGPLAN symposium on The foundations of software engineering. 43–52.

670
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore

[50] Zhuo Zhang, Yan Lei, Xiaoguang Mao, and Panpan Li. 2019. CNN-FL: An effective [52] Lei Zhao, Lina Wang, Zuoting Xiong, and Dongming Gao. 2010. Execution-aware
approach for localizing faults using convolutional neural networks. In 2019 IEEE fault localization based on the control flow analysis. In International Conference
26th International Conference on Software Analysis, Evolution and Reengineering on Information Computing and Applications. Springer, 158–165.
(SANER). IEEE, 445–455. [53] Wei Zheng, Desheng Hu, and Jing Wang. 2016. Fault localization analysis based
[51] Zhuo Zhang, Yan Lei, Qingping Tan, Xiaoguang Mao, Ping Zeng, and Xi Chang. on deep neural network. Mathematical Problems in Engineering 2016 (2016).
2017. Deep Learning-Based Fault Localization with Contextual Information. Ieice https://doi.org/10.1155/2016/1820454
Transactions on Information and Systems 100, 12 (2017), 3027–3031.

671

You might also like