Fault Localization To Detect Co-Change Fixing Locations
Fault Localization To Detect Co-Change Fixing Locations
ABSTRACT 1 INTRODUCTION
Fault Localization (FL) is a precursor step to most Automated Pro- To assist developers in the bug-detecting and fixing process, several
gram Repair (APR) approaches, which fix the faulty statements approaches have been proposed for Automated Program Repair
identified by the FL tools. We present FixLocator, a Deep Learn- (APR) [21]. A common usage of an APR tool is that one needs to
ing (DL)-based fault localization approach supporting the detection use a fault localization (FL) tool [41] to locate the faulty statements
of faulty statements in one or multiple methods that need to be that must be fixed, and then uses an APR tool to generate the fixing
modified accordingly in the same fix. Let us call them co-change changes for those detected statements. The input of an FL model
(CC) fixing locations for a fault. We treat this FL problem as dual- is the execution of a test suite, in which some of the test cases
task learning with two models. The method-level FL model, MethFL, are passing or failing ones. Specifically, the key input is the code
learns the methods to be fixed together. The statement-level FL coverage matrix in which the rows and columns correspond to the
model, StmtFL, learns the statements to be co-fixed. Correct learning statements and test cases, respectively. Each cell is assigned with
in one model can benefit the other and vice versa. Thus, we simul- the value of 1 if the statement is executed in the respective test
taneously train them with soft-sharing the models’ parameters via case, and with the value of 0, otherwise. An FL model uses such
cross-stitch units to enable the propagation of the impact of MethFL information to identify the list of suspicious lines of code that are
and StmtFL onto each other. Moreover, we explore a novel feature for ranked based on their associated suspiciousness scores [41]. In recent
FL: the co-changed statements. We also use Graph-based Convolu- advanced FL, several approaches also support fault localization at
tion Network to integrate different types of program dependencies. method level to locate faulty methods [22, 24].
Our empirical results show that FixLocator relatively improves The FL approaches can be broadly divided into the following cate-
over the state-of-the-art statement-level FL baselines by locating gories: spectrum-based fault localization (SBFL) [6, 17, 18], mutation-
26.5%–155.6% more CC fixing statements. To evaluate its usefulness based fault localization (MBFL) [31, 34, 35], and machine learning
in APR, we used FixLocator in combination with the state-of-the- (ML) and deep learning (DL) fault localization [22, 24]. For SBFL
art APR tools. The results show that FixLocator+DEAR (the origi- approaches, the key idea is that a line covered more in the fail-
nal FL in DEAR replaced by FixLocator) and FixLocator+CURE ing test cases than in the passing ones is more suspicious than a
improve relatively over the original DEAR and Ochiai+CURE by line executed more in the passing ones. To improve SBFL, MBFL
10.5% and 42.9% in terms of the number of fixed bugs. approaches [31, 34, 35] enhance the code coverage matrix by mod-
ifying a statement with mutation operators, and collecting code
CCS CONCEPTS coverage when executing the mutated programs with the test cases.
• Software and its engineering → Software testing and debug- The MBFL approaches apply suspiciousness score formulas in the
ging. same manner as in SBFL approaches on the matrix for each origi-
nal statement and its mutated code. Finally, ML and DL-based FL
KEYWORDS approaches explore the code coverage matrix and apply different
neural network models for fault localization.
Fault Localization; Deep Learning; Co-Change Fixing Locations
Despite their successes, the state-of-the-art FL approaches are
ACM Reference Format: still limited in locating all dependent fixing locations that need to
Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. Fault Localization to Detect be repaired at the same time in the same fix. In practice, there are
Co-change Fixing Locations. In Proceedings of the 30th ACM Joint European many bugs that require dependent changes in the same fix to multiple
Software Engineering Conference and Symposium on the Foundations of Soft-
lines of code in one or multiple hunks of the same or different methods
ware Engineering (ESEC/FSE ’22), November 14–18, 2022, Singapore, Singapore.
for the program to pass the test cases. For those bugs, applying the
ACM, New York, NY, USA, 13 pages. https://doi.org/10.1145/3540250.3549137
fixing change to individual statements once at a time will not make
∗ Corresponding Author the program pass the test case after the change to one statement.
This capability to detect the fixing locations of the co-changes in a
Permission to make digital or hard copies of all or part of this work for personal or fix for a bug (let us call them Co-change (CC) Fixing Locations) is
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
crucial for an APR tool. Such capability will enable an APR tool to
on the first page. Copyrights for components of this work owned by others than ACM make the correct and complete changes to fix a bug.
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, The state-of-the-art FL approaches do not satisfy that require-
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. ment. From the ranked list of suspicious statements returned from
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore an existing FL model, a naive approach to detect CC fixing locations
© 2022 Association for Computing Machinery. would be to take the top 𝑘 statements in that list and to consider
ACM ISBN 978-1-4503-9413-0/22/11. . . $15.00
https://doi.org/10.1145/3540250.3549137 them as to be fixed together. This solution might be ineffective
659
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen
660
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
2.1.1 Observation 1 [Co-change Fixing Locations]. In this example, we design an approach that treats this FL problem of detecting
the changes to fix this bug involve multiple faulty statements that dependent CC fixing locations as dual-task learning between the
are dependent on one another. Fixing only one of the faulty state- method-level and statement-level FL. First, the method-level FL
ments will not make the program pass the failing test(s). Fixing model (MethFL) aims to learn the methods that need to be modified
individual statements once at a time in the ranked list returned in the same fix. Second, the statement-level FL model (StmtFL) aims
from an existing FL tool will also not make the program pass the to learn the co-fixing statements regardless of whether they are in
tests. For an APR model to work, an FL tool needs to point out all of the same or different methods.
those faulty statements to be changed in the same fix. For example, Intuitively, MethFL and StmtFL are related to each other, in which
all four faulty statements at lines 17, 21, 3, and 13 need to modified the results of one model can help the other. We refer to this relation
accordingly in the same fix to fix the bug in Figure 1. as duality, which can provide some useful constraints for FixLo-
cator to learn dependent CC fixing locations. We conjecture that
2.1.2 Observation 2 [Multiple Faulty Methods]. As seen, this bug the joint training of the two models can improve the performance
requires an APR tool to make changes to multiple statements in of both models, when we leverage the constraints of this duality
three different methods in the same fix: toSource(...) at lines 17 in term of shared representations. For example, if two statements
and 21, toSource(...) at line 3, and toSource (...) at line 13. Thus, it in two different methods 𝑚 1 and 𝑚 2 were observed to be changed
is important for an FL tool to connect and identify these multiple in the same fix, then it should help the model learn that 𝑚 1 and
faulty statements in potentially different methods. 𝑚 2 were also changed together to fix the bug. If two methods were
Traditional FL approaches [49, 52] using program analysis (PA), observed to be fixed together, then some of their statements were
e.g., execution flow analysis, are restricted to specific PA techniques, changed in the same fix as well. In our model, we jointly train MethFL
thus, not general to locate all types of CC fixing locations. Spectrum- and StmtFL with the soft-sharing of the models’ parameters to exploit
based [6, 16], mutation-based [31, 34, 35]), statistic-based [27], and their relation. Specifically, we use a mechanism, called cross-stitch
machine learning (ML)-based FL approaches [22, 24] could implic- unit [30], to learn a linear combination of the input features from
itly learn the program dependencies for FL purpose. However, de- those two models to enable the propagation of the impact of MethFL
spite their successes, the non-PA FL approaches do not support the and StmtFL on each other. We also add an attention mechanism in
detection of multiple locations that need to be changed in the same fix the two models to help emphasize on the key features.
for a bug, i.e., Co-Change (CC) Fixing Locations. The spectrum-based
and ML-based FL models return a ranked list of suspicious state- 2.2.2 Key Idea 2 [Co-change Representation Learning in Fault Lo-
ments according to the corresponding suspiciousness scores. In this calization]. In detecting CC fixing locations, in addition to a new
example, the lines 13, 17, 21, and the other lines (e.g., 12, 20 and dual-task learning model in key idea 1, we use a new feature: co-
24) are executed in the same passing or failing test cases, thus as- change information among statements/methods, which has never
signed with the same scores by spectrum- and mutation-based FL explored in prior fault localization research. The rationale is that
approaches. A user would not be informed on what lines need to be the co-changed statements/methods in the past might become the
fixed together. Those non-PA, especially ML-based FL approaches, statements/methods that will be fixed together for a bug in the fu-
do not have a mechanism to detect CC fixing locations. ture. We also encode the co-fixed statements/methods in the same
In this work, we aim to advance the level of deep learning (DL)- fixes. The co-changed/co-fixed statements/methods in the same
based FL approaches to detect CC fixing statements. However, it is commit are used to train the models.
not trivial. A solution of assuming the top-𝑘 suspicious statements
2.2.3 Key Idea 3 [Graph Modeling for Dependencies among State-
from a FL tool as CC fixing locations does not work because even
ments/Methods]. The statements/methods that need to be fixed
being the most suspicious, those statements might not need to be
together are interdependent via several dependencies. Thus, we
changed in the same fix. In this example, all of the above lines with
use Graph-based Convolution Network (GCN) [26] to model differ-
the same suspiciousness scores would confuse a fixer.
ent types of dependencies among statements/methods, e.g., data
Moreover, another naive solution would be to use a method-
and control dependencies in a program dependence graph (PDG),
level FL tool to detect multiple faulty methods first and then use a
execution traces, stack traces, etc. We encode the co-change/co-fix
statement-level FL tool to detect the statements within each faulty
relations into the graph representations with different types of edges
method. As we will show in Section 7, the inaccuracy of the first
representing different relations. The GCN model enables nodes’ and
phase of detecting faulty methods will have a confounding effect
edges’ attributes and learns to classify the nodes as buggy or not.
on the overall performance in detecting CC fixing statements.
3 FIXLOCATOR: APPROACH OVERVIEW
2.2 Key Ideas
We propose FixLocator, an FL approach to locate all the CC fixing
3.1 Training Process
locations (i.e., faulty statements) that need to be changed in the Figure 2 summarizes the training process. The input of training
same fix for a bug. In designing FixLocator, we have the following includes the passing and failing test cases, and the source code un-
key ideas in both new model and new features: der study. The output includes the trained method-level FL model
(detecting co-fixed methods) and the trained statement-level FL
2.2.1 Key Idea 1 [Dual-Task Learning for Fault Localization]. To model (detecting co-fixed statements). The training process has
avoid the confounding effect in a naive solution of detecting faulty three main steps: 1) Feature Extraction, 2) Graph-based Feature Rep-
methods first and then detecting faulty statements in those methods, resentation Learning, and 3) Dual-Task Learning Fault Localization.
661
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen
3.1.1 Step 1. Feature Extraction. (Section 4). We aim to extract the embeddings for all sub-tokens as we consider a method or state-
the important features for FL from the test coverage and source code ment as a sentence in each case. We then use Gated Recurrent Unit
including co-changes. The features are extracted from two levels: (GRU) [14] to produce the vector for the entire sequence.
statements and methods. At each level, we extract the important For the structure of a method or statement, the representation is a
attributes of statements/methods, as well as the crucial relations (sub)tree in the AST. For this, we first use GloVe [36] to produce the
among them. We use graphs to model those attributes and relations. embeddings for all the nodes in the sub-tree, considering the entire
For a method 𝑚, we collect as its attributes 1) method content: method or statement as a sentence in each case. After obtaining
the sequences of the sub-tokens of its code tokens (excluding sepa- the sub-tree where the nodes are replaced by their GloVe’s vectors,
rators and special tokens), and 2) method structure: the Abstract we use TreeCaps [12], which captures well the tree structure, to
Syntax Tree (AST) of the method. For the relations among methods, produce the embedding for the entire sub-tree.
we extract the relations involving in the following: For the code coverage representation, we directly use the two
1) Execution flow (the calling relation, i.e., 𝑚 calls 𝑛), vectors for coverage and passing/failing and concatenate them
2) Stack trace after a crash, i.e., the order relation among the to produce the embedding. The embedding for the most similar
methods in the stack trace (the dynamic information in execution buggy method is computed in the same manner as explained with
and stack traces have been showed to be useful in FL [22, 24]), GloVe and TreeCaps. Finally, the embeddings for the attributes
3) Co-change relation in the project history (two methods that of the nodes are used in the fully connected layers to produce
were changed in the same commit are considered to have the co- the embedding for each node in the feature graph at the method
change relation), level. Similarly, we obtain the feature graph at the statement level in
4) Co-fixing relation among the methods (two methods that were which each node is the resulting vector of the fully connected layers.
fixed for the same bug are considered to have the co-fixing relation),
5) Similarity: we also extract the similar methods in the project 3.1.3 Step 3. Dual-Task Learning Fault Localization. After the fea-
that have been buggy before in the project history. We keep only ture representation learning step, we obtain two feature graphs at
the most similar method for each method. the method and statement levels, in which a node in either graph
For a statement 𝑠, we extract both static and dynamic informa- is a vector representation. The two graphs are used as the input for
tion. First, for static data, we extract the AST subtree that corre- dual-task learning. For dual-task learning, we use two Graph-based
sponds to 𝑠 to represent its structure. We also extract the list of Convolution Network (GCN) models [19] for the method-level FL
variables in 𝑠 together with their types, forming a sequence of names model (MethFL) and the statement-level FL model (StmtFL) to learn the
and types, e.g., “name String price int ...". Second, for dynamic data, CC fixing methods and CC fixing statements, respectively. During
we encode the test coverage matrix for 𝑠 into the feature vectors. training, the two feature graphs at the method and statement levels
At both method and statement levels, we use graphs to represent are used as the inputs of MethFL and StmtFL. The two GCN models
the methods and statements, and their relations. Let us call them play the role of binary classifiers for the bugginess for the nodes
the method-level and statement-level feature graphs. (i.e., methods/statements). We train the two models simultaneously
with soft-sharing of parameters. Details will be given in Section 6.
3.1.2 Step 2. Graph-based Feature Representation Learning. This
step is aimed to learn the vector representations (i.e., embeddings) 3.2 Predicting Process
for the nodes in the feature graphs from step 1. The input includes The input of the prediction process (Figure 3) includes the test cases
the method-level and statement-level feature graphs. The output and the source code in the project. The steps 1–2 of the process is the
includes the embeddings for the nodes in the method-/statement- same as in training. In step 3, the feature graph 𝑔𝑀 at the statement
level feature graphs. The graph structures for both feature graphs level built from the source code is used as the input of the trained
are un-changed after this step. StmtFL model, which predicts the labels of the nodes in that graph.
For the content of a method or statement, we use the embedding The labels indicate the bugginess of the corresponding statements
techniques accordingly to feature representations (Section 5). For in the source code, which represent the CC fixing statements. If one
the method’s content and a list of variables in a statement, the repre- aims to predict the faulty methods, the trained MethFL model can be
sentation is a sequence of sub-tokens. We use GloVe [36] to produce used on the feature graph to produce the CC fixing methods.
662
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
663
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen
664
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
𝛼 is the trainable weight matrix; 𝑋𝑀 𝑖+1 and 𝑋 𝑖+1 are the inputs for
𝑆
𝑡ℎ
the (𝑖 + 1) layers of the GCNs at the method and statement levels.
𝑋𝑀𝑖+1 and 𝑋 𝑖+1 contain the information learned from both MethFL
𝑆
and StmtFL, which helps achieve the main goal for dual-task learning
to enhance the performance of fault localization at both levels.
In general, 𝛼s can be set. If 𝛼 𝑀𝑆 and 𝛼𝑆𝑀 are set to zeros, the
layers are made to be task-specific. The 𝛼 values model linear com-
binations of feature maps. Their initialization in the range [0,1] is
important for stable learning, as it ensures that values in the output
activation map (after cross-stitch unit) are of the same order of
magnitude as the input values before linear combination [30].
Figure 9: Dual-Task Learning via Cross-stitch Unit 𝑖 and 𝐻 𝑖 are different, we need to adjust the
If the sizes of the 𝐻𝑀 𝑆
sizes of the matrices. From Formula 3, we have:
𝑖+1 𝑖
𝑋𝑀 = 𝛼 𝑀𝑀 𝐻𝑀 + 𝛼 𝑀𝑆 𝐻𝑆𝑖 (4)
a rectified linear unit (ReLU). They are aimed to consume and learn
the characteristic features in the input feature graphs. The last pair 𝑋𝑆𝑖+1 = 𝑖
𝛼𝑆𝑀 𝐻𝑀 + 𝛼𝑆𝑆 𝐻𝑆𝑖 (5)
of each GCN model is a pair of a graph convolution layer (Conv) We resize 𝐻𝑠𝑖 in Formula 4 and 𝑖 in Formula
resize 𝐻𝑚 5 if needed.
and a softmax layer (SoftMax). The SoftMax layer plays the role of the We use the bilinear interpolation technique [39] in image processing
classifier to determine whether a node for a method or a statement for resizing. We pad zeros to the matrix to make the aspect ratio
is labeled as faulty/co-fixing or non-faulty. 1:1. If the size needs to be reduced, we do the center crop on the
matrix to match the required size.
6.2 Dual-Task Learning with Cross-stitch Units FixLocator also has a trainable threshold for SoftMax to classify
In a regular GCN model, those above pairs of Conv and ReLU are if a node corresponding to a method or a statement is faulty or not.
connected to one another. However, to achieve dual-task learning
between method-level and statement-level FL (methFL and stmtFL), we 7 EMPIRICAL EVALUATION
apply a cross-stitch unit [30] to connect the two GCN models. The 7.1 Research Questions
sharing of representations between methFL and stmtFL is modeled by
For evaluation, we seek to answer the following research questions:
learning a linear combination of the input features in both feature
graphs 𝐺 𝑀 and 𝑔𝑀 . At each of the ReLU layer of each GCN model RQ1. Comparison with State-of-the Art Deep Learning (DL)-
(Figure 9), we aim to learn such a linear combination of the output based Approaches. How well does FixLocator perform compared
from the graph convolution layers (Conv) of methFL and stmtFL. with the state-of-the-art DL-based fault localization approaches?
The top sub-network in Figure 8 gets direct supervision from RQ2. Impact Analysis of Dual-Task Learning. How does the
methFL and indirect supervision (through cross-stitch units) from dual-task learning scheme affect FixLocator’s performance?
stmtFL. Cross-stitch units regularize methFL and stmtFL by learning and RQ3. Sensitivity Analysis. How do various factors affect the
enforcing shared representations by combining feature maps [30]. overall performance of FixLocator?
Formulation. For each pair of the GCN model, the outputs of the RQ4. Evaluation on Python Projects. How does FixLocator
ReLU layer, called the hidden states, are computed as follows: perform on Python code?
1 1 RQ5. Extrinsic Evaluation on Usefulness. How much does
𝐴ˆ = 𝐷 ′− 2 𝐴 ′𝐷 ′ − (1) FixLocator help an APR tool improve its bug-fixing?
2
𝐻 𝑖 = Δ(𝐴𝑋
ˆ 𝑖𝑊 𝑖 ) (2) 7.2 Experimental Methodology
Where 𝐴 ′ is the adjacency matrix of each feature graph; 𝐷 ′
is the 7.2.1 Dataset. We use a benchmark dataset Defects4J V2.0.0 [1]
degree matrix; 𝑊 𝑖 is the weight matrix for layer 𝑖; 𝑋 𝑖 is the input with 835 bugs from 17 Java projects. For each bug in a project 𝑃,
for layer 𝑖; 𝐻 𝑖 is the hidden state of layer 𝑖 and the output from the Defects4J has the faulty and fixed versions of the project. The faulty
ReLU layer; and Δ is the activation function ReLU. In a regular GCN,
and fixed versions contain the corresponding test suite relevant to
𝐻 𝑖 is the input of the next layer of GCN (i.e., the input of Conv). the bug. With the Diff comparison between faulty and fixed versions
In Figures 8 and 9, a cross-stitch unit is inserted between the ReLU of a project, we can identify the faulty statements. Specifically,
layer of the previous pair and the Conv layer of the next one. The for a bug in 𝑃, Defects4J has a separate copy of 𝑃 but with only
input of the cross-stitch unit includes the outputs of the two ReLU the corresponding test suite revealing the bug. For example, 𝑃1 , a
layers: 𝐻𝑀 𝑖 and 𝐻 𝑖 (i.e., the hidden states of those layers in methFL version of 𝑃, passes a test suite 𝑇1 . Later, a bug 𝐵 1 in 𝑃 1 is identified.
𝑆 After debugging, 𝑃1 has an evolved test suite 𝑇2 detecting the bug. In
and stmtFL). We aim to learn the linear combination of both inputs
of the cross-stitch unit, which is parameterized using the weights 𝛼. this case, Defects4J has a separate copy of the buggy 𝑃1 with a single
Thus, the output of the cross-stitch unit is computed as: bug, together with the test suite 𝑇2 . Similarly, for bug 𝐵 2 , Defects4J
𝑖+1 𝑖 has a copy of 𝑃2 together with 𝑇3 (evolving from 𝑇2 ), and so on. We
𝑋𝑀 𝛼 𝑀𝑀 𝛼 𝑀𝑆 𝐻𝑀 do not use the whole T of all test suites for training/testing. For
= (3)
𝑋𝑆𝑖+1 𝛼𝑆𝑀 𝛼𝑆𝑆 𝐻𝑆𝑖 within-project setting, we test one bug 𝐵𝑖 with test suite 𝑇 (𝑖+1) by
665
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen
training on all other bugs in 𝑃. We conducted all the experiments (2) Hit-All is the number of bugs in which the predicted set
on a server with 16 core CPU and a single Nvidia A100 GPU. covers the correct set in the oracle for a bug.
In Defects4J-v2.0, regarding the statistics on the number of bug- (3) Hit-N@Top-𝐾 is the number of bugs that the predicted list
gy/fixed statements for a bug, there are 199 bugs with one bug- of the top-𝐾 statements contains at least 𝑁 faulty statements. This
gy/fixed statement, 142 bugs with two, 90 bugs with three, 78 metric is used when we compare the approaches in ranking.
bugs with four, 43 bugs with five, and 283 bugs with >5 buggy RQ2. Impact Analysis of Dual-Task Learning Model.
statements. Regarding the statistics on the number of buggy/fixed Baselines. To study the impact of dual-task learning, we built two
methods/hunks for a bug, there are 199 bugs with one-method/one- variants of FixLocator: (1) Statement-only model: the method-level
statement, 105 bugs with one-method/multi-statements, 142 bugs FL model (methFL) is removed from FixLocator and only statement-
with multi-methods/one-statement for each method, 61 bugs with level FL (stmtFL) is kept for training. (2) Cascading model: in this
multi-methods/multi-statements for each method, and 357 bugs variant, dual-task learning is removed, and we cascade the output
with multiple methods, each has one or multiple buggy statements. of methFL directly to the input of stmtFL.
Thus, there are 665 (out of 864 bugs) with CC fixing statements.
Procedures. The statement-only model has only the statement-
7.2.2 Experimental Setup and Procedures. level fault localization. We ran it on all methods in the project to
RQ1. Comparison with DL-based FL Approaches. find the faulty statements. We use the same training strategy and
Baselines. Our tool aims to output a set of CC fixing statements parameter tuning as in RQ1. We use Hit-N for evaluation.
for a bug. However, the existing Deep Learning-based FL approaches RQ3. Sensitivity Analysis. We conduct ablation analysis to
can produce only the ranked lists of suspicious statements with evaluate the impact of different factors on the performance: every
scores. Thus, we chose as baselines the most recent, state-of-the- node feature, co-change relation, and the depth limit on the stack
art, DL-based, statement-level FL approaches: (1) CNN-FL [50]; (2) trace and the execution trace. Specifically, we set FixLocator as the
DeepFL [22]; and (3) DeepRL4FL [24]; then, we use the predicted, complete model, and each time we built a variant by removing one
ranked list of statements as the output set. For the comparison in key factor, and compared the results. Except for the removed factor,
ranking with those ranking baselines, we convert our tool’s result we keep the same setting as in other experiments.
into a ranked list by ranking the statements in the predicted set RQ4. Evaluation on Python Projects. To evaluate FixLocator
by the classification scores (i.e., before deriving the final set). We on different programming languages, we ran it on the Python bench-
also compare with the CC fixing-statement detection module in mark BugsInPy [2, 40] with 441 bugs from 17 different projects.
DEAR [25], a multi-method/multi-statement APR tool. RQ5. Extrinsic Evaluation. To evaluate usefulness, we replaced
Procedures. We use the leave-one-out setting as in prior work [22, the original CC fixing-location module in DEAR [25] with FixLoca-
23] (i.e., testing on one bug and training on all other bugs). We tor to build a variant of DEAR, namely DEAR𝐹𝑖𝑥𝐿 . We also added
also consider the order of the bugs in the same project via the FixLocator and Ochiai FL [7] to CURE [15] to build two variants:
revision numbers. Specifically, for each buggy version 𝐵 of project CURE𝐹𝑖𝑥𝐿 (FixLocator + CURE) and CURE𝑂𝑐ℎ𝑖 (Ochiai+CURE).
𝑃 in Defects4J, all buggy versions from the other projects are first
included in the training data. Besides, we separate all the buggy 8 EMPIRICAL RESULTS
versions of the project 𝑃 into two groups: 1) one buggy version as
the test data for model prediction, and 2) all the buggy versions of 8.1 RQ1. Comparison Results with
the same project 𝑃 that have occurred before the buggy version 𝐵 State-of-the-Art DL-based FL Approaches
are also included in the training data. If the latter group is empty, Table 1 shows how well FixLocator’s coverage is on the actual correct
only the buggy versions from the other projects are used for training CC fixing statements (recall). The result is w.r.t. the bugs in the oracle
to predict for the current buggy version in 𝑃. with different numbers 𝐾 of CC fixing statements: 𝐾= #𝐶𝐶-𝑆𝑡𝑚𝑡𝑠 = 1,
We tune all models using autoML [5] to find the best param- 2, 3, 4, 5, and 5+. For example, in the oracle, there are 90 bugs with 3
eter setting. We directly follow the baseline studies to select the faulty statements. FixLocator’s predicted set correctly contains all
parameters that need to be tuned in the baselines. We tuned our 3 buggy statements for 21 bugs (Hit-All), 2 of them for 25 bugs, and
model with the following key hyper-parameters to obtain the best 1 faulty statement for 51 bugs. As seen, regardless of 𝑁 , FixLocator
performance: (1) Epoch size (i.e., 100, 200, 300); (2) Batch size (i.e., performs better in any Hit-𝑁 over the baselines for all 𝐾s. Note
64, 128, 256); (3) Learning rate (i.e., 0.001, 0.003, 0.005, 0.010); (4) that Hit-All = Hit-𝑁 when 𝑁 (#overlaps) = 𝐾(#CC-Stmts).
Vector length of word representation and its output (i.e., 150, 200, Table 2 shows the summary of the comparison results in which
250, 300); (5) The output channels of convolutional layer (16, 32, we sum all the corresponding Hit-𝑁 values across different numbers
64,128); (6) The number of convolutional layers (3, 5, 7, 9). 𝐾 of CC fixing statements in Table 1. As seen, FixLocator can
DeepFL was proposed for the method-level FL. For comparison, improve CNN-FL, DeepFL, DeepRL4FL, and DEAR by 16.6%, 16.9%,
following a prior study [24], we use only DeepFL’s spectrum-based 9.9%, and 20.6%, respectively, in terms of Hit-1 (i.e., the predicted
and mutation-based features applicable to detect faulty statements. set contains at least one faulty statement). It also improves over
Evaluation Metrics. We use the following metrics for evaluation: those baselines by 33.6%, 40.3%, 26.5%, and 57.5% in terms of Hit-2,
(1) Hit-N measures the number of bugs that the predicted set 43.9%, 46.4%, 28.1%, and 51.9% in terms of Hit-3, 100%, 155.6%, 64.5%,
contains at least 𝑁 faulty statements (i.e., the predicted and oracle and 142.1% in terms of Hit-4. Note: Any Hit-𝑁 reflects the cases of
sets for a bug overlap at least 𝑁 statements regardless of the sizes of multiple CC statements. For example, Hit-1 might include the bugs
both sets). Both precision and recall can be computed from Hit-N. with more than one buggy/fixed statement. Importantly, our tool
666
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
Table 1: RQ1. Detailed Comparison w.r.t. Faults with Different Table 3: RQ1. Detailed Comparison w.r.t. Faults with Different
# of CC Fixing Statements in an Oracle Set (Recall) # of CC Fixing Statements in a Predicted Set (Precision)
#CC-Stmts Metrics CNN-FL DeepFL DeepRL4FL DEAR Fix- #Stmts in Metrics CNN-FL DeepFL DeepRL4FL DEAR Fix-
in Oracle Locator Predicted Set Locator
1 (199 bugs) Hit-1 78 76 84 74 93 1 (203 bugs) Hit-1 83 79 87 75 (183) 99
Hit-1 67 64 70 65 75 Hit-1 75 72 78 71 (172) 83
2 (142 bugs) 2 (165 bugs)
Hit-2 33 30 34 28 41 Hit-2 36 34 39 34 (172) 45
Hit-1 46 44 47 42 51 Hit-1 52 46 48 41 (129) 55
3 (90 bugs) Hit-2 21 20 23 20 25 3 (120 bugs) Hit-2 24 22 26 19 (129) 27
Hit-3 11 10 13 12 21 Hit-3 12 11 14 10 (129) 23
Hit-1 41 42 42 40 45 Hit-1 47 49 46 33 (78) 51
Hit-2 22 19 21 20 24 Hit-2 24 21 22 14 (78) 26
4 (78 bugs) 4 (96 bugs)
Hit-3 9 7 8 5 12 Hit-3 11 9 10 5 (78) 14
Hit-4 3 2 4 2 9 Hit-4 5 3 6 1 (78) 11
Hit-1 15 14 16 13 18 Hit-1 17 16 17 12 (55) 19
Hit-2 9 8 9 7 12 Hit-2 10 10 11 7 (55) 14
5 (43 bugs) Hit-3 6 5 6 5 7 5 (73 bugs) Hit-3 8 6 7 4 (55) 9
Hit-4 3 2 3 2 3 Hit-4 3 3 4 1 (55) 5
Hit-5 1 1 1 0 1 Hit-5 2 1 2 0 (55) 2
Hit-1 85 91 93 87 105 Hit-1 58 69 76 68(218) 80
Hit-2 40 42 45 41 65 Hit-2 31 32 34 32 (218) 55
Hit-3 31 34 37 32 42 Hit-3 26 30 33 24 (218) 36
5+ (283 bugs) 5+ (178 bugs)
Hit-4 17 14 21 15 34 Hit-4 15 12 18 16 (218) 30
Hit-5 4 3 5 2 8 Hit-5 3 3 4 5 (218) 7
Hit-5+ 1 2 3 1 3 Hit-5+ 1 2 3 2 (218) 3
Table 2: RQ1. Comparison Results with DL-based FL Models Table 4: RQ1. Comparison with Baselines w.r.t. Ranking
667
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen
Table 8: RQ3. Sensitivity Analysis (Depth of Traces) Table 10: RQ4. BugsInPy (Python Projects) versus Defects4J
Hit-N (Java Projects). P% = |Located Bugs|/|Total Bugs in Datasets|
Depth
1 2 3 4 5 5+ BugsInPy (Python projects) Defects4J (Java projects)
Metrics
5 371 162 74 42 8 3 P% Cases P% Cases
10 387 167 82 46 9 3 Hit-1 43.8% 193 46.3% 387
15 368 158 71 39 7 3 Hit-2 16.3% 72 20.0% 167
Hit-3 10.2% 45 9.8% 82
Hit-4 3.4% 15 5.5% 46
Moreover, without the impact of method-level FL (methFL), the per- Hit-5 0.7% 3 1.1% 9
formance decreases significantly, indicating methFL’s contribution. Hit-5+ 0% 0 0.4% 3
668
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
Table 11: RQ5. Usefulness in APR (running on Defects4J) comparison, we use only DeepFL’s features applicable to statement-
Bug 1-Meth 1-Meth M-Meths M-Meths M-Meths Total level FL although it works at the method level. Other baselines
Types 1-Stmt M-Stmt 1-Stmt M-Stmts Mix-Stmts work directly at the statement level. (4) In 501 bugs in BugsInPy,
DEAR 64 12 24 2 3 105 the third-party tool cannot process 60 of them. (5) We focus on CC
DEAR𝐹 𝑖𝑥𝐿 69 13 26 2 4 116 fixing statements, instead of methods, due to bug fixing purpose.
CURE𝑂𝑐ℎ𝑖 84 0 0 0 0 84
CURE𝐹 𝑖𝑥𝐿 79 11 26 1 3 120 9 RELATED WORK
Table 12: RQ5. Detailed Results on Usefulness in APR Several types of techniques have been proposed to locate faulty
statements/methods. However, none of the existing FL approaches
Projects in Defects4J DEAR DEAR𝐹 𝑖𝑥𝐿 CURE𝑂𝑐ℎ𝑖 CURE𝐹 𝑖𝑥𝐿
Chart 8/16 9/18 6/13 9/19
detect CC fixing locations. A related work is DEAR [25], which uses
Cli 5/11 7/14 4/11 7/16 a combination of BERT and data flows to locate CC statements.
Closure 7/11 8/13 6/10 9/14 Hercules APR tool [37] can detect multiple buggy hunks of code.
Codec 1/4 1/4 1/4 1/4
Collections 0/0 0/0 0/0 0/0 It can detect only the buggy hunks with similar statements (repli-
Compress 7/15 7/16 5/12 8/17 cated fixes), while our tool detects general CC fixing locations. In
Csv 2/5 2/6 1/4 2/5 comparison, FixLocator and Hercules detect 26 and 15 multi-hunk
Gson 1/3 1/3 1/2 1/3
JacksonCore 4/9 5/10 3/7 6/9 bugs respectively among 395 bugs in Defects4J-v1.2 [37].
JacksonDatabind 16/27 17/29 13/25 16/28 The Spectrum-based Fault Localization (SBFL) [6, 8, 16, 27, 29,
JacksonXml 1/1 1/1 1/1 1/1
Jsoup 13/21 14/22 10/17 16/23
33, 43, 46] and Mutation-based Fault Localization (MBFL) [11, 31,
JxPath 8/14 9/15 6/13 10/17 32, 35, 47, 48] have been proposed for statement-level FL. Their key
Lang 8/15 9/17 9/16 11/15 limitations are that they cannot differentiate the statements with the
Math 20/33 20/31 16/25 20/35
Mockito 1/2 1/2 1/1 1/2 same scores or cannot have effective mutators to catch a complex
Time 3/6 3/6 1/3 3/6 fault. Among the learning-based FL models, learning-to-rank FL
Total 105/193 116/207 84/164 120/214 approaches [9, 23, 38, 45] aim to locate faulty methods. Statistical
X/Y: are the numbers of correct and plausible patches; Dataset: Defects4J
FL has been combined with casual inference for statement-level
FL [20]. All of those models do not locate CC fixing statements.
Table 13: Running Time
Machine learning has also been used for FL. Early neural network-
Models CNN-FL DeepFL DeepRL4FL DEAR FixLocator
based FL [10, 42, 51, 53] mainly use test coverage data. A limi-
Training Time 4 hours 5 mins 7 hours 21 hours 6 hours
tation is that they cannot distinguish elements accidentally ex-
Prediction Time 2 seconds 1 second 4 seconds 9 seconds 2 seconds
ecuted by failed tests and the actual faulty elements [23]. Deep
learning-based approaches, GRACE [28], DeepFL [22], CNNFL [50],
more bugs (11 bugs) across all bug types. Moreover, CURE𝐹𝑖𝑥𝐿 DeepRL4FL [24] achieve better results. GRACE [28] proposes a new
(FixLocator+ CURE) can fix 42.9% relatively more bugs (36 bugs) graph representation for a method and learns to rank the faulty
than CURE𝑂𝑐ℎ𝑖 (Ochiai+CURE). Especially, CURE𝐹𝑖𝑥𝐿 fixed 41 more methods. In contrast, FixLocator is aimed to locate multiple CC
bugs with multi-statements or multi-methods. The 5 bugs with single fixing statements in a fix for a fault. DeepFL and DeepRL4FL can
buggy statements that CURE𝐹𝑖𝑥𝐿 missed are due to FixLocator in- outperform the learning-based and early neural networks FL tech-
correctly producing more than one fixing locations. Table 12 shows niques, such as MULTRIC [45], TrapT [23], and Fluccs [38]. In our
that FixLocator can help DEAR and CURE improve both correct empirical evaluation, we showed that FixLocator can outperform
and plausible patches (passing all the tests) across all projects. those baselines under study in detecting CC fixing statements.
669
ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore Yi Li, Shaohua Wang, and Tien N. Nguyen
670
Fault Localization to Detect Co-change Fixing Locations ESEC/FSE ’22, November 14–18, 2022, Singapore, Singapore
[50] Zhuo Zhang, Yan Lei, Xiaoguang Mao, and Panpan Li. 2019. CNN-FL: An effective [52] Lei Zhao, Lina Wang, Zuoting Xiong, and Dongming Gao. 2010. Execution-aware
approach for localizing faults using convolutional neural networks. In 2019 IEEE fault localization based on the control flow analysis. In International Conference
26th International Conference on Software Analysis, Evolution and Reengineering on Information Computing and Applications. Springer, 158–165.
(SANER). IEEE, 445–455. [53] Wei Zheng, Desheng Hu, and Jing Wang. 2016. Fault localization analysis based
[51] Zhuo Zhang, Yan Lei, Qingping Tan, Xiaoguang Mao, Ping Zeng, and Xi Chang. on deep neural network. Mathematical Problems in Engineering 2016 (2016).
2017. Deep Learning-Based Fault Localization with Contextual Information. Ieice https://doi.org/10.1155/2016/1820454
Transactions on Information and Systems 100, 12 (2017), 3027–3031.
671