0% found this document useful (0 votes)

50 views69 pages

A Survey of Learning-Based Automated Program Repair

Uploaded by

18767122875

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views69 pages

A Survey of Learning-Based Automated Program Repair

Uploaded by

18767122875

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 69

1

A Survey of Learning-based Automated Program Repair

QUANJUN ZHANG, State Key Laboratory for Novel Software Technology, Nanjing University, China
CHUNRONG FANG∗ , State Key Laboratory for Novel Software Technology, Nanjing University, China
YUXIANG MA, State Key Laboratory for Novel Software Technology, Nanjing University, China
arXiv:2301.03270v3 [cs.SE] 1 Nov 2023

WEISONG SUN, State Key Laboratory for Novel Software Technology, Nanjing University, China
ZHENYU CHEN∗ , State Key Laboratory for Novel Software Technology, Nanjing University, China
Automated program repair (APR) aims to fix software bugs automatically and plays a crucial role in software
development and maintenance. With the recent advances in deep learning (DL), an increasing number of APR
techniques have been proposed to leverage neural networks to learn bug-fixing patterns from massive open-
source code repositories. Such learning-based techniques usually treat APR as a neural machine translation
(NMT) task, where buggy code snippets (i.e., source language) are translated into fixed code snippets (i.e.,
target language) automatically. Benefiting from the powerful capability of DL to learn hidden relationships
from previous bug-fixing datasets, learning-based APR techniques have achieved remarkable performance.
In this paper, we provide a systematic survey to summarize the current state-of-the-art research in the
learning-based APR community. We illustrate the general workflow of learning-based APR techniques and
detail the crucial components, including fault localization, patch generation, patch ranking, patch validation,
and patch correctness phases. We then discuss the widely adopted datasets and evaluation metrics and
outline existing empirical studies. We discuss several critical aspects of learning-based APR techniques,
such as repair domains, industrial deployment, and the open science issue. We highlight several practical
guidelines on applying DL techniques for future APR studies, such as exploring explainable patch generation
and utilizing code features. Overall, our paper can help researchers gain a comprehensive understanding
about the achievements of the existing learning-based APR techniques and promote the practical application
of these techniques. Our artifacts are publicly available at the repository: https://github.com/iSEngLab/
AwesomeLearningAPR.
CCS Concepts: • Software and its engineering → Software testing and debugging; Software testing and
debugging.
Additional Key Words and Phrases: Automatic Program Repair, Deep Learning, Neural Machine Translation,
AI and Software Engineering
ACM Reference Format:
Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. 2023. A Survey of Learning-
based Automated Program Repair. ACM Trans. Softw. Eng. Methodol. 0, 0, Article 1 ( 2023), 69 pages. https:
//doi.org/10.1145/nnnnnnn.nnnnnnn
∗ Chunrong Fang and Zhenyu Chen are the corresponding authors.

Authors’ addresses: Quanjun Zhang, [email protected], State Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing, Jiangsu, China, 210093; Chunrong Fang, [email protected], State Key Laboratory for
Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China, 210093; Yuxiang Ma, [email protected].
edu.cn, State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, China, 210093; Weisong
Sun, [email protected], State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing,
Jiangsu, China, 210093; Zhenyu Chen, [email protected], State Key Laboratory for Novel Software Technology, Nanjing
University, Nanjing, Jiangsu, China, 210093.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
1049-331X/2023/0-ART1 $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:2 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

1 INTRODUCTION
Modern software systems continuously evolve with inevitable bugs due to the deprecation of
old features, adding of new functionalities, and refactoring of system architecture [194]. These
inevitable bugs have been widely recognized as notoriously costly and destructive, such as costing
billions of dollars annually across the world [20, 202]. The recorded quantity of bugs is increased
at a tremendous speed due to the increasing scale and complexity of software systems [53]. It is
an extremely time-consuming and error-prone task for developers to fix detected bugs manually
in the software development and maintenance process. For example, previous reports show that
software debugging accounts for over 50% of the cost in software development [21]. Considering
the promising future in relieving manual repair efforts, automated program repair (APR), which
aims to automatically fix software bugs without human intervention, has been a very active area in
academia and industry.
As a promising research area, APR has been extensively investigated in the literature and has made
substantial progress on the number of correctly-fixed bugs [135]. A living APR review [136] reports
that a growing number of papers get published each year with various exquisitely implemented APR
tools being released. Over the past decade, researchers have proposed a variety of APR techniques
to generate patches [108] [13] [195], including heuristic-based, constraint-based and pattern-based.
Among these traditional techniques, pattern-based APR employs pre-defined repair patterns to
transform buggy code snippets into correct ones and has been widely recognized as state-of-the-
art [107, 208, 209]. However, existing pattern-based techniques mainly rely on manually designed
repair templates, which require massive effort and professional knowledge to craft in practice.
Besides, these templates are usually designed for specific types of bugs (e.g., null pointer exception)
and thus are challenging to apply to unseen bugs, limiting the repair effectiveness.
Recently, inspired by the advance of deep learning (DL), a variety of learning-based APR tech-
niques have been proposed to learn the bug-fixing patterns automatically from large corpora of
source code [184]. Compared with traditional APR techniques, learning-based techniques can be
applied to a wider range of scenarios (e.g., multi-languages [209] and multiple multi-hunks [28])
with pairs of the buggy and corresponding fixed code snippets. For example, CIRCLE [228] is able to
generate patches across multiple programming languages with multilingual training datasets. These
learning-based techniques handle the program repair problem as a neural machine translation
(NMT) task [73, 115, 208, 209, 228], which translates a code sequence from a source language (i.e.,
buggy code snippets) into a target language (i.e., correct code snippets). Existing NMT repair models
are typically built on the top of the encoder-decoder architecture [187]. The encoder extracts the
hidden status of buggy code snippets with the necessary context, and the decoder takes the encoder’s
hidden status and generates the correct code snippets [70, 98, 111]. Thanks to the powerful ability
of DL to learn hidden and intricate relationships from massive code corpora, learning-based APR
techniques have achieved remarkable performance in the last couple of years.
The impressive progress of learning-based APR has shown the substantial benefits of exploiting
DL for APR and further revealed its promising future in follow-up research. However, a mass
of existing studies from different organizations (e.g., academia and industry) and communities
(e.g., software engineering and artificial intelligence) make it difficult for interested researchers to
understand state-of-the-art and improve upon them. More importantly, compared with traditional
techniques, learning-based techniques heavily rely on the quality of code corpora and model
architectures, posing several challenges (e.g., code representation and patch ranking) in developing
mature NMT repair models. For example, most learning-based techniques adopt different training
datasets, and there exist various strategies available to process the code snippets (e.g., the code
context, abstraction, and tokenization). Besides, researchers design different code representations

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:3

(e.g., sequence, tree, and graph) to extract code features, which require corresponding encoder-
decoder architectures (e.g., RNN, LSTM, and transformer) to learn the transformation patterns.
Furthermore, execution-based (e.g., plausible and correct patches) and match-based (e.g., accuracy
and BLUE) metrics are adopted in different studies. Such multitudinous design choices hinder
developers from conducting follow-up research on the learning-based APR direction.
In this paper, we summarize existing work and provide a retrospection of the learning-based APR
field after years of development. Community researchers can have a thorough understanding of the
advantages and limitations of the existing learning-based APR techniques. We illustrate the typical
workflow of learning-based APR and discuss different detailed techniques that appeared in the
papers we collected. Based on our analysis, we point out the current challenges and suggest possible
future directions for learning-based APR research. Overall, our work provides a comprehensive
review of the current progress of the learning-based APR community, enabling researchers to
obtain an overview of this thriving field and make progress toward advanced practices.
Contributions. To sum up, the main contributions of this paper are as follows:
• Survey Methodology. We conduct a detailed analysis of 112 relevant studies that used DL
techniques in terms of publication trends, distribution of publication venues and languages.
• Learning-based APR. We describe the typical framework of leveraging advances in DL tech-
niques to repair software bugs and discuss the key factors, including fault localization, data
pre-processing, patch generation, patch ranking, patch validation and patch correctness.
• Dataset and Metric. We perform a comprehensive analysis of the critical factors that impact
the performance of DL models in APR, including 53 collected datasets and evaluation metrics
in two categories.
• Empirical studies. We detail existing empirical studies performed to better understand the
process of learning-based APR and facilitate future studies.
• Some Discussions. We discuss some other crucial areas (e.g., security vulnerability and syntax
error) where learning-based APR techniques are applied, as well as certain known industrial
deployments. We demonstrate the trend of employing pre-trained models on APR recently.
We list the available learning-based tools and reveal the essential open science problem.
• Outlook and challenges. We pinpoint open research challenges of using DL in APR and provide
several practical guidelines on applying DL for future learning-based APR studies.
Comparison with Existing Surveys. Gazzola et al. [53] present a survey to organize the
repair techniques published up to January 2017. Monperrus et al. [135] present a bibliography of
behavioral and state repair papers. Unlike existing surveys mainly covering traditional techniques,
our work focuses on the learning-based APR, particularly the integration of DL techniques in the
repair phase (e.g., patch generation and correctness), repair domains (e.g., vulnerability and syntax
errors), and challenges. Besides, our survey summarizes the existing studies until Nov 2022.
Paper Organization. The remainder of this paper is organized as follows. Section 2 presents
the research methodology about how we collect relevant papers from several databases following
specific keywords. Section 3 introduces some common concepts encountered in the learning-based
APR field. Section 4 presents the typical workflow of learning-based APR and discusses the vital
components of the workflow in detail, as well as some representative approaches across different
repair domains. Section 5 focuses on pre-trained model-based APR, which is the recent hot topic
in the learning-based APR community. Section 6 extends the discussion on the empirical evalua-
tion, including common datasets, standard evaluation metrics, and existing empirical studies of
learning-based APR techniques. Section 7 details some discussions, including industrial deploy-
ments, traditional APR equipped with learning-based techniques, and the crucial open science
problem. Section 8 provides some practical guidelines. Section 9 draws the conclusions.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:4 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

Google Scholar Automated Search

ACM Digital Library
IEEE Digital Library
filter by year
342 papers

Group2 filter by pages

Group1
(remove duplications)
repair related keywords DL related keywords
283 papers
program repair; bug fix; … deep; learning; machine; …
filter irrelevant papers
87 papers

Snowballing
discussion and selection
add missed citations
112 papers

Figure 1. General workflow of the paper collection

Availability. All artifacts of this study are available in the following public repository:
https://github.com/iSEngLab/AwesomeLearningAPR

2 SURVEY METHODOLOGY
In this section, we present details of our systematic literature review methodology following
Petersen et al. [153] and Kitchenham et al. [82].
Search Process. For this survey, we select papers by mainly searching the Google Scholar repository,
ACM Digital Library, and IEEE Explorer Digital Library at the end of November 2022. Following
existing DL for SE surveys [193, 220], we divide the search keywords used for searching papers into
two groups: (1) an APR-related group containing some commonly used keywords related to program
repair; and (2) a DL-related group containing some keywords related to deep learning or machine
learning. Considering a significant amount of relevant papers from both SE and AI communities,
following Zhang et al. [230], we first try to collect some papers from the community-driven website1
and the living review of APR by Monperrus [136], and then conclude some frequent words in the
titles of these papers. The search strategy can capture the most relevant studies while achieving
better efficiency than a purely manual search. Finally, we identify a search string including several
DL-related terms frequently appearing in APR papers that make use of DL techniques, listed as
follows.
(“program repair” OR “software repair” OR “automatic repair” OR “code repair” OR “bug repair”
OR “bug fix” OR “code fix” OR “automatic fix” OR “patch generation” OR “fix generation” OR
“code transformation” OR “code edit” OR “fix error”) AND (“neural” OR “machine” OR “deep” OR
“learning” OR “transformer/transformers” OR “model/models” OR “transfer” OR “supervised”)
Study selection. Once the potentially relevant studies based on our search strategy are collected,
we perform a filtering and deduplication phase to exclude papers not aligned with the study goals.
We first attempt to filter out the papers before 2016, considering that Long et al. [111] propose the
first learning-based APR study in 2016. We then filter out any paper less than 7 pages and duplicated
papers, resulting in 283 papers in total. We then scrutinize the remaining papers manually to decide
1 http://program-repair.org/bibliography.html

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:5

whether they are relevant to the learning-based APR field. We obtained 87 papers at last. To ensure
that the collected papers are as comprehensive as possible, we further perform the common practice
snowballing to manually include other relevant papers that are missed in our search process [200].
In particular, we look at every reference within the collected papers and determine if any of those
references are relevant to our study. For example, the title of SampleFix [60] does not contain any
keywords we mention above in the two groups, but it is an APR approach targeting syntax errors,
so we include it in our survey. We manually analyzed all these cited papers by scanning the papers
and finally collected 112 papers in our survey. The general workflow of how we collected papers is
shown in Figure 1.

5%
13%
50
47

35
44%
Number of papers

25
25 18%
20

15 13 13

10
6
5 4
3

0 20%
2016 2017 2018 2019 2020 2021 2022
Year Java C Python JavaScript C++

Figure 2. Collected learning-based APR papers from Figure 3. Paper distribution on programming lan-
2016 to 2022 guages

Trend Observation. Figure 2 shows the collected papers from 2016 to 2022. It is found that the
number of learning-based APR papers has increased rapidly since 2020, indicating that more re-
searchers are considering DL as a promising solution to fixing software bugs. One reason behind this
phenomenon is that traditional APR techniques have reached a plateau [115, 218] and researchers
hope to find a brand-new way to address the problem. Another non-negligible reason is that DL has
proved its potential in various tasks, including natural language translation, which is similar to bug
fixing to some extent. Figure 3 presents an overview of the programming languages targeted by
learning-based APR techniques in our survey. We can find Java occupies a large proportion, which
is understandable as Java is widely adopted in modern software systems nowadays and the most
targeted language in existing mature datasets (e.g., Defects4J [76]). We also find that the collected
papers cover a wide range of programming languages (i.e., Java, JavaScript, Python, C, and C++).
For example, there exist several papers [115, 228] involving multiple programming language repair.
The probable reason may be that learning-based APR techniques usually regard APR as an NMT
problem, independent of programming languages.

3 BACKGROUND AND CONCEPTS

In this section, we will introduce some background information and common concepts in the
learning-based APR field.

3.1 Automated Program Repair

The primary objective of APR techniques is to identify and fix software bugs without human
intervention. In the software development and maintenance process, after a designed functionality
is implemented, developers usually write some test suites (e.g., Junit test cases) to check the
functionality. If there exist test suites that make the functionality fail, developers adopt the failing

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:6 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

Localization Phase Repair Phase

fault localization suspicious code repair strategy generated patch test suite

Verification Phase

buggy program
correct patch plausible patch developer

deployment overfitting patch

Figure 4. Overview of APR

test suites to analyze the symptoms and the root cause of the bug, and attempt to fix the bug
by making some changes to suspicious code elements. More generally, according to Nilizadehet
al. [144], we can give the following definition.

Definition 3.1. ✍ APR: Given a buggy program 𝑃, the corresponding specification 𝑆 that 𝑃 does
not satisfy, the transformation operators 𝑂 and the allowed maximum edit distance 𝜖, APR can
be formalized as a function 𝐴𝑃𝑅(𝑃, 𝑆, 𝑂, 𝜖). 𝑃𝑇 is the set of its all possible program variants by
enumerating all operators 𝑂 on 𝑃. The problem of APR is to find a program variant 𝑃 ′ (𝑃 ′ ∈ 𝑃𝑇 )
that satisfies 𝑆 and the changes satisfies 𝜖 (𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 (𝑃, 𝑃 ′ ) ≤ 𝜖).

The specification 𝑆 denotes a relation between inputs and outputs and most APR techniques
usually adopt a test suite as a specification. In other words, APR aims to find a minimal change to 𝑃
that passes all available test suites. The maximum edit distance 𝜖 limits the range of changes based
on the competent programmer hypothesis [147], which assumes that experienced programmers are
capable of writing almost correct programs and most bugs can be fixed by small changes. For
example, if 𝜖 is set to 0, 𝐴𝑃𝑅(𝑃, 𝑆, 𝑂, 0) becomes a program validation problem that aims to identify
if 𝑃 satisfys 𝑆. On the contrary, if 𝜖 is set to ∞, 𝐴𝑃𝑅(𝑃, 𝑆, 𝑂, 𝜖) becomes a program synthesizing
problem that aims to synthesize a program to satisfy 𝑆.
The typical workflow of APR techniques is illustrated in Figure 4, which is usually composed
of three parts: (1) off-the-shelf fault localization techniques are applied to outline the buggy code
snippets [1] [9]; (2) these snippets are modified based on a set of transformation rules or patterns to
generate new various program variants (i.e., candidate patches); (3) the original test suite is adopted
as the oracle to verify all candidate patches. Specifically, a candidate patch passing the original
test suite is called a plausible patch. A plausible patch, which is also semantically equivalent to the
developer patch, denotes a correct patch.
However, such specifications (i.e., test suites) are inherently incomplete as programs have infinite
domains. It is fundamentally challenging to ensure the correctness of the plausible patches (i.e.,
overfitting issue) due to the weak test suites in practice. Existing studies have demonstrated that
manually identifying the overfitting patches is time-consuming and may harm the debugging

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:7

performance of developers [170, 177]. The overfitting issue is a critical challenge in both traditional
and learning-based APR techniques. We will discuss the issue in Section 4.7.

3.1.1 Patch Generation Techniques. In the literature, numerous traditional APR techniques have
been proposed to generate patches from different aspects, which can be categorized into three
classes. We list them as follows.
• Heuristic-based repair techniques. These techniques usually apply heuristic strategies (e.g.,
genetic algorithm) to build search space from previous patches and generate valid patches by
exploring the search space [93, 123, 229]. For example, SimFix [70] builds an abstract search
space from existing patches and a concrete search space from similar code snippets in the
buggy project. SimFix then utilizes the intersection of the above two search spaces to search
the final patch by basic heuristics (e.g., syntactic distance).
• Constraint-based repair techniques. These techniques usually focus on a single conditional
expression and employ advanced constraint-solving or synthesis techniques to synthesize
candidate patches [44, 124, 129]. For example, Nopol [215] relies on an SMT solver to solve
the condition synthesis problem after identifying potential locations of patches by angelic
fault localization and collecting test execution traces of the program. Besides, Cardumen [124]
synthesizes candidate patches at the level of expressions with its mined templates from the
program under repair to replace the buggy expression.
• Pattern-based repair techniques. These techniques usually design certain repair templates by
manually analyzing specific software bugs and generating patches by applying such templates
to buggy code snippets [85, 106, 107]. For example, TBar [107] revisits the effectiveness of
pattern-based APR techniques by systematically summarizing a variety of repair patterns
from the literature.
In addition to the above traditional APR techniques, researchers attempt to fix software bugs
enriched by DL techniques due to the large-scale open-source source code repositories [184, 242].
Such learning-based techniques have demonstrated promising results and are getting growing
attention recently, which is the focus of our work (introduced in Section 3.2).

3.2 Neural Machine Translation

Sequence-to-sequence (Seq2Seq) is an advanced DL framework widely used in some NLP tasks (e.g.,
machine translation [74] and text summarization [138]). A Seq2Seq model usually consists of two
components (i.e., an encoder and a decoder) to learn mappings between two sequences. Inspired by
the success of Seq2Seq models in text generation tasks, program repair can be formulated as an
NMT task. The learning-based APR problem is formally defined as follows:

Definition 3.2. ✍ Learning-based APR: Given a buggy code snippet 𝑋𝑖 = [𝑥 1, . . . , 𝑥𝑛 ] with

𝑛 code tokens and a fixed code snippet 𝑌𝑖 = [𝑦1, . . . , 𝑦𝑚 ] with 𝑚 code tokens, the problem of
program repair is formalized to maximize the conditional probability (i.e., the likelihood of 𝑌
Î
being the correct fix): 𝑃 (𝑌 | 𝑋 ) = 𝑚
𝑖=1 𝑃 (𝑦𝑖 | 𝑦 1, . . . , 𝑦𝑖 −1 ; 𝑥 1, . . . , 𝑥𝑛 ).

In other words, the objective of an NMT repair model is to learn the mapping between a buggy
code snippet 𝑋 and a fixed code snippet 𝑌 . Then the parameters of the model are updated by using
the training dataset, so as to optimize the mapping (i.e., maximizing the conditional probability 𝑃).
In the literature, recurrent neural network architecture (RNN) is widely used in existing learning-
based APR techniques [27, 58, 183, 184]. Besides, researchers use long short-term memory (LSTM)
architecture to capture the long-distance dependencies among code sequences [23, 130]. Recently,

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:8 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

Figure 5. Detailed workflow of Learning-based APR

as a variant of the Seq2Seq model, Transformer [187] has been considered the state-of-the-art NMT
repair architecture due to the self-attention mechanism [28, 31, 51].

4 LEARNING-BASED APR
In this section, we will discuss the workflow of learning-based APR tools and introduce some
popular learning-based APR techniques with several examples.

4.1 Overall Workflow

Figure 5 illustrates the typical framework of existing learning-based APR techniques. The framework
can be generally divided into six phases: fault localization, data pre-processing, input encoding, output
decoding, patch ranking, patch validation, and patch correctness assessment. We now discuss the
phases in detail as follows.
① In the fault localization phase, a given buggy program is taken as the input and a list of
suspicious code elements (e.g., statements or methods) is returned [204], detailed in Section 4.2.
② In the data pre-processing phase, a given software buggy code snippet (e.g., buggy state-
ment) is taken as the input and the processed code tokens are returned. According to existing
learning-based APR studies [28, 31], there generally exist three potential ways to pre-process
the buggy code: code context, abstraction, and tokenization. First, code context information
refers to other correlated non-buggy lines within the buggy program [139]. Previous work
has demonstrated that NMT-based repair models reveal diverse code changes to fix bugs
under different contexts [27]. Second, code abstraction renames some special words (e.g.,
string and number literals) to a pool of predefined tokens, which has been proven to be an
effective method in reducing the vocabulary size [184]. Third, code tokenization splits source
code into words or subwords, which are then converted to ids through a look-up table [51].
These pre-processing methods are detailed in Section 4.3.
③ In the patch generation phase, the processed code tokens are first fed into a word embed-
ding stack to generate representation vectors, which can capture the semantic meaning of
code tokens and their position within a buggy code. Then an encoder stack is implemented

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:9

to derive the encoder’s hidden state, which is further passed into a decoder stack. Similar
to the encoder stack, a decoder stack is implemented to take the hidden states provided by
the encoder stack and previously generated tokens as inputs, and returns the probability
distribution of the vocabulary. There exist two training paradigms to learn bug-fixing patterns
automatically, i.e., unsupervised learning [32, 184] and self-supervised learning [223, 226],
detailed in Section 4.4.
④ In the patch ranking phase, after the NMT-based repair model is well-trained, a rank
strategy (e.g., beam search) is leveraged to prioritize the candidate patches as prediction
results based on the probability distribution of the vocabulary [170]. Particularly, beam
search [7, 27, 228] is a common practice to select several most high-scoring candidate patches
by iteratively ranking top-𝑘 probable tokens based on their estimated likelihood scores,
detailed in Section 4.5.
⑤ In the patch validation phase, the generated candidate patches are then verified by the
available program specification, such as functional test suites or static analysis tools [14],
detailed in Section 4.6.
⑥ In the patch correctness assessment phase, the plausible patches (i.e., passing the ex-
isting specification) are assessed to predict their correctness (i.e., whether the plausible are
overfitting) [195], which are finally manually checked by developers for deployment in the
software pipeline, detailed in Section 4.7.

4.2 Fault Localization

Fault localization aims to diagnose buggy program elements (e.g., statements and methods) without
human intervention and has been extensively studied to facilitate the program repair process [204].
As a crucial start in the learning-based APR pipeline, fault localization provides the repair model
with information about where a software bug is and directly influences the performance of the
repair model. For example, the repair accuracy under normal fault localization is usually lower
than the circumstance under perfect fault localization.
In the literature, fault localization techniques often leverage various static analysis or dynamic
execution information to compute suspiciousness scores (i.e., probability of being faulty) for each
program element. Program elements are then ranked in descending order of their suspiciousness
scores, based on which APR techniques can further be applied. Researchers have proposed a variety
of fault localization techniques, such as spectrum-based [152, 233], mutation-based [96, 148],
slicing-based [12, 120] and learning-based [97, 112] techniques. Among them, spectrum-based fault
localization (SBFL) has been extensively utilized as a general mechanism to localize the statements
that are likely to be faulty in the APR literature.

4.2.1 Localization Techniques. Similar to traditional APR techniques, some learning-based APR
techniques rely on existing SBFL fault localization approaches to localize the revealed bug. For
example, DLFix [98] adopts Ochiai algorithm to identify a buggy line and extracts all AST nodes
(including intermediate ones) related to that buggy line as a replaced subtree for patch generation.
Recoder [242] also assumes the faulty location of a bug is unknown to APR tools and uses Ochiai al-
gorithm with GZoltar [162], which is widely used in existing APR tools, such as RewardRepair [227]
and AlphaRepair [209]. Such SBFL techniques exploit runtime information to recognize the program
elements that are likely to be faulty when the buggy program is executed by the available test suite.
The crucial insight is that (1) the program elements executed by more failing test suites and fewer
passing test suites are likely to be faulty; and (2) the program elements executed by more passing
test suites and fewer failing suites are likely to be correct. In particular, SBFL produces a list of

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:10 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

program elements ranked according to their likelihood of being faulty based on the analysis of the
program entities covered by passing and failing tests (e.g., Ochiai and Tarantula [105]).
However, Liu et al. [105] have demonstrated that the fault localization techniques may introduce
a significant bias in the evaluation of APR techniques. The vast majority of learning-based APR
techniques consider repairing software bugs under perfect-based fault localization techniques.
Perfect-based fault localization techniques assume that the genuine localization of the bug is
known. Thus, perfect-based fault localization can provide a fair assessment of APR techniques
and the assessment is independent of the localization techniques. For example, CoCoNut [115]
manually checks the bug-fixing pairs in Defects4J benchmark and extracts the changed statements
as inputs to the repair model. Subsequently, recent learning-based APR techniques adopt the same
or similar processing method to conduct perfect localization, such as CIRCLE [228], CURE [73],
SelfAPR [226] and AlphaRepair [209].
Besides, there exist some techniques attempting to perform fault localization on their own. For
example, DeepFix [58] proposes an end-to-end approach in which the network reports a ranked list
of potentially erroneous lines with a beam search mechanism. Similarly, Prophet [111] designs a
fault localization algorithm to return a ranked list of program candidate lines to modify by analyzing
dynamic execution traces of the test suite. Szalontai et al. [172] first localize the nonidiomatic
code snippets by LSTM networks and predict the nonidiomatic pattern by a feed-forward neural
network, which is fixed by a high-quality alternative. Recently, Meng et al. [130] build a novel fault
localization technique based on deep semantic features and transferred knowledge, which is further
fed to a fix template prioritization model and a template-based APR technique TBar [107].

4.2.2 Localization Granularity. APR techniques consider program elements of different granulari-
ties, thus determining the scope of the fault localization. In other words, APR and fault localization
usually work at the same granularity level. For example, if APR techniques focus on repairing
buggy statements (or methods), the fault localization also works at the level of program statements
(or methods). In the literature, a majority of fault localization techniques adopted in learning-based
APR techniques usually record the line of a buggy code snippet [73, 98, 99, 115, 228, 242]. There
also exists little work considering other granularity. For example, Tufano et al. [184] adopt the
NMT-based repair model to learn the translation from buggy to fixed code at the method level.

✎ Summary ▶ As a preceding step in the learning-based APR workflow, fault localization

has a significant impact on the performance of the patch generation, which cannot generate
a correct patch with a wrong suspicious code element. Most learning-based APR techniques
follow the common practice in the traditional APR field to generate patches by integrating
spectrum-based fault localization techniques. There are also some repair techniques that are
starting to use perfect localization (i.e., the ground-truth buggy code element), to avoid the
noise introduced by the off-the-shelf fault localization techniques. Besides, thanks to the code
comprehension capabilities of DL models, some learning-based APR techniques can generate
patches with coarse-grained fault localization, e.g., only the buggy method is provided. ◀

4.3 Data Pre-processing

Data pre-processing phase aims to analyze and parse the identified buggy code snippets, which are
then passed into neural networks for training and inference. In the data pre-processing phase, a
given software buggy code snippet (e.g., a buggy function) is taken as the input and the processed
code tokens are returned. According to existing learning-based repair studies [28, 31], the data
pre-processing phase generally consists of three parts: code context, code abstraction and code
tokenization.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:11

4.3.1 Code Context. Code context generally refers to other correct statements around the buggy
lines. In the manual repair scenario, the context of the buggy code plays a significant role in
understanding faulty behaviors and reasoning about potential repairs. Developers usually identify
the buggy lines, and then analyze how they interact with the rest of the method’s execution, and
observe the context (e.g., variables and other methods) in order to come up with the possible repair
and pick several tokens from the context to generate the fixed line [83]. In learning-based APR, the
NMT model mimics this process by extracting the code context and the buggy line into a certain
code representation to preserve the necessary context that allows the model to predict the possible
fixes.
Existing learning-based APR techniques typically consider the surrounding source code relevant
to the buggy statement as context. These techniques typically employ context in various ways,
such as extracting code near the buggy statement within the buggy method, class, and even file.
On the one hand, a broad context contains plenty of essential fix ingredients, while such a large
vocabulary size introduces noise that negatively affects the repair performance of the NMT model
due to the tricky long-term dependency problem in NMT models [27]. In particular, long-term
dependency refers to the situation that the meaning of a token depends on another token that is
far apart from it in a code snippet [187]. As a result, NMT repair models often struggle to capture
long-term dependencies when dealing with tokens that appear over long code snippets [228]. On
the other hand, a narrow context contains too little information to capture the proper semantics of
the buggy statement and leads to incorrect patches generation due to a lack of necessary vocabulary.
There seems to be a trade-off relationship between vocabulary size and context size. Our survey
concludes the code context of existing learning-based APR studies into four granularities: context-
free, line-level context, method-level context, and class-level context.

• Context-free. This granularity refers to the scenario where NMT repair modes only take
buggy statements without any additional code snippets as inputs [40, 63, 125]. For example,
Mashhadi et al. [125] consider single statement bugs from the ManySStuBs4J dataset and
extract the buggy statement as a source side and the fixed statement as a target side from
bug-fixing commits. Ding et al. [40] provide NMT models with a single program line that
contains a buggy statement. However, previous work demonstrates that fixing nearly 90%
of bugs requires new vocabulary relative to the buggy code. Therefore, NMT repair models
suffer from capturing enough information from the buggy statements alone.
• Statement-level context. This granularity refers to the scenario where NMT repair models
take the buggy statements and several statements that the buggy code and some surrounding
correct statements as inputs [15, 31]. For example, TFix [15] extracts the two neighboring
statements of the buggy code as the code context. Chi et al. [31] extract statement-level
code changes by the “git diff” command and employ data-flow dependencies to capture more
critical information around the context.
• Method-level context. This granularity refers to the scenario where NMT modes take
the whole method to which the buggy statements belong as inputs [115, 184, 228]. It is the
most commonly used type of context in literature as it often contains enough information
for repairing the bug, such as the type of variables and the function of this method. For
example, Tufano et al. [183] focus on the method-level context since (1) the functionality to be
fixed is usually implemented in program methods; (2) the methods provide neural networks
with meaningful abundant context information, such as literals and variables. Similarly,
CoCoNuT [23] extracts the entire method of the buggy code as context, which is encoded as
a separate input.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:12 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

int GetMaxCommonDivisor(int m, int n){ int GetMaxCommonDivisor(int m, int n){

int r; int r;
while (n!=0){ while (n!=0){
r=m%n; r=m%n;
m=n; m=n;
n=r; n=r;
} }
return n; return m;
} }

raw(a)buggy code
raw buggy code raw fixed
(b) raw fixedcode
code

TYPE_1 METHOD_1(TYPE_1 VAR_1, TYPE_1 VAR_2){ TYPE_1 METHOD_1(TYPE_1 VAR_1, TYPE_1 VAR_2){
TYPE_1 VAR_3; TYPE_1 VAR_3;
while (VAR_2!=NUMBER_1){ while (VAR_2!=NUMBER_1){
VAR_3=VAR_1%VAR_2; VAR_3=VAR_1%VAR_2;
VAR_1=VAR_2; VAR_1=VAR_2;
VAR_2=VAR_3; VAR_2=VAR_3;
} }
return VAR_2; return VAR_1;
} }

abstracted buggy
(c) abstracted buggycode
code abstracted
(d) fixedcode
abstracted fixed code

Figure 6. A simple example of code abstraction

• Class-level context. This granularity refers to the scenario where NMT repair models
take the class to which the buggy statements belong to as inputs. It is a relatively broad
context, while it can provide the model with rich information. For example, SequenceR [27]
considers the class-level context and conducts abstract buggy context from the buggy class,
which captures the most important context around the buggy source code and reduces the
complexity of the input sequence to 1,000 tokens. Hoppity [39] takes the whole buggy file as
the context with a length limit of 500 nodes in the AST.

4.3.2 Code Abstraction. Code abstraction aims to limit the number of words the NMT models need
to process by renaming raw words (e.g., function names and string literals) to a set of predefined
tokens. Previous work demonstrates that it is challenging for NMT models to learn bug-fixing
transformation patterns due to the huge vocabulary of source code [184]. In particular, NMT models
usually employ a beam-search decoding strategy to output repair candidates by a probability
distribution over all words. The search space can be extremely large with many possible words in
the source code, resulting in inefficient patch generation.
In our survey, a considerable number of learning-based papers we collect employ the abstracted
source code to tackle this problem. Such abstraction operation means the original source code
is not directly fed into the NMT model. Benefiting from the abstracted code, we can (1) reduce
the size of vocabulary significantly and the frequency of specific tokens; (2) filter out irrelevant
information and improve the efficiency of the NMT model. Generally, the natural elements (e.g.,
identifiers and literal) in the source code are renamed, while the core semantic information (e.g.,
idioms) should be preserved. For example, Tufano et al. [184] propose the first code abstraction
approach in the learning-based APR field by (1 ) adopting a lexer to tokenize the raw source code
as a stream of tokens based on Another Tool for Language Recognition (ANTLR) [151]; (2) passing
the stream of tokens into a parser to identify the role of each identifier and literals (e.g., whether
it represents a variable, method, or type name); (3) replacing each identifier and literal with a
unique ID to generate the abstracted source code. Besides, they extract the idioms (i.e., tokens that
appear many times) and keep their original textual tokens in the abstraction process because such

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:13

idioms contain beneficial semantic information. The typical code abstraction example is presented
in Figure 6. Similarly, CoCoNut [115] and CURE [73] only abstract string and number literals
except for the frequent numbers (e.g., 0 and 1). DLFix [98] adopts a novel abstraction strategy to
alpha-rename the variables, so as to learn the fix between methods with similar scenarios while
having different variable names. DLFix also keeps the type of the variable to avoid accidental
clashing names and maintains a mapping table to recover the actual names. Recoder [242] abstracts
infrequent identifiers with placeholders to make the neural network learns to generate placeholders
for these identifiers.
Although a variety of learning-based APR techniques adopt the code abstraction strategy (such
as Tufano et al. [184]) to limit the vocabulary size and make the transformer concentrate on learning
common patterns from different code changes, we still find some repair techniques prefer raw
source code [228, 242] because it contains semantic information. For example, developers may
name one function as SetHeightValue to indicate that this function can set the value of height as
they want. If this name is abstracted directly as func_1, the critical semantic information would be
missed, resulting in suboptimal repair training. Thus, instead of renaming rare identifiers through
a custom abstraction process, SequenceR [27] utilizes the copy mechanism to generate candidate
patches with a large set of tokens. During programming, developers are not restricted by a set
vocabulary (e.g., English) when defining names for variables or methods, resulting in an extremely
large vocabulary with many rare tokens. The copy mechanism seeks to copy some rare input tokens
to the output and is effective in reducing the required vocabulary size [27]. Besides, Chen et al. [28]
adopt the raw source code as they think abstracted code may hide valuable information about the
variable that can be learned by word embedding. A strategy that is similar to Chen et al. [28] is
also implemented in other learning-based APR techniques, such as in CODIT [23], CIRCLE [228]
and TFix [15].
4.3.3 Code Tokenization. Code tokenization aims to split source code into a stream of tokens,
which are then converted to ids through a look-up table2 . These id numbers are in turn used
by the repair models for further processing and training. A simple tokenization approach can be
conducted by dividing the source code into individual characters. The core concept of this char-level
tokenization is that although the source code has many different words, it has a limited number of
characters. This approach is straightforward and leads to an exceeding small vocabulary. However,
it leads to a relatively long tokenized sequence with the splitting of each world into all characters.
More importantly, it is pretty difficult for repair models to meaningful input representations as
characters alone do not have semantic meaning. Generally, there exist two main granularities of
code tokenizers used in existing learning-based APR techniques: word-level tokenization [51] and
subword-level tokenization [31].
The word-level tokenization means that a sentence is divided according to its words (e.g., space-
separated), which is widely used in NLP tasks [158]. However, different from natural language (e.g.,
English dictionary), words (e.g., variable and method names) in programming languages can be
created by developers arbitrarily. As a result, there may exist some rare words not available in the
vocabulary (i.e., the out-of-vocabulary problem), resulting in unknown tokens in patch generation.
To address this issue, VRepair [28] employs a word-level tokenization to tokenize C source code and
the copy mechanism to deal with the out-of-vocabulary problem. Similarly, CoCoNut [115] designs
a code-aware space-separated tokenization algorithm that is specific to programming languages
by (1) separating operators from variables as they might not be space-separated; (2) considering
underscores, camel letters, and numbers as separators as many words are composed of multiple
words without separation (e.g., SetHeightValue); (3) introducing a new token <CAMEL> to mark
2 https://huggingface.co/Salesforce/codet5-base/blob/main/vocab.json

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:14 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

where the camel case split occurs to regenerate source code from the list of tokens generated
correctly.
The subword-level tokenization splits rare tokens into multiple subwords instead of directly
adding full tokens into the vocabulary. Besides, the frequent words should not be split into smaller
subwords. This kind of granularity can reduce the vocabulary size significantly and is widely
used in the learning-based APR field. Technically, there exist several subword-level tokenization
techniques, such as byte-pair encoding (BPE), byte-level byte-pair encoding (BBPE) [168] and
SentencePiece [87], listed as follows.
(1) BPE tokenizer generally needs to be trained upon a given dataset by (1) leveraging a pre-
tokenizer to splits the dataset into words by space-separated tokenization; (2) creating a
set of unique words and counting the frequency of each word in the dataset; (3) building a
base vocabulary with all symbols that occur in the set of unique words and learning merge
rules to form a new symbol from two symbols of the base vocabulary; (4) repeating the
above process until the vocabulary is reduced to a reasonable size, which is a pre-defined
hyperparameter, before training the tokenizer. For example, VulRepair [51] employs a BPE
algorithm to train a subword tokenizer on eight different programming languages (i.e., Ruby,
JavaScript, Go, Python, Java, PHP, C, C#) [197] and is suitable for tokenizing source code. In
the learning-based APR literature, a majority of repair studies adopt BPE as the tokenization
technique, such as CURE [73], CoCoNut [115], SeqTrans [31]. The results have demonstrated
the effectiveness of BPE in reducing vocabulary size and mitigating the OOV problem by
extracting the most frequent subwords and merging the most frequent byte pair iteratively.
(2) BBPE refines BPE by employing bytes as the base vocabulary, ensuring that every base
character is included with a proper vocabulary size. For example, AlphaRepair [209] builds a
BBPE-based tokenizer to reduce the vocabulary size by breaking uncommon long words into
meaningful subwords.
(3) SentencePiece contains the space in the base vocabulary and utilizes the existing BPE al-
gorithm (e.g., BPE) to create the desired vocabulary by regarding the source code as a raw
input stream. In the literature, before entering source code into the neural network, sev-
eral learning-based APR techniques use SentencePiece to divide words into a sequence of
subtokens, such as SelfAPR [226], RewardRepair [227] and CIRCLE [228].
✎ Summary ▶ Data preprocessing is responsible for processing the code snippets into a
suitable format and feeding it to the NMT repair models for training. Different learning-based
APR techniques employ diverse data pre-processing methods, learning to complex experimental
settings in the literature. For example, code abstraction involves raw code or abstracted code;
code context involves context-free, statement-level, method-level and class-level context; code
tokenization involves BPE, BBPE and SentencePiece tokenizers. On the one hand, these different
configurations may introduce bias to the evaluation of existing learning-based APR techniques.
On the other hand, the optimal combination of these configurations requires further exploration,
and it is also important to consider their interactions with other factors, such as the model
architectures and the types of software bugs being fixed. ◀

4.4 Patch Generation

In the learning-based APR context, to apply NMT repair models to high-level programming lan-
guages, the code snippets need to be converted to embedding vectors. Then an NMT repair model
is built on top of the encoder-decoder architecture [187] to learn the repair patterns automatically.
Finally, the mapping from buggy code to fixed code is optimized by updating the parameters of the
designed model. Thus, it is crucial to determine (1) how to represent the source code (with which

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:15

format) as input for word embedding, referred to as code representation; and (2) how to design the
specific architecture (with which neural network) as encoder-decoder for repair transformation
learning, referred to as model architecture.
In the literature, various strategies have been proposed to represent the source code as the input
for NMT repair models, which can be categorized into three classes: sequence-based, tree-based and
graph-based representation.
4.4.1 Sequence-based Generation. These techniques divide the textual source code as a sequence
of tokens and treat APR as a token-to-token translation task based on a Seq2Sep model.
Code Representation. Considering the buggy lines and the context, there generally exist four
different ways to sequence the textual code tokens.
(1) Raw representation.
Similar to NMT, which translates a sentence from one source language (e.g., English) to another
target language (e.g., Chinese), most sequence-based techniques directly feed the model with
the buggy code snippet [184]. For example, Tufano et al. [184] extract the buggy method
and train an NMT model for method-to-method translation. The size of this code snippet
depends on the choice of the buggy code and code context. However, the raw representation
is unaware of the difference between the buggy code and the code context, as these two parts
are sent into the encoder together. As a result, the transformation rules may be applied in
some correct lines, limiting the repair performance.
(2) Context representation.
The context representation splits the buggy code and the code context, then feeds them
into two encoders separately. Under this circumstance, the model is aware of the difference
between buggy code and the corresponding context. For example, Lutellier et al. [73, 115]
attempt to encode these two parts separately and then merge the encoding vectors. However,
it is challenging to merge the two separated encoding vectors and eliminate the semantic
gaps between the two encoders.
(3) Prompt representation.
The prompt representation refers to a text-in-text-out input format and can effectively
concatenate different input components with some prefixed prompt [158]. The prefixed
prompt is a piece of tokens inserted in the input, so that the original task can be formulated
as a language modeling task. For example, Yuan et al. [228] employs manually designed
prompt template to convert buggy code and corresponding context into a unified fill-in-the-
blank format. In particular, they employ “Buggy line:” and “Context:” to denote the buggy
code and code context, and then employ “The fixed code is:” to guide the NMT model to
generate candidate patches according to the previous input. This mechanism has been proven
effective in bridging the gap between pre-trained tasks and the downstream task, facilitating
fine-tuning pre-trained models for APR.
(4) Mask representation.
The mask representation replaces the buggy code with mask tokens and queries NMT models
to fill the masks with the correct code lines. This mechanism views the APR problem as a
cloze task and usually adopts the pre-trained model as the query model in the learning-based
APR. For example, Xia et al. [209] transform the original buggy code into a comment and
generate multiple mask lines with templates. The input is represented by comment buggy
code, context before buggy code, mask lines and context after buggy code. In particular, the
buggy code is masked randomly from one token to the whole line, and researchers expect to
generate every possible patch for different situations within a limited candidate patch size.
Compared with the above three representation strategies, the mask representation can adopt

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:16 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

pre-trained models to predict randomly masked tokens to perform cloze-style APR without
any additional training on the bug-fixing dataset.

Model Architecture. Sequence-based techniques usually treat the source code as a sequence of
tokens and adopt existing sequence-to-sequence architectures in the NLP field instead of designing
new network architectures. For example, CoCoNut [115] adopts two fully convolutional (FConv)
encoders to represent the buggy lines and the context separately. One common encoder architecture
is long short-term memory (LSTM), and it resolves the long-term dependency problem of the RNN
module by introducing the gate mechanism and ensures that short-term memory is not neglected.
For example, SequenceR [27] is based on an LSTM encoder-decoder architecture with copy mech-
anism. As a powerful kind of DL architecture, the transformer can model global dependencies
between input and output effectively thanks to the attention mechanism and has been adopted in
existing APR studies, such as Bug-Transformer [221], SeqTrans [31] and VRepair [28].
Recently, the usage of pre-trained models has gradually attracted the attention of researchers in
the learning-based APR community. Such models are first pre-trained by self-supervised training on
a large-scale unlabeled corpus (e.g., CodeSearchNet [69]), and then transferred to benefit multiple
downstream tasks by fine-tuning on a limited labeled corpus. For example, Mashhadi et al. [125]
employ CodeBERT, a bimodal pre-trained language model for both natural and programming lan-
guages, to fix Java single-line bugs by fine-tuning on the ManySStuBs4J small and large datasets [92].
CURE [73] applies a pre-trained GPT model to further revise an NMT-based APR architecture (i.e.,
CoCoNut). CIRCLE [228] proposes a T5-based program repair framework equipped with continual
learning ability across multiple languages. We will discuss the application of pre-trained models in
Section 5.

4.4.2 Tree-based Generation. Sequence-based APR techniques usually adopt Seq2Seq models for
patch generation. However, these techniques ignore code structure information because they are
designed for NLP, which is significantly different from programming language with strict syntactic
and grammatical rules. The generated patches of these techniques may suffer from syntax errors
that cause compilers to fail. As a result, researchers recently propose various tree-based generation
techniques by considering the syntactic structure of source code. These techniques treat the APR
problem as a tree transformation learning task.
Code Representation. A common solution is to parse the source code into an AST and adopt a
tree-aware model to perform patch generation, i.e., structure-aware representation. For example,
given a bug-fixing method pair 𝑀𝑏 and 𝑀 𝑓 representing the buggy and fixed method, DLFix [98]
first extracts a buggy AST for 𝑀𝑏 (i.e., 𝑇𝑏 ), a fixed AST for 𝑀 𝑓 (i.e., 𝑇𝑓 ), a buggy sub-AST (i.e, 𝑇𝑏𝑠 ) and
a fixed sub-AST (i.e., 𝑇𝑓𝑠 ) between 𝑇𝑏 and 𝑇𝑓 . DLFix then adopts an existing summarization model
to encode 𝑇𝑏𝑠 as a single node 𝑆𝑏𝑠 . Finally, the buggy method 𝑀𝑏 can be represented as a context
tree by replacing 𝑇𝑏𝑠 in 𝑇𝑏 with 𝑆𝑏𝑠 and a sub-changed tree 𝑇𝑏𝑠 . The fixed method 𝑀 𝑓 is represented
in a similar way.
As tree-based representation contains the structure information, which cannot be directly de-
ployed to sequence-based neural models. Thus, an additional code representation strategy is utilized
to parse the tree representation as a sequential traverse sequence, i.e., sequential-traverse repre-
sentation. For example, Tang et al. [176] parse the source code into AST representation, which is
further translated into a sequence of rules. The sequence of rules can be processed by the vanilla
transformer [187] while capturing the grammar and syntax information. Similarly, CODIT [23]
first represents code snippets as AST by (1) identifying the edited AST nodes (i.e., the inserting,
deleting, and updating); (2) selecting the minimal subtree of each AST and (3) collecting the edit
context by including the nodes that connect the root of the method to the root of the changed tree.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:17

CODIT then employs a tree-based model to learn the structural changes in the form of a sequence
of grammar, which is finally used to predict the fixed code sequence with a standard Seq2Seq model.
Model Architecture. Most NMT-based APR models treat patch generation as a machine translation
from buggy code to a fixed one. However, such models could not capture the information of code
structures and suffer from handling the context of the code. Tree-based encoders consider the
structure features of source code, such as AST. For example, DLFix [98] represents source code
as ASTs and employs tree-based RNN models to encode the context tree and sub-changed trees.
Besides, Devlin et al. [38] encode the AST with a sequential bidirectional LSTM by enumerating a
depth-first traversal of the nodes.

4.4.3 Graph-based Generation. These techniques transform source code into graph representations
with contextual information and frame the APR problem in terms of learning a sequence of graph
transformations based on graph-based models. Instead of directly manipulating the source code, such
graph-based APR techniques aim to learn a sequence of transformations on the graph representation
that would correspond to a corrected version of the original code.
Code Representation. To capture the neighbor relations between AST nodes, Recoder [242] treats
AST as a directional graph where the nodes denote AST nodes and the edges denote the relationship
between each node and its children and left sibling. Besides, Xu et al. [214] consider the context
structure by data and control dependencies captured by a data dependence graph (i.e., DDG) and a
control dependence graph (i.e., CDG).
Model Architecture. Existing graph-based APR techniques usually design graph neural networks
and their variants to capture graph representation and perform patch generation. For example,
Hoppity [39] adopts a gated graph neural network (GGNN) to treat the AST as a graph, where a
candidate patch is generated by a sequence of predictions, including the position of graph nodes
and corresponding graph edits. Besides, Xu et al. [214] design a graph neural network (GNN) for
obtaining a graph representation by first converting DDG and CDG into two graph representations
and then fusing them.

✎ Summary ▶ As the most crucial phase in the repair workflow, a majority of existing
learning-based APR techniques focus on patch generation. These patch generation techniques
typically can be divided into two parts: code representation and the corresponding model
architecture. The key research question lies in how to appropriately represent code snippets
and determine the model architecture that can effectively learn the transformation relationship
between buggy code and correct code. Inspired by NLP, incipient repair techniques usually
represent the source code as a sequence of tokens, and transform an APR problem into an
NMT task on top of a sequence-to-sequence model. The follow-up techniques represent the
source code as a tree or graph representation and adopt tree-aware models (e.g., tree-LSTM)
or graph-aware models (e.g., GGNN) to perform patch generation. The literature does not
demonstrate which code representation or model architecture exhibits the best performance.
An in-depth controlled experiment can be conducted to investigate the performance between
different code representations and the corresponding model architectures. ◀

4.5 Patch Ranking

The patch generation is a search process for the maximum in the combinatorial space. Given the
max output length l and the size of vocabulary V, the total number of candidate patches that the
decoder can generate reaches 𝑉 𝑙 , all of which it is impossible to validate in practice. Developers
may spend a considerable amount of effort to assess the correctness of the generated candidate
patches manually. In such a scenario, only inspecting fewer repair candidates (e.g., Top-1 and Top-5)

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:18 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

that have a high probability of being correct is more practical and reduces the valuable manual
effort. As a result, a patch ranking strategy is crucial to ensure the inference efficiency of the model
and relieve the burden of patch validation.
Beam search is an effective heuristic search algorithm to rank the outputs in previous NMT
applications [197] and is the most common patch ranking strategy in learning-based APR studies,
such as CIRCLE [228], SelfAPR [226], RewardRepair [227] and Recoder [242]. In particular, at each
iteration, the beam search algorithm selects the 𝑘 most probable tokens (corresponding to beam
size 𝑘) and ranks them according to the likelihood estimation score of the next 𝑑 prediction steps
(corresponding to search depth 𝑑). The iteration repeats until a stopping condition is met, such as
reaching a certain sequence length or all sequences ending with an end-of-sequence token. Finally,
the top 𝑘 high-scoring candidate patches are generated and ranked for further validation in the
next procedure of the overall learning-based APR workflow. Beam search provides a great trade-off
between repair accuracy versus inference cost via its flexible choice of beam size.
However, the vanilla beam search considers only the log probability to generate the next token
while ignoring the code-related information, such as variables. Thus, it may generate high-score
patches with unknown variables, leading to uncompilable candidate patches. In addition to directly
applying the existing beam search strategy, researchers design some novel strategies to filter out
low-probability patches. For example, CURE [73] designs a code-aware beam search strategy to
generate more compilable and correct patches based on valid-identifier check and length control
components. The code-aware strategy first performs static analysis to identify all valid tokens
used for sequence generation and then prompts beam search to generate sequences of a similar
length to the buggy line. DLFix [98] first derives the possible candidate patches by program analysis
filtering and ranks the list of possible patches by a CNN-based binary classification model. The
classifier adopts a Word2Vec model as the encoder stack at the char level, followed by a CNN stack
as the learning stack (containing a Convolutional layer, pooling, and fully connected layers), and a
softmax function as the classification stack. Then DLFix ranks the given list of patches based on
their possibilities of being a correct patch. Further, DEAR [99] applies a set of filters to verify the
program semantics and ranks the candidate patches in the same manner as DLFix does.
In addition to the widely-used beam search and their variants, there are also some self-designed
patch ranking methods as a component in the patch generation. As early as 2016, Long et al. [111]
propose Prophet, which trains a ranking model to assign a high probability to correct patches based
on designed features (detailed in Section 7.2). Recently, AlphaRepair [209] designs a patch ranking
strategy based on a masked language model. In particular, given a candidate patch, AlphaRepair
calculates its priority score by (1) extracting all generated tokens; (2) masking out only one of the
tokens; (3) querying CodeBERT to obtain the conditional probability of that token; (4) repeating
the same process for all other previous mask tokens; and (5) computing the joint score which is an
average of the individual token probabilities.

✎ Summary ▶ Patch ranking seeks to prioritize candidate patches with a higher probability
of being correct in the search space. As a greedy strategy, beam search is widely adopted in
existing learning-based APR techniques to keep 𝑘 optimal tokens at every iteration according
to the likelihood estimation score. Besides, some advanced patch ranking strategies (e.g., a
code-aware beam search strategy to consider valid identifiers) are proposed to identify high-
probability while low-quality patches, such as uncompilable candidate patches. Overall, a
majority of existing learning-based APR techniques follow the vanilla beam search strategy
and the literature fails to see systematic research to delve into the impact of patch ranking
strategies on repair performance. As a guideline for future work, after summarizing existing

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:19

patch ranking works, we recommend that a reasonable patch ranking technique needs to
consider three aspects: ①effectiveness, i.e., it should have a sufficiently large search space to
encompass the correct patches; ②efficiency, i.e., it should have a fast retrieval speed to find the
correct patch in a reasonable amount of time; and, ③priority, i.e., it should prioritize the patch
that is more likely to be correct higher based on additional code information, such as code
syntactic and semantic features. ◀

4.6 Patch Validation

Patch validation takes a ranked list of candidate patches generated by NMT models as the input
and returns the plausible patches for deployment, which is a crucial phase in the learning-based
APR pipeline. However, developers may spend a considerable amount of effort to inspect the
candidate patches manually. Thus, researchers usually recompile the buggy program with the
generated patch to check if it can pass the available test suite. In such a scenario, hundreds or even
thousands of candidate patches can be filtered automatically (e.g., 1000 candidate patches per bug
in CIRCLE [228]), which may benefit its adoption in practice.
Similar to traditional APR techniques, most learning-based techniques adopt a test-based vali-
dation strategy (i.e., executing available test suites against each candidate patch) to assess patch
correctness [73, 98, 99, 115, 228, 242]. For example, CIRCLE [228] filters out the candidate patches
that do not compile or do not pass available test suites. There generally exist two criteria for the
validation process: (1) the passing test suites that make the buggy program pass should still pass
on the patched program; and (2) the fault-triggering test suites that fail on the buggy program
should pass on the patched program. All candidate patches are checked until a plausible patch (i.e.,
a patch passing all test suites) is found. Finally, CIRCLE stops the validation process and reports the
plausible patch for manual investigation. Similar test-based validation strategies are also employed
by other learning-based APR appraoches, such as Recoder [242], CoCoNut [115] and CURE [73].
However, it can be extremely time-consuming to compile a large number of candidate patches
and repeat all test executions to identify plausible patches. For example. CURE [73] generates
10,000 candidate patches per bug and validates the top 5,000 ones considering the overhead time.
Similarly, AlphaRepair [209] returns at most 5,000 candidate patches for each bug. To reduce the
validation cost, some learning-based APR techniques return an acceptable amount of candidate
patches. For example, RewardRepair configures the beam size as 200 and outputs the 200 best
patches per bug. Similarly, SelfAPR adopts a beam search size of 50 and Recoder generates 100 valid
candidate patches for validation. Besides, similar to traditional APR techniques [70, 107], there
exist several learning-based ones limiting maximum time for validation. For example, DEAR [99]
and DLFix [98] set a 5-hour running-time limit for patch generation and validation.
In addition to the above strategies in patch validation, the learning-based APR community benefits
from some optimizations to speed up the dynamic execution. For example, AlphaRepair [209] adopts
the UniAPR [25] tool to validate the candidate patches on-the-fly. For example, Inspired by the
PraPR work [54], Chen et al. [25] present UniAPR as the first unified on-the-fly patch validation
framework to speed up APR techniques for JVM-based languages at both the source and byte-code
levels. They leverage the JVM HotSwap mechanism and Java Agent technology to implement
this framework. Besides, they apply the JVM resetting technique based on the ASM byte-code
manipulation framework. Since previous work shows that on-the-fly patch validation can be
imprecise, they reset the JVM state right after each patch execution to address such an issue.
Orthogonal to UniAPR, Bento et al. [14] introduce SeAPR, the first self-boosted patch validation
tool. Based on the idea that patches similar to earlier high-quality/low-quality patches should
be promoted/degraded, they leverage the patch execution information on its similarity with the

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:20 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

executed patches to update each patch’s priority score. The evaluation shows that SeAPR can
substantially speed up the studied APR techniques and its performance is stable under different
formulae for computing patch priority. Besides, the literature has seen the emergence of several
patch validation studies. For example, as early as 2012, Qi et al. [155] propose WAutoRepair, a
repair system that combines Genprog with a recompilation technique called weak recompilation to
reduce time cost and make program repair more efficient. WAutoRepair views a program as a set of
components and for each candidate patch, only one component is modified. After that, the changed
component is compiled to a shared library to reduce the time cost. In 2013, inspired by regression
test prioritization, Qi et al. [156] propose TrpAutoRepairto prioritize test case execution based on the
faults information in the repair process. Although these works have achieved commendable results,
most of them have all been applied to traditional APR techniques, e.g., GenProg [93]. However,
considering that the patch validation phase is designed to compile and execute the candidate patch,
which is independent of the specific patch generation techniques, such patch validation techniques
have the potential to be extended to learning-based repair techniques in the future.

✎ Summary ▶ Dynamic execution is a common practice to automatically validate the

code’s compilability and functional correctness of programs in the SE community. However,
it is time-consuming to compile and execute such programs, especially in the field of APR
where thousands of candidate patches and a mass of functional test cases are involved. Most
learning-based APR techniques rely on a test-based validation strategy to identify plausible
patches, which is a standard step in both traditional and learning-based APR communities.
Besides, recently there have been some advanced techniques proposed specifically to validate
these candidate patches more quickly, such as JVM HotSwap. Currently, there is no distinct
differentiation in patch validation research between the traditional and learning-based APR
communities. The possible reason lies in that, the patch validation phase aims to identify high-
quality patches that pass the available test cases. This objective can be achieved by directly
executing the test cases, without concerning whether the patches come from traditional or
learning-based APR techniques. We encourage more research on patch validation that is specific
to learning-based APR techniques, discussed in Section 8. ◀

4.7 Patch Correctness

Patch correctness is an additional phase for developers to further filter out overfitting patches after
patch validation, so as to improve the quality of returned patches. As discussed in Section 4.6, a
majority of existing learning-based APR techniques usually leverage the developer-written test
suites as the program specification to assess the correctness of the generated patches. However,
the test suite is an incomplete specification as it only describes a part of the program’s behavioral
space. As a result, it is fundamentally difficult to achieve high precision for returned patches due
to the incomplete program specification [90]. The plausible patch passing the available test suites
may not generalize to other potential test suites, leading to a long-standing challenge of APR (i.e.,
the overfitting issue) [90, 234].
For example, Qi et al. [157] have demonstrated that a majority of the overfitting patches generated
by previous APR approaches (e.g., GenProg [93]) for 105 C language bugs are equivalent to a single
modification that deletes the buggy functionality and does not actually fix the detected bugs. Under
the circumstances, it takes enormous time and effort to manually filter out the overfitting patches,
even resulting in a negative debugging performance [177, 238]. Different from some traditional
APR techniques that guide the repair process to generate patches with a high probability of being
correct, DL techniques lead to an end-to-end repair mechanism and the patches are generated in a
black-box manner. The overfitting issue in learning-based APR is more significant and severe [198].

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:21

In the literature, researchers have proposed a mass of automated patch correctness assessment
(APCA) techniques to identify whether a plausible patch is indeed correct or overfitting [179].
Xiong et al. [212] propose PATCH-SIM to identify correct patches based on the similarity of test
case execution traces on the buggy and patched programs. PATCH-SIM has been acknowledged
as a foundational work in this field [179], providing crucial guidance for the conception and
development of follow-up works [217, 224]. There are usually two types of traditional APCA
techniques based on the employed patch features: static and dynamic [195]. The former focuses
on the transformation patterns or the static syntactic similarity (e.g., Anti-pattern [174]), while
the latter relies on the dynamic execution outcomes by additional test suites from automated test
generation tools (e.g., PATCH-SIM [212]). Recently, inspired by large-scale patch benchmarks being
released, some learning-based APCA techniques have been proposed to predict patch correctness
with the assistance of DL models [178, 179, 224]. In general, such learning-based APCA techniques
extract the code features (e.g., static representation or dynamic execution traces) and build a
classifier model to directly perform patch correctness prediction. We view patch correctness as an
essential component of the learning-based APR pipeline and focus on such APCA techniques that
employ DL models.

Table 1. A summary and comparison of learning-based APCA studies

Year Approach Language Feature Dataset Repository

https://github.com/AAI-
2020 Csuvik et al. [34] Java Code Representation QuixBugs [103] USZ/APR-patch-
correctness-IBF2020
Defects4J [76],Bears [117],
https://github.com/TruX-
2020 Tian et al. [179] Java Code Representation Bugs.jar [165], ManySStubBs4J [78],
DTF/DL4PatchCorrectness
QuixBugs [103],RepairThemAll [43]
https://github.com/AAI-
2021 Csuvik et al. [35] JavaScript Code Representation BugsJS [59] USZ/JS-patch-exploration-
APR2021
https://github.
Defects4J [76],
2021 ODS [224] Java Engineered Feature com/SophieHYe/
Bears [117],Bugs.jar [165]
ODSExperiment
Defects4J [76],ManySStuBs4J [78], https://github.com/
2021 CACHE [102] Java Code Representation
RepairThemAll [43] Ringbo/Cache
Defects4J [76],Bears [117],
Code Representation https://github.com/
2022 Tian et al. [180] Java Bugs.jar [165],ManySStubBs4J [78],
Engineered Feature HaoyeTianCoder/Panther
QuixBugs [103],RepairThemAll [43]
https://github.com/
2022 BATS [178] Java Test Specification Defects4J [76]
HaoyeTianCoder/BATS
https://github.com/
Defects4J [76],
2022 QUATRAIN [181] Java Bug Report Trustworthy-Software/
Bears [117],Bugs.jar [165]
Quatrain
Textual similarity
https://github.com/ali-
2022 Shibboleth [55] Java Execution Trace Defects4J [76]
ghanbari/shibboleth
Code coverage
https://github.com/
2022 Crex [216] C Execution Trace CodeFlaws [173]
1993ryan/crex

Table 1 presents a summary of existing learning-based techniques to predict patch correctness

automatically in the literature. The first and second columns list the APCA technique and the time
of publication. The third column lists the targeted programming languages. The fourth column lists
the features adopted by the APCA technique. The remaining columns list the employed datasets
and the public repositories. Now, we discuss and summarize these individual approaches as follows.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:22 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

As early as 2020, inspirited by the similarity-based strategy in state-of-the-art traditional APCA

techniques (e.g., the behavior similarity of execution traces in PATCH-SIM [212]), Csuvik et al. [34]
present the first study to explore the nature of similarity-based approach based on static code
representation for patch correctness assessment. They leverage embedding models (i.e., Doc2vec
and BERT) to calculate the similarity between the buggy and patched code snippets and classify
patch correctness by a pre-defined similarity threshold. The experimental results on the QuixBugs
dataset show that the proposed approach successfully filters out 45% (16/35) of the incorrect
patches. In 2021, Csuvik et al. [35] further adapt the similarity-based method based on static code
representation to JavaScript with quantitative and qualitative analysis in depth.
However, the study of Csuvik et al. [34] is preliminary and small-scale, only with a single
BERT model on 40 one-line bugs. Tian et al. [179] further conduct a more large-scale empirical
study to investigate the feasibility of code representation learning to encode the properties of
patch correctness. They consider different representation learning techniques (i.e., Doc2Vec, BERT,
code2vec, and CC2Vec) to get embedding vectors for code changes, including pre-trained models
and the retraining of models. They also investigate the discriminative power of learned features in
a classification training pipeline (i.e., Decision tree, Logistic regression, and Naive Bayes) for patch
correctness. Overall, this work demonstrates the promising future of representation learning in
patch correctness assessment, and is valuable fo follow-up works [102, 180]. Later, Tian et al. [180]
further extend their previous work [179] by examining the effectiveness of code representation,
engineered features, and their combination for predicting patch correctness. They first introduce
Leopard, a classification training pipeline to investigate the discriminative power of learned em-
beddings by training machine learning classifiers to predict correct patches. The experimental
results demonstrate the potential of Leopard to reason about patch correctness based on represen-
tation learning models and supervised learning algorithms, e.g., BERT associated with XGBoost
on 2,147 labeled patches achieves an AUC value of about 0.803, outperforming state-of-the-art
APCA techniques PATCH-SIM. They then introduce Panther, an upgraded version of Leopard to
explore the combination of the learned embeddings and the engineered features to improve the
performance of identifying patch correctness with more accurate classification. Panther is proven
to outperform Leopard with higher scores in terms of AUC, +Recall and -Recall by combining deep
learned embeddings and engineered features.
In 2021, different from Tian et al. [179] focusing on code representation, Ye et al. [224] propose
ODS, a learning-based approach to identify overfitting patches based on hand-crafted features
and supervised learning. ODS first defines and extracts a set of 202 static code features from the
AST to represent a candidate patch. ODS then adopts gradient boosting with the captured code
features and patch correctness labels to train a classifier for patch correctness classification. They
conduct on three benchmarks (i.e., Defects4J, Bugs.jar and Bears) and the results show that ODS
achieves an accuracy of 71.9% in detecting overfitting patches from 26 projects, and outperforms
other state-of-the-art techniques, e.g., PATCH-SIM [212].
Despite promising, Tian et al. [179] only focus on the buggy and patched statements while
ignoring the surrounding context information. In 2021, Lin et al. [102] propose CACHE, a context-
aware code change embedding technique for the patch correctness task. CACHE leverages context
information of unchanged code and captures the code structure information with the AST path
technique. CACHE then trains a deep learning-based classifier to predict the correctness of the
patch based on several pre-defined heuristics. The experimental results on three benchmarks
(i.e., Defects4J [76], ManySStuBs4J [78] and RepairThemAll [43]) show that CACHE achieves
significantly better performance than both previous representation learning techniques (i.e., Tian et
al. [179]) and traditional APCA techniques (e.g., dynamic-based PATCH-SIM [212] and static-based
Anti-patterns [174]).

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:23

In 2022, unlike previous studies [179, 224] only considering buggy and patched code snippets,
Tian et al. [178] introduce BATS, an unsupervised learning-based approach to predict patch correct-
ness based on failing test specifications. BATS first constructs a search space of historical patches
with failing test cases. Given a plausible patch, BATS identifies similar failing test cases in the
search space. BATS then calculates the similarity of historical patches and the plausible patch
based on the failing test cases. The plausible patch is predicted as correct if the similarity score is
larger than a predefined threshold; otherwise, it is predicted as incorrect. After collecting plausible
patches from 32 APR tools to construct a large dataset, they evaluate the performance of BATS on
Defects4J benchmarks with some standard classification metrics (e.g., recall). BATS outperforms
existing techniques in identifying correct patches and filtering out incorrect patches.
Besides, Tian et al. [181] attempt to formulate the patch correctness assessment problem as a
question-answering problem, which can assess the semantic correlation between a bug report
(question) and a patch description (answer). They introduce QUATRAIN, a supervised learning
approach that exploits a deep NLP model to predict patch correctness based on the relatedness
of a bug report with a patch description. QUATRAIN first mines bug reports for bug datasets
automatically and generates patch descriptions by existing commit message generation models.
QUATRAIN then leverages an NLP model to capture the semantic correlation between bug reports
and patch descriptions. They evaluate QUATRAIN on a large dataset of 9135 patches from three
Java datasets (i.e., Defects4j, Bugs.jar, and Bears). The results demonstrate that QUATRAIN achieves
comparable or better performance against other state-of-the-art dynamic and static techniques,
such as PATCH-SIM [212] and BATS [178]. Besides, QUATRAIN is proven practical in learning the
relationship between bug reports and code change descriptions for the patch prediction task.
Different from most existing studies focusing on Java programs, Yan et al. [216] propose Crex to
predict patch correctness in C programs based on execution semantics. They first leverage transfer
learning to extract semantics from micro-traces in buggy C code on the function level. They then
perform semantic similarity computation to denote patch correctness. They evaluate Crex on a set
of 212 patches generated by the CoCoNut APR tool on CodeFlaws programs. The experimental
results indicate that Crex can achieve high precision and recall in predicting patch correctness.
At the same time, considering that previous studies [179, 224] training a patch prediction classifier
with static features (e.g., code representation or hand-crafted features), Ghanbari et al. [55] propose
Shibboleth, a hybrid learning-based technique by considering static and dynamic measures from
both production and test code to assess the correctness of the patches. Shibboleth measures
the impact of the patches by static syntactic feature (i.e., token-level textual similarity), dynamic
semantic feature (i.e., execution traces similarity) on production code, and code coverage on test code
(i.e., branch coverage of the passing test cases). Shibboleth then assesses the correctness of patches
via both ranking (i.e., prioritizing the patches that are more likely to be correct before the ones that
are less likely to be correct) and classification (i.e., categorizing patches into two classes of likely
correct and likely incorrect) modes. The experimental results show that Shibboleth outperforms
existing patch ranking (e.g., an Ochiai-based sterategy [204]) and classification techniques, such as
ODS [224] and PATCH-SIM [212].

✎ Summary ▶ The overfitting issue has become a key focus in the field of APR, which has led
to the emergence and rapid development of recent APCA techniques. DL techniques have been
gradually used to predict the correctness of patches by learning features from historical corpora.
Compared to traditional dynamic and static APCA, learning-based APCA has shown impressive
performance in prediction accuracy and recall. We provide a summary of the existing learning-
based APCA techniques in Table 1. In the literature, most existing APCA techniques employ

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:24 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

a two-component pipeline, i.e., the feature extractor and the classifier. The former extracts
the features from the source code of patches, e.g., hand-crafted features, static representation
features and dynamic execution features, while the latter trains a classifier to perform binary
prediction, e.g., Random Forest and Decision tree. Despite increasing research efforts being put
into this phase and encouraging progress being made, the problem of patch overfitting still
hinders the application and deployment of repair techniques in practice. Therefore, the APR
community needs more advanced APCA techniques to improve the correctness of returned
patches, e.g., patch-aware feature extraction and more powerful pre-trained models. ◀

4.8 State-of-the-Arts
In the learning-based APR field, semantic error (i.e., test-triggering error) has attracted considerable
attention from researchers, which is the most general application of repair techniques discussed in
the previous sections. A living review [136] summarizes and categorizes existing APR techniques
into different repair scenarios during software development, including static errors, concurrency
errors, etc. In this section, we attempt to summarize existing representative learning-based APR
techniques across different scenarios where learning-based APR is most applied.
Table 2 presents a summary of existing representative learning-based APR techniques. The first
and second columns list the investigated repair techniques and the time when these techniques are
presented. The third and fourth columns list the targeted bug types and programming languages.
The fifth column lists the adopted fault localization technique. The sixth, seventh, and eighth
columns list the detailed data pre-processing methods, i.e., code context, code abstraction and code
tokenization. The ninth and tenth columns list the detailed code representation and employed
models. The last column lists the employed patch ranking strategy. In the following, we discuss
these learning-based APR techniques according to the repair scenarios.
4.8.1 Semantic error repair. Semantic errors usually refer to any case where the actual program
behavior is not expected by developers, and can be detected by functional test cases. Considering
that the vast majority of existing learning-based studies are concentrated in this field of semantic
error, we group them based on the form of code representation. In the following, we discuss and
summarize existing individual learning-based APR techniques that focus on semantic bugs in detail.
❶ Sequence-based Approaches.
As early as 2019, Tufano et al. [183] conduct this first attempt to investigate the ability of NMT
models to learn code changes during pull requests. They first mine pull requests from three large
Gerrit repositories and extract the method pairs before and after the pull requests, where each pair
serves as an example of a meaningful change. They then map the identifiers and literals in the
source code to specific IDs (i.e., code abstraction) to reduce the vocabulary size. Finally, they train
NMT models to translate the method before the pull request into the one after the pull request, to
emulate the actual code change. The experimental results show that NMT models can generate the
same patches for 36% pull requests Overall, this study demonstrates the potential of NMT models
in learning a wide variety of meaningful code changes, especially refactorings and bug-fixing
activities. Further, Tufano et al. [184] perform an empirical study to investigate the potential of
NMT models in generating bug-fixing patches in the wild, which is discussed in Section 6.3.
At the same time, Chen et al. [27] propose SequenceR, an end-to-end approach based on sequence-
to-sequence learning. They combine LSTM encoder-decoder architecture with a copy mechanism
to address the problem of a large vocabulary. First, they apply state-of-the-art fault localization
techniques to identify the buggy method and the suspicious buggy lines. Then, they perform a
novel buggy context abstraction process that intelligently organizes the fault localization data
into a suitable representation for the deep learning model. Finally, SequenceR generates multiple

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
Table 2. A summary and comparison of representative learning-based APR approaches

Year Technique Type Language Localization Abstraction Context Tokenization Representation Model Ranking
2016 Bhatia et al. [18] Syntax Python Perfect No Method word token RNN N.A.
2017 Deepfix [58] Syntax C SD No Method N.A. token GRU N.A.
2017 Wang et al. [191] Semantic C N.A. No Method N.A. token RNN N.A.
2017 VuRLE [116] Vulnerability Java SD Yes Statement N.A. graph N.A. N.A.
2018 Harer et al. [62] Vulnerability C,C++ N.A. No Method N.A. token GAN N.A.
2018 TRACER [6] Syntax C SD Yes Method N.A. token RNN beam search
2018 Santos et al. [167] Syntax Java SD Yes Method N.A. token LSTM patch re-ranking
2018 Bhatia et al. [17] Syntax Python N.A. No Method N.A. token RNN patch re-ranking
2018 Sarfgen [192] Syntax C N.A. No Method N.A. tree N.A. patch filtering & re-ranking
2019 SequenceR [27] Semantic Java Perfect Yes Class word token LSTM beam search
2019 Codit [23] Syntax Java Perfect Yes Method N.A. tree Tree-LSTM beam search
2019 Tufano et al. [183] Semantic Java N.A. Yes Method word token RNN beam search
2019 Tufano et al. [184] Semantic Java Perfect perfect Method word token RNN RNN
2019 Chen et al. [27] Semantic Java N.A. No Class N.A. token RNN N.A.
2019 DeepDelta [132] Syntax Java Perfect Yes Method N.A. tree LSTM beam search
2019 RLAssitst [57] Syntax C SD No Method N.A. token LSTM N.A.
2020 CoCoNut [115] Semantic Java,C,Python,JS Perfect Yes Method word token FConv beam search
A Survey of Learning-based Automated Program Repair

2020 DLFix [98] Semantic Java SBFL Yes Method word tree Tree-LSTM patch filtering & re-ranking
2020 DrRepair [222] Syntax C,C++ SD No Method N.A. graph LSTM N.A.
2020 Hoppity [39] Semantic JS SD No Statement N.A. graph LSTM beam search
2020 Yang et al. [219] Syntax C SD N.A. Method subword token SeqGAN patch re-ranking
2020 GGF [205] Syntax C SD No Method N.A. token,graph GGNN N.A.
2021 CURE [73] Semantic Java Perfect No Method subword token GPT code-aware beam search
2021 Recoder [242] Syntax Java SBFL,Perfect No Method word graph Tree-LSTM beam search
2021 TFix [15] Semantic JS Perfect No Statement subword token T5 beam search
2021 GrasP [175] Semantic Java Perfect No Method word graph RNN,GNN beam search
2021 SampleFix [60] Syntax C SD No Method N.A. token LSTM beam search
2022 CIRCLE [228] Semantic Java,C,JS,Python Perfect No Method subword token T5 beam search
2022 DEAR [99] Semantic Java SBFL Yes Statement word tree Tree-LSTM patch filtering & re-ranking
2022 Graphix [142] Semantic Java Perfect Yes Method N.A. graph,tree Tree-LSTM N.A.
2022 SelfAPR [226] Semantic Java Perfect No Method subword token Transformer beam search
2022 VRepair [28] Vulnerability C Perfect No Method word token Transformer beam search
2022 SeqTrans [31] Vulnerability Java Perfect Yes Statement subword token Transformer beam search
2022 AlphaRepair [209] Semantic Java,Python Perfect No Class subword token CodeBERT CodeBERT re-ranking
2022 VulRepair [51] Vulnerability C Perfect No Method subword token T5 beam search
2022 Bug-Transformer [221] Semantic Java Perfect Yes Method subword token Transformer beam search
2022 SPVF [241] Vulnerability C++,C,Python Perfect No Method N.A. tree Transformer beam search,patch filtering
2022 SYNSHINE [4] Syntax Java SD Yes Class subword token Transformer N.A.
2022 MMAPR [231] Semantic,Syntax Python Perfect No Class subword token Codex N.A.
2022 RING [75] Syntax Python,JS,C SD No Method subword token Codex patch re-ranking
2022 RewardRepair [227] Semantic Java SBFL,Perfect No Statement subword token Transformer beam search
2021 BIFI [223] Syntax Python,C N.A. No Method N.A. token,graph LSTM beam search

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:25
1:26 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

patches for the buggy code. Although their approach can only be applied to single-line buggy code,
this model outperforms the APR tool of Tufano et al. on Defects4J benchmarks. Moreover, they
prove that the copy mechanism can improve the accuracy of generated patches.
However, Tufano et al. [184] and SequenceR [27] represent both the buggy line and its context as
one input for NMT models, making it difficult to extract long-term relations between code tokens. In
2020, Lutellier et al. [115] propose CoCoNut, a context-aware NMT approach that separately inputs
the buggy line and method context. In particular, CoCoNut applies CNN (i.e., FConv architecture) in
the context-aware NMT architecture, which is able to better extract hierarchical features of source
code compared with LSTM used in Tufano et al. [184] and SequenceR [27]. Besides, CoCoNuT
trains multiple NMT models to capture the diversity of bug fixes with ensemble learning. CoCoNut
is evaluated on six well-known benchmarks across four programming languages, i.e., Defects4J of
Java, QuixBugs of Java, CodeFlaws of C, ManyBugs of C, QuixBugs of Python and BugAID of JS.
The experimental results show that CoCoNut is capable of fixing 509 bugs on the six benchmarks,
309 of which have not been fixed by previous APR tools, such as DLFix, Prophet and TBar. At
the same time, different CoConut only considering patch generation, Yang et al. [219] propose
a sequence-basd technique for both fault localization and patch generation. They first employ a
CNN-based autoencoder to rank suspicious buggy code by extracting various information from bug
reports and the program source code. They then convert the program source code into multiple
lines with tokens and apply the SeqGAN model to generate the candidate patches.
In 2021, on top of CoCoNut, Jiang et al. [73] propose CURE, an NMT-based APR technique to
fix Java bugs. Compared with CoCoNut, the novelty of CURE mainly coms from three aspects.
First, to better learn developer-like source code, CURE pre-trains a programming language model
on a large corpus and combines it with the CoCoNut context-aware architecture. Second, CURE
designs a code-aware beam search strategy to avoid uncompilable patches during patch generation.
Third, to address the OOV problem, CURE introduces a new sub-word tokenization technique to
tokenize compound and rare words. The experimental results demonstrate that CURE is able to fix
57 Defects4J bugs and 26 QuixBugs bugs, outperforming existing learning-based APR approaches
under different beam search sizes, such as SequenceR and CoCoNut.
In 2022, unlike Tufano et al. [184] without considering semantic and lexical scope information
of code tokens, Yao et al. [221] propose Bug-Transformer, a transformer-based APR technique to
fix buggy code snippets. It is equipped with a token pair encoding (TPE) algorithm and a rename
mechanism to preserve crucial information. First, Bug-Transformer designs a TPE algorithm to
reduce vocabulary size by compressing code structure while preserving structural and semantic
information. Second, Bug-Transformer employs a rename mechanism to abstract code tokens (i.e.,
identifiers and literals) with consideration of their semantic and lexical scope knowledge. Third,
Bug-Transformer trains a transformer-based model to learn the structural and semantic information
of code snippets and predicts patches automatically. The experimental results on BPF [184] datasets
show that Bug-Transformer outperforms baseline models, e.g., Tufano et al. [184].
Existing learning-based APR techniques are usually limited by the generation of lots of low-
quality (e.g., non-compilable) patches, due to the employed static loss function based on token
similarity. In 2022, Ye et al. [227] introduce RewardRepair based on a mixed loss function that
considers program compilation and test execution information. In particular, RewardRepair defines
a discriminator to discriminate good patches from low-quality ones based on dynamic execution
feedback, rather than static token similarity between the generated patch and the human-written
ground truth patch. The discriminator computes a reward value to gauge the patch quality, and
this value is subsequently utilized to update the weights of the patch generation model during
the backpropagation process. A higher reward indicates a higher quality of generated patch that
is compilable and passes the test cases, while a lower reward suggests potentially unsatisfactory

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:27

patch quality, such as a non-compilable patch. Thanks to the compilation and test execution results
during training, RewardRepair is able to fix 207 bugs on four benchmarks, i.e., Defects4J-v1.2,
Defects4J-v2.0, Bugs.jar and QuixBugs, 121 of which are not repaired by previous approaches, e.g.,
DLFix, CoCoNut and CURE. More importantly, RewardRepair achieves a compilable rate of up to
45.3% among Top-30 candidate patches, an improvement over the 39% by CURE, demonstrating the
potential to generate high-quality patches.
Besides, previous learning-based APR approaches are dominantly founded on supervised training
with massive open-source code repositories, resulting in a lack of project-specific knowledge. In
parallel with RewardRepair, Ye et al. [226] also propose, SelfAPR, a self-supervised training approach
with test execution diagnostics based on a transformer neural network. SelfAPR consists of two
well-designed components, i.e., training sample generator and neural network optimization. The
first part generates perturbed programs with a perturbing model and tests it to capture compile
errors and execute failures information. The second part is fed with the previous information and
outputs 𝑛 best candidate patches with beam search. The experimental results show that SelfAPR
is capable of repairing 65bugs from Defects4J-v.12 and 45 bugs from Defects4J-v2.0, 10 of which
have never been repaired by the previous supervised neural repair models, such as CoCoNut [115]
and CURE [73]. More importantly, SelfAPR highlights the potential and power of self-supervised
training and project-specific knowledge in the learning-based APR community.
❷ Tree-based Approaches.
As early as 2020, Chakraborty et al. [23] propose a tree-based neural network, CODIT to learn
code changes by encoding code structures from the wild and generate patches for software bugs.
CODIT transforms the correct (or buggy) code snippet into the parse tree and generates the deleted
(or added) subtree. CODIT then predicts the structural changes using a tree-based translation model
among the subtrees and employs token names to concrete the structure using a token generation
model. The former tree-based model takes the previous code tree structure and generates a new
tree with the maximum likelihood, while the latter token generation model takes tokens and
types of tokens in the code and generates new tokens with the help of LSTM. To evaluate CODIT,
Chakraborty et al. [23] construct a real-world bug-fixing benchmark from 48 open-source projects
and also employ two well-known benchmarks, Pull-Request-Data [27] and Defects4J [76]. The
experimental results on these three benchmarks illustrate CODIT outperforms existing sequence-
based models (e.g., SequenceR [27]), highlighting the potential of the tree-based models in APR.
Despite the tree structure being considered, CODIT mainly employs a sequence-to-sequence
NMT model to learn code changes from ASTs, which can still be regarded as an NMT task. In 2021,
Li et al. [98] propose DLFix, a two-layer tree-based APR model to learn code transformations on the
AST level. In particular, DLFix first employs a tree-based RNN model to learn the contexts of bug
fixes, which is passed to another tree-based RNN model to learn the bug-fixing code transformations.
Besides, a CNN-based classification model is built to re-rank possible patches. The experimental
results on three benchmarks (i.e., Defects4J, Bugs.jar and BigFix) demonstrate that DLFix is able
to outperform previous learning-based APR approaches (e.g., Tufano et al. [184] and achieve
comparable performance against pattern-based APR approaches (e.g., Tbar [107]). Overall, DLFix
demonstrates that it is promising and valuable to treat the APR problem as a code transformation
learning task over the tree structure rather than an NMT task over code tokens.
In 2022, considering that DLFix is able to only fix individual statements at a time, Li et al. [99]
propose DEAR, a learning-based approach for multi-hunk multi-statement fixes. On top of DLFix,
DEAR is designed with three key contributions. First, DEAR introduces an FL technique to acquire
multi-hunks that need to be fixed together based on traditional SBFL and data flow analysis. Second,
DEAR develops a compositional approach to generate multi-hunk, multi-statement patches by a
divideand-conquer strategy to learn each subtree transformation in ASTs. Third, DEAR improves

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:28 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

the mode architecture of DLFix by designing a two-tier tree-based LSTM with an attention layer
learn proper code transformations for fixing multiple statements. The experimental results on three
benchmarks (i.e., Defects4J, BigFix, and CPatMiner) demonstrate that DEAR fixes 164 more bugs
than DLFix, 61 of which are multi-hunk/multi-statement bugs.
❸ Graph-based Approaches.
As early as 2020, Dinella et al. [39] introduce HOPPITY, an end-to-end graph transformation
learning-based approach for detecting and fixing bugs in JS programs. HOPPITY first represents
the buggy program as a graph representation by paring source code into an AST and connecting
the leaf nodes. HOPPITY then performs graph transformation to generate patches by making a
sequence of predictions including the position of bug nodes and corresponding graph edits. The
experimental results on a self-constructed benchmark show that HOPPITY outperforms existing
repair approaches (e.g., SequenceR [27]) with or without the perfect FL results.
In parallel with HOPPITY, Yasunaga et al. [222] propose DrRepair to repair C/C++ bugs based
on a program feedback graph. They parse the buggy source code into a joint graph representation
with diagnostic feedback that captures the semantic structure. The graph representation takes all
identifiers in the source code and any symbols in the diagnostic feedback as nodes, and connects
the same symbols as edges. They then design a GNN model for learning the graph representation.
Besides, they apply a self-supervised learning paradigm that can generate extra patches by cor-
rupting unlabeled programs. They also discover that pre-training on unlabeled programs improves
accuracy. The experimental results on DeepFix and SPoC datasets demonstrate that DrRepair
outperforms three compared APR approaches, i.e., DeepFix [58], RLAssist [57] and SampleFix [60].
Inspired by HOPPITY, Nguyen et al. [142] propose GRAPHIX, a graph edit model that is pre-
trained with deleted sub-tree reconstruction for program repair. On top of HOPPITY, GRAPHIX
enhances the encoder with multiple graph heads to capture diverse aspects of hierarchical code
structures. Besides, GRAPHIX introduces a novel pre-training task (i.e., deleted sub-tree recon-
struction) tolearn implicit program structures from unlabeled data. Finally, GRAPHIX is trained
with both abstracted and concrete code to learn both structural and semantic code patterns. The
experimental results GRAPHIX is evaluated on the Java benchmark from Tufano et al. [184] and
it turns out that GRAPHIX is as competitive as large pre-trained models (e.g., PLBART [2] and
CodeT5 [197]) and outperforms the previous learning-based APR approaches (e.g., HOPPITY [39]
and Tufano et al. [184]).
In 2021, Zhu et al. [242] propose Recoder, a syntax-guided edit decoder that uses a novel
provider/decider architecture based on an AST-based graph. Recoder takes a buggy statement and
its context as input and generates edits as output by (1) embedding the buggy statement and its
context by a code reader; (2) embedding the partial AST of the edits by an AST reader; (3) embed-
ding a path from the root node to a non-terminal node by a tree path reader; and (4) producing a
probability of each choice for expanding the non-terminal node based on previous embeddings. In
particular, Recoder treats an AST as a directional graph, with its nodes representing AST nodes
and its edges connecting a node to its children and its immediate left sibling. The AST-based graph
is then embedded in the form of an adjacency matrix using a Graph Neural Network (GNN) layer.
The authors evaluate Recoder on four widely adopted Java benchmarks: Defects4J v1.2 with 395
bugs, Defects4J v2.0 with 420 bugs, QuixBugs with 40 bugs, and IntroClassJava with 297 bugs. The
results show that Recoder correctly repairs 53 bugs on Defects4J v1.2, 11 bugs more than TBar [107]
and 19 bugs more than SimFix [70]. Besides, Recoder correctly repairs 19 bugs on Defects4J v2.0, 35
bugs on IntroClassJava and 17 bugs QuixBugs, respectively. More importantly, Recoder is the first
learning-based APR technique that outperforms existing traditional techniques (e.g., TBar [107]
and SimFix [70]) on these four Java benchmarks.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:29

Meanwhile, in 2021, Tang et al. [175] propose Grasp, an end-to-end graph-to-sequence learning-
based approach for repairing buggy Java programs. Grasp represents the source code as a graph to
retain structural information and applies a graph-to-sequence model to capture information from
the graph, overcoming the problem of missing information. The experimental results on Defects4J
show that GrasP is able to generate compilable patches for 75 bugs, 34 of which are correct. Besides,
GrasP achieves better performance than the baseline approach SequenceR with two more correct
patches and 11 more plausible patches.
In 2022, Xu et al. [214] introduce M3V, a new multi-modal multi-view graph-based context
embedding approach to predict repair operators for buggy Java code. Different from previous
studies performing patch generation and validation, M3V focuses on repair operator selection.
M3V first applies a GNN with multi-view graph-based context structure embedding to capture
data and control dependencies. M3V also employs a tree-LSTM model with tree-based context
signature embedding for capturing high-level semantics. The evaluation experiment is conducted
on 20 open-source Java projects with two common types of bugs: null pointer exceptions and index
out of bounds. The results show that M3V is effective in predicting repair operators, achieving
71.45%∼75.60% accuracy on both types of bugs, highlighting the future of context embedding in
APR.

4.8.2 Syntax error repair. Most existing learning-based APR techniques usually expect that the
programs under repair are syntactically correct and these techniques are not applicable for syntax
errors. Novice programmers are more likely to make syntax errors (e.g., replacing a “∗” with an
“𝑥”) that make compilers fail. Previous studies have indicated the long-term challenge from a wide
range of syntax mistakes, consuming a lot of time for novices and their instructors. Recently, the
release of high-quality novice error data and the emergence of trustworthy deep learning models
have raised the possibility of designing and training DL models to fix syntax errors automatically.
Now, we list and summarize the recent learning-based APR studies that focus on syntax errors
as follows.
As early as 2017, Gupta et al. [58] propose a sequence-based approach, DeepFix to fix common
programming errors. DeepFix is regarded as the first end-to-end solution using a sequence-to-
sequence model for localizing and fixing errors. In particular, DeepFix applies an RNN-based
encoder-decoder with gated recurrent units (GRUs) to serve as the Seq2Seq model. Beside, DeepFix
attempts to fix multiple errors iteratively by repairing one bug each time and using an oracle to
decide whether to accept the patch or not. The evaluation experiment is conducted on 6971 C
erroneous programs written by students for 93 different programming tasks in an introductory
programming course. More importantly, as the pioneering end-to-end sequence-based approach in
this field, DeepFix demonstrates the potential of Seq2Seq models in fixing syntax errors and serves
as a catalyst for follow-up works, detailed in the following paragraphs.
In 2018, different from DeepFix focusing on C program, Santos et al. [167] propose to leverage
language models for repairing syntax errors in Java programs. They compare n-gram with LSTM
models trained on a large corpus of Java projects from GitHub about localizing bugs and repairing
them. Besides, their methodology does not rely on buggy code from the same domain as the
training data. Evaluation results show that their improved LSTM configuration outperforms n-
gram considerably. Thus, this tool can localize and suggest corrections for syntax errors, and it is
especially useful to novice programmers.
Meanwhile, Bhatia et al. [17] propose a Neuro-symbolic approach to repair programs committed
by students based on neural networks with constraint-based reasoning. They first apply an RNN
to repair syntax errors and then formalize the problem of syntax corrections in programs as a
token sequence prediction problem. They further leverage the constraint-based technique to find

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:30 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

minimal repairs for the semantic correctness of syntactically-fixed programs. This approach is
then evaluated on a Python dataset and results show that their approach outperforms the n-gram
baseline model, demonstrating the potential of RNNs with constraint-based reasoning in repair
syntax errors.
In 2019, unlike DeepFix targeting syntax errors in C from small student programs, Mesbah et
al. [132] propose DeepDelta to repair the most costly classes of build-time compilation failures in
Java programs from real-world developer-written programs. They perform a large-scale study of
compilation errors and collect a large dataset from logs in Google. They further classify different
compilation errors and target repairing these errors following specific patterns learned from the AST
diff files in the dataset. DeepDelta is then evaluated on two most prevalent and costly classes of Java
compilation errors: missing symbols and mismatched method signatures. The experimental results
demonstrate that DeepDelta generates over half of the correct patches for unseen compilation
errors.
Meanwhile, different from DeepFix employing fully supervised learning, Gupta et al. [57] propose
RLAssist, a deep reinforcement learning-based technique to address the problem of syntactic error
repair in student programs. RLAssist is able to learn syntactic error repair directly from raw source
code through self-exploration,i.e., without any supervision. They leverage reinforcement learning
and train the model using Asynchronous Advantage Actor-Critic (A3C) [133]. A3C uses multiple
asynchronous parallel actor-learner threads to update a shared model, stabilizing the learning
process by reducing the correlation of an agent’s experience. After they evaluate RLAssist on the C
benchmark from DeepFix [58], results show that this model outperforms the APR tool DeepFix [58]
without using any labeled data for training, showing the potential to help novice programmers.
In 2020, unlike most existing techniques that use Seq2Seq models, Wu et al. [205] propose GGF,
a graph-based eep supervised learning model to localize and fix syntax errors. They first parse
the erroneous code into ASTs. Since the parser may crash in the parsing process due to syntax
errors, they create a so-called sub-AST and build the graph based on it. To tackle the problem
of isolated points and some error edges in the generated graph, they treat the code snippet as a
mixture of token sequences and graphs. Thus, GGF utilizes a mixture of the GRU and the GGNN as
the encoder module and a token replacement mechanism as the decoder module. The evaluation
shows that GGF is able to fix 50.12% of the erroneous code, outperforming DeepFix [58]. Besides,
the ablation study proves that the architecture used in GGF is quite helpful for the programming
language syntax error correction task.
However, most of the existing APR techniques employ supervised learning to train repair models
with labeled bug-fixing datasets, and their performance may be limited by the quantity and quality
of labeled data. In 2021, Yasunaga et al. [223] propose an unsupervised learning-based approach,
Break-It-Fix-It (BIFI) to fix syntax errors. BIFI first uses a fixer to generate patched code for buggy
code and uses a critic to check the patched code. BIFI then trains a breaker with real-world bug-
fixing code pairs to generate more realistic buggy code. Different from previous approaches, BIFI is
capable of turning raw unlabeled data into usable paired data with the help of a critic, which is
then used to train the fixer continuously. The experimental results on both Python and C language
benchmarks show that BIFI outperforms the previous repair approach DeepFix [58].
At the same time, considering that previous approaches (e.g., DeepFix [58]) usually ignore the
true intend of the programmer during the patch generation process, Hajipour et al. [60] propose
SampleFix, an efficient method to fix common programming errors by learning the distribution over
potential patches. To encourage the model to generate diverse fixes even with a limited number of
samples, they propose a novel regularizer that aims to increase the distance between the two closest
candidate fixes. They prove that this approach is capable of generating multiple diverse fixes with
different functionalities for 65% of repaired programs. After evaluating the approach on real-world

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:31

datasets, they show that this approach outperforms previous APR tools such as DeepFix [58] and
RLAssist [57].

4.8.3 Security vulnerability repair. Software vulnerability generally refers to the security flaws in
the concrete implementation of hardware, software, or protocols. Malicious attackers can exploit
unresolved security vulnerabilities to get access to the system without authorization or even
paralyze the system. Such vulnerabilities open a range of threats to cyber security, resulting in
severe economic damage and fatal consequences. For example, the Log4Shell vulnerability (CVE-
2021-44228) from Apache Log4j library3 allows attackers to run arbitrary code on any affected
system4 and is widely recognized as the most severe vulnerability in the last decade (e.g., 93% of
the cloud enterprise environment are vulnerable to Log4Shell 5 ). Nowadays, the number of exposed
security vulnerabilities recorded by the National Vulnerability Database (NVD)6 has been increasing
at a striking speed, affecting millions of software systems annually.
However, it is incredibly time-consuming and labor-intensive for security experts to repair such
security vulnerabilities manually due to the strikingly increasing number of detected vulnerabilities
and the complexity of modern software systems [52, 240]. For example, previous studies report
that the average time for repairing severe vulnerabilities is 256 days7 and the life spans of 50% of
vulnerabilities even exceed 438 days [95]. It is incredibly time-critical to patch reported security
vulnerabilities, as a belated vulnerability repair could expose software systems to attack [100, 104],
posing enormous risks to millions of users around the globe and costing billions of dollars in
financial losses [86]. Given the potentially disastrous effect when software vulnerabilities are
exploited, a mass of learning-based studies has recently been conducted on automated software
vulnerability repair [28, 51].
We list and summarize the recent learning-based vulnerability repair studies in detail as follows.
As early as 2017, Ma et al. [116] introduce a learning-based vulnerability repair tool, VuRLE, to
automatically detect and repair vulnerabilities in Java programs. In the learning phase, it generates
templates by analyzing edits from repair examples. First, it extracts edit blocks by performing AST
diff. Then, it compares each edit block with the other edit blocks, and produces groups of similar edit
blocks. Finally, for each edit group, VuRLE generates a repair template for each pair of edit blocks
that are adjacent to each other. In the repairing phase, VuRLE detects and repairs vulnerabilities
by selecting the most appropriate template. It applies repair templates in order of their matching
score until it detects no redundant code. Evaluation results on real-world vulnerabilities show that
VuRLE successfully repaired 101 vulnerabilities, achieving an accuracy of 55.19%.
In 2018, to get rid of the previous work’s dependence on labeled datasets, Harer et al. [62] propose
a GAN-based approach to automatically repair security vulnerabilities based on adversarial learning
without requiring labeled code samples. They first apply an NMT model as the generator and
employ two novel generator loss functions instead of the traditional negative likelihood loss. They
then design a discriminator to distinguish the output generated by the NMT model and oracle
output. This approach can be used in the absence of paired bug-fixing datasets, thus reducing the
requirements of datasets. The authors evaluate the proposed approach on the SATE IV dataset and
prove the promising results in fixing vulnerabilities compared with the original Seq2Seq model.

3 https://logging.apache.org/log4j/2.x/
4 https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/2022/01/ftc-warns-companies-remediate-log4j-security-

vulnerability
5 https://www.wiz.io/blog/10-days-later-enterprises\-halfway-through-patching-log4shell
6 https://www.nist.gov/
7 https://www.securitymagazine.com/articles/95929-average-time-to-fix-severe-vulnerabilities-is-256-days

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:32 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

Besides, the proposed approach is proven to be applicable to other repair tasks, such as grammatical
error correction.
In 2022, based on the transformer and transfer learning, Chen et al. [28] propose VRepair, a
learning-based approach to repair security vulnerabilities . VRepair is first trained on a large bug-
fixing dataset and is then transferred to a relatively small vulnerability-fixing dataset. VRepair uses
a transformer neural network model to generate potential patches that are likely to be correct based
on the training data. The results show that VRepair trained on a bug-fixing dataset already fix some
vulnerabilities. Besides, they demonstrate the knowledge learned from the program repair task can
be transferred to the vulnerability repair task. In particular, VRepair with the transfer learning
gains a better repair performance than that only trained on a vulnerability-fixing or bug-fixing
dataset.
Different from VRepair focusing on C code, Chi et al. [31] propose SeqTrans, a learning-based
approach to provide suggestions for automatically repairing Java vulnerability. SeqTrans first uses
Gumtree to search for differences between different commits and then traverses the whole AST
to label the variables. SeqTrans then traverses up the leaf nodes, localizes the statement with
vulnerability and generates code change pairs, which is fed into the NMT model. As SeqTrans
requires a massive amount of training data, SeqTrans is first trained on a bug-fixing dataset (i.e.,
source domain) and fine-tuned on a vulnerability-fixing dataset (i.e., target domain). SeqTrans is
proven to achieve better repair accuracy than existing techniques (e.g., SequenceR) and performs
very well in certain kinds of vulnerabilities (e.g., CWE-287).
However, previous approaches [28, 31] usually only consider source code while ignoring the
valuable vulnerability characteristics. Zhou et al. [241] propose an attention-based approach SFVP
for automatically fixing vulnerabilities by capturing the security property. SPVF first extracts
the security properties from NL descriptions of the vulnerabilities (e.g., CWE category). SPVF
then designs the pointer generator network to combine the AST representation and the security
properties. The authors evaluate SPVF on two public C/C++ and Python vulnerability-fixing datasets
and results show that it outperforms existing vulnerability repair technique SeqTrans [31].

4.8.4 Programming error repair. With the emergence of programming competition websites (e.g.,
LeetCode), developers frequently submit solutions, resulting in a vast amount of source code. A
portion of solutions contain flaws that prevent developers from solving the programming challenges
successfully. These programming flaws are usually simple types of errors, e.g., fail to compile and
execute due to syntax errors, or pass the corresponding test cases due to semantic errors. In the
following, we discuss and summarize existing individual learning-based APR techniques that focus
on programming errors in detail.
As early as 2016, since previous works fail to parse ASTs for student programs with syntax errors,
Bhatia et al. [18] present a technique to apply RNN for repairing syntax errors in student programs.
They first train the model with syntactically correct programs. Then, they query the trained model
with student submissions with syntax errors and feed the model with the prefix token sequence.
Finally, the model would predict suffix tokens and repair the syntax error. Evaluation on a dataset
obtained from a MOOC course shows that this approach outperforms the baseline models (e.g.,
RNN and LSTM with different configurations) and can provide automated feedback on syntax
errors for students.
In 2017, considering previous approaches focusing on static program representation, Wang et
al. [191] present dynamic program embeddings that learn from runtime execution traces to predict
error patterns that students would make in their online programming submissions. They define
three program embedding models: 1) variable trace model to obtain a sequence of variables;
2) state trace model to embed each program state as a numerical vector and feed all program

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:33

state embeddings as a sequence to another RNN encoder; 3) dependency enforcement model to

combine the advantages of the previous two approaches. They conduct experiments to prove htat
dynamic embeddings overcome critical problems with syntax-based program representations and
significantly outperform the syntactic program embeddings based on token sequences and AST.
At the same time, on top of dynamic program embeddings, Wang et al. [192] further propose
Sarfgen, a high-level data-driven framework to fix student-submitted programs for introductory
programming exercises. They develop novel program embeddings and the associated distance
metric to efficiently and precisely identify similar programs and compute program alignment. They
also conduct an extensive evaluation of Sarfgen on thousands of student submissions on 17 different
programming exercises from the Microsoft DEV204-.1x edX course and the Microsoft CodeHunt
platform. Results show that Sarfgen is effective and it improves existing systems automation,
capability, and scalability.
In 2018, with the aim of offering more informative error messages that would aid programmers
in easier diagnoses, Ahmed et al. [6] introduce TRACER to generate targeted repairs for novice
programmers in C programming courses. They leverage buggy student programs in Prutor and
conduct experiments on single-line and multi-line bugs. TRACER first localizes the buggy line, then
abstracts the program, and finally converts it into fixed code. Evaluation on the dataset collected
from IIT-K shows that TRACER achieves high accuracy up to 68% and outperforms DeepFix [58],
proving to be a student-friendliness repair tool.
In 2022, different from previous works using a basic transformer, Zhang et al. [231] propose
MMAPR, a pre-trained model-based repair approach to repair both semantic and syntactic bugs
in Python programming assignments. MMAPR applies a large language model trained on code
(i.e., Codex) for introductory Python programming assignments. In particular, MMAPR leverages
multimodal prompts, iterative querying, test-case-based few-shot selection, and program chunking
to repair bugs in students’ committed programs. The experimental results on 18 assignments
demonstrate MMAPR is able to outperform a transformer-based syntax repair tool BIFI [223], and
a re-factoring-based semantics repair tool Refactory [65].
In addition to the above-mentioned studies, the community has also seen an increasing number
of pre-trained model-based APR studies. Considering the growing popularity and potential impact
of pre-trained models on the learning-based APR community, we deserve a separate section to
thoroughly discuss such stat-of-the-arts for a more comprehensive insight, detailed in Section 5.

✎ Summary ▶ Overall, learning-based APR is generally applicable to different types of bugs

thanks to the end-to-end black-box NMT training, including semantic bugs, syntax bugs, and
security vulnerabilities. On top of encoder-decoder architecture, the APR problem can be
abstracted as an MMT task, which translates a buggy code snippet into a correct one. Thus,
researchers can train NMT models to learn code transformations automatically from a mass
of bug-fixing code pairs without considering the specific bug types. For example, CodeT5 is
fine-tuned to repair semantic bugs [208], syntax bugs [15], and security vulnerabilities [51],
respectively. Despite its high scalability, the learning-based APR is currently mostly focused on
the aforementioned typical domains. Future work can be conducted to explore the performance
of the learning-based APR in more repair scenarios, e.g., API misuse, detailed in Section 8. ◀

5 PRE-TRAINED MODEL-BASED REPAIR

Pre-trained models have significantly improved performance across a wide range of natural language
processing (NLP) and code-related tasks, such as machine translation, defect detection and code
classification [56, 113]. Typically, the models are pre-trained to derive generic vector representation
by self-supervised training on a large-scale unlabeled corpus and then are transferred to benefit

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:34 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

multiple downstream tasks by fine-tuning on a limited labeled corpus [37]. The application of
existing pre-trained models to program repair is usually divided into two categories: universal and
specific pre-trained model-based APR techniques. The former aims to propose universal pre-trained
models for multiple code-related tasks (including program repair), while the latter only focuses on
program repair by designing a novel APR technique based on pre-trained models.

5.1 Universal Pre-trained Model-based APR Techniques

Existing pre-trained models generally adopt the encoder-decoder transformer architecture, which
can be classified into three types: encoder-only, decoder-only, and encoder-decoder models. Encoder-
only models (e.g., CodeBERT [49]) usually pre-train a bidirectional transformer where tokens can
attend to each other. Encoder-only models are good at understanding tasks (e.g., code search),
but their bidirectionality nature requires an additional decoder for generation tasks. Decoder-
only models (e.g., CodeGPT [22]) are pre-trained using unidirectional language modeling that
only allows tokens to attend to the previous tokens and themselves to predict the next token.
Decoder-only models are good at auto-regressive tasks like code completion, but the unidirectional
framework is sub-optimal for understanding tasks. Encoder-decoder models (e.g., CodeT5 [158])
often make use of denoising pre-training objectives that corrupt the source input and require
the decoder to recover them. Compared to encoder-only and decoder-only models that favor
understanding and auto-regressive tasks, encoder-decoder models can support generation tasks
like code summarization.
Inspired by the success of pre-trained models in NLP, many recent attempts have been adopted
to boost numerous code-related tasks (such as program repair) with pre-trained models (e.g.,
CodeBERT) [49, 56]. In the context of APR, an encoder stack takes a sequence of code tokens as
input to map a buggy code 𝑋𝑖 = [𝑥 1, . . . , 𝑥𝑛 ] into a fixed-length intermediate hidden state, while
the decoder stack takes the hidden state vector as an input to generate the output sequence of
tokens 𝑌𝑖 = [𝑦1, . . . , 𝑦𝑛 ]. Researchers treat the APR problem as a generation task, and consider
encoder-decoder or encoder-only (with an additional decoder) pre-trained models, which are usually
evaluated by the BFP dataset from Tufano et al. [184].
We summarize existing pre-trained models involving the program repair task as follows.
The most commonly used model type is the T5-like encoder-decoder architecture, which can
naturally support the program repair task as a code generation problem. For example, in 2021,
Wang et al. [197] present a pre-trained encoder-decoder model (i.e., CodeT5) that considers the code
token type information based on T5 architecture. CodeT5 employs a unified framework to support
code understanding (e.g., clone detection) and generation tasks (e.g., program repair) and allows for
multi-task learning. The most crucial feature of CodeT5 is that the code semantics of identifiers are
taken into consideration. Assigned by developers, identifiers often convey rich code semantics and
thus a novel identifier-aware objective is added to the training of CodeT5.
In 2022, Mastropaolo et al. [126] propose a pre-trained text-to-text transfer transformer (T5) to
address four code-related tasks, namely automatic bug fixing, injection of code mutants, generation
of assert statements in test methods, and code summarization. They apply BFP small and BFP
medium datasets to train and evaluate the bug-fixing task, and then compare other learning-
based APR tools on the same benchmark. Moreover, they have done single-task fine-tuning and
multi-task fine-tuning to fully evaluate the function of the pre-trained T5 model. Although multi-
task fine-tuning does not improve the result of code-related tasks, single-task fine-tuning does
prove that this model outperforms other tools (e.g., Tufano et al. [184]) on the same benchmarks.
Besides, Niu et al. [145] propose a Seq2Seq pre-trained model (i.e., SPT ) by three code-specific
tasks (code-AST prediction, masked sequence to sequence and method name generation) and
fine-tune on the generation tasks (i.e., code summarization, code completion, program repair and

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:35

code translation) and classification task (i.e., code search). Results show that SPT outperforms
CodeBERT [49], GraphCodeBERT [56] and Tufano et al. [184] on the bug-fixing datasets BFP-small
and BFP-medium. Unlike previous general-purpose pre-trained models considering various tasks,
Zhang et al. [232] propose CoditT5, a pre-trained language model only for code-related edit tasks.
CoditT5 is pre-trained on both program languages and natural language comments. Zhang et
al. fine-tune it for three down-streaming tasks: comment updating, bug fixing, and automatic
code review. For bug-fixing, they fine-tune it with Java datasets BFP small and BFP medium. The
evaluation shows that CoditT5 outperforms other APR tools like CodeT5 and PLBART on three
down-streaming tasks.
There also exist some pre-trained models with BERT-like encoder-only architecture, which
usually with an additional decoder ti support program repair. For example, in 2020, Feng et al. [49]
present a bimodal pre-trained model (i.e., CodeBERT ) for natural language and programming
language with a transformer-based architecture. CodeBERT utilizes two pre-training objectives
(i.e., masked language modeling and replaced token detection) to support both code search and
code documentation generation tasks. To support program repair task, Lu et al. [113] leverage
CodeBERT as the encoder, which is connected with a randomly initialized decoder. Besides, Guo et
al. [56] present the first structure-aware pre-trained model (i.e., GraphCodeBERT ) that learns code
representation from source code and data flow. Unlike existing models focusing on syntactic-
level information (e.g., AST), GraphCodeBERT takes semantic-level information of code (e.g., data
flow) for pre-training with a transformer-based architecture. The results on BFP datasets [184]
demonstrate the advantage of leveraging code structure information to repair software bugs.

5.2 Specific Pre-trained Model-based APR Techniques

In addition to those above-mentioned typical pre-trained models that involve program repair,
researchers have adopted pre-trained models to design novel APR techniques. Table 7 presents
existing pre-trained model-based APR techniques. We summarize existing APR studies that employ
pre-trained models as follows.

Table 3. A summary and comparison of existing pre-trained model-based APR techniques

Year Study Type Model Language

2021 TFix [15] Syntax T5 JS
2022 CIRCLE [228] Semantic T5 Java,C,JS,Python
2022 VulRepair [51] Vulnerability CodeT5 C
2022 AlphaRepair [209] Semantic CodeBERT Java,Python
2022 SYNSHINE [4] Syntax RoBERTa Java

In 2021, inspired by the pre-trained T5 model [158] that converts all text-based language problems
into a text-to-text format in the NLP field, Berabi et al. [15] formulates the problem of fixing coding
errors as a text-to-text prediction task and propose TFix, a T5-based approach to fix syntax errors.
They fine-tune a pre-trained T5 model to generate JavaScript fixes on datasets extracted from GitHub
by themselves. By feeding the model with line context and fine-tuning it according to various
error types, they obtain multiple fine-tuned T5 models. The evaluation shows that TFix is able to
generate 67% of correct patches, significantly outperforming SequenceR [27] and CoCoNut [115].
In 2022, to address the issue of previous works not performing well on large programs, Ahmed et
al. [4] propose SynShine, a learning-based approach to fix syntax errors in Java programs by
innovatively using the diagnostics from a compiler and exploiting the ability to pre-train model.
SynShine first applies a three-stage syntax repair workflow, i.e., BlockFix for recovering block

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:36 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

structure, LineFix for fixing line errors, and UnkFix for recovering unknown tokens. SynShine then
leverages RoBERTa-based pre-training and information from compiler errors to generate fixes using
multi-label classification. The experimental results on the Blackbox dataset show that SynShine
outperforms previous repair approaches, e.g., DeepFix [58] and SequenceR [27] on different token
ranges. Importantly, they have also integrated SynShine with the VSCode IDE for public usage,
showing the practical value in a real-world development environment.
Considering that previous NMT-based repair approaches fail to consider DL descriptions about
the code context, Chakraborty et al. [24] present MODIT, a multi-modal pre-trained model-based
approach, to automatically generate fixes for buggy code. They leverage three modalities of infor-
mation during training: edit location, edit code context, and commit messages (i.e., natural language
guidance from the developer). They then employ the pre-trained PLBART model as the as the
starting point to train MODIT. The experimental results show that MODIT generates 29.99% correct
patches for the BFP-small dataset [184], outperforming CodeBERT by 15.12%, GraphCodeBERT by
16.82% and CodeGPT 5.49%. Similarly, 23.02% of patches generated by MODIT on the BFP-medium
dataset are correct and the improvement against the three pre-trained models reaches 34.38%,
25.72%, and 30.50%, respectively.
Existing learning-based APR techniques can only generate patches for a single programming
language and most of them are developed offline. In 2022, Yuan et al. [228] propose CIRCLE, a T5-
based APR technique targeting multiple programming languages with continual learning. CIRCLE
first employs a pre-trained model as a repair skeleton, then designs a prompt template to bridge the
gap between pre-trained tasks and program repair. To further strengthen the continual learning
ability, CIRCLE applies a difficulty-based rehearsal method to achieve lifelong learning without
access to the entire historical data and an elastic regularization to resolve catastrophic forgetting.
Finally, to perform the multi-lingual repair, CIRCLE designs a simple but effective re-repairing
mechanism to eliminate incorrectly generated patches caused by multiple programming languages.
The experimental results on five benchmarks across four programming languages (i.e., C, JAVA,
JavaScript, and Python) show that CIRCLE is able to achieve outperform various previous learning-
based APR approaches, such as CoCoNut [115], DLFix [98] and CURE [73]. More importantly, the
results demonstrate the potential of CIRCLE in repairing multiple programming language bugs
with a single repair model in the continual learning setting.
Different from previous learning-based APR approaches (e.g., CIRCLE [228]) that heavily rely on
large numbers of high-quality bug-fixing code pairs, in 2022, Xia et al. [209] introduce AlphaRepair
as a cloze-style APR tool to directly query a pre-trained model for generating patches. They apply
the newly pre-trained CodeBERT as an example under zero-shot learning settings. They try to mask
the buggy line in the source code with different templates or strategies and feed the whole source
code into the model with the buggy line as a “comment". Then with a large number of patches this
model generated, they propose probabilistic patch ranking to determine top-𝑘 plausible patches.
After evaluating this technique on both Java and Python benchmarks, it outperforms other APR
tools (e.g., Recoder [242], DLFix [98] and TBar [107]) and proves that a pre-trained model with no
fine-tuning is feasible.
Unlike VRepair employing a basic transformer, Fu et al. [51] propose VulRepair, a T5-based auto-
mated vulnerability repair technique based on subword tokenization and pre-training components.
They compare VulRepair with two competitive baseline approaches, VRepair and CodeBERT on a
C benchmark CVEFixes. Besides, they analyze the impact of adopted components (i.e., tokenization
and pre-training) and conduct an ablation study to investigate the contribution of each component.
The results show that VulRepair outperforms the previous repair technique VRepair [28] and it is
capable of repairing the Top-10 most dangerous CWEs.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:37

In parallel with these newly proposed approaches equipped with pre-trained models, the com-
munity has also seen some studies that empirically explore the actual performance of pre-trained
models in different repair scenarios. We will discuss these empirical studies in Section 6.3.
✎ Summary ▶ Overall, pre-trained models have significantly influenced a large amount
of code-related fields in the SE community, especially program repair. At the current stage,
existing techniques using pre-trained models for program repair are usually divided into three
categories. First, when some pre-trained models are built, they are fine-tuned and evaluated by
some downstream tasks, including program repair. The evaluation experiments are usually
conducted by these authors of the pre-trained models and reported in their original papers using
the BPF dataset from Tufano et al. [183]. The typical pre-trained models involve CodeT5 [197],
T5Learning [127] and SPT [145]. Second, researchers have proposed some novel pre-trained
model-based APR techniques. The first typical one is the fine-tuning scenario, e.g., CIRCLE [228]
is proposed to fine-tune the pre-trained T5 model with continual learning. The second typical
one is the zero-shot scenario, e.g., AlphaRepair [209] is proposed to use CodeBERT to generate
correct code under a cloze-style way. Third, there exists an increasing number of empirical
studies to evaluate the ability of pre-trained models in program repair. These empirical studies
encompass different pre-trained models [208], bug types [67] and programming languages [80].
In the future, pre-trained models can further deeply influence various steps of the program
repair workflow, such as patch correctness assessment, detailed in Section 8. ◀

6 EMPIRICAL EVALUATION
In this section, we introduce existing widely adopted datasets in the learning-based APR field and
discuss common evaluation metrics for evaluating repair performance.

6.1 Dataset
Different from previous APR techniques conducted in a traditional pipeline (e.g., generating patches
by heuristic strategies), the process of learning-based APR techniques is two-fold (1) a training pro-
cess with supervised learning on large labeled datasets (e.g., CoCoNut [115]); and (2) an evaluation
process on a small set of labeled datasets (e.g., Defects4J [76]). Benefiting from a large amount of
research effort in the learning-based APR community, there are several existing benchmarks to
evaluate NMT techniques for automatically repairing bugs. Now we discuss the widely adopted
datasets in the literature.
Defects4J [76] is the most widely adopted benchmark in learning-based APR studies, which
contains 395 known and reproducible real-world bugs from six open-source Java projects. To
facilitate reproducible studies, each bug contains a buggy version and a fixed version, as well as
a corresponding test suite that triggers that bug. Defects v2.0 provides 420 additional real-world
bugs from 17 Java projects, which is adopted by some recent studies [209, 242]. QuixBugs [103] is a
multi-lingual parallel bug-fixing dataset in Python and Java used in [209, 228]. QuixBugs contains
40 small classic algorithms with one bug on a single line, along with the test suite. Bugs.jar [165]
contains 1,158 real bugs from 8 large open-source Java projects, each of which has a fault-revealing
test suite. ManyBugs [92] contains 185 real-world bugs from 9 open-source C projects and each
bug has a corresponding developer patch and test suite. IntroClass [92] consists of 998 bugs in six
small student-written programming assignments for C language. Due to a well-defined test suite,
these datasets are effective in evaluating the correctness of generated patches by dynamic program
behavior.
8 Thelink provided in the original paper has expired. We find and provide a new link on Bitbucket, which is also maintained
by the authors.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:38 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

Table 4. Detailed information on collected datasets in existing learning-based APR studies.

ID Name Language #Bugs Test Suite Training Testing Techniques URL

1 Bears [117] Java 251 ✔ ✔ ✔ [94, 227] Link
[24, 31, 42, 66,
2 BFP medium [184] Java 65454 ✘ ✔ ✔ 142, 175, 176, Link
184, 221]
[24, 31, 42, 66,
3 BFP small [184] Java 58350 ✘ ✔ ✔ 142, 175, 176, Link
184, 221]
4 BigFix [98] Java 1.824 M ✘ ✔ ✔ [98, 99] Link
5 Bugs2Fix [113] Java 92849 ✘ ✔ ✔ [27, 33] Link
6 Bugs.jar [165] Java 1158 ✔ ✔ ✔ [98, 179, 227] Link
7 Code-Change-Data [23] Java 44372 ✘ ✔ ✔ [23] Link
8 CodeXGlue [113] Java 122 K ✘ ✘ ✔ [33] Link
9 CodRep [29] Java 58069 ✘ ✔ ✔ [27, 227] Link
10 CPatMiner [141] Java 44 K ✘ ✔ ✔ [99] Link
11 DeepRepair [203] Java 374 ✘ ✔ ✘ [203] Link
[23, 26, 94, 99,
12 Defects4J [76] Java 835 ✔ ✔ ✔ 114, 119, 175, 178, Link
179, 190, 226]
13 Function-SStuBs4J [143] Java 21047 ✘ ✔ ✔ [143] Link
14 IntroClassJava [45] Java 998 ✔ ✔ ✔ [26, 242] Link
15 Java-med [8] Java 7454 ✘ ✔ ✘ [77] Link
16 ManySStuBs4J large [78] Java 63923 ✘ ✔ ✔ [125] Link
17 ManySStuBs4J small [78] Java 10231 ✘ ✔ ✔ [125, 179] Link
18 MegaDiff [137] Java 663029 ✘ ✔ ✘ [28] Link
19 Ponta et al. [154] Java 624 ✘ ✔ ✔ [31] Link
20 Pull-Request-Data [183] Java 10666 ✘ ✔ ✔ [23, 183] Link
21 Ratchet [63] Java 35 K ✘ ✔ ✔ [63] Link
22 Recoder [242] Java 103585 ✘ ✔ ✘ [242] Link
23 TRANSFER [130] Java 408091 ✘ ✔ ✘ [130] Link
24 Deepdelta [132] Java 4.8 M ✘ ✔ ✔ [132] N.A.
25 Rahman et al. [159] C 2482 ✘ ✔ ✔ [159] N.A.
26 Big-Vul [47] C 3745 ✘ ✔ ✔ [28] [241] Link
27 Code4Bench [118] C 25 K ✔ ✔ ✔ [185] Link
28 Wang et al. [182] C 195 K ✔ ✔ ✔ [191] N.A.
29 CVEFixes [16] C 8482 ✘ ✔ ✔ [51, 192] Link
[57, 58, 60, 75,
30 DeepFix [58] C 6971 ✔ ✔ ✔ Link8
222, 223]
31 ManyBugs [92] C 185 ✔ ✔ ✔ [115, 190, 228] Link
32 Prophet [111] C 69 ✔ ✔ ✔ [111, 114] Link
33 Prutor [36] C 6971 ✔ ✔ ✔ [134, 219] Link
[114, 115, 190,
34 BugAID [61] JS 105133 ✘ ✔ ✔ Link
228]
35 BugsJS [59] JS 453 ✔ ✔ ✔ [89] Link
36 HOPPITY [39] JS 363 K ✘ ✔ ✔ [39] Link
37 KATANA [169] JS 114 K ✘ ✔ ✔ [169] Link
38 REPTORY [139] JS 407 K ✘ ✔ ✔ [139] Link
39 TFix [15] JS 100 K ✘ ✔ ✔ [15] Link
40 ETH Py150 [160] Python 150 K ✘ ✔ ✔ [64, 163, 186] Link
41 GitHub-Python [223] Python 3M ✘ ✔ ✔ [223] Link
42 Szalontai et al. [172] Python 13 K ✘ ✔ ✔ [172] N.A.
43 PyPIBugs [7] Python 2374 ✘ ✔ ✔ [7, 163] Link
44 SSB-9M [164] Python 9M ✘ ✔ ✘ [163] Link
45 VUDENC [199] Python 10 K ✘ ✔ ✔ [241] Link
46 Macer [30] Python 286 ✔ ✘ ✔ [231] Link
47 SPoC [88] C++ 18356 ✔ ✔ ✔ [222] Link
[26, 28, 41, 73,
48 QuixBugs [103] Java,Python 40 ✔ ✔ ✔ 114, 115, 179, 190, Link
209, 228, 242]
49 DeepDebug [41] Java,Python 523 ✘ ✔ ✔ [41, 42] N.A.
[73, 115, 227,
50 CoCoNut [115] C,Java,JS,Python 24 M ✔ ✔ ✘ Link
228]
51 CodeFlaws [173] C,Python 3902 ✔ ✔ ✔ [19, 115, 216] Link
52 ENCORE [114] Java,JS,Python,C++ 9.2 M ✘ ✔ ✘ [114] N.A.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:39

However, NMT-based APR techniques employ neural network techniques to learn the bug-fixing
patterns from the training dataset. The training of a reliable NMT repair model is hindered by the
scarcity of high-quality test datasets, which require extensive manual effort to produce. To make
experiment results more persuasive, lots of large-scale datasets have been curated recently. Such
datasets contain bug-fixing code pairs for the model to learn how to transform a buggy code into
the expected fixed code. In particular, researchers usually mine open-source projects from code
platforms (e.g., GitHub) and extract the commits by fixing-related keywords. Then unqualified
commits are filtered out by pre-defined rules (e.g., non-code changes). For example, Tufano et
al. [184] extract the bug-fixing commits between March 2011 and October 2017 on GitHub and
release two BFP datasets for small (i.e., 0∼50 tokens) and medium (i.e., 50∼100 tokens) methods,
consisting of 58k (58,350) and 65k (65,455) bug-fixing samples, respectively. Recoder [242] releases
a dataset of 103,585 bug-fixing pairs by crawling Java projects on GitHub between March 2011
and March 2018. Further, CoCoNut [115] provides five datasets across four languages (i.e., Java,
Python, C and JavaScript) by extracting commits from GitHub projects, resulting in more than
twenty million bug-fixing pairs.
Table 4 presents the description of all involved datasets in our survey. The first two columns list
the dataset name and the third column lists the programming languages the dataset covers. The
fourth column lists the number of bugs the dataset contains. The fifth column indicates whether
the dataset has corresponding test suites. The sixth and seventh columns indicate whether the
dataset is used in the training and evaluation process. The last column lists some learning-based
studies employing the dataset.
Among the collected datasets in our survey, we find that training datasets usually only contain
bug-fixing pairs for NMT model training, while evaluation datasets may additionally contain some
test suites to validate the correctness of generated patches. For example, existing studies [115, 228]
generally adopt some datasets like Defects4J as the evaluation datasets while adopting other
datasets like CoCoNut as the training datasets. Besides, we find some studies [183, 184] adopt the
same dataset for training and evaluation without executing test suites. For example, Tufano et
al. [184] split BFP dataset into training and evaluation parts and evaluate the repair performance
by match-based metric.
Table 4 also presents the programming languages of all datasets. It can be found that the collected
datasets mainly involve five languages (i.e., Java, JavaScript, Python, C and C++). Among them,
similar to traditional APR, Java is the most targeted language in the learning-based APR techniques.
Besides, researchers conduct lots of datasets in other languages (e.g., Python), indicating that
learning-based APR techniques begin to consider more languages in practice. For Java, researchers
prefer the traditionally dominated Defects4J dataset and the recently-released BFP dataset. For
other programming languages, researchers have different choices for datasets due in part to the
lack of publicly-accepted datasets. We also find that some recent datasets involve multi-languages,
such as CoCoNut [115] and QuixBugs [103, 225], while the traditional APR techniques mainly
focus on Java language [43]. The possible reasons lie in that (1) traditional techniques are widely
conducted on the same benchmark Defects4J while some additional datasets have been released
along with the application of DL; (2) traditional techniques may rely on language-specific features
to generate patches, which is challenging to apply to other languages (e.g., PraPR adopting JVM
bytecode [54]), while learning-based techniques treat APR as an NMT task similar to NLP, which is
independent of specific programming languages.

✎ Summary ▶ Within the expansive arena of learning-based APR, datasets play a pivotal role
in shaping the trajectory of research advancements. Different from traditional APR techniques,

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:40 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

which often leverage heuristic strategies for patch generation, learning-based APR techniques
are distinctly split into a two-phase methodology: a supervised training on large-scale labeled
datasets and a subsequent evaluation on smaller, selected datasets. While the traditional APR
realm has seen an inclination towards Java-centric datasets like Defects4J, the infusion of DL
into the sector has broadened horizons. One typical trend is the construction of large-scale
training datasets, e.g., the BPF dataset [183]. The other typical trend is the application of
multiple programming languages, e.g., the CoCoNut dataset [115]. However, we observe that
while ample datasets exist for training—mainly comprising bug-fixing pairs, evaluation datasets
often carry the added component of test suites to ascertain patch correctness. In summation,
as learning-based APR continues to evolve, it is imperative for the community to prioritize the
curation of comprehensive, high-quality datasets that cater to both training and evaluation. ◀

6.2 Metric
Evaluation metrics play a crucial role in the development and growth of learning-based APR
techniques as they serve as the standard to quantitatively define how good an NMT repair model is.
In this section, we discuss the common evaluation metrics in the learning-based APR community.
6.2.1 Execution-based Metrics. In general, learning-based APR techniques predict some candidate
patches with high probability as the outputs. The generated patches are evaluated by executing
available test suites to determine whether to report them to the developers for deployment. We list
the standard metrics as follows.
(1) Compilable Patch. Such a candidate patch makes the patched buggy program compile
successfully.
(2) Plausible Patch. Such a compilable patch fixes the buggy functionality without harming
existing functionality (i.e., passing all available test suites).
(3) Correct Patch. Such a plausible patch is semantically or syntactically equivalent to the
developer patch (i.e., generalizing the potential test suite).
6.2.2 Match-based Metrics. Although widely used in the learning-based APR literature, it is time-
consuming to evaluate generated patches on dynamic execution for all available test suites. Besides,
test suites may not always be available in large-scale evaluation datasets. More recently, an increas-
ing number of studies evaluate the performance by code token matching between the generated
patch and the ground truth (i.e., developer-written patches), listed as follows.
(1) Accuracy. Accuracy measures the percentage of candidate patches in which the sequence
predicted by the model equals the ground truth. As learning-based APR techniques usu-
ally employ a beam-search strategy, the beam-search strategy reports the 𝑘 sequences (i.e.,
sequence of terms representing the fixed code) with the highest probability. Researchers
consider these 𝑘 final sequences as candidate patches for a given buggy code snippet. Then
Accuracy@K value is defined as follows.

1{𝑚𝑎𝑡𝑐ℎ( 𝑗
Í𝑛 Í𝑘
𝑖=1 𝑗=1 𝑐 𝑖 )}
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦@𝐾 = (1)
𝑛
where 1 denotes whether 𝐶𝑖 contains a predicted repair sequence equal to the ground truth
repair sequence. The sequence accuracy is 1 if any predicted sequence among the 𝑘 outputs
matches the ground truth sequence, and it is 0 otherwise.
(2) BLEU. BLEU (Bilingual Evaluation Understudy) [149] score measures how similar the
predicted candidate patch and the ground truth are. Given a size 𝑛, BLEU splits the candidate
patch and ground truth into n-grams and determines how many n-grams of the candidate

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:41

patch appear in the reference patch. The BLEU score ranges between 0 (the sequences are
completely different) and 1 (the sequences are identical).
Compared with execution-based metrics, accuracy and BLUE evaluate the candidate patch by
matching the tokens of the candidate patch and ground truth without dynamic execution. These
two metrics can be employed to evaluate the performance of a mass of candidate patches in a limited
time and thus have been commonly adopted in the learning-based APR community [183, 184, 228].
However, accuracy and BLUE are initially designed in NLP tasks and may be improper to evaluate
the program repair task due to the differences between natural language and programming language.
For example, accuracy refers to the perfect prediction, which ignores that different code snippets may
have the same semantic logic. Besides, BLEU is originally designed for natural language sentences
by token-level matching, neglecting important syntactic and semantic features of codes. To address
the above concerns, recently researchers adopt a variant of BLEU (i.e., CodeBLEU [161]) to evaluate
the performance of learning-based APR techniques [113]. Compared with BLEU, CodeBLEU further
considers the weighted n-gram match, the syntactic AST match, and the semantic data-flow match.
In particular, the n-gram match assigns different weights for different n-grams, the syntactic match
considers the AST information in the evaluation score by matching the sub-trees, and the semantic
match employs a data-flow structure to measure semantic similarity.
✎ Summary ▶ Overall, within the realm of learning-based APR, evaluation metrics are of
paramount importance in guiding the evolution of repair models. On the one hand, similar to
traditional APR, learning-based APR the APR domain has been inclined towards execution-
based metrics, such as plausible patches, which are derived from the field of SE. On the other
hand, unlike traditional APR, are increasingly biased towards match-based metrics, such as
BLEU, which are derived from the field of NLP. The possible reason behind this trend is
the lack of test cases in the evaluation datasets, such as the BFP dataset [183]. Despite their
convenience, these NLP-inspired metrics are not without their pitfalls. For example, Accuracy
focuses narrowly on perfect predictions, and traditional BLEU might overlook the intricate
semantics of source code. To sum up, as learning-based APR continues its upward trajectory, the
spotlight is increasingly on the development and adoption of nuanced, code-centric evaluation
metrics (such as CodeBLUE) that mirror the complexities of the programming domain. ◀

6.3 Empirical Study

Despite an emerging research area, a variety of learning-based APR techniques have been proposed
and continuously achieved promising results in terms of the number of fixed bugs in the litera-
ture [115, 228]. In addition to developing new repair techniques that address technical challenges,
the learning-based APR research field is benefiting from several empirical studies. These empirical
studies systematically explore the impact of different components (e.g., code representation), pro-
viding insights into future learning-based APR work. We summarize existing empirical studies in
Table 5 and discuss them in detail as follows.
Traditional APR techniques are usually restrained by a relatively limited set of manually crafted
repair patterns [107]. Inspired by the potential of advanced DL techniques, which have shown
impressive performance in tackling several SE tasks [220], in 2019, Tufano et al. [184] conduct the
first systematic empirical study to investigate the capability of utilizing NMT models to fix software
bugs from open-source bug-fixing commits. First, they mine the bug-fixing commits by message
patterns from projects in GitHub repositories and filter out the low-quality commits by specific
rules. Second, they identify the list of edit actions performed between the buggy and fixed files using
the GumTree [46] and extract bug-fixing method pairs with at least one edit action. Third, they
design a code abstraction strategy to reduce vocabulary size by only keeping frequent identifiers

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:42 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

Table 5. A summary and comparison of empirical studies in learning-based APR

Year Study Scope Language Description

the first empirical study to assess the feasibility of using NMT
2019 Tufano et al. [184] Program Repair Java
techniques for learning bug-fixing patches.
investigate how language translation models perform in APR,
2020 Ding et al. [40] Program Repair Java
specifically on the concept of “patching as translation”.
investigate the performance of CodeBERT in fixing bugs from
2021 Mashhadi et al. [125] Program Repair Java
the ManySStuBs4J banchmark.
investigate the performance of the pre-trained model in fixing
2022 Kolak et al. [84] Program Repair Java
bugs from the QuixBugs benchmark.
investigate the impact of code representations in APR with
2022 Namavar et al. [139] Code Representation JavaScript
21 models and 14 code representation methods.
the first extensive study on directly applying nine pre-trained
2022 Xia et al. [208] Program Repair Java,Python,C
models for APR across three programming languages.
an extensive study of learning-based patch correctness as-
2022 Wang et al. [198] Patch Correctness Java
sessment techniques on the Defects4J dataset.
investigate the performance of the pre-trained model in fixing
2022 Kim et al. [80] Kotlin Repair Kotlin
defects in the Samsung Kotlin projects.
a preliminary study to investigate the performance of pre-
2022 Huang et al. [67] Vulnerability Repair C/C++
trained models in repairing security vulnerabilities.

and literals. Finally, they construct two datasets (i.e., BFP-small and BFP-medium) and train NMT
models to translate the buggy method into the corresponding correct method. The experimental
results show that NMT models are able to fix a considerable number of buggy methods in 9%–50%
of the cases. More importantly, this study highlights the future of NMT for APR, providing a solid
empirical foundation for follow-up studies in the learning-based APR community.
In 2020, Ding et al. [40] empirically investigate to what extent program repair is like machine
translation. They reveal that there exist essential differences between Seq2Seq models and transla-
tion models in terms of task design and architectural design. The translation model is inappropriate
for program repair due to the lack of vocabulary and immediate context. Besides, the translation
model usually keeps up most tokens from the bug code while replacing only a small number, which
is not ideal for program repair. Finally, they implement an edit-based model by adapting the Seq2Seq
models used for translation to generate edits rather than raw tokens, which leads to promising
improvement.
In 2021, with the rise of pre-trained models in the SE domain, Mashhadi et al. [125] conduct a pre-
liminary to apply CodeBERT to Java simple bugs. They fine-tune and evaluate it on ManySStuBs4J
datasets and find it is capable of generating patches in a short time. Their approach gets rid of the
limitation of token length and vocabulary problems, thus this model is more efficient and effective.
This model can generate patches for different types of bugs and outperform simple Seq2Seq models
in terms of the accuracy of generated patches. Similarly, Kolak et al. [84] propose to apply large
pre-trained language models to generate patches for one-line bugs in Java and Python programs.
They consider pre-trained models with a wide range of sizes (e.g., GPT-2 with 160M, 0.4B, and 2.7B
parameters and CodeX 12B parameters) for evaluation and comparison. After evaluating these
models on the QuixBugs benchmark, they discover that larger language models tend to generate
more predictable patches and thus are more promising in guiding patch selection in APR work.
In 2022, focusing on code representation, Namavar et al. [139] conduct a systematic study to
understand the effect of different code representation ways on learning-based APR performance. In
particular, they implement REPTORY as a tool for controlled experiments to assess the accuracy of
different code representations (e.g., AST variants) and the functionality of four different embeddings
(e.g., Word2Vec). They conduct 21 experiments with different models to evaluate their automatic
patchability and perceived usefulness as well as accuracy. The results reveal that mixed code

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:43

representation with Golve embedding outperforms other settings. Moreover, they find that bug
type affects the accuracy of different code representations.
At the same time, Xia et al. [208] present the first extensive evaluation of large programming
language models (PLMs) for program repair. They select nine state-of-art pre-trained PLMs with
different types (i.e., infilling and generative models) and parameter sizes (i.e., ranging from 125M
to 20B). They design three different repair settings for PLMs (i.e., complete function generation,
correct code infilling, and single line generation). They then conduct experiments on 5 datasets
across 3 different languages to compare different PLMs in the number of bugs fixed, generation
speed and compilation rate. They also compare the performance of PLMs against existing APR
techniques (e.g., Recoder [242] and CURE [73]) and results demonstrate the promising future of
directly adopting PLMs for APR.
Considering most existing APCA techniques evaluated on limited datasets, Wang et al. [198]
conduct an extensive empirical study of patch correctness on Java programs. First, they collect
a large-scale real-world dataset for patch correctness, containing 1,988 patches generated by the
recent PraPR APR tool [54]. Then they revisit state-of-the-art APCA techniques on the new dataset,
including static-based (e.g., Anti-patterns: [174]), dynamic-based (e.g., PATCH-SIM [212]), and
learning-based (e.g., ODS [224]). Results show that learning-based APCA techniques tend to
suffer from the dataset overfitting issue [198]. For example, the embedding-based techniques [179]
underperform on patches sourced from subjects outside the training set, thereby highlighting the
need for cross-dataset evaluation in future learning-based APCA research. Besides, the performance
of dynamic techniques significantly drops when encountering patches with more complicated
changes.
Different from previous empirical studies [184, 208] focusing on semantic bugs triggered by
test cases, Kim et al. [80] conduct an empirical study to investigate the performance of existing
learning-based APR techniques in fixing defects detected by a static analysis tool. They employ
the pre-trained TFix [15] model as the representative APR technique to fix defects from industrial
Samsung Kotlin projects. The experimental results demonstrate the original TFix model can fix 94
out of 1,961 defects. They also find that a fine-tuned TFix model using the defect-fixing dataset can
fix 289 more defects than the original TFix model. Besides, the TFix model with additional transfers
performed using the bug-fixing dataset fixes 211 more defects than the model transferred using only
defect-fixing knowledge. More importantly, as the first work to apply TFix to an industrial software
project, this empirical study demonstrates the potential of transfer learning when applying existing
learning-based APR techniques to industrial software.
Meanwhile, to explore the real-world performance of pre-trained models for vulnerability repair,
Huang et al. [67] conduct a preliminary stucy to apply large pre-trained models for vulnerability
repair. They compare the performance of CodeBERT and GraphCodeBERT on a C/C++ vulnerability
dataset with five CWE types. They discover that GraphCodeBERT with a data flow graph is signifi-
cantly better than CodeBERT without documenting code dependencies. They also demonstrate
that such pre-trained models outperform learning-based APR techniques (e.g., CoCoNut [115] and
DLFix [98]) and more data-dependent features (e.g., data flow and control flow) will help to repair
more complex vulnerabilities.

✎ Summary ▶ As the APR research community embraces an influx of learning-based APR

approaches, there is a parallel rise in empirical studies aimed at scrutinizing the progression
and subtleties of these techniques. These empirical studies explore the actual performance of
existing approaches from different aspects, such as the impact of code representation, the ability
of pre-trained models, and the potential in repairing vulnerabilities. However, considering that

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:44 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

there exist different repair phases, and each process can introduce various specific techniques,
the community urgently needs more and deeper empirical studies to illuminate the landscape
of learning-based APR. For example, future work can empirically explore whether mature
dynamic program execution techniques from other domains (e.g., mutation testing and fuzzing)
can be used to accelerate the patch validation, detailed in Section 8. ◀

7 APPLICATION AND DISCUSSION

In this section, we will discuss and summarize several crucial aspects of the learning-based APR
community.

7.1 Industrial Deployment

As a promising field, APR has been extensively studied in academia and even has drawn growing
attention from industry [10]. For example, Marginean et al. [121] present SapFix, the first end-to-
end deployment of industrial APR in Meta. SapFix is implemented in a continuous integration
environment and deployed into six production systems with tens of millions of code lines. Similar
industrial practices can also be found in other companies, such as Fujitsu [166], Bloomberg [81] and
Alibaba [239]. In addition to the above-mentioned traditional deployment, the industry recently
explored the feasibility of deploying learning-based APR tools. For example, GitHub launches a
product Copilot9 , which can provide code suggestions (e.g., fixing bugs) for more than a dozen
programming languages. Copilot is deployed in multiple IDEs, such as VS Code, Visual Studio,
Neovim, and JetBrains. Besides, Microsoft recently released a new tool Jigsaw10 to fix bugs in
machine-written software.
Now, we summarize the existing learning-based APR techniques and industrial deployment from
enterprises.
As early as in 2019, Bader et al. [10] present Getafix, the first industrially-deployed automated
bug-fixing tool for Java programs. To be fast enough to suggest fixes in time, this model produces a
ranked list of fix candidates based entirely on past fixes and on the context in which a fix is applied.
Besides, it leverages the hierarchical clustering technique for discovering repetitive fix patterns.
Moreover, They apply a statistical ranking technique to enable the model to predict human-like
fixes among the top few suggestions. An evaluation with a large dataset containing six types of
common bugs and their experience of deploying Getafix within Facebook shows that the approach
accurately predicts human-like fixes for various bugs, reducing the time developers have to spend
on fixing recurring kinds of bugs.
In 2020, Hellendoorn et al. [64] from Google conduct experiments for two different model
architectures that leverage both local and global information. They propose sandwich models that
apply different message-passing techniques and GREAT models that add extra information to a
transformer. Both architectures achieve high results and outperform both RNN and transformer
architectures, proving that a hybrid model with global information and incorporating structural
bias helps improve accuracy.
In 2021, Baudry et al. [11] present R-HERO, a novel software repair robot to automatically repair
bugs on the single platform GitHub/Travis CI. R-HERO contains six main blocks: a) Continuous
integration, b) Fault localization, c) Patch generation, d) Compilation & Test execution, e) Overfitting
prevention, and f) Pull-request creation. It receives and analyzes the events from a continuous
integration (CI) system. R-HERO leverages continual learning to acquire bug-fixing strategies from

9 https://github.com/features/copilot
10 https://www.microsoft.com/en-us/research/blog/jigsaw-fixes-bugs-in-machine-written-software/

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:45

the platform mentioned above. It shows that developers and bots can cooperate fruitfully to produce
high-quality, reliable software systems.
Different from previous works with supervised learning, Allamanis et al [7] from Microsoft
propose BUGLAB to detect and repair software bugs automatically by self-supervised learning.
Similar to BIFI [223], BUGLAB employs a detector model to repair bugs and a selector model to
generate buggy code snippets as the training data for the detector. The authors create a dataset
PYPIBUGS of 2374 real-world bugs from the PyPI packages. The results show that BUGLAB can fix
a number of software bugs and detect some previously unknown bugs in open-source software.
In parallel to BUGLAB, Tang et al. [176] from Microsoft introduce a grammar-guided end-to-end
approach to generate patches, which treats APR as the transformation of grammar rules. They
apply structure-aware modules and design three different types of strategies for grammar-based
inference algorithms. They also leverage two encoders and enhance the model with a new tree-
based self-attention. The experimental results on BFP datasets [184] demonstrate that the proposed
technique outperforms previous RNN-based techniques (e.g., Tufano et al. [184]).
Considering the raise of pre-trained models, Drain et al. [42] from Microsoft introduce DeepDebug,
a span-masking pre-trained encoder decoder transformer as a tool to fix Java methods. The model is
pre-trained from BART which is pre-trained in English. They conduct three pre-training experiments
to verify the feasibility of the model and test it on the Java benchmarks from Tufano et al. [184].
Results show that DeepDebug outperforms existing APR tools (e.g., CodeBERT [49] and Tufano et
al. [184]), and adding syntax embeddings along with the standard positional embeddings helps
improve the model.
In 2022, similar to DeepDebug, Hu et al. [66] from AWS AI propose NSEdit to generate patches
for Java programs. Given only the buggy code, NSEdit uses the pre-trained CodeBERT as the
encoder and CodeGPT as the decoder to address the Seq2Seq NMT problem. Moreover, it uses a
pointer network to select content-based edit locations. They apply beam search and design a novel
technique to fine-tune the reranker to re-rank the top-k patches for the buggy code. The results on
BFP benchmarks [184] indicate that NSEdit outperforms CodeBERT [49] and the ablation study
demonstrates the effectiveness of each component of the model.
Meanwhile, Wang et al. [190] from Ping An Technology propose CPR, short for causal program
repair, as a tool to utilize data augmentation strategy for input perturbations. This model can
generate patches for Java, Python, JavaScript, and C based on causally related input-output tokens.
Besides, it can offer explanations by transforming code into explainable graphs on various Seq2Seq
models in APR. They conduct experiments on four programming languages and prove that APR
models can be utilized as causal inference tools.

✎ Summary ▶ The APR domain has witnessed an unprecedented surge in industrial adoption.
With giants like Meta, Fujitsu, Bloomberg, and Alibaba exploring and harnessing its potential,
learning-based APR has undoubtedly established its foothold in real-world applications. Em-
phasis has notably shifted to learning-based APR tools, as exhibited by GitHub’s Copilot and
Microsoft’s Jigsaw, which underscore the blend of machine learning with traditional program-
ming paradigms. Noteworthy contributions emerge from global tech titans including Microsoft,
Google, and AWS AI. From tools like Getafix, R-HERO, and BUGLAB, which emphasize speed,
collaboration, and self-supervised learning respectively, to models like DeepDebug and NSEdit
that push the envelope of program repair using state-of-the-art machine learning architectures,
industry-affiliated research has been at the forefront. As the APR community moves forward,
the collaboration between academia and industry in the APR domain is poised to shape the next
generation of repair tools and methodologies. The trend demonstrates the desire to harness

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:46 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

advanced DL techniques to address recurrent software bugs, thereby alleviating the developers’
workload in the industry. ◀

7.2 DL for Traditional APR

In addition to the increasing number of end-to-end learning-based APR techniques, there has been
growing interest in leveraging these learning technologies to improve and refine the capabilities
of traditional APR techniques. These studies usually treat machine learning as a component to
address the inherent limitation in the original APR workflow. Table 6 presents existing studies that
attempt to boost traditional APR techniques by utilizing deep learning or machine learning. The
first and second columns list the summarized studies and the years. The third column denotes the
traditional APR techniques targeted by these summarized studies. The remaining two columns list
the targeted languages and a brief description.

Table 6. A summary and comparison of APR studies combining traditional repair techniques and machine
learning techniques

Year Approach Base Language Description

Training a ranking model to assign a high probability
2016 Prophet [111] SPR C
to correct patches based on designed features.
Inferring which predicates should be used with a
2017 ACS [213] N.A. Java
given variable.
Learning to rank the repair ingredients based on code
2019 DeepRepair [203] Astor Java
similarity with representation learning.
Employing a machine learning model to rank candi-
date patches based on their static (e.g., the number of
2022 LIANA [26] RESTORE Java
variables) and dynamic (the number of passing tests)
features.
Training a BiLSTM-based multi-classifier model to
2022 TRANSFER [130] TBar Java predict which fix template should be tried to repair
one suspicious statement.
Employing a Seq2Seq model to generate patches as
2022 ARJANMT [94] ARJA Java potential fix ingredients that are manipulated by a
multi-objective evolutionary search algorithm.
Training a machine-learning model to predict the
2022 SituRepair [185] N.A. C types of bugs based on static features and apply modi-
fications to the faulty program according to the types.

As early as 2016, Long et al. [111] propose Prophet, a patch-generation system for repairing bugs.
It uses dynamic analysis on the given test suite to get the program points for the patch to modify.
Then, the SPR [110] is used to generate search space. With a trained probabilistic model, Prophet
ranks the candidate patches, which are validated by executing the test suites. They collect eight
projects from GitHub and get 777 patches to train their model and test it on a benchmark [91]. The
result shows that Prophet can generate patches correctly with the learned knowledge compared
with previous patch generation systems. From the perspective of community development, while
Prophet may not be an end-to-end NMT-based patch generation approach like CoCoNut [115], its
pioneering integration of machine learning into the repair process offers invaluable insights for
subsequent research endeavors.
In 2017, Xiong et al. [213] introduce ACS, which aims to generate precise conditions at faulty
statements. During the condition synthesis process, ACS selects what variables should be used
in the conditional expression and decides what predicate should be performed on the variables.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:47

The predicates are mined from existing projects, and sorted based on their frequencies in contexts
similar to the target condition. The results on Defects4J show that ACS is the first APR approach
that achieves a precision higher than 70% (the precision of previous approaches is below 40%). ACS
employs a learning component to infer which predicates should be used with the current variable.
Although the learning component (just counting the frequencies in a corpus of source code) is
very simple, it is still learning. Thus, we regard ACS as one of the earliest learning-based APR
techniques.
At the same time, Long et al. [109] present a new system, Genesis, that processes human patches
to automatically infer code transforms for automatic patch generation. They first extract transforms
from the training set to obtain a pair containing a program before a change and a program after
a change. For each transformation, they create a template that defines the AST changes. They
then collect templates to create AST template forests which contain template variables to match
any appropriate AST subtrees. Given a set of training pairs, Genesis will select from the inference
search space to obtain potential transforms. They design an algorithm to reach a trade-off between
search space coverage and tractability. Finally, from these transforms they obtain a set of candidate
patches. They then evaluate Genesis on a dataset collected from GitHub Java programs covering null
pointer (NP), out-of-bounds (OOB), and class cast (CC) bugs. Results show that Genesis outperforms
another patch generation technique PAR [79] that leverages manually defined templates.
In 2019, White et al. [203] propose DeepRepair to intelligently select repair ingredients via
deep learning code similarities. In particular, DeepRepair is implemented on top of Astor [123], a
traditional heuristic-based APR approach, and consists of three phases, i.e., language recognition,
machine learning, and program repair. First, the language recognition phase processes the source
code to create ASTs and maps the literal tokens to their respective type. Second, the machine
learning phase trains a neural network language model from the file-level corpus to representations
for each term, and then trains an encoder to encode arbitrary streams of embeddings. Third, the
program repair phase leverages the trained encoder to query and transform code snippets for
patch generation. In this step, DeepRepair sorts the repair ingredients based on code similarity and
applies repair operators (“addition of statement” and “replacement of statement”) to repair the code
snippet. The experimental results on Defects4J demonstrate that DeepRepair achieves comparable
performance against jGenProg [122] in terms of the number of plausible patches with a faster
discovery speed of compilable ingredients. More importantly, as the first approach to expand the fix
space by transforming ingredients, DeepRepair generates some patches that cannot be generated
by jGenProg, highlighting the differences between the nature of DeepRepair and jGenProg.
In 2022, Chen et al. [26] propose a search-based technique called LIANA, which is based on
a designed learning-to-rank prioritization mode. It is based on the idea of repeatedly updating a
statistical model online based on the intermediate validation results of an ongoing program repair
process. The model is first trained offline and updated repeatedly after the generating progress
starts. The most up-to-date model is used to generate fixes and prioritize those that are more likely
to include the correct ingredients.
To improve the template-based APR, Wang et al. [130] propose TRANSFER, a fault localization
and program repair approach with deep semantic features and transferred knowledge which is
obtained by a combination of spectrum-based and mutation-based localization techniques. They
build a fault localization and program repair dataset respectively and employ existing fix templates
designed by TBar. They also design 11 binary classifications to identify whether one of the 11 bug
types they define exists in a statement and a multi-classification to determine which fix template
this statement should apply. The binary classification, consisting of one embedding layer, one RNN
layer, one max pooling layer, and one dense layer, is fed with spectrum-based, mutation-based, and

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:48 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

semantic features and outputs the probability of containing specific bugs. Although this approach
is only tested on Java, it is proven to outperform many state-of-art approaches.
Similarly, to improve the search-based APR, Li et al. [94] design a novel framework called
ARJANMT to leverage both redundancy assumption and Seq2Seq learning of correct patches to
generate fixes for Java methods using NSGA-II algorithm. This framework combines both ARJA and
SequenceR into a unified framework. After evaluating ARJANMT on two Java benchmarks, results
show that it benefits from search-based and NMT-based techniques and outperforms existing APR
techniques (e.g., CoCoNut [115], DLFix [98] and CURE [73]).
To address multiple bugs, Valueian et al. [185] propose SituRepair for repairing multiple bugs in
C programs based on pre-defined repair patterns. It applies a machine learning model to predict the
buggy type and localization of the buggy code and then repairs them with situational modifications
accordingly. SituRepair is evaluated on a C benchmark Code4Bench and it successfully repairs
3,848 multiple-fault programs, outperforming Genprog [93].

✎ Summary ▶ Although a mass of research effort has been devoted to end-to-end patch gen-
eration, the literature has also seen some orthogonal works utilizing DL to enhance traditional
APR techniques. Different from most learning-based APR techniques that design an NMT-based
patch generation model from scratch, these techniques can leverage mature traditional APR
techniques and employ DL to improve specific components, such as the selection of repair
templates [130]. Future work can be conducted to address certain limitations of traditional
APR techniques, such as the donor code retrieval issue [107] using pre-trained models. ◀

7.3 Open Science

Recent years have witnessed an increasing application of DL in traditional SE problems and tasks.
In particular, software bug is a growing quality concern for modern software, and accordingly,
APR has become an actively studied topic in the SE community. According to our survey, various
learning-based APR techniques have been introduced in the last five years (discussed in Section 2).
DL brings a new repair paradigm (i.e., training and repairing) for the APR problem with promising
results. However, due to the nature of DL, learning-based APR techniques face some concerns
in reproducibility, which is quite different from traditional APR techniques. For example, it may
require a large number of machine resources for researchers to reproduce the NMT model’s work.
The cost is unaffordable for most researchers from academia. Besides, there exists randomness in
the neural network training process, which hinders the reproduction results.
Such challenges posed by DL motivate us to further understand the potential issues with open
science in the learning-based APR area, so as to advance existing techniques by taking advantage
of the general merits of open science. Open science advocates that researchers make their artifacts
(e.g., raw data, dataset, scripts, related models, or any results produced in their work) available
to all levels of researchers [128], so knowledge can be shared without boundaries [146]. While a
mass of DL techniques are proposed to fix software bugs automatically, more support is needed
to investigate the critical open science problem. In particular, we investigate to what extent the
collected papers make their artifacts publicly available and in what way they provide the relevant
information.
Table 7 shows the tool availability results of the investigated papers. For each paper we collect,
we check whether an accessible link for its tool or data is provided in the main text or footnotes of
the paper. We only present the studies that provide the link of publicly available data or tools due
to limited space, listed in the first column. We then investigate the following five dimensions in
characterizing the availability of each paper:

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
Table 7. Results on tool availability

Tool Language Dataset Hosting Site Link Accessibility SA DA TA URL

Wang et al. [191] C CodeHunt GitHub valid ✔ ✘ ✘ https://github.com/keowang/dynamic-program-embedding
Tufano et al. [183] Java Pull-Request-Data Google valid ✔ ✔ ✔ https://sites.google.com/view/learning-codechanges
Tufano et al. [184] Java BFP-small, BFP-medium Google valid ✔ ✔ ✘ https://sites.google.com/view/learning-fixes
RLAssitst [57] C DeepFix bitbucket valid ✔ ✔ ✔ https://bitbucket.org/iiscseal/rlassist
Defects4J, QuixBugs, ManyBugs,
CoCoNut [115] Java,C,Python,JS GitHub valid ✔ ✔ ✘ https://github.com/lin-tan/CoCoNut-Artifact
CodeFlaw, BugAID
https://github.com/ICSE-2019-AUTOFIX/ICSE-2019-
DLFix [98] Java Defects4J, Bugs.jar, BigFix GitHub valid ✔ ✔ ✘
AUTOFIX
Hellendoorn et al. [64] Python ETH Py150 GitHub valid ✔ ✔ ✔ https://github.com/VHellendoorn/ICLR20-Great
DrRepair [222] C,C++ DeepFix, SPoC GitHub valid ✔ ✔ ✔ https://github.com/michiyasunaga/DrRepair
Learn2Fix [19] Python CodeFlaw GitHub valid ✔ ✔ ✔ https://github.com/mboehme/learn2fix
Tian et al. [179] Java Defects4J, QuixBugs GitHub valid ✔ ✔ ✔ https://github.com/TruX-DTF/DL4PatchCorrectness
BIFI [223] Python,C GitHub-Python, DeepFix GitHub valid ✔ ✔ ✔ https://github.com/michiyasunaga/bifi
Recoder [242] Java Defects4J, QuixBugs, IntroClassJava GitHub valid ✔ ✔ ✘ https://github.com/pkuzqh/Recoder
SequenceR [27] Java Defects4J, Bugs2Fix, CodRep GitHub valid ✔ ✔ ✔ https://github.com/kth/SequenceR
TFix [15] JS TFix GitHub valid ✔ ✔ ✔ https://github.com/eth-sri/TFix
https://github.com/microsoft/neurips21-self-supervised-
BugLab [7] Python PyPIBug GitHub valid ✔ ✔ ✘
bug-detection-and-repair
Reptory [139] JS REPTORY GitHub valid ✔ ✔ ✘ https://github.com/annon-reptory/reptory
RewardRepair [227] Java Defects4J, QuixBugs, Bugs.jar GitHub valid ✔ ✔ ✔ https://github.com/SophieHYe/RewardRepair
https://github.com/EhsanMashhadi/MSR2021-
CodeBERT [125] Java ManySStuBs4J-small, ManySStuBs4J-large GitHub valid ✔ ✔ ✘
ProgramRepair
A Survey of Learning-based Automated Program Repair

https://github.com/repairnator/open-science-repairnator/
R-HERO [11] Not Known GitHub valid ✘ ✔ ✘
tree/master/data/2020-r-hero
Ahmed et al. [3] Java Ahmed zenodo valid ✔ ✔ ✘ https://doi.org/10.5281/zenodo.3374019
ODS [224] Java Defects4J, Bugs.jar, Bears GitHub valid ✔ ✔ ✔ https://github.com/SophieHYe/ODSExperiment
CIRCLE [228] Java,C,JS,Python Defects4J, QuixBugs, ManyBugs, BugAID GitHub valid ✘ ✘ ✘ https://github.com/2022CIRCLE/CIRCLE
TRANSFER [130] Java Defects4J GitHub valid ✔ ✔ ✔ https://github.com/mxx1219/TRANSFER
https://github.com/AutomatedProgramRepair-2021/dear-
DEAR [99] Java Defects4J, CPatMiner, BigFix GitHub valid ✔ ✔ ✔
auto-fix
Cornor et al. [33] Java CodeXGlue GitHub valid ✔ ✔ ✘ https://github.com/WM-SEMERU/hephaestus
BATS [178] Java Defects4J GitHub valid ✔ ✔ ✔ https://github.com/HaoyeTianCoder/BATS
https://github.com/antonio-mastropaolo/
T5 [126] Java BFP-small, BFP-medium GitHub valid ✔ ✔ ✔
TransferLearning4Code
CompDefect [143] Java Function-SStuBs4J zenodo valid ✔ ✔ ✘ https://zenodo.org/record/5353354#.Y4CVdhRByUl
VRepair [28] C Big-Vul, CVEfixes GitHub valid ✔ ✔ ✔ https://github.com/SteveKommrusch/VRepair
SeqTrans [31] Java BFP-small, BFP-medium, Ponta GitHub valid ✔ ✔ ✔ https://github.com/chijianlei/SeqTrans
VulRepair [51] C CVEfixes GitHub valid ✔ ✔ ✔ https://github.com/awsm-research/VulRepair
Crex [216] C CodeFlaw GitHub valid ✔ ✔ ✔ https://github.com/1993ryan/crex
RealiT [163] Python PyPIBug GitHub valid ✔ ✘ ✔ https://github.com/cedricrupb/nbfbaselines
GPT-2 [89] JS BugsJS GitHub valid ✔ ✔ ✔ https://github.com/RGAI-USZ/APR22-JS-GPT
CoditT5 [232] Java BFP-small, BFP-medium GitHub valid ✔ ✔ ✔ https://github.com/EngineeringSoftware/CoditT5
SYNSHINE [4] Java BlackBox zenodo valid ✔ ✔ ✔ https://zenodo.org/record/4572390#.Y4CY8xRByUk
Verifix [5] C ITSP GitHub valid ✔ ✔ ✔ https://github.com/zhiyufan/Verifix
Cache [102] Java wang et al. [195], Tian et al. [179], ManySStuBs4J GitHub valid ✔ ✔ ✘ https://github.com/Ringbo/Cache
Wang et al. [198] Java Wang et al. [198] GitHub valid ✔ ✔ ✔ https://github.com/anonymous0903/patch_correctness
Quatrain [181] Java Defects4J, Bugs.jar, Bears GitHub valid ✔ ✔ ✔ https://github.com/Trustworthy-Software/Quatrain
Shibboleth [55] Java Defects4J GitHub valid ✔ ✔ ✔ https://github.com/ali-ghanbari/shibboleth
Tian et al. [180] Java Tian et al. [180] GitHub valid ✔ ✔ ✔ https://github.com/HaoyeTianCoder/Panther
https://iclr2018anon.github.io/semantic_code_repair/index.
SSC [38] Python Devlin et al. [38] GitHub valid ✘ ✘ ✘
html

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:49

Huang et al. [68] Java,C,C++ Juliet Test Suite GitHub valid ✔ ✔ ✔ https://github.com/shan-huang-1993/PLC-Pyramid
1:50 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

• Hosting Site. This information indicates which hosting site the available artifact is uploaded
to for public access (e.g., GitHub or Google), if the artifact link is presented in the paper. The
detailed information is listed in the third column.
• Link Accessibility. This information indicates whether the provided link is accessible, such
that we can download the artifacts. The detailed information is listed in the fourth column.
• Source Code Available (SA). This information indicates whether the source code (e.g.,
training and evaluation scripts) is available in the artifacts. The detailed information is listed
in the fifth column.
• Dataset Available (DA). This information indicates whether the dataset (e.g., raw data
and training data) is available in the artifacts. The detailed information is listed in the sixth
column.
• Trained Model Available (TA). This information indicates whether the trained model (e.g.,
raw data and training data) is available in the artifacts. The detailed information is listed in
the seventh column.
We also list the programming languages targeted by the tools in the second column and list the
accessible URL links in the last column. After carefully checking the collected papers, we find that
only a few of the papers have made their source code available to the public. For convenient public
access, a majority of papers upload their works to GitHub. The possible reason is that GitHub
is the most popular platform to host open-source code publicly. Meanwhile, we find that several
papers fail to provide the source code, dataset, or already trained model [176, 228]. The possible
reasons may be (1) the artifacts need to be refactored or reorganized for public availableness; (2)
the artifacts are used for further studies; and (3) the artifacts are lost due to some accidents. We also
find while the artifacts are available, some studies cannot be reproduced because (1) the missing of
default hyperparameters11 ; (2) the complexity of environment settings for training12 ; and (3) the
insufficiency of documentation to reproduce the experiments13 .

✎ Summary ▶ Overall, compared with traditional APR, the need for high-quality artifacts in
learning-based APR is even more vital for replication and future research. On the one hand, the
learning-based APR usually involves abundant training time and expensive equipment (e.g.,
GPUs) to train a repair model, and thus it is much harder to reproduce existing works. On the
other hand, some learning-based APR models require complex environment settings (e.g., the
best hyperparameters and the random seed) and some authors may fail to provide high-quality
code. In contrast, traditional APR results are typically more straightforward and deterministic to
reproduce when provided with open-source code and data. Therefore, we hope that researchers
in the learning-based APR community can provide high-quality open-source code and detailed
instructions to construct a unified repair framework for convenient reproduction. ◀

7.4 The Latest Advancements

While the scope of this survey encompasses literature up to Nov 2022, it is worth noting that
there have been significant advancements in the field of learning-based APR during 2023. For
example, notable papers presented at prominent conferences such as ICSE and ASE have introduced
innovative learning-based APR approaches, especially many of which leverage pre-trained models
to achieve remarkable results. Although a comprehensive review and analysis of these recent
works are beyond the scope of this paper, we acknowledge their contributions to the field and

11 https://github.com/lin-tan/CoCoNut-Artifact/issues/11
12 https://github.com/pkuzqh/Recoder/issues/11
13 https://github.com/ICSE-2019-AUTOFIX/ICSE-2019-AUTOFIX/issues/5

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:51

recognize them as pivotal developments that will shape future research in APR. In the following,
we summarize some recent studies for a timely understanding of the latest advancements.
In line with Section 4.3, code context provides necessary information for repair models to
generate correct patches and plays a vital role in the learning-based APR workflow. However,
existing approaches mainly extract code in close proximity to the buggy statement within the
enclosing file, class, or method, without any analysis to find actual relations with the bug. Sintaha et
al. [169] propose a learning-based APR approach Katana, which employs a program slicing-based
approach to analyze code context in program repair. Particularly, Katana designs a dual slicing
strategy to analyze statements that have a control or data dependency on the buggy statement.
In line with Section 4.4, Zhu et al. [243] further propose Tare built upon their previous graph-
based APR approach Recoder, a type-aware model for program repair to learn the typing rules.
Compared with Recoder, Tare replaces the grammar in Recoder with a T-Grammar that integrates
the type information into a standard grammar, and replaces the neural components of Recoder
encoding ASTs with neural components encoding T-Graphs, which is a heterogeneous graph with
attributes. Besides, Jiang et al. [72] propose KNOD, a learning-based APR approach based on a three-
stage tree decoder and a domain-rule distillation. The first tree decoder directly generates ASTs
of patched code according to the inherent tree structure and The second domain-rule distillation
leverages syntactic and semantic rules and teacher-student distributions to explicitly inject the
domain knowledge into the decoding procedure during both the training and inference phases.
In line with Section 4.6, Xiao et al. [210, 211] systematically investigate whether existing muta-
tion testing acceleration techniques are suitable for general-purpose patch validation. They then
introduce ExpressAPR, a patch validation framework by designing two adaption strategies, i.e.,
execution scheduling and interception-based instrumentation. The experimental results on four
previous APR approaches (including the learning-based one Rcoder) demonstrate that ExpressAPR
is able to reduce patch validation time significantly.
In line with Section 4.8, some domain approaches are proposed to address the repair problem for
various bug types. For example, So et al. [171] propose SmartFix, a learning-based technique for
repairing vulnerable smart contracts. SmartFix employs statistical models to intelligently guide
the repair procedure, so as to prioritize candidate patches that are helpful in finding desired
safe contracts. At the same time, Fan et al. [48] systematically investigate whether existing APR
techniques (e.g., Recoder [242]) can fix the incorrect solutions produced by pre-trained models in
LeetCode contests. Besides, First et al. [50] propose Baldur, an automated whole-proof generation
and repair approach on top of a large pre-trained model. Baldur first generates whole formal
proofs by a proof generation model trained on natural language text and code and fine-tuned on
proofs. Baldur then combines this proof generation model with a fine-tuned repair model to repair
incorrectly generated proofs, further increasing proving power.
In line with Section 5, there exist some recent approaches proposed to explore how to transfer
domain bug-fixing knowledge into the pre-trained model-based patch generation process. The
first example is RAP-Gen [196], a retrieval-augmented program repair approach on top of a pre-
trained CodeT5 model. RAP-Gen retrieves a relevant bug-fixing pair from an external codebase to
augment the buggy input for the CodeT5 patch generator. The second example is FitRepair [207], a
CodeT5-based APR approach that incorporates domain-specific knowledge with the insights of
the plastic surgery hypothesis. FitRepair designs two domain-specific fine-tuning strategies and
one prompting strategy to leverage the hypothesis from the buggy projects. The third example is
Repilot [201], which helps pre-trained models generate more valid patches through a completion
engine. Repilot employs the interaction between a pre-trained model and a completion engine to
generate candidate patches by first pruning away infeasible tokens suggested by the pre-trained
model and then completing the token based on the suggestions provided by the completion engine.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:52 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

In line with Section 6.3, some empirical studies further explore the actual performance of learning-
based APR from different aspects. For example, Jiang et al. [71] empirically evaluate the fixing
capabilities of pre-trained models with and without fine-tuning for the APR task, involving ten
pre-trained models and four benchmarks. Zhang et al. [235] conduct an extensive empirical study
to investigate how pre-trained models are applied to vulnerability repair in the workflow (i.e., data
pre-processing, model training and repair inference) and further propose an enhanced approach
with bug-fixing transfer learning, involving more than 100 variants of fine-tuned models. Similarly,
Wu et al. [206] conduct an extensive study to evaluate the fixing capabilities of five pre-trained
models and four learning-based APR approaches on real-world Java vulnerabilities.
In line with Section 7.2, there exist some approaches proposed to combine traditional APR and
recent learning-based APR. For example, Zhang et al. [236] propose GAMMA, a template-based
program repair approach on top of the advance of fix patterns from traditional template-based
APR and mask prediction from pre-trained models. Similarly, Meng et al. [131] propose TENURE,
a novel template-and-learning-based program repair approach by combining the template-based
and NMT-based methods. Importantly, both GAMMA and TENURE preliminarily demonstrate
the prospect of combining the advances of traditional APR and DL models. At the same time,
Parasaram et al. [150] propose RETE, which aims to navigate the search space of patches by
learning project-independent information about the program namespace. RETE first employs repair
patterns to generate candidate patches and prioritize patches by learning rich semantic information
about the project namespace.

✎ Summary ▶ Overall, these latest research findings further demonstrate the timeliness and
comprehensiveness of our survey. Importantly, the most apparent trend is the increasing use of
pre-trained models, including enhanced pre-trained model-based approaches, empirical studies
on diverse bug types, and the combination with traditional APR. Besides, there exist some
studies focusing on optimizing other components of the repair process, such as code context
and patch validation acceleration. ◀

8 IMPLICATION AND GUIDELINES

Our study reveals the following important practical guidelines for future learning-based APR.
I&G❶: Multifarious Code Representation. As discussed in Section 4.4, inspired by the advance
of neural machine translation in NLP, early learning-based APR work usually treats source code
as a sequence of code tokens. The follow-up work has begun to consider complex code features,
such as code edit [242], AST [98], and data flow graph [142]. For example, CIRCLE [228] which
treats the APR as a simple machine translation task on code sequences, and Recoder [242] which
is equipped with a syntax-guided edit decoder, are able to fix 64 and 65 real-world software bugs
from the Defects4J benchmark, respectively. Such observation indicates that there does not always
exist a specific code representation to demonstrate good performance, e.g., a simple sequence
representation can also yield excellent results. It is difficult to directly investigate the advantages
and disadvantages of different code representations because each code representation comes with
differentiated configurations, such as model architectures.
We recommend that future work can be conducted in the following three directions. First, it
is crucial to conduct a systematic study to explore the impact of different code representations
under various configurations, e.g., model architectures. Second, we find there exists a mass of code
representation ways in existing learning-based APR techniques, future work needs to design optimal
code representation based on specific scenarios. For example, researchers can design optimal code
representations based on specific programming languages, types of bugs and benchmarks. Third,

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:53

considering that most existing repair work statically extracts the buggy and contextual features,
it is promising to incorporate the static code representation features (e.g., AST) and dynamic
execution feedback (e.g., test results). In this way, the NMT model and the repair process can be
more deeply integrated to fit the APR scenario. Fourth, with the rise of pre-trained models, the
community has seen the usage of prompt-based representation of feeding inputs to pre-trained
models to facilitate the repair task, e.g., CIRCLE [228]. However, research about prompt-based
representation in the repair domain is still in its early stages, mainly focusing on fine-tuning [228].
In the future, researchers can draw from other code-related fields [140, 235] to further deepen the
understanding of how the knowledge of pre-trained models can be stimulated to support repair
tasks with appropriate prompt representation.
I&G❷: Patch Validation Acceleration. As discussed in Section 4.6, dynamic execution is the
common practice to validate candidate patches in the APR community. Although some techniques
have been proposed to speed up patch validation [14, 25], it is time-consuming to dynamically
execute all candidate patches against each test case. Besides, existing patch validation studies in
learning-based APR are general to both traditional and learning-based APR communities.
We recommend that future research can be conducted from three aspects. First, it is promising to
extensively investigate the differences between patches generated by traditional and learning-based
APR techniques, based on which more advanced patch validation techniques can be designed
that are targeted at learning-based APR techniques. Second, predictive patch validation can be
conducted on top of the code semantic understanding capability of DL techniques, i.e., predictive
patch validation. For example, automatically learning patched code features and predicting whether
a patch is passed by previous failing test cases without dynamic execution is promising. Third, we
notice that other fields also suffer from the problem of dynamic program execution overhead, such
as mutation testing (both mutants and patches are considered variants of a program). Therefore,
some advanced techniques from these similar fields can also be migrated into patch validation. For
example, Wang et al. [188, 189] detect equivalencies in mutant execution and execute one for each
equivalence class, which is general and applicable to patch validation.
I&G❸: Training Dataset Construction. As discussed in Section 6.1, in contrast to traditional
APR techniques, learning-based techniques heavily rely on the quality of the training dataset. A
majority of existing techniques mine bug-fixing pairs from open-source code repositories (e.g.,
GitHub) and build their own datasets. However, the training dataset is usually collected by auto-
mated tools (e.g., extracting commit by fix-related keywords) and then inspected by some filtering
rules (e.g., more than five Java files) [242], which means the quality of the training dataset can
be variant. Many training datasets contain noise (e.g., CoCoNut contains a number of duplicated
samples) that may reduce the performance of the model. Besides, the number of training samples
in different techniques varies greatly (e.g., 3,241,966 in CoCoNut [115] and 2,000 in DLFix [98]).
These concerns may introduce bias when comparing and analyzing learning-based techniques.
We recommend approaching future work in two parts. First, a unified standard for training
datasets should be built to reduce the burden on researchers when they propose a novel learning-
based APR technique. Second, with a standardized training dataset, researchers can uniformly
evaluate the performance of different repair models across various settings, such as code represen-
tations, model architectures, and training hyperparameters.
I&G❹: Practical Evaluation Metrics. As discussed in Section 6.2, when evaluating repair
performance, dynamic execution-based metrics (e.g., plausible patches) are the common practice in
the APR community. However, such metrics may suffer from some drawbacks. First, they need to
execute all available functional test suites against each patched software program, consuming a
significant amount of execution time. Section 4.6 lists some candidate patch validation acceleration
techniques to mitigate this issue. Second, due to the overfitting problem, developers are required to

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:54 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

further perform a manual inspection to assess the correctness of plausible patches, which demands
a substantial amount of human resources and is prone to errors. The overfitting problem leads
to the development of some patch correctness assessment techniques in Section 4.6. Third, the
dynamic execution heavily relies on well-constructed datasets including the corresponding fault-
triggering test cases. However, such test cases are often unavailable in practical scenarios, making
it challenging to rely solely on such metrics. We encourage further work to explore more practical
metrics to evaluate the repair performance of NMT models. For example, it is interesting to design
a hybrid metric by combining dynamic execution and static match.
We find an increasing number of recent learning-based APR techniques [51, 184] rely on static
match-based metrics (e.g., Accuracy and BLUE) to perform evaluation (mentioned in Section 6.2).
However, such match-based metrics are usually derived from the NLP domain (e.g., neural machine
translation) and fail to consider that a program’s functionality can be implemented in various ways,
such as different algorithms, data structures, or data flows. In the future, the community needs
large-scale empirical work to validate whether the match-based metrics can accurately reflect the
repair capability of NMT APR models. Besides, the two types of evaluation metrics (i.e., dynamic
execution vs. static match) are orthogonal and have their own advantages and disadvantages. We
suggest that the relationships between the recent static match-based and the classical dynamic test
execution-based metrics need to be studied in the future.
I&G❺: Exploring Patch Overfitting Issue. Similar to traditional APR techniques, learning-
based techniques usually adopt available test suites to filter incorrect candidate patches. However,
the test suite is an incomplete specification under the program behavioral space. The plausible
patches passing the existing test suite may not satisfy the expected outputs of potential test suites,
leading to a long challenge in APR (i.e., the overfitting issue). Considering the learning-based
APR is an end-to-end repair paradigm (in a black-box manner), which is different from traditional
techniques adopting test suites to guide the repair process, the overfitting issue in learning-based
APR is more significant and severe. Recently, researchers have adopted DL techniques (e.g., code
embedding [102, 179]) to predict the correctness of plausible patches, which is a promising direction
to address overfitting problems.
We recommend that future work can be conducted from three aspects. The first recommendation
lies in the process of patch generation. It is possible to design advanced code-aware NMT models
that incorporate more code information (e.g., code structure information or dynamic execution
information) to generate high-quality code snippets. The second recommendation is the process of
patch correctness. Investigating how to better utilize DL techniques to differentiate between correct
patches and overfitting patches is worth exploring. For example, we can incorporate contrastive
learning into existing learning-based patch correctness assessment approaches, as contrastive
learning is shown to be effective in distinguishing positive samples (i.e., correct patches) and
negative samples (i.e., overfitting patches). The third recommendation is the repair paradigm.
Previous work [107, 208] has shown that fix templates can generate higher-quality code snippets
with high precision. We believe that combining DL techniques with fix patterns as a novel repair
paradigm can address this issue in previous learning-based APR techniques.
I&G❻: Unified Localization and Repair workflow. As discussed in Section 4.2, similar to
traditional APR techniques, existing learning-based techniques usually consider fault localization
as an additional step in the repair process and adopt off-the-shelf fault localization tools (e.g., SBFL)
to identify suspicious code element, which is the input of NMT repair models. In the literature,
these two tasks (i.e., fault localization and patch generation) are developing in their own respective
fields so far and little work has explored their potential relationship. Recently, Ni et al. [143]
propose CompDefect to handle defect prediction and repair simultaneously. The powerful capacity

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:55

of DL to learn the semantic information of source code for fault localization [97, 112] and program
repair [228, 242] makes it possible to combine the two tasks.
We suggest that future works focus on a unified repair process interactively incorporating fault
localization and patch generation. The fault localization results can be improved with the feedback
from the patch generation, while the updated localization results can assist in generating patches
more effectively. Different from previous studies that treat the two tasks as separate, the unified
repair facilitates interaction between the two tasks, enabling feedback-driven improvements in
both localization and repair performance iteratively.
I&G❼: Combination with Traditional APR Techniques. As discussed in Section 4.4, existing
DL techniques are usually adopted as a patch generator in the learning-based APR workflow, which
takes the buggy code snippets as inputs and returns a ranked list of candidate patches. Despite
remarkable progress, such learning-based APR techniques need to generate correct code snippets
from scratch and are developed separately from traditional APR techniques. Previous work [242]
has demonstrated that learning-based APR is complementary to traditional repair techniques in
terms of fixed bugs.
Future work can be conducted in two aspects. First, it is interesting to design a predictive
APR technique to predict the optimal traditional or learning-based APR technique for a given
buggy project based on the program analysis. Second, it is flexible to integrate DL techniques into
traditional APR techniques as a component instead of developing a brand-new end-to-end patch
generator. For example, a state-of-the-art template-based APR tool TBar retrieves relevant donor
code from the local buggy file and may fail to generate correct patches with inappropriate donor
code with the correct fix pattern. Researchers can boost existing template-based APR techniques
(e.g., TBar) via pre-trained models, which contain generic knowledge pre-trained with millions of
code snippets from open-source projects, and provide a variety of donor code to fix different bugs.
I&G❽: Exploring Domain Repair Techniques. As discussed in Section 4.4, a majority of
learning-based APR techniques focus on semantic bugs, which have been investigated intensively
in the literature. Section 4.4 also summarizes a number of existing repair techniques considering
other types of bugs, such as security vulnerabilities and programming assignments. However, these
studies only account for a small proportion of existing techniques, and the types of investigated
bugs are also very limited.
We recommend that future work can be carried out from two perspectives. First, it is promising
to design more domain-specific learning-based APR techniques in repairing other diverse scenarios,
e.g., test repair, concurrency program repair, and API misuse repair. Second, we find the community
usually treats fixing these types of bugs as separate tasks. SequenceR [27] has demonstrated that
NMT-based models only trained on a limited bug-fixing corpus can already fix notable vulnerabilities.
These results indicate that bug fixing and vulnerability repair both aiming to fix errors in the source
code have a high degree of similarity, and the knowledge learned from bug fixing can be well
transferred to vulnerability repair. Such observation motivates that some bugs with different
types are very similar in both code patterns and repair workflow. Thus, future researchers are
recommended to explore their potential relationship and investigate whether these bugs can benefit
each other. Besides, it is promising to conduct some empirical studies to migrate existing mature
learning-based APR techniques to other scenarios, such as automated vulnerability repair.
I&G❾: Explainable Patch Generation. As discussed in Section 4.4, existing learning-based
APR techniques usually perform an end-to-end patch generation in a black-box manner, i.e.,
automatically transforming the buggy code snippets into correct ones on top of an NMT model.
The developers are unaware of why NMT models predict such results, thus unsure about the
reliability of these generated patches, hindering the adoption of repair NMT models in practice.
In the literature, a majority of studies focus on improving repair accuracy, while minor focus on

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:56 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

improving the explainability of such NMT models. In the future, advanced explainable techniques
can be considered to make the predictions of NMT repair models more practical, explainable, and
actionable.
We suggest that future work should concentrate on two aspects to support the understanding of
NMT models for program repair: the attention mechanism and input perturbation mechanisms.
As a white-box method, the attention mechanism generates explanations by assigning weights to
different parts of the input, thus indicating an attribution of importance for the prediction. On the
other hand, the input perturbation mechanism is a black-box method to modify the input data and
observe variations in the model’s output, helping to understand which parts of the input the model
deems most crucial.
I&G❿: Pre-trained Model-based APR Research. As discussed in Section 5, an increasing
number of APR studies are focusing on employing pre-trained language models to generate patches.
We have already seen pre-trained models being successfully applied to the APR domain with
promising results [208, 237]. In the future, pre-trained models will still be the main trend for follow-
up research, and there is still a lot of room for further improvement. We stress the importance of
conducting more research into pre-trained models to deepen our understanding of the existing
challenges in developing APR techniques. We describe the relevant topics in the following.

(1) Patch Correctness via Pre-trained Models. Recently, the research for generating patches on top of
pre-trained models is developing rapidly. However, patch correctness, as an important research
direction in the APR community, has not benefited much from these pre-trained models. For
example, Tian et al. [179] simply regard BERT as an embedding representation approach without
investigating the benefits of the pre-training component itself. We believe that future work
can be conducted to employ the rich programming knowledge contained in pre-trained models
to identify the relationship between correct patches and overfitting patches. For example, it
is promising to employ the pre-trained model as a component in existing patch validation
techniques. Besides, researchers can directly treat the patch correctness assessment as a code
classification task, and fine-tune off-the-shelf pre-trained models on patch-specific datasets.
(2) Repair-oriented Pre-trained Model We have seen an increasing number of pre-trained models in
the APR field. In the literature, the majority of these pre-trained models are designed with a
general-purpose pre-training approach to facilitate a variety of downstream tasks. However,
considering the distinct difference between these downstream tasks, the universal pre-trained
model may hinder the effectiveness of program repair. For example, these models usually
focus on code-related tasks to encode a given code snippet, such as code search and code
summarization. Specifically, the designed pre-training tasks (e.g., masked language modeling)
typically deal with a code snippet as the input, and the key challenge is to capture the syntactic
and semantic information of the code snippet. However, APR deals with two code snippets and
the key challenge is to understand the code change patterns in bug-fixing pairs. The learned
knowledge in existing pre-trained models is generally related to the syntactic and semantic
information of code snippets, which can hardly be exploited to encode bug-fixing pairs. Thus,
employing existing pre-trained models for APR will inevitably lead to inconsistent inputs
and objectives between pre-training and fine-tuning. It is sub-optimal to fine-tune existing
pre-trained code models for APR due to the natural differences between pre-training objectives
and APR. We recommend future work to explore domain-specific models for APR. For example,
the researcher can propose a repair-oriented pre-trained model, which takes two code snippets
as inputs to learn the domain knowledge about code change patterns with bug-fixing specific
pre-training objectives.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:57

(3) Trade-off between Effectiveness and Model Size. In the literature, recent learning-based APR
techniques tend to employ the growing size of models, achieving better performance. Xia et
al. [208] have demonstrated that larger models usually repair a greater number of software bugs,
highlighting the promising future of pre-trained models for APR. However, such large models
are difficult to deploy in the development workflow. Besides, with the release of ever-larger
models, there may exist a barrier in the trade-off between effectiveness and model size. In fact,
most existing pre-trained models in the APR literature (e.g., CIRCLE [228], AlphaRepair [209]
and VulRepair [51]) usually treat source code as natural language (i.e., code sequence), which
cannot capture the code structure features. In the future, investigating how to bring in code
features and program analysis (e.g., data flow or control flow) in pre-training may be a flexible
strategy instead of employing a larger mode size.
(4) Practical Pre-trained Repair model. As discussed in Section 5, an increasing number of learning-
based APR techniques attempt to generate candidate patches by large pre-trained language
models. Although remarkable progress is obtained, such repair models contain millions or even
billions of parameters. For example, CodeBERT has 125 million parameters and 476 MB model
size in total. It is significant to deploy these models in modern IDEs to assist developers during
software development and maintenance. However, these repair models consume huge device
resources and run slowly in the development workflow (e.g., IDEs), limiting their application
in practice. In the future, it is necessary to reduce the size of these repair models to deploy in
real-world scenarios while maintaining comparable prediction accuracy, such as model pruning
and knowledge distillation.
(5) Pre-trained Model-based Repair Chatbot. At the current stage, the goal of most learning-based
APR techniques is to automatically generate patches that pass available test cases without
human intervention, similar to traditional APR techniques. However, there are some long-term
challenges in deploying these APR techniques directly into the development process, such as
the low recall of repaired bugs and the low precision of correct patches [101]. Recently, the
natural language understanding capabilities of large pre-trained models (e.g., ChatGPT) have
provided a new direction, i.e., conversation-driven repair. Specifically, we can employ the large
pre-trained model as a repair chatbot, which can converse with developers just like a human to
provide potential fix suggestions. In such a human-machine conversation process, developers
can tell the repair chatbot useful debugging information, such as suspicious code statements and
bug reports. More importantly, the patches generated by the repair chatbot can be validated by
developers and external devices (e.g., static analysis tools and compilers), and then the feedback
(e.g., dynamic execution information) can be provided to the chatbot for further optimization.

9 CONCLUSION
APR techniques address the long-standing challenge of fixing software bugs automatically, and
alleviate manual debugging effort significantly, which promotes software testing, validation, and
debugging practices. In the last couple of years, learning-based APR techniques have achieved
promising results, demonstrating the substantial potential of using DL techniques for APR.
In this paper, we provide a comprehensive survey of existing learning-based APR techniques. We
describe the typical learning-based repair framework, involving fault localization, data pre-processing,
patch generation, patch ranking, validation and correctness components. We summarize how ex-
isting learning-based techniques design strategies for these crucial components. We discuss the
metrics, datasets and empirical studies in the learning-based APR community. Finally, we point out
several challenges (such as overfitting issues) and provide possible directions for future study.

ACKNOWLEDGMENTS

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:58 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

The authors would like to thank the anonymous reviewers for their insightful comments. This work
is supported partially by the National Natural Science Foundation of China (61932012, 62141215,
62372228), CCF-Huawei Populus Grove Fund (CCF-HuaweiSE202304, CCF-HuaweiSY202306), and
Science, Technology and Innovation Commission of Shenzhen Municipality (CJGJZD20200617103001003).

REFERENCES
[1] Rui Abreu, Peter Zoeteweij, and Arjan JC Van Gemund. 2007. On the Accuracy of Spectrum-based Fault Localization. In
Testing: Academic and Industrial Conference Practice and Research Techniques-MUTATION (TAICPART-MUTATION’07).
IEEE, 89–98.
[2] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Unified Pre-training for Program
Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies. 2655–2668.
[3] Toufique Ahmed, Premkumar Devanbu, and Vincent J Hellendoorn. 2021. Learning Lenient Parsing & Typing Via
Indirect Supervision. Empirical Software Engineering (EMSE) 26, 2 (2021), 1–31.
[4] Toufique Ahmed, Noah Rose Ledesma, and Premkumar Devanbu. 2022. Synshine: Improved Fixing of Syntax Errors.
IEEE Transactions on Software Engineering (TSE) (2022).
[5] Umair Z Ahmed, Zhiyu Fan, Jooyong Yi, Omar I Al-Bataineh, and Abhik Roychoudhury. 2022. Verifix: Verified Repair
of Programming Assignments. ACM Transactions on Software Engineering and Methodology (TOSEM) (2022).
[6] Umair Z Ahmed, Pawan Kumar, Amey Karkare, Purushottam Kar, and Sumit Gulwani. 2018. Compilation Error
Repair: For the Student Programs, from the Student Programs. In Proceedings of the 40th International Conference on
Software Engineering: Software Engineering Education and Training (ICSE-SEET’18). 78–87.
[7] Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. 2021. Self-supervised Bug Detection and Repair.
Advances in Neural Information Processing Systems (NeurIPS’21) 34, 27865–27876.
[8] Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. Code2vec: Learning Distributed Representations of
Code. Proceedings of the ACM on Programming Languages (POPL’19) 3, POPL (2019), 1–29.
[9] Nathaniel Ayewah, William Pugh, David Hovemeyer, J David Morgenthaler, and John Penix. 2008. Using Static
Analysis to Find Bugs. IEEE Software 25, 5 (2008), 22–29.
[10] Johannes Bader, Andrew Scott, Michael Pradel, and Satish Chandra. 2019. Getafix: Learning to Fix Bugs Automatically.
Proceedings of the ACM on Programming Languages (OOPSLA’19) 3, OOPSLA (2019), 1–27.
[11] Benoit Baudry, Zimin Chen, Khashayar Etemadi, Han Fu, Davide Ginelli, Steve Kommrusch, Matias Martinez, Martin
Monperrus, Javier Ron, He Ye, et al. 2021. A Software-repair Robot Based on Continual Learning. IEEE Software 38, 4
(2021), 28–35.
[12] Nazanin Bayati Chaleshtari and Saeed Parsa. 2020. Smbfl: Slice-based Cost Reduction of Mutation-based Fault
Localization. Empirical Software Engineering (EMSE) 25, 5 (2020), 4282–4314.
[13] Samuel Benton, Xia Li, Yiling Lou, and Lingming Zhang. 2020. On the Effectiveness of Unified Debugging: An
Extensive Study on 16 Program Repair Systems. In 2020 35th IEEE/ACM International Conference on Automated
Software Engineering (ASE’20). IEEE, 907–918.
[14] Samuel Benton, Yuntong Xie, Lan Lu, Mengshi Zhang, Xia Li, and Lingming Zhang. 2022. Towards Boosting Patch
Execution On-the-fly. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22). 2165–2176.
[15] Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. 2021. Tfix: Learning to Fix Coding Errors with a
Text-to-text Transformer. In International Conference on Machine Learning (ICML’21). PMLR, 780–791.
[16] Guru Bhandari, Amara Naseer, and Leon Moonen. 2021. Cvefixes: Automated Collection of Vulnerabilities and Their
Fixes from Open-source Software. In Proceedings of the 17th International Conference on Predictive Models and Data
Analytics in Software Engineering (PROMISE’21). 30–39.
[17] Sahil Bhatia, Pushmeet Kohli, and Rishabh Singh. 2018. Neuro-symbolic Program Corrector for Introductory
Programming Assignments. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE’18). IEEE,
60–70.
[18] Sahil Bhatia and Rishabh Singh. 2016. Automated Correction for Syntax Errors in Programming Assignments using
Recurrent Neural Networks. arXiv preprint arXiv:1603.06129 (2016).
[19] Marcel Böhme, Charaka Geethal, and Van-Thuan Pham. 2020. Human-in-the-loop Automatic Program Repair. In 2020
IEEE 13th International Conference on Software Testing, Validation and Verification (ICST’20). IEEE, 274–285.
[20] CO Boulder. 2019. University of Cambridge Study: Failure to Adopt Reverse Debugging Costs Global Economy $41
Billion Annually.
[21] Tom Britton, Lisa Jeng, Graham Carver, and Paul Cheak. 2013. Reversible Debugging Software “quantify the Time
and Cost Saved Using Reversible Debuggers”. (2013).

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:59

[22] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan,
Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language Models Are Few-shot Learners. In Proceedings of
the Advances in Neural Information Processing Systems (NeurIPS’20), Vol. 33. 1877–1901.
[23] Saikat Chakraborty, Yangruibo Ding, Miltiadis Allamanis, and Baishakhi Ray. 2022. Codit: Code Editing with Tree-
based Neural Models. IEEE Transactions on Software Engineering (TSE) 48, 4 (2022), 1385–1399. https://doi.org/10.
1109/TSE.2020.3020502
[24] Saikat Chakraborty and Baishakhi Ray. 2021. On Multi-modal Learning of Editing Source Code. In 2021 36th IEEE/ACM
International Conference on Automated Software Engineering (ASE’21). IEEE, 443–455.
[25] Lingchao Chen, Yicheng Ouyang, and Lingming Zhang. 2021. Fast and Precise On-the-fly Patch Validation for All. In
2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 1123–1134.
[26] Liushan Chen, Yu Pei, Minxue Pan, Tian Zhang, Qixin Wang, and Carlo Alberto Furia. 2022. Program Repair with
Repeated Learning. IEEE Transactions on Software Engineering (TSE) (2022).
[27] Zimin Chen, Steve Kommrusch, Michele Tufano, Louis-Noël Pouchet, Denys Poshyvanyk, and Martin Monperrus.
2019. Sequencer: Sequence-to-sequence Learning for End-to-end Program Repair. IEEE Transactions on Software
Engineering (TSE) 47, 9 (2019), 1943–1959.
[28] Zimin Chen, Steve James Kommrusch, and Martin Monperrus. 2022. Neural Transfer Learning for Repairing Security
Vulnerabilities in C Code. IEEE Transactions on Software Engineering (TSE) (2022).
[29] Zimin Chen and Martin Monperrus. 2018. The Codrep Machine Learning on Source Code Competition. arXiv preprint
arXiv:1807.03200 (2018).
[30] Darshak Chhatbar, Umair Z Ahmed, and Purushottam Kar. 2020. Macer: A Modular Framework for Accelerated
Compilation Error Repair. In Artificial Intelligence in Education: 21st International Conference, AIED 2020, Ifrane,
Morocco, July 6–10, 2020, Proceedings, Part I. Springer, 106–117.
[31] Jianlei Chi, Yu Qu, Ting Liu, Qinghua Zheng, and Heng Yin. 2022. Seqtrans: Automatic Vulnerability Fix Via Sequence
to Sequence Learning. IEEE Transactions on Software Engineering (TSE) (2022).
[32] Matteo Ciniselli, Nathan Cooper, Luca Pascarella, Antonio Mastropaolo, Emad Aghajani, Denys Poshyvanyk, Mas-
similiano Di Penta, and Gabriele Bavota. 2021. An Empirical Study on the Usage of Transformer Models for Code
Completion. IEEE Transactions on Software Engineering (TSE) 1, 1 (2021), 1–1.
[33] Aidan Connor, Aaron Harris, Nathan Cooper, and Denys Poshyvanyk. 2022. Can We Automatically Fix Bugs by
Learning Edit Operations?. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering
(SANER’22). IEEE, 782–792.
[34] Viktor Csuvik, Dániel Horváth, Ferenc Horváth, and László Vidács. 2020. Utilizing Source Code Embeddings to
Identify Correct Patches. In Proceedings of the 2nd IEEE International Workshop on Intelligent Bug Fixing (IBF’20).
18–25.
[35] Viktor Csuvik, Dániel Horváth, Márk Lajkó, and László Vidács. 2021. Exploring Plausible Patches using Source Code
Embeddings in Javascript. In 2021 IEEE/ACM International Workshop on Automated Program Repair (APR’22). IEEE,
11–18.
[36] Rajdeep Das, Umair Z Ahmed, Amey Karkare, and Sumit Gulwani. 2016. Prutor: A system for Tutoring CS1 and
Collecting Student Programs for Analysis. arXiv preprint arXiv:1608.03828 (2016).
[37] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of Deep Bidirectional
Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19). Association for
Computational Linguistics, 4171–4186.
[38] Jacob Devlin, Jonathan Uesato, Rishabh Singh, and Pushmeet Kohli. 2017. Semantic Code Repair Using Neuro-symbolic
Transformation Networks. arXiv preprint arXiv:1710.11054 (2017).
[39] Elizabeth Dinella, Hanjun Dai, Ziyang Li, Mayur Naik, Le Song, and Ke Wang. 2020. Hoppity: Learning Graph
Transformations to Detect and Fix Bugs in Programs. In International Conference on Learning Representations (ICLR).
[40] Yangruibo Ding, Baishakhi Ray, Premkumar Devanbu, and Vincent J Hellendoorn. 2020. Patching As Translation: The
Data and the Metaphor. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20).
IEEE, 275–286.
[41] Dawn Drain, Colin B Clement, Guillermo Serrato, and Neel Sundaresan. 2021. Deepdebug: Fixing Python Bugs Using
Stack Traces, Backtranslation, and Code Skeletons. arXiv preprint arXiv:2105.09352 (2021).
[42] Dawn Drain, Chen Wu, Alexey Svyatkovskiy, and Neel Sundaresan. 2021. Generating Bug-fixes Using Pretrained
Transformers. In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (MAPS’21).
1–8.
[43] Thomas Durieux, Fernanda Madeiral, Matias Martinez, and Rui Abreu. 2019. Empirical Review of Java Program
Repair Tools: A Large-scale Experiment on 2,141 Bugs and 23,551 Repair Attempts. In Proceedings of the 27th ACM
Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:60 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

(ESEC/FSE’19). 302–313.
[44] Thomas Durieux and Martin Monperrus. 2016. Dynamoth: Dynamic Code Synthesis for Automatic Program Repair.
In Proceedings of the 11th International Workshop on Automation of Software Test (AST’16). 85–91.
[45] Thomas Durieux and Martin Monperrus. 2016. IntroClassJava: A Benchmark of 297 Small and Buggy Java Programs.
Technical Report hal-01272126. Universite Lille 1.
[46] Jean-Rémy Falleri, Floréal Morandat, Xavier Blanc, Matias Martinez, and Martin Monperrus. 2014. Fine-grained
and Accurate Source Code Differencing. In Proceedings of the 29th ACM/IEEE International Conference on Automated
Software Engineering. 313–324.
[47] Jiahao Fan, Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. A C/c++ Code Vulnerability Dataset with Code Changes
and Cve Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR’20).
508–512.
[48] Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan. 2023. Automated repair of programs
from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE,
1469–1481.
[49] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu,
Daxin Jiang, et al. 2020. Codebert: A Pre-trained Model for Programming and Natural Languages. In Findings of the
Association for Computational Linguistics (EMNLP’20). 1536–1547.
[50] Emily First, Markus N Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: whole-proof generation and repair with
large language models. arXiv preprint arXiv:2303.04910 (2023).
[51] Michael Fu, Chakkrit Tantithamthavorn, Trung Le, Van Nguyen, and Phung Dinh. 2022. Vulrepair: A T5-based
Automated Software Vulnerability Repair. In the ACM Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering (ESEC/FSE’22).
[52] Xiang Gao, Bo Wang, Gregory J Duck, Ruyi Ji, Yingfei Xiong, and Abhik Roychoudhury. 2021. Beyond Tests: Program
Vulnerability Repair Via Crash Constraint Extraction. ACM Transactions on Software Engineering and Methodology
(TOSEM) 30, 2 (2021), 1–27.
[53] Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey. IEEE Transactions
on Software Engineering (TSE) 45, 1 (2019), 34–67.
[54] Ali Ghanbari, Samuel Benton, and Lingming Zhang. 2019. Practical Program Repair Via Bytecode Mutation. In
Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’19). 19–30.
[55] Ali Ghanbari and Andrian Marcus. 2022. Patch Correctness Assessment in Automated Program Repair Based on the
Impact of Patches on Production and Test Code. , 654–665 pages.
[56] Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy,
Shengyu Fu, et al. 2021. Graphcodebert: Pre-training Code Representations with Data Flow. In Proceedings of the 9th
International Conference on Learning Representations (ICLR’21). 1–18.
[57] Rahul Gupta, Aditya Kanade, and Shirish Shevade. 2019. Deep Reinforcement Learning for Syntactic Error Repair in
Student Programs. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI’19), Vol. 33. 930–937.
[58] Rahul Gupta, Soham Pal, Aditya Kanade, and Shirish Shevade. 2017. Deepfix: Fixing Common C Language Errors by
Deep Learning. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI’17). 1345–1351.
[59] Péter Gyimesi, Béla Vancsics, Andrea Stocco, Davood Mazinanian, Arpád Beszédes, Rudolf Ferenc, and Ali Mesbah.
2019. Bugsjs: A Benchmark of Javascript Bugs. In 2019 12th IEEE Conference on Software Testing, Validation and
Verification (ICST). IEEE, 90–101.
[60] Hossein Hajipour, Apratim Bhattacharyya, Cristian-Alexandru Staicu, and Mario Fritz. 2021. Samplefix: Learning to
Generate Functionally Diverse Fixes. In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases ( ECML’21). Springer, 119–133.
[61] Quinn Hanam, Fernando S de M Brito, and Ali Mesbah. 2016. Discovering Bug Patterns in Javascript. In Proceedings
of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 144–156.
[62] Jacob Harer, Onur Ozdemir, Tomo Lazovich, Christopher Reale, Rebecca Russell, Louis Kim, et al. 2018. Learning to
Repair Software Vulnerabilities with Generative Adversarial Networks. Advances in Neural Information Processing
Systems (NeurIPS’18) 31.
[63] Hideaki Hata, Emad Shihab, and Graham Neubig. 2018. Learning to Generate Corrective Patches Using Neural
Machine Translation. arXiv preprint arXiv:1812.07170 (2018).
[64] Vincent J Hellendoorn, Charles Sutton, Rishabh Singh, Petros Maniatis, and David Bieber. 2019. Global Relational
Models of Source Code. In International Conference on Learning Representations (ICLR’19).
[65] Yang Hu, Umair Z Ahmed, Sergey Mechtaev, Ben Leong, and Abhik Roychoudhury. 2019. Re-factoring based Program
Repair Applied to Programming Assignments. In 2019 34th IEEE/ACM International Conference on Automated Software
Engineering (ASE). IEEE, 388–398.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:61

[66] Yaojie Hu, Xingjian Shi, Qiang Zhou, and Lee Pike. 2022. Fix Bugs with Transformer through a Neural-symbolic Edit
Grammar. arXiv preprint arXiv:2204.06643 (2022).
[67] Kai Huang, Su Yang, Hongyu Sun, Chengyi Sun, Xuejun Li, and Yuqing Zhang. 2022. Repairing Security Vulnerabili-
ties Using Pre-trained Programming Language Models. In 2022 52nd Annual IEEE/IFIP International Conference on
Dependable Systems and Networks Workshops (DSN-W’22). IEEE, 111–116.
[68] Shan Huang, Xiao Zhou, and Sang Chin. 2021. Application of Seq2seq Models on Code Correction. Frontiers in
artificial intelligence (FRAI) 4 (2021), 590215.
[69] Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. Codesearchnet
challenge: Evaluating the State of Semantic Code Search. arXiv preprint arXiv:1909.09436 (2019).
[70] Jiajun Jiang, Yingfei Xiong, Hongyu Zhang, Qing Gao, and Xiangqun Chen. 2018. Shaping Program Repair Space
with Existing Patches and Similar Code. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software
Testing and Analysis (ISSTA’18). 298–309.
[71] Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. 2023. Impact of code language models on automated program
repair. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23). 1430–1442.
[72] Nan Jiang, Thibaud Lutellier, Yiling Lou, Lin Tan, Dan Goldwasser, and Xiangyu Zhang. 2023. KNOD: Domain
Knowledge Distilled Tree Decoder for Automated Program Repair. In 2023 IEEE/ACM 45th International Conference on
Software Engineering (ICSE). IEEE, 1251–1263.
[73] Nan Jiang, Thibaud Lutellier, and Lin Tan. 2021. Cure: Code-aware Neural Machine Translation for Automatic
Program Repair. In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE’21).
1161–1173.
[74] Melvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Thorat, Fernanda
Viégas, Martin Wattenberg, Greg Corrado, et al. 2017. Google’s Multilingual Neural Machine Translation System:
Enabling Zero-shot Translation. Transactions of the Association for Computational Linguistics (TACL) 5 (2017), 339–351.
[75] Harshit Joshi, José Cambronero, Sumit Gulwani, Vu Le, Ivan Radicek, and Gust Verbruggen. 2022. Repair Is Nearly
Generation: Multilingual Program Repair with Llms. arXiv preprint arXiv:2208.11640 (2022).
[76] René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4j: A Database of Existing Faults to Enable Controlled
Testing Studies for Java Programs. In Proceedings of the 23rd International Symposium on Software Testing and Analysis
(ISSTA’14). 437–440.
[77] Sungmin Kang and Shin Yoo. 2022. Language Models Can Prioritize Patches for Practical Program Patching. In
Proceedings of the Third International Workshop on Automated Program Repair (APR’22). 8–15.
[78] Rafael-Michael Karampatsis and Charles Sutton. 2020. How Often Do Single-statement Bugs Occur? The Manysstubs4j
Dataset. In Proceedings of the 17th International Conference on Mining Software Repositories (MSR’20). 573–577.
[79] Dongsun Kim, Jaechang Nam, Jaewoo Song, and Sunghun Kim. 2013. Automatic patch generation learned from
human-written patches. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 802–811.
[80] Misoo Kim, Youngkyoung Kim, Hohyeon Jeong, Jinseok Heo, Sungoh Kim, Hyunhee Chung, and Eunseok Lee. 2022.
An Empirical Study of Deep Transfer Learning-based Program Repair for Kotlin Projects. In Proceedings of the 30th
ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering.
1441–1452.
[81] Serkan Kirbas, Etienne Windels, Olayori McBello, Kevin Kells, Matthew Pagano, Rafal Szalanski, Vesna Nowack,
Emily Rowan Winter, Steve Counsell, David Bowes, et al. 2021. On the Introduction of Automatic Program Repair in
Bloomberg. IEEE Software 38, 4 (2021), 43–51.
[82] Barbara Ann Kitchenham and Stuart Charters. 2007. Guidelines for performing Systematic Literature Reviews in
Software Engineering. Technical Report EBSE 2007-001. Keele University and Durham University Joint Report. 1–65
pages.
[83] Amy J Ko, Brad A Myers, Michael J Coblenz, and Htet Htet Aung. 2006. An Exploratory Study of How Developers
Seek, Relate, and Collect Relevant Information during Software Maintenance Tasks. IEEE Transactions on Software
Engineering (TSE) 32, 12 (2006), 971–987.
[84] Sophia D Kolak, Ruben Martins, Claire Le Goues, and Vincent Josua Hellendoorn. 2022. Patch Generation with
Language Models: Feasibility and Scaling Behavior. In International Conference on Learning Representations Deep
Learning for Code Workshop (ICLR-DL4C’22).
[85] Anil Koyuncu, Kui Liu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, Martin Monperrus, and Yves Le Traon.
2020. Fixminer: Mining Relevant Fix Patterns for Automated Program Repair. Empirical Software Engineering (EMSE)
25, 3 (2020), 1980–2024.
[86] Nir Kshetri. 2006. The Simple Economics of Cybercrimes. IEEE Security & Privacy (S&P’06) 4, 1 (2006), 33–39.
[87] Taku Kudo and John Richardson. 2018. Sentencepiece: A Simple and Language Independent Subword Tokenizer
and Detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural
Language Processing: System Demonstrations (EMNLP’18). 66–71.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:62 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

[88] Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina Lee, Oded Padon, Alex Aiken, and Percy S Liang. 2019. Spoc:
Search-based Pseudocode to Code. Advances in Neural Information Processing Systems 32 (2019).
[89] Márk Lajkó, Viktor Csuvik, and László Vidács. 2022. Towards Javascript Program Repair with Generative Pre-trained
Transformer (gpt-2). In 2022 IEEE/ACM International Workshop on Automated Program Repair (APR’22). IEEE, 61–68.
[90] Xuan-Bach D Le, Lingfeng Bao, David Lo, Xin Xia, Shanping Li, and Corina Pasareanu. 2019. On Reliability of
Patch Correctness Assessment. In Proceedings of the 41st IEEE/ACM International Conference on Software Engineering
(ICSE’19). IEEE, 524–535.
[91] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A Systematic Study of Automated
Program Repair: Fixing 55 Out of 105 Bugs for $8 Each. In 2012 34th International Conference on Software Engineering
(ICSE’12). 3–13. https://doi.org/10.1109/ICSE.2012.6227211
[92] Claire Le Goues, Neal Holtschulte, Edward K Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley
Weimer. 2015. The Manybugs and Introclass Benchmarks for Automated Repair of C Programs. IEEE Transactions on
Software Engineering (TSE) 41, 12 (2015), 1236–1256.
[93] Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, and Westley Weimer. 2012. Genprog: A Generic Method for
Automatic Software Repair. IEEE Transactions on Software Engineering (TSE) 38, 01 (2012), 54–72.
[94] Dongcheng Li, W Eric Wong, Mingyong Jian, Yi Geng, and Matthew Chau. 2022. Improving Search-based Automatic
Program Repair with Neural Machine Translation. IEEE Access 10 (2022), 51167–51175.
[95] Frank Li and Vern Paxson. 2017. A Large-scale Empirical Study of Security Patches. In Proceedings of the 2017 ACM
SIGSAC Conference on Computer and Communications Security (CCS’17). 2201–2215.
[96] Xia Li and Lingming Zhang. 2017. Transforming Programs and Tests in Tandem for Fault Localization. Proceedings of
the ACM on Programming Languages (OOPSLA’17) 1, OOPSLA (2017), 1–30.
[97] Yi Li, Shaohua Wang, and Tien Nguyen. 2021. Fault Localization with Code Coverage Representation Learning. In
2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE’21). IEEE, 661–673.
[98] Yi Li, Shaohua Wang, and Tien N Nguyen. 2020. Dlfix: Context-based Code Transformation Learning for Automated
Program Repair. In Proceedings of the 42nd ACM/IEEE International Conference on Software Engineering (ICSE’20).
602–614.
[99] Yi Li, Shaohua Wang, and Tien N. Nguyen. 2022. Dear: A Novel Deep Learning-based Approach for Automated
Program Repair. In Proceedings of the 44th International Conference on Software Engineering (ICSE’22). 511–523.
[100] Zhen Li, Deqing Zou, Shouhuai Xu, Hai Jin, Yawei Zhu, and Zhaoxuan Chen. 2021. Sysevr: A Framework for Using
Deep Learning to Detect Software Vulnerabilities. IEEE Transactions on Dependable and Secure Computing (TDSC)
(2021).
[101] Jingjing Liang, Ruyi Ji, Jiajun Jiang, Shurui Zhou, Yiling Lou, Yingfei Xiong, and Gang Huang. 2021. Interactive Patch
Filtering as Debugging Aid. In 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME).
IEEE, 239–250.
[102] Bo Lin, Shangwen Wang, Ming Wen, and Xiaoguang Mao. 2022. Context-aware Code Change Embedding for Better
Patch Correctness Assessment. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 3 (2022),
1–29.
[103] Derrick Lin, James Koppel, Angela Chen, and Armando Solar-Lezama. 2017. Quixbugs: A Multi-lingual Program Repair
Benchmark Set Based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN International
Conference on Systems, Programming, Languages, and Applications: Software for Humanity (SPLASH Companion’17).
55–56.
[104] Bingchang Liu, Guozhu Meng, Wei Zou, Qi Gong, Feng Li, Min Lin, Dandan Sun, Wei Huo, and Chao Zhang. 2020. A
Large-scale Empirical Study on Vulnerability Distribution within Projects and the Lessons Learned. In 2020 IEEE/ACM
42nd International Conference on Software Engineering (ICSE’20). IEEE, 1547–1559.
[105] Kui Liu, Anil Koyuncu, Tegawendé F Bissyandé, Dongsun Kim, Jacques Klein, and Yves Le Traon. 2019. You Cannot
Fix What You Cannot Find! An Investigation of Fault Localization Bias in Benchmarking Automated Program Repair
Systems. In Proceedings of the 12th IEEE Conference on Software Testing, Validation and Verification (ICST’19). 102–113.
[106] Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Avatar: Fixing Semantic Bugs with Fix
Patterns of Static Analysis Violations. In Proceedings of the 26th IEEE International Conference on Software Analysis,
Evolution and Reengineering (SANER’19). 1–12.
[107] Kui Liu, Anil Koyuncu, Dongsun Kim, and Tegawendé F Bissyandé. 2019. Tbar: Revisiting Template-based Automated
Program Repair. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis
(ISSTA’19). 31–42.
[108] Kui Liu, Shangwen Wang, Anil Koyuncu, Kisub Kim, Tegawendé F Bissyandé, Dongsun Kim, Peng Wu, Jacques
Klein, Xiaoguang Mao, and Yves Le Traon. 2020. On the Efficiency of Test Suite Based Program Repair: A Systematic
Assessment of 16 Automated Repair Systems for Java Programs. In Proceedings of the 42nd ACM/IEEE International
Conference on Software Engineering (ICSE’20). 615–627.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:63

[109] Fan Long, Peter Amidon, and Martin Rinard. 2017. Automatic inference of code transforms for patch generation. In
Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 727–739.
[110] Fan Long and Martin Rinard. 2015. Staged Program Repair with Condition Synthesis. In Proceedings of the 2015 10th
Joint Meeting on Foundations of Software Engineering. 166–178.
[111] Fan Long and Martin Rinard. 2016. Automatic Patch Generation by Learning Correct Code. In Proceedings of the 43rd
Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL’16). 298–312.
[112] Yiling Lou, Qihao Zhu, Jinhao Dong, Xia Li, Zeyu Sun, Dan Hao, Lu Zhang, and Lingming Zhang. 2021. Boosting
Coverage-based Fault Localization Via Graph-based Representation Learning. In Proceedings of the 29th ACM Joint
Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
(ESEC/FSE’21). 664–676.
[113] Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin Clement, Dawn Drain,
Daxin Jiang, Duyu Tang, et al. 2021. Codexglue: A Machine Learning Benchmark Dataset for Code Understanding
and Generation. arXiv preprint arXiv:2102.04664 (2021).
[114] Thibaud Lutellier, Lawrence Pang, Viet Hung Pham, Moshi Wei, and Lin Tan. 2019. Encore: Ensemble Learning Using
Convolution Neural Machine Translation for Automatic Program Repair. arXiv preprint arXiv:1906.08691 (2019).
[115] Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. 2020. Coconut: Combining
Context-aware Neural Translation Models Using Ensemble for Program Repair. In Proceedings of the 29th ACM
SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’20). 101–114.
[116] Siqi Ma, Ferdian Thung, David Lo, Cong Sun, and Robert H Deng. 2017. Vurle: Automatic Vulnerability Detection and
Repair by Learning from Examples. In European Symposium on Research in Computer Security (ESORICS’17). Springer,
229–246.
[117] Fernanda Madeiral, Simon Urli, Marcelo Maia, and Martin Monperrus. 2019. Bears: An Extensible Java Bug Benchmark
for Automatic Program Repair Studies. In Proceedings of the 26th IEEE International Conference on Software Analysis,
Evolution and Reengineering (SANER’19). 468–478.
[118] Amirabbas Majd, Mojtaba Vahidi-Asl, Alireza Khalilian, Ahmad Baraani-Dastjerdi, and Bahman Zamani. 2019.
Code4Bench: A Multidimensional Benchmark of Codeforces Data for Different Program Analysis techniques. Journal
of Computer Languages 53 (2019), 38–52.
[119] T MAMATHA, B RAMA SUBBA REDDY, and C SHOBA BINDU. 2022. Oapr-homl’1: Optimal Automated Program
Repair Approach Based on Hybrid Improved Grasshopper Optimization and Opposition Learning Based Artificial
Neural Network. International Journal of Computer Science & Network Security (IJCSDS) 22, 4 (2022), 261–273.
[120] Xiaoguang Mao, Yan Lei, Ziying Dai, Yuhua Qi, and Chengsong Wang. 2014. Slice-based Statistical Fault Localization.
Journal of Systems and Software (JSS) 89 (2014), 51–62.
[121] Alexandru Marginean, Johannes Bader, Satish Chandra, Mark Harman, Yue Jia, Ke Mao, Alexander Mols, and Andrew
Scott. 2019. Sapfix: Automated End-to-end Repair at Scale. In 2019 IEEE/ACM 41st International Conference on Software
Engineering: Software Engineering in Practice (ICSE-SEIP’19). IEEE, 269–278.
[122] Matias Martinez, Thomas Durieux, Romain Sommerard, Jifeng Xuan, and Martin Monperrus. 2017. Automatic repair
of real bugs in java: A large-scale experiment on the defects4j dataset. Empirical Software Engineering 22 (2017),
1936–1964.
[123] Matias Martinez and Martin Monperrus. 2016. Astor: A Program Repair Library for Java. In Proceedings of the 25th
International Symposium on Software Testing and Analysis (ISSTA’16). 441–444.
[124] Matias Martinez and Martin Monperrus. 2018. Ultra-large Repair Search Space with Automatically Mined Templates:
The Cardumen Mode of Astor. In Proceedings of the International Symposium on Search Based Software Engineering
(SSBSE’18). Springer, 65–86.
[125] Ehsan Mashhadi and Hadi Hemmati. 2021. Applying Codebert for Automated Program Repair of Java Simple Bugs.
In Proceedings Companion of the 18th IEEE/ACM International Conference on Mining Software Repositories (MSR’21).
505–509.
[126] Antonio Mastropaolo, Nathan Cooper, David Nader Palacio, Simone Scalabrino, Denys Poshyvanyk, Rocco Oliveto,
and Gabriele Bavota. 2022. Using Transfer Learning for Code-related Tasks. IEEE Transactions on Software Engineering
(TSE) (2022).
[127] Antonio Mastropaolo, Simone Scalabrino, Nathan Cooper, David Nader Palacio, Denys Poshyvanyk, Rocco Oliveto,
and Gabriele Bavota. 2021. Studying the Usage of Text-to-text Transfer Transformer to Support Code-related Tasks.
In Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE’21). 336–347.
[128] Paola Masuzzo and Lennart Martens. 2017. Do You Speak Open Science? Resources and Tips to Learn the Language.
Technical Report. PeerJ Preprints.
[129] Sergey Mechtaev, Jooyong Yi, and Abhik Roychoudhury. 2016. Angelix: Scalable Multiline Program Patch Synthesis
Via Symbolic Analysis. In Proceedings of the 38th International Conference on Software Engineering (ICSE’16). 691–701.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:64 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

[130] Xiangxin Meng, Xu Wang, Hongyu Zhang, Hailong Sun, and Xudong Liu. 2022. Improving Fault Localization and
Program Repair with Deep Semantic Features and Transferred Knowledge. In Proceedings of the 44th IEEE/ACM
International Conference on Software Engineering (ICSE’22). 1169–1180.
[131] Xiangxin Meng, Xu Wang, Hongyu Zhang, Hailong Sun, Xudong Liu, and Chunming Hu. 2023. Template-based Neural
Program Repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1456–1468.
[132] Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian. 2019. Deepdelta: Learning to
Repair Compilation Errors. In Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering
Conference and Symposium on the Foundations of Software Engineering (ESE/FSE’19). 925–936.
[133] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver,
and Koray Kavukcuoglu. 2016. Asynchronous Methods for Deep Reinforcement Learning. In International Conference
on Machine Learning (ICML). PMLR, 1928–1937.
[134] Venkatesh Theru Mohan. 2019. Automatic Repair and Type Binding of Undeclared Variables Using Neural Networks.
Ph. D. Dissertation. Iowa State University.
[135] Martin Monperrus. 2018. Automatic Software Repair: A Bibliography. ACM Computing Surveys (CSUR) 51, 1 (2018),
1–24.
[136] Martin Monperrus. 2022. The Living Review on Automated Program Repair. (2022).
[137] Martin Monperrus, Matias Martinez, He Ye, Fernanda Madeiral, Thomas Durieux, and Zhongxing Yu. 2021. Megadiff:
A Dataset of 600k Java Source Code Changes Categorized by Diff Size. arXiv preprint arXiv:2108.04631 (2021).
[138] Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstractive Text
Summarization Using Sequence-to-sequence Rnns and Beyond. In Proceedings of The 20th SIGNLL Conference on
Computational Natural Language Learning (CoNLL’16). 280–290.
[139] Marjane Namavar, Noor Nashid, and Ali Mesbah. 2022. A Controlled Experiment of Different Code Representations
for Learning-based Program Repair. Empirical Software Engineering (EMSE) 27, 7 (2022), 1–39.
[140] Noor Nashid, Mifta Sintaha, and Ali Mesbah. 2023. Retrieval-based prompt selection for code-related few-shot
learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23).
[141] Hoan Anh Nguyen, Tien N Nguyen, Danny Dig, Son Nguyen, Hieu Tran, and Michael Hilton. 2019. Graph-based
Mining of In-the-wild, Fine-grained, Semantic Code Change Patterns. In 2019 IEEE/ACM 41st International Conference
on Software Engineering (ICSE). IEEE, 819–830.
[142] Thanh V Nguyen and Srinivasan H Sengamedu. 2021. Graphix: A Pre-trained Graph Edit Model for Automated
Program Repair. (2021).
[143] Chao Ni, Kaiwen Yang, Xin Xia, David Lo, Xiang Chen, and Xiaohu Yang. 2022. Defect Identification, Categorization,
and Repair: Better Together. arXiv preprint arXiv:2204.04856 (2022).
[144] Amirfarhad Nilizadeh and Gary T Leavens. 2022. Be realistic: Automated program repair is a combination of
undecidable problems. In Proceedings of the Third International Workshop on Automated Program Repair. 31–32.
[145] Changan Niu, Chuanyi Li, Vincent Ng, Jidong Ge, Liguo Huang, and Bin Luo. 2022. Spt-code: Sequence-to-sequence
Pre-training for Learning the Representation of Source Code. In Proceedings of the 44th International Conference on
Software Engineering (ICSE’22). 2006–2018.
[146] Yu Nong, Rainy Sharma, Abdelwahab Hamou-Lhadj, Xiapu Luo, and Haipeng Cai. 2022. Open Science in Software
Engineering: A Study on Deep Learning-based Vulnerability Detection. IEEE Transactions on Software Engineering
(TSE) (2022).
[147] A Jefferson Offutt and Stephen D Lee. 1994. An Empirical Evaluation of Weak Mutation. IEEE Transactions on Software
Engineering (TSE) 20, 5 (1994), 337–344.
[148] Mike Papadakis and Yves Le Traon. 2015. Metallaxis-fl: Mutation-based Fault Localization. Software Testing, Verification
and Reliability (STVR) 25, 5-7 (2015), 605–628.
[149] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A Method for Automatic Evaluation
of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics
(ACL’02). 311–318.
[150] Nikhil Parasaram, Earl T Barr, and Sergey Mechtaev. 2023. Rete: Learning Namespace Representation for Program
Repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1264–1276.
[151] Terence Parr and Kathleen Fisher. 2011. Ll (*) the Foundation of the Antlr Parser Generator. In Proceedings of the 32nd
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’11). 425–436.
[152] Spencer Pearson, José Campos, René Just, Gordon Fraser, Rui Abreu, Michael D Ernst, Deric Pang, and Benjamin
Keller. 2017. Evaluating and Improving Fault Localization. In 2017 IEEE/ACM 39th International Conference on Software
Engineering (ICSE’17). IEEE, 609–620.
[153] Kai Petersen, Sairam Vakkalanka, and Ludwik Kuzniarz. 2015. Guidelines for Conducting Systematic Mapping Studies
in Software Engineering: An Update. Information and Software Technology (IST) 64 (2015), 1–18.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:65

[154] Serena Elisa Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, and Cédric Dangremont. 2019. A manually-curated
dataset of fixes to vulnerabilities of open-source software. In 2019 IEEE/ACM 16th International Conference on Mining
Software Repositories (MSR). IEEE, 383–387.
[155] Yuhua Qi, Xiaoguang Mao, and Yan Lei. 2012. Making Automatic Repair for Large-scale Programs More Efficient
using Weak Recompilation. In 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 254–263.
[156] Yuhua Qi, Xiaoguang Mao, and Yan Lei. 2013. Efficient automated program repair through fault-recorded testing
prioritization. In 2013 IEEE International Conference on Software Maintenance. IEEE, 180–189.
[157] Zichao Qi, Fan Long, Sara Achour, and Martin Rinard. 2015. An Analysis of Patch Plausibility and Correctness for
Generate-and-validate Patch Generation Systems. In Proceedings of the 2015 International Symposium on Software
Testing and Analysis (ISSTA’15). 24–36.
[158] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and
Peter J Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-text Transformer. Journal of
Machine Learning Research (JMLR) 21 (2020), 1–67.
[159] Md Mostafizer Rahman, Yutaka Watanobe, and Keita Nakamura. 2021. A Bidirectional Lstm Language Model for
Code Evaluation and Repair. Symmetry (SYM) 13, 2 (2021), 247.
[160] Veselin Raychev, Pavol Bielik, and Martin Vechev. 2016. Probabilistic Model for Code with Decision Trees. ACM
SIGPLAN Notices 51, 10 (2016), 731–747.
[161] Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio Blanco,
and Shuai Ma. 2020. Codebleu: A Method for Automatic Evaluation of Code Synthesis. arXiv preprint arXiv:2009.10297
(2020).
[162] André Riboira and Rui Abreu. 2010. The Gzoltar Project: A Graphical Debugger Interface. In International Academic
and Industrial Conference on Practice and Research Techniques (TAIC-PART’10). Springer, 215–218.
[163] Cedric Richter and Heike Wehrheim. 2022. Can We Learn from Developer Mistakes? Learning to Localize and Repair
Real Bugs from Real Bug Fixes. arXiv preprint arXiv:2207.00301 (2022).
[164] Cedric Richter and Heike Wehrheim. 2022. TSSB-3M: Mining Single Statement Bugs at Massive Scale. In Proceedings
of the 19th International Conference on Mining Software Repositories. 418–422.
[165] Ripon K Saha, Yingjun Lyu, Wing Lam, Hiroaki Yoshida, and Mukul R Prasad. 2018. Bugs. Jar: A Large-scale, Diverse
Dataset of Real-world Java Bugs. In Proceedings of the 15th International Conference on Mining Software Repositories
(MSR’18). 10–13.
[166] Ripon K Saha, Yingjun Lyu, Hiroaki Yoshida, and Mukul R Prasad. 2017. Elixir: Effective Object-oriented Program
Repair. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE’17). IEEE, 648–659.
[167] Eddie Antonio Santos, Joshua Charles Campbell, Dhvani Patel, Abram Hindle, and José Nelson Amaral. 2018. Syntax
and Sensibility: Using Language Models to Detect and Correct Syntax Errors. In 2018 IEEE 25th International Conference
on Software Analysis, Evolution and Reengineering (SANER’18). IEEE, 311–322.
[168] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural Machine Translation of Rare Words with Subword
Units. In 54th Annual Meeting of the Association for Computational Linguistics (ACL’16). Association for Computational
Linguistics (ACL), 1715–1725.
[169] Mifta Sintaha, Noor Nashid, and Ali Mesbah. 2023. Katana: Dual Slicing Based Context for Learning Bug Fixes. ACM
Transactions on Software Engineering and Methodology 32, 4 (2023), 1–27.
[170] Edward K Smith, Earl T Barr, Claire Le Goues, and Yuriy Brun. 2015. Is the Cure Worse Than the Disease? Overfitting
in Automated Program Repair. In Proceedings of the 10th Joint Meeting of the European Software Engineering Conference
and ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE’15). 532–543.
[171] Sunbeom So and Hakjoo Oh. 2023. SmartFix: Fixing Vulnerable Smart Contracts by Accelerating Generate-and-Verify
Repair using Statistical Models. (2023).
[172] Balázs Szalontai, András Vadász, Zsolt Richárd Borsi, Teréz A Várkonyi, Balázs Pintér, and Tibor Gregorics. 2021.
Detecting and Fixing Nonidiomatic Snippets in Python Source Code with Deep Learning. In Proceedings of SAI
Intelligent Systems Conference (ISC’21). Springer, 129–147.
[173] Shin Hwei Tan, Jooyong Yi, Sergey Mechtaev, Abhik Roychoudhury, et al. 2017. Codeflaws: A Programming
Competition Benchmark for Evaluating Automated Program Repair Tools. In 2017 IEEE/ACM 39th International
Conference on Software Engineering Companion (ICSE-C). IEEE, 180–182.
[174] Shin Hwei Tan, Hiroaki Yoshida, Mukul R Prasad, and Abhik Roychoudhury. 2016. Anti-patterns in Search-based
Program Repair. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software
Engineering (FSE’16). 727–738.
[175] Ben Tang, Bin Li, Lili Bo, Xiaoxue Wu, Sicong Cao, and Xiaobing Sun. 2021. Grasp: Graph-to-sequence Learning for
Automated Program Repair. In 2021 IEEE 21st International Conference on Software Quality, Reliability and Security
(QRS’21). IEEE, 819–828.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:66 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

[176] Yu Tang, Long Zhou, Ambrosio Blanco, Shujie Liu, Furu Wei, Ming Zhou, and Muyun Yang. 2021. Grammar-based
Patches Generation for Automated Program Repair. In Findings of the Association for Computational Linguistics:
ACL-IJCNLP 2021. 1300–1305.
[177] Yida Tao, Jindae Kim, Sunghun Kim, and Chang Xu. 2014. Automatically Generated Patches As Debugging Aids:
A Human Study. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software
Engineering (FSE’14). 64–74.
[178] Haoye Tian, Yinghua Li, Weiguo Pian, Abdoul Kader Kabore, Kui Liu, Andrew Habib, Jacques Klein, and Tegawendé F
Bissyandé. 2022. Predicting Patch Correctness Based on the Similarity of Failing Test Cases. ACM Transactions on
Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–30.
[179] Haoye Tian, Kui Liu, Abdoul Kader Kaboré, Anil Koyuncu, Li Li, Jacques Klein, and Tegawendé F Bissyandé. 2020. Eval-
uating Representation Learning of Code Changes for Predicting Patch Correctness in Program Repair. In Proceedings
of the 35th IEEE/ACM International Conference on Automated Software Engineering (ASE’20). 981–992.
[180] Haoye Tian, Kui Liu, Yinghua Li, Abdoul Kader Kaboré, Anil Koyuncu, Andrew Habib, Li Li, Junhao Wen, Jacques
Klein, and Tegawendé F Bissyandé. 2022. The Best of Both Worlds: Combining Learned Embeddings with Engineered
Features for Accurate Prediction of Correct Patches. ACM Transactions on Software Engineering and Methodology
(TOSEM) 1, 1 (2022), 1–1.
[181] Haoye Tian, Xunzhu Tang, Andrew Habib, Shangwen Wang, Kui Liu, Xin Xia, Jacques Klein, and Tegawendé F
Bissyandé. 2022. Is This Change the Answer to That Problem? Correlating Descriptions of Bug and Code Changes
for Evaluating Patch Correctness. In 2022 37th IEEE/ACM International Conference on Automated Software Engineering
(ASE’22). IEEE.
[182] Nikolai Tillmann, Jonathan De Halleux, Tao Xie, and Judith Bishop. 2014. Code Hunt: Gamifying Teaching and
Learning of Computer Science at Scale. In Proceedings of the first ACM Conference on Learning@ Scale Conference.
221–222.
[183] Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning
Meaningful Code Changes Via Neural Machine Translation. In 2019 IEEE/ACM 41st International Conference on
Software Engineering (ICSE’19). IEEE, 25–36.
[184] Michele Tufano, Cody Watson, Gabriele Bavota, Massimiliano Di Penta, Martin White, and Denys Poshyvanyk. 2019.
An Empirical Study on Learning Bug-fixing Patches in the Wild Via Neural Machine Translation. ACM Transactions
on Software Engineering and Methodology (TOSEM) 28, 4 (2019), 1–29.
[185] Meysam Valueian, Mojtaba Vahidi-Asl, and Alireza Khalilian. 2022. Siturepair: Incorporating Machine-learning Fault
Class Prediction to Inform Situational Multiple Fault Automatic Program Repair. International Journal of Critical
Infrastructure Protection (IJCIP) 37 (2022), 100527.
[186] Marko Vasic, Aditya Kanade, Petros Maniatis, David Bieber, and Rishabh Singh. 2019. Neural Program Repair by
Jointly Learning to Localize and Repair. arXiv preprint arXiv:1904.01720 (2019).
[187] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. 2017. Attention Is All You Need. In Advances in Neural Information Processing Systems (NeurIPS’17).
5998–6008.
[188] Bo Wang, Sirui Lu, Yingfei Xiong, and Feng Liu. 2021. Faster Mutation Analysis with Fewer Processes and Smaller
Overheads. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 381–393.
[189] Bo Wang, Yingfei Xiong, Yangqingwei Shi, Lu Zhang, and Dan Hao. 2017. Faster Mutation Analysis via Equivalence
Modulo States. In Proceedings of the 26th ACM SIGSOFT International Symposium on Software Testing and Analysis.
295–306.
[190] Jianzong Wang, Shijing Si, Zhitao Zhu, Xiaoyang Qu, Zhenhou Hong, and Jing Xiao. 2022. Leveraging Causal
Inference for Explainable Automatic Program Repair. arXiv preprint arXiv:2205.13342 (2022).
[191] Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Dynamic Neural Program Embeddings for Program Repair. In
International Conference on Learning Representations (ICLR’18).
[192] Ke Wang, Rishabh Singh, and Zhendong Su. 2018. Search, Align, and Repair: Data-driven Feedback Generation for
Introductory Programming Exercises. In Proceedings of the 39th Acm Sigplan Conference on Programming Language
Design and Implementation (PLDI’18). 481–495.
[193] Simin Wang, Liguo Huang, Amiao Gao, Jidong Ge, Tengfei Zhang, Haitao Feng, Ishna Satyarth, Ming Li, He Zhang,
and Vincent Ng. 2022. Machine/deep Learning for Software Engineering: A Systematic Literature Review. IEEE
Transactions on Software Engineering (TSE) (2022).
[194] Song Wang, Jaechang Nam, and Lin Tan. 2017. Qtep: Quality-aware Test Case Prioritization. In Proceedings of the
2017 11th Joint Meeting on Foundations of Software Engineering (FSE’17). 523–534.
[195] Shangwen Wang, Ming Wen, Bo Lin, Hongjun Wu, Yihao Qin, Deqing Zou, Xiaoguang Mao, and Hai Jin. 2020.
Automated Patch Correctness Assessment: How Far Are We?. In Proceedings of the 35th IEEE/ACM International
Conference on Automated Software Engineering (ASE’20). 968–980.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:67

[196] Weishi Wang, Yue Wang, Shafiq Joty, and Steven CH Hoi. 2023. RAP-Gen: Retrieval-Augmented Patch Generation
with CodeT5 for Automatic Program Repair. arXiv preprint arXiv:2309.06057 (2023).
[197] Yue Wang, Weishi Wang, Shafiq Joty, and Steven CH Hoi. 2021. Codet5: Identifier-aware Unified Pre-trained Encoder-
decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods
in Natural Language Processing (EMNLP’21). 8696–8708.
[198] Yuehan Wang, Jun Yang, Yiling Lou, Ming Wen, and Lingming Zhang. 2022. Attention: Not Just Another Dataset for
Patch-correctness Checking. arXiv preprint arXiv:2207.06590 (2022).
[199] Laura Wartschinski, Yannic Noller, Thomas Vogel, Timo Kehrer, and Lars Grunske. 2022. VUDENC: vulnerability
detection with deep learning on a natural codebase for Python. Information and Software Technology 144 (2022),
106809.
[200] Cody Watson, Nathan Cooper, David Nader Palacio, Kevin Moran, and Denys Poshyvanyk. 2022. A systematic
literature review on the use of deep learning in software engineering research. ACM Transactions on Software
Engineering and Methodology (TOSEM) 31, 2 (2022), 1–58.
[201] Yuxiang Wei, Chunqiu Steven Xia, and Lingming Zhang. 2023. Copiloting the Copilots: Fusing Large Language
Models with Completion Engines for Automated Program Repair. arXiv preprint arXiv:2309.00608 (2023).
[202] Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How Long Will It Take to Fix This
Bug?. In Fourth International Workshop on Mining Software Repositories (MSR’07). IEEE, 1–1.
[203] Martin White, Michele Tufano, Matias Martinez, Martin Monperrus, and Denys Poshyvanyk. 2019. Sorting and
Transforming Program Repair Ingredients Via Deep Learning Code Similarities. In Proceedings of the 26th IEEE
International Conference on Software Analysis, Evolution and Reengineering (SANER’19). 479–490.
[204] W. E. Wong, R. Gao, Y. Li, R. Abreu, and F. Wotawa. 2016. A Survey on Software Fault Localization. IEEE Transactions
on Software Engineering (TSE) 42, 8 (Aug. 2016), 707–740.
[205] Liwei Wu, Fei Li, Youhua Wu, and Tao Zheng. 2020. Ggf: A Graph-based Method for Programming Language Syntax
Error Correction. In Proceedings of the 28th International Conference on Program Comprehension. 139–148.
[206] Yi Wu, Nan Jiang, Hung Viet Pham, Thibaud Lutellier, Jordan Davis, Lin Tan, Petr Babkin, and Sameena Shah. 2023.
How Effective Are Neural Networks for Fixing Security Vulnerabilities. In Proceedings of the 32nd ACM SIGSOFT
International Symposium on Software Testing and Analysis. 1282–1294.
[207] Chunqiu Steven Xia, Yifeng Ding, and Lingming Zhang. 2023. Revisiting the Plastic Surgery Hypothesis via Large
Language Models. arXiv preprint arXiv:2303.10494 (2023).
[208] Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2022. Practical Program Repair in the Era of Large Pre-trained
Language Models. arXiv preprint arXiv:2210.14179 (2022).
[209] Chunqiu Steven Xia and Lingming Zhang. 2022. Less Training, More Repairing Please: Revisiting Automated Program
Repair Via Zero-shot Learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and
Symposium on the Foundations of Software Engineering (ESEC/FSE’22). 959–971.
[210] Yuan-An Xiao, Chenyang Yang, Bo Wang, and Yingfei Xiong. 2023. Accelerating Patch Validation for Program Repair
with Interception-Based Execution Scheduling. arXiv preprint arXiv:2305.03955 (2023).
[211] Yuan-An Xiao, Chenyang Yang, Bo Wang, and Yingfei Xiong. 2023. ExpressAPR: Efficient Patch Validation for Java
Automated Program Repair Systems. In Proceedings of the 38th IEEE/ACM International Conference on Automated
Software Engineering. 1–4.
[212] Yingfei Xiong, Xinyuan Liu, Muhan Zeng, Lu Zhang, and Gang Huang. 2018. Identifying Patch Correctness in
Test-based Program Repair. In Proceedings of the 40th IEEE/ACM International Conference on Software Engineering
(ICSE’18). 789–799.
[213] Yingfei Xiong, Jie Wang, Runfa Yan, Jiachen Zhang, Shi Han, Gang Huang, and Lu Zhang. 2017. Precise Condition
Synthesis for Program Repair. In Proceedings of the 39th IEEE/ACM International Conference on Software Engineering
(ICSE’17). IEEE, 416–426.
[214] Xuezheng Xu, Xudong Wang, and Jingling Xue. 2022. M3v: Multi-modal Multi-view Context Embedding for Repair
Operator Prediction. In 2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’22). IEEE,
266–277.
[215] Jifeng Xuan, Matias Martinez, Favio Demarco, Maxime Clement, Sebastian Lamelas Marcote, Thomas Durieux, Daniel
Le Berre, and Martin Monperrus. 2016. Nopol: Automatic Repair of Conditional Statement Bugs in Java Programs.
IEEE Transactions on Software Engineering (TSE) 43, 1 (2016), 34–55.
[216] Dapeng Yan, Kui Liu, Yuqing Niu, Li Li, Zhe Liu, Zhiming Liu, Jacques Klein, and Tegawendé F Bissyandé. 2022. Crex:
Predicting Patch Correctness in Automated Repair of C Programs through Transfer Learning of Execution Semantics.
Information and Software Technology (IST’22) 152 (2022), 107043.
[217] Bo Yang and Jinqiu Yang. 2020. Exploring the Differences between Plausible and Correct Patches at Fine-grained
Level. In Proceedings of the 2nd IEEE International Workshop on Intelligent Bug Fixing (IBF’20). IEEE, 1–8.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
1:68 Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen

[218] Deheng Yang, Xiaoguang Mao, Liqian Chen, Xuezheng Xu, Yan Lei, David Lo, and Jiayu He. 2022. TransplantFix:
Graph Differencing-based Code Transplantation for Automated Program Repair. In 37th IEEE/ACM International
Conference on Automated Software Engineering (ASE’22). 1–13.
[219] Geunseok Yang, Kyeongsic Min, and Byungjeong Lee. 2020. Applying Deep Learning Algorithm to Automatic Bug
Localization and Repair. In Proceedings of the 35th Annual Acm Symposium on Applied Computing (SAC’20). 1634–1641.
[220] Yanming Yang, Xin Xia, David Lo, and John Grundy. 2022. A Survey on Deep Learning for Software Engineering.
ACM Computing Surveys (CSUR) 54, 10s (2022), 1–73.
[221] Jie Yao, Bingbing Rao, Weiwei Xing, and Liqiang Wang. 2022. Bug-transformer: Automated Program Repair Using
Attention-based Deep Neural Network. Journal of Circuits, Systems and Computers (JCSC) (2022), 2250210.
[222] Michihiro Yasunaga and Percy Liang. 2020. Graph-based, Self-supervised Program Repair from Diagnostic Feedback.
In International Conference on Machine Learning (ICML’20). PMLR, 10799–10808.
[223] Michihiro Yasunaga and Percy Liang. 2021. Break-it-fix-it: Unsupervised Learning for Program Repair. In International
Conference on Machine Learning (ICML’21). PMLR, 11941–11952.
[224] He Ye, Jian Gu, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2022. Automated Classification of
Overfitting Patches with Statically Extracted Code Features. IEEE Transactions on Software Engineering (TSE) 48, 8
(2022), 2920–2938.
[225] He Ye, Matias Martinez, Thomas Durieux, and Martin Monperrus. 2019. A Comprehensive Study of Automatic
Program Repair on the QuixBugs Benchmark. In 2019 IEEE 1st International Workshop on Intelligent Bug Fixing (IBF’19).
IEEE, 1–10.
[226] He Ye, Matias Martinez, Xiapu Luo, Tao Zhang, and Martin Monperrus. 2022. Selfapr: Self-supervised Program Repair
with Test Execution Diagnostics. In 2022 37th IEEE/ACM International Conference on Automated Software Engineering
(ASE’22). IEEE.
[227] He Ye, Matias Martinez, and Martin Monperrus. 2022. Neural Program Repair with Execution-based Backpropagation.
In Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22). 1506–1518.
[228] Wei Yuan, Quanjun Zhang, Tieke He, Chunrong Fang, Nguyen Quoc Viet Hung, Xiaodong Hao, and Hongzhi Yin.
2022. CIRCLE: Continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International
Symposium on Software Testing and Analysis. 678–690.
[229] Yuan Yuan and Wolfgang Banzhaf. 2018. Arja: Automated Repair of Java Programs Via Multi-objective Genetic
Programming. IEEE Transactions on Software Engineering (TSE) 46, 10 (2018), 1040–1067.
[230] He Zhang, Muhammad Ali Babar, and Paolo Tell. 2011. Identifying Relevant Studies in Software Engineering.
Information and Software Technology (IST) 53, 6 (2011), 625–637.
[231] Jialu Zhang, José Cambronero, Sumit Gulwani, Vu Le, Ruzica Piskac, Gustavo Soares, and Gust Verbruggen. 2022.
Repairing Bugs in Python Assignments Using Large Language Models. arXiv preprint arXiv:2209.14876 (2022).
[232] Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. 2022. Coditt5: Pretraining for
Source Code and Natural Language Editing. In 2022 37th IEEE/ACM International Conference on Automated Software
Engineering (ASE’22). IEEE.
[233] Mengshi Zhang, Yaoxian Li, Xia Li, Lingchao Chen, Yuqun Zhang, Lingming Zhang, and Sarfraz Khurshid. 2019.
An Empirical Study of Boosting Spectrum-based Fault Localization Via Pagerank. IEEE Transactions on Software
Engineering (TSE) 47, 6 (2019), 1089–1113.
[234] Quanjun Zhang, Chunrong Fang, Weisong Sun, Yan Liu, Tieke He, Xiaodong Hao, and Zhenyu Chen. 2023. Boosting
Automated Patch Correctness Prediction via Pre-trained Language Model. arXiv preprint arXiv:2301.12453 (2023).
[235] Quanjun Zhang, Chunrong Fang, Bowen Yu, Weisong Sun, Tongke Zhang, and Zhenyu Chen. 2023. Pre-Trained
Model-Based Automated Software Vulnerability Repair: How Far are We? IEEE Transactions on Dependable and Secure
Computing (2023).
[236] Quanjun Zhang, Chunrong Fang, Tongke Zhang, Weisong Sun, Bowen Yu, and Zhenyu Chen. 2023. GAMMA:
Revisiting Template-based Automated Program Repair via Mask Prediction. In Proceedings of the 38th IEEE/ACM
International Conference on Automated Software Engineering. 1–13.
[237] Quanjun Zhang, Tongke Zhang, Juan Zhai, Chunrong Fang, Bowen Yu, Weisong Sun, and Zhenyu Chen. 2023. A
Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated
Program Repair. arXiv preprint arXiv:2310.08879 (2023).
[238] Quanjun Zhang, Yuan Zhao, Weisong Sun, Chunrong Fang, Ziyuan Wang, and Lingming Zhang. 2022. Program
Repair: Automated Vs. Manual. arXiv preprint arXiv:2203.05166 (2022).
[239] Xindong Zhang, Chenguang Zhu, Yi Li, Jianmei Guo, Lihua Liu, and Haobo Gu. 2020. Precfix: Large-scale Patch
Recommendation by Mining Defect-patch Pairs. In Proceedings of the ACM/IEEE 42nd International Conference on
Software Engineering: Software Engineering in Practice (ICSE-SEIP’20). 41–50.
[240] Yuntong Zhang, Xiang Gao, Gregory J. Duck, and Abhik Roychoudhury. 2022. Program Vulnerability Repair Via
Inductive Inference. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.
A Survey of Learning-based Automated Program Repair 1:69

(ISSTA’22). 691–702.
[241] Zhou Zhou, Lili Bo, Xiaoxue Wu, Xiaobing Sun, Tao Zhang, Bin Li, Jiale Zhang, and Sicong Cao. 2022. Spvf: Security
Property Assisted Vulnerability Fixing Via Attention-based Models. Empirical Software Engineering (EMSE) 27, 7
(2022), 1–28.
[242] Qihao Zhu, Zeyu Sun, Yuan-an Xiao, Wenjie Zhang, Kang Yuan, Yingfei Xiong, and Lu Zhang. 2021. A Syntax-
guided Edit Decoder for Neural Program Repair. In Proceedings of the 29th ACM Joint Meeting on European Software
Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’21). 341–353.
[243] Qihao Zhu, Zeyu Sun, Wenjie Zhang, Yingfei Xiong, and Lu Zhang. 2023. Tare: Type-Aware Neural Program Repair.
In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1443–1455.

ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 1. Publication date: 2023.

Beautifully Deranged - Nova Black
No ratings yet
Beautifully Deranged - Nova Black
177 pages
Psionics Augmented - Compilation 2
100% (5)
Psionics Augmented - Compilation 2
88 pages
Jacques Alain Miller Marginalia
100% (2)
Jacques Alain Miller Marginalia
22 pages
Learning Software Engineering
From Everand
Learning Software Engineering
IT Campus Academy
No ratings yet
Practical C++ Backend Programming
From Everand
Practical C++ Backend Programming
Justin Barbara
No ratings yet
Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
From Everand
Application Observability with Elastic: Real-time metrics, logs, errors, traces, root cause analysis, and anomaly detection
Navin Sabharwal
No ratings yet
A Systematic Literature Review On Large Language Models
No ratings yet
A Systematic Literature Review On Large Language Models
39 pages
Towards Javascript Program Repair With Generative Pre-Trained Transformer (Gpt-2)
No ratings yet
Towards Javascript Program Repair With Generative Pre-Trained Transformer (Gpt-2)
7 pages
Shaping Program Repair Space With Existing Patches and Similar Code
No ratings yet
Shaping Program Repair Space With Existing Patches and Similar Code
12 pages
APPT Boosting Automated Patch Correctness Prediction Via Fine-Tuning Pre-Trained Models
No ratings yet
APPT Boosting Automated Patch Correctness Prediction Via Fine-Tuning Pre-Trained Models
21 pages
CURE: Code-Aware Neural Machine Translation For Automatic Program Repair
No ratings yet
CURE: Code-Aware Neural Machine Translation For Automatic Program Repair
13 pages
KNOD
No ratings yet
KNOD
13 pages
Repairing Bugs in Python Assignments Using Large Language Models
No ratings yet
Repairing Bugs in Python Assignments Using Large Language Models
10 pages
Repairing Bugs in Python Assignments Using Large Language Models
No ratings yet
Repairing Bugs in Python Assignments Using Large Language Models
12 pages
A Deep Dive Into Large Language Models For Automated Bug Localization and Repair
No ratings yet
A Deep Dive Into Large Language Models For Automated Bug Localization and Repair
23 pages
Repairagent: An Autonomous, Llm-Based Agent For Program Repair
No ratings yet
Repairagent: An Autonomous, Llm-Based Agent For Program Repair
14 pages
Learning Advanced Programming
From Everand
Learning Advanced Programming
IT Campus Academy
No ratings yet
Review 4 Repair
No ratings yet
Review 4 Repair
17 pages
Invalidator Automated Patch Correctness Assessment Via Semantic and Syntactic Reasoning
No ratings yet
Invalidator Automated Patch Correctness Assessment Via Semantic and Syntactic Reasoning
20 pages
Aspect-Oriented Programming in Practice: Definitive Reference for Developers and Engineers
From Everand
Aspect-Oriented Programming in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learning Mastery for Engineers
From Everand
Machine Learning Mastery for Engineers
Abdellatif Sadeq
No ratings yet
Algorithms Made Simple: Understanding the Building Blocks of Software
From Everand
Algorithms Made Simple: Understanding the Building Blocks of Software
William E. Clark
No ratings yet
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
From Everand
Practical C++ Backend Programming: Crafting Databases, APIs, and Web Servers for High-Performance Backend
Justin Barbara
No ratings yet
Software Design And Development in your pocket
From Everand
Software Design And Development in your pocket
David Chen
5/5 (1)
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Ranorex Automation Engineering: Definitive Reference for Developers and Engineers
From Everand
Ranorex Automation Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
From Everand
Basics of Programming: A Comprehensive Guide for Beginners: Essential Coputer Skills, #1
DG. Junior
No ratings yet
Program Vulnerability Repair Via Inductive Inference
No ratings yet
Program Vulnerability Repair Via Inductive Inference
12 pages
Automated Patch Assessment For Program Repair at Scale
No ratings yet
Automated Patch Assessment For Program Repair at Scale
15 pages
Software Engineering New Approach (Traditional and Agile Methodologies)
From Everand
Software Engineering New Approach (Traditional and Agile Methodologies)
Ramisetty Rajeswara Rao
No ratings yet
Mastering C# Concurrency
From Everand
Mastering C# Concurrency
Agafonov Eugene
2/5 (2)
Explainable Automated Program Repair Final Paper
No ratings yet
Explainable Automated Program Repair Final Paper
5 pages
C++ OOP Made Simple: A Practical Guide with Examples
From Everand
C++ OOP Made Simple: A Practical Guide with Examples
William E. Clark
No ratings yet
Defect Prediction in Software Development & Maintainence
From Everand
Defect Prediction in Software Development & Maintainence
Rudra Kumar
No ratings yet
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Java™ Programming: A Complete Project Lifecycle Guide
From Everand
Java™ Programming: A Complete Project Lifecycle Guide
Nitin Shreyakar
No ratings yet
Automated Network Technology: The Changing Boundaries of Expert Systems
From Everand
Automated Network Technology: The Changing Boundaries of Expert Systems
Carl P. Catalano Ph.D.
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)
DeepBugs TR Nov2017 PDF
No ratings yet
DeepBugs TR Nov2017 PDF
15 pages
Software Testing Interview Questions You'll Most Likely Be Asked
From Everand
Software Testing Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Comprehensive Guide to Appium Automation: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to Appium Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Programming Best Practices for New Developers: A Practical Guide with Examples
From Everand
Programming Best Practices for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
Software Reuse: Methods, Models, Costs, Second Edition
From Everand
Software Reuse: Methods, Models, Costs, Second Edition
Ronald J. Leach
No ratings yet
Professional Test Driven Development with C#: Developing Real World Applications with TDD
From Everand
Professional Test Driven Development with C#: Developing Real World Applications with TDD
James Bender
No ratings yet
Mastering the Craft: Unleashing the Art of Software Engineering
From Everand
Mastering the Craft: Unleashing the Art of Software Engineering
Kiran Nagesh
No ratings yet
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
From Everand
Neutralino.js Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Debugging Like a Pro: A Practical Guide with Examples
From Everand
Debugging Like a Pro: A Practical Guide with Examples
William E. Clark
No ratings yet
On Hardware Security Bug Code Fixes by Prompting Large Language Models
No ratings yet
On Hardware Security Bug Code Fixes by Prompting Large Language Models
15 pages
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Improving Bug Detection Via Context-Based Code Rep
No ratings yet
Improving Bug Detection Via Context-Based Code Rep
30 pages
Automated Programming and Program Repair
No ratings yet
Automated Programming and Program Repair
19 pages
Mining Software Repair Models
No ratings yet
Mining Software Repair Models
16 pages
Code, Bytes, Algorithms, And Innovation: Software & Engineering
From Everand
Code, Bytes, Algorithms, And Innovation: Software & Engineering
Tobi Makinde
No ratings yet
Building Desktop Applications with Electron: Definitive Reference for Developers and Engineers
From Everand
Building Desktop Applications with Electron: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Handbook of Cloud Computing: Basic to Advance research on the concepts and design of Cloud Computing
From Everand
Handbook of Cloud Computing: Basic to Advance research on the concepts and design of Cloud Computing
Dr. Anand Nayyar
No ratings yet
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
From Everand
CODING FOR ABSOLUTE BEGINNERS: How to Keep Your Data Safe from Hackers by Mastering the Basic Functions of Python, Java, and C++ (2022 Guide for Newbies)
Eric Vargas
No ratings yet
Software Defect Prediction PPR
No ratings yet
Software Defect Prediction PPR
11 pages
Practical C++ Machine Learning: Hands-on strategies for developing simple machine learning models using C++ data structures and libraries
From Everand
Practical C++ Machine Learning: Hands-on strategies for developing simple machine learning models using C++ data structures and libraries
Anais Sutherland
No ratings yet
Performance Optimization Made Simple: A Practical Guide to Programming
From Everand
Performance Optimization Made Simple: A Practical Guide to Programming
William E. Clark
No ratings yet
Changing Humanities and Smart Application of Digital Technologies
From Everand
Changing Humanities and Smart Application of Digital Technologies
PublishDrive
No ratings yet
Acute GlomeruloNephritis - AGN
No ratings yet
Acute GlomeruloNephritis - AGN
36 pages
The Gender Issue in Language and Society - Trandafir Dorinel-Laurentiu
No ratings yet
The Gender Issue in Language and Society - Trandafir Dorinel-Laurentiu
44 pages
Philippine National Police: Id Application Form (PNP Dependent)
No ratings yet
Philippine National Police: Id Application Form (PNP Dependent)
1 page
English Year 4 - Paper 2
No ratings yet
English Year 4 - Paper 2
9 pages
Maxillary Incisor Based Objectives in Present Day o 2022 Seminars in Orthodo
No ratings yet
Maxillary Incisor Based Objectives in Present Day o 2022 Seminars in Orthodo
13 pages
Saep 394
No ratings yet
Saep 394
9 pages
Assignment 1
No ratings yet
Assignment 1
3 pages
We2 Test Unit6
100% (2)
We2 Test Unit6
2 pages
Fransiskus Daud Try Surya A Bahasa Inggris PTK PPG DALJAB 2
No ratings yet
Fransiskus Daud Try Surya A Bahasa Inggris PTK PPG DALJAB 2
47 pages
Lab 1 Cadence Tutorial
No ratings yet
Lab 1 Cadence Tutorial
36 pages
Rickshaw Bank: Management Case
No ratings yet
Rickshaw Bank: Management Case
15 pages
Step2: Introduction To Designing of Robots (Part2) : College of Engineering Roorkee
No ratings yet
Step2: Introduction To Designing of Robots (Part2) : College of Engineering Roorkee
6 pages
Conjunction: A Conjunction Joins Words, Phrases and Sentences Together
No ratings yet
Conjunction: A Conjunction Joins Words, Phrases and Sentences Together
1 page
Math10 Q3 SLM Module 8
No ratings yet
Math10 Q3 SLM Module 8
12 pages
Nice To Meet You
No ratings yet
Nice To Meet You
2 pages
Alternative Proposal 20160912 - Mtentu (Rev.2)
No ratings yet
Alternative Proposal 20160912 - Mtentu (Rev.2)
17 pages
Meenakari Origin and History
No ratings yet
Meenakari Origin and History
5 pages
Resolution-Creation of Sorsogon Provincial Office
No ratings yet
Resolution-Creation of Sorsogon Provincial Office
2 pages
02-20-2017 Stepped Pressure Cycle - A New Approach To Lorenz Cycle
No ratings yet
02-20-2017 Stepped Pressure Cycle - A New Approach To Lorenz Cycle
12 pages
Commissioning Program SOP Template
No ratings yet
Commissioning Program SOP Template
24 pages
Paradigm Shift Taylor Peterson
No ratings yet
Paradigm Shift Taylor Peterson
7 pages
Hauwam Muhammed - Updated CV
No ratings yet
Hauwam Muhammed - Updated CV
4 pages
1019 1024 1
No ratings yet
1019 1024 1
6 pages
Performance Evaluation of Maize Hybrids
No ratings yet
Performance Evaluation of Maize Hybrids
6 pages
Forging Temperature
No ratings yet
Forging Temperature
91 pages
3 Art Therapy Techniques To Deal With Anxiety PDF
No ratings yet
3 Art Therapy Techniques To Deal With Anxiety PDF
3 pages
The Role of Quantitative Techniques in Business and Management
No ratings yet
The Role of Quantitative Techniques in Business and Management
3 pages

A Survey of Learning-Based Automated Program Repair

Uploaded by

A Survey of Learning-Based Automated Program Repair

Uploaded by

1

A Survey of Learning-based Automated Program Repair

Google Scholar Automated Search

Group2 filter by pages

Figure 1. General workflow of the paper collection

3 BACKGROUND AND CONCEPTS

3.1 Automated Program Repair

Localization Phase Repair Phase

deployment overfitting patch

Figure 4. Overview of APR

3.2 Neural Machine Translation

Definition 3.2. ✍ Learning-based APR: Given a buggy code snippet 𝑋𝑖 = [𝑥 1, . . . , 𝑥𝑛 ] with

Figure 5. Detailed workflow of Learning-based APR

4.1 Overall Workflow

4.2 Fault Localization

✎ Summary ▶ As a preceding step in the learning-based APR workflow, fault localization

4.3 Data Pre-processing

int GetMaxCommonDivisor(int m, int n){ int GetMaxCommonDivisor(int m, int n){

Figure 6. A simple example of code abstraction

4.4 Patch Generation

4.5 Patch Ranking

4.6 Patch Validation

✎ Summary ▶ Dynamic execution is a common practice to automatically validate the

4.7 Patch Correctness

Table 1. A summary and comparison of learning-based APCA studies

Year Approach Language Feature Dataset Repository

Table 1 presents a summary of existing learning-based techniques to predict patch correctness

As early as 2020, inspirited by the similarity-based strategy in state-of-the-art traditional APCA

state embeddings as a sequence to another RNN encoder; 3) dependency enforcement model to

✎ Summary ▶ Overall, learning-based APR is generally applicable to different types of bugs

5 PRE-TRAINED MODEL-BASED REPAIR

5.1 Universal Pre-trained Model-based APR Techniques

5.2 Specific Pre-trained Model-based APR Techniques

Table 3. A summary and comparison of existing pre-trained model-based APR techniques

Year Study Type Model Language

Table 4. Detailed information on collected datasets in existing learning-based APR studies.

ID Name Language #Bugs Test Suite Training Testing Techniques URL

6.3 Empirical Study

Table 5. A summary and comparison of empirical studies in learning-based APR

Year Study Scope Language Description

✎ Summary ▶ As the APR research community embraces an influx of learning-based APR

7 APPLICATION AND DISCUSSION

7.1 Industrial Deployment

7.2 DL for Traditional APR

Year Approach Base Language Description

7.3 Open Science

Tool Language Dataset Hosting Site Link Accessibility SA DA TA URL

7.4 The Latest Advancements

8 IMPLICATION AND GUIDELINES

You might also like