Automated Programming and Program Repair
Automated Programming and Program Repair
Abstract
The Dagstuhl Seminar 24431 on “Automated Programming and Program Repair” brought together
33 researchers from academia and industry to explore the intersection of automated code generation
and program repair. Over five days (October 21–25, 2024), participants discussed advances in large
language models (LLMs) for code generation, the role of automated program repair in improving
generated code, and challenges in deploying these technologies in real-world software development.
The seminar featured over 20 talks and three panel discussions on topics such as benchmarks for
LLM-generated code, trust in automated programming, and the broader applications of LLMs
beyond coding assistance. Key outcomes included identifying critical challenges in benchmarking,
evaluation criteria, and developer adoption of automated repair techniques, fostering future
collaborations and actionable research directions in the field.
Seminar October 20–25, 2024 – https://www.dagstuhl.de/24431
2012 ACM Subject Classification Software and its engineering → Automatic programming;
Software and its engineering → Software testing and debugging
Keywords and phrases Auto-coding, Large Language Models, Automated Program Repair,
Program Synthesis, Trustworthy Software
Digital Object Identifier 10.4230/DagRep.14.10.39
1 Executive Summary
Shin Hwei Tan (Concordia University – Montreal, CA)
License Creative Commons BY 4.0 International license
© Shin Hwei Tan
Automated tools that generate and improve code promise to fundamentally change software
development. For example, there is a recent trend towards automated code generation from
large language models, as evidenced by the capabilities of Codex/Copilot, ChatGPT, and
GPT-4. These models, and other techniques, such as search-based and semantic analysis-based
techniques, have the potential to automate significant parts of today’s software development
process. In particular, there are promising techniques for automated programming and
automated program repair. Automated programming refers to techniques that suggest
newly written code, e.g., in the form of code completion tools. The capabilities of such
tools have increased from moderately successful single-token predictions just a few years
ago to predicting entire functions with relatively high accuracy. Techniques for automated
programming include large language models that predict code based on natural language
specifications of the intended behavior.
∗
Editor / Organizer
2025, several talks took place in the morning, followed by an excursion to Mettlach and Villa
Borg after lunch. On October 24, 2025, a few inspiring talks took place, followed by a panel
discussion on “Obstacles for deploying program repair techniques”.
Overall, the seminar has received very positive feedback from the participants both
personally and formally (via email). Notably, one participant sent an email to one of the
organizers saying that “ It was my best Dagstuhl Seminar last October, and I really appreciate
your organizing of the seminar once again”, demonstrating that the seminar has been quite
successful in leaving a good impression in comparison to other Dagstuhl Seminars that the
participants have attended. Meanwhile, a few participants have complimented Dagstuhl on
the diversity of the social events held (e.g., excursion, the treetop walk, and sauna), and the
babysitting services provided for participants attending the seminar with young children.
In terms of collaborations, there are a few actionable topics for collaborations that have
been discussed. An opinion piece of AI Software Engineer titled “AI Software Engineer:
Programming with Trust” is now available.1 Another potential collaboration is a critical
review on benchmarks crafted by AI communities (i.e., SWE-Bench). Meanwhile, AutoCo-
deRover (presented by one of the organizers in Dagstuhl), which was an NUS spinoff, has on
February 19, 2025, been officially acquired by SonarSource, a leader in code quality via its
static analysis solutions.
The seminar focused on the following key themes:
Topics at the intersection of automated programming and automated program repair,
analyzing progress in both fields.
Understanding common mistakes in automatically generated code.
Discussing the theme of “Trusted Automated Programming”, which focuses on:
How automatically generated code can be made more trustworthy.
How to generate evidence that improvements to auto-generated code maintain trust-
worthiness.
How to decide, based on such evidence, when to incorporate automatically generated
code into an existing software project with a stable code-base.
Important challenges in automated program repair and automated programming in
general.
Using large language models (LLMs) beyond just coding assistance.
Obstacles in deploying program repair techniques in real-world settings.
1
https://arxiv.org/abs/2502.13767
24431
42 24431 – Automated Programming and Program Repair
2 Table of Contents
Executive Summary
Shin Hwei Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Overview of Talks
Automatic Semantic Augmentation of Language Model Prompts
Earl T. Barr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
RepairAgent: An Autonomous, LLM-Based Agent for Program Repair
Islem Bouzenia, Premkumar T. Devanbu, and Michael Pradel . . . . . . . . . . . . 44
Ya’ll are APRing the wrong thing
Yuriy Brun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Reflections on Automated Programming / Input Repair: Can LLMs help?
Cristian Cadar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Fault Localization (LLM + Heuristics): initial results
Celso G. Camilo-Junior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Reinventing ourselves
Satish Chandra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Automated Software A11Y Repair
Chunyang Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Enhancing Fault Localization with LLM Agents and Self-Reflection
Tse-Hsun Chen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Code from LLMs: Use, modify, or discard?
Premkumar T. Devanbu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Automated Scientific Debugging with LLMs
Sungmin Kang . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Proactive Debugging
Dongsun Kim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Energy Consumption of Automated Program Repair
Matías Martínez . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Fact Selection Problem in LLM Based Program Repair
Nikhil Parasaram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution
Michael Pradel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Testing, Testing, 1-2-3: Test generation with LLMs
Nikitha Rao . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Assured Automatic Programming via Large Language Models
Abhik Roychoudhury, Andreea Costea . . . . . . . . . . . . . . . . . . . . . . . . . . 51
AutoCodeRover: Autonomous Program Improvement
Abhik Roychoudhury . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
RepairBench: Leaderboard of Frontier Models for Program Repair
André Silva . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 43
Panel discussions
Benchmarks for LLM Code Generation
Satish Chandra, Premkumar T. Devanbu, Martin Monperrus, Gustavo Soares, and
Lin Tan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
LLM-beyond just coding-assistance
Ahmed E. Hassan, Lin Tan, and Shin Hwei Tan . . . . . . . . . . . . . . . . . . . 55
Obstacles for deploying program repair techniques
Fernanda Madeiral, Premkumar T. Devanbu, and Gustavo Soares . . . . . . . . . . 56
Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
24431
44 24431 – Automated Programming and Program Repair
3 Overview of Talks
3.1 Automatic Semantic Augmentation of Language Model Prompts
Earl T. Barr (University College London, GB)
License Creative Commons BY 4.0 International license
© Earl T. Barr
Joint work of Toufique Ahmed, Kunal Suresh Pai, Premkumar T. Devanbu, Earl T. Barr
Main reference Toufique Ahmed, Kunal Suresh Pai, Premkumar T. Devanbu, Earl T. Barr: “Automatic Semantic
Augmentation of Language Model Prompts (for Code Summarization)”, in Proc. of the 46th
IEEE/ACM International Conference on Software Engineering, ICSE 2024, Lisbon, Portugal, April
14-20, 2024, pp. 220:1–220:13, ACM, 2024.
URL https://doi.org/10.1145/3597503.3639183
Researchers are still learning how to best “program” LLMs via prompt engineering. We start
with the intuition that developers tend to consciously and unconsciously collect semantics
facts, from code, while working. Most are shallow, simple facts arising from a quick read.
One might assume that LLMs are implicitly capable of performing simple “code analysis”
and extracting such information: but are they, really? If not, could explicitly adding
this information help? Our goal here is to investigate this question and evaluate whether
automatically augmenting an LLM’s prompt with semantic facts explicitly, actually helps.
We find that adding semantic facts to the prompt actually does help! This approach improves
performance on the code summarization and completion tasks in several different settings
suggested by prior work, including for three different Large Language Models. In most cases,
we see improvements, as measured by a range of commonly-used metrics.
Automated program repair has emerged as a powerful technique to mitigate the impact of
software bugs on system reliability and user experience. This paper introduces RepairAgent,
the first work to address the program repair challenge through an autonomous agent based
on a large language model (LLM). Unlike existing deep learning-based approaches, which
prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as
an agent capable of autonomously planning and executing actions to fix bugs by invoking
suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering
repair ingredients, and validating fixes, while deciding which tools to invoke based on the
gathered information and feedback from previous fix attempts. Key contributions that enable
RepairAgent include a set of tools that are useful for program repair, a dynamically updated
prompt format that allows the LLM to interact with these tools, and a finite state machine
that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset
demonstrates RepairAgent’s effectiveness in autonomously repairing 164 bugs, including
39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost
of 270,000 tokens per bug, which, under the current pricing of OpenAI’s GPT-3.5 model,
translates to 14 cents per bug. To the best of our knowledge, this work is the first to present
an autonomous, LLM-based agent for program repair, paving the way for future agent-based
techniques in software engineering.
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 45
Automated program repair (APR) has advanced program synthesis technology but is limited
by the problem of weak specifications. The domain of proof synthesis for formal verification
has many of the same challenges as APR but has access to a strong oracle – the theorem
prover – to determine proof correctness. This oracle helps overcome the overfitting problem
for proof synthesis. We have developed several tools for synthesizing proofs from scratch using
natural language processing, including TacTok, Passport, Diva, Baldur, and Cobblestone.
References
1 First, Emily and Brun, Yuriy and Guha, Arjun, TacTok: Semantics-aware proof synthesis,
OOPSLA 2020
2 Sanchez-Stern, Alex and First, Emily and Zhou, Timothy and Kaufman, Zhanna and Brun,
Yuriy and Ringer, Talia, Passport: Improving automated formal verification using identifiers,
TOPLAS23
3 Sanchez-Stern, Alex and Varghese, Abhishek and Kaufman, Zhanna and Zhang, Dylan and
Ringer, Talia and Brun, Yuriy, QEDCartographer: Automating formal verification using
reward-free reinforcement learning, ICSE 25, To Appear
4 First, Emily and Brun, Yuriy, Diversity-driven automated formal verification, ICSE 2022
5 First, Emily and Rabe, Markus N and Ringer, Talia and Brun, Yuriy, Baldur: Whole-proof
generation and repair with large language models, FSE 2023
6 First, Emily and Brun, Yuriy, Diversity-driven automated formal verification, ICSE 2022
7 Brun, Yuriy, Demo for Proofster: Automated Formal Verification,
https://www.youtube.com/watch?v=xQAi66lRfwI
24431
46 24431 – Automated Programming and Program Repair
The task of Fault Localization is an important part of the software debugging process, as it
requires a lot of effort and investment. Additionally, it directly impacts the quality of the
produced code.
With the emergence of LLMs and their potential, there is an opportunity to apply them to
Fault Localization as well. However, discovering the best way to integrate known techniques
with LLMs is a challenge.
Thus, this presentation shows some initial results on how the use of LLMs for Fault
Localization, utilizing a few facts, can be reasonable, and which scenarios still face challenges
for applying more complex methods. Moreover, an architecture was presented that aims to
enrich the necessary inputs, such as better problem descriptions, for the optimal performance
of these tools.
The initial results show that, for some simpler problems, using a generic LLM (GPT-3.5
and Llama 3) with a good human description or an artificial description yields reasonable
results. However, in more complex scenarios, the combination of LLM with heuristics
(spectrum-based) shows more promising outcomes.
In this talk, I will cover a few related themes. First, I will share my views on how modern
LLMs are upending the field of software engineering with capabilities that only a few years
ago required creative solutions. Next, I’ll describe how at Google we have been working
on weaving AI capabilities in developer workflows, how we collect developer data, and how
we prioritize our work (based on the a recent blog post https://research.google/blog/ai-in-
software-engineering-at-google-progress-and-the-path-ahead/.). I will then change tracks a
bit and discuss the importance of evals for software engineering tasks; the more real the
better. Finally, I will segue into our preliminary investigation into understanding our own
bug database at Google, and how it may compare in distribution to SWEBench. I’ll share
our preliminary experience with agentic ways of resolving these internal bugs.
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 47
24431
48 24431 – Automated Programming and Program Repair
Code generated by language models is often flawed. And yet, LLMs can produce a lot of
useful code. How is a developer to decide whether a given generated fragrment of code should
be used or not? We consider this to be a joint human-AI decision problem, where it vitally
important for the AI (LLM) to reveal to the AI some trustworthy indication of the expected
likelihood of the code being correct.
References
1 Spiess, Claudio and Gros, David and Pai, Kunal Suresh and Pradel, Michael and Rabin,
Md Rafiqul Islam and Alipour, Amin and Jha, Susmit and Devanbu, Prem and Ahmed,
Toufique, Calibration and correctness of language models for code, ICSE 2025, to appear
Existing work on the application of automated program repair (APR) techniques in industry
suggest that providing explanations for automatically generated patches could help developer
adoption of such tools. We first define criteria for satisfying explanations of patches, then
propose the Automated Scientific Debugging (AutoSD) technique, inspired by human de-
veloper debugging frameworks, to uncover facts about the debugging scenario and integrate
these facts with hypotheses about what is causing the bug. We present our empirical results
demonstrating that AutoSD achieves comparable repair performance with other techniques,
and that explanations could help developers assess the correctness of patches generated by
AutoSD.
Automated program repair (APR) techniques successfully fixed programs bugs, but it still
follows the process of classical debugging, which is not quite different from a manual debugging
process. Although the techniques are fully or partially automated, we have to wait for a bug
report or symptoms, reproduce it, localize it, and fix it. However, it might not be necessary
to follow the classical and reactive process. What if we can proactively change the source
code first to fix a bug?
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 49
This talk addresses a series of our initial work on proactive debugging applied to specific
bug types, such as memory leaks in single-page web applications and blocking-call bugs in
reactive applications. In particular, the talk demonstrates the effectiveness of pattern-based
patch generation as the first step of proactive debugging. To show the effectiveness of
proactive debugging, the work conducts a series of experiments. The results show that the
proactive process significantly helps the developers address bugs in real software systems.
The talk illustrates further challenges to improve the activities for proactive debugging, such
as fully automated evidence collection and patch generation techniques.
Automated program repair (APR) aims to automatize the process of repairing software bugs
in order to reduce the cost of maintaining software programs. Moreover, the success (given by
the accuracy metric) of APR approaches has increased in recent years. However, no previous
work has considered the energy impact of repairing bugs automatically using APR. The field
of green software research aims to measure the energy consumption required to develop,
maintain, and use software products. This paper combines, for the first time, the APR and
Green software research fields. We have as main goal to define the foundation for measuring
the energy consumption of the APR activity. We measure the energy consumption of ten
traditional program repair tools for Java and ten fine-tuned Large-Language Models (LLM)
on source code trying to repair real bugs from Defects4J, a set of real buggy programs. The
initial results from this experiment show the existing trade-off between energy consumption
and the ability to correctly repair bugs: Some APR tools are capable of achieving higher
accuracy by spending less energy than other tools.
Recent research shows that including bug-related facts (categories of the information), such
as stack traces and GitHub issues, enhances LLMs’ bug-fixing abilities. To determine the
optimal facts and quantity, we studied over 19K prompts with various fact combinations
to repair 314 BugsInPy benchmark bugs. Each fact, from syntactic details to unexplored
24431
50 24431 – Automated Programming and Program Repair
semantic information like angelic values, proved beneficial in fixing specific bugs. Notably,
effectiveness is non-monotonic: too many facts reduce performance. We defined the fact
selection problem to find the optimal facts for each task, developing MANIPLE, a model
that selects facts tailored to a given bug, outperforming existing methods and repairing 17%
more bugs than the best alternative.
Code changes are an integral part of the software development process. Many code changes
are meant to improve the code without changing its functional behavior, e.g., refactorings and
performance improvements. Unfortunately, validating whether a code change preserves the
behavior is non-trivial, particularly when the code change is performed deep inside a complex
project. This talk presents ChangeGuard, an approach that uses learning-guided execution
to compare the runtime behavior of a modified function. The approach is enabled by the
novel concept of pairwise learning-guided execution and by a set of techniques that improve
the robustness and coverage of the state-of-the-art learning-guided execution technique. Our
evaluation applies ChangeGuard to a dataset of 224 manually annotated code changes from
popular Python open-source projects and to three datasets of code changes obtained by
applying automated code transformations. Our results show that the approach identifies
semantics-changing code changes with a precision of 77.1% and a recall of 69.5´%, and that
it detects unexpected behavioral changes introduced by automatic code refactoring tools.
In contrast, the existing regression tests of the analyzed projects miss the vast majority of
semantics-changing code changes, with a recall of only 7.6%. We envision our approach being
useful for detecting unintended behavioral changes early in the development process and for
improving the quality of automated code transformations.
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 51
LLMs can generate code that is highly similar to that written by humans. However, current
models are trained to generate each file separately, as is standard practice in natural language
processing. They thus fail to generate meaningful tests. My work leverages software artifacts
like code and natural language specifications to make LLM-based test generation more
reliable and useful. This includes (1) CAT-LM, a LLM trained to explicitly consider the
mapping between code and test files to improve the quality of tests generated. This not
only ensures code correctness, but also optimizes for other software testing metrics such as
pass/compile rate and code coverage. (2) DiffSpec, a prompt chaining based framework that
leverages artifacts like specification documents, bug reports, and code implementations for
generating differential tests using LLMs. We evaluate DiffSpec on eBPF and Wasm, and
generated over 500 differentiating tests that found real bugs.
With the advent of LLMs in automatic programming, the interest in trusted automatic
programming via LLMs increases. Unfortunately, it is difficult to give any guarantees
about code generated from LLMs, partly also because a detailed specification of the intended
behavior is usually not available. In this talk we show how to alleviate this lack of functionality
specifications by aligning automatically generated code via LLMs, automatically generated
formal specifications (obtained from natural language using LLMs), as well as tests. The
conformance between generated programs, generated specifications, and tests – does not
provide absolute guarantees but enhances trust. Establishing such conformance also helps us
uncover the likely intended program behavior.
24431
52 24431 – Automated Programming and Program Repair
Researchers have made significant progress in automating the software development process
in the past decades. Recent progress in Large Language Models (LLMs) has significantly
impacted the development process, where developers can use LLM-based programming as-
sistants to achieve automated coding. Nevertheless, software engineering involves the process
of program improvement apart from coding, specifically to enable software maintenance
(e.g. bug fixing) and software evolution (e.g. feature additions). We propose an automated
approach for solving GitHub issues to autonomously achieve program improvement. In our
approach called AutoCodeRover, LLMs are combined with sophisticated code search capabil-
ities, ultimately leading to a program modification or patch. In contrast to recent LLM agent
approaches from AI researchers and practitioners, our outlook is more software engineering
oriented. We work on a program representation (abstract syntax tree) as opposed to viewing
a software project as a mere collection of files. Our code search exploits the program structure
in the form of classes/methods to enhance LLM’s understanding of the issue’s root cause, and
effectively retrieve a context via iterative search. The use of spectrum-based fault localization
using tests, further sharpens the context, as long as a test-suite is available. Experiments
on SWE-bench-lite (300 real-life GitHub issues) show increased efficacy in solving GitHub
issues (19% on SWE-bench-lite), which is higher than the efficacy of the recently reported
SWE-agent. In addition, AutoCodeRover achieved this efficacy with significantly lower cost
(on average, $0.43 USD), compared to other baselines. We posit that our workflow enables
autonomous software engineering, where, in future, auto-generated code from LLMs can be
autonomously improved.
AI-driven program repair uses AI models to repair buggy software by producing patches.
Rapid advancements in AI surely impact state-of-the-art performance of program repair.
Yet, grasping this progress requires frequent and standardized evaluations. We propose
RepairBench, a novel leaderboard for AI-driven program repair. The key characteristics of
RepairBench are: 1) it is execution-based: all patches are compiled and executed against a
test suite, 2) it assesses frontier models in a frequent and standardized way. RepairBench
leverages two high-quality benchmarks, Defects4J and GitBug-Java, to evaluate frontier
models against real-world program repair tasks. We publicly release the evaluation framework
of RepairBench. We will update the leaderboard as new frontier models are released.
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 53
In this lightning talk, I will discuss some of our recent efforts of debugging and fixing fairness
issues in machine learning and thoughts about possibly reusing some of the techniques for
improving trustworthiness of automated code repair.
Recent techniques use deep learning techniques, including Large Language Models, for
automatic programming including automated program repair. An important question is,
whether adding more data to train deep learning models or adding domain knowledge to the
models is a more promising or effective direction to improve automatic programming. I will
discuss existing studies and techniques that answer this question positively or negatively:
References
1 Xie, Danning and Zhang, Zhuo and Jiang, Nan and Xu, Xiangzhe and Tan, Lin and Zhang,
Xiangyu, Resym: Harnessing llms to recover variable and data structure symbols from
stripped binaries, CCS 2024
2 Nan Jiang and Chengxiao Wang and Kevin Liu and Xiangzhe Xu and Lin Tan and Xiangyu
Zhang and Petr Babkin, Nova: Generative Language Models for Assembly Code with
Hierarchical Attention and Contrastive Learning, ArXiv
3 Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan, Can Language Models Replace
Programmers? REPOCOD Says ‘Not Yet’, ArXiv
4 Jiang, Nan and Liu, Kevin and Lutellier, Thibaud and Tan, Lin, Impact of Code Language
Models on Automated Program Repair, ICSE 2023
5 Wu, Yi and Jiang, Nan and Pham, Hung Viet and Lutellier, Thibaud and Davis, Jordan
and Tan, Lin and Babkin, Petr and Shah, Sameena, How Effective are Neural Networks for
Fixing Security Vulnerabilities?, ISSTA 2023
24431
54 24431 – Automated Programming and Program Repair
Programming requires the awareness of programming language knowledge, such as the syntax,
the typing system, and the semantics. Such knowledge is not easily learnable from end-to-end
training. In this talk I will introduce a series of work that integrates programming language
knowledge into neural models.
APR has been actively researched for the last 15 years, introducing various approaches
such as template-based, learning-based, and semantics-based approaches. In particular, the
repairability of APR tools has been significantly improved over the years. For example,
jGenProg, introduced in 2017, could correctly fix only 2% of the 224 bugs in the Defects4J
benchmark, while the latest tool, SRepair, can correctly fix 45% of the 695 bugs in the
extended benchmark. This is a remarkable achievement in the field of APR.
While research on repairability should continue, there is another important aspect of
APR that has received relatively less attention: APR efficiency. More recent APR tools using
template-based or learning-based approaches employ a simplistic method for patch-space
exploitation. They simply enumerate the patch candidates in a predefined order based
on the suspiciousness scores of the program locations to which the patch candidates are
applied. Such an approach is suboptimal because it does not consider the runtime information
obtained during the repair process.
In this talk, I present the two recent works we did to improve APR efficiency. We show
how the APR scheduling algorithm can be viewed as a multi-armed bandit problem, and
how this viewpoint can be used to improve APR efficiency.
The slides for this talk are available at the following link: https://www.jooyongyi.com/
slides/Seminar/Dagstuhl/2024/APR_from_fuzzing_perspective.html
In recent years, Large Language Models (LLMs), such as GPT-4 and Claude-3.5, have shown
impressive performance in various downstream applications. In this talk, I will discuss the
potential impact of modern LLMs on automated programming, including both program
repair and synthesis (e.g., AlphaRepair, ChatRepair, and Agentless). In addition, I will
also talk about our recent work on rigorous testing/benchmarking of LLMs in automated
programming (e.g., EvalPlus and SWE-bench Lite-S).
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 55
4 Panel discussions
4.1 Benchmarks for LLM Code Generation
Satish Chandra (Google – Mountain View, US), Premkumar T. Devanbu (University of
California – Davis, US), Martin Monperrus (KTH Royal Institute of Technology – Stockholm,
SE), Gustavo Soares (Microsoft Corporation – Redmond, US), Lin Tan (Purdue University –
West Lafayette, US)
License Creative Commons BY 4.0 International license
© Satish Chandra, Premkumar T. Devanbu, Martin Monperrus, Gustavo Soares, and Lin Tan
The panel discussion (organized in the format of fishbowl discussion) was led by an academia
researcher (Martin Monperrus) and an industrial researcher (Satish Chandra). Existing
evaluation benchmarks (e.g., HumanEval) seem rather simple and does not reflect reality.
Recent ones try to address this problem by focusing on more realistic tasks (SWEBench,
LiveCodeBench, Long Code Arena). However, there are still a lack of benchmark specifically
for code maintenance tasks (e.g., refactoring, repo-level changes, code reviewing, code
understanding and reasoning, and performance Improvement). An example of such benchmark
is CRQBench: A Benchmark of Code Reasoning Questions. The panel also mentioned possible
collaboration on “ Critical Review on SWE-Bench”. Another possible solution is to have a
benchmark track in SE conferences to encourage benchmark curation. The discussion also
centered several challenges in benchmarks for LLM Code Generation, including curating
good evaluation benchmarks, designing criteria, maintaining and evolving evaluation sets.
The panel discussion (organized in the format of fishbowl discussion) was led by Ahmed E.
Hassan. The discussion focused on discussing the future of automated program Repair (i.e.,
APR.Next), the challenges, and the opportunities to go from Software Engineering 1.0 to
Software Engineering 3.0 where systems based on AI agent have been discussed. Despite the
advancement of AI-based techniques, traditional APR techniques still play important roles
in: (1) domain specific tasks like Android repair, fixing accessibility issues, (2) improving
LLM-generated code by fixing bugs by LLMs using traditional approaches. Other tasks
that are worthwhile to explore include: (1) research related to binary repair, and (2) code
translation (e.g., C to Rust, translation from COBOL to modern languages).
24431
56 24431 – Automated Programming and Program Repair
The panel discussion (organized in according to the format of a fishbowl discussion) was lead by
academic researcher (Fernanda Madeiral) and industrial researcher (Gustavo Soares) centered
around the obstacles when deploying automated program repair techniques. Challenges
mentioned in the discussion include: (1) answering the questions: When (Inner loop, Outer
loop)? How (Autonomous agent, Human-in-the-loop)? What (Realistic benchmark)? (2)
the need to train developer how to use the technique (3) should write a test that can be
reproduce (e.g., the challenge in vulnerability repair lies in writing test for vulnerability.
To solve these obstacles, recent work focused on combining the confidence level of a LLM
model (e.g., intrinsic probabilities, i.e., probability of a token given previous tokens, per-token
log-probs from the model).
Claire Le Goues, Michael Pradel, Abhik Roychoudhury, and Shin Hwei Tan 57
Participants
Earl T. Barr Sungmin Kang Gustavo Soares
University College London, GB KAIST – Daejeon, KR Microsoft Corporation –
Islem Bouzenia Dongsun Kim Redmond, US
Universität Stuttgart, DE Kyungpook National Gang (Gary) Tan
Yuriy Brun University, KR Pennsylvania State University –
University of Massachusetts Claire Le Goues University Park, US
Amherst, US Carnegie Mellon University –
Cristian Cadar Pittsburgh, US Lin Tan
Imperial College London, GB Purdue University – West
Yiling Lou
Lafayette, US
Celso G. Camilo-Junior Fudan University –
Federal University of Goiás, BR Shanghai, CN Shin Hwei Tan
Satish Chandra Fernanda Madeiral Concordia University –
Google – Mountain View, US VU Amsterdam, NL Montreal, CA
Chunyang Chen Matías Martínez Yingfei Xiong
TU München – Heilbronn, DE UPC Barcelona Tech, ES Peking University, CN
Tse-Hsun Chen Martin Monperrus
Concordia University – KTH Royal Institute of Jinqiu Yang
Montreal, CA Technology – Stockholm, SE Concordia University –
Montreal, CA
Zimin Chen Nikhil Parasaram
Deutsche Telekom – Bonn, DE University College London, GB He Ye
Andreea Costea Michael Pradel Carnegie Mellon University –
National University of Universität Stuttgart, DE Pittsburgh, US
Singapore, SG Nikitha Rao Jooyong Yi
Premkumar T. Devanbu Carnegie Mellon University – Ulsan National Institute of
University of California – Pittsburgh, US Science and Technology, KR
Davis, US Abhik Roychoudhury
Alexander Frömmgen National University of Jie Zhang
Google – München, DE Singapore, SG King’s College London, GB
Ahmed E. Hassan André Silva Lingming Zhang
Queen’s University – KTH Royal Institute of University of Illinois –
Kingston, CA Technology – Stockholm, SE Urbana-Champaign, US
24431