0% found this document useful (0 votes)
18 views

Fully Autonomous Programming With Large Language Models

Uploaded by

jicesag354
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Fully Autonomous Programming With Large Language Models

Uploaded by

jicesag354
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Fully Autonomous Programming with Large Language Models

Vadim Liventsev∗† Anastasiia Grishina∗ Aki Härmä Leon Moonen


[email protected] [email protected] [email protected] [email protected]
TU Eindhoven & Simula & University of Philips Research Simula & BI Norwegian
Philips Research Oslo Eindhoven Business School
The Netherlands Norway The Netherlands Oslo, Norway

ABSTRACT 1 INTRODUCTION
Current approaches to program synthesis with Large Language Automatic programming has been an important goal of the Arti-
Models (LLMs) exhibit a “near miss syndrome”: they tend to gener- ficial Intelligence field almost since its inception [1], promising
ate programs that semantically resemble the correct answer (as to reduce the workload of software developers by automatically
measured by text similarity metrics or human evaluation), but solving some of the tasks they face. More recently, program syn-
achieve a low or even zero accuracy as measured by unit tests due thesis has emerged as an interpretable alternative [2] to black-box
to small imperfections, such as the wrong input or output format. machine learning methods that lets human experts understand,
This calls for an approach known as Synthesize, Execute, Debug validate and edit the algorithms generated by artificial intelligence.
(SED), whereby a draft of the solution is generated first, followed In addition to the scientific benefits of such knowledge, it extends
by a program repair phase addressing the failed tests. To effectively the benefits of machine learning to domains, such as embedded
apply this approach to instruction-driven LLMs, one needs to de- systems where it is technically challenging [3] or healthcare where
termine which prompts perform best as instructions for LLMs, as it is avoided for safety reasons [4, 5].
well as strike a balance between repairing unsuccessful programs The predominant methodology in automatic programming has
and replacing them with newly generated ones. We explore these shifted from deductive programming [6, 7] to genetic and evolution-
trade-offs empirically, comparing replace-focused, repair-focused, ary methods [8] to, more recently, large autoregressive language
and hybrid debug strategies, as well as different template-based models trained on corpora of source code due to their remarkable
and model-based prompt-generation techniques. We use OpenAI capability for zero-shot generalization [9]. However, even state-of-
Codex as the LLM and Program Synthesis Benchmark 2 as a data- the-art models fine-tuned on a specific class of programming tasks
base of problem descriptions and tests for evaluation. The resulting still require a costly filtering step where the LLM outputs that do
framework outperforms both conventional usage of Codex without not compile or pass tests are discarded [10]. These outputs tend to
the repair phase and traditional genetic programming approaches. be superficially similar to correct solutions [11] despite failing to
produce the expected output, a phenomenon known as "near miss
CCS CONCEPTS syndrome" or "last mile problem" [12].
Given these challenges, research in machine learning on source
• Software and its engineering → Software design engineering; •
code [13] tends to focus on restricted domain-specific languages [14–
Computing methodologies → Neural networks; Model develop-
16] or automating specific parts1 of the software development pro-
ment and analysis; Search methodologies.
cess [19, 20] such as code search [21], code translation [22], detec-
tion of issues [23, 24], improvement [25] and repair [26] rather than
KEYWORDS fully autonomous programming in a programming language popu-
automatic programming, large language models, program repair lar with human developers [27]. However, two recent innovations
potentially make the latter task tractable.
ACM Reference Format:
One is Synthesize, Execute, Debug [28], a framework that attempts
Vadim Liventsev, Anastasiia Grishina, Aki Härmä, and Leon Moonen. 2023.
Fully Autonomous Programming with Large Language Models. In Genetic to bridge the "last mile" gap by introducing program repair into
and Evolutionary Computation Conference (GECCO ’23), July 15–19, 2023, the program synthesis algorithm. A programming task is specified
Lisbon, Portugal. ACM, New York, NY, USA, 10 pages. https://doi.org/10. using both a natural language description and a set of input/output
1145/3583131.3590481 (I/O) pairs demonstrating what output is expected of the program,
thereby combining text to code [29] and programming by exam-

ple [30, 31] paradigms typical for competitive programming [32].
Both authors contributed equally to this research.
† Corresponding author.
Synthesize, Execute, Debug creates a first draft program using a gen-
erative model, compiles and executes it with given input examples.
This is followed by a program repair step to fix the identified errors.
Permission to make digital or hard copies of part or all of this work for personal or Another relevant innovation is instruction-driven large language
classroom use is granted without fee provided that copies are not made or distributed models [33]. Instruction-driven models use human feedback in their
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. training process and admit two inputs: a source text (or code) and
For all other uses, contact the owner/author(s). a textual command instructing the model to edit the source in a
GECCO ’23, July 15–19, 2023, Lisbon, Portugal particular way, i.e., "summarize" or "translate to Python". These
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0119-1/23/07.
https://doi.org/10.1145/3583131.3590481 1 similarly to autonomous driving [17, 18]
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.

input RANK EXECUTE EXECUTE


S Code with
LLM for Pi with yes
T template draft Program Top program valid set of 100% test set of
Program if executed
A draft candidates
instr programs candidates at least I/O pairs: pass rate? I/O pairs:
R Pi, i=1,...,N Pi, i=1...W
Task psynth once (I, O)val (I, O)test
T description
Pi
no
SYNTHESIZE N new
program input
Program
candidates LLM for
candidate Pi Failing (I, O)val Final program
input - LLM input: code or text to debugging INSTRUCT
and score on
complete or edit pdebug instr pairs and
unseen tests
Instruction stderr for Pi
instr - LLM instruction on how to END
change the input
DEBUG

Figure 1: Framework for LLM-based Synthesize, Execute, Instruct, Debug, and Rank approach.

models have been shown to be highly successful in automatic pro- 𝑝 text (input). We have chosen the state-of-the-art transformer mod-
gram repair [34]. However, given the free-form nature of these els [36] for 𝑝 synth (input, instr), 𝑝 debug (input, instr), and 𝑝 text (input)
instructions2 how one should engineer instructions that maximize in our experiments as described in Section 4.5. In general, SEIDR
repair performance is an open question. requires a sequence-to-sequence generative model for these blocks.
Section 2 presents a framework that adapts Synthesize, Execute,
Debug to instruction-driven Large Language Models for solving 2.2 Synthesize
programming tasks in an autonomous fashion. We discuss related The framework starts with the SYNTHESIZE block, which is re-
work in Section 3, introduce experiments to establish optimal search sponsible for generating initial draft solutions to programming
and prompting strategies for this framework in Section 4. Finally, tasks to be repaired in the later stages of SEIDR. We start with a
we demonstrate in Section 5 that our framework outperforms con- basic template for a chosen programming language that contains a
ventional automatic programming techniques, such as genetic pro- number of standard library imports and an empty main function or
gramming and naive application of large language models that this language’s equivalent thereof, see figure 2. We populate this
generate one solution per problem without updating it iteratively. template with a comment indicating a text description of a task
at hand and several I/O examples from the prompt training set.
2 METHODOLOGY We design the templates with guidelines by the authors of the lan-
The proposed framework, Synthesize, Execute, Instruct, Debug and guage model [37] and prior work [38] in mind. We then sample 𝑁
Rank, or SEIDR,3 is summarized in figure 1. To solve a programming programs from 𝑝 synth (input, instr), setting input to the populated
task defined as a text description and a collection of I/O examples, template and instruction to the problem description. We use tem-
we split I/O examples into prompt and validation sets and use the perature sampling with a monotonically increasing temperature
prompt set in a large language model to SYNTHESIZE a popula- schedule where 𝑖-th program is sampled with temperature 𝑡𝑖 ≈ 𝑁𝑖
tion of candidate solutions. We EXECUTE the solutions, test them (approximate equality enables efficient implementation by means
against the validation set, generate a text description of the identi- of batching). Thus, the sampling procedure for the first programs
fied problems used to INSTRUCT a large language model to produce approximates deterministic maximum likelihood estimation. Ulti-
repaired candidate solutions similar to the way a human developer mately, this approach ensures that samples are diverse, but always
DEBUGs a program. We RANK the candidates by correctness mea- contain the likeliest programs.
sured by matching I/O pairs, discard the worst candidates, and
repeat until a fully correct solution is found.
import os #include <vector>
import sys standard preamble #include <iostream>
2.1 Ingredients import numpy as np with useful imports #include <string>
... ...
SEIDR makes use of 2 instruction-driven large language models for
source code: a synthesis model 𝑝 synth (input, instr) and a debugging """ /*
Given an integer x, return "Fizz" Given an integer x, return "Fizz"
model 𝑝 debug (input, instr), as well as, optionally, a large natural if x is divisible by 3, "Buzz" if x if x is divisible by 3, "Buzz" if x
language model 𝑝 text (input) that can be used for writing instruc- is divisible by 5, "FizzBuzz" if x task is divisible by 5, "FizzBuzz" if x
tions for the code model. Each model is a highly parameterised is divisible by 3 and 5, and a description is divisible by 3 and 5, and a
string version of x if none of string version of x if none of
probability distribution over the space of (input, instruction)-tuples the above hold. the above hold.
with parameters estimated on a large diverse (i.e., non-task-specific) For example, For example,
corpus. This stochastic nature of language models is an important input: input:
3 3
prerequisite for SEIDR, since it lets us sample batches of diverse can- I/O
output: examples output:
didate solutions from 𝑝 synth (input, instr), 𝑝 debug (input, instr), and Fizz Fizz
""" */
2 Throughout this paper we avoid other definitions of instruction, such as an individual if __name__ == '__main__': "main" int main() {
block
operation in code, to prevent ambiguity.
3 seiðr also refers to a type of Norse magic [35] pertaining to predicting and controlling
the future, which we deem thematically appropriate. Figure 2: Anatomy of SYNTHESIZE templates
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal

INSTRUCTstatic INSTRUCTLLM

Template Instruction: Template Instruction:


no no Instruction, e.g.:
engine Fix {stderr} engine Fix {stderr}
Failing Failing Fix {bug summary}
(I, O)val (stderr) (I, O)val (stderr)
pairs stderr pairs stderr
and empty? and empty?
Template output, e.g.:
stderr Template stderr Template LLM for
The code must return Template
for Pi engine Instruction, e.g.: for Pi engine text Bug engine
yes yes Oval instead of O for Ival.
(code Make sure that (code completion summary (debug
The error is... input
behavior) {Ival} -> {Oval} behavior) ptext instruction)

Figure 3: INSTRUCT blocks with and without an LLM.


2.3 Execute template engine and used as the debugging instruction, such as
In the EXECUTE block, the programs are compiled (if necessary) “Fix {bug summary}”.
and launched using the standard tools for the programming lan-
guage. The program is run once for every I/O pair in the validation 2.5 Debug
set. Its stdin stream receives all the input lines in a given input pair, The DEBUG block iterates over all programs in the population
and its stdout and stderr streams are captured and saved. We then and uses the instruction written by INSTRUCT based on the re-
measure the score of the program defined as accuracy over output sults of EXECUTE to sample from 𝑝 debug (input, instr) 𝑁 times to
lines, with O being the expected output, and 𝑛 = max{|𝑂 |, |stdout|}: repair every candidate, setting input to the candidate solution
Í𝑛 and instruction to the output of INSTRUCT. The population of
𝑖 I[stdout𝑖 = 𝑂𝑖 ] candidates is then replaced with the output of DEBUG.
score(𝑂, stdout) =
𝑛
2.6 Rank
unless stderr is non-empty during compilation or execution, which
Finally, the RANK block implements what is known in genetic
is considered to indicate failure and is assigned a score of 0.
programming as parent selection [39]. It ranks all programs in the
candidate population by their score calculated in EXECUTE, keeps
2.4 Instruct the top 𝑊 programs, and removes the rest from the population.
The goal of the INSTRUCT block is to provide an instruction that
summarizes a bug in the program candidate for 𝑝 debug (input, instr). 2.7 Meaning of Hyperparameters
The resulting instruction with the bug summary should indicate After evaluating a given candidate solution in EXECUTE, SEIDR
what requirement is violated and instruct the LLM to edit the candi- supports two approaches to addressing the candidate’s flaws:
date program so that the candidate meets the violated requirements.
• Replace the candidate with another sample from the current
In SEIDR, we generate instructions using template engines. In gen-
population.
eral, template engines replace placeholders in files or strings with
• Use INSTRUCT and DEBUG to repair the candidate.
input values and return a formatted string. With template engines,
we can create a number of templates that will be adapted dynami- We refer to this problem as repair-replace trade-off, by analogy with
cally based on the results of program candidate execution. production economics [40].
We consider two different designs of the instruction generation How does the choice of hyperparameters 𝑁 and 𝑊 influence
block: INSTRUCTstatic and INSTRUCTLLM shown in figure 3. Both the flow of SEIDR? 𝑁 and 𝑊 act as upper bounds on the replace
types of blocks use failing I/O pairs from the validation set and option by limiting the size of the population. In the edge cases,
stderr output of the candidate execution. In both blocks, if stderr 𝑁 = 𝑊 = 1 corresponds to a repair-only process, while 𝑁 = 𝑊 = ∞
is not empty, i.e., execution errors occur before getting the output to corresponds to replace-only, see figure 4.
compare it with the expected output, the stderr-based template en- Observe that a mutation-only genetic algorithm with population
gine generates an instruction to fix the error mentioned in stderr. size 𝑊 , such as SEIDR, is equivalent to local beam search with beam
However, the blocks differ in the way they transform failing I/O width 𝑊 on a 𝑁 -ary tree [41, Section 4.1.4]. This corresponds to a
pairs to generate instructions in case stderr is empty. known property of local beam search: it degenerates into depth-first
INSTRUCTstatic uses a fixed input template and substitutes place- search when 𝑊 = 1, whereas setting 𝑊 = ∞ yields breadth-first
holders for input and output with the corresponding strings of the search. Hence, we refer to 𝑁 as tree arity and 𝑊 as beam width.
first failing test case. We show the resulting instruction for an ex-
emplar template in figure 3. By contrast, INSTRUCTLLM uses the 3 RELATED WORK
failing I/O pair in the LLM for text completion, thereby prompting The use of large language models for a program repair step within
the text LLM to produce the bug summary. An exemplar output a program synthesis pipeline has been studied by Joshi et al. [42]
of the code behavior template engine in figure 3 describes that and Gupta et al. [28], while a specific case of instruction-driven
the code returns output O instead of expected output Oval for the LLMs has been explored by Fan et al. [34]. The latter authors also
failing test case with input string Ival . The LLM is then prompted compare instruction-driven LLMs to other Automated Program
to auto-complete this description of program behavior with the Repair (APR) strategies, and conclude that LLMs solve the task
bug summary. The bug description is passed further to the next effectively. However, they do not consider the entire Synthesize,
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.

initial template candidate solutions lexicase selection [51]. During our empirical evaluation of SEIDR
performance, we address the following research questions:
0th 1st 2nd 3rd generation RQ1. Repair-replace trade-off exploration: What is the impact
repair only tree arity of using different tree search strategies in the autonomous pro-
1 2 3 4 = beam width
(depth first) synthesize debug debug gramming setting? We experiment with four different tree arities
= 1 program
in the tree search and study their impact on the number of resolved
problems as well as the speed of obtaining solutions.

beam width = 3 = population size


compromise 1 2 6 10 RQ2. Prompt engineering: What is the effect of using LLM-
(beam search) tree arity
produced bug summaries compared to static instructions on the
= 2 programs
repair performance of automatically synthesized code? We test
3 5 9
six static debug instructions that describe bug behavior based on

beam width x tree arity = 6


violated requirements and five auto-generated debug prompts.
4 13
4.1 Data
7 11
PSB2 is a benchmark suite of 25 problems for program synthesis that
resemble small real-world tasks. PSB2 was developed as a more re-
8 alistic and challenging version of PSB1 [52], the latter consisting of
textbook problems and is widely used in genetic programming [53].
12
The problems require different data structures and control flows to
be used for effective solutions and are taken from sources, such as
competitive programming platforms and educational courses. The
replace only 1 4 every I/O pair as expected problems have descriptions in English, as well as 1 million (M) tests
(breadth first)
beam width = ∞

tree arity = ∞

for training and 1M testing-stage tests, including edge or corner


2 at least one I/O pair as expected
cases that test the resulting program on complicated inputs. The
3 compiles and runs, but all I/O pairs wrong
tests are provided as I/O pairs and are distributed together with the
problem descriptions as a PyPI package.4
5 stderr non-empty at compile- or runtime In PSB1 [52], the training set consists of the edge test cases
and is augmented by random test cases if the number of edge
Figure 4: Repair-replace trade-off as a tree search problem. tests is not enough. The test set is formed by random test cases.
This terminology is preserved in PSB2. However, we do not have
Execute, Debug framework, as we do in this work. By contrast, LLMs
a training or fine-tuning phase in our experiments, because the
are usually set up to generate a fix from the first debug attempt and
models are not made available for further training. Instead, we
do not update failing patches iteratively.
validate the framework with an existing pre-trained LLM for code
Prompt generation for LLMs pre-trained on code is investigated
and text as its parts. Therefore, we only have the validation and test
in prompt engineering literature on a number of tasks, such as
phases. We will refer to training test cases in the PSB terminology
source code repair, type inference, code synthesis, and autocomple-
as validation test cases in this study.
tion [43–45]. Studies on program repair automation used prompts
that contain code only, code with docstrings, and code with bug
hints with the Codex model to test the repair capabilities of the LLM
4.2 Repair-replace Trade-off Settings
on the QuixBugs benchmark [46, 47]. The studies reported that bug As described in Section 2.7, the choice of beam width 𝑊 and tree
localization hints were not helpful, whereas providing buggy code arity 𝑁 define the repair-replace trade-off where higher 𝑊 and 𝑁
and the task summary was the most effective. ChatGPT was tested prioritize to repair over replace. We evaluate four options for these
on QuixBugs in addition to Codex models as well [48]. Kuznia et hyperparameters as shown in table 1.
al. [49] proposed to summarize task descriptions of competitive
programming and interview programming tasks. Following guid- Table 1: Tree search hyperparameters.
ance from these studies, we include I/O pairs and task descriptions experiment 1 2 3 4
as docstrings in addition to the function signature in our prompts. beam width 𝑊 1 10 100 ∞
tree arity 𝑁 1 10 100 ∞
4 EXPERMENTAL SETUP
Because we aim to compare tree search parameters, we fix one
To explore the capabilities of SEIDR, we test the framework on the
default debugging instruction and use the INSTRUCTstatic block.
benchmark of problems for code competitions with different types
Moreover, we set the upper limit for the total number of gener-
of instructions for program candidate debugging, varied search
ated program candidates to 1000 to limit the experimentation time.
strategies, and two languages, Python and C++. Our experiments
Although some solutions may not be found within the hard limit,
use the Program Synthesis Benchmark 2 (PSB2) for problem de-
we assume5 that 1000 program candidates form a sufficiently large
scriptions and tests to evaluate the proposed framework [50]. We
compare the performance of programs synthesized with SEIDR 4 https://pypi.org/project/psb2/
to the PushGP genetic programming system with down-sampled 5 This assumption is later confirmed in Section 5.1.
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal

search space for our experiments. 𝑊 = 𝑁 = ∞ is achieved in im- to avoid asking the model what not to do.6 We denote the text
plementation by setting 𝑊 and 𝑁 equal to the upper limit on the completion LLM’s output as {bug} which should constitute the bug
number of candidates, i. e. 1000. This setting ensures that a second summary. Input templates to use LLM for bug description followed
generation of programs does not exist. by debugging instruction templates (after “→”) are as follows:
M6 The code should solve the following problem: {task}. The
code must return {Oval } for input {Ival } but it returns {Op }.
4.3 Prompting Strategies Obviously, the error is that...
The prompt for the LLM model 𝑝 debug (input, instr) consists of the → Fix {bug};
input for editing — candidate program generated so far — and a M7 The code should solve the following problem: {task}. The
debug instruction to repair the candidate. We test SEIDR on 11 code must return {Oval } for input {Ival } but it returns {Op }.
debug instructions to explore whether the use of the LLM for text The error is that...
completion 𝑝 text (input) benefits the performance of our framework, → Fix {bug};
as well as what effect different prompt phrases have on the debug M8 Problem description: {task}. The code must return {Oval } for
process. We compare debug instructions that use neutral phrases input {Ival }, but it returns {Op }. It is clear the error is that...
with those that use more confident language and mimic experienced → Fix {bug};
software developers, as well as shorter and longer instructions with M9 There is clearly a bug in code, because the code returns {Op }
different amounts of details about code behavior. To alleviate the for input {Ival } but output {Oval } is expected. The bug is that...
effect of beam width and tree arity, we set 𝑁 = 𝑊 = 1 and test the → Fix {bug};
repair-only tree search strategy shown in figure 4. This strategy is M10 There is clearly a bug in code, because the code returns {Op }
used to gradually improve one program candidate throughout the for input {Ival } but output {Oval } is expected. The bug is that...
search with no competing programs in the same generation. → Fix {bug} and modify the code to return {Oval } for in-
The debug instructions are formulated as templates. The instruc- put {Ival }.
tions describe the violated requirements in terms of the wrong Note that the text completion LLM does not use program candidates
output in a failing I/O test or summarize the bug to capture issues in its input, but only template inputs M6–M10 before the arrow.
in code logic. We present debug instructions using the template Input M6 for the text completion LLM is used to evaluate the
engine format: the brackets { } denote that the placeholder in the effect of the “confidence” sentiment on the bug summaries and
brackets will be replaced with the value generated during execution, debugging process. It is identical to input M7, except for the word
{Ival } and {Oval } stand for values failing I/O pair from the validation “obviously”, which should reflect or confidence of the comment.
set. As shown in figure 3, the instruction to fix the execution errors, Inputs M7 and M8 can be compared in the way the problem descrip-
which abort the program before the resulting output is obtained, tion is introduced, i.e., as a separate sentence similar to a spoken
with stderr lines: Fix {stderr}. Static debug instructions that do not situation in prompt M7 or as a short title in M8.
use LLM for bug summarization are as follows: Input templates M9 and M10 for text completion LLM are iden-
tical, but the instruction templates are different. Text completion
S0 Make sure that {Ival } -> {Oval };
inputs start with a “confidently” phrased statement that a bug is
S1 Make sure the code returns {Oval } for input {Ival };
present in code. We include both the LLM output {bug} and de-
S2 Ensure that input {Ival } yields output {Oval };
scription of the failing validation test case in debug instruction
S3 Modify code to get {Oval } from {Ival };
M10. Therefore, instructions M6–M9 rely mainly on the LLM out-
S4 Code must correspond instructions in comments and {Ival }
put to summarize the bug, whereas instruction M10 also provides
must yield {Oval };
information about the expected output.
S5 See comments in code and return {Oval } for input {Ival }.
4.4 Performance Indicators
The instruction S0 is the default instruction for tree arity experi-
ments. It has an intuitive symbolic notation (->) instead of the word In our experiments, we compare the number of fully solved
“return” or “yield”. In instructions S1–S3, we experiment with verbs programs obtained using SEIDR with different values of hyper-
and the order of output and input. Alternatively, in debug instruc- parameters. For a more detailed analysis of results, we use test pass
tions S4–S5, we prompt the model to consider task description in rate (TPR) and Excess Programs Generated (EPG). TPR reflects the
the docstring in addition to providing the details of the failing I/O percentage of fully passed test cases based on the exact match of
pair. Overall, instructions S0–S5 indicate the requirements to be program output and test output. The TPR metric is used for the
met, but do not describe the current program’s behavior. final evaluation of generated programs and does not reflect partial
The second set of instructions use the LLM for text comple- passing of the I/O test as opposed to score in the RANK block.
tion 𝑝 text (input). The instructions are designed so that the LLM is DEBUG and EXECUTE blocks generate a number of programs
prompted to complete the sentence that should describe an error. that are replaced or repaired during the search for solution program.
In addition to validation I/O pairs, the following notation is used: The number of programs generated before the first occurrence of
{Op } denotes the program candidate output for input {Ival }, {task} is the program that passes all validation test cases is referred to as EPG.
a placeholder for a problem description in English. Note that we EPG is indicative of the computational cost of solving a problem
do not include the incorrect output 𝑂 𝑝 of a generated candidate 6 https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-
program in debug instructions S0-S5, because it is recommended with-openai-api
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.

25 Python C++
22 PushGP 100 100
19 19 Python
16 17 16 24
15 14 C++ 21
13

frequency
13 14
10 10 10
7 8 7
5
4 3
1 2 2
1 10 100 ∞ 1 1 1 1 1 1 1 1 1
1 1
tree arity 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
EPG EPG
Figure 5: Number of solved PSB2 problems depending on the
tree arity in tree search for the fixed prompt type S0. (a) 0 ≤ EPG ≤ 10 with step 1.

Python C++
100 100
distributed in terms of LLM inferences and program compilations 47 47

frequency
and executions.
10 10
4 4 4
4.5 Implementation Details 2
3

1 1 1 1 1
We use GPT-3 models pre-trained on code7
and text8
as LLMs in 1 1

0
100
200
300
400
500
600
700
800
900
1000

0
100
200
300
400
500
600
700
800
900
1000
our framework. Specifically, we use Codex-edit (code-davinci-edit-
001) as the LLM for draft programs 𝑝 synth and LLM for debugging EPG range EPG range
𝑝 debug and GPT-3 (text-davinci-003) for bug summarization via (b) 0 ≤ EPG ≤ 1000 with step 100.
text completion with 𝑝 text . We ensure that the program candidates
generated from the same parent program are different from each Figure 6: Distribution of the number of generated programs
other by changing the temperature parameter of Codex-edit. during each problem-solving attempt in the experiments
We use 2000 I/O pairs from the test split of PSB2 to evaluate with different tree arities where a problem solution is found.
the candidate program that has passed all the validation test cases
during debugging. Due to repetitive calls to the EXECUTE block, finding a better fix is higher when more alternatives are generated
we have to resolve the speed of testing versus precision trade-off to update the draft program at 𝑁 > 1 compared to 𝑁 = 1. The
while choosing the number of validation test pairs. We resolve the search strategy with 𝑁 = 10 yields the best results: it performs
trade-off by fixing the validation set size at 100. on par with PushGP for C++ and outperforms the baseline during
In the experiments with tree arity values, we set the limit to Python program synthesis by +2 problems resulting in a total of 19
generate a maximum of 1000 program candidates during the search programs that pass all test cases. The results imply that generating
of the candidate that passes all validation tests. If we reach 1000 a moderate number of programs in parallel during the DEBUG step
candidates and none of them passes all validation tests, we report works better than the policies in which more updates are generated
the test pass rate for the last generated candidate. In the experiments for each program (100 or 1000) or only one program is updated
with prompts, we set the limit of maximum generated programs to 5, iteratively.
because we search for the prompt that yields the fastest solution to We present the analogy of the solution speed for all four arities
exclude long searches and comply with the request rate limits. and fixed default debug instruction in figure 6. In detail, we show
the distribution of EPG values in all experiments to explore how
5 RESULTS AND DISCUSSION many candidate updates are generated before the solution is found.
We zoom in to the cases with solutions found with up to the first
5.1 RQ1. Repair-replace Trade-off Exploration 10 program candidates in figure 6a and show the EPG distribution
We compare the number of solved problems in the experiments with the step of 100 candidates in figure 6b.
with tree arity of 1, 10, 100, and ∞ and fixed debug instruction S0 Out of 100 experiments for each language, in 21–24% of runs in
in Python and C++ in figure 5. The results of SEIDR are compared Python and C++, the draft program is already the solution (EPG=0).
to the baseline performance of PushGP on the PSB2 benchmark, For 19-32% of experiments, the solution is found after discarding
which solves 17 out of 25 problems. Note that experiments with 5 candidates. Around half of experiments do not generate more
𝑁 = 1 and 𝑁 = ∞ can be considered as ablation studies, where the than 100 programs. However, 5 problems are solved with more than
replace option and repair option is turned off, correspondingly. 500 generated programs in Python and 1 problem in C++ (with
The results highlight the benefit of compromise strategies with 𝑁 = 10). The results imply that the first steps in the update of the
tree arity of 10 and 100 over repair-only (𝑁 = 1) and replace-only draft program are crucial for solving the problem. The chances
(𝑁 = ∞) strategies. The repair-only scheme is outperformed by of solving the problem on the later stages of the search, such as
other strategies. We explain the poor performance of repair-only after 100 programs have been generated, are low. This confirms our
strategy by the fact that the search space is under-explored. Specif- initial assumption in Section 4.2 that 1000 programs are sufficient.
ically, replace scenario ensures the LLM for debugging represented
Answer to RQ1. SEIDR outperforms the PushGP baseline on
by Codex-edit in our experiments generates different updates of
PSB2 in Python and performs on par with it in C++ experiments
program candidates using variable temperature. The probability of
with tree arity of 10. Search strategies with tree arity larger than
7 https://platform.openai.com/docs/guides/code/editing-code
8 https://platform.openai.com/docs/models/gpt-3
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal

25
Python C++
to the input length limit of the underlying LLM 𝑝 debug , i.e., the
22
19
generated code is too long and does not fit as input to the LLM.
16 Some problems are solved with all prompts, while other problems
13 11 12 11 are solved with only a subset of prompts, solved partially, or not
98 109 910 108 910 10 910
10 8 87 8 8 solved at all. A number of problems are solved with all or the
7 6
majority of prompts in both languages, such as basement, fizz-buzz,
4
1
paired-digits, and twitter. Other problems pass all tests in only
S0 S1 S2 S3 S4 S5 M6 M7 M8 M9 M10 one of the languages, such as luhn, vector-distance, fuel-cost, or
debug prompt type substitution-cipher. Most of the solved problems are generated as
Figure 7: Number of solved PSB2 problems depending on the the first draft or within 1–2 debug steps. However, some problems
instruction choice for the fixed tree arity of 1. pass 90% of test cases at the fifth step, such as substitution-cipher
in Python with prompts S4 and M8 or shopping-list in C++ with
one benefit from the replace possibility of the SEIDR framework prompts S0, S1, S5 and M7. These runs are likely to be updated with
as a consequence of using variable temperature for Codex-edit. the fully correct programs in the following several steps, according
The repair component is also crucial for the framework because to the results in section 5.1, but the experiments are stopped for the
the replace-only search policy (with tree arity of ∞) performs fairness of inter-prompt comparison. Alternatively, conducting the
worse than the policies that alternate between replace and repair prompt engineering experiment with 1000 max programs would
during program update (with tree arity of 10 or 100). have shown what prompts are beneficial for solving the problems
in the long run and can be interesting for future work.
5.2 RQ2. Prompt Engineering The most interesting cases concern the problems that are solved
We report the number of solved problems for different static and only with LLM bug summaries or only with static prompts. For
GPT-assisted debug instructions in figure 7. Because debug instruc- example, the gcd problem is solved only with prompts M6–M10
tions are parts of prompts for LLMs and the program candidate in C++ and is not solved with either of S0–S5. A similar result
format does not change, we will use the term prompt during the is obtained for spin-words and coin-sums in C++. In Python, we
analysis of experiment results with different instructions. Overall, observe only the cases where solutions are obtained with static
the performance of the framework is robust to the debug prompt prompts and are not obtained with GPT-assisted prompts, e.g., for
choice, both with LLM-generated and static templates. The number find-pair, camel-case. In addition, several prompts work well from
of solved problems differs for Python and C++ in our experiments. both S and M categories as for gcd in Python.
For C++, all debug prompts except S2 result in the same or higher Answer to RQ2. Program synthesis in C++ with SEIDR
performance than the instruction S0 which is used in the repair- achieves better performance in the repair-only setting with both
replace trade-off experiments. The debug instruction S2 contains GPT-assisted prompts that summarize bugs in code and static
the verbs “yield” and “ensure” which are probably rarely used in templates which describe failing I/O cases. The best-performing
code documentation. The best debug instruction for C++ is the C++ instruction is obtained with GPT-3 for text completion that
LLM-assisted template M6 containing the word “obviously”, which contains the word “obviously”. Results differ for PSB2 solutions
should indicate the confidence of the author of bug summary whom in Python: the static prompt template S0 results in the best per-
GPT-3 should mimic during autocompletion. formance. Overall, SEIDR performance is stable with different
Python programs do not show the same effect during experi- debugging prompts submitted to Codex-edit.
ments with different prompts. The overall performance drops in
comparison with using the prompt S0. By limiting the total number
of generated programs from 1000 to 5 in the current set of experi- 5.3 Threats to Validity
ments, we lose 2 problem solutions in Python with S0. The prompt External threats to validity concern SEIDR performance on different
that results in the best performance in C++ for the EPG limit of 5 benchmarks and the use of other language models than the tested
corresponds to the worst performance in Python. This result can ones. Specifically, PSB2 contains competitive programming tasks
occur due to the small tree arity and low variability of debugging which require smaller functions to be generated than production-
updates of the initial draft. Another reason is that the GPT-3 sum- scale software. We plan to extend our experiments in future work
mary of bugs may not point to logical errors. The model for text to explore the generalizability of results to other benchmarks.
autocompletion frequently outputs bug summaries that mention Internal threats relate to the implementation. We use PSB2,
“the code is not accepting the input correctly.” Note that such bug which has corner case tests in the training set and test regular
summary appears in other debug prompts, too. cases in the test set. To ensure a fair comparison with other studies
To analyze the effect of using different prompts on a problem on PSB2, we evaluate and report results on the provided test set
level, we present a heatmap of EPG for all 25 problems in figure 8. of PSB2 which risks that the synthesized programs do not pass
We add the values of test pass rate in numbers or signs and show some of the training cases. Large models for code editing and text
EPG in color. Empty cells denote that the search halts due to OpenAI completion used in this study are nondeterministic, which impacts
exceptions, such as APIError.9 In addition, if the framework halts results. Due to prohibitive model inference costs, each experiment
before max programs attempts (light-blue cells with a “-”), it is due was only run once. However, our temperature sampling procedure
described in section 2.2 reduces this stochasticity significantly, es-
9 https://platform.openai.com/docs/guides/error-codes/python-library-error-types pecially for low-EPG results. As with other language models, Codex
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.

Figure 8: Number of excess programs generated (in color) and test pass rate (as numbers) depending on the type of debug
prompt. Higher EPG values are shown in darker shades than low EPG. We denote solved problems with “+” (test pass rate = 1),
unsolved problems with “-” (test pass rate = 0), and show the test pass rate for partially solved problems.

is a black-box model and may generate malicious code [54]. The PushGP, making it feasible in the areas with costly testing, such as
Codex model was pre-trained on an unbalanced dataset across pro- robotics. Investigation of the repair-replace trade-off shows that
gramming languages [9]. Thus, the results can be skewed towards SEIDR with tree arity of 10 outperforms both the replace-only strat-
high performance in popular programming languages. egy and the repair-only approach. Our prompt engineering study
shows that bug summaries generated with “confidence indicators”,
6 CONCLUSION such as “obviously”, improve the performance of SEIDR during C++
In this study, we propose the SEIDR framework to solve the chal- code synthesis. Overall, our framework shows low performance
lenge of fully autonomous programming. We augment the program variability with different prompts, which indicates its robustness.
synthesis procedure based on the large language models for code Future work: To study the generalizability of the SEIDR frame-
generation from templates and textual instructions with the repair work, we plan to expand the experiments to the competitive pro-
block. The repair block consists of the tree search across the pro- gramming dataset of AlphaCode [10] and QuixBugs [46], as well as
gram candidates generated by a large language model for code. The experimenting with ranking strategies, such as lexicase selection.
LLM used for code repair takes imperfect program candidates and
DATA AVAILABILITY
instructions for their improvement as prompts. The instructions are
obtained from both static templates with failing test case descrip- The code and results are made available via Zenodo.11 Note that
tions and templates with auto-generated bug summaries by a text OpenAI discontinued the Codex API on March 23, 2023, and sug-
completion language model. We explore 11 prompting strategies gests using the GPT-3.5-Turbo API instead.
and the repair-replace trade-off of updating the draft program.
Contributions: We test SEIDR with the Codex-edit as the model ACKNOWLEDGMENTS
for draft program synthesis and debugging in Python and C++ on The work presented in this paper was supported by the European
the PSB2 benchmark. In our experiments, SEIDR outperforms the Commission through Horizon 2020 grant 812882, and by the Re-
PushGP baseline and achieves the state-of-the-art result with 19 search Council of Norway through the secureIT project (#288787).
solved problems out of 25. It requires under 1000 program execu- The empirical evaluation made use of the Experimental Infrastruc-
tions to solve them, in stark contrast to billions10 of executions in ture for Exploration of Exascale Computing (eX3), supported by
the Research Council of Norway through grant #270053.
10 Aproblem is considered "solved" by PushGP if at least 1 of 100 runs, each with a
11 https://doi.org/10.5281/zenodo.7837282
limit of 60 million programs, was successful.
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal

REFERENCES [19] S. Lu et al. “CodeXGLUE: A Machine Learning Benchmark Dataset for


[1] Z. Manna and R. J. Waldinger. “Toward Automatic Program Synthe- Code Understanding and Generation.” In: Proceedings of the Neural
sis.” In: Communications of the ACM 14.3 (1971), pp. 151–165. doi: Information Processing Systems Track on Datasets and Benchmarks.
10/bhgh4z. Dec. 2021.
[2] O. Bastani, J. P. Inala, and A. Solar-Lezama. “Interpretable, Verifiable, [20] C. Niu, C. Li, V. Ng, and B. Luo. CrossCodeBench: Benchmarking
and Robust Reinforcement Learning via Program Synthesis.” In: xxAI Cross-Task Generalization of Source Code Models. Feb. 2023. arXiv:
- Beyond Explainable AI: International Workshop, Held in Conjunction 2302.04030.
with ICML 2020, July 18, 2020, Vienna, Austria, Revised and Extended [21] H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt.
Papers. Ed. by A. Holzinger, R. Goebel, R. Fong, T. Moon, K.-R. Müller, CodeSearchNet Challenge: Evaluating the State of Semantic Code
and W. Samek. Lecture Notes in Computer Science. Springer, 2022, Search. June 2020. arXiv: 1909.09436.
pp. 207–228. doi: 10.j5zn. [22] B. Roziere, M.-A. Lachaux, G. Lample, and L. Chanussot. “Unsuper-
[3] S. Dhar, J. Guo, J. Liu, S. Tripathi, U. Kurup, and M. Shah. “A Survey of vised Translation of Programming Languages.” In: 34th Conference
On-Device Machine Learning: An Algorithms and Learning Theory on Neural Information Processing Systems. Vancouver, Canada, 2020.
Perspective.” In: ACM Transactions on Internet of Things 2.3 (2021), [23] E. Fernandes, J. Oliveira, G. Vale, T. Paiva, and E. Figueiredo. “A
pp. 1–49. doi: 10/gnxq57. Review-Based Comparative Study of Bad Smell Detection Tools.” In:
[4] T. M. Connolly, M. Soflano, and P. Papadopoulos. “Systematic Litera- Proceedings of the 20th International Conference on Evaluation and
ture Review: XAI and Clinical Decision Support.” In: Diverse Perspec- Assessment in Software Engineering. EASE ’16. New York, NY, USA:
tives and State-of-the-Art Approaches to the Utilization of Data-Driven Association for Computing Machinery, June 2016, pp. 1–12. doi:
Clinical Decision Support Systems. IGI Global, 2023, pp. 161–188. doi: 10/ggfv7h.
10/j5zp. [24] S. Chakraborty, R. Krishna, Y. Ding, and B. Ray. “Deep Learning
[5] Y. Jia, J. McDermid, T. Lawton, and I. Habli. “The Role of Explain- Based Vulnerability Detection: Are We There Yet?” In: IEEE Transac-
ability in Assuring Safety of Machine Learning in Healthcare.” In: tions on Software Engineering 48.9 (Sept. 2022), pp. 3280–3296. doi:
IEEE Transactions on Emerging Topics in Computing 10.4 (Oct. 2022), 10/gk52qr.
pp. 1746–1760. doi: 10/grrn36. [25] J. Petke, S. O. Haraldsson, M. Harman, W. B. Langdon, D. R. White,
[6] Z. Manna and R. Waldinger. “Fundamentals of Deductive Program and J. R. Woodward. “Genetic Improvement of Software: A Compre-
Synthesis.” In: IEEE Transactions on Software Engineering 18.8 (1992), hensive Survey.” In: IEEE Transactions on Evolutionary Computation
pp. 674–704. doi: 10/ddbdft. 22.3 (June 2018), pp. 415–432. doi: 10/gdrjb7.
[7] R. Alur et al. “Syntax-Guided Synthesis.” In: Dependable Software [26] C. Le Goues, M. Pradel, and A. Roychoudhury. “Automated Program
Systems Engineering (2015), pp. 1–25. doi: 10/grb6m9. Repair.” In: Communications of the ACM 62.12 (Nov. 2019), pp. 56–65.
[8] M. T. Ahvanooey, Q. Li, M. Wu, and S. Wang. “A Survey of Genetic doi: 10/gkgf29.
Programming and Its Applications.” In: KSII Transactions on Internet [27] TIOBE Index. https://www.tiobe.com/tiobe-index/.
and Information Systems (TIIS) 13.4 (2019), pp. 1765–1794. [28] K. Gupta, P. E. Christensen, X. Chen, and D. Song. “Synthesize, Exe-
[9] M. Chen et al. Evaluating Large Language Models Trained on Code. cute and Debug: Learning to Repair for Neural Program Synthesis.”
July 2021. arXiv: 2107.03374. In: Advances in Neural Information Processing Systems. Vol. 33. Curran
[10] Y. Li et al. “Competition-Level Code Generation with AlphaCode.” Associates, Inc., 2020, pp. 17685–17695.
In: Science 378.6624 (Dec. 2022), pp. 1092–1097. doi: 10/grggxf. [29] S. Iyer, I. Konstas, A. Cheung, and L. Zettlemoyer. Mapping Language
[11] S. Ren et al. CodeBLEU: A Method for Automatic Evaluation of Code to Code in Programmatic Context. Aug. 2018. arXiv: 1808.09588.
Synthesis. Sept. 2020. arXiv: 2009.10297. [30] D. C. Halbert. “Programming by Example.” PhD thesis. University of
[12] R. Bavishi, H. Joshi, J. Cambronero, A. Fariha, S. Gulwani, V. Le, California, Berkeley, 1984.
I. Radicek, and A. Tiwari. “Neurosymbolic Repair for Low-Code [31] S. Gulwani. Programming by Examples (and Its Applications in Data
Formula Languages.” In: OOPSLA. Dec. 2022. Wrangling). Tech. rep. Redmond, WA, USA: Microsoft Corportation,
[13] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. “A Survey of 2016, p. 22.
Machine Learning for Big Code and Naturalness.” In: ACM Computing [32] M. Zavershynskyi, A. Skidanov, and I. Polosukhin. NAPS: Natural
Surveys 51.4 (July 2018), 81:1–81:37. doi: 10/gffkhm. Program Synthesis Dataset. 2018. arXiv: 1807.0316.
[14] X. Chen, D. Song, and Y. Tian. “Latent Execution for Neural Program [33] L. Ouyang et al. Training Language Models to Follow Instructions with
Synthesis Beyond Domain-Specific Languages.” In: Advances in Neu- Human Feedback. Mar. 2022. arXiv: 2203.02155.
ral Information Processing Systems. Vol. 34. Curran Associates, Inc., [34] Z. Fan, X. Gao, M. Mirchev, A. Roychoudhury, and S. H. Tan. Au-
2021, pp. 22196–22208. tomated Repair of Programs from Large Language Models. Jan. 2023.
[15] O. Polozov and S. Gulwani. “FlashMeta: A Framework for Inductive arXiv: 2205.10583.
Program Synthesis.” In: OOPSLA. OOPSLA 2015. New York, NY, USA: [35] J. Blain. Nine Worlds of Seid-magic: Ecstasy and Neo-shamanism in
Association for Computing Machinery, Oct. 2015, pp. 107–126. doi: North European Paganism. Routledge, 2002.
10/grb6mm. [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[16] V. Liventsev, A. Härmä, and M. Petković. BF++: A Language for Gomez, Ł. Kaiser, and I. Polosukhin. “Attention Is All You Need.”
General-Purpose Program Synthesis. Jan. 2021. arXiv: 2101.09571. In: International Conference on Neural Information Processing Systems
[17] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu. “A Survey of (NeurIPS). Ed. by I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R.
Deep Learning Techniques for Autonomous Driving.” In: Journal of Fergus, S. Vishwanathan, and R. Garnett. Curran Associates, Inc.,
Field Robotics 37.3 (2020), pp. 362–386. doi: 10/gg9ztb. 2017, pp. 5998–6008.
[18] M. Marcano, S. Díaz, J. Pérez, and E. Irigoyen. “A Review of Shared [37] OpenAI API. https://platform.openai.com.
Control for Automated Vehicles: Theory and Applications.” In: IEEE [38] S. de Bruin, V. Liventsev, and M. Petković. Autoencoders as Tools for
Transactions on Human-Machine Systems 50.6 (2020), pp. 475–491. Program Synthesis. Sept. 2021. arXiv: 2108.07129.
doi: 10/gj8zjj. [39] J. R. Koza. Genetic Programming II. Vol. 17. MIT Press, 1994.
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.

[40] N. Jack and F. Van der Duyn Schouten. “Optimal Repair–Replace [48] D. Sobania, M. Briesch, C. Hanna, and J. Petke. An Analysis of the
Strategies for a Warranted Product.” In: International Journal of Pro- Automatic Bug Fixing Performance of ChatGPT. Jan. 2023. arXiv: 2301.
duction Economics 67.1 (Aug. 2000), pp. 95–100. doi: 10/cfzj7f. 08653.
[41] S. J. Russell. Artificial Intelligence a Modern Approach. Pearson Edu- [49] K. Kuznia, S. Mishra, M. Parmar, and C. Baral. Less Is More: Summary
cation, Inc., 2010. of Long Instructions Is Better for Program Synthesis. Oct. 2022. arXiv:
[42] H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, and G. Ver- 2203.08597.
bruggen. Repair Is Nearly Generation: Multilingual Program Repair [50] T. Helmuth and P. Kelly. “Applying Genetic Programming to PSB2:
with LLMs. Sept. 2022. arXiv: 2208.11640. The next Generation Program Synthesis Benchmark Suite.” In: Ge-
[43] D. Shrivastava, H. Larochelle, and D. Tarlow. Repository-Level Prompt netic Programming and Evolvable Machines 23.3 (Sept. 2022), pp. 375–
Generation for Large Language Models of Code. Oct. 2022. arXiv: 404. doi: 10/gq5gjq.
2206.12839. [51] T. Helmuth and L. Spector. “Problem-Solving Benefits of Down-
[44] Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu. Prompt-Tuned Sampled Lexicase Selection.” In: Artificial Life 27.3–4 (Mar. 2022),
Code Language Model as a Neural Knowledge Base for Type Inference pp. 183–203. doi: 10/grrnj7.
in Statically-Typed Partial Code. Aug. 2022. arXiv: 2208.05361. [52] T. Helmuth and L. Spector. “General Program Synthesis Benchmark
[45] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. Fixing Hardware Suite.” In: Proceedings of the 2015 Annual Conference on Genetic and
Security Bugs with Large Language Models. Feb. 2023. arXiv: 2302. Evolutionary Computation. Madrid Spain: ACM, July 2015, pp. 1039–
01215. 1046. doi: 10/ghsn5b.
[46] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama. “QuixBugs: A Multi- [53] D. Sobania, M. Briesch, and F. Rothlauf. “Choose Your Program-
Lingual Program Repair Benchmark Set Based on the Quixey Chal- ming Copilot: A Comparison of the Program Synthesis Performance
lenge.” In: Companion of the SIGPLAN International Conference on of Github Copilot and Genetic Programming.” In: Proceedings of
Systems, Programming, Languages, and Applications: Software for the Genetic and Evolutionary Computation Conference. Boston Mas-
Humanity. Vancouver BC Canada: ACM, Oct. 2017, pp. 55–56. doi: sachusetts: ACM, July 2022, pp. 1019–1027. doi: 10/gq5gjp.
10/gf8nmn. [54] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep
[47] J. A. Prenner, H. Babii, and R. Robbes. “Can OpenAI’s Codex Fix at the Keyboard? Assessing the Security of GitHub Copilot’s Code
Bugs?: An Evaluation on QuixBugs.” In: 2022 IEEE/ACM International Contributions. Dec. 2021. arXiv: 2108.09293.
Workshop on Automated Program Repair (APR). May 2022, pp. 69–75.
doi: 10/grpcnx.

You might also like