Fully Autonomous Programming With Large Language Models
Fully Autonomous Programming With Large Language Models
ABSTRACT 1 INTRODUCTION
Current approaches to program synthesis with Large Language Automatic programming has been an important goal of the Arti-
Models (LLMs) exhibit a “near miss syndrome”: they tend to gener- ficial Intelligence field almost since its inception [1], promising
ate programs that semantically resemble the correct answer (as to reduce the workload of software developers by automatically
measured by text similarity metrics or human evaluation), but solving some of the tasks they face. More recently, program syn-
achieve a low or even zero accuracy as measured by unit tests due thesis has emerged as an interpretable alternative [2] to black-box
to small imperfections, such as the wrong input or output format. machine learning methods that lets human experts understand,
This calls for an approach known as Synthesize, Execute, Debug validate and edit the algorithms generated by artificial intelligence.
(SED), whereby a draft of the solution is generated first, followed In addition to the scientific benefits of such knowledge, it extends
by a program repair phase addressing the failed tests. To effectively the benefits of machine learning to domains, such as embedded
apply this approach to instruction-driven LLMs, one needs to de- systems where it is technically challenging [3] or healthcare where
termine which prompts perform best as instructions for LLMs, as it is avoided for safety reasons [4, 5].
well as strike a balance between repairing unsuccessful programs The predominant methodology in automatic programming has
and replacing them with newly generated ones. We explore these shifted from deductive programming [6, 7] to genetic and evolution-
trade-offs empirically, comparing replace-focused, repair-focused, ary methods [8] to, more recently, large autoregressive language
and hybrid debug strategies, as well as different template-based models trained on corpora of source code due to their remarkable
and model-based prompt-generation techniques. We use OpenAI capability for zero-shot generalization [9]. However, even state-of-
Codex as the LLM and Program Synthesis Benchmark 2 as a data- the-art models fine-tuned on a specific class of programming tasks
base of problem descriptions and tests for evaluation. The resulting still require a costly filtering step where the LLM outputs that do
framework outperforms both conventional usage of Codex without not compile or pass tests are discarded [10]. These outputs tend to
the repair phase and traditional genetic programming approaches. be superficially similar to correct solutions [11] despite failing to
produce the expected output, a phenomenon known as "near miss
CCS CONCEPTS syndrome" or "last mile problem" [12].
Given these challenges, research in machine learning on source
• Software and its engineering → Software design engineering; •
code [13] tends to focus on restricted domain-specific languages [14–
Computing methodologies → Neural networks; Model develop-
16] or automating specific parts1 of the software development pro-
ment and analysis; Search methodologies.
cess [19, 20] such as code search [21], code translation [22], detec-
tion of issues [23, 24], improvement [25] and repair [26] rather than
KEYWORDS fully autonomous programming in a programming language popu-
automatic programming, large language models, program repair lar with human developers [27]. However, two recent innovations
potentially make the latter task tractable.
ACM Reference Format:
One is Synthesize, Execute, Debug [28], a framework that attempts
Vadim Liventsev, Anastasiia Grishina, Aki Härmä, and Leon Moonen. 2023.
Fully Autonomous Programming with Large Language Models. In Genetic to bridge the "last mile" gap by introducing program repair into
and Evolutionary Computation Conference (GECCO ’23), July 15–19, 2023, the program synthesis algorithm. A programming task is specified
Lisbon, Portugal. ACM, New York, NY, USA, 10 pages. https://doi.org/10. using both a natural language description and a set of input/output
1145/3583131.3590481 (I/O) pairs demonstrating what output is expected of the program,
thereby combining text to code [29] and programming by exam-
∗
ple [30, 31] paradigms typical for competitive programming [32].
Both authors contributed equally to this research.
† Corresponding author.
Synthesize, Execute, Debug creates a first draft program using a gen-
erative model, compiles and executes it with given input examples.
This is followed by a program repair step to fix the identified errors.
Permission to make digital or hard copies of part or all of this work for personal or Another relevant innovation is instruction-driven large language
classroom use is granted without fee provided that copies are not made or distributed models [33]. Instruction-driven models use human feedback in their
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. training process and admit two inputs: a source text (or code) and
For all other uses, contact the owner/author(s). a textual command instructing the model to edit the source in a
GECCO ’23, July 15–19, 2023, Lisbon, Portugal particular way, i.e., "summarize" or "translate to Python". These
© 2023 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0119-1/23/07.
https://doi.org/10.1145/3583131.3590481 1 similarly to autonomous driving [17, 18]
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.
Figure 1: Framework for LLM-based Synthesize, Execute, Instruct, Debug, and Rank approach.
models have been shown to be highly successful in automatic pro- 𝑝 text (input). We have chosen the state-of-the-art transformer mod-
gram repair [34]. However, given the free-form nature of these els [36] for 𝑝 synth (input, instr), 𝑝 debug (input, instr), and 𝑝 text (input)
instructions2 how one should engineer instructions that maximize in our experiments as described in Section 4.5. In general, SEIDR
repair performance is an open question. requires a sequence-to-sequence generative model for these blocks.
Section 2 presents a framework that adapts Synthesize, Execute,
Debug to instruction-driven Large Language Models for solving 2.2 Synthesize
programming tasks in an autonomous fashion. We discuss related The framework starts with the SYNTHESIZE block, which is re-
work in Section 3, introduce experiments to establish optimal search sponsible for generating initial draft solutions to programming
and prompting strategies for this framework in Section 4. Finally, tasks to be repaired in the later stages of SEIDR. We start with a
we demonstrate in Section 5 that our framework outperforms con- basic template for a chosen programming language that contains a
ventional automatic programming techniques, such as genetic pro- number of standard library imports and an empty main function or
gramming and naive application of large language models that this language’s equivalent thereof, see figure 2. We populate this
generate one solution per problem without updating it iteratively. template with a comment indicating a text description of a task
at hand and several I/O examples from the prompt training set.
2 METHODOLOGY We design the templates with guidelines by the authors of the lan-
The proposed framework, Synthesize, Execute, Instruct, Debug and guage model [37] and prior work [38] in mind. We then sample 𝑁
Rank, or SEIDR,3 is summarized in figure 1. To solve a programming programs from 𝑝 synth (input, instr), setting input to the populated
task defined as a text description and a collection of I/O examples, template and instruction to the problem description. We use tem-
we split I/O examples into prompt and validation sets and use the perature sampling with a monotonically increasing temperature
prompt set in a large language model to SYNTHESIZE a popula- schedule where 𝑖-th program is sampled with temperature 𝑡𝑖 ≈ 𝑁𝑖
tion of candidate solutions. We EXECUTE the solutions, test them (approximate equality enables efficient implementation by means
against the validation set, generate a text description of the identi- of batching). Thus, the sampling procedure for the first programs
fied problems used to INSTRUCT a large language model to produce approximates deterministic maximum likelihood estimation. Ulti-
repaired candidate solutions similar to the way a human developer mately, this approach ensures that samples are diverse, but always
DEBUGs a program. We RANK the candidates by correctness mea- contain the likeliest programs.
sured by matching I/O pairs, discard the worst candidates, and
repeat until a fully correct solution is found.
import os #include <vector>
import sys standard preamble #include <iostream>
2.1 Ingredients import numpy as np with useful imports #include <string>
... ...
SEIDR makes use of 2 instruction-driven large language models for
source code: a synthesis model 𝑝 synth (input, instr) and a debugging """ /*
Given an integer x, return "Fizz" Given an integer x, return "Fizz"
model 𝑝 debug (input, instr), as well as, optionally, a large natural if x is divisible by 3, "Buzz" if x if x is divisible by 3, "Buzz" if x
language model 𝑝 text (input) that can be used for writing instruc- is divisible by 5, "FizzBuzz" if x task is divisible by 5, "FizzBuzz" if x
tions for the code model. Each model is a highly parameterised is divisible by 3 and 5, and a description is divisible by 3 and 5, and a
string version of x if none of string version of x if none of
probability distribution over the space of (input, instruction)-tuples the above hold. the above hold.
with parameters estimated on a large diverse (i.e., non-task-specific) For example, For example,
corpus. This stochastic nature of language models is an important input: input:
3 3
prerequisite for SEIDR, since it lets us sample batches of diverse can- I/O
output: examples output:
didate solutions from 𝑝 synth (input, instr), 𝑝 debug (input, instr), and Fizz Fizz
""" */
2 Throughout this paper we avoid other definitions of instruction, such as an individual if __name__ == '__main__': "main" int main() {
block
operation in code, to prevent ambiguity.
3 seiðr also refers to a type of Norse magic [35] pertaining to predicting and controlling
the future, which we deem thematically appropriate. Figure 2: Anatomy of SYNTHESIZE templates
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal
INSTRUCTstatic INSTRUCTLLM
initial template candidate solutions lexicase selection [51]. During our empirical evaluation of SEIDR
performance, we address the following research questions:
0th 1st 2nd 3rd generation RQ1. Repair-replace trade-off exploration: What is the impact
repair only tree arity of using different tree search strategies in the autonomous pro-
1 2 3 4 = beam width
(depth first) synthesize debug debug gramming setting? We experiment with four different tree arities
= 1 program
in the tree search and study their impact on the number of resolved
problems as well as the speed of obtaining solutions.
tree arity = ∞
search space for our experiments. 𝑊 = 𝑁 = ∞ is achieved in im- to avoid asking the model what not to do.6 We denote the text
plementation by setting 𝑊 and 𝑁 equal to the upper limit on the completion LLM’s output as {bug} which should constitute the bug
number of candidates, i. e. 1000. This setting ensures that a second summary. Input templates to use LLM for bug description followed
generation of programs does not exist. by debugging instruction templates (after “→”) are as follows:
M6 The code should solve the following problem: {task}. The
code must return {Oval } for input {Ival } but it returns {Op }.
4.3 Prompting Strategies Obviously, the error is that...
The prompt for the LLM model 𝑝 debug (input, instr) consists of the → Fix {bug};
input for editing — candidate program generated so far — and a M7 The code should solve the following problem: {task}. The
debug instruction to repair the candidate. We test SEIDR on 11 code must return {Oval } for input {Ival } but it returns {Op }.
debug instructions to explore whether the use of the LLM for text The error is that...
completion 𝑝 text (input) benefits the performance of our framework, → Fix {bug};
as well as what effect different prompt phrases have on the debug M8 Problem description: {task}. The code must return {Oval } for
process. We compare debug instructions that use neutral phrases input {Ival }, but it returns {Op }. It is clear the error is that...
with those that use more confident language and mimic experienced → Fix {bug};
software developers, as well as shorter and longer instructions with M9 There is clearly a bug in code, because the code returns {Op }
different amounts of details about code behavior. To alleviate the for input {Ival } but output {Oval } is expected. The bug is that...
effect of beam width and tree arity, we set 𝑁 = 𝑊 = 1 and test the → Fix {bug};
repair-only tree search strategy shown in figure 4. This strategy is M10 There is clearly a bug in code, because the code returns {Op }
used to gradually improve one program candidate throughout the for input {Ival } but output {Oval } is expected. The bug is that...
search with no competing programs in the same generation. → Fix {bug} and modify the code to return {Oval } for in-
The debug instructions are formulated as templates. The instruc- put {Ival }.
tions describe the violated requirements in terms of the wrong Note that the text completion LLM does not use program candidates
output in a failing I/O test or summarize the bug to capture issues in its input, but only template inputs M6–M10 before the arrow.
in code logic. We present debug instructions using the template Input M6 for the text completion LLM is used to evaluate the
engine format: the brackets { } denote that the placeholder in the effect of the “confidence” sentiment on the bug summaries and
brackets will be replaced with the value generated during execution, debugging process. It is identical to input M7, except for the word
{Ival } and {Oval } stand for values failing I/O pair from the validation “obviously”, which should reflect or confidence of the comment.
set. As shown in figure 3, the instruction to fix the execution errors, Inputs M7 and M8 can be compared in the way the problem descrip-
which abort the program before the resulting output is obtained, tion is introduced, i.e., as a separate sentence similar to a spoken
with stderr lines: Fix {stderr}. Static debug instructions that do not situation in prompt M7 or as a short title in M8.
use LLM for bug summarization are as follows: Input templates M9 and M10 for text completion LLM are iden-
tical, but the instruction templates are different. Text completion
S0 Make sure that {Ival } -> {Oval };
inputs start with a “confidently” phrased statement that a bug is
S1 Make sure the code returns {Oval } for input {Ival };
present in code. We include both the LLM output {bug} and de-
S2 Ensure that input {Ival } yields output {Oval };
scription of the failing validation test case in debug instruction
S3 Modify code to get {Oval } from {Ival };
M10. Therefore, instructions M6–M9 rely mainly on the LLM out-
S4 Code must correspond instructions in comments and {Ival }
put to summarize the bug, whereas instruction M10 also provides
must yield {Oval };
information about the expected output.
S5 See comments in code and return {Oval } for input {Ival }.
4.4 Performance Indicators
The instruction S0 is the default instruction for tree arity experi-
ments. It has an intuitive symbolic notation (->) instead of the word In our experiments, we compare the number of fully solved
“return” or “yield”. In instructions S1–S3, we experiment with verbs programs obtained using SEIDR with different values of hyper-
and the order of output and input. Alternatively, in debug instruc- parameters. For a more detailed analysis of results, we use test pass
tions S4–S5, we prompt the model to consider task description in rate (TPR) and Excess Programs Generated (EPG). TPR reflects the
the docstring in addition to providing the details of the failing I/O percentage of fully passed test cases based on the exact match of
pair. Overall, instructions S0–S5 indicate the requirements to be program output and test output. The TPR metric is used for the
met, but do not describe the current program’s behavior. final evaluation of generated programs and does not reflect partial
The second set of instructions use the LLM for text comple- passing of the I/O test as opposed to score in the RANK block.
tion 𝑝 text (input). The instructions are designed so that the LLM is DEBUG and EXECUTE blocks generate a number of programs
prompted to complete the sentence that should describe an error. that are replaced or repaired during the search for solution program.
In addition to validation I/O pairs, the following notation is used: The number of programs generated before the first occurrence of
{Op } denotes the program candidate output for input {Ival }, {task} is the program that passes all validation test cases is referred to as EPG.
a placeholder for a problem description in English. Note that we EPG is indicative of the computational cost of solving a problem
do not include the incorrect output 𝑂 𝑝 of a generated candidate 6 https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-
program in debug instructions S0-S5, because it is recommended with-openai-api
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.
25 Python C++
22 PushGP 100 100
19 19 Python
16 17 16 24
15 14 C++ 21
13
frequency
13 14
10 10 10
7 8 7
5
4 3
1 2 2
1 10 100 ∞ 1 1 1 1 1 1 1 1 1
1 1
tree arity 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
EPG EPG
Figure 5: Number of solved PSB2 problems depending on the
tree arity in tree search for the fixed prompt type S0. (a) 0 ≤ EPG ≤ 10 with step 1.
Python C++
100 100
distributed in terms of LLM inferences and program compilations 47 47
frequency
and executions.
10 10
4 4 4
4.5 Implementation Details 2
3
1 1 1 1 1
We use GPT-3 models pre-trained on code7
and text8
as LLMs in 1 1
0
100
200
300
400
500
600
700
800
900
1000
0
100
200
300
400
500
600
700
800
900
1000
our framework. Specifically, we use Codex-edit (code-davinci-edit-
001) as the LLM for draft programs 𝑝 synth and LLM for debugging EPG range EPG range
𝑝 debug and GPT-3 (text-davinci-003) for bug summarization via (b) 0 ≤ EPG ≤ 1000 with step 100.
text completion with 𝑝 text . We ensure that the program candidates
generated from the same parent program are different from each Figure 6: Distribution of the number of generated programs
other by changing the temperature parameter of Codex-edit. during each problem-solving attempt in the experiments
We use 2000 I/O pairs from the test split of PSB2 to evaluate with different tree arities where a problem solution is found.
the candidate program that has passed all the validation test cases
during debugging. Due to repetitive calls to the EXECUTE block, finding a better fix is higher when more alternatives are generated
we have to resolve the speed of testing versus precision trade-off to update the draft program at 𝑁 > 1 compared to 𝑁 = 1. The
while choosing the number of validation test pairs. We resolve the search strategy with 𝑁 = 10 yields the best results: it performs
trade-off by fixing the validation set size at 100. on par with PushGP for C++ and outperforms the baseline during
In the experiments with tree arity values, we set the limit to Python program synthesis by +2 problems resulting in a total of 19
generate a maximum of 1000 program candidates during the search programs that pass all test cases. The results imply that generating
of the candidate that passes all validation tests. If we reach 1000 a moderate number of programs in parallel during the DEBUG step
candidates and none of them passes all validation tests, we report works better than the policies in which more updates are generated
the test pass rate for the last generated candidate. In the experiments for each program (100 or 1000) or only one program is updated
with prompts, we set the limit of maximum generated programs to 5, iteratively.
because we search for the prompt that yields the fastest solution to We present the analogy of the solution speed for all four arities
exclude long searches and comply with the request rate limits. and fixed default debug instruction in figure 6. In detail, we show
the distribution of EPG values in all experiments to explore how
5 RESULTS AND DISCUSSION many candidate updates are generated before the solution is found.
We zoom in to the cases with solutions found with up to the first
5.1 RQ1. Repair-replace Trade-off Exploration 10 program candidates in figure 6a and show the EPG distribution
We compare the number of solved problems in the experiments with the step of 100 candidates in figure 6b.
with tree arity of 1, 10, 100, and ∞ and fixed debug instruction S0 Out of 100 experiments for each language, in 21–24% of runs in
in Python and C++ in figure 5. The results of SEIDR are compared Python and C++, the draft program is already the solution (EPG=0).
to the baseline performance of PushGP on the PSB2 benchmark, For 19-32% of experiments, the solution is found after discarding
which solves 17 out of 25 problems. Note that experiments with 5 candidates. Around half of experiments do not generate more
𝑁 = 1 and 𝑁 = ∞ can be considered as ablation studies, where the than 100 programs. However, 5 problems are solved with more than
replace option and repair option is turned off, correspondingly. 500 generated programs in Python and 1 problem in C++ (with
The results highlight the benefit of compromise strategies with 𝑁 = 10). The results imply that the first steps in the update of the
tree arity of 10 and 100 over repair-only (𝑁 = 1) and replace-only draft program are crucial for solving the problem. The chances
(𝑁 = ∞) strategies. The repair-only scheme is outperformed by of solving the problem on the later stages of the search, such as
other strategies. We explain the poor performance of repair-only after 100 programs have been generated, are low. This confirms our
strategy by the fact that the search space is under-explored. Specif- initial assumption in Section 4.2 that 1000 programs are sufficient.
ically, replace scenario ensures the LLM for debugging represented
Answer to RQ1. SEIDR outperforms the PushGP baseline on
by Codex-edit in our experiments generates different updates of
PSB2 in Python and performs on par with it in C++ experiments
program candidates using variable temperature. The probability of
with tree arity of 10. Search strategies with tree arity larger than
7 https://platform.openai.com/docs/guides/code/editing-code
8 https://platform.openai.com/docs/models/gpt-3
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal
25
Python C++
to the input length limit of the underlying LLM 𝑝 debug , i.e., the
22
19
generated code is too long and does not fit as input to the LLM.
16 Some problems are solved with all prompts, while other problems
13 11 12 11 are solved with only a subset of prompts, solved partially, or not
98 109 910 108 910 10 910
10 8 87 8 8 solved at all. A number of problems are solved with all or the
7 6
majority of prompts in both languages, such as basement, fizz-buzz,
4
1
paired-digits, and twitter. Other problems pass all tests in only
S0 S1 S2 S3 S4 S5 M6 M7 M8 M9 M10 one of the languages, such as luhn, vector-distance, fuel-cost, or
debug prompt type substitution-cipher. Most of the solved problems are generated as
Figure 7: Number of solved PSB2 problems depending on the the first draft or within 1–2 debug steps. However, some problems
instruction choice for the fixed tree arity of 1. pass 90% of test cases at the fifth step, such as substitution-cipher
in Python with prompts S4 and M8 or shopping-list in C++ with
one benefit from the replace possibility of the SEIDR framework prompts S0, S1, S5 and M7. These runs are likely to be updated with
as a consequence of using variable temperature for Codex-edit. the fully correct programs in the following several steps, according
The repair component is also crucial for the framework because to the results in section 5.1, but the experiments are stopped for the
the replace-only search policy (with tree arity of ∞) performs fairness of inter-prompt comparison. Alternatively, conducting the
worse than the policies that alternate between replace and repair prompt engineering experiment with 1000 max programs would
during program update (with tree arity of 10 or 100). have shown what prompts are beneficial for solving the problems
in the long run and can be interesting for future work.
5.2 RQ2. Prompt Engineering The most interesting cases concern the problems that are solved
We report the number of solved problems for different static and only with LLM bug summaries or only with static prompts. For
GPT-assisted debug instructions in figure 7. Because debug instruc- example, the gcd problem is solved only with prompts M6–M10
tions are parts of prompts for LLMs and the program candidate in C++ and is not solved with either of S0–S5. A similar result
format does not change, we will use the term prompt during the is obtained for spin-words and coin-sums in C++. In Python, we
analysis of experiment results with different instructions. Overall, observe only the cases where solutions are obtained with static
the performance of the framework is robust to the debug prompt prompts and are not obtained with GPT-assisted prompts, e.g., for
choice, both with LLM-generated and static templates. The number find-pair, camel-case. In addition, several prompts work well from
of solved problems differs for Python and C++ in our experiments. both S and M categories as for gcd in Python.
For C++, all debug prompts except S2 result in the same or higher Answer to RQ2. Program synthesis in C++ with SEIDR
performance than the instruction S0 which is used in the repair- achieves better performance in the repair-only setting with both
replace trade-off experiments. The debug instruction S2 contains GPT-assisted prompts that summarize bugs in code and static
the verbs “yield” and “ensure” which are probably rarely used in templates which describe failing I/O cases. The best-performing
code documentation. The best debug instruction for C++ is the C++ instruction is obtained with GPT-3 for text completion that
LLM-assisted template M6 containing the word “obviously”, which contains the word “obviously”. Results differ for PSB2 solutions
should indicate the confidence of the author of bug summary whom in Python: the static prompt template S0 results in the best per-
GPT-3 should mimic during autocompletion. formance. Overall, SEIDR performance is stable with different
Python programs do not show the same effect during experi- debugging prompts submitted to Codex-edit.
ments with different prompts. The overall performance drops in
comparison with using the prompt S0. By limiting the total number
of generated programs from 1000 to 5 in the current set of experi- 5.3 Threats to Validity
ments, we lose 2 problem solutions in Python with S0. The prompt External threats to validity concern SEIDR performance on different
that results in the best performance in C++ for the EPG limit of 5 benchmarks and the use of other language models than the tested
corresponds to the worst performance in Python. This result can ones. Specifically, PSB2 contains competitive programming tasks
occur due to the small tree arity and low variability of debugging which require smaller functions to be generated than production-
updates of the initial draft. Another reason is that the GPT-3 sum- scale software. We plan to extend our experiments in future work
mary of bugs may not point to logical errors. The model for text to explore the generalizability of results to other benchmarks.
autocompletion frequently outputs bug summaries that mention Internal threats relate to the implementation. We use PSB2,
“the code is not accepting the input correctly.” Note that such bug which has corner case tests in the training set and test regular
summary appears in other debug prompts, too. cases in the test set. To ensure a fair comparison with other studies
To analyze the effect of using different prompts on a problem on PSB2, we evaluate and report results on the provided test set
level, we present a heatmap of EPG for all 25 problems in figure 8. of PSB2 which risks that the synthesized programs do not pass
We add the values of test pass rate in numbers or signs and show some of the training cases. Large models for code editing and text
EPG in color. Empty cells denote that the search halts due to OpenAI completion used in this study are nondeterministic, which impacts
exceptions, such as APIError.9 In addition, if the framework halts results. Due to prohibitive model inference costs, each experiment
before max programs attempts (light-blue cells with a “-”), it is due was only run once. However, our temperature sampling procedure
described in section 2.2 reduces this stochasticity significantly, es-
9 https://platform.openai.com/docs/guides/error-codes/python-library-error-types pecially for low-EPG results. As with other language models, Codex
GECCO ’23, July 15–19, 2023, Lisbon, Portugal Liventsev and Grishina, et al.
Figure 8: Number of excess programs generated (in color) and test pass rate (as numbers) depending on the type of debug
prompt. Higher EPG values are shown in darker shades than low EPG. We denote solved problems with “+” (test pass rate = 1),
unsolved problems with “-” (test pass rate = 0), and show the test pass rate for partially solved problems.
is a black-box model and may generate malicious code [54]. The PushGP, making it feasible in the areas with costly testing, such as
Codex model was pre-trained on an unbalanced dataset across pro- robotics. Investigation of the repair-replace trade-off shows that
gramming languages [9]. Thus, the results can be skewed towards SEIDR with tree arity of 10 outperforms both the replace-only strat-
high performance in popular programming languages. egy and the repair-only approach. Our prompt engineering study
shows that bug summaries generated with “confidence indicators”,
6 CONCLUSION such as “obviously”, improve the performance of SEIDR during C++
In this study, we propose the SEIDR framework to solve the chal- code synthesis. Overall, our framework shows low performance
lenge of fully autonomous programming. We augment the program variability with different prompts, which indicates its robustness.
synthesis procedure based on the large language models for code Future work: To study the generalizability of the SEIDR frame-
generation from templates and textual instructions with the repair work, we plan to expand the experiments to the competitive pro-
block. The repair block consists of the tree search across the pro- gramming dataset of AlphaCode [10] and QuixBugs [46], as well as
gram candidates generated by a large language model for code. The experimenting with ranking strategies, such as lexicase selection.
LLM used for code repair takes imperfect program candidates and
DATA AVAILABILITY
instructions for their improvement as prompts. The instructions are
obtained from both static templates with failing test case descrip- The code and results are made available via Zenodo.11 Note that
tions and templates with auto-generated bug summaries by a text OpenAI discontinued the Codex API on March 23, 2023, and sug-
completion language model. We explore 11 prompting strategies gests using the GPT-3.5-Turbo API instead.
and the repair-replace trade-off of updating the draft program.
Contributions: We test SEIDR with the Codex-edit as the model ACKNOWLEDGMENTS
for draft program synthesis and debugging in Python and C++ on The work presented in this paper was supported by the European
the PSB2 benchmark. In our experiments, SEIDR outperforms the Commission through Horizon 2020 grant 812882, and by the Re-
PushGP baseline and achieves the state-of-the-art result with 19 search Council of Norway through the secureIT project (#288787).
solved problems out of 25. It requires under 1000 program execu- The empirical evaluation made use of the Experimental Infrastruc-
tions to solve them, in stark contrast to billions10 of executions in ture for Exploration of Exascale Computing (eX3), supported by
the Research Council of Norway through grant #270053.
10 Aproblem is considered "solved" by PushGP if at least 1 of 100 runs, each with a
11 https://doi.org/10.5281/zenodo.7837282
limit of 60 million programs, was successful.
Fully Autonomous Programming with Large Language Models GECCO ’23, July 15–19, 2023, Lisbon, Portugal
[40] N. Jack and F. Van der Duyn Schouten. “Optimal Repair–Replace [48] D. Sobania, M. Briesch, C. Hanna, and J. Petke. An Analysis of the
Strategies for a Warranted Product.” In: International Journal of Pro- Automatic Bug Fixing Performance of ChatGPT. Jan. 2023. arXiv: 2301.
duction Economics 67.1 (Aug. 2000), pp. 95–100. doi: 10/cfzj7f. 08653.
[41] S. J. Russell. Artificial Intelligence a Modern Approach. Pearson Edu- [49] K. Kuznia, S. Mishra, M. Parmar, and C. Baral. Less Is More: Summary
cation, Inc., 2010. of Long Instructions Is Better for Program Synthesis. Oct. 2022. arXiv:
[42] H. Joshi, J. Cambronero, S. Gulwani, V. Le, I. Radicek, and G. Ver- 2203.08597.
bruggen. Repair Is Nearly Generation: Multilingual Program Repair [50] T. Helmuth and P. Kelly. “Applying Genetic Programming to PSB2:
with LLMs. Sept. 2022. arXiv: 2208.11640. The next Generation Program Synthesis Benchmark Suite.” In: Ge-
[43] D. Shrivastava, H. Larochelle, and D. Tarlow. Repository-Level Prompt netic Programming and Evolvable Machines 23.3 (Sept. 2022), pp. 375–
Generation for Large Language Models of Code. Oct. 2022. arXiv: 404. doi: 10/gq5gjq.
2206.12839. [51] T. Helmuth and L. Spector. “Problem-Solving Benefits of Down-
[44] Q. Huang, Z. Yuan, Z. Xing, X. Xu, L. Zhu, and Q. Lu. Prompt-Tuned Sampled Lexicase Selection.” In: Artificial Life 27.3–4 (Mar. 2022),
Code Language Model as a Neural Knowledge Base for Type Inference pp. 183–203. doi: 10/grrnj7.
in Statically-Typed Partial Code. Aug. 2022. arXiv: 2208.05361. [52] T. Helmuth and L. Spector. “General Program Synthesis Benchmark
[45] B. Ahmad, S. Thakur, B. Tan, R. Karri, and H. Pearce. Fixing Hardware Suite.” In: Proceedings of the 2015 Annual Conference on Genetic and
Security Bugs with Large Language Models. Feb. 2023. arXiv: 2302. Evolutionary Computation. Madrid Spain: ACM, July 2015, pp. 1039–
01215. 1046. doi: 10/ghsn5b.
[46] D. Lin, J. Koppel, A. Chen, and A. Solar-Lezama. “QuixBugs: A Multi- [53] D. Sobania, M. Briesch, and F. Rothlauf. “Choose Your Program-
Lingual Program Repair Benchmark Set Based on the Quixey Chal- ming Copilot: A Comparison of the Program Synthesis Performance
lenge.” In: Companion of the SIGPLAN International Conference on of Github Copilot and Genetic Programming.” In: Proceedings of
Systems, Programming, Languages, and Applications: Software for the Genetic and Evolutionary Computation Conference. Boston Mas-
Humanity. Vancouver BC Canada: ACM, Oct. 2017, pp. 55–56. doi: sachusetts: ACM, July 2022, pp. 1019–1027. doi: 10/gq5gjp.
10/gf8nmn. [54] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep
[47] J. A. Prenner, H. Babii, and R. Robbes. “Can OpenAI’s Codex Fix at the Keyboard? Assessing the Security of GitHub Copilot’s Code
Bugs?: An Evaluation on QuixBugs.” In: 2022 IEEE/ACM International Contributions. Dec. 2021. arXiv: 2108.09293.
Workshop on Automated Program Repair (APR). May 2022, pp. 69–75.
doi: 10/grpcnx.