Code Generation AceCoder_Preprint
Code Generation AceCoder_Preprint
Ge Li Zhi Jin
Peking University Peking University
Beijing, China Beijing, China
[email protected] [email protected]
the prompt, LLMs first output a preliminary and then generate code (a) Input requirement:
based on the preliminary. We illustrate the guided code generation # Write a python function to remove first and last
in Section 2 and describe the details in Section 3.3. occurrence of a given character from the string
(b) Unit tests:
Challenge 2: Code Implementation. After understanding the
(1) ("hello world", "x") -> "hello world"
requirement, how to implement the source code using a program- (2) ("hello world", "l") -> "helo word" S
ming language is challenging. It requires LLMs to master related
(c) Few-shot Prompting: #
grammar, algorithms, and libraries. Even for human developers, it def remove_Occ(s, ch): o
is difficult to write an exactly correct program from scratch. return s.replace(ch, '', 1).replace(ch, '', -1) d
Novelty 2: Example Retrieval. To solve the above challenge, Evaluation: (1) pass (2) fail
we propose example retrieval. It is inspired by the human devel- (d) AceCoder:
opers’ code reuse. In real-world scenarios, given a new requirement, # ("test", "x") -> "test"
developers often search for similar programs. They learn program- # ("test", "e") -> "tst"
# ("test", "t") -> "es" #
ming skills (e.g., APIs) or directly reuse relevant content from similar def remove_Occ(s, ch): c
programs [31]. first_idx, last_idx = -1, -1 d
for i in range(len(s)):
Specifically, we use a retriever to search for programs with similar if s[i] == ch: S
requirements (e.g., Top-20). Considering the maximum input length if first_idx == -1: #
of LLMs is limited (e.g., 1024 tokens), the number of examples in first_idx = i e
last_idx = I d
a prompt is also limited, such as three examples. Thus, we further # case 1: ch not in s
design a selector to select a set of programs from retrieved results as if first_idx == -1 and last_idx == -1:
return s
examples. The selector will filter out redundant programs and pick # case 2: ch appears once in s
informative examples. Then, examples are inserted into prompts elif first_idx == last_idx:
return s[:first_idx] + s[first_idx+1:]
and teach LLMs how to implement code. We illustrate the example
# case 3: ch appears multi-times in s
retrieval in Section 2 and describe the details in Section 3.2. else:
In conclusion, given a requirement, AceCoder generates a pro- return s[:first_idx] + \
s[first_idx+1:last_idx] + s[last_idx+1:]
gram in three steps: Evaluation: (1)(2) pass
• Example retrieval. It uses a retriever and a selector to find similar
programs as examples, i.e., <requirement, code> pairs. Figure 1: A motivating example of guided code generation.
• Prompt construction. It uses an analyzer to convert retrieved
examples into <requirement, preliminary, code> triples. Then, it
concatenates triple examples with the input requirement together
modules and discuss different designs for AceCoder. Results
to construct a prompt.
show that three modules are all necessary and our designs for three
• Code generation. It feeds the prompt into LLMs. By learning
modules are superior to multiple alternates.
from examples, LLMs first output an intermediate preliminary
We summarize our contributions in this paper as follows.
and then generate code for the input requirement.
• We propose a novel prompting technique named AceCoder, for
We apply AceCoder to three representative LLMs, i.e., CodeGeeX
improving the performance of LLMs in code generation.
[1], CodeGen [32], and InCoder [15]. We conduct extensive experi-
• AceCoder contains two novel techniques (i.e., guided code gen-
ments on three popular code generation benchmarks, i.e., MBPP
eration and example retrieval) to alleviate two challenges (i.e.,
(Python) [9], MBJP (Java) [8], and MBJSP (JavaScript) [8]. We em-
requirement understanding and code implementation) in code
ploy Pass@𝑘 (𝑘 = 1, 3, 5) to measure the performance of different
generation, respectively.
approaches. We obtain some findings from experimental results.
• We apply AceCoder in three LLMs and conduct extensive ex-
(1) AceCoder significantly outperforms existing prompting
periments on three public benchmarks. Qualitative and quantita-
techniques. In terms of Pass@1, AceCoder outperforms the SOTA
tive experiments show that AceCoder significantly outperforms
baseline - few-shot prompting by up to 56.4% in MBPP, 70.7% in
the SOTA baselines (e.g., chain-of-thought prompting, few-shot
MBJP, and 88.4% in MBJSP. The improvements prove the superior-
prompting).
ity of AceCoder in code generation. (2) AceCoder substantially
outperforms retrieval-based models. In terms of Pass@1, Ace-
Coder outperforms the SOTA retrieval-based baseline by up to 2 MOTIVATING EXAMPLES
13.1% in MBPP, 23.44% in MBJP, and 15.8% in MBJSP. (3) AceCoder In this section, we explain our motivations by some real cases.
is effective in LLMs of different sizes. We apply AceCoder to Requirement Understanding → Guided Code Generation.
three LLMs, which scale from 6B to 13B. In terms of Pass@1, Ace- Figure 1 (a) and (b) show a requirement from a real-world bench-
Coder improves CodeGeeX-13B by up to 88.4%, CodeGen-6B by up mark [9] and its unit test for evaluation, respectively. We select
to 65.5%, and InCoder-6B by up to 57.5%. (4) Human evaluation Codex as the base model. Figure 1 (c) shows a program generated by
shows that human developers prefer programs generated by few-shot prompting. The program fails, as it ignores some essential
AceCoder. Results show that AceCoder outperforms the SOTA scenarios in the requirement, such as ch appearing multiple times
baseline in multiple aspects, including correctness, code smell, and in s. It shows that comprehensively understanding the requirement
maintainability. (5) We explore the contributions of different is crucial to write correct programs.
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA
(a) Input requirement: for similar programs as examples in prompts. We expect LLMs can
# Write a function to find sequences of lowercase learn from similar programs how to implement new programs.
letters joined with an underscore.
Since the maximum input length of LLMs is usually limited (e.g.,
(b) Unit tests:
(1) ('a_b_c') -> True 1024 tokens), the number of examples in a prompt is limited. Thus,
(2) ('a_ c_') -> False we need to further select a set of programs from retrieved results as
(c) Few-shot Prompting: examples. A straightforward idea is to pick top similar programs as
def text_lowercase_underscore(text): examples. However, as the programs are retrieved independently,
words = text.split()
for word in words: we find that retrieved results may contain redundant programs. In
if word.islower() and '_' in word: Figure 2 (d), Program-1, Program-2, and Program-3 are redundant,
return True
return False as all of them provide an API re.search, which teaches how to
Evaluation: (1) pass (2) fail search a pattern in the text. Program-4 contains a relevant regular
(d) Retrieved Programs: expression, which tells how to design a pattern. Suppose the number
Program-1: find sequences of literals in a string. of examples is 2. The examples will contain redundant programs
def find_literals(text, pattern):
(i.e., Program-1&2) and miss more informative Program-4.
match = re.search(pattern, text)
(more lines...)
Thus, we design a selector for selecting examples, which can filter
Programs-2&3: re.search(…) out redundant programs in retrieved results. Suppose the number of
Program-4: split a string at lowercase letters. examples is 2. In Figure 2 (d), our selector will select Program-1 and
def split_upperstring(text):
Program-4 as examples. Figure 2 (e) shows a program generated by
return re.findall("[a-z][^a-z]*", text)
AceCoder. It successfully learns how to write regular expressions
(e) AceCoder: from Program-4 and learns how to use re.search to find patterns
def text_lowercase_underscore(text):
import re
from Program-1.
patterns = '^[a-z]+_[a-z]+$‘
if re.search(patterns, text):
return True 3 METHODOLOGY
else:
return False In this section, we propose a novel prompting technique for code
generation, named AceCoder. In the subsections, we first present
Figure 2: A motivating example of example retrieval. an overview of AceCoder and then describe its details.
Thus, we propose guided code generation, which asks LLMs to 3.1 An Overview
first analyze the requirement and then generate code. Figure 1 (d) Code generation aims to generate the source code 𝑦 based on a
shows a program generated by AceCoder. We consider test cases natural language requirement 𝑥. AceCoder leverages large lan-
as the intermediate preliminary. We can see that the generated guage models (LLMs) to generate programs via prompting. Figure 3
test cases cover multiple scenarios, e.g., boundary inputs ("test", shows an overview of AceCoder during inference. Given an input
"e"). They further clarify the requirement and benefit the following requirement 𝑥𝑡𝑒𝑠𝑡 , AceCoder generates code in three steps.
code implementation. Based on the test cases, AceCoder gener- • Example Retrieval. It uses a retriever and a selector to select 𝑘
ates a correct program, which considers three scenarios and gives similar <requirement, code> pairs ({𝑥𝑖 , 𝑦𝑖 }𝑘𝑖=1 ) from a retrieval
solutions respectively. The example shows that our guided code corpus as examples.
generation can help LLMs to analyze requirements and improve • Prompt Construction. It employs an analyzer to convert exam-
the logical correctness of code. ples into <requirement, preliminary, code> triples ({𝑥𝑖 , 𝑎𝑖 , 𝑦𝑖 }𝑘𝑖=1 ).
Code Implementation → Example Retrieval. After under- A preliminary is a software artifact for clarifying the requirement,
standing the input requirement, how to implement the code is such as test cases. The examples are concatenated with the input
challenging. It requires LLMs to use various algorithms or libraries. requirement to construct a prompt.
Figure 2 (a) and (b) show a requirement from a real-world bench- • Code Generation. The prompt is fed into LLMs. By learning
mark [9] and its unit test for evaluation, respectively. We select from examples, LLMs first output an intermediate preliminary
Codex as the base model. Figure 2 (c) shows a program generated and then generate the code.
by few-shot prompting. The program contains a wrong condition
statement highlighted in yellow. This is because the model does where 𝑥𝑖 , 𝑦𝑖 , 𝑎𝑖 denote the requirement, the code, and the prelimi-
not know how to judge a string containing lowercase letters joined nary in 𝑖-th example, respectively.
with an underscore.
To alleviate the above problem, we propose example retrieval. 3.2 Example Retrieval
Our motivation is that human developers often search for similar As shown in Figure 3, the first step has two goals: (i) retrieve similar
programs and learn programming skills from them. Figure 2 (d) programs and (ii) select a few examples from retrieved programs.
shows some retrieved programs based on the similarity of require- We design a retriever and a selector to achieve these goals, respec-
ments. The retrieval metric is the BM25 score. We sort the results tively. The details of the two modules are shown as follows.
in descending order of BM25 score. We can see that the retrieved
programs contain lots of relevant content (e.g., re.search), which 3.2.1 Retriever. Similar programs often have similar natural lan-
benefits code implementation. Thus, we design a retriever to search guage requirements [17, 25]. There, we take the input requirement
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin
Figure 3: An overview of AceCoder. Given a requirement, it selects examples from similar programs and constructs a prompt.
LLMs first output an intermediate preliminary and then generate the source code. 𝑥, 𝑦, and 𝑡 denote requirements, programs,
and intermediate preliminaries, respectively.
Table 2: The results of AceCoder and prompting baselines on three datasets. The values in parentheses are the relative
improvements compared to the SOTA baseline - few-shot prompting.
MBPP MBJP MBJSP
Base model Prompting Technique
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
Zero-shot prompting 5.20 13.80 19.40 4.46 11.97 18.26 0.20 0.20 0.41
CoT prompting 12.60 23.40 30.20 14.40 28.19 33.67 11.35 21.10 25.96
CodeGeeX-13B
Few-shot prompting 20.40 30.60 36.00 16.63 26.17 34.48 11.16 19.88 25.56
AceCoder 26.74 (↑ 31.1%) 36.43 (↑ 19%) 41.13 (↑ 14.2%) 28.38 (↑ 70.7%) 36.79 (↑ 40.6%) 41.54 (↑ 20.5%) 21.03 (↑ 88.4%) 31.44 (↑ 58.2%) 36.04 (↑ 41%)
Zero-shot prompting 10.40 19.40 24.40 14.81 25.76 31.44 8.72 19.67 22.92
CoT prompting 13.00 21.00 26.00 13.59 25.35 31.24 11.56 20.08 24.54
CodeGen-6B
Few-shot prompting 14.60 24.00 30.20 18.25 30.02 34.68 9.94 19.88 23.12
AceCoder 22.83 (↑ 56.4%) 34.58 (↑ 44.1%) 40.16 (↑ 33%) 22.45 (↑ 23%) 34.27 (↑ 14.2%) 40.96 (↑ 18.1%) 16.45 (↑ 65.5%) 27.31 (↑ 37.4%) 32.16 (↑ 39.1%)
Zero-shot prompting 4.20 11.40 16.20 2.23 5.88 9.13 3.65 5.88 8.11
CoT prompting 3.99 10.65 15.31 1.83 4.46 7.10 1.22 2.03 4.67
InCoder-6B
Few-shot prompting 12.80 22.80 28.20 10.95 23.53 26.17 12.78 22.52 27.79
AceCoder 20.16 (↑ 57.5%) 31.44 (↑ 37.9%) 34.10 (↑ 20.92%) 16.37 (↑ 49.5%) 29.89 (↑ 27%) 34.74 (↑ 32.7%) 15.97 (↑ 25%) 27.13 (↑ 20.5%) 30.65 (↑ 10.3%)
Code Generation. Following previous studies [14, 15, 32], we Answer to RQ1: AceCoder outperforms existing prompt-
use nucleus sampling [19] to decode programs from LLMs. The ing techniques on three benchmarks. In terms of Pass@1,
temperature is 0.8 and the top-𝑝 is 0.95. The maximum generated AceCoder outperforms the SOTA baseline by up to 56.4%
lengths are 400, 500, and 500, respectively. The sampling settings in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. Besides, Ace-
of baselines are the same as the ones of AceCoder. Coder is effective in LLMs with different sizes. It improves
CodeGeeX-13B by up to 88.4%, CodeGen-6B by up to 65.5%,
and InCoder-6B by up to 57.5%. The significant improvements
prove the effectiveness of AceCoder in code generation.
Table 3: The comparison of retrieval-based baselines and AceCoder. The values in parentheses are relative improvements
compared to the SOTA baseline - Jigsaw.
MBPP MBJP MBJSP
Approach
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
REDCODER 3.37 6.21 9.74 4.46 7.51 9.94 4.87 10.34 12.78
Jigsaw 23.65 33.97 37.78 22.99 33.26 36.95 18.16 28.79 34.08
AceCoder 26.74 (↑ 13.1%) 36.43 (↑ 7.2%) 41.13 (↑ 8.9%) 28.38 (↑ 23.44%) 36.79 (↑ 10.61%) 41.54 (↑ 12.42%) 21.03 (↑ 15.8%) 31.44 (↑ 9.2%) 36.04 (↑ 5.8%)
Table 4: The results of ablation study. The values in parentheses are relative improvements compared to few-shot promopting.
MBPP MBJP MBJSP
Retriever Selector Analyzer
Pass@1 (%) Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
20.40 30.60 36.00 16.63 26.17 34.48 11.16 19.88 25.56
24.00 (↑ 17.6%) 34.60 (↑ 13.1%) 38.20 (↑ 6.1%) 23.35 (↑ 40.4%) 33.67 (↑ 28.7%) 37.22 (↑ 7.9%) 18.66 (↑ 67.2%) 29.18 (↑ 46.8%) 34.89 (↑ 36.5%)
24.89 (↑ 22%) 35.02 (↑ 14.4%) 39.14 (↑ 8.7%) 25.03 (↑ 50.5%) 34.47 (↑ 31.7%) 39.24 (↑ 13.8%) 19.73 (↑ 76.8%) 30.16 (↑ 51.7%) 35.34 (↑ 38.3%)
26.74 (↑ 31.1%) 36.43 (↑ 19%) 41.13 (↑ 14.2%) 28.38 (↑ 70.7%) 36.79 (↑ 40.6%) 41.54 (↑ 20.5%) 21.03 (↑ 88.4%) 31.44 (↑ 58.2%) 36.04 (↑ 41%)
Table 5: The results of human evaluation. The values in paren- final score is the average of two evaluators’ scores. Evaluators are
theses are the relative improvements compared to the SOTA allowed to search the Internet for unfamiliar concepts.
baseline - few-shot prompting. Results. The results of the human evaluation are shown in
Approach Correctness Code smell Maintainability Table 5. The values in parentheses are the relative improvements
Zero-shot prompting 0.3167 1.1033 1.2749 compared to the SOTA baseline - few-shot prompting.
CoT prompting 0.6671 1.1405 1.4479 Analyses. AceCoder is better than all baselines in three aspects.
Few-shot prompting 0.9769 1.2148 1.5420 Specifically, our AceCoder outperforms the SOTA baseline - few-
AceCoder 1.5802 (↑ 61.8%) 1.6241 (↑ 33.7%) 1.7544 (↑ 13.8%) shot prompting by 61.8% in correctness, 33.7% in code smell, and
13.8% in maintainability. The improvements show that AceCoder
has better usability and is promising in practical applications. Be-
settings are reliable. We manually evaluate programs in three as-
sides, all the p-values are substantially smaller than 0.05, which
pects:
shows the improvements are statistically significant.
• Correctness (whether the program satisfies the given re-
quirement). 0 point: the program is totally inconsistent with the Answer to RQ3: Human evaluation shows that human devel-
requirement. 1 point: the program is implemented, but misses opers prefer programs generated by AceCoder. It outperforms
some details. 2 points: the program is correctly implemented. the SOTA baseline by 61.8% in correctness, 33.7% in code smell,
• Code Smell (whether the program contains bad code smell). and 13.8% in maintainability.
0 point: There are better solutions in terms of performance. Or RQ4: What are the contributions of different modules in
there is serious code smell. 1 point: Some details are not in place. AceCoder?
There is code smell of low severity. 2 points: No obviously better Setup. AceCoder contains three modules, i.e., a retriever, a se-
code in terms of performance exists. If possible, resources are lector, and an analyzer. This RQ is designed to analyze the contribu-
released accordingly. No obvious code smell. tions of three modules to the performance. We select CodeGeeX as
• Maintainability (whether the implementation is standard- the base model and conduct an ablation study by gradually adding
ized and has good readability). 0 point: The program does not three modules.
follow a consistent specification, or there are many meaningless Results. The results are shown in Table 5. and represent
names in variable naming, or there are certain repetitions and adding and removing corresponding modules, respectively. Without
redundant codes. 1 point: The program implementation meets three modules, the base model uses few-shot prompting to generate
certain specifications. But some variable names can be further code. After adding a retriever, the base model selects top-𝑘 similar
refined. 2 points: The program implementation is relatively stan- programs as examples and directly generates code. After adding a
dardized, the variable naming is basically semantically straight- selector, the base model selects 𝑘 examples from similar programs
forward, and the readability is better. and then generates code. After further introducing an analyzer, the
We explain the above aspects to evaluators through some ex- base model uses AceCoder to generate code.
amples. After discussing with evaluators, we set the score of each Analyses. All modules are necessary for AceCoder to perform
aspect to an integer, ranging from 0 to 2 (from bad to good). For Ace- the best. After adding a retriever, the performance of the base mod-
Coder and baselines, we select a fixed base model (i.e., CodeGen-2B) els is improved. In terms of Pass@1, the retriever brings a 17.6%
and collect 200 generated programs per approach. Finally, we obtain improvement in MBPP, a 40.4% improvement in MBJP, and a 67.2%
1,000 programs for evaluation. We invite 10 developers with 3-5 improvement in MBJSP. It validates our motivation that retrieved
years of development experience to evaluate the generated pro- programs contain lots of useful information that benefits code gen-
grams in the form of a questionnaire. The 1,000 code snippets are eration. After adding a selector, the performance of the base model
divided into 5 groups, with each questionnaire containing one group. is further improved. It shows that our selector can effectively fil-
The programs are randomly shuffled and anonymously reviewed ter out redundant programs in retrieved results and improve the
by evaluators. Each group is evaluated by two evaluators, and the quality of examples. After further introducing an analyzer, the base
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA
model achieves better results. In terms of Pass@1, the base model AceCoder, AceCoder with dense retriever has a slight drop in
is improved by 31.1% in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. performances. It indicates that code generation prefers lexically
It proves the effectiveness of guided code generation in analyzing similar programs, which contain lots of reusable content. Similar
requirements. findings can be found in code completion work [29]. Besides, the
dense retriever has a higher complexity and is hard to be applied to
Answer to RQ4: Three modules are essential for the per- a large-scale retrieval corpus. (2) The BLEU selector prefers shorter
formance of AceCoder. The performance of CodeGeeX on examples and is suboptimal. Compared to AceCoder, AceCoder
three benchmarks is substantially improvement by gradually with BLEU selector has an obvious decrease in accuracy. We inspect
adding three modules. some failed samples and find that the BLEU selector prefers shorter
RQ5: What are the better designs for three modules in Ace- examples. This is because BLEU is the precision of 𝑛-gram in similar
Coder? requirements. The shorter the similar requirement, the higher the
Setup. As stated in Section 3.1, AceCoder contains three mod- BLEU. It leads that the selector tends to select short programs as
ules, i.e., a retriever, a selector, and an analyzer. In this RQ, we examples and ignores some informative but long examples. (3) Test
explore different designs for three modules and validate the superi- cases are more suitable to the preliminary than APIs and method
ority of our designs. We select CodeGeeX as the base model. The signatures. We carefully inspect some cases. First, many require-
evaluation settings are shown as follows. ments in benchmarks do not require APIs or only involve a few
(1) A retriever takes the input requirement as a query and searches trivial APIs (e.g., range, split, and len). It causes that generated APIs
for similar programs from a retrieval corpus. We design two choices bring limited benefits to code generation. Second, by generating
for the retriever: method signatures, LLMs are asked to think about the input-output
format, which benefits code generation. But method signatures miss
• Dense retriever. It uses a neural encoder to convert the require- other necessary details, such as edge cases. AceCoder considers
ments into vector representations. Then, it retrieves similar pro- test cases as the preliminary. Test cases are common in code files.
grams based on the similarity of vector representations. In exper- Thus, it is feasible for LLMs trained with extensive code data to
iments, we use an off-the-shelf natural language representation generate plausible test cases. With the guidance of test cases, LLMs
model [39] as the encoder. can comprehensively understand requirements and determine re-
• Sparse retriever (AceCoder). As stated in Section 3.2, it uses the lated details (e.g., input-output formats, boundary inputs, outliers),
BM25 score as the retrieval metric. BM25 score can measure the thus generating more correct programs.
lexical-level similarity of two requirements.
(2) A selector aims to score similar programs and filter redundant Answer to RQ5: We explore the other four designs for Ace-
programs. For the score function in the selector (line 8 of Algorithm Coder and compare them to our designs. Results on three
1), we design two choices: benchmarks show the superiority of our design.
Table 6: The performance of AceCoder with different designs. “w/” is the abbreviation of with.
MBPP MBJP MBJSP
Approach
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
AceCoder 26.74 36.43 41.13 28.38 36.79 41.54 21.03 31.44 36.04
w/ Dense retriever 26.63 36.42 41.10 28.16 36.55 41.32 20.88 31.27 35.94
w/ BLEU selector 25.61 35.71 40.74 27.86 35.91 40.77 20.15 30.42 35.47
w/ API analyzer 25.10 35.24 40.38 26.44 35.16 40.12 19.86 30.23 35.41
w/ signature analyzer 26.14 35.96 40.89 27.35 36.11 40.98 20.58 30.89 35.86
different from CoT prompting and is more promising than CoT 7 RELATED WORK
prompting in code generation. Large Language Models (LLMs) for Code Generation are large-
scale neural networks pre-trained on a large corpus of natural lan-
guage and programming language. With the development of LLM
6.2 AceCoder vs. Rank Techniques research, current Code LLMs can be divided into two categories:
Some recent studies [13, 20] propose rank techniques to improve standard language models and instruction-tuned models.
the performance of LLMs on code generation. Given a requirement, Standard Language models are pre-trained on the raw corpus
they first sample many programs from LLMs and then use test cases with the next-token prediction. They can continually complete
or neural networks to rerank sampled programs. the given context, which makes them useful in tasks like code
In this paper, we do not directly compare our approach to rank completion and code generation. With the success of GPT series
techniques. The reason is that AceCoder and rank techniques [11, 37, 38] in NLP, OpenAI adapts similar idea into the domain of
have different focuses and they are complementary. Our work is source code, and fine-tunes GPT models on code to produce closed-
a new prompting technique that improves the accuracy of LLMs source Codex [14]. There are multiple open-source attempts to
in code generation. Rank techniques do not care about LLMs and replicate its success, e.g., CodeParrot [3], CodeGen [32], CodeGeeX
aim to select the best one from LLMs’ multiple outputs. In practice, [1], InCoder [15], StarCoder [26] and CodeT5+ [45].
users can use AceCoder to generate many programs and then Instruction-tuned models are models fine-tuned using instruc-
use rank techniques to pick a final output. Thus, we omit them in tion tuning [48]. Instruction tuning helps models to follow users’
experiments. instructions. OpenAI’s ChatGPT [33] is trained by Reinforcement
Learning with Human Feedback (RLHF) [34], making it capable
of both natural language tasks and programming tasks. Due to its
6.3 Threats to Validity enormous influence and closed-sourceness, many researchers try to
create open-source ChatGPT alternatives using instruction tuning
There are two main threats to the validity of our work.
and its variants. Alpaca [41] is LLaMA [42] fine-tuned using self-
The generalizability of experimental results. To mitigate
instruct [44] and ChatGPT feedback. Code Alpaca [12] is LLaMA
this threat, we carefully select the experimental datasets, metrics,
fine-tuned using self-instruct and ChatGPT feedback with more
and baselines. Following previous studies [8, 13], we pick three rep-
programming-focused instructions. WizardCoder [30] is StarCoder
resentative code generation benchmarks. They are collected from
[26] fine-tuned using Evol-Instruct [50] and ChatGPT feedback
real-world software projects and cover three popular programming
with Code Alpaca’s dataset as seed dataset. InstructCodeT5+ [45]
languages (i.e., Python, Java, and JavaScript). For evaluation met-
is CodeT5+ [45] fine-tuned on Code Alpaca’s dataset.
rics, we select a widely used metric - Pass@𝑘 (𝑘 = 1, 3, 5). Pass@𝑘
Prompting Techniques. LLMs are too large to fine-tune, so
is an execution-based metric that utilizes test cases to check the
researchers need to find a new way to adapt the LLMs on the
correctness of programs. We select existing prompting techniques
downstream tasks. Prompting techniques are a popular approach to
and retrieval-based models as comparison baselines. We pick three
leverage LLMs to generate code by inputting a special prompt.
representative LLMs as base models [1, 14, 15, 32], which scale from
Early, researchers proposed zero-shot prompting and few-shot
6B to 13B. We apply AceCoder and baselines to base models and
prompting. Zero-shot prompting concatenates a task instruction
evaluate their performance on three datasets using Pass@k. To
(e.g., please generate a program based on the requirement)
ensure fairness, we run each approach three times and report the
and a requirement together to make the prompt. Based on the
average results.
zero-shot prompting, few-shot prompting further adds several
The impact of retrieved programs. The retrieved programs
⟨ requirement, code ⟩ pairs to the prompts, so that LLMs can learn
are important elements in AceCoder. Intuitively, when retrieved
code generation from given examples. Chain-of-Thought (CoT)
programs are less relevant to input requirements, the performance
prompting [49] is a recently proposed prompting technique. CoT
of our approach may suffer. To address this threat, we have two
asks LLMs first to generate CoTs (i.e., intermediate natural lan-
thoughts. (1) A large-scale study on 13.2 million real code files found
guage reasoning steps) and then output the final code. It allows
the proportion of reused code is up to 80% [31]. Thus, we believe that
LLMs to first design a solving process that leads to the code. CoT
it is quite possible to retrieve similar programs in real development
has achieved the SOTA results in natural language generation and
scenarios. (2) Even if retrieved programs are less relevant to input
requirements, AceCoder degrades to few-shot prompting at worst.
In most cases, AceCoder is superior to few-shot prompting.
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA
sparked lots of follow-up research, such as self-consistency prompt- Thirty-fifth Conference on Neural Information Processing Systems Datasets and
ing [43], least-to-most prompting [52]. But these prompting tech- Benchmarks Track (Round 2).
[19] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The
niques are designed for natural language generation and bring Curious Case of Neural Text Degeneration. In 8th International Conference on
slight improvements in code generation. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH
8 CONCLUSION AND FUTURE WORK [20] Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andrés Codas, Mark Encar-
nación, Shuvendu K. Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault-
We propose a new prompting technique named AceCoder to im- Aware Neural Code Rankers. In NeurIPS. http://papers.nips.cc/paper_files/paper/
2022/hash/5762c579d09811b7639be2389b3d07be-Abstract-Conference.html
prove the performance of LLMs on code generation. AceCoder [21] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh
designs two novel techniques (i.e., guided code generation and Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: Large lan-
example retrieval) to help LLMs understand requirements and im- guage models meet program synthesis. In Proceedings of the 44th International
Conference on Software Engineering. 1219–1231.
plement programs. Guided code generation asks LLMs to output [22] Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2023. Enabling Programming Thinking in
an intermediate preliminary (e.g., test cases) before generating pro- Large Language Models Toward Code Generation. CoRR abs/2305.06599 (2023).
grams. The preliminary helps LLMs understand requirements and https://doi.org/10.48550/arXiv.2305.06599 arXiv:2305.06599
[23] Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. CodeEd-
guides the next code generation. Example retrieval selects simi- itor: Learning to Edit Source Code with Pre-Trained Models. ACM Trans. Softw.
lar programs as examples, which provide many reusable elements Eng. Methodol. (may 2023). https://doi.org/10.1145/3597207 Just Accepted.
[24] Jia Li, Yongmin Li, Ge Li, Xing Hu, Xin Xia, and Zhi Jin. 2021. Editsum: A retrieve-
for program implementation. We apply AceCoder to three LLMs and-edit framework for source code summarization. In 2021 36th IEEE/ACM
and conduct experiments on three benchmarks. Results show that International Conference on Automated Software Engineering (ASE). IEEE, 155–
AceCoder significantly outperforms the SOTA baselines. 166.
[25] Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. SkCoder:
In the future, we will explore how to improve the usability of A Sketch-based Approach for Automatic Code Generation. In 45th IEEE/ACM
LLMs in code generation. For example, how to teach LLMs to use International Conference on Software Engineering, ICSE 2023, Melbourne, Australia,
unseen frameworks without re-training. May 14-20, 2023. IEEE, 2124–2135. https://doi.org/10.1109/ICSE48619.2023.00179
[26] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
REFERENCES [27] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
[1] 2022. CodeGeeX. https://models.aminer.cn/codegeex/zh-CN. Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022.
[2] 2022. CodeGeeX. https://models.aminer.cn/codegeex/blog/index.html. Competition-level code generation with alphacode. Science 378, 6624 (2022),
[3] 2022. CodeParrot. https://huggingface.co/codeparrot/codeparrot. 1092–1097.
[4] 2022. GitHub. https://github.com/. [28] CY LIN. 2004. Rouge: A package for automatic evaluation of summaries. In Text
[5] 2022. Lucene. https://lucene.apache.org/. Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona,
[6] 2022. tree-sitter. https://tree-sitter.github.io/tree-sitter/. Spain. 74–81.
[7] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Uni- [29] Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svy-
fied Pre-training for Program Understanding and Generation. In Proceedings of atkovskiy. 2022. ReACC: A Retrieval-Augmented Code Completion Framework.
the 2021 Conference of the North American Chapter of the Association for Computa- In Proceedings of the 60th Annual Meeting of the Association for Computational
tional Linguistics: Human Language Technologies. Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,
Linguistics, Online, 2655–2668. https://doi.org/10.18653/v1/2021.naacl-main.211 Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association
[8] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen for Computational Linguistics, 6227–6240. https://doi.org/10.18653/v1/2022.acl-
Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, long.431
et al. 2022. Multi-lingual Evaluation of Code Generation Models. arXiv preprint [30] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu,
arXiv:2210.14868 (2022). Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder:
[9] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Empowering Code Large Language Models with Evol-Instruct. arXiv preprint
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, arXiv:2306.08568 (2023).
et al. 2021. Program synthesis with large language models. arXiv preprint [31] Audris Mockus. 2007. Large-scale code reuse in open source software. In First
arXiv:2108.07732 (2021). International Workshop on Emerging Trends in FLOSS Research and Development
[10] Kent Beck. 2003. Test-driven development: by example. Addison-Wesley Profes- (FLOSS’07: ICSE Workshops 2007). IEEE, 7–7.
sional. [32] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Silvio Savarese, and Caiming Xiong. 2022. A conversational paradigm for program
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda synthesis. arXiv preprint arXiv:2203.13474 (2022).
Askell, et al. 2020. Language models are few-shot learners. Advances in neural [33] OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt.
information processing systems 33 (2020), 1877–1901. [34] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[12] Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
for code generation. https://github.com/sahil280114/codealpaca. Training language models to follow instructions with human feedback. Advances
[13] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, in Neural Information Processing Systems 35 (2022), 27730–27744.
and Weizhu Chen. 2022. Codet: Code generation with generated tests. arXiv [35] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
preprint arXiv:2207.10397 (2022). method for automatic evaluation of machine translation. In Proceedings of the
[14] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira 40th annual meeting of the Association for Computational Linguistics. 311–318.
Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, [36] Md. Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and
et al. 2021. Evaluating large language models trained on code. arXiv preprint Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Summarization.
arXiv:2107.03374 (2021). In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual
[15] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine
Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association
generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 for Computational Linguistics, 2719–2734. https://doi.org/10.18653/v1/2021.
(2022). findings-emnlp.232
[16] Yiyang Hao, Ge Li, Yongqiang Liu, Xiaowei Miao, He Zong, Siyuan Jiang, Yang [37] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
Liu, and He Wei. 2022. AixBench: A Code Generation Benchmark Dataset. arXiv Improving language understanding by generative pre-training. (2018).
preprint arXiv:2206.13179 (2022). [38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
[17] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. 2018. et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
A retrieve-and-edit framework for predicting structured outputs. Advances in 1, 8 (2019), 9.
Neural Information Processing Systems 31 (2018). [39] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-
[18] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference
Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. In on Empirical Methods in Natural Language Processing and the 9th International
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin
Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong [46] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5:
Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Un-
Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. derstanding and Generation. In Proceedings of the 2021 Conference on Empir-
https://doi.org/10.18653/v1/D19-1410 ical Methods in Natural Language Processing. Association for Computational
[40] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https:
framework: BM25 and beyond. Foundations and Trends® in Information Retrieval //doi.org/10.18653/v1/2021.emnlp-main.685
3, 4 (2009), 333–389. [47] Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. 2020. Retrieve and refine:
[41] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos exemplar-based neural comment generation. In 2020 35th IEEE/ACM International
Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Conference on Automated Software Engineering (ASE). IEEE, 349–360.
Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ [48] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian
alpaca. Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models
[42] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne are Zero-Shot Learners. In International Conference on Learning Representations.
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal https://openreview.net/forum?id=gEZrGCozdqR
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- [49] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi,
ple. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Rea-
preprint arXiv:2302.13971 (2023). soning in Large Language Models. In Advances in Neural Information Processing
[43] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Systems.
Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves [50] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng,
Chain of Thought Reasoning in Language Models. In The Eleventh International Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language
Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
OpenReview.net. https://openreview.net/pdf?id=1PL1NIMMrw [51] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate
[44] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, before use: Improving few-shot performance of language models. In International
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Lan- Conference on Machine Learning. PMLR, 12697–12706.
guage Models with Self-Generated Instructions. In Proceedings of the 61st Annual [52] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang,
Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023.
pers). Association for Computational Linguistics, Toronto, Canada, 13484–13508. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.
https://aclanthology.org/2023.acl-long.754 In The Eleventh International Conference on Learning Representations, ICLR 2023,
[45] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=
Steven CH Hoi. 2023. Codet5+: Open code large language models for code WZH7099tgfM
understanding and generation. arXiv preprint arXiv:2305.07922 (2023).