0% found this document useful (0 votes)
9 views

Code Generation AceCoder_Preprint

Uploaded by

Salisu Borodo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Code Generation AceCoder_Preprint

Uploaded by

Salisu Borodo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

AceCoder: Utilizing Existing Code to Enhance Code Generation

Jia Li ♂ Yunfei Zhao Yongmin Li


Peking University Peking University Peking University
Beijing, China Beijing, China Beijing, China
[email protected] [email protected] [email protected]

Ge Li Zhi Jin
Peking University Peking University
Beijing, China Beijing, China
[email protected] [email protected]

ABSTRACT Large Language Models (LLMs) have achieved state-of-the-art (SOTA)


Large Language Models (LLMs) have shown great success in code results on code generation [1, 14, 15, 27, 32]. LLMs do not require
generation. LLMs take as the input a prompt and output the code. fine-tuning and take a prompt as input. A prompt consists of several
A key question is how to make prompts (i.e., Prompting Techniques). examples (e.g., <requirement, code pairs>) and a new requirement.
Existing prompting techniques are designed for natural language LLMs learn code generation from examples and analogously gener-
generation and have low accuracy in code generation. ate code for the new requirement.
In this paper, we propose a new prompting technique named Ace- The performance of LLMs strongly relies on the prompt surface
Coder. Our motivation is that code generation meets two unique [51]. Thus, how to design prompts (i.e., prompting techniques)
challenges (i.e., requirement understanding and code implemen- is a popular topic. Existing prompting techniques (e.g., few-shot
tation). AceCoder contains two novel mechanisms (i.e., guided prompting [11] and chain-of-thought prompting [49]) are designed
code generation and example retrieval) to solve these challenges. for natural language generation and have low accuracy in code
(1) Guided code generation asks LLMs first to analyze requirements generation. For example, Codex with few-shot prompting only
and output an intermediate preliminary (e.g., test cases). The pre- achieves 37.2% Pass@1 on a real-world benchmark - HumanEval
liminary is used to clarify requirements and tell LLMs “what to [14]. Thus, it is necessary to explore more advanced prompting
write”. (2) Example retrieval selects similar programs as examples techniques for code generation.
in prompts, which provide lots of relevant content (e.g., algorithms, In this paper, we propose a novel prompting technique
APIs) and teach LLMs “how to write”. We apply AceCoder to three specialized in code generation, named AceCoder. It signifi-
LLMs (e.g., Codex) and evaluate it on three public benchmarks cantly improves the performance of LLMs in code generation. Our
using the Pass@𝑘. Results show that AceCoder can significantly motivation is that code generation aims to build a mapping from
improve the performance of LLMs on code generation. (1) In terms natural language requirements to source code. There are two unique
of Pass@1, AceCoder outperforms the state-of-the-art base- challenges in this mapping, i.e., requirement understanding and
line by up to 56.4% in MBPP, 70.7% in MBJP, and 88.4% in code implementation. AceCoder proposes two novel mechanisms
MBJSP. (2) AceCoder is effective in LLMs with different sizes (i.e., to alleviate two challenges. The details of AceCoder are shown as
6B to 13B) and different languages (i.e., Python, Java, and JavaScript). follows.
(3) Human evaluation shows human developers prefer programs Challenge 1: Requirement Understanding. Understanding
from AceCoder. requirements is the starting point of code generation. In real-world
programming problems, the requirement may be a brief purpose
ACM Reference Format:
without specific details. For example, a requirement from a real-
Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin. 2023. AceCoder:
Utilizing Existing Code to Enhance Code Generation. In Proceedings of world benchmark - MBPP [9] is write a function to check if
ACM Conference (Conference’17). ACM, New York, NY, USA, 12 pages. https: the triangle is isosceles or not. Before writing code, we
//doi.org/10.1145/nnnnnnn.nnnnnnn need to analyze the requirement and determine specific details, e.g.,
input-output formats, and possible exceptions.
1 INTRODUCTION Novelty 1: Guided Code Generation. To alleviate this chal-
Code generation aims to automatically generate the source code lenge, we propose guided code generation. Our motivation is that
based on a natural language requirement [22, 23, 25]. Recently, human developers often use some software artifacts to assist in ana-
lyzing requirements. For example, in test-driven development [10],
Permission to make digital or hard copies of all or part of this work for personal or developers clarify requirements by designing test cases. It forces
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation developers to think about details of requirements, e.g., input-output
on the first page. Copyrights for components of this work owned by others than ACM formats and boundary values. These test cases exactly define the
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, requirement and tell developers what to write.
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from [email protected]. To implement the above process, we design a special prompt con-
Conference’17, July 2017, Washington, DC, USA sisting of triple examples (i.e., <requirement, preliminary, code>).
© 2023 Association for Computing Machinery. A preliminary is a specific software artifact (e.g., test cases, APIs)
ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. . . $15.00
https://doi.org/10.1145/nnnnnnn.nnnnnnn for clarifying the requirement. Given a new requirement, based on
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin

the prompt, LLMs first output a preliminary and then generate code (a) Input requirement:
based on the preliminary. We illustrate the guided code generation # Write a python function to remove first and last
in Section 2 and describe the details in Section 3.3. occurrence of a given character from the string
(b) Unit tests:
Challenge 2: Code Implementation. After understanding the
(1) ("hello world", "x") -> "hello world"
requirement, how to implement the source code using a program- (2) ("hello world", "l") -> "helo word" S
ming language is challenging. It requires LLMs to master related
(c) Few-shot Prompting: #
grammar, algorithms, and libraries. Even for human developers, it def remove_Occ(s, ch): o
is difficult to write an exactly correct program from scratch. return s.replace(ch, '', 1).replace(ch, '', -1) d
Novelty 2: Example Retrieval. To solve the above challenge, Evaluation: (1) pass (2) fail
we propose example retrieval. It is inspired by the human devel- (d) AceCoder:
opers’ code reuse. In real-world scenarios, given a new requirement, # ("test", "x") -> "test"
developers often search for similar programs. They learn program- # ("test", "e") -> "tst"
# ("test", "t") -> "es" #
ming skills (e.g., APIs) or directly reuse relevant content from similar def remove_Occ(s, ch): c
programs [31]. first_idx, last_idx = -1, -1 d
for i in range(len(s)):
Specifically, we use a retriever to search for programs with similar if s[i] == ch: S
requirements (e.g., Top-20). Considering the maximum input length if first_idx == -1: #
of LLMs is limited (e.g., 1024 tokens), the number of examples in first_idx = i e
last_idx = I d
a prompt is also limited, such as three examples. Thus, we further # case 1: ch not in s
design a selector to select a set of programs from retrieved results as if first_idx == -1 and last_idx == -1:
return s
examples. The selector will filter out redundant programs and pick # case 2: ch appears once in s
informative examples. Then, examples are inserted into prompts elif first_idx == last_idx:
return s[:first_idx] + s[first_idx+1:]
and teach LLMs how to implement code. We illustrate the example
# case 3: ch appears multi-times in s
retrieval in Section 2 and describe the details in Section 3.2. else:
In conclusion, given a requirement, AceCoder generates a pro- return s[:first_idx] + \
s[first_idx+1:last_idx] + s[last_idx+1:]
gram in three steps: Evaluation: (1)(2) pass
• Example retrieval. It uses a retriever and a selector to find similar
programs as examples, i.e., <requirement, code> pairs. Figure 1: A motivating example of guided code generation.
• Prompt construction. It uses an analyzer to convert retrieved
examples into <requirement, preliminary, code> triples. Then, it
concatenates triple examples with the input requirement together
modules and discuss different designs for AceCoder. Results
to construct a prompt.
show that three modules are all necessary and our designs for three
• Code generation. It feeds the prompt into LLMs. By learning
modules are superior to multiple alternates.
from examples, LLMs first output an intermediate preliminary
We summarize our contributions in this paper as follows.
and then generate code for the input requirement.
• We propose a novel prompting technique named AceCoder, for
We apply AceCoder to three representative LLMs, i.e., CodeGeeX
improving the performance of LLMs in code generation.
[1], CodeGen [32], and InCoder [15]. We conduct extensive experi-
• AceCoder contains two novel techniques (i.e., guided code gen-
ments on three popular code generation benchmarks, i.e., MBPP
eration and example retrieval) to alleviate two challenges (i.e.,
(Python) [9], MBJP (Java) [8], and MBJSP (JavaScript) [8]. We em-
requirement understanding and code implementation) in code
ploy Pass@𝑘 (𝑘 = 1, 3, 5) to measure the performance of different
generation, respectively.
approaches. We obtain some findings from experimental results.
• We apply AceCoder in three LLMs and conduct extensive ex-
(1) AceCoder significantly outperforms existing prompting
periments on three public benchmarks. Qualitative and quantita-
techniques. In terms of Pass@1, AceCoder outperforms the SOTA
tive experiments show that AceCoder significantly outperforms
baseline - few-shot prompting by up to 56.4% in MBPP, 70.7% in
the SOTA baselines (e.g., chain-of-thought prompting, few-shot
MBJP, and 88.4% in MBJSP. The improvements prove the superior-
prompting).
ity of AceCoder in code generation. (2) AceCoder substantially
outperforms retrieval-based models. In terms of Pass@1, Ace-
Coder outperforms the SOTA retrieval-based baseline by up to 2 MOTIVATING EXAMPLES
13.1% in MBPP, 23.44% in MBJP, and 15.8% in MBJSP. (3) AceCoder In this section, we explain our motivations by some real cases.
is effective in LLMs of different sizes. We apply AceCoder to Requirement Understanding → Guided Code Generation.
three LLMs, which scale from 6B to 13B. In terms of Pass@1, Ace- Figure 1 (a) and (b) show a requirement from a real-world bench-
Coder improves CodeGeeX-13B by up to 88.4%, CodeGen-6B by up mark [9] and its unit test for evaluation, respectively. We select
to 65.5%, and InCoder-6B by up to 57.5%. (4) Human evaluation Codex as the base model. Figure 1 (c) shows a program generated by
shows that human developers prefer programs generated by few-shot prompting. The program fails, as it ignores some essential
AceCoder. Results show that AceCoder outperforms the SOTA scenarios in the requirement, such as ch appearing multiple times
baseline in multiple aspects, including correctness, code smell, and in s. It shows that comprehensively understanding the requirement
maintainability. (5) We explore the contributions of different is crucial to write correct programs.
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA

(a) Input requirement: for similar programs as examples in prompts. We expect LLMs can
# Write a function to find sequences of lowercase learn from similar programs how to implement new programs.
letters joined with an underscore.
Since the maximum input length of LLMs is usually limited (e.g.,
(b) Unit tests:
(1) ('a_b_c') -> True 1024 tokens), the number of examples in a prompt is limited. Thus,
(2) ('a_ c_') -> False we need to further select a set of programs from retrieved results as
(c) Few-shot Prompting: examples. A straightforward idea is to pick top similar programs as
def text_lowercase_underscore(text): examples. However, as the programs are retrieved independently,
words = text.split()
for word in words: we find that retrieved results may contain redundant programs. In
if word.islower() and '_' in word: Figure 2 (d), Program-1, Program-2, and Program-3 are redundant,
return True
return False as all of them provide an API re.search, which teaches how to
Evaluation: (1) pass (2) fail search a pattern in the text. Program-4 contains a relevant regular
(d) Retrieved Programs: expression, which tells how to design a pattern. Suppose the number
Program-1: find sequences of literals in a string. of examples is 2. The examples will contain redundant programs
def find_literals(text, pattern):
(i.e., Program-1&2) and miss more informative Program-4.
match = re.search(pattern, text)
(more lines...)
Thus, we design a selector for selecting examples, which can filter
Programs-2&3: re.search(…) out redundant programs in retrieved results. Suppose the number of
Program-4: split a string at lowercase letters. examples is 2. In Figure 2 (d), our selector will select Program-1 and
def split_upperstring(text):
Program-4 as examples. Figure 2 (e) shows a program generated by
return re.findall("[a-z][^a-z]*", text)
AceCoder. It successfully learns how to write regular expressions
(e) AceCoder: from Program-4 and learns how to use re.search to find patterns
def text_lowercase_underscore(text):
import re
from Program-1.
patterns = '^[a-z]+_[a-z]+$‘
if re.search(patterns, text):
return True 3 METHODOLOGY
else:
return False In this section, we propose a novel prompting technique for code
generation, named AceCoder. In the subsections, we first present
Figure 2: A motivating example of example retrieval. an overview of AceCoder and then describe its details.

Thus, we propose guided code generation, which asks LLMs to 3.1 An Overview
first analyze the requirement and then generate code. Figure 1 (d) Code generation aims to generate the source code 𝑦 based on a
shows a program generated by AceCoder. We consider test cases natural language requirement 𝑥. AceCoder leverages large lan-
as the intermediate preliminary. We can see that the generated guage models (LLMs) to generate programs via prompting. Figure 3
test cases cover multiple scenarios, e.g., boundary inputs ("test", shows an overview of AceCoder during inference. Given an input
"e"). They further clarify the requirement and benefit the following requirement 𝑥𝑡𝑒𝑠𝑡 , AceCoder generates code in three steps.
code implementation. Based on the test cases, AceCoder gener- • Example Retrieval. It uses a retriever and a selector to select 𝑘
ates a correct program, which considers three scenarios and gives similar <requirement, code> pairs ({𝑥𝑖 , 𝑦𝑖 }𝑘𝑖=1 ) from a retrieval
solutions respectively. The example shows that our guided code corpus as examples.
generation can help LLMs to analyze requirements and improve • Prompt Construction. It employs an analyzer to convert exam-
the logical correctness of code. ples into <requirement, preliminary, code> triples ({𝑥𝑖 , 𝑎𝑖 , 𝑦𝑖 }𝑘𝑖=1 ).
Code Implementation → Example Retrieval. After under- A preliminary is a software artifact for clarifying the requirement,
standing the input requirement, how to implement the code is such as test cases. The examples are concatenated with the input
challenging. It requires LLMs to use various algorithms or libraries. requirement to construct a prompt.
Figure 2 (a) and (b) show a requirement from a real-world bench- • Code Generation. The prompt is fed into LLMs. By learning
mark [9] and its unit test for evaluation, respectively. We select from examples, LLMs first output an intermediate preliminary
Codex as the base model. Figure 2 (c) shows a program generated and then generate the code.
by few-shot prompting. The program contains a wrong condition
statement highlighted in yellow. This is because the model does where 𝑥𝑖 , 𝑦𝑖 , 𝑎𝑖 denote the requirement, the code, and the prelimi-
not know how to judge a string containing lowercase letters joined nary in 𝑖-th example, respectively.
with an underscore.
To alleviate the above problem, we propose example retrieval. 3.2 Example Retrieval
Our motivation is that human developers often search for similar As shown in Figure 3, the first step has two goals: (i) retrieve similar
programs and learn programming skills from them. Figure 2 (d) programs and (ii) select a few examples from retrieved programs.
shows some retrieved programs based on the similarity of require- We design a retriever and a selector to achieve these goals, respec-
ments. The retrieval metric is the BM25 score. We sort the results tively. The details of the two modules are shown as follows.
in descending order of BM25 score. We can see that the retrieved
programs contain lots of relevant content (e.g., re.search), which 3.2.1 Retriever. Similar programs often have similar natural lan-
benefits code implementation. Thus, we design a retriever to search guage requirements [17, 25]. There, we take the input requirement
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin

Test cases Examples Preliminary


retriever selector analyzer
𝒙𝒕𝒆𝒔𝒕 {𝒙𝒊 , 𝒕𝒊 , 𝒚𝒊 }𝒌𝟏
{𝒙𝒊 , 𝒚𝒊 }𝒌𝟏 API calls Input
Source code
Input requirement
Retrieval More… Examples
Examples Large language
requirement Outputs
corpus Prompt model
Preliminary 𝒕

(a) Example Retrieval (b) Prompt Construction (c) Code Generation

Figure 3: An overview of AceCoder. Given a requirement, it selects examples from similar programs and constructs a prompt.
LLMs first output an intermediate preliminary and then generate the source code. 𝑥, 𝑦, and 𝑡 denote requirements, programs,
and intermediate preliminaries, respectively.

Input requirement: Algorithm 1 The algorithm of our selector.


# Write a function to find sequences of lowercase Inputs:
letters joined with an underscore in a string.
Input requirement 𝑥𝑡𝑒𝑠𝑡 , similar programs {(𝑥𝑖 , 𝑦𝑖 )}𝑚
𝑖=1 ;
Similar program-1:
The number of examples 𝑘, 𝑘 <= 𝑚, decay factor 𝜆.
# find sequences of literals in a string.
def find_literals(text, pattern):
Outputs:
re.search(…) Selected examples 𝑇 , {(𝑥𝑖 , 𝑦𝑖 )}𝑘𝑖=1 .
Similar program-2: 1: 𝑇 ← Empty Ordered List
# find sequences of an a followed by zero or more b's. 2: 𝑆 ← Extract_ngrams_with_count(𝑥𝑡𝑒𝑠𝑡 )
def text_match(text): 3: for 𝑖 in {1, · · · , 𝑚} do
re.search(…)
4: 𝑄 [𝑖] ← Extract_ngrams_with_count(𝑥𝑖 )
Similar program-3:
5: end for
# find sequences of numbers containing a decreasing
trend or not.
6: while 𝑙𝑒𝑛(𝑇 ) < 𝑘 do
def decreasing_trend(nums): 7: for 𝑖 in {1, · · · , 𝑚} do
re.search(…) 8: 𝑆𝑐𝑜𝑟𝑒 [𝑖] ← Ngram_overlap_score(𝑆, 𝑄 [𝑖])
Similar program-4:
9: end for
# split a text at lowercase letters.
def split_upperstring(text):
10: 𝑗 ← argmax(Score)
return re.findall("[a-z][^a-z]*", text) 11: 𝑇 .𝑎𝑝𝑝𝑒𝑛𝑑 ((𝑥 𝑗 , 𝑦 𝑗 ))
12: 𝑚𝑎𝑡𝑐ℎ𝑒𝑑_𝑛𝑔𝑟𝑎𝑚𝑠 ← 𝑆 ∩ 𝑄 [ 𝑗]
Figure 4: A requirement and its similar programs. 13: 𝑄 [ 𝑗] ← ∅
14: for 𝑛𝑔𝑟𝑎𝑚 ∈ 𝑚𝑎𝑡𝑐ℎ_𝑛𝑔𝑟𝑎𝑚𝑠 do
15: 𝑆 [𝑛𝑔𝑟𝑎𝑚]× = 𝜆
as a query to search for similar requirements in a retrieval corpus. 16: end for
Then, we extract the corresponding programs as similar programs. 17: end while
Specifically, we leverage an open-source search engine named 18: return 𝑇
Lucene [5] to build our retriever and use the training data as a
retrieval corpus. We employ the BM25 score [40] as the retrieval
metric, which is widely used in previous studies [24, 47]. The BM25
score is a bag-of-words retrieval function and is used to estimate A straightforward idea is to pick top-𝑘 similar programs as ex-
the lexical-level similarity of two sentences. The more similar the amples. However, as the programs are scored independently, we
two sentences are, the higher the value of BM25 scores. In this find that retrieved results may contain redundant programs. Figure
paper, the retriever outputs top-𝑚 similar programs based on the 4 shows a requirement and its similar programs. Similar programs
BM25 score. are ranked by the BM25 score. We can see that top-3 programs
The reason for choosing BM25+Lucene is that they can achieve are redundant, as all of them use an API (i.e., re.search) to find
good retrieval accuracy and have low complexity. Considering that sequences of a specific pattern. Program-4 contains a relevant regex
the retrieval corpus is often large-scale, a lightweight retriever is expression. However, as Program-4 has fewer overlapping 𝑛-grams
closer to practical applications. In Section 5-RQ5, we also explore with the input requirement, it has a relatively small BM25 score.
other designs for the retriever and compare them to our design. Obviously, directly selecting top-𝑘 (e.g., top-3) retrieved programs
is unreasonable, as it will introduce redundant programs and ignore
3.2.2 Selector. We can obtain top-𝑚 similar programs from the more informative Program-4.
retriever. However, the maximum input length of LLMs (e.g., 1024 In this paper, we design a selector, which can filter out redundant
tokens) and the inference budget are often limited. It leads that programs in retrieved results. The algorithm of the selector is shown
the number of examples (i.e., 𝑘) in a prompt is also limited (e.g., in Algorithm 1. We first extract all 𝑛-grams of the input requirement
three examples). It is necessary to further select 𝑘 programs from and all similar requirements (lines 2-5). In this paper, 𝑛 is set to
retrieved results as examples. 4 by default. Then, we calculate a recall-based ROUGE-𝑛 score
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA

[requirement] Specifically, we first use an analyzer to introduce preliminaries


# find the last occurrence of a character {𝑡𝑖 }𝑘𝑖=1 into selected examples {𝑥𝑖 , 𝑦𝑖 }𝑘𝑖=1 , obtaining triple examples
in a string.
An [test case] {𝑥𝑖 , 𝑡𝑖 , 𝑦𝑖 }𝑘𝑖=1 . The preliminary is a software artifact for clarifying
(“little”,‘t‘) ==> 3
Example requirements. Inspired by test-driven development [10], this paper
(“assert”,‘s‘) ==> 2
[source code] considers test cases as the preliminary by default. We also explore
def last_occurence_char(string, char): other choices (e.g., APIs, method signature) in our experiments
. . .
(more examples . . ) (Section 5-RQ5). Then, we concatenate these triple examples with
[requirement] the input requirement to construct a prompt.
Input # remove first and last occurrence of a
requirement given character from the string. Figure 5 (a) shows an example of our prompt. The prompt be-
gins with several examples and ends with a new requirement.
(a) Prompt
[requirement], [test case], and [source code] are special
(None,‘y‘) ==> None tags that mark different parts in a triple.
Preliminary (“Python”,‘x‘) ==> ‘Python’
(“machine”,‘e‘) ==> ‘machin’ We assume that test cases of examples are available. We think
[source code] this assumption is acceptable. The reasons are two-fold. First, there
def remove_Occ(s, ch):
if s == None or ch == None: are many public code generation datasets containing test cases,
return None e.g., MBPP [9] (474 samples), APPS [18] (5,000 samples), and Code-
Source string = s
code for i in range(len(string)):
Contest [27] (13,328 samples). We can extract training data from
if string[i] == ch: these datasets and construct a retrieval corpus. Second, test-driven
(more lines . . .) software development is popular in real-world scenarios. We can
return string
mine software repositories from open-source communities (e.g.,
(b) Output of a large language model
GitHub [4]) and extract code snippets equipped with test cases.
Figure 5: Examples of our prompt and an LLM’s output.
3.4 Code Generation
In this step, we leverage an LLM to generate code based on the
between the input requirement and each similar requirement using
prompt. Following previous studies [1, 14, 15, 32], we view the LLM
the following equations (lines 7-9).
Í as a black-box generator and use it to complete the prompt. By
𝑛_𝑔𝑟𝑎𝑚∈𝑆∩𝑄 𝑆 (𝑛_𝑔𝑟𝑎𝑚) learning from examples in the prompt, LLMs will first output a
𝑅𝑛 = Í (1)
𝑛_𝑔𝑟𝑎𝑚∈𝑆 𝑆 (𝑛_𝑔𝑟𝑎𝑚) preliminary (e.g., test cases) and then generate the code based on
1 ∑︁ the preliminary and input requirement.
𝑆𝑐𝑜𝑟𝑒 = exp( log(𝑅𝑛 )) (2) Figure 5 (b) shows an output of an LLM - CodeGeeX [1]. We can
𝑛 𝑛
see that CodeGeeX first generates some test cases and then imple-
We get a similar requirement with the maximum score and add ments a Python function. The test cases provide lots of valuable
its corresponding program to examples (lines 10-11). Then, the information (e.g., input-output formats, invalid inputs) and guide
matched 𝑛-grams between the similar requirement and the input the subsequent code generation.
requirement are decayed by a factor 𝜆. This process (lines 6-17) is
repeated until the number of examples reaches the upper bound. 4 STUDY DESIGN
The motivation for the decay is to filter out redundant programs,
i.e., programs with the same matched 𝑛-grams. For example, in To assess AceCoder, we perform a large-scale study to answer six
Figure 4, we first add Program-1 to examples and then decay its research questions. In this section, we describe the details of our
matched 𝑛-grams (e.g., find sequences of). Subsequent programs study, including datasets, evaluation metrics, baselines, and base
with the same matched 𝑛-grams (i.e., Program-2 and Program-3) are large language models (LLMs).
considered redundant and will be ignored. Program-4 contains new
matched 𝑛-grams (e.g., lowercase letters) and probably contains 4.1 Research Questions
new information. Thus, Program-4 will obtain a higher score and Our study aims to answer the following research questions (RQs).
is added to the examples. RQ1: How does AceCoder perform compared to existing
By the above process, our selector filters out redundant programs prompting techniques? This RQ aims to validate that AceCoder
and selects 𝑘 similar programs as examples. In practice, 𝑚 and 𝑘 are has higher accuracy than existing prompting techniques in code
small numbers, such as 𝑚 = 50, 𝑛 = 3. Thus, the time complexity of generation. We apply AceCoder and baselines to three LLMs and
our selector is acceptable. measure their accuracy on three code generation benchmarks. The
evaluation metric is Pass@𝐾.
3.3 Prompt Construction RQ2: How does AceCoder perform compared to retrieval-
The goal of this step is to construct a prompt. As stated in Section based models? AceCoder retrieves similar programs as examples
1, our guided code generation expects that LLMs can first output in prompts. Some existing studies [21, 36] also introduce informa-
an intermediate preliminary and then generate the final code. To tion retrieval to augment code generation. In this RQ, we compare
achieve this goal, we design a special prompt consisting of triple AceCoder to these retrieval-based models. The evaluation metric
examples (i.e., <requirement, preliminary, code>). is Pass@𝐾.
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin

Table 1: Statistics of the datasets in our experiments. 4.3 Comparison Baselines


Statistics MBPP MBJP MBJSP This paper is to propose a new prompting technique for code gen-
eration. Thus, we select three existing prompting techniques as
Language Python Java JavaScript
baselines.
# Train 384 383 383 • Zero-shot prompting [14, 32] directly feeds the input require-
# Dev 90 90 90 ment into LLMs. Then, it extracts the code from LLMs’ outputs.
# Test 500 493 493 • Few-shot prompting [14] randomly selects several <requirement,
Avg. tokens in requirement 16.50 16.71 16.53 code> pairs as examples and constructs a prompt, which is fed
Avg. tokens in code 92.68 247.79 100.75 into an LLM. Then, it extracts the code from LLMs’ outputs.
• Chain-of-Thought (CoT) prompting [49] is a variant of few-
shot prompting. CoT prompting asks LLMs first to generate a
series of intermediate natural language reasoning steps and then
output the code.
RQ3: Do human developers prefer code generated by Ace- AceCoder retrieves similar programs to assist LLMs in gener-
Coder? The ultimate goal of code generation is to assist human ating code. Some studies also introduce information retrieval to
developers in writing code. In this RQ, we hire 10 developers (in- augment code generation. We compare AceCoder to these retrieval-
cluding industry employees and academic researchers) to manually based models.
review the code generated by AceCoder and baselines. We measure
• REDCODER [36] retrieves similar programs and fine-tunes a
the quality of code in three aspects, including correctness, code
pre-trained model - PLBART [7] to generate code based on the
smell, and maintainability.
requirement and similar programs.
RQ4: What are the contributions of different modules in
• Jigsaw [21] searches for similar programs from API documenta-
AceCoder? AceCoder contains three modules, i.e., a retriever,
tion and insert them into the prompts.
a selector, and an analyzer. This RQ is designed to analyze the
contributions of three modules to the performance. We select a
4.4 Base Large Language Models
base model, gradually introduce three modules, and observe the
fluctuations in accuracy. We select three open-source LLMs as base models. The details of
RQ5: What are the better designs for three modules? This the base models are shown as follows.
RQ aims to validate the superiority of our designs for three modules • CodeGeeX [1] is a multilingual LLM for source code with 13
in AceCoder. Specifically, we explore multiple designs for three billion parameters. CodeGeeX is pre-trained with a large cor-
modules and compare them to our designs. pus of more than 20 programming languages (e.g., Python, Java,
JavaScript). We download the model weight from the official
website [2] and run CodeGeeX according to official instructions.
4.2 Evaluation Datasets and Metrics • CodeGen [32] is a family of LLMs for source code that is pre-
4.2.1 Datasets. We conduct experiments on three public code gen- trained with extensive natural language and code data. We select
eration benchmarks, including the MBPP in Python, MBJP in Java, CodeGen-Multi-6.1B (CodeGen-6B) as a base model.
and MBJSP in JavaScript. The statistics of the datasets are shown • InCoder [15] is a multilingual LLM for code generation. It is
in Table 1. The details of the datasets are described as follows. pre-trained with 216 GB of code data. We use a version with 6.7
billion parameters (InCoder-6B) as a base model.
• MBPP [9] contains 974 real-world programming problems that
are constructed by crowd-sourcing. Each problem contains a The reason why we do not choose the GPT series of models (e.g.,
natural language requirement, a single Python function, and ChatGPT [33]) as the base models is that they are closed source.
three test cases. Although we can access GPT models through the OpenAI’s APIs,
• MBJP [8] and MBJSP [8] both contain 966 crowd-sourced pro- these models are likely to be updated dynamically, affecting the
gramming problems in Java and JavaScript, respectively. Each fairness and reproducibility of experiments. Thus, we leave them
problem consists of a natural language requirement, an individual to future work.
function, and 3 test cases.
4.5 Implementation Details
4.2.2 Metrics. Following previous code generation studies [1, 14, Example Retrieval. For each dataset, the retrieval corpus is its
15, 32], we employ Pass@𝑘 as our evaluation metric. Specifically, training data. We exclude the ground truths from the outputs of
we generate 𝑘 programs for each requirement. A requirement is our retriever. We first retrieve top-20 similar programs and then
considered solved if any generated programs pass all test cases. We use the selector to select three examples. For ensuring fairness, the
compute the percentage of solved requirements in total require- number of examples in AceCoder and baselines is the same.
ments as Pass@𝑘. In this paper, 𝑘 is set to 1, 3, and 5. Prompt Construction. In experimental datasets, the retrieval
We notice that previous studies [17, 46] also use some match- corpus (i.e., training data) has been equipped with test cases by
based metrics (e.g., BLEU [35]). These metrics are initially designed data collector [8, 9]. Thus, the analyzer utilizes pre-defined rules to
for natural language generation and are poor in measuring the extract test cases and transform retrieved programs into <require-
functionality of programs [14]. Thus, we omit them in experiments. ment, test cases, code> triples.
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA

Table 2: The results of AceCoder and prompting baselines on three datasets. The values in parentheses are the relative
improvements compared to the SOTA baseline - few-shot prompting.
MBPP MBJP MBJSP
Base model Prompting Technique
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
Zero-shot prompting 5.20 13.80 19.40 4.46 11.97 18.26 0.20 0.20 0.41
CoT prompting 12.60 23.40 30.20 14.40 28.19 33.67 11.35 21.10 25.96
CodeGeeX-13B
Few-shot prompting 20.40 30.60 36.00 16.63 26.17 34.48 11.16 19.88 25.56
AceCoder 26.74 (↑ 31.1%) 36.43 (↑ 19%) 41.13 (↑ 14.2%) 28.38 (↑ 70.7%) 36.79 (↑ 40.6%) 41.54 (↑ 20.5%) 21.03 (↑ 88.4%) 31.44 (↑ 58.2%) 36.04 (↑ 41%)
Zero-shot prompting 10.40 19.40 24.40 14.81 25.76 31.44 8.72 19.67 22.92
CoT prompting 13.00 21.00 26.00 13.59 25.35 31.24 11.56 20.08 24.54
CodeGen-6B
Few-shot prompting 14.60 24.00 30.20 18.25 30.02 34.68 9.94 19.88 23.12
AceCoder 22.83 (↑ 56.4%) 34.58 (↑ 44.1%) 40.16 (↑ 33%) 22.45 (↑ 23%) 34.27 (↑ 14.2%) 40.96 (↑ 18.1%) 16.45 (↑ 65.5%) 27.31 (↑ 37.4%) 32.16 (↑ 39.1%)
Zero-shot prompting 4.20 11.40 16.20 2.23 5.88 9.13 3.65 5.88 8.11
CoT prompting 3.99 10.65 15.31 1.83 4.46 7.10 1.22 2.03 4.67
InCoder-6B
Few-shot prompting 12.80 22.80 28.20 10.95 23.53 26.17 12.78 22.52 27.79
AceCoder 20.16 (↑ 57.5%) 31.44 (↑ 37.9%) 34.10 (↑ 20.92%) 16.37 (↑ 49.5%) 29.89 (↑ 27%) 34.74 (↑ 32.7%) 15.97 (↑ 25%) 27.13 (↑ 20.5%) 30.65 (↑ 10.3%)

Code Generation. Following previous studies [14, 15, 32], we Answer to RQ1: AceCoder outperforms existing prompt-
use nucleus sampling [19] to decode programs from LLMs. The ing techniques on three benchmarks. In terms of Pass@1,
temperature is 0.8 and the top-𝑝 is 0.95. The maximum generated AceCoder outperforms the SOTA baseline by up to 56.4%
lengths are 400, 500, and 500, respectively. The sampling settings in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. Besides, Ace-
of baselines are the same as the ones of AceCoder. Coder is effective in LLMs with different sizes. It improves
CodeGeeX-13B by up to 88.4%, CodeGen-6B by up to 65.5%,
and InCoder-6B by up to 57.5%. The significant improvements
prove the effectiveness of AceCoder in code generation.

RQ2: How does AceCoder perform compared to retrieval-


based models?
Setup. In this RQ, we compare AceCoder to two retrieval-based
5 RESULTS AND ANALYSES baselines, including REDCODER [36] and Jigsaw [21]. Baselines
and AceCoder use the same retrieval corpus. Because REDCODER
In the first research question, we evaluate the performance of Ace-
requires fine-tuning, we follow the official instructions and use the
Coder with respect to existing prompting techniques.
training data to train REDCODER.
RQ1: How does AceCoder perform compared to existing
Results. The results on three benchmarks are shown in Table 3.
prompting techniques?
The values in parentheses are relative improvements compared to
Setup. We apply AceCoder and three prompting baselines to
the SOTA baseline - Jiagsaw.
three base models (Section 4.4). Then, we use Pass@k to measure
Analyses. AceCoder outperforms retrieval-based baselines in
their performance on three benchmarks (Section 4.2).
three benchmarks. Compared to the SOTA baseline - Jigsaw, in
Results. The results on three benchmarks are shown in Table 2.
terms of Pass@1, AceCoder outperforms it by up to 13.1% in MBPP,
The values in parentheses are relative improvements compared to
23.44% in MBJP, and 15.8% in MBJSP. Jigsaw also retrieves simi-
the SOTA baseline - few-shot prompting.
lar programs for making prompts. The improvements show the
Analyses. (1) AceCoder performs better than baselines on three
effectiveness of our selector and analyzer. The selector filters out
benchmarks. Compared to the SOTA baseline - few-shot prompt-
redundant similar programs and further improves the quality of
ing, in terms of Pass@1, AceCoder outperforms it by up to 56.4%
examples. The analyzer constraints LLMs to first analyze require-
in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. Pass@1 is a very
ments and then generate code. Besides, we notice that REDCODER
strict metric and it is difficult to improve. The significant improve-
has poor accuracy in three benchmarks. This is because the training
ments prove the superiority of AceCoder in code generation. We
data is limited, and fine-tuning easily leads to overfitting. It validates
attribute the improvements to our novel techniques, i.e., example re-
our motivation that introducing similar programs by prompting is
trieval and guided code generation. The retrieved examples contain
a more suitable approach to LLMs.
many relevant code elements teaching LLMs “how to write”. Guided
code generation asks LLMs to analyze requirements that tell LLMs Answer to RQ2: AceCoder outperforms retrieval-based base-
“what to write”. (2) AceCoder is effective in LLMs with different lines. Specifically, it outperforms the SOTA baseline - Jigsaw
sizes and different programming languages. Compared to few-shot by up to 13.1% in MBPP, 23.44% in MBJP, and 15.8% in MBJSP.
prompting, in terms of Pass@1, AceCoder improves CodeGeeX-
13B by up to 88.4%, CodeGen-6B by up to 65.5%, and InCoder-6B by RQ3: Do human developers prefer code generated by Ace-
up to 57.5%. In particular, we find that an LLM with AceCoder even Coder?
outperforms larger LLMs. For example, in the MBJSP, InCoder-6B Setup. The ultimate goal of code generation is to assist human
with AceCoder outperforms CodeGeeX-13B with few-shot prompt- developers in writing code. Thus, we conduct a human evaluation
ing. It proves the potential of AceCoder. Besides, AceCoder is to measure programs generated by AceCoder and baselines. We
language-agnostic and is effective in multilingual code generation follow settings of human evaluation in previous studies [16, 25].
(i.e., Python, Java, and JavaScript). We have carefully checked the evaluation settings and think our
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin

Table 3: The comparison of retrieval-based baselines and AceCoder. The values in parentheses are relative improvements
compared to the SOTA baseline - Jigsaw.
MBPP MBJP MBJSP
Approach
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
REDCODER 3.37 6.21 9.74 4.46 7.51 9.94 4.87 10.34 12.78
Jigsaw 23.65 33.97 37.78 22.99 33.26 36.95 18.16 28.79 34.08
AceCoder 26.74 (↑ 13.1%) 36.43 (↑ 7.2%) 41.13 (↑ 8.9%) 28.38 (↑ 23.44%) 36.79 (↑ 10.61%) 41.54 (↑ 12.42%) 21.03 (↑ 15.8%) 31.44 (↑ 9.2%) 36.04 (↑ 5.8%)

Table 4: The results of ablation study. The values in parentheses are relative improvements compared to few-shot promopting.
MBPP MBJP MBJSP
Retriever Selector Analyzer
Pass@1 (%) Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
20.40 30.60 36.00 16.63 26.17 34.48 11.16 19.88 25.56
24.00 (↑ 17.6%) 34.60 (↑ 13.1%) 38.20 (↑ 6.1%) 23.35 (↑ 40.4%) 33.67 (↑ 28.7%) 37.22 (↑ 7.9%) 18.66 (↑ 67.2%) 29.18 (↑ 46.8%) 34.89 (↑ 36.5%)
24.89 (↑ 22%) 35.02 (↑ 14.4%) 39.14 (↑ 8.7%) 25.03 (↑ 50.5%) 34.47 (↑ 31.7%) 39.24 (↑ 13.8%) 19.73 (↑ 76.8%) 30.16 (↑ 51.7%) 35.34 (↑ 38.3%)
26.74 (↑ 31.1%) 36.43 (↑ 19%) 41.13 (↑ 14.2%) 28.38 (↑ 70.7%) 36.79 (↑ 40.6%) 41.54 (↑ 20.5%) 21.03 (↑ 88.4%) 31.44 (↑ 58.2%) 36.04 (↑ 41%)

Table 5: The results of human evaluation. The values in paren- final score is the average of two evaluators’ scores. Evaluators are
theses are the relative improvements compared to the SOTA allowed to search the Internet for unfamiliar concepts.
baseline - few-shot prompting. Results. The results of the human evaluation are shown in
Approach Correctness Code smell Maintainability Table 5. The values in parentheses are the relative improvements
Zero-shot prompting 0.3167 1.1033 1.2749 compared to the SOTA baseline - few-shot prompting.
CoT prompting 0.6671 1.1405 1.4479 Analyses. AceCoder is better than all baselines in three aspects.
Few-shot prompting 0.9769 1.2148 1.5420 Specifically, our AceCoder outperforms the SOTA baseline - few-
AceCoder 1.5802 (↑ 61.8%) 1.6241 (↑ 33.7%) 1.7544 (↑ 13.8%) shot prompting by 61.8% in correctness, 33.7% in code smell, and
13.8% in maintainability. The improvements show that AceCoder
has better usability and is promising in practical applications. Be-
settings are reliable. We manually evaluate programs in three as-
sides, all the p-values are substantially smaller than 0.05, which
pects:
shows the improvements are statistically significant.
• Correctness (whether the program satisfies the given re-
quirement). 0 point: the program is totally inconsistent with the Answer to RQ3: Human evaluation shows that human devel-
requirement. 1 point: the program is implemented, but misses opers prefer programs generated by AceCoder. It outperforms
some details. 2 points: the program is correctly implemented. the SOTA baseline by 61.8% in correctness, 33.7% in code smell,
• Code Smell (whether the program contains bad code smell). and 13.8% in maintainability.
0 point: There are better solutions in terms of performance. Or RQ4: What are the contributions of different modules in
there is serious code smell. 1 point: Some details are not in place. AceCoder?
There is code smell of low severity. 2 points: No obviously better Setup. AceCoder contains three modules, i.e., a retriever, a se-
code in terms of performance exists. If possible, resources are lector, and an analyzer. This RQ is designed to analyze the contribu-
released accordingly. No obvious code smell. tions of three modules to the performance. We select CodeGeeX as
• Maintainability (whether the implementation is standard- the base model and conduct an ablation study by gradually adding
ized and has good readability). 0 point: The program does not three modules.
follow a consistent specification, or there are many meaningless Results. The results are shown in Table 5. and represent
names in variable naming, or there are certain repetitions and adding and removing corresponding modules, respectively. Without
redundant codes. 1 point: The program implementation meets three modules, the base model uses few-shot prompting to generate
certain specifications. But some variable names can be further code. After adding a retriever, the base model selects top-𝑘 similar
refined. 2 points: The program implementation is relatively stan- programs as examples and directly generates code. After adding a
dardized, the variable naming is basically semantically straight- selector, the base model selects 𝑘 examples from similar programs
forward, and the readability is better. and then generates code. After further introducing an analyzer, the
We explain the above aspects to evaluators through some ex- base model uses AceCoder to generate code.
amples. After discussing with evaluators, we set the score of each Analyses. All modules are necessary for AceCoder to perform
aspect to an integer, ranging from 0 to 2 (from bad to good). For Ace- the best. After adding a retriever, the performance of the base mod-
Coder and baselines, we select a fixed base model (i.e., CodeGen-2B) els is improved. In terms of Pass@1, the retriever brings a 17.6%
and collect 200 generated programs per approach. Finally, we obtain improvement in MBPP, a 40.4% improvement in MBJP, and a 67.2%
1,000 programs for evaluation. We invite 10 developers with 3-5 improvement in MBJSP. It validates our motivation that retrieved
years of development experience to evaluate the generated pro- programs contain lots of useful information that benefits code gen-
grams in the form of a questionnaire. The 1,000 code snippets are eration. After adding a selector, the performance of the base model
divided into 5 groups, with each questionnaire containing one group. is further improved. It shows that our selector can effectively fil-
The programs are randomly shuffled and anonymously reviewed ter out redundant programs in retrieved results and improve the
by evaluators. Each group is evaluated by two evaluators, and the quality of examples. After further introducing an analyzer, the base
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA

model achieves better results. In terms of Pass@1, the base model AceCoder, AceCoder with dense retriever has a slight drop in
is improved by 31.1% in MBPP, 70.7% in MBJP, and 88.4% in MBJSP. performances. It indicates that code generation prefers lexically
It proves the effectiveness of guided code generation in analyzing similar programs, which contain lots of reusable content. Similar
requirements. findings can be found in code completion work [29]. Besides, the
dense retriever has a higher complexity and is hard to be applied to
Answer to RQ4: Three modules are essential for the per- a large-scale retrieval corpus. (2) The BLEU selector prefers shorter
formance of AceCoder. The performance of CodeGeeX on examples and is suboptimal. Compared to AceCoder, AceCoder
three benchmarks is substantially improvement by gradually with BLEU selector has an obvious decrease in accuracy. We inspect
adding three modules. some failed samples and find that the BLEU selector prefers shorter
RQ5: What are the better designs for three modules in Ace- examples. This is because BLEU is the precision of 𝑛-gram in similar
Coder? requirements. The shorter the similar requirement, the higher the
Setup. As stated in Section 3.1, AceCoder contains three mod- BLEU. It leads that the selector tends to select short programs as
ules, i.e., a retriever, a selector, and an analyzer. In this RQ, we examples and ignores some informative but long examples. (3) Test
explore different designs for three modules and validate the superi- cases are more suitable to the preliminary than APIs and method
ority of our designs. We select CodeGeeX as the base model. The signatures. We carefully inspect some cases. First, many require-
evaluation settings are shown as follows. ments in benchmarks do not require APIs or only involve a few
(1) A retriever takes the input requirement as a query and searches trivial APIs (e.g., range, split, and len). It causes that generated APIs
for similar programs from a retrieval corpus. We design two choices bring limited benefits to code generation. Second, by generating
for the retriever: method signatures, LLMs are asked to think about the input-output
format, which benefits code generation. But method signatures miss
• Dense retriever. It uses a neural encoder to convert the require- other necessary details, such as edge cases. AceCoder considers
ments into vector representations. Then, it retrieves similar pro- test cases as the preliminary. Test cases are common in code files.
grams based on the similarity of vector representations. In exper- Thus, it is feasible for LLMs trained with extensive code data to
iments, we use an off-the-shelf natural language representation generate plausible test cases. With the guidance of test cases, LLMs
model [39] as the encoder. can comprehensively understand requirements and determine re-
• Sparse retriever (AceCoder). As stated in Section 3.2, it uses the lated details (e.g., input-output formats, boundary inputs, outliers),
BM25 score as the retrieval metric. BM25 score can measure the thus generating more correct programs.
lexical-level similarity of two requirements.
(2) A selector aims to score similar programs and filter redundant Answer to RQ5: We explore the other four designs for Ace-
programs. For the score function in the selector (line 8 of Algorithm Coder and compare them to our designs. Results on three
1), we design two choices: benchmarks show the superiority of our design.

• BLEU [35]. It extracts overlapping 𝑛-grams between the input 6 DISCUSSION


requirement and the similar requirement. Then, it computes the
precision of 𝑛-grams in the similar requirement. 6.1 AceCoder vs. CoT prompting
• ROUGE-N [28] (AceCoder). It extracts overlapping 𝑛-grams be- Our guided code generation is similar to Chain-of-Thought (CoT)
tween the input requirement and the similar requirement. Then, prompting. Both approaches ask LLMs to first generate an interme-
it computes the recall of 𝑛-grams in the input requirement. diate result and then output the final code. The intermediate result
in CoT prompting is a series of natural language steps describing
(3) An analyzer is to introduce preliminaries into examples. A
how to write code step by step. In contrast, AceCoder leverages
preliminary is a special software artifact that benefits the require-
some software artifacts (e.g., test cases) as the intermediate result.
ment understanding. For the preliminary, we design three choices:
We argue that our guided code generation is superior to the
• API sequence. APIs are important elements in code and reflect the CoT in code generation. Table 2 shows the comparison results
functionality of the code. Pre-designing APIs help LLMs to think between AceCoder and CoT prompting. CoT prompting achieves
about how to solve requirements. We use a program analysis tool slight improvements over few-shot prompting and is even worse
[6] to extract APIs from examples and view the API sequence as than zero-shot prompting. We inspect some failed samples and
a preliminary (e.g., open, numpy.array, write). summarize the main reason. We find that CoTs describe how to
• Method signature. It contains input-output parameters and their write code in a series of steps almost at the same level as code. The
types, which clearly indicate the inputs and outputs of require- LLMs for source code are mainly pre-trained with code data and
ments. Thus, we consider the method signature as a preliminary are relatively weak in natural language generation. The generated
(e.g., def floor_Min(A: int, B: int, N: int)). CoTs often contain ambiguities or errors and negatively affect the
• Test cases (AceCoder). Test cases exactly define the requirement, subsequent code generation. Similar findings can be found in the
including the input-output format, edge cases, and functional- original paper of CoT prompting [49]. Compared to CoT prompting,
ity. We consider several test cases as the preliminary, such as AceCoder uses a software artifact (i.e., test cases) as intermediate
(“Python”,“o”) -> 1); (“little”,“t”) -> 2);. preliminaries. Compared to natural languages, test cases are more
Results and Analyses. The results are shown in Table 6. “w/” suitable to clarify requirements and contain fewer ambiguities.
is the abbreviation of with. (1) A dense retriever is comparable to Besides, test cases are common in real-world code files, and LLMs
our retriever, but has a lower efficiency. In Table 6, compared to have abilities to generate plausible test cases. Thus, AceCoder is
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin

Table 6: The performance of AceCoder with different designs. “w/” is the abbreviation of with.
MBPP MBJP MBJSP
Approach
Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5 Pass@1 Pass@3 Pass@5
AceCoder 26.74 36.43 41.13 28.38 36.79 41.54 21.03 31.44 36.04
w/ Dense retriever 26.63 36.42 41.10 28.16 36.55 41.32 20.88 31.27 35.94
w/ BLEU selector 25.61 35.71 40.74 27.86 35.91 40.77 20.15 30.42 35.47
w/ API analyzer 25.10 35.24 40.38 26.44 35.16 40.12 19.86 30.23 35.41
w/ signature analyzer 26.14 35.96 40.89 27.35 36.11 40.98 20.58 30.89 35.86

different from CoT prompting and is more promising than CoT 7 RELATED WORK
prompting in code generation. Large Language Models (LLMs) for Code Generation are large-
scale neural networks pre-trained on a large corpus of natural lan-
guage and programming language. With the development of LLM
6.2 AceCoder vs. Rank Techniques research, current Code LLMs can be divided into two categories:
Some recent studies [13, 20] propose rank techniques to improve standard language models and instruction-tuned models.
the performance of LLMs on code generation. Given a requirement, Standard Language models are pre-trained on the raw corpus
they first sample many programs from LLMs and then use test cases with the next-token prediction. They can continually complete
or neural networks to rerank sampled programs. the given context, which makes them useful in tasks like code
In this paper, we do not directly compare our approach to rank completion and code generation. With the success of GPT series
techniques. The reason is that AceCoder and rank techniques [11, 37, 38] in NLP, OpenAI adapts similar idea into the domain of
have different focuses and they are complementary. Our work is source code, and fine-tunes GPT models on code to produce closed-
a new prompting technique that improves the accuracy of LLMs source Codex [14]. There are multiple open-source attempts to
in code generation. Rank techniques do not care about LLMs and replicate its success, e.g., CodeParrot [3], CodeGen [32], CodeGeeX
aim to select the best one from LLMs’ multiple outputs. In practice, [1], InCoder [15], StarCoder [26] and CodeT5+ [45].
users can use AceCoder to generate many programs and then Instruction-tuned models are models fine-tuned using instruc-
use rank techniques to pick a final output. Thus, we omit them in tion tuning [48]. Instruction tuning helps models to follow users’
experiments. instructions. OpenAI’s ChatGPT [33] is trained by Reinforcement
Learning with Human Feedback (RLHF) [34], making it capable
of both natural language tasks and programming tasks. Due to its
6.3 Threats to Validity enormous influence and closed-sourceness, many researchers try to
create open-source ChatGPT alternatives using instruction tuning
There are two main threats to the validity of our work.
and its variants. Alpaca [41] is LLaMA [42] fine-tuned using self-
The generalizability of experimental results. To mitigate
instruct [44] and ChatGPT feedback. Code Alpaca [12] is LLaMA
this threat, we carefully select the experimental datasets, metrics,
fine-tuned using self-instruct and ChatGPT feedback with more
and baselines. Following previous studies [8, 13], we pick three rep-
programming-focused instructions. WizardCoder [30] is StarCoder
resentative code generation benchmarks. They are collected from
[26] fine-tuned using Evol-Instruct [50] and ChatGPT feedback
real-world software projects and cover three popular programming
with Code Alpaca’s dataset as seed dataset. InstructCodeT5+ [45]
languages (i.e., Python, Java, and JavaScript). For evaluation met-
is CodeT5+ [45] fine-tuned on Code Alpaca’s dataset.
rics, we select a widely used metric - Pass@𝑘 (𝑘 = 1, 3, 5). Pass@𝑘
Prompting Techniques. LLMs are too large to fine-tune, so
is an execution-based metric that utilizes test cases to check the
researchers need to find a new way to adapt the LLMs on the
correctness of programs. We select existing prompting techniques
downstream tasks. Prompting techniques are a popular approach to
and retrieval-based models as comparison baselines. We pick three
leverage LLMs to generate code by inputting a special prompt.
representative LLMs as base models [1, 14, 15, 32], which scale from
Early, researchers proposed zero-shot prompting and few-shot
6B to 13B. We apply AceCoder and baselines to base models and
prompting. Zero-shot prompting concatenates a task instruction
evaluate their performance on three datasets using Pass@k. To
(e.g., please generate a program based on the requirement)
ensure fairness, we run each approach three times and report the
and a requirement together to make the prompt. Based on the
average results.
zero-shot prompting, few-shot prompting further adds several
The impact of retrieved programs. The retrieved programs
⟨ requirement, code ⟩ pairs to the prompts, so that LLMs can learn
are important elements in AceCoder. Intuitively, when retrieved
code generation from given examples. Chain-of-Thought (CoT)
programs are less relevant to input requirements, the performance
prompting [49] is a recently proposed prompting technique. CoT
of our approach may suffer. To address this threat, we have two
asks LLMs first to generate CoTs (i.e., intermediate natural lan-
thoughts. (1) A large-scale study on 13.2 million real code files found
guage reasoning steps) and then output the final code. It allows
the proportion of reused code is up to 80% [31]. Thus, we believe that
LLMs to first design a solving process that leads to the code. CoT
it is quite possible to retrieve similar programs in real development
has achieved the SOTA results in natural language generation and
scenarios. (2) Even if retrieved programs are less relevant to input
requirements, AceCoder degrades to few-shot prompting at worst.
In most cases, AceCoder is superior to few-shot prompting.
AceCoder: Utilizing Existing Code to Enhance Code Generation Conference’17, July 2017, Washington, DC, USA

sparked lots of follow-up research, such as self-consistency prompt- Thirty-fifth Conference on Neural Information Processing Systems Datasets and
ing [43], least-to-most prompting [52]. But these prompting tech- Benchmarks Track (Round 2).
[19] Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The
niques are designed for natural language generation and bring Curious Case of Neural Text Degeneration. In 8th International Conference on
slight improvements in code generation. Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
OpenReview.net. https://openreview.net/forum?id=rygGQyrFvH
8 CONCLUSION AND FUTURE WORK [20] Jeevana Priya Inala, Chenglong Wang, Mei Yang, Andrés Codas, Mark Encar-
nación, Shuvendu K. Lahiri, Madanlal Musuvathi, and Jianfeng Gao. 2022. Fault-
We propose a new prompting technique named AceCoder to im- Aware Neural Code Rankers. In NeurIPS. http://papers.nips.cc/paper_files/paper/
2022/hash/5762c579d09811b7639be2389b3d07be-Abstract-Conference.html
prove the performance of LLMs on code generation. AceCoder [21] Naman Jain, Skanda Vaidyanath, Arun Iyer, Nagarajan Natarajan, Suresh
designs two novel techniques (i.e., guided code generation and Parthasarathy, Sriram Rajamani, and Rahul Sharma. 2022. Jigsaw: Large lan-
example retrieval) to help LLMs understand requirements and im- guage models meet program synthesis. In Proceedings of the 44th International
Conference on Software Engineering. 1219–1231.
plement programs. Guided code generation asks LLMs to output [22] Jia Li, Ge Li, Yongmin Li, and Zhi Jin. 2023. Enabling Programming Thinking in
an intermediate preliminary (e.g., test cases) before generating pro- Large Language Models Toward Code Generation. CoRR abs/2305.06599 (2023).
grams. The preliminary helps LLMs understand requirements and https://doi.org/10.48550/arXiv.2305.06599 arXiv:2305.06599
[23] Jia Li, Ge Li, Zhuo Li, Zhi Jin, Xing Hu, Kechi Zhang, and Zhiyi Fu. 2023. CodeEd-
guides the next code generation. Example retrieval selects simi- itor: Learning to Edit Source Code with Pre-Trained Models. ACM Trans. Softw.
lar programs as examples, which provide many reusable elements Eng. Methodol. (may 2023). https://doi.org/10.1145/3597207 Just Accepted.
[24] Jia Li, Yongmin Li, Ge Li, Xing Hu, Xin Xia, and Zhi Jin. 2021. Editsum: A retrieve-
for program implementation. We apply AceCoder to three LLMs and-edit framework for source code summarization. In 2021 36th IEEE/ACM
and conduct experiments on three benchmarks. Results show that International Conference on Automated Software Engineering (ASE). IEEE, 155–
AceCoder significantly outperforms the SOTA baselines. 166.
[25] Jia Li, Yongmin Li, Ge Li, Zhi Jin, Yiyang Hao, and Xing Hu. 2023. SkCoder:
In the future, we will explore how to improve the usability of A Sketch-based Approach for Automatic Code Generation. In 45th IEEE/ACM
LLMs in code generation. For example, how to teach LLMs to use International Conference on Software Engineering, ICSE 2023, Melbourne, Australia,
unseen frameworks without re-training. May 14-20, 2023. IEEE, 2124–2135. https://doi.org/10.1109/ICSE48619.2023.00179
[26] Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov,
Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, et al. 2023.
StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023).
REFERENCES [27] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi
[1] 2022. CodeGeeX. https://models.aminer.cn/codegeex/zh-CN. Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. 2022.
[2] 2022. CodeGeeX. https://models.aminer.cn/codegeex/blog/index.html. Competition-level code generation with alphacode. Science 378, 6624 (2022),
[3] 2022. CodeParrot. https://huggingface.co/codeparrot/codeparrot. 1092–1097.
[4] 2022. GitHub. https://github.com/. [28] CY LIN. 2004. Rouge: A package for automatic evaluation of summaries. In Text
[5] 2022. Lucene. https://lucene.apache.org/. Summarization Branches Out: Proceedings of the ACL-04 Workshop, Barcelona,
[6] 2022. tree-sitter. https://tree-sitter.github.io/tree-sitter/. Spain. 74–81.
[7] Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Uni- [29] Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svy-
fied Pre-training for Program Understanding and Generation. In Proceedings of atkovskiy. 2022. ReACC: A Retrieval-Augmented Code Completion Framework.
the 2021 Conference of the North American Chapter of the Association for Computa- In Proceedings of the 60th Annual Meeting of the Association for Computational
tional Linguistics: Human Language Technologies. Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022,
Linguistics, Online, 2655–2668. https://doi.org/10.18653/v1/2021.naacl-main.211 Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association
[8] Ben Athiwaratkun, Sanjay Krishna Gouda, Zijian Wang, Xiaopeng Li, Yuchen for Computational Linguistics, 6227–6240. https://doi.org/10.18653/v1/2022.acl-
Tian, Ming Tan, Wasi Uddin Ahmad, Shiqi Wang, Qing Sun, Mingyue Shang, long.431
et al. 2022. Multi-lingual Evaluation of Code Generation Models. arXiv preprint [30] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu,
arXiv:2210.14868 (2022). Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. 2023. WizardCoder:
[9] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Empowering Code Large Language Models with Evol-Instruct. arXiv preprint
Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, arXiv:2306.08568 (2023).
et al. 2021. Program synthesis with large language models. arXiv preprint [31] Audris Mockus. 2007. Large-scale code reuse in open source software. In First
arXiv:2108.07732 (2021). International Workshop on Emerging Trends in FLOSS Research and Development
[10] Kent Beck. 2003. Test-driven development: by example. Addison-Wesley Profes- (FLOSS’07: ICSE Workshops 2007). IEEE, 7–7.
sional. [32] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou,
[11] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Silvio Savarese, and Caiming Xiong. 2022. A conversational paradigm for program
Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda synthesis. arXiv preprint arXiv:2203.13474 (2022).
Askell, et al. 2020. Language models are few-shot learners. Advances in neural [33] OpenAI. 2022. ChatGPT. https://openai.com/blog/chatgpt.
information processing systems 33 (2020), 1877–1901. [34] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela
[12] Sahil Chaudhary. 2023. Code Alpaca: An Instruction-following LLaMA model Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022.
for code generation. https://github.com/sahil280114/codealpaca. Training language models to follow instructions with human feedback. Advances
[13] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, in Neural Information Processing Systems 35 (2022), 27730–27744.
and Weizhu Chen. 2022. Codet: Code generation with generated tests. arXiv [35] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a
preprint arXiv:2207.10397 (2022). method for automatic evaluation of machine translation. In Proceedings of the
[14] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira 40th annual meeting of the Association for Computational Linguistics. 311–318.
Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, [36] Md. Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and
et al. 2021. Evaluating large language models trained on code. arXiv preprint Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Summarization.
arXiv:2107.03374 (2021). In Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual
[15] Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Event / Punta Cana, Dominican Republic, 16-20 November, 2021, Marie-Francine
Ruiqi Zhong, Wen-tau Yih, Luke Zettlemoyer, and Mike Lewis. 2022. Incoder: A Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association
generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 for Computational Linguistics, 2719–2734. https://doi.org/10.18653/v1/2021.
(2022). findings-emnlp.232
[16] Yiyang Hao, Ge Li, Yongqiang Liu, Xiaowei Miao, He Zong, Siyuan Jiang, Yang [37] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018.
Liu, and He Wei. 2022. AixBench: A Code Generation Benchmark Dataset. arXiv Improving language understanding by generative pre-training. (2018).
preprint arXiv:2206.13179 (2022). [38] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever,
[17] Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. 2018. et al. 2019. Language models are unsupervised multitask learners. OpenAI blog
A retrieve-and-edit framework for predicting structured outputs. Advances in 1, 8 (2019), 9.
Neural Information Processing Systems 31 (2018). [39] Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embed-
[18] Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, dings using Siamese BERT-Networks. In Proceedings of the 2019 Conference
Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. In on Empirical Methods in Natural Language Processing and the 9th International
Conference’17, July 2017, Washington, DC, USA Jia Li ♂, Yunfei Zhao, Yongmin Li, Ge Li, and Zhi Jin

Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong [46] Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. 2021. CodeT5:
Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Un-
Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. derstanding and Generation. In Proceedings of the 2021 Conference on Empir-
https://doi.org/10.18653/v1/D19-1410 ical Methods in Natural Language Processing. Association for Computational
[40] Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance Linguistics, Online and Punta Cana, Dominican Republic, 8696–8708. https:
framework: BM25 and beyond. Foundations and Trends® in Information Retrieval //doi.org/10.18653/v1/2021.emnlp-main.685
3, 4 (2009), 333–389. [47] Bolin Wei, Yongmin Li, Ge Li, Xin Xia, and Zhi Jin. 2020. Retrieve and refine:
[41] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos exemplar-based neural comment generation. In 2020 35th IEEE/ACM International
Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford Alpaca: An Conference on Automated Software Engineering (ASE). IEEE, 349–360.
Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_ [48] Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian
alpaca. Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models
[42] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne are Zero-Shot Learners. In International Conference on Learning Representations.
Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal https://openreview.net/forum?id=gEZrGCozdqR
Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lam- [49] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi,
ple. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv Quoc V Le, Denny Zhou, et al. 2022. Chain-of-Thought Prompting Elicits Rea-
preprint arXiv:2302.13971 (2023). soning in Large Language Models. In Advances in Neural Information Processing
[43] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Systems.
Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves [50] Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng,
Chain of Thought Reasoning in Language Models. In The Eleventh International Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language
Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023).
OpenReview.net. https://openreview.net/pdf?id=1PL1NIMMrw [51] Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate
[44] Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, before use: Improving few-shot performance of language models. In International
Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-Instruct: Aligning Lan- Conference on Machine Learning. PMLR, 12697–12706.
guage Models with Self-Generated Instructions. In Proceedings of the 61st Annual [52] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang,
Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023.
pers). Association for Computational Linguistics, Toronto, Canada, 13484–13508. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.
https://aclanthology.org/2023.acl-long.754 In The Eleventh International Conference on Learning Representations, ICLR 2023,
[45] Yue Wang, Hung Le, Akhilesh Deepak Gotmare, Nghi DQ Bui, Junnan Li, and Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=
Steven CH Hoi. 2023. Codet5+: Open code large language models for code WZH7099tgfM
understanding and generation. arXiv preprint arXiv:2305.07922 (2023).

You might also like