0% found this document useful (0 votes)

13 views19 pages

Docprompting:: G C R D

Uploaded by

analystb867

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

Docprompting:: G C R D

Uploaded by

analystb867

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Published as a conference paper at ICLR 2023

DocPrompting: G ENERATING C ODE BY R ETRIEVING

THE D OCS

Shuyan Zhou† , Uri Alon†

Frank F. Xu† , Zhiruo Wang† , Zhengbao Jiang† , Graham Neubig†‡
†
Language Technologies Institute, Carnegie Mellon University,
‡
Inspired Cognition
{shuyanzh,ualon,fangzhex,zhiruow,zhengbaj,gneubig}@cs.cmu.edu
arXiv:2207.05987v3 [cs.CL] 18 Feb 2023

A BSTRACT

Publicly available source-code libraries are continuously growing and changing.

This makes it impossible for models of code to keep current with all available
APIs by simply training these models on existing code repositories. Thus, exist-
ing models inherently cannot generalize to using unseen functions and libraries,
because these would never appear in their training data. In contrast, when human
programmers use functions and libraries for the first time, they frequently refer
to textual resources such as code manuals and documentation, to explore and
understand the available functionality. Inspired by this observation, we introduce
DocPrompting: a natural-language-to-code generation approach that explicitly
leverages code documentation by (1) retrieving the relevant documentation pieces
given a natural language (NL) intent, and (2) generating code based on the NL
intent and the retrieved documentation. DocPrompting is general: it can be ap-
plied to any programming language, and is agnostic to the underlying neural model.
We demonstrate that DocPrompting consistently improves NL-to-code models:
DocPrompting improves strong base models such as CodeT5 by 2.85% in pass@1
(52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based
evaluation on the popular Python CoNaLa benchmark; on a new Bash dataset
tldr, DocPrompting improves CodeT5 and GPT-Neo-1.3B by up to absolute
6.9% exact match. 1

1 I NTRODUCTION
We address the task of natural language to code generation (NL→code): generating a code snippet,
written in a general-purpose programming language such as Python or Bash, given a natural language
intent. This task has seen sharply growing popularity recently due to the emergence of large language
models trained on vast amounts of natural language and code (Chen et al., 2021; Xu et al., 2022;
Fried et al., 2022). NL→code models facilitate programming for both professional and inexperienced
programmers, by allowing programmers to write code by only expressing their higher-level intent.
Many existing code generation models either learn directly from input-output pairs provided as
training data (Allamanis et al., 2015; Yin and Neubig, 2017; Iyer et al., 2018; Brockschmidt et al.,
2019; Xu et al., 2020; Alon et al., 2020; Wang et al., 2021), or learn the mapping between input
and output implicitly from naturally occurring corpora of intertwined natural language and code
(Austin et al., 2021; Nijkamp et al., 2022). Nevertheless, all these works assume that all libraries
and function calls were seen in the training data; and that at test time, the trained model will need to
generate only seen libraries and function calls. However, new functions and libraries are introduced
all the time, and even a seen function call can have unseen arguments. Thus, these existing models
inherently cannot generalize to generate such unseen usages.
In contrast to these existing models, human programmers frequently refer to manuals and docu-
mentation when writing code (Nykaza et al., 2002; Lethbridge et al., 2003). This allows humans
to easily use functions and libraries they have never seen nor used before. Inspired by this ability,
1
Data and code are available at https://github.com/shuyanzhou/docprompting.

1
Published as a conference paper at ICLR 2023

c
n from pygments import *
Generate HTML with python
Re!iever Genera"r
code = ‘print(“reading docs”)’
syntax highlighting for s = highlight(code, PythonLexer(),
“print(‘reading docs’)” HtmlFormatter())

A formatter takes the token stream and writes it d3

Pygment is a generic syntax highlighter
d1 to an output file …
A lexer splits the source into tokens, fragments … class HtmlFormatter
Format tokens as HTML 4 <span> tags with …
class PythonLexer
For Python source code d2
𝒟

Figure 1: DocPrompting: given an NL intent n , the retriever retrieves a set of relevant documenta-
tion { d1 , d2 , d3 } from a documentation pool D . Then, the generator generates the code c based
on the NL and retrieved docs. DocPrompting allows the model to generalize to previously unseen
usages by reading those docs. Italic blue highlights the shared tokens between NL and docs; Bold
shows shared tokens between docs and the code snippet.

we propose DocPrompting: a code generation approach that learns to retrieve code documentation
before generating the code. An overview of our approach is illustrated in Figure 1: First, a document
retriever uses the NL intent n to retrieve relevant code documentation { d1 , d2 , d3 } from a documen-
tation pool D . Then, a code generator uses these docs in its prompt to generate the corresponding
code c . The documentation pool serves as an external data store that can be updated frequently
with new contents (e.g., documentation of newly released libraries), without re-training any model
component. This way, DocPrompting can leverage newly added documentation, and it can generate
code containing unseen and unused functions and libraries. DocPrompting is general and applicable
to any programming language and underlying base architecture. To the best of our knowledge, this is
the first demonstration of leveraging documentation in models of code explicitly and effectively.
We demonstrate the effectiveness of DocPrompting on two NL→code benchmarks and tasks, across
two programming languages, and using several base models: GPT-Neo (Black et al., 2021), T5 (Raffel
et al., 2020), CodeT5 (Wang et al., 2021), Fusion-in-Decoder (Izacard and Grave, 2021)), and Codex
(Chen et al., 2021). Further, we experiment with both sparse retrievers such as BM25 (Robertson and
Jones, 1976) and dense retrieval models such as SimCSE (Gao et al., 2021). Finally, we introduce
two new benchmarks for retrieval-based code generation: (a) in Bash, we curate a new benchmark
by crawling the tldr repository, and constructing the training/development/test splits without
overlapping commands; (b) in Python, we re-split the popular CoNaLa benchmark (Yin et al., 2018)
by making every test example contain at least one Python function that is not seen in the training data.
Models that use DocPrompting consistently outperform their base models that generate code solely
based on the NL intents. Using DocPrompting improves strong base models such as CodeT5 by
2.85% in pass@1 (52% relative gain) and 4.39% in pass@10 (30% relative gain) in execution-based
evaluation in CoNaLa; on the new tldr dataset, DocPrompting improves CodeT5 and GPT-Neo-
1.3B by up to absolute 6.9% exact match. We release our new benchmarks, including annotation
of oracle documents for each example and pools of documentation, to serve as a test-bed for future
retrieval-based code generation models.

2 C ODE G ENERATION BY R EADING THE D OCS

Our underlying assumption is that code documentation is the most exhaustive yet succinct resource
for most libraries and programming languages (Roehm et al., 2012), and that documentation allows
to effectively generalize to unseen libraries and functions (Forward and Lethbridge, 2002). We follow
the retrieve-then-generate paradigm (Lewis et al., 2020; Guu et al., 2020), focusing on retrieving
documentation. In this section, we describe the general approach of DocPrompting; in §3 and §6.2,
we elaborate and experiment with practical implementations of DocPrompting.
Formulation Given NL intent n, our goal is to generate a corresponding code snippet c written in
some programming language (PL) such as Python. We assume that a model has access to a collection
of code documentation D. Each document di ∈ D describes the usage of a library, a function, or an

2
Published as a conference paper at ICLR 2023

argument in that PL. The construction of D is flexible: it can either be a comprehensive set of all
available libraries and functions in a PL, or a customized subset for the scope of a specific project.

2.1 BACKGROUND : R ETRIEVAL -C ONDITIONED G ENERATION

Although a model may use the entire collection of documents D, only a few documents in D are
relevant for any particular intent. Further, it is usually computationally infeasible to directly condition
on the entire, unbounded, collection of documents while making predictions. Thus, we first let the
model select a subset of documents Dn = {d1 , d2 , .., dk } ⊆ D that are potentially relevant given n,
and refer to this subset while generating c.
Overall, we decompose the probability of generating c into the probability of choosing a particular
subset of documents P (Dn ∣ D, n), and the probability of generating the code conditioned on the
intent and the selected documents P (c ∣ Dn , n); finally, we marginalizing over all Dn ⊆ D:
P (c ∣ D, n) = ∑D P (c ∣ Dn , n) ⋅ P (Dn ∣ D, n) (1)
n ⊆D

assuming that c is independent of D given Dn (that is, (c á D ∣ Dn )). Since enumerating all possible
subsets Dn is computationally infeasible, we follow the common practice and approximate the
marginalization over Dn in Equation (1) by taking the most probable subset of retrieved documents
Dˆn , and then conditioning the prediction of c on these most likely documents:
Dˆn ∶= argmaxDn ⊆D P (Dn ∣ D, n) P (c ∣ D, n) ≈ P (c ∣ Dˆn , n) ⋅ P (Dˆn ∣ D, n) (2)

2.2 DocPrompting: G ENERATING C ODE BY R ETRIEVING THE D OCS

Equation 2 implies that DocPrompting relies of two main components: A retriever R retrieves
relevant documents Dˆn given the intent n; and a generator G generates the code snippet c conditioned
on the retrieved documents Dˆn and the intent n, which compose a new prompt. Specifically, R
computes a similarity score s (di , n) between a intent n and every document di ∈ D. Thus, the subset
Dˆn ⊆ D is the top-k documents with the highest similarity scores: Dˆn = top-kdi ∈D (s (di , n)).
An overview of our approach is illustrated in Figure 1: given the intent Generate HTML with python
syntax highlighting for “print(’reading docs’)”, the retriever R retrieves three relevant documents:
d1 describes the syntax highlighting library pygments, d2 describes the class PythonLexer, and
d3 describes the HtmlFormatter class. Given these docs and the intent, the generator G generates
the code snippet c, which uses PythonLexer and HtmlFormatter from the pygment library.

3 P RACTICAL I NSTANTIATIONS OF DocPrompting

DocPrompting is a general approach that is not bound to any specific model choices, and it can be
instantiated with any base retriever and generator. This section presents the concrete instantiations of
R and G that we found to provide the best performance in our experiments.

3.1 R ETRIEVER I NSTANTIATION

We experiment with two main types of retrievers: sparse retrievers and dense retrievers. As our sparse
retriever, we use Elasticsearch2 with the standard BM25 (Robertson and Jones, 1976). This retriever
represents documents using sparse features that rely on word frequencies, such as BM25 and TF-IDF.
As our dense retriever, we follow prior work (Chen et al., 2020; Karpukhin et al., 2020; Gao et al.,
2021): given a triplet (n, c, Dn∗ ), where Dn∗ are the oracle docs for n, each d+i ∈ Dn∗ and n form a
positive pair (n, d+i ), while each d−j ∉ Dn∗ and n form a negative pair (ni , d−j ). We train the retriever
in a contrastive fashion where the similarity score of a positive pair is maximized while that of
in-batch negative pairs is minimized. For a pair (ni , d+i ), the loss function is defined as:
exp (sim(hn , hd+i ))
Lr = − log (3)
exp (sim(hn , hd+i )) + ∑d− ∈B/Dn∗ exp (sim(hn , hd−j ))
j

2
https://github.com/elastic/elasticsearch

3
Published as a conference paper at ICLR 2023

where hx is the representation of x computed by a neural encoder, and B are positive docs for other
examples in the batch. We define sim(hx , hy ) as the cosine similarity between hx and hy .
We use all (ni , d+i ) in the training set as our supervised training dataset. Additionally, we use all
sentences in the documentation pool for weak supervision: Following Chen et al. (2020) and Gao et al.
(2021), representations of the same sentence with different dropout masks are treated as a positive
example. Instead of using either supervised or weakly supervised training as in Gao et al. (2021), we
simply mix the two resulting supervision signals, and examples are randomly distributed into batches.
This mixture of tasks not only facilitates the learning process (§6.2), but also reduces the engineering
effort required to store and reload models for separate supervised and unsupervised training phases.
We initialize the retriever encoder with either the best model of Gao et al. (2021) or the encoder of
CodeT5-base (Wang et al., 2021). Additional training details are provided in Appendix C

3.2 G ENERATOR I NSTANTIATION

We experimented with a variety of generator models. We used GPT-Neo-125M, GPT-Neo-1.3B (Black

et al., 2021) and Codex (Chen et al., 2021), where we concatenate the retrieved documents and the
NL intent as a single, long, prompt. T5-base (Raffel et al., 2019) and CodeT5-base (Wang et al.,
2021) have a shorter input size of 512 tokens, which is sometimes too short for the concatenation of
multiple docs. Thus, for T5 and CodeT5 we apply the fusion-in-decoder approach (FiD; Izacard and
ˆ n and encode each (n, di )
Grave, 2021): we first concatenate the intent n with each retrieved di ∈ D
pair independently. Then, the decoder attends to all encoded NL-document pairs. We finetune the
generator to maximize the log-likelihood of the reference code c given n and D ˆn .
With Codex (Chen et al., 2021), we performed few-shot learning rather than finetuning because the
model parameters are not publicly available. We constructed the prompt with three static examples,
each of which is a concatenation of retrieved documentation, an NL intent and the reference code
snippet. We then appended the test example and its retrieved documentation to the few-shot examples.
We used the code-davinci-001 version because we suspect potential leakage of the test set into the
training set of code-davinci-002. See more details in Appendix H. Training details, hyper-parameter
settings and example prompts can be found in Appendices E and D.

4 E XPERIMENTAL S ETUP
We evaluate DocPrompting on two NL→code tasks: shell scripting (§4.1), in which we generate
complex shell commands given an intent, and Python programming (§4.2), where we generate
answers in Python for NL questions. In this section, we first introduce a newly curated benchmark
tldr; we then describe our re-split of the popular CoNaLa benchmark (Yin et al., 2018). For each
benchmark, we provide a global documentation pool D that is shared for all examples and oracle
documents Dn∗ which we use to train the retriever. We release our newly curated benchmarks to serve
as test-bed for future retrieval-based code generation models.

4.1 S HELL S CRIPTING

tldr is a community-driven project that maintains easily-

readable help pages with examples for over 2.5k Bash com-
mands in over 25 natural languages3 . We collected pairs of
English intents and Bash command lines. The NL intents are
written by human users, and the Bash commands range from
popular ones like cat and tar, to uncommon commands such
as toilet and faketime. Our resulting tldr benchmark
contains 1,879 unique Bash commands and 9,187 NL→Bash
pairs. We constructed the training, development and the test set
with completely disjoint commands to test the generalizability
of a code generation model. The shared documentation pool
D is made up of the 400k paragraphs from the 1,879 Bash Figure 2: An example NL-code pair
manuals. Each paragraph describes a single concept such as an from tldr, along with three oracle
documentation items.
3
https://github.com/tldr-pages/tldr

4
Published as a conference paper at ICLR 2023

argument flag. We further curated the oracle documents Dn∗ for each example using simple string
matching. An example from tldr is shown in Figure 2. To the best of our knowledge, this is the
first work to leverage tldr as an NL→code benchmark. Detailed statistics and additional details
are provided in Appendix A. In tldr, each NL intent results in a single Bash command with a
combination of argument flags. We therefore first retrieve an entire Bash manual; then, we take the
top manual and retrieve the top-10 paragraphs from that manual.
Evaluation metrics We measure: (a) command name accuracy (CMD Acc) – whether the command
name (e.g., cat) is an exact match; (b) exact match (EM) – exact match between the reference and
the generation; (c) token-level F1; and (d) character-level BLEU (charBLEU; Lin et al., 2018; Shi
et al., 2022). In all metrics, we disregard user-specific variable names in the references and the
models outputs. For example, “mycli -u [user] -h [host] [database]” is evaluated
as “mycli -u $1 -h $2 $3”.

4.2 P YTHON P ROGRAMMING

CoNaLa (Yin et al., 2018) is a popular benchmark for NL→Python generation. NL intents are
StackOverflow questions, and code snippets are their answers. Both intents and code snippets are
rewritten by human annotators. We re-split the dataset to test models’ generalization to unseen Python
functions. In our re-split, we verifed that every example in the development or the test set uses at least
one Python function (e.g., plt.plot) that was not seen in the training data. In addition, we make
sure that the examples from the same StackOverflow posts are in the same set to prevent leakage.
This re-split results in 2,135/201/543 examples in the training/development/test sets, respectively.
The CoNaLa documentation pool D contains 35,763 documents, each describing a single function,
from all Python libraries available on DevDocs (https://devdocs.io). These include built-in
libraries and other popular libraries such as numpy. We constructed the oracle docs Dn∗ for each
example by matching all function names in the target code c with docs. More details in Appendix B.
Evaluation metrics We follow Yin et al. (2018) and measure BLEU-4. Since we focus on general-
ization to unseen functions, we additionally report function name recall (recall) and unseen function
recall (recallunseen ), which measures recall among function calls that do not appear in the training
set. Finally, following Chen et al. (2021); Austin et al. (2021), we used the manually written unit
tests from Wang et al. (2022) for 100 examples from CoNaLa’s test set and measure pass@k. We
followed Chen et al. (2021) and performed nucleus sampling (Holtzman et al., 2019) with p = 0.95.
For each k, we searched for the best temperature for each model from {0.2, 0.4, 0.6, 0.8, 1.0}. On
average, each example has 2.03 tests. The concatenation of multiple Python docs often exceeded the
length limit of GPT-Neo, we hence experimented in this dataset with FiD, which allows longer inputs.
Additional details are provided in Appendix B.

5 R ESULTS
In all following results, all models with DocPrompting use the top-10 retrieved docs from the best
retriever on that dataset (Table 4). Every baseline uses the exact same setup as its “+DocPrompting”
version, except for not using the documentation.

5.1 S HELL S CRIPTING R ESULTS

Results for tldr are shown in Table 1. DocPrompting consistently improves the base models. For
example, T5+DocPrompting achieves more than twice higher accuracy in predicting the command
name, more than 16 charBLEU points on the entire prediction, and almost 9% of absolute exact
match gain, compared to the vanilla T5. In the few-shot learning setting with Codex, DocPrompting
brings gains of 6.7 charBLEU points, and consistent improvement across all metrics over the baseline
that observes only NL-code pairs in its prompt. These results show that retrieving documentation
also benefits strong models such as Codex, and with only few examples in the context.

Code generation with oracle command names In realistic settings, a human programmer may
know the command name they need to use (e.g., awk), but not know the exact usage and flags. In
fact, better understanding of the usage of known commands is the purpose of Unix man pages and the

5
Published as a conference paper at ICLR 2023

Table 1: Results on shell scripting, using a BM25 retriever with top-10 retrieved docs, on the test set
of tldr. For the “oracle command name” experiments, we selected the best model of each type.

Model CMD Acc (%) EM (%) Token F1 charBLEU

- 11.96 1.94 28.75 19.99
GPT-Neo-125M
+DocPrompting 25.32 3.56 31.23 24.43
- 14.55 3.12 32.46 24.70
GPT-Neo-1.3B
+DocPrompting 27.59 9.05 37.24 30.57
- 10.02 0.76 19.90 25.48
T5
+DocPrompting 30.28 9.16 37.58 31.97
- 14.60 2.18 30.00 21.50
CodeT5
+DocPrompting 30.72 9.15 36.71 33.83
- 27.48 8.94 36.04 16.94
Codex 3-shots
+DocPrompting 31.21 9.29 36.77 23.72
With the oracle command name
- - 12.96 59.36 45.05
T5
+DocPrompting - 22.55 64.84 54.28
- - 22.44 62.26 50.29
Codex 3-shots
+DocPrompting - 32.43 69.73 55.21

Table 2: Comparison to approaches that retrieve examples (Parvez et al., 2021; Pasupat et al., 2021)
.
Model CMD Acc (%) EM (%) Token F1 charBLEU
+ExPrompting 6.68 0.32 20.49 11.15
GPT-Neo-125M
+DocPrompting 25.32 3.56 31.23 24.43
+ExPrompting 14.01 2.8 30.07 22.11
GPT-Neo-1.3B
+DocPrompting 27.59 9.05 37.24 30.57

tldr project. We conducted an oracle experiment where we provided T5 (which was the strongest
model using DocPrompting) and Codex with the oracle command name (e.g., awk). This oracle
information is provided to both the baseline and the model that uses DocPrompting. The results are
shown on the bottom part of Table 1. When the oracle command is given, DocPrompting further
improves over the base models. For example, when providing Codex with the ground truth command
name, DocPrompting improves its exact match from 22.44% to 32.43%.
Should we retrieve documentation or examples? All existing retrieval-based models of code
retrieve NL-code pairs or code snippets, rather than documentation. To simulate this scenario, we
followed Parvez et al. (2021) and Pasupat et al. (2021) to retrieve NL-code pairs from the training
set of tldr, and refer to this baseline as ExPrompting. We finetuned the best retriever RoBERTa
and two generators, and retrieved the top-30 NL-code pairs for every example. As shown in Table 2,
retrieving documentation (DocPrompting) provides much higher gains than retrieving examples
(ExPrompting). Theoretically, adding examples of unseen commands can help ExPrompting
generalize to them as well. However, new libraries and functions may not have available examples on
the web yet, while documentation often does becomes available when the library is released.

5.2 P YTHON P ROGRAMMING R ESULTS

Table 3 shows the results on CoNaLa. CodeT5+DocPrompting yields a 1.65 BLEU improvement
over the state-of-the-art baseline that was initialized with CodeT5.4 When measuring the recall of the
generated function names, the benefit of DocPrompting is especially higher for unseen functions
(recallunseen ). For example, DocPrompting achieves 18.30 compared to only 9.03 of the base CodeT5
in unseen functions. Additionally, DocPrompting improves in-context learning setting with Codex.
4
In a separate experiment on the original split of CoNaLa, this baseline achieved a BLEU score of 39.12,
which outperforms the previous state-of-the-art (Beau and Crabbé, 2022) by 4.92 BLEU points.

6
Published as a conference paper at ICLR 2023

Table 3: Results on CoNaLa, using a CodeT5 retriever with top-10 retrieved docs. Function recall
(Recall) measures how many functions in the reference code are correctly predicted, and unseen
function recall (Recallunseen ) only considers the subset held out from the training data.

Model BLEU Recall Recallunseen

- 43.16 39.52 -
Codex 3-shots + DocPrompting 43.47 39.87 -
+ DocPrompting oracle docs 50.59 57.84 -
- 28.07 14.36 2.57
T5
+ DocPrompting 30.04 21.34 8.24
- 34.57 24.24 9.03
CodeT5 + DocPrompting 36.22 27.80 18.30
+ DocPrompting oracle docs 49.04 72.20 63.91

40 tldr CoNaLa
35.46 100% 91%
35 31.87 NL←
→Code
30 27.54 80% (NL+Docs)←
→Code
pass@k

25 27.08 60%
Recall
25.54 52%
20 18.70 23.38
15 40% 30% 28%
14.31 24%
10 8.26 20% 12% 14% 11% 16%
9% 7%11%
+DocPrompting
5 5.41 0%2% 0%0%
CodeT5 0%
0
110 50 100 200 1 2 3 1 2 3 4 5
k n-gram n-gram
Figure 4: Using documentation significantly in-
Figure 3: Pass@k of CodeT5 with and without creases the n-gram overlap recall between the
DocPrompting on 100 CoNaLa examples. input and the output, in tldr and CoNaLa.

We hypothesis that the minor gain is mainly due to the potential data leakage of Codex, which violates
the split of seen and unseen functions. Another reason is that a strong generator such as Codex
may require an equally strong retriever as well. We find that Codex can achieve even higher results
with an oracle retriever, which shows the potential further improvement by improving the retrievers.
Finally, CodeT5 performs better than T5, with and without using DocPrompting. This emphasizes
the importance of using code-specific pretrained models.
Execution-based evaluation The results are shown in Figure 3. Using DocPrompting consistently
outperforms the baseline CodeT5 for all values of pass@k. For example, DocPrompting yields
2.85% improvement on pass@1 and 4.45% improvement on pass@5, which are realistic numbers
of completions that can be suggested in an IDE. When k = 200, DocPrompting widens the gap
to 8.38%. These results demonstrate that DocPrompting does not only improve the quality of the
generated code in its surface form, but also increase its functional correctness. Additional details and
results are provided in Appendix G.

6 A NALYSIS

6.1 W HY DOES READING THE DOCUMENTATION HELP GENERATING MORE ACCURATE CODE ?

We believe that one of the major reasons is that documentation eases the mapping between NL
intents and code, since the documentation contains both NL descriptions and function signatures.
We calculated the n-gram overlap between the NL intents and their corresponding code snippets
(NL← →code), and the overlap between the NL intents with their top-10 retrieved documents and
their code snippets ((NL+docs)← →code). As shown in Figure 4, adding documentation significantly
increases the overlap across n-grams, and increase, for example, the unigram overlap from 12% to

7
Published as a conference paper at ICLR 2023

Table 4: Retrieval performance of multiple models on the dev set of tldr (top) and CoNaLa (bottom).
RoBERTa is the best model taken from from Gao et al. (2021), and CodeT5 is the encoder of CodeT5-
base (Wang et al., 2021). Models with the subscript “off-shelf” are the off-the-shelf models, and the
other models were finetuned with the objective in Equation 3. The last column is the best model
(RoBERTa for tldr and CodeT5 for CoNaLa) trained without the weak supervision corpus.

n BM25 RoBERTaoff-shelf RoBERTa CodeT5off-shelf CodeT5 Best w/o weak sup.

1 32.81 17.53 30.03 10.45 18.10 28.30
5 51.73 37.89 52.50 20.26 38.52 50.50
tldr
10 59.86 46.80 60.33 25.73 51.03 59.84
20 62.01 56.11 64.30 33.65 57.26 62.30
1 3.01 4.46 13.49 4.60 16.54 10.51
5 7.16 7.58 26.38 8.63 42.35 21.15
CoNaLa
10 9.73 10.93 34.86 12.25 55.81 29.34
20 11.46 13.89 45.46 18.46 66.79 42.21

24% in tldr. That is, one of the reasons that retrieving documentation helps generating accurate code
is that documentation bridges the gap between the “intent terminology” and the “code terminology”.

6.2 A BLATION S TUDY

We compared different configurations of the retriever, to gather more insights for effective
DocPrompting. Table 4 shows a comparison between different retrievers and their setups. First, the
performance of BM25 varies among datasets: In tldr, BM25 matches the recall of trained dense
retrievers; however in CoNaLa, BM25 achieves only recall@10 of 9.73%, and strong dense retrievers
such as the encoder of CodeT5 achieve recall@10 of 55.81. We hypothesize that this difference
between datasets stems from the ways these datasets were created: tldr intents were written based
on existing Bash commands and manuals; while CoNaLa examples were mined from StackOverflow
posts, where users ask questions with limited or no context. Thus, NL intents in CoNaLa require
a better semantic alignment with the documents, and thus benefit from dense retrievers. The gap
resulting from different data curation processes was also observed by Rodriguez and Boyd-Graber
(2021) in open-domain question answering (QA).
Second, retrievers that were pretrained on the target programming language are generally stronger.
For example in CoNaLa, CodeT5 which was pretrained on Python, is both a better off-the-shelf
retriever and a better finetuned-retriever than RoBERTa, which was pretrained mainly on text. In
contrast, tldr is based on Bash, which neither CodeT5 nor RoBERTa were explicitly pretrained on.
Thus, tldr benefits mostly from BM25 and RoBERTa rather than CodeT5 as retrievers.
Finally, training the retriever using weak supervision on the documentation pool (Section 3.1)
dramatically improves the retriever. The recall of the best retrievers of each dataset without this
corpus is shown in the last column of Table 4 (“Best w/o weak sup.”). On CoNaLa, removing
this corpus results in severe performance degradation. One possible explanation is that this weak
supervision helps the retriever perform domain adaptation more effectively.

6.3 C ASE STUDY

We examine the models’ outputs and show two representative examples in Table 5. In the first
example, Image.open was not seen in the training set, and the baseline CodeT5 incorrectly
predicts os.open. In contrast, using DocPrompting allows to retrieve the docs and to correctly
predict Image.open. In the second example, df.to csv was not seen in training, and the baseline
CodeT5 fails to correctly predict it. In contrast, DocPrompting does predict most of the df.to csv
call correctly, thanks to the retrieved docs. Nevertheless, DocPrompting generates an incorrect
argument skiprows=1, instead of header=False. The reason is that along with the retrieved
documentation of df.to csv, the retriever also retrieved the documentation of df.read csv,
which has a skiprows argument. That is, the generator uses an argument of df.read csv with
the function df.to csv. Further improving the retrievers and the generators, and post-filtering
based on the validity of argument names, may mitigate such mistakes.

8
Published as a conference paper at ICLR 2023

Table 5: Examples of predictions from CoNaLa, of the base CodeT5 compared to

CodeT5+DocPrompting. Unseen functions are underscored.
::::::::::

NL Intent: Open image ”picture.jpg”

Ground truth: img = Image.open(’picture.jpg’)
::::::::::
\n Img.show
CodeT5: os.open(’picture.jpg’, ’r’)
CodeT5+DocPrompting: image = Image.open(’picture.jpg’,
:::::::::::
’rb’)
NL Intent: Exclude column names when writing dataframe ‘df’ to a csv file ‘filename.csv’
Ground truth: df.to csv (’filename.csv’, header=False)
:::::::::
CodeT5: df.drop([’col1’, ’col2’], axis=1, inplace=True)
CodeT5+DocPrompting: :::::::::
df.to csv(’filename.csv’, skiprows=1)

7 R ELATED W ORK

Code generation The most common practice in NL→code generation is training a model on a dataset
of NL-code pairs (Allamanis et al., 2015; Yin and Neubig, 2017; Rabinovich et al., 2017; Iyer et al.,
2018). Nevertheless, all these works assume that their training corpus covers all required libraries and
functions, and their models are inherently incapable of generating libraries and functions that were not
seen in the training data. On the contrary, DocPrompting allows models to generate calls to unseen
function, by retrieving these functions’ documentation and reading them at test time. Hayati et al.
(2018); Parvez et al. (2021); Hashimoto et al. (2018) and Lu et al. (2017) learn to retrieve examples at
test time; Pasupat et al. (2021) also considered settings where the test data has a distribution shift from
the training data. However, when new libraries are released they often come with documentation,
and thus we assume that documentation for new libraries is much more likely to be available than
concrete natural language intent and code snippet pairs (n, c) that use these libraries already. The
models of Shrivastava et al. and Wu et al. (2021) retrieve code snippets from relevant files in the
same project; contrarily, when predicting new libraries and functions that are external to the user’s
project, documentation is the source that is the most likely to be available.
Retrieval augmented generation The paradigm of retrieve-then-generate has gained popularity in
the field of open-domain question answering (Guu et al., 2020; Lewis et al., 2020; Karpukhin et al.,
2020), where the answer for an open-domain question exists in only few documents out of a much
larger pool. Although DocPrompting takes a similar approach, documentation retrieval in code
generation is even more valuable, since code libraries are updated constantly, and new libraries are
introduced daily. Thus, DocPrompting allows updating the documentation pool frequently with new
contents, without re-training any model components.
Documentation conditioned generation The model of Zhong et al. (2019) reads documents to
understand environment dynamics in a grid-world game, and Branavan et al. (2011) controls situated
agents in a game (Civilization II) by reading the game’s manual. However, all their models were
tailored to specific games; in contrast, DocPrompting is general and is applicable for a variety of
programming languages and datasets.

8 C ONCLUSION

We propose DocPrompting, a simple and effective approach for code generation by retrieving the
relevant documentation. DocPrompting consistently improves NL→code models in two tasks, in
two PLs, and across multiple strong base models. DocPrompting improves strong base models such
as CodeT5 by 2.85% in pass@1 (52% relative gain) in execution-based evaluation on the popular
Python CoNaLa benchmark; on a new Bash dataset tldr, DocPrompting improves CodeT5 and
GPT-Neo-1.3B by up to 6.9% exact match, and Codex by 6.78 charBLEU score.
These results open a promising direction for NL→code generation. We believe that our results can be
further improved using more clever encoding of the structured nature of long documents, and using
joint training of the retriever and the generator, which hopefully will avoid cascading errors. Further,
we believe that the principles and the methods presented in this paper are applicable to additional
code-related tasks, and other documentation-like resources such as tutorials and blog posts. To these
ends, we make all our code, data, and models publicly available.

9
Published as a conference paper at ICLR 2023

9 ACKNOWLEDGEMENT
We thanks the anonymous reviewers for their useful comments and suggestions. This work is
supported by a gift from Amazon AI and a contract from the Air Force Research Laboratory
under agreement number FA8750-19-2-0200. The U.S. Government is authorized to reproduce and
distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The
views and conclusions contained herein are those of the authors and should not be interpreted as
necessarily representing the official policies or endorsements, either expressed or implied, of the Air
Force Research Laboratory or the U.S. Government.

R EFERENCES
Miltiadis Allamanis, Daniel Tarlow, Andrew D. Gordon, and Yi Wei. Bimodal modelling of source code and
natural language. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International
Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop
and Conference Proceedings, pages 2123–2132. JMLR.org, 2015. URL http://proceedings.mlr.
press/v37/allamanis15.html.
Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Structural language models of code. In International
conference on machine learning, pages 245–256. PMLR, 2020.
Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang,
Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. ArXiv preprint,
abs/2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.
Nathanaël Beau and Benoı̂t Crabbé. The impact of lexical and grammatical processing on generating code
from natural language. ArXiv preprint, abs/2202.13972, 2022. URL https://arxiv.org/abs/2202.
13972.
Sid Black, Leo Gao, Phil Wang, Connor Leahy, and Stella Biderman. GPT-Neo: Large Scale Autoregres-
sive Language Modeling with Mesh-Tensorflow, 2021. URL https://doi.org/10.5281/zenodo.
5297715. If you use this software, please cite it using these metadata.
S.R.K. Branavan, David Silver, and Regina Barzilay. Learning to win by reading manuals in a Monte-Carlo
framework. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies, pages 268–277, Portland, Oregon, USA, 2011. Association for Computational
Linguistics. URL https://aclanthology.org/P11-1028.
Marc Brockschmidt, Miltiadis Allamanis, Alexander L. Gaunt, and Oleksandr Polozov. Generative code
modeling with graphs. In International Conference on Learning Representations, 2019. URL https:
//openreview.net/forum?id=Bke4KsA5FX.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura
Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ArXiv
preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive
learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning,
ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research,
pages 1597–1607. PMLR, 2020. URL http://proceedings.mlr.press/v119/chen20j.html.
Andrew Forward and Timothy C Lethbridge. The relevance of software documentation, tools and technologies:
a survey. In Proceedings of the 2002 ACM symposium on Document engineering, pages 26–33, 2002.
Daniel Fried, Armen Aghajanyan, Jessy Lin, Sida Wang, Eric Wallace, Freda Shi, Ruiqi Zhong, Wen-tau Yih,
Luke Zettlemoyer, and Mike Lewis. Incoder: A generative model for code infilling and synthesis. ArXiv
preprint, abs/2204.05999, 2022. URL https://arxiv.org/abs/2204.05999.
Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages
6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main.
552.
Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern, and Sanjiv Kumar. Accelerating
large-scale inference with anisotropic vector quantization. In International Conference on Machine Learning,
2020. URL https://arxiv.org/abs/1908.10396.

10
Published as a conference paper at ICLR 2023

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented
language model pre-training. ArXiv preprint, abs/2002.08909, 2020. URL https://arxiv.org/abs/
2002.08909.

Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for
predicting structured outputs. Advances in Neural Information Processing Systems, 31, 2018.

Shirley Anugrah Hayati, Raphael Olivier, Pravalika Avvaru, Pengcheng Yin, Anthony Tomasic, and Graham
Neubig. Retrieval-based neural code generation. In Proceedings of the 2018 Conference on Empirical Methods
in Natural Language Processing, pages 925–930, Brussels, Belgium, 2018. Association for Computational
Linguistics. doi: 10.18653/v1/D18-1111. URL https://aclanthology.org/D18-1111.

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration.
arXiv preprint arXiv:1904.09751, 2019.

Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, and Luke Zettlemoyer. Mapping language to code in pro-
grammatic context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language
Processing, pages 1643–1652, Brussels, Belgium, 2018. Association for Computational Linguistics. doi:
10.18653/v1/D18-1192. URL https://aclanthology.org/D18-1192.

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain ques-
tion answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Compu-
tational Linguistics: Main Volume, pages 874–880, Online, 2021. Association for Computational Linguistics.
doi: 10.18653/v1/2021.eacl-main.74. URL https://aclanthology.org/2021.eacl-main.74.

Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with GPUs. IEEE Transactions
on Big Data, 7(3):535–547, 2019.

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen,
and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the
2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781,
Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL
https://aclanthology.org/2020.emnlp-main.550.

Timothy C Lethbridge, Janice Singer, and Andrew Forward. How software engineers use documentation: The
state of the practice. IEEE software, 20(6):35–39, 2003.

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal,
Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-
augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia
Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing
Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decem-
ber 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/
6b493230205f780e1bc26945df7481e5-Abstract.html.

Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D. Ernst. NL2Bash: A corpus and semantic
parser for natural language interface to the linux operating system. In Proceedings of the Eleventh International
Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018. European Language
Resources Association (ELRA). URL https://aclanthology.org/L18-1491.

Yanxin Lu, Swarat Chaudhuri, Chris Jermaine, and David Melski. Data-driven program completion. arXiv
preprint arXiv:1705.09042, 2017.

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming
Xiong. A conversational paradigm for program synthesis. arXiv preprint, 2022.

Janet Nykaza, Rhonda Messinger, Fran Boehme, Cherie L Norman, Matthew Mace, and Manuel Gordon. What
programmers really want: results of a needs assessment for sdk documentation. In Proceedings of the 20th
annual international conference on Computer documentation, pages 133–141, 2002.

Md Rizwan Parvez, Wasi Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Retrieval aug-
mented code generation and summarization. In Findings of the Association for Computational Linguistics:
EMNLP 2021, pages 2719–2734, Punta Cana, Dominican Republic, 2021. Association for Computational
Linguistics. doi: 10.18653/v1/2021.findings-emnlp.232. URL https://aclanthology.org/2021.
findings-emnlp.232.

11
Published as a conference paper at ICLR 2023

Panupong Pasupat, Yuan Zhang, and Kelvin Guu. Controllable semantic parsing via retrieval augmentation.
In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages
7683–7698, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics.
doi: 10.18653/v1/2021.emnlp-main.607. URL https://aclanthology.org/2021.emnlp-main.
607.
Maxim Rabinovich, Mitchell Stern, and Dan Klein. Abstract syntax networks for code generation and semantic
parsing. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume
1: Long Papers), pages 1139–1149, Vancouver, Canada, 2017. Association for Computational Linguistics.
doi: 10.18653/v1/P17-1105. URL https://aclanthology.org/P17-1105.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. ArXiv
preprint, abs/1910.10683, 2019. URL https://arxiv.org/abs/1910.10683.
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of
Machine Learning Research, 21:1–67, 2020.
Stephen E Robertson and K Sparck Jones. Relevance weighting of search terms. Journal of the American Society
for Information science, 27(3):129–146, 1976.
Pedro Rodriguez and Jordan Boyd-Graber. Evaluation paradigms in question answering. In Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9630–9642, Online and
Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.
emnlp-main.758. URL https://aclanthology.org/2021.emnlp-main.758.
Tobias Roehm, Rebecca Tiarks, Rainer Koschke, and Walid Maalej. How do professional developers comprehend
software? In 2012 34th International Conference on Software Engineering (ICSE), pages 255–265. IEEE,
2012.
Freda Shi, Daniel Fried, Marjan Ghazvininejad, Luke Zettlemoyer, and Sida I. Wang. Natural language to code
translation with execution, 2022. URL https://arxiv.org/abs/2204.11454.
Disha Shrivastava, Hugo Larochelle, and Daniel Tarlow. Repository-level prompt generation for large language
models of code. In ICML 2022 Workshop on Knowledge Retrieval and Language Models.
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C.H. Hoi. CodeT5: Identifier-aware unified pre-trained
encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, pages 8696–8708, Online and Punta Cana, Dominican
Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.685. URL
https://aclanthology.org/2021.emnlp-main.685.
Zhiruo Wang, Shuyan Zhou, Daniel Fried, and Graham Neubig. Execution-based evaluation for open-domain
code generation. arXiv preprint arXiv:2212.10481, 2022.
Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. In
International Conference on Learning Representations, 2021.
Frank F. Xu, Zhengbao Jiang, Pengcheng Yin, Bogdan Vasilescu, and Graham Neubig. Incorporating external
knowledge through pre-training for natural language to code generation. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, pages 6045–6052, Online, 2020. Association for
Computational Linguistics. doi: 10.18653/v1/2020.acl-main.538. URL https://aclanthology.org/
2020.acl-main.538.
Frank F Xu, Uri Alon, Graham Neubig, and Vincent J Hellendoorn. A systematic evaluation of large lan-
guage models of code. ArXiv preprint, abs/2202.13169, 2022. URL https://arxiv.org/abs/2202.
13169.
Pengcheng Yin and Graham Neubig. A syntactic neural model for general-purpose code generation. In
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long
Papers), pages 440–450, Vancouver, Canada, 2017. Association for Computational Linguistics. doi: 10.
18653/v1/P17-1041. URL https://aclanthology.org/P17-1041.
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan Vasilescu, and Graham Neubig. Learning to mine aligned
code and natural language pairs from stack overflow. In 2018 IEEE/ACM 15th international conference on
mining software repositories (MSR), pages 476–486. IEEE, 2018.
Victor Zhong, Tim Rocktäschel, and Edward Grefenstette. Rtfm: Generalising to novel environment dynamics
via reading. ArXiv preprint, abs/1910.08210, 2019. URL https://arxiv.org/abs/1910.08210.

12
Published as a conference paper at ICLR 2023

A T L D R: A N EWLY C URATED S HELL S CRIPTING B ENCHMARK

NL→Bash pairs For each command (e.g., cat), users contribute examples of pairs of NL descriptions and
bash code (mainly one-liners), including various flags and arguments, which cover the common usages of that
command. An example is shown in Figure 2.
We crawl NL-code pairs from the markdown files5 in the linux and common folders. We discard Bash
commands whose manual is unavailable (discussed below). The detailed statistics are shown in Table 6. On
average, each command has 4.84 NL→Bash pairs and there is a total of 9187 NL-code pairs. To test the
generalizability of a model, we construct the training, development and the test set with completely different
commands.

Table 6: The statistics of the tldr shell scripting benchmark

# Commands NL→Bash pairs

train 1315 6414
dev 376 1845
test 188 928
total 1879 9187

Documentation pool D We take the bash manual of the 1897 bash commands in tldr to construct a
documentation pool. We search each command name at manned.org6 , a website which archives Unix manual
pages (the same as the Unix ‘man <command> command), and then extract the text contents from the returned
manual page. We further break each manual into multiple paragraphs by line breaks so that each paragraph
delicately describes a single concept such as a command functionality or a flag usage. We make this decision
due to the large volume of content each manual has, which is too long to fit the length limitation of a neural
model, and too noisy and distracts the model with irrelevant information. This results in 400k individual entries
in the pool in total.

Oracle manual Di∗ We find the ground truth documentation for each (n, c) pair through command
name and flag matching heuristics. For instance, given a code snippet toilet ’input text’ -f
’font filename’, we constrain our search to the documentation from toilet manual page and select
documentation that starts with -f flag as an oracle paragraph. Along with the first paragraph that commonly
summarizes a command, these paragraphs forms Dn∗ .

Evaluation metrics We use four evaluation metrics to measure the quality of the generated code: (a) com-
mand name accuracy (CMD Acc) – measures whether the command name (e.g., cat) is predicted correctly;
(b) token-level F1 – converts the reference code and the generated code to bag of words and measures the
token-level precision, recall, and F1 overlap; (c) exact match (EM) – measures the exact match between the
reference and the generation; and (d) character-level BLEU (charBLEU; Lin et al., 2018; Shi et al., 2022).
For token level F1, exact match, and charBLEU, we disregard all user-specific variables in the references and
the system outputs. For example, ”mycli -u [user] -h [host] [database]” is converted into
”mycli -u $1 -h $2 $3”. This is mainly because the variables are not instantiated in tldr and the
style of the placeholder varies among contributors. For example, some contributors might write [user] as
[username] or [your name]. Therefore, measuring the surface form of user-specific variable names is
less meaningful.

B R E - SPLITTING C O N A L A
NL→Python pairs We adapt the popular CoNaLa benchmark and re-split the dataset to test the generaliza-
tion scenario. This re-split makes every example in the development and the test set have at least one Python
function (e.g., plt.plot) that was not seen in the training data. There are 2135, 201, and 543 examples in
the training, development and test sets, respectively. We follow the original work Yin et al. (2018) to evaluate
the system outputs with BLEU-4. Since we focus on the generalization setting, we additionally report unseen
function accuracy, which measures the percentage of correctly predicted held-out functions that do not appear in
the training set.

5
e.g., https://github.com/tldr-pages/tldr/blob/main/pages/linux/toilet.md
6
https://manned.org

13
Published as a conference paper at ICLR 2023

Human-annotated unit tests Following Chen et al. (2021) and Austin et al. (2021), we conduct execution-
based evaluation on CoNaLa to measure the functional correctness of the generated code. We randomly selected
100 examples from the test set and manually annotated unit test for each example. For example, we wrote tests
such as assert gen code("abcds", 2) == 4 and assert gen code("abde", 2) == -1 to
verify whether the function gen code could perform “find the index of sub string ’s’ in string ‘str‘ starting
from index 2”. Each example was annotated by a single annotator. The annotation was done by two authors of
the paper who program with Python daily. On average, we annotate 2.03 unit tests for each example.

Documentation pool D Our documentation pool contains 35763 manuals. These functions are from
all Python libraries that are available on DevDocs7 . These libraries contains the Python built-in library,
and popular libraries like numpy and pandas. The documentation on DevDocs are curated and further
transformed and indexed to allow for quick searching of APIs. We then extract each API signature and the
corresponding documentation in every library, remove any content in the documentation that is not text, and
segment the documentation into multiple paragraphs based on the <p> HTML tags. The documentation pool
then contains pairs of the API signature and a single paragraph in the corresponding documentation. Although
the documentation pool is not comprehensive to cover all Python libraries and functions, we find it has a high
coverage rate on the CoNaLa dataset. This choice reflects the flexibility of our approach upon the characteristics
of a target scenario.

Oracle manual Di∗ To find the oracle documents for a given NL intent Di∗ from the original
(n, c) example, we first index the function names with absolute path (e.g., plot is indexed with
matplotlib.pyplot.plot) with Elasticsearch. Then we query the search engine with clean version
of c where variable name are removed. The top-5 functions after de-duplication are treated as oracle manuals
Di∗ .

Natural language and code associations during pretraining Despite our efforts, it is possible that
some of the held-out functions in the test set were seen to associate with NL contexts (e.g., comments) during
the pretraining of a retriever and a generator. Since the generators were initialized from the same checkpoint in
both the baselines and the DocPrompting models, such a possible association is expected to equally help both
models. In the retriever, such a possible association did not cause the retriever to see the exact NL intents together
with the corresponding documentation, and thus the matching between NL← →doc was not leaked. However, it
is possible that there had been semantically similar intents seen along with the code snippets of the held-out
functions. Nevertheless, such co-occurrence is “indirect” and “unsupervised”.

C D ENSE R ETRIEVER T RAINING

We finetune the model for 10 epochs with batch size of 512 and learning rate of 1e − 5. Since CodeT5 does not
use [CLS] token, we alternatively take the average of the hidden state of the last layer as the text representation.
For CoNaLa, we also use the first 100k ”mined” examples provided as part of CoNaLa as the supervised corpus.
For CoNaLa, we only apply a single search step because each code snippet commonly contains more than one
function. We also observed that using the first sentence that normally summarizes the usage of a function achieve
the best retrieval performance than other alternatives such as using the first paragraph, or simply truncating to
the maximum token length. The training takes up to 15 hours on a single A6000 GPU.

D G ENERATOR T RAINING
We train our single-source generators for 20 epochs with learning rate 4e − 5. We train our FiD-based generators
for 10000 steps. The doc length is set to 200, any further content will be truncated. We follow (Izacard and
Grave, 2021) to set learning rate to 5e − 5 with 2000 steps warmup and linear learning rate decay. The batch size
is set to 8. The best model is selected based on the token-level F1 score on the development set for tldr and
BLEU score for CoNaLa. The training takes 8 hours on a single A6000 GPU.

E C ODEX P ROMPTS
For the baseline, we prompt Codex with three NL-code pairs and append the test query to the end. An example on
tldr is shown on top of Table 7. On the bottom, we list the prompt with DocPrompting where documentation
is provided along too. In the oracle command name setting, we prepend the command name before each NL
7
https://devdocs.io

14
Published as a conference paper at ICLR 2023

CoNaLa
70.1

Retrieval Recall@k

Generation BLEU
70 36.2 66.7
36.2 73.7
35.9 36.3
60 35.6
60.4
50 55.8

40 42.3 34.9
33.1
30 33.7
Recall
20 16.5
33.4 BLEU
1 3 5 10 15 20 25 30
Retrieved Docs

Figure 5: The recall@k (%) and the corresponding BLEU score by using these top-k docs on
CoNaLa dataset (using CodeT5).

intent for the baseline prompt. For DocPrompting prompt, we replace the potential docs with the retrieved
docs from the oracle manual.

F A DDITIONAL A NALYSIS
Parameter efficiency As shown in Table 1, under a given parameter budget, we find that DocPrompting
mostly benefits from parallel encoding (FiD). For example, the parallel encoding T5+DocPrompting (220M
parameters) significantly outperforms the 125M parameters joint encoding Neo-125M+DocPrompting. Only
scaling up Neo+DocPrompting to 1.3B parameters manages to match the 220M parameter T5+DocPrompting.
A possible explanation is that although the base Neo-1.3B (without DocPrompting) generally performs better
than the base T5 (without DocPrompting), parallel encoding allows to utilize the retrieved documents better,
since documents are encoded independently on the encoder side.
The impact of the number of documents Figure 5 shows the recall@k and the BLEU score compared to k,
the number of retrieved documents. Increasing k consistently yields a higher recall; however, as more irrelevant
documents are retrieved, the generator cannot effectively distinguish them from the relevant ones and the overall
performance remain similar. For example, CodeT5 achieves the highest BLEU score using 5 ≤ k ≤ 10. In
contrast, when the generator is provided with the oracle docs only, its BLEU score reaches 49.04 (Table 3). This
suggests that both precision and recall of docs are important, and the benefit of using larger values of k in open
domain QA (Izacard and Grave, 2021) does not necessarily hold in code generation.

Full n-gram overlap Table 8 shows that using documentation significantly increases the n-gram overlap
recall between the input and the output, in tldr and CoNaLa. Since we used BM25 to retrieve docs in tldr,
the NL← →Retrieved docs overlap is high by construction. In CoNaLa, the NL← →Retrieved docs unigram overlap
is high as well, but since we used a dense retriever, the general n-gram overlap does not have to be high for
DocPrompting to work well.

Retrieval latency Although retrieving docs results in additional test-time computation, the increase in
latency is not prohibitive. First, encoding the input for the retrieval step “costs” a single forward pass through
the retriever’s encoder, which is significantly less expensive than generation (which requires multiple time steps
of the decoder). All the documentation in the retrieval pool can be encoded in advance, and finding the top-k
results can be performed quickly using libraries such as FAISS Johnson et al. (2019) on the GPU or ScaNN Guo
et al. (2020) on CPU. The cost of this top-k search is sub-linear in the size of the document pool. Second,
the additional input to the generator results in an increased memory consumption, but only a small increase
in latency since the tokens of a given input can be encoded in parallel. If this difference is crucial in practical
settings, we can decrease the number of retrieved documents. Figure 5 shows that retrieving as few as five docs
may be sufficient in many cases.

G F ULL PASS @k P LOTS

In the main execution-based evaluation, pass@k results in Section 5.2 and Figure 3, we took the best temperature
for every model and value of k. Here, we show all the pass@k plots with different temperatures in Figure 6.

15
Published as a conference paper at ICLR 2023

# get the label of a fat32 partition

fatlabel /dev/sda1
# END

# display information without including the login, jcpu and pcpu columns
w --short
# END

# sort a csv file by column 9

csvsort -c 9 data.csv
# END

# search for a package in your current sources

Potential document 0: fatlabel will display or change the volume label or volume ID on the MS- DOS
filesystem located on DEVICE ...

# get the label of a fat32 partition

fatlabel /dev/sda1
# END

Potential document 0: w displays information about the users currently on the machine, and their
processes. The header shows, in this order ...

Potential document 1: -s, –short Use the short format. Don’t print the login time, JCPU or PCPU
times.

# display information without including the login, jcpu and pcpu columns
w --short
# END

Potential document 0: Sort CSV files. Like the Unix “sort” command, but for tabular data

Potential document 1: usage: csvsort [-h] [-d DELIMITER] [-t] [-q QUOTECHAR] [-u 0,1,2,3] [-b]
[-p ESCAPECHAR] ...

Potential document 2: optional arguments: -h, –hel show this help message and exit -n, –names
Display column names and indices from the input CSV and exit. -c COLUMNS ...

Potential document 3: csvsort -c 9 examples/realdata/FY09 EDU Recipients by State.csv

Potential document 4: csvcut -c 1,9 examples/realdata/FY09 EDU Recipients by State.csv — csvsort

-r -c 2 — head -n 5

# sort a csv file by column 9

csvsort -c 9 data.csv
# END

Potential document 1: ...

Potential document 2: ...

...

# search for a package in your current sources

Table 7: Top: baseline Codex prompt with three NL-code pairs and a test intent. Bottom: DocPrompt-
ing prompt for Codex. In each in-context learning example, the oracle docs, the NL intent and the
corresponding bash command are provided. We use up to five oracle docs for these examples. For
a test example, the top-5 paragraphs from the retriever are represented with the NL intent. The
documents’ contents were omitted (“...”) to save space.

16
Published as a conference paper at ICLR 2023

Table 8: n-gram overlap between different contents (%). Using documentation significantly increases
the n-gram overlap recall between the input and the output, in tldr and CoNaLa.

tldr 1 2 3 CoNaLa 1 2 3 4 5
NL←→Code 12 0 0 NL←→Code 30 14 11 9 7
(NL+retrieved docs)←
→Code 24 2 0 (NL+retrieved docs)←
→Code 91 52 28 16 11
NL←→Retrieved docs 39 8 3 NL←→Retrieved docs 72 14 3 1 1

temperature=0.2 temperature=0.4
0.18 0.225
CodeT5
+DocPrompting
0.16 0.200

0.175
0.14
0.150
0.12
pass@k

pass@k
0.125
0.10
0.100
0.08
0.075
0.06 CodeT5
+DocPrompting 0.050
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
k k

temperature=0.6 temperature=0.8
0.30 0.35
CodeT5 CodeT5
+DocPrompting +DocPrompting
0.25 0.30

0.25
0.20
pass@k

pass@k

0.20
0.15
0.15

0.10 0.10

0.05 0.05
0 25 50 75 100 125 150 175 200 0 25 50 75 100 125 150 175 200
k k

temperature=1.0
0.35 CodeT5
+DocPrompting
0.30

0.25
pass@k

0.20

0.15

0.10

0.05

0 25 50 75 100 125 150 175 200

Figure 6: Pass@k on 100 examples on the test set with different temperatures.

17
Published as a conference paper at ICLR 2023

Table 9: Results on tldr and CoNaLa with code-davinci-002.

tldr
Model CMD Acc (%) EM (%) Token F1 charBLEU
Codex - 39.01 14.55 44.89 33.93
3-shots +DocPrompting 36.10 13.97 42.55 32.93
With the oracle command name
- - 20.22 59.22 38.14
+DocPrompting - 33.15 68.59 44.76

CoNaLa
BLEU Recall
- 48.39 43.35
+ DocPrompting 47.21 44.70
+ DocPrompting oracle docs 54.67 59.68

H E XPERIMENTS WITH code-davinci-002

The results with code-davinci-002 under few-shot learning setting is shown in Table 9. In the non-oracle settings,
Codex+DocPrompting did not improve over the base Codex; one explanation might be that the datasets are
leaked into the training corpus of the Codex. For example, CoNaLa was extracted from StackOverflow, which is
included in the large CommonCrawl corpus8 that was used to train GPT-3, and possibly Codex. Therefore, Codex
might have memorized the target code, and thus did not need the additional documentation. Although the data
leakage issue might have happened in code-davinci-001 as well, we suspect that this issue has worsened in the
stronger 002 version. Regardless, we believe that the large capacity of Codex requires an equally strong retriever
to improve over the base model. With an oracle retriever, DocPrompting yields significant improvement on
both datasets. Thus, the non-oracle results could be further improved using a stronger non-oracle retriever.

I E XAMPLES
I.1 TLDR

Examples on tldr are in Table 10 . In the top three cases, the baseline T5 could not generate the correct bash
command while T5+DocPrompting retrieves the correct bash documentation and use the correct arguments. In
the bottom two cases, T5 also predicts the bash command correctly. However, in the puppet case, it misses
concrete usages of executing the code. In the elixir case, it halluciate the argument flags --example and
--arguments which elixir does not support at all. On the other hand, T5+DocPrompting does ground
its generation to the retrieved docs and therefore it generates more accurate code.

I.2 CONALA
More examples on CoNaLa are shown in Table 11.

8
https://commoncrawl.org/the-data/

18
Published as a conference paper at ICLR 2023

Table 10: Examples of predictions from tldr, of the base T5 compared to T5+DocPrompting. In
tldr, the bash commands in the test set are completely different from those in the training set.

play a midi file in a specific key (0 = c major/a minor, 1 = f major/d minor, +1 = g major/e minor, etc.)
Ground truth: timidity --force-keysig=’-flats|+sharps’ ’path/to/file.mid’
T5: midiplay ’key’
T5+DocPrompting: timidity --force-keysig ’f|f|f’ ’path/to/file.midi’
clean up all temporary tex files in the current directory
Ground truth: latexmk -c
T5: tex clean
T5+DocPrompting: latexmk -c
specify the hash algorithm for the index file (experimental)
Ground truth: git show-index --object-format=’sha1|sha256’ ’path/to/file’
T5: hash --experimental ’path/to/index.hashes’
T5+DocPrompting: git show-index --object-format= ’path/to/hash-algorithm’
’path/to/experimental’
execute puppet code
Ground truth: puppet apply --execute ’code’
T5: puppet
T5+DocPrompting: puppet apply --execute ’path/to/file.puppet’
evaluate elixir code by passing it as an argument
Ground truth: elixir -e ’code’
T5: elixir --example --arguments ’path/to/file.elixir’
T5+DocPrompting: elixir -e ’path/to/file.elixir’

Table 11: Examples of predictions from CoNaLa, of the base CodeT5 compared to
CodeT5+DocPrompting. Unseen functions are underscored.
::::::::::

set the current working directory to ’c:\Users\uname\desktop\python’

Ground truth: os.chdir(’c:\Users\uname\desktop\python’)
:::::::::
CodeT5: os.system(’c:\Users\uname\desktop\python’)
CodeT5+DocPrompting: os.chdir(’c:\Users\uname\desktop\python’)
:::::::::

convert dataframe ’df’ to integer-type sparse object

Ground truth: df.to sparse(0)
:::::::::::::
CodeT5: np.isinstance(df, np.integer)
CodeT5+DocPrompting: df.to sparse(’i’)
:::::::::::::

A Survey On Language Models For Code
No ratings yet
A Survey On Language Models For Code
125 pages
(2023) A Survey On Language Models For Code
No ratings yet
(2023) A Survey On Language Models For Code
55 pages
Programming Lang Processing
No ratings yet
Programming Lang Processing
70 pages
Pretraining and Evaluation CodeLLMs
No ratings yet
Pretraining and Evaluation CodeLLMs
71 pages
A Survey of Controllable Text Generation Using Transformer-Based Pre-Trained Language Models
No ratings yet
A Survey of Controllable Text Generation Using Transformer-Based Pre-Trained Language Models
37 pages
E1. Code Language Models
No ratings yet
E1. Code Language Models
40 pages
Distilled GPT For Source Code Summarization: Chia-Yi Su and Collin Mcmillan
No ratings yet
Distilled GPT For Source Code Summarization: Chia-Yi Su and Collin Mcmillan
26 pages
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
No ratings yet
Recent Advances in Natural Language Processing Via Large Pre-Trained Language Models-A Survey
40 pages
Gitub Copilot
No ratings yet
Gitub Copilot
27 pages
4786 Planning With Large Language M
No ratings yet
4786 Planning With Large Language M
28 pages
Hard Prompts Made Easy: Gradient-Based Discrete Optimization For Prompt Tuning and Discovery
No ratings yet
Hard Prompts Made Easy: Gradient-Based Discrete Optimization For Prompt Tuning and Discovery
15 pages
The Prompt Report: A Systematic Survey of Prompting Techniques
No ratings yet
The Prompt Report: A Systematic Survey of Prompting Techniques
77 pages
Large Language Models Meet NL2Code A Survey
No ratings yet
Large Language Models Meet NL2Code A Survey
22 pages
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
No ratings yet
O C: T O C T - T C L L M: PEN Oder HE PEN Ookbook For OP IER ODE Arge Anguage Odels
35 pages
Code Generation 2305.11790v3
No ratings yet
Code Generation 2305.11790v3
20 pages
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
No ratings yet
CodeGeeX - A Pre-Trained Model For Code Generation With Multilingual Evaluations On HumanEval-X
30 pages
Context - Tuning
No ratings yet
Context - Tuning
15 pages
Interactions With Prompt Problems: A New Way To Teach Programming With
No ratings yet
Interactions With Prompt Problems: A New Way To Teach Programming With
30 pages
Meta利用稀疏自编码器提取连续概念，开发出 CoCoMix 预训练框架，实现语言模型性能提升与可解释性增强
No ratings yet
Meta利用稀疏自编码器提取连续概念，开发出 CoCoMix 预训练框架，实现语言模型性能提升与可解释性增强
17 pages
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
No ratings yet
Automated Source Code Generation and Auto-Completion Using Deep Learning: Comparing and Discussing Current Language Model-Related Approaches
16 pages
A: Active Retrieval in Knowledge Soup For Code Generation
No ratings yet
A: Active Retrieval in Knowledge Soup For Code Generation
16 pages
Code Llama: Open Foundation Models For Code
No ratings yet
Code Llama: Open Foundation Models For Code
48 pages
Efficient Prompting Methods For Large Language Models - A Survey
100% (1)
Efficient Prompting Methods For Large Language Models - A Survey
18 pages
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
No ratings yet
A Survey On Large Language Models For Code Generation: Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, Sunghun Kim
49 pages
2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers
No ratings yet
2020-emnlp-PYMT5 - Multi-Mode Translation of Natural Language and Python Code With Transformers
14 pages
On The Robustness of Code Generation Techniques: An Empirical Study On GitHub Copilot
No ratings yet
On The Robustness of Code Generation Techniques: An Empirical Study On GitHub Copilot
12 pages
Code Generation AceCoder - Preprint
No ratings yet
Code Generation AceCoder - Preprint
12 pages
Codecot and Beyond: Learning To Program and Test Like A Developer
No ratings yet
Codecot and Beyond: Learning To Program and Test Like A Developer
10 pages
El Poder Del Prompting - Explorando Técnicas Avanzadas
No ratings yet
El Poder Del Prompting - Explorando Técnicas Avanzadas
80 pages
Icse48619 2023 00181
No ratings yet
Icse48619 2023 00181
12 pages
A Survey of Graph Prompting Methods
No ratings yet
A Survey of Graph Prompting Methods
11 pages
Turning Questions Into Dialogs To Teach Models How To Search
No ratings yet
Turning Questions Into Dialogs To Teach Models How To Search
14 pages
OpenAI Codex Arxiv
No ratings yet
OpenAI Codex Arxiv
35 pages
Deepseek-Coder: When The Large Language Model Meets Programming - The Rise of Code Intelligence
No ratings yet
Deepseek-Coder: When The Large Language Model Meets Programming - The Rise of Code Intelligence
23 pages
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
No ratings yet
Code Generation Tools (Almost) For Free? A Study of Few-Shot, Pre-Trained Language Models On Code
12 pages
Repository Level Prompt Generation For LLMs
No ratings yet
Repository Level Prompt Generation For LLMs
21 pages
LLM Are Human-Level Prompt Engineers
No ratings yet
LLM Are Human-Level Prompt Engineers
43 pages
Code Llama
No ratings yet
Code Llama
47 pages
P 2M: Generating Deployable Models From Natural Language Instructions
No ratings yet
P 2M: Generating Deployable Models From Natural Language Instructions
10 pages
L D S M C S C S N C: Earning EEP Emantic Odel For ODE Earch Using ODE Earch ET Orpus
No ratings yet
L D S M C S C S N C: Earning EEP Emantic Odel For ODE Earch Using ODE Earch ET Orpus
6 pages
Ask Me Anything
No ratings yet
Ask Me Anything
59 pages
A M A: A: SK E Nything Simple Strategy For Prompting Language Models
No ratings yet
A M A: A: SK E Nything Simple Strategy For Prompting Language Models
63 pages
Lunyiu SOP UT
No ratings yet
Lunyiu SOP UT
2 pages
A Parallel Corpus of Python Functions and Documentation Strings For Automated Code Documentation and Code Generation
No ratings yet
A Parallel Corpus of Python Functions and Documentation Strings For Automated Code Documentation and Code Generation
5 pages
Past Questions Main
No ratings yet
Past Questions Main
61 pages
Can Developers Prompt? A Controlled Experiment For Code Documentation Generation
No ratings yet
Can Developers Prompt? A Controlled Experiment For Code Documentation Generation
13 pages
Merged
No ratings yet
Merged
28 pages
22 Promptengg
No ratings yet
22 Promptengg
40 pages
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
No ratings yet
A E C P T L L M: A P ' G: N Mpirical Ategorization of Rompting Echniques FOR Arge Anguage Odels Ractitioner S Uide
16 pages
Evaluating Large Language Models Trained On Code
No ratings yet
Evaluating Large Language Models Trained On Code
35 pages
Recent Advances in Language Modeling (2022-2025)
No ratings yet
Recent Advances in Language Modeling (2022-2025)
5 pages
1998 - 1000 - DOC - AI-Powered Code Generation
No ratings yet
1998 - 1000 - DOC - AI-Powered Code Generation
5 pages
Appendix 8 - Typical Project Execution Plan
No ratings yet
Appendix 8 - Typical Project Execution Plan
19 pages
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
No ratings yet
2022 - Expectation vs. Experience - Evaluating The Usability of Code Generation Tools Powered by Large Language Models
7 pages
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
No ratings yet
Natural Language To Code: Improving Semantic Reasoning in Code Generation Models
10 pages
Prompt Engineering
No ratings yet
Prompt Engineering
24 pages
Augmenting LLMs Survey
No ratings yet
Augmenting LLMs Survey
33 pages
Github Copilot Ai Pair Programmer: Asset or Liability?
No ratings yet
Github Copilot Ai Pair Programmer: Asset or Liability?
20 pages
Microsoft Certified Azure Fundamentals AZ-900 Quick Reference Sheet
No ratings yet
Microsoft Certified Azure Fundamentals AZ-900 Quick Reference Sheet
45 pages
The Complete Guide To Prompt Engineering....
No ratings yet
The Complete Guide To Prompt Engineering....
47 pages
CodeGeeX4: Multilingual Open-Source Code Assistant
No ratings yet
CodeGeeX4: Multilingual Open-Source Code Assistant
9 pages
Condensate Recovery Meter CRM 485R: Energy Conservation - Environment - Process Efficiency
0% (1)
Condensate Recovery Meter CRM 485R: Energy Conservation - Environment - Process Efficiency
6 pages
Vet EU IG - Chapter 2 - Initial Submission
No ratings yet
Vet EU IG - Chapter 2 - Initial Submission
103 pages
DsPIC33 EP64 GS502 Datasheet
No ratings yet
DsPIC33 EP64 GS502 Datasheet
390 pages
SDXPST 66 2515.3
No ratings yet
SDXPST 66 2515.3
816 pages
CCS369
No ratings yet
CCS369
2 pages
Grade 8 Computer Question Bank
No ratings yet
Grade 8 Computer Question Bank
3 pages
Oracle 9i SQL & PLSQL
No ratings yet
Oracle 9i SQL & PLSQL
740 pages
Strategic Management Challenges in The 21st Century
No ratings yet
Strategic Management Challenges in The 21st Century
6 pages
M7 - T-GCPFCI-B - Core Infrastructure 5.0 - ILT PDF
100% (1)
M7 - T-GCPFCI-B - Core Infrastructure 5.0 - ILT PDF
21 pages
BDS602 Module 2 PDF
No ratings yet
BDS602 Module 2 PDF
16 pages
Spring Security in Action 1st Edition Laurentiu Spilca Download
No ratings yet
Spring Security in Action 1st Edition Laurentiu Spilca Download
61 pages
Adobe Creative Suite 3 Design Premium: Deliver Innovative Ideas in Print, Web, and Mobile
No ratings yet
Adobe Creative Suite 3 Design Premium: Deliver Innovative Ideas in Print, Web, and Mobile
18 pages
NT2S-SF121B-E & NT2S-SF122B-E: Quick Start Guide
No ratings yet
NT2S-SF121B-E & NT2S-SF122B-E: Quick Start Guide
31 pages
Flipkart TBBD '23 Cheat Sheet - Electronics
No ratings yet
Flipkart TBBD '23 Cheat Sheet - Electronics
23 pages
TCT2 - PDH Principles - 1688735713572
No ratings yet
TCT2 - PDH Principles - 1688735713572
52 pages
MCA Syllabus
No ratings yet
MCA Syllabus
8 pages
CCN 19ec602 Isd
No ratings yet
CCN 19ec602 Isd
30 pages
Coc Gab Question
No ratings yet
Coc Gab Question
3 pages
FortiNet Log Reference PDF
No ratings yet
FortiNet Log Reference PDF
143 pages
Product Catalogue No Prices Edition 1
No ratings yet
Product Catalogue No Prices Edition 1
58 pages
How To Ace Jumbling Questions For RRB NTPC: Facebooktwitterwhatsappemailtelegram Google Bookmarksshare
No ratings yet
How To Ace Jumbling Questions For RRB NTPC: Facebooktwitterwhatsappemailtelegram Google Bookmarksshare
4 pages
Data Storage 6MBXz6j2tBKCGg7X
No ratings yet
Data Storage 6MBXz6j2tBKCGg7X
8 pages
HEC-RAS 507 Unsteady
No ratings yet
HEC-RAS 507 Unsteady
9 pages
Intel 8080 CPU Chip Development
No ratings yet
Intel 8080 CPU Chip Development
4 pages
Keywords
No ratings yet
Keywords
3 pages
8255 Architecture and CWR
No ratings yet
8255 Architecture and CWR
3 pages
Who Is Arthur Noriega - Google Search
No ratings yet
Who Is Arthur Noriega - Google Search
1 page
Mastering Python in 7 Days
From Everand
Mastering Python in 7 Days
Alex Wood
No ratings yet
C + +: C++ programming
From Everand
C + +: C++ programming
Ummed Singh
No ratings yet

Docprompting:: G C R D

Uploaded by

Docprompting:: G C R D

Uploaded by

Published as a conference paper at ICLR 2023

DocPrompting: G ENERATING C ODE BY R ETRIEVING

Shuyan Zhou† , Uri Alon†

Publicly available source-code libraries are continuously growing and changing.

A formatter takes the token stream and writes it d3

2 C ODE G ENERATION BY R EADING THE D OCS

2.1 BACKGROUND : R ETRIEVAL -C ONDITIONED G ENERATION

2.2 DocPrompting: G ENERATING C ODE BY R ETRIEVING THE D OCS

3 P RACTICAL I NSTANTIATIONS OF DocPrompting

3.1 R ETRIEVER I NSTANTIATION

3.2 G ENERATOR I NSTANTIATION

We experimented with a variety of generator models. We used GPT-Neo-125M, GPT-Neo-1.3B (Black

4.1 S HELL S CRIPTING

tldr is a community-driven project that maintains easily-

4.2 P YTHON P ROGRAMMING

5.1 S HELL S CRIPTING R ESULTS

Model CMD Acc (%) EM (%) Token F1 charBLEU

5.2 P YTHON P ROGRAMMING R ESULTS

Model BLEU Recall Recallunseen

n BM25 RoBERTaoff-shelf RoBERTa CodeT5off-shelf CodeT5 Best w/o weak sup.

6.2 A BLATION S TUDY

6.3 C ASE STUDY

Table 5: Examples of predictions from CoNaLa, of the base CodeT5 compared to

NL Intent: Open image ”picture.jpg”

A T L D R: A N EWLY C URATED S HELL S CRIPTING B ENCHMARK

Table 6: The statistics of the tldr shell scripting benchmark

# Commands NL→Bash pairs

C D ENSE R ETRIEVER T RAINING

G F ULL PASS @k P LOTS

# get the label of a fat32 partition

# sort a csv file by column 9

# search for a package in your current sources

# get the label of a fat32 partition

Potential document 3: csvsort -c 9 examples/realdata/FY09 EDU Recipients by State.csv

Potential document 4: csvcut -c 1,9 examples/realdata/FY09 EDU Recipients by State.csv — csvsort

# sort a csv file by column 9

Potential document 1: ...

Potential document 2: ...

# search for a package in your current sources

0 25 50 75 100 125 150 175 200

Table 9: Results on tldr and CoNaLa with code-davinci-002.

H E XPERIMENTS WITH code-davinci-002

set the current working directory to ’c:\Users\uname\desktop\python’

convert dataframe ’df’ to integer-type sparse object

You might also like