0% found this document useful (0 votes)

16 views14 pages

Towards Neural Synthesis For SMT-Assisted Proof-Oriented Programming

The document discusses the development of a dataset called FS TAR DATA S ET, which contains 600K lines of F⋆ programs and proofs aimed at facilitating AI-assisted proof-oriented programming. It highlights the potential of using AI to automate the synthesis of proofs and programs, revealing that smaller fine-tuned language models can perform comparably to larger models at a lower computational cost. The paper also outlines the contributions made, including the dataset, a benchmark problem for synthesis, and techniques for evaluating neural synthesis methods.

Uploaded by

6rz4jnhhq8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views14 pages

Towards Neural Synthesis For SMT-Assisted Proof-Oriented Programming

Uploaded by

6rz4jnhhq8

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Towards Neural Synthesis for

SMT-Assisted Proof-Oriented Programming

Saikat Chakraborty§ , Gabriel Ebner§ , Siddharth Bhat†∗ , Sarah Fakhoury§ ,
Sakina Fatima‡∗ , Shuvendu Lahiri§ , Nikhil Swamy§
§
Microsoft Research, Redmond, WA, USA
†
University of Cambridge, Cambridge, UK, ‡ University of Ottawa, Ottawa, ON, Canada
{saikatc, gabrielebner, sfakhoury, shuvendu, nswamy}@microsoft.com
†
[email protected], ‡ [email protected]

Abstract—Proof-oriented programs mix computational content languages often require a high-level of human expertise—AI
arXiv:2405.01787v3 [cs.PL] 4 Sep 2024

with proofs of program correctness. However, the human effort assistance could help make them easier to use.
involved in programming and proving is still substantial, despite
the use of Satisfiability Modulo Theories (SMT) solvers to Recognizing the potential dual benefit (i.e., trustworthy AI
automate proofs in languages such as F⋆ . programming & easier program proof) many researchers have
Seeking to spur research on using AI to automate the construc- begun investigating using AI to synthesize proofs. However,
tion of proof-oriented programs, we curate a dataset of 600K most of the prior work has focused on using AI with tactic-
lines of open-source F⋆ programs and proofs, including software based proof assistants, such as Coq, Lean, and Isabelle [2]–
used in production systems ranging from Windows and Linux, to
Python and Firefox. Our dataset includes around 32K top-level
[5], including projects like CoqGym [6], which builds models
F⋆ definitions, each representing a type-directed program and based on hundreds of Coq projects containing more than
proof synthesis problem—producing a definition given a formal 70,000 tactic scripts, and LeanDojo [7], which uses math-
specification expressed as an F⋆ type. We provide a program- lib [8], a large corpus of formalized mathematics in Lean.
fragment checker that queries F⋆ to check the correctness of AI automation for proof-oriented programming languages like
candidate solutions. We also report on an extended version of
our dataset containing a total of 940K lines of programs and
F⋆ [9], Dafny [10], Viper [11], Verus [12] and others has
proofs, with a total of 54k top-level F⋆ definitions. We believe received comparatively less attention. The prior work [13]–
this is the largest corpus of SMT-assisted program proofs coupled [18] has been limited by the availability of data, focusing
with a reproducible program-fragment checker. instead on small, hand-crafted problem sets numbering in
Grounded in this dataset, we investigate the use of AI to the few hundreds. This is unfortunate, since proof-oriented
synthesize programs and their proofs in F⋆ , with promising
languages may be close to the limit of symbolic automation
results. Our main finding in that the performance of fine-tuned
smaller language models (such as Phi-2 or StarCoder) compare provided by SMT solvers and AI automation could open
favorably with large language models (such as GPT-4), at a the door to further automation that has remained out of
much lower computational cost. We also identify various type- reach. Additionally, to achieve the promise of trustworthy
based retrieval augmentation techniques and find that they boost AI programming, we believe it is essential to develop AI
performance significantly. With detailed error analysis and case
for program and proof synthesis rather that only on mostly
studies, we identify potential strengths and weaknesses of models
and techniques and suggest directions for future improvements. mathematical tactic-based proofs.
Index Terms—Proof Oriented Programming, AI for Proofs, Towards this end, our paper makes the following three major
Trustworthy AI programming contributions:
1. A new dataset of programs and proofs: Aiming to
I. I NTRODUCTION
spur research in AI-assisted proof-oriented programming, our
The recent excitement around AI-assisted programming has first main contribution is FS TAR DATA S ET, a dataset of F⋆
been tempered by concerns around the trustworthiness of AI- programs and proofs extracted from 2060 source files, repre-
generated code [1]. Languages that offer static guarantees senting about 600K lines of source code, drawn from 8 open
can help reduce some of these concerns, e.g., having AI source projects. The dataset provides around 32K top-level F⋆
generate safe Rust code rather than C eliminates the risk definitions, coupled with tooling that allows each definition to
of AI-introduced memory safety issues. Taking this line of be checked in isolation. We believe this is the largest dataset
thinking to its limit, using AI to generate code in proof- of SMT-assisted proof-oriented programs and we envision for
oriented languages which allow programs to be augmented it to be a live, evolving data set, with new projects added to
with specification and proofs of correctness could eliminate it over time. Indeed, currently FS TAR DATA S ET has grown to
trust in AI-generated code, so long as the specification can be include 4 additional projects reaching 940K lines of source
audited to match a human’s intent. Conversely, proof-oriented code (see Appendix A, and B for details and initial results).
Although we currently focus on F⋆ , we hope for the dataset to
∗
Work done as interns at Microsoft Research also grow to include data from other proof-oriented languages.
We describe the dataset in detail in §III. provides further discussion and analysis. We release FS TAR -
2. A benchmark problem: Grounded in this data, we de- DATA S ET in https://huggingface.co/datasets/
sign a straightforward benchmark: given a type as a formal microsoft/FStarDataSet. Details of the project, source
specification, synthesize a program that has that type. Each code and models can be found in http://fstar-lang.
of the 32K definitions in FS TAR DATA S ET yields its own org/popai
synthesis problem, where the type of the definition is the
“goal” type, and a technique is expected to synthesize a
definition that the F⋆ compiler attests is goal-type-correct. In
F⋆ , types are rich enough to capture a variety of specifications, II. BACKGROUND
ranging from simple types as in other functional programming
languages, to dependently typed specifications that capture
We start by providing some general background on F⋆ ,
functional correctness properties of a program, i.e., types
adapted from its online manual. F⋆ is a dependently typed
can represent arbitrary propositions. Dually, programs in F⋆
programming language and proof assistant. It encourages
contain computational content (e.g., logic for reversing a
proof-oriented programming, a paradigm in which one co-
list), but they can also be interpreted as proofs. As such,
designs programs and proofs which attest various aspects of
our benchmark can be seen as an instance of a type- or
a program’s correctness, including its input/output behavior
specification-directed program & proof synthesis problem. We
together with its side effects, if any.
present a simple and objective taxonomy to classify benchmark
instances, distinguishing simply typed, dependently typed, and The core design philosophy of F⋆ is that the type of a
fully specified proofs, corresponding roughly to the precision program is a specification of its runtime behavior. Many terms
of specifications. can have the same type and the same term can have many
types, e.g., e : int states that the term or program fragment
3. Designing and evaluating neural synthesis techniques:
e reduces to an integer; and e : nat states that e reduces
We apply a (by now) standard regime of prompting large
to a non-negative integer, where nat = x:int{x ≥ 0} is a
language models (LLMs) to generate solutions, backed by
refinement of type int. When proving a program e correct,
retrieval augmentation and fine-tuning techniques specific to
one specifies the properties one is interested in as a type t
our setting. In particular, we construct a prompt by augmenting
and then tries to convince F⋆ that e has type t, i.e., deriving
the goal type with related types and definitions from the
e : t.
program context, based on various embedding models. In §V,
we evaluate the performance of off-the-shelf large language Many dependently typed languages have the same core
models, including GPT-3.5 [19] and GPT-4 [20], as well design, however F⋆ is distinctive in that in addition to
as fine-tuned smaller models include Phi2-2.7B [21], Orca2- several built-in type-checking algorithms, it uses the Z3
7B [22], and StarCoder-15B [23], with the following main SMT solver [26] to try to automatically prove various obli-
takeaways. gations that arise during type-checking. For example, un-
• Fine tuned smaller models can match or outperform large der the hypothesis that x : even, proving that x + 2 : even,
language models (§V-A). where even = x:int{x % 2 == 0}, results in a query to Z3
• Different classes of problems are solved with varying of the form ∀(x:int). x % 2 == 0 =⇒ (x + 2) % 2 == 0. While
degrees of success, with common error modes differing in general the queries F⋆ presents to Z3 are undecidable,
between pretrained and fine-tuned models (§V-B). in practice Z3 successfully automates a large number of the
• Leveraging the contexts as well as retrieval augmentation queries it receives. However, when Z3 proof search fails, the
significantly boosts the quality of results (§V-C). programmer must write additional lemmas and other proofs
hints to decompose the proof into smaller pieces that can be
Based on our results, we are optimistic about for the
checked by F⋆ and Z3.
future of AI-assisted proof-oriented programming. Researchers
building verified systems in proof-oriented languages have The concrete syntax of F⋆ is based on other languages in
reported writing around 3-5 lines of proof for each line of the ML family, including OCaml and F#. Shown below is
verified code [24], [25], a considerable overhead, despite a recursive implementation of Quick Sort, together with its
strong symbolic automation from SMT solvers. For SMT- specification and proof of correctness. The type of sort states
assisted proof-oriented programs, we provide the first, sub- that for any total order f on elements of type α, given an
stantial empirical evidence that LLMs trained on corpora like input list l:list α, sort is a total function (i.e., it always
FS TAR DATA S ET, and prompted with retrieval augmentation, terminates) returning a list m:list α that is sorted according
can automate somewhere between a third and a half of proofs. to f and where m is a permutation of l. The predicates like
That said, our approach focuses on synthesizing program total_order, sorted, is_permutation etc. are other aux-
and proof fragments given their specifications: finding the iliary definitions in scope. The implementation of sort mixes
right modular decomposition to prove a program correct, the computational behavior (i.e., partitioning the list based
with the right specifications and auxiliary lemmas, is not on a pivot, recursively sorting the partitions, combining and
yet within the scope of the techniques we explore. §VI returning them) with proof steps that attest to the correctness
of the code with respect to its specification.1 • EverParse: A parser generator for binary formats [28], used
let rec sort (f:total_order_t α) (l:list α)
in various large scale systems, e.g., the Windows kernel [29]
: Tot (m:list α{ sorted f m ∧ is_permutation l m }) • HACL*: A library of verified cryptographic algorithms [25],
(decreases (length l))
= match l with
including ValeCrypt [30], a library of verified assembly
| [] → [] code, as well as EverCrypt, a cryptographic provider [31],
| pivot :: tl →
let hi, lo = partition (f pivot) tl in
including code deployed in Linux, Firefox, and Python.
let res = append (sort f lo) (pivot :: sort f hi) in • miTLS-F*: A partially verified reference implementation of
(* <proof> *)
partition_mem_permutation (f pivot) tl;
the TLS protocol [32].
append_count lo hi; append_count hi lo; • EverQuic-Crypto: A verified implementation of header and
sorted_concat f (sort f lo) (sort f hi) pivot;
append_count (sort f lo) (sort f hi);
packet protection for the QUIC protocol [33].
permutation_app_lemma pivot tl (sort f lo) (sort f hi); • Merkle-tree: A verified, incremental Merkle tree, designed
(* </proof> *)
res
for use in Azure CCF, a confidential computing system [31].
• Steel: A concurrent separation logic library, with proofs of
For example, the annotation decreases (length l) data structures and concurrency primitives [34].
indicates that the recursion terminates because the length of
The dataset will be available publicly as an open source
the list input decreases at each recursive call. Additionally, the
repository referencing the other projects as sub-modules, in-
source lines delimited by <proof> comment tags are calls to
cluding a given version of F⋆ itself. All the projects are built
F⋆ lemmas, auxiliary definitions that prove certain properties.
with the same version of F⋆ and the Z3 SMT solver, resulting
For instance, the auxiliary lemma append_count is shown
in a single archive with all the 2,060 source files and a build
below. Its type states that every call to append_count l m x
artifact for each of them, i.e., F⋆ ’s .checked files. Each
guarantees the postcondition described in the ensures
checked file is accompanied by record of metadata about its
clause; or, equivalently, the type claims the universal property
contents: for each top-level element (e.g.,, the definition of a
∀l m x. count x (append l m) = count x l + count x m.
function, or a type, or a type signature), the metadata records
The proof of this property is by induction on the list l,
its dependences, the compiler settings used to verify them, etc.
expressed as a recursive function.
Reproducibility & evolution. We aim to strike a balance
let rec append_count (l m:list α) (x:α)
: Lemma (ensures (count x (append l m) = count x l + count x m)) between reproducibility of the dataset, while also encouraging
= match l with the data set to grow and change along with projects it
| [] → ()
| hd :: tl → append_count tl m x references. Referencing the precise commit hashes of all the
referenced projects as sub-modules allows any version of the
Program and proof fragments like sort, append_count dataset to be fully reproducible. The results reported in this
etc. each constitute a top-level definition in an F⋆ program. paper focus on version 1 of the dataset, a snapshot from
Each definition yields a type-directed synthesis problem, i.e., November 2023, provided as an anonymous supplement.
given a goal type such as the first two lines of append_count
shown above, can an LLM generate a type-correct definition. A type-checking harness. To enable using the data to val-
To help the LLM succeed, we aim to augment the goal type idate program-proof synthesis attempts, we provide scripts
with related information, e.g., the definitions of symbols such that enable launching F⋆ and initializing it so that it is
as count, append etc. We evaluate the performance of various capable of checking each top-level definition independently.
retrieval augmentation strategies and LLMs on this task. Once initialized in this way, the F⋆ process can be repeatedly
queried using an interactive JSON-based API to type-check
III. A C ORPUS OF P ROOF -O RIENTED P ROGRAMS program fragments—in F⋆ , type-checking amounts to program
verification. However, replaying the F⋆ type-checker on top-
FS TAR DATA S ET is an archive of source code, build arti- level definitions in the dataset is not perfect, e.g., due to
facts, and metadata assembled from eight (8) different F⋆ - sensitivity to small changes in the search heuristics used by
based open source projects on GitHub, summarized below, SMT solvers. We use the type-checking harness to identify
with a focus on libraries that provide secure low-level software those top-level definitions whose proofs can be checked in
and the tools that enable their proofs. isolation, finding that 90% of them meet this criterion. In all
• FStar: The F⋆ compiler itself, including its standard libraries our experiments, we focus on this re-checkable fragment of
and examples. the dataset.
• Karamel: A transpiler from a subset of F⋆ called Low* to During the development of the harness, we faced similar
C, including libraries to work with a model of C types and challenges as reported by LeanDojo [7, A.2], mainly: (a)
control structures, e.g., for- and while-loops [27]. ensuring names are resolved correctly by opening the right
namespace in the right order, (b) ensuring that definitions
1 Many F⋆ examples, including the ones shown here, are similar to program to be checked cannot refer to their original version in the
proofs in languages like Dafny or Verus. However, F⋆ also allows other styles library, that they cannot use their original definition implicitly
of higher order and dependently typed definitions that are not expressible in
other SMT-assisted languages. We refer the reader to the F⋆ manual for a full via Z3, (c) verifying that solutions do not use escape hatches
discussion of similarities and differences. such as admit () or assume that are intended for interactive
TABLE I: Summary statistics of the FS TAR DATA S ET. F* Example Training
Related Example Retrieval Data
// Type:
Test val appendList (l1:list a) (l2:list a)
Metric Train Valid
Intra-project Cross-project : (res: list a{
length res = length l1 + length l2})
Number of Definitions 22779 1541 5965 1769 // Context:

Number of Projects 6 6 6 2 open FStar.List.Tot.Base // Opened Modules Premise

open FStar.Nat.Tot.Base // Opened Modules // Related Examples:
Number of Files 1216 72 306 126 Selection
module N = FStar.Nat.Tot.Base // Instantiated val appendSet: ...
...
Avg. num of lines 8.66 13.63 11.40 7.45 val length: #a:Type -> l:list a -> nat let subtractList l1 l2 = ... // Selected Premises:
let rec length #a (l: list a): nat =
Avg. num of tokens 92.16 157.26 124.32 60.32 match l with index
... add
length
# Simply Typed 6736 434 1248 149 // Configuration: head
# Dependently Typed 12047 764 3111 1431 tail
"smtencoding_valid_intro": true, ...
# Proofs 3996 343 1606 189 ...
"z3rlimit": 100,
"z3seed": 0,
Prompt
... Preparation
// Other Metadata Prompt / Input
1. Ideal Premises The type is:
2. Defn location: project, sha, file, line val appendList (l1: list a) (l2: list a) ...
use, and (d) that all files can be loaded in a consistent 3. Dependencies and Effects
... The following files are opened and instantiated:
state even though the bare dataset has clashing file names open FStar.List.Tot.Base
...
module N = FStar.Nat.Tot.Base
and conflicting options. Challenges (b) and (c) are hard to
Following definitions are in the context:
notice during development because they only result in false LM
val length: ...
...
let rec index ...
positives, i.e., the checker incorrectly claiming that a solution // Synthesized Definition: Synthesis & You may use some of all of the following premises:
is correct. The biggest development effort was related to let rec appendList l1 l2 = Verification index add length head tail
match l1, l2 with ...
Following definition are related and might be helpful
efficiently supporting (b): we added a feature to F⋆ that allows val appendSet: ...
let appendSet l1 l2 = ...

the client to partially load a compiled .checked file until we Instructions:

1. Synthesize a definition that satisfies the type
reach the definition to be checked; this ensures that exactly the Verifier 2. Start the definition with `let rec appendList`.
...

same definitions and modules are in scope that were accessible

Fig. 1: Overview of our experimental approach for synthesizing
to the original definition. After deduplication, removal of auto- F⋆ definitions.
generated code, data type definitions, and all proofs that are
fully automated by SMT, the dataset contains 32,054 top-level
definitions. other general-purpose programming languages. Synthesiz-
Partitions of the dataset. We split FS TAR DATA S ET into ing definitions in this class is non-trivial—many symbolic
four sets: training, validation, intra-project test, and cross- synthesis techniques focus exclusively on simple types,
project test. Each source file is in exactly one of these sets. while also excluding type polymorphism or non-dependent
The first three sets are taken exclusively from the F⋆ , Karamel, refinements [35]. Nevertheless, types in this class may admit
EverParse, HACL*, Merkle-tree, and Steel projects, while many simple solutions. For example, given a goal type of
the “cross-project test” (we refer as “cross-project” hereafter) nat → nat, a solution such as λ _ → 42 would succeed.
exclusively contains definitions from EverQuic-Crypto and • Proof definitions have types that represent propositions, e.g.,
miTLS-F*. The training set is closed under the dependence the type of append_count shown in §II. Such types are so
relation, i.e., a file in the training set also has all its depen- precise that F⋆ ’s logic allows proving that all well-typed
dencies in the same set. As such, the training set contains terms of a given proposition type are semantically equal.
files close to the roots of the dependence graphs for each As such, this is the most challenging class, since there is at
project. In contrast, files in the intra-project test set (refers to most one semantically unique solution and as synthesizing
as “intra-project” hereafter) may depend on files from other a solution amounts to proving a theorem in one go.
sets. Partitioning according to the dependence relation ensures • Dependently typed definitions describe all cases whose types
that our trained models have not been contaminated with any are not in the other two classes. Dependent types can encode
information that depends on the test set. Files in cross-project complex specifications, e.g., the type of sort shown in
set are from projects whose files are not in any of the other §II is a detailed functional correctness property. Therefore
sets. This allows us to do a generalization evaluation across definitions in this class in general have a higher theoretical
projects of the trained models, while again minimizing the complexity than the simply typed class. However there
potential for dataset contamination. can still be multiple semantically different solutions to a
A Taxonomy. We classify the definitions in the dataset dependently typed goal. For example, an implementation of
by their type, providing three classes corresponding loosely merge sort can be given the same type as sort.
to the theoretical complexity of synthesizing a type-correct Table I shows some statistics of FS TAR DATA S ET version
definition. 1, the basis of all experiments and analyses in this paper.
• Simply typed definitions have simple types such as
IV. N EURAL P ROGRAM & P ROOF S YNTHESIS FOR F⋆
int → int. Types that include refinements but no
dependences are also considered simply typed, e.g., Figure 1 shows a high level overview of the program and
nat → nat. Definitions that are type polymorphic, e.g., proof synthesis technique we develop in this paper. The top-
f:(α→ β) → list α→ list β , are also considered simply left of the figure shows an instance of a top-level definition
typed. Types in this class could be expressed in many from FS TAR DATA S ET, containing the dependent type ti of a
function appendList, other metadata, and, importantly, infor- type, the file context, selected premises, retrieved related
mation about its “file context”, i.e., all the top-level elements examples, together with some fixed instructions for the model.
preceding appendList in the file in which it appears. To We reserve a fixed number of tokens in the prompt for the
synthesize definitions for ti , first (at top right) we retrieve file context, premises, and related examples. The file context
related examples from the training data and select premises contains opened and instantiated modules in the file together
(i.e., ingredient definitions) that are likely to be used in the with other definitions in the file up to the point of the goal
definition body. At the center of the figure, combining ti with type, preferring definitions that are closer to the goal to fit
its file context, related examples, selected premises, we create in the fixed token budget for the file context. Likewise, for
a prompt pi . We use the prompt and the ground truth definition related examples and premises, we rank them according to
from the training data (DT ) to train models. Finally, at bottom descending similarity values, including as many as fit into their
left, for testing and evaluation, we present the same kind of respective token budgets. We organize different components
prompt to our finetuned LLMs, and evaluate the correctness within a natural language prompt, i.e., each component of
of definitions with the type-checker harness. Note, we do the prompt is preceded by a brief description of what the
not feedback type-checking results to the LLM for repair, an component is. We represent each of the related example with
enhancement we are considering for the future. its type and definition. For each of the selected premises, we
Related Examples Retrieval. Inspired by successes of pro- only use the short name of that premise in the prompt. While
viding language models with related examples (colloquially the type and definition of a premise contain more important
known as Retrieval Augmented Generation, RAG) [36]–[38], information about it, we experimentally observe that using
we provide related examples as input to the model for better other information actually hurt the ultimate performance. Our
generalization. To find related examples, we calculate the conjecture is that since the number of allowed tokens for the
similarities between the input type and the types of training prompt is limited, including additional information with each
examples. Our intuition is that if two examples’ types are premise allows fewer premises to be included in the prompt,
very close, there will be similarities between their definitions. which leads to important premises being excluded from the
We compute the similarity between types in their embedding prompt altogether. Finally, the prompt ends with a set of
space. Concretely, we embed the types into a fixed dimen- instructions including the prefix of the definition. Figure 1
sional vector space using an LLM-based embedding model, shows an example prompt at bottom right. We experiment
text-embedding-ada-002 [39], and compute the cosine on the impact of different components (context, retrieved
similarity between ti and tx ∈ DT . We use the tx and dx (i.e., examples, and selected premises) in §V-C.
definition of tx from DT ) as related examples. Training. We train different language models of varying
Premise Selection. Following Yang et al. [7], we also sizes to synthesize definitions from prompts. In particular, we
provide the models with other definitions (premises) which are trained Phi-2, Orca-2, and StarCoder with 2.7B, 7B, and 15.5B
likely to be used in the body of the definitions. We fine-tune parameters, respectively. The input to the models is the prompt
an embedding model based on all-MiniLM-L6-v2 [40]. constructed as described above and the target response is set
During inference, given a type as input the fine-tuned premise to be the definition from the dataset itself. We use standard
selection model produces ranked list of premises based on their practices for training, including low rank optimization [41],
likelihood of being used in the definition. We also explore fine- parameter efficient model tuning [42], and quantized model
tuning alternative embedding models in Section VI-D. training, enabling us to deploy the models with GPUs with
To fine-tune an embedding model e(·), we take for every smaller memory size, for both both training and inference. We
definition g in the training set the definitions p1 , . . . , pn train Phi-2 model for 10K steps, Orca-2 model for 5K steps,
whose name occurs in the definition’s value but not in and StarCoder for 2K steps, requiring roughly similar training
its type and then randomly sample q1 , . . . , qn definitions time. The training objective is to maximize the probability of
whose name occurs in neither the value nor the type but the definition given the prompt, which is realized by Cross
are in scope at the definition’s location. As training objec- Entropy Loss. We retain the model checkpoint which resulted
1
Pn T 2
tive we use the mean-square error 2n i=1 (e(g) e(qi )) +
in the least loss in the validation dataset.
2
(e(g)T e(pi ) − 1) , which aims to align the goal g to the Synthesis & Verification. Having the prompt designed, we
ground-truth premises p1 , . . . , p1 , and making the embedding synthesize the definition for a given type using the fine-tuned
of the goal orthogonal to the negative premises q1 , . . . , qn . For (also pretrained) LMs. While generating, we use temperature
evaluation, the premises are then ranked by cosine distance to based sampling. For each example, we generate k definitions
the goal. Note that, “Premise Selection” allowed us to fine- (Sk ) using the LMs. We evaluate each of these generated
tune model as we already know the ground truth premises from definitions using the type-checking harness. In the case of a
the dataset. In contrast, there is no straightforward notion of verification (type check) failure, the checker return the error
ground truth for related example, hence we use embeddings code with error message. We define the evaluation metric,
from an LLM that supposedly understands its input. verify@k as the percentage of examples in the evaluation data
Prompt Preparation. We prepare the input to the LLM for which there is at least one verified definition in the k
(colloquially known as a prompt), by combining the “goal” generated solutions by the LM.
TABLE II: Comparison between Prompting different models and GPT-4 GPT-4
finetuning for synthesizing F⋆ definitions. Combined Prompted∗ 219 83
shows the result when we combine the synthesized definitions
from all prompted experiments (5), Combined Finetuned† corre-
GPT-3.5 11 13
0 4 Phi-2 GPT-3.5 5
2 6 Phi-2
0 19 4 35

3
0 0 2 43 33

16
sponds to the combined results from finetuned experiments (3), 69 0 25
3 0 20 6 33 21
and Comb. Prompted & Finetuned‡ shows the results across all 3 5
0 116
the experiments (8). 0 1 0 00 8 27 23 29
6 13
verify@k (intra-project) verify@k (cross-project) 47 20 2 3 42 29 5 2
Strategy Model
k=1 k=5 k = 10 k=1 k=5 k = 10 17 6 3 66 1919
20 10 71 12
GPT-3.5 13.51 25.20 29.81 6.67 14.41 18.54
StarCoder Orca-7B StarCoder Orca-7B
GPT-4 19.41 31.90 36.38 13.00 23.57 28.49
Prompted Phi-2 (2.7B) 0.30 1.26 2.20 0.11 0.62 1.75 (a) Prompted Models (b) Finetuned Models
Orca-2 (7B) 2.63 6.39 8.55 0.45 2.20 3.84
Fig. 2: Venn Diagram showing intersection of examples from
StarCoder (15.5B) 1.89 7.14 12.27 0.73 3.79 6.90
Cross-Project examples solved by different models. The GPT-*
Combined Prompted∗ 26.34 40.02 45.47 16.56 29.96 36.18 performance in Fig. (b) are from prompting respective models.
Phi-2 (2.7B) 17.28 27.75 31.10 8.59 16.62 20.97
Finetuned Orca-2 (7B) 14.90 26.02 30.61 7.74 14.92 18.37
StarCoder (15.5B) 27.95 39.50 43.98 16.73 27.64 32.90
Combined Finetuned† 32.86 44.17 48.52 20.97 32.28 37.20 examples than on cross-project examples. When constructing
Comb. Prompted & Finetuned‡ 38.76 50.49 55.34 27.13 40.14 45.56 prompts, we only retrieve related examples from the training
∗ : verify@k interprets as verify@5k. † : verify@k interprets as verify@3k, data split, which specifically does not include any examples
and ‡ : verify@k interprets as verify@8k. from the cross-project set—as such, the cross-project examples
benefit less from retrieval augmentation.
V. E MPIRICAL E VALUATIONS The performance of prompted smaller models (Phi-2, Orca-
2, and StarCoder) is significantly worse. Among these models,
A. Performance of Different Models we do observe a positive correlation between the model
In this section, we evaluate the ability of language models to size and performance, with the 15.5B parameter StarCoder
synthesize F⋆ programs and proofs, including of state-of-the- performing better than Orca-2 (7B parameters), which in turn
art LLMs such as GPT-3.5 and GPT-4, which are available is better than Phi-2 (2.7B parameters). Their relative perfor-
for inference through APIs, as well as fine-tuned small and mance may possibly also be attributed to their pre-training
medium-sized language models deployed on local machines, data. Unlike Phi-2 and Orca-2, StarCoder’s pre-training data
posing the following research question: contains OCaml code from GitHub; the syntactic similarity
RQ1. Can language models effectively synthesize F⋆ definitions given
between OCaml and F⋆ could also contribute to the relatively
the specification? higher success of prompted StarCoder.
When we fine-tune these models, their performance signif-
Experimental Setup. We evaluate all models using the same icantly improves. For example, fine tuning Phi-2 improves its
prompt, as described in §IV. We divide the experiments into verify@10 score by more than a factor of 10. In fact, fine-
two solution strategies: (i) prompting and (ii) fine-tuning. tuned Phi-2 slightly outperforms GPT-3.5, while fine-tuned
Prompting involves giving a pre-trained language model a task StarCoder also outperforms GPT-4. The combined verify@nk
and assessing whether or not it is able to solve it. Fine-tuning, results suggest that one could use one or more cheaper, fine-
on the other hand, involves further refinement of a model tuned models for synthesis at first, falling back to the larger
parameters based on a task-specific dataset. For prompting models only when the smaller models fail.
LLMs, we use the OpenAI Python API. For prompting smaller Fine-tuning equips a model with project-specific code id-
language models, namely Phi-2, Orca-2, and StarCoder, we ioms to improve performance. However, we argue that fine-
fine-tune these models (from the HuggingFace library [43]) as tuning also yields benefits transferable across projects. Fig-
described in §IV and use the fine-tuned model checkpoints for ure 2 shows the intersection of successes in the cross-project
inference. We set the same token limits for all models: 500 examples between different models in both prompted and
tokens for context, 400 for related examples, 300 for selected fine-tuned settings. As Figure 2a shows, 451 examples are
premises, and 500 for generated definitions. correctly solved by GPT-3.5 and GPT-4 exclusively, which
Results. Table II shows the results of different strategies could not be solved by prompting any other models; other
across different models. We focus on the verify@k metric models exclusively solved only 53 examples. In contrast, after
defined in §IV, but also report a combined verify@nk metric fine-tuning (Figure 2b), prompted GPT models could only
which considers an example solved if the instance was solved exclusively solve 137 problems, the rest 314 are solved by at
by any of n models under the verify@k metric. least one fine-tuned model. On the other hand, 165 examples
Both GPT-3.5 and GPT-4 solve a significant fraction of are exclusively solved by the smaller models that neither GPT-
problems in the verify@10 metric, though GPT-4 is better. 4 nor GPT-3.5 can solve. We believe this demonstrates the
Their success is likely due to F⋆ code being part of their effectiveness of fine-tuning beyond specific projects they are
training data. Both models also perform better on intra-project trained on.
70 70
Model (Solid = Prompted, Hollow = Finetuned) Model (Solid = Prompted, Hollow = Finetuned)

61.46
60 GPT-3.5 Phi-2 StarCoder 60 GPT-3.5 Phi-2 StarCoder

55.45
54.01
GPT-4 Orca-2 GPT-4 Orca-2
Percentage of Examples Verified

Percentage of Examples Verified

52.40
50.72

50.34
50 50

44.30
41.61
41.21

38.93
35.74
40 40

34.20

33.56

32.01
27.67
26.94

26.90
26.52

25.93
25.72
30 30

23.41
23.32

22.22
20.80

20.06
19.12

17.89

16.84
16.78
20 20
14.42

13.76
11.64

11.64
10.07
10.06
8.65

7.97

7.41
10 10

5.80
4.03
3.80

3.56
2.56

2.37

2.12
1.96

1.47

1.06
0 Simply Typed Dependently Typed Proof 0 Simply Typed Dependently Typed Proof
(a) Intra Project evaluation (b) Cross Project evaluation
Fig. 3: Verify@10 across different types of examples for different models.

Result 1: Language models are helpful in automating Errors (Solid = Prompted, Hollow = Finetuned)
the synthesis of definitions and proofs in F⋆ . Fine-tuning Syntax Error Identifier not found Semantic Error
100.0%
smaller models can match or outperform the performance of
large language models. Despite being significantly smaller,
75.0%

Percentage of Errors
the fine-tuned Phi-2 (2.7B) model slightly outperforms GPT-
3.5 by up to 5%. In addition, StarCoder (15.5B) outperforms
the most advanced GPT-4 by up to 21%. Further, our 50.0%

evaluation shows the generalizability of these fine-tuned

models on unseen projects. 25.0%

0.0% GPT-3.5 GPT-4 Phi-2 Orca-2 StarCoder

B. Understanding models’ successes and failures
Fig. 4: Different errors spawned from the generated definitions
In this section we focus on the following research question: by different models, presented as percentage of all errors.
RQ2. What are the models’ success rate across different problem types
and what kinds of errors do they produce?
categories is also systematically better than on cross-project
Experimental Setup. From §III, recall that we divide the examples. In all cases, fine-tuned StarCoder performs the best.
problems in our dataset into three categories, i.e., (i) simply We also examined specific successes from fine-tuned Star-
typed definitions, (ii) dependently typed definitions, and (iii) Coder to give a qualitative sense of the kinds of problems it
proofs. We report on the performance of models in each of is able to solve. We give two representative examples.
these categories. We also analyze the types of errors reported Our first example is dependently typed, for
by the type-checking harness on synthesized code. We divide FStar.OrdSet.where, a function that filters and ordered set s
the errors into three broad categories (i) Syntax errors: where to those elements that satisfy a condition c. The specification
the F⋆ could not parse the generated code, (ii) Identifier not provided in the prompt is very detailed and includes functional
found errors: where the definition was parsed, but one or more correctness. The model is able to synthesize the type-correct
identifiers could not be resolved, and (iii) Semantic errors: definition shown, a fairly typical functional program. To
the definition is syntactically valid and has no unresolved put this in perspective, compared to AI automation for
identifiers, but the type-checker could not verify the code tactic-based proofs, our work exploits an LLMs ability to
against the goal type. generate programs, and relies on F⋆ ’s dependent types and
Results. Figure 3 shows the accuracy of different models SMT-based automation to certify synthesis results.
in different classes of problems under the verify@10 met- let rec where #a #f (s:ordset a f) (c:condition a)
ric. Across all configurations, models are most successful at : Pure (ordset a f)
(requires ⊤)
simply typed definitions. For dependently typed definitions, (ensures λ (z:ordset a f) →
we observe slightly lower performance across all models (as_list #a z == FStar.List.Tot.Base.filter c
(as_list s)) ∧ (∀ x. mem x z = (mem x s && c x)) ∧
compared to simply typed definitions and performance on (if size z > 0 && size s > 0 then
proof definitions is lower still. This is in keeping with our f (head s) (head z) else true))
= match s with
expectations: as described in §III, our taxonomy is meant to | [] → empty
roughly capture the theoretical complexity of the problems | x::q →
let z = where q c in
and the model performance appears to validate this view. As if c x then insert’ x z else z
in §V-A, performance on intra-project examples across all
TABLE III: Impact of different sources of information in the
Our next example is from EverQuic-Crypto, a cross-project
finetuned Phi-2 model.
proof problem. The lemma is about the property of a binary
formatted header, using functions from the EverParse library. Experiment Auxiliary Information verify@10
The preceding file context for this example contains a similar Name Context RE Premise Intra Cross
proof and the related examples includes an example from the Phi2full ✓ ✓ ✓ 31.10 20.97
EverParse library. In this case, the model has successfully Phi2-ctx ✗ ✓ ✓ 21.89 10.97
adapted a related proof from the file context. Indeed, program Phi2-re ✓ ✗ ✓ 25.92 21.82
Phi2-pre ✓ ✓ ✗ 31.15 20.35
proofs often contain many similar elements, tedious for a
Phi2ideal ✓ ✓ Ideal 35.84 23.74
human to write, but models are adept at identifying common
patterns and adapting them accordingly. RE = Related example definition from training set

let serialize_header_is_retry
(short_dcid_len: short_dcid_len_t) Result 2: Model performance follows our taxonomy, with
(h: header’ short_dcid_len) simply typed problems being most commonly solved, then
: Lemma (
let s = LP.serialize (serialize_header short_dcid_len) h in dependently typed, and finally proofs. Models are able to
Seq.length s > 0 ∧ generate type-correct functional programs with complex
(is_retry h ⇐⇒
(LPB.get_bitfield (U8.v (Seq.index s 0)) 7 8 == 1 ∧ dependent types, and to generate proofs by adapting similar
LPB.get_bitfield (U8.v (Seq.index s 0)) 4 6 == 3) examples. For most of the pre-trained models, the major
)
) = serialize_header_eq short_dcid_len h; source of errors are syntax errors, while after fine-tuning,
let tg = first_byte_of_header short_dcid_len h in “Identifier not found” consist of the majority of errors,
let x = LPB.synth_bitsum’_recip first_byte tg in
LP.serialize_u8_spec x; followed by semantic errors.
let s = LP.serialize (serialize_header short_dcid_len) h in
assert (Seq.index s 0 == x);
assert (is_retry h ⇐⇒ ( C. Impact of components of the prompt
LPB.get_bitfield (U8.v (Seq.index s 0)) 7 8 == 1 ∧
LPB.get_bitfield (U8.v (Seq.index s 0)) 4 6 == 3 Recall from §IV, a prompt contains the (1) local file context;
))
(2) related examples retrieved from the training data; and,
Turning our attention to errors, Figure 4 plots three different (3) selected premises by the premise selection model. In
class of errors made by models as a percentage of the total this section, we evaluate the impact of each of these three
erroneous solutions they generated. Interestingly, for GPT-3.5, components on model performance.
the largest error class is “Syntax error”, whereas for GPT- RQ3. How do the different prompt components impact the effective-
4 it is “Identifier not found”—GPT-4 seems to be better at ness of language models?
F⋆ syntax than GPT-3.5. For the smaller models, when we
prompt them, most of the errors are syntax errors. This is not Experimental Setup. We take as a baseline for our expe-
surprising, since none of them have been trained on F⋆ . When rience the performance a fine-tuned Phi-2 model (Phi2full )
we fine-tune the models, the largest error class is “Identifier not that has access to all three prompt components. We fine-
found”, indicating that models often hallucinate identifiers. For tune three other versions of Phi-2, each time dropping one
example, the following definition is generated by StarCoder: of three three prompt components: for the Phi2-pre model,
we drop the selected premises from the prompt; Phi2-ctx ,
let eqList_ok (#a: Type) (d: deq a) : drops file context information; and Phi2-re , drops the related
Lemma (decides_eq #(list a) (Raw.eqList d.raw.eq)) = examples. In addition, since we know the ideal premises used
let open FStar.Classical in
let lemma_eq (xs ys: list a) in the ground truth definition, we fine-tune another version of
: Lemma (Raw.eqList d.raw.eq xs ys) = Phi2full , where instead of the selected premised, we used the
FStar.List.Tot.lemma_eq_intro (
Raw.eqList d.raw.eq actual premises (Phi2ideal ). Phi2ideal is not realistic, but it
) xs ys; () in
Classical.forall_intro (lemma_eq)
serves as a roof-line for premise selection. We evaluate these
variants on verify@10 as well on the numbers of errors each
While seemingly syntactically correct, this definition uses of these models produce. We also experiment with different
the symbol lemma_eq_intro, which is not in scope. Adapt- ways to combine and present these prompt components to the
ing recent techniques to guide models based on lightweight model. Note that, since these experiments entail a large number
static analyses (e.g., based on the identifiers in scope) is a of fine-tuning runs, we chose to focus on the Phi-2 model, as
promising direction for the future [44], [45]. it is the most cost-effective. We believe our findings should
The last, but not the least, category of error is Semantic generalize to other models.
error, which is a collection of many different errors. Notable Results. Table III summarizes our results. The baseline
among these include Type Error: a value or identifier Phi2full model performance is as in Table II. When we remove
with incompatible type is used and Z3 Solver Error: the related examples from the prompt (Phi2-re ) and fine-
Z3 cannot prove the SMT query. Program repair techniques, tune a model, for intra-project, the performance drops by ∼5
both search-based [46] and LLM-assisted [2], may help reduce percentage points. Interestingly, without the related examples,
some of these errors, another direction for the future. the performance in cross-project examples increases slightly.
Since we extract the related examples from other projects in Model (Solid = Prompted, Hollow = Finetuned)
GPT-3.5 GPT-4 Phi-2 Orca-2 StarCoder
the training set, and there is little to no helpful examples in 100

Ground Truth and Verified Solutions

Normalized Distance Between
the Phi2full model for the cross-project examples. Thus we
80
conjecture, the related examples from the same project helps
the model most. 60

On the other hand, in Phi2-ctx , we observe the performance 40

dropping significantly from the Phi2full model for both intra- 20
project and cross-project examples. Such a drop is not sur-
0
prising, since even for a human user, the file context has the Simply Typed Dependently Typed Proof
most relevant information about the target definition, e.g., in
Fig. 5: Normalized Levenshtein distance between verified solu-
the FStar.OrdSet.where example shown in §V-B, the local tions and ground truth across different classes.
file context defines the type ordset a f as an ordered list,
crucial information for synthesizing a solution. TABLE IV: Performance of different state-of-the-art LLMs with
Finally, when we remove the selected premises (Phi2-pre ), increasing model capacity for 100 randomly sampled examples.
the performance remains very close to the Phi2full model. This
Model Max number of tokens verify@10
is surprising, since prior work [7] demonstrates that premise
2,048 29
selection improves the performance of proof synthesis. To GPT-3.5
16,000 35
investigate further, we fine-tuned Phi2ideal , where instead of
2,048 37
using premise selection model, we use the ideal premises. GPT-4
16,000 50
The results reveal that, with ideal premises, the performance StarCoder (finetuned) 2,048 45
does improve significantly from Phi2full model to 35.84%
in intra-project and 23.74%in cross-project. Additionally, the
number of “Identifier not found” errors with Phi2ideal reduces
B. Impact of maximum tokens on model performances
significantly to 22149 from 30108 such errors in Phi2full .
As such, Phi2ideal suggests that a premise selection model Throughout the paper, for a fair comparison, we restrict
that better approximates the ideal premises can provide a the maximum number of tokens for different models to 2048
significant improvement—see §VI for more discussion. tokens. However, some LLMs accommodate much larger
Result 3: File context and related examples are very im- prompts, e.g., gpt-3.5-turbo-16K with 16K. While ex-
portant sources of information for synthesizing definitions in perimenting on these larger context models are expensive,
F⋆ . Without the context, the performance of a model drops we evaluate GPT-3.5 and GPT-4 models allowing maximum
down by up to 29.6% and up to 16.55% without the related 10K tokens for context, 3K for related examples, and 2K for
examples. An ideal premise selection roof line suggests selected premises, and 1K maximum generation length. To
that it may also provide significant improvements, however keep the cost of experiment tractable, we randomly sample 100
our current premise selection model does not significantly problems for evaluation. As Table IV shows, we observe that,
impact performance. with increased capacity, both GPT-3.5 and GPT-4 improves
by a large margin compared smaller capacities. While GPT-
4 with 16K token capacity achieves 50% verify@10, fine-
VI. F URTHER A NALYSIS & D ISCUSSION tuned StarCoder achieves 45% with only 2K tokens. Further
This section takes a closer look at some of the results from experiments with GPT-4 on the entire data set would be more
§V. We report on additional experiments and explore potential definitive, though the cost is prohibitive.
for future improvements.
C. Definition ingredients in the prompt and its impact
A. Syntactic Evaluation of Model Generated Definitions
Following our observation that that majority of the errors are
In addition to the semantic verify@k metric of the prior due to hallucinated identifiers, we investigated how much each
section, we also evaluate how close a model’s generated of the prompt components help the model generate correct
solutions are to the ground truth solution (g). We calculate identifiers. We calculated the percentage of overlap between
the normalized edit (Levenshtein) distance between tokenized the identifiers used in the ground truth and different prompt
g and solution v. For every problem which has a non- components and investigate whether this correlates with model
empty set of verified candidate solution (V S), we define the performance. Figure 6 shows that when there is a higher
distance as minv∈V S lev dist(g, v). The lev dist calculates overlap between an information modality and the ground truth
the edit distance between g and v and normalized it w.r.t. identifiers, the models are more likely to generate verified
the number of tokens in g. Figure 5 shows the distribution definitions (with statistical significance). While in theory, it
of distances. Prompted LLMs such as GPT-3.5 and GPT-4 is impossible to know the name of the identifiers that will be
generate solutions that are further away from the ground truth used in a definition a priori, the analysis suggests that better
compared to fine-tuned models. retrieval augmentation could boost performance.
100 Verified
In this version, every input modalities are surrounded by tags
Between Target definition and
% of Overlapping Identifiers such as <context> and </context>, surrounding the file
80 Failed
context <related> and </related> around the related
60 examples. In another version, we create a completion style
where all the prompt components are concatenated with a sep-
40 arator token <|end_of_text|>, the default padding token
of Phi-2, without any description of each of the components.
20
Natural language prompts are slightly better that structured
0 prompts, while the completion-style prompt performs sub-
Selected Premises Context Related Examples stantially worse. We conjecture that since the Phi-2 model is
pvalue << 0.05 pvalue << 0.05 pvalue << 0.05
Cohen's d : 0.28 Cohen's d : 0.34 Cohen's d : 0.84 pre-trained mostly on natural language, wrapping the prompt
components with natural language help during fine-tuning; the
Fig. 6: Overlap between identifiers in ground truth and identifiers
in each prompt component
structured tag formats also help the model distinguish the
components. Such a distinction is lost in the completion style
TABLE V: Evaluation of alternative models for the premise prompt.
selection. An asterisk (*) indicates a fine-tuned model.
VII. T HREATS & L IMITATIONS
Intra-Project Cross-Project
Model
MAP NDCG MAP NDCG Data contamination. The training data for GPT-3.5 and
Pythia 70M 0.15 0.37 0.14 0.37 GPT-4 is not publicly known, but it is very likely that it
Pythia 160M 0.15 0.38 0.13 0.37 intersects the intra-project and cross-project test sets because
all-MiniLM-L6-v2 0.15 0.38 0.16 0.40
OpenAI Ada 0.19 0.42 0.19 0.43 those are taken from repositories publicly hosted on GitHub.
Pythia 70M* 0.33 0.55 0.15 0.40
On the other hand, despite maintaining file level separation
Pythia 160M* 0.34 0.56 0.13 0.38 between train and test sets, we observe that there are a small
all-MiniLM-L6-v2* 0.32 0.54 0.18 0.43 number of clones (i.e., examples with different names, yet
same definitions) between train and test set. In particular, we
identified 343 such clones in intra-project test set and 7 in
D. Evaluating Different Premise Selection Models. the cross-project test set. While these solutions are clones, the
We compared four embedding models for premise selection, context they are in are different. Hence, the prompt to the
before and after fine-tuning: the small 22M parameter model language model (see Section IV for prompt preparation) are
all-MiniLM-L6-v2 from sentence transformers [40], the different for these. While the related examples retrieval method
text-embedding-ada-002 model [39] from OpenAI, as are in general capable of finding these clones, we argue that
well as two generic transformer models from the Pythia [47] such a setup is not unrealistic for real development, where
family with 70M and 160M parameters. Comparing the Mean developer often reuse code written elsewhere.
Average Precision (MAP) and Normalized Cumulative Dis- Hints about the definition problem statement. The synthe-
counted Gain (NDCG) (Table V), without fine-tuning the Ada sis problem in this paper requires the model to synthesize a
model achieves roughly 25% higher performance. Surpris- definition for a given type. However, there are some implicit
ingly, the off-the-shelf Pythia models are competitive with hints about the definition that are present in the types, and
all-MiniLM-L6-v2 even though they are not trained on context. For instance, consider the example sort in §II,
any contrastive objective. After fine-tuning, all of the mod- the type contains an effect decreases (length l), which
els have comparable performance irrespective of model size. implicitly tells the model that there will be an induction on
However, the training does not generalize well and fine-tuning the length of thel. While these are hints potentially helping
only barely improves performance on the cross-project set. the model, we argue that these are the hints that a developer
writing F⋆ code may already know.
E. Alternative Prompt Preparation
Preciseness of specifications. We consider a problem to be
TABLE VI: Different Input Formatting. solved if the solution type-checks. For problems in the proof
class, all solutions are equivalent, so no further inspection
Input Type verify@10 is necessary for a successful proof. However, in other cases,
Intra-Project Cross-Project
specifications may only be partial and, although type-correct,
NL Prompt 31.10 20.97 a solution may still be subject to inspection to confirm if it
Structured 30.76 19.45
Completion 10.54 12.21 matches a user’s under-specified intent.
Non-deterministic nature of LLMs. Large Language Mod-
In addition to presenting a natural language prompt (NL els, especially those for which we do not have access to
prompt) (see §IV for details), we investigated alternative the model weights (e.g., GPT-3.5/GPT-4) often go through
prompting strategies. In particular, we created a structured for- regular updates. Hence, it it very difficult, if not impossible, to
mat format similar to the format proposed by Gupta et al. [48]. reproduce some of the results from those LLMs. In contrast,
the fine-tuned model, being smaller in size, will be accessible IX. C ONCLUSION
to those seeking to reproduce our results. Aiming to enhance both the trustworthiness of AI-generated
code and to ease program proof, we investigate AI-based
VIII. R ELATED W ORK program and proof synthesis backed by a program verifier.
While prior work has predominantly focused on tactic-based
In the past 15 years, an increasing number of software
proof, we investigate AI automation for F⋆ , a proof-oriented
systems have been proven correct, both using interactive the-
language with SMT-based automation. To fuel future research
orem provers like Coq and Isabelle/HOL, as well using SMT-
in this direction, we introduce the FS TAR DATA S ET, a public
assisted proof-oriented languages like F⋆ and Dafny. Software
benchmark dataset of F⋆ programs and proofs. We expect
proofs require expertise and can be quite verbose. A common
FS TAR DATA S ET to evolve and grow: indeed in recent weeks,
metric used to evaluate proof effort is a proof:code ratio, i.e.,
after our experiments, it has grown to include four more
the number of lines of proof for each line of executable code.
projects, reaching 940K lines of F⋆ code and proofs.
This ratio is typically higher in interactive proof assistants
On a type-based program and proof synthesis task, we
than in SMT-assisted proof-oriented languages. For example,
evaluate several state-of-the-art LLMs such as GPT-3.5 and
sel4 in Isabelle/HOL [49] incurs a proof:code ratio of 20:1
GPT-4, fine-tune smaller language models such as StarCoder.
despite the use of automated theorem provers integrated with
Our findings suggest that LLMs, when trained on appropriate
Isabelle/HOL, e.g., Sledgehammer [50]. Meanwhile, Ironclad
datasets and prompted effectively with necessary information
in Dafny [24] reports a ratio of 5:1, and EverCrypt in F⋆ [31]
through retrieval, can automate a significant portion programs
reports a ratio of 3:1, through the use of SMT-solvers and pro-
and proof. However, beyond boosting synthesis performance,
grams that mix computational content and proofs. Improved
many interesting problems remain, notably around specifi-
automation through the use of AI in both settings could help
cation formulation and modular decomposition of proofs,
lower these overheads.
exciting areas for further exploration.
In the context of interactive theorem provers, machine-
Overall, with this paper, we contribute to the ongoing
learning has been used to improve premise selection in the-
discourse on trustworthy AI programming and lay a foundation
orem provers [51], [52]. Predictive models based on existing
for advancements in AI-assisted proof-oriented programming.
proofs to guide proof search have been used in TacticToe [53],
TacTok [54]. GPT-f [55] uses expert iteration and HTPS [3] R EFERENCES
online reinforcement learning to improve the model by self-
[1] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep
learning based on previous proof attempts. Baldur [2] uses an at the keyboard? assessing the security of github copilot’s code contri-
LLM-based synthesis and fine-tuned repair model to complete butions,” in 2022 IEEE Symposium on Security and Privacy (SP), 2022,
full proofs for Isabelle/HOL theorems. LeanDojo [7] and pp. 754–768.
[2] E. First, M. Rabe, T. Ringer, and Y. Brun, “Baldur: Whole-proof
LEGO-Prover [56] integrate retrieval augmentation to find generation and repair with large language models,” in Proceedings of
relevant existing theorems and proofs from the library. Draft- the 31st ACM Joint European Software Engineering Conference and
sketch-proof [57] introduced the use of informal sketches as Symposium on the Foundations of Software Engineering, 2023, pp.
1229–1241.
an intermediate step, which has also been used in [58], [59]. [3] G. Lample, T. Lacroix, M.-A. Lachaux, A. Rodriguez, A. Hayat,
In the context of SMT-assisted program verifiers, recent T. Lavril, G. Ebner, and X. Martinet, “Hypertree proof search for neural
theorem proving,” Advances in Neural Information Processing Systems,
approaches leverage LLMs for synthesizing loop invariants vol. 35, pp. 26 337–26 349, 2022.
[13]–[16]. Closest to our work, Rakib et al. [18] leverage [4] A. Thakur, Y. Wen, and S. Chaudhuri, “A language-agent approach to
LLMs along with retrieval augmentation to generate both formal theorem-proving,” arXiv preprint arXiv:2310.04353, 2023.
[5] A. Sanchez-Stern, Y. Alhessi, L. Saul, and S. Lerner, “Generating
functional specification and code with proofs from natural correctness proofs with neural networks,” in Proceedings of the 4th
language in Dafny. However, the specifications generated ACM SIGPLAN International Workshop on Machine Learning and
may be too weak or incorrect and requires a human to Programming Languages, 2020, pp. 1–10.
[6] K. Yang and J. Deng, “Learning to prove theorems via interacting with
audit them; this makes it subjective and difficult to treat proof assistants,” in International Conference on Machine Learning.
as a benchmark problem set unlike FS TAR DATA S ET where PMLR, 2019, pp. 6984–6994.
specifications are drawn from ground truth. In addition, prior [7] K. Yang, A. Swope, A. Gu, R. Chalamala, P. Song, S. Yu, S. Godil,
R. J. Prenger, and A. Anandkumar, “Leandojo: Theorem proving with
works on automating proof-oriented programming [18], [60] retrieval-augmented language models,” Advances in Neural Information
are evaluated on much smaller benchmarks with less than Processing Systems, vol. 36, 2024.
200 mostly introductory algorithmic tasks. In contrast with [8] T. mathlib Community, “The lean mathematical library,” in Proceedings
of the 9th ACM SIGPLAN International Conference on Certified
FS TAR DATA S ET we provide a benchmark with more than Programs and Proofs, ser. CPP 2020. New York, NY, USA:
32K reproducible F⋆ programs and proofs that form part of Association for Computing Machinery, 2020, p. 367–381. [Online].
production software deployed in real-world. Available: https://doi.org/10.1145/3372885.3373824
[9] N. Swamy, C. Hritcu, C. Keller, A. Rastogi, A. Delignat-Lavaud,
Additionally, many works use LLMs for general-purpose S. Forest, K. Bhargavan, C. Fournet, P.-Y. Strub, M. Kohlweiss, J.-K.
programming tasks [61]–[64]. Recent approaches [65], [66] Zinzindohoué, and S. Zanella-Béguelin, “Dependent types and multi-
even attempt at formalizing functional specifications for lan- monadic effects in F*,” in 43rd ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages (POPL). ACM, Jan. 2016,
guages such as python—without access to static verifiers, these pp. 256–270. [Online]. Available: https://www.fstar-lang.org/papers/
functional specifications often are used as runtime assertions. mumon/
[10] K. R. M. Leino, “Dafny: An automatic program verifier for functional [29] N. Swamy, T. Ramananandro, A. Rastogi, I. Spiridonova, H. Ni,
correctness,” in Logic for Programming, Artificial Intelligence, and D. Malloy, J. Vazquez, M. Tang, O. Cardona, and A. Gupta,
Reasoning: 16th International Conference, LPAR-16, Dakar, Senegal, “Hardening attack surfaces with formally proven binary format
April 25–May 1, 2010, Revised Selected Papers 16. Springer, 2010, parsers,” in Proceedings of the 43rd ACM SIGPLAN International
pp. 348–370. Conference on Programming Language Design and Implementation
[11] P. Müller, M. Schwerhoff, and A. J. Summers, “Viper: (PLDI ’22), June 13–17, 2022, San Diego, CA, USA, 2022. [Online].
A verification infrastructure for permission-based reasoning,” in Available: https://www.fstar-lang.org/papers/EverParse3D.pdf
Verification, Model Checking, and Abstract Interpretation (VMCAI), [30] A. Fromherz, N. Giannarakis, C. Hawblitzel, B. Parno,
ser. LNCS, B. Jobstmann and K. R. M. Leino, Eds., vol. A. Rastogi, and N. Swamy, “A verified, efficient embedding
9583. Springer-Verlag, 2016, pp. 41–62. [Online]. Available: of a verifiable assembly language,” PACMPL, no. POPL, 2019.
https://doi.org/10.1007/978-3-662-49122-5 2 [Online]. Available: https://github.com/project-everest/project-everest.
[12] A. Lattuada, T. Hance, C. Cho, M. Brun, I. Subasinghe, Y. Zhou, github.io/raw/master/assets/vale-popl.pdf
J. Howell, B. Parno, and C. Hawblitzel, “Verus: Verifying rust [31] J. Protzenko, B. Parno, A. Fromherz, C. Hawblitzel, M. Polubelova,
programs using linear ghost types,” Proc. ACM Program. Lang., K. Bhargavan, B. Beurdouche, J. Choi, A. Delignat-Lavaud, C. Fournet,
vol. 7, no. OOPSLA1, apr 2023. [Online]. Available: https: N. Kulatova, T. Ramananandro, A. Rastogi, N. Swamy, C. M. Win-
//doi.org/10.1145/3586037 tersteiger, and S. Zanella-Beguelin, “Evercrypt: A fast, verified, cross-
[13] A. Kamath, A. Senthilnathan, S. Chakraborty, P. Deligiannis, S. K. platform cryptographic provider,” in 2020 IEEE Symposium on Security
Lahiri, A. Lal, A. Rastogi, S. Roy, and R. Sharma, “Finding in- and Privacy (SP), 2020, pp. 983–1002.
ductive loop invariants using large language models,” arXiv preprint [32] K. Bhargavan, A. Delignat-Lavaud, C. Fournet, M. Kohlweiss, J. Pan,
arXiv:2311.07948, 2023. J. Protzenko, A. Rastogi, N. Swamy, S. Zanella Béguelin, and J. K.
[14] S. Chakraborty, S. K. Lahiri, S. Fakhoury, M. Musuvathi, A. Lal, Zinzindohoue, “Implementing and proving the TLS 1.3 record layer,”
A. Rastogi, A. Senthilnathan, R. Sharma, and N. Swamy, “Ranking IEEE Security & Privacy, 2017.
llm-generated loop invariants for program verification,” arXiv preprint [33] A. Delignat-Lavaud, C. Fournet, B. Parno, J. Protzenko, T. Ramananan-
arXiv:2310.09342, 2023. dro, J. Bosamiya, J. Lallemand, I. Rakotonirina, and Y. Zhou, “A security
[15] K. Pei, D. Bieber, K. Shi, C. Sutton, and P. Yin, “Can large language model and fully verified implementation for the ietf quic record layer,”
models reason about program invariants?” 2023. in 2021 IEEE Symposium on Security and Privacy (SP), 2021, pp. 1162–
[16] C. Liu, X. Wu, Y. Feng, Q. Cao, and J. Yan, “Towards general loop 1178.
invariant generation via coordinating symbolic execution and large [34] A. Fromherz, A. Rastogi, N. Swamy, S. Gibson, G. Martı́nez,
language models,” arXiv preprint arXiv:2311.10483, 2023. D. Merigoux, and T. Ramananandro, “Steel: Proof-oriented
[17] C. Sun, Y. Sheng, O. Padon, and C. Barrett, “Clover: Closed-loop programming in a dependently typed concurrent separation
verifiable code generation,” arXiv preprint arXiv:2310.17807, 2023. logic,” in 25th ACM SIGPLAN International Conference on
Functional Programming (ICFP), Aug. 2021. [Online]. Available:
[18] M. Rakib Hossain Misu, C. V. Lopes, I. Ma, and J. Noble, “Towards ai-
https://www.fstar-lang.org/papers/steel/
assisted synthesis of verified dafny methods,” arXiv e-prints, pp. arXiv–
[35] P.-M. Osera and S. Zdancewic, “Type-and-example-directed program
2402, 2024.
synthesis,” SIGPLAN Not., vol. 50, no. 6, p. 619–630, jun 2015.
[19] R. OpenAI, “Gpt-4 technical report. arxiv 2303.08774,” View in Article,
[Online]. Available: https://doi.org/10.1145/2813885.2738007
vol. 2, 2023.
[36] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang,
[20] OpenAI, “Gpt-4 technical report,” 2023. J. Callan, and G. Neubig, “Active retrieval augmented generation,” arXiv
[21] S. Gunasekar, Y. Zhang, J. Aneja, C. C. T. Mendes, A. Del Giorno, preprint arXiv:2305.06983, 2023.
S. Gopi, M. Javaheripi, P. Kauffmann, G. de Rosa, O. Saarikivi et al., [37] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal,
“Textbooks are all you need,” arXiv preprint arXiv:2306.11644, 2023. H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-
[22] A. Mitra, L. Del Corro, S. Mahajan, A. Codas, C. Simoes, S. Agarwal, augmented generation for knowledge-intensive nlp tasks,” Advances in
X. Chen, A. Razdaibiedina, E. Jones, K. Aggarwal et al., “Orca Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
2: Teaching small language models how to reason,” arXiv preprint [38] M. R. Parvez, W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W.
arXiv:2311.11045, 2023. Chang, “Retrieval augmented code generation and summarization,”
[23] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, arXiv preprint arXiv:2108.11601, 2021.
M. Marone, C. Akiki, J. Li, J. Chim et al., “Starcoder: may the source [39] A. Neelakantan, T. Xu, R. Puri, A. Radford, J. M. Han, J. Tworek,
be with you!” arXiv preprint arXiv:2305.06161, 2023. Q. Yuan, N. Tezak, J. W. Kim, C. Hallacy, J. Heidecke, P. Shyam,
[24] C. Hawblitzel, J. Howell, J. R. Lorch, A. Narayan, B. Parno, D. Zhang, B. Power, T. E. Nekoul, G. Sastry, G. Krueger, D. Schnurr, F. P.
and B. Zill, “Ironclad apps: End-to-End security via automated Full- Such, K. Hsu, M. Thompson, T. Khan, T. Sherbakov, J. Jang,
System verification,” in 11th USENIX Symposium on Operating Systems P. Welinder, and L. Weng, “Text and code embeddings by contrastive
Design and Implementation (OSDI 14). Broomfield, CO: USENIX pre-training,” CoRR, vol. abs/2201.10005, 2022. [Online]. Available:
Association, Oct. 2014, pp. 165–181. [Online]. Available: https://www. https://arxiv.org/abs/2201.10005
usenix.org/conference/osdi14/technical-sessions/presentation/hawblitzel [40] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings
[25] J.-K. Zinzindohoué, K. Bhargavan, J. Protzenko, and B. Beurdouche, using siamese bert-networks,” in Proceedings of the 2019 Conference
“HACL*: A verified modern cryptographic library,” in ACM Conference on Empirical Methods in Natural Language Processing. Association
on Computer and Communications Security. ACM, 2017, pp. 1789– for Computational Linguistics, 11 2019. [Online]. Available: https:
1806. [Online]. Available: http://eprint.iacr.org/2017/536 //arxiv.org/abs/1908.10084
[26] L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in Tools [41] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
and Algorithms for the Construction and Analysis of Systems: 14th and W. Chen, “Lora: Low-rank adaptation of large language models,”
International Conference, TACAS 2008, Held as Part of the Joint arXiv preprint arXiv:2106.09685, 2021.
European Conferences on Theory and Practice of Software, ETAPS [42] Y. Mao, L. Mathias, R. Hou, A. Almahairi, H. Ma, J. Han, W.-t. Yih,
2008, Budapest, Hungary, March 29-April 6, 2008. Proceedings 14. and M. Khabsa, “Unipelt: A unified framework for parameter-efficient
Springer, 2008, pp. 337–340. language model tuning,” arXiv preprint arXiv:2110.07577, 2021.
[27] J. Protzenko, J.-K. Zinzindohoué, A. Rastogi, T. Ramananandro, [43] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi,
P. Wang, S. Zanella-Béguelin, A. Delignat-Lavaud, C. Hritcu, P. Cistac, T. Rault, R. Louf, M. Funtowicz et al., “Huggingface’s trans-
K. Bhargavan, C. Fournet, and N. Swamy, “Verified low-level formers: State-of-the-art natural language processing,” arXiv preprint
programming embedded in F*,” PACMPL, vol. 1, no. ICFP, pp. 17:1– arXiv:1910.03771, 2019.
17:29, Sep. 2017. [Online]. Available: http://arxiv.org/abs/1703.00053 [44] L. A. Agrawal, A. Kanade, N. Goyal, S. Lahiri, and S. Rajamani,
[28] T. Ramananandro, A. Delignat-Lavaud, C. Fournet, N. Swamy, T. Cha- “Monitor-guided decoding of code lms with static analysis of repository
jed, N. Kobeissi, and J. Protzenko, “Everparse: Verified secure zero- context,” Advances in Neural Information Processing Systems, vol. 36,
copy parsers for authenticated message formats,” in Proceedings of the 2024.
28th USENIX Conference on Security Symposium, ser. SEC’19. USA: [45] Y. Wei, C. S. Xia, and L. Zhang, “Copiloting the copilots: Fusing
USENIX Association, 2019, p. 1465–1482. large language models with completion engines for automated program
repair,” in Proceedings of the 31st ACM Joint European Software of the 30th ACM Joint European Software Engineering Conference and
Engineering Conference and Symposium on the Foundations of Software Symposium on the Foundations of Software Engineering, 2022, pp. 18–
Engineering, 2023, pp. 172–184. 30.
[46] C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A [63] S. K. Lahiri, A. Naik, G. Sakkas, P. Choudhury, C. von Veh,
generic method for automatic software repair,” Ieee transactions on M. Musuvathi, J. P. Inala, C. Wang, and J. Gao, “Interactive
software engineering, vol. 38, no. 1, pp. 54–72, 2011. code generation via test-driven user-intent formalization,” CoRR, vol.
[47] S. Biderman, H. Schoelkopf, Q. G. Anthony, H. Bradley, K. O’Brien, abs/2208.05950, 2022. [Online]. Available: https://doi.org/10.48550/
E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff et al., arXiv.2208.05950
“Pythia: A suite for analyzing large language models across training and [64] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan,
scaling,” in International Conference on Machine Learning. PMLR, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al., “Evaluating large
2023, pp. 2397–2430. language models trained on code,” arXiv preprint arXiv:2107.03374,
[48] P. Gupta, A. Khare, Y. Bajpai, S. Chakraborty, S. Gulwani, A. Kanade, 2021.
A. Radhakrishna, G. Soares, and A. Tiwari, “Grace: Language models [65] M. Endres, S. Fakhoury, S. Chakraborty, and S. K. Lahiri, “Formalizing
meet code edits,” ser. ESEC/FSE 2023. New York, NY, USA: natural language intent into program specifications via large language
Association for Computing Machinery, 2023, p. 1483–1495. [Online]. models,” CoRR, vol. abs/2310.01831, 2023. [Online]. Available:
Available: https://doi.org/10.1145/3611643.3616253 https://doi.org/10.48550/arXiv.2310.01831
[49] G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, [66] D. Key, W.-D. Li, and K. Ellis, “I speak, you verify: Toward trustworthy
D. Elkaduwe, K. Engelhardt, R. Kolanski, M. Norrish, T. Sewell, neural program synthesis,” 2022.
H. Tuch, and S. Winwood, “sel4: formal verification of an os kernel,” [67] J. R. Lorch, Y. Chen, M. Kapritsos, B. Parno, S. Qadeer, U. Sharma,
in Proceedings of the ACM SIGOPS 22nd Symposium on Operating J. R. Wilcox, and X. Zhao, “Armada: low-effort verification of
Systems Principles, ser. SOSP ’09. New York, NY, USA: Association high-performance concurrent programs,” in Proceedings of the 41st
for Computing Machinery, 2009, p. 207–220. [Online]. Available: ACM SIGPLAN Conference on Programming Language Design and
https://doi.org/10.1145/1629575.1629596 Implementation, ser. PLDI 2020. New York, NY, USA: Association
[50] J. C. Blanchette, S. Böhme, and L. C. Paulson, “Extending for Computing Machinery, 2020, p. 197–210. [Online]. Available:
sledgehammer with SMT solvers,” J. Autom. Reason., vol. 51, https://doi.org/10.1145/3385412.3385971
no. 1, pp. 109–128, 2013. [Online]. Available: https://doi.org/10.1007/ [68] A. Arasu, T. Ramananandro, A. Rastogi, N. Swamy, A. Fromherz,
s10817-013-9278-5 K. Hietala, B. Parno, and R. Ramamurthy, “Fastver2: A provably
[51] D. Kühlwein, J. C. Blanchette, C. Kaliszyk, and J. Urban, correct monitor for concurrent, key-value stores,” in Proceedings of the
“Mash: Machine learning for sledgehammer,” in Interactive Theorem 12th ACM SIGPLAN International Conference on Certified Programs
Proving - 4th International Conference, ITP 2013, Rennes, France, and Proofs, ser. CPP 2023. New York, NY, USA: Association
July 22-26, 2013. Proceedings, ser. Lecture Notes in Computer for Computing Machinery, 2023, p. 30–46. [Online]. Available:
Science, S. Blazy, C. Paulin-Mohring, and D. Pichardie, Eds., https://doi.org/10.1145/3573105.3575687
vol. 7998. Springer, 2013, pp. 35–50. [Online]. Available: https: [69] Z. Tao, A. Rastogi, N. Gupta, K. Vaswani, and A. V. Thakur,
//doi.org/10.1007/978-3-642-39634-2 6 “DICE*: A formally verified implementation of DICE measured
[52] M. Mikula, S. Antoniak, S. Tworkowski, A. Q. Jiang, J. P. Zhou, boot,” in 30th USENIX Security Symposium (USENIX Security 21).
C. Szegedy, L. Kucinski, P. Milos, and Y. Wu, “Magnushammer: USENIX Association, Aug. 2021, pp. 1091–1107. [Online]. Available:
A transformer-based approach to premise selection,” CoRR, vol. https://www.usenix.org/conference/usenixsecurity21/presentation/tao
abs/2303.04488, 2023. [Online]. Available: https://doi.org/10.48550/ [70] S. Ho, J. Protzenko, A. Bichhawat, and K. Bhargavan, “Noise: A library
arXiv.2303.04488 of verified high-performance secure channel protocol implementations,”
[53] T. Gauthier, C. Kaliszyk, J. Urban, R. Kumar, and M. Norrish, in 2022 IEEE Symposium on Security and Privacy (SP), 2022, pp. 107–
“Tactictoe: Learning to prove with tactics,” J. Autom. Reason., 124.
vol. 65, no. 2, pp. 257–286, 2021. [Online]. Available: https:
//doi.org/10.1007/s10817-020-09580-x A PPENDIX
[54] E. First, Y. Brun, and A. Guha, “Tactok: Semantics-aware proof synthe- A. FS TAR DATA S ET v2
sis,” Proceedings of the ACM on Programming Languages, vol. 4, no.
OOPSLA, pp. 1–31, 2020. Since the initial version of this paper, we have extended our
[55] J. M. Han, J. Rute, Y. Wu, E. W. Ayers, and S. Polu, dataset to a larger FS TAR DATA S ET version 2. As mentioned
“Proof artifact co-training for theorem proving with language
models,” CoRR, vol. abs/2102.06203, 2021. [Online]. Available: previously, we envision FS TAR DATA S ET to be a live, evolving
https://arxiv.org/abs/2102.06203 dataset drawn from active F⋆ projects. As such, while version
[56] H. Xin, H. Wang, C. Zheng, L. Li, Z. Liu, Q. Cao, Y. Huang, J. Xiong, 2 primarily includes data from four additional F⋆ projects not
H. Shi, E. Xie et al., “Lego-prover: Neural theorem proving with
growing libraries,” arXiv preprint arXiv:2310.00656, 2023. included in the original version, it also contains data from the
[57] A. Q. Jiang, S. Welleck, J. P. Zhou, T. Lacroix, J. Liu, W. Li, original 8 projects, both in the form of new definitions that
M. Jamnik, G. Lample, and Y. Wu, “Draft, sketch, and prove: were previously not present, as well as changes to some of
Guiding formal theorem provers with informal proofs,” in The Eleventh
International Conference on Learning Representations, ICLR 2023, the original definitions. The four new projects are as follows:
Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. [Online]. • Starmada: a framework for doing proofs by stepwise
Available: https://openreview.net/pdf?id=SMa9EAovKMC refinement for concurrent programs in a weak mem-
[58] S. Welleck, J. Liu, X. Lu, H. Hajishirzi, and Y. Choi, “Naturalprover:
Grounded mathematical proof generation with language models,” Ad- ory model. Starmada is an experimental version of Ar-
vances in Neural Information Processing Systems, vol. 35, pp. 4913– mada [67] implemented in F⋆ , relying on various ad-
4927, 2022. vanced features of F⋆ ’s dependent type system for more
[59] Y. Huang, X. Lin, Z. Liu, Q. Cao, H. Xin, H. Wang, Z. Li, L. Song,
and X. Liang, “Mustard: Mastering uniform synthesis of theorem and generic and abstract proofs.
proof data,” arXiv preprint arXiv:2402.08957, 2024. • Zeta: a high performance, concurrent monitor for stateful
[60] J. Yao, Z. Zhou, W. Chen, and W. Cui, “Leveraging large lan- services proven correct in F⋆ and its Steel concurrent
guage models for automated proof synthesis in rust,” arXiv preprint
arXiv:2311.03739, 2023. separation logic [68].
[61] W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified • Dice-star: a verified implementation of the DICE mea-
pre-training for program understanding and generation,” arXiv preprint sured boot protocol for embedded devices [69].
arXiv:2103.06333, 2021.
[62] S. Chakraborty, T. Ahmed, Y. Ding, P. T. Devanbu, and B. Ray, “Natgen: • Noise-star: a verified compiler for implementations of
generative pre-training by “naturalizing” source code,” in Proceedings Noise protocols, a family of key-exchange protocols [70].
TABLE VIII: Synthesis results for FS TAR DATA S ET v2
In total, FS TAR DATA S ET v2 contains 54,404 definitions
drawn from 12 projects, representing around 940,000 lines of intra-projects (verify@k) cross-project (V@)
Model
F⋆ code and proofs. To avoid cross-contamination, we made k=1 k=5 k = 10 k=1 k=5 k = 10
sure that the F* files that belonged to training dataset in GPT-3.5 18.85 32.33 36.59 20.53 35.87 41.63
V1 also remained in the training set on V2. The V2 dataset GPT-4 xx.xx xx.xx xx.xx xx.xx xx.xx xx.xx
is available in https://huggingface.co/datasets/ Phi-2 24.13 35.06 39.16 24.33 36.22 40.59
microsoft/FStarDataSet-V2. Orca-2 18.91 31.76 37.08 15.89 29.07 34.42
StarCoder 36.85 50.93 56.07 38.58 53.51 58.13
TABLE VII: Summary statistics of the FS TAR DATA S ET v2.
100
Metric Train Valid
Test Model (Solid = Prompted, Hollow = Finetuned)
Intra-project Cross-project GPT-3.5 Phi-2 Orca-2 StarCoder

Percentage of Examples Verified

73.05
Number of Definitions 30910 1955 5097 16442
Number of Projects 6 4 6 2
Number of Files 1669 125 328 1179 60

50.46

49.94
48.44
47.85

46.41
Avg. num of lines 4.14 3.77 3.99 3.93

37.20
40

33.47
Avg. num of tokens 45.34 44.90 45.56 30.30

31.89
31.53

29.63
28.80
# Simply Typed 10072 834 1536 7282
# Dependently Typed 15091 778 2363 6067 20
# Proofs 5747 343 1198 3093
0 Simply Typed Dependently Typed Proof
Table VII reports aggregate statistics for V2 dataset.
(a) Intra Project evaluation
100
B. Synthesis Results on FS TAR DATA S ET v2 Model (Solid = Prompted, Hollow = Finetuned)
GPT-3.5 Phi-2 Orca-2 StarCoder

Percentage of Examples Verified

Improving the RAG. In §V-A we reported that all models 80

73.36
performed better on the intra-project examples as compared to

54.93
54.70
60
the cross-project examples. We indicated that a cause of this

49.58
46.33
might be the manner in which related examples are retrieved,

39.06
40

32.87

29.80
quoting:

28.06

27.97

25.70
24.56
Both models also perform better on intra-project 20
examples than on cross-project examples. When
constructing prompts, we only retrieve related exam- 0 Simply Typed Dependently Typed Proof
ples from the training data split, which specifically (b) Cross Project evaluation
does not include any examples from the cross-project
set—as such, the cross-project examples benefit less Fig. 7: Verify@10 across different types of examples for different
models in the V2 dataset.
from retrieval augmentation.
We have revised the manner in which related examples
are retrieved. Rather than restricting related examples to only Our results make it clear that fine-tuned models are not over-
be fetched from the training data, for a given synthesis fitted to a particular version of the dataset. Our prior results ex-
problem, we search for examples from all files that do not tend well to the new dataset, and indeed with our new retrieval
depend on the file of the current problem—this dependence augmentation strategy, the difference in performance between
information is available in metadata present in the dataset. the intra-project and cross-project classes is not significant.
We employ the same embedding-based retrieval augmentation Perhaps surprisingly, the performance across the board on the
technique described in §IV, ranking examples according to cross-project class is better than in the intra-project class: this
their similarity to the goal type and include as many related is explained by noting that the cross-project class contains a
examples as can fit in the token window, in descending order larger proportion of simply typed definitions (44.3%) that the
of similarity score. intra-project class (33.5%). In addition, cross-project examples
Results on V2 with Improved RAG. Table VIII presents an are shorter than that of Intra-Project examples.
overview of our results on the FS TAR DATA S ET-V2 from the
fine-tuned model trained on the V1 training set. In addition,
Figure 7 shows performance of different models across differ-
ent types of problems. We have yet to run all the configurations
reported in Table II on the full v2 dataset. We do not report
results of the non-fine-tuned versions of Phi-2, Orca-2, and
StarCoder on v2, as the performance of these models without
fine-tuning on the v1 dataset was not competitive. More
notably, we have not yet completed a full run of GPT-4 on
the v2 dataset, as a full run is very resource intensive.