Towards Neural Synthesis For SMT-Assisted Proof-Oriented Programming
Towards Neural Synthesis For SMT-Assisted Proof-Oriented Programming
Abstract—Proof-oriented programs mix computational content languages often require a high-level of human expertise—AI
arXiv:2405.01787v3 [cs.PL] 4 Sep 2024
with proofs of program correctness. However, the human effort assistance could help make them easier to use.
involved in programming and proving is still substantial, despite
the use of Satisfiability Modulo Theories (SMT) solvers to Recognizing the potential dual benefit (i.e., trustworthy AI
automate proofs in languages such as F⋆ . programming & easier program proof) many researchers have
Seeking to spur research on using AI to automate the construc- begun investigating using AI to synthesize proofs. However,
tion of proof-oriented programs, we curate a dataset of 600K most of the prior work has focused on using AI with tactic-
lines of open-source F⋆ programs and proofs, including software based proof assistants, such as Coq, Lean, and Isabelle [2]–
used in production systems ranging from Windows and Linux, to
Python and Firefox. Our dataset includes around 32K top-level
[5], including projects like CoqGym [6], which builds models
F⋆ definitions, each representing a type-directed program and based on hundreds of Coq projects containing more than
proof synthesis problem—producing a definition given a formal 70,000 tactic scripts, and LeanDojo [7], which uses math-
specification expressed as an F⋆ type. We provide a program- lib [8], a large corpus of formalized mathematics in Lean.
fragment checker that queries F⋆ to check the correctness of AI automation for proof-oriented programming languages like
candidate solutions. We also report on an extended version of
our dataset containing a total of 940K lines of programs and
F⋆ [9], Dafny [10], Viper [11], Verus [12] and others has
proofs, with a total of 54k top-level F⋆ definitions. We believe received comparatively less attention. The prior work [13]–
this is the largest corpus of SMT-assisted program proofs coupled [18] has been limited by the availability of data, focusing
with a reproducible program-fragment checker. instead on small, hand-crafted problem sets numbering in
Grounded in this dataset, we investigate the use of AI to the few hundreds. This is unfortunate, since proof-oriented
synthesize programs and their proofs in F⋆ , with promising
languages may be close to the limit of symbolic automation
results. Our main finding in that the performance of fine-tuned
smaller language models (such as Phi-2 or StarCoder) compare provided by SMT solvers and AI automation could open
favorably with large language models (such as GPT-4), at a the door to further automation that has remained out of
much lower computational cost. We also identify various type- reach. Additionally, to achieve the promise of trustworthy
based retrieval augmentation techniques and find that they boost AI programming, we believe it is essential to develop AI
performance significantly. With detailed error analysis and case
for program and proof synthesis rather that only on mostly
studies, we identify potential strengths and weaknesses of models
and techniques and suggest directions for future improvements. mathematical tactic-based proofs.
Index Terms—Proof Oriented Programming, AI for Proofs, Towards this end, our paper makes the following three major
Trustworthy AI programming contributions:
1. A new dataset of programs and proofs: Aiming to
I. I NTRODUCTION
spur research in AI-assisted proof-oriented programming, our
The recent excitement around AI-assisted programming has first main contribution is FS TAR DATA S ET, a dataset of F⋆
been tempered by concerns around the trustworthiness of AI- programs and proofs extracted from 2060 source files, repre-
generated code [1]. Languages that offer static guarantees senting about 600K lines of source code, drawn from 8 open
can help reduce some of these concerns, e.g., having AI source projects. The dataset provides around 32K top-level F⋆
generate safe Rust code rather than C eliminates the risk definitions, coupled with tooling that allows each definition to
of AI-introduced memory safety issues. Taking this line of be checked in isolation. We believe this is the largest dataset
thinking to its limit, using AI to generate code in proof- of SMT-assisted proof-oriented programs and we envision for
oriented languages which allow programs to be augmented it to be a live, evolving data set, with new projects added to
with specification and proofs of correctness could eliminate it over time. Indeed, currently FS TAR DATA S ET has grown to
trust in AI-generated code, so long as the specification can be include 4 additional projects reaching 940K lines of source
audited to match a human’s intent. Conversely, proof-oriented code (see Appendix A, and B for details and initial results).
Although we currently focus on F⋆ , we hope for the dataset to
∗
Work done as interns at Microsoft Research also grow to include data from other proof-oriented languages.
We describe the dataset in detail in §III. provides further discussion and analysis. We release FS TAR -
2. A benchmark problem: Grounded in this data, we de- DATA S ET in https://huggingface.co/datasets/
sign a straightforward benchmark: given a type as a formal microsoft/FStarDataSet. Details of the project, source
specification, synthesize a program that has that type. Each code and models can be found in http://fstar-lang.
of the 32K definitions in FS TAR DATA S ET yields its own org/popai
synthesis problem, where the type of the definition is the
“goal” type, and a technique is expected to synthesize a
definition that the F⋆ compiler attests is goal-type-correct. In
F⋆ , types are rich enough to capture a variety of specifications, II. BACKGROUND
ranging from simple types as in other functional programming
languages, to dependently typed specifications that capture
We start by providing some general background on F⋆ ,
functional correctness properties of a program, i.e., types
adapted from its online manual. F⋆ is a dependently typed
can represent arbitrary propositions. Dually, programs in F⋆
programming language and proof assistant. It encourages
contain computational content (e.g., logic for reversing a
proof-oriented programming, a paradigm in which one co-
list), but they can also be interpreted as proofs. As such,
designs programs and proofs which attest various aspects of
our benchmark can be seen as an instance of a type- or
a program’s correctness, including its input/output behavior
specification-directed program & proof synthesis problem. We
together with its side effects, if any.
present a simple and objective taxonomy to classify benchmark
instances, distinguishing simply typed, dependently typed, and The core design philosophy of F⋆ is that the type of a
fully specified proofs, corresponding roughly to the precision program is a specification of its runtime behavior. Many terms
of specifications. can have the same type and the same term can have many
types, e.g., e : int states that the term or program fragment
3. Designing and evaluating neural synthesis techniques:
e reduces to an integer; and e : nat states that e reduces
We apply a (by now) standard regime of prompting large
to a non-negative integer, where nat = x:int{x ≥ 0} is a
language models (LLMs) to generate solutions, backed by
refinement of type int. When proving a program e correct,
retrieval augmentation and fine-tuning techniques specific to
one specifies the properties one is interested in as a type t
our setting. In particular, we construct a prompt by augmenting
and then tries to convince F⋆ that e has type t, i.e., deriving
the goal type with related types and definitions from the
e : t.
program context, based on various embedding models. In §V,
we evaluate the performance of off-the-shelf large language Many dependently typed languages have the same core
models, including GPT-3.5 [19] and GPT-4 [20], as well design, however F⋆ is distinctive in that in addition to
as fine-tuned smaller models include Phi2-2.7B [21], Orca2- several built-in type-checking algorithms, it uses the Z3
7B [22], and StarCoder-15B [23], with the following main SMT solver [26] to try to automatically prove various obli-
takeaways. gations that arise during type-checking. For example, un-
• Fine tuned smaller models can match or outperform large der the hypothesis that x : even, proving that x + 2 : even,
language models (§V-A). where even = x:int{x % 2 == 0}, results in a query to Z3
• Different classes of problems are solved with varying of the form ∀(x:int). x % 2 == 0 =⇒ (x + 2) % 2 == 0. While
degrees of success, with common error modes differing in general the queries F⋆ presents to Z3 are undecidable,
between pretrained and fine-tuned models (§V-B). in practice Z3 successfully automates a large number of the
• Leveraging the contexts as well as retrieval augmentation queries it receives. However, when Z3 proof search fails, the
significantly boosts the quality of results (§V-C). programmer must write additional lemmas and other proofs
hints to decompose the proof into smaller pieces that can be
Based on our results, we are optimistic about for the
checked by F⋆ and Z3.
future of AI-assisted proof-oriented programming. Researchers
building verified systems in proof-oriented languages have The concrete syntax of F⋆ is based on other languages in
reported writing around 3-5 lines of proof for each line of the ML family, including OCaml and F#. Shown below is
verified code [24], [25], a considerable overhead, despite a recursive implementation of Quick Sort, together with its
strong symbolic automation from SMT solvers. For SMT- specification and proof of correctness. The type of sort states
assisted proof-oriented programs, we provide the first, sub- that for any total order f on elements of type α, given an
stantial empirical evidence that LLMs trained on corpora like input list l:list α, sort is a total function (i.e., it always
FS TAR DATA S ET, and prompted with retrieval augmentation, terminates) returning a list m:list α that is sorted according
can automate somewhere between a third and a half of proofs. to f and where m is a permutation of l. The predicates like
That said, our approach focuses on synthesizing program total_order, sorted, is_permutation etc. are other aux-
and proof fragments given their specifications: finding the iliary definitions in scope. The implementation of sort mixes
right modular decomposition to prove a program correct, the computational behavior (i.e., partitioning the list based
with the right specifications and auxiliary lemmas, is not on a pivot, recursively sorting the partitions, combining and
yet within the scope of the techniques we explore. §VI returning them) with proof steps that attest to the correctness
of the code with respect to its specification.1 • EverParse: A parser generator for binary formats [28], used
let rec sort (f:total_order_t α) (l:list α)
in various large scale systems, e.g., the Windows kernel [29]
: Tot (m:list α{ sorted f m ∧ is_permutation l m }) • HACL*: A library of verified cryptographic algorithms [25],
(decreases (length l))
= match l with
including ValeCrypt [30], a library of verified assembly
| [] → [] code, as well as EverCrypt, a cryptographic provider [31],
| pivot :: tl →
let hi, lo = partition (f pivot) tl in
including code deployed in Linux, Firefox, and Python.
let res = append (sort f lo) (pivot :: sort f hi) in • miTLS-F*: A partially verified reference implementation of
(* <proof> *)
partition_mem_permutation (f pivot) tl;
the TLS protocol [32].
append_count lo hi; append_count hi lo; • EverQuic-Crypto: A verified implementation of header and
sorted_concat f (sort f lo) (sort f hi) pivot;
append_count (sort f lo) (sort f hi);
packet protection for the QUIC protocol [33].
permutation_app_lemma pivot tl (sort f lo) (sort f hi); • Merkle-tree: A verified, incremental Merkle tree, designed
(* </proof> *)
res
for use in Azure CCF, a confidential computing system [31].
• Steel: A concurrent separation logic library, with proofs of
For example, the annotation decreases (length l) data structures and concurrency primitives [34].
indicates that the recursion terminates because the length of
The dataset will be available publicly as an open source
the list input decreases at each recursive call. Additionally, the
repository referencing the other projects as sub-modules, in-
source lines delimited by <proof> comment tags are calls to
cluding a given version of F⋆ itself. All the projects are built
F⋆ lemmas, auxiliary definitions that prove certain properties.
with the same version of F⋆ and the Z3 SMT solver, resulting
For instance, the auxiliary lemma append_count is shown
in a single archive with all the 2,060 source files and a build
below. Its type states that every call to append_count l m x
artifact for each of them, i.e., F⋆ ’s .checked files. Each
guarantees the postcondition described in the ensures
checked file is accompanied by record of metadata about its
clause; or, equivalently, the type claims the universal property
contents: for each top-level element (e.g.,, the definition of a
∀l m x. count x (append l m) = count x l + count x m.
function, or a type, or a type signature), the metadata records
The proof of this property is by induction on the list l,
its dependences, the compiler settings used to verify them, etc.
expressed as a recursive function.
Reproducibility & evolution. We aim to strike a balance
let rec append_count (l m:list α) (x:α)
: Lemma (ensures (count x (append l m) = count x l + count x m)) between reproducibility of the dataset, while also encouraging
= match l with the data set to grow and change along with projects it
| [] → ()
| hd :: tl → append_count tl m x references. Referencing the precise commit hashes of all the
referenced projects as sub-modules allows any version of the
Program and proof fragments like sort, append_count dataset to be fully reproducible. The results reported in this
etc. each constitute a top-level definition in an F⋆ program. paper focus on version 1 of the dataset, a snapshot from
Each definition yields a type-directed synthesis problem, i.e., November 2023, provided as an anonymous supplement.
given a goal type such as the first two lines of append_count
shown above, can an LLM generate a type-correct definition. A type-checking harness. To enable using the data to val-
To help the LLM succeed, we aim to augment the goal type idate program-proof synthesis attempts, we provide scripts
with related information, e.g., the definitions of symbols such that enable launching F⋆ and initializing it so that it is
as count, append etc. We evaluate the performance of various capable of checking each top-level definition independently.
retrieval augmentation strategies and LLMs on this task. Once initialized in this way, the F⋆ process can be repeatedly
queried using an interactive JSON-based API to type-check
III. A C ORPUS OF P ROOF -O RIENTED P ROGRAMS program fragments—in F⋆ , type-checking amounts to program
verification. However, replaying the F⋆ type-checker on top-
FS TAR DATA S ET is an archive of source code, build arti- level definitions in the dataset is not perfect, e.g., due to
facts, and metadata assembled from eight (8) different F⋆ - sensitivity to small changes in the search heuristics used by
based open source projects on GitHub, summarized below, SMT solvers. We use the type-checking harness to identify
with a focus on libraries that provide secure low-level software those top-level definitions whose proofs can be checked in
and the tools that enable their proofs. isolation, finding that 90% of them meet this criterion. In all
• FStar: The F⋆ compiler itself, including its standard libraries our experiments, we focus on this re-checkable fragment of
and examples. the dataset.
• Karamel: A transpiler from a subset of F⋆ called Low* to During the development of the harness, we faced similar
C, including libraries to work with a model of C types and challenges as reported by LeanDojo [7, A.2], mainly: (a)
control structures, e.g., for- and while-loops [27]. ensuring names are resolved correctly by opening the right
namespace in the right order, (b) ensuring that definitions
1 Many F⋆ examples, including the ones shown here, are similar to program to be checked cannot refer to their original version in the
proofs in languages like Dafny or Verus. However, F⋆ also allows other styles library, that they cannot use their original definition implicitly
of higher order and dependently typed definitions that are not expressible in
other SMT-assisted languages. We refer the reader to the F⋆ manual for a full via Z3, (c) verifying that solutions do not use escape hatches
discussion of similarities and differences. such as admit () or assume that are intended for interactive
TABLE I: Summary statistics of the FS TAR DATA S ET. F* Example Training
Related Example Retrieval Data
// Type:
Test val appendList (l1:list a) (l2:list a)
Metric Train Valid
Intra-project Cross-project : (res: list a{
length res = length l1 + length l2})
Number of Definitions 22779 1541 5965 1769 // Context:
3
0 0 2 43 33
16
sponds to the combined results from finetuned experiments (3), 69 0 25
3 0 20 6 33 21
and Comb. Prompted & Finetuned‡ shows the results across all 3 5
0 116
the experiments (8). 0 1 0 00 8 27 23 29
6 13
verify@k (intra-project) verify@k (cross-project) 47 20 2 3 42 29 5 2
Strategy Model
k=1 k=5 k = 10 k=1 k=5 k = 10 17 6 3 66 1919
20 10 71 12
GPT-3.5 13.51 25.20 29.81 6.67 14.41 18.54
StarCoder Orca-7B StarCoder Orca-7B
GPT-4 19.41 31.90 36.38 13.00 23.57 28.49
Prompted Phi-2 (2.7B) 0.30 1.26 2.20 0.11 0.62 1.75 (a) Prompted Models (b) Finetuned Models
Orca-2 (7B) 2.63 6.39 8.55 0.45 2.20 3.84
Fig. 2: Venn Diagram showing intersection of examples from
StarCoder (15.5B) 1.89 7.14 12.27 0.73 3.79 6.90
Cross-Project examples solved by different models. The GPT-*
Combined Prompted∗ 26.34 40.02 45.47 16.56 29.96 36.18 performance in Fig. (b) are from prompting respective models.
Phi-2 (2.7B) 17.28 27.75 31.10 8.59 16.62 20.97
Finetuned Orca-2 (7B) 14.90 26.02 30.61 7.74 14.92 18.37
StarCoder (15.5B) 27.95 39.50 43.98 16.73 27.64 32.90
Combined Finetuned† 32.86 44.17 48.52 20.97 32.28 37.20 examples than on cross-project examples. When constructing
Comb. Prompted & Finetuned‡ 38.76 50.49 55.34 27.13 40.14 45.56 prompts, we only retrieve related examples from the training
∗ : verify@k interprets as verify@5k. † : verify@k interprets as verify@3k, data split, which specifically does not include any examples
and ‡ : verify@k interprets as verify@8k. from the cross-project set—as such, the cross-project examples
benefit less from retrieval augmentation.
V. E MPIRICAL E VALUATIONS The performance of prompted smaller models (Phi-2, Orca-
2, and StarCoder) is significantly worse. Among these models,
A. Performance of Different Models we do observe a positive correlation between the model
In this section, we evaluate the ability of language models to size and performance, with the 15.5B parameter StarCoder
synthesize F⋆ programs and proofs, including of state-of-the- performing better than Orca-2 (7B parameters), which in turn
art LLMs such as GPT-3.5 and GPT-4, which are available is better than Phi-2 (2.7B parameters). Their relative perfor-
for inference through APIs, as well as fine-tuned small and mance may possibly also be attributed to their pre-training
medium-sized language models deployed on local machines, data. Unlike Phi-2 and Orca-2, StarCoder’s pre-training data
posing the following research question: contains OCaml code from GitHub; the syntactic similarity
RQ1. Can language models effectively synthesize F⋆ definitions given
between OCaml and F⋆ could also contribute to the relatively
the specification? higher success of prompted StarCoder.
When we fine-tune these models, their performance signif-
Experimental Setup. We evaluate all models using the same icantly improves. For example, fine tuning Phi-2 improves its
prompt, as described in §IV. We divide the experiments into verify@10 score by more than a factor of 10. In fact, fine-
two solution strategies: (i) prompting and (ii) fine-tuning. tuned Phi-2 slightly outperforms GPT-3.5, while fine-tuned
Prompting involves giving a pre-trained language model a task StarCoder also outperforms GPT-4. The combined verify@nk
and assessing whether or not it is able to solve it. Fine-tuning, results suggest that one could use one or more cheaper, fine-
on the other hand, involves further refinement of a model tuned models for synthesis at first, falling back to the larger
parameters based on a task-specific dataset. For prompting models only when the smaller models fail.
LLMs, we use the OpenAI Python API. For prompting smaller Fine-tuning equips a model with project-specific code id-
language models, namely Phi-2, Orca-2, and StarCoder, we ioms to improve performance. However, we argue that fine-
fine-tune these models (from the HuggingFace library [43]) as tuning also yields benefits transferable across projects. Fig-
described in §IV and use the fine-tuned model checkpoints for ure 2 shows the intersection of successes in the cross-project
inference. We set the same token limits for all models: 500 examples between different models in both prompted and
tokens for context, 400 for related examples, 300 for selected fine-tuned settings. As Figure 2a shows, 451 examples are
premises, and 500 for generated definitions. correctly solved by GPT-3.5 and GPT-4 exclusively, which
Results. Table II shows the results of different strategies could not be solved by prompting any other models; other
across different models. We focus on the verify@k metric models exclusively solved only 53 examples. In contrast, after
defined in §IV, but also report a combined verify@nk metric fine-tuning (Figure 2b), prompted GPT models could only
which considers an example solved if the instance was solved exclusively solve 137 problems, the rest 314 are solved by at
by any of n models under the verify@k metric. least one fine-tuned model. On the other hand, 165 examples
Both GPT-3.5 and GPT-4 solve a significant fraction of are exclusively solved by the smaller models that neither GPT-
problems in the verify@10 metric, though GPT-4 is better. 4 nor GPT-3.5 can solve. We believe this demonstrates the
Their success is likely due to F⋆ code being part of their effectiveness of fine-tuning beyond specific projects they are
training data. Both models also perform better on intra-project trained on.
70 70
Model (Solid = Prompted, Hollow = Finetuned) Model (Solid = Prompted, Hollow = Finetuned)
61.46
60 GPT-3.5 Phi-2 StarCoder 60 GPT-3.5 Phi-2 StarCoder
55.45
54.01
GPT-4 Orca-2 GPT-4 Orca-2
Percentage of Examples Verified
50.34
50 50
44.30
41.61
41.21
38.93
35.74
40 40
34.20
33.56
32.01
27.67
26.94
26.90
26.52
25.93
25.72
30 30
23.41
23.32
22.22
20.80
20.06
19.12
17.89
16.84
16.78
20 20
14.42
13.76
11.64
11.64
10.07
10.06
8.65
7.97
7.41
10 10
5.80
4.03
3.80
3.56
2.56
2.37
2.12
1.96
1.47
1.06
0 Simply Typed Dependently Typed Proof 0 Simply Typed Dependently Typed Proof
(a) Intra Project evaluation (b) Cross Project evaluation
Fig. 3: Verify@10 across different types of examples for different models.
Result 1: Language models are helpful in automating Errors (Solid = Prompted, Hollow = Finetuned)
the synthesis of definitions and proofs in F⋆ . Fine-tuning Syntax Error Identifier not found Semantic Error
100.0%
smaller models can match or outperform the performance of
large language models. Despite being significantly smaller,
75.0%
Percentage of Errors
the fine-tuned Phi-2 (2.7B) model slightly outperforms GPT-
3.5 by up to 5%. In addition, StarCoder (15.5B) outperforms
the most advanced GPT-4 by up to 21%. Further, our 50.0%
let serialize_header_is_retry
(short_dcid_len: short_dcid_len_t) Result 2: Model performance follows our taxonomy, with
(h: header’ short_dcid_len) simply typed problems being most commonly solved, then
: Lemma (
let s = LP.serialize (serialize_header short_dcid_len) h in dependently typed, and finally proofs. Models are able to
Seq.length s > 0 ∧ generate type-correct functional programs with complex
(is_retry h ⇐⇒
(LPB.get_bitfield (U8.v (Seq.index s 0)) 7 8 == 1 ∧ dependent types, and to generate proofs by adapting similar
LPB.get_bitfield (U8.v (Seq.index s 0)) 4 6 == 3) examples. For most of the pre-trained models, the major
)
) = serialize_header_eq short_dcid_len h; source of errors are syntax errors, while after fine-tuning,
let tg = first_byte_of_header short_dcid_len h in “Identifier not found” consist of the majority of errors,
let x = LPB.synth_bitsum’_recip first_byte tg in
LP.serialize_u8_spec x; followed by semantic errors.
let s = LP.serialize (serialize_header short_dcid_len) h in
assert (Seq.index s 0 == x);
assert (is_retry h ⇐⇒ ( C. Impact of components of the prompt
LPB.get_bitfield (U8.v (Seq.index s 0)) 7 8 == 1 ∧
LPB.get_bitfield (U8.v (Seq.index s 0)) 4 6 == 3 Recall from §IV, a prompt contains the (1) local file context;
))
(2) related examples retrieved from the training data; and,
Turning our attention to errors, Figure 4 plots three different (3) selected premises by the premise selection model. In
class of errors made by models as a percentage of the total this section, we evaluate the impact of each of these three
erroneous solutions they generated. Interestingly, for GPT-3.5, components on model performance.
the largest error class is “Syntax error”, whereas for GPT- RQ3. How do the different prompt components impact the effective-
4 it is “Identifier not found”—GPT-4 seems to be better at ness of language models?
F⋆ syntax than GPT-3.5. For the smaller models, when we
prompt them, most of the errors are syntax errors. This is not Experimental Setup. We take as a baseline for our expe-
surprising, since none of them have been trained on F⋆ . When rience the performance a fine-tuned Phi-2 model (Phi2full )
we fine-tune the models, the largest error class is “Identifier not that has access to all three prompt components. We fine-
found”, indicating that models often hallucinate identifiers. For tune three other versions of Phi-2, each time dropping one
example, the following definition is generated by StarCoder: of three three prompt components: for the Phi2-pre model,
we drop the selected premises from the prompt; Phi2-ctx ,
let eqList_ok (#a: Type) (d: deq a) : drops file context information; and Phi2-re , drops the related
Lemma (decides_eq #(list a) (Raw.eqList d.raw.eq)) = examples. In addition, since we know the ideal premises used
let open FStar.Classical in
let lemma_eq (xs ys: list a) in the ground truth definition, we fine-tune another version of
: Lemma (Raw.eqList d.raw.eq xs ys) = Phi2full , where instead of the selected premised, we used the
FStar.List.Tot.lemma_eq_intro (
Raw.eqList d.raw.eq actual premises (Phi2ideal ). Phi2ideal is not realistic, but it
) xs ys; () in
Classical.forall_intro (lemma_eq)
serves as a roof-line for premise selection. We evaluate these
variants on verify@10 as well on the numbers of errors each
While seemingly syntactically correct, this definition uses of these models produce. We also experiment with different
the symbol lemma_eq_intro, which is not in scope. Adapt- ways to combine and present these prompt components to the
ing recent techniques to guide models based on lightweight model. Note that, since these experiments entail a large number
static analyses (e.g., based on the identifiers in scope) is a of fine-tuning runs, we chose to focus on the Phi-2 model, as
promising direction for the future [44], [45]. it is the most cost-effective. We believe our findings should
The last, but not the least, category of error is Semantic generalize to other models.
error, which is a collection of many different errors. Notable Results. Table III summarizes our results. The baseline
among these include Type Error: a value or identifier Phi2full model performance is as in Table II. When we remove
with incompatible type is used and Z3 Solver Error: the related examples from the prompt (Phi2-re ) and fine-
Z3 cannot prove the SMT query. Program repair techniques, tune a model, for intra-project, the performance drops by ∼5
both search-based [46] and LLM-assisted [2], may help reduce percentage points. Interestingly, without the related examples,
some of these errors, another direction for the future. the performance in cross-project examples increases slightly.
Since we extract the related examples from other projects in Model (Solid = Prompted, Hollow = Finetuned)
GPT-3.5 GPT-4 Phi-2 Orca-2 StarCoder
the training set, and there is little to no helpful examples in 100
73.05
Number of Definitions 30910 1955 5097 16442
Number of Projects 6 4 6 2
Number of Files 1669 125 328 1179 60
50.46
49.94
48.44
47.85
46.41
Avg. num of lines 4.14 3.77 3.99 3.93
37.20
40
33.47
Avg. num of tokens 45.34 44.90 45.56 30.30
31.89
31.53
29.63
28.80
# Simply Typed 10072 834 1536 7282
# Dependently Typed 15091 778 2363 6067 20
# Proofs 5747 343 1198 3093
0 Simply Typed Dependently Typed Proof
Table VII reports aggregate statistics for V2 dataset.
(a) Intra Project evaluation
100
B. Synthesis Results on FS TAR DATA S ET v2 Model (Solid = Prompted, Hollow = Finetuned)
GPT-3.5 Phi-2 Orca-2 StarCoder
73.36
performed better on the intra-project examples as compared to
54.93
54.70
60
the cross-project examples. We indicated that a cause of this
49.58
46.33
might be the manner in which related examples are retrieved,
39.06
40
32.87
29.80
quoting:
28.06
27.97
25.70
24.56
Both models also perform better on intra-project 20
examples than on cross-project examples. When
constructing prompts, we only retrieve related exam- 0 Simply Typed Dependently Typed Proof
ples from the training data split, which specifically (b) Cross Project evaluation
does not include any examples from the cross-project
set—as such, the cross-project examples benefit less Fig. 7: Verify@10 across different types of examples for different
models in the V2 dataset.
from retrieval augmentation.
We have revised the manner in which related examples
are retrieved. Rather than restricting related examples to only Our results make it clear that fine-tuned models are not over-
be fetched from the training data, for a given synthesis fitted to a particular version of the dataset. Our prior results ex-
problem, we search for examples from all files that do not tend well to the new dataset, and indeed with our new retrieval
depend on the file of the current problem—this dependence augmentation strategy, the difference in performance between
information is available in metadata present in the dataset. the intra-project and cross-project classes is not significant.
We employ the same embedding-based retrieval augmentation Perhaps surprisingly, the performance across the board on the
technique described in §IV, ranking examples according to cross-project class is better than in the intra-project class: this
their similarity to the goal type and include as many related is explained by noting that the cross-project class contains a
examples as can fit in the token window, in descending order larger proportion of simply typed definitions (44.3%) that the
of similarity score. intra-project class (33.5%). In addition, cross-project examples
Results on V2 with Improved RAG. Table VIII presents an are shorter than that of Intra-Project examples.
overview of our results on the FS TAR DATA S ET-V2 from the
fine-tuned model trained on the V1 training set. In addition,
Figure 7 shows performance of different models across differ-
ent types of problems. We have yet to run all the configurations
reported in Table II on the full v2 dataset. We do not report
results of the non-fine-tuned versions of Phi-2, Orca-2, and
StarCoder on v2, as the performance of these models without
fine-tuning on the v1 dataset was not competitive. More
notably, we have not yet completed a full run of GPT-4 on
the v2 dataset, as a full run is very resource intensive.