0% found this document useful (0 votes)

16 views17 pages

Diffusion On Syntax Trees For Program Synthesis

The document proposes a new approach to program synthesis using neural diffusion models that operate directly on syntax trees. The approach allows the model to iteratively edit programs while ensuring syntactic validity and observing program outputs at each step. The authors apply this to inverse graphics tasks, where a model learns to convert images to programs that produce those images. Combined with search, the model can write graphics programs, see outputs, and debug programs to meet specifications.

Uploaded by

Mustafa Abrar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views17 pages

Diffusion On Syntax Trees For Program Synthesis

Uploaded by

Mustafa Abrar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Diffusion On Syntax Trees For Program Synthesis

Shreyas Kapur Erik Jenner

University of California, Berkeley University of California, Berkeley
[email protected] [email protected]
arXiv:2405.20519v1 [cs.AI] 30 May 2024

Stuart Russell
University of California, Berkeley
[email protected]

Abstract

Large language models generate code one token at a time. Their autoregressive
generation process lacks the feedback of observing the program’s output. Training
LLMs to suggest edits directly can be challenging due to the scarcity of rich
edit data. To address these problems, we propose neural diffusion models that
operate on syntax trees of any context-free grammar. Similar to image diffusion
models, our method also inverts “noise” applied to syntax trees. Rather than
generating code sequentially, we iteratively edit it while preserving syntactic
validity, which makes it easy to combine this neural model with search. We
apply our approach to inverse graphics tasks, where our model learns to convert
images into programs that produce those images. Combined with search, our
model is able to write graphics programs, see the execution result, and debug them
to meet the required specifications. We additionally show how our system can
write graphics programs for hand-drawn sketches. Video results can be found at
https://tree-diffusion.github.io.

1 Introduction

Large language models (LLMs) have made remarkable progress in code generation, but their au-
toregressive nature presents a fundamental challenge: they generate code token by token, without
access to the program’s runtime output from the previously generated tokens. This makes it difficult
to correct errors, as the model lacks the feedback loop of seeing the program’s output and adjusting
accordingly. While LLMs can be trained to suggest edits to existing code [6, 42, 17], acquiring
sufficient training data for this task is difficult.
In this paper, we introduce a new approach to program synthesis using neural diffusion models that
operate directly on syntax trees. Diffusion models have previously been used to great success in
image generation [14, 22, 31]. By leveraging diffusion, we let the model learn to iteratively refine
programs while ensuring syntactic validity. Crucially, our approach allows the model to observe the
program’s output at each step, effectively enabling a debugging process.
In the spirit of systems like AlphaZero [29], the iterative nature of diffusion naturally lends itself
to search-based program synthesis. By training a value model alongside our diffusion model, we
can guide the denoising process toward programs that are likely to achieve the desired output. This
allows us to efficiently explore the program space, making more informed decisions at each step of
the generation process.
We implement our approach for inverse graphics tasks, where we posit domain-specific languages for
drawing images. Inverse graphics tasks are naturally suitable for our approach since small changes in

Preprint. Under review.

Figure 1: Examples of programs recovered by our system. The top row shows a hand-drawn sketch of
an icon (left), the recovered program (middle), and the compilation of the recovered program (right).
The top two rows are for the constructive solid geometry language (CSG2D-Sketch). The last row
is an example output from our TinySVG environment that learns to invert hierarchical programs of
shapes and colors. Video examples can be found at https://tree-diffusion.github.io.

the code produce semantically meaningful changes in the rendered image. For example, a misplaced
shape on the image can be easily seen and fixed in program space.
Our main contributions for this work are (a) a novel approach to program synthesis using diffusion on
syntax trees and (b) an implementation of our approach for inverse graphics tasks that significantly
outperforms previous methods.

2 Background & Related Work

Neural program synthesis Neural program synthesis is a prominent area of research, in which
neural networks generate programs from input-output examples. Early work, such as Parisotto et al.
[23], demonstrated the feasibility of this approach. While modern language models can be directly
applied to program synthesis, combining neural networks with search strategies often yields better
results and guarantees. In this paradigm, the neural network guides the search process by providing
proposal distributions or scoring candidate programs. Examples of such hybrid methods include
Balog et al. [2], Ellis et al. [12], and Devlin et al. [9]. A key difference from our work is that these
methods construct programs incrementally, exploring a vast space of partial programs. Our approach,
in contrast, focuses on editing programs, allowing us to both grow programs from scratch and make
corrections based on the program execution.

Neural diffusion Neural diffusion models, a class of generative models, have demonstrated im-
pressive results for modeling high-dimensional data, such as images [14, 22, 31]. A neural diffusion
model takes samples from the data distribution (e.g. real-world images), incrementally corrupts the
data by adding noise, and trains a neural network to incrementally remove the noise. To generate new
samples, we can start with random noise and iteratively apply the neural network to denoise the input.

Diffusion for discrete data Recent work extends diffusion to discrete and structured data like
graphs [35], with applications in areas such as molecule design [15, 27, 8]. Notably, Lou et al. [20]
proposed a discrete diffusion model using a novel score-matching objective for language modeling.
Another promising line of work for generative modeling on structured data is generative flow networks
(GFlowNets) [3], where neural models construct structured data one atom at a time.

2
Figure 2: An overview of our method. Analogously to adding noise in image diffusion, we randomly
make small mutations to the syntax trees of programs. We then train a conditional neural model to
invert these small mutations. In the above example, we operate in a domain-specific language (DSL)
for creating 2D graphics using a constructive solid geometry language. The leftmost panel (z0 ) shows
the target image (bottom) alongside its program as a syntax tree (top). The y value of the circle gets
mutated from 16 to 10 in the second panel, making the black circle "jump" a little higher. Between z1
and z2 , we see that we can mutate the Subtract (−) node to a Circle node, effectively deleting it.

Diffusion for code generation Singh et al. [30] use a diffusion model for code generation. However,
their approach is to first embed text into a continuous latent space, train a continuous diffusion model
on that space, and then unembed at the end. This means that intermediate stages of the latent
representation are not trained to correspond to actual code. The embedding tokens latch to the nearest
embeddings during the last few steps.
Direct code editing using neural models has also been explored. Chakraborty et al. [6] use a graph
neural network for code editing, trained on a dataset of real-world code patches. Similarly, Zhang
et al. [42] train a language model to edit code by modifying or inserting [MASK] tokens or deleting
existing tokens. They further fine-tune their model on real-world comments and patches. Unlike these
methods, our approach avoids the need for extensive code edit datasets and inherently guarantees
syntactic validity through our pretraining task.

Program synthesis for inverse graphics We are inspired by previous work by Sharma et al.
[28], Ellis et al. [10, 11], which also uses the CSG2D language. Sharma et al. [28] propose a
convolutional encoder and a recurrent model to go from images to programs. Ellis et al. [11] propose
a method to provide a neural model with the intermediate program execution output in a read–eval–
print loop (REPL). Unlike our method, the ability to execute partial graphics programs is a key
requirement for their work. Our system operates on complete programs and does not require a custom
partial compiler. As mentioned in their work, their policies are also brittle. Once the policy proposes
an object, it cannot undo that proposal. Hence, these systems require a large number of particles in a
Sequential Monte-Carlo (SMC) sampler to make the system less brittle to mistakes.

3 Method
The main idea behind our method is to develop a form of denoising diffusion models analogous to
image diffusion models for syntax trees.
Consider the example task from Ellis et al. [11] of generating a constructive solid geometry (CSG2D)
program from an image. In CSG2D, we can combine simple primitives like circles and quadrilaterals

3
using boolean operations like addition and subtraction to create more complex shapes, with the
context-free grammar (CFG),
S → S + S | S − S | Circlerx,y | Quadw,h
x,y,θ .

In Figure 2, z0 is our target program, and x0 is the rendered version of z0 . Our task is to invert x0
to recover z0 . Our noising process randomly mutates y=16 to y=10. It then mutates the whole ⃝ −
sub-tree with two shapes with a new sub-tree with just one shape. Conditioned on the image x0 , and
starting at z3 , x3 , we would like to train a neural network to reverse this noising process to get to z0 .
In the following sections, we will first describe how “noise” is added to syntax trees. Then, we will
detail how we train a neural network to reverse this noise. Finally, we will describe how we use this
neural network for search.

3.1 Sampling Small Mutations

Let zt be a program at time t. Let pN (zt+1 |zt ) be the distribution over randomly mutating program
zt to get zt+1 . We want pN mutations to be: (1) small and (2) produce syntactically valid zt+1 ’s.
To this end, we turn to the rich computer security literature on grammar-based fuzzing [41, 13, 32, 36].
To ensure the mutations are small, we first define a function σ(z) that gives us the “size” of program z.
For all our experiments, we define a set of terminals in our CFG to be primitives. As an example, the
primitives in our CSG2D language are {Quad, Circle}. In that language, we use σ(z) = σprimitive (z),
which counts the number of primitives. Other generic options for σ(z) could be the depth, number of
nodes, etc.
We then follow Luke [21] and Zeller et al. [41] to randomly sample programs from our CFG under
exact constraints, σmin < σ(z) ≤ σmax . We call this function ConstrainedSample(σmin , σmax ).
Setting a small value for σmax allows us to sample small programs randomly. We set σmax = σsmall
when generating small mutations.
To mutate a given program z, we first generate a set of candidate nodes in its tree under some σsmall ,
C = {n ∈ SyntaxTree(z) | σ(n) ≤ σsmall }.
Then, we uniformly sample a mutation node from this set,
m ∼ Uniform[C].
Since we have access to the full syntax tree and the CFG, we know which production rule produced
m, and can thus ensure syntactically valid mutations. For example, if m were a number, we know to
replace it with a number. If m were a general subexpression, we know we can replace it with any
general subexpression. Therefore, we sample m′ , which is m’s replacement as,
m′ ∼ ConstrainedSample(ProductionRule(m), σsmall ).

3.2 Policy

3.2.1 Forward Process

We cast the program synthesis problem as an inference problem. Let p(x|z) be our observation model,
where x can be any kind of observation. For example, we will later use images x produced by our
program, but x could also be an execution trace, a version of the program compiled to bytecode, or
simply a syntactic property. Our task is to invert this observation model, i.e. produce a program z
given some observation x.
We first take some program z0 , either from a dataset, D = {z 0 , z 1 , . . .}, or in our case, a randomly
sampled program from our CFG. We sample z0 ’s such that σ(z0 ) ≤ σmax . We then add noise to z0
for s ∼ Uniform[1, smax ], steps, where smax is a hyper-parameter, using,
zt+1 ∼ pN (zt+1 |zt ).
We then train a conditional neural network that models the distribution,
qϕ (zt−1 |zt , xt ; x0 ),
where ϕ are the parameters of the neural network, zt is the current program, xt is the current output
of the program, and x0 is the target output we are solving for.

4
3.2.2 Reverse Mutation Paths

Since we have access to the ground-truth mutations, we can generate targets to train a neural
network by simply reversing the sampled trajectory through the forward process Markov-Chain,
z0 → z1 → . . .. At first glance, this may seem a reasonable choice. However, training to simply
invert the last mutation can potentially create a much noisier signal for the neural network.
Consider the case where, within a much larger syntax tree, a color was mutated as,

Red → Blue → Green.

The color in our target image, x0 , is Red, while the color in our mutated image, x2 , is Green. If we
naively teach the model to invert the above Markov chain, we are training the network to turn the
Green to a Blue, even though we could have directly trained the network to go from Green to a Red.
Therefore, to create a better training signal, we compute an edit path between the target tree and the
mutated tree. We use a tree edit path algorithm loosely based on the tree edit distance introduced by
Pawlik and Augsten [25, 24]. The general tree edit distance problem allows for the insertion, deletion,
and replacement of any node. Unlike them, our trees can only be edited under an action space that
only permits small mutations. For two trees, zA and zB , we linearly compare the syntax structure.
For changes that are already ≤ σsmall , we add that to our mutation list. For changes that are > σsmall ,
we find the first mutation that reduces the distance between the two trees. Therefore, for any two
programs, zA and zB , we can compute the first step of the mutation path in O(|zA | + |zB |) time.

3.3 Value Network & Search

We additionally train a value network, vϕ (xA , xB ), which takes as input two rendered images, xA
and xB , and predicts the edit distance between the underlying programs that generated those images.
Since we have already computed edit paths between trees during training, we have direct access to
the ground-truth program edit distance for any pair of rendered images, allowing us to train this value
network in a supervised manner.
Using our policy, qϕ (zt−1 |zt , xt ; x0 ), and our value, vϕ (xtA , xtB ), we can perform beam-search for
a given target image, x0 , and a randomly initialized program zt . At each iteration, we maintain a
collection of nodes in our search tree with the most promising values and only expand those nodes.

3.4 Architecture

Figure 3 shows an overview of our neural architecture. We use a vision-language model described
by Tsimpoukelli et al. [33] as our denoising model, qϕ (zt−1 |zt , xt ; x0 ). We use an off-the-shelf
implementation [38] of NF-ResNet-26 as our image encoder, which is a normalizer-free convolutional
architecture proposed by Brock et al. [4] to avoid test time instabilities with Batch-Norm [40]. We
implement a custom tokenizer, using the terminals of our CFG as tokens. The rest of the edit model
is a small decoder-only transformer [34, 26].
We add two additional types of tokens: an <EDIT> token, which serves as a start-of-sentence token
for the model; and <POS x> tokens, which allow the model to reference positions within its context.
Given a current image, a target image, and a current tokenized program, we train this transformer
model to predict the edit position and the replacement text autoregressively. While making predictions,
the decoding is constrained under the grammar. We mask out the prediction logits to only include edit
positions that represent nodes in the syntax tree, and only produce replacements that are syntactically
valid for the selected edit position.
We set σsmall = 2, which means the network is only allowed to produce edits with fewer than two
primitives. For training data, we sample an infinite stream of random expressions from the CFG.
We choose a random number of noise steps, s ∈ [1, 5], to produce a mutated expression. For some
percentage of the examples, ρ, we instead sample a completely random new expression as our mutated
expression. We trained for 3 days for the environments we tested on a single Nvidia A6000 GPU.

5
Figure 3: We train qϕ (zt−1 |zt , xt ; x0 ) as a decoder only vision-language transformer following
Tsimpoukelli et al. [33]. We use an NF-ResNet as the image encoder, which is a normalizer-free
convolutional architecture proposed by Brock et al. [4]. The image encoder encodes the current
image, xt , and the target images, x0 . The current program is tokenized according to the vocabulary
in our context-free grammar. The decoder first predicts an edit location in the current program, and
then tokens that replace what the edit location should be replaced by. We constrain the autoregressive
decoding by our context-free grammar by masking only the valid token logits.

4 Experiments

4.1 Environments

We conduct experiments on four domain-specific graphics languages, with complete grammar

specifications provided in Appendix B.

CSG2D A 2D constructive solid geometry language where primitive shapes are added and subtracted
to create more complex forms, as explored in our baseline methods [11, 28]. We also create
CSG2D-Sketch, which has an added observation model that simulates hand-drawn sketches using
the algorithm from Wood et al. [39].

TinySVG A language featuring primitive shapes with color, along with Arrange commands for
horizontal and vertical alignment, and Move commands for shape offsetting. Figure 1 portrays
an example program. Unlike the compositional nature of CSG2D, TinySVG is hierarchical: sub-
expressions can be combined into compound objects for high-level manipulation. We also create,
Rainbow, a simplified version of TinySVG without Move commands for ablation studies due to its
reduced computational demands.
We implemented these languages using the Lark [19] and Iceberg [16] Python libraries, with our
tree-diffusion implementation designed to be generic and adaptable to any context-free grammar and
observation model.

4.2 Baselines

We use two prior works, Ellis et al. [11] and CSGNet [28] as baseline methods.

CSGNet Sharma et al. [28] employed a convolutional and recurrent neural network to generate
program statements from an input image. For a fair comparison, we re-implemented CSGNet using the
same vision-language transformer architecture as our method, representing the modern autoregressive
approach to code generation. We use rejection sampling, repeatedly generating programs until a
match is found.

6
Performance on the CSG2D Environment Performance on the TinySVG Environment

Fraction of Problems Solved

0.8
0.20
0.7

0.6
0.15
0.5

0.4
0.10
0.3

0.2 0.05

0.1

0.0 0.00
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Number of Nodes Expanded Number of Nodes Expanded

Tree Diffusion Search (Ours) Tree Diffusion Rollouts (Ours) REPL Flow CSGNet

Figure 4: Performance of our approach in comparison to baseline methods in CSG2D and TinySVG
languages. We give the methods n = 256 images from the test set and measure the number of nodes
expanded to find a solution. The auto-regressive baseline was queried with rejection sampling. Our
policy outperforms previous methods, and our policy combined with search helps boost performance
further. Error bars show standard deviation across 5 random seeds.

REPL Flow Ellis et al. [11] proposed a method to build programs one primitive at a time until
all primitives have been placed. They also give a policy network access to a REPL, i.e., the ability
to execute code and see outputs. Notably, this current image is rendered from the current partial
program. As such, we require a custom partial compiler. This is straightforward for CSG2D since
it is a compositional language. We simply render the shapes placed so far. For TinySVG, it is not
immediately obvious how this partial compiler should be written. This is because the rendering
happens bottom-up. Primitives get arranged, and those arrangements get arranged again (see Figure 1).
Therefore, we only use this baseline method with CSG2D. Due to its similarities with Generative Flow
Networks [3], we refer to our modified method as “REPL Flow”.

Test tasks For TinySVG we used a held-out test set of randomly generated expressions and their
images. For the CSG2D task, we noticed that all methods were at ceiling performance on an in-
distribution held-out test set. In Ellis et al. [11], the authors created a harder test set with more objects.
However, simply adding more objects in an environment like CSG2D resulted in simpler final scenes,
since sampling a large object that subtracts a large part of the scene becomes more likely. Instead, to
generate a hard test set, we filtered for images at the 95th percentile or more on incompressibility
with the LZ4 [7, 37] compression algorithm.

Evaluation In CSG2D, we accepted a predicted program as matching the specification if it achieved

an intersection-over-union (IoU) of 0.99 or more. In TinySVG, we accepted an image if 99% of the
1
pixels were within 0.005 ≈ 256 .
All methods were trained with supervised learning and were not fine-tuned with reinforcement
learning. All methods used the grammar-based constrained decoding method described in Section 3.4,
which ensured syntactically correct outputs. While testing, we measured performance based on the
number of compilations needed for a method to complete the task.
Figure 4 shows the performance of our method compared to the baseline methods. In both the CSG2D
and TinySVG environments, our tree diffusion policy rollouts significantly outperform the policies of
previous methods. Our policy combined with beam search further improves performance, solving
problems with fewer calls to the renderer than all other methods. Figure 6 shows successful qualitative
examples of our system alongside outputs of baseline methods. We note that our system can fix
smaller issues that other methods miss. Figure 7 shows some examples of recovered programs from
sketches in the CSG2D-Sketch language, showing how the observation model does not necessarily
need to be a deterministic rendering; it can also consist of stochastic hand-drawn images.

7
Ablations on Tree Diffusion Effects of Expression Initialization
0.5 0.5

Fraction of Problems Solved

0.4 0.4

0.3 0.3

0.2 0.2
ρ = 0.2
Control ρ = 0.5
0.1 No Reverse Path 0.1 ρ = 0.0
No Noising ρ = 0.8
No Current Image ρ = 1.0
0.0 0.0
0 20 40 60 80 100 0 20 40 60 80 100
Number of Nodes Expanded Number of Nodes Expanded
(a) (b)
Figure 5: Effects of changing several design decisions of our system. We train smaller models
on the Rainbow environment. We give the model n = 256 test problems to solve. In (a), for No
Reverse Path, we train the model without computing an explicit reverse path, only using the last
step of the noising process as targets. For No Current Image, we train a model that does not get to
see the compiled output image of the program it is editing. For No Noising, instead of using our
noising process, we generate two random expressions and use the path between them as targets. In
(b) we examine the effect of training mixture between forward diffusion (ρ = 0.0) and pure random
initialization (ρ = 1.0) further. Error bars show standard deviation across 5 random seeds.

4.3 Ablations

To understand the impact of our design decisions, we performed ablation studies on the simplified
Rainbow environment using a smaller transformer model.
First, we examined the effect of removing the current image (no REPL) from the policy network’s
input. As shown in Figure 5(a), this drastically hindered performance, confirming the importance of a
REPL-like interface observed by Ellis et al. [11].
Next, we investigated the necessity of our reverse mutation path algorithm. While training on the
last mutation step alone provides a valid path, it introduces noise by potentially targeting suboptimal
intermediate states. Figure 5(a) demonstrates that utilizing the reverse mutation path significantly
improves performance, particularly in finding solutions with fewer steps. However, both methods
eventually reach similar performance levels, suggesting that a noisy path, while less efficient, can
still lead to a solution.
Finally, we explored whether the incremental noise process is crucial, given our tree edit path
algorithm. Couldn’t we directly sample two random expressions, calculate the path, and train the
network to imitate it? We varied the training data composition between pure forward diffusion
(ρ = 0.0) and pure random initialization (ρ = 1.0) as shown in Figure 5(b). We found that a small
proportion (ρ = 0.2) of pure random initializations combined with forward diffusion yielded the
best results. This suggests that forward diffusion provides a richer training distribution around target
points, while random initialization teaches the model to navigate the program space more broadly.
The emphasis on fine-grained edits from forward diffusion proves beneficial for achieving exact pixel
matches in our evaluations.

5 Conclusion

In this work, we proposed a neural diffusion model that operates on syntax trees for program synthesis.
We implemented our approach for inverse graphics tasks, where our task is to find programs that
would render a given image. Unlike previous work, our model can construct programs, view their
output, and in turn edit these programs, allowing it to fix its mistakes in a feedback loop. We
quantitatively showed how our approach outperforms our baselines at these inverse graphics tasks.
We further studied the effects of key design decisions via ablation experiments.

8
Figure 6: Qualitative examples of our method and baselines on two inverse graphics languages,
CSG2D (top two rows) and TinySVG (bottom two rows). The leftmost column shows the ground-truth
rendered programs from our test set. The next columns show rendered programs from various
methods. Our methods are able to finely adjust and match the ground-truth programs more closely.

Figure 7: Examples of programs recovered for input sketches in the CSG2D-Sketch language. The
input sketches are from our observation model that simulates hand-drawn sketches (top-row). The
output programs rendered (bottom row) are able to match the input sketches by adding and subtracting
basic shapes. Video results for these sketches can be found at https://tree-diffusion.github.
io.

Limitations There are several significant limitations to this work. First, we operate on expressions
with no variable binding, loops, strings, continuous parameters, etc. While we think our approach
can be extended to support these, it needs more work and careful design. Current large-language
models can write complicated programs in many domains, while we focus on a very narrow task.
Additionally, the task of inverse graphics might just be particularly suited for inverse graphics where
small mutations make informative changes in the image output.

Future Work In the future, we hope to be able to leverage large-scale internet data on programs to
train our system, making small mutations to their syntax tree and learning to invert them. We would
also like to study this approach in domains other than inverse graphics. Additionally, we would like
to extend this approach to work with both the discrete syntax structure and continuous floating-point
constants.

Impact Given the narrow scope of the implementation, we don’t think there is a direct societal
impact, other than to inform future research direction in machine-assisted programming. We hope

9
future directions of this work, specifically in inverse graphics, help artists, engineering CAD modelers,
and programmers with a tool to convert ideas to precise programs for downstream use quickly.

Acknowledgments and Disclosure of Funding

We would like to thank Kathy Jang, David Wu, Cam Allen, Sam Toyer, Eli Bronstein, Koushik Sen,
and Pieter Abbeel for discussions, feedback, and technical support. Shreyas was supported in part
by the AI2050 program at Schmidt Futures (Grant G-22-63471). Erik was supported by fellowships
from the Future of Life Institute and Open Philanthropy.

References
[1] Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin
Bao, Peter Bell, David Berard, Evgeni Burovski, Geeta Chauhan, Anjali Chourdia, Will Constable, Alban
Desmaison, Zachary DeVito, Elias Ellison, Will Feng, Jiong Gong, Michael Gschwind, Brian Hirsh,
Sherlock Huang, Kshiteej Kalambarkar, Laurent Kirsch, Michael Lazos, Mario Lezcano, Yanbo Liang,
Jason Liang, Yinghai Lu, CK Luk, Bert Maher, Yunjie Pan, Christian Puhrsch, Matthias Reso, Mark
Saroufim, Marcos Yukio Siraichi, Helen Suk, Michael Suo, Phil Tillet, Eikan Wang, Xiaodong Wang,
William Wen, Shunting Zhang, Xu Zhao, Keren Zhou, Richard Zou, Ajit Mathews, Gregory Chanan, Peng
Wu, and Soumith Chintala. PyTorch 2: Faster Machine Learning Through Dynamic Python Bytecode
Transformation and Graph Compilation. In 29th ACM International Conference on Architectural Support
for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24). ACM, April 2024. doi:
10.1145/3620665.3640366. URL https://pytorch.org/assets/pytorch2-2.pdf.

[2] Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. Deepcoder:
Learning to write programs. arXiv preprint arXiv:1611.01989, 2016.

[3] Yoshua Bengio, Salem Lahlou, Tristan Deleu, Edward J Hu, Mo Tiwari, and Emmanuel Bengio. Gflownet
foundations. Journal of Machine Learning Research, 24(210):1–55, 2023.

[4] Andy Brock, Soham De, Samuel L Smith, and Karen Simonyan. High-performance large-scale image
recognition without normalization. In International Conference on Machine Learning, pages 1059–1071.
PMLR, 2021.

[5] Edwin Catmull and Raphael Rom. A class of local interpolating splines. In Computer aided geometric
design, pages 317–326. Elsevier, 1974.

[6] Saikat Chakraborty, Yangruibo Ding, Miltiadis Allamanis, and Baishakhi Ray. Codit: Code editing with
tree-based neural models. IEEE Transactions on Software Engineering, 48(4):1385–1399, 2020.

[7] Yann Collet et al. Lz4: Extremely fast compression algorithm. code. google. com, 2013.

[8] Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, and Tommi Jaakkola. Diffdock: Diffusion
steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.

[9] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju, Rishabh Singh, Abdel rahman Mohamed, and Pushmeet
Kohli. RobustFill: Neural program learning under noisy I/O. In Proceedings of the 34th International
Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages
990–998. PMLR, 06–11 Aug 2017.

[10] Kevin Ellis, Daniel Ritchie, Armando Solar-Lezama, and Josh Tenenbaum. Learning to infer graphics
programs from hand-drawn images. Advances in neural information processing systems, 31, 2018.

[11] Kevin Ellis, Maxwell Nye, Yewen Pu, Felix Sosa, Josh Tenenbaum, and Armando Solar-Lezama. Write,
execute, assess: Program synthesis with a repl. Advances in Neural Information Processing Systems, 32,
2019.

[12] Kevin Ellis, Catherine Wong, Maxwell Nye, Mathias Sablé-Meyer, Lucas Morales, Luke Hewitt, Luc
Cary, Armando Solar-Lezama, and Joshua B Tenenbaum. Dreamcoder: Bootstrapping inductive program
synthesis with wake-sleep library learning. In Proceedings of the 42nd acm sigplan international conference
on programming language design and implementation, pages 835–850, 2021.

[13] Patrice Godefroid, Adam Kiezun, and Michael Y Levin. Grammar-based whitebox fuzzing. In Proceedings
of the 29th ACM SIGPLAN conference on programming language design and implementation, pages
206–215, 2008.

10
[14] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural
information processing systems, 33:6840–6851, 2020.

[15] Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, and Max Welling. Equivariant diffusion for
molecule generation in 3d. In International conference on machine learning, pages 8867–8887. PMLR,
2022.

[16] IceBerg Contributors. IceBerg – Compositional Graphics and Diagramming. github, July 2023. URL
https://github.com/revalo/iceberg.

[17] Matthew Jin, Syed Shahriar, Michele Tufano, Xin Shi, Shuai Lu, Neel Sundaresan, and Alexey Svy-
atkovskiy. Inferfix: End-to-end program repair with llms. In Proceedings of the 31st ACM Joint European
Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages
1646–1656, 2023.

[18] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.

[19] Lark Contributors. Lark - a parsing toolkit for Python. github, August 2014. URL https://github.
com/lark-parser/lark.

[20] Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion language modeling by estimating the
ratios of the data distribution. arXiv preprint arXiv:2310.16834, 2023.

[21] Sean Luke. Two fast tree-creation algorithms for genetic programming. IEEE Transactions on Evolutionary
Computation, 4(3):274–283, 2000.

[22] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya
Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided
diffusion models. arXiv preprint arXiv:2112.10741, 2021.

[23] Emilio Parisotto, Abdel rahman Mohamed, Rishabh Singh, Lihong Li, Dengyong Zhou, and Pushmeet
Kohli. Neuro-symbolic program synthesis, 2016.

[24] Mateusz Pawlik and Nikolaus Augsten. Efficient computation of the tree edit distance. ACM Transactions
on Database Systems (TODS), 40(1):1–40, 2015.

[25] Mateusz Pawlik and Nikolaus Augsten. Tree edit distance: Robust and memory-efficient. Information
Systems, 56:157–173, 2016.

[26] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language
models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[27] Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro
Lió, Carla Gomes, Max Welling, et al. Structure-based drug design with equivariant diffusion models.
arXiv preprint arXiv:2210.13695, 2022.

[28] Gopal Sharma, Rishabh Goyal, Difan Liu, Evangelos Kalogerakis, and Subhransu Maji. Csgnet: Neural
shape parser for constructive solid geometry. In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pages 5515–5523, 2018.

[29] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez,
Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. A general reinforcement learning
algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.

[30] Mukul Singh, José Cambronero, Sumit Gulwani, Vu Le, Carina Negreanu, and Gust Verbruggen. Codefu-
sion: A pre-trained diffusion model for code generation, 2023.

[31] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben
Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint
arXiv:2011.13456, 2020.

[32] Prashast Srivastava and Mathias Payer. Gramatron: Effective grammar-aware fuzzing. In Proceedings of
the 30th acm sigsoft international symposium on software testing and analysis, pages 244–256, 2021.

[33] Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal
few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:
200–212, 2021.

11
[34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems,
30, 2017.
[35] Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard.
Digress: Discrete denoising diffusion for graph generation. arXiv preprint arXiv:2209.14734, 2022.
[36] Junjie Wang, Bihuan Chen, Lei Wei, and Yang Liu. Superion: Grammar-aware greybox fuzzing. In 2019
IEEE/ACM 41st International Conference on Software Engineering (ICSE), pages 724–735. IEEE, 2019.
[37] Terry A. Welch. A technique for high-performance data compression. Computer, 17(06):8–19, 1984.
[38] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models,
2019.
[39] Jo Wood, Petra Isenberg, Tobias Isenberg, Jason Dykes, Nadia Boukhelifa, and Aidan Slingsby. Sketchy
rendering for information visualization. IEEE transactions on visualization and computer graphics, 18
(12):2749–2758, 2012.
[40] David Xing Wu, Chulhee Yun, and Suvrit Sra. On the training instability of shuffling sgd with batch
normalization. In International Conference on Machine Learning, pages 37787–37845. PMLR, 2023.
[41] Andreas Zeller, Rahul Gopinath, Marcel Böhme, Gordon Fraser, and Christian Holler. Efficient grammar
fuzzing. In The Fuzzing Book. CISPA Helmholtz Center for Information Security, 2023. URL https:
//www.fuzzingbook.org/html/GrammarFuzzer.html. Retrieved 2023-11-11 18:18:06+01:00.
[42] Jiyang Zhang, Sheena Panthaplackel, Pengyu Nie, Junyi Jessy Li, and Milos Gligoric. Coditt5: Pretraining
for source code and natural language editing. In Proceedings of the 37th IEEE/ACM International
Conference on Automated Software Engineering, pages 1–12, 2022.

Appendix

A Mutation Algorithm

Figure 8: An example expression from CSG2D represented as a tree to help illustrate the mutation
algorithm. The green nodes are candidate nodes with primitives count σ(z) ≤ 2. Our mutation
algorithm only mutates these nodes.

Here we provide additional details on how we sample small mutations for tree diffusion. We will first
repeat the algorithm mentioned in Section 3 in more detail.
Our goal is to take some syntax tree and apply a small random mutation. The only type of mutation
we consider is a replacement mutation. We first collect a set of candidate nodes that we are allowed
to replace. If we select a node too high up in the tree, we end up replacing a very large part of the tree.
To make sure we only change a small part of the tree we only select nodes with ≤ σsmall primitives.
In Figure 8, if we set σsmall = 2, we get all the green nodes. We sample a node, m, uniformly from
this green set. We know the production rule for m from the CFG. For instance, if we selected node
15, the only replacements allowed are + or −. If we selected node 46, we can only replace it with
an angle. If we selected node 11, we can replace it with any subexpression. When we sample a

12
(a) (b)
Figure 9: Examples of images drawn with the (a) CSG2D and (b) TinySVG languages.

replacement, we ensure that the replacement is ≤ σsmall , and that it is different than m. Here we show
4 random mutation steps on a small expression,

(+ (+ (+ (Circle A D 4) (Quad F E 4 6 K)) (Quad 3 E C 2 M)) (Circle C 2 1))

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ --> (Circle 0 8 A)
(+ (+ (Circle 0 8 A) (Quad 3 E C 2 M)) (Circle C 2 1))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ --> (Quad 1 0 A 3 H)
(+ (Quad 1 0 A 3 H) (Circle C 2 1))
^ --> 4
(+ (Quad 1 0 A 3 H) (Circle 4 2 1))
^ --> 8
(+ (Quad 1 0 A 3 H) (Circle 8 2 1))

During our experiments we realized that this style of random mutations biases expression to get
longer on average, since there are many more leaves than parents of leaves. This made the network
better at going from very long expressions to target expressions, but not very good at editing shorter
expressions into longer ones. This also made our model’s context window run out frequently when
expressions got too long. To make the mutation length effects more uniform, we add a slight
modification to the algorithm mentioned above and in Section 3.
For each of the candidate nodes, we find the set of production rules for the candidates. We then select
a random production rule, r, and then select a node from the candidates with the production rule r, as
follows,

C = {n ∈ SyntaxTree(z) | σ(n) ≤ σsmall }

R = {ProductionRule(n) | n ∈ C}
r ∼ Uniform[R]
M = {n ∈ C | ProductionRule(n) = r}
m ∼ Uniform[M ]

For CSG2D, this approach empirically biased our method to make expressions shorter 30.8%, equal
49.2%, and longer 20.0% of the times (n = 10, 000).

B Context-Free Grammars

Here we provide the exact context-free grammars of the languages used in this work.

13
Figure 10: Examples of the same scene being called multiple times by our sketch observation model.

B.1 CSG2D

s: binop | circle | quad

binop: (op s s)
op: + | -

number: [0 to 15]
angle: [0 to 315]

// (Circle radius x y)
circle: (Circle r=number x=number y=number)

// (Quad x y w h angle)
// quad: (Quad x=number y=number
w=number h=number
theta=angle)

B.2 TinySVG

s: arrange | rect | ellipse | move

rect: (Rectangle w=number h=number fill=color stroke=color border=number)

ellipse: (Ellipse w=number h=number fill=color stroke=color border=number)

// Arrange direction left right gap

arrange: (Arrange direction s s gap=number)

move: (Move s dx=sign number dy=sign number)

C Sketch Simulation

As mentioned in the main text, we implement the CSG2D-Sketch environment, which is the same as
CSG2D with a hand-drawn sketch observation model. We do this to primarily show how this sort of a
generative model can possibly be applied to a real-world task, and that observations do not need to

14
(a) (b)
Figure 11: Examples of thresholding scene images using the LZ4 compression algorithm. The left
represents our test set, the right represents our training distribution.

be deterministic. Our sketch algorithm can be found in our codebase, and is based off the approach
described in Wood et al. [39].
Our compiler uses Iceberg [16] and Google’s 2D Skia library to perform boolean operations on
primitive paths. The resulting path consists of line and cubic bézier commands. We post-process
these commands to generate sketches. For each command, we first add Gaussian noise to all points
stated in those commands. For each line, we randomly pick a point near the 50% and 75% of the
line, add Gaussian noise, and fit a Catmull-Rom spline [5]. For all curves, we sample random points
at uniform intervals and fit Catmull-Rom splines. We have a special condition for circles, where
we ensure that the start and end points are randomized to create the effect of the pen lifting off.
Additionally we randomize the stroke thickness.
Figure 10 shows the same program rendered multiple times using our randomized sketch simulator.

D Complexity Filtering
As mentioned in Section 4, while testing our method alongside baseline methods, we reached ceiling
performance for all our methods. Ellis et al. [11] got around this by creating a “hard” test case by
sampling more objects. For us, when we increased the number of objects to increase complexity, we
saw that it increased the probability that a large object would be sampled and subtract from the whole
scene, resulting in simpler scenes. This is shown by Figure 11(b), which is our training distribution.
Even though we sample a large number of objects, the scenes don’t look visually interesting. When
we studied the implementation details of Ellis et al. [11], we noticed that during random generation
of expressions, they ensured that each shape did not change more that 60% or less than 10% of the
pixels in the scene. Instead of modifying our tree sampling method, we instead chose to rejection
sample based on the compressibility of the final rendered image.

E Tree Path Algorithm

Algorithm 1 shows the high-level pseudocode for how we find the first step of mutations to transform
tree A into tree B. We linearly walk down both trees until we find a node that is different. If the
target node is small, i.e., its σ(z) ≤ σsmall , then we can simply mutate the source to the target. If
the target node is larger, we sample a random small expression with the correct production rule, and
compute the path from this small expression to the target. This gives us the first step to convert the
source node to the target node. Repeatedly using Algorithm 1 gives us the full path to convert one
expression to another. We note that this path is not necessarily the optimal path, but a valid path that
is less noisy than the path we would get by simply chasing the last random mutation.

15
Algorithm 1 TreeDiff: Find the first set of mutations to turn one tree to another.
Require: treeA: source tree, treeB: target tree, max_primitives: maximum primitives
Ensure: List of mutations to transform treeA into treeB
1: if NodeEq(treeA, treeB) then
2: mutations ← []
3: for each (childA, childB) in zip(treeA.children, treeB.children) do
4: mutations ← mutations + TreeDiff(childA, childB, max_primitives)
5: end for
6: return mutations
7: else
8: if treeA.primitive_count ≤ max_primitives and treeB.primitive_count ≤
max_primitives then
9: return [Mutation(treeA.start_pos, treeA.end_pos, treeB.expression)]
10: else
11: new_expression ← GenerateNewExpression(treeA.production_rule,
max_primitives)
12: tightening_diffs ← TreeDiff(new_expression, treeB, max_primitives)
13: new_expression ← ApplyAllMutations(new_expression, tightening_diffs)
14: return [Mutation(treeA.start_pos, treeA.end_pos, new_expression)]
15: end if
16: end if

Figure 12: A conceptual illustration of why we need tree path-finding. The red path represents the
naive target for the neural network. The green path represents the path-finding algorithm’s target.

Figure 12 conceptually shows why computing this tree path might be necessary. The circle represents
the space of programs. Consider a starting program z0 . Each of the black arrows represents a random
mutation that kicks the program to a slightly different program, so z0 → z1 , then z2 → z3 . . .. If we
provide the neural network the supervised target to go from z5 to z4 , we are teaching the network to
take an inefficient path to z0 . The green path is the direct path from z5 → z0 .

F Implementation Details
We implement our architecture in PyTorch [1]. For our image encoder we use the NF-ResNet26 [4]
implementation from the open-sourced library by Wightman [38]. Images are of size 128 × 128 × 1
for CSG2D and 128 × 128 × 3 for TinySVG. We pass the current and target images as a stack of image
planes into the image encoder. Additionally, we provide the absolute difference between current and
target image as additional planes.

16
Our decoder-only transformer [34, 26] uses 8 layers, 16 heads, with an embedding size of 256.
We use batch size 32 and optimize with Adam [18] with a learning rate of 3 × 10−4 . The image
embeddings are of the same size as the transformer embeddings. We use 4 prefix tokens for the image
embeddings. We used a maximum context size of 512 tokens. For both environments, we sampled
expressions with at most 8 primitives. Our method and all baseline methods used this architecture.
We did not do any hyperparameter sweeps or tuning.
For the autoregressive (CSGNet) baseline, we trained the model to output ground-truth programs
from target images, and provided a blank current image. For tree diffusion methods, we initialized
the search and rollouts using the output of the autoregressive model, which counted as a single node
expansion. For our re-implementation of Ellis et al. [11], we flattened the CSG2D tree into shapes
being added from left to right. We then randomly sampled a position in this shape array, compiled the
output up until the sampled position, and trained the model to output the next shape using constrained
grammar decoding.
This is a departure from the pointer network architecture in their work. We think that the lack of prior
shaping, departure from a graphics specific pointer network, and not using reinforcement learning
to fine-tune leads to a performance difference between their results and our re-implementation. We
note that our method does not require any of these additional features, and thus the comparison is
fairer. For tree diffusion search, we used a beam size of 64, with a maximum node expansion budget
of 5000 nodes.