Chroma Supplementary Info
Chroma Supplementary Info
1038/s41586-023-06728-8
Supplementary information
Nature | www.nature.com/nature
Supplementary Information for:
Illuminating protein space
with a programmable generative model
John B Ingraham, Max Baranov, Zak Costello, Karl W Barber, Wujie Wang,
Ahmed Ismail, Vincent Frappier, Dana M Lord, Christopher Ng-Thow-Hing,
Erik R Van Vlack, Shan Tie, Vincent Xue, Sarah C Cowles, Alan Leung,
João V Rodrigues, Claudio L Morales-Perez, Alex M Ayoub, Robin Green,
Katherine Puentes, Frank Oplinger, Nishant V Panwar, Fritz Obermeyer,
Adam R Root, Andrew L Beam, Frank J Poelwijk, Gevorg Grigoryan
1
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 2
Supplementary Information
Table of Contents
A Overview 7
A.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
A.2 Conditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
A.3 Wet lab experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
D Polymer-Structured Diffusions 26
D.1 Diffusion processes predictably affect molecular distances . . . . . . . . . . . . 26
D.2 Covariance model #1: Ideal Chain . . . . . . . . . . . . . . . . . . . . . . . . . 28
D.3 Covariance model #2: Rg -confined Globular Polymer . . . . . . . . . . . . . . . 29
D.4 Alternative covariance model: Residue Gas . . . . . . . . . . . . . . . . . . . . 35
G Chroma Architecture 45
G.1 Graph neural networks for protein structure . . . . . . . . . . . . . . . . . . . . 45
G.2 ChromaBackbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
G.3 ChromaDesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
G.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
H Training 51
H.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
H.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
I Sampling 52
I.1 Sampling backbones . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
I.2 Sampling sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Q Programmability: Symmetry 86
Q.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Q.2 Symmetry breaking in sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 86
Q.3 Symmetric transformation as a conditioner . . . . . . . . . . . . . . . . . . . . . 87
Q.4 Practical implementation with scaling and composition . . . . . . . . . . . . . . 89
Q.5 Additional symmetric samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
R Programmability: Shape 91
R.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
R.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
S Programmability: Classification 94
S.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
S.2 Model inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
S.3 Featurization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
S.4 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
S.5 Labels and loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
S.6 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 5
S.7 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
List of Figures
1 Low temperature sampling with Hybrid Langevin SDE . . . . . . . . . . . . . . 22
2 Low temperature sampling analysis, proteins . . . . . . . . . . . . . . . . . . . . 25
3 Polymer-structured diffusions for proteins . . . . . . . . . . . . . . . . . . . . . 27
4 Random graph sampling for random graph neural networks . . . . . . . . . . . . 38
5 Equivariant structure updates from inter-residue geometries . . . . . . . . . . . . 40
6 Anisotropic confidence models for predicted inter-residue geometries. . . . . . . 43
7 Chroma architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
8 Randomized autoregression orders with varying spatial clustering . . . . . . . . 50
9 Random single-chain samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
10 Random complex samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
11 Structural metrics analysis of unconditional samples . . . . . . . . . . . . . . . 57
12 Novelty analysis of unconditional samples . . . . . . . . . . . . . . . . . . . . . 58
13 The protein space of unconditional samples . . . . . . . . . . . . . . . . . . . . 59
14 Refolding analysis of unconditional samples . . . . . . . . . . . . . . . . . . . . 61
15 Sequence recovery evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
16 Refolding analysis for substructure conditioning . . . . . . . . . . . . . . . . . . 63
17 Refolding analysis for symmetry conditioning . . . . . . . . . . . . . . . . . . . 64
18 Refolding analysis for shape conditioning . . . . . . . . . . . . . . . . . . . . . 66
19 Refolding analysis for class conditioning . . . . . . . . . . . . . . . . . . . . . . 67
20 Refolding analysis for natural language conditioning . . . . . . . . . . . . . . . 68
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 6
List of Tables
1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Hyperparameters for the backbone network . . . . . . . . . . . . . . . . . . . . 48
3 Hyperparameters for the design network . . . . . . . . . . . . . . . . . . . . . . 49
4 Hyperparameters for sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5 Structural metrics for backbones . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6 Conditioners . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
7 Design protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8 Split-GFP control sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 7
Symbol Definition
N number of atoms or residues
xt ∈ RN×3 coordinates sampled at time t
xtM ∈ R|M|×3 motif-sliced coordinates based on index set M ⊂ [[1, N]]
(i)
xt ∈ R3 the ith coordinate in xt
G = (V, E) a graph composed of sets of vertices and edges
Di j , Di j Euclidean distance between i and j ∥x(i) − x( j) ∥2
z ∈ RN×3 whitened noise, and zi is the individual noise component
Σ = RR⊺ covariance matrix for polymer-structured prior, [Rz]ik = ∑ j [R]i j z jk
T = (O, t) Euclidean transformation with rotation O and translation t
αt time-dependent mean scaling in the forwards diffusion
σt time-dependent variance scaling in the forwards diffusion
ht time-dependent generalized drift coefficient
gt time-dependent generalized diffusion coefficient
βt time-dependent noise schedule for variance-preserving process
λt time-dependent inverse temperature
ψ Langevin equilibration rate in Hybrid SDE
T number of integration time steps
x̂θ denoising network in Cartesian space
ẑθ denoising network in the whitened space
∇x log pt (x,t) score estimator network
dw, dw̄ forward Brownian noise, reverse Brownian noise
A Overview
A.1 Model
Factorization Chroma is a joint generative model for the all-atom structures of protein com-
plexes given a set of chain lengths. To model this complex set of dependencies across mixed
continuous and discrete degrees of freedom, we factorize the joint distribution into parameterized
components as
where x ∈ R4N×3 represents the backbone heavy atom coordinates (i.e., N, Cα , C, and O), s ∈
[[20]]N represents the discrete sequences over all residues, χ ∈ (−π , π ]4N represents the side-chain
torsional angles, θ represents model parameters, and N is the total number of residues in the com-
plex1 . We parameterize these component distributions in terms of two neural networks: a backbone
network which uses diffusion modeling to estimate log pθ (x) and a design network which uses dis-
cretely supported distributions to estimate log pθ (χ, s|x). Both networks are based on a graph neu-
1 We drop explicit dependence on chain lengths throughout this work to simplify notation.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 8
ral network architecture which takes SE(3)-invariant features as inputs and outputs SE(3)-invariant
scalars and SE(3)-equivariant coordinates as needed (Supplementary Appendix G).
Diffusion process Our diffusion modeling approach builds upon standard methods with exten-
sions for correlated diffusion processes (Supplementary Appendix B). Briefly, we define a forwards
noising process which destroys structure in data as
xt ∼ p(xt |x0 ) = N x; αt x0 , 1 − αt2 RR⊺ ,
where RR⊺ is the covariance matrix of the diffusion process and αt is a signal-erasing schedule
that decays monotonically from 1 to 0 as the time t goes from 0 to 1. This can also be effectively
simulated by the Ornstein–Uhlenbeck process
1 p
dx = − βt x dt + βt dRw,
2
αt
where − 21 βt = d log
dt is a time-dependent noising schedule2 . We design the covariance matrix RR⊺
to respect the distance statistics of natural proteins, including local chain constraints and global
density constraints based on the scaling law Rg ≈ 2. × N 0.4 (Supplementary Appendix D). We also
show how to generalize this framework for arbitrary Gaussian noising schedules in Supplementary
Appendix B.
Diffusion objective Given this forwards process, we train a neural network x̂θ (xt ,t) to predict
the optimally denoised structure by optimizing a bound on the likelihood of the noise-free structure
x0 ,
1 ⊺ 1 ′ N −1 2
log p(x0 ) ≥ − log det (2π eRR )− E p(xt |x0 )p(t) SNRt − ∥R (x̂θ (xt ,t) − x0 ) ∥2 ,
2 2 1 + SNRt
together with auxiliary training objectives which emphasize accurate denoising of specific sub-
structural features (Supplementary Appendix B). Both the likelihood and auxiliary objectives are
α2
weighted in terms of various functions of the time dependent Signal-to-Noise ratio, SNRt = 1−tα 2 .
t
We parameterize the optimal denoiser x̂θ (xt ,t) in terms of a graph neural network with random
long-range connectivity for global context (Supplementary Appendix E) which predicts denoised
structures via a weighted consensus of inter-residue geometry predictions (Supplementary Ap-
pendix F).
where λt are λ0 inverse are temperature parameters, ψ sets the rate of Langevin equilibration
per unit time, dw̄ is a reverse-time Wiener process, and ∇x log pt (x; θ) is the time-dependent
score function which can be expressed as an affine transform of the optimal denoiser. These
Langevin-enriched dynamics allow us to adjust the time-dependent distribution to account for per-
turbations such as external conditioning cues or lower sampling temperatures which bias towards
high-likelihood states (Supplementary Appendix C).
Sampling side-chains For the design network, we train a graph neural network to predict discrete
sequence states via either conditional Potts models or conditional language models and predict side
chain conformations via an autoregressive decomposition and an empirical histogram parameter-
ization binned at 10◦ resolution. Sampling is performed via a combination of penalized Markov
Chain Monte Carlo and/or ancestral sampling (Supplementary Appendix I).
Dataset We constructed a dataset of 28,819 protein complex structures from the Protein Data
Bank circa March 20, 2022 (Supplementary Appendix H). These complexes were filtered for X-ray
crystal structures at resolution ≤ 2.6 Å and were then redundancy-reduced via general sequence
clustering at 50% identity followed by re-enrichment of 1,726 highly-variable antibody systems
with clustering at 10% sequence identity. We split these data into 80%/10%/10% training/valida-
tion/test components based on a graph-based annotation overlap reduction procedure.
Training We trained two configurations of the backbone network on 8 V100 GPUs for approxi-
mately 1.6 and 1.8 million training steps with target batch sizes of approximately 32,000 residues
per step, with each model having approximately 19 million parameters (Supplementary Table 2).
To test the influence of different components of our framework, we also carried out an ablation
study of 7 different model configurations each trained with 8 V100 GPUs and similar batch sizing
for approximately 500,000 steps (Supplementary Appendix L, Supplementary Figure 22). Addi-
tionally, we trained two configurations of the design network on 1 or 8 V100 GPUs with each
model having approximately 4 or 14 million parameters, respectively, based on the inclusion of
side chain and autoregressive decoding layers.
A.2 Conditioners
To make protein design with Chroma programmable, we introduce a Conditioners framework that
allows for automatic conditional sampling under arbitrary composition of protein specifications.
These specifications can come in the forms of restraints, which bias the distribution of states, and
constraints, which directly restrict the domain of underlying sampling process (Supplementary
Appendix M). We accomplish this by formulating conditional protein design as sampling under
composable transformations which can affect both the energy and/or the state variables. Briefly,
we can express the time-dependent (unnormalized) posterior log-likelihood of structures x given
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 10
conditions y as
where f (x̃t ,U0 ;t) is a function that (optionally) transforms the state and U f (x̃t ,U0 ;t) is a function
that transforms the energy. These transformation functions are composable and, by leveraging
automatic differentiation, we can derive universal conditional samplers from any gradient-based
MCMC algorithm such as Langevin dynamics (Supplementary Appendix M.2).
We implement and evaluate several conditioners within this framework capturing a variety of po-
tential protein design criteria including:
• substructure constraints (Supplementary Appendix N)
• substructure distance restraints (Supplementary Appendix O)
• substructure motif restraints (Supplementary Appendix P)
• symmetries across chains (Supplementary Appendix Q)
• arbitrary volumetric shapes (Supplementary Appendix R)
• neural network classifiers (Supplementary Appendix S)
• natural language prompts (Supplementary Appendix T)
Whitening transformations and linear generative models One classical approach for remov-
ing nuisance correlations in the data is to apply a “whitening transformation”, i.e., an affine linear
1
transformation z = Σ− 2 (x − µ) that decorrelates all factors of variation by subtracting the empiri-
1
cal mean µ and multiplying by a square root of the inverse covariance matrix R = Σ− 2 .
Whitening data can also be related to fitting the data to a Gaussian model x = F(z) = Rz + b where
the whitened factors z are standard normally distributed as z ∼ N (0, I) [56]. The density in the
whitened space can be related to the density in the transformed space by the change of variables
formula as
dF
log p(x) = log pz (F −1 (x)) − log det
dx
= log pz (R−1 (x − b)) − log |det R|
= log N (R−1 (x − b); 0, I) − log |det R|
= log N (x; b, RR⊺ ).
From uncorrelated diffusion to correlated diffusion If we have a linear Gaussian prior for our
data p(x) = N (x; b, RR⊺ ) which can be sampled3 as x = Rz with z ∼ N (0, I), then an uncorrelated
diffusion process on the whitened coordinates zt ∼ pt (z|z0 ) will induce a correlated diffusion
process on the original coordinates xt ∼ pt (x|x0 ). For the typical case of a Gaussian diffusion
process with time-dependent noise, the uncorrelated diffusion
with time-dependent signal scaling αt and noise scaling σt , will induce correlated Gaussian noise
We can also see this relationship infinitesimally in time [55]. If we express the uncorrelated diffu-
sion process in terms of a stochastic differential equation (SDE),
dz = ht z dt + gt dw,
Linking instantaneous and integrated parameters The above results may be seen via ODEs
which govern the time-evolution of the means and variances of the stochastic process. Following
[61], for an SDE of the form
dx = f (xt ,t) dt + G(x,t) dw,
where f (xt ,t) and G(x,t) are the drift and diffusion terms, we can express the time-dependent
mean and covariance as
d
µt = E [ f (xt ,t)]
dt
d
Σt = E [ f (xt ,t)(xt − µt )⊺ ] + E [(xt − µt ) f (xt ,t)⊺ ] + E [G(xt ,t)G⊺ (xt ,t)] .
dt
4 This can be justified by Ito’s lemma.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 13
In our framework, the noise kernel is N (αt x0 , σt2 Σ) and we seek an OU process of with the drift
term f (xt ,t) = ht xt , and the correlated diffusion term Gt = gt R. If we define Σ = RR⊺ , and
Σt = σt2 Σ, we have
d
Σt = E [ht xt (xt − µt )⊺ ] + E −(xt − µt )ht xt⊺ + E [G(xt ,t)G⊺ (xt ,t)]
dt
= 2ht E xt xt⊺ − µt µt⊺ + gt2 Σ
= 2ht Σt + gt2 Σ
d
=⇒ gt2 Σ = Σt − 2ht Σt ,
dt
where the last step comes from the identity
Σt = E[(xt − µt )(xt − µt )⊺ ] = E[xt xt⊺ − xt µt⊺ − µt xt⊺ + µt µt⊺ ] = E[xt xt⊺ ] − µt µt⊺ .
Now substituting our noise kernel p(xt |x0 ) = N (αt x0 , σt2 Σ), we have ht in terms of αt as
d
µt = E [ f (xt ,t)] = ht E [xt ] = ht µt = αt′ x0
dt
α′ d log αt
=⇒ ht = t = ,
αt dt
and gt in terms of σt as
2α ′ 2α ′
gt2 Σ = Σt′ − 2ht Σt = Σt′ − t Σt = 2σt σt′ Σ − t σt2 Σ
αt αt
s ′ r
σt αt′ d log SNRt
=⇒ gt = 2σt2 − = −σt2 .
σt αt dt
This establishes the correspondence between the time evolution of the noise kernel and the under-
lying diffusion SDE for a general Gaussian noising schedule.
Variance Preserving schedules Throughout this work, we focus on noise that is distribution-
ally matched and use the variance-preserving process [55] which enforces a balance between the
α′
creation of noise and destruction of signal as σt2 = 1 − αt2 . Setting ht = αtt ≜ − 21 βt , we have
s s
σt′ αt′ αt′ p
gt = 2σt2 − = −2(αt2 + σt2 ) = βt ,
σt αt αt
Optimal Transport schedules The Optimal Transport (OT) schedule proposed in [58] is an-
other type of one-parameter noising process that interpolates between a prior p(x1 ) and the data
distribution p(x0 ). It induces ‘straight paths’ in the corresponding flow ODE. Given the noising
parameters σt = t and αt = 1 − t, we can convert the flow ODE to an SDE,
r
1 2t
dxt = − xt dt + R dwt ,
1−t 1−t
q
1 2t
where ht = − 1−t and gt = 1−t . The SDE trajectories samples p(xt |x0 ) = N (αt x0 , σt2 Σ). In
practice, following [58], one can integrate 0 < t < 1 − ε to ensure numerical stability, where ε is a
small number.
ELBO We train our diffusion models by optimizing a bound on the log marginal likelihood of
data together with optional auxiliary losses. As shown in Information-Theoretic Diffusion models
[64] and building on Variational Diffusion models [62], we can express a lower bound on the the
log-likelihood of data in terms of the weighted average of mean-square error across diffusion time
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 15
as
Z
N 1 ∞ N
log p(z0 ) = − log(2π e) + − mmse(z0 , SNR) dSNR
2 2 0 1 + SNR
Z
N 1 1 N
= − log(2π e) − − mmse(z0 , SNRt ) SNRt′ dt
2 2 0 1 + SNRt
Z
N 1 1 N
≥ − log(2π e) − − E p(zt |z0 ) ∥ẑ(zt ,t) − z0 ∥2 SNRt′ dt,
2
2 2 0 1 + SNRt
αt2
where N is the dimensionality of z0 , the Signal-to-Noise Ratio (SNR) is defined SNRt = σt2
with
σt2 = 1 − αt2 for VP diffusions, and mmse is the minimum achievable mean square error under
the forwards noising model as a function of the SNR. We can then apply the change of variables
formula to transform this bound as
log p(x0 ) = log p(z0 ) − log det R
Z
N 1 1 N −1
≥ − log(2π e) − log det R − − E p(xt |x0 ) ∥R (x̂(xt ,t) − x0 ) ∥2 SNRt′ dt
2
2 2 0 1 + SNRt
1 ⊺ 1 ′ N −1 2
= − log det(2π eRR ) − E p(xt |x0 )p(t) SNRt − ∥R (x̂(xt ,t) − x0 ) ∥2
| 2 {z } |2 1 + SNRt
{z }
Entropy of the Gaussian prior Deviation from Gaussianity (Bound)
≜L(x; θ),
ELBO-weighted unwhitened MSE While the information content of the structures is measured
by a SNR-weighted average of mean square error in whitened space, we also consider similarly-
weighted objective measuring errors in x-space as
h i
Lx (x0 ; θ) = −Ext ∼p(xt |x0 ),t∼Unif(0,1) ω −2 SNRt′ ∥x̂θ (xt ,t) − x∥22 . (1)
where we set the scale factor ω to give x units of nanometers. We found this regularization to be
important because in practice we care about absolute errors in x space, i.e. absolute spatial errors,
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 16
at least as much as we care about errors in z space, which will correspond under our covariance
models (Supplementary Appendix D) to relative local geometries. These objectives share the same
minima, i.e. they will be minimized by the posterior optimal denoiser under the diffusion process,
but for an approximately trained parametric model with limited capacity will trade off different
errors in which statistics of data are emphasized in reconstruction.
Substructure MSE and Perceptually-motivated metrics As has often been emphasized in the
literature in generative models of images, not all bits are equally important to perception or, more
generally, sample utility. For example, it takes the same number of bits to encode the average color
of an image as it does to encode the color of one single pixel, but mis-estimation of the average
color will generally be much more noticeable to humans.
As a result of this, many diffusion models eschew training purely on likelihood-based metrics,
for example using flat weightings of the denoising loss across diffusion time which implicitly
emphasize the importance of low-frequency statistics [55]. Other generative models have used
domain-specific metrics such as FAPE for proteins [66] as the denoising objective for diffusion
training [67].
Here we consider auxiliary training objectives for protein backbone diffusion models which em-
phasize some conventionally important aspects of structural similarity. Since diffusion models
trained to optimality will learn the posterior mean denoising function, which minimizes mean
squared error of reconstruction from the forward process, we consider only squared-error objec-
tives.
Substructure Aligned Squared Error Minimizing ELBO-weighted mean squared error trains
a diffusion model to learn all statistics of the data at all length scales, but for proteins we know that
there are some substructural statistics which may be stronger and more important to correctly es-
timate than others. For example, proteins often exhibit substructures, such as secondary structural
elements or domains connected by more flexible linkers. We can encourage the denoiser to prior-
itize these substructural statistics of the data by optimizing the mean squared error under optimal
superposition as
Mi 2
Dsubstructure (x, x′ ) = ∑ min ∥xMi − T ◦ x′ ∥2 ,
Mi ∈{Mi } T∈SE(3)
where {Mi } is a set of substructures and the inner optimization problem can solved via the optimal
superposition with a Kabsch or quaternion-based method [68, 69]. We consider the following
substructures for measuring aligned squared error:
• Global structure. M = [[1, N]]. In this case, the substructure aligned MSE will simply be a
rescaling of the squared optimal RMSD after superposition.
• Fragment structure. Mi = {i − m, . . . , i + m}. We consider fragments of radius m = 7
residues centered around each residue i.
Distance Squared Error Many aspects of protein geometry are driven by specific packing and
steric interactions that depend more strongly on interatomic distances and less strongly on relative
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 17
orientations. We consider a loss measuring the squared error of proteins when represented by
distance matrices of their Cα carbon atoms as
Normalizing Auxiliary Losses Across Time and Schedules All of the aforementioned losses
can be used as denoising losses by minimizing E p(x0 ,xt ,t) [D(x̂(xt ,t), x0 )], but (i) an unweighted av-
erage will be dominated by loss values at high t and (ii) values of these losses will be incomparable
if the noise schedule of the diffusion is changed, complicating evaluation. To address both of these
issues, we propose (i) to normalize the losses with an approximate estimate of the time-dependent
error magnitude and (ii) to reweight the average with respect to time t as an average with respect
to a schedule-invariant statisic via importance weights.
α2
One intuitive schedule-invariant statistic is the Signal to Signal plus Noise Ratio SSNRt ≜ α 2 +t σ 2 =
t t
SNRt
SNRt +1 .For Variance-Preserving diffusion, this value simplifies to SSNRt = αt2 ∈ [0, 1]. Since t
is uniformly distributed on (0, 1) and SSNRt goes from 1 to 0, we can interpret SSNR1−t as a
CDF and compute p(SSNRt ) = dtd SSNRt−1 (SSNRt ). We can then compute importance weights
1
as p(SSNR t)
and combine that with normalization to yield normalized denoising training losses
as
1 D(x̂(xt ,t); x0 )
LD (x0 ; θ) = Ext ,xt′ ∼p(xt |x0 ),t∼Unif(0,1) .
p(SSNRt ) D(xt′ ; x0 )
Transform Squared Error Our proposed method for parameterizing predicted structure in terms
of predicted inter-residue geometries (Supplementary Appendix F) leverages predicted inter-residue
transforms Ti j between every pair of residues on the graph. When training on ELBO, these pre-
dicted inter-residue transforms are only indirectly supervised by backpropagation but we can also
directly supervise their values towards the true denoised inter-residue geometries to potentially
stabilize and accelerate learning. This is not dissimilar from auxiliary prediction of inter-residue
distances as done in end-to-end structure prediction methods such as AlphaFold [66]. Training
these quantities directly can be useful because (i) they are SE(3) invariant and typically lower-
variance targets than raw coordinates and (ii) they are aligned with the overall denoising objective
in the sense that perfect inter-residue geometry prediction will yield a perfectly denoised structure
(assuming sufficient equilibration time of the backbone solver).
We score the agreement between the predicted T̂θi j (xt ) and actual Ti j (x0 ) inter-residue geometries
as the sum of squared errors in the predicted translation vectors and rotation matrices, i.e.
2 2
Ltransform (x0 ; θ) = ∑ ti j (x0 ) − t̂θi j (xt ) + Ri j (x0 ) − R̂θi j (xt ) ,
2 2
i j∈G(x)
We can similarly express this in the score function of the transformed coordinate system as
We can sample from the diffusion model by sampling the “prior” (t = 1 distribution) and then
integrating the Reverse-time SDE backwards from t = 1 to t = 0. We can rewrite the above SDE
in directly in terms of our optimal denoising network x̂θ (x,t) (trained as described above) by
leveraging the relationship [55, 62] that
−1
∇x log pt (x) = σt2 RR⊺ (αt x̂θ (x,t) − x) .
This yields the reverse-time SDE for VP diffusions in terms of the optimal denoising network
x̂θ (x,t) as
!
1 (RR ⊺ )−1 p
dx = − x − RR⊺ 2
(αt x̂θ (x,t) − x) βt dt + βt R dw̄
2 1 − αt
1 αt x̂θ (x,t) − x p
= − x− 2
βt dt + βt R dw̄
2 1 − αt
!
−αt x̂θ (x,t) + x − 12 x(1 − αt2 ) p
= β t dt + βt R dw̄
1 − αt2
αt + 1 αt p
= x − x̂ θ (x,t) β t dt + βt R dw̄.
2(1 − αt2 ) 1 − αt2
dx 1
= ht x + gt2 RR⊺ ∇x log pt (x).
dt 2
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 19
The ODE formulation of sampling is especially important because it enables reformulating the
model as a Continuous Normalizing Flow [72, 73], which can admit efficient and exact likelihood
calculations using the adjoint method [73].
Bayesian posterior ODE for conditional sampling In the context of our covariance model and
conditional constraints, the Probability Flow ODE for sampling from the posterior is
dx 1
= ht x − gt2 RR⊺ (∇x log pt (x) + ∇x log pt (y|x)) .
dt 2
For the variance-preserving process, the ODE reduces to
dx βt
= − (x + RR⊺ (∇x log pt (x) + ∇x log pt (y|x)))
dt 2
1 αt βt x̂θ (x,t) βt
= 2
x− − RR⊺ ∇x log pt (y|x).
2 1 − αt αt 2
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 20
Low temperature and diffusion models The issue of trading diversity for sample quality in
diffusion models has been discussed previously, with some authors reporting that simple modifica-
tions like upscaling the score function and/or downscaling the noise were ineffective [80]. Instead,
classifier guidance and classifier-free guidance have been widely adopted as critical components
of contemporary text-to-image diffusion models such as Imagen and DALL-E 2 [81–83].
Equilibrium versus Non-Equilibrium Sampling Here we offer an explanation for why these
previous attempts at low-temperature sampling did not work and produce a novel algorithm for
low-temperature sampling from diffusion models. We make two key observations, explained in
the next two sections
1. Upscaling the score function of the reverse SDE is insufficient to properly re-weight pop-
ulations in a temperature perturbed distribution.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 21
2. Annealed Langevin dynamics can sample from low temperature distributions if given
sufficient equilibration time.
st ≜ ∇x log pt (x)
αt µdata − x
= 2 2 .
αt σdata + (1 − αt2 )σprior
2
Now, suppose we wish to modify the definition of the time-dependent score function so that, instead
of transitioning to the original data distribution, it transforms to the perturbed data distribution, i.e.
1
so the it transitions to Zlambda p0 (x)λ0 . For a Gaussian, this operation will simply multiply the
0
precision (or equivalently, divide the covariance) by the factor λ0 . The perturbed score function
will therefore be
perturb αt µdata − x
st = .
αt σdata /λ0 + (1 − αt2 )σprior
2 2 2
Based on this, we can express the perturbed score function as a time-dependent rescaling of the
original score function with scaling based on the ratios of the time-dependent inverse variances
as
Therefore, to achieve a particular inverse temperature λ0 for the data distribution, we should rescale
the learned score function by time-dependent factor
(1 − αt2 )σprior
2 + αt2 σdata
2
λ0
λt = ≈
(1 − αt2 )σprior
2 + αt2 σdata
2 /λ
0 αt + (1 − αt2 )λ0
2
Supplementary Figure 1: The Hybrid Langevin SDE can sample from temperature-perturbed
distributions. The marginal densities of the diffusion process pt (x) (top left) gradually transform
between a toy 1D data distribution at time t = 0 and a standard normal distribution at time t = T .
Reweighting the distribution by inverse temperature λ0 as Z1 pt (x)λ0 (left column, bottom two
λ0
rows) will both concentrate and reweight the population distributions. The annealed versions of
the reverse-time SDE and Probability Flow ODEs (middle columns) can concentrate towards local
optima but do not correctly reweight the relative population occupancies. Adding in Langevin
dynamics with the Hybrid Langevin SDE (right column) increases the rate of equilibration to
the time-dependent marginals and, when combined with low-temperature rescaling, successfully
reweights the populations (bottom right).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 23
Temperature-adjusted reverse time SDE We can modify the reverse-time SDE by simply
rescaling the score function with the above time-dependent temperature rescaling as
Temperature adjusted probability flow ODE Similarly for the Probability Flow ODE we can
rescale as
dx 1
= ht x − gt2 λt RR⊺ ∇x log pt (x).
dt 2
For the variance-preserving process, this reduces to
dx βt
= − (x + λt RR⊺ ∇x log pt (x))
dt 2
βt αt + λt − 1 λt αt
= x − x̂θ (x,t) .
2 1 − αt2 1 − αt2
Rescaling does not reweight We derived the above rescaling rationale by considering a uni-
modal Gaussian, which has the simple property that the score of the perturbed diffusion can be
expressed as a rescaling of the learned diffusion. This will not be true in general, and sure enough
we find that the above dynamics do drive towards local maxima but do not reweight populations
based on their relative probability (Supplementary Figure 1) as true low temperature sampling
does. To address this, we next introduce an equilibration process that can be arbitrarily mixed in
with the non-equilibrium reverse dynamics. Concurrent with this work, [84] identified this problem
as well and proposed several potential solutions based on MCMC.
βt ψ p
dx = − λ0 RR⊺ ∇x log pt (x) dt + βt ψ R dw̄,
2
where ψ is an “equilibration rate” scaling the amount of Langevin dynamics per unit time. As
ψ → ∞ the system will instantly equilibrate over time (require infinite number of sampling steps),
constantly adjusting to the changing score function. In practice, we can think about how to set
these parameters by considering a single Euler-Maruyama integration step in reverse time with
step size T1 where T is the total number of steps
r
βt ψ βt ψ
xt− 1 ← xt + λ0 RR⊺ ∇xt log pt (xt ) + Rϵ ϵ ∼ N (0, I),
T 2T T
which is precisely preconditioned Langevin dynamics with step size βTt ψ . For a sufficiently small
interval (t − dt,t) we can keep the target density approximately fixed while increasing T to do an
arbitrarily large number of Langevin dynamics steps, which will asymptotically equilibrate to the
current density log pt (x).
where we highlight in pink the terms that, when set to unity, recover the standard reverse time
SDE.
Representative samples using this modified SDE are shown in Supplementary Figure 2. Without
the low temperature modification, this idea is very reminiscent of the Predictor Corrector sampler
proposed by [55], where those authors explicitly alternated between reverse-time diffusion and
Langevin dynamics while we fuse them into a single SDE.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 25
Inverse temperature
1.0 1.4 2.0 4.0 8.0
Equilibration is not free Generally speaking, as we increase the amount of Langevin equilibra-
tion with ψ , we will need to simultaneously increase the resolution of our SDE solution to maintain
the same level of accuracy. However, we found that even a modest amount of equilibration was
sufficient to considerably improve sample quality in practice with ψ ∈ [1, 8]. With a larger ψ , a
smaller time step is needed to ensure the accuracy of SDE integration.
Even more equilibration Lastly, while the Hybrid Langevin-Reverse Time SDE can do an ar-
bitrarily large amount of Langevin dynamics per time interval which would equilibrate asymp-
totically in principle, these dynamics will still inefficiently mix between basins of attraction in
the energy landscape when 0 < t ≪ 1. We suspect that ideas from variable-temperature sampling
methods, such as simulated tempering [86] or parallel tempering [87], would be useful in this
context and would amount to deriving an augmented SDE system with auxiliary variables for the
temperature and/or copies of the system at different time points in the diffusion. Additionally,
momentum-aware approaches such as those based on Hamiltonian Monte Carlo [84] may help
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 26
increase equilibration rates and thus enable better satisfication of conditioning criteria with fewer
objective function evaluations.
D Polymer-Structured Diffusions
Most prior applications of diffusion models to images and molecules leveraged uncorrelated dif-
fusion in which data are gradually transformed by isotropic Gaussian noise. We found this ap-
proach to be non-ideal for protein structure applications for two reasons. First, noised samples
break simple chain and density constraints that almost all structures satisfy such as basic size scal-
ing laws of the form Rg ∝ N ν , where the scaling exponent is approximately ν ≈ 0.4 [88, 89].
These mismatches between the data distribution and the noising process force the model to allo-
cate capacity and training time towards re-learning basic and well-understood constraints. Second,
when high-noise samples are highly “out-of-distribution” from the data distribution, this can limit
the performance of efficient domain-specific neural architectures for molecular systems, such as
sparsely-connected graph neural networks. To this end, we introduce multivariate Gaussian dis-
tributions for protein structures that (i) are SO(3) invariant, (ii) enforce protein chain and radius
of gyration statistics, and (iii) can be computed in linear time. Throughout this section, we will
introduce covariance models for protein polymers (which can be thought of as a un-whitening
transform R, see Appendix B) with parameters that can be fit offline from training the diffusion
model. We provide an overview figure that illustrates the different Gaussian distributions presented
in this section, their corresponding diffusion processes, and the respective distance statistics which
they capture in Supplementary Figure 3.
Residue Gas
Ideal Chain
Globular
(Monomer)
Globular
(Complex)
The squared distance is a quadratic form, so diffusion processes will simply linearly interpolate to
the behavior of the prior as
E p(xt |x0 ) D2i j (xt ) = αt2 D2i j (x0 ) + (1 − αt2 ) E pprior (x) D2i j (x)
and squared radius of gyration will similarly evolve under the diffusion as
E p(xt |x0 ) R2g (xt ) = αt2 R2g (x0 ) + (1 − αt2 ) E pprior (x) R2g (x) .
Punchline Because variance-preserving diffusion models will perform simple linear interpola-
tions between the average squared distances and Rg of the data distribution and of the prior, we
should focus on covariance structures that empirically match these properties as closely as pos-
sible. Two primary ways will be in the chain constraint, i.e., that Di,i+1 (xt ) should always be
small and match the data distribution, and the density constraint of how R2g (xt ) should behave as a
function of protein length and typical packing statistics.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 28
Noise process We index our amount of noise with a diffusion time t ∈ [0, 1]. Given a denoised
structure x0 , a level of noise t, and a noise schedule αt , we sample perturbed structures from a
Multivariate Gaussian distribution p(xt |x0 ) = N (αt x0 , (1 − αt2 )Σ) as
q
xt = αt x0 + 1 − αt2 Rz, z ∼ N (0, I),
where the covariance matrix enforcing our chain constraint Σ = RR⊺ can be expressed in terms of
its square root R, which is defined below.
Key to our framework is a matrix R whose various products, inverse-products, and transpose-
products with vectors can be computed in linear time. We define the matrix R in terms of its
product with a vector f (z) = Rz as
i
x̃k
f (z)i = x̃i + δ x̃1 − ∑ , where x̃i = a ∑ zk .
k N k=1
x̃i − x̃i−1 1 xk
f −1 (x)i = , where x̃i = xi − x1 + ∑ .
a δ k N
This definition of R induces the following inverse covariance matrix on the noise, which possesses
a special structure of
1 −1
−1 2 −1
⊺ −1 1 −1 2 −1 1
−1
11⊺ .
Σ = (RR ) = 2 . . . . . . + 2
a . . . (Naδ )
−1 2 −1
−1 1
The parameter a sets the length scale of the chain and the parameter δ sets the allowed amount of
translational noise about the origin. This latter parameter is important for training on complexes
where each chain may not have a center of mass at 0.
ri j ∼ N (0, a2 |i − j|).
The expected squared norm for a Multivariate Normal Distribution (MVN) with spherical covari-
ance is ∥µ∥22 + kσ 2 where k is the dimensionality, so we have
E p(xt |x0 ) D2i j (xt ) = αt2 D2i j (x0 ) + (1 − αt2 )3a2 |i − j|.
When αt = 0, the expected squared distances are those of the data distribution, while when αt = T ,
they are those on an ideal Gaussian chain.
To compute the expected radius of Gyration, we can use the identity that it is simply half of the
root mean square of inter-residue distances
1 h i 1
j i 2
∑ (1 − α )3a2 |i − j|
2N 2 ∑
E p ∥xt − xt ∥2 = +
2N 2 i, j prior
i, j
1
= 3a2
2N 2 ∑
|i − j|
i, j
1 N N
= 3a2 ∑ ∑ j−i
N 2 i=1 j=i
2
2N N −1
= 3a .
6 N2
Therefore, we can also view the mean behavior of the diffusion as linearly interpolating the squared
radius of gyration as
2 2
(0) 2 N N2 − 1
E p(xt |x0 ) Rg (x) = αt Rg + (1 − αt2 )3a2 .
6 N2
N 2 −1
When α → 0 and N ≪ 0, the term N2
≈ 1 we recover the well-known scaling for an ideal
2 Nl 2 √
chain with E p(xt |x0 ) Rg (xt ) = 6 where the segment length is l = 3a.
xi = azi + bxi−1
i
= a ∑ bi− j z j + bi−1 x1 .
j=2
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 30
Here, the parameter a is a global scale parameter setting the “segment length” of the polymer and
b is a “decay” parameter which sets the memory of the chain to fluctuations. Informally, at each
step along the chain, we bury 1 − b percent of the way to the origin and step in a random direction
with step scale a. We recover a spherical Gaussian when b = 0 and the ideal Gaussian chain when
b = 1.
This system can also be written in matrix form as x = Rz with
vb0
vb1 b0
vb2 b1 b0
R = a .. ... ...
.
N−2
vb b1 b0
vb N−1 · · · b2 b1 b0
p
where v = Var(x1 ).
We can solve for the equilibrium value of v via the condition Var(x1 ) = a2 v2 = Var(xi ) = Var(xi−1 ).
The solution is
To compute the expected Radius of Gyration, we will use the identity R2g (x) = 2N1 2 ∑i, j D2i j (x),
which we can compute via the variance of the residual between xi and x j . Assuming j > i, we
have
j i
x j − xi b j−1 − bi−1
= ∑ b j−k zk + ∑ (b j−k − bi−k )zk + √ z1 ,
a k=i+1 k=2 1 − b2
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 31
1 N N 1 2
= ∑ ∑ a2 E Di j (x)
N 2 i=1 j=i
1 N N 2(1 − b j−i )
= ∑ ∑ 1 − b2
N 2 i=1 j=i
2bN+1 − b2 N(N + 1) + 2b(N 2 − 1) − N(N − 1)
=
(b − 1)3 (b + 1)N 2
−1
6b 2
≈ +1−b for b on (0, 1) and N ≫ 1
N
N
= .
6b + N(1 − b2 )
The approximation in the penultimate step works quite well in practice and becomes more accurate
with growing N, which we can verify with the limit
2bN+1 − b2 N(N + 1) + 2b(N 2 − 1) − N(N − 1) 6b 2
∀b ∈ (0, 1) lim + 1 − b = 1.
N→∞ (b − 1)3 (b + 1)N 2 N
Limiting Behaviors We can verify that this result reproduces the expected limiting behavior of
an ideal unfolded chain when b → 1 as
1 N
lim 2 E R2g (x) = ,
b→1 a 6
and of a standard normal distribution when b → 0 as
1
lim 2 E R2g (x) = 1.
b→0 a
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 32
R2g Scaling To finish up, we can add back in our global scaling factor a to give
Na2
Ex∼pprior (x) R2g (x) ≈ .
6b + N(1 − b2 )
Rg = rN ν ,
2 Na2
Ex∼pprior (x) R2g (x) = (rN ν ) = .
6b + N(1 − b2 )
x̃k
xi = x̃i − ξ ∑ .
k N
N (1−ξ )
a√
where log det Rcenter follows from the matrix determinant lemma. Thus, det R = .
1−b2
and
1 −⊺ −⊺ −1 −1
Σ−1
uncentered = R R R R
a2 sum init init sum
1
= (I − bP⊺ )D(I − bP)
a2
1
= 2
D − b (P⊺ D + D⊺ P) + b2 P⊺ DP
a
1
= 2
D − b (P⊺ + P) + b2 P⊺ P
a
1 −b
−b 1 + b2 −b
1 −b 1 + b2 −b
= .. .. .. ,
a2 . . .
−b 1 + b2 −b
−b 1
where the penultimate line follows from the behavior of the shift operator P.
We can identify within this precision matrix a linear combination of two well-known precision
matrices: the precision of Brownian motion, i.e. the chain Laplacian matrix, and the precision for
a spherical Gaussian, i.e. an identity matrix, along some nuisance boundary conditions as
1 −1 b(1 − b)
−1 2 −1
−1
1 −1 2 −1
2
Σ = 2 b . . . + (1 − b) I +
.
a . . . . . .
−1 2 −1
−1 1 b(1 − b)
This provides another simple characterization of our globular covariance model as being the re-
sult of a combination of ‘chain springs’ holding the polymer together locally along with ‘burial
springs’ pulling the chain to the origin. This simple energetic structure has been leveraged in prior
biophysical ‘toy models’ of hydrophobic collapse in proteins [90].
previous frame-based approaches then model the remaining N,C, O atoms locked to the Cα carbon
with variable rotation and ideal geometry, we can simply model these atoms as normally dis-
tributed around Cα with a fixed standard deviation σintra . At full noise levels this will induce an
isotropic distribution over implied frame orientations while keeping these atoms close to the parent
Cα , and as such can be considered an off-ideality relaxation of frame diffusion models [67] or an
all-backbone-atom extension of i.i.d. Cα diffusion models [91].
This sequential Gaussian dependency structure within residues will imply that all coordinates are
jointly Gaussian with square root of the covariance matrix
Rresidue
Rresidue
Rgas = . ..
Rresidue
In our experiments we set the intra-residue standard deviation to σintra = 1 and the residue standard
deviation to σCα = 10. As can be seen in Supplementary Figure 3, this covariance implies trajec-
tories that are extremely similar to frame-based diffusion [67], but with the added benefit that we
can treat non-ideal bond stretch and angle fluctuations. We do lose the guarantee of fixed internal
ideal geometries, but this only requires learning the equivalent of ∼ 6 additional numbers.
physical systems that more distant interactions can be modeled more coarsely for the same level
of accuracy. For example, in cosmological simulations, you can approximate the gravitational
forces acting on a star in a distant galaxy by approximating that galaxy as a point at its center of
mass.
So far, most relational machine learning systems [94] for protein structure have tended to process
information in a manner that is either based on local connectivity (e.g. a k-Nearest Neighbors or
cutoff graphs) [95] or all-vs-all connectivity [66, 67, 91]. The former approach is natural for highly
spatially localized tasks such as structure-conditioned sequence design and the characterization of
residue environments, but it is less clear if local graph-based methods can effectively reason over
global structure in a way that is possible with fully connected Graph Neural Networks, such as
Transformers [96]. Here we ask if there might be reasonable ways to add in long-range reasoning
while preserving sub-quadratic scaling simply by random graph construction.
Related work Our method evokes similarity to approaches that have been used to scale Trans-
formers to large documents by combining a mixture of local and deterministically [97] or randomly
sampled long-range context [98]. Distant-dependent density of context has also been explored in
multiresolution attention for Vision transformers [99] and in dilated convolutional neural networks
[100].
Edge propensity
constant
Long-range attachment
this work (complexes containing up to 4000 residues6 ) it was sufficient to simply set the number
of edges per node to a constant k = 60, which means that the graph and associated computation
will scale within this bounded size as O(N). This is a considerable improvement on previous ap-
proaches for global learning on protein structure such as methods based on fully connected graph
neural networks [91] O(N 2 ) or Evoformer-based approaches [66] which scale as O(N 3 ). These
sparse graphs also combine favorably with our method for synthesizing updated protein structures
6 In some of our symmetry examples we find that models still generalize well to systems larger than they were
trained on.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 39
Predicting structure as predicting constraints In principle, protein structures arise from a bal-
ance of competing intra- and inter-molecular forces. In that sense, protein structure may be re-
garded of as the solution to a constraint satisfaction problem with many competing potential in-
teractions across multiple length scales. It is therefore natural to think about protein structure
prediction as a so-called “Structured Prediction” problem [107], in which predictions are cast as
the low-energy configurations of a learned potential function. Structured Prediction formulations
of tasks often learn in a data efficient manner because it can be simpler to characterize the con-
straints in a system the the outcomes of those constraints. This perspective can be leveraged for
molecular geometries via differentiable optimization or differentiable molecular dynamics [106,
108, 109], but these approaches are often unstable and can be cumbersome to integrate as part of a
larger learning system.
...
Distances
Directions
Iteration 10
Orientations
optimization. We show how predicting pairwise inter-residue geometries as pairwise rigid transla-
tion transformations with potentially anisotropic uncertainty models induces a convex optimization
problem which can be solved by a simple iteration that quickly drives towards a global consensus
configuration. Throughout this section, we will build on the widely adopted approach representing
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 41
the rigid orientations of residues in proteins via coordinate reference frames [66, 67, 106].
The key idea of our update is that we ask the network to predict a set of inter-residue geometries
Ti j together with confidences wi j (which will initially be simple but can be extended to anisotropic
uncertainty) and we then attempt to either fully or approximately solve for the consensus structure
that best satisfies this set of pairwise predictions. We visualize the method in Supp. Fig. 5.
Converting from backbones to transforms We represent the rigid pose of a residue as an ab-
solute translation and rotation in space Ti ≜ (Oi , ti ). We can compute these residue poses by
building
n an orthonormal
o basis from three backbone coordinates at a residue i, i.e. from the set of
C Cα Cα
atoms xN α C N C
i , xi , xi . To do this, we define the vectors v1 = xi − xi and v2 = xi − xi , and then
build an orthonormal basis as
v1 v2
u1 = , u2 = ,
∥v1 ∥ ∥v2 ∥
n1 × u2 n1 × n2
n1 = u1 , n2 = , n3 = ,
∥n1 × u2 ∥ ∥n1 × n2 ∥
which gives the final transform as
Ti = [n1 , n2 , n3 ]⊺ , xiCα .
We note that pose representations are SE(3) equivariant but are not invertible unless one forces
coordinates to adopt ideal geometries, as is the choice in many structure prediction and diffusion
methods [66, 67, 110, 111]. Many backbone geometries with differing internal bond lengths and
angles) will give rise to the same transform Ti (though it is also true that many structures are
not resolved at a resolution to meaningfully distinguish these degrees of freedom). Nevertheless,
we can retain the benefits of both coarse transformation frames for prediction and fine all-atom
granularity via a hierarchical decomposition in which we predict coarse residue-transform based
inter-residue geometries along with sub-frame deviations from ideality, which can be in turn be
composed (equivariantly) to yield the final structure.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 42
Convex problem How can we define a consensus structure given a set of predictions of inter-
residue geometries, some of which may agree and some of which may disagree? This problem
is naturally formulated as an optimization problem. Given a collection of pairwise inter-residue
geometry predictions and confidences {Ti j , wi j }i j∈E , we score a candidate structure {Ti }N
i=1 via a
weighted loss U that measures the agreement between the current pose of each residue Ti and the
predicted pose of the residue given each neighbor T j and the predicted geometry T ji as
2
U {Ti }; {wi j , Ti j } = ∑ wi j Ti − T j ◦ T ji
i, j
2 2
= ∑ wi j Oi − O j O ji + wi j ti − (O j t ji + t j ) .
i, j
Note that we define a norm on the discrepancy between two Euclidean transforms Ta , Tb as
∥Ta − Tb ∥2 ≜ ∥Oa − Ob ∥2 + ∥ta − tb ∥2 . We wish to optimize each local pose Ti with neighbors
fixed as
T⋆i ← arg min U {Ti }; {wi j , Ti j } .
Ti
This problem of finding the local “consensus pose” for a residue T⋆i given its neighborhood is
a convex optimization problem, the solution to which can be realized analytically as a weighted
average with projection,
! !
wi j
T⋆i = ProjSO(3) ∑ pi j O j O ji , ∑ pi j (O j t ji + t j ) , where pi j =
j j ∑ j wi j
and the projection operator may be implemented via SVD as in the Kabsch algorithm [68] for
optimal RMSD superposition. If we iterate this update multiple times to all positions in parallel,
we obtain a parallel coordinate descent algorithm which can rapidly equilibrate towards a global
consensus (Supplementary Figure 5).
Two-parameter uncertainty models The above iteration leverages an isotropic uncertainty model
in which the error model for the translational component is spherically symmetric and coupled to
the uncertainty in the rotational component of the transform. We may also consider anisotropic
uncertainty models where these confidences are decoupled. In the first of these, we decouple the
weight wi j into separate factors for the translational and rotational components of uncertainty as
w⊺i j and w∠
i j , respectively. The overall error model being optimized is then
+ w⊺i j ti − (O j t ji + t j )
2 2
U {Ti }; {wi j , Ti j } = ∑ w∠
i j Oi − O j O ji .
i, j
This makes intuitive sense when the network will possess high confidence about the relative po-
sition of another residue but not its relative orientation, and may still be solved analytically by
weighted averaging with projection.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 43
(0, xATOM
i ) = Ti ◦ ( 0, tiATOM ).
We schematize this combined method in Algorithm 2. These predictions will be equivariant be-
cause they are right-composed with the parent residue poses, which are equivariant because they
are built from relative, equivariant projection from the initial geometry xt .
where x̃θ (xt ,t) is the output from the inter-residue consensus and the time dependent ‘gate’ ηt was
set in two ways:
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 45
√
• Output Scaling A Set ηt to scale as 1 − SSNRt with a learnable offset by parameterizing
as ηt = S S−1 (SSNRt ) + utθ where S(·) is the sigmoid function and utθ is parameterized by
a small MLP.
√
• Output Scaling B Set ηt to scale as 1 − SSNRt with a learnable offset by parameterizing as
ηt = 1− 1 − S S−1 (SSNRt ) + utθ I(SSNRt > CUTOFF) where S(·) is the sigmoid function,
utθ is parameterized by a small MLP, and CUTOFF = 0.99. This is similar to the previous
scaling but almost always disabled except for the highest values of the signal-to-noise ratio.
G Chroma Architecture
Chroma builds a joint distribution of the sequence and and all-atom structure of protein complexes
via the factorization
We model these likelihoods with two networks: a backbone network trained as a diffusion model
to model p(x) and a design network which models sequence and side chain chains conditioned
on backbone structure. Both networks are based on a common graph neural network architecture,
and we visualize the overall system in Supplementary Figure 7. We list important hyperparameters
for the backbone network in Supplementary Table 2 and for the design network in Supplementary
Table 3. We design sequences by extending the framework of [95] and factorizing joint rotamer
states autoregressively in space, and then locally autoregressively per side-chain χ angle within a
residue as done in [113]. For the sequence decoder, we explore both autoregressive decoders of
sequence (pictured in Supplementary Figure 7) and conditional random field decoding of sequence,
which was also explored in concurrent work [114].
Featurization We represent protein structure as an attributed graph with node and edge embed-
dings computed as SE(3)-invariant features of the input backbone. For the node features we encode
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 46
Supplementary Figure 7: Chroma is composed of graph neural networks for backbone denois-
ing and sidechain design.
local geometry via bond lengths and the backbone dihedral angles lifted to the unit circle via paired
sin and cos featurization. We encode the inter-residue geometries between each pair of nodes (i, j)
with the following edge features:
• Inter-atomic distances: Distances between all atoms at residues i and j, i.e. the 8 × 8
2 2
distance matrix, lifted into a radial basis via fi (Dab ) ≜ e(Dab −µi ) /σi for 1 ≤ i ≤ 20 and
centers µi spaced linearly on [0, 20] and σi = 1.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 47
pi ← Concatenate(ñi , mi )
ni ← ni + NodeUpdateMLP (pi )
end for
for each i ∈ [N] do
for each i j ∈ N (i) do
pi j ← Concatenate j∈N (i) (ni , n j , ẽi j )
ei j = ei j + EdgeUpdateMLP(pi j )
end for
end for
return ni , ei j ▷ Updated node and edge embeddings
Equivariance Because the input features are SE(3) invariant and the update layer (see section
F for details) is SE(3) equivariant, the ChromaBackbone network is SE(3) equivariant and the
ChromaDesign network is SE(3) invariant.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 48
G.2 ChromaBackbone
The backbone network parameterizes an estimate of the optimal denoiser x̂θ (xt ,t) and combines
a graph neural network described in the previous section with the inter-residue consensus layer
described in Appendix F. We trained two major versions used throughout this work (aside from
the ablation study), with hyperparameters described in Supplementary Table 2.
G.3 ChromaDesign
The design network parameterizes the conditional distribution of sequence and χ angles given
structure pθ (s, χ|x) by combining the graph neural network encoder described in the previous
section with sequence and side-chain decoding layers. To enable robust sequence prediction and
the potential for use as a conditioner, we train ChromaDesign with diffusion augmentation, i.e.
we predict sequence and chi angles given a noisy structure xt and a time t as pθ (s, χ|xt ,t). We
consider both a Potts decoder architecture which admits compact and fast constrained sampling
with conditioning or auxiliary objectives, as well as an autoregressive decoder architecture for
capturing higher-order dependencies in the sequence and modeling sidechain conformations given
sequence and structure.
erative frameworks which model all-atom protein structure with mixed diffusions over backbone,
sequence, and side-chain degrees of freedom [67, 120]. Furthermore, we are beginning to see ex-
perimental validation of diffusion-based models for structure and/or sequence [119, 121] and for
partially joint sequence-structure models that combine a language model prior with deterministic
structure prediction [122].
One common theme of generative models for proteins thus far has been dense reasoning in which,
to generate complex molecular systems like proteins or protein complexes, learning frameworks
must reason over all possible pairs of interactions in a system. While these approaches will, by con-
struction, always be able to perform as well as sparsely-connected approaches, Chroma provides
evidence that simpler frameworks based entirely on sparse reasoning and knowledge of domain
structure can be sufficient to build a complete joint model for complex multi-molecular systems
such as protein complexes. We anticipate that this sufficiency argument may be important for two
reasons: Firstly, subquadratic scaling O(N log N) of algorithms has been a foundational paradigm
for modeling the physical world from molecular [123] to cosmological systems [93]. Second, and
perhaps more speculatively, it may be argued that, given multiple algorithms with similar perfor-
mance, simpler and more computationally efficient algorithms are more likely to be robust and to
generalize [124].
Potts Decoder In the Potts formulation of the ChromaDesign network, we factorize the condi-
tional distribution of sequence as a conditional Potts model, a type of conditional random field
[56], with likelihood
!
1
pθ (s|x) = exp − ∑ hi (si ; x) − ∑ Ji j (si , s j ; x) ,
Z(x, θ) i i< j
where the conditional fields hi (si ; x) and conditional couplings Ji j (si , s j ; x) are parameterized by
the node and edge embeddings of the graph neural network, respectively. We train the Potts de-
coder with diffusion augmentation to predict sequence given a noisy structures xt and a time t
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 50
Supplementary Figure 8: Randomized autoregression orders with spatial smoothing vary the
typical spatial context for sequence modeling. Uniformly random autoregression orders (left) are
spatially uncorrelated and as a result induce highly disordered contexts which are unlike the con-
ditionals used during substructure design tasks. Uniformly random orderings can be transformed
into spatially coherent orderings by applying tunable spatial smoothing to the original ordering val-
ues, followed by ARGSORT. We apply spatial smoothing with by local neighborhood averaging
on a k-NN graph. Intermediate strengths of spatial smoothing produce locally coherent orderings
(middle), while strong smoothing producing crystallization-like, coherent traversals of the entire
structure (right). We uniformly sample µsmooth ∼ U (0, 1) at training time.
as !
1
pθ (s|xt ,t) = exp − ∑ hi (si ; xt ,t) − ∑ Ji j (si , s j ; xt ,t) .
Z(xt ,t, θ) i i< j
Advantages of the Potts decoders include that they admit fast global optimization even when com-
bined with conditioning constraints or co-objectives via as simulated annealing or gradient-based
samplers ([125]) and that they have been highly validated experimentally as sufficient generative
models for generating diverse and functional samples when trained on protein families. A disad-
vantage is that they are limited beyond modeling second-order effects and require many more iter-
ations of Monte Carlo sampling than one-shot ancestral sampling of autoregressive models.
Autoregressive Decoder We build on the theme of using graph neural networks with autore-
gressive decoders for sequence design [106, 115, 116] and factorize the conditional distribution of
sequence given structure autoregressively as
pθ (s|x) = ∏ pθ (sπi |sπi−1 , . . . , sπ1 , x),
i
where π is a permutation specifying an decoding order for the sequence. We sample random
traversals with a randomly sampled amount of spatial correlation, as shown in Supplementary
Figure 8, that may better align with conditionals encountered at design time and enable more
spatially structured decompositions that mix more effectively in causally-masked message passing.
We train the autoregressive decoder with diffusion augmentation to predict sequence given a noisy
structure xt and a time t as
pθ (s|xt ,t) = ∏ pθ (sπi |sπi−1 , . . . , sπ1 , xt ,t).
i
Sidechain Decoding We model the conditional distribution of side chain conformations given
sequence and backbone structure by modeling theχ angles with an autoregressive decomposition
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 51
as
pθ (χ|s, x) = ∏ pθ (χπi |χπi−1 , . . . , χπ1 , s, x),
i
where the conditional joint distributions pθ (χπi |χπi−1 , . . . , χπ1 , s, x) at each residue locally factor-
ize as up to four discrete, sequential decisions as in [113]. We model model these with empirical
histograms for each angular degree of freedom binned at 36 bins, i.e. with 10◦ angular resolu-
tion. During sampling, we convert the discrete binned probability masses into linearly interpolated
probability densities, giving a distribution over angles that is fully supported on the hyper-torus.
We train the sidechain decoder with diffusion augmentation to predict chi angles from a sequence
s, a noisy structure xt , and a time t as
H Training
H.1 Dataset
Processing We constructed our training dataset from a filtered version of the Protein Data Bank
[126] queried on 2022-03-20. We filtered for non-membrane X-ray protein structures with a resolu-
tion of 2.6 Å or less and reduced redundancy by clustering homologous sequences with USEARCH
[127] at 50% sequence identity and selecting one sequence per cluster. Additionally, because anti-
body folds are highly diverse in both sequence and structure and highly releveant to biotherapeutic
development, we enriched our redundancy-reduced set 1726 non-redundant antibodies that were
clustered at a 90% sequence identity cutoff. This yielded 28,819 complex structures which were
transformed into their biological assemblies by favouring assembly ID where the authors and soft-
ware agreed, followed by authors and finally by software only. Missing side-chain atoms were
added with pyRosetta [128].
Splitting We split the data set with into 80%/10%/10% train, validation and test splits by min-
imizing the sequence similarity overlap using entries of PFAM family ID, PFAM clan ID [129],
UniProt ID [130] and MMSEQ2 cluster ID at a 30% threshold [131]. To accomplish this, we
construct a similarity graph in which each PDB entry is represented by a node connected to other
entries that share at least one identical annotation. Connected sub-graphs are identified and bro-
ken apart by iteratively deleting the most central annotations until there are 50 or fewer connected
nodes. Using this procedure, we increased the fraction of test annotations with no representation in
the training set (versus a random split) from 0.1% to 9% for Pfam clan, from 10% to 59% for Pfam
family, from 50% to 82% for MMSEQ30 cluster, and from 70% to 89% for Uniprot ID.
H.2 Optimization
Backbone network We trained ChromaBackbone v1 on 8 Tesla V100-SXM2-16GB using the
Adam optimizer [132] to optimize a sum of the regularized ELBO loss (Supplementary Appendix
B) and an unweighted sum of the losses described in (Supplementary Appendix B.4). We linearly
annealed the learning rate from 0. to 2 × 10−4 over the first 10,000 steps and trained for a total of
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 52
1,796,493 steps. Due to the linear scaling memory footprint of our model, we dynamically pack
complexes into minibatches to approach a target number of residues per batch which was 4,000
residues per GPU and thus 32,000 residues per step. We estimated the final model parameters
with an exponential moving average (EMA) of per-step parameter values with a decay factor of
0.999 [133]. We trained ChromaBackbone v0 similarly but without EMA estimation, and we refer
to checkpoints from specific epochs of training as ChromaBackbone v0.XXXX where XXXX is the
epoch number.
Design network We trained ChromaDesign Potts and ChromaDesign Multi with the same
framework as the backbone networks but a few specific modifications: We trained ChromaDesign
Potts in a time-invariant manner on uncorrupted samples x0 to optimize a pairwise composite
log-likelihood approximation of the Potts log-likelihood [134], averaged to nats per residue. We
trained ChromaDesign Multi in a time-aware manner on samples xt from the diffusion process.
As a training objective we used the sum of the pairwise composite log likelihood loss for the Potts
decoder (residue-averaged) along with the average per residue log likelihood losses for the three
other decoder ‘heads’: the autoregressive sequence decoder, the marginal sequence decoder (which
independently predicts each residue identity si from structure xt ), and the autoregressive side chain
predictor.
I Sampling
I.1 Sampling backbones
We sampled proteins from Chroma by first generating backbone structures and then designing
sequences conditioned on the backbone. Unless otherwise specified, we generated structures by
integrating the reverse diffusion SDE (appendix Q.2) with λ0 = 10. For constrained conditioners
(e.g. substructure, symmetry), we opt for annealed Langevin dynamics (see appendix M.2 for more
details).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 53
Robust design The ChromaDesign Multi network is trained with diffusion augmentation such
that it can predict sequence given structures as though the structures arose from the diffusion en-
semble at time t. This time conditioning serves as a kind of amortized tunable amount of data
augmentation during design. Throughout this work, we sample sequences with t = 0, but note that
t > 0 is useful for increasing sequence design robustness.
where Ŝ is the threshold entropy value and ∆ Si < Ŝ is an indicator variable of whether Si falls
below Ŝ. Here, we used w = 30 and as Ŝ we chose the 5th percentile of 30-residue local window
entropies in PDB sequences (∼ 2.32 nats). All three restraints effectively restricted the sampling of
sequences from Potts models to regions of expected sequence complexity for native-like sequences,
with the last two having the advantage of not introducing potentially undesired global inter-residue
correlations.
a b e
{
TERM
decomposition
self
triple
pair full
self
0.04 0.04
triple
30
30 RMSD < 0.58Å
20 20
full
RMSD < 0.82Å
200 300 400 600 1000 200 300 400 600 1000
Supplementary Figure 11: Unconditional backbone samples reproduce both low and high or-
der structural statistics of natural proteins. a and c, Distributions of structural properties (in a)
and length-dependent scaling of contact order [136] and radius of gyration (in b) computed on a set
of 50,000 single-chain samples from the unconditional ChromaBackbone v0 at inverse tempera-
ture λ0 = 10, compared to the corresponding metrics for PDB structures. b and d, Distributions
of structural properties (in b) and length-dependent scaling of contact order and radius of gyration
(in d) computed on a set of 50,000 single-chain samples from the unconditional ChromaBackbone
v1 at inverse temperature λ0 = 10 and a set of 500 single-chain samples from the unconditional
ChromaBackbone v1 at inverse temperature λ0 = 1, compared to the corresponding metrics for
PDB structures. Box plots in a and b show medians and inter-quartile ranges. e, The distribution of
closest-match RMSD for TERMs of increasing order originating from native or Chroma-generated
backbones (at inverse temperature λ0 of 1 or 10, as indicated).
Metric Description Normalization
Secondary structure content (SSi ) Fraction of residues mapping into Helix, none
Strand, or Coil secondary-structure classes
Mean Residue Contact (Cmean ) Average number of contacts per residue none
Long-range Residue Contact (Clong ) Average number of long-range contacts per none
residue. Contacts are long-range if they in-
volve residues separated in sequence by 24
or more positions
Contact Order (CO) Average sequence distance between con- CO/N −0.3 [141]
tacting residues normalized by the total
length of the protein; higher contact orders
generally indicate longer folding times
Radius of Gyration (Rg ) Root mean square distance of structure’s Rg /N 0.4 [89]
atomic coordinates from its center of mass
a b
c d
structure space [142]. These tertiary motifs, or TERMs, consist of a central residue, its backbone-
contiguous neighbors, neighboring residues capable of contacting the central residue, and their
backbone-contiguous neighbors [142, 143]. Depending on how many contacting residues are
combined into the motif, TERMs can be distinguished as self, pair, triple, or higher-order, corre-
sponding to having zero, one, two, or more contacting neighbors (Supplementary Figure 11e, top),
respectively. To compare the local geometry of Chroma-generated backbones with that of native
structures, we randomly sub-sampled self, pair, triple, and full TERMs (i.e., TERMs containing
all contacting residues for a given central residue) within Chroma backbones and identified the
closest neighbor (by backbone RMSD) to each within the “search database”—i.e., the training set
used for Chroma. We performed a similar analysis on a set of native proteins not contained within
the search database–i.e., the test set used for Chroma. Although the test and training sets had been
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 59
a b
CATH Novelty
1 2 3
PDB Chroma
Average Average
1 2
Helix
4 5 6
Strand 4
UMAP 2
5
7 8 9
6
Big 7 8 9
10
11 10 11 12
12
PDB UMAP 1
Supplementary Figure 13: Unconditional backbone samples span natural protein space while
also frequently demonstrating high novelty. a. We co-embedded ≈ 50, 000 samples from
ChromaBackbone v1 along with a small set of about ≈ 500 samples from our PDB test set us-
ing UMAP [137] on 31 global fold descriptors derived from knot theory [138, 139]. We visualize
in the largest embedding plot all of these points colored by our length-adjusted CATH novelty met-
ric, which estimates the normalized number of CATH domains needed to achieve a greedy cover
at least 80% of residues at TM > 0.5. We use this score because it continues to grade the novelty
of longer proteins which almost all have a PDB nearest-neighbor TM < 0.5. On average Chroma
has a CATH novelty score of 2.7 and PDB has a CATH novelty score of 1.9. The four embedding
insets (left) demonstrate the specific distributions of properties of interest by highlighting popula-
tions of structures that are mainly helices, strands, large (> 500 residues), or from the PDB test
set. b, We highlight twelve proteins from across the embedding space with a high novelty score
(with embedding locations numbered)
split by chain-level sequence homology, we took further care to exclude any apparent homologs of
native TERMs from consideration as matches. To this end, we compared the local 31-amino acid
sequence windows around each TERM segment and its corresponding match, with any pairings
reaching 60% or more sequence identity not being allowed to participate in a match.
Supplementary Figure 11e shows the distribution of closest-neighbor RMSDs for TERMs de-
rived from both native and Chroma-sampled backbones that were generated at inverse temper-
atures λ0 = 10 and λ0 = 1. The distributions of nearest-neighbor RMSD were very close for
low-temperature samples from Chroma and native proteins, indicating that Chroma geometries
are valid and likely to be as designable as native proteins, including complex motifs with four
or five disjoint fragments (see Supplementary Figure 11e, bottom panel). Because native amino-
acid choices are driven by these local geometries [144], and adherence to TERM statistics has
been previously shown to correlate with structural model accuracy and success in de-novo design
[143, 144], this argues for the general designability of Chroma-generated backbones in a model-
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 60
independent manner. Notably, the samples from Chroma at its natural temperature (i.e., λ0 = 1)
still utilize precedented low-order TERMs, while their geometries do begin to depart from native
for higher-order motifs.
Supplementary Figure 14: ChromaBackbone v0 and v1 refolding TM-scores across length, sec-
ondary structure, and novelty. TM-scores between Chroma-generated structures and structures
predicted for the corresponding designed sequence using AlphaFold, ESMfold, and OmegaFold,
across length, helical content, and novelty. A maximum of 2000 points per model and bin is
shown. Due to hardware limitations, ESMFold predictions were restricted to proteins shorter than
848 amino acids and OmegaFold predictions were restricted to proteins shorter than 618 amino
acids.
Supplementary Figure 15: ChromaDesign and ProteinMPNN have comparable sequence re-
covery on natural proteins. We plot the median and interquartile ranges of per-protein sequence
recoveries when evaluated on the Chroma test set (left) and an intersection of the Chroma and
ProteinMPNN test sets (right).
regions. We compared performance to ProteinMPNN [116] using the 002 checkpoint at a temper-
ature of 0.01, as well as the 020 checkpoint at a temperature of 0.1. Considering that a substantial
portion of Chroma’s test set was incorporated into ProteinMPNN’s training set, performance was
assessed on the overlapping entries of both test sets. A summary of the performances is shown
in Supplementary Figure 15. Chroma designs and ProteinMPNN 002 exhibited comparable per-
formance across all regions and subsets, while ProteinMPNN 020 tended to have lower sequence
recovery. Neither of the complexity penalty methods appeared to have a meaningful impact on the
performance of ChromaDesign.
a b d
% Residues Masked
AlphaFold OmegaFold
ESMFold
6qaz
a c d
C2 C3 C4
protocol a
symmetry groups
[ C2, C3, C4, D2, D3, D4, T, O, I ]
36,000 sequences
T* O* I*
protocol b
symmetry groups
[O, I ]
10,000 sequences
AlphaFold-Multimer
0.8, when restricting our analysis to the task of masking out 60% of the template backbone, we see
non-zero hit rate on half of the sampled PDBs for each of the three structure prediction methods.
We see that refolding becomes less likely as more of the template is masked (and hence more of
the monomer backbone is infilled).
setting might result from excluding interface interactions from neighboring chains, this does not
necessarily imply that the designed proteins will not assemble. Based on the results obtained from
the aforementioned protocol, we observed a considerable number of successfully refolded designs
across the selected symmetry groups and sequence lengths (see Supplementary Figure 17). The
probability of success in refolding, defined as a TM-score greater than 0.5, was found to be higher
for assemblies with a smaller number of subunits and shorter chain lengths. We have included
selected refolded structures in Supplementary Figure 17.
Furthermore, we conducted a separate set of refolding validation experiments that focused specif-
ically on assemblies with O and I symmetries, generating 500 backbones instead of 50. We ob-
served a notable number of successful trimer refoldings. However, how trimer refolding correlates
with assembly formation success rate requires further investigation.
a 36 shapes (Latin alphabet and numerals) c AlphaFold d Chroma samples ESMFold models
1.00
x2 methods
x3 sizes {500, 750,1000} 0.75
TM
0.50
21600 sequences
0.75
TM
0.50
0.25
b OmegaFold
1.00
0.75
TM
0.50
0.25
0.00
0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
under ProClass) were selected for design and refolding. Sequence design was performed 100
times for each backbone, then each of the resulting 30,000 resulting sequences were folded by
three folding models: AlphaFold, OmegaFold, and ESMFold. To evaluate if the refolding was
successful for each model a TM score was calculated against the generated backbone. If that TM
score was greater than 0.5 it was considered a successful refolding event. Overall success of a
backbone was evaluated by choosing the best TM score out of 100 designs.
Choosing hyperparamers that allow for successful optimization of the backbones requires tuning.
Two key hyperparameters are guidance scale, and max norm. Both need to be tuned to achieve
high-quality samples. The guidance scale rescales the gradient of the conditioner for sampling,
while max norm provides a maximum gradient norm above which the gradient is clipped. If the
guidance scale is too low the sample looks like an unconditioned sample. If the guidance scale is
too high, it breaks local backbone bond length constraints and the sampled protein might explode.
max norm choices that are too low result in gradient clipping in a way that prevents optimization.
If max norm is chosen to be too high, random outlier gradients can cause the sampling trajectory
to fail, as occasionally the gradients explode and destroy the sample. This random gradient explo-
sion does not occur for all conditional sampling problems, and so is evaluated on a case-by-case
basis.
The conditional parameters depend on various other sampling hyperparameters, so must be de-
termined for each sampling problem separately. The best choice for guidance scale tends to vary
based on inverse temperature, Langevin factor, and the number of steps. Practically, the guidance
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 67
a b c
Fold PDB Exemplar Conditional Sample
3 Conditioned Folds
x100 sequences
2.40.155
30,000 Sequences (Beta Barrel)
2.60.40
(IG Fold)
Supplementary Figure 19: Class-conditioned samples can refold in silico. a, Conditional gener-
ation protocol diagram. Three canonical folds were chosen to conditionally design the beta barrel,
Rossmann fold, and IG fold. 2000 conditional samples were generated for each fold. The best
100 of each fold were selected for downstream refolding analysis. There is close agreement in TM
score for all folding algorithms for these samples. b, Each backbone was designed 100 times and
refolded under each folding model. Almost all of the structures refold with a TM score greater
than 0.5 in best of 100 sequence designs. In the bottom plot, Conditioned backbones have a range
of probabilities of being the correct fold. In general conditioning on CAT class requires many
samples before high quality examples are generated. Some are easier to optimize than others. c, A
selection of the best examples for each fold in conditional design. The middle column illustrates
an example of the same class from the PDB for reference. The right column is an exemplar protein
generated from Chroma. In white is the refolded structure, in rainbow is the sampled backbone.
scale and max norm are found by a small sampling hyperparameter search. A small number of
seed-controlled samples are run at different choices of guidance scale and max norm (e.g. 0.1,
1, 10, 100). Then the best performing values are chosen for a production run. We chose max norm
to be 10.0, and use a guidance scale of 0.1 for the refolding experiments described in Supplemen-
tary Figure 19(a).
Refolding successes were observed across all three conditioned folds. Refolding had high agree-
ment across models, as seen in Supplementary Figure 19(left bottom). Further in Supplementary
Figure 19(middle), about 40% of the designs meet the threshold for refolding success. For a design
to be considered successful, it also has to have a high p(fold). Qualitatively this cutoff can vary on
what is acceptable, however, the best samples tend to be close to 1. In Supplementary Figure 19(b
bottom), the top 100 backbones are seen to vary substantially in the best optimization performance
achieved. Some CAT annotations are very difficult to optimize, whereas others are relatively easy
and good samples can be found quickly. In all three cases, structures that refolded and match the
desired fold were found. These examples can be seen in Supplementary Figure 19(c).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 68
a b c Canonical examples
4 captions Crystal structure of Fab
TM = 0.74
x 3 guidance scales (single chain)
Perplexity 1.67
12 caption tasks
Crystal structure of
SH2 domain
x 50 samples
TM = 0.88
Perplexity 1.92
600 backbones
TM = 0.54
Perplexity 2.43
6000 sequences
De novo designed
Rossmann fold protein
TM = 0.94
AlphaFold ESMFold OmegaFold Perplexity 4.07
Supplementary Figure 21: The agreement of predicted structures with designs (TM-score) is
correlated to model confidence (pLDDT). We evaluated ESMFold, AlphaFold, and OmegaFold
models on the 35,000 unconditional samples generated in the ablation study, which represent model
behaviors and biases across several different configurations. We see across these data that struc-
ture predictions with high correspondence between Chroma models and refolded predictions are
also generally higher confidence predictions, suggesting a general self-consistency between the
sequence structure relationships being modeled across these different systems.
We find examples of designability for structures conditioned on each caption, although the suc-
cess rate varies considerably. The single chain structures refold with larger TM scores to their
Chroma predictions than complexes (antibody example). Nevertheless, for all captions we ob-
serve instances where our design protocol is successful, as measured by refolding with a TM score
above 0.5. We also show some comparisons of Chroma and successfully refolded structures in the
right panel of Supplementary Figure 20, alongside canonical examples of each caption from the
PDB.
Figure 22. These ablations were evaluated through the lenses of likelihood and sample quality to
holistically evaluate their effects on model performance.
Model component: Covariance. We consider two covariance models for defining the diffusion
process, which are are visualized in Supplementary Figure 3 and described in Appendix D:
• Covariance variant: ResidueGas. In this model, the coordinates of each Cα are indepen-
dently and identically normally distributed with standard deviation 10Å (along each x, y, z
dimension). The other coordinates of the N, C, and O atoms within the residue are then dis-
tributed normally around Cα with 1Å conditional standard deviation. This can be considered
an off-frame relaxation of frame diffusion models [67] or an all-backbone-atom extension of
IID Cα diffusion models. [91].
• Covariance variant: Globular. This covariance model captures spatial proximity con-
straints in the form of correlations within atoms in a residue and between residues in a chain,
while also respecting global length-dependent Rg scaling effects that arise from polymer
collapse. This version includes Complex Rg scaling.
Model component: Graph We consider two kinds of graph structure, which are visualized in
Supplementary Figure 4 and described in Appendix E:
• Graph variant: k-NN. This used a graph topology based on the 60 nearest neighbors in the
current structure.
• Graph variant: Random Graph. This used a hybrid graph topology for which 20 of the
edges are the nearest neighbors in the current structure and 40 of the edges are sampled
according to the inverse cubic attachment model.
Model component: Output. We consider three kinds of output parameterization varying from
the consensus update visualized in Supplementary Figure 5 and described in Appendix F:
• Output variant: PairFrameA. This uses the inter-residue geometry parameterization with
three equilibration steps and one uncertainty parameter per i, j that is coupled to both transla-
tion and rotation. The predicted transforms Ti j are parameterized as linear projections from
the final edge embeddings, and the coordinates are post-processed with time-dependent scal-
ing method A.
• Output variant: PairFrameB. This uses the inter-residue geometry parameterization with
ten equilibration steps and two uncertainty parameters per i, j, one for translation and one
for rotation. The predicted transforms Ti j are parameterized as residual updates to the trans-
forms of the current structure based on the final edge embeddings and the coordinates are
post-processed with time-dependent scaling method B.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 71
ELBO
ELBO
Globular Random LocalFrame ELBO
Globular Random PairFrameB ELBO
v1 Globular Random PairFrameB +AuxLoss2 5 5
ResidueGas Random PairFrameA ELBO Output layers
eB
irF e
m
ResidueGas kNN PairFrameA ELBO
Pa am
ca eA
ra
r
Lo am
lF
r
irF
Pa
ELBO
5
Sample quality evaluation
Large inter-checkpoint SS fluctuations
Refolding especially driven by alpha-helicity
1
Loss functions
7 model configurations 600 All ɑ Mostly ɑ Mixed ɑ/β Higher β X-space
x5 consecutive 0 losses
Total refolded
checkpoints 900
at TM > 0.5
Total refolded
35 model checkpoints 0 Chain
whitened
x1000 samples 1.0 losses
(100-500 AA, uniform)
ɑ-Helix 1 0
35,000 backbones Fraction
0
x1 sequence design Graph topologies
AlphaFold
35,000 sequences Refolded TM Random
Graph
400
Total refolded
k-NN
1
AlphaFold
0 0.0 0
1100 1101 1102 1103 1104 0.0 0.0 to 0.2 0.2 to 0.4 0.4 to 1.0
Epoch β-Strand Fraction
Supplementary Figure 22: Ablation study demonstrates utility of novel model components
as measured by likelihood and sample quality. We trained seven models composing different
configurations of proposed components and baselines, modifying the covariance model (Supple-
mentary Appendix B), graph neural network topology (Supplementary Appendix E), atomic output
parameterization (Supplementary Appendix F), and losses (Supplementary Appendix B) (top left).
We indicate the two configurations corresponding to ChromaBackbone v0 and ChromaBackbone
v1, where v0 has one additional change of using the globular monomer version of the globular
covariance scaling. Training for ∼500,000 steps on 8 V100 GPUs with a batch size of ∼32,000
residues per step suggests that there is little generalization gap between the training and validation
sets (top middle, windowed averaged training curves across 100 epochs). From the perspective of
likelihood (top right), globular covariance is favorable to residue gas covariance (Supplementary
Appendix D), inter-residue geometry prediction layer is favorable to local frame updates if tuned
appropriately (Supplementary Appendix F), and auxiliary losses incur a cost to ELBO (Supple-
mentary Appendix B). When we applied these trained models to generate unconditional samples
(bottom left), we observed large fluctuations in secondary structure composition between adjacent
checkpoints (bottom, middle left). When aggregating across these checkpoints, we observed that
refolding by AlphaFold was highly dependent on the fraction of α -helices in the sampled structure
(bottom, middle right). In spite of this, the refolding rate of samples based on a model with ran-
dom graph topologies was higher than those of a model based on k-NN topologies (Supplementary
Appendix E) and losses weighted in x-space induced better refolding than losses weighted only in
chain-whitened space (bottom right).
• Output variant: LocalFrame. This uses a local frame-transform update to the coordi-
nates that is parameterized based on the final node embeddings. The coordinates are post-
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 72
Training For each of the model configurations in the ablation study, we trained on batch sizes of
32,000 residues by leveraging data parallelism across 8 V100 GPUs for ∼ 500, 000 steps, which
is approximately ∼ 1500 epochs and ∼ 28 days of wall clock time. Models were trained with
the Adam optimizer [132] and a learning rate of 2 × 10−4 with an initial linear warm-up phase
of 10,000 steps. After each epoch, we evaluated one-sample estimates of ELBO and other losses
across the full training and evaluation sets.
Second, we see similar but modest improvements in likelihood across the three different output
layer parameterizations, where PairFrameB is favorable to LocalFrame which is favorable to
PairFrameA. Thus we see evidence suggestive of favorable performance for our inter-residue
geometry prediction over purely local prediction, though this can depend on tuning and is poten-
tially confounded by the fact that PairFrameB also changes the output scaling at the same time.
Optimizing the output layer will likely warrant further investigation.
Finally, we observe that adding auxiliary non-ELBO losses to otherwise purely ELBO-based train-
ing reduces ELBO performance.
Sample quality analysis. To evaluate each of the model configurations from the point of view
of sample quality, we performed a large scale sample-and-refolding analysis. For each of the
seven model configurations, we took five checkpoints from consecutive epochs around epoch 1100,
sampled 1000 backbones per checkpoint with lengths uniformly distributed between 100 and 500
amino acids. We note that this epoch corresponds to ∼ 360, 000 training steps, which is approx-
imately one quarter of the total training time of the ChromaBackbone v0.4998 that was used in
our broader refolding experiments. We expect that the total refolding rates reported in this section
may be generally lower than our production model.
We observe large epoch-to-epoch fluctuations in secondary structure biases of the samples (Sup-
plementary Figure 22, bottom center left). This is reminiscent of behaviors previously observed
in other diffusion models [133], in which a batch of images may be all tinted one color, then an-
other, even when the underlying denoising function is only changing slightly. These macroscopic
fluctuations arising from microscopic changes may be intuitively understood as a tendency of the
sampling process to amplify small per-time-step discrepancies. This phenomenon has previously
been addressed by exponential moving averaging (EMA) of the checkpoints [62, 133], and we
anticipate this is a worthwhile direction for future work.
Nevertheless, when we aggregate across checkpoints, we observe a few trends. All of the models
trained with denoising losses that measure squared error in Cartesian space, which includes both
the auxiliary loss models and the residue gas models, tend to have higher refolding rates than the
models which were trained only with a chain whitened losses. This aligns with classical intuition
on proteins in the sense that chained whitened coordinates emphasize local geometries in proteins
while Cartesian coordinates much more directly measure the absolute positioning of coordinates in
space that underlie contacts and interatomic distances. We also observe that the random graph neu-
ral networks have considerably higher folding rates than a purely k-NN based model, and that the
best performing model overall combined our new diffusion and output parameterizations together
with several new auxiliary loss functions. Thus, as has been a common lesson in the diffusion mod-
eling literature, non-likelihood based losses or denoising weightings can be important to driving
sample quality measures [55].
Low-temperature sampling remains essential All of the model configurations in this ablation
study can generate samples which successfully refold and, in that sense, none of these changes
qualitatively break model performance. We emphasize that the same cannot be said about low-
temperature sampling, as all of these experiments were sampled with λ = 10. As shown in Sup-
plementary Figure 2, low temperature sampling is important to generate high likelihood samples
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 74
This formulation can treat arbitrary combinations of conditions if we model the joint event y as
factorizing into independent sub-events y1 , . . . , yM . Then we have the posterior score
M
∇x log pt (x|y1 , . . . , yM ) = ∇x log pt (x) + ∑ ∇x log pt (yi |x).
i=1
These posterior scores can directly substitute the usual score function in the posterior SDE and
ODE described in Appendix B.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 75
Joint programmable sampling of sequence and structure While we focus on classifier con-
ditioning of backbone structures throughout this work, it is also straightforward to extend the
above picture to include joint gradient-based sequence and structure sampling by leveraging new
discrete sampling methods based on locally gradient-adjusted MCMC proposals [125, 152]. It
is an important distinction that joint sequence-structure sampling at inference time does not re-
quire joint sequence-structure diffusion at training time; all we require for joint sampling is ac-
cess to a time-dependent joint likelihood pt (x, s). Our current Chroma model satisfies this as
pt (xt , st ) = pt (xt )pt (s|xt ), which may be leveraged as
M
∇x,s log pt (x, s|y1 , . . . , yM ) = ∇x,s log pt (x, s) + ∑ ∇x,s log pt (yi |x, s).
i=1
Design specifications as energy functions The Bayesian picture, as well as classical protein
design approaches [151], formulate protein design problems in terms of energy functions which
express the (unnormalized) negative log-posterior probability density of a protein system given a
set of conditions. We can similarly cast posterior diffusions in terms of a time-dependent total
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 76
Supplementary Figure 23: Conditioners parameterize protein design problems, facilitate au-
tomatic sampling algorithms, and are composable. (Left) Conditioners are functions which
map an unconstrained system consisting of an initial state x̃t and energy U0 = 0 to a transformed
state xt = f (x̃t ,U0 ;t) and an updated energy U f (x̃t ,U0 ;t). Gradient-based sampling with respect
to unconstrained x̃t on the Conditioner-adjusted Diffusion energy (left) will induce constrained
dynamics on xt . Many kinds of restraints and constraints can be realized in this framework (right),
and because of matched input-output types, simple Conditioners can be composed into complex
Conditioners to jointly satisfy multiple design objectives within a complex protein design problem.
energy as
where the gradient of the total energy with respect to x will yield the negative posterior score
function7 .
Constraints via linear transformations How can we encode constraints such as symmetry and
substructure? Many constraints, including these, can be enforced via affine transformation func-
tions of the form f (x̃) = Ax̃ + b which map points in unconstrained Euclidean space x̃ ∈ RN to
points in a constrained space f (x̃) ∈ Ω ⊆ RM . We can then run Langevin dynamics (Supple-
mentary Appendix B) with the gradient of constrained energy U( f (x̃t ); y,t) with respect to the
unconstrained coordinate x̃t as
gt2 ψ √
dx̃ = − λt RR⊺ ∇x̃U( f (x̃t ); y,t) dt + gt ψ R dw̄.
2
7 We may choose to stop gradient flow through the denoiser model, which saves compute cost and recovers the
behavior of the score functions from training time. This will lead to a non-conservative vector field (as is standard
practice for diffusion models), but allowing gradients to flow through the denoiser restores energy conservation [153].
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 77
dx = Adx̃
βt ψ ⊺
p
=A − λt RR ∇x̃U( f (x̃t ); y,t) dt + βt ψ R dw̄
2
βt ψ p (3)
=− λt ARR⊺ ∇x̃U( f (x̃t ); y,t) dt + βt ψ AR dw̄
2
βt ψ p
=− λt ARR⊺ A⊺ ∇xU(xt ; y,t) dt + βt ψ AR dw̄,
2
which is precisely Langevin dynamics with a modified mass matrix (ARR⊺ A⊺ )−1 [154, 155] which
will sample from the constrained domain Ω.
Nonlinear constraints: Exact sampling Many constraint sets cannot be expressed as the images
of affine transformations [156]. One such example relevant to protein design is box constraints,
where some subsets of atoms may be confined to contiguous finite regions of space. To enforce
these constraints while still sampling from the intended energy function, we can simply design
a nonlinear function f that implements the constraint and then adjust the total energy for sam-
pling with the log-volume adjustment factor given by the multivariate change of variables formula:
log det ∂∂ x̃f . This works so long as f is continuously differentiable and bijective onto the con-
strained space and the constrained space has the same dimension as the domain of f . It is further
possible to extend this to also consider non-dimension-preserving transforms, e.g. with certain
embedded Riemannian manifolds, for which we refer the reader to [157].
This transformed MCMC approach may be useful even when the nonlinear transformation function
is fully unconstrained, for example, if it is a learned normalizing flow model of a particular class
of structures of interest, in which case it will induce a dynamics similar to latent diffusion models
[75].
Nonlinear constraints: Beyond If we are willing to sacrifice exact sampling from the true en-
ergy function, we may also discard the log-determinant adjustment and absorb the bias induced by
running Langevin dynamics in a transformed space. These dynamics will still be exactly confined
to the range of f , but may potentially be biased by change-of-volume effects as well as non-
bijectivity. However, this opens up a large number of possibilities which are simple to implement
by the user, as they only require a differentiable function f that implements the desired constraints
which need not have an inverse and which can be differentiated by automatic differentiation. We
have found this latter paradigm useful, as one can quickly realize more complex functionalities
such as restricting sampling of subsets of a system to rigid body motions, to satisfying complex
constraints such as optimal transport by differentiable inner optimization, and beyond.
8 The first step can be justified by Ito’s lemma.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 78
M.3 Conditioners
The previously described restraints and constraints for Langevin dynamics share a common form
of implementation: they modify the system coordinates x and/or the total energy U. This suggests
a natural “building block” for a protein programming framework: transformation functions which
input and output system states (x,U).
We define a conditioner as a function F : RN × R → Ω ⊆ RM × R which maps state-energy pairs
in unconstrained input space RN × R to potentially constrained state-energy pairs in Ω ⊆ RM × R.
For ease of notation, we further refer to Conditioners component-wise F = ( f ,U f ) in terms of a
state update function f : RN × R → Ω f ⊆ RM and an energy update function U f : RN × R →
ΩU ⊆ R.
where the gradient ∇x̃U(x̃t ;U f , f ,t) for sampling is computed with respect to the unconstrained
coordinates x̃t . These gradients and dynamics can be computed efficiently even for complex com-
posed conditioners by leveraging modern automatic differentiation frameworks, as shown in Sup-
plementary Figure 23.
• Automated Sampling. Any gradient-based sampling algorithm may be used in concert with
the Conditioner-adjusted energy and an annealing schedule on the diffusion time t.
9 Composition of blocks will require that their inputs and outputs can be shape-compatible, just as in the case of
composing differentiable blocks in neural networks. For example, two substructure constraints by definition must be
expressed in a way that can be jointly realized with one set of protein chain lengths.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 79
Conditioners for sequence and structure As noted in the previous section, the Conditioner
framework is also straightforwardly applied to joint sampling of sequence and structure, where we
define the joint energy
1 −1 −1 2
U(xt ; y,t) = σt R ( f (x̃t ,U0 ;t) − αt x̂t (xt ,t)) 2 − log p( fs (s̃t ))| fx (xt ),t) +U f (x̃t , s̃t ,U0 ,t),
|2 {z } | Sequence{z Likelihood
} | {z }
Diffusion Energy Conditioner Energy
where gradient and dynamics are computed in unconstrained space x̃t , s̃t and we can use approaches
such as Discrete Langevin sampling [125, 152] to sample from sequence space while leveraging
gradients for building locally-informed proposals. Sequence and structure gradients can be com-
puted in one pass via automatic differentiation frameworks.
Thus, we can perform joint sequence and structure sampling conditioned on a target objective with-
out needing to train a joint diffusion on sequence and structure at the same time; all we require is a
valid joint posterior for sequence and structure conditioned on function which may be realized, for
example, with a conditional language model for sequence given structure together with a diffusion
model for the backbone structure joint marginal.
composed energy functions was explored in [84], which also presents useful tools for negation and
other primitive composition operations.
Our framework introduces a versatile conditioning mechanism that accommodates additional modal-
ities such as natural languages [75] and 3D densities [159], allowing users to designate either
parametric or non-parametric classifiers as restraint energies. Moreover, it facilitates a compos-
able structure for sampling within constrained domains and manifolds, bearing a resemblance to
MCMC methods for structured spaces. In our methodology, we initiate sampling from an un-
constrained space, followed by mapping these samples onto the corresponding constrained space.
This mapping procedure echoes the ideas of Mirror Langevin Dynamics [160, 161], where the
constrained transformation operates as a ”mirror map”. For sampling linear subspaces, our ap-
proach recovers preconditioned Langevin Dynamics [155], achieved via mass matrix conditioning
[154].
Sampled
Globular
Covariance
Clamped
Globular
Covariance
(conditioned)
Supplementary Figure 24: The globular covariance model admits analytic conditioning (Left)
Heatmaps illustrating comparison of unconditional (top) globular covariance matrix RR⊺ and con-
ditioned (bottom) covariance matrix R̄R̄⊺ . (Middle) X-coordinate plotted against residue index of
samples drawn from unconditional (top) and conditional (bottom) prior. (Right) Initial samples
X0 and noised samples drawn from p(X1 |X0 ) for the unconditional (top) and conditional (bottom)
priors. Conditioned-on structural residues are drawn in gray and correspond to the same residues
that are conditioned in the covariance matrix and line plot.
states of a system are sampled by simulating the dynamics of an auxiliary system with a modified
mass matrix. If the mass matrix is chosen appropriately, the original system’s configuration space
can be sampled more efficiently.
The method works by integrating a modified Annealed Langevin Dynamics SDE (see appendix C)
backwards in time to sample from p0 (x|x1 ), where the dynamics are modified using a mass matrix
that assigns higher mass to particles closer (in chain distance) to known coordinates and assigning
infinite mass to known atoms. Samples drawn using this method satisfy y with probability 1.
Let S, M ⊂ [1, · · · , N] denote the atoms comprising the unknown scaffold and known motif, re-
spectively, throughout this section.
20% masked 40% masked 60% masked 80% masked 20% masked 40% masked 60% masked 80% masked
1a8q 5sv5
2e0q 5xb0
3bdi 6bde
5o0t 6qaz
error introduced by the replacement method arises from imputing noised motifs that are highly
unlikely given the corresponding noised scaffold.
N.2 Approach
It is known that for x ∼ N (µ , Σ), if we partition the coordinates
S as above into
subsets
M, S
x µS
and follow the Gaussian conditioning formula to write x = M with µ = and Σ =
x µM
ΣSS ΣSM
such that (xS |xM = a) ∼ N (µ̄ , Σ̄) where
ΣMS ΣMM
µ̄ = µ S + ΣSM Σ−1 M
MM (a − µ )
and
Σ̄ = ΣSS − ΣSM Σ−1
MM ΣMS ,
where inverse matrices are understood to denote pseudo-inverses. We also compute the Cholesky
factorization R̄R̄⊺ = Σ̄. To draw an approximate conditional sample from p(xS0 |xM 0 = a) we pro-
ceed as follows: we sample xS1 ∼ N (µ̄ , Σ̄) from the conditional prior, set xM
1 = a , and integrate a
modified Annealed Langevin Dynamics SDE (see section C.2)
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 83
gt2 ψ √
dx = − R̄R̄⊺ ∇x log pt (x)λ0 dt + gt ψ R̄ dw̄
2
backwards in time, where the matrices R̄, R̄⊺ are broadcast to the correct size with the conditioned
on rows and columns filled by zeroes. Supplementary Figure 24 illustrates R̄, R̄⊺ as well as samples
from a conditional prior.
Additionally, we have found it helpful to incorporate a reconstruction-guidance based gradient
term as in [162]. We have found that, while this can introduce some instability to the sampling, it
improves sample quality. To do so, in our conditioner formulation we define
U f (x̃t ,U,t) = U + ∥x̂θ (xt ,t)M − xtM ∥22 ,
where
xt = f (x̃t ) = Ax̃t + b = R̄R−1 x̃t + µ̄ .
See section M.2 for a derivation that under this f , evolving x̃ according to the unmodified An-
nealed Langevin SDE induces dynamics on xt equivalent to the mass-modified dynamics presented
above.
O.2 Approach
The Bayesian approach to diffusion conditioning approach would be to build an estimate of the
time-dependent likelihood pt (y|x(t)) to classify noisy inputs. In the case of a contact classifier,
ij
we can build an analytic approximation for pt (D0 < c|xt ) as follows. First, we choose a prior
p(x0 ) that captures distance statistics in the PDB giving rise to an tractable posterior denoising
distribution p(x0 |xt ). With a Gaussian prior for x0 , which we can use our globular covariance
model for, we arrive at a Gaussian posterior for p(x0 |xt ) and can further model the posterior dis-
ij
tances p(D0 |xt ) with a non-central chi-squared distribution. This allows us to compute the desired
ij
p(D0 < c|xt ) using the CDF of the non-central chi-squared distribution.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 84
First, we can build a Gaussian approximation of a prior for protein chains p(x0 ) with our globular
covariance model (Appendix D.3) as
Then, according to our forward process, we have a forwards transition kernel for the likelihood
as
p(xt |x0 ) ∼ N (αt x0 , σt2 RR⊺ ).
We can now apply Bayes’ Theorem as
x0 = αt xt + σt Rz,
From the Rg scaling analysis of the globular covariance model (Supplementary Appendix D.3) we
have
The squared inter-atomic distance is a squared 2-norm of this residual, which will therefore follow
a non-central Chi Squared distribution with 3 degrees of freedom as
j 2 ij
!
ij
x0 − xi0 (D0 )2 αt (dt )2
= 2 2 ∼ NonCentralChiSquared ,k = 3 .
σt σi j σt σi j σt2 σi2j
2
Supplementary Figure 26: Motifs can occur in entirely unrelated structural contexts. a, An
example motif composed of three disjoint segments. b, PDB entry 3NXQ harbors the motif with
a backbone RMSD of 0.45 Å. c, PDB entry 3OBW harbors the motif with a backbone RMSD of
0.64 Å.
P.2 Approach
To determine whether a pre-specified motif is present within a given structure S is simple–one can,
for example, find the substructure of S with the lowest optimal superposition root-mean-squared-
deviation (RMSD) to the motif and ask whether this RMSD value is below a desired cutoff; this
can be done using previously published algorithms [163, 164]. To enable conditional generation
based on the presence of a motif then, we employ a form of reconstruction guidance based on the
best RMSD to the motif in the present de-noised structure. Specifically, at time t we define the
best-match RMSD to the target motif with coordinates xM as
∥xtM − T ◦ x̂θ (xt ,t)Mπ ∥2
ρ (xt ) = min min p , (4)
π ∈Π T∈SE(3) |Mπ |
where the outer minimization is over the combinatorial space Π of alignment permutations π of
motif disjoint segments onto the current structure xt and the inner minimization is over the optimal
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 86
superposition of the motif given a specific alignment π . The actual calculation is done using a
branch-and-bound search similar to the one defined in Zhou et al. [163] rather than an explicit
minimization over permutations. With this, we then modify the energy within our conditioner
formulation (see section M) as
ζ [ρ (xt )−ρmax ]
U f (xt ,U,t) = U + η log 1 + e ,
where ρmax is the threshold RMSD below which we desire to find the motif in the final generated
structure, and η and ζ are parameters (we used η = 50 and ζ = 4 in this work). With this mod-
ification, auto-differentiation of the resulting energy to obtain the score function creates gradients
that pull the system towards containing the motif in question. Note that the location of the motif
within the generated structure needs not be specified, as equation 4 optimizes over all possible
alignments at each step of the reverse diffusion. On the other hand, it is also easy to introduce
constraints to the possible matching alignments, such as either relative constraints on the mapping
of individual segments of the motif (e.g., first and second segments must be separated by anywhere
between 3 and 20 residues) or absolute constraints on the location of the motif (e.g., first segment
must match in the first 100 residues of the generated structure). This can be easily accommodated
by modifying the parameters of the search in equation 4 as shown previously [163].
Q Programmability: Symmetry
Q.1 Motivation
The functions of many proteins are often realized through self-assembly into large higher-order
structures. One of the most powerful and widely employed tools for this in nature is symmetric
assembly, observed in everything from large membrane pores that gate transfer of materials in and
out of cells to icosahedral viral capsids which can encapsulate an entire genetic payload [165].
Similarly, incorporating symmetry into computational design of proteins holds great promise for
building large functional complexes [166]. To realize this potential within our diffusion frame-
work, we propose a method to directly constrain sampling to any prescribed discrete Euclidean
symmetries.
Related work Incorporating group equivariance in machine learning has been an important topic
in the machine learning community [167]. Incorporating symmetries is critical in molecular sim-
ulations [168, 169]. In this work, we propose a method for incorporating symmetry for point
set sampling with applications in the generation of large-scale protein complexes with arbitrary
discrete symmetry groups.
To generate symmetric protein complexes, we want to sample structures x ∈ RM×N×3 that are built
from M = |G| identical single chain proteins x ∈ RN×3 where N is the number of residues for each
subunit. The SDE solving process produces final samples with:
x0 = sdesolve(xT )
For sample generation to respect symmetries for an arbitrary group G, the SDE/ODE dynamics
needs to be G-invariant up to a permutation of subunits. Let · represent the symmetric operations
(rotation, reflection, and translation) performed on point sets in R3 , we define the sampling proce-
dure sdesolve : RM×N×3 → R|G|×n×3 with x0 = sdesolve(xT ) being the desired samples. The
sampling procedure should follow the following G-invariance condition:
where g indicates a group element in G and we impose an arbitrary order on G and our method is
equivariant to the permutation of subunits. σ (g) is the induced permutation operation that satisfies
the relation: gG = σ (g)G, as computed from the group multiplication table (also called the Caley
table).
The first equality in appendix Q.2 is trivially satisfied if f (·) or the underlying gradient update is
E(3) equivariant, as G consists of only orthogonal transformations and translations. However, the
second equality is generally not satisfied. For molecular simulations where Hamiltonian dynamics
is used, the second equality can be satisfied if (i) the energy function is E(3) invariant, and (ii)
the initial xT and dxdtT are symmetric, i.e gxT = σ (g) xT , g dx dxT
dt = σ (g) dt . At each successive
T
time step, xt automatically satisfies the prescribed G-symmetry. This approach confines both the
position and momentum update to ensure that the sampled configurations remain symmetric.
However, this is not the case for SDE/ODE sampling in our framework. We list three origins of
the symmetry-breaking error if appendix Q.2 is used: (i) the denoising network uses distances
as features and is automatically E(3) equivariant. However, because protein feature graphs are
generated probabilistically, each subunit protein has a different geometric graph, despite the pro-
tein’s overall symmetry. (ii) Our polymer structured noise is randomly sampled from N (xT ; µ , Σ),
so each subunit protein has different chain noise. (iii) The sampling procedure requires solving
an ODE/SDE which is vulnerable to accumulated integration error. Integration error can induce
unwanted geometric drifts such as rotation and translation[173], and be a substantial symmetry
breaking force.
where n is the index of the elements of the group, m is the index of atoms in AU, and i, j are
Euclidean indices. The associated diffusion energy transformation is
1 1 2 1
U f (xt ) = σt−1 R−1 ( f (x̃t ,U0 ;t) − αt x̂t (xt ,t)) 2
=− log pt (xt ).
|G| 2|G| |G|
The energy is averaged with |G| to account for the diffusion energy in individual AU with M atoms.
We can compute the Jacobian of the transformation f : RM×3 → RN×M×3 as
d f (xt ) d[ f (xt )]m,n,i
=G→ = Gm,i, j δn,n′ .
dx̃t d[x̃t ]n′ , j
To derive the transformed dynamics, we inspect one solver step for the reverse Langevin dynamics
(identical analysis can be done for reverse diffusion), which is
⊺
1 ⊺ d f (x̃t ) dU f (xt )
x̃t+ dt = x̃t − RR dt + Rdw̄,
2 dx̃t dxt
and show the induced gradient transform with its associated indexed representation
d f (x̃t ) ⊺ dU f (xt )
dU f (xt ) dU f (xt )
= = G⊺ ,
dx̃t dx̃t dxt dxt
dU f (xt ) dU f (xt )
= ∑ ∑ Gm,i, j .
d[x̃t ]n, j m i dxt m,n,i
Observe that in the gradient transformation, the summation occurs over indices i, contrasting with
the index j used in the forward transformation to account for the index transposition [·]⊺ between
i and j. For orthogonal transformation, the transposition is also equivalent to the inverse of the
individual rotation matrix in G. This method inherently pulls the gradients back to the AU. The
computation of the transformed gradient can be adeptly handled using auto-differentiation, specifi-
cally as vector-Jacobian products. Furthermore, the gradients accumulated in AU are also averaged
by the number of chains in the tessellated domain by dividing the gradient by |G|.
We then analyze the transformed solver step with the pull-back gradient transform and get
⊺
1 ⊺ d f (x̃t ) 1 dU(xt )
f (x̃t + dx̃t ) = f x̃t − RR dt + Rdw̄
2 dx̃t |G| dxt
1 ⊺ −1 1 dU(xt )
= |{z}
G x̃t − RR G dt + Rdw̄ .
2 |G| dxt
symmetrize | {z }
folding to AU
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 89
The constrained transformation has a nice interpretation: the solver step first folds the infinitesimal
change back, followed by symmetrization. Note that this method is equivariant to permutations
of group elements in G because the gradients are pulled back to AU and tessellated following the
order of group elements in G.
Another option to pull the gradients is to perform a “broadcasting” operation from a single AU
(indexed with u ) of x. This is also a valid gradient transformation that ensures G-invariance. This
equation is an example of constrained transformations in eq. (3), and in practice we apply the
temperature adjustment described in Supplementary Appendix C.
1 ⊺ −1 dU(xt )
f (x̃t + dx̃t ) = f x̃t − RR [G]u dt + Rdw̄ .
2 dxt u
PriorEnergy Symmetrize
Symmetrize Subsample
Subsample PriorEnergy
+Rg Restraint
PriorEnergy
and
d f (xt )
= S ∈ [0, 1]MN×KN ,
dx̃t
where S is the chain selection matrix of size (KN × K) where K < M is the number of chains
selected during sampling.
Rg energy restraints. The conditioner formalism provides the flexibility to seamlessly incorporate
the restraint energy during energy updates. To ensure optimal contact and packing, we can integrate
an Rg penalty through a harmonic or flat-bottom potential. This serves to maintain both the inter-
chain distance and the Asymmetric Unit (AU) Radius of Gyration within a reasonable range via
the restraint energy
The proposed samplers can also be combined with other conditioners (substructure, natural lan-
guage, shape, etc.) to realize symmetric assembly design with controllable functions.
Composed transformation. Putting this together, the composed transformation is as follows
x = subsample(symmetrize(x̃))
U f (x,U,t) = U +URg (x̃) +URg (subsample(symmetrize(x̃)),
We include the schematic of the composed conditioner blocks in Supplementary Figure 27. For
implementation, this can be easily implemented in a composable function.
C2 C3 C4 I Screw
D2 D3 D4
C6 C10 D8 D12 T
[ C3 , C3 ] [ C3 , C4 ] [ C3 , C5 ] [ C2 , C4 ] [ C2 , C4 ] [ C2 , C5 ] [ C2 , C6 ]
R Programmability: Shape
R.1 Motivation
Proteins often realize particular functions through particular shapes, and consequently being able
to sample proteins subject to generic shape constraints would seem to be an important capability.
Pores allow molecules to pass through biological membranes via a doughnut shape, scaffolding
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 92
Supplementary Figure 29: Examples of poor packing in sampled symmetric complexes. Un-
derpacking or overpacking can occur occasionally, and may be partially addressed by density re-
straints.
proteins spatially organize molecular events across the cell with precise spacing and interlocking
assemblies, and receptors on the surfaces of cells interact with the surrounding world through pre-
cise geometries. Here, we aim to explore and test generalized tools for conditioning on volumetric
shape specifications within the diffusion framework.
R.2 Approach
Our shape conditioning approach is based on optimal transport [174], which provides tools for
identifying correspondences and defining similarities between objects, such as the atoms in a pro-
tein backbone and a point cloud sampled from a target shape. We leverage two tools in particu-
lar: (i) the Wasserstein distance [174], which measures point cloud correspondences in Euclidean
space and (ii) the Gromov-Wasserstein distance, which can measure the correspondences between
objects in different domains by comparing their intra-domain distances or dissimilarities. Because
Gromov-Wasserstein distance leverages relational comparisons, it can measure correspondences
between unaligned objects of different structure and dimensionality such as a skeleton graph and a
3D surface [175] or unsupervised word embeddings in two different languages [176].
Bounding degeneracy We initially experimented with restraints based purely on the Wasser-
stein distance and a target point cloud, which can estimated with the Sinkhorn algorithm [174],
but found that the huge degeneracy in potential volume-filling conformations would often lead to
jammed or high-contact-order solutions when using a modest amount of MCMC sampling. While
long-run Langevin sampling or similar approaches could allow gentle annealing into a satisfactory
configuration in principle, we sought to accelerate convergence by breaking this degeneracy with
a very coarse ”space-filling plan” for how the fold should map into the target point cloud, which
the prior can then realize with a specific protein backbone.
of Dideal (|i − j|) = 7.21 × |i − j|0.32 , (ii) computed the distance matrix for our target shape, and
(iii) solved for the Gromov-Wasserstein optimal transport given these two distance matrices [174]
yielding a coupling matrix KGromovWasserstein with dimensionality Natoms × Npoints . This coupling
map sums to unity and captures the correspondence between each point in the protein and in the
shape. We use a small amount of entropy regularization to solve the optimal transport problem
[174].
Optimal Transport loss In the inner loop of sampling, we can combine the Gromov-Wasserstein
coupling with simple Wasserstein couplings as a form of regularization towards our fold “plan”.
Our final loss is then
GW W
ShapeLoss(x, r) = ∑ Ki j + Ki j (x, r) ∥xi − r j ∥,
i, j
where we compute the Wasserstein optimal couplings KiWj with the Sinkhorn algorithm [174]. This
yields a fast, differentiable loss that can be used directly for sampling.
Time-dependent scaling We weight the ShapeLoss(x, r) term with the scaling factor
(shape)
p
wt = Clamp( SNRt , [0.001, 3.0]),
and then add its gradient directly to the loss during sampling. So the weighted objective is
p
ShapeLosst (x, r) = Clamp( SNRt , [0.001, 3.0]) ∑ KiGW j + K W
ij (x, r) ∥xi − r j ∥.
i, j
Scaling point clouds to protein sizes The Wasserstein and Gromov-Wasserstein losses are sen-
sitive to the point cloud length scales, but our shapes will not in general be correctly sized to the
protein that we wish to design them with. Of the methods that we explored to deal with this, two
that demonstrated some success were
• Fixed volume scaling. We estimate an approximate volume of our point cloud via on a
hard-sphere probe with radius set on typical nearest neighbor distance. We correct for sphere
overlaps via second-order inclusion-exclusion formulas. We then resize the point cloud ge-
ometry to match ideal protein geometry scaling of approximately ≈ 128Å3 per residue and
then adjust by a manually tuned factor (in practice anywhere from 0.3-1.0).
• Autoscaling We use the fixed scaling approach for GW distance calculation but also make
our loss scale invariant during optimization by computing the loss with a version of the
current structure that has been rescaled to have the same radius of gyration as the target
point cloud.
Generating point clouds for characters We rendered Latin letters and Arabic numerals in the
Liberation Sans font, extruded these 2D images into 3D volumes, and then sampled isotropic point
clouds from these volumes. The shape logo in Fig. 1C (silhouette of the ”Stanford bunny”) was
created using data from The Stanford 3D Scanning Repository (http://graphics.stanford.
edu/data/3Dscanrep/).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 94
Noisy
Coordinates
[B,N,4,3]
Classifier Prediction Network (ProClass)
S Programmability: Classification
S.1 Motivation
Protein databases provide a rich structured set of descriptions of various aspects of proteins. Pro-
teins are classified in these databases in terms of various aspects of their sequence, structure, and
functions. We can use any of these structured descriptors to generate proteins with structurally
and semantically useful features. Some of these descriptors, particularly ones that correspond with
protein function, may induce diffuse and complex structure changes that resist simple description.
To this end, we explore using a multi-property protein classifier as a conditioner for generation,
attempting to provide the ability to directly design proteins with desired categorical descriptions.
We see this as an initial step towards programmability of protein function.
S.3 Featurization
We encoded the diffusion time with a random Fourier featurization (e.g., see [177]). When pro-
viding a sequence, we encoded it with a learnable embedding layer of amino acid identity. Finally,
we passed backbone coordinates to extract 2-mer and chain-based distances and orientations as
described in Supplementary Appendix G. We passed the sum of these components to the neural
network.
S.4 Architecture
The encoder is a message-passing neural network. We formed the graph by taking K=20 nearest
neighbors and sampling additional neighbors from a distribution according to a random exponential
method.
We passed node and edge embeddings to each layer, with each node being updated by a scaled
sum of messages passed from its neighbors. We obtained the message to pass from node i to node
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 95
j by stacking the embeddings at node i, those at node j, and E, and passing these to a multi-layer
perceptron (1 hidden layer). We updated edges similarly. In each layer, we also applied layer
normalization (along the channel dimension) and dropout (dropout probability=0.1).
After processing by the MPNN, we passed node embeddings to a different classification head for
each label. For each head corresponding to a chain-level label, we pooled residues from each
chain using an attentional pooling layer. We then passed the resulting embeddings to an MLP with
1 hidden layer to output logits for each label.
S.6 Training
We trained the model for 300 epochs with an Adam optimizer [132] with default momentum set-
tings (betas=(0.9,0.999)). We linearly annealed the learning rate from 0 up to 0.0001 over the first
10,000 steps and then kept it constant. During training, we first sampled a time stamp 0 < t < 1
uniformly, then sampled noise from the globular covariance distribution, injected this noise into
the backbone coordinates, and fed the result to the model. Next, we predicted labels, computed
losses, and updated parameters with the Adam optimizer.
S.7 Hyperparameters
Our classification model has 4 layers, with node feature dimension 512 and edge feature dimension
192. Our node update MLP has hidden dimension 256 with 2 hidden layers, and our edge update
MLP has hidden dimension 128 with 2 hidden layers.
Noisy
Coordinates
[B,N,4,3]
Diffusion
Time Index Caption Prediction Network (ProCap)
[B]
Node Conditioning
Backbone Linear
Embeddings Pseudotokens
Sequences GNN Projection
[B,N,H] [B,R,D]
(Optional)
Caption
[B, N] Language
LogProbs
Model
[B, L, V]
Caption
Tasks
Tokenizer Embeddings
[B]
[B, L, D]
Captions
[B]
Supplementary Figure 31: ProCap model architecture. ProCap connects a pretrained graph neu-
ral network encoder to an autoregressive language model trained on a large data corpus including
scientific documents. We use the 125M parameter GPT-Neo as the language model, with internal
dimension D = 768. Conditioning is achieved with pseudotokens generated from encodings of pro-
tein complex 3D backbone coordinates (batch size B, number of residues N, embedding dimension
H) and a task token indicating whether a caption describes the whole complex or a single chain.
The R relevant pseudotokens for each caption, consisting of the chain/structure residue tokens and
the task token, are passed to the language model along with the caption. When used in the forward
mode, ProCap can describe the protein backbone by outputting the probabilities of each word in
the language model’s vocabulary of size V for each of the L tokens of a caption. When used in
conjunction with the prior model, it can be used for text-to-protein backbone synthesis. In training,
ProCap uses a masked cross entropy loss applied only to the caption logits.
help the model distinguish between PDB and UniProt prediction tasks, we prepend the encodings
of entire structures with an embedding vector of a newly defined PDB marker token. We normalize
the components of all structure vectors such that each one has zero mean and unit variance.
In summary, the ProCap architecture consists of a pre-trained GNN model for structure embedding
and a pre-trained language model for caption embedding, with a learnable linear layer to interface
between the two and a learnable language model head to convert the raw language model outputs
to token probabilities.
T.5 Performance
In order to test ProCap as a generative model, we draw high-quality conditional and correspond-
ing unconditional low-temperature samples from the model. To that end, we employ a structural
denoising approach in a similar fashion to the method described in [55]. Specifically, we use the
hybrid reverse diffusion SDE of Appendix C to evolve noisy random sample structures drawn from
the diffusion model prior, with gradients of the ProCap loss with respect to structure added to the
gradients of the structure diffusion model. When the size of the ProCap gradients is too small
relative to those from the prior model, there is little appreciable difference between a caption-
conditioned sample and an unconditional sample drawn from the same seed. We thus scale the
ProCap gradients by a guidance scale of up to 100 and find that the resulting samples are better con-
ditioned, analogously to previous work on classifier guidance [80]. At even larger guidance scales,
the coherence of the samples breaks down as the base model’s gradients are overwhelmed.
We present examples of our generated samples in the main text. To evaluate ProCap model per-
formance, we measure the improvement in caption loss during the SDE evolution between the
unconditioned and conditioned samples. As an independent check, we also examine the gain in the
TM-score between our sample (conditioned over unconditioned) and a target PDB structure which
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 99
0.8
0.3
ProCap loss, unconditioned - conditioned
0.0 0.0
Crystal structure of Fab
Crystal structure of SH2 domain
0.2 0.1
1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0
Diffusion timestep Diffusion timestep
Supplementary Figure 32: ProCap evaluation metrics show effect of natural language condi-
tioning compared to unconditioned samples from the same noised seed structure. (Left) The
caption model cross-entropy loss as a function of diffusion timestep, for two sample trajectories
with and without the use of caption gradients. (Right) The TM-score between sampled structures
and example structures from the PDB corresponding to the captions used for conditioning.
exemplifies the caption being used for conditioning. Finally, we analyze the generated structures
visually for structural coherence. Qualitatively, starting from the same noisy random structure, the
diffusion model yields denoised structures which demonstrate desirable characteristics including
secondary structure elements, both with and without guidance from the caption model.
The caption loss and TM-score metrics for example sampling trajectories are shown in Fig. 32.
Both are initially quite noisy, and the conditioned and unconditioned samples are equally likely at
high t to have lower ProCap loss and/or better alignment with the target structure. However, over
the course of the reverse diffusion, the effect of the conditioning is demonstrated. It is particularly
notable that the TM-score is relatively stable at low t, indicating a regime where the SDE evolution
is fine-tuning structural details rather than making large-scale changes. In addition, we see that the
impact of classifier guidance can vary widely, possibly owing to the intricate balance between the
gradients over the diffusion trajectory. It remains challenging to robustly generate samples with
natural language conditioning in a systematic fashion; nevertheless, our results serve as a proof of
concept of guided diffusion using text input.
As a final check of the ProCap model, we ask whether samples generated guided by natural lan-
guage suggestive of a particular CATH topology are seen as being representative of that topology,
as measured by the model of appendix S. In Supplementary Figure 33, we compare the ProCap per-
plexity and ProClass probability of an immunoglobulin fold (CATH 2.60.40) for backbones gener-
ated using the caption “Crystal structure of Fab”. We see a strong correlation between the negative
log probability of the relevant topology and the ProCap loss, suggesting that ProCap shows signs
of understanding the meaning of natural language captions at the level of CATH topologies.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 100
Supplementary Figure 33: ProCap perplexity shows correlation with ProClass loss. From a
group of samples generated with classifier guidance from ProCap using an antibody-related cap-
tion, we plot the resulting perplexity of each backbone against its probability of an immunoglobulin
fold (CATH 2.60.40). We estimate the fold probability of a backbone using the classification model
described in appendix S, after the backbone is generated. Successful refolding can take place re-
gardless of perplexity, as described further in appendix K.5.
U Experimental Validation
U.1 Protein design
Four sets of designs were generated for experimental validation: two unconditional sets (Uncon-
ditional I and II) and two sets conditioned on CATH class or topology (Conditioned I and II,
respectively). The full protocol for each of these involved generating a set of Chroma back-
bones (either unconditionally or conditioned), designing sequences for each backbone (10 per
backbone for Unconditional I and 1 per backbone for the rest), and sub-selecting a smaller set of
designs to be experimentally characterized (see Supplementary Table 7 for details). Importantly,
no sub-selection based on refolding or structural energy calculations was performed. Further,
the protocols were run in an automated fashion with no manual intervention or selection of de-
signs. All experimentally addressed protein sequences are included in Zenodo dataset https:
//doi.org/10.5281/zenodo.8285063.
SOC at 37◦ C, 230 rpm. Recovered cells were inoculated in 5 mL TB + 100 ng/µL carbenicillin and
grown for an additional 16 h. Plasmid populations were then isolated by miniprep (Machery-Nagel
740588.50). For the fluorescence stability experiment shown in Supplementary Figure 38d, cells
recovered after FACS and regrown overnight were then subjected to a second round of identical
experimentation in which split-GFP components were induced and cells were re-examined on the
BD FACSAria.
2-5030).
After incubation, resin was loaded on gravity column and allowed to flow through, then washed
with 2 x 10 CV Strep-Tactin XT wash buffer W (IBA Lifesciences 2-1003) and eluted with 2x 1 CV
Strep-Tactin XT elution buffer BXT (IBA Lifesciences 2-1042). Elution fractions were pooled and
incubated with TEV (Sigma Aldrich T4455) at a 1:100 v/v concentration ratio to protein, overnight
at room temperature.
The sample was then buffer exchanged back into Strep-Tactin XT wash buffer W using a Zeba
desalting column (Thermo Scientific 89893) and incubated with 5 mL Strep-Tactin XT resin and
1 mL cOmplete His-Tag Purification Resin (Millipore Sigma 5893801001) for 1 hr at 4◦ C before
flowing through a gravity column to remove TEV and uncleaved protein.
Samples were then concentrated to a volume of approximately 5 mL and purified via size exclusion
chromatography (SEC) on a HiLoad 16/600 Superdex 75 Column (Cytiva GE28-9893-33) into a
final buffer of 20 mM Tris pH 7.5 100 mM NaCl. Fractions were collected, purity was assessed by
SDS-PAGE, and appropriate fractions were pooled.
For mammalian-based protein expression of SEM 011, a gBlock encoding the protein was intro-
duced into a plasmid for transient transfection via Golden Gate Assembly under the CMV promoter
with an N-terminal signal peptide, based on vector pcDNA3.4 (ThermoFisher, A14697). The pro-
tein also had a C-terminal TEV cleavage site and Strep-tag identical to the configuration used
for bacterial expression described above. 100 mL of Expi293F cells (ThermoFisher A14635) in
Expi293 expression medium was transfected with the construct containing C-terminal Strep tag
following manufacturer’s guidelines. Cells were transfected on Day 0 at a density of 3x106 viable
cells/mL with 100 µg of plasmid DNA and placed in shaker at 37◦ C, 8% CO2 . At 24 h post-
transfection, cells were fed with transfection enhancers and returned to shaker for expression until
day 5.
Expression supernatant was harvested at 70% cell viability by centrifugation at 4,000 x g for 30
min. Supernatant was clarified further through a 0.22 µM filter for immediate purification. 2
mL of Strep-Tactin XT 4Flow High-capacity resin (IBA Lifesciences 2-5030) was added to the
supernatant and placed on a roller for 24 h at 4◦ C for batch binding.
After incubation, resin was loaded on gravity column and allowed to flow through, then washed
with 7.5 CV Strep-Tactin XT wash buffer W (IBA Lifesciences 2-1003) and eluted with 2 x 2.5 CV
Strep-Tactin XT elution buffer BXT (IBA Lifesciences 2-1042). Eluted protein was concentrated
to 2.5 mL using Amicon Ultra-15 3 kDa spin concentrators (Millipore UFC900324) followed by
buffer exchange into PBS pH 7.4 using PD-10 desalting columns packed with Sephadex G-25 resin
(Cytiva 17085101). Desalted protein was incubated overnight with TEV protease (Sigma Aldrich
T4455) at a 1:100 v/v concentration ratio to protein, overnight at 4◦ C.
The sample was then incubated with 1 mL Strep-Tactin XT resin and 1 mL cpmplete His-Tag
Purification Resin (Millipore Sigma 5893801001) for 1 h at 4◦ C before flowing through a gravity
column to remove TEV and uncleaved protein.
Cleaved protein was then concentrated to a volume of approximately 1 mL and purified via size
exclusion chromatography (SEC) on a 10/300 Superdex 75 Increase Column (Cytiva 29148721)
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 105
into a final buffer of 20 mM Tris pH 7.5 100 mM NaCl. Fractions were collected, purity was
assessed by SDS-PAGE, and appropriate fractions were pooled.
PhaserMR [189] and was fully refined using phenix.refine [190] to a resolution of 1.1 Å. Data
collection and refinement statistics are listed in Extended Data Table 2.
Diffraction quality crystals of UNC 239 were obtained by hanging drop diffusion at 4 °C by mixing
750 nL protein (27 mg/ml in 20 mM Tris pH 8, 100 mM NaCl) with 750 nL reservoir solution (0.2
M ammonium acetate, 0.1 M Tris pH 8.5, 25% w/v polyethylene glycol) over 500 µL reservoir.
The drop containing crystals was mixed with ethylene glycol to 20% before flash freezing. X-ray
diffraction data was collected at 100 K at a wavelength of 0.97624 Å at the PETRA3 synchrotron
on the P13 beamline [186]. The data were processed using DIALS [187] and Aimless [188] in
P21 21 21 space group with 2 molecules in the ASU. A structure was able to be phased using the
Chroma-generated model using PhaserMR [189] and was fully refined using phenix.refine [190] to
a resolution of 2.36 Å. Data collection and refinement statistics are listed in Extended Data Table
2.
Sequence Length
-0.27 -0.28 -0.14 -0.14 -0.14 0.14 0.03 0.06 0.07 0.01
Length covariate-adjusted
score residuals
GFP Score
Supplementary Figure 34: In silico scores compared to Unconditional I split-GFP and se-
quence length. Top) Scatter plot of each score compared to design length Bottom) Score residuals
after lowess smoothing compared to split GFP values. Pearson correlation is written in each plot.
LOWESS fit shown in black
Unconditional I
Unconditional II
0.3
0.2
0.1
Partial Spearman correlation
0.0
0.1
0.2
0.3
0.4
ChromaDesign ChromaDesign ChromaDesign Chroma v0 Chroma v1 ESMFold OmegaFold AlphaFold AlphaFold FoldSeek
PurePotts Multi Multi ELBO ELBO TM-score TM-score TM-score pLDDT Training set
Potts AR TM-score
Supplementary Figure 35: In silico scores partial Spearman correlation to split GFP control-
ling for sequence length Horizontal bar is the median partial Spearman correlation of each score
and the vertical bar their 95% confidence interval.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 108
Supplementary Figure 36: Unconditional protein designs. 172 unconditional Chroma proteins
(UNC 001 through UNC 172) constructed for experimental validation between 100 and 450 amino
acids in length.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 110
Legend
1 mixed α/β
α-conditioned
β-conditioned/β-barrel
(solubility)
10
1
0
30 40 50 60 70 80 90 100
Temperature (°C) 0
-10 Rank-ordered proteins
40 3
SEM_038
Split GFP bin score
2
(solubility)
20
10 1
0
30 40 50 60 70 80 90 100
0
-10 Temperature (°C) Rank-ordered proteins
3
3
SEM_011 Tm1 = 64.4 °C
Split GFP bin score
mixed α/β 2
Cp (kJ/mol/K)
2
(solubility)
1
1
0
30 40 50 60 70 80 90 100
Temperature (°C) 0
-1 Rank-ordered proteins
Supplementary Figure 37: Secondary structure conditional designs. a, 42 proteins were de-
signed based on secondary structure composition (SEM 001 through SEM 042). b, Split-GFP
rank-ordered bin scores for designed proteins conditioned on secondary structure content. Individ-
ual data points for two independent biological replicates shown. c, Differential scanning calorime-
try data for one protein from each of various secondary structure design classes. Split-GFP data
shown for reference.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 111
a
Split-GFP protein solubility reporter assay:
Soluble protein Insoluble protein
Chroma
Chroma + Chroma
+ No fluorescence
b Experimental workflow:
DNA library design Pooled ligation Pooled transformation Protein expression FACS NGS/analysis
Pro. 1 Pro. 1
Pro. 2
Pro. 2
Pro. 3
Pro. 3
Pro. 3
Pro. 4
Pro. 4 Pro. 4
Pro. 5 Pro. 5
80 Regrown cell
0 1 2 3 0 1 2 3 populations
104 10 4
cell count
Bin 0
SSC
SSC
60
Bin 1
103 103 40 Bin 2
Protein expressed Bin 3
0 GPCR (negative control) 0 20
DHFR (positive contorol)
-103 -103
0
101 102 103 104 105 101 102 103 104 105 101 102 103 104 105
GFP fluorescence (AU) GFP fluorescence (AU) GFP fluorescence (AU)
Supplementary Figure 38: Split-GFP protein solubility assay. a, Schematic of split-GFP reporter
concept. Co-expression of soluble protein fused to the GFP11 tag with the GFP1-10 protein results
in restoration of GFP fluorescence. b, Split-GFP experimental workflow to determine protein
solubility. FACS = fluorescence-activated cell sorting; NGS = next-generation sequencing. c,
FACS gating strategy informed by positive and negative control cells. Chroma library was sorted
into 4 different gates based on GFP fluorescence for subsequent sequencing enrichment analysis.
d, Flow cytometry of sorted cell populations that were regrown and split-GFP components were
re-induced to evaluate signal stability within sorted populations.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 112
Supplementary Figure 39: Soluble protein expression confirmation via western blot. The top 20
and bottom 20 hits from the split-GFP solubility screen on proteins UNC 001 through UNC 172
were reformatted to contain a C-terminal Strep-tag. Protein expression from E. coli lysates was
detected by western blot using Streptactin or anti-Strep-tag antibody. Representative blots shown
from two independent biological experiments. Lane designations: L = ladder; C = control protein
(same on each blot)
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 113
a b c
3
1 1
R2 = 0.79
negative control
0 0
Rank-ordered proteins 0 1 2 3
Split-GFP bin score (replicate 2)
d
Western blot: Anti-Strep-tag-HRP
L 1 2 3 4 5 6 7 8 9 10
kDa
Lane PRO-ID MW (kDa) Observed
1 UNC_239 32.6 Yes
50
40 2 UNC_174 21.2 Yes
30 3 UNC_264 33.6 Yes
20
4 UNC_258 41.7 Yes
5 UNC_194 25.5 Yes
15 6 UNC_227 17.2 Yes
10
7 UNC_244 34.6 Yes
8 UNC_262 25.1 Yes
9 UNC_267 35.0 Yes
12% Bis Tris SDS-PAGE run in MES 10 UNC_185 20.3 No
Anti-Strep-HRP western blot
L 1 2 3 4 5 6 7 8 9 10
Lane PRO-ID MW (kDa) Observed
1 UNC_239 32.6 Yes
kDa
2 UNC_174 21.2 Yes
50
40 3 UNC_264 33.6 Yes
30
4 UNC_258 41.7 Yes
20 5 UNC_194 25.5 Yes
6 UNC_227 17.2 Yes
15
7 UNC_244 34.6 No
10 8 UNC_262 25.1 Yes
9 UNC_267 35.0 No
12% Bis Tris SDS-PAGE run in MES 10 UNC_185 20.3 No
Streptactin-HRP western blot
UNC_018 50 3
Cp (kJ/mol/K)
2
30 40 50 60 70 80 90 100
Temperature (°C)
-50
1
No fit
-100
(Tm incalculable)
0
-150 Rank-ordered proteins
8 3
UNC_079
(solubility)
Cp (kJ/mol/K)
4 2
2
1
0
30 40 50 60 70 80 90 100
-2 Temperature (°C) 0
-4 Rank-ordered proteins
10 3
UNC_096
Tm1 = 55.4 °C
(solubility)
2
0
30 40 50 60 70 80 90 100 1
-5 Temperature (°C)
0
-10 Rank-ordered proteins
10 3
UNC_159
(solubility)
5
Tm2 = 85.9 °C 2
1
0
30 40 50 60 70 80 90 100
Temperature (°C)
0
-5 Rank-ordered proteins
30 3
UNC_163
Split GFP bin score
Tm1 = 76.3 °C
20
(solubility)
2
Cp (kJ/mol/K)
10
1
0
30 40 50 60 70 80 90 100
Temperature (°C) 0
-10 Rank-ordered proteins
UNC_194 10 3
Tm1 = 58.1 °C
Split GFP bin score
8
Cp (kJ/mol/K)
(solubility)
6 2
2 1
0
30 40 50 60 70 80 90 100
-2 Temperature (°C)
0
Rank-ordered proteins
3
Split GFP bin score
UNC_239 15
Tm1 = 55.3 °C
Tm2 = 67.6 °C
(solubility)
2
10
Cp (kJ/mol/K)
5 1
0
30 40 50 60 70 80 90 100
0
Temperature (°C) Rank-ordered proteins
-5
References
54. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S. Deep unsupervised learn-
ing using nonequilibrium thermodynamics in International Conference on Machine Learn-
ing (2015), 2256–2265.
55. Song, Y. et al. Score-Based Generative Modeling through Stochastic Differential Equations
in International Conference on Learning Representations (2021). https://openreview.
net/forum?id=PxTIG12RRHS.
56. Murphy, K. P. Machine learning: a probabilistic perspective (MIT press, 2012).
57. Ho, J., Jain, A. & Abbeel, P. Denoising diffusion probabilistic models. Advances in Neural
Information Processing Systems 33, 6840–6851 (2020).
58. Lipman, Y., Chen, R. T., Ben-Hamu, H., Nickel, M. & Le, M. Flow Matching for Generative
Modeling in The Eleventh International Conference on Learning Representations (2022).
59. Liu, X., Gong, C. & Liu, Q. Flow straight and fast: Learning to generate and transfer data
with rectified flow. arXiv preprint arXiv:2209.03003 (2022).
60. Albergo, M. S., Boffi, N. M. & Vanden-Eijnden, E. Stochastic interpolants: A unifying
framework for flows and diffusions. arXiv preprint arXiv:2303.08797 (2023).
61. Särkkä, S. & Solin, A. Applied stochastic differential equations (Cambridge University
Press, 2019).
62. Kingma, D., Salimans, T., Poole, B. & Ho, J. Variational diffusion models. Advances in
neural information processing systems 34, 21696–21707 (2021).
63. Nichol, A. Q. & Dhariwal, P. Improved denoising diffusion probabilistic models in Interna-
tional Conference on Machine Learning (2021), 8162–8171.
64. Kong, X., Brekelmans, R. & Ver Steeg, G. Information-Theoretic Diffusion in The Eleventh
International Conference on Learning Representations (2022).
65. Karras, T., Aittala, M., Aila, T. & Laine, S. Elucidating the design space of diffusion-based
generative models. arXiv preprint arXiv:2206.00364 (2022).
66. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596,
583–589 (2021).
67. Anand, N. & Achim, T. Protein Structure and Sequence Generation with Equivariant De-
noising Diffusion Probabilistic Models. arXiv preprint arXiv:2205.15019 (2022).
68. Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Crystallo-
graphica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography
32, 922–923 (1976).
69. Coutsias, E. A., Seok, C. & Dill, K. A. Using quaternions to calculate RMSD. Journal of
computational chemistry 25, 1849–1857 (2004).
70. Anderson, B. D. Reverse-time diffusion equation models. Stochastic Processes and their
Applications 12, 313–326 (1982).
71. Maoutsa, D., Reich, S. & Opper, M. Interacting particle solutions of Fokker–Planck equa-
tions through gradient–log–density estimation. Entropy 22, 802 (2020).
72. Chen, R. T., Rubanova, Y., Bettencourt, J. & Duvenaud, D. K. Neural ordinary differential
equations. Advances in neural information processing systems 31 (2018).
73. Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I. & Duvenaud, D. FFJORD: Free-
Form Continuous Dynamics for Scalable Reversible Generative Models in International
Conference on Learning Representations (2018).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 117
74. Jing, B., Corso, G., Berlinghieri, R. & Jaakkola, T. Subspace diffusion generative models in
European Conference on Computer Vision (2022), 274–289.
75. Rombach, R., Blattmann, A., Lorenz, D., Esser, P. & Ommer, B. High-resolution image syn-
thesis with latent diffusion models in Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition (2022), 10684–10695.
76. Kingma, D. P. & Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. Ad-
vances in neural information processing systems 31 (2018).
77. Holtzman, A., Buys, J., Du, L., Forbes, M. & Choi, Y. The Curious Case of Neural Text
Degeneration in International Conference on Learning Representations (2020). https :
//openreview.net/forum?id=rygGQyrFvH.
78. Kool, W., Van Hoof, H. & Welling, M. Stochastic beams and where to find them: The
gumbel-top-k trick for sampling sequences without replacement in International Confer-
ence on Machine Learning (2019), 3499–3508.
79. MacKay, D. J. Information theory, inference and learning algorithms (Cambridge university
press, 2003).
80. Dhariwal, P. & Nichol, A. Diffusion models beat gans on image synthesis. Advances in
Neural Information Processing Systems 34, 8780–8794 (2021).
81. Ho, J. & Salimans, T. Classifier-Free Diffusion Guidance in NeurIPS 2021 Workshop on
Deep Generative Models and Downstream Applications (2021). https://openreview.
net/forum?id=qw8AKxfYbI.
82. Saharia, C. et al. Photorealistic text-to-image diffusion models with deep language under-
standing. Advances in Neural Information Processing Systems 35, 36479–36494 (2022).
83. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C. & Chen, M. Hierarchical text-conditional
image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022).
84. Du, Y. et al. Reduce, reuse, recycle: Compositional generation with energy-based diffusion
models and mcmc in International Conference on Machine Learning (2023), 8489–8510.
85. Song, Y. & Ermon, S. Generative modeling by estimating gradients of the data distribution.
Advances in Neural Information Processing Systems 32 (2019).
86. Marinari, E. & Parisi, G. Simulated tempering: a new Monte Carlo scheme. EPL (Euro-
physics Letters) 19, 451 (1992).
87. Hansmann, U. H. Parallel tempering algorithm for conformational studies of biological
molecules. Chemical Physics Letters 281, 140–150 (1997).
88. Hong, L. & Lei, J. Scaling law for the radius of gyration of proteins and its dependence on
hydrophobicity. Journal of Polymer Science Part B: Polymer Physics 47, 207–214. https:
//onlinelibrary.wiley.com/doi/abs/10.1002/polb.21634 (2009).
89. Tanner, J. J. Empirical power laws for the radii of gyration of protein oligomers. Acta Crys-
tallographica Section D: Structural Biology 72, 1119–1129 (2016).
90. Perunov, N. & England, J. L. Quantitative theory of hydrophobic effect as a driving force of
protein structure. Protein Science 23, 387–399 (2014).
91. Trippe, B. L. et al. Diffusion Probabilistic Modeling of Protein Backbones in 3D for the
motif-scaffolding problem in The Eleventh International Conference on Learning Represen-
tations (2022).
92. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing
for quantum chemistry in International conference on machine learning (2017), 1263–1272.
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 118
93. Barnes, J. & Hut, P. A hierarchical O(N log N) force-calculation algorithm. Nature 324,
446–449 (1986).
94. Battaglia, P. W. et al. Relational inductive biases, deep learning, and graph networks. CoRR
abs/1806.01261. http://arxiv.org/abs/1806.01261 (2018).
95. Ingraham, J., Garg, V., Barzilay, R. & Jaakkola, T. Generative models for graph-based pro-
tein design. Advances in neural information processing systems 32 (2019).
96. Vaswani, A. et al. Attention is all you need. Advances in neural information processing
systems 30 (2017).
97. Child, R., Gray, S., Radford, A. & Sutskever, I. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509 (2019).
98. Zaheer, M. et al. Big bird: Transformers for longer sequences. Advances in Neural Informa-
tion Processing Systems 33, 17283–17297 (2020).
99. Yang, J. et al. Focal Attention for Long-Range Interactions in Vision Transformers in Ad-
vances in Neural Information Processing Systems 34 (Curran Associates, Inc., 2021), 30008–
30022. https://proceedings.neurips.cc/paper/2021/file/fc1a36821b02abbd2503fd949bfc91
Paper.pdf.
100. Van den Oord, A. et al. WaveNet: A Generative Model for Raw Audio in 9th ISCA Speech
Synthesis Workshop (2016), 125–125.
101. AlQuraishi, M. End-to-end differentiable learning of protein structure. Cell systems 8, 292–
301 (2019).
102. Wu, K. E. et al. Protein structure generation via folding diffusion. arXiv preprint arXiv:2209.15611
(2022).
103. Anand, N. & Huang, P. Generative modeling for protein structures. Advances in neural
information processing systems 31 (2018).
104. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learn-
ing. Nature 577, 706–710 (2020).
105. Marks, D. S., Hopf, T. A. & Sander, C. Protein structure prediction from sequence variation.
Nature biotechnology 30, 1072–1080 (2012).
106. Ingraham, J., Riesselman, A., Sander, C. & Marks, D. Learning protein structure with a
differentiable simulator in International Conference on Learning Representations (2018).
107. Belanger, D. & McCallum, A. Structured prediction energy networks in International Con-
ference on Machine Learning (2016), 983–992.
108. Schoenholz, S. & Cubuk, E. D. Jax md: a framework for differentiable physics. Advances
in Neural Information Processing Systems 33, 11428–11441 (2020).
109. Wang, W., Axelrod, S. & Gómez-Bombarelli, R. Differentiable Molecular Simulations for
Control and Learning in ICLR 2020 Workshop on Integration of Deep Neural Models and
Differential Equations (2020).
110. Lin, Z. et al. Language models of protein sequences at the scale of evolution enable accurate
structure prediction. bioRxiv. eprint: https : / / www . biorxiv . org / content / early /
2022/07/21/2022.07.20.500902.full.pdf. https://www.biorxiv.org/content/
early/2022/07/21/2022.07.20.500902 (2022).
111. Wu, R. et al. High-resolution de novo structure prediction from primary sequence. bioRxiv.
eprint: https : / / www . biorxiv . org / content / early / 2022 / 07 / 22 / 2022 . 07 . 21 .
500999.full.pdf. https://www.biorxiv.org/content/early/2022/07/22/2022.
07.21.500999 (2022).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 119
112. Murphy, K. P. Conjugate Bayesian analysis of the Gaussian distribution. def 1, 16 (2007).
113. Anand, N. et al. Protein sequence design with a learned potential. Nature communications
13, 1–11 (2022).
114. Li, A. J., Sundar, V., Grigoryan, G. & Keating, A. E. TERMinator: A Neural Framework for
Structure-Based Protein Design using Tertiary Repeating Motifs. arXiv preprint arXiv:2204.13048
(2022).
115. Hsu, C. et al. Learning inverse folding from millions of predicted structures in International
Conference on Machine Learning (2022), 8946–8970.
116. Dauparas, J. et al. Robust deep learning–based protein sequence design using Protein-
MPNN. Science 378, 49–56 (2022).
117. Lin, Y. & AlQuraishi, M. Generating novel, designable, and diverse protein structures by
equivariantly diffusing oriented residue clouds. arXiv preprint arXiv:2301.12485 (2023).
118. Yim, J. et al. SE (3) diffusion model with application to protein backbone generation. arXiv
preprint arXiv:2302.02277 (2023).
119. Watson, J. L. et al. De novo design of protein structure and function with RFdiffusion.
Nature, 1–3 (2023).
120. Chu, A. E., Cheng, L., El Nesr, G., Xu, M. & Huang, P.-S. An all-atom protein generative
model. bioRxiv, 2023–05 (2023).
121. Lisanza, S. L. et al. Joint generation of protein sequence and structure with RoseTTAFold
sequence space diffusion. bioRxiv, 2023–05 (2023).
122. Verkuil, R. et al. Language models generalize beyond natural proteins. bioRxiv, 2022–12
(2022).
123. Wells, B. A. & Chaffee, A. L. Ewald summation for molecular simulations. Journal of
chemical theory and computation 11, 3684–3695 (2015).
124. Solomonoff, R. J. A formal theory of inductive inference. Part I. Information and control 7,
1–22 (1964).
125. Grathwohl, W., Swersky, K., Hashemi, M., Duvenaud, D. & Maddison, C. Oops i took a
gradient: Scalable sampling for discrete distributions in International Conference on Ma-
chine Learning (2021), 3831–3841.
126. Berman, H. M. The Protein Data Bank. Nucleic Acids Research 28, 235–242. https://
doi.org/10.1093/nar/28.1.235 (Jan. 2000).
127. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics
26, 2460–2461. ISSN: 1367-4803. eprint: https://academic.oup.com/bioinformatics/
article- pdf/26/19/2460/16896486/btq461.pdf. https://doi.org/10.1093/
bioinformatics/btq461 (Aug. 2010).
128. Chaudhury, S., Lyskov, S. & Gray, J. J. PyRosetta: a script-based interface for implementing
molecular modeling algorithms using Rosetta. Bioinformatics 26, 689–691 (2010).
129. Mistry, J. et al. Pfam: The protein families database in 2021. Nucleic acids research 49,
D412–D419 (2021).
130. Bateman, A. et al. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids
Research 49, D480–D489. https://doi.org/10.1093/nar/gkaa1100 (Nov. 2020).
131. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the
analysis of massive data sets. Nature biotechnology 35, 1026–1028 (2017).
132. Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
(2014).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 120
133. Song, Y. & Ermon, S. Improved techniques for training score-based generative models.
Advances in neural information processing systems 33, 12438–12448 (2020).
134. Zhang, H. et al. Predicting protein inter-residue contacts using composite likelihood maxi-
mization and deep learning. BMC bioinformatics 20, 1–11 (2019).
135. Wootton, J. C. & Federhen, S. Statistics of local complexity in amino acid sequences and
sequence databases. Computers & Chemistry 17, 149–163. https://doi.org/10.1016/
0097-8485(93)85006-x (June 1993).
136. Plaxco, K. W., Simons, K. T. & Baker, D. Contact order, transition state placement and the
refolding rates of single domain proteins 1 1Edited by P. E. Wright. Journal of Molecular
Biology 277, 985–994. https://doi.org/10.1006/jmbi.1998.1645 (Apr. 1998).
137. McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: Uniform Manifold Approxima-
tion and Projection. Journal of Open Source Software 3, 861. https : / / doi . org / 10 .
21105/joss.00861 (2018).
138. Røgen, P. & Fain, B. Automatic classification of protein structure by using Gauss integrals.
Proceedings of the National Academy of Sciences 100, 119–124 (2003).
139. Harder, T., Borg, M., Boomsma, W., Røgen, P. & Hamelryck, T. Fast large-scale clustering
of protein structures using Gauss integrals. Bioinformatics 28, 510–515 (2012).
140. Frishman, D. & Argos, P. Knowledge-based protein secondary structure assignment. Pro-
teins 23, 566–579 (1995).
141. Ivankov, D. et al. Contact order revisited: influence of protein size on the folding rate. Pro-
tein Sci 12, 2057–2062 (2003).
142. Mackenzie, C. O., Zhou, J. & Grigoryan, G. Tertiary alphabet for the observable protein
structural universe. Proceedings of the National Academy of Sciences 113 (Nov. 2016).
143. Zheng, F., Zhang, J. & Grigoryan, G. Tertiary Structural Propensities Reveal Fundamental
Sequence/Structure Relationships. Structure 23, 961–971. https://doi.org/10.1016/
j.str.2015.03.015 (May 2015).
144. Zhou, J., Panaitiu, A. E. & Grigoryan, G. A general-purpose protein design framework
based on mining sequence–structure relationships in known protein structures. Proceedings
of the National Academy of Sciences 117, 1059–1068 (2019).
145. Sillitoe, I. et al. CATH: increased structural coverage of functional space. Nucleic acids
research 49, D266–D273 (2021).
146. Van Kempen, M. et al. Fast and accurate protein structure search with Foldseek. Nature
Biotechnology, 1–4 (2023).
147. Borg, M. et al. A probabilistic approach to protein structure prediction: PHAISTOS in
CASP9. LASR2009-Statistical tools for challenges in bioinformatics, 65–70 (2009).
148. Zhang, Y. & Skolnick, J. TM-align: a protein structure alignment algorithm based on the
TM-score. Nucleic acids research 33, 2302–2309 (2005).
149. Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. BioRxiv, 2021–10
(2021).
150. Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language
model. Science 379, 1123–1130 (2023).
151. Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. Assembly of protein tertiary struc-
tures from fragments with similar local sequences using simulated annealing and Bayesian
scoring functions. J. Mol. Biol. 268, 209–225 (Apr. 1997).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 121
152. Rhodes, B. & Gutmann, M. U. Enhanced gradient-based MCMC in discrete spaces. Trans-
actions on Machine Learning Research (2022).
153. Salimans, T. & Ho, J. Should EBMs model the energy or the score? in Energy Based Models
Workshop-ICLR 2021 (2021).
154. Bennett, C. H. Mass tensor molecular dynamics. Journal of Computational Physics 19, 267–
279. ISSN: 0021-9991. https://www.sciencedirect.com/science/article/pii/
0021999175900777 (1975).
155. Li, C., Chen, C., Carlson, D. & Carin, L. Preconditioned stochastic gradient Langevin dy-
namics for deep neural networks in Proceedings of the AAAI conference on artificial intel-
ligence 30 (2016).
156. Team, S. D. et al. Stan modeling language users guide and reference manual. Technical
report (2016).
157. Gemici, M. C., Rezende, D. & Mohamed, S. Normalizing Flows on Riemannian Manifolds
2016. arXiv: 1611.02304 [stat.ML].
158. Hie, B. et al. A high-level programming language for generative protein design. bioRxiv,
2022–12 (2022).
159. Poole, B., Jain, A., Barron, J. T. & Mildenhall, B. Dreamfusion: Text-to-3d using 2d diffu-
sion. arXiv preprint arXiv:2209.14988 (2022).
160. Hsieh, Y.-P., Kavis, A., Rolland, P. & Cevher, V. Mirrored langevin dynamics. Advances in
Neural Information Processing Systems 31 (2018).
161. Liu, G.-H., Chen, T., Theodorou, E. A. & Tao, M. Mirror Diffusion Models for Constrained
and Watermarked Generation. stat 1050, 2 (2023).
162. Ho, J. et al. Video Diffusion Models 2022. arXiv: 2204.03458 [cs.CV].
163. Zhou, J. & Grigoryan, G. Rapid search for tertiary fragments reveals protein sequence-
structure relationships. Protein Sci. 24, 508–524 (Apr. 2015).
164. Zhou, J. & Grigoryan, G. A C++ library for protein sub-structure search. bioRxiv preprint
2020.04.26.062612 (2020).
165. Goodsell, D. S. & Olson, A. J. Structural symmetry and protein function. Annual review of
biophysics and biomolecular structure 29, 105 (2000).
166. Hsia, Y. et al. Design of a hyperstable 60-subunit protein icosahedron. Nature 535, 136–139
(2016).
167. Cohen, T. & Welling, M. Group equivariant convolutional networks in International con-
ference on machine learning (2016), 2990–2999.
168. Cox, S. & White, A. D. Symmetric Molecular Dynamics. arXiv preprint arXiv:2204.01114
(2022).
169. Zabrodsky, H., Peleg, S. & Avnir, D. Continuous symmetry measures. Journal of the Amer-
ican Chemical Society 114, 7843–7851 (1992).
170. McWeeny, R. Symmetry: An introduction to group theory and its applications (Courier Cor-
poration, 2002).
171. Vincent, A. Molecular symmetry and group theory: a programmed introduction to chemical
applications (John Wiley & Sons, 2013).
172. Zee, A. Group theory in a nutshell for physicists (Princeton University Press, 2016).
173. Harvey, S. C., Tan, R. K.-Z. & Cheatham III, T. E. The flying ice cube: velocity rescaling
in molecular dynamics leads to violation of energy equipartition. Journal of computational
chemistry 19, 726–740 (1998).
Suppl. Info. for Illuminating protein space with a programmable generative model (2023) 122
174. Peyré, G., Cuturi, M., et al. Computational optimal transport: With applications to data
science. Foundations and Trends® in Machine Learning 11, 355–607 (2019).
175. Solomon, J., Peyré, G., Kim, V. G. & Sra, S. Entropic metric alignment for correspondence
problems. ACM Transactions on Graphics (ToG) 35, 1–13 (2016).
176. Alvarez-Melis, D. & Jaakkola, T. S. Gromov-Wasserstein Alignment of Word Embedding
Spaces in EMNLP (2018).
177. Tancik, M. et al. Fourier features let networks learn high frequency functions in low di-
mensional domains. Advances in Neural Information Processing Systems 33, 7537–7547
(2020).
178. Consortium, T. U. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids
Research 49, D480–D489. https://doi.org/10.1093/nar/gkaa1100 (Nov. 2020).
179. Black, S., Gao, L., Wang, P., Leahy, C. & Biderman, S. GPT-Neo: Large Scale Autoregres-
sive Language Modeling with Mesh-Tensorflow version 1.0. Mar. 2021. https : / / doi .
org/10.5281/zenodo.5297715.
180. Gao, L. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. CoRR
abs/2101.00027. arXiv: 2101.00027. https://arxiv.org/abs/2101.00027 (2021).
181. Lester, B., Al-Rfou, R. & Constant, N. The Power of Scale for Parameter-Efficient Prompt
Tuning in Proceedings of the 2021 Conference on Empirical Methods in Natural Language
Processing (2021), 3045–3059.
182. Sawyer, N. et al. Designed phosphoprotein recognition in Escherichia coli. ACS chemical
biology 9, 2502–2507 (2014).
183. Shen, W., Le, S., Li, Y. & Hu, F. SeqKit: a cross-platform and ultrafast toolkit for FASTA/Q
file manipulation. PloS one 11, e0163962 (2016).
184. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–
3100 (2018).
185. Micsonai, A. et al. BeStSel: webserver for secondary structure and fold prediction for pro-
tein CD spectroscopy. Nucleic Acids Research 50, W90–W98. ISSN: 0305-1048. eprint:
https://academic.oup.com/nar/article-pdf/50/W1/W90/44378197/gkac345.
pdf. https://doi.org/10.1093/nar/gkac345 (May 2022).
186. Cianci, M. et al. P13, the EMBL macromolecular crystallography beamline at the low-
emittance PETRA III ring for high-and low-energy phasing with variable beam focusing.
Journal of synchrotron radiation 24, 323–332 (2017).
187. Kabsch, W. xds. Acta Crystallographica Section D: Biological Crystallography 66, 125–
132 (2010).
188. Evans, P. R. & Murshudov, G. N. How good are my data and what is the resolution? Acta
Crystallographica Section D: Biological Crystallography 69, 1204–1214 (2013).
189. McCoy, A. J. et al. Phaser crystallographic software. Journal of applied crystallography 40,
658–674 (2007).
190. Adams, P. D. et al. PHENIX: a comprehensive Python-based system for macromolecular
structure solution. Acta Crystallographica Section D: Biological Crystallography 66, 213–
221 (2010).
191. Vallat, R. Pingouin: statistics in Python. Journal of Open Source Software 3, 1026. https:
//doi.org/10.21105/joss.01026 (2018).