Score-Based Generative Modeling in Latent Space
Score-Based Generative Modeling in Latent Space
Abstract
Score-based generative models (SGMs) have recently demonstrated impressive
results in terms of both sample quality and distribution coverage. However, they
are usually applied directly in data space and often require thousands of network
evaluations for sampling. Here, we propose the Latent Score-based Generative
Model (LSGM), a novel approach that trains SGMs in a latent space, relying on the
variational autoencoder framework. Moving from data to latent space allows us to
train more expressive generative models, apply SGMs to non-continuous data, and
learn smoother SGMs in a smaller space, resulting in fewer network evaluations
and faster sampling. To enable training LSGMs end-to-end in a scalable and stable
manner, we (i) introduce a new score-matching objective suitable to the LSGM
setting, (ii) propose a novel parameterization of the score function that allows SGM
to focus on the mismatch of the target distribution with respect to a simple Normal
one, and (iii) analytically derive multiple techniques for variance reduction of the
training objective. LSGM obtains a state-of-the-art FID score of 2.10 on CIFAR-10,
outperforming all existing generative results on this dataset. On CelebA-HQ-256,
LSGM is on a par with previous SGMs in sample quality while outperforming them
in sampling time by two orders of magnitude. In modeling binary images, LSGM
achieves state-of-the-art likelihood on the binarized OMNIGLOT dataset. Our
project page and code can be found at https://nvlabs.github.io/LSGM.
1 Introduction
The long-standing goal of likelihood-based generative learning is to faithfully learn a data distribution,
while also generating high-quality samples. Achieving these two goals simultaneously is a tremendous
challenge, which has led to the development of a plethora of different generative models. Recently,
score-based generative models (SGMs) demonstrated astonishing results in terms of both high sample
quality and likelihood [1, 2]. These models define a forward diffusion process that maps data to noise
by gradually perturbing the input data. Generation corresponds to a reverse process that synthesizes
novel data via iterative denoising, starting from random noise. The problem then reduces to learning
the score function—the gradient of the log-density—of the perturbed data [3]. In a seminal work,
Song et al. [2] show how this modeling approach is described with a stochastic differential equation
(SDE) framework which can be converted to maximum likelihood training [4]. Variants of SGMs
have been applied to images [1, 2, 5, 6], audio [7, 8, 9, 10], graphs [11] and point clouds [12, 13].
Albeit high quality, sampling from SGMs is computationally expensive. This is because generation
amounts to solving a complex SDE, or equivalently ordinary differential equation (ODE) (denoted as
the probability flow ODE in [2]), that maps a simple base distribution to the complex data distribution.
The resulting differential equations are typically complex and solving them accurately requires
numerical integration with very small step sizes, which results in thousands of neural network
evaluations [1, 2, 6]. Furthermore, generation complexity is uniquely defined by the underlying data
distribution and the forward SDE for data perturbation, implying that synthesis speed cannot be
∗
Equal contribution.
2 Background
Here, we review continuous-time score-based generative models (see [2] for an in-depth discussion).
Consider a forward diffusion process {zt }t=1t=0 for continuous time variable t ∈ [0, 1], where z0 is the
starting variable and zt its perturbation at time t. The diffusion process is defined by an Itô SDE:
dz = f (t)z dt + g(t) dw (1)
where f : R → R and g : R → R are scalar drift and diffusion coefficients, respectively, and w is
the standard Wiener process. f (t) and g(t) can be designed such that z1 ∼ N (z1 ; 0, I) follows a
2
Normal distribution at the end of the diffusion process.2 Song et al. [2] show that the SDE in Eq. 1
can be converted to a generative model by first sampling from z1 ∼ N (z1 ; 0, I) and then running the
reverse-time SDE dz = [f (t)z−g(t)2 ∇z log qt (z)] dt+g(t) dw̄, where w̄ is a reverse-time standard
Wiener process and dt is an infinitesimal negative time step. The reverse SDE requires knowledge of
∇zt log qt (zt ), the score function of the marginal distribution under the forward diffusion at time t.
One approach for estimating it is via the score matching objective3 :
h i
min Et∼U [0,1] λ(t)Eq(z0 ) Eq(zt |z0 ) [||∇zt log q(zt ) − ∇zt log pθ (zt )||22 ] (2)
θ
that trains the parameteric score function ∇zt log pθ (zt ) at time t ∼ U[0, 1] for a given weighting
coefficient λ(t). q(z0 ) is the z0 -generating distribution and q(zt |z0 ) is the diffusion kernel, which is
available in closed form for certain f (t) and g(t). Since ∇zt log q(zt ) is not analytically available,
Song et al. [2] rely on denoising score matching [22] that converts the objective in Eq. 2 to:
h i
min Et∼U [0,1] λ(t)Eq(z0 ) Eq(zt |z0 ) [||∇zt log q(zt |z0 ) − ∇zt log pθ (zt )||22 ] + C (3)
θ
Vincent [22] shows C = Et∼U [0,1] [λ(t)Eq(z0 ) Eq(zt |z0 ) [||∇zt log q(zt )||22 − ||∇zt log q(zt |z0 )||22 ]] is
independent of θ, making the minimizations in Eq. 3 and Eq. 2 equivalent. Song et al. [4] show
that for λ(t) = g(t)2 /2, the minimizations correspond to approximate maximum likelihood training
based on an upper on the Kullback-Leibler (KL) divergence between the target distribution and the
distribution defined by the reverse-time generative SDE with the learnt score function. In particular,
the objective of Eq. 2 can then be written:
" #
g(t)2 h i
Eq(z0 ) Eq(zt |z0 ) ||∇zt log q(zt ) − ∇zt log pθ (zt )||22
KL q(z0 )||pθ (z0 ) ≤ Et∼U [0,1] (4)
2
which can again be transformed into denoising score matching (Eq. 3) following Vincent [22].
following a VAE approach [14, 15], where qφ (z0 |x) approximates the true posterior p(z0 |x).
In this paper, we use Eq. 6 with decomposed KL divergence into its entropy and cross entropy terms.
The reconstruction and entropy terms are estimated easily for any explicit encoder as long as the
reparameterization trick is available [14]. The challenging part in training LSGM is to train the
cross entropy term that involves the SGM prior. We motivate and present our expression for the
cross-entropy term in Sec. 3.1, the parameterization of the SGM prior in Sec. 3.2, different weighting
mechanisms for the training objective in Sec. 3.3, and variance reduction techniques in Sec. 3.4.
3.1 The Cross Entropy Term
One may ask, why not train LSGM with Eq. 5 and rely on the KL in Eq. 4. Directly using the
KL expression in Eq. 4 is not possible, as it involves the marginal score ∇zt log q(zt ), which is
unavailable analytically for common non-Normal distributions q(z0 ) such as Normalizing flows.
2
Other distributions at t = 1 are possible; for instance, see the “variance-exploding” SDE in [2]. In this
paper, however, we use only SDEs converging towards N (z1 ; 0, I) at t = 1.
3
We omit the t-subscript of the diffused distributions qt in all score functions of the form ∇zt log qt (zt ).
3
Transforming into denoising score matching does not help either, since in that case the problematic
∇zt log q(zt ) term appears in the C term (see Eq. 3). In contrast to previous works [2, 22], we cannot
simply drop C, since it is, in fact, not constant but depends on q(zt ), which is trainable in our setup.
To circumvent this problem, we instead decompose the KL in Eq. 5 and rather work directly with the
cross entropy between the encoder distribution q(z0 |x) and the SGM prior p(z0 ). We show:
Theorem 1. Given two distributions q(z0 |x) and p(z0 ), defined in the continuous space RD , denote
the marginal distributions of diffused samples under the SDE in Eq. 1 at time t with q(zt |x) and
p(zt ). Assuming mild smoothness conditions on log q(zt |x) and log p(zt ), the cross entropy is:
" #
g(t)2 h i D
CE(q(z0 |x)||p(z0 )) = Et∼U [0,1] Eq(zt ,z0 |x) ||∇zt log q(zt |z0 )−∇zt log p(zt )||22 + log 2πeσ02 ,
2 2
with q(zt , z0 |x) = q(zt |z0 )q(z0 |x) and a Normal transition kernel q(zt |z0 ) = N (zt ; µt (z0 ), σt2 I),
where µt and σt2 are obtained from f (t) and g(t) for a fixed initial variance σ02 at t = 0.
A proof with generic expressions for µt and σt2 as well as an intuitive interpretation are in App. A.
Importantly, unlike for the KL objective of Eq. 4, no problematic terms depending on the marginal
score ∇zt log q(zt |x) arise. This allows us to use this denoising score matching objective for the
cross entropy term in Theorem 1 not only for optimizing p(z0 ) (which is commonly done in the
score matching literature), but also for the q(z0 |x) encoding distribution. It can be used even
with complex q(z0 |x) distributions, defined, for example, in a hierarchical fashion [20, 21] or via
Normalizing flows [23, 24]. Our novel analysis shows that, for diffusion SDEs following Eq. 1, only
the cross entropy can be expressed purely with ∇zt log q(zt |z0 ). Neither KL nor entropy in [4] can
be expressed without the problematic term ∇zt log q(zt |x) (details in the Appendix).
Note that in Theorem 1, the term ∇zt log p(zt ) in the score matching expression corresponds to the
score that originates from diffusing an initial p(z0 ) distribution. In practice, we use the expression to
learn an SGM prior pθ (z0 ), which models ∇zt log p(zt ) by a neural network. With the learnt score
∇zt log pθ (zt ) (here we explicitly indicate the parameters θ to clarify that this is the learnt model), the
actual SGM prior is defined via the generative reverse-time SDE (or, alternatively, a closely-connected
ODE, see Sec. 2 and App. D), which generally defines its own, separate marginal distribution pθ (z0 )
at t = 0. Importantly, the learnt, approximate score ∇zt log pθ (zt ) is not necessarily the same as one
would obtain when diffusing pθ (z0 ). Hence, when considering the learnt score ∇zt log pθ (zt ), the
score matching expression in our Theorem only corresponds to an upper bound on the cross entropy
between q(z0 |x) and pθ (z0 ) defined by the generative reverse-time SDE. This is discussed in detail
in concurrent works [4, 25]. Hence, from the perspective of the learnt SGM prior, we are training
with an upper bound on the cross entropy (similar to the bound on the KL in Eq. 4), which can also be
considered as the continuous version of the discretized variational objective derived by Ho et al. [1].
3.2 Mixing Normal and Neural Score Functions
In VAEs [14], p(z0 ) is often chosen as a standard Normal N (z0 ; 0, I). For recent hierarchical
VAEs [20, 21], using the reparameterization trick, the prior can be converted to N (z0 ; 0, I) (App. E).
Considering a single dimensional latent space, we can assume that the prior at time t is in the
form of a geometric mixture p(zt ) ∝ N (zt ; 0, 1)1−α p0θ (zt )α where p0θ (zt ) is a trainable SGM prior
and α ∈ [0, 1] is a learnable scalar mixing coefficient. Formulating the prior this way has crucial
advantages: (i) We can pretrain LSGM’s autoencoder networks assuming α=0, which corresponds
to training the VAE with a standard Normal prior. This pretraining step will bring the distribution
of latent variable close to N (z0 ; 0, 1), allowing the SGM prior to learn a much simpler distribution
in the following end-to-end training stage. (ii) The score function for this mixture is of the form
∇zt log p(zt ) = −(1 − α)zt + α∇zt log p0θ (zt ). When the score function is dominated by the linear
term, we expect that the reverse SDE can be solved faster, as its drift is dominated by this linear term.
For our multivariate latent space, we obtain diffused samples at time t by sampling zt ∼ q(zt |z0 )
with zt = µt (z0 ) + σt , where ∼ N (; 0, I). Since we have ∇zt log q(zt |z0 ) = −/σt , similar
to [1], we parameterize the score function by ∇zt log p(zt ) := −θ (zt , t)/σt , where θ (zt , t) :=
σt (1 − α) zt + α 0θ (zt , t) is defined by our mixed score parameterization that is applied
elementwise to the components of the score. With this, we simplify the cross entropy expression to:
i D
w(t) h
CE(qφ (z0 |x)||pθ (z0 )) = Et∼U [0,1] Eqφ (zt ,z0 |x), ||−θ (zt , t)||22 + log 2πeσ02 , (7)
2 2
4
where w(t) = g(t)2 /σt2 is a time-dependent weighting scalar.
where Eq. 8 trains the VAE encoder and decoder parameters {φ, ψ} using the variational bound
L(x, φ, θ, ψ) from Eq. 6. Eq. 9 trains the prior with one of the three weighting mechanisms. Since
the SGM prior participates in the objective only in the cross entropy term, we only consider this term
when training the prior. Efficient algorithms for training with the objectives are presented in App. G.
3.4 Variance Reduction
The objectives in Eqs. 8 and 9 involve sampling of the time variable t, which has high variance [26].
We introduce several techniques for reducing this variance for all three objective weightings.pWe focus
on the “variance preserving” SDEs (VPSDEs) [2, 1, 27], defined by dz = − 12 β(t)z dt + β(t) dw
where β(t) = β0 + (β1 − β0 )t linearly interpolates in [β0 , β1 ] (other SDEs discussed in App. B).
We denote the marginal distribution of latent variables by q(z0 ) := Epdata (x) [q(z0 |x)]. Here, we
derive variance reduction techniques for CE(q(z0 )||p(z0 )), assuming that both q(z0 ) = p(z0 ) =
N (z0 ; 0, I). This is a reasonable simplification for our analysis because pretraining our LSGM model
with a N (z0 ; 0, I) prior brings q(z0 ) close to N (z0 ; 0, I) and our SGM prior is often dominated by
the fixed Normal mixture component. We empirically observe that the variance reduction techniques
developed with this assumption still work well when q(z0 ) and p(z0 ) are not exactly N (z0 ; 0, I).
Variance reduction for likelihood weighting: In App. B, for q(z0 ) = p(z0 ) = N (z0 ; 0, I), we
show CE(q(z0 )||p(z0 )) is given by D 2
2 Et∼U [0,1] [d log σt /dt] + const. We consider two approaches:
(1) Geometric VPSDE: To reduce the variance sampling uniformly from t, we can design the SDE such
σt2
that d log σt2 /dt is constant for t ∈ [0, 1]. We show in App. B that a β(t) = log(σmax 2 2
/σmin ) (1−σ 2
t)
with geometric variance σt2 = σmin 2 2
(σmax 2
/σmin )t satisfies this condition. We call a VPSDE with this
2 2 2 2
β(t) a geometric VPSDE. σmin and σmax are the hyperparameters of the SDE, with 0 < σmin < σmax < 1.
Although our geometric VPSDE has a geometric variance progression similar to the “variance
exploding” SDE (VESDE) [2], it still enjoys the “variance preserving” property of the VPSDE. In
App. B, we show that the VESDE does not come with a reduced variance for t-sampling by default.
(2) Importance sampling (IS): We can keep β(t) and σt2 unchanged for the original linear VPSDE, and
instead use IS to minimize variance. The theory of IS shows that the proposal r(t) ∝ d log σt2 /dt has
minimum variance [28]. In App. B, we show that we can sample from r(t) using inverse transform
sampling t = var−1 ((σ12 )ρ (σ02 )1−ρ ) where var−1 is the inverse of σt2 and ρ ∼ U[0, 1]. This variance
reduction technique is available for any VPSDE with arbitrary β(t).
In Fig. 2, we train a small LSGM on CIFAR-10 with wll weighting using (i) the original VPSDE
with uniform t sampling, (ii) the same SDE but with our IS from t, and (iii) the proposed geometric
4
Minimizing L(x, φ, θ, ψ) w.r.t φ is equivalent to minimizing KL q(z0 |x)||p(z0 |x) w.r.t q(z0 |x).
5
VPSDE. Note how both (ii) and (iii) significantly reduce the variance and allow us to monitor the
progress of the training objective. In this case, (i) has difficulty minimizing the objective due to the
high variance. In App. B, we show how IS proposals can be formed for other SDEs, including the
VESDE and Sub-VPSDE from [2].
Variance reduction for unweighted and reweighted objectives: 9000 Lin. VPSDE, Un
with wre . In App. B, we show that the optimal distribution is in the 7000
form r(t) ∝ dσt2 /dt which is sampled by t=var−1 ((1−ρ)σ02 +ρσ12 ) 0 100 200 300 400 500
Epochs
with ρ ∼ U[0, 1]. In Fig. 3, we visualize the IS distributions for the Figure 2: Variance reduction
three weighting mechanisms for the linear VPSDE with the original
[β0 , β1 ] parameters from [2]. r(t) for the likelihood weighting is weighted
more tilted towards t = 0 due to the 1/σt2 term in wll . unweighted
r(t)
reweighted
When using differently weighted objectives for training, we can
either sample separate t with different IS distributions for each
objective, or use IS for the SGM objective (Eq. 9) and reweight the
0.0 0.2 0.4 0.6 0.8 1.0
samples according to the likelihood objective for encoder training t
(Eq. 8). See App. G for details. Figure 3: IS distributions
4 Related Work
Our work builds on score-matching [29, 30, 31, 32, 33, 34, 35, 36, 37], specifically denoising score
matching [22], which makes our work related to recent generative models using denoising score
matching- and denoising diffusion-based objectives [3, 38, 1, 2, 6]. Among those, [1, 6] use a
discretized diffusion process with many noise scales, building on [27], while Song et al. [2] introduce
the continuous time framework using SDEs. Experimentally, these works focus on image modeling
and, contrary to us, work directly in pixel space. Various works recently tried to address the slow
sampling of these types of models and further improve output quality. [39] add an adversarial
objective, [5] introduce non-Markovian diffusion processes that allow to trade off synthesis speed,
quality, and sample diversity, [40] learn a sequence of conditional energy-based models for denoising,
[41] distill the iterative sampling process into single shot synthesis, and [42] learn an adaptive noise
schedule, which is adjusted during synthesis to accelerate sampling. Further, [26] propose empirical
variance reduction techniques for discretized diffusions and introduce a new, heuristically motivated,
noise schedule. In contrast, our proposed noise schedule and our variance reduction techniques are
analytically derived and directly tailored to our learning setting in the continuous time setup.
Recently, [11] presented a method to generate graphs using score-based models, relaxing the entries
of adjacency matrices to continuous values. LSGM would allow to model graph data more naturally
using encoders and decoders tailored to graphs [43, 44, 45, 46].
Since our model can be considered a VAE [14, 15] with score-based prior, it is related to approaches
that improve VAE priors. For example, Normalizing flows and hierarchical distributions [23, 24, 47,
48, 20, 21], as well as energy-based models [49, 50, 51, 52, 53] have been proposed as VAE priors.
Furthermore, classifiers [54, 55, 56], adversarial methods [57], and other techniques [58, 59] have
been used to define prior distributions implicitly. In two-stage training, a separate generative model
is trained in latent space as a new prior after training the VAE itself [60, 61, 62, 63, 64, 10]. Our
work also bears a resemblance to recent methods on improving the sampling quality in generative
adversarial networks using gradient flows in the latent space [65, 66, 67, 68], with the main difference
that these prior works use a discriminator to update the latent variables, whereas we train an SGM.
Concurrent works: [10] proposed to learn a denoising diffusion model in the latent space of a VAE
for symbolic music generation. This work does not introduce an end-to-end training framework of
the combined VAE and denoising diffusion model and instead trains them in two separate stages. In
contrast, concurrently with us [69] proposed an end-to-end training approach, and [70] combines
contrastive learning with diffusion models in the latent space of VAEs for controllable generation.
However, [10, 69, 70] consider the discretized diffusion objective [1], while we build on the continu-
ous time framework. Also, these models are not equipped with the mixed score parameterization and
variance reduction techniques, which we found crucial for the successful training of SGM priors.
6
Additionally, [71, 4, 25] concurrently with us proposed likelihood-based training of SGMs in data
space5 . [4] developed a bound for the data likelihood in their Theorem 3 of their second version,
using a denoising score matching objective, closely related to our cross entropy expression. However,
our cross entropy expression is much simpler as we show how several terms can be marginalized
out analytically for the diffusion SDEs employed by us (see our proof in App. A). The same
marginalization can be applied to Theorem 3 in [4] when the drift coefficient takes a special affine
form (i.e., f (z, t) = f (t)z). Moreover, [25] discusses the likelihood-based training of SGMs from
a fundamental perspective and shows how several score matching objectives become a variational
bound on the data likelihood. [71] introduced a notion of signal-to-noise ratio (SNR) that results in a
noise-invariant parameterization of time that depends only on the initial and final noise. Interestingly,
our importance sampling distribution in Sec. 3.4 has a similar noise-invariant parameterization of
time via t = var−1 ((σ12 )ρ (σ02 )1−ρ ), which also depends only on the initial and final diffusion process
variances. We additionally show that this time parameterization results in the optimal minimum-
variance objective, if the distribution of latent variables follows a standard Normal distribution.
Finally, [72] proposed a modified time parameterization that allows modeling unbounded data scores.
5 Experiments
Here, we examine the efficacy of LSGM in learning generative models for images.
Implementation details: We implement LSGM using the NVAE [20] architecture as VAE backbone
and NCSN++ [2] as SGM backbone. NVAE has a hierarchical latent structure. The diffusion process
input z0 is constructed by concatenating the latent variables from all groups in the channel dimension.
For NVAEs with multiple spatial resolutions in latent groups, we only feed the smallest resolution
groups to the SGM prior and assume that the remaining groups have a standard Normal distribution.
Sampling: To generate samples from LSGM at test time, we use a black-box ODE solver [73] to
sample from the prior. Prior samples are then passed to the decoder to generate samples in data space.
Evaluation: We measure NELBO, an upper bound on negative log-likelihood (NLL), using Eq. 6.
For estimating log p(z0 ), we rely on the probability flow ODE [2], which provides an unbiased but
stochastic estimation of log p(z0 ). This stochasticity prevents us from performing an importance
weighted estimation of NLL [74] (see App. F for details). For measuring sample quality, Fréchet
inception distance (FID) [75] is evaluated with 50K samples. Implementation details in App. G.
5.1 Main Results
Unconditional color image generation: Here, we present our main results for unconditional image
generation on CIFAR-10 [89] (Tab. 2) and CelebA-HQ-256 (5-bit quantized) [88] (Tab. 3). For
CIFAR-10, we train 3 different models: LSGM (FID) and LSGM (balanced) both use the VPSDE
with linear β(t) and wun -weighting for the SGM prior in Eq. 9, while performing IS as derived in
Sec. 3.4. They only differ in how the backbone VAE is trained. LSGM (NLL) is a model that is
trained with our novel geometric VPSDE, using wll -weighting in the prior objective (further details
in App. G). When set up for high image quality, LSGM achieves a new state-of-the-art FID of
2.10. When tuned towards NLL, we achieve a NELBO of 2.87, which is significantly better than
previous score-based models. Only autoregressive models, which come with very slow synthesis, and
VDVAE [21] reach similar or higher likelihoods, but they usually have much poorer image quality.
For CelebA-HQ-256, we observe that when LSGM is trained with different SDE types and weighting
mechanisms, it often obtains similar NELBO potentially due to applying the SGM prior only to small
latent variable groups and using Normal priors at the larger groups. With wre -weighting and linear
VPSDE, LSGM obtains the state-of-the-art FID score of 7.22 on a par with the original SGM [2].
For both datasets, we also report results for the VAE backbone used in our LSGM. Although this
baseline achieves competitive NLL, its sample quality is behind our LSGM and the original SGM.
Modeling binarized images: Next, we examine LSGM on dynamically binarized MNIST [93] and
OMNIGLOT [74]. We apply LSGM to binary images using a decoder with pixel-wise independent
Bernoulli distributions. For these datasets, we report both NELBO and NLL in nats in Tab. 4 and
Tab. 5. On OMNIGLOT, LSGM achieves state-of-the-art likelihood of ≤87.79 nat, outperforming
previous models including VAEs with autoregressive decoders, and even when comparing its NELBO
5
We build on the V1 version of [4], which was substantially updated after the NeurIPS submission deadline.
7
Table 2: Generative performance on CIFAR-10. Table 3: Generative results on CelebA-HQ-256.
Method NLL↓ FID↓ Method NLL↓ FID↓
LSGM (FID) ≤3.43 2.10 LSGM ≤0.70 7.22
Ours
Ours LSGM (NLL) ≤2.87 6.89 VAE Backbone 0.70 30.87
LSGM (balanced) ≤2.95 2.17
NVAE [20] 0.70 29.76
VAE Backbone 2.96 43.18
VAEBM [76] - 20.38
VAEs
VDVAE [21] 2.87 - NCP-VAE [56] - 24.79
NVAE [20] 2.91 23.49 DC-VAE [77] - 15.80
VAEBM [76] - 12.19
VAEs Score SDE [2] - 7.23
NCP-VAE [56] - 24.08
BIVA [48] 3.08 - Flows GLOW [85] 1.03 68.93
DC-VAE [77] - 17.90
Aut. Reg. SPN [86] 0.61 -
NCSN [3] - 25.32
Rec. Likelihood [40] 3.18 9.36 Adv. LAE [87] - 19.21
DSM-ALS [39] 3.65 - GANs VQ-GAN [64] - 10.70
Score
DDPM [1] 3.75 3.17 PGGAN [88] - 8.03
Improved DDPM [26] 2.94 11.47
SDE (DDPM++) [2] 2.99 2.92
SDE (NCSN++) [2] - 2.20 80 12
60
NFE
FID
VFlow [19] 2.98 - 10
Flows 40
ANF [18] 3.05 -
20 8
DistAug aug [78] 2.53 42.90
Sp. Transformers [79] 2.80 - 10 5 10 4 10 3 102 10 1
Table 4: Dyn. binarized OMNIGLOT results. Table 5: Dynamically binarized MNIST results.
Method NELBO↓ NLL↓ Method NELBO↓ NLL↓
Ours LSGM 87.79 ≤87.79 Ours LSGM 78.47 ≤78.47
NVAE [20] 93.92 90.75 NVAE [20] 79.56 78.01
BIVA [48] 93.54 91.34 BIVA [48] 80.06 78.41
VAEs VAEs
DVAE++ [51] - 92.38 IAF-VAE [24] 80.80 79.10
Ladder VAE [90] - 102.11 DVAE++ [51] - 78.49
VLVAE [47] - 89.83 PixelVAE++ [91] - 78.00
Aut. Reg. VampPrior [59] - 89.76 Aut. Reg. VampPrior [59] - 78.45
PixelVAE++ [91] - 88.29 MAE [92] - 77.98
against importance weighted estimation of NLL for other methods. On MNIST, LSGM outperforms
previous VAEs in NELBO, reaching a NELBO 1.09 nat lower than the state-of-the-art NVAE.
Qualitative results: We visualize qualitative results for all datasets in Fig. 5. On the complex
multimodal CIFAR-10 dataset, LSGM generates sharp and high-quality images. On CelebA-HQ-256,
LSGM generates diverse samples from different ethnicity and age groups with varying head poses and
facial expressions. On MNIST and OMNIGLOT, the generated characters are sharp and high-contrast.
Sampling time: We compare LSGM against the original SGM [2] trained on the CelebA-HQ-256
dataset in terms of sampling time and number of function evaluations (NFEs) of the ODE solver. Song
et al. [2] propose two main sampling techniques including predictor-corrector (PC) and probability
flow ODE. PC sampling involves 4000 NFEs and takes 44.6 min. on a Titan V for a batch of 16
images. It yields 7.23 FID score (see Tab. 3). ODE-based sampling from SGM takes 3.91 min. with
335 NFEs, but it obtains a poor FID score of 128.13 with 10−5 as ODE solver error tolerance6 .
In a stark contrast, ODE-based sampling from our LSGM takes 0.07 min. with average of 23 NFEs,
yielding 7.22 FID score. LSGM is 637× and 56× faster than original SGM’s [2] PC and ODE
6
We use the VESDE checkpoint at https://github.com/yang-song/score_sde_pytorch. Song et
al. [2] report that ODE-based sampling yields worse FID scores for their models (see D.4 in [2]). The problem is
more severe for VESDEs. Unfortunately, at submission time only a VESDE model was released.
8
(c) OMNIGLOT
Table 6: Ablations on SDEs, objectives, weighting mechanisms, and variance reduction. Details in App. G.
SGM-obj.-weighting wll wun wre
t-sampling (SGM-obj.) U[0, 1] rll (t) U[0, 1] run (t) U[0, 1] rre (t)
t-sampling (q-obj.) rew. rew. rew. rll (t) rew. rll (t) rew. rll (t) rew. rll (t)
Geom.- FID↓ 10.18 n/a NaN NaN n/a n/a 22.21 NaN 7.29 7.18
VPSDE NELBO↓ 2.96 n/a NaN NaN n/a n/a 3.04 NaN 2.99 2.99
FID↓ 6.15 8.00 NaN NaN 5.39 5.39 NaN 4.99 15.12 6.19
VPSDE
NELBO↓ 2.97 2.97 NaN NaN 2.98 2.98 NaN 2.99 3.03 2.99
sampling, respectively. In Fig. 4, we visualize FID scores and NFEs for different ODE solver error
tolerances. Our LSGM achieves low FID scores for relatively large error tolerances.
We identify three main reasons for this significantly faster sampling from LSGM: (i) The SGM prior
in our LSGM models latent variables with 32×32 spatial dim., whereas the original SGM [2] directly
models 256×256 images. The larger spatial dimensions require a deeper network to achieve a large
receptive field. (ii) Inspecting the SGM prior in our model suggests that the score function is heavily
dominated by the linear term at the end of training, as the mixing coefficients α are all < 0.02. This
makes our SGM prior smooth and numerically faster to solve. (iii) Since SGM is formed in the latent
space in our model, errors from solving the ODE can be corrected to some degree using the VAE
decoder, while in the original SGM [2] errors directly translate to artifacts in pixel space.
SDEs, objective weighting mechanisms and variance reduction. In Tab. 6, we analyze the differ-
ent weighting mechanisms and variance reduction techniques and compare the geometric VPSDE
with the regular VPSDE with linear β(t) [1, 2]. In the table, SGM-obj.-weighting denotes the weight-
ing mechanism used when training the SGM prior (via Eq. 9). t-sampling (SGM-obj.) indicates the
sampling approach for t, where rll (t), run (t) and rre (t) denote the IS distributions for the weighted
(likelihood), the unweighted, and the reweighted objective, respectively. For training the VAE encoder
qφ (z0 |x) (last term in Eq. 8), we either sample a separate batch t with importance sampling following
rll (t) (only necessary when the SGM prior is not trained with wll itself), or we reweight the samples
drawn for training the prior according to the likelihood objective (denoted by rew.). n/a indicates
fields that do not apply: The geometric VPSDE has optimal variance for the weighted (likelihood)
objective already with uniform sampling; there is no additional IS distribution. Also, we did not
derive IS distributions for the geometric VPSDE for wun . NaN indicates experiments that failed due
to training instabilities. Previous work [20, 21] have reported instability in training large VAEs. We
find that our method inherits similar instabilities from VAEs; however, importance sampling often
stabilizes training our LSGM. As expected, we obtain the best NELBOs (red) when training with
the weighted, maximum likelihood objective (wll ). Importantly, our new geometric VPSDE achieves
the best NELBO. Furthermore, the best FIDs (blue) are obtained either by unweighted (wun ) or
reweighted (wre ) SGM prior training, with only slightly worse NELBOs. These experiments were run
on the CIFAR10 dataset, using a smaller model than for our main results above (details in App. G).
9
End-to-end training. We proposed to train LSGM end-to-end, in contrast to [10]. Using a similar
setup as above we compare end-to-end training of LSGM during the second stage with freezing the
VAE encoder and decoder and only training the SGM prior in latent space during the second stage.
When training the model end-to-end, we achieve an FID of 5.19 and NELBO of 2.98; when freezing
the VAE networks during the second stage, we only get an FID of 9.00 and NELBO of 3.03. These
results clearly motivate our end-to-end training strategy.
Mixing Normal and neural score functions. We generally found training LSGM without our
proposed “mixed score” formulation (Sec. 3.2) to be unstable during end-to-end training, highlighting
its importance. To quantify the contribution of the mixed score parametrization for a stable model,
we train a small LSGM with only one latent variable group. In this case, without the mixed score,
we reached an FID of 34.71 and NELBO of 3.39; with it, we got an FID of 7.60 and NELBO of
3.29. Without the inductive bias provided by the mixed score, learning that the marginal distribution
is close to a Normal one for large t purely from samples can be very hard in the high-dimensional
latent space, where our diffusion is run. Furthermore, due to our importance sampling schemes, we
tend to oversample small, rather than large t. However, synthesizing high-quality images requires an
accurate score function estimate for all t. On the other hand, the log-likelihood of samples is highly
sensitive to local image statistics and primarily determined at small t. It is plausible that we are still
able to learn a reasonable estimate of the score function for these small t even without the mixed
score formulation. That may explain why log-likelihood suffers much less than sample quality, as
estimated by FID, when we remove the mixed score parameterization.
Additional experiments and model samples are presented in App. H.
6 Conclusions
We proposed the Latent Score-based Generative Model, a novel framework for end-to-end training of
score-based generative models in the latent space of a variational autoencoder. Moving from data
to latent space allows us to form more expressive generative models, model non-continuous data,
and reduce sampling time using smoother SGMs. To enable training latent SGMs, we made three
core contributions: (i) we derived a simple expression for the cross entropy term in the variational
objective, (ii) we parameterized the SGM prior by mixing Normal and neural score functions, and
(iii) we proposed several techniques for variance reduction in the estimation of the training objective.
Experimental results show that latent SGMs outperform recent pixel-space SGMs in terms of both
data likelihood and sample quality, and they can also be applied to binary datasets. In large image
generation, LSGM generates data several orders of magnitude faster than recent SGMs. Nevertheless,
LSGM’s synthesis speed does not yet permit sampling at interactive rates, and our implementation of
LSGM is currently limited to image generation. Therefore, future work includes further accelerating
sampling, applying LSGMs to other data types, and designing efficient networks for LSGMs.
7 Broader Impact
Generating high-quality samples while fully covering the data distribution has been a long-standing
challenge in generative learning. A solution to this problem will likely help reduce biases in generative
models and lead to improving overall representation of minorities in the data distribution. SGMs
are perhaps one of the first deep models that excel at both sample quality and distribution coverage.
However, the high computational cost of sampling limits their widespread use. Our proposed LSGM
reduces the sampling complexity of SGMs by a large margin and improves their expressivity further.
Thus, in the long term, it can enable the usage of SGMs in practical applications.
Here, LSGM is examined on the image generation task which has potential benefits and risks
discussed in [94, 95]. However, LSGM can be considered a generic framework that extends SGMs to
non-continuous data types. In principle LSGM could be used to model, for example, language [96,
97], music [98, 10], or molecules [99, 100]. Furthermore, like other deep generative models, it
can potentially be used also for non-generative tasks such as semi-supervised and representation
learning [101, 102, 103]. This makes the long-term social impacts of LSGM dependent on the
downstream applications.
Funding Statement
All authors were funded by NVIDIA through full-time employment.
10
References
[1] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. arXiv:2006.11239,
2020.
[2] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole.
Score-based generative modeling through stochastic differential equations. In International Conference
on Learning Representations, 2021.
[3] Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution.
In Advances in Neural Information Processing Systems 32, 2019.
[4] Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based
diffusion models. arXiv e-prints, 2021.
[5] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International
Conference on Learning Representations, 2021.
[6] Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. arXiv preprint
arXiv:2105.05233, 2021.
[7] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. WaveGrad:
Estimating Gradients for Waveform Generation. arXiv:2009.00713, 2020.
[8] Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A Versatile Diffusion
Model for Audio Synthesis. arXiv:2009.09761, 2020.
[9] Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, and Nam Soo Kim. Diff-tts: A
denoising diffusion model for text-to-speech. arXiv:2104.01409, 2021.
[10] Gautam Mittal, Jesse Engel, Curtis Hawthorne, and Ian Simon. Symbolic music generation with diffusion
models. arXiv preprint arXiv:2103.16091, 2021.
[11] Chenhao Niu, Yang Song, Jiaming Song, Shengjia Zhao, Aditya Grover, and Stefano Ermon. Permutation
invariant graph generation via score-based generative modeling. In The 23rd International Conference on
Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy],
2020.
[12] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, Zekun Hao, Serge J. Belongie, Noah Snavely, and
Bharath Hariharan. Learning gradient fields for shape generation. In Computer Vision - ECCV 2020 -
16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part III, volume 12348 of
Lecture Notes in Computer Science, pages 364–381. Springer, 2020.
[13] Shitong Luo and Wei Hu. Diffusion probabilistic models for 3d point cloud generation. arXiv:2103.01458,
2021.
[14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In The International Conference
on Learning Representations (ICLR), 2014.
[15] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approx-
imate inference in deep generative models. In International Conference on Machine Learning, pages
1278–1286, 2014.
[16] Emilien Dupont, Arnaud Doucet, and Yee Whye Teh. Augmented neural ODEs. In Advances in Neural
Information Processing Systems, pages 3140–3150, 2019.
[17] Zhifeng Kong and Kamalika Chaudhuri. The expressive power of a class of normalizing flow models.
arXiv preprint arXiv:2006.00392, 2020.
[18] Chin-Wei Huang, Laurent Dinh, and Aaron Courville. Augmented normalizing flows: Bridging the gap
between generative flows and latent variable models. arXiv preprint arXiv:2002.07101, 2020.
[19] Jianfei Chen, Cheng Lu, Biqi Chenli, Jun Zhu, and Tian Tian. Vflow: More expressive generative flows
with variational data augmentation. arXiv preprint arXiv:2002.09741, 2020.
[20] Arash Vahdat and Jan Kautz. NVAE: A Deep Hierarchical Variational Autoencoder. arXiv:2007.03898,
2020.
[21] Rewon Child. Very deep VAEs generalize autoregressive models and can outperform them on images. In
International Conference on Learning Representations, 2021.
[22] Pascal Vincent. A Connection between Score Matching and Denoising Autoencoders. Neural Computa-
tion, 23(7):1661–1674, 2011.
[23] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv
preprint arXiv:1505.05770, 2015.
[24] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling.
Improved variational inference with inverse autoregressive flow. In Advances in Neural Information
Processing Systems, pages 4743–4751, 2016.
11
[25] Chin-Wei Huang, Jae Hyun Lim, and Aaron Courville. A variational perspective on diffusion-based
generative models and score matching. arXiv preprint arXiv:2106.02808, 2021.
[26] Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint
arXiv:2102.09672, 2021.
[27] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep Unsupervised
Learning Using Nonequilibrium Thermodynamics. In Proceedings of the 32nd International Conference
on International Conference on Machine Learning - Volume 37, ICML’15, page 2256–2265. JMLR.org,
2015.
[28] c. Monte Carlo theory, methods and examples. 2013.
[29] Aapo Hyvärinen. Estimation of Non-Normalized Statistical Models by Score Matching. J. Mach. Learn.
Res., 6:695–709, December 2005.
[30] Siwei Lyu. Interpretation and Generalization of Score Matching. In Proceedings of the Twenty-Fifth
Conference on Uncertainty in Artificial Intelligence, UAI ’09, page 359–366, Arlington, Virginia, USA,
2009. AUAI Press.
[31] Durk P Kingma and Yann L. Cun. Regularized estimation of image statistics by Score Matching. In
Advances in Neural Information Processing Systems 23, 2010.
[32] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized Denoising Auto-Encoders
as Generative Models. In Proceedings of the 26th International Conference on Neural Information
Processing Systems, 2013.
[33] Krzysztof J. Geras and Charles A. Sutton. Scheduled denoising autoencoders. In 3rd International
Conference on Learning Representations, ICLR, 2015.
[34] Saeed Saremi, Arash Mehrjou, Bernhard Schölkopf, and Aapo Hyvärinen. Deep Energy Estimator
Networks. arXiv:1805.08306, 2018.
[35] Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced Score Matching: A Scalable Approach to
Density and Score Estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial
Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019, 2019.
[36] Zengyi Li, Yubei Chen, and Friedrich T. Sommer. Learning Energy-Based Models in High-Dimensional
Spaces with Multi-scale Denoising Score Matching. arXiv:1910.07762, 2019.
[37] Tianyu Pang, Kun Xu, Chongxuan Li, Yang Song, Stefano Ermon, and Jun Zhu. Efficient Learning of
Generative Models via Finite-Difference Score Matching. arXiv:2007.03317, 2020.
[38] Yang Song and Stefano Ermon. Improved Techniques for Training Score-Based Generative Models.
arXiv:2006.09011, 2020.
[39] Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Ioannis Mitliagkas, and Remi Tachet des Combes.
Adversarial score matching and improved sampling for image generation. In International Conference on
Learning Representations, 2021.
[40] Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P Kingma. Learning energy-based
models by diffusion recovery likelihood. In International Conference on Learning Representations, 2021.
[41] Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved
sampling speed. arXiv preprint arXiv:2101.02388, 2021.
[42] Robin San-Roman, Eliya Nachmani, and Lior Wolf. Noise estimation for generative diffusion models.
arXiv preprint arXiv:2104.02600, 2021.
[43] Martin Simonovsky and Nikos Komodakis. Graphvae: Towards generation of small graphs using
variational autoencoders. arXiv:1802.03480, 2018.
[44] Wengong Jin, Regina Barzilay, and Tommi Jaakkola. Junction tree variational autoencoder for molecular
graph generation. In Proceedings of the 35th International Conference on Machine Learning, 2018.
[45] Aditya Grover, Aaron Zweig, and Stefano Ermon. Graphite: Iterative generative modeling of graphs. In
Proceedings of the 36th International Conference on Machine Learning, 2019.
[46] Renjie Liao, Yujia Li, Yang Song, Shenlong Wang, Will Hamilton, David K Duvenaud, Raquel Urtasun,
and Richard Zemel. Efficient graph generation with graph recurrent attention networks. In Advances in
Neural Information Processing Systems, 2019.
[47] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever,
and Pieter Abbeel. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
[48] Lars Maaløe, Marco Fraccaro, Valentin Liévin, and Ole Winther. BIVA: A very deep hierarchy of
latent variables for generative modeling. In Advances in neural information processing systems, pages
6548–6558, 2019.
[49] Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
12
[50] Arash Vahdat, Evgeny Andriyash, and William G Macready. DVAE#: Discrete variational autoencoders
with relaxed Boltzmann priors. In Neural Information Processing Systems, 2018.
[51] Arash Vahdat, William G. Macready, Zhengbing Bian, Amir Khoshaman, and Evgeny Andriyash.
DVAE++: Discrete variational autoencoders with overlapping transformations. In International Confer-
ence on Machine Learning (ICML), 2018.
[52] Arash Vahdat, Evgeny Andriyash, and William G Macready. Undirected graphical models as approximate
posteriors. In International Conference on Machine Learning (ICML), 2020.
[53] Bo Pang, Tian Han, Erik Nijkamp, Song-Chun Zhu, and Ying Nian Wu. Learning latent space energy-
based prior model. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances
in Neural Information Processing Systems, volume 33, pages 21994–22008. Curran Associates, Inc.,
2020.
[54] Jesse Engel, Matthew Hoffman, and Adam Roberts. Latent constraints: Learning to generate conditionally
from unconditional generative models. In International Conference on Learning Representations, 2018.
[55] Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders. In Kamalika Chaudhuri
and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International Conference on Artificial
Intelligence and Statistics, volume 89 of Proceedings of Machine Learning Research, pages 66–75.
PMLR, 16–18 Apr 2019.
[56] Jyoti Aneja, Alexander Schwing, Jan Kautz, and Arash Vahdat. NCP-VAE: Variational autoencoders with
noise contrastive priors. arXiv preprint arXiv:2010.02917, 2020.
[57] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial
autoencoders, 2016.
[58] Hiroshi Takahashi, Tomoharu Iwata, Yuki Yamanaka, Masanori Yamada, and Satoshi Yagi. Variational
autoencoder with implicit optimal priors. Proceedings of the AAAI Conference on Artificial Intelligence,
33(01):5066–5073, Jul. 2019.
[59] Jakub Tomczak and Max Welling. Vae with a vampprior. In International Conference on Artificial
Intelligence and Statistics, pages 1214–1223, 2018.
[60] Bin Dai and David Wipf. Diagnosing and enhancing VAE models. In International Conference on
Learning Representations, 2019.
[61] Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning.
arXiv preprint arXiv:1711.00937, 2018.
[62] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2.
In Advances in Neural Information Processing Systems, pages 14837–14847, 2019.
[63] Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael Black, and Bernhard Scholkopf. From
variational to deterministic autoencoders. In International Conference on Learning Representations,
2020.
[64] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image
synthesis. arXiv preprint arXiv:2012.09841, 2020.
[65] Abdul Fatir Ansari, Ming Liang Ang, and Harold Soh. Refining deep generative models via discriminator
gradient flow. In International Conference on Learning Representations, 2021.
[66] Tong Che, Ruixiang ZHANG, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, and
Yoshua Bengio. Your gan is secretly an energy-based model and you should use discriminator driven
latent sampling. In Advances in Neural Information Processing Systems, 2020.
[67] Akinori Tanaka. Discriminator optimal transport. In Advances in Neural Information Processing Systems,
2019.
[68] Weili Nie, Arash Vahdat, and Anima Anandkumar. Controllable and compositional generation with
latent-space energy-based models. In Neural Information Processing Systems (NeurIPS), 2021.
[69] Antoine Wehenkel and Gilles Louppe. Diffusion priors in variational autoencoders. In ICML Workshop
on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, 2021.
[70] Abhishek Sinha, Jiaming Song, Chenlin Meng, and Stefano Ermon. D2c: Diffusion-denoising models for
few-shot conditional generation. arXiv preprint arXiv:2106.06819, 2021.
[71] Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv
preprint arXiv:2107.00630, 2021.
[72] Dongjun Kim, Seungjae Shin, Kyungwoo Song, Wanmo Kang, and Il-Chul Moon. Score matching model
for unbounded data score. arXiv preprint arXiv:2106.05527, 2021.
[73] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differential
equations. Advances in Neural Information Processing Systems, 2018.
13
[74] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint
arXiv:1509.00519, 2015.
[75] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans
trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural
information processing systems, pages 6626–6637, 2017.
[76] Zhisheng Xiao, Karsten Kreis, Jan Kautz, and Arash Vahdat. VAEBM: A symbiosis between variational
autoencoders and energy-based models. In International Conference on Learning Representations, 2021.
[77] Gaurav Parmar, Dacheng Li, Kwonjoon Lee, and Zhuowen Tu. Dual contradistinctive generative
autoencoder. arXiv preprint arXiv:2011.10063, 2020.
[78] Heewoo Jun, Rewon Child, Mark Chen, John Schulman, Aditya Ramesh, Alec Radford, and Ilya Sutskever.
Distribution augmentation for generative modeling. In International Conference on Machine Learning,
pages 5006–5019. PMLR, 2020.
[79] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse
transformers. arXiv preprint arXiv:1904.10509, 2019.
[80] Ali Razavi, Aäron van den Oord, Ben Poole, and Oriol Vinyals. Preventing posterior collapse with
delta-vaes. In The International Conference on Learning Representations (ICLR), 2019.
[81] XI Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved autoregres-
sive generative model. In International Conference on Machine Learning, 2018.
[82] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving
the pixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint
arXiv:1701.05517, 2017.
[83] Xinyu Gong, Shiyu Chang, Yifan Jiang, and Zhangyang Wang. Autogan: Neural architecture search for
generative adversarial networks. In The IEEE International Conference on Computer Vision (ICCV), Oct
2019.
[84] Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training
generative adversarial networks with limited data. In NeurIPS, 2020.
[85] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In
S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances
in Neural Information Processing Systems 31, pages 10236–10245, 2018.
[86] Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and
multidimensional upscaling. In International Conference on Learning Representations, 2019.
[87] Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoencoders. In
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
(CVPR), 2020.
[88] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved
quality, stability, and variation. In International Conference on Learning Representations, 2018.
[89] Alex Krizhevsky et al. Learning multiple layers of features from tiny images, 2009.
[90] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder
variational autoencoders. In Advances in neural information processing systems, pages 3738–3746, 2016.
[91] Hossein Sadeghi, Evgeny Andriyash, Walter Vinci, Lorenzo Buffoni, and Mohammad H Amin. Pixel-
vae++: Improved pixelvae with discrete prior. arXiv preprint arXiv:1908.09948, 2019.
[92] Xuezhe Ma, Chunting Zhou, and Eduard Hovy. MAE: Mutual posterior-divergence regularization for
variational autoencoders. In The International Conference on Learning Representations (ICLR), 2019.
[93] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
[94] J. Bailey. The tools of generative art, from flash to neural networks. Art in America, 2020.
[95] Cristian Vaccari and Andrew Chadwick. Deepfakes and disinformation: Exploring the impact
of synthetic political video on deception, uncertainty, and trust in news. Social Media+ Society,
6(1):2056305120903408, 2020.
[96] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio.
Generating sentences from a continuous space. In Proceedings of The 20th SIGNLL Conference on
Computational Natural Language Learning, pages 10–21, 2016.
[97] Chunyuan Li, Xiang Gao, Yuan Li, Xiujun Li, Baolin Peng, Yizhe Zhang, and Jianfeng Gao. Optimus:
Organizing sentences via pre-trained modeling of a latent space. In EMNLP, 2020.
[98] Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever.
Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
14
[99] Benjamin Sanchez-Lengeling and Alán Aspuru-Guzik. Inverse molecular design using machine learning:
Generative models for matter engineering. Science, 361(6400):360–365, 2018.
[100] Zaccary Alperstein, Artem Cherkasov, and Jason Tyler Rolfe. All smiles variational autoencoder. arXiv
preprint arXiv:1905.13343, 2019.
[101] Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised
learning with deep generative models. In Advances in Neural Information Processing Systems, pages
3581–3589, 2014.
[102] Augustus Odena. Semi-supervised learning with generative adversarial networks. arXiv preprint
arXiv:1606.01583, 2016.
[103] Daiqing Li, Junlin Yang, Karsten Kreis, Antonio Torralba, and Sanja Fidler. Semantic segmentation with
generative models: Semi-supervised learning and strong out-of-domain generalization. In Conference on
Computer Vision and Pattern Recognition (CVPR), 2021.
[104] Simo Särkkä and Arno Solin. Applied Stochastic Differential Equations. Institute of Mathematical
Statistics Textbooks. Cambridge University Press, United Kingdom, 2019.
[105] Brian D.O. Anderson. Reverse-time diffusion equation models. Stochastic Process. Appl., 12(3):313–326,
1982.
[106] Will Grathwohl, Ricky T. Q. Chen, Jesse Bettencourt, Ilya Sutskever, and David Duvenaud. Ffjord:
Free-form continuous dynamics for scalable reversible generative models. International Conference on
Learning Representations, 2019.
[107] Richard Zhang. Making convolutional networks shift-invariant again. In ICML, 2019.
[108] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
[109] J. R. Dormand and P. J. Prince. A family of embedded Runge–Kutta formulae. Journal of Computational
and Applied Mathematics, 6(1):19–26, 1980.
15
Appendix
1 Introduction 1
2 Background 2
4 Related Work 6
5 Experiments 7
5.1 Main Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
6 Conclusions 10
7 Broader Impact 10
B Variance Reduction 21
B.1 Generic Mixed Score Parameterization for Non-Variance Preserving SDEs . . . . . 22
B.2 Variance Reduction of Cross Entropy with Importance Sampling for Generic SDEs 22
B.3 VPSDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
B.3.1 Variance Reduction for Likelihood Weighting (Geometric VPSDE) . . . . 24
B.3.2 Variance Reduction for Likelihood Weighting (Importance Sampling) . . . 25
B.3.3 Variance Reduction for Unweighted Objective . . . . . . . . . . . . . . . 25
B.3.4 Variance Reduction for Reweighted Objective . . . . . . . . . . . . . . . . 26
B.4 VESDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
B.4.1 Variance Reduction for Likelihood Weighting . . . . . . . . . . . . . . . . 27
B.4.2 Variance Reduction for Unweighted Objective . . . . . . . . . . . . . . . 28
B.4.3 Variance Reduction for Reweighted Objective . . . . . . . . . . . . . . . . 28
B.5 Sub-VPSDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
16
F Bias in Importance Weighted Estimation of Log-Likelihood 31
H Additional Experiments 36
H.1 Additional Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
H.2 MNIST: Small VAE Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
H.3 CIFAR-10: Neural Network Evaluations during Sampling . . . . . . . . . . . . . . 37
H.4 CIFAR-10: Sub-VPSDE vs. VPSDE . . . . . . . . . . . . . . . . . . . . . . . . . 37
H.5 CelebA-HQ-256: Different ODE Solver Error Tolerances . . . . . . . . . . . . . . 37
H.6 CelebA-HQ-256: Ancestral Sampling . . . . . . . . . . . . . . . . . . . . . . . . 40
H.7 CelebA-HQ-256: Sampling from VAE Backbone vs. LSGM . . . . . . . . . . . . 40
H.8 Evolution Samples on the ODE and SDE Reverse Generative Process . . . . . . . 40
17
A Proof for Theorem 1
Without loss of generality, we state the theorem in general form without conditioning on x.
Theorem 1. Given two distributions q(z0 ) and p(z0 ) defined in the continuous space RD , denote the
marginal distributions of diffused samples under the SDE dz = f (t)z dt + g(t) dw at time t ∈ [0, 1]
with q(zt ) and p(zt ). Assuming that log q(zt ) and log p(zt ) are smooth with at most polynomial
growth at zt → ±∞, and also assuming that f (t) and g(t) are chosen such that q(z1 ) = p(z1 ) at
t = 1, the cross entropy is given by:
" #
g(t)2 h i D
CE(q(z0 )||p(z0 )) = Et∼U [0,1] Eq(zt ,z0 |x) ||∇zt log q(zt |z0 )−∇zt log p(zt )||2 + log 2πeσ02 ,
2
2 2
with q(zt , z0 ) = q(zt |z0 )q(z0 ) and a Normal transition kernel q(zt |z0 ) = N (zt ; µt (z0 ), σt2 I)
where µt and σt2 are obtained from f (t) and g(t) for a fixed initial variance σ02 at t = 0.
Theorem 1 amounts to estimating the cross entropy between q(z0 ) and p(z0 ) with denoising score
matching and can be understood intuitively in the context of LSGM: We are drawing samples from
a potentially complex encoding distribution q(z0 ), add Gaussian noise with small initial variance
σ02 to obtain a well-defined initial distribution, and then smoothly perturb the sampled encodings
using a diffusion process, while learning a denoising model, the SGM prior. Note that from the
perspective of the learnt SGM prior, which is defined by the separate reverse-time generative SDE
with the learnt score function model (see Sec. 2), the expression in our theorem becomes an upper
bound (see discussion in Sec. 3.1).
Proof. The first part of our proof follows a similar proof strategy as was used by Song et al. [4]. We
start the proof with a more generic diffusion process in the form:
The time-evolution of probability densities q(zt ) and p(zt ) under this SDE is described by the
Fokker-Planck equation [104] (note that we follow the same notation as in the main paper: We omit
the t-subscript of the diffused distributions qt , indicating the time dependence at the variable, i.e.
q(zt ) ≡ qt (zt )):
∂q(zt ) 1 2
= ∇zt g (t)q(zt )∇zt log q(zt ) − f (z, t)q(zt )
∂t 2 (10)
= ∇zt hq (zt , t)q(zt )
with
1 2
hq (zt , t) := g (t)∇zt log q(zt ) − f (z, t) (11)
2
0
∂
Z
CE(q(z0 )||p(z0 )) = CE(q(z1 )||p(z1 )) + CE(q(zt )||p(zt ))dt
1 ∂t
1
∂
Z
= H q(z1 ) − CE(q(zt )||p(zt ))dt
0 ∂t
since q(z1 ) = p(z1 ), as assumed in the Theorem (in practice, the used SDEs are designed such that
q(z1 ) = p(z1 )).
18
Furthermore, we have
Z
∂ ∂q(zt ) q(zt ) ∂p(zt )
CE(q(zt )||p(zt )) = − log p(zt ) + dz
∂t ∂t p(zt ) ∂t
Z
(i) q(zt )
=− ∇zt (hq (zt , t)q(zt )) log p(zt ) + ∇zt (hp (zt , t)p(zt )) dz
p(zt )
Z
(ii) q(zt )
= hq (zt , t)> q(zt )∇zt log p(zt ) + hp (zt , t)> p(zt )∇zt dz
p(zt )
Z
(iii)
q(zt ) hq (zt , t)> ∇zt log p(zt )
=
where (i) inserts the Fokker Planck equations for q(zt ) and p(zt ), respectively. Furthermore, (ii)
is integration by parts assuming similar limiting behavior of q(zt ) and p(zt ) at zt → ±∞ as
Song et al. [4]. Specifically, we know that q(zt ) and p(zt ) must decay towards zero at zt → ±∞
to be normalized. Furthermore, we assumed log q(zt ) and log p(zt ) to have at most polynomial
growth (or decay, when looking at it from the other direction) at zt → ±∞, which implies faster
exponential growth/decay of q(zt ) and p(zt ). Also, ∇zt log q(zt ) and ∇zt log p(zt ) grow/decay at
most polynomially, too, since the gradient of a polynomial is still a polynomial. Hence, one can work
out that all terms to be evaluated at zt → ±∞ after integration by parts vanish. Finally, (iii) uses the
log derivative trick and some rearrangements, and (iv) is obtained by inserting hq and hp .
Hence, we obtain
1
1 2
Z
g (t)||∇zt log p(zt )||22 + f (zt , t)∇zt log q(zt )
CE(q(z0 )||p(z0 )) = H q(z1 ) + Eq(zt )
0 2
− g 2 (t)∇zt log q(zt )> ∇zt log p(zt ) dt,
which we can interpret as a general score matching-based expression for calculating the cross entropy,
analogous to the expressions for the Kullback-Leibler divergence and entropy derived by Song et
al. [4].
However, as discussed in the main paper, dealing with the marginal score ∇zt log q(zt ) is problematic
for complex “input” distributions q(z0 ). Hence, we further transform the cross entropy expression
into a denoising score matching-based expression:
19
Z 1
1 2
g (t)||∇zt log p(zt )||22 + f (zt , t)∇zt log q(zt )
CE(q(z0 )||p(z0 )) = H q(z1 ) + Eq(zt )
2
0
− g 2 (t)∇zt log q(zt )> ∇zt log p(zt ) dt
Z 1
(i) 1 h i
= g(t)2 Eq(z0 ,zt ) −2∇z log q(zt |z0 )> ∇zt log p(zt ) + ||∇zt log p(zt )||22 dt
2 0
1 1
Z h i
Eq(z0 ,zt ) 2f (z, t)> ∇zt log q(zt |z0 ) dt + H q(z1 )
+
2 0
Z 1
(ii) 1
h i
= g(t)2 Eq(z0 ,zt ) ||∇zt log q(zt |z0 )||22 − 2∇zt log q(zt |z0 )> ∇zt log p(zt ) + ||∇zt log p(zt )||22 dt
2 0
1 1
Z h i
Eq(zt ) 2f (z, t)> ∇zt log q(zt |z0 ) − g(t)2 ||∇zt log q(zt |z0 )||22 dt + H q(z1 )
+
2 0
Z 1
(iii) 1
h i
= g(t)2 Eq(z0 ,zt ) ||∇zt log q(zt |z0 ) − ∇zt log p(zt )||22 dt
2 0
1 1
Z >
2
+ Eq(z0 ,zt ) 2f (z, t) − g(t) ∇zt log q(zt |z0 ) ∇zt log q(zt |z0 ) dt +H q(z1 )
2 0
| {z }
(I): Model-independent term
with q(z0 , zt ) = q(zt |z0 )q(z0 ) and where in (i) we have used the following identity from Vin-
cent [22]:
h i
Eq(zt ) ∇zt log q(zt ) = Eq(zt ) Eq(z0 |zt ) ∇zt log q(zt |z0 ) = Eq(z0 )q(zt |z0 ) ∇zt log q(zt |z0 ) .
In (ii), we have added and subtracted g(t)2 ||∇zt log q(zt |z0 )||22 and in (iii) we rearrange the terms
into denoising score matching. In the following, we show that the term marked by (I) depends only
on the diffusion parameters and does not depend on q(z0 ) when f (z, t) takes a special affine (linear)
form f (z, t) := f (t)z, which is often used for training SGMs and which we assume in our Theorem.
Note that for linear f (z, t) := f (t)z, we can derive the mean and variance (there are no “off-
diagonal” co-variance terms here, since all dimensions undergo diffusion independently) of the
distribution q(zt |z0 ) at any time t in closed form, essentially solving the Fokker-Planck equation
for this special case analytically. In that case, if the initial distribution at t = 0 is Normal then
the distribution stays Normal and the mean and variance completely describe the distribution, i.e.
q(zt |z0 ) = N (zt ; µt (z0 ), σt2 I). The mean and variance are given by the differential equations and
their solutions [104]:
dµ Rt
= f (t)µ → µt = z0 e 0 f (s)ds (12)
dt !
t
dσ 2 1
Z Rt
= 2f (t)σ 2 + g 2 (t) → σt2 = F̃ (s)g 2 (s)ds + σ02 , F̃ (t) := e−2 0 f (s)ds (13)
dt F̃ (t) 0
Here, z0 denotes the mean of the distribution at t = 0 and σ02 the component-wise variance at
t = 0. After transforming into the denoising score matching expression above, what we are doing
is essentially drawing samples z0 from the potentially complex q(z0 ), then placing simple Normal
distributions with variance σ02 at those samples, and then letting those distributions evolve according
to the SDE. σ02 acts as a hyperparameter of the model.
In this case, i.e. when the distribution q(zt |z0 ) is Normal at all t, we can represent samples zt from
the intermediate distributions in reparameterized from zt = µt (z0 ) + σt where ∼ N (; 0, I). We
20
also know that ∇z log q(zt |z0 ) = − σt With this we can write down (i) as:
" T #
1 1
Z
2
(I) = Eq(z0 ), 2f (t)(µt (z0 ) + σt ) + g(t) − dt (14)
2 0 σt σt
Z 1 i 2f (t)σ 2 + g(t)2
f (t) h
t
= − Eq(z0 ), µt (z0 )T − 2 E [T ] dt (15)
0 σ t | {z } 2σ t | {z }
=D
=0
1
D 2f (t)σt2 + g(t)2
Z
=− dt (16)
2 0 σt2
σ12
D 1 D
Z
=− dσ 2 = (log σ02 − log σ12 ), (17)
2 σ02 σt2 t 2
where we have used Eq. 13.
Furthermore, since q(zT ) → N (zT , 0, σ12 I) at t = 1, its entropy is H q(zT ) = D
2
2 log(2πeσ1 ).
With this, we get the following simple expression for the cross-entropy:
1 1
Z h i q
CE(q(z0 )||p(z0 )) = g(t)2 Eq(z0 ,zt ) ||∇z log q(zt |z0 ) − ∇z log p(zt )||22 dt + D log( 2πeσ02 )
2 0
Expressing the integral as an expectation completes the proof:
" #
g(t)2 h i D
CE(q(z0 )||p(z0 )) = Et∼U [0,1] Eq(zt ,z0 ) ||∇zt log q(zt |z0 )−∇zt log p(zt )||2 + log 2πeσ02
2
2 2
The expression in Theorem 1 measures the cross entropy between q and p at t = 0. However, one
should consider practical implications of the choice of initial variance σ02 when estimating the cross
entropy between two distributions using our expression, as we discuss below.
Consider two arbitrary distributions q 0 (z) and p0 (z). If the forward diffusion process has a non-
zero initial variance (i.e., σ02 > 0), the
R actual distributions q and p at t = R0 in the score matching
expression are defined by q(z0 ) := q 0 (z)N (z0 , z, σ02 I)dz and p(z0 ) := p0 (z)N (z0 , z, σ02 I)dz,
which correspond to convolving q 0 (z) and p0 (z) each with a Normal distribution with variance σ02 I. In
this case, q 0 (z) and p0 (z) are not identical to q(z0 ) and p(z0 ), respectively, in general. However, we
can approximate q 0 (z) and p0 (z) using p(z0 ) and q(z0 ), respectively, when σ02 is small. That is why
our expression in Theorem 1 that measures CE(q(z0 )||p(z0 )), can be considered as an approximation
of CE(q 0 (z)||p0 (z)) when σ02 takes a positive small value. Note that in practice, our σ02 is indeed
generally very small (see Tab. 7).
On the other hand, when σ02 = 0 (e.g., when using the VPSDE from Song et al. [2]), we know that
q 0 (z) and p0 (z) are identical to q(z0 ) and p(z0 ). However, in this case, the initial distribution at t = 0
is essentially an infinitely sharp Normal and we cannot evaluate the integral over the full interval
t ∈ [0, 1]. Hence, we limit its range to t ∈ [, 1], where is another hyperparameter. In this case, we
can approximate the cross entropy CE(q 0 (z)||p0 (z)) using:
1 1
Z h i p
CE(q(z0 )||p(z0 )) ≈ g(t)2 Eq(z0 ,zt ) ||∇z log q(zt |z0 ) − ∇z log p(zt )||22 dt + D log( 2πeσ2 )
2
" #
g(t)2 h i D
= Et∼U [,1] Eq(zt ,z0 ) ||∇zt log q(zt |z0 )−∇zt log p(zt )||2 + log 2πeσ2
2
2 2
B Variance Reduction
The variance of the cross entropy in a mini-batch update depends on the variance of CE(q(z0 )||p(z0 ))
where q(z0 ) := Epdata (x) [q(z0 |x)] is the aggregate posterior (i.e., the distribution of latent vari-
ables) and pdata is the data distribution. This is because, for training, we use a mini-batch
21
estimation of Epdata (x) [L(x, φ, θ, ψ)]. For the cross entropy term in L(x, φ, θ, ψ), we have
Epdata (x) [CE(q(z0 |x)||p(z0 ))] = CE(q(z0 )||p(z0 )).
In order to study the variance of the training objective, we derive CE(q(z0 )||p(z0 )) analytically,
assuming that both q(z0 ) = p(z0 ) = N (z0 ; 0, I). This is a reasonable simplification for our analysis
because pretraining our LSGM model with a N (z0 ; 0, I) prior brings q(z0 ) close to N (z0 ; 0, I)
and our SGM prior is often dominated by the fixed Normal mixture component. Nevertheless, we
empirically observe that the variance reduction techniques developed with this simplification still
work well when q(z0 ) and p(z0 ) are not exactly N (z0 ; 0, I).
In this section, we start with presenting the mixed score parameterization for generic SDEs in
App. B.1. Then, we discuss variance reduction with importance sampling for these generic SDEs in
App. B.2. Finally, in App. B.3 and App. B.4, we focus on variance reduction of the VPSDEs and
VESDEs, respectively, and we briefly discuss the Sub-VPSDE [2] in App. B.5.
The mixed score parameterization uses the score that is obtained when dealing with Normal input
data and just predicts an additional residual score. In the main text, we assume that the variance
of the standard Normal data stays the same throughout the diffusion process, which is the case for
VPSDEs. But the way Normal data diffuses depends generally on the underlying SDE and generic
SDEs behave differently than the regular VPSDE in that regard.
Consider the generic forward SDEs in the form:
If our data distribution is standard Normal, i.e. z0 ∼ N (z0 ; 0, I), using Eq. 13, we have
!
t
1 1 2
Z
σ̊t2 := 2
F̃ (s)g (s)ds + 1 = σ̃t + 1 (19)
F̃ (t) 0 F̃ (t)
Rt
with the definition σ̃t2 := 0 F̃ (s)g 2 (s)ds. Hence, the score function at time t is ∇zt log p(zt ) =
− σ̊zt2 . Using the geometric mixture p(zt ) ∝ N (zt ; 0,σ̊t2 )1−α p0θ (zt )α , we can generally define our
t
mixed score parameterization as
σt
θ (zt , t) := (1 − α) zt + α 0θ (zt , t). (20)
σ̊t2
In the case of VPSDEs, we have σ̊t2 = 1 which corresponds to the mixed score introduced in the
main text.
Remark: It is worth noting that both σ̊t2 and σt2 are solutions to the same differential equation in
Eq. 13 with different initial conditions. It is easy to see that σ̊t2 − σt2 = (1 − σ02 )F̃ (t)−1 .
B.2 Variance Reduction of Cross Entropy with Importance Sampling for Generic SDEs
Let’s consider the cross entropy expression for p(z0 ) = N (z0 , 0, I) and q(z0 ) = N (z0 , 0, (1 − σ02 )I)
where we have scaled down the variance of q(z0 ) to (1 − σ02 ) to accommodate the fact that the
diffusion process with initial variance σ02 applies a perturbation with variance σ02 in its initial step
(hence, the marginal distribution at t = 0 is N (z0 , 0, I) and we know that the optimal score is
θ (zt , t) = σ̊σ2t zt , i.e., the Normal component).
t
22
The cross entropy CE(q(z0 )||p(z0 )) with the optimal score θ (zt , t) = σ̊σ2t zt is:
t
1 1 g 2 (t)
Z h i
CE − const. = Ez0 , || − θ (zt , t)||22 dt (21)
2 σt2
1 1 g 2 (t)
σt
Z
2
= E z0 , || − z t 2 dt
|| (22)
2 σt2 σ̊t2
1 1 g 2 (t)
σt
Z
− 12 2
= Ez0 , || − 2 (F̃ (t) z0 + σt )||2 dt (23)
2 σt2 σ̊t
Z 1 2 " #
1 g (t) σ̊t2 − σt2 σt − 12 2
= Ez0 , || − 2 F̃ (t) z0 ||2 dt (24)
2 σt2 σ̊t2 σ̊t
Z 1 2 !
1 g (t) (σ̊t2 − σt2 )2 h 2
i σt2 −1
h
2
i
= E ||||2 + 2 2 F̃ (t) Ez0 ||z0 ||2 dt (25)
2 σt2 (σ̊t2 )2 (σ̊t )
Z 1 2 !
D g (t) (σ̊t2 − σt2 )2 σt2 −1 2
= + 2 2 F̃ (t) (1 − σ0 ) dt (26)
2 σt2 (σ̊t2 )2 (σ̊t )
Z 1 2 !
D g (t) (σ̊t2 − σt2 )2 σt2 (σ̊t2 − σt2 )
= + dt (27)
2 σt2 (σ̊t2 )2 (σ̊t2 )2
D 1 g 2 (t) D 1 g 2 (t)
Z Z
= dt − dt (28)
2 σt2 2 σ̊t2
Z d 2 Z d 2
D 1 dt σt + 2f (t)σt2 D 1 dt σ̊t + 2f (t)σ̊t2
= dt − dt (29)
2 σt2 2 σ̊t2
Z d 2 Z d 2
D 1 dt σt D 1 dt σ̊t
= dt − dt (30)
2 σt2 2 σ̊t2
!
2
1− d σ t
=D Et∼U [,1] log (31)
2 dt σ̊t2
!
2 2
1− d σ̃ t + σ 0
=D Et∼U [,1] log , (32)
2 dt σ̃t2 + 1
1
where in Eq. 23, we have used zt = F̃ (t)− 2 z0 + σt . In Eq. 25, we have used the fact that z0 and
are independent. In Eq. 27, we have used the identity σ̊t2 − σt2 = (1 − σ02 )F̃ (t)−1 . In Eq. 29, we
d 2
have used g 2 (t) = dt σt + 2f (t)σt2 from Eq. 13.
Therefore, the IW distribution with minimum variance for CE(q(z0 )||p(z0 )) is
!
d σ̃t2 + σ02
r(t) ∝ log (33)
dt σ̃t2 + 1
with normalization constant ! !
σ̃ 2 + σ02 σ̃2 + 1
R̃ = log 12 (34)
σ̃1 + 1 σ̃2 + σ02
and CDF ! !
1 σ̃t2 + σ02 σ̃2 + 1
R(t) = log (35)
R̃ σ̃t2 + 1 σ̃2 + σ02
Hence, the inverse CDF is
1−ρ ρ
σ̃2 +σ02 σ̃12 +σ02
inv σ02
− σ̃2 +1 σ̃12 +1
t = σ̃t2
2
1−ρ ρ (36)
σ̃2 +σ σ̃12 +σ0
2
σ̃2 +1
0
σ̃12 +1
−1
23
Finally, the cross entropy objective with importance weighting becomes
" #
1 1 g 2 (t) R̃ 1 + σ̃t2
Z h i
2 2
Ez0 , || − θ (zt , t)||2 dt = Et∼r(t) Ez , || − θ (zt , t)||2 (37)
2 σt2 2 1 − σ02 0
! ! " #
2 2 2 2
1 σ̃ 1 + σ 0 σ̃ + 1 1 + σ̃ t 2
= log Et∼r(t) Ez , || − θ (zt , t)||2 (38)
2 σ̃12 + 1 σ̃2 + σ02 1 − σ02 0
Rt
The idea here is to write everything as a function of σ̃t2 = 0 F̃ (s)g 2 (s)ds. We see that σ̃t2 is
monotonically increasing for any g(t) and f (t); hence, it always has an inverse and inverse transform
sampling is, in principle, always possible. However, we should pick g(t) and f (t) such that σ̃t2 and
its inverse are also analytically tractable to avoid dealing with numerical methods.
B.3 VPSDE
Using Eq. 13, we can find an expression for β(t) that generates such noise schedule:
2 t σ2
1 dσt2 σt2 2
σmax σmin ( σmax
2 ) σ2
β(t) = 2 = 2 log( 2 ) = min
σ 2 log( max
2 ) (43)
1 − σt dt 1 − σt σmin 2 ( max )t
1 − σmin σmin
σ2 min
2 2
We call a VPSDE with β(t) defined as above a geometric VPSDE. For small σmin and σmax close to 1,
all inputs diffuse closely towards the standard Normal prior at t = 1. In that regard, notice that our
geometric VPSDE is well-behaved with positive β(t) only within the relevant interval t ∈ [0, 1] and
2 2
for 0 < σmin < σmax < 1. These conditions also imply σt2 < 1 for all t ∈ [0, 1]. This is expected for
any VPSDE. We can approach unit variance arbitrarily closely but not reach it exactly.
Importantly, our geometric VPSDE is different from the “variance-exploding” SDE (VESDE),
proposed by Song et al. [5] (also see App. C). The VESDE leverages a SDE in which the variance
grows in an almost unbounded way, while the mean of the input distribution stays constant. Because
24
of this, the hyperparameters of the VESDE must be chosen carefully in a data-dependent manner [38],
which can be problematic in our case (see discussion in App. B.4). Furthermore, Song et al. also
found that the VESDE does not perform well when used with probability flow-based sampling [2]. In
contrast, our geometric VPSDE combines the variance preserving behavior (i.e. standard Normal
input data remains standard Normal throughout the diffusion process; all individual inputs diffuse
towards standard Normal prior) of the VPSDE with the geometric growth of the variance in the
diffusion process, which was first used in the VESDE.
∂
Finally, for the geometric VPSDE we also have that ∂t CE(q(zt )||p(zt )) = const. for Normal input
data. Hence, data is encoded “as continuously as possible” throughout the diffusion process. This
is in line with the arguments made by Song et al. in [38]. We hypothesize that this is particularly
beneficial towards learning models with strong likelihood or NELBO performance. Indeed, in our
experiments we observe the geometric VPSDE to perform best on this metric.
25
with proposal distribution r(t) ∝ 1−σt2 . Recall that in the VPSDE with linear β(t) = β0 +(β1 −β0 )t,
we have
t2
Rt
1 − σt2 = (1 − σ02 )e− 0 β(s)ds = (1 − σ02 )e−β0 t−(β1 −β0 ) 2 (51)
Hence, the normalization constant of r(t) is
Z 1
t2
R̃ = (1 − σ02 )e−β0 t−(β1 −β0 ) 2 dt (52)
r ! r !
π β − β β β − β β
β0
r
1 1 0 0 1 0 0
= (1 − σ02 )e 2 β1 −β0 erf 1+ − erf +
2(β1 − β0 ) 2 β1 − β0 2 β1 − β0
| {z }
:=AR̃
(53)
Similarly, we can write the CDF of r(t) as
r ! r !
AR̃ β1 − β0 β0 β1 − β0 β0
R(t) = erf t+ − erf + (54)
R̃ 2 β1 − β0 2 β1 − β0
dσt2
with proposal distribution r(t) ∝ dt = β(t)(1 − σt2 ).
In this case, we have the following proposal r(t), its CDF R(t) and inverse CDF R−1 (ρ):
β(t)(1 − σt2 ) σt2 − σ2
r(t) = , R(t) = , t = R−1 (ρ) = var−1 ((1 − ρ)σ2 + ρσ12 ) (58)
σ12 − σ2 σ12 − σ2
Note that usually σ2 ' 0 and σ12 / 1. In that case, the inverse CDF can be thought of as R−1 (ρ) ≈
var−1 (ρ).
Importance Weighted Objective:
" #
1 1 1 (σ12 − σ2 )
Z h i
2 2
β(t)Ez0 , || − θ (zt , t)||2 dt = Et∼r(t) Ez , || − θ (zt , t)||2 (59)
2 2 (1 − σt2 ) 0
Remark: It is worth noting that the derivation of the importance sampling distribution for the
reweighted objective does not make any assumption on the form of β(t). Thus, the IS distribution
can be formed for any VPSDE when training with the reweighted objective, including the original
VPSDE with linear β(t) and also our new geometric VPSDE.
26
B.4 VESDE
Solving the Fokker-Planck equation for input distribution N (µ0 , σ02 ) results in
!t
2
2 2 2 2 σmax
µt = µ0 ; σt = σ0 − σmin + σmin 2 (62)
σmin
2 2 2
Typical values for σmin and σmax are σmin = 0.012 and σmax
2
= 502 (CIFAR10). Usually, we use
2 2
σmin = σ0 .
Note that when the input data is distributed as z0 ∼ N (z0 ; 0, I), the variance at time t in VESDE is
given by:
!t
2
σ max
σ̊t2 = 1 − σmin
2 2
+ σmin 2 (63)
σmin
2
Note that σmax is typically very large and chosen empirically based on the scale of the data [38].
However, this is tricky in our case, as the role of the data is played by the latent space encodings,
which themselves are changing during training. We did briefly experiment with the VESDE and
2
calculated σmax as suggested in [38] using the encodings after the VAE pre-training stage. However,
these experiments were not successful and we suffered from significant training instabilities, even
with variance reduction techniques. Therefore, we did not further explore this direction.
Nevertheless, our proposed variance reduction techniques via importance sampling can be derived
also for the VESDE. Hence, for completeness, they are shown below.
Since the term inside the expectation is not constant in t, the VESDE does not result in an objective
with naturally minimal variance, opposed to our proposed geometric VPSDE.
We derive an importance sampling scheme with a proposal distribution
2 t
2 σmax
σmin σ2
!
1 dσ 2 1 dσ̊ 2 2
σmax
r(t) ∝ 2 t − 2 t = log
min
1 − 2 t (66)
σt dt σ̊t dt 2
σmin 2 + σ2 σmax
1 − σmin min σ 2 min
27
2
t
2 σmax
σmin
σ2 2
Note that the quantity above is always positive as min
2
σmax
t ≤ 1 with σmin < 1. In this case
2 +σ 2
1−σmin min σ2
min
2
σ̊2 σmax
the normalization constant of r(t) is R̃ = log σ2 σ̊12 and the CDF is:
!
1 h i 1 σ̊2 σt2
R(t) = log σt2 − log σ2 + log σ̊2 − log σ̊t2 = log (67)
R̃ R̃ σ̊t2 σ2
In contrast to the VESDE, the geometric VPSDE combines the geometric progression in diffusion
variance directly with minimal variance in the objective by design. Furthermore, it is simpler to set
2
up, because we can always choose σmax ∼ 1 for the geometric VPSDE and do not have to use a
2
data-specific σmax as proposed by [38].
Thus, the optimal proposal for reweighted objective and the inverse CDF are:
σ̊ 2
1 dσ̊ 2 1 1 dσ̊t2 log( σ̊t2 ) −1
r(t) ∼ 2 t ⇒ r(t) = 2 2 ⇒ R(t) =
2 ⇒ t = ˚
var (σ̊
2 1−ρ 2 ρ
) (σ̊1 )
σ̊t dt log( σ̊12 ) σ̊t dt
σ̊ σ̊
log( σ̊12 )
(73)
28
So, the reweighted objective with importance sampling is:
!
1 1 2 1 σ̊12
Z h i
g (t)Eµ0 , || − θ (zt , t)||22 dt = Et∼r(t) log σ̊t2 Eµ0 , || − θ (zt , t)||22 (74)
2 2 σ̊2
Note that in practice, we can safely set = 0 as initial σ02 is non-zero in the VESDE.
B.5 Sub-VPSDE
29
In both VESDE and Geometric VPSDE, the initial variance σ02 is denoted by σmin 2
> 0. These
2
diffusion processes start from a slightly perturbed version of the data at t = 0. In VESDE, σmax by
definition is large (as the name variance exploding SDE suggests) and it is set based on the scale of
2
the data [38]. In contrast, σmax in the Geometric VPSDE does not depend on the scale of the data
and it is set to σmax ≈ 1. In the VPSDE, the initial variance is denoted by the hyperparameter σ02 .
2
In contrast to VESDE and Geometric VPSDE, we often set the initial variance to zero in VPSDE,
meaning that the diffusion process models the data distribution exactly at t = 0. However, using
the VPSDE with σ02 = 0 comes at the cost of not being able to sample t in the full interval [0, 1]
during training and also prevents us from solving the probability flow ODE all the way to zero during
sampling [2].
In LSGM, to sample from our SGM prior in latent space and to estimate NELBOs, we follow Song
et al. [2] and build on the connection between SDEs and ODEs. We use black-box ODE solvers to
solve the probability flow ODE. Here, we briefly recap this approach.
All SDEs used in this paper can be written in the general form
The reverse of this diffusion process is also a diffusion process running backwards in time [105, 2],
defined by
h i
dz = f (z, t) − g 2 (t)∇zt log q(zt ) dt + g(t)dw̄,
where dw̄ denotes a standard Wiener process going backwards in time, dt now represents a negative
infinitesimal time increment, and ∇zt log q(zt ) is the score function of the diffusion process distribu-
tion at time t. Interestingly, Song et al. have shown that there is a corresponding ODE that generates
the same marginal probability distributions q(zt ) when acting upon the same prior distribution q(z1 ).
It is given by
" #
g 2 (t)
dz = f (z, t) − ∇zt log q(zt ) dt
2
and usually called the probability flow ODE. This connects score-based generative models using
diffusion processes to continuous Normalizing flows, which are based on ODEs [73, 106]. Note that
in practice ∇zt log q(zt ) is approximated by a learnt model. Therefore, the generative distributions
defined by the ODE and SDE above are formally not exactly equivalent when inserting this learnt
model for the score function expression. Nevertheless, they often achieve quite similar performance
in practice [2]. This aspect is discussed in detail in concurrent work by Song et al. [4].
We can use the above ODE for efficient sampling of the model via black-box ODE solvers. Specifi-
cally, we can draw samples from the standard Normal prior distribution at t = 1 and then solve this
ODE towards t = 0. In fact, this is how we perform sampling from the latent SGM prior in our paper.
Similarly, we can also use this ODE to calculate the probability of samples under this generative
process using the instantaneous change of variables formula (see [73, 106] for details). We rely on
this for calculating the probability of latent space samples under the score-based prior in LSGM.
Note that this involves calculating the trace of the Jacobian of the ODE function. This is usually
approximated via Hutchinson’s trace estimator, which is unbiased but has a certain variance (also see
discussion in Sec. F).
This approach is applicable similarly for all diffusion processes and SDEs considered in this paper.
Converting a VAE with hierarchical prior to a standard Normal prior can be done
Q using a simple
change of variables. Consider a VAE with hierarchical encoder q(z|x) = l q(zl |z<l , x) and
30
p(zl |z<l ) where z = {zl }L
Q
hierarchical prior p(z) = l l=1 represent all latent variables and:
In NVAE [20], the prior has the same hierarchical form as in Eq. 78. However, the authors observe
that the residual parameterization of the encoder often improves the generative performance. In this
parameterization, with a small modification, the encoder is defined by:
q(zl |z<l , x) = N (zl ; µl (z<l ) + σl (z<l )∆µ0l (z<l , x), σl2 (z<l )∆σl02 (z<l , x)I), (83)
where the encoder is tasked to predict the residual parameters ∆µ0l (z<l , x) and ∆σl02 (z<l , x). Using
the same reparameterization as above (l = zl −µ l (z<l )
σl (z<l ) ), we have the equivalent VAE in the form:
31
P
where H is the Hessian matrix for the log exp function at w. Note that the gradient
wi wi wi
∇w log exp(w) = Pe ewj is the softmax function and trace(H) = i Pe ewj 1 − Pe ewj ≤ 1.
P P
j j j
Thus, we have:
X X
E [log exp(w + σ)] / log exp(w) + σ 2 (88)
The VAE backbone for all LSGM models is NVAE [20]7 , one of the best-performing VAEs in
the literature. It has a hierarchical latent space with group-wise autoregressive latent variable
dependencies and it leverages residual neural networks (for architecture details see [20]). It uses
depth-wise separable convolutions in the decoder. Although both the approximate posterior and the
prior are hierarchical in its original version, we can reparametrize the prior and write it as a product
of independent Normal distributions (see Sec. E).
The VAE’s most important hyperparameters include the number of latent variable groups and their
spatial resolution, the channel depth of the latent variables, the number of residual cells per group,
and the number of channels in the convolutions in the residual cells. Furthermore, when training the
VAE during the first stage we are using KL annealing and KL balancing, as described in [20]. For
some models, we complete KL annealing during the pre-training stage, while for other models we
found it beneficial to anneal only up to a KL-weight βKL < 1.0 in the ELBO during the first stage and
complete KL annealing during the main end-to-end LSGM training stage. This provides additional
flexibility in learning an expressive distribution in latent space during the second training stage, as
it prevents more latent variables from becoming inactive while the prior is being trained gradually.
However, when using a very large backbone VAE together with an SGM objective that does not
correspond to maximum likelihood training, i.e. wun - or wre -weighting, we empirically observe that
this approach can also hurt NLL, while slightly improving FID (see CIFAR10 (best FID) model).
Note that the VAE Backbone performance for CIFAR10 reported in Tab. 2 in the main paper
corresponds to the 20-group backbone VAE (trained to full KL-weight βKL = 1.0) from the CIFAR10
(balanced) LSGM model (see hyperparameter Tab. 7).
Image Decoders: Since SGMs [2] assume that the data is continuous, they rely on uniform de-
quantization when measuring data likelihood. However, in LSGM, we rely on decoders designed
specifically for images with discrete intensity values. On color images, we use mixtures of discretized
logistics [82], and on binary images, we use Bernoulli distributions. These decoder distributions are
both available from the NVAE implementation.
Our denoising networks for the latent SGM prior are based on the NCSN++ architecture from Song et
al. [2], adapted such that the model ingests and predicts tensors according to the VAE’s latent variable
dimensions. We vary hyperparameters such as the number of residual cells per spatial resolution level
7
https://github.com/NVlabs/NVAE (NVIDIA Source Code License)
32
and the number of channels in convolutions. Note that all our models use 0.2 dropout in the SGM
prior. Some of our models use upsampling and downsampling operations with anti-aliasing based on
Finite Impulse Response (FIR) [107], following Song et al. [2].
NVAE has a hierarchical latent structure. For small image datasets including CIFAR-10, MNIST and
OMNIGLOT all the latent variables have the same spatial dimensions. Thus, the diffusion process
input z0 is constructed by concatenating the latent variables from all groups in the channel dimension.
Our NVAE backbone on the CelebA-HQ-256 dataset comes with multiple spatial resolutions in latent
groups. In this case, we only feed the smallest resolution groups to the SGM prior and assume that
the remaining groups have a standard Normal distribution.
To optimize our models, we are mostly following the previous literature. The VAE’s encoder and
decoder networks are trained using an Adamax optimizer [108], following NVAE [20]. In the
second stage, the whole model is trained with an Adam optimizer [108] and we perform learning
rate annealing for the VAE network optimization, while we keep the learning rate constant when
optimizing the SGM prior parameters. At test time, we use an exponential moving average (EMA)
of the parameters of the SGM prior with 0.9999 EMA decay rate, following [1, 2]. Note that, when
using the VPSDE with linear β(t), we are also generally following [1, 2] and use β0 = 0.1 and
β1 = 20.0. We did not observe any benefits in using the EMA parameters for the VAE networks.
For evaluation, we are drawing samples and calculating log-likelihoods using the probability flow
ODE, leveraging black-box ODE solvers, following [73, 106, 2]. Similar to [2], we are using an
RK45 ODE solver [109], based on scipy, using the torchdiffeq interface 8 . Integration cutoffs
close to zero and ODE solver error tolerances used for evaluation are indicated in Tab. 7 (for example,
for the VPSDE with linear β(t) we usually use σ02 = 0 and therefore have that σt2 goes to 0 at t = 0,
hence preventing us from integrating the probability flow ODE all the way to exactly 0. This was
handled similarly by Song et al. [2]).
Following the conventions established by previous work [88, 3, 1, 5], when evaluating our main
models we compute FID at frequent intervals during training and report FID and NLL at the minimum
observed FID.
Vahdat and Kautz in NVAE [20] observe that setting the batch normalization (BN) layers to train mode
during sampling (i.e., using batch statistics for normalization instead of moving average statistics)
improves sample quality. We similarly observe that setting BN layers to train mode improves sample
quality by about 1 FID score on the CelebA-HQ-256 dataset, but it does not affect performance on
the CIFAR-10 dataset. In contrast to NVAE, we do not change the temperature of the prior during
sampling, as we observe that it hurts generation quality.
Here we provide additional details and discussions about the ablation experiments performed in the
paper.
33
Table 7: Hyperparameters for our main models. We use the same notations and abbreviations as in Tab. 6 in
main paper.
Hyperparameter CIFAR10 CIFAR10 CIFAR10 CelebA-HQ-256 CelebA-HQ-256 OMNIGLOT MNIST
(best FID) (balanced) (best NLL) (best quantitative) (best qualitative)
VAE Backbone
# normalizing flows 0 0 2 2 2 0 0
# latent variable scales 1 1 1 3 2 1 1
# groups in each scale 20 20 4 8 10 3 2
spatial dims. of z in each scale 162 162 162 1282 , 642 , 322 1282 , 642 162 82
# channel in z 9 9 45 20 20 20 20
# initial channels in enc. 128 128 256 64 64 64 64
# residual cells per group 2 2 3 2 2 3 1
NVAE’s spectral reg. λ 10−2 10−2 10−2 3 × 10−2 3 × 10−2 10−2 10−2
Training
(VAE pre-training)
# epochs 400 600 400 200 200 200 200
learning rate VAE 10−2 10−2 10−2 10−2 10−2 10−2 10−2
batch size per GPU 32 32 64 4 4 64 100
# GPUs 8 8 4 16 16 2 2
KL annealing to βKL =0.7 βKL =1.0 βKL =0.7 βKL =1.0 βKL =1.0 βKL =1.0 βKL =0.7
Latent SGM Prior
# number of scales 3 3 3 4 5 3 2
# residual cells per scale 8 8 8 8 8 8 8
# conv. channels at each scale [512]×3 [512]×3 [512]×3 256, [512]×3 [320]×2, [640]×3 [256]×3 [256]×2
use FIR [107] yes yes yes yes yes no no
Training
(Main LSGM training)
# epochs 1875 1875 1875 1000 2000 1500 800
learning rate VAE 10−4 10−4 10−4 10−4 - 10−4 10−4
learning rate SGM prior 10−4 10−4 10−4 10−4 10−4 3 × 10−4 3 × 10−4
batch size per GPU 16 16 16 4 8 32 32
# GPUs 16 16 16 16 16 4 4
KL annealing continued no continued no no continued continued
SDE VPSDE VPSDE Geo. VPSDE VPSDE VPSDE VPSDE VPSDE
σ02 (= σmin
2
for Geo. VPSDE) 0.0 0.0 3 × 10−5 0.0 0.0 0.0 0.0
2
σmax (only for Geo. VPSDE) - - 0.999 - - - -
t-sampling cutoff during training 0.01 0.01 0.0 0.01 0.01 0.01 0.01
SGM prior weighting mechanism wun wun wll wre wre wll wll
t-sampling approach (SGM-obj.) run (t) run (t) U[0, 1] rre (t) rre (t) rll (t) rll (t)
t-sampling approach (q-obj.) rew. rew. rew. rll (t) - rew. rew.
Evaluation
ODE solver integration cutoff 10−6 10−6 10−6 10−5 10−5 10−5 10−5
ODE solver error tolerance 10−5 10−5 10−5 10−5 10−5 10−5 10−5
As discussed in the main paper, the results of this ablation study overall validate that importance
sampling is important to stabilize training, that the wll -weighting mechanism as well as our novel
geometric VPSDE are well suited for training towards strong likelihood, and that the wun - and
wre -weighting mechanisms tend to produce better FIDs. Although these trends generally hold, it is
noteworthy that not all results translate perfectly to our large models that we used to produce our
main results. For instance, the setting with wre -weighting and no importance sampling for the SGM
objective, which produced the best FID in Tab. 6 (main paper), is generally unstable for our bigger
models, in line with our observation that IS is usually necessary to stabilize training. The stable
training run for this setting in Tab. 6 can be considered an outlier.
Furthermore, for CIFAR10 we obtained our very best FID results using the VPSDE, wun -weighting,
IS, and sample reweighting for the q-objective, while for the slightly smaller models used for the
results in Tab. 6, there is no difference between using sample reweighting and drawing a separate
batch t with rll (t) for training q for this case (see Tab. 6 main paper, VPSDE, wun , run (t) fields). Also,
CelebA-HQ-256 behaves slightly different for the large models in that the VPSDE with wre -weighting
and sampling a separate batch t with rll (t) for q-training performed best by a small margin (see
hyperparameter Tab. 7).
The model used for the results on the ablation study regarding end-to-end training vs. fully separate
VAE and SGM prior training is the same one as used for the ablation study on SDEs, objective
weighting mechanisms and variance reduction above, evaluated in a similar way. For this experiment,
we used the VPSDE, wun -objective weighting, IS for t with run (t) when training the SGM prior,
and we did draw a second batch t with rll (t) for training q (only relevant for the end-to-end training
setup).
34
G.5.3 Ablation: Mixing Normal and Neural Score Functions
The model used for the ablation study on mixing Normal and neural score functions is again similar
to the one used for the other ablations with the exception that the underlying VAE has only a single
latent variable group, which makes it much smaller and removes all hierarchical dependencies
between latent variables. We tried training multiple models with larger backbone VAEs, but they were
generally unstable when trained without our mixed score parametrization, which only hightlights
its importance. As for the previous ablation, for this experiment we used the VPSDE, wun -objective
weighting, IS for t with run (t) when training the SGM prior, and we did draw a second batch t with
rll (t) for training q.
To unambiguously clarify how we train our LSGMs, we summarized the training procedures in three
different algorithms for different situations:
1. Likelihood training with IS. In this case, the SGM prior and the encoder share the same
weighted likelihood objective and do not need to be updated separately.
3. Un/Reweighted training with IS of t for the SGM-objective and reweighting for the q-
objective. What this means is that when training the encoder with the score-based cross
entropy term (last term in Eq. 8 from main paper), we are using an importance sampling
distribution that was actually tailored to un- or reweighted training for the SGM objective
(Eq. 9 from main paper) and therefore isn’t optimal for the weighted (maximum likelihood)
objective necessary for encoder training. However, if we nevertheless use the same impor-
tance sampling distribution, we do not need to draw a second batch of t for encoder training.
In practice, this boils down to different (re-)weighting factors in the cross entropy term (see
Algorithm 3).
For efficiency comparison between approaches (2) and (3), we observe that (3) consumes more
memory than (2) in general but it can be faster due to the shared computation for the denoising step.
Due to the memory limitations, we use (2) on large image datasets. Note that the choice between (2)
and (3) may affect generative performance as we empirically observed in our experiments.
35
Algorithm 2 Un/Reweighted training with separate IS of t
Input: data x, parameters {θ, φ, ψ}
Draw z0 ∼ qφ (z0 |x) using encoder.
B VAE Encoder and Decoder loss computed with the same t sample
1 wll (t)
Calculate cross entropy CE(qφ (z0 |x)||pθ (z0 )) ≈ run/re (t) 2 LDSM .
Calculate objective L(x, φ, ψ) = − log pψ (x|z0 ) + log qφ (z0 |x) + CE(qφ (z0 |x)||pθ (z0 )).
In total, the research project consumed ≈ 350, 000 GPU hours, which translates to an electricity
consumption of about ≈ 50 MWh. We used an in-house GPU cluster of V100 NVIDIA GPUs.
H Additional Experiments
H.1 Additional Samples
In this section, we provide additional samples generated by our models for CIFAR-10 in Fig. 7, and
CelebA-256-HQ in Fig. 8.
36
Table 8: Experiment with a small VAE architecture on dynamically binarized MNIST.
Method NELBO ↓ (nats)
Small VAE [24] 84.08±0.10
Small VAE + inverse autoregressive flow [24] 80.80±0.07
Our small VAE 83.85
Our LSGM w/ small VAE 79.23
Here, we examine our LSGM on a small VAE architecture. We specifically follow [24] and build
a small VAE in the NVAE codebase. In particular, the model does not have hierarchical latent
variables, but only a single latent variable group with a total of 64 latent variables. Encoder and
decoder consist of small ResNets with 6 residual cells in total (every two cells there is a down- or
up-sampling operation, so we have 3 blocks with 2 residual cells per block). The experiments are
done on dynamically binarized MNIST. As we can see in Table 8, our implementation of the VAE
obtains a similar test NELBO as [24]. However, our LSGM improves the NELBO by almost 4.6 nats.
This simple experiment shows that we can even obtain good generative performance with our LSGM
using small VAE architectures.
In Tab. 9, we report the number of neural network evaluations performed by the ODE solver during
sampling from our CIFAR-10 models. ODE solver error tolerance is 10−5 and time integration
cutoff is 10−6 . CIFAR-10 is a highly diverse and more multimodal dataset, compared to CelebA-HQ-
256. Because of that, the latent SGM prior that is learnt is more complex, requiring more function
evaluations.
In App. B.5 we discussed how variance reduction techniques derived based on the VPSDE can also
help reducing the variance of the sample-based estimate of the training objective when using the
Sub-VPSDE in the latent space SGM. Here, we perform a quantitative comparison between the
VPSDE and the Sub-VPSDE, following the same experimental setup and using the same models as
for the ablation study on SDEs, objective weighting mechanisms, and variance reduction (experiment
details in App. G.5.1). The results are reported in Tab. 10. We find that the VPSDE generally
performs slightly better in FID, while we observed little difference in NELBO in these experiments.
Importantly, the Sub-VPSDE also did not outperform our novel geometric VPSDE in NELBO. We
also see that the combination of Sub-VPSDE with wre -weighting performs poorly. Consequently, we
did not explore the Sub-VPSDE further in our main experiments.
In Fig. 9, we visualize CelebA-HQ-256 samples from our LSGM model for varying ODE solver error
tolerances.
37
Table 10: Comparing the VPSDE and Sub-VPSDE in LSGM. For detailed explanations
of abbreviations in the table, see Tab. 6 in main paper. Note that importance sampling
distributions are generally based on derivations with the VPSDE, even when using the
Sub-VPSDE, as discussed in App. B.5.
Figure 7: Additional uncurated samples generated by LSGM on the CIFAR-10 dataset (best FID
model). Sampling in the latent space is done using the probability flow ODE.
38
Figure 8: Additional uncurated samples generated by LSGM on the CelebA-HQ-256 dataset. Sam-
pling in the latent space is done using the probability flow ODE.
39
H.6 CelebA-HQ-256: Ancestral Sampling
For our experiments in this paper, we use the probability flow ODE to sample from the model.
However, on CelebA-HQ-256, we observe that ancestral sampling [2, 1, 27] from the prior instead
of solving the probability flow ODE often generates much higher quality samples. However, the
FID score is slightly worse for this approach. In Fig. 10, Fig. 11, and Fig. 12, we visualize samples
generated with different numbers of steps in ancestral sampling.
For the quantitative results on the CelebA-HQ-256 dataset in the main text, we use an LSGM with
spatial dimension of 32×32 for the latent variables in the SGM prior. However, for the qualitative
results we used an LSGM with the prior spatial dimension of 64×64. The 32×32 dimensional model
achieves a better FID score compared to the 64×64 dimensional model (FID 7.22 vs. 8.53) and
sampling from it is much faster (2.7 sec. vs. 39.9 sec.). However, the visual quality of the samples
is slightly worse. In this section, we visualize samples generated by the 32×32 dimensional model
as well as the VAE backbone for this model. In this experiment, the VAE backbone is fully trained.
Samples from our VAE backbone are visualized in Fig. 13 and for our 32×32 dimensional LSGM in
Fig. 14.
H.8 Evolution Samples on the ODE and SDE Reverse Generative Process
In Fig. 15, we visualize the evolution of the latent variables under both the reverse generative SDE and
also the probability flow ODE. We are decoding the intermediate latent samples along the reverse-time
generative process via the decoder to pixel space.
40
(a) ODE solver error tolerance 10−2
41
Figure 10: Uncurated samples generated by LSGM on the CelebA-HQ-256 dataset using 200-step
ancestral sampling for the prior.
42
Figure 11: Uncurated samples generated by LSGM on the CelebA-HQ-256 dataset using 1000-step
ancestral sampling for the prior.
43
Figure 12: Additional uncurated samples generated by LSGM on the CelebA-HQ-256 dataset using
1000-step ancestral sampling.
44
Figure 13: Uncurated samples generated by our VAE backbone without changing the temperature of
the prior. The poor quality of the samples from the VAE backbone is partially due to the large spatial
dimensions of the latent space in which long-range correlations are not encoded well.
Figure 14: Uncurated samples generated by LSGM with the SGM prior applied to the latent variables
of 32×32 spatial dimensions, on the CelebA-HQ-256 dataset. Sampling in the latent space is done
using the probability flow ODE.
45
(a) Evolution of latent variables under the SDE
Figure 15: We visualize the evolution of the latent variables under both the reverse gener-
ative SDE (a-b) and also the probability flow ODE (c-d). Specifically, we feed latent vari-
ables from different stages along the generative denoising diffusion process to the decoder to
map them back to image space. The 13 different images in each row correspond to the times
t = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01, 10−5 ] along the reverse denoising diffu-
sion process. The evolution of the images is noticeably different from diffusion models that are run
directly in pixel space (see, for example, Fig. 1 in [2]).
46