Cascaded Diffusion Models For High Fidelity Image Generation
Cascaded Diffusion Models For High Fidelity Image Generation
Cascaded Diffusion Models For High Fidelity Image Generation
[email protected]
Mohammad Norouzi [email protected]
Tim Salimans [email protected]
Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043
Abstract
We show that cascaded diffusion models are capable of generating high fidelity images on
the class-conditional ImageNet generation benchmark, without any assistance from auxiliary
image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline
of multiple diffusion models that generate images of increasing resolution, beginning with a
standard diffusion model at the lowest resolution, followed by one or more super-resolution
diffusion models that successively upsample the image and add higher resolution details.
We find that the sample quality of a cascading pipeline relies crucially on conditioning
augmentation, our proposed method of data augmentation of the lower resolution condi-
tioning inputs to the super-resolution models. Our experiments show that conditioning
augmentation prevents compounding error during sampling in a cascaded model, helping
us to train cascading pipelines achieving FID scores of 1.48 at 64×64, 3.52 at 128×128
and 4.88 at 256×256 resolutions, outperforming BigGAN-deep, and classification accuracy
scores of 63.02% (top-1) and 84.06% (top-5) at 256×256, outperforming VQ-VAE-2.
Keywords: generative models, diffusion models, score matching, iterative refinement,
super-resolution
256×256
64×64
32×32
Class ID = 213
“Irish Setter”
Figure 1: A cascaded diffusion model comprising a base model and two super-resolution models.
∗. Equal contribution
©2021 Jonathan Ho, Chitwan Saharia, William Chan, David J. Fleet, Mohammad Norouzi, Tim Salimans.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/.
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
1. Introduction
Diffusion models (Sohl-Dickstein et al., 2015) have recently been shown to be capable of
synthesizing high quality images and audio (Chen et al., 2021; Ho et al., 2020; Kong et al.,
2021; Song and Ermon, 2020): an application of machine learning that has long been
dominated by other classes of generative models such as autoregressive models, GANs, VAEs,
and flows (Brock et al., 2019; Dinh et al., 2017; Goodfellow et al., 2014; Ho et al., 2019;
Kingma and Dhariwal, 2018; Kingma and Welling, 2014; Razavi et al., 2019; van den Oord
et al., 2016a,b, 2017). Most previous work on diffusion models demonstrating high quality
samples has focused on data sets of modest size, or data with strong conditioning signals.
Our goal is to improve the sample quality of diffusion models on large high-fidelity data sets
for which no strong conditioning information is available. To showcase the capabilities of
the original diffusion formalism, we focus on simple, straightforward techniques to improve
the sample quality of diffusion models; for example, we avoid using extra image classifiers to
boost sample quality metrics (Dhariwal and Nichol, 2021; Razavi et al., 2019).
Our key contribution is the use of cascades to improve the sample quality of diffusion
models on class-conditional ImageNet . Here, cascading refers to a simple technique to model
high resolution data by learning a pipeline of separately trained models at multiple resolutions;
a base model generates low resolution samples, followed by super-resolution models that
upsample low resolution samples into high resolution samples. Sampling from a cascading
pipeline occurs sequentially, first sampling from the low resolution base model, followed by
sampling from super-resolution models in order of increasing resolution. While any type of
generative model could be used in a cascading pipeline (e.g., Menick and Kalchbrenner, 2019;
Razavi et al., 2019), here we restrict ourselves to diffusion models. Cascading has been shown
in recent prior work to improve the sample quality of diffusion models (Saharia et al., 2021;
Nichol and Dhariwal, 2021); our work here concerns the improvement of diffusion cascading
pipelines to attain the best possible sample quality.
2
Cascaded Diffusion Models
The simplest and most effective technique we found to improve cascading diffusion
pipelines is to apply strong data augmentation to the conditioning input of each super-
resolution model. We refer to this technique as conditioning augmentation. In our experiments,
conditioning augmentation is crucial for our cascading pipelines to generate high quality
samples at the highest resolution. With this approach we attain FID scores on class-
conditional ImageNet generation that are better than BigGAN-Deep (Brock et al., 2019) at any
truncation value, and classification accuracy scores that are better than VQ-VAE-2 (Razavi
et al., 2019). We empirically find that conditioning augmentation is effective because it
alleviates compounding error in cascading pipelines due to train-test mismatch, sometimes
referred to as exposure bias in the sequence modeling literature (Bengio et al., 2015; Ranzato
et al., 2016).
The key contributions of this paper are as follows:
• We show that our Cascaded Diffusion Models (CDM) yield high fidelity samples
superior to BigGAN-deep (Brock et al., 2019) and VQ-VAE-2 (Razavi et al., 2019)
in terms of FID score (Heusel et al., 2017) and classification accuracy score (Ravuri
and Vinyals, 2019), the latter by a large margin. We achieve these results with pure
generative models that are not combined with any classifier.
Section 2 reviews recent work on diffusion models. Section 3 describes the most effective
types of conditioning augmentation that we found for class-conditional ImageNet generation.
Section 4 contains our sample quality results, ablations, and experiments on additional
datasets. Appendix A contains extra samples and Appendix B contains details on hyperpa-
rameters and architectures. High resolution figures and additional supplementary material
can be found at https://cascaded-diffusion.github.io/.
2. Background
We begin with background on diffusion models, their extension to conditional generation,
and their associated neural network architectures.
T
Y p
q(x1:T |x0 ) = q(xt |xt−1 ), q(xt |xt−1 ) = N (xt ; 1 − βt xt−1 , βt I)
t=1
3
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
T
Y
pθ (x0:T ) = p(xT ) pθ (xt−1 |xt ), pθ (xt−1 |xt ) = N (xt−1 ; µθ (xt , t), Σθ (xt , t)).
t=1
where LT (x0 ) = DKL (q(xT |x0 ) k p(xT )). The forward process posteriors q(xt−1 |xt , x0 ) and
marginals q(xt |x0 ) are Gaussian, and the KL divergences in the ELBO can be calculated in
closed form. Thus it is possible to train the diffusion model by taking stochastic gradient
steps on random terms of Eq. (1). As previously suggested (Ho et al., 2020; Nichol and
Dhariwal, 2021), we use the reverse process parameterizations
1 βt
µθ (xt , t) = √ xt − √ θ (xt , t)
αt 1 − ᾱt
Σii i
θ (xt , t) = exp(log β̃t + (log βt − log β̃t )vθ (xt , t))
which is a weighted form of the ELBO that resembles denoising score matching over multiple
noise scales (Ho et al., 2020; Song and Ermon, 2019). For the case of learned Σθ , we employ
a hybrid loss (Nichol and Dhariwal, 2021) implemented using the expression
where Lvb = Ex0 [Lθ (x0 )] and a stop-gradient is applied to the θ term inside Lθ . Optimizing
this hybrid loss has the effect of simultaneously learning µθ using Lsimple and learning Σθ
using the ELBO.
4
Cascaded Diffusion Models
to learn a conditional model pθ (x0 |c). To do so, we modify the diffusion model to include c
as input to the reverse process:
T
Y
pθ (x0:T |c) = p(xT ) pθ (xt−1 |xt , c), pθ (xt−1 |xt , c) = N (xt−1 ; µθ (xt , t, c), Σθ (xt , t, c))
t=1
" #
X
Lθ (x0 |c) = Eq LT (x0 ) + DKL (q(xt−1 |xt , x0 ) k pθ (xt−1 |xt , c)) − log pθ (x0 |x1 , c) .
t>1
The data and conditioning signal (x0 , c) are sampled jointly from the data distribution,
now called q(x0 , c), and the forward process q(x1:T |x0 ) remains unchanged. The only
modification that needs to be made is to inject c as a extra input to the neural network
function approximators: instead of µθ (xt , t) we now have µθ (xt , t, c), and likewise for Σθ .
The particular architectural choices for injecting these extra inputs depends on the type of
the conditioning c, as described next.
2.3 Architectures
The current best architectures for image diffusion models are U-Nets (Ronneberger et al., 2015;
Salimans et al., 2017), which are a natural choice to map corrupted data xt to reverse process
parameters (µθ , Σθ ) that have the same spatial dimensions as xt . Scalar conditioning, such
as a class label or a diffusion timestep t, is provided by adding embeddings into intermediate
layers of the network (Ho et al., 2020). Lower resolution image conditioning is provided
by channelwise concatenation of the low resolution image, processed by bilinear or bicubic
upsampling to the desired resolution, with the reverse process input xt , as in the SR3 (Saharia
et al., 2021) and Improved DDPM (Nichol and Dhariwal, 2021) models. See Fig. 3 for an
illustration of the SR3-based architecture that we use in this work.
5
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
) , MK
N 2
(K ) , MK
N 2
(K (K ) , 2 × MK
N 2
( N2 )2 , M2 ( N2 )2 , 2 × M2
(xt , z) N 2 , M1 N 2 , 2 × M1 xt−1
Figure 3: The U-Net architecture used in each model of a CDM pipeline. The first model is a
class-conditional diffusion model that receives the noisy image xt and the class label
y and as input. (The class label y and timestep t are injected into each block as an
embedding, not depicted here). The remaining models in the pipeline are class-conditional
super-resolution models that receive xt , y, and an additional upsampled low-resolution
image z as input. The downsampling/upsampling blocks adjust the image input resolution
N × N by a factor of 2 through each of the K blocks. The channel count at each block
is specified using channel multipliers M1 , M2 , ..., MK , and the upsampling pass has
concatenation skip connections to the downsampling pass.
256×256
64×64
32×32
Class ID = 933
“Cheeseburger” Class Class
Class
Conditional Conditional
Conditional
Super-Res Super-Res
Figure 4: Detailed CDM pipeline for generation of class conditional 256×256 images. The first
model is a class-conditional diffusion model, and it is followed by a sequence of two
class-conditional super-resolution diffusion models. Each model has a U-Net architecture
as depicted in Fig. 3.
The most effective technique we found to improve the sample quality of cascading pipelines
is to train each super-resolution model using data augmentation on its low resolution input.
We refer to this general technique as conditioning augmentation. At a high level, for some
super-resolution model pθ (x0 |z) from a low resolution image z to a high resolution image x0 ,
conditioning augmentation refers to applying some form of data augmentation to z. This
augmentation can take any form, but what we found most effective at low resolutions is
adding Gaussian noise (forward process noise), and for high resolutions, randomly applying
Gaussian blur to z. In some cases, we found it more practical to train super-resolution
models amortized over the strength of conditioning augmentation and pick the best strength
in a post-training hyperparameter search for optimal sample quality. Details on conditioning
6
Cascaded Diffusion Models
augmentation and its realization during training and sampling are given in the following
sections.
(For simplicity, we have assumed that the low resolution and super-resolution models both use
the same number of timesteps T .) Truncated conditioning augmentation refers to truncating
the low resolution reverse process to stop at timestep s > 0, instead of 0; i.e.,
Z Z
psθ (x0 ) = pθ (x0 |zs )pθ (zs ) dzs = pθ (x0 |zs )pθ (zs:T ) dzs:T . (2)
The base model is now pθ (zs ) = pθ (zs:T )dzs+1:T , and the super-resolution model is now
R
7
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
which runs forward processes independently on a low and high resolution pair. The ELBO is
" #
X
s
− log pθ (x0 ) ≤ Eq LT (z0 ) + DKL (q(zt−1 |zt , z0 ) k pθ (zt−1 |zt )) − log pθ (x0 |zs ) ,
t>s
where LT (z0 ) = DKL (q(zT |z0 ) k p(zT )). Note that the sum over t is truncated at s, and the
decoder pθ (x0 |zs ) is the super-resolution model conditioned on zs . The decoder itself has an
ELBO of the form − log pθ (x0 |zs ) ≤ Lθ (x0 |zs ), where
" #
X
Lθ (x0 |zs ) = Eq LT (x0 ) + DKL (q(xt−1 |xt , x0 ) k pθ (xt−1 |xt , zs )) − log pθ (x0 |x1 , zs ) .
t>1
It is apparent that optimizing Eq. (3) trains the low and high resolution models separately.
For a fixed value of s, the low resolution process is trained up to the truncation timestep s,
and the super-resolution model is trained on a conditioning signal corrupted using the low
resolution forward process stopped at timestep s.
In practice, since we pursue sample quality as our main objective, we do not use these
ELBO expressions directly when training models with learnable reverse process variances.
Rather, we train on the “simple” unweighted loss or the hybrid loss described in Section 2,
and the particular loss we use is considered a hyperparameter reported in Appendix B.
We would like to search over multiple values of s to select for the best sample quality. To
make this search practical, we avoid retraining models by amortizing a single super-resolution
model over uniform random s at training time. Because each possible truncation time
corresponds to a distinct super-resolution task, the super-resolution model for µθ and Σθ
must take zs as input along with s, and this can be accomplished using a single network
with an extra time embedding input for s. We leave the low resolution model training
unchanged, because the standard diffusion training procedure already trains with random s.
The complete training procedure for a two-stage cascading pipeline is listed in Algorithm 1.
8
Cascaded Diffusion Models
non-truncated augmentation, we need to store the low resolution samples just once, since
sampling z0s ∼ q(zs |z0 ) is computationally inexpensive. These sampling procedures are listed
in Algorithm 2.
9
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
Figure 5: Classwise Synthetic 256×256 ImageNet images. Each row represents a specific
ImageNet class. Classes from top to bottom - Flamingo (130), White Wolf (270),
Tiger (292), Monarch Butterfly (323), Zebra (340) and Dome (538).
10
Cascaded Diffusion Models
Figure 6: Classwise Synthetic 256×256 ImageNet images. Each row represents a specific
ImageNet class. Classes from top to bottom - Greenhouse (580), Model T (661),
Streetcar (829), Comic Book (917), Crossword Puzzle (918), Cheeseburger (933).
11
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
4. Experiments
We designed experiments to improve the sample quality metrics of cascaded diffusion models
on class-conditional ImageNet generation. Our cascading pipelines consist of class-conditional
diffusion models at all resolutions, so class information is injected at all resolutions: see Fig. 4.
Our final ImageNet results are described in Section 4.1.
To give insight into our cascading pipelines, we begin with improvements on a baseline
non-cascaded model at the 64×64 resolution (Section 4.2), then we show that cascading up
to 64×64 improves upon our best non-cascaded 64×64 model, but only in conjunction with
conditioning augmentation. We also show that truncated and non-truncated conditioning
augmentation perform equally well (Section 4.3), and we study random Gaussian blur augmen-
tation to train super-resolution models to resolutions of 128×128 and 256×256 (Section 4.4).
Finally, we verify that conditioning augmentation is also effective on the LSUN dataset (Yu
et al., 2015) and therefore is not specific to ImageNet (Section 4.5).
We cropped and resized the ImageNet dataset (Russakovsky et al., 2015) in the same
manner as BigGAN (Brock et al., 2019). We report Inception scores using the standard
practice of generating 50k samples and calculating the mean and standard deviation over 10
splits (Salimans et al., 2016). Generally, throughout our experiments, we selected models and
performed early stopping based on FID score calculated over 10k samples, but all reported
FID scores are calculated over 50k samples for comparison with other work (Heusel et al.,
2017). The FID scores we used for model selection and reporting model performance are
calculated against training set statistics according to common practice, but since this can be
seen as overfitting on the performance metric, we additionally report model performance
using FID scores calculated against validation set statistics. We also report results on
Classification Accuracy Score (CAS), which was proposed by Ravuri and Vinyals (2019) due
to their findings that non-GAN models may score poorly on FID and IS despite generating
visually appealing samples and that FID and IS are not correlated (sometimes anti-correlated)
with performance on downstream tasks.
12
Cascaded Diffusion Models
classes by BigGAN-deep and VQ-VAE-2 respectively. We also show samples from classes
with best and worst accuracy scores in Appendix Figure 11 and 12.
Our cascading pipelines are structured as a 32×32 base model, a 32×32→64×64 super-
resolution model, followed by 64×64→128×128 or 64×64→256×256 super-resolution models.
Models at 32×32 and 64×64 resolutions use 4000 diffusion timesteps and architectures similar
to DDPM (Ho et al., 2020) and Improved DDPM (Nichol and Dhariwal, 2021). Models
at 128×128 and 256×256 resolutions use 100 sampling steps, determined by post-training
hyperparameter search (Section 4.4), and they use architectures similar to SR3 (Saharia
et al., 2021). All base resolution and super-resolution models are conditioned on class labels.
See Appendix B for details.
1
Classification Accuracy
0.8
0.6
0.4
0.2
Real Training Data
0 CDM Samples
0 200 400 600 800 1000
Class ID
Figure 7: Classwise Classification Accuracy Score comparison between real data (blue) and
generated data (red) at the 256×256 resolution. Accompanies Table 1b.
To set a strong baseline for class-conditional ImageNet generation at the 64×64 resolution,
we reproduced and improved upon a 4000 timestep non-cascaded 64×64 class-conditional
diffusion model from Improved DDPM (Nichol and Dhariwal, 2021). Our reimplementation
used dropout and was trained longer than reported by Nichol and Dhariwal; we found that
adding dropout generally slowed down convergence of FID and Inception scores, but improved
their best values over the course of a longer training period. We further improved the training
set FID score and Inception score by adding noise to the trained model’s samples using the
forward process to the 2000 timestep point, then restarting the reverse process from that
point. See Table 2a for the resulting sample quality metrics.
13
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
FID FID
Model IS
vs train vs validation
32×32 resolution
CDM (ours) 1.11 1.99 26.01 ± 0.59
64×64 resolution
BigGAN-deep, by (Dhariwal and Nichol, 2021) 4.06
Improved DDPM (Nichol and Dhariwal, 2021) 2.92
ADM (Dhariwal and Nichol, 2021) 2.07
CDM (ours) 1.48 2.48 67.95 ± 1.97
128×128 resolution
BigGAN-deep (Brock et al., 2019) 5.7 124.5
BigGAN-deep, max IS (Brock et al., 2019) 25 253
LOGAN (Wu et al., 2019) 3.36 148.2
ADM (Dhariwal and Nichol, 2021) 5.91
CDM (ours) 3.52 3.76 128.80 ± 2.51
256×256 resolution
BigGAN-deep (Brock et al., 2019) 6.9 171.4
BigGAN-deep, max IS (Brock et al., 2019) 27 317
VQ-VAE-2 (Razavi et al., 2019) 31.11
Improved DDPM (Nichol and Dhariwal, 2021) 12.26
SR3 (Saharia et al., 2021) 11.30
ADM (Dhariwal and Nichol, 2021) 10.94 100.98
ADM+upsampling (Dhariwal and Nichol, 2021) 7.49 127.49
CDM (ours) 4.88 4.63 158.71 ± 2.26
(a) Class-conditional ImageNet sample quality results for classifier guidance-free methods
Table 1: Main results. Numbers are bolded only when at least two are available for compari-
son. CAS for real data and other models are from Ravuri and Vinyals (2019).
14
Cascaded Diffusion Models
15
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
FID FID
Model IS
vs train vs validation
Improved DDPM (Nichol and Dhariwal, 2021) 2.92
Our reimplementation 2.44 2.91 49.81 ± 0.65
+ more sampling steps 2.35 2.91 52.72 ± 1.15
FID FID
Conditioning IS
vs train vs validation
No cascading 2.35 2.91 52.72 ± 1.15
16×16→64×64 cascading
s=0 6.02 5.84 35.59 ± 1.19
s = 101 3.41 3.67 44.72 ± 1.12
s = 1001 2.13 2.79 54.47 ± 1.05
16
Cascaded Diffusion Models
17
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
best 32×32 resolution FID score is 1.25 at 300k training steps, while training with random
flips it is 1.11 at 700k training steps. The 32×32→64×64 super-resolution model is now
amortized over the truncation time s by providing s as an extra time embedding input to
the network (Section 2), allowing us to perform a more fine grained search over s without
retraining the model.
Table 3a displays the resulting sample quality scores for both truncated and non-truncated
augmentation. The sample quality metrics improve and then degrade non-monotonically
as the truncation time is increased. This indicates that moderate amounts of conditioning
augmentation are beneficial to sample quality of the cascading pipeline, but too much
conditioning augmentation causes the super-resolution model to behave as a non-conditioned
model unable to benefit from cascading. For comparison, Table 3b shows sample quality when
the super-resolution model is conditioned on ground truth data instead of generated data.
Here, sample quality monotonically degrades as truncation time is increased. Conditioning
augmentation is therefore useful precisely when conditioning on generated samples, so as a
technique it is uniquely suited to cascading pipelines.
Based on these findings on non-monotonicity of sample quality with respect to truncation
time, we conclude that conditioning augmentation works because it alleviates compounding
error from a train-test mismatch for the super-resolution model. This occurs when low-
resolution model samples are out of distribution compared to the ground truth data on
which the super-resolution model is trained. A sufficient amount of Gaussian conditioning
augmentation prevents the super-resolution model from attempting to upsample erroneous,
out-of-distribution details in the low resolution generated samples. In contrast, sample
quality degrades monotonically with respect to truncation time when conditioning the
super-resolution model on ground truth data, because there is no such train-test mismatch.
Table 3a additionally shows that truncated and non-truncated conditioning augmentation
are approximately equally effective at improving sample quality of the cascading pipeline,
albeit at different values of the truncation time parameter. Thus we generally recommend
non-truncated augmentation due to its practical benefits described in Section 3.3.
18
Cascaded Diffusion Models
FID FID
Conditioning IS
vs train vs validation
No conditioning augmentation (baseline)
s=0 1.71 2.46 61.34 ± 1.58
Truncated conditioning augmentation
s = 251 1.50 2.44 66.76 ± 1.76
s = 501 1.48 2.48 67.95 ± 1.97
s = 751 1.48 2.51 68.48 ± 1.77
s = 1001 1.49 2.51 67.95 ± 1.51
s = 1251 1.51 2.54 67.20 ± 1.94
s = 1501 1.54 2.56 67.09 ± 1.67
Non-truncated conditioning augmentation
s = 251 1.58 2.50 66.21 ± 1.51
s = 501 1.53 2.51 67.59 ± 1.85
s = 751 1.48 2.47 67.48 ± 1.31
s = 1001 1.49 2.48 66.51 ± 1.59
s = 1251 1.48 2.46 66.28 ± 1.49
s = 1501 1.50 2.47 65.59 ± 0.86
FID FID
Conditioning IS
vs train vs validation
Ground truth training data
s=0 0.76 1.76 74.84 ± 1.43
s = 251 0.87 1.85 71.79 ± 0.89
s = 501 0.92 1.91 70.68 ± 1.26
s = 751 0.95 1.94 69.93 ± 1.40
s = 1001 0.98 1.97 69.03 ± 1.26
s = 1251 1.03 1.99 67.92 ± 1.65
s = 1501 1.11 2.04 66.7 ± 1.21
Ground truth validation data
s=0 1.20 0.59 64.33 ± 1.24
s = 251 1.27 0.96 63.17 ± 1.19
s = 501 1.32 1.17 62.65 ± 0.76
s = 751 1.38 1.32 62.21 ± 0.94
s = 1001 1.42 1.44 61.53 ± 1.39
s = 1251 1.47 1.54 60.58 ± 0.93
s = 1501 1.53 1.64 60.02 ± 0.84
Table 3: 64×64 ImageNet sample quality: large scale experiment comparing truncated
and non-truncated conditioning augmentation for 32×32→64×64 cascading, using
amortized truncation time conditioning.
19
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
5.6
5.3
FID
4.9
FID vs Train
4.6 FID vs Validation
4 8 16 25 100 1000
Inference Steps
Figure 10: FID on 256×256 images vs inference steps in 64×64 → 256×256 super-resolution.
FID FID
Blur σ IS
vs train vs validation
σ = 0 (no blur) 7.26 6.42 134.53 ± 2.97
σ ∼ U(0.4, 0.6) 6.18 5.57 142.71 ± 2.83
σ ∼ U(0.4, 0.8) 6.90 6.31 136.57 ± 4.34
σ ∼ U(0.4, 1.0) 6.35 5.76 141.40 ± 4.34
FID FID
Model IS
vs train vs validation
Baseline 6.18 5.57 142.71 ± 2.83
+ Class Conditioning 5.75 5.27 152.17 ± 2.29
+ Large Batch Training 5.00 4.71 157.84 ± 2.60
+ Flip LR 4.88 4.63 158.71 ± 2.26
20
Cascaded Diffusion Models
interesting that it still gives a huge boost to the upsampling performance at high resolutions
even when the low resolution inputs at 64×64 can be sufficiently informative. We also
found increasing the training batch size from 256 to 1024 further improved performance by a
significant margin. We also obtain marginal improvements by training the super-resolution
model on randomly flipped data.
Since the sampling cost increases quadratically with the target image resolution, we
attempt to minimize the number of denoising iterations for our 64×64 → 256×256 and
64×64 → 128×128 super-resolution models. To this end, we train these super-resolution
models with continuous noise conditioning, like Saharia et al. (2021) and Chen et al. (2021),
and tune the noise schedule for a given number of steps during inference. This tuning is
relatively inexpensive as we do not need to retrain the models. We report all results using
100 inference steps for these models. Figure 10 shows FID vs number of inference steps
for our 64×64 → 256×256 model. The FID score deteriorates marginally even when using
just 4 inference steps. Interestingly, we do not observe any concrete improvement in FID by
increasing the number of inference steps from 100 to 1000.
5. Related Work
One way to formulate cascaded diffusion models is to modify the original diffusion formalism
of a forward process q(x0:T ) at single resolution so that the transition q(xt |xt−1 ) performs
downsampling at certain intermediate timesteps, for example at t ∈ S := {T /4, 2T /4, 3T /4}.
The reverse process would then be required to perform upsampling at those timesteps, similar
to our cascaded models here. However, there is no guarantee that the reverse transitions
at the timesteps in S are conditional Gaussian, unlike the guarantee for reverse transitions
at other timesteps for sufficiently slow diffusion. By contrast, our cascaded diffusion model
construction dedicates entire conditional diffusion models for these upsampling steps, so it is
specified more flexibly.
21
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
FID FID
Conditioning
vs train vs validation
LSUN Bedroom
s=0 2.30 40.68
s = 251 2.06 40.47
s = 501 2.08 40.44
s = 751 2.14 40.45
s = 1001 2.18 40.53
s = 1251 2.24 40.58
s = 1501 2.28 40.58
LSUN Church
s=0 3.29 42.21
s = 251 2.97 42.14
s = 501 2.93 42.17
s = 751 2.89 42.20
s = 1001 2.86 42.26
s = 1251 2.83 42.28
s = 1501 2.84 42.31
Recent interest in diffusion models (Sohl-Dickstein et al., 2015) started with work con-
necting diffusion models to denoising score matching over multiple noise scales (Ho et al.,
2020; Song and Ermon, 2019). There have been a number of improvements and alternatives
proposed to the diffusion framework, for example generalization to continuous time (Song
et al., 2021b), deterministic sampling (Song et al., 2021a), adversarial training (Jolicoeur-
Martineau et al., 2021), and others (Gao et al., 2021). For simplicity, we base our models on
DDPM (Ho et al., 2020) with modifications from Improved DDPM (Nichol and Dhariwal,
2021) to stay close to the original diffusion framework.
Cascading pipelines have been investigated in work on VQ-VAEs (van den Oord et al.,
2016c; Razavi et al., 2019) and autoregressive models (Menick and Kalchbrenner, 2019).
Cascading pipelines have also been investigated for diffusion models, such as SR3 (Saharia
et al., 2021), Improved DDPM (Nichol and Dhariwal, 2021), and concurrently in ADM (Dhari-
wal and Nichol, 2021). Our work here focuses on improving cascaded diffusion models for
ImageNet generation and is distinguished by the extensive study on conditioning augmenta-
tion and deeper cascading pipelines. Our conditioning augmentation work also resembles
scheduled sampling in autoregressive sequence generation (Bengio et al., 2015), where noise
is used to alleviate the mismatch between train and inference conditions.
Concurrent work (Dhariwal and Nichol, 2021) showed that diffusion models are capable
of generating high quality ImageNet samples using an improved architecture, named ADM,
and a classifier guidance technique in which a class-conditional diffusion model sampler is
modified to simultaneously take gradient steps to maximize the score of an extra trained image
22
Cascaded Diffusion Models
classifier. By contrast, our work focuses solely on improving sample quality by cascading, so
we avoid introducing extra model elements such as the image classifier. We are interested in
avoiding classifier guidance because the FID and Inception score sample quality metrics that
we use to evaluate our models are themselves computed on activations of an image classifier
trained on ImageNet, and therefore classifier guidance runs the risk of cheating these metrics.
Avoiding classifier guidance comes at the expense of using thousands of diffusion timesteps
in our low resolution models, where ADM uses hundreds. ADM with classifier guidance
outperforms our models in terms of FID and Inception scores, while our models outperform
ADM without classifier guidance as reported by Dhariwal and Nichol. Our work is a showcase
of the effectiveness of cascading alone in a pure generative model, and since classifier guidance
and cascading complement each other as techniques to improve sample quality and can be
applied together, we expect classifier guidance would improve our results too.
6. Conclusion
We have shown that cascaded diffusion models are capable of outperforming state-of-the-art
generative models on the ImageNet class-conditional generation benchmark when paired
with conditioning augmentation, our technique of introducing data augmentation into the
conditioning information of super-resolution models. Our models outperform BigGAN-deep
and VQ-VAE-2 as measured by FID score and classification accuracy score. We found that
conditioning augmentation helps sample quality because it combats compounding error in
cascading pipelines due to train-test mismatch in super-resolution models, and we proposed
practical methods to train and test models amortized over varying levels of conditioning
augmentation.
Although there could be negative impact of our work in the form of malicious uses of
image generation, our work has the potential to improve beneficial downstream applications
such as data compression while advancing the state of knowledge in fundamental machine
learning problems. We see our results as a conceptual study of the image synthesis capabilities
of diffusion models in their original form with minimal extra techniques, and we hope our
work serves as inspiration for future advances in the capabilities of diffusion models.
Acknowledgments
We thank Jascha Sohl-Dickstein, Douglas Eck and the Google Brain team for feedback,
research discussions and technical assistance.
23
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
Appendix A. Samples
Figure 11: Samples from classes with best relative classification accuracy score. Each row represents
a specific ImageNet class. Classes from top to bottom - Tiger Cat (282), Gong (577),
Coffee Mug (504), Squirrel Monkey (382), Miniature Schnauzer (196) and Corn (987).
24
Cascaded Diffusion Models
Figure 12: Samples from classes with worst relative classification accuracy score. Each row represents
a specific ImageNet class. Classes from top to bottom - Letter Opener (623), Plate (923),
Overskirt (689), Tobacco Shop (860), Black-and-tan Coonhound (165) and Bathtub
(435).
25
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
Figure 13: Samples from LSUN 128x128: bedroom subset (first six rows) and church subset (last
six rows).
26
Cascaded Diffusion Models
Appendix B. Hyperparameters
B.1 ImageNet
Here we give the hyperparameters of the models in our ImageNet cascading pipelines. Each
model in the pipeline is described by its diffusion process, its neural network architecture, and
its training hyperparameters. Architecture hyperparameters, such as the base channel count
and the list of channel multipliers per resolution, refer to hyperparameters of the U-Net in
DDPM and related models (Ho et al., 2020; Nichol and Dhariwal, 2021; Saharia et al., 2021;
Salimans et al., 2017). The cosine noise schedule and the hybrid loss method of learning re-
verse process variances are from Improved DDPM (Nichol and Dhariwal, 2021). Some models
are conditioned on ᾱt for post-training sampler tuning (Chen et al., 2021; Saharia et al., 2021).
• Architecture • Diffusion
– Base channels: 256 – Timesteps: 4000
– Channel multipliers: 1, 2, 3, 4 – Noise schedule: cosine
– Residual blocks per resolution: 6 – Learned variances: yes
– Attention resolutions: 8, 16 – Loss: hybrid
– Attention heads: 4
• Training
– Optimizer: Adam
– Batch size: 2048
– Learning rate: 1e-4
– Steps: 700000
– Dropout: 0.1
– EMA: 0.9999
– Hardware: 256 TPU-v3 cores
27
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
32×32→64×64 super-resolution
• Architecture • Diffusion
– Base channels: 256 – Timesteps: 4000
– Channel multipliers: 1, 2, 3, 4 – Noise schedule: cosine
– Residual blocks per resolution: 5 – Learned variances: yes
– Attention resolutions: 8, 16 – Loss: hybrid
– Attention heads: 4
• Training
– Optimizer: Adam
– Batch size: 2048
– Learning rate: 1e-4
– Steps: 400000
– Dropout: 0.1
– EMA: 0.9999
– Hardware: 256 TPU-v3 cores
64×64→128×128 super-resolution
28
Cascaded Diffusion Models
64×64→256×256 super-resolution
29
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
B.2 LSUN
Here we give the hyperparameters of our LSUN Bedroom and Church pipelines. We used
the same hyperparameters for both datasets.
64×64 base model
64×64→128×128 super-resolution
30
Cascaded Diffusion Models
References
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction
with recurrent neural networks. Advances in Neural Information Processing Systems, 2015.
A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural
image synthesis. In International Conference on Learning Representations, 2019.
P. Dhariwal and A. Nichol. Diffusion models beat gans on image synthesis. arXiv preprint
arXiv:2105.05233, 2021.
L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using Real NVP. International
Conference on Learning, 2017.
D. P. Kingma and P. Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In
Advances in Neural Information Processing Systems, pages 10215–10224, 2018.
31
Ho, Saharia, Chan, Fleet, Norouzi and Salimans
J. Menick and N. Kalchbrenner. Generating high fidelity images with subscale pixel networks
and multidimensional upscaling. In International Conference on Learning Representations,
2019.
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent
neural networks. International Conference on Learning Representations, 2016.
S. Ravuri and O. Vinyals. Classification Accuracy Score for Conditional Generative Models.
In Advances in Neural Information Processing Systems, volume 32, 2019.
A. Razavi, A. van den Oord, and O. Vinyals. Generating diverse high-fidelity images with
VQ-VAE-2. In Advances in Neural Information Processing Systems, pages 14837–14847,
2019.
Y. Song and S. Ermon. Generative modeling by estimating gradients of the data distribution.
In Advances in Neural Information Processing Systems, pages 11895–11907, 2019.
Y. Song and S. Ermon. Improved techniques for training score-based generative. Advances
in Neural Information Processing Systems, 2020.
32
Cascaded Diffusion Models
A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu. Pixel recurrent neural networks.
International Conference on Machine Learning, 2016b.
A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning.
Advances in Neural Information Processing Systems, 2017.
F. Yu, Y. Zhang, S. Song, A. Seff, and J. Xiao. LSUN: Construction of a large-scale image
dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365,
2015.
33