Diffusion Models
Diffusion Models
2
∑
Recall for vector x: ||x||2 =
•
=1
𝑖
𝑖
𝑥
𝜀
𝜀
𝑛
Natural training strategy
• Recall that during inference we will create xt from xt+1 . So let’s try to mimic that
during training.
• Recall for training data xt = xt-1 + t . So xt-1 = xt - t
∑∑
min Ex0,t[|| (xt,t) - −1||
2 ]= || ( ,t) − −1||2
0 =0
Repeat:
• Randomly select clean image x0. Randomly select t∈ {1,2,…,T}. Grab corresponding xt and xt-1
from our data set of MT images.
• ← - ∇ || (xt,t) - −1||
2
• Note that instead of learning to noise all at once, the model learns to
make small correctios at each step. These are much easier to learn.
In practice, estimate noise instead of image
• In original algorithm, they estimate noise instead of estimating xt-1 given xt
• Recall in training set t = xt - xt-1 t = 1,2,…,T
• Training goal: estimate t from xt without knowing xt-1
• Use neural network ^ = (xt,t) to estimate noise t
• MSE: E [|| ^ − t||2 ] = E [|| (x ,t) − (x - x ) ||2 ]
x0,t x0,t t t t-1
where is small, between 10-5 and 10-1. In paper, 1 = 10-4 and = .02
• This makes the signal relatively weak as t gets larger, thereby getting closer to white
noise at t=T
• Equivalently can write:
= ¯ 0 + 1− ¯ ~ N(0,I)
∏
where := 1 - ¯ :=
=1
𝑠
• Note that ¯ → 0 as t → ∞, so → ~ N(0,I) as t → ∞
𝑠
𝛼
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑇
𝑡
𝑡
𝑡
𝑡
𝑡
𝛼
𝑡
𝑡
𝛼
𝛽
𝛽
𝑥
𝑥
𝜀
𝛽
𝛽
𝛽
𝑥
𝑥
𝛼
𝜀𝜀
𝛽
𝛼
𝛼
𝑥
𝜀
𝑡
Finally, arrive at paper’s training and
inference algorithms
392 x 392
390 x 390
388 x 388
388 x 388
map
570 x 570
568 x 568
572 x 572
• This downsampling mirrors what is done
in traditional CNN architectures.
128 128
256 128
• The receptive field is increased at each
block: as the resolution increases, each
200²
198²
196²
convolution sees a larger portion of the
280²
284²
282²
original image. 256 256 512 256
104²
• Decoder: upsamples data, enables precise
140²
138²
136²
102²
100²
copy and crop
512 512 1024 512
localization and allows for obtaining an output max pool 2x2
56²
68²
32² 64²
66²
54²
52²
that is the same size as the input 1024 up-conv 2x2
conv 1x1
30²
28²
• Skip connections to every upsampling layer
from the corresponding downsampling layer Ronneberger
Fig. 1. U-net architecture (example for 32x32et al., in
pixels 2015
the lowest resolution). Each blue
retain information that is lost during encoding box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
U-Net | Architecture
Each block in the encoder (downsampling block) consists of a ResNet block, self-
attention, and a downsampling convolution block. The ResNet block consists of two
(or more) sets of group normalization, activation, and a convolution layer. The
decoder has an identical structure, upsampling the representations by mirroring the
downsampling of the encoder. The SiLU activation function is defined as SiLU(x) = x *
sigmoid(x). Figure from ddpm.
U-Net | Why condition on timestep?
In the forward di usion process, the noise level depends en rely on t:
• At early steps (small t), noise is small
• At later steps (large t), xt is almost pure noise
So, at di erent t, the denoising task is di erent:
• At t = 1: you’re removing a li le bit of noise
• At t = 900: you’re trying to generate structure from nearly complete noise
The model must know t to generalize across me: Instead of training 1000 separate
denoisers, we share parameters and pass t as an embedding, so one model handles all
steps.
ff
ff
tt
ff
ti
ti
U-Net | Conditioning on Timestep
The position embedding is computed as follows:
[ ( 100002i/d ) ( 100002i/d )]
d/2−1
t t
emb(t) = sin , cos
i=0
Cross Attention
• Additional inputs can be fed to the
neural network as conditions in
several ways.
• One example is shown here where
text is embedded using a CLIP
model, projected with a linear layer,
and cross-attention is computed
between the resulting embedding
and the representations from the
ResNet block.
Classifier-Free Guidance
Training
• Let the condition be y (e.g. text embedding)
• During training, randomly replace y with null (e.g. an embedding of zeros) some
percentage of the time.
• The model learns both:
• ϵθ(xt, t, y) (conditional)
• ϵθ(xt, t, ∅) (unconditional)
Inference
• At each step, mix the two predictions:
ϵguided = (1 + w) ⋅ ϵθ(xt, t, y) − w ⋅ ϵθ(xt, t, ∅), where w is the guidance scale (e.g. 1.5–3.0)
• Higher w means stronger guidance (more faithful but possibly less diverse)
Latent Diffusion Models
• Diffusion on latent representations of the images, which tend to be
much smaller in size than the images themselves
onstructs the im- Latent Space Conditioning
Semantic
D(E(x)), where Diffusion Process
Map
downsamples the Denoising U-Net Text
Repres
nd we investigate entations
with m 2 N. Images
nce latent spaces,
f regularizations. Pixel Space
ht KL-penalty to-
atent, similar to a
denoising step crossattention switch skip connection concat
ctor quantization
can be interpreted Figure 3. We condition LDMs either via concatenation or by a
Diffusion Transformer (DiT)
• The U-Net backbone can be replaced by any neural network
architecture
• Another common architecture used in diffusion models is the
Transformer
+ + +
#"
Scale
Noise Σ Pointwise Pointwise
Pointwise Feedforward Feedforward
32 x 32 x 4 32 x 32 x 4
Feedforward
Layer Norm Layer Norm
!",""
Linear and Reshape Scale, Shift
Layer Norm + +
Layer Norm
Multi-Head
+ Cross-Attention
Multi-Head
Nx DiT Block Scale
#! Layer Norm Self-Attention
Layer Norm
Multi-Head +
Self-Attention
Patchify Embed
!!,"! Multi-Head
Scale, Shift Self-Attention Concatenate
on Sequence
Noised Layer Norm MLP Layer Norm Dimension
Timestep #
Latent
32 x 32 x 4 Label $ Input Tokens Conditioning Input Tokens Conditioning Input Tokens Conditioning
Latent Diffusion Transformer DiT Block with adaLN-Zero DiT Block with Cross-Attention DiT Block with In-Context Conditioning
Figure 3. The Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed
Peebles et al., 2023
into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer
blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.