0% found this document useful (0 votes)
11 views27 pages

Diffusion Models

Diffusion models generate images by iteratively refining a noisy image, starting from white noise and using a trained neural network to predict denoised images at each step. The training process involves creating noisy versions of images from a dataset and minimizing the mean-squared error between predicted and actual denoised images. The architecture often utilizes U-Net, which includes downsampling and upsampling layers with skip connections to retain information, and conditions on the noise level to adapt to varying degrees of noise during the denoising process.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views27 pages

Diffusion Models

Diffusion models generate images by iteratively refining a noisy image, starting from white noise and using a trained neural network to predict denoised images at each step. The training process involves creating noisy versions of images from a dataset and minimizing the mean-squared error between predicted and actual denoised images. The architecture often utilizes U-Net, which includes downsampling and upsampling layers with skip connections to retain information, and conditions on the noise level to adapt to varying degrees of noise during the denoising process.

Uploaded by

Yicheng Jiang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Diffusion models

• Train a model with a dataset of images (eg, faces)


• Diffusion model generates images in the distribution of the training
data set
• Similar functionality as GAN, but done in a very different way

• We will cover the classic paper “Denoising Diffusion Probabilistic


Models,” Dec 2020, Ho, Jain, Abbeel, 21,000 citations
Diffusion model at inference: Iterative
process
• Begin with white noise image xT
• From xT create xT-1; from xT-1 create xT-2,…., from x1 create x0
• Use trained neural network (xt,t) at each iteration
𝜃
𝜖
Discussion questions
Suppose we take a single image of a face, put some random
noise covering the nose.
• Would a human be able to fix the image (that is denoise or partially
denoise the image)?
• Without any knowledge/context/training, can a computer fix the
image (denoise or partially denoise the image)?
• If we have a large dataset of many faces, how can we get the
computer to successfully denoise images?
Paper’s training and inference algorithms

Let’s try to explain these algorithms!


First discuss a simplified diffusion algorithm
• For pedagogic reasons, we will first describe a simplified diffusion-
model algorithm.
• Then we will discuss the real thing.
Training data
• Start with a dataset of M images (say human faces)
• For each image x0 in data set, create T new noisy images:
• xt = xt-1 + t t = 1,…,T
• where t is a vector with each component randomly sampled from N(0,I)
• So x0 is clean image, x1 is a slightly noisy image, xT looks like white noise.
• The training data becomes MT images. T=1000 is a typical value.

2

Recall for vector x: ||x||2 =

=1
𝑖
𝑖
𝑥
𝜀
𝜀
𝑛
Natural training strategy
• Recall that during inference we will create xt from xt+1 . So let’s try to mimic that
during training.
• Recall for training data xt = xt-1 + t . So xt-1 = xt - t

• where t is a vector with each component randomly sampled from N(0,I)

• Can we estimate xt-1 from xt without knowing t ?


• Yes: Use neural network to estimate −1: ^ −1 = (xt,t)
• Minimize MSE: min Ex0,t[|| ^ −1 - −1||
2 | ] = min Ex0,t[|| (xt,t) - −1||
2 ]
xt
denoising
^ −1 = (xt,t) neural network
t
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝜃
𝜃
𝑥
𝑥
𝜃
𝜀𝜀
𝜀𝑥
𝑓
𝜃
𝑥
𝑥
𝜃𝑓
𝑥
𝑡
𝜀
𝜃
𝑥
𝑓
Possible training and inference algorithms
Training:
Recall: want to find that minimizes mean-squared denoising error:
−1

∑∑
min Ex0,t[|| (xt,t) - −1||
2 ]= || ( ,t) − −1||2
0 =0
Repeat:
• Randomly select clean image x0. Randomly select t∈ {1,2,…,T}. Grab corresponding xt and xt-1
from our data set of MT images.
• ← - ∇ || (xt,t) - −1||
2

Inference (after training):


1. Initialize xT ~ N(0,I)
2. For t= T,T-1,…,1
3. Set xt-1 = (xt,t)
𝑥
𝑡
4. Return x0 as generated image
𝜃
𝑡
𝑡
𝑓
𝑥
𝑥
𝜃
𝑡
𝑡
𝜃
𝜃
𝜃
𝜃𝜃𝑓
𝑥
𝜃
𝛼
𝑥
𝑓
𝑓
𝜃
𝑇
Comments
• Main idea of diffusion models:
• Predict denoised image at each step t.
• Use neural network for prediction.
• Do this over training set by minimizing MSE
• After training, have a denoising neural neural network, which we can use at
inference beginning with randomly generated noise.
• During inference need to do a forward pass through the denoising neural
network T times. T may be equal to 1000 or more!
• Note that we are doing regression to all pixels in the target xt-1

• Note that instead of learning to noise all at once, the model learns to
make small correctios at each step. These are much easier to learn.
In practice, estimate noise instead of image
• In original algorithm, they estimate noise instead of estimating xt-1 given xt
• Recall in training set t = xt - xt-1 t = 1,2,…,T
• Training goal: estimate t from xt without knowing xt-1
• Use neural network ^ = (xt,t) to estimate noise t
• MSE: E [|| ^ − t||2 ] = E [|| (x ,t) − (x - x ) ||2 ]
x0,t x0,t t t t-1

Our inference algorithm becomes:


1. Initialize xT ~ N(0,I)
2. For t= T,T-1,…,1
3. Set xt-1 = xt - (xt,t)
4. Return x0 as generated image
𝑡
𝑡
𝜃
𝜃
𝜃
𝜀
𝜀
𝜀
𝜀𝜀
𝜖
𝜀
𝜖
𝜖
Ours versus the original

Our Inference algorithm: Paper’s Inference algorithm:


1. xT ~ N(0,I)
2. For t= T, T-1,…,1
3. Set xt-1 = xt - (xt,t)
4. Return x0 as generated image

Algorithms are close but not exactly the same.


We will now explain the differences.
𝜃
𝜖
Towards their algorithms
• Note that in training data xt = x0+ 1+ 2+ …+ t where each j is N(0,I)

• But 1+ 2+ …+ t is N(0,tI). Also with ~ N(0, I) is N(0,tI)

• So we can instead define xt = x0 + where ~ N(0,1) & preserve distribution


• So during training, can instead use
• Ex0,t[|| (xt,t) − (xt - xt-1) ||2 ] = Ex0,ε [|| (x0 + , t) − ||2 ]

Training: Ours Training: Theirs


Repeat:
1. Select x0 from data set
2. t ~ {1,…,T}
3. ~ N(0,I)
4. Take gradient step on
∇ [||ϵ− (x0 + ϵ, t) ||2
𝜀
𝜀
𝜀
𝑡
𝑡
𝑡
𝜃
𝜃
𝑡
𝜃
𝜀𝜀𝜀𝜀𝜀𝜀𝜀
𝜀
𝜖
𝜀
𝜖
𝜀
𝜃
𝜀
𝜖
Creating noisy data
• In actuality, noisy data is recursively created as follows:
= 1− −1 +

where is small, between 10-5 and 10-1. In paper, 1 = 10-4 and = .02
• This makes the signal relatively weak as t gets larger, thereby getting closer to white
noise at t=T
• Equivalently can write:
= ¯ 0 + 1− ¯ ~ N(0,I)


where := 1 - ¯ :=
=1
𝑠
• Note that ¯ → 0 as t → ∞, so → ~ N(0,I) as t → ∞
𝑠
𝛼
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑡
𝑇
𝑡
𝑡
𝑡
𝑡
𝑡
𝛼
𝑡
𝑡
𝛼
𝛽
𝛽
𝑥
𝑥
𝜀
𝛽
𝛽
𝛽
𝑥
𝑥
𝛼
𝜀𝜀
𝛽
𝛼
𝛼
𝑥
𝜀
𝑡
Finally, arrive at paper’s training and
inference algorithms

• Note that we do not actually create a training set of T noisy


images for each dataset image as discussed at beginning of class.
We instead implicitly create them with = ¯ 0 + 1 − ¯
• Also z: increases diversity of the generated images
𝑡
𝑡
𝑡
𝛼
𝛼
𝑥
𝑥
𝜀
Neural network internals
xt
^ −1 = (xt,t)
t

• fθ is a neural network and can have any architecture


• Inputs to fθ:
• Noisy image source xt
• During training, this is obtained by randomly sampling t and noising x according to the
noise level that corresponds to t in our noise schedule
• During inference, it is obtained by decrementing t (the noise level) according to a noise
schedule and iteratively sampling from the train neural network
• Noise level t to tell the neural network what noising step we’re at, since the parameters of the
neural network are shared across time-steps
• t is encoded with sinusoidal position embeddings (like in transformers)
𝑡
𝜃
𝜃
𝑥
𝑓
U-Net | Architecture
2
• Model architecture that was initially 1 64 64
introduced for medical image segmentation, 128 64 64 2
where it improved accuracy from 46% -> 77%
with fewer training samples input
output
image
segmentation
• Encoder: downsamples data, captures context. tile

392 x 392

390 x 390

388 x 388
388 x 388
map

570 x 570
568 x 568
572 x 572
• This downsampling mirrors what is done
in traditional CNN architectures.
128 128
256 128
• The receptive field is increased at each
block: as the resolution increases, each

200²

198²
196²
convolution sees a larger portion of the

280²
284²
282²
original image. 256 256 512 256

conv 3x3, ReLU

104²
• Decoder: upsamples data, enables precise

140²

138²

136²

102²

100²
copy and crop
512 512 1024 512
localization and allows for obtaining an output max pool 2x2

56²
68²

32² 64²
66²

54²

52²
that is the same size as the input 1024 up-conv 2x2
conv 1x1

30²

28²
• Skip connections to every upsampling layer
from the corresponding downsampling layer Ronneberger
Fig. 1. U-net architecture (example for 32x32et al., in
pixels 2015
the lowest resolution). Each blue
retain information that is lost during encoding box corresponds to a multi-channel feature map. The number of channels is denoted
on top of the box. The x-y-size is provided at the lower left edge of the box. White
U-Net | Architecture

• Each encoder block consist of


several convolutional layers. In
DDPM, the downsampling blocks
halve the height and width of the
image representations at each block.
• Each decoder block consists of
several deconvolutional layers that
mirror the endoder and upsample
the representations in order to
obtain an output that has the same
size as the input.
• Skip connections add the output Each down block, up block, and mid block in the UNet consists of
from every downsampling block to ResNet and self attention blocks. The down/up blocks also contain
the corresponding upsampling block. downsampling and upsampling convolution layers. The bottleneck
retains the size of the input and contains an additional ResNet
block (figure from ddpm).
U-Net | Architecture

Each block in the encoder (downsampling block) consists of a ResNet block, self-
attention, and a downsampling convolution block. The ResNet block consists of two
(or more) sets of group normalization, activation, and a convolution layer. The
decoder has an identical structure, upsampling the representations by mirroring the
downsampling of the encoder. The SiLU activation function is defined as SiLU(x) = x *
sigmoid(x). Figure from ddpm.
U-Net | Why condition on timestep?
In the forward di usion process, the noise level depends en rely on t:
• At early steps (small t), noise is small
• At later steps (large t), xt is almost pure noise
So, at di erent t, the denoising task is di erent:
• At t = 1: you’re removing a li le bit of noise
• At t = 900: you’re trying to generate structure from nearly complete noise

The model must know t to generalize across me: Instead of training 1000 separate
denoisers, we share parameters and pass t as an embedding, so one model handles all
steps.
ff
ff
tt
ff
ti
ti
U-Net | Conditioning on Timestep
The position embedding is computed as follows:

[ ( 100002i/d ) ( 100002i/d )]
d/2−1
t t
emb(t) = sin , cos
i=0

Where t is the timestep, d is the embedding


dimension, and i is the dimension index. This
is computed for every timestep t, each
resulting in an embedding of size d for each
timestep.
Timesteps are passed through a time embedding block and added to
Why use sin/cos at all? the ResNet blocks of the U-Net. The time embedding block consists if
the sinusoidal positional embedding, a linear layer, followed by
1. The sinusoidal positional encoding maintains
relative distance between positions. activation and another linear layer (figure from ddpm).
2. Sinusoidal encodings let the model
interpolate in time. It can generalize to
unseen and fractional timesteps.
U-Net | Conditioning on Timestep

Timestep embedding is first projected to the same size and number


of channels as the intermediate outputs of the ResNet layers. Then,
the output is added added to the residual blocks of the UNet
(figure from ddpm)
Conditioning Diffusion Models

• So far, we’ve seen how we can generate random images. In reality, we


want to be able to control what we generate.
• Examples of conditions include: text, class labels, or reference images.
• Diffusion models can be conditioned in various ways, including:
• Architectural conditioning
• Classifier-free guidance
Architectural Conditioning
CLIP Embedding

FC CLIP Model Text

Cross Attention
• Additional inputs can be fed to the
neural network as conditions in
several ways.
• One example is shown here where
text is embedded using a CLIP
model, projected with a linear layer,
and cross-attention is computed
between the resulting embedding
and the representations from the
ResNet block.
Classifier-Free Guidance
Training
• Let the condition be y (e.g. text embedding)
• During training, randomly replace y with null (e.g. an embedding of zeros) some
percentage of the time.
• The model learns both:
• ϵθ(xt, t, y) (conditional)
• ϵθ(xt, t, ∅) (unconditional)
Inference
• At each step, mix the two predictions:
ϵguided = (1 + w) ⋅ ϵθ(xt, t, y) − w ⋅ ϵθ(xt, t, ∅), where w is the guidance scale (e.g. 1.5–3.0)
• Higher w means stronger guidance (more faithful but possibly less diverse)
Latent Diffusion Models
• Diffusion on latent representations of the images, which tend to be
much smaller in size than the images themselves
onstructs the im- Latent Space Conditioning
Semantic
D(E(x)), where Diffusion Process
Map
downsamples the Denoising U-Net Text
Repres
nd we investigate entations
with m 2 N. Images
nce latent spaces,
f regularizations. Pixel Space
ht KL-penalty to-
atent, similar to a
denoising step crossattention switch skip connection concat
ctor quantization
can be interpreted Figure 3. We condition LDMs either via concatenation or by a
Diffusion Transformer (DiT)
• The U-Net backbone can be replaced by any neural network
architecture
• Another common architecture used in diffusion models is the
Transformer
+ + +
#"
Scale
Noise Σ Pointwise Pointwise
Pointwise Feedforward Feedforward
32 x 32 x 4 32 x 32 x 4
Feedforward
Layer Norm Layer Norm
!",""
Linear and Reshape Scale, Shift

Layer Norm + +
Layer Norm
Multi-Head
+ Cross-Attention
Multi-Head
Nx DiT Block Scale
#! Layer Norm Self-Attention

Layer Norm
Multi-Head +
Self-Attention
Patchify Embed
!!,"! Multi-Head
Scale, Shift Self-Attention Concatenate
on Sequence
Noised Layer Norm MLP Layer Norm Dimension
Timestep #
Latent
32 x 32 x 4 Label $ Input Tokens Conditioning Input Tokens Conditioning Input Tokens Conditioning

Latent Diffusion Transformer DiT Block with adaLN-Zero DiT Block with Cross-Attention DiT Block with In-Context Conditioning

Figure 3. The Diffusion Transformer (DiT) architecture. Left: We train conditional latent DiT models. The input latent is decomposed
Peebles et al., 2023
into patches and processed by several DiT blocks. Right: Details of our DiT blocks. We experiment with variants of standard transformer
blocks that incorporate conditioning via adaptive layer norm, cross-attention and extra input tokens. Adaptive layer norm works best.

You might also like