Week 4 - Diffusion Models
Week 4 - Diffusion Models
Visual Generative
AI Application
Unets and Diffusion Models
Week 4
AY 24/25
S PEC IALIST D IPLOMA IN APPLIE D GE NE RATIV E AI ( S D GAI)
Official Open
Objectives
• By the end of this module, learners will be able to
• Explain diffusion models Using a U-Net architecture.
• Develop an intuitive overview of the theory behind denoising diffusion
• Highlight the design choices related to sampling (generating images when
a trained denoiser is available)
• Highlight the design choices when training that denoiser
Official Open
Data Noise
Destructing Data by Adding Noise
Official Open
Data Noise
Generating samples by removing Noise
Official Open
UNet Architecture
• U-Net is a
convolutional neural
network that was
developed for image
segmentation
• The U-Net
architecture has also
been employed in
diffusion models for
iterative image
denoising.
Official Open
1 0 1
.25 .25
0 1 0
.25 .25
1 0 1
Official Open
1 • .25 0 • .25 1
.25 .25 .5
0 • .25 1 • .25 0
.25 .25
1 0 1
Official Open
1 0 • .25 1 • .25
.25 .25 .5 .5
0 1 • .25 0 • .25
.25 .25
1 0 1
Official Open
1 0 1
.25 .25 .5 .5
0 1 0
.25 .25 .5 .5
1 0 1
When is equal to 1, we move our convolution window across the image one spaces at a time.
Official Open
Upsampling on Up Block
Kernel Image Output
• Convolution Transpose
• BatchNorm2d
• ReLU .25 .25 1 0 1
• Conv2d
0 1 0
• BatchNorm2d
.25 .25
• ReLU 1 0 1
Upsampling on Up Block
The stride defines how many rows Image Image
and columns we will add. Stride = 3
Stride = 2
With a stride of 2, we’ll add 1 row of
zeros in between each image row. 1 0 0 0 0 0 1
1 0 0 0 1 0 0 0 0 0 0 0
With a stride of 3, we’ll add 2 rows
of zeros in between each image 0 0 0 0 0 0 0 0 0 0 0 0
row.
0 0 1 0 0 0 0 0 1 0 0 0
1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 1 0 0 0 1 0 0 0 0 0 0 0
1 0 1 1 0 0 0 0 0 1
Official Open
Upsampling on Up Block
Kernel Image Output
Stride = 2
1 • .25 0 • .25 0 0 1
.25
.25 .25 0 • .25 0 • .25 0 0 0
0 0 1 0 0
.25 .25 0 0 0 0 0
1 0 0 0 1
Official Open
Upsampling on Up Block
Kernel Image Output
Stride = 2
1 0 • .25 0 • .25 0 1
.25 0
.25 .25 0 0 • .25 0 • .25 0 0
0 0 1 0 0
.25 .25 0 0 0 0 0
1 0 0 0 1
Official Open
Upsampling on Up Block
Kernel Image Output
Stride = 2
1 0 0 0 1
.25 0 0 .25
.25 .25 0 0 0 0 0
0 .25 .25 0
0 0 1 0 0
0 .25 .25 0
.25 .25 0 0 0 0 0
.25 0 0 .25
1 0 0 0 1
Official Open
Denoising Process
• Draw a random image of pure white noise,
• Remove the noise slowly (say, 2% at a time) by repeatedly feeding it to a
neural denoiser.
• Gradually, a random clean image emerges from underneath the noise.
• The distribution of generated content is determined by the dataset that
the denoiser network was trained with.
Practical 4
Diffusion Models
• Unets
• Diffusion Models
• Optimizing Diffusion Models
Official Open
Kernel Size = 3
• In transposed convolution, when
kernel size is larger than the
stride, we end up using a pixel
value for multiple computations,
resulting in “Checkerbox”
appearance.
Stride = 2
• Even if the kernel and stride are
not overlapping, the kernel can Kernel Size = 2
H, W
H, W
CC C
C
NN
Group Size
N=2 Group Size = All
Batch normalization convert the output of each neuron In group normalization, we will normalize the
across a batch into a z-score. With convolutional neural output of each sample image across a group of
networks, a kernel is equivalent to a neuron, so if the output channels. Each image in the batch does not
of each neuron creates a channel, the outputs across the influence each other.
channel is normalized.
Official Open
Rearranged Pooling
Max Pooling
1 8 6 7 1 2 3 4
6 6
5 3 0 9 8 9 5 5
5 6 7 8 2 4
9 8 12 141 3
0 5 6 1 11
9 10 11 12 13
10 10
9 3 7 8 9 11
13 14 15 16
Max Pooling is an effective technique Einops provide tools to control rearranging of our
for reducing the size of our feature feature maps.
map, but it also drops a lot of
information We can cut every other column into strips and
stacked along the channel dimension. Then, cut
every other row into strips and stack those along
the channel dimension.
Official Open
0 0 0 0 0 0 0 0 1 1 …
0 0 0 0 1 1 1 1 0 0 …
As binary? Large discontinuity at
0 0 1 1 0 0 1 1 0 0 … boundary. Consider
0001 + 0111 = 1000
0 1 0 1 0 1 0 1 0 1 …
Official Open
cos 𝜃 , sin 𝜃
𝜃
-1, 0 1, 0
0, -1
Official Open
t=3 t = 15
10 9 10 9
Official Open
Conv T
Normalization
Activation Function
Convolution
Normalization
Activation Function
Increase Model Depth
Conv T
Normalization
Activation Function
Convolution
Normalization
Activation Function
Convolution
In general, adding more depth helps fight the checkerboard model
Normalization
Activation Function
Convolution
Normalization
Activation Function
Convolution
Normalization
Up
CT
Activation Function
Block