lecture7-diffusion
lecture7-diffusion
Diffusion Models
1
Reminders
• Homework 1: Generative Models of Text
– Out: Mon, Jan 27
– Due: Mon, Feb 10 at 11:59pm
• Quiz 2:
– In-class: Mon, Feb 17
– Lectures 5-8
• Homework 2: Generative Models of Images
– Out: Mon, Feb 10
– Due: Sat, Feb 22 at 11:59pm
3
UNSUPERVISED LEARNING
4
Re
Unsupervised Learning ca
ll…
Assumptions:
1. our data comes from some distribution
p*(x0)
2. we choose a distribution pθ(x0) for which
sampling x0 ~ pθ(x0) is tractable
Goal: learn θ s.t. pθ(x0) ≈ p*(x0)
5
Re
Unsupervised Learning ca
ll…
Assumptions: Example: autoregressive LMs
1. our data comes from some distribution • true p*(x0) is the (human) process that
p*(x0) produced text on the web
2. we choose a distribution pθ(x0) for which • choose pθ(x0) to be an autoregressive
sampling x0 ~ pθ(x0) is tractable language model
Goal: learn θ s.t. pθ(x0) ≈ p*(x0) – autoregressive structure means that
p(xt | x1, …, xt-1) ~ Categorical(.) and
ancestral sampling is exact/efficient
• learn by finding
θ ≈ argmaxθ log(pθ(x0))
using gradient based updates on
∇θ log(pθ(x0))
6
Re
Unsupervised Learning ca
ll…
Assumptions: Example: GANs
1. our data comes from some distribution • true p*(x0) is distribution over photos taken
p*(x0) and posted to Flikr
2. we choose a distribution pθ(x0) for which • choose pθ(x0) to be an expressive model
sampling x0 ~ pθ(x0) is tractable (e.g. noise fed into inverted CNN) that can
Goal: learn θ s.t. pθ(x0) ≈ p*(x0) generate images
– sampling is typically easy:
z ~ N(0, I) and x0 = fθ(z)
• learn by finding θ ≈ argmaxθ log(pθ(x0))?
– No! Because we can’t even compute
log(pθ(x0)) or its gradient
– Why not? Because the integral is
intractable even for a simple 1-hidden
layer neural network with nonlinear
activation !
p(x0 ) = p(x0 | z)p(z)dz 7
so optimize a minimax loss instead z
Re
Unsupervised Learning ca
ll…
Assumptions: Example: VAEs / Diffusion Models
1. our data comes from some distribution • true p*(x0) is distribution over photos taken
p*(x0) and posted to Flikr
2. we choose a distribution pθ(x0) for which • choose pθ(x0) to be an expressive model
sampling x0 ~ pθ(x0) is tractable (e.g. noise fed into inverted CNN) that can
Goal: learn θ s.t. pθ(x0) ≈ p*(x0) generate images
– sampling is will be easy
• learn by finding θ ≈ argmaxθ log(pθ(x0))?
– Sort of! We can’t compute the gradient
∇θ log(pθ(x0))
– So we instead optimize a variational
lower bound (more on that later)
8
Figure from Ho et al. (2020)
Latent Variable Models
• For GANs and VAEs,
we assume that there
are (unknown) latent
variables which give
rise to our
observations
• The vector z are those
latent variables
• After learning a GAN
or VAE, we can
interpolate between
images in latent z
space
9
Figure from Radford et al. (2016)
GAN à VAE à Diffusion (in 15 minutes)
12
U-NET
13
Semantic Segmentation
• Given an image,
predict a label for
every pixel in the
image
• Not merely a
classification
problem, because
there are strong
correlations between
pixel-specific labels
15
Figure from https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf
U-Net
Contracting path
• block consists of:
– 3x3 convolution
– 3x3 convolution
– ReLU
– max-pooling with stride of 2
(downsample)
• repeat the block N times,
doubling number of channels
Expanding path
• block consists of:
– 2x2 convolution (upsampling)
– concatenation with
contracting path features
– 3x3 convolution
– 3x3 convolution
– ReLU
• repeat the block N times,
halving the number of
channels
16
U-Net
• Originally designed
for applications to
biomedical
segmentation
• Key observation is
that the output
layer has the same
dimensions as the
input image
(possibly with
different number
of channels)
17
DIFFUSION MODELS
18
Diffusion Model
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
Forward Process:
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
t=1
19
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
Forward Process:
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
t=1
20
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
xT xT-1 … xt+1 xt … x1 x0
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
22
Figure from Ho et al. (2020)
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
Question: Answer:
Which are the latent variables in
a diffusion model?
23
Figure from Ho et al. (2020)
Denoising Diffusion Probabilistic Model (DDPM)
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)
24
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
learning is hard.
xT xt+1 xt x1 why don’t x0we instead just infer
xT-1 … …
the exact reverse process
adds noise to corresponding to q(xthe0forward
)
the image qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )process?
if we could sample
Forward Process: from this we’d be done (Exact) Reverse Process:
!
T
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) qφ (x0:T ) = qφ (xT ) qφ (xt−1 | xt )
t=1 t=1
The exact reverse process requires inference. And,
even though qφ (xt | xt−1 ) is simple, computing
(Learned) Reverse Process: removes noise qφ (xt−1 | xt ) is intractable! Why? Because q(x0 )
!
T might be not‐so‐simple.
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt ) !
q (x )dx0:t−2,t+1:T
x0:t−2,t+1:T φ 0:T
t=1 qφ (xt−1 | xt ) = !
q (x )dx0:t−2,t:T
x0:t−2,t:T φ 0:T
goal is to learn this 25
Denoising Diffusion Probabilistic Model (DDPM)
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)
26
Defining the Forward Process
Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)
Noise schedule:
We choose αt to follow a fixed schedule s.t.
qφ (xT ) ∼ N (0, I), just like pθ (xT ).
27
Gaussian (an aside)
Let X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 )
1. Sum of two Gaussians is a Gaussian
A:
30
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
(Learned) Reverse Process: Q: But if pθ (xt−1 |xt ) is Gaussian, how can it learn a θ
!
T such that pθ (x0 ) ≈ q(x0 )? Won’t pθ (x0 ) be Gaussian
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt ) too?
t=1 A: No. In fact, a diffusion model of sufÏciently long
timespan T can capture any smooth target distribution.32
Gaussian (an aside)
Let X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 )
1. Sum of two Gaussians is a Gaussian
xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
(Learned) Reverse Process: Q: But if pθ (xt−1 |xt ) is Gaussian, how can it learn a θ
!
T such that pθ (x0 ) ≈ q(x0 )? Won’t pθ (x0 ) be Gaussian
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt ) too?
t=1 A: No. In fact, a diffusion model of sufÏciently long
timespan T can capture any smooth target distribution.35
Diffusion
Model
Analogy
37
Properties of forward and exact reverse processes
Property #1:
√
q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I)
!
t
where ᾱt = αs
s=1
39
Properties of forward and exact reverse processes
Property #1:
√
q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I)
!
t
where ᾱt = αs
s=1
pθ (xt−1 | xt )
to be as close as possible to
q(xt−1 | xt , x0 )
Σθ (xt , t) = σt2 I
pθ (xt−1 | xt )
Intuitively, this makes sense: if the Option A: Learn a network that approximates µ̃q (xt , x0 )
learned reverse process is supposed directly from xt and t:
to subtract away the noise, then µθ (xt , t) = UNetθ (xt , t)
whenever we’re working with a spe‐
cific x0 it should subtract it away where t is treated as an extra feature in UNet
exactly as exact reverse process would
have. 42
Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Later we will show that given a train‐ Idea #1: Rather than learn Σθ (xt , t) just use what we
ing sample x0 , we want know about q(xt−1 | xt , x0 ) ∼ N (µ̃q (xt , x0 ), σt2 I):
Σθ (xt , t) = σt2 I
pθ (xt−1 | xt )
Intuitively, this makes sense: if the Option B: Learn a network that approximates the
learned reverse process is supposed real x0 from only xt and t:
to subtract away the noise, then (0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
whenever we’re working with a spe‐
(0)
cific x0 it should subtract it away where xθ (xt , t) = UNetθ (xt , t)
exactly as exact reverse process would
have. 43
Properties of forward and exact reverse processes
Property #1: Property #3: Combining the two previous prop‐
√ erties, we can obtain a different parameteriza‐
q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I) tion of µ̃q which has been shown empirically to
!
t help in learning pθ .
where ᾱt = αs √
Rearranging xt = ᾱt x0 + 1 − ᾱt ! we have
√
s=1
that:
⇒ we can sample xt from x0 at any timestep t √ " √
x0 = xt − 1 − ᾱt ! / ᾱt
!
efÏciently in closed form
√ √
⇒ xt = ᾱt x0 + 1 − ᾱt ! where ! ∼ N (0, I) Substituting this definition of x0 into property
#2’s definition of µ̃q gives:
Property #2: Estimating q(xt−1 | xt ) is intractable
because of its dependence on q(x0 ). However, (0) (t)
µ̃q (xt , x0 ) = αt x0 + αt xt
conditioning on x0 we can efÏciently work with: √
(0) !! " √ " (t)
= αt xt − 1 − ᾱt ! / ᾱt + αt xt
q(xt−1 | xt , x0 ) = N (µ̃q (xt , x0 ), σt2 I) 1
#
(1 − αt )
$
√ √ =√ xt − √ !
ᾱt (1 − αt ) αt (1 − ᾱt ) 1 − ᾱt
where µ̃q (xt , x0 ) = x0 + xt αt
1 − ᾱt 1 − ᾱt
(0) (t)
= αt x0 + αt xt
(1 − ᾱt−1 )(1 − αt )
σt2 =
1 − ᾱt 44
Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Later we will show that given a train‐ Idea #1: Rather than learn Σθ (xt , t) just use what we
ing sample x0 , we want know about q(xt−1 | xt , x0 ) ∼ N (µ̃q (xt , x0 ), σt2 I):
Σθ (xt , t) = σt2 I
pθ (xt−1 | xt )
Intuitively, this makes sense: if the Option C: Learn a network that approximates the
learned reverse process is supposed ! that gave rise to xt from x0 in the forward
process from xt and t:
to subtract away the noise, then
whenever we’re working with a spe‐ (0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
cific x0 it should subtract it away √ " √
(0)
exactly as exact reverse process would where xθ (xt , t) = xt − 1 − ᾱt !θ (xt , t) / ᾱt
!
47
Learning the Reverse Process
Recall: given a training sample x0 , Algorithm 1 Training (Option C)
we want 1: initialize θ
2: for e ∈ {1, . . . , E} do
pθ (xt−1 | xt ) 3: for x0 ∈ D do
4: t ∼ Uniform(1, . . . , T )
to be as close as possible to
5: ! ∼ N√(0, I) √
q(xt−1 | xt , x0 ) 6: xt ← ᾱt x0 + 1 − ᾱt !
7: #t (θ) ← &! − !θ (xt , t)&2
8: θ ← θ − ∇θ #t (θ)
Depending on which of the
options for parameterization we Option C: Learn a network that approximates the ! that
pick, we get a different training gave rise to xt from x0 in the forward process from xt
and t:
algorithm.
(0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
Option C is the best (0) √ " √
where xθ (xt , t) = xt − 1 − ᾱt !θ (xt , t) / ᾱt
!