0% found this document useful (0 votes)
2 views

lecture7-diffusion

This document is a lecture on generative AI, specifically focusing on diffusion models and unsupervised learning techniques. It outlines the assumptions and goals of various generative models, including GANs, VAEs, and diffusion models, while providing examples and methodologies for learning these models. Key concepts include the forward and reverse processes in diffusion models, as well as the structure of U-Net for image segmentation tasks.

Uploaded by

marcus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

lecture7-diffusion

This document is a lecture on generative AI, specifically focusing on diffusion models and unsupervised learning techniques. It outlines the assumptions and goals of various generative models, including GANs, VAEs, and diffusion models, while providing examples and methodologies for learning these models. Key concepts include the forward and reverse processes in diffusion models, as well as the structure of U-Net for image segmentation tasks.

Uploaded by

marcus
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

10-423/10-623 Generative AI

Machine Learning Department


School of Computer Science
Carnegie Mellon University

Diffusion Models

Matt Gormley & Pat Virtue


Lecture 7
Feb. 5, 2025

1
Reminders
• Homework 1: Generative Models of Text
– Out: Mon, Jan 27
– Due: Mon, Feb 10 at 11:59pm
• Quiz 2:
– In-class: Mon, Feb 17
– Lectures 5-8
• Homework 2: Generative Models of Images
– Out: Mon, Feb 10
– Due: Sat, Feb 22 at 11:59pm
3
UNSUPERVISED LEARNING

4
Re
Unsupervised Learning ca
ll…
Assumptions:
1. our data comes from some distribution
p*(x0)
2. we choose a distribution pθ(x0) for which
sampling x0 ~ pθ(x0) is tractable
Goal: learn θ s.t. pθ(x0) ≈ p*(x0)

5
Re
Unsupervised Learning ca
ll…
Assumptions: Example: autoregressive LMs
1. our data comes from some distribution • true p*(x0) is the (human) process that
p*(x0) produced text on the web
2. we choose a distribution pθ(x0) for which • choose pθ(x0) to be an autoregressive
sampling x0 ~ pθ(x0) is tractable language model
Goal: learn θ s.t. pθ(x0) ≈ p*(x0) – autoregressive structure means that
p(xt | x1, …, xt-1) ~ Categorical(.) and
ancestral sampling is exact/efficient
• learn by finding
θ ≈ argmaxθ log(pθ(x0))
using gradient based updates on
∇θ log(pθ(x0))

6
Re
Unsupervised Learning ca
ll…
Assumptions: Example: GANs
1. our data comes from some distribution • true p*(x0) is distribution over photos taken
p*(x0) and posted to Flikr
2. we choose a distribution pθ(x0) for which • choose pθ(x0) to be an expressive model
sampling x0 ~ pθ(x0) is tractable (e.g. noise fed into inverted CNN) that can
Goal: learn θ s.t. pθ(x0) ≈ p*(x0) generate images
– sampling is typically easy:
z ~ N(0, I) and x0 = fθ(z)
• learn by finding θ ≈ argmaxθ log(pθ(x0))?
– No! Because we can’t even compute
log(pθ(x0)) or its gradient
– Why not? Because the integral is
intractable even for a simple 1-hidden
layer neural network with nonlinear
activation !
p(x0 ) = p(x0 | z)p(z)dz 7
so optimize a minimax loss instead z
Re
Unsupervised Learning ca
ll…
Assumptions: Example: VAEs / Diffusion Models
1. our data comes from some distribution • true p*(x0) is distribution over photos taken
p*(x0) and posted to Flikr
2. we choose a distribution pθ(x0) for which • choose pθ(x0) to be an expressive model
sampling x0 ~ pθ(x0) is tractable (e.g. noise fed into inverted CNN) that can
Goal: learn θ s.t. pθ(x0) ≈ p*(x0) generate images
– sampling is will be easy
• learn by finding θ ≈ argmaxθ log(pθ(x0))?
– Sort of! We can’t compute the gradient
∇θ log(pθ(x0))
– So we instead optimize a variational
lower bound (more on that later)

8
Figure from Ho et al. (2020)
Latent Variable Models
• For GANs and VAEs,
we assume that there
are (unknown) latent
variables which give
rise to our
observations
• The vector z are those
latent variables
• After learning a GAN
or VAE, we can
interpolate between
images in latent z
space

9
Figure from Radford et al. (2016)
GAN à VAE à Diffusion (in 15 minutes)

12
U-NET

13
Semantic Segmentation
• Given an image,
predict a label for
every pixel in the
image
• Not merely a
classification
problem, because
there are strong
correlations between
pixel-specific labels

Figure from https://openaccess.thecvf.com/content_iccv_2015/papers/Noh_Learning_Deconvolution_Network_ICCV_2015_paper.pdf 14


Instance Segmentation
• Predict per-pixel labels as
in semantic segmentation,
but differentiate between
different instances of the
same label
• Example: if there are two
people in the image, one
person should be labeled
person-1 and one should
be labeled person-2

15
Figure from https://openaccess.thecvf.com/content_ICCV_2017/papers/He_Mask_R-CNN_ICCV_2017_paper.pdf
U-Net
Contracting path
• block consists of:
– 3x3 convolution
– 3x3 convolution
– ReLU
– max-pooling with stride of 2
(downsample)
• repeat the block N times,
doubling number of channels

Expanding path
• block consists of:
– 2x2 convolution (upsampling)
– concatenation with
contracting path features
– 3x3 convolution
– 3x3 convolution
– ReLU
• repeat the block N times,
halving the number of
channels
16
U-Net
• Originally designed
for applications to
biomedical
segmentation
• Key observation is
that the output
layer has the same
dimensions as the
input image
(possibly with
different number
of channels)

17
DIFFUSION MODELS

18
Diffusion Model

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Forward Process:
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
t=1

19
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Forward Process:
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
t=1

(Learned) Reverse Process:


!
T
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt )
t=1

20
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0

adds noise to q(x0 )


the image qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )
if we could sample
Forward Process: from this we’d be done
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
t=1

(Learned) Reverse Process: removes noise


!
T
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt )
t=1

goal is to learn this 21


Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

22
Figure from Ho et al. (2020)
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Question: Answer:
Which are the latent variables in
a diffusion model?

23
Figure from Ho et al. (2020)
Denoising Diffusion Probabilistic Model (DDPM)
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)

(Learned) Reverse Process:


!
T pθ (xT ) ∼ N (0, I)
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt )
pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
t=1

24
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )
learning is hard.
xT xt+1 xt x1 why don’t x0we instead just infer
xT-1 … …
the exact reverse process
adds noise to corresponding to q(xthe0forward
)
the image qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )process?
if we could sample
Forward Process: from this we’d be done (Exact) Reverse Process:
!
T
!
T
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) qφ (x0:T ) = qφ (xT ) qφ (xt−1 | xt )
t=1 t=1
The exact reverse process requires inference. And,
even though qφ (xt | xt−1 ) is simple, computing
(Learned) Reverse Process: removes noise qφ (xt−1 | xt ) is intractable! Why? Because q(x0 )
!
T might be not‐so‐simple.
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt ) !
q (x )dx0:t−2,t+1:T
x0:t−2,t+1:T φ 0:T
t=1 qφ (xt−1 | xt ) = !
q (x )dx0:t−2,t:T
x0:t−2,t:T φ 0:T
goal is to learn this 25
Denoising Diffusion Probabilistic Model (DDPM)
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)

(Learned) Reverse Process:


!
T pθ (xT ) ∼ N (0, I)
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt )
pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
t=1

26
Defining the Forward Process
Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)

Noise schedule:
We choose αt to follow a fixed schedule s.t.
qφ (xT ) ∼ N (0, I), just like pθ (xT ).

27
Gaussian (an aside)
Let X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 )
1. Sum of two Gaussians is a Gaussian

X + Y ∼ N (µx + µy , σx2 + σy2 )

2. Difference of two Gaussians is a Gaussian

X − Y ∼ N (µx − µy , σx2 + σy2 )

3. Gaussian with a Gaussian mean has a Gaussian Conditional

Z ∼ N (µz = X, σz2 ) ⇒ P (Z | X) ∼ N (·, ·)

4. But #3 does not hold if X is passed through a nonlinear function f


2
W ∼ N (µz = f (X), σw ) ! P (W | X) ∼ N (·, ·)
28
Gaussian (an aside)
Let X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 )
1. Sum of two Gaussians is a Gaussian

X + Y ∼ N (µx + µy , σx2 + σy2 )

2. Difference of two Gaussians is a Gaussian

X − Y ∼ N (µx − µy , σx2 + σy2 )

3. Gaussian with a Gaussian mean has a Gaussian Conditional

Z ∼ N (µz = X, σz2 ) ⇒ P (Z | X) ∼ N (·, ·)

4. But #3 does not hold if X is passed through a nonlinear function f


2
W ∼ N (µz = f (X), σw ) ! P (W | X) ∼ N (·, ·)
29
Defining the Forward Process
Forward Process:
!
T
q(x0 ) = data distribution
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 ) √
t=1 qφ (xt | xt−1 ) ∼ N ( αt xt−1 , (1 − αt )I)

Noise schedule: Property #1:


We choose αt to follow a fixed schedule s.t. √
q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I)
qφ (xT ) ∼ N (0, I), just like pθ (xT ). !
t
where ᾱt = αs
s=1

Q: So what is q𝜙(xT | x0) ? Note the capital T in the


subscript.

A:

30
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Forward Process: Q: If q𝜙 is just adding noise, how can pθ be interesting


!
T at all?
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
A: Because q(x0) is not just a noise distribution and pθ
t=1 must capture that interesting variability

(Learned) Reverse Process: Q: But if pθ (xt−1 |xt ) is Gaussian, how can it learn a θ
!
T such that pθ (x0 ) ≈ q(x0 )? Won’t pθ (x0 ) be Gaussian
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt ) too?
t=1 A: No. In fact, a diffusion model of sufÏciently long
timespan T can capture any smooth target distribution.32
Gaussian (an aside)
Let X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 )
1. Sum of two Gaussians is a Gaussian

X + Y ∼ N (µx + µy , σx2 + σy2 )

2. Difference of two Gaussians is a Gaussian

X − Y ∼ N (µx − µy , σx2 + σy2 )

3. Gaussian with a Gaussian mean has a Gaussian Conditional

Z ∼ N (µz = X, σz2 ) ⇒ P (Z | X) ∼ N (·, ·)

4. But #3 does not hold if X is passed through a nonlinear function f


2
W ∼ N (µz = f (X), σw ) ! P (W | X) ∼ N (·, ·)
33
Gaussian (an aside)
Let X ∼ N (µx , σx2 ) and Y ∼ N (µy , σy2 )
1. Sum of two Gaussians is a Gaussian

X + Y ∼ N (µx + µy , σx2 + σy2 )

2. Difference of two Gaussians is a Gaussian

X − Y ∼ N (µx − µy , σx2 + σy2 )

3. Gaussian with a Gaussian mean has a Gaussian Conditional

Z ∼ N (µz = X, σz2 ) ⇒ P (Z | X) ∼ N (·, ·)

4. But #3 does not hold if X is passed through a nonlinear function f


2
W ∼ N (µz = f (X), σw ) ! P (W | X) ∼ N (·, ·)
34
Diffusion Model
pθ (xT −1 | xT ) pθ (xt | xt+1 ) pθ (x0 | x1 )
pθ (xT )

xT xT-1 … xt+1 xt … x1 x0
q(x0 )
qφ (xT | xT −1 ) qφ (xt+1 | xt ) qφ (x1 | x0 )

Forward Process: Q: If q𝜙 is just adding noise, how can pθ be interesting


!
T at all?
qφ (x0:T ) = q(x0 ) qφ (xt | xt−1 )
A: Because q(x0) is not just a noise distribution and pθ
t=1 must capture that interesting variability

(Learned) Reverse Process: Q: But if pθ (xt−1 |xt ) is Gaussian, how can it learn a θ
!
T such that pθ (x0 ) ≈ q(x0 )? Won’t pθ (x0 ) be Gaussian
pθ (x0:T ) = pθ (xT ) pθ (xt−1 | xt ) too?
t=1 A: No. In fact, a diffusion model of sufÏciently long
timespan T can capture any smooth target distribution.35
Diffusion
Model
Analogy

37
Properties of forward and exact reverse processes
Property #1:

q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I)
!
t
where ᾱt = αs
s=1

⇒ we can sample xt from x0 at any timestep t


efÏciently in closed form
√ √
⇒ xt = ᾱt x0 + 1 − ᾱt ! where ! ∼ N (0, I) this is the same reparameterization trick from VAEs

39
Properties of forward and exact reverse processes
Property #1:

q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I)
!
t
where ᾱt = αs
s=1

⇒ we can sample xt from x0 at any timestep t


efÏciently in closed form
√ √
⇒ xt = ᾱt x0 + 1 − ᾱt ! where ! ∼ N (0, I)

Property #2: Estimating q(xt−1 | xt ) is intractable


because of its dependence on q(x0 ). However,
conditioning on x0 we can efÏciently work with:

q(xt−1 | xt , x0 ) = N (µ̃q (xt , x0 ), σt2 I)


√ √
ᾱt (1 − αt ) αt (1 − ᾱt )
where µ̃q (xt , x0 ) = x0 + xt
1 − ᾱt 1 − ᾱt
(0) (t)
= αt x0 + αt xt
(1 − ᾱt−1 )(1 − αt )
σt2 =
1 − ᾱt 40
Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Later we will show that given a train‐
ing sample x0 , we want

pθ (xt−1 | xt )

to be as close as possible to

q(xt−1 | xt , x0 )

Intuitively, this makes sense: if the


learned reverse process is supposed
to subtract away the noise, then
whenever we’re working with a spe‐
cific x0 it should subtract it away
exactly as exact reverse process would
have. 41
Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Later we will show that given a train‐ Idea #1: Rather than learn Σθ (xt , t) just use what we
ing sample x0 , we want know about q(xt−1 | xt , x0 ) ∼ N (µ̃q (xt , x0 ), σt2 I):

Σθ (xt , t) = σt2 I
pθ (xt−1 | xt )

to be as close as possible to Idea #2: Choose µθ based on q(xt−1 | xt , x0 ), i.e. we


want µθ (xt , t) to be close to µ̃q (xt , x0 ). Here are
q(xt−1 | xt , x0 ) three ways we could parameterize this:

Intuitively, this makes sense: if the Option A: Learn a network that approximates µ̃q (xt , x0 )
learned reverse process is supposed directly from xt and t:
to subtract away the noise, then µθ (xt , t) = UNetθ (xt , t)
whenever we’re working with a spe‐
cific x0 it should subtract it away where t is treated as an extra feature in UNet
exactly as exact reverse process would
have. 42
Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Later we will show that given a train‐ Idea #1: Rather than learn Σθ (xt , t) just use what we
ing sample x0 , we want know about q(xt−1 | xt , x0 ) ∼ N (µ̃q (xt , x0 ), σt2 I):

Σθ (xt , t) = σt2 I
pθ (xt−1 | xt )

to be as close as possible to Idea #2: Choose µθ based on q(xt−1 | xt , x0 ), i.e. we


want µθ (xt , t) to be close to µ̃q (xt , x0 ). Here are
q(xt−1 | xt , x0 ) three ways we could parameterize this:

Intuitively, this makes sense: if the Option B: Learn a network that approximates the
learned reverse process is supposed real x0 from only xt and t:
to subtract away the noise, then (0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
whenever we’re working with a spe‐
(0)
cific x0 it should subtract it away where xθ (xt , t) = UNetθ (xt , t)
exactly as exact reverse process would
have. 43
Properties of forward and exact reverse processes
Property #1: Property #3: Combining the two previous prop‐
√ erties, we can obtain a different parameteriza‐
q(xt | x0 ) ∼ N ( ᾱt x0 , (1 − ᾱt )I) tion of µ̃q which has been shown empirically to
!
t help in learning pθ .
where ᾱt = αs √
Rearranging xt = ᾱt x0 + 1 − ᾱt ! we have

s=1
that:
⇒ we can sample xt from x0 at any timestep t √ " √
x0 = xt − 1 − ᾱt ! / ᾱt
!
efÏciently in closed form
√ √
⇒ xt = ᾱt x0 + 1 − ᾱt ! where ! ∼ N (0, I) Substituting this definition of x0 into property
#2’s definition of µ̃q gives:
Property #2: Estimating q(xt−1 | xt ) is intractable
because of its dependence on q(x0 ). However, (0) (t)
µ̃q (xt , x0 ) = αt x0 + αt xt
conditioning on x0 we can efÏciently work with: √
(0) !! " √ " (t)
= αt xt − 1 − ᾱt ! / ᾱt + αt xt
q(xt−1 | xt , x0 ) = N (µ̃q (xt , x0 ), σt2 I) 1
#
(1 − αt )
$
√ √ =√ xt − √ !
ᾱt (1 − αt ) αt (1 − ᾱt ) 1 − ᾱt
where µ̃q (xt , x0 ) = x0 + xt αt
1 − ᾱt 1 − ᾱt
(0) (t)
= αt x0 + αt xt
(1 − ᾱt−1 )(1 − αt )
σt2 =
1 − ᾱt 44
Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Later we will show that given a train‐ Idea #1: Rather than learn Σθ (xt , t) just use what we
ing sample x0 , we want know about q(xt−1 | xt , x0 ) ∼ N (µ̃q (xt , x0 ), σt2 I):

Σθ (xt , t) = σt2 I
pθ (xt−1 | xt )

to be as close as possible to Idea #2: Choose µθ based on q(xt−1 | xt , x0 ), i.e. we


want µθ (xt , t) to be close to µ̃q (xt , x0 ). Here are
q(xt−1 | xt , x0 ) three ways we could parameterize this:

Intuitively, this makes sense: if the Option C: Learn a network that approximates the
learned reverse process is supposed ! that gave rise to xt from x0 in the forward
process from xt and t:
to subtract away the noise, then
whenever we’re working with a spe‐ (0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
cific x0 it should subtract it away √ " √
(0)
exactly as exact reverse process would where xθ (xt , t) = xt − 1 − ᾱt !θ (xt , t) / ᾱt
!

have. where !θ (xt , t) = UNetθ (xt , t) 45


Parameterizing the learned reverse process
Recall: pθ (xt−1 | xt ) ∼ N (µθ (xt , t), Σθ (xt , t))
Option A: Learn a network that approximates µ̃q (xt , x0 )
directly from xt and t:
Idea #1: Rather than learn Σθ (xt , t) just use what µθ (xt , t) = UNetθ (xt , t)
we know about q(xt−1 | xt , x0 ) ∼ N (µ̃q (xt , x0 ), σt2 I):
where t is treated as an extra feature in UNet
Σθ (xt , t) = σt2 I Option B: Learn a network that approximates the real
x0 from only xt and t:
Idea #2: Choose µθ based on q(xt−1 | xt , x0 ), i.e. (0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
we want µθ (xt , t) to be close to µ̃q (xt , x0 ). Here
(0)
are three ways we could parameterize this: where xθ (xt , t) = UNetθ (xt , t)
Option C: Learn a network that approximates the ! that
gave rise to xt from x0 in the forward process from xt
and t:
(0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
(0) √ " √
where xθ (xt , t) = xt − 1 − ᾱt !θ (xt , t) / ᾱt
!

where !θ (xt , t) = UNetθ (xt , t) 46


DIFFUSION MODEL TRAINING

47
Learning the Reverse Process
Recall: given a training sample x0 , Algorithm 1 Training (Option C)
we want 1: initialize θ
2: for e ∈ {1, . . . , E} do
pθ (xt−1 | xt ) 3: for x0 ∈ D do
4: t ∼ Uniform(1, . . . , T )
to be as close as possible to
5: ! ∼ N√(0, I) √
q(xt−1 | xt , x0 ) 6: xt ← ᾱt x0 + 1 − ᾱt !
7: #t (θ) ← &! − !θ (xt , t)&2
8: θ ← θ − ∇θ #t (θ)
Depending on which of the
options for parameterization we Option C: Learn a network that approximates the ! that
pick, we get a different training gave rise to xt from x0 in the forward process from xt
and t:
algorithm.
(0) (0) (t)
µθ (xt , t) = αt xθ (xt , t) + αt xt
Option C is the best (0) √ " √
where xθ (xt , t) = xt − 1 − ᾱt !θ (xt , t) / ᾱt
!

empirically where !θ (xt , t) = UNetθ (xt , t) 48

You might also like