0% found this document useful (0 votes)
19 views49 pages

Lecture 5 Diffusion - Models Part II Final

Uploaded by

huukhoadn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views49 pages

Lecture 5 Diffusion - Models Part II Final

Uploaded by

huukhoadn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

CAP6412

Advanced Computer Vision


Mubarak Shah
[email protected]
HEC-245
Lecture-5: Diffusion Models-Part-II

1/25/2023 CAP6412 - Lecture 1 Introduction 1


Diffusion models in vision: A survey
https://arxiv.org/pdf/2209.04747.pdf

Alin Croitoru Vlad Hondru Radu Tudor Ionescu Mubarak Shah


University of Bucharest, University of Bucharest, University of Bucharest, University of Central
Romania Romania Romania Florida, US
[email protected] [email protected] [email protected] [email protected]
High-level overview
• Diffusion models are probabilistic models used for image generation
• They involve reversing the process of gradually degrading the data
• Consist of two processes:
 The forward process: data is progressively destroyed by adding noise across
multiple time steps
 The reverse process: using a neural network, noise is sequentially removed
to obtain the original data

Standard Gaussian
Data distribution

reverse

forward
High-level overview

• Three categories:

 Denoising Diffusion Probabilistic Models (DDPM)

 Noise Conditioned Score Networks (NCSN)

 Stochastic Differential Equations (SDE)


Denoising Diffusion Probabilistic Models (DDPMs)

Forward process
𝑥 𝑥

… …

𝑥 ~𝑝(𝑥 ) 𝑥 ~𝒩(0, 𝐼)
Denoising Diffusion Probabilistic Models (DDPMs)

𝑥 𝑥

… …

𝑥 ~𝑝(𝑥 ) Reverse process 𝑥 ~𝒩(0, 𝐼)


Denoising Diffusion Probabilistic Models (DDPMs)

Forward process (Iterative) The image is


replaced with
noise
𝑥 ~𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 1 − 𝛽 𝑥 , 𝛽 I) 𝛽 ≪ 1 , 𝑡 = 1, 𝑇

… …
𝑥 𝑥 𝑥 𝑥
Denoising Diffusion Probabilistic Models (DDPMs)

Forward process. Ancestral sampling (One Shot) Notations:

𝛽 = 𝛼
𝑥 ~𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 𝛽 ⋅ 𝑥 , 1 − 𝛽 I) 𝛼 =1 − 𝛽

… …
𝑥 𝑥 𝑥 𝑥
DDPMs. Training objective
Remember that:

𝑥 𝑥 𝑥 𝑥

… …
𝑝 𝑥 𝑥 ≈𝑝 𝑥 𝑥 = 𝒩(𝑥 ;𝜇 𝑥 ,𝑡 ,Σ 𝑥 ,𝑡 )
Reverse process

Neural network Approximated by


weights a neural network
DDPMs. Training objective
Simplification:

𝑥 𝑥 𝑥 𝑥

… …
𝑝 𝑥 𝑥 ≈𝑝 𝑥 𝑥 = 𝒩(𝑥 ; 𝜇 𝑥 , 𝑡 , 𝜎 I)
Reverse process

Neural network Approximated by


Fix the variance instead of learning, and predict/learn the mean weights a neural network
DDPMs. Training objective
UNet-like neural network

𝜇 (𝑥 , 𝑡)

~𝒩 𝑥 , 𝜇 (𝑥 , 𝑡), 𝜎 I

𝑥
DDPMs. Training Algorithm

1
min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇

Training algorithm:

Repeat 𝛽 = 𝛼
𝑥 ~𝑝 𝑥
𝑡~𝒰 1, … , 𝑇
𝑧 ~𝒩(0, I)
𝑥 = 𝛽 ⋅𝑥 + 1−𝛽 𝑧
𝜃 = 𝜃 − 𝑙𝑟 ⋅ ∇ ℒ
Until convergence
DDPMs. Training Algorithm

1
min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇

ℒ 𝛽 = 𝛼

Training algorithm:

Repeat
𝑥 ~𝑝 𝑥 %We sample an image from our data set
𝑡~𝒰 1, … , 𝑇 %choose randomly a time step t of the forward process
𝑧 ~𝒩(0, I) %sample the noise z_t
𝑥 = 𝛽 ⋅𝑥 + 1−𝛽 𝑧 % Get noisy image
𝜃 = 𝜃 − 𝑙𝑟 ⋅ ∇ ℒ %Update neural network weights
Until convergence
DDPMs. Sampling

𝑥
𝑧 (𝑥 , 𝑡)

• Pass the current noisy image along with t to the neural network

• With the resultant compute the mean of the gaussian distribution


DDPMs. Sampling

𝑥
𝑧 (𝑥 , 𝑡)

Sample the image for the next iteration


𝜇 (𝑥 , 𝑡)

1 1 − 𝛼
~𝒩 𝑥 , 𝑥 − 𝑧 𝑥 ,𝑡 ,𝜎 I
𝛼
1−𝛽

𝑥
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Score Function

• Direction we need to change the input x such that the


density becomes greater

• Mixture of two Gaussians in 2D

© Copyright 2022 Yang Song. Powered by Jekyll with al-folio theme.


Score Function
• Second formulation of Diffusion Model

• Langevin dynamics method


• Starts from a random sample

• Apply iterative updates with the score function to modify the sample

• Result will have a higher chance of being a sample of the true distribution p(x)
Naïve score-based model

• Score: gradient of the logarithm of the probability density with respect to the input

• Annealed Langevin dynamics

𝛾
𝑥 =𝑥 + ∇ log 𝑝 𝑥 + 𝛾⋅𝜔
2

Step size – controls the magnitude of the update in the direction of the score

Score – estimated by the score network

Noise – random gaussian noise N(0, I)


Naïve score-based model

• The score is approximated with a neural network

• Score network is trained using score matching


𝔼 ~ ( ) 𝑠 𝑥 − ∇ log 𝑝(𝑥)

• Denoising score matching:


 Add small noise to each sample of the data:

𝑥 ~ 𝒩 𝑥 , 𝑥, 𝜎 ⋅ 𝐼 = 𝑝 (𝑥 )
 Objective
~ ( )

 After training:
Naïve score-based model. Problems

• Manifold hypothesis: real data resides on low dimensional manifolds

• The score is undefined outside these low dimensional manifolds

• Data being concentrated in regions results in further issues:


Incorrectly estimating the score within the low-density regions

Langevin dynamics never converging to the high-density region

Credit images Yang Song: https://yangsong.net/blog/2021/score/


Naïve score-based model. Problems

Credit images Yang Song: https://yangsong.net/blog/2021/score/


Noise Conditioned Score Network (NCSNs)
• Solution:
Perturb the data with random Gaussian noise at different scales

Learn score estimations noisy distributions via a single score network

Credit images Yang Song: https://yang-song.net/blog/2021/score/


Noise Conditioned Score Network (NCSNs)
• Given a sequence of Gaussian noise scales σ1 < σ2 < · · · < σT such that:
o 𝑝 𝑥 ≈ 𝑝(𝑥 )
o Approximating the true data distribution

o 𝑝 𝑥 ≈ 𝒩(0, 𝐼)
o Almost equally with the standard gaussian distribution.

• And the forward process i.e. noise perturbation given by:


1 −1 𝑥 −𝑥
𝑝 𝑥 | 𝑥 = 𝒩 𝑥 ; 𝑥, 𝜎 ⋅ 𝐼 = ⋅ exp ⋅
𝜎 ⋅ 2𝜋 2 𝜎
• The gradient can be written as:
𝑥 −𝑥
∇ log 𝑝 𝑥 |𝑥 =−
𝜎
Noise Conditioned Score Network (NCSNs)

• Training the NCSN with denoising score matching, the following objective is minimized:

1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 +
𝑇 𝜎
Noise Conditioned Score Network (NCSNs)

• Training the NCSN with denoising score matching, the following objective is minimized:

1 𝑥 −𝑥
ℒ = 𝜆 𝜎 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 +
𝑇 𝜎

Weighting function
Noise Conditioned Score Network (NCSNs). Sampling
Annealed Langevin dynamics
Parameters:
– number of iterations for Langevin dynamics
…< - noise scales
- update magnitude
Algorithm:

for t do:
for do:

return
Noise Conditioned Score Network (NCSNs). Sampling
Annealed Langevin dynamics
Parameters:
– number of iterations for Langevin dynamics
…< - noise scales
- update magnitude
Algorithm:

; %sample some standard gaussian noise


for t do: %start from the largest noise scale, which is denoted by the time step
for do: %for N iterations execute the Langevin dynamics updates
% get noise
% update
% next iteration
return
DDPM vs NCSN. Losses

DDPM: ℒ = ∑ 𝔼 ~ , ~𝒩 , 𝑧 𝑥 ,𝑡 − 𝑧

NCSN: ℒ = ∑ 𝜆(𝜎 )𝔼 ~ , ~ ( | ) 𝑠 𝑥 ,𝜎 +

• In , the weighting function is missing because better sample quality when is set to 1.

We can rewrite the noise , as follows:

• So, learns to approximate a scaled negative noise .


DDPM vs NCSN. Sampling
DDPM: 𝑥 = 𝑥 − 𝑧 𝑥 ,𝑡 + 𝛽 ⋅𝑧

• Iterative updates are based on subtracting some form of noise from the noisy image.

NCSN: 𝑥 = 𝑥 + ⋅𝑠 𝑥 ,𝜎 + 𝛾 ⋅𝑧

• This is true also for NCSN because 𝑠 𝑥 , 𝜎 approximates the negative of the noise.

1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 ,𝜎 + 1
𝑇 𝜎 min 𝔼 ~ , ~𝒩 , 𝑧 − 𝑧 (𝑥 , 𝑡)
𝑇

1 𝑥 −𝑥
ℒ = 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 , 𝜎 − (− )
𝑇 𝜎 ℒ
𝑥 −𝑥
𝑧=
• Therefore, the generative processes defined by NCSN and DDPM are very similar. 𝜎
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Stochastic Differential Equations (SDEs)

• A generalized framework that can be applied over the previous two methods

• However, the diffusion process is continuous, given by an SDE

• Works by the same principle:


 Gradually transforms the data distribution p(x0) into noise

 Reverse the process to obtain the original data distribution


Stochastic Differential Equations (SDEs)

• The forward diffusion process is represented by the following SDE:

𝜕𝑥 Notation for:
= 𝑓 𝑥, 𝑡 + 𝜎 𝑡 𝜔 ⟺ 𝜕𝑥 = 𝑓 𝑥, 𝑡 𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
𝜕𝑡 𝒩(0, 𝜕𝑡)
Function for drift coefficient: gradually White Gaussian
nullifies the data x0 noise
Function for diffusion
coefficient: controls how much
Gaussian noise is added
Stochastic Differential Equations (SDEs)

• The reverse-time SDE is defined as:


𝜕𝑥 = [𝑓 𝑥, 𝑡 − 𝜎 𝑡 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑥 ]𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔

• The training objective is similar to NCSN, but adapted for continuous time:

ℒ∗ =𝔼 𝜆 𝑡 𝔼 ( )𝔼 ( | ) 𝑠 𝑥 , 𝑡 + 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑥

• The score function is used in the reverse-time SDE:


 It employs a neural network to estimate the score function.
 Then uses a numerical SDE solver to generate samples.
Stochastic Differential Equations (SDEs). NCSN

• The process of NCSN:


𝑥 ~ 𝒩 𝑥 ;𝑥 , (𝜎 − 𝜎 )⋅𝐼 ⇒ 𝑥 = 𝑥 + (𝜎 − 𝜎 )⋅𝑧

• We can reformulate the above expression to look like a discretization of an SDE:


(𝜎 − 𝜎 )
𝑥 − 𝑥 = ⋅𝑧
𝑡 − (𝑡 − 1)

• Translating the above discretization in the continuous case:


𝜕𝜎 𝑡
𝜕𝑥 = 𝜕𝜔(𝑡)
𝜕𝑡

𝜕𝑥 = 𝑓 𝑥, 𝑡 𝜕𝑡 + 𝜎(𝑡) ⋅ 𝜕𝜔
Stochastic Differential Equations (SDEs). DDPM

• The process of DDPM:


𝑥 ~𝒩 𝑥 ; 1−𝛽 𝑥 ,𝛽 I ⇒ 𝑥 = 1−𝛽 𝑥 + 𝛽 ⋅𝑧

• If we consider time step size ∆𝑡 = , instead of 1, and 𝛽 𝑡 ∆𝑡 = 𝛽 :


𝑥 = 1 − 𝛽(𝑡)∆𝑡 𝑥 ∆ + 𝛽(𝑡)∆𝑡 ⋅ 𝑧
• Using Taylor expansion of 1 − 𝛽(𝑡)∆𝑡 :


𝑥 ≈(1− ) 𝑥 ∆ + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧

𝛽 𝑡 ∆𝑡 𝛽 𝑡 ∆𝑡
𝑥 ≈𝑥 ∆ − 𝑥 ∆ + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧 ⟺ 𝑥 − 𝑥 ∆ = − 𝑥 + 𝛽 𝑡 ∆𝑡 ⋅ 𝑧
2 2

• For the continuous case, the above becomes:


𝜕𝑥 = − 𝛽 𝑡 𝑥 𝜕𝑡 + 𝛽(𝑡)𝜕𝜔(𝑡)
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Stochastic Differential Equations
6. Conditional Generation
7. Research directions
Conditional generation.
Diffusion models estimate the score function, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 to sample from a distribution 𝒑 𝒙 .
Sampling from 𝐩 𝒙 𝒚 requires the score function of this probability density, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 ; y is condition.

Solution 1. Conditional training: train the model with an additional input 𝑦 to estimate 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 .

𝑠 𝑥 , 𝑡, 𝑦 ≈
𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦

𝑦
Conditional generation. Classifier Guidance
Diffusion models estimate the score function, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 to sample from a distribution 𝒑 𝒙 .
Sampling from 𝐩 𝒙 𝒚 requires the score function of this probability density, 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 .

Solution 2. Classifier guidance:

Bayes rule:
𝑝 𝑦 𝑥 ⋅ 𝑝 (𝑥 )
𝑝 𝑥 𝑦 = ⟺
𝑝 (𝑦)
Logarithm:
log 𝑝 𝑥 𝑦 = log 𝑝 𝑦 𝑥 + log 𝑝 𝑥 − log 𝑝 𝑦 ⟺
Gradient:
𝛻 log 𝑝 𝑥 𝑦 = 𝛻 log 𝑝 𝑦 𝑥 + 𝛻 log 𝑝 (𝑥 ) − 𝛻 log 𝑝 (𝑦) ⟺

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒚 𝒙𝒕 + 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 (𝒙𝒕 )

Unconditional diffusion model


Classifier
Conditional generation. Classifier Guidance
Solution 2. Classifier guidance:
Guidance weight

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒚 𝒙𝒕 + 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 (𝒙𝒕 )

𝑠=1 𝑠 = 10
Problem
• Need to have good gradients estimates at each step of denoising process

• Need a classifier that is robust to noise added in the image.

• Training of the classifier on noisy data, which can be problematic..


Conditional generation. Classifier-free Guidance
Solution 3. Classifier-free guidance

𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 + 𝛻 𝑙𝑜𝑔 𝑝 (𝑥 )

Bayes rule:
𝑝 𝑥 𝑦) ⋅ 𝑝 (𝑦)
𝑝 𝑦𝑥 =
𝑝 (𝑥 )
Logarithm:
log 𝑝 𝑦 𝑥 = log 𝑝 𝑥 𝑦 − log 𝑝 𝑥 + log 𝑝 (𝑦)

Gradient
𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 = 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 − 𝛻 log 𝑝 (𝑥 )

from above 𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ 𝛻 𝑙𝑜𝑔 𝑝 𝑦 𝑥 + 𝛻 𝑙𝑜𝑔 𝑝 (𝑥 )

𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 = 𝑠 ⋅ (𝛻 𝑙𝑜𝑔 𝑝 𝑥 𝑦 − 𝛻 log 𝑝 (𝑥 )) + 𝛻 𝑙𝑜𝑔 𝑝 𝑥

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

Learned by a single model


Conditional generation. Classifier-free Guidance

𝑠 𝑥 , 𝑡, 𝑦
≈𝑝 𝑥 𝑦

𝑦
Conditional generation. Classifier-free Guidance

𝑠 𝑥 , 𝑡, 𝑦/0 ≈
𝑝 𝑥 𝑦 /𝑝 (𝑥 |0)

𝑦/0
CLIP guidance
What is a CLIP model?

• Trained by contrastive cross-entropy loss:

• The optimal value of is


12
1

Slide from: Denoising Diffusion-based Generative Modeling: Foundations and Applications


Karsten Kreis Ruiqi Gao Arash Vahdat

Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
CLIP guidance
Replace the classifier in classifier guidance with a CLIP model

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒚 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 (𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 − 𝒍𝒐𝒈 𝒑(𝒄)) + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕


CLIP model 12
2

𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕 𝒄 = 𝒔 ⋅ 𝜵𝒙𝒕 (𝒇 𝒙 . 𝒈(𝒄)) + 𝟏 − 𝒔 ⋅ 𝜵𝒙𝒕 𝒍𝒐𝒈 𝒑𝒕 𝒙𝒕

Slide from: Denoising Diffusion-based Generative Modeling: Foundations and Applications


Karsten Kreis Ruiqi Gao Arash Vahdat

Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
Outline

1. Motivation
2. High-level overview
3. Denoising diffusion probabilistic models
4. Noise Conditioned Score Network
5. Conditional Generation
6. Stochastic Differential Equations
7. Research directions
Research directions
Unconditional image generation:
• Sampling efficiency
• Image quality
Conditional image generation:
• Text-to-image generation
Complex tasks in computer vision:
• Image editing, even based on text
• Super-resolution
• Image segmentation
• Anomaly detection in medical images
• Video generation
Thank you !
Survey: Github:
https://arxiv.org/abs/2209.04747 https://github.com/CroitoruAlin/Diffusion-
Models-in-Vision-A-Survey

You might also like