0% found this document useful (0 votes)
21 views

Lecture7-8 Diffusion Model

Uploaded by

Karan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Lecture7-8 Diffusion Model

Uploaded by

Karan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 136

Lecture X: Denoising Diffusion Models

My Class L

Me, teaching Diffusion Models that I


knew nothing about two weeks ago!
Next few lectures: Generative models for direct image base
rendering.
Latent Space of
Generative models
Edit New Image under
Vision Components
Graphics different conditions
Encoder Decoder

3D Intrinsic Components Change:


• Viewpoint
Current Image • Lighting
• Reflectance
• Background
Implicit: Use a Neural Network • Attributes
(Conditional Generative networks) • Many others…
Often, end-to-end.
Slide Courtesy:

Denoising Diffusion-based Generative Modeling: Foundations and Applications,


CVPR 2022 tutorial,
Karsten Kreis, Ruiqi Gao, Arash Vahdat

https://cvpr2022-tutorial-diffusion-models.github.io/

@karsten_kreis @RuiqiGao @ArashVahdat

1
Deep Generative Learning
Learning to generate data

Train

Samples from a Data Distribution Neural Network

Sample

2
The Landscape of Deep Generative Learning

Normalizing
Autoregressive Flows
Models

Variational
Autoencoders

Denoising
Generative Energy-based
Diffusion Models
Adversarial Networks Models

7
Denoising Diffusion Models
Emerging as powerful generative models, outperforming GANs

“Diffusion Models Beat GANs on Image Synthesis” “Cascaded Diffusion Models for High Fidelity Image Generation”
Dhariwal & Nichol, OpenAI, 2021 Ho et al., Google, 2021
8
Image Super-resolution
Successful applications

7
Saharia et al., Image Super-Resolution via Iterative Refinement, ICCV 2021
Text-to-Image Generation
DALL·E 2 Imagen
A group of teddy bears in suit in a corporate office celebrating
“a teddy bear on a skateboard in times square”
the birthday of their friend. There is a pizza cake on the desk.

“Hierarchical Text-Conditional Image Generation with CLIP Latents” “Photorealistic Text-to-Image Diffusion Models with Deep
Ramesh et al., 2022 Language Understanding”, Saharia et al., 2022
8
Text-to-Image Generation

Stable Diffusion

Stable Diffusion Applications: Twitter Mega Thread


“High-Resolution Image Synthesis with Latent Diffusion Models” Rombach et al., 2022

9
Q: What is a diffusion model?

78
Denoising Diffusion Models
Learning to generate by denoising

Denoising diffusion models consist of two processes:

• Forward diffusion process that gradually adds noise to input

• Reverse denoising process that learns to generate data by denoising

Forward diffusion process (fixed)

Data Noise

Reverse denoising process (generative)

Sohl-Dickstein et al., Deep Unsupervised Learning using Nonequilibrium Thermodynamics, ICML 2015
Ho et al., Denoising Diffusion Probabilistic Models, NeurIPS 2020
Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021 18
Forward Diffusion Process

The formal definition of the forward process in T steps:

Forward diffusion process (fixed)

Data Noise

x0 x1 x2 x3 x4 … xT

p p
Sample: xt = 1 t xt 1 + t ✏t 1

mean variance where, ✏t 1 ⇠ N (0, I)


19
Diffusion Kernel

Data Noise

x0 x1 x2 x3 x4 … xT

p p
Sample: xt = 1 t xt 1 + t ✏t 1
where, ✏t 1 ⇠ N (0, I)
mean variance
You will need to prove this in your assignment
Define, (Diffusion Kernel)

For sampling: where

values schedule (i.e., the noise schedule) is designed such that and 19
What happens to a distribution in the forward diffusion?

So far, we discussed the diffusion kernel but what about ?

Diffused Data Distributions


Data Noise

xt
Diffused Joint Input Diffusion
data dist. dist. data dist. kernel

The diffusion kernel is Gaussian convolution. q(x0) q(x1) q(x2) q(x3) … q(xT)

We can sample by first sampling and then sampling (i.e., ancestral sampling).

21
Generative Learning by Denoising

Recall, that the diffusion parameters are designed such that

Diffused Data Distributions

Generation:

Sample
xt
Iteratively sample

True Denoising Dist.

q(x0) q(x1) q(x2) q(x3) … q(xT)

q(x0|x1) q(x1|x2) q(x2|x3) q(x3|x4) q(xT-1|xT)


In general, is intractable.

Can we approximate ? Yes, we can use a Normal distribution if is small in each forward diffusion step.
22
Reverse Denoising Process

Formal definition of forward and reverse processes in T steps:

Reverse denoising process (generative)

Data Noise

x0 x1 x2 x3 x4 … xT

Autoencoder predicts the mean of


the denoised image x(t-1) given x(t).

Trainable network 23
(U-net, Denoising Autoencoder)
How do we train? (summary version)
What is the loss function? (Ho et al. NeurIPS 2020 )

U-Net autoencoder takes x(t) as input and


simply predict a noise. The goal of the
training is to generate a noise pattern that
is unit normal. Very similar to VAE, right?
Summary
Training and Sample Generation

Intuitively: During forward process we add noise to image. During reverse process we try to
predict that noise with a U-Net and then subtract it from the image to denoise it.

27
Implementation Considerations
Network Architectures

Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent

Time Representation
Fully-connected
Layers

Time representation: sinusoidal positional embeddings or random Fourier features.

Time features are fed to the residual blocks using either simple spatial addition or using adaptive group normalization
layers. (see Dharivwal and Nichol NeurIPS 2021)
28
Diffusion Parameters
Noise Schedule

Data Noise

Above, and control the variance of the forward diffusion and reverse denoising processes respectively.

Often a linear schedule is used for , and is set equal to . Slowly increase the amount of added noise.

Kingma et al. NeurIPS 2022 introduce a new parameterization of diffusion models using signal-to-noise ratio (SNR), and
show how to learn the noise schedule by minimizing the variance of the training objective.

We can also train while training the diffusion model by minimizing the variational bound (Improved DPM by Nichol and
Dhariwal ICML 2021) or after training the diffusion model (Analytic-DPM by Bao et al. ICLR 2022).
29
What happens to an image in the forward diffusion process?

Recall that sampling from is done using where

Small t

Freq.

Fourier Transform

Freq.

Large t

In the forward diffusion, the high frequency content is perturbed faster.


Freq.

30
Content-Detail Tradeoff

Reverse denoising process (generative)

Data Noise

x0 x1 x2 x3 x4 … xT

The denoising model is The denoising model is


specialized for generating the specialized for generating the
high-frequency content (i.e., low-frequency content (i.e.,
low-level details) coarse content)

The weighting of the training objective for different timesteps is important!


31
Connection to VAEs

Diffusion models can be considered as a special form of hierarchical VAEs.

However, in diffusion models:

• The encoder is fixed

• The latent variables have the same dimension as the data

• The denoising model is shared across different timestep

• The model is trained with some reweighting of the variational bound.

Vahdat and Kautz, NVAE: A Deep Hierarchical Variational Autoencoder, NeurIPS 2020
Sønderby, et al.. Ladder variational autoencoders, NeurIPS 2016. 32
Summary
Denoising Diffusion Probabilistic Models
- Diffusion process can be reversed if the variance of the gaussian noise added at each step of the diffusion is small enough.

- To reverse the process we train a U-Net that takes input: current noisy image and timestamp, and predicts the noise map..

- Training goal is to make sure that the predicted noise map at each step is unit gaussian (Note that in VAE we also required the
latent space to be unit gaussian).

- During sampling/generation, subtract the predicted noise from the noisy image at time t to generate the image at time t-1

(with some weighting).

The devil is in the details:

• Network architectures

• Objective weighting

• Diffusion parameters (i.e., noise schedule)

“Elucidating the Design Space of Diffusion-Based Generative Models” by Karras et al. for important design decisions.
To be presented in the class!
33
How do we obtain the ”Score Function”?
Denoising Score Matching
Implementation Details
Forward diffusion process (fixed)

q(x0) q(xT )
x0 … xt … xT

More sophisticated model parametrizations and loss


weightings are possible!

Karras et al., “Elucidating the Design Space of Diffusion-


Based Generative Models”, arXiv, 2022

To be discussed in detail in paper presentation


Advanced Techniques
Questions to address with advanced techniques

• Q1: How to accelerate the sampling process?

• Advanced forward diffusion process

• Advanced reverse process

• Hybrid models & model distillation

• Q2: How to do high-resolution (conditional) generation?

• Conditional diffusion models

• Classifier(-free) guidance

• Cascaded generation

77
Q: How to accelerate sampling process?

78
What makes a good generative model?
The generative learning trilemma
Likelihood-based models
(Variational Autoencoders
& Normalizing flows)

Mode
Fast
Coverage/
Sampling Diversity

Generative Denoising
Adversarial Diffusion
Networks (GANs) Models
High
Quality Often requires 1000s of
Samples network evaluations!

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs, ICLR 2022 79
What makes a good generative model?
The generative learning trilemma

Tackle the trilemma by accelerating diffusion models

Mode
Fast
Coverage/
Sampling Diversity

High
Quality
Samples

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs, ICLR 2022 80
How to accelerate diffusion models?
[Image credit: Ben Poole, Mohammad Norouzi]
Simple forward process slowly maps data to noise

Reverse process maps noise back to data where


diffusion model is trained

• Naïve acceleration methods, such as reducing diffusion


time steps in training or sampling every k time step in Diffusion model
inference, lead to immediate worse performance.

• We need something cleverer.

• Given a limited number of functional calls, usually


much less than 1000s, how to improve performance? 81
Denoising diffusion implicit models (DDIM)
Non-Markovian diffusion process

Main Idea

Design a family of non-Markovian diffusion processes and corresponding reverse processes.

The process is designed such that the model can be optimized by the same surrogate
objective as the original diffusion model.

Therefore, can take a pretrained diffusion model but with more choices of sampling procedure.

Song et al., “Denoising Diffusion Implicit Models”, ICLR 2021. 86


Denoising diffusion implicit models (DDIM)
Non-Markovian diffusion process

Define a family of forward processes that meets the above requirement:

The corresponding reverse process is

Intuitively, given noisy xt we first predict the corresponding clean image x0 and then use if to obtain a sample xt-1

Regular diffusion model 86


Denoising diffusion implicit models (DDIM)
Non-Markovian diffusion process

The corresponding reverse process is

Intuitively, given noisy xt we first predict the corresponding clean image x0 and then use if to obtain a sample xt-1

- Different choice of 𝜎 results in different generative process without re-training the model

- When 𝜎 = 0 for all t, we have a deterministic generative process, with randomness from
only t=T (the last step).
Advanced reverse process
Approximate reverse process with more complicated distributions

Simple forward process slowly maps data to noise

Reverse process maps noise back to data where


diffusion model is trained

Diffusion model
• Q: is normal approximation of the reverse process still
accurate when there’re less diffusion time steps?

94
Advanced approximation of reverse process
Normal assumption in denoising distribution holds only for small step

Denoising Process with Uni-modal Normal Distribution

Data Noise

Data Noise

Requires more complicated functional approximators!


Xiao et al., “Tackling the Generative Learning Trilemma with Denoising Diffusion GANs”, ICLR 2022.
Gao et al., “Learning energy-based models by diffusion recovery likelihood”, ICLR 2021. 95
Denoising diffusion GANs
Approximating reverse process by conditional GANs

Compared to a one-shot GAN generator:

• Both generator and discriminator are


solving a much simpler problem.

• Stronger mode coverage

• Better training stability

Xiao et al., “Tackling the Generative Learning Trilemma with Denoising Diffusion GANs”, ICLR 2022. 96
Advanced modeling
Latent space modeling & model distillation

Simple forward process slowly maps data to noise

• Both generator and discriminator are solving a much simpler problem.

Reverse process maps noise back to data where


diffusion model is trained

Diffusion model
• Can we do model distillation for fast sampling?

• Can we lift the diffusion model to a latent space that is faster to diffuse?

99
Progressive distillation
• Distill a deterministic DDIM sampler to the same model architecture.
• At each stage, a “student” model is learned to distill two adjacent sampling steps of the
“teacher” model to one sampling step.
• At next stage, the “student” model from previous stage will serve as the new “teacher” model.

Distillation stage
Salimans & Ho, “Progressive distillation for fast sampling of diffusion models”, ICLR 2022. 100
Latent-space diffusion models
Variational autoencoder + score-based prior
Latent Space Forward Diffusion
Encoder
Data

Reconst.
Decoder Latent Space Generative Denoising

Variational Autoencoder Denoising Diffusion Prior

Main Idea

Encoder maps the input data to an embedding space

Denoising diffusion models are applied in the latent space

Vahdat et al., “Score-based generative modeling in latent space”, NeurIPS 2021.


102
Rombach et al., “High-Resolution Image Synthesis with Latent Diffusion Models”, CVPR 2022.
Latent-space diffusion models
Variational autoencoder + score-based prior
Latent Space Forward Diffusion
Encoder
Data

Reconst.
Decoder Latent Space Generative Denoising

Variational Autoencoder Denoising Diffusion Prior


Advantages:

(1) The distribution of latent embeddings close to Normal distribution è Simpler denoising, Faster Synthesis!

(2) Augmented latent space è More expressivity!

(3) Tailored Autoencoders è More expressivity, Application to any data type (graphs, text, 3D data, etc.) !
12
Q: How to do high-resolution conditional generation?

105
Impressive conditional diffusion models
Text-to-image generation
DALL·E 2 IMAGEN
“a propaganda poster depicting a cat dressed as french “A photo of a raccoon wearing an astronaut helmet,
emperor napoleon holding a piece of cheese” looking out of the window at night.”

Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022.
106
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022.
Impressive conditional diffusion models
Super-resolution & colorization

Super-resolution Colorization

Saharia et al., “Palette: Image-to-Image Diffusion Models”, arXiv 2021. 107


Impressive conditional diffusion models
Panorama generation
← Generated Input Generated →

108
Conditional diffusion models
Include condition as input to reverse process

Reverse process:

Variational
upper bound:

Incorporate conditions into U-Net

• Scalar conditioning: encode scalar as a vector embedding, simple spatial addition or


adaptive group normalization layers.

• Image conditioning: channel-wise concatenation of the conditional image.

• Text conditioning: single vector embedding – spatial addition or adaptive group norm /
a seq of vector embeddings - cross-attention.
109
Classifier guidance
Using the gradient of a trained classifier as guidance

Recap: What is a score function?


Classifier guidance
Using the gradient of a trained classifier as guidance

Applying Bayes rule to obtain conditional score function

Classifier

Guidance scale: value >1 amplifies


the influence of classifier signal.

Slide Credits of guidance: https://benanne.github.io/2022/05/26/guidance.html


Classifier guidance
Using the gradient of a trained classifier as guidance
Classifier guidance
Using the gradient of a trained classifier as guidance

Score model Classifier gradient

- Train unconditional Diffusion model

- Take your favorite classifier, depending on the conditioning type

- During inference / sampling mix the gradients of the classifier with the predicted
score function of the unconditional diffusion model. 110
Classifier guidance
Problems of classifier guidance

Classifier

Guidance scale: value >1 amplifies


the influence of classifier signal.

• At each step of denoising the input to the classifier is a noisy image xt . Classifier is never trained on noisy
image. So one needs to re-train classifier on noisy images! Can’t use existing pre-trained classifiers.

• Most of the information in the input x is not relevant to predicting y, and as a result, taking the gradient of
the classifier w.r.t. its input can yield arbitrary (and even adversarial) directions in input space.
Classifier-free guidance
Get guidance by Bayes’ rule on conditional diffusion models

We proved this in
classifier guidance.

Score function Score function


for unconditional for conditional
diffusion model diffusion model
Classifier-free guidance
Get guidance by Bayes’ rule on conditional diffusion models

Score function for Score function for


unconditional conditional
diffusion model diffusion model
Classifier-free guidance
Get guidance by Bayes’ rule on conditional diffusion models

In practice: Score function for Score function for


unconditional conditional
• Train a conditional diffusion model p(x∣y), with conditioning dropout: some diffusion model diffusion model
percentage of the time, the conditioning information y is removed (10-20%
tends to work well).

• The conditioning is often replaced with a special input value representing


the absence of conditioning information.

• The resulting model is now able to function both as a conditional


model p(x∣y), and as an unconditional model p(x), depending on whether
the conditioning signal is provided.

• During inference / sampling simply mix the score function of conditional


and unconditional diffusion model based on guidance scale.
Classifier-free guidance
Trade-off for sample quality and sample diversity

Non-guidance Guidance scale = 1 Guidance scale = 3

Large guidance weight (𝜔) usually leads to better individual sample quality but less sample diversity.

Ho & Salimans, “Classifier-Free Diffusion Guidance”, 2021. 113


Classifier guidance Classifier-free guidance

Guidance scale Classifier


Score function for Score function for
unconditional conditional
diffusion model diffusion model

X Need to train a separate ”noise-robust” classifier + + Train conditional & unconditional diffusion model
unconditional diffusion model. jointly via drop-out.

X Gradient of the classifier w.r.t. input yields arbitrary + All pixels in input receive equally ‘good’ gradients.
values.

Rather than constructing a generative model from classifier, we construct a classifier from a generative model!

Most recent papers use classifier-free guidance! Very simple yet very powerful idea!
Cascaded generation
Pipeline
Super-Resolution Diffusion Models

Class Conditioned Diffusion Model

Similar cascaded / multi-resolution image generation also exist in GAN (Big-GAN & StyleGAN)

Cascaded Diffusion Models outperform Big-GAN in FID and IS and VQ-VAE2 in Classification Accuracy Score.

Ho et al., “Cascaded Diffusion Models for High Fidelity Image Generation”, 2021. 114
Noise conditioning augmentation
Reduce compounding error
Problem:

• During training super-resolution models are trained on original low-res images from the dataset.
Mismatch
• During inference, these low-res images are generated by class conditioned diffusion model, which has issue
artifacts and poor quality than original low-res images used for training.

Solution: Noise conditioning augmentation.

• During training, add varying amounts of Gaussian noise (or blurring by Gaussian kernel) to the low-res images.

• During inference, sweep over the optimal amount of noise added to the low-res images.

• BSR-degradation process: applies JPEG compressions noise, camera sensor noise, different image interpolations for
downsampling, Gaussian blur kernels and Gaussian noise in a random order to an image.

Ho et al., “Cascaded Diffusion Models for High Fidelity Image Generation”, 2021.
115
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
Summary
Questions to address with advanced techniques

• Q1: How to accelerate the sampling process?

• Advanced forward diffusion process

• Advanced reverse process

• Hybrid models & model distillation

• Q2: How to do high-resolution (conditional) generation?

• Conditional diffusion models

• Classifier(-free) guidance

• Cascaded generation

116
Applications (1):
Image Synthesis, Controllable Generation,
Text-to-Image

118
GLIDE
OpenAI

• A 64x64 base model + a 64x64 → 256x256 super-resolution model.

• Tried classifier-free and CLIP guidance. Classifier-free guidance works better than CLIP guidance.

Samples generated with classifier-free guidance (256x256)

Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021. 120
CLIP guidance
What is a CLIP model?

• Trained by contrastive cross-entropy loss:

• The optimal value of is

Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
121
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
CLIP guidance
Replace the classifier in classifier guidance with a CLIP model

• Sample with a modified score:

CLIP model

Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
122
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
GLIDE
OpenAI

• Fine-tune the model especially for inpainting: feed randomly occluded images with an additional mask channel as
the input.

Text-conditional image inpainting examples

Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021. 123
DALL·E 2
OpenAI

1kx1k Text-to-image generation.


Outperform DALL-E (autoregressive transformer).
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 124
DALL·E 2
Model components
Prior: produces CLIP image embeddings conditioned on the caption.
Decoder: produces images conditioned on CLIP image embeddings and text.

126
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022.
DALL·E 2
Model components

Why conditional on CLIP image embeddings?

CLIP image embeddings capture high-level semantic meaning.

Latents in the decoder model take care of the rest.

The bipartite latent representation enables several text-guided image manipulation tasks. 126
DALL·E 2
Model components (1/2): prior model

Prior: produces CLIP image embeddings conditioned on the caption.

• Option 1. autoregressive prior: quantize image embedding to a seq. of discrete codes and predict them autoregressively.

• Option 2. diffusion prior: model the continuous image embedding by diffusion models conditioned on caption.

Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 127
DALL·E 2
Model components (2/2): decoder model

Decoder: produces images conditioned on CLIP image embeddings (and text).

• Cascaded diffusion models: 1 base model (64x64), 2 super-resolution models (64x64 → 256x256, 256x256 → 1024x1024).

• Largest super-resolution model is trained on patches and takes full-res inputs at inference time.

• Classifier-free guidance & noise conditioning augmentation are important.

Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 128
DALL·E 2
Bipartite latent representations

Bipartite latent representations

: CLIP image embeddings

: inversion of DDIM sampler


(latents in the decoder model)
Near exact
reconstruction

Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 129
DALL·E 2
Image variations

Fix the CLIP embedding


Decode using different decoder latents

130
DALL·E 2
Image interpolation

Interpolate image CLIP embeddings .

Use different to get different interpolation trajectories.

Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 131
DALL·E 2
Text Diffs

Change the image CLIP embedding towards the difference of the text CLIP embeddings of two prompts.
132
Decoder latent is kept as a constant.
Imagen
Google Research, Brain team

Input: text; Output: 1kx1k images

• An unprecedented degree of photorealism

• SOTA automatic scores & human ratings

• A deep level of language understanding

• Extremely simple

• no latent space, no quantization

A brain riding a rocketship heading towards the moon.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 133
Imagen
Google Research, Brain team

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 134
Imagen
Google Research, Brain team

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 135
Imagen
Google Research, Brain team

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 136
Imagen
Google Research, Brain team

A cute hand-knitted koala wearing a sweater with 'CVPR' written on it.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 137
Imagen

Key modeling components:

• Cascaded diffusion models

• Classifier-free guidance and


dynamic thresholding.

• Frozen large pretrained language


models as text encoders. (T5-XXL)

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 138
Imagen

Key observations:

• Beneficial to use text conditioning for all


super-res models.

• Noise conditioning augmentation weakens


information from low-res models, thus
needs text conditioning as extra
information input.

• Scaling text encoder is extremely efficient.

• More important than scaling diffusion


model size.

• Human raters prefer T5-XXL as the text encoder


over CLIP encoder on DrawBench.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 138
Imagen
Dynamic thresholding

• Large classifier-free guidance weights → better text alignment, worse image quality

Better sample quality

Better text alignment

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 140
Imagen
Dynamic thresholding

• Large classifier-free guidance weights → better text alignment, worse image quality

• Hypothesis : at large guidance weight, the generated images are saturated due to the very
large gradient updates during sampling

• Solution – dynamic thresholding: adjusts the pixel values of samples at each sampling step to be
within a dynamic range computed over the statistics of the current samples.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 141
Imagen
Dynamic thresholding

• Large clas sifier-free guidance weights → better text alignment, worse image quality

• Hypothesi s : at high guidance weight, the generated images are saturated due to the very large gradient up dates during
sampling

• Solution – dynamitic thresholding: adjusts the pixel values of samples at each sampling step to be within a d ynamic
range com puted over the statistics of the current samples.

Static thresholding Dynamic thresholding

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 142
Imagen
DrawBench: new benchmark for text-to-image evaluations

• A set of 200 prompts to evaluate text-to-image models across multiple dimensions.

• E.g., the ability to faithfully render different colors, numbers of objects, spatial relations, text in the scene, unusual
interactions between objects.

• Contains complex prompts, e.g, long and intricate descriptions, rare words, misspelled prompts.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 143
Imagen
DrawBench: new benchmark for text-to-image evaluations

• A set of 200 prompts to evaluate text-to-image models across multiple dimensions.

• E.g., the ability to faithfully render different colors, numbers of objects, spatial relations, text in the scene, unusual
interactions between objects.

• Contains complex prompts, e.g, long and intricate descriptions, rare words, misspelled prompts.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 144
Imagen
Evaluations

Imagen got SOTA automatic evaluation scores Imagen is preferred over recent work by human raters in sample
on COCO dataset quality & image-text alignment on DrawBench.

Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 145
Stable Diffusion
Latest & Publicly available text-to-image generation
To be discussed in detail in paper presentation
Stable Diffusion
Latest & Publicly available text-to-image generation

HW assignment: Use stable diffusion API to


generate ‘interesting’ image from text prompt.
All submissions will be rated for top 3!
Applications (2):
Image Editing, Image-to-Image,
Super-resolution, Segmentation

151
Diffusion Autoencoders
Learning semantic meaningful latent representations in diffusion models

To be discussed in detail in paper presentation


Preechakul et al., “Diffusion Autoencoders: Toward a Meaningful and Decodable Representation”, CVPR 2022. 146
Diffusion Autoencoders
Learning semantic meaningful latent representations in diffusion models

Changing the semantic latent .

Very similar to StyleGAN based editing. Zsem is the latent representation similar to the W/W+ space of StyleGAN

Preechakul et al., “Diffusion Autoencoders: Toward a Meaningful and Decodable Representation”, CVPR 2022. 147
Diffusion Autoencoders
Learning semantic meaningful latent representations in diffusion models

Preechakul et al., “Diffusion Autoencoders: Toward a Meaningful and Decodable Representation”, CVPR 2022. 148
Super-Resolution
Super-Resolution via Repeated Refinement (SR3)

Image super-resolution can be considered as training where y is a low-resolution image and x is the corresponding
high-resolution image

Train a score model for x conditioned on y using:

The conditional score is simply a U-Net with xt and y (resolution image) concatenated.

Saharia et al., Image Super-Resolution via Iterative Refinement, 2021 152


Super-Resolution
Super-Resolution via Repeated Refinement (SR3)

Saharia et al., Image Super-Resolution via Iterative Refinement, 2021 153


Image-to-Image Translation
Palette: Image-to-Image Diffusion Models

Many image-to-image translation applications can be considered as training where y is the input image.

For example, for colorization, x is a colored image and y is a gray-level image.

Train a score model for x conditioned on y using:

The conditional score is simply a U-Net with xt and y concatenated.

Saharia et al., Palette: Image-to-Image Diffusion Models, 2022 154


Image-to-Image Translation
Palette: Image-to-Image Diffusion Models

Saharia et al., Palette: Image-to-Image Diffusion Models, 2022 155


Conditional Generation
Iterative Latent Variable Refinement (ILVR)
To be discussed in detail in paper presentation

A simple technique to guide the generation process of an unconditional diffusion


model using a reference image.

Given the conditioning (reference) image y the generation process is modified to


pull the samples towards the reference image.

Low-pass filter
Choi et al., ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models, ICCV 2021 Downsampling / Upsampling by a factor of 156
N
Conditional Generation
Iterative Latent Variable Refinement (ILVR)

157
Choi et al., ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models, ICCV 2021
Semantic Segmentation
Label-efficient semantic segmentation with diffusion models
Can we use representation learned from diffusion models for downstream applications such as semantic segmentation?

Only train this


Denoising Diffusion component
Model (U-Net)

Baranchuk et al., Label-Efficient Semantic Segmentation with Diffusion Models, ICLR 2022 158
Semantic Segmentation
Label-efficient semantic segmentation with diffusion models
The experimental results show that the proposed method outperforms Masked Autoencoders, GAN and VAE-based models.

Baranchuk et al., Label-Efficient Semantic Segmentation with Diffusion Models, ICLR 2022 159
Image Editing
SDEdit
Goal: Given a stroke painting with color, generate a photorealistic image
Key Idea:

- Latent Distribution of stroke and real images do not overlap.


- But once we apply forward diffusion on them, their distribution
start overlapping as finally it becomes gaussian noise.

Use a pre-trained unconditional diffusion model for real image

Meng et al., SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, ICLR 2022 160
Image Editing
SDEdit

Meng et al., SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, ICLR 2022 161
Video Synthesis, Medical Imaging,
3D Generation, Discrete State Models

166
Video Generation

Samples from a text-conditioned video diffusion model, conditioned on the string fireworks.
Ho et al., “Video Diffusion Models”, arXiv, 2022
Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022
Yang et al., “Diffusion Probabilistic Modeling for Video Generation”, arXiv, 2022
Höppe et al., “Diffusion Models for Video Prediction and Infilling”, arXiv, 2022
167
Voleti et al., “MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation”, arXiv, 2022
Video Generation
Video Generation Tasks:
• Unconditional Generation (Generate all frames)
• Future Prediction (Generate future from past fames)
• Past Prediction (Generate past from future fames)
• Interpolation (Generate intermediate frames)

Ho et al., “Video Diffusion Models”, arXiv, 2022


Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022
Yang et al., “Diffusion Probabilistic Modeling for Video Generation”, arXiv, 2022
Höppe et al., “Diffusion Models for Video Prediction and Infilling”, arXiv, 2022
167
Voleti et al., “MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation”, arXiv, 2022
Video Generation
Learn one model for everything:

• Architecture as one diffusion model over all frames concatenated.


• Mask frames to be predicted; provide conditioning frames; vary
applied masking/conditioning for different tasks during training.
• Use time position encodings to encode times.

167
(image from: Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022)
Video Generation
Architecture Details:
Data is 4D (image height, image width, #frames, channels) •
• Option (1): 3D Convolutions. Can be computationally expensive.
• Option (2): Spatial 2D Convolutions + Attention Layers along frame axis.

Additional Advantage:

Ignoring the attention layers, the


model can be trained additionally
on pure image data!

167
(image from: Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022)
Video Generation
Results
Long term video generation in hierarchical manner:
1+ hour coherent video
• 1. Generate future frames in sparse manner, conditioning on frames far back
generation possible!
• 2. Interpolate in-between frames

Test Data:

Generated:

(video from: Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022,
170
https://plai.cs.ubc.ca/2022/05/20/flexible-diffusion-modeling-of-long-videos/)
Solving Inverse Problems in Medical Imaging

Forward CT or MRI imaging process (simplified):

sparse-view CT undersampled MRI

(image from: Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022)

Inverse Problem:
Reconstruct original image from sparse measurements.

Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022 171
Solving Inverse Problems in Medical Imaging

High-level idea: Learn Generative Diffusion Model as “prior”; then guide synthesis conditioned on sparse observations:

(image from: Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022)

Outperforms even fully-supervised methods.


Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022 172
Solving Inverse Problems in Medical Imaging
Lots of Literature

High-level idea: Learn Generative Diffusion Model as “prior”; then guide synthesis conditioned on sparse observations:

• Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022
• Chung and Ye, “Score-based diffusion models for accelerated MRI”, Medical Image Analysis, 2022
• Chung et al., “Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction”, CVPR, 2022
• Peng et al., “Towards performant and reliable undersampled MR reconstruction via diffusion model sampling”, arXiv, 2022
• Xie and Li, “Measurement-conditioned Denoising Diffusion Probabilistic Model for Under-sampled Medical Image Reconstruction”, arXiv, 2022
• Luo et al, “MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation”, arXiv, 2022
• …

(Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022)

Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022 173
3D Shape Generation

• Point clouds as 3D shape representation can be diffused easily and intuitively


• Denoiser implemented based on modern point cloud-processing networks (PointNets & Point-VoxelCNNs)

(image from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021)

Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021
Luo and Hu, “Diffusion Probabilistic Models for 3D Point Cloud Generation”, CVPR, 2021 174
3D Shape Generation

• Point clouds as 3D shape representation can be diffused easily and intuitively


• Denoiser implemented based on modern point cloud-processing networks (PointNets & Point-VoxelCNNs)

(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)

Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 175
3D Shape Generation
Shape Completion

• Can train conditional shape completion diffusion model (subset of points fixed to given conditioning points):

(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)

Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 176
3D Shape Generation
Shape Completion – Multimodality

(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)

Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 177
3D Shape Generation
Shape Completion – Multimodality – On Real Data

(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)

Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 178
Towards Discrete State Diffusion Models

• So far:
Continuous diffusion and denoising processes.

Data Noise

x0 … xt … xT

p
Fixed forward diffusion process: q(x t |x t-1 ) = N (x t ; 1 - /3t x t - 1 , /3tI)

Reverse generative process: p✓( x t - 1 |xt ) = N ( x t - 1; µ✓(x t , t), o-2t I)

But what if data is discrete? Categorical?


Continuous perturbations are not possible!
(Text, Pixel-wise Segmentation Labels,
Discrete Image Encodings, etc.)

179
Discrete State Diffusion Models

• Categorical diffusion: q(x t |x t-1 ) = Cat(x t ; p = x t - 1 Q t ) • Reverse process can be parametrized


categorical distribution.
x t : one-hot state vector
Q t : transition matrix [Q t ] ij = q(xt = j|x t - 1 = i)

(image from: Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022)

Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021
Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022 180
Discrete State Diffusion Models

• Uniform categorical diffusion: • Progressive masking out of data


Options for forward process: j3t (generation is “de-masking”)
>
Q t = (1 - j3t)I +
K • Tailored to ordinal data
(e.g. discretized Gaussian)

(image from: Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021)

Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021 181
Discrete State Diffusion Models

(image from: Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021)

Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021 182
Discrete State Diffusion Models
Modeling Categorical Image Pixel Values

Progressive denoising
starting from all-
masked state.

Progressive denoising
starting from random
uniform state.
(with discretized Gaussian
denoising model)

(image from: Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021)

Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021 183
Discrete State Diffusion Models
Modeling Discrete Image Encodings

Encoding images into latent space with discrete tokens, and


modeling discrete token distribution

Class-conditional model samples


Iterative generation

(images from: Chang et al., “MaskGIT: Masked Generative Image Transformer”, CVPR, 2022)

Chang et al., “MaskGIT: Masked Generative Image Transformer”, CVPR, 2022


184
Esser et al., “ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis”, NeurIPS, 2021
Discrete State Diffusion Models
Modeling Pixel-wise Segmentations

(image from: Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022)

Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022 185
Conclusions, Open Problems and Final Remarks

188
Summary: Denoising Diffusion Probabilistic Models
“Discrete-time” Diffusion Models

We started with denoising diffusion probabilistic models:

Forward diffusion process (fixed)

Data Noise

Reverse denoising process (generative)

We showed how the denoising model can be trained by predicting noise injected in each diffused image:

189
Summary: Advanced Techniques
Acceleration, Guidance and beyond

In the third part, we discussed several advanced topics in diffusion models.

How can we accelerate the sample generation?


[Image credit: Ben Poole, Mohammad Norouzi]

Simple forward process slowly maps data to noise

Reverse process maps noise back to data with a denoising model

How to scale up diffusion models to high-resolution (conditional) generation?

• Cascaded models

• Guided diffusion models


191
Summary: Applications

We covered many successful applications of diffusion models:

• Image generation, text-to-image generation, controllable generation

• Image editing, image-to-image translation, super-resolution, segmentation, adversarial robustness

• Discrete models, 3D generation, medical imaging, video synthesis

192
Open Problems (1)
• Diffusion models are a special form of VAEs and continuous normalizing flows

• Why do diffusion models perform so much better than these models?

• How can we improve VAEs and normalizing flows with lessons learned from diffusion models?

• Sampling from diffusion models is still slow especially for interactive applications

• The best we could reach is 4-10 steps. How can we have one step samplers?

• Do we need new diffusion processes?

• Diffusion models can be considered as latent variable models, but their latent space lacks semantics

• How can we do latent-space semantic manipulations in diffusion models

193
Open Problems (2)
• How can diffusion models help with discriminative applications?

• Representation learning (high-level vs low-level)

• Uncertainty estimation

• Joint discriminator-generator training

• What are the best network architectures for diffusion models?

• Can we go beyond existing U-Nets?

• How can we feed the time input and other conditioning?

• How can we improve the sampling efficiency using better network designs?

194
Open Problems (3)
• How can we apply diffusion models to other data types?

• 3D data (e.g., distance functions, meshes, voxels, volumetric representations), video, text, graphs, etc.

• How should we change diffusion models for these modalities?

• Compositional and controllable generation

• How can we go beyond images and generate scenes?

• How can we have more fine-grained control in generation?

• Diffusion models for X

• Can we better solve applications that were previously addressed by GANs and other generative models?

• Which applications will benefit most from diffusion models?


195
Thanks!

https://cvpr2022-tutorial-diffusion-models.github.io/

@karsten_kreis @RuiqiGao @ArashVahdat

196

You might also like