Lecture7-8 Diffusion Model
Lecture7-8 Diffusion Model
My Class L
https://cvpr2022-tutorial-diffusion-models.github.io/
1
Deep Generative Learning
Learning to generate data
Train
Sample
2
The Landscape of Deep Generative Learning
Normalizing
Autoregressive Flows
Models
Variational
Autoencoders
Denoising
Generative Energy-based
Diffusion Models
Adversarial Networks Models
7
Denoising Diffusion Models
Emerging as powerful generative models, outperforming GANs
“Diffusion Models Beat GANs on Image Synthesis” “Cascaded Diffusion Models for High Fidelity Image Generation”
Dhariwal & Nichol, OpenAI, 2021 Ho et al., Google, 2021
8
Image Super-resolution
Successful applications
7
Saharia et al., Image Super-Resolution via Iterative Refinement, ICCV 2021
Text-to-Image Generation
DALL·E 2 Imagen
A group of teddy bears in suit in a corporate office celebrating
“a teddy bear on a skateboard in times square”
the birthday of their friend. There is a pizza cake on the desk.
“Hierarchical Text-Conditional Image Generation with CLIP Latents” “Photorealistic Text-to-Image Diffusion Models with Deep
Ramesh et al., 2022 Language Understanding”, Saharia et al., 2022
8
Text-to-Image Generation
Stable Diffusion
9
Q: What is a diffusion model?
78
Denoising Diffusion Models
Learning to generate by denoising
Data Noise
Sohl-Dickstein et al., Deep Unsupervised Learning using Nonequilibrium Thermodynamics, ICML 2015
Ho et al., Denoising Diffusion Probabilistic Models, NeurIPS 2020
Song et al., Score-Based Generative Modeling through Stochastic Differential Equations, ICLR 2021 18
Forward Diffusion Process
Data Noise
x0 x1 x2 x3 x4 … xT
p p
Sample: xt = 1 t xt 1 + t ✏t 1
Data Noise
x0 x1 x2 x3 x4 … xT
p p
Sample: xt = 1 t xt 1 + t ✏t 1
where, ✏t 1 ⇠ N (0, I)
mean variance
You will need to prove this in your assignment
Define, (Diffusion Kernel)
values schedule (i.e., the noise schedule) is designed such that and 19
What happens to a distribution in the forward diffusion?
xt
Diffused Joint Input Diffusion
data dist. dist. data dist. kernel
The diffusion kernel is Gaussian convolution. q(x0) q(x1) q(x2) q(x3) … q(xT)
We can sample by first sampling and then sampling (i.e., ancestral sampling).
21
Generative Learning by Denoising
Generation:
Sample
xt
Iteratively sample
Can we approximate ? Yes, we can use a Normal distribution if is small in each forward diffusion step.
22
Reverse Denoising Process
Data Noise
x0 x1 x2 x3 x4 … xT
Trainable network 23
(U-net, Denoising Autoencoder)
How do we train? (summary version)
What is the loss function? (Ho et al. NeurIPS 2020 )
Intuitively: During forward process we add noise to image. During reverse process we try to
predict that noise with a U-Net and then subtract it from the image to denoise it.
27
Implementation Considerations
Network Architectures
Diffusion models often use U-Net architectures with ResNet blocks and self-attention layers to represent
Time Representation
Fully-connected
Layers
Time features are fed to the residual blocks using either simple spatial addition or using adaptive group normalization
layers. (see Dharivwal and Nichol NeurIPS 2021)
28
Diffusion Parameters
Noise Schedule
Data Noise
Above, and control the variance of the forward diffusion and reverse denoising processes respectively.
Often a linear schedule is used for , and is set equal to . Slowly increase the amount of added noise.
Kingma et al. NeurIPS 2022 introduce a new parameterization of diffusion models using signal-to-noise ratio (SNR), and
show how to learn the noise schedule by minimizing the variance of the training objective.
We can also train while training the diffusion model by minimizing the variational bound (Improved DPM by Nichol and
Dhariwal ICML 2021) or after training the diffusion model (Analytic-DPM by Bao et al. ICLR 2022).
29
What happens to an image in the forward diffusion process?
Small t
Freq.
Fourier Transform
Freq.
Large t
30
Content-Detail Tradeoff
Data Noise
x0 x1 x2 x3 x4 … xT
Vahdat and Kautz, NVAE: A Deep Hierarchical Variational Autoencoder, NeurIPS 2020
Sønderby, et al.. Ladder variational autoencoders, NeurIPS 2016. 32
Summary
Denoising Diffusion Probabilistic Models
- Diffusion process can be reversed if the variance of the gaussian noise added at each step of the diffusion is small enough.
- To reverse the process we train a U-Net that takes input: current noisy image and timestamp, and predicts the noise map..
- Training goal is to make sure that the predicted noise map at each step is unit gaussian (Note that in VAE we also required the
latent space to be unit gaussian).
- During sampling/generation, subtract the predicted noise from the noisy image at time t to generate the image at time t-1
• Network architectures
• Objective weighting
“Elucidating the Design Space of Diffusion-Based Generative Models” by Karras et al. for important design decisions.
To be presented in the class!
33
How do we obtain the ”Score Function”?
Denoising Score Matching
Implementation Details
Forward diffusion process (fixed)
q(x0) q(xT )
x0 … xt … xT
• Classifier(-free) guidance
• Cascaded generation
77
Q: How to accelerate sampling process?
78
What makes a good generative model?
The generative learning trilemma
Likelihood-based models
(Variational Autoencoders
& Normalizing flows)
Mode
Fast
Coverage/
Sampling Diversity
Generative Denoising
Adversarial Diffusion
Networks (GANs) Models
High
Quality Often requires 1000s of
Samples network evaluations!
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs, ICLR 2022 79
What makes a good generative model?
The generative learning trilemma
Mode
Fast
Coverage/
Sampling Diversity
High
Quality
Samples
Tackling the Generative Learning Trilemma with Denoising Diffusion GANs, ICLR 2022 80
How to accelerate diffusion models?
[Image credit: Ben Poole, Mohammad Norouzi]
Simple forward process slowly maps data to noise
Main Idea
The process is designed such that the model can be optimized by the same surrogate
objective as the original diffusion model.
Therefore, can take a pretrained diffusion model but with more choices of sampling procedure.
Intuitively, given noisy xt we first predict the corresponding clean image x0 and then use if to obtain a sample xt-1
Intuitively, given noisy xt we first predict the corresponding clean image x0 and then use if to obtain a sample xt-1
- Different choice of 𝜎 results in different generative process without re-training the model
- When 𝜎 = 0 for all t, we have a deterministic generative process, with randomness from
only t=T (the last step).
Advanced reverse process
Approximate reverse process with more complicated distributions
Diffusion model
• Q: is normal approximation of the reverse process still
accurate when there’re less diffusion time steps?
94
Advanced approximation of reverse process
Normal assumption in denoising distribution holds only for small step
Data Noise
Data Noise
Xiao et al., “Tackling the Generative Learning Trilemma with Denoising Diffusion GANs”, ICLR 2022. 96
Advanced modeling
Latent space modeling & model distillation
Diffusion model
• Can we do model distillation for fast sampling?
• Can we lift the diffusion model to a latent space that is faster to diffuse?
99
Progressive distillation
• Distill a deterministic DDIM sampler to the same model architecture.
• At each stage, a “student” model is learned to distill two adjacent sampling steps of the
“teacher” model to one sampling step.
• At next stage, the “student” model from previous stage will serve as the new “teacher” model.
Distillation stage
Salimans & Ho, “Progressive distillation for fast sampling of diffusion models”, ICLR 2022. 100
Latent-space diffusion models
Variational autoencoder + score-based prior
Latent Space Forward Diffusion
Encoder
Data
Reconst.
Decoder Latent Space Generative Denoising
Main Idea
Reconst.
Decoder Latent Space Generative Denoising
(1) The distribution of latent embeddings close to Normal distribution è Simpler denoising, Faster Synthesis!
(3) Tailored Autoencoders è More expressivity, Application to any data type (graphs, text, 3D data, etc.) !
12
Q: How to do high-resolution conditional generation?
105
Impressive conditional diffusion models
Text-to-image generation
DALL·E 2 IMAGEN
“a propaganda poster depicting a cat dressed as french “A photo of a raccoon wearing an astronaut helmet,
emperor napoleon holding a piece of cheese” looking out of the window at night.”
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022.
106
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022.
Impressive conditional diffusion models
Super-resolution & colorization
Super-resolution Colorization
108
Conditional diffusion models
Include condition as input to reverse process
Reverse process:
Variational
upper bound:
• Text conditioning: single vector embedding – spatial addition or adaptive group norm /
a seq of vector embeddings - cross-attention.
109
Classifier guidance
Using the gradient of a trained classifier as guidance
Classifier
- During inference / sampling mix the gradients of the classifier with the predicted
score function of the unconditional diffusion model. 110
Classifier guidance
Problems of classifier guidance
Classifier
• At each step of denoising the input to the classifier is a noisy image xt . Classifier is never trained on noisy
image. So one needs to re-train classifier on noisy images! Can’t use existing pre-trained classifiers.
• Most of the information in the input x is not relevant to predicting y, and as a result, taking the gradient of
the classifier w.r.t. its input can yield arbitrary (and even adversarial) directions in input space.
Classifier-free guidance
Get guidance by Bayes’ rule on conditional diffusion models
We proved this in
classifier guidance.
Large guidance weight (𝜔) usually leads to better individual sample quality but less sample diversity.
X Need to train a separate ”noise-robust” classifier + + Train conditional & unconditional diffusion model
unconditional diffusion model. jointly via drop-out.
X Gradient of the classifier w.r.t. input yields arbitrary + All pixels in input receive equally ‘good’ gradients.
values.
Rather than constructing a generative model from classifier, we construct a classifier from a generative model!
Most recent papers use classifier-free guidance! Very simple yet very powerful idea!
Cascaded generation
Pipeline
Super-Resolution Diffusion Models
Similar cascaded / multi-resolution image generation also exist in GAN (Big-GAN & StyleGAN)
Cascaded Diffusion Models outperform Big-GAN in FID and IS and VQ-VAE2 in Classification Accuracy Score.
Ho et al., “Cascaded Diffusion Models for High Fidelity Image Generation”, 2021. 114
Noise conditioning augmentation
Reduce compounding error
Problem:
• During training super-resolution models are trained on original low-res images from the dataset.
Mismatch
• During inference, these low-res images are generated by class conditioned diffusion model, which has issue
artifacts and poor quality than original low-res images used for training.
• During training, add varying amounts of Gaussian noise (or blurring by Gaussian kernel) to the low-res images.
• During inference, sweep over the optimal amount of noise added to the low-res images.
• BSR-degradation process: applies JPEG compressions noise, camera sensor noise, different image interpolations for
downsampling, Gaussian blur kernels and Gaussian noise in a random order to an image.
Ho et al., “Cascaded Diffusion Models for High Fidelity Image Generation”, 2021.
115
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
Summary
Questions to address with advanced techniques
• Classifier(-free) guidance
• Cascaded generation
116
Applications (1):
Image Synthesis, Controllable Generation,
Text-to-Image
118
GLIDE
OpenAI
• Tried classifier-free and CLIP guidance. Classifier-free guidance works better than CLIP guidance.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021. 120
CLIP guidance
What is a CLIP model?
Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
121
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
CLIP guidance
Replace the classifier in classifier guidance with a CLIP model
CLIP model
Radford et al., “Learning Transferable Visual Models From Natural Language Supervision”, 2021.
122
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021.
GLIDE
OpenAI
• Fine-tune the model especially for inpainting: feed randomly occluded images with an additional mask channel as
the input.
Nichol et al., “GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models”, 2021. 123
DALL·E 2
OpenAI
126
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022.
DALL·E 2
Model components
The bipartite latent representation enables several text-guided image manipulation tasks. 126
DALL·E 2
Model components (1/2): prior model
• Option 1. autoregressive prior: quantize image embedding to a seq. of discrete codes and predict them autoregressively.
• Option 2. diffusion prior: model the continuous image embedding by diffusion models conditioned on caption.
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 127
DALL·E 2
Model components (2/2): decoder model
• Cascaded diffusion models: 1 base model (64x64), 2 super-resolution models (64x64 → 256x256, 256x256 → 1024x1024).
• Largest super-resolution model is trained on patches and takes full-res inputs at inference time.
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 128
DALL·E 2
Bipartite latent representations
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 129
DALL·E 2
Image variations
130
DALL·E 2
Image interpolation
Ramesh et al., “Hierarchical Text-Conditional Image Generation with CLIP Latents”, arXiv 2022. 131
DALL·E 2
Text Diffs
Change the image CLIP embedding towards the difference of the text CLIP embeddings of two prompts.
132
Decoder latent is kept as a constant.
Imagen
Google Research, Brain team
• Extremely simple
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 133
Imagen
Google Research, Brain team
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 134
Imagen
Google Research, Brain team
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 135
Imagen
Google Research, Brain team
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 136
Imagen
Google Research, Brain team
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 137
Imagen
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 138
Imagen
Key observations:
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 138
Imagen
Dynamic thresholding
• Large classifier-free guidance weights → better text alignment, worse image quality
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 140
Imagen
Dynamic thresholding
• Large classifier-free guidance weights → better text alignment, worse image quality
• Hypothesis : at large guidance weight, the generated images are saturated due to the very
large gradient updates during sampling
• Solution – dynamic thresholding: adjusts the pixel values of samples at each sampling step to be
within a dynamic range computed over the statistics of the current samples.
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 141
Imagen
Dynamic thresholding
• Large clas sifier-free guidance weights → better text alignment, worse image quality
• Hypothesi s : at high guidance weight, the generated images are saturated due to the very large gradient up dates during
sampling
• Solution – dynamitic thresholding: adjusts the pixel values of samples at each sampling step to be within a d ynamic
range com puted over the statistics of the current samples.
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 142
Imagen
DrawBench: new benchmark for text-to-image evaluations
• E.g., the ability to faithfully render different colors, numbers of objects, spatial relations, text in the scene, unusual
interactions between objects.
• Contains complex prompts, e.g, long and intricate descriptions, rare words, misspelled prompts.
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 143
Imagen
DrawBench: new benchmark for text-to-image evaluations
• E.g., the ability to faithfully render different colors, numbers of objects, spatial relations, text in the scene, unusual
interactions between objects.
• Contains complex prompts, e.g, long and intricate descriptions, rare words, misspelled prompts.
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 144
Imagen
Evaluations
Imagen got SOTA automatic evaluation scores Imagen is preferred over recent work by human raters in sample
on COCO dataset quality & image-text alignment on DrawBench.
Saharia et al., “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding”, arXiv 2022. 145
Stable Diffusion
Latest & Publicly available text-to-image generation
To be discussed in detail in paper presentation
Stable Diffusion
Latest & Publicly available text-to-image generation
151
Diffusion Autoencoders
Learning semantic meaningful latent representations in diffusion models
Very similar to StyleGAN based editing. Zsem is the latent representation similar to the W/W+ space of StyleGAN
Preechakul et al., “Diffusion Autoencoders: Toward a Meaningful and Decodable Representation”, CVPR 2022. 147
Diffusion Autoencoders
Learning semantic meaningful latent representations in diffusion models
Preechakul et al., “Diffusion Autoencoders: Toward a Meaningful and Decodable Representation”, CVPR 2022. 148
Super-Resolution
Super-Resolution via Repeated Refinement (SR3)
Image super-resolution can be considered as training where y is a low-resolution image and x is the corresponding
high-resolution image
The conditional score is simply a U-Net with xt and y (resolution image) concatenated.
Many image-to-image translation applications can be considered as training where y is the input image.
Low-pass filter
Choi et al., ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models, ICCV 2021 Downsampling / Upsampling by a factor of 156
N
Conditional Generation
Iterative Latent Variable Refinement (ILVR)
157
Choi et al., ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models, ICCV 2021
Semantic Segmentation
Label-efficient semantic segmentation with diffusion models
Can we use representation learned from diffusion models for downstream applications such as semantic segmentation?
Baranchuk et al., Label-Efficient Semantic Segmentation with Diffusion Models, ICLR 2022 158
Semantic Segmentation
Label-efficient semantic segmentation with diffusion models
The experimental results show that the proposed method outperforms Masked Autoencoders, GAN and VAE-based models.
Baranchuk et al., Label-Efficient Semantic Segmentation with Diffusion Models, ICLR 2022 159
Image Editing
SDEdit
Goal: Given a stroke painting with color, generate a photorealistic image
Key Idea:
Meng et al., SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, ICLR 2022 160
Image Editing
SDEdit
Meng et al., SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations, ICLR 2022 161
Video Synthesis, Medical Imaging,
3D Generation, Discrete State Models
166
Video Generation
Samples from a text-conditioned video diffusion model, conditioned on the string fireworks.
Ho et al., “Video Diffusion Models”, arXiv, 2022
Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022
Yang et al., “Diffusion Probabilistic Modeling for Video Generation”, arXiv, 2022
Höppe et al., “Diffusion Models for Video Prediction and Infilling”, arXiv, 2022
167
Voleti et al., “MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation”, arXiv, 2022
Video Generation
Video Generation Tasks:
• Unconditional Generation (Generate all frames)
• Future Prediction (Generate future from past fames)
• Past Prediction (Generate past from future fames)
• Interpolation (Generate intermediate frames)
167
(image from: Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022)
Video Generation
Architecture Details:
Data is 4D (image height, image width, #frames, channels) •
• Option (1): 3D Convolutions. Can be computationally expensive.
• Option (2): Spatial 2D Convolutions + Attention Layers along frame axis.
Additional Advantage:
167
(image from: Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022)
Video Generation
Results
Long term video generation in hierarchical manner:
1+ hour coherent video
• 1. Generate future frames in sparse manner, conditioning on frames far back
generation possible!
• 2. Interpolate in-between frames
Test Data:
Generated:
(video from: Harvey et al., “Flexible Diffusion Modeling of Long Videos”, arXiv, 2022,
170
https://plai.cs.ubc.ca/2022/05/20/flexible-diffusion-modeling-of-long-videos/)
Solving Inverse Problems in Medical Imaging
(image from: Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022)
Inverse Problem:
Reconstruct original image from sparse measurements.
Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022 171
Solving Inverse Problems in Medical Imaging
High-level idea: Learn Generative Diffusion Model as “prior”; then guide synthesis conditioned on sparse observations:
(image from: Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022)
High-level idea: Learn Generative Diffusion Model as “prior”; then guide synthesis conditioned on sparse observations:
• Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022
• Chung and Ye, “Score-based diffusion models for accelerated MRI”, Medical Image Analysis, 2022
• Chung et al., “Come-Closer-Diffuse-Faster: Accelerating Conditional Diffusion Models for Inverse Problems through Stochastic Contraction”, CVPR, 2022
• Peng et al., “Towards performant and reliable undersampled MR reconstruction via diffusion model sampling”, arXiv, 2022
• Xie and Li, “Measurement-conditioned Denoising Diffusion Probabilistic Model for Under-sampled Medical Image Reconstruction”, arXiv, 2022
• Luo et al, “MRI Reconstruction via Data Driven Markov Chain with Joint Uncertainty Estimation”, arXiv, 2022
• …
(Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022)
Song et al., “Solving Inverse Problems in Medical Imaging with Score-Based Generative Models”, ICLR, 2022 173
3D Shape Generation
(image from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021)
Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021
Luo and Hu, “Diffusion Probabilistic Models for 3D Point Cloud Generation”, CVPR, 2021 174
3D Shape Generation
(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)
Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 175
3D Shape Generation
Shape Completion
• Can train conditional shape completion diffusion model (subset of points fixed to given conditioning points):
(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)
Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 176
3D Shape Generation
Shape Completion – Multimodality
(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)
Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 177
3D Shape Generation
Shape Completion – Multimodality – On Real Data
(video from: Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021,
https://alexzhou907.github.io/pvd)
Zhou et al., “3D Shape Generation and Completion through Point-Voxel Diffusion”, ICCV, 2021 178
Towards Discrete State Diffusion Models
• So far:
Continuous diffusion and denoising processes.
Data Noise
x0 … xt … xT
p
Fixed forward diffusion process: q(x t |x t-1 ) = N (x t ; 1 - /3t x t - 1 , /3tI)
179
Discrete State Diffusion Models
(image from: Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022)
Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021
Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022 180
Discrete State Diffusion Models
(image from: Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021)
Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021 181
Discrete State Diffusion Models
(image from: Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021)
Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021 182
Discrete State Diffusion Models
Modeling Categorical Image Pixel Values
Progressive denoising
starting from all-
masked state.
Progressive denoising
starting from random
uniform state.
(with discretized Gaussian
denoising model)
(image from: Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021)
Austin et al., “Structured Denoising Diffusion Models in Discrete State-Spaces”, NeurIPS, 2021 183
Discrete State Diffusion Models
Modeling Discrete Image Encodings
(images from: Chang et al., “MaskGIT: Masked Generative Image Transformer”, CVPR, 2022)
(image from: Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022)
Hoogeboom et al., “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions”, NeurIPS, 2022 185
Conclusions, Open Problems and Final Remarks
188
Summary: Denoising Diffusion Probabilistic Models
“Discrete-time” Diffusion Models
Data Noise
We showed how the denoising model can be trained by predicting noise injected in each diffused image:
189
Summary: Advanced Techniques
Acceleration, Guidance and beyond
• Cascaded models
192
Open Problems (1)
• Diffusion models are a special form of VAEs and continuous normalizing flows
• How can we improve VAEs and normalizing flows with lessons learned from diffusion models?
• Sampling from diffusion models is still slow especially for interactive applications
• The best we could reach is 4-10 steps. How can we have one step samplers?
• Diffusion models can be considered as latent variable models, but their latent space lacks semantics
193
Open Problems (2)
• How can diffusion models help with discriminative applications?
• Uncertainty estimation
• How can we improve the sampling efficiency using better network designs?
194
Open Problems (3)
• How can we apply diffusion models to other data types?
• 3D data (e.g., distance functions, meshes, voxels, volumetric representations), video, text, graphs, etc.
• Can we better solve applications that were previously addressed by GANs and other generative models?
https://cvpr2022-tutorial-diffusion-models.github.io/
196