Application DPM
Application DPM
𝑥0 𝑥1 𝑥2 … 𝑥𝑁 ≈ 𝑁(0, 𝐼)
Demo Images from Song et al. Score-based generative modeling through stochastic differential equations, ICLR 2021.
𝑥0 𝑥1 𝑥2 … 𝑥𝑁 ≈ 𝑁(0, 𝐼)
𝑥0 𝑥1 𝑥2 … 𝑥𝑁 ≈ 𝑁(0, 𝐼)
Predict noise
1
𝜇𝑛∗ 𝑥𝑛 = 𝑥𝑛 + 𝛽𝑛 ∇ log 𝑞𝑛 (𝑥𝑛 ) ,
𝛼𝑛
ഥ𝑛−1
𝛽 𝛽𝑛2
𝜎𝑛∗ 𝑥𝑛 2 = ഥ𝑛 𝛽𝑛 + ഥ𝑛 𝛼𝑛 (E𝑞(𝑥𝑛 |𝑥𝑛 ) 𝜖𝑛2 − E𝑞(𝑥𝑛 |𝑥𝑛 ) 𝜖𝑛 2 ).
𝛽 𝛽
最优协方差表达式:
constant
预测网络 最小化均方误差
𝜖𝑛Ƹ (𝑥𝑛 ) min 𝐄‖𝜖𝑛Ƹ (𝑥𝑛 ) − 𝜖𝑛 ‖22
𝜖ො 𝑛
数据 高斯噪声 带噪数据
𝑥0 𝜖𝑛 𝑥𝑛
预测网络 最小化均方误差
ℎ𝑛 (𝑥𝑛 ) min 𝐄‖ℎ𝑛 𝑥𝑛 − 𝜖𝑛2 ‖22
ℎ𝑛
平方噪声
𝜖𝑛2
基于预测噪声平方的最优协方差估计:
The optimal solution to min2 𝐾𝐿(𝑞(𝑥0:𝑁 )||𝑝 𝑥0:𝑁 ) with imperfect mean is
𝜎𝑛 ⋅
ഥ𝑛−1
𝛽 𝛽𝑛2
𝜎𝑛∗ 𝑥𝑛 2 = ഥ𝑛 𝛽𝑛 + ഥ𝑛 𝛼𝑛 E𝑞(𝑥0 |𝑥𝑛 ) [ 𝜖𝑛 − 𝜖𝑛Ƹ (𝑥𝑛 ) 2 ].
𝛽 𝛽
1 1
Generally, the mean 𝜇𝑛 𝑥𝑛 = 𝑥𝑛 − 𝛽𝑛 𝜖Ƹ (𝑥 )
ഥ𝑛 𝑛 𝑛
is not optimal due to approximation or
𝛼𝑛 𝛽
optimization error of 𝜖𝑛Ƹ (𝑥𝑛 ).
最优协方差表达式:
预测网络 最小化均方误差
𝜖𝑛Ƹ (𝑥𝑛 ) min 𝐄‖𝜖𝑛Ƹ (𝑥𝑛 ) − 𝜖𝑛 ‖22
𝜖ො 𝑛
数据 高斯噪声 带噪数据
𝑥0 𝜖𝑛 𝑥𝑛
预测网络 最小化均方误差
𝑔𝑛 (𝑥𝑛 ) min 𝐄‖𝑔𝑛 𝑥𝑛 − (𝜖𝑛Ƹ 𝑥𝑛 − 𝜖𝑛 )2 ‖22
𝑔𝑛
噪声残差
(𝜖𝑛Ƹ 𝑥𝑛 − 𝜖𝑛 )2
基于预测噪声残差的最优协方差估计:
Page 11
Song et al. Score-based generative modeling through stochastic differential equations, ICLR 2021.
• 𝑞 𝑥0 , … , 𝑥𝑁 becomes
• 𝑑𝒙 = 𝑓 𝑡 𝒙𝑑𝑡 + 𝑔 𝑡 𝑑𝒘 ↔ 𝑑𝒙 = 𝑓 𝑡 𝒙 − 𝑔 𝑡 2 ∇log 𝑞 ഥ
𝒙 𝑑𝑡 + 𝑔 𝑡 𝑑 𝒘
𝑡
• 𝑝 𝑥0 , … , 𝑥𝑁 becomes
• 𝑑𝒙 = 𝑓 𝑡 𝒙 − 𝑔 𝑡 2 𝒔𝑡 𝒙 𝑑𝑡 + 𝑔 𝑡 𝑑𝒘
ഥ
Conditional DPM:
1
Discrete time: 𝑝 𝑥𝑛−1 𝑥𝑛 , 𝑐 = 𝑁(𝜇𝑛 𝑥𝑛 |𝑐 , Σ𝑛 (𝑥𝑛 )), 𝜇𝑛 𝑥𝑛 = 𝑥𝑛 + 𝛽𝑛 𝑠𝑛 𝑥𝑛 |𝑐
𝛼𝑛
Continuous time: 𝑑𝒙 = 𝑓 𝑡 𝒙 − 𝑔 𝑡 2 𝒔𝑡 𝒙|c 𝑑𝑡 + 𝑔 𝑡 𝑑 𝒘
ഥ
Require an extra
∇ log 𝑞𝑡 (𝑐|𝑥) = ∇ log 𝑞𝑡 (𝑥|𝑐) − ∇ log 𝑞𝑡 (𝑥) discriminative model
Learn conditional & unconditional model together
Introduce token ∅, and use 𝑠𝑡 𝑥𝑡 |∅ to represent unconditional cases
Conditional score-based SDE:
𝑑𝒙 = 𝑓 𝑡 𝒙 − 𝑔 𝑡 2 (𝑠𝑡 𝑥|∅ + 𝜆(𝑠𝑡 𝑥 𝑐 − 𝑠𝑡 (𝑥|∅)) 𝑑𝑡 + 𝑔 𝑡 𝑑 𝒘
ഥ
Training:
min E𝑐 E𝑛 𝛽𝑛ҧ E𝑞𝑛 (𝑥𝑛 |𝑐) 𝑠𝑛 𝑥𝑛 |𝑐 − ∇ log 𝑞𝑛 (𝑥𝑛 |𝑐) 2 + 𝜆E𝑛 𝛽𝑛ҧ E𝑞𝑛 (𝑥𝑛 ) 𝑠𝑛 𝑥𝑛 |∅ − ∇ log 𝑞𝑛 (𝑥𝑛 ) 2
𝑠𝑛 ⋅
conditional loss unconditional loss
By Fan Bao, Tsinghua University 18
Saharia et al. Image Super-Resolution via Iterative Refinement
Application: Segmentation
Paired data (𝑥0 , 𝑐), 𝑥0 is segmentation, 𝑐 is image
𝑠𝑡 𝑥 𝑐 = UNet(𝐹 𝑥 + 𝐺(𝑐), 𝑡)
Cons:
𝑝(𝑥0 |𝑐) is very black box
Energy design is based on intuition
CLIP provides a model to measure the similarity between images and texts:
Similarity: sim 𝒙, 𝑐 = 𝒇(𝒙) ∙ 𝒈(𝑐)
Energy: 𝐸𝑡 𝒙, 𝑐 = −sim 𝒙, 𝑐
Energy
guidance
Self
guidance
Vikash et al. Generating High Fidelity Data from Low-density Regions using Diffusion Models
Dataset Samples from SDE of 𝒔𝑡 𝒙|c Samples from 𝒔𝑡 𝒙|c − ∇𝐸𝑡 (𝒙, 𝑐)
𝑑𝒙 = 𝑓 𝑡 𝒙 − 𝑔 𝑡 2 (𝒔𝑡 𝒙 ) 𝑑𝑡 + 𝑔 𝑡 𝑑𝒘,
ഥ 𝑥𝑡0 ∼ 𝑝(𝑥𝑡0 |𝑐)
No energy guidance
𝑐 only influence the start distribution
Choose an early start time 𝑡0 < 𝑇
𝑝(𝑥𝑡0 |𝑐) is a Gaussian perturbation of 𝑐
Stroke to painting
Video generation
Ho et. al. Video Diffusion Models