Conditional Diffusion Model
Conditional Diffusion Model
Abstract
Diffusion models have recently exhibited remarkable abilities
to synthesize striking image samples since the introduction sample from (unconditional)
of denoising diffusion probabilistic models (DDPMs). Their
key idea is to disrupt images into noise through a fixed for-
ward process and learn its reverse process to generate sam-
early-stage critical-stage late-stage
ples from noise in a denoising way. For conditional DDPMs,
most existing practices relate conditions only to the reverse image
process and fit it to the reversal of unconditional forward pro- manifold
cess. We find this will limit the condition modeling and gen-
eration in a small time window. In this paper, we propose a
mixed sampling procedure (input 0~9 when conditional sampling)
novel and flexible conditional diffusion model by introducing
conditions into the forward process. We utilize extra latent late-stage
Prior-Shift
Grad-TTS (Popov et al. 2021) proposes a score-based text-
to-speech generative model with the prior mean predicted
by text encoder and aligner. Specifically, it defines a forward
process satisfying the following SDE:
1 p
dXt = (µ − Xt )βt dt + βt dWt , (14)
2
where µ corresponds to E(c) of our system (E(·) represents
the parameterized text encoder and √ aligner, c represents the
input text). We show that kt = 1 − ᾱt match a discretiza-
tion of Eq.(14) (See proof in Appendix A). For forward pro-
cess, kt increases from 0 to 1 and leads xt to shift to µ as t
Prior-Shift Data-Normalization Quadratic-Shift
increases. For reverse process, we have:
1 Figure 2: 32 × 32 conditional MNIST samples for differ-
dt = (1 − √ )µ , (15) ent shift modes with different shift predictors. The last row
αt
visualize the learned Eψ (·).
where 1 − √1α < 0 because the reverse process starts from
t
N (µ, I) and it needs to eliminate the cumulative shift µ
of forward process. From the view of diffusion trajectories, designing Σ, it can achieve the same precision with a sim-
Grad-TTS changes the ending point of trajectories, so we pler network and have a faster convergence rate under some
name the shift mode as Prior-Shift. constraints (Lee et al. 2021). Data-Normalization is more
Note that Grad-TTS still takes µ as an additional input to suitable for variance-sensitive data such as audio.
the score estimator, but we have stated that it is unnecessary.
However, doing this will get at least not worse results, but Quadratic-Shift
also introduces additional parameter and computation. Except for Prior-Shift, we propose a shift mode to disentan-
gle the diffusion trajectories of different conditions by mak-
Data-Normalization ing the concave trajectories shown in Figure 1 convex. In
PriorGrad (Lee et al. 2021) employs a forward process as this case, we don’t change their starting or ending point, and
follows: E(c) becomes a middle point, where they first progress to
√ √ it and then go away from it. Therefore kt should be similar
xt = ᾱt (x0 − µ) + 1 − ᾱt , (16) 1 ≈0
to some quadratic function opening downwards √ with k√
√ and kT ≈ 0. Empirically, we choose kt = ᾱt (1 − ᾱt ).
where ∼ N (0, Σ). Obviously, kt = − ᾱt satisfies We name the shift mode as Quadratic-Shift.
Eq.(16). For forward process, it first normalizes x0 by sub-
tracting its corresponding prior mean µ and then trains a
diffusion model on normalized x0 with prior N (0, Σ). For Experiments
reverse process, we have: In this section, we conduct several conditional image synthe-
sis experiments with ShiftDDPMs. Note that we always set
d1 = µ , dt>1 = 0. (17) Σ(c) = I. Full implementation details of all experiments
Intuitively, the reverse process starts from N (0, Σ) and has can be found in Appendix B.
no amendments all the time except the last step, where it
adds prior mean µ to the output (denormalization). From the Effectiveness of Conditional Sampling
view of diffusion trajectories, PriorGrad resets the starting We first verify the effectiveness of ShiftDDPMs with three
point of trajectories on the data manifold, so we name the shift modes on toy dataset MNIST (LeCun et al. 1998). We
shift mode as Data-Normalization. employ two fixed shift predictors (E1 (·) and E2 (·)) and a
Unlike Prior-Shift that disperses the cumulative shift to all trainable one (Eψ (·) with parameters ψ), mapping a one-hot
points on the diffusion trajectories, Data-Normalization does vector c to a 32 × 32 matrix. Specifically, E1 (·) takes 10
not disentangle the diffusion trajectories so that it must feed evenly spaced numbers over [−1, 1] and expands each num-
c into the network to guide sampling. However, by carefully ber into a 32×32 matrix. E2 (·) takes the mean of all training
Model IS↑ FID↓ NLL↓
Unconditional
DDPM 9.46 3.17 ≤ 3.75
our DDPM 9.52 3.13 ≤ 3.72
Conditional
cond. DDPM 9.59 3.12 ≤ 3.74
cls. DDPM 9.17 5.85 −
Prior-Shift 9.54 3.06 ≤ 3.71
cond. Prior-Shift 9.65 3.06 ≤ 3.70
Data-Normalization 9.14 5.51 −
Quadratic-Shift 9.67 3.05 ≤ 3.69
cond. Quadratic-Shift 9.74 3.02 ≤ 3.70
els (cond. Prior-Shift and cond. Quadratic-Shift) by incorpo- rameter that we can directly control. Figure 4 and Table 2
rating class labels into the reverse process of Prior-Shift and presents the conditional CIFAR-10 samples generated by
𝑆𝑆 = 10 𝑆𝑆 = 100
𝜂𝜂 = 0.0
𝜂𝜂 = 0.2
𝜂𝜂 = 0.5
𝜂𝜂 = 1.0
Figure 4: 32 × 32 conditional CIFAR-10 samples for Figure 5: 64 × 64 conditional LFW samples for Quadratic-
Quadratic-Shift. We use fixed input and noise during sam- Shift. From left to right are ground truth image (from test
pling. set), generated image and learned Eψ (c).
Quadratic-Shift mode and its FID with different sampling Figure 6: 64 × 64 conditional LFW interpolations for
steps and η. ShiftDDIMs can still keep competitive FID even Quadratic-Shift. We use fixed input and noise during sam-
though it only samples for 100 steps. pling.
More Choice of kt
The choice of kt is flexible. For Prior-Shift, any schedules of
kt monotonically increasing from 0 to 1 can be applied on
Prior-Shift. We have tried with following three types kt : Tt ,
( Tt )2 and sin( 2T
tπ
− π2 ) and they all work well. Furthermore,
kt can also be piecewise:
0 t < 0.4T
kt = t−0.4T . (19)
0.6T otherwise
One can also design other reasonable kt . We leave empirical
investigations of kt as future work.
References Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.;
Bishop, C. M. 2006. Pattern recognition. Machine learning, and Lee, H. 2016. Generative adversarial text to image syn-
128(9). thesis. In International Conference on Machine Learning,
Chen, N.; Zhang, Y.; Zen, H.; Weiss, R. J.; Norouzi, M.; 1060–1069. PMLR.
and Chan, W. 2020. WaveGrad: Estimating gradients for Ren, Y.; Yu, X.; Zhang, R.; Li, T. H.; Liu, S.; and Li, G.
waveform generation. arXiv preprint arXiv:2009.00713. 2019. Structureflow: Image inpainting via structure-aware
Dhariwal, P.; and Nichol, A. 2021. Diffusion models beat appearance flow. In Proceedings of the IEEE/CVF Interna-
gans on image synthesis. arXiv preprint arXiv:2105.05233. tional Conference on Computer Vision, 181–190.
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Rezende, D.; and Mohamed, S. 2015. Variational inference
Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. with normalizing flows. In International conference on ma-
2014. Generative adversarial nets. Advances in neural in- chine learning, 1530–1538. PMLR.
formation processing systems, 27. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Den-
Ho, J.; Jain, A.; and Abbeel, P. 2020. Denoising diffusion ton, E.; Ghasemipour, S. K. S.; Ayan, B. K.; Mahdavi, S. S.;
probabilistic models. arXiv preprint arXiv:2006.11239. Lopes, R. G.; et al. 2022. Photorealistic Text-to-Image Dif-
fusion Models with Deep Language Understanding. arXiv
Huang, G. B.; Mattar, M.; Berg, T.; and Learned-Miller, E.
preprint arXiv:2205.11487.
2008. Labeled faces in the wild: A database forstudying face
recognition in unconstrained environments. In Workshop on Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; and
faces in’Real-Life’Images: detection, alignment, and recog- Ganguli, S. 2015. Deep unsupervised learning using
nition. nonequilibrium thermodynamics. In International Confer-
Huang, R.; Lam, M. W.; Wang, J.; Su, D.; Yu, D.; Ren, Y.; ence on Machine Learning, 2256–2265. PMLR.
and Zhao, Z. 2022a. FastDiff: A Fast Conditional Diffusion Song, J.; Meng, C.; and Ermon, S. 2020. Denoising diffusion
Model for High-Quality Speech Synthesis. arXiv preprint implicit models. arXiv preprint arXiv:2010.02502.
arXiv:2204.09934. Song, Y.; Sohl-Dickstein, J.; Kingma, D. P.; Kumar, A.; Er-
Huang, R.; Zhao, Z.; Liu, H.; Liu, J.; Cui, C.; and Ren, Y. mon, S.; and Poole, B. 2020. Score-based generative model-
2022b. Prodiff: Progressive fast diffusion model for high- ing through stochastic differential equations. arXiv preprint
quality text-to-speech. arXiv preprint arXiv:2207.06389. arXiv:2011.13456.
Kingma, D. P.; and Welling, M. 2013. Auto-encoding varia- Van Oord, A.; Kalchbrenner, N.; and Kavukcuoglu, K. 2016.
tional bayes. arXiv preprint arXiv:1312.6114. Pixel recurrent neural networks. In International Conference
Krizhevsky, A.; and Hinton, G. 2009. Learning multiple lay- on Machine Learning, 1747–1756. PMLR.
ers of features from tiny images. Technical Report 0, Uni- Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie,
versity of Toronto, Toronto, Ontario. S. 2011. The Caltech-UCSD Birds-200-2011 Dataset.
LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Technical Report CNS-TR-2011-001, California Institute of
Gradient-based learning applied to document recognition. Technology.
Proceedings of the IEEE, 86(11): 2278–2324. Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang,
Lee, S.-g.; Kim, H.; Shin, C.; Tan, X.; Liu, C.; Meng, Q.; X.; and He, X. 2018. Attngan: Fine-grained text to image
Qin, T.; Chen, W.; Yoon, S.; and Liu, T.-Y. 2021. Pri- generation with attentional generative adversarial networks.
orGrad: Improving Conditional Denoising Diffusion Mod- In Proceedings of the IEEE conference on computer vision
els with Data-Driven Adaptive Prior. arXiv preprint and pattern recognition, 1316–1324.
arXiv:2106.06406. Yan, X.; Yang, J.; Sohn, K.; and Lee, H. 2016. At-
Liu, G.; Reda, F. A.; Shih, K. J.; Wang, T.-C.; Tao, A.; and tribute2image: Conditional image generation from visual at-
Catanzaro, B. 2018. Image inpainting for irregular holes tributes. In European Conference on Computer Vision, 776–
using partial convolutions. In Proceedings of the European 791. Springer.
Conference on Computer Vision (ECCV), 85–100. Ye, Z.; Jiang, Z.; Ren, Y.; Liu, J.; He, J.; and Zhao, Z. 2023.
Liu, L.; Ren, Y.; Lin, Z.; and Zhao, Z. 2022. Pseudo Numer- GeneFace: Generalized and High-Fidelity Audio-Driven 3D
ical Methods for Diffusion Models on Manifolds. In Inter- Talking Face Synthesis. arXiv preprint arXiv:2301.13430.
national Conference on Learning Representations. Ye, Z.; Zhao, Z.; Ren, Y.; and Wu, F. 2022. SyntaSpeech:
Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015. Deep learn- Syntax-aware Generative Adversarial Text-to-Speech. arXiv
ing face attributes in the wild. In Proceedings of the IEEE preprint arXiv:2204.11792.
international conference on computer vision, 3730–3738. Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; and
Nazeri, K.; Ng, E.; Joseph, T.; Qureshi, F.; and Ebrahimi, M. Xiao, J. 2015. Lsun: Construction of a large-scale image
2019. Edgeconnect: Structure guided image inpainting using dataset using deep learning with humans in the loop. arXiv
edge prediction. In Proceedings of the IEEE/CVF Interna- preprint arXiv:1506.03365.
tional Conference on Computer Vision Workshops, 0–0. Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; and Huang, T. S.
Popov, V.; Vovk, I.; Gogoryan, V.; Sadekova, T.; and Kudi- 2018. Generative image inpainting with contextual atten-
nov, M. 2021. Grad-tts: A diffusion probabilistic model for tion. In Proceedings of the IEEE conference on computer
text-to-speech. arXiv preprint arXiv:2105.06337. vision and pattern recognition, 5505–5514.
Yu, J.; Lin, Z.; Yang, J.; Shen, X.; Lu, X.; and Huang, T. S.
2019. Free-form image inpainting with gated convolution.
In Proceedings of the IEEE/CVF International Conference
on Computer Vision, 4471–4480.
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang,
X.; and Metaxas, D. N. 2017. Stackgan: Text to photo-
realistic image synthesis with stacked generative adversarial
networks. In Proceedings of the IEEE international confer-
ence on computer vision, 5907–5915.
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.;
and Metaxas, D. N. 2018. Stackgan++: Realistic image syn-
thesis with stacked generative adversarial networks. IEEE
transactions on pattern analysis and machine intelligence,
41(8): 1947–1962.
Zhang, Z.; Zhao, Z.; and Lin, Z. 2022. Unsupervised Repre-
sentation Learning from Pre-trained Diffusion Probabilistic
Models. In Advances in Neural Information Processing Sys-
tems.
Zhang, Z.; Zhao, Z.; Zhang, Z.; Huai, B.; and Yuan, J. 2020.
Text-guided image inpainting. In Proceedings of the 28th
ACM International Conference on Multimedia, 4079–4087.
Zhou, B.; Lapedriza, A.; Khosla, A.; Oliva, A.; and Torralba,
A. 2017. Places: A 10 million image database for scene
recognition. IEEE transactions on pattern analysis and ma-
chine intelligence, 40(6): 1452–1464.
Appendix A From (Bishop 2006) (2.116 and 2.117), we have that
q(xt−1 |xt , x0 , c) is Gaussian and
−1
Derivation of our conditional forward
1 αt
Σ−1 + Σ−1
Cov q(xt−1 |xt , x0 , c) =
diffusion kernels 1 − ᾱt−1 1 − αt
According to Markovian property, q(xt |xt−1 , x0 , c) = 1
= 1 αt Σ
q(xt |xt−1 , c) for all t > 1. Therefore, we can assume that: 1−ᾱt−1 + 1−αt