Vector Quantized Diffusion Model For Text-to-Image Synthesis
Vector Quantized Diffusion Model For Text-to-Image Synthesis
Decoder
Encoder (5)
2 4 1
>
5
>
𝑄𝑄 � 𝑞𝑞 𝑥𝑥𝑡𝑡−1 𝑥𝑥𝑡𝑡 , 𝑥𝑥0v (x )Q v(x
2 t t t−1 ) v (xt−1 )Qt−1 v(x0 )
1 3
= .
v > (xt )Qt v(x0 )
Step2: VQ-Diffusion The transition matrix Qt is crucial to the discrete diffu-
Transformer
𝑞𝑞 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡−1
Layernorm
Adaptive
M 1 …MM 2 1 …MM 2 4 …M 3 2 4 … 1 3
Block
sion model and should be carefully designed such that it is
𝑥𝑥𝑇𝑇 … 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡−1 … 𝑥𝑥0 not too difficult for the reverse network to recover the signal
𝑝𝑝𝜃𝜃 𝑥𝑥𝑡𝑡−1 |𝑥𝑥𝑡𝑡 , 𝑦𝑦
from noises.
A single giraffe
grassing leave
Diffusion Image Decoder Previous works [1, 26] propose to introduce a small
𝑡𝑡
outside amount of uniform noises to the categorical distribution and
Transformer
Transformer
the transition matrix can be formulated as,
AdaLN
AdaLN
Block
Block
BPE
Encoder
αt + βt βt ··· βt
Text
𝑦𝑦
βt αt + βt · · · βt
Qt = . (6)
. .. ..
.. .. . .
Figure 1. Overall framework of our method. It starts with the VQ-
βt βt ··· αt + βt
VAE. Then, the VQ-Diffusion models the discrete latent space by
reversing a forward diffusion process that gradually corrupts the with αt ∈ [0, 1] and βt = (1 − αt )/K. Each token has a
input via a fixed Markov chain. probability of (αt + βt ) to remain the previous value at the
current step while with a probability of Kβt to be resampled
q(xt |xt−1 ), e.g., random replace some tokens of xt−1 .
uniformly over all the K categories.
After a fixed number of T timesteps, the forward pro-
Nonetheless, the data corruption using uniform diffusion
cess yields a sequence of increasingly noisy latent variables
is a somewhat aggressive process that may pose challenge
z1 , ..., zT of the same dimensionality as z0 , and zT be-
for the reverse estimation. First, as opposed to the Gaussian
comes pure noise tokens. Starting from the noise zT , the
diffusion process for ordinal data, an image token may be
reverse process gradually denoises the latent variables and
replaced to an utterly uncorrelated category, which leads to
restore the real data x0 by sampling from the reverse dis-
an abrupt semantic change for that token. Second, the net-
tribution q(xt−1 |xt , x0 ) sequentially. However, since x0
work has to take extra efforts to figure out the tokens that
is unknown in the inference stage, we train a transformer
have been replaced prior to fixing them. In fact, due to the
network to approximate the conditional transit distribution
semantic conflict within the local context, the reverse esti-
pθ (xt−1 |xt , y) depends on the entire data distribution.
mation for different image tokens may form a competition
To be more specific, consider a single image token xi0
and run into the dilemma of identifying the reliable tokens.
of x0 at location i, which takes the index that specifies the
Mask-and-replace diffusion strategy. To solve the above
entries in the codebook, i.e., xi0 ∈ {1, 2, ..., K}. Without in-
issues of uniform diffusion, we draw inspiration from mask
troducing confusion, we omit superscripts i in the following
language modeling [11] and propose to corrupt the tokens
description. We define the probabilities that xt−1 transits to
by stochastically masking some of them so that the cor-
xt using the matrices [Qt ]mn = q(xt = m|xt−1 = n) ∈
rupted locations can be explicitly known by the reverse net-
RK×K . Then the forward Markov diffusion process for the
work. Specifically, we introduce an additional special to-
whole token sequence can be written as,
ken, [MASK] token, so each token now has (K + 1) discrete
states. We define the mask diffusion as follows: each or-
q(xt |xt−1 ) = v > (xt )Qt v(xt−1 ) (3)
dinary token has a probability of γt to be replaced by the
where v(x) is a one-hot column vector which length is K [MASK] token and has a chance of Kβt to be uniformly dif-
and only the entry x is 1. The categorical distribution over fused, leaving the probability of αt = 1 − Kβt − γt to
xt is given by the vector Qt v(xt−1 ). be unchanged, whereas the [MASK] token always keeps its
Importantly, due to the property of Markov chain, one own state. Hence, we can formulate the transition matrix
can marginalize out the intermediate steps and derive the Qt ∈ R(K+1)×(K+1) as,
probability of xt at arbitrary timestep directly from x0 as,
αt + βt βt βt ··· 0
βt αt + βt βt · · · 0
q(xt |x0 ) = v > (xt )Qt v(x0 ), with Qt = Qt · · · Q1 . (4)
Qt =
βt βt αt + βt · · · 0 . (7)
.. .. .. .. ..
Besides, another notable characteristic is that by con- . . . . .
ditioning on z0 , the posterior of this diffusion process is γt γt γt ··· 1
Algorithm 1 Training of the VQ-Diffusion, given transition Algorithm 2 Inference of the VQ-Diffusion, given fast in-
matrix {Qt }, initial network parameters θ, loss weight λ, ference time stride ∆t , input text s.
learning rate η. 1: t ← T , y ← BPE(s)
1: repeat 2: xt ← sample from p(xT ) . Eqn. 10
2: (I, s) ← sample training image-text pair 3: while t > 0 do
3: x0 ← VQVAE-Encoder(I), y ← BPE(s) 4: xt ← sample from pθ (xt−∆t |xt , y) . Eqn. 13
4: t ∼ Uniform({1, · · · , T }) 5: t ← t − ∆t
5: xt ←(sample from q(xt |x0 ) . Eqn. 4 and 8 6: end while
L0 , if t = 1 7: return VQVAE-Decoder(xt )
6: L← . Eqn. 9 and 12
Lt−1 + λLx0 , otherwise
7: θ ← θ − η∇θ L . Update network parameters
8: until converged
Reparameterization trick on discrete stage. The network
parameterization affects the synthesis quality significantly.
Instead of directly predicting the posterior q(xt−1 |xt , x0 ),
The benefit of this mask-and-replace transition is that: recent works [1, 23, 26] find that approximating some sur-
1) the corrupted tokens are distinguishable to the network, rogate variables, e.g., the noiseless target data q(x0 ) gives
which eases the reverse process. 2) Comparing to the mask better quality. In the discrete setting, we let the network
only approach in [1], we theoretically prove that it is nec- predict the noiseless token distribution pθ (x̃0 |xt , y) at each
essary to include a small amount of uniform noises besides reverse step. We can thus compute the reverse transition
the token masking, otherwise we get a trivial posterior when distribution according to:
xt 6= x0 . 3) The random token replacement forces the net- K
work to understand the context rather than only focusing on
X
pθ (xt−1 |xt , y) = q(xt−1 |xt , x̃0 )pθ (x̃0 |xt , y). (11)
the [MASK] tokens. 4) The cumulative transition matrix Qt x̃0 =1
and the probability q(xt |x0 ) in Equation 4 can be computed
Based on the reparameterization trick, we can introduce an
in closed form with:
auxiliary denoising objective, which encourages the net-
Qt v(x0 ) = αt v(x0 ) + (γ t − β t )v(K + 1) + β t (8) work to predict noiseless token x0 :
Qt Qt
Where αt = i=1 αi , γ t = 1 − i=1 (1 − γi ), and
Lx0 = − log pθ (x0 |xt , y) (12)
β t = (1 − αt − γ t )/K can be calculated and stored in
advance. Thus, the computation cost of q(xt |x0 ) is reduced We find that combining this loss with Lvlb improves the im-
from O(tK 2 ) to O(K). The proof is given in the supple- age quality.
mental material. Model architecture. We propose an encoder-decoder
transformer to estimate the distribution pθ (x̃0 |xt , y). As
4.2. Learning the reverse process shown in Figure 1, the framework contains two parts: a
To reverse the diffusion process, we train a denoising text encoder and a diffusion image decoder. Our text en-
network pθ (xt−1 |xt , y) to estimate the posterior transition coder takes the text tokens y and yields a conditional fea-
distribution q(xt−1 |xt , x0 ). The network is trained to min- ture sequence. The diffusion image decoder takes the im-
imize the variational lower bound (VLB) [59]: age token xt and timestep t and outputs the noiseless to-
ken distribution pθ (x̃0 |xt , y). The decoder contains several
Lvlb = L0 + L1 + · · · + LT −1 + LT , transformer blocks and a softmax layer. Each transformer
L0 = − log pθ (x0 |x1 , y), block contains a full attention, a cross attention to com-
(9) bine text information and a feed forward network block.
Lt−1 = DKL (q(xt−1 |xt , x0 ) || pθ (xt−1 |xt , y)),
The current timestep t is injected into the network with
LT = DKL (q(xT |x0 ) || p(xT )). Adaptive Layer Normalization [2](AdaLN) operator, i.e.,
AdaLN(h, t) = at LayerNorm(h) + bt , where h is the in-
Where p(xT ) is the prior distribution of timestep T . For the
termediate activations, at and bt are obtained from a linear
proposed mask-and-replace diffusion, the prior is:
projection of the timestep embedding.
> Fast inference strategy In the inference stage, by leverag-
p(xT ) = β T , β T , · · · , β T , γ T (10)
ing the reparameterization trick, we can skip some steps in
Note that since the transition matrix Qt is fixed in the train- diffusion model to achieve a faster inference.
ing, the LT is a constant number which measures the gap Specifically, assuming the time stride is ∆t , instead
between the training and inference and can be ignored in of sampling images in the chain of xT , xT −1 , xT −2 ...x0 ,
the training. we sample images in the chain of xT , xT −∆t , xT −2∆t ...x0
A small gray bird This beautiful little The long wings
A small sized blue A giraffe is Some children are A white and blue
with white and bird has a white spreaded showing An airplane that is
bird that has a standing in playing soccer bus driving down a
dark gray wingbars breast and very the breast and the parked at airport
short pointed bill a green field on the field road next to trees
and white breast intriguing red eyes belly of the large bird
DM-GAN
DF-GAN
Ours
with the reverse transition distribution: to the word frequency. The LAION-400M dataset contains
K 400M image-text pairs. We train our model on three subsets
from LAION, i.e., cartoon, icon, and human, each of them
X
pθ (xt−∆t |xt , y) = q(xt−∆t |xt , x̃0 )pθ (x̃0 |xt , y).
x̃0 =1
contains 0.9M, 1.3M, 42M images, respectively. For each
(13) subset, we filter the data according to the text.
We found it makes the sampling more efficient which Traning Details. Our VQ-VAE’s encoder and decoder
only causes little harm to quality. The whole training and follow the setting of VQGAN [16] which leverages the
inference algorithm is shown in Algorithm 1 and 2. GAN loss to get a more realistic image. We directly adopt
the publicly available VQGAN model trained on Open-
5. Experiments Images [30] dataset for all text-to-image synthesis experi-
In this section, we first introduce the overall experiment ments. It converts 256×256 images into 32×32 tokens. The
setups and then present extensive results to demonstrate the codebook size K = 2886 after removing useless codes. We
superiority of our approach in text-to-image synthesis. Fi- adopt a publicly available tokenizer of the CLIP model [45]
nally, we point out that our method is a general image syn- as text encoder, yielding a conditional sequence of length
thesis framework that achieves great performance on other 77. We fix both image and text encoders in our training.
generation tasks, including unconditional and class condi- For fair comparison with previous text-to-image meth-
tional image synthesis. ods under similar parameters, we build two different diffu-
Datasets. To demonstrate the capability of our pro- sion image decoder settings: 1) VQ-Diffusion-S (Small),
posed method for text-to-image synthesis, we conduct it contains 18 transformer blocks with dimension of 192.
experiments on CUB-200 [66], Oxford-102 [40], and The model contains 34M parameters. 2) VQ-Diffusion-B
MSCOCO [36] datasets. The CUB-200 dataset contains (Base), it contains 19 transformer blocks with dimension of
8855 training images and 2933 test images belonging to 1024. The model contains 370M parameters.
200 bird species. Oxford-102 dataset contains 8189 im-
ages of flowers of 102 categories. Each image in CUB- In order to show the scalability of our method, we also
200 and Oxford-102 dataset contains 10 text descriptions. train our base model on a larger database Conceptual Cap-
MSCOCO dataset contains 82k images for training and 40k tions, and then fine-tune it on each database. This model is
images for testing. Each image in this dataset has five text denoted as VQ-Diffusion-F.
descriptions. For the default setting, we set timesteps T = 100 and
To further demonstrate the scalability of our method, loss weight λ = 0.0005. For the transition matrix, we lin-
we also train our model on large scale datasets, including early increase γ t and β t from 0 to 0.9 and 0.1, respectively.
Conceptual Captions [6, 57] and LAION-400M [55]. The We optimize our network using AdamW [37] with β1 = 0.9
Conceptual Caption dataset, including both CC3M [57] and and β2 = 0.96. The learning rate is set to 0.00045 after
CC12M [6] datasets, contains 15M images. To balance the 5000 iterations of warmup. More training details are pro-
text and image distribution, we filter a 7M subset according vided in the appendix.
MSCOCO CUB-200 Oxford-102
5.1. Comparison with state-of-the-art methods
StackGAN [70] 74.05 51.89 55.28
We qualitatively compare the proposed method with sev- StackGAN++ [71] 81.59 15.30 48.68
eral state-of-the-art methods, including some GAN-based EFF-T2I [60] - 11.17 16.47
methods [60, 61, 63, 67, 70, 71, 73], DALL-E [48] and SEGAN [61] 32.28 18.17 -
CogView [13], on MSCOCO, CUB-200 and Oxford-102 AttnGAN [67] 35.49 23.98 -
datasets. We calculate the FID [22] between 30k gener- DM-GAN [73] 32.64 16.09 -
ated images and 30k real images, and show the results in DF-GAN [63] 21.42 14.81 -
Table 1. DAE-GAN [51] 28.12 15.19 -
We can see that our small model, VQ-Diffusion-S, which DALLE [48] 27.50 56.10 -
has the similar parameter number with previous GAN-based Cogview [13] 27.10 - -
models, has strong performance on CUB-200 and Oxford- VQ-Diffusion-S 30.17 12.97 14.95
102 datasets. Our base model, VQ-Diffusion-B, further im- VQ-Diffusion-B 19.75 11.94 14.88
proves the performance. And our VQ-Diffusion-F model VQ-Diffusion-F 13.86 10.32 14.10
achieves the best results and surpasses all previous meth- Table 1. FID comparison of different text-to-image synthesis
ods by a large margin, even surpassing DALL-E [48] and method on MSCOCO, CUB-200, and Oxford-102 datasets.
CogView [13], which have ten times more parameters than A handsome man
A green heart The sunset on the A cartoon boy
ours, on MSCOCO dataset. with thick eyebrows
with shadow beach is wonderful is smiling
Some visualized comparison results with DM-GAN [73] and moustache
Figure 4. Ablation study on the mask rate and the truncation rate.
thesis and image synthesis conditioned on labels. To gen-
erate images from a given class label, we first remove the
Model steps FID throughput(imgs/s) text encoder network and cross attention part in transformer
VQ-AR-S 18.12 0.08 blocks, and inject the class label through the AdaLN oper-
VQ-Diffusion-S 25 15.46 1.25 ator. Our network contains 24 transformer blocks with di-
VQ-Diffusion-S 50 13.62 0.67 mension 512. We train our model on the ImageNet dataset.
VQ-Diffusion-S 100 12.97 0.37 For VQ-VAE, we adopt the publicly available model from
VQ-AR-B 17.76 0.03 VQ-GAN [16] trained on ImageNet dataset, which down-
VQ-Diffusion-B 25 14.03 0.47 samples images from 256 × 256 to 16 × 16. For un-
VQ-Diffusion-B 50 12.45 0.24 conditional image synthesis, we trained our model on the
VQ-Diffusion-B 100 11.94 0.13 FFHQ256 dataset, which contains 70k high quality face
Table 3. Comparison between VQ-Diffusion and VQ-AR mod- images. The image encoder also downsamples images to
els. By changing the inference steps, the VQ-Diffusion model is 16 × 16 tokens.
15 times faster than the VQ-AR model while maintaining better We assess the performance of our model in terms of FID
performance. and compare with a variety of previously established mod-
els [3, 12, 15, 16, 39]. For a fair comparison, we calculate
dataset. As shown in Figure 4, we find that it achieves the FID between 50k generated images and all real images. Fol-
best performance when the truncation rate equals 0.86. lowing [16] we can further increase the quality by only ac-
VQ-Diffusion vs VQ-AR. For a fair comparison, we re- cepting images with a top 5% classification score, denoted
place our diffusion image decoder with an autoregressive as acc0.05. We show the quantitative results in Table 4.
decoder with the same network structure and keep other set- While some task-specialized GAN models report better FID
tings the same, including both image and text encoders. The scores, our approach provides a unified model that works
autoregressive model is denoted as VQ-AR-S and VQ-AR- well across a wide range of tasks.
B, corresponding to VQ-Diffusion-S and VQ-Diffusion-B.
The experiment is performed on the CUB-200 dataset. As
shown in Table 3 , on both -S and -B settings the VQ- 6. Conclusion
Diffusion model surpasses the VQ-AR model by a large
In this paper, we present a novel text-to-image architec-
margin. Meanwhile, we evaluate the throughput of both
ture named VQ-Diffusion. The core design is to model the
methods on a V100 GPU with a batch size of 32. The VQ-
VQ-VAE latent space in a non-autoregressive manner. The
Diffusion with the fast inference strategy is 15 times faster
proposed mask-and-replace diffusion strategy avoids the ac-
than the VQ-AR model with a better FID score.
cumulation of errors of the AR model. Our model has the
capacity to generate more complex scenes, which surpasses
5.4. Unified generation model
previous GAN-based text-to-image methods. Our method
Our method is general, which can also be applied to is also general and produces strong results on unconditional
other image synthesis tasks, e.g., unconditional image syn- and conditional image generation.
Acknowledgement [13] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
We thank Qiankun Liu from University of Science and Hongxia Yang, et al. Cogview: Mastering text-to-image gen-
Technology of China for his help, he provided the initial eration via transformers. arXiv preprint arXiv:2105.13290,
code and datasets. 2021. 2, 3, 7
[14] Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, De-
References von Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua
Bengio, and Graham W Taylor. Tell, draw, and repeat: Gen-
[1] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar- erating and modifying images based on continual linguistic
low, and Rianne van den Berg. Structured denoising dif- instruction. In Proceedings of the IEEE/CVF International
fusion models in discrete state-spaces. arXiv preprint Conference on Computer Vision, pages 10304–10312, 2019.
arXiv:2107.03006, 2021. 2, 3, 4, 5 2
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- [15] Patrick Esser, Robin Rombach, Andreas Blattmann, and
ton. Layer normalization. arXiv preprint arXiv:1607.06450, Björn Ommer. Imagebart: Bidirectional context with multi-
2016. 5 nomial diffusion for autoregressive image synthesis. arXiv
[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large preprint arXiv:2108.08827, 2021. 1, 2, 3, 8
scale gan training for high fidelity natural image synthesis. [16] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
arXiv preprint arXiv:1809.11096, 2018. 8 transformers for high-resolution image synthesis. In Pro-
[4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- ceedings of the IEEE/CVF Conference on Computer Vision
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- and Pattern Recognition, pages 12873–12883, 2021. 2, 3, 6,
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 8, 12
Language models are few-shot learners. arXiv preprint [17] Lianli Gao, Daiyuan Chen, Jingkuan Song, Xing Xu, Dongx-
arXiv:2005.14165, 2020. 1, 2 iang Zhang, and Heng Tao Shen. Perceptual pyramid adver-
[5] Miriam Cha, Youngjune L Gwon, and HT Kung. Adversar- sarial networks for text-to-image synthesis. In Proceedings
ial learning of semantic relevance in text to image synthesis. of the AAAI Conference on Artificial Intelligence, volume 33,
In Proceedings of the AAAI conference on artificial intelli- pages 8312–8319, 2019. 2
gence, volume 33, pages 3272–3279, 2019. 2 [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[6] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
Yoshua Bengio. Generative adversarial nets. In Advances in
Soricut. Conceptual 12m: Pushing web-scale image-text pre-
Neural Information Processing Systems, pages 2672–2680,
training to recognize long-tail visual concepts. In Proceed-
2014. 2
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 3558–3568, 2021. 6 [19] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Giqa:
Generated image quality assessment. In European Confer-
[7] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
ence on Computer Vision, pages 369–385. Springer, 2020.
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
2
training from pixels. In International Conference on Ma-
[20] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Pri-
chine Learning, pages 1691–1703. PMLR, 2020. 2, 3
organ: Real data prior for generative adversarial nets. arXiv
[8] Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and preprint arXiv:2006.16990, 2020. 2
Dapeng Tao. Rifegan: Rich feature generation for text-to- [21] Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang
image synthesis from prior knowledge. In Proceedings of Wen, and Lu Yuan. Mask-guided portrait editing with con-
the IEEE/CVF Conference on Computer Vision and Pattern ditional gans. In Proceedings of the IEEE/CVF Conference
Recognition, pages 10911–10920, 2020. 2 on Computer Vision and Pattern Recognition, pages 3436–
[9] Ayushman Dash, John Cristian Borges Gamboa, Sheraz 3445, 2019. 2
Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal. [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Tac-gan-text conditioned auxiliary classifier generative ad- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
versarial network. arXiv preprint arXiv:1703.06412, 2017. two time-scale update rule converge to a local nash equilib-
2 rium. Advances in neural information processing systems,
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, 30, 2017. 7
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
database. In 2009 IEEE conference on computer vision and sion probabilistic models. arXiv preprint arXiv:2006.11239,
pattern recognition, pages 248–255. Ieee, 2009. 2 2020. 1, 3, 5
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [24] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
Toutanova. Bert: Pre-training of deep bidirectional Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
transformers for language understanding. arXiv preprint models for high fidelity image generation. arXiv preprint
arXiv:1810.04805, 2018. 1, 2, 3, 4 arXiv:2106.15282, 2021. 3
[12] Prafulla Dhariwal and Alex Nichol. Diffusion models beat [25] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and
gans on image synthesis. arXiv preprint arXiv:2105.05233, Honglak Lee. Inferring semantic layout for hierarchical text-
2021. 1, 3, 8 to-image synthesis. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7986– In Proceedings of the IEEE Conference on Computer Vision
7994, 2018. 2 and Pattern Recognition, pages 4467–4477, 2017. 2
[26] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick [39] Alex Nichol and Prafulla Dhariwal. Improved de-
Forré, and Max Welling. Argmax flows and multinomial dif- noising diffusion probabilistic models. arXiv preprint
fusion: Towards non-autoregressive language models. arXiv arXiv:2102.09672, 2021. 3, 8
preprint arXiv:2102.05379, 2021. 3, 4, 5 [40] Maria-Elena Nilsback and Andrew Zisserman. Automated
[27] Yupan Huang, Hongwei Xue, Bei Liu, and Yutong Lu. Uni- flower classification over a large number of classes. In 2008
fying multimodal transformer for bi-directional image and Sixth Indian Conference on Computer Vision, Graphics &
text generation. In Proceedings of the 29th ACM Interna- Image Processing, pages 722–729. IEEE, 2008. 2, 6
tional Conference on Multimedia, pages 1138–1147, 2021. [41] Aaron van den Oord, Oriol Vinyals, and Koray
2 Kavukcuoglu. Neural discrete representation learning.
[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based arXiv preprint arXiv:1711.00937, 2017. 2, 3, 12
generator architecture for generative adversarial networks. [42] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
In Proceedings of the IEEE/CVF Conference on Computer Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
Vision and Pattern Recognition, pages 4401–4410, 2019. 2 age transformer. In International Conference on Machine
[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Learning, pages 4055–4064. PMLR, 2018. 2
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- [43] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao.
ing the image quality of stylegan. In Proceedings of the Learn, imagine and create: Text-to-image generation from
IEEE/CVF Conference on Computer Vision and Pattern prior knowledge. Advances in Neural Information Process-
Recognition, pages 8110–8119, 2020. 8 ing Systems, 32:887–897, 2019. 2
[30] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami [44] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao.
Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- Mirrorgan: Learning text-to-image generation by redescrip-
jlings, Stefan Popov, Andreas Veit, et al. Openimages: A tion. In Proceedings of the IEEE/CVF Conference on Com-
public dataset for large-scale multi-label and multi-class im- puter Vision and Pattern Recognition, pages 1505–1514,
age classification. Dataset available from https://github. 2019. 2
com/openimages, 2(3):18, 2017. 6, 12 [45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[31] Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Francis Dutil, Lisa Di Jorio, and Thomas Fevens. Dual ad-
ing transferable visual models from natural language super-
versarial inference for text-to-image synthesis. In Proceed-
vision. arXiv preprint arXiv:2103.00020, 2021. 6, 12
ings of the IEEE/CVF International Conference on Com-
[46] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
puter Vision, pages 7567–7576, 2019. 2
Sutskever. Improving language understanding by generative
[32] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS
pre-training. 2018. 1, 2
Torr. Controllable text-to-image generation. arXiv preprint
[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
arXiv:1909.07083, 2019. 2
Amodei, Ilya Sutskever, et al. Language models are unsu-
[33] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, pervised multitask learners. OpenAI blog, 1(8):9, 2019. 1,
Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven 2
text-to-image synthesis via adversarial training. In Proceed-
[48] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
ings of the IEEE/CVF Conference on Computer Vision and
Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya
Pattern Recognition, pages 12174–12182, 2019. 2
Sutskever. Zero-shot text-to-image generation. arXiv
[34] Jiadong Liang, Wenjie Pei, and Feng Lu. Cpgan: Content- preprint arXiv:2102.12092, 2021. 1, 2, 3, 7
parsing generative adversarial networks for text-to-image [49] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generat-
synthesis. In European Conference on Computer Vision, ing diverse high-fidelity images with vq-vae-2. In Advances
pages 491–508. Springer, 2020. 2 in neural information processing systems, pages 14866–
[35] Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, 14876, 2019. 2
Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan [50] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
Jia, et al. M6: A chinese multimodal pretrainer. arXiv geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
preprint arXiv:2103.00823, 2021. 2 versarial text to image synthesis. In International Conference
[36] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, on Machine Learning, pages 1060–1069. PMLR, 2016. 2
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [51] Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan
Zitnick. Microsoft coco: Common objects in context. In Tang, Qi Liu, and Enhong Chen. Dae-gan: Dynamic aspect-
European conference on computer vision, pages 740–755. aware gan for text-to-image synthesis. In Proceedings of the
Springer, 2014. 2, 6 IEEE/CVF International Conference on Computer Vision,
[37] Ilya Loshchilov and Frank Hutter. Decoupled weight decay pages 13960–13969, 2021. 2, 7
regularization. arXiv preprint arXiv:1711.05101, 2017. 6 [52] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal-
[38] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovit- imans, David J Fleet, and Mohammad Norouzi. Image
skiy, and Jason Yosinski. Plug & play generative networks: super-resolution via iterative refinement. arXiv preprint
Conditional iterative generation of images in latent space. arXiv:2104.07636, 2021. 3
[53] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P [67] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Kingma. Pixelcnn++: Improving the pixelcnn with dis- Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
cretized logistic mixture likelihood and other modifications. grained text to image generation with attentional generative
arXiv preprint arXiv:1701.05517, 2017. 2 adversarial networks. In Proceedings of the IEEE conference
[54] Florian Schmidt. Generalization in generation: A closer look on computer vision and pattern recognition, pages 1316–
at exposure bias. arXiv preprint arXiv:1910.00292, 2019. 3 1324, 2018. 2, 7
[55] Christoph Schuhmann, Richard Vencu, Romain Beaumont, [68] Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Wang, and Jing Shao. Semantics disentangling for text-to-
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: image generation. In Proceedings of the IEEE/CVF Con-
Open dataset of clip-filtered 400 million image-text pairs. ference on Computer Vision and Pattern Recognition, pages
arXiv preprint arXiv:2111.02114, 2021. 6 2327–2336, 2019. 2
[56] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural [69] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
machine translation of rare words with subword units. arXiv Yinfei Yang. Cross-modal contrastive learning for text-to-
preprint arXiv:1508.07909, 2015. 3 image generation. In Proceedings of the IEEE/CVF Con-
[57] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu ference on Computer Vision and Pattern Recognition, pages
Soricut. Conceptual captions: A cleaned, hypernymed, im- 833–842, 2021. 2
age alt-text dataset for automatic image captioning. In Pro- [70] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
ceedings of the 56th Annual Meeting of the Association for gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
Computational Linguistics (Volume 1: Long Papers), pages gan: Text to photo-realistic image synthesis with stacked
2556–2565, 2018. 6 generative adversarial networks. In Proceedings of the IEEE
[58] Shikhar Sharma, Dendi Suhubdy, Vincent Michalski, international conference on computer vision, pages 5907–
Samira Ebrahimi Kahou, and Yoshua Bengio. Chatpainter: 5915, 2017. 2, 7
Improving text to image generation using dialogue. arXiv [71] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
preprint arXiv:1802.08216, 2018. 2 gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
[59] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, gan++: Realistic image synthesis with stacked generative ad-
and Surya Ganguli. Deep unsupervised learning using versarial networks. IEEE transactions on pattern analysis
nonequilibrium thermodynamics. In International Confer- and machine intelligence, 41(8):1947–1962, 2018. 2, 7
ence on Machine Learning, pages 2256–2265. PMLR, 2015. [72] Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic
1, 3, 5 text-to-image synthesis with a hierarchically-nested adver-
[60] Douglas M Souza, Jônatas Wehrmann, and Duncan D Ruiz. sarial network. In Proceedings of the IEEE Conference
Efficient neural architecture for text-to-image synthesis. In on Computer Vision and Pattern Recognition, pages 6199–
2020 International Joint Conference on Neural Networks 6208, 2018. 2
(IJCNN), pages 1–8. IEEE, 2020. 2, 7 [73] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan:
[61] Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Dynamic memory generative adversarial networks for text-
Yin. Semantics-enhanced adversarial nets for text-to-image to-image synthesis. In Proceedings of the IEEE/CVF Con-
synthesis. In Proceedings of the IEEE/CVF International ference on Computer Vision and Pattern Recognition, pages
Conference on Computer Vision, pages 10501–10510, 2019. 5802–5810, 2019. 2, 7
2, 7
[62] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin
Li. Kt-gan: knowledge-transfer generative adversarial net-
work for text-to-image synthesis. IEEE Transactions on Im-
age Processing, 30:1275–1290, 2020. 2
[63] Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan
Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion gener-
ative adversarial networks for text-to-image synthesis. arXiv
preprint arXiv:2008.05865, 2020. 2, 7
[64] Aaron Van Oord, Nal Kalchbrenner, and Koray
Kavukcuoglu. Pixel recurrent neural networks. In In-
ternational Conference on Machine Learning, pages
1747–1756. PMLR, 2016. 2
[65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 1
[66] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. 2011. 2, 6
A. Implementation details B. Proof of Equation 8
Mathematical induction can be used to prove the Equa-
In our experiments on text-to-image synthesis, we adopt tion 8 in the paper.
the public VQ-VAE [41] model provided by VQGAN [16] When t = 1, we have
trained on the OpenImages [30] dataset, which downsam-
ples images from 256 × 256 to 32 × 32. We use the α1 + β 1 ,
x = x0
CLIP [45] pretrained model (ViT-B) as our text encoder, Q1 v(x0 ) = β 1 , x 6= x0 and x 6= K + 1 (14)
which encodes a sentence to 77 tokens. Our diffusion im-
γ1, x=K +1
age decoder consists of several transformer blocks, each
block contains full attention, cross attention, and feed for- which is clearly hold. Suppose the Equation 8 is hold at step
ward network(FFN). Our base model contains 19 trans- t, then for t = t + 1:
former blocks, the channel of each block is 1024. The FFN
Qt+1 v(x0 ) = Qt+1 Qt v(x0 )
contains two linear layer, which expand the dimension to
4096 in the middle layer. The model contains 370M pa- When x = x0 ,
rameters. For our small model, it contains 18 transformer
blocks while the channel is 192, the FFN contains two con- Qt+1 v(x0 )(x) = β t βt+1 (K − 1) + (αt+1 + βt+1 )(αt + β t )
volution layers with kernel size 3, the channel expand rate = β t (Kβt+1 + αt+1 ) + αt (αt+1 + βt+1 )
is 2. The model contains 34M parameters. 1
= (β t (1 − γt+1 ) + αt βt+1 − β t+1 ) ∗ K + \
K
For our class conditional generation model on ImageNet, αt+1 + β t+1
we adopt the public VQ-VAE model provided by VQGAN 1
trained on ImageNet, which downsamples images from = [(1 − αt − γ t )(1 − γt+1 ) + Kαt βt+1 − \
K
256 × 256 to 16 × 16. Our model contains 24 transformer (1 − αt+1 − γ t+1 )] + αt+1 + β t+1
blocks, each block contains a full attention layer and a FFN. 1
The base channel number is 512. Besides, the FFN also uses = [(1 − γ t+1 ) − αt (1 − γt+1 − Kβt+1 ) − \
K
convolution instead of linear layer, and the channel expand (1 − γ t+1 ) + αt+1 ] + αt+1 + β t+1
rate is 4.
= αt+1 + β t+1
When x = K + 1,
Qt+1 v(x0 )(x) = γ t + (1 − γ t )γt+1
A small pizza
in the plate = 1 − (1 − γ t+1 )
= γ t+1
When x 6= x0 and x 6= K + 1,
Qt+1 v(x0 )(x) = β t (αt+1 + βt+1 ) + β t βt+1 (K − 1) + αt βt+1
A green bird
with dark head = β t (αt+1 + βt+1 ∗ K) + αt βt+1
1 − αt − γ t
= ∗ (1 − γt+1 ) + αt βt+1
K
1 1 − γt+1
= (1 − γ t+1 ) + αt (βt+1 − )
K K
A man wears
striped tie = β t+1
So proof done.
A man riding a A giraffe walking A green train A woman with A handsome man
A living area with a A red hydrant A woman with
motorcycle through a green is coming long straight hair with beard and
television and a table on the grass white hair
on the beach grass covered field down the tracks wears glasses moustache
A woman riding a
Two beautiful ladies A hydrant A woman with Two beautiful ladies A movie poster
Sunset over the sea red motorcycle A cartoon tiger face
standing together in the park delicate makeup standing together of haunted house
wearing a helmet
Sunset over the skyline of a city A handsome man with beard with sunglasses
Figure 7. Comparison our results with XMC-GAN, their results come from their paper.