0% found this document useful (0 votes)
26 views

Vector Quantized Diffusion Model For Text-to-Image Synthesis

The document presents the Vector Quantized Diffusion (VQ-Diffusion) model for text-to-image generation. VQ-Diffusion eliminates the unidirectional bias and accumulated prediction errors of existing methods by modeling the latent space of a vector quantized variational autoencoder with a conditional variant of Denoising Diffusion Probabilistic Model. Experiments show VQ-Diffusion produces significantly better results than autoregressive models and can handle more complex scenes, while also allowing for more efficient generation through reparameterization.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Vector Quantized Diffusion Model For Text-to-Image Synthesis

The document presents the Vector Quantized Diffusion (VQ-Diffusion) model for text-to-image generation. VQ-Diffusion eliminates the unidirectional bias and accumulated prediction errors of existing methods by modeling the latent space of a vector quantized variational autoencoder with a conditional variant of Denoising Diffusion Probabilistic Model. Experiments show VQ-Diffusion produces significantly better results than autoregressive models and can handle more complex scenes, while also allowing for more efficient generation through reparameterization.

Uploaded by

markus.aurelius
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Shuyang Gu1 Dong Chen2 Jianmin Bao2 Fang Wen2


Bo Zhang2 Dongdong Chen3 Lu Yuan3 Baining Guo2
1 2 3
University of Science and Technology of China Microsoft Research Microsoft Cloud+AI
[email protected] {doch,jianbao,fangwen,zhanbo,dochen,luyuan,bainguo}@microsoft.com
arXiv:2111.14822v3 [cs.CV] 3 Mar 2022

Abstract generation. Based on the AR model, recent work DALL-


E [48] has achieved impressive results for text-to-image
We present the vector quantized diffusion (VQ-Diffusion) generation.
model for text-to-image generation. This method is based Despite their success, existing text-to-image generation
on a vector quantized variational autoencoder (VQ-VAE) methods still have weaknesses that need to be improved.
whose latent space is modeled by a conditional variant One issue is the unidirectional bias. Existing methods pre-
of the recently developed Denoising Diffusion Probabilis- dict pixels or tokens in the reading order, from top-left
tic Model (DDPM). We find that this latent-space method to bottom-right, based on the attention to all prefix pix-
is well-suited for text-to-image generation tasks because els/tokens and the text description. This fixed order intro-
it not only eliminates the unidirectional bias with exist- duces unnatural bias in the synthesized images because im-
ing methods but also allows us to incorporate a mask-and- portant contextual information may come from any part of
replace diffusion strategy to avoid the accumulation of er- the image, not just from left or above. Another issue is
rors, which is a serious problem with existing methods. Our the accumulated prediction errors. Each step of the infer-
experiments show that the VQ-Diffusion produces signif- ence stage is performed based on previously sampled to-
icantly better text-to-image generation results when com- kens – this is different from that of the training stage, which
pared with conventional autoregressive (AR) models with relies on the so-called “teacher-forcing” practice [15] and
similar numbers of parameters. Compared with previous provides the ground truth for each step. This difference is
GAN-based text-to-image methods, our VQ-Diffusion can important and its consequence merits careful examination.
handle more complex scenes and improve the synthesized In particular, a token in the inference stage, once predicted,
image quality by a large margin. Finally, we show that the cannot be corrected and its errors will propagate to the sub-
image generation computation in our method can be made sequent tokens.
highly efficient by reparameterization. With traditional AR
We present the vector quantized diffusion (VQ-
methods, the text-to-image generation time increases lin-
Diffusion) model for text-to-image generation, a model
early with the output image resolution and hence is quite
that eliminates the unidirectional bias and avoids accu-
time consuming even for normal size images. The VQ-
mulated prediction errors. We start with a vector quan-
Diffusion allows us to achieve a better trade-off between
tized variational autoencoder (VQ-VAE) and model its la-
quality and speed. Our experiments indicate that the VQ-
tent space by learning a parametric model using a con-
Diffusion model with the reparameterization is fifteen times
ditional variant of the Denoising Diffusion Probabilistic
faster than traditional AR methods while achieving a bet-
Model (DDPM) [23, 59], which has been applied to im-
ter image quality. The code and models are available at
age synthesis with compelling results [12]. We show that
https://github.com/cientgu/VQ-Diffusion.
the latent-space model is well-suited for the task of text-
to-image generation. Roughly speaking, the VQ-Diffusion
1. Introduction model samples the data distribution by reversing a forward
diffusion process that gradually corrupts the input via a
Recent success of Transformer [11, 65] in neural lan- fixed Markov chain. The forward process yields a sequence
guage processing (NLP) has raised tremendous interest of increasingly noisy latent variables of the same dimen-
in using successful language models for computer vision sionality as the input, producing pure noise after a fixed
tasks. Autoregressive (AR) model [4, 46, 47] is one of the number of timesteps. Starting from this noise result, the
most natural and popular approach to transfer from text-to- reverse process gradually denoises the latent variables to-
text generation (i.e., machine translation) to text-to-image wards the desired data distribution by learning the condi-
tional transit distribution. fits for the inference speed. With traditional AR methods,
The VQ-Diffusion model eliminates the unidirectional the inference time increases linearly with the output image
bias. It consists of an independent text encoder and a dif- resolution and the image generation is quite time consuming
fusion image decoder, which performs denoising diffusion even for normal-size images (e.g., images larger than small
on discrete image tokens. At the beginning of the infer- thumbnail images of 64 × 64 pixels). The VQ-Diffusion
ence stage, all image tokens are either masked or random. provides the global context for each token prediction and
Here the masked token serves the same function as those in makes it independent of the image resolution. This allows
mask-based generative models [11]. The denoising diffu- us to provide an effective way to achieve a better tradeoff
sion process gradually estimates the probability density of between the inference speed and the image quality by a
image tokens step-by-step based on the input text. In each simple reparameterization of the diffusion image decoder.
step, the diffusion image decoder leverages the contextual Specifically, in each step, we ask the decoder to predict
information of all tokens of the entire image predicted in the original noise-free image instead of the noise-reduced
the previous step to estimate a new probability density dis- image in the next denoising diffusion step. Through ex-
tribution and use this distribution to predict the tokens in periments we have found that the VQ-Diffusion method
the current step. This bidirectional attention provides global with reparameterization can be fifteen times faster than AR
context for each token prediction and eliminates the unidi- methods while achieving a better image quality.
rectional bias.
The VQ-Diffusion model, with its mask-and-replace dif- 2. Related Work
fusion strategy, also avoids the accumulation of errors. In GAN-based Text-to-image generation. In the past few
the training stage, we do not use the “teacher-forcing” strat- years, Generative Adversarial Networks (GANs) [18] have
egy. Instead, we deliberately introduce both masked to- shown promising results on many tasks [19–21],, especially
kens and random tokens and let the network learn to pre- text-to-image generation [5, 8, 9, 14, 17, 25, 27, 31–34, 38,
dict the masked token and modify incorrect tokens. In the 43, 44, 50, 51, 58, 60–63, 67–73]. GAN-INT-CLS [50] was
inference stage, we update the density distribution of all to- the first to use a conditional GAN formulation for text-to-
kens in each step and resample all tokens according to the image generation. Based on this formulation, some ap-
new distribution. Thus we can modify the wrong tokens proaches [34, 44, 67–71, 73] were proposed to further im-
and prevent error accumulation. Comparing to the conven- prove the generation quality. These models generate high
tional replace-only diffusion strategy for unconditional im- fidelity images on single domain datasets, e.g., birds [66]
age generation [1], the masked tokens effectively direct the and flowers [40]. However, due to the inductive bias on the
network’s attention to the masked areas and thus greatly re- locality of convolutional neural networks, they struggle on
duce the number of token combinations to be examined by complex scenes with multiple objects, such as those in the
the network. This mask-and-replace diffusion strategy sig- MS-COCO dataset [36].
nificantly accelerates the convergence of the network. Other works [25,33] adopt a two-step process which first
To assess the performance of the VQ-Diffusion method, infer the semantic layout then generate different objects, but
we conduct text-to-image generation experiments with a this kind of method requires fine-grained object labels, e.g.,
wide variety of datasets, including CUB-200 [66], Oxford- object bounding boxes or segmentation maps.
102 [40], and MSCOCO [36]. Compared with AR model Autoregressive Models. AR models [4, 46, 47] have shown
with similar numbers of model parameters, our method powerful capability of density estimation and have been
achieves significantly better results, as measured by both applied for image generation [7, 16, 41, 42, 49, 53, 64] re-
image quality metrics and visual examination, and is much cently. PixelRNN [53,64], Image Transformer [42] and Im-
faster. Compared with previous GAN-based text-to-image ageGPT [7] factorized the probability density on an image
methods [67, 70, 71, 73], our method can handle more com- over raw pixels. Thus, they only generate low-resolution
plex scenes and the synthesized image quality is improved images, like 64 × 64, due to the unaffordable amount of
by a large margin. Compared with extremely large models computation for large images.
(models with ten times more parameters than ours), includ- VQ-VAE [41, 49], VQGAN [16] and ImageBART [15]
ing DALL-E [48] and CogView [13], our model achieves train an encoder to compress the image into a low-
comparable or better results for specific types of images, dimensional discrete latent space and fit the density of the
i.e., the types of images that our model has seen during hidden variables. It greatly improves the performance of
the training stage. Furthermore, our method is general and image generation.
produces strong results in our experiments on both uncon- DALL-E [48], CogView [13] and M6 [35] propose AR-
ditional and conditional image generation with FFHQ [28] based text-to-image frameworks. They model the joint dis-
and ImageNet [10] datasets. tribution of text and image tokens. With powerful large
The VQ-Diffusion model also provides important bene- transformer structure and massive text-image pairs, they
greatly advance the quality of text-to-image generation, but Where, sg[·] stands for the stop-gradient operation. In prac-
still have weaknesses of unidirectional bias and accumu- tice, we replace the second term of Equation 2 with expo-
lated prediction errors due to the limitation of AR models. nential moving averages (EMA) [41] to update the code-
Denoising Diffusion Probabilistic Models. Diffusion gen- book entries which is proven to work better than directly
erative models were first proposed in [59] and achieved using the loss function.
strong results on image generation [12,23,24,39] and image
super super-resolution [52] recently. However, most previ- 4. Vector Quantized Diffusion Model
ous works only considered continuous diffusion models on
Given the text-image pairs, we obtain the discrete image
the raw image pixels. Discrete diffusion models were also
tokens x ∈ ZN with a pretrained VQ-VAE, where N = hw
first described in [59], and then applied to text generation
represents the sequence length of tokens. Suppose the size
in Argmax Flow [26]. D3PMs [1] applies discrete diffusion
of the VQ-VAE codebook is K, the image token xi at loca-
to image generation. However, it also estimates the density
tion i takes the index that specifies the entries in the code-
of raw image pixels and can only generate low-resolution
book, i.e., xi ∈ {1, 2, ..., K}. On the other hand, the text to-
(e.g.,32 × 32) images.
kens y ∈ ZM can be obtained through BPE-encoding [56].
The overall text-to-image framework can be viewed as max-
3. Background: Learning Discrete Latent
imizing the conditional transition distribution q(x|y).
Space of Images Via VQ-VAE Previous autoregressive models, e.g., DALL-E [48] and
Transformer architectures have shown great promise in CogView [13], sequentially predict each image token de-
image synthesis due to their outstanding expressivity [7, 16, pends on the text tokens as well Qas the previously predicted
N
48]. In this work, we aim to leverage the transformer to image tokens, i.e., q(x|y) = i=1 q(xi |x1 , · · · , xi−1 , y).
learn the mapping from text to image. Since the compu- While achieving remarkable quality in text-to-image syn-
tation cost is quadratic to the sequence length, it is com- thesis, there exist several limitations of autoregressive mod-
putationally prohibitive to directly model raw pixels using eling. First, image tokens are predicted in a unidirectional
transformers. To address this issue, recent works [16, 41] ordering, e.g., raster scan, which neglects the structure of
propose to represent an image by discrete image tokens with 2D data and restricts the expressivity for image modeling
reduced sequence length. Hereafter a transformer can be ef- since the prediction of a specific location should not merely
fectively trained upon this reduced context length and learn attend to the context on the left or the above. Second, there
the translation from the text to image tokens. is a train-test discrepancy as the training employs ground
Formally, a vector quantized variational autoencoder truth whereas the inference relies on the prediction as pre-
(VQ-VAE) [41] is employed. The model consists of an en- vious tokens. The so-called “teacher-forcing” practice [15]
coder E, a decoder G and a codebook Z = {zk }K or exposure bias [54] leads to error accumulation due to the
k=1 ∈
RK×d containing a finite number of embedding vectors , mistakes in the earlier sampling. Moreover, it requires a
where K is the size of the codebook and d is the dimen- forward pass of the network to predict each token, which
sion of codes. Given an image x ∈ RH×W ×3 , we obtain consumes an inordinate amount of time even for the sam-
a spatial collection of image tokens zq with the encoder pling in the latent space of low resolution (i.e., 32 × 32),
z = E(x) ∈ Rh×w×d and a subsequent spatial-wise quan- making the AR model impractical for real usage.
tizer Q(·) which maps each spatial feature zij into its clos- We aim to model the VQ-VAE latent space in a
est codebook entry zk : non-autoregressive manner. The proposed VQ-Diffusion
  method maximizes the probability q(x|y) with the diffu-
sion model [23, 59], an emerging approach that produces
zq = Q(z) = argmin kzij − zk k22 ∈ Rh×w×d (1)
zk ∈Z compelling quality on image synthesis [12]. While the ma-
Where h × w represents the encoded sequence length and jority of recent works focus on continuous diffusion mod-
is usually much smaller than H × W . Then the image can els, using them for categorical distribution is much less re-
be faithfully reconstructed via the decoder, i.e., x̃ = G(zq ). searched [1, 26]. In this work, we propose to use its con-
Hence, image synthesis is equivalent to sampling image to- ditional variant discrete diffusion process for text-to-image
kens from the latent distribution. Note that the image to- generation. We will subsequently introduce the discrete dif-
kens are quantized latent variables in the sense that they fusion process inspired by the masked language modeling
take discrete values. The encoder E, the decoder G and (MLM) [11], and then discuss how to train a neural network
the codebook Z can be trained end-to-end via the following to reverse this process.
loss function: 4.1. Discrete diffusion process
LVQVAE = kx − x̃k1 +ksg[E(x)] − zq k22 On a high level, the forward diffusion process gradu-
(2)
+βksg[zq ] − E(x)k22 . ally corrupts the image data x0 via a fixed Markov chain
Step1: VQ-VAE Codebook tractable, i.e.,
𝑧𝑧1 𝑧𝑧2 𝑧𝑧3 𝑧𝑧4 𝑧𝑧𝐾𝐾
q(xt |xt−1 , x0 )q(xt−1 |x0 )
q(xt−1 |xt , x0 ) =
q(xt |x0 )

Decoder
Encoder (5)
2 4 1
>
5
 > 
𝑄𝑄 � 𝑞𝑞 𝑥𝑥𝑡𝑡−1 𝑥𝑥𝑡𝑡 , 𝑥𝑥0v (x )Q v(x
2 t t t−1 ) v (xt−1 )Qt−1 v(x0 )
1 3
= .
v > (xt )Qt v(x0 )
Step2: VQ-Diffusion The transition matrix Qt is crucial to the discrete diffu-

Transformer
𝑞𝑞 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡−1

Layernorm
Adaptive
M 1 …MM 2 1 …MM 2 4 …M 3 2 4 … 1 3

Block
sion model and should be carefully designed such that it is
𝑥𝑥𝑇𝑇 … 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡−1 … 𝑥𝑥0 not too difficult for the reverse network to recover the signal
𝑝𝑝𝜃𝜃 𝑥𝑥𝑡𝑡−1 |𝑥𝑥𝑡𝑡 , 𝑦𝑦
from noises.
A single giraffe
grassing leave
Diffusion Image Decoder Previous works [1, 26] propose to introduce a small
𝑡𝑡
outside amount of uniform noises to the categorical distribution and
Transformer

Transformer
the transition matrix can be formulated as,
AdaLN

AdaLN
Block

Block
BPE
Encoder

 
αt + βt βt ··· βt
Text

𝑦𝑦
 βt αt + βt · · · βt 
Qt =  . (6)
 
. .. .. 
 .. .. . . 
Figure 1. Overall framework of our method. It starts with the VQ-
βt βt ··· αt + βt
VAE. Then, the VQ-Diffusion models the discrete latent space by
reversing a forward diffusion process that gradually corrupts the with αt ∈ [0, 1] and βt = (1 − αt )/K. Each token has a
input via a fixed Markov chain. probability of (αt + βt ) to remain the previous value at the
current step while with a probability of Kβt to be resampled
q(xt |xt−1 ), e.g., random replace some tokens of xt−1 .
uniformly over all the K categories.
After a fixed number of T timesteps, the forward pro-
Nonetheless, the data corruption using uniform diffusion
cess yields a sequence of increasingly noisy latent variables
is a somewhat aggressive process that may pose challenge
z1 , ..., zT of the same dimensionality as z0 , and zT be-
for the reverse estimation. First, as opposed to the Gaussian
comes pure noise tokens. Starting from the noise zT , the
diffusion process for ordinal data, an image token may be
reverse process gradually denoises the latent variables and
replaced to an utterly uncorrelated category, which leads to
restore the real data x0 by sampling from the reverse dis-
an abrupt semantic change for that token. Second, the net-
tribution q(xt−1 |xt , x0 ) sequentially. However, since x0
work has to take extra efforts to figure out the tokens that
is unknown in the inference stage, we train a transformer
have been replaced prior to fixing them. In fact, due to the
network to approximate the conditional transit distribution
semantic conflict within the local context, the reverse esti-
pθ (xt−1 |xt , y) depends on the entire data distribution.
mation for different image tokens may form a competition
To be more specific, consider a single image token xi0
and run into the dilemma of identifying the reliable tokens.
of x0 at location i, which takes the index that specifies the
Mask-and-replace diffusion strategy. To solve the above
entries in the codebook, i.e., xi0 ∈ {1, 2, ..., K}. Without in-
issues of uniform diffusion, we draw inspiration from mask
troducing confusion, we omit superscripts i in the following
language modeling [11] and propose to corrupt the tokens
description. We define the probabilities that xt−1 transits to
by stochastically masking some of them so that the cor-
xt using the matrices [Qt ]mn = q(xt = m|xt−1 = n) ∈
rupted locations can be explicitly known by the reverse net-
RK×K . Then the forward Markov diffusion process for the
work. Specifically, we introduce an additional special to-
whole token sequence can be written as,
ken, [MASK] token, so each token now has (K + 1) discrete
states. We define the mask diffusion as follows: each or-
q(xt |xt−1 ) = v > (xt )Qt v(xt−1 ) (3)
dinary token has a probability of γt to be replaced by the
where v(x) is a one-hot column vector which length is K [MASK] token and has a chance of Kβt to be uniformly dif-
and only the entry x is 1. The categorical distribution over fused, leaving the probability of αt = 1 − Kβt − γt to
xt is given by the vector Qt v(xt−1 ). be unchanged, whereas the [MASK] token always keeps its
Importantly, due to the property of Markov chain, one own state. Hence, we can formulate the transition matrix
can marginalize out the intermediate steps and derive the Qt ∈ R(K+1)×(K+1) as,
probability of xt at arbitrary timestep directly from x0 as,  
αt + βt βt βt ··· 0
 βt αt + βt βt · · · 0
q(xt |x0 ) = v > (xt )Qt v(x0 ), with Qt = Qt · · · Q1 . (4)  
Qt = 
 βt βt αt + βt · · · 0  . (7)
 .. .. .. .. .. 
Besides, another notable characteristic is that by con-  . . . . .
ditioning on z0 , the posterior of this diffusion process is γt γt γt ··· 1
Algorithm 1 Training of the VQ-Diffusion, given transition Algorithm 2 Inference of the VQ-Diffusion, given fast in-
matrix {Qt }, initial network parameters θ, loss weight λ, ference time stride ∆t , input text s.
learning rate η. 1: t ← T , y ← BPE(s)
1: repeat 2: xt ← sample from p(xT ) . Eqn. 10
2: (I, s) ← sample training image-text pair 3: while t > 0 do
3: x0 ← VQVAE-Encoder(I), y ← BPE(s) 4: xt ← sample from pθ (xt−∆t |xt , y) . Eqn. 13
4: t ∼ Uniform({1, · · · , T }) 5: t ← t − ∆t
5: xt ←(sample from q(xt |x0 ) . Eqn. 4 and 8 6: end while
L0 , if t = 1 7: return VQVAE-Decoder(xt )
6: L← . Eqn. 9 and 12
Lt−1 + λLx0 , otherwise
7: θ ← θ − η∇θ L . Update network parameters
8: until converged
Reparameterization trick on discrete stage. The network
parameterization affects the synthesis quality significantly.
Instead of directly predicting the posterior q(xt−1 |xt , x0 ),
The benefit of this mask-and-replace transition is that: recent works [1, 23, 26] find that approximating some sur-
1) the corrupted tokens are distinguishable to the network, rogate variables, e.g., the noiseless target data q(x0 ) gives
which eases the reverse process. 2) Comparing to the mask better quality. In the discrete setting, we let the network
only approach in [1], we theoretically prove that it is nec- predict the noiseless token distribution pθ (x̃0 |xt , y) at each
essary to include a small amount of uniform noises besides reverse step. We can thus compute the reverse transition
the token masking, otherwise we get a trivial posterior when distribution according to:
xt 6= x0 . 3) The random token replacement forces the net- K
work to understand the context rather than only focusing on
X
pθ (xt−1 |xt , y) = q(xt−1 |xt , x̃0 )pθ (x̃0 |xt , y). (11)
the [MASK] tokens. 4) The cumulative transition matrix Qt x̃0 =1
and the probability q(xt |x0 ) in Equation 4 can be computed
Based on the reparameterization trick, we can introduce an
in closed form with:
auxiliary denoising objective, which encourages the net-
Qt v(x0 ) = αt v(x0 ) + (γ t − β t )v(K + 1) + β t (8) work to predict noiseless token x0 :
Qt Qt
Where αt = i=1 αi , γ t = 1 − i=1 (1 − γi ), and
Lx0 = − log pθ (x0 |xt , y) (12)
β t = (1 − αt − γ t )/K can be calculated and stored in
advance. Thus, the computation cost of q(xt |x0 ) is reduced We find that combining this loss with Lvlb improves the im-
from O(tK 2 ) to O(K). The proof is given in the supple- age quality.
mental material. Model architecture. We propose an encoder-decoder
transformer to estimate the distribution pθ (x̃0 |xt , y). As
4.2. Learning the reverse process shown in Figure 1, the framework contains two parts: a
To reverse the diffusion process, we train a denoising text encoder and a diffusion image decoder. Our text en-
network pθ (xt−1 |xt , y) to estimate the posterior transition coder takes the text tokens y and yields a conditional fea-
distribution q(xt−1 |xt , x0 ). The network is trained to min- ture sequence. The diffusion image decoder takes the im-
imize the variational lower bound (VLB) [59]: age token xt and timestep t and outputs the noiseless to-
ken distribution pθ (x̃0 |xt , y). The decoder contains several
Lvlb = L0 + L1 + · · · + LT −1 + LT , transformer blocks and a softmax layer. Each transformer
L0 = − log pθ (x0 |x1 , y), block contains a full attention, a cross attention to com-
(9) bine text information and a feed forward network block.
Lt−1 = DKL (q(xt−1 |xt , x0 ) || pθ (xt−1 |xt , y)),
The current timestep t is injected into the network with
LT = DKL (q(xT |x0 ) || p(xT )). Adaptive Layer Normalization [2](AdaLN) operator, i.e.,
AdaLN(h, t) = at LayerNorm(h) + bt , where h is the in-
Where p(xT ) is the prior distribution of timestep T . For the
termediate activations, at and bt are obtained from a linear
proposed mask-and-replace diffusion, the prior is:
projection of the timestep embedding.
 > Fast inference strategy In the inference stage, by leverag-
p(xT ) = β T , β T , · · · , β T , γ T (10)
ing the reparameterization trick, we can skip some steps in
Note that since the transition matrix Qt is fixed in the train- diffusion model to achieve a faster inference.
ing, the LT is a constant number which measures the gap Specifically, assuming the time stride is ∆t , instead
between the training and inference and can be ignored in of sampling images in the chain of xT , xT −1 , xT −2 ...x0 ,
the training. we sample images in the chain of xT , xT −∆t , xT −2∆t ...x0
A small gray bird This beautiful little The long wings
A small sized blue A giraffe is Some children are A white and blue
with white and bird has a white spreaded showing An airplane that is
bird that has a standing in playing soccer bus driving down a
dark gray wingbars breast and very the breast and the parked at airport
short pointed bill a green field on the field road next to trees
and white breast intriguing red eyes belly of the large bird
DM-GAN
DF-GAN
Ours

Figure 2. Comparison with GAN-based method on CUB-200 and MSCOCO datasets.

with the reverse transition distribution: to the word frequency. The LAION-400M dataset contains
K 400M image-text pairs. We train our model on three subsets
from LAION, i.e., cartoon, icon, and human, each of them
X
pθ (xt−∆t |xt , y) = q(xt−∆t |xt , x̃0 )pθ (x̃0 |xt , y).
x̃0 =1
contains 0.9M, 1.3M, 42M images, respectively. For each
(13) subset, we filter the data according to the text.
We found it makes the sampling more efficient which Traning Details. Our VQ-VAE’s encoder and decoder
only causes little harm to quality. The whole training and follow the setting of VQGAN [16] which leverages the
inference algorithm is shown in Algorithm 1 and 2. GAN loss to get a more realistic image. We directly adopt
the publicly available VQGAN model trained on Open-
5. Experiments Images [30] dataset for all text-to-image synthesis experi-
In this section, we first introduce the overall experiment ments. It converts 256×256 images into 32×32 tokens. The
setups and then present extensive results to demonstrate the codebook size K = 2886 after removing useless codes. We
superiority of our approach in text-to-image synthesis. Fi- adopt a publicly available tokenizer of the CLIP model [45]
nally, we point out that our method is a general image syn- as text encoder, yielding a conditional sequence of length
thesis framework that achieves great performance on other 77. We fix both image and text encoders in our training.
generation tasks, including unconditional and class condi- For fair comparison with previous text-to-image meth-
tional image synthesis. ods under similar parameters, we build two different diffu-
Datasets. To demonstrate the capability of our pro- sion image decoder settings: 1) VQ-Diffusion-S (Small),
posed method for text-to-image synthesis, we conduct it contains 18 transformer blocks with dimension of 192.
experiments on CUB-200 [66], Oxford-102 [40], and The model contains 34M parameters. 2) VQ-Diffusion-B
MSCOCO [36] datasets. The CUB-200 dataset contains (Base), it contains 19 transformer blocks with dimension of
8855 training images and 2933 test images belonging to 1024. The model contains 370M parameters.
200 bird species. Oxford-102 dataset contains 8189 im-
ages of flowers of 102 categories. Each image in CUB- In order to show the scalability of our method, we also
200 and Oxford-102 dataset contains 10 text descriptions. train our base model on a larger database Conceptual Cap-
MSCOCO dataset contains 82k images for training and 40k tions, and then fine-tune it on each database. This model is
images for testing. Each image in this dataset has five text denoted as VQ-Diffusion-F.
descriptions. For the default setting, we set timesteps T = 100 and
To further demonstrate the scalability of our method, loss weight λ = 0.0005. For the transition matrix, we lin-
we also train our model on large scale datasets, including early increase γ t and β t from 0 to 0.9 and 0.1, respectively.
Conceptual Captions [6, 57] and LAION-400M [55]. The We optimize our network using AdamW [37] with β1 = 0.9
Conceptual Caption dataset, including both CC3M [57] and and β2 = 0.96. The learning rate is set to 0.00045 after
CC12M [6] datasets, contains 15M images. To balance the 5000 iterations of warmup. More training details are pro-
text and image distribution, we filter a 7M subset according vided in the appendix.
MSCOCO CUB-200 Oxford-102
5.1. Comparison with state-of-the-art methods
StackGAN [70] 74.05 51.89 55.28
We qualitatively compare the proposed method with sev- StackGAN++ [71] 81.59 15.30 48.68
eral state-of-the-art methods, including some GAN-based EFF-T2I [60] - 11.17 16.47
methods [60, 61, 63, 67, 70, 71, 73], DALL-E [48] and SEGAN [61] 32.28 18.17 -
CogView [13], on MSCOCO, CUB-200 and Oxford-102 AttnGAN [67] 35.49 23.98 -
datasets. We calculate the FID [22] between 30k gener- DM-GAN [73] 32.64 16.09 -
ated images and 30k real images, and show the results in DF-GAN [63] 21.42 14.81 -
Table 1. DAE-GAN [51] 28.12 15.19 -
We can see that our small model, VQ-Diffusion-S, which DALLE [48] 27.50 56.10 -
has the similar parameter number with previous GAN-based Cogview [13] 27.10 - -
models, has strong performance on CUB-200 and Oxford- VQ-Diffusion-S 30.17 12.97 14.95
102 datasets. Our base model, VQ-Diffusion-B, further im- VQ-Diffusion-B 19.75 11.94 14.88
proves the performance. And our VQ-Diffusion-F model VQ-Diffusion-F 13.86 10.32 14.10
achieves the best results and surpasses all previous meth- Table 1. FID comparison of different text-to-image synthesis
ods by a large margin, even surpassing DALL-E [48] and method on MSCOCO, CUB-200, and Oxford-102 datasets.
CogView [13], which have ten times more parameters than A handsome man
A green heart The sunset on the A cartoon boy
ours, on MSCOCO dataset. with thick eyebrows
with shadow beach is wonderful is smiling
Some visualized comparison results with DM-GAN [73] and moustache

and DF-GAN [63] are shown in Figure 2. Obviously, our


synthesized images have better realistic fine-grained details
and are more consistent with the input text.
5.2. In the wild text-to-image synthesis
Snow mountain A red bus A bare kitchen has Two smiling
To demonstrate the capability of generating in-the-wild and tree reflection is driving wood cabinets and beautiful ladies are
in the lake on the road white appliances standing together
images, we train our model on three subsets from LAION-
400M dataset, e.g., cartoon, icon and human. We provide
our results here in Figure 3. Though our base model is much
smaller than previous works like DALL-E and CogView, we
also achieved a strong performance.
Compared with the AR method which generates images The man with
A picture of a A movie poster of
Black and white
from top-left to down-right, our method generates images sunglasses has icon of man
very tall stop sign mountain and sea
a big beard and woman
in a global manner. It makes our method can be applied to
many vision tasks, e.g., irregular mask inpainting. For this
task, we do not need to re-train a new model. We simply
set the tokens in the irregular region as [MASK] token, and
send them to our model. This strategy supports both uncon-
ditional mask inpainting and text conditional mask inpaint-
Figure 3. In the wild text-to-image synthesis results.
ing. We show these results in the appendix.
5.3. Ablations Oxford-102 dataset. We set different final mask rate (γ T )
Number of timesteps. We investigate the timesteps in to investigate the effect. Both mask only strategy (γ T = 1)
training and inference. As shown in Table 2, we perform and replace only strategy (γ T = 0) are special cases of our
the experiment on the CUB-200 dataset. We find when the mask-and-replace strategy. From Figure 4, we find it get
training steps increase from 10 to 100, the result improves, the best performance when M = 0.9. When M > 0.9,
when it further increase to 200, it seems saturated. So we it may suffer from the error accumulation problem, when
set the default timesteps number to 100 in our experiments. M < 0.9, the network may be difficult to find which region
To demonstrate the fast inference strategy, we evaluate the needs to pay more attention.
generated images from 10, 25, 50, 100 inference steps on Truncation. We also demonstrate that the truncation sam-
five models with different training steps. We find it still pling strategy is extremely important for our discrete dif-
maintains a good performance when dropping 3/4 inference fusion based method. It may avoid the network sampling
steps, which may save about 3/4 inference times. from low probability tokens. Specifically, we only keep top
Mask-and-replace diffusion strategy. We explore how the r tokens of pθ (x̃0 |xt , y) in the inference stage. We evalu-
mask-and-replace strategy benefits our performance on the ate the results with different truncation rates r on CUB-200
training steps Model ImageNet FFHQ
10 25 50 100 200 StyleGAN2 [29] - 3.8
inference steps

10 32.35 27.62 23.47 19.84 20.96 BigGAN [3] 7.53 12.4


25 - 18.53 15.25 14.03 16.13 BigGAN-deep [3] 6.84 -
50 - - 13.82 12.45 13.67 IDDPM [39] 12.3 -
100 - - - 11.94 12.27 ADM-G [12] 10.94 -
200 - - - - 11.80 VQGAN [16] 15.78 9.6
Table 2. Ablation study on training steps and inference steps. Each
ImageBART [15] 21.19 9.57
column shares the same training steps while each row shares the Ours 11.89 6.33
same inference steps. ADM-G (1.0guid) [12] 4.59 -
VQGAN (acc0.05) [16] 5.88 -
ImageBART (acc0.05) [15] 7.44 -
Ours (acc0.05) 5.32 -
Table 4. FID score comparison for class-conditional synthesis on
ImageNet, and unconditional synthesis on FFHQ dataset. ’guid’
denotes using classifier guidance [12], ’acc’ denotes adopting ac-
ceptance rate [16].

Figure 4. Ablation study on the mask rate and the truncation rate.
thesis and image synthesis conditioned on labels. To gen-
erate images from a given class label, we first remove the
Model steps FID throughput(imgs/s) text encoder network and cross attention part in transformer
VQ-AR-S 18.12 0.08 blocks, and inject the class label through the AdaLN oper-
VQ-Diffusion-S 25 15.46 1.25 ator. Our network contains 24 transformer blocks with di-
VQ-Diffusion-S 50 13.62 0.67 mension 512. We train our model on the ImageNet dataset.
VQ-Diffusion-S 100 12.97 0.37 For VQ-VAE, we adopt the publicly available model from
VQ-AR-B 17.76 0.03 VQ-GAN [16] trained on ImageNet dataset, which down-
VQ-Diffusion-B 25 14.03 0.47 samples images from 256 × 256 to 16 × 16. For un-
VQ-Diffusion-B 50 12.45 0.24 conditional image synthesis, we trained our model on the
VQ-Diffusion-B 100 11.94 0.13 FFHQ256 dataset, which contains 70k high quality face
Table 3. Comparison between VQ-Diffusion and VQ-AR mod- images. The image encoder also downsamples images to
els. By changing the inference steps, the VQ-Diffusion model is 16 × 16 tokens.
15 times faster than the VQ-AR model while maintaining better We assess the performance of our model in terms of FID
performance. and compare with a variety of previously established mod-
els [3, 12, 15, 16, 39]. For a fair comparison, we calculate
dataset. As shown in Figure 4, we find that it achieves the FID between 50k generated images and all real images. Fol-
best performance when the truncation rate equals 0.86. lowing [16] we can further increase the quality by only ac-
VQ-Diffusion vs VQ-AR. For a fair comparison, we re- cepting images with a top 5% classification score, denoted
place our diffusion image decoder with an autoregressive as acc0.05. We show the quantitative results in Table 4.
decoder with the same network structure and keep other set- While some task-specialized GAN models report better FID
tings the same, including both image and text encoders. The scores, our approach provides a unified model that works
autoregressive model is denoted as VQ-AR-S and VQ-AR- well across a wide range of tasks.
B, corresponding to VQ-Diffusion-S and VQ-Diffusion-B.
The experiment is performed on the CUB-200 dataset. As
shown in Table 3 , on both -S and -B settings the VQ- 6. Conclusion
Diffusion model surpasses the VQ-AR model by a large
In this paper, we present a novel text-to-image architec-
margin. Meanwhile, we evaluate the throughput of both
ture named VQ-Diffusion. The core design is to model the
methods on a V100 GPU with a batch size of 32. The VQ-
VQ-VAE latent space in a non-autoregressive manner. The
Diffusion with the fast inference strategy is 15 times faster
proposed mask-and-replace diffusion strategy avoids the ac-
than the VQ-AR model with a better FID score.
cumulation of errors of the AR model. Our model has the
capacity to generate more complex scenes, which surpasses
5.4. Unified generation model
previous GAN-based text-to-image methods. Our method
Our method is general, which can also be applied to is also general and produces strong results on unconditional
other image synthesis tasks, e.g., unconditional image syn- and conditional image generation.
Acknowledgement [13] Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng,
Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao,
We thank Qiankun Liu from University of Science and Hongxia Yang, et al. Cogview: Mastering text-to-image gen-
Technology of China for his help, he provided the initial eration via transformers. arXiv preprint arXiv:2105.13290,
code and datasets. 2021. 2, 3, 7
[14] Alaaeldin El-Nouby, Shikhar Sharma, Hannes Schulz, De-
References von Hjelm, Layla El Asri, Samira Ebrahimi Kahou, Yoshua
Bengio, and Graham W Taylor. Tell, draw, and repeat: Gen-
[1] Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tar- erating and modifying images based on continual linguistic
low, and Rianne van den Berg. Structured denoising dif- instruction. In Proceedings of the IEEE/CVF International
fusion models in discrete state-spaces. arXiv preprint Conference on Computer Vision, pages 10304–10312, 2019.
arXiv:2107.03006, 2021. 2, 3, 4, 5 2
[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- [15] Patrick Esser, Robin Rombach, Andreas Blattmann, and
ton. Layer normalization. arXiv preprint arXiv:1607.06450, Björn Ommer. Imagebart: Bidirectional context with multi-
2016. 5 nomial diffusion for autoregressive image synthesis. arXiv
[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large preprint arXiv:2108.08827, 2021. 1, 2, 3, 8
scale gan training for high fidelity natural image synthesis. [16] Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming
arXiv preprint arXiv:1809.11096, 2018. 8 transformers for high-resolution image synthesis. In Pro-
[4] Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Sub- ceedings of the IEEE/CVF Conference on Computer Vision
biah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakan- and Pattern Recognition, pages 12873–12883, 2021. 2, 3, 6,
tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 8, 12
Language models are few-shot learners. arXiv preprint [17] Lianli Gao, Daiyuan Chen, Jingkuan Song, Xing Xu, Dongx-
arXiv:2005.14165, 2020. 1, 2 iang Zhang, and Heng Tao Shen. Perceptual pyramid adver-
[5] Miriam Cha, Youngjune L Gwon, and HT Kung. Adversar- sarial networks for text-to-image synthesis. In Proceedings
ial learning of semantic relevance in text to image synthesis. of the AAAI Conference on Artificial Intelligence, volume 33,
In Proceedings of the AAAI conference on artificial intelli- pages 8312–8319, 2019. 2
gence, volume 33, pages 3272–3279, 2019. 2 [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
[6] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
Yoshua Bengio. Generative adversarial nets. In Advances in
Soricut. Conceptual 12m: Pushing web-scale image-text pre-
Neural Information Processing Systems, pages 2672–2680,
training to recognize long-tail visual concepts. In Proceed-
2014. 2
ings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 3558–3568, 2021. 6 [19] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Giqa:
Generated image quality assessment. In European Confer-
[7] Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Hee-
ence on Computer Vision, pages 369–385. Springer, 2020.
woo Jun, David Luan, and Ilya Sutskever. Generative pre-
2
training from pixels. In International Conference on Ma-
[20] Shuyang Gu, Jianmin Bao, Dong Chen, and Fang Wen. Pri-
chine Learning, pages 1691–1703. PMLR, 2020. 2, 3
organ: Real data prior for generative adversarial nets. arXiv
[8] Jun Cheng, Fuxiang Wu, Yanling Tian, Lei Wang, and preprint arXiv:2006.16990, 2020. 2
Dapeng Tao. Rifegan: Rich feature generation for text-to- [21] Shuyang Gu, Jianmin Bao, Hao Yang, Dong Chen, Fang
image synthesis from prior knowledge. In Proceedings of Wen, and Lu Yuan. Mask-guided portrait editing with con-
the IEEE/CVF Conference on Computer Vision and Pattern ditional gans. In Proceedings of the IEEE/CVF Conference
Recognition, pages 10911–10920, 2020. 2 on Computer Vision and Pattern Recognition, pages 3436–
[9] Ayushman Dash, John Cristian Borges Gamboa, Sheraz 3445, 2019. 2
Ahmed, Marcus Liwicki, and Muhammad Zeshan Afzal. [22] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,
Tac-gan-text conditioned auxiliary classifier generative ad- Bernhard Nessler, and Sepp Hochreiter. Gans trained by a
versarial network. arXiv preprint arXiv:1703.06412, 2017. two time-scale update rule converge to a local nash equilib-
2 rium. Advances in neural information processing systems,
[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, 30, 2017. 7
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [23] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffu-
database. In 2009 IEEE conference on computer vision and sion probabilistic models. arXiv preprint arXiv:2006.11239,
pattern recognition, pages 248–255. Ieee, 2009. 2 2020. 1, 3, 5
[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina [24] Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet,
Toutanova. Bert: Pre-training of deep bidirectional Mohammad Norouzi, and Tim Salimans. Cascaded diffusion
transformers for language understanding. arXiv preprint models for high fidelity image generation. arXiv preprint
arXiv:1810.04805, 2018. 1, 2, 3, 4 arXiv:2106.15282, 2021. 3
[12] Prafulla Dhariwal and Alex Nichol. Diffusion models beat [25] Seunghoon Hong, Dingdong Yang, Jongwook Choi, and
gans on image synthesis. arXiv preprint arXiv:2105.05233, Honglak Lee. Inferring semantic layout for hierarchical text-
2021. 1, 3, 8 to-image synthesis. In Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pages 7986– In Proceedings of the IEEE Conference on Computer Vision
7994, 2018. 2 and Pattern Recognition, pages 4467–4477, 2017. 2
[26] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick [39] Alex Nichol and Prafulla Dhariwal. Improved de-
Forré, and Max Welling. Argmax flows and multinomial dif- noising diffusion probabilistic models. arXiv preprint
fusion: Towards non-autoregressive language models. arXiv arXiv:2102.09672, 2021. 3, 8
preprint arXiv:2102.05379, 2021. 3, 4, 5 [40] Maria-Elena Nilsback and Andrew Zisserman. Automated
[27] Yupan Huang, Hongwei Xue, Bei Liu, and Yutong Lu. Uni- flower classification over a large number of classes. In 2008
fying multimodal transformer for bi-directional image and Sixth Indian Conference on Computer Vision, Graphics &
text generation. In Proceedings of the 29th ACM Interna- Image Processing, pages 722–729. IEEE, 2008. 2, 6
tional Conference on Multimedia, pages 1138–1147, 2021. [41] Aaron van den Oord, Oriol Vinyals, and Koray
2 Kavukcuoglu. Neural discrete representation learning.
[28] Tero Karras, Samuli Laine, and Timo Aila. A style-based arXiv preprint arXiv:1711.00937, 2017. 2, 3, 12
generator architecture for generative adversarial networks. [42] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz
In Proceedings of the IEEE/CVF Conference on Computer Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran. Im-
Vision and Pattern Recognition, pages 4401–4410, 2019. 2 age transformer. In International Conference on Machine
[29] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Learning, pages 4055–4064. PMLR, 2018. 2
Jaakko Lehtinen, and Timo Aila. Analyzing and improv- [43] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao.
ing the image quality of stylegan. In Proceedings of the Learn, imagine and create: Text-to-image generation from
IEEE/CVF Conference on Computer Vision and Pattern prior knowledge. Advances in Neural Information Process-
Recognition, pages 8110–8119, 2020. 8 ing Systems, 32:887–897, 2019. 2
[30] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami [44] Tingting Qiao, Jing Zhang, Duanqing Xu, and Dacheng Tao.
Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Ui- Mirrorgan: Learning text-to-image generation by redescrip-
jlings, Stefan Popov, Andreas Veit, et al. Openimages: A tion. In Proceedings of the IEEE/CVF Conference on Com-
public dataset for large-scale multi-label and multi-class im- puter Vision and Pattern Recognition, pages 1505–1514,
age classification. Dataset available from https://github. 2019. 2
com/openimages, 2(3):18, 2017. 6, 12 [45] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
[31] Qicheng Lao, Mohammad Havaei, Ahmad Pesaranghader,
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn-
Francis Dutil, Lisa Di Jorio, and Thomas Fevens. Dual ad-
ing transferable visual models from natural language super-
versarial inference for text-to-image synthesis. In Proceed-
vision. arXiv preprint arXiv:2103.00020, 2021. 6, 12
ings of the IEEE/CVF International Conference on Com-
[46] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya
puter Vision, pages 7567–7576, 2019. 2
Sutskever. Improving language understanding by generative
[32] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, and Philip HS
pre-training. 2018. 1, 2
Torr. Controllable text-to-image generation. arXiv preprint
[47] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario
arXiv:1909.07083, 2019. 2
Amodei, Ilya Sutskever, et al. Language models are unsu-
[33] Wenbo Li, Pengchuan Zhang, Lei Zhang, Qiuyuan Huang, pervised multitask learners. OpenAI blog, 1(8):9, 2019. 1,
Xiaodong He, Siwei Lyu, and Jianfeng Gao. Object-driven 2
text-to-image synthesis via adversarial training. In Proceed-
[48] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott
ings of the IEEE/CVF Conference on Computer Vision and
Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya
Pattern Recognition, pages 12174–12182, 2019. 2
Sutskever. Zero-shot text-to-image generation. arXiv
[34] Jiadong Liang, Wenjie Pei, and Feng Lu. Cpgan: Content- preprint arXiv:2102.12092, 2021. 1, 2, 3, 7
parsing generative adversarial networks for text-to-image [49] Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generat-
synthesis. In European Conference on Computer Vision, ing diverse high-fidelity images with vq-vae-2. In Advances
pages 491–508. Springer, 2020. 2 in neural information processing systems, pages 14866–
[35] Junyang Lin, Rui Men, An Yang, Chang Zhou, Ming Ding, 14876, 2019. 2
Yichang Zhang, Peng Wang, Ang Wang, Le Jiang, Xianyan [50] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-
Jia, et al. M6: A chinese multimodal pretrainer. arXiv geswaran, Bernt Schiele, and Honglak Lee. Generative ad-
preprint arXiv:2103.00823, 2021. 2 versarial text to image synthesis. In International Conference
[36] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, on Machine Learning, pages 1060–1069. PMLR, 2016. 2
Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence [51] Shulan Ruan, Yong Zhang, Kun Zhang, Yanbo Fan, Fan
Zitnick. Microsoft coco: Common objects in context. In Tang, Qi Liu, and Enhong Chen. Dae-gan: Dynamic aspect-
European conference on computer vision, pages 740–755. aware gan for text-to-image synthesis. In Proceedings of the
Springer, 2014. 2, 6 IEEE/CVF International Conference on Computer Vision,
[37] Ilya Loshchilov and Frank Hutter. Decoupled weight decay pages 13960–13969, 2021. 2, 7
regularization. arXiv preprint arXiv:1711.05101, 2017. 6 [52] Chitwan Saharia, Jonathan Ho, William Chan, Tim Sal-
[38] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovit- imans, David J Fleet, and Mohammad Norouzi. Image
skiy, and Jason Yosinski. Plug & play generative networks: super-resolution via iterative refinement. arXiv preprint
Conditional iterative generation of images in latent space. arXiv:2104.07636, 2021. 3
[53] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P [67] Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang,
Kingma. Pixelcnn++: Improving the pixelcnn with dis- Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-
cretized logistic mixture likelihood and other modifications. grained text to image generation with attentional generative
arXiv preprint arXiv:1701.05517, 2017. 2 adversarial networks. In Proceedings of the IEEE conference
[54] Florian Schmidt. Generalization in generation: A closer look on computer vision and pattern recognition, pages 1316–
at exposure bias. arXiv preprint arXiv:1910.00292, 2019. 3 1324, 2018. 2, 7
[55] Christoph Schuhmann, Richard Vencu, Romain Beaumont, [68] Guojun Yin, Bin Liu, Lu Sheng, Nenghai Yu, Xiaogang
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Wang, and Jing Shao. Semantics disentangling for text-to-
Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: image generation. In Proceedings of the IEEE/CVF Con-
Open dataset of clip-filtered 400 million image-text pairs. ference on Computer Vision and Pattern Recognition, pages
arXiv preprint arXiv:2111.02114, 2021. 6 2327–2336, 2019. 2
[56] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural [69] Han Zhang, Jing Yu Koh, Jason Baldridge, Honglak Lee, and
machine translation of rare words with subword units. arXiv Yinfei Yang. Cross-modal contrastive learning for text-to-
preprint arXiv:1508.07909, 2015. 3 image generation. In Proceedings of the IEEE/CVF Con-
[57] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu ference on Computer Vision and Pattern Recognition, pages
Soricut. Conceptual captions: A cleaned, hypernymed, im- 833–842, 2021. 2
age alt-text dataset for automatic image captioning. In Pro- [70] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
ceedings of the 56th Annual Meeting of the Association for gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
Computational Linguistics (Volume 1: Long Papers), pages gan: Text to photo-realistic image synthesis with stacked
2556–2565, 2018. 6 generative adversarial networks. In Proceedings of the IEEE
[58] Shikhar Sharma, Dendi Suhubdy, Vincent Michalski, international conference on computer vision, pages 5907–
Samira Ebrahimi Kahou, and Yoshua Bengio. Chatpainter: 5915, 2017. 2, 7
Improving text to image generation using dialogue. arXiv [71] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiao-
preprint arXiv:1802.08216, 2018. 2 gang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stack-
[59] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, gan++: Realistic image synthesis with stacked generative ad-
and Surya Ganguli. Deep unsupervised learning using versarial networks. IEEE transactions on pattern analysis
nonequilibrium thermodynamics. In International Confer- and machine intelligence, 41(8):1947–1962, 2018. 2, 7
ence on Machine Learning, pages 2256–2265. PMLR, 2015. [72] Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic
1, 3, 5 text-to-image synthesis with a hierarchically-nested adver-
[60] Douglas M Souza, Jônatas Wehrmann, and Duncan D Ruiz. sarial network. In Proceedings of the IEEE Conference
Efficient neural architecture for text-to-image synthesis. In on Computer Vision and Pattern Recognition, pages 6199–
2020 International Joint Conference on Neural Networks 6208, 2018. 2
(IJCNN), pages 1–8. IEEE, 2020. 2, 7 [73] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. Dm-gan:
[61] Hongchen Tan, Xiuping Liu, Xin Li, Yi Zhang, and Baocai Dynamic memory generative adversarial networks for text-
Yin. Semantics-enhanced adversarial nets for text-to-image to-image synthesis. In Proceedings of the IEEE/CVF Con-
synthesis. In Proceedings of the IEEE/CVF International ference on Computer Vision and Pattern Recognition, pages
Conference on Computer Vision, pages 10501–10510, 2019. 5802–5810, 2019. 2, 7
2, 7
[62] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin
Li. Kt-gan: knowledge-transfer generative adversarial net-
work for text-to-image synthesis. IEEE Transactions on Im-
age Processing, 30:1275–1290, 2020. 2
[63] Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan
Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion gener-
ative adversarial networks for text-to-image synthesis. arXiv
preprint arXiv:2008.05865, 2020. 2, 7
[64] Aaron Van Oord, Nal Kalchbrenner, and Koray
Kavukcuoglu. Pixel recurrent neural networks. In In-
ternational Conference on Machine Learning, pages
1747–1756. PMLR, 2016. 2
[65] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In Advances in neural
information processing systems, pages 5998–6008, 2017. 1
[66] Catherine Wah, Steve Branson, Peter Welinder, Pietro Per-
ona, and Serge Belongie. The caltech-ucsd birds-200-2011
dataset. 2011. 2, 6
A. Implementation details B. Proof of Equation 8
Mathematical induction can be used to prove the Equa-
In our experiments on text-to-image synthesis, we adopt tion 8 in the paper.
the public VQ-VAE [41] model provided by VQGAN [16] When t = 1, we have
trained on the OpenImages [30] dataset, which downsam- 
ples images from 256 × 256 to 32 × 32. We use the α1 + β 1 ,
 x = x0
CLIP [45] pretrained model (ViT-B) as our text encoder, Q1 v(x0 ) = β 1 , x 6= x0 and x 6= K + 1 (14)
which encodes a sentence to 77 tokens. Our diffusion im-

γ1, x=K +1

age decoder consists of several transformer blocks, each
block contains full attention, cross attention, and feed for- which is clearly hold. Suppose the Equation 8 is hold at step
ward network(FFN). Our base model contains 19 trans- t, then for t = t + 1:
former blocks, the channel of each block is 1024. The FFN
Qt+1 v(x0 ) = Qt+1 Qt v(x0 )
contains two linear layer, which expand the dimension to
4096 in the middle layer. The model contains 370M pa- When x = x0 ,
rameters. For our small model, it contains 18 transformer
blocks while the channel is 192, the FFN contains two con- Qt+1 v(x0 )(x) = β t βt+1 (K − 1) + (αt+1 + βt+1 )(αt + β t )
volution layers with kernel size 3, the channel expand rate = β t (Kβt+1 + αt+1 ) + αt (αt+1 + βt+1 )
is 2. The model contains 34M parameters. 1
= (β t (1 − γt+1 ) + αt βt+1 − β t+1 ) ∗ K + \
K
For our class conditional generation model on ImageNet, αt+1 + β t+1
we adopt the public VQ-VAE model provided by VQGAN 1
trained on ImageNet, which downsamples images from = [(1 − αt − γ t )(1 − γt+1 ) + Kαt βt+1 − \
K
256 × 256 to 16 × 16. Our model contains 24 transformer (1 − αt+1 − γ t+1 )] + αt+1 + β t+1
blocks, each block contains a full attention layer and a FFN. 1
The base channel number is 512. Besides, the FFN also uses = [(1 − γ t+1 ) − αt (1 − γt+1 − Kβt+1 ) − \
K
convolution instead of linear layer, and the channel expand (1 − γ t+1 ) + αt+1 ] + αt+1 + β t+1
rate is 4.
= αt+1 + β t+1

When x = K + 1,
Qt+1 v(x0 )(x) = γ t + (1 − γ t )γt+1
A small pizza
in the plate = 1 − (1 − γ t+1 )
= γ t+1

When x 6= x0 and x 6= K + 1,
Qt+1 v(x0 )(x) = β t (αt+1 + βt+1 ) + β t βt+1 (K − 1) + αt βt+1
A green bird
with dark head = β t (αt+1 + βt+1 ∗ K) + αt βt+1
1 − αt − γ t
= ∗ (1 − γt+1 ) + αt βt+1
K
1 1 − γt+1
= (1 − γ t+1 ) + αt (βt+1 − )
K K
A man wears
striped tie = β t+1

So proof done.

A woman wears C. Results


white tshirt with
a rectangular In this part, we provide more visualization results. First,
pattern on it
we compare our results with XMC-GAN in Figure 7. We
got their results directly from their paper. The irregular
mask inpainting results are shown in Figure 5. we show
Input Image Mask Edited Image our more in the wild text-to-image results in Figure 6. And
we provide our results on ImageNet and FFHQ in Figure 9
and Figure 8.
Figure 5. Text guided image editing by VQ-Diffusion.
A man wears A small galley kitchen A black and white
A man with A mountain A woman is talking Icon of a Sunset on
black suit with wooden cabinets landscape photograph
beard in 1920s near the lake in an interview red heart the beach
and a tie and white appliances of a black tree

A man riding a A giraffe walking A green train A woman with A handsome man
A living area with a A red hydrant A woman with
motorcycle through a green is coming long straight hair with beard and
television and a table on the grass white hair
on the beach grass covered field down the tracks wears glasses moustache

A woman with A group of A red bus is A group of skiers


A cartoon house Very pretty makeup Two girls in
A cool cartoon boy curly hairs and people gather driving on are preparing to ski
with red roof with a red lip cartoon style
brown skin for a photo the road down a mountain

A woman riding a
Two beautiful ladies A hydrant A woman with Two beautiful ladies A movie poster
Sunset over the sea red motorcycle A cartoon tiger face
standing together in the park delicate makeup standing together of haunted house
wearing a helmet

Some children Face of an A man wears A picture of


A Chinese girl A boy and a girl Sketch of a
are sitting in orange frog in black suit some food The face of messi
with long hair in cartoon style woman face
the classroom cartoon style and a tie in the plate

A vector illustration of a tree A small house in the wilderness

Sunset over the skyline of a city A handsome man with beard with sunglasses

A heart made of ________ A ________ bus is driving on the road


chocolate water wood cookie purple green blue yellow

Figure 6. In the wild text-to-image synthesis by VQ-Diffusion.


A group of skiers A living area A group of An image of The bus is pulling A boat in the A long boat is
A picture of a A picture of some
are preparing to with a television elephants walking people outside off to the side middle sitting on the
very tall stop sign food on the plate
ski down a mountain and a table in muddy water playing frisbee of the road of the ocean clear water
XMC-GAN
Ours

Figure 7. Comparison our results with XMC-GAN, their results come from their paper.

Figure 8. VQ-Diffusion results on FFHQ1024 and FFHQ256 datasets.

cock ostrich golden retriever ambulance matchstick teapot bubble daisy


c7 c9 c207 c407 c644 c849 c971 c985

cock ostrich golden retriever ambulance matchstick teapot bubble daisy


c7 c9 c207 c407 c644 c849 c971 c985
Figure 9. VQ-Diffusion results of class conditional synthesis on ImageNet dataset.

You might also like