0% found this document useful (0 votes)
43 views9 pages

Xu AttnGAN Fine-Grained Text CVPR 2018 Paper

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 9

AttnGAN: Fine-Grained Text to Image Generation

with Attentional Generative Adversarial Networks

Tao Xu∗1 , Pengchuan Zhang2 , Qiuyuan Huang2 ,


Han Zhang3 , Zhe Gan4 , Xiaolei Huang1 , Xiaodong He5
1 2 3 4 5
Lehigh University Microsoft Research Rutgers University Duke University JD AI Research
{tax313, xih206}@lehigh.edu, {penzhan, qihua, xiaohe}@microsoft.com
[email protected], [email protected], [email protected]

Abstract this bird is red with white and has a very short beak

In this paper, we propose an Attentional Generative Ad-


versarial Network (AttnGAN) that allows attention-driven,
multi-stage refinement for fine-grained text-to-image gener-
ation. With a novel attentional generative network, the At-
tnGAN can synthesize fine-grained details at different sub-
regions of the image by paying attentions to the relevant
10:short 3:red 11:beak 9:very 8:a
words in the natural language description. In addition, a
deep attentional multimodal similarity model is proposed to
compute a fine-grained image-text matching loss for train-
ing the generator. The proposed AttnGAN significantly out-
performs the previous state of the art, boosting the best re-
ported inception score by 14.14% on the CUB dataset and 3:red 5:white 1:bird 10:short 0:this
170.25% on the more challenging COCO dataset. A de-
tailed analysis is also performed by visualizing the atten-
tion layers of the AttnGAN. It for the first time shows that
the layered attentional GAN is able to automatically select
the condition at the word level for generating different parts Figure 1. Example results of the proposed AttnGAN. The first row
of the image. gives the low-to-high resolution images generated by G0 , G1 and
G2 of the AttnGAN; the second and third row shows the top-5
most attended words by F1attn and F2attn of the AttnGAN, re-
1. Introduction spectively. Here, images of G0 and G1 are bilinearly upsampled
to have the same size as that of G2 for better visualization.
Automatically generating images according to natural
language descriptions is a fundamental problem in many
applications, such as art generation and computer-aided de- only on the global sentence vector lacks important fine-
sign. It also drives research progress in multimodal learning grained information at the word level, and prevents the gen-
and inference across vision and language, which is one of eration of high quality images. This problem becomes even
the most active research areas in recent years [20, 18, 36, more severe when generating complex scenes such as those
19, 41, 4, 30, 5, 1, 31, 33, 32] in the COCO dataset [14].
Most recently proposed text-to-image synthesis methods To address this issue, we propose an Attentional Genera-
are based on Generative Adversarial Networks (GANs) [6]. tive Adversarial Network (AttnGAN) that allows attention-
A commonly used approach is to encode the whole text de- driven, multi-stage refinement for fine-grained text-to-
scription into a global sentence vector as the condition for image generation. The overall architecture of the AttnGAN
GAN-based image generation [20, 18, 36, 37]. Although is illustrated in Figure 2. The model consists of two novel
impressive results have been presented, conditioning GAN components. The first component is an attentional gener-
ative network, in which an attention mechanism is devel-
∗ work was performed when was an intern with Microsoft Research oped for the generator to draw different sub-regions of the

11316
image by focusing on words that are most relevant to the ble of synthesizing plausible images from text descriptions.
sub-region being drawn (see Figure 1). More specifically, Their follow-up work [18] also demonstrated that GAN was
besides encoding the natural language description into a able to generate better samples by incorporating additional
global sentence vector, each word in the sentence is also conditions (e.g., object locations). Zhang et al. [36, 37]
encoded into a word vector. The generative network uti- stacked several GANs for text-to-image synthesis and used
lizes the global sentence vector to generate a low-resolution different GANs to generate images of different sizes. How-
image in the first stage. In the following stages, it uses ever, all of their GANs are conditioned on the global sen-
the image vector in each sub-region to query word vectors tence vector, missing fine-grained word level information
by using an attention layer to form a word-context vector. for image generation.
It then combines the regional image vector and the corre- The attention mechanism has recently become an inte-
sponding word-context vector to form a multimodal context gral part of sequence transduction models. It has been suc-
vector, based on which the model generates new image fea- cessfully used in modeling multi-level dependencies in im-
tures in the surrounding sub-regions. This effectively yields age captioning [30, 38], image question answering [31] and
a higher resolution picture with more details at each stage. machine translation [2]. Vaswani et al. [28] also demon-
The other component in the AttnGAN is a Deep Attentional strated that machine translation models could achieve state-
Multimodal Similarity Model (DAMSM). With an attention of-the-art results by solely using an attention model. In
mechanism, the DAMSM is able to compute the similarity spite of these progress, the attention mechanism has not
between the generated image and the sentence using both been explored in GANs for text-to-image synthesis yet. It is
the global sentence level information and the fine-grained worth mentioning that the alignDRAW [15] also used LAP-
word level information. Thus, the DAMSM provides an ad- GAN [3] to scale the image to a higher resolution. How-
ditional fine-grained image-text matching loss for training ever, the GAN in their framework was only utilized as a
the generator. post-processing step without attention. To our knowledge,
The contribution of our method is threefold. (i) An the proposed AttnGAN for the first time develops an atten-
Attentional Generative Adversarial Network is proposed tion mechanism that enables GANs to generate fine-grained
for synthesizing images from text descriptions. Specif- high quality images via multi-level (e.g., word level and
ically, two novel components are proposed in the At- sentence level) conditioning.
tnGAN, including the attentional generative network and
the DAMSM. (ii) Comprehensive study is carried out to em- 3. Attentional Generative Adversarial Net-
pirically evaluate the proposed AttnGAN. Experimental re-
work
sults show that the AttnGAN significantly outperforms pre-
vious state-of-the-art GAN models. (iii) A detailed analysis As shown in Figure 2, the proposed Attentional Gener-
is performed through visualizing the attention layers of the ative Adversarial Network (AttnGAN) has two novel com-
AttnGAN. For the first time, it is demonstrated that the lay- ponents: the attentional generative network and the deep
ered conditional GAN is able to automatically attend to rele- attentional multimodal similarity model. We will elaborate
vant words to form the condition for image generation. Our each of them in the rest of this section.
code is available at https://github.com/taoxugit/AttnGAN.
3.1. Attentional Generative Network
Current GAN-based models for text-to-image genera-
2. Related Work
tion [20, 18, 36, 37] typically encode the whole-sentence
Generating high resolution images from text descrip- text description into a single vector as the condition for im-
tions, though very challenging, is important for many prac- age generation, but lack fine-grained word level informa-
tical applications such as art generation and computer- tion. In this section, we propose a novel attention model
aided design. Recently, great progress has been achieved that enables the generative network to draw different sub-
in this direction with the emergence of deep generative regions of the image conditioned on words that are most
models [12, 27, 6]. Mansimov et al. [15] built the align- relevant to those sub-regions.
DRAW model, extending the Deep Recurrent Attention As shown in Figure 2, the proposed attentional genera-
Writer (DRAW) [7] to iteratively draw image patches while tive network has m generators (G0 , G1 , ..., Gm−1 ), which
attending to the relevant words in the caption. Nguyen take the hidden states (h0 , h1 , ..., hm−1 ) as input and gen-
et al. [16] proposed an approximate Langevin approach erate images of small-to-large scales (x̂0 , x̂1 , ..., x̂m−1 ).
to generate images from captions. Reed et al. [21] used Specifically,
conditional PixelCNN [27] to synthesize images from text
with a multi-scale model structure. Compared with other h0 = F0 (z, F ca (e));
deep generative models, Generative Adversarial Networks
(GANs) [6] have shown great performance for generating hi = Fi (hi−1 , Fiattn (e, hi−1 )) for i = 1, 2, ..., m − 1;
sharper samples [17, 3, 23, 13, 10, 35, 24, 34, 39, 40]. Reed x̂i = Gi (hi ).
et al. [20] first showed that the conditional GAN was capa- (1)

1317
Residual FC with reshape Upsampling Joining Conv3x3

Deep Attentional Multimodal Similarity Model (DAMSM)

word Attentional Generative Network Local image


features features
Attention models

z~N(0,I)
F0
F1attn F1
F2attn F2

sentence h1 h2 Image
feature h0
Text G2
F ca c Encoder
Encoder
256x256x3

G0 G1

this bird is red with 64x64x3 128x128x3


white and has a
very short beak D0 D1 D2

Figure 2. The architecture of the proposed AttnGAN. Each attention model automatically retrieves the conditions (i.e., the most relevant
word vectors) for generating different sub-regions of the image; the DAMSM provides the fine-grained image-text matching loss for the
generative network.

Here, z is a noise vector usually sampled from a standard Here, λ is a hyperparameter to balance the two terms of
normal distribution. e is a global sentence vector, and e is Eq. (3). The first term is the GAN loss that jointly approx-
the matrix of word vectors. F ca represents the Conditioning imates conditional and unconditional distributions [37]. At
Augmentation [36] that converts the sentence vector e to the the ith stage of the AttnGAN, the generator Gi has a cor-
conditioning vector. Fiattn is the proposed attention model responding discriminator Di . The adversarial loss for Gi is
at the ith stage of the AttnGAN. F ca , Fiattn , Fi , and Gi are defined as
modeled as neural networks. 1 1
The attention model F attn (e, h) has two inputs: the LGi = − Ex̂i ∼pG [log(Di (x̂i )] − Ex̂i ∼pG [log(Di (x̂i , e)],
2 i 2 i

word features e ∈ RD⇥T and the image features from the | {z } |


unconditional loss
{z }
conditional loss
previous hidden layer h ∈ RD̂⇥N . The word features are (4)
first converted into the common semantic space of the im- where the unconditional loss determines whether the image
age features by adding a new perceptron layer, i.e., e0 = U e, is real or fake while the conditional loss determines whether
where U ∈ RD̂⇥D . Then, a word-context vector is com- the image and the sentence match or not.
puted for each sub-region of the image based on its hidden Alternately to the training of Gi , each discriminator Di
features h (query). Each column of h is a feature vector of is trained to classify the input into the class of real or fake
a sub-region of the image. For the j th sub-region, its word- by minimizing the cross-entropy loss defined by
context vector is a dynamic representation of word vectors 1 1
relevant to hj , which is calculated by LDi = − Exi ∼pdata [log Di (xi )] − Ex̂i ∼pG [log(1 − Di (x̂i )] +
2 i 2 i
| {z }
unconditional loss
T −1
X exp(s0j,i ) 1 1
cj = βj,i e0i , where βj,i = PT −1 , (2) − Exi ∼pdata [log Di (xi , e)] − Ex̂i ∼pG [log(1 − Di (x̂i , e)],
2 i 2 i
i=0 k=0 exp(s0j,k ) | {z }
conditional loss
(5)
s0j,i= hTj e0i ,
and βj,i indicates the weight the model attends where xi is from the true image distribution pdatai at the
to the i word when generating the j th sub-region of the
th
ith scale, and x̂i is from the model distribution pGi at the
image. We then donate the word-context matrix for image same scale. Discriminators of the AttnGAN are structurally
feature set h by F attn (e, h) = (c0 , c1 , ..., cN −1 ) ∈ RD̂⇥N . disjoint, so they can be trained in parallel and each of them
Finally, image features and the corresponding word-context focuses on a single image scale.
features are combined to generate images at the next stage. The second term of Eq. (3), LDAM SM , is a word level
To generate realistic images with multiple levels (i.e., fine-grained image-text matching loss computed by the
sentence level and word level) of conditions, the final objec- DAMSM, which will be elaborated in Subsection 3.2.
tive function of the attentional generative network is defined
as 3.2. Deep Attentional Multimodal Similarity Model
m−1 The DAMSM learns two neural networks that map sub-
regions of the image and words of the sentence to a common
X
L = LG + λLDAM SM , where LG = L Gi . (3)
i=0
semantic space, thus measures the image-text similarity at

1318
the word level to compute a fine-grained loss for image gen- Then, we build an attention model to compute a region-
eration. context vector for each word (query). The region-context
The text encoder is a bi-directional Long Short-Term vector ci is a dynamic representation of the image’s sub-
Memory (LSTM) [25] that extracts semantic vectors from regions related to the ith word of the sentence. It is com-
the text description. In the bi-directional LSTM, each word puted as the weighted sum over all regional visual vectors,
corresponds to two hidden states, one for each direction. i.e.,
Thus, we concatenate its two hidden states to represent the
288
semantic meaning of a word. The feature matrix of all X exp(γ1 si,j )
words is indicated by e ∈ RD⇥T . Its ith column ei is the ci = αj vj , where αj = P288 . (9)
j=0 k=0 exp(γ1 si,k )
feature vector for the ith word. D is the dimension of the
word vector and T is the number of words. Meanwhile, the Here, γ1 is a factor that determines how much attention is
last hidden states of the bi-directional LSTM are concate- paid to features of its relevant sub-regions when computing
nated to be the global sentence vector, denoted by e ∈ RD . the region-context vector for a word.
The image encoder is a Convolutional Neural Network Finally, we define the relevance between the ith word
(CNN) that maps images to semantic vectors. The inter- and the image using the cosine similarity between ci and ei ,
mediate layers of the CNN learn local features of different i.e., R(ci , ei ) = (cTi ei )/(||ci ||||ei ||). Inspired by the mini-
sub-regions of the image, while the later layers learn global mum classification error formulation in speech recognition
features of the image. More specifically, our image en- (see, e.g., [11, 8]), the attention-driven image-text match-
coder is built upon the Inception-v3 model [26] pretrained ing score between the entire image (Q) and the whole text
on ImageNet [22]. We first rescale the input image to be description (D) is defined as
299×299 pixels. And then, we extract the local feature ma-
trix f ∈ R768⇥289 (reshaped from 768×17×17) from the ⇣ TX
−1 ⌘ γ1
“mixed 6e” layer of Inception-v3. Each column of f is the R(Q, D) = log exp(γ2 R(ci , ei ))
2
, (10)
feature vector of a sub-region of the image. 768 is the di- i=1
mension of the local feature vector, and 289 is the number
of sub-regions in the image. Meanwhile, the global feature where γ2 is a factor that determines how much to mag-
vector f ∈ R2048 is extracted from the last average pooling nify the importance of the most relevant word-to-region-
layer of Inception-v3. Finally, we convert the image fea- context pair. When γ2 → ∞, R(Q, D) approximates to
tures to a common semantic space of text features by adding maxTi=1
−1
R(ci , ei ).
a perceptron layer: The DAMSM loss is designed to learn the attention
model in a semi-supervised manner, in which the only su-
v = Wf , v = W f, (6) pervision is the matching between entire images and whole
sentences (a sequence of words). Similar to [4, 9], for a
where v ∈ RD⇥289 and its ith column vi is the visual fea- batch of image-sentence pairs {(Qi , Di )}M
i=1 , the posterior
ture vector for the ith sub-region of the image; and v ∈ RD probability of sentence Di being matching with image Qi
is the global vector for the whole image. D is the dimension is computed as
of the multimodal (i.e., image and text modalities) feature
space. For efficiency, all parameters in layers built from the exp(γ3 R(Qi , Di ))
P (Di |Qi ) = PM , (11)
Inception-v3 model are fixed, and the parameters in newly
j=1 exp(γ3 R(Qi , Dj ))
added layers are jointly learned with the rest of the net-
work. where γ3 is a smoothing factor determined by experiments.
The attention-driven image-text matching score is In this batch of sentences, only Di matches the image Qi ,
designed to measure the matching of an image-sentence pair and treat all other M − 1 sentences as mismatching de-
based on an attention model between the image and the text. scriptions. Following [4, 9], we define the loss function as
We first calculate the similarity matrix for all possible the negative log posterior probability that the images are
pairs of words in the sentence and sub-regions in the image matched with their corresponding text descriptions (ground
by truth), i.e.,
s = eT v, (7) XM
Lw1 =− log P (Di |Qi ), (12)
where s ∈ RT ⇥289 and si,j is the dot-product similarity i=1
between the ith word of the sentence and the j th sub-region
of the image. We find that it is beneficial to normalize the where ‘w’ stands for “word”.
similarity matrix as follows Symmetrically, we also minimize
M
exp(si,j ) X
si,j = PT −1 . (8) Lw
2 =− log P (Qi |Di ), (13)
k=0 exp(sk,j ) i=1

1319
exp(γ3 R(Qi ,Di )) CUB [29] COCO [14]
where P (Qi |Di ) = PM is the posterior Dataset
j=1 exp(γ3 R(Qj ,Di )) train test train test
probability that sentence Di is matched with its correspond- #samples 8,855 2,933 80k 40k
& % Qi . &If we redefine Eq. (10) by R(Q, D) =
%ingT image caption/image 10 10 5 5
v e / ||v||||e|| and substitute it to Eq. (11), (12) and
(13), we can obtain loss functions Ls1 and Ls2 (where ‘s’ Table 1. Statistics of datasets.
stands for “sentence”) using the sentence vector e and the
Evaluation. Following Zhang et al. [36], we use the
global image vector v.
inception score [23] as the quantitative evaluation measure.
Finally, the DAMSM loss is defined as
Since the inception score cannot reflect whether the gener-
LDAM SM = Lw w s s
1 + L2 + L1 + L2 . (14) ated image is well conditioned on the given text description,
we propose to use R-precision, a common evaluation met-
Based on experiments on a held-out validation set, we set ric for ranking retrieval results, as a complementary eval-
the hyperparameters in this section as: γ1 = 5, γ2 = 5, uation metric for the text-to-image synthesis task. If there
γ3 = 10 and M = 50. Our DAMSM is pretrained 1 by are R relevant documents for a query, we examine the top
minimizing LDAM SM using real image-text pairs. Since R ranked retrieval results of a system, and find that r are
the size of images for pretraining DAMSM is not limited relevant, and then by definition, the R-precision is r/R.
by the size of images that can be generated, real images of More specifically, we conduct a retrieval experiment, i.e.,
size 299×299 are utilized. In addition, the pretrained text- we use generated images to query their corresponding text
encoder in the DAMSM provides visually-discriminative descriptions. First, the image and text encoders learned in
word vectors learned from image-text paired data for the our pretrained DAMSM are utilized to extract global feature
attentional generative network. In comparison, conven- vectors of the generated images and the given text descrip-
tional word vectors pretrained on pure text data are often tions. And then, we compute cosine similarities between the
not visually-discriminative, e.g., word vectors of different global image vectors and the global text vectors. Finally,
colors, such as red, blue, yellow, etc., are often clustered we rank candidate text descriptions for each image in de-
together in the vector space, due to the lack of grounding scending similarity and find the top r relevant descriptions
them to the actual visual signals. for computing the R-precision. To compute the inception
In sum, we propose two novel attention models, the at- score and the R-precision, each model generates 30,000 im-
tentional generative network and the DAMSM, which play ages from randomly selected unseen text descriptions. The
different roles in the AttnGAN. (i) The attention mechanism candidate text descriptions for each query image consist of
in the generative network (see Eq. 2) enables the AttnGAN one ground truth (i.e., R = 1) and 99 randomly selected
to automatically select word level condition for generating mismatching descriptions.
different sub-regions of the image. (ii) With an attention Besides quantitative evaluation, we also qualitatively
mechanism (see Eq. 9), the DAMSM is able to compute examine the samples generated by our models. Specifi-
the fine-grained text-image matching loss LDAM SM . It is cally, we visualize the intermediate results with attention
worth mentioning that, LDAM SM is applied only on the learned by the attention models F attn . As defined in
output of the last generator Gm−1 , because the eventual Eq. (2), weights βj,i indicates which words the model at-
goal of the AttnGAN is to generate large images by the last tends to when generating a sub-region of the image, and
generator. We tried to apply LDAM SM on images of all PT −1
i=0 βj,i = 1. We suppress the less-relevant words for an
resolutions generated by (G0 , G1 , ..., Gm−1 ). However, the image’s sub-region via
performance was not improved but the computational cost
was increased. (
βj,i , if βj,i > 1/T,
β̂j,i = (15)
4. Experiments 0, otherwise.
Extensive experimentation is carried out to evaluate the
proposed AttnGAN. We first study the important compo- For better visualization, we fix the word and compute its at-
nents of the AttnGAN, including the attentional genera- tention weights with N different sub-regions of an image,
tive network and the DAMSM. Then, we compare our At- β̂0,i√, β̂1,i , √
..., β̂N −1,i . We reshape the N attention weights
tnGAN with previous state-of-the-art GAN models for text- to N × N pixels, which are then upsampled with Gaus-
to-image synthesis [36, 37, 20, 18, 16]. sian filters to have the same size as the generated images.
Datasets. Same as previous text-to-image meth- Limited by the length of the paper, we only visualize the
ods [36, 37, 20, 18], our method is evaluated on CUB [29] top-5 most attended words (i.e., words with top-5 highest
PN −1
and COCO [14] datasets. We preprocess the CUB dataset j=0 β̂j,i values) for each attention model.
according to the method in [36]. Table 1 lists the statistics
of datasets. 4.1. Component analysis
1 Wealso finetuned the DAMSM with the whole network, however the In this section, we first quantitatively evaluate the At-
performance was not improved. tnGAN and its variants. The results are shown in Table 2

1320
quality images that are better conditioned on given text de-
scriptions. The reason is that the proposed fine-grained
image-text matching loss LDAM SM provides additional su-
pervision (i.e., word level matching information) for train-
ing the generator. Moreover, in our experiments, we do
not observe any collapsed nonsensical mode in the visu-
alization of AttnGAN-generated images. It indicates that,
with extra supervision, the fine-grained image-text match-
ing loss also helps to stabilize the training process of the
AttnGAN. In addition, a baseline model, ‘AttnGAN1, no
30 100 attention”, with the text encoder used in [19], is trained
90 on the CUB dataset. Without using attention, its inception
25
80 score and R-precision drops to 3.98 and 10.37%, respec-
Inception score

70
20 tively, which further demonstrates the effectiveness of the
R-precision(%)

60
15 50 proposed LDAM SM .
40
10
30
The attentional generative network. As shown in Ta-
5
AttnGAN1,!=0.1 AttnGAN1,!=1 20 AttnGAN1,!=0.1 AttnGAN1,!=1 ble 2 and Figure 3, stacking two attention models in the
AttnGAN1,!=10
AttnGAN1,!=100
AttnGAN1,!=50
AttnGAN2,!=50
10
AttnGAN1,!=10
AttnGAN1,!=100
AttnGAN1,!=50
AttnGAN2,!=50
generative networks not only generates images of a higher
0
10 30 50 70 90 110 130 150
0
10 30 50 70 90 110 130 150
resolution (from 128×128 to 256×256 resolution), but also
Epoch Epoch yields higher inception scores on both CUB and COCO
Figure 3. Inception scores and R-precision rates by our AttnGAN datasets. In order to guarantee the image quality, we find
and its variants at different epochs on CUB (top) and COCO (bot- the best value of λ for each dataset by increasing the value
tom) test sets. For the text-to-image synthesis task, R = 1. of λ until the overall inception score is starting to drop on
a held-out validation set. “AttnGAN1” models are built for
Method inception score R-precision(%) searching the best λ, based on which a “AttnGAN2” model
AttnGAN1, no attention 3.98 ± .04 10.37± 5.88 is built to generate higher resolution images. Due to GPU
AttnGAN1, λ = 0.1 4.19 ± .06 16.55± 4.83 memory constraints, we did not try the AttnGAN with three
AttnGAN1, λ = 1 4.35 ± .05 34.96± 4.02 attention models. As the result, our final model for CUB
AttnGAN1, λ = 5 4.35 ± .04 58.65± 5.41 and COCO is “AttnGAN2, λ=5” and “AttnGAN2, λ=50”,
AttnGAN1, λ = 10 4.29 ± .05 63.87± 4.85 respectively. The final λ of the COCO dataset turns out to
AttnGAN2, λ = 5 4.36 ± .03 67.82 ± 4.43 be much larger than that of the CUB dataset, indicating that
AttnGAN2, λ = 50 the proposed LDAM SM is especially important for generat-
25.89 ± .47 85.47 ± 3.69
(COCO)
ing complex scenarios like those in the COCO dataset.
Table 2. The best inception score and the corresponding R- To better understand what has been learned by the At-
precision rate of each AttnGAN model on CUB (top six rows) and tnGAN, we visualize its intermediate results with attention.
COCO (the last row) test sets. More results in Figure 3. As shown in Figure 4, the first stage of the AttnGAN (G0 )
just sketches the primitive shape and colors of objects and
and Figure 3. Our “AttnGAN1” architecture has one atten- generates low resolution images. Since only the global sen-
tion model and two generators, while the “AttnGAN2” ar- tence vectors are utilized in this stage, the generated images
chitecture has two attention models stacked with three gen- lack details described by exact words, e.g., the beak and
erators (see Figure 2). In addition, as illustrated in Figure 4, eyes of a bird. Based on word vectors, the following stages
Figure 5, Figure 6, and Figure 7, we qualitatively examine (G1 and G2 ) learn to rectify defects in results of the previ-
the images generated by our AttnGAN. ous stage and add more details to generate higher-resolution
The DAMSM loss. To test the proposed LDAM SM , images. Some sub-regions/pixels of G1 or G2 images can
we adjust the value of λ (see Eq. (3)). As shown in Fig- be inferred directly from images generated by the previous
ure 3, a larger λ leads to a significantly higher R-precision stage. For those sub-regions, the attention is equally allo-
rate on both CUB and COCO datasets. On the CUB dataset, cated to all words and shown to be black in the attention
when the value of λ is increased from 0.1 to 5, the incep- map (see Figure 4). For other sub-regions, which usually
tion score of the AttnGAN1 is improved from 4.19 to 4.35 have semantic meaning expressed in the text description
and the corresponding R-precision rate is increased from such as the attributes of objects, the attention is allocated to
16.55% to 58.65% (see Table 2). On the COCO dataset, their most relevant words (bright regions in Figure 4). Thus,
by increasing the value of λ from 0.1 to 50, the AttnGAN1 those regions are inferred from both word-context features
achieves both high inception score and R-precision rate (see and previous image features of those regions. As shown in
Figure 3). This comparison demonstrates that properly in- Figure 4, on the CUB dataset, the words the, this, bird are
creasing the weight of LDAM SM helps to generate higher usually attended by the F attn models for locating the ob-

1321
the bird has a yellow crown and a black eyering that is round this bird has a green crown black primaries and a white belly

1:bird 4:yellow 0:the 12:round 11:is 1:bird 0:this 2:has 11:belly 10:white

1:bird 4:yellow 0:the 8:black 12:round 6:black 4:green 10:white 0:this 1:bird

a photo of a homemade swirly pasta with broccoli carrots and onions a fruit stand display with bananas and kiwi

0:a 7:with 5:swirly 8:broccoli 10:and 0:a 6:and 1:fruit 7:kiwi 5:bananas

8:broccoli 6:pasta 0:a 9:carrot 5:swirly 0:a 5:bananas 1:fruit 7:kiwi 6:and

Figure 4. Intermediate results of our AttnGAN on CUB (top) and COCO (bottom) test sets. In each block, the first row gives 64×64 images
by G0 , 128×128 images by G1 and 256×256 images by G2 of the AttnGAN; the second and third row shows the top-5 most attended
words by F1attn and F2attn of the AttnGAN, respectively. Refer to the supplementary material for more examples.

Dataset GAN-INT-CLS [20] GAWWN [18] StackGAN [36] StackGAN-v2 [37] PPGN [16] Our AttnGAN
CUB 2.88 ± .04 3.62 ± .07 3.70 ± .04 3.84 ± .06 / 4.36 ± .03
COCO 7.88 ± .07 / 8.45 ± .03 / 9.58 ± .21 25.89 ± .47
Table 3. Inception scores by state-of-the-art GAN models [20, 18, 36, 37, 16] and our AttnGAN on CUB and COCO test sets.

ject; the words describing object attributes, such as colors block of Figure 4. Those observations demonstrate that the
and parts of birds, are also attended for correcting defects AttnGAN learns to understand the detailed semantic mean-
and drawing details. On the COCO dataset, we have similar ing expressed in the text description of an image. Another
observations. Since there are usually more than one ob- observation is that our second attention model F2attn is able
ject in each COCO image, it is more visible that the words to attend to some new words that were omitted by the first
describing different objects are attended by different sub- attention model F1attn (see Figure 4). It demonstrates that,
regions of the image, e.g., bananas, kiwi in the bottom-right to provide richer information for generating higher resolu-

1322
this bird has wings that are black and has a white belly not perfect in capturing global coherent structures, which
leaves room to improve. To sum up, observations shown
in Figure 5, Figure 6 and Figure 7 further demonstrate the
generalization ability of the AttnGAN.
this bird has wings that are red and has a yellow belly
4.2. Comparison with previous methods
We compare our AttnGAN with previous state-of-the-
art GAN models for text-to-image generation on CUB and
this bird has wings that are blue and has a red belly COCO test sets. As shown in Table 3, on CUB, our At-
tnGAN achieves 4.36 inception score, which significantly
outperforms the previous best inception score of 3.82. More
impressively, our AttnGAN boosts the best reported incep-
Figure 5. Example results of our AttnGAN model trained on CUB
while changing some most attended words in the text descriptions. tion score on COCO from 9.58 to 25.89, a 170.25% im-
provement relatively. The COCO dataset is known to be
a red double much more challenging than the CUB dataset because it
a fluffy black decker bus a stop sign a stop sign
cat floating on is floating on is floating on is flying in consists of images with more complex scenarios. Existing
top of a lake top of a lake top of a lake the blue sky methods struggle in generating realistic high-resolution im-
ages on this dataset. Examples in Figure 4 and Figure 6 il-
lustrate that our AttnGAN succeeds in generating 256×256
images for various scenarios on the COCO dataset, although
those generated images of the COCO dataset are not as
photo-realistic as that of the CUB dataset. The experimen-
Figure 6. 256×256 images generated from descriptions of novel tal results show that, compared to previous state-of-the-art
scenarios using the AttnGAN model trained on COCO. (Interme- approaches, the AttnGAN is more effective for generating
diate results are given in the supplementary material.) complex scenes due to its novel attention mechanism that
catches fine-grained word level and sub-region level infor-
mation in text-to-image generation.
Besides StackGAN-v2 [37], the proposed attention
mechanisms can also be applied to the widely used DC-
GAN framework [17]. On the CUB dataset, we build an
AttnDCGAN and a vanilla DCGAN. While the vanilla DC-
GAN conditioned only on the sentence vector (without the
Figure 7. Novel images by our AttnGAN on the CUB test set. proposed attention mechanisms) is shown unable to gen-
erate plausible 256×256 images, our AttnDCGAN is able
tion images at latter stages of the AttnGAN, the correspond- to generate realistic images. The AttnDCGAN achieves
ing attention models learn to recover objects and attributes 4.12±.05 inception score and 38.45±4.26% R-precision.
omitted at previous stages. The vanilla DCGAN only achieves 2.47±.01 inception
Generalization ability. Our experimental results above score and 3.69±1.82% R-precision because of severe mode
have quantitatively and qualitatively shown the generaliza- collapse. The comparison result further demonstrates the
tion ability of the AttnGAN by generating images from effectiveness of the proposed attention mechanisms.
unseen text descriptions. Here we further test how sensi-
tive the outputs are to changes in the input sentences by 5. Conclusions
changing some most attended words in the text descriptions. In this paper, an Attentional Generative Adversarial Net-
Some examples are shown in Figure 5. It illustrates that the work, named AttnGAN, is proposed for fine-grained text-
generated images are modified according to the changes in to-image synthesis. We build a novel attentional genera-
the input sentences, showing that the model can catch sub- tive network for the AttnGAN to generate high quality im-
tle semantic differences in the text description. Moreover, age through a multi-stage process. We present a deep at-
as shown in Figure 6, our AttnGAN can generate images to tentional multimodal similarity model to compute the fine-
reflect the semantic meaning of descriptions of novel sce- grained image-text matching loss for training the generator
narios that are not likely to happen in the real world, e.g., of the AttnGAN. Our AttnGAN significantly outperforms
a stop sign is floating on top of a lake. On the other hand, state-of-the-art GAN models, boosting the best reported in-
we also observe that the AttnGAN sometimes generates im- ception score by 14.14% on the CUB dataset and 170.25%
ages which are sharp and detailed, but are not likely realis- on the more challenging COCO dataset. Extensive experi-
tic. As examples shown in Figure 7, the AttnGAN creates mental results demonstrate the effectiveness of the proposed
birds with multiple heads, eyes or tails, which only exist in attention mechanisms in the AttnGAN, which is especially
fairy tales. This indicates that our current method is still critical for text-to-image generation for complex scenes.

1323
References [25] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural net-
works. IEEE Trans. Signal Processing, 45(11):2673–2681, 1997. 4
[1] A. Agrawal, J. Lu, S. Antol, M. Mitchell, C. L. Zitnick, D. Parikh, [26] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Re-
and D. Batra. VQA: visual question answering. IJCV, 123(1):4–31, thinking the inception architecture for computer vision. In CVPR,
2017. 1 2016. 4
[2] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by [27] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt,
jointly learning to align and translate. arXiv:1409.0473, 2014. 2 A. Graves, and K. Kavukcuoglu. Conditional image generation with
[3] E. L. Denton, S. Chintala, A. Szlam, and R. Fergus. Deep generative pixelcnn decoders. In NIPS, 2016. 2
image models using a laplacian pyramid of adversarial networks. In [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
NIPS, 2015. 2 Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need.
[4] H. Fang, S. Gupta, F. N. Iandola, R. K. Srivastava, L. Deng, P. Dollár, arXiv:1706.03762, 2017. 2
J. Gao, X. He, M. Mitchell, J. C. Platt, C. L. Zitnick, and G. Zweig.
[29] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The
From captions to visual concepts and back. In CVPR, 2015. 1, 4
Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-
[5] Z. Gan, C. Gan, X. He, Y. Pu, K. Tran, J. Gao, L. Carin, and L. Deng. 2011-001, California Institute of Technology, 2011. 5
Semantic compositional networks for visual captioning. In CVPR,
[30] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
2017. 1
R. S. Zemel, and Y. Bengio. Show, attend and tell: Neural image
[6] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-
caption generation with visual attention. In ICML, 2015. 1, 2
Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative ad-
versarial nets. In NIPS, 2014. 1, 2 [31] Z. Yang, X. He, J. Gao, L. Deng, and A. J. Smola. Stacked attention
networks for image question answering. In CVPR, 2016. 1, 2
[7] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra.
DRAW: A recurrent neural network for image generation. In ICML, [32] H. Zhang and K. Dana. Multi-style generative network for real-time
2015. 2 transfer. arXiv:1703.06953, 2017. 1
[8] X. He, L. Deng, and W. Chou. Discriminative learning in sequential [33] H. Zhang, K. Dana, J. Shi, Z. Zhang, X. Wang, A. Tyagi, and
pattern recognition. IEEE Signal Processing Magazine, 25(5):14–36, A. Agrawal. Context encoding for semantic segmentation. In CVPR,
2008. 4 2018. 1
[9] P.-S. Huang, X. He, J. Gao, L. Deng, A. Acero, and L. Heck. Learn- [34] H. Zhang and V. M. Patel. Densely connected pyramid dehazing
ing deep structured semantic models for web search using click- network. In CVPR, 2018. 2
through data. In CIKM, 2013. 4 [35] H. Zhang, V. Sindagi, and V. M. Patel. Image de-raining using a
[10] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image trans- conditional generative adversarial network. arXiv:1701.05957, 2017.
lation with conditional adversarial networks. In CVPR, 2017. 2 2
[11] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error [36] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and
rate methods for speech recognition. IEEE Transactions on Speech D. Metaxas. Stackgan: Text to photo-realistic image synthesis with
and Audio Processing, 5(3):257–265, 1997. 4 stacked generative adversarial networks. In ICCV, 2017. 1, 2, 3, 5, 7
[12] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In [37] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
ICLR, 2014. 2 Metaxas. Stackgan++: Realistic image synthesis with stacked gen-
[13] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, erative adversarial networks. arXiv: 1710.10916, 2017. 1, 2, 3, 5, 7,
J. Totz, Z. Wang, and W. Shi. Photo-realistic single image super- 8
resolution using a generative adversarial network. In CVPR, 2017. [38] Z. Zhang, Y. Xie, F. Xing, M. Mcgough, and L. Yang. Mdnet: A
2 semantically and visually interpretable medical image diagnosis net-
[14] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, work. In CVPR, 2017. 2
P. Dollr, and C. L. Zitnick. Microsoft coco: Common objects in [39] Z. Zhang, Y. Xie, and L. Yang. Photographic text-to-image synthesis
context. In ECCV, 2014. 1, 5 with a hierarchically-nested adversarial network. In CVPR, 2018. 2
[15] E. Mansimov, E. Parisotto, L. J. Ba, and R. Salakhutdinov. Generat- [40] Z. Zhang, L. Yang, and Y. Zheng. Translating and segmenting mul-
ing images from captions with attention. In ICLR, 2016. 2 timodal medical volumes with cycle- and shape-consistency genera-
[16] A. Nguyen, J. Yosinski, Y. Bengio, A. Dosovitskiy, and J. Clune. tive adversarial network. In CVPR, 2018. 2
Plug & play generative networks: Conditional iterative generation of [41] Y. Zhu, M. Elhoseiny, B. Liu, X. Peng, and A. Elgammal. A genera-
images in latent space. In CVPR, 2017. 2, 5, 7 tive adversarial approach for zero-shot learning from noisy texts. In
[17] A. Radford, L. Metz, and S. Chintala. Unsupervised representation CVPR, 2018. 1
learning with deep convolutional generative adversarial networks. In
ICLR, 2016. 2, 8
[18] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee.
Learning what and where to draw. In NIPS, 2016. 1, 2, 5, 7
[19] S. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep represen-
tations of fine-grained visual descriptions. In CVPR, 2016. 1, 6
[20] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee.
Generative adversarial text-to-image synthesis. In ICML, 2016. 1, 2,
5, 7
[21] S. E. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo,
Z. Wang, Y. Chen, D. Belov, and N. de Freitas. Parallel multiscale
autoregressive density estimation. In ICML, 2017. 2
[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and
L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
IJCV, 115(3):211–252, 2015. 4
[23] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford,
and X. Chen. Improved techniques for training gans. In NIPS, 2016.
2, 5
[24] T. Salimans, H. Zhang, A. Radford, and D. Metaxas. Improving gans
using optimal transport. In ICLR, 2018. 2

1324

You might also like