Satgan Paper
Satgan Paper
21/05/2019
1
1 Abstract
Generation of fine grained image with details from text descriptions is a
highly challenging task in Computer Vision. Various successful attempts
have been taken so far but most of the tasks lack details and do not match
the text descriptions properly. In this paper, we propose SATGAN, a two
stage Generative Adversarial Network with Self Attention performed on text
descriptions. The first stage draws the primary 64x64 image and the second
stage applies the attention (on text) then generates high resolution 256x256
image. We experimented our model on CUB-200 (2011) Birds dataset. Our
extensive experiment and comparison with the state-of-the-art models shows
that using Self Attention on text descriptions and Spectral normalization
improves the quality of the generated image while reducing the computational
cost. The Inception score was found to be 5.04 ± 0.37, which is a boost of
15.6% on the CUB dataset. Also the model scored an FID score of 42.87.
2 Introduction
Human beings have been given the power of imagination. They can quickly
imagine a scenario based on text descriptions. For example, if a person is to
read a novel, they can actually imagine the meadows, the plot unraveling.
The recent hype of the world of Computer Vision is to give that same power
or something close, to a machine. This is where ”Text to Image Generation”
comes in. This is very much needed in various applications like, photo-editing,
computer-aided design etc. Text to Image Generation using Generative
Adversarial Network (GAN)[1] have shown the most promise. GANs’ using
deep convolutional networks have been especially successful[2][3][4]. Using
this, various state of the art architectures have been made which are discussed
in the Related Work section. But the main problem of those architectures is
that, they are very much computationally costly. In this paper, we tried to
solve 2 problems:
2
computational cost. Furthermore, 2 Generators out of 3, using attention
architecture adds up even more to the computational cost. In contrast to
that, in our paper, we experimented with Vanilla GAN to produce images,
GAN with Self Attention [6] on images, and GAN with Self Attention on Text
Descriptions. We also used Spectral normalization [7] for normalizing image.
At the same time, we decreased the number of Generators to 2, 1st Generator
produces 64x64 images and the 2nd upsamples the image to 256x256 based
on the attention maps from Self Attention architecture. Detailed explanation
on results are described in section 5.
3 Related Work
Mansimov et. al. used bidirectional RNN attention model along with the con-
ditional DRAW network in order to generate images from text descriptions[8].
AttnGAN[5] uses attention driven multi stage refinement for text in order
to generate images. At first a low resolution image is generated using the
sentence vector(vector form of text descriptions). After that, each sub-region
of image is refined using word vector of the sentence based on context. In
addition, a deep multi-modal similarity model is introduced for calculating
GAN loss. Image Generation using PixelCNN[9] used the conditioning GAN
based on modified PixelCNN decoders. These conditions can be vectors,
labels, tags and latent embedding. The difference between PixelCNN and any
other architecture is that, along with generating excellent samples, it returns
explicitly probability densities. These densities help to generate excellent
samples and use them as transfer learning in other categories. Based on the
condition, the model can generate various outputs. Scott Reed used Deep
Conditional GAN (DC-GAN)[10] to produce finer images. They used deep
convolutional and recurrent text encoders for obtaining vector representations
of text descriptions. Matching-aware discriminator (GAN-CLS) was also
used in order to discriminate between real and fake images as well as real
image and mismatch text. They were also the first to use Inception score
as a metric for determining accuracy of GAN. Additional condition of real
image and mismatched text are added to the GAN. StackGAN-v2[11] uses
the architecture from StackGAN-v1 which uses two stages of GAN, one for
generating low resolution primary image from text and other one for gener-
ating high resolution image using the low resolution primary image and the
text description as inputs. Furthermore, StackGAN-v2 also uses multiple
generators and discriminators in order to generate images in multiple scales.
Zizhao Zhang introduced hierarchical-nested adversarial objectives inside
networks to produce high resolution images[12]. They also introduced a new
3
visual semantic similarity measure. LSTM (Long Short Term Memory) is
basically a Recurrent Neural Network (RNN). The basic difference between
LSTM and any other RNNs’ is that it is able to remember the relationships
among vectors over a long distance which other RNNs’ find very hard to do.
It is practically a nature of LSTM. Xu Ouyang represented an architecture
that uses LSTM network to find semantic meaning from the input text. They
used the real image as the goal for multiple similar sentences and showed that
it produced better results. These all were done based on generating image on
a single category. Multi-Instance StackGAN[13] on the other hand produced
multiple instances from a broader variety of categories. The model showed
promise in generating complex scene composition consisting of multiple ob-
jects based on input text description. Vashisht Madhavan et. al.[14] came up
with a Dual loss DCCGAN. They used encoded captions and used them for
generating images in DCCGAN (Deep Convolutional Conditional GAN).
4 Methodology
4
P
the mean µ(ϕt ) and diagonal covariance matrix (ϕt ) are the functions of
the text embedding ϕt . The Kullback-Leibler divergence (KL divergence) is
also used as additional regularization term for smoothing the learning curve
for generator :- P
DKL (N (µ(ϕt ), (ϕt ))||N (0, 1))
Here, x1 and x2 are of shape (batch size, 3, 64, 64) and (batch size, 3, 256,
256) respectively. Here F ca represents the Conditional Augmentation[11] and
F attn represents the Self Attention module.
5
linearity layer, normalization layer, non-linearity (ReLU) layer. The vector is
then upsampled 4 times to produce a (batch size, 32, 64, 64) vector. This
vector is then convoluted over to produce a (batch size, 3, 64, 64) size images.
This is the 64x64 image produced by the first Generator G0 .
y = F (x, Wi ) + Ws x
The above equation is the case when the input and output dimensions
are not of the same dimension. In our case, the weights are generated from
applying Spectral Normalization[7] on the layers.
6
After going through Residual Block, 4 times upsampling and a final
Spectral Normalization[7] and non-linearity (Tanh) gives the final (batch size,
3, 256, 256) image.
Self Attention adapts the model of [19] which enables both the generator
and the discriminator to model relationships between widely separated spatial
regions. The image features of xRC∗N are first transformed into 2 feature
spaces, f (x) = Wf x, g(x) = Wg x
exp(sij )
βj,i = N
P
, where sij = f (xi )T g(xj )
exp(sij )
i=1
Here, Wg , Wf RC̄∗C , Wh RC∗C and C̄ = C/s. The final output then becomes -
7
KL Divergence Loss further made the training process easier for the gen-
erators and discriminators.
P (x) Q(x)
P
DKL (P ||Q) = − xX P (x)
L = LGi + LDi
Here, the discriminator is trained for the real image, i.e. the image from
the dataset; the fake image, i.e. the image generated from the generator and
the wrong image, i.e. changing labels for the images in the dataset in order
to train the discriminator to consider these as false images as well.
5 Experiment
For our implementation we used various steps and explored each criteria
differently. We used cropped 64x64 images for our training for Stage 1 and
256x256 for Stage 2, with batch size 32. The training Dataset is Birds Dataset
of CUB-200 (2011) which contains 200 bird species. The Generator and
Discriminator networks are trained using the ADAM optimizer. The learning
rate was set to 0.0002 initially and decayed by half after every 100 epochs.
Several attempts for training were taken and modifications were made based
on demand.
8
ing both scores we evaluate our model.
First, we train our one stage GAN with 3x3 convolution for the layers in
generator along with the default loss of generator. It was evident that the
discriminator was reaching 0 loss within few epochs and the generator was
not learning anything. This is a problem of GAN which was mentioned in the
original paper[1]. There is no defined way to solve this problem. To solve the
problem, the discriminator was frozen for initial epochs and then unfrozen
so that the generator can have enough time to learn the contexts. We used
this method throughout our experiments for the later experiments. However,
after unfreezing the discriminator still went back to zero loss state, which
means the discriminator became strong too quickly. The image generated
from the generator was not too good. We found Inception Score 4.01 ± 0.26
and FID Score 145.98 for this experiment.
9
5.2.2 One Stage (5x5 Convolution, KL Loss)
To address this problem we figured out that the discriminator is acting stronger
than the generator. So, instead of doing 3x3 convolution, we increased the
filter size to 5 to do 5x5 convolution on each upsampling of the features.
Upsampling was done by the nearest neighbour interpolation. This change
made the model stable a bit till 150 epochs. However, the results were
improving. To further address the issue, we added the KL Divergence loss to
the generator. This change made the GAN training to converge better. We
further noticed that reducing the learning rate after some defined epochs is a
good idea and thus we reduced learning rate by 1/2 after every 100 epochs.
These changes were consistent for the later experiments of our model. We
found Inception Score 4.05 ± 0.20 and FID Score 74.44 for this experiment.
10
which are not in the correct shape and color. This is because the generator 1
fails to generate the correct shape and color and thus the generator 2 cannot
produce the correct image as it is conditioned on the first generator. We
found Inception Score 4.96 ± 0.24 and FID Score 49.33 for this experiment.
After training the vanilla two stage GAN, we applied Self Attention on the
conditioned vector which is produced from the addition of the generator 1
image and the text embedding. After sufficient epochs of training, we noticed
that the results images were good but not as good as the vanilla GAN. This
is because when applying attention, the whole image is considered. Thus,
the operation considers the background also. The attention does good when
producing good shape and colors but it fails to distinguish between the bird and
the background. In the original implementation of Self Attention in GAN[6]
was actually image to image synthesis. Since our model in conditioning text to
generate image, the attention mechanism fails to generate correct results. We
found Inception Score 3.22 ± 0.10 and FID Score 145.75 for this experiment.
11
Since we did not get satisfactory results on applying attention on generated
image, we also experimented applying attention on the sentence embedding
vector after the conditioning augmentation. After sufficient epochs, we
observed that this method produces really good images. We compared these
images from the same epoch with the no attention model, we observed that
the results improves. The main difference between these two is that, having
attention on text separates the backgrounds well while generating quality
images with detail. We found Inception Score 5.04 ± 0.37 and FID Score
42.87 for this experiment.
12
13
Figure 7: Comparison on models based on visual aspect
14
Method Inception Score FID Score
6 Conclusion
The contribution of our work is a newly proposed architecture SATGAN which
reduces the number of Generators and is still being able to generate high
quality images from text descriptions. From our experiment we found that at-
tention applied to only text descriptions is good enough and computationally
cost effective for generating high quality images. Our SATGAN outperforms
the best reported state-of-the-art architectures in terms of generating diver-
sified yet contextually close enough, high quality images. We believe this
experiment will usher in a new side of analysis on GAN architectures and
also be used as an example for understanding which metrics are better suited
for generative model evaluation.
15
References
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems, pp. 2672–2680, 2014.
16
[11] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N.
Metaxas, “Stackgan: Text to photo-realistic image synthesis with stacked
generative adversarial networks,” in The IEEE International Conference
on Computer Vision (ICCV), Oct 2017.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 770–778, 2016.
17