Development and deployment of a generative model-based framework for text to photorealistic image generation
Development and deployment of a generative model-based framework for text to photorealistic image generation
a r t i c l e i n f o a b s t r a c t
The task of generating photorealistic images from their textual descriptions is quite challenging. Most
existing tasks in this domain are focused on the generation of images such as flowers or birds from their
textual description, especially for validating the generative models based on Generative Adversarial
Network (GAN) variants and for recreational purposes. However, such work is limited in the domain of
photorealistic face image generation and the results obtained have not been satisfactory. This is partly
due to the absence of concrete data in this domain and a large number of highly specific features/at-
tributes involved in face generation compared to birds or flowers. In this paper, we propose an
Keywords:
Attention Generative Adversarial Network (AttnGAN) for a fine-grained text-to-face generation that
Text-to-image
Text-to-face enables attention-driven multi-stage refinement by employing Deep Attentional Multimodal Similarity
Face synthesis Model (DAMSM). Through extensive experimentation on the CelebA dataset, we evaluated our approach
GAN using the Frechet Inception Distance (FID) score. The output files for the Face2Text Dataset are also com-
AttnGAN pare with that of the T2F Github project. According to the visual comparison, AttnGAN generated higher-
quality images than T2F. Additionally, we compare our methodology with existing approaches with a
specific focus on CelebA dataset and demonstrate that our approach generates a better FID score facilitat-
ing more realistic image generation. The application of such an approach can be found in criminal iden-
tification, where faces are generated from the textual description from an eyewitness. Such a method can
bring consistency and eliminate the individual biases of an artist drawing the faces from the description
given by the eyewitness. Finally, we discuss the deployment of the models on a Raspberry Pi to test how
effective the models would be on a standalone device to facilitate portability and timely task completion. Ó
1. Introduction and scope Recently, methods for text-to-image synthesis based on Generative
Adversarial Networks (GANs) [7] have been suggested. It is com-
Reed et al. [1] first introduced text-to-image synthesis in 2016, mon to encode the entire text meaning into a global sentence vec-
and it is a fundamental and novel research area in computer vision tor as a prerequisite for GAN-based image production [1,8–11].
[2]. It is similar to reverse image captioning in that it aims to create Despite its promising results, conditioning a GAN only on the glo-
natural images from input sentences. Text-to-image synthesis, bal sentence vector has certain limitations and is inadequate in
including image caption, explores the visual semantic process of taking into consideration the crucial fine-grained details at the
the human brain by mining the connection between text and word stage. This limitation is even more evident when creating
image. Furthermore, it has enormous potential for art production, complex scenes like those in the COCO dataset [12] or the CelebA
computer-aided design [3], image searching, and other areas such dataset [13].
as image analysis of gold immunochromatographic strip [4–6]. Text-to-face synthesis is a subdivision of text-to-image synthe-
sis that aims to create face images from human descriptions. Text-
to-face synthesis, similar to text-to-image synthesis, has two key
⇑ Corresponding author at: Symbiosis Institute of Technology (SIT), Symbiosis goals: (1) to produce high-quality images and (2) to generate
International (Deemed University), India.
images that correspond to the input descriptions.
E-mail address: [email protected] (K. Kotecha).
1.1. Related work ing that, several scholars made advancements as a result of his
study [16]. Zhang et al. published one of the most influential stud-
Text-to-face synthesis is divided into two categories: (1) text- ies on this subject, proposing a two-stage network called StackGAN
to-image synthesis and (2) text-to-face synthesis. Fig. 1 sum- [8] to solve the problem and for producing high-quality photos and
marises the previous work carried out in both these categories. improving the Inception Score. Later studies [3,10,17,18] inherit
this network as well.
Text-to-image: Researchers gradually concentrated on achieving another goal:
enhancing the resemblance between input text and produced pho-
Despite the fact that there are a variety of networks for text-to- tos because the network had already shown the ability to produce
image synthesis tasks, the majority of them are built on the realistic images. Reed et al. suggested a network for generating
encoder-decoder structure and conditional GAN [14]. Text- images from a box that was created first. This approach assisted
encoder and image-decoder are used in this encoder-decoder sys- in producing more precise data on the output photographs [9]. A
tem. The text encoder converts input definitions into semantic vec- GAN network built on a related concept was also created by [19].
tors, which are then decoded into natural images by the image Sharma et al., on the other hand, used dialogue to aid interpreta-
decoder. Text-to-image synthesis has two key goals: to create tion of the description, which allows synthesising visuals that are
high-quality images and to create images that complement the more relevant to the input document [20].
provided descriptions. These two goals serve as the foundation Dong et al. suggested a method for creating new images based
for all advances in text-to-image synthesis. on the input picture and explanations, which would produce
Text-to-image synthesis research in its early stages has primar- images that complement the input descriptions [16]. They also
ily focused on improving the quality of produced photographs. proposed Image-Text-Image (I2T2I), a new training approach that
Reed et al. proposed the text-to-image challenge for the first time combines text-to-image and image-to-text (image captioning) syn-
in 2016 and created two end-to-end networks focused on condi- thesis to increase text-to-image synthesis accuracy [21]. Attention
tional GAN to complete it [1]. Reed used a pre-trained Char-CNN- processes have already made significant progress in the text- and
RNN network for text encoding and a DCGAN-like network [15] image-related activities [22–25] and they are now being used in
for image decoding to create natural images from vectors. Follow- GANs to generate text-to-image conversions. [3] constructed Attn-
2
GAN to create an attention mechanism that allows GANs to create tor. The other part of the AttnGAN includes a Deep Attentional
fine-grained high-resolution photographs from natural language Multimodal Similarity Model (DAMSM). With the aid of an atten-
descriptions. MirrorGAN [17] is a text-to-image-to-text network tion mechanism, the DAMSM computes the similarity between
suggested by Qiao et al., which uses a global–local collaborative the generated image and the sentence using both global
focus paradigm. [18] proposed a visual-semantic similarity mea- sentence-level details and fine-grained word-level information.
sure as an aid to measurement metrics since there are no available Consequently, the DAMSM modifies the Generator’s preparation
criteria on how the produced images represent the input descrip- by adding a finer-grained image-text matching loss. We consider
tions. These findings indicate a pattern in which researchers are birds [39], flowers [40], CelebA [13] and Face2Text [27] datasets
increasingly concentrating on improving the accuracy of produced for the study and experimentation.
images and input sentences. With this, we can use it for scripts-to-
storyboard, text-to-architecture and much more. 1.2. Contribution and novelty
Image synthesis has been a popular subject in deep learning (i) The study, comparison and analysis of various GAN models
since Goodfellow suggested GAN in 2014 [7,26]. Face synthesis is for photorealistic image generation from textual descrip-
a common research area since there are two broad-scale public tions by experimentation on birds, flowers, and human faces
datasets: CelebA [13] and Face2Text [27]. Almost all state-of-the- datasets.
art networks, like networks based on GAN and networks based (ii) Implementation of the Text-To-Face synthesis using Attn-
on both conditional GAN and conditional GAN, demonstrates their GAN through attention-driven multi-stage refinement for
model’s dominance on face synthesis (such as PGGAN [11], DCGAN photorealistic face image generation and optimisation of
[15,28] CycleGAN [29], BigGAN [30], StyleGAN [31], StarGAN [32] the model by employing DAMSM loss. A model architecture
to name the few). With the advancement of these networks, the is proposed based on the AttnGAN employing the DAMSM
quality of produced face images is steadily improving. Some net- loss.
works now generate 1024 1024 face pictures, which is far greater (iii) Implementation of the trained models on a standalone Rasp-
than the face dataset’s initial picture resolution. These models berry Pi device to ensure more portability, useability and
attempt to learn a mapping between a noise vector and real face accessibility of such an approach.
images that follows the Normal distribution. However, they are
unable to command the network to produce the particular face pic- 1.3. Novelty
ture that they need.
Face synthesis has derived many interesting applications about 1. Experimentation for identifying various aspects of Generative
faces using conditional GAN, such as translating edges to natural models for photorealistic face image generation. This includes
face images [33], exchanging the attributes of two face images the in-depth analysis of the effect of FID scores and DAMSM loss
[34], generating a positive face from the side face [35], generating on image quality and realism.
a full face from the eyes’ region only [36], from face attributes to 2. Comparison and analysis of the results obtained with existing
sketches to natural face images synthesis [37], and face in space methods on CelebA dataset.
[38]. By applying a condition vector to the synthesised face images, 3. Generation of distinct variations in the images as a result of
such networks attempt to monitor them and produce face images semantic alterations in the input text.
that meet the needs of various circumstances. The activities that 4. Implementation on a standalone portable hardware system for
use the input descriptions as the control condition are identical easy application and usability.
to text to face synthesis.
In the context of face generation from the textual description,
1.4. Outline
one of the most relevant applications is to develop criminal images
from the textual description from an eyewitness. In the public
The paper is organised as: Section 2 presents the methods and
safety context, this task has more relative values than text-to-
the background of the architectures employed in detail, followed
image synthesis. Drawing an image for a criminal solely relying
by Section 3, which describes the experiments, evaluations, and
on eyewitness descriptions is a daunting process that takes techni-
results. Section 4 presents the discussion on experiments and the
cal knowledge and extensive practice. Additionally, the individual
results. The paper concludes with an overview and future scope
biases of the eyewitness as well as the artist may creep into the
in Section 5.
process. E.g., different artists may have a different notion of ‘attrac-
tive’ or ‘dark skin’ based on their social and ethnic background.
Such biases may bring inconsistency in the images created and 2. Methods and background
delay the process of finding the criminals. On the other hand, a per-
son who is not an artist can also easily produce photorealistic faces In this section, we present AttnGAN and the further application
of criminals based on eyewitness reports using a well-trained text- of that for text to face synthesis. We begin by explaining how Gen-
to-face model. erative Adversarial Networks (GAN) function. Then we describe
To address this problem, we employ the Attentional Generative AttnGAN and its DAMSM network to carry out text encoding and
Adversarial Network (AttnGAN) [3], which enables fine-grained compute the attention map. This attention map is then utilised
text-to-image creation through attention-driven multi-stage for the task of generating images from their textual description.
refinement. The model is made up of two unique elements. The Further, we describe how AttnGAN is helpful for our problem state-
attentional generative network is the first part, in which the Gen- ment. Finally, we describe the FID Score and evaluate our model.
erator generates an attention system that enables it to draw differ-
ent sub-regions of the image by concentrating on words that are 2.1. Generative Adversarial Networks (GAN)
more relevant to the sub-region that is being drawn. In addition
to the natural language summary being encoded into a global sen- GAN stands for Generative Adversarial Networks, and it is a
tence vector, each word in the sentence is encoded into a word vec- framework for learning a function or program that can produce
3
samples that are quite similar to samples taken from a specified tion of fine-grained images. It generates high-quality images by
training distribution. GANs have become popular very recently. dividing an image into various subregions and then focusing on
The general architecture of a GAN [7] consists of a Generator(G) specific words from the caption relevant to a particular subregion
and a Discriminator(D). Both the Generator and Discriminator are of the image.
separate neural networks. A random input(noise) is given to the The models that have so far been used for text-to-image conver-
Generator and it tries to produce an image close to the actual sion use the entire description and convert it to a vector which is
image. The output of the Generator is then given to the Discrimina- then used for image generation. In this model, instead of the whole
tor. The Discriminator tries to tell the difference between natural sentence, we focus on its constituent words to generate various
and synthetic training data, whereas the Generator tries to deceive image subregions. This ensures a generated image that is visually
the Discriminator. The Discriminator updates the weights depend- closer to the actual image. Different words are used to produce dif-
ing on whether it predicts the image generated by the Generator as ferent parts of the final image according to the sub-region that they
real or fake. If it predicts the image to be fake, then an update in the are most relevant to. The detailed architecture for AttnGAN is
weights takes place. The duty of the Generator here is to keep on shown in Fig. 2.
producing images that seemingly look real. It does so till the time The text description, containing T words, is input to the text
that the Discriminator predicts them as real images. So essentially, encoder. The text encoder is a bidirectional LSTM. This means that
the Generator and Discriminator participate in a minimax game. the input caption is trained on two LSTMs instead of the usual one
Equations (1) and (2) from [41] describe this minimax game. LSTM. Essentially this does the job of concatenating the hidden
states from the forward and backward direction for all timesteps
1 1
J ðDÞ ¼ ExPdata logDðxÞ Ez log ð1 DðGðzÞÞÞ ð1Þ and outputs a final hidden state. This final hidden state in the case
2 2
of this architecture is the sentence feature, represented by
J ðGÞ ¼ J ðDÞ ð2Þ e e 2 RD . HereD represents the working dimension for the words.
Since there are T words, hence another matrix ewith the dimension
Pdata is the probability distribution of given data, and J (D) is the e 2 RDT represents the word features. In a nutshell, sentence fea-
discriminator cost, and J (G) is the generator cost. The Nash tures may be considered as the final hidden state, while the word
Equilibrium of this game, according to Goodfellow et al. [7], is features are the hidden states from all timesteps.
when samples generated by G are indistinguishable from samples
The sentence features are passed on to F ca for conditional aug-
derived from training data (provided G and D have sufficient
mentation. F ca is modelled as a neural network. All the equations
capacity).
in this section are based on the mathematical discussion presented
in [3].
2.2. StackGAN
The output after Conditional Augmentation, c, is given by:
This GAN [8] is typically used for synthesising images from tex- c ¼ l þ re; e N ð0; IÞ ð3Þ
tual description. It breaks down the text-to-image generation pro-
Since the same description can describe several images, eðnoiseÞ is
cess into two stages, as mentioned below:
added here to introduce variation in the generated images.
1) Stage-I GAN
Typically, the input to the Generator in a GAN is only a noise
Stage-I GAN focuses on drawing only rough shapes and appro-
vector (z). But since to generate the final image, the Generator
priate colors from the textual definition. It creates a low-
needs to be conditioned on the input description, so here we use
resolution image by drawing the context layout from a random
Conditional GANs. Accordingly, cand z(noise vector) are concate-
noise vector. It generally produces 64x64 images.
nated and are fed as input to the generator network.
2) Stage-II GAN
The architecture can be considered to have mgenerators.
Stage-II GAN is built upon Stage-I GAN results, and so it pro-
F 0 is responsible for most of the upsampling. The scale factor for
duces high-resolution images. Low-resolution images generated
upsampling is 2. F 0 does not use word-level features. The context
by Stage-I GAN are generally devoid of realistic object parts and
vector at this stage, h0 , is given by:
might have distortions of shape. The Stage-II GAN takes into con-
sideration the text ignored in Stage-I to generate images with more
h0 ¼ F 0 z; F ca ðeÞ ð4Þ
natural details. It generates 256x256 images.
The output from F 0 ði:e:h0 Þ along with the word features (e) is
2.3. PGGAN taken as input by the attention network. To do this, a perceptron
layer is added to transform the word features into a common
PGGAN [11] is short for Progressively Growing GAN. PGGAN is semantic space of image features. This may be represented as:
used to produce ultra-high-resolution images by increasing the 0 b b represents the network’s internal
e ¼ Ue; where U 2 R D D . Here D
network layers as training goes on by first training a model to gen- working dimension.
erate 4x4 image and add layers to generate 8x8, 16x16 images and 0
Together with h0 , e is given as input to the attention network.
so on.
Thus, we get a word-context vector for every subregion.
The most significant difference between PGGAN and StackGAN
The word-context vector may be understood as a score to relate
is that the network structure of the latter is fixed. However, in
all T words to all N subregions and is a measure of how relevant a
PGGANs, as the training progresses, the network structure contin-
word would be to a particular region. This is how specific words
ues to change. The most significant benefit of doing this is that
are selected for generating specific regions of the final image. The
most iterations are done at lower resolutions, and the training th
speed is faster than traditional GANs. word context vector for the j subregion is given as:
0
XT1 exp sj;i
2.4. AttnGAN cj ¼ b e
0
; where b ¼
0 0
and sj;i ¼ hTj ei ð5Þ
i¼0 j;i i j;i PT1 0
k¼0 exp sj;k
AttnGAN or Attentional Generative Adversarial Network [3] is
an attention-driven architecture that enables text-to-image con- On doing this for each region, we get the output for the atten-
version. The architecture involves multiple stages for the genera- tion network, which is:
4
Fig. 2. The Architecture for AttnGAN for text-to-face synthesis. Each attention model automatically retrieves the conditions (i.e., the most critical word vectors) for generating
various sub-regions of the image; the DAMSM provides the fine-grained image-text matching failure for the generative network.
b 1 1
F attn ðe; hÞ ¼ ðc0 ; c1 ; ::::; cN1 Þ 2 R D N ð6Þ LDi ¼ Exi Pdatai log ðDi ðxi Þ E b
2 2 xi PGi
For F 1 , there are two inputs, h0 from F 0 and the word-context 1 1
vector from the attention network. It consists of residual blocks, log 1 Di xbi þ Exi Pdatai log ðDi ðxi ; eÞ E b
2 2 xi PGi
which make the network deeper, and an upsampling layer. It uses
b
log 1 Di xi ; e ð11Þ
word-level features from F attn attn
1 : Here F i is the attention model at
th
i stage of AttnGAN. The context vectors henceforth may be gener- The attentional network’s final objective function is provided
alised as: by:
Xm1
hi ¼ F i hi1 ; F attn
i ðe; hi1 Þ fori ¼ 1; 2; :::; m 1; ð7Þ L ¼ LG þ kLDAMSM ; where LG ¼ LGi ð12Þ
i¼0
Fig. 3. Some examples of our generated images (64x64) by StackGAN Stage-I on Caltech CUB-200 and Oxford-102 datasets.
6
Fig. 4. Some examples of our generated images (256x256) by StackGAN Stage-II on Caltech CUB-200 and Oxford-102 flowers dataset.
Fig. 5. Some examples of our generated images by the PGGAN Oxford-102 flowers dataset.
using the methods described in [3] to assess the efficacy of our reducing the number of captions to be tested by the Generator dur-
approach. ing training. The output files were compared with that of the T2F
The Face2Text Dataset [27] was utilised as an experiment to see Project on Github [45]. According to the visual comparison, Attn-
how AttnGAN could handle generating images for the face dataset. GAN generated higher-quality images than T2F (as shown in Fig. 6)
The dataset contains 400 images, the majority of which have three Since our results on the Face2Text dataset were visually
captions per image. We later reduced this to 2 captions per image promising, as seen from Fig. 6, we carried out further experiments
by increasing the number of words in a sentence and thereby with a bigger dataset, namely the CelebA dataset [13].
7
Fig. 6. The AttnGAN model generates the images on the left and the right images are generated examples from StackGAN and PGGAN models from [45], trained and tested it
on the Face2Text dataset.
The Celeb Faces Attributes Dataset (CelebA) [13] is a large-scale observed for real image-text pairs. As a result, the image and text
face attribute dataset. It has over 200 K celebrity images and 40 encoders learned how to extract global feature vectors from pro-
attribute annotations for each image. The images in this collection duced images and text descriptions. Further, as the attention
cover a wide range of poses as well as the clutter in the back- GAN model was being trained, this pre-trained DAMSM model
ground. CelebA has rich annotated information. We used 10.2 k computed the LDAMSM loss for each iteration.
images from the CelebA dataset to train the AttnGAN model. Since The DAMSM loss is governed by a parameter k, and the total
the dataset lacks official captions, captions were sourced from [46]. loss of the model is given by the equation (12). To test LDAMSM
Each image has ten captions that cover all of the image’s attributes. The value of k is tuned from k = 0 to 5. The results obtained for var-
An attention map for each word in the input statement, as ious kvalues are shown in Fig. 10.
shown in Figs. 7 and 8, is generated. In attention maps, words that These results show that appropriately raising LDAMSM weight
are of use while producing a particular sub-region are highlighted. results in higher-quality images that are better trained on the given
In the case of text-to-face, these include the words that describe input descriptions. This is because increased LDAMSM provides word-
the attributes of the face. When generating images, this shows level matching information, which helps train the Generator in a
where the network would concentrate with each word. When better way. The CelebA dataset was also trained onk = 50. But each
responding to certain terms, the induced attention maps essen- time, the training resulted in a mode collapse which was unlike the
tially fit the concentrating region of the human brain. The gener- case when AttnGAN was trained on the COCO dataset by [3].
ated face images have a high level of continuity with the input This work aims not only to generate better quality (having more
sentences. However, sometimes the attention maps fail to repre- similarity to the textual description) images but also the images
sent the captions accurately, as shown in Fig. 9. that retain realism and are visually more realistic. As can be seen
The DAMSM Model was initially trained for each dataset until from Fig. 10, the images generated for k ¼ 3are more realistic than
no significant changes in the sentence and word loss were k ¼ 5:
Fig. 7. Attention Maps of the Generated example of the text-to-face synthesis. The image shown has the input description as ‘‘woman has bushy eyebrows with a smile.”
8
Fig. 8. Attention Maps of the Generated example of the text-to-face synthesis. The image shown has the input description as ‘‘The woman wearing earrings has smile arched
eyebrow”.
Fig. 9. The image shown has the input description as ‘‘the attractive man has black hair”. It is observed in the image that the hair attribute has not been correctly represented
through its attention map.
3.3. Evaluation early stopping (450 Epochs) are better than late stopping (650
Epochs) for both k = 3 and k = 5 as shown in Table1.
The FID score [47] is used to assess the image consistency of Since the FID score cannot indicate whether the produced
synthetically generated faces. Text-to-image synthesis, in general, image from the captions is well conditioned on the text description
uses Inception Score as a metric. Standard practice is to use a provided. Therefore, we use R-precision, an evaluation metric for
pre-trained Inception-V3 network that is fine-tuned on a specific rating the retrieval performances, as an additional evaluation met-
images dataset to measure Inception Score to determine network ric for the task of text-to-image synthesis. If there are R appropri-
outcomes. This is reported in [3] for the CUB dataset. However, ate documents that are applicable to a query, we review the top ‘R’
there is no pre-trained Inception-V3 model for the face dataset. ranked obtained results of the method and discover that ‘r’ is rele-
As a result, we switched to the FID score, which is another often- vant, and therefore, R-precision is given by ‘r/R’. We have per-
used metric for measuring image synthesis and can be thought of formed a retrieval experiment in which we have use validation
as a more powerful variant of the Inception Score (IS) as it is images to query the text that corresponds to them. To begin with,
more robust to noise than IS. the global feature vectors of the output images and their text
FID is a metric for comparing the resemblance between two descriptions are extracted using the image and text encoders
image datasets. It is found to associate well with human visual learned in DAMSM.
content judgments and is used to assess the quality of Generative The next step is to compute the cosine similarity between the
Adversarial Networks samples. Fréchet distance between two global image and global text vector and, lastly to calculate the R-
Gaussians that have been fitted to feature representations of the precision. The candidate texts are ranked for every image in the
Inception network, is used to calculate the FID Score. It is also order of descending similarity and select the top r valid descrip-
essential to test the model with a minimum of 10 k images to tions. The model produces 11,000 photographs from randomly
obtain appropriate and truthful FID scores [48]. In this work, we chosen unseen captions to calculate the R-precision. For each
tested on 11 k images to evaluate the FID SCORE. The best value query picture, the candidate text descriptions consist of single
of FID is obtained for k ¼ 5; as seen in Table 1. A lower FID score ground truth (i.e., R = 1) and 99 randomly selected descriptions
suggests greater image quality; however, not necessarily better that don’t fit. Table 1 shows the FID scores and R-precision
realism since our images have better realism for k ¼ 3 (Fig. 10) achieved for different k values.
but a higher FID score (Table 1).
The stop criteria for the GAN model is when it reaches the Nash 3.4. Experimental setup for a standalone device
Equilibrium. But since we typically employ the SGD, the loss of
both G and D models oscillates and never reach Nash Equilibrium. The birds and face trained model was deployed on a Raspberry
So one of the better methods to stop the GAN training is by visually Pi 4 Model B (4 GB RAM). The Raspberry Pi was interfaced with the
inspecting the generated images and early stop if there is no visu- VNC Viewer App. All the dependencies, PyTorch wheel file, pre-
ally perceived improvement in the generated images. In this work, trained models and the code were put onto a 16 GB MicroSD Card
we applied the early stopping and observed that the FID scores for and the evaluation code was executed from the command window.
9
Fig. 10. Output image for given captions for different k values.
Due to the restriction of RAM, it generated only three images from put was approximately 14 to 15 s. This testing on a standalone
their corresponding captions. Any more input captions resulted in device was done to check how optimised and efficient the models
an ’Out of memory’ error. The response time from the input to out- are. Figs. 11 and 13 show the Input caption and Output image on
10
Table 1 In [49,50], a multimodal CELEBA-HQ dataset is used. The data-
The best FID score and the corresponding R precision rate of the AttnGAN model on
set consists of 30 k high-resolution face images, each having a
CelebA dataset. More results in Fig. 10.
high-quality segmentation mask, sketch, and descriptive text. Sty-
Method FID SCORE R-precision (%) leGAN is used for face generation with a FID score reported as
600 epochs 106.37 and 101.42, respectively. In [41], DC GAN is used for face
AttnGAN2, k = 0 No DAMSAM 53:11 11:71 0:01 image generation with IS of 1.4 ± 0.7 and the limitations of using
AttnGAN2, k = 1 50:93 15:33 0:01 inception score as an evaluation metric for faces datasets are also
AttnGAN2, k = 3 56:41 26:83 0:01
AttnGAN2, k = 5 48:27 38:66 0:01
discussed. In [51], a smaller subset of CelebA named as SCU-
Text2face dataset is used. Two hundred samples are used for test-
Early stopping, 450 epochs
k = 3, 450 epochs 55.44 27.30 ± 0.0102
ing with a reported FID score of 44.49. However, the base FID paper
k = 5, 450 epochs 40.73 39.60 ± 0.0203 [48] states that a minimum of 10 k testing images must be used for
generating valid FID scores. As opposed to this, we have used 11 k
the VNC viewer App. Figs. 12 and 14 show the total time for pre- testing images from CelebA dataset and obtained a FID score of
dicting the output on Raspberry Pi 4 Model B (4 GB RAM). 40.73 fork ¼ 5
4.1. Comparison The AttnGAN is not only capable of generating images of high
resolution. In addition, it can also consider all the attributes men-
Table 2 shows existing work done in the fields of Text-to-image tioned in the input caption. Removing or replacing a certain key-
generation with the CelebA dataset. A number of approaches and word in the input drastically impacts the output image. An
methodologies have been proposed. example of this can be seen in Figs. 15 and 16.
Fig. 11. Input caption and Output image for the birds dataset on the VNC viewer App.
Fig. 12. Total time for predicting the output on Raspberry Pi 4 Model B (4 GB RAM) for birds data.
11
Fig. 13. Input caption and Output image for CelebA dataset on the VNC viewer App.
Fig. 14. Total time for predicting the output on Raspberry Pi 4 Model B (4 GB RAM) for CelebA data.
By altering some of the words in the input descriptions, we can images in relation to the ground truth corresponding to the input
observe how responsive the output images are to alterations in the textual description. It correlates well with the quality of the image.
input sentences. It shows how the generated visuals are altered in However, it does not necessarily represent the realism of the
response to alterations in the input phrases, demonstrating that images. Therefore, the lower FID score, although is a metric of bet-
the model can detect even minor semantic alterations in the writ- ter image generation based on the textual description, does not
ten description. necessarily mean the more realistic images. This is owing to the
k parameter and its effect on image generation. The qualitative
4.3. Challenges analysis of generated images shown in Fig. 10 represents that for
k = 3 the images are more real-looking than for k = 5. However,
4.3.1. Bias in the dataset Table 1 exhibits that the FID score for k = 3 is more than that of
Both CelebA and Face2Text datasets are primarily focused on k = 5. The balance between image quality vs realism is an impor-
Caucasian ethnicity and have an over-representation of the same. tant challenge of the generative models.
However, it does not contain balance samples for fair and dark-
skinned people leading to the under-representation of these ethnic
groups. This is one of the limitations of these datasets [52] and to 4.3.3. Use of better text encoding methods
remedy this, a balanced dataset that has an equal and unbiased The transformers are typically considered to perform better
representation of ethnicity and gender must be developed. This than LSTM, as reported in the literature [25]. However, there are
kind of unbalanced dataset gives rise to unethical outcomes of a few approaches reported in recent times which suggest that
the AI models. transformers may not be the ultimate solution. In [53], authors
propose that in the context of language models, convolutional
4.3.2. Realism vs quality models may prove competitive to Transformers when pre-
We have observed that the AttnGAN occasionally produces pho- trained. Also, in [54], it is suggested that replacing BERT with a lin-
tos that are clear and detailed. However, not necessarily realistic. ear transform such as Fourier transform proves to be exceptionally
FID score serves as a metric to evaluate the quality of the generated faster in real-time GPU implementations. Also, in [55], an approach
12
Table 2
Prominent work done in the fields of Text-to-image generation with the CelebA dataset.
Fig. 15. The figure demonstrates the effect of specific words on the output generated. In the image, the word ‘old’ significantly affects how the face of the lady is generated.
This demonstrates how AttnGAN can detect minor semantic alterations.
Fig. 16. Other examples showcasing how the words ‘attractive’ and ‘chubby’ are learned by the AttnGAN model.
13
based on auto-encoders with transformers is suggested for text-to- Declaration of Competing Interest
image generation.
In summary, there are multiple approaches to the task of text- The authors declare that they have no known competing finan-
to-image generation in general and text encoding in particular. cial interests or personal relationships that could have appeared
We experimented with the simplest method of text encoding since to influence the work reported in this paper.
the focus of this work is specifically on understanding the GAN
models and experimenting with them extensively for face image References
generation. Our contribution lies in identifying various approaches
of implementations of GANs for better realistic images, comparing [1] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, in 33rd International
Conference on Machine Learning, 2016, 2016,.
and analysing the various evaluation and performance methods to
[2] R. Zhou, C. Jiang, Q. Xu, A survey on generative adversarial network-based text-
strike a balance between realism and quality, optimization via to-image synthesis, Neurocomputing 451 (2021) 316–336, https://doi.org/
DAMSM loss and most importantly, handling of the limitations 10.1016/j.neucom.2021.04.069.
posed by inconsistent human participation. Hence we chose to [3] T. Xu et al., ‘‘AttnGAN: Fine-Grained Text to Image Generation with Attentional
Generative Adversarial Networks,” 2018. doi: 10.1109/CVPR.2018.00143.
implement the text encodings via LSTM. However, we believe that [4] N. Zeng, H. Li, Z. Wang, W. Liu, S. Liu, F.E. Alsaadi, X. Liu, Deep-reinforcement-
replacing the LSTM with the transformer attention model will learning-based images segmentation for quantitative analysis of gold
improve the performance, which will be the further extension of immunochromatographic strip, Neurocomputing 425 (2021) 173–180,
https://doi.org/10.1016/j.neucom.2020.04.001.
this work. [5] N. Zeng, Z. Wang, H. Zhang, K.-E. Kim, Y. Li, X. Liu, An Improved Particle Filter
with a Novel Hybrid Proposal Distribution for Quantitative Analysis of Gold
Immunochromatographic Strips, IEEE Transactions on Nanotechnology 18
5. Conclusion (2019) 819–829, https://doi.org/10.1109/TNANO.772910.1109/
TNANO.2019.2932271.
[6] N. Zeng et al., ‘‘Image-based quantitative analysis of gold
In general, Photorealistic image generation from its description immunochromatographic strip via cellular neural network approach,” IEEE
is constrained on its dataset. Every word of the caption has an Transactions on Medical Imaging, vol. 33, no. 5, 2014, doi: 10.1109/
impact on the quality of the output image. In the case of text-to- TMI.2014.2305394.
[7] I. J. Goodfellow et al., ‘‘Generative Adversarial Networks,” Communications of
face generation, if the dataset consists of more prior information the ACM, vol. 63, no. 11, pp. 139–144, Jun. 2014, Accessed: Jul. 14, 2021.
on a face rather than focusing on selected attributes, it certainly [Online]. Available: https://arxiv.org/abs/1406.2661v1
increases the quality of the obtained results. Therefore, in this [8] H. Zhang et al., ‘‘StackGAN: Text to Photo-Realistic Image Synthesis with
Stacked Generative Adversarial Networks,” in Proceedings of the IEEE
work, we proposed the implementation of text-to-face synthesis
International Conference on Computer Vision, 2017, vol. 2017-October. doi:
using AttnGAN. Initially, experiments were conducted on Stack- 10.1109/ICCV.2017.629.
GAN and PGGAN for the Birds and Flowers dataset. But owing to [9] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, ‘‘Learning what and
where to draw,” 2016.
the lack of performance of these architectures on more complex
[10] H. Zhang et al., ‘‘StackGAN++: Realistic Image Synthesis with Stacked
datasets and lack of focusing attention to a specific attribute, Attn- Generative Adversarial Networks,” IEEE Transactions on Pattern Analysis and
GAN was employed. AttnGAN was used on the Face2Text dataset Machine Intelligence, vol. 41, no. 8, 2019, doi: 10.1109/TPAMI.2018.2856256.
and CelebA dataset. [11] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive Growing of GANs for
Improved Quality, Stability, and Variation,” 6th International Conference on
The model was first implemented on the Face2Text dataset. Fol- Learning Representations, ICLR 2018 - Conference Track Proceedings, Oct.
lowing this, we trained the model on 10.2 k images from the Cel- 2017, Accessed: Jul. 14, 2021. [Online]. Available: https://arxiv.org/abs/
ebA dataset. DAMSM loss was considered for optimisation, and 1710.10196v3
[12] T. Y. Lin et al., ‘‘Microsoft COCO: Common objects in context,” in Lecture Notes
we experimented with variouskvalues. The results obtained by in Computer Science (including subseries Lecture Notes in Artificial
our model are compared with the existing models employed on Intelligence and Lecture Notes in Bioinformatics), 2014, vol. 8693 LNCS, no.
the CelebA dataset. Our model outperforms the other approaches PART 5. doi: 10.1007/978-3-319-10602-1_48.
[13] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep Learning Face Attributes in the
in terms of using the required number of testing samples and gen- Wild.” pp. 3730–3738, 2015. Accessed: Jul. 14, 2021. [Online]. Available:
erating the lowest FID scores. We also studied and demonstrated http://personal.ie.cuhk.edu.hk/
the effect of semantic alterations on the generated images. Such [14] M. Mirza and S. Osindero, ‘‘Conditional Generative Adversarial Nets,” Nov.
2014, Accessed: Jul. 14, 2021. [Online]. Available: https://arxiv.org/abs/
images are very similar to each other. However, they have distinct
1411.1784v1
variations introduced due to semantic alterations. The effect of FID [15] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised Representation Learning
and k values on the quality and realism of the image is analysed, with Deep Convolutional Generative Adversarial Networks,” 4th International
and an early stopping method is implemented to achieve the bal- Conference on Learning Representations, ICLR 2016 - Conference Track
Proceedings, Nov. 2015, [Online]. Available: https://arxiv.org/abs/
ance between the same. Certain challenges, specifically due to 1511.06434v2
the bias in the datasets, are also discussed. Finally, we deployed [16] H. Dong, S. Yu, C. Wu, and Y. Guo, ‘‘Semantic Image Synthesis via Adversarial
these trained models of the birds and faces dataset on a Raspberry Learning,” in Proceedings of the IEEE International Conference on Computer
Vision, 2017, vol. 2017-October. doi: 10.1109/ICCV.2017.608.
Pi to achieve real-world usability, accessibility and portability of [17] T. Qiao, J. Zhang, D. Xu, D. Tao, in: in Proceedings of the IEEE Computer Society
this framework. Deploying the model as an API has enormous pro- Conference on Computer Vision and Pattern Recognition, 2019, p. 2019-June.,
mise in the field of public safety and increased useability. Future https://doi.org/10.1109/CVPR.2019.00160.
[18] Z. Zhang, Y. Xie, and L. Yang, ‘‘Photographic Text-to-Image Synthesis with a
work may focus on capturing global coherent structures as well Hierarchically-Nested Adversarial Network,” 2018. doi: 10.1109/
as employing the attention trasnfromers model for advanced text CVPR.2018.00649.
encoding. [19] S. Hong, D. Yang, J. Choi, and H. Lee, ‘‘Inferring Semantic Layout for Hierarchical
Text-to-Image Synthesis,” 2018. doi: 10.1109/CVPR.2018.00833.
[20] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio, ‘‘ChatPainter:
Improving text to image generation using dialogue,” 2018.
CRediT authorship contribution statement
[21] H. Dong, J. Zhang, D. McIlwraith, and Y. Guo, ‘‘I2T2I: Learning text to image
synthesis with textual data augmentation,” in Proceedings - International
Sharad Pande: Software, Validation, Writing – original draft. Conference on Image Processing, ICIP, 2018, vol. 2017-September. doi:
10.1109/ICIP.2017.8296635.
Srishti Chouhan: Software, Writing – original draft, Visualization,
[22] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, ‘‘Stacked attention networks for
Validation. Ritesh Sonavane: Software, Writing – original draft. image question answering,” in Proceedings of the IEEE Computer Society
Rahee Walambe: Conceptualization, Methodology, Wiriting - orig- Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-
inal draft, Supervision. George Ghinea: Review and Suggestions. December. doi: 10.1109/CVPR.2016.10.
[23] K. Xu et al., ‘‘Show, Attend and Tell: Neural Image Caption Generation with
Ketan Kotecha: Conceptualization, Methodology, Supervision, Visual Attention,” in Proceedings of the 32nd International Conference on
Validation. Machine Learning, Jul. 2015, vol. 3.
14
[24] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, ‘‘Self-Attention Generative [52] E. M. Rudd, M. Günther, and T. E. Boult, ‘‘MOON: A mixed objective
Adversarial Networks,” in Proceedings of the 36th International Conference on optimization network for the recognition of facial attributes,” in Lecture
Machine Learning, Jul. 2019, vol. 2019-June. Notes in Computer Science (including subseries Lecture Notes in Artificial
[25] A. Vaswani et al., ‘‘Attention Is All You Need,” Advances in Neural Information Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9909 LNCS. doi:
Processing Systems, vol. 2017-December, Jun. 2017. 10.1007/978-3-319-46454-1_2.
[26] L. Ye, B. Zhang, M. Yang, W. Lian, Triple-translation GAN with multi-layer [53] Y. Tay et al., ‘‘Are Pre-trained Convolutions Better than Pre-trained
sparse representation for face image synthesis, Neurocomputing 358 (2019) Transformers?,” arXiv preprint arXiv:2105.03322, 2021.
294–308, https://doi.org/10.1016/j.neucom.2019.04.074. [54] ‘‘Google Replaces BERT Self-Attention with Fourier Transform: 92% Accuracy, 7
[27] A. Gatt et al., ‘‘Face2Text: Collecting an annotated image description corpus for Times Faster on GPUs | Synced.” https://syncedreview.com/2021/05/14/
the generation of rich face descriptions,” 2019. deepmind-podracer-tpu-based-rl-frameworks-deliver-exceptional-
[28] J. He, J. Zheng, Y. Shen, Y. Guo, H. Zhou, Facial Image Synthesis and Super- performance-at-low-cost-19/amp/ (accessed Jul. 14, 2021).
Resolution With Stacked Generative Adversarial Network, Neurocomputing [55] N. A. Fotedar and J. H. Wang, ‘‘Bumblebee: Text-to-Image Generation with
402 (2020) 359–365, https://doi.org/10.1016/j.neucom.2020.03.107. Transformers”, Accessed: Jul. 14, 2021. [Online]. Available: https://web.
[29] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired Image-to-Image stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15709283.
Translation Using Cycle-Consistent Adversarial Networks,” in Proceedings of pdf
the IEEE International Conference on Computer Vision, 2017, vol. 2017-
October. doi: 10.1109/ICCV.2017.244.
[30] A. Brock, J. Donahue, and K. Simonyan, ‘‘Large Scale GAN Training for High
Fidelity Natural Image Synthesis,” 7th International Conference on Learning Sharad Pande is a undergraduate student at Symbiosis
Representations, ICLR 2019, Sep. 2019. Institute of Technology. He is pursuing the Bachelors of
[31] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for Technology, majoring in Electronics and Telecommuni-
generative adversarial networks,” in Proceedings of the IEEE Computer Society cation. He has keen interest in machine learning and
Conference on Computer Vision and Pattern Recognition, 2019, vol. 2019-June. data science. For the past few years he us working in the
doi: 10.1109/CVPR.2019.00453. area of GANs and their use for various generative tasks.
[32] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo, ‘‘StarGAN: Unified
Generative Adversarial Networks for Multi-domain Image-to-Image
Translation,” 2018. doi: 10.1109/CVPR.2018.00916.
[33] T. C. Wang, M. Y. Liu, J. Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, ‘‘High-
Resolution Image Synthesis and Semantic Manipulation with Conditional
GANs,” 2018. doi: 10.1109/CVPR.2018.00917.
[34] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘Towards Open-Set Identity
Preserving Face Synthesis,” 2018. doi: 10.1109/CVPR.2018.00702.
[35] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond Face Rotation: Global and Local
Perception GAN for Photorealistic and Identity Preserving Frontal View
Synthesis,” Proceedings of the IEEE International Conference on Computer Srishti Chouhan is a undergraduate student at Sym-
Vision, vol. 2017-October, 2017, doi: 10.1109/ICCV.2017.267. biosis Institute of Technology. She is pursuing the
[36] X. Chen, L. Qing, X. He, J. Su, Y. Peng, From Eyes to Face Synthesis: A New Bachelors of Technology, majoring in Electronics and
Approach for Human-Centered Smart Surveillance, IEEE Access 6 (2018) Telecommunication. She has keen interest in deep
14567–14575, https://doi.org/10.1109/ACCESS.2018.2803787. learning and data analysis. For the past few years she is
[37] X. Di and V. M. Patel, ‘‘Face synthesis from visual attributes via sketch using working in the area of image processing, GANS and deep
conditional vaes and gans,” arXiv preprint arXiv:1801.00077, 2017. learning methods for various applications.
[38] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, ‘‘Generative Image Inpainting
with Contextual Attention” (2018), https://doi.org/10.1109/CVPR.2018.00577.
[39] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, ‘‘The Caltech-UCSD
Birds-200-2011 Dataset,” 2011.
[40] M. E. Nilsback and A. Zisserman, ‘‘Automated flower classification over a large
number of classes,” 2008. doi: 10.1109/ICVGIP.2008.47.
[41] O. R. Nasir, S. K. Jha, M. S. Grover, Y. Yu, A. Kumar, and R. R. Shah,
‘‘Text2FaceGAN: Face generation from fine grained textual descriptions,”
2019. doi: 10.1109/BigMM.2019.00-42.
[42] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking the RItesh Sonavane is a undergraduate student at Sym-
Inception Architecture for Computer Vision,” in Proceedings of the IEEE biosis Institute of Technology. He is pursuing the
Computer Society Conference on Computer Vision and Pattern Recognition, Bachelors of Technology, majoring in Electronics and
2016, vol. 2016-December. doi: 10.1109/CVPR.2016.308. Telecommunication. He has keen interest in robotics
[43] O. Russakovsky et al., ‘‘ImageNet Large Scale Visual Recognition Challenge,” and implementation of various models on hardware
International Journal of Computer Vision, vol. 115, no. 3, 2015, doi: 10.1007/ platforms. For the past few years he is working on
s11263-015-0816-y.
deployment of various models on microprocessors and
[44] C. Bodnar, ‘‘Text to Image Synthesis Using Generative Adversarial Networks,”
hardware platforms.
May 2018, doi: 10.13140/rg.2.2.35817.39523.
[45] Karnewar Animesh and Ibrahim Ahmed Hani, ‘‘GitHub - akanimax/T2F: T2F:
text to face generation using Deep Learning.” https://github.com/akanimax/
T2F (accessed Jun. 01, 2021).
[46] ‘‘GitHub - 2KangHo/AttnGAN-CelebA: Face Image Generation using AttnGAN
with CelebA Dataset.” https://github.com/2KangHo/AttnGAN-CelebA
(accessed Jun. 01, 2021).
[47] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ‘‘GANs
trained by a two time-scale update rule converge to a local Nash equilibrium,” Rahee walambe received MPhil, Ph.D. Degree from
in Advances in Neural Information Processing Systems, 2017, vol. 2017- Lancaster University, UK, in 2008. From 2008 to 2017,
December. she was a research Consultant with various organiza-
[48] ‘‘GitHub - bioinf-jku/TTUR: Two time-scale update rule for training GANs.”
tions in the control and robotics domain. Since 2017,
https://github.com/bioinf-jku/TTUR (accessed Jun. 01, 2021).
she has been working as an Associate Professor at Dept
[49] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, ‘‘TediGAN: Text-Guided Diverse Face
of Electronics and Telecommunications at Symbiosis
Image Generation and Manipulation,” Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2256-2265, Institute of Technology, Symbiosis International
Dec. 2020, [Online]. Available: https://github.com/weihaox/TediGAN. University, Pune, India. Her area of research is applied
[50] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, ‘‘Towards Open-World Text-Guided Face Deep Learning and AI in the field of Robotics and
Image Generation and Manipulation,” Apr. 2021. Healthcare. She is awarded number of national and
[51] X. Chen, L. Qing, X. He, X. Luo, and Y. Xu, ‘‘FTGAN: A fully-trained generative international research grants.
adversarial networks for text to face generation,” arXiv preprint
arXiv:1904.05729, 2019.
15
George Ghinea is a Professor in the Department of Ketan kotecha pursued Ph.D.& MTech from (IIT Bom-
Computer Science at Brunel University London. My bay) and is currently holding the positions as Head,
research activities lie at the confluence of Computer Symbiosis Centre for Applied AI (SCAAI), Director,
Science, Media and Psychology. He has applied my Symbiosis Institute of Technology, Dean, Faculty of
expertise in areas such as eye-tracking, telemedicine, Engineering, Symbiosis International (Deemed Univer-
multi-modal interaction, and ubiquitous and mobile sity). He is an expert in AI and Deep Learning. He has
computing. I am particularly interested in building published 100+ widely in a number of excellent peer-
human-centred e-systems, particularly integrating reviewed journals on various topics ranging from
human perceptual requirements. His work has been cutting-edge AI, education policies, teaching-learning
funded by both national and international funding practices and AI for all. He has published 3 patents and
bodies. delivered keynote speeches at various national and
international forums. He is a recipient of multiple
international research grants and awards.
16