0% found this document useful (0 votes)

16 views

Development and deployment of a generative model-based framework for text to photorealistic image generation

Uploaded by

santiagobega

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views

Development and deployment of a generative model-based framework for text to photorealistic image generation

Uploaded by

santiagobega

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Development and deployment of a generative model-based framework

for text to photorealistic image generation

Sharad Pande a, Srishti Chouhan a, Ritesh Sonavane a, Rahee Walambe a,b, George Ghinea c,
Ketan Kotecha a,b,⇑
a
Symbiosis Institute of Technology (SIT), Symbiosis International (Deemed University), India
b
Symbiosis Centre for Applied Artificial Intelligence, Symbiosis Institute of Technology (SIT), Symbiosis International (Deemed University), India
c
Department of Computer Science, Brunel University, London, UK

a r t i c l e i n f o a b s t r a c t

The task of generating photorealistic images from their textual descriptions is quite challenging. Most
existing tasks in this domain are focused on the generation of images such as flowers or birds from their
textual description, especially for validating the generative models based on Generative Adversarial
Network (GAN) variants and for recreational purposes. However, such work is limited in the domain of
photorealistic face image generation and the results obtained have not been satisfactory. This is partly
due to the absence of concrete data in this domain and a large number of highly specific features/at-
tributes involved in face generation compared to birds or flowers. In this paper, we propose an
Keywords:
Attention Generative Adversarial Network (AttnGAN) for a fine-grained text-to-face generation that
Text-to-image
Text-to-face enables attention-driven multi-stage refinement by employing Deep Attentional Multimodal Similarity
Face synthesis Model (DAMSM). Through extensive experimentation on the CelebA dataset, we evaluated our approach
GAN using the Frechet Inception Distance (FID) score. The output files for the Face2Text Dataset are also com-
AttnGAN pare with that of the T2F Github project. According to the visual comparison, AttnGAN generated higher-
quality images than T2F. Additionally, we compare our methodology with existing approaches with a
specific focus on CelebA dataset and demonstrate that our approach generates a better FID score facilitat-
ing more realistic image generation. The application of such an approach can be found in criminal iden-
tification, where faces are generated from the textual description from an eyewitness. Such a method can
bring consistency and eliminate the individual biases of an artist drawing the faces from the description
given by the eyewitness. Finally, we discuss the deployment of the models on a Raspberry Pi to test how
effective the models would be on a standalone device to facilitate portability and timely task completion. Ó

1. Introduction and scope Recently, methods for text-to-image synthesis based on Generative
Adversarial Networks (GANs) [7] have been suggested. It is com-
Reed et al. [1] first introduced text-to-image synthesis in 2016, mon to encode the entire text meaning into a global sentence vec-
and it is a fundamental and novel research area in computer vision tor as a prerequisite for GAN-based image production [1,8–11].
[2]. It is similar to reverse image captioning in that it aims to create Despite its promising results, conditioning a GAN only on the glo-
natural images from input sentences. Text-to-image synthesis, bal sentence vector has certain limitations and is inadequate in
including image caption, explores the visual semantic process of taking into consideration the crucial fine-grained details at the
the human brain by mining the connection between text and word stage. This limitation is even more evident when creating
image. Furthermore, it has enormous potential for art production, complex scenes like those in the COCO dataset [12] or the CelebA
computer-aided design [3], image searching, and other areas such dataset [13].
as image analysis of gold immunochromatographic strip [4–6]. Text-to-face synthesis is a subdivision of text-to-image synthe-
sis that aims to create face images from human descriptions. Text-
to-face synthesis, similar to text-to-image synthesis, has two key
⇑ Corresponding author at: Symbiosis Institute of Technology (SIT), Symbiosis goals: (1) to produce high-quality images and (2) to generate
International (Deemed University), India.
images that correspond to the input descriptions.
E-mail address: [email protected] (K. Kotecha).
1.1. Related work ing that, several scholars made advancements as a result of his
study [16]. Zhang et al. published one of the most influential stud-
Text-to-face synthesis is divided into two categories: (1) text- ies on this subject, proposing a two-stage network called StackGAN
to-image synthesis and (2) text-to-face synthesis. Fig. 1 sum- [8] to solve the problem and for producing high-quality photos and
marises the previous work carried out in both these categories. improving the Inception Score. Later studies [3,10,17,18] inherit
this network as well.
Text-to-image: Researchers gradually concentrated on achieving another goal:
enhancing the resemblance between input text and produced pho-
Despite the fact that there are a variety of networks for text-to- tos because the network had already shown the ability to produce
image synthesis tasks, the majority of them are built on the realistic images. Reed et al. suggested a network for generating
encoder-decoder structure and conditional GAN [14]. Text- images from a box that was created first. This approach assisted
encoder and image-decoder are used in this encoder-decoder sys- in producing more precise data on the output photographs [9]. A
tem. The text encoder converts input definitions into semantic vec- GAN network built on a related concept was also created by [19].
tors, which are then decoded into natural images by the image Sharma et al., on the other hand, used dialogue to aid interpreta-
decoder. Text-to-image synthesis has two key goals: to create tion of the description, which allows synthesising visuals that are
high-quality images and to create images that complement the more relevant to the input document [20].
provided descriptions. These two goals serve as the foundation Dong et al. suggested a method for creating new images based
for all advances in text-to-image synthesis. on the input picture and explanations, which would produce
Text-to-image synthesis research in its early stages has primar- images that complement the input descriptions [16]. They also
ily focused on improving the quality of produced photographs. proposed Image-Text-Image (I2T2I), a new training approach that
Reed et al. proposed the text-to-image challenge for the first time combines text-to-image and image-to-text (image captioning) syn-
in 2016 and created two end-to-end networks focused on condi- thesis to increase text-to-image synthesis accuracy [21]. Attention
tional GAN to complete it [1]. Reed used a pre-trained Char-CNN- processes have already made significant progress in the text- and
RNN network for text encoding and a DCGAN-like network [15] image-related activities [22–25] and they are now being used in
for image decoding to create natural images from vectors. Follow- GANs to generate text-to-image conversions. [3] constructed Attn-

Fig. 1. Previous work in Text-to-Face synthesis.

2
GAN to create an attention mechanism that allows GANs to create tor. The other part of the AttnGAN includes a Deep Attentional
fine-grained high-resolution photographs from natural language Multimodal Similarity Model (DAMSM). With the aid of an atten-
descriptions. MirrorGAN [17] is a text-to-image-to-text network tion mechanism, the DAMSM computes the similarity between
suggested by Qiao et al., which uses a global–local collaborative the generated image and the sentence using both global
focus paradigm. [18] proposed a visual-semantic similarity mea- sentence-level details and fine-grained word-level information.
sure as an aid to measurement metrics since there are no available Consequently, the DAMSM modifies the Generator’s preparation
criteria on how the produced images represent the input descrip- by adding a finer-grained image-text matching loss. We consider
tions. These findings indicate a pattern in which researchers are birds [39], flowers [40], CelebA [13] and Face2Text [27] datasets
increasingly concentrating on improving the accuracy of produced for the study and experimentation.
images and input sentences. With this, we can use it for scripts-to-
storyboard, text-to-architecture and much more. 1.2. Contribution and novelty

Face Synthesis: In summary, the main contributions of this work are:

Image synthesis has been a popular subject in deep learning (i) The study, comparison and analysis of various GAN models
since Goodfellow suggested GAN in 2014 [7,26]. Face synthesis is for photorealistic image generation from textual descrip-
a common research area since there are two broad-scale public tions by experimentation on birds, flowers, and human faces
datasets: CelebA [13] and Face2Text [27]. Almost all state-of-the- datasets.
art networks, like networks based on GAN and networks based (ii) Implementation of the Text-To-Face synthesis using Attn-
on both conditional GAN and conditional GAN, demonstrates their GAN through attention-driven multi-stage refinement for
model’s dominance on face synthesis (such as PGGAN [11], DCGAN photorealistic face image generation and optimisation of
[15,28] CycleGAN [29], BigGAN [30], StyleGAN [31], StarGAN [32] the model by employing DAMSM loss. A model architecture
to name the few). With the advancement of these networks, the is proposed based on the AttnGAN employing the DAMSM
quality of produced face images is steadily improving. Some net- loss.
works now generate 1024 1024 face pictures, which is far greater (iii) Implementation of the trained models on a standalone Rasp-
than the face dataset’s initial picture resolution. These models berry Pi device to ensure more portability, useability and
attempt to learn a mapping between a noise vector and real face accessibility of such an approach.
images that follows the Normal distribution. However, they are
unable to command the network to produce the particular face pic- 1.3. Novelty
ture that they need.
Face synthesis has derived many interesting applications about 1. Experimentation for identifying various aspects of Generative
faces using conditional GAN, such as translating edges to natural models for photorealistic face image generation. This includes
face images [33], exchanging the attributes of two face images the in-depth analysis of the effect of FID scores and DAMSM loss
[34], generating a positive face from the side face [35], generating on image quality and realism.
a full face from the eyes’ region only [36], from face attributes to 2. Comparison and analysis of the results obtained with existing
sketches to natural face images synthesis [37], and face in space methods on CelebA dataset.
[38]. By applying a condition vector to the synthesised face images, 3. Generation of distinct variations in the images as a result of
such networks attempt to monitor them and produce face images semantic alterations in the input text.
that meet the needs of various circumstances. The activities that 4. Implementation on a standalone portable hardware system for
use the input descriptions as the control condition are identical easy application and usability.
to text to face synthesis.
In the context of face generation from the textual description,
1.4. Outline
one of the most relevant applications is to develop criminal images
from the textual description from an eyewitness. In the public
The paper is organised as: Section 2 presents the methods and
safety context, this task has more relative values than text-to-
the background of the architectures employed in detail, followed
image synthesis. Drawing an image for a criminal solely relying
by Section 3, which describes the experiments, evaluations, and
on eyewitness descriptions is a daunting process that takes techni-
results. Section 4 presents the discussion on experiments and the
cal knowledge and extensive practice. Additionally, the individual
results. The paper concludes with an overview and future scope
biases of the eyewitness as well as the artist may creep into the
in Section 5.
process. E.g., different artists may have a different notion of ‘attrac-
tive’ or ‘dark skin’ based on their social and ethnic background.
Such biases may bring inconsistency in the images created and 2. Methods and background
delay the process of finding the criminals. On the other hand, a per-
son who is not an artist can also easily produce photorealistic faces In this section, we present AttnGAN and the further application
of criminals based on eyewitness reports using a well-trained text- of that for text to face synthesis. We begin by explaining how Gen-
to-face model. erative Adversarial Networks (GAN) function. Then we describe
To address this problem, we employ the Attentional Generative AttnGAN and its DAMSM network to carry out text encoding and
Adversarial Network (AttnGAN) [3], which enables fine-grained compute the attention map. This attention map is then utilised
text-to-image creation through attention-driven multi-stage for the task of generating images from their textual description.
refinement. The model is made up of two unique elements. The Further, we describe how AttnGAN is helpful for our problem state-
attentional generative network is the first part, in which the Gen- ment. Finally, we describe the FID Score and evaluate our model.
erator generates an attention system that enables it to draw differ-
ent sub-regions of the image by concentrating on words that are 2.1. Generative Adversarial Networks (GAN)
more relevant to the sub-region that is being drawn. In addition
to the natural language summary being encoded into a global sen- GAN stands for Generative Adversarial Networks, and it is a
tence vector, each word in the sentence is encoded into a word vec- framework for learning a function or program that can produce
3
samples that are quite similar to samples taken from a specified tion of fine-grained images. It generates high-quality images by
training distribution. GANs have become popular very recently. dividing an image into various subregions and then focusing on
The general architecture of a GAN [7] consists of a Generator(G) specific words from the caption relevant to a particular subregion
and a Discriminator(D). Both the Generator and Discriminator are of the image.
separate neural networks. A random input(noise) is given to the The models that have so far been used for text-to-image conver-
Generator and it tries to produce an image close to the actual sion use the entire description and convert it to a vector which is
image. The output of the Generator is then given to the Discrimina- then used for image generation. In this model, instead of the whole
tor. The Discriminator tries to tell the difference between natural sentence, we focus on its constituent words to generate various
and synthetic training data, whereas the Generator tries to deceive image subregions. This ensures a generated image that is visually
the Discriminator. The Discriminator updates the weights depend- closer to the actual image. Different words are used to produce dif-
ing on whether it predicts the image generated by the Generator as ferent parts of the final image according to the sub-region that they
real or fake. If it predicts the image to be fake, then an update in the are most relevant to. The detailed architecture for AttnGAN is
weights takes place. The duty of the Generator here is to keep on shown in Fig. 2.
producing images that seemingly look real. It does so till the time The text description, containing T words, is input to the text
that the Discriminator predicts them as real images. So essentially, encoder. The text encoder is a bidirectional LSTM. This means that
the Generator and Discriminator participate in a minimax game. the input caption is trained on two LSTMs instead of the usual one
Equations (1) and (2) from [41] describe this minimax game. LSTM. Essentially this does the job of concatenating the hidden
states from the forward and backward direction for all timesteps
1 1
J ðDÞ ¼ ExPdata logDðxÞ Ez log ð1 DðGðzÞÞÞ ð1Þ and outputs a final hidden state. This final hidden state in the case
2 2
of this architecture is the sentence feature, represented by

J ðGÞ ¼ J ðDÞ ð2Þ e e 2 RD . HereD represents the working dimension for the words.
Since there are T words, hence another matrix ewith the dimension
Pdata is the probability distribution of given data, and J (D) is the e 2 RDT represents the word features. In a nutshell, sentence fea-
discriminator cost, and J (G) is the generator cost. The Nash tures may be considered as the final hidden state, while the word
Equilibrium of this game, according to Goodfellow et al. [7], is features are the hidden states from all timesteps.
when samples generated by G are indistinguishable from samples
The sentence features are passed on to F ca for conditional aug-
derived from training data (provided G and D have sufficient
mentation. F ca is modelled as a neural network. All the equations
capacity).
in this section are based on the mathematical discussion presented
in [3].
2.2. StackGAN
The output after Conditional Augmentation, c, is given by:

This GAN [8] is typically used for synthesising images from tex- c ¼ l þ re; e N ð0; IÞ ð3Þ
tual description. It breaks down the text-to-image generation pro-
Since the same description can describe several images, eðnoiseÞ is
cess into two stages, as mentioned below:
added here to introduce variation in the generated images.
1) Stage-I GAN
Typically, the input to the Generator in a GAN is only a noise
Stage-I GAN focuses on drawing only rough shapes and appro-
vector (z). But since to generate the final image, the Generator
priate colors from the textual definition. It creates a low-
needs to be conditioned on the input description, so here we use
resolution image by drawing the context layout from a random
Conditional GANs. Accordingly, cand z(noise vector) are concate-
noise vector. It generally produces 64x64 images.
nated and are fed as input to the generator network.
2) Stage-II GAN
The architecture can be considered to have mgenerators.
Stage-II GAN is built upon Stage-I GAN results, and so it pro-
F 0 is responsible for most of the upsampling. The scale factor for
duces high-resolution images. Low-resolution images generated
upsampling is 2. F 0 does not use word-level features. The context
by Stage-I GAN are generally devoid of realistic object parts and
vector at this stage, h0 , is given by:
might have distortions of shape. The Stage-II GAN takes into con-
sideration the text ignored in Stage-I to generate images with more

h0 ¼ F 0 z; F ca ðeÞ ð4Þ
natural details. It generates 256x256 images.
The output from F 0 ði:e:h0 Þ along with the word features (e) is
2.3. PGGAN taken as input by the attention network. To do this, a perceptron
layer is added to transform the word features into a common
PGGAN [11] is short for Progressively Growing GAN. PGGAN is semantic space of image features. This may be represented as:
used to produce ultra-high-resolution images by increasing the 0 b b represents the network’s internal
e ¼ Ue; where U 2 R D D . Here D
network layers as training goes on by first training a model to gen- working dimension.
erate 4x4 image and add layers to generate 8x8, 16x16 images and 0
Together with h0 , e is given as input to the attention network.
so on.
Thus, we get a word-context vector for every subregion.
The most significant difference between PGGAN and StackGAN
The word-context vector may be understood as a score to relate
is that the network structure of the latter is fixed. However, in
all T words to all N subregions and is a measure of how relevant a
PGGANs, as the training progresses, the network structure contin-
word would be to a particular region. This is how specific words
ues to change. The most significant benefit of doing this is that
are selected for generating specific regions of the final image. The
most iterations are done at lower resolutions, and the training th
speed is faster than traditional GANs. word context vector for the j subregion is given as:

0
XT1 exp sj;i
2.4. AttnGAN cj ¼ b e
0
; where b ¼
0 0
and sj;i ¼ hTj ei ð5Þ
i¼0 j;i i j;i PT1 0
k¼0 exp sj;k
AttnGAN or Attentional Generative Adversarial Network [3] is
an attention-driven architecture that enables text-to-image con- On doing this for each region, we get the output for the atten-
version. The architecture involves multiple stages for the generation network, which is:
4
Fig. 2. The Architecture for AttnGAN for text-to-face synthesis. Each attention model automatically retrieves the conditions (i.e., the most critical word vectors) for generating
various sub-regions of the image; the DAMSM provides the fine-grained image-text matching failure for the generative network.

b 1 1
F attn ðe; hÞ ¼ ðc0 ; c1 ; ::::; cN1 Þ 2 R D N ð6Þ LDi ¼ Exi Pdatai log ðDi ðxi Þ E b
2 2 xi PGi

For F 1 , there are two inputs, h0 from F 0 and the word-context 1 1
vector from the attention network. It consists of residual blocks, log 1 Di xbi þ Exi Pdatai log ðDi ðxi ; eÞ E b
2 2 xi PGi
which make the network deeper, and an upsampling layer. It uses
b
log 1 Di xi ; e ð11Þ
word-level features from F attn attn
1 : Here F i is the attention model at
th
i stage of AttnGAN. The context vectors henceforth may be gener- The attentional network’s final objective function is provided
alised as: by:
Xm1
hi ¼ F i hi1 ; F attn
i ðe; hi1 Þ fori ¼ 1; 2; :::; m 1; ð7Þ L ¼ LG þ kLDAMSM ; where LG ¼ LGi ð12Þ
i¼0

The second attention network, F attn

2 and F 2 have the exact func-
tionality and structure as F attn
1 and F 1 . They only differ in inputs. 2.5. Deep attentional multimodal similarity model (DAMSM) model
The number of generators can be adjusted according to the size
of the image to be produced. The more the number of generators, The paper [3] also introduces a deep attentional multimodal
the more is the size of the image. similarity model to determine a loss indicating how the generated
Each of the F generators is associated with a G generator. They fine-grained image matches its corresponding text/caption. This is
consist of convolutional blocks which bring down the number of used for training the Generator.
channels to, i.e., an RGB image. The generated image, xbi , is given DAMSM tries to check whether the generated images follow the
by: written textual description. It does so with the help of two neural
networks. Essentially, it tries to calculate the similarity between
xbi ¼ Gi ðhi Þ ð8Þ
the image and the text to generate an image with better detailing.
The overall generator loss is the summation of all themgenera- An image encoder is required to compute the DAMSM loss. It
tors present in the network this is given by takes the generated image as input. It is based on the Inception-
Xm1 v3 model [42], which has been pre-trained on ImageNet [43]. Then,
LG ¼ i¼0
LGi ð9Þ from the ‘‘mixed_6e” layer of Inception-v3, we extract the local
The discriminators placed after the generated image represent feature matrix f 2 R768289 (reshaped from 768 17 17). The fea-
whether or not the generated image justifies the input caption. ture vector of a local image area is represented by each column of f .
For this, one of its inputs is c, which is the output after conditional The local function vector has a dimension of 768, and the picture
augmentation. The other input is the generated image. has289regions. Meanwhile, the global function vector f 2 R2048 is
Losses: derived from Inception-v30 s last average pooling layer. All equa-
th
The adversarial loss for i generator, Gi , is: tions in this section are proposed in [3].
Finally, as seen in below Eq, we bring the image features to a
1 1 common space.
LGi ¼ E b log Di xbi E b log Di xbi ; e ð10Þ
2 xi PGi 2 xi PGi
t ¼ Wf ; t 2 RD289 ð13Þ
Here the unconditional loss tells whether or not the image is
real, and the conditional part tells whether the generated image
t ¼ Wf ; t ¼ RD ð14Þ
and the input description belong to the same pair.
Cross entropy loss for Discriminator, After this, the similarity matrix is computed, which is given by:
5
XM
s ¼ eT t; s 2 RT289 ð15Þ Lw
2 ¼ i¼1
logPðQ i jDi Þ; where PðQ i jDi Þ
It was found that normalising the similarity matrix gives better exp ðc3 RðQ i ; Di ÞÞ
results, so the normalised matrix is: ¼ PM ð23Þ
j¼1 exp c3 R Q j ; Di

exp si;j
si;j ¼ PT1 ð16Þ here P ðDi jQ i Þ is the posterior probability of the sentence Di being
k¼0 exp sk;j matched with the image Q i . It is given by
Here, a region context vector,ci , is calculated. This is different
exp ðc3 RðQ i ; Di ÞÞ
from the word-context vector. In the word-context vector, we were PðDi jQ i Þ ¼ PM ð24Þ
looking at all the words and estimating how relevant they would j¼1 exp c3 R Q i ; Dj
be to a particular region so that they may be used to generate that
sub-region. As opposed to that, in the regional context vector (cal-
culated for a word), we look at a single word at a time and look at 3. Experiments
all the sub-regions. This is done to understand whether that partic-
ular word was significant in the generation of a particular subre- In this section, we begin by studying the important components
gion. The region context vector is defined as follows: of StackGAN and PGGAN to understand the architecture for text-
to-image synthesis. We also look at AttnGAN’s key components,
X288 exp c1 si;j
ci ¼ a m ; where aj ¼ P288
j¼0 j j
c1 such as the attentional generative network and the DAMSM. Addi-
k¼0 exp c1 si;k tionally, we analyse why AttnGAN is a better architecture for com-
: attention scaling factor ð17Þ plex scenes by implementing a specific application, namely, text-
to-face synthesis and implementing it in Raspberry Pi.
th
The relevance between i word and the image is also calculated.
This tells us how important each word is in the generation of the 3.1. Experiments on CUB-200 and Oxford-102 flowers dataset
image:
cTi ei To understand the architecture for generating images from their
Rðci ; ei Þ ¼ ð18Þ textual descriptions, we experimented with StackGAN and PGGAN
jjci jjjjei jj
[44]. The experiments were carried out using the Caltech CUB-200
Finally, the attention-driven image-text matching score dataset [39] and Oxford-102 Flower’s dataset [40], for which the
between the image and its text description is given by: results are shown in Figs. 3, 4 and 5.
XT1 c1 Although StackGAN and PGGAN work well on datasets of birds
RðQ ; DÞ ¼ log i¼1
expð c2 R ðc i ; ei ÞÞ
2
and flowers, they cease to produce visually satisfactory results on
complex datasets like the COCO dataset or CelebA dataset. This
c2 : word to region context pair v ector ð19Þ can be observed from the results provided in previous studies [45].

A similar score is calculated on the sentence level as well. It is

given by : 3.2. Experiments on CelebA and Face2Text datasets

vTe While StackGAN and PGGAN lack performance on the COCO

RðQ ; DÞ ¼ ð20Þ
ðjjv jjjjejjÞ dataset, AttnGAN has been used on the COCO dataset [12]. This
dataset on [3] produced an Inception Score of 25.89 ± 0.47 as
The final DAMSM loss is given by:
opposed to 8.45 ± 0.03 and 9.58 ± 0.21, respectively, on the former
LDAMSM ¼ Lw w s s
1 þ L2 þ L1 þ L2 ð21Þ two architectures. Based on this, we deciphered that AttnGAN
would work well for the Faces dataset as well.
XM
Lw Initially, our methodology is evaluated using CelebA datasets.
1 ¼ i¼1
logPðDi jQ i Þ ð22Þ
We adopted and preprocessed the CelebA and Face2Text dataset

Fig. 3. Some examples of our generated images (64x64) by StackGAN Stage-I on Caltech CUB-200 and Oxford-102 datasets.

6
Fig. 4. Some examples of our generated images (256x256) by StackGAN Stage-II on Caltech CUB-200 and Oxford-102 flowers dataset.

Fig. 5. Some examples of our generated images by the PGGAN Oxford-102 flowers dataset.

using the methods described in [3] to assess the efficacy of our reducing the number of captions to be tested by the Generator dur-
approach. ing training. The output files were compared with that of the T2F
The Face2Text Dataset [27] was utilised as an experiment to see Project on Github [45]. According to the visual comparison, Attn-
how AttnGAN could handle generating images for the face dataset. GAN generated higher-quality images than T2F (as shown in Fig. 6)
The dataset contains 400 images, the majority of which have three Since our results on the Face2Text dataset were visually
captions per image. We later reduced this to 2 captions per image promising, as seen from Fig. 6, we carried out further experiments
by increasing the number of words in a sentence and thereby with a bigger dataset, namely the CelebA dataset [13].

7
Fig. 6. The AttnGAN model generates the images on the left and the right images are generated examples from StackGAN and PGGAN models from [45], trained and tested it
on the Face2Text dataset.

The Celeb Faces Attributes Dataset (CelebA) [13] is a large-scale observed for real image-text pairs. As a result, the image and text
face attribute dataset. It has over 200 K celebrity images and 40 encoders learned how to extract global feature vectors from pro-
attribute annotations for each image. The images in this collection duced images and text descriptions. Further, as the attention
cover a wide range of poses as well as the clutter in the back- GAN model was being trained, this pre-trained DAMSM model
ground. CelebA has rich annotated information. We used 10.2 k computed the LDAMSM loss for each iteration.
images from the CelebA dataset to train the AttnGAN model. Since The DAMSM loss is governed by a parameter k, and the total
the dataset lacks official captions, captions were sourced from [46]. loss of the model is given by the equation (12). To test LDAMSM
Each image has ten captions that cover all of the image’s attributes. The value of k is tuned from k = 0 to 5. The results obtained for var-
An attention map for each word in the input statement, as ious kvalues are shown in Fig. 10.
shown in Figs. 7 and 8, is generated. In attention maps, words that These results show that appropriately raising LDAMSM weight
are of use while producing a particular sub-region are highlighted. results in higher-quality images that are better trained on the given
In the case of text-to-face, these include the words that describe input descriptions. This is because increased LDAMSM provides word-
the attributes of the face. When generating images, this shows level matching information, which helps train the Generator in a
where the network would concentrate with each word. When better way. The CelebA dataset was also trained onk = 50. But each
responding to certain terms, the induced attention maps essen- time, the training resulted in a mode collapse which was unlike the
tially fit the concentrating region of the human brain. The gener- case when AttnGAN was trained on the COCO dataset by [3].
ated face images have a high level of continuity with the input This work aims not only to generate better quality (having more
sentences. However, sometimes the attention maps fail to repre- similarity to the textual description) images but also the images
sent the captions accurately, as shown in Fig. 9. that retain realism and are visually more realistic. As can be seen
The DAMSM Model was initially trained for each dataset until from Fig. 10, the images generated for k ¼ 3are more realistic than
no significant changes in the sentence and word loss were k ¼ 5:

Fig. 7. Attention Maps of the Generated example of the text-to-face synthesis. The image shown has the input description as ‘‘woman has bushy eyebrows with a smile.”

8
Fig. 8. Attention Maps of the Generated example of the text-to-face synthesis. The image shown has the input description as ‘‘The woman wearing earrings has smile arched
eyebrow”.

Fig. 9. The image shown has the input description as ‘‘the attractive man has black hair”. It is observed in the image that the hair attribute has not been correctly represented
through its attention map.

3.3. Evaluation early stopping (450 Epochs) are better than late stopping (650
Epochs) for both k = 3 and k = 5 as shown in Table1.
The FID score [47] is used to assess the image consistency of Since the FID score cannot indicate whether the produced
synthetically generated faces. Text-to-image synthesis, in general, image from the captions is well conditioned on the text description
uses Inception Score as a metric. Standard practice is to use a provided. Therefore, we use R-precision, an evaluation metric for
pre-trained Inception-V3 network that is fine-tuned on a specific rating the retrieval performances, as an additional evaluation met-
images dataset to measure Inception Score to determine network ric for the task of text-to-image synthesis. If there are R appropri-
outcomes. This is reported in [3] for the CUB dataset. However, ate documents that are applicable to a query, we review the top ‘R’
there is no pre-trained Inception-V3 model for the face dataset. ranked obtained results of the method and discover that ‘r’ is rele-
As a result, we switched to the FID score, which is another often- vant, and therefore, R-precision is given by ‘r/R’. We have per-
used metric for measuring image synthesis and can be thought of formed a retrieval experiment in which we have use validation
as a more powerful variant of the Inception Score (IS) as it is images to query the text that corresponds to them. To begin with,
more robust to noise than IS. the global feature vectors of the output images and their text
FID is a metric for comparing the resemblance between two descriptions are extracted using the image and text encoders
image datasets. It is found to associate well with human visual learned in DAMSM.
content judgments and is used to assess the quality of Generative The next step is to compute the cosine similarity between the
Adversarial Networks samples. Fréchet distance between two global image and global text vector and, lastly to calculate the R-
Gaussians that have been fitted to feature representations of the precision. The candidate texts are ranked for every image in the
Inception network, is used to calculate the FID Score. It is also order of descending similarity and select the top r valid descrip-
essential to test the model with a minimum of 10 k images to tions. The model produces 11,000 photographs from randomly
obtain appropriate and truthful FID scores [48]. In this work, we chosen unseen captions to calculate the R-precision. For each
tested on 11 k images to evaluate the FID SCORE. The best value query picture, the candidate text descriptions consist of single
of FID is obtained for k ¼ 5; as seen in Table 1. A lower FID score ground truth (i.e., R = 1) and 99 randomly selected descriptions
suggests greater image quality; however, not necessarily better that don’t fit. Table 1 shows the FID scores and R-precision
realism since our images have better realism for k ¼ 3 (Fig. 10) achieved for different k values.
but a higher FID score (Table 1).
The stop criteria for the GAN model is when it reaches the Nash 3.4. Experimental setup for a standalone device
Equilibrium. But since we typically employ the SGD, the loss of
both G and D models oscillates and never reach Nash Equilibrium. The birds and face trained model was deployed on a Raspberry
So one of the better methods to stop the GAN training is by visually Pi 4 Model B (4 GB RAM). The Raspberry Pi was interfaced with the
inspecting the generated images and early stop if there is no visu- VNC Viewer App. All the dependencies, PyTorch wheel file, pre-
ally perceived improvement in the generated images. In this work, trained models and the code were put onto a 16 GB MicroSD Card
we applied the early stopping and observed that the FID scores for and the evaluation code was executed from the command window.
9
Fig. 10. Output image for given captions for different k values.

Due to the restriction of RAM, it generated only three images from put was approximately 14 to 15 s. This testing on a standalone
their corresponding captions. Any more input captions resulted in device was done to check how optimised and efficient the models
an ’Out of memory’ error. The response time from the input to out- are. Figs. 11 and 13 show the Input caption and Output image on

10
Table 1 In [49,50], a multimodal CELEBA-HQ dataset is used. The data-
The best FID score and the corresponding R precision rate of the AttnGAN model on
set consists of 30 k high-resolution face images, each having a
CelebA dataset. More results in Fig. 10.
high-quality segmentation mask, sketch, and descriptive text. Sty-
Method FID SCORE R-precision (%) leGAN is used for face generation with a FID score reported as
600 epochs 106.37 and 101.42, respectively. In [41], DC GAN is used for face
AttnGAN2, k = 0 No DAMSAM 53:11 11:71 0:01 image generation with IS of 1.4 ± 0.7 and the limitations of using
AttnGAN2, k = 1 50:93 15:33 0:01 inception score as an evaluation metric for faces datasets are also
AttnGAN2, k = 3 56:41 26:83 0:01
AttnGAN2, k = 5 48:27 38:66 0:01
discussed. In [51], a smaller subset of CelebA named as SCU-
Text2face dataset is used. Two hundred samples are used for test-
Early stopping, 450 epochs
k = 3, 450 epochs 55.44 27.30 ± 0.0102
ing with a reported FID score of 44.49. However, the base FID paper
k = 5, 450 epochs 40.73 39.60 ± 0.0203 [48] states that a minimum of 10 k testing images must be used for
generating valid FID scores. As opposed to this, we have used 11 k
the VNC viewer App. Figs. 12 and 14 show the total time for pre- testing images from CelebA dataset and obtained a FID score of
dicting the output on Raspberry Pi 4 Model B (4 GB RAM). 40.73 fork ¼ 5

4. Discussion 4.2. Effect of semantic alteration on the images

4.1. Comparison The AttnGAN is not only capable of generating images of high
resolution. In addition, it can also consider all the attributes men-
Table 2 shows existing work done in the fields of Text-to-image tioned in the input caption. Removing or replacing a certain key-
generation with the CelebA dataset. A number of approaches and word in the input drastically impacts the output image. An
methodologies have been proposed. example of this can be seen in Figs. 15 and 16.

Fig. 11. Input caption and Output image for the birds dataset on the VNC viewer App.

Fig. 12. Total time for predicting the output on Raspberry Pi 4 Model B (4 GB RAM) for birds data.

11
Fig. 13. Input caption and Output image for CelebA dataset on the VNC viewer App.

Fig. 14. Total time for predicting the output on Raspberry Pi 4 Model B (4 GB RAM) for CelebA data.

By altering some of the words in the input descriptions, we can images in relation to the ground truth corresponding to the input
observe how responsive the output images are to alterations in the textual description. It correlates well with the quality of the image.
input sentences. It shows how the generated visuals are altered in However, it does not necessarily represent the realism of the
response to alterations in the input phrases, demonstrating that images. Therefore, the lower FID score, although is a metric of bet-
the model can detect even minor semantic alterations in the writ- ter image generation based on the textual description, does not
ten description. necessarily mean the more realistic images. This is owing to the
k parameter and its effect on image generation. The qualitative
4.3. Challenges analysis of generated images shown in Fig. 10 represents that for
k = 3 the images are more real-looking than for k = 5. However,
4.3.1. Bias in the dataset Table 1 exhibits that the FID score for k = 3 is more than that of
Both CelebA and Face2Text datasets are primarily focused on k = 5. The balance between image quality vs realism is an impor-
Caucasian ethnicity and have an over-representation of the same. tant challenge of the generative models.
However, it does not contain balance samples for fair and dark-
skinned people leading to the under-representation of these ethnic
groups. This is one of the limitations of these datasets [52] and to 4.3.3. Use of better text encoding methods
remedy this, a balanced dataset that has an equal and unbiased The transformers are typically considered to perform better
representation of ethnicity and gender must be developed. This than LSTM, as reported in the literature [25]. However, there are
kind of unbalanced dataset gives rise to unethical outcomes of a few approaches reported in recent times which suggest that
the AI models. transformers may not be the ultimate solution. In [53], authors
propose that in the context of language models, convolutional
4.3.2. Realism vs quality models may prove competitive to Transformers when pre-
We have observed that the AttnGAN occasionally produces pho- trained. Also, in [54], it is suggested that replacing BERT with a lin-
tos that are clear and detailed. However, not necessarily realistic. ear transform such as Fourier transform proves to be exceptionally
FID score serves as a metric to evaluate the quality of the generated faster in real-time GPU implementations. Also, in [55], an approach
12
Table 2
Prominent work done in the fields of Text-to-image generation with the CelebA dataset.

Dataset Approach Highest Output Image Resolution Metrics

Xia et al. [49] MULTI-MODAL CELEBA-HQ StyleGAN inversion module, 1024x1024 FID: 106.37
(has 30 k high-resolution visual-linguistic similarity Other Metrics used: LPIPS
face images, each having a learning, and instance-level (Learned Perceptual Image
high-quality segmentation optimisation. Patch Similarity), Accuracy,
mask, sketch, and Realism
descriptive text)
Xia et.al. [50] Multi-Modal CelebA-HQ (Builds on TediGAN-A [49]) 1024x1024 FID:101.42
((has 30 k high-resolution Other Metrics used: LPIPS
face images, each having a (Learned Perceptual Image
high-quality segmentation Patch Similarity), Accuracy,
mask, sketch, and Realism
descriptive text)
Nasir et al. [41] CelebA: DC-GAN with GAN-CLS loss 64x64 IS: 1.4 ± 0.7
7500 training,
2500 testing
Chen et al. [51] SCU-Text2face: 1000 FTGAN 256x256 FID: 44.49,Other Metrics
training, used: IS, FSD, FSS
200 testing
Ours CelebA: 10.2 k samples, AttnGAN 256 256 FID: 40.73 (best with k ¼ 5
early stopping. and early stopping after 450
epochs)

Fig. 15. The figure demonstrates the effect of specific words on the output generated. In the image, the word ‘old’ significantly affects how the face of the lady is generated.
This demonstrates how AttnGAN can detect minor semantic alterations.

Fig. 16. Other examples showcasing how the words ‘attractive’ and ‘chubby’ are learned by the AttnGAN model.

13
based on auto-encoders with transformers is suggested for text-to- Declaration of Competing Interest
image generation.
In summary, there are multiple approaches to the task of text- The authors declare that they have no known competing finan-
to-image generation in general and text encoding in particular. cial interests or personal relationships that could have appeared
We experimented with the simplest method of text encoding since to influence the work reported in this paper.
the focus of this work is specifically on understanding the GAN
models and experimenting with them extensively for face image References
generation. Our contribution lies in identifying various approaches
of implementations of GANs for better realistic images, comparing [1] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, H. Lee, in 33rd International
Conference on Machine Learning, 2016, 2016,.
and analysing the various evaluation and performance methods to
[2] R. Zhou, C. Jiang, Q. Xu, A survey on generative adversarial network-based text-
strike a balance between realism and quality, optimization via to-image synthesis, Neurocomputing 451 (2021) 316–336, https://doi.org/
DAMSM loss and most importantly, handling of the limitations 10.1016/j.neucom.2021.04.069.
posed by inconsistent human participation. Hence we chose to [3] T. Xu et al., ‘‘AttnGAN: Fine-Grained Text to Image Generation with Attentional
Generative Adversarial Networks,” 2018. doi: 10.1109/CVPR.2018.00143.
implement the text encodings via LSTM. However, we believe that [4] N. Zeng, H. Li, Z. Wang, W. Liu, S. Liu, F.E. Alsaadi, X. Liu, Deep-reinforcement-
replacing the LSTM with the transformer attention model will learning-based images segmentation for quantitative analysis of gold
improve the performance, which will be the further extension of immunochromatographic strip, Neurocomputing 425 (2021) 173–180,
https://doi.org/10.1016/j.neucom.2020.04.001.
this work. [5] N. Zeng, Z. Wang, H. Zhang, K.-E. Kim, Y. Li, X. Liu, An Improved Particle Filter
with a Novel Hybrid Proposal Distribution for Quantitative Analysis of Gold
Immunochromatographic Strips, IEEE Transactions on Nanotechnology 18
5. Conclusion (2019) 819–829, https://doi.org/10.1109/TNANO.772910.1109/
TNANO.2019.2932271.
[6] N. Zeng et al., ‘‘Image-based quantitative analysis of gold
In general, Photorealistic image generation from its description immunochromatographic strip via cellular neural network approach,” IEEE
is constrained on its dataset. Every word of the caption has an Transactions on Medical Imaging, vol. 33, no. 5, 2014, doi: 10.1109/
impact on the quality of the output image. In the case of text-to- TMI.2014.2305394.
[7] I. J. Goodfellow et al., ‘‘Generative Adversarial Networks,” Communications of
face generation, if the dataset consists of more prior information the ACM, vol. 63, no. 11, pp. 139–144, Jun. 2014, Accessed: Jul. 14, 2021.
on a face rather than focusing on selected attributes, it certainly [Online]. Available: https://arxiv.org/abs/1406.2661v1
increases the quality of the obtained results. Therefore, in this [8] H. Zhang et al., ‘‘StackGAN: Text to Photo-Realistic Image Synthesis with
Stacked Generative Adversarial Networks,” in Proceedings of the IEEE
work, we proposed the implementation of text-to-face synthesis
International Conference on Computer Vision, 2017, vol. 2017-October. doi:
using AttnGAN. Initially, experiments were conducted on Stack- 10.1109/ICCV.2017.629.
GAN and PGGAN for the Birds and Flowers dataset. But owing to [9] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, ‘‘Learning what and
where to draw,” 2016.
the lack of performance of these architectures on more complex
[10] H. Zhang et al., ‘‘StackGAN++: Realistic Image Synthesis with Stacked
datasets and lack of focusing attention to a specific attribute, Attn- Generative Adversarial Networks,” IEEE Transactions on Pattern Analysis and
GAN was employed. AttnGAN was used on the Face2Text dataset Machine Intelligence, vol. 41, no. 8, 2019, doi: 10.1109/TPAMI.2018.2856256.
and CelebA dataset. [11] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive Growing of GANs for
Improved Quality, Stability, and Variation,” 6th International Conference on
The model was first implemented on the Face2Text dataset. Fol- Learning Representations, ICLR 2018 - Conference Track Proceedings, Oct.
lowing this, we trained the model on 10.2 k images from the Cel- 2017, Accessed: Jul. 14, 2021. [Online]. Available: https://arxiv.org/abs/
ebA dataset. DAMSM loss was considered for optimisation, and 1710.10196v3
[12] T. Y. Lin et al., ‘‘Microsoft COCO: Common objects in context,” in Lecture Notes
we experimented with variouskvalues. The results obtained by in Computer Science (including subseries Lecture Notes in Artificial
our model are compared with the existing models employed on Intelligence and Lecture Notes in Bioinformatics), 2014, vol. 8693 LNCS, no.
the CelebA dataset. Our model outperforms the other approaches PART 5. doi: 10.1007/978-3-319-10602-1_48.
[13] Z. Liu, P. Luo, X. Wang, and X. Tang, ‘‘Deep Learning Face Attributes in the
in terms of using the required number of testing samples and gen- Wild.” pp. 3730–3738, 2015. Accessed: Jul. 14, 2021. [Online]. Available:
erating the lowest FID scores. We also studied and demonstrated http://personal.ie.cuhk.edu.hk/
the effect of semantic alterations on the generated images. Such [14] M. Mirza and S. Osindero, ‘‘Conditional Generative Adversarial Nets,” Nov.
2014, Accessed: Jul. 14, 2021. [Online]. Available: https://arxiv.org/abs/
images are very similar to each other. However, they have distinct
1411.1784v1
variations introduced due to semantic alterations. The effect of FID [15] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised Representation Learning
and k values on the quality and realism of the image is analysed, with Deep Convolutional Generative Adversarial Networks,” 4th International
and an early stopping method is implemented to achieve the bal- Conference on Learning Representations, ICLR 2016 - Conference Track
Proceedings, Nov. 2015, [Online]. Available: https://arxiv.org/abs/
ance between the same. Certain challenges, specifically due to 1511.06434v2
the bias in the datasets, are also discussed. Finally, we deployed [16] H. Dong, S. Yu, C. Wu, and Y. Guo, ‘‘Semantic Image Synthesis via Adversarial
these trained models of the birds and faces dataset on a Raspberry Learning,” in Proceedings of the IEEE International Conference on Computer
Vision, 2017, vol. 2017-October. doi: 10.1109/ICCV.2017.608.
Pi to achieve real-world usability, accessibility and portability of [17] T. Qiao, J. Zhang, D. Xu, D. Tao, in: in Proceedings of the IEEE Computer Society
this framework. Deploying the model as an API has enormous pro- Conference on Computer Vision and Pattern Recognition, 2019, p. 2019-June.,
mise in the field of public safety and increased useability. Future https://doi.org/10.1109/CVPR.2019.00160.
[18] Z. Zhang, Y. Xie, and L. Yang, ‘‘Photographic Text-to-Image Synthesis with a
work may focus on capturing global coherent structures as well Hierarchically-Nested Adversarial Network,” 2018. doi: 10.1109/
as employing the attention trasnfromers model for advanced text CVPR.2018.00649.
encoding. [19] S. Hong, D. Yang, J. Choi, and H. Lee, ‘‘Inferring Semantic Layout for Hierarchical
Text-to-Image Synthesis,” 2018. doi: 10.1109/CVPR.2018.00833.
[20] S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio, ‘‘ChatPainter:
Improving text to image generation using dialogue,” 2018.
CRediT authorship contribution statement
[21] H. Dong, J. Zhang, D. McIlwraith, and Y. Guo, ‘‘I2T2I: Learning text to image
synthesis with textual data augmentation,” in Proceedings - International
Sharad Pande: Software, Validation, Writing – original draft. Conference on Image Processing, ICIP, 2018, vol. 2017-September. doi:
10.1109/ICIP.2017.8296635.
Srishti Chouhan: Software, Writing – original draft, Visualization,
[22] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, ‘‘Stacked attention networks for
Validation. Ritesh Sonavane: Software, Writing – original draft. image question answering,” in Proceedings of the IEEE Computer Society
Rahee Walambe: Conceptualization, Methodology, Wiriting - orig- Conference on Computer Vision and Pattern Recognition, 2016, vol. 2016-
inal draft, Supervision. George Ghinea: Review and Suggestions. December. doi: 10.1109/CVPR.2016.10.
[23] K. Xu et al., ‘‘Show, Attend and Tell: Neural Image Caption Generation with
Ketan Kotecha: Conceptualization, Methodology, Supervision, Visual Attention,” in Proceedings of the 32nd International Conference on
Validation. Machine Learning, Jul. 2015, vol. 3.

14
[24] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena, ‘‘Self-Attention Generative [52] E. M. Rudd, M. Günther, and T. E. Boult, ‘‘MOON: A mixed objective
Adversarial Networks,” in Proceedings of the 36th International Conference on optimization network for the recognition of facial attributes,” in Lecture
Machine Learning, Jul. 2019, vol. 2019-June. Notes in Computer Science (including subseries Lecture Notes in Artificial
[25] A. Vaswani et al., ‘‘Attention Is All You Need,” Advances in Neural Information Intelligence and Lecture Notes in Bioinformatics), 2016, vol. 9909 LNCS. doi:
Processing Systems, vol. 2017-December, Jun. 2017. 10.1007/978-3-319-46454-1_2.
[26] L. Ye, B. Zhang, M. Yang, W. Lian, Triple-translation GAN with multi-layer [53] Y. Tay et al., ‘‘Are Pre-trained Convolutions Better than Pre-trained
sparse representation for face image synthesis, Neurocomputing 358 (2019) Transformers?,” arXiv preprint arXiv:2105.03322, 2021.
294–308, https://doi.org/10.1016/j.neucom.2019.04.074. [54] ‘‘Google Replaces BERT Self-Attention with Fourier Transform: 92% Accuracy, 7
[27] A. Gatt et al., ‘‘Face2Text: Collecting an annotated image description corpus for Times Faster on GPUs | Synced.” https://syncedreview.com/2021/05/14/
the generation of rich face descriptions,” 2019. deepmind-podracer-tpu-based-rl-frameworks-deliver-exceptional-
[28] J. He, J. Zheng, Y. Shen, Y. Guo, H. Zhou, Facial Image Synthesis and Super- performance-at-low-cost-19/amp/ (accessed Jul. 14, 2021).
Resolution With Stacked Generative Adversarial Network, Neurocomputing [55] N. A. Fotedar and J. H. Wang, ‘‘Bumblebee: Text-to-Image Generation with
402 (2020) 359–365, https://doi.org/10.1016/j.neucom.2020.03.107. Transformers”, Accessed: Jul. 14, 2021. [Online]. Available: https://web.
[29] J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, ‘‘Unpaired Image-to-Image stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/custom/15709283.
Translation Using Cycle-Consistent Adversarial Networks,” in Proceedings of pdf
the IEEE International Conference on Computer Vision, 2017, vol. 2017-
October. doi: 10.1109/ICCV.2017.244.
[30] A. Brock, J. Donahue, and K. Simonyan, ‘‘Large Scale GAN Training for High
Fidelity Natural Image Synthesis,” 7th International Conference on Learning Sharad Pande is a undergraduate student at Symbiosis
Representations, ICLR 2019, Sep. 2019. Institute of Technology. He is pursuing the Bachelors of
[31] T. Karras, S. Laine, and T. Aila, ‘‘A style-based generator architecture for Technology, majoring in Electronics and Telecommuni-
generative adversarial networks,” in Proceedings of the IEEE Computer Society cation. He has keen interest in machine learning and
Conference on Computer Vision and Pattern Recognition, 2019, vol. 2019-June. data science. For the past few years he us working in the
doi: 10.1109/CVPR.2019.00453. area of GANs and their use for various generative tasks.
[32] Y. Choi, M. Choi, M. Kim, J. W. Ha, S. Kim, and J. Choo, ‘‘StarGAN: Unified
Generative Adversarial Networks for Multi-domain Image-to-Image
Translation,” 2018. doi: 10.1109/CVPR.2018.00916.
[33] T. C. Wang, M. Y. Liu, J. Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, ‘‘High-
Resolution Image Synthesis and Semantic Manipulation with Conditional
GANs,” 2018. doi: 10.1109/CVPR.2018.00917.
[34] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua, ‘‘Towards Open-Set Identity
Preserving Face Synthesis,” 2018. doi: 10.1109/CVPR.2018.00702.
[35] R. Huang, S. Zhang, T. Li, and R. He, ‘‘Beyond Face Rotation: Global and Local
Perception GAN for Photorealistic and Identity Preserving Frontal View
Synthesis,” Proceedings of the IEEE International Conference on Computer Srishti Chouhan is a undergraduate student at Sym-
Vision, vol. 2017-October, 2017, doi: 10.1109/ICCV.2017.267. biosis Institute of Technology. She is pursuing the
[36] X. Chen, L. Qing, X. He, J. Su, Y. Peng, From Eyes to Face Synthesis: A New Bachelors of Technology, majoring in Electronics and
Approach for Human-Centered Smart Surveillance, IEEE Access 6 (2018) Telecommunication. She has keen interest in deep
14567–14575, https://doi.org/10.1109/ACCESS.2018.2803787. learning and data analysis. For the past few years she is
[37] X. Di and V. M. Patel, ‘‘Face synthesis from visual attributes via sketch using working in the area of image processing, GANS and deep
conditional vaes and gans,” arXiv preprint arXiv:1801.00077, 2017. learning methods for various applications.
[38] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, T.S. Huang, ‘‘Generative Image Inpainting
with Contextual Attention” (2018), https://doi.org/10.1109/CVPR.2018.00577.
[39] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, ‘‘The Caltech-UCSD
Birds-200-2011 Dataset,” 2011.
[40] M. E. Nilsback and A. Zisserman, ‘‘Automated flower classification over a large
number of classes,” 2008. doi: 10.1109/ICVGIP.2008.47.
[41] O. R. Nasir, S. K. Jha, M. S. Grover, Y. Yu, A. Kumar, and R. R. Shah,
‘‘Text2FaceGAN: Face generation from fine grained textual descriptions,”
2019. doi: 10.1109/BigMM.2019.00-42.
[42] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, ‘‘Rethinking the RItesh Sonavane is a undergraduate student at Sym-
Inception Architecture for Computer Vision,” in Proceedings of the IEEE biosis Institute of Technology. He is pursuing the
Computer Society Conference on Computer Vision and Pattern Recognition, Bachelors of Technology, majoring in Electronics and
2016, vol. 2016-December. doi: 10.1109/CVPR.2016.308. Telecommunication. He has keen interest in robotics
[43] O. Russakovsky et al., ‘‘ImageNet Large Scale Visual Recognition Challenge,” and implementation of various models on hardware
International Journal of Computer Vision, vol. 115, no. 3, 2015, doi: 10.1007/ platforms. For the past few years he is working on
s11263-015-0816-y.
deployment of various models on microprocessors and
[44] C. Bodnar, ‘‘Text to Image Synthesis Using Generative Adversarial Networks,”
hardware platforms.
May 2018, doi: 10.13140/rg.2.2.35817.39523.
[45] Karnewar Animesh and Ibrahim Ahmed Hani, ‘‘GitHub - akanimax/T2F: T2F:
text to face generation using Deep Learning.” https://github.com/akanimax/
T2F (accessed Jun. 01, 2021).
[46] ‘‘GitHub - 2KangHo/AttnGAN-CelebA: Face Image Generation using AttnGAN
with CelebA Dataset.” https://github.com/2KangHo/AttnGAN-CelebA
(accessed Jun. 01, 2021).
[47] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, ‘‘GANs
trained by a two time-scale update rule converge to a local Nash equilibrium,” Rahee walambe received MPhil, Ph.D. Degree from
in Advances in Neural Information Processing Systems, 2017, vol. 2017- Lancaster University, UK, in 2008. From 2008 to 2017,
December. she was a research Consultant with various organiza-
[48] ‘‘GitHub - bioinf-jku/TTUR: Two time-scale update rule for training GANs.”
tions in the control and robotics domain. Since 2017,
https://github.com/bioinf-jku/TTUR (accessed Jun. 01, 2021).
she has been working as an Associate Professor at Dept
[49] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, ‘‘TediGAN: Text-Guided Diverse Face
of Electronics and Telecommunications at Symbiosis
Image Generation and Manipulation,” Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 2256-2265, Institute of Technology, Symbiosis International
Dec. 2020, [Online]. Available: https://github.com/weihaox/TediGAN. University, Pune, India. Her area of research is applied
[50] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, ‘‘Towards Open-World Text-Guided Face Deep Learning and AI in the field of Robotics and
Image Generation and Manipulation,” Apr. 2021. Healthcare. She is awarded number of national and
[51] X. Chen, L. Qing, X. He, X. Luo, and Y. Xu, ‘‘FTGAN: A fully-trained generative international research grants.
adversarial networks for text to face generation,” arXiv preprint
arXiv:1904.05729, 2019.

15
George Ghinea is a Professor in the Department of Ketan kotecha pursued Ph.D.& MTech from (IIT Bom-
Computer Science at Brunel University London. My bay) and is currently holding the positions as Head,
research activities lie at the confluence of Computer Symbiosis Centre for Applied AI (SCAAI), Director,
Science, Media and Psychology. He has applied my Symbiosis Institute of Technology, Dean, Faculty of
expertise in areas such as eye-tracking, telemedicine, Engineering, Symbiosis International (Deemed Univer-
multi-modal interaction, and ubiquitous and mobile sity). He is an expert in AI and Deep Learning. He has
computing. I am particularly interested in building published 100+ widely in a number of excellent peer-
human-centred e-systems, particularly integrating reviewed journals on various topics ranging from
human perceptual requirements. His work has been cutting-edge AI, education policies, teaching-learning
funded by both national and international funding practices and AI for all. He has published 3 patents and
bodies. delivered keynote speeches at various national and
international forums. He is a recipient of multiple
international research grants and awards.

BTP Presentation On Text To Image Synthesis
100% (1)
BTP Presentation On Text To Image Synthesis
38 pages
Generating Anime Faces From Human Faces With Adversarial Networks
No ratings yet
Generating Anime Faces From Human Faces With Adversarial Networks
7 pages
A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks
No ratings yet
A_Realistic_Image_Generation_of_Face_From_Text_Description_Using_the_Fully_Trained_Generative_Adversarial_Networks
11 pages
Engproc 20 00016 With Cover
No ratings yet
Engproc 20 00016 With Cover
7 pages
T - F G S Gan2: EXT TO ACE Eneration With Tyle
No ratings yet
T - F G S Gan2: EXT TO ACE Eneration With Tyle
16 pages
Introduction To Recurrent Neural Network
No ratings yet
Introduction To Recurrent Neural Network
10 pages
SAW-GAN
No ratings yet
SAW-GAN
11 pages
Text-to-Image Generation Using Deep Learning
No ratings yet
Text-to-Image Generation Using Deep Learning
6 pages
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
No ratings yet
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
8 pages
Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv
No ratings yet
Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv
16 pages
Text-to-Image Synthesis With Generative Models Met
No ratings yet
Text-to-Image Synthesis With Generative Models Met
16 pages
Conference Template A4
No ratings yet
Conference Template A4
6 pages
Conference Template A114
No ratings yet
Conference Template A114
6 pages
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
No ratings yet
Photographic Text-to-Image Synthesis With A Hierarchically-Nested Adversarial Network
10 pages
basepaper1
No ratings yet
basepaper1
15 pages
Final All Correct
No ratings yet
Final All Correct
49 pages
Cycle-Consistent Inverse GAN For Text-to-Image Synthesis - 3474085.3475226
No ratings yet
Cycle-Consistent Inverse GAN For Text-to-Image Synthesis - 3474085.3475226
2 pages
Meta
No ratings yet
Meta
17 pages
MPAI05_FINAL DOCUMENT
No ratings yet
MPAI05_FINAL DOCUMENT
40 pages
Tao DF-GAN A Simple and Effective Baseline For Text-to-Image Synthesis CVPR 2022 Paper
No ratings yet
Tao DF-GAN A Simple and Effective Baseline For Text-to-Image Synthesis CVPR 2022 Paper
11 pages
(Arisandy Yudha Putra - 23150137) Research Interest
No ratings yet
(Arisandy Yudha Putra - 23150137) Research Interest
13 pages
Dual Adversarial Inference For Text-to-Image Synthesis
No ratings yet
Dual Adversarial Inference For Text-to-Image Synthesis
20 pages
Base Paper Batch 9 Final Updated 3
No ratings yet
Base Paper Batch 9 Final Updated 3
10 pages
Deep Learning Based Text To Image Genera
No ratings yet
Deep Learning Based Text To Image Genera
6 pages
Mirrorgan: Learning Text-To-Image Generation by Redescription
No ratings yet
Mirrorgan: Learning Text-To-Image Generation by Redescription
10 pages
Conference Template A14
No ratings yet
Conference Template A14
4 pages
Documents 5
No ratings yet
Documents 5
5 pages
Mirror Gan
No ratings yet
Mirror Gan
10 pages
Liao Text To Image Generation With Semantic-Spatial Aware GAN CVPR 2022 Paper
No ratings yet
Liao Text To Image Generation With Semantic-Spatial Aware GAN CVPR 2022 Paper
10 pages
Xu AttnGAN Fine-Grained Text CVPR 2018 Paper
No ratings yet
Xu AttnGAN Fine-Grained Text CVPR 2018 Paper
9 pages
Verisimilar Image Synthesis For Accurate Detection and Recognition of Texts in Scenes
No ratings yet
Verisimilar Image Synthesis For Accurate Detection and Recognition of Texts in Scenes
18 pages
Rishab Paper Final
No ratings yet
Rishab Paper Final
7 pages
【2022】RiFeGAN2 Rich Feature Generation for Text-To-Image Synthesis From Constrained Prior Knowledge
No ratings yet
【2022】RiFeGAN2 Rich Feature Generation for Text-To-Image Synthesis From Constrained Prior Knowledge
14 pages
Attngan PDF
No ratings yet
Attngan PDF
9 pages
1 RV
No ratings yet
1 RV
11 pages
An Adaptive Approach To Text To Image
No ratings yet
An Adaptive Approach To Text To Image
5 pages
Semantically Consistent Text To Fashion Image Synthesis With An Enhanced Attentional GAN
No ratings yet
Semantically Consistent Text To Fashion Image Synthesis With An Enhanced Attentional GAN
8 pages
Paper Math
No ratings yet
Paper Math
13 pages
Stylegan-T: Unlocking The Power of Gans For Fast Large-Scale Text-To-Image Synthesis
No ratings yet
Stylegan-T: Unlocking The Power of Gans For Fast Large-Scale Text-To-Image Synthesis
13 pages
New Microsoft Word Document (2)
No ratings yet
New Microsoft Word Document (2)
8 pages
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
No ratings yet
Nataniel Ruiz Dreambooth Fine Tuning Text To Image
11 pages
DCGAN-sci-hub
No ratings yet
DCGAN-sci-hub
10 pages
Attn GAN
No ratings yet
Attn GAN
9 pages
DL M6 Tech
No ratings yet
DL M6 Tech
29 pages
BTP_6 sem_part1
No ratings yet
BTP_6 sem_part1
40 pages
Dream Booth
No ratings yet
Dream Booth
25 pages
Paper Submitted
No ratings yet
Paper Submitted
6 pages
Satgan Paper
No ratings yet
Satgan Paper
17 pages
Text To Image Synthesis Using Self
No ratings yet
Text To Image Synthesis Using Self
20 pages
Ruiz DreamBooth Fine Tuning Text-to-Image Diffusion Models For Subject-Driven Generation CVPR 2023 Paper
No ratings yet
Ruiz DreamBooth Fine Tuning Text-to-Image Diffusion Models For Subject-Driven Generation CVPR 2023 Paper
11 pages
Generating AI Text to Image A Comprehensive Guide
No ratings yet
Generating AI Text to Image A Comprehensive Guide
3 pages
Synthesizing Visual Realities Design and Implementation of A Text To Image Synthesizer Leveraging Spatial Transformer Generative Adversarial Networks
No ratings yet
Synthesizing Visual Realities Design and Implementation of A Text To Image Synthesizer Leveraging Spatial Transformer Generative Adversarial Networks
5 pages
2102.04699
No ratings yet
2102.04699
9 pages
Generative Adversarial Text To Image Synthesis
No ratings yet
Generative Adversarial Text To Image Synthesis
1 page
Yayi Final Seminar
No ratings yet
Yayi Final Seminar
19 pages
Plug and Play Diffusion Feature
No ratings yet
Plug and Play Diffusion Feature
15 pages
Frank Gabel Eml2018 Report
No ratings yet
Frank Gabel Eml2018 Report
15 pages
Sketch2face: Conditional Generative Adversarial Networks For Transforming Face Sketches Into Photorealistic Images
No ratings yet
Sketch2face: Conditional Generative Adversarial Networks For Transforming Face Sketches Into Photorealistic Images
9 pages
Cross-Caption Cycle-Consistent Text-to-Image Synthesis
No ratings yet
Cross-Caption Cycle-Consistent Text-to-Image Synthesis
9 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Aadhar Update Form
No ratings yet
Aadhar Update Form
1 page
TTD Guidelines
No ratings yet
TTD Guidelines
2 pages
01 - 07 - 2017 List of Approved Vendors PDF
No ratings yet
01 - 07 - 2017 List of Approved Vendors PDF
132 pages
TopGlove Beading
No ratings yet
TopGlove Beading
14 pages
Math Club
No ratings yet
Math Club
5 pages
NESCOM Fellowship Syllabus-Watermark
No ratings yet
NESCOM Fellowship Syllabus-Watermark
2 pages
Add Math 2012 Paper 1
No ratings yet
Add Math 2012 Paper 1
20 pages
Case Study - Elsewedy Electric - 2
No ratings yet
Case Study - Elsewedy Electric - 2
4 pages
Tata Motors Case Study
100% (3)
Tata Motors Case Study
35 pages
Artificial Intelligence Concepts Areas Techniques and Applications 1st Edition Anne Håkansson instant download
100% (1)
Artificial Intelligence Concepts Areas Techniques and Applications 1st Edition Anne Håkansson instant download
49 pages
"What Is A Concept Map?" by (Novak & Cañas, 2008)
No ratings yet
"What Is A Concept Map?" by (Novak & Cañas, 2008)
4 pages
Phi 210733
No ratings yet
Phi 210733
20 pages
P6 Math Diagnostic Test (TCH Copy)
No ratings yet
P6 Math Diagnostic Test (TCH Copy)
9 pages
alternator and regulator
No ratings yet
alternator and regulator
6 pages
Basic Sensory Attributes
No ratings yet
Basic Sensory Attributes
31 pages
Model 800+ Platinum With Alarm: Manual Automatic Transmission
No ratings yet
Model 800+ Platinum With Alarm: Manual Automatic Transmission
6 pages
Screenshot 2024-01-15 at 6.22.38 P
No ratings yet
Screenshot 2024-01-15 at 6.22.38 P
7 pages
3.4.1 Line Integral
No ratings yet
3.4.1 Line Integral
5 pages
Fluid Flow Equations PDF
No ratings yet
Fluid Flow Equations PDF
10 pages
CFD Simulation On Aerodynamic Analysis of Wind Turbine Rotor Blade Airfoils Ijariie13032
No ratings yet
CFD Simulation On Aerodynamic Analysis of Wind Turbine Rotor Blade Airfoils Ijariie13032
6 pages
Armstrong Nature of The Mind
No ratings yet
Armstrong Nature of The Mind
23 pages
620os Parts Manual
No ratings yet
620os Parts Manual
26 pages
Sbi Aaiii
No ratings yet
Sbi Aaiii
4 pages
Project KOC Interview Reliability Guide: Document Prepared by
No ratings yet
Project KOC Interview Reliability Guide: Document Prepared by
18 pages
3 3.98P
No ratings yet
3 3.98P
3 pages
Importance of Guest Feedback in The Hotel Industry
No ratings yet
Importance of Guest Feedback in The Hotel Industry
6 pages
LP04 Gen - Phys 2 Calculate The Net Electric Force On A Point Charge Exerted by A System of Point Charges - G12 Richmindale - Rev0 1 1
No ratings yet
LP04 Gen - Phys 2 Calculate The Net Electric Force On A Point Charge Exerted by A System of Point Charges - G12 Richmindale - Rev0 1 1
16 pages
Peer and Self-Assessment - TeachingEnglish - British Council
No ratings yet
Peer and Self-Assessment - TeachingEnglish - British Council
8 pages
Flowmeter Selection Guide
100% (1)
Flowmeter Selection Guide
2 pages
ASHRAE - The Fundamentals of Expansion Tanks PDF
No ratings yet
ASHRAE - The Fundamentals of Expansion Tanks PDF
7 pages