0% found this document useful (0 votes)
28 views18 pages

TAM GAN - Tamil Text To Naturalistic Image Synthesis Using Conventional Deep Adversarial Networks - 3584019

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views18 pages

TAM GAN - Tamil Text To Naturalistic Image Synthesis Using Conventional Deep Adversarial Networks - 3584019

Uploaded by

SYA63Raj More
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

128

TAM GAN: Tamil Text to Naturalistic Image Synthesis Using


Conventional Deep Adversarial Networks

DIVIYA M and KARMEL A, School of Computer Science and Engineering, VIT Chennai Campus,
Chennai, Tamilnadu, India

Text-to-image synthesis has advanced recently as a prospective area for improvement in computer vision
applications. The image synthesis model follows signicant neural network architectures such as Generative
Adversarial Networks (GANs). The ourishing text-to-image generation approaches can nominally reect the
meaning of the text in generated images. Still, they need the prospect of providing the necessary details and
eloquent object features. Intelligent systems are trained in text-to-image synthesis applications for various
languages. However, their contribution to regional languages is yet to be explored. Autoencoders prompt
the synthesis of images, but they result in blurriness, which results in clear output and essential features of
the picture. Based on textual descriptions, The GAN model is capable of producing realistic images of a high
quality that can be used in various applications, like fashion design, photo editing, computer-aided design,
and educational platforms. The proposed method uses two-stage processing to create a language model using
a BERT model called TAM-BERT and an existing MuRIL BERT, followed by image synthesis using a GAN.
The work was conducted using the Oxford-102 dataset, and the model’s eciency was evaluated using the
F1-Score measure.
CCS Concepts: • Computing methodologies → Information extraction;
Additional Key Words and Phrases: Computer vision, Generative Adversarial Network (GAN), BERT, MuRIL
BERT, language model, L1Norm, feature matching, latent vectors
ACM Reference format:
Diviya M and Karmel A. 2023. TAM GAN: Tamil Text to Naturalistic Image Synthesis Using Conventional
Deep Adversarial Networks. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 22, 5, Article 128 (May 2023),
18 pages.
https://doi.org/10.1145/3584019

1 INTRODUCTION
Technology is enhancing everything. The rapid growth in the technological era paves the way for
text-to-image synthesis applications to spread their wings. The proposed work aims at providing
a text-based image synthesis based on Tamil textual descriptions. The work focuses on generating
images that relate to the given text from the dataset under study. Previously, many researchers
have contributed to image synthesis in other languages, including English, which has undergone

Authors’ address: Diviya M and Karmel A, School of Computer Science and Engineering, Vellore Institute of Technology,
Chennai, Tamilnadu, India; emails: [email protected], [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and
the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specic permission and/or a fee. Request permissions from [email protected].
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
2375-4699/2023/05-ART128 $15.00
https://doi.org/10.1145/3584019

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:2 Diviya M and Karmel A

extensive development. The researchers have taken on the computational challenge of creating
new human poses. They have taken a reference picture of a person and used it to create a new
image of that person in the required posture while keeping the original photo’s aesthetic qualities
intact, including the original photo’s lighting and background. Here, the authors provided a mod-
ular generative neural network capable of synthesizing poses from training pairs of images and
stances extracted from human action movies. To accomplish this, the network dissects the scene
into layers consisting of body parts and backgrounds, then relocates and modies the appearance
of those body parts before nally compositing the new foreground onto a background with holes
in it [1]. It would be fascinating and practical if AI could automatically synthesize realistic visuals
from text, but existing systems are far from this aim. Generic and strong recurrent neural network
topologies have been developed in recent years. To eectively combine these developments in text
and picture modeling, the researchers created a new deep architecture and Generative Adver-
sarial Network (GAN) formulation in this study to show that the proposed model can create
convincing pictures of birds and owers based on textual descriptions. The work follows GAN-
CLS and GAN-INT-CLS as a two-stage process [2]. In the world of fashion image synthesis using
Fashion Gen and Deep Fashion synthesis datasets, Kenan et al. proposed a novel method named
enhanced-Attentional Generative Adversarial Network; The model incorporates both real-world
and synthetic picture feature matching losses, as well as multimodal similarity learning for text
and image characteristics [3]. It uses feature-wise linear modulations for a clear understanding of
the context. For the time being, Krishna et al. came up with a new challenging task of generating
images based on scenic views. They gave a cross-view image translation, which is better than the
traditional methods.
While considering image synthesis as a prime motive, textual descriptions, however, need more
attention. While handling a language that has morphology-rich features comes with a challeng-
ing aspect. Here, with a good start, the text features are studied, and preprocessing is done [4]. In
the Tamil language, noted work was performed for language feature study. Saraswathi et al. have
drawn-out language models at various phases to identify the errors in syllables and words, result-
ing in a better speech recognition system. At the phonetic level, speech signals were segmented by
using their acoustic characteristics [5]. When we go for symbol-level language models, the authors
proposed a lexicon for text corpora in the Tamil language that improves accuracy [6]. Suresh et
al. dropped a bi-gram language model for online and handwritten words. The model weeds out
the problem of idiosyncrasies and disambiguation that are present in Indic scripts. The use of
(1) language models that take advantage of the peculiarities of Indic scripts and (2) well-framed
classiers for the disambiguation of confusing symbols are two areas where this work contributes
that are rarely discussed in the online Indic word recognition literature. Each symbol in the input
word is extracted before being recognized by a main Support Vector Machine (SVM) classi-
er. To further improve recognition accuracy, they use (i) a bigram language model at the symbol
or letter level and (ii) well-equipped classiers to review and disambiguate the multiple sets of
ambiguous morphemes [7].
Realistic image synthesis using generative adversarial networks has been utilized in many real-
time applications proposed by Ian Goodfellow et al. It consists of a generator and a discriminator.
This model doesn’t need any other interference network, which symbolizes the ability of GANS to
generate a renowned output. Following the work proposed by various researchers, a new GANS
was proposed [8]. Latterly, convolutional neural networks (CNNs) have become popular in
the computer vision discipline for supervised learning. However, little focus has been placed on
unsupervised learning using CNNs. The prime goal of this work is to ll in the gaps between su-
pervised and unsupervised learning that CNNs currently have. The authors presented a new type
of CNNs termed DCGANs that meet specic architectonic requirements and show a promising

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:3

platform for unsupervised learning. They demonstrated that while training on numerous image
datasets, a deep convolutional adversarial pair learns a hierarchy of representations from object
sections to scenes in both the generator and the discriminator. These representations are then ap-
plied to new problems and their generalizability as representations of images is demonstrated [9].
Followed by Alec et al., Mehran et al. also came up with regularised DC GAN for representation
learning. They developed and deployed deep neural networks (DNNs) in tandem with GAN;
they provide unsupervised representation learning (GANs). The suggested approach is better than
the competing approaches. As evidence, the proposed strategy not only aids feature evulsion but
also speeds up to boosts the consummation of the learning in GANs, leading to more appropriate
feature extraction [10]. The researchers created a CookGAN model that learns image features and
results in an ecient model that can distinguish real and fake images. Proposed CookGAN, a novel
network design that up-samples images gradually while preserving ne-grained features and sim-
ulating visual eects in causality chains. In particular, the researchers present a culinary simulator
sub-network that, over time, modies input images of food based on how various ingredients and
preparation techniques interact. CookGAN has been shown experimentally on Recipe1M to pro-
duce food photos with a respectable inception score. In addition, the visuals can be interpreted and
manipulated meaningfully [11]. Memory networks bring a new outlook to the existing GAN net-
work, which helps in generating high-resolution images using external memory. This prompted a
lot of researchers to concentrate on generating high-resolution images along with the traditional
GAN architecture [12]. Tan et al. brought a new framework to the existing GAN network, said
to be Knowledge-Transfer (KT). The KT mechanism solves the problem of cross-domain that
exists between image and text input [13]. In the case of complex image captions, the metrics for
image quality in practice don’t meet the standards. In such a case, the researchers have brought
out a new metric named Semantic Object Accuracy that results in and evaluates the image and the
caption that belong to it. Even though image synthesis from text has been explored to a greater
extent, the drawback still relies on pose variations, shape variations, viewpoints, and so on [14].
Wang et al. believed that instead of learning through text-image mapping, their algorithm learns
through the semantic layout, which proves to be a better model. GAN has another path of appli-
cations, such as the image in the painting, which is on the other oor of GANS [15]. The authors
applied a two-stage GAN model on a custom dataset to improve the performance.
On the one hand, image synthesis using GANS based on the text has been considered, but on the
other hand, processing Tamil text is a challenging part. Abundant research has been carried out to
understand the text features. Handwritten word recognition for Tamil and Devanagari scripts was
addressed by Bharat et al. An HMM model was proposed to study the features of the words [16].
The researchers created a Morphological Analyzer cum Generator said to be Tamizhi Morph for
processing text, which in turn applied to Machine Translation applications. The preprocessing step
plays a vital role in text feature extraction whereas Tamil is a morphologically rich language. This
study details the Finite-state Transducer (FST)-based design and Foma-based implementation
of Thamizhi Morph. It brings out the specics of Tamil’s nominal and verbal paradigms that in-
formed our design choices. To eciently characterize the inectional morphology of the language,
we dene a high-level meta-language [17]. Surya et al. developed a morphological analyzer named
Piripori for analyzing words using morphological rules. To understand the word form structure,
a morphological analyzer is an important tool [18]. From machine learning to Deep Learning, al-
most every challenge in NLP has been conquered. To this day, the process of translating a foreign
language into a local one remains mysterious. Languages besides English have muddled NLP is-
sues. Entity Extraction, Optical Character Recognition, and Sequence Modeling Classication and
Prediction are all possible names for the issues. As more people intended to convey in their native
tongues on social forums, it becomes increasingly vital to automate the process of categorizing

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:4 Diviya M and Karmel A

content written in languages like Tamil, Telegu, Hindi, and so on. The objective here is categoriz-
ing the Tamil news items according to their respective subjects (Sports, Cinema, Politics). Existing
work has taken a TFIDF-of-words-as-features approach to conventional machine learning tech-
niques. The purpose of this research was to assess the eciency of CNNs that were trained with
pre-trained embeddings to those that learned TFIDF features from scratch (CNN) [19]. The authors
processed the text before running CNN by removing stop words followed by generating an em-
bedding matrix and embedding vector. Another way of processing text is by using BOW and TF,
IDF, by the researchers Sajeetha et al. in their work. The work considered the following approach
based on supervised machine learning and a hybrid approach. Many algorithms and language
models were used to convert a text le to vector forms [20]. One among them is using Conv-LSTM
for understanding text and Parts of Speech Tagging for Swahili words through syllables. The ef-
fectiveness of a method can be best understood by employing it on the corpus in hand. Reviewing
and understanding various methodologies results in better research ideas over an area [21]. The
deep learning model gives a better focus for the work to be carried out. The models in the context
of NLP throw a spotlight on researchers’ need to have in-depth knowledge before they employ the
model. Such methods include word2vec, Fast text and Glove, RNNs, LSTM, GRUs, BRNNs, and so
on. They also made focused on explaining how activation functions are to be chosen in the genera-
tive models, followed by various optimization techniques to be followed in the generative models.
Normally, research happens with generic text corpora [22]. But the researchers brought out the
fact of handling scientic data by using the SCI-BERT model, which proves to be an improvement
of NLP tasks in scientic areas [23].
Deep analysis has been done through language models and image synthesis algorithms. Major
research contributions have been made to the English language, and applications based on other
regional languages have been in the initial stages of exploration. This has made a quick start and
the need for taking up research in the Tamil language. Since Tamil is a historically old language
with a rich morphology, it paved the way for various applications that could be developed for the
Tamil language. The research in hand concentrates on how a language model can be proposed for
the Tamil language, and further, an image synthesis deep learning architecture could be developed
for Tamil text. Being in the initial stage of research in the future, many such applications and
research problems for the Tamil language can be addressed. The main diculty that exists with
regional languages is the lack of corpora and tools for processing. The main aim of the research is
to support research carried out in Tamil, and this is in the budding stages of research, which could
be enhanced further in the future.

2 PROPOSED LANGUAGE MODEL AND IMAGE SYNTHESIS ARCHITECTURE


The proposed architecture supports carrying out the image synthesis for Tamil text descriptions.
The model has a novel framework, which is indulged in feature matching across the real as well
as synthesised images through the L1 norm, through which the generator could be ne-tuned in
synthesizing fake images that are more similar to the real image features. In turn, it also handles
loss of the GAN network. By using minibatch discrimination, the model also handles mode col-
lapse. The model has a two-step process in which, initially language model is built to convert the
Tamil text input to embedding vectors, followed by injecting image feature vectors along with
embedding vectors to the GAN network.

2.1 TAM-BERT
The Bidirectional Encoder Representations from Transformers (BERT) model developed by
Google created a revolution in the eld of NLP. The BERT model [23] was developed as a generic
model to process textual information in multiple languages. For Indic languages like Tamil, Hindi,

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:5

Fig. 1. English to Tamil text translation using google translate.

Fig. 2. Process flow diagram of TAM–BERT language model.

and Telugu, the model was trained on publicly available datasets. The model proposed in work is
trained on the sentences of various text les belonging to the Oxford 102 ower dataset. The Tamil
sentences for the corresponding image les have been collected as a corpus and were trained. In
total, the ower dataset has about 102 classes, each with 30 to 40 image les and a description
in English for those images. From Figure 1, the description is clear in that the available English
sentences were translated using Google Translate and formed as text les in the Tamil language,
possibly about 3,000 sentences on the run. Tamil sentences were used as inputs to the BERT model,
which trains for each of those inputs.
The following layers, starting from hidden layers, linear layers, and latent layers, were present
in the language model: Figure 2 represents the schematic representation of TAM-BERT. The input
sentence in the Tamil language is considered as S having T words. It is represented as, where S
ranges from 1 to T words. The Indic BERT is used as a base language model, which tokenizes
the Tamil input text [24]. Followed by tokenization, the tokens were converted into latent vectors
using the Latent Layer. Each latent layer model has a dened latent size and has hidden states. The
Output of the Latent Layer Model is the latent vectors, which serve as inputs to the GAN network.
Each layer performs its own function, starting from the tokenization of the given sentences to
individual words. Once the tokenization word has been done, the result is provided to the hidden
layer as an input where the processing happens. Based on the parameters for tuning, the next
layer of the linear layer performs its function, and the end output is passed to the latent layer. The
output is latent vectors of the input word sequence represented in Table 1. The various layers that
combine to build the TAM-BERT model are all encoders, which have the capacity to be trained on
the target corpus.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:6 Diviya M and Karmel A

Table 1. Latent Vector Representation for Corresponding Tamil Text


Input Tokens Using TAM-BERT Model

2.2 MuRIL (Multilingual Representations for Indian Languages) BERT


Multilingual Representations for Indian Languages (MuRIL) BERT-developed by Google In-
dian Division, to support the research work carried out in regional languages. The BERT model
has been extensively used in understanding and processing NLP tasks involved in the English lan-
guage. When it comes to regional languages, however, the work has not been achieved in a vast
manner due to the morphology of the regional languages and their grammatical nature. MuRIL
BERT supports 17 Indian languages [27]. It supports various transliteration tasks and embedding
generation functions. The model, when incorporated and trained for the datasets of the proposed
work, has performed well in generating latent vectors for the model. Since the prime aim of using
language models in the proposed work is to generate vectors for text representations, the pur-
suance of the model has been evaluated using the TAM GAN architecture proposed. The vector
form of the text input is represented in Table 2.

2.3 Tam-Gan: A Generative Model for Tamil Text Image Synthesis Using TAMBERT
and MuRIL BERT Language Model
The generative network plays a major role in image synthesis. The basic units of a GAN model are
a generator network and a discriminator network. Both models try to learn through probabilistic
distribution. When the discriminator is considered, it learns the model by conditional probabilis-
tic distribution. However, generative models follow joint probabilistic distribution in learning. A
combined model is produced by using Bayes’ rule. The GAN model has an interesting history in
its development. Belief networks, autoencoders, and Boltzmann machines are the chronological
players before the GAN model. The Fully Visible Belief Network worked by employing the chain
rule of probability. But they fail to generate more samples in return. The next level of improvement
is made in the change of variables while considering non-linear ICA. But they ended up having a
constraint of similar dimensions for data and latent variables.
The auto encoders tried working on maximizing the log-likelihood function of data, but the
network lacks performance when there is a gap between the lower bound and the actual density
of data. The nal output of these models is of low quality. A Boltzmann machine works with an
energy function that is proportional to the probabilistic function of a dened state. They require
Monte Carlo and Markov chains, but they fail in high dimensional space. The limitations of the

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:7

Table 2. Latent Vector Representation for Corresponding Tamil Text


Input Tokens Using MuRIL BERT Model

Fig. 3. Schematic representation of Tamil text to image synthesis network.

previous players are overcome by the GAN model, which is consistent and does not require a
Markov chain, which results in the best samples. The major function of the generator is to fool the
discriminator by generating indistinguishable samples. But the discriminator results in identifying
fake and real data. The TAM-GAN model for the given data works in the following fashion.
The proposed system works by preprocessing the images to be trained and the text le under
study. From Figure 3, the initial process is described as a set of functions involved in text and
images. The work starts by tokenizing the given text le and creating individual tokens that serve
as inputs to the language model. The TAM-BERT language model and MuRIL BERT generate latent

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:8 Diviya M and Karmel A

Fig. 4. Input image files sample.

vectors, and they are archived. However, the image is preprocessed, which involves image re-
sizing, normalization, cropping, and converting to tensors. The generated Output of the language
model and the image vectors are fed into the GAN network along with the loss functions. The
resultant images are synthesised along with the loss of the network.

2.4 Generator and Discriminator Model


The latent input vectors from the language model, along with an image vector of size 1,024 × 1,024,
are fed to the generator. The generator loss is also supplied to the generator, which has layers of
convolution and an activation function of ReLu. The nal activation function is designated to be
the TanH function. The output image vector from the generator is of size 64 × 64 × 3, and it is
fed as an input to the discriminator, which maximizes layer by layer and nally ends up with the
result of 1 × 1 × 3. The activation function followed is Leaky ReLu in the intermediate layers,
and the nal activation function is a sigmoid function to generate discriminatory Output from
the discriminator. The loss for the generator and discriminator is a minimax and an adversarial
function. The MINIMAX function denes the ability of the Generator Discriminator to play a
game of one over another, trying for the chance of winning, similar to the MINIMAX concept in
games. The Generator function, however, urges to maximize discriminator loss by reducing the
loss with respect to the generator network. Tamil input text descriptions and the corresponding
image representations are given in Figure 4.
Input Text
.
(mələɾɪn ʲɪd̪əɻgəɭ ʷoɾʉ t̪əʈʈəjjɑːnə ʷuːd̪ɑː nɪɾət̪t̪ɪl ʷʊɭɭənə)
Meaning–The petals of the ower are a at purple color
Input Images
The latent space is useful in learning the data points that are dispersed. The points that are
compressed and similar will be closer to each other. The data points supply the necessary infor-
mation to draw conclusions on the data features under study. This, in turn, results in generating
images that have a linear path between the two data points in the latent space. From Figure 5, it
is understandable that the latent vector and image vector play a major part in the GAN network.
The training takes place through a large number of epochs in which the two networks compete
with each other.
The architecture of the text-to-image synthesis model using MuRIL BERT as the language model
is represented in Figure 6. The proposed work concentrates on generating word embeddings for

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:9

Fig. 5. Text-to-Image synthesis model using TAMBERT and TAM GAN.

Fig. 6. Text-to-Image synthesis model using MuRILBERT and TAM GAN.

the corresponding Tamil text using MuRIL BERT, which helps in dealing with the Tamil language.
The vectorized forms of text have been generated based on segment embedding, positional em-
bedding, and sentence embedding. The word vectors are given as input to TAMGAN along with
a preprocessed image vector. The Generator and Discriminator work together to synthesis the
images for the Tamil text.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:10 Diviya M and Karmel A

ALGORITHM 1:
Input: Images x, Text description of the image t, training batch size B
For n = 1 to B, do
h ← Φ (t) – Encoded the text description for an image
Z ~ Noise (0, 1)
x ← G (Z, t) (Input to the Generator)
y ← D(x, h) - (Real Image, text description)
ŷ ← D(x̄ ,h) - (Fake image, text description)
G ← G − η∇f (G ) (update Generator)
D ← D − η∇f (D ) (Update Discriminator)
Prediction ← D (x, x̄ )
Generate Real Image x

The representation for GAN model is given by


min max L Real = E X [log D Real (X ) ] + E Z [ log (1 − D Real (G (Z ) ) ] , (1)
G D Re al

where E X is Expected overall Real Data, G (Z ) is Generator output Noise Function, D Real (G (Z ) ) is
Estimate of Probabilistic value that false instance is real, and D Real (X ) is Estimate of Discriminator
probabilistic value that real instance is real. Whereas in Equation (1), the min-max function over
G and D Real implies the Generator’s nature in minimizing the Discriminator’s ability to deter-
mine whether it is fooled. In turn, the discriminator tries to maximize its function in detecting real
images from fake images. The log D Real (X) approaches value 1, and the log (1 − D Real (G(Z ))) ap-
proaches negative innity, which results in the synthesis of the original image by the discriminator.

2.5 Feature Matching Using L1 Norm to Overcome Mode Collapse


The generator searches for the most eective image to use to trick the discriminator. When both
networks work together to combat their rival, the “best” image is always shifting. Nevertheless,
the optimization process can become too greedy, turning the game into an endless game of cat and
mouse. The loss function for the generator is retted so that it focuses on minimizing the statis-
tically signicant dierence intervening between the features of the real images and synthesised
images when feature matching is applied. Measuring the L1 distance between the means of their
feature vectors. Consequently, feature matching broadens the scope of the aim, which formerly
consisted just of defeating the competition. The distance that separates the vector coordinate from
the vector space’s starting point is what the L1 norm attempts to compute. As a result of this, it
is also referred to as the L1 norm, because it is computed based on the Manhattan distance that is
measured from the origin. The calculation yields a positive value for the distance. In the proposed
model L1 norm calculates the distance between the real image vector distribution and images of
the same batch. The smaller the distance, the similarity is more between the images, and this leads
to mode collapse. Whenever the norm reduces, the discriminator penalizes the generator for gen-
erating more similar images. Calculating the L1 is the sum of the magnitudes of the vectors in
space. It measures the norm between vectors, which is, the sum of the son summates dierence of
the components of the vectors. The L1 norm is represented in Equations (2) and (3):
⎡⎢ ∑B ⎤⎥
| |L1 | | = Max ⎢⎢1 ≤ j ≤ B |Vi j | ⎥⎥ , (2)
⎢⎣ i=1
⎥⎦

L1norm = |V11 | + |V12 | . (3)


Consider xi as the input image and xj as the remaining images in the batch. The feature maps of
the vector distribution are represented as follows. f(xi ) where i = 1 to B batches. The feature map is
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:11

Fig. 7. L1 Norm between real images and fake images.

multiplied by the tensor value T. The nal output matrix with tensor is represented as T € MA X B X C ,
Where the tensor is a three-dimensional representation and Si € MA X B is the resultant matrix with
T × f(xi ) followed by Output (xi ) is the Output of L1 norm values. With respect to the given input
embeddings and image vector, the model shows a higher loss value for the generator. But beyond
300 epochs, the model starts to synthesis images at the same rate, and the loss function tends
to be in the range of 2.5. L1 norm and mini-batch discrimination play a major role in stabilizing
the model to avoid mode collapse. Feature matching between the input image and the generated
synthetic images adopting the L1 norm helps to understand dierence in images synthesised by
the generator. Finally, the loss reaches a minimal value, which is represented in Figure 7.

2.6 GAN Loss Function


The loss functions employed in the generator and discriminator are referred to as modied
MINIMAX loss and Adversarial loss. The minimax loss function has two functions; one that tries to
maximize the generator value and the other that tries to minimize the generator value supplied to
the discriminator. The minimax loss function in the generator tries to minimize the inverse prob-
abilistic value of the discriminator function, which in turn paves the way for the generator itself
to attain lower probability values in generating fake images. In a similar way, the loss function
for discriminators tends to be an adversarial loss. The use of adversarial loss is that it acts as a
binary classier in demarcating between the ground truth data and the images synthesised by the

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:12 Diviya M and Karmel A

Fig. 8. Synthesised images using TAMBERT and TAM GAN.

generator. Moreover, it focuses on the probability value of discriminating real images from fakes
that are supplied by the generator. The mathematical expressions are given as follows in Equa-
tions (4) and (5):
Loss of Generator L (G ) = E x̄ log (1 − D (x̄ ) ) , (4)
∑N
Loss of Discriminator L (D ) = −loд D Real (G (Z ) ). (5)
1

3 RESULTS AND DISCUSSIONS


3.1 Qantitative Analysis
The performance of the model can be best understood from Figure 8, which shows the image
synthesised in accordance with the given text input. In the proposed work during the initial epochs,
consider epoch 10; the synthesised image appears to be a clumsy, meaningless pixel without any
format. Going beyond 50 epochs, the model starts to generate images that could be identied as a
color of pixels. But still, it lacks the perfect format of an image according to the given input text.
When iteration goes on, after 200 epochs, a blurred image corresponding to the input text has
been generated. Beyond 300 epochs, the model becomes stable and synthesis images that have a
reasonable resolution when compared with previous epochs.
When compared with the image synthesis model for other languages such as English, Chinese,
and so on, which has been explored already by various researchers, for regional languages, it
still exists as an unexplored way. Since the proposed work is a base model, in the future, many
more models could be proposed. The performance of the model can be better understood from the
following graphical representations. The graphical representation of incurring the loss function is
known from Figures 9 and 10. The graph shows the loss value of the Discriminator being a higher
value during the initial epochs, which is to be 3.58. When the epochs of training increase, it starts
to learn the features from the latent space and has a distorted graph of escalating and downturn
fashion. Finally reaches an optimum value of approaching 0.2. On the rear side, the Generator loss
has a signicant value, which seems to be at the peak, having a value of around 17.5. But when
training goes on through 200 and 400 epochs value starts to fall o and become anchored over 500
epochs with a value of 2.5.
After the loss function has been best understood, the potentiality of the Generator network in
synthesizing the fake images is represented in Figure 11. The score has cut back after extensive
training. In a similar way, the true positive rate of real images synthesised by the model is depicted
in Figure 12, which approaches an accuracy of nearing 0.9.
The schematic representation of the score value of real and fake images by considering the num-
ber of epochs has been described in Figure 13. This clearly states that the F1 score for the model

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:13

Fig. 9. Discriminator losses of the TAM GAN model across a number of epochs.

Fig. 10. Generator loss of the TAMAN model across a number of epochs.

in the initial epochs for real and fake images is 0.8216 and 0.0261. But when it proceeds for fur-
ther epochs, it tends to vary, and in the nal phase, the F1 score is 0.9707 for the real images and
0.1829 for the fake images. The score depicts the measure of precision and recall as a combined
value. The model performs well while we employ the validation set. Moreover, it doesn’t need to
be highly computational, as we have obtained word vectors during the initial stages of process-
ing. The performance of the model results in a signicant resolution of the image. The evaluation
of the TAM GAN model using the latent vectors generated by MuRIL BERT has also achieved a
better F1-Score value of 0.9721, which is similar to the previous model. The model has performed
with an error rate that is higher in the initial stages, and when epochs go on, the model has re-
ported a better resolution with the images synthesised. The synthesised images are represented in
Figure 14.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:14 Diviya M and Karmel A

Fig. 11. The score of synthetic images generated by the Generator network.

Fig. 12. True positive rate of the real images synthesised by the model.

From Figure 15, the scores of the real and fake images synthesised by the MuRIL BERT model
and TAM GAN are represented, which shows the model has achieved better performance.

3.2 Qalitative Analysis


Generative adversarial networks prove to be a better synthesiser of images, and they can clas-
sify real images and fake images with a good F1 score. The Chinese character generation model
uses CycleGAN. The model works by generating personalized handwritten characters in the Chi-
nese language from printed letters. The proposed method can be applied to all calligraphy work.
The model worked with the CASIA-HWDB dataset and the Chinese handwriting database CASIA-
HWDB. They considered Resnet-6 and Densenet-5 modules as transfer models and achieved rea-
sonable accuracy over the Top-1 and Top-5 accuracy of the characters [25]. In contrast with the
English language, much work has been done. The preprocessing of text can be done using pre-
trained vectors of English text such as WordtoVec, Skip thought vectors, Glove embeddings, and
so on. The GAN model proposed works with AttnGAN [26]. It uses a text and an image encoder in

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:15

Fig. 13. The scores of Real and Fake images across a number of epochs—TAMBERT+TAMGAN.

Fig. 14. Synthesised images using MuRIL BERT and TAM GAN.

Fig. 15. The scores of Real and Fake images across a number of epochs—MuRIL BERT + TAMGAN.

the GAN architecture. The text processing is done with LSTM, which is a bidirectional one. They
worked with a popular MS-COCO dataset with which they created a Train-R, Test-R, and Train-S
that have specied data corresponding to groups such as white, top, pillows, and table. The results
are evaluated using BLUE-1, BLUE-2, -3, and -4 scores. A brand-new approach that goes from be-
ginning to text-to-image synthesis based on dimensional restrictions brought forth by mining the
geographical location and shape information of objects. Directly under the supervision of the de-
veloped semantic information, the system generates multi-object ne-grained images rather than

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:16 Diviya M and Karmel A

Fig. 16. F1-Score comparison of existing and proposed GAN algorithm.

learning a hierarchical mapping from text to image. This is carried out under the guidance of the
established semantic data [27]. However, when applied to real-generation tasks, vanilla deep neu-
ral networks tend to proximate continuous mappings rather than discontinuous mappings with
discrete points. The failure of GAN to synthesis a variety of images, which we refer to as mode col-
lapse, occurs during training on datasets that contain many types. In this research, the researchers
present the Multi-generator Text Conditioned GAN, also known as the MTC-GAN, as a potential
solution to this problem [28]. The proposed work focuses on generating images for Tamil text
input using a TAM-BERT and MuRIL-BERT model along with a TAM-GAN model [29]. The syn-
thesised images and their resolution depend on the text embedding models adopted according to
the morphology of input and the GAN model employed for synthesis of images. Moreover, it also
depends on the available resources and corpus of the particular language. The proposed model
also overcomes mode collapse and reduced loss function through feature matching and minibatch
discrimination, which employs the L1 norm for improvement. The comparative analysis of the ex-
isting method with the proposed methodology is depicted in Figure 16. The model’s performance
on the given dataset is nearly on the same run, which is also better for the Tamil language. More-
over, the performance of each algorithm depends on the language, since every language has its
own morphological representation.

4 CONCLUSION
Image synthesis for textual description seems to be an interesting area of exploration. The pro-
posed text synthesis for Tamil text plays a vital role in developing tools that pave the way for

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
TAM GAN: Tamil Text to Naturalistic Image Synthesis 128:17

education in an easier way. Similarly, where regional languages play a major role in such an envi-
ronment, the proposed model can be employed in image generation. The TAM-BERT and MuRIL-
BERT models, as well as TAM-GAN, are in the initial phase of employing the Tamil language
to combine with the renowned GAN model. In the future, major text parts, including literature,
science, and so on, can be eciently handled. By enhancing the model, super-resolution photore-
alistic images can be obtained. Moreover, GAN-based loss functions can be used to attain accurate
image generation with the existing functions. In the future, auto-encoder models and diusion
models could be added to enhance the work, which would result in better performance.

REFERENCES
[1] Guha Balakrishnan, Amy Zhao, Adrian V. Dalca, Fredo Durand, and John Guttag. 2018. Synthesizing images of hu-
mans in unseen poses. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8340–8348.
[2] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. 2016. Generative
adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning. PMLR, 1060–
1069.
[3] Kenan E. Ak, Joo Hwee Lim, Jo Yew Tham, and Ashraf A. Kassim. 2020. Semantically consistent text to fashion image
synthesis with an enhanced attentional generative adversarial network. Pattern Recogn. Lett. 135 (2020), 22–29.
[4] Krishna Regmi and Ali Borji. 2018. Cross-view image synthesis using conditional GANs. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 3501–3510.
[5] S. Saraswathi and T. V. Geetha. 2004. Building language models for Tamil speech recognition system. In Proceedings
of the Asian Applied Computing Conference. Springer, Berlin, 161–168.
[6] Suresh Sundaram and A. G. Ramakrishnan. 2012. Language models for online handwritten Tamil word recognition.
In Proceeding of the Workshop on Document Analysis and Recognition. 42–48.
[7] Suresh Sundaram and A. G. Ramakrishnan. 2015. Bigram language models and reevaluation strategy for improved
recognition of online handwritten Tamil words. ACM Trans. Asian Low-Res. Lang. Info. Process. 14, 2 (2015), 1–28.
[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. 2020. Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
[9] Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised representation learning with deep convolutional
generative adversarial networks. Retrieved from https:// arXiv:1511.06434.
[10] Mehran Mehralian and Babak Karas. 2018. RDCGAN: Unsupervised representation learning with regularized deep
convolutional generative adversarial networks. In Proceedings of the 9th Conference on Articial Intelligence and Ro-
botics and 2nd Asia-Pacic International Symposium. IEEE. 31–38.
[11] Bin Zhu and Chong-Wah Ngo. 2020. CookGAN: Causality-based text-to-image synthesis. In Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5519–5527.
[12] Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang. 2019. Dm-gan: Dynamic memory generative adversarial networks
for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
5802–5810.
[13] Hongchen Tan, Xiuping Liu, Meng Liu, Baocai Yin, and Xin Li. 2020. KT-GAN: Knowledge-transfer generative ad-
versarial network for text-to-image synthesis. IEEE Trans. Image Process. 30 (2020), 1275–1290.
[14] Tobias Hinz, Stefan Heinrich, and Stefan Wermter. 2019. Semantic object accuracy for generative text-to-image syn-
thesis. Retrieved from https:// arXiv:1910.13321.
[15] Purva Raut, Moxa Doshi, Monil Diwan, and Karan Doshi. 2020. Face completion using generative adversarial network.
In Advanced Computing Technologies and Applications. Springer, 523–531.
[16] A. Bharath and Sriganesh Madhvanath. 2011. HMM-based lexicon-driven and lexicon-free word recognition for on-
line handwritten Indic scripts. IEEE Trans. Pattern Anal. Mach. Intell. 34, 4 (2011), 670–682.
[17] Kengatharaiyer Sarveswaran, Gihan Dias, and Miriam Butt. 2021. ThamizhiMorph: A morphological parser for the
Tamil language. Mach. Transl. 35, 1 (2021), 37–70.
[18] M. Suriyah, Aarthy Anandan, Anitha Narasimhan, and Madhan Karky. 2019. Piripori: Morphological analyser for
Tamil. In Proceedings of the International Conference on Articial Intelligence, Smart Grid And Smart City Applications.
Springer, Cham, 801–809.
[19] S. Ramraj, R. Arthi, Solai Murugan, and M. S. Julie. 2020. Topic categorization of Tamil News Articles using Pre-
Trained Word2Vec Embeddings with Convolutional Neural Network. In Proceedings of the International Conference
on Computational Intelligence for Smart Power System and Sustainable Energy (CISPSSE’20). IEEE, 1–4.
[20] Sajeetha Thavareesan and Sinnathamby Mahesan. 2019. Sentiment analysis in Tamil texts: A study on machine learn-
ing techniques and feature representation. In Proceedings of the 14th Conference on Industrial and Information Systems
(ICIIS’19). IEEE, 320–325.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.
128:18 Diviya M and Karmel A

[21] Casper Shikali Shivachi, Refuoe Mokhosi, Zhou Shijie, and Liu Qihe. 2021. Learning syllables using Conv-LSTM
model for Swahili word representation and part-of-speech Tagging. Trans. Asian Low-Res. Lang. Info. Process. 20,
4 (2021), 1–25.
[22] Touseef Iqbal and Shaima Qureshi. 2022. The survey: Text generation models in deep learning. Journal of King Saud
University-Computer and Information Sciences 34, 6 (2022), 2515–2528.
[23] Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. Scibert: A pre-trained language model for scientic text. Retrieved from
https://arXiv:1903.10676.
[24] Christophe Van Gysel, Maarten De Rijke, and Evangelos Kanoulas. 2018. Neural vector spaces for unsupervised
information retrieval. ACM Trans. Info. Syst. 36, 4 (2018), 1–25.
[25] Bo Chang, Qiong Zhang, Shenyi Pan, and Lili Meng. 2018. Generating handwritten chinese characters using cyclegan.
In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 199–207.
[26] Md Zakir Hossain, Ferdous Sohel, Mohd Fairuz Shiratuddin, Hamid Laga, and Mohammed Bennamoun. 2021. Text to
image synthesis for improved image captioning. IEEE Access 9 (2021), 64918–64928.
[27] Min Wang, Congyan Lang, Liqian Liang, Songhe Feng, Tao Wang, and Yutong Gao. 2020. End-to-end text-to-image
synthesis with spatial constraints. ACM Trans. Intell. Syst. Technol. 11, 4, Article 47 (Aug. 2020), 19 pages. https:
//doi.org/10.1145/3391709
[28] Min Zhang, Chunye Li, and Zhiping Zhou. 2021. Text to image synthesis using multi-generator text conditioned
generative adversarial networks. Multimedia Tools Appl.80, 5 (Feb 2021), 7789–7803. https://doi.org/10.1007/s11042-
020- 09965- 5
[29] S. Khanuja, D. Bansal, S. Mehtani, S. Khosla, A. Dey, B. Gopalan, D. K. Margam, P. Aggarwal, R. T. Nagipogu, S. Dave,
S. Gupta, S. C. Gali, V. Subramanian, and Talukdar. 2021. MuRIL: Multilingual representations for Indian languages.
Retrieved from https://arxiv.org/abs/2103.10730.

Received 13 January 2022; revised 29 December 2022; accepted 4 February 2023

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 22, No. 5, Article 128. Publication date: May 2023.

You might also like