0% found this document useful (0 votes)
6 views

Text-to-image generation using Generative AI

This document presents a survey on text-to-image generation using generative AI, focusing on techniques such as Cross-modal Semantic Matching Generative Adversarial Networks (CSM-GAN) and the Imagen diffusion model. It discusses the challenges in achieving semantic consistency between text and images, and proposes new modules like the Text Encoder Module and Textual-Visual Semantic Matching Module to enhance image generation quality. The study highlights the effectiveness of large pre-trained language models in improving text-to-image synthesis outcomes and introduces a novel GAN-based model that outperforms existing methods.

Uploaded by

vidhi21btai44
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Text-to-image generation using Generative AI

This document presents a survey on text-to-image generation using generative AI, focusing on techniques such as Cross-modal Semantic Matching Generative Adversarial Networks (CSM-GAN) and the Imagen diffusion model. It discusses the challenges in achieving semantic consistency between text and images, and proposes new modules like the Text Encoder Module and Textual-Visual Semantic Matching Module to enhance image generation quality. The study highlights the effectiveness of large pre-trained language models in improving text-to-image synthesis outcomes and introduces a novel GAN-based model that outperforms existing methods.

Uploaded by

vidhi21btai44
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Journal of Scientific Research in Engineering and Management (IJSREM)

Volume: 07 Issue: 08 | August - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

Text-to-Image Generation using Generative AI

Anusha Bhambore Bhagyashri Pavithra R C Tejashwini


AI&ML(VTU) AI&ML(VTU) AI&ML(VTU) AI&ML(VTU)
Dayananda Sagar College Dayananda Sagar College Dayananda Sagar College Dayananda Sagar College
of Engineering(VTU) of Engineering(VTU) of Engineering(VTU) of Engineering(VTU)
Bangalore, India Bangalore, India Bangalore, India Bangalore, India

Reshma S
Assistant Professor, AI&ML
Dayananda Sagar College of Engineering
Bangalore, India

Abstract—This survey reviews text-to-image generation by technology has the potential to change, including those in
using different approaches. One of the approaches marketing, entertainment, and advertising.
identified in this study is Cross-modal Semantic Matching
Generative Adversarial Networks (CSM-GAN), which is In many applications, including computer-aided design,
used to increase semantic consistency between text pedestrian picture editing, and text-to-image generation, this
descriptions and synthesised pictures for fine-grained text- task is essential. The domain difference between texts and
to-image creation. This includes other two modules, Text images, however, makes it difficult to produce aesthetically
Encoder Module and Textual-Visual Semantic Matching realistic images. Word-level attention techniques to enhance
Module. We further discussed about Imagen which is a text- cross-modal semantic consistency have been presented by
to-image diffusion model with photorealism and deep AttnGAN and MirrorGAN as a solution to this problem.
language understanding, which is used on the COCO However, the entropy loss in the latent space might produce
dataset. Lastly, we discussed about Text to image synthesis embeddings with more intraclass spacing than interclass
used to automates image generation using conditional spacing, which can cause semantic structural ambiguity and
generative models and GAN, enhancing artificial semantic mismatch between the synthesised image and text
description.Only written descriptions from a realistic dataset
intelligence and deep learning. Based on these approaches
are used in the text-to-image synthesis task, and a generator
we present a review of text to image generation using
creates the appropriate images. It is challenging to train the
generative AI.
discriminative feature detector and descriptor because of this.
Keywords— Generative AI, Diffusion model, Text-to- To enable the generator more effectively extract important
image, Imagen, CSM-GAN semantics from unidentified text descriptions, the authors add a
modal matching method to text-to-image synthesis [1].
I. INTRODUCTION
Text-to-image generation is a type of generative AI that allows Multimodal learning has grown in relevance in recent years,
computers to create images from written descriptions. To do particularly in text-to-image synthesis and image-text
this, a language model was trained using a big dataset of text contrastive learning Imagen uses a transformer LM to capture
and images. The capacity to connect the written descriptions to the semantics of the text input, and then uses diffusion models
the pertinent photographs is gained by the model. When given to map the text to images. This allows for a photorealistic image
a new written description, the model may create a picture that synthesis, while also providing a deep understanding of the text
matches it. Recent advances in the field of text-to-image input. An imagen consists of a frozen T5-XXL encoder, a 64x64
generation are significant. The quality of the images produced image diffusion model, and two super-resolution diffusion
by text-to-image models has significantly improved, and they models, which generate 256x256 images and 1024x1024
are now capable of creating images that are identical to real images, respectively. Classifier-free guidance is used to train
photos. These are just a handful of the industries that this and condition diffusion models on text embedding sequences.
Imagen relies on unique sampling methodologies to leverage

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM25320 | Page 1


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 08 | August - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

large guide weights without deteriorating sample quality, as performance on a variety of tasks such as sentence
demonstrated in earlier studies [2]. categorization, machine translation, and others.In this research,
they propose a feature fusion technique that can integrate local
GANs are generative models that turn text into picture pixels to visual information with Text_CNNs to capture and emphasise
get better outcomes. They are employed in text to image crucial local elements such as "red bird," "white belly," and
synthesis, which translates word descriptions into pictures. "blue wings" that are significant in this job.
However, due to the large number of alternative configurations,
deep learning encounters difficulties in recognising single text
descriptions [3].

Making interclass spacing larger than intraclass spacing can


significantly increase the generalisation ability of models in
classification and retrieval. The authors also intended to
increase interclass spacing while decreasing intraclass spacing,
which helps the semantic consistency and generalisation
capacity of the text-to-image synthesis model, particularly for
unknown text descriptions. They also added a modal matching
technique to text-to-image synthesis to enable the generator
catch crucial meanings from uncertain text descriptions. Only
written descriptions from realistic datasets are used in the text-
to-image synthesis task, while their matching images are Fig. 1. Text_CNNs highlight local visual information. The Text_CNNs catches
and emphasises crucial local elements such as "red bird," "white belly," and
generated by a generator. The substantial amount of interfering "blue wings" in the final encoded feature vector, which play vital roles in this
information from synthesised images makes training the job.
discriminative feature detector and descriptor difficult. The The fundamental contribution of this study is a novel GAN-
authors suggest a cross-modal matching job on text-to-image based Text-to-Image model for text-to-image synthesis, Cross-
databases so that features can be constructed discriminative and modal Semantic Matching Generative Adversarial Network
resilient even on synthesised images. This modal matching (CSM-GAN). Textual Visual Semantic Matching Module
approach becomes useful in leading the generator to create (TVSMM) and Text Encoder Module (TEM) are two
more semantically coherent images [1]. innovative modules in the CSM-GAN. The suggested
technique has been validated using two widely used
With a zero-shot FID-30K of 7.27, Imagen beats previous benchmarks: CUB-Bird and MS-COCO [1].
efforts such as GLIDE and DALL-E 2. It also outperforms
cutting-edge COCO-trained models such as Make-A-Scene. The study proposes DrawBench, a novel structured suite of text
Human raters find Imagen produced samples to be comparable prompts for text-to-image assessment that provides deeper
to reference images in image-text alignment on COCO captions insights through multi-dimensional text-to-image model
[2]. evaluation. It also emphasises the advantages of employing big
pre-trained language models as a text encoder for Imagen over
Images are more appealing and have the ability to communicate multi-modal embeddings such as CLIP. The paper's key
information more immediately, making them ideal for critical contributions include discovering that large frozen language
activities such as presenting and learning. Deep learning, a models trained only on text data are surprisingly effective text
subtype of AI, analyses data to convert languages and recognise encoders for text-to-image generation, introducing dynamic
objects by mimicking the operations of the human brain. It thresholding, highlighting important diffusion architecture
employs artificial neural networks with hierarchical structures design choices, achieving a new state-of-the-art COCO FID of
such as Convolutional Neural Networks and Recurrent Neural 7.27, and outperforming all other work, including DALL-E 2
Networks to imitate the functioning of the human brain [3]. [2].

The authors also investigate improved text feature To summarise, GAN models are extensively utilised to get
representation, which appears to be overlooked in many current better outcomes, yet there are difficulties in comprehending and
text-to-image synthesis algorithms. Text Convolutional Neural processing unstructured data. Deep learning, a type of artificial
Networks (Text_CNNs) can better simulate semantics between intelligence, has the ability to revolutionise numerous scenarios
neighbouring words and highlight crucial local phrase and improve overall user experience [3].
information in text descriptions. They have been used in natural
language processing tasks and have demonstrated competitive

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM25320 | Page 2


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 08 | August - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

A. Text Encoder Module(TEM) matching synthesised picture and congruent pairs of phrases
Techniques for text-to-image synthesis often focus on altering that are more similar than incongruent pairings over the whole
and adding additional GAN modules. RNNs, on the other hand, global semantic field. Text descriptions are unknown. As a
have limited capacity to capture local textual components such result, neither AttnGAN nor MirrorGAN do well in terms of
as words and phrases. Text_CNNs are more adept at extracting semantic generalisation. To address this issue, we propose text-
these features. This paper introduces Text_CNNs for collecting view semantics Matching module (TVSMM), a superior modal
and emphasising local textual features in text descriptions. matching mechanism that assists the generator in thinking about
Using Text_CNNs, the fundamental feature extraction method the semantics of unknown textual descriptions. TVSMM
comprises embedding a word sequence into a D-dimensional attempts to reduce and increase the distance between classes in
feature space and extracting semantic elements of distinct n- order to improve the diversity of synthesised pictures and the
grams using three 1-dimensional convolutional layers with generalizability of generative models.TVSMM accepts the
varied kernel sizes. These feature maps effectively capture and statement and image characteristics as input. The sentence's
highlight key local n-gram textual information. characteristics are encoded using TEM. CNN_Encoder
provides the image characteristics. To encode the picture
The steps included in this model are: feature, we employ the Inception-v3 model pretrained on
Step 1. Take a word sequence and embed each word in a D- ImageNet in our CNN_Encoder. Globally, the feature vector f
dimensional feature space as input. Following the design, this R2048 is taken from Inception-v3's final average collection
word embedding is started using a pre-trained word2vec model layer.
trained on Google News corpus.
Step 2: Capture semantic properties of various n-grams using Let us denote a pair of positive sentence and image (their
three 1-dimensional convolutional layers with varied kernel characteristics) (¯e, v¯) and two negative example pairs (e, v) )
sizes (e.g., filter size = 2,3,4; Channel = m). An n-gram is a and (eg , v) where e ¯ is of an image that v¯ and v do not
string of n words. describe It is of a sentence that does not describe e¯. The
Step 3: Apply pooling layers to these three groups of feature objective function should maximize the similarity of positive
maps to get refined semantic textual features a,b, and c. pairs like all negative pairs. Therefore we can define Investment
Step 4: Concatenate feature vectors a, b, and c to create feature loss LRank as
vector e.
Step 5: Use the fully connected layers to extract the phrase ℒRank = ∑̅̅̅ ‾, 𝑣‾) − 𝑑(𝑒 ′ , 𝑣‾)]+
𝑒 ′ [𝛼 + 𝑑(𝑒
characteristic e-1 even further. +∑𝑣‾ [𝛼 + 𝑑(𝑒‾, 𝑣‾) − 𝑑(𝑒‾, 𝑣‾ ′ )]+

Text_CNNs are capable of successfully modelling local text where α is the margin, d( e, v ) is the cosine distance between
characteristics. RNNs are recognised to be capable of capturing image feature v and phrase feature e, e and v are negative
such dependencies sequential data.The RNN model commonly samples. [x] denotes maximum(x, 0). The hyperparameter = 1.0
used here is bi-LSTM. It requires a sentence (i.e.sequence of based on the studies' enlarged validation set. The LRank
words) as input and output of a sentence feature vector e 2 RD purpose of TVSMM is to drag the corresponding image-text
and word feature matrix e RDxT, where it i column is not the pairings closer to each other and press incompatible pairs that
eigenvector of i word, D is the dimension of the word vector, are far in global semantics a room, as indicated at the top of
and T is the number of distinct words in the provided Figure 6. Furthermore, in Appendix A, we present a more
sentence.The composite text vector is e R2D, and the phrase is theoretical examination of what a drop in investment may lead
linked by e 1 (from text_CNN) and e 2 (from text Bi-LSTM). class intervals to be higher than class gaps.
The text merge function is then placed as a succession of fully
linked layers that generate the fusion of the final phrase Pre-training details in TVSMM: Text descriptions and photos
property [1]. represent genuine dataset data in a cross-modal format. related
jobs.As indicated in Section I, only textual descriptions are
B. Textual-Visual Semantic Matching Module (TVSMM) sourced from data in this text-to-image synthesis job, while
AttnGAN controls the word-level attention mechanism with visuals are synthesised using a generator GAN model. When we
entropy loss to promote semantic consistency. Entropy loss is apply TVSMM directly to the GAN model, the resulting
also used by the newer MirrorGAN to match phrase semantics pictures include a large quantity of distracting information,
with matching picture. As we explained in Part I, this makes it making it difficult to practise discriminating features. In the pre-
difficult for the synthesiser to properly infer semantics. Text- training step, we must utilise the same data sets as in the T2I
visual semantic matching module (TVSMM) framework task (CUB-Bird and MS-COCO datasets in task T2I), and run
presented. The visual semantic immersion mode and text comparable modes of transportation. That is why we must first
formats are discussed above. The objective is to create a train by finishing our TVSMM module multiformat text to

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM25320 | Page 3


International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 08 | August - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

image database adaption (CUB-Bird or MS-COCO dataset). (GAN) is an unsupervised learning technique that uses neural
Furthermore, AttnGAN's DAMSM incorporates an entropy- networks to generate new instances. GAN is divided into two
based word-level semantic method loss; nevertheless, you must parts: the generator, which makes bogus samples, and the
be pre-trained in text-to-image conversion databases. We also discriminator, which differentiates between actual and bogus
educate DAMSM and TVSMM jointly to stimulate the samples. Both sub-models are deep neural networks, with the
generator to synthesise with high quality pictures. This generator attempting to deceive the discriminator and the
TVSMM and DAMSM contain a text encoder TEM, and the discriminator correctly detecting the true samples. Training the
CNN_Encoder is an Inception v3 model that has been GAN model takes a long period [2].
pretrained on ImageNet. The loss function in the training phase
is structured in real image-text pairs [1]. The GAN CLS method is used for discriminator and generator
training. The algorithm takes three input pairs: correct text with
Lpre = LDAMSM +LRank actual picture, wrong text with genuine image, and false image
Lpre = LDAMSM+LRank with correct text. The dataset utilised is the Oxford-102 flower
collection, which comprises 8,192 photos of various species.
The project employs 8000 photos for training and 189 images
C. Diffusion model with photorealism and deep language for testing, with 10 descriptions per image.
Midway through (version 4; Stable Diffusion (version 1.5; MJ),
DALL-E 2 (DE), and SD) were used in the investigation. These
three software demonstrate the most recent breakthroughs in
text-to-image creation for public consumption. Because they
make it simple to combine images and written instructions, the
tools have gained in favour. Midjourney and DALL-E are both
available online through Midjourney and OpenAI.1 For Stable
Dispersion, we employed Solidness Artificial intelligence's
electronic Dream Studio interface2. Each of the three
frameworks (MJ, DE, SD) was used differently in the sessions,
with up to two people using just one of the apparatuses in each
meeting. Each of the three photo generators has enough credits
to generate images for the duration of the session. When it was
determined that SD only maintains prompt history in the
participant's local browser history, data from two SD
participants (P3, P4) was lost in S1. The laptop supplied to S2-
S3 participants helped to examine the locally saved browsing
history as they interacted with SD. Because the SD does not
store a complete history of prompts, the data from S2-S3 only
comprises 100 of the most recent prompts.

Participants made images using a range of stimuli. The created


visuals are explained, the participants' prompt language is
analysed, and the interview data and general comments from
the sessions are then examined. In the qualitative portion, we
Fig. 2. Flow Chart
analyse the qualitative insights gathered from the group
The flowchart depicts the process of training the model with the
interviews, investigate the participants' use of prompts to
algorithm and the outcomes. The project also contains a
visualise their ideas, and assess the effectiveness of the image
Graphical User Interface (GUI) built by PySimpleGUI, which
generators in assisting the design work [2].
shows user ingenuity and makes the project more exciting and
approachable [3].
D. GAN-CLS Algorithm
GAN is the deep learning approach employed, which consists
II. RESULT
of a generator and a discriminator. To create the text as a picture,
the Tensorflow machine learning package is employed. The text
While CSM-GAN produces fine-grained pictures with
is separated using NLTK markup, and the tensor layer builds
consistent colours and semantic variety preservation, AttnGAN
layers for the generator and separator. Data is serialised using
loses image details, causes colours to vary from text
the Python Pickle package.Generative Adversarial Networks
© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM25320 | Page 4
International Journal of Scientific Research in Engineering and Management (IJSREM)
Volume: 07 Issue: 08 | August - 2023 SJIF Rating: 8.176 ISSN: 2582-3930

descriptions, and causes shapes to seem strange [1]. Imagen [2] Ville Paananen, Jonas Oppenlaender, Aku Visuri "Using
outperforms DALL-E 2 and COCO models in terms of zero Text-to-Image generation for Architectural
image FID scores, picture quality, and alignment. Human Design Ideation” in arXiv:2304.10182v [cs.HC] 20 April 2023
evaluation revealed 39.2% photorealism and 43.6% title [3] Rida Malik Mubeen1, Sai Annanya Sree Vedala2,
similarity. Imagen beats other models in terms of accurate "Generative Adversarial Network Architectures for Text to
alignment, as well as text and picture alignment [2]. The GAN Image Generation: A Comparative Study", in IRJET 2021
architecture and the GAN-CLS algorithm were used to match
captions to the Oxford-102 Flowers dataset, with a focus on
flower morphology. The presentation of accurate images is
assured via GUI-processed user input [3].

III. LIMITATION

First, the generating outcome is significantly influenced by the


original image quality. Second, the amount of information that
each word in an input sentence conveys varies according to the
content of the image [3].We are aware that text-to-image
generators can be used quite successfully, especially when
sophisticated features and knowledgeable users are used. The
participants in our study were generally untrained and began
utilising them from scratch, at least in the context of the design
assignment. With the aid of more challenging instructions or
tasks, the generated images may be improved and the issues we
experienced in our experiment may be resolved [2].It is
particularly difficult because the CUB dataset and the MS-
COCO dataset are so big [1].

IV. CONCLUSION

In order to improve semantic consistency and capture local


structural information using Text Convolutional Neural
Networks, the research suggests a Cross-modal Semantic
Matching Generative Adversarial Network (CSM-GAN) [1].A
laboratory research of 17 architecture students revealed that
they used picture generating in early architectural concept
ideation in various ways. The design of image generators should
encourage creative experimentation, and educators should
emphasise appropriate usage and teach advanced usage to
ensure efficient and meaningful use [2].Using the COCO and
CUB datasets, this study assesses text-to-image generation
methods based on Generative Adversarial Networks. Their
performance is highlighted by metrics like Inception score,
Frechet Inception Distance, and R-Precision. The study can be
expanded to incorporate indicators that improve performance
and new domain datasets for deeper comprehension [3].

REFERENCES

[1] Hongchen Tan, Xiuping Liu, Baocai Yin, and Xin


Li,''Cross-Modal Semantic Matching Generative Adversarial
Networks for Textto-Image Synthesis'' in IEEE Transactions
on Multimedia · February 2021.

© 2023, IJSREM | www.ijsrem.com DOI: 10.55041/IJSREM25320 | Page 5

You might also like