0% found this document useful (0 votes)
26 views7 pages

Rishab Paper Final

The document discusses using machine learning for text-to-image generation. Generative adversarial networks (GANs) are frequently used for this task. The goal is to create images that are semantically consistent with the input text. Several papers on text-to-image synthesis using GANs are reviewed. The methodology uses a GAN with a generator and discriminator trained in an adversarial manner to generate images from text.

Uploaded by

Rishab Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views7 pages

Rishab Paper Final

The document discusses using machine learning for text-to-image generation. Generative adversarial networks (GANs) are frequently used for this task. The goal is to create images that are semantically consistent with the input text. Several papers on text-to-image synthesis using GANs are reviewed. The methodology uses a GAN with a generator and discriminator trained in an adversarial manner to generate images from text.

Uploaded by

Rishab Tiwari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Text to Image Generation Using Machine Learning

Rishab Tiwari [1], Dr. Chitra K[2]

Dayananda Sagar Academy of Technology and Management , Bangalore 560082[1] Dayananda

Sagar Academy of Technology and Management , Bangalore 560082[2]

[email protected], [email protected]

Abstract

A method called text-to-image involves creating images automatically from provided written descriptions.
It contributes significantly to artificial intelligence by tackling the problem of integrating textual and visual
input. One of the usefulness of automatic picture synthesis is the generation of images using conditional
generative models. For this, Generative Adversarial Networks (GANs) are frequently employed. Using
GANs, recent developments in the sector have made significant progress. An outstanding illustration of
deep learning's potential is the transformation of text into images. It is difficult to create a text-to-image
synthesis system that consistently creates realistic graphics based on predetermined criteria. Many of the
existing algorithms in this field struggle to produce visuals that precisely match the given text. In order to
solve this issue, we carried out a research work where we concentrated on developing the generative
adversarial network (GAN), a deep learning-based architecture. The aim of this research work is to create a
system that allows you to generate images that are semantically consistent.

Keywords

Generative Adversarial Networks, Convolutional neural network, deep learning


Introduction

The goal of text-to-image (T2I) creation is to produce aesthetically accurate and semantically consistent
images from textual descriptions. Given its significance in numerous applications, including as photo editing,
art creation, and computer-aided design, this particular problem has attracted a great deal of focus in the deep
learning community. The large dimensionality of the output space and the semantic gap between the textual
and visual domains are the key reasons why it also poses substantial obstacles. Once the technology is ready
for commercial usage, the creation of graphics from natural language has enormous promise for a variety of
future applications. As generative models, Generative Adversarial Networks (GANs) are capable of producing
new content. The goal of text-to-image synthesis is to turn verbal descriptions into aesthetically pleasing
pictures. GAN models are now frequently employed in this industry to get better outcomes. The fact that a
single text description might take on various configurations presents a barrier for deep learning. By giving the
model the proper training, this problem can be solved. As generative adversarial networks (GANs) have
developed, they have proven to be remarkably effective at a variety of tasks involving images, such as picture
synthesis, image super-resolution, data augmentation, and image-to-image conversion.
1

Problem Statement

Text comprehension can occasionally be difficult, and visualizing the information might make it even more
difficult. Additionally, there is a chance that some words or phrases will be misunderstood. However, it is
simpler to understand and accept the information when it is provided as graphics rather than text. Compared
to plain text, images are typically more appealing and compelling. The benefit of using visual aids is that
they can communicate information more effectively and immediately. Visual material has the power to
draw viewers in and hold their interest. Visual components are extremely important in many different tasks,
including presentations, learning, and communication. When developed well, visual communication has
several advantages.

Understanding Deep Learning

Deep learning is a subset of AI that focuses on data processing to carry out tasks like language translation
and object recognition by mimicking the functions of the human brain. Deep learning has made great
progress over the years and made a wealth of readily available data available. Since most of this data is
unstructured, is time-consuming for people to extract the necessary information. Deep learning has,
however, overcome this issue by making it possible to grasp and analyze such data effectively. Artificial
neural networks are used in deep learning in an effort to replicate how the human brain operates. Because
of their hierarchical form, these neural networks may process data across numerous layers. Convolutional
neural networks and recurrent neural networks are two examples of common neural network architectures.
In general, deep learning is revolutionizing many scenarios and applications through artificial intelligence.

Literature Review

[1]. Image generation using generative adversarial networks and attention mechanism. The 15th
International Conference on Computer and Information Science (ICIS), organized by IEEE/ACIS, took
place in 2016. A paper by Y. Kataoka, T. Matsubara, and K. Uehara titled "Image Generation using
Generative Adversarial Networks and Attention Mechanism" was presented at the 2016 IEEE/ACIS 15th
International Conference on Computer and Information Science (ICIS).

[2].Text to Photorealistic Image Synthesis with Stacked Generative Adversarial Networks" in Rutgers
University and Lehigh University August 2017. is a research paper authored by Han Zhang, Tao Xu,
Hongsheng Li, and Shaoting Zhang from Rutgers University and Lehigh University.

The article "StackGAN: Text to Photorealistic Image Synthesis with Stacked Generative Adversarial
Networks" concludes with presenting a novel technique for text-to-image synthesis utilizing a two-stage
GAN architecture. The model significantly advances generative picture synthesis by stacking two GANs to
produce higher-quality, more realistic pictures based on textual descriptions.

[3] “Text to Image Synthesis Using Generative Adversarial Networks" in The University of Manchester
May 2018. Stian Bodnar and Jon Shapiro from The University of Manchester wrote a study named "Text
to Image Synthesis Using Generative Adversarial Networks" and it was released in May 2018.

The study "Text to Image Using Generative Adversarial Synthesis explores the capacity of GANs to
generate realistic images from textual descriptions, and it concludes by presenting a GAN-based method for

2
text-to-image synthesis. The study likely makes a contribution to the multimodal learning and generative
picture synthesis fields, where it is particularly difficult to combine text and image data.

[4]. Large for high-fidelity natural picture synthesis, use scale gan training. 2019 saw the release of by
Andrew Brock, Karen Simonyan, Jeff Donahue, and Brock. In this essay, the authors investigate methods
for massively training GANs, or generative adversarial networks.

With an emphasis on scaling GAN training to accommodate big datasets and produce high-resolution,
highfidelity images, the work "Large Scale GAN Training for High Fidelity Natural Image Synthesis"
demonstrates a significant improvement in GAN-based image synthesis. The study has significance for
numerous computer vision and graphics applications and advances generative models for realistic picture
synthesis.

Methodology

Generative Adversarial Network (GAN), which consists of a generator and a discriminator, is the deep
learning method we utilized. For text to picture generation, we also employed Tensorflow, Numpy,
NLTK, and Tensorlayer. Tensorflow is essentially a machine learning library. In comparison to other
deep learning libraries, it compiles quicker. Additionally, both CPU and GPU computing units are
supported. We use the Python Pickle module for data serialization in our network design. By converting
objects into byte streams, this module enables us to easily store the data in files and transfer it between
different systems and applications.

Generative Adversarial Networks. is an unsupervised learning strategy that trains a generative model to
produce fresh examples. GANs can be used in many domains, such as the synthesis of images and
sounds, and use neural networks to create new instances of data. The term "generative" is used to describe
learning a model that can produce fresh data in the context of GANs, and the model is trained using
neural networks [6]. The discriminator and the generator are the two component sections of the GAN.

Generator:
The generator in the Generative Adversarial Network (GAN) is in charge of producing fresh instances of
data, which are frequently bogus or synthetic examples. The discriminator, whose job it is to discern
between actual and fraudulent data, is then shown these created samples. Goal of the generator is to
provide samples that successfully trick or perplex the discriminator, making it challenging for the hater of
diversity to correctly determine whether a sample is real or artificial. The generator and discriminator's
competitive procedure encourages learning and long-term development of both models.

Discriminator:
The discriminator in a Generative Adversarial Network (GAN) is in charge of separating real samples
from phony samples produced by the generator. Deep neural networks are used for the generator and
discriminator. Goal of the generator is to is to trick the discriminator by creating phony instances that
resemble actual data. The discriminator, on the other hand, seeks to accurately recognize and categorize
genuine data samples. As a result, there is competition between the generator and discriminator.

Proposed Methodology:

We used Conditional Generative Adversarial Networks (GANs) in combination with Recurrent Neural

3
Networks (RNNs) and Convolutional Neural Networks (CNNs) as part of the training phase of our deep
learning-based generative models to produce meaningful images based on textual descriptions. Our data set
included floral photos and the language descriptions that go with them.
We preprocessed the textual data and scaled the photos to a fixed dimension in order to generate
convincing visuals from text using GANs. We parsed the dataset's caption sentences, built a vocabulary
list, and gave each caption a special ID. The photos were loaded and appropriately scaled. These
previously processed textual and visual data then served as the foundation for our suggested model.

We used RNNs to extract the contextual information from the text sequences. The association between
words at various time stamps was created via RNNs, which allowed the model to comprehend the textual
descriptions. We combined RNN and CNN to carry out the texttoimage mapping. Without human input,
CNN retrieved pertinent details from the photographs.

An input series of textual descriptions was fed into the RNN during training, and the RNN transformed
the text into 256-word embeddings. Then, a 512-dimensional noise vector was concatenated with these
word embeddings. With a gated-feedback generator of size 128 and a batch size of 64, we trained our
model while feeding the generator inputs of noise and text.

We used the textual description's extracted semantic data as input for the generator model. The generator
used this semantic data to translate the distinguishing details into pixel-level data and produce related
images. The discriminator then used these generated images as input along with accurate or inaccurate
verbal descriptions and authentic sample images from the collection.

The model receives as input during training a series of unique pairings made up of photos and their
matching textual descriptions. This is done in order to achieve the discriminator's goals. The input
pairings consist of produced images with accurate text descriptions, erroneous images with inaccurate
text descriptions, and genuine photographs with accurate text descriptions.

To enable the model to ascertain whether a specific image and text pairing are in alignment with one
another, actual image and real text combinations are used. A mismatch between the image and the caption
is indicated if an incorrect image is matched with a true written description.

The discriminator is taught to distinguish between authentic and artificial images. The discriminator's
classification performance is initially mainly concerned with telling correct images from false ones. In
order to enhance weight updates and give training feedback to both the generator and discriminator
models, the loss is estimated during training.

Result

The user enters text into the GUI application, which then processes it and displays the accompanying
image. As a result, the application displays an image that matches the written description.

4
Fig 1.

Fig 2.

5
Conclusion

We have demonstrated through experimentation that our architecture is capable of generating graphics
that reflect the aesthetic of the given dataset. We are certain that the generated images can accurately
depict the content stated in the captions and that the system can handle different styles if enough time is
allotted for creating a styled dataset, training the model, and doing hyperparameter search. It's crucial to
keep in mind, too, that creating a stylised picture dataset and training the model might be computationally
expensive. Therefore, unless the time savings it offers are essential to the desired application and
adequate training time is available, we do not advise attempting to instruct a system with this designmany
different applications and business use cases, text-based synthetic image synthesis is extremely useful. It
can help machine learning models that lack enough picture data by allowing them to train using artificial
images that are created from text descriptions. Additionally, chatbots can make use of this ability to
generate pertinent and contextual visuals that will improve user interactions. By visualizing search
queries or completing gaps in their picture archives, search engines and stock photo websites can also
gain from synthetic image synthesis.

References

[1]. Akanksha Singh 1, Sonam Anekar 2, Ritika Shenoy 3, Prof. Sainath Patil 4,(2021). Text to Image
using Deep Learning. International Journal of Engineering Research & Technology (IJERT). ISSN: 2278-
0181. Vol. 10 Issue 04, April-2021

[2]. Y. Kataoka, T. Matsubara, and K. Uehara. Image generation using generative adversarial networks
and attention mechanism. In 2016 IEEE/ACIS 15th International Conference on Computer and
Information Science (ICIS), 2016.

[3]. Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, “StackGAN: Text to Photorealistic Image
Synthesis with Stacked Generative Adversarial Networks" in Rutgers University and Lehigh University
August 2017.

[4]. Stian Bodnar, Jon Shapiro, “Text to Image Synthesis Using Generative Adversarial Networks" in The
University of Manchester May 2018.

[5]. Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural
image synthesis. 2019

6
Plagarism Report

You might also like