0% found this document useful (0 votes)
4 views

Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv

This document provides a comprehensive overview of text-to-image synthesis using generative models, highlighting the evolution from traditional methods to advanced deep learning techniques such as Generative Adversarial Networks (GANs) and diffusion models. It discusses various datasets used for training, performance metrics for evaluation, and the challenges faced in the field, while also identifying future research directions. The study aims to integrate the latest advancements in both GANs and diffusion models, offering a holistic view of the current landscape in text-to-image synthesis.

Uploaded by

rajjm397
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv

This document provides a comprehensive overview of text-to-image synthesis using generative models, highlighting the evolution from traditional methods to advanced deep learning techniques such as Generative Adversarial Networks (GANs) and diffusion models. It discusses various datasets used for training, performance metrics for evaluation, and the challenges faced in the field, while also identifying future research directions. The study aims to integrate the latest advancements in both GANs and diffusion models, offering a holistic view of the current landscape in text-to-image synthesis.

Uploaded by

rajjm397
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received 26 January 2024, accepted 6 February 2024, date of publication 9 February 2024, date of current version 20 February 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3365043

Text-to-Image Synthesis With Generative Models:


Methods, Datasets, Performance Metrics,
Challenges, and Future Direction
SARAH K. ALHABEEB AND AMAL A. AL-SHARGABI
Department of Information Technology, College of Computer, Qassim University, Buraydah 51452, Saudi Arabia
Corresponding author: Amal A. Al-Shargabi ([email protected])
This work was supported by the Deanship of Scientific Research, Qassim University.

ABSTRACT Text-to-image synthesis, the process of turning words into images, opens up a world of creative
possibilities, and meets the growing need for engaging visual experiences in a world that is becoming
more image-based. As machine learning capabilities expanded, the area progressed from simple tools
and systems to robust deep learning models that can automatically generate realistic images from textual
inputs. Modern, large-scale text-to-image generation models have made significant progress in this direction,
producing diversified and high-quality images from text description prompts. Although several methods
exist, Generative Adversarial Networks (GANs) have long held a position of prominence. However, diffusion
models have recently emerged, with results much beyond those achieved by GANs. This study offers
a concise overview of text-to-image generative models by examining the existing body of literature and
providing a deeper understanding of this topic. This will be accomplished by providing a concise summary
of the development of text-to-image synthesis, previous tools and systems employed in this field, key types
of generative models, as well as an exploration of the relevant research conducted on GANs and diffusion
models. Additionally, the study provides an overview of common datasets utilized for training the text-to-
image model, compares the evaluation metrics used for evaluating the models, and addresses the challenges
encountered in the field. Finally, concluding remarks are provided to summarize the findings and implications
of the study and open issues for further research.

INDEX TERMS Deep learning, diffusion model, generative models, generative adversarial network, text-
to-image synthesis.

I. INTRODUCTION relationship between vision and language and can generate


The rapid improvements made by Artificial Intelligence (AI) visuals that correspond to textual descriptions is a significant
in a variety of applications have been remarkable. AI has step toward achieving an intelligence comparable to that of
shown its potential in various ways, and one of the most humans [1].
interesting applications is text-to-image generation. This In recent years, deep learning has allowed for significant
technology uses natural language processing and computer progress in the realm of computer vision, allowing for
vision to generate an image based on a given text input. new and improved applications and methods for processing
The text provided serves as a set of instructions for the AI’s images. Deep learning seeks to discover deep and hierarchical
techniques to create an image, which is then rendered in a models that accurately describe probability distributions
variety of formats, such as vector graphics, 3D renders, and across the many types of data used in AI systems [2]. Image
more. The development of a system that comprehends the synthesis, the creation of new images and the alteration of
old ones, is one such area. Image editing, art generation,
The associate editor coordinating the review of this manuscript and computer-aided design, and virtual reality are just a few of
approving it for publication was Khursheed Aurangzeb. the many real-world applications that make image synthesis
2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
24412 For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 12, 2024
S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

an engaging and consequential endeavor [3]. One of the The text-to-picture synthesis system [7], aimed to improve
popular approaches is to guide image synthesis with text communication by generating visual representations based on
description, which leads to text-to-image synthesis, which textual input. The system followed an evolutionary process
will be addressed in the following section. and adopted semantic role labeling as opposed to keyword
extraction, incorporating the concept of picturability to assess
A. TEXT-TO-IMAGE SYNTHESIS the likelihood of identifying a suitable image that represents
Text-to-image synthesis, or the generation of images from a given word. To produce compilations of images obtained
text descriptions, is a complex computer vision and machine from the Flickr platform, Word2Image [8] implemented
learning problem that has seen significant progress in recent a variety of methodologies, including semantic clustering,
years. Users may be able to describe visual elements through correlation analysis, and visual clustering.
visually rich text descriptions if automatic image generation Moreover, WordsEye [9] is a text-to-scene system that
from natural language is used. Visual content, like pictures, mechanically generates static, 3D scenes that are represen-
is a better way to share and understand information because it tational of the supplied content. A language analyzer and
is more accurate and easy to understand than written text [4]. a visualiser are the two primary parts of the system. Also,
Text-to-image Synthesis refers to the use of computational a multi-modal system called CONFUCIUS [10], which
methods to convert human-written textual descriptions (sen- works as a text-to-animation converter, can convert any
tences or keywords) into visually equivalent representations sentence containing an action verb into an animation that
of those descriptions (images) [3]. The best alignment of is perfectly synced with speech. A visually assisted instant
visual content matching the text used to be determined messaging technique, called Chat With Illustration (CWI)
through word-to-image correlation analysis combined with [11], automatically provides users with visual messages
supervised methods in synthesis. New unsupervised methods, connected with text messages. Nevertheless, many different
especially deep generative models, have emerged as a result systems for other languages exist. In order to handle the
of recent developments in deep learning. These models are Russian language, the Utkus [12] text-to-image synthesis
able to generate reasonable visual images by employing system utilizes a natural language analysis module, a stage
appropriately trained neural networks [3]. Figure 1 shows a processing module, and a rendering module. Likewise,
general architecture of how text-to-image generation would Vishit [13] is a method for visualizing processed Hindi
work: a text prompt is fed into an image generative model, texts. Language processing, knowledge base construction,
which uses the text description to generate an image. and scene generation are its three main computational
foundations. Moreover, for the Arabic language, [14] put
forth a comprehensive mobile-based system designed for
Arabic that generates illustrations for Arabic narratives
automatically. The suggested method is specifically designed
for utilization on mobile devices, with the aim of instructing
Arab children in an engaging and non-traditional manner.
Also, using a technique called conceptual graph matching,
FIGURE 1. General architecture of text-to-image generation. Illustrate It! [15] is a multimedia mobile learning solution
for the Arabic language.

B. TRADITIONAL METHODS FOR TEXT-TO-IMAGE C. NEW METHODS FOR TEXT-TO-IMAGE SYNTHESIS


SYNTHESIS In recent years, scientists have sought a solution to the issue
Early attempts at translating text into images aimed to bridge of artificially generating objects. Numerous strategies and
the gap between humans and machines by emphasizing the technologies have been developed to aid in the generation
importance of natural language comprehension. Some of of new content in various domains, including text, images,
these systems can take a piece of text written in a natural audio, etc. Using deep learning approaches, generative
language and transform it into a sequence of static or dynamic models were developed to solve the challenge. The term
visual representations. Zakraoui et al. [5] conducted an ‘‘generative modeling’’ describes the process of making fake
analysis of several established text-to-picture systems and instances from a dataset that share properties with the original
tools, with a focus on identifying challenges and primary set. The use of generative models makes it possible for
issues encountered by previous iterations. This subsection machine learning to function with multi-modal outputs [16].
will present an overview of previous text-to-image systems This section demonstrates the four major types of generative
and tools. models: GAN, Variational Autoencoder, flow-based model,
The story picturing engine [6] applied to the procedure of and diffusion model.
effectively complementing a narrative with appropriate visual
representations. The method is made up of three separate 1) GENERATIVE ADVERSARIAL NETWORKS
steps: processing the story and choosing photos, figuring out In 2014, Goodfellow et al. [2] introduced GANs, one of
how similar things are, and ranking based on reinforcement. the well-known generating models. From that point forward,

VOLUME 12, 2024 24413


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

several additional models based on the concept of GANs were generative capabilities. The field of generative modeling
developed to address the previous shortcomings. GANs can has found many uses for diffusion models so far, including
be used in many different contexts, such as to make images image generation, super-resolution, inpainting, editing, and
of people’s faces, to make realistic photos, to make cartoon translation between images [20]. The principles of non-
characters, to age people’s faces, to increase resolution, equilibrium thermodynamics provide the basis for diffusion
to translate between images and words, and so on [4]. models. Before learning to rebuild desirable data examples
GANs consist of two major sub-models: generator and from the noise, they generate a Markov chain of diffusion
discriminator. The generator is in charge of making new fake steps to gradually inject noise into data [20]. In order to learn,
images by taking a noise vector as an input and putting out the diffusion model has two phases: one for forward diffusion
an image as an output. On the other hand, the discriminator’s and the other for backward diffusion. In the forward diffusion
job is to tell the difference between real and fake images after phase, Gaussian noise is progressively added to the input data
being trained with real data. In other words, it serves as a at each level [21]. In the second phase, called ‘‘reverse,’’ the
classification network that is capable of classifying images by model is charged to reverse the diffusion process so that the
returning 0 for fake and 1 for real. Therefore, the generator’s original input data can be recovered.
goal is to create convincing fakes in order to trick the The architectures of generative model types are shown in
discriminator, while the discriminator’s goal is to recognize Figure 2.
the difference [1]. Training improves both the discriminator’s
ability to distinguish between real or fake images, and the
generator’s ability to produce realistic-looking images. When
the discriminator can no longer tell genuine images from
fraudulent ones, equilibrium has been reached.

2) VARIATIONAL AUTOENCODER (VAE)


The utilization of a variational autoencoder (VAE) [17]
provides a probabilistic framework for representing an
observation inside a latent space. The input is subjected
to encoding, which frequently involves compressing infor-
mation into a latent space of reduced dimensionality. The
primary objective of autoencoders is to effectively encode and
represent the given data. The objective at hand involves the
identification of a low-dimensional representation for a high-
dimensional input, which facilitates the reconstruction of the
original input while minimizing the loss of content.

3) FLOW-BASED GENERATIVE MODEL


Flow-based models are capable of learning distinct encoders
and decoders. In a manner similar to the encoding phase
observed in autoencoders, a transformation is employed
to the data, with its parameters determined by a neural
network [18]. Nevertheless, the decoder does not consist of
a novel neural network that needs to autonomously acquire
the decoding process; rather, it functions in direct opposition FIGURE 2. Types of generative models, reproduced from Weng [22].

to its counterpart. In order to achieve the invertibility of a


function ‘‘f’’ using neural networks, multiple strategies need
to be employed. D. RELATED SURVEYS AND STUDY CONTRIBUTION
The state-of-the-art works of GAN-based approaches were
4) DIFFUSION MODELS examined by Tyagi and Yadav [23], Frolov et al. [1],
As a subset of deep generative models, diffusion models Zhou et al. [24], and Tan et al. [25]. On the other hand,
have recently been recognized as the cutting edge. The on the diffusion models field, with multiple works [20],
diffusion models have lately demonstrated significant results [21], [26], reviewing the progress of diffusion models in
that have been proven to surpass GAN models [19] They all fields, some articles explore deeply into particular areas,
have proven successful in a number of different areas, including audio diffusion models [27], diffusion models for
including the difficult task of image synthesis, where GANs video generation [28], diffusion model in vision [29], and
had previously dominated. Recently, diffusion models have text-to-image diffusion models [30], providing a thorough
become a hot topic in computer vision due to their impressive overview of the diffusion model field, as well as an in-depth

24414 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

look at its applications, limitations, and promising future


possibilities.
However, our work distinctively integrates the latest
advancements in both GANs and diffusion models, providing
a holistic view of the field. Unlike the other surveys, which
focus primarily on GANs, our review also delves into diffu-
sion models, a cutting-edge area in text-to-image synthesis.
Additionally, our paper systematically addresses various
research questions, covering a wide array of topics from
methods and datasets to evaluation metrics and challenges,
offering a broader scope than the previous surveys.
This study focuses on the new approaches to text-to-
image synthesis, particularly generative methods, and aims
to address five primary questions:
1. RQ1: Which are the existing methods employed, and
what are their applications?
2. RQ2: What datasets are commonly used for this
purpose?
3. RQ3: What evaluation metrics are used to assess the
results?
4. RQ4: What challenges and limitations are associated
with the state-of-the-art studies?
5. RQ5: What areas remain unexplored for future research?
FIGURE 3. Sample images and their captions of common text-to-image
II. DATASETS datasets. Figure reproduced from Frolov et al. [1].
Datasets play a crucial role in the development and evaluation
of text-to-image generative models. In the realm of text-to-
image generative models, the utilization of diverse datasets is C. OXFORD 102 FLOWER
vital for achieving accurate and realistic visual outputs. This Reference [33] comprises a collection of 102 distinct
section will explore the various datasets frequently utilized categories of flowers, which can be effectively employed for
in this research area. The most frequently used datasets by image classification. The selected flowers were indigenous
text-to-image synthesis models are: to the United Kingdom. The number of photos in each class
ranges from 40 to 258. The images demonstrate significant
A. MS COCO variations in terms of size, pose, and lighting conditions.
Reference [31], known as the Microsoft Common Objects There exist categories that exhibit significant variations
in Context, is a comprehensive compilation of images that within their respective boundaries, as well as numerous
is widely employed for the purpose of object detection and categories that have notable similarities.
segmentation. The dataset comprises a collection of more Figure 3 shows samples of images along with their captions
than 330,000 images, with each image being accompanied from the MS COCO, Oxford 102 Flower, and CUB-200-2011
by annotations for 80 object categories and 5 captions that datasets.
provide descriptive information about the depicted scene. The
COCO dataset is extensively utilized in the field of computer D. MULTI-MODAL CELEBA-HQ
vision research and has been employed for the purposes of A large-scale face image collection, Multi-Modal-CelebA-
training and evaluating numerous cutting-edge models for HQ [34] contains 30,000 high-resolution facial images hand-
object identification and segmentation. picked from the CelebA dataset by following CelebA-
HQ [35]. Transparent images, sketches, descriptive text, and
B. CUB-200-2011 high-quality segmentation masks accompany each image.
Caltech-UCSD Birds-200-2011 [32] is a popular dataset for Algorithms for face generation and editing, text-guided
fine-grained visual categorization. This dataset comprises picture manipulation, sketch-to-image production, and more
11,788 bird images from 200 subcategories. Images are can all benefit from being trained and tested on the data
divided into 5,994 training and 5,794 testing sets. Each available in Multi-Modal-CelebA-HQ.
image in the dataset has a subcategory, part location,
binary attribute, and bounding box labels. Natural language E. CELEBA-DIALOG
descriptions supplemented these annotations to improve the Another enormous visual language face dataset with detailed
CUB-200-2011 dataset. Each image received ten single- labeling [36], divides a single feature into a range of degrees
sentence descriptions. that all belong to the same semantic meaning. The dataset

VOLUME 12, 2024 24415


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

has over 200,000 images, encompassing 10,000 distinct III. TEXT-TO-IMAGE GENERATION METHODS
identities. Each image is accompanied by five detailed This section provides an overview of relevant studies on
attributes, providing fine-grained information. text-to-image generative models. Due to the diversity of
the generative models and the vast amount of associated
F. DEEPFASHION literature, this study narrows its focus to the two cutting-
Reference [37] serves as a valuable resource for the edge types of deep learning generative models: GANs and
training and evaluating of numerous image synthesis models. diffusion models.
It encompasses a comprehensive collection of annotations,
including textual descriptions and fine-grained labels, across
multiple modalities. The dataset comprises a collection of
eight hundred thousand fashion images that exhibit a wide A. TEXT-TO-IMAGE GENERATION USING GANS
range of diversity, encompassing various accessories and Since its introduction in 2014, GAN-based text-to-image
positions. synthesis has been the subject of numerous studies, leading
to significant advancements in the field. Reed et al. [42],
G. IMAGENET working upon the foundation laid by deep convolutional
GANs [43], were the first to investigate the GAN-based text-
To test algorithms designed to save, retrieve, or ana-
to-image synthesis technique.
lyze multimedia data, researchers have created a massive
Earlier models could create images based on universal
database called ImageNet [38], which contains high-quality
constraints like a class label or caption, but not pose or
images that have been manually annotated. There are more
location. Therefore, the Generative Adversarial What-Where
than 14 million images in the ImageNet database, all of
Network (GAWWN) [44] was proposed, which is a network
which have been annotated using the WordNet classifi-
that generates images based on directions about what to
cation system. Since 2010, the dataset has been applied
draw and where to draw it. It demonstrates the ability to
as a standard for object recognition and image classifi-
generate images based on free-form text descriptions and
cation in the ImageNet Large Scale Visual Recognition
the precise location of objects. GAWWN enables precise
Challenge (ILSVRC).
location management through the use of a bounding box or
a collection of key points.
H. OPENIMAGES
Stacked Generative Adversarial Networks (StackGAN)
Reference [39] consists of around 9 million images that have [45] established a two-stage conditioning augmentation
been annotated with various types of data, including object approach to boost the diversity of synthesized images
bounding boxes, image-level labels, object segmentation and stabilize conditional-GAN training. Using the provided
masks, localized narratives, and visual relationships. The text description as input, the Stage-I GAN generates low-
training dataset of version 7 has 1.9 million images and resolution images of the initial shape and colors of the object.
16 million bounding boxes representing 600 different item High-resolution (e.g., 256 × 256) images with photorealistic
classes, rendering it the most extensive dataset currently features are generated by the Stage-II GAN using the results
available with annotations for object location. from Stage-I and the descriptive text.
However, an improvement to this model was made, leading
I. CC12M to StackGAN++ [46]. The second version of StackGAN
Conceptual 12M [40] is one of the datasets utilized by uses generators and discriminators organized in a tree-like
OpenAI’s DALL-E2 for training, and it consists of 12 million structure to produce images at multiple scales that fit the same
text-image pairs. The dataset, built from the original CC3M scene. StackGAN++ has a more reliable training behavior by
dataset of 3 million text-image pairs, was used for a wide approximating multiple distributions.
range of pre-training and end-to-end training of images. For even more accurate text-to-image production, the
Attentional Generative Adversarial Network (AttnGAN) [47]
J. LAION-5B permits attention-driven, multi-stage refining. By focusing
One of the largest publicly available image-text datasets is on important natural language terms, AttnGAN’s attentional-
Large-scale AI Open Network (LAION) [41]. More than five generating network allows it to synthesize fine-grained image
billion text-image pairs make up LAOIN-5B, an AI training features.
dataset that is 14 times larger than its predecessor, LAOIN- To rebuild textual descriptions from the generated images,
400M. MirrorGAN [48] presents a text-to-image-to-text architecture
Table 1 provides a comprehensive comparison of the with three models. To guarantee worldwide semantic coher-
commonly used datasets used in computer vision and ence between textual descriptions and the corresponding
multimodal research. Each dataset is evaluated based on produced images, it additionally suggests word sentence
key attributes including domain, common task, number of average embedding.
images, captions per image, training and testing split, and the Figure 4 shows the architectures of: StackGAN,
number of object categories. StackGAN++, AttnGAN, and MirrorGAN.

24416 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

TABLE 1. Overview of commonly used datasets for text-to-image synthesis.

Furthermore, a multi-conditional GAN (MC-GAN) [50]


coordinates both the object and the context. The main portion
of MC-GAN is a synthesis block that separates object
and background information during training. This block
helps MC-GAN to construct a realistic object image with
the appropriate background by altering the proportion of
background and foreground information.
The Dynamic Memory Generative Adversarial Network
(DM-GAN) [51] employs a dynamic memory module to
enhance the ambiguous image contents in cases where the
initial images are generated inadequately. The method can
accurately generate images from the text description since
a memory writing gate is created to pick the relevant text
details based on the content of the initial image. In addition,
a response gate is used to adaptively combine the data
retrieved from the memories with the attributes of the images.
ManiGAN [52] semantically edits an image to match a
provided text describing desirable attributes such as color,
texture, and background, while keeping irrelevant content.
ManiGAN has two major parts. The first part links visual
regions with meaningful phrases for effective manipulation.
The second part corrects mismatched properties and com-
pletes missing image content.
Without relying on any sort of entanglements between
FIGURE 4. The architectures of: (a)StackGAN, (b)StackGAN++, many generators, DeepFusion Generative Adversarial Net-
(c)AttnGAN, and (d) MirrorGAN, reproduced from Tan et al. [25].
works (DF-GAN) [53] may produce high-resolution images
directly by a single generator and discriminator. Moreover,
In the field of story visualization, a story-to-image- DF-GAN’s Deep text-image Fusion Block (DFBlock) allows
sequence generative model, StoryGAN [49], was pro- for a more thorough and efficient fusion of text and picture
posed using the sequential conditional GAN framework. information.
To improve the image resolution and uniformity of the Tedi-GAN [34] combines text-guided image production
generated sequences, it employs two discriminators, one and modification into one framework for high accessibility,
at the story level and one at the image level, as well variety, accuracy, and stability in facial image generation and
as a deep context encoder that dynamically tracks the manipulation. It can synthesize high-quality images using
story flow. multi-modal GAN inversion and a huge multi-modal dataset.

VOLUME 12, 2024 24417


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

Although there have been many studies on text-to-image Moreover, [60] proposed a robust architecture designed
generation in English, very few have been applied to other to produce high-resolution realistic images that match a text
languages. In [54], the use of Attn-GAN was proposed description written in Arabic. The authors adjusted the shape
for generating fine-grained images based on descriptions in of the input data to DF-GAN by decreasing the size of the
Bangla text. It is capable of integrating the most exact details sentence vectors generated by AraBERT. Subsequently, they
in various subregions of the image, with a specific emphasis combined DF-GAN with AraBERT by feeding the sentence
on the pertinent terms mentioned in the natural language embedding vector into the generator and discriminator of
description. DF-GAN. When compared to stackGAN++, their method
Furthermore, [55] uses language translation models produced impressive results. In the CUB dataset, it got
to extend established English text-to-image generating an FID score of 55.96 and a SI score of 3.51. In the
approaches to Hindi text-to-image synthesis. Input Hindi Oxford-102 dataset, got an FID score of 59.45 and a
sentences were translated to English by a transformer-based SI score of 3.06.
Neural Machine Translation module, whose output was To improve upon their prior work in [60], the authors
supplied to a GAN-based Image Generation Module. presented two additional techniques [61]. To get over the
On the other hand, The CJE-TIG [56] cross-lingual text-to- out-of-vocabulary problem, they tried a first technique that
image pre-training technique removes barriers to using GAN- involved combining a sample text transformer with the
based text-to-image synthesis models for any given input generator and discriminator of DF-GANs. In the second
language. This method alters text-to-image training patterns method, the text transformer and training were carried
that are linguistically specific. It uses a bilingual joint encoder over, and a learning mask predictor was integrated into
in place of a text encoder, applies a discriminator to optimize the architecture to make predictions about masks, which
the encoder, and uses novel generative models to generate are then utilized as parameters in affine transformations to
content. provide a more seamless fusion between the image and the
The difficulties of visualizing the text of a story with text. To further improve training stability, the DAMSM loss
several characters and exemplary semantic relationships were function was used to train the architecture. The findings
considered in [57]. Two cutting-edge GAN-based image proved that the latest technique was superior. Figure 5 shows
generation models served as inspiration for the researchers’ samples on the CUB dataset, generated by DM-GAN, Attn-
innovative two-stage model architecture for creating images. GAN, StackGAN, and GAN-INT-CLS.
Stage-I of the image generating process makes use of a This study [62] outlines on using transformer-based
scene graph image generation framework; stage-II refines the models (BERT, GPT-2, T5) for text-to-image generation,
output image using a StackGAN based on the object layout an under-explored area in computer vision and NLP. It pro-
module and the initial output image. Extensive examination poses specific architectures to adapt these models for creating
and qualitative results showed that their method could images from text descriptions. The study, evaluating the
produce a high-quality graphic accurately depicting the text’s models on challenging datasets, finds that T5 is particularly
key concepts. effective in generating images that are both visually appealing
Short Arabic stories, complete with images that capture and semantically accurate.
the essence of the story and its setting, were offered using Kang et al. [63] presented a groundbreaking approach to
a novel approach in [58]. To lessen the need for human scaling up GANs for text-to-image synthesis. By introducing
input, a text generation method was used in combination GigaGAN, a new GAN architecture, the study showcases
with a text-to-image synthesis network. Arabic stories with the ability to generate high-resolution, high-quality images
specialized vocabulary and images were also compiled into efficiently. GigaGAN demonstrates superior performance in
a corpus. Applying the approach to the generation of text- terms of speed and image quality, marking a significant
image content using various generative models yielded results advancement in the use of GANs for large-scale, complex
that proved its value. The method has the potential for use in image synthesis tasks.
the classroom to facilitate the development of subject-specific SWF-GAN, a new model introduced in [64], enhances
narratives by educators. image synthesis from textual descriptions. It uniquely uses
A model for generating 256 × 256 realistic graphics from a sentence-word fusion module and a weakly supervised
Arabic text descriptions was proposed in [59]. In order to mask predictor for detailed semantic mapping and accurate
generate high-quality images, a unique attention network structure generation. The model effectively creates clear and
was trained and evaluated in many stages for the proposed vivid images with lower computational load, significantly
model. A deep multimodal similarity model for calculating outperforming baseline models in IS and FID scores.
the loss of matching fine-grained picture text for training GALIP [65] introduces a novel GAN architecture for text-
the model generator was proposed. The proposed approach to-image synthesis. This model integrates transformer-based
set a new standard for converting Arabic text to photoreal- text encoders and an advanced generator, resulting in high-
istic images. On the Caltech-UCSD Birds-200-2011 (CUB) quality, text-aligned image generation. The model excels in
dataset, the newly proposed model produced an inception creating images from complex text descriptions, emphasizing
score of 3.42 ±.05. the potential of GANs in the realm of text-to-image synthesis.

24418 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

and combining it with a DDPM to generate results that looked


natural.
CLIP-Forge [72] was proposed as a solution to the
widespread absence of coupled text and shape data. Utilizing
a two-step training approach, CLIP-Forge requires only a pre-
trained image-text network like CLIP, as well as an unlabeled
shape dataset. One of the advantages of this approach is that it
can produce various shapes for a given text without resorting
to costly inference time optimization.
In [73], the authors investigate CLIP guidance and
classifier-free guidance as two separate guiding methodolo-
gies for the problem of text-conditional image synthesis.
Their proposed model, GLIDE, which stands for Guided
Language to Image Diffusion for Generation and Editing, was
shown to be the most liked by humans in terms of caption
similarity and photorealism. It also often made examples that
were very photorealistic.
Specifically for conditional image synthesis, M6-UFC was
presented in [74] as a universal form for unifying several
multi-modal controls. To quicken inference, boost global
consistency, and back up preservation controls, the authors
turned to non-autoregressive generation. In addition, they
developed a progressive generation process using relevance
and fidelity estimators to guarantee accuracy.
Using the language-image priors retrieved from a pre-
trained CLIP model, this study [75] proposes a self-
supervised approach called CLIP-GEN for automatic text-
to-image synthesis. Here, a text-to-image generator can be
taught to work with just a group of images from the broad
domain that don’t have labels. This will help to prevent the
need to collect vast amounts of matched text-image data,
FIGURE 5. Random image samples on the CUB dataset, generated by
DM-GAN, Attn-GAN, StackGAN, and GAN-INT-CLS. Source: [1].
which is too costly to gather.
Imagen, a method for text-to-image synthesis presented
in [76], uses a single encoder for the text sequence and a
set of diffusion models to generate high-resolution images.
B. TEXT-TO-IMAGE GENERATION USING DIFFUSION The text embeddings provided by the encoder are also a
MODELS prerequisite for these models. As an added bonus, the authors
Unlike GAN-based approaches, which primarily work with presented a brand new caption set (DrawBench) for testing
small-scale data, autoregressive methods use large-scale data text-to-image conversion. The authors created Efficient U-
to generate text-to-image conversions, such as DALL-E [66] Net, an efficient network architecture, and used it in their text-
from OpenAI and Parti [67] from Google. Nevertheless, these to-image generation experiments to test its efficacy. Figure 6
approaches have significant computation costs and sequential represents a simple visualisation of Imagen architecture.
error buildup due to their autoregressive nature [66], [67],
[68], [69]. Conversely, diffusion models are highly popular
for all sorts of generating applications.
To create images from text, the study [70] introduced
the vector quantized diffusion (VQ-Diffusion) model. Vector
quantized variational autoencoders (VQ-VAEs) form the
basis of this technique, with the latent space being modeled
using a conditional variant of the Denoising Diffusion Proba-
bilistic Model (DDPM). Using a natural language description
with an ROI mask, the Blended Diffusion approach was FIGURE 6. Overview of Imagen, reproduced from Saharia et al. [76].
provided in [71] for making local (region-based) adjustments
to real images. The authors were successful in their mission ERNIE-ViLG 2.0 [77] is a large-scale Chinese text-to-
by employing a pretrained language-image model (CLIP) to image diffusion model that uses fine-grained textual and
guide the modification in the direction of a given text prompt visual information about important parts of the scene along

VOLUME 12, 2024 24419


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

TABLE 2. Diffusion Models-based related studies. text as input rather than more complex requirements such as
masks or drawings.
DiVAE, a VQ-VAE architecture model that employs
a diffusion decoder as the reconstructing component in
image synthesis, was proposed by Shi et al. in [81]. They
investigated how to incorporate image embedding into the
diffusion model for high performance and discovered that
a minor adjustment to the U-Net used in diffusion could
accomplish this.
Building upon the success of its predecessor [66], DALL-E
2 [80] was launched as a follow-up version with the intention
of producing more realistic images at greater resolutions by
combining concepts, features, and styles. The model consists
of two parts: a prior that creates a CLIP image embedding
from a caption and a decoder that creates an image based
on the embedding. It was demonstrated that increasing image
with different denoising specialists at different denoising variety through the intentional generation of representations
stages to improve the quality of the output images. leads to only a slight decrease in photorealism and caption
On the other hand, eDiff-I [78] outperforms other large- similarity.
scale text-to-image diffusion models by improving text
alignment while keeping inference computation cost and
visual quality stable. Unlike traditional diffusion models,
which rely on a single model trained to denoise the entire
noise distribution, eDiff-I is instead trained on an ensemble
of expert denoisers, each of which is tailored to denoising
at a distinct stage of generation. The researchers claim that
employing such specialized denoisers enhances the quality
of synthesized output.
Frido [82] is an image-synthesizing Feature Pyramid
Diffusion model that conducts multiscale coarse-to-fine
FIGURE 7. Overview of DALL-E 2, reproduced from Ramesh et al. [80].
denoising. To construct an output image, it first decomposes
the input into vector quantized scale-dependent components.
The previously mentioned stage of learning multi-scale Figure 7 represents an overview of DALL-E 2, and Figure 8
representations can also take advantage of input conditions shows samples of images generated by DALL-E 2 given a
such as language, scene graphs, and image layout. Frido can detailed text prompt.
thus be utilized for both traditional and cross-modal image Furthermore, the advanced model DALL-E 3 [86], which
synthesis. was recently released, represents a significant advancement
A new method called DreamBooth was suggested in [83] over its predecessors. Leveraging advanced diffusion models,
as a way to tailor the results of text-to-image generation from DALL-E 3 not only excels in maintaining fidelity to textual
diffusion models to the needs of users. The authors fine- prompts but also underscores its ability to capture intricate
tuned a pretrained text-to-image model so that it is able to details, marking a substantial advancement in the realm of
associate a distinctive identifier with a subject given only a generative models.
small number of images of that subject as input. Following Stable Diffusion is another popular text-to-image tool that
the subject’s incorporation into the model’s output domain, was introduced in 2022, based on a previous work [79].
the identifier can be used to generate completely brand-new Stable Diffusion employs a type of diffusion model known as
photorealistic pictures of the subject in a variety of settings. the latent diffusion model (LDM). The VAE, U-Net, and an
Furthermore, Imagic [84] shows how a single real image optional text encoder comprise Stable Diffusion. Compared
can be subjected to sophisticated text-guided semantic edits. to pixel-based diffusion models, LDMs dramatically reduced
While maintaining the image’s original qualities, Imagic can the requirement for processing while achieving a new
alter the position and composition of one or more objects state-of-the-art picture inpainting and highly competitive
within it. It works on raw images without the need for image performance on a variety of applications like unconditional
masks or any other preprocessing. image creation and super-resolution. Figure 9 shows an
Likewise, UniTune [85] is capable of editing images with overview of the architecture of Stable diffusion.
a high degree of semantic and visual fidelity to the original, Table 2 summarizes the studies that utilized diffusion
given a random image and a textual edit description as input. models in text-to-image generation by year, model, and
It can be considered an art-direction tool that only requires dataset.

24420 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

synthesis, utilizing a more streamlined and efficient process


compared to conventional methods.
GLIGEN was proposed in [90] as a new method for text-
to-image generation, focusing on generating linguistically
coherent and visually compelling images. It emphasizes the
integration of natural language understanding and image
synthesis, demonstrating impressive capabilities in creating
images that accurately reflect complex textual inputs.
Snapfusion [91] introduces an efficient text-to-image
diffusion model optimized for mobile devices, achieving
image generation in under two seconds. It addresses the
computational intensity and speed limitations of existing
diffusion models through an innovative network architecture
and improved step distillation. The proposed UNet efficiently
synthesizes high-quality images, outperforming the baseline
Stable Diffusion model in terms of FID and CLIP scores.
Zhang et al. [92] introduced a method to add conditional
control to image generation models, allowing for more
FIGURE 8. Samples generated by DALL-E 2 given the prompt: ‘‘a bowl of precise and tailored image creation. The approach improves
soup that is a portal to another dimension as digital art’’. Source: [80].
the ability to generate images that meet specific criteria
or conditions, enhancing the versatility and applicability of
image-generation technologies.
Moreover, Zhao et al. [93] explored advancements in
text-to-image diffusion models, focusing on enhancing their
capabilities to produce more realistic and varied images.
The study delves into new methods and techniques to
improve these models, significantly advancing the field of
T2I synthesis.
The researchers in [94] focused on adapting the English
Stable Diffusion model for Chinese text-to-image synthesis.
They introduced a novel method for transferring the model’s
FIGURE 9. Overview of stable diffusion. Source: [79].
capabilities to the Chinese language, resulting in high-quality
image generation from Chinese text prompts, significantly
ParaDiffusion [87] is an innovative text-to-image genera- reducing the need for extensive training data.
tion model adept at transforming detailed, long-form text into AltDiffusion [95] presents a multilingual text-to-image
corresponding images. It stands out due to its deep semantic diffusion model supporting eighteen languages, addressing
understanding derived from large language models, enabling the limitations of existing models that cater primarily to
it to create images that are both visually appealing and closely English. The paper details the development and effectiveness
aligned with complex textual descriptions. The model’s of this model in generating culturally relevant and accurate
training is enhanced by the ParaImage Dataset, which images across various languages, showcasing its potential for
includes extensive image-text pairs. This approach marks a global use in T2I tasks.
significant advancement in AI-driven media, particularly in Random image samples on the MS-COCO dataset are
generating intricate images from elaborate text descriptions. represented in Figure 10, generated by DALL-E, GLIDE, and
UPainting is an approach that was presented in [88] DALL-E 2.
to automatic painting generation using deep learning. The
model captures the essence of famous painters and styles,
enabling the creation of new artworks that reflect the IV. EVALUATION METRICS
characteristics of these styles. It’s a blend of art and The majority of current metrics evaluate a model’s quality
technology, offering a new way of creating art with AI’s by considering two main factors: the quality of the images
assistance. it produces and the alignment between text and images.
CLIPAG [89] explores a unique approach to text-to- Fréchet Inception Distance (FID) [96] and Inception Score
image generation without relying on traditional generative (IS) [97] are commonly used metrics for appraising the image
models. It leverages Perceptually Aligned Gradients (PAG) quality of a model. These metrics were initially developed
in robust Vision-Language models, specifically an enhanced for traditional GAN tasks focused on assessing image quality.
version of CLIP, to generate images directly aligned with To evaluate text-image alignment, the R-precision [47] metric
text descriptions. This method marks a shift in text-to-image is widely employed.

VOLUME 12, 2024 24421


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

FIGURE 10. Random image samples on MS-COCO, generated by DALL-E, GLIDE, and DALL-E 2. Source: [80].

For more in-depth details, we refer to [98]. Moreover, of realism, accuracy, and variety in the generated distri-
the Clip Score [99] is used in evaluating common sense butions. Table 3 represents a comparison of FID scores
and mentioned objects, while Human Evaluation offers obtained by GANs and diffusion models on the MS-COCO
a comprehensive insight into multiple aspects of image dataset and shows that diffusion models made remarkable
generation. In the following a detailed description of each results.
metric.
B. THE INCEPTION SCORE
A. THE FRECHET INCEPTION DISTANCE (FID) [96] Reference [97], which ignores the underlying distribution,
Using the feature space of a pre-trained Inception v3 network, measures the produced distribution’s faithfulness and diver-
FID [77] determines the frechet distance between natural and sity. The following is the IS equation:
artificial distributions. This equation solves it:
I = exp(Ex DKL (p(y|x)∥p(y))) (2)
 
2 1
F(r, g) = µr − µg + trace 6r + 6g − 2 6r 6g 2

IS calculates the difference between the marginal distri-


(1)
bution p(y) and the conditional distribution p(y|x) using the
where r and g represent, respectively, the image’s real Kull back-Leibler (KL) divergence. The generated image
and generated features. The covariance and mean of real x, denoted by the label y, is predicted using a pre-trained
and produced features are represented by r, g, r, and Inception v3 network. Unlike FID, a higher IS is prefer-
g, correspondingly. The lower FID score is considered able. It implies high-quality images accurately categorized
to be the more appropriate score. It describes the level by class.

24422 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

TABLE 3. FID scores of GANs and diffusion models on the MS-COCO Frolov et al. [1] proposed a set of different criteria
dataset.
for comparing evaluation metrics. The following is an
explanation of these criteria.
• Image Quality and Diversity: The degree to which
the generated image looks realistic or similar to the
reference image and the ability of the model to produce
varied images based on the same text prompt.
• Text Relevance: How well the generated image corre-
sponds to the given text prompt.
• Mentioned Objects and Object Fidelity: Whether
the model correctly identifies and includes the objects
mentioned in the text, and how accurately the objects in
the generated image match their real-world counterparts.
• Numerical and Positional Alignment: The accuracy of
any quantitative details and the positional arrangement
of objects in the generated image in relation to the
C. THE R-PRECISION (RP)
provided text.
• Common Sense: The presence of logical and expected
Reference [47] metric is widely employed for assessing
elements in the generated image.
the consistency between text and images. RP operates on
• Paraphrase Robustness: The model remains unaf-
the principle of employing a generated image query based
fected by minor modifications in the input description,
on the provided caption. Specifically, given an authentic
such as word substitutions or rephrasings.
text description and 99 randomly selected mismatched
• Explainable: The ability to provide a clear explanation
captions, an image is produced from the authentic caption.
of why an image is not aligned with the input.
This resulting image is then utilized to query the original
• Automatic: Whether the metric can be calculated
description from a pool of 100 candidate captions. The
automatically without human intervention.
retrieval is deemed successful if the similarity score between
Based on these Key criteria, we provide in Table 4 a
it and the authentic caption is the highest. The matching
comparative analysis of the commonly used text-to-image
score is determined using the cosine similarity between the
evaluation metrics based on their performance. It is important
encoding vectors of the image and the caption. A higher RP
to note that the table presented offers a simplified overview.
score indicates better quality, with RP being the proportion of
In practice, choosing the right metric depends on the specific
successful retrievals.
goals and context of the text-to-image generation task.
Additionally, the effectiveness of these metrics may vary
D. CLIP SCORE
depending on the specific model and dataset used.
The CLIP model [99], developed by OpenAI, demonstrates
the ability to evaluate the semantic similarity between a V. CHALLENGES AND LIMITATIONS
given text caption and an accompanying image. Based on this Although there has been significant progress made in the
rationale, the CLIP score can serve as a quantitative measure area of creating visual representations of textual descriptions,
and is formally defined as: there are still some challenges and limitations that will be
E[s(f (image) ∗ g(caption))] (3) discussed below.

where the mathematical expectation is computed over the set A. OPEN SOURCE
of created images in a batch, and s represents the logarithmic Although DALL-E is one of the competitive models,
scale of the CLIP logit [73]. A higher Clip score suggests unfortunately, it has not been released for public usage. There
a stronger semantic relationship between the image and the is a copy of DALL-E 2 available in PyTorch [101], but no
text, while a lower score shows less of a connection. pre-trained model. However, the Stable Diffusion model is
among the open-source models that are currently accessible.
E. HUMAN EVALUATIONS Stable Diffusion benefits from extensive community support
Some studies used human evaluation as a qualitative measure due to its open-source nature. Consequently, it is anticipated
to assess and evaluate the quality of the results. The reporting that there will be additional advancements in this particular
of metrics based on human evaluation is motivated by area in the near future.
the fact that many possible applications of the models are
centered upon tricking the human observer [100]. Typically, B. LANGUAGE SUPPORT
a collection of images is provided to an individual, who is The majority of studies in the field of text-to-image genera-
tasked with evaluating their quality in terms of photorealism tion have been conducted on English text descriptions due to
and alignment with associated captions. the abundance of dataset resources and the simple structure of

VOLUME 12, 2024 24423


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

TABLE 4. Overview of commonly used evaluation metrics for text-to-image synthesis, adapted from Frolov et al. [1].

the language. Some languages, however, require more effort sparked significant interest and discussion in the scientific
which needs to be addressed. For instance, Arabic, in contrast community. The field shows a high degree of fertility and
to English, has more complicated morphological features renewability, as seen by the recent publication of numerous
and fewer semantic and linguistic resources [5]. This is a relevant studies and an ongoing flow of new papers within a
main challenge that needs to be dealt with in text-to-image relatively short timeframe.
generation. By making generative models open-source, researchers
and developers can collaborate more effectively, which will
C. COMPUTATIONAL COMPLEXITY in turn boost innovation in the field. Researchers may utilize
The computational complexity of diffusion models poses a these publicly available models to investigate novel uses,
notable difficulty. The process of training a diffusion model enhance current AI models, and move the field forward
involves multiple iterative processes which can impose a rapidly.
significant computational burden. Therefore, The model’s To overcome the language barrier, some studies proposed
scalability may be constrained by the increased complexity multilingual [95] and cross-lingual [56] models to support
observed when working with larger datasets and higher- multiple languages within the same model. The goal of these
resolution images. Moreover, for further research in the multilingual models is to break down linguistic barriers by
field of text-to-image generative models, and despite the providing a common groundwork for the comprehension
availability of big datasets like LION-5B to the general and processing of several languages at once. This method
public, the utilization of such datasets remains challenging has the ability to dramatically improve linguistic diversity
for individuals due to the substantial hardware requirements in communication and open up access to information for
involved. everyone.
Moreover, to make these technologies more widely
D. ETHICAL CONSIDERATIONS accessible and sustainable, it will be essential to improve
It is important to consider the potential ethical issues that resource efficiency and minimize computational complexity
arise with the use of text-to-image generative models. One by creating models that produce high-quality photos using
of the significant concerns is the potential for misuse of these fewer computer resources.
models. With the ability to generate realistic images based on Nevertheless, greater research into ethical and bias con-
text descriptions, there is a risk that these models could be siderations is required. Ensuring fairness, removing bias, and
used to create deceptive or misleading content. This could following ethical rules are still critical considerations for any
have serious consequences in various areas, such as fake AI system. Possible directions for future study in this area
news, fraud, or even harassment. include developing models with increased consciousness and
Another issue is the potential bias that can be embedded sensitivity to these factors.
in the generated images. If the training data used to develop The utilization of text-to-image production exhibits a wide
these models is not diverse and representative, there is a range of applications across several domains, including but
possibility that the generated images may reflect prejudices not limited to education, product design, and marketing. This
or stereotypes present in the data. technology enables the creation of visual materials, such
as illustrations and infographics, that seamlessly integrate
VI. FUTURE DIRECTIONS text and images. There are some early assumptions about
The domain of text-to-image generation is experiencing which businesses might be impacted by the growing area of
significant advancements on a regular basis. The recent image generation, which will have an impact on any sector
emergence of novel generative diffusion models, including that relies on visual art, such as graphic design, filmmaking,
DALL-E, Midjourney, Stable diffusion, and others, has or photography [100].

24424 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

VII. CONCLUSION [14] A. G. Karkar, J. M. Al Ja’am, S. Foufou, and A. Sleptchenko,


The field of text-to-image synthesis has made significant ‘‘An e-learning mobile system to generate illustrations for Arabic text,’’
in Proc. IEEE Global Eng. Educ. Conf., Apr. 2016, pp. 184–191.
progress in recent years. The development of GANs and [15] A. G. Karkar, J. M. Alja’am, and A. Mahmood, ‘‘Illustrate it! An Arabic
diffusion models has paved the way for more advanced and multimedia text-to-picture m-learning system,’’ IEEE Access, vol. 5,
realistic image generation from textual descriptions. These pp. 12777–12787, 2017.
[16] I. Goodfellow, ‘‘NIPS 2016 tutorial: Generative adversarial networks,’’
models have demonstrated an outstanding ability to generate 2017, arXiv:1701.00160.
high-quality images across a wide range of domains and [17] D. P. Kingma and M. Welling, ‘‘An introduction to variational autoen-
datasets. This study offers a comprehensive review of the coders,’’ Found. Trends Mach. Learn., vol. 12, no. 4, pp. 307–392, 2019.
[18] L. Weng. (2018). Flow-based Deep Generative Models. [Online].
existing literature on text-to-image generative models, sum- Available: https://lilianweng.github.io/posts/2018-10-13-flow-models/
marizing the historical development, popular datasets, key [19] P. Dhariwal and A. Nichol, ‘‘Diffusion models beat GANs on image
methods, commonly used evaluation metrics, and challenges synthesis,’’ 2021, arXiv:2105.05233.
[20] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, ‘‘Diffusion models
faced in this field. Despite these challenges, the potential in vision: A survey,’’ 2022, arXiv:2209.04747.
of text-to-image generation in expanding creative horizons [21] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, Y. Shao, W. Zhang,
and enhancing AI systems is undeniable. The ability to B. Cui, and M.-H. Yang, ‘‘Diffusion models: A comprehensive survey of
methods and applications,’’ Comprehensive Surv. Methods Appl., vol. 1,
generate realistic and diverse images from textual inputs p. 39, Sep. 2022.
opens up new possibilities in various fields, including art, [22] L. Weng. (2021). What Are Diffusion Models. [Online]. Available:
design, advertising, and others. Therefore, researchers and https://lilianweng.github.io/posts/2021-07-11-diffusion-models/
[23] S. Tyagi and D. Yadav, ‘‘A comprehensive review on image synthesis
practitioners should continue to explore and refine text-to- with adversarial networks: Theory, literature, and applications,’’ Arch.
image generative models. Comput. Methods Eng., vol. 29, no. 5, pp. 2685–2705, Aug. 2022.
[24] R. Zhou, C. Jiang, and Q. Xu, ‘‘A survey on generative adversarial
network-based text-to-image synthesis,’’ Neurocomputing, vol. 451,
ACKNOWLEDGMENT pp. 316–336, Sep. 2021.
The authors would like to thank the Deanship of Scientific [25] Y. X. Tan, C. P. Lee, M. Neo, K. M. Lim, J. Y. Lim, and
A. Alqahtani, ‘‘Recent advances in text-to-image synthesis: Approaches,
Research, Qassim University, for funding the publication of datasets and future research prospects,’’ IEEE Access, vol. 11,
this project. pp. 88099–88115, 2023.
[26] H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li,
‘‘A survey on generative diffusion model,’’ 2022, arXiv:2209.02646.
REFERENCES [27] C. Zhang, C. Zhang, S. Zheng, M. Zhang, M. Qamar, S.-H. Bae, and
[1] S. Frolov, T. Hinz, F. Raue, J. Hees, and A. Dengel, ‘‘Adversarial text- I. S. Kweon, ‘‘A survey on audio diffusion models: Text to speech syn-
to-image synthesis: A review,’’ Neural Netw., vol. 144, pp. 187–209, thesis and enhancement in generative AI,’’ 2023, arXiv:2303.13336v2.
Dec. 2021. [28] R. Yang, P. Srivastava, and S. Mandt, ‘‘Diffusion probabilistic modeling
[2] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- for video generation,’’ 2022, arXiv:2203.09481v5.
Farley, S. Ozair, A. Courville, and Y. Bengio, ‘‘Generative adversarial [29] A. Ulhaq, N. Akhtar, and G. Pogrebna, ‘‘Efficient diffusion models for
networks,’’ 2014, arXiv:1406.2661. vision: A survey,’’ 2022, arXiv:2210.09292v2.
[30] C. Zhang, C. Zhang, M. Zhang, and I. So Kweon, ‘‘Text-to-image
[3] J. Agnese, J. Herrera, H. Tao, and X. Zhu, ‘‘A survey and taxonomy of
diffusion models in generative AI: A survey,’’ 2023, arXiv:2303.07909v2.
adversarial neural networks for text-to-image synthesis,’’ WIREs Data
[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
Mining Knowl. Discovery, vol. 10, no. 4, Jul. 2020, Art. no. e1345.
P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in
[4] L. Jin, F. Tan, and S. Jiang, ‘‘Generative adversarial network technolo- context,’’ in Computer Vision—ECCV 2014 (Lecture Notes in Computer
gies and applications in computer vision,’’ Comput. Intell. Neurosci., Science). Cham, Switzerland: Springer, 2014, pp. 740–755.
vol. 2020, pp. 1–17, Aug. 2020. [32] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, ‘‘Caltech-
[5] J. Zakraoui, M. Saleh, and J. A. Ja’am, ‘‘Text-to-picture tools, systems, UCSD birds 200,’’ California Inst. Technol., Tech. Rep. CNS-TR-2011-
and approaches: A survey,’’ Multimedia Tools Appl., vol. 78, no. 16, 001, 2011.
pp. 22833–22859, Aug. 2019, doi: 10.1007/s11042-019-7541-4. [33] M.-E. Nilsback and A. Zisserman, ‘‘Automated flower classification over
[6] D. Joshi, J. Z. Wang, and J. Li, ‘‘The story picturing engine—A system for a large number of classes,’’ in Proc. 6th Indian Conf. Comput. Vis., Graph.
automatic text illustration,’’ ACM Trans. Multimedia Comput., Commun., Image Process., Dec. 2008, pp. 722–729.
Appl., vol. 2, no. 1, pp. 68–89, Feb. 2006, doi: 10.1145/1126004.1126008. [34] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, ‘‘TediGAN: Text-guided diverse
[7] X. Zhu, A. Goldberg, M. Eldawy, C. Dyer, and B. Strock, ‘‘A text-to- face image generation and manipulation,’’ in Proc. IEEE Comput. Soc.
picture synthesis system for augmenting communication,’’ in Proc. 22nd Conf. Comput. Vis. Pattern Recognit., Dec. 2020, pp. 2256–2265.
AAAI Conf. Artif. Intell., 2007, p. 1590. [35] T. Karras, T. Aila, S. Laine, and J. Lehtinen, ‘‘Progressive growing of
[8] H. Li, J. Tang, G. Li, and T.-S. Chua, ‘‘Word2Image: Towards visual GANs for improved quality, stability, and variation,’’ in Proc. 6th Int.
interpreting of words,’’ in Proc. 16th ACM Int. Conf. Multimedia, 2008, Conf. Learn. Represent., Oct. 2018.
pp. 813–816. [36] Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, ‘‘Talk-to-edit: Fine-
grained facial editing via dialog,’’ 2021, arXiv:2109.04425.
[9] B. Coyne and R. Sproat, ‘‘WordsEye: An automatic text-to-scene
[37] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, ‘‘DeepFashion: Powering
conversion system,’’ in Proc. 28th Annu. Conf. Comput. Graph. Interact.
robust clothes recognition and retrieval with rich annotations,’’ in
Techn., Aug. 2001, pp. 487–496.
Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2016,
[10] M. E. Ma, ‘‘Confucius: An intelligent multimedia storytelling interpreta- pp. 1096–1104.
tion and presentation system,’’ School Comput. Intell. Syst., Univ. Ulster, [38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
Coleraine, U.K., Tech. Rep., 2002. A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
[11] Y. Jiang, J. Liu, and H. Lu, ‘‘Chat with illustration,’’ Multimedia Syst., Vis. Pattern Recognit., Jun. 2009, pp. 248–255. [Online]. Available:
vol. 22, no. 1, pp. 5–16, Feb. 2016, doi: 10.1007/s00530-014-0371-3. http://www.image-net.org
[12] D. Ustalov, ‘‘A text-to-picture system for Russian language,’’ in Proc. 6th [39] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset,
Russian Young Scientists Conf. Inf. Retr., Aug. 2012, pp. 35–44. S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, T. Duerig, and V. Ferrari,
[13] P. Jain, H. Darbari, and V. C. Bhavsar, ‘‘Vishit: A visualizer for Hindi ‘‘The open images dataset v4: Unified image classification, object
text,’’ in Proc. 4th Int. Conf. Commun. Syst. Netw. Technol., Apr. 2014, detection, and visual relationship detection at scale,’’ Int. J. Comput. Vis.,
pp. 886–890. vol. 128, no. 7, pp. 1956–1981, Jul. 2020.

VOLUME 12, 2024 24425


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

[40] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, ‘‘Conceptual 12M: [60] M. Bahani, A. El Ouaazizi, and K. Maalmi, ‘‘AraBERT and DF-GAN
Pushing web-scale image-text pre-training to recognize long-tail visual fusion for Arabic text-to-image generation,’’ Array, vol. 16, Dec. 2022,
concepts,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Art. no. 100260. [Online]. Available: https://www.sciencedirect.com/
(CVPR), Jun. 2021, pp. 3557–3567. science/article/pii/S2590005622000935
[41] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, [61] M. Bahani, S. M. Ben, K. Maalmi, and A. E. Ouaazizi. (Oct. 2022).
M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, Increase the Effectiveness of the Arabic Text-to-image Generation
P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, Task. [Online]. Available: https://www.researchsquare.com/article/rs-
R. Kaczmarczyk, and J. Jitsev, ‘‘LAION-5B: An open large-scale 2169841/v1
dataset for training next generation image-text models,’’ in Proc. Adv. [62] M. Bahani, A. E. Ouaazizi, and K. Maalmi, ‘‘The effectiveness of T5,
Neural Inf. Process. Syst., vol. 35, 2022. GPT-2, and BERT on text-to-image generation task,’’ Pattern Recognit.
[42] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, ‘‘Gen- Lett., vol. 173, pp. 57–63, Sep. 2023.
erative adversarial text to image synthesis,’’ 2016, arXiv:1605.05396. [63] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris,
[43] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation and T. Park, ‘‘Scaling up GANs for text-to-image synthesis,’’ 2023,
learning with deep convolutional generative adversarial networks,’’ 2015, arXiv:2303.05511v2.
arXiv:1511.06434. [64] C. Liu, J. Hu, and H. Lin, ‘‘SWF-GAN: A text-to-image model
[44] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, ‘‘Learning
based on sentence-word fusion perception,’’ Comput. Graph., vol. 115,
what and where to draw,’’ 2016, arXiv:1610.02454.
pp. 500–510, Oct. 2023.
[45] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas,
[65] M. Tao, B.-K. Bao, H. Tang, and C. Xu, ‘‘GALIP: Generative adversarial
‘‘StackGAN: Text to photo-realistic image synthesis with stacked
CLIPs for text-to-image synthesis,’’ in Proc. IEEE/CVF Conf. Comput.
generative adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis.
Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14214–14223.
(ICCV), Oct. 2017, pp. 5908–5916.
[46] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and [66] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
D. N. Metaxas, ‘‘StackGAN++: Realistic image synthesis with stacked M. Chen, and I. Sutskever, ‘‘Zero-shot text-to-image generation,’’ 2021,
generative adversarial networks,’’ IEEE Trans. Pattern Anal. Mach. arXiv:2102.12092v2.
Intell., vol. 41, no. 8, pp. 1947–1962, Aug. 2019. [Online]. Available: [67] J. Yu, Y. Xu, J. Yu Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku,
https://github.com/hanzhanggit/StackGAN-v2. Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang,
[47] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and J. Baldridge, and Y. Wu, ‘‘Scaling autoregressive models for content-rich
X. He, ‘‘AttnGAN: Fine-grained text to image generation with attentional text-to-image generation,’’ 2022, arXiv:2206.10789v1.
generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. [68] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou,
Pattern Recognit., Jun. 2018, pp. 1316–1324. Z. Shao, H. Yang, and J. Tang, ‘‘CogView: Mastering text-to-image
[48] T. Qiao, J. Zhang, D. Xu, and D. Tao, ‘‘MirrorGAN: Learning text-to- generation via transformers,’’ in Proc. Adv. Neural Inf. Process. Syst.,
image generation by redescription,’’ in Proc. IEEE/CVF Conf. Comput. vol. 24, May 2021, pp. 19822–19835.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514. [69] M. Ding, W. Zheng, W. Hong, and J. Tang, ‘‘CogView2: Faster and
[49] Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, better text-to-image generation via hierarchical transformers,’’ 2022,
D. Carlson, and J. Gao, ‘‘StoryGAN: A sequential conditional GAN for arXiv:2204.14217v2.
story visualization,’’ 2018, arXiv:1812.02784. [70] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo,
[50] H. Park, Y. Yoo, N. K. Mc-Gan, H. Park, Y. Yoo, and N. Kwak, ‘‘Vector quantized diffusion model for text-to-image synthesis,’’ in Proc.
‘‘MC-GAN: Multi-conditional generative adversarial network for image IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Nov. 2021,
synthesis,’’ in Proc. Brit. Mach. Vis. Conf., May 2018, p. 76. pp. 10686–10696.
[51] M. Zhu, P. Pan, W. Chen, and Y. Yang, ‘‘DM-GAN: Dynamic memory [71] O. Avrahami, D. Lischinski, and O. Fried, ‘‘Blended diffusion for text-
generative adversarial networks for text-to-image synthesis,’’ in Proc. driven editing of natural images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, Pattern Recognit. (CVPR), Jun. 2022, pp. 18187–18197.
pp. 5795–5803. [72] A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C.-Y. Cheng, M. Fumero,
[52] B. Li, X. Qi, T. Lukasiewicz, and P. H. S. Torr, ‘‘ManiGAN: Text-guided and K. R. Malekshan, ‘‘CLIP-Forge: Towards zero-shot text-to-shape
image manipulation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern generation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Recognit. (CVPR), Jun. 2020, pp. 7877–7886. (CVPR), Oct. 2021, pp. 18582–18592.
[53] M. Tao, H. Tang, F. Wu, X. Jing, B.-K. Bao, and C. Xu, ‘‘DF-GAN: [73] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
A simple and effective baseline for text-to-image synthesis,’’ in Proc. I. Sutskever, and M. Chen, ‘‘GLIDE: Towards photorealistic image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, generation and editing with text-guided diffusion models,’’ 2021,
pp. 16494–16504. arXiv:2112.10741v3.
[54] M. A. Haque Palash, M. A. Al Nasim, A. Dhali, and F. Afrin, ‘‘Fine-
[74] Z. Zhang, J. Ma, C. Zhou, R. Men, Z. Li, M. Ding, J. Tang, J. Zhou,
grained image generation from Bangla text description using attentional
and H. Yang, ‘‘M6-UFC: Unifying multi-modal controls for conditional
generative adversarial network,’’ in Proc. IEEE Int. Conf. Robot., Autom.,
image synthesis via non-autoregressive generative transformers,’’ 2021,
Artif.-Intell. Internet-of-Things (RAAICON), Dec. 2021, pp. 79–84.
arXiv:2105.14211v4.
[Online]. Available: https://ieeexplore.ieee.org/document/9929536/
[75] Z. Wang, W. Liu, Q. He, X. Wu, and Z. Yi, ‘‘CLIP-GEN: Language-
[55] A. S. Parihar, A. Kaushik, A. V. Choudhary, and A. K. Singh, ‘‘HTGAN:
free training of a text-to-image generator with CLIP,’’ 2022,
An architecture for Hindi text based image synthesis,’’ in Proc. 5th
arXiv:2203.00386v1.
Int. Conf. Comput., Commun. Signal Process. (ICCCSP), May 2021,
pp. 273–279. [76] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
[56] H. Zhang, S. Yang, and H. Zhu, ‘‘CJE-TIG: Zero-shot cross-lingual text- S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes,
to-image generation by corpora-based joint encoding,’’ Knowl.-Based T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, ‘‘Photorealistic text-
Syst., vol. 239, Mar. 2022, Art. no. 108006. to-image diffusion models with deep language understanding,’’ 2022,
[57] J. Zakraoui, M. Saleh, S. Al-Maadeed, and J. M. Jaam, ‘‘Improving arXiv:2205.11487v1.
text-to-image generation with object layout guidance,’’ Multimedia Tools [77] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin,
Appl., vol. 80, no. 18, pp. 27423–27443, Jul. 2021, doi: 10.1007/s11042- S. Feng, Y. Sun, L. Chen, H. Tian, H. Wu, and H. Wang, ‘‘ERNIE-ViLG
021-11038-0. 2.0: Improving text-to-image diffusion model with knowledge-enhanced
[58] J. Zakraoui, S. A. Maadeed, M. S. A. El-Seoud, J. M. Alja’am, and mixture-of-denoising-experts,’’ 2022, arXiv:2210.15257v1.
M. Salah, ‘‘A generative approach to enrich Arabic story text with [78] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis,
visual aids,’’ in Proc. 10th Int. Conf. Softw. Inf. Eng. New York, NY, M. Aittala, T. Aila, S. Laine, B. Catanzaro, T. Karras, and M.-Y. Liu,
USA: Association for Computing Machinery, 2021, pp. 47–52, doi: ‘‘EDiff-I: Text-to-image diffusion models with an ensemble of expert
10.1145/3512716.3512725. denoisers,’’ 2022, arXiv:2211.01324v3.
[59] S. M. Mathematics and M. Loey, ‘‘Photo realistic generation from [79] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
Arabic text description based on generative adversarial networks,’’ ‘‘High-resolution image synthesis with latent diffusion models,’’ in Proc.
ACM Trans. Asian Low-Resource Lang. Inf. Process., Mar. 2022, doi: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
10.1145/3490504. pp. 10674–10685.

24426 VOLUME 12, 2024


S. K. Alhabeeb, A. A. Al-Shargabi: Text-to-Image Synthesis With Generative Models

[80] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, ‘‘Hier- [96] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
archical text-conditional image generation with CLIP latents,’’ 2022, ‘‘GANs trained by a two time-scale update rule converge to a local Nash
arXiv:2204.06125. equilibrium,’’ 2017, arXiv:1706.08500.
[81] J. Shi, C. Wu, J. Liang, X. Liu, and N. Duan, ‘‘DiVAE: Photo- [97] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and
realistic images synthesis with denoising diffusion decoder,’’ 2022, X. Chen, ‘‘Improved techniques for training GANs,’’ in Proc. 30th Int.
arXiv:2206.00386v1. Conf. Neural Inf. Process. Syst. Red Hook, NY, USA: Curran Associates,
[82] W.-C. Fan, Y.-C. Chen, D. Chen, Y. Cheng, L. Yuan, and Y.-C. F. Wang, 2016, pp. 2234–2242.
‘‘Frido: Feature pyramid diffusion for complex scene image synthesis,’’ [98] A. Borji, ‘‘Pros and cons of GAN evaluation measures,’’ Comput. Vis.
2022, arXiv:2208.13753v1. Image Understand., vol. 179, pp. 41–65, Feb. 2018.
[83] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, [99] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
‘‘DreamBooth: Fine tuning text-to-image diffusion models for subject- G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
driven generation,’’ 2022, arXiv:2208.12242v1. ‘‘Learning transferable visual models from natural language supervi-
[84] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and sion,’’ in Proc. Mach. Learn. Res., vol. 139, 2021, pp. 8748–8763.
M. Irani, ‘‘Imagic: Text-based real image editing with diffusion models,’’ [100] C. Akkus, L. Chu, V. Djakovic, S. Jauch-Walser, P. Koch, G. Loss,
2022, arXiv:2210.09276v1. C. Marquardt, M. Moldovan, N. Sauter, M. Schneider, R. Schulte,
[85] D. Valevski, M. Kalman, Y. Matias, and Y. Leviathan, ‘‘UniTune: Text- K. Urbanczyk, J. Goschenhofer, C. Heumann, R. Hvingelby,
driven image editing by fine tuning an image generation model on a single D. Schalk, and M. Aßenmacher, ‘‘Multimodal deep learning,’’ 2023,
image,’’ 2022, arXiv:2210.09477v3. arXiv:2301.04856v1.
[86] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, [101] P. Wang. (2022). Dall-E 2—PyTorch. Accessed: Oct. 25, 2023. [Online].
J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, Y. Jiao, and Available: https://github.com/lucidrains/DALLE2-pytorch
A. Ramesh. (2023). Improving Image Generation With Better Captions.
[Online]. Available: https://cdn.openai.com/papers/dall-e-3.pdf
[87] W. Wu, Z. Li, Y. He, M. Zheng Shou, C. Shen, L. Cheng, Y. Li,
T. Gao, D. Zhang, and Z. Wang, ‘‘Paragraph-to-image generation with
information-enriched diffusion model,’’ 2023, arXiv:2311.14284.
[88] W. Li, X. Xu, X. Xiao, J. Liu, H. Yang, G. Li, Z. Wang, Z. Feng, SARAH K. ALHABEEB received the B.Sc. degree in information technology
Q. She, Y. Lyu, and H. Wu, ‘‘UPainting: Unified text-to-image diffusion from the Department of Information Technology, College of Computer,
generation with cross-modal guidance,’’ 2022, arXiv:2210.16031v3. Qassim University, Saudi Arabia, in May 2018, where she is currently
[89] R. Ganz and M. Elad, ‘‘CLIPAG: Towards generator-free text-to-image pursuing the M.Sc. degree in information technology. Her research interests
generation,’’ 2023, arXiv:2306.16805v2. include machine learning, artificial intelligence, natural language processing,
[90] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, and the Internet of Things.
‘‘GLIGEN: Open-set grounded text-to-image generation,’’ in Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jan. 2023,
pp. 22511–22521.
[91] Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang, S. Tulyakov,
and J. Ren, ‘‘SnapFusion: Text-to-image diffusion model on mobile
devices within two seconds,’’ 2023, arXiv:2306.00980v2.
AMAL A. AL-SHARGABI received the master’s
[92] L. Zhang, A. Rao, and M. Agrawala, ‘‘Adding conditional control to text-
to-image diffusion models,’’ 2023, arXiv:2302.05543v2. and Ph.D. degrees from Universiti Teknologi Mara
[93] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, ‘‘Unleash- (UiTM), Malaysia. She is currently an Associate
ing text-to-image diffusion models for visual perception,’’ 2023, Professor with the College of Computer, Qassim
arXiv:2303.02153v1. University. She has been receiving a number
[94] J. Hu, X. Han, X. Yi, Y. Chen, W. Li, Z. Liu, and M. Sun, ‘‘Efficient cross- of Qassim University’s research grants, since
lingual transfer for Chinese stable diffusion with images as pivots,’’ 2023, 2018. Her research interests include program
arXiv:2305.11540v1. comprehension, empirical software engineering,
[95] F. Ye, G. Liu, X. Wu, and L. Wu, ‘‘AltDiffusion: A multilingual text-to- and machine learning.
image diffusion model,’’ 2023, arXiv:2308.09991v2.

VOLUME 12, 2024 24427

You might also like