Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv
Text-to-Image_Synthesis_With_Generative_Models_Methods_Datasets_Performance_Metrics_Challenges_and_Future_Direction_Basiv
ABSTRACT Text-to-image synthesis, the process of turning words into images, opens up a world of creative
possibilities, and meets the growing need for engaging visual experiences in a world that is becoming
more image-based. As machine learning capabilities expanded, the area progressed from simple tools
and systems to robust deep learning models that can automatically generate realistic images from textual
inputs. Modern, large-scale text-to-image generation models have made significant progress in this direction,
producing diversified and high-quality images from text description prompts. Although several methods
exist, Generative Adversarial Networks (GANs) have long held a position of prominence. However, diffusion
models have recently emerged, with results much beyond those achieved by GANs. This study offers
a concise overview of text-to-image generative models by examining the existing body of literature and
providing a deeper understanding of this topic. This will be accomplished by providing a concise summary
of the development of text-to-image synthesis, previous tools and systems employed in this field, key types
of generative models, as well as an exploration of the relevant research conducted on GANs and diffusion
models. Additionally, the study provides an overview of common datasets utilized for training the text-to-
image model, compares the evaluation metrics used for evaluating the models, and addresses the challenges
encountered in the field. Finally, concluding remarks are provided to summarize the findings and implications
of the study and open issues for further research.
INDEX TERMS Deep learning, diffusion model, generative models, generative adversarial network, text-
to-image synthesis.
an engaging and consequential endeavor [3]. One of the The text-to-picture synthesis system [7], aimed to improve
popular approaches is to guide image synthesis with text communication by generating visual representations based on
description, which leads to text-to-image synthesis, which textual input. The system followed an evolutionary process
will be addressed in the following section. and adopted semantic role labeling as opposed to keyword
extraction, incorporating the concept of picturability to assess
A. TEXT-TO-IMAGE SYNTHESIS the likelihood of identifying a suitable image that represents
Text-to-image synthesis, or the generation of images from a given word. To produce compilations of images obtained
text descriptions, is a complex computer vision and machine from the Flickr platform, Word2Image [8] implemented
learning problem that has seen significant progress in recent a variety of methodologies, including semantic clustering,
years. Users may be able to describe visual elements through correlation analysis, and visual clustering.
visually rich text descriptions if automatic image generation Moreover, WordsEye [9] is a text-to-scene system that
from natural language is used. Visual content, like pictures, mechanically generates static, 3D scenes that are represen-
is a better way to share and understand information because it tational of the supplied content. A language analyzer and
is more accurate and easy to understand than written text [4]. a visualiser are the two primary parts of the system. Also,
Text-to-image Synthesis refers to the use of computational a multi-modal system called CONFUCIUS [10], which
methods to convert human-written textual descriptions (sen- works as a text-to-animation converter, can convert any
tences or keywords) into visually equivalent representations sentence containing an action verb into an animation that
of those descriptions (images) [3]. The best alignment of is perfectly synced with speech. A visually assisted instant
visual content matching the text used to be determined messaging technique, called Chat With Illustration (CWI)
through word-to-image correlation analysis combined with [11], automatically provides users with visual messages
supervised methods in synthesis. New unsupervised methods, connected with text messages. Nevertheless, many different
especially deep generative models, have emerged as a result systems for other languages exist. In order to handle the
of recent developments in deep learning. These models are Russian language, the Utkus [12] text-to-image synthesis
able to generate reasonable visual images by employing system utilizes a natural language analysis module, a stage
appropriately trained neural networks [3]. Figure 1 shows a processing module, and a rendering module. Likewise,
general architecture of how text-to-image generation would Vishit [13] is a method for visualizing processed Hindi
work: a text prompt is fed into an image generative model, texts. Language processing, knowledge base construction,
which uses the text description to generate an image. and scene generation are its three main computational
foundations. Moreover, for the Arabic language, [14] put
forth a comprehensive mobile-based system designed for
Arabic that generates illustrations for Arabic narratives
automatically. The suggested method is specifically designed
for utilization on mobile devices, with the aim of instructing
Arab children in an engaging and non-traditional manner.
Also, using a technique called conceptual graph matching,
FIGURE 1. General architecture of text-to-image generation. Illustrate It! [15] is a multimedia mobile learning solution
for the Arabic language.
several additional models based on the concept of GANs were generative capabilities. The field of generative modeling
developed to address the previous shortcomings. GANs can has found many uses for diffusion models so far, including
be used in many different contexts, such as to make images image generation, super-resolution, inpainting, editing, and
of people’s faces, to make realistic photos, to make cartoon translation between images [20]. The principles of non-
characters, to age people’s faces, to increase resolution, equilibrium thermodynamics provide the basis for diffusion
to translate between images and words, and so on [4]. models. Before learning to rebuild desirable data examples
GANs consist of two major sub-models: generator and from the noise, they generate a Markov chain of diffusion
discriminator. The generator is in charge of making new fake steps to gradually inject noise into data [20]. In order to learn,
images by taking a noise vector as an input and putting out the diffusion model has two phases: one for forward diffusion
an image as an output. On the other hand, the discriminator’s and the other for backward diffusion. In the forward diffusion
job is to tell the difference between real and fake images after phase, Gaussian noise is progressively added to the input data
being trained with real data. In other words, it serves as a at each level [21]. In the second phase, called ‘‘reverse,’’ the
classification network that is capable of classifying images by model is charged to reverse the diffusion process so that the
returning 0 for fake and 1 for real. Therefore, the generator’s original input data can be recovered.
goal is to create convincing fakes in order to trick the The architectures of generative model types are shown in
discriminator, while the discriminator’s goal is to recognize Figure 2.
the difference [1]. Training improves both the discriminator’s
ability to distinguish between real or fake images, and the
generator’s ability to produce realistic-looking images. When
the discriminator can no longer tell genuine images from
fraudulent ones, equilibrium has been reached.
has over 200,000 images, encompassing 10,000 distinct III. TEXT-TO-IMAGE GENERATION METHODS
identities. Each image is accompanied by five detailed This section provides an overview of relevant studies on
attributes, providing fine-grained information. text-to-image generative models. Due to the diversity of
the generative models and the vast amount of associated
F. DEEPFASHION literature, this study narrows its focus to the two cutting-
Reference [37] serves as a valuable resource for the edge types of deep learning generative models: GANs and
training and evaluating of numerous image synthesis models. diffusion models.
It encompasses a comprehensive collection of annotations,
including textual descriptions and fine-grained labels, across
multiple modalities. The dataset comprises a collection of
eight hundred thousand fashion images that exhibit a wide A. TEXT-TO-IMAGE GENERATION USING GANS
range of diversity, encompassing various accessories and Since its introduction in 2014, GAN-based text-to-image
positions. synthesis has been the subject of numerous studies, leading
to significant advancements in the field. Reed et al. [42],
G. IMAGENET working upon the foundation laid by deep convolutional
GANs [43], were the first to investigate the GAN-based text-
To test algorithms designed to save, retrieve, or ana-
to-image synthesis technique.
lyze multimedia data, researchers have created a massive
Earlier models could create images based on universal
database called ImageNet [38], which contains high-quality
constraints like a class label or caption, but not pose or
images that have been manually annotated. There are more
location. Therefore, the Generative Adversarial What-Where
than 14 million images in the ImageNet database, all of
Network (GAWWN) [44] was proposed, which is a network
which have been annotated using the WordNet classifi-
that generates images based on directions about what to
cation system. Since 2010, the dataset has been applied
draw and where to draw it. It demonstrates the ability to
as a standard for object recognition and image classifi-
generate images based on free-form text descriptions and
cation in the ImageNet Large Scale Visual Recognition
the precise location of objects. GAWWN enables precise
Challenge (ILSVRC).
location management through the use of a bounding box or
a collection of key points.
H. OPENIMAGES
Stacked Generative Adversarial Networks (StackGAN)
Reference [39] consists of around 9 million images that have [45] established a two-stage conditioning augmentation
been annotated with various types of data, including object approach to boost the diversity of synthesized images
bounding boxes, image-level labels, object segmentation and stabilize conditional-GAN training. Using the provided
masks, localized narratives, and visual relationships. The text description as input, the Stage-I GAN generates low-
training dataset of version 7 has 1.9 million images and resolution images of the initial shape and colors of the object.
16 million bounding boxes representing 600 different item High-resolution (e.g., 256 × 256) images with photorealistic
classes, rendering it the most extensive dataset currently features are generated by the Stage-II GAN using the results
available with annotations for object location. from Stage-I and the descriptive text.
However, an improvement to this model was made, leading
I. CC12M to StackGAN++ [46]. The second version of StackGAN
Conceptual 12M [40] is one of the datasets utilized by uses generators and discriminators organized in a tree-like
OpenAI’s DALL-E2 for training, and it consists of 12 million structure to produce images at multiple scales that fit the same
text-image pairs. The dataset, built from the original CC3M scene. StackGAN++ has a more reliable training behavior by
dataset of 3 million text-image pairs, was used for a wide approximating multiple distributions.
range of pre-training and end-to-end training of images. For even more accurate text-to-image production, the
Attentional Generative Adversarial Network (AttnGAN) [47]
J. LAION-5B permits attention-driven, multi-stage refining. By focusing
One of the largest publicly available image-text datasets is on important natural language terms, AttnGAN’s attentional-
Large-scale AI Open Network (LAION) [41]. More than five generating network allows it to synthesize fine-grained image
billion text-image pairs make up LAOIN-5B, an AI training features.
dataset that is 14 times larger than its predecessor, LAOIN- To rebuild textual descriptions from the generated images,
400M. MirrorGAN [48] presents a text-to-image-to-text architecture
Table 1 provides a comprehensive comparison of the with three models. To guarantee worldwide semantic coher-
commonly used datasets used in computer vision and ence between textual descriptions and the corresponding
multimodal research. Each dataset is evaluated based on produced images, it additionally suggests word sentence
key attributes including domain, common task, number of average embedding.
images, captions per image, training and testing split, and the Figure 4 shows the architectures of: StackGAN,
number of object categories. StackGAN++, AttnGAN, and MirrorGAN.
Although there have been many studies on text-to-image Moreover, [60] proposed a robust architecture designed
generation in English, very few have been applied to other to produce high-resolution realistic images that match a text
languages. In [54], the use of Attn-GAN was proposed description written in Arabic. The authors adjusted the shape
for generating fine-grained images based on descriptions in of the input data to DF-GAN by decreasing the size of the
Bangla text. It is capable of integrating the most exact details sentence vectors generated by AraBERT. Subsequently, they
in various subregions of the image, with a specific emphasis combined DF-GAN with AraBERT by feeding the sentence
on the pertinent terms mentioned in the natural language embedding vector into the generator and discriminator of
description. DF-GAN. When compared to stackGAN++, their method
Furthermore, [55] uses language translation models produced impressive results. In the CUB dataset, it got
to extend established English text-to-image generating an FID score of 55.96 and a SI score of 3.51. In the
approaches to Hindi text-to-image synthesis. Input Hindi Oxford-102 dataset, got an FID score of 59.45 and a
sentences were translated to English by a transformer-based SI score of 3.06.
Neural Machine Translation module, whose output was To improve upon their prior work in [60], the authors
supplied to a GAN-based Image Generation Module. presented two additional techniques [61]. To get over the
On the other hand, The CJE-TIG [56] cross-lingual text-to- out-of-vocabulary problem, they tried a first technique that
image pre-training technique removes barriers to using GAN- involved combining a sample text transformer with the
based text-to-image synthesis models for any given input generator and discriminator of DF-GANs. In the second
language. This method alters text-to-image training patterns method, the text transformer and training were carried
that are linguistically specific. It uses a bilingual joint encoder over, and a learning mask predictor was integrated into
in place of a text encoder, applies a discriminator to optimize the architecture to make predictions about masks, which
the encoder, and uses novel generative models to generate are then utilized as parameters in affine transformations to
content. provide a more seamless fusion between the image and the
The difficulties of visualizing the text of a story with text. To further improve training stability, the DAMSM loss
several characters and exemplary semantic relationships were function was used to train the architecture. The findings
considered in [57]. Two cutting-edge GAN-based image proved that the latest technique was superior. Figure 5 shows
generation models served as inspiration for the researchers’ samples on the CUB dataset, generated by DM-GAN, Attn-
innovative two-stage model architecture for creating images. GAN, StackGAN, and GAN-INT-CLS.
Stage-I of the image generating process makes use of a This study [62] outlines on using transformer-based
scene graph image generation framework; stage-II refines the models (BERT, GPT-2, T5) for text-to-image generation,
output image using a StackGAN based on the object layout an under-explored area in computer vision and NLP. It pro-
module and the initial output image. Extensive examination poses specific architectures to adapt these models for creating
and qualitative results showed that their method could images from text descriptions. The study, evaluating the
produce a high-quality graphic accurately depicting the text’s models on challenging datasets, finds that T5 is particularly
key concepts. effective in generating images that are both visually appealing
Short Arabic stories, complete with images that capture and semantically accurate.
the essence of the story and its setting, were offered using Kang et al. [63] presented a groundbreaking approach to
a novel approach in [58]. To lessen the need for human scaling up GANs for text-to-image synthesis. By introducing
input, a text generation method was used in combination GigaGAN, a new GAN architecture, the study showcases
with a text-to-image synthesis network. Arabic stories with the ability to generate high-resolution, high-quality images
specialized vocabulary and images were also compiled into efficiently. GigaGAN demonstrates superior performance in
a corpus. Applying the approach to the generation of text- terms of speed and image quality, marking a significant
image content using various generative models yielded results advancement in the use of GANs for large-scale, complex
that proved its value. The method has the potential for use in image synthesis tasks.
the classroom to facilitate the development of subject-specific SWF-GAN, a new model introduced in [64], enhances
narratives by educators. image synthesis from textual descriptions. It uniquely uses
A model for generating 256 × 256 realistic graphics from a sentence-word fusion module and a weakly supervised
Arabic text descriptions was proposed in [59]. In order to mask predictor for detailed semantic mapping and accurate
generate high-quality images, a unique attention network structure generation. The model effectively creates clear and
was trained and evaluated in many stages for the proposed vivid images with lower computational load, significantly
model. A deep multimodal similarity model for calculating outperforming baseline models in IS and FID scores.
the loss of matching fine-grained picture text for training GALIP [65] introduces a novel GAN architecture for text-
the model generator was proposed. The proposed approach to-image synthesis. This model integrates transformer-based
set a new standard for converting Arabic text to photoreal- text encoders and an advanced generator, resulting in high-
istic images. On the Caltech-UCSD Birds-200-2011 (CUB) quality, text-aligned image generation. The model excels in
dataset, the newly proposed model produced an inception creating images from complex text descriptions, emphasizing
score of 3.42 ±.05. the potential of GANs in the realm of text-to-image synthesis.
TABLE 2. Diffusion Models-based related studies. text as input rather than more complex requirements such as
masks or drawings.
DiVAE, a VQ-VAE architecture model that employs
a diffusion decoder as the reconstructing component in
image synthesis, was proposed by Shi et al. in [81]. They
investigated how to incorporate image embedding into the
diffusion model for high performance and discovered that
a minor adjustment to the U-Net used in diffusion could
accomplish this.
Building upon the success of its predecessor [66], DALL-E
2 [80] was launched as a follow-up version with the intention
of producing more realistic images at greater resolutions by
combining concepts, features, and styles. The model consists
of two parts: a prior that creates a CLIP image embedding
from a caption and a decoder that creates an image based
on the embedding. It was demonstrated that increasing image
with different denoising specialists at different denoising variety through the intentional generation of representations
stages to improve the quality of the output images. leads to only a slight decrease in photorealism and caption
On the other hand, eDiff-I [78] outperforms other large- similarity.
scale text-to-image diffusion models by improving text
alignment while keeping inference computation cost and
visual quality stable. Unlike traditional diffusion models,
which rely on a single model trained to denoise the entire
noise distribution, eDiff-I is instead trained on an ensemble
of expert denoisers, each of which is tailored to denoising
at a distinct stage of generation. The researchers claim that
employing such specialized denoisers enhances the quality
of synthesized output.
Frido [82] is an image-synthesizing Feature Pyramid
Diffusion model that conducts multiscale coarse-to-fine
FIGURE 7. Overview of DALL-E 2, reproduced from Ramesh et al. [80].
denoising. To construct an output image, it first decomposes
the input into vector quantized scale-dependent components.
The previously mentioned stage of learning multi-scale Figure 7 represents an overview of DALL-E 2, and Figure 8
representations can also take advantage of input conditions shows samples of images generated by DALL-E 2 given a
such as language, scene graphs, and image layout. Frido can detailed text prompt.
thus be utilized for both traditional and cross-modal image Furthermore, the advanced model DALL-E 3 [86], which
synthesis. was recently released, represents a significant advancement
A new method called DreamBooth was suggested in [83] over its predecessors. Leveraging advanced diffusion models,
as a way to tailor the results of text-to-image generation from DALL-E 3 not only excels in maintaining fidelity to textual
diffusion models to the needs of users. The authors fine- prompts but also underscores its ability to capture intricate
tuned a pretrained text-to-image model so that it is able to details, marking a substantial advancement in the realm of
associate a distinctive identifier with a subject given only a generative models.
small number of images of that subject as input. Following Stable Diffusion is another popular text-to-image tool that
the subject’s incorporation into the model’s output domain, was introduced in 2022, based on a previous work [79].
the identifier can be used to generate completely brand-new Stable Diffusion employs a type of diffusion model known as
photorealistic pictures of the subject in a variety of settings. the latent diffusion model (LDM). The VAE, U-Net, and an
Furthermore, Imagic [84] shows how a single real image optional text encoder comprise Stable Diffusion. Compared
can be subjected to sophisticated text-guided semantic edits. to pixel-based diffusion models, LDMs dramatically reduced
While maintaining the image’s original qualities, Imagic can the requirement for processing while achieving a new
alter the position and composition of one or more objects state-of-the-art picture inpainting and highly competitive
within it. It works on raw images without the need for image performance on a variety of applications like unconditional
masks or any other preprocessing. image creation and super-resolution. Figure 9 shows an
Likewise, UniTune [85] is capable of editing images with overview of the architecture of Stable diffusion.
a high degree of semantic and visual fidelity to the original, Table 2 summarizes the studies that utilized diffusion
given a random image and a textual edit description as input. models in text-to-image generation by year, model, and
It can be considered an art-direction tool that only requires dataset.
FIGURE 10. Random image samples on MS-COCO, generated by DALL-E, GLIDE, and DALL-E 2. Source: [80].
For more in-depth details, we refer to [98]. Moreover, of realism, accuracy, and variety in the generated distri-
the Clip Score [99] is used in evaluating common sense butions. Table 3 represents a comparison of FID scores
and mentioned objects, while Human Evaluation offers obtained by GANs and diffusion models on the MS-COCO
a comprehensive insight into multiple aspects of image dataset and shows that diffusion models made remarkable
generation. In the following a detailed description of each results.
metric.
B. THE INCEPTION SCORE
A. THE FRECHET INCEPTION DISTANCE (FID) [96] Reference [97], which ignores the underlying distribution,
Using the feature space of a pre-trained Inception v3 network, measures the produced distribution’s faithfulness and diver-
FID [77] determines the frechet distance between natural and sity. The following is the IS equation:
artificial distributions. This equation solves it:
I = exp(Ex DKL (p(y|x)∥p(y))) (2)
2 1
F(r, g) = µr − µg + trace 6r + 6g − 2 6r 6g 2
TABLE 3. FID scores of GANs and diffusion models on the MS-COCO Frolov et al. [1] proposed a set of different criteria
dataset.
for comparing evaluation metrics. The following is an
explanation of these criteria.
• Image Quality and Diversity: The degree to which
the generated image looks realistic or similar to the
reference image and the ability of the model to produce
varied images based on the same text prompt.
• Text Relevance: How well the generated image corre-
sponds to the given text prompt.
• Mentioned Objects and Object Fidelity: Whether
the model correctly identifies and includes the objects
mentioned in the text, and how accurately the objects in
the generated image match their real-world counterparts.
• Numerical and Positional Alignment: The accuracy of
any quantitative details and the positional arrangement
of objects in the generated image in relation to the
C. THE R-PRECISION (RP)
provided text.
• Common Sense: The presence of logical and expected
Reference [47] metric is widely employed for assessing
elements in the generated image.
the consistency between text and images. RP operates on
• Paraphrase Robustness: The model remains unaf-
the principle of employing a generated image query based
fected by minor modifications in the input description,
on the provided caption. Specifically, given an authentic
such as word substitutions or rephrasings.
text description and 99 randomly selected mismatched
• Explainable: The ability to provide a clear explanation
captions, an image is produced from the authentic caption.
of why an image is not aligned with the input.
This resulting image is then utilized to query the original
• Automatic: Whether the metric can be calculated
description from a pool of 100 candidate captions. The
automatically without human intervention.
retrieval is deemed successful if the similarity score between
Based on these Key criteria, we provide in Table 4 a
it and the authentic caption is the highest. The matching
comparative analysis of the commonly used text-to-image
score is determined using the cosine similarity between the
evaluation metrics based on their performance. It is important
encoding vectors of the image and the caption. A higher RP
to note that the table presented offers a simplified overview.
score indicates better quality, with RP being the proportion of
In practice, choosing the right metric depends on the specific
successful retrievals.
goals and context of the text-to-image generation task.
Additionally, the effectiveness of these metrics may vary
D. CLIP SCORE
depending on the specific model and dataset used.
The CLIP model [99], developed by OpenAI, demonstrates
the ability to evaluate the semantic similarity between a V. CHALLENGES AND LIMITATIONS
given text caption and an accompanying image. Based on this Although there has been significant progress made in the
rationale, the CLIP score can serve as a quantitative measure area of creating visual representations of textual descriptions,
and is formally defined as: there are still some challenges and limitations that will be
E[s(f (image) ∗ g(caption))] (3) discussed below.
where the mathematical expectation is computed over the set A. OPEN SOURCE
of created images in a batch, and s represents the logarithmic Although DALL-E is one of the competitive models,
scale of the CLIP logit [73]. A higher Clip score suggests unfortunately, it has not been released for public usage. There
a stronger semantic relationship between the image and the is a copy of DALL-E 2 available in PyTorch [101], but no
text, while a lower score shows less of a connection. pre-trained model. However, the Stable Diffusion model is
among the open-source models that are currently accessible.
E. HUMAN EVALUATIONS Stable Diffusion benefits from extensive community support
Some studies used human evaluation as a qualitative measure due to its open-source nature. Consequently, it is anticipated
to assess and evaluate the quality of the results. The reporting that there will be additional advancements in this particular
of metrics based on human evaluation is motivated by area in the near future.
the fact that many possible applications of the models are
centered upon tricking the human observer [100]. Typically, B. LANGUAGE SUPPORT
a collection of images is provided to an individual, who is The majority of studies in the field of text-to-image genera-
tasked with evaluating their quality in terms of photorealism tion have been conducted on English text descriptions due to
and alignment with associated captions. the abundance of dataset resources and the simple structure of
TABLE 4. Overview of commonly used evaluation metrics for text-to-image synthesis, adapted from Frolov et al. [1].
the language. Some languages, however, require more effort sparked significant interest and discussion in the scientific
which needs to be addressed. For instance, Arabic, in contrast community. The field shows a high degree of fertility and
to English, has more complicated morphological features renewability, as seen by the recent publication of numerous
and fewer semantic and linguistic resources [5]. This is a relevant studies and an ongoing flow of new papers within a
main challenge that needs to be dealt with in text-to-image relatively short timeframe.
generation. By making generative models open-source, researchers
and developers can collaborate more effectively, which will
C. COMPUTATIONAL COMPLEXITY in turn boost innovation in the field. Researchers may utilize
The computational complexity of diffusion models poses a these publicly available models to investigate novel uses,
notable difficulty. The process of training a diffusion model enhance current AI models, and move the field forward
involves multiple iterative processes which can impose a rapidly.
significant computational burden. Therefore, The model’s To overcome the language barrier, some studies proposed
scalability may be constrained by the increased complexity multilingual [95] and cross-lingual [56] models to support
observed when working with larger datasets and higher- multiple languages within the same model. The goal of these
resolution images. Moreover, for further research in the multilingual models is to break down linguistic barriers by
field of text-to-image generative models, and despite the providing a common groundwork for the comprehension
availability of big datasets like LION-5B to the general and processing of several languages at once. This method
public, the utilization of such datasets remains challenging has the ability to dramatically improve linguistic diversity
for individuals due to the substantial hardware requirements in communication and open up access to information for
involved. everyone.
Moreover, to make these technologies more widely
D. ETHICAL CONSIDERATIONS accessible and sustainable, it will be essential to improve
It is important to consider the potential ethical issues that resource efficiency and minimize computational complexity
arise with the use of text-to-image generative models. One by creating models that produce high-quality photos using
of the significant concerns is the potential for misuse of these fewer computer resources.
models. With the ability to generate realistic images based on Nevertheless, greater research into ethical and bias con-
text descriptions, there is a risk that these models could be siderations is required. Ensuring fairness, removing bias, and
used to create deceptive or misleading content. This could following ethical rules are still critical considerations for any
have serious consequences in various areas, such as fake AI system. Possible directions for future study in this area
news, fraud, or even harassment. include developing models with increased consciousness and
Another issue is the potential bias that can be embedded sensitivity to these factors.
in the generated images. If the training data used to develop The utilization of text-to-image production exhibits a wide
these models is not diverse and representative, there is a range of applications across several domains, including but
possibility that the generated images may reflect prejudices not limited to education, product design, and marketing. This
or stereotypes present in the data. technology enables the creation of visual materials, such
as illustrations and infographics, that seamlessly integrate
VI. FUTURE DIRECTIONS text and images. There are some early assumptions about
The domain of text-to-image generation is experiencing which businesses might be impacted by the growing area of
significant advancements on a regular basis. The recent image generation, which will have an impact on any sector
emergence of novel generative diffusion models, including that relies on visual art, such as graphic design, filmmaking,
DALL-E, Midjourney, Stable diffusion, and others, has or photography [100].
[40] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, ‘‘Conceptual 12M: [60] M. Bahani, A. El Ouaazizi, and K. Maalmi, ‘‘AraBERT and DF-GAN
Pushing web-scale image-text pre-training to recognize long-tail visual fusion for Arabic text-to-image generation,’’ Array, vol. 16, Dec. 2022,
concepts,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Art. no. 100260. [Online]. Available: https://www.sciencedirect.com/
(CVPR), Jun. 2021, pp. 3557–3567. science/article/pii/S2590005622000935
[41] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, [61] M. Bahani, S. M. Ben, K. Maalmi, and A. E. Ouaazizi. (Oct. 2022).
M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, Increase the Effectiveness of the Arabic Text-to-image Generation
P. Schramowski, S. Kundurthy, K. Crowson, L. Schmidt, Task. [Online]. Available: https://www.researchsquare.com/article/rs-
R. Kaczmarczyk, and J. Jitsev, ‘‘LAION-5B: An open large-scale 2169841/v1
dataset for training next generation image-text models,’’ in Proc. Adv. [62] M. Bahani, A. E. Ouaazizi, and K. Maalmi, ‘‘The effectiveness of T5,
Neural Inf. Process. Syst., vol. 35, 2022. GPT-2, and BERT on text-to-image generation task,’’ Pattern Recognit.
[42] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, ‘‘Gen- Lett., vol. 173, pp. 57–63, Sep. 2023.
erative adversarial text to image synthesis,’’ 2016, arXiv:1605.05396. [63] M. Kang, J.-Y. Zhu, R. Zhang, J. Park, E. Shechtman, S. Paris,
[43] A. Radford, L. Metz, and S. Chintala, ‘‘Unsupervised representation and T. Park, ‘‘Scaling up GANs for text-to-image synthesis,’’ 2023,
learning with deep convolutional generative adversarial networks,’’ 2015, arXiv:2303.05511v2.
arXiv:1511.06434. [64] C. Liu, J. Hu, and H. Lin, ‘‘SWF-GAN: A text-to-image model
[44] S. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee, ‘‘Learning
based on sentence-word fusion perception,’’ Comput. Graph., vol. 115,
what and where to draw,’’ 2016, arXiv:1610.02454.
pp. 500–510, Oct. 2023.
[45] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. Metaxas,
[65] M. Tao, B.-K. Bao, H. Tang, and C. Xu, ‘‘GALIP: Generative adversarial
‘‘StackGAN: Text to photo-realistic image synthesis with stacked
CLIPs for text-to-image synthesis,’’ in Proc. IEEE/CVF Conf. Comput.
generative adversarial networks,’’ in Proc. IEEE Int. Conf. Comput. Vis.
Vis. Pattern Recognit. (CVPR), Jun. 2023, pp. 14214–14223.
(ICCV), Oct. 2017, pp. 5908–5916.
[46] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and [66] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford,
D. N. Metaxas, ‘‘StackGAN++: Realistic image synthesis with stacked M. Chen, and I. Sutskever, ‘‘Zero-shot text-to-image generation,’’ 2021,
generative adversarial networks,’’ IEEE Trans. Pattern Anal. Mach. arXiv:2102.12092v2.
Intell., vol. 41, no. 8, pp. 1947–1962, Aug. 2019. [Online]. Available: [67] J. Yu, Y. Xu, J. Yu Koh, T. Luong, G. Baid, Z. Wang, V. Vasudevan, A. Ku,
https://github.com/hanzhanggit/StackGAN-v2. Y. Yang, B. K. Ayan, B. Hutchinson, W. Han, Z. Parekh, X. Li, H. Zhang,
[47] T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and J. Baldridge, and Y. Wu, ‘‘Scaling autoregressive models for content-rich
X. He, ‘‘AttnGAN: Fine-grained text to image generation with attentional text-to-image generation,’’ 2022, arXiv:2206.10789v1.
generative adversarial networks,’’ in Proc. IEEE/CVF Conf. Comput. Vis. [68] M. Ding, Z. Yang, W. Hong, W. Zheng, C. Zhou, D. Yin, J. Lin, X. Zou,
Pattern Recognit., Jun. 2018, pp. 1316–1324. Z. Shao, H. Yang, and J. Tang, ‘‘CogView: Mastering text-to-image
[48] T. Qiao, J. Zhang, D. Xu, and D. Tao, ‘‘MirrorGAN: Learning text-to- generation via transformers,’’ in Proc. Adv. Neural Inf. Process. Syst.,
image generation by redescription,’’ in Proc. IEEE/CVF Conf. Comput. vol. 24, May 2021, pp. 19822–19835.
Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 1505–1514. [69] M. Ding, W. Zheng, W. Hong, and J. Tang, ‘‘CogView2: Faster and
[49] Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, better text-to-image generation via hierarchical transformers,’’ 2022,
D. Carlson, and J. Gao, ‘‘StoryGAN: A sequential conditional GAN for arXiv:2204.14217v2.
story visualization,’’ 2018, arXiv:1812.02784. [70] S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo,
[50] H. Park, Y. Yoo, N. K. Mc-Gan, H. Park, Y. Yoo, and N. Kwak, ‘‘Vector quantized diffusion model for text-to-image synthesis,’’ in Proc.
‘‘MC-GAN: Multi-conditional generative adversarial network for image IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Nov. 2021,
synthesis,’’ in Proc. Brit. Mach. Vis. Conf., May 2018, p. 76. pp. 10686–10696.
[51] M. Zhu, P. Pan, W. Chen, and Y. Yang, ‘‘DM-GAN: Dynamic memory [71] O. Avrahami, D. Lischinski, and O. Fried, ‘‘Blended diffusion for text-
generative adversarial networks for text-to-image synthesis,’’ in Proc. driven editing of natural images,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, Pattern Recognit. (CVPR), Jun. 2022, pp. 18187–18197.
pp. 5795–5803. [72] A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C.-Y. Cheng, M. Fumero,
[52] B. Li, X. Qi, T. Lukasiewicz, and P. H. S. Torr, ‘‘ManiGAN: Text-guided and K. R. Malekshan, ‘‘CLIP-Forge: Towards zero-shot text-to-shape
image manipulation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern generation,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Recognit. (CVPR), Jun. 2020, pp. 7877–7886. (CVPR), Oct. 2021, pp. 18582–18592.
[53] M. Tao, H. Tang, F. Wu, X. Jing, B.-K. Bao, and C. Xu, ‘‘DF-GAN: [73] A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew,
A simple and effective baseline for text-to-image synthesis,’’ in Proc. I. Sutskever, and M. Chen, ‘‘GLIDE: Towards photorealistic image
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022, generation and editing with text-guided diffusion models,’’ 2021,
pp. 16494–16504. arXiv:2112.10741v3.
[54] M. A. Haque Palash, M. A. Al Nasim, A. Dhali, and F. Afrin, ‘‘Fine-
[74] Z. Zhang, J. Ma, C. Zhou, R. Men, Z. Li, M. Ding, J. Tang, J. Zhou,
grained image generation from Bangla text description using attentional
and H. Yang, ‘‘M6-UFC: Unifying multi-modal controls for conditional
generative adversarial network,’’ in Proc. IEEE Int. Conf. Robot., Autom.,
image synthesis via non-autoregressive generative transformers,’’ 2021,
Artif.-Intell. Internet-of-Things (RAAICON), Dec. 2021, pp. 79–84.
arXiv:2105.14211v4.
[Online]. Available: https://ieeexplore.ieee.org/document/9929536/
[75] Z. Wang, W. Liu, Q. He, X. Wu, and Z. Yi, ‘‘CLIP-GEN: Language-
[55] A. S. Parihar, A. Kaushik, A. V. Choudhary, and A. K. Singh, ‘‘HTGAN:
free training of a text-to-image generator with CLIP,’’ 2022,
An architecture for Hindi text based image synthesis,’’ in Proc. 5th
arXiv:2203.00386v1.
Int. Conf. Comput., Commun. Signal Process. (ICCCSP), May 2021,
pp. 273–279. [76] C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton,
[56] H. Zhang, S. Yang, and H. Zhu, ‘‘CJE-TIG: Zero-shot cross-lingual text- S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes,
to-image generation by corpora-based joint encoding,’’ Knowl.-Based T. Salimans, J. Ho, D. J. Fleet, and M. Norouzi, ‘‘Photorealistic text-
Syst., vol. 239, Mar. 2022, Art. no. 108006. to-image diffusion models with deep language understanding,’’ 2022,
[57] J. Zakraoui, M. Saleh, S. Al-Maadeed, and J. M. Jaam, ‘‘Improving arXiv:2205.11487v1.
text-to-image generation with object layout guidance,’’ Multimedia Tools [77] Z. Feng, Z. Zhang, X. Yu, Y. Fang, L. Li, X. Chen, Y. Lu, J. Liu, W. Yin,
Appl., vol. 80, no. 18, pp. 27423–27443, Jul. 2021, doi: 10.1007/s11042- S. Feng, Y. Sun, L. Chen, H. Tian, H. Wu, and H. Wang, ‘‘ERNIE-ViLG
021-11038-0. 2.0: Improving text-to-image diffusion model with knowledge-enhanced
[58] J. Zakraoui, S. A. Maadeed, M. S. A. El-Seoud, J. M. Alja’am, and mixture-of-denoising-experts,’’ 2022, arXiv:2210.15257v1.
M. Salah, ‘‘A generative approach to enrich Arabic story text with [78] Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, Q. Zhang, K. Kreis,
visual aids,’’ in Proc. 10th Int. Conf. Softw. Inf. Eng. New York, NY, M. Aittala, T. Aila, S. Laine, B. Catanzaro, T. Karras, and M.-Y. Liu,
USA: Association for Computing Machinery, 2021, pp. 47–52, doi: ‘‘EDiff-I: Text-to-image diffusion models with an ensemble of expert
10.1145/3512716.3512725. denoisers,’’ 2022, arXiv:2211.01324v3.
[59] S. M. Mathematics and M. Loey, ‘‘Photo realistic generation from [79] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer,
Arabic text description based on generative adversarial networks,’’ ‘‘High-resolution image synthesis with latent diffusion models,’’ in Proc.
ACM Trans. Asian Low-Resource Lang. Inf. Process., Mar. 2022, doi: IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2022,
10.1145/3490504. pp. 10674–10685.
[80] A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, ‘‘Hier- [96] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter,
archical text-conditional image generation with CLIP latents,’’ 2022, ‘‘GANs trained by a two time-scale update rule converge to a local Nash
arXiv:2204.06125. equilibrium,’’ 2017, arXiv:1706.08500.
[81] J. Shi, C. Wu, J. Liang, X. Liu, and N. Duan, ‘‘DiVAE: Photo- [97] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and
realistic images synthesis with denoising diffusion decoder,’’ 2022, X. Chen, ‘‘Improved techniques for training GANs,’’ in Proc. 30th Int.
arXiv:2206.00386v1. Conf. Neural Inf. Process. Syst. Red Hook, NY, USA: Curran Associates,
[82] W.-C. Fan, Y.-C. Chen, D. Chen, Y. Cheng, L. Yuan, and Y.-C. F. Wang, 2016, pp. 2234–2242.
‘‘Frido: Feature pyramid diffusion for complex scene image synthesis,’’ [98] A. Borji, ‘‘Pros and cons of GAN evaluation measures,’’ Comput. Vis.
2022, arXiv:2208.13753v1. Image Understand., vol. 179, pp. 41–65, Feb. 2018.
[83] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, [99] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
‘‘DreamBooth: Fine tuning text-to-image diffusion models for subject- G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever,
driven generation,’’ 2022, arXiv:2208.12242v1. ‘‘Learning transferable visual models from natural language supervi-
[84] B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and sion,’’ in Proc. Mach. Learn. Res., vol. 139, 2021, pp. 8748–8763.
M. Irani, ‘‘Imagic: Text-based real image editing with diffusion models,’’ [100] C. Akkus, L. Chu, V. Djakovic, S. Jauch-Walser, P. Koch, G. Loss,
2022, arXiv:2210.09276v1. C. Marquardt, M. Moldovan, N. Sauter, M. Schneider, R. Schulte,
[85] D. Valevski, M. Kalman, Y. Matias, and Y. Leviathan, ‘‘UniTune: Text- K. Urbanczyk, J. Goschenhofer, C. Heumann, R. Hvingelby,
driven image editing by fine tuning an image generation model on a single D. Schalk, and M. Aßenmacher, ‘‘Multimodal deep learning,’’ 2023,
image,’’ 2022, arXiv:2210.09477v3. arXiv:2301.04856v1.
[86] J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, [101] P. Wang. (2022). Dall-E 2—PyTorch. Accessed: Oct. 25, 2023. [Online].
J. Zhuang, J. Lee, Y. Guo, W. Manassra, P. Dhariwal, C. Chu, Y. Jiao, and Available: https://github.com/lucidrains/DALLE2-pytorch
A. Ramesh. (2023). Improving Image Generation With Better Captions.
[Online]. Available: https://cdn.openai.com/papers/dall-e-3.pdf
[87] W. Wu, Z. Li, Y. He, M. Zheng Shou, C. Shen, L. Cheng, Y. Li,
T. Gao, D. Zhang, and Z. Wang, ‘‘Paragraph-to-image generation with
information-enriched diffusion model,’’ 2023, arXiv:2311.14284.
[88] W. Li, X. Xu, X. Xiao, J. Liu, H. Yang, G. Li, Z. Wang, Z. Feng, SARAH K. ALHABEEB received the B.Sc. degree in information technology
Q. She, Y. Lyu, and H. Wu, ‘‘UPainting: Unified text-to-image diffusion from the Department of Information Technology, College of Computer,
generation with cross-modal guidance,’’ 2022, arXiv:2210.16031v3. Qassim University, Saudi Arabia, in May 2018, where she is currently
[89] R. Ganz and M. Elad, ‘‘CLIPAG: Towards generator-free text-to-image pursuing the M.Sc. degree in information technology. Her research interests
generation,’’ 2023, arXiv:2306.16805v2. include machine learning, artificial intelligence, natural language processing,
[90] Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, and the Internet of Things.
‘‘GLIGEN: Open-set grounded text-to-image generation,’’ in Proc.
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., Jan. 2023,
pp. 22511–22521.
[91] Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang, S. Tulyakov,
and J. Ren, ‘‘SnapFusion: Text-to-image diffusion model on mobile
devices within two seconds,’’ 2023, arXiv:2306.00980v2.
AMAL A. AL-SHARGABI received the master’s
[92] L. Zhang, A. Rao, and M. Agrawala, ‘‘Adding conditional control to text-
to-image diffusion models,’’ 2023, arXiv:2302.05543v2. and Ph.D. degrees from Universiti Teknologi Mara
[93] W. Zhao, Y. Rao, Z. Liu, B. Liu, J. Zhou, and J. Lu, ‘‘Unleash- (UiTM), Malaysia. She is currently an Associate
ing text-to-image diffusion models for visual perception,’’ 2023, Professor with the College of Computer, Qassim
arXiv:2303.02153v1. University. She has been receiving a number
[94] J. Hu, X. Han, X. Yi, Y. Chen, W. Li, Z. Liu, and M. Sun, ‘‘Efficient cross- of Qassim University’s research grants, since
lingual transfer for Chinese stable diffusion with images as pivots,’’ 2023, 2018. Her research interests include program
arXiv:2305.11540v1. comprehension, empirical software engineering,
[95] F. Ye, G. Liu, X. Wu, and L. Wu, ‘‘AltDiffusion: A multilingual text-to- and machine learning.
image diffusion model,’’ 2023, arXiv:2308.09991v2.