0% found this document useful (0 votes)
43 views22 pages

Multimodal Image Synthesis and Editing The Generative AI Era

Uploaded by

aswanthkmr03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views22 pages

Multimodal Image Synthesis and Editing The Generative AI Era

Uploaded by

aswanthkmr03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

15098 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO.

12, DECEMBER 2023

Multimodal Image Synthesis and Editing: The


Generative AI Era
Fangneng Zhan , Yingchen Yu , Rongliang Wu , Jiahui Zhang , Shijian Lu , Lingjie Liu ,
Adam Kortylewski , Christian Theobalt , and Eric Xing, Fellow, IEEE

(Survey Paper)

Abstract—As information exists in various modalities in real I. INTRODUCTION


world, effective interaction and fusion among multimodal informa-
UMANS are naturally capable of imaging a scene accord-
tion plays a key role for the creation and perception of multimodal
data in computer vision and deep learning research. With superb
power in modeling the interaction among multimodal information,
H ing to a piece of visual, text or audio description. However,
the intuitive processes are less straightforward for deep neural
multimodal image synthesis and editing has become a hot research networks, primarily due to an inherent modality gap. This modal-
topic in recent years. Instead of providing explicit guidance for
network training, multimodal guidance offers intuitive and flexible
ity gap for visual perception can be boiled down to intra-modal
means for image synthesis and editing. On the other hand, this gap between visual clues and real images, and cross-modal gap
field is also facing several challenges in alignment of multimodal between non-visual clues and real images. Targeting to mimic
features, synthesis of high-resolution images, faithful evaluation human imagination and creativity in the real world, the tasks
metrics, etc. In this survey, we comprehensively contextualize the of Multimodal Image Synthesis and Editing (MISE) provide
advance of the recent multimodal image synthesis and editing
and formulate taxonomies according to data modalities and model
profound insights about how deep neural networks correlate
types. We start with an introduction to different guidance modali- multimodal information with image attributes.
ties in image synthesis and editing, and then describe multimodal As a trending area, image synthesis and editing aim to create
image synthesis and editing approaches extensively according to realistic images or edit real images with natural textures. In the
their model types. After that, we describe benchmark datasets and last few years, it has witnessed very impressive progress thanks
evaluation metrics as well as corresponding experimental results.
Finally, we provide insights about the current research challenges
to the advance of generative AI especially deep generative mod-
and possible directions for future research. els [1], [2], [3] and neural rendering [4]. To achieve controllable
generation, a popular line of research focuses on generating and
Index Terms—Multimodality, image synthesis and editing, editing images conditioned on certain guidance as illustrated in
GANs, diffusion models, autoregressive models, NeRF.
Fig. 1. Typically, visual clues, such as segmentation maps and
sketch maps, have been widely adopted to guide image synthesis
Manuscript received 24 July 2022; revised 22 July 2023; accepted 10 August and editing [5], [6], [7]. Beyond the intra-modal guidance of vi-
2023. Date of publication 25 August 2023; date of current version 3 November sual clues, cross-modal guidance such as texts, audios, and scene
2023. The work of Fangneng Zhan, Lingjie Liu, and Christian Theobalt was
supported by the ERC Consolidator grant 4DRepLy under Grant 770784. graph provides an alternative but often more intuitive and flexible
The work of Yingchen Yu, Rongliang Wu, Jiahui Zhang, and Shijian Lu was way of expressing visual concepts. However, effective retrieval
supported in part by the ERC Consolidator grant 4DRepLy under Grant 770784 and fusion of heterogeneous information from data of different
and in part by the RIE2020 Industry Alignment Fund – Industry Collaboration
Projects (IAF-ICP) Funding Initiative. The work of Adam Kortylewski was modalities present a substantial challenge in multimodal image
supported by Emmy Noether Research Group funded by the German Science synthesis and editing.
Foundation (DFG) under Grant 468670075. Recommended for acceptance by As a pioneering effort in multimodal image synthesis, [17]
X. Bai. (Corresponding author: Shijian Lu.)
Fangneng Zhan is with the Max Planck Institute for Informatics, 66123 shows that recurrent variational auto-encoder could generate
Saarbrücken, Germany, and also with S-Lab, Nanyang Technological University, novel visual scenes conditioned on image captions. The re-
Singapore 639798 (e-mail: [email protected]). search of multimodal image synthesis is then significantly ad-
Yingchen Yu, Rongliang Wu, Jiahui Zhang, and Shijian Lu are with the
Nanyang Technological University, Singapore 639798 (e-mail: yingchen001@ vanced with the prosperity of Generative Adversarial Networks
e.ntu.edu.sg; [email protected]; [email protected]; Shijian.Lu@ (GANs) [1], [5], [6], [18], [19], [20] and diffusion models [3],
ntu.edu.sg). [21], [22], [23], [24]. Originating from the Conditional GANs
Lingjie Liu, Adam Kortylewski, and Christian Theobalt are with the Max
Planck Institute for Informatics, 66123 Saarbrücken, Germany (e-mail: lingjie6 (CGANs) [18], a bunch of GANs and diffusion models [5],
@seas.upenn.edu; [email protected]; [email protected]). [6], [10], [25] have been developed to synthesize images from
Eric Xing is with Carnegie Mellon University, Pittsburgh, PA 15213 USA, various multimodal signals, by incorporating the multimodal
and also with the Mohamed bin Zayed University of Artificial Intelligence, Abu
Dhabi, UAE (e-mail: [email protected]). guidance to condition the generation process. This conditional
A project associated with this survey is available at https://github.com/fnzhan/ generation paradigm is relatively straight-forward and is widely
Generative-AI. adopted in SOTA methods to yield unprecedented generation
This article has supplementary downloadable material available at
https://doi.org/10.1109/TPAMI.2023.3305243, provided by the authors. performance [6], [10], [25], [26], [27]. On the other hand,
Digital Object Identifier 10.1109/TPAMI.2023.3305243 developing conditional model require a cumbersome training
0162-8828 © 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15099

Fig. 1. Illustration of multimodal image synthesis and editing. Typical guidance types include visual information (e.g.semantic maps, scene layouts, sketch maps),
text prompts, audio signal, scene graph, brain signal, and mouse track. The samples are from [2], [8], [9], [10], [11], [12], [13], [14], [15], [16].

process which usually involves high computational cost. Thus, NeRF to train 3D-aware generative models on 2D images, where
another line of research refers to pre-trained models for MISE, MISE can be performed by developing conditional NeRFs or
which can be achieved by manipulation in the GAN latent inverting NeRFs [48], [49].
space via inversion [28], [29], [30], [31], applying guidance The contributions of this survey can be summarized in the
functions [21], [32] to diffusion process or adapting the latent following aspects:
space & embedding [33], [34], [35], [36] of diffusion models. r This survey covers extensive literature with regard to mul-
Currently, a CNN architecture is still widely adopted in GANs timodal image synthesis and editing with a rational and
and diffusion models, which hinders them from supporting structured framework.
diverse multimodal input in a unified manner. On the other hand, r We provide a foundation of different types of guidance
with the prevalence of Transformer model [37] which naturally modality underlying multimodal image synthesis and edit-
allows various multimodal input, impressive improvements have ing tasks and elaborate the specifics of encoding ap-
been made in the generation of different modality data, such as proaches associated with the guidance modalities.
language [38], image [39], and audio [40]. These recent advances r We develop a taxonomy of the recent approaches according
fueled by Transformer suggest a possible route for autoregres- to the essential models and highlight the major strengths
sive models [41] in MISE by accommodating the long-range and weaknesses of existing models.
dependency of sequences. Notably, both multimodal guidance r This survey provides an overview of various datasets and
and images can be represented in a common form of discrete evaluation metrics in multimodal image synthesis and edit-
tokens. For instance, texts can be naturally denoted by token ing, and critically evaluates the performance of contempo-
sequence; audio and visual guidance including images can be rary methods.
represented as token sequences [42]. With such unified discrete r We summarize the open challenges in the current research
representation, the correlation between multimodal guidance and share our humble opinions on promising areas and
and images can be well accommodated via Transformer-based directions for future research.
autoregressive models which have pushed the boundary of MISE The remainder of this survey is organized as follows.
significantly [2], [43], [44]. Section II presents the modality foundations of MISE. Section III
Most aforementioned methods work for 2D images regard- provides a comprehensive overview and description of MISE
less the 3D essence of real world. With the recent advance of methods with detailed pipelines. Section IV reviews the common
neural rendering, especially Neural Radiance Fields (NeRF) [4], datasets and evaluation metrics, with experimental results of
3D-aware image synthesis and editing have attracted increasing typical methods. In Section V, we discuss the main challenges
attention from the community. Distinct from synthesis and edit- and future research directions for MISE. Some social impact
ing on 2D images, 3D-aware MISE poses a bigger challenge analysis and concluding remarks are drawn in Sections VI and
thanks to the lack of multi-view data and requirement of multi- VII, respectively.
view consistency during synthesis and editing. As a remedy,
pre-trained 2D foundation models (e.g., CLIP [45] and Stable
Diffusion [46]) can be employed to drive the NeRF optimization II. MODALITY FOUNDATIONS
for view synthesis and editing [11], [47]. Besides, generative Each source or form of information can be called a modality.
models like GAN and diffusion models can be combined with For example, people have the sense of touch, hearing, sight,
orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15100 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

and smell; the medium of information includes voice, video, relevance to the corresponding text guidance. Notably, text and
text, etc.; data are recorded by various sensors such as radar, images are different types of data which makes it difficult to
infrared, and accelerometer. In terms of image synthesis and learn an accurate and reliable mapping from one to the other.
editing, we group the modality guidance as visual guidance, Techniques for integrating text guidance, such as representation
text guidance, audio guidance, and other modality guidance. learning, play a crucial role in text-guided image synthesis and
Detailed description of each modality guidance together with editing.
related processing methods will be presented in the following Text Guidance Encoding: Learning faithful representation
subsections. from text description is a non-trivial task. There are a number
of traditional text representations, such as Word2Vec [111]
A. Visual Guidance and Bag-of-Words [112]. With the prevalence of deep neu-
ral networks, Recurrent Neural Network (RNN) [109] and
Visual guidance has drawn widespread interest in the field of
LSTM [54] are widely adopted to encode texts as features [55].
MISE due to its inherent capacity to convey spatial and structural
With the development of pre-trained models in natural lan-
details. Notably, it encapsulates specific image properties in
guage processing field, several studies [113], [114] also ex-
pixel space, thereby offering an exceptional degree of control.
plore to perform text encoding by leveraging large-scale pre-
This property of visual guidance facilitates interactive manipu-
trained language models such as BERT [115]. Remarkably, with
lation and precise handling during image synthesis, which can
a large number of image-text pairs for training, Contrastive
be crucial for achieving desired outcomes. As a pixel-level guid-
Language-Image Pre-training (CLIP) [45] yields informative
ance, it can be seamlessly integrated into the image generation
text embeddings by learning the alignment of images and the
process, underscoring its versatility and extensive use in various
corresponding captions, and has been widely adopted for text
image synthesis contexts. Common types of visual guidance
encoding.
encompass segmentation maps [5], [6], keypoints [89], [90],
[91], sketch & edge & scribbles [51], [92], [93], [94], [95],
C. Audio Guidance
[96], [97], [98], [99], and scene layouts [100], [101], [102],
[103], [104] as illustrated in Fig. 1. Besides, several studies Unlike text and visual guidance, audio guidance provides
investigate image synthesis conditioned on depth map [2], [8], temporal information which can be utilized for generating dy-
normal map [8], trace map [105], etc. The visual guidance can be namic or sequential visual content. The relationship between
obtained by employing pre-trained models (e.g., segmentation audio signals and images [116], [117], [118] is often more
model, depth predictor, pose predictor), applying algorithms abstract compared to text or visual guidance. For instance, audio
(e.g., Canny edges, Hough lines), or relying on manual effort associated with certain actions or environments may suggest
(e.g., manual annotation, human scribbles). By modifying the but not explicitly define visual content [119]; sound can carry
visual guidance elements, like semantic maps, we can directly emotional tone and nuanced context that isn’t always clear in text
repurpose image synthesis techniques for various image editing or visual inputs. Thus, audio-guided MISE offers an interesting
tasks [106], [107], demonstrating the versatile applicability of challenge of interpreting audio signals into visual content. This
visual guidance in the domain of MISE. involves understanding and modeling the complex correlations
Visual Guidance Encoding: These visual cues, represented in between sound and visual elements, which has been explored
2D pixel space, can be interpreted as specific types of images, in talking-face generation [57], [59], [60], [120] whose goal is
thereby permitting their direct encoding via numerous image to create realistic animations of a face speaking given an audio
encoding strategies such as Convolutional Neural Networks input.
(CNNs) and Transformers. As the encoded features spatially Audio Guidance Encoding: An audio sequence can be gen-
align with image features, it can be smoothly integrated into erated from given videos where deep convolution network is
networks via naive concatenation, SPADE [6], cross-attention employed to extract features from video screenshots followed
mechanism [46], etc. by LSTM [121] to generate audio waveform of the corre-
sponding input video [122]. Besides, an input audio segment
B. Text Guidance can also be represented by a sequence of features which can
be spectrograms, fBanks, Mel-Frequency Cepstral Coefficients
Compared with visual guidance, text guidance provides a
(MFCCs), and the hidden layer outputs of the pre-trained Sound-
more versatile and flexible way to express and describe visual
Net model [119]. In talking face generation [123], Action Units
concepts. This is because text can capture a wide range of
(AUs) [124] has also been widely adopted to convert the driving
ideas and details that may not be easily communicated through
audio into coherent visual signals for talking face generation.
other means. Text descriptions can be ambiguous and open to
interpretation. This is both a challenge and an opportunity. It’s
a challenge because it can lead to a wide array of possible D. Other Modality Guidance
images that accurately represent the text, making it harder to Several other types of guidance have also been investigated
predict the outcome. However, it’s also an opportunity because it to guide multimodal image synthesis and editing.
allows for greater creativity and diversity in the resulting images. Scene Graph: Scene Graphs represent scenes as directed
The text-to-image synthesis task [53], [108], [109], [110] aims graphs, where nodes are objects and edges give relationships
to produce clear, photo-realistic images with high semantic between objects. Image generation conditioned on scene graphs

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15101

TABLE I
STRENGTH AND WEAKNESS OF DIFFERENT MODEL TYPES FOR MISE TASKS

allows to reason explicit object relationships and synthesize which generally rely on GANs and their inversion. We then dis-
faithful images with complex scene relationships. The guided cuss the prevailing diffusion-based methods and autoregressive
scene graph can be encoded through a graph convolution net- methods comprehensively. After that, we introduce NeRF for the
work [125] which predicts object bounding boxes to yield a challenging task of 3D-aware MISE. Later, we present several
scene layout. For instance, Vo et al. [126] propose to predict other methods for image synthesis and editing under the context
relation units between objects which is converted to a visual of multimodal guidance. Finally, we compare and discuss the
layout via convolutional LSTM [127]. strengths and weaknesses of different generation architectures.
Brain Signal: Treating brain signals as a modality to syn-
thesize or reconstruct visual images offers an exciting way to A. GAN-Based Methods
understand brain activity and facilitate brain-computer inter- GAN-based methods have been widely adopted for various
faces. Recently, several studies explore to generate images from MISE tasks by either developing conditional GANs (Section II-
functional magnetic resonance imaging (fMRI). For example, I-A1) or leveraging pre-trained unconditional GANs (Section II-
Fang et al. [128] decode shape and semantic representations I-A2). For conditional GANs, multimodal condition can be
from the visual cortex, and then fuse them to generate images directly incorporated into the generator to guide the generation
via GAN; Lin et al. [129] propose to map fMRI signals into process. For pre-trained unconditional GANs, GAN inversion is
the latent space of pretrained StyleGAN to enable conditional usually employed to perform various MISE tasks by operating
generation; Takagi and Nishimoto [130] quantitatively interpret latent codes in latent spaces.
each component in pretrained LDM [46] by mapping them into 1) Conditional GANs: Conditional Generative Adversarial
distinct brain regions. Networks (CGANs) [18] are extensions of the popular GAN
Mouse Track: To achieve precise and flexible manipulation architecture which allow for image generation with specific
of image content, mouse track [16] has recently emerged as characteristics or attributes. The key idea behind CGANs is
a remarkable guidance in MISE. Specifically, users can select to condition the generation process on additional information,
a set of ‘handle points’ and ‘target points’ within an image by such as multimodal guidance in MISE tasks. This is achieved by
simply clicking the mouse. The objective here is to edit the image feeding the additional information into both the generator and
by steering these handle points to their respective target points. discriminator networks as extra guidance. The generator then
This innovative approach of mouse track guidance enables an learns to generate samples that not only fool the discriminator but
image to be deformed with an impressive level of accuracy, also match the specified conditional information. In recent years,
and facilitates manipulation of various attributes such as pose, a range of designs have significantly boosted the performance
shape, and expression across a range of categories. The point of CGANs for MISE [110].1
motion can be integrated to supervise the editing via a pre-trained Condition Incorporation: To steer the generation process,
transformer that’s based on optical flow [131], [132] or a shifted it is necessary to incorporate multimodal conditions into the
patch loss on the generator features [16]. network effectively as shown in Fig. 2. Generally, multimodal
guidance can be uniformly encoded as 1-D features which can
III. METHODS be concatenated with the feature in networks [18], [59], [109].
For visual guidance that is spatially aligned with the target
We broadly categorize the methods for MISE into five cat- image, the condition can be directly encoded as 2D features
egories: GAN-based methods (Section III-A), autoregressive which provide accurate spatial guidance for generation or edit-
methods (Section III-C), diffusion-based methods (Section II- ing [5]. However, the encoded 2D features struggle to capture
I-B), NeRF-based methods (Section III-D), and other methods complex scene structural relationships between the guidance and
(Section III-E). We briefly summarize the strength and weakness
of four main methods with representative references as shown in 1 Please refer to [110] for detailed review of GAN-based text-to-image gen-
Table I. In this section, we first discuss the GAN-based methods, eration.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15102 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

for the two branches, a contrastive loss can be adopted to


minimize the distance between positive pairs (two text prompts
describe the same scene) and maximize the distance between
negative pairs (two prompts describe different scenes) [137],
[148], [149]. Besides, an intra-domain transformation loss [150]
can also be employed in siamese structure to preserve key
characteristics during generation. Except for above structures, a
cycle structure also has been explored in series of conditional
GANs to preserve key information in generation process. Specif-
ically, some research [55], [151], [152], [153], [154] explores
to pass the generated images through an inverse network to
Fig. 2. Illustration of conditional GAN framework with different condition yield the conditional input, which imposes a cycle-consistency
incorporation mechanisms.
of conditional input. The inverse network varies for different
conditional inputs, e.g., image captioning models [55], [155]
for text guidance, generation networks for visual guidance.
real images when there exists very different views or severe Loss Design: Except for the inherent adversarial loss in
deformations. Under such circumstances, an attention module GANs, various other loss terms have been explored to achieve
can be employed to align the guidance with the target image as high-fidelity generation or faithful conditional generation. For
in [133], [134], [135]. Moreover, naively encoding the visual conditional input that is spatially aligned with the ground-truth
guidance with deep networks is suboptimal as part of the guid- image, it has been proved that perceptual loss [156] is able
ance information tends to be lost in normalization layers. Thus, a to boost the generation quality significantly [157], by mini-
spatially-adaptive de-normalization (SPADE) [6] is introduced mizing the distance of perceptual features between generated
to inject the guided feature effectively, which is further extended images and the ground-truth. Besides, associated with the cycle
to a semantic region-adaptive normalization [136] to achieve structure described previously, a cycle-consistency loss [151] is
region-wise condition incorporation. Besides, by assessing the duly imposed to enforce condition consistency. However, cycle-
similarity between generated images and conditions, an atten- consistency loss is too restrictive for conditional generation as it
tional incorporation mechanism [54], [137], [138], [139] can be assumes a bi-jectional relationship between two domains. Thus,
employed to direct the generator’s attention to particular image some efforts [150], [158], [159] have been devoted to explor-
regions during generation, which is particularly advantageous ing one-way translation and bypass the bijection constraint of
when dealing with complex conditional information, such as cycle-consistency. With the emergence of contrastive learning,
texts. Notably, complex conditions also can be mapped to an several studies explore to maximize the mutual information of
intermediary representation which facilitates more faithful im- positive pairs via noise contrastive estimation [160] for the
age generation, e.g., audio clip can be mapped to facial land- preservation of contents in unpaired image generation from
marks [58], [120] or 3DMM parameters [140] for talking-face visual guidance [161], [162] or text-to-image generation [163].
generation. For sequential conditions such as audios [20], [59], Except for contrastive loss, triplet loss also has been employed
[120], [123], [141], [142], [143], [144], a recurrent condition to improve the condition consistency for cross-modal guidance
incorporation mechanism is also widely adopted to account like texts [148].
for temporal dependency such that smooth transition can be 2) Inversion of Unconditional GAN: Large scale GANs
achieved in sequential conditions. [145], [164] have achieved remarkable progress in unconditional
Model Structure: Conditional generation of high-resolution image synthesis with high-resolution and high-fidelity. With a
images with fine details is challenging and computationally pre-trained GAN model, a series of studies explore to invert
expensive for GANs. Coarse-to-fine structures [50], [53], [99], a given image back into the latent space of the GAN, which
[108], [146] help address these issues by gradually refining is termed as GAN inversion [30].2 Specifically, a pre-trained
the generated images or features from low resolutions to high GAN learns a mapping from latent codes to real images, while
resolutions. By generating coarse images or features first and the GAN inversion maps images back to latent codes, which is
then refining them, the generator network can focus on capturing achieved by feeding the latent code into the pre-trained GAN
the overall structure of the image before moving on to the to reconstruct the image through optimization. Typically, the
fine details, which leads to more efficient training and higher reconstruction metrics are based on 1 , 2 , perceptual [156] loss
generation quality. Not only generator, many discriminator net- or LPIPS [165]. Certain constraints on face identity [166] or
works [50], [147] also operate at multiple levels of resolution latent codes [31] could also be included during optimization.
to efficiently differentiate high-resolution images and avoid With the obtained latent codes, we can faithfully reconstruct the
potentially overfitting. On the other hand, as a scene can be original image and conduct realistic image manipulation in the
depicted with diverse linguistic expressions, generating images latent space. In terms of MISE, cross-modal image manipulation
with consistent semantics regardless of the expression variants can be achieved by manipulating or generating latent codes
presents a significant challenge. Multiple pieces of research according to the guidance from other modalities.
employ a siamese structure with two generation branches to
facilitate the semantic alignment. With a pair of conditions 2 Please refer to [30] for a comprehensive review of GAN inversion.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15103

Fig. 3. Architectures of GAN inversion for MISE, including (a) Cross-modal alignment [28] and (b) cross-modal supervision [29]. The cross-modal alignment
embeds both images and conditions into the latent space of GAN (e.g., StyleGAN [145]), aiming to pull their embeddings to be closer. Then, image and condition
embeddings can be mixed to perform multimodal image generation or editing. The cross-modal supervision inverts source image into a latent code and trains a
mapper network to produce residuals that are added to the latent code to yield the target code, from which a pre-trained StyleGAN generates an image assessed by
the CLIP and identity losses. The figure is reproduced based on [28] and [29].

Explicit Cross-Modal Alignment: One direction of leveraging the pretrained generative model with text conditions for domain
the guidance from other modalities is to map the embeddings of adaptation. Moreover, Yu et al. [173] introduce a CLIP-based
images and cross-modal inputs (e.g., semantic maps, texts) in a contrastive loss for robust optimization and counterfactual image
common embedding space [28], [167] as shown in Fig. 3(a). For manipulation.
example, TediGAN [28] trains an encoder for each modality to
extract the embeddings and apply similarity loss to map them
into the latent space. Afterwards, latent manipulation (e.g., latent B. Diffusion-Based Methods
mixing [28]) could be performed to edit the image latent codes Recently, diffusion models such as denoising diffusion prob-
toward the embeddings of other modalities and achieve cross- abilistic models (DDPMs) [3], [174] have achieved great suc-
modal image manipulation. However, mapping multimodal data cesses in generative image modeling [3], [22], [23], [24].
into a common space is non-trivial thanks to the heterogeneity DDPMs are a type of latent variable models that consist of a for-
across different modalities, which can result in inferior and ward diffusion process and a reverse diffusion process. The for-
unfaithful image generation. ward process is a Markov chain where noise is gradually added
Implicit Cross-Modal Supervision: Instead of explicitly pro- to the data when sequentially sampling the latent variables xt
jecting guidance modality into the latent space, another line for t = 1, . . . , T . Each step in√the forward process is a Gaussian
of research aims to guide the synthesis or editing by defining transition q(xt |xt−1 ) := N ( 1 − βt xt−1 , βt I), where {βt }Tt=0
consistency loss between the generation results and the guiding are fixed or learned variance schedule. The reverse process
modality. For instance, Jiang et al. [168] propose to optimize q(xt−1 |xt ) is parameterized by another Gaussian transition
image latent codes through a pre-trained fine-grained attribute p(xt−1 |xt ) := N (xt−1 ; μ(xt ), σt2 I). μ(xt ) can be decomposed
predictor, which can examine the consistency of the edited image into a linear combination of xt and a noise approximation model
and the text description. However, the attribute predictor is θ (xt , t) that can be learned through optimization. After training
specifically designed for face editing with fine-grained attribute (x, t), the sampling process of DDPM can be achieved by
annotations, making it hard to generalize to other scenarios. following a reverse diffusion process.
A recently released large-scale pretrained model, Contrastive Song et al. [22] propose an alternative non-Markovian noising
Language-Image Pre-training (CLIP) [45] has demonstrated process that has the same forward marginals as DDPM but
great potential in multimodal synthesis and manipulation [29], allows using different samplers by changing the variance of
[43], which learns joint vision-language representations from the noise. Especially, by setting the noise to 0, which is a
over 400 M text-image pairs via contrastive learning. On the DDIM sampling process [22], the sampling process becomes
strength of the powerful pre-trained CLIP, Bau et al. [169] define deterministic, enabling full inversion of the latent variables into
a CLIP-based semantic consistency loss to optimize latent codes the original images with significantly fewer steps [21], [22].
inside an inpainting region to align the recovered content with the Notably, the latest work [21] has demonstrated even higher
given text. Similarly, StyleClip [29] and StyleMC [170] employ quality of image synthesis performance compared to variational
cosine similarity between CLIP representations to supervise the autoencoders (VAEs) [175], flow models [176], [177], autore-
text-guided manipulation as illustrated in Fig. 3(b). A known gressive models [178], [179] and (GANs) [1], [145]. To achieve
issue of standard CLIP loss is the adversarial solution [171], image generation and editing conditioned on provided guidance,
where the model tends to fool the CLIP classifier by adding leveraging pre-trained models [32] (by guidance function or
meaningless pixel-level perturbations to the image. To this fine-tuning) and training conditional models from scratch [46]
end, Liu et al.propose AugCLIP score [171] to robustify the are both extensively studied in the literature. A downside of guid-
standard CLIP score; StyleGAN-NADA [172] presents a direc- ance function method lies in the requirement of an additional
tional CLIP loss to align the CLIP-space directions between guidance model which leads to a complicated training pipeline.
the source and target text-image pairs. It also directly finetunes Recently, Ho et al. [27] achieve compelling results without a
orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15104 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

generation. To preserve production-ready weights of pre-trained


models for fast convergence, a ‘zero convolution’ is designed
to incorporate the guidance, where the convolution weights are
gradually learned from zeros to optimized parameters.
Latent Diffusion: To enable diffusion models training on
limited computational resources while retaining their quality and
flexibility, several works explore to conduct diffusion process
in learned latent spaces [46] as shown in Fig. 4. Typically, an
autoencoding model can be employed to learn a latent space
that is perceptually equivalent to the image space. On the other
hand, the learned latent spaces may be accompanied with un-
desired high variance, which highlighting the need for latent
space regularizations. As a common choice, KL divergence can
be applied to regularize the latent space towards a standard
normal distribution. Alternatively, vector quantization can also
be applied for regularization via a VQGAN [2] variant with an
absorbed quantization layer as in [46]. Besides, VQGAN can
Fig. 4. Overall framework of conditional diffusion model. With a certain
model for latent representation, diffusion process models the latent space by directly learn a discrete latent space (quantization layer is not
reversing a forward diffusion process conditioned on certain guidance (e.g., absorbed), which can be modeled by a discrete diffusion process
semantic map, depth map, and texts). The image is reproduced based on [46]. as in VQ-Diffusion [184]. Tang et al. [185] further improve
VQ-Diffusion by introducing a high-quality inference strategy
separately guidance model by using a form of guidance that to alleviate the joint distribution issue.
interpolates between predictions from a diffusion model with Model Architecture: Ho et al. [3] introduced a U-Net architec-
and without labels. GLIDE [180] compares CLIP-guided dif- ture for diffusion models, which can incorporate the inductive
fusion model and conditional diffusion model on text-to-image bias of CNNs into the diffusion process. This U-Net architecture
synthesis task, and concludes that training conditional diffusion is further improved by a series of designs, including attention
model yields better generation performance. configuration [21], residual block for upsampling and downsam-
1) Conditional Diffusion Models: To launch the MISE tasks, pling activations [23], and adaptive group normalization [21].
a conditional diffusion model can be formulated by directly Although U-Net structure is widely adopted in SOTA diffusion
integrating the condition information into the denoising process. models, Chahal [186] shows that a Transformer-based LDM [46]
Recently, the performance of conditional diffusion models is can yield comparable performance to U-Net-based LDM [46],
significantly pushed forward by a series of designs. accompanied with a natural multimodal condition incorpora-
Condition Incorporation: As a common framework, a tion via multi-head attention. Nevertheless, such Transformer
condition-specific encoder is usually employed to project mul- architecture is more favored under the setting of discrete latent
timodal condition into embedding vectors, which is further space as in [184], [187]. On the other hand, instead of directly
incorporated into the model as shown in Fig. 4. The condition- generating final images, DALL-E 2 [25] proposes a two-stage
specific encoder can be learned along with the model or di- structure by producing intermediate image embeddings from
rectly borrowed from pre-trained models. Typically, CLIP is text in the CLIP latent space. Then, the image embeddings are
a common choice for text embedding as adopted in DALL-E applied to condition a diffusion model to generate final images,
2 [25]. Besides, generic large language models (e.g.T5 [182]) which allows improving the diversity of generated images [25].
pre-trained text corpora also show remarkable effectiveness at Besides, some other architectures are also explored, including
encoding text for image synthesis as validated in Imagen [10]. compositional architecture [188] which generates an image by
With the condition embedding, diverse mechanisms can be composing a set of diffusion models, multi-diffusion architec-
adopted to incorporate it into diffusion models. Specifically, the ture [189] which is composed of multiple diffusion processes
condition embedding can be naively concatenated or added to with shared parameters or constraints, retrieval-based diffusion
the diffusion timestep embedding [21], [183]. In LDM [46], model [190] which alleviates the high computational cost, etc.
condition embedding is mapped to the intermediate layers of 2) Pre-Trained Diffusion Models: Rather than expensively
diffusion models via a cross-attention mechanism. Imagen [10] re-training diffusion models, another line of research resorts
further compares mean pooling and attention pooling with cross to guiding the denoising process with proper supervision, or
attention mechanism and observes both pooling mechanisms finetuning the model with a lower cost as shown in Fig. 5.
perform significantly worse. To fully leverage the conditional in- Guidance Function Method: As an early exploration, Dhari-
formation for semantic image synthesis, Wang et al. [61] propose wal et al. [21] augment pre-trained diffusion models with clas-
to incorporate visual guidance via a spatially-adaptive normal- sifier guidance which can be extended to achieve conditional
ization, which improves both the quality and semantic coherence generation with various guidance. Specifically, the reverse pro-
of generated images. Instead of incorporating condition to train cess p(xt−1 |xt ) with guidance can be rewritten as p(xt−1 |xt , y)
diffusion models from scratch, ControlNet [8] aims to incorpo- where y is the provided guidance. Following the derivation
rate condition into a pre-trained diffusion model for controllable in [21], the final diffusion sampling process can be rewritten

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15105

Fig. 5. Typical frameworks of pre-trained diffusion models for MISE tasks, including guidance function method and fine-tuning method. The figure is reproduced
based on [32] and [181].

as each word lies in cross-attention layers, Prompt-to-Prompt [34]


proposes to preserve some content from the original image by
xt−1 = μ(xt ) + σt2 ∇xt log p(y|xt ) + σt ε, ε ∼ N (0, I), (1) manipulating the cross-attention maps. Alternatively, taking ad-
F (xt , y) = log p(y|xt ) (dubbed as guidance function) indicates vantage of the step-by-step diffusion sampling process, a model
the consistency between xt and guidance y which can be formu- fine-tuned for image reconstruction can be utilized to provide
lated by certain similarity metric [32] such as Cosine similarity score guidance for content and structure preservation at the early
and L2 distance. As the similarity is usually computed on the stage of the denoising process [192]. Similar approach is adopted
feature space, pre-trained CLIP can be adopted as the image in [181] by fine-tuning diffusion model and optimizing text
encoder and condition encoder for text guidance as shown in embedding via image reconstruction, which allows preserving
Fig. 5(a). However, the image encoder will take noisy images contents via text embedding interpolation.
as input while CLIP is trained on clean images. Thus, a self-
supervised fine-tuning of CLIP can be performed to force an C. Autoregressive Methods
alignment between features extracted from clean and noised
images as in [32]. Fueled by the advance of GPT [38] in natural language mod-
To control the generation consistency with the guidance, a eling, autoregressive models have been successfully applied to
parameter γ can be introduced to scale the guidance gradients image generation [39] by treating the flattened image sequences
as below as discrete tokens. The plausibility of generated images demon-
strates that autoregressive models are able to accommodate the
xt−1 = μ(xt ) + σt2 γ∇xt log p(y|xt ) + σt ε, ε ∼ N (0, I). spatial relationships between pixels and high-level attributes.
(2) Compared with CNN, Transformer models naturally support
Apparently, the model will focus more on the modes of guidance various multimodal inputs in a unified manner, and a series
with a larger gradient scale γ. As the result, γ is positively of studies have been proposed to explore multimodal image
correlated with the generation consistency (with the guidance), synthesis with Transformer-based autoregressive models [2],
while is negatively correlated with the generation diversity [21]. [44], [69], [194]. Overall, the pipeline of autoregressive model
Besides, to achieve the local guidance for image editing, a for MISE consists of a vector quantization [42], [195] stage to
blended diffusion mechanism [191] can be employed by spa- yield unified discrete representation and achieve data compres-
tially blending noised image with the local guided diffusion sion, and an autoregressive modeling stage which establishes
latent at progressive noise levels. the dependency between discrete tokens in a raster-scan order
Fine-Tuning Method: In terms of fine-tuning, MISE can be as illustrated in Fig. 6.
achieved by modifying the latent code or adapting the pre-trained 1) Vector Quantization: Directly treating all image pixels as
diffusion models as shown in Fig. 5(b). To adapt unconditional a sequence for autoregressive modeling with Transformer is
pre-trained models for text-guided editing, the input image is expensive in terms of memory consumption as the self-attention
first converted to the latent space via the forward diffusion pro- mechanism in Transformer incurs quadratic memory cost. Thus,
cess. The diffusion model at the reverse path is then fine-tuned to compressed and discrete representation of image is essential and
generate images driven by the target text and the CLIP loss [33]. significant for autoregressive image synthesis and editing. A
For pre-trained conditional models (typically conditioned on k-means method to cluster RGB pixel values has been adopted
texts), similar to GAN Inversion, a text latent embedding or a in [39] to reduce the input dimensionality. However, k-means
diffusion model can be fine-tuned to reconstruct a few images (or clustering only reduces the dimensionality while the sequence
objects) [35], [36] faithfully. Then the obtained text embedding length is still unchanged. Thus, the autoregressive model still
or fine-tuned model can be applied to generate the same object in cannot be scaled to higher resolutions, due to the quadratically
novel contexts. However, these methods [35], [36] usually dras- increasing cost in sequence length. To this end, Vector Quantised
tically change the layout of the original images. As observing VAE (VQ-VAE) [42] is adopted to learn discrete and compressed
the crux of the relationship between image spatial layout and image representation. VQ-VAE consists of an encoder, a feature

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15106 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

Fig. 6. Typical framework of autoregressive methods for MISE tasks. A quantization stage is first performed to learn discrete and compressed representation by
reconstructing the original image or condition (e.g., semantic map) faithfully via VQ-GAN [2], [42], followed by an autoregressive modeling stage to capture the
dependency of discrete sequence. The image is reproduced based on [2] and [193].

quantizer, and a decoder. The image is fed into the encoder Codebook Utilization: The vanilla VQ-VAE with argmin op-
to learn a continuous representation, which is quantized via eration (to get the nearest codebook entry) suffers from severe
the feature quantizer by assigning the feature to the nearest codebook collapse, e.g., only few codebook entries are effec-
codebook entry. Then the decoder reconstructs the original tively utilized for quantization [209]. To alleviate the codebook
image from the quantized feature, driving to learn a faithful collapse, vq-wav2vec [210] introduces Gumbel-Softmax [211]
discrete image representation. As assigning codebook entry is to replace argmin for quantization. The Gumbel-Softmax al-
not differentiable, a reparameterization trick [42], [196] is usu- lows sampling discrete representation in a differentiable way
ally adopted to approximate the gradient. Targeting for learning through straight-through gradient estimator [196], which boosts
superior discrete image representation, a series of efforts [2], the codebook utilization significantly. ViT-VQGAN [197] also
[197], [198] have been devoted to improving VQ-VAE in terms presents a factorized code architecture which introduces a linear
of loss function, model architecture, codebook utilization, and projection from the encoder output to a low dimensional latent
learning regularization. variable space for code index lookup and boosts the codebook
Loss Function: To achieve desirable perceptual quality for usage substantially.
reconstructed images, an adversarial loss and a perceptual Learning Regularization: Recent work [198] validates that
loss [156], [199], [200] (with pre-trained VGG) can be incor- the vanilla VQ-VAE doesn’t satisfy translation equivariance
porated for image reconstruction. With the extra adversarial during quantization, resulting in degraded performance for text-
loss and perceptual loss, the image quality is clearly improved to-image generation. A simple but effective TE-VQGAN [198]
compared with the original pixel loss as validated in [2]. Ex- is thus proposed to achieve translation equivariance by regular-
cept for pre-trained VGG for computing perceptual loss, vision izing orthogonality in the codebook embeddings. To regularize
Transformer [201] from self-supervised learning [115], [202] the latent structure of heterogeneous domain data in conditional
is also proved to work well for calculating perceptual loss. generation, Zhan et al. [193] design an Integrated Quantization
Besides, to emphasize reconstruction quality in certain regions, VAE (IQ-VAE) to penalize the inter-domain discrepancy with
a feature-matching loss can be employed over the activations of intra-domain variations.
certain pre-trained models, e.g., face-embedding network [203] 2) Autoregressive Modeling: Autoregressive (AR) model-
which can improve the reconstruction quality of face region. ing is a representative paradigm to accommodate sequence
Network Architecture: Convolution neural network is the dependencies, complying with the chain rule of probability.
common structure to learn the discrete image representation The probability of each token in the sequence is conditioned
in VQ-VAE. Recently, Yu et al. [197] replace the convolution- on all previously predictions, yielding a joint distribution of
based structure with Vision Transformer (ViT) [204], which is sequences
n as the product of conditionaln distributions: p(x) =
shown to be less constrained by the inductive priors imposed t=1 p(x |x
t 1 , x 2 , . . . , x t−1 ) = t=1 p(x t |x<t ). During infer-
by convolutions and is able to yield better computational effi- ence, each token is predicted autoregressively in a raster-scan
ciency with higher reconstruction quality. With the emergence of order. Notably, a sliding-window strategy [2] can be employed to
diffusion models, diffusion-based decoder [205] also has been reduce the cost during inference by only utilizing the predictions
explored to learn discrete image representation with superior within a local window. A top-k sampling strategy is adopted to
reconstruction quality. On the other hand, a multi-scale quanti- randomly sample from the k most likely next tokens, which
zation structure is proved to promote the generation performance naturally enables diverse sampling results. The predicted tokens
by including both low-level pixels and high-level tokens [206] are then concatenated with the previous sequence as conditions
or hierarchical latent codes [207]. To further reduce the com- for the prediction of next token. This process repeats iteratively
putational costs, a residual quantization [208] can be employed until all the tokens are sampled. Autoregressive models for
to recursively quantize the image as a stacked map of discrete image synthesis have become increasingly popular due to their
tokens. ability to generate high-quality, realistic images with a high level

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15107

of detail. In MISE tasks, autoregressive models generate images color and density of a 3D scene with neural fields. Specifically,
pixel-by-pixel based on a conditional probability distribution a fully-connected neural network is adopted in NeRF, by taking a
that takes into account both the previously generated pixels and spatial location (x, y, z) with the corresponding viewing direction
the given conditioning information, which allows the models to (θ, φ)) as input, and the volume density with the corresponding
capture the complex dependencies to yield visually consistent emitted radiance as output. To render 2D images from the
images. In recent years, autoregressive models for MISE have implicit 3D representation, differentiable volume rendering is
been largely fueled by series of designs to be introduced below. performed with a numerical integrator [4] to approximate the
Network Architecture: Early autoregressive models for image intractable volumetric projection integral. Powered by NeRF for
generation usually adopt PixelCNN [213] which struggle in 3D scene representation, 3D-aware MISE can be achieved with
modeling long term relationships within an image due to the per-scene NeRF or generative NeRF frameworks.
limited receptive field. With the prevailing of Transformer [37], 1) Per-Scene NeRF: Consistent with the original NeRF
Transformer-based autoregressive models [214] emerge with model, a per-scene NeRF aims to optimize and represent a single
enhanced receptive field which allows sequentially predicting scene supervised by images or certain pre-trained models.
each pixel conditioned on previous prediction results. To explore Image Supervision: With paired guidance and corresponding
the limits of autoregressive text-to-image synthesis, Parti [71] view images, a NeRF can be naively trained conditioned on
scales the parameter size of Transformer up to 20B, yielding the guidance to achieve MISE. For instance, AD-NeRF [13]
consistent quality improvements in terms of image quality and achieves high-fidelity talking-head synthesis by training neural
text-image alignment. Instead of unidirectionally modeling from radiance fields on a video sequence with the audio track of
condition to image, a bi-directional architecture is also explored one target person. Instead of bridging audio inputs and video
in text-to-image synthesis [215], [216], which generates both outputs based on the intermediate representations, AD-NeRF
diverse captions and images. directly feeds the audio features into an implicit function to
Bidirectional Context: On the other hand, previous methods yield a dynamic NeRF, which is further exploited to synthesize
incorporate image context in a raster-scan order by attending high-fidelity talking-face videos accompanied with the audio
only to previous generation results. This strategy is unidirec- via volume rendering. However, the paired condition-image data
tional and suffers from sequential bias as it disregards much con- and multiview images are usually unavailable or costly to acquire
text information until autoregression is nearly complete. It also which hinders the broad applications of this method.
ignores much contextual information in different scales as it only Pre-Trained Model Supervision: Instead of relying on multi-
processes the image on a single scale. Grounded in above obser- view images or paired data, certain pre-trained models can be
vations, ImageBART [194] presents a coarse-to-fine approach adopted to optimize NeRFs from scratch as shown in Fig. 7(a).
in a unified framework that addresses the unidirectional bias of For instance, pre-trained CLIP can be leveraged to achieve
autoregressive modeling and the corresponding exposure bias. text-driven 3D-aware image synthesis [223], by optimizing
Specifically, a diffusion process is applied to successively elim- NeRF to render multi-view images that score highly with a
inate information, yielding a hierarchy of representations which target text description according to the CLIP model. Similar
is further compressed via a multinomial diffusion process [174], CLIP-based approach is also adopted in AvatarCLIP [224] to
[217]. By modeling the Markovian transition autoregressively achieve zero-shot text-driven 3D avatar generation and anima-
with attending to the preceding hierarchical state, crucial global tion. Recently, with the prosperity of diffusion models, pre-
context can be leveraged for each individual autoregressive trained 2D diffusion models show great potential to drive the
step. As an alternative, bidirectional Transformer is also widely generation of high-fidelity 3D scenes for diverse text prompts as
explored to incorporate bidirectional context, accompanied with in DreamFusion [11]. Specifically, based on probability density
a Masked Visual Token Modeling (MVTM) [218] or Masked distillation, 2D diffusion model can serve as a generative prior
Language Modeling (MLM) [219], [220] mechanism. for the optimization of a randomly-initialized 3D neural field
Self-Attention Mechanism: To handle languages, images, and via gradient descent such that its 2D renderings yield a high
videos in different tasks in a unified manner, NUWA [44] score with the target condition. Following this line of research,
presents a 3D Transformer framework with a unified 3D Nearby Magic3D [47] further proposes to optimize a textured 3D mesh
Self-Attention (3DNA) which not only reduces the complexity model with an efficient differentiable renderer [212], [225] in-
of full attention but also shows superior performance. With a teracting with a pre-trained latent diffusion model. On the other
focus on semantic image editing at high resolution, ASSET [221] hand, optimizing NeRF with pre-trained models is an under-
proposes to sparsify the Transformer’s attention matrix at high constrained process, which highlights the need of certain prior
resolutions guided by dense attention at lower resolutions, lead- knowledge or regularizations. It has been proved that geometric
ing to reduced computational cost. priors including sparsity regularization and scene bounds [223]
improve the generation fidelity significantly. Besides, to miti-
gate the ambiguous geometry from a single viewpoint, random
D. NeRF-Based Methods lighting directions can be applied to shade a scene to reveal
A neural field [222] is a field that is parameterized fully the geometric details [11]. To prevent normal vectors from
or in part by a neural network. As a special case of neural improperly facing backwards from the camera, an orientation
fields, Neural Radiance Fields (NeRF) [4] achieve impressive loss proposed in Ref-NeRF [226] can be employed to impose
performance for novel views synthesis by parameterizing the penalty.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15108 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

Fig. 7. Frameworks of (a) per-scene NeRF and (b) generative (GAN-based) NeRF for 3D-aware MISE. The image is adapted from [11], [212].

2) Generative NeRF: Distinct from per-scene optimization where the encoder predicts a camera pose and a coarse style
NeRFs which work for a single scene, generative NeRFs are code which is further refined through inverse optimization. To
capable of generalizing to different scenes by integrating NeRF enable flexible and faithful 3D-aware MISE, some pre-trained
with generative models. In generative NeRF, a scene is specified models like CLIP also can be introduced in NeRF inversion. For
by a latent code in the corresponding latent space. GRAF [227] is instance, to achieve 3D-aware manipulation from text prompt,
the first to introduce a GAN framework for the generative train- CLIP-NeRF [237] optimizes latent codes towards targeted ma-
ing of radiance fields by employing a multi-scale patch-based nipulation driven by a CLIP-based matching loss as described in
discriminator. Lot of efforts have recently been devoted to im- StyleCLIP [29].
prove the generative NeRF, e.g., GIRAFFE [228] for introducing
volume rendering at the feature level and separating the object in- E. Other Methods
stances in a controllable way; Pi-GAN [229] for the FiLM-based
Except for above-mentioned methods, there has been sev-
conditioning scheme [230] with a SIREN architecture [231];
eral endeavors dedicated to the MISE task, exploring diverse
StyleNeRF [232] for the integration of style-based generator
research paths.
to achieve high-resolution image synthesis; EG3D [233] for
2D MISE Without Generative Models: Instead of relying on
incorporating efficient triplane 3D representation. Fueled by
generative models, a series of alternative methods have been
these advancements, 3D-aware MISE can be well performed
explored for multimodal editing of 2D images. For instance,
following the pipeline of conditional generative NeRF or gen-
CLVA [255] manipulates the style of a content image through
erative NeRF inversion.
text prompts by comparing the contrastive pairs of content image
Conditional NeRF: In conditional generative NeRF, a scene
and style instruction to achieve mutual relativeness. However,
is specified by the combination of 3D positions and given con-
CLVA is constrained as it requires style images accompanied
ditions as shown in Fig. 7(b). The condition can be integrated
with the text prompts during training. Instead, CLIPstyler [256]
to condition the NeRF following the integration strategies in
leverages pre-trained CLIP model to achieves text guided style
GANs or diffusion models. For instance, a pre-trained CLIP
transfer by training a lightweight network which transforms a
model is employed in [234] to extract the conditional visual and
content image to follow the text condition. As an extension to
text features to condition a NeRF. Similarly, pix2pix3D [49]
video, Loeschcke et al. [257] harness the power of CLIP to
encodes certain visual guidance (and a random code) to generate
stylize the object in a video according to two target texts.
triplanes for scene representation, while it renders the image and
3D-Aware MISE Without NeRF: Except for NeRF, there are
pixel-aligned label map simultaneously to enable interactive 3D
alternative methods that can be leveraged for 3D-aware MISE.
cross-view editing.
Typically, classical 3D representations such as mesh also can be
NeRF Inversion: In the light of recent advances in gener-
employed to replace NeRF for 3D-aware MISE [258], [259].
ative NeRFs for 3D-aware image synthesis, some work ex-
Specifically, aiming for style transfer of 3D scenes, Mu et
plores the inversion of generative NeRFs for 3D-aware MISE.
al. [260] propose to learn geometry-aware content features from
As generative NeRF (GAN-based) is accompanied with a la-
a point cloud representation of the scene, followed by point-
tent space, the conditional guidance for MISE can be naively
to-pixel adaptive attention normalization (AdaAttN) to transfer
mapped into the latent space to enable conditional 3D-aware
the style of a given image. Besides, a popular line of research
generation [235]. However, this method struggles for image
adapts GANs for 3D-aware generation by conditioning on cam-
generation & editing with local control. Some recent work
era parameters [261], introducing intermediate 3D shape [262],
proposes to train 3D-semantic-aware generative NeRF [48],
incorporating depth prior [263], and adopting 3D rigid-body
[236] that produces spatial-aligned images and semantic masks
transformation with projection [264].
concurrently with two branches. These aligned semantic masks
can be used to perform local editing of 3D volume via NeRF
inversion. On the other hand, the inversion of generative NeRF F. Comparison and Discussion
is challenging due to the including of camera pose. Thus, All generation methods possess their own strength and weak-
a hybrid inversion strategy [232] can be applied in practice ness. GAN-based methods can achieve high-fidelity image syn-
by combining encoder-based and optimization-based inversion, thesis in terms of FID and Inception Score and also have fast

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15109

inference speed, while GANs are notorious for unstable training the detailed annotation types in popular datasets in Table II. No-
and are prone to mode collapse. Moreover, it has been shown that tably, ADE20K [239], COCO-Stuff [241], and Cityscapes [243]
GANs focus more on fidelity rather than capturing the diversity are common benchmark datasets for semantic image synthe-
of the training data distribution compared with likelihood-based sis; Oxford-120 Flowers [249], CUB-200 Birds [250], and
models like diffusion models and autoregressive models [21]. COCO [240] are widely adopted in text-to-image synthesis;
Besides, GANs usually adopt a CNN architecture (although VoxCeleb2 [281] and Lip Reading in the Wild (LRW) [282]
Transformer structure is explored in some studies [265], [266], are usually used for the benchmark of taking face generation.
[267]), which makes them struggle to handle multimodal data Please refer to the supplementary material, available online, for
in a unified manner and generalize to new MISE tasks. With more details of the widely adopted datasets in different MISE
wide adoption of Transformer backbone, autoregressive models tasks.
can handle different MISE tasks in a unified manner. However,
thanks to the autoregressive prediction of tokens, autoregressive
models suffer from slow inference speed, which is also a bot- B. Evaluation Metrics
tleneck of diffusion models as requiring a number of diffusion Precise evaluation metrics are of great importance in driv-
steps. Currently, autoregressive models and diffusion models ing progress of research. On the other hand, the evaluation of
are more favored in SOTA methods compared with GANs, MISE tasks is challenging as multiple attributes account for
especially for text-to-image synthesis. a fine generation result and the notion of image evaluation is
Autoregressive models and diffusion models are likely-based often subjective. To achieve faithful evaluation, comprehensive
generative models which are equipped with stationary train- metrics are adopted to evaluate MISE tasks from multiple as-
ing objective and good training stability. The comparison of pects. Specifically, Inception Score (IS) [283] and FID [284] are
generative modeling capability between autoregressive models general metrics for image quality evaluation, while LPIPS [165]
and diffusion is still inconclusive. DALL-E 2 [25] shows that is a common metric to evaluate image diversity. These metrics
diffusion models are slightly better than autoregressive mod- can be applied across different generation tasks. In terms of
els in modeling the diffusion prior. However, the recent work the alignment between generated images and conditions, the
Parti [71] which adopts an autoregressive structure presents evaluation metrics are usually designed for specific generation
superior performance over the SOTA work of diffusion-based tasks, e.g., mIoU and mAP for semantic image synthesis, R-
methods (i.e., Imagen). On the other hand, the exploration of precision [54], Captioning Metrics [285] and Semantic Object
two different families of generative models may open exciting Accuracy (SOA) [274] for text-to-image generation, Landmark
opportunities to combine the merits of the two powerful models. distance (LMD) and audio-lip synchronization (Sync) [286] for
Different from above generation methods which mainly work talking face generation.
on 2D images and have few requirements for the training As a general image quality metric, the advantage of IS is
datasets, NeRF-based methods handle the 3D scene geometry its simplicity, and it can be applied to a wide range of image
and thus have relatively high requirements for training data. generation models. However, IS has been criticized for its lack
For example, per-scene optimization NeRFs require multiview of robustness and sensitivity to noise. It also struggles to evaluate
images or video sequence with pose annotation, while generative overfitting generation (i.e., the model memorizes the training
NeRFs require the scene geometry of the dataset to be simple. set) and measure intra-domain variation (i.e., the model only
Thus, the application of NeRF in MISE with high-fidelity is still produces one good sample). FID is more robust than the IS and
quite constrained. Nevertheless, the 3D-aware modeling of real can better capture the overall quality of the generated images.
world with NeRF opens a new door for future MISE research, However, it assumes a Gaussian distribution for image features
broadening the horizons for potential advancements. which is not always valid. For diversity evaluation metrics like
Besides, state-of-the-art methods are prone to combine dif- LPIPS, the quality of generated images is not concerned which
ferent generative models to yield superior performance. For means unrealistic generation could lead to a good diversity score.
example, Taming Transformer [2] incorporates VQ-GAN and Alignment metric provides quantitative evaluations of genera-
Autoregressive models to achieve high-resolution image syn- tion alignment, while most of them are subject to various issues,
thesis; StyleNeRF [232] combines NeRF with GAN to en- including insensitivity to temporal or overall coherence in SOA
able high-fidelity image synthesis with both high-fidelity and and CPBD, dataset or pre-trained model bias in R-precision,
3D-awareness; ImageBart [194] combines the autoregressive mIoU & mAP and audio-lip synchronization, ambiguous align-
formulation with a multinomial diffusion process to incor- ment in Captioning Metrics. Please refer to the supplementary
porate a coarse-to-fine hierarchy of context information; X- material, available online, for more details of the corresponding
LXMERT [268] integrates GAN into the framework of cross- evaluation metrics. Overall, certain evaluation metric should
modality representation to achieve text-guided image genera- be applied in conjunction with other evaluation metrics for a
tion. comprehensive and faithful analysis of model performance.

IV. EXPERIMENTAL EVALUATION


C. Experimental Results
A. Datasets
To showcase the capability and effectiveness of MISE in a tan-
Datasets are the core of image synthesis and editing tasks. gible manner, we visualize the synthesized images conditioned
To give an overall picture of the datasets in MISE, we tabulate on the combination of diverse guidance types as shown in Fig. 8.
orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15110 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

TABLE II
ANNOTATION TYPES IN POPULAR DATASETS FOR MISE

Fig. 8. Image synthesis from the combination of different types of guidance. The samples are from [8], [62], [269].

TABLE III
QUANTITATIVE COMPARISON WITH EXISTING METHODS ON SEGMENTATION-TO-IMAGE SYNTHESIS

Please refer to the supplementary material for more visualiza- 1) Visual Guidance: For visual guidance, we mainly conduct
tion, available online. Furthermore, we provide a quantitative comparison on semantic image synthesis as there are numbers
comparison of the image synthesis performance exhibited by of methods for benchmarking. As shown in Table III, the exper-
various models. This assessment takes into consideration distinct imental comparison is conducted on four challenging datasets:
types of guidance including visual, text, and audio, which will ADE20K [239], ADE20K-outdoors [239], COCO-stuff [241]
be discussed in the following sections. and Cityscapes [243], following the setting of [6]. The evaluation

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15111

is performed with FID, LPIPS, and mIoU. Specially, the mIoU TABLE IV
TEXT-TO-IMAGE GENERATION PERFORMANCE ON THE COCO DATASET
aims to assess the alignment between the generated image
and the ground truth segmentation via a pre-trained semantic
segmentation network. Pre-trained UperNet101 [287], multi-
scale DRN-D-105 [288], and DeepLabV2 [289] are adopted for
Cityscapes, ADE20 K & ADE20K-outdoors, and COCO-Stuff,
respectively.
As shown in Table III, diffusion-based method (i.e.,
SDM [61]) achieves superior generation quality and diversity as
evaluated by FID and LPIPS, and yields comparable semantic
consistency as evaluated by mIoU compared with GAN-based
methods. Although the comparison may not be fair as the model
sizes are different, diffusion-based method still demonstrates
its powerful modeling capability for semantic image synthesis.
With a large model size, autoregressive method Taming [2]
doesn’t show a clear advantage over other methods. We con-
jecture that Taming Transformer [2] is a versatile framework
for various conditional generation tasks without specific design
for semantic image synthesis, while other methods in Table III
mainly focus on the task of semantic image synthesis. Notably,
autoregressive method and diffusion method inherently support
diverse conditional generation results, while GAN-based meth-
ods usually require additional modules (e.g., VAE [175]) or
designs to achieve diverse generation.
2) Text Guidance: We benchmark text-to-image generation
methods on COCO dataset as tabulated in Table IV (The
results are extracted from relevant papers). As shown in
Table IV, GAN-based, autoregressive, and diffusion-based
methods can all achieve SOTA performance in terms of FID, e.g.,
8.12 in GAN-based method LAFITE [26], 7.23 in autoregressive
method Parti [71], and 7.27 in diffusion-based method Ima-
gen [10]. However, autoregressive and diffusion-based methods
are still preferred in recent SOTA work, thanks to their stationary
training objective and good scalability [21].
3) Audio Guidance: In terms of audio guided image syn- TABLE V
thesis and editing, we conduct quantitative comparison in the AUDIO GUIDED IMAGE EDITING (TALKING-HEAD) PERFORMANCE ON
LRW [282] AND VOXCELEB2 [281] UNDER THREE METRICS
task of audio-driven talking face generation which has been
widely explored in the literature. Notably, current development
of talking face generation mainly relies on GANs, while autore-
gressive or diffusion-based methods for talking face generation
remain under-explored. The quantitative results of talking face
generation on LRW [282] and VoxCeleb2 [281] datasets are
shown in Table V.

V. OPEN CHALLENGES & DISCUSSION


Though MISE has made notable progress and achieved supe-
rior performance in recent years, there exist several challenges
for future exploration. In this section, we overview the typical
challenges, share our humble opinions on possible solutions, and
highlight the future research directions.
synthesis). However, humans possess the capability of creating
A. Towards Large-Scale Multi-Modality Datasets
visual contents with guidance of multiple modalities concur-
As current datasets mainly provide annotations in a single rently. Targeting to mimic the human intelligence, multimodal
modality (e.g., visual guidance), most existing methods focus inputs are expected to be fused and leveraged jointly in image
on image synthesis and editing conditioned on guidance from a generation. Recently, Make-A-Scene [72] explores to include
single modality (e.g., text-to-image synthesis, semantic image semantic segmentation tokens in autoregressive modeling to

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15112 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

achieve better quality in image synthesis; ControlNet [8] incor- D. Towards 3D Awareness
porates various visual conditions into Stable Diffusion (for text-
With the emergence of neural scene representation models
to-image generation) to achieve controllable generation; with
especially NeRF, 3D-aware image synthesis and editing has the
MM-CelebA-HQ [28], COCO [240], and COCO-Stuff [241]
potential to be the next breaking point for MISE as it models the
as the training set, PoE-GAN [269] achieves image generation
3D geometry of real world. With the incorporation of generative
conditioned on multi-modal including segmentation, sketch,
models, generative NeRF is notably appealing for MISE as
image, and text. However, the size of MM-CelebA-HQ [28],
it is associated with a latent space. Current generative NeRF
COCO [240], and COCO-Stuff [241] is still far from narrowing
models (e.g., StyleNeRF, EG3D) have enabled to model scenes
the gap with real-world distributions. Therefore, to encompass a
with simple geometry (e.g., faces, cars) from a collection of
broad range of modalities into image generation, there is a need
unposed 2D images, just like the training of unconditional GANs
for a large-scale dataset equipped with annotations spanning a
(e.g., StyleGAN). Powered by these efforts, several 3D-aware
wide spectrum of modalities, such as semantic segmentation,
MISE tasks have been explored, e.g., text-to-NeRF [234] and
text descriptions, and scene graphs. One potential approach to
semantic-to-NeRF [235]. However, current generative NeRFs
assemble such a dataset could be utilizing pre-trained models for
still struggle on datasets with complex geometry variation, e.g.,
different tasks to generate the requisite annotations. For instance,
DeepFashion [245] and ImageNet [299].
a segmentation model could be used to create semantic maps, a
Only relying on generative models to learn the complex
detection model could be employed to annotate bounding boxes.
scene geometry from unposed 2D images is indeed intractable
Additionally, synthetic data could provide another feasible al-
and challenging. A possible solution is to provide more prior
ternative, given its inherent advantage of readily providing a
knowledge of the scene, e.g., obtaining prior geometry with off-
multitude of annotations.
the-shelf models [300], providing skeleton prior for generative
human modeling, etc. Notably, the power of prior knowledge has
been explored in some recent studies of 3D-aware tasks [260],
B. Towards Faithful Evaluation Metrics [300], [301]. Another possible approach is to provide more
Accurate yet faithful evaluation is of great significance for supervision, e.g., creating a large dataset with multiview anno-
the development of MISE and is still an open problem. Lever- tations or geometry information. Once the 3D-aware generative
aging pre-trained models to conduct evaluations (e.g., FID) is modeling succeeds to work on complex natural scenes, some
constrained to the pre-trained datasets, which tends to pose interesting multimodal applications will become possible, e.g.,
discrepancy with the target datasets. User study recruits human 3D version of DALL-E.
subjects to assess the synthesized images directly, which is
however often resource-intensive in terms of time and cost. VI. SOCIAL IMPACTS
With the advance of multimodal pre-training, CLIP [45] is
As related to the hot concept of AI-Generated Content
used to measure the similarity between the texts and generated
(AIGC), MISE has gained considerable attention in recent years.
images, which however does not correlate well with human
The rapid advancements in MISE offer unprecedented gener-
preferences. To inherit the merits of powerful representation of
ation realism and editing possibilities, which have influenced
pre-trained models and human preference of crowd-sourcing
and will continue to influence our society in both positive
study, fine-tuning pre-trained CLIP with human preference
and potentially negative ways. In this section, we discuss the
datasets [294], [295] will be a promising direction for the
correlation between MISE and AIGC, and analyze the potential
designing of MISE evaluation metrics.
social impacts of MISE.

A. Correlation With AIGC


C. Towards Efficient Network Architecture
Recently, AI-generated content has been a very hot research
With inherent support for multimodal input and powerful gen-
topic with emergence of Stable Diffusion and ChatGPT. MISE
erative modeling, autoregressive models and diffusion models
is related to AIGC in that they both involve using machine
have been a new paradigm for unified MISE. However, both
learning & deep learning to create new and novel visual contents.
autoregressive models and diffusion models suffer from slow
Nevertheless, MISE is a specific application of AI that focuses
inference speed, which is more severe in high-resolution image
on generating & editing images with specific attributes which
synthesis. Some works [296], [297] explore to accelerate autore-
is controlled by various multimodal guidance. It aims to mimic
gressive models and diffusion models, while the experiments are
the visual imaging capability of humans in the multimodal real
constrained to toy datasets with low resolution. Recently, Song et
world. As a comparison, AIGC encompasses a much broader
al. [298] introduced consistency models based on diffusion pro-
range of creative work including visual contents, text contents,
cesses, which allow to generate high quality samples by directly
audio contents, etc.
mapping noise to data with a support of fast one-step generation
and multistep sampling to trade compute for sample quality.
B. Applications
The sampling efficiency of this model architecture presents a
compelling opportunity for the advancements of network archi- The multi-modal image synthesis and editing techniques can
tecture in MISE tasks. be applied in artistic creation and content generation, which

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15113

could widely benefit designers, photographers, and content cre- REFERENCES


ators [302]. Moreover, they can be democratized in everyday [1] I. Goodfellow et al., “Generative adversarial nets,” in Proc. 27th Int. Conf.
applications as image generation or editing tools for popular en- Neural Inf. Process. Syst., 2014, pp. 2672–2680.
tertainment. In addition, the various conditions as intermediate [2] P. Esser et al., “Taming transformers for high-resolution image synthesis,”
2020, arXiv:2012.09841.
representations for synthesis & editing greatly ease the use of [3] J. Ho et al., “Denoising diffusion probabilistic models,” in Proc. 34th Int.
the methods and improve the flexibility of user interaction. In Conf. Neural Inf. Process. Syst., 2020, Art. no. 574.
general, the techniques greatly lower the barrier for the public [4] B. Mildenhall et al., “NeRF: Representing scenes as neural radiance
fields for view synthesis,” in Proc. Eur. Conf. Comput. Vis., 2020,
and unleash their creativity on content generation and editing. pp. 405–421.
[5] P. Isola, J. -Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation
with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2017, pp. 5967–5976.
C. Misuse [6] T. Park, M. -Y. Liu, T. -C. Wang, and J. -Y. Zhu, “Semantic image
synthesis with spatially-adaptive normalization,” in Proc. IEEE Conf.
On the other hand, the increasing editing capability and gen- Comput. Vis. Pattern Recognit., 2019, pp. 2332–2341.
eration realism also offers opportunities to generate or manip- [7] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “MaskGAN: Towards diverse and
ulate images for malicious purposes. The misuse of synthesis interactive facial image manipulation,” in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., 2020, pp. 5548–5557.
& editing techniques may spread fake or nefarious informa- [8] L. Zhang and M. Agrawala, “Adding conditional control to text-to-image
tion and lead to negative social impacts. To prevent potential diffusion models,” 2023, arXiv:2302.05543.
misuses, one possible way is to develop detection techniques [9] J. Cheng et al., “LayoutDiffuse: Adapting foundational diffusion models
for layout-to-image generation,” 2023, arXiv:2302.08908.
for automatically identifying generated images, which has been [10] C. Saharia et al., “Photorealistic text-to-image diffusion models with deep
actively researched by the community [303]. Meanwhile, suffi- language understanding,” 2022, arXiv:2205.11487.
cient guardrails, labelling, and access control should be carefully [11] B. Poole et al., “Dreamfusion: Text-to-3D using 2D diffusion,” in Proc.
Int. Conf. Learn. Representations, 2023.
considered when deploying MISE techniques to minimize the [12] U. Singer et al., “Make-a-video: Text-to-video generation without text-
risk of misuses. video data,” 2022, arXiv:2209.14792.
[13] Y. Guo, K. Chen, S. Liang, Y. -J. Liu, H. Bao, and J. Zhang, “AD-NeRF:
Audio driven neural radiance fields for talking head synthesis,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5764–5774.
D. Environment [14] Y. Li et al., “PasteGAN: A semi-parametric method to generate image
from scene graph,” in Proc. 33rd Int. Conf. Neural Inf. Process. Syst.,
As deep-learning-based methods, the current multi-model 2019, Art. no. 355.
generative methods inevitably require GPUs and considerable [15] Y. Takagi and S. Nishimoto, “Improving visual image reconstruction from
human brain activity using latent diffusion models via multiple decoded
energy consumption for training and inference, which may neg- inputs,” 2023, arXiv:2306.11536.
atively influence the environment and global climate before the [16] X. Pan et al., “Drag your GAN: Interactive point-based manipulation on
large-scale use of renewable energy. One direction to soften the the generative image manifold,” in Proc. ACM SIGGRAPH Conf., 2023,
Art. no. 78.
need for computational resources lies in the active exploration [17] E. Mansimov et al., “Generating images from captions with attention,”
of model generalization. For example, a pretrained model gen- 2015, arXiv:1511.02793.
eralized in various datasets could greatly accelerate the training [18] M. Mirza and S. Osindero, “Conditional generative adversarial nets,”
2014, arXiv:1411.1784.
process or provide semantical knowledge for downstream tasks. [19] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb,
“Learning from simulated and unsupervised images through adversarial
training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017,
pp. 2242–2251.
VII. CONCLUSION [20] J. S. Chung et al., “You said that?,” 2017, arXiv:1705.02966.
[21] P. Dhariwal and A. Nichol, “Diffusion models beat GANs on im-
This review has covered main approaches for multimodal im- age synthesis,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
age synthesis and editing. Specifically, we provide an overview pp. 8780–8794.
[22] J. Song et al., “Denoising diffusion implicit models,” 2020, arXiv:
of different guidance modalities including visual guidance, text 2010.02502.
guidance, audio guidance, and other modality guidance (e.g., [23] Y. Song et al., “Score-based generative modeling through stochastic
scene graph). In addition, we provided a detailed introduction differential equations,” 2020, arXiv:2011.13456.
[24] A. Jolicoeur-Martineau et al., “Adversarial score matching and improved
of the main image synthesis & editing paradigms: GAN-based sampling for image generation,” 2020, arXiv:2009.05475.
methods, diffusion-based methods, autoregressive methods, and [25] A. Ramesh et al., “Hierarchical text-conditional image generation with
NeRF-based methods. The corresponding strengths and weak- clip latents,” 2022, arXiv:2204.06125.
[26] Y. Zhou et al., “LAFITE: Towards language-free training for text-to-
nesses were comprehensively discussed to inspire new paradigm image generation,” 2021, arXiv:2111.13792.
that takes advantage of the strengths of existing frameworks. We [27] J. Ho and T. Salimans, “Classifier-free diffusion guidance,” 2022,
also conduct a comprehensive survey of datasets and evaluation arXiv:2207.12598.
[28] W. Xia, Y. Yang, J. -H. Xue, and B. Wu, “TediGAN: Text-guided diverse
metrics for MISE conditioned on different guidance modalities. face image generation and manipulation,” in Proc. IEEE Conf. Comput.
Further, we tabularize and compare the performance of existing Vis. Pattern Recognit., 2021, pp. 2256–2265.
approaches in different MISE tasks. Last but not least, we [29] O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski,
StyleCLIP: Text-driven manipulation of stylegan imagery,” in Proc.
provided our perspective on the current challenges and future IEEE/CVF Int. Conf. Comput. Vis., 2021, pp. 2065–2074.
directions related to integrating all modalities, comprehensive [30] W. Xia, Y. Zhang, Y. Yang, J. -H. Xue, B. Zhou, and M. -H. Yang, “GAN
datasets, evaluation metrics, model architecture, and 3D aware- inversion: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45,
no. 3, pp. 3121–3138, Mar. 2023.
ness.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15114 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

[31] J. Zhu et al., “In-domain GAN inversion for real image editing,” in Proc. [63] T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to
Eur. Conf. Comput. Vis., 2020, pp. 592–608. follow image editing instructions,” in Proc. IEEE Conf. Comput. Vis.
[32] X. Liu et al., “More control for free! image synthesis with semantic Pattern Recognit., 2023, pp. 18392–18402.
diffusion guidance,” 2021, arXiv:2112.05744. [64] S. Shen et al., “DiffTalk: Crafting diffusion models for generalized audio-
[33] G. Kim and J. C. Ye, “DiffusionCLIP: Text-guided image manipulation driven portraits animation,” in Proc. IEEE Conf. Comput. Vis. Pattern
using diffusion models,” 2021, arXiv:2110.02711. Recognit., 2023, pp. 1982–1991.
[34] A. Hertz et al., “Prompt-to-prompt image editing with cross attention [65] J. Tseng, R. Castellon, and K. Liu, “Edge: Editable dance generation
control,” 2022, arXiv:2208.01626. from music,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
[35] N. Ruiz et al., “DreamBooth: Fine tuning text-to-image diffusion models pp. 448–458.
for subject-driven generation,” 2022, arXiv:2208.12242. [66] L. Ruan et al., “Mm-diffusion: Learning multi-modal diffusion models
[36] R. Gal et al., “An image is worth one word: Personalizing text-to-image for joint audio and video generation,” in Proc. IEEE/CVF Conf. Comput.
generation using textual inversion,” 2022, arXiv:2208.01618. Vis. Pattern Recognit., 2023, pp. 10219–10228.
[37] A. Vaswani et al., “Attention is all you need,” in Proc. 31st Int. Conf. [67] L. Yang et al., “Diffusion-based scene graph to image generation with
Neural Inf. Process. Syst., 2017, pp. 6000–6010. masked contrastive pre-training,” 2022, arXiv:2211.11138.
[38] A. Radford et al., “Language models are unsupervised multitask learn- [68] S. Kim, J. Baek, J. Park, G. Kim, and S. Kim, “InstaFormer: Instance-
ers,” OpenAI Blog, vol. 1, no. 8, p. 9, 2019. aware image-to-image translation with transformer,” in Proc. IEEE Conf.
[39] M. Chen et al., “Generative pretraining from pixels,” in Proc. 37th Int. Comput. Vis. Pattern Recognit., 2022, pp. 18300–18310.
Conf. Mach. Learn., 2020, Art. no. 158. [69] M. Ding et al., “CogView: Mastering text-to-image generation via trans-
[40] P. Dhariwal et al., “Jukebox: A generative model for music,” formers,” 2021, arXiv:2105.13290.
2020, arXiv:2005.00341. [70] A. Ramesh et al., “Zero-shot text-to-image generation,” 2021,
[41] K. Gregor et al., “Deep AutoRegressive networks,” in Proc. 31st Int. arXiv:2102.12092.
Conf. Mach. Learn., 2014, pp. 1242–1250. [71] J. Yu et al., “Scaling autoregressive models for content-rich text-to-image
[42] A. v. d. Oord et al., “Neural discrete representation learning,” generation,” 2022, arXiv:2206.10789.
2017, arXiv:1711.00937. [72] O. Gafni et al., “Make-a-scene: Scene-based text-to-image generation
[43] A. Ramesh et al., “Zero-shot text-to-image generation,” in Proc. Int. Conf. with human priors,” 2022, arXiv:2203.13131.
Mach. Learn., 2021, pp. 8821–8831. [73] H. Chang et al., “Muse: Text-to-image generation via masked generative
[44] C. Wu et al., “NÜWA: Visual synthesis pre-training for neural visual transformers,” 2023, arXiv:2301.00704.
world creation,” 2021, arXiv:2111.12417. [74] Y. Lu et al., “Live speech portraits: Real-time photorealistic talking-head
[45] A. Radford et al., “Learning transferable visual models from natural animation,” ACM Trans. Graph., vol. 40, 2021, Art. no. 220.
language supervision,” 2021, arXiv:2103.00020. [75] R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “AI choreographer: Music
[46] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- conditioned 3D dance generation with AIST,” in Proc. IEEE/CVF Int.
resolution image synthesis with latent diffusion models,” in Proc. IEEE Conf. Comput. Vis., 2021, pp. 13381–13392.
Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10674–10685. [76] L. Siyao et al., “Bailando: 3D dance generation by actor-critic GPT
[47] C.-H. Lin et al., “Magic3D: High-resolution text-to-3D content creation,” with choreographic memory,” in Proc. IEEE Conf. Comput. Vis. Pattern
2022, arXiv:2211.10440. Recognit., 2022, pp. 11040–11049.
[48] J. Sun et al., “IDE-3D: Interactive disentangled editing for high-resolution [77] Y. Yin et al., “OR-NeRF: Object removing from 3D scenes guided
3D-aware portrait synthesis,” 2022, arXiv:2205.15517. by multiview segmentation with neural radiance fields,” 2023,
[49] K. Deng, G. Yang, D. Ramanan, and J. Y. Zhu, “3D-aware conditional arXiv:2305.10503.
image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [78] A. Mikaeili et al., “SKED: Sketch-guided text-based 3D editing,”
2023, pp. 4434–4445. 2023, arXiv:2303.10735.
[50] T.-C. Wang, M. -Y. Liu, J. -Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, [79] C. Bao et al., “Sine: Semantic-driven image-based nerf editing with prior-
“High-resolution image synthesis and semantic manipulation with condi- guided editing field,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
tional GANs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, 2023, pp. 20919–20929.
pp. 8798–8807. [80] S. Weder et al., “Removing objects from neural radiance fields,”
[51] H.-Y. Lee et al., “Diverse image-to-image translation via disentangled in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023,
representations,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 36–52. pp. 16528–16538.
[52] V. Sushko et al., “You only need adversarial supervision for semantic [81] D. Xu et al., “SinNeRF: Training neural radiance fields on complex
image synthesis,” 2020, arXiv:2012.04781. scenes from a single image,” in Proc. Eur. Conf. Comput. Vis., 2022,
[53] H. Zhang et al., “StackGAN: Text to photo-realistic image synthesis with pp. 736–753.
stacked generative adversarial networks,” in Proc. IEEE/CVF Int. Conf. [82] J. Hyung, S. Hwang, D. Kim, H. Lee, and J. Choo, “Local 3D editing
Comput. Vis., 2017, pp. 5908–5916. via 3D distillation of clip knowledge,” in Proc. IEEE Conf. Comput. Vis.
[54] T. Xu et al., “AttnGAN: Fine-grained text to image generation with Pattern Recognit., 2023, pp. 12674–12684.
attentional generative adversarial networks,” in Proc. IEEE Conf. Com- [83] C. Wang, M. Chai, M. He, D. Chen, and J. Liao, “CLIP-NeRF: Text-
put. Vis. Pattern Recognit., 2018, pp. 1316–1324. and-image driven manipulation of neural radiance fields,” in Proc. IEEE
[55] T. Qiao, J. Zhang, D. Xu, and D. Tao, “MirrorGAN: Learning text-to- Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3825–3834.
image generation by redescription,” in Proc. IEEE Conf. Comput. Vis. [84] Z. Wang et al., “ProlificDreamer: High-fidelity and diverse text-to-3D
Pattern Recognit., 2019, pp. 1505–1514. generation with variational score distillation,” 2023, arXiv:2305.16213.
[56] M. Kang et al., “Scaling up GANs for text-to-image synthesis,” in Proc. [85] Z. Ye et al., “GeneFace: Generalized and high-fidelity audio-driven 3D
IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 10124–10134. talking face synthesis,” 2023, arXiv:2301.13430.
[57] K. Prajwal et al., “A lip sync expert is all you need for speech to lip [86] Z. Ye et al., “GeneFace: Generalized and stable real-time audio-driven
generation in the wild,” in Proc. 28th ACM Int. Conf. Multimedia, 2020, 3D talking face generation,” 2023, arXiv:2305.00787.
pp. 484–492. [87] S. Shen et al., “Learning dynamic facial radiance fields for few-
[58] Y. Zhou et al., “MakeltTalk: Speaker-aware talking-head animation,” shot talking head synthesis,” in Proc. Eur. Conf. Comput. Vis., 2022,
ACM Trans. Graph., vol. 39, 2020, Art. no. 221. pp. 666–682.
[59] H. Zhou, Y. Sun, W. Wu, C. C. Loy, X. Wang, and Z. Liu, “Pose- [88] X. Liu et al., “Semantic-aware implicit neural audio-driven video portrait
controllable talking face generation by implicitly modularized audio- generation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 106–125.
visual representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [89] L. Ma et al., “Pose guided person image generation,” 2017,
nit., 2021, pp. 4174–4184. arXiv:1705.09368.
[60] S. J. Park et al., “SyncTalkFace: Talking face generation with precise [90] Y. Men, Y. Mao, Y. Jiang, W. -Y. Ma, and Z. Lian, “Controllable person
lip-syncing via audio-lip memory,” in Proc. AAAI Conf. Artif. Intell., image synthesis with attribute-decomposed GAN,” in Proc. IEEE Conf.
2022, pp. 2062–2070. Comput. Vis. Pattern Recognit., 2020, pp. 5083–5092.
[61] W. Wang et al., “Semantic image synthesis via diffusion models,” [91] C. Zhang et al., “Deep monocular 3D human pose estimation via cascaded
2022, arXiv:2207.00050. dimension-lifting,” 2021, arXiv:2104.03520.
[62] C. Qin et al., “UniControl: A unified diffusion model for controllable [92] J.-Y. Zhu et al., “Toward multimodal image-to-image translation,” in
visual generation in the wild,” 2023, arXiv:2305.11147. Proc. 31st Int. Conf. Neural Inf. Process. Syst., 2017, pp. 465–476.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15115

[93] C. Gao, Q. Liu, Q. Xu, L. Wang, J. Liu, and C. Zou, “SketchyCOCO: [122] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T.
Image generation from freehand scene sketches,” in Proc. IEEE Conf. Freeman, “Visually indicated sounds,” in Proc. IEEE Conf. Comput. Vis.
Comput. Vis. Pattern Recognit., 2020, pp. 5173–5182. Pattern Recognit., 2016, pp. 2405–2413.
[94] W. Chen and J. Hays, “SketchyGAN: Towards diverse and realistic sketch [123] Y. Song et al., “Talking face generation by conditional recurrent adver-
to image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., sarial network,” 2018, arXiv: 1804.04786.
2018, pp. 9416–9425. [124] P. Ekman and W. V. Friesen, “Facial action coding system,” Environ.
[95] S.-Y. Chen et al., “DeepFaceDrawing: Deep generation of face images Psychol. Nonverbal Behav., 1978.
from sketches,” ACM Trans. Graph., vol. 39, 2020, Art. no. 72. [125] J. Johnson, A. Gupta, and L. Fei-Fei, “Image generation from scene
[96] M. Zhu, J. Li, N. Wang, and X. Gao, “A deep collaborative framework graphs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
for face photo–sketch synthesis,” IEEE Trans. Neural Netw. Learn. Syst., pp. 1219–1228.
vol. 30, no. 10, pp. 3096–3108, Oct. 2019. [126] D. M. Vo and A. Sugimoto, “Visual-relation conscious image gener-
[97] M. Zhu et al., “Learning deep patch representation for probabilistic graph- ation from structured-text,” in Proc. Eur. Conf. Comput. Vis., 2020,
ical model-based face sketch synthesis,” Int. J. Comput. Vis., vol. 129, pp. 290–306.
pp. 1820–1836, 2021. [127] X. Shi et al., “Convolutional LSTM network: A machine learning ap-
[98] M. Zhu et al., “Knowledge distillation for face photo–sketch synthesis,” proach for precipitation nowcasting,” in Proc. Int. Conf. Neural Inf.
IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 2, pp. 893–906, Process. Syst., 2015, pp. 802–810.
Feb. 2022. [128] T. Fang et al., “Reconstructing perceptive images from brain activity
[99] Z. Li, C. Deng, E. Yang, and D. Tao, “Staged sketch-to-image synthe- by shape-semantic GAN,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
sis via semi-supervised generative adversarial networks,” IEEE Trans. 2020, pp. 13038–13048.
Multimedia, vol. 23, pp. 2694–2705, Aug. 2020. [129] S. Lin et al., “Mind reader: Reconstructing complex images from brain
[100] W. Sun and T. Wu, “Image synthesis from reconfigurable lay- activities,” 2022, arXiv:2210.01769.
out and style,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, [130] Y. Takagi and S. Nishimoto, “High-resolution image reconstruction with
pp. 10530–10539. latent diffusion models from human brain activity,” in Proc. IEEE/CVF
[101] B. Zhao, L. Meng, W. Yin, and L. Sigal, “Image generation from Conf. Comput. Vis. Pattern Recognit., 2023, pp. 14453–14463.
layout,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, [131] G. Yang and D. Ramanan, “Upgrading optical flow to 3D scene flow
pp. 8576–8585. through optical expansion,” in Proc. IEEE Conf. Comput. Vis. Pattern
[102] Y. Li, Y. Cheng, Z. Gan, L. Yu, L. Wang, and J. Liu, “BachGAN: High- Recognit., 2020, pp. 1331–1340.
resolution image synthesis from salient object layout,” in Proc. IEEE [132] Y. Endo, “User-controllable latent transformer for StyleGAN im-
Conf. Comput. Vis. Pattern Recognit., 2020, pp. 8362–8371. age layout editing,” Comput. Graph. Forum, vol. 41, pp. 395–406,
[103] Z. Li, I. Koh, Y. Tang, and L. Sun, “Image synthesis from layout with 2022.
locality-aware mask adaption,” in Proc. IEEE Int. Conf. Comput. Vis., [133] H. Tang et al., “Multi-channel attention selection GAN with cascaded
2021, pp. 13799–13808. semantic guidance for cross-view image translation,” in Proc. IEEE Conf.
[104] S. Frolov et al., “AttrLostGAN: Attribute controlled image synthesis from Comput. Vis. Pattern Recognit., 2019, pp. 2412–2421.
reconfigurable layout and style,” 2021, arXiv:2103.13722. [134] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen, “Cross-domain
[105] J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang, “Text-to-image generation correspondence learning for exemplar-based image translation,” in Proc.
grounded by fine-grained user attention,” in Proc. IEEE Winter Conf. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 5142–5152.
Appl. Comput. Vis., 2021, pp. 237–246. [135] F. Zhan et al., “Unbalanced feature transport for exemplar-based image
[106] F. Zhan et al., “Bi-level feature alignment for versatile image translation translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
and manipulation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 224–241. pp. 15023–15033.
[107] H. Zheng et al., “Semantic layout manipulation with high-resolution [136] P. Zhu, R. Abdal, Y. Qin, and P. Wonka, “SEAN: Image synthesis with
sparse attention,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 3, semantic region-adaptive normalization,” in Proc. IEEE Conf. Comput.
pp. 3768–3782, Mar. 2023. Vis. Pattern Recognit., 2020, pp. 5103–5112,.
[108] H. Zhang et al., “StackGAN: Realistic image synthesis with stacked gen- [137] H. Tan, X. Liu, X. Li, Y. Zhang, and B. Yin, “Semantics-enhanced
erative adversarial networks,” IEEE Trans. Pattern Anal. Mach. Intell., adversarial nets for text-to-image synthesis,” in Proc. IEEE Int. Conf.
vol. 41, no. 8, pp. 1947–1962, Aug. 2019. Comput. Vis., 2019, pp. 10500–10509.
[109] S. Reed et al., “Generative adversarial text to image synthesis,” in Proc. [138] B. Li et al., “Controllable text-to-image generation,” 2019, arXiv:
Int. Conf. Mach. Learn., 2016, pp. 1060–1069. 1909.07083.
[110] S. Frolov et al., “Adversarial text-to-image synthesis: A review,” Neural [139] M. Zhu et al., “DM-GAN: Dynamic memory generative adversarial
Netw., vol. 144, pp. 187–209, 2021. networks for text-to-image synthesis,” in Proc. IEEE Conf. Comput. Vis.
[111] T. Mikolov et al., “Distributed representations of words and phrases and Pattern Recognit., 2019, pp. 5795–5803.
their compositionality,” in Proc. Int. Conf. Neural Inf. Process. Syst., [140] V. Blanz et al., “A morphable model for the synthesis of 3D faces,”
2013, pp. 3111–3119. in Proc. 26th Annu. Conf. Comput. Graph. Interactive Techn., 1999,
[112] Z. S. Harris, “Distributional structure,” Word, vol. 10, pp. 146–162, 1954. pp. 187–194.
[113] T. Wang, T. Zhang, and B. Lovell, “Faces à la carte: Text-to-face gen- [141] L. Chen et al., “Talking-head generation with rhythmic head motion,” in
eration via attribute disentanglement,” in Proc. IEEE Winter Conf. Appl. Proc. Eur. Conf. Comput. Vis., 2020, pp. 35–51.
Comput. Vis., 2021, pp. 3379–3387. [142] H. Zhou et al., “Talking face generation by adversarially disentangled
[114] D. Pavllo et al., “Controlling style and semantics in weakly-supervised audio-visual representation,” in Proc. AAAI Conf. Artif. Intell., 2019,
image generation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 482–499. pp. 9299–9306.
[115] J. Devlin et al., “BERT: Pre-training of deep bidirectional transformers [143] S. Suwajanakorn et al., “Synthesizing Obama: Learning lip sync from
for language understanding,” 2018, arXiv: 1810.04805. audio,” ACM Trans. Graph., vol. 36, 2017, Art. no. 95.
[116] D. Harwath and J. R. Glass, “Learning word-like units from joint audio- [144] S. Wang et al., “One-shot talking face generation from single-speaker
visual analysis,” 2017, arXiv: 1701.07481. audio-visual correlation learning,” 2021, arXiv:2112.02749.
[117] D. Harwath et al., “Vision as an interlingua: Learning multilingual [145] T. Karras et al., “A style-based generator architecture for generative ad-
semantic embeddings of untranscribed speech,” in Proc. IEEE Int. Conf. versarial networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Acoust. Speech Signal Process., 2018, pp. 4969–4973. 2019, pp. 4396–4405.
[118] J. Li et al., “Direct speech-to-image translation,” IEEE J. Sel. Topics [146] T. Karras, S. Laine, and T. Aila, “Progressive growing of GANS for
Signal Process., vol. 14, no. 3, pp. 517–529, Mar. 2020. improved quality, stability, and variation,” 2017, arXiv: 1710.10196.
[119] Y. Aytar et al., “SoundNet: Learning sound representations from un- [147] Z. Zhang, Y. Xie, and L. Yang, “Photographic text-to-image synthesis
labeled video,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2016, with a hierarchically-nested adversarial network,” in Proc. IEEE Conf.
pp. 892–900. Comput. Vis. Pattern Recognit., 2018, pp. 6199–6208.
[120] L. Chen, R. K. Maddox, Z. Duan, and C. Xu, “Hierarchical cross-modal [148] G. Yin, B. Liu, L. Sheng, N. Yu, X. Wang, and J. Shao, “Semantics
talking face generation with dynamic pixel-wise loss,” in Proc. IEEE disentangling for text-to-image generation,” in Proc. IEEE Conf. Comput.
Conf. Comput. Vis. Pattern Recognit., 2019, pp. 7824–7833. Vis. Pattern Recognit., 2019, pp. 2322–2331.
[121] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [149] M. Cha et al., “Adversarial learning of semantic relevance in text to image
Computation, vol. 9, no. 8, pp. 1753–1780, 1997. synthesis,” in Proc. AAAI Conf. Artif. Intell., 2019, pp. 3272–3279.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15116 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

[150] M. Amodio and S. Krishnaswamy, “TraVeLGAN: Image-to-image trans- [180] A. Nichol et al., “GLIDE: Towards photorealistic image generation and
lation by transformation vector learning,” in Proc. IEEE Conf. Comput. editing with text-guided diffusion models,” 2021, arXiv:2112.10741.
Vis. Pattern Recognit., 2019, pp. 8975–8984. [181] B. Kawar et al., “Imagic: Text-based real image editing with diffusion
[151] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image models,” 2022, arXiv:2210.09276.
translation using cycle-consistent adversarial networks,” in Proc. IEEE [182] C. Raffel et al., “Exploring the limits of transfer learning with a unified
Int. Conf. Comput. Vis., 2017, pp. 2242–2251. text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 5485–5551,
[152] H. Tang et al., “Cycle in cycle generative adversarial networks for 2020.
keypoint-guided image generation,” in Proc. ACM Int. Conf. Multimedia, [183] J. Ho et al., “Cascaded diffusion models for high fidelity image genera-
2019, pp. 2052–2060. tion,” J. Mach. Learn. Res., vol. 23, pp. 2249–2281, 2022.
[153] Q. Lao, M. Havaei, A. Pesaranghader, F. Dutil, L. D. Jorio, and T. Fevens, [184] S. Gu et al., “Vector quantized diffusion model for text-to-image synthe-
“Dual adversarial inference for text-to-image synthesis,” in Proc. IEEE sis,” 2021, arXiv:2111.14822.
Int. Conf. Comput. Vis., 2019, pp. 7566–7575. [185] Z. Tang et al., “Improved vector quantized diffusion models,”
[154] Z. Chen and Y. Luo, “Cycle-consistent diverse image synthesis from 2022, arXiv:2205.16007.
natural language,” in Proc. IEEE Int. Conf. Multimedia Expo Workshops, [186] P. Chahal, “Exploring transformer backbones for image diffusion mod-
2019, pp. 459–464. els,” 2022, arXiv:2212.14678.
[155] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and J. Yosinski, “Plug & [187] Y. Jiang et al., “Text2Human: Text-driven controllable human image
play generative networks: Conditional iterative generation of images in generation,” ACM Trans. Graph., vol. 4, 2022, Art. no. 162.
latent space,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, [188] N. Liu et al., “Compositional visual generation with composable diffusion
pp. 3510–3520. models,” 2022, arXiv:2206.01714.
[156] J. Johnson et al., “Perceptual losses for real-time style transfer and super- [189] O. Bar-Tal et al., “Multidiffusion: Fusing diffusion paths for controlled
resolution,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 694–711. image generation,” 2023, arXiv:2302.08113.
[157] C. Wang, C. Xu, C. Wang, and D. Tao, “Perceptual adversarial networks [190] A. Blattmann et al., “Retrieval-augmented diffusion models,”
for image-to-image transformation,” IEEE Trans. Image Process., vol. 27, 2022, arXiv:2204.11824.
no. 8, pp. 4066–4079, Aug. 2018. [191] O. Avrahami et al., “Blended Diffusion for Text-Driven Editing of Natural
[158] S. Benaim and L. Wolf, “One-sided unsupervised domain mapping,” in Images,” arXiv:2111.14818, 2021.
Proc. Int. Conf. Neural Inf. Process. Syst., 2017, pp. 752–762. [192] Z. Zhang et al., “SINE: Single image editing with text-to-image diffusion
[159] H. Fu, M. Gong, C. Wang, K. Batmanghelich, K. Zhang, and D. Tao, models,” 2022, arXiv:2212.04489.
“Geometry-consistent generative adversarial networks for one-sided un- [193] F. Zhan et al., “Auto-regressive image synthesis with integrated quanti-
supervised domain mapping,” in Proc. IEEE Conf. Comput. Vis. Pattern zation,” in Proc. Eur. Conf. Comput. Vis., 2022, pp. 110–127.
Recognit., 2019, pp. 2422–2431. [194] P. Esser et al., “ImageBART: Bidirectional context with multinomial
[160] A. V. D. Oord et al., “Representation learning with contrastive predictive diffusion for autoregressive image synthesis,” in Proc. Int. Conf. Neural
coding,” arXiv: 1807.03748, 2018. Inf. Process. Syst., 2021, pp. 3518–3532.
[161] T. Park et al., “Contrastive learning for unpaired image-to-image trans- [195] J. T. Rolfe, “Discrete variational autoencoders,” 2016, arXiv:1609.02200.
lation,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 319–345. [196] Y. Bengio et al., “Estimating or propagating gradients through stochastic
[162] A. Andonian, T. Park, B. Russell, P. Isola, J. -Y. Zhu, and R. Zhang, neurons for conditional computation,” 2013, arXiv:1308.3432.
“Contrastive feature loss for image prediction,” in Proc. IEEE Int. Conf. [197] J. Yu et al., “Vector-quantized image modeling with improved VQGAN,”
Comput. Vis., 2021, pp. 1934–1943. 2021, arXiv:2110.04627.
[163] H. Zhang, J. Y. Koh, J. Baldridge, H. Lee, and Y. Yang„ “Cross-modal [198] W. Shin et al., “Translation-equivariant image quantizer for bi-directional
contrastive learning for text-to-image generation,” in Proc. IEEE Conf. image-text generation,” 2021, arXiv:2112.00384.
Comput. Vis. Pattern Recognit., 2021, pp. 833–842. [199] A. Lamb et al., “Discriminative regularization for generative models,”
[164] A. Brock et al., “Large scale GAN training for high fidelity natural image 2016, arXiv:1602.03220.
synthesis,” 2018, arXiv: 1809.11096. [200] A. B. L. Larsen et al., “Autoencoding beyond pixels using a learned
[165] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The similarity metric,” in Proc. Int. Conf. Mach. Learn., 2016, pp. 1558–1566.
unreasonable effectiveness of deep features as a perceptual met- [201] X. Dong et al., “PeCo: Perceptual codebook for bert pre-training of vision
ric,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, transformers,” 2021, arXiv:2111.12710.
pp. 586–595. [202] H. Bao et al., “BEiT: BERT pre-training of image transformers,”
[166] E. Richardson et al., “Encoding in style: A StyleGAN encoder for image- 2021, arXiv:2106.08254.
to-image translation,” 2020, arXiv: 2008.00951. [203] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “VGGFace2:
[167] H. Wang et al., “Cycle-consistent inverse GAN for text-to-image synthe- A dataset for recognising faces across pose and age,” in Proc. IEEE 13th
sis,” in Proc. 29th ACM Int. Conf. Multimedia, 2021, pp. 630–638. Int. Conf. Automa. Face Gesture Recognit., 2018, pp. 67–74.
[168] Y. Jiang, Z. Huang, X. Pan, C. C. Loy, and Z. Liu, “Talk-to-edit: Fine- [204] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for
grained facial editing via dialog,” in Proc. IEEE Int. Conf. Comput. Vis., image recognition at scale,” 2020.
2021, pp. 13779–13788. [205] J. Shi et al., “DiVAE: Photorealistic images synthesis with denoising
[169] D. Bau et al., “Paint by word,”2021, arXiv:2103.10951. diffusion decoder,” in Proc. Int. Conf. Learn. Representations, 2020.
[170] U. Kocasari et al., “StyleMC: Multi-channel based fast text-guided image [206] M. Ni et al., “NÜwa-Lip: Language guided image inpainting with defect-
generation and manipulation,” 2021, arXiv:2112.08493. free VQGAN,” 2022, arXiv:2202.05009.
[171] X. Liu et al., “FuseDream: Training-free text-to-image generation with [207] A. Razavi et al., “Generating diverse high-fidelity images with VQ-VAE-
improved clip GAN space optimization,” 2021, arXiv:2112.01573. 2,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019, pp. 14866–14876.
[172] R. Gal et al., “StyleGAN-NADA: Clip-guided domain adaptation of [208] D. Lee, C. Kim, S. Kim, M. Cho, and W. -S. Han, “Autoregressive image
image generators,” 2021, arXiv:2108.00946. generation using residual quantization,” in Proc. IEEE Conf. Comput.
[173] Y. Yu et al., “Towards counterfactual image manipulation via clip,” Vis. Pattern Recognit., 2022, pp. 11513–11522.
2022, arXiv:2207.02812. [209] J. Zhang et al., “Regularized vector quantization for tokenized im-
[174] J. Sohl-Dickstein et al., “Deep unsupervised learning using nonequi- age synthesis,” in Proc. Conf. Comput. Vis. Pattern Recognit., 2023,
librium thermodynamics,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 18467–18476.
pp. 2256–2265. [210] A. Baevski et al., “vq-wav2vec: Self-supervised learning of discrete
[175] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” speech representations,” 2019, arXiv: 1910.05453.
2013, arXiv:1312.6114. [211] E. Jang et al., “Categorical reparameterization with Gumbel-softmax,”
[176] D. Rezende and S. Mohamed, “Variational inference with normalizing 2016, arXiv:1611.01144.
flows,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1530–1538. [212] J. Gao et al., “GET3D: A generative model of high quality 3D textured
[177] L. Dinh et al., “Density estimation using real NVP,” 2016, shapes learned from images,” in Proc. Int. Conf. Neural Inf. Process.
arXiv:1605.08803. Syst., 2022, pp. 31841–31854.
[178] J. Menick and N. Kalchbrenner, “Generating high fidelity images with [213] A. Van den Oord et al., “Conditional image generation with Pixel-
subscale pixel networks and multidimensional upscaling,” 2018, arXiv: CNN decoders,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2016,
1812.01608. pp. 4797–4805.
[179] A. Van Oord et al., “Pixel recurrent neural networks,” in Proc. Int. Conf. [214] N. Parmar et al., “Image transformer,” in Proc. Int. Conf. Mach. Learn.,
Mach. Learn., 2016, pp. 1747–1756. 2018, pp. 4055–4064.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15117

[215] Y. Huang et al., “A picture is worth a thousand words: A unified system [244] Z. Liu et al., “Deep learning face attributes in the wild,” in Proc. Int.
for diverse captions and rich images generation,” in Proc. ACM Int. Conf. Conf. Comput. Vis., 2015, pp. 3730–3738.
Multimedia, 2021, pp. 2792–2794. [245] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “DeepFashion: Powering
[216] Y. Huang et al., “Unifying multimodal transformer for bi-directional robust clothes recognition and retrieval with rich annotations,” in Proc.
image and text generation,” in Proc. ACM Int. Conf. Multimedia, 2021, IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 1096–1104.
pp. 1138–1147. [246] X. Liang et al., “Deep human parsing with active template regression,”
[217] E. Hoogeboom et al., “Argmax flows and multinomial diffusion: Towards IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 12, pp. 2402–2414,
non-autoregressive language models,” 2021, arXiv:2102.05379. Dec. 2015.
[218] H. Chang, H. Zhang, L. Jiang, C. Liu, and W. T. Freeman, “MaskGIT: [247] N. Silberman and R. Fergus, “Indoor scene segmentation using a struc-
Masked generative image transformer,” in Proc. IEEE Conf. Comput. Vis. tured light sensor,” in Proc. IEEE Int. Conf. Comput. Vis. Workshops,
Pattern Recognit., 2022, pp. 11305–11315. 2011, pp. 601–608.
[219] Z. Zhang et al., “M6-UFC: Unifying multi-modal controls for conditional [248] J. Krause et al., “3D object representations for fine-grained cat-
image synthesis,” 2021, arXiv:2105.14211. egorization,” in Proc. Int. Conf. Comput. Vis. Workshops, 2013,
[220] Y. Yu et al., “Diverse image inpainting with bidirectional and autoregres- pp. 554–561.
sive transformers,” in Proc. ACM Int. Conf. Multimedia, 2021, pp. 69–78. [249] M.-E. Nilsback and A. Zisserman, “Automated flower classification over
[221] D. Liu et al., “Asset: Autoregressive semantic scene editing with trans- a large number of classes,” in Proc. 6th Indian Conf. Comput. Vis. Graph.
formers at high resolutions,” ACM Trans. Graph., vol. 41, pp. 1–12, 2022. Image Process., 2008, pp. 722–729.
[222] Y. Xie et al., “Neural fields in visual computing and beyond,” in Computer [250] P. Welinder et al., “Caltech-UCSD birds 200,” 2010.
Graphics Forum, Hoboken, NJ, USA: Wiley, 2022. [251] C. Schuhmann et al., “LAION-5B: An open large-scale dataset for
[223] A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot training next generation image-text models,” in Proc. NeurIPS Datasets
text-guided object generation with dream fields,” in Proc. IEEE Conf. Benchmarks Track, 2022, pp. 25278–25294.
Comput. Vis. Pattern Recognit., 2022, pp. 857–866. [252] R. Krishna et al., “Visual genome: Connecting language and vision using
[224] F. Hong et al., “AvatarCLIP: Zero-shot text-driven generation and ani- crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123,
mation of 3D avatars,” ACM Trans. Graph., vol. 41, pp. 1–19, 2022. pp. 32–73, 2017.
[225] T. Shen et al., “Deep marching tetrahedra: A hybrid representation for [253] A. Nagrani et al., “VoxCeleb: A large-scale speaker identification
high-resolution 3D shape synthesis,” in Proc. Int. Conf. Neural Inf. dataset,” 2017, arXiv: 1706.08612.
Process. Syst., 2021, pp. 6087–6101. [254] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sen-
[226] D. Verbin, P. Hedman, B. Mildenhall, T. Zickler, J. T. Barron, and P. P. tences in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
Srinivasan, “Ref-NeRF: Structured view-dependent appearance for neu- 2017, pp. 3444–3453.
ral radiance fields,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., [255] T.-J. Fu et al., “Language-driven image style transfer,” 2021,
2022, pp. 5481–5490. arXiv:2106.00178.
[227] K. Schwarz et al., “GRAF: Generative radiance fields for 3D-aware [256] G. Kwon and J. C. Ye, “Clipstyler: Image style transfer with a single text
image synthesis,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, condition,” 2021, arXiv:2112.00374.
pp. 20154–20166. [257] S. Loeschcke et al., “Text-driven stylization of video objects,”
[228] M. Niemeyer and A. Geiger, “GIRAFFE: Representing scenes as com- 2022, arXiv:2206.12396.
positional generative neural feature fields,” in Proc. IEEE Conf. Comput. [258] O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2Mesh:
Vis. Pattern Recognit., 2021, pp. 11448–11459. Text-driven neural stylization for meshes,” in Proc. IEEE Conf. Comput.
[229] E. R. Chan et al., “pi-GAN: Periodic implicit generative adversarial Vis. Pattern Recognit., 2022, pp. 13482–13492.
networks for 3D-aware image synthesis,” in Proc. IEEE Conf. Comput. [259] N. Khalid et al., “Text to mesh without 3D supervision using limit
Vis. Pattern Recognit., 2021, pp. 5795–5805. subdivision,” 2022, arXiv:2203.13333.
[230] E. Perez et al., “FiLM: Visual reasoning with a general conditioning [260] F. Mu, J. Wang, Y. Wu, and Y. Li, “3D photo stylization: Learning to
layer,” in Proc. Conf. Assoc. Advance. Artif. Intell., 2018, pp. 3942–3951. generate stylized novel views from a single image,” in Proc. IEEE Conf.
[231] V. Sitzmann et al., “Implicit neural representations with periodic acti- Comput. Vis. Pattern Recognit., 2022, pp. 16252–16261.
vation functions,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, [261] A. Noguchi and T. Harada, “RGBD-GAN: Unsupervised 3D represen-
pp. 7462–7473. tation learning from natural image datasets via RGBD image synthesis,”
[232] J. Gu et al., “StyleNeRF: A style-based 3D-aware generator for high- 2019, arXiv: 1909.12573.
resolution image synthesis,” in Proc. Int. Conf. Learn. Representations, [262] J.-Y. Zhu et al., “Visual object networks: Image generation with disen-
2022. tangled 3D representations,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
[233] E. R. Chan et al., “Efficient geometry-aware 3D generative adversarial 2018, pp. 118–129.
networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, [263] Z. Shi et al., “3D-aware indoor scene synthesis with depth priors,” in
pp. 16102–16112. Proc. Eur. Conf. Comput. Vis., 2022, pp. 406–422.
[234] K. Jo et al., “CG-NeRF: Conditional generative neural radiance fields,” [264] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and Y. Yang, “HoloGAN:
2021, arXiv:2112.03517. Unsupervised learning of 3D representations from natural images,” in
[235] Y. Chen et al., “Sem2NeRF: Converting single-view semantic masks to Proc. Int. Conf. Comput. Vis., 2019, pp. 7587–7596.
neural radiance fields,” 2022, arXiv:2203.10821. [265] Y. Jiang et al., “TransGAN: Two pure transformers can make one strong
[236] J. Sun et al., “FENeRF: Face editing in neural radiance fields,” in Proc. GAN, and that can scale up,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 7662–7672. 2021, pp. 14745–14758.
[237] C. Wang et al., “CLIP-NeRF: Text-and-image driven manipulation of [266] J. Park and Y. Kim, “Styleformer: Transformer based generative adversar-
neural radiance fields,” 2021, arXiv:2112.05139. ial networks with style vector,” in Proc. IEEE Conf. Comput. Vis. Pattern
[238] Y. Xue et al., “Deep image synthesis from intuitive user input: A review Recognit., 2022, pp. 8973–8982.
and perspectives,” Comput. Vis. Media, vol. 8, pp. 3–31, 2022. [267] D. A. Hudson and C. L. Zitnick, “Generative adversarial transformers,”
[239] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene in Proc. Int. Conf. Mach. Learn., 2021, pp. 4487–4499.
parsing through ADE20K dataset,” in Proc. IEEE Conf. Comput. Vis. [268] J. Cho et al., “X-LXMERT: Paint, caption and answer questions with
Pattern Recognit., 2017, pp. 5122–5130. multi-modal transformers,” 2020, arXiv: 2009.11278.
[240] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. [269] X. Huang et al., “Multimodal conditional image synthesis with product-
Eur. Conf. Comput. Vis., 2014, pp. 740–755. of-experts GANs,” 2021, arXiv:2112.05130.
[241] H. Caesar, J. Uijlings, and V. Ferrari, “COCO-stuff: Thing and stuff [270] Z. Tan et al., “Efficient semantic image synthesis via class-adaptive
classes in context,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., normalization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 9,
2018, pp. 1209–1218. pp. 4852–4866, Sep. 2022.
[242] J. Yang et al., “Panoptic scene graph generation,” 2022, [271] X. Liu et al., “Learning to predict layout-to-image conditional convolu-
arXiv:2207.11247. tions for semantic image synthesis,” 2019, arXiv: 1910.06809.
[243] M. Cordts et al., “The cityscapes dataset for semantic urban scene [272] Z. Zhu, Z. Xu, A. You, and X. Bai, “Semantically multi-modal image
understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
2016, pp. 3213–3223. pp. 5466–5475.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
15118 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 12, DECEMBER 2023

[273] Z. Tan et al., “Diverse semantic image synthesis via probability distri- Fangneng Zhan received the PhD degree in com-
bution modeling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., puter science and engineering from Nanyang Tech-
2021, pp. 7958–7967. nological University. He is a postdoctoral researcher
[274] T. Hinz et al., “Semantic object accuracy for generative text-to-image with Max Planck Institute for Informatics. His re-
synthesis,” 2019, arXiv: 1910.13321. search interests include generative models and neu-
[275] W. Li et al., “Object-driven text-to-image synthesis via adversarial ral rendering. He serves as a reviewer or program
training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, committee member for top journals and conferences
pp. 12166–12174. including IEEE Transactions on Pattern Analysis and
[276] Z. Wang, Z. Quan, Z. -J. Wang, X. Hu, and Y. Chen, “Text to image Machine Intelligence, ICLR, ICML, NeurIPS, CVPR,
synthesis with bidirectional generative adversarial network,” in Proc. and ICCV.
IEEE Int. Conf. Multimedia Expo, 2020, pp. 1–6.
[277] M. Wang et al., “End-to-end text-to-image synthesis with spatial con-
strains,” ACM Trans. Intell. Syst. Technol., vol. 11, pp. 1–19, 2020.
[278] R. Rombach et al., “Network-to-network translation with conditional
invertible neural networks,” 2020, arXiv: 2005.13580. Yingchen Yu is currently working toward the PhD
[279] J. Liang et al., “CPGAN: Content-parsing generative adversarial net- degree with the School of Computer Science and En-
works for text-to-image synthesis,” in Proc. Eur. Conf. Comput. Vis., gineering, Nanyang Technological University under
2020, pp. 491–508. Alibaba Talent Programme. His research interests are
[280] M. Ding et al., “CogView2: Faster and better text-to-image generation image synthesis and manipulation.
via hierarchical transformers,” 2022, arXiv:2204.14217.
[281] J. S. Chung et al., “Deep speaker recognition,” 2018, arXiv: 1806.05622.
[282] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in Proc. Asian
Conf. Comput. Vis., 2016, pp. 87–103.
[283] T. Salimans et al., “Improved techniques for training GANs,” in Proc.
Int. Conf. Neural Inf. Process. Syst., 2016, pp. 2234–2242.
[284] M. Heusel et al., “GANs trained by a two time-scale update rule converge
to a local Nash equilibrium,” in Proc. Int. Conf. Neural Inf. Process. Syst., Rongliang Wu received the PhD degree from
2017, pp. 6629–6640. the School of Computer Science and Engineering,
[285] S. Hong, D. Yang, J. Choi, and H. Lee, “Inferring semantic layout for Nanyang Technological University. His research in-
hierarchical text-to-image synthesis,” in Proc. IEEE Conf. Comput. Vis. terests include computer vision and deep learning,
Pattern Recognit., 2018, pp. 7986–7994. specifically for facial expression analysis and gener-
[286] J. S. Chung and A. Zisserman, “Out of time: Automated lip sync in the ation.
wild,” in Proc. Asian Conf. Comput. Vis., 2016, pp. 251–263.
[287] T. Xiao et al., “Unified perceptual parsing for scene understanding,” in
Proc. Eur. Conf. Comput. Vis., 2018, pp. 418–434.
[288] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” in Proc.
IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 636–644.
[289] L.-C. Chen et al., “Semantic image segmentation with deep convolutional
nets and fully connected CRFs,” 2014, arXiv:1412.7062. Jiahui Zhang is currently working toward the
[290] B. Liang et al., “Expressive talking head generation with granular audio- PhD degree with the School of Computer Science
visual control,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, and Engineering, Nanyang Technological University.
pp. 3377–3386. His research interests include computer vision and
[291] X. Ji et al., “EAMM: One-shot emotional talking face via audio-based machine learning.
emotion-aware motion model,” 2022, arXiv:2205.15278.
[292] R. Wu et al., “Audio-driven talking face generation with diverse yet
realistic facial animations,” 2023, arXiv:2304.08945.
[293] S. Wang et al., “One-shot talking face generation from single-speaker
audio-visual correlation learning,” in Proc. Conf. Assoc. Advance. Artif.
Intell., 2022, pp. 2531–2539.
[294] X. Wu et al., “Human preference score v2: A solid bench- Shijian Lu received the PhD degree in electrical
mark for evaluating human preferences of text-to-image synthesis,” and computer engineering from the National Univer-
2023, arXiv:2306.09341. sity of Singapore. He is an associate professor with
[295] Y. Kirstain et al., “Pick-a-pic: An open dataset of user preferences for the School of Computer Science and Engineering,
text-to-image generation,” 2023, arXiv:2305.01569. Nanyang Technological University. His research in-
[296] V. Jayaram and J. Thickstun, “Parallel and flexible sampling from au- terests include computer vision and deep learning. He
toregressive models via Langevin dynamics,” in Proc. Int. Conf. Mach. has published more than 100 internationally refereed
Learn., 2021, pp. 4807–4818. journal and conference papers. He is currently an as-
[297] T. Dockhorn et al., “Score-based generative modeling with critically- sociate editor for the journals of Pattern Recognition
damped Langevin diffusion,” 2021, arXiv:2112.07068. and Neurocomputing.
[298] Y. Song et al., “Consistency models,” in Proc. Int. Conf. Mach. Learn.,
2023, pp. 32211–32252.
[299] J. Deng, W. Dong, R. Socher, L. -J. Li, K. Li, and L. Fei-Fei, “ImageNet:
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. Lingjie Liu received the PhD degree from the Uni-
Vis. Pattern Recognit., 2009, pp. 248–255. versity of Hong Kong, in 2019. She is the Aravind
[300] I. Skorokhodov et al., “3D generation on imagenet,” in Proc. Int. Conf. K. Joshi assistant professor with the Department of
Learn. Representations, 2023. Computer and Information Science, University of
[301] Q. Xu et al., “Point-NeRF: Point-based neural radiance fields,” in Proc. Pennsylvania. Before that, she was a Lise Meitner
IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 5428–5438. postdoctoral researcher with the Visual Computing
[302] J. Bailey, “The tools of generative art, from flash to neural networks,” Art and AI Department, Max Planck Institute for Infor-
Amer., vol. 8, 2020, Art. no. 1. matics. Her research interests are neural scene rep-
[303] Y. Mirsky and W. Lee, “The creation and detection of deepfakes: A resentations, neural rendering, human performance
survey,” ACM Comput. Surv., vol. 54, pp. 1–41, 2021. modeling and capture, and 3D reconstruction.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap
ZHAN et al.: MULTIMODAL IMAGE SYNTHESIS AND EDITING: THE GENERATIVE AI ERA 15119

Adam Kortylewski received the PhD degree from Eric Xing (Fellow, IEEE) received the PhD degree
the University of Basel with Thomas Vetter. He is a in computer science from the University of Califor-
research group leader with the University of Freiburg nia at Berkeley, Berkeley, California, in 2004. He
and the Max Planck Institute for Informatics where is currently a professor of machine learning with
he leads the Generative Vision and Robust Learn- the School of Computer Science, Carnegie Mellon
ing Lab. Before that he was a postdoc with Johns University, Pittsburgh, Pennsylvania. His principal
Hopkins University with Alan Yuille for three years. research interests lie in the development of machine
His research focuses understanding the principles learning and statistical methodology, especially for
that enable artificial intelligence systems to reliably solving problems involving automated learning, rea-
perceive our world through images. He was awarded soning, and decision-making in high-dimensional,
the prestigious Emmy Noether Grant (2022) of the multimodal, and dynamic possible worlds in social
German Science Foundation for exceptionally qualified early career researchers. and biological systems. He is a member of the DARPA Information Science and
Technology (ISAT) Advisory Group and the Program chair of the International
Conference on Machine Learning (ICML) 2014. He is also an associate editor
of The Annals of Applied Statistics (AOAS), the Journal of American Statisti-
Christian Theobalt is a professor of computer sci-
cal Association (JASA), IEEE Transactions on Pattern Analysis and Machine
ence and the director of the department “Visual
Intelligence (T-PAMI), and PLOS Computational Biology and an action editor
Computing and Artificial Intelligence” with the Max of the Machine Learning Journal (MLJ) and the Journal of Machine Learning
Planck Institute for Informatics, Germany. He is also
Research (JMLR).
a professor with Saarland University. His research
lies on the boundary between Computer Vision and
Computer Graphics. He received several awards, for
instance the Otto Hahn Medal of the Max Planck
Society (2007), the EUROGRAPHICS Young Re-
searcher Award (2009), the German Pattern Recog-
nition Award (2012), an ERC Starting Grant (2013),
an ERC Consolidator Grant (2017), and the Eurographics Outstanding Technical
Contributions Award (2020). In 2015, he was elected one of Germany’s top 40
innovators under 40 by the magazine Capital.

orized licensed use limited to: AMRITA VISHWA VIDYAPEETHAM AMRITA SCHOOL OF ENGINEERING. Downloaded on January 06,2025 at 05:25:49 UTC from IEEE Xplore. Restrictions ap

You might also like