0% found this document useful (0 votes)
168 views35 pages

Foo Et Al. - 2023 - AI-Generated Content (AIGC) For Various Data Modal

Uploaded by

Zachary Moore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
168 views35 pages

Foo Et Al. - 2023 - AI-Generated Content (AIGC) For Various Data Modal

Uploaded by

Zachary Moore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

AI-Generated Content (AIGC) for Various Data Modalities: A Survey

LIN GENG FOO, Singapore University of Technology and Design, Singapore


HOSSEIN RAHMANI, Lancaster University, United Kingdom
JUN LIU∗ , Singapore University of Technology and Design, Singapore
AI-generated content (AIGC) methods aim to produce text, images, videos, 3D assets, and other media using AI algorithms. Due to its wide range of applications
and the demonstrated potential of recent works, AIGC developments have been attracting lots of attention recently, and AIGC methods have been developed for
various data modalities, such as image, video, text, 3D shape (as voxels, point clouds, meshes, and neural implicit fields), 3D scene, 3D human avatar (body and
head), 3D motion, and audio – each presenting different characteristics and challenges. Furthermore, there have also been many significant developments in
cross-modality AIGC methods, where generative methods can receive conditioning input in one modality and produce outputs in another. Examples include going
from various modalities to image, video, 3D shape, 3D scene, 3D avatar (body and head), 3D motion (skeleton and avatar), and audio modalities. In this paper, we
provide a comprehensive review of AIGC methods across different data modalities, including both single-modality and cross-modality methods, highlighting the
various challenges, representative works, and recent technical directions in each setting. We also survey the representative datasets throughout the modalities,
arXiv:2308.14177v4 [cs.CV] 21 Oct 2023

and present comparative results for various modalities. Moreover, we also discuss the challenges and potential future research directions.
CCS Concepts: • Computing methodologies → Computer vision; Neural networks.
Additional Key Words and Phrases: AI Generated Content (AIGC), Deep Generative Models, Data Modality, Single Modality, Cross-modality
ACM Reference Format:
Lin Geng Foo, Hossein Rahmani, and Jun Liu. 2023. AI-Generated Content (AIGC) for Various Data Modalities: A Survey. 1, 1 (October 2023), 35 pages.
https://doi.org/XXXXXXX.XXXXXXX

1 INTRODUCTION
Amidst the rapid advancement of artificial intelligence (AI), the development of content generation techniques stands out as one of the most
captivating and widely discussed topics in the field. AI-generated content (AIGC) encompasses the production of text, images, videos, 3D
assets, and other media through AI algorithms, which enables automation of the content creation process. Such automation facilitated by
AIGC significantly reduces the need for human labor and lowers the costs of content creation, fundamentally reshaping industries such as
advertising [148, 234, 673], education [33, 525, 793], code development [101, 112, 581], and entertainment [105, 497, 517].
In the earlier days of AIGC, developments and approaches mainly a)
involved a single modality, where the inputs (if any) and outputs
of the generation model both share the same modality. The sem-
inal work [218] by Goodfellow et al. first introduced Generative
Adversarial Networks (GANs) that were in principle capable of
training deep neural networks to generate images that were diffi-
cult to be distinguished from images in the training dataset. This
demonstration of generative capabilities by deep neural networks
led to extensive developments on single-modality generation for b)
images [218, 260, 338], as well as various other modalities, such as
videos [666, 684], text [63, 532, 533], 3D shapes (as voxels [558, 723],
point clouds [6, 757], meshes [608, 672] and neural implicit fields
[115, 496]), 3D scenes [458, 473], 3D avatars (full bodies [110, 803]
and heads [271, 771]), 3D motions [265, 529], audio [311, 487], and
so on. Moreover, such developments have continued consistently
over the years, where the number of works published in the field ev-
Fig. 1. General trend of the number of papers published regarding a) single-
ery year has been steadily increasing over the past years, as shown modality and b) cross-modality generation/editing every year over the past five
in Fig. 1(a). Although generative models for each modality share years in six top conferences (CVPR, ICCV, ECCV, NeurIPS, ICML, and ICLR). a)
some similar approaches and principles, they also encounter unique There is an increasing trend over the years for AIGC papers published regarding
challenges. Consequently, the methods and designs for generative image, video and 3D (shape, human and scene) generation for a single modality.
b) There is an observable spike over the last 2 years for papers published regarding
models of each modality are specially dedicated to address these cross-modality (text-to-image, text-to-video and text-to-3D) generation.
distinct challenges.
∗ Corresponding author.

Authors’ addresses: Lin Geng Foo, Singapore University of Technology and Design, Singapore, 487372; Hossein Rahmani, Lancaster University, Lancaster, United Kingdom, LA1 4YW; Jun
Liu, Singapore University of Technology and Design, Singapore, 487372.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected].
© 2023 Association for Computing Machinery.
XXXX-XXXX/2023/10-ART $15.00
https://doi.org/XXXXXXX.XXXXXXX

, Vol. 1, No. 1, Article . Publication date: October 2023.


2 • Foo, et al.

a)

b)

Fig. 2. Taxonomy of a) single-modality AIGC methods and b) cross-modality AIGC methods in this survey.

Recently, there has also been a rapid development of AIGC involving multiple modalities, where the input and output modalities differ.
This cross-combination of modalities grants users greater control over what outputs to generate, e.g., generating desired images with input
text descriptions, or generating a personalized 3D human avatar from RGB images or videos. However, such cross-modality methods are
generally more challenging to train as there can be a large gap between the representations of different modalities. Moreover, they often
require larger datasets that include paired data from multiple modalities in order to effectively capture the diverse relationships between
them. Notably, recent works such as Stable Diffusion [561], Make-A-Video [605], and DreamFusion [520] have further demonstrated the
impressive capabilities of AIGC in receiving text prompts and delivering remarkable outputs in various modalities that can rival human
craftsmanship, which have inspired a large increase in the number of works in the field, as shown in Fig. 1(b). These recent advancements
exhibit the potential of AIGC in various modalities, while simultaneously also opens new and exciting avenues for cross-modality content
generation.
Therefore, recognizing the diverse nature of generative models across different modalities and the timely significance of cross-modality
generation in light of recent advances, we review existing AIGC methods from this perspective. Specifically, we comprehensively review the
single-modality methods across a broad range of modalities, while also reviewing the latest cross-modality methods which lay the foundation
for future works. We discuss the challenges in each modality and setting, as well as the representative works and recent technical directions.
The main contributions are summarized as follows:
• To the best of our knowledge, this is the first survey paper that comprehensively reviews AIGC methods from the perspective of
modalities, including: image, video, text, 3D shape (as voxels, point clouds, meshes and neural implicit fields), 3D scene, 3D avatar (full
human and head), 3D motion, and audio modality. Since we focus on modalities, we further categorize the settings in each modality
based on the input conditioning information.
• We comprehensively review the cross-modality AIGC methods, including cross-modality image, video, 3D shape, 3D scene, 3D avatar
(full body and head), 3D motion (skeleton and avatar), and audio generation.
• We focus on reviewing the most recent and advanced AIGC methods to provide readers with the state-of-the-art approaches.
The paper is organized as follows. First, we compare our survey with existing works. Next, we introduce the generative models involving
a single modality, by introducing each modality separately. Since we focus on modalities, we further categorize the methods in each modality
according to whether they are an unconditional approach (where no constraints are provided regarding what images to generate), or
according to the type of conditioning information required. A taxonomy of these single-modality methods can be found in Fig. 2(a). Then, we
introduce the cross-modality AIGC approaches, and a taxonomy for these methods can be found in Fig. 2(b). Next, we survey the datasets and
benchmarks throughout various modalities. Lastly, we discuss the challenges of existing AIGC methods and possible future directions.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 3

2 COMPARISON WITH EXISTING SURVEYS


Due to the importance and popularity of AIGC, there have been several previous attempts at surveying AIGC from various perspectives. Wu
et al. [714] surveyed the pros and cons of AIGC as a whole, as well as some potential challenges and applications. A few works [542, 761, 796]
discuss specifically about ChatGPT and large language models and their potential impacts and challenges. Gozalo et al. [219] introduce a few
selected works across several modalities. Xu et al. [736] surveys the AIGC works in the context of mobile and cloud computing. Abdollahzadeh
et al. [4] focus on generative techniques using less data. Several works [75, 798, 825] focus mainly on surveying image and text generation
methods. Wang et al. [700] discuss the security, privacy and ethical issues of AIGC, as well as some solutions to these issues.
Differently, we discuss AIGC over a broad range of modalities, for many single-modality settings as well as for many cross-modality
settings. We introduce the unique challenges, representative works and technical directions of various settings and tasks within a broad
range of single-modality and cross-modality scenarios. We organize and catalog these single-modality and cross-modality settings according
to their generated output, as well as their conditioning input, in a unified consistency. Furthermore, we provide standardized comparisons
and tables of results for AIGC methods throughout many modalities, which is much more extensive than previous works.

3 IMAGE MODALITY
The image modality was the earliest to undergo developments for deep generative modeling, and often forms the test bed for many foundational
techniques such as Generative Adversarial Networks (GANs), Variational Auto-Encoders (VAEs), Normalizing Flows (NFs) and Denoising
Diffusion Models (DMs). This is due to several reasons, including the ready availability of image data, the relative simplicity of images
compared to other modalities (e.g., videos or 3D data), and the ease and efficiency of modeling the grid-like RGB data with Convolutional
Neural Networks (CNNs).
Initial attempts at image generation with deep learning faced a myriad of difficulties. For instance, many methods faced training instability,
which is particularly evident in GANs with the risk of mode collapse. Additionally, modeling long-range dependencies and efficiently scaling
up image resolution posed significant difficulties. Besides, generating diverse images was also challenging. However, the progress over the
years have mostly overcome these issues, making it relatively easy to train image generation models to produce diverse and high-quality
images that are often difficult to tell apart from real images with the naked eye. Below, we first discuss the unconditional methods, followed
by the conditional methods where various constraints are applied to the generation process.

3.1 Unconditional Image Generation


We discuss the representative methods for uncon- Table 1. Comparison between the quality of images generated by various deep generative models. Results are reported on CIFAR-10
(32 × 32) [345], ImageNet (128 × 128) [568], LSUN Cat and Bedroom (256 × 256)[781], CelebA-HQ (1024 × 1024) [317], FFHQ (1024 × 1024)
ditional image generation. In general, these uncon- [320] datasets. We report the Fréchet Inception Distance (FID) metric [256] for all datasets and the Negative Log-Likelihood (NLL) metric
ditional methods are the fundamental techniques (measured in terms of bits/dimension) for CIFAR-10, where lower is better for both metrics. FID measures the similarity between real and
generated images, while NLL measures the ability of the generative model to represent the data distribution.
behind image generation which seek to improve
CIFAR-10 LSUN Cat LSUN Bedroom CelebA-HQ FFHQ
quality of generated images, training stability and Type Method FID(↓) NLL(↓) FID(↓) FID(↓) FID(↓) FID(↓)
efficiency. As these methods tend to be funda- DCGAN [531] 37.11 - - - - -
WGAN-GP [228] 36.40 - - - -
mental improvements, they are often applicable ProGAN [317] 15.52 - 37.52 8.34 7.30 8.04
to conditional generation as well, with only minor SNGAN [463] 21.7 - - - - -
StyleGAN [320] - - 8.53 - 5.17 4.40
modifications. Below, we split the methods into SAGAN [799] - - - - - -
BigGAN [61] 14.73 - - - - -
four categories: Generative Adversarial Networks GANs
StyleGAN2 [321] 11.1 - 6.93 - - 2.84
StyleGAN3-T [319] - - - - - 2.79
(GANs), Variational Autoencoders (VAEs), Normal- GANsformer [284] - - - 6.51 - 7.42
izing Flows (NFs) and Denoising Diffusion Models TransGAN [303]
HiT [818]
9.26
-
-
-
-
-
-
-
-
8.83
-
6.37
(DMs). Tab. 1 shows a summary and comparison ViTGAN [356] 6.66 - - - - -
StyleSwin [795] - - - - 4.43 5.07
of the representative methods in each category.
VAE [338] - ≤ 4.54
Generative Adversarial Networks (GANs) VLAE [111] - ≤ 2.95 - - - -
IAF-VAE [337] - ≤ 3.11 - - - -
have been a very popular choice for image gen- VAEs Conv Draw [221] - ≤ 3.58 - - - -
eration since their introduction by Goodfellow et VQ-VAE [677] - ≤ 4.67 - - - -
𝛿 -VAE [543] - ≤ 2.83 - - - -
al. [218]. GANs are trained in an adversarial man- NVAE [670] 23.49 ≤ 2.91 - - - -

ner, where a generator generates synthetic images NICE [161] - 4.48 - - - -


RealNVP [162] - 3.49 - - - -
and gets gradient updates from a discriminator that Glow [336] 46.90 3.35 - - - -
i-Resnet [43] 65.01 3.45 - - - -
tries to determine if the images are real or fake. One NFs
FFJORD [709] - 3.40 - - - -
Residual Flow [103] 46.37 3.28 - - - -
major drawback of GANs is their instability dur- Flow++ [259] - 3.08 - - - -
ing training, and the tendency for mode collapse DenseFlow [220] 34.90 2.98 - - - -

to occur, where the generators stick to generating DDPM [260]


NCSN [624]
3.17
25.32
≤ 3.70
-
-
-
-
-
4.90
-
-
-
-

only one or a limited set of images. Since the intro- NCSNv2 [625] 10.87 - - - - -
Improved DDPM [478] 11.47 ≤ 2.94 - - - -
duction of GANs [218], many improvements have DMs
VDM [335] 4.00 ≤ 2.65 - - - -
Score SDE (NCSN++) [626] 2.20 - - - - -
been proposed [577], and here we discuss some of Score SDE (DDPM++) [626] 2.92 ≤ 2.99 - - - -
the representative methods. LSGM (balanced) [671] 2.17 ≤ 2.95 - - - -
EDM [318] 1.97 - - - - - -
Many prominent approaches improve the archi- Consistency Distillation [622] 2.93 - 8.84 5.22 - -

tecture and design of the generator and discriminator, which are important for smoothly generating high-resolution images. DCGAN [531] is

, Vol. 1, No. 1, Article . Publication date: October 2023.


4 • Foo, et al.

a generative model with a deep convolutional structure, which follows a set of proposed architectural guidelines for better stability. ProGAN
[317] proposes a training methodology for GANs where both the generator and discriminator are progressively grown, which improves
quality, stability and variation of outputs. BigGAN [61] explores large-scale training of GANs and proposes a suite of tricks to improve the
generator and discriminator. SAGAN [799] proposes a self-attention module to model long range, multi-level dependencies across image
regions. StyleGAN [320] proposes a generator architecture that allows it to learn to separate high-level attributes in the generated images,
which gives users more control over the generated image. StyleGAN2 [321] re-designs the generator of StyleGAN and proposes an alternative
to progressive growing [317], leading to significant quality improvements. Subsequently, StyleGAN3 [319] proposes small architectural
changes to tackle aliasing and obtain a rotation equivariant generator. More recently, several works explore Transformer architectures
[680] for GANs. GANsformer [284] explores an architecture combining bipartite attention with convolutions, which enables long-range
interactions across the image. TransGAN [303] explores a pure transformer-based architecture for GANs. ViTGAN [356] integrates the Vision
Transformer (ViT) architecture [165] into GANs. HiT [818] improves the scaling of its Transformer architecture, allowing it to synthesize
high definition images. StyleSwin [795] proposes to leverage Swin Transformers [408] in a style-based architecture, which is scalable and
produces high-quality images.
Another line of work focuses on exploring GAN loss functions, hyperparameters and regularization. InfoGAN [108] aims to maximize
the mutual information. LSGAN [437] explores the least squares loss function. EBGAN [817] views the discriminator as an energy function
which is trained via an absolute deviation loss with margin. WGANs [24] minimizes an efficient approximation of the Earth Mover’s distance,
and is further improved in WGAN-GP [228] with a gradient penalty. BEGAN [46] proposes a loss based on the Wasserstein distance for
training auto-encoder based GANs such as EGGAN [817]. SNGAN [463] proposes spectral normalization to stabilize training. Other works
[418, 450] discuss regularization strategies and hyperparameter choices for training GANs.
Variational Autoencoders (VAEs) [338, 554] are a variational Bayesian approach to image generation, which learn to encode images
to a probabilistic latent space and reconstruct images from that latent space. Then, new images can be generated by sampling from the
probabilistic latent space to produce images. VAEs are generally more stable to train than GANs, but struggle to achieve high resolutions and
tend to display blurry artifacts.
Many works have been proposed to improve VAEs, especially the quality of generated images. IWAE [64] explores a strictly tighter
log-likelihood lower bound that is derived from importance weighting. VLAE [111] proposes to learn global representations by combining
VAEs with neural autoregressive models. Conv Draw [221] adopts a hierarchical latent space and a fully convolutional architecture for better
compression of high-level concepts. PixelVAE [229] proposes to use an autoregressive decoder based on PixelCNN [676]. 𝛽-VAE [257] aims to
discover interpretable and factorised latent representations from unsupervised training with raw image data, which is further improved in
[65]. WAE [657] uses a loss based on the Wasserstein distance between the model and target distributions. VampPrior [658] introduces a new
type of prior consisting of a mixture distribution. VQ-VAE [677] learns discrete latent representations by incorporating ideas from vector
quantisation into VAEs. VQ-VAE 2 [544] scales and enhances the autoregressive priors used in VQ-VAE, which generates higher quality
samples. 𝛿-VAE [543] prevents posterior collapse that can come with the use of a powerful autoregressive decoder, to ensure meaningful
training of the encoder. RAE [212] proposes regularization schemes and an ex-post density estimation scheme. NVAE [670] proposes a deep
hierarchical VAE with a carefully designed network architecture.
Normalizing Flows (NFs) [161, 553, 640, 641] transform a simple probability distribution into a richer and more complex distribution
(e.g., image data distribution) via a series of invertible and differentiable transformations. After learning the data distribution, images can be
generated by sampling from the initial density and applying the learned transformations. NFs hold the advantage of being able to produce
tractable distributions due to their invertibility, which allows for efficient image sampling and exact likelihood evaluation. However, it can be
difficult for NFs to achieve high resolutions and high image quality.
Rezende et al. [553] use NFs for variational inference and also develop categories of finite and infinitesimal flows. NICE [161] proposes to
learn complex non-linear transformations via deep neural networks. IAF [337] proposes a new type of autoregressive NF that scales well to
high-dimensional latent spaces. RealNVP [162] defines a class of invertible functions with tractable Jacobian determinant, by making the
Jacobian triangular. MNF [417] introduces multiplicative NFs for variational inference with Bayesian neural networks. SNF [675] proposes to
use Sylvester’s determinant identity to remove the single-unit bottleneck from planar flows [553] to improve flexibility. Glow [336] defines a
generative flow using an invertible 1 × 1 convolution. Emerging convolutions [274] generalize it to invertible 𝑑 × 𝑑 convolutions, which
significantly improves the generative flow model. Neural ODEs [104] introduced continuous normalizing flows, which is extended in FFJORD
[709] with an unbiased stochastic estimator of the likelihood that allows completely unrestricted architectures. i-Resnets [43] introduce a
variant of ResNet [247] with a different normalization scheme that is invertible, and use them to define a NF for image generation. Some
other recent advancements in NF methods include Flow++ [259], Residual Flow [103], Neural Flows [48], DenseFlow [220] and Rectified Flow
[406] which further improve the quality of the generated image.
Denoising Diffusion Models (DMs) [260, 617] generate images by iteratively “denoising” random noise to produce an image from
the desired data distribution. Such an approach can also be seen as estimating the gradients (i.e., score function) of the data distribution
[624]. DMs have become very popular recently as they tend to be more stable during training and avoid issues like mode collapse of GANs.
Furthermore, DMs also tend to produce high-quality samples and scale well to higher resolutions, while also being able to sample diverse
images. However, due to their iterative approach, DMs may require longer training and inference times as compared to other generative
models.
DDPM [260] designs a denoising diffusion probabilistic model which builds upon diffusion probabilistic models [617] and incorporates
the ideas of denoising over multiple steps from NCSN [624]. NCSN v2 [625] introduces a way to scale noise at each step for more effective

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 5

score-based modeling, which leads to improvements in image quality. Score-SDE [626] proposes a stochastic differential equation (SDE)
to gradually remove the noise, and use numerical SDE solvers to generate image samples. DDIM [620] significantly improves efficiency
by accelerating the sampling process, introducing a class of iterative implicit probabilistic models that can use significantly less sampling
steps. Improved DDPM [478] proposes to learn the variance used in the reverse process and design a cosine noise schedule to improve
upon DDPM [260]. D3PM [28] explores discrete diffusion models for quantized images, which generalizes the Multinomial diffusion model
[272]. Analytic-DPM [37] presents the analytic forms of both the optimal reverse variance and the corresponding optimal KL divergence
of a DM, and uses it to propose a training-free inference framework that improves the efficiency of a pre-trained DM. Karras et al. [318]
investigate the sampling processes and the networks’ inputs, outputs and loss functions, which enable significant improvements in quality
of synthesized image. Progressive Distillation [578] reduces the sampling time of DMs by progressively distilling a pre-trained DM into a
smaller model requiring less steps. DDSS [706] treats the design of fast diffusion model samplers as a differentiable optimization problem, and
finds fast samplers that improve the image quality and sampling efficiency. Besides, some works explore combining multiple DMs for better
performance, e.g., Cascaded DMs [261] or Composable DMs [400].
Another line of work investigates integrating DMs with ideas from other types of generative models. VDMs [335] propose variational
diffusion models which optimize the diffusion process parameters jointly with the rest of the model, which makes it a type of VAE. LSGM
[671] leverages a VAE framework to map the image data into latent space, before applying a score-based generative model in the latent
space. Denoising Diffusion GANs [728] models the denoising distributions with conditional GANs, achieving speed-up compared to current
diffusion models while outperforming GANs in terms of sample diversity. DiffFlow [810] is proposed to extend NFs by adding noise to the
sampling trajectories, while also making the forward diffusion process trainable. ScoreFlow [623] proposes to train score-based diffusion
models with maximum likelihood training (often used for training continuous NFs [104, 709]) to achieve better likelihood values. Consistency
models [622] have been proposed as a new family of models to directly generate images from noise in a single step, which borrows ideas
from both DMs [318] and NF-based models [48, 104].

3.2 Conditional Image Generation


In order to control the content of the generated image, various types of Inputs
Conditional Image Settings for Conditional
Generation Model Image Generation
conditioning information can be provided as input to the generative models. Class Information
“Malamute” Class-conditional Generation
Below, we first discuss the methods that generate images based on class
information, which is one of the basic ways to control the generated content. Incomplete Image Image Inpainting
Then, we discuss the usage of input images as conditioning information
Image Image-to-image Translation
(for image inpainting or image-to-image translation), as well as with user
inputs. A summary of these settings is shown in Fig. 3. Note that, the User Interactions Image Editing with User
Image Interactions
conditioning with other modalities (e.g., text) have been left to Sec. 12.
Class-conditional Generation aims to produce images containing
the specified class by conditioning the generative model on the class labels Fig. 3. Illustration of various conditional image generation settings. Ex-
(e.g., “dog” or “cat”), and is one of the basic approaches for conditional amples obtained from [262, 492, 570, 782].
image generation. Below we discuss some representative works, including
some fundamental developments in conditional generation.
Conditional GANs (cGANs) [459] are the first to extend GANs [218] to a conditional setting, where both the generator and discriminator
are conditioned on additional information (e.g., class labels that are encoded as one-hot vectors). AC-GAN [483] introduces an auxiliary
classifier to add more structure to the GAN latent space, which is improved in [464] with a projection to improve the diversity of the
generated samples for each class. PPGN [472] aims to generate images at test time according to a replaceable condition classifier network in a
plug-and-play manner. Conditional Instance Normalization [168] and CBN [145] propose conditional normalization mechanisms as a way to
modulate the intermediate feature representations based on the conditioning information. MSGAN [434] aims to mitigate the mode collapse
issue for cGANs. Score-SDE [626] explores a score-based generative approach for conditional generation, where the conditional reverse-time
SDE can be estimated from unconditional scores. To guide the DM’s generation process, classifier guided diffusion sampling [154] proposes
to use the gradients of a pre-trained image classifier. To remove the need for a classifier, classifier-free guided sampling [262] is proposed
as a way to train a conditional and unconditional model by training with the class information as inputs and randomly dropping the class
information during training, which provides control over the generation process. Self-guided Diffusion Models [276] aim to eliminate the
need for labeled image data by creating self-annotations with a self-supervised feature extractor. Meng et al. [447] introduce a distillation
approach to distill classifier-guided classifier-free guided DMs to obtain models that are more efficient to sample from.
Image Inpainting is where a model takes as input an image that is incomplete in some way (e.g., missing pixels), and tries to complete the
image. Many approaches are designed to better leverage the context surrounding the missing patch, and to handle patches of various shapes.
Pathak et al. [501] are the first to adopt GANs for image inpainting, and introduce a Context Encoder to generate the contents of a region
based on the surrounding context. Iizuka et al. [290] propose to use a combination of global and local context discriminators to complete the
image with better local and global consistency. Shift-Net [752] aims to use the encoder features of the known regions in the missing regions,
and introduces a shift-connection layer to its model architecture for feature rearrangement. Yu et al. [782] present a contextual attention layer
with channel-wise weights to encourage coherency between the existing and generated pixels. Partial Convolutions [392] are convolutions
that are masked and renormalized such that the convolutional results depend only on the non-missing regions. Gate Convolutions [783]

, Vol. 1, No. 1, Article . Publication date: October 2023.


6 • Foo, et al.

learn a dynamic feature gating mechanism which performs well when the missing regions have arbitrary shapes, i.e., free-form inpainting.
LaMa [638] uses fast Fourier convolutions which have an image-wide receptive field for effectively inpainting large masks. Repaint [419]
proposes a DM for inpainting, which leverages a pre-trained unconditional DM and does not require training on the inpainting task itself,
allowing it to generalize to any mask shape given during testing while having powerful semantic generation capabilities.
Image-to-Image translation between different visual domains is an important task, where a generative model takes an image as
input and produces an image belonging to another domain. Some popular applications include style transfer (e.g., sketch to photo, day to night,
summer to winter, between artistic styles, synthetic to real), pose transfer (e.g., modifying a human to have a given pose), attribute transfer
(e.g., dog to cat, dog breed translation), colorization (e.g., RGB from grayscale images or depth images), etc. While these applications are
very different, they generally can be tackled with similar approaches (as introduced below), but with the image data from the corresponding
domains. In general, works in this direction aim to learn without the strong constraint of requiring paired images from both domains, while
also improving the diversity of generated images while preserving key attributes.
Pix2Pix [292] investigates using cGANS [459] to tackle image-to-image translation problems in a common framework. CycleGAN [832]
designs an algorithm that can translate between image domains without paired examples between domains in the training data, through
training with a cycle consistency loss. UNIT [398] also tackles the case where no paired examples exist between domains, and approaches
the task by enforcing a shared VAE latent space between domains with an adversarial training objective. DiscoGAN [332] aims to discover
cross-domain relations when given unpaired data, and can transfer styles while preserving key attributes in the output image. BicycleGAN
[833] encourages the connection between output and the latent code to be invertible, which helps prevent mode collapse and produces more
diverse results for image-to-image translation. DRIT [353] proposes to generate diverse outputs with unpaired training data via disentangling
the latent space into a domain-invariant content space and a domain-specific attribute space, together with a cross-cycle consistency loss.
Concurrently, MUNIT [281] also aims to produce diverse image-to-image translation outputs with unpaired training data, achieving this by
encoding content codes and style codes separately and sampling different style codes to generate diverse images. StarGAN [120] explores
handling more than two domains with a single model by introducing the target domain label as an input to the model, such that users can
control the translation of the image into the desired domain. StarGAN v2[121] further improves upon StarGAN, enabling it to generate more
diverse images. FUNIT [399] explores the few-shot setting, where only a few images of the target class are needed at test time. Recently,
Palette [570] adopts a DM for image-to-image translation which achieves good performance.
Image editing and manipulation with user interaction aims to control the image synthesis process via human input (e.g., scribbles
and sketches). Methods in this category explore different ways to incorporate user inputs into deep image generation pipelines.
Zhu et al. [831] explore manipulating the image’s shape and colour through the users’ scribbles, and relies on the learned manifold of
natural images in GANs to constrain the generated image to lie on the learned manifold, i.e., a GAN inversion approach. Image2StyleGAN++
[3] also adopts a GAN inversion approach for image editing based on the Image2StyleGAN [2] latent space embedding framework, and
introduces local embeddings to enable local edits. SketchyGAN [106] aims to synthesize images from many diverse class categories given a
sketch. Ghosh et al. [211] also use a single model to generate a wide range of object classes, and introduce a feedback loop where the user can
interactively edit the sketch based on the network’s recommendations. Lee et al. [355] colorize a sketch based on a given colored reference
image. Paint2pix [607] explores a progressive image synthesis pipeline, which allows a user to progressively synthesize the images from
scratch via user interactions at each step. Recently, DragGAN [492] proposes to allow users to flexibly edit the image by dragging points of
the image toward target areas. A similar approach is adopted by DragDiffusion [597], but a DM is used instead.
Others. Various works also explore the conditional image generation process in other directions. Some works [44, 178, 389, 587, 592, 642]
take in human brain activity (e.g., fMRI signals) to generate the corresponding visual images. Another research direction involves layout
generation [83, 237, 310, 368, 819], where the layout of the room is generated conditioned on the given elements with user-specified attributes.

4 VIDEO MODALITY
Following the developments in image-based AIGC, there has also Table 2. Comparison between representative video generative models. Results are reported on the UCF-
101 [627] and Sky Timelapse [730] datasets. We report the Fréchet Video Distance (FVD) evaluation metric
been much attention on video-based AIGC, which has many appli- [668] based on the standardized implementation by [614] (lower is better), where FVD16 and FVD128 refer to
cations in the advertising and entertainment industries. However, evaluation of clips with 16 and 128 frames respectively. FVD measures the similarity between generated videos
and real videos. We also report the Inception Score (IS) [571, 577] on UCF-101, which measures the class diversity
video generation remains a very challenging task. Beyond the diffi- over generated videos and how clearly each generated video corresponds to a class. Besides, we also report the
resolution (Res.) of each video frame for each evaluation result. An asterisk (*) indicates that methods were
culties in generating individual frames/images, the generated videos trained on both training and testing videos of the dataset.
must also be temporally consistent to ensure coherence between
UCF-101 Sky Timelapse
frames, which can be extremely challenging for longer videos. More- Method Res. IS(↑) FVD16 (↓) FVD128 (↓) FVD16 (↓) FVD128 (↓)
over, it can also be difficult to produce realistic motions. Besides, VGAN [684] 64 × 64 8.31±0.09 - - - -
TGAN [571] 64 × 64 11.85±0.07 - - - -
due to the much larger size of the output, it is also challenging to MoCoGAN [666] 64 × 64 12.42±0.07 - - - -
generate videos quickly and efficiently. We discuss some approaches MoCoGAN [666] 256 × 256 10.09±0.30 2886.8 3679.0 206.6 575.9
DVD-GAN [129] 128 × 128 32.97±1.7 - - - -
to overcome these challenges below, starting with the unconditional MoCoGAN+SG2 [321, 614] 256 × 256 15.26±0.95 1821.4 2311.3 85.88 272.8
MoCoGAN-HD [655] 256 × 256 23.39±1.48 1729.6 2606.5 164.1 878.1
methods. VideoGPT [750] 256 × 256 12.61±0.33 2880.6 - 222.7 -
VideoGPT [750] 128 × 128 24.69±0.30 - - - -
DIGAN [788] 128 × 128 29.71±0.53 - - - -
4.1 Unconditional Video Generation DIGAN [788]
Video Diffusion* [263]
256 × 256
64 × 64
23.16±1.13
57.00±0.62
1630.2
-
2293.7
-
83.11
-
196.7
-
TATS [203] 128 × 128 57.63±0.24 - - - -
Here, we discuss the methods for unconditional generation of videos, StyleGAN-V* [614] 256 × 256 23.94±0.73 1431.0 1773.4 79.52 197.0
VideoFusion [423] 128 × 128 72.22 - - - -
which generally aim to fundamentally improve the video generation PVDM [787] 256 × 256 74.40±1.25 343.6 648.4 55.41 125.2

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 7

quality, diversity and stability. They are mostly applicable to the conditional setting as well, with some modifications. Tab. 2 shows a summary
and comparison between some representative methods.
VGAN [684] proposes a GAN for video generation with a spatial-temporal convolutional architecture. TGAN [571] uses two separate
generators – a temporal generator to output a set of latent variables and an image generator that takes in these latent variables to produce
frames of the video, and adopts a WGAN [24] model for more stable training. MoCoGAN [666] learns to decompose motion and content by
utilizing both video and image discriminators, allowing it to generate videos with the same content and varying motion, and vice versa.
DVD-GAN [129] decomposes the discriminator into spatial and temporal components to improve efficiency. MoCoGAN-HD [655] improves
upon MoCoGAN by introducing a motion generator in the latent space and also a pre-trained and fixed image generator, which enables
generation of high-quality videos. VideoGPT [750] and Video VQ-VAE [688] both explore using VQ-VAE [677] to generate videos via discrete
latent variables. DIGAN [788] designs a dynamics-aware approach leveraging a neural implicit representation-based GAN [613] with temporal
dynamics. StyleGAN-V [614] also proposes a continuous-time video generator based on implicit neural representations, and is built upon
StyleGAN2 [321]. Brooks et al. [62] aim to generate consistent and realistic long videos and generate videos in two stages: a low-resolution
generation stage and super-resolution stage. Video Diffusion [263] explores video generation through a DM with a video-based 3D U-Net
[128] architecture. Video LDM [53], PVDM [787] and VideoFusion [423] employ latent diffusion models (LDMs) for video generation, showing
a strong ability to generate coherent and long videos at high resolutions. Several recent works [203, 245, 774] also aim to improve the
generation of long videos.

4.2 Conditional Video Generation


In conditional video generation, users can control the content of the gen- Conditional Video Settings for Conditional
Inputs
erated video through various inputs. In this subsection, we investigate the Generation Model Video Generation

scenario where the input information can be a collection of video frames Video Clip Video-to-video Synthesis
(i.e., including single images), or an input class. A summary of these settings
is shown in Fig. 4. Prefix Frame(s) Future Video Prediction
Video-to-Video Synthesis aims to generate a video conditioned on
Class Information
an input video clip. There are many applications for this task, including “Right Hand Wave”
Class-conditional Generation
motion transfer and synthetic-to-real synthesis. Most approaches identify
ways to transfer information (e.g., motion information) to the generated
Fig. 4. Illustration of various conditional video generation inputs. Exam-
video while maintaining consistency in other aspects (e.g., identities remain ples obtained from [84, 164, 475].
the same).
Vid2vid [697] learns a cGAN using paired input and output videos with a spatio-temporal learning objective to learn to map videos
from one visual domain to another. Chan et al. [84] extract the pose sequence from a subject and transfers it to a target subject, which
allows users to synthesize people dancing by transferring the motions from a dancer. Wang et al. [696] proposes a few-shot approach for
video-to-video synthesis which requires only a few videos of a new person to perform the transfer of motions between subjects. Mallya et al.
[431] aims to maintain 3D consistency in generated videos by condensing and storing the information of the 3D world as a 3D point cloud,
improving temporal coherence. LIA [702] approaches the video-to-video task without explicit structure representations, and achieve it purely
by manipulating the latent space of a GAN [218] for motion transfer.
Future Video Prediction is where a video generative model takes in some prefix frames (i.e., an image or video clip), and aim to generate
the future frames.
Srivastava et al. [628] proposes to learn video representations with LSTMs in an unsupervised manner via future prediction, which allows
it to generate future frames of a given input video clip. Walker et al. [687] adopts a conditional VAE-based approach for self-supervised
learning via future prediction, which can produce multiple predictions for an ambiguous future. Mathieu et al. [442] tackles the issue where
blurry predictions are obtained, by introducing an adversarial training approach, a multi-scale architecture, and an image gradient difference
loss function. PredNet [416] designs an architecture where only the deviations of predictions from early network layers are fed to subsequent
network layers, which learns better representations for motion prediction. VPN [312] is trained to estimate the joint distribution of the pixel
values in a video and models the factorization of the joint likelihood, which makes the computation of a video’s likelihood fully tractable.
SV2P [31] aims to provide effective stochastic multi-frame predictions for real-world videos via an approach based on stochastic inference.
MD-GAN [730] presents a GAN-based approach to generate realistic time-lapse videos given a photo, and produce the videos via a multi-stage
approach for more realistic modeling. Dorkenwald et al. [164] uses a conditional invertible neural network to map between the image and
video domain, allowing more control over the video synthesis task.
Class-conditional Generation aims to generate videos containing activities according to a given class label. A prominent example is the
action generation task, where action classes are fed into the video generative model.
CDNA [183] proposes an unsupervised approach for learning to generate the motion of physical objects while conditioned on the action
via pixel advection. PSGAN [755] leverages human pose as an intermediate representation to guide the generation of videos conditioned on
extracted pose from the input image and an action label. Kim et al. [334] adopts an unsupervised approach to train a model to detect the
keypoints of arbitrary objects, which are then used as pseudo-labels for learning the objects’ motions, enabling the generative model to be
applied to datasets without costly annotations of keypoints in the videos. GameGAN [331] trains a GAN to generate the next video frame of a
graphics game based on the key pressed by the agent. ImaGINator [699] introduces a spatio-temporal cGAN architecture that decomposes

, Vol. 1, No. 1, Article . Publication date: October 2023.


8 • Foo, et al.

appearance and motion for better generation quality. Recently, LFDM [475] proposes a DM for temporal latent flow generation based on class
information, which is more efficient than generating in the pixel space.
Others. Some works [51, 52, 244] propose pipelines for users to interact with the video generation process, e.g., via pokes [51, 52] or
a user-specified trajectory [244]. Besides, Mahapatra and Kulkarni [428] aim toward controlling the animation of fluids from images via
various user inputs (e.g., arrows and masks), and build upon an animation technique [267] based on Eulerian Motion Fields.

5 TEXT MODALITY
Another field which has received a lot of attention is Table 3. Comparison between recent representative text generative models. Results are reported on common sense reasoning
benchmarks BoolQ [130], WinoGrande [575], ARC easy and ARC challenge (ARC-e and ARC-c) [131], as well as closed book question
text generation, which has gained more widespread answering benchmarks Natural Questions [349] and TriviaQA [309]. For all datasets, we report accuracy as the evaluation metric
(higher is better). We report results for the fine-tuned setting where the pre-trained model is fine-tuned on the dataset as a downstream
interest with well-known chatbots such as ChatGPT. task, and also the zero-shot setting where the model does not get any additional task-specific training. We also report the size of each
Text generation is a challenging task due to several model, i.e., number of parameters.
reasons. Initial approaches found it challenging to
Setting Method Size BoolQ WinoGrande ARC-e ARC-c NaturalQuestions TriviaQA
adopt generative methods such as GANs for text rep-
T5-Base [535] 223M 81.4 66.6 56.7 35.5 25.8 24.5
resentations which are discrete, which also led to is- Fine-tuned T5-Large [535] 770M 85.4 79.1 68.8 35.5 27.6 29.5
Switch-Base [180] 7B - 73.3 61.3 32.8 26.8 30.7
sues with training stability. It is also challenging to Switch-Large [180] 26B - 83.0 66.0 35.5 29.5 36.9
maintain coherence and consistently keep track of the GPT-3 [63] 175B 60.5 70.2 68.8 51.4 14.6 -
Gopher [534] 280B 79.3 70.1 - - 10.1 43.5
context over longer passages of text. Moreover, it is Chinchilla [264] 70B 83.7 74.9 - - 16.6 55.4
difficult to apply deep generative models to generate Zero-shot PaLM [123]
PaLM [123]
62B
540B
84.8
88.0
77.0
81.1
75.2
76.6
52.5
53.0
18.1
21.2
-
-
text that adhere to grammatical rules while also cap- LLaMA [659] 7B 76.5 70.1 72.8 47.6 - -
LLaMA [659] 33B 83.1 76.0 80.0 57.8 24.9 65.1
turing the intended tone, style and level of formality.
In general, text generation models are mostly trained and tested in the conditional setting, where the models produce text while
conditioned on a text input (e.g., input question, preceding text, or text to be translated). Therefore, we do not categorize based on conditional
or unconditional. Instead, we categorize the methods according to their generative techniques: VAEs, GANs, and Autoregressive Transformers.
Tab. 3 reports the performance of some recent text generation methods.
VAEs. Bowman et al. [60] first explores a VAE-based approach for text generation, which overcomes the discrete representation of text by
learning a continuous latent representation through the VAE. Many subsequent works (e.g., Yang et al. [765], Semeniuta et al. [588], Xu et al.
[733], Dieng et al. [157]) have explored ways to improve the training stability and avoid the problem of KL collapse where the VAE’s latent
representation does not properly encode the input text. However, these methods generally still do not achieve high levels of performance,
stability and diversity.
GANs. Training text-based GANs poses a challenge due to the non-differentiability of the discrete text representation. Therefore, building
upon earlier works [365, 366] that learn to generate text via reinforcement learning (RL), many GAN-based approaches leverage RL to
overcome the differentiability issue. SeqGAN [785] adopts a policy gradient RL approach to train a GAN, where the real-or-fake prediction
score of the discriminator is used as a reward signal to train the generator. However, this approach faces considerable training instability.
Hence, several works focus on improving the design of the guidance signals from the discriminator (e.g., RankGAN [387], LeakGAN [233]),
improving the architecture design (e.g., TextGAN [814]), or exploring various training techniques (e.g., MaskGAN [179]). However, using
GANs for text generation has not been very successful, due to the instability of GANs in general, coupled with the complications brought by
the non-differentiability of the discrete text representations [66],
Autoregressive Transformers. GPT [532] proposes a generative pre-training approach to train a Transformer-based language model,
where a Transformer learns to autoregressively predict the next tokens during pre-training. GPT-2 [533] shows that GPT’s approach can be
effective in generating predictions for natural language processing tasks, even in a zero-shot setting. XLNet [764] is designed to enable learning
of bidirectional contexts during pretraining and is built upon the Transformer-XL [141] architecture. BART [359] presents a bidirectional
encoder with an autoregressive decoder, enabling pre-training with an arbitrary corruption of the original text. T5 [535] proposes to treat
every text processing task as a text generation task and investigates the transfer learning capabilities of the model in such a setting. GPT-3
[63] significantly scales up the size of language models, showing large improvements overall, and achieving good performance even in
few-shot and zero-shot settings. Notably, the ChatGPT chatbot is mainly based on GPT-3, but with further fine-tuning via reinforcement
learning from human feedback (RLHF) as performed in InstructGPT [491]. Gopher [534] investigates the scaling of Transformer-based
language models (up to a 280 billion parameter model) by evaluating across many tasks. The training of Chinchilla [264] is based on an
estimated compute-optimal frontier, where a better performance is achieved when the model is 4 times smaller than Gopher, while training
on 4 times more data. PaLM [123] proposes a large and densely activated language model, which is trained using Pathways [39], allowing for
efficient training across TPU pods. Switch Transformer [180] is a model with a large number of parameters, yet is sparsely activated due to
its Mixture of Experts design where different parameters are activated for each input sample. LLaMa [659] is trained by only using publicly
available data, making it compatible with open-sourcing, and it is also efficient, outperforming GPT-3 on many benchmarks despite being
much smaller. LLama2 [660] is further fine-tuned and optimized for dialogue use cases.
Diffusion Models. Several works also explore adopting DMs to generate text, which holds the advantage of generating more diverse text
samples, Diffusion-LM [377] introduces a diffusion-based approach for text generation, which denoises random noise into word vectors in a
non-autoregressive manner. DiffuSeq [217] also adopts a diffusion model for text generation, which shows high diversity during generation.
GENIE [391] proposes a pre-training framework to pre-train the language model, generating text of high quality and diversity.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 9

6 3D SHAPES MODALITY
Another important and rapidly advancing field is 3D shape generation, which aims to generate
novel 3D shapes and objects. The ability to swiftly generate 3D assets can be very useful,
particularly in sectors like manufacturing and entertainment, where it aids in rapid prototyping
and design exploration. Notably, when generating 3D shapes, users can choose to generate
the 3D shape in various 3D representations: voxel, point cloud, mesh, or neural implicit fields,
where each 3D representation generally adopts different settings and backbones, and each i) Voxel ii) Point Cloud

have their own characteristics, advantages and disadvantages. A visualization of the various
3D representations is shown in Fig. 5. In practice, for many tasks a specific representation
can be more suited than the other, where considerations can include the memory efficiency,
ease of handling the representation, and the cost of obtaining supervision signals. Below, we
further categorize the 3D shape generation methods based on the generated output 3D data
representation of each method. Tab. 4 reports the performance of some representative 3D shape iii) Mesh iv) Neural Implicit Field

methods. Fig. 5. Visualization of the Stanford Bunny as represented in (i)


voxel; (ii) point cloud; (iii) mesh; (iv) neural implicit field. Images
are obtained from [7, 71, 496, 563].

6.1 Voxels
The voxel representation is a natural extension of the Table 4. Comparison between representative 3D shape generative models on the chair and airplane shapes from ShapeNet [91].
2D image pixel representation into 3D, where the 3D We report the Minimum Matching Distance (MMD) and Coverage (COV) scores [6] based on Earth Mover’s Distance (EMD) and Light
Field Descriptor (LFD) [96] and 1-Nearest Neighbour Accuracy (1-NNA) metric [415, 757] based on LFD. MMD is an indication of
space is divided into grids and each voxel stores some the fidelity of the generated samples (lower is better), COV measures the coverage and diversity of the generated samples (higher is
better), while 1-NNA measures the similarity between the distribution of generated samples and the data distribution (lower is better).
values (e.g., occupancy [723] or signed distance values On the other hand, EMD and LFD are different ways of computing distances for these metrics. Note that MMD (EMD) scores are
[139, 462]). As the voxel representation is the natural multiplied by 102 . As there have been several different data pre-processing, post-processing approaches and implementations of the
evaluation metrics (which lead to varying results), we also indicate which protocol each of the reported results use. We denote each
extension of the 2D pixel space, they can conveniently protocol with a symbol, where each protocol comes from a specific paper, as follows: †[598], ‡[421], * [6], §[827], ¶[287], ** [672],
be processed with 3D CNNs, which is why earlier 3D ††[424], §§[287], ¶¶[199].
generative works [715, 723] tend to leverage the voxel Chair Airplane
Type Method MMD (↓) COV (↑) 1-NNA (↓) MMD (↓) COV (↑) 1-NNA (↓)
representation. However, generation via the 3D voxel EMD LFD EMD LFD EMD EMD LFD EMD LFD EMD

3D-GAN [715] * 9.1 - 22.4 - - - - - - -


representation can be computationally costly with Voxel
3D-GAN [715] §§ - 4365 - 25.07 - - - - - -
GRASS [367] ¶¶ 7.44 - 44.5 - - - - - - -
poor memory efficiency, due to the cubic growth of G2L [692] ¶¶ 6.82 - 83.4 - - - - - - -
SAGNet [724] ¶¶ 6.08 - 74.3 - - - - - - -
the volume with increasing resolution. Octree Transformer [287] §§ - 2958 - 76.47 - - 3664 - 73.05 -

3D ShapeNet [723] explores using 3D voxel grids Point Cloud


r-GAN [6] *
l-GAN (EMD) [6] *
12.3
6.9
-
-
19.0
57.1
-
-
-
-
-
-
-
-
-
-
-
-
-
-
to represent 3D shapes which allows for 3D shape PC-GAN [362] §§
PC-GAN [362] ‡
-
31.04
3143
-
-
22.14
70.06
-
-
100.00
-
18.10
3737
-
-
13.84
73.55
-
-
98.52
completion, and introduces ModelNet (a large dataset GCN-GAN [674]
Tree-GAN [600] ‡
‡ 22.13
36.13
-
-
35.09
6.77
-
-
95.80
100.00
16.50
19.53
-
-
18.62
8.40
-
-
98.60
99.67
with 3D CAD models) to train it. 3D-GAN [715] pro- PointFlow [757] ‡
ShapeGF [67] ‡
18.56
17.85
-
-
43.38
46.71
-
-
68.40
62.69
10.90
10.27
-
-
44.65
47.12
-
-
69.36
70.51
poses to leverage GANs to model and generate 3D Luo et al. [421] ‡
r-GAN [6] §
17.84
-
-
-
47.52
-
-
-
69.06
99.70
10.61
-
-
-
45.47
-
-
-
75.12
96.79
l-GAN (EMD) [6] § - - - - 64.65 - - - - 76.91
objects, and designs a 3D GAN architecture to pro- PointFlow [757] § - - - - 60.57 - - - - 70.74
SoftFlow [329] § - - - - 60.05 - - - - 65.80
duce objects in a 3D voxel representation. VSL [403] DPF-Net [339] § - - - - 58.53 - - - - 65.55
ShapeGF [67] § - - - - 65.48 - - - - 76.17
designs a hierarchical architecture which learns a hi- Luo et al. [421] § - - - - 74.77 - - - - 86.91
PVD [827] § - - - - 53.32 - - - - 64.81
erarchical representation of 3D objects, which allows Luo et al. [421] ** - - - - 74.96 - - - - 96.04
PVD [827] ** - - - - 57.90 - - - - 56.06
for improved generalization capabilities when used PointFlow [757] † 4.21 - 49.63 - 74.74 1.67 - 52.97 - 62.50
Luo et al. [421] † 4.08 - 41.65 - 76.66 1.64 - 52.23 - 63.37
across various tasks. PlatonicGAN [253] aims to train PVD [827] † 3.56 - 50.37 - 53.03 1.55 - 53.96 - 52.72
Tree-GAN [600] †† 9.02 - 48.89 - 62.04 4.04 - 43.81 - 71.78
a 3D generative model from a collection of 2D im- ShapeGF [67] †† 8.53 - 50.96 - 54.14 4.03 - 41.58 - 71.29
PVD [827] †† 8.38 - 50.52 - 52.36 4.29 - 34.9 - 83.91
ages, by introducing a 2D discriminator and render- Luo et al. [421]
SP-GAN [373] ††
†† 9.51
9.67
-
-
47.27
31.91
-
-
69.50
75.85
4.00
4.03
-
-
48.76
46.29
-
-
68.81
70.42
ing layers that connect between the 3D generator and SDM-NET [199] ¶¶ 0.671 - 84.1 - - - - - - -
Mesh
LION [672] § - - - - 52.34 - - - - 53.70
2D discriminator. Several other works propose other LION [672] ** - - - - 48.67 - - - - 53.84
SLIDE (centroid) [424] †† 8.49 - 49.63 - 51.18 3.77 - 46.29 - 65.84
GAN-based approaches (e.g., G2L [692]) or VAE-based SLIDE (random) [424] †† 8.63 - 50.37 - 53.10 3.81 - 47.03 - 67.82

approaches (e.g., GRASS [367] and SAGNet [724]) to Neural Fields


IM-GAN [115] **
IM-GAN [115] §§
-
-
-
2893
-
-
-
75.44
58.20
-
-
-
-
3689
-
-
-
70.33
77.85
-
improve the quality of the generated voxel-based rep- Grid IM-GAN [288] §§
SDF-Diffusion [598] †
-
3.61
2768
-
-
49.31
82.08
-
-
51.77
-
1.49
3226
-
-
55.20
81.58
-
-
48.14
resentation.
However, storing 3D voxel representation in memory can be inefficient, since the memory and computational requirements of handling
the 3D voxel grid grow cubically with the resolution, which limits the 3D voxel output to low resolutions. To overcome this, OctNet [558]
leverages the octree data structure [98, 443, 629] – a data structure with adaptive cell sizes – and hierarchically partitions the 3D space into a
set of unbalanced octrees, which allows for more memory and computations to be allocated to the regions that require higher resolutions.
OGN [646] presents a convolutional decoder architecture to operate on octrees, where intermediate convolutional layers can predict the
occupancy value of each cell at different granularity levels, which is thus able to flexibly predict the octree’s structure and does not need to
know it in advance. HSP [241] builds upon the octree data representation, and only predicts voxels around the surface of the object (e.g.,
predicting the object’s boundary), performing the hierarchical predictions in a coarse-to-fine manner to obtain higher resolutions. More
recently, Octree Transformer [287] introduces a Transformer-based architecture to effectively generate octrees via an autoregressive sequence
generation approach.

, Vol. 1, No. 1, Article . Publication date: October 2023.


10 • Foo, et al.

In order to achieve better precision for 3D voxel representations efficiently, another approach is to use the 3D voxel grid to represent a
signed distance function (SDF) [136]. In this representation, instead of storing occupancy values, each voxel instead stores the signed distance
to the nearest 3D surface point, where the inside of the object has negative distance values and the outside has positive distance values.
3D-EPN [139] represents the shape’s surface by storing the signed distances in the 3D voxel grid instead of occupancy values to perform 3D
shape completion from partial, low-resolution scans. OctNetFusion [559] proposes a deep 3D convolutional architecture based on OctNets
[558], that can fuse depth information from different viewpoints to reconstruct 3D objects via estimation of the signed distance. AutoSDF
[462] adopts a VAE-based approach and stores features in a 3D voxel grid, reconstructing the object by referring to a codebook learnt using
the VAE which maps the features in each locality into more precise shapes. This SDF approach is further extended into the continuous space
with neural implicit functions in other works, and a discussion can be found under Sec. 6.4.

6.2 Point Cloud


Point clouds are unordered sets of 3D points that represent a surface of a 3D object or scene. One advantage of point clouds is that they
are relatively scalable as compared to voxels, since they only explicitly encode the surface. Furthermore, 3D point clouds are a popular
representation because they can be conveniently obtained via depth sensors (e.g., LiDAR or the Kinect). On the other hand, point clouds do
not have a regular grid-like structure, which makes them challenging to be processed by CNNs. Additionally, due to the point clouds being
unordered, the proposed models that handle point clouds need to be permutation-invariant, i.e., invariant to the ordering between the points.
Another disadvantage is that point clouds do not have surfaces, so it can be difficult to directly add textures and lighting to the 3D object or
scene. To overcome this, 3D point clouds can be transformed to meshes through surface reconstruction techniques [242, 325, 326, 507, 710].
Some works generate 3D point clouds via GANs. Achlioptas et al. [6] explore GANs for point cloud generation and improve the
performance by training them in the fixed latent space of an autoencoder network. PC-GAN [362] improves the sampling process and
proposes a sandwiching objective. GCN-GAN [674] investigates a GAN architecture with graph convolutions to effectively generate 3D point
clouds by exploiting local topology. Tree-GAN [600] introduces a tree-structured graph convolutional network (GCN) which is more accurate
and can also generate more diverse object categories. Recently, GAN-based approaches have also seen various improvements in generation
quality [285, 373, 644].
However, GAN-based approaches tend to generate a fixed number of points and lack flexibility. Therefore, PointFlow [757] explores an
approach based on NFs, which can sample an arbitrary number of points for the generation of the 3D object. DPF-Nets [339] propose a discrete
alternative to PointFlow [757] which significantly improves computational efficiency. SoftFlow [329] estimates the conditional distribution of
perturbed input data instead of directly learning the data distribution, which reduces the difficulty of generating thin structures. ShapeGF
[67] adopts a score-based approach, learning gradient fields to move an arbitrary number of sampled points to form the shape’s surface.
More recently, Luo et al. [421] tackle 3D point cloud generation via DMs, which are simpler to train than GANs and continuous-flow-based
models, and removes the requirement of invertibility for flow-based models. PVD [827] is also a DM-based method, but adopts a point-voxel
representation [411], which provides point clouds with a structured locality and spatial correlations that can be exploited.

6.3 Mesh
Polygonal meshes depict surfaces of 3D shapes by utilizing a set of vertices, edges, and faces. Meshes are more memory-efficient and scalable
than voxels as they only encode the surfaces of scenes, and are also more efficient than point clouds, where the mesh only requires three
vertices and one face to represent a large triangular surface but many points need to be sampled for point clouds. Furthermore, meshes offer
an additional benefit by being well-suited for geometric transformations such as translation, rotation, and scaling of objects, which can be
easily accomplished through straightforward vertex operations. Besides, meshes can also encode textures more conveniently than voxels or
point clouds. Thus, meshes are commonly used in traditional computer graphics and related applications. Nevertheless, dealing with and
generating 3D meshes can present challenges, and the level of difficulty surpasses that of generating point clouds. Similar to point clouds,
the irregular data structure of meshes presents handling challenges. However, predicting both vertex positions and topology for meshes
introduces further complexity, making it challenging to synthesize meshes with plausible connections between mesh vertices. Therefore,
many works use point cloud as an intermediate representation, and transform point clouds to mesh via surface reconstruction techniques
[242, 325, 326, 507, 710].
SurfNet [608] explores an approach and a network architecture to generate a 3D mesh represented by surface points, to reduce the
computational burden from voxel representations. SDM-NET [199] introduces a network architecture to produce structured deformable
meshes, that decomposes the overall shape into a set of parts and recovers the finer-scale geometry of each part by deforming a box, which
is an approach that facilitates the editing and interpolation of the generated shapes. PolyGen [471] presents an autoregressive approach
with a Transformer-based architecture, which directly predicts vertices and faces sequentially while taking into account the long-range
dependencies in the mesh. LION [672] uses a hierarchical VAE with two diffusion models in the latent space to produce the overall shapes
and the latent points, before using them to reconstruct a mesh. SLIDE [424] designs a diffusion-based approach for mesh generation which
employs point clouds as an intermediate representation via a cascaded approach, where one diffusion model learns the distribution of the
sparse latent points (in a 3D point cloud format) and another diffusion model learns to model the features at latent points.
Mesh Textures. As 3D mesh is capable of representing surface textures, a line of work focuses on synthesizing textures of 3D surfaces.
Texture Fields [484] proposes a texture representation which allows for regressing a continuous function (parameterized by a neural network),
and is trained by minimizing the difference between the pixel colors predicted by the Texture Field and the ground truth pixel colors.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 11

Henderson et al. [251] aims to generate textured 3D meshes from only a collection of 2D images without any 3D information, and learns to
warp a spherical mesh template to the target shape. AUV-Net [114] learns to embed 3D surfaces into a 2D aligned UV space, which allows
textures to be aligned across objects and facilitates swapping or transfer of textures. Texturify [603] generates geometry-aware textures for
untextured collections of 3D objects without any explicit 3D color supervision. GET3D [197] aims to generate 3D meshes that are practical for
real-world applications, which have detailed geometry and arbitrary topology, is a textured mesh, and is trained from 2D image collections.
Mesh2Tex [56] leverages a hybrid mesh-neural-field texture representation which compactly encodes high-resolution textures.

6.4 Neural Implicit Fields


Neural implicit fields (or neural fields) are an implicit representation of 3D shapes where a neural network “represents” the 3D shape. In
order to observe the 3D shape, the 3D point coordinates are fed as inputs into the neural network, and the properties pertaining to each point
can be predicted. The properties predicted by the neural field tend to be either occupancy values [115, 451] or signed distance values [496],
from which we can infer the 3D shape. For example, by querying the 3D space for occupancy values, we can infer which regions of the 3D
space are occupied by the 3D shape. Neural fields hold the advantage of being a continuous representation, which can potentially encode
3D shapes in high resolutions with high memory efficiency. They are also flexible in encoding various topologies. However, extracting the
3D surfaces from neural fields is typically slow, since it requires a dense evaluation of 3D points with a neural network (usually an MLP).
Furthermore, it can be challenging to train neural fields as there is a lack of ground truth data to effectively supervise the training, and it is
also not straightforward to train generative models such as GANs in a neural field representation. Moreover, since each neural field only
captures one 3D shape, more effort is required to learn the ability to generalize and generate new shapes. It can also be difficult to edit the 3D
shape, as the information is stored implicitly as neural network weights.
Occupancy Fields. IM-GAN [115] and ONet [451] are the first to perform 3D shape generation while modeling the implicit field with a
neural network, which assigns an occupancy value to each point. LDIF [207] decomposes a shape into a set of overlapping regions according
to a structured implicit function [208] template, which improves the efficiency and generalizability of the neural implicit field. Peng et al.
[508] introduce convolutional occupancy networks that leverage convolutions, which are scalable and also improves over the fully-connected
architecture often used in previous works.
Signed Distance Fields. However, predicting only the occupancy values in an occupancy field (as described above) provides limited
utility when compared to predicting the metric signed distance to the surface of the 3D object via a signed distance field (SDF). For instance,
the signed distance values of the SDF can be used to raycast against surfaces to render the model, and its gradients can be used to compute
surface normals. DeepSDF [496] is the first work to use a neural field to predict the signed distance values to the surface of the 3D shape at
each 3D point. However, ReLU-based neural networks tend to face difficulties encoding high frequency signals such as textures, thus SIRENs
[611] use periodic activation functions instead, in order to better handle high-frequency details. Unlike previous methods that depend on a
low-dimensional latent code to generalize across various shapes, MetaSDF [610] aims to improve the generalization ability of neural SDFs via
meta-learning [184]. Grid IM-GAN [288] split the space into a grid and learn to capture the local geometry in each grid cell separately (instead
of with a global implicit function), which also enables the composing and arranging of local parts to create new shapes. SDF-Diffusion [598]
proposes a diffusion-based approach for generating high-resolution 3D shapes in the form of an SDF through iteratively increasing the
resolution of the SDF via a denoising diffusion process.
Others. Some works also explore the manipulation and editing of 3D shapes [150, 243] with neural fields. Other works [26, 27, 224] aim to
to construct a neural implicit field from raw point cloud data. Dupont et al. [169] learn to directly generate the implicit neural representations
(functa) given the input data, which can be used to represent the input data for performing various tasks, including generation.

7 NOVEL VIEW SYNTHESIS FOR 3D SCENES


Along with the progress in 3D shape reconstruction, there has also been more attention and interest on 3D scenes, which can involve one
or multiple objects and the background. The mainstream generative approach involving 3D scenes is to explicitly or implicitly encode a
3D scene representation (i.e., via a voxel-based representation or a neural implicit representation), which allows for synthesis of images
from novel views when required. Due to the need for encoding complete scenes, this task tends to be much more challenging than 3D shape
generation. Another difficulty arises with how to implicitly encode the 3D scene, since the rendering of 3D scenes involves the generation of
color, texture and lighting, which are challenging elements that now need to be encoded, while occupancy fields and signed distance fields as
introduced in Sec. 6.4 for 3D shape representations do not naturally encode for colour.

7.1 Voxel-based Representations


One approach for novel view synthesis stores the 3D scene information (geometry and appearance) as features in a voxel grid, which can be
rendered into images.
DeepVoxels [612] introduces a deep 3D voxel representation where features are stored in a small 3D voxel grid, which can be rendered by
a neural network (i.e., neural rendering) to produce 3D shapes. HoloGAN [473] has an architecture that enables direct manipulation of view,
shape and appearance of 3D objects and can be trained using unlabeled 2D images only, while leveraging a deep 3D voxel representation
[612]. BlockGAN [474] also generate 3D scene representations by learning from unlabelled 2D images, while adding compositionality where
the individual objects can be added or edited independently. HoloDiffusion [315] introduces a DM-based approach that can be trained on

, Vol. 1, No. 1, Article . Publication date: October 2023.


12 • Foo, et al.

posed images without access to 3D ground truth, where the DM generates a 3D feature voxel grid, which are rendered by a rendering function
(MLP) to produce the 2D images.
Besides, sparse voxel-based representations have been proposed to improve the optimization efficiency. PlenOctrees [779] Plenoxels
[189] are sparse voxel girds where each voxel corner stores the scalar density and a vector of spherical harmonic coefficients for each color
channel. Sun et al. [632] also aim to speed up the optimization process, by directly optimizing the volume density. VoxGRAF [586] proposes a
3D-aware GAN which represents the scene with a sparse voxel grid to generate novel views.

7.2 Neural Radiance Fields (NeRFs)


Another approach involves implicitly encoding the 3D scene as a neural implicit Table 5. Comparison between 3D scene novel view synthesis
field, i.e., where a neural network represents the 3D scene. In order to additionally models. Results are reported on FFHQ [320] at 256 × 256 and
512 × 512, Cats [811] at 256 × 256 and CARLA [166, 585] at
output appearance on top of shapes, NeRF [458] presents neural radiance fields 128 × 128. We report FID scores [256] to measure the quality
that encodes color and density given each point and the viewing angle, which can of images rendered from the scene (lower is better). We report
produce a 3D shape with color and texture after volume rendering. NeRF-based results from various protocols (each denoted with a symbol),
scene representations have become very popular as they are memory efficient where each protocol comes from a specific paper, as follows:
†[615], §[586], ‡[489], * [151]
and scalable to higher resolutions.
GRAF [585] is the first work to generate radiance fields via an adversarial Method FFHQ 2562 FFHQ 5122 Cat 2562 Carla 1282
approach, where the discriminator is trained to predict if the rendered 2D images GIRAFFE [479] 32 - 33.39 -
from the radiance fields are real or fake. 𝜋 -GAN [86] improves the expressivity 𝜋 -GAN [86] - - - 29.2
StyleNeRF [225] 8.00 7.8 5.91 -
of the generated object by adopting a network based on SIREN [611] for the StyleSDF [489] 11.5 11.19 - -
neural radiance field, while leveraging a progressive growing strategy. GOF [743] EG3D [85] 4.8 4.7 - -
VolumeGAN [745] 9.1 - - 7.9
combines NeRFs and occupancy networks to ensure compactness of learned object MVCGAN [812] 13.7 13.4 39.16 -
surfaces. Mip-NeRF [40] incorporates a mipmapping from computer graphics GIRAFFE-HD [746] 11.93 - 12.36 -
rendering pipelines to improve efficiency and quality of rendering novel views. GRAF [585] * 73.0 - 59.5 32.1
-GAN [86] * 55.2 - 53.7 36.0
GRAM [151] confines the sampling and radiance learning to a reduced space (a 𝜋GIRAFFE [479] * 32.6 - 20.7 105
set of implicit surfaces), which improves the learning of fine details. MVCGAN GRAM [151] * 17.9 - 14.6 26.3
[812] incorporates geometry constraints between views. StyleNeRF [225] and 𝜋 -GAN [86] † 53.2 - 68.28 -
StyleSDF [489] incorporate a style-based generator [321] to improve the rendering GRAM [151] † 13.78 - 13.40 -
EpiGRAF [615] † 9.71 9.92 6.93 -
efficiency while generating high-resolution images. EpiGRAF [615] improves the
HoloGAN [473] ‡ 90.9 - - -
training process through a location- and scale-aware discriminator design, with a GRAF [585] ‡ 79.2 - - -
modified patch sampling strategy to stabilize training. Mip-NeRF 360 [41] builds 𝜋 -GAN [86] ‡ 83.0 - - -
GIRAFFE [479] ‡ 31.2 - - -
upon Mip-NeRF [40] to extend to unbounded scenes, where the camera can rotate StyleSDF [489] ‡ 11.5 - - -
360 degress around a point. TensoRF [94] models the radiance field of a scene as a GRAF [585] § 71 - - 41
4D tensor and factorizes the tensors, leading to significant gains in reconstruction GIRAFFE [479] § 31.5 - - -
𝜋 -GAN [86] § 85 - - 29.2
speed and rendering quality. GOF [743] § 69.2 - - 29.3
Another line of work aims to edit the radiance fields. Editing-NeRF [404] GRAM [151] § 17.9 - - 26.3
VoxGRAF[586] § 9.6 - - 6.7
proposes to edit the NeRF based on a user’s coarse scribbles, which allows for
color modification or removing of certain parts of the shape. ObjectNeRF [754]
learns an object-compositional NeRF to duplicate, move, or rotate objects in the scene. NeRF-editing [791] explores a method to edit a static
NeRF to perform user-controlled shape deformation, where the user can edit a reconstructed mesh. D3D [650] learns to disentangle the
geometry, appearance and pose of the 3D scene using just monocular observations, and allows for editing of real images by computing
their embeddings in the disentangled GAN space to enable control of those attributes. Deforming-NeRF [741] and CageNeRF [510] enable
free-form radiance field deformation by extending cage-based deformation of meshes to radiance field deformation, which allows for explicit
object-level scene deformation or animation. Besides, some works [177, 282, 395, 806] also focus on editing the visual style of the NeRF.
Some works also explore the compositional generation of scenes for more control over the generation process. GIRAFFE [479] incorporates
a compositional 3D scene representation to control the image formation process with respect to the camera pose, object poses and objects’
shapes and appearances. GIRAFFE HD [746] extends GIRAFFE to generate high-quality high-resolution images by leveraging a style-based
[321] neural renderer while generating the foreground and background independently to enforce disentanglement. DisCoScene [744] spatially
disentangles the scene into object-centric generative radiance fields, which allows for flexible editing of objects and composing them and the
background into a complete scene. Recently, FusedRF [215] introduces a way to fuse multiple radiance fields into a single scene.

7.3 Hybrid 3D Representations and Others


Hybrid representations generally combine explicit representations (e.g., voxels, point clouds, meshes) with implicit representations (e.g.,
neural fields). Such hybrid representations aim to capitalize on the strengths of each type of representation, such as the explicit control over
geometry afforded by explicit representations, as well as the memory efficiency and flexibility of the implicit representations.
Voxels and Neural Fields. One line of works combine voxels with neural implicit fields. NSVF [396] defines a set of neural fields which
are organized in a sparse voxel-based structure, where the neural field in each cell models the local properties in that cell. Jiang et al. [299]
present a set of overlapping voxel grid cells which store local implicit latent codes. SNeRG [250] is a representation with a sparse voxel grid

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 13

of diffuse color, volume density and features, where images can be rendered by an MLP in real-time. VolumeGAN [745] represents 3D scenes
in a hybrid fashion, with explicit voxel grids and NeRF-like implicit feature fields to generate novel views. DiffRF [468] generates the neural
radiance fields via a DM on a voxel grid and a volumetric rendering loss, which leads to efficient, realistic and view consistent renderings.
Triplane representation. Another promising hybrid approach is the triplane representation introduced by EG3D [85], where the scene
information is stored in three axis-aligned orthogonal feature planes, and a decoder network takes in aggregated 3D features from the three
planes to predict the color and density values at each point. This representation is efficient, as it can keep the decoder small by leveraging
explicit features from the triplanes, while the triplanes scale quadratically instead of cubically (as compared to dense voxels) in terms of
the memory requirement. Shue et al. [601] generate neural fields in a triplane representation with a DM for generation of high-fidelity and
diverse scenes.
Others. Other approaches have also been explored for representing 3D scenes. SRN [609] represents the 3D scenes implicitly while
formulating the image formation as a differentiable ray-marching algorithm, allowing it to be trained with only 2D images and their camera
poses. NeuMesh [753] uses a mesh-based implicit field for editing of the scene’s geometry and texture. Point-NeRF [740] explores a point-based
neural radiance field.

8 3D HUMAN DATA GENERATION


Besides generation of 3D shapes and scenes, generation of 3D humans is also an important task. Different from generation of shapes which
tend to be static and inanimate, the ultimate goal of 3D human generation is to synthesize animatable humans with movable joints and
non-rigid components (such as hair and clothing), which further increases the difficulty of the task. In general, the work on generating 3D
human avatars can be split into the full body and head, which we discuss separately below.

8.1 3D Avatar Body


3D human avatar generation aims to generate diverse 3D virtual humans with different identities and shapes, which can take arbitrary
poses. This task can be very challenging, as there can be many variations in clothed body shapes and their articulations can be very complex.
Furthermore, since 3D avatars should ideally be animatable, 3D shape generation methods are typically not easily extended to non-rigid
clothed humans.
Unconditional. Early works (e.g., SCAPE [22], SMPL [413], Adam [308], STAR [490], GHUM [732]) explore generating 3D human mesh
via human parametric models, which express human body shapes in terms of a relatively small set of parameters that deform a template
human mesh. These models conveniently synthesize human shapes as users only need to fit/regress values for a small set of parameters. Some
works also explore a similar approach for modeling clothed humans [227, 425, 760]. However, such human parametric models are not able to
capture many finer and personalized details due to the fixed mesh topology and the bounded resolution. To overcome these limitations, a line
of works propose to generate 3D human avatars by adopting implicit non-rigid representations [113, 573]. gDNA [110] proposes a method to
generate novel detailed 3D humans with a variety of garments, by predicting a skinning field (following [113]) and a normal field (i.e., using
an MLP to predict the surface normals), and is trained using raw posed 3D scans of humans. AvatarGen [803] leverages an SDF on top of a
SMPL human prior, and enables disentangled control over the human model’s geometry and appearance. GNARF [45] proposes a generative
and articulated NeRF that is trained with posed 2D images. EVA3D [268] is a compositional human NeRF representation that can be trained
using sparse 2D human image collections, which is efficient and can render in high-resolution. HumanGen [300] generates 3D humans by
building upon an existing 2D generator which provides prior information, and adopts a neural rendering approach to produce texture, while
an SDF produces the fine-grained geometry information.
Conditioned on 3D scans. Another line of work aims to produce an animatable 3D avatar from 3D scans. Traditionally, articulated
deformation of a given 3D human mesh is often performed with the classic linear blend skinning algorithm. However, the deformation is
simple and cannot produce pose-dependent variations, which leads to various artifacts [358]. Many methods have been proposed to tackle
these issues, such as dual quaternion blend skinning [323] and multi-weight enveloping [449, 698].
Recently, many works aim to produce an animatable 3D avatar from 3D scans via implicit representations of the human body [480], which
are resolution-independent, smooth and continuous. NASA [147] learns an occupancy field to model humans as a collection of deformable
components, which is extended in LEAP [456] to also represent unseen people with a single neural network. Some works such as SNARF
[113] and S3 [766] also leverage occupancy field representations and can generate 3D animated human characters which generalize well
to unseen poses. Some other works (e.g., SCANimate [574], Neural-GIF [656], imGHUM [19], UNIF [527]) explore an SDF-based implicit
representation.
Some works (e.g., Peng et al. [506], Neural Actor [397], Neural Body [509], Zheng et al. [822], Weng et al. [707], Zhao et al. [816], NeuMan
[302], InstantAvatar [301]) also propose deformable and animatable NeRFs for synthesizing humans from novel poses in novel views.

8.2 3D Avatar Head


On the other hand, 3D avatar head generation aims to generate a 3D morphable face model with fine-grained control over the facial
expressions. It is challenging to produce realistic 3D avatar heads, and it is even more difficult to model the complex parts such as hair.
Unconditional. Traditional approaches model facial appearance and geometry based on the 3D Morphable Models (3DMM) [50] which is
a parametric model that simplifies the face modeling to fitting values in a linear subspace. Subsequently, many variants have been proposed,
including multilinear models [73, 682], full-head PCA models [375] and fully articulated head models [209, 505]. Several deep generative

, Vol. 1, No. 1, Article . Publication date: October 2023.


14 • Foo, et al.

methods [116, 616, 663] that generate an explicit 3D face model have also been explored, however these methods tend to produce 3D avatar
heads that lack realism. To improve the realism, many recent methods proposed implicit methods for unconditional generation of head
avatars, with some exploring an SDF-based approach [771], NeRF-based approach [271, 836], or with occupancy fields [721]. Several of these
methods [721, 771] can even model hair.
Others. Recently, one line of works explores the editing of avatar faces [160, 494, 636]. IDE-3D [634] allows users to perform interactive
editing of shape and texture.

9 3D MOTION GENERATION
Besides generating 3D humans, many AIGC works also generate 3D human motions to drive the movements of 3D humans. The most
common representation used for 3D motion generation is the 3D skeleton pose – is a simple yet effective representation that captures the
human joint locations. After generating the 3D skeleton pose sequences, they can be utilized to generate simple moving human meshes, or be
used to animate a rigged avatar (which can be generated by methods in Sec. 8). In 3D motion generation, it can be challenging to synthesise
long-term motions that are realistic and coherent. It is also difficult to capture multiple diverse yet plausible motions from a starting point.

9.1 Unconditional 3D Motion Generation


A few works explore the unconditional synthesis of human motions. Holden et al. [265] introduce a deep learning framework for motion
synthesis with an autoencoder that learns the motion manifold. CSGN [749] introduces a GAN-based approach to generate long motion
sequences with meaningful and continuous actions. MoDi [529] proposes an encoder architecture to learn a structured, highly semantic
and clustered latent space, which facilitates motion generation via a GAN-based approach. MDM [648] adopts a DM for generating human
motion, which frees the assumption of the one-to-one mapping of autoencoders and can express many plausible outcomes.

9.2 Conditional 3D Motion Generation


Conditioned on prefix frames. In conditional 3D motion generation, one line of works aims to generate the motion following a given set
of prefix frames, which is also known as motion prediction. These prefix 3D skeleton poses can be obtained using depth sensors such as the
Kinect, or extracted from an input image or video with a 3D pose estimation algorithm [186, 216, 504]. Many works rely on recurrent neural
networks (RNNs) [12, 187, 439, 828] for learning to autoregressively predict and generate future motion. In order to improve the modeling of
spatio-temporal information and long-term motion generation, other architectures have also been investigated, such as CNNs [363], GCNs
[142, 370, 426, 435, 436], Transformers [69, 440] and MLPs [235]. Furthermore, in order to predict multiple diverse yet plausible future poses,
some other works also adopt probabilistic elements such as VAEs [70, 238], GANs [42, 254, 813], or NFs [790].
Conditioned on actions. Another line of works focus on generating motions of a specified action class, where an initial pose or sequence
is not required. Generally, this can be achieved by feeding the target action class as input to the motion generation model during training,
and have class-annotated clips as supervision. Action2Motion [232] first introduces a Lie Algebra-based VAE framework to generate motions
from desired action categories to cover a broad range of motions. ACTOR [511] is a Transformer VAE that can synthesize variable-length
motion sequences of a given action. Some approaches based on GANs have also been proposed to generate more diverse motions [146, 704]
or capture rich structure information [786]. Furthermore, some works [429] also perform multi-person motion generation.

10 AUDIO MODALITY
Table 6. Comparison between audio generation models. We report results for unconditional generation
Many AIGC methods also aim to generate audio, which facilitates the based on raw waveform on the Speech Commands [705] dataset. For this task, we report the Fréchet
creation of voiceovers, music, and other sound effects. They can also Inception Distance (FID), Inception Score (IS) and 5-scale Mean Opinion Score (MOS) metrics. FID
measures the similarity between real and generated audio, IS measures the diversity of generated audio
be useful in text-to-speech applications, e.g., assistive technology and and whether it can be clearly determined by a classifier, while MOS evaluates the quality according
to human subjects. Results for neural vocoding (where mel-spectrograms are given) are reported on
entertainment purposes. However, it can be challenging to generate proprietary dataset of [102] and the LJ speech [293] dataset, where we report the 5-scale Mean Opinion
audio realistically, including the pitch and timbre variations. In the Score (MOS) metric. We report the results from two different protocols, which have been denoted with
different symbols: †[102], ‡[341].
context of speech, it can be difficult to capture the emotional and tonal
Method Speech Commands LJ [102]’s dataset
variations, and it is also difficult to generate speech based on a given FID (↓) IS (↑) MOS (↑) MOS (↑) MOS (↑)
identity. In various applications, inference speed of generating the audio WaveRNN [311] † - - - 4.49±0.05 4.49±0.04
is also important. Parallel WaveGAN [748] †
MelGAN [347] †
-
-
-
-
-
-
-
-
3.92±0.05
3.95±0.06
Multi-band MelGAN [758] † - - - - 4.10±0.05

10.1 Speech GAN-TTS [49] †


WaveGrad Base [102] †
-
-
-
-
-
-
-
4.35±0.05
4.34±0.04
4.41±0.03
WaveGrad Large [102] † - - - 4.55±0.05 4.51±0.04
In this subsection, we discuss the unconditional speech generation works.
WaveNet-128 [487] ‡ 3.279 2.54 1.34±0.29 4.43±0.10 -
In general, these methods can also be slightly modified to apply to the WaveNet-256 [487] ‡ 2.947 2.84 1.43±0.30 - -
WaveGAN [163] ‡ 1.349 4.53 2.03±0.33 - -
conditional settings (e.g., text-to-speech that is discussed in Sec. 18). ClariNet [513] ‡ - - - 4.27±0.09 -
WaveGlow [523] ‡ - - - 4.33±0.12 -
WaveNet [487] is the first deep generative model for audio genera- WaveFlow-64 [515] ‡ - - - 4.30±0.11 -
tion. It is an autoregressive model that predicts the next audio sample WaveFlow-128 [515] ‡ - - - 4.40±0.07 -
DiffWave Base [341] ‡ - - - 4.38±0.08 -
conditioned on previous audio samples, which are built based on causal DiffWave Large [341] ‡ - - - 4.44±0.07 -
DiffWave [341] ‡ 1.287 5.30 3.39±0.32 - -
convolutional layers. Subsequently, some other works such as Sam-
pleRNN [444] and WaveRNN [311] also adopt an autoregressive approach while improving the modeling of long-term dependencies [444]
and reducing the sampling time [311]. VQ-VAE [677] leverages a VAE to encode a latent representation for audio synthesis. Some works

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 15

[336, 553] also explore a NF-based approach to generate audio. Notably, Parallel WaveNet [486] distills a trained WaveNet model into a
flow-based IAF [337] model for the sake of efficient training and sampling. WaveGlow [523] extends the flow-based Glow [336] with WaveNet
for efficient and high-quality audio synthesis. WaveFlow [515] builds upon WaveGlow, further improving the fidelity and synthesis speed.
ClariNet [513] distills a flow-based Gaussian IAF model from the autoregressive WaveNet. WaveGAN [163], MelGAN [347] and HiFi-GAN
[340] adopt a GAN-based approach to generating audio, with MelGAN and HiFi-GAN providing significant improvements in generation
speed. Parallel WaveGAN [748] and Multi-band MelGAN [758] further improve the speed of audio waveform generation. Diffusion-based
audio generation has also been explored in DiffWave [341] and WaveGrad [102], which show strong performance. Recently, Audio-LM [57]
casts audio generation as a language modeling task which is capable of long-term consistency.

10.2 Music
Another widely explored audio generation task is music generation. Different from speech, music can have multiple tracks which represent
different instruments. There are also many styles of music (e.g., pop, classical) and different emotions/themes, which can be challenging to
keep consistent over long ranges. Although there are many earlier attempts at music generation [58, 172], producing complex multi-track
music has largely been facilitated by deep generative methods.
The first attempts [125, 277] to generate music with deep learning adopted recurrent neural networks (RNNs) to recurrently generate
new notes. Subsequently, Engel et al. [173] approach music synthesis with a WaveNet-based autoencoder, and also introduce the NSynth
dataset with annotated notes of various instruments for training, which enables expressive and realistic generation of instrument sounds.
MiDiNet [762] aims to generate melodies from scratch or from a few given bars of music by taking a GAN-based approach, and can be
expanded to generate music with multiple tracks. DeepBach [239] aims to generate polyphonic music, with a specific focus on hymn-like
pieces. Dieleman et al. [156] do not handle the symbolic representations of music (e.g., scores, MIDI sequences) and handle raw audio instead,
which improves generality and is more able to capture the precise timing, timbre and volume of the notes. Jukebox [153] generates music
with singing with a multi-scale VQ-VAE encoder and an autoregressive Transformer decoder, which can generate high-fidelity and diverse
songs that are coherent for up to several minutes. It can also be conditioned on the artist and genre to control the vocal style.

11 OTHER MODALITIES
Vector Graphics. A few works generate vector graphics [414], with some taking images as inputs [427, 546]. These can be important when
designs need to be sharp, scalable and flexible, e.g., logos.
Graphs. Some works focus on generation of graphs [140, 383, 777], which are often applied to molecular generation. Some works
[144, 200, 204, 273, 307, 401, 420, 681, 718] generate 3D molecules unconditionally, i.e., directly generating each atom type and position. Some
works [422, 432, 595, 604, 737, 738] generate molecules conditioned on the 2D SMILE structure.
Others. There are various other modalities where generative models have been explored, including tabular data [343], sketches [694],
CAD models [720], game maps [155], and so on.

12 CROSS-MODALITY IMAGE GENERATION


In this section, we discuss cross-modality image generation, where images Inputs
Cross-modal Image Settings for Cross-modal
Generation Model Image Generation
are generated using conditioning information from other modalities. In the Text Input
“A photo of a confused grizzly Text-to-image Generation
cross-modality setting, it is crucial that we learn the interaction between bear in calculus class.”
multiple modalities, which enables us to control the generation process via Text Input
“A bird spreading wings.”
the inputs from another modality. A summary of various cross-modality Text-driven Image Editing
Original Image
image generation settings is shown in Fig. 6.
Text Input
“A [V] backpack in the Grand Canyon.”
12.1 Text-to-Image Generation Image(s) of
Personalized Image Synthesis

Subject
In the text-to-image task, the goal is to generate images corresponding to
a given textual description. Text-to-image is a popular direction that has
attracted a lot of attention, and many developments have been made over Fig. 6. Illustration of various cross-modality image generation settings.
Note that “[V]” refers to the subject’s unique identifier. Examples ob-
the years. In general, these developments can be categorized as: GANs, Au- tained from [324, 567, 569].
toregressive Transformers, and DMs. A comparison between representative
works can be found in Tab. 7.
GANs. Most GAN-based works for text-to-image generation incorporate a text encoder (e.g., an RNN) to encode text embeddings which
are fed to the GAN. AlignDRAW [433] generates images mainly via RNNs [222] to encode the text inputs, and generate images which learns
an alignment between the input captions and the generating canvas, but employ GANs for post-processing to improve the image quality. Reed
et al. [549] develops a fully end-to-end differentiable deep convolutional GAN architecture with a convolutional-RNN, which significantly
improves the realism and resolution of the generated images (from 28 × 28 of AlignDraw to 64 × 64 images). GAWWN [550] builds upon
[549] and finds that image quality can be improved by providing additional information to the model, such as bounding boxes that instruct
the model where to generate the objects. StackGAN [801] and Stackgan++ [802] adopt the text encoder of [549], and present a coarse-to-fine
refinement approach and stack multiple GANs to improve the image resolution, yielding images of resolution 256 × 256. AttnGAN [742]
proposes an attention-driven architecture that allows subregions of the image to be drawn by focusing on the most relevant words of the

, Vol. 1, No. 1, Article . Publication date: October 2023.


16 • Foo, et al.

caption, providing significant improvements to the image quality. Other GAN-based text-to-image methods have been proposed for better
control (e.g., ControlGAN [360], ManiGAN [361]), resolution (e.g., HDGAN [815], DM-GAN [835]), or semantic alignment (e.g., MirrorGAN
[528], XMC-GAN [800]).
Autoregressive Transformers. DALL-E [539] is the first to Table 7. Comparison between representative text-to-image generative models. Results are reported on MS-
COCO [390] and CUB [548, 686]. For both datasets, we report the Fréchet Inception Distance (FID) and Inception
leverage autoregressive Transformers [680] for text-to-image gen- Score (IS), where FID measures the similarity between the generated images and reference ground truth images with
eration, by using them to generate a sequence of image tokens the same captions, while IS measures the diversity of generated images and whether it can be clearly determined
by a classifier. On MS-COCO, we also report the zero-shot FID metric, where models are tasked to generate images
after taking the text tokens as input. The image tokens are then for the MS-COCO captions without dataset-specific tuning, which is a metric that recent works tend to focus on.
decoded into a high-quality images via a trained discrete VAE MS-COCO CUB
[677]. CogView [158] also proposes a similar approach, but with Type Method FID (↓) Zero-shot FID (↓) IS (↑) FID (↓) IS (↑)

significant improvements in image quality, and also investigates GANs


GAN-INT-CLS [549] - - 7.88 - 2.88
GAWWN [550] - - - - 3.62
fine-tuning for downstream tasks. CogView 2 [159] introduces a StackGAN [801] 74.05 - 8.45 51.89 3.70
StackGAN++ [802] 81.59 - 8.30 15.30 4.04
cross-modality pre-training method (CogLM) which facilitates the PPGN [472] - - 9.58 - -
HDGAN [815] - - 11.86 - 4.15
conditioning on both image and text tokens to perform various AttnGAN [742] 35.49 - 25.89 23.98 4.36
tasks such as image captioning or image editing, on top of text- MirrorGAN [528] 34.71 - 26.47 18.34 4.56
DM-GAN [835] 32.64 - 30.49 16.09 4.75
to-image generation. NÜWA [712] trains an encoder that takes DF-GAN [645] 19.32 - - 14.81 5.10
DM-GAN+CL [768] 20.79 - 33.34 14.38 4.77
text or visual inputs, and an autoregressive decoder with an effi- XMC-GAN [800] 9.33 - 30.45 - -
LAFITE [829] 8.12 26.94 32.34 10.48 5.97
cient 3DNA attention mechanism that is shared among 8 visual
DALL-E [539] - 27.50 - - -
synthesis tasks, e.g., text-to-image generation and manipulation, Autoregressive CogView [158] - 27.10 - - -
text-to-video generation and manipulation. Make-A-Scene [194] CogView2 [159]
NÜWA [712]
17.5
12.9
24.0
-
25.2
27.2
-
-
-
-
improves the controllability of text-to-image generation by allow- Make-A-Scene [194] 7.55 11.84 - - -
Parti [784] 3.22 7.23 - - -
ing an optional segmentation map as input, while also increasing
Stable Diffusion [561] - 12.61 - - -
DMs
the quality and resolution of the generated images by improving GLIDE [477] - 12.24 - - -
VQ-Diffusion [226] 13.86 - - 10.32 -
the tokenization process. Parti [784] further scales up the model DALL-E 2 [538] - 10.39 - - -
Imagen [569] - 7.27 - - -
size to improve image quality. eDiff-i [36] - 6.95 - - -
Diffusion Models. More recently, DMs have shown to be a GLIGEN [378] 5.61 - - - -

very effective approach for text-to-image generation, and have


attracted much attention. GLIDE [477] explores diffusion-based text-to-image generation, and investigates using guidance from CLIP [530] or
classifier-free guidance [262], where the latter produces photorealistic images that are consistent with the captions. Imagen [569] leverages
a generic language model that is pre-trained only on text (T5 [535]) to encode text and a dynamic thresholding technique for diffusion
sampling, which provide improved image quality and image-text alignment. Different from previous methods that generate in the pixel
space, Stable Diffusion [561] introduces Latent Diffusion Models that perform diffusion in the latent space. Specifically, an autoencoder is
first trained, and the diffusion model is trained to generate features in the latent space of the autoencoder, which greatly reduces training
costs and inference costs. DALL-E 2 [538] also performs diffusion in the latent space, but instead use the joint text and image latent space of
CLIP [530], which has the advantage of being able to semantically modify images by moving in the direction of any encoded text vector.
VQ-Diffusion [226] performs diffusion in the latent space of a VQ-VAE [677] while encoding text inputs using CLIP. Some other developments
include conditional generation using pre-trained unconditional diffusion models (e.g., ILVR [119]), having a database of reference images
(e.g., RDM. [54], kNN-Diffusion [594]), or training an ensemble of text-to-image diffusion models (e.g., eDiff-I [36]).

12.2 Text-driven Image Editing


In text-driven image editing, input images are manipulated according to given text descriptions to obtain an edited version of the original
image. Earlier works on text-driven editing are based on GANs, where a text encoder is added to the GAN and the GAN is trained on
image-text pairs (e.g., ManiGAN [361], TAGAN [470]). For better zero-shot performance with pre-trained image GANs, a GAN inversion
technique is combined with the CLIP’s text-image embedding space for image editing (e.g., StyleCLIP [500], e4e [661], HyperStyle [13]).
Recently, many works focus on editing of images with DMs. DiffusionCLIP [328] explores the text-driven editing of an image by using a CLIP
loss on DMs. Specifically, an input image (to be edited) is first transformed to the latent space via DMs, and the DM is fine-tuned with the
directional CLIP loss (which uses the text prompts to provide a gradient), such that it produces updated samples during the reverse diffusion
process. Blended Diffusion [30] adopts a similar approach, but also allow for region-based image editing where edits are contained in a local
region, and enable this through masking the CLIP loss and enforcing a background preservation loss. Hertz et al. [255] modify a pre-trained
text-to-image DM by injecting cross-attention maps during the diffusion process (which can automatically be obtained from the internal
layers of the generative model itself), to control the pixels that the prompt text should attend to. Imagic [324] also leverages a pre-trained
text-to-image DM, which enables sophisticated manipulations of a real high-resolution image, including editing of multiple objects.
Some other works input additional information to the generation process to have more control over the generated images. Voynov et al.
[685] take in an additional input sketch to guide a pre-trained text-to-image DM to generate an image that follows the spatial layout of the
sketch. Make-A-Scene [194] explores controllable text-to-image generation through adding additional sketches or editing extracted semantic
maps. ControlNet [807] aims to allow pre-trained large text-to-image DMs to support additional input conditions, e.g., scribbles, poses,
segmentation maps, via a fine-tuning approach involving the zero convolution layer. SpaText [29] allows the user to provide a segmentation
map with annotated language descriptions of each region. GLIGEN [378] takes in captions as well as other additional information (e.g.,

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 17

bounding boxes, human poses, depth maps, semantic maps) to perform grounded text-to-image generation using a frozen pre-trained
text-to-image DM.

12.3 Personalized Image Synthesis


In this task, generative models aim to generate novel images of a given subject (e.g., a person). Many works leverage GANs to perform
this task, such as IC-GAN [80] which additionally requires the discriminator to predict instance information, and Pivotal Tuning [560]
which adopts a GAN inversion approach. Recently, many works leverage pre-trained text-to-image DMs, which have shown promising
improvements. Textual Inversion [195] uses only a few images (typically 3-5) of a user-provided concept, and represents the concept through
new "words" in the embedding space of a text-to-image DM, enabling personalized creation of concepts guided by natural language sentences.
DreamBooth [567] aims to generate “personalized” images for a given subject (e.g., a specific dog or human) and introduce an approach to
fine-tune the pre-trained text-to-image diffusion model which result in better performance. Custom Diffusion [348] further extends this to
multiple concepts and their compositions, through jointly training for multiple concepts or combining multiple fine-tuned models into one.
Several recent developments [196, 566, 596] further improve the efficiency and speed of the personalization process.

13 CROSS-MODALITY VIDEO GENERATION


Following the tremendous progress in cross-modality image generation, more and more works have also explored the more challenging
cross-modality video generation task. We discuss two popular settings below: text-to-video generation, and text-driven video editing.

13.1 Text-to-Video generation


The text-to-video task aims to generate videos based on the given text Table 8. Comparison between representative text-to-video generative models. Results are reported
on MSR-VTT [734] and UCF-101 [627], on both the fine-tuned setting (where the dataset is used during
captions, and is significantly more challenging than the text-to-image training) and the zero-shot setting (where the dataset is not used during training). For the MSR-VTT
task, since the video frames need to be temporally consistent, while also dataset, we report the Fréchet Inception Distance (FID) [499] and CLIP similarity (CLIPSIM) [711], where
CLIPSIM measures the average CLIP similarity between the generated video frames and text which
adhering to the specified motions or actions. A comparison between evaluates the semantic match between them (higher is better). For the UCF-101 dataset, we report the
FVD and IS metrics (refer to Tab. 2 for more details). Note that, for the UCF-101 dataset in the text-to-video
recent methods is shown in Tab. 8. setting, class names are provided directly as text conditioning.
Non-diffusion-based approaches. In the earlier days of text-to-
MSR-VTT UCF-101
video generation, many works explored GAN-based approaches [35, Type Method FID (↓) CLIPSIM (↑) FVD (↓) IS (↑)

493] and VAE-based approaches [441, 461] or a combination of both Fine-tuned


GODIVA [711] - 0.2402 - -
NÜWA [712] 47.68 0.2439 - -
[380] to perform the task. These works encoded text descriptions into Make-A-Video [605] - - 81.25 82.55
embeddings with a text encoder (e.g., an RNN) and fed the embeddings Zero-shot
CogVideo (Chinese) [270] 24.78 0.2614 751.34 23.55
CogVideo (English) [270] 23.59 0.2631 701.59 25.27
to their decoder (which is usually an RNN). However, these approaches Make-A-Video [605] 13.17 0.3049 367.23 33.00
VideoLDM [53] - 0.2929 550.61 33.45
tend to focus on simpler settings, and generally only produce a short
clip with low resolution (e.g., 64 × 64). Subsequently, with the rise of
autoregressive Transformers, GODIVA [711] leverages an image-based VQVAE with an autoregressive model for text-to-video generation,
where a Transformer takes in text tokens and produces video frames autoregressively. This approach is extended in NÜWA [712] with
multi-task pre-training on both images and videos. NÜWA-Infinity [382] further improves the autoregressive generation process by a
patch-by-patch generation approach, which improves the synthesis ability on long-duration videos. CogVideo [270] adds temporal attention
modules to a frozen CogView 2 [159] (which is an autoregressive text-to-image Transformer model) to perform text-to-video generation,
which significantly reduces the training cost by inheriting knowledge from the pre-trained text-to-image model.
Diffusion Models. Similar to the success found in text-to-image generation, DMs have also found much success in text-to-video
generation. Video Diffusion [263] is the first to use DMs to present results on a large text-conditioned video generation task, which proposes
an architecture for video DMs and trains on captioned videos to perform text-conditioned video generation. Imagen Video [258] further
improves this by using a cascade of video DMs to improve the quality and fidelity of generated videos. Make-A-Video [605] aims to perform
text-to-video generation without paired text-video data by leveraging text-to-image models for the correspondence between text and visuals.
Story-LDM [536] aims to generate stories, where characters and backgrounds are consistent.

13.2 Text-driven Video Editing


One research direction also focuses on the editing of videos with text descriptions. Text2LIVE [38] introduces a general approach without
using a pre-trained generator in the loop, which can perform semantic, localized editing of real world videos. Subsequently, many approaches
leverage DMs in their approach. Tune-A-Video [716] uses pre-trained image diffusion models to edit the video of a given text-video pair.
Dreamix [466] and Gen-1 [174] perform video editing with a text-conditioned video diffusion model. Video-P2P [405] performs video editing
with cross-attention control based on the text inputs by adapting an image-based diffusion model.

14 CROSS-MODALITY 3D SHAPE GENERATION


In addition to cross-modality generation of 2D modalities such as image and video, many AIGC methods have also tackled the cross-modality
generation of 3D shapes. Many works have explored generating 3D shapes with input text or image information, which we comprehensively
discuss below.

, Vol. 1, No. 1, Article . Publication date: October 2023.


18 • Foo, et al.

14.1 Text-to-3D Shapes


In text-to-3D shape generation, the goal is to generate 3D assets corresponding to text descriptions. This is very challenging, since most 3D
shape generative models are trained on datasets of specific object categories like ShapeNet [91] and struggle to generalize to the zero-shot
setting. Another challenge is the lack of large-scale captioned 3D shape data for training, which can limit the capabilities of trained text-to-3D
shape models.
Earlier approaches train their generative models with text-3D shape pairs. Text2Shape [99] collected a dataset of natural language
descriptions on the ShapeNet dataset [91], and used it to train a 3D GAN for text-to-3D shape generation. Liu et al. [412] adopted an implicit
occupancy field representation and uses an Implicit Maximum Likelihood Estimation (IMLE) approach instead of GANs. ShapeCrafter [191]
collected a larger dataset (Text2Shape++), and used BERT [152], a pre-trained text encoder, to encode the text information. However, such
approaches are limited by the lack of the text-3D shape pairs, and these approaches also cannot perform zero-shot generation. To overcome
this, recent works generally rely on one of two approaches to perform text-guided 3D shape generation: either a CLIP-based approach, or a
diffusion-based approach.
CLIP-based Approaches. CLIP-Forge [579] uses CLIP to train the generative model using images, but use text embeddings at inference
time. Text2Mesh [454] leverages CLIP to manipulate the style of 3D meshes, by using CLIP to enforce semantic similarity between the
rendered images (from the mesh) and the text prompt. CLIP-Mesh [465] uses CLIP in a similar manner, but is able to directly generate both
shape and texture. Dream Fields [294] generates 3D models from natural language prompts, while avoiding the use of any 3D training data.
Specifically, Dream Fields optimizes a NeRF (which represents a 3D object) from many camera views such that rendered images score highly
with a target caption according to a pre-trained CLIP model. Dream3D [735] incorporates a 3D shape prior using Stable Diffusion, which
forms the initialization of a NeRF to be optimized via the CLIP-based loss (as in Dream Fields).
Diffusion-based Approaches. DreamFusion [520] adopts a similar approach to Dream Fields, but instead replaces the CLIP-based loss
with a loss based on sampling a pre-trained image diffusion model through a proposed Score Distillation Sampling (SDS) approach. SJC
[691] also adopts SDS to optimize their 3D representation (which is a voxel NeRF based on DVGO [632] and TensoRF [94]), and adopts
Stable Diffusion as their image-based diffusion model instead. Magic3D [386] proposes a coarse-to-fine optimization approach with multiple
diffusion priors optimized via a SDS-based approach, which eventually produces a textured mesh. Magic3D significantly improves the speed
of DreamFusion (by 2×) while achieving higher resolution (by 8×). Latent-NeRF [452] proposes to operate a NeRF in the latent space of a
Latent Diffusion Model to improve efficiency, and also introduce shape-guidance to the generation process, which allows users to guide the
3D shape towards a desired shape. Some recent works also explore SDF-based representations with generative DMs (e.g., Diffusion-SDF [371],
SDFusion [117]), or focus on generating textures from text input (e.g., TEXTure [555], Text2Tex [97]).

14.2 Image-to-3D Shapes


Besides generating 3D shapes from text input, many works also focus on synthesizing 3D shapes from input images. This is very challenging
as there can be depth ambiguity in images, and some parts of the 3D shape might also be occluded are are not visible in the image. Works in
this direction generally add an image encoder to capture the image information while improving the correspondence between the image and
the output 3D shape. Some works also reduce the reliance on 3D data, and aim to learn the 3D shape completely from images only. Below, we
categorize the methods based on their output 3D representation.
Image-to-3D Voxel. Some earlier works focus on generating 3D shapes as voxels. 3D-R2N2 [124] is a RNN architecture which takes
in one or more images of an object instance and outputs the corresponding 3D voxel occupancy grid, where the output 3D object can be
sequentially refined by feeding more observations to the RNN. TL-embedding network [214] introduces an autoencoding-based approach
for mapping images to a 3D voxel map, which learns a meaningful representation where it can perform both prediction of 3D voxels from
images and also conditional generation by combining feature vectors. Rezende et al. [306] and Yan et al. [751] learn to recover 3D volumetric
structures from pixels, using only 2D image data as supervision. Pix2Vox [729] learns to exploit context across multiple views by selecting
high-quality reconstructions for each part of the object and fuse them.
Image-to-3D Point Cloud. A few works also explore generation of 3D point clouds conditioned on images, which is more scalable
than then voxel representation. Fan et al. [175] aim to reconstruct a 3D point cloud from a single input image, by designing a point set
generation network architecture and exploring loss functions based on the Chamfer distance and Earth Mover’s distance. PC2 [446] introduces
a conditional diffusion-based approach to reconstruct a point cloud from an input image, including a projection conditioning approach where
local image features are projected onto the point cloud at each diffusion step.
Image-to-3D Mesh. There has also been much interest in generating 3D meshes from image inputs. Kato et al. [322] propose a neural
renderer that can render the mesh output (i.e., project the mesh vertices onto the screen coordinate system and generating the image) as a
differentiable operation, enabling generation of 3D meshes from images. Kanazawa et al. [314] presents an approach to generate meshes with
only a collection of RGB images, without requiring ground truth 3D data or multi-view images of the object, which enables image to mesh
reconstruction. Pixel2Mesh [693] also generates 3D meshes from single RGB images with their GCN-based architecture in a coarse-to-fine
fashion, which learns to deform ellipsoid meshes into the target shape. Liao et al. [385] design a differentiable Marching Cubes layer which
allows learning of an explicit surface mesh representation given raw observations (e.g., images or point clouds) as input.
Image-to-neural fields. To further improve scalability and resolution, some works explore generating neural fields from images. DISN
[739] introduces an approach to take a single image as input and predict the SDF which is a continuous field that represents the 3D shape with
arbitrary resolution. Michalkiewicz et al. [453] incorporate level set methods into the architecture and introduces an implicit representation

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 19

of 3D surfaces as a distinct layer in the architecture of a CNN to improve accuracy on 3D reconstruction from images. SDFDiff [304] is a
differentiable render based on ray-casting SDFs, which allows 3D reconstruction with a high level of detail and complex topology from one or
several images.

15 CROSS-MODALITY 3D SCENE VIEW SYNTHESIS


With the increasing progress in AIGC methods, more research has focused on generation of 3D scenes instead of 3D shapes only. The
generation of such scenes can be controlled by other input modalities, such as text, image, and video, which we discuss below.

15.1 Text-to-3D Scene


Many recent works have explored modelling 3D scenes with text input. Due to the lack of paired text-3D scene data, most works leverage on
CLIP or pre-trained text-to-image models to handle the text input. Text2Scene [286] leverages CLIP to model and stylize 3D scenes with text
(or image) inputs, by decomposing the scene into sub-parts for handling. Set-the-Scene [132] and Text2NeRF [804] generate NeRFs from text
with the aid of text-to-image DMs to represent 3D scenes. SceneScape [188] and Text2Room [266] further leverage a pre-trained monocular
depth prediction model for more geometric consistency, and directly generate the 3D textured mesh representation of the scene. Po and
Wetzstein [518] propose a locally conditioned diffusion technique for compositional scene generation with various text captions. MAV3D
[606] aims to generate 4D dynamic scenes by using a text-to-video diffusion model.

15.2 Image-to-3D Scene


Unlike using text captions that can be rather vague, some works also synthesize 3D scene representations from image input, where the
3D scene corresponds closely to what is depicted in the image. PixelNeRF [780] predicts a NeRF based on one or a few posed images via a
re-projection approach. Pix2NeRF [68] learn to output a neural radiance field from a single image by building upon 𝜋 -GAN [86]. LOLNeRF
[545] aims to learn to reconstruct 3D objects from single-view images, with the use of an auto-decoder [496]. On the other hand, NeRF-VAE
[342] explores a VAE-based approach to learn to generate a NeRF with very few views of an unseen scene. Recently, Chan et al. [87] introduce
a diffusion-based approach for generating novel views from a single image, where the latent 3D feature field captures a distribution of scene
representations which is sampled from during inference.

15.3 Video-to-Dynamic Scene


Besides encoding static 3D scenes (with image or text input), another line of works research aim to encode dynamic scenes, where the scene
can be changing with respect to time. This is sometimes also called 4D view synthesis, where a video is taken as input, and a dynamic
3D scene representation is obtained, where users can render images of the scene at any given time instance. Xian et al. [725] learn neural
irradiance fields (with no view dependency) while estimating the depth to constrain the scene’s geometry at any moment. D-NeRF [524],
NSFF [381] and Nerfies [498] aim to synthesize novel views of dynamic scenes with complex and non-rigid geometries from a monocular
video, through a dynamic NeRF representation that additionally takes the temporal aspect into account to deform the scene. NR-NeRF [664]
implements scene deformation as ray bending which deforms straight rays non-rigidly. D2 NeRF [722] learns separate radiance fields for the
the dynamic and static portions of the scene in a fully self-supervised manner, effectively decoupling the dynamic and static components.

16 CROSS-MODALITY 3D HUMAN GENERATION


Cross-modal 3D Avatar Settings for Cross-modal 3D
Inputs
Body Generation Model Avatar Body Generation
Besides cross-modality generation of 3D shapes and 3D scenes, cross-modality
Image of Subject Image-to-3D Avatar Body Generation
generation of 3D humans has also attracted much attention, where various other
modalities are used to condition the generation process to produce the desired Video of Subject Video-to-3D Avatar Body Generation

3D human avatar. Similar to the single-modality case (in Sec. 8), cross-modality
Cross-modal 3D Avatar Settings for Cross-modal 3D
generation of 3D humans can also be categorized into avatar body methods and Inputs
Head Generation Model Avatar Head Generation
Text Input
avatar head methods, and the various settings are illustrated in Fig. 7. “A man with a narrow jawline, with
a pair of lancet shaped eyebrows
above deep-set eyes, this man has a Text-to-3D Avatar Head Generation
striking appearance, oily red hair.”

16.1 3D Avatar Body Image of Subject Image-to-3D Avatar Head Generation

Image-to-3D avatar. An earlier line of works focus on recovering the human


Video of Subject Video-to-3D Avatar Head Generation
mesh from RGB images, including model-based methods [313, 503] that build
upon human parametric models and model-free methods [118, 185, 388] that Fig. 7. Illustration of various cross-modality 3D human genera-
directly predict the human mesh. However, this line of works tends to overlook tion settings. Pictures obtained from [17, 289, 297, 808].
the clothing aspect and require additional modifications such as external garment
layers [47, 298], displacements [18, 830], volumetric methods [678, 824]. More recently, PIFu [572] proposed to digitize clothed humans from
RGB images via implicit functions (an occupancy field). By aligning the 3D human with pixels of 2D images, PIFu is able to infer 3D surfaces
and textures and model intricate shapes such as clothing and hairstyles. PIFuHD [573] further improves the fidelity of PIFu. Geo-PIFu [248]
further improves the quality of reconstructed human meshes by learning latent voxel features which serves as a coarse human shape proxy.
The fidelity and quality of reconstructed clothed humans are further improved in more recent works (e.g., PaMIR [823], PHORUM [20], CAR
[384], 2K2K [240])

, Vol. 1, No. 1, Article . Publication date: October 2023.


20 • Foo, et al.

The methods listed above produce meshes or models corresponding to an RGB image, but they are not animation ready. To overcome this,
ARCH [283] learns to generate detailed 3D rigged human avatars from a single RGB image. Specifically, ARCH can perform animation of the
generated 3D human body by controlling its 3D skeleton pose. Arch++ [249] further improves the reconstruction quality with an end-to-end
point-based geometry encoder [526] and a mesh refinement strategy.
Video-to-3D avatar. Some other works take in an RGB video as input to generate a 3D avatar. This removes the need for posed 3D scans
or posed 2D images (for multi-view settings) while providing better performance than using single images only. Some earlier works are based
on human parametric models, and use video information to model the clothing and hair by learning the offsets to the mesh template (e.g.,
Alldieck et al. [16, 17], Octopus [15]).
ICON [731] produces an animatable avatar from an RGB video via an implicit SDF-based approach. Specifically, ICON first performs 3D
mesh recovery in a frame-wise manner to obtain the surface normals in each frame, and them combine them together to form a clothed and
animatable avatar with realistic clothing deformations. SelfRecon [297] adopts a hybrid approach which optimizes both an explicit mesh and
SDF. SCARF [182] represents the body with a mesh (based on SMPL-X [502]) and the clothing with a NeRF, which offers animator control over
the avatar shape, pose hand articulations and body expressions while allowing extraction and transference of clothing. Recently, Vid2Avatar
[230] introduces a self-supervised way to reconstruct detailed 3D avatars from monocular videos in the wild, which does not require any
ground truth supervision or other priors extracted from large clothed human datasets. MonoHuman [789] introduces an animatable human
neural field from monocular video.
Others. Other works explore other types of inputs, e.g., thermal images [402] or depth images [695]. Another line of work also aims to
generate and edit personalized 3D avatars given a photograph (e.g., 3DAvatarGAN [1]).

16.2 3D Avatar Head


Apart from 3D avatar bodies, many works also focus on generating a 3D avatar head with conditioning information from other modalities.
Text-to-avatar head. Wu et al. [719] explore a text-to-3D face generation approach, which adopts CLIP to encode text information and
generates a 3D face with 3DMM representation. DreamFace [808] also leverages CLIP and DMs in the latent and image space to generate
animatable 3D faces. ClipFace [21] uses CLIP to perform text-guided editing of 3DMMs. StyleAvatar3D [797] aims to produce stylized avatars
from text inputs.
Image-to-avatar head. Earlier approaches in image-to-avatar head generation tend to be optimization-based methods, which aim to
fit a parametric face model to align with a given image [205, 562] or collection of images [565]. However, these methods tend to require
iterative optimization, which can be costly. On the other hand, many learning-based approaches have also been introduced, where a model
(often a deep neural network) directly regresses the 3D face model (usually based on 3DMM [50] or FLAME [375]) from input images. Some
earlier works introduce methods that require full supervision [351, 556, 557], while some methods explore weakly-supervised [149, 662] or
unsupervised settings [580, 590, 652] that do not require ground truth 3D data.
Another line of works produces personalized rigs for each generated face, which facilitates animation. Some earlier works facilitate
the facial animation by estimating a personalized set of blendshapes (e.g., a set containing different expressions) via deformation transfer
[74, 275, 630], which copies the deformations on a source mesh to a specified target mesh. More recently, some works [32, 93, 759] estimate the
blendshapes via a deep-learning approach. Notably, DECA [181] and ROME [327] propose approaches that produce a face with person-specific
details from a single image, and can be trained with only in-the-wild 2D images.
On the other hand, some works [271, 635, 717, 836] leverage an implicit NeRF representation to further increase the fidelity of generated
images from novel views, when given 2D images of a person.
Video-to-avatar head. Some works aim to generate the 3D avatar head from RGB video, by leveraging multiple unposed and uncurated
images to learn the 3D face structure. Earlier works tend to be optimization-based [192, 201, 202, 654]. Subsequently, some works explore
learning-based approaches [649, 651, 663, 821], which generalize better and can attain high speeds for avatar head reconstruction. Some
works [540] also explore implicit SDF representation for producing detailed geometry for the avatar heads.
In order to facilitate head animations, some works [289] also generate morphable head avatars from input videos. To further improve the
quality (such as modeling of hair), many works explore implicit representations such as SDFs [771] and NeRFs [193, 703, 820].

17 CROSS-MODALITY 3D MOTION GENERATION


Beyond generating 3D human avatars, many AIGC methods also generate the desired 3D motions of the human avatar, often using various
other modalities as input. The generated motions are often represented using 3D skeleton poses or 3D avatars, which we discuss separately
below.

17.1 Skeleton Pose


Since 3D skeleton pose is a simple yet effective representation of the human body (as introduced in Sec. 9), many works generate motions
using this representation.
Text-to-Motion. Some works focus on generating motions that correspond to an input text description. JL2P [11] learns a joint text-pose
embedding space via an autoencoder to map text to pose motions. Text2Action [10] introduces a GAN-based model to synthesize pose
sequences from language descriptions. To generate multiple sequential and compositional actions, Ghosh et al. [210] adopt a GAN-based
hierarchical two-stream approach. TEMOS [512] and T2M [231] employ a VAE to synthesize diverse motions. MotionCLIP [647] leverages the

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 21

rich semantic knowledge of CLIP [530] to align text with the synthesized motion. T2M-GPT [805] autoregressively generates motion tokens
via GPT, and a VQ-VAE [677] to decode the tokens to motions, which can handle challenging text descriptions and generate high-quality
motion sequences. Recently, DMs have also been explored for text-to-motion synthesis (e.g., MDM [648], MotionDiffuse [809], MoFusion
[137]) which have also shown to be capable of generating varied motions with many vivid and fine details.
Music-to-Motion. Another line of works aim to generate motion (e.g., dance) from music inputs. Traditional methods tend to learn the
statistical relationships between the music’s loudness, rhythm, beat and style and the dance motions [176, 333, 485]. Early deep learning
methods leverage the power of RNNs [14, 643, 747] to learn to generate dance motions with music input. To generate diverse plausible
motions, some works [354, 633] propose a GAN-based generator. However, GANs can be unstable to train, thus some methods [369, 374]
generate motions from music inputs via autoregressive Transformers with seed motion, which allows for generation of multiple motions
given the same input audio, even with a deterministic model. Aristidou et al. [23] introduce a motion generation framework that generates
long-term dance animations that are well aligned to the musical style and rhythm, which also enables high-level choreography control.
Recently, MoFusion [137] adopts DMs for motion generation, which are effective at long-term and semantically accurate motions. Besides,
some works [665] aim to edit the generated dance motions, or aim to generate the motions of pianists or violinists given piano or violin
music pieces [599].
Speech-to-body gesture. Another line of works condition the generation process on an input speech to generate gestures. This is also
known as the co-speech gesture synthesis task, and is useful to help a talking avatar act more vividly. Before the proliferation of deep
learning, many rule-based methods [81, 82, 438] were proposed, where speech-gesture pairs were manually defined. Subsequently, RNN-based
approaches were proposed to automatically learn the speech-to-gesture synthesis [246, 346, 776] from collected datasets. In order to produce
diverse motions, approaches based on VAEs [364, 772] and NFs [252, 769] have also been proposed. Moreover, to further improve the realism
of generated motions, GAN-based approaches [213, 407] were proposed. Recently, DMs have been explored for co-speech gesture synthesis
[834], which further improve the generation quality and avoid the training instability of GANs.

17.2 3D Avatar
Although 3D skeleton poses efficiently represent 3D human poses, but they do not capture body shape and surface information. Therefore,
methods have been developed to directly generate motion based on a 3D avatar, which can be a 3D mesh or various implicit representations
such as dynamic NeRFs [25, 498]. This increases the fidelity and realism of the synthesized motions as compared to using skeleton pose only.
Text-to-motion of 3D avatar. Several works explore motion generation for 3D avatars conditioned on a text description. Some works
(e.g., CLIP-Actor [778], AvatarCLIP [269], MotionCLIP [647]) rely on the rich text and semantic knowledge of CLIP to generate animations
based on text input. TBGAN [72] also leverages CLIP with a GAN-based generator to generate facial animations.
Speech-to-Head Motions. Instead of generating full body motions, some works also focus on generating head and lip movements based
on the given speech. This is also known as “talking head” generation. Some earlier works explore RNNs [619] and CNNs [295, 708] to directly
synthesize videos with appropriate lip movements. Some works [100, 143, 522, 826] also employ GANs to generate the lip movements in
videos from audio more realistically.
Some works directly perform animation on the 3D avatar heads. Karras et al. [316] propose a CNN architecture to learn to map audio
inputs and an emotional state into a animations of a 3D head mesh. Several works [135, 621, 639, 653] instead represent head movements
with a head parametric model (i.e., a 3DMM). Recently, talking heads have also been synthesized via implicit representations. AD-NeRF [236]
presents an approach to map the audio features to dynamic NeRFs, which is a more powerful and flexible representation that can freely
adjust the deformations of the head and model the fine details such as teeth and hair. Subsequent works [376, 767, 770] further improve the
fidelity and control over the head poses and expressions.

18 CROSS-MODALITY AUDIO GENERATION


While audio (e.g., speech and music) can be used to generate 3D motion in a cross-modality manner, the audio itself can also be generated
from other modalities (e.g., text), which we discuss below.

18.1 Text-to-Speech
A popular research direction explores the generation of speech conditioned on text. Tacotron [701] introduces an RNN-based end-to-end
trainable generative model that generates speech from characters, and is trained on audio-text pairs without phoneme-level alignment,
while using the Griffin-Lim reconstruction algorithm [223] to synthesize the final waveform to reduce artifacts. DeepVoice 3 [514] is a
fully convolutional architecture that is compatible for use with various vocoders (which convert the compact audio representations into
audio waveforms) such as the Griffin-Lim [223], WORLD [467] and WaveNet [487] vocoders. Tacotron 2 [593] extends Tacotron and uses a
WaveNet-based vocoder to synthesize high-quality speech from a sequence of characters. Transformer-TTS [372] leverage a Transformer-
based architecture which effectively models long-term dependencies and improves training efficiency. FastSpeech [552] greatly speeds up the
audio synthesis through parallel generation of mel-spectrograms. GAN-TTS [49] introduces a GAN-based approach to produce speech, which
is more efficient than autoregressive approaches. Glow-TTS [330] combines the properties of a flow-based generative method [336] and
dynamic programming to learn its own alignment between speech and text via a technique called monotonic alignment search. Grad-TTS
[521] adopts a DM for text-to-speech synthesis, which generates high-quality mel-spectrograms and also aligns text and speech through
monotonic alignment search, and provides control to trade-off the quality of the mel-spectrogram with inference speed. As DMs are more

, Vol. 1, No. 1, Article . Publication date: October 2023.


22 • Foo, et al.

stable and simpler to train than GANs while providing high-quality generated audio, many recent works adopt them for text-to-speech
synthesis, e.g., BDDM [350], ProDiff [280], Diffsound [756], and DiffSinger [393].

18.2 Text-to-Music
Another direction also aims to generate music from text inputs. Jukebox [153] is an earlier work that allows generation of a singing voice
conditioned on lyrics. Recent works focus more on music generation based on provided text descriptions. MusicLM [8] relies on the joint
text-audio embedding space of MuLan [278] to produce audio representations from text input. More recently, Noise2music [279] and Moûsai
[582] adopt DMs with Transformer-based text encoders, which further improves performance.

19 DATASETS
Many datasets have been created to train AIGC methods for Table 9. Some representative benchmark datasets for AIGC throughout various data modalities. We report the total
number of samples in the full dataset, as well as the number of classes (if available). Note that, for the large web-crawled
various data modalities. We summarize a list of the represen- text datasets, we follow [535, 659] to report the dataset size on disk, since it can be hard to compare between them (as
each sample can be a sentence, a paragraph, or a document).
tative single-modality and cross-modality datasets in Tab. 9
and Tab. 10 respectively. 1 2 3 Dataset Year Modality #Sample #Class

Cats [811] 2008 image 10K -

20 CHALLENGES AND DISCUSSION CIFAR-10 [345]


ImageNet [568]
2009
2009
image
image
60K
1.4M
10
1000
LSUN Cat [781] 2015 image 1.6M -
Despite the rapid progress of AIGC methods, there are still LSUN Bedroom [781] 2015 image 3.0M -
LSUN Horse [781] 2015 image 2.0M -
many challenges faced by researchers and users. We explore LSUN Churches [781] 2015 image 126K -
some of these issues and challenges below. Summer ↔ Winter [832] 2017 image 2K 2
Photo ↔ Art [832] 2017 image 10K 5
Deepfakes. AIGC has the potential to be misused to create CelebA-HQ [317] 2018 image 30K -
FFHQ [320] 2019 image 70K -
deepfakes, e.g., synthetically generated videos that look like AFHQ [121] 2020 image 15K -
JFT-300M [631] 2017 image 300M 18291
real videos, which can be used to spread disinformation and UCF-101 [627] 2012 video 13K 101
Moving MNIST [628] 2015 video 10K -
fake news [669]. With the significant improvements in AIGC, Cityscapes [134] 2016 video 3K -
where the generated outputs are of high quality and fidelity, Youtube-8M [5] 2016 video 8M 4800
Robotic pushing prediction [183] 2016 video 57K -
it is getting more difficult to detect these synthetic media (e.g.,
BAIR robot pushing [171] 2017 video 45K -
Kinetics-600 [79] 2017 video 490K 600
images and videos) by the naked eye. Thus, several algorithms Sky Timelapse [730] 2018 video 38K -
FaceForensics [564] 2018 video 1K -
have been proposed to detect them, especially for GAN-based Taichi-HD [602] 2019 video 3K -
CNN generators [88, 89, 170]. With the rapid development of SNLI [59] 2015 text 570K 3
LAMBADA [495] 2016 text 10K -
more recent architectures (e.g., Transformers) and techniques SQuAD [537] 2016 text 107K -
TriviaQA [309] 2017 text 174K -
(e.g., DMs), how to detect emerging deepfakes deserves further OpenBookQA [457] 2018 text 5K -
ARC-e [131] 2018 text 5K -
investigation. ARC-c [131] 2018 text 2K -
Adversarial Attacks. AIGC methods can also be vulnera- NaturalQuestions [349] 2019 text 315K -
CoQA [547] 2019 text 127K -
ble to adversarial attacks. For instance, AIGC methods can be BoolQ [130] 2019 text 12K -
GLUE [690] 2019 text 1.4M -
vulnerable to backdoor attacks, where the generated output SuperGLUE [689] 2019 text 195K -
HumanEval [101] 2021 text 164 -
can be manipulated by a malicious input trigger signal. Specifi-WinoGrande [575] 2021 text 44K -
cally, these backdoor attacks can generate a pre-defined target Common Crawl1 [659] 2017-2020 text 3.3TB -
C4 [535] 2020 text 783GB -
image or an image from a specified class, when the trigger English Wikipedia2 2002-now text 20GB -
pattern is observed. Adversaries can insert such backdoors The Pile [198]
ShapeNet [91]
2020
2015
text
3D shape
825GB
3M
-
3135
into models by training them on poisoned data before making ShapeNet Core [91] 2015 3D shape 51K 13
ShapeNet Chair [91] 2015 3D shape 6778 -
them publicly available, where users might unknowingly use ShapeNet Airplane [91] 2015 3D shape 4045 -
ShapeNet Car [91] 2015 3D shape 7497 -
these backdoored models in their applications, which can be ModelNet [723] 2015 3D shape 48K 660
risky. Some works investigate such backdoor attacks on GANs CelebA [410]
CompCars [763]
2015
2015
scene views
scene views
200K
136K
10K identities
1716 car models
[541, 576] and DMs [107, 122]. Some defensive measures have CARLA [166, 585] 2020 scene views 10K 16 car models
D-Faust [55] 2017 3D human sequences 129 sequences 10 subjects
been proposed [379, 541, 713], but there still exists room for CAPE [425] 2020 3D clothed human sequences >600 sequences 15 subjects
FaceScape [759] 2020 3D face 18K 938 subjects
improvement in terms of defense efficacy while maintaining DeepFashion [409] 2016 clothes image 800K 50
good generative performance. CelebAMask-HQ [352]
SHHQ [190]
2020
2022
face image
full body human image
30K
230K
19
-
Privacy Issues. There are also privacy issues with AIGC CMU-Mocap3 [363] 2018 3D motion 249 8
AMASS [430] 2019 3D motion 11K 344 subjects
methods, since many models (e.g., chatbots such as ChatGPT) HumanAct12 [232] 2020 3D motion 1K 12
Speech Commands [705] 2018 audio 100K 35
often handle sensitive user data, and such sensitive data can
be collected to further train the model. Such training data can
potentially be memorized by large models [77] which can be leaked by the model [76, 78], causing privacy concerns. How to mitigate such
privacy issues in AIGC largely remains an open problem.
Legal Issues. There are also potential legal issues with AIGC, especially with regards to the intellectual property rights of generated
outputs. For instance, recent text-to-image DMs [538, 561] are often trained on extremely large datasets with billions of text-caption pairs

1 https://commoncrawl.org/
2 https://dumps.wikimedia.org/
3 http://mocap.cs.cmu.edu/

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 23

that are obtained from the web. Due to the size of the dataset, it is often difficult to carefully curate and check for the intellectual property
rights of each image. Furthermore, due to the memorization ability of large models [76, 77], they might directly replicate images from the
training set, which might leads to copyright infringement. To overcome this, some works [538] have pre-training mitigations [476] such as
deduplicating the training data, while some works investigate training factors to encourage diffusion models to generate novel images [618].
Further investigations in this direction are required for more effective and efficient ways of tackling this issue.
Domain-specialized Applications. A potential challenge for Table 10. Some representative datasets for AIGC throughout various cross-modality settings. We report the
AIGC is their application to specialized domains which might have total number of samples in the full dataset, as well as the number of classes (if available) or subjects. Moreover, if
the number of annotations for each modality is different, we list the number of annotations for each modality
less data, such as medical areas [482], or laws [727]. When using (separated by a comma).

large language models such as ChatGPT, they also can hallucinate Dataset Year Modalities #Sample #Class
make factual errors [296], which can be problematic in these spe- MS-COCO [109, 390] 2014 image,text 123K,616K 80
cialized domains. A promising direction is to find ways to apply CUB [548, 686]
Oxford-102 [481, 548]
2016
2016
image,text
image,text
11K,110K
8K,80K
200
102
domain-specific knowledge to reduce and mitigate these errors Conceptual Captions [591]
LN-COCO [519]
2018
2020
image,text
image,text
3.3M
142K
-
-
when large language models are applied to specific domains [9], or LAION-400M [583] 2021 image,text 400M -
Conceptual 12M [92] 2021 image,text 12.4M -
to enhance the factuality of the generated text using a text corpus MSVD [95] 2011 video,text 2K,85K -
MSR-VTT [734] 2016 video,text 10K,200K -
[357]. KTH [441, 584] 2017 video,text 2391 6
Multi-modal Combinations. Generative models with multi- HowTo100M [455]
ANet Captions [344] 2017
2019
video,text
video,text
100K
136M
-
-
modal inputs are very useful as they can gain understanding of WebVid-10M [34] 2021 video,text 10.7M -
PASCAL3D+ [726] 2014 3D shape,image 36K 12
the context through different modalities, and can better capture ShapeNet (3D-R2N2) [124] 2016 3D shape,image 51K 13
the complexity of the real world. Such multi-modal models are Pix3D [637]
CO3Dv2 [551]
2018
2021
3D shape,image
3D shape,video
10K,395
19K
-
50
important for many practical applications such as robotics [167] Text2Shape [99]
Text2Shape++ [191]
2018
2022
3D shape,text
3D shape,text
15K,75K
369K
-
-
and dialogue systems [488]. However, when we aim to have several Matterport3D [90] 2017 3D scene,image 90 40
ScanNet v2 [138] 2017 3D scene,video 1513 20
modalities as input, it can be difficult to curate large-scale datasets BU-3DFE [773] 2006 3D face,image 2.5K 100 subjects
with all modalities provided for each sample, which makes training FaceWarehouse
VoxCeleb [469]
[73] 2013
2017
3D face,image
face video,audio
150
153K
47 facial expressions
1251 subjects
such models challenging. To overcome this, one line of works VoxCeleb2 [126]
DeepHuman [824]
2018
2019
face video,audio
3D human,image
1.1M
7K
6K
-
[667, 667] allow for multiple input modalities by collaborating People-Snapshot [17] 2018 3D human,video 24 sequences 11 subjects
3DPW [683] 2018 3D human,video 51K frames -
multiple pre-trained DMs, each with different input modalities. MPI-INF-3DHP [445] 2017 3D human,video 1.3M frames 8 activities
TED Gesture [775] 2020 3D human,video,audio,text 1766 videos -
Further advancements in this domain are necessary to devise more DeepFashion-MultiModal [305] 2022 human images,text 11K 24
efficient strategies for attaining multi-modality. KIT [516] 2016 3D motion,text 3K,6K -
HumanML3D [231] 2022 3D motion,text 14K,44K -
Updating of Knowledge. As human knowledge, pop cultures, Human3.6M [291] 2013 3D motion,video 3.6M frames 15
NTU RGB+D [589] 2016 3D motion,video 56K 60
and word usage keep evolving, large pre-trained AIGC models NTU RGB+D 120 [394] 2019 3D motion,video 114K 120

should also update their knowledge accordingly. Importantly, such AIST++ [374]
LJ Speech [293]
2021
2017
3D motion,music
audio,text
1408 sequences
13K
-
-
updating of knowledge should be done in an efficient and effective Audio Set [206]
LibriTTS [792]
2017
2019
audio,text
audio,text
1.7M
585 hours
632
2456 speakers
manner, without requiring a total re-training from scratch with GRID [133] 2006 audio,video 34K sentences 34 subjects
LRW [127] 2017 audio,video 1M words 1K subjects
all the previous and updated data, which can be very costly. To Music Caps [8] 2023 music,text 5.5K -

learn new information, some works perform continual learning


[679, 794], where models can learn new information in the absence of the previous training data, without suffering from catastrophic
forgetting. Another direction of model editing [448, 460] aims to inject updated knowledge and erase incorrect knowledge or undesirable
behaviours, e.g., editing factual associations in language models [448]. However, as pointed out in [460], existing methods are limited in
their enforcing of the locality of edits, and also do not assess the edits in terms of their generality (i.e., editing of indirect associations and
implications as well). Thus, the are many opportunities for further investigations into this area.

21 CONCLUSION
AIGC is an important topic that has garnered significant research attention across diverse modalities, each presenting different characteristics
and challenges. In this survey, we have comprehensively reviewed AIGC methods across different data modalities, including single-modality
and cross-modality methods. Moreover, we have organized these methods based on the conditioning input information, affording a structured
and standardized overview of the landscape in terms of input modalities. We highlight the various challenges, representative works, and
recent technical directions in each setting. Furthermore, we make performance comparisons between representative works, and also review
the representative datasets and benchmarks throughout the modalities. We also discuss the challenges and potential future research directions.

REFERENCES
[1] Rameen Abdal, Hsin-Ying Lee, Peihao Zhu, et al. 2023. 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars. In CVPR.
[2] Rameen Abdal, Yipeng Qin, and Peter Wonka. 2019. Image2stylegan: How to embed images into the stylegan latent space?. In ICCV.
[3] Rameen Abdal, Yipeng Qin, and Peter Wonka. 2020. Image2stylegan++: How to edit the embedded images?. In CVPR.
[4] Milad Abdollahzadeh, Touba Malekzadeh, et al. 2023. A Survey on Generative Modeling with Limited Data, Few Shots, and Zero Shot. arXiv (2023).
[5] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, et al. 2016. Youtube-8m: A large-scale video classification benchmark. arXiv (2016).
[6] Panos Achlioptas, Olga Diamanti, Ioannis Mitliagkas, et al. 2018. Learning representations and generative models for 3d point clouds. In ICML.
[7] Parag Agarwal and Balakrishnan Prabhakaran. 2009. Robust blind watermarking of point-sampled geometry. TIFS (2009).
[8] Andrea Agostinelli, Timo I Denk, Zalán Borsos, et al. 2023. Musiclm: Generating music from text. arXiv (2023).
[9] Monica Agrawal, Stefan Hegselmann, Hunter Lang, et al. 2022. Large language models are few-shot clinical information extractors. In EMNLP.

, Vol. 1, No. 1, Article . Publication date: October 2023.


24 • Foo, et al.

[10] Hyemin Ahn, Timothy Ha, Yunho Choi, et al. 2018. Text2action: Generative adversarial synthesis from language to action. In ICRA.
[11] Chaitanya Ahuja and Louis-Philippe Morency. 2019. Language2pose: Natural language grounded pose forecasting. In 3DV.
[12] Emre Aksan, Manuel Kaufmann, and Otmar Hilliges. 2019. Structured prediction helps 3d human motion modelling. In ICCV.
[13] Yuval Alaluf, Omer Tov, Ron Mokady, et al. 2022. Hyperstyle: Stylegan inversion with hypernetworks for real image editing. In CVPR.
[14] Omid Alemi, Jules Françoise, et al. 2017. GrooveNet: Real-time music-driven dance movement generation using artificial neural networks. Networks (2017).
[15] Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, et al. 2019. Learning to reconstruct people in clothing from a single RGB camera. In CVPR.
[16] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, et al. 2018. Detailed human avatars from monocular video. In 3DV.
[17] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, et al. 2018. Video based reconstruction of 3d people models. In CVPR.
[18] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, et al. 2019. Tex2shape: Detailed full human body geometry from a single image. In ICCV.
[19] Thiemo Alldieck, Hongyi Xu, and Cristian Sminchisescu. 2021. imghum: Implicit generative models of 3d human shape and articulated pose. In ICCV.
[20] Thiemo Alldieck, Mihai Zanfir, and Cristian Sminchisescu. 2022. Photorealistic monocular 3d reconstruction of humans wearing clothing. In CVPR.
[21] Shivangi Aneja, Justus Thies, Angela Dai, et al. 2022. Clipface: Text-guided editing of textured 3d morphable models. arXiv (2022).
[22] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, et al. 2005. Scape: shape completion and animation of people. In SIGGRAPH.
[23] Andreas Aristidou, Anastasios Yiannakidis, et al. 2022. Rhythm is a dancer: Music-driven motion synthesis with global structure. TVCG (2022).
[24] Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein generative adversarial networks. In ICML.
[25] ShahRukh Athar, Zexiang Xu, Kalyan Sunkavalli, et al. 2022. Rignerf: Fully controllable neural 3d portraits. In CVPR.
[26] Matan Atzmon, Niv Haim, Lior Yariv, et al. 2019. Controlling neural level sets. NeurIPS (2019).
[27] Matan Atzmon and Yaron Lipman. 2020. Sal: Sign agnostic learning of shapes from raw data. In CVPR.
[28] Jacob Austin, Daniel D Johnson, Jonathan Ho, et al. 2021. Structured denoising diffusion models in discrete state-spaces. NeurIPS (2021).
[29] Omri Avrahami, Thomas Hayes, Oran Gafni, et al. 2023. Spatext: Spatio-textual representation for controllable image generation. In CVPR.
[30] Omri Avrahami, Dani Lischinski, and Ohad Fried. 2022. Blended diffusion for text-driven editing of natural images. In CVPR.
[31] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, et al. 2018. Stochastic Variational Video Prediction. In ICLR.
[32] Ziqian Bai, Zhaopeng Cui, Xiaoming Liu, et al. 2021. Riggable 3d face reconstruction via in-network optimization. In CVPR.
[33] David Baidoo-Anu and Leticia Owusu Ansah. 2023. Education in the era of generative artificial intelligence (AI): Understanding the potential benefits of
ChatGPT in promoting teaching and learning. SSRN 4337484 (2023).
[34] Max Bain, Arsha Nagrani, Gül Varol, et al. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV.
[35] Yogesh Balaji, Martin Renqiang Min, Bing Bai, et al. 2019. Conditional GAN with Discriminative Filter Generation for Text-to-Video Synthesis. In IJCAI.
[36] Yogesh Balaji, Seungjun Nah, Xun Huang, et al. 2022. ediffi: Text-to-image diffusion models with an ensemble of expert denoisers. arXiv (2022).
[37] Fan Bao, Chongxuan Li, et al. 2022. Analytic-DPM: an Analytic Estimate of the Optimal Reverse Variance in Diffusion Probabilistic Models. In ICLR.
[38] Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, et al. 2022. Text2live: Text-driven layered image and video editing. In ECCV.
[39] Paul Barham, Aakanksha Chowdhery, Jeff Dean, et al. 2022. Pathways: Asynchronous distributed dataflow for ml. MLSys (2022).
[40] Jonathan T Barron, Ben Mildenhall, Matthew Tancik, et al. 2021. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In ICCV.
[41] Jonathan T Barron, Ben Mildenhall, Dor Verbin, et al. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR.
[42] Emad Barsoum, John Kender, and Zicheng Liu. 2018. Hp-gan: Probabilistic 3d human motion prediction via gan. In CVPR Workshops.
[43] Jens Behrmann, Will Grathwohl, Ricky TQ Chen, et al. 2019. Invertible residual networks. In ICML.
[44] Roman Beliy, Guy Gaziv, et al. 2019. From voxels to pixels and back: Self-supervision in natural-image reconstruction from fMRI. NeurIPS (2019).
[45] Alexander Bergman, Petr Kellnhofer, Wang Yifan, et al. 2022. Generative neural articulated radiance fields. NeurIPS (2022).
[46] David Berthelot, Thomas Schumm, and Luke Metz. 2017. Began: Boundary equilibrium generative adversarial networks. arXiv (2017).
[47] Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, et al. 2019. Multi-garment net: Learning to dress 3d people from images. In ICCV.
[48] Marin Biloš, Johanna Sommer, Syama Sundar Rangapuram, et al. 2021. Neural flows: Efficient alternative to neural ODEs. NeurIPS (2021).
[49] Mikołaj Bińkowski, Jeff Donahue, Sander Dieleman, et al. 2020. High Fidelity Speech Synthesis with Adversarial Networks. In ICLR.
[50] Volker Blanz and Thomas Vetter. 1999. A morphable model for the synthesis of 3D faces. In SIGGRAPH.
[51] Andreas Blattmann, Timo Milbich, Michael Dorkenwald, et al. 2021. ipoke: Poking a still image for controlled stochastic video synthesis. In ICCV.
[52] Andreas Blattmann, Timo Milbich, Michael Dorkenwald, et al. 2021. Understanding object dynamics for interactive image-to-video synthesis. In CVPR.
[53] Andreas Blattmann, Robin Rombach, Huan Ling, et al. 2023. Align your latents: High-resolution video synthesis with latent diffusion models. In CVPR.
[54] Andreas Blattmann, Robin Rombach, Kaan Oktay, et al. 2022. Retrieval-augmented diffusion models. NeurIPS (2022).
[55] Federica Bogo, Javier Romero, Gerard Pons-Moll, et al. 2017. Dynamic FAUST: Registering human bodies in motion. In CVPR.
[56] Alexey Bokhovkin, Shubham Tulsiani, and Angela Dai. 2023. Mesh2Tex: Generating Mesh Textures from Image Queries. arXiv (2023).
[57] Zalán Borsos, Raphaël Marinier, Damien Vincent, et al. 2023. Audiolm: a language modeling approach to audio generation. TASLP (2023).
[58] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. 2012. Modeling temporal dependencies in high-dimensional sequences: Application
to polyphonic music generation and transcription. arXiv (2012).
[59] Samuel Bowman, Gabor Angeli, Christopher Potts, et al. 2015. A large annotated corpus for learning natural language inference. In EMNLP.
[60] Samuel Bowman, Luke Vilnis, Oriol Vinyals, et al. 2016. Generating Sentences from a Continuous Space. In SIGNLL.
[61] Andrew Brock, Jeff Donahue, Karen Simonyan, et al. 2019. Large Scale GAN Training for High Fidelity Natural Image Synthesis. In ICLR.
[62] Tim Brooks, Janne Hellsten, Miika Aittala, et al. 2022. Generating long videos of dynamic scenes. NeurIPS (2022).
[63] Tom Brown, Benjamin Mann, Nick Ryder, et al. 2020. Language models are few-shot learners. NeurIPS (2020).
[64] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. 2015. Importance weighted autoencoders. In ICLR.
[65] Christopher P Burgess, Irina Higgins, Arka Pal, et al. 2018. Understanding disentangling in beta-VAE. arXiv (2018).
[66] Massimo Caccia, Lucas Caccia, William Fedus, et al. 2020. Language GANs Falling Short. In ICLR.
[67] Ruojin Cai, Guandao Yang, Hadar Averbuch-Elor, et al. 2020. Learning gradient fields for shape generation. In ECCV.
[68] Shengqu Cai, Anton Obukhov, et al. 2022. Pix2nerf: Unsupervised conditional p-gan for single image to neural radiance fields translation. In CVPR.
[69] Yujun Cai, Lin Huang, Yiwei Wang, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, et al. 2020. Learning progressive joint propagation for human motion
prediction. In ECCV.
[70] Yujun Cai, Yiwei Wang, Yiheng Zhu, Tat-Jen Cham, Jianfei Cai, Junsong Yuan, Jun Liu, et al. 2021. A unified 3d human motion synthesis model via
conditional variational auto-encoder. In ICCV.
[71] Elena Camuffo, Daniele Mari, et al. 2022. Recent advancements in learning algorithms for point clouds: An updated overview. Sensors (2022).
[72] Zehranaz Canfes, M Furkan Atasoy, Alara Dirik, et al. 2023. Text and image guided 3d avatar generation and manipulation. In WACV.
[73] Chen Cao, Yanlin Weng, Shun Zhou, et al. 2013. Facewarehouse: A 3d facial expression database for visual computing. TVCG (2013).
[74] Chen Cao, Hongzhi Wu, Yanlin Weng, et al. 2016. Real-time facial animation with image-based dynamic avatars. ToG (2016).
[75] Yihan Cao, Siyu Li, et al. 2023. A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv (2023).
[76] Nicholas Carlini, Jamie Hayes, Milad Nasr, et al. 2023. Extracting training data from diffusion models. arXiv (2023).
[77] Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, et al. 2023. Quantifying Memorization Across Neural Language Models. In ICLR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 25

[78] Nicholas Carlini, Florian Tramer, Eric Wallace, et al. 2021. Extracting training data from large language models. In USENIX Security.
[79] Joao Carreira, Eric Noland, Andras Banki-Horvath, et al. 2018. A short note about kinetics-600. arXiv (2018).
[80] Arantxa Casanova, Marlene Careil, Jakob Verbeek, et al. 2021. Instance-conditioned gan. NeurIPS (2021).
[81] Justine Cassell, Catherine Pelachaud, Norman Badler, et al. 1994. Animated conversation: rule-based generation of facial expression, gesture & spoken
intonation for multiple conversational agents. In SIGGRAPH.
[82] Justine Cassell, Hannes Högni Vilhjálmsson, and Timothy Bickmore. 2001. Beat: the behavior expression animation toolkit. In SIGGRAPH.
[83] Shang Chai, Liansheng Zhuang, and Fengying Yan. 2023. LayoutDM: Transformer-based Diffusion Model for Layout Generation. In CVPR.
[84] Caroline Chan, Shiry Ginosar, Tinghui Zhou, et al. 2019. Everybody dance now. In ICCV.
[85] Eric R Chan, Connor Z Lin, Matthew A Chan, et al. 2022. Efficient geometry-aware 3D generative adversarial networks. In CVPR.
[86] Eric R Chan, Marco Monteiro, Petr Kellnhofer, et al. 2021. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In CVPR.
[87] Eric R Chan, Koki Nagano, Matthew A Chan, et al. 2023. Generative novel view synthesis with 3d-aware diffusion models. arXiv (2023).
[88] Keshigeyan Chandrasegaran, Ngoc-Trung Tran, et al. 2021. A closer look at fourier spectrum discrepancies for cnn-generated images detection. In CVPR.
[89] Keshigeyan Chandrasegaran, Ngoc-Trung Tran, et al. 2022. Discovering transferable forensic features for cnn-generated images detection. In ECCV.
[90] Angel Chang, Angela Dai, Thomas Funkhouser, et al. 2017. Matterport3D: Learning from RGB-D Data in Indoor Environments. In 3DV.
[91] Angel X Chang, Thomas Funkhouser, Leonidas Guibas, et al. 2015. Shapenet: An information-rich 3d model repository. arXiv (2015).
[92] Soravit Changpinyo et al. 2021. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR.
[93] Bindita Chaudhuri, Noranart Vesdapunt, et al. 2020. Personalized face modeling for improved face reconstruction and motion retargeting. In ECCV.
[94] Anpei Chen, Zexiang Xu, Andreas Geiger, et al. 2022. Tensorf: Tensorial radiance fields. In ECCV.
[95] David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In HLT. 190–200.
[96] Ding-Yun Chen, Xiao-Pei Tian, Yu-Te Shen, et al. 2003. On visual similarity based 3D model retrieval. In CGFORUM.
[97] Dave Zhenyu Chen, Yawar Siddiqui, Hsin-Ying Lee, et al. 2023. Text2tex: Text-driven texture synthesis via diffusion models. arXiv (2023).
[98] Jiawen Chen, Dennis Bautembach, and Shahram Izadi. 2013. Scalable real-time volumetric surface reconstruction. ToG (2013).
[99] Kevin Chen, Christopher B Choy, et al. 2019. Text2shape: Generating shapes from natural language by learning joint embeddings. In ACCV.
[100] Lele Chen, Ross K Maddox, Zhiyao Duan, et al. 2019. Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In CVPR.
[101] Mark Chen, Jerry Tworek, Heewoo Jun, et al. 2021. Evaluating large language models trained on code. arXiv (2021).
[102] Nanxin Chen, Yu Zhang, Heiga Zen, et al. 2021. WaveGrad: Estimating Gradients for Waveform Generation. In ICLR.
[103] Ricky TQ Chen, Jens Behrmann, David K Duvenaud, et al. 2019. Residual flows for invertible generative modeling. NeurIPS (2019).
[104] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt, et al. 2018. Neural ordinary differential equations. NeurIPS (2018).
[105] Shu-Ching Chen. 2022. Multimedia research toward the metaverse. IEEE MultiMed. (2022).
[106] Wengling Chen and James Hays. 2018. Sketchygan: Towards diverse and realistic sketch to image synthesis. In CVPR.
[107] Weixin Chen, Dawn Song, and Bo Li. 2023. Trojdiff: Trojan attacks on diffusion models with diverse targets. In CVPR.
[108] Xi Chen, Yan Duan, et al. 2016. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. NeurIPS (2016).
[109] Xinlei Chen, Hao Fang, Tsung-Yi Lin, et al. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv (2015).
[110] Xu Chen, Tianjian Jiang, Jie Song, et al. 2022. gdna: Towards generative detailed neural avatars. In CVPR.
[111] Xi Chen, Diederik P Kingma, Tim Salimans, et al. 2016. Variational Lossy Autoencoder. In ICLR.
[112] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. 2023. Teaching large language models to self-debug. arXiv (2023).
[113] Xu Chen, Yufeng Zheng, Michael J Black, et al. 2021. Snarf: Differentiable forward skinning for animating non-rigid neural implicit shapes. In ICCV.
[114] Zhiqin Chen, Kangxue Yin, and Sanja Fidler. 2022. Auv-net: Learning aligned uv maps for texture transfer and synthesis. In CVPR.
[115] Zhiqin Chen and Hao Zhang. 2019. Learning implicit fields for generative shape modeling. In CVPR.
[116] Shiyang Cheng, Michael Bronstein, Yuxiang Zhou, et al. 2019. Meshgan: Non-linear 3d morphable models of faces. arXiv (2019).
[117] Yen-Chi Cheng, Hsin-Ying Lee, Sergey Tulyakov, et al. 2023. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In CVPR.
[118] Hongsuk Choi et al. 2020. Pose2mesh: Graph convolutional network for 3d human pose and mesh recovery from a 2d human pose. In ECCV.
[119] Jooyoung Choi, Sungwon Kim, Yonghyun Jeong, et al. 2021. ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. In ICCV.
[120] Yunjey Choi, Minje Choi, et al. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR.
[121] Yunjey Choi, Youngjung Uh, Jaeju Yoo, et al. 2020. Stargan v2: Diverse image synthesis for multiple domains. In CVPR.
[122] Sheng-Yen Chou, Pin-Yu Chen, and Tsung-Yi Ho. 2023. How to backdoor diffusion models?. In CVPR.
[123] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. 2022. Palm: Scaling language modeling with pathways. arXiv (2022).
[124] Christopher B Choy, Danfei Xu, JunYoung Gwak, et al. 2016. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In ECCV.
[125] Hang Chu, Raquel Urtasun, and Sanja Fidler. 2016. Song from PI: A musically plausible network for pop music generation. arXiv (2016).
[126] J Chung, A Nagrani, and A Zisserman. 2018. VoxCeleb2: Deep speaker recognition. Proc. Interspeech (2018).
[127] Joon Son Chung and Andrew Zisserman. 2017. Lip reading in the wild. In ACCV.
[128] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, et al. 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation. In MICCAI.
[129] Aidan Clark, Jeff Donahue, and Karen Simonyan. 2019. Adversarial video generation on complex datasets. arXiv (2019).
[130] Christopher Clark, Kenton Lee, Ming-Wei Chang, et al. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL.
[131] Peter Clark, Isaac Cowhey, Oren Etzioni, et al. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv (2018).
[132] Dana Cohen-Bar, Elad Richardson, Gal Metzer, et al. 2023. Set-the-Scene: Global-Local Training for Generating Controllable NeRF Scenes. arXiv (2023).
[133] Martin Cooke, Jon Barker, Stuart Cunningham, et al. 2006. An audio-visual corpus for speech perception and automatic speech recognition. JASA (2006).
[134] Marius Cordts, Mohamed Omran, Sebastian Ramos, et al. 2016. The cityscapes dataset for semantic urban scene understanding. In CVPR.
[135] Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, et al. 2019. Capture, learning, and synthesis of 3D speaking styles. In CVPR.
[136] Brian Curless and Marc Levoy. 1996. A volumetric method for building complex models from range images. In SIGGRAPH.
[137] Rishabh Dabral, Muhammad Hamza Mughal, et al. 2023. Mofusion: A framework for denoising-diffusion-based motion synthesis. In CVPR.
[138] Angela Dai, Angel X Chang, Manolis Savva, et al. 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR.
[139] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. 2017. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In CVPR.
[140] Hanjun Dai, Azade Nazi, Yujia Li, et al. 2020. Scalable deep generative modeling for sparse graphs. In ICML.
[141] Zihang Dai, Zhilin Yang, Yiming Yang, et al. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In ACL.
[142] Lingwei Dang, Yongwei Nie, et al. 2021. MSR-GCN: Multi-Scale Residual Graph Convolution Networks for Human Motion Prediction. In ICCV.
[143] Dipanjan Das, Sandika Biswas, Sanjana Sinha, et al. 2020. Speech-driven facial animation using cascaded gans for learning of motion and texture. In ECCV.
[144] Nicola De Cao and Thomas Kipf. 2018. MolGAN: An implicit generative model for small molecular graphs. arXiv (2018).
[145] Harm De Vries, Florian Strub, Jérémie Mary, et al. 2017. Modulating early visual processing by language. NeurIPS (2017).
[146] Bruno Degardin, Joao Neves, Vasco Lopes, et al. 2022. Generative adversarial graph convolutional networks for human action synthesis. In WACV.
[147] Boyang Deng, John P Lewis, Timothy Jeruzalski, et al. 2020. Nasa neural articulated shape approximation. In ECCV.

, Vol. 1, No. 1, Article . Publication date: October 2023.


26 • Foo, et al.

[148] Shasha Deng, Chee-Wee Tan, Weijun Wang, et al. 2019. Smart generation system of personalized advertising copy and its application to advertising
practice and research. J. Advert. (2019).
[149] Yu Deng, Jiaolong Yang, et al. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In CVPR Workshops.
[150] Yu Deng, Jiaolong Yang, and Xin Tong. 2021. Deformed implicit field: Modeling 3d shapes with learned dense correspondence. In CVPR.
[151] Yu Deng, Jiaolong Yang, Jianfeng Xiang, et al. 2022. Gram: Generative radiance manifolds for 3d-aware image generation. In CVPR.
[152] Ming-Wei Devlin, Jacob Chang, Kenton Lee, et al. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
[153] Prafulla Dhariwal, Heewoo Jun, Christine Payne, et al. 2020. Jukebox: A generative model for music. arXiv (2020).
[154] Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. NeurIPS (2021).
[155] Luca Di Liello, Pierfrancesco Ardino, et al. 2020. Efficient generation of structured objects with constrained adversarial networks. NeurIPS (2020).
[156] Sander Dieleman, Aaron van den Oord, et al. 2018. The challenge of realistic music generation: modelling raw audio at scale. NeurIPS (2018).
[157] Adji B Dieng, Yoon Kim, Alexander M Rush, et al. 2019. Avoiding latent variable collapse with generative skip models. In AISTATS.
[158] Ming Ding, Zhuoyi Yang, Wenyi Hong, et al. 2021. Cogview: Mastering text-to-image generation via transformers. NeurIPS (2021).
[159] Ming Ding, Wendi Zheng, Wenyi Hong, et al. 2022. Cogview2: Faster and better text-to-image generation via hierarchical transformers. NeurIPS (2022).
[160] Zheng Ding, Xuaner Zhang, Zhihao Xia, et al. 2023. DiffusionRig: Learning Personalized Priors for Facial Appearance Editing. In CVPR.
[161] Laurent Dinh, David Krueger, and Yoshua Bengio. 2015. Nice: Non-linear independent components estimation. In ICLR Workshops.
[162] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. 2017. Density estimation using Real NVP. In ICLR.
[163] Chris Donahue, Julian McAuley, and Miller Puckette. 2019. Adversarial Audio Synthesis. In ICLR.
[164] Michael Dorkenwald, Timo Milbich, Andreas Blattmann, et al. 2021. Stochastic image-to-video synthesis using cinns. In CVPR.
[165] Alexey Dosovitskiy, Lucas Beyer, et al. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR.
[166] Alexey Dosovitskiy, German Ros, Felipe Codevilla, et al. 2017. CARLA: An open urban driving simulator. In CoRL.
[167] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, et al. 2023. PaLM-E: An Embodied Multimodal Language Model. In arXiv.
[168] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. 2017. A Learned Representation For Artistic Style. In ICLR.
[169] Emilien Dupont, Hyunjik Kim, SM Ali Eslami, et al. 2022. From data to functa: Your data point is a function and you can treat it like one. In ICML.
[170] Ricard Durall, Margret Keuper, and Janis Keuper. 2020. Watch your up-convolution: Cnn based generative deep neural networks are failing to reproduce
spectral distributions. In CVPR.
[171] Frederik Ebert, Chelsea Finn, Alex X Lee, et al. 2017. Self-Supervised Visual Planning with Temporal Skip Connections. CoRL (2017).
[172] Douglas Eck and Juergen Schmidhuber. 2002. A first look at music composition using lstm recurrent neural networks. IDSIA (2002).
[173] Jesse Engel, Cinjon Resnick, Adam Roberts, et al. 2017. Neural audio synthesis of musical notes with wavenet autoencoders. In ICML.
[174] Patrick Esser, Johnathan Chiu, Parmida Atighehchian, et al. 2023. Structure and content-guided video synthesis with diffusion models. arXiv (2023).
[175] Haoqiang Fan, Hao Su, and Leonidas J Guibas. 2017. A point set generation network for 3d object reconstruction from a single image. In CVPR.
[176] Rukun Fan, Songhua Xu, and Weidong Geng. 2011. Example-based automatic music-driven conventional dance motion synthesis. TVCG (2011).
[177] Zhiwen Fan, Yifan Jiang, Peihao Wang, et al. 2022. Unified implicit neural stylization. In ECCV.
[178] Tao Fang, Yu Qi, and Gang Pan. 2020. Reconstructing perceptive images from brain activity by shape-semantic gan. NeurIPS (2020).
[179] William Fedus, Ian Goodfellow, and Andrew M Dai. 2018. MaskGAN: Better Text Generation via Filling in the _. In ICLR.
[180] William Fedus, Barret Zoph, et al. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR (2022).
[181] Yao Feng, Haiwen Feng, Michael J Black, and otherso. 2021. Learning an animatable detailed 3D face model from in-the-wild images. ToG (2021).
[182] Yao Feng, Jinlong Yang, Marc Pollefeys, et al. 2022. Capturing and animation of body and clothing from monocular video. In SIGGRAPH Asia.
[183] Chelsea Finn, Ian Goodfellow, and Sergey Levine. 2016. Unsupervised learning for physical interaction through video prediction. NeurIPS (2016).
[184] Pieter Finn, Chelsea aand Abbeel and Sergey Levine. 2017. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML.
[185] Lin Geng Foo, Jia Gong, Hossein Rahmani, and Jun Liu. 2023. Distribution-Aligned Diffusion for Human Mesh Recovery. In ICCV.
[186] Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Qiuhong Ke, and Jun Liu. 2023. Unified pose sequence modeling. In CVPR.
[187] Katerina Fragkiadaki, Sergey Levine, Panna Felsen, et al. 2015. Recurrent network models for human dynamics. In ICCV.
[188] Rafail Fridman, Amit Abecasis, Yoni Kasten, et al. 2023. Scenescape: Text-driven consistent scene generation. arXiv (2023).
[189] Sara Fridovich-Keil, Alex Yu, Matthew Tancik, et al. 2022. Plenoxels: Radiance fields without neural networks. In CVPR.
[190] Jianglin Fu, Shikai Li, Yuming Jiang, et al. 2022. Stylegan-human: A data-centric odyssey of human generation. In ECCV.
[191] Rao Fu, Xiao Zhan, Yiwen Chen, et al. 2022. Shapecrafter: A recursive text-conditioned 3d shape generation model. NeurIPS (2022).
[192] Graham Fyffe, Andrew Jones, Oleg Alexander, et al. 2014. Driving high-resolution facial scans with video performance capture. ToG (2014).
[193] Guy Gafni, Justus Thies, Michael Zollhofer, et al. 2021. Dynamic neural radiance fields for monocular 4d facial avatar reconstruction. In CVPR.
[194] Oran Gafni, Adam Polyak, Oron Ashual, et al. 2022. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV.
[195] Rinon Gal, Yuval Alaluf, et al. 2023. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In ICLR.
[196] Rinon Gal, Moab Arar, Yuval Atzmon, et al. 2023. Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models. arXiv (2023).
[197] Jun Gao, Tianchang Shen, Zian Wang, et al. 2022. Get3d: A generative model of high quality 3d textured shapes learned from images. NeurIPS (2022).
[198] Leo Gao, Stella Biderman, Sid Black, et al. 2020. The pile: An 800gb dataset of diverse text for language modeling. arXiv (2020).
[199] Lin Gao, Jie Yang, Tong Wu, et al. 2019. SDM-NET: Deep generative network for structured deformable mesh. ToG (2019).
[200] Victor Garcia Satorras, Emiel Hoogeboom, Fabian Fuchs, et al. 2021. E (n) equivariant normalizing flows. NeurIPS (2021).
[201] Pablo Garrido, Levi Valgaerts, Chenglei Wu, et al. 2013. Reconstructing detailed dynamic face geometry from monocular video. ToG (2013).
[202] Pablo Garrido, Michael Zollhöfer, Dan Casas, et al. 2016. Reconstruction of personalized 3D face rigs from monocular video. ToG (2016).
[203] Songwei Ge, Thomas Hayes, Harry Yang, et al. 2022. Long video generation with time-agnostic vqgan and time-sensitive transformer. In ECCV.
[204] Niklas Gebauer, Michael Gastegger, et al. 2019. Symmetry-adapted generation of 3d point sets for the targeted discovery of molecules. NeurIPS (2019).
[205] Baris Gecer, Stylianos Ploumpis, Irene Kotsia, et al. 2019. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. In CVPR.
[206] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, et al. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP.
[207] Kyle Genova, Forrester Cole, Avneesh Sud, et al. 2020. Local deep implicit functions for 3d shape. In CVPR.
[208] Kyle Genova, Forrester Cole, Daniel Vlasic, et al. 2019. Learning shape templates with structured implicit functions. In ICCV.
[209] Thomas Gerig, Andreas Morel-Forster, Clemens Blumer, et al. 2018. Morphable face models-an open framework. In FG.
[210] Anindita Ghosh, Noshaba Cheema, Cennet Oguz, et al. 2021. Synthesis of compositional animations from textual descriptions. In ICCV.
[211] Arnab Ghosh, Richard Zhang, Puneet K Dokania, et al. 2019. Interactive sketch & fill: Multiclass sketch-to-image translation. In ICCV.
[212] Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, et al. 2020. From Variational to Deterministic Autoencoders. In ICLR.
[213] Shiry Ginosar, Amir Bar, Gefen Kohavi, et al. 2019. Learning individual styles of conversational gesture. In CVPR.
[214] Rohit Girdhar, David F Fouhey, Mikel Rodriguez, et al. 2016. Learning a predictable and generative vector representation for objects. In ECCV.
[215] Rahul Goel, Dhawal Sirikonda, Rajvi Shah, et al. 2023. FusedRF: Fusing Multiple Radiance Fields. In CVPR Workshops.
[216] Jia Gong, Lin Geng Foo, Zhipeng Fan, Qiuhong Ke, Hossein Rahmani, and Jun Liu. 2023. Diffpose: Toward more reliable 3d pose estimation. In CVPR.
[217] Shansan Gong, Mukai Li, Jiangtao Feng, et al. 2023. DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. In ICLR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 27

[218] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, et al. 2014. Generative adversarial nets. NeurIPS (2014).
[219] Roberto Gozalo-Brizuela and Eduardo C Garrido-Merchan. 2023. ChatGPT is not all you need. A State of the Art Review of large Generative AI models.
arXiv (2023).
[220] Matej Grcić, Ivan Grubišić, and Siniša Šegvić. 2021. Densely connected normalizing flows. NeurIPS (2021).
[221] Karol Gregor, Frederic Besse, Danilo Jimenez Rezende, et al. 2016. Towards conceptual compression. NeurIPS (2016).
[222] Karol Gregor, Ivo Danihelka, Alex Graves, et al. 2015. Draw: A recurrent neural network for image generation. In ICML.
[223] Daniel Griffin and Jae Lim. 1984. Signal estimation from modified short-time Fourier transform. ITASS (1984).
[224] Amos Gropp, Lior Yariv, Niv Haim, et al. 2020. Implicit Geometric Regularization for Learning Shapes. In ICML.
[225] Jiatao Gu, Lingjie Liu, Peng Wang, et al. 2022. StyleNeRF: A Style-based 3D Aware Generator for High-resolution Image Synthesis. In ICLR.
[226] Shuyang Gu, Dong Chen, Jianmin Bao, et al. 2022. Vector quantized diffusion model for text-to-image synthesis. In CVPR.
[227] Peng Guan, Loretta Reiss, David A Hirshberg, et al. 2012. Drape: Dressing any person. ToG (2012).
[228] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, et al. 2017. Improved training of wasserstein gans. NeurIPS (2017).
[229] Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, et al. 2016. PixelVAE: A Latent Variable Model for Natural Images. In ICLR.
[230] Chen Guo, Tianjian Jiang, et al. 2023. Vid2avatar: 3d avatar reconstruction from videos in the wild via self-supervised scene decomposition. In CVPR.
[231] Chuan Guo, Shihao Zou, Xinxin Zuo, et al. 2022. Generating diverse and natural 3d human motions from text. In CVPR.
[232] Chuan Guo, Xinxin Zuo, Sen Wang, et al. 2020. Action2motion: Conditioned generation of 3d human motions. In ACM MM.
[233] Jiaxian Guo, Sidi Lu, Han Cai, et al. 2018. Long text generation via adversarial training with leaked information. In AAAI.
[234] Shunan Guo, Zhuochen Jin, Fuling Sun, et al. 2021. Vinci: an intelligent graphic design system for generating advertising posters. In CHI.
[235] Wen Guo, Yuming Du, Xi Shen, et al. 2022. Back to MLP: A Simple Baseline for Human Motion Prediction. arXiv (2022).
[236] Yudong Guo, Keyu Chen, Sen Liang, et al. 2021. Ad-nerf: Audio driven neural radiance fields for talking head synthesis. In ICCV.
[237] Kamal Gupta, Justin Lazarow, Alessandro Achille, et al. 2021. Layouttransformer: Layout generation and completion with self-attention. In ICCV.
[238] Ikhsanul Habibie, Daniel Holden, Jonathan Schwarz, et al. 2017. A recurrent variational autoencoder for human motion synthesis. In BMVC.
[239] Gaëtan Hadjeres, François Pachet, and Frank Nielsen. 2017. Deepbach: a steerable model for bach chorales generation. In ICML.
[240] Sang-Hun Han, Min-Gyu Park, Ju Hong Yoon, et al. 2023. High-fidelity 3D Human Digitization from Single 2K Resolution Images. In CVPR.
[241] Christian Häne, Shubham Tulsiani, and Jitendra Malik. 2017. Hierarchical surface prediction for 3d object reconstruction. In 3DV.
[242] Rana Hanocka, Gal Metzer, Raja Giryes, et al. 2020. Point2Mesh: a self-prior for deformable meshes. ToG (2020).
[243] Zekun Hao, Hadar Averbuch-Elor, Noah Snavely, et al. 2020. Dualsdf: Semantic shape manipulation using a two-level representation. In CVPR.
[244] Zekun Hao, Xun Huang, and Serge Belongie. 2018. Controllable video generation with sparse trajectories. In CVPR.
[245] William Harvey, Saeid Naderiparizi, Vaden Masrani, et al. 2022. Flexible diffusion modeling of long videos. NeurIPS (2022).
[246] Dai Hasegawa, Naoshi Kaneko, Shinichi Shirakawa, et al. 2018. Evaluation of speech-to-gesture generation using bi-directional LSTM network. In IVA.
[247] Kaiming He, Xiangyu Zhang, Shaoqing Ren, et al. 2016. Deep residual learning for image recognition. In CVPR.
[248] Tong He, John Collomosse, et al. 2020. Geo-pifu: Geometry and pixel aligned implicit functions for single-view human reconstruction. NeurIPS (2020).
[249] Tong He, Yuanlu Xu, Shunsuke Saito, et al. 2021. Arch++: Animation-ready clothed human reconstruction revisited. In ICCV.
[250] Peter Hedman, Pratul P Srinivasan, Ben Mildenhall, et al. 2021. Baking neural radiance fields for real-time view synthesis. In ICCV.
[251] Paul Henderson, Vagia Tsiminaki, and Christoph H Lampert. 2020. Leveraging 2d data to learn textured 3d mesh generation. In CVPR.
[252] Gustav Eje Henter, Simon Alexanderson, et al. 2020. Moglow: Probabilistic and controllable motion synthesis using normalising flows. ToG (2020).
[253] Philipp Henzler, Niloy J Mitra, and Tobias Ritschel. 2019. Escaping plato’s cave: 3d shape from adversarial rendering. In ICCV.
[254] Alejandro Hernandez, Jurgen Gall, and Francesc Moreno-Noguer. 2019. Human motion prediction via spatio-temporal inpainting. In ICCV.
[255] Amir Hertz, Ron Mokady, Jay Tenenbaum, et al. 2023. Prompt-to-Prompt Image Editing with Cross-Attention Control. In ICLR.
[256] Martin Heusel, Hubert Ramsauer, et al. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS (2017).
[257] Irina Higgins, Loic Matthey, Arka Pal, et al. 2017. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR.
[258] Jonathan Ho, William Chan, Chitwan Saharia, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv (2022).
[259] Jonathan Ho, Xi Chen, et al. 2019. Flow++: Improving flow-based generative models with variational dequantization and architecture design. In ICML.
[260] Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. NeurIPS (2020).
[261] Jonathan Ho, Chitwan Saharia, William Chan, et al. 2022. Cascaded diffusion models for high fidelity image generation. JMLR (2022).
[262] Jonathan Ho and Tim Salimans. 2021. Classifier-Free Diffusion Guidance. In NeurIPS Workshops.
[263] Jonathan Ho, Tim Salimans, Alexey Gritsenko, et al. 2022. Video Diffusion Models. In NeurIPS.
[264] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. 2022. Training compute-optimal large language models. arXiv (2022).
[265] Daniel Holden, Jun Saito, and Taku Komura. 2016. A deep learning framework for character motion synthesis and editing. ToG (2016).
[266] Lukas Höllein, Ang Cao, Andrew Owens, et al. 2023. Text2room: Extracting textured 3d meshes from 2d text-to-image models. arXiv (2023).
[267] Aleksander Holynski, Brian L Curless, Steven M Seitz, et al. 2021. Animating pictures with eulerian motion fields. In CVPR.
[268] Fangzhou Hong, Zhaoxi Chen, LAN Yushi, et al. 2023. EVA3D: Compositional 3D Human Generation from 2D Image Collections. In ICLR.
[269] Fangzhou Hong, Mingyuan Zhang, Liang Pan, et al. 2022. AvatarCLIP: zero-shot text-driven generation and animation of 3D avatars. ToG (2022).
[270] Wenyi Hong, Ming Ding, Wendi Zheng, et al. 2023. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. In ICLR.
[271] Yang Hong, Bo Peng, Haiyao Xiao, et al. 2022. Headnerf: A real-time nerf-based parametric head model. In CVPR.
[272] Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, et al. 2021. Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions. In NeurIPS.
[273] Emiel Hoogeboom, Vıctor Garcia Satorras, Clément Vignac, et al. 2022. Equivariant diffusion for molecule generation in 3d. In ICML.
[274] Emiel Hoogeboom, Rianne Van Den Berg, and Max Welling. 2019. Emerging convolutions for generative normalizing flows. In ICML.
[275] Liwen Hu, Shunsuke Saito, Lingyu Wei, et al. 2017. Avatar digitization from a single image for real-time rendering. ToG (2017).
[276] Vincent Tao Hu, David W. Zhang, Yuki M. Asano, et al. 2023. Self-Guided Diffusion Models. In CVPR.
[277] Allen Huang and Raymond Wu. 2016. Deep learning for music. arXiv (2016).
[278] Qingqing Huang, Aren Jansen, Joonseok Lee, et al. 2022. MuLan: A Joint Embedding of Music Audio and Natural Language. In ISMIR.
[279] Qingqing Huang, Daniel S Park, Tao Wang, et al. 2023. Noise2music: Text-conditioned music generation with diffusion models. arXiv (2023).
[280] Rongjie Huang, Zhou Zhao, Huadai Liu, et al. 2022. Prodiff: Progressive fast diffusion model for high-quality text-to-speech. In ACM MM.
[281] Xun Huang, Ming-Yu Liu, Serge Belongie, et al. 2018. Multimodal unsupervised image-to-image translation. In ECCV.
[282] Yi-Hua Huang, Yue He, Yu-Jie Yuan, et al. 2022. Stylizednerf: consistent 3d scene stylization as stylized nerf via 2d-3d mutual learning. In CVPR.
[283] Zeng Huang, Yuanlu Xu, Christoph Lassner, et al. 2020. Arch: Animatable reconstruction of clothed humans. In CVPR.
[284] Drew A Hudson and Larry Zitnick. 2021. Generative adversarial transformers. In ICML.
[285] Le Hui, Rui Xu, Jin Xie, et al. 2020. Progressive point cloud deconvolution generation network. In ECCV.
[286] Inwoo Hwang, Hyeonwoo Kim, and Young Min Kim. 2023. Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details. In CVPR.
[287] Moritz Ibing et al. 2023. Octree transformer: Autoregressive 3d shape generation on hierarchically structured sequences. In CVPR Workshops.
[288] Moritz Ibing, Isaak Lim, and Leif Kobbelt. 2021. 3d shape generation with grid-based implicit functions. In CVPR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


28 • Foo, et al.

[289] Alexandru Eugen Ichim, Sofien Bouaziz, and Mark Pauly. 2015. Dynamic 3D avatar creation from hand-held video input. ToG (2015).
[290] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa. 2017. Globally and locally consistent image completion. ToG (2017).
[291] Catalin Ionescu, Dragos Papava, Vlad Olaru, et al. 2013. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural
environments. TPAMI (2013).
[292] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, et al. 2017. Image-to-image translation with conditional adversarial networks. In CVPR.
[293] Keith Ito and Linda Johnson. 2017. The lj speech dataset. (2017).
[294] Ajay Jain, Ben Mildenhall, Jonathan T Barron, et al. 2022. Zero-shot text-guided object generation with dream fields. In CVPR.
[295] Amir Jamaludin, Joon Son Chung, and Andrew Zisserman. 2019. You said that?: Synthesising talking faces from audio. IJCV (2019).
[296] Ziwei Ji, Nayeon Lee, Rita Frieske, et al. 2023. Survey of hallucination in natural language generation. Comput. Surveys (2023).
[297] Boyi Jiang, Yang Hong, Hujun Bao, et al. 2022. Selfrecon: Self reconstruction your digital avatar from monocular video. In CVPR.
[298] Boyi Jiang, Juyong Zhang, Yang Hong, et al. 2020. Bcnet: Learning body and cloth shape from a single image. In ECCV.
[299] Chiyu Jiang, Avneesh Sud, Ameesh Makadia, et al. 2020. Local implicit grid representations for 3d scenes. In CVPR.
[300] Suyi Jiang, Haoran Jiang, Ziyu Wang, et al. 2023. Humangen: Generating human radiance fields with explicit priors. In CVPR.
[301] Tianjian Jiang, Xu Chen, Jie Song, et al. 2023. Instantavatar: Learning avatars from monocular video in 60 seconds. In CVPR.
[302] Wei Jiang, Kwang Moo Yi, Golnoosh Samei, et al. 2022. Neuman: Neural human radiance field from a single video. In ECCV.
[303] Yifan Jiang, Shiyu Chang, and Zhangyang Wang. 2021. Transgan: Two pure transformers can make one strong gan, and that can scale up. NeurIPS (2021).
[304] Yue Jiang, Dantong Ji, Zhizhong Han, et al. 2020. Sdfdiff: Differentiable rendering of signed distance fields for 3d shape optimization. In CVPR.
[305] Yuming Jiang, Shuai Yang, Haonan Qiu, et al. 2022. Text2human: Text-driven controllable human image generation. ToG (2022).
[306] Danilo Jimenez Rezende, SM Eslami, Shakir Mohamed, et al. 2016. Unsupervised learning of 3d structure from images. NeurIPS (2016).
[307] Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. 2022. Score-based generative modeling of graphs via the system of stochastic differential equations. In ICML.
[308] Hanbyul Joo, Tomas Simon, and Yaser Sheikh. 2018. Total capture: A 3d deformation model for tracking faces, hands, and bodies. In CVPR.
[309] Mandar Joshi, Eunsol Choi, et al. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In ACL.
[310] Akash Abdu Jyothi, Thibaut Durand, Jiawei He, et al. 2019. Layoutvae: Stochastic scene layout generation from a label set. In ICCV.
[311] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, et al. 2018. Efficient neural audio synthesis. In ICML.
[312] Nal Kalchbrenner, Aäron Oord, Karen Simonyan, et al. 2017. Video pixel networks. In ICML.
[313] Angjoo Kanazawa, Michael J Black, David W Jacobs, et al. 2018. End-to-end recovery of human shape and pose. In CVPR.
[314] Angjoo Kanazawa, Shubham Tulsiani, Alexei A Efros, et al. 2018. Learning category-specific mesh reconstruction from image collections. In ECCV.
[315] Animesh Karnewar, Andrea Vedaldi, David Novotny, et al. 2023. Holodiffusion: Training a 3D diffusion model using 2D images. In CVPR.
[316] Tero Karras, Timo Aila, Samuli Laine, et al. 2017. Audio-driven facial animation by joint end-to-end learning of pose and emotion. ToG (2017).
[317] Tero Karras, Timo Aila, Samuli Laine, et al. 2018. Progressive Growing of GANs for Improved Quality, Stability, and Variation. In ICLR.
[318] Tero Karras, Miika Aittala, Timo Aila, et al. 2022. Elucidating the design space of diffusion-based generative models. NeurIPS (2022).
[319] Tero Karras, Miika Aittala, Samuli Laine, et al. 2021. Alias-free generative adversarial networks. NeurIPS (2021).
[320] Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In CVPR.
[321] Tero Karras, Samuli Laine, Miika Aittala, et al. 2020. Analyzing and improving the image quality of stylegan. In CVPR.
[322] Hiroharu Kato, Yoshitaka Ushiku, and Tatsuya Harada. 2018. Neural 3d mesh renderer. In CVPR.
[323] Ladislav Kavan, Steven Collins, Jiří Žára, et al. 2008. Geometric skinning with approximate dual quaternion blending. ToG (2008).
[324] Bahjat Kawar, Shiran Zada, Oran Lang, et al. 2023. Imagic: Text-based real image editing with diffusion models. In CVPR.
[325] Michael Kazhdan, Matthew Bolitho, and Hugues Hoppe. 2006. Poisson surface reconstruction. In SGP.
[326] Michael Kazhdan and Hugues Hoppe. 2013. Screened poisson surface reconstruction. ToG (2013).
[327] Taras Khakhulin, Vanessa Sklyarova, Victor Lempitsky, et al. 2022. Realistic one-shot mesh-based head avatars. In ECCV.
[328] Gwanghyun Kim, Taesung Kwon, and Jong Chul Ye. 2022. Diffusionclip: Text-guided diffusion models for robust image manipulation. In CVPR.
[329] Hyeongju Kim, Hyeonseung Lee, Woo Hyun Kang, et al. 2020. Softflow: Probabilistic framework for normalizing flow on manifolds. NeurIPS (2020).
[330] Jaehyeon Kim, Sungwon Kim, Jungil Kong, et al. 2020. Glow-tts: A generative flow for text-to-speech via monotonic alignment search. NeurIPS (2020).
[331] Seung Wook Kim, Yuhao Zhou, Jonah Philion, et al. 2020. Learning to simulate dynamic environments with gamegan. In CVPR.
[332] Taeksoo Kim, Moonsu Cha, Hyunsoo Kim, et al. 2017. Learning to discover cross-domain relations with generative adversarial networks. In ICML.
[333] Tae-hoon Kim, Sang Il Park, and Sung Yong Shin. 2003. Rhythmic-motion synthesis based on motion-beat analysis. ToG (2003).
[334] Yunji Kim, Seonghyeon Nam, In Cho, et al. 2019. Unsupervised keypoint learning for guiding class-conditional video prediction. NeurIPS (2019).
[335] Diederik Kingma, Tim Salimans, Ben Poole, et al. 2021. Variational diffusion models. NeurIPS (2021).
[336] Durk P Kingma and Prafulla Dhariwal. 2018. Glow: Generative flow with invertible 1x1 convolutions. NeurIPS (2018).
[337] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, et al. 2016. Improved variational inference with inverse autoregressive flow. NeurIPS (2016).
[338] Diederik P Kingma and Max Welling. 2014. Auto-encoding variational bayes. In ICLR.
[339] Roman Klokov, Edmond Boyer, and Jakob Verbeek. 2020. Discrete point flow networks for efficient point cloud generation. In ECCV.
[340] Jungil Kong, Jaehyeon Kim, et al. 2020. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. NeurIPS (2020).
[341] Zhifeng Kong, Wei Ping, Jiaji Huang, et al. 2021. DiffWave: A Versatile Diffusion Model for Audio Synthesis. In ICLR.
[342] Adam R Kosiorek, Heiko Strathmann, Daniel Zoran, et al. 2021. Nerf-vae: A geometry aware 3d scene generative model. In ICML.
[343] Akim Kotelnikov, Dmitry Baranchuk, Ivan Rubachev, et al. 2023. Tabddpm: Modelling tabular data with diffusion models. In ICML.
[344] Ranjay Krishna, Kenji Hata, Frederic Ren, et al. 2017. Dense-captioning events in videos. In ICCV.
[345] Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Technical Report.
[346] Taras Kucherenko, Dai Hasegawa, Gustav Eje Henter, et al. 2019. Analyzing input and output representations for speech-driven gesture generation. In IVA.
[347] Kundan Kumar, Rithesh Kumar, et al. 2019. Melgan: Generative adversarial networks for conditional waveform synthesis. NeurIPS (2019).
[348] Nupur Kumari, Bingliang Zhang, Richard Zhang, et al. 2023. Multi-concept customization of text-to-image diffusion. In CVPR.
[349] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, et al. 2019. Natural questions: a benchmark for question answering research. TACL (2019).
[350] Max W. Y. Lam, Jun Wang, Dan Su, et al. 2022. BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis. In ICLR.
[351] Alexandros Lattas, Stylianos Moschoglou, Baris Gecer, et al. 2020. AvatarMe: Realistically Renderable 3D Facial Reconstruction" in-the-wild". In CVPR.
[352] Cheng-Han Lee, Ziwei Liu, Lingyun Wu, et al. 2020. Maskgan: Towards diverse and interactive facial image manipulation. In CVPR.
[353] Hsin-Ying Lee, Hung-Yu Tseng, Jia-Bin Huang, et al. 2018. Diverse image-to-image translation via disentangled representations. In ECCV.
[354] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, et al. 2019. Dancing to music. NeurIPS (2019).
[355] Junsoo Lee et al. 2020. Reference-based sketch image colorization using augmented-self reference and dense semantic correspondence. In CVPR.
[356] Kwonjoon Lee, Huiwen Chang, Lu Jiang, et al. 2022. ViTGAN: Training GANs with Vision Transformers. In ICLR.
[357] Nayeon Lee, Wei Ping, Peng Xu, et al. 2022. Factuality enhanced language models for open-ended text generation. NeurIPS (2022).
[358] John P Lewis et al. 2000. Pose space deformation: a unified approach to shape interpolation and skeleton-driven deformation. In SIGGRAPH.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 29

[359] Mike Lewis, Yinhan Liu, Naman Goyal, et al. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation,
and Comprehension. In ACL.
[360] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, et al. 2019. Controllable text-to-image generation. NeurIPS (2019).
[361] Bowen Li, Xiaojuan Qi, Thomas Lukasiewicz, et al. 2020. Manigan: Text-guided image manipulation. In CVPR.
[362] Chun-Liang Li, Manzil Zaheer, Yang Zhang, et al. 2019. Point Cloud GAN. In ICLR Workshops.
[363] Chen Li, Zhen Zhang, Wee Sun Lee, et al. 2018. Convolutional sequence to sequence model for human dynamics. In CVPR.
[364] Jing Li, Di Kang, et al. 2021. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In ICCV.
[365] Jiwei Li, Will Monroe, Alan Ritter, et al. 2016. Deep Reinforcement Learning for Dialogue Generation. In EMNLP.
[366] Jiwei Li, Will Monroe, Tianlin Shi, et al. 2017. Adversarial Learning for Neural Dialogue Generation. In EMNLP.
[367] Jun Li, Kai Xu, Siddhartha Chaudhuri, et al. 2017. Grass: Generative recursive autoencoders for shape structures. ToG (2017).
[368] Jianan Li, Jimei Yang, Aaron Hertzmann, et al. 2020. Layoutgan: Synthesizing graphic layouts with vector-wireframe adversarial networks. TPAMI (2020).
[369] Jiaman Li, Yihang Yin, Hang Chu, et al. 2020. Learning to generate diverse dance motions with transformer. arXiv (2020).
[370] Maosen Li, Siheng Chen, et al. 2021. Multiscale Spatio-Temporal Graph Neural Networks for 3D Skeleton-Based Motion Prediction. TIP (2021).
[371] Muheng Li, Yueqi Duan, Jie Zhou, et al. 2023. Diffusion-sdf: Text-to-shape via voxelized diffusion. In CVPR.
[372] Naihan Li, Shujie Liu, Yanqing Liu, et al. 2019. Neural speech synthesis with transformer network. In AAAI.
[373] Ruihui Li, Xianzhi Li, Ka-Hei Hui, et al. 2021. SP-GAN: Sphere-guided 3D shape generation and manipulation. ToG (2021).
[374] Ruilong Li, Shan Yang, David A Ross, et al. 2021. Ai choreographer: Music conditioned 3d dance generation with aist++. In ICCV.
[375] Tianye Li, Timo Bolkart, Michael J Black, et al. 2017. Learning a model of facial shape and expression from 4D scans. ToG (2017).
[376] Weichuang Li, Longhao Zhang, Don Wang, et al. 2023. One-Shot High-Fidelity Talking-Head Synthesis with Deformable Neural Radiance Field. In CVPR.
[377] Xiang Li, John Thickstun, and Ishaan Gulrajani. 2022. Diffusion-lm improves controllable text generation. NeurIPS (2022).
[378] Yuheng Li, Haotian Liu, Qingyang Wu, et al. 2023. Gligen: Open-set grounded text-to-image generation. In CVPR.
[379] Yige Li, Xixiang Lyu, Nodens Koren, et al. 2021. Neural Attention Distillation: Erasing Backdoor Triggers from Deep Neural Networks. In ICLR.
[380] Yitong Li, Martin Min, Dinghan Shen, et al. 2018. Video generation from text. In AAAI.
[381] Zhengqi Li, Simon Niklaus, Noah Snavely, et al. 2021. Neural scene flow fields for space-time view synthesis of dynamic scenes. In CVPR.
[382] Jian Liang, Chenfei Wu, et al. 2022. NUWA-infinity: Autoregressive over autoregressive generation for infinite visual synthesis. NeurIPS (2022).
[383] Renjie Liao, Yujia Li, Yang Song, et al. 2019. Efficient graph generation with graph recurrent attention networks. NeurIPS (2019).
[384] Tingting Liao, Xiaomei Zhang, Yuliang Xiu, et al. 2023. High-Fidelity Clothed Avatar Reconstruction from a Single Image. In CVPR.
[385] Yiyi Liao, Simon Donne, and Andreas Geiger. 2018. Deep marching cubes: Learning explicit surface representations. In CVPR.
[386] Chen-Hsuan Lin, Jun Gao, Luming Tang, et al. 2023. Magic3d: High-resolution text-to-3d content creation. In CVPR.
[387] Kevin Lin, Dianqi Li, Xiaodong He, et al. 2017. Adversarial ranking for language generation. NeurIPS (2017).
[388] Kevin Lin, Lijuan Wang, and Zicheng Liu. 2021. End-to-end human pose and mesh reconstruction with transformers. In CVPR.
[389] Sikun Lin, Thomas Sprague, and Ambuj K Singh. 2022. Mind reader: Reconstructing complex images from brain activities. NeurIPS (2022).
[390] Tsung-Yi Lin, Michael Maire, Serge Belongie, et al. 2014. Microsoft coco: Common objects in context. In ECCV.
[391] Zhenghao Lin, Yeyun Gong, Yelong Shen, et al. 2023. Text Generation with Diffusion Language Models: A Pre-training Approach with Continuous
Paragraph Denoise. In ICML.
[392] Guilin Liu, Fitsum A Reda, Kevin J Shih, et al. 2018. Image inpainting for irregular holes using partial convolutions. In ECCV.
[393] Jinglin Liu, Chengxi Li, Yi Ren, et al. 2022. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In AAAI.
[394] Jun Liu, Amir Shahroudy, Mauricio Perez, et al. 2019. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. TPAMI (2019).
[395] Kunhao Liu, Fangneng Zhan, Yiwen Chen, et al. 2023. StyleRF: Zero-shot 3D Style Transfer of Neural Radiance Fields. In CVPR.
[396] Lingjie Liu, Jiatao Gu, Kyaw Zaw Lin, et al. 2020. Neural sparse voxel fields. NeurIPS (2020).
[397] Lingjie Liu, Marc Habermann, Viktor Rudnev, et al. 2021. Neural actor: Neural free-view synthesis of human actors with pose control. ToG (2021).
[398] Ming-Yu Liu, Thomas Breuel, and Jan Kautz. 2017. Unsupervised image-to-image translation networks. NeurIPS (2017).
[399] Ming-Yu Liu, Xun Huang, Arun Mallya, et al. 2019. Few-shot unsupervised image-to-image translation. In ICCV.
[400] Nan Liu, Shuang Li, Yilun Du, et al. 2022. Compositional visual generation with composable diffusion models. In ECCV.
[401] Qi Liu, Miltiadis Allamanis, Marc Brockschmidt, et al. 2018. Constrained graph variational autoencoders for molecule design. NeurIPS (2018).
[402] Ruoshi Liu and Carl Vondrick. 2023. Humans as Light Bulbs: 3D Human Reconstruction from Thermal Reflection. In CVPR.
[403] Shikun Liu, Lee Giles, and Alexander Ororbia. 2018. Learning a hierarchical latent-variable model of 3d shapes. In 3DV.
[404] Steven Liu, Xiuming Zhang, Zhoutong Zhang, et al. 2021. Editing conditional radiance fields. In ICCV.
[405] Shaoteng Liu, Yuechen Zhang, Wenbo Li, et al. 2023. Video-p2p: Video editing with cross-attention control. arXiv (2023).
[406] Xingchao Liu, Chengyue Gong, and qiang liu. 2023. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow. In ICLR.
[407] Xian Liu, Qianyi Wu, Hang Zhou, et al. 2022. Learning hierarchical cross-modal association for co-speech gesture generation. In CVPR.
[408] Ze Liu, Yutong Lin, Yue Cao, et al. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV.
[409] Ziwei Liu, Ping Luo, Shi Qiu, et al. 2016. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In CVPR.
[410] Ziwei Liu, Ping Luo, Xiaogang Wang, et al. 2015. Deep learning face attributes in the wild. In ICCV.
[411] Zhijian Liu, Haotian Tang, Yujun Lin, et al. 2019. Point-voxel cnn for efficient 3d deep learning. NeurIPS (2019).
[412] Zhengzhe Liu, Yi Wang, Xiaojuan Qi, et al. 2022. Towards implicit text-guided 3d shape generation. In CVPR.
[413] Matthew Loper, Naureen Mahmood, Javier Romero, et al. 2015. SMPL: A skinned multi-person linear model. ToG (2015).
[414] Raphael Gontijo Lopes, David Ha, Douglas Eck, et al. 2019. A learned representation for scalable vector graphics. In ICCV.
[415] David Lopez-Paz and Maxime Oquab. 2017. Revisiting Classifier Two-Sample Tests. In ICLR.
[416] William Lotter, Gabriel Kreiman, and David Cox. 2017. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. In ICLR.
[417] Christos Louizos and Max Welling. 2017. Multiplicative normalizing flows for variational bayesian neural networks. In ICML.
[418] Mario Lucic, Karol Kurach, Marcin Michalski, et al. 2018. Are gans created equal? a large-scale study. NeurIPS (2018).
[419] Andreas Lugmayr, Martin Danelljan, Andres Romero, et al. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. In CVPR.
[420] Shitong Luo, Jiaqi Guan, Jianzhu Ma, et al. 2021. A 3D generative model for structure-based drug design. NeurIPS (2021).
[421] Shitong Luo and Wei Hu. 2021. Diffusion probabilistic models for 3d point cloud generation. In CVPR.
[422] Shitong Luo, Chence Shi, Minkai Xu, et al. 2021. Predicting molecular conformation via dynamic graph score matching. NeurIPS (2021).
[423] Zhengxiong Luo, Dayou Chen, Yingya Zhang, et al. 2023. VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation. In CVPR.
[424] Zhaoyang Lyu, Jinyi Wang, Yuwei An, et al. 2023. Controllable Mesh Generation Through Sparse Latent Point Diffusion Models. In CVPR.
[425] Qianli Ma, Jinlong Yang, Anura Ranjan, et al. 2020. Learning to dress 3d people in generative clothing. In CVPR.
[426] Tiezheng Ma et al. 2022. Progressively Generating Better Initial Guesses Towards Next Stages for High-Quality Human Motion Prediction. In CVPR.
[427] Xu Ma, Yuqian Zhou, Xingqian Xu, et al. 2022. Towards layer-wise image vectorization. In CVPR.
[428] Aniruddha Mahapatra and Kuldeep Kulkarni. 2022. Controllable animation of fluid elements in still images. In CVPR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


30 • Foo, et al.

[429] Shubh Maheshwari, Debtanu Gupta, et al. 2022. Mugl: Large scale multi person conditional action generation with locomotion. In WACV.
[430] Naureen Mahmood, Nima Ghorbani, Nikolaus F Troje, et al. 2019. AMASS: Archive of motion capture as surface shapes. In ICCV.
[431] Arun Mallya, Ting-Chun Wang, Karan Sapra, et al. 2020. World-consistent video-to-video synthesis. In ECCV.
[432] Elman Mansimov, Omar Mahmood, et al. 2019. Molecular geometry prediction using a deep generative graph neural network. Sci. Rep. (2019).
[433] Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, et al. 2016. Generating images from captions with attention. In ICLR.
[434] Qi Mao, Hsin-Ying Lee, Hung-Yu Tseng, et al. 2019. Mode seeking generative adversarial networks for diverse image synthesis. In CVPR.
[435] Wei Mao, Miaomiao Liu, and Mathieu Salzmann. 2020. History repeats itself: Human motion prediction via motion attention. In ECCV.
[436] Wei Mao, Miaomiao Liu, Mathieu Salzmann, et al. 2019. Learning trajectory dependencies for human motion prediction. In ICCV.
[437] Xudong Mao, Qing Li, Haoran Xie, et al. 2017. Least squares generative adversarial networks. In ICCV.
[438] Stacy Marsella, Yuyu Xu, Margaux Lhommet, et al. 2013. Virtual character performance from speech. In SCA.
[439] Julieta Martinez, Michael J Black, and Javier Romero. 2017. On human motion prediction using recurrent neural networks. In CVPR.
[440] Angel Martínez-González, Michael Villamizar, and Jean-Marc Odobez. 2021. Pose Transformers (POTR): Human Motion Prediction With Non-Autoregressive
Transformers. In ICCV Workshops.
[441] Tanya Marwah, Gaurav Mittal, and Vineeth N Balasubramanian. 2017. Attentive semantic video generation using captions. In ICCV.
[442] Michael Mathieu, Camille Couprie, and Yan LeCun. 2016. Deep multi-scale video prediction beyond mean square error. In ICLR.
[443] Donald Meagher. 1982. Geometric modeling using octree encoding. CGIP (1982).
[444] Soroush Mehri, Kundan Kumar, Ishaan Gulrajani, et al. 2017. SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. In ICLR.
[445] Dushyant Mehta, Helge Rhodin, Dan Casas, et al. 2017. Monocular 3d human pose estimation in the wild using improved cnn supervision. In 3DV.
[446] Luke Melas-Kyriazi, Christian Rupprecht, et al. 2023. PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction. In CVPR.
[447] Chenlin Meng, Robin Rombach, Ruiqi Gao, et al. 2023. On Distillation of Guided Diffusion Models. In CVPR.
[448] Kevin Meng, David Bau, Alex Andonian, et al. 2022. Locating and editing factual associations in GPT. NeurIPS (2022).
[449] Bruce Merry, Patrick Marais, and James Gain. 2006. Animation space: A truly linear framework for character animation. ToG (2006).
[450] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which training methods for GANs do actually converge?. In ICML.
[451] Lars Mescheder, Michael Oechsle, Michael Niemeyer, et al. 2019. Occupancy networks: Learning 3d reconstruction in function space. In CVPR.
[452] Gal Metzer, Elad Richardson, Or Patashnik, et al. 2023. Latent-nerf for shape-guided generation of 3d shapes and textures. In CVPR.
[453] Mateusz Michalkiewicz, Jhony K Pontes, Dominic Jack, et al. 2019. Implicit surface representations as layers in neural networks. In ICCV.
[454] Oscar Michel, Roi Bar-On, Richard Liu, et al. 2022. Text2mesh: Text-driven neural stylization for meshes. In CVPR.
[455] Antoine Miech, Dimitri Zhukov, et al. 2019. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV.
[456] Marko Mihajlovic, Yan Zhang, Michael J Black, et al. 2021. LEAP: Learning articulated occupancy of people. In CVPR.
[457] Todor Mihaylov, Peter Clark, et al. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In EMNLP.
[458] Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, et al. 2020. NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. In ECCV.
[459] Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv (2014).
[460] Eric Mitchell, Charles Lin, Antoine Bosselut, et al. 2022. Fast Model Editing at Scale. In ICLR.
[461] Gaurav Mittal, Tanya Marwah, et al. 2017. Sync-draw: Automatic video generation using deep recurrent attentive architectures. In ACM MM.
[462] Paritosh Mittal, Yen-Chi Cheng, Maneesh Singh, et al. 2022. Autosdf: Shape priors for 3d completion, reconstruction and generation. In CVPR.
[463] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, et al. 2018. Spectral Normalization for Generative Adversarial Networks. In ICLR.
[464] Takeru Miyato and Masanori Koyama. 2018. cGANs with Projection Discriminator. In ICLR.
[465] Nasir Mohammad Khalid et al. 2022. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia.
[466] Eyal Molad, Eliahu Horwitz, Dani Valevski, et al. 2023. Dreamix: Video diffusion models are general video editors. arXiv (2023).
[467] Masanori Morise et al. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE T INF SYST (2016).
[468] Norman Müller, Yawar Siddiqui, Lorenzo Porzi, et al. 2023. Diffrf: Rendering-guided 3d radiance field diffusion. In CVPR.
[469] Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. VoxCeleb: A Large-Scale Speaker Identification Dataset. Proc. Interspeech (2017), 2616–2620.
[470] Seonghyeon Nam, Yunji Kim, et al. 2018. Text-adaptive generative adversarial networks: manipulating images with natural language. NeurIPS (2018).
[471] Charlie Nash, Yaroslav Ganin, SM Ali Eslami, et al. 2020. Polygen: An autoregressive generative model of 3d meshes. In ICML.
[472] Anh Nguyen, Jeff Clune, Yoshua Bengio, et al. 2017. Plug & play generative networks: Conditional iterative generation of images in latent space. In CVPR.
[473] Thu Nguyen-Phuoc, Chuan Li, Lucas Theis, et al. 2019. Hologan: Unsupervised learning of 3d representations from natural images. In ICCV.
[474] Thu H Nguyen-Phuoc et al. 2020. Blockgan: Learning 3d object-aware scene representations from unlabelled images. NeurIPS (2020).
[475] Haomiao Ni, Changhao Shi, Kai Li, et al. 2023. Conditional Image-to-Video Generation with Latent Flow Diffusion Models. In CVPR.
[476] Alex Nichol. 2022. DALL-E 2 pre-training mitigations. OpenAI blog (2022).
[477] Alexander Quinn Nichol et al. 2022. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In ICML.
[478] Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In ICML.
[479] Michael Niemeyer and Andreas Geiger. 2021. Giraffe: Representing scenes as compositional generative neural feature fields. In CVPR.
[480] Michael Niemeyer, Lars Mescheder, Michael Oechsle, et al. 2019. Occupancy flow: 4d reconstruction by learning particle dynamics. In ICCV.
[481] Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In ICVGIP.
[482] Harsha Nori, Nicholas King, Scott Mayer McKinney, et al. 2023. Capabilities of gpt-4 on medical challenge problems. arXiv (2023).
[483] Augustus Odena, Christopher Olah, and Jonathon Shlens. 2017. Conditional image synthesis with auxiliary classifier gans. In ICML.
[484] Michael Oechsle, Lars Mescheder, Michael Niemeyer, et al. 2019. Texture fields: Learning texture representations in function space. In ICCV.
[485] Ferda Ofli, Engin Erzin, Yücel Yemez, et al. 2011. Learn2dance: Learning statistical music-to-dance mappings for choreography synthesis. TMM (2011).
[486] Aaron Oord, Yazhe Li, Igor Babuschkin, et al. 2018. Parallel wavenet: Fast high-fidelity speech synthesis. In ICML.
[487] Aaron van den Oord, Sander Dieleman, Heiga Zen, et al. 2016. Wavenet: A generative model for raw audio. arXiv (2016).
[488] OpenAI. 2023. GPT-4 Technical Report. ArXiv (2023).
[489] Roy Or-El, Xuan Luo, Mengyi Shan, et al. 2022. Stylesdf: High-resolution 3d-consistent image and geometry generation. In CVPR.
[490] Ahmed AA Osman, Timo Bolkart, and Michael J Black. 2020. Star: Sparse trained articulated human body regressor. In ECCV.
[491] Long Ouyang, Jeffrey Wu, Xu Jiang, et al. 2022. Training language models to follow instructions with human feedback. NeurIPS (2022).
[492] Xingang Pan, Ayush Tewari, et al. 2023. Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold. In SIGGRAPH.
[493] Yingwei Pan, Zhaofan Qiu, Ting Yao, et al. 2017. To create what you tell: Generating videos from captions. In ACM MM.
[494] Youxin Pang, Yong Zhang, Weize Quan, et al. 2023. Dpe: Disentanglement of pose and expression for general video portrait editing. In CVPR.
[495] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, et al. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. In ACL.
[496] Jeong Joon Park, Peter Florence, Julian Straub, et al. 2019. Deepsdf: Learning continuous signed distance functions for shape representation. In CVPR.
[497] Kyungjin Park et al. 2019. Generating educational game levels with multistep deep convolutional generative adversarial networks. In CoG.
[498] Keunhong Park, Utkarsh Sinha, Jonathan T Barron, et al. 2021. Nerfies: Deformable neural radiance fields. In ICCV.
[499] Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. 2022. On aliased resizing and surprising subtleties in gan evaluation. In CVPR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 31

[500] Or Patashnik, Zongze Wu, Eli Shechtman, et al. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV.
[501] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, et al. 2016. Context encoders: Feature learning by inpainting. In CVPR.
[502] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, et al. 2019. Expressive body capture: 3d hands, face, and body from a single image. In CVPR.
[503] Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, et al. 2018. Learning to estimate 3D human pose and shape from a single color image. In CVPR.
[504] Dario Pavllo, Christoph Feichtenhofer, et al. 2019. 3d human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR.
[505] Pascal Paysan, Reinhard Knothe, Brian Amberg, et al. 2009. A 3D face model for pose and illumination invariant face recognition. In AVSS.
[506] Sida Peng, Junting Dong, Qianqian Wang, et al. 2021. Animatable neural radiance fields for modeling dynamic human bodies. In ICCV.
[507] Songyou Peng, Chiyu Jiang, Yiyi Liao, et al. 2021. Shape as points: A differentiable poisson solver. NeurIPS (2021).
[508] Songyou Peng, Michael Niemeyer, Lars Mescheder, et al. 2020. Convolutional occupancy networks. In ECCV.
[509] Sida Peng, Yuanqing Zhang, Yinghao Xu, et al. 2021. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of
dynamic humans. In CVPR.
[510] Yicong Peng, Yichao Yan, et al. 2022. CageNeRF: Cage-based neural radiance field for generalized 3D deformation and animation. NeurIPS (2022).
[511] Mathis Petrovich, Michael J Black, and Gül Varol. 2021. Action-conditioned 3D human motion synthesis with transformer VAE. In ICCV.
[512] Mathis Petrovich, Michael J Black, and Gül Varol. 2022. TEMOS: Generating diverse human motions from textual descriptions. In ECCV.
[513] Wei Ping, Kainan Peng, and Jitong Chen. 2019. ClariNet: Parallel Wave Generation in End-to-End Text-to-Speech. In ICLR.
[514] Wei Ping, Kainan Peng, Andrew Gibiansky, et al. 2018. Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning. In ICLR.
[515] Wei Ping, Kainan Peng, Kexin Zhao, et al. 2020. Waveflow: A compact flow-based model for raw audio. In ICML.
[516] Matthias Plappert, Christian Mandery, and Tamim Asfour. 2016. The KIT motion-language dataset. Big data (2016).
[517] Cale Plut and Philippe Pasquier. 2020. Generative music in video games: State of the art, challenges, and prospects. Entertain. Comput. (2020).
[518] Ryan Po and Gordon Wetzstein. 2023. Compositional 3d scene generation using locally conditioned diffusion. arXiv (2023).
[519] Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, et al. 2020. Connecting vision and language with localized narratives. In ECCV.
[520] Ben Poole, Ajay Jain, Jonathan T. Barron, et al. 2023. DreamFusion: Text-to-3D using 2D Diffusion. In ICLR.
[521] Vadim Popov, Ivan Vovk, Vladimir Gogoryan, et al. 2021. Grad-tts: A diffusion probabilistic model for text-to-speech. In ICML.
[522] KR Prajwal, Rudrabha Mukhopadhyay, et al. 2020. A lip sync expert is all you need for speech to lip generation in the wild. In ACM MM.
[523] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. 2019. Waveglow: A flow-based generative network for speech synthesis. In ICASSP.
[524] Albert Pumarola, Enric Corona, Gerard Pons-Moll, et al. 2021. D-nerf: Neural radiance fields for dynamic scenes. In CVPR.
[525] Junaid Qadir. 2023. Engineering education in the era of ChatGPT: Promise and pitfalls of generative AI for education. In EDUCON.
[526] Charles Ruizhongtai Qi, Li Yi, Hao Su, et al. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. NeurIPS (2017).
[527] Shenhan Qian, Jiale Xu, Ziwei Liu, et al. 2022. UNIF: United neural implicit functions for clothed human reconstruction and animation. In ECCV.
[528] Tingting Qiao, Jing Zhang, Duanqing Xu, et al. 2019. Mirrorgan: Learning text-to-image generation by redescription. In CVPR.
[529] Sigal Raab, Inbal Leibovitch, Peizhuo Li, et al. 2023. MoDi: Unconditional Motion Synthesis from Diverse Data. In CVPR.
[530] Alec Radford, Jong Wook Kim, Chris Hallacy, et al. 2021. Learning transferable visual models from natural language supervision. In ICML.
[531] Alec Radford, Luke Metz, et al. 2018. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR.
[532] Alec Radford, Karthik Narasimhan, Tim Salimans, et al. 2018. Improving language understanding by generative pre-training. (2018).
[533] Alec Radford, Jeffrey Wu, Rewon Child, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog (2019).
[534] Jack W Rae, Sebastian Borgeaud, Trevor Cai, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv (2021).
[535] Colin Raffel, Noam Shazeer, Adam Roberts, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR (2020).
[536] Tanzila Rahman, Hsin-Ying Lee, Jian Ren, et al. 2023. Make-a-story: Visual memory conditioned consistent story generation. In CVPR.
[537] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, et al. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In EMNLP.
[538] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, et al. 2022. Hierarchical text-conditional image generation with clip latents. arXiv (2022).
[539] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, et al. 2021. Zero-shot text-to-image generation. In ICML.
[540] Eduard Ramon, Gil Triginer, Janna Escur, et al. 2021. H3d-net: Few-shot high-fidelity 3d head reconstruction. In ICCV.
[541] Ambrish Rawat, Killian Levacher, and Mathieu Sinn. 2022. The devil is in the GAN: backdoor attacks and defenses in deep generative models. In ESORICS.
[542] Partha Pratim Ray. 2023. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet
of Things and Cyber-Physical Systems (2023).
[543] Ali Razavi, Aaron van den Oord, Ben Poole, et al. 2019. Preventing Posterior Collapse with delta-VAEs. In ICLR.
[544] Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. 2019. Generating diverse high-fidelity images with vq-vae-2. NeurIPS (2019).
[545] Daniel Rebain, Mark Matthews, Kwang Moo Yi, et al. 2022. Lolnerf: Learn from one look. In CVPR.
[546] Pradyumna Reddy, Michael Gharbi, Michal Lukac, et al. 2021. Im2vec: Synthesizing vector graphics without vector supervision. In CVPR.
[547] Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. TACL (2019).
[548] Scott Reed, Zeynep Akata, Honglak Lee, et al. 2016. Learning deep representations of fine-grained visual descriptions. In CVPR.
[549] Scott Reed, Zeynep Akata, Xinchen Yan, et al. 2016. Generative adversarial text to image synthesis. In ICML.
[550] Scott E Reed, Zeynep Akata, Santosh Mohan, et al. 2016. Learning what and where to draw. NeurIPS (2016).
[551] Jeremy Reizenstein et al. 2021. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV.
[552] Yi Ren, Yangjun Ruan, Xu Tan, et al. 2019. Fastspeech: Fast, robust and controllable text to speech. NeurIPS (2019).
[553] Danilo Rezende and Shakir Mohamed. 2015. Variational inference with normalizing flows. In ICML.
[554] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. 2014. Stochastic backpropagation and approximate inference in deep generative models. In
ICML.
[555] Elad Richardson, Gal Metzer, Yuval Alaluf, et al. 2023. Texture: Text-guided texturing of 3d shapes. arXiv (2023).
[556] Elad Richardson, Matan Sela, and Ron Kimmel. 2016. 3D face reconstruction by learning from synthetic data. In 3DV.
[557] Elad Richardson, Matan Sela, Roy Or-El, et al. 2017. Learning detailed face reconstruction from a single image. In CVPR.
[558] Gernot Riegler, Ali Osman Ulusoy, and Andreas Geiger. 2017. Octnet: Learning deep 3d representations at high resolutions. In CVPR.
[559] Gernot Riegler, Ali Osman Ulusoy, Horst Bischof, et al. 2017. Octnetfusion: Learning depth fusion from data. In 3DV.
[560] Daniel Roich, Ron Mokady, Amit H Bermano, et al. 2022. Pivotal tuning for latent-based editing of real images. ToG (2022).
[561] Robin Rombach, Andreas Blattmann, Dominik Lorenz, et al. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.
[562] Sami Romdhani and Thomas Vetter. 2005. Estimating 3D shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior.
In CVPR.
[563] Alessandro Rossi, Marco Barbiero, et al. 2021. Robust Visibility Surface Determination in Object Space via Plücker Coordinates. J. Imaging (2021).
[564] Andreas Rössler, Davide Cozzolino, et al. 2018. Faceforensics: A large-scale video dataset for forgery detection in human faces. arXiv (2018).
[565] Joseph Roth, Yiying Tong, and Xiaoming Liu. 2016. Adaptive 3D face reconstruction from unconstrained photo collections. In CVPR.
[566] Nataniel Ruiz, Yuanzhen Li, et al. 2023. HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models. arXiv (2023).
[567] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, et al. 2023. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


32 • Foo, et al.

[568] Olga Russakovsky, Jia Deng, Hao Su, et al. [n. d.]. Imagenet large scale visual recognition challenge. IJCV ([n. d.]).
[569] Chitwan Saharia, William Chan, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. NeurIPS (2022).
[570] Chitwan Saharia, William Chan, Huiwen Chang, et al. 2022. Palette: Image-to-image diffusion models. In SIGGRAPH.
[571] Masaki Saito, Eiichi Matsumoto, and Shunta Saito. 2017. Temporal generative adversarial nets with singular value clipping. In ICCV.
[572] Shunsuke Saito, Zeng Huang, Ryota Natsume, et al. 2019. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV.
[573] Shunsuke Saito, Tomas Simon, et al. 2020. Pifuhd: Multi-level pixel-aligned implicit function for high-resolution 3d human digitization. In CVPR.
[574] Shunsuke Saito, Jinlong Yang, Qianli Ma, et al. 2021. SCANimate: Weakly supervised learning of skinned clothed avatar networks. In CVPR.
[575] Keisuke Sakaguchi, Ronan Le Bras, et al. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM (2021).
[576] Ahmed Salem, Yannick Sautter, et al. 2020. Baaan: Backdoor attacks against autoencoder and gan-based machine learning models. arXiv (2020).
[577] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, et al. 2016. Improved techniques for training gans. NeurIPS (2016).
[578] Tim Salimans and Jonathan Ho. 2022. Progressive Distillation for Fast Sampling of Diffusion Models. In ICLR.
[579] Aditya Sanghi, Hang Chu, Joseph G Lambourne, et al. 2022. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR.
[580] Soubhik Sanyal, Timo Bolkart, et al. 2019. Learning to regress 3D face shape and expression from an image without 3D supervision. In CVPR.
[581] Sami Sarsa et al. 2022. Automatic generation of programming exercises and code explanations using large language models. In ICER.
[582] Flavio Schneider, Zhijing Jin, and Bernhard Schölkopf. 2023. Mo\ˆ usai: Text-to-Music Generation with Long-Context Latent Diffusion. arXiv (2023).
[583] Christoph Schuhmann, Richard Vencu, Romain Beaumont, et al. 2021. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv (2021).
[584] Christian Schuldt, Ivan Laptev, and Barbara Caputo. 2004. Recognizing human actions: a local SVM approach. In ICPR.
[585] Katja Schwarz, Yiyi Liao, Michael Niemeyer, et al. 2020. Graf: Generative radiance fields for 3d-aware image synthesis. NeurIPS (2020).
[586] Katja Schwarz, Axel Sauer, Michael Niemeyer, et al. 2022. Voxgraf: Fast 3d-aware image synthesis with sparse voxel grids. NeurIPS (2022).
[587] Katja Seeliger, Umut Güçlü, et al. 2018. Generative adversarial networks for reconstructing natural images from brain activity. NeuroImage (2018).
[588] Stanislau Semeniuta, Aliaksei Severyn, and Erhardt Barth. 2017. A Hybrid Convolutional Variational Autoencoder for Text Generation. In EMNLP.
[589] Amir Shahroudy, Jun Liu, Tian-Tsong Ng, et al. 2016. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In CVPR.
[590] Jiaxiang Shang et al. 2020. Self-supervised monocular 3d face reconstruction by occlusion-aware multi-view geometry consistency. In ECCV.
[591] Piyush Sharma, Nan Ding, et al. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL.
[592] Guohua Shen, Tomoyasu Horikawa, Kei Majima, et al. 2019. Deep image reconstruction from human brain activity. PLoS Comput. Biol. (2019).
[593] Jonathan Shen, Ruoming Pang, Ron J Weiss, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP.
[594] Shelly Sheynin, Oron Ashual, Adam Polyak, et al. 2023. kNN-Diffusion: Image Generation via Large-Scale Retrieval. In ICLR.
[595] Chence Shi, Shitong Luo, Minkai Xu, et al. 2021. Learning gradient fields for molecular conformation generation. In ICML.
[596] Jing Shi, Wei Xiong, Zhe Lin, et al. 2023. Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv (2023).
[597] Yujun Shi, Chuhui Xue, Jiachun Pan, et al. 2023. DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing. arXiv (2023).
[598] Jaehyeok Shim, Changwoo Kang, and Kyungdon Joo. 2023. Diffusion-Based Signed Distance Fields for 3D Shape Generation. In CVPR.
[599] Eli Shlizerman, Lucio Dery, Hayden Schoen, et al. 2018. Audio to body dynamics. In CVPR.
[600] Dong Wook Shu, Sung Woo Park, et al. 2019. 3d point cloud generative adversarial network based on tree structured graph convolutions. In ICCV.
[601] J Ryan Shue, Eric Ryan Chan, Ryan Po, et al. 2023. 3d neural field generation using triplane diffusion. In CVPR.
[602] Aliaksandr Siarohin, Stéphane Lathuilière, Sergey Tulyakov, et al. 2019. First order motion model for image animation. NeurIPS (2019).
[603] Yawar Siddiqui, Justus Thies, Fangchang Ma, et al. 2022. Texturify: Generating textures on 3d shape surfaces. In ECCV.
[604] Gregor Simm and Jose Miguel Hernandez-Lobato. 2020. A Generative Model for Molecular Distance Geometry. In ICML.
[605] Uriel Singer, Adam Polyak, Thomas Hayes, et al. 2023. Make-A-Video: Text-to-Video Generation without Text-Video Data. In ICLR.
[606] Uriel Singer, Shelly Sheynin, Adam Polyak, et al. 2023. Text-To-4D Dynamic Scene Generation. In ICML.
[607] Jaskirat Singh, Liang Zheng, Cameron Smith, et al. 2022. Paint2pix: interactive painting based progressive image synthesis and editing. In ECCV.
[608] Ayan Sinha, Asim Unmesh, Qixing Huang, et al. 2017. Surfnet: Generating 3d shape surfaces using deep residual networks. In CVPR.
[609] Vincent Sitzmann et al. 2019. Scene representation networks: Continuous 3d-structure-aware neural scene representations. NeurIPS (2019).
[610] Vincent Sitzmann, Eric Chan, Richard Tucker, et al. 2020. Metasdf: Meta-learning signed distance functions. NeurIPS (2020).
[611] Vincent Sitzmann, Julien Martel, Alexander Bergman, et al. 2020. Implicit neural representations with periodic activation functions. NeurIPS (2020).
[612] Vincent Sitzmann, Justus Thies, Felix Heide, et al. 2019. Deepvoxels: Learning persistent 3d feature embeddings. In CVPR.
[613] Ivan Skorokhodov, Savva Ignatyev, and Mohamed Elhoseiny. 2021. Adversarial generation of continuous images. In CVPR.
[614] Ivan Skorokhodov, Sergey Tulyakov, et al. 2022. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In CVPR.
[615] Ivan Skorokhodov, Sergey Tulyakov, Yiqun Wang, et al. 2022. Epigraf: Rethinking training of 3d gans. NeurIPS (2022).
[616] Ron Slossberg et al. 2018. High quality facial surface and texture synthesis via generative adversarial networks. In ECCV Workshops.
[617] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, et al. 2015. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML.
[618] Gowthami Somepalli, Vasu Singla, Micah Goldblum, et al. 2023. Diffusion art or digital forgery? investigating data replication in diffusion models. In CVPR.
[619] Joon Son Chung, Andrew Senior, Oriol Vinyals, et al. 2017. Lip reading sentences in the wild. In CVPR.
[620] Jiaming Song, Chenlin Meng, and Stefano Ermon. 2021. Denoising Diffusion Implicit Models. In ICLR.
[621] Linsen Song, Wayne Wu, Chen Qian, et al. 2022. Everybody’s talkin’: Let me talk as you want. TIFS (2022).
[622] Yang Song, Prafulla Dhariwal, Mark Chen, et al. 2023. Consistency models. (2023).
[623] Yang Song, Conor Durkan, Iain Murray, et al. 2021. Maximum likelihood training of score-based diffusion models. NeurIPS (2021).
[624] Yang Song and Stefano Ermon. 2019. Generative modeling by estimating gradients of the data distribution. NeurIPS (2019).
[625] Yang Song and Stefano Ermon. 2020. Improved techniques for training score-based generative models. NeurIPS (2020).
[626] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, et al. 2021. Score-Based Generative Modeling through Stochastic Differential Equations. In ICLR.
[627] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv (2012).
[628] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. 2015. Unsupervised learning of video representations using lstms. In ICML.
[629] Frank Steinbrücker, Jürgen Sturm, and Daniel Cremers. 2014. Volumetric 3D mapping in real-time on a CPU. In ICRA.
[630] Robert W Sumner and Jovan Popović. 2004. Deformation transfer for triangle meshes. ToG (2004).
[631] Chen Sun, Abhinav Shrivastava, Saurabh Singh, et al. 2017. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV.
[632] Cheng Sun, Min Sun, and Hwann-Tzong Chen. 2022. Direct voxel grid optimization: Super-fast convergence for radiance fields reconstruction. In CVPR.
[633] Guofei Sun, Yongkang Wong, Zhiyong Cheng, et al. 2020. DeepDance: music-to-dance motion choreography with adversarial learning. TMM (2020).
[634] Jingxiang Sun, Xuan Wang, Yichun Shi, et al. 2022. Ide-3d: Interactive disentangled editing for high-resolution 3d-aware portrait synthesis. ToG (2022).
[635] Jingxiang Sun, Xuan Wang, Lizhen Wang, et al. 2023. Next3d: Generative neural texture rasterization for 3d-aware head avatars. In CVPR.
[636] Jingxiang Sun, Xuan Wang, Yong Zhang, et al. 2022. Fenerf: Face editing in neural radiance fields. In CVPR.
[637] Xingyuan Sun, Jiajun Wu, Xiuming Zhang, et al. 2018. Pix3d: Dataset and methods for single-image 3d shape modeling. In CVPR.
[638] Roman Suvorov, Elizaveta Logacheva, Anton Mashikhin, et al. 2022. Resolution-robust large mask inpainting with fourier convolutions. In WACV.
[639] Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ToG (2017).

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 33

[640] Esteban G Tabak and Cristina V Turner. 2013. A family of nonparametric density estimation algorithms. CPAM (2013).
[641] Esteban G Tabak and Eric Vanden-Eijnden. 2010. Density estimation by dual ascent of the log-likelihood. CMS (2010).
[642] Yu Takagi and Shinji Nishimoto. 2023. High-resolution image reconstruction with latent diffusion models from human brain activity. In CVPR.
[643] Taoran Tang, Jia Jia, and Hanyang Mao. 2018. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In ACM MM.
[644] Yingzhi Tang, Yue Qian, Qijian Zhang, et al. 2022. WarpingGAN: Warping multiple uniform priors for adversarial 3D point cloud generation. In CVPR.
[645] Ming Tao, Hao Tang, Fei Wu, et al. 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In CVPR.
[646] Maxim Tatarchenko et al. 2017. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In ICCV.
[647] Guy Tevet, Brian Gordon, Amir Hertz, et al. 2022. Motionclip: Exposing human motion generation to clip space. In ECCV.
[648] Guy Tevet, Sigal Raab, Brian Gordon, et al. 2023. Human Motion Diffusion Model. In ICLR.
[649] Ayush Tewari, Florian Bernard, Pablo Garrido, et al. 2019. Fml: Face model learning from videos. In CVPR.
[650] Ayush Tewari, Xingang Pan, Ohad Fried, et al. 2022. Disentangled3d: Learning a 3d generative model with disentangled geometry and appearance from
monocular images. In CVPR.
[651] Ayush Tewari, Michael Zollhöfer, et al. 2018. Self-supervised multi-level face model learning for monocular reconstruction at over 250 hz. In CVPR.
[652] Ayush Tewari, Michael Zollhofer, Hyeongwoo Kim, et al. 2017. Mofa: Model-based deep convolutional face autoencoder for unsupervised monocular
reconstruction. In ICCV Workshops.
[653] Justus Thies, Mohamed Elgharib, Ayush Tewari, et al. 2020. Neural voice puppetry: Audio-driven facial reenactment. In ECCV.
[654] Justus Thies, Michael Zollhofer, Marc Stamminger, et al. 2016. Face2face: Real-time face capture and reenactment of rgb videos. In CVPR.
[655] Yu Tian, Jian Ren, Menglei Chai, et al. 2021. A Good Image Generator Is What You Need for High-Resolution Video Synthesis. In ICLR.
[656] Garvita Tiwari, Nikolaos Sarafianos, Tony Tung, et al. 2021. Neural-gif: Neural generalized implicit functions for animating people in clothing. In ICCV.
[657] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, et al. 2018. Wasserstein Auto-Encoders. In ICLR.
[658] Jakub Tomczak and Max Welling. 2018. VAE with a VampPrior. In AISTATS.
[659] Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al. 2023. Llama: Open and efficient foundation language models. arXiv (2023).
[660] Hugo Touvron, Louis Martin, Kevin Stone, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv (2023).
[661] Omer Tov, Yuval Alaluf, Yotam Nitzan, et al. 2021. Designing an encoder for stylegan image manipulation. TOG (2021).
[662] Luan Tran and Xiaoming Liu. 2018. Nonlinear 3d face morphable model. In CVPR.
[663] Luan Tran and Xiaoming Liu. 2019. On learning 3d face morphable model from in-the-wild images. TPAMI (2019).
[664] Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, et al. 2021. Non-rigid neural radiance fields: Reconstruction and novel view synthesis of a dynamic
scene from monocular video. In ICCV.
[665] Jonathan Tseng, Rodrigo Castellon, and Karen Liu. 2023. Edge: Editable dance generation from music. In CVPR.
[666] Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, et al. 2018. Mocogan: Decomposing motion and content for video generation. In CVPR.
[667] Narek Tumanyan, Michal Geyer, Shai Bagon, et al. 2023. Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR.
[668] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, et al. 2019. FVD: A new Metric for Video Generation. In ICLR Workshops.
[669] Cristian Vaccari and Andrew Chadwick. 2020. Deepfakes and disinformation: Exploring the impact of synthetic political video on deception, uncertainty,
and trust in news. SM+S (2020).
[670] Arash Vahdat and Jan Kautz. 2020. NVAE: A deep hierarchical variational autoencoder. NeurIPS (2020).
[671] Arash Vahdat, Karsten Kreis, and Jan Kautz. 2021. Score-based generative modeling in latent space. NeurIPS (2021).
[672] Arash Vahdat, Francis Williams, Zan Gojcic, et al. 2022. LION: Latent Point Diffusion Models for 3D Shape Generation. NeurIPS (2022).
[673] Demetrios Vakratsas and Xin Wang. [n. d.]. Artificial intelligence in advertising creativity. J. Advert. ([n. d.]).
[674] Diego Valsesia, Giulia Fracastoro, and Enrico Magli. 2019. Learning localized generative models for 3d point clouds via graph convolution. In ICLR.
[675] Rianne Van Den Berg, Leonard Hasenclever, Jakub M Tomczak, et al. 2018. Sylvester normalizing flows for variational inference. In UAI.
[676] Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent neural networks. In ICML.
[677] Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural discrete representation learning. NeurIPS (2017).
[678] Gul Varol, Duygu Ceylan, Bryan Russell, et al. 2018. Bodynet: Volumetric inference of 3d human body shapes. In ECCV.
[679] Sakshi Varshney, Vinay Kumar Verma, et al. 2021. Cam-GAN: Continual adaptation modules for generative adversarial networks. NeurIPS (2021).
[680] Ashish Vaswani, Noam Shazeer, Niki Parmar, et al. 2017. Attention is all you need. NeurIPS (2017).
[681] Clement Vignac, Igor Krawczuk, Antoine Siraudin, et al. 2022. DiGress: Discrete Denoising diffusion for graph generation. In ICLR.
[682] Daniel Vlasic, Matthew Brand, Hanspeter Pfister, et al. 2006. Face transfer with multilinear models. In SIGGRAPH Courses.
[683] Timo Von Marcard, Roberto Henschel, et al. 2018. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV.
[684] Carl Vondrick, Hamed Pirsiavash, and Antonio Torralba. 2016. Generating videos with scene dynamics. NeurIPS (2016).
[685] Andrey Voynov, Kfir Aberman, and Daniel Cohen-Or. 2022. Sketch-guided text-to-image diffusion models. arXiv (2022).
[686] Catherine Wah, Steve Branson, Peter Welinder, et al. 2011. The caltech-ucsd birds-200-2011 dataset. (2011).
[687] Jacob Walker, Carl Doersch, Abhinav Gupta, et al. 2016. An uncertain future: Forecasting from static images using variational autoencoders. In ECCV.
[688] Jacob Walker, Ali Razavi, and Aäron van den Oord. 2021. Predicting video with vqvae. arXiv (2021).
[689] Alex Wang, Yada Pruksachatkun, et al. 2019. Superglue: A stickier benchmark for general-purpose language understanding systems. NeurIPS (2019).
[690] Alex Wang, Amanpreet Singh, et al. 2019. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In ICLR.
[691] Haochen Wang, Xiaodan Du, Jiahao Li, et al. 2023. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In CVPR.
[692] Hao Wang, Nadav Schor, Ruizhen Hu, et al. 2018. Global-to-local generative model for 3d shapes. ToG (2018).
[693] Nanyang Wang, Yinda Zhang, Zhuwen Li, et al. 2018. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV.
[694] Qiang Wang, Haoge Deng, Yonggang Qi, et al. 2023. SketchKnitter: Vectorized Sketch Generation with Diffusion Models. In ICLR.
[695] Shaofei Wang, Marko Mihajlovic, Qianli Ma, et al. 2021. Metaavatar: Learning animatable clothed human models from few depth images. NeurIPS (2021).
[696] Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, et al. 2019. Few-shot Video-to-Video Synthesis. NeurIPS (2019).
[697] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, et al. 2018. Video-to-Video Synthesis. NeurIPS (2018).
[698] Xiaohuan Corina Wang and Cary Phillips. 2002. Multi-weight enveloping: least-squares approximation techniques for skin animation. In SCA.
[699] Yaohui Wang, Piotr Bilinski, Francois Bremond, et al. 2020. Imaginator: Conditional spatio-temporal gan for video generation. In WACV.
[700] Yuntao Wang, Yanghe Pan, Miao Yan, et al. 2023. A Survey on ChatGPT: AI-Generated Contents, Challenges, and Solutions. arXiv (2023).
[701] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, et al. 2017. Tacotron: Towards End-to-End Speech Synthesis. Proc. Interspeech (2017), 4006–4010.
[702] Yaohui Wang, Di Yang, Francois Bremond, et al. 2022. Latent Image Animator: Learning to Animate Images via Latent Space Navigation. In ICLR.
[703] Ziyan Wang, Timur Bagautdinov, Stephen Lombardi, et al. 2021. Learning compositional radiance fields of dynamic human heads. In CVPR.
[704] Zhenyi Wang, Ping Yu, Yang Zhao, et al. 2020. Learning diverse stochastic human-action generators by learning smooth latent transitions. In AAAI.
[705] Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv (2018).
[706] Daniel Watson, William Chan, Jonathan Ho, et al. 2022. Learning fast samplers for diffusion models by differentiating through sample quality. In ICLR.
[707] Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, et al. 2022. Humannerf: Free-viewpoint rendering of moving people from monocular video. In CVPR.

, Vol. 1, No. 1, Article . Publication date: October 2023.


34 • Foo, et al.

[708] Olivia Wiles, A Koepke, and Andrew Zisserman. 2018. X2face: A network for controlling face generation using images, audio, and pose codes. In ECCV.
[709] Grathwohl Will, T. Q. Chen Ricky, et al. 2019. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models. In ICLR.
[710] Francis Williams, Teseo Schneider, Claudio Silva, et al. 2019. Deep geometric prior for surface reconstruction. In CVPR.
[711] Chenfei Wu, Lun Huang, Qianxi Zhang, et al. 2021. Godiva: Generating open-domain videos from natural descriptions. arXiv (2021).
[712] Chenfei Wu, Jian Liang, Lei Ji, et al. 2022. Nüwa: Visual synthesis pre-training for neural visual world creation. In ECCV.
[713] Dongxian Wu and Yisen Wang. 2021. Adversarial neuron pruning purifies backdoored deep models. NeurIPS (2021).
[714] Jiayang Wu, Wensheng Gan, Zefeng Chen, et al. 2023. Ai-generated content (aigc): A survey. arXiv (2023).
[715] Jiajun Wu, Chengkai Zhang, et al. 2016. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. NeurIPS (2016).
[716] Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, et al. 2022. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. arXiv (2022).
[717] Keyu Wu, Yifan Ye, et al. 2022. Neuralhdhair: Automatic high-fidelity hair modeling from a single image using implicit neural representations. In CVPR.
[718] Lemeng Wu, Chengyue Gong, and Xingchaoand others Liu. 2022. Diffusion-based molecule generation with informative prior bridges. NeurIPS (2022).
[719] Menghua Wu, Hao Zhu, Linjia Huang, et al. 2023. High-Fidelity 3D Face Generation From Natural Language Descriptions. In CVPR.
[720] Rundi Wu, Chang Xiao, and Changxi Zheng. 2021. Deepcad: A deep generative network for computer-aided design models. In ICCV.
[721] Sijing Wu, Yichao Yan, Yunhao Li, et al. 2023. GANHead: Towards Generative Animatable Neural Head Avatars. In CVPR.
[722] Tianhao Walter Wu et al. 2022. D^2NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video. In NeurIPS.
[723] Zhirong Wu, Shuran Song, Aditya Khosla, et al. 2015. 3d shapenets: A deep representation for volumetric shapes. In CVPR.
[724] Zhijie Wu, Xiang Wang, Di Lin, et al. 2019. Sagnet: Structure-aware generative network for 3d-shape modeling. ToG (2019).
[725] Wenqi Xian, Jia-Bin Huang, Johannes Kopf, et al. 2021. Space-time neural irradiance fields for free-viewpoint video. In CVPR.
[726] Yu Xiang, Roozbeh Mottaghi, and Silvio Savarese. 2014. Beyond pascal: A benchmark for 3d object detection in the wild. In WACV.
[727] Chaojun Xiao, Xueyu Hu, Zhiyuan Liu, et al. 2021. Lawformer: A pre-trained language model for chinese legal long documents. AI Open (2021).
[728] Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. 2022. Tackling the Generative Learning Trilemma with Denoising Diffusion GANs. In ICLR.
[729] Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, et al. 2019. Pix2vox: Context-aware 3d reconstruction from single and multi-view images. In ICCV.
[730] Wei Xiong, Wenhan Luo, Lin Ma, et al. 2018. Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In CVPR.
[731] Yuliang Xiu, Jinlong Yang, Dimitrios Tzionas, et al. 2022. Icon: Implicit clothed humans obtained from normals. In CVPR.
[732] Hongyi Xu, Eduard Gabriel Bazavan, Andrei Zanfir, et al. 2020. Ghum & ghuml: Generative 3d human shape and articulated pose models. In CVPR.
[733] Jiacheng Xu and Greg Durrett. 2018. Spherical Latent Spaces for Stable Variational Autoencoders. In EMNLP.
[734] Jun Xu, Tao Mei, Ting Yao, et al. 2016. Msr-vtt: A large video description dataset for bridging video and language. In CVPR.
[735] Jiale Xu, Xintao Wang, et al. 2023. Dream3d: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models. In CVPR.
[736] Minrui Xu, Hongyang Du, et al. 2023. Unleashing the power of edge-cloud generative ai in mobile networks: A survey of aigc services. arXiv (2023).
[737] Minkai Xu, Shitong Luo, Yoshua Bengio, et al. [n. d.]. Learning Neural Generative Dynamics for Molecular Conformation Generation. In ICLR.
[738] Minkai Xu, Lantao Yu, Yang Song, et al. 2022. GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation. In ICLR.
[739] Qiangeng Xu, Weiyue Wang, et al. 2019. Disn: Deep implicit surface network for high-quality single-view 3d reconstruction. NeurIPS (2019).
[740] Qiangeng Xu, Zexiang Xu, Julien Philip, et al. 2022. Point-nerf: Point-based neural radiance fields. In CVPR.
[741] Tianhan Xu and Tatsuya Harada. 2022. Deforming radiance fields with cages. In ECCV.
[742] Tao Xu, Pengchuan Zhang, et al. 2018. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In CVPR.
[743] Xudong Xu, Xingang Pan, Dahua Lin, et al. 2021. Generative occupancy fields for 3d surface-aware image synthesis. NeurIPS (2021).
[744] Yinghao Xu et al. 2023. DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-aware Scene Synthesis. In CVPR.
[745] Yinghao Xu, Sida Peng, Ceyuan Yang, et al. 2022. 3d-aware image synthesis via learning structural and textural representations. In CVPR.
[746] Yang Xue, Yuheng Li, Krishna Kumar Singh, et al. 2022. Giraffe hd: A high-resolution 3d-aware generative model. In CVPR.
[747] Nelson Yalta, Shinji Watanabe, et al. 2019. Weakly-supervised deep recurrent neural networks for basic dance step generation. In IJCNN.
[748] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim. 2020. Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks
with multi-resolution spectrogram. In ICASSP.
[749] Sijie Yan, Zhizhong Li, Yuanjun Xiong, et al. 2019. Convolutional sequence generation for skeleton-based action synthesis. In ICCV.
[750] Wilson Yan, Yunzhi Zhang, Pieter Abbeel, et al. 2021. Videogpt: Video generation using vq-vae and transformers. arXiv (2021).
[751] Xinchen Yan, Jimei Yang, et al. 2016. Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. NeurIPS (2016).
[752] Zhaoyi Yan, Xiaoming Li, Mu Li, et al. 2018. Shift-net: Image inpainting via deep feature rearrangement. In ECCV.
[753] Bangbang Yang, Chong Bao, et al. 2022. Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In ECCV.
[754] Bangbang Yang, Yinda Zhang, Yinghao Xu, et al. 2021. Learning object-compositional neural radiance field for editable scene rendering. In ICCV.
[755] Ceyuan Yang, Zhe Wang, Xinge Zhu, et al. 2018. Pose guided human video generation. In ECCV.
[756] Dongchao Yang, Jianwei Yu, Helin Wang, et al. 2023. Diffsound: Discrete diffusion model for text-to-sound generation. TASLP (2023).
[757] Guandao Yang, Xun Huang, Zekun Hao, et al. 2019. Pointflow: 3d point cloud generation with continuous normalizing flows. In ICCV.
[758] Geng Yang, Shan Yang, Kai Liu, et al. 2021. Multi-band melgan: Faster waveform generation for high-quality text-to-speech. In SLT.
[759] Haotian Yang, Hao Zhu, Yanru Wang, et al. 2020. Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In CVPR.
[760] Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, et al. 2018. Analyzing clothing layer deformation statistics of 3d human motions. In ECCV.
[761] Jingfeng Yang, Hongye Jin, Ruixiang Tang, et al. 2023. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv (2023).
[762] Li-Chia Yang et al. 2017. MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation. In ISMIR.
[763] Linjie Yang, Ping Luo, Chen Change Loy, et al. 2015. A large-scale car dataset for fine-grained categorization and verification. In CVPR.
[764] Zhilin Yang, Zihang Dai, Yiming Yang, et al. 2019. Xlnet: Generalized autoregressive pretraining for language understanding. NeurIPS (2019).
[765] Zichao Yang, Zhiting Hu, Ruslan Salakhutdinov, et al. 2017. Improved variational autoencoders for text modeling using dilated convolutions. In ICML.
[766] Ze Yang, Shenlong Wang, Sivabalan Manivasagam, et al. 2021. S3: Neural shape, skeleton, and skinning fields for 3d human modeling. In CVPR.
[767] Shunyu Yao, RuiZhe Zhong, et al. 2022. DFA-NeRF: Personalized talking head generation via disentangled face attributes neural rendering. arXiv (2022).
[768] Hui Ye, Xiulong Yang, Martin Takac, et al. 2021. Improving Text-to-Image Synthesis Using Contrastive Learning. BMVC (2021).
[769] Sheng Ye, Yu-Hui Wen, Yanan Sun, et al. 2022. Audio-driven stylized gesture generation with flow-based model. In ECCV.
[770] Zhenhui Ye, Ziyue Jiang, Yi Ren, et al. 2023. GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis. In ICLR.
[771] Tarun Yenamandra, Ayush Tewari, Florian Bernard, et al. 2021. i3dmm: Deep implicit 3d morphable model of human heads. In CVPR.
[772] Hongwei Yi, Hualin Liang, Yifei Liu, et al. 2023. Generating holistic 3d human motion from speech. In CVPR.
[773] Lijun Yin, Xiaozhou Wei, Yi Sun, et al. 2006. A 3D facial expression database for facial behavior research. In FGR.
[774] Shengming Yin, Chenfei Wu, Huan Yang, et al. 2023. NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation. arXiv (2023).
[775] Youngwoo Yoon, Bok Cha, Joo-Haeng Lee, et al. 2020. Speech gesture generation from the trimodal context of text, audio, and speaker identity. ToG (2020).
[776] Youngwoo Yoon, Woo-Ri Ko, et al. 2019. Robots learn social skills: End-to-end learning of co-speech gesture generation for humanoid robots. In ICRA.
[777] Jiaxuan You, Rex Ying, Xiang Ren, et al. 2018. Graphrnn: Generating realistic graphs with deep auto-regressive models. In ICML.
[778] Kim Youwang, Kim Ji-Yeon, and Tae-Hyun Oh. 2022. Clip-actor: Text-driven recommendation and stylization for animating human meshes. In ECCV.

, Vol. 1, No. 1, Article . Publication date: October 2023.


AI-Generated Content (AIGC) for Various Data Modalities: A Survey • 35

[779] Alex Yu, Ruilong Li, Matthew Tancik, et al. 2021. Plenoctrees for real-time rendering of neural radiance fields. In ICCV.
[780] Alex Yu, Vickie Ye, Matthew Tancik, et al. 2021. pixelnerf: Neural radiance fields from one or few images. In CVPR.
[781] Fisher Yu, Ari Seff, Yinda Zhang, et al. 2015. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv (2015).
[782] Jiahui Yu, Zhe Lin, Jimei Yang, et al. 2018. Generative image inpainting with contextual attention. In CVPR.
[783] Jiahui Yu, Zhe Lin, Jimei Yang, et al. 2019. Free-form image inpainting with gated convolution. In ICCV.
[784] Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, et al. 2022. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation. TMLR (2022).
[785] Lantao Yu, Weinan Zhang, Jun Wang, et al. 2017. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI.
[786] Ping Yu, Yang Zhao, Chunyuan Li, et al. 2020. Structure-aware human-action generation. In ECCV.
[787] Sihyun Yu, Kihyuk Sohn, Subin Kim, et al. 2023. Video probabilistic diffusion models in projected latent space. In CVPR.
[788] Sihyun Yu, Jihoon Tack, Sangwoo Mo, et al. 2022. Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks. In ICLR.
[789] Zhengming Yu, Wei Cheng, Xian Liu, et al. 2023. MonoHuman: Animatable Human Neural Field from Monocular Video. In CVPR.
[790] Ye Yuan and Kris Kitani. 2020. Dlow: Diversifying latent flows for diverse human motion prediction. In ECCV.
[791] Yu-Jie Yuan, Yang-Tian Sun, Yu-Kun Lai, et al. 2022. NeRF-editing: geometry editing of neural radiance fields. In CVPR.
[792] Heiga Zen, Viet Dang, Rob Clark, et al. 2019. LibriTTS: A corpus derived from librispeech for text-to-speech. Proc. Interspeech 2019 (2019).
[793] Aeron Zentner. 2022. Applied Innovation: Artificial Intelligence in Higher Education. SSRN 4314180 (2022).
[794] Mengyao Zhai, Lei Chen, Frederick Tung, et al. 2019. Lifelong gan: Continual learning for conditional image generation. In ICCV.
[795] Bowen Zhang, Shuyang Gu, Bo Zhang, et al. 2022. Styleswin: Transformer-based gan for high-resolution image generation. In CVPR.
[796] Chaoning Zhang et al. 2023. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv (2023).
[797] Chi Zhang, Yiwen Chen, et al. 2023. StyleAvatar3D: Leveraging Image-Text Diffusion Models for High-Fidelity 3D Avatar Generation. arXiv (2023).
[798] Chaoning Zhang, Chenshuang Zhang, et al. 2023. A complete survey on generative ai (aigc): Is chatgpt from gpt-4 to gpt-5 all you need? arXiv (2023).
[799] Han Zhang, Ian Goodfellow, Dimitris Metaxas, et al. 2019. Self-attention generative adversarial networks. In ICML.
[800] Han Zhang, Jing Yu Koh, Jason Baldridge, et al. 2021. Cross-modal contrastive learning for text-to-image generation. In CVPR.
[801] Han Zhang, Tao Xu, Hongsheng Li, et al. 2017. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV.
[802] Han Zhang, Tao Xu, Hongsheng Li, et al. 2018. Stackgan++: Realistic image synthesis with stacked generative adversarial networks. TPAMI (2018).
[803] Jianfeng Zhang, Zihang Jiang, Dingdong Yang, et al. 2022. Avatargen: a 3d generative model for animatable human avatars. In ECCV.
[804] Jingbo Zhang, Xiaoyu Li, Ziyu Wan, et al. 2023. Text2NeRF: Text-Driven 3D Scene Generation with Neural Radiance Fields. arXiv (2023).
[805] Jianrong Zhang, Yangsong Zhang, et al. 2023. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. In CVPR.
[806] Kai Zhang, Nick Kolkin, Sai Bi, et al. 2022. Arf: Artistic radiance fields. In ECCV.
[807] Lvmin Zhang and Maneesh Agrawala. 2023. Adding conditional control to text-to-image diffusion models. arXiv (2023).
[808] Longwen Zhang, Qiwei Qiu, Hongyang Lin, et al. 2023. DreamFace: Progressive Generation of Animatable 3D Faces under Text Guidance. arXiv (2023).
[809] Mingyuan Zhang, Zhongang Cai, Liang Pan, et al. 2022. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv (2022).
[810] Qinsheng Zhang and Yongxin Chen. 2021. Diffusion normalizing flow. NeurIPS (2021).
[811] Weiwei Zhang, Jian Sun, and Xiaoou Tang. 2008. Cat head detection-how to effectively exploit shape and texture features. In ECCV.
[812] Xuanmeng Zhang, Zhedong Zheng, et al. 2022. Multi-view consistent generative adversarial networks for 3d-aware image synthesis. In CVPR.
[813] Yan Zhang, Michael J Black, and Siyu Tang. 2021. We are more than our joints: Predicting how 3d bodies move. In CVPR.
[814] Yizhe Zhang, Zhe Gan, Kai Fan, et al. 2017. Adversarial feature matching for text generation. In ICML.
[815] Zizhao Zhang, Yuanpu Xie, and Lin Yang. 2018. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In CVPR.
[816] Fuqiang Zhao, Wei Yang, Jiakai Zhang, et al. 2022. Humannerf: Efficiently generated human radiance field from sparse inputs. In CVPR.
[817] Junbo Zhao, Michael Mathieu, and Yann LeCun. 2017. Energy-based Generative Adversarial Networks. In ICLR.
[818] Long Zhao, Zizhao Zhang, Ting Chen, et al. 2021. Improved transformer for high-resolution gans. NeurIPS (2021).
[819] Xinru Zheng, Xiaotian Qiao, Ying Cao, et al. 2019. Content-aware generative modeling of graphic design layouts. ToG (2019).
[820] Yufeng Zheng, Victoria Fernández Abrevaya, Marcel C Bühler, et al. 2022. Im avatar: Implicit morphable head avatars from videos. In CVPR.
[821] Yufeng Zheng, Wang Yifan, Gordon Wetzstein, et al. 2023. Pointavatar: Deformable point-based head avatars from videos. In CVPR.
[822] Zerong Zheng, Han Huang, Tao Yu, et al. 2022. Structured local radiance fields for human avatar modeling. In CVPR.
[823] Zerong Zheng, Tao Yu, et al. 2021. Pamir: Parametric model-conditioned implicit representation for image-based human reconstruction. TPAMI (2021).
[824] Zerong Zheng, Tao Yu, Yixuan Wei, et al. 2019. Deephuman: 3d human reconstruction from a single image. In ICCV.
[825] Ce Zhou, Qian Li, Chen Li, et al. 2023. A comprehensive survey on pretrained foundation models: A history from bert to chatgpt. arXiv (2023).
[826] Hang Zhou, Yasheng Sun, et al. 2021. Pose-controllable talking face generation by implicitly modularized audio-visual representation. In CVPR.
[827] Linqi Zhou, Yilun Du, and Jiajun Wu. 2021. 3d shape generation and completion through point-voxel diffusion. In ICCV.
[828] Yi Zhou, Zimo Li, Shuangjiu Xiao, et al. 2018. Auto-Conditioned Recurrent Networks for Extended Complex Human Motion Synthesis. In ICLR.
[829] Yufan Zhou, Ruiyi Zhang, Changyou Chen, et al. 2022. Towards language-free training for text-to-image generation. In CVPR.
[830] Hao Zhu, Xinxin Zuo, Sen Wang, et al. 2019. Detailed human shape estimation from a single image by hierarchical mesh deformation. In CVPR.
[831] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, et al. 2016. Generative visual manipulation on the natural image manifold. In ECCV.
[832] Jun-Yan Zhu, Taesung Park, Phillip Isola, et al. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV.
[833] Jun-Yan Zhu, Richard Zhang, Deepak Pathak, et al. 2017. Toward multimodal image-to-image translation. NeurIPS (2017).
[834] Lingting Zhu, Xian Liu, Xuanyu Liu, et al. 2023. Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation. In CVPR.
[835] Minfeng Zhu, Pingbo Pan, Wei Chen, et al. 2019. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In CVPR.
[836] Yiyu Zhuang, Hao Zhu, Xusen Sun, et al. 2022. Mofanerf: Morphable facial neural radiance field. In ECCV.

, Vol. 1, No. 1, Article . Publication date: October 2023.

You might also like