Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications
ABSTRACT | The generative adversarial network (GAN) frame- networks—a generator network G and a discriminator
work has emerged as a powerful tool for various image and network D—which are trained jointly by playing a
video synthesis tasks, allowing the synthesis of visual content zero-sum game where the objective of the generator is
in an unconditional or input-conditional manner. It has enabled to synthesize fake data that resembles real data and the
the generation of high-resolution photorealistic images and objective of the discriminator is to distinguish between real
videos, a task that was challenging or impossible with prior and fake data. When the training is successful, the gener-
methods. It has also led to the creation of many new applica- ator is an approximator of the underlying data generation
tions in content creation. In this article, we provide an overview mechanism in the sense that the distribution of the fake
of GANs with a special focus on algorithms and applications data converges to the real one. Due to the distribution
for visual synthesis. We cover several important techniques matching capability, GANs have become a popular tool
to stabilize GAN training, which has a reputation for being for various data synthesis and manipulation problems,
notoriously difficult. We also discuss its applications to image especially in the visual domain.
translation, image processing, video synthesis, and neural GAN’s rise also marks another major success of deep
rendering. learning in replacing hand-designed components with
machine-learned components in modern computer vision
KEYWORDS | Computer vision; generative adversarial net-
pipelines. As deep learning has directed the community
works (GANs); image and video synthesis; image processing;
to abandon hand-designed features, such as the histogram
neural rendering.
of oriented gradients (HOG) [36], for deep features com-
puted by deep neural networks, the objective function
I. I N T R O D U C T I O N used to train the networks remains largely hand-designed.
The generative adversarial network (GAN) framework is While this is not a major issue for a classification task since
a deep learning architecture [59], [100] introduced by effective and descriptive objective functions, such as the
Goodfellow et al. [60]. It consists of two interacting neural cross-entropy loss exist, this is a serious hurdle for a gener-
ation task. After all, how can one hand-design a function to
Manuscript received August 4, 2020; revised November 26, 2020; accepted
guide a generator to produce a better cat image? How can
December 24, 2020. Date of publication February 1, 2021; date of current we even mathematically describe “felineness” in an image?
version April 30, 2021. (All the authors contributed equally to this work.)
(Corresponding author: Ming-Yu Liu.)
GANs address the issue by deriving a functional form
Ming-Yu Liu, Xun Huang, Ting-Chun Wang, and Arun Mallya are with of the objective using training data. As the discriminator
NVIDIA, Santa Clara, CA 95050-2519 USA (e-mail: [email protected]).
is trained to tell whether an input image is a cat image
Jiahui Yu is with Google, Mountain View, CA 10011 USA.
from the training data set or one synthesized by the gen-
Digital Object Identifier 10.1109/JPROC.2021.3049196 erator, it defines an objective function that can guide the
0018-9219 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
traditional methods. We will provide an overview of GAN are also descendants of latent variable models, such
methods in these image-processing tasks in Section V. as principal component analysis and autoencoders [18],
Video synthesis is another exciting area that GANs have which concerns representing high-dimensional data x
shown promising results. Many research studies have uti- using lower dimensional latent variables z . In terms of
lized GANs to synthesize realistic human videos or transfer structure, a VAE employs an inference model q(z|x; φ)
motions from one person to another for various entertain- and a generation model p(x|z; θ)p(z), where p(z) is usu-
ment applications, which we will review in Section VI. ally a Gaussian distribution, which we can easily sample
Finally, due to its great capability in generating photo- from, and q(z|x; φ) approximates the posterior p(z|x; θ).
realistic images, GANs have played an important role in Both the inference and generation models are imple-
the development of neural rendering—using neural net- mented using feed-forward neural networks. VAE training
works to boost the performance of the graphics rendering is through maximizing the evidence lower bound (ELBO)
pipeline. We will cover GAN studies in Section VII. of log p(x; θ), and the nondifferentiability of the stochastic
sampling is elegantly handled by the reparameterization
II. R E L AT E D W O R K S trick [94]. One can also show that maximization of the
Several GAN review articles exist, including the intro- ELBO is equivalent to minimizing the Kullback–Leibler
ductory article by Goodfellow [58]. The articles by (KL) divergence
Creswell et al. [35] and Pan et al. [154] summarize GAN
methods prior to 2018. Wang et al. [223] provided a KL (q(x)q(z|x; φ)||p(z)p(x|z; θ)) (2)
taxonomy of GANs. Our work differs from the prior studies
in that we provide a more contemporary summary of GANs
with a focus on image and video synthesis. where q(x) is the empirical distribution of the data [94].
There are many different deep generative mod- Once a VAE is trained, an image can be efficiently gen-
els or deep neural networks that model the genera- erated by first sampling z from the Gaussian prior p(z)
tion process of some data. Besides GANs, other popular and then passing it through the feed-forward deep neural
deep generative models include deep Boltzmann machines network p(x|z; θ). VAEs are effective in learning useful
(DBMs), variational autoencoders (VAEs), deep autore- latent representations [188]. However, they tend to gen-
gressive models (DARs), and normalizing flow models erate blurry output images.
(NFMs). We compare these models in Fig. 3 and briefly
review them in the following.
C. Deep AutoRegressive Models
A. Deep Boltzmann Machines DARs [30], [153], [177], [207] are deep learning
DBMs [45], [48], [68], [175] are energy-based mod- implementations of classical autoregressive models, which
els [101], which can be represented by undirected graphs. assumes an ordering to the random variables to be mod-
Let x denote the array of image pixels, often called visible eled and generates the variables sequentially based on the
nodes. Let h denote the hidden nodes. DBMs model the ordering. This induces a factorization form to the data
distribution given by solving
probability density function of data based on the Boltz-
mann (or Gibbs) distribution as
p(x; θ) = p (xi |x<i ; θ) (3)
1 i
p(x; θ) = exp(−E(x, h; θ)) (1)
N (θ)
h
where xi are variables in x, and x<i are the union of
the variables that are prior to xi based on the assumed
where E is an energy function modeling interactions of
ordering. DARs are conditional generative models where
nodes in the graph, N is the partition function, and θ
they generate a new portion of the signal based on what
denotes the network parameters to be learned. Once a
has been generated or observed so far. The learning is
DBM is trained, a new image can be generated by applying
based on maximum likelihood learning
Markov chain Monte Carlo (MCMC) sampling, ascending
from a random configuration to one with high probability.
While extensively expressive, the reliance on MCMC sam- max Ex∼D [log p(xi |x<i ; θ)] . (4)
θ
pling on both training and generation makes DBMs scale
poorly compared to other deep generative models since DAR training is more stable compared to the other genera-
efficient MCMC sampling is itself a challenging problem, tive models. However, due to the recurrent nature, they are
especially for large networks. slow in inference. Also, while, for audio or text, a natural
ordering of the variables can be determined based on
B. Variational AutoEncoders the time dimension, such an ordering does not exist for
VAEs [93], [94], [168] are directed probabilistic graphic images. One, hence, has to enforce an order prior that is
models, inspired by the Helmholtz machine [37]. They an unnatural fit to the image grid.
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Fig. 3. Structure comparison of different deep generative models. Except for the DBM that is based on undirected graphs, other models
are all based on directed graphs, which enjoys a faster inference speed. (a) Boltzmann machine. (b) VAE. (c) Autoregressive model. (d) NFM.
(e) GAN.
D. Normalizing Flow Models In mode dropping, the generator does not faithfully model
NFMs [40], [41], [92], [167] are based on the normal- the target distribution and misses some portion of it. Other
izing flow—a transformation of a simple probability dis- common failure cases include checkerboard and water
tribution into a more complex distribution by a sequence drop artifacts. In this article, we cover the basics of GAN
of invertible and differentiable mappings. Each mapping training and some techniques invented to improve training
corresponds to a layer in a deep neural network. With a stability.
layer design that guarantees invertibility and differentiabil-
ity for all possible weights, one can stack many such layers A. Learning Objective
to construct a powerful mapping because composition of The core idea in GAN training is to minimize the discrep-
invertible and differentiable functions is invertible and dif- ancy between the true data distribution p(x) and the fake
ferentiable. Let F = f (1) ◦ f (2) · · · ◦ f (K) be such a K -layer data distribution p(G(z; θ)). As there are a variety of ways
mapping that maps the simple probability distribution Z to measure the distance between two distributions, such as
to the data distribution X . The probability density of a the Jensen–Shannon divergence, the KL divergence, and
sample x ∼ X can be computed by transforming it back to the integral probability metric, there are also a variety
the corresponding z . Hence, we can apply maximum likeli- of GAN losses, including the saturated GAN loss [60],
hood learning to train NFMs because the log-likelihood of the nonsaturated GAN loss [60], the Wasserstein GAN
the complex data distribution can be converted to the log- loss [6], [64], the least-square GAN loss [134], the hinge
likelihood of the simple prior distribution subtracted by the GAN loss [112], [242], the f-divergence GAN loss [81],
Jacobians terms. This gives [150], and the relativistic GAN loss [80]. Empirically,
K and the network architecture. As of the time of writing this
df (i)
log p(x; θ) = log p(z; θ) − log det (5) survey article, there is no clear consensus on which one is
dz i−1
i=1 absolutely better.
Here, we give a generic GAN learning objective for-
where z i = f (i) (z i−1 ). One key strength of NFMs is in mulation that subsumes several popular ones. For the
supporting direct evaluation of probability density calcula-
discriminator update step, the learning objective is
tion. However, NFMs require an invertible mapping, which
greatly limits the choices of applicable architectures.
max Ex∼D fD (D(x; φ) + Ez∼Z fG (D(G(z; θ); φ) (7)
φ
III. L E A R N I N G
Let θ and φ be the learnable parameters in G and D, where fD and fG are the output layers that transform the
respectively. GAN training is formulated as a minimax results computed by the discriminator D to the classifica-
problem tion scores for the real and fake images, respectively. For
min max V (θ, φ)
(6) the generator update step, the learning objective is
φ θ
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Table 1 Comparison of Different GAN Losses, Including Satu- second-order moments, is very popular for training GANs.
rated [60], Nonsaturated [60], Wasserstein [6], Least Square [134], and ADAM has several user-defined parameters. Typically,
Hinge [112], [242], in Terms of the Discriminator Output Layer Type
the first momentum is set to 0, while the second momen-
in (7) and (8). We Maximize fD and fG for Training the Discriminator.
As Shown in (7) and (8), We Minimize gG for Training the Generator. Note tum is set to 0.999. The learning rate for the discrimina-
That σ x / e −x Is the Sigmoid Function tor update is often set to two to four times larger than
the learning rate for the generator update (usually set
to 0.0001), which is called the two-time update scales
(TTUR) [67]. We also note that RMSProp [201] is popular
for GAN training [64], [84], [85], [118].
C. Regularization
We review several popular regularization techniques
available for countering instability in GAN training.
the fake image. In Table 1, we compare fD , fG , and gG for Gradient penalty (GP) is an auxiliary loss term that
several popular GAN losses. penalizes deviation of gradient norm from the desired
value [64], [138], [169]. To use GP, one adds it to the
B. Training objective function for the discriminator update, that is, (7).
There are several variants of GP. Generally, they can be
Two variants of stochastic gradient descent/ascent
expressed as
(SGD) schemes are commonly used for GAN training: the
simultaneous update scheme and the alternating update
scheme. Let VD (θ, φ) and VG (θ, φ) be the objective func- GP-δ = Ex̂ ∇D(x̂)2 − δ . (13)
tions in (7) and (8), respectively. In the simultaneous
update, each training iteration contains a discriminator The two most common forms are GP-1 [64] and
update step and a generator update step given by GP-0 [138].
GP-1 was first introduced by Gulrajani et al. [64]. It uses
∂VD (θ (t) , φ(t) ) an imaginary data distribution
φ(t+1) = φ(t) + αD (9)
∂φ
(t)
∂V (θ , φ(t) )
θ (t+1) = θ (t) − αG x̂ = ux + (1 − u)G(z), u ∼ U(0, 1)
G
(10) (14)
∂θ
where αD and αG are the learning rates for the generator where u is a uniform random variable between 0 and 1.
and discriminator, respectively. In the alternating update, Basically, x̂ is neither real nor fake. It is a convex com-
each training iteration consists of one discriminator update bination of a real sample and a fake sample. The design
step followed by a generator update step, given by of the GP-1 is motivated by the property of an optimal D
that solves the Wasserstein GAN loss. However, GP-1 is also
∂VD (θ (t) , φ(t) ) useful when using other GAN losses. In practice, it has the
φ(t+1) = φ(t) + αD (11) effect of countering vanishing and exploding gradients that
∂φ
occurred during GAN training.
∂VG (θ (t) , φ(t+1) )
θ (t+1) = θ (t) − αG . (12) On the other hand, the design of GP-0 is based on
∂θ
the idea of penalizing the discriminator deviating away
Note that in the alternating update scheme, the genera- from the Nash equilibrium. GP-0 takes a simpler form
tor update (12) utilizes the newly updated discriminator where they do not use imaginary sample distribution but
parameters θ (t+1) , while, in the simultaneous update (10), use the real data distribution, that is, setting x̂ ≡ x.
it does not. These two schemes have their pros and cons. We find the use of GP-0 in several state-of-the-art GAN
The simultaneous update scheme can be computed more algorithms [85], [86].
efficiently, as a major part of the computation in the two Spectral normalization (SN) [140] is an effective
steps can be shared. On the other hand, the alternating regularization technique used in many recent GAN algo-
update scheme tends to be more stable as the generator rithms [24], [156], [170], [242]. SN is based on regular-
update is computed based on the latest discriminator. izing the spectral norm of the projection operation at each
Recent GAN studies [24], [64], [70], [118], [156] mostly layer of the discriminator, by simply dividing the weight
use the alternating update scheme. Sometimes, the dis- matrix by its largest eigenvalue. Let W be the weight
criminator update (11) is performed several times before matrix of a layer of the discriminator network. With SN,
the true weight that is applied is
computing (12) [24], [64].
Among various SGD algorithms, ADAM [91], which
is based on adaptive estimates of the first- and W/ λmax (W T W ) (15)
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Fig. 4. Generator evolution. Since the debut of GANs [60], the generator architecture has continuously evolved. (a)–(c) One can observe
the change from simple MLPs to deep convolutional and residual networks. Recently, conditional architectures, including (d) conditional ANs
and (e) conditional convolutions, have gained popularity as they allow users to have more control on the generation outputs. (a) MLP.
(b) Deep ConvNet. (c) Residual Net. (d) Residual Net. with Cond. Act. Norm. (e) Cond. ConvNet.
where λmax (A) extracts the largest eigenvalue from the original generator with weight θ and the other is the MA
square matrix A. In other words, each project layer has generator with weight θ M A . At iteration t, we update θ M A
a projection matrix with a spectral norm equal to 1. based on
(t) (t−1)
Feature matching (FM) provides a way to encourage θ M A = βθ (t) + (1 − β)θ M A (18)
the generator to generate images similar to real ones in
some sense. Similar to GP, FM is an auxiliary loss. There are where β is a scalar controlling the contribution from the
two popular implementations: one is batch-based [176], current model weight.
and the other is instance-based [99], [218]. Let Di be the
ith layer of a discriminator D, that is, D = Dd ◦ · · · ◦ D2 ◦ D. Network Architecture
D1 . For the batch-based FM loss, it matches the moments
Network architectures provide a convenient way to
of the activations extracted by the real and fake images,
inject inductive biases. Certain network designs often work
respectively. For the ith layer, the loss is
better than others for a given task. Since the introduction
of GANs, we have observed an evolution of the network
Ex∼D [Di ◦ · · · ◦ D1 (x)] − Ez∼Z [Di ◦ · · · ◦ D1 (G(z))] . architecture for both the generator and discriminator.
(16) 1) Generator Evolution: In Fig. 4, we visualize the evo-
lution of the GAN generator architecture. In the original
One can apply the FM loss to a subset of layers in the gen- GAN paper [60], both the generator and the discrim-
erator and use the weighted sum as the final FM loss. The inator are based on the multilayer perceptron (MLP)
instance-based FM loss is only applicable to conditional [see Fig. 4(a)]. As an MLP fails to model the translational
generation models where we have the corresponding real invariance property of natural images, its output images
image for a fake image. For the ith layer, the instance-based are of limited quality. In the DCGAN work [163], deep con-
volutional architecture [see Fig. 4(b)] is used for the GAN
FM loss is given by
generator. As the convolutional architecture is a better fit
for modeling image signals, the outputs produced by the
[Di ◦ · · · ◦ D1 (xi )] − [Di ◦ · · · ◦ D1 (G(z, y i ))] (17)
DCGAN are often of better quality. Researchers also borrow
architecture designs from discriminative modeling tasks.
where y i is the control signal for xi . As the residual architecture [66] is proved to be effective
Perceptual loss [79]: Often, when instance-based FM for training deep networks, several GAN studies start to
loss is applicable, one can additionally match features use the residual architecture in their generator design [see
extracted from real and fake images using a pretrained net- Fig. 4(c)] [6], [140].
work. Such a variant of FM losses is called the perceptual A residual block used in modern GAN generators typi-
loss [79]. cally consists of a skip connection paired with a series of
Model average (MA) can improve the quality of images batch normalization (BN) [74], nonlinearity, and convo-
generated by a GAN. To use MA, we keep two copies of lution operations. The BN is one type of activation norm
the generator network during training, where one is the (AN), a technique that normalizes the activation values
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
where γc and βc are scalars used to shift the postnormaliza- Fig. 5. Conditional discriminator architectures. There are several
ways to leverage the user input signal y in the GAN discriminator.
tion activation values. They are constants learned during
(a) AC [151]. In this design, the discriminator is asked to predict the
training.
ground-truth label for the real image. (b) Input concatenation [75].
For many applications, it is required to have some way In this design, the discriminator learns to reason whether the input
to control the output produced by a generator. This desire is real by learning a joint feature embedding of image and label.
has motivated various conditional generator architectures (c) PD [141]. In this design, the discriminator computes an image
embedding and correlates it with the label embedding (through the
[see Fig. 4(d)] for the GAN generator [24], [70], [156].
dot product) to determine whether the input is real or fake.
The most common approach is to use the conditional AN.
In a conditional AN, both γc and βc are data-dependent.
Often, one employs a separate network to map input
control signals to the target γc and βc values. Another way domain to a corresponding image in a different domain, for
to achieve such controllability is to use hypernetworks, example, sketch to shoes, label maps to photos, and sum-
basically, using an auxiliary network to produce weights mer to winter. The problem can be studied in a supervised
for the main network. For example, we can have a convo- setting, where sample pairs of corresponding images are
lutional layer where the filter weights are generated by a available, or an unsupervised setting, where such training
separate network. We often call such a scheme conditional data are unavailable, and we only have two independent
convolutions [see Fig. 4(e)], and it has been used for sets of images. In Sections IV-A and B, we will discuss
several state-of-the-art GAN generators [86], [216]. recent progress in both settings.
2) Discriminator Evolution: GAN discriminators have
also undergone an evolution. However, the change has A. Supervised Image Translation
mostly been on moving from the MLP to deep convolu- Isola et al. [75] proposed the pix2pix framework as a
tional and residual architectures. As the discriminator is general-purpose solution to image-to-image translation in
solving a classification task, new breakthroughs in archi- the supervised setting. The training objective of pix2pix
tecture design for image classification tasks could influence combines conditional GANs with the pixelwise 1 loss
future GAN discriminator designs. between the generated image and the ground truth. One
notable design choice of pix2pix is the use of patchwise dis-
3) Conditional Discriminator Architecture: There are sev-
criminators (PatchGAN), which attempts to discriminate
eral effective architectures for utilizing control signals
each local image patch rather than the whole image. This
(conditional inputs y ) in the GAN discriminator to achieve
design incorporates the prior knowledge that the underly-
better image generation quality, as visualized in Fig. 5.
ing image translation function we want to learn is local,
This includes the auxiliary classifier (AC) [151], input
assuming independence between pixels that are far away.
concatenation (IC) [75], and the projection discriminator
In other words, translation mostly involves style or texture
(PD) [141]. The AC and PD are mostly used for category-
changes. It significantly alleviates the burden of the dis-
conditional image generation tasks, while the PD is com-
criminator because it requires much less model capacity to
mon for image-to-image translation tasks.
discriminate local patches than whole images.
4) Neural Architecture Search: As neural architecture One important limitation of pix2pix is that its translation
search has become a popular topic for various recognition function is restricted to be one-to-one. However, many of
tasks, efforts have been made in trying to automatically the mappings that we aim to learn are one-to-many in
find a performant architecture for GANs [56]. nature. In other words, the distribution of possible outputs
While this section and Section III have focused on intro- is multimodal. For example, one can imagine many shoes
ducing the GAN mechanism and various algorithms used to in different colors and styles that correspond to the same
train them, Sections IV–VII focus on various applications of sketch of a shoe. Naively injecting a Gaussian noise latent
GANs in generating images and videos. code to the generator does not lead to many variations
since the generator is free to ignore that latent code.
IV. I M A G E T R A N S L AT I O N BicycleGAN [255] explores approaches to encourage the
This section discusses the application of GANs to image-to- generator to make use of the latent code to represent
image translation, which aims to map an image from one output variations, including applying a KL divergence loss
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Fig. 6. Image translation examples of SPADE [156], which converts semantic label maps into photorealistic natural scenes. The style of the
output image can also be controlled by a reference image (the leftmost column). Images are from Park et al. [156].
to the encoded latent code and reconstructing the sampled latent space. It is shown that shared-latent space implies
latent code from the generated image. Other strategies cycle consistency and imposes a stronger regularization.
to encourage diversity include using different generators DistanceGAN [16] encourages the mapping to preserve
to capture different output modes [54], replacing the the distance between any pair of images before and after
reconstruction loss with maximum likelihood objective translation. While the methods above need to train a differ-
[106], [107], and directly encouraging the distance ent model for each pair of image domains, StarGAN [32]
between output images generated from different latent is able to translate images across multiple domains using
codes to be large [120], [133], [234]. only a single model.
Besides, the quality of image-to-image translation In many unsupervised image translation tasks
has been significantly improved by some recent stud- (e.g., horses to zebras and dogs to cats), the two-
ies [104], [122], [156], [194], [218], [250]. In particu- image domains mainly differ in the foreground objects,
lar, pix2pixHD [218] is able to generate high-resolution and the background distribution is very similar. Ideally,
(HR) images with a coarse-to-fine generator and a mul- the model should only modify the foreground objects and
tiscale discriminator. SPADE [156] further improves the leave the background region untouched. Some work [31],
image quality with a spatially adaptive normalization layer. [137], [232] employs spatial attention to detect and
SPADE, in addition, allows a style image input for better change the foreground region without influencing the
control of the desired look of the output image. Some background. InstaGAN [142] further allows the shape of
examples of SPADE are shown in Fig. 6. the foreground objects to be changed.
The early work mentioned above focuses on unimodal
B. Unsupervised Image Translation translation. On the other hand, recent advances [5], [57],
For many tasks, paired training images are very difficult [70], [105], [128], [133] have made it possible to perform
to obtain [16], [32], [70], [90], [105], [117], [235], multimodal translation, generating diverse output images
[254]. Unsupervised learning of mappings between corre- given the same input. For example, MUNIT [70] assumes
sponding images in two domains is a much harder problem that images can be encoded into two disentangled latent
but has wider applications than the supervised setting. spaces: a domain-invariant content space that captures
CycleGAN [254] simultaneously learns mappings in both the information that should be preserved during trans-
directions and employs a cycle consistency loss to enforce lation, and a domain-specific style space that represents
that, if an image is translated to the other domain and the variations that are not specified by the input image.
translated back to the original domain, the output should To generate diverse translation results, we can recombine
be close to the original image. UNIT [117] makes a shared the content code of the input image with different style
latent space assumption [119] that a pair of corresponding codes sampled from the style space of the target domain.
images can be mapped to the same latent code in a shared Fig. 7 compares MUNIT with existing unimodal translation
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
V. I M A G E P R O C E S S I N G
methods including CycleGAN and UNIT. The disentangled GAN’s strength in generating realistic images makes it ideal
latent space not only enables multimodal translation but for solving various image-processing problems, especially
also allows example-guided translation in which the gener- for those where the perceptual quality of image outputs
ator recombines the domain-invariant content of an image is the primary evaluation criterion. This section will dis-
from the source domain and the domain-specific style of cuss some prominent GAN-based methods for several key
an image from the target domain. The idea of using a image-processing problems, including image restoration
guiding style image has also been applied to the supervised and enhancement (SR, denoising, deblurring, and com-
setting [156], [214], [244]. pression artifacts removal) and image inpainting.
Although paired example images are not needed in the
unsupervised setting, most existing methods still require
access to a large number of unpaired example images in A. Image Restoration and Enhancement
both source and target domains. Some studies seek to The traditional way of evaluating algorithms for image
reduce the number of training examples without much restoration and enhancement tasks is to measure the dis-
loss of performance. Benaim and Wolf [17] focused on tortion, the difference between the ground-truth images
the situation where there are many images in the target and restored images using metrics, such as the mean
domain but only a single image in the source domain. square error (MSE), the peak signal-to-noise ratio (PSNR),
The work of Cohen and Wolf [34] enables translation and the structural similarity index (SSIM). Recently, met-
in the opposite direction where the source domain has rics for measuring perceptual quality, such as the no-
many images, but the target domain has only one. The reference (NR) metric [127], have been proposed, as the
above setting assumes that the source and target domain visual quality is arguably the most important factor for
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Fig. 16. Face swapping versus reenactment [149]. Face swapping focuses on pasting the face region from one subject to another, while
face reenactment concerns transferring the expressions and head poses from the target subject to the source image. Images are from
Nirkin et al. [149].
The synthesized videos are usually up to a few seconds on focus on face reenactment. It has many applications in
simple video content, such as facial motion. fields, such as gaming or the film industry, where the
On the other hand, conditional video synthesis gen- characters can be animated by human actors. Based on
erates videos conditioning on input content. A common whether the trained model can only work for a specific
category is future frame prediction [39], [47], [83], [103], person or is universal to all persons, face reenactment
[110], [125], [136], [191], [208], [212], [213], [230], can be classified as subject-specific or subject-agnostic,
which attempts to predict the next frame of a sequence as described in the following.
based on the past frames. Another common category of
conditional video synthesis is conditioning on an input 1) Subject-Specific Model: Traditional methods usually
video that shares the same high-level representation. Such build a subject-specific model, which can only synthesize
a setting is often referred to as the video-to-video syn- one predetermined subject by focusing on transferring
thesis [217]. This line of studies has shown promising the expressions without transferring the head move-
results on various tasks, such as transforming high-level ment [192], [197]–[199], [209]. This line of studies usu-
representations to photorealistic videos [217], animat- ally starts by collecting footage of the target person to
ing characters with new expressions or motions [26], be synthesized, either using an RGBD sensor [198] or an
[199], or innovating a new rendering pipeline for graphics RGB sensor [199]. Then, a 3-D model of the target person
engines [50]. Due to its broader impact, we will mainly is built for the face region [20]. At test time, given the
focus on conditional video synthesis. Particularly, we will new expressions, they can be used to drive the 3-D model
focus on its two major domains: face reenactment and pose to generate the desired motions, as shown in Fig. 17.
transfer. Instead of extracting the driving expressions from someone
else, they can also be directly synthesized from speech
A. Face Reenactment inputs [192]. Since 3-D models are involved, this line of
Conditional face video synthesis exists in many forms. studies typically does not use GANs.
The most common forms include face swapping and face Some follow-up studies take transferring head motions
reenactment. Face swapping focuses on pasting the face into account and can model both expressions and dif-
region from one subject to another, whereas face reenact- ferent head poses at the same time [11], [87], [227].
ment concerns transferring the subject’s expressions and For example, RecycleGAN [11] extends CycleGAN [254]
head poses. Fig. 16 illustrates the difference. Here, we only to incorporate temporal constraints, so it can transform
Fig. 17. Face reenactment using 3-D face models [87]. These methods first construct a 3-D model for the person to be synthesized, so they
can easily animate the model with new expressions. Images are from Kim et al. [87].
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
videos of a particular person into another fixed person. Table 2 Categorization of Face Reenactment Methods. Subject-
On the other hand, ReenactGAN [227] can transfer move- Specific Models Can Only Work on One Subject per Model, While Subject-
Agnostic Models Can Work on General Targets. Among Each of Them,
ments and expressions from an arbitrary person to a fixed
Some Frameworks Only Focus on the Inner Face Region, So They Can
person. Still, the subject-dependent nature of these studies Only Transfer Expressions, While Others Can Also Transfer Head Move-
greatly limits their usability. One model can only work for ments. Studies With * Do Not Use GANs in Their Framework
one person, and generalizing to another person requires
training a new model. Moreover, collecting training data
for the target person may not be feasible at all times, which
motivates the emergence of subject-agnostic models.
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Fig. 18. Few-shot face reenactment methods which require only a 2-D image as input [239]. The driving expressions are usually
represented by facial landmarks or keypoints. Images are from Zakharov et al. [239].
Table 3 Categories of Pose Transfer Methods. Again, They Can Be the final result. Grigorev et al. [61] also mapped the input
Classified Depending on Whether One Model Can Work for Only One Per- to texture space and inpainted the textures before warping
son or Any Persons. Some of the Frameworks Only Focus on Generating
them back to the target pose. Huang et al. [71] combined
Single Images, While Others Also Demonstrate Their Effectiveness on
Videos. Studies With * Do Not Use GANs in Their Framework the SMPL models [123] with the implicit field estimation
framework [172] to rig the reconstructed meshes with
desired motions. While these methods work reasonably
well in transferring poses, as shown in Fig. 19, directly
applying them to videos will usually result in unsatisfactory
artifacts, such as flickering or inconsistent results. In the
following, we introduce methods specifically targeting
video generation, which works on a one-person-per-model
basis.
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
VII. N E U R A L R E N D E R I N G
Neural rendering is a recent and upcoming topic in the
area of neural networks, which combines classical ren-
and successfully demonstrated the transfer results on sev-
dering and generative models. Classical rendering can
eral dancing sequences, opening the era for a new applica-
produce photorealistic images given the complete specifi-
tion (see Fig. 20). Chan et al. [26] also adopted a similar
cation of the world. This includes all the objects in it, their
approach to generate many dancing examples but using
geometry, material properties, the lighting, the cameras,
a simple temporal smoothing on the inputs instead of
and so on. Creating such a world from scratch is a laborious
explicitly modeling temporal consistency by the network.
process that often requires expert manual input. Moreover,
Following these studies, many subsequent studies improve
faithfully reproducing such data directly from images of
upon them [3], [115], [116], [183], [252], usually by
the world can often be hard or impossible. On the other
combining the neural network with 3-D models or graphics
hand, as described in Sections IV–VI, GANs have had
engines. For example, instead of predicting RGB values
great success in producing photorealistic images given
directly, Shysheya et al. [183] predicted DensePose-like
minimal semantic inputs. The ability to synthesize and
part maps and texture maps from input 3-D keypoints
learn material properties, textures, and other intangibles
and adopted a neural renderer to render the outputs. Liu
from training data can help overcome the drawbacks of
et al. [116] first constructed a 3-D character model of
classical rendering.
the target by capturing multiview static images and then
Neural rendering aims to combine the strengths of
trained a character-to-image translation network using a
the two areas to create a more powerful and flexible
monocular video of the target. The authors later combined
framework. Neural networks can either be applied as a
the constructed 3-D model with the monocular video to
postprocessing step after classical rendering or as part of
estimate dynamic textures, so they can use different tex-
the rendering pipeline with the design of 3-D-aware and
ture maps when synthesizing different motions to increase
the realism [115].
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Fig. 22. Two common frameworks for neural rendering. (a) In the first set of studies [109], [132], [135], [139], [158], a neural network that
purely operates in the 2-D domain is trained to enhance an input image, possibly supplemented with other information, such as
depth or segmentation maps. (b) Second set of studies [147], [148], [178], [187], [225] introduces native 3-D operations that produce and
transform 3-D features. This allows the network to reason in 3-D and produce view-consistent outputs. (a) 3-D to 2-D projection as a
preprocessing step. (b) 3-D ↔ 2-D transform as a part of network training.
differentiable layers. Sections VII-A and B discuss such view, and no features or gradients are backpropagated to
approaches and how they use GAN losses to improve the the 3-D source world or through the camera projection.
quality of outputs. In this article, we focus on studies that A key advantage of this approach is that the traditional
use GANs to train neural networks and augment the clas- graphics rendering pipeline can be easily augmented to
sical rendering pipeline to generate images. For a general immediately take advantage of proven and mature tech-
survey on the use of neural networks in rendering, refer the niques from 2-D image-to-image translation (as discussed
survey paper on neural rendering by Tewari et al. [196]. in Section IV), without the need for designing and imple-
We divide the studies on GAN-based neural rendering menting differentiable projection layers or transformations
into two parts: 1) studies that treat 3-D to 2-D projection as that are part of the deep network during training. This type
a preprocessing step and apply neural networks purely in of framework is illustrated in Fig. 22(a).
the 2-D domain and 2) studies that incorporate layers that Martin-Brualla et al. [135] introduced the notion of
perform differentiable operations to transform features rerendering, where a deep neural network takes as input
from 3-D to 2-D or vice versa (3-D ↔ 2-D) and learn some a rendered 2-D image and enhances it (improving colors,
implicit form of geometry to provide 3-D understanding to boundaries, resolution, and so on) to produce a rerendered
the network. image. The full pipeline consists of two steps—a traditional
3-D to 2-D rendering step and a trainable deep network
that enhances the rendered 2-D image. The 3-D to 2-D
A. 3-D to 2-D Projection as a Preprocessing Step rendering technique can be differentiable or nondiffer-
A number of studies [109], [132], [135], [139], entiable, but no gradients are backpropagated through
[158] improve upon traditional techniques by casting the this step. This allows one to use more complex rendering
task of rendering into the framework of image-to-image techniques. By using this two-step process, the output of
translation, possibly unimodal, multimodal, or conditional, a performance capture system, which might suffer from
depending on the exact use cases. Using given camera noise, poor color reproduction, and other issues, can be
parameters, the source 3-D world is first projected to a improved. In this particular work, they did not see an
2-D feature map containing per-pixel information, such improvement from using a GAN loss, perhaps because they
as color, depth, surface normals, and segmentation. This trained their system on the limited domain of people and
feature map is then fed as input to a generator, which tries faces, using carefully captured footage.
to produce desired outputs, usually a realistic-looking RGB Meshry et al. [139] and Li et al. [109] extended this
image. The deep neural network application happens in approach to the more challenging domain of unstructured
the 2-D space after the 3-D world is projected to the camera photo collections. They produce multiple plausible views
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
camera parameters. The attributes of the 3-D point are associated depth and SIFT attributes [158]. The top row of images is
produced by a generator trained without an adversarial loss,
copied to the 2-D pixel location to which it is projected.
whereas the bottom row uses adversarial loss. Using an adversarial
Mallya et al. [132] precomputed the mapping of the 3-D loss helps generates better details and more plausible colors.
world point cloud to the pixel locations in the images Images are from Pittaluga et al. [158].
produced by cameras with known parameters and use this
to obtain an estimate of the next frame, referred to as
a “guidance image.” They learn to output video frames
consistent over time and viewpoints by conditioning the in the feature space. The general pipeline of this line of
generator on these noisy estimates. studies is illustrated in Fig. 22(b). Learning a 3-D represen-
In these studies, the use of a generator coupled with an tation and modeling the process of image projection and
adversarial loss helps produce better-looking outputs con- formation into the network have several advantages: the
ditioned on the input feature maps. Similar to applications ability to reason in 3-D, control the pose, and produce a
of pix2pixHD [218], such as manipulating output images series of consistent views of a scene. Contrast to this is the
by editing input segmentation maps, Meshry et al. [139] neural network shown in Fig. 22(a), which purely operates
are able to remove people and transient objects from in the 2-D domain.
images of landmarks and generate plausible inpainting. DeepVoxels [187] learns a persistent 3-D voxel feature
A key motivation of the work of Pittaluga et al. [158] representation of a scene given a set of multiview images
was to explore if a user’s privacy can be protected by and their associated camera intrinsic and extrinsic para-
techniques, such as discarding the color of the 3-D points. meters. Features are first extracted from the 2-D views
A very interesting observation was that discarding color and then lifted to a 3-D volume. This 3-D volume is then
information helps prevent accurate reproduction. How- integrated into the persistent DeepVoxels representation.
ever, the use of a GAN loss recovers plausible colors and These 3-D features are then projected to 2-D using a projec-
greatly improves the output results, as shown in Fig. 23. tion layer, and a new view of the object is synthesized using
GAN losses might also be helpful in cases where it is hard a U-Net generator. This generator network is trained with
to manually define a good loss function, either due to an 1 loss and a GAN loss. The authors found that using
the inherent ambiguity in determining the desired behav- a GAN loss accelerates the generation of high-frequency
ior or the difficulty in fully labeling the data. details, especially at earlier stages of training. Similar to
DeepVoxels [187], visual object networks (VONs) [256]
generate a voxel grid from a sample noise vector and use
B. 3-D ↔ 2-D Transform as a Part of Network a differentiable projection layer to map the voxel grid to
Training a 2.5-D sketch. Inspired by classical graphics rendering
In the previous set of studies, the geometry of the pipelines, this work decomposes image formation into
world or object is explicitly provided, and neural rendering three conditionally independent factors of shape, view-
is purely used to enhance the appearance or add details point, and texture. Trained with a GAN loss, their model
to the traditionally rendered image or feature maps. The synthesizes more photorealistic images, and the use of the
studies in this section [147], [148], [178], [187], [225] disentangled representation allows for 3-D manipulations,
introduce native 3-D operations in the neural network used which are not feasible with purely 2-D methods.
to learn from and produce images. These operations enable HoloGAN [147] proposes a system to learn 3-D voxel
them to model the geometry and appearance of the scene feature representations of the world and to render them
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
to realistic-looking images. Unlike VONs [256], Holo- Table 4 Key Differences Among 3-D-Aware Methods. Adversarial Losses
GAN does not require explicit 3-D data or supervision Are Used by a Range of Methods That Differ in the Type of 3-D Feature
Representation and Training Supervision
and can do so using unlabeled images (no pose, explicit
3-D shape, or multiple views). By incorporating a 3-D
rigid-body transformation module and a 3-D-to-2-D pro-
jection module in the network, HoloGAN provides the
ability to control the pose of the generated objects. Holo-
GAN employs a multiscale feature GAN discriminator, and
the authors empirically observed that this helps prevent
mode collapse. BlockGAN [148] extends the unsupervised
approach of the HoloGAN [147] to also consider object
disentanglement. BlockGAN learns 3-D features per object
and the background. These are combined into 3-D scene
features after applying appropriate transformations before
projecting them into the 2-D space. One issue with learn- help in learning disentangled features by ensuring that the
ing scene compositionality without explicit supervision is outputs after projection and rendering look realistic. The
the conflation of features of the foreground object and
learnability and flexibility of the GAN loss to the task at
the background, which results in visual artifacts when hand help provide feedback, guiding how to change the
objects or the camera moves. By adding more powerful
generated image, and, thus, the upstream features so that
“style” discriminators (feature discriminators introduced
it looks as if it were sampled from the distribution of real
in [147]) to their training scheme, the authors observed images. This makes the GAN framework a powerful asset
that the disentangling of features improved, resulting in
in the toolbox of any neural rendering practitioner.
cleaner outputs.
SynSin [225] learns an end-to-end model for view syn-
thesis from a single image, without any ground-truth 3-D VIII. L I M I TAT I O N S A N D O P E N
supervision. Unlike the above studies that internally use PROBLEMS
a feature voxel representation, SynSin predicts a point Despite the successful applications introduced above, there
cloud of features from the input image and then projects it are still limitations of GANs needed to be addressed by
to new views using a differentiable point cloud renderer. future work.
Two-dimensional image features and a depth map are
first predicted from the input image. Based on the depth A. Evaluation Metrics
map, the 2-D features are projected to 3-D to obtain the
Evaluating and comparing different GAN models are
3-D feature point cloud. The network is trained adversar-
difficult. The most popular evaluation metrics are per-
ially with a discriminator based on the one proposed by
haps inception score (IS) [176] and Fréchet inception
Wang et al. [218].
distance (FID) [67], which both have many shortcomings.
One of the drawbacks of voxel-based feature repre-
The IS, for example, is not able to detect intraclass mode
sentations is the cubic growth in the memory required
collapse [23]. In other words, a model that generates only
to store them. To keep requirements manageable, voxel-
a single image per class can obtain a high IS. FID can better
based approaches are typically restricted to low resolu-
measure such diversity, but it does not have an unbiased
tions. GRAF [178] proposes to use conditional radiance
estimator [19]. Kernel inception distance (KID) [19] can
fields, which are a continuous mapping from a 3-D location
capture higher order statistics and has an unbiased esti-
and a 2-D viewing direction to an RGB color value, as the
mator but has been empirically found to suffer from high
intermediate feature representation. They also use a single
variance [165]. In addition to the above measures that
discriminator similar to PatchGAN [75], with weights that
summarize the performance with a single number, there
are shared across patches with different receptive fields.
are metrics that separately evaluate fidelity and diversity
This allows them to capture the global context and refine
of the generator distribution [97], [143], [173].
local details.
As summarized in Table 4, the studies discussed in this
section use a variety of 3-D feature representations and B. Instability
train their networks using paired input–output with known Although the regularization techniques introduced in
transformations or unlabeled and unpaired data. The use Section III-C have greatly improved the stability of GAN
of a GAN loss is common to all these approaches. This training, GANs are still much more unstable to train
is perhaps because traditional hand-designed losses, such than supervised discriminative models or likelihood-based
as the 1 loss or even perceptual loss, are unable fully to generative models. For example, even the state-of-the-art
capture what makes a synthesized image look unrealistic. BigGAN model would eventually collapse in the late stage
Furthermore, in the case where explicit task supervision is of training on ImageNet [24]. Also, the final performance
unavailable, BlockGAN [148] shows that a GAN loss can is generally very sensitive to hyperparameters [96], [126].
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
REFERENCES
[1] R. Abdal, Y. Qin, and P. Wonka, “Image2StyleGAN: image pair,” 2020, arXiv:2004.02222. [Online]. “Attention-GAN for object transfiguration in wild
How to embed images into the StyleGAN latent Available: http://arxiv.org/abs/2004.02222 images,” in Proc. ECCV, 2018, pp. 164–180.
space?” ICCV, 2019, pp. 4432–4441. [16] S. Benaim and L. Wolf, “One-sided unsupervised [32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and
[2] R. Abdal, Y. Qin, and P. Wonka, domain mapping,” in Proc. NeurIPS, 2017, J. Choo, “StarGAN: Unified generative adversarial
“Image2StyleGAN++: How to edit the embedded pp. 752–762. networks for multi-domain image-to-image
images?” CVPR, 2020, pp. 8296–8305. [17] S. Benaim and L. Wolf, “One-shot unsupervised translation,” in Proc. CVPR, 2018, pp. 8789–8797.
[3] K. Aberman, M. Shi, J. Liao, D. Liscbinski, cross domain translation,” in Proc. NeurIPS, 2018, [33] A. Clark, J. Donahue, and K. Simonyan, “Efficient
B. Chen, and D. Cohen-Or, “Deep video-based pp. 2104–2114. video generation on complex datasets,” 2019,
performance cloning,” Comput. Graph. Forum, [18] Y. Bengio, A. Courville, and P. Vincent, arXiv:1907.06571. [Online]. Available:
vol. 38, no. 2, pp. 219–233, May 2019. “Representation learning: A review and new https://arxiv.org/abs/1907.06571
[4] E. Agustsson, M. Tschannen, F. Mentzer, perspectives,” IEEE Trans. Pattern Anal. Mach. [34] T. Cohen and L. Wolf, “Bidirectional one-shot
R. Timofte, and L. Van Gool, “Generative Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. unsupervised domain mapping,” in Proc. ICCV,
adversarial networks for extreme learned image [19] M. Bińkowski, D. J. Sutherland, M. Arbel, and 2019, pp. 1784–1792.
compression,” in Proc. ICCV, 2019, pp. 221–231. A. Gretton, “Demystifying MMD GANs,” in Proc. [35] A. Creswell, T. White, V. Dumoulin,
[5] A. Almahairi, S. Rajeshwar, A. Sordoni, ICLR, 2018. K. Arulkumaran, B. Sengupta, and A. A. Bharath,
P. Bachman, and A. Courville, “Augmented [20] V. Blanz and T. Vetter, “A morphable model for the “Generative adversarial networks: An overview,”
CycleGAN: Learning many-to-many mappings synthesis of 3D faces,” in Proc. 26th Annu. Conf. IEEE Signal Process. Mag., vol. 35, no. 1,
from unpaired data,” in Proc. ICML, 2018, Comput. Graph. Interact. Techn., 1999, pp. 53–65, Jan. 2018.
pp. 195–204. pp. 187–194. [36] N. Dalal and B. Triggs, “Histograms of oriented
[6] M. Arjovsky, S. Chintala, and L. Bottou, [21] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and gradients for human detection,” in Proc. CVPR,
“Wasserstein GAN,” in Proc. ICML, 2017, L. Zelnik-Manor, “The 2018 PIRM challenge on 2005, pp. 886–893.
pp. 214–223. perceptual image super-resolution,” in Proc. ECCV [37] P. Dayan, G. E. Hinton, R. M. Neal, and
[7] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and Workshop, 2018, pp. 334–355. R. S. Zemel, “The Helmholtz machine,” Neural
M. F. Cohen, “Bringing portraits to life,” ACM [22] Y. Blau and T. Michaeli, “The perception-distortion Comput., vol. 7, no. 5, pp. 889–904, Sep. 1995.
Trans. Graph., vol. 36, no. 6, pp. 1–13, Nov. 2017. tradeoff,” in Proc. CVPR, 2018, pp. 6228–6237. [38] E. de Bézenac, I. Ayed, and P. Gallinari, “Optimal
[8] J. Lei Ba, J. Ryan Kiros, and G. E. Hinton, “Layer [23] A. Borji, “Pros and cons of GAN evaluation unsupervised domain translation,” 2019,
normalization,” 2016, arXiv:1607.06450. measures,” Comput. Vis. Image Understand., arXiv:1906.01292. [Online]. Available:
[Online]. Available: http://arxiv.org/ vol. 179, pp. 41–65, Feb. 2019. http://arxiv.org/abs/1906.01292
abs/1607.06450 [24] A. Brock, J. Donahue, and K. Simonyan, “Large [39] E. L. Denton and V. Birodkar, “Unsupervised
[9] K. Baek, Y. Choi, Y. Uh, J. Yoo, and H. Shim, scale GAN training for high fidelity natural image learning of disentangled representations from
“Rethinking the truly unsupervised synthesis,” in Proc. ICLR, 2019. video,” in Proc. NeurIPS, 2017, pp. 4414–4423.
image-to-image translation,” 2020, [25] L. Chai, D. Bau, S.-N. Lim, and P. Isola, “What [40] L. Dinh, D. Krueger, and Y. Bengio, “NICE:
arXiv:2006.06500. [Online]. Available: makes fake images detectable? understanding Non-linear independent components estimation,”
http://arxiv.org/abs/2006.06500 properties that generalize,” in Proc. ECCV, 2020, in Proc. ICLR, 2015.
[10] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, pp. 103–120. [41] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density
and J. Guttag, “Synthesizing images of humans in [26] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, estimation using real NVP,” in Proc. ICLR, 2017.
unseen poses,” in Proc. CVPR, 2018, “Everybody dance now,” in Proc. ICCV, 2019, [42] C. Dong, C. C. Loy, K. He, and X. Tang, “Image
pp. 8340–8348. pp. 5933–5942. super-resolution using deep convolutional
[11] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, [27] J. Chen, J. Chen, H. Chao, and M. Yang, “Image networks,” IEEE Trans. Pattern Anal. Mach. Intell.,
“Recycle-GAN: Unsupervised video retargeting,” blind denoising with generative adversarial vol. 38, no. 2, pp. 295–307, Feb. 2016.
in Proc. ECCV, 2018, pp. 119–135. network based noise modeling,” in Proc. CVPR, [43] C. Dong, C. C. Loy, and X. Tang, “Accelerating the
[12] C. Barnes, E. Shechtman, A. Finkelstein, and 2018, pp. 3155–3164. super-resolution convolutional neural network,”
D. B. Goldman, “PatchMatch: A randomized [28] L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, in Proc. ECCV, 2016, pp. 391–407.
correspondence algorithm for structural image “Lip movements generation at a glance,” in Proc. [44] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and
editing,” ACM Trans. Graph., vol. 28, no. 3, p. 24, ECCV, 2018, pp. 520–535. J. Yin, “Soft-gated warping-GAN for pose-guided
2009. [29] L. Chen, R. K. Maddox, Z. Duan, and C. Xu, person image synthesis,” in Proc. NeurIPS, 2018,
[13] D. Bau et al., “GAN dissection: Visualizing and “Hierarchical cross-modal talking face generation pp. 474–484.
understanding generative adversarial networks,” with dynamic pixel-wise loss,” in Proc. CVPR, [45] Y. Du and I. Mordatch, “Implicit generation and
in Proc. ICLR, 2019. 2019, pp. 7832–7841. generalization in energy-based models,” in Proc.
[14] D. Bau, J.-Y. Zhu, J. Wulff, W. Peebles, H. Strobelt, [30] X. Chen, N. Mishra, M. Rohaninejad, and NeurIPS, 2019, pp. 3608–3618.
B. Zhou, and A. Torralba, “Seeing what a GAN P. Abbeel, “PixelSNAIL: An improved [46] P. Esser, E. Sutter, and B. Ommer, “A variational
cannot generate,” in Proc. ICCV, 2019. autoregressive generative model,” in Proc. ICML, u-net for conditional appearance and shape
[15] S. Benaim, R. Mokady, A. Bermano, D. Cohen-Or, 2018, pp. 864–872. generation,” in Proc. CVPR, 2018, pp. 8857–8866.
and L. Wolf, “Structural-analogy from a single [31] X. Chen, C. Xu, X. Yang, and D. Tao, [47] C. Finn, I. Goodfellow, and S. Levine,
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
“Unsupervised learning for physical interaction “Multimodal unsupervised image-to-image deblurring using conditional adversarial
through video prediction,” in Proc. NeurIPS, 2016, translation,” in Proc. ECCV, 2018, pp. 172–189. networks,” in Proc. CVPR, 2018, pp. 8183–8192.
pp. 64–72. [71] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung, [96] K. Kurach, M. Lučić, X. Zhai, M. Michalski, and
[48] A. Fischer and C. Igel, “An introduction to “ARCH: Animatable reconstruction of clothed S. Gelly, “A large-scale study on regularization
restricted Boltzmann machines,” in Proc. humans,” in Proc. CVPR, 2020, pp. 3093–3102. and normalization in GANs,” in Proc. Int. Conf.
Iberoamerican Congr. Pattern Recognit., 2012, [72] M. Huh, R. Zhang, J.-Y. Zhu, S. Paris, and Mach. Learn., 2019, pp. 3581–3590.
pp. 14–36. A. Hertzmann, “Transforming and projecting [97] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen,
[49] O. Fried et al., “Text-based editing of talking-head images into class-conditional generative and T. Aila, “Improved precision and recall metric
video,” ACM Trans. Graph., vol. 38, no. 4, networks,” in Proc. ECCV, 2020, pp. 17–34. for assessing generative models,” in Proc. NeurIPS,
pp. 1–14, Jul. 2019. [73] S. Iizuka, E. Simo-Serra, and H. Ishikawa, 2019, pp. 3927–3936.
[50] O. Gafni, L. Wolf, and Y. Taigman, “Vid2Game: “Globally and locally consistent image [98] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman,
Controllable characters extracted from real-world completion,” ACM Trans. Graph., vol. 36, no. 4, E. Yumer, and M.-H. Yang, “Learning blind video
videos,” in Proc. ICLR, 2020. pp. 1–14, Jul. 2017. temporal consistency,” in Proc. ECCV, 2018,
[51] T. Galanti, L. Wolf, and S. Benaim, “The role of [74] S. Ioffe and C. Szegedy, “Batch normalization: pp. 170–185.
minimal complexity functions in unsupervised Accelerating deep network training by reducing [99] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and
learning of semantic mappings,” in Proc. ICLR, internal covariate shift,” in Proc. ICML, 2015, O. Winther, “Autoencoding beyond pixels using a
2018. pp. 448–456. learned similarity metric,” in Proc. ICML, 2016,
[52] L. Galteri, L. Seidenari, M. Bertini, and [75] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, pp. 1558–1566.
A. D. Bimbo, “Deep generative adversarial “Image-to-image translation with conditional [100] Y. LeCun, Y. Bengio, and G. Hinton, “Deep
compression artifact removal,” in Proc. ICCV, adversarial networks,” in Proc. CVPR, 2017, learning,” Nature, vol. 521, pp. 436–444,
2017, pp. 4826–4835. pp. 1125–1134. May 2015.
[53] J. Geng, T. Shao, Y. Zheng, Y. Weng, and K. Zhou, [76] A. Jahanian, L. Chai, and P. Isola, “On the [101] Y. LeCun et al., “A tutorial on energy-based
“Warp-guided GANs for single-photo facial ‘steerability’ of generative adversarial networks,” learning,” in Predicting Structured Data.
animation,” ACM Trans. Graph., vol. 37, no. 6, in Proc. ICLR, 2020. Cambridge, MA, USA: MIT Press 2006.
pp. 1–12, Jan. 2019. [77] A. Jamaludin, J. S. Chung, and A. Zisserman, “You [102] C. Ledig et al., “Photo-realistic single image
[54] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, said that: Synthesising talking faces from audio,” super-resolution using a generative adversarial
and P. K. Dokania, “Multi-agent diverse generative in Proc. IJCV, 2019, pp. 1767–1779. network,” in Proc. CVPR, 2017.
adversarial networks,” in Proc. CVPR, 2018, [78] Y. Jo and J. Park, “SC-FEGAN: Face editing [103] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn,
pp. 8513–8521. generative adversarial network with user’s sketch and S. Levine, “Stochastic adversarial video
[55] L. Goetschalckx, A. Andonian, A. Oliva, and and color,” in Proc. ICCV, 2019, pp. 1745–1753. prediction,” 2018, arXiv:1804.01523. [Online].
P. Isola, “GANalyze: Toward visual definitions of [79] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Available: http://arxiv.org/abs/
cognitive image properties,” in Proc. ICCV, 2019, losses for real-time style transfer and 1804.01523
pp. 5744–5753. super-resolution,” in Proc. ECCV, 2016, [104] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “MaskGAN:
[56] X. Gong, S. Chang, Y. Jiang, and Z. Wang, pp. 694–711. Towards diverse and interactive facial image
“AutoGAN: Neural architecture search for [80] A. Jolicoeur-Martineau, “The relativistic manipulation,” in Proc. CVPR, 2020.
generative adversarial networks,” in Proc. ICCV, discriminator: A key element missing from [105] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and
2019, pp. 3224–3234. standard GAN,” in Proc. ICLR, 2019. M.-H. Yang, “Diverse image-to-image translation
[57] A. Gonzalez-Garcia, J. Van De Weijer, and [81] A. Jolicoeur-Martineau, “On relativistic via disentangled representations,” in Proc. ECCV,
Y. Bengio, “Image-to-image translation for f-divergences,” in Proc. ICML, 2020, 2018, pp. 35–51.
cross-domain disentanglement,” in Proc. NeurIPS, pp. 4931–4939. [106] S. Lee, J. Ha, and G. Kim, “Harmonizing
2018, pp. 1287–1298. [82] D. Joo, D. Kim, and J. Kim, “Generating a fusion maximum likelihood with GANs for multimodal
[58] I. Goodfellow, “NIPS 2016 tutorial: Generative image: One’s identity and another’s shape,” in conditional generation,” in Proc. ICLR, 2019.
adversarial networks,” 2017, arXiv:1701.00160. Proc. CVPR, 2018, pp. 1635–1643. [107] K. Li, T. Zhang, and J. Malik, “Diverse image
[Online]. Available: [83] N. Kalchbrenner et al., “Video pixel networks,” in synthesis from semantic layouts via conditional
http://arxiv.org/abs/1701.00160 Proc. ICML, 2017, pp. 1771–1779. IMLE,” in Proc. ICCV, 2019, pp. 4220–4229.
[59] I. Goodfellow, Y. Bengio, and A. Courville, Deep [84] T. Karras, T. Aila, S. Laine, and J. Lehtinen, [108] Y. Li, C. Huang, and C. C. Loy, “Dense intrinsic
Learning. Cambridge, MA, USA: MIT Press, 2016. “Progressive growing of GANs for improved appearance flow for human pose transfer,” in Proc.
[60] I. Goodfellow et al., “Generative adversarial quality, stability, and variation,” in Proc. ICLR, CVPR, 2019, pp. 3693–3702.
networks,” in Proc. NeurIPS, 2014. 2018. [109] Z. Li, W. Xian, A. Davis, and N. Snavely,
[61] A. Grigorev, A. Sevastopolsky, A. Vakhitov, and [85] T. Karras, S. Laine, and T. Aila, “A style-based “Crowdsampling the plenoptic function,” in Proc.
V. Lempitsky, “Coordinate-based texture generator architecture for generative adversarial ECCV, 2020, pp. 178–196.
inpainting for pose-guided human image networks,” in Proc. CVPR, 2019, pp. 4401–4410. [110] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual
generation,” in Proc. CVPR, 2019, [86] T. Karras, S. Laine, M. Aittala, J. Hellsten, motion GAN for future-flow embedded video
pp. 12135–12144. J. Lehtinen, and T. Aila, “Analyzing and improving prediction,” in Proc. NeurIPS, 2017,
[62] K. Gu, Y. Zhou, and T. S. Huang, “FLNet: the image quality of StyleGAN,” in Proc. CVPR, pp. 1744–1752.
Landmark driven fetching and learning network 2020, pp. 8110–8119. [111] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee,
for faithful talking facial animation synthesis,” in [87] H. Kim et al., “Deep video portraits,” ACM Trans. “Enhanced deep residual networks for single
Proc. AAAI, 2020, pp. 10861–10868. Graph., vol. 37, no. 4, pp. 1–14, Aug. 2018. image super-resolution,” in Proc. CVPR Workshop,
[63] R. A. Güler, N. Neverova, and I. Kokkinos, [88] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate 2017, pp. 136–144.
“DensePose: Dense human pose estimation in the image super-resolution using very deep [112] J. Hyun Lim and J. Chul Ye, “Geometric GAN,”
wild,” in Proc. CVPR, 2018, pp. 7297–7306. convolutional networks,” in Proc. CVPR, 2016, 2017, arXiv:1705.02894. [Online]. Available:
[64] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, pp. 1646–1654. http://arxiv.org/abs/1705.02894
and A. C. Courville, “Improved training of [89] K. Kim, Y. Yun, K.-W. Kang, K. Kong, S. Lee, and [113] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo,
wasserstein GANs,” in Proc. NeurIPS, 2017, S.-J. Kang, “Painting outside as inside: Edge “TuiGAN: Learning versatile image-to-image
pp. 5767–5777. guided image outpainting via bidirectional translation with two unpaired images,” in Proc.
[65] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim, rearrangement with progressive step learning,” ECCV, 2020, pp. 18–35.
“MarioNETte: Few-shot face reenactment 2020, arXiv:2010.01810. [Online]. Available: [114] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao,
preserving identity of unseen targets,” in Proc. http://arxiv.org/abs/2010.01810 and B. Catanzaro, “Image inpainting for irregular
AAAI, 2020, pp. 10893–10900. [90] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, holes using partial convolutions,” in Proc. ECCV,
[66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep “Learning to discover cross-domain relations with 2018, pp. 85–100.
residual learning for image recognition,” in Proc. generative adversarial networks,” in Proc. ICML, [115] L. Liu et al., “Neural human video rendering by
CVPR, 2016, pp. 10893–10900. 2017, pp. 1857–1865. learning dynamic textures and rendering-to-video
[67] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, [91] D. Kingma and J. Ba, “Adam: A method for translation,” IEEE Trans. Vis. Comput. Graphics,
and S. Hochreiter, “GANs trained by a two stochastic optimization,” in Proc. ICLR, 2015. early access, May 2020, doi:
time-scale update rule converge to a local Nash [92] D. P. Kingma and P. Dhariwal, “Glow: Generative 10.1109/TVCG.2020.2996594.
equilibrium,” in Proc. NeurIPS, 2017, flow with invertible 1×1 convolutions,” in Proc. [116] L. Liu et al., “Neural rendering and reenactment of
pp. 6626–6637. NeurIPS, 2018, pp. 10215–10224. human actor videos,” ACM Trans. Graph., vol. 38,
[68] G. E. Hinton, “Reducing the dimensionality of [93] D. P. Kingma and M. Welling, “Auto-encoding no. 5, pp. 1–14, Nov. 2019.
data with neural networks,” Science, vol. 313, variational Bayes,” in Proc. ICLR, 2013. [117] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised
no. 5786, pp. 504–507, Jul. 2006. [94] D. P. Kingma and M. Welling, “An introduction to image-to-image translation networks,” in Proc.
[69] X. Huang and S. Belongie, “Arbitrary style transfer variational autoencoders,” 2019, NeurIPS, 2017.
in real-time with adaptive instance arXiv:1906.02691. [Online]. Available: [118] M.-Y. Liu et al., “Few-shot unsupervised
normalization,” in Proc. ICCV, 2017, http://arxiv.org/abs/1906.02691 image-to-image translation,” in Proc. ICCV, 2019,
pp. 1501–1510. [95] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, pp. 10551–10560.
[70] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, and J. Matas, “DeblurGAN: Blind motion [119] M.-Y. Liu and O. Tuzel, “Coupled generative
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
adversarial networks,” in Proc. NeurIPS, 2016, generative models,” in Proc. ICML, 2020, [166] Y. Ren, X. Yu, J. Chen, T. H. Li, and G. Li, “Deep
pp. 469–477. pp. 7176–7185. image spatial transformation for person image
[120] S. Liu, X. Zhang, J. Wangni, and J. Shi, [144] K. Nagano et al., “PaGAN: Real-time avatars using generation,” in Proc. CVPR, 2020,
“Normalized diversification,” in Proc. CVPR, 2019, dynamic textures,” ACM Trans. Graph., vol. 37, pp. 7690–7699.
pp. 10306–10315. no. 6, pp. 1–12, Jan. 2019. [167] D. J. Rezende and S. Mohamed, “Variational
[121] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao, [145] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and inference with normalizing flows,” in Proc. ICML,
“Liquid warping GAN: A unified framework for M. Ebrahimi, “EdgeConnect: Generative image 2015, pp. 1530–1538.
human motion imitation, appearance transfer and inpainting with adversarial edge learning,” 2019, [168] D. J. Rezende, S. Mohamed, and D. Wierstra,
novel view synthesis,” in Proc. ICCV, 2019, arXiv:1901.00212. [Online]. Available: “Stochastic backpropagation and approximate
pp. 5904–5913. http://arxiv.org/abs/1901.00212 inference in deep generative models,” in Proc.
[122] X. Liu, G. Yin, J. Shao, and X. Wang, “Learning to [146] N. Neverova, R. Alp Guler, and I. Kokkinos, “Dense ICML, 2014, pp. 1278–1286.
predict layout-to-image conditional convolutions pose transfer,” in Proc. ECCV, 2018, pp. 123–138. [169] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann,
for semantic image synthesis,” in Proc. NeurIPS, [147] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and “Stabilizing training of generative adversarial
2019, pp. 570–580. Y.-L. Yang, “HoloGAN: Unsupervised learning of networks through regularization,” in Proc.
[123] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, 3d representations from natural images,” in Proc. NeurIPS, 2017, pp. 2018–2028.
and M. J. Black, “SMPL: A skinned multi-person CVPR, 2019, pp. 7588–7597. [170] K. Saito, K. Saenko, and M.-Y. Liu, “COCO-FUNIT:
linear model,” ACM Trans. Graph., vol. 34, no. 6, [148] T. Nguyen-Phuoc, C. Richardt, L. Mai, Y.-L. Yang, Few-shot unsupervised image translation with a
pp. 1–16, Nov. 2015. and N. Mitra, “BlockGAN: Learning 3D content conditioned style encoder,” in Proc. ECCV,
[124] D. Lorenz, L. Bereska, T. Milbich, and B. Ommer, object-aware scene representations from 2020, pp. 382–398.
“Unsupervised part-based disentangling of object unlabelled images,” 2020, arXiv:2002.08988. [171] M. Saito, E. Matsumoto, and S. Saito, “Temporal
shape and appearance,” in Proc. CVPR, 2019, [Online]. Available: http://arxiv.org/ generative adversarial nets with singular value
pp. 10955–10964. abs/2002.08988 clipping,” in Proc. ICCV, 2017, pp. 2830–2839.
[125] W. Lotter, G. Kreiman, and D. Cox, “Deep [149] Y. Nirkin, Y. Keller, and T. Hassner, “FSGAN: [172] S. Saito, Z. Huang, R. Natsume, S. Morishima,
predictive coding networks for video prediction Subject agnostic face swapping and reenactment,” A. Kanazawa, and H. Li, “PIFu: Pixel-aligned
and unsupervised learning,” in Proc. ICLR, 2017. in Proc. ICCV, 2019, pp. 7184–7193. implicit function for high-resolution clothed
[126] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and [150] S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: human digitization,” in Proc. ICCV, 2019,
O. Bousquet, “Are GANs created equal? A Training generative neural samplers using pp. 2304–2314.
large-scale study,” in Proc. Adv. Neural Inf. Process. variational divergence minimization,” in Proc. [173] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet,
Syst., 2018, pp. 700–709. NeurIPS, 2016, pp. 271–279. and S. Gelly, “Assessing generative models via
[127] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, [151] A. Odena, C. Olah, and J. Shlens, “Conditional precision and recall,” in Proc. NeurIPS, 2018,
“Learning a no-reference quality metric for image synthesis with auxiliary classifier GANs,” in pp. 5228–5237.
single-image super-resolution,” Comput. Vis. Proc. ICML, 2017, pp. 2642–2651. [174] M. S. Sajjadi, B. Scholkopf, and M. Hirsch,
Image Understand., vol. 158, pp. 1–16, [152] K. Olszewski et al., “Realistic dynamic facial “EnhanceNet: Single image super-resolution
May 2017. textures from a single image using GANs,” in Proc. through automated texture synthesis,” in Proc.
[128] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and ICCV, 2017, pp. 5429–5438. ICCV, 2017, pp. 4491–4500.
L. Van Gool, “Exemplar guided unsupervised [153] A. van den Oord et al., “WaveNet: A generative [175] R. Salakhutdinov and G. Hinton, “Deep
image-to-image translation with semantic model for raw audio,” 2016, arXiv:1609.03499. Boltzmann machines,” in Artificial Intelligence and
consistency,” in Proc. ICLR, 2019. [Online]. Available: http://arxiv.org/ Statistics. 2009, pp. 448–455.
[129] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and abs/1609.03499 [176] T. Salimans, I. Goodfellow, W. Zaremba,
L. Van Gool, “Pose guided person image [154] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and V. Cheung, A. Radford, and X. Chen, “Improved
generation,” in Proc. NeurIPS, 2017, pp. 406–416. Y. Zheng, “Recent progress on generative techniques for training GANs,” in Proc. NeurIPS,
[130] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, adversarial networks (GANs): A survey,” IEEE 2016, pp. 2234–2242.
B. Schiele, and M. Fritz, “Disentangled person Access, vol. 7, pp. 36322–36333, 2019. [177] T. Salimans, A. Karpathy, X. Chen, and
image generation,” in Proc. CVPR, 2018, [155] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, D. P. Kingma, “PixelCNN++: Improving the
pp. 99–108. “Contrastive learning for unpaired image-to-image PixelCNN with discretized logistic mixture
[131] S. Maeda, “Unpaired image super-resolution using translation,” in Proc. ECCV, 2020, pp. 319–345. likelihood and other modifications,” in Proc. ICLR,
pseudo-supervision,” in Proc. CVPR, 2020, [156] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, 2017.
pp. 291–300. “Semantic image synthesis with spatially-adaptive [178] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger,
[132] A. Mallya, T.-C. Wang, K. Sapra, and M.-Y. Liu, normalization,” in Proc. CVPR, 2019, “GRAF: Generative radiance fields for 3D-aware
“World-consistent video-to-video synthesis,” in pp. 2337–2346. image synthesis,” 2020, arXiv:2007.02442.
Proc. ECCV, 2020, pp. 359–378. [157] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, [Online]. Available: http://arxiv.
[133] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and and A. A. Efros, “Context encoders: Feature org/abs/2007.02442
M.-H. Yang, “Mode seeking generative adversarial learning by inpainting,” in Proc. CVPR, 2016, [179] T. R. Shaham, T. Dekel, and T. Michaeli, “SinGAN:
networks for diverse image synthesis,” in Proc. pp. 2536–2544. Learning a generative model from a single natural
CVPR, 2019, pp. 1429–1437. [158] F. Pittaluga, S. J. Koppal, S. B. Kang, and image,” in Proc. ICCV, 2019, pp. 4570–4580.
[134] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. N. Sinha, “Revealing scenes by inverting [180] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting
S. P. Smolley, “Least squares generative structure from motion reconstructions,” in Proc. the latent space of GANs for semantic face
adversarial networks,” in Proc. ICCV, 2017, CVPR, 2019, pp. 145–154. editing,” in Proc. CVPR, 2020, pp. 9243–9252.
pp. 2794–2802. [159] A. Pumarola, A. Agudo, A. M. Martinez, [181] Z. Shen, W.-S. Lai, T. Xu, J. Kautz, and M.-H. Yang,
[135] R. Martin-Brualla et al., “LookinGood: Enhancing A. Sanfeliu, and F. Moreno-Noguer, “GANimation: “Deep semantic face deblurring,” in Proc. CVPR,
performance capture with real-time neural Anatomically-aware facial animation from a single 2018, pp. 8260–8269.
re-rendering,” ACM Trans. Graph., vol. 37, no. 6, image,” in Proc. ECCV, 2018, pp. 818–833. [182] W. Shi et al., “Real-time single image and video
pp. 1–14, Jan. 2019. [160] A. Pumarola, A. Agudo, A. M. Martinez, super-resolution using an efficient sub-pixel
[136] M. Mathieu, C. Couprie, and Y. LeCun, “Deep A. Sanfeliu, and F. Moreno-Noguer, “GANimation: convolutional neural network,” in Proc. CVPR,
multi-scale video prediction beyond mean square One-shot anatomically consistent facial 2016, pp. 1874–1883.
error,” in Proc. ICLR, 2016. animation,” Int. J. Comput. Vis., vol. 128, [183] A. Shysheya et al., “Textured neural avatars,” in
[137] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, pp. 698–713, Aug. 2020. Proc. CVPR, 2019, pp. 1874–1883.
and K. I. Kim, “Unsupervised attention-guided [161] A. Pumarola, A. Agudo, A. Sanfeliu, and [184] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci,
image-to-image translation,” in Proc. NeurIPS, F. Moreno-Noguer, “Unsupervised person image and N. Sebe, “First order motion model for image
2018, pp. 3693–3703. synthesis in arbitrary poses,” in Proc. IEEE Conf. animation,” in Proc. NeurIPS, 2019,
[138] L. Mescheder, A. Geiger, and S. Nowozin, “Which Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7137–7147.
training methods for GANs do actually converge?” pp. 8620–8628. [185] A. Siarohin, S. Lathuiliére, S. Tulyakov, E. Ricci,
ICML, 2018, pp. 3693–3703. [162] S. Qian et al., “Make a face: Towards arbitrary and N. Sebe, “Animating arbitrary objects via deep
[139] M. Meshry et al., “Neural rerendering in the wild,” high fidelity face manipulation,” in Proc. ICCV, motion transfer,” in Proc. CVPR, 2019,
in Proc. CVPR, 2019, pp. 6878–6887. 2019, pp. 10033–10042. pp. 2377–2386.
[140] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, [163] A. Radford, L. Metz, and S. Chintala, [186] A. Siarohin, E. Sangineto, S. Lathuiliere, and
“Spectral normalization for generative adversarial “Unsupervised representation learning with deep N. Sebe, “Deformable GANs for pose-based human
networks,” in Proc. ICLR, 2018. convolutional generative adversarial networks,” in image generation,” in Proc. CVPR, 2018,
[141] T. Miyato and M. Koyama, “cGANs with projection Proc. ICLR, 2015. pp. 3408–3416.
discriminator,” in Proc. ICLR, 2018. [164] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, [187] V. Sitzmann, J. Thies, F. Heide, M. Nießner,
[142] S. Mo, M. Cho, and J. Shin, “InstaGAN: and J. Lu, “SwapNet: Image based garment G. Wetzstein, and M. Zollhofer, “DeepVoxels:
Instance-aware image-to-image translation,” in transfer,” in Proc. ECCV, 2018, pp. 679–695. Learning persistent 3D feature embeddings,” in
Proc. ICLR, 2019. [165] S. Ravuri and O. Vinyals, “Seeing is not necessarily Proc. CVPR, 2019, pp. 2437–2446.
[143] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo, believing: Limitations of BigGANs for data [188] C. K. Sønderby, T. Raiko, L. Maaløe,
“Reliable fidelity and diversity metrics for augmentation,” in Proc. ICLR Workshop, 2019. S. K. Sønderby, and O. Winther, “Ladder
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
variational autoencoders,” in Proc. NeurIPS, 2016, [212] J. Walker, C. Doersch, A. Gupta, and M. Hebert, translation,” in Proc. ICCV, 2017, pp. 2849–2857.
pp. 3738–3746. “An uncertain future: Forecasting from static [236] J. Yu, Y. Fan, and T. Huang, “Wide activation for
[189] S. Song, W. Zhang, J. Liu, and T. Mei, images using variational autoencoders,” in Proc. efficient and accurate image super-resolution,” in
“Unsupervised person image generation with ECCV, 2016, pp. 835–851. Proc. BMVC, 2019.
semantic parsing transformation,” in Proc. CVPR, [213] J. Walker, K. Marino, A. Gupta, and M. Hebert, [237] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and
2019, pp. 2357–2366. “The pose knows: Video forecasting by generating T. S. Huang, “Generative image inpainting with
[190] Y. Song, J. Zhu, D. Li, A. Wang, and H. Qi, pose futures,” in Proc. ICCV, 2017, pp. 3332–3341. contextual attention,” in Proc. CVPR, 2018,
“Talking face generation by conditional recurrent [214] M. Wang et al., “Example-guided style-consistent pp. 5505–5514.
adversarial network,” in Proc. IJCAI, 2019, image synthesis from semantic labeling,” in Proc. [238] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and
pp. 919–925. CVPR, 2019, pp. 1495–1504. T. S. Huang, “Free-form image inpainting with
[191] N. Srivastava, E. Mansimov, and R. Salakhudinov, [215] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and gated convolution,” in Proc. ICCV, 2019,
“Unsupervised learning of video representations A. A. Efros, “CNN-generated images are pp. 4471–4480.
using LSTMs,” in Proc. ICML, 2015, pp. 843–852. surprisingly easy to spot.. for now,” in Proc. CVPR, [239] E. Zakharov, A. Shysheya, E. Burkov, and
[192] S. Suwajanakorn, S. M. Seitz, and 2020, pp. 8692–8701. V. Lempitsky, “Few-shot adversarial learning of
I. Kemelmacher-Shlizerman, “Synthesizing [216] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and realistic neural talking head models,” in Proc.
Obama: Learning lip sync from audio,” ACM B. Catanzaro, “Few-shot video-to-video synthesis,” ICCV, 2019, pp. 9459–9468.
Trans. Graph., vol. 36, no. 4, pp. 1–13, Jul. 2017. in Proc. NeurIPS, 2019, pp. 8695–8704. [240] M. Zanfir, A.-I. Popa, A. Zanfir, and
[193] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: [217] T.-C. Wang et al., “Video-to-video synthesis,” in C. Sminchisescu, “Human appearance transfer,” in
A persistent memory network for image Proc. NeurIPS, 2018, pp. 1152–1164. Proc. CVPR, 2018, pp. 5391–5399.
restoration,” in Proc. ICCV, 2017, pp. 4539–4547. [218] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, [241] Y. Zeng, J. Fu, H. Chao, and B. Guo, “Learning
[194] H. Tang, X. Qi, D. Xu, P. H. S. Torr, and N. Sebe, and B. Catanzaro, “High-resolution image pyramid-context encoder network for high-quality
“Edge guided GANs with semantic preserving for synthesis and semantic manipulation with image inpainting,” in Proc. CVPR, 2019,
semantic image synthesis,” 2020, conditional GANs,” in Proc. CVPR, 2018, pp. 1486–1494.
arXiv:2003.13898. [Online]. Available: pp. 8798–8807. [242] H. Zhang, I. Goodfellow, D. Metaxas, and
http://arxiv.org/abs/2003.13898 [219] T.-C. Wang, A. Mallya, and M.-Y. Liu, “One-shot A. Odena, “Self-attention generative adversarial
[195] P. Teterwak et al., “Boundless: Generative free-view neural talking-head synthesis for video networks,” in Proc. ICML, 2019, pp. 7354–7363.
adversarial networks for image extension,” in conferencing,” 2020, arXiv:2011.15126. [Online]. [243] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang,
Proc. ICCV, 2019, pp. 10521–10530. Available: http://arxiv.org/abs/2011.15126 “Beyond a Gaussian denoiser: Residual learning of
[196] A. Tewari et al., “State of the art on neural [220] X. Wang et al., “ESRGAN: Enhanced deep CNN for image denoising,” IEEE Trans.
rendering,” Comput. Graph. Forum (EG STAR), super-resolution generative adversarial networks,” Image Process., vol. 26, no. 7, pp. 3142–3155,
vol. 39, no. 2, pp. 701–727, 2020. in Proc. ECCV, 2018, pp. 63–79. Jul. 2017.
[197] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred [221] Y. Wang, S. Khan, A. Gonzalez-Garcia, [244] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen,
neural rendering: Image synthesis using neural J. van de Weijer, and F. S. Khan, “Semi-supervised “Cross-domain correspondence learning for
textures,” ACM Trans. Graph., vol. 38, no. 4, learning for few-shot image-to-image translation,” exemplar-based image translation,” in Proc. CVPR,
pp. 1–12, Jul. 2019. in Proc. CVPR, 2020, pp. 4453–4462. 2020, pp. 5143–5153.
[198] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, [222] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, “Image [245] X. Zhang, S. Karaman, and S.-F. Chang, “Detecting
M. Stamminger, and C. Theobalt, “Real-time inpainting via generative multi-column and simulating artifacts in GAN fake images,” in
expression transfer for facial reenactment,” ACM convolutional neural networks,” in Proc. NeurIPS, Proc. WIFD, 2019, pp. 1–6.
Trans. Graph., vol. 34, no. 6, pp. 1–14, Nov. 2015. 2018, pp. 331–340. [246] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and
[199] J. Thies, M. Zollhofer, M. Stamminger, [223] Z. Wang, Q. She, and T. E. Ward, “Generative Y. Fu, “Image super-resolution using very deep
C. Theobalt, and M. Nießner, “Face2Face: adversarial networks in computer vision: A survey residual channel attention networks,” in Proc.
Real-time face capture and reenactment of RGB and taxonomy,” 2019, arXiv:1906.01529. ECCV, 2018, pp. 286–301.
videos,” in Proc. CVPR, 2016, pp. 2387–2395. [Online]. Available: http://arxiv.org/ [247] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu,
[200] J. Thies, M. Zollhöfer, C. Theobalt, abs/1906.01529 “Residual dense network for image
M. Stamminger, and M. Niessner, “Headon: [224] C.-Y. Weng, B. Curless, and super-resolution,” in Proc. CVPR, 2018,
Real-time reenactment of human portrait videos,” I. Kemelmacher-Shlizerman, “Photo wake-up: 3D pp. 2472–2481.
ACM Trans. Graph., vol. 37, no. 4, pp. 1–13, character animation from a single photo,” in Proc. [248] B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie,
Aug. 2018. CVPR, 2019, pp. 5908–5917. and J. Feng, “Multi-view image generation
[201] T. Tieleman and G. Hinton, “Lecture [225] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, from a single-view,” in Proc. MM, 2018,
6.5—RmsProp: Divide the gradient by a running “SynSin: End-to-end view synthesis from a single pp. 383–391.
average of its recent magnitude,” in Proc. Neural image,” in Proc. CVPR, 2020, pp. 7467–7477. [249] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic
Netw. Mach. Learn. (COURSERA), 2012. [226] O. Wiles, A. S. Koepke, and A. Zisserman, image completion,” in Proc. CVPR, 2019,
[202] I. Tolstikhin, O. Bousquet, S. Gelly, and “X2Face: A network for controlling face pp. 1438–1447.
B. Schoelkopf, “Wasserstein auto-encoders,” in generation using images, audio, and pose codes,” [250] H. Zheng, H. Liao, L. Chen, W. Xiong, T. Chen, and
Proc. ICLR, 2018. in Proc. ECCV, 2018, pp. 670–686. J. Luo, “Example-guided image synthesis across
[203] T. Tong, G. Li, X. Liu, and Q. Gao, “Image [227] W. Wu, Y. Zhang, C. Li, C. Qian, and C. Change arbitrary scenes using masked spatial-channel
super-resolution using dense skip connections,” in Loy, “ReenactGAN: Learning to reenact faces via attention and self-supervision,” 2020,
Proc. ICCV, 2017, pp. 4799–4807. boundary transfer,” in Proc. ECCV, 2018, arXiv:2004.10024. [Online]. Available:
[204] M. Tschannen, E. Agustsson, and M. Lucic, “Deep pp. 603–619. http://arxiv.org/abs/2004.10024
generative models for distribution-preserving [228] Y. Wu and K. He, “Group normalization,” in Proc. [251] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang,
lossy compression,” in Proc. NeurIPS, 2018, ECCV, 2018, pp. 3–19. “Talking face generation by adversarially
pp. 5929–5940. [229] W. Xiong et al., “Foreground-aware image disentangled audio-visual representation,” in Proc.
[205] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, inpainting,” in Proc. CVPR, 2019, pp. 5840–5848. AAAI, 2019, pp. 9299–9306.
“MoCoGAN: Decomposing motion and content for [230] T. Xue, J. Wu, K. Bouman, and B. Freeman, “Visual [252] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg,
video generation,” in Proc. CVPR, 2018, dynamics: Probabilistic future frame synthesis via “Dance dance generation: Motion transfer for
pp. 1526–1535. cross convolutional networks,” in Proc. NeurIPS, Internet videos,” 2019, arXiv:1904.00129.
[206] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance 2016, pp. 91–99. [Online]. Available:
normalization: The missing ingredient for fast [231] C. Yang, T. Kim, R. Wang, H. Peng, and http://arxiv.org/abs/1904.00129
stylization,” 2016, arXiv:1607.08022. [Online]. C.-C. J. Kuo, “ESTHER: Extremely simple image [253] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and
Available: http://arxiv.org/abs/1607.08022 translation through self-regularization,” in Proc. A. A. Efros, “Generative visual manipulation on
[207] A. Van den Oord et al., “Conditional image BMVC, 2018, pp. 91–99. the natural image manifold,” in Proc. ECCV, 2016,
generation with PixelCNN decoders,” in Proc. [232] C. Yang, T. Kim, R. Wang, H. Peng, and pp. 597–613.
NeurIPS, 2016, pp. 4790–4798. C.-C. J. Kuo, “Show, attend, and translate: [254] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros,
[208] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, Unsupervised image translation with “Unpaired image-to-image translation using
“Decomposing motion and content for natural self-regularization and attention,” IEEE Trans. cycle-consistent adversarial networks,” in Proc.
video sequence prediction,” in Proc. ICLR, 2017. Image Process., vol. 28, no. 10, pp. 4845–4856, ICCV, 2017, pp. 2223–2232.
[209] D. Vlasic, M. Brand, H. Pfister, and J. Popovic, Oct. 2019. [255] J.-Y. Zhu et al., “Toward multimodal
“Face transfer with multilinear models,” ACM [233] C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and image-to-image translation,” in Proc. NeurIPS,
Trans. Graph., vol. 24, no. 3, pp. 426–433, D. Lin, “Pose guided human video generation,” in 2017, pp. 465–476.
Jul. 2005. Proc. ECCV, 2018, pp. 201–216. [256] J.-Y. Zhu et al., “Visual object networks: Image
[210] C. Vondrick, H. Pirsiavash, and A. Torralba, [234] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee, generation with disentangled 3D representations,”
“Generating videos with scene dynamics,” in Proc. “Diversity-sensitive conditional generative in Proc. NeurIPS, 2018, pp. 118–129.
NeurIPS, 2016, pp. 613–621. adversarial networks,” in Proc. ICLR, 2019, [257] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and
[211] K. Vougioukas, S. Petridis, and M. Pantic, pp. 201–216. X. Bai, “Progressive pose attention transfer for
“Realistic speech-driven facial animation with [235] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: person image generation,” in Proc. CVPR, 2019,
GANs,” in Proc. IJCV, 2019, pp. 1–16. Unsupervised dual learning for image-to-image pp. 2347–2356.
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications
Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.