Generative Adversarial Networks For Image and Video Synthesis: Algorithms and Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

Generative Adversarial

Networks for Image and


Video Synthesis:
Algorithms and Applications
This article provides an overview of generative adversarial networks (GANs) with a
special focus on algorithms and applications for visual synthesis.
By M ING -Y U L IU , X UN H UANG , J IAHUI Y U , Member IEEE, T ING -C HUN WANG , AND A RUN M ALLYA

ABSTRACT | The generative adversarial network (GAN) frame- networks—a generator network G and a discriminator
work has emerged as a powerful tool for various image and network D—which are trained jointly by playing a
video synthesis tasks, allowing the synthesis of visual content zero-sum game where the objective of the generator is
in an unconditional or input-conditional manner. It has enabled to synthesize fake data that resembles real data and the
the generation of high-resolution photorealistic images and objective of the discriminator is to distinguish between real
videos, a task that was challenging or impossible with prior and fake data. When the training is successful, the gener-
methods. It has also led to the creation of many new applica- ator is an approximator of the underlying data generation
tions in content creation. In this article, we provide an overview mechanism in the sense that the distribution of the fake
of GANs with a special focus on algorithms and applications data converges to the real one. Due to the distribution
for visual synthesis. We cover several important techniques matching capability, GANs have become a popular tool
to stabilize GAN training, which has a reputation for being for various data synthesis and manipulation problems,
notoriously difficult. We also discuss its applications to image especially in the visual domain.
translation, image processing, video synthesis, and neural GAN’s rise also marks another major success of deep
rendering. learning in replacing hand-designed components with
machine-learned components in modern computer vision
KEYWORDS | Computer vision; generative adversarial net-
pipelines. As deep learning has directed the community
works (GANs); image and video synthesis; image processing;
to abandon hand-designed features, such as the histogram
neural rendering.
of oriented gradients (HOG) [36], for deep features com-
puted by deep neural networks, the objective function
I. I N T R O D U C T I O N used to train the networks remains largely hand-designed.
The generative adversarial network (GAN) framework is While this is not a major issue for a classification task since
a deep learning architecture [59], [100] introduced by effective and descriptive objective functions, such as the
Goodfellow et al. [60]. It consists of two interacting neural cross-entropy loss exist, this is a serious hurdle for a gener-
ation task. After all, how can one hand-design a function to
Manuscript received August 4, 2020; revised November 26, 2020; accepted
guide a generator to produce a better cat image? How can
December 24, 2020. Date of publication February 1, 2021; date of current we even mathematically describe “felineness” in an image?
version April 30, 2021. (All the authors contributed equally to this work.)
(Corresponding author: Ming-Yu Liu.)
GANs address the issue by deriving a functional form
Ming-Yu Liu, Xun Huang, Ting-Chun Wang, and Arun Mallya are with of the objective using training data. As the discriminator
NVIDIA, Santa Clara, CA 95050-2519 USA (e-mail: [email protected]).
is trained to tell whether an input image is a cat image
Jiahui Yu is with Google, Mountain View, CA 10011 USA.
from the training data set or one synthesized by the gen-
Digital Object Identifier 10.1109/JPROC.2021.3049196 erator, it defines an objective function that can guide the

0018-9219 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 839

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

network architectures, and regularization techniques, have


been proposed to stabilize GAN training. We will review
several representative approaches in Section III. In Fig. 2,
we illustrate the progress of GANs over the past few years.
In the original GAN formulation [60], the generator is
formulated as a mapping function that converts a simple,
unconditional distribution, such as a uniform distribu-
tion or a Gaussian distribution, to complex data distri-
bution, such as a natural image distribution. We, now,
Fig. 1. Unconditional versus conditional GANs. (a) In unconditional
generally refer to this formulation as the unconditional
GANs, the generator converts a noise input z to a fake image G(z), GAN framework. While the unconditional framework has
where z ∼ Z , and Z is usually a Gaussian random variable. The several important applications on its own, the lack of
discriminator tells apart real images x from the training data set D controllability in the generation outputs makes it unfit for
and fake images from G. (b) In conditional GANs, the generator
many applications. This has motivated the development of
takes an additional input y as the control signal, which could be
another image (image-to-image translation), text (text-to-image
the conditional GAN framework. In the conditional frame-
synthesis), or a categorical label (label-to-image synthesis). The work, the generator additionally takes a control signal
discriminator tells apart real from fake by leveraging the as input. The signal can take in many different forms,
information in y . In both settings, the combination of the including category labels, texts, images, layouts, sounds,
discriminator and real training data defines an objective function for
and even graphs. The goal of the generator is to produce
image synthesis. This data-driven objective function definition is a
powerful tool for many computer vision problems.
outputs corresponding to the signal. In Fig. 1, we compare
these two frameworks. This conditional GAN framework
has led to many exciting applications. We will cover several
representative ones through Sections IV–VII.
GANs have led to the creation of many exciting new
generator in improving its generation based on its current applications. For example, it has been the core building
network weights. The generator can keep improving as block to semantic image synthesis algorithms that con-
long as the discriminator can differentiate real and fake cern converting human-editable semantic representations,
cat images. The only way that a generator can beat the such as segmentation masks or sketches, to photorealistic
discriminator is to produce images similar to the real images. GANs have also led to the development of many
images used for training. Since all the training images image-to-image translation methods that aim to translate
contain cats, the generator output must contain cats to win an image in one domain to a corresponding image in
the game. Moreover, when we replace the cat images with a different domain. These methods find a wide range
dog images, we can use the same method to train a dog of applicability, ranging from image editing to domain
image generator. The objective function for the generator adaptation. We will review some algorithms in Section IV.
is defined by the training data set and the discriminator We can, now, find GAN’s footprint in many visual
architecture. It is, thus, a very flexible framework to define processing systems. For example, for image restoration,
the objective function for a generation task, as illustrated super-resolution (SR), and inpainting, where the goal is
in Fig. 1. to transform an input image distribution to a target image
However, despite its excellent modeling power, GANs distribution, GANs have been shown to generate results
are notoriously difficult to train because it involves chasing with much better visual quality than those produced with
a moving target. Not only do we need to make sure that the
generator can reach the target but also that the target can
reach a desirable level of goodness. Recall that the goal
of the discriminator is to differentiate real and fake data.
As the generator changes, the fake data distribution also
changes as well. This poses a new classification problem to
the discriminator, distinguishing the same real but a new
kind of fake data distribution, one that is presumably more
similar to the real data distribution. As the discriminator
is updated according to the new classification problem,
it induces a new objective for the generator. Without
careful control of the dynamics, a learning algorithm tends
to experience failures in GAN training. Often, the discrim-
Fig. 2. GAN progress on face synthesis. The progress of GANs on
inator becomes too strong and provides strong gradients
face synthesis over the years. From left to right, we have face
that push the generator to a numerically unstable region. synthesis results by (a) the original GAN [60], (b) DCGAN [163],
This is a well-recognized issue. Fortunately, over the years, (c) CoGAN [119], (d) PgGAN [84], and (e) StyleGAN [85]. This image
various approaches, including better training algorithms, was originally created and shared by Ian Goodfellow on Twitter.

840 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

traditional methods. We will provide an overview of GAN are also descendants of latent variable models, such
methods in these image-processing tasks in Section V. as principal component analysis and autoencoders [18],
Video synthesis is another exciting area that GANs have which concerns representing high-dimensional data x
shown promising results. Many research studies have uti- using lower dimensional latent variables z . In terms of
lized GANs to synthesize realistic human videos or transfer structure, a VAE employs an inference model q(z|x; φ)
motions from one person to another for various entertain- and a generation model p(x|z; θ)p(z), where p(z) is usu-
ment applications, which we will review in Section VI. ally a Gaussian distribution, which we can easily sample
Finally, due to its great capability in generating photo- from, and q(z|x; φ) approximates the posterior p(z|x; θ).
realistic images, GANs have played an important role in Both the inference and generation models are imple-
the development of neural rendering—using neural net- mented using feed-forward neural networks. VAE training
works to boost the performance of the graphics rendering is through maximizing the evidence lower bound (ELBO)
pipeline. We will cover GAN studies in Section VII. of log p(x; θ), and the nondifferentiability of the stochastic
sampling is elegantly handled by the reparameterization
II. R E L AT E D W O R K S trick [94]. One can also show that maximization of the
Several GAN review articles exist, including the intro- ELBO is equivalent to minimizing the Kullback–Leibler
ductory article by Goodfellow [58]. The articles by (KL) divergence
Creswell et al. [35] and Pan et al. [154] summarize GAN
methods prior to 2018. Wang et al. [223] provided a KL (q(x)q(z|x; φ)||p(z)p(x|z; θ)) (2)
taxonomy of GANs. Our work differs from the prior studies
in that we provide a more contemporary summary of GANs
with a focus on image and video synthesis. where q(x) is the empirical distribution of the data [94].
There are many different deep generative mod- Once a VAE is trained, an image can be efficiently gen-
els or deep neural networks that model the genera- erated by first sampling z from the Gaussian prior p(z)
tion process of some data. Besides GANs, other popular and then passing it through the feed-forward deep neural
deep generative models include deep Boltzmann machines network p(x|z; θ). VAEs are effective in learning useful
(DBMs), variational autoencoders (VAEs), deep autore- latent representations [188]. However, they tend to gen-
gressive models (DARs), and normalizing flow models erate blurry output images.
(NFMs). We compare these models in Fig. 3 and briefly
review them in the following.
C. Deep AutoRegressive Models
A. Deep Boltzmann Machines DARs [30], [153], [177], [207] are deep learning
DBMs [45], [48], [68], [175] are energy-based mod- implementations of classical autoregressive models, which
els [101], which can be represented by undirected graphs. assumes an ordering to the random variables to be mod-
Let x denote the array of image pixels, often called visible eled and generates the variables sequentially based on the
nodes. Let h denote the hidden nodes. DBMs model the ordering. This induces a factorization form to the data
distribution given by solving


probability density function of data based on the Boltz-
mann (or Gibbs) distribution as
p(x; θ) = p (xi |x<i ; θ) (3)
1 i
p(x; θ) = exp(−E(x, h; θ)) (1)
N (θ)
h
where xi are variables in x, and x<i are the union of
the variables that are prior to xi based on the assumed
where E is an energy function modeling interactions of
ordering. DARs are conditional generative models where
nodes in the graph, N is the partition function, and θ
they generate a new portion of the signal based on what
denotes the network parameters to be learned. Once a
has been generated or observed so far. The learning is
DBM is trained, a new image can be generated by applying
based on maximum likelihood learning
Markov chain Monte Carlo (MCMC) sampling, ascending
from a random configuration to one with high probability.
While extensively expressive, the reliance on MCMC sam- max Ex∼D [log p(xi |x<i ; θ)] . (4)
θ
pling on both training and generation makes DBMs scale
poorly compared to other deep generative models since DAR training is more stable compared to the other genera-
efficient MCMC sampling is itself a challenging problem, tive models. However, due to the recurrent nature, they are
especially for large networks. slow in inference. Also, while, for audio or text, a natural
ordering of the variables can be determined based on
B. Variational AutoEncoders the time dimension, such an ordering does not exist for
VAEs [93], [94], [168] are directed probabilistic graphic images. One, hence, has to enforce an order prior that is
models, inspired by the Helmholtz machine [37]. They an unnatural fit to the image grid.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 841

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 3. Structure comparison of different deep generative models. Except for the DBM that is based on undirected graphs, other models
are all based on directed graphs, which enjoys a faster inference speed. (a) Boltzmann machine. (b) VAE. (c) Autoregressive model. (d) NFM.
(e) GAN.

D. Normalizing Flow Models In mode dropping, the generator does not faithfully model
NFMs [40], [41], [92], [167] are based on the normal- the target distribution and misses some portion of it. Other
izing flow—a transformation of a simple probability dis- common failure cases include checkerboard and water
tribution into a more complex distribution by a sequence drop artifacts. In this article, we cover the basics of GAN
of invertible and differentiable mappings. Each mapping training and some techniques invented to improve training
corresponds to a layer in a deep neural network. With a stability.
layer design that guarantees invertibility and differentiabil-
ity for all possible weights, one can stack many such layers A. Learning Objective
to construct a powerful mapping because composition of The core idea in GAN training is to minimize the discrep-
invertible and differentiable functions is invertible and dif- ancy between the true data distribution p(x) and the fake
ferentiable. Let F = f (1) ◦ f (2) · · · ◦ f (K) be such a K -layer data distribution p(G(z; θ)). As there are a variety of ways
mapping that maps the simple probability distribution Z to measure the distance between two distributions, such as
to the data distribution X . The probability density of a the Jensen–Shannon divergence, the KL divergence, and
sample x ∼ X can be computed by transforming it back to the integral probability metric, there are also a variety
the corresponding z . Hence, we can apply maximum likeli- of GAN losses, including the saturated GAN loss [60],
hood learning to train NFMs because the log-likelihood of the nonsaturated GAN loss [60], the Wasserstein GAN
the complex data distribution can be converted to the log- loss [6], [64], the least-square GAN loss [134], the hinge
likelihood of the simple prior distribution subtracted by the GAN loss [112], [242], the f-divergence GAN loss [81],
Jacobians terms. This gives [150], and the relativistic GAN loss [80]. Empirically,

  the performance of a GAN loss depends on the application

 
K and the network architecture. As of the time of writing this
df (i)
log p(x; θ) = log p(z; θ) − log det (5) survey article, there is no clear consensus on which one is
dz i−1
i=1 absolutely better.
Here, we give a generic GAN learning objective for-
where z i = f (i) (z i−1 ). One key strength of NFMs is in mulation that subsumes several popular ones. For the
supporting direct evaluation of probability density calcula-

   
discriminator update step, the learning objective is
tion. However, NFMs require an invertible mapping, which
greatly limits the choices of applicable architectures.
max Ex∼D fD (D(x; φ) + Ez∼Z fG (D(G(z; θ); φ) (7)
φ

III. L E A R N I N G
Let θ and φ be the learnable parameters in G and D, where fD and fG are the output layers that transform the
respectively. GAN training is formulated as a minimax results computed by the discriminator D to the classifica-
problem tion scores for the real and fake images, respectively. For
min max V (θ, φ)
 
(6) the generator update step, the learning objective is
φ θ

where V is the utility function.


min Ez∼Z gG (D(G(z; θ); φ) (8)
GAN training is challenging. Famous failure cases θ
include mode collapse and mode dropping. In mode col-
lapse, the generator is trapped to a certain local minimum where gG is the output layer that transforms the result
where it only captures a small portion of the distribution. computed by the discriminator to a classification score for

842 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Table 1 Comparison of Different GAN Losses, Including Satu- second-order moments, is very popular for training GANs.
rated [60], Nonsaturated [60], Wasserstein [6], Least Square [134], and ADAM has several user-defined parameters. Typically,
Hinge [112], [242], in Terms of the Discriminator Output Layer Type
the first momentum is set to 0, while the second momen-
in (7) and (8). We Maximize fD and fG for Training the Discriminator.
As Shown in (7) and (8), We Minimize gG for Training the Generator. Note tum is set to 0.999. The learning rate for the discrimina-
That σ x   /  e −x Is the Sigmoid Function tor update is often set to two to four times larger than
the learning rate for the generator update (usually set
to 0.0001), which is called the two-time update scales
(TTUR) [67]. We also note that RMSProp [201] is popular
for GAN training [64], [84], [85], [118].

C. Regularization
We review several popular regularization techniques
available for countering instability in GAN training.
the fake image. In Table 1, we compare fD , fG , and gG for Gradient penalty (GP) is an auxiliary loss term that
several popular GAN losses. penalizes deviation of gradient norm from the desired
value [64], [138], [169]. To use GP, one adds it to the
B. Training objective function for the discriminator update, that is, (7).
There are several variants of GP. Generally, they can be
Two variants of stochastic gradient descent/ascent

 
expressed as
(SGD) schemes are commonly used for GAN training: the
simultaneous update scheme and the alternating update
scheme. Let VD (θ, φ) and VG (θ, φ) be the objective func- GP-δ = Ex̂ ∇D(x̂)2 − δ . (13)
tions in (7) and (8), respectively. In the simultaneous
update, each training iteration contains a discriminator The two most common forms are GP-1 [64] and
update step and a generator update step given by GP-0 [138].
GP-1 was first introduced by Gulrajani et al. [64]. It uses
∂VD (θ (t) , φ(t) ) an imaginary data distribution
φ(t+1) = φ(t) + αD (9)
∂φ
(t)
∂V (θ , φ(t) )
θ (t+1) = θ (t) − αG x̂ = ux + (1 − u)G(z), u ∼ U(0, 1)
G
(10) (14)
∂θ

where αD and αG are the learning rates for the generator where u is a uniform random variable between 0 and 1.
and discriminator, respectively. In the alternating update, Basically, x̂ is neither real nor fake. It is a convex com-
each training iteration consists of one discriminator update bination of a real sample and a fake sample. The design
step followed by a generator update step, given by of the GP-1 is motivated by the property of an optimal D
that solves the Wasserstein GAN loss. However, GP-1 is also
∂VD (θ (t) , φ(t) ) useful when using other GAN losses. In practice, it has the
φ(t+1) = φ(t) + αD (11) effect of countering vanishing and exploding gradients that
∂φ
occurred during GAN training.
∂VG (θ (t) , φ(t+1) )
θ (t+1) = θ (t) − αG . (12) On the other hand, the design of GP-0 is based on
∂θ
the idea of penalizing the discriminator deviating away
Note that in the alternating update scheme, the genera- from the Nash equilibrium. GP-0 takes a simpler form
tor update (12) utilizes the newly updated discriminator where they do not use imaginary sample distribution but
parameters θ (t+1) , while, in the simultaneous update (10), use the real data distribution, that is, setting x̂ ≡ x.
it does not. These two schemes have their pros and cons. We find the use of GP-0 in several state-of-the-art GAN
The simultaneous update scheme can be computed more algorithms [85], [86].
efficiently, as a major part of the computation in the two Spectral normalization (SN) [140] is an effective
steps can be shared. On the other hand, the alternating regularization technique used in many recent GAN algo-
update scheme tends to be more stable as the generator rithms [24], [156], [170], [242]. SN is based on regular-
update is computed based on the latest discriminator. izing the spectral norm of the projection operation at each
Recent GAN studies [24], [64], [70], [118], [156] mostly layer of the discriminator, by simply dividing the weight
use the alternating update scheme. Sometimes, the dis- matrix by its largest eigenvalue. Let W be the weight
criminator update (11) is performed several times before matrix of a layer of the discriminator network. With SN,
the true weight that is applied is


computing (12) [24], [64].
Among various SGD algorithms, ADAM [91], which
is based on adaptive estimates of the first- and W/ λmax (W T W ) (15)

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 843

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 4. Generator evolution. Since the debut of GANs [60], the generator architecture has continuously evolved. (a)–(c) One can observe
the change from simple MLPs to deep convolutional and residual networks. Recently, conditional architectures, including (d) conditional ANs
and (e) conditional convolutions, have gained popularity as they allow users to have more control on the generation outputs. (a) MLP.
(b) Deep ConvNet. (c) Residual Net. (d) Residual Net. with Cond. Act. Norm. (e) Cond. ConvNet.

where λmax (A) extracts the largest eigenvalue from the original generator with weight θ and the other is the MA
square matrix A. In other words, each project layer has generator with weight θ M A . At iteration t, we update θ M A
a projection matrix with a spectral norm equal to 1. based on
(t) (t−1)
Feature matching (FM) provides a way to encourage θ M A = βθ (t) + (1 − β)θ M A (18)
the generator to generate images similar to real ones in
some sense. Similar to GP, FM is an auxiliary loss. There are where β is a scalar controlling the contribution from the
two popular implementations: one is batch-based [176], current model weight.
and the other is instance-based [99], [218]. Let Di be the
ith layer of a discriminator D, that is, D = Dd ◦ · · · ◦ D2 ◦ D. Network Architecture
D1 . For the batch-based FM loss, it matches the moments
Network architectures provide a convenient way to
of the activations extracted by the real and fake images,
inject inductive biases. Certain network designs often work

 
respectively. For the ith layer, the loss is
better than others for a given task. Since the introduction
of GANs, we have observed an evolution of the network
Ex∼D [Di ◦ · · · ◦ D1 (x)] − Ez∼Z [Di ◦ · · · ◦ D1 (G(z))] . architecture for both the generator and discriminator.
(16) 1) Generator Evolution: In Fig. 4, we visualize the evo-
lution of the GAN generator architecture. In the original
One can apply the FM loss to a subset of layers in the gen- GAN paper [60], both the generator and the discrim-
erator and use the weighted sum as the final FM loss. The inator are based on the multilayer perceptron (MLP)
instance-based FM loss is only applicable to conditional [see Fig. 4(a)]. As an MLP fails to model the translational
generation models where we have the corresponding real invariance property of natural images, its output images
image for a fake image. For the ith layer, the instance-based are of limited quality. In the DCGAN work [163], deep con-
volutional architecture [see Fig. 4(b)] is used for the GAN

 
FM loss is given by
generator. As the convolutional architecture is a better fit
for modeling image signals, the outputs produced by the
[Di ◦ · · · ◦ D1 (xi )] − [Di ◦ · · · ◦ D1 (G(z, y i ))] (17)
DCGAN are often of better quality. Researchers also borrow
architecture designs from discriminative modeling tasks.
where y i is the control signal for xi . As the residual architecture [66] is proved to be effective
Perceptual loss [79]: Often, when instance-based FM for training deep networks, several GAN studies start to
loss is applicable, one can additionally match features use the residual architecture in their generator design [see
extracted from real and fake images using a pretrained net- Fig. 4(c)] [6], [140].
work. Such a variant of FM losses is called the perceptual A residual block used in modern GAN generators typi-
loss [79]. cally consists of a skip connection paired with a series of
Model average (MA) can improve the quality of images batch normalization (BN) [74], nonlinearity, and convo-
generated by a GAN. To use MA, we keep two copies of lution operations. The BN is one type of activation norm
the generator network during training, where one is the (AN), a technique that normalizes the activation values

844 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

to facilitate training. Other AN variants have also been


exploited for the GAN generator, including the instance
normalization [206], the layer normalization [8], and the
group normalization [228]. Generally, an activation nor-
malization scheme consists of a whitening step followed by
an affine transformation step. Let hc be the output of the
whitening step for h. The final output of the normalization
layer is
γc hc + βc (19)

where γc and βc are scalars used to shift the postnormaliza- Fig. 5. Conditional discriminator architectures. There are several
ways to leverage the user input signal y in the GAN discriminator.
tion activation values. They are constants learned during
(a) AC [151]. In this design, the discriminator is asked to predict the
training.
ground-truth label for the real image. (b) Input concatenation [75].
For many applications, it is required to have some way In this design, the discriminator learns to reason whether the input
to control the output produced by a generator. This desire is real by learning a joint feature embedding of image and label.
has motivated various conditional generator architectures (c) PD [141]. In this design, the discriminator computes an image
embedding and correlates it with the label embedding (through the
[see Fig. 4(d)] for the GAN generator [24], [70], [156].
dot product) to determine whether the input is real or fake.
The most common approach is to use the conditional AN.
In a conditional AN, both γc and βc are data-dependent.
Often, one employs a separate network to map input
control signals to the target γc and βc values. Another way domain to a corresponding image in a different domain, for
to achieve such controllability is to use hypernetworks, example, sketch to shoes, label maps to photos, and sum-
basically, using an auxiliary network to produce weights mer to winter. The problem can be studied in a supervised
for the main network. For example, we can have a convo- setting, where sample pairs of corresponding images are
lutional layer where the filter weights are generated by a available, or an unsupervised setting, where such training
separate network. We often call such a scheme conditional data are unavailable, and we only have two independent
convolutions [see Fig. 4(e)], and it has been used for sets of images. In Sections IV-A and B, we will discuss
several state-of-the-art GAN generators [86], [216]. recent progress in both settings.
2) Discriminator Evolution: GAN discriminators have
also undergone an evolution. However, the change has A. Supervised Image Translation
mostly been on moving from the MLP to deep convolu- Isola et al. [75] proposed the pix2pix framework as a
tional and residual architectures. As the discriminator is general-purpose solution to image-to-image translation in
solving a classification task, new breakthroughs in archi- the supervised setting. The training objective of pix2pix
tecture design for image classification tasks could influence combines conditional GANs with the pixelwise 1 loss
future GAN discriminator designs. between the generated image and the ground truth. One
notable design choice of pix2pix is the use of patchwise dis-
3) Conditional Discriminator Architecture: There are sev-
criminators (PatchGAN), which attempts to discriminate
eral effective architectures for utilizing control signals
each local image patch rather than the whole image. This
(conditional inputs y ) in the GAN discriminator to achieve
design incorporates the prior knowledge that the underly-
better image generation quality, as visualized in Fig. 5.
ing image translation function we want to learn is local,
This includes the auxiliary classifier (AC) [151], input
assuming independence between pixels that are far away.
concatenation (IC) [75], and the projection discriminator
In other words, translation mostly involves style or texture
(PD) [141]. The AC and PD are mostly used for category-
changes. It significantly alleviates the burden of the dis-
conditional image generation tasks, while the PD is com-
criminator because it requires much less model capacity to
mon for image-to-image translation tasks.
discriminate local patches than whole images.
4) Neural Architecture Search: As neural architecture One important limitation of pix2pix is that its translation
search has become a popular topic for various recognition function is restricted to be one-to-one. However, many of
tasks, efforts have been made in trying to automatically the mappings that we aim to learn are one-to-many in
find a performant architecture for GANs [56]. nature. In other words, the distribution of possible outputs
While this section and Section III have focused on intro- is multimodal. For example, one can imagine many shoes
ducing the GAN mechanism and various algorithms used to in different colors and styles that correspond to the same
train them, Sections IV–VII focus on various applications of sketch of a shoe. Naively injecting a Gaussian noise latent
GANs in generating images and videos. code to the generator does not lead to many variations
since the generator is free to ignore that latent code.
IV. I M A G E T R A N S L AT I O N BicycleGAN [255] explores approaches to encourage the
This section discusses the application of GANs to image-to- generator to make use of the latent code to represent
image translation, which aims to map an image from one output variations, including applying a KL divergence loss

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 845

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 6. Image translation examples of SPADE [156], which converts semantic label maps into photorealistic natural scenes. The style of the
output image can also be controlled by a reference image (the leftmost column). Images are from Park et al. [156].

to the encoded latent code and reconstructing the sampled latent space. It is shown that shared-latent space implies
latent code from the generated image. Other strategies cycle consistency and imposes a stronger regularization.
to encourage diversity include using different generators DistanceGAN [16] encourages the mapping to preserve
to capture different output modes [54], replacing the the distance between any pair of images before and after
reconstruction loss with maximum likelihood objective translation. While the methods above need to train a differ-
[106], [107], and directly encouraging the distance ent model for each pair of image domains, StarGAN [32]
between output images generated from different latent is able to translate images across multiple domains using
codes to be large [120], [133], [234]. only a single model.
Besides, the quality of image-to-image translation In many unsupervised image translation tasks
has been significantly improved by some recent stud- (e.g., horses to zebras and dogs to cats), the two-
ies [104], [122], [156], [194], [218], [250]. In particu- image domains mainly differ in the foreground objects,
lar, pix2pixHD [218] is able to generate high-resolution and the background distribution is very similar. Ideally,
(HR) images with a coarse-to-fine generator and a mul- the model should only modify the foreground objects and
tiscale discriminator. SPADE [156] further improves the leave the background region untouched. Some work [31],
image quality with a spatially adaptive normalization layer. [137], [232] employs spatial attention to detect and
SPADE, in addition, allows a style image input for better change the foreground region without influencing the
control of the desired look of the output image. Some background. InstaGAN [142] further allows the shape of
examples of SPADE are shown in Fig. 6. the foreground objects to be changed.
The early work mentioned above focuses on unimodal
B. Unsupervised Image Translation translation. On the other hand, recent advances [5], [57],
For many tasks, paired training images are very difficult [70], [105], [128], [133] have made it possible to perform
to obtain [16], [32], [70], [90], [105], [117], [235], multimodal translation, generating diverse output images
[254]. Unsupervised learning of mappings between corre- given the same input. For example, MUNIT [70] assumes
sponding images in two domains is a much harder problem that images can be encoded into two disentangled latent
but has wider applications than the supervised setting. spaces: a domain-invariant content space that captures
CycleGAN [254] simultaneously learns mappings in both the information that should be preserved during trans-
directions and employs a cycle consistency loss to enforce lation, and a domain-specific style space that represents
that, if an image is translated to the other domain and the variations that are not specified by the input image.
translated back to the original domain, the output should To generate diverse translation results, we can recombine
be close to the original image. UNIT [117] makes a shared the content code of the input image with different style
latent space assumption [119] that a pair of corresponding codes sampled from the style space of the target domain.
images can be mapped to the same latent code in a shared Fig. 7 compares MUNIT with existing unimodal translation

846 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

images, whether there are many or few, are available


during training. Liu et al. [118] proposed FUNIT to address
a different situation where there are many source domain
images that are available during training but few target
domain images that are available only at test time. The
target domain images are used to guide translation similar
to the example-guided translation procedure in MUNIT.
Saito et al. [170] proposed a content-conditioned style
encoder to better preserve the domain-invariant content of
the input image. However, the above scenario [118], [170]
still assumes access to the domain labels of the training
images. Some recent work aims to reduce the need for
such supervision by using few [221] or even no [9] domain
labels. Very recently, some studies [15], [113], [155] are
able to achieve image translation even when each domain
only has a single image, inspired by recent advances that
can train GANs on a single image [179].
Despite the empirical successes, the problem of unsu-
pervised image-to-image translation is inherently ill-posed,
even with constraints such as cycle consistency or shared
latent space. Specifically, there exist infinitely many map-
pings that satisfy those constraints [38], [51], [231], yet
most of them are not semantically meaningful. How do
current methods successfully find meaningful mapping
Fig. 7. Comparisons among unsupervised image translation in practice? Galanti et al. [51] assume that meaningful
methods (CycleGAN [254], UNIT [117], and MUNIT [70]). X 1 and X 2
mapping is of minimal complexity, and the popular gener-
are two different image domains (dogs and cats in this example).
(a) CycleGAN enforces the learned mappings to be inverses of each
ator architectures are not expressive enough to represent
other. (b) UNIT autoencodes images in both domains to a common mappings that are highly complex. Bézenac et al. [38]
latent space Z . Both CycleGAN and UNIT can only perform unimodal further argued that the popular architectures are implicitly
translation. (c) MUNIT (randomly sampled) decomposes the latent biased toward mappings that produce minimal changes
space into a shared content space C and unshared style spaces
to the input, which are usually semantically meaningful.
S 1 , S 2 . Diverse outputs can be obtained by sampling different style
codes from the target style space. (d) Style of the translation output
In summary, the training objectives of unsupervised image
can also be controlled by a guiding image in the target domain translation alone cannot guarantee that the model can find
[MUNIT (example-guided)]. semantically meaningful mappings and the inductive bias
of generator architectures plays an important role.

V. I M A G E P R O C E S S I N G
methods including CycleGAN and UNIT. The disentangled GAN’s strength in generating realistic images makes it ideal
latent space not only enables multimodal translation but for solving various image-processing problems, especially
also allows example-guided translation in which the gener- for those where the perceptual quality of image outputs
ator recombines the domain-invariant content of an image is the primary evaluation criterion. This section will dis-
from the source domain and the domain-specific style of cuss some prominent GAN-based methods for several key
an image from the target domain. The idea of using a image-processing problems, including image restoration
guiding style image has also been applied to the supervised and enhancement (SR, denoising, deblurring, and com-
setting [156], [214], [244]. pression artifacts removal) and image inpainting.
Although paired example images are not needed in the
unsupervised setting, most existing methods still require
access to a large number of unpaired example images in A. Image Restoration and Enhancement
both source and target domains. Some studies seek to The traditional way of evaluating algorithms for image
reduce the number of training examples without much restoration and enhancement tasks is to measure the dis-
loss of performance. Benaim and Wolf [17] focused on tortion, the difference between the ground-truth images
the situation where there are many images in the target and restored images using metrics, such as the mean
domain but only a single image in the source domain. square error (MSE), the peak signal-to-noise ratio (PSNR),
The work of Cohen and Wolf [34] enables translation and the structural similarity index (SSIM). Recently, met-
in the opposite direction where the source domain has rics for measuring perceptual quality, such as the no-
many images, but the target domain has only one. The reference (NR) metric [127], have been proposed, as the
above setting assumes that the source and target domain visual quality is arguably the most important factor for

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 847

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 8. Perception–distortion tradeoff [22]. Distortion metrics,


Fig. 9. Perception–distortion curve of the ESRGAN [220] on PIRM
including the MSE, PSNR, and SSIM, measure the similarity between
self-validation data set [21]. The curve also compares the ESRGAN
the ground-truth image and the restored images. Perceptual quality
with the EnhanceNet [174], the RCAN [246], and the EDSR [111]. The
metrics, including NR [127], measure the distribution distance
curve is from Wang et al. [220].
between the recovered image distribution and the target image
distribution. Blau and Michaeli [22] showed that an image
restoration algorithm can be characterized by the distortion and
perceptual quality tradeoff curve. The plot is from Blau and Michaeli
[22].
GAN-based image SR methods and practices can be found
in the 2018 PIRM challenge report [21].
The above image SR algorithms all operate in the super-
vised setting where they assume corresponding LR and HR
the usability of an algorithm. Blau and Michaeli [22] pairs in the training data set. Typically, they create such
proposed the perception–distortion tradeoff [22], which a training data set by downsampling the ground-truth HR
states that an image restoration algorithm can potentially images. However, the downsampled HR images are very
improve only in terms of its distortion or in terms of its different from the LR images captured by a real sensor,
perceptual quality, as shown in Fig. 8. Blau and Michaeli which often contains noise and other distortion. As a result,
[22] further demonstrated that GANs provide a principled these SR algorithms are not directly applicable to upsample
way to approach the perception–distortion bound. LR images captured in the wild. Several methods have
Image SR aims at estimating an HR image from addressed the issue by studying image SR in the unsu-
its low-resolution (LR) counterpart. Deep learning has pervised setting where they only assume a data set of LR
enabled faster and more accurate SR methods, including images captured by a sensor and a data set of HR images.
SRCNN [42], FSRCNN [43], ESPCN [182], VDSR [88], Recently, Maeda [131] proposed a GAN-based image SR
SRResNet [102], EDSR [111], SRDenseNet [203], Mem- algorithm that operates in the unsupervised setting for
Net [193], RDN [247], WDSR [236], and many others. bridging the gap.
However, the above SR approaches focus on improving Image denoising aims at removing noise from noisy
the distortion metrics and pay little to no attention to the images. The task is challenging since the noise distribu-
perceptual quality metrics. As a result, they tend to predict tion is usually unknown. This setting is also referred to
oversmoothed outputs and fail to synthesize finer high- as blind image denoising. DnCNN [243] is one of the
frequency details.
Recent image SR algorithms improve the perceptual
quality of outputs by leveraging GANs. The SRGAN [102]
is the first of its kind and can generate photorealistic
images with 4× or higher upscaling factors. The quality
of the SRGAN [102] outputs is mainly measured by the
mean opinion score (MOS) over 26 raters. To enhance
the visual quality further, Wang et al. [220] revisited the
design of the three key components in the SRGAN: the
network architecture, the GAN loss, and the perceptual
loss. They proposed the enhanced SRGAN (ESRGAN),
which achieves consistently better visual quality with more
realistic and natural textures than the competing methods,
as shown in Figs. 9 and 10. The ESRGAN is the winner
of the 2018 Perceptual Image Restoration and Manipu- Fig. 10. Visual comparison between the ESRGAN [220] and the
lation (PIRM) challenge [21] (region 3 in Fig. 9). Other SRGAN [102]. Images are from Wang et al. [220].

848 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

first approaches using feed-forward convolutional neural


networks for image denoising. However, DnCNN [243]
requires knowing the noise distribution in the noisy image
and, hence, has limited applicability. To tackle blind image
denoising, Chen et al. [27] proposed the GAN–CNN-based
Blind Denoiser (GCBD) that consists of: 1) a GAN trained
to estimate the noise distribution over the input noisy
images to generate noise samples and 2) a deep CNN
that learns to denoise on generated noisy images. The
GAN training criterion of GCBD [27] is based on Wasser-
stein GAN [6], and the generator network is based on
DCGAN [163].
Image deblurring sharpens blurry images that result
from motion blur, out of focus, and possibly other causes.
DeblurGAN [95] trains an image motion deblurring net-
Fig. 12. Image compression with GANs [4]. Comparing a
work using Wasserstein GAN [6] with the GP-1 loss and
GAN-based approach [4] for image compression to those obtained
the perceptual loss (see Section III). Shen et al. [181] by the off-the-shelf codecs. Even with fewer than half the number of
used a similar approach to deblur face image by using bytes, GAN-based compression [4] produces more realistic visual
GAN and perceptual loss and incrementally training results. Images are from Agustsson et al. [4].

the deblurring network. Visual examples are shown


in Fig. 11.
Lossy image compression algorithms (e.g., JPEG,
JPEG2000, BPG, and WebP) can efficiently reduce image B. Image Inpainting
sizes but introduce visual artifacts in compressed images Image inpainting aims at filling missing pixels in an
when the compression ratio is high. Deep neural networks image such that the result is visually realistic and seman-
have been widely explored for removing the introduced tically correct. Image inpainting algorithms can be used to
artifacts [4], [52], [204]. Galteri et al. [52] showed remove distracting objects or retouch undesired regions
that a residual network trained with a GAN loss is able in photos and can be further extended to other tasks,
to produce images with more photorealistic details than including image uncropping, rotation, stitching, retarget-
MSE- or SSIM-based objectives for the removal of image ing, recomposition, compression, SR, harmonization, and
compression artifacts. Tschannen et al. [204] further pro- more.
posed distribution-preserving lossy compression using a Traditionally, patch-based approaches, such as the Patch-
new combination of Wasserstein GAN and Wasserstein Match [12], copy background patches according to the
autoencoder [202]. More recently, Agustsson et al. [4] built low-level FM (e.g., Euclidean distance on pixel RGB val-
an extreme image compression system using unconditional ues) and paste them into the missing regions. These
and conditional GANs, outperforming all other codecs in approaches can synthesize plausible stationary textures but
the low bit-rate setting. Some compression visual exam- fail at nonstationary image regions, such as faces, objects,
ples [4] are shown in Fig. 12. and complicated scenes. Recently, deep learning and GAN-
based approaches [73], [78], [114], [145], [222], [229],
[237], [238], [241], [249] open a new direction for image
inpainting using deep neural networks learned on large-
scale data in an end-to-end fashion. Compared with Patch-
Match, these methods are more scalable and can leverage
large-scale data.
The context encoder (CE) approach [157] is one of
the first in using a GAN generator to predict the missing
regions and is trained with the 2 pixelwise reconstruction
loss and a GAN loss. Iizuka et al. [73] further improved the
GAN-based inpainting framework using both global and
local GAN discriminators, with the global one operating
on the entire image and the local one operating on only
the patch in the hole. We note that the postprocessing
techniques, such as image blending, are still required in
these GAN-based approaches [73], [157] to reduce visual
artifacts near the hole boundaries.
Fig. 11. Face deblurring results with GANs [181]. Images are from Yu et al. [237] proposed DeepFill, a GAN framework for
Shen et al. [181]. end-to-end image inpainting without any postprocessing

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 849

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 14. Free-form image inpainting results using the


DeepFillV2 [238]. From left to right, we have the ground-truth
image, the free-form mask, and the DeepFillV2 inpainting result.
Original images are from Yu et al. [238].
Fig. 13. Image inpainting results using the DeepFill [237]. Missing
regions are shown in white. In each pair, the left is the input image,
and the right is the direct output of trained GAN without any
postprocessing. Images are from Yu et al. [237].
[229], [238] have been proposed to provide an option to
take additional user inputs, for example, sketches, as guid-
ance for image inpainting networks. An example of user-
step, which leverages a stacked network, consisting of a guided image inpainting is shown in Fig. 15.
coarse network and a refinement network, to ensure the Finally, we note that the image out-painting or extrap-
color and texture consistency between the in-filled regions olation tasks are closely related to image inpainting [89],
and their surroundings. Moreover, as convolutions are [195]. They can also be benefited from a GAN formulation.
local operators and less effective in capturing long-range
spatial dependencies, the contextual attention layer [237] VI. V I D E O S Y N T H E S I S
is introduced and integrated into the DeepFill to borrow Video synthesis focuses on generating video content
information from distant spatial locations explicitly. Visual instead of static images. Compared with image synthesis,
examples of the DeepFill [237] are shown in Fig. 13. video synthesis needs to ensure the temporal consistency
One common issue with the earlier GAN-based inpaint- of the output videos. This is usually achieved by using a
ing approaches [73], [157], [237] is that the training temporal discriminator [205], flow-warping loss on neigh-
is performed with randomly sampled rectangular masks. boring frames [217], smoothing the inputs before process-
While allowing easy processing during training, these ing [26], or a postprocessing step [98]. Each of them might
approaches do not generalize well to free-form masks, that be suitable for a particular task.
is, irregular masks with arbitrary shapes. To address the Similar to image synthesis, video synthesis can be clas-
issue, Liu et al. [114] proposed the partial convolution sified into unconditional and conditional video synthesis.
layer where the convolution is masked and renormalized Unconditional video synthesis generates sequences using
to utilize valid pixels only. Yu et al. [238] further proposed random noise inputs [33], [171], [205], [210]. Because
the gated convolution layer, generalizing the partial convo- such a method needs to model all the spatial and tem-
lution by providing a learnable dynamic feature selection poral content in a video, the generated results are often
mechanism for each channel at each spatial location across short or with very constrained motion patterns. For exam-
all layers. In addition, as free-form masks may appear ple, MoCoGAN [205] decomposes the motion and content
anywhere in images with any shape, global and local parts of the sequence and uses a fixed latent code for the
GANs [73] designed for a single rectangular mask are not content and a series of latent codes to generate the motion.
applicable. To address this issue, Yu et al. [238] introduced
a patch-based GAN loss, SNPatchGAN [238], by apply-
ing spectral-normalized discriminator on the dense image
patches. Visual examples of the DeepFillV2 [238] with
free-form masks are shown in Fig. 14.
Although capable of handling free-form masks, these
inpainting methods perform poorly in reconstructing fore-
ground details. This motivated the design of edge-guided
image inpainting methods [145], [229]. These methods
decompose inpainting into two stages. The first stage
predicts edges or contours of foregrounds, and the second
stage takes predicted edges to predict the final output.
Fig. 15. User-guided image inpainting results using the
Moreover, for image inpainting, enabling user interactivity DeepFillV2 [238]. From left to right, we have the ground-truth
is essential as there are many plausible solutions for filling image, the mask with user-provided edge guidance, and the
a hole in an image. User-guided inpainting methods [145], DeepFillV2 inpainting result. Images are from Yu et al. [238].

850 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 16. Face swapping versus reenactment [149]. Face swapping focuses on pasting the face region from one subject to another, while
face reenactment concerns transferring the expressions and head poses from the target subject to the source image. Images are from
Nirkin et al. [149].

The synthesized videos are usually up to a few seconds on focus on face reenactment. It has many applications in
simple video content, such as facial motion. fields, such as gaming or the film industry, where the
On the other hand, conditional video synthesis gen- characters can be animated by human actors. Based on
erates videos conditioning on input content. A common whether the trained model can only work for a specific
category is future frame prediction [39], [47], [83], [103], person or is universal to all persons, face reenactment
[110], [125], [136], [191], [208], [212], [213], [230], can be classified as subject-specific or subject-agnostic,
which attempts to predict the next frame of a sequence as described in the following.
based on the past frames. Another common category of
conditional video synthesis is conditioning on an input 1) Subject-Specific Model: Traditional methods usually
video that shares the same high-level representation. Such build a subject-specific model, which can only synthesize
a setting is often referred to as the video-to-video syn- one predetermined subject by focusing on transferring
thesis [217]. This line of studies has shown promising the expressions without transferring the head move-
results on various tasks, such as transforming high-level ment [192], [197]–[199], [209]. This line of studies usu-
representations to photorealistic videos [217], animat- ally starts by collecting footage of the target person to
ing characters with new expressions or motions [26], be synthesized, either using an RGBD sensor [198] or an
[199], or innovating a new rendering pipeline for graphics RGB sensor [199]. Then, a 3-D model of the target person
engines [50]. Due to its broader impact, we will mainly is built for the face region [20]. At test time, given the
focus on conditional video synthesis. Particularly, we will new expressions, they can be used to drive the 3-D model
focus on its two major domains: face reenactment and pose to generate the desired motions, as shown in Fig. 17.
transfer. Instead of extracting the driving expressions from someone
else, they can also be directly synthesized from speech
A. Face Reenactment inputs [192]. Since 3-D models are involved, this line of
Conditional face video synthesis exists in many forms. studies typically does not use GANs.
The most common forms include face swapping and face Some follow-up studies take transferring head motions
reenactment. Face swapping focuses on pasting the face into account and can model both expressions and dif-
region from one subject to another, whereas face reenact- ferent head poses at the same time [11], [87], [227].
ment concerns transferring the subject’s expressions and For example, RecycleGAN [11] extends CycleGAN [254]
head poses. Fig. 16 illustrates the difference. Here, we only to incorporate temporal constraints, so it can transform

Fig. 17. Face reenactment using 3-D face models [87]. These methods first construct a 3-D model for the person to be synthesized, so they
can easily animate the model with new expressions. Images are from Kim et al. [87].

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 851

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

videos of a particular person into another fixed person. Table 2 Categorization of Face Reenactment Methods. Subject-

On the other hand, ReenactGAN [227] can transfer move- Specific Models Can Only Work on One Subject per Model, While Subject-
Agnostic Models Can Work on General Targets. Among Each of Them,
ments and expressions from an arbitrary person to a fixed
Some Frameworks Only Focus on the Inner Face Region, So They Can
person. Still, the subject-dependent nature of these studies Only Transfer Expressions, While Others Can Also Transfer Head Move-
greatly limits their usability. One model can only work for ments. Studies With * Do Not Use GANs in Their Framework
one person, and generalizing to another person requires
training a new model. Moreover, collecting training data
for the target person may not be feasible at all times, which
motivates the emergence of subject-agnostic models.

2) Subject-Agnostic Model: Several recent studies pro-


pose subject-agnostic frameworks, which focus on transfer-
ring the facial expressions without head movements [28],
[29], [49], [53], [77], [144], [152], [159], [160], [190],
[211], [251]. In particular, many studies only focus on the
mouth region since it is the most expressive part of talking.
For example, given an audio speech and one lip image of subject-agnostic face reenactment method for video con-
the target identity, Chen et al. [28] synthesized a video ferencing, achieving an order of magnitude bandwidth
of the desired lip movements. Fried et al. [49] edited the saving over the H.264 standard. However, while these
lower face region of an existing video, so they can edit the methods have achieved great results in synthesizing people
video script and synthesize a new video corresponding to talking under natural motions, they usually struggle to
the change. While these studies have better generalization generate satisfying outputs under extreme poses or uncom-
capability than the previous subject-specific methods, they mon expressions, especially when the target pose is very
usually cannot synthesize spontaneous head motions. The different from the original one. Moreover, synthesizing
head movements cannot be transferred from the driving complex regions, such as hair or background, is still hard.
sequence to the target person. This is, indeed, a very challenging task that is still open to
Some studies can very recently handle both expres- further research. A summary of different categories of face
sions and head movements using subject-agnostic frame- reenactment methods can be found in Table 2.
works [7], [62], [149], [184], [216], [219], [226],
[239]. These frameworks only need a single 2-D image B. Pose Transfer
of the target person and can synthesize talking videos
Pose transfer techniques aim at transferring the body
of this person given arbitrary motions. These motions
pose of one person to another person. It can be seen as the
are represented using either facial landmarks [7] or key-
whole body counterpart of face reenactment. In contrast to
points learned without supervision [184]. Since the input
the talking head generation, which usually shares similar
is only a 2-D image, many methods rely on warping
motions, body poses have more varieties and are, thus,
the input or its extracted features and then fill in the
much harder to synthesize. Early studies focus on simple
unoccluded areas to refine the results. For example,
pose transfers that generate low resolution and lower
Averbuch-Elor et al. [7] first warped the image and directly
quality images. They work only on single images instead
copied the teeth region from the driving image to fill in the
of videos. Recent studies have shown their capability to
holes in the case of an open mouth. Siarohin et al. [184]
generate high-quality and HR videos for challenging poses
warped the extracted features from the input image, using
but can only work on a particular person per model.
motion fields estimated from sparse keypoints. On the
Very recently, several studies attempt to perform subject-
other hand, Zakharov et al. [239] demonstrated that
agnostic video synthesis. A summary of the categories is
it is possible to achieve promising results using direct
shown in Table 3. In the following, we introduce each
synthesis methods without any warping. To synthesize
category in more detail.
the target identity, they extract features from the source
images and inject the information into the generator 1) Subject-Agnostic Image Generation: Although we focus
through the AdaIN [69] parameters. Similarly, the few-shot on video synthesis in this section, since most of the existing
vid2vid [216] injects the information into their generator motion transfer approaches only focus on synthesizing
by dynamically determining the SPADE [156] parameters. images, we still briefly introduce them here (see [10],
Since these methods require only an image as input, they [44], [46], [61], [82], [108], [124], [129], [130], [146],
become particularly powerful and can be used in even [161], [164], [186], [189], [240], [248], and [257]). Ma
more cases. For instance, several studies [7], [216], [239] et al. [129] adopted a two-stage coarse-to-fine approach
demonstrate successes in animating paintings or graffiti using GANs to synthesize a person in a different pose,
instead of real humans, as shown in Fig. 18, which is not represented by a set of keypoints. In their follow-up
possible with the previous subject-dependent approaches. work [130], the foreground, background, and poses in the
Recently, Wang et al. [219] demonstrated the use of a novel image are further disentangled into different latent codes

852 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 18. Few-shot face reenactment methods which require only a 2-D image as input [239]. The driving expressions are usually
represented by facial landmarks or keypoints. Images are from Zakharov et al. [239].

Table 3 Categories of Pose Transfer Methods. Again, They Can Be the final result. Grigorev et al. [61] also mapped the input
Classified Depending on Whether One Model Can Work for Only One Per- to texture space and inpainted the textures before warping
son or Any Persons. Some of the Frameworks Only Focus on Generating
them back to the target pose. Huang et al. [71] combined
Single Images, While Others Also Demonstrate Their Effectiveness on
Videos. Studies With * Do Not Use GANs in Their Framework the SMPL models [123] with the implicit field estimation
framework [172] to rig the reconstructed meshes with
desired motions. While these methods work reasonably
well in transferring poses, as shown in Fig. 19, directly
applying them to videos will usually result in unsatisfactory
artifacts, such as flickering or inconsistent results. In the
following, we introduce methods specifically targeting
video generation, which works on a one-person-per-model
basis.

2) Subject-Specific Video Generation: For high-quality


video synthesis, most methods employ a subject-specific
model, which can only synthesize a particular person.
to provide more flexibility and controllability. Later, Siaro- These approaches start with collecting training data of the
hin et al. [186] introduced deformable skip connections to target person to be synthesized (e.g., a few minutes of
move local features to the target pose position in a U-Net a subject performing various motions) and then train a
generator. Similarly, Balakrishnan et al. [10] decomposed neural network or infer a 3-D model from it to synthesize
different parts of the body into different layer masks and the output. For example, Thies et al. [200] extended their
apply spatial transforms to each of them. The transformed previous face reenactment work [199] to include shoulders
segments are then fused together to form the final output. and part of the upper body to increase realism and fidelity.
The above methods work in a supervised setting where To extend to whole-body motion transfer, Wang et al. [217]
images of different poses of the same person are avail- extended their image synthesis framework [218] to videos
able during training. To work in the unsupervised setting,
Pumarola et al. [161] rendered the synthesized image
back to the original pose and applied cycle-consistency
constraint on the back-rendered image. Lorenz et al. [124]
decoupled the shape and appearance from images with-
out supervision by adopting a two-stream auto-encoding
architecture, so they can resynthesize images in a different
shape with the same appearance.
Recently, instead of relying on 2-D keypoints solely, some
frameworks choose to utilize 3-D or 2.5-D information. For
example, Zanfir et al. [240] incorporated estimating 3-D
parametric models into their framework to aid the synthe-
sis process. Similarly, Li et al. [108] predicted 3-D dense
flows to warp the source image by estimating 3-D models
Fig. 19. Subject-agnostic pose transfer examples [257]. Using
from the input images. Neverova et al. [146] adopted the only a 2-D image and the target pose to be synthesized, these
DensePose [63] to help warp the input textures according methods can realistically generate the desired outputs. Images are
to their UV-coordinates and inpaint the holes to generate from Zhu et al. [257].

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 853

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

modules. Similarly, Ren et al. [166] also predicted kernels


in their local attention modules using the input images
to adaptively select features and warp them to the target
pose. While these approaches have achieved better results
than previous studies (see Fig. 21), their qualities are still
Fig. 20. Subject-specific pose transfer examples for video not comparable to state-of-the-art subject-specific models.
generation [217]. For each image triplet, (left) driving sequence, Moreover, most of them still synthesize lower resolution
(middle) intermediate pose representation, and (right) synthesized outputs (256 or 512). How to further increase the quality
output. By using a model specifically trained on the target person,
and resolution to the photorealistic level is still an open
it can synthesize realistic output videos faithfully reflecting the
driving motions. Images are from Wang et al. [217].
question.

VII. N E U R A L R E N D E R I N G
Neural rendering is a recent and upcoming topic in the
area of neural networks, which combines classical ren-
and successfully demonstrated the transfer results on sev-
dering and generative models. Classical rendering can
eral dancing sequences, opening the era for a new applica-
produce photorealistic images given the complete specifi-
tion (see Fig. 20). Chan et al. [26] also adopted a similar
cation of the world. This includes all the objects in it, their
approach to generate many dancing examples but using
geometry, material properties, the lighting, the cameras,
a simple temporal smoothing on the inputs instead of
and so on. Creating such a world from scratch is a laborious
explicitly modeling temporal consistency by the network.
process that often requires expert manual input. Moreover,
Following these studies, many subsequent studies improve
faithfully reproducing such data directly from images of
upon them [3], [115], [116], [183], [252], usually by
the world can often be hard or impossible. On the other
combining the neural network with 3-D models or graphics
hand, as described in Sections IV–VI, GANs have had
engines. For example, instead of predicting RGB values
great success in producing photorealistic images given
directly, Shysheya et al. [183] predicted DensePose-like
minimal semantic inputs. The ability to synthesize and
part maps and texture maps from input 3-D keypoints
learn material properties, textures, and other intangibles
and adopted a neural renderer to render the outputs. Liu
from training data can help overcome the drawbacks of
et al. [116] first constructed a 3-D character model of
classical rendering.
the target by capturing multiview static images and then
Neural rendering aims to combine the strengths of
trained a character-to-image translation network using a
the two areas to create a more powerful and flexible
monocular video of the target. The authors later combined
framework. Neural networks can either be applied as a
the constructed 3-D model with the monocular video to
postprocessing step after classical rendering or as part of
estimate dynamic textures, so they can use different tex-
the rendering pipeline with the design of 3-D-aware and
ture maps when synthesizing different motions to increase
the realism [115].

3) Subject-Agnostic Video Generation: Finally, the most


general framework would be to have one model that can
work universally regardless of the target identity. Early
studies in this category synthesize videos uncondition-
ally and do not have full control over the synthesized
sequence (e.g., MoCoGAN [205]). Some other studies,
such as [233], have control over the appearance and the
starting pose of the person, but the motion generation
is still unconditional. Due to these factors, the synthe-
sized videos are usually shorter and of lower quality. Very
recently, a few studies have shown the ability to render
higher quality videos for pose transfer results [121], [166],
[184], [185], [216], [224]. Weng et al. [224] recon-
structed the SMPL model [123] from the input image and
animated it with some simple motions, such as running.
Liu et al. [121] proposed a unified framework for pose
transfer, novel view synthesis, and appearance transfer all
at once. Siarohin et al. [184], [185] estimated unsuper-
vised keypoints from the input images and predicted a
Fig. 21. Subject-agnostic pose transfer videos [216]. Given an
dense motion field to warp the source features to the target example image and a driving pose sequence, the methods can
pose. Wang et al. [216] extended vid2vid [217] to the output a sequence of the person performing the motions. Images
few-shot setting by predicting kernels in the SPADE [156] are from Wang et al. [216].

854 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

Fig. 22. Two common frameworks for neural rendering. (a) In the first set of studies [109], [132], [135], [139], [158], a neural network that
purely operates in the 2-D domain is trained to enhance an input image, possibly supplemented with other information, such as
depth or segmentation maps. (b) Second set of studies [147], [148], [178], [187], [225] introduces native 3-D operations that produce and
transform 3-D features. This allows the network to reason in 3-D and produce view-consistent outputs. (a) 3-D to 2-D projection as a
preprocessing step. (b) 3-D ↔ 2-D transform as a part of network training.

differentiable layers. Sections VII-A and B discuss such view, and no features or gradients are backpropagated to
approaches and how they use GAN losses to improve the the 3-D source world or through the camera projection.
quality of outputs. In this article, we focus on studies that A key advantage of this approach is that the traditional
use GANs to train neural networks and augment the clas- graphics rendering pipeline can be easily augmented to
sical rendering pipeline to generate images. For a general immediately take advantage of proven and mature tech-
survey on the use of neural networks in rendering, refer the niques from 2-D image-to-image translation (as discussed
survey paper on neural rendering by Tewari et al. [196]. in Section IV), without the need for designing and imple-
We divide the studies on GAN-based neural rendering menting differentiable projection layers or transformations
into two parts: 1) studies that treat 3-D to 2-D projection as that are part of the deep network during training. This type
a preprocessing step and apply neural networks purely in of framework is illustrated in Fig. 22(a).
the 2-D domain and 2) studies that incorporate layers that Martin-Brualla et al. [135] introduced the notion of
perform differentiable operations to transform features rerendering, where a deep neural network takes as input
from 3-D to 2-D or vice versa (3-D ↔ 2-D) and learn some a rendered 2-D image and enhances it (improving colors,
implicit form of geometry to provide 3-D understanding to boundaries, resolution, and so on) to produce a rerendered
the network. image. The full pipeline consists of two steps—a traditional
3-D to 2-D rendering step and a trainable deep network
that enhances the rendered 2-D image. The 3-D to 2-D
A. 3-D to 2-D Projection as a Preprocessing Step rendering technique can be differentiable or nondiffer-
A number of studies [109], [132], [135], [139], entiable, but no gradients are backpropagated through
[158] improve upon traditional techniques by casting the this step. This allows one to use more complex rendering
task of rendering into the framework of image-to-image techniques. By using this two-step process, the output of
translation, possibly unimodal, multimodal, or conditional, a performance capture system, which might suffer from
depending on the exact use cases. Using given camera noise, poor color reproduction, and other issues, can be
parameters, the source 3-D world is first projected to a improved. In this particular work, they did not see an
2-D feature map containing per-pixel information, such improvement from using a GAN loss, perhaps because they
as color, depth, surface normals, and segmentation. This trained their system on the limited domain of people and
feature map is then fed as input to a generator, which tries faces, using carefully captured footage.
to produce desired outputs, usually a realistic-looking RGB Meshry et al. [139] and Li et al. [109] extended this
image. The deep neural network application happens in approach to the more challenging domain of unstructured
the 2-D space after the 3-D world is projected to the camera photo collections. They produce multiple plausible views

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 855

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

of famous landmarks from noisy point clouds generated


from Internet photo collections by utilizing Structure from
Motion (SfM). Meshry et al. [139] generated a 2-D feature
map containing per-pixel albedo and depth by splatting
points of the 3-D point cloud onto a given viewpoint. The
segmentation map of the expected output image is also
concatenated to this feature representation. The problem
is then framed as a multimodal image translation problem.
A noisy and incomplete input has to be translated to a real-
istic image conditioned on a style code to produce desired
environmental effects, such as lighting. Li et al. [109] used
a similar approach but with multiplane images and achieve
better photorealism. Pittaluga et al. [158] tackled the task
of producing 2-D color images of the underlying scene
given as input a sparse SfM point cloud with associated
point attributes, such as color, depth, and SIFT descriptors.
The input to their network is a 2-D feature map obtained
by projecting the 3-D points to the image plane given the Fig. 23. Inverting images from 3-D point clouds and their

camera parameters. The attributes of the 3-D point are associated depth and SIFT attributes [158]. The top row of images is
produced by a generator trained without an adversarial loss,
copied to the 2-D pixel location to which it is projected.
whereas the bottom row uses adversarial loss. Using an adversarial
Mallya et al. [132] precomputed the mapping of the 3-D loss helps generates better details and more plausible colors.
world point cloud to the pixel locations in the images Images are from Pittaluga et al. [158].
produced by cameras with known parameters and use this
to obtain an estimate of the next frame, referred to as
a “guidance image.” They learn to output video frames
consistent over time and viewpoints by conditioning the in the feature space. The general pipeline of this line of
generator on these noisy estimates. studies is illustrated in Fig. 22(b). Learning a 3-D represen-
In these studies, the use of a generator coupled with an tation and modeling the process of image projection and
adversarial loss helps produce better-looking outputs con- formation into the network have several advantages: the
ditioned on the input feature maps. Similar to applications ability to reason in 3-D, control the pose, and produce a
of pix2pixHD [218], such as manipulating output images series of consistent views of a scene. Contrast to this is the
by editing input segmentation maps, Meshry et al. [139] neural network shown in Fig. 22(a), which purely operates
are able to remove people and transient objects from in the 2-D domain.
images of landmarks and generate plausible inpainting. DeepVoxels [187] learns a persistent 3-D voxel feature
A key motivation of the work of Pittaluga et al. [158] representation of a scene given a set of multiview images
was to explore if a user’s privacy can be protected by and their associated camera intrinsic and extrinsic para-
techniques, such as discarding the color of the 3-D points. meters. Features are first extracted from the 2-D views
A very interesting observation was that discarding color and then lifted to a 3-D volume. This 3-D volume is then
information helps prevent accurate reproduction. How- integrated into the persistent DeepVoxels representation.
ever, the use of a GAN loss recovers plausible colors and These 3-D features are then projected to 2-D using a projec-
greatly improves the output results, as shown in Fig. 23. tion layer, and a new view of the object is synthesized using
GAN losses might also be helpful in cases where it is hard a U-Net generator. This generator network is trained with
to manually define a good loss function, either due to an 1 loss and a GAN loss. The authors found that using
the inherent ambiguity in determining the desired behav- a GAN loss accelerates the generation of high-frequency
ior or the difficulty in fully labeling the data. details, especially at earlier stages of training. Similar to
DeepVoxels [187], visual object networks (VONs) [256]
generate a voxel grid from a sample noise vector and use
B. 3-D ↔ 2-D Transform as a Part of Network a differentiable projection layer to map the voxel grid to
Training a 2.5-D sketch. Inspired by classical graphics rendering
In the previous set of studies, the geometry of the pipelines, this work decomposes image formation into
world or object is explicitly provided, and neural rendering three conditionally independent factors of shape, view-
is purely used to enhance the appearance or add details point, and texture. Trained with a GAN loss, their model
to the traditionally rendered image or feature maps. The synthesizes more photorealistic images, and the use of the
studies in this section [147], [148], [178], [187], [225] disentangled representation allows for 3-D manipulations,
introduce native 3-D operations in the neural network used which are not feasible with purely 2-D methods.
to learn from and produce images. These operations enable HoloGAN [147] proposes a system to learn 3-D voxel
them to model the geometry and appearance of the scene feature representations of the world and to render them

856 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

to realistic-looking images. Unlike VONs [256], Holo- Table 4 Key Differences Among 3-D-Aware Methods. Adversarial Losses

GAN does not require explicit 3-D data or supervision Are Used by a Range of Methods That Differ in the Type of 3-D Feature
Representation and Training Supervision
and can do so using unlabeled images (no pose, explicit
3-D shape, or multiple views). By incorporating a 3-D
rigid-body transformation module and a 3-D-to-2-D pro-
jection module in the network, HoloGAN provides the
ability to control the pose of the generated objects. Holo-
GAN employs a multiscale feature GAN discriminator, and
the authors empirically observed that this helps prevent
mode collapse. BlockGAN [148] extends the unsupervised
approach of the HoloGAN [147] to also consider object
disentanglement. BlockGAN learns 3-D features per object
and the background. These are combined into 3-D scene
features after applying appropriate transformations before
projecting them into the 2-D space. One issue with learn- help in learning disentangled features by ensuring that the
ing scene compositionality without explicit supervision is outputs after projection and rendering look realistic. The
the conflation of features of the foreground object and
learnability and flexibility of the GAN loss to the task at
the background, which results in visual artifacts when hand help provide feedback, guiding how to change the
objects or the camera moves. By adding more powerful
generated image, and, thus, the upstream features so that
“style” discriminators (feature discriminators introduced
it looks as if it were sampled from the distribution of real
in [147]) to their training scheme, the authors observed images. This makes the GAN framework a powerful asset
that the disentangling of features improved, resulting in
in the toolbox of any neural rendering practitioner.
cleaner outputs.
SynSin [225] learns an end-to-end model for view syn-
thesis from a single image, without any ground-truth 3-D VIII. L I M I TAT I O N S A N D O P E N
supervision. Unlike the above studies that internally use PROBLEMS
a feature voxel representation, SynSin predicts a point Despite the successful applications introduced above, there
cloud of features from the input image and then projects it are still limitations of GANs needed to be addressed by
to new views using a differentiable point cloud renderer. future work.
Two-dimensional image features and a depth map are
first predicted from the input image. Based on the depth A. Evaluation Metrics
map, the 2-D features are projected to 3-D to obtain the
Evaluating and comparing different GAN models are
3-D feature point cloud. The network is trained adversar-
difficult. The most popular evaluation metrics are per-
ially with a discriminator based on the one proposed by
haps inception score (IS) [176] and Fréchet inception
Wang et al. [218].
distance (FID) [67], which both have many shortcomings.
One of the drawbacks of voxel-based feature repre-
The IS, for example, is not able to detect intraclass mode
sentations is the cubic growth in the memory required
collapse [23]. In other words, a model that generates only
to store them. To keep requirements manageable, voxel-
a single image per class can obtain a high IS. FID can better
based approaches are typically restricted to low resolu-
measure such diversity, but it does not have an unbiased
tions. GRAF [178] proposes to use conditional radiance
estimator [19]. Kernel inception distance (KID) [19] can
fields, which are a continuous mapping from a 3-D location
capture higher order statistics and has an unbiased esti-
and a 2-D viewing direction to an RGB color value, as the
mator but has been empirically found to suffer from high
intermediate feature representation. They also use a single
variance [165]. In addition to the above measures that
discriminator similar to PatchGAN [75], with weights that
summarize the performance with a single number, there
are shared across patches with different receptive fields.
are metrics that separately evaluate fidelity and diversity
This allows them to capture the global context and refine
of the generator distribution [97], [143], [173].
local details.
As summarized in Table 4, the studies discussed in this
section use a variety of 3-D feature representations and B. Instability
train their networks using paired input–output with known Although the regularization techniques introduced in
transformations or unlabeled and unpaired data. The use Section III-C have greatly improved the stability of GAN
of a GAN loss is common to all these approaches. This training, GANs are still much more unstable to train
is perhaps because traditional hand-designed losses, such than supervised discriminative models or likelihood-based
as the 1 loss or even perceptual loss, are unable fully to generative models. For example, even the state-of-the-art
capture what makes a synthesized image look unrealistic. BigGAN model would eventually collapse in the late stage
Furthermore, in the case where explicit task supervision is of training on ImageNet [24]. Also, the final performance
unavailable, BlockGAN [148] shows that a GAN loss can is generally very sensitive to hyperparameters [96], [126].

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 857

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

C. Interpretability images is essential to prevent malicious usage of GANs.


Despite the impressive quality of the generate images, Recent studies have found it possible to train a classifier to
there has been a lack of understanding of how GANs detect generated images and generalize to unseen genera-
represent the image structure internally in the generator. tor architectures [25], [215], [245]. This cat-and-mouse
Bau et al. [13] visualized the causal effect of different game may continue, as generated images may become
neurons on the output image. After finding the semantic increasingly harder to detect in the future.
meaning of individual neurons or directions in the latent
space [55], [76], [180], one can edit a real image by IX. C O N C L U S I O N
inverting it to the latent space, edit the latent code accord- In this article, we present a comprehensive overview of
ing to the desired semantic change, and regenerate it with GANs with an emphasis on algorithms and applications
the generator. Finding the best way to encode an image to to visual synthesis. We summarize the evolution of the
the latent space is, therefore, another interesting research network architectures in GANs and the strategies to sta-
direction [1], [2], [14], [72], [86], [253]. bilize GAN training. We then introduce several fascinating
applications of GANs, including image translation, image
D. Forensics processing, video synthesis, and neural rendering. In the
The success of GANs has enabled many new applica- end, we point out some open problems for GANs, and we
tions but also raised ethical and social concerns, such as hope that this article would inspire future research to solve
fraud and fake news. The ability to detect GAN-generated them.

REFERENCES
[1] R. Abdal, Y. Qin, and P. Wonka, “Image2StyleGAN: image pair,” 2020, arXiv:2004.02222. [Online]. “Attention-GAN for object transfiguration in wild
How to embed images into the StyleGAN latent Available: http://arxiv.org/abs/2004.02222 images,” in Proc. ECCV, 2018, pp. 164–180.
space?” ICCV, 2019, pp. 4432–4441. [16] S. Benaim and L. Wolf, “One-sided unsupervised [32] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and
[2] R. Abdal, Y. Qin, and P. Wonka, domain mapping,” in Proc. NeurIPS, 2017, J. Choo, “StarGAN: Unified generative adversarial
“Image2StyleGAN++: How to edit the embedded pp. 752–762. networks for multi-domain image-to-image
images?” CVPR, 2020, pp. 8296–8305. [17] S. Benaim and L. Wolf, “One-shot unsupervised translation,” in Proc. CVPR, 2018, pp. 8789–8797.
[3] K. Aberman, M. Shi, J. Liao, D. Liscbinski, cross domain translation,” in Proc. NeurIPS, 2018, [33] A. Clark, J. Donahue, and K. Simonyan, “Efficient
B. Chen, and D. Cohen-Or, “Deep video-based pp. 2104–2114. video generation on complex datasets,” 2019,
performance cloning,” Comput. Graph. Forum, [18] Y. Bengio, A. Courville, and P. Vincent, arXiv:1907.06571. [Online]. Available:
vol. 38, no. 2, pp. 219–233, May 2019. “Representation learning: A review and new https://arxiv.org/abs/1907.06571
[4] E. Agustsson, M. Tschannen, F. Mentzer, perspectives,” IEEE Trans. Pattern Anal. Mach. [34] T. Cohen and L. Wolf, “Bidirectional one-shot
R. Timofte, and L. Van Gool, “Generative Intell., vol. 35, no. 8, pp. 1798–1828, Aug. 2013. unsupervised domain mapping,” in Proc. ICCV,
adversarial networks for extreme learned image [19] M. Bińkowski, D. J. Sutherland, M. Arbel, and 2019, pp. 1784–1792.
compression,” in Proc. ICCV, 2019, pp. 221–231. A. Gretton, “Demystifying MMD GANs,” in Proc. [35] A. Creswell, T. White, V. Dumoulin,
[5] A. Almahairi, S. Rajeshwar, A. Sordoni, ICLR, 2018. K. Arulkumaran, B. Sengupta, and A. A. Bharath,
P. Bachman, and A. Courville, “Augmented [20] V. Blanz and T. Vetter, “A morphable model for the “Generative adversarial networks: An overview,”
CycleGAN: Learning many-to-many mappings synthesis of 3D faces,” in Proc. 26th Annu. Conf. IEEE Signal Process. Mag., vol. 35, no. 1,
from unpaired data,” in Proc. ICML, 2018, Comput. Graph. Interact. Techn., 1999, pp. 53–65, Jan. 2018.
pp. 195–204. pp. 187–194. [36] N. Dalal and B. Triggs, “Histograms of oriented
[6] M. Arjovsky, S. Chintala, and L. Bottou, [21] Y. Blau, R. Mechrez, R. Timofte, T. Michaeli, and gradients for human detection,” in Proc. CVPR,
“Wasserstein GAN,” in Proc. ICML, 2017, L. Zelnik-Manor, “The 2018 PIRM challenge on 2005, pp. 886–893.
pp. 214–223. perceptual image super-resolution,” in Proc. ECCV [37] P. Dayan, G. E. Hinton, R. M. Neal, and
[7] H. Averbuch-Elor, D. Cohen-Or, J. Kopf, and Workshop, 2018, pp. 334–355. R. S. Zemel, “The Helmholtz machine,” Neural
M. F. Cohen, “Bringing portraits to life,” ACM [22] Y. Blau and T. Michaeli, “The perception-distortion Comput., vol. 7, no. 5, pp. 889–904, Sep. 1995.
Trans. Graph., vol. 36, no. 6, pp. 1–13, Nov. 2017. tradeoff,” in Proc. CVPR, 2018, pp. 6228–6237. [38] E. de Bézenac, I. Ayed, and P. Gallinari, “Optimal
[8] J. Lei Ba, J. Ryan Kiros, and G. E. Hinton, “Layer [23] A. Borji, “Pros and cons of GAN evaluation unsupervised domain translation,” 2019,
normalization,” 2016, arXiv:1607.06450. measures,” Comput. Vis. Image Understand., arXiv:1906.01292. [Online]. Available:
[Online]. Available: http://arxiv.org/ vol. 179, pp. 41–65, Feb. 2019. http://arxiv.org/abs/1906.01292
abs/1607.06450 [24] A. Brock, J. Donahue, and K. Simonyan, “Large [39] E. L. Denton and V. Birodkar, “Unsupervised
[9] K. Baek, Y. Choi, Y. Uh, J. Yoo, and H. Shim, scale GAN training for high fidelity natural image learning of disentangled representations from
“Rethinking the truly unsupervised synthesis,” in Proc. ICLR, 2019. video,” in Proc. NeurIPS, 2017, pp. 4414–4423.
image-to-image translation,” 2020, [25] L. Chai, D. Bau, S.-N. Lim, and P. Isola, “What [40] L. Dinh, D. Krueger, and Y. Bengio, “NICE:
arXiv:2006.06500. [Online]. Available: makes fake images detectable? understanding Non-linear independent components estimation,”
http://arxiv.org/abs/2006.06500 properties that generalize,” in Proc. ECCV, 2020, in Proc. ICLR, 2015.
[10] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, pp. 103–120. [41] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density
and J. Guttag, “Synthesizing images of humans in [26] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, estimation using real NVP,” in Proc. ICLR, 2017.
unseen poses,” in Proc. CVPR, 2018, “Everybody dance now,” in Proc. ICCV, 2019, [42] C. Dong, C. C. Loy, K. He, and X. Tang, “Image
pp. 8340–8348. pp. 5933–5942. super-resolution using deep convolutional
[11] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, [27] J. Chen, J. Chen, H. Chao, and M. Yang, “Image networks,” IEEE Trans. Pattern Anal. Mach. Intell.,
“Recycle-GAN: Unsupervised video retargeting,” blind denoising with generative adversarial vol. 38, no. 2, pp. 295–307, Feb. 2016.
in Proc. ECCV, 2018, pp. 119–135. network based noise modeling,” in Proc. CVPR, [43] C. Dong, C. C. Loy, and X. Tang, “Accelerating the
[12] C. Barnes, E. Shechtman, A. Finkelstein, and 2018, pp. 3155–3164. super-resolution convolutional neural network,”
D. B. Goldman, “PatchMatch: A randomized [28] L. Chen, Z. Li, R. K. Maddox, Z. Duan, and C. Xu, in Proc. ECCV, 2016, pp. 391–407.
correspondence algorithm for structural image “Lip movements generation at a glance,” in Proc. [44] H. Dong, X. Liang, K. Gong, H. Lai, J. Zhu, and
editing,” ACM Trans. Graph., vol. 28, no. 3, p. 24, ECCV, 2018, pp. 520–535. J. Yin, “Soft-gated warping-GAN for pose-guided
2009. [29] L. Chen, R. K. Maddox, Z. Duan, and C. Xu, person image synthesis,” in Proc. NeurIPS, 2018,
[13] D. Bau et al., “GAN dissection: Visualizing and “Hierarchical cross-modal talking face generation pp. 474–484.
understanding generative adversarial networks,” with dynamic pixel-wise loss,” in Proc. CVPR, [45] Y. Du and I. Mordatch, “Implicit generation and
in Proc. ICLR, 2019. 2019, pp. 7832–7841. generalization in energy-based models,” in Proc.
[14] D. Bau, J.-Y. Zhu, J. Wulff, W. Peebles, H. Strobelt, [30] X. Chen, N. Mishra, M. Rohaninejad, and NeurIPS, 2019, pp. 3608–3618.
B. Zhou, and A. Torralba, “Seeing what a GAN P. Abbeel, “PixelSNAIL: An improved [46] P. Esser, E. Sutter, and B. Ommer, “A variational
cannot generate,” in Proc. ICCV, 2019. autoregressive generative model,” in Proc. ICML, u-net for conditional appearance and shape
[15] S. Benaim, R. Mokady, A. Bermano, D. Cohen-Or, 2018, pp. 864–872. generation,” in Proc. CVPR, 2018, pp. 8857–8866.
and L. Wolf, “Structural-analogy from a single [31] X. Chen, C. Xu, X. Yang, and D. Tao, [47] C. Finn, I. Goodfellow, and S. Levine,

858 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

“Unsupervised learning for physical interaction “Multimodal unsupervised image-to-image deblurring using conditional adversarial
through video prediction,” in Proc. NeurIPS, 2016, translation,” in Proc. ECCV, 2018, pp. 172–189. networks,” in Proc. CVPR, 2018, pp. 8183–8192.
pp. 64–72. [71] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung, [96] K. Kurach, M. Lučić, X. Zhai, M. Michalski, and
[48] A. Fischer and C. Igel, “An introduction to “ARCH: Animatable reconstruction of clothed S. Gelly, “A large-scale study on regularization
restricted Boltzmann machines,” in Proc. humans,” in Proc. CVPR, 2020, pp. 3093–3102. and normalization in GANs,” in Proc. Int. Conf.
Iberoamerican Congr. Pattern Recognit., 2012, [72] M. Huh, R. Zhang, J.-Y. Zhu, S. Paris, and Mach. Learn., 2019, pp. 3581–3590.
pp. 14–36. A. Hertzmann, “Transforming and projecting [97] T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen,
[49] O. Fried et al., “Text-based editing of talking-head images into class-conditional generative and T. Aila, “Improved precision and recall metric
video,” ACM Trans. Graph., vol. 38, no. 4, networks,” in Proc. ECCV, 2020, pp. 17–34. for assessing generative models,” in Proc. NeurIPS,
pp. 1–14, Jul. 2019. [73] S. Iizuka, E. Simo-Serra, and H. Ishikawa, 2019, pp. 3927–3936.
[50] O. Gafni, L. Wolf, and Y. Taigman, “Vid2Game: “Globally and locally consistent image [98] W.-S. Lai, J.-B. Huang, O. Wang, E. Shechtman,
Controllable characters extracted from real-world completion,” ACM Trans. Graph., vol. 36, no. 4, E. Yumer, and M.-H. Yang, “Learning blind video
videos,” in Proc. ICLR, 2020. pp. 1–14, Jul. 2017. temporal consistency,” in Proc. ECCV, 2018,
[51] T. Galanti, L. Wolf, and S. Benaim, “The role of [74] S. Ioffe and C. Szegedy, “Batch normalization: pp. 170–185.
minimal complexity functions in unsupervised Accelerating deep network training by reducing [99] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and
learning of semantic mappings,” in Proc. ICLR, internal covariate shift,” in Proc. ICML, 2015, O. Winther, “Autoencoding beyond pixels using a
2018. pp. 448–456. learned similarity metric,” in Proc. ICML, 2016,
[52] L. Galteri, L. Seidenari, M. Bertini, and [75] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, pp. 1558–1566.
A. D. Bimbo, “Deep generative adversarial “Image-to-image translation with conditional [100] Y. LeCun, Y. Bengio, and G. Hinton, “Deep
compression artifact removal,” in Proc. ICCV, adversarial networks,” in Proc. CVPR, 2017, learning,” Nature, vol. 521, pp. 436–444,
2017, pp. 4826–4835. pp. 1125–1134. May 2015.
[53] J. Geng, T. Shao, Y. Zheng, Y. Weng, and K. Zhou, [76] A. Jahanian, L. Chai, and P. Isola, “On the [101] Y. LeCun et al., “A tutorial on energy-based
“Warp-guided GANs for single-photo facial ‘steerability’ of generative adversarial networks,” learning,” in Predicting Structured Data.
animation,” ACM Trans. Graph., vol. 37, no. 6, in Proc. ICLR, 2020. Cambridge, MA, USA: MIT Press 2006.
pp. 1–12, Jan. 2019. [77] A. Jamaludin, J. S. Chung, and A. Zisserman, “You [102] C. Ledig et al., “Photo-realistic single image
[54] A. Ghosh, V. Kulharia, V. P. Namboodiri, P. H. Torr, said that: Synthesising talking faces from audio,” super-resolution using a generative adversarial
and P. K. Dokania, “Multi-agent diverse generative in Proc. IJCV, 2019, pp. 1767–1779. network,” in Proc. CVPR, 2017.
adversarial networks,” in Proc. CVPR, 2018, [78] Y. Jo and J. Park, “SC-FEGAN: Face editing [103] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn,
pp. 8513–8521. generative adversarial network with user’s sketch and S. Levine, “Stochastic adversarial video
[55] L. Goetschalckx, A. Andonian, A. Oliva, and and color,” in Proc. ICCV, 2019, pp. 1745–1753. prediction,” 2018, arXiv:1804.01523. [Online].
P. Isola, “GANalyze: Toward visual definitions of [79] J. Johnson, A. Alahi, and L. Fei-Fei, “Perceptual Available: http://arxiv.org/abs/
cognitive image properties,” in Proc. ICCV, 2019, losses for real-time style transfer and 1804.01523
pp. 5744–5753. super-resolution,” in Proc. ECCV, 2016, [104] C.-H. Lee, Z. Liu, L. Wu, and P. Luo, “MaskGAN:
[56] X. Gong, S. Chang, Y. Jiang, and Z. Wang, pp. 694–711. Towards diverse and interactive facial image
“AutoGAN: Neural architecture search for [80] A. Jolicoeur-Martineau, “The relativistic manipulation,” in Proc. CVPR, 2020.
generative adversarial networks,” in Proc. ICCV, discriminator: A key element missing from [105] H.-Y. Lee, H.-Y. Tseng, J.-B. Huang, M. Singh, and
2019, pp. 3224–3234. standard GAN,” in Proc. ICLR, 2019. M.-H. Yang, “Diverse image-to-image translation
[57] A. Gonzalez-Garcia, J. Van De Weijer, and [81] A. Jolicoeur-Martineau, “On relativistic via disentangled representations,” in Proc. ECCV,
Y. Bengio, “Image-to-image translation for f-divergences,” in Proc. ICML, 2020, 2018, pp. 35–51.
cross-domain disentanglement,” in Proc. NeurIPS, pp. 4931–4939. [106] S. Lee, J. Ha, and G. Kim, “Harmonizing
2018, pp. 1287–1298. [82] D. Joo, D. Kim, and J. Kim, “Generating a fusion maximum likelihood with GANs for multimodal
[58] I. Goodfellow, “NIPS 2016 tutorial: Generative image: One’s identity and another’s shape,” in conditional generation,” in Proc. ICLR, 2019.
adversarial networks,” 2017, arXiv:1701.00160. Proc. CVPR, 2018, pp. 1635–1643. [107] K. Li, T. Zhang, and J. Malik, “Diverse image
[Online]. Available: [83] N. Kalchbrenner et al., “Video pixel networks,” in synthesis from semantic layouts via conditional
http://arxiv.org/abs/1701.00160 Proc. ICML, 2017, pp. 1771–1779. IMLE,” in Proc. ICCV, 2019, pp. 4220–4229.
[59] I. Goodfellow, Y. Bengio, and A. Courville, Deep [84] T. Karras, T. Aila, S. Laine, and J. Lehtinen, [108] Y. Li, C. Huang, and C. C. Loy, “Dense intrinsic
Learning. Cambridge, MA, USA: MIT Press, 2016. “Progressive growing of GANs for improved appearance flow for human pose transfer,” in Proc.
[60] I. Goodfellow et al., “Generative adversarial quality, stability, and variation,” in Proc. ICLR, CVPR, 2019, pp. 3693–3702.
networks,” in Proc. NeurIPS, 2014. 2018. [109] Z. Li, W. Xian, A. Davis, and N. Snavely,
[61] A. Grigorev, A. Sevastopolsky, A. Vakhitov, and [85] T. Karras, S. Laine, and T. Aila, “A style-based “Crowdsampling the plenoptic function,” in Proc.
V. Lempitsky, “Coordinate-based texture generator architecture for generative adversarial ECCV, 2020, pp. 178–196.
inpainting for pose-guided human image networks,” in Proc. CVPR, 2019, pp. 4401–4410. [110] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual
generation,” in Proc. CVPR, 2019, [86] T. Karras, S. Laine, M. Aittala, J. Hellsten, motion GAN for future-flow embedded video
pp. 12135–12144. J. Lehtinen, and T. Aila, “Analyzing and improving prediction,” in Proc. NeurIPS, 2017,
[62] K. Gu, Y. Zhou, and T. S. Huang, “FLNet: the image quality of StyleGAN,” in Proc. CVPR, pp. 1744–1752.
Landmark driven fetching and learning network 2020, pp. 8110–8119. [111] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee,
for faithful talking facial animation synthesis,” in [87] H. Kim et al., “Deep video portraits,” ACM Trans. “Enhanced deep residual networks for single
Proc. AAAI, 2020, pp. 10861–10868. Graph., vol. 37, no. 4, pp. 1–14, Aug. 2018. image super-resolution,” in Proc. CVPR Workshop,
[63] R. A. Güler, N. Neverova, and I. Kokkinos, [88] J. Kim, J. Kwon Lee, and K. Mu Lee, “Accurate 2017, pp. 136–144.
“DensePose: Dense human pose estimation in the image super-resolution using very deep [112] J. Hyun Lim and J. Chul Ye, “Geometric GAN,”
wild,” in Proc. CVPR, 2018, pp. 7297–7306. convolutional networks,” in Proc. CVPR, 2016, 2017, arXiv:1705.02894. [Online]. Available:
[64] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, pp. 1646–1654. http://arxiv.org/abs/1705.02894
and A. C. Courville, “Improved training of [89] K. Kim, Y. Yun, K.-W. Kang, K. Kong, S. Lee, and [113] J. Lin, Y. Pang, Y. Xia, Z. Chen, and J. Luo,
wasserstein GANs,” in Proc. NeurIPS, 2017, S.-J. Kang, “Painting outside as inside: Edge “TuiGAN: Learning versatile image-to-image
pp. 5767–5777. guided image outpainting via bidirectional translation with two unpaired images,” in Proc.
[65] S. Ha, M. Kersner, B. Kim, S. Seo, and D. Kim, rearrangement with progressive step learning,” ECCV, 2020, pp. 18–35.
“MarioNETte: Few-shot face reenactment 2020, arXiv:2010.01810. [Online]. Available: [114] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao,
preserving identity of unseen targets,” in Proc. http://arxiv.org/abs/2010.01810 and B. Catanzaro, “Image inpainting for irregular
AAAI, 2020, pp. 10893–10900. [90] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, holes using partial convolutions,” in Proc. ECCV,
[66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep “Learning to discover cross-domain relations with 2018, pp. 85–100.
residual learning for image recognition,” in Proc. generative adversarial networks,” in Proc. ICML, [115] L. Liu et al., “Neural human video rendering by
CVPR, 2016, pp. 10893–10900. 2017, pp. 1857–1865. learning dynamic textures and rendering-to-video
[67] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, [91] D. Kingma and J. Ba, “Adam: A method for translation,” IEEE Trans. Vis. Comput. Graphics,
and S. Hochreiter, “GANs trained by a two stochastic optimization,” in Proc. ICLR, 2015. early access, May 2020, doi:
time-scale update rule converge to a local Nash [92] D. P. Kingma and P. Dhariwal, “Glow: Generative 10.1109/TVCG.2020.2996594.
equilibrium,” in Proc. NeurIPS, 2017, flow with invertible 1×1 convolutions,” in Proc. [116] L. Liu et al., “Neural rendering and reenactment of
pp. 6626–6637. NeurIPS, 2018, pp. 10215–10224. human actor videos,” ACM Trans. Graph., vol. 38,
[68] G. E. Hinton, “Reducing the dimensionality of [93] D. P. Kingma and M. Welling, “Auto-encoding no. 5, pp. 1–14, Nov. 2019.
data with neural networks,” Science, vol. 313, variational Bayes,” in Proc. ICLR, 2013. [117] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised
no. 5786, pp. 504–507, Jul. 2006. [94] D. P. Kingma and M. Welling, “An introduction to image-to-image translation networks,” in Proc.
[69] X. Huang and S. Belongie, “Arbitrary style transfer variational autoencoders,” 2019, NeurIPS, 2017.
in real-time with adaptive instance arXiv:1906.02691. [Online]. Available: [118] M.-Y. Liu et al., “Few-shot unsupervised
normalization,” in Proc. ICCV, 2017, http://arxiv.org/abs/1906.02691 image-to-image translation,” in Proc. ICCV, 2019,
pp. 1501–1510. [95] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, pp. 10551–10560.
[70] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, and J. Matas, “DeblurGAN: Blind motion [119] M.-Y. Liu and O. Tuzel, “Coupled generative

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 859

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

adversarial networks,” in Proc. NeurIPS, 2016, generative models,” in Proc. ICML, 2020, [166] Y. Ren, X. Yu, J. Chen, T. H. Li, and G. Li, “Deep
pp. 469–477. pp. 7176–7185. image spatial transformation for person image
[120] S. Liu, X. Zhang, J. Wangni, and J. Shi, [144] K. Nagano et al., “PaGAN: Real-time avatars using generation,” in Proc. CVPR, 2020,
“Normalized diversification,” in Proc. CVPR, 2019, dynamic textures,” ACM Trans. Graph., vol. 37, pp. 7690–7699.
pp. 10306–10315. no. 6, pp. 1–12, Jan. 2019. [167] D. J. Rezende and S. Mohamed, “Variational
[121] W. Liu, Z. Piao, J. Min, W. Luo, L. Ma, and S. Gao, [145] K. Nazeri, E. Ng, T. Joseph, F. Z. Qureshi, and inference with normalizing flows,” in Proc. ICML,
“Liquid warping GAN: A unified framework for M. Ebrahimi, “EdgeConnect: Generative image 2015, pp. 1530–1538.
human motion imitation, appearance transfer and inpainting with adversarial edge learning,” 2019, [168] D. J. Rezende, S. Mohamed, and D. Wierstra,
novel view synthesis,” in Proc. ICCV, 2019, arXiv:1901.00212. [Online]. Available: “Stochastic backpropagation and approximate
pp. 5904–5913. http://arxiv.org/abs/1901.00212 inference in deep generative models,” in Proc.
[122] X. Liu, G. Yin, J. Shao, and X. Wang, “Learning to [146] N. Neverova, R. Alp Guler, and I. Kokkinos, “Dense ICML, 2014, pp. 1278–1286.
predict layout-to-image conditional convolutions pose transfer,” in Proc. ECCV, 2018, pp. 123–138. [169] K. Roth, A. Lucchi, S. Nowozin, and T. Hofmann,
for semantic image synthesis,” in Proc. NeurIPS, [147] T. Nguyen-Phuoc, C. Li, L. Theis, C. Richardt, and “Stabilizing training of generative adversarial
2019, pp. 570–580. Y.-L. Yang, “HoloGAN: Unsupervised learning of networks through regularization,” in Proc.
[123] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, 3d representations from natural images,” in Proc. NeurIPS, 2017, pp. 2018–2028.
and M. J. Black, “SMPL: A skinned multi-person CVPR, 2019, pp. 7588–7597. [170] K. Saito, K. Saenko, and M.-Y. Liu, “COCO-FUNIT:
linear model,” ACM Trans. Graph., vol. 34, no. 6, [148] T. Nguyen-Phuoc, C. Richardt, L. Mai, Y.-L. Yang, Few-shot unsupervised image translation with a
pp. 1–16, Nov. 2015. and N. Mitra, “BlockGAN: Learning 3D content conditioned style encoder,” in Proc. ECCV,
[124] D. Lorenz, L. Bereska, T. Milbich, and B. Ommer, object-aware scene representations from 2020, pp. 382–398.
“Unsupervised part-based disentangling of object unlabelled images,” 2020, arXiv:2002.08988. [171] M. Saito, E. Matsumoto, and S. Saito, “Temporal
shape and appearance,” in Proc. CVPR, 2019, [Online]. Available: http://arxiv.org/ generative adversarial nets with singular value
pp. 10955–10964. abs/2002.08988 clipping,” in Proc. ICCV, 2017, pp. 2830–2839.
[125] W. Lotter, G. Kreiman, and D. Cox, “Deep [149] Y. Nirkin, Y. Keller, and T. Hassner, “FSGAN: [172] S. Saito, Z. Huang, R. Natsume, S. Morishima,
predictive coding networks for video prediction Subject agnostic face swapping and reenactment,” A. Kanazawa, and H. Li, “PIFu: Pixel-aligned
and unsupervised learning,” in Proc. ICLR, 2017. in Proc. ICCV, 2019, pp. 7184–7193. implicit function for high-resolution clothed
[126] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and [150] S. Nowozin, B. Cseke, and R. Tomioka, “f-GAN: human digitization,” in Proc. ICCV, 2019,
O. Bousquet, “Are GANs created equal? A Training generative neural samplers using pp. 2304–2314.
large-scale study,” in Proc. Adv. Neural Inf. Process. variational divergence minimization,” in Proc. [173] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet,
Syst., 2018, pp. 700–709. NeurIPS, 2016, pp. 271–279. and S. Gelly, “Assessing generative models via
[127] C. Ma, C.-Y. Yang, X. Yang, and M.-H. Yang, [151] A. Odena, C. Olah, and J. Shlens, “Conditional precision and recall,” in Proc. NeurIPS, 2018,
“Learning a no-reference quality metric for image synthesis with auxiliary classifier GANs,” in pp. 5228–5237.
single-image super-resolution,” Comput. Vis. Proc. ICML, 2017, pp. 2642–2651. [174] M. S. Sajjadi, B. Scholkopf, and M. Hirsch,
Image Understand., vol. 158, pp. 1–16, [152] K. Olszewski et al., “Realistic dynamic facial “EnhanceNet: Single image super-resolution
May 2017. textures from a single image using GANs,” in Proc. through automated texture synthesis,” in Proc.
[128] L. Ma, X. Jia, S. Georgoulis, T. Tuytelaars, and ICCV, 2017, pp. 5429–5438. ICCV, 2017, pp. 4491–4500.
L. Van Gool, “Exemplar guided unsupervised [153] A. van den Oord et al., “WaveNet: A generative [175] R. Salakhutdinov and G. Hinton, “Deep
image-to-image translation with semantic model for raw audio,” 2016, arXiv:1609.03499. Boltzmann machines,” in Artificial Intelligence and
consistency,” in Proc. ICLR, 2019. [Online]. Available: http://arxiv.org/ Statistics. 2009, pp. 448–455.
[129] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and abs/1609.03499 [176] T. Salimans, I. Goodfellow, W. Zaremba,
L. Van Gool, “Pose guided person image [154] Z. Pan, W. Yu, X. Yi, A. Khan, F. Yuan, and V. Cheung, A. Radford, and X. Chen, “Improved
generation,” in Proc. NeurIPS, 2017, pp. 406–416. Y. Zheng, “Recent progress on generative techniques for training GANs,” in Proc. NeurIPS,
[130] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, adversarial networks (GANs): A survey,” IEEE 2016, pp. 2234–2242.
B. Schiele, and M. Fritz, “Disentangled person Access, vol. 7, pp. 36322–36333, 2019. [177] T. Salimans, A. Karpathy, X. Chen, and
image generation,” in Proc. CVPR, 2018, [155] T. Park, A. A. Efros, R. Zhang, and J.-Y. Zhu, D. P. Kingma, “PixelCNN++: Improving the
pp. 99–108. “Contrastive learning for unpaired image-to-image PixelCNN with discretized logistic mixture
[131] S. Maeda, “Unpaired image super-resolution using translation,” in Proc. ECCV, 2020, pp. 319–345. likelihood and other modifications,” in Proc. ICLR,
pseudo-supervision,” in Proc. CVPR, 2020, [156] T. Park, M.-Y. Liu, T.-C. Wang, and J.-Y. Zhu, 2017.
pp. 291–300. “Semantic image synthesis with spatially-adaptive [178] K. Schwarz, Y. Liao, M. Niemeyer, and A. Geiger,
[132] A. Mallya, T.-C. Wang, K. Sapra, and M.-Y. Liu, normalization,” in Proc. CVPR, 2019, “GRAF: Generative radiance fields for 3D-aware
“World-consistent video-to-video synthesis,” in pp. 2337–2346. image synthesis,” 2020, arXiv:2007.02442.
Proc. ECCV, 2020, pp. 359–378. [157] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, [Online]. Available: http://arxiv.
[133] Q. Mao, H.-Y. Lee, H.-Y. Tseng, S. Ma, and and A. A. Efros, “Context encoders: Feature org/abs/2007.02442
M.-H. Yang, “Mode seeking generative adversarial learning by inpainting,” in Proc. CVPR, 2016, [179] T. R. Shaham, T. Dekel, and T. Michaeli, “SinGAN:
networks for diverse image synthesis,” in Proc. pp. 2536–2544. Learning a generative model from a single natural
CVPR, 2019, pp. 1429–1437. [158] F. Pittaluga, S. J. Koppal, S. B. Kang, and image,” in Proc. ICCV, 2019, pp. 4570–4580.
[134] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. N. Sinha, “Revealing scenes by inverting [180] Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting
S. P. Smolley, “Least squares generative structure from motion reconstructions,” in Proc. the latent space of GANs for semantic face
adversarial networks,” in Proc. ICCV, 2017, CVPR, 2019, pp. 145–154. editing,” in Proc. CVPR, 2020, pp. 9243–9252.
pp. 2794–2802. [159] A. Pumarola, A. Agudo, A. M. Martinez, [181] Z. Shen, W.-S. Lai, T. Xu, J. Kautz, and M.-H. Yang,
[135] R. Martin-Brualla et al., “LookinGood: Enhancing A. Sanfeliu, and F. Moreno-Noguer, “GANimation: “Deep semantic face deblurring,” in Proc. CVPR,
performance capture with real-time neural Anatomically-aware facial animation from a single 2018, pp. 8260–8269.
re-rendering,” ACM Trans. Graph., vol. 37, no. 6, image,” in Proc. ECCV, 2018, pp. 818–833. [182] W. Shi et al., “Real-time single image and video
pp. 1–14, Jan. 2019. [160] A. Pumarola, A. Agudo, A. M. Martinez, super-resolution using an efficient sub-pixel
[136] M. Mathieu, C. Couprie, and Y. LeCun, “Deep A. Sanfeliu, and F. Moreno-Noguer, “GANimation: convolutional neural network,” in Proc. CVPR,
multi-scale video prediction beyond mean square One-shot anatomically consistent facial 2016, pp. 1874–1883.
error,” in Proc. ICLR, 2016. animation,” Int. J. Comput. Vis., vol. 128, [183] A. Shysheya et al., “Textured neural avatars,” in
[137] Y. A. Mejjati, C. Richardt, J. Tompkin, D. Cosker, pp. 698–713, Aug. 2020. Proc. CVPR, 2019, pp. 1874–1883.
and K. I. Kim, “Unsupervised attention-guided [161] A. Pumarola, A. Agudo, A. Sanfeliu, and [184] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci,
image-to-image translation,” in Proc. NeurIPS, F. Moreno-Noguer, “Unsupervised person image and N. Sebe, “First order motion model for image
2018, pp. 3693–3703. synthesis in arbitrary poses,” in Proc. IEEE Conf. animation,” in Proc. NeurIPS, 2019,
[138] L. Mescheder, A. Geiger, and S. Nowozin, “Which Comput. Vis. Pattern Recognit., Jun. 2018, pp. 7137–7147.
training methods for GANs do actually converge?” pp. 8620–8628. [185] A. Siarohin, S. Lathuiliére, S. Tulyakov, E. Ricci,
ICML, 2018, pp. 3693–3703. [162] S. Qian et al., “Make a face: Towards arbitrary and N. Sebe, “Animating arbitrary objects via deep
[139] M. Meshry et al., “Neural rerendering in the wild,” high fidelity face manipulation,” in Proc. ICCV, motion transfer,” in Proc. CVPR, 2019,
in Proc. CVPR, 2019, pp. 6878–6887. 2019, pp. 10033–10042. pp. 2377–2386.
[140] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida, [163] A. Radford, L. Metz, and S. Chintala, [186] A. Siarohin, E. Sangineto, S. Lathuiliere, and
“Spectral normalization for generative adversarial “Unsupervised representation learning with deep N. Sebe, “Deformable GANs for pose-based human
networks,” in Proc. ICLR, 2018. convolutional generative adversarial networks,” in image generation,” in Proc. CVPR, 2018,
[141] T. Miyato and M. Koyama, “cGANs with projection Proc. ICLR, 2015. pp. 3408–3416.
discriminator,” in Proc. ICLR, 2018. [164] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, [187] V. Sitzmann, J. Thies, F. Heide, M. Nießner,
[142] S. Mo, M. Cho, and J. Shin, “InstaGAN: and J. Lu, “SwapNet: Image based garment G. Wetzstein, and M. Zollhofer, “DeepVoxels:
Instance-aware image-to-image translation,” in transfer,” in Proc. ECCV, 2018, pp. 679–695. Learning persistent 3D feature embeddings,” in
Proc. ICLR, 2019. [165] S. Ravuri and O. Vinyals, “Seeing is not necessarily Proc. CVPR, 2019, pp. 2437–2446.
[143] M. F. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo, believing: Limitations of BigGANs for data [188] C. K. Sønderby, T. Raiko, L. Maaløe,
“Reliable fidelity and diversity metrics for augmentation,” in Proc. ICLR Workshop, 2019. S. K. Sønderby, and O. Winther, “Ladder

860 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

variational autoencoders,” in Proc. NeurIPS, 2016, [212] J. Walker, C. Doersch, A. Gupta, and M. Hebert, translation,” in Proc. ICCV, 2017, pp. 2849–2857.
pp. 3738–3746. “An uncertain future: Forecasting from static [236] J. Yu, Y. Fan, and T. Huang, “Wide activation for
[189] S. Song, W. Zhang, J. Liu, and T. Mei, images using variational autoencoders,” in Proc. efficient and accurate image super-resolution,” in
“Unsupervised person image generation with ECCV, 2016, pp. 835–851. Proc. BMVC, 2019.
semantic parsing transformation,” in Proc. CVPR, [213] J. Walker, K. Marino, A. Gupta, and M. Hebert, [237] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and
2019, pp. 2357–2366. “The pose knows: Video forecasting by generating T. S. Huang, “Generative image inpainting with
[190] Y. Song, J. Zhu, D. Li, A. Wang, and H. Qi, pose futures,” in Proc. ICCV, 2017, pp. 3332–3341. contextual attention,” in Proc. CVPR, 2018,
“Talking face generation by conditional recurrent [214] M. Wang et al., “Example-guided style-consistent pp. 5505–5514.
adversarial network,” in Proc. IJCAI, 2019, image synthesis from semantic labeling,” in Proc. [238] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and
pp. 919–925. CVPR, 2019, pp. 1495–1504. T. S. Huang, “Free-form image inpainting with
[191] N. Srivastava, E. Mansimov, and R. Salakhudinov, [215] S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and gated convolution,” in Proc. ICCV, 2019,
“Unsupervised learning of video representations A. A. Efros, “CNN-generated images are pp. 4471–4480.
using LSTMs,” in Proc. ICML, 2015, pp. 843–852. surprisingly easy to spot.. for now,” in Proc. CVPR, [239] E. Zakharov, A. Shysheya, E. Burkov, and
[192] S. Suwajanakorn, S. M. Seitz, and 2020, pp. 8692–8701. V. Lempitsky, “Few-shot adversarial learning of
I. Kemelmacher-Shlizerman, “Synthesizing [216] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and realistic neural talking head models,” in Proc.
Obama: Learning lip sync from audio,” ACM B. Catanzaro, “Few-shot video-to-video synthesis,” ICCV, 2019, pp. 9459–9468.
Trans. Graph., vol. 36, no. 4, pp. 1–13, Jul. 2017. in Proc. NeurIPS, 2019, pp. 8695–8704. [240] M. Zanfir, A.-I. Popa, A. Zanfir, and
[193] Y. Tai, J. Yang, X. Liu, and C. Xu, “MemNet: [217] T.-C. Wang et al., “Video-to-video synthesis,” in C. Sminchisescu, “Human appearance transfer,” in
A persistent memory network for image Proc. NeurIPS, 2018, pp. 1152–1164. Proc. CVPR, 2018, pp. 5391–5399.
restoration,” in Proc. ICCV, 2017, pp. 4539–4547. [218] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, [241] Y. Zeng, J. Fu, H. Chao, and B. Guo, “Learning
[194] H. Tang, X. Qi, D. Xu, P. H. S. Torr, and N. Sebe, and B. Catanzaro, “High-resolution image pyramid-context encoder network for high-quality
“Edge guided GANs with semantic preserving for synthesis and semantic manipulation with image inpainting,” in Proc. CVPR, 2019,
semantic image synthesis,” 2020, conditional GANs,” in Proc. CVPR, 2018, pp. 1486–1494.
arXiv:2003.13898. [Online]. Available: pp. 8798–8807. [242] H. Zhang, I. Goodfellow, D. Metaxas, and
http://arxiv.org/abs/2003.13898 [219] T.-C. Wang, A. Mallya, and M.-Y. Liu, “One-shot A. Odena, “Self-attention generative adversarial
[195] P. Teterwak et al., “Boundless: Generative free-view neural talking-head synthesis for video networks,” in Proc. ICML, 2019, pp. 7354–7363.
adversarial networks for image extension,” in conferencing,” 2020, arXiv:2011.15126. [Online]. [243] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang,
Proc. ICCV, 2019, pp. 10521–10530. Available: http://arxiv.org/abs/2011.15126 “Beyond a Gaussian denoiser: Residual learning of
[196] A. Tewari et al., “State of the art on neural [220] X. Wang et al., “ESRGAN: Enhanced deep CNN for image denoising,” IEEE Trans.
rendering,” Comput. Graph. Forum (EG STAR), super-resolution generative adversarial networks,” Image Process., vol. 26, no. 7, pp. 3142–3155,
vol. 39, no. 2, pp. 701–727, 2020. in Proc. ECCV, 2018, pp. 63–79. Jul. 2017.
[197] J. Thies, M. Zollhöfer, and M. Nießner, “Deferred [221] Y. Wang, S. Khan, A. Gonzalez-Garcia, [244] P. Zhang, B. Zhang, D. Chen, L. Yuan, and F. Wen,
neural rendering: Image synthesis using neural J. van de Weijer, and F. S. Khan, “Semi-supervised “Cross-domain correspondence learning for
textures,” ACM Trans. Graph., vol. 38, no. 4, learning for few-shot image-to-image translation,” exemplar-based image translation,” in Proc. CVPR,
pp. 1–12, Jul. 2019. in Proc. CVPR, 2020, pp. 4453–4462. 2020, pp. 5143–5153.
[198] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, [222] Y. Wang, X. Tao, X. Qi, X. Shen, and J. Jia, “Image [245] X. Zhang, S. Karaman, and S.-F. Chang, “Detecting
M. Stamminger, and C. Theobalt, “Real-time inpainting via generative multi-column and simulating artifacts in GAN fake images,” in
expression transfer for facial reenactment,” ACM convolutional neural networks,” in Proc. NeurIPS, Proc. WIFD, 2019, pp. 1–6.
Trans. Graph., vol. 34, no. 6, pp. 1–14, Nov. 2015. 2018, pp. 331–340. [246] Y. Zhang, K. Li, K. Li, L. Wang, B. Zhong, and
[199] J. Thies, M. Zollhofer, M. Stamminger, [223] Z. Wang, Q. She, and T. E. Ward, “Generative Y. Fu, “Image super-resolution using very deep
C. Theobalt, and M. Nießner, “Face2Face: adversarial networks in computer vision: A survey residual channel attention networks,” in Proc.
Real-time face capture and reenactment of RGB and taxonomy,” 2019, arXiv:1906.01529. ECCV, 2018, pp. 286–301.
videos,” in Proc. CVPR, 2016, pp. 2387–2395. [Online]. Available: http://arxiv.org/ [247] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu,
[200] J. Thies, M. Zollhöfer, C. Theobalt, abs/1906.01529 “Residual dense network for image
M. Stamminger, and M. Niessner, “Headon: [224] C.-Y. Weng, B. Curless, and super-resolution,” in Proc. CVPR, 2018,
Real-time reenactment of human portrait videos,” I. Kemelmacher-Shlizerman, “Photo wake-up: 3D pp. 2472–2481.
ACM Trans. Graph., vol. 37, no. 4, pp. 1–13, character animation from a single photo,” in Proc. [248] B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie,
Aug. 2018. CVPR, 2019, pp. 5908–5917. and J. Feng, “Multi-view image generation
[201] T. Tieleman and G. Hinton, “Lecture [225] O. Wiles, G. Gkioxari, R. Szeliski, and J. Johnson, from a single-view,” in Proc. MM, 2018,
6.5—RmsProp: Divide the gradient by a running “SynSin: End-to-end view synthesis from a single pp. 383–391.
average of its recent magnitude,” in Proc. Neural image,” in Proc. CVPR, 2020, pp. 7467–7477. [249] C. Zheng, T.-J. Cham, and J. Cai, “Pluralistic
Netw. Mach. Learn. (COURSERA), 2012. [226] O. Wiles, A. S. Koepke, and A. Zisserman, image completion,” in Proc. CVPR, 2019,
[202] I. Tolstikhin, O. Bousquet, S. Gelly, and “X2Face: A network for controlling face pp. 1438–1447.
B. Schoelkopf, “Wasserstein auto-encoders,” in generation using images, audio, and pose codes,” [250] H. Zheng, H. Liao, L. Chen, W. Xiong, T. Chen, and
Proc. ICLR, 2018. in Proc. ECCV, 2018, pp. 670–686. J. Luo, “Example-guided image synthesis across
[203] T. Tong, G. Li, X. Liu, and Q. Gao, “Image [227] W. Wu, Y. Zhang, C. Li, C. Qian, and C. Change arbitrary scenes using masked spatial-channel
super-resolution using dense skip connections,” in Loy, “ReenactGAN: Learning to reenact faces via attention and self-supervision,” 2020,
Proc. ICCV, 2017, pp. 4799–4807. boundary transfer,” in Proc. ECCV, 2018, arXiv:2004.10024. [Online]. Available:
[204] M. Tschannen, E. Agustsson, and M. Lucic, “Deep pp. 603–619. http://arxiv.org/abs/2004.10024
generative models for distribution-preserving [228] Y. Wu and K. He, “Group normalization,” in Proc. [251] H. Zhou, Y. Liu, Z. Liu, P. Luo, and X. Wang,
lossy compression,” in Proc. NeurIPS, 2018, ECCV, 2018, pp. 3–19. “Talking face generation by adversarially
pp. 5929–5940. [229] W. Xiong et al., “Foreground-aware image disentangled audio-visual representation,” in Proc.
[205] S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, inpainting,” in Proc. CVPR, 2019, pp. 5840–5848. AAAI, 2019, pp. 9299–9306.
“MoCoGAN: Decomposing motion and content for [230] T. Xue, J. Wu, K. Bouman, and B. Freeman, “Visual [252] Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg,
video generation,” in Proc. CVPR, 2018, dynamics: Probabilistic future frame synthesis via “Dance dance generation: Motion transfer for
pp. 1526–1535. cross convolutional networks,” in Proc. NeurIPS, Internet videos,” 2019, arXiv:1904.00129.
[206] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance 2016, pp. 91–99. [Online]. Available:
normalization: The missing ingredient for fast [231] C. Yang, T. Kim, R. Wang, H. Peng, and http://arxiv.org/abs/1904.00129
stylization,” 2016, arXiv:1607.08022. [Online]. C.-C. J. Kuo, “ESTHER: Extremely simple image [253] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and
Available: http://arxiv.org/abs/1607.08022 translation through self-regularization,” in Proc. A. A. Efros, “Generative visual manipulation on
[207] A. Van den Oord et al., “Conditional image BMVC, 2018, pp. 91–99. the natural image manifold,” in Proc. ECCV, 2016,
generation with PixelCNN decoders,” in Proc. [232] C. Yang, T. Kim, R. Wang, H. Peng, and pp. 597–613.
NeurIPS, 2016, pp. 4790–4798. C.-C. J. Kuo, “Show, attend, and translate: [254] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros,
[208] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, Unsupervised image translation with “Unpaired image-to-image translation using
“Decomposing motion and content for natural self-regularization and attention,” IEEE Trans. cycle-consistent adversarial networks,” in Proc.
video sequence prediction,” in Proc. ICLR, 2017. Image Process., vol. 28, no. 10, pp. 4845–4856, ICCV, 2017, pp. 2223–2232.
[209] D. Vlasic, M. Brand, H. Pfister, and J. Popovic, Oct. 2019. [255] J.-Y. Zhu et al., “Toward multimodal
“Face transfer with multilinear models,” ACM [233] C. Yang, Z. Wang, X. Zhu, C. Huang, J. Shi, and image-to-image translation,” in Proc. NeurIPS,
Trans. Graph., vol. 24, no. 3, pp. 426–433, D. Lin, “Pose guided human video generation,” in 2017, pp. 465–476.
Jul. 2005. Proc. ECCV, 2018, pp. 201–216. [256] J.-Y. Zhu et al., “Visual object networks: Image
[210] C. Vondrick, H. Pirsiavash, and A. Torralba, [234] D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee, generation with disentangled 3D representations,”
“Generating videos with scene dynamics,” in Proc. “Diversity-sensitive conditional generative in Proc. NeurIPS, 2018, pp. 118–129.
NeurIPS, 2016, pp. 613–621. adversarial networks,” in Proc. ICLR, 2019, [257] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and
[211] K. Vougioukas, S. Petridis, and M. Pantic, pp. 201–216. X. Bai, “Progressive pose attention transfer for
“Realistic speech-driven facial animation with [235] Z. Yi, H. Zhang, P. Tan, and M. Gong, “DualGAN: person image generation,” in Proc. CVPR, 2019,
GANs,” in Proc. IJCV, 2019, pp. 1–16. Unsupervised dual learning for image-to-image pp. 2347–2356.

Vol. 109, No. 5, May 2021 | P ROCEEDINGS OF THE IEEE 861

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.
Liu et al.: GANs for Image and Video Synthesis: Algorithms and Applications

AB OUT THE AUTHORS


Ming-Yu Liu received the Ph.D. degree from Jiahui Yu (Member, IEEE) received the
the University of Maryland, College Park, bachelor’s degree (honors) from the School
MD, USA, in 2012, advised by Prof. Rama of the Gifted Young in Computer Science,
Chellappa. University of Science and Technology of
He was a Principal Research Scientist with China, Hefei, China, in 2016, and the Ph.D.
Mitsubishi Electric Research Laboratories degree from the University of Illinois at
(MERL), Cambridge, MA, USA. He is currently Urbana–Champaign, Champaign, IL, USA,
a Distinguished Research Scientist and a in 2020.
Manager with NVIDIA Research, Santa Clara, He is currently a Research Scientist with
CA, USA. His goal is to enable machines human-like imagination Google Brain, Mountain View, CA, USA. His research interest is in
capability. His research interest is in generative image modeling. sequence modeling (language, speech, video, and financial data),
Dr. Liu won the R&D 100 Award for his contribution to a com- machine perception (vision), generative models [generative adver-
mercial robotic bin picking system. His layered streetview semantic sarial networks (GANs)], and high-performance computing.
labeling paper was in the best paper finalists in the Robotics: Dr. Yu is a member of Association for Computing Machinery (ACM)
Science and Systems (RSS) conference in 2015. He won the first and National Conference on Artificial Intelligence (AAAI). He was a
place in both the Domain Adaptation for Semantic Segmentation recipient of the Baidu Scholarship, the Thomas and Margaret Huang
Competition and the Robust Optical Flow Estimation Challenge in Research Award, and the Microsoft-IEEE Young Fellowship.
the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition (CVPR) in 2018. His SPADE paper was in the best paper
finalists in the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR) in 2019. In SIGGRAPH Real-Time Live 2019,
he won both the Best in Show Award and the Audience Choice Ting-Chun Wang received the Ph.D.
Award for his GauGAN demo. His GauGAN App further won the Best degree in electrical engineering and com-
of What’s New Award by the Popular Science Magazine in 2019. puter sciences (EECS) from the University of
In 2021, his GAN for video compression work helps NVIDIA win California at Berkeley (UC Berkeley), Berke-
the most disruptive innovator award by Forbes. He has served as ley, CA, USA, in 2017, advised by Prof. Ravi
the Area Chair for various computer vision and machine learning Ramamoorthi and Alexei A. Efros.
conferences, including Advances in Neural Information Processing He is currently a Senior Research Scien-
Systems (NeurIPS), International Conference on Machine Learn- tist with NVIDIA Research, Santa Clara, CA,
ing (ICML), International Conference on Learning Representations USA. His research interests include com-
(ICLR), Conference on Computer Vision and Pattern Recognition puter vision, machine learning, and computer graphics, particularly
(CVPR), IEEE International Conference on Computer Vision (ICCV), the intersections of all three. His recent research focus is on using
European Conference on Computer Vision (ECCV), British Machine generative adversarial models to synthesize realistic images and
Vision Conference (BMVC), and IEEE Winter Conference on Appli- videos, with applications to rendering, visual manipulations, and
cations of Computer Vision (WACV). He has served as the Program beyond.
Chair of IEEE Winter Conference on Applications of Computer Vision Dr. Wang won the first place in the Domain Adaptation for Seman-
(WACV) in 2020. tic Segmentation Competition in CVPR in 2018. His semantic image
synthesis paper was in the best paper finalists in CVPR in 2019 and
the corresponding GauGAN app won the Best in Show Award and
Audience Choice Award in SIGGRAPH RealTimeLive in 2019. He has
served as the Area Chair of WACV in 2020.

Arun Mallya received the B.Tech. degree in


Xun Huang received the Ph.D. degree computer science and engineering from IIT
from Cornell University, Ithaca, NY, USA, in Kharagpur, Kharagpur, India, in 2012, and
2020, under the supervision of Prof. Serge the M.S. degree in computer science and
Belongie. the Ph.D. degree, with a focus on perform-
He is currently a Research Scientist with ing multiple tasks efficiently with a single
NVIDIA Research, Santa Clara, CA, USA. His deep network, from the University of Illinois
research interests include developing new at Urbana–Champaign, Champaign, IL, USA,
architectures and training algorithms of gen- in 2014 and 2018, respectively.
erative adversarial networks, as well as He is currently a Senior Research Scientist with NVIDIA Research,
applications such as image editing and synthesis. Santa Clara, CA, USA. He is interested in generative modeling and
Dr. Huang was a recipient of the NVIDIA Graduate Fellowship, enabling new applications of deep neural networks.
the Adobe Research Fellowship, and the Snap Research Fellowship. Dr. Mallya was selected as a Siebel Scholar in 2014.

862 P ROCEEDINGS OF THE IEEE | Vol. 109, No. 5, May 2021

Authorized licensed use limited to: Rutgers University. Downloaded on May 15,2021 at 09:39:26 UTC from IEEE Xplore. Restrictions apply.

You might also like