Gans
Gans
Abstract—Generative adversarial networks (GANs) pro- be used to train the generator, leading it towards being
vide a way to learn deep representations without extensively able to produce forgeries of better quality.
annotated training data. They achieve this through deriving
backpropagation signals through a competitive process in- The networks that represent the generator and discrim-
volving a pair of networks. The representations that can be
inator are typically implemented by multi-layer networks
learned by GANs may be used in a variety of applications,
consisting of convolutional and/or fully-connected layers.
including image synthesis, semantic image editing, style
transfer, image super-resolution and classification. The aim The generator and discriminator networks must be dif-
of this review paper is to provide an overview of GANs ferentiable, though it is not necessary for them to be
for the signal processing community, drawing on familiar directly invertible. If one considers the generator network
analogies and concepts where possible. In addition to as mapping from some representation space, called a
identifying different methods for training and constructing latent space, to the space of the data (we shall focus
GANs, we also point to remaining challenges in their theory on images), then we may express this more formally as
and application.
G : G(z) → R|x| , where z ∈ R|z| is a sample from the
Index Terms—neural networks, unsupervised learning, latent space, x ∈ R|x| is an image and | · | denotes the
semi-supervised learning. number of dimensions.
Fig. 1. In this figure, the two models which are learned during the training process for a GAN are the discriminator (D ) and the generator (G ).
These are typically implemented with neural networks, but they could be implemented by any form of differentiable system that maps data from
one space to another; see text for details.
estimating conditional density functions, possibly indirectly ICA, Fourier and wavelet representations, the latent space
in the form of a model which learns the joint distribution of of GANs is, by analogy, the coefficient space of what we
variables of interest and the observed data. The difficulty commonly refer to as transform space. What sets GANs
we face is that likelihood functions for high-dimensional, apart from these standard tools of signal processing is
real-world image data are difficult to construct. Whilst the level of complexity of the models that map vectors
GANs don’t explicitly provide a way of evaluating density from latent space to image space. Because the generator
functions, for a generator-discriminator pair of suitable networks contain non-linearities, and can be of almost
capacity, the generator implicitly captures the distribution arbitrary depth, this mapping – as with many other deep
of the data. learning approaches – can be extraordinarily complex.
With regard to deep image-based models, modern
D. Related Work approaches to generative image modelling can be grouped
One may view the principles of generative models by into explicit density models and implicit density models.
making comparisons with standard techniques in signal Explicit density models are either tractable (change of
processing and data analysis. For example, signal pro- variables models, autoregressive models) or intractable
cessing makes wide use of the idea of representing a (directed models trained with variational inference, undi-
signal as the weighted combination of basis functions. rected models trained using Markov chains). Implicit den-
Fixed basis functions underlie standard techniques such sity models capture the statistical distribution of the data
as Fourier-based and wavelet representations. Data-driven through a generative process which makes use of either
approaches to constructing basis functions can be traced ancestral sampling [11] or Markov chain-based sampling.
back to the Hotelling [8] transform, rooted in Pearson’s GANs fall into the directed implicit model category. A more
observation that principal components minimize a recon- detailed overview and relevant papers can be found in Ian
struction error according to a minimum squared error crite- Goodfellow’s NIPS 2016 tutorial [12].
rion. Despite its wide use, standard Principal Components
Analysis (PCA) does not have an overt statistical model III. GAN A RCHITECTURES
for the observed data, though it has been shown that the A. Fully Connected GANs
bases of PCA may be derived as a maximum likelihood
The first GAN architectures used fully connected neural
parameter estimation problem.
networks for both the generator and discriminator [1]. This
Despite wide adoption, PCA itself is limited – the basis
type of architecture was applied to relatively simple image
functions emerge as the eigenvectors of the covariance
datasets, namely MNIST (hand written digits), CIFAR-10
matrix over observations of the input data, and the map-
(natural images) and the Toronto Face Dataset (TFD).
ping from the representation space back to signal or image
space is linear. So, we have both a shallow and a linear
mapping, limiting the complexity of the model, and hence B. Convolutional GANs
of the data, that can be represented. Going from fully-connected to convolutional neural net-
Independent Components Analysis (ICA) provides an- works is a natural extension, given that CNNs are ex-
other level up in sophistication, in which the signal com- tremely well suited to image data. Early experiments con-
ponents no longer need to be orthogonal; the mixing ducted on CIFAR-10 suggested that it was more difficult
coefficients used to blend components together to con- to train generator and discriminator networks using CNNs
struct examples of data are merely considered to be with the same level of capacity and representational power
statistically independent. ICA has various formulations that as the ones used for supervised learning.
differ in their objective functions used during estimat- The Laplacian pyramid of adversarial networks (LAP-
ing signal components, or in the generative model that GAN) [13] offered one solution to this problem, by de-
expresses how signals or images are generated from composing the generation process using multiple scales:
those components. A recent innovation explored through a ground truth image is itself decomposed into a Laplacian
ICA is noise contrastive estimation (NCE); this may be pyramid, and a conditional, convolutional GAN is trained
seen as approaching the spirit of GANs [9]: the objective to produce each layer given the one above.
function for learning independent components compares a Additionally, Radford et al. [5] proposed a family of net-
statistic applied to noise with that produced by a candidate work architectures called DCGAN (for “deep convolutional
generative model [10]. The original NCE approach did not GAN”) which allows training a pair of deep convolutional
include updates to the generator. generator and discriminator networks. DCGANs make use
What other comparisons can be made between GANs of strided and fractionally-strided convolutions which allow
and the standard tools of signal processing? For PCA, the spatial down-sampling and up-sampling operators to
SUBMITTED TO IEEE-SPM, APRIL 2017 4
Fig. 2. During GAN training, the generator is encouraged to produce a distribution of samples, pg (x) to match that of real data, pdata (x). For
an appropriately parametrized and trained GAN, these distributions will be nearly identical. The representations embodied by GANs are captured
in the learned parameters (weights) of the generator and discriminator networks.
be learned during training. These operators handle the D. GANs with Inference Models
change in sampling rates and locations, a key require- In their original formulation, GANs lacked a way to map
ment in mapping from image space to possibly lower- a given observation, x, to a vector in latent space – in the
dimensional latent space, and from image space to a GAN literature, this is often referred to as an inference
discriminator. Further details of the DCGAN architecture mechanism. Several techniques have been proposed to
and training are presented in Section IV-B. invert the generator of pre-trained GANs [17], [18]. The
As an extension to synthesizing images in 2D, Wu et independently proposed Adversarially Learned Inference
al. [14] presented GANs that were able to synthesize 3D (ALI) [19] and Bidirectional GANs [20] provide simple but
data samples using volumetric convolutions. Wu et al. [14] effective extensions, introducing an inference network in
synthesized novel objects including chairs, table and cars; which the discriminators examine joint (data, latent) pairs.
in addition, they also presented a method to map from 2D In this formulation, the generator consists of two net-
image images to 3D versions of objects portrayed in those works: the “encoder” (inference network) and the “de-
images. coder”. They are jointly trained to fool the discriminator.
The discriminator itself receives pairs of (x, z) vectors
(see Fig. 4), and has to determine which pair constitutes
a genuine tuple consisting of real image sample and its
C. Conditional GANs encoding, or a fake image sample and the corresponding
latent-space input to the generator.
Mirza et al. [15] extended the (2D) GAN framework to Ideally, in an encoding-decoding model the output,
the conditional setting by making both the generator and referred to as a reconstruction, should be similar to the
the discriminator networks class-conditional (Fig. 3). Con- input. Typically, the fidelity of reconstructed data samples
ditional GANs have the advantage of being able to provide synthesised using an ALI/BiGAN are poor. The fidelity of
better representations for multi-modal data generation. A samples may be improved with an additional adversarial
parallel can be drawn between conditional GANs and cost on the distribution of data samples and their recon-
InfoGAN [16], which decomposes the noise source into structions [21].
an incompressible source and a “latent code”, attempting
to discover latent factors of variation by maximizing the
mutual information between the latent code and the gen- E. Adversarial Autoencoders (AAE)
erator’s output. This latent code can be used to discover Autoencoders are networks, composed of an “encoder”
object classes in a purely unsupervised fashion, although and “decoder”, that learn to map data to an internal
it is not strictly necessary that the latent code be cate- latent representation and out again. That is, they learn a
gorical. The representations learned by InfoGAN appear deterministic mapping (via the encoder) from a data space
to be semantically meaningful, dealing with complex inter- – e.g., images – into a latent or representation space, and
tangled factors in image appearance, including variations a mapping (via the decoder) from the latent space back
in pose, lighting and emotional content of facial images to data space. The composition of these two mappings
[16]. results in a “reconstruction”, and the two mappings are
SUBMITTED TO IEEE-SPM, APRIL 2017 5
Fig. 3. Left, the Conditional GAN, proposed by Mirza et al. [15] performs class-conditional image synthesis; the discriminator performs class-
conditional discrimination of real from fake images. The InfoGAN (right) [16], on the other hand, has a discriminator network that also estimates
the class label.
Fig. 4. The ALI/BiGAN structure [20], [19] consists of three networks. One of these serves as a discriminator, another maps the noise vectors
from latent space to image space (decoder, depicted as a generator G in the figure), with the final network (encoder, depicted as E ) mapping
from image space to latent space.
trained such that a reconstructed image is as close as perform feedforward, ancestral sampling [11] from an au-
possible to the original. toencoder. Adversarial training provides a route to achieve
Autoencoders are reminiscent of the perfect- these two goals. Specifically, adversarial training may be
reconstruction filter banks that are widely used in applied between the latent space and a desired prior
image and signal processing. However, autoencoders distribution on the latent space (latent-space GAN). This
generally learn non-linear mappings in both directions. results in a combined loss function [22] that reflects both
Further, when implemented with deep networks, the the reconstruction error and a measure of how different
possible architectures that can be used to implement the distribution of the prior is from that produced by a
autoencoders are remarkably flexible. Training can candidate encoding network. This approach is akin to a
be unsupervised, with backpropagation being applied variational autoencoder (VAE) [23] for which the latent-
between the reconstructed image and the original in space GAN plays the role of the KL-divergence term of
order to learn the parameters of both the encoder and the loss function.
the decoder. Mescheder et al. [24] unified variational autoencoders
As suggested earlier, one often wants the latent space with adversarial training in the form of the Adversarial
to have a useful organization. Additionally, one may want to Variational Bayes (AVB) framework. Similar ideas were
SUBMITTED TO IEEE-SPM, APRIL 2017 6
presented in Ian Goodfellow’s NIPS 2016 tutorial [12]. AVB Several authors suggested heuristic approaches to ad-
tries to optimise the same criterion as that of variational dress these issues [1], [25]; these are discussed in Section
autoencoders, but uses an adversarial training objective IV-B.
rather than the Kullback-Leibler divergence. Early attempts to explain why GAN training is unstable
were proposed by Goodfellow and Salimans et al. [1],
IV. T RAINING GAN S
[25] who observed that gradient descent methods typically
A. Introduction used for updating both the parameters of the generator
Training of GANs involves both finding the parameters and discriminator are inappropriate when the solution to
of a discriminator that maximize its classification accuracy, the optimization problem posed by GAN training actually
and finding the parameters of a generator which maximally constitutes a saddle point. Salimans et al. provided a
confuse the discriminator. This training process is summa- simple example which shows this [25]. However, stochastic
rized in Fig. 5. gradient descent is often used to update neural networks,
The cost of training is evaluated using a value function, and there are well developed machine learning program-
V (G, D) that depends on both the generator and the ming environments that make it easy to construct and
discriminator. The training involves solving: update networks using stochastic gradient descent.
Although an early theoretical treatment [1] showed that
max min V (G, D) the generator is optimal when pg (x) = pdata (x), a
D G
very neat result with a strong underlying intuition, the
where
real data samples reside on a manifold which sits in a
V (G, D) = Epdata (x) log D(x) + Epg (x) log(1 − D(x)) high-dimensional space of possible representations. For
instance, if colour image samples are of size N × N × 3
During training, the parameters of one model are
with pixel values [0, R+ ]3 , the space that may be rep-
updated, while the parameters of the other are fixed.
resented – which we can call X – is of dimensionality
Goodfellow et al. [1] show that for a fixed generator there
p (x) 3N 2 , with each dimension taking values between 0 and
is a unique optimal discriminator, D ∗ (x) = p data .
data (x)+pg (x)
the maximum measurable pixel intensity. The data samples
They also show that the generator, G , is optimal when
in the support of pdata , however, constitute the manifold
pg (x) = pdata (x), which is equivalent to the optimal
of the real data associated with some particular problem,
discriminator predicting 0.5 for all samples drawn from x.
typically occupying a very small part of the total space, X.
In other words, the generator is optimal when the discrim-
Similarly, the samples produced by the generator should
inator, D , is maximally confused and cannot distinguish
also occupy only a small portion of X.
real samples from fake ones.
Ideally, the discriminator is trained until optimal with Arjovsky et al. [26] showed that the support pg (x)
respect to the current generator; then, the generator is and pdata (x) lie in a lower dimensional space than that
again updated. However in practice, the discriminator corresponding to X. The consequence of this is that pg (x)
might not be trained until optimal, but rather may only be and pdata (x) may have no overlap, and so there exists a
trained for a small number of iterations, and the generator nearly trivial discriminator that is capable of distinguishing
is updated simultaneously with the discriminator. Further, real samples, x ∼ pdata (x) from fake samples, x ∼ pg (x)
an alternate, non-saturating training criterion is typically with 100% accuracy. In this case, the discriminator error
used for the generator, using maxG log D(G(z)) rather quickly converges to zero. Parameters of the generator
than minG log(1 − D(G(z))). may only be updated via the discriminator, so when this
Despite the theoretical existence of unique solutions, happens, the gradients used for updating parameters of
GAN training is challenging and often unstable for sev- the generator also converge to zero and so may no longer
eral reasons [5][25][26]. One approach to improving GAN be useful for updates to the generator. Arjovsky et al.’s [26]
training is to asses the empirical “symptoms” that might explanations account for several of the symptoms related
be experienced during training. These symptoms include: to GAN training.
• Difficulties in getting the pair of models to converge Goodfellow et al. [1] also showed that when D is
[5]; optimal, training G is equivalent to minimizing the Jensen-
• The generative model, “collapsing”, to generate very Shannon divergence between pg (x) and pdata (x). If D
similar samples for different inputs [25]; is not optimal, the update may be less meaningful, or
• The discriminator loss converging quickly to zero [26], inaccurate. This theoretical insight has motivated research
providing no reliable path for gradient updates to the into cost functions based on alternative distances. Several
generator. of these are explored in Section IV-C.
SUBMITTED TO IEEE-SPM, APRIL 2017 7
Fig. 5. The main loop of GAN training. Novel data samples, x0 , may be drawn by passing random samples, z through the generator network.
The gradient of the discriminator may be updated k times before updating the generator.
second part looks at alternative cost functions which aim D. A Brief Comparison of GAN Variants
to directly address the problem of vanishing gradients. GANs allow us to synthesize novel data samples from
random noise, but they are considered difficult to train
1) Generalisations of the GAN cost function: Nowozin due partially to vanishing gradients. All GAN models that
et al. [30] showed that GAN training may be generalized we have discussed in this paper require careful hyperpa-
to minimize not only the Jensen-Shannon divergence, rameter tuning and model selection for training. However,
but an estimate of f -divergences; these are referred perhaps the easier models to train are the AAE and the
to as f -GANs. The f -divergences include well-known WGAN. The AAE is relatively easy to train because the
divergence measures such as the Kullback-Leibler diver- adversarial loss is applied to a fairly simple distribution
gence. Nowozin et al. showed that f -divergence may in lower dimensions (than the image data). The WGAN
be approximated by applying the Fenchel conjugates of [33], is designed to be easier to train, using a different
the desired f -divergence to samples drawn from the formulation of the training objective which does not suffer
distribution of generated samples, after passing those from the vanishing gradient problem. The WGAN may also
samples through a discriminator [30]. They provide a list be trained successfully even without batch normalisation;
of Fenchel conjugates for commonly used f -divergences, it is also less sensitive to the choice of non-linearities used
as well as activation functions that may be used in the between convolutional layers.
final layer of the generator network, depending on the Samples synthesised using a GAN or WGAN may be-
choice of f -divergence. Having derived the generalized long to any class present in the training data. Conditional
cost functions for training the generator and discriminator GANs provide an approach to synthesising samples with
of an f -GAN, Nowozin et al. [30] observe that, in its user specified content.
raw form, maximizing the generator objective is likely to It is evident from various visualisation techniques
lead to weak gradients, especially at the start of training, (Fig. 6) that the organisation of the latent space harbours
and proposed an alternative cost function for updating the some meaning, but vanilla GANs do not provide an
generator which is less likely to saturate at the beginning of inference model to allow data samples to be mapped to
training. This means that when the discriminator is trained, latent representations. Both BiGANs and ALI provide a
the derivative of the f -divergence on the ratio of the real mechanism to map image data to a latent space (infer-
and fake data distributions is estimated, while when the ence), however, reconstruction quality suggests that they
generator is trained only an estimate of the f -divergence do not necessarily faithfully encode and decode samples.
is minimized. Uehara et al. [31] extend the f -GAN further, A very recent development shows that ALI may recover
where in the discriminator step the ratio of the distributions encoded data samples faithfully [21]. However, this model
of real and fake data are predicted, and in the generator shares a lot in common with the AVB and AAE. These are
step the f -divergence is directly minimized. Alternatives autoencoders, similar to variational autoencoders (VAEs),
to the JS-divergence are also covered by Goodfellow [12]. where the latent space is regularised using adversarial
training rather than a KL-divergence between encoded
samples and a prior.
2) Alternative Cost functions to prevent vanishing gra-
dients: Arjovsky et al. [32] proposed the WGAN, a GAN
V. T HE S TRUCTURE OF L ATENT S PACE
with an alternative cost function which is derived from an
approximation of the Wasserstein distance. Unlike the orig- GANs build their own representations of the data they
inal GAN cost function, the WGAN is more likely to provide are trained on, and in doing so produce structured geo-
gradients that are useful for updating the generator. The metric vector spaces for different domains. This is a quality
cost function derived for the WGAN relies on the discrimi- shared with other neural network models, including VAEs
nator, which they refer to as the “critic”, being a k -Lipschitz [23], as well as linguistic models such as word2vec
continuous function; practically, this may be implemented [34]. In general, the domain of the data to be modelled
by simply clipping the parameters of the discriminator. is mapped to a vector space which has fewer dimensions
However, more recent research [33] suggested that weight than the data space, forcing the model to discover interest-
clipping adversely reduces the capacity of the discriminator ing structure in the data and represent it efficiently. This
model, forcing it to learn simpler functions. Gulrajani et latent space is at the “originating” end of the generator
al. [33] proposed an improved method for training the network, and the data at this level of representation (the
discriminator for a WGAN, by penalizing the norm of latent space) can be highly structured, and may support
discriminator gradients with respect to data samples during high level semantic operations [5]. Examples include ro-
training, rather than performing parameter clipping. tation of faces from trajectories through latent space, as
SUBMITTED TO IEEE-SPM, APRIL 2017 9
well as image analogies which have the effect of adding the unsupervised representations within a DCGAN net-
visual attributes such as eyeglasses on to a “bare” face. work have been assessed by applying a regularized L2-
All (vanilla) GAN models have a generator which maps SVM classifier to a feature vector extracted from the
data from the latent space into the space to be mod- (trained) discriminator [5]. Good classification scores were
elled, but many GAN models have an “encoder” which achieved using this approach on both supervised and
additionally supports the inverse mapping [19], [20]. This semi-supervised datasets, even those that were disjoint
becomes a powerful method for exploring and using the from the original training data.
structured latent space of the GAN network. With an en- The quality of the data representation may be im-
coder, collections of labelled images can be mapped into proved when adversarial training includes jointly learning
latent spaces and analysed to discover “concept vectors” an inference mechanism such as with an ALI [19]. A
that represent high level attributes such as “smiling” or representation vector was built using last three hidden
“wearing a hat”. These vectors can be applied at scaled layers of the ALI encoder, a similar L2-SVM classifier, yet
offsets in latent space to influence the behaviour of the achieved a misclassification rate significantly lower than
generator (Fig. 6). Similar to using an encoding process the DCGAN [19]. Additionally, ALI has achieved state-
to model the distribution of latent samples, Gurumurthy et of-the art classification results when label information is
al. [35] propose modelling the latent space as a mixture incorporated into the training routine.
of Gaussians and learning the mixture components that When labelled training data is in limited supply, ad-
maximize the likelihood of generated data samples under versarial training may also be used to synthesize more
the data generating distribution. training samples. Shrivastava et al. [39] use GANs to
refine synthetic images, while maintaining their annota-
VI. A PPLICATIONS OF GAN S tion information. By training models only on GAN-refined
synthetic images (i.e. no real training data) Shrivastava
Discovering new applications for adversarial training
et al. [39] achieved state-of-the-art performance on pose
of deep networks is an active area of research. We
and gaze estimation tasks. Similarly, good results were
examine a few computer vision applications that have
obtained for gaze estimation and prediction using a spatio-
appeared in the literature and have been subsequently
temporal GAN architecture [40]. In some cases, models
refined. These applications were chosen to highlight some
trained on synthetic data do not generalize well when
different approaches to using GAN-based representations
applied to real data [3]. Bousmalis et al. [3] propose
for image-manipulation, analysis or characterization, and
to address this problem by adapting synthetic samples
do not fully reflect the potential breadth of application of
from a source domain to match a target domain using
GANs.
adversarial training. Additionally, Liu et al. [41] propose
Using GANs for image classification places them within
using multiple GANs – one per domain – with tied weights
the broader context of machine learning and provides a
to synthesize pairs of corresponding images samples from
useful quantitative assessment of the features extracted
different domains.
in unsupervised learning. Image synthesis remains a
Because the quality of generated samples is hard to
core GAN capability, and is especially useful when the
quantitatively judge across models, classification tasks
generated image can be subject to pre-existing constraints.
are likely to remain an important quantitative tool for
Super-resolution [36], [37], [38] offers an example of
performance assessment of GANs, even as new and
how an existing approach can be supplemented with
diverse applications in computer vision are explored.
an adversarial loss component to achieve higher quality
results. Finally, image-to-image translation demonstrates
B. Image Synthesis
how GANs offer a general purpose solution to a family
Much of the recent GAN research focuses on improving
of tasks which require automatically converting an input
the quality and utility of the image generation capabilities.
image into an output image.
The LAPGAN model introduced a cascade of convolutional
networks within a Laplacian pyramid framework to gen-
A. Classification and Regression erate images in a coarse-to-fine fashion [13]. A similar
After GAN training is complete, the neural network approach is used by Huang et al. [42] with GANs op-
can be reused for other downstream tasks. For example, erating on intermediate representations rather than lower
outputs of the convolutional layers of the discriminator resolution images.
can be used as a feature extractor, with simple linear LAPGAN also extended the conditional version of the
models fitted on top of these features using a modest GAN model where both G and D networks receive addi-
quantity of (image, label) pairs [5], [25]. The quality of tional label information as input; this technique has proved
SUBMITTED TO IEEE-SPM, APRIL 2017 10
Fig. 6. Example of applying a “smile vector” with an ALI model [19]. On the left hand side is an example of a woman without a smile and on
the right a woman with a smile. A z value for the image of the woman on the left is inferred, z1 and for the right, z2 . Interpolating along a
vector that connects z1 and z2 , gives z values that may be passed through a generator to synthesize novel samples. Note the implication: a
displacement vector in latent space traverses smile “intensity” in image space.
useful and is now a common practice to improve image CycleGAN [4] extends this work by introducing a cycle
quality. This idea of GAN conditioning was later extended consistency loss that attempts to preserve the original
to incorporate natural language. For example, Reed et al. image after a cycle of translation and reverse translation.
[43] used a GAN architecture to synthesize images from In this formulation, matching pairs of images are no
text descriptions, which one might describe as reverse longer needed for training. This makes data preparation
captioning. For example, given a text caption of a bird much simpler, and opens the technique to a larger family
such as “white with some black on its head and wings of applications. For example, artistic style transfer [47]
and a long orange beak”, the trained GAN can generate renders natural images in the style of artists, such as
several plausible images that match the description. Picasso or Monet, by simply being trained on an unpaired
In addition to conditioning on text descriptions, the collection of paintings and natural images (Fig. 8).
Generative Adversarial What-Where Network (GAWWN)
conditions on image location [44]. The GAWWN system D. Super-resolution
supported an interactive interface in which large images Super-resolution allows a high-resolution image to be
could be built up incrementally with textual descriptions of generated from a lower resolution image, with the trained
parts and user-supplied bounding boxes (Fig. 7). model inferring photo-realistic details while up-sampling.
Conditional GANs not only allow us to synthesize novel The SRGAN model [36] extends earlier efforts by adding
samples with specific attributes, they also allow us to an adversarial loss component which constrains images
develop tools for intuitively editing images – for example to reside on the manifold of natural images.
editing the hair style of a person in an image, making them The SRGAN generator is conditioned on a low resolu-
wear glasses or making them look younger [35]. Additional tion image, and infers photo-realistic natural images with
applications of GANs to image editing include work by Zhu 4x up-scaling factors. Unlike most GAN applications, the
and Brock et al. [2], [45]. adversarial loss is one component of a larger loss function,
which also includes perceptual loss from a pretrained
classifier, and a regularization loss that encourages spa-
C. Image-to-image translation tially coherent images. In this context, the adversarial loss
Conditional adversarial networks are well suited for constrains the overall solution to the manifold of natural
translating an input image into an output image, which is images, producing perceptually more convincing solutions.
a recurring theme in computer graphics, image process- Customizing deep learning applications can often be
ing, and computer vision. The pix2pix model offers a hampered by the availability of relevant curated training
general purpose solution to this family of problems [46]. datasets. However, SRGAN is straightforward to customize
In addition to learning the mapping from input image to specific domains, as new training image pairs can
to output image, the pix2pix model also constructs easily be constructed by down-sampling a corpus of high-
a loss function to train this mapping. This model has resolution images. This is an important consideration in
demonstrated effective results for different problems of practice, since the inferred photo-realistic details that the
computer vision which had previously required separate GAN generates will vary depending on the domain of
machinery, including semantic segmentation, generating images used in the training set.
maps from aerial photos, and colorization of black and
VII. D ISCUSSION
white images. Wang et al. present a similar idea, using
GANs to first synthesize surface-normal maps (similar A. Open Questions
to depth maps) and then map these images to natural GANs have attracted considerable attention due to their
scenes. ability to leverage vast amounts of unlabelled data. While
SUBMITTED TO IEEE-SPM, APRIL 2017 11
Fig. 7. Examples of Image Synthesis using the the Generative Adversarial What-Where Network (GAWWN). In GAWWN, images are conditioned
on both text descriptions and image location specified as either by keypoint or bounding box. Figure reproduced from [44] with authors’ permission.
Fig. 8. CycleGAN model learns image to image translations between two unordered image collections. Shown here are the examples of bi-
directional image mappings: Monet paintings to landscape photos, zebras to horses, and summer to winter photos in Yosemite park. Figure
reproduced from [4].
much progress has been made to alleviate some of the now has access to how the discriminator would update
challenges related to training and evaluating GANs, there itself. With the usual one step generator objective, the
still remain several open challenges. discriminator will simply assign a low probability to the
1) Mode Collapse: As articulated in Section IV, a com- generator’s previous outputs, forcing the generator to
mon problem of GANs involves the generator collapsing to move, resulting either in convergence, or an endless cycle
produce a small family of similar samples (partial collapse), of mode hopping. However, with the unrolled objective,
and in the worst case producing simply a single sample the generator can prevent the discriminator from focusing
(complete collapse) [26], [48]. on the previous update, and update its own generations
Diversity in the generator can be increased by practical with the foresight of how the discriminator would have
hacks to balance the distribution of samples produced responded.
by the discriminator for real and fake batches, or by 2) Training instability – saddle points: In a GAN, the
employing multiple GANs to cover the different modes Hessian of the loss function becomes indefinite. The
of the probability distribution [49]. Yet another solution to optimal solution, therefore, lies in finding a saddle point
alleviate mode collapse is to alter the distance measure rather than a local minimum. In deep learning, a large
used to compare statistical distributions. Arjovsky [32] number of optimizers depend only on the first derivative
proposed to compare distributions based on a Wasserstein of the loss function; converging to a saddle point for
distance rather than a KL-based divergence (DCGAN [5]) GANs requires good initialization. By invoking the stable
or a total-variation distance (energy-based GAN [50]). manifold theorem from non-linear systems theory, Lee et
Metz et al. [51] proposed unrolling the discriminator for al. [52] showed that, were we to select the initial points
several steps, i.e., letting it calculate its updates on the of an optimizer at random, gradient descent would not
current generator for several steps, and then using the converge to a saddle with probability one (also see [53],
“unrolled” discriminators to update the generator using the [25]). Additionally, Mescheder et al. [54] have argued that
normal minimax objective. As normal, the discriminator convergence of a GAN’s objective function suffers from
only trains on its update from one step, but the generator the presence of a zero real part of the Jacobian matrix
SUBMITTED TO IEEE-SPM, APRIL 2017 12
as well as eigenvalues with large imaginary parts. This is [2] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Gen-
disheartening for GAN training; yet, due to the existence of erative visual manipulation on the natural image manifold,” in
European Conference on Computer Vision. Springer, 2016, pp.
second-order optimizers, not all hope is lost. Unfortunately, 597–613.
Newton-type methods have compute-time complexity that [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krish-
scales cubically or quadratically with the dimension of the nan, “Unsupervised pixel-level domain adaptation with generative
parameters. Therefore, another line of questions lies in ap- adversarial networks,” in IEEE Conference on Computer Vision
and Pattern Recognition, 2016.
plying and scaling second-order optimizers for adversarial
[4] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-
training. image translation using cycle-consistent adversarial networks,” in
A more fundamental problem is the existence of an Proceedings of the International Conference on Computer Vision,
equilibrium for a GAN. Using results from Bayesian non- 2017. [Online]. Available: https://arxiv.org/abs/1703.10593
[5] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
parametrics, Arora et al. [48] connects the existence of tation learning with deep convolutional generative adversarial
the equilibrium to a finite mixture of neural networks – this networks,” in Proceedings of the 5th International Conference on
means that below a certain capacity, no equilibrium might Learning Representations (ICLR) - workshop track, 2016.
exist. On a closely related note, it has also been argued [6] A. Creswell and A. A. Bharath, “Adversarial training for sketch re-
trieval,” in Computer Vision – ECCV 2016 Workshops: Amsterdam,
that whilst GAN training can appear to have converged, The Netherlands, October 8-10 and 15-16, 2016, Proceedings,
the trained distribution could still be far away from the Part I. Springer International Publishing, 2016.
target distribution. To alleviate this issue, Arora et al. [48] [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
propose a new measure called the ‘neural net distance’. 521, no. 7553, pp. 436–444, 2015.
[8] H. Hotelling, “Analysis of a complex of statistical variables into
3) Evaluating Generative Models: How can one gauge principal components.” Journal of educational psychology, vol. 24,
the fidelity of samples synthesized by a generative mod- no. 6, p. 417, 1933.
els? Should we use a likelihood estimation? Can a GAN [9] I. J. Goodfellow, “On distinguishability criteria for estimating gen-
trained using one methodology be compared to another erative models,” International Conference on Learning Represen-
tations - workshop track, 2015.
(model comparison)? These are open-ended questions [10] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A
that are not only relevant for GANs, but also for proba- new estimation principle for unnormalized statistical models.” in
bilistic models, in general. Theis [55] argued that evalu- AISTATS, vol. 1, no. 2, 2010, p. 6.
ating GANs using different measures can lead conflicting [11] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising
auto-encoders as generative models,” in Advances in Neural
conclusions about the quality of synthesised samples; the Information Processing Systems, 2013, pp. 899–907.
decision to select one measure over another depends on [12] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial
the application. networks,” 2016, presented at the Neural Information Processing
Systems Conference. [Online]. Available: https://arxiv.org/abs/
1701.00160
B. Conclusions [13] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative
The explosion of interest in GANs is driven not only by image models using a laplacian pyramid of adversarial networks,”
in Advances in Neural Information Processing Systems, 2015, pp.
their potential to learn deep, highly non-linear mappings 1486–1494.
from a latent space into a data space and back, but also [14] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum,
by their potential to make use of the vast quantities of “Learning a probabilistic latent space of object shapes via 3d
unlabelled image data that remain closed to deep repre- generative-adversarial modeling,” in Advances in Neural Informa-
tion Processing Systems, 2016, pp. 82–90.
sentation learning. Within the subtleties of GAN training, [15] M. Mirza and S. Osindero, “Conditional generative adversarial
there are many opportunities for developments in theory nets,” arXiv preprint arXiv:1411.1784, 2014.
and algorithms, and with the power of deep networks, [16] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
there are vast opportunities for new applications. and P. Abbeel, “Infogan: Interpretable representation learning by
information maximizing generative adversarial nets,” in Advances
in Neural Information Processing Systems, 2016.
ACKNOWLEDGMENT [17] A. Creswell and A. A. Bharath, “Inverting the generator of a
The authors would like to thank David Warde-Farley for generative adversarial network,” in NIPS Workshop on Adversarial
Training, 2016.
his valuable feedback on previous revisions of the paper.
[18] Z. C. Lipton and S. Tripathi, “Precise recovery of latent vectors
Antonia Creswell acknowledges the support of the EPSRC from generative adversarial networks,” in ICLR (workshop track),
through a Doctoral training scholarship. 2017.
[19] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb,
M. Arjovsky, and A. Courville, “Adversarially learned inference,” in
R EFERENCES
(accepted, to appear) Proceedings of the International Conference
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, on Learning Representations, 2017.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” [20] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature
in Advances in Neural Information Processing Systems, 2014, pp. learning,” in (accepted, to appear) Proceedings of the International
2672–2680. Conference on Learning Representations, 2017.
SUBMITTED TO IEEE-SPM, APRIL 2017 13
[21] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin, through adversarial training,” in IEEE Conference on Computer
“Towards understanding adversarial learning for joint distribution Vision and Pattern Recognition (CVPR), 2016.
matching,” in Advances in Neural Information Processing Systems, [40] M. Zhang, K. T. Ma, J. H. Lim, Q. Zhao, and J. Feng, “Deep future
2017. gaze: Gaze anticipation on egocentric videos using adversarial
[22] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, networks,” in IEEE Conference on Computer Vision and Pattern
“Adversarial autoencoders,” in International Conference on Recognition, 2017, pp. 4372–4381.
Learning Representations (to appear), 2016. [Online]. Available: [41] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,”
http://arxiv.org/abs/1511.05644 in Advances in neural information processing systems, 2016, pp.
[23] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 469–477.
in Proceedings of the 2nd International Conference on Learning [42] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie,
Representations (ICLR), 2014. “Stacked generative adversarial networks,” in IEEE Conference
[24] L. M. Mescheder, S. Nowozin, and A. Geiger, “Adversarial on Computer Vision and Pattern Recognition, 2016.
variational bayes: Unifying variational autoencoders and [43] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
generative adversarial networks,” 2017. [Online]. Available: H. Lee, “Generative adversarial text to image synthesis,” in
http://arxiv.org/abs/1701.04722 International Conference on Machine Learning, 2016. [Online].
[25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Available: https://arxiv.org/abs/1605.05396
and X. Chen, “Improved techniques for training gans,” in Advances [44] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and
in Neural Information Processing Systems, 2016, pp. 2226–2234. H. Lee, “Learning what and where to draw,” in Advances in Neural
[26] M. Arjovsky and L. Bottou, “Towards principled methods for Information Processing Systems, 2016, pp. 217–225.
training generative adversarial networks,” NIPS 2016 Workshop [45] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photo
on Adversarial Training, 2016. editing with introspective adversarial networks,” in Proceedings
[27] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional net- of the 6th International Conference on Learning Representations
works for semantic segmentation,” IEEE transactions on pattern (ICLR), 2017.
analysis and machine intelligence, vol. 39, no. 4, pp. 640–651, [46] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans-
2017. lation with conditional adversarial networks,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016.
[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[47] C. Li and M. Wand, “Precomputed real-time texture synthesis
network training by reducing internal covariate shift,” in Proceed-
with Markovian generative adversarial networks,” in European
ings of The 32nd International Conference on Machine Learning,
Conference on Computer Vision. Springer, 2016, pp. 702–716.
2015, pp. 448–456.
[48] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and
[29] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár,
equilibrium in generative adversarial nets (gans),” in Proceedings
“Amortised map inference for image super-resolution,” in Interna-
of The 34nd International Conference on Machine Learning, 2017.
tional Conference on Learning Representations, 2017.
[49] I. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and
[30] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative
B. Schölkopf, “Adagan: Boosting generative models,” Tech. Rep.,
neural samplers using variational divergence minimization,” in
2017.
Advances in Neural Information Processing Systems, 2016, pp.
[50] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative
271–279.
adversarial network,” in International Conference on Learning
[31] M. Uehara, I. Sato, M. Suzuki, K. Nakayama, and Y. Matsuo,
Representations, 2017. [Online]. Available: https://arxiv.org/abs/
“Generative adversarial nets from a density ratio estimation per-
1609.03126
spective,” arXiv preprint arXiv:1610.02920, 2016.
[51] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled
[32] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” in
generative adversarial networks,” in Proceedings of the
Proceedings of The 34nd International Conference on Machine
International Conference on Learning Representations, 2017.
Learning, 2017.
[Online]. Available: https://arxiv.org/abs/1611.02163
[33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, [52] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient
“Improved training of wasserstein gans,” in (accepted, to appear) descent only converges to minimizers,” in Conference on Learning
Advances in Neural Information Processing Systems, 2017. Theory, 2016, pp. 1246–1257.
[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation [53] R. Pemantle, “Nonconvergence to unstable points in urn models
of word representations in vector space,” in International Confer- and stochastic approximations,” Ann. Probab., vol. 18, no. 2, pp.
ence on Learning Representations, 2013. 698–712, 04 1990.
[35] S. Gurumurthy, R. K. Sarvadevabhatla, and V. B. Radhakrishnan, [54] L. M. Mescheder, S. Nowozin, and A. Geiger, “The numerics of
“Deligan: Generative adversarial networks for diverse and limited gans,” in Advances in Neural Information Processing Systems,
data,” in IEEE Conference On Computer Vision and Pattern 2017. [Online]. Available: http://arxiv.org/abs/1705.10461
Recognition (CVPR), 2017. [55] L. Theis, A. van den Oord, and M. Bethge, “A note on the eval-
[36] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Aitken, A. Tejani, uation of generative models,” in Proceedings of the International
J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image Conference on Learning Representations.
super-resolution using a generative adversarial network,” in IEEE
Conference on Computer Vision and Pattern Recognition, 2017.
[37] X. Yu and F. Porikli, “Ultra-resolving face images by discrimina-
tive generative networks,” in European Conference on Computer
Vision. Springer, 2016, pp. 318–333.
[38] ——, “Hallucinating very low-resolution unaligned and noisy face
images by transformative discriminative autoencoders,” in Pro- Antonia Creswell ([email protected]) holds a first-class degree from
ceedings of the IEEE Conference on Computer Vision and Pattern Imperial College in Biomedical Engineering (2011), and is currently
Recognition, 2017, pp. 3760–3768. a PhD student in the Biologically Inspired Computer Vision (BICV)
[39] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and Group at Imperial College London (2015). The focus of her PhD is on
R. Webb, “Learning from simulated and unsupervised images improving the training of generative adversarial networks and applying
SUBMITTED TO IEEE-SPM, APRIL 2017 14