0% found this document useful (0 votes)
43 views14 pages

Gans

This document provides an overview of generative adversarial networks (GANs). GANs use two neural networks, a generator and a discriminator, that compete against each other. The generator tries to generate fake images that look real, while the discriminator tries to distinguish real images from fake ones. GANs can be used for tasks like image synthesis, editing, and classification by learning deep representations from unlabeled data.

Uploaded by

AnhNguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views14 pages

Gans

This document provides an overview of generative adversarial networks (GANs). GANs use two neural networks, a generator and a discriminator, that compete against each other. The generator tries to generate fake images that look real, while the discriminator tries to distinguish real images from fake ones. GANs can be used for tasks like image synthesis, editing, and classification by learning deep representations from unlabeled data.

Uploaded by

AnhNguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

SUBMITTED TO IEEE-SPM, APRIL 2017 1

Generative Adversarial Networks: An Overview


Antonia Creswell§ , Tom White¶ ,
Vincent Dumoulin‡ , Kai Arulkumaran§ , Biswa Sengupta†§ and Anil A Bharath§ , Member IEEE
§ BICV Group, Dept. of Bioengineering, Imperial College London
¶ School of Design, Victoria University of Wellington, New Zealand
‡ MILA, University of Montreal, Montreal H3T 1N8
† Cortexica Vision Systems Ltd., London, United Kingdom
arXiv:1710.07035v1 [cs.CV] 19 Oct 2017

Abstract—Generative adversarial networks (GANs) pro- be used to train the generator, leading it towards being
vide a way to learn deep representations without extensively able to produce forgeries of better quality.
annotated training data. They achieve this through deriving
backpropagation signals through a competitive process in- The networks that represent the generator and discrim-
volving a pair of networks. The representations that can be
inator are typically implemented by multi-layer networks
learned by GANs may be used in a variety of applications,
consisting of convolutional and/or fully-connected layers.
including image synthesis, semantic image editing, style
transfer, image super-resolution and classification. The aim The generator and discriminator networks must be dif-
of this review paper is to provide an overview of GANs ferentiable, though it is not necessary for them to be
for the signal processing community, drawing on familiar directly invertible. If one considers the generator network
analogies and concepts where possible. In addition to as mapping from some representation space, called a
identifying different methods for training and constructing latent space, to the space of the data (we shall focus
GANs, we also point to remaining challenges in their theory on images), then we may express this more formally as
and application.
G : G(z) → R|x| , where z ∈ R|z| is a sample from the
Index Terms—neural networks, unsupervised learning, latent space, x ∈ R|x| is an image and | · | denotes the
semi-supervised learning. number of dimensions.

In a basic GAN, the discriminator network, D , may


I. I NTRODUCTION
be similarly characterized as a function that maps from
ENERATIVE adversarial networks (GANs) are an
G emerging technique for both semi-supervised and
unsupervised learning. They achieve this through implicitly
image data to a probability that the image is from the
real data distribution, rather than the generator distribution:
D : D(x) → (0, 1). For a fixed generator, G , the
modelling high-dimensional distributions of data. Proposed discriminator, D , may be trained to classify images as
in 2014 [1], they can be characterized by training a pair either being from the training data (real, close to 1) or from
of networks in competition with each other. A common a fixed generator (fake, close to 0). When the discriminator
analogy, apt for visual data, is to think of one network is optimal, it may be frozen, and the generator, G , may
as an art forger, and the other as an art expert. The continue to be trained so as to lower the accuracy of the
forger, known in the GAN literature as the generator, G , discriminator. If the generator distribution is able to match
creates forgeries, with the aim of making realistic images. the real data distribution perfectly then the discriminator
The expert, known as the discriminator, D , receives both will be maximally confused, predicting 0.5 for all inputs. In
forgeries and real (authentic) images, and aims to tell them practice, the discriminator might not be trained until it is
apart (see Fig. 1). Both are trained simultaneously, and in optimal; we explore the training process in more depth in
competition with each other. Section IV.
Crucially, the generator has no direct access to real
images - the only way it learns is through its interaction On top of the interesting academic problems related to
with the discriminator. The discriminator has access to training and constructing GANs, the motivations behind
both the synthetic samples and samples drawn from the training GANs may not necessarily be the generator or
stack of real images. The error signal to the discriminator the discriminator per se: the representations embodied by
is provided through the simple ground truth of knowing either of the pair of networks can be used in a variety of
whether the image came from the real stack or from the subsequent tasks. We explore the applications of these
generator. The same error signal, via the discriminator, can representations in Section VI.
SUBMITTED TO IEEE-SPM, APRIL 2017 2

Fig. 1. In this figure, the two models which are learned during the training process for a GAN are the discriminator (D ) and the generator (G ).
These are typically implemented with neural networks, but they could be implemented by any form of differentiable system that maps data from
one space to another; see text for details.

II. P RELIMINARIES As with all deep learning systems, training requires


that we have some clear objective function. Following the
A. Terminology
usual notation, we use JG (ΘG ; ΘD ) and JD (ΘD ; ΘG )
Generative models learn to capture the statistical distri- to refer to the objective functions of the generator and
bution of training data, allowing us to synthesize samples discriminator, respectively. The choice of notation reminds
from the learned distribution. On top of synthesizing novel us that the two objective functions are in a sense co-
data samples, which may be used for downstream tasks dependent on the evolving parameter sets ΘG and ΘD
such as semantic image editing [2], data augmentation [3] of the networks as they are iteratively updated. We shall
and style transfer [4], we are also interested in using the explore this further in Section IV. Finally, note that mul-
representations that such models learn for tasks such as tidimensional gradients are used in the updates; we use
classification [5] and image retrieval [6]. ∇ΘG to denote the gradient operator with respect to the
We occasionally refer to fully connected and convolu- weights of the generator parameters, and ∇ΘD to denote
tional layers of deep networks; these are generalizations of the gradient operator with respect to the weights of the
perceptrons or of spatial filter banks with non-linear post- discriminator. The expected gradients are indicated by the
processing. In all cases, the network weights are learned notation E∇• .
through backpropagation [7].

C. Capturing Data Distributions


B. Notation
A central problem of signal processing and statistics
The GAN literature generally deals with multi- is that of density estimation: obtaining a representation –
dimensional vectors, and often represents vectors in a implicit or explicit, parametric or non-parametric – of data
probability space by italics (e.g. latent space is z ). In in the real world. This is the key motivation behind GANs.
the field of signal processing, it is common to represent In the GAN literature, the term data generating distribution
vectors by bold lowercase symbols, and we adopt this is often used to refer to the underlying probability density
convention in order to emphasize the multi-dimensional or probability mass function of observation data. GANs
nature of variables. Accordingly, we will commonly refer to learn through implicitly computing some sort of similarity
pdata (x) as representing the probability density function between the distribution of a candidate model and the
over a random vector x which lies in R|x| . We will use distribution corresponding to real data.
pg (x) to denote the distribution of the vectors produced by Why bother with density estimation at all? The answer
the generator network of the GAN. We use the calligraphic lies at the heart of – arguably – many problems of visual
symbols G and D to denote the generator and discrim- inference, including image categorization, visual object
inator networks, respectively. Both networks have sets detection and recognition, object tracking and object regis-
of parameters (weights), ΘD and ΘG , that are learned tration. In principle, through Bayes’ Theorem, all inference
through optimization, during training. problems of computer vision can be addressed through
SUBMITTED TO IEEE-SPM, APRIL 2017 3

estimating conditional density functions, possibly indirectly ICA, Fourier and wavelet representations, the latent space
in the form of a model which learns the joint distribution of of GANs is, by analogy, the coefficient space of what we
variables of interest and the observed data. The difficulty commonly refer to as transform space. What sets GANs
we face is that likelihood functions for high-dimensional, apart from these standard tools of signal processing is
real-world image data are difficult to construct. Whilst the level of complexity of the models that map vectors
GANs don’t explicitly provide a way of evaluating density from latent space to image space. Because the generator
functions, for a generator-discriminator pair of suitable networks contain non-linearities, and can be of almost
capacity, the generator implicitly captures the distribution arbitrary depth, this mapping – as with many other deep
of the data. learning approaches – can be extraordinarily complex.
With regard to deep image-based models, modern
D. Related Work approaches to generative image modelling can be grouped
One may view the principles of generative models by into explicit density models and implicit density models.
making comparisons with standard techniques in signal Explicit density models are either tractable (change of
processing and data analysis. For example, signal pro- variables models, autoregressive models) or intractable
cessing makes wide use of the idea of representing a (directed models trained with variational inference, undi-
signal as the weighted combination of basis functions. rected models trained using Markov chains). Implicit den-
Fixed basis functions underlie standard techniques such sity models capture the statistical distribution of the data
as Fourier-based and wavelet representations. Data-driven through a generative process which makes use of either
approaches to constructing basis functions can be traced ancestral sampling [11] or Markov chain-based sampling.
back to the Hotelling [8] transform, rooted in Pearson’s GANs fall into the directed implicit model category. A more
observation that principal components minimize a recon- detailed overview and relevant papers can be found in Ian
struction error according to a minimum squared error crite- Goodfellow’s NIPS 2016 tutorial [12].
rion. Despite its wide use, standard Principal Components
Analysis (PCA) does not have an overt statistical model III. GAN A RCHITECTURES
for the observed data, though it has been shown that the A. Fully Connected GANs
bases of PCA may be derived as a maximum likelihood
The first GAN architectures used fully connected neural
parameter estimation problem.
networks for both the generator and discriminator [1]. This
Despite wide adoption, PCA itself is limited – the basis
type of architecture was applied to relatively simple image
functions emerge as the eigenvectors of the covariance
datasets, namely MNIST (hand written digits), CIFAR-10
matrix over observations of the input data, and the map-
(natural images) and the Toronto Face Dataset (TFD).
ping from the representation space back to signal or image
space is linear. So, we have both a shallow and a linear
mapping, limiting the complexity of the model, and hence B. Convolutional GANs
of the data, that can be represented. Going from fully-connected to convolutional neural net-
Independent Components Analysis (ICA) provides an- works is a natural extension, given that CNNs are ex-
other level up in sophistication, in which the signal com- tremely well suited to image data. Early experiments con-
ponents no longer need to be orthogonal; the mixing ducted on CIFAR-10 suggested that it was more difficult
coefficients used to blend components together to con- to train generator and discriminator networks using CNNs
struct examples of data are merely considered to be with the same level of capacity and representational power
statistically independent. ICA has various formulations that as the ones used for supervised learning.
differ in their objective functions used during estimat- The Laplacian pyramid of adversarial networks (LAP-
ing signal components, or in the generative model that GAN) [13] offered one solution to this problem, by de-
expresses how signals or images are generated from composing the generation process using multiple scales:
those components. A recent innovation explored through a ground truth image is itself decomposed into a Laplacian
ICA is noise contrastive estimation (NCE); this may be pyramid, and a conditional, convolutional GAN is trained
seen as approaching the spirit of GANs [9]: the objective to produce each layer given the one above.
function for learning independent components compares a Additionally, Radford et al. [5] proposed a family of net-
statistic applied to noise with that produced by a candidate work architectures called DCGAN (for “deep convolutional
generative model [10]. The original NCE approach did not GAN”) which allows training a pair of deep convolutional
include updates to the generator. generator and discriminator networks. DCGANs make use
What other comparisons can be made between GANs of strided and fractionally-strided convolutions which allow
and the standard tools of signal processing? For PCA, the spatial down-sampling and up-sampling operators to
SUBMITTED TO IEEE-SPM, APRIL 2017 4

Fig. 2. During GAN training, the generator is encouraged to produce a distribution of samples, pg (x) to match that of real data, pdata (x). For
an appropriately parametrized and trained GAN, these distributions will be nearly identical. The representations embodied by GANs are captured
in the learned parameters (weights) of the generator and discriminator networks.

be learned during training. These operators handle the D. GANs with Inference Models
change in sampling rates and locations, a key require- In their original formulation, GANs lacked a way to map
ment in mapping from image space to possibly lower- a given observation, x, to a vector in latent space – in the
dimensional latent space, and from image space to a GAN literature, this is often referred to as an inference
discriminator. Further details of the DCGAN architecture mechanism. Several techniques have been proposed to
and training are presented in Section IV-B. invert the generator of pre-trained GANs [17], [18]. The
As an extension to synthesizing images in 2D, Wu et independently proposed Adversarially Learned Inference
al. [14] presented GANs that were able to synthesize 3D (ALI) [19] and Bidirectional GANs [20] provide simple but
data samples using volumetric convolutions. Wu et al. [14] effective extensions, introducing an inference network in
synthesized novel objects including chairs, table and cars; which the discriminators examine joint (data, latent) pairs.
in addition, they also presented a method to map from 2D In this formulation, the generator consists of two net-
image images to 3D versions of objects portrayed in those works: the “encoder” (inference network) and the “de-
images. coder”. They are jointly trained to fool the discriminator.
The discriminator itself receives pairs of (x, z) vectors
(see Fig. 4), and has to determine which pair constitutes
a genuine tuple consisting of real image sample and its
C. Conditional GANs encoding, or a fake image sample and the corresponding
latent-space input to the generator.
Mirza et al. [15] extended the (2D) GAN framework to Ideally, in an encoding-decoding model the output,
the conditional setting by making both the generator and referred to as a reconstruction, should be similar to the
the discriminator networks class-conditional (Fig. 3). Con- input. Typically, the fidelity of reconstructed data samples
ditional GANs have the advantage of being able to provide synthesised using an ALI/BiGAN are poor. The fidelity of
better representations for multi-modal data generation. A samples may be improved with an additional adversarial
parallel can be drawn between conditional GANs and cost on the distribution of data samples and their recon-
InfoGAN [16], which decomposes the noise source into structions [21].
an incompressible source and a “latent code”, attempting
to discover latent factors of variation by maximizing the
mutual information between the latent code and the gen- E. Adversarial Autoencoders (AAE)
erator’s output. This latent code can be used to discover Autoencoders are networks, composed of an “encoder”
object classes in a purely unsupervised fashion, although and “decoder”, that learn to map data to an internal
it is not strictly necessary that the latent code be cate- latent representation and out again. That is, they learn a
gorical. The representations learned by InfoGAN appear deterministic mapping (via the encoder) from a data space
to be semantically meaningful, dealing with complex inter- – e.g., images – into a latent or representation space, and
tangled factors in image appearance, including variations a mapping (via the decoder) from the latent space back
in pose, lighting and emotional content of facial images to data space. The composition of these two mappings
[16]. results in a “reconstruction”, and the two mappings are
SUBMITTED TO IEEE-SPM, APRIL 2017 5

Fig. 3. Left, the Conditional GAN, proposed by Mirza et al. [15] performs class-conditional image synthesis; the discriminator performs class-
conditional discrimination of real from fake images. The InfoGAN (right) [16], on the other hand, has a discriminator network that also estimates
the class label.

Fig. 4. The ALI/BiGAN structure [20], [19] consists of three networks. One of these serves as a discriminator, another maps the noise vectors
from latent space to image space (decoder, depicted as a generator G in the figure), with the final network (encoder, depicted as E ) mapping
from image space to latent space.

trained such that a reconstructed image is as close as perform feedforward, ancestral sampling [11] from an au-
possible to the original. toencoder. Adversarial training provides a route to achieve
Autoencoders are reminiscent of the perfect- these two goals. Specifically, adversarial training may be
reconstruction filter banks that are widely used in applied between the latent space and a desired prior
image and signal processing. However, autoencoders distribution on the latent space (latent-space GAN). This
generally learn non-linear mappings in both directions. results in a combined loss function [22] that reflects both
Further, when implemented with deep networks, the the reconstruction error and a measure of how different
possible architectures that can be used to implement the distribution of the prior is from that produced by a
autoencoders are remarkably flexible. Training can candidate encoding network. This approach is akin to a
be unsupervised, with backpropagation being applied variational autoencoder (VAE) [23] for which the latent-
between the reconstructed image and the original in space GAN plays the role of the KL-divergence term of
order to learn the parameters of both the encoder and the loss function.
the decoder. Mescheder et al. [24] unified variational autoencoders
As suggested earlier, one often wants the latent space with adversarial training in the form of the Adversarial
to have a useful organization. Additionally, one may want to Variational Bayes (AVB) framework. Similar ideas were
SUBMITTED TO IEEE-SPM, APRIL 2017 6

presented in Ian Goodfellow’s NIPS 2016 tutorial [12]. AVB Several authors suggested heuristic approaches to ad-
tries to optimise the same criterion as that of variational dress these issues [1], [25]; these are discussed in Section
autoencoders, but uses an adversarial training objective IV-B.
rather than the Kullback-Leibler divergence. Early attempts to explain why GAN training is unstable
were proposed by Goodfellow and Salimans et al. [1],
IV. T RAINING GAN S
[25] who observed that gradient descent methods typically
A. Introduction used for updating both the parameters of the generator
Training of GANs involves both finding the parameters and discriminator are inappropriate when the solution to
of a discriminator that maximize its classification accuracy, the optimization problem posed by GAN training actually
and finding the parameters of a generator which maximally constitutes a saddle point. Salimans et al. provided a
confuse the discriminator. This training process is summa- simple example which shows this [25]. However, stochastic
rized in Fig. 5. gradient descent is often used to update neural networks,
The cost of training is evaluated using a value function, and there are well developed machine learning program-
V (G, D) that depends on both the generator and the ming environments that make it easy to construct and
discriminator. The training involves solving: update networks using stochastic gradient descent.
Although an early theoretical treatment [1] showed that
max min V (G, D) the generator is optimal when pg (x) = pdata (x), a
D G
very neat result with a strong underlying intuition, the
where
real data samples reside on a manifold which sits in a
V (G, D) = Epdata (x) log D(x) + Epg (x) log(1 − D(x)) high-dimensional space of possible representations. For
instance, if colour image samples are of size N × N × 3
During training, the parameters of one model are
with pixel values [0, R+ ]3 , the space that may be rep-
updated, while the parameters of the other are fixed.
resented – which we can call X – is of dimensionality
Goodfellow et al. [1] show that for a fixed generator there
p (x) 3N 2 , with each dimension taking values between 0 and
is a unique optimal discriminator, D ∗ (x) = p data .
data (x)+pg (x)
the maximum measurable pixel intensity. The data samples
They also show that the generator, G , is optimal when
in the support of pdata , however, constitute the manifold
pg (x) = pdata (x), which is equivalent to the optimal
of the real data associated with some particular problem,
discriminator predicting 0.5 for all samples drawn from x.
typically occupying a very small part of the total space, X.
In other words, the generator is optimal when the discrim-
Similarly, the samples produced by the generator should
inator, D , is maximally confused and cannot distinguish
also occupy only a small portion of X.
real samples from fake ones.
Ideally, the discriminator is trained until optimal with Arjovsky et al. [26] showed that the support pg (x)
respect to the current generator; then, the generator is and pdata (x) lie in a lower dimensional space than that
again updated. However in practice, the discriminator corresponding to X. The consequence of this is that pg (x)
might not be trained until optimal, but rather may only be and pdata (x) may have no overlap, and so there exists a
trained for a small number of iterations, and the generator nearly trivial discriminator that is capable of distinguishing
is updated simultaneously with the discriminator. Further, real samples, x ∼ pdata (x) from fake samples, x ∼ pg (x)
an alternate, non-saturating training criterion is typically with 100% accuracy. In this case, the discriminator error
used for the generator, using maxG log D(G(z)) rather quickly converges to zero. Parameters of the generator
than minG log(1 − D(G(z))). may only be updated via the discriminator, so when this
Despite the theoretical existence of unique solutions, happens, the gradients used for updating parameters of
GAN training is challenging and often unstable for sev- the generator also converge to zero and so may no longer
eral reasons [5][25][26]. One approach to improving GAN be useful for updates to the generator. Arjovsky et al.’s [26]
training is to asses the empirical “symptoms” that might explanations account for several of the symptoms related
be experienced during training. These symptoms include: to GAN training.
• Difficulties in getting the pair of models to converge Goodfellow et al. [1] also showed that when D is
[5]; optimal, training G is equivalent to minimizing the Jensen-
• The generative model, “collapsing”, to generate very Shannon divergence between pg (x) and pdata (x). If D
similar samples for different inputs [25]; is not optimal, the update may be less meaningful, or
• The discriminator loss converging quickly to zero [26], inaccurate. This theoretical insight has motivated research
providing no reliable path for gradient updates to the into cost functions based on alternative distances. Several
generator. of these are explored in Section IV-C.
SUBMITTED TO IEEE-SPM, APRIL 2017 7

Fig. 5. The main loop of GAN training. Novel data samples, x0 , may be drawn by passing random samples, z through the generator network.
The gradient of the discriminator may be updated k times before updating the generator.

B. Training Tricks other samples. This is intended to prevent mode collapse,


as the discriminator can easily tell if the generator is
One of the first major improvements in the training of producing the same outputs.
GANs for generating images were the DCGAN architec-
A third heuristic trick, heuristic averaging, penalizes
tures proposed by Radford et al. [5]. This work was the
the network parameters if they deviate from a running
result of an extensive exploration of CNN architectures
average of previous values, which can help convergence
previously used in computer vision, and resulted in a set of
to an equilibrium. The fourth, virtual batch normalization,
guidelines for constructing and training both the generator
reduces the dependency of one sample on the other
and discriminator. In Section III-B, we alluded to the impor-
samples in the mini-batch by calculating the batch statistics
tance of strided and fractionally-strided convolutions [27],
for normalization with the sample placed within a reference
which are key components of the architectural design. This
mini-batch that is fixed at the beginning of training.
allows both the generator and the discriminator to learn
Finally, one-sided label smoothing makes the target
good up-sampling and down-sampling operations, which
for the discriminator 0.9 instead of 1, smoothing the
may contribute to improvements in the quality of image
discriminator’s classification boundary, hence preventing
synthesis. More specifically to training, batch normalization
an overly confident discriminator that would provide weak
[28] was recommended for use in both networks in order
gradients for the generator. Sønderby et al. [29] advanced
to stabilize training in deeper models. Another suggestion
the idea of challenging the discriminator by adding noise
was to minimize the number of fully connected layers
to the samples before feeding them into the discriminator.
used to increase the feasibility of training deeper models.
Sønderby et al. [29] argued that one-sided label smoothing
Finally, Radford et al. [5] showed that using leaky ReLU
biases the optimal discriminator, whilst their technique,
activation functions between the intermediate layers of the
instance noise, moves the manifolds of the real and fake
discriminator gave superior performance over using regular
samples closer together, at the same time preventing
ReLUs.
the discriminator easily finding a discrimination boundary
Later, Salimans et al. [25] proposed further heuristic
that completely separates the real and fake samples. In
approaches for stabilizing the training of GANs. The first,
practice, this can be implemented by adding Gaussian
feature matching, changes the objective of the generator
noise to both the synthesized and real images, annealing
slightly in order to increase the amount of information
the standard deviation over time. The process of adding
available. Specifically, the discriminator is still trained to
noise to data samples to stabilize training was, later,
distinguish between real and fake samples, but the gener-
formally justified by Arjovsky et al. [26].
ator is now trained to match the discriminator’s expected
intermediate activations (features) of its fake samples with
the expected intermediate activations of the real samples.
C. Alternative formulations
The second, mini-batch discrimination, adds an extra input
to the discriminator, which is a feature that encodes the The first part of this section considers other information-
distance between a given sample in a mini-batch and the theoretic interpretations and generalizations of GANs. The
SUBMITTED TO IEEE-SPM, APRIL 2017 8

second part looks at alternative cost functions which aim D. A Brief Comparison of GAN Variants
to directly address the problem of vanishing gradients. GANs allow us to synthesize novel data samples from
random noise, but they are considered difficult to train
1) Generalisations of the GAN cost function: Nowozin due partially to vanishing gradients. All GAN models that
et al. [30] showed that GAN training may be generalized we have discussed in this paper require careful hyperpa-
to minimize not only the Jensen-Shannon divergence, rameter tuning and model selection for training. However,
but an estimate of f -divergences; these are referred perhaps the easier models to train are the AAE and the
to as f -GANs. The f -divergences include well-known WGAN. The AAE is relatively easy to train because the
divergence measures such as the Kullback-Leibler diver- adversarial loss is applied to a fairly simple distribution
gence. Nowozin et al. showed that f -divergence may in lower dimensions (than the image data). The WGAN
be approximated by applying the Fenchel conjugates of [33], is designed to be easier to train, using a different
the desired f -divergence to samples drawn from the formulation of the training objective which does not suffer
distribution of generated samples, after passing those from the vanishing gradient problem. The WGAN may also
samples through a discriminator [30]. They provide a list be trained successfully even without batch normalisation;
of Fenchel conjugates for commonly used f -divergences, it is also less sensitive to the choice of non-linearities used
as well as activation functions that may be used in the between convolutional layers.
final layer of the generator network, depending on the Samples synthesised using a GAN or WGAN may be-
choice of f -divergence. Having derived the generalized long to any class present in the training data. Conditional
cost functions for training the generator and discriminator GANs provide an approach to synthesising samples with
of an f -GAN, Nowozin et al. [30] observe that, in its user specified content.
raw form, maximizing the generator objective is likely to It is evident from various visualisation techniques
lead to weak gradients, especially at the start of training, (Fig. 6) that the organisation of the latent space harbours
and proposed an alternative cost function for updating the some meaning, but vanilla GANs do not provide an
generator which is less likely to saturate at the beginning of inference model to allow data samples to be mapped to
training. This means that when the discriminator is trained, latent representations. Both BiGANs and ALI provide a
the derivative of the f -divergence on the ratio of the real mechanism to map image data to a latent space (infer-
and fake data distributions is estimated, while when the ence), however, reconstruction quality suggests that they
generator is trained only an estimate of the f -divergence do not necessarily faithfully encode and decode samples.
is minimized. Uehara et al. [31] extend the f -GAN further, A very recent development shows that ALI may recover
where in the discriminator step the ratio of the distributions encoded data samples faithfully [21]. However, this model
of real and fake data are predicted, and in the generator shares a lot in common with the AVB and AAE. These are
step the f -divergence is directly minimized. Alternatives autoencoders, similar to variational autoencoders (VAEs),
to the JS-divergence are also covered by Goodfellow [12]. where the latent space is regularised using adversarial
training rather than a KL-divergence between encoded
samples and a prior.
2) Alternative Cost functions to prevent vanishing gra-
dients: Arjovsky et al. [32] proposed the WGAN, a GAN
V. T HE S TRUCTURE OF L ATENT S PACE
with an alternative cost function which is derived from an
approximation of the Wasserstein distance. Unlike the orig- GANs build their own representations of the data they
inal GAN cost function, the WGAN is more likely to provide are trained on, and in doing so produce structured geo-
gradients that are useful for updating the generator. The metric vector spaces for different domains. This is a quality
cost function derived for the WGAN relies on the discrimi- shared with other neural network models, including VAEs
nator, which they refer to as the “critic”, being a k -Lipschitz [23], as well as linguistic models such as word2vec
continuous function; practically, this may be implemented [34]. In general, the domain of the data to be modelled
by simply clipping the parameters of the discriminator. is mapped to a vector space which has fewer dimensions
However, more recent research [33] suggested that weight than the data space, forcing the model to discover interest-
clipping adversely reduces the capacity of the discriminator ing structure in the data and represent it efficiently. This
model, forcing it to learn simpler functions. Gulrajani et latent space is at the “originating” end of the generator
al. [33] proposed an improved method for training the network, and the data at this level of representation (the
discriminator for a WGAN, by penalizing the norm of latent space) can be highly structured, and may support
discriminator gradients with respect to data samples during high level semantic operations [5]. Examples include ro-
training, rather than performing parameter clipping. tation of faces from trajectories through latent space, as
SUBMITTED TO IEEE-SPM, APRIL 2017 9

well as image analogies which have the effect of adding the unsupervised representations within a DCGAN net-
visual attributes such as eyeglasses on to a “bare” face. work have been assessed by applying a regularized L2-
All (vanilla) GAN models have a generator which maps SVM classifier to a feature vector extracted from the
data from the latent space into the space to be mod- (trained) discriminator [5]. Good classification scores were
elled, but many GAN models have an “encoder” which achieved using this approach on both supervised and
additionally supports the inverse mapping [19], [20]. This semi-supervised datasets, even those that were disjoint
becomes a powerful method for exploring and using the from the original training data.
structured latent space of the GAN network. With an en- The quality of the data representation may be im-
coder, collections of labelled images can be mapped into proved when adversarial training includes jointly learning
latent spaces and analysed to discover “concept vectors” an inference mechanism such as with an ALI [19]. A
that represent high level attributes such as “smiling” or representation vector was built using last three hidden
“wearing a hat”. These vectors can be applied at scaled layers of the ALI encoder, a similar L2-SVM classifier, yet
offsets in latent space to influence the behaviour of the achieved a misclassification rate significantly lower than
generator (Fig. 6). Similar to using an encoding process the DCGAN [19]. Additionally, ALI has achieved state-
to model the distribution of latent samples, Gurumurthy et of-the art classification results when label information is
al. [35] propose modelling the latent space as a mixture incorporated into the training routine.
of Gaussians and learning the mixture components that When labelled training data is in limited supply, ad-
maximize the likelihood of generated data samples under versarial training may also be used to synthesize more
the data generating distribution. training samples. Shrivastava et al. [39] use GANs to
refine synthetic images, while maintaining their annota-
VI. A PPLICATIONS OF GAN S tion information. By training models only on GAN-refined
synthetic images (i.e. no real training data) Shrivastava
Discovering new applications for adversarial training
et al. [39] achieved state-of-the-art performance on pose
of deep networks is an active area of research. We
and gaze estimation tasks. Similarly, good results were
examine a few computer vision applications that have
obtained for gaze estimation and prediction using a spatio-
appeared in the literature and have been subsequently
temporal GAN architecture [40]. In some cases, models
refined. These applications were chosen to highlight some
trained on synthetic data do not generalize well when
different approaches to using GAN-based representations
applied to real data [3]. Bousmalis et al. [3] propose
for image-manipulation, analysis or characterization, and
to address this problem by adapting synthetic samples
do not fully reflect the potential breadth of application of
from a source domain to match a target domain using
GANs.
adversarial training. Additionally, Liu et al. [41] propose
Using GANs for image classification places them within
using multiple GANs – one per domain – with tied weights
the broader context of machine learning and provides a
to synthesize pairs of corresponding images samples from
useful quantitative assessment of the features extracted
different domains.
in unsupervised learning. Image synthesis remains a
Because the quality of generated samples is hard to
core GAN capability, and is especially useful when the
quantitatively judge across models, classification tasks
generated image can be subject to pre-existing constraints.
are likely to remain an important quantitative tool for
Super-resolution [36], [37], [38] offers an example of
performance assessment of GANs, even as new and
how an existing approach can be supplemented with
diverse applications in computer vision are explored.
an adversarial loss component to achieve higher quality
results. Finally, image-to-image translation demonstrates
B. Image Synthesis
how GANs offer a general purpose solution to a family
Much of the recent GAN research focuses on improving
of tasks which require automatically converting an input
the quality and utility of the image generation capabilities.
image into an output image.
The LAPGAN model introduced a cascade of convolutional
networks within a Laplacian pyramid framework to gen-
A. Classification and Regression erate images in a coarse-to-fine fashion [13]. A similar
After GAN training is complete, the neural network approach is used by Huang et al. [42] with GANs op-
can be reused for other downstream tasks. For example, erating on intermediate representations rather than lower
outputs of the convolutional layers of the discriminator resolution images.
can be used as a feature extractor, with simple linear LAPGAN also extended the conditional version of the
models fitted on top of these features using a modest GAN model where both G and D networks receive addi-
quantity of (image, label) pairs [5], [25]. The quality of tional label information as input; this technique has proved
SUBMITTED TO IEEE-SPM, APRIL 2017 10

Fig. 6. Example of applying a “smile vector” with an ALI model [19]. On the left hand side is an example of a woman without a smile and on
the right a woman with a smile. A z value for the image of the woman on the left is inferred, z1 and for the right, z2 . Interpolating along a
vector that connects z1 and z2 , gives z values that may be passed through a generator to synthesize novel samples. Note the implication: a
displacement vector in latent space traverses smile “intensity” in image space.

useful and is now a common practice to improve image CycleGAN [4] extends this work by introducing a cycle
quality. This idea of GAN conditioning was later extended consistency loss that attempts to preserve the original
to incorporate natural language. For example, Reed et al. image after a cycle of translation and reverse translation.
[43] used a GAN architecture to synthesize images from In this formulation, matching pairs of images are no
text descriptions, which one might describe as reverse longer needed for training. This makes data preparation
captioning. For example, given a text caption of a bird much simpler, and opens the technique to a larger family
such as “white with some black on its head and wings of applications. For example, artistic style transfer [47]
and a long orange beak”, the trained GAN can generate renders natural images in the style of artists, such as
several plausible images that match the description. Picasso or Monet, by simply being trained on an unpaired
In addition to conditioning on text descriptions, the collection of paintings and natural images (Fig. 8).
Generative Adversarial What-Where Network (GAWWN)
conditions on image location [44]. The GAWWN system D. Super-resolution
supported an interactive interface in which large images Super-resolution allows a high-resolution image to be
could be built up incrementally with textual descriptions of generated from a lower resolution image, with the trained
parts and user-supplied bounding boxes (Fig. 7). model inferring photo-realistic details while up-sampling.
Conditional GANs not only allow us to synthesize novel The SRGAN model [36] extends earlier efforts by adding
samples with specific attributes, they also allow us to an adversarial loss component which constrains images
develop tools for intuitively editing images – for example to reside on the manifold of natural images.
editing the hair style of a person in an image, making them The SRGAN generator is conditioned on a low resolu-
wear glasses or making them look younger [35]. Additional tion image, and infers photo-realistic natural images with
applications of GANs to image editing include work by Zhu 4x up-scaling factors. Unlike most GAN applications, the
and Brock et al. [2], [45]. adversarial loss is one component of a larger loss function,
which also includes perceptual loss from a pretrained
classifier, and a regularization loss that encourages spa-
C. Image-to-image translation tially coherent images. In this context, the adversarial loss
Conditional adversarial networks are well suited for constrains the overall solution to the manifold of natural
translating an input image into an output image, which is images, producing perceptually more convincing solutions.
a recurring theme in computer graphics, image process- Customizing deep learning applications can often be
ing, and computer vision. The pix2pix model offers a hampered by the availability of relevant curated training
general purpose solution to this family of problems [46]. datasets. However, SRGAN is straightforward to customize
In addition to learning the mapping from input image to specific domains, as new training image pairs can
to output image, the pix2pix model also constructs easily be constructed by down-sampling a corpus of high-
a loss function to train this mapping. This model has resolution images. This is an important consideration in
demonstrated effective results for different problems of practice, since the inferred photo-realistic details that the
computer vision which had previously required separate GAN generates will vary depending on the domain of
machinery, including semantic segmentation, generating images used in the training set.
maps from aerial photos, and colorization of black and
VII. D ISCUSSION
white images. Wang et al. present a similar idea, using
GANs to first synthesize surface-normal maps (similar A. Open Questions
to depth maps) and then map these images to natural GANs have attracted considerable attention due to their
scenes. ability to leverage vast amounts of unlabelled data. While
SUBMITTED TO IEEE-SPM, APRIL 2017 11

Fig. 7. Examples of Image Synthesis using the the Generative Adversarial What-Where Network (GAWWN). In GAWWN, images are conditioned
on both text descriptions and image location specified as either by keypoint or bounding box. Figure reproduced from [44] with authors’ permission.

Fig. 8. CycleGAN model learns image to image translations between two unordered image collections. Shown here are the examples of bi-
directional image mappings: Monet paintings to landscape photos, zebras to horses, and summer to winter photos in Yosemite park. Figure
reproduced from [4].

much progress has been made to alleviate some of the now has access to how the discriminator would update
challenges related to training and evaluating GANs, there itself. With the usual one step generator objective, the
still remain several open challenges. discriminator will simply assign a low probability to the
1) Mode Collapse: As articulated in Section IV, a com- generator’s previous outputs, forcing the generator to
mon problem of GANs involves the generator collapsing to move, resulting either in convergence, or an endless cycle
produce a small family of similar samples (partial collapse), of mode hopping. However, with the unrolled objective,
and in the worst case producing simply a single sample the generator can prevent the discriminator from focusing
(complete collapse) [26], [48]. on the previous update, and update its own generations
Diversity in the generator can be increased by practical with the foresight of how the discriminator would have
hacks to balance the distribution of samples produced responded.
by the discriminator for real and fake batches, or by 2) Training instability – saddle points: In a GAN, the
employing multiple GANs to cover the different modes Hessian of the loss function becomes indefinite. The
of the probability distribution [49]. Yet another solution to optimal solution, therefore, lies in finding a saddle point
alleviate mode collapse is to alter the distance measure rather than a local minimum. In deep learning, a large
used to compare statistical distributions. Arjovsky [32] number of optimizers depend only on the first derivative
proposed to compare distributions based on a Wasserstein of the loss function; converging to a saddle point for
distance rather than a KL-based divergence (DCGAN [5]) GANs requires good initialization. By invoking the stable
or a total-variation distance (energy-based GAN [50]). manifold theorem from non-linear systems theory, Lee et
Metz et al. [51] proposed unrolling the discriminator for al. [52] showed that, were we to select the initial points
several steps, i.e., letting it calculate its updates on the of an optimizer at random, gradient descent would not
current generator for several steps, and then using the converge to a saddle with probability one (also see [53],
“unrolled” discriminators to update the generator using the [25]). Additionally, Mescheder et al. [54] have argued that
normal minimax objective. As normal, the discriminator convergence of a GAN’s objective function suffers from
only trains on its update from one step, but the generator the presence of a zero real part of the Jacobian matrix
SUBMITTED TO IEEE-SPM, APRIL 2017 12

as well as eigenvalues with large imaginary parts. This is [2] J.-Y. Zhu, P. Krähenbühl, E. Shechtman, and A. A. Efros, “Gen-
disheartening for GAN training; yet, due to the existence of erative visual manipulation on the natural image manifold,” in
European Conference on Computer Vision. Springer, 2016, pp.
second-order optimizers, not all hope is lost. Unfortunately, 597–613.
Newton-type methods have compute-time complexity that [3] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krish-
scales cubically or quadratically with the dimension of the nan, “Unsupervised pixel-level domain adaptation with generative
parameters. Therefore, another line of questions lies in ap- adversarial networks,” in IEEE Conference on Computer Vision
and Pattern Recognition, 2016.
plying and scaling second-order optimizers for adversarial
[4] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-
training. image translation using cycle-consistent adversarial networks,” in
A more fundamental problem is the existence of an Proceedings of the International Conference on Computer Vision,
equilibrium for a GAN. Using results from Bayesian non- 2017. [Online]. Available: https://arxiv.org/abs/1703.10593
[5] A. Radford, L. Metz, and S. Chintala, “Unsupervised represen-
parametrics, Arora et al. [48] connects the existence of tation learning with deep convolutional generative adversarial
the equilibrium to a finite mixture of neural networks – this networks,” in Proceedings of the 5th International Conference on
means that below a certain capacity, no equilibrium might Learning Representations (ICLR) - workshop track, 2016.
exist. On a closely related note, it has also been argued [6] A. Creswell and A. A. Bharath, “Adversarial training for sketch re-
trieval,” in Computer Vision – ECCV 2016 Workshops: Amsterdam,
that whilst GAN training can appear to have converged, The Netherlands, October 8-10 and 15-16, 2016, Proceedings,
the trained distribution could still be far away from the Part I. Springer International Publishing, 2016.
target distribution. To alleviate this issue, Arora et al. [48] [7] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol.
propose a new measure called the ‘neural net distance’. 521, no. 7553, pp. 436–444, 2015.
[8] H. Hotelling, “Analysis of a complex of statistical variables into
3) Evaluating Generative Models: How can one gauge principal components.” Journal of educational psychology, vol. 24,
the fidelity of samples synthesized by a generative mod- no. 6, p. 417, 1933.
els? Should we use a likelihood estimation? Can a GAN [9] I. J. Goodfellow, “On distinguishability criteria for estimating gen-
trained using one methodology be compared to another erative models,” International Conference on Learning Represen-
tations - workshop track, 2015.
(model comparison)? These are open-ended questions [10] M. Gutmann and A. Hyvärinen, “Noise-contrastive estimation: A
that are not only relevant for GANs, but also for proba- new estimation principle for unnormalized statistical models.” in
bilistic models, in general. Theis [55] argued that evalu- AISTATS, vol. 1, no. 2, 2010, p. 6.
ating GANs using different measures can lead conflicting [11] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising
auto-encoders as generative models,” in Advances in Neural
conclusions about the quality of synthesised samples; the Information Processing Systems, 2013, pp. 899–907.
decision to select one measure over another depends on [12] I. Goodfellow, “Nips 2016 tutorial: Generative adversarial
the application. networks,” 2016, presented at the Neural Information Processing
Systems Conference. [Online]. Available: https://arxiv.org/abs/
1701.00160
B. Conclusions [13] E. L. Denton, S. Chintala, R. Fergus et al., “Deep generative
The explosion of interest in GANs is driven not only by image models using a laplacian pyramid of adversarial networks,”
in Advances in Neural Information Processing Systems, 2015, pp.
their potential to learn deep, highly non-linear mappings 1486–1494.
from a latent space into a data space and back, but also [14] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum,
by their potential to make use of the vast quantities of “Learning a probabilistic latent space of object shapes via 3d
unlabelled image data that remain closed to deep repre- generative-adversarial modeling,” in Advances in Neural Informa-
tion Processing Systems, 2016, pp. 82–90.
sentation learning. Within the subtleties of GAN training, [15] M. Mirza and S. Osindero, “Conditional generative adversarial
there are many opportunities for developments in theory nets,” arXiv preprint arXiv:1411.1784, 2014.
and algorithms, and with the power of deep networks, [16] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,
there are vast opportunities for new applications. and P. Abbeel, “Infogan: Interpretable representation learning by
information maximizing generative adversarial nets,” in Advances
in Neural Information Processing Systems, 2016.
ACKNOWLEDGMENT [17] A. Creswell and A. A. Bharath, “Inverting the generator of a
The authors would like to thank David Warde-Farley for generative adversarial network,” in NIPS Workshop on Adversarial
Training, 2016.
his valuable feedback on previous revisions of the paper.
[18] Z. C. Lipton and S. Tripathi, “Precise recovery of latent vectors
Antonia Creswell acknowledges the support of the EPSRC from generative adversarial networks,” in ICLR (workshop track),
through a Doctoral training scholarship. 2017.
[19] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb,
M. Arjovsky, and A. Courville, “Adversarially learned inference,” in
R EFERENCES
(accepted, to appear) Proceedings of the International Conference
[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, on Learning Representations, 2017.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” [20] J. Donahue, P. Krähenbühl, and T. Darrell, “Adversarial feature
in Advances in Neural Information Processing Systems, 2014, pp. learning,” in (accepted, to appear) Proceedings of the International
2672–2680. Conference on Learning Representations, 2017.
SUBMITTED TO IEEE-SPM, APRIL 2017 13

[21] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin, through adversarial training,” in IEEE Conference on Computer
“Towards understanding adversarial learning for joint distribution Vision and Pattern Recognition (CVPR), 2016.
matching,” in Advances in Neural Information Processing Systems, [40] M. Zhang, K. T. Ma, J. H. Lim, Q. Zhao, and J. Feng, “Deep future
2017. gaze: Gaze anticipation on egocentric videos using adversarial
[22] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow, networks,” in IEEE Conference on Computer Vision and Pattern
“Adversarial autoencoders,” in International Conference on Recognition, 2017, pp. 4372–4381.
Learning Representations (to appear), 2016. [Online]. Available: [41] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,”
http://arxiv.org/abs/1511.05644 in Advances in neural information processing systems, 2016, pp.
[23] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” 469–477.
in Proceedings of the 2nd International Conference on Learning [42] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie,
Representations (ICLR), 2014. “Stacked generative adversarial networks,” in IEEE Conference
[24] L. M. Mescheder, S. Nowozin, and A. Geiger, “Adversarial on Computer Vision and Pattern Recognition, 2016.
variational bayes: Unifying variational autoencoders and [43] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and
generative adversarial networks,” 2017. [Online]. Available: H. Lee, “Generative adversarial text to image synthesis,” in
http://arxiv.org/abs/1701.04722 International Conference on Machine Learning, 2016. [Online].
[25] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Available: https://arxiv.org/abs/1605.05396
and X. Chen, “Improved techniques for training gans,” in Advances [44] S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and
in Neural Information Processing Systems, 2016, pp. 2226–2234. H. Lee, “Learning what and where to draw,” in Advances in Neural
[26] M. Arjovsky and L. Bottou, “Towards principled methods for Information Processing Systems, 2016, pp. 217–225.
training generative adversarial networks,” NIPS 2016 Workshop [45] A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “Neural photo
on Adversarial Training, 2016. editing with introspective adversarial networks,” in Proceedings
[27] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional net- of the 6th International Conference on Learning Representations
works for semantic segmentation,” IEEE transactions on pattern (ICLR), 2017.
analysis and machine intelligence, vol. 39, no. 4, pp. 640–651, [46] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans-
2017. lation with conditional adversarial networks,” in IEEE Conference
on Computer Vision and Pattern Recognition (CVPR), 2016.
[28] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
[47] C. Li and M. Wand, “Precomputed real-time texture synthesis
network training by reducing internal covariate shift,” in Proceed-
with Markovian generative adversarial networks,” in European
ings of The 32nd International Conference on Machine Learning,
Conference on Computer Vision. Springer, 2016, pp. 702–716.
2015, pp. 448–456.
[48] S. Arora, R. Ge, Y. Liang, T. Ma, and Y. Zhang, “Generalization and
[29] C. K. Sønderby, J. Caballero, L. Theis, W. Shi, and F. Huszár,
equilibrium in generative adversarial nets (gans),” in Proceedings
“Amortised map inference for image super-resolution,” in Interna-
of The 34nd International Conference on Machine Learning, 2017.
tional Conference on Learning Representations, 2017.
[49] I. Tolstikhin, S. Gelly, O. Bousquet, C.-J. Simon-Gabriel, and
[30] S. Nowozin, B. Cseke, and R. Tomioka, “f-gan: Training generative
B. Schölkopf, “Adagan: Boosting generative models,” Tech. Rep.,
neural samplers using variational divergence minimization,” in
2017.
Advances in Neural Information Processing Systems, 2016, pp.
[50] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative
271–279.
adversarial network,” in International Conference on Learning
[31] M. Uehara, I. Sato, M. Suzuki, K. Nakayama, and Y. Matsuo,
Representations, 2017. [Online]. Available: https://arxiv.org/abs/
“Generative adversarial nets from a density ratio estimation per-
1609.03126
spective,” arXiv preprint arXiv:1610.02920, 2016.
[51] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled
[32] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein GAN,” in
generative adversarial networks,” in Proceedings of the
Proceedings of The 34nd International Conference on Machine
International Conference on Learning Representations, 2017.
Learning, 2017.
[Online]. Available: https://arxiv.org/abs/1611.02163
[33] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville, [52] J. D. Lee, M. Simchowitz, M. I. Jordan, and B. Recht, “Gradient
“Improved training of wasserstein gans,” in (accepted, to appear) descent only converges to minimizers,” in Conference on Learning
Advances in Neural Information Processing Systems, 2017. Theory, 2016, pp. 1246–1257.
[34] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation [53] R. Pemantle, “Nonconvergence to unstable points in urn models
of word representations in vector space,” in International Confer- and stochastic approximations,” Ann. Probab., vol. 18, no. 2, pp.
ence on Learning Representations, 2013. 698–712, 04 1990.
[35] S. Gurumurthy, R. K. Sarvadevabhatla, and V. B. Radhakrishnan, [54] L. M. Mescheder, S. Nowozin, and A. Geiger, “The numerics of
“Deligan: Generative adversarial networks for diverse and limited gans,” in Advances in Neural Information Processing Systems,
data,” in IEEE Conference On Computer Vision and Pattern 2017. [Online]. Available: http://arxiv.org/abs/1705.10461
Recognition (CVPR), 2017. [55] L. Theis, A. van den Oord, and M. Bethge, “A note on the eval-
[36] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Aitken, A. Tejani, uation of generative models,” in Proceedings of the International
J. Totz, Z. Wang, and W. Shi, “Photo-realistic single image Conference on Learning Representations.
super-resolution using a generative adversarial network,” in IEEE
Conference on Computer Vision and Pattern Recognition, 2017.
[37] X. Yu and F. Porikli, “Ultra-resolving face images by discrimina-
tive generative networks,” in European Conference on Computer
Vision. Springer, 2016, pp. 318–333.
[38] ——, “Hallucinating very low-resolution unaligned and noisy face
images by transformative discriminative autoencoders,” in Pro- Antonia Creswell ([email protected]) holds a first-class degree from
ceedings of the IEEE Conference on Computer Vision and Pattern Imperial College in Biomedical Engineering (2011), and is currently
Recognition, 2017, pp. 3760–3768. a PhD student in the Biologically Inspired Computer Vision (BICV)
[39] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and Group at Imperial College London (2015). The focus of her PhD is on
R. Webb, “Learning from simulated and unsupervised images improving the training of generative adversarial networks and applying
SUBMITTED TO IEEE-SPM, APRIL 2017 14

them to visual search and to learning representations in unlabelled


sources of image data.

Tom White Tom received his BS in Mathematics from the University


of University of Georgia, USA, and MS from Massachusetts Institute
of Technology in Media Arts and Sciences. He is currently a senior
lecturer in the School of Design at Victoria University of Wellington,
New Zealand. His current research focuses on exploring the growing
use of constructive machine learning in computational design and
the creative potential of human designers working collaboratively with
artificial neural networks during the exploration of design ideas and
prototyping.

Vincent Dumoulin holds a BSc in Physics and Computer Science from


the University of Montréal. He is a doctoral candidate at the Montréal
Institute for Learning Algorithms under the co-supervision of Yoshua
Bengio and Aaron Courville, working on deep learning approaches to
generative modelling.

Kai Arulkumaran ([email protected]) is a Ph.D. candidate in the De-


partment of Bioengineering at Imperial College London. He received
a B.A. in Computer Science at the University of Cambridge in 2012,
and an M.Sc. in Biomedical Engineering at Imperial College London in
2014. He was a Research Intern in Twitter Magic Pony and Microsoft
Research in 2017. His research focus is deep reinforcement learning
and computer vision for visuomotor control.

Biswa Sengupta received his B.Eng. (Hons.) and M.Sc. degrees in


electrical and computer engineering (2004) and theoretical computer
science (2005) respectively from the University of York. He then read
for a second M.Sc. degree in neural and behavioural sciences (2007) at
the Max Planck Institute for Biological Cybernetics, obtaining his PhD in
theoretical neuroscience (2011) from the University of Cambridge. He
received further training in Bayesian statistics and differential geometry
at the University College London and University of Cambridge before
leading Cortexica Vision Systems as its Chief Scientist. Currently, he
is a visiting scientist at Imperial College London along with leading
machine learning research at Noah’s Ark Lab of Huawei Technologies
UK.

Anil A Bharath ([email protected]) Anil Anthony Bharath is


a Reader in the Department of Bioengineering at Imperial College
London, an Academic Fellow of Imperial’s Data Science Institute and a
Fellow of the Institution of Engineering and Technology. He received a
B.Eng. in Electronic and Electrical Engineering from University College
London in 1988, and a Ph.D. in Signal Processing from Imperial
College London in 1993. He was an academic visitor in the Signal
Processing Group at the University of Cambridge in 2006. He is a
co-founder of Cortexica Vision Systems. His research interests are in
deep architectures for visual inference.

You might also like