Denoising Adversarial Autoencoders
Denoising Adversarial Autoencoders
4, APRIL 2019
Abstract— Unsupervised learning is of growing interest because without labeled data sets [9], [12], [14], [24]. For exam-
it unlocks the potential held in vast amounts of unlabeled data to ple, autoencoders learn a generative model, referred to as
learn useful representations for inference. Autoencoders, a form a decoder, by recovering inputs from corrupted [5], [12], [30]
of generative model, may be trained by learning to reconstruct
unlabeled input data from a latent representation space. More or encoded [14] versions of themselves.
robust representations may be produced by an autoencoder if Two broad approaches to learning the state-of-the-art gen-
it learns to recover clean input samples from corrupted ones. erative autoencoders that do not require labeled training data
Representations may be further improved by introducing regu- include: 1) introduction of a denoising criterion [5], [30], [31],
larization during training to shape the distribution of the encoded where the model learns to reconstruct clean samples from
data in the latent space. We suggest denoising adversarial autoen-
coders (AAEs), which combine denoising and regularization, corrupted ones and 2) regularization of the latent space to
shaping the distribution of latent space using adversarial training. match a prior [14], [20]; for the latter, the priors take a simple
We introduce a novel analysis that shows how denoising may be form, such as multivariate normal distributions.
incorporated into the training and sampling of AAEs. Experi- The denoising variational autoencoder (DVAE) [12] com-
ments are performed to assess the contributions that denoising bines both denoising and regularization in a single generative
makes to the learning of representations for classification and
sample synthesis. Our results suggest that autoencoders trained model. However, introducing a denoising criterion makes the
using a denoising criterion achieve higher classification perfor- variational cost function—used to match the latent distribution
mance and can synthesize samples that are more consistent with to the prior—analytically intractable. Reformulation of the cost
the input data than those trained without a corruption process. function makes it tractable but only for certain families of
Index Terms— Image analysis, pattern recognition, semisuper- prior and posterior distributions. We propose using adversarial
vised learning, unsupervised learning. training [9] to match the posterior distribution to the prior.
Taking this approach expands the possible choices for families
I. I NTRODUCTION of prior and posterior distributions.
When a denoising criterion is introduced to an adversarial
M ODELING and drawing data samples from complex,
high-dimensional distributions are challenging. Gener-
ative models may be used to capture an underlying statistical
autoencoder (AAE), we have a choice to either shape the
conditional distribution of latent variables given corrupted
structure from real-world data. A good generative model is not samples to match the prior (as was done using a variational
only able to draw samples from the distribution of data being approach [12]) or to shape the full posterior conditional on
modeled but should also be useful for inference. the original data samples to match the prior. Shaping the
Modeling complicated distributions may be made easier by posterior distribution over corrupted samples does not require
learning the parameters of conditional probability distributions additional sampling during training, but trying to shape the
that map intermediate, latent, [2] variables from simpler dis- full conditional distribution with respect to the original data
tributions to more complex ones [4]. Often, the intermediate samples does. We explore both the approaches using adver-
representations that are learned can be used for tasks, such as sarial training to avoid the difficulties posed by analytically
retrieval or classification [20], [24], [26], [30]. intractable cost functions.
Typically, to train a model for classification, a deep neural In addition, a model that has been trained using the posterior
network may be constructed, demanding large labeled data conditioned on the corrupted data requires an iterative process
sets to achieve high accuracy [15]. Large labeled data sets for synthesizing samples, whereas using the full posterior
may be expensive or difficult to obtain for some tasks. How- conditioned on the original data does not. Similar challenges
ever, many state-of-the-art generative models can be trained exist for the DVAE but were not addressed by Im et al. [12].
We analyze and address these challenges for AAEs, introduc-
Manuscript received July 3, 2017; revised January 3, 2018, March 17, 2018, ing a novel sampling approach for synthesizing samples from
and June 24, 2018; accepted June 25, 2018. Date of publication August 16,
2018; date of current version March 18, 2019. This work was supported by trained models.
the Engineering and Physical Sciences Research Council through a Doctoral In summary, our contributions include: 1) two types of
Training Studentship under Grant EP/L504786/1. (Corresponding author: denoising AAEs: one which is more efficient to train and one
Antonia Creswell.)
The authors are with the Biologically Inspired Computer Vision Group, which is more efficient to draw samples from; 2) methods to
Imperial College London, London SW7 2AZ, U.K. (e-mail: [email protected]; draw synthetic data samples from denoising AAEs through
[email protected]). Markov chain (MC) sampling; and 3) an analysis of the
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. quality of features learned with denoising AAEs through their
Digital Object Identifier 10.1109/TNNLS.2018.2852738 application to discriminative tasks.
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
CRESWELL AND BHARATH: DENOISING AAEs 969
II. BACKGROUND
A. Autoencoders
In a supervised learning setting, given a set of training data
{(yi , x i )}i=1
N , we wish to learn a model f (y|x) that maxi-
ψ
mizes the likelihood E p(y|x) fψ (y|x) of the true label y given
an observation x. In the supervised setting, there are many
ways to calculate and approximate the likelihood because there
is a ground-truth label for every training data sample.
When trying to learn a generative model pθ (x) in the
absence of a ground truth, calculating the likelihood of the
model under the observed data distribution Ex∼ p(x) pθ (x)
is challenging. Autoencoders introduce a two-step learning
process that allows the estimation pθ (x) of p(x) via an
auxiliary variable z. The variable z may take many forms,
and we shall explore several of these in this section. The two-
step process involves first learning a probabilistic encoder [14] Fig. 1. Comparison of autoencoding models. Previous works include
qφ (z|x) conditioned on observed samples and a second prob- DAEs [5], [30], VAEs [14], AAEs [20], and DVAEs [12]. Our contributions
are the DAAE and iDAAE models. Arrows represent mappings implemented
abilistic decoder [14] pθ (x|z) conditioned on the auxiliary using trained neural networks.
variables. Using the probabilistic encoder, we may form a
training data set {(z i , x i )}i=1
N where x i is the ground truth are more useful and robust for tasks such as classification.
output for x ∼ p(x|z i ) with the input being z i ∼ qφ (z|x i ). The Parameters φ and θ are learned simultaneously by minimiz-
probabilistic decoder pθ (x|z) may then be trained on this data ing the reconstruction error for the training set {(x˜i , x i )}i=1
N
,
set in a supervised fashion. By sampling pθ (x|z) conditioning which does not include z i . The ground truth z i for given
on the suitable z values, we may obtain a joint distribution x̃ i is unknown. The form of the distribution over z, to
pθ (x, z), which may be marginalized by integrating over which x samples are mapped, pφ (z) is also unknown, making
all z values to obtain to pθ (x). Note that a deterministic it difficult to draw novel data samples from the decoder
autoencoder is a special case of a probabilistic one. model pθ (x|z).
In some situations, the encoding distribution is chosen rather
than learned [5], and in other situations, the encoder and the
decoder are learned simultaneously [12], [14], [20]. C. Variational Autoencoders
VAEs [14] specify a prior distribution, p(z) to which
B. Denoising Autoencoders qφ (z|x) should map all x samples, by formulating and
Bengio et al. [5] treat the encoding process as a local maximizing a variational lower bound on the log-likelihood
corruption process that does not need to be learned. In the of pθ (x).
corruption process, defined as c(x̃|x) where x̃, the corrupted x The variational lower bound on the log-likelihood of pθ (x)
is the auxiliary variable (instead of z). The decoder pθ (x|x̃) is given by [14]
is therefore trained on the data pairs {(x̃ i , x i )}i=1
N .
log pθ (x) ≥ Ez∼qφ (z|x)[log pθ (x|z)]− K L[qφ (z|x)|| p(z)]. (1)
By using a local corruption process (e.g., additive white
Gaussian noise [5]), both x̃ and x have the same number of The term pθ (x|z) corresponds to the likelihood of a recon-
dimensions and are close to each other. This makes it very structed x value given the encoding z of a data sample x.
easy to learn pθ (x|x̃). Bengio et al. [5] show how the learned This formulation of the variational lower bound does not
model may be sampled using an iterative process but does not involve a corruption process. The term K L[qφ (z|x)|| p(z)]
explore how representations learned by the model may transfer is the Kullback–Leibler (KL) divergence between qφ (z|x)
to other applications such as classification. and p(z). Samples are drawn from qφ (z|x) via a reparametri-
Hinton and Salakhutdinov [11] show that when auxiliary sation trick (see the VAE in Fig. 1).
variables of an autoencoder have lower dimension than the If qφ (z|x) is chosen to be a parameterized multivariate
observed data, the encoding model learns representations that Gaussian N (μφ (x), σφ (x)) and the prior is chosen to be a
may be useful for tasks, such as classification and retrieval. Gaussian distribution, then K L[qφ (z|x)|| p(z)] may be com-
Rather than treating the corruption process c(x̃, x) as puted analytically. KL divergence may only be computed
an encoding process [5]—missing out on potential ben- analytically for certain (limited) choices of prior and posterior
efits of using a lower dimensional auxiliary variable— distributions.
Vincent et al. [30], [31] learn an encoding distribution qφ (z|x̃) VAE training encourages qφ (z|x) to map observed samples
conditioned on the corrupted samples. The decoding distri- to the chosen prior p(z). Therefore, novel observed data
bution pθ (x|z) learns to reconstruct images from encoded, samples may be generated via the following simple sampling
corrupted images, see the denoising autoencoders (DAEs) process: z i ∼ p(z), x i ∼ pθ (x|z i ) [14].
in Fig. 1. Vincent et al. [30], [31] show that compared Note that despite the benefits of the denoising cri-
with regular autoencoders, DAEs learn representations that terion shown by Vincent et al. [30], [31] for regular
970 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 30, NO. 4, APRIL 2019
autoencoders, no corruption process was introduced by (i.e., “real” samples) and trained to correctly predict whether
Kingma and Welling [14] for VAEs. the samples are “real” or “fake.” The generative model—fed
with input samples v, drawn from a chosen prior distribution
D. Denoising Variational Autoencoders p(v)—is trained to generate output samples w that are indis-
tinguishable from target w samples in order to “fool” [24] the
Adding the denoising criterion to a VAE is nontrivial
discriminative model into making incorrect predictions. This
because the variational lower bound becomes intractable.
may be achieved by the following minimax objective [9]:
Consider the conditional probability density function
q̃φ (z|x) = qφ (z|x̃)c(x̃|x)d x̃, where qφ (z|x̃) is the proba- min max Ew∼t (w) [log dχ (w)] + Ew∼gφ (w|v) [log(1 − dχ (w))].
g d
bilistic encoder conditioned on the corrupted x samples x̃, and
c(x̃|x) is a corruption process. The variational lower bound It has been shown that for an optimal discriminative model,
may be formed in the following way [12]: optimizing the generative model is equivalent to minimizing
the Jensen–Shannon divergence between the generated and tar-
pθ (x, z) pθ (x, z)
log pθ (x) ≥ Eq̃φ (z|x) log ≥ Eq̃φ (z|x) log . get distributions [9]. In general, it is reasonable to assume that,
qφ (z|x̃) q̃φ (x|z)
during training, the discriminative model quickly achieves near
If qφ (z|x̃) is chosen to be Gaussian, then in many cases, optimal performance [9]. This property is useful for learning
q̃φ (z|x) will be a mixture of Gaussians. If this is the case, distributions for which the Jensen–Shannon divergence may
there is no analytical solution for K L[q̃φ (z|x)|| p(z)], and so not be easily calculated.
the denoising variational lower bound becomes analytically The generative model is optimal when the distribution of
intractable. However, there may still be an analytical solu- the generated samples matches the target distribution. Under
tion for K L[qφ (z|x̃)|| p(z)]. The DVAE therefore maximizes these conditions, the discriminator is maximally confused and
Eq̃(x|z) log[( pθ (x, z)/qφ (z|x̃))]. We refer to the model which cannot distinguish “real” samples from “fake” ones. As a con-
is trained to maximize this objective as a DVAE (see Fig. 1). sequence of this, adversarial training may be used to capture
Im et al. [12] show that the DVAE achieves lower negative very complicated data distributions and has been shown to be
variational lower bounds than the regular VAE on the test data able to synthesize images of handwritten digits and human
set. faces that are almost indistinguishable from real data [24].
However, note that qφ (z|x̃) is matched to the prior p(z)
rather than q̃φ (z|x). This means that generating novel samples
B. Adversarial Autoencoders
using pθ (z|x) is not as simple as the process of generating
samples from a VAE. To generate novel samples, we should Makhzani et al. [20] introduce the AAE, where qφ (z|x)
sample z i ∼ q̃φ (z|x) and x i ∼ pθ (x|z i ), which is difficult is both the probabilistic encoding model in an autoen-
because of the need to evaluate q̃φ (z|x). Im et al. [12] do not coder framework and the generative model in an adversarial
address this problem. framework.
For both DVAEs and VAEs, there is a limited choice A new discriminative model dχ (z) is introduced. This
of prior and posterior distributions for which there exists discriminative model is trained to distinguish between latent
an analytic solution for the KL divergence. Alternatively, samples drawn from p(z) and qφ (z|x). The cost function used
adversarial training may be used to learn a model that matches to train the discriminator dχ (z) is
samples to an arbitrarily complicated target distribution—
1 1
N−1 2N−1
provided that samples may be drawn from both the target and Ldis = − log dχ (z i ) − log(1 − dχ (z j ))
model distributions. N N
i=0 j =N
III. R ELATED W ORK where z i=0:N−1 ∼ p(z) and z j =N:2N−1 ∼ qφ (z|x) and N is
the size of the training batch.
In this section, we introduce adversarial training and AAEs,
Adversarial training is used to match qφ (z|x) to an arbitrar-
on which this paper builds directly.
ily chosen prior p(z). The cost function for matching qφ (z|x)
to prior p(z) is as follows:
A. Adversarial Training
1
N−1
Adversarial training, as introduced by Goodfellow et al. [9],
involves learning a mapping from a latent sample v to a data Lprior = log(1 − dχ (z i )) (2)
N
sample w. However, at a more abstract level, w may be thought i=0
of as a sample from any chosen target distribution and v as a where z i=0:N−1 ∼ qφ (z|x) and N is the size of a training
sample from any distribution that we wish to map to w. batch. If both Lprior and Ldis are optimized, qφ (z|x) will be
More formally, in adversarial training [9], a model gφ (w|v) indistinguishable from p(z).
is trained to produce output samples w that match a target In Makhzani et al.’s [20] AAE, qφ (z|x) is specified by
probability distribution t (w). This is achieved by iteratively a neural network whose input is x and whose output is z.
training two competing models: a generative model gφ (w|v) This allows qφ (z|x) to have arbitrary complexity, unlike the
and a discriminative model dχ (w). The discriminative model VAE where the structure of qφ (z|x) is usually limited to a
is fed with the samples either from the generator (i.e., “fake” multivariate Gaussian. In an AAE, the posterior does not have
samples) or with samples from the target distribution to be analytically defined because an adversary is used to
CRESWELL AND BHARATH: DENOISING AAEs 971
x 1 ∼ pθ (x|x̃ 0 ) [5]. In the case where the auxiliary variable We will now show that under certain conditions, this transi-
is an encoding [30], [31], the sampling process is the same, operator defines an ergodic MC that converges to P(z) =
tion
with pθ (x|x̃) encompassing both the encoding and decoding q̃φ (z|x) p(x)d x in the following steps: 1) we will show
processes. that there exists a stationary distribution P(z) for z (0) drawn
However, since a DAE [5] is trained to reconstruct corrupted from a specific choice of initial distribution (see Lemma 1);
versions of its inputs, and the sample x 1 is likely to be 2) the MC is homogeneous, because the transition operator is
very similar to x 0 . Bengio et al. [5] propose a method defined by a set of distributions whose parameters are fixed
for iteratively sampling DAEs by defining an MC whose during sampling; 3) we will show that the MC is also ergodic
stationary distribution—under certain conditions—exists and (see Lemma 2); and 4) since the chain is both homogeneous
is equivalent, under certain assumptions, to training the data and ergodic, there exists a unique stationary distribution to
distribution. This approach is generalized and extended by which the MC will converge [22].
Bengio et al. [4] to introduce a latent distribution with no Step 1) shows that one stationary distribution is P(z), which
prior assumptions on z. we now know by 2) and 3) to be the unique stationary
We now consider the implication for drawing samples from distribution. So the MC converges to P(z).
denoising AAEs introduced in Section IV-A. By using the In this section, only we use a change of notation, where the
iDAAE formulation (see Section IV-A1), where q̃φ (z|x) is training data probability distribution, previously represented
matched to the prior over z, then x samples may be drawn as p(x), is represented as P(x); this is to help make distinc-
from pθ (x|z), conditioning on z ∼ p(z). However, if we use tions between “natural system” probability distributions and
the DAAE—matching qφ (z|x̃) to a prior—sampling becomes the learned distributions. Furthermore, note that p(z) is the
nontrivial. prior, while the distribution required for sampling P(x|z) is
On the surface, it may appear easy to draw samples from P(z) such that
DAAEs (see Section IV-A2), by first sampling the prior p(z)
and then sampling pθ (x|z). However, the full posterior dis- P(x) = P(x|z)P(z)dz ≈ pθ (x|z)P(z)dz. (5)
tribution is given by q̃φ (z|x) = qφ (z|x̃)c(x̃|x)d x̃, but only
qφ (z|x̃) is matched to p(z) during training (see Fig. 2). The P(z) = q̃φ (z|x)P(x)d x
implication of this is that when attempting to synthesize novel
samples from pθ (x|z), drawing samples from the prior p(z) = qφ (z|x̃)c(x̃|x)d x̃P(x)d x. (6)
is unlikely to yield samples consistently with p(x). This will
become more clear in Section V-B. Lemma 1: P(z) is a stationary distribution for the
MC defined by the sampling process in (3).
For proof, see the Appendix.
B. Proposed Method for Sampling DAAEs Lemma 2: The MC defined by the transition operator
Here, we propose a method for synthesizing novel samples Tθ,φ (z t +1 |z t ) (4) is ergodic, provided that the corruption
using trained DAAEs. In order to draw samples from pθ (x|z), process is additive Gaussian noise and that the adversarial
we need to be able to draw samples from q̃φ (z|x). pair qφ (z|x̃) and dχ (z) are optimal within the adversarial
To ensure that we draw novel data samples, we do not framework.
want to draw samples from the training data at any point For proof, see the Appendix.
during sample synthesis. This means that we cannot use data Theorem 1: Under the assumptions that pθ (x|z) = P(x|z)
samples from our training data to approximately draw samples and that the adversarial pair qφ (z|x) and dχ (x) are optimal,
from q̃φ (z|x). the transition operator Tθ,φ (z (t +1)|z (t )) defines an MC whose
Instead, similar to Bengio et al. [5], we formulate an MC, stationary distribution is P(z) = q̃φ (z|x)P(x)d x.
which we show that it has the necessary properties
to converge Proof: This follows from Lemmas 1 and 2.
and that the chain converges to P(z) = q̃φ (z|x) p(x)d x. This sampling method uncovers the distribution P(z) on
Unlike Bengio’s formulation, our chain is initialized with a which samples drawn from pθ (x|z) must be conditioned in
random vector of the same dimensions as the latent space, order to sample pθ (x). Assuming that pθ (x|z) = P(x|z), this
rather than a sample drawn from the training set. allows us to draw samples from P(x).
We define an MC by the following sampling process: For completeness, we would like to acknowledge that there
are several other methods that use MCs during the training
z (0) ∼ Ra , x (t ) ∼ pθ (x|z (t )
of autoencoders [2], [21] to improve the performance. Our
x̃ (t ) ∼ c(x̃|x (t )), z (t +1) ∼ qφ (z|x̃ (t )) approach for synthesizing samples using the DAAE is focused
t ≥ 0. (3) on sampling only from trained models; the MC sampling is
not used to update model parameters.
Notice that our first sample is any real vector of dimen-
sion a, where a is the dimension of the latent space. This VI. I MPLEMENTATION
MC has the transition operator
The analyses of Sections IV and V were deliberately gen-
Tθ,φ (z (t +1)|z (t )) eral; they did not rely on any specific implementation choice to
capture the model distributions. In this section, we consider a
= qφ (z (t +1)|x̃ (t ))c(x̃ (t )|x (t )) pθ (x (t )|z (t ))d x d x̃. (4) specific implementation of denoising AAEs and apply them to
CRESWELL AND BHARATH: DENOISING AAEs 973
the task of learning models for image distributions. We define Drawing samples z real involves sampling some prior distrib-
an encoding model that maps corrupted data samples to a utions p(z), often a Gaussian. Now, we consider how to draw
latent space E φ (x̃) and Rθ (z) which maps samples from a fake samples z fake . How these samples are drawn depends on
latent space to an image space. These, respectively, draw whether qφ (z|x̃) (DAAE) or q̃φ (z|x) (iDAAE) is being fit to
samples according to the conditional probabilities qφ (z|x̃) and the prior. Drawing samples z fake is easy if qφ (z|x̃) is being
pθ (x|z). We also define a corruption process C(x), which matched to the prior, as these are simply obtained by mapping
draws samples according to c(x̃|x). corrupted samples though the encoder: z fake = E φ (x̃).
The parameters θ and φ of models Rθ (z) and E φ (x̃) are However, if q̃(z|x) is being matched to the prior, we must
learned under an autoencoder framework; the parameter φ is use Monte Carlo sampling to approximate z fake samples (see
also updated under an adversarial framework. The models are Section IV-A1). The process for calculating z fake is given by
trained using large data sets of unlabeled images. Algorithm 2 in the Appendix and detailed in Section IV-A1.
Finally, in order to match the distribution of z fake samples
to the prior p(z), adversarial training is used to update para-
A. Autoencoder meters φ while the holding parameters χ fixed. Parameter φ
Under the regular (nondenoising) autoencoder framework, is updated to minimize the likelihood that Dχ (·) correctly
E φ (x) is the encoder, and Rθ (z) is the decoder. We used neural classifies z fake as being “fake.” The training procedure is laid
networks for both the encoder and the decoder. Rectifying out in lines 18 and 19 of Algorithm 1.
linear units (ReLUs) were used between all intermediate layers Algorithm 1 shows the steps taken to train an iDAAE.
to encourage the networks to learn representations that capture To train a DAAE instead, all lines in Algorithm 1 are the same
multimodal distributions. In the final layer of the decoder except Line 11, which may be replaced by z fake = E φ (x̃).
network, a sigmoid activation function is used so that the
output represents the pixels of an image. The final layer of the Algorithm 1 Algorithm for Training an iDAAE. This
encoder network is left as a linear layer so that the distribution Algorithm May Be Altered for DAAE Training, by
of encoded samples is not restricted. Replacing Line 11 With z f ake = E φ (x̃).
As described in Section IV-A, the autoencoder is trained to
maximize the log-likelihood of the reconstructed image given 1 # Draw a batch of samples from the training data:
the corrupted image. Although there are several ways in which 2 x = {x 0 , x 2 , . . . , x N−1 } ∼ p(x)
one may evaluate this log-likelihood, we chose to measure 3 for k = 1 to NoEpoch do
pixelwise binary cross entropy between the reconstructed 4 x̃ = C(x) # Corrupt all samples
sample x̂ and the original samples before corruption x. During 5 z = E φ (x̃) # Encode all corrupted samples
training, we aim to learn parameters φ and θ that minimize the 6 x̂ = Rθ (z) # Reconstruct
7 # Minimize reconstruction cost
binary cross entropy between x̂ and x. The training process is N−1
summarized by lines 1–9 in Algorithm 1 in the Appendix. 8 Lrec = − N1 i=0 (x̂ i log x i + (1 − x̂ i ) log(1 − x i ))
The elements of the vectors that output by the encoder may 9 φ ← φ − α∇φ Lrec
take any real values, and so minimizing reconstruction error 10 θ ← θ − α∇θ Lrec
is not sufficient to match either qφ (z|x̃) or q̃φ (z|x) to the 11 # Match q̃φ (z|x) to p(z) using adversarial training
prior p(z). For this, parameter φ must also be updated under 12 z f ake = approx_z(x) # Draw samples for q̃φ (z|x)
the adversarial framework. 13 zreal ∼ p(z) # Draw samples from prior p(z)
14 # Train the discriminator:
N−1
15 Ldis = − N1 i=0 log Dχ (z real i ) +
B. Adversarial Training N−1
16 i=0 log(1 − Dχ (z f akei ))
To perform adversarial training, we define the discriminator 17 χ ← χ − α∇χ Ldis
dχ (z), described in Section III-A to be a neural network, which 18 # Train the decoder to match the prior:
N−1
we denote Dχ (z). The output of Dχ (z) is a “probability” 19 L prior = N1 i=0 log(1 − Dχ (z f akei ))
because the final layer of the neural network has a sigmoid 20 φ ← φ − α∇φ L prior
activation function, constraining the range of Dχ (z) to be 21 end
between (0, 1). Intermediate layers of the network have ReLU
activation functions to encourage the network to capture highly
nonlinear relations between z and the labels, {‘real’, ‘fake’}.
How adversarial training is applied depends on whether C. Sampling
q̃φ (z|x) or qφ (z|x̃) is being fit to the prior p(z). z fake refers Although the training process for matching q̃φ (z|x) to p(z)
to the samples drawn from the distribution that we wish to is less computationally efficient than matching qφ (z|x̃) to p(z),
fit to p(z), and z real samples drawn from the prior p(z). The it is very easy to draw samples when q̃φ (z|x) is matched to the
discriminator Dχ (z) is trained to predict whether the values prior (iDAAE). We simply draw a random z (0) value from p(z)
of z are “real” or “fake.” This may be achieved by learning and calculate x (0) = Rθ (z (0) ), where x (0) is a new sample.
the parameters χ that maximize the probability of the correct When drawing samples, parameters θ and φ are fixed.
labels being assigned to z fake and z real . This training procedure If qφ (z|x̃) is matched to the prior (DAAE), an iterative
is shown in Algorithm 1 on lines 14–16. sampling process is needed in order to draw new samples
974 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 30, NO. 4, APRIL 2019
from p(x). This sampling process is described in Section V-B. of labeled facial attributes, for example, “No Beard,” “Blond
To implement this, sampling process is trivial. A random Hair,” and “Wavy Hair.” This face data set is more complex
sample z (0) is drawn from any distribution; the distribution than the Toronto Face data set used by Makhzani et al. [20]
does not have to be the chosen prior p(z). New samples z (t +1) for training the AAE.
are obtained by iteratively decoding, corrupting, and encoding
z (t ), such that z (t +1) is given by: z (t +1) = E φ (C(Rθ (z (t )))).
In Section IV, we evaluate the performance of denoising C. Architecture and Training
AAEs on three image data sets: a synthetic color image data set For each data set, we detail the architecture and training
of tiny images (Sprites) [25], a complex data set of handwritten parameters of the networks used to implement each of the
characters [17], and color faces (CelebA) [19]. Some results denoising AAEs. For each data set, several DAAEs, iDAAEs,
on handwritten digits (MNIST) are presented in the Appendix. and AAEs are trained. In order to compare models trained on
The denoising and nondenoising AAEs are compared for tasks, the same data sets, the same network architectures, batch size,
such as reconstruction, generation, and classification. learning rate, annealing rate, and size of latent code are used
for each.
VII. E XPERIMENTS AND R ESULTS Each set of models were trained using the same optimization
algorithm. The trained AAE [20] models act as a benchmark,
A. Code Available Online
allowing us to compare our proposed DAAEs and iDAAEs.
We make our PyTorch [23] code available at the following 1) Architecture and Training (Omniglot): The decoder,
link: https://github.com/ToniCreswell/pyTorch_DAAE.1 encoder, and discriminator networks consisted of 6, 3, and 2
fully connected layers, respectively, each layer having
B. Data Sets 1000 neurons. We found that deeper networks than those
We evaluate our denoising AAE on three image data sets proposed by Makhazni et al. [20] (for the MNIST data
of varying complexity. Here, we describe the data sets and set) led to better convergence. The networks are trained for
their complexity in terms of variation within the data set, 1000 epochs, using a learning rate of 10−5 , a batch size of 64,
the number of training examples, and the size of the images. and the Adam [13] optimization algorithm. We used
1) Data Sets (Omniglot): The Omniglot data set is a a 200-D Gaussian for the prior and additive Gaussian noise
handwritten character data set consisting of 1623 categories with a standard deviation of 0.5 for the corruption process.
of character from 50 different writing systems, with only When training the iDAAE, we use M = 5 steps of Monte
20 examples of each character. Each example in the data Carlo integration (see Algorithm 2 in the Appendix).
set is 105 × 105 pixels, taking values {0,1}. The data set 2) Architecture and Training (Sprites): Both the encoder
is split such that 19 examples from 964 categories make and the discriminator are two-layer fully connected neural
up the training data set, while one example from each of networks with 1000 neurons in each layer. For the
those 964 categories makes up the testing data set. The decoder, we used a three-layer fully connected network with
20 characters from each of the remaining 659 categories make 1000 neurons in the first layer and 500 in each of the last
up the evaluation data set. This means that experiments may be layers, and this configuration allowed us to capture complexity
performed to reconstruct or classify samples from categories in the data without overfitting. The networks were trained
not seen during training of the autoencoders. for 5 epochs, using a batch size of 128, a learning rate
2) Data Sets (Sprites): The sprites data set is made up of of 10−4 , and the Adam [13] optimization algorithm. We used
672 unique humanlike characters. Each character has seven an encoding 200 units, 200-D Gaussian for the prior, and
attributes, including hair, body, armor, trousers, arm, and additive Gaussian noise with a standard deviation of 0.25 for
weapon type, as well as gender. For each character, there the corruption process. The iDAAE was trained with M = 5
20 animations consisting of 6–13 frames each. There are steps of Monte Carlo integration.
between 120 and 260 examples of each character; however, 3) Architecture and Training (CelebA): The encoder and
every example is in a different pose. Each sample is 60 × 60 the decoder were constructed with convolutional layers,
pixels and is in color. The training, validation, and test data rather than fully connected layers since the CelebA data
sets are split to have frames from 500, 72, and 100 unique set is more complex than the Toronto face data set use by
characters each, with no two sets having frames containing Makhzani et al. [20]. The encoder and the decoder consisted of
the same character. four convolutional layers with a similar structure to that of the
3) Data Sets (CelebA): The CelebA data set consists of deep convolutional generative adversarial network proposed
250k images of faces in color. Though a version of the data by Radford et al. [24]. We used a three-layer fully connected
set with tightly cropped faces exists, we use the uncropped network for the discriminator. Networks were trained for 100
data set. We use 1000 samples for testing and the rest for epochs with a batch size of 64 using RMSprop with a learning
training. Each example has dimensions 64 × 64 and a set rate of 10−4 and a momentum of ρ = 0.1 for training the
discriminator. We found that using smaller momentum values
1 An older version of our code in Theano available at leads to more blurred images, and however, larger momentum
https://github.com/ToniCreswell/DAAE_ with our results presented in values prevented the network from converging and made
iPython notebooks. Since this is a revised version of this paper and Theano
is no longer being supported, our new experiments on the CelebA data sets training unstable. When using Adam instead of RMSprop (on
were performed using PyTorch. the CelebA data set specifically), we found that the values in
CRESWELL AND BHARATH: DENOISING AAEs 975
the encodings became very large and were not consistent with
the prior. The encoding was made up of 200 units, and we used
a 200-D Gaussian for the prior. We used additive Gaussian
noise for the corruption process. We experimented with dif-
ferent noise levels σ between [0.1, 1.0], finding several values
in this range to be suitable. For our classification experiments,
we fixed σ = 0.25, and for synthesis from the DAAE,
to demonstrate the effect of sampling, we used σ = 1.0.
For the iDAAE, we experimented with M = 2, 5, 20, 50.
We found that M < 5 (when σ = 1.0) was not sufficient
to train an iDAAE. By comparing the histograms of encoded
data samples to histograms of the prior (see Fig. 2), for an
iDAAE trained with a particular M value, we are able to see
whether M is sufficiently larger or not. We found M = 5 to Fig. 3. Omniglot mean log-likelihood Lθ compared on the testing and
be sufficiently large for most experiments. evaluation data sets. The training and evaluation data sets have samples
from different handwritten character classes. All models were trained using a
200-D Gaussian prior. The training and testing data sets have samples from
the same handwritten character classes. Error bars denote the standard error.
D. Sampling DAAEs and iDAAEs
Samples may be synthesized using the decoder of a trained
iDAAE or AAE by passing latent samples drawn from the the training data set and the evaluation data set has samples
prior through the decoder. On the other hand, if we pass from different classes.
samples from the prior through the decoder of a trained First, we discuss the results on the evaluation data set. The
DAAE, the samples are likely to be inconsistent with the results, as shown in Fig. 3, are consistent with what is expected
training data. To synthesize more consistent samples using of the models. The iDAAE outperformed the AAE, with a
the DAAE, we draw an initial z (0) value from any random higher (better) log-likelihood. The initial samples drawn using
distribution—we use a Gaussian distribution for simplicity2 — the DAAE had more smaller (worse) log-likelihood values than
and decode, corrupt, and encode the sample several times samples drawn using the AAE. However, after one iteration of
for each synthesized sample. This process is equivalent to MC sampling, the synthesized samples have increasing (better,
sampling an MC where one iteration of the MC includes i.e., moving away from −∞) log-likelihood values than those
decoding, corrupting, and encoding to get a z (t ) value after from the AAE. Additional iterations of MC sampling led to
t iterations. The sample z (t ) may be used to synthesize a novel worse results, possibly because synthesized samples tending
sample, which we call x (t ). x (0) is the sample generated when toward multiple modes of the data generating distribution,
z (0) is passed through the decoder. appearing to be more like samples from classes represented
To evaluate the quality of some synthesized samples, we cal- in the training data.
culated the log-likelihood Lθ of real (hold-out) samples under The Omniglot testing data set consists of one example of
the model [20]. This is achieved by fitting a Parzen window every category in the training data set. This means that if
to a number of synthesized samples. Further details of how multiple iterations of MC sampling cause synthesized samples
the log-likelihood is calculated for each data set are given to tend toward modes in the training data, the likelihood
in Appendix G. score on the testing data set is likely to increase. The results
We expect initial samples x (0) values drawn from the DAAE shown in Fig. 3 confirm this expectation; the log-likelihood
to have a lower (worse) log-likelihood than those drawn from for the fifth sample is higher (better) than for the first
the AAE, and however, we expect MC sampling to improve sample. These apparently conflicting results (in Fig. 3)—
synthesized samples, such that x (t ) for t > 0 should have whether sampling improves or worsens synthesized samples—
larger log-likelihood than the initial samples. It is not clear highlights the challenges involved with evaluating generative
whether x (t ) for t > 0 drawn using a DAAE will be better models using the log-likelihood, discussed in more depth by
than samples drawn from an iDAAE. The purpose of these Theis et al. [29]. For this reason, we also show qualitative
experiments is to demonstrate the challenges associated with results.
drawing samples from denoising AAEs and show that our Fig. 4(a) shows an set of initial samples (x (0)) drawn from a
proposed methods for sampling a DAAE and training iDAAEs DAAE and samples synthesized after nine iterations (x (9)) of
allow us to address these challenges. We also hope to show MC sampling in Fig. 4(b), and these samples display good
that iDAAE and DAAE samples are competitive with those variation, capturing multiple modes of the data generating
drawn from an AAE. distribution.
1) Sampling (Omniglot): Here, we explore the Omniglot 2) Sampling (Sprites): In alignment with expectation,
data set, where we look at the log-likelihood score on both the the iDAAE model synthesizes samples with higher (better)
testing and evaluation data sets. Recall (see Section VII-B1) log-likelihood 2122 ± 5 than the AAE 2085 ± 5. The initial
that the testing data set has samples from the same classes as image samples drawn from the DAAE model underperform
compared with the AAE model 2056 ± 5, and however, after
2 Which happens to be equivalent to our choice of prior just one iteration of sampling, the synthesized samples have
976 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 30, NO. 4, APRIL 2019
TABLE I
R ECONSTRUCTION S HOWS THE M EAN S QUARED E RROR FOR
R ECONSTRUCTIONS OF C ORRUPTED T EST D ATA S AMPLES .
T HIS TABLE S ERVES T WO P URPOSES : 1) TO D EMONSTRATE
T HAT IN M OST C ASES , THE DAAE AND I DAAE A RE
B ETTER A BLE TO R ECONSTRUCT I MAGES C OMPARED
W ITH THE AAE AND 2) TO M OTIVATE W HY W E A RE
I NTERESTED IN AAE S , AS O PPOSED TO O THER
GAN [9] R ELATED A PPROACHES . W E C OMPARE
R ECONSTRUCTION E RROR ON MNIST FOR THE
S TATE - OF - THE -A RT GAN VARIANT, ALICE [18],
D ESIGNED TO I MPROVE R ECONSTRUCTION
F IDELITY IN GAN-L IKE M ODELS . T HE
MNIST D ATA S ET AND E XPERIMENTS
A RE D ESCRIBED IN THE A PPENDIX Fig. 7. CelebA reconstruction with an iDAAE. (a) Original.
(b) Reconstructions.
TABLE II
O MNIGLOT C LASSIFICATION ON A LL 964 T EST S ET
C LASSES AND ON 20 E VALUATION C LASSES
VIII. C ONCLUSION
We propose two types of DAEs, where a posterior is
shaped to match a prior using adversarial training. In the first,
we match the posterior conditional on corrupted data samples
Fig. 10. DAAE Robustness to hyperparameters.
to the prior; we call this model a DAAE. In the second,
we match the posterior, conditional on original data samples,
From this section, we may conclude that with the exception to the prior. We call the second model an iDAAE because
of three facial attributes, AAEs and variations of AAEs are the approach involves using Monte Carlo integration during
able to outperform the VAE and β-VAE on the task of facial training.
attribute classification. This suggests that AAEs and their vari- Our first contribution is the extension of AAEs to denois-
ants are interesting models to study in the setting of learning ing AAEs (DAAEs and iDAAEs). Our second contribution
linearly separable encodings. We also show that for a specific includes identifying and addressing challenges related to syn-
set of several facial attribute categories, the iDAAE or DAAE thesizing data samples using the DAAE models. We propose
performs better than the AAE. This consistency suggests that synthesizing data samples by iteratively sampling a DAAE
there are some specific attributes that the denoising variants according to an MC transition operator, defined by the learned
of the AAE learn better than the nondenoising AAE. encoder and decoder of the DAAE model, and the corruption
process used during training.
Finally, we present results on three data sets for three tasks
G. Tradeoffs in Performance
that compare representations of both DAAE and iDAAE to
The results presented in this section suggest that both AAE models. The data sets include: handwritten characters
the DAAE and the iDAAE outperform the AAE models (Omniglot [17]), a collection of humanlike sprite characters
on most generation and some reconstruction tasks and sug- (Sprites [25]), and a data set of faces (CelebA [19]). The tasks
gest that it is sometimes beneficial to incorporate denoising are reconstruction, classification, and sample synthesis.
into the training of AAEs. However, it is less clear which
of the two new models, DAAE or iDAAE, are better for A PPENDIX A
classification. When evaluating which one to use, we must C LASSIFICATION R ESULTS
consider both the practicalities of training and for generative
Table III shows the numerical facial attribute classification
purposes, the practicalities—primarily computational load—of
results, corresponding to Fig. 8.
each model.
The integrating steps required for training an iDAAE means
A PPENDIX B
that it may take longer to train than a DAAE (see Fig. 11 in
T IME C OMPARISON FOR VARIOUS M VALUES
the Appendix). On the other hand, it is possible to perform
the integration process in parallel, provided that the sufficient Fig. 11 shows the average times taken to run a 100 training
computational resource is available. Furthermore, once the iterations of batch size 128 for an iDAAE with different
model is trained, the time taken to compute encodings for numbers of integration steps M. Note that for M = 1,
classification is the same for both the models. Finally, results the iDAAE is equivalent to the DAAE. Models were trained
suggest that using as few as M = 5 integrating steps during on a Linux machine running Ubuntu 14.0, using an Nvidia
training leads to an improvement in classification score. This Tesla K80 GPU (11.4GB) and CUDA 8.0.61.
means that for some classification tasks, it may be worthwhile
to train an iDAAE rather than a DAAE. A PPENDIX C
For generative tasks, neither the DAAE nor the iDAAE P ROOFS
model consistently outperforms the other in terms of log- Lemma 1: P(z) is a stationary distribution for the
likelihood of synthesized samples. The choice of model may MC defined by the sampling process in (3).
980 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 30, NO. 4, APRIL 2019
TABLE III
FACIAL ATTRIBUTE C LASSIFICATION R ESULTS . C OMPARISON OF (%)
C LASSIFICATION S CORES FOR AN AAE, DAAE, AND I DAAE
C OMPARED W ITH THE VAE [14] AND β-VAE [10].
A L INEAR SVM C LASSIFIER I S T RAINED ON E NCODINGS
TO D EMONSTRATE THE L INEAR S EPARABILITY OF
R EPRESENTATION L EARNED BY E ACH M ODEL .
T HE ATTRIBUTE C LASSIFICATION VALUES FOR
THE VAE AND β-VAE W ERE O BTAINED
F ROM K UMAR et al. [16]
A PPENDIX E TABLE IV
A LGORITHM FOR M ONTE C ARLO I NTEGRATION M ODELS T RAINED ON MNIST. F IVE M ODELS A RE T RAINED ON
THE MNIST D ATA S ET. CORRUPTION I NDICATES THE S TANDARD
See Algorithm 2. D EVIATION OF G AUSSIAN N OISE A DDED D URING THE
C ORRUPTION P ROCESS c( x̃|x). PRIOR I NDICATED THE
P RIOR D ISTRIBUTION I MPOSED ON THE L ATENT
Algorithm 2 Drawing Samples From q̃φ (z|x) S PACE . M I S THE N UMBER OF M ONTE C ARLO
I NTEGRATION S TEPS (S EE A LGORITHM 2
1 function: approx_z({x 0, x 2 , . . . x N−1 }) IN THE A PPENDIX ) U SED D URING
2 for i = 0 to N − 1 do T RAINING —T HIS A PPLIES
O NLY TO THE I DAAE
3 ẑ i+1 = [ ]
4 for j = 1 to M do
5 x̃ i, j = C(x i )
6 z i+1, j = E φ (x̃ j )
7 end
M
8 ẑ i+1 = M1 j =1 z j
9 end
10 return ẑ = {ẑ 1 , ẑ 2 . . . ẑ N }
TABLE V
MNIST R ECONSTRUCTION . RECON. S HOWS THE M EAN S QUARED E RROR
FOR R ECONSTRUCTIONS OF C ORRUPTED T EST D ATA S AMPLES
TABLE VI
MNIST L OG -L IKELIHOOD Lθ . T O C ALCULATE THE L OG -L IKELIHOOD Lθ
OF A M ODEL pθ (x), A PARZEN W INDOW WAS F IT TO 104 G ENERATED
S AMPLES AND THE M EAN L OG -L IKELIHOOD WAS R EPORTED FOR
THE T ESTING D ATA S ET. T HE B ANDWIDTH U SED FOR
THE PARZEN W INDOW WAS D ETERMINED U SING
A VALIDATION S ET. T HE T RAINING , T EST, AND
VALIDATION D ATA S ETS H AD
D IFFERENT S AMPLES
TABLE VII evaluation data set and the testing data set. To compute the log-
MNIST C LASSIFICATION : SVM C LASSIFIERS W ITH RBF K ERNELS W ERE likelihood on the of the testing data set, a Parzen window was
T RAINED ON E NCODED MNIST T RAINING D ATA S AMPLES . T HE
S AMPLES W ERE E NCODED U SING THE E NCODER OF THE
fit to a new set of synthesized samples, different to those used
T RAINED AAE, DAAE, OR I DAAE M ODELS . to calculate the bandwidth. The results are shown in Fig. 3.
C LASSIFICATION S CORES A RE G IVEN FOR
THE MNIST T EST D ATA S ET
B. Sprites
To calculate log-likelihood of samples, a Parzen window
was fit to 1 × 103 synthesized samples. The bandwidth was
set as previously described in Appendix F-D; we found the
optimal bandwidth to be 1.29.
A PPENDIX H
I DAAE ROBUSTNESS TO H YPERPARAMETERS
See Fig. 15.
ACKNOWLEDGMENT
The authors would like to thank K. Arulkumaran for inter-
esting discussions and managing the compute cluster on which
many experiments were performed. They would like to thank
Dr. L. Hadjilucas for discussions on the SVM results. They
would also like to thank N. Pawlowski and M. Rajchl for
additional help with cluster access.
R EFERENCES
[1] G. Alain and Y. Bengio, “What regularized auto-encoders learn from
the data-generating distribution,” J. Mach. Learn. Res., vol. 15, no. 1,
pp. 3563–3593, 2014.
[2] P. Bachman and D. Precup, “Variational generative stochastic networks
with collaborative shaping,” in Proc. 32nd Int. Conf. Mach. Learn., 2015,
Fig. 15. iDAAE—Robustness to hyperparameters. pp. 1964–1972.
[3] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach.
Learn., vol. 2, no. 1, pp. 1–127, 2009.
[4] Y. Bengio, É. Thibodeau-Laufer, G. Alain, and J. Yosinski, “Deep
correctly predicting a label in this interval. For the MNIST generative stochastic networks trainable by backprop,” in Proc. 31st Int.
data set, the SVM classifier is trained on encoded samples Conf. Mach. Learn., vol. 32, 2014, pp. II-226–II-234.
from the MNIST training data set and evaluated on encoded [5] Y. Bengio, L. Yao, G. Alain, and P. Vincent, “Generalized denoising
auto-encoders as generative models,” in Proc. Adv. Neural Inf. Process.
samples from the MNIST testing data set; results are shown Syst., 2013, pp. 899–907.
in Table VII. [6] J. Donahue, P. Krähenbühl, and T. Darrell. (2016). “Adversarial feature
First, we consider the results for DAAE, iDAAE, and AAE learning.” [Online]. Available: https://arxiv.org/abs/1605.09782
[7] V. Dumoulin et al. (2016). “Adversarially learned inference.”
models trained with a 10-D Gaussian prior. Classifiers trained [Online]. Available: https://arxiv.org/abs/1606.00704
on encodings extracted from the encoders of trained DAAE, [8] H. Edwards and A. Storkey. (2016). “Towards a neural statistician.”
iDAAE, or AAE models outperformed classifiers trained on [Online]. Available: https://arxiv.org/abs/1606.02185
[9] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural
PCA of image samples. Classifiers trained on the encodings Inf. Process. Syst., 2014, pp. 2672–2680.
extracted from the encoders of learned DAAE and iDAAE [10] I. Higgins et al., “β-VAE: Learning basic visual concepts with a
models outperformed those trained on the encodings extracted constrained variational framework,” in Proc. Int. Conf. Learn. Represent.
(ICLR), 2017.
from the encoders of the AAE model. [11] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
The differences in classification score for each model on the data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507,
MNIST data set are small; this might be because it is relatively 2006.
[12] D. J. Im, S. Ahn, R. Memisevic, and Y. Bengio, “Denoising criterion for
easy to classify MNIST digits with very high accuracy [28]. variational auto-encoding framework,” in Proc. 31st AAAI Conf. Artif.
Intell., 2017, pp. 2059–2065.
A PPENDIX G [13] D. P. Kingma and L. J. Ba, “Adam: A method for stochastic optimiza-
tion,” in Proc. Int. Conf. Learn. Represent. (ICLR), 2014.
D ETAILS FOR C ALCULATING L OG -L IKELIHOOD [14] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in
A. Omniglot Proc. Int. Conf. Learn. Represent. (ICLR), 2014.
[15] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
To calculate log-likelihood of samples, a Parzen window with deep convolutional neural networks,” in Proc. Adv. Neural Inf.
was fit to 1 × 103 synthesized samples, where the bandwidth Process. Syst., 2012, pp. 1097–1105.
[16] A. Kumar, P. Sattigeri, and A. Balakrishnan. (2017). “Variational
was determined on the testing data set in a similar way to that inference of disentangled latent concepts from unlabeled observations.”
in Appendix F-D. The log-likelihood was evaluated on both the [Online]. Available: https://arxiv.org/abs/1711.00848
984 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 30, NO. 4, APRIL 2019
[17] B. M. Lake, R. Salakhutdinov, and J. B. Tenebaum, “Human-level [31] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
concept learning through probabilistic program induction,” Science, “Stacked denoising autoencoders: Learning useful representations in a
vol. 350, no. 6266, pp. 1332–1338, 2015. deep network with a local denoising criterion,” J. Mach. Learn. Res.,
[18] C. Li et al., “Alice: Towards understanding adversarial learning for joint vol. 11, pp. 3371–3408, Dec. 2010.
distribution matching,” in Proc. Neural Inf. Process. Syst. (NIPS), 2017, [32] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra,
pp. 5495–5503. “Matching networks for one shot learning,” in Proc. Adv. Neural Inf.
[19] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in Process. Syst., 2016, pp. 3630–3638.
the wild,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2015, pp. 3730–3738.
[20] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and
B. Frey. (2015). “Adversarial autoencoders.” [Online]. Available:
https://arxiv.org/abs/1511.05644
[21] A. Nguyen, J. Clune, Y. Bengio, A. Dosovitskiy, and Antonia Creswell received the M.Eng. degree
J. Yosinski. (2016). “Plug & play generative networks: Conditional in biomedical engineering from Imperial College
iterative generation of images in latent Space.” [Online]. Available: London, London, U.K., where she is currently pur-
https://arxiv.org/abs/1612.00005 suing the Ph.D. degree with the Department of
[22] S. P. Meyn and R. L. Tweedie, Markov Chains and Stochastic Sampling Bioengineering.
(Markov Chain Convergence). Harrisburg, PA, USA: Harrisburg Univ.
Sci. Technol., 2005, ch. 1, p. 5.
[23] A. Paszke et al., “Automatic differentiation in PyTorch,” in Proc. Adv.
Neural Inf. Process. Syst., 2017.
[24] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation
learning with deep convolutional generative adversarial networks,” in
Proc. Int. Conf. Learn. Represent. (ICLR), 2016.
[25] S. E. Reed, Y. Zhang, Y. Zhang, and H. Lee, “Deep visual analogy-
making,” in Proc. Adv. Neural Inf. Process. Syst., 2015, pp. 1252–1260.
[26] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, Anil Anthony Bharath received the B.Eng. degree
and X. Chen. (2016). “Improved techniques for training GANs.” in electronic and electrical engineering from Univer-
[Online]. Available: https://arxiv.org/abs/1606.03498 sity College London, London, U.K., in 1988, and
[27] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. the Ph.D. degree in signal processing from Imperial
(2016). “One-shot learning with memory-augmented neural networks.” College London, London, in 1993.
[Online]. Available: https://arxiv.org/abs/1605.06065 He was an Academic Visitor with the Sig-
[28] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for nal Processing Group, University of Cambridge,
convolutional neural networks applied to visual document analysis,” in Cambridge, U.K., in 2006. He is a Co-Founder of
Proc. ICDAR, vol. 3, Aug. 2003, pp. 958–962. Cortexica Vision Systems, London. He is currently
[29] L. Theis, A. van de Oord, and M. Bethge, “A note on the evaluation of a Reader with the Department of Bioengineering,
generative models,” in Proc. Int. Conf. Learn. Represent., 2015. Imperial College London, where he is also a fellow
[30] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting of the Data Science Institute. His current research interests include deep
and composing robust features with denoising autoencoders,” in Proc. architectures for visual inference.
ACM 25th Int. Conf. Mach. Learn., 2008, pp. 1096–1103. Dr. Bharath is a fellow of the Institution of Engineering and Technology.