Intro To Vae
Intro To Vae
R
in Machine Learning
An Introduction to
Variational Autoencoders
Suggested Citation: Diederik P. Kingma and Max Welling (2019), “An Introduction to
Variational Autoencoders”, Foundations and Trends
R
in Machine Learning: Vol. 12, No.
4, pp 307–392. DOI: 10.1561/2200000056.
Diederik P. Kingma
Google
[email protected]
Max Welling
University of Amsterdam
Qualcomm
[email protected]
This article may be used only for the purpose of research, teaching,
and/or private study. Commercial use or systematic downloading
(by robots or other automatic processes) is prohibited without ex-
plicit Publisher approval.
Boston — Delft
Contents
1 Introduction 308
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 308
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312
1.3 Probabilistic Models and Variational Inference . . . . . . . 312
1.4 Parameterizing Conditional Distributions with Neural Networks314
1.5 Directed Graphical Models and Neural Networks . . . . . . 315
1.6 Learning in Fully Observed Models with Neural Nets . . . 316
1.7 Learning and Inference in Deep Latent Variable Models . . 318
1.8 Intractabilities . . . . . . . . . . . . . . . . . . . . . . . . 319
5 Conclusion 369
Acknowledgements 371
Appendices 372
A Appendix 373
A.1 Notation and definitions . . . . . . . . . . . . . . . . . . . 373
A.2 Alternative methods for learning in DLVMs . . . . . . . . . 376
A.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . 378
References 380
An Introduction to
Variational Autoencoders
Diederik P. Kingma1 and Max Welling2,3
1 Google;[email protected]
2 Universityof Amsterdam
3 Qualcomm; [email protected]
ABSTRACT
Variational autoencoders provide a principled framework
for learning deep latent-variable models and corresponding
inference models. In this work, we provide an introduction
to variational autoencoders and some important extensions.
1.1 Motivation
308
1.1. Motivation 309
may help us build useful abstractions of the world that can be used
for multiple prediction tasks downstream. This quest for disentangled,
semantically meaningful, statistically independent and causal factors
of variation in data is generally known as unsupervised representation
learning, and the variational autoencoder (VAE) has been extensively
employed for that purpose. Alternatively, one may view this as an
implicit form of regularization: by forcing the representations to be
meaningful for data generation, we bias the inverse of that process, which
maps from input to representation, into a certain mould. The auxiliary
task of predicting the world is used to better understand the world at
an abstract level and thus to better make downstream predictions.
The VAE can be viewed as two coupled, but independently parame-
terized models: the encoder or recognition model, and the decoder or
generative model. These two models support each other. The recogni-
tion model delivers to the generative model an approximation to its
posterior over latent random variables, which it needs to update its
parameters inside an iteration of “expectation maximization” learning.
Reversely, the generative model is a scaffolding of sorts for the recogni-
tion model to learn meaningful representations of the data, including
possibly class-labels. The recognition model is the approximate inverse
of the generative model according to Bayes rule.
One advantage of the VAE framework, relative to ordinary Varia-
tional Inference (VI), is that the recognition model (also called inference
model) is now a (stochastic) function of the input variables. This in
contrast to VI where each data-case has a separate variational distribu-
tion, which is inefficient for large data-sets. The recognition model uses
one set of parameters to model the relation between input and latent
variables and as such is called “amortized inference”. This recognition
model can be arbitrary complex but is still reasonably fast because
by construction it can be done using a single feedforward pass from
input to latent variables. However the price we pay is that this sampling
induces sampling noise in the gradients required for learning. Perhaps
the greatest contribution of the VAE framework is the realization that
we can counteract this variance by using what is now known as the
“reparameterization trick”, a simple procedure to reorganize our gradient
computation that reduces variance in the gradients.
1.1. Motivation 311
1.2 Aim
x ∼ pθ (x) (1.1)
allows scaling to large models and large datasets. We will denote a deep
neural network as a vector function: NeuralNet(.).
At the time of writing, deep learning has been shown to work well for
a large variety of classification and regression problems, as summarized
in (LeCun et al., 2015; Goodfellow et al., 2016). In case of neural-
network based image classification LeCun et al., 1998, for example,
neural networks parameterize a categorical distribution pθ (y|x) over a
class label y, conditioned on an image x.
p = NeuralNet(x) (1.4)
pθ (y|x) = Categorical(y; p) (1.5)
where P a(xj ) is the set of parent variables of node j in the directed graph.
For non-root-nodes, we condition on the parents. For root nodes, the set
of parents is the empty set, such that the distribution is unconditional.
Traditionally, each conditional probability distribution pθ (xj |P a(xj ))
is parameterized as a lookup table or a linear model (Koller and Fried-
man, 2009). As we explained above, a more flexible way to parameterize
such conditional distributions is with neural networks. In this case,
neural networks take as input the parents of a variable in a directed
316 Introduction
If all variables in the directed graphical model are observed in the data,
then we can compute and differentiate the log-probability of the data
under the model, leading to relatively straightforward optimization.
1.6.1 Dataset
We often collect a dataset D consisting of N ≥ 1 datapoints:
D = {x(1) , x(2) , ..., x(N ) } ≡ {x(i) }N
i=1 ≡ x
(1:N )
(1.9)
The datapoints are assumed to be independent samples from an un-
changing underlying distribution. In other words, the dataset is assumed
to consist of distinct, independent measurements from the same (un-
changing) system. In this case, the observations D = {x(i) }N
i=1 are said
to be i.i.d., for independently and identically distributed. Under the
i.i.d. assumption, the probability of the datapoints given the parame-
ters factorizes as a product of individual datapoint probabilities. The
log-probability assigned to the data by the model is therefore given by:
X
log pθ (D) = log pθ (x) (1.10)
x∈D
The ' symbol means that one of the two sides is an unbiased estimator
of the other side. So one side (in this case the right-hand side) is a
random variable due to some noise source, and the two sides are equal
when averaged over the noise distribution. The noise source, in this case,
is the randomly drawn minibatch of data M. The unbiased estimator
log pθ (M) is differentiable, yielding the unbiased stochastic gradients:
1 1 1 X
∇θ log pθ (D) ' ∇θ log pθ (M) = ∇θ log pθ (x)
ND NM NM x∈M
(1.12)
These gradients can be plugged into stochastic gradient-based optimizers;
see section A.3 for further discussion. In a nutshell, we can optimize the
objective function by repeatedly taking small steps in the direction of
the stochastic gradient.
This is also called the (single datapoint) marginal likelihood or the model
evidence, when taken as a function of θ.
Such an implicit distribution over x can be quite flexible. If z is
discrete and pθ (x|z) is a Gaussian distribution, then pθ (x) is a mixture-
of-Gaussians distribution. For continuous z, pθ (x) can be seen as an
infinite mixture, which are potentially more powerful than discrete mix-
tures. Such marginal distributions are also called compound probability
distributions.
where pθ (z) and/or pθ (x|z) are specified. The distribution p(z) is often
called the prior distribution over z, since it is not conditioned on any
observations.
1.8 Intractabilities
321
322 Variational Autoencoders
z-space
x-space
Dataset: D
Figure 2.1: A VAE learns stochastic mappings between an observed x-space, whose
empirical distribution qD (x) is typically complicated, and a latent z-space, whose
distribution can be relatively simple (such as spherical, as in this figure). The
generative model learns a joint distribution pθ (x, z) that is often (but not always)
factorized as pθ (x, z) = pθ (z)pθ (x|z), with a prior distribution over latent space
pθ (z), and a stochastic decoder pθ (x|z). The stochastic encoder qφ (z|x), also called
inference model, approximates the true but intractable posterior pθ (z|x) of the
generative model.
324 Variational Autoencoders
and zero if, and only if, qφ (z|x) equals the true posterior distribution.
The first term in eq. (2.8) is the variational lower bound, also called
the evidence lower bound (ELBO):
2. The gap between the ELBO Lθ,φ (x) and the marginal likelihood
log pθ (x); this is also called the tightness of the bound. The better
qφ (z|x) approximates the true (posterior) distribution pθ (z|x), in
terms of the KL divergence, the smaller the gap.
2.3. Stochastic Gradient-Based Optimization of the ELBO 325
Datapoint
Objective
ELBO = log p(x,z) - log q(z|x)
The individual-datapoint ELBO, and its gradient ∇θ,φ Lθ,φ (x) is, in
general, intractable. However, good unbiased estimators ∇ ˜ θ,φ Lθ,φ (x)
exist, as we will show, such that we can still perform minibatch SGD.
326 Variational Autoencoders
where z = g(, φ, x). and the expectation and gradient operators become
commutative, and we can form a simple Monte Carlo estimator:
where in the last line, z = g(φ, x, ) with random noise sample ∼ p().
See figure 2.3 for an illustration and further clarification, and figure 3.2
for an illustration of the resulting posteriors for a 2D toy problem.
328 Variational Autoencoders
f Backprop f
z ~ qφ(z|x) ∇z f z = g(φ,x,ε)
φ x ∇φ f φ x ε ~ p(ε)
∼ p() (2.27)
z = g(φ, x, ) (2.28)
L̃θ,φ (x) = log pθ (x, z) − log qφ (z|x) (2.29)
Unbiasedness
This gradient is an unbiased estimator of the exact single-datapoint
ELBO gradient; when averaged over noise ∼ p(), this gradient equals
the single-datapoint ELBO gradient:
h i
Ep() ∇θ,φ L̃θ,φ (x; ) = Ep() [∇θ,φ (log pθ (x, z) − log qφ (z|x))] (2.30)
= ∇θ,φ (Ep() [log pθ (x, z) − log qφ (z|x)]) (2.31)
= ∇θ,φ Lθ,φ (x) (2.32)
330 Variational Autoencoders
where the second term is the log of the absolute value of the determinant
of the Jacobian matrix (∂z/∂):
∂z
log dφ (x, ) = log det (2.34)
∂
We call this the log-determinant of the transformation from to z. We
use the notation log dφ (x, ) to make explicit that this log-determinant,
similar to g(), is a function of x, and φ. The Jacobian matrix contains
all first derivatives of the transformation from to z:
∂z ∂z1
1
∂1 ··· ∂k
∂z ∂(z1 , ..., zk ) .. .. ..
= =
. . . (2.35)
∂ ∂(1 , ..., k ) ∂z ∂zk
k
∂1 ··· ∂k
As we will show, we can build very flexible transformations g() for which
log dφ (x, ) is simple to compute, resulting in highly flexible inference
models qφ (z|x).
∼ N (0, I) (2.38)
(µ, log σ) = EncoderNeuralNetφ (x) (2.39)
z=µ+σ (2.40)
∼ N (0, I) (2.46)
z = µ + L (2.47)
332 Variational Autoencoders
After training a VAE, we can estimate the probability of data under the
model using an importance sampling technique, as originally proposed
by Rezende et al., 2014. The marginal likelhood of a datapoint can be
written as:
log pθ (x) = log Eqφ (z|x) [pθ (x, z)/qφ (z|x)] (2.56)
where each z(l) ∼ qφ (z|x) is a random sample from the inference model.
By making L large, the approximation becomes a better estimate of the
marginal likelihood, and in fact since this is a Monte Carlo estimator,
for L → ∞ this converges to the actual marginal likelihood.
Notice that when setting L = 1, this equals the ELBO estima-
tor of the VAE. We can also use the estimator of eq. (2.57) as our
objective function; this is the objective used in importance weighted
autoencoders (Burda et al., 2015) (IWAE). In that paper, it was also
shown that the objective has increasing tightness for increasing value
of L. It was later shown by Cremer et al., 2017 that the IWAE ob-
jective can be re-interpreted as an ELBO objective with a particular
inference model. The downside of these approaches for optimizing a
tighter bound, is that importance weighted estimates have notoriously
bad scaling properties to high-dimensional latent spaces.
DKL (qD (x)||pθ (x)) = −EqD (x) [log pθ (x)] + EqD (x) [log qD (x)] (2.61)
= − log pθ (D) + constant (2.62)
2.8 Challenges
z-space
x-space
Figure 2.4: The maximum likelihood (ML) objective can be viewed as the mini-
mization of DKL (qD,φ (x)||pθ (x)), while the ELBO objective can be viewed as the
minimization of DKL (qD,φ (x, z)||pθ (x, z)), which upper bounds DKL (qD,φ (x)||pθ (x)).
If a perfect fit is not possible, then pθ (x, z) will typically end up with higher variance
than qD,φ (x, z), because of the direction of the KL divergence.
338 Variational Autoencoders
advantageous:
h i
Leλ = Ex∼M Eq(z|x) [log p(x|z)] (2.69)
K
X
− maximum(λ, Ex∼M [DKL (q(zj |x)||p(zj ))] (2.70)
j=1
where z ∼ qφ (z|x).
This is also known as the likelihood ratio estimator (Glynn, 1990;
Fu, 2006) and the REINFORCE gradient estimator (Williams, 1992).
The method has been successfully used in various methods like neural
variational inference (Mnih and Gregor, 2014), black-box variational
inference (Ranganath et al., 2014), automated variational inference
(Wingate and Weber, 2013), and variational stochastic search (Paisley
et al., 2012), often in combination with various novel control variate
techniques (Glasserman, 2013) for variance reduction. An advantage
of the likelihood ratio estimator is its applicability to discrete latent
variables.
We do not directly compare to these techniques, since we concern
ourselves with continuous latent variables, in which case we have (compu-
tationally cheap) access to gradient information ∇z log pθ (x, z), courtesy
of the backpropagation algorithm. The score function estimator solely
uses the scalar-valued log pθ (x, z), ignoring the gradient information
about the function log pθ (x, z), generally leading to much higher vari-
ance. This has been experimentally confirmed by e.g. (Kucukelbir et al.,
2017), which finds that a sophisticated score function estimator requires
342 Variational Autoencoders
343
344 Beyond Gaussian Posteriors
Here we will review two general techniques for improving the flexibility
of approximate posteriors in the context of gradient-based variational
inference: auxiliary latent variables, and normalizing flows.
From this equation it can be seen that in principle, the ELBO gets
worse by augmenting the VAE with an auxiliary variable u:
DKL (qD,φ (x, z, u)||pθ (x, z, u)) ≥ DKL (qD,φ (x, z)||pθ (x, z))
But because we now have access to a much more flexible class of inference
distributions, qφ (z|x), the original ELBO objective DKL (qD,φ (x, z)||
pθ (x, z)) can improve, potentially outweighing the additional cost of
EqD (x,z) [DKL (qD,φ (u|x, z)||pθ (u|x, z))]. In (Salimans et al., 2015), (Ran-
ganath et al., 2016) and (Maaløe et al., 2016) it was shown that auxiliary
variables can indeed lead to significant improvements in models.
The introduction of auxiliary latent variables in the graph, are a
special case of VAEs with multiple layers of latent variables, which are
discussed in chapter 4. In our experiment with CIFAR-10, we make use
of multiple layers of stochastic variables.
= (y − µ(y))/σ(y) (3.19)
transformation has a lower triangular Jacobian (∂i /∂yj = 0 for j > i),
with a simple diagonal: ∂i /∂yi = σ1i . The determinant of a lower trian-
gular matrix equals the product of the diagonal terms. As a result, the
log-determinant of the Jacobian of the transformation is remarkably
simple and straightforward to compute:
X
d D
log det =
− log σi (y) (3.20)
dy i=1
0 ∼ N (0, I) (3.23)
(µ0 , log σ 0 , h) = EncoderNeuralNet(x; θ) (3.24)
z0 = µ0 + σ 0 0 (3.25)
3.4. Inverse Autoregressive Flow (IAF) 349
for t ← 1 to T do
[m, s] ← AutoregressiveNN(z; h, t, θ)
σ ← (1 + exp(−s))−1
z ← σ z + (1 − σ) m
l ← l − i (log σi )
P
end
IAF IAF
σ μ ···
step step
ε0 × + ε1 ε2 ··· z
σ μ
ε × + ε
IAF Step
Figure 3.1: Like other normalizing flows, drawing samples from an approximate
posterior with Inverse Autoregressive Flow (IAF) (Kingma et al., 2016) starts with
a distribution with tractable density, such as a Gaussian with diagonal covariance,
followed by a chain of nonlinear invertible transformations of z, each with a simple
Jacobian determinant. The final iterate has a flexible distribution.
z ≡ T (3.28)
D T
!
X X
1 2
log q(z|x) = − 2 i + 21 log(2π) + log σt,i (3.29)
i=1 t=0
Figure 3.2: Best viewed in color. We fitted a variational autoencoder (VAE) with
a spherical Gaussian prior, and with factorized Gaussian posteriors (b) or inverse
autoregressive flow (IAF) posteriors (c) to a toy dataset with four datapoints. Each
colored cluster corresponds to the posterior distribution of one datapoint. IAF greatly
improves the flexibility of the posterior distributions, and allows for a much better
fit between the posteriors and the prior.
354
4.1. Inference and Learning with Multiple Latent Variables 355
2. Evaluating the scalar value (log pθ (x, z) − log qφ (z|x)) at the re-
sulting sample z and datapoint x. This scalar is the unbiased
stochastic estimate lower bound on log pθ (x). It is also differen-
tiable and optimizable with SGD.
… … … …
z3 z3 z3
z2 z2 z2
+ =
z1 z1 z1
Deep generative model Bottom-up inference model VAE with bottom-up inference
(a) VAE with bottom-up inference.
… … … … …
z3
z3 z3
z2
+ z2
= z2
z1
z1 z1
x x x x
Deep generative model Top-down inference model VAE with top-down inference
(b) VAE with top-down inference.
Figure 4.1: Illustration, taken from Kingma et al., 2016, of two choices of direc-
tionality of the inference model. Sharing directionality of inference, astotalin300
(b), has
convolutions
the benefit that it
(some top-down parameters
allows for straightforward sharing of parameters between
in our model! the
To see why this might be a good idea, note that the true posterior
over the latent variables, is a function of the prior:
pθ (z|x) ∝ pθ (z)pθ (x|z) (4.6)
Likewise, the posterior of a latent variable given its parents (in the
generative model), is:
pθ (zi |x, P a(zi )) ∝ pθ (zi |P a(zi ))pθ (x|zi , P a(zi )) (4.7)
Optimization of the generative model changes both pθ (zi |P a(zi )) and
pθ (x|zi , P a(zi )). By coupling the inference model qφ (zi |x, P a(zi )) and
4.2. Alternative methods for increasing expressivity 357
prior pθ (zi |P a(zi )), changes in pθ (zi |P a(zi )) can be directly reflected
in changes in qφ (zi |P a(zi )).
This coupling is especially straightforward when pθ (zi |P a(zi )) is
Gaussian distributed. The inference model can be directly specified
as the product of this Gaussian distribution, with a learned quadratic
pseudo-likelihood term:
qφ (zi |P a(zi ), x) = pθ (zi |P a(zi ))˜l(zi ; x, P a(zi ))/Z,
where Z is tractable to compute. This idea is explored by (Salimans,
2016) and (Sønderby et al., 2016a). In principle this idea could be
extended to a more general class of conjugate priors, but no work on
this is known at the time of writing.
A less constraining variant, explored by (Kingma et al., 2016), is
to simply let the neural network that parameterizes qφ (zi |P a(zi ), x) be
partially specified by a part of the neural network that parameterizes
pθ (zi |P a(zi )). In general, we can let the two distributions share parame-
ters. This allows for more complicated posteriors, like normalizing flows
or IAF.
Chemical Design
One example of a recent scientific application of artificial creativity,
is shown in Gómez-Bombarelli et al., 2018. In this paper, a fairly
straightforward VAE is trained on hundreds of thousands of existing
chemical structures. The resulting continuous representation (latent
362 Deeper Generative Models
Astronomy
In (Ravanbakhsh et al., 2017), VAEs are applied to simulate observations
of distant galaxies. This helps with the calibration of systems that need
to indirectly detect the shearing of observations of distant galaxies,
caused by weak gravitational lensing in the presence of dark matter
between earth and those galaxies. Since the lensing effects are so weak,
such systems need to be calibrated with ground-truth images with a
known amount of shearing. Since real data is still limited, the proposed
solution is to use deep generative models for synthesis of pseudo-data.
Image (Re-)Synthesis
A popular application is image (re)synthesis. One can optimize a VAE to
form a generative model over images. One can synthesize images from the
generative model, but the inference model (or encoder) also allows one
to encode real images into a latent space. One can modify the encoding
in this latent space, then decode the image back into the observed
space. Relatively simple transformations in the observed space, such
as linear transformations, often translate into semantically meaningful
modifications of the original image. One example, as demonstrated by
White, 2016, is the modification of images in latent space along a "smile
vector" in order to make them more happy, or more sad looking. See
figure 4.4 for an example.
364 Deeper Generative Models
Figure 4.4: VAEs can be used for image resynthesis. In this example by White,
2016, an original image (left) is modified in a latent space in the direction of a smile
vector, producing a range of versions of the original, from smiling to sadness.
but has not been discussed in-depth here since (as often the case with
importance-weighted estimators) it can be difficult to scale to high-
dimensional latent spaces. Other objectives have been proposed such as
Rényi divergence variational inference (Li and Turner, 2016), Generative
Moment Matching Networks (Li et al., 2015), objectives based on nor-
malizing such as NICE and RealNVP flows (Sohl-Dickstein et al., 2015;
Dinh et al., 2014), black-box α-divergence minimization (Hernández-
Lobato et al., 2016) and Bi-directional Helmholtz Machines (Bornschein
et al., 2016).
Various combinations with adversarial objectives have been pro-
posed. In (Makhzani et al., 2015), the "adversarial autoencoder" (AAE)
was proposed, a probabilistic autoencoder that uses a generative adver-
sarial network (GAN) (Goodfellow et al., 2014) to perform variational
inference. In (Dumoulin et al., 2017) Adversarially Learned Inference
(ALI) was proposed, which aims to minimize a GAN objective between
the joint distributions qφ (x, z) and pθ (x, z). Other hybrids have been
proposed as well (Larsen et al., 2016; Brock et al., 2017; Hsu et al.,
2017).
One of the most prominent, and most difficult, applications of gener-
ative models is image modeling. In (Kulkarni et al., 2015) (Deep convo-
lutional inverse graphics network), a convolutional VAE was applied to
modeling images with some success, building on work by (Dosovitskiy
et al., 2015) proposing convolutional networks for image synthesis. In
(Gregor et al., 2015) (DRAW), an attention mechanism was combined
with a recurrent inference model and recurrent generative model for
image synthesis. This approach was further extended in (Gregor et al.,
2016) (Towards Conceptual Compression) with convolutional networks,
scalable to larger images, and applied to image compression. In (Kingma
et al., 2016), deep convolutional inference models and generative models
were also applied to images. Furthermore, (Gulrajani et al., 2017)
(PixelVAE) and (Chen et al., 2017) (Variational Lossy Autoencoder)
combined convolutional VAEs with the PixelCNN model (Van Oord
et al., 2016; Van den Oord et al., 2016). Methods and VAE architectures
for controlled image generation from attributes or text were studied
in (Kingma et al., 2014; Yan et al., 2016; Mansimov et al., 2015; Brock
et al., 2017; White, 2016). Predicting the color of pixels based on a
4.5. Follow-Up Work 367
369
370 Conclusion
We are grateful for the help of Tim Salimans, Alec Radford, Rif A.
Saurous and others who have given us valuable feedback at various
stages of writing.
371
Appendices
A
Appendix
A.1.1 Notation
Example(s) Description
373
374 Appendix
A.1.2 Definitions
Term Description
A.1. Notation and definitions 375
A.1.3 Distributions
We overload the notation of distributions (e.g. p(x) = N (x; µ, Σ)) with
two meanings: (1) a distribution from which we can sample, and (2)
the probability density function (PDF) of that distribution.
Term Description
Bayes’ Rule
The prior p(θ) in equation (A.5) has diminishing effect for increasingly
large N . For this reason, in case of optimization with large datasets,
we often choose to simply use the maximum likelihood criterion by
omitting the prior from the objective, which is numerically equivalent
to setting p(θ) = constant.
until convergence. Why does this work? Note that at the E-step:
A.2.3 MCMC-EM
Another Bayesian approach towards optimizing the likelihood pθ (x)
with DLVMs is Expectation Maximization (EM) with Markov Chain
Monte Carlo (MCMC). In case of MCMC, the posterior is approximated
by a mixture of a set of approximately i.i.d. samples from the posterior,
acquired by running a Markov chain. Note that posterior gradients
in DLVMs are relatively affordable to compute by differentiating the
log-joint distribution w.r.t. z:
∇z log pθ (z|x) = ∇z log[pθ (x, z)/pθ (x)] (A.11)
= ∇z [log pθ (x, z) − log pθ (x)] (A.12)
= ∇z log pθ (x, z) − ∇z log pθ (x) (A.13)
= ∇z log pθ (x, z) (A.14)
One version of MCMC which uses such posterior for relatively fast
convergence, is Hamiltonian MCMC (Neal, 2011). A disadvantage of
this approach is the requirement for running an independent MCMC
chain per datapoint.
380
References 381
Deshpande, A., J. Lu, M.-C. Yeh, M. Jin Chong, and D. Forsyth. 2017.
“Learning diverse image colorization”. In: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition. 6837–
6845.
Dinh, L., D. Krueger, and Y. Bengio. 2014. “NICE: Non-linear indepen-
dent components estimation”. arXiv preprint arXiv:1410.8516.
Dinh, L., J. Sohl-Dickstein, and S. Bengio. 2016. “Density estimation
using Real NVP”. arXiv preprint arXiv:1605.08803.
Dosovitskiy, A., J. Tobias Springenberg, and T. Brox. 2015. “Learning
to generate chairs with convolutional neural networks”. In: Pro-
ceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. 1538–1546.
Dumoulin, V., I. Belghazi, B. Poole, A. Lamb, M. Arjovsky, O. Mas-
tropietro, and A. Courville. 2017. “Adversarially learned inference”.
International Conference on Learning Representations.
Edwards, H. and A. Storkey. 2017. “Towards a neural statistician”.
International Conference on Learning Representations.
Eslami, S. A., N. Heess, T. Weber, Y. Tassa, D. Szepesvari, G. E. Hinton,
et al. 2016. “Attend, infer, repeat: Fast scene understanding with
generative models”. In: Advances In Neural Information Processing
Systems. 3225–3233.
Fan, K., Z. Wang, J. Beck, J. Kwok, and K. A. Heller. 2015. “Fast
second order stochastic backpropagation for variational inference”.
In: Advances in Neural Information Processing Systems. 1387–1395.
Fortunato, M., C. Blundell, and O. Vinyals. 2017. “Bayesian recurrent
neural networks”. arXiv preprint arXiv:1704.02798.
Fraccaro, M., S. K. Sønderby, U. Paquet, and O. Winther. 2016. “Sequen-
tial neural models with stochastic layers”. In: Advances in Neural
Information Processing Systems. 2199–2207.
Fu, M. C. 2006. “Gradient estimation”. Handbooks in Operations Re-
search and Management Science. 13: 575–616.
Gal, Y. and Z. Ghahramani. 2016. “A theoretically grounded application
of dropout in recurrent neural networks”. In: Advances in neural
information processing systems. 1019–1027.
References 383
Pu, Y., Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin. 2016.
“Variational autoencoder for deep learning of images, labels and
captions”. In: Advances in Neural Information Processing Systems.
2352–2360.
Ranganath, R., S. Gerrish, and D. Blei. 2014. “Black Box Variational
Inference”. In: International Conference on Artificial Intelligence
and Statistics. 814–822.
Ranganath, R., D. Tran, and D. Blei. 2016. “Hierarchical variational
models”. In: International Conference on Machine Learning. 324–
333.
Ravanbakhsh, S., F. Lanusse, R. Mandelbaum, J. Schneider, and B.
Poczos. 2017. “Enabling dark energy science with deep generative
models of galaxy images”. In: AAAI Conference on Artificial Intel-
ligence.
Rezende, D. J., S. Mohamed, I. Danihelka, K. Gregor, and D. Wierstra.
2016a. “One-shot generalization in deep generative models”. In:
International Conference on International Conference on Machine
Learning. 1521–1529.
Rezende, D. J., S. Mohamed, and D. Wierstra. 2014. “Stochastic back-
propagation and approximate inference in deep generative models”.
In: International Conference on Machine Learning. 1278–1286.
Rezende, D. J., S. A. Eslami, S. Mohamed, P. Battaglia, M. Jaderberg,
and N. Heess. 2016b. “Unsupervised learning of 3d structure from
images”. In: Advances In Neural Information Processing Systems.
4997–5005.
Rezende, D. and S. Mohamed. 2015. “Variational inference with nor-
malizing flows”. In: International Conference on Machine Learning.
1530–1538.
Roeder, G., Y. Wu, and D. K. Duvenaud. 2017. “Sticking the landing:
Simple, lower-variance gradient estimators for variational inference”.
In: Advances in Neural Information Processing Systems. 6925–6934.
Rosca, M., B. Lakshminarayanan, and S. Mohamed. 2018. “Distribution
matching in variational inference”. arXiv preprint arXiv:1802.06847.
Roweis, S. 1998. “EM algorithms for PCA and SPCA”. Advances in
Neural Information Processing Systems: 626–632.
390 References