Adversarial Variational Bayes
Adversarial Variational Bayes
Abstract
Variational Autoencoders (VAEs) are expressive
arXiv:1701.04722v3 [cs.LG] 2 Aug 2017
f
latent variable models that can be used to learn
complex probability distributions from training
data. However, the quality of the resulting model
crucially relies on the expressiveness of the in-
ference model. We introduce Adversarial Vari- Figure 1. We propose a method which enables neural samplers
ational Bayes (AVB), a technique for training with intractable density for Variational Bayes and as inference
Variational Autoencoders with arbitrarily expres- models for learning latent variable models. This toy exam-
sive inference models. We achieve this by in- ple demonstrates our method’s ability to accurately approximate
troducing an auxiliary discriminative network complex posterior distributions like the one shown on the right.
that allows to rephrase the maximum-likelihood-
problem as a two-player game, hence establish-
more powerful. While many model classes such as Pixel-
ing a principled connection between VAEs and
RNNs (van den Oord et al., 2016b), PixelCNNs (van den
Generative Adversarial Networks (GANs). We
Oord et al., 2016a), real NVP (Dinh et al., 2016) and Plug
show that in the nonparametric limit our method
& Play generative networks (Nguyen et al., 2016) have
yields an exact maximum-likelihood assignment
been introduced and studied, the two most prominent ones
for the parameters of the generative model, as
are Variational Autoencoders (VAEs) (Kingma & Welling,
well as the exact posterior distribution over the
2013; Rezende et al., 2014) and Generative Adversarial
latent variables given an observation. Contrary
Networks (GANs) (Goodfellow et al., 2014).
to competing approaches which combine VAEs
with GANs, our approach has a clear theoretical Both VAEs and GANs come with their own advantages
justification, retains most advantages of standard and disadvantages: while GANs generally yield visually
Variational Autoencoders and is easy to imple- sharper results when applied to learning a representation
ment. of natural images, VAEs are attractive because they natu-
rally yield both a generative model and an inference model.
Moreover, it was reported, that VAEs often lead to better
1. Introduction log-likelihoods (Wu et al., 2016). The recently introduced
BiGANs (Donahue et al., 2016; Dumoulin et al., 2016) add
Generative models in machine learning are models that can
an inference model to GANs. However, it was observed
be trained on an unlabeled dataset and are capable of gener-
that the reconstruction results often only vaguely resemble
ating new data points after training is completed. As gen-
the input and often do so only semantically and not in terms
erating new content requires a good understanding of the
of pixel values.
training data at hand, such models are often regarded as a
key ingredient to unsupervised learning. The failure of VAEs to generate sharp images is often at-
tributed to the fact that the inference models used during
In recent years, generative models have become more and
training are usually not expressive enough to capture the
1
Autonomous Vision Group, MPI Tübingen 2 Microsoft true posterior distribution. Indeed, recent work shows that
Research Cambridge 3 Computer Vision and Geometry using more expressive model classes can lead to substan-
Group, ETH Zürich. Correspondence to: Lars Mescheder tially better results (Kingma et al., 2016), both visually
<[email protected]>. and in terms of log-likelihood bounds. Recent work (Chen
Proceedings of the 34 th International Conference on Machine et al., 2016) also suggests that highly expressive inference
Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 models are essential in presence of a strong decoder to al-
by the author(s). low the model to make use of the latent space at all.
Adversarial Variational Bayes
versarial Autoencoders (AAEs) (Makhzani et al., 2015) the (a) Standard VAE (b) Our model
Kullback-Leibler regularization term that appears in the
training objective for VAEs is replaced with an adversarial Figure 2. Schematic comparison of a standard VAE and a VAE
loss that encourages the aggregated posterior to be close to with black-box inference model, where 1 and 2 denote samples
the prior over the latent variables. Even though AAEs do from some noise distribution. While more complicated inference
not maximize a lower bound to the maximum-likelihood models for Variational Autoencoders are possible, they are usually
objective, we show in Section 6.2 that AAEs can be in- not as flexible as our black-box approach.
terpreted as an approximation to our approach, thereby es-
tablishing a connection of AAEs to maximum-likelihood • We empirically demonstrate that our model is able
learning. to learn rich posterior distributions and show that the
model is able to generate compelling samples for com-
Outside the context of generative models, AVB yields a plex data sets.
new method for performing Variational Bayes (VB) with
neural samplers. This is illustrated in Figure 1, where we
2. Background
used AVB to train a neural network to sample from a non-
trival unnormalized probability density. This allows to ac- As our model is an extension of Variational Autoencoders
curately approximate the posterior distribution of a prob- (VAEs) (Kingma & Welling, 2013; Rezende et al., 2014),
abilistic model, e.g. for Bayesian parameter estimation. we start with a brief review of VAEs.
The only other variational methods we are aware of that
can deal with such expressive inference models are based VAEs are specified by a parametric generative model pθ (x |
on Stein Discrepancy (Ranganath et al., 2016; Liu & Feng, z) of the visible variables given the latent variables, a prior
2016). However, those methods usually do not directly tar- p(z) over the latent variables and an approximate inference
get the reverse Kullback-Leibler-Divergence and can there- model qφ (z | x) over the latent variables given the visible
fore not be used to approximate the variational lower bound variables. It can be shown that
for learning a latent variable model. Our contributions are
as follows: log pθ (x) ≥ −KL(qφ (z | x), p(z))
+ Eqφ (z|x) log pθ (x | z). (2.1)
• We enable the usage of arbitrarily complex inference
models for Variational Autoencoders using adversarial
The right hand side of (2.1) is called the variational lower
training.
bound or evidence lower bound (ELBO). If there is φ such
• We give theoretical insights into our method, show-
that qφ (z | x) = pθ (z | x), we would have
ing that in the nonparametric limit our method recov-
ers the true posterior distribution as well as a true
maximum-likelihood assignment for the parameters log pθ (x) = max −KL(qφ (z | x), p(z))
φ
of the generative model.
+ Eqφ (z|x) log pθ (x | z). (2.2)
1
Concurrently to our work, several researchers have de-
scribed similar ideas. Some ideas of this paper were described However, in general this is not true, so that we only have
independently by Huszár in a blog post on http://www. an inequality in (2.2).
inference.vc and in Huszár (2017). The idea to use adversar-
ial training to improve the encoder network was also suggested by When performing maximum-likelihood training, our goal
Goodfellow in an exploratory talk he gave at NIPS 2016 and by Li is to optimize the marginal log-likelihood
& Liu (2016). A similar idea was also mentioned by Karaletsos
(2016) in the context of message passing in graphical models.
EpD (x) log pθ (x), (2.3)
Adversarial Variational Bayes
where pD is the data distribution. Unfortunately, com- The idea of our approach is to circumvent this problem by
puting log pθ (x) requires marginalizing out z in pθ (x, z) implicitly representing the term
which is usually intractable. Variational Bayes uses in-
equality (2.1) to rephrase the intractable problem of opti- log p(z) − log qφ (z | x) (3.2)
mizing (2.3) into
as the optimal value of an additional real-valued discrimi-
h native network T (x, z) that we introduce to the problem.
max max EpD (x) −KL(qφ (z | x), p(z))
θ φ More specifically, consider the following objective for the
i
+ Eqφ (z|x) log pθ (x | z) . (2.4) discriminator T (x, z) for a given qφ (x | z):
Due to inequality (2.1), we still optimize a lower bound to max EpD (x) Eqφ (z|x) log σ(T (x, z))
T
the true maximum-likelihood objective (2.3). + EpD (x) Ep(z) log (1 − σ(T (x, z))) . (3.3)
Naturally, the quality of this lower bound depends on the
expressiveness of the inference model qφ (z | x). Usually, Here, σ(t) := (1 + e−t )−1 denotes the sigmoid-function.
qφ (z | x) is taken to be a Gaussian distribution with diago- Intuitively, T (x, z) tries to distinguish pairs (x, z) that were
nal covariance matrix whose mean and variance vectors are sampled independently using the distribution pD (x)p(z)
parameterized by neural networks with x as input (Kingma from those that were sampled using the current inference
& Welling, 2013; Rezende et al., 2014). While this model model, i.e., using pD (x)qφ (z | x).
is very flexible in its dependence on x, its dependence on To simplify the theoretical analysis, we assume that the
z is very restrictive, potentially limiting the quality of the model T (x, z) is flexible enough to represent any func-
resulting generative model. Indeed, it was observed that tion of the two variables x and z. This assumption is often
applying standard Variational Autoencoders to natural im- referred to as the nonparametric limit (Goodfellow et al.,
ages often results in blurry images (Larsen et al., 2015). 2014) and is justified by the fact that deep neural networks
are universal function approximators (Hornik et al., 1989).
3. Method As it turns out, the optimal discriminator T ∗ (x, z) accord-
In this work we show how we can instead use a black-box ing to the objective in (3.3) is given by the negative of (3.2).
inference model qφ (z | x) and use adversarial training to
obtain an approximate maximum likelihood assignment θ∗ Proposition 1. For pθ (x | z) and qφ (z | x) fixed, the opti-
to θ and a close approximation qφ∗ (z | x) to the true pos- mal discriminator T ∗ according to the objective in (3.3) is
terior pθ∗ (z | x). This is visualized in Figure 2: on the left given by
hand side the structure of a typical VAE is shown. The right
hand side shows our flexible black-box inference model. In T ∗ (x, z) = log qφ (z | x) − log p(z). (3.4)
contrast to a VAE with Gaussian inference model, we in-
clude the noise 1 as additional input to the inference model Proof. The proof is analogous to the proof of Proposition
instead of adding it at the very end, thereby allowing the in- 1 in Goodfellow et al. (2014). See the Supplementary Ma-
ference network to learn complex probability distributions. terial for details.
Together with (3.1), Proposition 1 allows us to write the
3.1. Derivation optimization objective in (2.4) as
Algorithm 1 Adversarial Variational Bayes (AVB) algorithm converges, any fix point of this algorithm yields
1: i ← 0 a stationary point of the objective in (2.4).
2: while not converged do Note that optimizing (3.5) with respect to φ while keep-
3: Sample {x(1) , . . . , x(m) } from data distrib. pD (x) ing θ and T fixed makes the encoder network collapse to
4: Sample {z (1) , . . . , z (m) } from prior p(z) a deterministic function. This is also a common problem
5: Sample {(1) , . . . , (m) } from N (0, 1) for regular GANs (Radford et al., 2015). It is therefore
6: Compute θ-gradient (eq. 3.7): crucial to keep the discriminative T network close to op-
1
Pm
gθ ← m k=1 ∇θ log pθ x(k) | zφ x(k) , (k) timality while optimizing (3.5). A variant of Algorithm 1
therefore performs several SGD-updates for the adversary
7: Compute φ-gradient (eq. 3.7):
Pm for one SGD-update of the generative model. However,
1
(k)
gφ ← m k=1 ∇φ −Tψ x , zφ (x(k) , (k) ) throughout our experiments we use the simple 1-step ver-
+ log pθ x(k) | zφ (x(k) , (k) ) sion of AVB unless stated otherwise.
8: Compute ψ-gradient (eq. 3.3) :
1
Pm h 3.3. Theoretical results
gψ ← m k=1 ∇ ψ log σ(Tψ (x(k) , zφ (x(k) , (k) )))
i In Sections 3.1 we derived AVB as a way of performing
+ log 1 − σ(Tψ (x(k) , z (k) )
stochastic gradient descent on the variational lower bound
in (2.4). In this section, we analyze the properties of Algo-
9: Perform SGD-updates for θ, φ and ψ: rithm 1 from a game theoretical point of view.
θ ← θ + hi gθ , φ ← φ + hi gφ , ψ ← ψ + hi gψ
10: i←i+1 As the next proposition shows, global Nash-equilibria of
11: end while Algorithm 1 yield global optima of the objective in (2.4):
Proposition 3. Assume that T can represent any function
of two variables. If (θ∗ , φ∗ , T ∗ ) defines a Nash-equilibrium
Proposition 2. We have of the two-player game defined by (3.3) and (3.7), then
Eqφ (z|x) (∇φ T ∗ (x, z)) = 0. (3.6) T ∗ (x, z) = log qφ∗ (z | x) − log p(z) (3.8)
Proof. The proof can be found in the Supplementary Ma- and (θ∗ , φ∗ ) is a global optimum of the variational lower
terial. bound in (2.4).
Using the reparameterization trick (Kingma & Welling,
2013; Rezende et al., 2014), (3.5) can be rewritten in the Proof. The proof can be found in the Supplementary Ma-
form terial.
Our parameterization of qφ (z | x) as a neural network al-
max EpD (x) E − T ∗ (x, zφ (x, ))
θ,φ lows qφ (z | x) to represent almost any probability density
on the latent space. This motivates
+ log pθ (x | zφ (x, )) (3.7)
Corollary 4. Assume that T can represent any function of
for a suitable function zφ (x, ). Together with Proposition two variables and qφ (z | x) can represent any probability
1, (3.7) allows us to take unbiased estimates of the gradients density on the latent space. If (θ∗ , φ∗ , T ∗ ) defines a Nash-
of (3.5) with respect to φ and θ. equilibrium for the game defined by (3.3) and (3.7), then
3.2. Algorithm
1. θ∗ is a maximum-likelihood assignment
In theory, Propositions 1 and 2 allow us to apply Stochastic
Gradient Descent (SGD) directly to the objective in (2.4). 2. qφ∗ (z | x) is equal to the true posterior pθ∗ (z | x)
However, this requires keeping T ∗ (x, z) optimal which is 3. T ∗ is the pointwise mutual information between x and
computationally challenging. We therefore regard the opti- z, i.e.
mization problems in (3.3) and (3.7) as a two-player game. pθ∗ (x, z)
Propositions 1 and 2 show that any Nash-equilibrium of T ∗ (x, z) = log . (3.9)
pθ∗ (x)p(z)
this game yields a stationary point of the objective in (2.4).
In practice, we try to find a Nash-equilibrium by applying Proof. This is a straightforward consequence of Proposi-
SGD with step sizes hi jointly to (3.3) and (3.7), see Algo- tion 3, as in this case (θ∗ , φ∗ ) optimizes the variational
rithm 1. Here, we parameterize the neural network T with lower bound in (2.4) if and only if 1 and 2 hold. Insert-
a vector ψ. Even though we have no guarantees that this ing the result from 2 into (3.8) yields 3.
Adversarial Variational Bayes
4. Adaptive Contrast
While in the nonparametric limit our method yields the cor-
rect results, in practice T (x, z) may fail to become suffi-
ciently close to the optimal function T ∗ (x, z). The rea-
son for this problem is that AVB calculates a contrast be-
tween the two densities pD (x)qφ (z | x) to pD (x)p(z)
which are usually very different. However, it is known that
logistic regression works best for likelihood-ratio estima-
tion when comparing two very similar densities (Friedman
et al., 2001).
Figure 3. Comparison of KL to ground truth posterior obtained by
To improve the quality of the estimate, we therefore pro- Hamiltonian Monte Carlo (HMC).
pose to introduce an auxiliary conditional probability dis-
tribution rα (z | x) with known density that approximates with mean 0 and variance 1. This way, the adversary only
qφ (z | x). For example, rα (z | x) could be a Gaussian dis- has to account for the deviation of qφ (z | x) from a Gaus-
tribution with diagonal covariance matrix whose mean and sian distribution, not its location and scale. Please see the
variance matches the mean and variance of qφ (z | x). Supplementary Material for pseudo code of the resulting
algorithm.
Using this auxiliary distribution, we can rewrite the varia-
tional lower bound in (2.4) as In practice, we estimate µ(x) and σ(x) using a Monte-
h Carlo estimate. In the Supplementary Material we describe
EpD (x) −KL (qφ (z | x), rα (z | x)) a network architecture for qφ (z | x) that makes the compu-
i tation of this estimate particularly efficient.
+ Eqφ (z|x) (− log rα (z | x) + log pφ (x, z)) . (4.1)
(µ, τ ) (τ, η1 )
VB
(full-
rank) (a) VAE (b) AVB
Figure 4. Comparison of AVB to VB on the “Eight Schools” ex- Table 1. Comparison of VAE and AVB on synthetic dataset.
ample by inspecting two marginal distributions of the approxima- The optimal log-likelihood score on this dataset is − log(4) ≈
tion to the 10-dimensional posterior. We see that AVB accurately −1.386.
captures the multi-modality of the posterior distribution. In con-
trast, VB only focuses on a single mode. The ground truth is
encoder network takes as input a data point x and a vec-
shown in the last row and has been obtained using HMC.
tor of Gaussian random noise and produces a latent code
Carlo (HMC) for 500000 steps using STAN. Note that z. The decoder network takes as input a latent code z and
AVB and the baseline variational methods allow to draw produces the parameters for four independent Bernoulli-
an arbitrary number of samples after training is completed distributions, one for each pixel of the output image. The
whereas HMC only yields a fixed number of samples. adversary is parameterized by two neural networks with
two 512-dimensional hidden layers each, acting on x and z
We evaluate all methods by estimating the Kullback- respectively, whose 512-dimensional outputs are combined
Leibler-Divergence to the ground-truth data using the ITE- using an inner product.
package (Szabo, 2013) applied to 10000 samples from the
ground-truth data and the respective approximation. The We compare our method to a Variational Autoencoder with
resulting Kullback-Leibler divergence over the number of a diagonal Gaussian posterior distribution. The encoder
iterations for the different methods is plotted in Figure 3. and decoder networks are parameterized as above, but the
We see that our method clearly outperforms the methods encoder does not take the noise as input and produces a
with Gaussian inference model. For a qualitative visualiza- mean and variance vector instead of a single sample.
tion, we also applied Kernel-density-estimation to the 2- We visualize the resulting division of the latent space in
dimensional marginals of the (µ, τ )- and (τ, η1 )-variables Figure 6, where each color corresponds to one state in the
as illustrated in Figure 4. In contrast to variational Bayes x-space. Whereas the Variational Autoencoder divides the
with Gaussian inference model, our approach clearly cap- space into a mixture of 4 Gaussians, the Adversarial Varia-
tures the multi-modality of the posterior distribution. We tional Autoencoder learns a complex posterior distribution.
also observed that Adaptive Contrast makes learning more Quantitatively this can be verified by computing the KL-
robust and improves the quality of the resulting model. divergence betweenR the prior p(z) and the aggregated pos-
terior qφ (z) := qφ (z | x)pD (x)dx, which we estimate
5.2. Generative Models using the ITE-package (Szabo, 2013), see Table 1. Note
that the variations for different colors in Figure 6 are solely
Synthetic Example To illustrate the application of our
due to the noise used in the inference model.
method to learning a generative model, we trained the neu-
ral networks on a simple synthetic dataset containing only The ability of AVB to learn more complex posterior mod-
the 4 data points from the space of 2 × 2 binary images els leads to improved performance as Table 1 shows. In
shown in Figure 5 and a 2-dimensional latent space. Both particular, AVB leads to a higher likelihood score that is
the encoder and decoder are parameterized by 2-layer fully close to the optimal value of − log(4) compared to a stan-
connected neural networks with 512 hidden units each. The dard VAE that struggles with the fact that it cannot divide
Adversarial Variational Bayes
Ali, Syed Mumtaz and Silvey, Samuel D. A general class Karaletsos, Theofanis. Adversarial message passing for
of coefficients of divergence of one distribution from an- graphical models. arXiv preprint arXiv:1612.05048,
other. Journal of the Royal Statistical Society. Series B 2016.
(Methodological), pp. 131–142, 1966.
Kingma, Diederik P and Welling, Max. Auto-encoding
Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan, variational bayes. arXiv preprint arXiv:1312.6114,
Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, and 2013.
Abbeel, Pieter. Variational lossy autoencoder. arXiv
preprint arXiv:1611.02731, 2016. Kingma, Diederik P, Salimans, Tim, and Welling, Max. Im-
proving variational inference with inverse autoregressive
Dinh, Laurent, Krueger, David, and Bengio, Yoshua. Nice: flow. arXiv preprint arXiv:1606.04934, 2016.
Non-linear independent components estimation. arXiv
preprint arXiv:1410.8516, 2014. Kucukelbir, Alp, Ranganath, Rajesh, Gelman, Andrew, and
Blei, David. Automatic variational inference in stan. In
Dinh, Laurent, Sohl-Dickstein, Jascha, and Bengio, Samy. Advances in neural information processing systems, pp.
Density estimation using real nvp. arXiv preprint 568–576, 2015.
arXiv:1605.08803, 2016.
Larsen, Anders Boesen Lindbo, Sønderby, Søren Kaae,
Donahue, Jeff, Krähenbühl, Philipp, and Darrell, and Winther, Ole. Autoencoding beyond pixels
Trevor. Adversarial feature learning. arXiv preprint using a learned similarity metric. arXiv preprint
arXiv:1605.09782, 2016. arXiv:1512.09300, 2015.
Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Lamb, LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner,
Alex, Arjovsky, Martin, Mastropietro, Olivier, and Patrick. Gradient-based learning applied to document
Courville, Aaron. Adversarially learned inference. arXiv recognition. Proceedings of the IEEE, 86(11):2278–
preprint arXiv:1606.00704, 2016. 2324, 1998.
Nguyen, Anh, Yosinski, Jason, Bengio, Yoshua, Dosovit- Neural Information Processing Systems, pp. 4790–4798,
skiy, Alexey, and Clune, Jeff. Plug & play generative 2016a.
networks: Conditional iterative generation of images in
latent space. arXiv preprint arXiv:1612.00005, 2016. van den Oord, Aaron van den, Kalchbrenner, Nal, and
Kavukcuoglu, Koray. Pixel recurrent neural networks.
Nguyen, XuanLong, Wainwright, Martin J, and Jordan, arXiv preprint arXiv:1601.06759, 2016b.
Michael I. Estimating divergence functionals and the
likelihood ratio by convex risk minimization. IEEE Wu, Yuhuai, Burda, Yuri, Salakhutdinov, Ruslan, and
Transactions on Information Theory, 56(11):5847–5861, Grosse, Roger. On the quantitative analysis of
2010. decoder-based generative models. arXiv preprint
arXiv:1611.04273, 2016.
Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ry-
ota. f-gan: Training generative neural samplers us-
ing variational divergence minimization. arXiv preprint
arXiv:1606.00709, 2016.
Proof. By Proposition 1,
I. Proofs
Eqφ (z|x) (∇φ T ∗ (x, z))
This section contains the proofs that were omitted in the
= Eqφ (z|x) (∇φ log qφ (z | x)) . (I.5)
main text.
The derivation of AVB in Section 3.1 relies on the fact that For an arbitrary family of probability densities qφ we have
we have an explicit representation of the optimal discrim-
∇φ qφ (z)
Z
inator T ∗ (x, z). This was stated in the following Proposi- Eqφ (∇φ log qφ ) = qφ (z) dz
qφ (z)
tion: Z
Proposition 1. For pθ (x | z) and qφ (z | x) fixed, the opti- = ∇φ qφ (z)dz = ∇φ 1 = 0. (I.6)
mal discriminator T ∗ according to the objective in (3.3) is
given by Together with (I.5), this implies (3.6).
∗
T (x, z) = log qφ (z | x) − log p(z). (3.4)
In Section 3.3 we characterized the Nash-equilibria of the
Proof. As in the proof of Proposition 1 in Goodfellow et al. two-player game defined by our algorithm. The follow-
(2014), we rewrite the objective in (3.3) as ing Proposition shows that in the nonparametric limit for
Z T (x, z) any Nash-equilibrium defines a global optimum of
the variational lower bound:
pD (x)qφ (z | x) log σ(T (x, z))
Proposition 3. Assume that T can represent any function
of two variables. If (θ∗ , φ∗ , T ∗ ) defines a Nash-equilibrium
+ pD (x)p(z) log(1 − σ(T (x, z)) dxdz. (I.1)
of the two-player game defined by (3.3) and (3.7), then
This integral is maximal as a function of T (x, z) if and only
if the integrand is maximal for every (x, z). However, the T ∗ (x, z) = log qφ∗ (z | x) − log p(z) (3.8)
function
t 7→ a log(t) + b log(1 − t) (I.2) and (θ∗ , φ∗ ) is a global optimum of the variational lower
a bound in (2.4).
attains its maximum at t = a+b , showing that
qφ (z | x) Proof. If (θ∗ , φ∗ , T ∗ ) defines a Nash-equilibrium, Propo-
σ(T ∗ (x, z)) = (I.3) sition 1 shows (3.8). Inserting (3.8) into (3.5) shows that
qφ (z | x) + p(z)
(φ∗ , θ∗ ) maximizes
or, equivalently,
T ∗ (x, z) = log qφ (z | x) − log p(z). (I.4) EpD (x) Eqφ (z|x) − log qφ∗ (z | x) + log p(z)
+ log pθ (x | z) (I.7)
Adversarial Variational Bayes
1 f1 v1 ∗ a1
.. .. .. .. .. g x
. . . . .
m fm vm ∗ am
+ z
(a) Training data (b) Random samples
Figure 8. Architecture of the network used for the MNIST- Figure 9. Independent samples for a model trained on celebA.
experiment
m
X
E(zk ) = E[vi,k (i )]ai,k (x). (III.2)
i=1
Xm
Var(zk ) = Var[vi,k (i )]ai,k (x)2 . (III.3)
i=1
By estimating E[vi,k (i )] and Var[vi,k (i )] via sampling Figure 10. Interpolation experiments for celebA
once per mini-batch, we can efficiently compute the mo-
ments of qφ (z | x) for all the data points x in a single
mini-batch.
MNIST To evaluate how AVB with adaptive contrast
compares against other methods on a fixed decoder archi-
IV. Additional Experiments tecture, we reimplemented the methods from Maaløe et al.
celebA We also used AVB (without AC) to train a deep (2016) and Kingma et al. (2016). The method from Maaløe
convolutional network on the celebA-dataset (Liu et al., et al. (2016) tries to make the variational approximation
2015) for a 64-dimensional latent space with N (0, 1)-prior. to the posterior more flexible by using auxiliary variables,
For the decoder and adversary we use two deep convolu- the method from Kingma et al. (2016) tries to improve the
tional neural networks acting on x like in Radford et al. variational approximation by employing an Inverse Autore-
(2015). We add the noise and the latent code z to each gressive Flow (IAF), a particularly flexible instance of a
hidden layer via a learned projection matrix. Moreover, in normalizing flow (Rezende & Mohamed, 2015). In our ex-
the encoder and decoder we use three RESNET-blocks (He periments, we compare AVB with adaptive contrast to a
et al., 2015) at each scale of the neural network. We add standard VAE with diagonal Gaussian inference model as
the log-prior log p(z) explicitly to the adversary T (x, z), well as the methods from Maaløe et al. (2016) and Kingma
so that it only has to learn the log-density of the inference et al. (2016).
model qφ (z | x). In our first experiment, we evaluate all methods on training
The samples for celebA are shown in Figure 9. We see a decoder that is given by a fully-connected neural network
that our model produces visually sharp images of faces. To with ELU-nonlinearities and two hidden layers with 300
demonstrate that the model has indeed learned an abstract units each. The prior distribution p(z) is given by a 32-
representation of the data, we show reconstruction results dimensional standard-Gaussian distribution.
and the result of linearly interpolating the z-vector in the la- The results are shown in Table 3a. We observe, that
tent space in Figure 10. We see that the reconstructions are both AVB and the VAE with auxiliary variables achieve
reasonably sharp and the model produces realistic images a better (approximate) ELBO than a standard VAE. When
for all interpolated z-values. evaluated using AIS, both methods result in similar log-
Adversarial Variational Bayes