From Optimal Transport To Generative Modeling
From Optimal Transport To Generative Modeling
1
Google Brain
2
Max Planck Institute for Intelligent Systems
Abstract
We study unsupervised generative modeling in terms of the optimal transport (OT) problem
between true (but unknown) data distribution PX and the latent variable model distribution
PG . We show that the OT problem can be equivalently written in terms of probabilistic
encoders, which are constrained to match the posterior and prior distributions over the latent
space. When relaxed, this constrained optimization problem leads to a penalized optimal
transport (POT) objective, which can be efficiently minimized using stochastic gradient descent
by sampling from PX and PG . We show that POT for the 2-Wasserstein distance coincides with
the objective heuristically employed in adversarial auto-encoders (AAE) [1], which provides the
first theoretical justification for AAEs known to the authors. We also compare POT to other
popular techniques like variational auto-encoders (VAE) [2]. Our theoretical results include
(a) a better understanding of the commonly observed blurriness of images generated by VAEs,
and (b) establishing duality between Wasserstein GAN [3] and POT for the 1-Wasserstein
distance.
1 Introduction
The field of representation learning was initially driven by supervised approaches, with impressive
results using large labelled datasets. Unsupervised generative modeling, in contrast, used to be
a domain governed by probabilistic approaches focusing on low-dimensional data. Recent years
have seen a convergence of those two approaches. In the new field that formed at the intersection,
variational autoencoders (VAEs) [2] form one well-established approach, theoretically elegant yet
with the drawback that they tend to generate blurry images. In contrast, generative adversarial
networks (GANs) [4] turned out to be more impressive in terms of the visual quality of images
sampled from the model, but have been reported harder to train. There has been a flurry of
activity in assaying numerous configurations of GANs as well as combinations of VAEs and GANs.
A unifying theory relating GANs to VAEs in a principled way is yet to be discovered. This forms
a major motivation for the studies underlying the present paper.
Following [3], we approach generative modeling from the optimal transport point of view. The
optimal transport (OT) cost [5] is a way to measure a distance between probability distributions
and provides a much weaker topology than many others, including f -divergences associated with
the original GAN algorithms. This is particularly important in applications, where data is usually
supported on low dimensional manifolds in the input space X . As a result, stronger notions of
distances (such as f -divergences, which capture the density ratio between distributions) often max
1
out, providing no useful gradients for training. In contrast, the optimal transport behave nicer
and may lead to a more stable training [3].
In this work we aim at minimizing the optimal transport cost Wc (PX , PG ) between the true
(but unknown) data distribution PX and a latent variable model PG . We do so via the primal
form and investigate theoretical properties of this optimization problem. Our main contributions
are listed below; cf. also Figure 1.
1. We derive an equivalent formulation for the primal form of Wc (PX , PG ), which makes the
role of latent space Z and probabilistic encoders Q(Z|X) explicit (Theorem 1).
2. Unlike in VAE, we arrive at an optimization problem where the Q are constrained. We relax
the constraints by penalization, arriving at the penalized optimal transport (POT) objective
(13) which can be minimized with stochastic gradient descent by sampling from PX and PG .
3. We show that for squared Euclidean cost c (Section 4.1), POT coincides with the objective
of adversarial auto-encoders (AAE) [1]. We believe this provides the first theoretical jus-
tification for AAE, showing that they approximately minimize the 2-Wasserstein distance
W2 (PX , PG ). We also compare POT to VAE, adversarial variational Bayes (AVB) [6], and
other methods based on the marginal log-likelihood. In particular, we show that all these
methods necessarily suffer from blurry outputs, unlike POT or AAE.
4. When c is the Euclidean distance (Section 4.2), POT and WGAN [3] both minimize the
1-Wasserstein distance W1 (PX , PG ). They approach this problem from primal/dual forms
respectively, which leads to a different behaviour of the resulting algorithms.
Finally, following a somewhat well-established tradition of using acronyms based on the “GAN”
suffix, and because we give a recipe for blending VAE and GAN (using POT), we propose to call
this work a VEGAN cookbook.
Related work The authors of [7] address computing the OT cost in large scale using stochastic
gradient descent (SGD) and sampling. They approach this task either through the dual formula-
tion, or via a regularized version of the primal. They do not discuss any implications for generative
modeling. Our approach is based on the primal form of OT, we arrive at regularizers which are
very different, and our main focus is on generative modeling. The Wasserstein GAN [3] minimizes
the 1-Wasserstein distance W1 (PX , PG ) for generative modeling. The authors approach this task
from the dual form. Unfortunately, their algorithm cannot be readily applied to any other OT
cost, because the famous Kantorovich duality holds only for W1 . In contrast, our algorithm POT
approaches the same problem from the primal form and can be applied for any cost function c.
The present paper is structured as follows. In Section 2 we introduce our notation and discuss
existing generative modeling techniques, including GANs, VAEs, and AAE. Section 3 contains
our main results, including a theoretical analysis of the primal form of OT and the novel POT
objective. Section 4 discusses the implications of our new results. Finally, we discuss future work
in Section 5. Proofs may be found in Appendix B.
2
(a) VAE and AVB (b) Optimal transport (primal form) and AAE
Pz(Z) Pz(Z)
QVAE(Z|X) QOT(Z|X)
PG(Y|Z) PG(Y|Z)
VAE
(Y|X) OT
(Y|X)
Figure 1: Different behaviours of generative models. The top half represents the latent space Z
with codes (triangles) sampled from PZ . The bottom half represents the data space X , with true
data points X (circles) and generated ones Y (squares). The arrows represent the conditional
distributions. Generally these are not one to one mappings, but for improved readability we
show only one or two arrows to the most likely points. On the left figure, describing VAE [2]
and AVB [6], ΓVAE (Y |Z) is a composite of the encoder QVAE (Z|X) and the decoder PG (Y |Z),
mapping each true data point X to a distribution on generated points Y . For a fixed decoder, the
optimal encoder Q∗VAE will assign mass proportionally to the distance between Y and X and the
probability PZ (Z) (see Eq. 14). We see how different points X are mapped with high probability
to the same Y , while the other generated points Y are reached only with low probabilities. On the
right figure, the OT is expressed as a conditional mapping ΓOT (Y |X). One of our main results
(Theorem 1) shows that this mapping can be reparametrized via transport X → Z → Y , making
explicit a role of the encoder QOT (Z|X).
(i.e. P (X)) and densities with lower case letters (i.e. p(x)). By δx we denote the Dirac distribution
putting mass 1 on x ∈ X , and supp P denotes the support of P .
We will often need to measure the agreement between two probability distributions P and
Q and there are many ways to do so. The class of f -divergences [8] is defined by Df (P kQ) :=
f p(x)
R
q(x) q(x)dx, where f : (0, ∞) → R is any convex function satisfying f (1) = 0. It is known
that Df ≥ 0 and Df = 0 if P = Q. Classical examples include the Kullback-Leibler DKL and
Jensen-Shannon DJS divergences. Another rich class of divergences is induced by the optimal
transport (OT) problem [5]. Kantorovich’s formulation [9] of the problem is given by
3
where FL is the class of all bounded 1-Lipschitz functions on (X , d). Note that the same symbol
is used for Wp and Wc , but only p is a number and thus the above W1 refers to the Wasserstein
distance.
assuming all involved densities are properly defined. These models are indeed easy to sample and,
provided G can be differentiated analytically with respect to its parameters, PG can be trained
with SGD. The field is growing rapidly and numerous variations of VAEs and GANs are available
in the literature. Next we introduce and compare several of them.
The original generative adversarial network (GAN) [4] approach minimizes
DGAN (PX , PG ) = sup EX∼PX [log T (X)] + EZ∼PZ log 1 − T (G(Z)) (4)
T ∈T
where W is any subset of 1-Lipschitz functions on X . It follows from (2) that DWGAN (PX , PG ) ≤
W1 (PX , PG ) and thus WGAN is minimizing a lower bound on the 1-Wasserstein distance.
Variational auto-encoders (VAE) [2] utilize models PG of the form (3) and minimize
DVAE (PX , PG ) = inf EPX DKL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)] (5)
Q(Z|X)∈Q
4
with respect to a random decoder mapping PG (X|Z). The conditional distribution PG (X|Z)
is often parametrized by a deep net G and can have any form as long as its density pG (x|z)
can be computed and differentiated with respect to the parameters of G. A typical choice is
to use Gaussians PG (X|Z) = N (X; G(Z), σ 2 · I). If Q is the set of all conditional probability
distributions Q(Z|X), the objective of VAE coincides with the negative marginal log-likelihood
DVAE (PX , PG ) = −EPX [log PG (X)]. However, in order to make the DKL term of (5) tractable
in closed form, the original implementation of VAE uses a standard normal PZ and restricts
Q to a class of Gaussian distributions Q(Z|X) = N Z; µ(X), Σ(X) with mean µ and diagonal
covariance Σ parametrized by deep nets. As a consequence, VAE is minimizing an upper bound on
the negative log-likelihood or, equivalently, on the KL-divergence DKL (PX , PG ). Further details
can be found in Section A.
One possible way to reduce the gap between the true negative log-likelihood and the upper
bound provided by DVAE is to enlarge the class Q. Adversarial variational Bayes (AVB) [6]
follows this argument by employing the idea of GANs. Given any point x ∈ X , a noise ∼ N (0, 1),
and any fixed transformation e : X × R → Z, a random variable e(x, ) implicitly defines one
particular conditional distribution Qe (Z|X = x). AVB allows Q to contain all such distributions
for different choices of e, replaces the intractable term DKL Qe (Z|X), PZ in (5) by the adversarial
approximation Df,GAN corresponding to the KL-divergence, and proposes to minimize1
DAVB (PX , PG ) = inf EPX Df,GAN Qe (Z|X), PZ − EQe (Z|X) [log pG (X|Z)] . (6)
Qe (Z|X)∈Q
The DKL term in (5) may be viewed as a regularizer. Indeed, VAE reduces to the classical
unregularized auto-encoder if this term is dropped, minimizing the reconstruction cost of the
encoder-decoder pair Q(Z|X), PG (X|Z). This often results in different training points being
encoded into non-overlapping zones chaotically scattered all across the Z space with “holes” in
between where the decoder mapping PG (X|Z) has never been trained. Overall, the encoder
Q(Z|X) trained in this way does not provide a useful representation and sampling from the latent
space Z becomes hard [12].
Adversarial auto-encoders (AAE) [1] replace the DKL term in (5) with another regularizer:
DAAE (PX , PG ) = inf DGAN (QZ , PZ ) − EPX EQ(Z|X) [log pG (X|Z)], (7)
Q(Z|X)∈Q
where QZ is the marginal distribution of Z when first X is sampled from PX and then Z is sampled
from Q(Z|X), also known as the aggregated posterior [1]. Similarly to AVB, there is no clear link to
log-likelihood, as DAAE ≤ DAVB (see Appendix A). The authors of [1] argue that matching QZ to
PZ in this way ensures that there are no “holes” left in the latent space Z and PG (X|Z) generates
reasonable samples whenever Z ∼ PZ . They also report an equally good performance of different
types of conditional distributions Q(Z|X), including Gaussians as used in VAEs, implicit models
Qe as used in AVB, and deterministic encoder mappings, i.e. Q(Z|X) = δµ(X) with µ : X → Z.
5
explain how this can be done in the primal formulation of the OT problem (1) by reparametrizing
the space of couplings (Section 3.1) and relaxing the marginal constraint (Section 3.2), leading
to a formulation involving expectations over PX and PG that can thus be solved using SGD and
sampling.
Formal statement As in Section 2, P(X ∼ PX , Y ∼ PG ) denotes the set of all joint distri-
butions of (X, Y ) with marginals PX , PG , and likewise for P(X ∼ PX , Z ∼ PZ ). The set of all
joint distributions of (X, Y, Z) such that X ∼ PX , (Y, Z) ∼ PG,Z , and (Y ⊥
⊥ X)|Z will be denoted
by PX,Y,Z . Finally, we denote by PX,Y and PX,Z the sets of marginals on (X, Y ) and (X, Z) (re-
spectively) induced by distributions in PX,Y,Z . Note that P(PX , PG ), PX,Y,Z , and PX,Y depend
on the choice of conditional distributions PG (Y |Z), while PX,Z does not. In fact, it is easy to
check that PX,Z = P(X ∼ PX , Z ∼ PZ ). From the definitions it is clear that PX,Y ⊆ P(PX , PG )
and we get the following upper bound:
Wc (PX , PG ) ≤ Wc† (PX , PG ) := inf E(X,Y )∼P [c(X, Y )] (9)
P ∈PX,Y
If PG (Y |Z) are Dirac measures (i.e., Y = G(Z)), the two sets are actually coincide, thus justifying
the reparametrization (8) and the illustration in Figure 1(b), as demonstrated in the following
theorem:
Theorem 1. If PG (Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have
6
where QZ is the marginal distribution of Z when X ∼ PX and Z ∼ Q(Z|X).
It is well known [13] that under mild conditions adding a penalty as in (12) is equivalent to adding
a constraint of the form F (Q) ≤ µλ for some µλ > 0. As λ increases, the corresponding µλ
decreases, and as λ → ∞, the solutions of (12) reach the feasible region where PZ = QZ . This
shows that Wcλ (PX , PG ) ≤ Wc (PX , PG ) for all λ ≥ 0 and the gap reduces with increasing λ.
One possible choice for F is a convex divergence between the prior PZ and the aggregated
posterior QZ , such as DJS (QZ , PZ ), DKL (QZ , PZ ), or any other member of the f -divergence
family. However, this results in intractable F . Instead, similarly to AVB, we may utilize the
adversarial approximation DGAN (QZ , PZ ), which becomes tight in the nonparametric limit. We
thus arrive at the problem of minimizing a penalized optimal transport (POT) objective
DPOT (PX , PG ) := inf EPX EQ(Z|X) c X, G(Z) + λ · DGAN (QZ , PZ ), (13)
Q(Z|X)∈Q
where Q is any nonparametric set of conditional distributions. If the cost function c is differen-
tiable, this problem can be solved with SGD similarly to AAE, where we iterate between updating
(a) an encoder-decoder pair Q, G and (b) an adversarial discriminator of DGAN , trying to sepa-
rate latent codes sampled from PZ and QZ . Moreover, in Section 4.1 we will show that DPOT
coincides with DAAE when c is the squared Euclidean cost and the PG (Y |Z) are Gaussian. The
good empirical performance of AAEs reported in [1] provides what we believe to be rather strong
support of our theoretical results.
Random decoders The above discussion assumed Dirac measures PG (Y |Z). If this is not the
case, we can still upper bound the 2-Wasserstein distance W2 (PX , PG ), corresponding to c(x, y) =
kx − yk2 , in a very similar way to Theorem 1. The case of Gaussian decoders PG (Y |Z), which
will be particularly useful when discussing the relation to VAEs, is summarized in the following
remark:
7
Remark 1. For X = Rd and Gaussian PG (Y |Z) = N (Y ; G(Z), σ 2 ·Id ) the value of Wc (PX , PG ) is
upper bounded by Wc† (PX , PG ), which coincides with the r.h.s. of (11) up to a d · σ 2 additive term
(see Corollary 7 in Section B.2). In other words, objective (12) coincides with the relaxed version
of Wc† (PX , PG ) up to additive constant, while DPOT corresponds to its adversarial approximation.
In order to verify the differentiability of log pG (x|z) all three methods require σ 2 > 0 and have
problems handling the case of deterministic decoders (σ 2 = 0). To emphasize the role of the
variance σ 2 we will denote the resulting latent variable model PGσ .
Relation to VAE and AVB The analysis of Section 3 shows that the value of W2 (PX , PGσ )
is upper bounded by Wc† (PX , PGσ ) of the form (16) and the two coincide when σ 2 = 0. Next we
summarize properties of solutions G minimizing these two values W2 and Wc† :
any function G : X → R. If σ 2 > 0 then the functions G∗σ and G† minimizing Wc (PX , PGσ ) and
Wc† (PX , PGσ ) respectively are different: G∗σ depends on σ 2 , while G† does not. The function G† is
also a minimizer of Wc (PX , PG0 ).
For the purpose of generative modeling, the noise σ 2 > 0 is often not desirable, and it is
common practice to sample from the trained model G∗ by simply returning G∗ (Z) for Z ∼ PZ
without adding noise to the output. This leads to a mismatch between inference and training.
Furthermore, VAE, AVB, and other similar variational methods implicitly use σ 2 as a factor to
balance the `2 reconstruction cost and the KL-regularizer.
In contrast, Proposition 2 shows that for the same Gaussian models with any given σ 2 ≥ 0 we
can minimize Wc† (PX , PGσ ) and the solution G† will be indeed the one resulting in the smallest 2-
Wasserstein distance between PX and the noiseless implicit model G(Z), Z ∼ PZ used in practice.
8
Relation to AAE Next, we wish to convey the following intriguing finding. Substituting
an analytical form of log pG (x|z) in (7), we immediately see that the DAAE objective coincides
with DPOT up to additive terms independent of Q and G when the regularization coefficient λ
is set to 2σ 2 .
For 0 < σ 2 < ∞ this means (see Remark 1) that AAE is minimizing the penalized relaxation
DPOT of the constrained optimization problem corresponding to Wc† (PX , PGσ ). The size of the gap
between DPOT and Wc† depends on the choice of λ, i.e., on σ 2 . If σ 2 → 0, we know (Remark 1)
that the upper bound Wc† converges to the OT cost Wc , however the relaxation DPOT gets loose,
as λ = 2σ 2 → 0. In this case AAE approaches the classical unregularized auto-encoder and does
not have any connections to the OT problem. If σ 2 → ∞, the solution of the penalized objective
DPOT reaches the feasible region of the original constrained optimization problem (11), because
λ = 2σ 2 → ∞, and as a result DPOT converges to Wc† (PX , PGσ ). In this case AAE is searching
for the solution G† of minG Wc† (PX , PGσ ), which is also the function minimizing Wc (PX , PG0 ) for
the deterministic encoder Y = G(Z) according to Proposition 2. In other words, the function G†
learned by AAE with σ 2 → ∞ minimizes the 2-Wasserstein distance between PX and G(Z) when
Z ∼ PZ .
The authors of [6] tried to establish a connection between AAE and log-likelihood maximiza-
tion. They argued that AAE is “a crude approximation” to AVB. Our results suggest that AAE
is in fact attempting to minimize the 2-Wasserstein distance between PX and PGσ , which may
explain its good empirical performance reported in [1].
Blurriness of VAE and AVB We next add to the discussion regarding the blurriness com-
monly attributed to VAE samples. Our argument shows that VAE, AVB, and other methods based
on the marginal log-likelihood necessarily lead to an averaging in the input space if PG (Y |Z) are
Gaussian.
First we notice that in the VAE and AVB objectives, for any fixed encoder Q(Z|X), the decoder
is minimizing the expected `2 -reconstruction cost EPX EQ(Z|X) kX − G(Z)k2 with respect to G.
The optimal solution G∗ is of the form G∗ (z) = EPz∗ [X], where Pz∗ (X) ∝ PX (X)Q(Z = z|X).
Hence, as soon as supp Pz∗ is non-singleton, the optimal decoder G∗ will end up averaging points
in the input space. In particular this will happen whenever there are two points x1 , x2 in supp PX
such that supp Q(Z|X = x1 ) and supp Q(Z|X = x2 ) overlap.
This overlap necessarily happens in VAEs, which use Gaussian encoders Q(Z|X) supported
on the entire Z. When probabilistic encoders Q are allowed to be flexible enough, as in AVB, for
any fixed PG (Y |Z) the optimal Q∗ will try to invert the decoder (see Appendix A) and take the
form
PG (X|Z)PZ (Z)
Q∗ (Z|X) ≈ PG (Z|X) := . (14)
PG (X)
This approximation becomes exact in the nonparametric limit of Q. When PG (Y |Z) is Gaussian
we have pG (y|z) > 0 for all y ∈ X and z ∈ Z, showing that supp Q∗ (Z|X = x) = supp PZ for all
x ∈ X . This will again lead to the overlap of encoders if supp PZ = Z. In contrast, the optimal
encoders of AAE and POT do not necessarily overlap, as they are not inverting the decoders.
The common belief today is that the blurriness of VAEs is caused by the `2 reconstruction
cost, or equivalently by the Gaussian form of decoders PG (Y |Z). We argue that it is instead
caused by the combination of (a) Gaussian decoders and (b) the objective (KL-divergence) being
minimized.
9
4.2 The 1-Wasserstein distance: relation to WGAN
We have shown that the DPOT criterion leads to a generalized version of the AAE algorithm and
can be seen as a relaxation of the optimal transport cost Wc . In particular, if we choose c to be the
Euclidean distance c(x, y) = kx−yk, we get a primal formulation of W1 . This is the same criterion
that WGAN aims to minimize in the dual formulation (see Eq. 2). As a result of Theorem 1, we
have
W1 (PX , PG ) = inf EX∼PX ,Z∼Q(Z|X) [kX − G(Z)k] = sup EPX [f (X)] − EPZ [f (G(Z))] .
Q:QZ =PZ f ∈FL
This means we can now approach the problem of optimizing W1 in two distinct ways, taking
gradient steps either in the primal or in the dual forms. Denote by Q∗ the optimal encoder in
the primal and f ∗ the optimal witness function in the dual. By the envelope theorem, gradients
of W1 with respect to G can be computed by taking a gradient of the criteria evaluated at the
optimal points Q∗ or f ∗ .
Despite the theoretical equivalence of both approaches, practical considerations lead to differ-
ent behaviours and to potentially poor approximations of the real gradients. For example, in the
dual formulation, one usually restricts the witness functions to be smooth, while in the primal
formulation, the constraint on Q is only approximately enforced. We will study the effect of these
approximations.
Imperfect gradients in the dual (i.e., for WGAN) We show that (i) if the true optimum f ∗
is not reached exactly (no matter how close), the effect on the gradient in the dual formulation can
be arbitrarily large, and (ii) this also holds when the optimization is performed only in a restricted
class of smooth functions. We write the criterion to be optimized as JD (f ) := EPX [f (X)] −
EPZ [f (G(Z))] and denote its gradient with respect to G by ∇JD (f ). Let H be a subset of the
1-Lipschitz functions FL on X containing smooth functions with bounded Hessian. Denote by fH ∗
0 0
the minimizer of JD in H. A(f, f ) := cos (∇JD (f ), ∇JD (f )) will denote the cosine of the angle
between the gradients of the criterion at different functions.
Proposition 3. There exists a constant C > 0 such that for any > 0, one can construct
distributions PX , PG and pick witness functions f ∈ FL and h ∈ H that are -optimal |JD (f ) −
JD (f ∗ )| ≤ , |JD (h )−JD (h∗ )| ≤ , but which give (at some point z ∈ Z) gradients whose direction
is at least C-wrong: A(f , f ∗ ) ≤ 1 − C, A(h0 , h∗ ) ≤ 1 − C, and A(h , h∗ ) ≤ 0.
Imperfect posterior in the primal (i.e., for POT) In the primal formulation, when the
constraint is violated, that is the aggregated posterior QZ is not matching PZ , there can be two
kinds of negative effects: (i) the gradient of the criterion is only computed on a (possibly small)
subset of the latent space reached by QZ ; (ii) several input points could be mapped by Q(Z|X)
to the same latent code z, thus giving gradients that encourage G(z) to be the average/median of
several inputs (hence encouraging a blurriness). See Section 4.1 for the details and Figure 1 for
an illustration.
5 Conclusion
This work proposes a way to fit generative models by minimizing any optimal transport cost. It
also establishes novel links between different popular unsupervised probabilistic modeling tech-
niques. Whilst our contribution is on the theoretical side, it is reassuring to note that the empirical
10
results of [1] show the strong performance of our method for the special case of the 2-Wasserstein
distance. Experiments with other cost functions c are beyond the scope of the present work and
left for the future studies.
Acknowledgments
The authors are thankful to Mateo Rojas-Carulla and Fei Sha for stimulating discussions. CJSG
is supported by a Google European Doctoral Fellowship in Causal Inference.
References
[1] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR,
2016.
[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages
2672–2680, 2014.
[5] C. Villani. Topics in Optimal Transportation. AMS Graduate Studies in Mathematics, 2003.
[6] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational
autoencoders and generative adversarial networks, 2017.
[7] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale
optimal transport. In NIPS, pages 3432–3440, 2016.
[8] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.
[9] L. Kantorovich. On the transfer of masses (in Russian). Doklady Akademii Nauk, 37(2):227–
229, 1942.
[10] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural
samplers using variational divergence minimization. In NIPS, 2016.
[11] B. Poole, A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for
GANs, 2016.
[12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new per-
spectives. Pattern Analysis and Machine Intelligence, 35, 2013.
[13] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and
Examples. Springer-Verlad, 2006.
[14] J. Lin. Divergence measures based on the shannon entropy. Information Theory, 37, 1991.
11
A Further details on VAEs and GANs
VAE, KL-divergence and a marginal log-likelihood For models PG of the form (3) and
any conditional distribution Q(Z|X) it can be easily verified that
−EPX [log PG (X)] = − EPX DKL Q(Z|X), PG (Z|X)
+ EPX DKL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)] . (15)
Here the conditional distribution PG (Z|X) is induced by a joint distribution PG,Z (X, Z), which
is in turn specified by the 2-step latent variable procedure: (a) sample Z from PZ , (b) sample
X from PG (X|Z). Note that the first term on the r.h.s. of (15) is always non-positive, while the
l.h.s. does not depend on Q. This shows that if conditional distributions Q are not restricted then
−EPX [log PG (X)] = inf EPX DKL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)] ,
Q
where the infimum is achieved for Q(Z|X) = PG (Z|X). However, for any restricted class Q of
conditional distributions Q(Z|X) we only have
where the inequality accounts for the fact that Q(Z|X) might be not flexible enough to match
P (Z|X) for all values of X.
= DAVB (PX , PG ).
Proof. We already mentioned that DGAN (P, Q) ≤ 2 · DJS (P, Q) − log(4) for any distributions P
and Q. Furthermore, DJS (P, Q) ≤ 12 DTV (P, Q) [14, Theorem 3] and DTV (P, Q) ≤ DKL (P, Q)
p
12
Together with the joint convexity of DJS and Jensen’s inequality this implies
R
DAAE (PX , PG ) := inf DGAN X Q(Z|x)pX (x)dx, PZ − EPX EQ(Z|X) [log pG (X|Z)]
Q(Z|X)∈Q
R
≤ inf DJS X Q(Z|x)pX (x)dx, PZ − EPX EQ(Z|X) [log pG (X|Z)]
Q(Z|X)∈Q
≤ inf EPX DJS (Q(Z|X), PZ ) − EQ(Z|X) [log pG (X|Z)]
Q(Z|X)∈Q
1p
≤ inf EPX DKL (Q(Z|X), PZ ) − EQ(Z|X) [log pG (X|Z)]
Q(Z|X)∈Q 2
≤ inf EPX DKL (Q(Z|X), PZ ) − EQ(Z|X) [log pG (X|Z)]
Q(Z|X)∈Q
= DVAE (PX , PG ).
B Proofs
B.1 Proof of Theorem 1
We start by introducing an important lemma relating the two sets over which Wc and Wc† are
optimized.
Lemma 6. PX,Y ⊆ P(PX , PG ) with identity if PG (Y |Z = z) are Dirac distributions for all z ∈ Z 2 .
Proof. The first assertion is obvious. To prove the identity, note that
when Y is a deterministic
function of Z, for any A in the sigma-algebra induced by Y we have E 1[Y ∈A] |X, Z = E 1[Y ∈A] |Z .
This implies (Y ⊥ ⊥ X)|Z and concludes the proof.
Inequality (9) and the first identity in (10) obviously follows from Lemma 6. The tower rule
of expectation, and the conditional independence property of PX,Y,Z implies
13
First inequality follows from (9). For the identity we proceed similarly to the proof of Theorem
1 and write
Wc† (PX , PG ) = inf EPZ EX∼P (X|Z) EY ∼P (Y |Z) kX − Y k2 .
(17)
P ∈PX,Y,Z
Note that
d
X
= kX − G(Z)k2 + σi2 .
i=1
Together with (17) and the fact that PX,Z = P(X ∼ PX , Z ∼ PZ ) this concludes the proof.
Then
Z Z
Var[Y ] := (y − E[G(Z)])2 pG (y|z)pZ (z)dzdy
Z R Z Z Z Z
2 2
= (y − G(z)) pG (y|z)pZ (z)dzdy + (G(z) − E[G(Z)]) pG (y|z)pZ (z)dzdy
R Z R Z
2
= σ + VarZ∼PZ [G(Z)].
Next we prove the remaining implication of Proposition 2. Namely, that when σ 2 > 0 the
function G∗σ minimizing Wc (PX , PGσ ) depends on σ 2 . The proof is based on the following example:
X = Z = R, PX = N (0, 1), PZ = N (0, 1), and 0 < σ 2 < 1. Note that by setting G(z) = c · z
for any c > 0 we ensure that PGσ is the Gaussian distribution, √ because a convolution of two
∗
Gaussians is also Gaussian. In particular if we take G (z) = 1 − σ 2 · z Lemma 8 implies that
PGσ∗ is the standard normal Gaussian N (0, 1). In other words, we obtain the global minimum
Wc (PX , PGσ∗ ) = 0 and G∗ clearly depends on σ 2 .
14
B.4 Proof of Proposition 3
We just give a sketch of the proof: consider discrete distributions PX supported on two points
{x0 , x1 }, and PZ supported on {0, 1} and let y0 = G(0), y1 = G(1) (y0 6= y1 ). Given an optimal f ∗ ,
one can modify locally it around y0 without changing its Lipschitz constant such that the obtained
f is an -approximation of f ∗ whose gradients at y0 and y1 point in directions arbitrarily different
from those of f ∗ . For smooth functions, by moving y0 and y1 away from the segment [x0 , x1 ] but
close to each other ky0 − y1 k ≤ K, the gradients of f ∗ will point in directions roughly opposite
but the constraint on the Hessian will force the gradients of fF ,0 at y0 and y1 to be very close.
Finally, putting y0 , y1 on the segment [x0 , x1 ], one can get an fF∗ whose gradients at y0 and y1 are
exactly opposite, while taking fF , (y) = fF∗ (y + ), we can swap the direction at one of the points
while changing the criterion by less than .
15