0% found this document useful (0 votes)

12 views15 pages

From Optimal Transport To Generative Modeling

Uploaded by

ohmyhilbert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views15 pages

From Optimal Transport To Generative Modeling

Uploaded by

ohmyhilbert

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

From optimal transport to generative modeling:

the VEGAN cookbook

Olivier Bousquet1 , Sylvain Gelly1 , Ilya Tolstikhin2 , Carl-Johann Simon-Gabriel2 , and
Bernhard Schölkopf2
arXiv:1705.07642v1 [stat.ML] 22 May 2017

1
Google Brain
2
Max Planck Institute for Intelligent Systems

Abstract
We study unsupervised generative modeling in terms of the optimal transport (OT) problem
between true (but unknown) data distribution PX and the latent variable model distribution
PG . We show that the OT problem can be equivalently written in terms of probabilistic
encoders, which are constrained to match the posterior and prior distributions over the latent
space. When relaxed, this constrained optimization problem leads to a penalized optimal
transport (POT) objective, which can be efficiently minimized using stochastic gradient descent
by sampling from PX and PG . We show that POT for the 2-Wasserstein distance coincides with
the objective heuristically employed in adversarial auto-encoders (AAE) [1], which provides the
first theoretical justification for AAEs known to the authors. We also compare POT to other
popular techniques like variational auto-encoders (VAE) [2]. Our theoretical results include
(a) a better understanding of the commonly observed blurriness of images generated by VAEs,
and (b) establishing duality between Wasserstein GAN [3] and POT for the 1-Wasserstein
distance.

1 Introduction
The field of representation learning was initially driven by supervised approaches, with impressive
results using large labelled datasets. Unsupervised generative modeling, in contrast, used to be
a domain governed by probabilistic approaches focusing on low-dimensional data. Recent years
have seen a convergence of those two approaches. In the new field that formed at the intersection,
variational autoencoders (VAEs) [2] form one well-established approach, theoretically elegant yet
with the drawback that they tend to generate blurry images. In contrast, generative adversarial
networks (GANs) [4] turned out to be more impressive in terms of the visual quality of images
sampled from the model, but have been reported harder to train. There has been a flurry of
activity in assaying numerous configurations of GANs as well as combinations of VAEs and GANs.
A unifying theory relating GANs to VAEs in a principled way is yet to be discovered. This forms
a major motivation for the studies underlying the present paper.
Following [3], we approach generative modeling from the optimal transport point of view. The
optimal transport (OT) cost [5] is a way to measure a distance between probability distributions
and provides a much weaker topology than many others, including f -divergences associated with
the original GAN algorithms. This is particularly important in applications, where data is usually
supported on low dimensional manifolds in the input space X . As a result, stronger notions of
distances (such as f -divergences, which capture the density ratio between distributions) often max

1
out, providing no useful gradients for training. In contrast, the optimal transport behave nicer
and may lead to a more stable training [3].
In this work we aim at minimizing the optimal transport cost Wc (PX , PG ) between the true
(but unknown) data distribution PX and a latent variable model PG . We do so via the primal
form and investigate theoretical properties of this optimization problem. Our main contributions
are listed below; cf. also Figure 1.

1. We derive an equivalent formulation for the primal form of Wc (PX , PG ), which makes the
role of latent space Z and probabilistic encoders Q(Z|X) explicit (Theorem 1).

2. Unlike in VAE, we arrive at an optimization problem where the Q are constrained. We relax
the constraints by penalization, arriving at the penalized optimal transport (POT) objective
(13) which can be minimized with stochastic gradient descent by sampling from PX and PG .

3. We show that for squared Euclidean cost c (Section 4.1), POT coincides with the objective
of adversarial auto-encoders (AAE) [1]. We believe this provides the first theoretical jus-
tification for AAE, showing that they approximately minimize the 2-Wasserstein distance
W2 (PX , PG ). We also compare POT to VAE, adversarial variational Bayes (AVB) [6], and
other methods based on the marginal log-likelihood. In particular, we show that all these
methods necessarily suffer from blurry outputs, unlike POT or AAE.

4. When c is the Euclidean distance (Section 4.2), POT and WGAN [3] both minimize the
1-Wasserstein distance W1 (PX , PG ). They approach this problem from primal/dual forms
respectively, which leads to a different behaviour of the resulting algorithms.

Finally, following a somewhat well-established tradition of using acronyms based on the “GAN”
suffix, and because we give a recipe for blending VAE and GAN (using POT), we propose to call
this work a VEGAN cookbook.

Related work The authors of [7] address computing the OT cost in large scale using stochastic
gradient descent (SGD) and sampling. They approach this task either through the dual formula-
tion, or via a regularized version of the primal. They do not discuss any implications for generative
modeling. Our approach is based on the primal form of OT, we arrive at regularizers which are
very different, and our main focus is on generative modeling. The Wasserstein GAN [3] minimizes
the 1-Wasserstein distance W1 (PX , PG ) for generative modeling. The authors approach this task
from the dual form. Unfortunately, their algorithm cannot be readily applied to any other OT
cost, because the famous Kantorovich duality holds only for W1 . In contrast, our algorithm POT
approaches the same problem from the primal form and can be applied for any cost function c.
The present paper is structured as follows. In Section 2 we introduce our notation and discuss
existing generative modeling techniques, including GANs, VAEs, and AAE. Section 3 contains
our main results, including a theoretical analysis of the primal form of OT and the novel POT
objective. Section 4 discusses the implications of our new results. Finally, we discuss future work
in Section 5. Proofs may be found in Appendix B.

2 Notations and preliminaries

We use calligraphic letters (i.e. X ) for sets, capital letters (i.e. X) for random variables, and
lower case letters (i.e. x) for their values. We denote probability distributions with capital letters

2
(a) VAE and AVB (b) Optimal transport (primal form) and AAE

Pz(Z) Pz(Z)

QVAE(Z|X) QOT(Z|X)
PG(Y|Z) PG(Y|Z)

VAE
(Y|X) OT
(Y|X)

Figure 1: Different behaviours of generative models. The top half represents the latent space Z
with codes (triangles) sampled from PZ . The bottom half represents the data space X , with true
data points X (circles) and generated ones Y (squares). The arrows represent the conditional
distributions. Generally these are not one to one mappings, but for improved readability we
show only one or two arrows to the most likely points. On the left figure, describing VAE [2]
and AVB [6], ΓVAE (Y |Z) is a composite of the encoder QVAE (Z|X) and the decoder PG (Y |Z),
mapping each true data point X to a distribution on generated points Y . For a fixed decoder, the
optimal encoder Q∗VAE will assign mass proportionally to the distance between Y and X and the
probability PZ (Z) (see Eq. 14). We see how different points X are mapped with high probability
to the same Y , while the other generated points Y are reached only with low probabilities. On the
right figure, the OT is expressed as a conditional mapping ΓOT (Y |X). One of our main results
(Theorem 1) shows that this mapping can be reparametrized via transport X → Z → Y , making
explicit a role of the encoder QOT (Z|X).

(i.e. P (X)) and densities with lower case letters (i.e. p(x)). By δx we denote the Dirac distribution
putting mass 1 on x ∈ X , and supp P denotes the support of P .
We will often need to measure the agreement between two probability distributions P and
Q and there are many ways to do so. The class of f -divergences [8] is defined by Df (P kQ) :=
f p(x)
R
q(x) q(x)dx, where f : (0, ∞) → R is any convex function satisfying f (1) = 0. It is known
that Df ≥ 0 and Df = 0 if P = Q. Classical examples include the Kullback-Leibler DKL and
Jensen-Shannon DJS divergences. Another rich class of divergences is induced by the optimal
transport (OT) problem [5]. Kantorovich’s formulation [9] of the problem is given by

Wc (P, Q) := inf E(X,Y )∼Γ [c(X, Y )] , (1)

Γ∈P(X∼P,Y ∼Q)

where c(x, y) : X × X → R+ is any measurable cost function and P(X ∼ P, Y ∼ Q) is a set of

all joint distributions of (X, Y ) with marginals P and Q respectively. A particularly interesting
case is when (X , d) is a metric space and c(x, y) = dp (x, y) for p ≥ 1. In this case Wp , the p-th
root of Wc , is called the p-Wasserstein distance. Finally, the Kantorovich-Rubinstein theorem
establishes a duality for the 1-Wasserstein distance, which holds under mild assumptions on P
and Q:
W1 (P, Q) = sup EX∼P [f (X)] − EY ∼Q [f (Y )] , (2)
f ∈FL

3
where FL is the class of all bounded 1-Lipschitz functions on (X , d). Note that the same symbol
is used for Wp and Wc , but only p is a number and thus the above W1 refers to the Wasserstein
distance.

2.1 Implicit generative models: a short tour of GANs and VAEs

Even though GANs and VAEs are quite different—both in terms of the conceptual frameworks
and empirical performance—they share important features: (a) both can be trained by sampling
from the model PG without knowing an analytical form of its density and (b) both can be scaled up
with SGD. As a result, it becomes possible to use highly flexible implicit models PG defined by a
two-step procedure, where first a code Z is sampled from a fixed distribution PZ on a latent space
Z and then Z is mapped to the image G(Z) ∈ X = Rd with a (possibly random) transformation
G : Z → X . This results in latent variable models PG defined on X with density of the form
Z
pG (x) := pG (x|z)pz (z)dz, ∀x ∈ X , (3)
Z

assuming all involved densities are properly defined. These models are indeed easy to sample and,
provided G can be differentiated analytically with respect to its parameters, PG can be trained
with SGD. The field is growing rapidly and numerous variations of VAEs and GANs are available
in the literature. Next we introduce and compare several of them.
The original generative adversarial network (GAN) [4] approach minimizes

DGAN (PX , PG ) = sup EX∼PX [log T (X)] + EZ∼PZ log 1 − T (G(Z)) (4)
T ∈T

with respect to a deterministic generator G : Z → X , where T is any non-parametric class of

choice. It is known that DGAN (PX , PG ) ≤ 2 · DJS (PX , PG ) − log(4) and the inequality turns into
identity in the nonparametric limit, that is when the class T becomes rich enough to represent all
functions mapping X to (0, 1). Hence, GANs are minimizing a lower bound on the JS-divergence.
However, GANs are not only linked to the JS-divergence: the f -GAN approach [10] showed that
a slight modification Df,GAN of the objective (4) allows to lower bound any desired f -divergence
in a similar way. In practice, both generator G and discriminator T are trained in alternating
SGD steps. Stopping criteria as well as adequate evaluation of the trained GAN models remain
open questions.
Recently, the authors of [3] argued that the 1-Wasserstein distance W1 , which is known to
induce a much weaker topology than DJS , may be better suited for generative modeling. When
PX and PG are supported on largely disjoint low-dimensional manifolds (which may be the case in
applications), DKL , DJS , and other strong distances between PX and PG max out and no longer
provide useful gradients for PG . This “vanishing gradient” problem necessitates complicated
scheduling between the G/T updates. In contrast, W1 is still sensible in these cases and provides
stable gradients. The Wasserstein GAN (WGAN) minimizes

DWGAN (PX , PG ) = sup EX∼PX [T (X)] − EZ∼PZ T (G(Z)) ,
T ∈W

where W is any subset of 1-Lipschitz functions on X . It follows from (2) that DWGAN (PX , PG ) ≤
W1 (PX , PG ) and thus WGAN is minimizing a lower bound on the 1-Wasserstein distance.
Variational auto-encoders (VAE) [2] utilize models PG of the form (3) and minimize

DVAE (PX , PG ) = inf EPX DKL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)] (5)
Q(Z|X)∈Q

4
with respect to a random decoder mapping PG (X|Z). The conditional distribution PG (X|Z)
is often parametrized by a deep net G and can have any form as long as its density pG (x|z)
can be computed and differentiated with respect to the parameters of G. A typical choice is
to use Gaussians PG (X|Z) = N (X; G(Z), σ 2 · I). If Q is the set of all conditional probability
distributions Q(Z|X), the objective of VAE coincides with the negative marginal log-likelihood
DVAE (PX , PG ) = −EPX [log PG (X)]. However, in order to make the DKL term of (5) tractable
in closed form, the original implementation of VAE uses a standard normal PZ and restricts
Q to a class of Gaussian distributions Q(Z|X) = N Z; µ(X), Σ(X) with mean µ and diagonal
covariance Σ parametrized by deep nets. As a consequence, VAE is minimizing an upper bound on
the negative log-likelihood or, equivalently, on the KL-divergence DKL (PX , PG ). Further details
can be found in Section A.
One possible way to reduce the gap between the true negative log-likelihood and the upper
bound provided by DVAE is to enlarge the class Q. Adversarial variational Bayes (AVB) [6]
follows this argument by employing the idea of GANs. Given any point x ∈ X , a noise ∼ N (0, 1),
and any fixed transformation e : X × R → Z, a random variable e(x, ) implicitly defines one
particular conditional distribution Qe (Z|X = x). AVB allows Q to contain all such distributions
for different choices of e, replaces the intractable term DKL Qe (Z|X), PZ in (5) by the adversarial
approximation Df,GAN corresponding to the KL-divergence, and proposes to minimize1

DAVB (PX , PG ) = inf EPX Df,GAN Qe (Z|X), PZ − EQe (Z|X) [log pG (X|Z)] . (6)
Qe (Z|X)∈Q

The DKL term in (5) may be viewed as a regularizer. Indeed, VAE reduces to the classical
unregularized auto-encoder if this term is dropped, minimizing the reconstruction cost of the
encoder-decoder pair Q(Z|X), PG (X|Z). This often results in different training points being
encoded into non-overlapping zones chaotically scattered all across the Z space with “holes” in
between where the decoder mapping PG (X|Z) has never been trained. Overall, the encoder
Q(Z|X) trained in this way does not provide a useful representation and sampling from the latent
space Z becomes hard [12].
Adversarial auto-encoders (AAE) [1] replace the DKL term in (5) with another regularizer:

DAAE (PX , PG ) = inf DGAN (QZ , PZ ) − EPX EQ(Z|X) [log pG (X|Z)], (7)
Q(Z|X)∈Q

where QZ is the marginal distribution of Z when first X is sampled from PX and then Z is sampled
from Q(Z|X), also known as the aggregated posterior [1]. Similarly to AVB, there is no clear link to
log-likelihood, as DAAE ≤ DAVB (see Appendix A). The authors of [1] argue that matching QZ to
PZ in this way ensures that there are no “holes” left in the latent space Z and PG (X|Z) generates
reasonable samples whenever Z ∼ PZ . They also report an equally good performance of different
types of conditional distributions Q(Z|X), including Gaussians as used in VAEs, implicit models
Qe as used in AVB, and deterministic encoder mappings, i.e. Q(Z|X) = δµ(X) with µ : X → Z.

3 Minimizing the primal of optimal transport

We have argued that minimizing the optimal transport cost Wc (PX , PG ) between the true data
distribution PX and the model PG is a reasonable goal for generative modeling. We now will
1
The authors of AVB [6] note that using f -GAN as described above actually results in “unstable training”.
Instead, following the approach of [11], they use a trained discriminator T ∗ resulting
R from the DGAN
objective (4)
to approximate the ratio of densities and then directly estimate the KL divergence f p(x)/q(x) q(x)dx.

5
explain how this can be done in the primal formulation of the OT problem (1) by reparametrizing
the space of couplings (Section 3.1) and relaxing the marginal constraint (Section 3.2), leading
to a formulation involving expectations over PX and PG that can thus be solved using SGD and
sampling.

3.1 Reparametrization of the couplings

We will consider certain sets of joint probability distributions of three random variables (X, Y, Z) ∈
X × X × Z. The reader may wish to think of X as true images, Y as images sampled from the
model, and Z as latent codes. We denote by PG,Z (Y, Z) a joint distribution of a variable pair
(Y, Z), where Z is first sampled from PZ and next Y from PG (Y |Z). Note that PG defined in (3)
and used throughout this work is the marginal distribution of Y when (Y, Z) ∼ PG,Z .
In the optimal transport problem, we consider joint distributions Γ(X, Y ) which are called cou-
plings between values of X and Y . Because of the marginal constraint, we can write Γ(X, Y ) =
Γ(Y |X)PX (X) and we can consider Γ(Y |X) as a non-deterministic mapping from X to Y . In this
section we will show how to factor this mapping through Z, i.e., decompose it into an encoding
distribution Q(Z|X) and the generating distribution PG (Y |Z).
In order to give a more intuitive explanation of this decomposition, consider the case where
all probability distributions have densities with respect to the Lebesgue measure. In this case our
results show that some elements Γ of P(X ∼ PX , Y ∼ PG ) have densities of the form
Z
γ(x, y) = pG (y|z)q(z|x)pX (x)dz, (8)
Z
R
where the density of the conditional distribution Q(Z|X) satisfies qZ (z) := X q(z|x)pX (x)dx =
pZ (z) for all z ∈ Z. Equation (8) allows to express the search space (couplings) of the OT problem
in terms of the probabilistic encoders Q(Z|X). Unlike VAE, AVB, and other methods based on
the marginal log-likelihood, where the Q are not constrained, the encoders of the OT problem
need to match the aggregated posterior QZ to the prior PZ .

Formal statement As in Section 2, P(X ∼ PX , Y ∼ PG ) denotes the set of all joint distri-
butions of (X, Y ) with marginals PX , PG , and likewise for P(X ∼ PX , Z ∼ PZ ). The set of all
joint distributions of (X, Y, Z) such that X ∼ PX , (Y, Z) ∼ PG,Z , and (Y ⊥
⊥ X)|Z will be denoted
by PX,Y,Z . Finally, we denote by PX,Y and PX,Z the sets of marginals on (X, Y ) and (X, Z) (re-
spectively) induced by distributions in PX,Y,Z . Note that P(PX , PG ), PX,Y,Z , and PX,Y depend
on the choice of conditional distributions PG (Y |Z), while PX,Z does not. In fact, it is easy to
check that PX,Z = P(X ∼ PX , Z ∼ PZ ). From the definitions it is clear that PX,Y ⊆ P(PX , PG )
and we get the following upper bound:
Wc (PX , PG ) ≤ Wc† (PX , PG ) := inf E(X,Y )∼P [c(X, Y )] (9)
P ∈PX,Y

If PG (Y |Z) are Dirac measures (i.e., Y = G(Z)), the two sets are actually coincide, thus justifying
the reparametrization (8) and the illustration in Figure 1(b), as demonstrated in the following
theorem:
Theorem 1. If PG (Y |Z = z) = δG(z) for all z ∈ Z, where G : Z → X , we have

Wc (PX , PG ) = Wc† (PX , PG ) =

inf E(X,Z)∼P c X, G(Z) (10)
P ∈P(X∼PX ,Z∼PZ )

= inf EPX EQ(Z|X) c X, G(Z) , (11)
Q : QZ =PZ

6
where QZ is the marginal distribution of Z when X ∼ PX and Z ∼ Q(Z|X).

The r.h.s. of (10) is the optimal

transport WcG (PX , PZ ) between PX and PZ with the cost
function cg (x, z) := c x, G(z) defined on X × Z. If PG (Y |Z) corresponds to a deterministic
mapping G : Z → X , the resulting model PG is the push-forward of PZ through G. Theorem 1
states that in this case the two OT problems are equivalent, Wc (PX , PG ) = WcG (PX , PZ ).
The conditional distributions Q in (11) are constrained to ensure that when X is sampled from
PX and then Z from Q(Z|X), the resulting marginal distribution of Z (which was denoted QZ
and called aggregated posterior in Section 2.1) coincides with PZ . A very similar result also holds
for the case when PG (Y |Z) are not necessarily Dirac. Nevertheless, for simplicity we will focus on
the Dirac case for now and shortly summarize the more general case in the end of this section.

3.2 Relaxing the constraints

Minimizing Wc (PX , PG ) boils down to a min-min optimization problem. Unfortunately, the con-
straint on Q makes the variational problem even harder. We thus propose to replace the con-
strained optimization problem (11) with its relaxed unconstrained version in a standard way.
Namely, use any convex penalty F : Q → R+ , such that F (Q) = 0 if and only if PZ = QZ , and
for any λ > 0, consider the following relaxed unconstrained version of Wc† (PX , PG ):

Wcλ (PX , PG ) := inf EPX EQ(Z|X) c X, G(Z) + λF (Q)

(12)
Q(Z|X)

It is well known [13] that under mild conditions adding a penalty as in (12) is equivalent to adding
a constraint of the form F (Q) ≤ µλ for some µλ > 0. As λ increases, the corresponding µλ
decreases, and as λ → ∞, the solutions of (12) reach the feasible region where PZ = QZ . This
shows that Wcλ (PX , PG ) ≤ Wc (PX , PG ) for all λ ≥ 0 and the gap reduces with increasing λ.
One possible choice for F is a convex divergence between the prior PZ and the aggregated
posterior QZ , such as DJS (QZ , PZ ), DKL (QZ , PZ ), or any other member of the f -divergence
family. However, this results in intractable F . Instead, similarly to AVB, we may utilize the
adversarial approximation DGAN (QZ , PZ ), which becomes tight in the nonparametric limit. We
thus arrive at the problem of minimizing a penalized optimal transport (POT) objective

DPOT (PX , PG ) := inf EPX EQ(Z|X) c X, G(Z) + λ · DGAN (QZ , PZ ), (13)
Q(Z|X)∈Q

where Q is any nonparametric set of conditional distributions. If the cost function c is differen-
tiable, this problem can be solved with SGD similarly to AAE, where we iterate between updating
(a) an encoder-decoder pair Q, G and (b) an adversarial discriminator of DGAN , trying to sepa-
rate latent codes sampled from PZ and QZ . Moreover, in Section 4.1 we will show that DPOT
coincides with DAAE when c is the squared Euclidean cost and the PG (Y |Z) are Gaussian. The
good empirical performance of AAEs reported in [1] provides what we believe to be rather strong
support of our theoretical results.

Random decoders The above discussion assumed Dirac measures PG (Y |Z). If this is not the
case, we can still upper bound the 2-Wasserstein distance W2 (PX , PG ), corresponding to c(x, y) =
kx − yk2 , in a very similar way to Theorem 1. The case of Gaussian decoders PG (Y |Z), which
will be particularly useful when discussing the relation to VAEs, is summarized in the following
remark:

7
Remark 1. For X = Rd and Gaussian PG (Y |Z) = N (Y ; G(Z), σ 2 ·Id ) the value of Wc (PX , PG ) is
upper bounded by Wc† (PX , PG ), which coincides with the r.h.s. of (11) up to a d · σ 2 additive term
(see Corollary 7 in Section B.2). In other words, objective (12) coincides with the relaxed version
of Wc† (PX , PG ) up to additive constant, while DPOT corresponds to its adversarial approximation.

4 Implications: relations to AAE, VAEs, and GANs

Thus far we showed that the primal form of the OT problem can be relaxed to allow for efficient
minimization by SGD. We now discuss the main implications of these results.
In the following Section 4.1, we compare minimizing the optimal transport cost Wc , the upper
bound Wc† , and its relaxed version DPOT to VAE, AVB, and AAE in the special case when
c(x, y) = kx−yk2 and PG (Y |Z) = N (Y ; G(Z), σ 2·I). We show that in this case (i) the solutions of
VAE and AVB both depend on σ 2 , while the minimizer G† of Wc† (PX , PG ) does not depend on σ 2 ,
and is the same as the minimizer of Wc (PX , PG ) for σ 2 = 0; (ii) AAE is equivalent to minimizing
DPOT of (13) with λ = 2σ 2 . We also briefly discuss the role of these conclusions in explaining
the well-known blurriness of VAE outputs. Section 4.2 shows that when c(x, y) = kx − yk, our
algorithm and WGAN approach primal and dual forms respectively of the same optimization
problem. Finally, we discuss a difference in behaviour between the two algorithms caused by this
duality.

4.1 The 2-Wasserstein distance: relation to VAE, AVB, and AAE

Consider the squared Euclidean cost function c(x, y) = kx − yk2 , for which Wc is the squared 2-
Wasserstein distance W22 . The goal of this section is to compare the minimization of W2 (PX , PG )
to other generative modeling approaches. Let us focus our attention on generative distributions
PG (Y |Z) typically used in VAE, AVB, and AAE, i.e., Gaussians PG (Y |Z) = N Y ; G(Z), σ 2 ·Id .

In order to verify the differentiability of log pG (x|z) all three methods require σ 2 > 0 and have
problems handling the case of deterministic decoders (σ 2 = 0). To emphasize the role of the
variance σ 2 we will denote the resulting latent variable model PGσ .

Relation to VAE and AVB The analysis of Section 3 shows that the value of W2 (PX , PGσ )
is upper bounded by Wc† (PX , PGσ ) of the form (16) and the two coincide when σ 2 = 0. Next we
summarize properties of solutions G minimizing these two values W2 and Wc† :

Proposition 2. Let X = Rd and assume c(x, y) = kx − yk2 , PG (Y |Z) = N Y ; G(Z), σ 2 ·I with

any function G : X → R. If σ 2 > 0 then the functions G∗σ and G† minimizing Wc (PX , PGσ ) and
Wc† (PX , PGσ ) respectively are different: G∗σ depends on σ 2 , while G† does not. The function G† is
also a minimizer of Wc (PX , PG0 ).

For the purpose of generative modeling, the noise σ 2 > 0 is often not desirable, and it is
common practice to sample from the trained model G∗ by simply returning G∗ (Z) for Z ∼ PZ
without adding noise to the output. This leads to a mismatch between inference and training.
Furthermore, VAE, AVB, and other similar variational methods implicitly use σ 2 as a factor to
balance the `2 reconstruction cost and the KL-regularizer.
In contrast, Proposition 2 shows that for the same Gaussian models with any given σ 2 ≥ 0 we
can minimize Wc† (PX , PGσ ) and the solution G† will be indeed the one resulting in the smallest 2-
Wasserstein distance between PX and the noiseless implicit model G(Z), Z ∼ PZ used in practice.

8
Relation to AAE Next, we wish to convey the following intriguing finding. Substituting
an analytical form of log pG (x|z) in (7), we immediately see that the DAAE objective coincides
with DPOT up to additive terms independent of Q and G when the regularization coefficient λ
is set to 2σ 2 .
For 0 < σ 2 < ∞ this means (see Remark 1) that AAE is minimizing the penalized relaxation
DPOT of the constrained optimization problem corresponding to Wc† (PX , PGσ ). The size of the gap
between DPOT and Wc† depends on the choice of λ, i.e., on σ 2 . If σ 2 → 0, we know (Remark 1)
that the upper bound Wc† converges to the OT cost Wc , however the relaxation DPOT gets loose,
as λ = 2σ 2 → 0. In this case AAE approaches the classical unregularized auto-encoder and does
not have any connections to the OT problem. If σ 2 → ∞, the solution of the penalized objective
DPOT reaches the feasible region of the original constrained optimization problem (11), because
λ = 2σ 2 → ∞, and as a result DPOT converges to Wc† (PX , PGσ ). In this case AAE is searching
for the solution G† of minG Wc† (PX , PGσ ), which is also the function minimizing Wc (PX , PG0 ) for
the deterministic encoder Y = G(Z) according to Proposition 2. In other words, the function G†
learned by AAE with σ 2 → ∞ minimizes the 2-Wasserstein distance between PX and G(Z) when
Z ∼ PZ .
The authors of [6] tried to establish a connection between AAE and log-likelihood maximiza-
tion. They argued that AAE is “a crude approximation” to AVB. Our results suggest that AAE
is in fact attempting to minimize the 2-Wasserstein distance between PX and PGσ , which may
explain its good empirical performance reported in [1].

Blurriness of VAE and AVB We next add to the discussion regarding the blurriness com-
monly attributed to VAE samples. Our argument shows that VAE, AVB, and other methods based
on the marginal log-likelihood necessarily lead to an averaging in the input space if PG (Y |Z) are
Gaussian.
First we notice that in the VAE and AVB objectives, for any fixed encoder Q(Z|X), the decoder
is minimizing the expected `2 -reconstruction cost EPX EQ(Z|X) kX − G(Z)k2 with respect to G.

The optimal solution G∗ is of the form G∗ (z) = EPz∗ [X], where Pz∗ (X) ∝ PX (X)Q(Z = z|X).
Hence, as soon as supp Pz∗ is non-singleton, the optimal decoder G∗ will end up averaging points
in the input space. In particular this will happen whenever there are two points x1 , x2 in supp PX
such that supp Q(Z|X = x1 ) and supp Q(Z|X = x2 ) overlap.
This overlap necessarily happens in VAEs, which use Gaussian encoders Q(Z|X) supported
on the entire Z. When probabilistic encoders Q are allowed to be flexible enough, as in AVB, for
any fixed PG (Y |Z) the optimal Q∗ will try to invert the decoder (see Appendix A) and take the
form
PG (X|Z)PZ (Z)
Q∗ (Z|X) ≈ PG (Z|X) := . (14)
PG (X)
This approximation becomes exact in the nonparametric limit of Q. When PG (Y |Z) is Gaussian
we have pG (y|z) > 0 for all y ∈ X and z ∈ Z, showing that supp Q∗ (Z|X = x) = supp PZ for all
x ∈ X . This will again lead to the overlap of encoders if supp PZ = Z. In contrast, the optimal
encoders of AAE and POT do not necessarily overlap, as they are not inverting the decoders.
The common belief today is that the blurriness of VAEs is caused by the `2 reconstruction
cost, or equivalently by the Gaussian form of decoders PG (Y |Z). We argue that it is instead
caused by the combination of (a) Gaussian decoders and (b) the objective (KL-divergence) being
minimized.

9
4.2 The 1-Wasserstein distance: relation to WGAN
We have shown that the DPOT criterion leads to a generalized version of the AAE algorithm and
can be seen as a relaxation of the optimal transport cost Wc . In particular, if we choose c to be the
Euclidean distance c(x, y) = kx−yk, we get a primal formulation of W1 . This is the same criterion
that WGAN aims to minimize in the dual formulation (see Eq. 2). As a result of Theorem 1, we
have

W1 (PX , PG ) = inf EX∼PX ,Z∼Q(Z|X) [kX − G(Z)k] = sup EPX [f (X)] − EPZ [f (G(Z))] .
Q:QZ =PZ f ∈FL

This means we can now approach the problem of optimizing W1 in two distinct ways, taking
gradient steps either in the primal or in the dual forms. Denote by Q∗ the optimal encoder in
the primal and f ∗ the optimal witness function in the dual. By the envelope theorem, gradients
of W1 with respect to G can be computed by taking a gradient of the criteria evaluated at the
optimal points Q∗ or f ∗ .
Despite the theoretical equivalence of both approaches, practical considerations lead to differ-
ent behaviours and to potentially poor approximations of the real gradients. For example, in the
dual formulation, one usually restricts the witness functions to be smooth, while in the primal
formulation, the constraint on Q is only approximately enforced. We will study the effect of these
approximations.

Imperfect gradients in the dual (i.e., for WGAN) We show that (i) if the true optimum f ∗
is not reached exactly (no matter how close), the effect on the gradient in the dual formulation can
be arbitrarily large, and (ii) this also holds when the optimization is performed only in a restricted
class of smooth functions. We write the criterion to be optimized as JD (f ) := EPX [f (X)] −
EPZ [f (G(Z))] and denote its gradient with respect to G by ∇JD (f ). Let H be a subset of the
1-Lipschitz functions FL on X containing smooth functions with bounded Hessian. Denote by fH ∗
0 0
the minimizer of JD in H. A(f, f ) := cos (∇JD (f ), ∇JD (f )) will denote the cosine of the angle
between the gradients of the criterion at different functions.

Proposition 3. There exists a constant C > 0 such that for any > 0, one can construct
distributions PX , PG and pick witness functions f ∈ FL and h ∈ H that are -optimal |JD (f ) −
JD (f ∗ )| ≤ , |JD (h )−JD (h∗ )| ≤ , but which give (at some point z ∈ Z) gradients whose direction
is at least C-wrong: A(f , f ∗ ) ≤ 1 − C, A(h0 , h∗ ) ≤ 1 − C, and A(h , h∗ ) ≤ 0.

Imperfect posterior in the primal (i.e., for POT) In the primal formulation, when the
constraint is violated, that is the aggregated posterior QZ is not matching PZ , there can be two
kinds of negative effects: (i) the gradient of the criterion is only computed on a (possibly small)
subset of the latent space reached by QZ ; (ii) several input points could be mapped by Q(Z|X)
to the same latent code z, thus giving gradients that encourage G(z) to be the average/median of
several inputs (hence encouraging a blurriness). See Section 4.1 for the details and Figure 1 for
an illustration.

5 Conclusion
This work proposes a way to fit generative models by minimizing any optimal transport cost. It
also establishes novel links between different popular unsupervised probabilistic modeling tech-
niques. Whilst our contribution is on the theoretical side, it is reassuring to note that the empirical

10
results of [1] show the strong performance of our method for the special case of the 2-Wasserstein
distance. Experiments with other cost functions c are beyond the scope of the present work and
left for the future studies.

Acknowledgments
The authors are thankful to Mateo Rojas-Carulla and Fei Sha for stimulating discussions. CJSG
is supported by a Google European Doctoral Fellowship in Causal Inference.

References
[1] A. Makhzani, J. Shlens, N. Jaitly, and I. Goodfellow. Adversarial autoencoders. In ICLR,
2016.

[2] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017.

[4] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages
2672–2680, 2014.

[5] C. Villani. Topics in Optimal Transportation. AMS Graduate Studies in Mathematics, 2003.

[6] L. Mescheder, S. Nowozin, and A. Geiger. Adversarial variational bayes: Unifying variational
autoencoders and generative adversarial networks, 2017.

[7] A. Genevay, M. Cuturi, G. Peyré, and F. R. Bach. Stochastic optimization for large-scale
optimal transport. In NIPS, pages 3432–3440, 2016.

[8] F. Liese and K.-J. Miescke. Statistical Decision Theory. Springer, 2008.

[9] L. Kantorovich. On the transfer of masses (in Russian). Doklady Akademii Nauk, 37(2):227–
229, 1942.

[10] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-GAN: Training generative neural
samplers using variational divergence minimization. In NIPS, 2016.

[11] B. Poole, A. Alemi, J. Sohl-Dickstein, and A. Angelova. Improved generator objectives for
GANs, 2016.

[12] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new per-
spectives. Pattern Analysis and Machine Intelligence, 35, 2013.

[13] J. M. Borwein and A. S. Lewis. Convex Analysis and Nonlinear Optimization: Theory and
Examples. Springer-Verlad, 2006.

[14] J. Lin. Divergence measures based on the shannon entropy. Information Theory, 37, 1991.

[15] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, NY, 2008.

Here the conditional distribution PG (Z|X) is induced by a joint distribution PG,Z (X, Z), which
is in turn specified by the 2-step latent variable procedure: (a) sample Z from PZ , (b) sample
X from PG (X|Z). Note that the first term on the r.h.s. of (15) is always non-positive, while the
l.h.s. does not depend on Q. This shows that if conditional distributions Q are not restricted then

−EPX [log PG (X)] = inf EPX DKL Q(Z|X), PZ − EQ(Z|X) [log pG (X|Z)] ,
Q

where the infimum is achieved for Q(Z|X) = PG (Z|X). However, for any restricted class Q of
conditional distributions Q(Z|X) we only have

− EPX [log PG (X)]

where the inequality accounts for the fact that Q(Z|X) might be not flexible enough to match
P (Z|X) for all values of X.

Relation between AAE, AVB, and VAE

Proposition 4. For any distributions PX and PG :

DAAE (PX , PG ) ≤ DAVB (PX , PG ).

Proof. By Jensen’s inequality and the joint convexity of DGAN we have

= DAVB (PX , PG ).

Under certain assumptions it is also possible to link DAAE to DVAE :

Proposition 5. Assume DKL (Q(Z|X), PZ ) ≥ 1/4 for all Q ∈ Q with PX -probability 1. Then

DAAE (PX , PG ) ≤ DVAE (PX , PG ).

Proof. We already mentioned that DGAN (P, Q) ≤ 2 · DJS (P, Q) − log(4) for any distributions P
and Q. Furthermore, DJS (P, Q) ≤ 12 DTV (P, Q) [14, Theorem 3] and DTV (P, Q) ≤ DKL (P, Q)
p

[15, Eq. 2.20], which leads to

1p
DJS (P, Q) ≤ DKL (P, Q).
2

= DVAE (PX , PG ).

B Proofs
B.1 Proof of Theorem 1
We start by introducing an important lemma relating the two sets over which Wc and Wc† are
optimized.
Lemma 6. PX,Y ⊆ P(PX , PG ) with identity if PG (Y |Z = z) are Dirac distributions for all z ∈ Z 2 .
Proof. The first assertion is obvious. To prove the identity, note that
when Y is a deterministic

function of Z, for any A in the sigma-algebra induced by Y we have E 1[Y ∈A] |X, Z = E 1[Y ∈A] |Z .
This implies (Y ⊥ ⊥ X)|Z and concludes the proof.

Inequality (9) and the first identity in (10) obviously follows from Lemma 6. The tower rule
of expectation, and the conditional independence property of PX,Y,Z implies

Wc† (PX , PG ) = inf E(X,Y,Z)∼P [c(X, Y )]

P ∈PX,Y,Z

= infEPZ EX∼P (X|Z) EY ∼P (Y |Z) [c(X, Y )]

P ∈PX,Y,Z

= inf EPZ EX∼P (X|Z) c X, G(Z)
P ∈PX,Y,Z

= inf E(X,Z)∼P c X, G(Z) .
P ∈PX,Z

B.2 Random decoders PG (Y |Z)

Corollary 7. Let X = Rd and assume the conditional distributions PG (Y |Z = z) have mean
values G(z) ∈ Rd and marginal variances σ12 , . . . , σd2 ≥ 0 for all z ∈ Z, where G : Z → X . Take
c(x, y) = kx − yk22 . Then
d
X
Wc† (PX , PG ) σi2 + E(X,Z)∼P kX − G(Z)k2 .

Wc (PX , PG ) ≤ = inf (16)
P ∈P(X∼PX ,Z∼PZ )
i=1

Proof. Proof is similar to the one of Theorem 1. See Section B.2.

2
We conjecture that this is also a necessary condition. The necessity is not used in the remainder of the paper.

13
First inequality follows from (9). For the identity we proceed similarly to the proof of Theorem
1 and write
Wc† (PX , PG ) = inf EPZ EX∼P (X|Z) EY ∼P (Y |Z) kX − Y k2 .

(17)
P ∈PX,Y,Z

Note that

EY ∼P (Y |Z) kX − Y k2 = EY ∼P (Y |Z) kX − G(Z) + G(Z) − Y k2

= kX − G(Z)k2 + EY ∼P (Y |Z) hX − G(Z), G(Z) − Y i + EY ∼P (Y |Z) kG(Z) − Y k2

d
X
= kX − G(Z)k2 + σi2 .
i=1

Together with (17) and the fact that PX,Z = P(X ∼ PX , Z ∼ PZ ) this concludes the proof.

B.3 Proof of Proposition 2

Corollary 7 shows that G† does not depend on the variance σ 2 . When σ 2 = 0 the distribution
PG (Y |Z) turns into Dirac. In this case we combine Theorem 1 and Corollary 7 to conclude that
G† also minimizes Wc (PX , PG0 ). Next we prove that G∗σ generally depends on σ 2 .
We will need the following simple result, which is basically saying that the variance of a sum
of two independent random variables is a sum of the variances:

Lemma 8. Under conditions of Proposition 2, assume Y ∼ PGσ . Then

Var[Y ] = σ 2 + VarZ∼PZ [G(Z)].

Proof. First of all, using (3) we have

Z Z Z Z
E[Y ] := y pG (y|z)pZ (z)dzdy = y pG (y|z)dy pZ (z)dz = EZ∼PZ [G(Z)].
R Z Z R

Then
Z Z
Var[Y ] := (y − E[G(Z)])2 pG (y|z)pZ (z)dzdy
Z R Z Z Z Z
2 2
= (y − G(z)) pG (y|z)pZ (z)dzdy + (G(z) − E[G(Z)]) pG (y|z)pZ (z)dzdy
R Z R Z
2
= σ + VarZ∼PZ [G(Z)].

Next we prove the remaining implication of Proposition 2. Namely, that when σ 2 > 0 the
function G∗σ minimizing Wc (PX , PGσ ) depends on σ 2 . The proof is based on the following example:
X = Z = R, PX = N (0, 1), PZ = N (0, 1), and 0 < σ 2 < 1. Note that by setting G(z) = c · z
for any c > 0 we ensure that PGσ is the Gaussian distribution, √ because a convolution of two
∗
Gaussians is also Gaussian. In particular if we take G (z) = 1 − σ 2 · z Lemma 8 implies that
PGσ∗ is the standard normal Gaussian N (0, 1). In other words, we obtain the global minimum
Wc (PX , PGσ∗ ) = 0 and G∗ clearly depends on σ 2 .

14
B.4 Proof of Proposition 3
We just give a sketch of the proof: consider discrete distributions PX supported on two points
{x0 , x1 }, and PZ supported on {0, 1} and let y0 = G(0), y1 = G(1) (y0 6= y1 ). Given an optimal f ∗ ,
one can modify locally it around y0 without changing its Lipschitz constant such that the obtained
f is an -approximation of f ∗ whose gradients at y0 and y1 point in directions arbitrarily different
from those of f ∗ . For smooth functions, by moving y0 and y1 away from the segment [x0 , x1 ] but
close to each other ky0 − y1 k ≤ K, the gradients of f ∗ will point in directions roughly opposite
but the constraint on the Hessian will force the gradients of fF ,0 at y0 and y1 to be very close.
Finally, putting y0 , y1 on the segment [x0 , x1 ], one can get an fF∗ whose gradients at y0 and y1 are
exactly opposite, while taking fF , (y) = fF∗ (y + ), we can swap the direction at one of the points
while changing the criterion by less than .

Modern Approach To Axiomatics
No ratings yet
Modern Approach To Axiomatics
91 pages
Prolog Viva Questions
No ratings yet
Prolog Viva Questions
6 pages
Variational Autoencoders
No ratings yet
Variational Autoencoders
94 pages
Kongsberg K-Pos DP (OS)
No ratings yet
Kongsberg K-Pos DP (OS)
334 pages
Algorithms and Data Structures
No ratings yet
Algorithms and Data Structures
167 pages
Physics-Based Deep Learning: N. Thuerey, B. Holzschuh, P. Holl, G. Kohl, M. Lino, Q. Liu, P. Schnell, F. Trost
No ratings yet
Physics-Based Deep Learning: N. Thuerey, B. Holzschuh, P. Holl, G. Kohl, M. Lino, Q. Liu, P. Schnell, F. Trost
461 pages
ML Unit 4
No ratings yet
ML Unit 4
10 pages
Lec 19
No ratings yet
Lec 19
111 pages
Robust Shape Matching With OT
No ratings yet
Robust Shape Matching With OT
175 pages
BreimanLectureNeurIPS2024 Doucet
No ratings yet
BreimanLectureNeurIPS2024 Doucet
56 pages
Convex Optimizatiom IP
No ratings yet
Convex Optimizatiom IP
97 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Tutorialon Diffusion Modelsfor Imaging and Vision
No ratings yet
Tutorialon Diffusion Modelsfor Imaging and Vision
90 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan September 10, 2024
89 pages
Diffusion Csail Lecture Notes
No ratings yet
Diffusion Csail Lecture Notes
56 pages
DLbook
No ratings yet
DLbook
165 pages
Chapter 5
No ratings yet
Chapter 5
140 pages
Schrodinger Bridge Flow NeurIPS24
No ratings yet
Schrodinger Bridge Flow NeurIPS24
58 pages
Syllabus For Jupeb
No ratings yet
Syllabus For Jupeb
2 pages
Diffusion Models: A Comprehensive Survey of Methods and Applications
No ratings yet
Diffusion Models: A Comprehensive Survey of Methods and Applications
54 pages
D D B M: Enoising Iffusion Ridge Odels
No ratings yet
D D B M: Enoising Iffusion Ridge Odels
26 pages
PMP Math QA
100% (1)
PMP Math QA
18 pages
6S191 MIT DeepLearning L4
No ratings yet
6S191 MIT DeepLearning L4
88 pages
Toponc
No ratings yet
Toponc
37 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Lecture 14
No ratings yet
Lecture 14
23 pages
முதலாம் தவணை - வவுனியா தெற்கு
No ratings yet
முதலாம் தவணை - வவுனியா தெற்கு
6 pages
Mod6 Slides
No ratings yet
Mod6 Slides
27 pages
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
No ratings yet
An Introduction To Variational Autoencoders: Foundations and Trends in Machine Learning
89 pages
Mit Diffusion
No ratings yet
Mit Diffusion
30 pages
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
No ratings yet
Stabilized Sparse Scaling Algorithms For Entropy Regularized Transport Problems
30 pages
Elucidating The Design Space of Diffusion-Based Generative Models
No ratings yet
Elucidating The Design Space of Diffusion-Based Generative Models
47 pages
Intro To Vae
No ratings yet
Intro To Vae
89 pages
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
No ratings yet
Tutorial On Diffusion Models For Imaging and Vision: Stanley Chan March 28, 2024
51 pages
Paper 8
No ratings yet
Paper 8
26 pages
Introduction To VAE
No ratings yet
Introduction To VAE
5 pages
Flow Straight and Fast: Learning To Generate and Transfer Data With Rectified Flow
No ratings yet
Flow Straight and Fast: Learning To Generate and Transfer Data With Rectified Flow
41 pages
Nips Ws 2017
No ratings yet
Nips Ws 2017
12 pages
Generalization of VAE
No ratings yet
Generalization of VAE
30 pages
Design and Control of A Bidirectional Dual Active Bridge DC-DC Co PDF
No ratings yet
Design and Control of A Bidirectional Dual Active Bridge DC-DC Co PDF
95 pages
1111 Neural Optimal Transport
No ratings yet
1111 Neural Optimal Transport
34 pages
Scour Modeling Based On Immersed Boundary Method
No ratings yet
Scour Modeling Based On Immersed Boundary Method
60 pages
Optimal Transport and Applications To Geometric Optics (Cristian E. Gutiérrez)
No ratings yet
Optimal Transport and Applications To Geometric Optics (Cristian E. Gutiérrez)
140 pages
Optimal Transport Map Estimation in General Function Spaces
No ratings yet
Optimal Transport Map Estimation in General Function Spaces
49 pages
Multi-Marginal Optimal Transport and Probabilistic Graphical Models
No ratings yet
Multi-Marginal Optimal Transport and Probabilistic Graphical Models
24 pages
B7 - B10 (Grafik)
No ratings yet
B7 - B10 (Grafik)
40 pages
Optimal Transport in Learning, Control, and Dynamical Systems
No ratings yet
Optimal Transport in Learning, Control, and Dynamical Systems
25 pages
Dynamical Measure Transport and Neural PDE Solvers For Sampling
No ratings yet
Dynamical Measure Transport and Neural PDE Solvers For Sampling
25 pages
Lec-9 Diffie Hellmen and ECC
No ratings yet
Lec-9 Diffie Hellmen and ECC
27 pages
Approximations To The Fisher Information Metric of Deep Generative Models For Out-Of-Distribution Detection
No ratings yet
Approximations To The Fisher Information Metric of Deep Generative Models For Out-Of-Distribution Detection
32 pages
A Geometric View of Optimal Transportation and Generative Model
No ratings yet
A Geometric View of Optimal Transportation and Generative Model
21 pages
Consistency Models
No ratings yet
Consistency Models
41 pages
Information Dropout Learning Optimal Representations Through Noisy Computation
No ratings yet
Information Dropout Learning Optimal Representations Through Noisy Computation
9 pages
Recent Advances in Optimal Transport For Machine Learning
No ratings yet
Recent Advances in Optimal Transport For Machine Learning
20 pages
Kingma 等 - 2023 - Variational Diffusion Models
No ratings yet
Kingma 等 - 2023 - Variational Diffusion Models
27 pages
Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences
No ratings yet
Interpolating Between Optimal Transport and MMD Using Sinkhorn Divergences
15 pages
481 Generative Latent Flow
No ratings yet
481 Generative Latent Flow
20 pages
An Et Al. - 2019 - AE-OT A NEW GENERATIVE MODEL BASED ON EXTENDED SE
No ratings yet
An Et Al. - 2019 - AE-OT A NEW GENERATIVE MODEL BASED ON EXTENDED SE
19 pages
Tutorial On Diffusion Models
No ratings yet
Tutorial On Diffusion Models
4 pages
Algebra I m1 End of Module Assessment
No ratings yet
Algebra I m1 End of Module Assessment
32 pages
Flow Based Deep Generative Models Report
No ratings yet
Flow Based Deep Generative Models Report
12 pages
Diffusion Based Representation Learning
No ratings yet
Diffusion Based Representation Learning
20 pages
21 - ICML - Unbalanced Minibatch Optimal Transport - Applications To Domain Adaptation
No ratings yet
21 - ICML - Unbalanced Minibatch Optimal Transport - Applications To Domain Adaptation
12 pages
Generarive AI
No ratings yet
Generarive AI
9 pages
Robust Vector Quantized-Variational Autoencoder
No ratings yet
Robust Vector Quantized-Variational Autoencoder
14 pages
A Geometric Understanding of Deep Learning 2020 Engineering
No ratings yet
A Geometric Understanding of Deep Learning 2020 Engineering
14 pages
Advanced Quantum Field Theory Example Sheet 4: Lent 2021
No ratings yet
Advanced Quantum Field Theory Example Sheet 4: Lent 2021
30 pages
Implicit Generative Models: Dual vs. Primal Approaches
No ratings yet
Implicit Generative Models: Dual vs. Primal Approaches
42 pages
ITT Evaluation Matrix
No ratings yet
ITT Evaluation Matrix
18 pages
Experiment Title: Experimental Study of Sinusoids and Their Characteristics
No ratings yet
Experiment Title: Experimental Study of Sinusoids and Their Characteristics
5 pages
Experiment 5 (Modified)
100% (1)
Experiment 5 (Modified)
8 pages
Columns
No ratings yet
Columns
17 pages
Solid Mensuration
100% (1)
Solid Mensuration
3 pages
Stabilizing Training of Generative Adversarial Networks Through Regularization
No ratings yet
Stabilizing Training of Generative Adversarial Networks Through Regularization
16 pages
Plastic Limit Theorems: The Upper Bound Theorem The Mechanism Method
No ratings yet
Plastic Limit Theorems: The Upper Bound Theorem The Mechanism Method
15 pages
Denoising Adversarial Autoencoders
No ratings yet
Denoising Adversarial Autoencoders
17 pages
Corr T2 AM1-IMS-22-1-2024
No ratings yet
Corr T2 AM1-IMS-22-1-2024
3 pages
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
No ratings yet
Ffjord: F - C D S R G M: REE Form Ontinuous Ynamics For Calable Eversible Enerative Odels
13 pages
How To Design A Xilinx Digital Signal Processing System 13 1
No ratings yet
How To Design A Xilinx Digital Signal Processing System 13 1
1 page
Sinkhorn Distances: Lightspeed Computation of Optimal Transport
No ratings yet
Sinkhorn Distances: Lightspeed Computation of Optimal Transport
9 pages
Chapter II. Metric Spaces and The Topology of C
No ratings yet
Chapter II. Metric Spaces and The Topology of C
9 pages
Representing Real Numbers in A Number Line
No ratings yet
Representing Real Numbers in A Number Line
10 pages
G E R A D: Enerative Nsembles For Obust Nomaly Etection
No ratings yet
G E R A D: Enerative Nsembles For Obust Nomaly Etection
10 pages
NETWORK-THEORY IES and GATE
No ratings yet
NETWORK-THEORY IES and GATE
2 pages
Bimo 1 2023 PSS (B&J) - 1
No ratings yet
Bimo 1 2023 PSS (B&J) - 1
2 pages
Exercise Bank For Chapter Two: Truth Tables
No ratings yet
Exercise Bank For Chapter Two: Truth Tables
2 pages
Rectilinear Motion
No ratings yet
Rectilinear Motion
3 pages
Assignment of Mathematics B
No ratings yet
Assignment of Mathematics B
2 pages
Math 132 Master Syllabus
No ratings yet
Math 132 Master Syllabus
2 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet

From Optimal Transport To Generative Modeling

Uploaded by

From Optimal Transport To Generative Modeling

Uploaded by

From optimal transport to generative modeling:

the VEGAN cookbook

2 Notations and preliminaries

Wc (P, Q) := inf E(X,Y )∼Γ [c(X, Y )] , (1)

where c(x, y) : X × X → R+ is any measurable cost function and P(X ∼ P, Y ∼ Q) is a set of

2.1 Implicit generative models: a short tour of GANs and VAEs

with respect to a deterministic generator G : Z → X , where T is any non-parametric class of

3 Minimizing the primal of optimal transport

3.1 Reparametrization of the couplings

Wc (PX , PG ) = Wc† (PX , PG ) =

The r.h.s. of (10) is the optimal

3.2 Relaxing the constraints

Wcλ (PX , PG ) := inf EPX EQ(Z|X) c X, G(Z) + λF (Q)

4 Implications: relations to AAE, VAEs, and GANs

4.1 The 2-Wasserstein distance: relation to VAE, AVB, and AAE

Proposition 2. Let X = Rd and assume c(x, y) = kx − yk2 , PG (Y |Z) = N Y ; G(Z), σ 2 ·I with

[2] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.

[3] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN, 2017.

[15] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, NY, 2008.

− EPX [log PG (X)]

Relation between AAE, AVB, and VAE

DAAE (PX , PG ) ≤ DAVB (PX , PG ).

Proof. By Jensen’s inequality and the joint convexity of DGAN we have

Under certain assumptions it is also possible to link DAAE to DVAE :

DAAE (PX , PG ) ≤ DVAE (PX , PG ).

[15, Eq. 2.20], which leads to

Wc† (PX , PG ) = inf E(X,Y,Z)∼P [c(X, Y )]

= infEPZ EX∼P (X|Z) EY ∼P (Y |Z) [c(X, Y )]

B.2 Random decoders PG (Y |Z)

Proof. Proof is similar to the one of Theorem 1. See Section B.2.

EY ∼P (Y |Z) kX − Y k2 = EY ∼P (Y |Z) kX − G(Z) + G(Z) − Y k2

= kX − G(Z)k2 + EY ∼P (Y |Z) hX − G(Z), G(Z) − Y i + EY ∼P (Y |Z) kG(Z) − Y k2

B.3 Proof of Proposition 2

Lemma 8. Under conditions of Proposition 2, assume Y ∼ PGσ . Then

Var[Y ] = σ 2 + VarZ∼PZ [G(Z)].

Proof. First of all, using (3) we have

You might also like