Optimal Transport
Optimal Transport
Nhat Ho
1
Talk Outline
• Applications/ Methods of Optimal Transport (OT): Brief Introduction
• Foundations of Optimal Transport
• Monge’s Optimal Transport Formulation
• Kantorovich’s Optimal Transport Formulation
• Entropic Regularized Optimal Transport
• Application of Optimal Transport to Deep Generative Model
• Wasserstein GAN
• Issues of Wasserstein GAN and Solutions
2
Some Applications/ Methods of Optimal
Transport (OT): Brief Introduction
3
OT’s Method: Deep Generative Model
CIFAR 10
Speech
Goal: Given a set of data in high dimension (e.g., images, speeches, words, etc.),
we would like to learn the underlying data distribution
4
OT’s Method: Deep Generative Model
• OT is used as a loss between push-forward distribution from low-dimensional
space and the empirical distribution from data
5
OT’s Method: Transfer Learning
• Optimal transport is an efficient loss function capture the difference between these
domains (e.g., [4] and [5]) 6
OT’s Method: Transfer Learning
[6] Trung Nguyen, Hieu Pham, Tam Le, Tung Pham, Nhat Ho, Son Hua. Point-set distances for learning representations of 3D point clouds. ICCV, 2021
8
OT’s Method: (Multilevel) Clustering
• Each image contains several annotated regions, such as, those of animals,
buildings, trees, etc.
• Goal: Based on the clustering behaviors of annotated regions from the images,
we would like to learn the themes/ clusters of images
9
OT’s Method: Multilevel Clustering
[7] Nhat Ho, Long Nguyen, Mikhail Yurochkin, Hung Bui, Viet Huynh, and Dinh Phung. Multilevel clustering via Wasserstein means. ICML, 2017
[8] Viet Huynh, Nhat Ho, Nhan Dam, Long Nguyen, Mikhail Yurochkin, Hung Bui, Dinh Phung. On efficient multilevel clustering via Wasserstein distances. Journal of Machine
Learning Research (JMLR), 2021
10
OT’s Method: Other Applications
• Optimal Transport is also a powerful tool for other important applications:
• Forecasting Time Series (e.g., forecasting sales (Walmart), forecasting
expenses (Amazon), etc.) [9]
11
OT is also useful as foundational theory tool
Object category Layer L
Layer
L-1
Intermediate
rendering Latent variables
Layer
1
Layer
Image 0
[12] Tan Nguyen, Nhat Ho, Ankit Patel, Anima Anandkumar, Michael I. Jordan, Richard Baraniuk. A Bayesian Perspective of Convolutional Neural Networks through a
Deconvolutional Generative Model. Under Revision, Journal of Machine Learning Research (JMLR), 2022
12
OT is also useful as foundational theory tool
• A few other popular applications of OT for understanding machine learning methods and
models include:
13
Foundations of Optimal Transport
• Monge’s Optimal Transport Formulation
• Kantorovich’s Optimal Transport Formulation
• Entropic Regularized Optimal Transport
14
Monge’s OT Formulation: Motivation
• Optimal Transport was created by mathematician Gaspard Monge to find
optimal ways to transport commodities and products under certain constraints
15
Monge’s OT Formulation: Motivation
• We start with a simple practical example
B1 R1 of moving products from Bakeries
C1,1 (denoted by B) to Restaurants (denoted by
R)
18
Monge’s OT Formulation: Equivalent Form
n n
1 1
n∑ ∑
We define Pn = δBi and Qn = δRi as corresponding
• n Pn
i=1 i=1
empirical measures of bakeries and restaurants
2
• We denote Cij = ∥Bi − Rj∥ as the distance between Bi and
B1 B2 Bn
Rj
Qn
• The Monge’s formulation in equation (1) can be rewritten as
T ∫
2
inf ∥x − T(x)∥ dPn(x), R1 R2 Rn
d d
where the mapping T : ℝ → ℝ in the infimum is such that
T♯Pn = Qn
n
1 d d
n∑
Recall that, Pn = δBi and T : ℝ → ℝ
•
i=1
n
1
n∑
Then, T♯Pn = δT(Bi)
•
i=1
20
General Monge’s OT Formulation
P
• In general, we can define the Monge’s optimal transport
beyond discrete probability distributions, such as
Gaussian distributions
T ∫
2
inf ∥x − T(x)∥ dP(x), (2)
d d
where the mapping T : ℝ → ℝ in the infimum is such that
T♯P = Q
• Note that, for continuous distributions, T♯P = Q means
−1 d
that P(T (A)) = Q(A) for any measurable set A of ℝ
21
General Monge’s OT Formulation: Challenges
• Good settings: When (i) P and Q admit density functions or (ii) P and
Q are discrete with uniform weights, there exist optimal maps T that
solve the Monge’s OT in equation (2)
• Pathological settings:
• In certain settings when P and Q are discrete, the existence of
mapping T such that T♯P = Q may not always be possible
1 1
• Assume that P = δx and Q = δy1 + δy2, the equation T♯P = Q
2 2 P
means that
1 x
−1
P(T ({y1})) = Q({y1}) = Q
2
−1
• However, it is not possible as P(T ({y1})) ∈ {0,1} depending y1 y2
−1
on whether x ∈ T (y1)
22
General Monge’s OT Formulation: Challenges
• The non-existence of transport map T under pathological settings makes it
challenging to use Monge’s OT formulation when the probability distributions
P and Q are discrete
• Furthermore, due to the non-linearity of the constraint T♯P = Q, it is non-
trivial to solve for or approximate the optimal mapping T in equation (2)
23
Kantorovich’s Optimal Transport Formulation
24
Kantorovich’s OT Formulation
• Given two probability distributions P and Q, the Kantorovich’s Optimal Transport
between P and Q can be defined as
π∈Π(P,Q) ∫
OT(P, Q) := inf c(x, y)dπ(x, y), (3)
• Under certain assumptions (see Section 4 in [18]), the Kantorovich’s OT and Monge’s
OT are equivalent
25
Kantorovich’s OT for Discrete Measures
m
∑
When P = δη and Q = qiδθi, then
• p2
i=1
m p1 p3
p4
∑
OT(P, Q) = qi ⋅ c(η, θi)
i=1 η1 η2 η3 η4
n m q1
θ1 π11 π12 π13 π14
∑ ∑
When P = piδηi and Q = qjδθj, then
•
i=1 j=1
q2 θ2 π21 π22 π23 π24
n m
θ3 π31 π32 π33 π34
∑∑
OT(P, Q) = min πij ⋅ c(ηi, θj), (4) q3
π≥0
i=1 j=1 q4 θ4 π41 π42 π43 π44
n m
θ5 π51 π52 π53 π54
q5
∑ ∑
s.t. πij = qj for all 1 ≤ j ≤ m; πij = pi for all
θ6 π61 π62 π63 π64
i=1 j=1 q6
1≤i≤n
• These simple examples show that there always exists optimal transportation
plan when P and Q are discrete, which is in contrast to the Monge’s OT
formulation 26
Kantorovich’s OT for Discrete Measures
• We can rewrite the problem (4) as follows
OT(P, Q) = min ⟨C, π⟩ (5)
n×m
π∈ℝ
⊤
s.t. π ≥ 0; π1m = p; π 1n = q,
where p = (p1, p2, …, pn); q = (q1, q2, …, qm)
• The problem (3) is a linear programming problem
n×m ⊤
• The set 𝒫 = {π ∈ ℝ : π ≥ 0, π1m = p, π 1n = q} is called a
transportation polytope, which is a convex set
27
Computational Complexity of Kantorovich’s Formulation
• The below theorem yields the best computational complexity of the network
simplex algorithm for solving the linear programming (5)
• Therefore, the network simplex algorithm is not sufficiently scalable to use for
large-scale machine learning and deep learning applications
28
Entropic (Regularized) Optimal Transport
29
Entropic (Regularized) Optimal Transport
• We now discuss an useful approach to obtain scalable approximation of optimal
transport
• The idea is that we regularize the optimal transport (5) by the entropy of the
transportation plan [20], named entropic (regularized) optimal transport:
∑∑
H(π) = − πij log(πij);
i=1 j=1
n×m ⊤
𝒫(p, q) = {π ∈ ℝ : π1m = p, π 1n = q};
Here, we use a convention that log(x) = − ∞ when x ≤ 0
30
Properties of Entropic Optimal Transport
• For each regularized parameter η > 0, the objective function of the entropic
regularized optimal transport is η− strongly convex function
• As the constrained set 𝒫(p, q) is convex, it indicates that there exists unique
optimal transportation plan, denoted by π*
η , for solving the entropic regularized
optimal transport
31
Properties of Entropic Optimal Transport
Theorem 2: (a) When η → 0, we have
EOTη(P, Q) → OT(P, Q),
π*
η → arg min {−H(π)},
π∈𝒫:⟨C,π⟩=OT(P,Q)
• The results of part (b) indicate that when the regularized parameter η is
sufficiently large, we can treat the distributions P and Q as independent
distributions
32
Sinkhorn Algorithm
• We now discuss a popular algorithm, named Sinkhorn algorithm, for solving
the entropic regularized optimal transport (6)
• Dual form of entropic optimal transport (6): We will demonstrate that solving
the dual form of (9), which is an unconstrained optimization problem, is easier
33
Sinkhorn Algorithm: Detailed Description
0 n 0 m
• Step 1: Initialize u = 0 ∈ ℝ and v = 0 ∈ ℝ
• Step 2: For any t ≥ 0, we perform
• If t is an even number, then for all (i, j)
m Cij′
( η )
t+1 t t+1 t
∑
ui = log(pi) − log exp vj′ − , vj = vj
j′=1
• Increase t ← t + 1
34
Approximation of Optimal Transport via Sinkhorn algorithm
• Now, we discuss briefly the complexity of approximating the value of optimal
transport via the Sinkhorn algorithm
35
Approximation of Optimal Transport via Sinkhorn algorithm
t t
• Denote (u , v ) as the updates of step t from the Sinkhorn algorithm (See
Slide 35)
36
Approximation of Optimal Transport via Sinkhorn algorithm
t t
• Therefore, we need to do an extra rounding step to transform π to π̄ such
t t ⊤
that π̄ 1m = p and (π̄ ) 1n = q
• Details of that rounding step are in Algorithm 2 in [21] (We skip this step in the
lecture for the simplicity)
ϵ t t
Theorem 3: Assume that η = . Denote by (u , v ) updates from the
4 log(max{n, m})
Sinkhorn algorithm for the entropic optimal transport with regularized parameter η and
t
denote by π̄ the rounding transportation plan we obtain from these updates. Then, we
have
t
⟨C, π̄ ⟩ ≤ min⟨C, π⟩ + ϵ
π∈𝒫
2
∥C∥∞ log(max{n, m})
as long as t = 𝒪( ) .
ϵ 2
37
Approximation of Optimal Transport via Sinkhorn algorithm
• The proof of Theorem 3 can be found in Theorem 2 of [22]
2
• Each iteration of the Sinkhorn algorithm requires max{n, m} arithmetic
operations
38
Other Approximations of Optimal Transport
• There are other optimization algorithms that outperform Sinkhorn:
• Greedy version of Sinkhorn (Greenkhorn) [23]
• Accelerated Sinkhorn [24]
• The scalable approximations of optimal transport via these optimization
algorithms have lead to several interesting methodological developments in
machine learning
[23] Tianyi Lin, Nhat Ho, Michael I. Jordan.On efficient optimal transport: an analysis of greedy and accelerated mirror descent algorithms. ICML, 2019
[24] Tianyi Lin, Nhat Ho, Michael I. Jordan. On the efficiency of entropic regularized algorithms for optimal transport. Journal of Machine Learning Research (JMLR), 2022
39
Deep Generative Model via Optimal Transport
• Wasserstein GAN
• Issues of Wasserstein GAN:
• Misspecified Matchings of Minibatch Schemes
• Curse of Dimensionality
40
Generative Model
• We now discuss an important application of optimal transport in generative
modeling task
Imagenet
CIFAR 10
• Goal: Given a collection of very high dimensional data, we would like to learn
the underlying data distribution P effectively
41
Generative Model
• There are several approaches:
• Nonparametric approaches:
• Frequentist density estimator
• Bayesian nonparametric models
• Parametric approaches via latent variable assumption:
• Bayesian hierarchical models
• Deep learning models, i.e., Variational Auto-Encoder (VAE)
[25], Generative Adversarial Networks (GANs) [26], etc.
42
Generative Adversarial Networks (GANs)
• Generative Adversarial Networks is an instance of implicit methods, i.e., we
do not need explicit density estimation
43
Generative Adversarial Networks (GANs)
General recipe of implicit methods:
44
Generative Adversarial Networks (GANs)
• For GANs [26], the choice of that divergence is the Jensen-Shannon divergence (JS):
min JS(Tϕ(z), P), (8)
ϕ
P + Tϕ(z) P + Tϕ(z)
( ) ( )
where JS(Tϕ(z), P) := KL Tϕ(z), + KL P,
2 2
• Disjoint supports
• One is continuous distribution and another one is discrete distribution
• Example: To see that, we will consider the following simple example:
Tϕ(z) = (ϕ, z) where z ∼ U(0,1) and P = (0,U(0,1))
• Direct calculation shows that
JS(Tϕ(z), P) = log(2) if ϕ ≠ 0 and 0 otherwise
• The paper [27] suggests that we can use the first order Wasserstein metric
• For any two distributions P and Q, the first order Wasserstein metric between
P and Q is defined as follows:
π∈Π(P,Q) ∫
W1(P, Q) = inf ∥x − y∥dπ(x, y),
where Π(P, Q) denotes the set of joint probability measures between P and Q
47
Wasserstein GANs
• The objective of Wasserstein GANs is then given by:
min W1(Tϕ(z), P) (9)
ϕ
• The first order Wasserstein metric is meaningful even when the two
distributions
48
Wasserstein GANs
• Under this case, we can verify that W1(Tϕ(z), P) = | ϕ | for all ϕ ∈ ℝ
• It is clear that this function is continuous for all ϕ and we can use optimization
method to solve min | ϕ |
ϕ
49
Wasserstein GANs
• These observations indicate that the first order Wasserstein metric is a valid
choice for GANs
• From the definition of first order Wasserstein metric, we can rewrite equation
(16) as follows:
π∈Π(Tϕ(z),P) ∫
min W1(Tϕ(z), P) = min min ∥x − y∥dπ(x, y) (10)
ϕ ϕ
• We will discuss a dual function approach for dealing with that optimization
problem
50
Wasserstein GANs: Dual Function Approach
• Dual Function Approach: For any two probability distributions P and Q, the
dual form of the first order Wasserstein metric between P and Q has the
following form:
• Please refer to Section 5 in [27] about how to derive the dual form (11)
51
Wasserstein GANs: Dual Function Approach
• Given the dual form of the first order Wasserstein metric in equation (18), we
can rewrite Wasserstein GANs as follows:
52
Wasserstein GANs: Dual Function Approach
• Therefore, we approximate the Wasserstein GANs (19) as
min max 𝔼z∼pZ[ fω(Tϕ(z))] − 𝔼x∼P[ fω(x)] (13)
ϕ ω
53
Limitations of Dual Function Approach
• Limitations of dual function approach:
• It relies on the choice of first order Wasserstein metric and Euclidean
distance to have a nice dual form
π∈Π(Tϕ(z),P) ∫
where OT(Tϕ(z), P) = inf c(x, y)dπ(x, y) and c( . , . ) is some metric
54
Optimal Transport GANs (OT-GANs)
• For general cost matrix c( . , . ), the dual form of OT-GANs (21) can be non-
trivial to use
55
Optimal Transport GANs (OT-GANs)
n
1
∑
For the distribution P, we can use Pn = δXi where X1, X2, …, Xn are the
• n i=1
data
M
1
∑
For Tϕ(z), we can use δTϕ(zi) where z1, z2, …, zM are i.i.d. samples from
• M i=1
pZ( . )
• It suggests the following approximation of OT-GANs (14)
M n
1 1
M∑ ∑
inf OT( δTϕ(zi), δXi) (15)
ϕ
i=1
n i=1
56
Computational Challenge of OT-GANs
57
Computational Challenge of OT-GANs
• Computational Challenge:
• The M
computational complexity of approximating the optimal transport between
n
1 1 2
M∑ ∑
δTϕ(zi) and δXi is 𝒪(max{M, n} )
i=1
n i=1
• In practice, n can be very large (as large as a few millions) and M need to be
chosen to be quite large (scale with the dimension) to guarantee good
M
1
M∑
approximation of Tϕ(z) via the empirical distribution δTϕ(zi)
i=1
• Unfortunately, it is unavoidable memory issue of optimal transport
• Practical Solution: A popular approach for doing that is to consider minibatches
of the entire data, which we refer to as minibatch optimal transport GANs
58
Minibatch Optimal Transport
59
Minibatch Optimal Transport GANs (mOT-GANs)
• To set up the stage, we need the following notations:
• We denote by m the minibatch size where m ≤ min{M, n}
(m) (m)
n M
X z
We denote and the sets of all m elements of {X1, …, Xn}
•
and {z1, …, zM} respectively
(m) (m)
n M
m X m z
For any X ∈ and z ∈ , we respectively denote by
•
1 1 m
∑ ∑
PXm = δx and Pzm = δz′ the empirical measures of X and
m x∈X m m z′∈z m
m
z
60
Minibatch Optimal Transport GANs (mOT-GANs)
Minibatch Optimal Transport GANs (mOT-GANs): For any batch size
m m m m
1 ≤ m ≤ min{M, n} and number of minibatches k, we draw X1 , …, Xk and z1 , …, zk
(m) (m)
n M
X z
uniformly from and . The minibatch optimal transport GANs is given by:
k
1
∑
min OT(Tϕ(Pz m), PX m) (16)
ϕ k i i
i=1
• Note that, the choice that k = 1 can lead to sub-optimal result in practice
61
Minibatch Optimal Transport GANs (mOT-GANs)
• Computational Complexity of mOT-GANs:
• When ϕ is given, the complexity of computing OT(Tϕ(Pz m), PX m) exactly
i i
3
is at the order of 𝒪(m ) if we use exact-solver to solve the linear
programming
2
• We can improve the complexity to 𝒪(m ) via using entropic regularized
optimal transport to approximate OT(Tϕ(Pz m), PX m)
i i
∑
Therefore, the best complexity of approximating OT(Tϕ(Pz m), PX m) is
• i i
2 i=1
𝒪(km )
62
OT GANs: Minibatch Approach
• For the approximation of OT-GANs in equation (15), the complexity is
2
𝒪(max{M, n} )
2 2
• As long as km ⋘ max{M, n} , the complexity of mOT-GANs is much
cheaper than that of OT-GANs for each parameter ϕ
63
Wasserstein GANs: Minibatch Approach
• Examples of CIFAR 10 generated images via mOT-GANs:
Data
Generated data
64
Issues of mOT-GANs
• mOT-GANs suffer from misspecified matching issue, i.e., the optimal transport
plan from the mOT-GANs contains wrong matchings that do not appear in the
original optimal transport plan of OT-GANs
• There are a few recent proposals to solve the misspecified matching issue,
includes using partial optimal transport [28], hierarchical optimal transport
[29], unbalanced optimal transport [30]
65
Minibatch Partial Optimal Transport [28]
[28] Khai Nguyen, Dang Nguyen, Tung Pham, Nhat Ho. Improving minibatch optimal transport via partial transportation. ICML, 2022
66
Misspecified Matching Issue of MOT
• We consider a simple example where Pn, Qn are two empirical distributions
with 5 supports on 2D: {(0,1), (0,2), (0,3), (0,4), (0,5)},
{(1,1), (1,2), (1,3), (1,4), (1,5)}
(m) (m)
n n
m m X m m Y
where X1 , …, Xk ∈ ; Y1 , …, Yk ∈ ;
m m
PXim, PYim are empirical measures associated with Xi and Yi
Computational Complexity of Minibatch Partial Optimal Transport
• We have an equivalent way to write m-POT in terms of m-OT as follows:
k
1
k∑
m-POTs(Pn, Qn) = min ⟨C̄i, π⟩,
π∈Π(ᾱi,ᾱi)
i=1
( 0 Ai)
Ci 0 (m+1)×(m+1)
where Ci = ∈ ℝ+ ;
m m
Ci is a cost matrix formed by the differences of elements of Xi and Yi ;
Ai > 0 for all i = 1,2,…, k;
ᾱi = [um,1 − s] for all i = 1,2,…, k
• By using entropic regularized approach, we can compute the m-POT with
2
computational complexity 𝒪(k(m + 1) ), which is comparable to that of m-OT
Minibatch Partial Optimal Transport
• The corresponding transportation plan of minibatch partial optimal transport
with transportation fraction s is given by:
k
m-POT 1s
POTs
k∑
π = k πP m,P m,
Xi Yi
i=1
where POT
πP ,P is a transportation matrix from solving POTs(PXm, PYm);
s
Xm
i Ym
i
i i
POT
πP ,P is expanded to a n × n matrix that has padded zero entries to
s
Xm
i Ym
i
m m
indices which are different from those of Xi and Yi
Minibatch Partial Optimal Transport
• The m-POT can alleviate misspecified matchings
[29] Khai Nguyen, Dang Nguyen, Quoc Nguyen, Tung Pham, Dinh Phung, Hung Bui, Trung Le, Nhat Ho. On transportation of mini-batches: A
hierarchical approach. ICML, 2022
75
Alleviating Misspecified Matching of m-OT via Hierarchical Approach
• The m-POT requires to choose good transportation fraction s, which can be non-trivial in
practice
• We now describe another approach that can be used to alleviate the misspecified
matching of m-OT without any tuning parameter
∑∑
BoMb-OT(Pn, Qn) = min γijOT(PXim, PYjm),
⊗m ⊗m
γ∈Π(Pk ,Qk )
i=1 j=1
(m) (m)
n n
m m X m m Y
where X1 , …, Xk ∈ ; Y1 , …, Yk ∈ ;
k k
⊗m 1 ⊗m 1
k∑ ∑
Pk = δXim and Qk = δYim;
i=1
k i=1
m m
PXim, PYjm are empirical measures associated with Xi and Yj
Batch of Minibatches Optimal Transport
Batch of Minibatches Optimal Transport
• The corresponding transportation plan of Batch of minibatches optimal
transport (BoMb-OT) between Pn and Qn is defined as
k k
BoMb-OT OT
∑∑
π k = γijπPXm,PY m,
i j
i=1 j=1
81
Curse of Dimensionality of OT-GANs
• Another important issue of OT-GANs is curse of dimensionality
• The required number of samples for OT-GANs to obtain good
estimation of the underlying distribution of the data is exponential in
the number of the dimension
• Solutions: We utilize sliced OT-GANs and their variants [31], [32], [33], [34]
[31] Khai Nguyen, Nhat Ho, Tung Pham, Hung Bui. Distributional sliced-Wasserstein and applications to deep generative modeling. ICLR, 2021
[32] Khai Nguyen, Nhat Ho, Tung Pham, Hung Bui. Improving relational regularized autoencoders with spherical sliced fused Gromov Wasserstein. ICLR, 2021
[33] Khai Nguyen, Nhat Ho. Revisiting projected Wasserstein metric on images: from vectorization to convolution. Arxiv Preprint, 2022
[34] Khai Nguyen, Nhat Ho. Amortized projection optimization for sliced Wasserstein generative models. Arxiv Preprint, 2022
82
Sliced Optimal Transport
• We first define sliced optimal transport, which is key to define sliced OT-GANs
• The sliced optimal transport (OT) between two probability distributions μ and ν is defined
as follows:
1/p
( ∫𝕊d−1 )
p
SWp(μ, ν) := Wp(θ♯μ, θ♯ν)dθ ,
d
where θ♯μ is the push-forward probability measure of μ through the function Tθ :ℝ →ℝ
⊤
with Tθ(x) = θ x;
83
Properties of Sliced OT
There are three key properties of sliced optimal transport that make them
appealing for large-scale applications:
• The sliced OT does not suffer from curse of dimensionality, namely, the
required sample for the sliced OT to obtain good estimation of the
underlying probability distribution does not scale exponentially with the
dimension
84
Sliced-OT GANs
• Given the definition of sliced-OT, the sliced optimal transport GANs (Sliced-OT GANs) is:
min SWp(Tϕ(z), P),
ϕ
• Memory inefficiency since each slicing direction is a vector that has the same
dimension as the images
85
Sliced-OT GANs
86
Convolution Sliced-OT GANs [33]
[33] Khai Nguyen, Nhat Ho. Revisiting projected Wasserstein metric on images: from vectorization to convolution. Arxiv Preprint, 2022
87
Convolution
• To efficiently capture the spatial structures and improve the memory efficiency
of sliced OT, we utilize the convolution operators to the slicing process of
sliced optimal transport
88
Convolution Slicer
90
Convolution Sliced Optimal Transport
91
Experiments: Deep Generative Models
93
Experiments: Deep Generative Models
94
Experiments: Deep Generative Models
95
Experiments: Deep Generative Models
96
Conclusion
• We have studied both the computational complexities of optimal transport as
well as its applications to deep generative models
97
Thank You!
98
References
[1] Martin Arjovsky, Soumith Chintala, Léon Bottou. Wasserstein Generative Adversarial Networks. ICML,
2017
[2] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron C. Courville. Improved
Training of Wasserstein GANs. NIPS, 2017
[3] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, Bernhard Scholkopf. Wasserstein Auto-Encoders. ICLR,
2018
[4] Nicolas Courty, Rémi Flamary, Devis Tuia, Alain Rakotomamonjy. Optimal Transport for Domain
Adaptation. IEEE Transactions on Pattern Analysis and Artificial Intelligence (PAMI), 2017
[5] Bharath Bhushan Damodaran, Benjamin Kellenberger, Rémi Flamary, Devis Tuia, Nicolas Courty.
DeepJDOT: Deep Joint Distribution Optimal Transport for Unsupervised Domain Adaptation. ECCV, 2018
[6] Trung Nguyen, Hieu Pham, Tam Le, Tung Pham, Nhat Ho, Son Hua. Point-set distances for learning
representations of 3D point clouds. ICCV, 2021
[7] Nhat Ho, Long Nguyen, Mikhail Yurochkin, Hung Bui, Viet Huynh, and Dinh Phung. Multilevel clustering
via Wasserstein means. ICML, 2017
[8] Viet Huynh, Nhat Ho, Nhan Dam, Long Nguyen, Mikhail Yurochkin, Hung Bui, Dinh Phung. On efficient
multilevel clustering via Wasserstein distances. Journal of Machine Learning Research, 2021
99
References
[9] Xing Han, Tongzheng Ren, Jing Hu, Joydeep Ghosh, Nhat Ho. Efficient Forecasting of Large Scale Hierarchical Time
Series via Multilevel Clustering. Under review, NeurIPS, 2022
[10] Jingjing Xu Hao Zhou Chun Gan Zaixiang Zheng Lei Li. Vocabulary Learning via Optimal Transport for Neural Machine
, , , ,
[11] Khang Le, Huy Nguyen, Quang Nguyen, Tung Pham, Hung Bui, Nhat Ho. On robust optimal transport: Computational
complexity and barycenter computation . NeurIPS, 2021
[12] Nhat Ho, Tan Nguyen, Ankit Patel, Anima Anandkumar, Michael I. Jordan, Richard Baraniuk. A Bayesian Perspective
of Convolutional Neural Networks through a Deconvolutional Generative Model. Under Revision, Journal of Machine
Learning Research, 2021
[13] Long Nguyen. Convergence of latent mixing measures in finite and infinite mixture models. Annals of Statistics, 2013
[14] Nhat Ho, Long Nguyen. Convergence rates of parameter estimation for some weakly identifiable finite mixtures.
Annals of Statistics, 2016
[15] Nhat Ho, Chiao-Yu Yang, Michael I. Jordan. Convergence rates for Gaussian mixtures of experts. Journal of Machine
Learning Research, 2022 (Accepted Under Minor Revision)
[16] Rui Gao, Anton J Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance. Arxiv preprint
arXiv:1604.02199, 2016
[17] Daniel Kuhn, Peyman Mohajerin Esfahani, Viet Anh Nguyen, Soroosh Shafieezadeh-Abadeh. Wasserstein
distributionally robust optimization: Theory and applications in machine learning. INFORMS Tutorials in Operations
Research
100
References
[18] Matthew Thorpe. Introduction to Optimal Transport (https://www.math.cmu.edu/~mthorpe/
OTNotes)
[19] Gabriel Peyré Marco Cuturi. Computational Optimal Transport: With Applications to Data Science.
,
[20] Marco Cuturi. Sinkhorn Distances: Lightspeed Computation of Optimal Transport. NIPS 2013
[21] Jason Altschuler Jonathan Weed Philippe Rigollet. Near-linear time approximation algorithms for
, ,
[22] Pavel Dvurechensky, Alexander Gasnikov, Alexey Kroshnin. Computational Optimal Transport:
Complexity by Accelerated Gradient Descent Is Better Than by Sinkhorn’s Algorithm. ICML, 2018
[23] T. Lin, N. Ho, M. I. Jordan.On efficient optimal transport: an analysis of greedy and accelerated
mirror descent algorithms. ICML, 2019
[24] T. Lin, N. Ho, M. I. Jordan. On the efficiency of entropic regularized algorithms for optimal
transport. Journal of Machine Learning Research (JMLR), 2022
101
References
[25] Diederik P Kingma, Max Welling. Auto-Encoding Variational Bayes. ICLR, 2014
[26] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville,Yoshua Bengio. Generative Adversarial Networks. NIPS, 2014
[27] Martin Arjovsky, Soumith Chintala, Léon Bottou. Wasserstein Generative Adversarial Networks.
ICML, 2017
[28] Khai Nguyen, Dang Nguyen, Tung Pham, Nhat Ho. Improving minibatch optimal transport via partial
transportation. ICML, 2022
[29] Khai Nguyen, Dang Nguyen, Quoc Nguyen, Tung Pham, Dinh Phung, Hung Bui, Trung Le, Nhat Ho.
On transportation of mini-batches: A hierarchical approach. ICML, 2022
[30] Kilian Fatras, Thibault Sejourne, Rémi Flamary, and Nicolas Courty. Unbalanced minibatch optimal
transport; applications to domain adaptation. ICML, 2021
102
References
[31] Khai Nguyen, Nhat Ho, Tung Pham, Hung Bui. Distributional sliced-
Wasserstein and applications to deep generative modeling. ICLR, 2021
[32] Khai Nguyen, Nhat Ho, Tung Pham, Hung Bui. Improving relational
regularized autoencoders with spherical sliced fused Gromov Wasserstein. ICLR,
2021
[33] Khai Nguyen, Nhat Ho. Revisiting projected Wasserstein metric on images:
from vectorization to convolution. Arxiv Preprint, 2022
[34] Khai Nguyen, Nhat Ho. Amortized projection optimization for sliced
Wasserstein generative models. Arxiv Preprint, 2022
103