Introduction To Optimal Transport
Introduction To Optimal Transport
Matthew Thorpe
Lent 2018
Current Version: Thursday 8th March, 2018
Foreword
These notes have been written to supplement my lectures given at the University of Cambridge
in the Lent term 2017-2018. The purpose of the lectures is to provide an introduction to
optimal transport. Optimal transport dates back to Gaspard Monge in 1781 [11], with significant
advancements by Leonid Kantorovich in 1942 [8] and Yann Brenier in 1987 [4]. The latter
in particular lead to connections with partial differential equations, fluid mechanics, geometry,
probability theory and functional analysis. Currently optimal transport enjoys applications in
image retrieval, signal and image representation, inverse problems, cancer detection, texture
and colour modelling, shape and image registration, and machine learning, to name a few. The
purpose of this course is to introduce the basic theory that surrounds optimal transport, in the
hope that it may find uses in people’s own research, rather than focus on any specific application.
I can recommend the following references. My lectures and notes are based on Topics in
Optimal Transportation [15]. Two other accessible introductions are Optimal Transport: Old and
New [16] (also freely available online) and Optimal Transport for Applied Mathemacians [12]
(also available for free online). For a more technical treatment of optimal transport I refer to
Gradient Flows in Metric Spaces and in the Space of Probability Measures [2]. For a short
review of applications in optimal transport see the article Optimal Mass Transport for Signal
Processing and Machine Learning [9].
Please let me know of any mistakes in the text. I will also be updating the notes as the course
progresses.
Some Notation:
4. P(Z) is the set of probability measures on Z, i.e. the subset of M+ (Z) with unit mass.
i
7. A function Θ : E → R ∪ {+∞} is convex if for all (z1 , z2 , t) ∈ E × E × [0, 1] we have
Θ(tz1 +(1−t)z2 ) ≤ tΘ(z1 )+(1−t)Θ(z2 ). A convex function Θ is proper if Θ(z) > −∞
for all z ∈ E and there exists z † ∈ E such that Θ(z † ) < +∞.
8. If E is a normed vector space then E ∗ is its dual space, i.e. the space of all bounded and
linear functions on E.
9. For a set A in a topological space Z the interior of A, which we denote by int(A), is the
set of points a ∈ A such that there exists an open set O with the property a ∈ O ⊆ A.
11. The closure of a set A in a topological space Z, which we denote by A, is the set of all
points a ∈ Z such that for any open set O with a ∈ O we have O ∩ A 6= ∅.
12. The graph of a function ϕ : X → R which we denote by Gra(ϕ), is the set {(x, y) : x ∈
X, y = ϕ(x)}.
R
13. The k th moment of µ ∈ P(X) is defined as X |x|k dµ(x).
14. The support of a probability measure µ ∈ P(X) is the smallest closed set A such that
µ(A) = 1.
16. We write µbA for the measure µ restricted to A, i.e. µbA (B) = µ(A∩B) for all measurable
B.
17. Given a probability measure µ we say a property holds µ-almost surely if it holds on a
set of probability one. If µ is the Lebesgue measure we will just say that it holds almost
surely.
ii
Contents
2 Special Cases 9
2.1 Optimal Transport in One Dimension . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Existence of Transport Maps for Discrete Measures . . . . . . . . . . . . . . . 15
3 Kantorovich Duality 19
3.1 Kantorovich Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Fenchel-Rockafeller Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Proof of Kantorovich Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Existence of Maximisers to the Dual Problem . . . . . . . . . . . . . . . . . . 26
5 Wasserstein Distances 41
5.1 Wasserstein Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 The Wasserstein Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Geodesics in the Wasserstein Space . . . . . . . . . . . . . . . . . . . . . . . 49
1
Chapter 1
There are two ways to formulate the optimal transport problem: the Monge and Kantorovich
formulations. We explain both these formulations in this chapter. Historically the Monge for-
mulation comes before Kantorovich which is why we introduce Monge first. The Kantorovich
formulation can be seen as a generalisation of Monge. Both formulations have their advantages
and disadvantages. My experience is that Monge is more useful in applications, whilst Kan-
torovich is more useful theoretically. In a later chapter (see Chapter 4) we will show sufficient
conditions for the two problems to be considered equivalent. After introducing both formula-
tions we give an existence result for the Kantorovich problem; existence results for Monge are
considerably more difficult. We look at special cases of the Monge and Kantorovich problems in
the next chapter, a more general treatment is given in Chapters 3 and 4.
2
To visualise the transport map see Figure 1.1. For greater generality we work with the
inverse of T rather than T itself; the inverse is treated in the general set valued sense, i.e.
x ∈ T −1 (y) if T (x) = y, if the function T is injective then we can equivalently say that
ν(T (A)) = µ(A) for all µ-measurable A. What Figure 1.1 shows is that for any ν-measurable
B, and A = {x ∈ X : T (x) ∈ B} that µ(A) = ν(B). This is what we mean by T transporting
µ to ν. As shorthand we write ν = T# µ if (1.1) is satisfied.
Proposition 1.2. Let µ ∈ P(X), T : X → Y , S : Y → Z and f ∈ L1 (Y ). Then
1. change of variables formula
Z Z
(1.2) f (y) d(T# µ)(y) = f (T (x)) dµ(x);
Y X
2. composition of maps
(S ◦ T )# µ = S# (T# µ).
where both supremums are taken over simple functions. Hence (1.2) holds for non-negative
functions. By treating signed functions as f = f + − f − we can prove the proposition for
f ∈ L1 (Y ).
For the second statement let A ⊂ Z and observe that T −1 (S −1 (A)) = (S ◦ T )−1 (A). Then
S# (T# µ)(A) = T# µ(S −1 (A)) = µ(T −1 (S −1 (A))) = µ((S ◦ T )−1 (A)) = (S ◦ T )# µ(A).
Hence S# (T# µ) = (S ◦ T )# µ.
Given two measures µ and ν the existence of a transport map T such that T# µ = ν is not
only non-trivial, but it may also be false. For example, consider two discrete measures µ = δx1 ,
ν = 12 δy1 + 12 δy2 where y1 6= y2 . Then ν({y1 }) = 12 but µ(T −1 (y1 )) ∈ {0, 1} depending on
whether x1 ∈ T −1 (y1 ). Hence no transport maps exist.
There are two important cases where transport maps exist:
3
1
Pn 1
Pn
1. the discrete case when µ = n i=1 δxi and ν = n j=1 δyj ;
2. the absolutely continuous case when dµ(x) = f (x) dx and dν(y) = g(y) dy.
It is important that in the discrete case that µ and ν are supported on the same number of points;
the supports do not have to be the same but they do have to be of the same size. We will revisit
both cases (the discrete case in the next chapter, the absolutely continuous case in Chapter 4).
Figure 1.1: Monge’s transport map, figure modified from Figure 1 in [9].
With this notation we can define Monge’s optimal transport problem as follows.
Definition 1.3. Monge’s Optimal Transport Problem: given µ ∈ P(X) and ν ∈ P(Y ),
Z
minimise M(T ) = c(x, T (x)) dµ(x)
X
The above constraint is highly non-linear and difficult to handle with standard techniques from
the calculus of variations.
4
that if we allow mass to be split, i.e. half of the mass from x1 goes to y1 and half the mass goes
to y2 , then we have a natural relaxation. This is in effect what the Kantorovich formulation does.
To formalise this we consider a measure π ∈ P(X × Y ) and think of dπ(x, y) as the amount of
mass transferred from x to y; this way mass can be transferred from x to multiple locations. Of
course the total amount of mass removed from any measurable set A ⊂ X has to equal to µ(A),
and the total amount of mass transferred to any measurable set B ⊂ Y has to be equal to ν(B).
In particular, we have the constraints:
We say that any π which satisfies the above has first marginal µ and second marginal ν, we
denote the set of such π by Π(µ, ν). We will call Π(µ, ν) the set of transport plans between µ
and ν. Note that Π(µ, ν) is never non-empty (in comparison with the set of transport plans) since
µ ⊗ ν ∈ Π(µ, ν) which is the trivial transport plan which transports every grain of sand at x to y
proportional to ν(y). We can now define Kantorovich’s formulation of optimal transport.
Definition 1.4. Kantorovich’s Optimal Transport Problem: given µ ∈ P(X) and ν ∈ P(Y ),
Z
minimise K(π) = c(x, y) dπ(x, y)
X×Y
By the example with discrete measures, where we showed there did not exist any transport
maps, we know that Kantorovich’s and Monge’s optimal transport problems do not always
coincide. However, let us assume that there exists a transport map T † : X → Y that is
optimal for Monge, then if we define dπ(x, y) = dµ(x)δy=T † (x) a quick calculation shows that
π ∈ Π(µ, ν):
Z
π(A × Y ) = δT † (x)∈Y dµ(x) = µ(A)
A
Z
π(X × B) = δT † (x)∈B dµ(x) = µ((T † )−1 (B)) = T#† µ(B) = ν(B).
X
Since,
Z Z Z Z
c(x, y) dπ(x, y) = c(x, y)δy=T † (x) dy dµ(x) = c(x, T † (x)) dµ(x)
X×Y X Y X
it follows that
In fact one does not need minimisers of Monge’s problem to exist. If M(T † ) ≤ min M(T ) + ε
for some ε > 0 then inf K(π) ≤ inf M(T ) + ε. Since ε > 0 was arbitrary then (1.3) holds.
When the optimal plan π † can be written in the form dπ † (x, y) = dµ(x)δy=T (x) it follows
that T is an optimal transport map and inf K(π) = inf M(T ). Conditions sufficient for such a
5
condition will be explored in Chapter 4. For now we just say that if c(x, y) = |x − y|2 , µ, ν both
have finite second moments, and µ does not give mass to small sets2 then there exists an optimal
plan π † which can be written as dπ † (x, y) = dµ(x)δy=T † (x) where T † is an optimal map.
Let us observe the advantages of both Monge and Kantorovich formulation. Transport maps
give a natural method of interpolation between two measures, in particular we can define
µt = ((1 − t)Id + tT † )# µ
then µt interpolates between µ and ν. In fact this line of reasoning will lead us directly to
geodesics that we consider in greater detail in Chapter 5. In Figure 1.2 we compare the optimal
transport interpolation with the Euclidean interpolation defined by µE
t = (1 − t)µ + tν. In many
applications the Lagrangian nature of optimal transport will be more realistic than Euclidean
formulations.
Figure 1.2: Interpolation in the optimal transport framework (left) and Euclidean space (right), figure
modified from Figure 2 in [9].
Notice that the Kantorovich problem is convex (the constraints are convex and one usually
has that the cost function c(x, y) = d(x − y) where d is convex).
Pm In particular Pnlet us consider
the
PmKantorovich P problem between discrete measures µ = i=1 αi δxi , ν = j=1 βj δyj where
n
α
i=1 i = 1 = j=1 βj , αi ≥ 0, βj ≥ 0. Let cij = c(xi , yj ) and πij = π(xi , xj ). Then the
Kantorovich problem is to solve
m X
X n m
X n
X
minimise cij πij over πsubject to πij ≥ 0, πij = βj , πij = αi .
i=1 j=1 i=1 j=1
This is a linear programme! In fact Kantorovich is considered as the inventor of linear pro-
gramming. Not only does this provide a method for solving optimal transport problems (either
through off the shelf linear programming algorithms, or through more recent advances such as
µ ∈ P(Rd ) does not give mass to small sets if for all sets A of Hausdorff dimension at most d − 1 that
2
µ(A) = 0.
6
entropy regularised approaches see [5]) but the dual formulation:
inf c·π = sup (µ · ϕ + ν · φ)
π≥0,C > π=(µ> ,ν > )> C(ϕ> ,φ> )> ≤c
7
We choose f (x, y) = f˜(x), where f˜ is continuous and bounded. We have,
Z Z Z
f˜(x) dµ(x) → f˜(x) dπ(x, y) = f˜(x) dP#X π(x)
X X×Y X
where P X (x, y) = x is the projection onto X (so P#X π is the X marginal). Since this is true for
all f˜ ∈ Cb0 (X) it follows that P#X π = µ. Similarly, P#Y π = ν. Hence, π ∈ Π(µ, ν) and Π(µ, ν)
is weakly* closed.
Let πn ∈ Π(µ, ν) be a minimising sequence, i.e. K(πn ) → inf π∈Π(µ,ν) K(π). Since Π(µ, ν)
*
is compact we can assume that πn * π † ∈ Π(µ, ν). Our candidate for a minimiser is π † . Note
that c is lower semi-continuous and bounded from below. Then,
Z Z
inf K(π) = lim c(x, y) dπn (x, y) ≥ c(x, y) dπ † (x, y)
π∈Π(µ,ν) n→∞ X×Y X×Y
where we use the Portmanteau theorem which provides equivalent characterisations of weak*
convergence. Hence π † is a minimiser.
8
Chapter 2
Special Cases
In this section we look at some special cases where we can prove existence and characterisation
of optimal transport maps and plans. Generalising these results requires some work and in
particular a duality theorem. On the other hand the results in this chapter require less background
and are somehow "lower hanging fruit". Chapters 3 and 4 are essentially the results of this
chapter generalised to more abstract settings. The two special cases we consider here are when
measures µ, ν are on the real line, and when measures µ, ν are discrete. We start with the real
line.
Theorem 2.1. Let µ, ν ∈ P(R) with cumulative distributions F and G respectively. Assume
c(x, y) = d(x − y) where d is convex and continuous. Let π † be the measure on R2 with cumu-
lative distribution function H(x, y) = min{F (x), G(y)}. Then π † ∈ Π(µ, ν) and furthermore
π † is optimal for Kantorovich’s optimal transport problem with cost function c. Moreover the
9
optimal transport cost is
Z 1
d F −1 (t) − G−1 (t) dt.
min K(π) =
π∈Π(µ,ν) 0
2. If µ does not give mass to atoms then minπ∈Π(µ,ν) K(π) = minT : T# µ=ν M(T ) and further-
more T † = G−1 ◦ F is a minimiser to Monge’s optimal transport problem, i.e. T#† µ = ν
and
inf M(T ) = M(T † ).
T : T# µ=ν
Proof. For the first part, by Theorem 2.1, it is enough to show that
Z 1 Z
−1 −1
F (t) − G (t) dt = |F (x) − G(x)| dx.
0 R
Define A ⊂ R2 by
A = (x, t) : min{F −1 (t), G−1 (t)} ≤ x ≤ max{F −1 (t), G−1 (t)}, t ∈ [0, 1] .
By Fubini’s theorem
Z Z max{F (x),G(x)} Z 1 Z max{F −1 (t),G−1 (t)}
L(A) = dt dx = dx dt
R min{F (x),G(x)} 0 min{F −1 (t),G−1 (t)}
Similarly
Z 1 Z max{F −1 (t),G−1 (t)} Z 1
dx dt = F −1 (t) − G−1 (t) dt.
0 min{F −1 (t),G−1 (t)} 0
10
Figure 2.1: Optimal transport distance in 1D with cost c(x, y) = |x − y|, figure is taken from [10].
Since inf T : T# µ=ν M(T ) ≥ minπ∈Π(µ,ν) K(π) then the minimum of the Monge and Kantorovich
optimal transport problems coincide and T † is an optimal map for Monge.
Before we prove Theorem 2.1 we give some basic ideas in the proof. The key is the
idea of monotonicity. We say that a set Γ ⊂ R2 is monotone (with respect to d) if for all
(x1 , y1 ), (x2 , y2 ) ∈ Γ that
d(x1 − y1 ) + d(x2 − y2 ) ≤ d(x1 − y2 ) + d(x2 − y1 ).
11
For example, if Γ = {(x, y) : f (x) = y} and f is increasing, then Γ is monotone (assuming
that d is increasing). The definition generalises to higher dimensions and often appears in convex
analysis (for example the subdifferential of a convex function satisfies a monotonicity property).
As a result, this concept can also be used to prove analogous results to Theorem 2.1 in higher
dimensions. The definition should be natural for optimal transport. In particular, let Γ be the
support of π † , which is a solution of Kantorovich’s optimal transport problem. If π † transports
mass from x1 to y1 and from x2 > x1 to y2 we expect y2 > y1 , else it would have been cheaper to
transport from x1 to y2 , and from x2 to y1 . The following proposition formalises this reasoning.
Proposition 2.3. Let µ, ν ∈ P(R). Assume π † ∈ Π(µ, ν) is an optimal transport plan in the
Kantorovich sense for cost function c(x, y) = d(x − y) where d is continuous. Then for all
(x1 , y1 ), (x2 , y2 ) ∈ supp(π † ) we have
Proof. Let Γ = supp(π † ) and (x1 , y1 ), (x2 , y2 ) ∈ Γ. Assume there exists η > 0 such that
3. Ii × Jj are disjoint;
And choose any π̃12 ∈ Π(µ̃1 , ν̃2 ), π̃21 ∈ Π(µ̃2 , ν̃1 ). We define π̃ to satisfy
†
π (A × B) if (A × B) ∩ (Ii × Jj ) = ∅ for all i, j
0 if A × B ⊆ Ii × Ji for some i
π̃(A × B) = †
π (A × B) + π̃ 12 (A × B) if A × B ⊆ I1 × J2
†
π (A × B) + π̃21 (A × B) if A × B ⊆ I2 × J1 .
12
For sets (A × B) ∩ (Ii × Jj ) 6= ∅ but A × B 6⊆ (Ii × Jj ) then we define π̃(A × B) by
π̃(R × B) = π † (R × B) = ν(B).
If B ⊆ J1 then
since π̃21 (I2 × B) = ν̃1 (B) = π † (I1 × (B ∩ J1 )) = π † (I1 × B). Similarly for B ⊆ J2 . Hence we
have π̃(R × B) = ν(B) for all measurable B. Analogously π̃(A × R) = µ(A) for all measurable
A. Therefore π̃ ∈ Π(µ, ν).
Now,
Z Z
†
d(x − y) dπ (x, y) − d(x − y) dπ̃(x, y)
R×R R×R
Z Z
†
= d(x − y) dπ (x, y) − d(x − y) dπ̃12 (x, y)
I1 ×J1 ∪I2 ×J2 I1 ×J2
Z
− d(x − y) dπ̃21 (x, y)
I2 ×J1
≥ δ (d(x1 − y1 ) − ε) + δ (d(x2 − y2 ) − ε) − δ (d(x1 − y2 ) + ε) − δ (d(x2 − y1 ) + ε)
≥ δ(η − 4ε)
>0
since π̃12 (I1 × J2 ) = µ̃1 (I1 ) = π † (I1 × J1 ) = δ, and similarly π̃21 (I2 × J1 ) = δ. This contradicts
the assumption that π † is optimal, hence no such η can exist.
Finally we remark that if π † (I1 × J1 ) > π † (I2 × J2 ) then one can adapt the constructed plan
π̃ by transporting some mass with the original plan π † . In particular the new constructed plan is
chosen to satisfy
π † (I2 × J2 )
†
π̃(A × B) = π (A × B) 1 − †
π (I1 × J1 )
if A × B ⊂ I1 × J1 , and µ̃1 , ν̃1 are rescaled:
π † (I2 × J2 ) X † π † (I2 × J2 ) Y †
µ̃1 = † P π bI1 ×J1 , ν̃1 = † P π bI1 ×J1 .
π (I1 × J1 ) # π (I1 × J1 ) #
All other definitions remain unchanged. One can go through the argument above and reach the
same conclusion.
13
We now prove Theorem 2.1.
Proof of Theorem 2.1. Assume d is continuous and strictly convex. By Proposition 1.5 there
exists π ∗ ∈ Π(µ, ν) that is an optimal transport plan in the Kantorovich sense. We will show that
π ∗ = π † By Proposition 2.3 Γ = supp(π ∗ ) is monotone, i.e.
for all (x1 , y1 ), (x2 , y2 ) ∈ Γ. We claim that for any x1 , x2 , y1 , y2 satisfying the above and x1 < x2
that y1 ≤ y2 . Assume that y2 < y1 and let a = x1 − y1 , b = x2 − y2 and δ = x2 − x1 . We know
that
d(a) + d(b) ≤ d(b − δ) + d(a + δ).
δ
Let t = b−a , it is easy to check that t ∈ (0, 1) and b − δ = (1 − t)b + ta, a + δ = tb + (1 − t)a.
Then, by strict convexity of d,
Γ ⊂ {(x, y) : x ≤ x0 , y ≤ y0 } ∪ {(x, y) : x ≥ x0 , y ≥ y0 } .
But,
is strictly convex and satisfies 0 ≤ f (x) ≤ 1 + d(x). Then dε : d + εf is strictly convex and
satisfies d ≤ dε ≤ (1 + ε)d + ε. Now let π ∈ Π(µ, ν), then
Z Z
†
d(x − y) dπ (x, y) ≤ dε (x − y) dπ † (x, y)
R×R
ZR×R
≤ dε (x − y) dπ(x, y)
R×R
Z
≤ (1 + ε) d(x − y) dπ(x, y) + ε.
R×R
14
†
Taking ε → 0 proves that R π is an optimal†plan in theR sense of Kantorovich.
1
Now we show that R×R d(x − y) dπ (x, y) = 0 d(F (t) − G−1 (t)) dt. We claim that
−1
15
Furthermore η can be chosen such that the cardinality of the support of η is at most dim(B) + 1
and (the support is) independent of π † .
Let d = dim(B). It is enough to show that there exists {ai }di=0 such that π † = di=0 ai π (i)
P
Proof. P
where ni=0 ai = 1 and {π (i) }di=0 ⊂ E(B). We prove the result by induction. The case when
d = 0 is trivial since B is just a point.
Now assume the result is true for all sets of dimension at most d − 1. Pick π † ∈ B and assume
π † 6∈ E(B). Pick π (0) ∈ E(B) and take the line segment [π (0) , π] and extend it until it intersects
with the boundary of B, i.e. let θ parametrise the line then {θ : (1 − θ)π (0) + θπ ∈ B} = [0, α]
for some α ≥ 1 (where α exists and is finite by convexity and compactness of B). Let
ξ = (1 − α)π (0) + απ then π = (1 − θ0 )ξ + θ0 π (0) where θ0 = 1 − α1 . Now since ξ ∈ F
for some proper
Pn face B 1 then by the induction hypothesis
F of P there exists {π (i) }di=1 such
d d
θ π (i) with i=1 θi = 1. Hence, π = (i)
+ θ0 π (0) . Since
P
that ξ =
Pd i=1 i i=1 (1 − θ0 )θi π
(1 − θ0 ) i=1 θi + θ0 = 1 then π is a convex combination of {π (i) }di=0 . Note that we chose π (0)
independently of π † .
Theorem 2.6. Birkhoff’s theorem. Let B be the set of n × n bistochastic matrices, i.e.
( n n
)
X X
B = π ∈ Rn×n : ∀ij, πij ≥ 0; ∀j, πij = 1; ∀i, πij = 1 .
i=1 j=1
Then the set of extremal points E(B) of B is exactly the set of permutation matrices, i.e.
( n n
)
X X
E(B) = π ∈ {0, 1}n×n : ∀j, πij = 1; ∀i, πij = 1 .
i=1 j=1
Proof. We start by showing that every permutation matrix is an extremal point. Let πij = δj=σ(i)
where σ is a permutation. Assume that π 6∈ E(B). Then there exists π (1) , π (2) ∈ B, with
π (1) 6= π 6= π (2) , and t ∈ (0, 1) such that π = tπ (1) + (1 − t)π (2) . Let ij be such that
(1)
0 = πij 6= πij , then
(1)
(1) (2) (2) πij
0 = πij = tπij + (1 − t)πij =⇒ πij =− < 0.
1−t
(2)
This contradicts πij ≥ 0, hence π ∈ E(B).
Now we show that every π ∈ E(B) is a permutation matrix. We do this in two parts: we (i)
show that π ∈ E(B) implies that πij ∈ {0, 1}, then (ii) show π = δj=σ(i) for a permutation
Pn σ.
For (i) let π ∈ E(B) and assume there exists i1 j1 such that πi1 j1 ∈ P
(0, 1). Since i=1 πij1 = 1
then there exists i2 6= i1 such that πi2 j1 ∈ (0, 1). Similarly, since nj=1 πi2 j = 1 there exists
j2 6= j1 such that πi2 j2 ∈ (0, 1). Continuing this procedure until im = i1 we obtain two
sequences:
I = {ik jk : k ∈ {1, . . . , m − 1}} I + = {ik+1 jk : k ∈ {1, . . . , m − 1}}
1
A face F of a convex set B is any set with the property that if π (1) , π (2) ∈ B, t ∈ (0, 1) and tπ (1) +(1−t)π (2) ∈
F then π (1) , π (2) ∈ F . A proper face is a face which has dimension at most dim(B) − 1. A result we use without
proof is that the boundary of a convex set is the union of all proper faces.
16
with ik+1 6= ik and jk+1 6= jk . Define π (δ) by the following
πi k j k + δ if ij = ik jk for some k
(δ)
πij = πik+1 jk − δ if ij = ik+1 jk for some k
πij else.
Then,
n
X n
X
(δ)
ij ∈ I + : i ∈ {1, . . . , n}
πij = πij + δ |{ij ∈ I : i ∈ {1, . . . , n}}| − δ .
i=1 i=1
Now if ij ∈ I then there exists i0 such that i0 j ∈ I + , and likewise, if ij ∈ I + then there exists i0
such that i0 j ∈ I. Hence,
(δ) (δ)
It follows that ni=1 πij = 1 and analogously nj=1 πij = 1.
P P
Choose δ = min {min{πij , 1 − πij } : ij ∈ I ∪ I + } ∈ (0, 1). Define π (1) = π (−δ) , π (δ) .
(1) (2)
We have that πij , πij ≥ 0 and therefore π (1) , π (2) ∈ B with π (1) 6= π (2) . Moreover we have
π = 12 π (1) + 12 π (2) . Hence, π 6∈ E(B). The contradiction implies that there does not exist i1 j1
such that πi1j1 ∈ (0, 1). We have shown that if π ∈ E(B) then πij ∈ {0, 1}.
We’re left to show Pn (ii): that πij = δj=σ(i) . Since π ∈ B then for each i there exists j ∗ such
that πij ∗ = 1 (else j=1 πij 6= 1). We let σ(i) = j ∗ so by construction we have πiσ(i) = 1. We
claim σ is a permutation. It is enough to show that σ is injective. Now if j = σ(i1 ) = σ(i2 )
where i1 6= i2 then
Xn
1= πij ≥ πi1 j + πi2 j = 2.
i=1
Theorem 2.7. Let µ = n1 ni=1 δxi and ν = n1 nj=1 δyj . Then there exists a solution to Monge’s
P P
optimal transport problem between µ and ν.
Proof. Let cij = c(xi , yj ) and B be the set of bistochastic n × n matrices, i.e.
( n n
)
X X
B = π ∈ Rn×n : ∀ij, πij ≥ 0; ∀j, πij = 1; ∀i, πij = 1 .
i=1 j=1
17
Although, by Proposition 1.5, there exists a minimiser to the Kantorovich optimal transport
problem we do not use this fact here. Let M be the minimum of the Kantorovich optimal
transport problem, ε > 0 and find an approximate minimiser π ε ∈ B such that
X
M≥ cij π ε − ε.
ij
P
If we let f (π) = ij cij πij then assuming that B is compact and convex we have that there
exists a measure η supported on E(B) such that
Z
ε
f (π ) = f (π) dη(π).
Hence Z X X
M≥ cij πij dη(π) − ε ≥ inf cij πij − ε ≥ M − ε.
π∈E(B)
ij ij
P
Since this is true for all ε it holds that inf π∈E(B) ij cij πij = M . We claim that E(B) is compact,
in which case there exists a minimiser π † ∈ E(B). Note that we have also shown (independently
from Proposition 1.5) that there exists a solution to Kantorovich’s optimal transport problem.
By Birkhoff’s theomem π † is a permutation matrix, that is there exists a permutation σ † :
{1, . . . , n} → {1, . . . , n} such that πij† = δj=σ† (i) . Let T † : X → Y be defined by T † (xi ) =
yσ(i) .
We already know that the set of transport maps is non-empty. Let T be any transport map
and define πij = δyj =T (xi ) , (it is easy to see that π ∈ B) then
n
X X X n
X
c(xi , T (xi )) = cij πij ≥ cij πij† = c(xi , T † (xi )).
i=1 ij ij i=1
18
Chapter 3
Kantorovich Duality
We saw in the previous chapter how Kantorovich’s optimal transport problem resembles a linear
programme. It should not therefore be surprising that Kantorovich’s optimal transport problem
admits a dual formulation. In the following section we state the duality result and give an intuitive
but non-rigorous proof. In Section 3.2 we give a general minimax principle upon whoch we can
base the proof of Kantorovich duality. In Section 3.3 we can then rigorously prove duality. With
additional assumptions such as restricting X, Y to Euclidean spaces we prove the existence of
solutions to the dual problem in Section 3.4.
Let Φc be defined by
where the inequality is understood to hold for µ-almost every x ∈ X and ν-almost every y ∈ Y .
Then,
min K(π) = sup J(ϕ, ψ).
π∈Π(µ,ν) (ϕ,ψ)∈Φc
Let us give an informal interpretation of the result which originally comes from Caffarelli
and I take from Villani [15]. Consider the shippers problem. Suppose we own a number of coal
19
mines and a number of factories, we wish to transport the coal from mines to factories. The
amount each mine produces and each factory requires is fixed (and we assume equal). The cost
for you to transport from mine x to factory y is c(x, y). The total optimal cost is the solution
to Kantorovich’s optimal transport problem. Now a clever shipper comes to you and says they
will ship for you and you just pay a price ϕ(x) for loading and ψ(y) for unloading. To make it
in your interest the shipper makes sure that ϕ(x) + ψ(y) ≤ c(x, y) that is the cost is no more
than what you would have spent transporting the coal yourself. Kantorovich duality tells us that
one can find ϕ and ψ such that this price scheme costs just as much as paying for the cost of
transport yourself.
We now give an informal proof that will subsequently be made rigorous. Let M =
inf π∈Π(µ,ν) K(π). Observe that
Z Z Z
X Y
(3.2) M = inf sup c(x, y) dπ + ϕ d µ − P# π + ψ d ν − P# π
π∈M+ (X×Y ) (ϕ,ψ) X×Y X Y
where we take the supremum on the right hand side over (ϕ, ψ) ∈ Cb0 (X) × Cb0 (Y ). This follows
since
+∞ if µ 6= P#X π
Z
X
sup ϕ d µ − P# π =
ϕ∈C 0 (X) X 0 else.
b
Hence, the infimum over π of the right hand side of (3.2) is on the set where P#X π = µ and,
similarly, P#Y π = ν (which means that π ∈ Π(µ, ν)). We can rewrite (3.2) more conveniently as
Z Z Z
M= inf sup c(x, y) − ϕ(x) − ψ(y) dπ + ϕ dµ + ψ dν .
π∈M+ (X×Y ) (ϕ,ψ) X×Y X Y
Now if there exists (x0 , y0 ) ∈ X × Y and ε > 0 such that ϕ(x0 ) + ψ(y0 ) − c(x0 , y0 ) = ε > 0
then by letting πλ = λδ(x0 ,y0 ) for some λ > 0 we have
Z
inf c(x, y) − ϕ(x) − ψ(y) dπ ≤ −λε → −∞ as λ → ∞.
π∈M+ (X×Y ) X×Y
Hence the infimum on right hand side of (3.3) can be restricted to when ϕ(x) + ψ(y) ≤ c(x, y)
for all (x, y) ∈ X × Y , i.e. (ϕ, ψ) ∈ Φc (this heuristic argument actually used (ϕ, ψ) ∈ Cb0 (X) ×
Cb0 (Y ) not L1 (µ) × L1 (ν) and there is a difference between the constraint ϕ(x) + ψ(y) ≤ c(x, y)
holding everywhere and holding almost everywhere, these are technical details that are not
important at this stage). When (ϕ, ψ) ∈ Φc then
Z
inf c(x, y) − ϕ(x) − ψ(y) dπ = 0
π∈M+ (X×Y ) X×Y
20
which is achieved for π ≡ 0 for example. Hence,
Z Z
inf K(π) = sup ϕ(x) dµ(x) + ψ(y) dν(y).
π∈Π(µ,ν) (ϕ,ψ)∈Φc X Y
This is the statement of Kantorovich duality. To complete this argument we need to make the
minimax principle rigorous. In the next section we prove a minimax principle, in the section
after we apply it to Kantorovich duality and provide a complete proof.
Convex analysis will play a greater role in the sequel, in particular in Chapter 4 where we will
provide a more in-depth review. We now state the minimax principle taken from Villani [15].
Theorem 3.2. Fenchel-Rockafellar Duality. Let E be a normed vector space and Θ, Ξ : E →
R ∪ {+∞} two convex functions. Assume there exists z0 ∈ E such that Θ(z0 ) < ∞, Ξ(z0 ) < ∞
and Θ is continuous at z0 . Then,
inf (Θ + Ξ) = max
∗ ∗
(−Θ∗ (−z ∗ ) − Ξ∗ (z ∗ )) .
E z ∈E
A = {(z, t) ∈ E × R : t ≥ Θ(z)} .
B = {(z, t) ∈ E × R : t ≤ Θ(z)} .
21
4. If D ⊂ E is convex and int(D) 6= ∅ then D = int(D).
The following theorem, the Hahn-Banach theorem can be stated in multiple different forms.
The most convenient form for us is in terms of separation of convex sets.
Theorem 3.4. Hahn-Banach Theorem. Let E be a topological vector space. Assume A, B are
convex, non-empty and disjoint subsets of E, and that A is open. Then there exists a closed
hyperplane separating A and B.
A = {(x, λ) ∈ E × R : λ ≥ Θ(x)}
B = {(y, σ) ∈ E × R : σ ≤ M − Ξ(y)} .
By Lemma 3.3 A and B are convex. By continuity and finiteness of Θ at z0 the interior of A is
non-empty and by finiteness of Ξ at z0 B is non-empty. Let C = int(A) (which is convex by
Lemma 3.3. Now, if (x, λ) ∈ C then λ > Θ(x), therefore λ+Ξ(x) > Θ(x)+Ξ(x) ≥ M . Hence
(x, λ) 6∈ B. In particular B ∩ C = ∅. By the Hahn-Banach theorem there exists a hyperplane
H = {Φ = α} that separates B and C, i.e. if we write Φ(x, λ) = f (x) + kλ (where f is linear)
then
∀(x, λ) ∈ C, f (x) + kλ ≥ α
∀(x, λ) ∈ B, f (x) + kλ ≤ α.
Now if (x, λ) ∈ A then there exists a sequence (xn , λn ) ∈ C such that (xn , λn ) → (x, λ). Hence
f (x) + kλ ← f (xn ) + kλn ≥ α. Therefore
We know that (z0 , λ) ∈ A for λ sufficiently large, hence k ≥ 0. We claim k > 0. Assume
k = 0. Then
22
As Dom(Ξ) 3 z0 ∈ Dom(Θ) then f (z0 ) = α. Since Θ is continuous at z0 there exists r > 0
such that B(z0 , r) ⊂ Dom(Θ), hence for all z with kzk < r and δ ∈ R with |δ| < 1 we have
This is true for all δ ∈ (−1, 1) and therefore f (z) = 0 for z ∈ B(0, r). Hence f ≡ 0 on E. It
follows that Φ ≡ 0 which is clearly a contradiction (either H = E × R if α = 0 or H = ∅). It
must be that k > 0.
By (3.4) we have
∗ f f (z)
Θ − = sup − − Θ(z)
k z∈E k
1
= − inf (f (z) + kΘ(z))
k z∈E
α
≤−
k
since (z, Θ(z)) ∈ A. Similarly, by (3.5) we have
∗ f f (z)
Ξ = sup − Ξ(z)
k z∈E k
1
= −M + sup (f (z) + k(M − Ξ(z)))
k z∈E
α
≤ −M +
k
since (z, M − Ξ(z)) ∈ B. It follows that
∗ ∗ ∗ ∗ ∗ f ∗ f α α
M ≥ sup (−Θ (−z ) − Ξ (z )) ≥ −Θ − −Ξ ≥ + M − = M.
z ∗ ∈E ∗ k k k k
So
inf (Θ(x) + Ξ(x)) = M = sup (−Θ∗ (−z ∗ ) − Ξ∗ (z ∗ )) .
x∈E z ∗ ∈E ∗
f
Furthermore z ∗ = k
must achieve the supremum.
23
Proof. Let (ϕ, ψ) ∈ ΦC and π ∈ Π(µ, ν). Let A ⊂ X and B ⊂ Y be sets such that µ(A) = 1,
ν(B) = 1 and
ϕ(x) + ψ(y) ≤ c(x, y) ∀(x, y) ∈ A × B.
Now π(Ac × B c ) ≤ π(Ac × Y ) + π(X × B c ) = µ(Ac ) + ν(B c ) = 0. Hence,
So it follows that ϕ(x) + ψ(y) ≤ c(x, y) for π-almost every (x, y). Then,
Z Z Z Z
J(ϕ, ψ) = ϕ dµ + ψ dν = ϕ(x) + ψ(y) dπ(x, y) ≤ c(x, y) dπ(x, y).
X Y X×Y X×Y
The result of the lemma follows by taking the supremum over (ϕ, ψ) ∈ Φc on the right hand side
and the infimum over π ∈ Π(µ, ν) on the left hand side.
To complete the proof of Theorem 3.1 we need to show that the opposite inequality in
Lemma 3.5 is also true.
1. Let E = Cb0 (X × Y ) equipped with the supremum norm. The dual space of E is the space
of Radon measures E ∗ = M(X × Y ) (by the Riesz–Markov–Kakutani representation theorem).
Define
0 if u(x, y) ≥ −c(x, y)
Θ(u) =
+∞ else,
R R
X
ϕ(x) dµ(x) + Y ψ(y) dν(y) if u(x, y) = ϕ(x) + ψ(y)
Ξ(u) =
+∞ else.
Note that although the representation u(x, y) = ϕ(x) + ψ(y) is not unique (ϕ and ψ are only
unique upto a constant) Ξ is still well defined. We claim that Θ and Ξ are convex. For Θ
24
consider u, v with Θ(u), Θ(v) < +∞, then u(x, y) ≥ −c(x, y) and v(x, y) ≥ −c(x, y), hence
tu(x, y) + (1 − t)v(x, y) ≥ c(x, y) for any t ∈ [0, 1]. It follows that
Θ(tu + (1 − t)v) = 0 = tΘ(u) + (1 − t)Θ(v).
If either Θ(u) = +∞ or Θ(v) = +∞ then clearly
Θ(tu + (1 − t)v) ≤ tΘ(u) + (1 − t)Θ(v).
Hence Θ is convex. For Ξ if either Ξ(u) = +∞ or Ξ(v) = +∞ then clearly
Ξ(tu + (1 − t)v) ≤ tΞ(u) + (1 − t)Ξ(v).
Assume u(x, y) = ϕ1 (x) + ψ1 (y), v(x, y) = ϕ2 (x) + ψ2 (y) then
tu(x, y) + (1 − t)v(x, y) = tϕ1 (x) + (1 − t)ϕ2 (x) + tψ1 (y) + (1 − t)ψ2 (y)
and therefore
Z Z
Ξ(tu + (1 − t)v) = tϕ1 + (1 − t)ϕ2 dµ + tψ1 + (1 − t)ψ2 dν = tΞ(u) + (1 − t)Ξ(v).
X Y
Hence Ξ is convex.
Let u ≡ 1 then Θ(u), Ξ(u) < +∞ and Θ is continuous at u. By Theorem 3.2
(3.6) inf (Θ(u) + Ξ(u)) = max∗ (−Θ∗ (−π) − Ξ∗ (π)) .
u∈E π∈E
We now consider the right hand side of (3.6). To do so we need to find the convex conjugates
of Θ and Ξ. For Θ∗ we compute
Z Z Z
∗
Θ (−π) = sup − u dπ − Θ(u) = sup − u dπ = sup u dπ.
u∈E X×Y u≥−c X×Y u≤c X×Y
Then we find R
∗ c(x, y) dπ if π ∈ M+ (X × Y )
X×Y
Θ (−π) =
+∞ else.
For Ξ∗ we have
Z
∗
Ξ (π) = sup u dπ − Ξ(u)
u∈E X×Y
Z Z Z
= sup u dπ − ϕ(x) dµ − ψ(y) dν
u(x,y)=ϕ(x)+ψ(y) X×Y X Y
Z Z !
= sup ϕ d(P#X − µ) + ψ d(P#Y − ν)
u(x,y)=ϕ(x)+ψ(y) X Y
0 if π ∈ Π(µ, ν)
=
+∞ else.
25
Hence, the right hand side of (3.6) reads
Z
∗ ∗
max∗ (−Θ (−π) − Ξ (π)) = − min c(x, y) dπ = − min K(π).
π∈E π∈Π(µ,ν) X×Y π∈Π(µ,ν)
Furthermore we can choose (ϕ, ψ) = (η cc , η c ) for some η ∈ L1 (µ), where η c is defined below.
The condition that M < ∞ is effectively a moment condition on µ and ν. In particular, if
c(x, y) = |x − y|p then c(x, y) ≤ C(|x|p + |y|p ) and the requirement that M < ∞ is exactly the
condition that µ, ν have finite pth moments.
The proof relies on similar concepts as the proof of duality, in particular, for ϕ : X → R, the
c-transforms ϕc , ϕcc defined by
ϕc :Y → R, ϕc (y) = inf (c(x, y) − ϕ(x))
x∈X
cc
ϕ :X → R, ϕ (x) = inf (c(x, y) − ϕc (y))
cc
y∈Y
for ϕ : X → R are key; one should compare this to the Legendre-Fenchel transform defined in
the previous section. We first give a result which implies we only need to consider c-transform
pairs.
Lemma 3.8. Let µ ∈ P(X), ν ∈ P(Y ). For any a ∈ R and (ϕ̃, ψ̃) ∈ Φc we have (ϕ, ψ) =
(ϕ̃cc − a, ϕ̃c + a) satisfies J(ϕ, ψ) ≥ J(ϕ̃, ψ̃) and ϕ(x) + ψ(y) ≤ c(x, y) for µ-almost every
x ∈ X and ν-almost every y ∈ Y .
Furthermore, if J(ϕ̃, ψ̃) > −∞, M < +∞ (where M is defined by (3.7)), and there exists
cX ∈ L1 (µ) and cY ∈ L1 (ν) such that ϕ ≤ cX and ψ ≤ cY , then (ϕ, ψ) ∈ Φc .
26
Proof. Clearly J(ϕ − a, ψ + a) = J(ϕ, ψ) for all a ∈ R, ϕ ∈ L1 (µ) and ν ∈ L1 (ν), so it is
enough to show that ϕ = ϕ̃cc ≥ ϕ̃, ψ = ϕ̃c ≥ ψ̃, ϕ(x) + ψ(y) ≤ c(x, y).
Note that
ψ(y) = inf (c(x, y) − ϕ̃(x)) ≥ ψ̃(y)
x∈X
by choosing z = x.
We easily see that
by choosing z = y.
For the furthermore part of the lemma it is left to show integrability of ϕ, ψ. Note that
Z Z
ϕ(x) − cX (x) dµ(x) + ψ(y) − cY (y) dν(y) = J(ϕ, ψ) − M ≥ J(ϕ̃, ψ̃) − M
X Y
and since ϕ − cX ≤ 0, ψ − cY ≤ 0 then both integrals on the left hand side are negative. In
particular
Z Z
kϕ − cX kL1 (µ) + kψ − cY kL1 (ν) = − ϕ(x) − cX (x) dµ(x) − ψ(y) − cY (y) dν(y)
X Y
≤ M − J(ϕ̃, ψ̃).
ϕk (x) ≤ cX (x) ∀x ∈ X, ∀k ∈ N
ψk (y) ≤ cY (y) ∀y ∈ Y, ∀k ∈ N.
Proof. Let (ϕ̃k , ψ̃k ) ∈ Φc be a maximising sequence. Notice that since 0 ≤ supΦc J ≤
inf Π(µ,ν) K ≤ M < ∞ that ϕ̃k , ψ̃k must be proper functions (in fact not equal to ±∞ any-
where). Let (ϕk , ψk ) = (ϕ̃cc c
k − ak , ϕ̃k + ak ) where we will choose
By Lemma 3.8 (ϕk , ψk ) ∈ Φc and (ϕk , ψk ) is a maximising sequence once we have shown that
ϕk ≤ cX and ψk ≤ cY .
27
We start by showing ak ∈ (−∞, +∞). Since (ϕ̃k , ψ̃k ) ∈ Φc then ϕ̃k (x) ≤ c(x, y) − ψ̃k (y)
for all y ∈ Y . Hence there exists y0 ∈ Y and b0 ∈ R (possibly depending on k) such that
ϕ̃k (x) ≤ c(x, y0 ) + b0 . Then,
cY (y) − ϕ̃ck (y) = sup (cY (y) − c(x, y) + ϕ̃k (x)) ≥ sup (ϕ̃k (x) − cX (x)) ≥ ϕ̃k (x0 ) − cX (x0 )
x∈X x∈X
for any x0 ∈ X. Hence, ak ≥ ϕ̃(x0 ) − cX (x0 ) which is greater than −∞. We have shown that
ak ∈ (−∞, +∞) and the pair (ϕk , ψk ) are well defined.
Clearly ψk (y) = ϕ̃ck (y) + ak ≤ cY (y). And,
ϕk (x) − cX (x) = inf (c(x, y) − ϕ̃ck (y) − ak − cX (x)) ≤ inf (cY (y) − ϕ̃ck (y) − ak ) = 0.
y∈Y y∈Y
and
(`) (`)
ϕk (x) + ψk (y) ≤ max {ϕk (x) − cX (x) + ψk (y) − cY (y), −`} + cX (x) + cY (y)
(3.8) ≤ max {c(x, y) − cX (x) − cY (y), −`} + cX (x) + cY (y).
(`) (`)
For each ` the sequence ϕk is bounded in L∞ so {ϕk }k∈N is weakly compact in Lp (µ)
for any p ∈ (1, ∞) (for reflexive Banach spaces boundedness plus closure is equivalent to
weak compactness). Let’s choose p = 2 then, after extracting a subsequence, we have that
(`)
ϕk * ϕ(`) ∈ L1 (µ) (since L2 (µ) ⊂ L1 (µ)). By a diagonalisation argument we can assume that
28
(`) (`)
ϕk * ϕ(`) for all ` ∈ N. We can apply the same argument to ψk to imply the existence of weak
limits ψ (`) ∈ L1 (ν). We have that
cX ≥ ϕ(1) ≥ ϕ(2) ≥ . . .
cY ≥ ψ (1) ≥ ψ (2) ≥ . . .
Since ϕ(`) , ψ (`) are bounded above by an L1 function and monotonically decreasing we can
apply the Monotone Convergence Theorem to infer
Z Z
lim (`)
ϕ (x) dµ(x) = ϕ† (x) dµ(x)
`→∞ X
Z ZX
lim ψ (`) (y) dν(y) = ψ † (y) dν(y)
`→∞ Y Y
The functions (ϕ† , ψ † ) are our candidate maximisers. We are required to show that (ϕ† , ψ † ) ∈
Ψc and J(ϕ, ψ) ≤ J(ϕ† , ψ † ) for all (ϕ, ψ) ∈ Φc .
(`) (`)
Since supΦc J = limk→∞ J(ϕk , ψk ) ≤ limk→∞ J(ϕk , ψk ) = J(ϕ(`) , ψ (`) ) for any ` ∈ N
then
J(ϕ† , ψ † ) = lim J(ϕ(`) , ψ (`) ) ≥ sup J.
`→∞ Φc
† †
Hence (ϕ , ψ ) maximises J.
It follows from taking ` → ∞ in (3.8) that ϕ† (x) + ψ † (y) ≤ c(x, y). Now integrability
follows from
Z Z
†
0≥ ϕ (x) − cX (x) dµ(x) + ψ † (y) − cY (y) dν(y) ≥ sup J − M.
X Y Φc
In particular, since ϕ† − cX ≤ 0, ψ † − cY ≤ 0 then it follows that both integrals are finite and
ϕ† − cX ∈ L1 (µ), ψ † − cY ∈ L1 (ν). Hence ϕ† ∈ L1 (µ) and ψ † ∈ L1 (ν).
For the furthermore part of the theorem we use the double c-transform trick as in the proof of
Lemma 3.9. For any a ∈ R we have, by Lemma 3.8,
We only have to show ((ϕ† )cc , (ϕ† )c ) ∈ L1 (µ) × L1 (ν). Let a = inf y∈Y cY (y) − (ϕ† )c (y) then
a ∈ R for the same reasons that ak ∈ R in the proof of Lemma 3.9. Clearly (ϕ† )c (y) + a ≤ cY (y)
and
(ϕ† )cc (x) = inf c(x, y) − (ϕ† )c (y) − a ≤ inf cX (x) + cY (y) − (ϕ† )c (y) − a ≤ cX (x).
y∈Y y∈Y
Hence, ((ϕ† )cc − a, (ϕ† )c + a) ∈ L1 (µ) × L1 (ν) by Lemma 3.8. Trivially ((ϕ† )cc , (ϕ† )c ) ∈
L1 (µ) × L1 (ν).
29
Chapter 4
Our aim is to characterise the optimal transport plans that arise as a minimiser to the Kantorovich
optimal transportation problem and show sufficient conditions for the existence of optimal
transport maps. In this chapter we will restrict ourselves to the cost function c(x, y) = 21 |x − y|2 .
Generalisations to other cost functions is possible but with an additional notational burden.1 The
second restriction we make is to assume that X and Y are subsets of Rn .
The chapter is structured as follows. We first state our objectives and in particular the results
we will prove, then give motivating explanations. We will require some results and definitions
from convex analysis which we give in Section 4.2 before finally proving the main theorems
from the first section.
We will review some convex analysis in Section 4.2 which will inform the definition. For now it
is enough to know that the subdifferential is a generalisation of the differential which always
exists for lower semi-continuous convex functions. Note that the subdifferential is in general a
set however if ϕ is differentiable at x then ∂ϕ(x) = {∇ϕ(x)}.
1
For example it is possible to show the existence of optimal transport maps for strictly convex and superlinear
cost functions, and characterise them in terms of c-superdifferentials of c-convex functions.
30
Theorem 4.1. Knott-Smith Optimality Criterion Let µ ∈ P(X), ν ∈ P(Y ) with X, Y ⊂ Rn
and assume that µ, ν both have finite second moments. Then π ∈ Π(µ, ν) is a minimizer of
Kantorovich’s optimal transport problem with cost c(x, y) = 21 |x − y|2 if and only if there
exists an L1 (µ), convex lower semi-continuous function ϕ such that supp(π) ⊆ Gra(∂ϕ), or
equivalently y ∈ ∂ϕ(x) for π-almost every (x, y). Moreover the pair (ϕ, ϕ∗ ) is a minimiser of
the problem inf Φ̃ J(ϕ, ψ) where ϕ∗ is the convex conjugate defined in Definition 4.4, J is defined
by (3.1), and Φ̃ is defined by
n o
Φ̃ = (ϕ̃, ψ̃) ∈ L1 (µ) × L1 (ν) : ϕ̃(x) + ψ̃(y) ≥ x · y .
Why do we expect the optimal plan to have support in the graph of the subdifferential of
a convex function? Let us first consider the 1D case and a map T# µ = ν. We should expect
any map that is ‘optimal’ to be order preserving, i.e. if x1 ≤ x2 then T (x1 ) ≤ T (x2 ). This is
equivalent to saying that T is non-decreasing.
Maps rule out splitting since each x 7→ T (x). However if we let T set valued (i.e. we are
considering plans instead of maps) then the increasing property in some sense should still hold.
Let
Γ = {(x, y) : x ∈ X and y ∈ T (x)} .
We can write the increasing property as: for any (x1 , y1 ), (x2 , y2 ) ∈ Γ with x1 ≤ x2 then y1 ≤ y2 .
In convex analysis this property is called cyclical monotonicity. It can be shown that any
cyclically monotone set can be written as the subgradient of a convex function (since any convex
function has a non-decreasing derivative). Hence we expect any optimal plan to be supported in
the subgradient of a convex function. This turns out to also be true in dimensions greater than
one.
The next result specifically gives conditions sufficient for the existence of transport maps.
Theorem 4.2. Brenier’s Theorem Let µ ∈ P(X), ν ∈ P(Y ) with X, Y ⊂ Rn and assume
that µ, ν both have finite second moments and that µ does not give mass to small sets. Then
there is a unique solution π † ∈ Π(µ, ν) to Kantorovich’s optimal transport problem with cost
c(x, y) = 12 |x − y|2 which is given by
π † = (Id × ∇ϕ)# µ or equivalently dπ † (x, y) = dµ(x)δ(y = ∇ϕ(x))
where ∇ϕ is the gradient of a convex function (defined µ-almost everywhere) that pushes µ
forward to ν, i.e. (∇ϕ)# µ = ν.
Any convex function is locally Lipschitz on the interior of its domain. It follows (from
Rademacher’s theorem) that ϕ is almost everywhere (in the Lebesgue sense) differentiable on the
interior of its domain. The fact that the set of non-differentiability is a small set and π gives zero
mass to small sets implies we can talk about the derivative of ϕ on the support of π as though it
exists everywhere.
The following corollary is immediate from Brenier’s theorem.
Corollary 4.3. Under the assumptions of Theorem 4.2 ∇ϕ is the unique solution to the Monge
transportation problem:
Z Z
1 2 1
|x − ∇ϕ(x)| dµ(x) = inf |x − T (x)|2 dµ(x).
2 X 2 T# µ=ν X
31
We sketch the proof of the corollary. From the inequality
min K(π) ≤ inf M(T )
π∈Π(µ,ν) T : T# µ=ν
that was argued in Section 1.2 it is enough to show T † = ∇ϕ satisfies M(T † ) ≤ minΠ(µ,ν) K
(T#† µ = ν is given in Theorem 4.2). Now, let π † ∈ Π(µ, ν) be as in Theorem 4.2,
Z
† 1
M(T ) = |x − T † (x)|2 dµ(x)
2 X
Z
1
= |x − T † (x)|2 dπ † (x, y)
2 X×Y
Z
1
= |x − y|2 dπ † (x, y)
2 X×Y
= min K(π)
π∈Π(µ,ν)
since T † (x) = y for π-almost every (x, y) ∈ X × Y . Uniqueness of T † follows from uniqueness
of π † .
32
In fact if ϕ is convex then ϕ is differentiable almost everywhere, hence we have that ∂ϕ(x) =
{∇ϕ(x)} for almost every x.
Proof. Let x ∈ int(Dom(ϕ)) and δ ∗ be such that B(x, δ ∗ ) ⊂ int(Dom(ϕ)). We show that ϕ is
Lipschitz continuous on B(x, δ ∗ /4). Then, by Rademacher’s theorem2 , ϕ is differentiable almost
everywhere on B(x, δ ∗ /4), and therefore differentiable almost everywhere on int(Dom(ϕ)).
This will complete the proof of (1).
We show ϕ is Lipschitz on B(x, δ ∗ /4) by first showing that ϕ is uniformly bounded on
B(x, δ ∗ /2). By the Minkowski-Carathéodory Theorem (Theorem 2.5) thereP exists {xi }ni=0 ⊂
∗ ∗ n n
Pδn ) such that for all y ∈ B(x, δ ) there exists {λi }i=0 ⊂ [0, 1] with i=0 λi = 1 and
∂B(x,
y = i=0 λi xi . So,
Xn n
X
ϕ(y) = ϕ( λ i xi ) ≤ λi ϕ(xi ) ≤ max |ϕ(xi )|.
i=0,...,n
i=0 i=0
Hence
kϕkL∞ (B(x,δ∗ /2)) ≤ max max |ϕ(xi )| − 2ϕ(x), max |ϕ(xi )| < ∞.
i=0,...,n i=0,...,n
To show ϕ is Lischitz on B(x, δ ∗ /4) let x1 , x2 ∈ B(x, δ ∗ /4) (x1 6= x2 ) and take x3 to be the
point of intersection of the line through x1 and x2 with ∂B(x, δ ∗ /2); there are two possibilities
for x3 , we choose the option where x2 lies between x1 and x3 . Let λ = |x 2 −x3 |
|x1 −x3 |
∈ (0, 1). Now,
33
x2 −x1 x3 −x2
since |x2 −x1 |
= |x3 −x2 |
. So by convexity of ϕ,
ϕ(x) + y · (z − x) ≤ ϕ(z)
Proposition 4.7. Let ϕ : Rn → R ∪ {+∞} be proper. Then the following are equivalent:
3. ϕ∗∗ = ϕ.
34
Proof. 3 clearly implies 2. We first show that 2 implies 1. Let ϕ = ψ ∗ and we will show that ϕ
is convex and lower semi-continuous. For convexity let x1 , x2 ∈ Rn , t ∈ [0, 1] then
= tϕ(x1 ) + (1 − t)ϕ(x2 ).
lim inf ϕ(xm ) = lim inf sup (xm · y − ψ(y)) ≥ lim (xm · y − ψ(y)) = x · y − ψ(y)
m→∞ m→∞ y∈Rn m→∞
for any y ∈ Rn . Taking the supremum over y ∈ Rn implies lim inf m→∞ ϕ(xn ) ≥ ϕ(x) as
required.
Finally we show that 1 implies 3. Let ϕ be lower semi-continuous and convex. Fix x ∈ Rn ,
we want to show that ϕ(x) = ϕ∗∗ (x). Since ϕ∗ (y) ≥ x · y − ϕ(x) for all y ∈ Rn then
35
The pointwise limit of an arbitrary collection of lower semi-continuous functions is lower semi-
continuous, hence ϕε is lower semi-continuous. Now let x ∈ Rd and since ϕ is proper there
exists y0 ∈ Rd such that ϕ(y0 ) < ∞, hence ϕε (x) ≤ ϕ(y0 ) + ψε (x − y0 ). Since ψ is everywhere
finite then it follows that ϕε (x) is finite hence x ∈ Dom(ϕε ). In particular int(Dom(ϕε )) = Rd .
We now show that lim inf ε→0 ϕε (x) ≥ ϕ(x) (we actually show that limε→0 ϕε (x) = ϕ(x)
but the former statement is all we really need). Fix x ∈ Rn and note that ϕε (x) ≤ A + Bε Since
ϕ is convex then it is bounded below by an affine function, say ϕ(z) ≥ a · z + b for all z ∈ Rn .
Let yε be a minimising sequence, i.e. ϕε (x) ≥ ϕ(x − yε ) + ψε (yε ) − ε. Then
In both cases the right hand side is greater than ϕ(x) hence limn→∞ ϕεn (x) ≥ ϕ(x).
On the other hand, since ϕ(x) ≥ ϕε (x) then
ϕ∗∗ (x) = sup infn (y · (x − z) + ϕ(z)) ≥ sup infn (y · (x − z) + ϕε (z)) = ϕ∗∗ (x).
y∈Rn z∈R y∈Rn z∈R
Hence,
ϕ∗∗ (x) ≥ lim inf ϕ∗∗
ε (x) = lim inf ϕε (x) ≥ ϕ(x).
ε→0 ε→0
36
Furthermore J(ϕ̃, ψ̃) = M − J(ϕ, ψ) where
Z Z
1 2 1
M= |x| dµ(x) + |y|2 dν(y).
2 X 2 Y
And for π ∈ Π(µ, ν),
Z Z
1 2
K(π) = |x − y| dπ(x, y) = M − x · y dπ(x, y).
2 X×Y X×Y
Hence, Z
M − J(ϕ̃, ψ̃) = J(ϕ, ψ) ≤ K(π) = M − x · y dπ(x, y).
X×Y
Or more conveniently, Z
J(ϕ̃, ψ̃) ≥ x · y dπ(x, y).
X×Y
Kantorovich duality (Theorem 3.1) implies that
Z
(4.1) min J(ϕ̃, ψ̃) = max x · y dπ(x, y).
(ϕ̃,ψ̃)∈Φ̃ π∈Π(µ,ν) X×Y
We notice that if π † ∈ Π(µ, ν) minimises K then it also maximises X×Y x · y dπ(x, y), and vice
R
1
ψ̃(y) = |y|2 − ϕc (y)
2
1 2 1 2
= sup |y| − |x − y| + ϕ(x)
x∈X 2 2
= sup (x · y − ϕ̃(x))
x∈X
∗
= ϕ̃ (y)
37
Hence minimisers of J over Φ̃ take the form (ϕ̃∗∗ , ϕ̃∗ ).
Let η̃ = ϕ̃∗∗ then by Proposition 4.7 η̃ is convex and lower semi-continuous. Furthermore,
again by Proposition 4.7 η̃ ∗ = ϕ̃∗∗∗ = ϕ̃∗ . Hence there exists minimisers of J over Φ̃ with the
form (η̃, η̃ ∗ ) where η̃ is a proper, convex and lower semi-continuous function.
Proof of Theorem 4.1. Let π † ∈ Π(µ, ν) minimise K over Π(µ, ν) and ϕ̃ be the proper lower
semi-continuous function such that the pair (ϕ̃, ϕ̃∗ ) minimise J over Φ̃. By Kantorovich duality
(in particular (4.1)) we have
Z Z Z
∗
ϕ̃(x) dµ(x) + ϕ̃ (y) dν(y) = x · y dπ † (x, y).
X Y X×Y
Equivalently, Z
ϕ̃(x) + ϕ̃∗ (y) − x · y dπ † (x, y) = 0.
X×Y
By definition of the convex conjugate ϕ̃(x) + ϕ̃∗ (y) ≥ x · y and therefore the integrand is
non-negative. We must have ϕ̃(x) + ϕ̃∗ (y) = x · y for π † -almost every (x, y) and therefore by
Proposition 4.5 y ∈ ∂ ϕ̃(x) for π † -almost every (x, y).
Conversely, suppose y ∈ ∂ ϕ̃(x) for π † -almost every (x, y) where ϕ̃ is a L1 (µ) proper, lower
semi-continuous and convex function. Then by Proposition 4.5,
Z
ϕ̃(x) + ϕ̃∗ (y) − x · y dπ † (x, y) = 0.
X×Y
Notice that by definition of the Legendre-Fenchel transform we have that ϕ̃(x) + ϕ̃∗ (y) ≥ x · y.
We will show integrability of ϕ̃∗ shortly, for now it is assumed then (ϕ̃, ϕ̃∗ ) ∈ Φ̃. Hence,
Z Z
∗ †
min J ≤ J(ϕ̃, ϕ̃ ) = x · y dπ (x, y) ≤ max x · y dπ(x, y).
Φ̃ X×Y π∈Π(µ,ν) X×Y
By duality (i.e. (4.1)) it follows that (ϕ̃, ϕ̃∗ ) in Φ̃ achieves the minimum of J and π † achieves
the maximum of X×Y x · y dπ(x, y) is Π(µ, ν). Hence π † is an optimal plan in the Kantorovich
R
sense.
The last detail we have to show is ϕ̃∗ ∈ L1 (ν). Since ϕ̃ is convex then ϕ̃∗ can be bounded
below by an affine function, in particular there exists x0 ∈ X such that ϕ̃∗ (y) ≥ x0 · y − ϕ̃(x0 ) ≥
x0 · y − b0 =: f (y). So the integral,
Z Z
∗ ∗ ∗ 1 2 1
kϕ̃ − f kL1 (ν) = ϕ̃ (y) − f (y) dν(y) ≤ J(ϕ̃, ϕ̃ ) + kϕ̃kL1 (µ) + |x0 | + |y|2 dν(y) + b0
Y 2 2 Y
is finite. Hence ϕ̃∗ − f ∈ L1 (ν), and since f ∈ L1 (ν) then ϕ̃∗ ∈ L1 (ν) as required.
38
Proof of Theorem 4.2. Let π † be a minimiser of Kantorovich’s optimal transport problem. If we
write (by disintegration of measures)
Z
†
π (A × B) = π † (B|x) dµ(x),
A
for some family {π † (·|x)}x∈X ⊂ P(Y ), then supp(π † (·|x)) ⊆ ∂ϕ(x) for µ-a.e. x ∈ X by
Theorem 4.1. By Proposition 4.6, ∂ϕ(x) = {∇ϕ(X)} for L-a.e. x ∈ X (and therefore µ-a.e.
x ∈ X). Hence supp(π † (·|x)) ⊂ {∇ϕ(x)} for µ-a.e. x ∈ X. This implies π † (·|x) = ∂∇ϕ(x) for
µ-a.e. x ∈ X. We have shown that there exists an optimal π † that can be written as
π † = (Id × ∇ϕ)# µ
and
ν(B) = π † (Rn × B)
= (Id × ∇ϕ)# µ(Rn × B)
= µ (Id × ∇ϕ)−1 (Rn × B)
where the second line follows as y ∈ ∂ϕ(x) for π-a.e. (x, y) and by Proposition 4.5. Also,
Z Z
∗ †
ϕ̄(x) + ϕ̄ (y) dπ (x, y) = ϕ̄(x) + ϕ̄∗ (∇ϕ(x)) dµ(x).
X×Y X
39
Hence Z
(ϕ̄(x) + ϕ̄∗ (∇ϕ(x)) − x · ∇ϕ(x)) dµ(x) = 0.
X
∗
In particular, ϕ̄(x) + ϕ̄ (∇ϕ(x)) − x · ∇ϕ(x) = 0 for µ-almost every x. By Proposition 4.5 this
implies that ∇ϕ(x) ∈ ∂ ϕ̄(x) for µ-almost every x and therefore ∇ϕ(x) = ∇ϕ̄(x) for µ-almost
every x.
40
Chapter 5
Wasserstein Distances
Eulerian based costs, such as Lp , define a metric based on "pointwise differences". This has
some notable disadvantages, for example consider in 1D two indicator functions f (x) = χ[0,1] (x)
and fδ (x) = χ[δ,δ+1] (x). Notice that in Lp ,
p 2δ if |δ| < 1
kf − fδ kLp =
2 else.
In particular, we notice that once |δ| ≥ 1 the Lp distance is constant. In more general examples,
where f and fδ are not necessarily indicator functions, the Lp distance will be the sum of the Lp
norms whenever the supports of f and fδ are disjoint.
Why do we care? Say we are trying to fit a parametrised curve fδ to f . Then say we start
from a bad initialisation where the support of fδ is disjoint from the support of f . In this regime
d
the derivative dδ kfδ − f kLp = 0, this is a problem for gradient based optimisation.
On the other hand, we would hope that a transport based distance would do a better job.
In particular, in the elementary example f (x) = χ[0,1] (x) and fδ (x) = χ[δ,δ+1] (x) the OT cost
would be Z 1
min |x − T (x)|p dx = |δ|p
T# f =fδ 0
p
where the cost is c(x, y) = |x − y| and with an abuse of notation we associate f and fδ with the
measures with density f and fδ respectively. Note that the OT cost now strictly increases as a
function of |δ|.
The objective of this section is to understand how the optimal transport can be used to define
a metric and some of the metric properties. In particular, we will define the Wasserstein distance
(also sometimes known as the earth movers distance) in the next section. In Section 5.2 we look
at the topology of Wasserstein spaces and show that the Wasserstein distance metrizes the weak*
convergence. Finally we will look at geodesics and the relation to fluid dynamics.
Throughout this chapter we will assume that c(x, y) = |x − y|p for p ∈ [1, +∞) and X, Y
are subsets of Rd .
Before proceeding to the Wasserstein distance, let us note one other important example that
can be posed as an optimal transport problem. Let c(x, y) = Ix6=y , i.e. c(x, y) = 0 if x = y and
c(x, y) = 1 otherwise. Then the optimal transport problem coincides with the total variation
distance between measures.
41
Proposition 5.1. Let µ, ν ∈ P(X) where X is a Polish space and c(x, y) = Ix6=y then
1
inf K(π) = kµ − νkTV
π∈Π(µ,ν) 2
where
kµkTV := 2 sup |µ(A)|.
A
Proof. We prove the proposition in 4 steps, in the first two steps we only assume that c is a
metric, in particular we use that c is lower semi-continuous, symmetric and satisfies the triangle
inequality. In the third step we use the specific form of c, but this can also be avoided, see
Remark 5.2.
1. Let ϕ ∈ L1 (µ), we claim that |ϕc (x) − ϕc (y)| ≤ c(x, y) for almost every (x, y) ∈ X × X.
Indeed,
ϕc (x) − ϕc (y) = inf sup (c(x, z1 ) − c(y, z2 ) − ϕ(z1 ) − ϕ(z2 ))
z1 ∈X z2 ∈X
≤ c(x, y).
By switching x and y we have that |ϕc (x) − ϕc (y)| ≤ c(x, y).
2. We claim ϕcc = −ϕc . By part 1 we have c(x, y) − ϕc (y) ≥ −ϕc (x) and therefore
ϕcc (x) = inf (c(x, y) − ϕc (y)) ≥ −ϕc (x).
y∈X
3. Since |η c (x) − η c (y)| ≤ 1 we can, without loss of generality, assume that η c (x) ∈ [0, 1] in
the following. By Theorem 3.1 and Theorem 3.7,
min K(π) = sup J(η cc , η c ) = sup J(−η c , η c )
π∈Π(µ,ν) η∈L1 η∈L1 (µ)
where the penultimate inequality follows as (−f, f ) ∈ Φc , to see this we only need to show
−f (x) + f (y) ≤ c(x, y). Clearly −f (x) + f (y) ≤ 1 = c(x, y) for x 6= y and −f (x) + f (y) =
0 = c(x, y) for x = y. It follows that
min K(π) = sup J(−f, f ).
π∈Π(µ,ν) 0≤f ≤1
42
4. Now let ν −µ = (ν −µ)+ −(ν −µ)− be the decomposition of ν −µ where (ν −µ)± ∈ M(X)
are singular. It follows that
kµ − νkTV = 2(ν − µ)+ (X).
And, Z
sup J(−f, f ) = sup f d(ν − µ) = (ν − µ)+ (X).
0≤f ≤1 0≤f ≤1 X
This is actually a special case of the Kantorovich-Rubinstein Theorem (see [15, Theorem 1.14])
which states that, when c is a metric,
Z
1
min K(π) = sup ϕ d(µ − ν) : ϕ ∈ L (|µ − ν|), kϕkLip ≤ 1
π∈Π(µ,ν) X
where
|ϕ(x) − ϕ(y)|
kϕkLip = sup .
x6=y c(x, y)
Of course, if X is bounded then Pp (X) = P(X). We now define the Wasserstein distance, it
will be the objective of this section to prove that the Wasserstein distance is a metric.
Definition 5.3. Let µ, ν ∈ Pp (X), then the Wasserstein distance is defined as
Z p1
p
dW p (µ, ν) = min |x − y| dπ(x, y) .
π∈Π(µ,ν) X×X
The Wasserstein distance is the pth root of the minimum of the Kantorovich optimal transport
problem for cost function c(x, y) = |x − y|p . The motivation is that this cost resembles an Lp
distance (in fact we use properties of Lp distances to prove the triangle inequality). One could
also consider an analogous distance for cost function c(x, y) = d(x, y) where d is a metric on
X. This type of distance is known as the earth movers distance. Notice that when p = 1 the
43
Wasserstein distance is also a earth movers distance. We will not focus on earth movers distances
here.
Let us note here that µ, ν ∈ Pp (X) is enough to guarantee dW p (µ, ν) < +∞. In particular,
Z Z Z
p p p p
dW p (µ, ν) ≤ p inf |x| + |y| dπ(x, y) = p |x| dµ(x) + p |y|p dν(y).
π∈Π(µ,ν) X×X X X
We now state the result that dW p is a metric. The proof, minus the triangle inequality, is given
below.
Proof. We give the proof of all the required criteria with the exception of the triangle inequality
which will require some preliminary results. Firstly, it is clear that dW p (µ, ν) ≥ 0 for all
µ, ν ∈ P(X) and by symmetry of the cost function c(x, y) = |x − y|p and π ∈ Π(µ, ν) ⇔
S# π ∈ Π(ν, µ) where S(x, y) = (y, x) we have symmetry of dW p . Now if µ = ν then we can
take π(x, y) = δx (y)µ(x) so that
Z
p
dW p (µ, ν) ≤ |x − y|p dπ(x, y) = 0
X×X
as x = y π-almost everywhere. Now if dW p (µ, ν) = 0 then there exists π ∈ Π(µ, ν) such that
x = y π-almost everywhere. Hence for any test function f : X → R,
Z Z Z Z
f (x) dµ(x) = f (x) dπ(x, y) = f (y) dπ(x, y) = f (y) dν(y).
X X×X X×X X
Lemma 5.5. Given measures µ ∈ P(X), ν ∈ P(Y ), ω ∈ P(Z) and transport plans π1 ∈
Π(µ, ν) and π2 ∈ Π(ν, ω) there exists a measure γ ∈ P(X × Y × Z) such that P#X,Y γ = π1
and P#Y,Z γ = π2 where P X,Y (x, y, z) = (x, y) and P Y,Z (x, y, z) = (y, z) are the projections
onto the two first and two last variables respectively.
for some family of probability measures π1 (·|y) ∈ P(X), and similarly for π2 ,
Z
π2 (B × C) = π2 (C|y) dν(y).
B
44
Define γ ∈ M(X × Y × Z) by
Z
γ(A × B × C) = π1 (A|y)π2 (C|y) dν(y).
B
Then,
Z Z
γ(A × B × Z) = π1 (A|y)π2 (Z|y) dν(y) = π1 (A|y) dν(y) = π1 (A × B).
B B
Let γ ∈ P(X × X × X) be such that P#X,Y γ = πXY and P#Y,Z γ = πY Z (such γ exists by
Lemma 5.5). Let πXZ = P#X,Z γ. Then,
= γ ({(x, y, z) : x ∈ A})
= γ(A × X × X)
= πXZ (A × X)
= µ(A).
Similarly πXZ (X × B) = ω(B). So, πXZ ∈ Π(µ, ω).
Now,
Z p1
p
dW p (µ, ω) ≤ |x − z| dπXZ (x, z)
X×X
Z p1
p
= |x − z| dγ(x, y, z)
X×X×X
Z p1 Z p1
p p
≤ |x − y| dγ(x, y, z) + |y − z| dγ(x, y, z)
X×X×X X×X×X
Z p1 Z p1
p p
= |x − y| dπXY (x, y) + |y − z| dπY Z (y, z)
X×X X×X
= dW p (µ, ν) + dW p (ν, ω).
This proves the triangle inequality.
45
One can also prove the triangle inequality using an transport maps and an approximation
argument. Slightly more precisely, if µ, ν and ω all have densities with respect to Lebesgue then
we know there exits transport maps T and S with T# µ = ν and S# ν = ω. The map S ◦ T then
pushes µ onto ω. One can argue, similarly to our proof, that dW p (µ, ν) + dW p (ν, ω) ≥ dW p (µ, ω).
To extend the argument to arbitrary probability measures µ, ν and ω one uses mollifiers to
define µ̃ε = µ ∗ Jε , analogously for ν̃ε , ω̃ε , where Jε = ε1d J(·/ε) and J is a standard mollifier.
The measures µ̃, ν̃, ω̃ have densities with respect to the Lebesgue measure and one can show
dW p (µ̃ε , ν̃ε ) → dW p (µ, ν) as ε → 0. We refer to [12, Lemma 5.2 and Lemma 5.3] for full details.
Our final result of the section gives sufficient conditions for equivalence of Wasserstein
distances.
Proposition 5.6. For every p ∈ [1, +∞) and any µ, ν ∈ Pp (X) we have dW p (µ, ν) ≥ dW 1 (µ, ν).
Furthermore, if X ⊂ Rn is bounded then dpW p (µ, ν) ≤ diam(X)p−1 dW 1 (µ, ν).
Proof. By Jensen’s inequality, for π ∈ Π(µ, ν), we have
Z p1 Z
|x − y|p dπ(x, y) ≥ |x − y| dπ(x, y).
X×X X×X
Hence, Z Z
p p−1
|x − y| dπ(x, y) ≤ (diam(X)) |x − y| dπ(x, y).
X×X X×X
46
1
Let ϕ be a Lipschitz function with Lip(ϕ) > 0 then ϕ̃ = Lip(ϕ) ϕ is a 1-Lipschitz function and
therefore Z Z
1
ϕ d(µm − µ) = ϕ̃ d(µm − µ) ≤ dW 1 (µn , µ) → 0.
Lip(ϕ) X X
By substituting ϕ 7→ −ϕ we have that
Z Z
ϕ dµm → ϕ dµ
X X
*
for any Lipschitz function ϕ. By the Portmanteau theorem µm * µ.
*
For the converse statement we assume that µm * µ and let mk be the subsequence such that
if ν(X) = 0. Hence if we let ϕmk (x) = ϕ̃mk (x) − ϕ̃mk (x0 ) then
Z
1
dW 1 (µmk , µ) ≤ ϕmk d(µmk − µ) + ,
X k
ϕmk are 1-Lipschitz (in particular equicontinuous) and bounded. By the Arzelà-Ascoli theorem
there exists a further subsequence (relabelled) such that ϕmk → ϕ uniformly. In particular, ϕ is
1-Lipschitz. Hence,
Z
1
lim sup dW 1 (µm , µ) ≤ lim sup ϕmk d(µmk − µ) +
m→∞ k→∞ X k
Z Z
≤ lim sup (ϕmk − ϕ) dµmk + ϕ d(µmk − µ)
k→∞ X X
Z
≤ lim sup kϕmk − ϕkL∞ + lim sup ϕ d(µmk − µ)
k→∞ k→∞ X
= 0.
Hence, dW 1 (µm , µ) → 0 as m → ∞.
We now generalise to unbounded domains.
R
Theorem 5.8. Let µm , µ ∈ Pp (Rn ). Then dW p (µm , µ) → 0 if and only if Rn
|x|p dµm →
R *
Rn
|x|p dµ and µm * µ.
47
Proof. Let dW p (µm , µ) → 0. Then byR Proposition 5.6 we have dW 1 (µm , µ) → 0. Analogously
to the proof of Theorem 5.7 we have X ϕ d(µm − µ) → 0 for all Lipschitz functions ϕ. Hence,
*
by the Portmanteau
R theorem µRm * µ.
To show Rn |x|p dµm → Rn |x|p dµ we note that
Z Z
p
p
|x| dµm = dW p (µm , δ0 ) and |x|p dµ = dpW p (µ, δ0 ).
Rn Rn
Now,
dW p (µm , δ0 ) ≤ dW p (µm , µ) + dW p (µ, δ0 ) → dW p (µ, δ0 )
and
dW p (µm , δ0 ) ≥ dW p (µ, δ0 ) − dW p (µm , µ) → dW p (µ, δ0 )
R R
Hence Rn |x|p dµm → Rn |x|p dµ.
* R R
For the converse statement let µm * µ and |x|p dµ → |x|p dµ. For any R > 0 let
φR (x) = (|x| ∧ R)p = (min{|x|, R})p which is continuous and bounded. We have
Z Z
p
(5.1) (|x| − φR (x)) dµn → (|x|p − φR (x)) dµ
Rn Rn
Let PR : Rn → B(0, R) be the projection onto the ball B(0, R), i.e.
x if x ∈ B(0, R)
PR (x) =
argminy∈∂B(0,R) |y − x| else.
48
The map PR is continuous and equal to the identity on B(0, R). For all x 6∈ B(0, R) we have
|x − PR (x)| = |x| − R. Hence,
Z p1
p
dW p (µ, (PR )# µ) ≤ |x − PR (x)| dµ(x)
Rn
Z p1
p
= |x − PR (x)| dµ(x)
|x|>R
Z p1
p
= (|x| − R) dµ(x)
|x|>R
1
≤ε , p
and similarly,
1
dW p (µm , (PR )# µm ) ≤ ε p .
For any ϕ ∈ Cb0 (Rn ) we have
Z Z Z Z
ϕ d(PR )# µm = ϕ(PR (x)) dµm → ϕ(PR (x)) dµ = ϕ d(PR )# µ
*
since ϕ ◦ PR is continuous and bounded. Hence, (PR )# µm *(PR )# µ.
Now, (PR )# µm , (PR )# µ have support in B(0, R) (a compact set) so by Theorem 5.7 we
have dW p ((PR )# µm , (PR )# µ) → 0. Hence,
lim sup dW p (µm , µ) ≤ lim sup dW p (µm , (PR )# µm ) + dW p ((PR )# µm , (PR )# µ)
m→∞ m→∞
+ dW p ((PR )# µ, µ)
1
≤ 2ε p .
Definition 5.9. Let (Z, d) be a metric space and ω : [0, 1] → Z a curve in Z.R We say ω is
t
absolutely continuous if there exists g ∈ L1 ([0, 1]) such that d(ω(t0 ), ω(t1 )) ≤ t01 g(s) ds for
any 0 ≤ t0 < t1 ≤ 1. We denote the set of absolutely continuous curves on Z by AC(Z).
49
Definition 5.10. Let (Z, d) be a metric space and ω : [0, 1] → Z a curve in Z. We define the
length of ω by
( n−1 )
X
Len(ω) := sup d(ω(tk ), ω(tk+1 )) : n ≥ 1, 0 = t0 < t1 < · · · < tn = 1 .
k=0
Note that if ω : [0, 1] → Z is a constant speed geodesic then it is a geodesic. Indeed, assume
that ω : [0, 1] → Z and ω̃ : [0, 1] → Z satisfy ω(0) = z0 = ω̃(0), ω(1) = z1 = ω̃(1),
I.e. we assume that ω is a constant speed geodesic but not a geodesic. Then there exists n ∈ N
and 0 ≤ t0 < t1 < · · · < tn = 1 such that
n−1
X n−1
X
Len(ω̃) < d(ω(tk ), ω(tk+1 )) = d(z0 , z1 ) (tk+1 − tk ) = d(z0 , z1 ).
k=0 k=0
This implies Len(ω̃) < d(z0 , z1 ). Clearly this is a contradiction (choosing n = 1 in the definition
of Len(ω̃) implies Len(ω̃) ≥ d(z0 , z1 )).
Note also that if d(ω(t), ω(s)) = |t − s|d(z0 , z1 ) then ω ∈ AC(Z) with g(s) = d(z0 , z1 ).
Definition 5.11. Let (Z, d) be a metric space. We say (Z, d) is a length space if
We now show that the Wasserstein space (Pp (X), dW p ) is a geodesic space.
50
Proof. Note that P0 = P X and P1 = P Y . Therefore, µ0 = (P0 )# π = µ, µ1 = (P1 )# π = ν,
so µt connects µ and ν. To show dW p (µs , µt ) = |t − s|dW p (µ, ν) it is enough to prove that
dW p (µs , µt ) ≤ |t − s|dW p (µ, ν). Indeed assuming this is true, then if dW p (µs , µt ) < |t −
s|dW p (µ, ν) for any 0 ≤ s < t ≤ 1 we have
a contradiction.
To show dW p (µs , µt ) ≤ |t − s|dW p (µ, ν) let πs,t = (Ps , Pt )# π. Then for any (measurable)
A ⊆ X,
as required.
If π = (Id×T )# µ where T is as in the statement of the theorem, then for A ⊂ X (measurable)
we have
51
Bibliography
[1] L. Ambrosio. Mathematical Aspects of Evolving Interfaces: Lectures given at the C.I.M.-
C.I.M.E. joint Euro-Summer School held in Madeira, Funchal, Portugal, July 3-9, 2000,
chapter Lecture Notes on Optimal Transport Problems, pages 1–52. Springer Berlin
Heidelberg, 2003.
[2] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the space
of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel,
second edition, 2008.
[3] L. Ambrosio and A. Pratelli. Optimal Transportation and Applications: Lectures given at
the C.I.M.E. Summer School, held in Martina Franca, Italy, September 2-8, 2001, chapter
Existence and stability results in the L1 theory of optimal transportation, pages 123–160.
Springer Berlin Heidelberg, 2003.
[6] L. C. Evans and W. Gangbo. Differential equations methods for the Monge-Kantorovich
mass transfer problem, volume 653. American Mathematical Society, 1999.
[7] W. Gangbo and R. J. McCann. The geometry of optimal transportation. Acta Mathematica,
177(2):113–161, 1996.
[9] S. Kolouri, S. R. Park, M. Thorpe, D. Slepčev, and G. K. Rohde. Optimal mass transport:
Signal processing and machine-learning applications. IEEE Signal Processing Magazine,
34(4):43–59, 2017.
[10] S. Kolouri and G. K. Rohde. Optimal transport a crash course. IEEE ICIP 2016 Tutorial
Slides: Part 1, 2016.
[11] G. Monge. Mémoire sur la théorie des déblais et des remblais. De l’Imprimerie Royale,
1781.
52
[12] F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser Springer, Basel,
2015.
[16] C Villani. Optimal transport: old and new, volume 338. Springer Science & Business
Media, 2008.
53