0% found this document useful (0 votes)
48 views56 pages

Introduction To Optimal Transport

Uploaded by

linenshu37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views56 pages

Introduction To Optimal Transport

Uploaded by

linenshu37
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Introduction to Optimal Transport

Matthew Thorpe

F2.08, Centre for Mathematical Sciences


University of Cambridge
Email: [email protected]

Lent 2018
Current Version: Thursday 8th March, 2018
Foreword

These notes have been written to supplement my lectures given at the University of Cambridge
in the Lent term 2017-2018. The purpose of the lectures is to provide an introduction to
optimal transport. Optimal transport dates back to Gaspard Monge in 1781 [11], with significant
advancements by Leonid Kantorovich in 1942 [8] and Yann Brenier in 1987 [4]. The latter
in particular lead to connections with partial differential equations, fluid mechanics, geometry,
probability theory and functional analysis. Currently optimal transport enjoys applications in
image retrieval, signal and image representation, inverse problems, cancer detection, texture
and colour modelling, shape and image registration, and machine learning, to name a few. The
purpose of this course is to introduce the basic theory that surrounds optimal transport, in the
hope that it may find uses in people’s own research, rather than focus on any specific application.
I can recommend the following references. My lectures and notes are based on Topics in
Optimal Transportation [15]. Two other accessible introductions are Optimal Transport: Old and
New [16] (also freely available online) and Optimal Transport for Applied Mathemacians [12]
(also available for free online). For a more technical treatment of optimal transport I refer to
Gradient Flows in Metric Spaces and in the Space of Probability Measures [2]. For a short
review of applications in optimal transport see the article Optimal Mass Transport for Signal
Processing and Machine Learning [9].
Please let me know of any mistakes in the text. I will also be updating the notes as the course
progresses.

Some Notation:

1. Cb0 (Z) is the space of all continuous and bounded functions on Z.

2. A sequence of probability measures πn ∈ P(Z) converges weak* to π, and we write


* R R
πn * π, if for any f ∈ Cb0 (Z) we have Z f dπn → Z f dπ.

3. A Polish space is a separable completely metrizable topological space (i.e. a complete


metric space with a countable dense subset).

4. P(Z) is the set of probability measures on Z, i.e. the subset of M+ (Z) with unit mass.

5. M+ (Z) is the set of positive Radon measures on Z.

6. P X : X × Y → X is the projection onto X, i.e. P (x, y) = x, similarly P Y : X × Y → Y


is given by P Y (x, y) = y.

i
7. A function Θ : E → R ∪ {+∞} is convex if for all (z1 , z2 , t) ∈ E × E × [0, 1] we have
Θ(tz1 +(1−t)z2 ) ≤ tΘ(z1 )+(1−t)Θ(z2 ). A convex function Θ is proper if Θ(z) > −∞
for all z ∈ E and there exists z † ∈ E such that Θ(z † ) < +∞.

8. If E is a normed vector space then E ∗ is its dual space, i.e. the space of all bounded and
linear functions on E.

9. For a set A in a topological space Z the interior of A, which we denote by int(A), is the
set of points a ∈ A such that there exists an open set O with the property a ∈ O ⊆ A.

10. All vector spaces are assumed to be over R.

11. The closure of a set A in a topological space Z, which we denote by A, is the set of all
points a ∈ Z such that for any open set O with a ∈ O we have O ∩ A 6= ∅.

12. The graph of a function ϕ : X → R which we denote by Gra(ϕ), is the set {(x, y) : x ∈
X, y = ϕ(x)}.
R
13. The k th moment of µ ∈ P(X) is defined as X |x|k dµ(x).

14. The support of a probability measure µ ∈ P(X) is the smallest closed set A such that
µ(A) = 1.

15. L is the Lebesgue measure on Rd (the dimension d should be clear by context).

16. We write µbA for the measure µ restricted to A, i.e. µbA (B) = µ(A∩B) for all measurable
B.

17. Given a probability measure µ we say a property holds µ-almost surely if it holds on a
set of probability one. If µ is the Lebesgue measure we will just say that it holds almost
surely.

ii
Contents

1 Formulation of Optimal Transport 2


1.1 The Monge Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 The Kantorovich Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Existence of Transport Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Special Cases 9
2.1 Optimal Transport in One Dimension . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Existence of Transport Maps for Discrete Measures . . . . . . . . . . . . . . . 15

3 Kantorovich Duality 19
3.1 Kantorovich Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Fenchel-Rockafeller Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Proof of Kantorovich Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.4 Existence of Maximisers to the Dual Problem . . . . . . . . . . . . . . . . . . 26

4 Existence and Characterisation of Transport Maps 30


4.1 Knott-Smith Optimality and Brenier’s Theorem . . . . . . . . . . . . . . . . . 30
4.2 Preliminary Results from Convex Analysis . . . . . . . . . . . . . . . . . . . . 32
4.3 Proof of the Knott-Smith Optimality Criterion . . . . . . . . . . . . . . . . . . 36
4.4 Proof of Brenier’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Wasserstein Distances 41
5.1 Wasserstein Distances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 The Wasserstein Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Geodesics in the Wasserstein Space . . . . . . . . . . . . . . . . . . . . . . . 49

1
Chapter 1

Formulation of Optimal Transport

There are two ways to formulate the optimal transport problem: the Monge and Kantorovich
formulations. We explain both these formulations in this chapter. Historically the Monge for-
mulation comes before Kantorovich which is why we introduce Monge first. The Kantorovich
formulation can be seen as a generalisation of Monge. Both formulations have their advantages
and disadvantages. My experience is that Monge is more useful in applications, whilst Kan-
torovich is more useful theoretically. In a later chapter (see Chapter 4) we will show sufficient
conditions for the two problems to be considered equivalent. After introducing both formula-
tions we give an existence result for the Kantorovich problem; existence results for Monge are
considerably more difficult. We look at special cases of the Monge and Kantorovich problems in
the next chapter, a more general treatment is given in Chapters 3 and 4.

1.1 The Monge Formulation


Optimal transport gives a framework for comparing measures µ and ν in a Lagrangian framework.
Essentially one pays a cost for transporting one measure to another. To illustrate this consider
the first measure µ as a pile of sand and the second measure ν as a hole we wish to fill up.
We assume that both measures are probability measures on spaces X and Y respectively. Let
c : X × Y → [0, +∞] be a cost function where c(x, y) measures the cost of transporting one
unit of mass from x ∈ X to y ∈ Y . The optimal transport problem is how to transport µ to ν
whilst minimizing the cost c.1 First, we should be precise about what is meant by transporting
one measure to another.

Definition 1.1. We say that T : X → Y transports µ ∈ P(X) to ν ∈ P(Y ), and we call T a


transport map, if

(1.1) ν(B) = µ(T −1 (B)) for all ν-measurable sets B.


1
Some time ago I either read or was told that the original motivation for Monge was how to design defences for
Napoleon. In this case the pile of sand is a wall and the hole a moat. Obviously one wishes to to make the wall
using the dirt dug out to form the moat. In this context the optimal transport problem is how to transport the dirt
from the moat to the wall.

2
To visualise the transport map see Figure 1.1. For greater generality we work with the
inverse of T rather than T itself; the inverse is treated in the general set valued sense, i.e.
x ∈ T −1 (y) if T (x) = y, if the function T is injective then we can equivalently say that
ν(T (A)) = µ(A) for all µ-measurable A. What Figure 1.1 shows is that for any ν-measurable
B, and A = {x ∈ X : T (x) ∈ B} that µ(A) = ν(B). This is what we mean by T transporting
µ to ν. As shorthand we write ν = T# µ if (1.1) is satisfied.
Proposition 1.2. Let µ ∈ P(X), T : X → Y , S : Y → Z and f ∈ L1 (Y ). Then
1. change of variables formula
Z Z
(1.2) f (y) d(T# µ)(y) = f (T (x)) dµ(x);
Y X

2. composition of maps
(S ◦ T )# µ = S# (T# µ).

Proof. Recall that, for non-negative f : Y → R


Z Z 
f (y) d(T# µ)(y) := sup s(y) d(T# µ)(y) : 0 ≤ s ≤ f and s is simple .
Y Y
PN
Now if s(y) = i=1 ai δUi (y) where ai = s(y) for any y ∈ Ui then
Z N
X N
X Z
s(y) d(T# µ)(y) = ai T# µ(Ui ) = ai µ(Vi ) = r(x) dµ(x)
Y i=1 i=1 X

for Vi = T −1 (Ui ) and r = N


P
i=1 ai δVi . For x ∈ Vi we have T (x) ∈ Ui and therefore r(x) =
ai = s(T (x)) ≤ f (T (x)). From this it is not hard to see that
Z Z
sup s(y) d(T# µ)(y) = sup r(x) dµ(x)
0≤s≤f Y 0≤r≤f ◦T X

where both supremums are taken over simple functions. Hence (1.2) holds for non-negative
functions. By treating signed functions as f = f + − f − we can prove the proposition for
f ∈ L1 (Y ).
For the second statement let A ⊂ Z and observe that T −1 (S −1 (A)) = (S ◦ T )−1 (A). Then

S# (T# µ)(A) = T# µ(S −1 (A)) = µ(T −1 (S −1 (A))) = µ((S ◦ T )−1 (A)) = (S ◦ T )# µ(A).

Hence S# (T# µ) = (S ◦ T )# µ.
Given two measures µ and ν the existence of a transport map T such that T# µ = ν is not
only non-trivial, but it may also be false. For example, consider two discrete measures µ = δx1 ,
ν = 12 δy1 + 12 δy2 where y1 6= y2 . Then ν({y1 }) = 12 but µ(T −1 (y1 )) ∈ {0, 1} depending on
whether x1 ∈ T −1 (y1 ). Hence no transport maps exist.
There are two important cases where transport maps exist:

3
1
Pn 1
Pn
1. the discrete case when µ = n i=1 δxi and ν = n j=1 δyj ;

2. the absolutely continuous case when dµ(x) = f (x) dx and dν(y) = g(y) dy.
It is important that in the discrete case that µ and ν are supported on the same number of points;
the supports do not have to be the same but they do have to be of the same size. We will revisit
both cases (the discrete case in the next chapter, the absolutely continuous case in Chapter 4).

Figure 1.1: Monge’s transport map, figure modified from Figure 1 in [9].

With this notation we can define Monge’s optimal transport problem as follows.
Definition 1.3. Monge’s Optimal Transport Problem: given µ ∈ P(X) and ν ∈ P(Y ),
Z
minimise M(T ) = c(x, T (x)) dµ(x)
X

over µ-measurable maps T : X → Y subject to ν = T# µ.


Monge originally considered the problem with L1 cost, i.e. c(x, y) = |x − y|. This problem
is significantly harder than with L2 cost, i.e. c(x, y) = |x − y|2 . In fact the first correct proof with
L1 cost dates back only a few years to 1999 (see Evans and Gangbo [6]) and required stronger
assumptions than the L2 cost, Sudakov thought to have proven the result in 1979 [14] however
this was found to contain a mistake which was later fixed by Ambrosio and Pratelli [1, 3].
In general Monge’s problem is difficult due to the non-linearity in the constraint (1.1). If we
assume that µ and ν are absolutely continuous with respect to the Lebegue measure on Rd , i.e.
dµ(x) = f (x) dx and dν(y) = g(y) dy, and assume T is a C 1 diffeomorphism (T is bijective
and T, T −1 are differentiable) then one can show that (1.1) is equivalent to

f (x) = g(T (x)) |det(∇T (x))| .

The above constraint is highly non-linear and difficult to handle with standard techniques from
the calculus of variations.

1.2 The Kantorovich Formulation


Observe that in the Monge formulation mass is mapped x 7→ T (x). In particular, this means that
mass is not split. In the discrete case this causes difficulties concerning the existence of maps T
such that T# µ = ν, see the example µ = δx1 , ν = 21 δy1 + 21 δy2 in the previous section. Observe

4
that if we allow mass to be split, i.e. half of the mass from x1 goes to y1 and half the mass goes
to y2 , then we have a natural relaxation. This is in effect what the Kantorovich formulation does.
To formalise this we consider a measure π ∈ P(X × Y ) and think of dπ(x, y) as the amount of
mass transferred from x to y; this way mass can be transferred from x to multiple locations. Of
course the total amount of mass removed from any measurable set A ⊂ X has to equal to µ(A),
and the total amount of mass transferred to any measurable set B ⊂ Y has to be equal to ν(B).
In particular, we have the constraints:

π(A × Y ) = µ(A), π(X × B) = ν(B) for all measurable sets A ⊆ X, B ⊆ Y.

We say that any π which satisfies the above has first marginal µ and second marginal ν, we
denote the set of such π by Π(µ, ν). We will call Π(µ, ν) the set of transport plans between µ
and ν. Note that Π(µ, ν) is never non-empty (in comparison with the set of transport plans) since
µ ⊗ ν ∈ Π(µ, ν) which is the trivial transport plan which transports every grain of sand at x to y
proportional to ν(y). We can now define Kantorovich’s formulation of optimal transport.

Definition 1.4. Kantorovich’s Optimal Transport Problem: given µ ∈ P(X) and ν ∈ P(Y ),
Z
minimise K(π) = c(x, y) dπ(x, y)
X×Y

over π ∈ Π(µ, ν).

By the example with discrete measures, where we showed there did not exist any transport
maps, we know that Kantorovich’s and Monge’s optimal transport problems do not always
coincide. However, let us assume that there exists a transport map T † : X → Y that is
optimal for Monge, then if we define dπ(x, y) = dµ(x)δy=T † (x) a quick calculation shows that
π ∈ Π(µ, ν):
Z
π(A × Y ) = δT † (x)∈Y dµ(x) = µ(A)
A
Z
π(X × B) = δT † (x)∈B dµ(x) = µ((T † )−1 (B)) = T#† µ(B) = ν(B).
X

Since,
Z Z Z Z
c(x, y) dπ(x, y) = c(x, y)δy=T † (x) dy dµ(x) = c(x, T † (x)) dµ(x)
X×Y X Y X

it follows that

(1.3) inf K(π) ≤ inf M(T ).

In fact one does not need minimisers of Monge’s problem to exist. If M(T † ) ≤ min M(T ) + ε
for some ε > 0 then inf K(π) ≤ inf M(T ) + ε. Since ε > 0 was arbitrary then (1.3) holds.
When the optimal plan π † can be written in the form dπ † (x, y) = dµ(x)δy=T (x) it follows
that T is an optimal transport map and inf K(π) = inf M(T ). Conditions sufficient for such a

5
condition will be explored in Chapter 4. For now we just say that if c(x, y) = |x − y|2 , µ, ν both
have finite second moments, and µ does not give mass to small sets2 then there exists an optimal
plan π † which can be written as dπ † (x, y) = dµ(x)δy=T † (x) where T † is an optimal map.
Let us observe the advantages of both Monge and Kantorovich formulation. Transport maps
give a natural method of interpolation between two measures, in particular we can define

µt = ((1 − t)Id + tT † )# µ

then µt interpolates between µ and ν. In fact this line of reasoning will lead us directly to
geodesics that we consider in greater detail in Chapter 5. In Figure 1.2 we compare the optimal
transport interpolation with the Euclidean interpolation defined by µE
t = (1 − t)µ + tν. In many
applications the Lagrangian nature of optimal transport will be more realistic than Euclidean
formulations.

Figure 1.2: Interpolation in the optimal transport framework (left) and Euclidean space (right), figure
modified from Figure 2 in [9].

Notice that the Kantorovich problem is convex (the constraints are convex and one usually
has that the cost function c(x, y) = d(x − y) where d is convex).
Pm In particular Pnlet us consider
the
PmKantorovich P problem between discrete measures µ = i=1 αi δxi , ν = j=1 βj δyj where
n
α
i=1 i = 1 = j=1 βj , αi ≥ 0, βj ≥ 0. Let cij = c(xi , yj ) and πij = π(xi , xj ). Then the
Kantorovich problem is to solve
m X
X n m
X n
X
minimise cij πij over πsubject to πij ≥ 0, πij = βj , πij = αi .
i=1 j=1 i=1 j=1

This is a linear programme! In fact Kantorovich is considered as the inventor of linear pro-
gramming. Not only does this provide a method for solving optimal transport problems (either
through off the shelf linear programming algorithms, or through more recent advances such as
µ ∈ P(Rd ) does not give mass to small sets if for all sets A of Hausdorff dimension at most d − 1 that
2

µ(A) = 0.

6
entropy regularised approaches see [5]) but the dual formulation:
inf c·π = sup (µ · ϕ + ν · φ)
π≥0,C > π=(µ> ,ν > )> C(ϕ> ,φ> )> ≤c

is an important stepping stone in establishing important properties such as the characterisation of


optimal transport maps and plans. We study the dual formulation in the Chapter 3. In the next
section we prove the existence of transport plans under fairly general conditions.

1.3 Existence of Transport Plans


Section references: Proposition 1.5 is taken from [15, Proposition 2.1].
We complete this chapter by proving the existence of a minimizer to Kantorovich’s optimal
transport problem. For the proof we use the direct method from the calculus of variations.
Approximately the direct method is compactness plus lower semi-continuity. More precisely
if we are considering a variational problem inf v∈V F (v) then we first show that the set V
is compact (or at least a set which contains the minimizer is compact). Then, let vn be a
minimising sequence, i.e. F (vn ) → inf F . Upon extracting a subsequence we can assume that
vn → v † ∈ V . This gives a candidate minimizer. If we can show that F is lower semi-continuous
then limn→∞ F (vn ) ≥ F (v † ) and hence v † is a minimiser.
Proposition 1.5. Let µ ∈ P(X), ν ∈ P(Y ) where X, Y are Polish spaces, and assume c :
X × Y → [0, ∞) is lower semi-continuous. Then there exists π † ∈ Π(µ, ν) that minimises K
(defined in Definition 1.4) over all π ∈ Π(µ, ν).
Proof. Note that Π(µ, ν) is non-empty. Let us see that Π(µ, ν) is compact in the weak* topology.
Let δ > 0 and choose compact sets K ⊂ X, L ⊂ Y such that
µ(X \ K) ≤ δ, ν(Y \ L) ≤ δ.
(Existence of sets follows directly since by definition Radon measures are inner regular.) If
(x, y) ∈ (X × Y ) \ (K × L) then either x 6∈ K or y 6∈ L, hence (x, y) ∈ X × (Y \ L) or
(x, y) ∈ (X \ K) × Y . So, for any π ∈ Π(µ, ν)
π((X × Y ) \ (K × L)) ≤ π(X × (Y \ L)) + π((X \ K) × Y )
= ν(Y \ L) + µ(X \ K)
≤ 2δ.
Hence, Π(µ, ν) is tight. By Prokhorov’s theorem the closure of Π(µ, ν) is sequentially compact
in the topology of weak* convergence.3
To check that Π(µ, ν) us (weak*) closed let πn ∈ Π(µ, ν) be a sequence weakly* converging
to π ∈ M(X × Y ), i.e.
Z Z
f (x, y) dπn (x, y) → f (x, y) dπ(x, y) ∀f ∈ Cb0 (X × Y ).
X×Y X×Y
3
Prokhorov’s theorem: if (S, ρ) is a separable metric space then K ⊂ P(S) is tight if and only if the closure of
K is sequentially compact in P(S) equipped with the topology of weak* convergence.

7
We choose f (x, y) = f˜(x), where f˜ is continuous and bounded. We have,
Z Z Z
f˜(x) dµ(x) → f˜(x) dπ(x, y) = f˜(x) dP#X π(x)
X X×Y X

where P X (x, y) = x is the projection onto X (so P#X π is the X marginal). Since this is true for
all f˜ ∈ Cb0 (X) it follows that P#X π = µ. Similarly, P#Y π = ν. Hence, π ∈ Π(µ, ν) and Π(µ, ν)
is weakly* closed.
Let πn ∈ Π(µ, ν) be a minimising sequence, i.e. K(πn ) → inf π∈Π(µ,ν) K(π). Since Π(µ, ν)
*
is compact we can assume that πn * π † ∈ Π(µ, ν). Our candidate for a minimiser is π † . Note
that c is lower semi-continuous and bounded from below. Then,
Z Z
inf K(π) = lim c(x, y) dπn (x, y) ≥ c(x, y) dπ † (x, y)
π∈Π(µ,ν) n→∞ X×Y X×Y

where we use the Portmanteau theorem which provides equivalent characterisations of weak*
convergence. Hence π † is a minimiser.

8
Chapter 2

Special Cases

In this section we look at some special cases where we can prove existence and characterisation
of optimal transport maps and plans. Generalising these results requires some work and in
particular a duality theorem. On the other hand the results in this chapter require less background
and are somehow "lower hanging fruit". Chapters 3 and 4 are essentially the results of this
chapter generalised to more abstract settings. The two special cases we consider here are when
measures µ, ν are on the real line, and when measures µ, ν are discrete. We start with the real
line.

2.1 Optimal Transport in One Dimension


Section references: a version of Theorem 2.1 can be found in [15, Theorem 2.18] and [12,
Theorem 2.9 and Proposition 2.17], versions of Corollary 2.2 can be found in [15, Remark 2.19]
and [12, Lemma 2.8 and Proposition 2.17], Proposition 2.3 can be found in [7, Theorem 2.3].
Let us consider two measures µ, ν ∈ P(R) with cumulative distribution functions F and G
respectively. We recall that
Z x
F (x) = dµ = µ((−∞, x])
−∞

and F is right-continuous, non-decreasing, F (−∞) = 0 and F (+∞) = 1. We define the


generalised inverse of F on [0, 1] by

F −1 (t) = inf {x ∈ R : F (x) > t} .

In general F −1 (F (x)) ≥ x and F (F −1 (t)) ≥ t. If F is invertible then F −1 (F (x)) = x and


F (F −1 (t)) = t. The main result of this section is the following theorem.

Theorem 2.1. Let µ, ν ∈ P(R) with cumulative distributions F and G respectively. Assume
c(x, y) = d(x − y) where d is convex and continuous. Let π † be the measure on R2 with cumu-
lative distribution function H(x, y) = min{F (x), G(y)}. Then π † ∈ Π(µ, ν) and furthermore
π † is optimal for Kantorovich’s optimal transport problem with cost function c. Moreover the

9
optimal transport cost is
Z 1
d F −1 (t) − G−1 (t) dt.

min K(π) =
π∈Π(µ,ν) 0

Before proving the theorem we state a corollary.


Corollary 2.2. Under the assumptions of Theorem 2.1 the following holds.
1. If c(x, y) = |x − y| then the optimal transport distance is the L1 distance between
cumulative distribution functions, i.e.
Z
inf K(π) = |F (x) − G(x)| dx.
π∈Π(µ,ν) R

2. If µ does not give mass to atoms then minπ∈Π(µ,ν) K(π) = minT : T# µ=ν M(T ) and further-
more T † = G−1 ◦ F is a minimiser to Monge’s optimal transport problem, i.e. T#† µ = ν
and
inf M(T ) = M(T † ).
T : T# µ=ν

Proof. For the first part, by Theorem 2.1, it is enough to show that
Z 1 Z
−1 −1
F (t) − G (t) dt = |F (x) − G(x)| dx.
0 R

Define A ⊂ R2 by

A = {(x, t) : min{F (x), G(x)} ≤ t ≤ max{F (x), G(x)}, x ∈ R} .

From Figure 2.1 we notice that we can equivalently write

A = (x, t) : min{F −1 (t), G−1 (t)} ≤ x ≤ max{F −1 (t), G−1 (t)}, t ∈ [0, 1] .


By Fubini’s theorem
Z Z max{F (x),G(x)} Z 1 Z max{F −1 (t),G−1 (t)}
L(A) = dt dx = dx dt
R min{F (x),G(x)} 0 min{F −1 (t),G−1 (t)}

where L is the Lebesgue measure. Since max{a, b} − min{a, b} = |a − b| then


Z Z max{F (x),G(x)} Z
dt dx = min{F (x), G(x)} − max{F (x), G(x)} dx
R min{F (x),G(x)} R
Z
= |F (x) − G(x)| dx.
R

Similarly
Z 1 Z max{F −1 (t),G−1 (t)} Z 1
dx dt = F −1 (t) − G−1 (t) dt.
0 min{F −1 (t),G−1 (t)} 0

10
Figure 2.1: Optimal transport distance in 1D with cost c(x, y) = |x − y|, figure is taken from [10].

This proves the first part of the corollary.


For the second part we recall by Proposition 1.2 that T# µ = G−1 # (F# µ). We show that (i)
−1
G# Lb[0,1] = ν and (ii) Lb[0,1] = F# µ. This is enough to show that T# µ = ν. For (i),
G−1 t : G−1 (t) ≤ y
 
# Lb[0,1] ((−∞, y]) = Lb[0,1]
= Lb[0,1] ({t : G(y) ≥ t})
= G(y)
= ν ((−∞, y])
where we used G−1 (t) ≤ y ⇔ G(y) ≥ t. For (ii) we note that F is continuous (as µ does not
give mass to atoms). So for all t ∈ (0, 1) the set F −1 ([0, t]) is closed, in particular F −1 ([0, t]) =
(−∞, xt ] for some xt with F (xt ) = t. Now, for t ∈ (0, 1),
F# µ ([0, t]) = µ ({x : F (x) ≤ t})
= µ ({x : x ≤ xt })
= F (xt )
= t.
Hence F# µ = L[0,1] .
Now we show that T † is optimal. By Theorem 2.1
Z 1
d F −1 (t) − G−1 (t) dt

inf K(π) =
π∈Π(µ,ν)
Z0
d x − G−1 (F (x)) dµ(x)

= since F# µ = Lb[0,1] and by Proposition 1.2
ZR
d x − T † (x) dµ(x)

=
R
≥ inf M(T ).
T : T# µ=ν

Since inf T : T# µ=ν M(T ) ≥ minπ∈Π(µ,ν) K(π) then the minimum of the Monge and Kantorovich
optimal transport problems coincide and T † is an optimal map for Monge.
Before we prove Theorem 2.1 we give some basic ideas in the proof. The key is the
idea of monotonicity. We say that a set Γ ⊂ R2 is monotone (with respect to d) if for all
(x1 , y1 ), (x2 , y2 ) ∈ Γ that
d(x1 − y1 ) + d(x2 − y2 ) ≤ d(x1 − y2 ) + d(x2 − y1 ).

11
For example, if Γ = {(x, y) : f (x) = y} and f is increasing, then Γ is monotone (assuming
that d is increasing). The definition generalises to higher dimensions and often appears in convex
analysis (for example the subdifferential of a convex function satisfies a monotonicity property).
As a result, this concept can also be used to prove analogous results to Theorem 2.1 in higher
dimensions. The definition should be natural for optimal transport. In particular, let Γ be the
support of π † , which is a solution of Kantorovich’s optimal transport problem. If π † transports
mass from x1 to y1 and from x2 > x1 to y2 we expect y2 > y1 , else it would have been cheaper to
transport from x1 to y2 , and from x2 to y1 . The following proposition formalises this reasoning.
Proposition 2.3. Let µ, ν ∈ P(R). Assume π † ∈ Π(µ, ν) is an optimal transport plan in the
Kantorovich sense for cost function c(x, y) = d(x − y) where d is continuous. Then for all
(x1 , y1 ), (x2 , y2 ) ∈ supp(π † ) we have

d(x1 − y1 ) + d(x2 − y2 ) ≤ d(x1 − y2 ) + d(x2 − y1 ).

Proof. Let Γ = supp(π † ) and (x1 , y1 ), (x2 , y2 ) ∈ Γ. Assume there exists η > 0 such that

d(x1 − y1 ) + d(x2 − y2 ) − d(x1 − y2 ) − d(x2 − y1 ) ≥ η.

Let I1 , I2 , J1 , J2 be closed intervals with the following properties:


1. xi ∈ Ii , yi ∈ Ji , i = 1, 2;

2. d(x − y) ≥ d(xi − yj ) − ε for x ∈ Ii , y ∈ Jj , i, j = 1, 2, where ε < η4 ;

3. Ii × Jj are disjoint;

4. π † (I1 × J1 ) = π † (I2 × J2 ) = δ > 0.


Properties 1-3 can be satisfied by choosing the intervals Ii , Jj sufficiently small. It may not be
possible to satisfy property 4, however since (xi , yi ) ∈ Γ then we can find set Ii , Jj that satisfy
1-3 and π † (I1 × J1 ) > 0, π † (I2 × J2 ) > 0. It makes the notation in the proof easier to assume
that π † (I1 × J1 ) = π † (I2 × J2 ) however if not the proof can be adapted and we briefly describe
how at the end.
The idea of the proof is to, instead of transferring mass from x1 to y1 , and from x2 to y2 ,
transfer mass from x1 to y2 , and from x2 to y1 . To make the argument rigorous we talk about the
mass around each of xi , yi (hence the need for the intervals Ii , Ji ).
Let

µ̃1 = P#X π † bI1 ×J1 , µ̃2 = P#X π † bI2 ×J2 ,


ν̃1 = P#Y π † bI1 ×J1 , ν̃2 = P#Y π † bI2 ×J2 .

And choose any π̃12 ∈ Π(µ̃1 , ν̃2 ), π̃21 ∈ Π(µ̃2 , ν̃1 ). We define π̃ to satisfy
 †

 π (A × B) if (A × B) ∩ (Ii × Jj ) = ∅ for all i, j
0 if A × B ⊆ Ii × Ji for some i

π̃(A × B) = †

 π (A × B) + π̃ 12 (A × B) if A × B ⊆ I1 × J2
 †
π (A × B) + π̃21 (A × B) if A × B ⊆ I2 × J1 .

12
For sets (A × B) ∩ (Ii × Jj ) 6= ∅ but A × B 6⊆ (Ii × Jj ) then we define π̃(A × B) by

π̃(A × B) = π̃((A × B) ∩ (Ii × Jj )) + π̃((A × B) ∩ (Ii × Jj )c ).

By construction, for B ∩ (J1 ∪ J2 ) = ∅,

π̃(R × B) = π † (R × B) = ν(B).

If B ⊆ J1 then

π̃(R × B) = π̃((R \ (I1 ∪ I2 )) × B) + π̃(I1 × B) + π̃(I2 × B)


= π † ((R \ (I1 ∪ I2 )) × B) + 0 + π † (I2 × B) + π̃21 (I2 × B)
= π † ((R \ I1 ) × B) + π † (I1 × B)
= π † (R × B)
= ν(B)

since π̃21 (I2 × B) = ν̃1 (B) = π † (I1 × (B ∩ J1 )) = π † (I1 × B). Similarly for B ⊆ J2 . Hence we
have π̃(R × B) = ν(B) for all measurable B. Analogously π̃(A × R) = µ(A) for all measurable
A. Therefore π̃ ∈ Π(µ, ν).
Now,
Z Z

d(x − y) dπ (x, y) − d(x − y) dπ̃(x, y)
R×R R×R
Z Z

= d(x − y) dπ (x, y) − d(x − y) dπ̃12 (x, y)
I1 ×J1 ∪I2 ×J2 I1 ×J2
Z
− d(x − y) dπ̃21 (x, y)
I2 ×J1
≥ δ (d(x1 − y1 ) − ε) + δ (d(x2 − y2 ) − ε) − δ (d(x1 − y2 ) + ε) − δ (d(x2 − y1 ) + ε)
≥ δ(η − 4ε)
>0

since π̃12 (I1 × J2 ) = µ̃1 (I1 ) = π † (I1 × J1 ) = δ, and similarly π̃21 (I2 × J1 ) = δ. This contradicts
the assumption that π † is optimal, hence no such η can exist.
Finally we remark that if π † (I1 × J1 ) > π † (I2 × J2 ) then one can adapt the constructed plan
π̃ by transporting some mass with the original plan π † . In particular the new constructed plan is
chosen to satisfy
π † (I2 × J2 )
 

π̃(A × B) = π (A × B) 1 − †
π (I1 × J1 )
if A × B ⊂ I1 × J1 , and µ̃1 , ν̃1 are rescaled:
π † (I2 × J2 ) X † π † (I2 × J2 ) Y †
µ̃1 = † P π bI1 ×J1 , ν̃1 = † P π bI1 ×J1 .
π (I1 × J1 ) # π (I1 × J1 ) #
All other definitions remain unchanged. One can go through the argument above and reach the
same conclusion.

13
We now prove Theorem 2.1.
Proof of Theorem 2.1. Assume d is continuous and strictly convex. By Proposition 1.5 there
exists π ∗ ∈ Π(µ, ν) that is an optimal transport plan in the Kantorovich sense. We will show that
π ∗ = π † By Proposition 2.3 Γ = supp(π ∗ ) is monotone, i.e.

d(x1 − y1 ) + d(x2 − y2 ) ≤ d(x1 − y1 ) + d(x2 − y1 )

for all (x1 , y1 ), (x2 , y2 ) ∈ Γ. We claim that for any x1 , x2 , y1 , y2 satisfying the above and x1 < x2
that y1 ≤ y2 . Assume that y2 < y1 and let a = x1 − y1 , b = x2 − y2 and δ = x2 − x1 . We know
that
d(a) + d(b) ≤ d(b − δ) + d(a + δ).
δ
Let t = b−a , it is easy to check that t ∈ (0, 1) and b − δ = (1 − t)b + ta, a + δ = tb + (1 − t)a.
Then, by strict convexity of d,

d(b − δ) + d(a + δ) < (1 − t)d(b) + td(a) + td(b) + (1 − t)d(a) = d(b) + d(a).

This is a contradiction, hence y2 ≥ y1 .


Now we show that π † = π ∗ . More precisely we show that π ∗ ((−∞, x], (−∞, y]) =
min{F (x), G(y)}. Let A = (−∞, x] × (y, +∞), B = (x, +∞) × (−∞, y]. We know that if
(x1 , y1 ), (x2 , y2 ) ∈ Γ and x1 < x2 then y1 ≤ y2 . This implies that, if (x0 , y0 ) ∈ Γ then

Γ ⊂ {(x, y) : x ≤ x0 , y ≤ y0 } ∪ {(x, y) : x ≥ x0 , y ≥ y0 } .

Hence π(A) and π(B) cannot both be non-zero. In particular

π ∗ ((−∞, x] × (−∞, y]) = min π ∗ (((−∞, x] × (−∞, y]) ∪ A) ,




π ∗ (((−∞, x] × (−∞, y]) ∪ B) .

But,

π ∗ (((−∞, x] × (−∞, y]) ∪ A) = π ((−∞, x] × R) = F (x)


π ∗ (((−∞, x] × (−∞, y]) ∪ B) = π (R × (−∞, y]) = G(y).

Hence π ∗ ((−∞, x] × (−∞, y]) = min{F (x), G(y)}.


Now we generalise to d not strictly convex. Since d is convex it p can be bounded below by an
affine function. Let d(x) ≥ (ax + b)+ . One can check that f (x) = 2 4 + (ax + b)2 + 12 (ax + b)
1

is strictly convex and satisfies 0 ≤ f (x) ≤ 1 + d(x). Then dε : d + εf is strictly convex and
satisfies d ≤ dε ≤ (1 + ε)d + ε. Now let π ∈ Π(µ, ν), then
Z Z

d(x − y) dπ (x, y) ≤ dε (x − y) dπ † (x, y)
R×R
ZR×R
≤ dε (x − y) dπ(x, y)
R×R
Z
≤ (1 + ε) d(x − y) dπ(x, y) + ε.
R×R

14

Taking ε → 0 proves that R π is an optimal†plan in theR sense of Kantorovich.
1
Now we show that R×R d(x − y) dπ (x, y) = 0 d(F (t) − G−1 (t)) dt. We claim that
−1

π † = (F −1 , G−1 )# Lb[0,1] . Assuming so, then


Z Z Z 1
† −1 −1
d(F −1 (t)−G−1 (t)) dt

d(x−y) dπ (x, y) = d(x−y) d (F , G )# L (x, y) =
R×R R×R 0

by the change of variable formula (Proposition 1.2).


To prove the claim we have
 
−1 −1 −1 −1 −1
(F , G )# Lb[0,1] ((−∞, x] × (−∞, y]) = Lb[0,1] (F , G ) ((−∞, x] × (−∞, y])
= Lb[0,1] t : F −1 (t) ≤ x and G−1 (t) ≤ y
 

= Lb[0,1] ({t : F (x) ≥ t and G(y) ≥ t})


= min{F (x), G(y)}
= π † ((−∞, x] × (−∞, y]) .

where we used F −1 (t) ≤ x ⇔ F (x) ≥ t.


Remark 2.4. Note that we actually showed that if d is continuous and strictly convex then π † is
unique.

2.2 Existence of Transport Maps for Discrete Measures


Section references: the discrete special case is based on the proof outlined in the introduction
to [15]. The proof of the Minkowski-Carathéodory Theorem comes from [13, Theorem 8.11]
Proving the existence of a transport map T † that are optimal for Monge’s optimal transport
problem, i.e. T † minimises M(T ) over all T satisfying T# µ = ν, is difficult and in fact for
general measures we will only consider this problem for a specific cost function c(x, Py) = |x−y|2 .
Here we consider general cost functions but restrict to discrete measures µ = n1 ni=1 δxi and
ν = n1 nj=1 δyj . Note that since all points X = {xi }ni=1 , Y = {yj }nj=1 have equal mass that the
P
map T : X → Y defined by T (xi ) = yσ(i) where σ : {1, . . . , n} → {1, . . . , n} is a permutation
is a transport map (i.e. satisfies (1.1)). Hence the set of transport maps is non-empty.
For a convex and compact set B in a Banach space M we define the set of extreme points,
which we denote by E(B), as the set of points in B P that cannot be written
Pmas nontrivial convex
m
combinations of points in B. I.e. if B 3 π = i=1 αi πi (where i=1 αi = 1, αi ≥ 0,
πi ∈ B) then π ∈ E(B) if and only if αi ∈ {0, 1}. We recall two results. The first is
the Minkowski–Carathéodory Theorem. The theorem is set in Euclidean spaces but can be
generalised to Banach spaces where it is known as Choquet’s theorem.
Theorem 2.5. Minkowski–Carathéodory Theorem. Let B ⊂ RM be a non-empty, convex and
compact set. Then for any π † ∈ B there exists a measure η supported on E(B) such that for any
affine function f Z
f (π † ) = f (π) dη(π).

15
Furthermore η can be chosen such that the cardinality of the support of η is at most dim(B) + 1
and (the support is) independent of π † .
Let d = dim(B). It is enough to show that there exists {ai }di=0 such that π † = di=0 ai π (i)
P
Proof. P
where ni=0 ai = 1 and {π (i) }di=0 ⊂ E(B). We prove the result by induction. The case when
d = 0 is trivial since B is just a point.
Now assume the result is true for all sets of dimension at most d − 1. Pick π † ∈ B and assume
π † 6∈ E(B). Pick π (0) ∈ E(B) and take the line segment [π (0) , π] and extend it until it intersects
with the boundary of B, i.e. let θ parametrise the line then {θ : (1 − θ)π (0) + θπ ∈ B} = [0, α]
for some α ≥ 1 (where α exists and is finite by convexity and compactness of B). Let
ξ = (1 − α)π (0) + απ then π = (1 − θ0 )ξ + θ0 π (0) where θ0 = 1 − α1 . Now since ξ ∈ F
for some proper
Pn face B 1 then by the induction hypothesis
F of P there exists {π (i) }di=1 such
d d
θ π (i) with i=1 θi = 1. Hence, π = (i)
+ θ0 π (0) . Since
P
that ξ =
Pd i=1 i i=1 (1 − θ0 )θi π
(1 − θ0 ) i=1 θi + θ0 = 1 then π is a convex combination of {π (i) }di=0 . Note that we chose π (0)
independently of π † .
Theorem 2.6. Birkhoff’s theorem. Let B be the set of n × n bistochastic matrices, i.e.
( n n
)
X X
B = π ∈ Rn×n : ∀ij, πij ≥ 0; ∀j, πij = 1; ∀i, πij = 1 .
i=1 j=1

Then the set of extremal points E(B) of B is exactly the set of permutation matrices, i.e.
( n n
)
X X
E(B) = π ∈ {0, 1}n×n : ∀j, πij = 1; ∀i, πij = 1 .
i=1 j=1

Proof. We start by showing that every permutation matrix is an extremal point. Let πij = δj=σ(i)
where σ is a permutation. Assume that π 6∈ E(B). Then there exists π (1) , π (2) ∈ B, with
π (1) 6= π 6= π (2) , and t ∈ (0, 1) such that π = tπ (1) + (1 − t)π (2) . Let ij be such that
(1)
0 = πij 6= πij , then
(1)
(1) (2) (2) πij
0 = πij = tπij + (1 − t)πij =⇒ πij =− < 0.
1−t
(2)
This contradicts πij ≥ 0, hence π ∈ E(B).
Now we show that every π ∈ E(B) is a permutation matrix. We do this in two parts: we (i)
show that π ∈ E(B) implies that πij ∈ {0, 1}, then (ii) show π = δj=σ(i) for a permutation
Pn σ.
For (i) let π ∈ E(B) and assume there exists i1 j1 such that πi1 j1 ∈ P
(0, 1). Since i=1 πij1 = 1
then there exists i2 6= i1 such that πi2 j1 ∈ (0, 1). Similarly, since nj=1 πi2 j = 1 there exists
j2 6= j1 such that πi2 j2 ∈ (0, 1). Continuing this procedure until im = i1 we obtain two
sequences:
I = {ik jk : k ∈ {1, . . . , m − 1}} I + = {ik+1 jk : k ∈ {1, . . . , m − 1}}
1
A face F of a convex set B is any set with the property that if π (1) , π (2) ∈ B, t ∈ (0, 1) and tπ (1) +(1−t)π (2) ∈
F then π (1) , π (2) ∈ F . A proper face is a face which has dimension at most dim(B) − 1. A result we use without
proof is that the boundary of a convex set is the union of all proper faces.

16
with ik+1 6= ik and jk+1 6= jk . Define π (δ) by the following

 πi k j k + δ if ij = ik jk for some k
(δ)
πij = πik+1 jk − δ if ij = ik+1 jk for some k
πij else.

Then,
n
X n
X
(δ)
ij ∈ I + : i ∈ {1, . . . , n}

πij = πij + δ |{ij ∈ I : i ∈ {1, . . . , n}}| − δ .
i=1 i=1

Now if ij ∈ I then there exists i0 such that i0 j ∈ I + , and likewise, if ij ∈ I + then there exists i0
such that i0 j ∈ I. Hence,

|{ij ∈ I : i ∈ {1, . . . , n}}| = ij ∈ I + : i ∈ {1, . . . , n} .




(δ) (δ)
It follows that ni=1 πij = 1 and analogously nj=1 πij = 1.
P P

Choose δ = min {min{πij , 1 − πij } : ij ∈ I ∪ I + } ∈ (0, 1). Define π (1) = π (−δ) , π (δ) .
(1) (2)
We have that πij , πij ≥ 0 and therefore π (1) , π (2) ∈ B with π (1) 6= π (2) . Moreover we have
π = 12 π (1) + 12 π (2) . Hence, π 6∈ E(B). The contradiction implies that there does not exist i1 j1
such that πi1j1 ∈ (0, 1). We have shown that if π ∈ E(B) then πij ∈ {0, 1}.
We’re left to show Pn (ii): that πij = δj=σ(i) . Since π ∈ B then for each i there exists j ∗ such
that πij ∗ = 1 (else j=1 πij 6= 1). We let σ(i) = j ∗ so by construction we have πiσ(i) = 1. We
claim σ is a permutation. It is enough to show that σ is injective. Now if j = σ(i1 ) = σ(i2 )
where i1 6= i2 then
Xn
1= πij ≥ πi1 j + πi2 j = 2.
i=1

The contradiction implies that i1 = i2 and therefore σ is injective.


WeP now show that theP existence of optimal transport maps between discrete measures
µ = n i=1 δxi and ν = n nj=1 δyj .
1 n 1

Theorem 2.7. Let µ = n1 ni=1 δxi and ν = n1 nj=1 δyj . Then there exists a solution to Monge’s
P P
optimal transport problem between µ and ν.

Proof. Let cij = c(xi , yj ) and B be the set of bistochastic n × n matrices, i.e.
( n n
)
X X
B = π ∈ Rn×n : ∀ij, πij ≥ 0; ∀j, πij = 1; ∀i, πij = 1 .
i=1 j=1

The Kantorovich problem reads as


1X
minimise cij πij over π ∈ B.
n i,j

17
Although, by Proposition 1.5, there exists a minimiser to the Kantorovich optimal transport
problem we do not use this fact here. Let M be the minimum of the Kantorovich optimal
transport problem, ε > 0 and find an approximate minimiser π ε ∈ B such that
X
M≥ cij π ε − ε.
ij
P
If we let f (π) = ij cij πij then assuming that B is compact and convex we have that there
exists a measure η supported on E(B) such that
Z
ε
f (π ) = f (π) dη(π).

Hence Z X X
M≥ cij πij dη(π) − ε ≥ inf cij πij − ε ≥ M − ε.
π∈E(B)
ij ij
P
Since this is true for all ε it holds that inf π∈E(B) ij cij πij = M . We claim that E(B) is compact,
in which case there exists a minimiser π † ∈ E(B). Note that we have also shown (independently
from Proposition 1.5) that there exists a solution to Kantorovich’s optimal transport problem.
By Birkhoff’s theomem π † is a permutation matrix, that is there exists a permutation σ † :
{1, . . . , n} → {1, . . . , n} such that πij† = δj=σ† (i) . Let T † : X → Y be defined by T † (xi ) =
yσ(i) .
We already know that the set of transport maps is non-empty. Let T be any transport map
and define πij = δyj =T (xi ) , (it is easy to see that π ∈ B) then
n
X X X n
X
c(xi , T (xi )) = cij πij ≥ cij πij† = c(xi , T † (xi )).
i=1 ij ij i=1

Hence T † is a solution to Monge’s optimal transport problem.


We are left to show that B is compact and P convex, and E(B) is compact. To show B is
compact we consider the `1 norm: kπk1 := ij |πij | (since all norms are equivalent on finite
dimensional spaces it does not really matter which norm we choose). Clearly B is bounded as for
all π ∈ B we have kπk1 ≤ n2 . For closure, we consider a sequence π (m) ∈ B with π (m) → π.
(m) Pn Pn (m)
Trivially
Pn π ij → π ij for all ij and therefore π ij ≥ 0, likewise i=1 π ij = limm→∞ i=1 πij =
1 and j=1 πij = 1. Hence π ∈ B and B is closed. Therefore B is compact.
Convexity of B is easy to check by considering π (1) , π (2) ∈ B and π = tπ (1) + (1 − t)π (2)
for t ∈ [0, 1] then clearly πij ≥ 0,
n
X n
X n
X
(1) (2)
πij = t πij + (1 − t) πij = t + (1 − t) = 1,
i=1 i=1 i=1
Pn
and similarly j=1 πij = 1. Hence π ∈ B and B is convex.
For compactness of E(B) it is enough to show closure. If E(B) 3 π (m) → π then we already
(m)
know that π ∈ B and by pointwise convergence of πij → πij we also have πij ∈ {0, 1}. Hence
π ∈ E(B) and therefore E(B) is closed.

18
Chapter 3

Kantorovich Duality

We saw in the previous chapter how Kantorovich’s optimal transport problem resembles a linear
programme. It should not therefore be surprising that Kantorovich’s optimal transport problem
admits a dual formulation. In the following section we state the duality result and give an intuitive
but non-rigorous proof. In Section 3.2 we give a general minimax principle upon whoch we can
base the proof of Kantorovich duality. In Section 3.3 we can then rigorously prove duality. With
additional assumptions such as restricting X, Y to Euclidean spaces we prove the existence of
solutions to the dual problem in Section 3.4.

3.1 Kantorovich Duality


Section references: The statement and proof of the main result, Theorem 3.1, come from [15,
Theorem 1.3].
We start by stating Kantorovich then give an intuitive proof with one key step missing. The
proof is made rigorous in Section 3.3.
Theorem 3.1. Kantorovich Duality. Let µ ∈ P(X), ν ∈ P(Y ) where X, Y are Polish spaces.
Let c : X ×Y → [0, +∞] be a lower semi-continuous cost function. Define K as in Definition 1.4
and J by
Z Z
1 1
(3.1) J : L (µ) × L (ν) → R, J(ϕ, ψ) = ϕ dµ + ψ dν.
X Y

Let Φc be defined by

Φc = (ϕ, ψ) ∈ L1 (µ) × L1 (ν) : ϕ(x) + ψ(y) ≤ c(x, y)




where the inequality is understood to hold for µ-almost every x ∈ X and ν-almost every y ∈ Y .
Then,
min K(π) = sup J(ϕ, ψ).
π∈Π(µ,ν) (ϕ,ψ)∈Φc

Let us give an informal interpretation of the result which originally comes from Caffarelli
and I take from Villani [15]. Consider the shippers problem. Suppose we own a number of coal

19
mines and a number of factories, we wish to transport the coal from mines to factories. The
amount each mine produces and each factory requires is fixed (and we assume equal). The cost
for you to transport from mine x to factory y is c(x, y). The total optimal cost is the solution
to Kantorovich’s optimal transport problem. Now a clever shipper comes to you and says they
will ship for you and you just pay a price ϕ(x) for loading and ψ(y) for unloading. To make it
in your interest the shipper makes sure that ϕ(x) + ψ(y) ≤ c(x, y) that is the cost is no more
than what you would have spent transporting the coal yourself. Kantorovich duality tells us that
one can find ϕ and ψ such that this price scheme costs just as much as paying for the cost of
transport yourself.
We now give an informal proof that will subsequently be made rigorous. Let M =
inf π∈Π(µ,ν) K(π). Observe that
Z Z Z 
X Y
 
(3.2) M = inf sup c(x, y) dπ + ϕ d µ − P# π + ψ d ν − P# π
π∈M+ (X×Y ) (ϕ,ψ) X×Y X Y

where we take the supremum on the right hand side over (ϕ, ψ) ∈ Cb0 (X) × Cb0 (Y ). This follows
since
+∞ if µ 6= P#X π
Z 
X

sup ϕ d µ − P# π =
ϕ∈C 0 (X) X 0 else.
b

Hence, the infimum over π of the right hand side of (3.2) is on the set where P#X π = µ and,
similarly, P#Y π = ν (which means that π ∈ Π(µ, ν)). We can rewrite (3.2) more conveniently as
Z Z Z 
M= inf sup c(x, y) − ϕ(x) − ψ(y) dπ + ϕ dµ + ψ dν .
π∈M+ (X×Y ) (ϕ,ψ) X×Y X Y

Assuming a minimax principle we switch the infimum and supremum to obtain


Z Z Z 
(3.3) M = sup ϕ dµ + ψ dν + inf c(x, y) − ϕ(x) − ψ(y) dπ .
(ϕ,ψ) X Y π∈M+ (X×Y ) X×Y

Now if there exists (x0 , y0 ) ∈ X × Y and ε > 0 such that ϕ(x0 ) + ψ(y0 ) − c(x0 , y0 ) = ε > 0
then by letting πλ = λδ(x0 ,y0 ) for some λ > 0 we have
Z
inf c(x, y) − ϕ(x) − ψ(y) dπ ≤ −λε → −∞ as λ → ∞.
π∈M+ (X×Y ) X×Y

Hence the infimum on right hand side of (3.3) can be restricted to when ϕ(x) + ψ(y) ≤ c(x, y)
for all (x, y) ∈ X × Y , i.e. (ϕ, ψ) ∈ Φc (this heuristic argument actually used (ϕ, ψ) ∈ Cb0 (X) ×
Cb0 (Y ) not L1 (µ) × L1 (ν) and there is a difference between the constraint ϕ(x) + ψ(y) ≤ c(x, y)
holding everywhere and holding almost everywhere, these are technical details that are not
important at this stage). When (ϕ, ψ) ∈ Φc then
Z
inf c(x, y) − ϕ(x) − ψ(y) dπ = 0
π∈M+ (X×Y ) X×Y

20
which is achieved for π ≡ 0 for example. Hence,
Z Z
inf K(π) = sup ϕ(x) dµ(x) + ψ(y) dν(y).
π∈Π(µ,ν) (ϕ,ψ)∈Φc X Y

This is the statement of Kantorovich duality. To complete this argument we need to make the
minimax principle rigorous. In the next section we prove a minimax principle, in the section
after we apply it to Kantorovich duality and provide a complete proof.

3.2 Fenchel-Rockafeller Duality


Section references: I take the duality theorem (Theorem 3.2) from [15, Theorem 1.9]. Lemma 3.3
is hopefully obvious and the Hahn-Banach theorem is well known.
To rigorously prove the Kantorovich duality theorem we need a minimax principle, i.e.
conditions sufficient to interchange the infimum and supremum when we introduced the Lagrange
multipliers ϕ, ψ in (3.2). The minimax principle is specific to convex functions; at this stage it is
perhaps not clear how to apply it to Kantorovich’s optimal transport problem when we made no
convexity assumption on c. We define the Legendre-Fenchel transform for a convex function
Θ : E → R ∪ {+∞} where E is a normed vector space by

Θ∗ : E ∗ → R ∪ {+∞}, Θ∗ (z ∗ ) = sup (hz ∗ , zi − Θ(z)) .


z∈E

Convex analysis will play a greater role in the sequel, in particular in Chapter 4 where we will
provide a more in-depth review. We now state the minimax principle taken from Villani [15].
Theorem 3.2. Fenchel-Rockafellar Duality. Let E be a normed vector space and Θ, Ξ : E →
R ∪ {+∞} two convex functions. Assume there exists z0 ∈ E such that Θ(z0 ) < ∞, Ξ(z0 ) < ∞
and Θ is continuous at z0 . Then,

inf (Θ + Ξ) = max
∗ ∗
(−Θ∗ (−z ∗ ) − Ξ∗ (z ∗ )) .
E z ∈E

In particular the supremum on the right hand side is attained.


We recall a couple of preliminary results (that we do not prove) before we prove the theorem.
Lemma 3.3. Let E be a normed vector space.
1. If Θ : E → R ∪ {+∞} is convex then so is the epigraph A defined by

A = {(z, t) ∈ E × R : t ≥ Θ(z)} .

2. If Θ : E → R ∪ {+∞} is concave then so is the hypograph B defined by

B = {(z, t) ∈ E × R : t ≤ Θ(z)} .

3. If C ⊂ E is convex then int(C) is convex.

21
4. If D ⊂ E is convex and int(D) 6= ∅ then D = int(D).

The following theorem, the Hahn-Banach theorem can be stated in multiple different forms.
The most convenient form for us is in terms of separation of convex sets.

Theorem 3.4. Hahn-Banach Theorem. Let E be a topological vector space. Assume A, B are
convex, non-empty and disjoint subsets of E, and that A is open. Then there exists a closed
hyperplane separating A and B.

We now prove Theorem 3.2.


Proof of Theorem 3.2. By writing

−Θ∗ (−z ∗ ) − Ξ∗ (z ∗ ) = inf (Θ(x) + Ξ(y) + hz ∗ , x − yi)


x,y∈E

and choosing y = x on the right hand side we see that

inf (Θ(x) + Ξ(x)) ≥ sup (−Θ∗ (−z ∗ ) − Ξ∗ (z ∗ )) .


x∈E z ∗ ∈E ∗

Let M = inf (Θ + Ξ), and define the sets A, B by

A = {(x, λ) ∈ E × R : λ ≥ Θ(x)}
B = {(y, σ) ∈ E × R : σ ≤ M − Ξ(y)} .

By Lemma 3.3 A and B are convex. By continuity and finiteness of Θ at z0 the interior of A is
non-empty and by finiteness of Ξ at z0 B is non-empty. Let C = int(A) (which is convex by
Lemma 3.3. Now, if (x, λ) ∈ C then λ > Θ(x), therefore λ+Ξ(x) > Θ(x)+Ξ(x) ≥ M . Hence
(x, λ) 6∈ B. In particular B ∩ C = ∅. By the Hahn-Banach theorem there exists a hyperplane
H = {Φ = α} that separates B and C, i.e. if we write Φ(x, λ) = f (x) + kλ (where f is linear)
then

∀(x, λ) ∈ C, f (x) + kλ ≥ α
∀(x, λ) ∈ B, f (x) + kλ ≤ α.

Now if (x, λ) ∈ A then there exists a sequence (xn , λn ) ∈ C such that (xn , λn ) → (x, λ). Hence
f (x) + kλ ← f (xn ) + kλn ≥ α. Therefore

(3.4) ∀(x, λ) ∈ A, f (x) + kλ ≥ α


(3.5) ∀(x, λ) ∈ B, f (x) + kλ ≤ α.

We know that (z0 , λ) ∈ A for λ sufficiently large, hence k ≥ 0. We claim k > 0. Assume
k = 0. Then

∀(x, λ) ∈ A, f (x) ≥ α =⇒ f (x) ≥ α ∀x ∈ Dom(Θ)


∀(x, λ) ∈ B, f (x) ≤ α =⇒ f (x) ≤ α ∀x ∈ Dom(Ξ).

22
As Dom(Ξ) 3 z0 ∈ Dom(Θ) then f (z0 ) = α. Since Θ is continuous at z0 there exists r > 0
such that B(z0 , r) ⊂ Dom(Θ), hence for all z with kzk < r and δ ∈ R with |δ| < 1 we have

f (z0 + δz) ≥ α =⇒ f (z0 ) + δf (z) ≥ α =⇒ δf (z) ≥ 0.

This is true for all δ ∈ (−1, 1) and therefore f (z) = 0 for z ∈ B(0, r). Hence f ≡ 0 on E. It
follows that Φ ≡ 0 which is clearly a contradiction (either H = E × R if α = 0 or H = ∅). It
must be that k > 0.
By (3.4) we have
   
∗ f f (z)
Θ − = sup − − Θ(z)
k z∈E k
1
= − inf (f (z) + kΘ(z))
k z∈E
α
≤−
k
since (z, Θ(z)) ∈ A. Similarly, by (3.5) we have
   
∗ f f (z)
Ξ = sup − Ξ(z)
k z∈E k
1
= −M + sup (f (z) + k(M − Ξ(z)))
k z∈E
α
≤ −M +
k
since (z, M − Ξ(z)) ∈ B. It follows that
   
∗ ∗ ∗ ∗ ∗ f ∗ f α α
M ≥ sup (−Θ (−z ) − Ξ (z )) ≥ −Θ − −Ξ ≥ + M − = M.
z ∗ ∈E ∗ k k k k
So
inf (Θ(x) + Ξ(x)) = M = sup (−Θ∗ (−z ∗ ) − Ξ∗ (z ∗ )) .
x∈E z ∗ ∈E ∗
f
Furthermore z ∗ = k
must achieve the supremum.

3.3 Proof of Kantorovich Duality


Section references: The two lemmas in this section together prove the Kantorovich duality
theorem, both lemmas come from [15].
Finally we can prove Kantorovich dualiy as stated in Theorem 3.1. We break the theorem
into two parts.
Lemma 3.5. Under the same conditions as Theorem 3.1 we have

sup J(ϕ, ψ) ≤ inf K(π).


(ϕ,ψ)∈Φc π∈Π(µ,ν)

23
Proof. Let (ϕ, ψ) ∈ ΦC and π ∈ Π(µ, ν). Let A ⊂ X and B ⊂ Y be sets such that µ(A) = 1,
ν(B) = 1 and
ϕ(x) + ψ(y) ≤ c(x, y) ∀(x, y) ∈ A × B.
Now π(Ac × B c ) ≤ π(Ac × Y ) + π(X × B c ) = µ(Ac ) + ν(B c ) = 0. Hence,

π(A × B) = π(X × B) − π(Ac × B)


= ν(B) − π(Ac × Y ) + π(Ac × B c )
= 1 − µ(Ac ) + π(Ac × B c )
= 1.

So it follows that ϕ(x) + ψ(y) ≤ c(x, y) for π-almost every (x, y). Then,
Z Z Z Z
J(ϕ, ψ) = ϕ dµ + ψ dν = ϕ(x) + ψ(y) dπ(x, y) ≤ c(x, y) dπ(x, y).
X Y X×Y X×Y

The result of the lemma follows by taking the supremum over (ϕ, ψ) ∈ Φc on the right hand side
and the infimum over π ∈ Π(µ, ν) on the left hand side.
To complete the proof of Theorem 3.1 we need to show that the opposite inequality in
Lemma 3.5 is also true.

Lemma 3.6. Under the same conditions as Theorem 3.1 we have

sup J(ϕ, ψ) ≥ inf K(π).


(ϕ,ψ)∈Φc π∈Π(µ,ν)

Proof. The proof is completed in three steps in increasing generality:

1. we assume X, Y are compact and c is continuous;

2. the assumption that X, Y are compact is relaxed, c is still continuous;

3. c is only assumed to be lower semi-continuous.

1. Let E = Cb0 (X × Y ) equipped with the supremum norm. The dual space of E is the space
of Radon measures E ∗ = M(X × Y ) (by the Riesz–Markov–Kakutani representation theorem).
Define

0 if u(x, y) ≥ −c(x, y)
Θ(u) =
+∞ else,
 R R
X
ϕ(x) dµ(x) + Y ψ(y) dν(y) if u(x, y) = ϕ(x) + ψ(y)
Ξ(u) =
+∞ else.

Note that although the representation u(x, y) = ϕ(x) + ψ(y) is not unique (ϕ and ψ are only
unique upto a constant) Ξ is still well defined. We claim that Θ and Ξ are convex. For Θ

24
consider u, v with Θ(u), Θ(v) < +∞, then u(x, y) ≥ −c(x, y) and v(x, y) ≥ −c(x, y), hence
tu(x, y) + (1 − t)v(x, y) ≥ c(x, y) for any t ∈ [0, 1]. It follows that
Θ(tu + (1 − t)v) = 0 = tΘ(u) + (1 − t)Θ(v).
If either Θ(u) = +∞ or Θ(v) = +∞ then clearly
Θ(tu + (1 − t)v) ≤ tΘ(u) + (1 − t)Θ(v).
Hence Θ is convex. For Ξ if either Ξ(u) = +∞ or Ξ(v) = +∞ then clearly
Ξ(tu + (1 − t)v) ≤ tΞ(u) + (1 − t)Ξ(v).
Assume u(x, y) = ϕ1 (x) + ψ1 (y), v(x, y) = ϕ2 (x) + ψ2 (y) then
tu(x, y) + (1 − t)v(x, y) = tϕ1 (x) + (1 − t)ϕ2 (x) + tψ1 (y) + (1 − t)ψ2 (y)
and therefore
Z Z
Ξ(tu + (1 − t)v) = tϕ1 + (1 − t)ϕ2 dµ + tψ1 + (1 − t)ψ2 dν = tΞ(u) + (1 − t)Ξ(v).
X Y

Hence Ξ is convex.
Let u ≡ 1 then Θ(u), Ξ(u) < +∞ and Θ is continuous at u. By Theorem 3.2
(3.6) inf (Θ(u) + Ξ(u)) = max∗ (−Θ∗ (−π) − Ξ∗ (π)) .
u∈E π∈E

First we calculate the left hand side of (3.6). We have


Z Z
inf (Θ(u) + Ξ(u)) ≥ inf ϕ(x) dµ(x) + ψ(y) dν(y) = − sup J(ϕ, ψ).
u∈E ϕ(x)+ψ(y)≥−c(x,y)
X Y (ϕ,ψ)∈Φc
ϕ∈L1 (µ),ψ∈L1 (ν)

We now consider the right hand side of (3.6). To do so we need to find the convex conjugates
of Θ and Ξ. For Θ∗ we compute
 Z  Z Z

Θ (−π) = sup − u dπ − Θ(u) = sup − u dπ = sup u dπ.
u∈E X×Y u≥−c X×Y u≤c X×Y

Then we find  R
∗ c(x, y) dπ if π ∈ M+ (X × Y )
X×Y
Θ (−π) =
+∞ else.
For Ξ∗ we have
Z 

Ξ (π) = sup u dπ − Ξ(u)
u∈E X×Y
Z Z Z 
= sup u dπ − ϕ(x) dµ − ψ(y) dν
u(x,y)=ϕ(x)+ψ(y) X×Y X Y
Z Z !
= sup ϕ d(P#X − µ) + ψ d(P#Y − ν)
u(x,y)=ϕ(x)+ψ(y) X Y

0 if π ∈ Π(µ, ν)
=
+∞ else.

25
Hence, the right hand side of (3.6) reads
Z
∗ ∗
max∗ (−Θ (−π) − Ξ (π)) = − min c(x, y) dπ = − min K(π).
π∈E π∈Π(µ,ν) X×Y π∈Π(µ,ν)

This completes the proof of part 1.


Parts 2 and 3 and more complicated (part 2 takes some work, part 3 is actually quite
straghtforward) and are omitted; both parts can be found in [15, pp 28-32].

3.4 Existence of Maximisers to the Dual Problem


Section references: Theorem 3.7 is adapted from the special case X = Y = Rn , c(x, y) = |x−y|2
in [15, Theorem 2.9], the other results in this section, Lemmas 3.8 and 3.9 are adapted from [15,
Lemma 2.10].
The objective of this section is to prove the existence of a maximiser to the dual problem.
We state the theorem before giving a preliminary result followed by the proof of the theorem.
Theorem 3.7. Let µ ∈ P(X), ν ∈ P(Y ), where X and Y are Polish, and c : X × Y → [0, ∞).
Assume that there exists cX ∈ L1 (µ), cY ∈ L1 (ν) such that c(x, y) ≤ cX (x)+cY (y) for µ-almost
every x ∈ X and ν-almost every y ∈ Y . In addition, assume that
Z Z
(3.7) M := cX (x) dµ(x) + cY (y) dν(y) < ∞.
X Y

Then there exists (ϕ, ψ) ∈ Φc such that


sup J = J(ϕ, ψ).
Φc

Furthermore we can choose (ϕ, ψ) = (η cc , η c ) for some η ∈ L1 (µ), where η c is defined below.
The condition that M < ∞ is effectively a moment condition on µ and ν. In particular, if
c(x, y) = |x − y|p then c(x, y) ≤ C(|x|p + |y|p ) and the requirement that M < ∞ is exactly the
condition that µ, ν have finite pth moments.
The proof relies on similar concepts as the proof of duality, in particular, for ϕ : X → R, the
c-transforms ϕc , ϕcc defined by
ϕc :Y → R, ϕc (y) = inf (c(x, y) − ϕ(x))
x∈X
cc
ϕ :X → R, ϕ (x) = inf (c(x, y) − ϕc (y))
cc
y∈Y

for ϕ : X → R are key; one should compare this to the Legendre-Fenchel transform defined in
the previous section. We first give a result which implies we only need to consider c-transform
pairs.
Lemma 3.8. Let µ ∈ P(X), ν ∈ P(Y ). For any a ∈ R and (ϕ̃, ψ̃) ∈ Φc we have (ϕ, ψ) =
(ϕ̃cc − a, ϕ̃c + a) satisfies J(ϕ, ψ) ≥ J(ϕ̃, ψ̃) and ϕ(x) + ψ(y) ≤ c(x, y) for µ-almost every
x ∈ X and ν-almost every y ∈ Y .
Furthermore, if J(ϕ̃, ψ̃) > −∞, M < +∞ (where M is defined by (3.7)), and there exists
cX ∈ L1 (µ) and cY ∈ L1 (ν) such that ϕ ≤ cX and ψ ≤ cY , then (ϕ, ψ) ∈ Φc .

26
Proof. Clearly J(ϕ − a, ψ + a) = J(ϕ, ψ) for all a ∈ R, ϕ ∈ L1 (µ) and ν ∈ L1 (ν), so it is
enough to show that ϕ = ϕ̃cc ≥ ϕ̃, ψ = ϕ̃c ≥ ψ̃, ϕ(x) + ψ(y) ≤ c(x, y).
Note that
ψ(y) = inf (c(x, y) − ϕ̃(x)) ≥ ψ̃(y)
x∈X

since ϕ̃(x) + ψ̃(y) ≤ c(x, y), and

ϕ(x) = inf sup (c(x, y) − c(z, y) + ϕ̃(z)) ≥ ϕ̃(x)


y∈Y z∈X

by choosing z = x.
We easily see that

ϕ(x) + ψ(y) = inf (c(x, z) − ϕ̃c (z) + ϕ̃c (y)) ≤ c(x, y)


z∈Y

by choosing z = y.
For the furthermore part of the lemma it is left to show integrability of ϕ, ψ. Note that
Z Z
ϕ(x) − cX (x) dµ(x) + ψ(y) − cY (y) dν(y) = J(ϕ, ψ) − M ≥ J(ϕ̃, ψ̃) − M
X Y

and since ϕ − cX ≤ 0, ψ − cY ≤ 0 then both integrals on the left hand side are negative. In
particular
Z Z
kϕ − cX kL1 (µ) + kψ − cY kL1 (ν) = − ϕ(x) − cX (x) dµ(x) − ψ(y) − cY (y) dν(y)
X Y
≤ M − J(ϕ̃, ψ̃).

Hence ϕ − cX ∈ L1 (µ), ψ − cY ∈ L1 (ν) from which it follows ϕ ∈ L1 (µ), ψ ∈ L1 (ν).


The next result gives an upper bound on maximising sequences.
Lemma 3.9. Let µ ∈ P(X), ν ∈ P(Y ), and c : X × Y → R. Assume that c(x, y) ≤
cX (x) + cY (y) where cX ∈ L1 (µ) and cY ∈ L1 (ν). Furthermore, assume that M given by (3.7)
satisfies M < ∞. Then there exists a sequence (ϕk , ψk ) ∈ Ψc such that J(ϕk , ψk ) → supΦc J
and satisfying the bounds

ϕk (x) ≤ cX (x) ∀x ∈ X, ∀k ∈ N
ψk (y) ≤ cY (y) ∀y ∈ Y, ∀k ∈ N.

Proof. Let (ϕ̃k , ψ̃k ) ∈ Φc be a maximising sequence. Notice that since 0 ≤ supΦc J ≤
inf Π(µ,ν) K ≤ M < ∞ that ϕ̃k , ψ̃k must be proper functions (in fact not equal to ±∞ any-
where). Let (ϕk , ψk ) = (ϕ̃cc c
k − ak , ϕ̃k + ak ) where we will choose

ak = inf (cY (y) − ϕ̃ck (y)) .


y∈Y

By Lemma 3.8 (ϕk , ψk ) ∈ Φc and (ϕk , ψk ) is a maximising sequence once we have shown that
ϕk ≤ cX and ψk ≤ cY .

27
We start by showing ak ∈ (−∞, +∞). Since (ϕ̃k , ψ̃k ) ∈ Φc then ϕ̃k (x) ≤ c(x, y) − ψ̃k (y)
for all y ∈ Y . Hence there exists y0 ∈ Y and b0 ∈ R (possibly depending on k) such that
ϕ̃k (x) ≤ c(x, y0 ) + b0 . Then,

ϕ̃ck (y0 ) = inf (c(x, y0 ) − ϕ̃k (x)) ≥ −b0 .


x∈X

Hence, ak ≤ cY (y0 ) − ϕck (y0 ) ≤ cY (y0 ) + b0 < ∞. We also have

cY (y) − ϕ̃ck (y) = sup (cY (y) − c(x, y) + ϕ̃k (x)) ≥ sup (ϕ̃k (x) − cX (x)) ≥ ϕ̃k (x0 ) − cX (x0 )
x∈X x∈X

for any x0 ∈ X. Hence, ak ≥ ϕ̃(x0 ) − cX (x0 ) which is greater than −∞. We have shown that
ak ∈ (−∞, +∞) and the pair (ϕk , ψk ) are well defined.
Clearly ψk (y) = ϕ̃ck (y) + ak ≤ cY (y). And,

ϕk (x) − cX (x) = inf (c(x, y) − ϕ̃ck (y) − ak − cX (x)) ≤ inf (cY (y) − ϕ̃ck (y) − ak ) = 0.
y∈Y y∈Y

So, (ϕk , ψk ) satisfy the bounds stated in the lemma.


With the help of the preceding lemma we can prove Theorem 3.7.
Proof of Theorem 3.7. Note that inf π∈Π(µ,ν) K(π) ≤ M < ∞ by Lemma 3.5. Let (ϕk , ψk ) ∈ Φc
(`) (`)
be a maximising sequence as in Lemma 3.9. Define ϕk , ψk by
(`)
ϕk (x) = max{ϕk (x) − cX (x), −`} + cX (x)
(`)
ψk (y) = max{ψk (y) − cY (y), −`} + cY (y).
(`) (`)
Note that ϕk ≤ ϕk , ψk ≤ ψk ,
(`)
−` ≤ ϕk (x) − cX (x) ≤ 0 ∀x ∈ X, ∀k ∈ N, ∀` ∈ N
(`)
−` ≤ ψk (y) − cY (y) ≤ 0 ∀y ∈ Y, ∀k ∈ N, ∀` ∈ N
(1) (2)
ϕk ≥ ϕk ≥ . . .
(1) (2)
ψk ≥ ψk ≥ . . .

and
(`) (`)
ϕk (x) + ψk (y) ≤ max {ϕk (x) − cX (x) + ψk (y) − cY (y), −`} + cX (x) + cY (y)
(3.8) ≤ max {c(x, y) − cX (x) − cY (y), −`} + cX (x) + cY (y).

(`) (`)
For each ` the sequence ϕk is bounded in L∞ so {ϕk }k∈N is weakly compact in Lp (µ)
for any p ∈ (1, ∞) (for reflexive Banach spaces boundedness plus closure is equivalent to
weak compactness). Let’s choose p = 2 then, after extracting a subsequence, we have that
(`)
ϕk * ϕ(`) ∈ L1 (µ) (since L2 (µ) ⊂ L1 (µ)). By a diagonalisation argument we can assume that

28
(`) (`)
ϕk * ϕ(`) for all ` ∈ N. We can apply the same argument to ψk to imply the existence of weak
limits ψ (`) ∈ L1 (ν). We have that

cX ≥ ϕ(1) ≥ ϕ(2) ≥ . . .
cY ≥ ψ (1) ≥ ψ (2) ≥ . . .

Since ϕ(`) , ψ (`) are bounded above by an L1 function and monotonically decreasing we can
apply the Monotone Convergence Theorem to infer
Z Z
lim (`)
ϕ (x) dµ(x) = ϕ† (x) dµ(x)
`→∞ X
Z ZX
lim ψ (`) (y) dν(y) = ψ † (y) dν(y)
`→∞ Y Y

where ϕ† , ψ † are the pointwise limits of ϕ(`) , ψ (`) :

ϕ† (x) = lim ϕ(`) (x), ψ † (y) = lim ψ (`) (y).


`→∞ `→∞

The functions (ϕ† , ψ † ) are our candidate maximisers. We are required to show that (ϕ† , ψ † ) ∈
Ψc and J(ϕ, ψ) ≤ J(ϕ† , ψ † ) for all (ϕ, ψ) ∈ Φc .
(`) (`)
Since supΦc J = limk→∞ J(ϕk , ψk ) ≤ limk→∞ J(ϕk , ψk ) = J(ϕ(`) , ψ (`) ) for any ` ∈ N
then
J(ϕ† , ψ † ) = lim J(ϕ(`) , ψ (`) ) ≥ sup J.
`→∞ Φc
† †
Hence (ϕ , ψ ) maximises J.
It follows from taking ` → ∞ in (3.8) that ϕ† (x) + ψ † (y) ≤ c(x, y). Now integrability
follows from
Z Z

0≥ ϕ (x) − cX (x) dµ(x) + ψ † (y) − cY (y) dν(y) ≥ sup J − M.
X Y Φc

In particular, since ϕ† − cX ≤ 0, ψ † − cY ≤ 0 then it follows that both integrals are finite and
ϕ† − cX ∈ L1 (µ), ψ † − cY ∈ L1 (ν). Hence ϕ† ∈ L1 (µ) and ψ † ∈ L1 (ν).
For the furthermore part of the theorem we use the double c-transform trick as in the proof of
Lemma 3.9. For any a ∈ R we have, by Lemma 3.8,

J(ϕ† , ψ † ) ≤ J((ϕ† )cc − a, (ϕ† )c + a) = J((ϕ† )cc , (ϕ† )c ).

We only have to show ((ϕ† )cc , (ϕ† )c ) ∈ L1 (µ) × L1 (ν). Let a = inf y∈Y cY (y) − (ϕ† )c (y) then


a ∈ R for the same reasons that ak ∈ R in the proof of Lemma 3.9. Clearly (ϕ† )c (y) + a ≤ cY (y)
and

(ϕ† )cc (x) = inf c(x, y) − (ϕ† )c (y) − a ≤ inf cX (x) + cY (y) − (ϕ† )c (y) − a ≤ cX (x).
 
y∈Y y∈Y

Hence, ((ϕ† )cc − a, (ϕ† )c + a) ∈ L1 (µ) × L1 (ν) by Lemma 3.8. Trivially ((ϕ† )cc , (ϕ† )c ) ∈
L1 (µ) × L1 (ν).

29
Chapter 4

Existence and Characterisation of


Transport Maps

Our aim is to characterise the optimal transport plans that arise as a minimiser to the Kantorovich
optimal transportation problem and show sufficient conditions for the existence of optimal
transport maps. In this chapter we will restrict ourselves to the cost function c(x, y) = 21 |x − y|2 .
Generalisations to other cost functions is possible but with an additional notational burden.1 The
second restriction we make is to assume that X and Y are subsets of Rn .
The chapter is structured as follows. We first state our objectives and in particular the results
we will prove, then give motivating explanations. We will require some results and definitions
from convex analysis which we give in Section 4.2 before finally proving the main theorems
from the first section.

4.1 Knott-Smith Optimality and Brenier’s Theorem


Section references: Theorem 4.1, Theorem 4.2 and Corollary 4.3 form [15, Theorem 2.12].
We will give (1) a characterisation of optimal transport plans and (2) sufficient conditions for
the existence of optimal transport maps (and when minπ∈Π(µ,ν) K(π) = minT# µ=ν M(T )).
We will restate the theorem in SectionR4.3 with a change of notation (more precisely we
look at the equivalent problems supπ∈Π(µ,ν) X×Y x · y dπ(x, y) and inf Φ̃ J where Φ̃ is defined in
Section 4.3).
The subdifferential ∂ϕ of a convex function ϕ is defined by

∂ϕ(x) := {y : ϕ(z) ≥ ϕ(x) + y · (z − x) ∀z ∈ Rn } .

We will review some convex analysis in Section 4.2 which will inform the definition. For now it
is enough to know that the subdifferential is a generalisation of the differential which always
exists for lower semi-continuous convex functions. Note that the subdifferential is in general a
set however if ϕ is differentiable at x then ∂ϕ(x) = {∇ϕ(x)}.
1
For example it is possible to show the existence of optimal transport maps for strictly convex and superlinear
cost functions, and characterise them in terms of c-superdifferentials of c-convex functions.

30
Theorem 4.1. Knott-Smith Optimality Criterion Let µ ∈ P(X), ν ∈ P(Y ) with X, Y ⊂ Rn
and assume that µ, ν both have finite second moments. Then π ∈ Π(µ, ν) is a minimizer of
Kantorovich’s optimal transport problem with cost c(x, y) = 21 |x − y|2 if and only if there
exists an L1 (µ), convex lower semi-continuous function ϕ such that supp(π) ⊆ Gra(∂ϕ), or
equivalently y ∈ ∂ϕ(x) for π-almost every (x, y). Moreover the pair (ϕ, ϕ∗ ) is a minimiser of
the problem inf Φ̃ J(ϕ, ψ) where ϕ∗ is the convex conjugate defined in Definition 4.4, J is defined
by (3.1), and Φ̃ is defined by
n o
Φ̃ = (ϕ̃, ψ̃) ∈ L1 (µ) × L1 (ν) : ϕ̃(x) + ψ̃(y) ≥ x · y .
Why do we expect the optimal plan to have support in the graph of the subdifferential of
a convex function? Let us first consider the 1D case and a map T# µ = ν. We should expect
any map that is ‘optimal’ to be order preserving, i.e. if x1 ≤ x2 then T (x1 ) ≤ T (x2 ). This is
equivalent to saying that T is non-decreasing.
Maps rule out splitting since each x 7→ T (x). However if we let T set valued (i.e. we are
considering plans instead of maps) then the increasing property in some sense should still hold.
Let
Γ = {(x, y) : x ∈ X and y ∈ T (x)} .
We can write the increasing property as: for any (x1 , y1 ), (x2 , y2 ) ∈ Γ with x1 ≤ x2 then y1 ≤ y2 .
In convex analysis this property is called cyclical monotonicity. It can be shown that any
cyclically monotone set can be written as the subgradient of a convex function (since any convex
function has a non-decreasing derivative). Hence we expect any optimal plan to be supported in
the subgradient of a convex function. This turns out to also be true in dimensions greater than
one.
The next result specifically gives conditions sufficient for the existence of transport maps.
Theorem 4.2. Brenier’s Theorem Let µ ∈ P(X), ν ∈ P(Y ) with X, Y ⊂ Rn and assume
that µ, ν both have finite second moments and that µ does not give mass to small sets. Then
there is a unique solution π † ∈ Π(µ, ν) to Kantorovich’s optimal transport problem with cost
c(x, y) = 12 |x − y|2 which is given by
π † = (Id × ∇ϕ)# µ or equivalently dπ † (x, y) = dµ(x)δ(y = ∇ϕ(x))
where ∇ϕ is the gradient of a convex function (defined µ-almost everywhere) that pushes µ
forward to ν, i.e. (∇ϕ)# µ = ν.
Any convex function is locally Lipschitz on the interior of its domain. It follows (from
Rademacher’s theorem) that ϕ is almost everywhere (in the Lebesgue sense) differentiable on the
interior of its domain. The fact that the set of non-differentiability is a small set and π gives zero
mass to small sets implies we can talk about the derivative of ϕ on the support of π as though it
exists everywhere.
The following corollary is immediate from Brenier’s theorem.
Corollary 4.3. Under the assumptions of Theorem 4.2 ∇ϕ is the unique solution to the Monge
transportation problem:
Z Z
1 2 1
|x − ∇ϕ(x)| dµ(x) = inf |x − T (x)|2 dµ(x).
2 X 2 T# µ=ν X

31
We sketch the proof of the corollary. From the inequality
min K(π) ≤ inf M(T )
π∈Π(µ,ν) T : T# µ=ν

that was argued in Section 1.2 it is enough to show T † = ∇ϕ satisfies M(T † ) ≤ minΠ(µ,ν) K
(T#† µ = ν is given in Theorem 4.2). Now, let π † ∈ Π(µ, ν) be as in Theorem 4.2,
Z
† 1
M(T ) = |x − T † (x)|2 dµ(x)
2 X
Z
1
= |x − T † (x)|2 dπ † (x, y)
2 X×Y
Z
1
= |x − y|2 dπ † (x, y)
2 X×Y
= min K(π)
π∈Π(µ,ν)

since T † (x) = y for π-almost every (x, y) ∈ X × Y . Uniqueness of T † follows from uniqueness
of π † .

4.2 Preliminary Results from Convex Analysis


Section references: These results are a subset of the background in convex analsyis given in [15].
In particular Proposition 4.5 is [15, Proposition 2.4] and Proposition 4.7 is [15, Proposition
2.5].
In order to characterise subgradients we will use the convex conjugate defined below. This
is essentially a special case of the Legendre-Fenchel transform we defined in Section 3.2. The
convex conjugate is also sometimes called the Legendre transform.
Definition 4.4. The convex conjugate of a proper function ϕ : Rn → R ∪ {+∞} is defined by
ϕ∗ (y) = sup (x · y − ϕ(x)) .
x∈Rn

The following proposition characterises the subdifferential.


Proposition 4.5. Let ϕ be a proper, lower semi-continuous, convex function on Rn . Then for all
x, y ∈ Rn
x · y = ϕ(x) + ϕ∗ (y) ⇔ y ∈ ∂ϕ(x).
Proof. Since ϕ∗ (y) ≥ x · y − ϕ(x) for all x, y we have
x · y = ϕ(x) + ϕ∗ (y) ⇔ x · y ≥ ϕ(x) + ϕ∗ (y)
⇔ x · y ≥ ϕ(x) + y · z − ϕ(z) ∀z ∈ Rd
⇔ ϕ(z) ≥ ϕ(x) + y · (z − x) ∀z ∈ Rd
⇔ y ∈ ∂ϕ(x)
which proves the proposition.

32
In fact if ϕ is convex then ϕ is differentiable almost everywhere, hence we have that ∂ϕ(x) =
{∇ϕ(x)} for almost every x.

Proposition 4.6. If ϕ : Rn → R∪{+∞} is convex then (1) ϕ is almost everywhere differentiable


and (2) whenever ϕ is differentiable ∂ϕ(x) = {∇ϕ(x)}.

Proof. Let x ∈ int(Dom(ϕ)) and δ ∗ be such that B(x, δ ∗ ) ⊂ int(Dom(ϕ)). We show that ϕ is
Lipschitz continuous on B(x, δ ∗ /4). Then, by Rademacher’s theorem2 , ϕ is differentiable almost
everywhere on B(x, δ ∗ /4), and therefore differentiable almost everywhere on int(Dom(ϕ)).
This will complete the proof of (1).
We show ϕ is Lipschitz on B(x, δ ∗ /4) by first showing that ϕ is uniformly bounded on
B(x, δ ∗ /2). By the Minkowski-Carathéodory Theorem (Theorem 2.5) thereP exists {xi }ni=0 ⊂
∗ ∗ n n
Pδn ) such that for all y ∈ B(x, δ ) there exists {λi }i=0 ⊂ [0, 1] with i=0 λi = 1 and
∂B(x,
y = i=0 λi xi . So,

Xn n
X
ϕ(y) = ϕ( λ i xi ) ≤ λi ϕ(xi ) ≤ max |ϕ(xi )|.
i=0,...,n
i=0 i=0

Now for y ∈ B(x, δ ∗ ) and y 0 = x − (y − x) = 2x − y we have y 0 ∈ B(x, δ ∗ ) and x = 12 y 0 + 12 y.


Therefore ϕ(x) ≤ 12 ϕ(y 0 ) + 12 ϕ(y). In particular,

ϕ(y) ≥ 2ϕ(x) − ϕ(y 0 ) ≥ 2ϕ(x) − max |ϕ(xi )|.


i=0,...,n

We have shown that

2ϕ(x) − max |ϕ(xi )| ≤ ϕ(y) ≤ max |ϕ(xi )| ∀y ∈ B(x, δ ∗ ).


i=0,...,n i=0,...,n

Hence
 
kϕkL∞ (B(x,δ∗ /2)) ≤ max max |ϕ(xi )| − 2ϕ(x), max |ϕ(xi )| < ∞.
i=0,...,n i=0,...,n

To show ϕ is Lischitz on B(x, δ ∗ /4) let x1 , x2 ∈ B(x, δ ∗ /4) (x1 6= x2 ) and take x3 to be the
point of intersection of the line through x1 and x2 with ∂B(x, δ ∗ /2); there are two possibilities
for x3 , we choose the option where x2 lies between x1 and x3 . Let λ = |x 2 −x3 |
|x1 −x3 |
∈ (0, 1). Now,

λx1 + (1 − λ)x3 = λx2 + λ(x1 − x2 ) + (1 − λ)x2 + (1 − λ)(x3 − x2 )


|x3 − x2 |(x1 − x2 ) (|x3 − x1 | − |x3 − x2 |)(x3 − x2 )
= x2 + +
|x3 − x1 | |x3 − x1 |
1
= x2 + (|x3 − x2 |(x1 − x2 ) + |x2 − x1 |(x3 − x2 ))
|x3 − x1 |
= x2
2
Rademacher’s theorem: if U ⊂ Rn is open and f : U → R is Lipschitz continuous then f is differentiable
almost everywhere on U .

33
x2 −x1 x3 −x2
since |x2 −x1 |
= |x3 −x2 |
. So by convexity of ϕ,

ϕ(x2 ) − ϕ(x1 ) ≤ (1 − λ) (ϕ(x3 ) − ϕ(x1 ))


|x1 − x3 | − |x2 − x3 |
= (ϕ(x3 ) − ϕ(x1 ))
|x1 − x3 |
8M |x1 − x2 |

δ∗
where M = kϕkL∞ (B(x,δ∗ /2)) and we use that |x1 − x3 | ≥ δ ∗ /4. Switching x1 and x2 implies
that |ϕ(x2 ) − ϕ(x1 )| ≤ 8M |xδ1∗−x2 | hence ϕ is Lipschitz continuous, with constant L = 8M
δ∗
in
B(x, δ ∗ /4).
For (2) let ϕ be differentiable at x. Then,

ϕ(x + (z − x)h) − ϕ(x)


ϕ(x) + ∇ϕ(x) · (z − x) = ϕ(x) + lim+
h→0 h
ϕ((1 − h)x + hz) − ϕ(x)
= ϕ(x) + lim+
h→0 h
(1 − h)ϕ(x) + hϕ(z) − ϕ(x)
≤ ϕ(x) + lim+
h→0 h
= ϕ(z).

Hence ∇ϕ(x) ∈ ∂ϕ(x). Now if y ∈ ∂ϕ(x) then

ϕ(x) + y · (z − x) ≤ ϕ(z)

for all z ∈ Rn . Let z = x + hw then we can infer that


ϕ(x + hw) − ϕ(x)
y·w ≤
h
for all h > 0 and w ∈ Rn . Letting h → 0+ we have y · w ≤ ∇ϕ(x) · w for all w ∈ Rn .
Substituting w 7→ −w we have y · w = ∇ϕ(x) · w for all w ∈ Rn . Hence y = ∇ϕ(x).
Our final preliminary result gives equivalent conditions for convexity and lower semi-
continuity.

Proposition 4.7. Let ϕ : Rn → R ∪ {+∞} be proper. Then the following are equivalent:

1. ϕ is convex and lower semi-continuous;

2. ϕ = ψ ∗ for some proper function ψ;

3. ϕ∗∗ = ϕ.

34
Proof. 3 clearly implies 2. We first show that 2 implies 1. Let ϕ = ψ ∗ and we will show that ϕ
is convex and lower semi-continuous. For convexity let x1 , x2 ∈ Rn , t ∈ [0, 1] then

ϕ(tx1 + (1 − t)x2 ) = ψ ∗ (tx1 + (1 − t)x2 )


= sup ((tx1 + (1 − t)x2 ) · y − ψ(y))
y∈Rn

≤ sup (tx1 · y − tψ(y)) + sup ((1 − t)x2 · y − (1 − t)ψ(y))


y∈Rn y∈Rn

= tψ (x1 ) + (1 − t)ψ ∗ (x2 )


= tϕ(x1 ) + (1 − t)ϕ(x2 ).

For lower semi-continuity let xm → x, then

lim inf ϕ(xm ) = lim inf sup (xm · y − ψ(y)) ≥ lim (xm · y − ψ(y)) = x · y − ψ(y)
m→∞ m→∞ y∈Rn m→∞

for any y ∈ Rn . Taking the supremum over y ∈ Rn implies lim inf m→∞ ϕ(xn ) ≥ ϕ(x) as
required.
Finally we show that 1 implies 3. Let ϕ be lower semi-continuous and convex. Fix x ∈ Rn ,
we want to show that ϕ(x) = ϕ∗∗ (x). Since ϕ∗ (y) ≥ x · y − ϕ(x) for all y ∈ Rn then

ϕ(x) ≥ sup (x · y − ϕ∗ (y)) = ϕ∗∗ (x).


y∈Rn

We are left to show ϕ(x) ≤ ϕ∗∗ (x).


Let x ∈ int(Dom(ϕ)), then since ϕ can be bounded below by an affine function passing
through ϕ(x) (since ϕ is convex) then ∂ϕ(x) 6= ∅. Let y0 ∈ ∂ϕ(x). By Proposition 4.5
x · y0 = ϕ(x) + ϕ∗ (y0 ) then

ϕ(x) = x · y0 − ϕ∗ (y0 ) ≤ sup (x · y − ϕ∗ (y)) = ϕ∗∗ (x).


y∈Rn

Hence we have proved the Proposition for any ϕ with int(Dom(ϕ)) = Rd .


2
For all other ϕ we define ψε (x) = |x|ε and

ϕε (x) = inf (ϕ(x − y) + ψε (y)) = inf (ϕ(y) + ψε (x − y)) .


y∈Rd y∈Rd

In order to show ϕε = ϕ∗∗ d


ε on R it is enough to show that ϕε is convex, lower semi-continuous
d
and int(Dom(ϕε )) = R . For convexity we note,

ϕε (tx1 + (1 − t)x2 ) = inf (ϕ(tx1 + (1 − t)x2 − y) + ψε (y))


y∈R

= inf (ϕ(t(x1 − y1 ) + (1 − t)(x2 − y2 )) + ψε (ty1 + (1 − t)y2 ))


y1 ,y2 ∈Rd

≤ inf (t (ϕ(x1 − y1 ) + ψε (y1 )) + (1 − t) (ϕ(x2 − y2 ) + ψε (y2 )))


y1 ,y2 ∈Rd

= tϕε (x1 ) + (1 − t)ϕε (x2 ).

35
The pointwise limit of an arbitrary collection of lower semi-continuous functions is lower semi-
continuous, hence ϕε is lower semi-continuous. Now let x ∈ Rd and since ϕ is proper there
exists y0 ∈ Rd such that ϕ(y0 ) < ∞, hence ϕε (x) ≤ ϕ(y0 ) + ψε (x − y0 ). Since ψ is everywhere
finite then it follows that ϕε (x) is finite hence x ∈ Dom(ϕε ). In particular int(Dom(ϕε )) = Rd .
We now show that lim inf ε→0 ϕε (x) ≥ ϕ(x) (we actually show that limε→0 ϕε (x) = ϕ(x)
but the former statement is all we really need). Fix x ∈ Rn and note that ϕε (x) ≤ A + Bε Since
ϕ is convex then it is bounded below by an affine function, say ϕ(z) ≥ a · z + b for all z ∈ Rn .
Let yε be a minimising sequence, i.e. ϕε (x) ≥ ϕ(x − yε ) + ψε (yε ) − ε. Then

B |yε |2 (1 + ε)|a|2 |x|2 |yε |2


A+ ≥ ϕε (x) ≥ a · (x − yε ) + −ε≥− − + − ε.
ε ε ε 2 2ε
This implies |yε | = O(1).
Now let εn → 0 be a subsequence with lim inf ε→0 ϕε (x) = limn→∞ ϕεn (x). Since yεn is
bounded then there exists a further subsequence (which we relabel) and some y ∈ Rn such that
yεn → y. Furthermore,

ϕ(x) if y = 0
lim ϕεn (x) = lim (ϕ(x − yεn ) + ψεn (yεn )) ≥
n→∞ n→∞ +∞ else.

In both cases the right hand side is greater than ϕ(x) hence limn→∞ ϕεn (x) ≥ ϕ(x).
On the other hand, since ϕ(x) ≥ ϕε (x) then

ϕ∗∗ (x) = sup infn (y · (x − z) + ϕ(z)) ≥ sup infn (y · (x − z) + ϕε (z)) = ϕ∗∗ (x).
y∈Rn z∈R y∈Rn z∈R

Hence,
ϕ∗∗ (x) ≥ lim inf ϕ∗∗
ε (x) = lim inf ϕε (x) ≥ ϕ(x).
ε→0 ε→0

Which concludes the proof.

4.3 Proof of the Knott-Smith Optimality Criterion


Section references: The proof of the Knott-Smith optimality condition is based on Villani’s proof
in [15, Theorem 2.12].
Before proving the theorem we manipulate the Kantorovich optimal transport problem and
the dual problem. Let (ϕ, ψ) ∈ Φc (where c(x, y) = 12 |x − y|2 ) and define ϕ̃(x) = 12 |x|2 − ϕ(x),
ψ̃(y) = 12 |y|2 − ψ(y). Clearly ϕ̃ ∈ L1 (µ), ψ̃ ∈ L1 (ν) whenever µ, ν have finite second moments.
Furthermore
1 1 1 1 1
ϕ̃(x) + ψ̃(y) = |x|2 + |y|2 − ϕ(x) − ψ(y) ≥ |x|2 + |y|2 − |x − y|2 ≥ x · y.
2 2 2 2 2

In fact, (ϕ, ψ) ∈ Φc ⇔ (ϕ̃, ψ̃) ∈ Φ̃ where


n o
Φ̃ = (ϕ̃, ψ̃) ∈ L1 (µ) × L1 (ν) : ϕ̃(x) + ψ̃(y) ≥ x · y .

36
Furthermore J(ϕ̃, ψ̃) = M − J(ϕ, ψ) where
Z Z
1 2 1
M= |x| dµ(x) + |y|2 dν(y).
2 X 2 Y
And for π ∈ Π(µ, ν),
Z Z
1 2
K(π) = |x − y| dπ(x, y) = M − x · y dπ(x, y).
2 X×Y X×Y

Hence, Z
M − J(ϕ̃, ψ̃) = J(ϕ, ψ) ≤ K(π) = M − x · y dπ(x, y).
X×Y
Or more conveniently, Z
J(ϕ̃, ψ̃) ≥ x · y dπ(x, y).
X×Y
Kantorovich duality (Theorem 3.1) implies that
Z
(4.1) min J(ϕ̃, ψ̃) = max x · y dπ(x, y).
(ϕ̃,ψ̃)∈Φ̃ π∈Π(µ,ν) X×Y

We notice that if π † ∈ Π(µ, ν) minimises K then it also maximises X×Y x · y dπ(x, y), and vice
R

versa. On the other hand, if (ϕ, ψ) ∈ Φc maximises J, then (ϕ̃, ψ̃) = ( 12 | · |2 − ϕ, 12 | · |2 − ψ) ∈ Φ̃


minimises J, and vice versa.
Existence of maximisers of J over Φc imply there exists ϕ ∈ L1 (µ) such that (ϕ̃, ψ̃) =
( 2 | · |2 − ϕ, 12 | · |2 − ϕc ) ∈ Φ̃ and (ϕ̃, ψ̃) minimise J over Φ̃. Furthermore,
1

1
ψ̃(y) = |y|2 − ϕc (y)
2  
1 2 1 2
= sup |y| − |x − y| + ϕ(x)
x∈X 2 2
= sup (x · y − ϕ̃(x))
x∈X

= ϕ̃ (y)

where ϕ̃∗ . We also have,


1
ϕ̃(x) = |x|2 − ϕcc (x)
2  
1 2 1 2 c
= sup |x| − |x − y| + ϕ (y)
y∈Y 2 2
 
1 2 1 2 1 2 ∗
= sup |x| − |x − y| + |y| − ϕ̃ (y)
y∈Y 2 2 2
= sup (x · y − ϕ̃∗ (y))
y∈Y
∗∗
= ϕ̃ (x).

37
Hence minimisers of J over Φ̃ take the form (ϕ̃∗∗ , ϕ̃∗ ).
Let η̃ = ϕ̃∗∗ then by Proposition 4.7 η̃ is convex and lower semi-continuous. Furthermore,
again by Proposition 4.7 η̃ ∗ = ϕ̃∗∗∗ = ϕ̃∗ . Hence there exists minimisers of J over Φ̃ with the
form (η̃, η̃ ∗ ) where η̃ is a proper, convex and lower semi-continuous function.
Proof of Theorem 4.1. Let π † ∈ Π(µ, ν) minimise K over Π(µ, ν) and ϕ̃ be the proper lower
semi-continuous function such that the pair (ϕ̃, ϕ̃∗ ) minimise J over Φ̃. By Kantorovich duality
(in particular (4.1)) we have
Z Z Z

ϕ̃(x) dµ(x) + ϕ̃ (y) dν(y) = x · y dπ † (x, y).
X Y X×Y

Equivalently, Z
ϕ̃(x) + ϕ̃∗ (y) − x · y dπ † (x, y) = 0.
X×Y

By definition of the convex conjugate ϕ̃(x) + ϕ̃∗ (y) ≥ x · y and therefore the integrand is
non-negative. We must have ϕ̃(x) + ϕ̃∗ (y) = x · y for π † -almost every (x, y) and therefore by
Proposition 4.5 y ∈ ∂ ϕ̃(x) for π † -almost every (x, y).
Conversely, suppose y ∈ ∂ ϕ̃(x) for π † -almost every (x, y) where ϕ̃ is a L1 (µ) proper, lower
semi-continuous and convex function. Then by Proposition 4.5,
Z
ϕ̃(x) + ϕ̃∗ (y) − x · y dπ † (x, y) = 0.
X×Y

Notice that by definition of the Legendre-Fenchel transform we have that ϕ̃(x) + ϕ̃∗ (y) ≥ x · y.
We will show integrability of ϕ̃∗ shortly, for now it is assumed then (ϕ̃, ϕ̃∗ ) ∈ Φ̃. Hence,
Z Z
∗ †
min J ≤ J(ϕ̃, ϕ̃ ) = x · y dπ (x, y) ≤ max x · y dπ(x, y).
Φ̃ X×Y π∈Π(µ,ν) X×Y

By duality (i.e. (4.1)) it follows that (ϕ̃, ϕ̃∗ ) in Φ̃ achieves the minimum of J and π † achieves
the maximum of X×Y x · y dπ(x, y) is Π(µ, ν). Hence π † is an optimal plan in the Kantorovich
R

sense.
The last detail we have to show is ϕ̃∗ ∈ L1 (ν). Since ϕ̃ is convex then ϕ̃∗ can be bounded
below by an affine function, in particular there exists x0 ∈ X such that ϕ̃∗ (y) ≥ x0 · y − ϕ̃(x0 ) ≥
x0 · y − b0 =: f (y). So the integral,
Z Z
∗ ∗ ∗ 1 2 1
kϕ̃ − f kL1 (ν) = ϕ̃ (y) − f (y) dν(y) ≤ J(ϕ̃, ϕ̃ ) + kϕ̃kL1 (µ) + |x0 | + |y|2 dν(y) + b0
Y 2 2 Y

is finite. Hence ϕ̃∗ − f ∈ L1 (ν), and since f ∈ L1 (ν) then ϕ̃∗ ∈ L1 (ν) as required.

4.4 Proof of Brenier’s Theorem


Section references: The proof of the Brenier’s theorem is based on Villani’s proof in [15, Theorem
2.12].

38
Proof of Theorem 4.2. Let π † be a minimiser of Kantorovich’s optimal transport problem. If we
write (by disintegration of measures)
Z

π (A × B) = π † (B|x) dµ(x),
A

for some family {π † (·|x)}x∈X ⊂ P(Y ), then supp(π † (·|x)) ⊆ ∂ϕ(x) for µ-a.e. x ∈ X by
Theorem 4.1. By Proposition 4.6, ∂ϕ(x) = {∇ϕ(X)} for L-a.e. x ∈ X (and therefore µ-a.e.
x ∈ X). Hence supp(π † (·|x)) ⊂ {∇ϕ(x)} for µ-a.e. x ∈ X. This implies π † (·|x) = ∂∇ϕ(x) for
µ-a.e. x ∈ X. We have shown that there exists an optimal π † that can be written as

π † = (Id × ∇ϕ)# µ

and

ν(B) = π † (Rn × B)
= (Id × ∇ϕ)# µ(Rn × B)
= µ (Id × ∇ϕ)−1 (Rn × B)


= µ ({x : (Id × ∇ϕ)(x) ∈ Rn × B})


= µ ({x : ∇ϕ(x) ∈ B})
= (∇ϕ)# µ(B)

then we also have that (∇ϕ)# µ = ν.


We are left to show uniqueness. Assume ϕ̄ is another convex function with (∇ϕ̄)# µ = ν.
We will show that ∇ϕ = ∇ϕ̄ upto µ null sets.
By Theorem 4.1 we know that (Id × ∇ϕ̄)# µ is an optimal transport plan and the pair (ϕ̄, ϕ̄∗ )
minimize J over Φ̃. So,
Z Z Z Z

ϕ̄ dµ + ϕ̄ dν = ϕ dµ + ϕ∗ dν.
X Y X Y

The above implies that


Z Z
∗ †
ϕ̄(x) + ϕ̄ (y) dπ (x, y) = ϕ(x) + ϕ∗ (y) dπ † (x, y)
X×Y
ZX×Y
= x · y dπ † (x, y)
X×Y
Z
= x · y d(Id × ∇ϕ)# µ(x, y)
X×Y
Z
= x · ∇ϕ(x) dµ(x)
X

where the second line follows as y ∈ ∂ϕ(x) for π-a.e. (x, y) and by Proposition 4.5. Also,
Z Z
∗ †
ϕ̄(x) + ϕ̄ (y) dπ (x, y) = ϕ̄(x) + ϕ̄∗ (∇ϕ(x)) dµ(x).
X×Y X

39
Hence Z
(ϕ̄(x) + ϕ̄∗ (∇ϕ(x)) − x · ∇ϕ(x)) dµ(x) = 0.
X

In particular, ϕ̄(x) + ϕ̄ (∇ϕ(x)) − x · ∇ϕ(x) = 0 for µ-almost every x. By Proposition 4.5 this
implies that ∇ϕ(x) ∈ ∂ ϕ̄(x) for µ-almost every x and therefore ∇ϕ(x) = ∇ϕ̄(x) for µ-almost
every x.

40
Chapter 5

Wasserstein Distances

Eulerian based costs, such as Lp , define a metric based on "pointwise differences". This has
some notable disadvantages, for example consider in 1D two indicator functions f (x) = χ[0,1] (x)
and fδ (x) = χ[δ,δ+1] (x). Notice that in Lp ,

p 2δ if |δ| < 1
kf − fδ kLp =
2 else.
In particular, we notice that once |δ| ≥ 1 the Lp distance is constant. In more general examples,
where f and fδ are not necessarily indicator functions, the Lp distance will be the sum of the Lp
norms whenever the supports of f and fδ are disjoint.
Why do we care? Say we are trying to fit a parametrised curve fδ to f . Then say we start
from a bad initialisation where the support of fδ is disjoint from the support of f . In this regime
d
the derivative dδ kfδ − f kLp = 0, this is a problem for gradient based optimisation.
On the other hand, we would hope that a transport based distance would do a better job.
In particular, in the elementary example f (x) = χ[0,1] (x) and fδ (x) = χ[δ,δ+1] (x) the OT cost
would be Z 1
min |x − T (x)|p dx = |δ|p
T# f =fδ 0
p
where the cost is c(x, y) = |x − y| and with an abuse of notation we associate f and fδ with the
measures with density f and fδ respectively. Note that the OT cost now strictly increases as a
function of |δ|.
The objective of this section is to understand how the optimal transport can be used to define
a metric and some of the metric properties. In particular, we will define the Wasserstein distance
(also sometimes known as the earth movers distance) in the next section. In Section 5.2 we look
at the topology of Wasserstein spaces and show that the Wasserstein distance metrizes the weak*
convergence. Finally we will look at geodesics and the relation to fluid dynamics.
Throughout this chapter we will assume that c(x, y) = |x − y|p for p ∈ [1, +∞) and X, Y
are subsets of Rd .
Before proceeding to the Wasserstein distance, let us note one other important example that
can be posed as an optimal transport problem. Let c(x, y) = Ix6=y , i.e. c(x, y) = 0 if x = y and
c(x, y) = 1 otherwise. Then the optimal transport problem coincides with the total variation
distance between measures.

41
Proposition 5.1. Let µ, ν ∈ P(X) where X is a Polish space and c(x, y) = Ix6=y then
1
inf K(π) = kµ − νkTV
π∈Π(µ,ν) 2
where
kµkTV := 2 sup |µ(A)|.
A

Proof. We prove the proposition in 4 steps, in the first two steps we only assume that c is a
metric, in particular we use that c is lower semi-continuous, symmetric and satisfies the triangle
inequality. In the third step we use the specific form of c, but this can also be avoided, see
Remark 5.2.

1. Let ϕ ∈ L1 (µ), we claim that |ϕc (x) − ϕc (y)| ≤ c(x, y) for almost every (x, y) ∈ X × X.
Indeed,
ϕc (x) − ϕc (y) = inf sup (c(x, z1 ) − c(y, z2 ) − ϕ(z1 ) − ϕ(z2 ))
z1 ∈X z2 ∈X

≤ sup (c(x, z2 ) − c(y, z2 )) choosing z1 = z2


z2 ∈X

≤ sup (c(x, y) + c(y, z2 ) − c(y, z2 )) by the triangle inequality


z2 ∈X

≤ c(x, y).
By switching x and y we have that |ϕc (x) − ϕc (y)| ≤ c(x, y).

2. We claim ϕcc = −ϕc . By part 1 we have c(x, y) − ϕc (y) ≥ −ϕc (x) and therefore
ϕcc (x) = inf (c(x, y) − ϕc (y)) ≥ −ϕc (x).
y∈X

On the other hand


ϕcc (x) = inf (c(x, y) − ϕc (y)) ≤ −ϕc (x)
y∈X

by choosing y = x. Hence ϕ = −ϕc as claimed.


cc

3. Since |η c (x) − η c (y)| ≤ 1 we can, without loss of generality, assume that η c (x) ∈ [0, 1] in
the following. By Theorem 3.1 and Theorem 3.7,
min K(π) = sup J(η cc , η c ) = sup J(−η c , η c )
π∈Π(µ,ν) η∈L1 η∈L1 (µ)

≤ sup J(−f, f ) ≤ sup J(ϕ, ψ) = min K(π)


0≤f ≤1 Φc π∈Π(µ,ν)

where the penultimate inequality follows as (−f, f ) ∈ Φc , to see this we only need to show
−f (x) + f (y) ≤ c(x, y). Clearly −f (x) + f (y) ≤ 1 = c(x, y) for x 6= y and −f (x) + f (y) =
0 = c(x, y) for x = y. It follows that
min K(π) = sup J(−f, f ).
π∈Π(µ,ν) 0≤f ≤1

42
4. Now let ν −µ = (ν −µ)+ −(ν −µ)− be the decomposition of ν −µ where (ν −µ)± ∈ M(X)
are singular. It follows that
kµ − νkTV = 2(ν − µ)+ (X).
And, Z
sup J(−f, f ) = sup f d(ν − µ) = (ν − µ)+ (X).
0≤f ≤1 0≤f ≤1 X

Hence minπ∈Π(µ,ν) K(π) = sup0≤f ≤1 J(−f, f ) = 21 kµ − νkTV .


Remark 5.2. In step 3 of the above proof we showed that
Z
min K(π) = sup J(ϕ, ψ) = sup f d(ν − µ).
π∈Π(µ,ν) (ϕ,ψ)∈Φc 0≤f ≤1 X

This is actually a special case of the Kantorovich-Rubinstein Theorem (see [15, Theorem 1.14])
which states that, when c is a metric,
Z 
1
min K(π) = sup ϕ d(µ − ν) : ϕ ∈ L (|µ − ν|), kϕkLip ≤ 1
π∈Π(µ,ν) X

where
|ϕ(x) − ϕ(y)|
kϕkLip = sup .
x6=y c(x, y)

5.1 Wasserstein Distances


Section references: The proof of that the Wasserstein distance is a metric, Proposition 5.4, comes
from [12, Proposition 5.1 and Lemma 5.4], with the preliminary result, Lemma 5.5, coming
from [12, Lemma 5.5]. The equivalence of Wasserstein norms is from [12, Section 5.1].
We will work on the space of probability measures on X ⊂ Rd with bounded pth moment,
i.e.  Z 
p
Pp (X) := µ ∈ P(X) : |x| dµ(x) < +∞ .
X

Of course, if X is bounded then Pp (X) = P(X). We now define the Wasserstein distance, it
will be the objective of this section to prove that the Wasserstein distance is a metric.
Definition 5.3. Let µ, ν ∈ Pp (X), then the Wasserstein distance is defined as
Z  p1
p
dW p (µ, ν) = min |x − y| dπ(x, y) .
π∈Π(µ,ν) X×X

The Wasserstein distance is the pth root of the minimum of the Kantorovich optimal transport
problem for cost function c(x, y) = |x − y|p . The motivation is that this cost resembles an Lp
distance (in fact we use properties of Lp distances to prove the triangle inequality). One could
also consider an analogous distance for cost function c(x, y) = d(x, y) where d is a metric on
X. This type of distance is known as the earth movers distance. Notice that when p = 1 the

43
Wasserstein distance is also a earth movers distance. We will not focus on earth movers distances
here.
Let us note here that µ, ν ∈ Pp (X) is enough to guarantee dW p (µ, ν) < +∞. In particular,
Z Z Z
p p p p
dW p (µ, ν) ≤ p inf |x| + |y| dπ(x, y) = p |x| dµ(x) + p |y|p dν(y).
π∈Π(µ,ν) X×X X X

We now state the result that dW p is a metric. The proof, minus the triangle inequality, is given
below.

Proposition 5.4. The distance dW p : Pp (X) × Pp (X) → [0, ∞) is a metric on Pp (X).

Proof. We give the proof of all the required criteria with the exception of the triangle inequality
which will require some preliminary results. Firstly, it is clear that dW p (µ, ν) ≥ 0 for all
µ, ν ∈ P(X) and by symmetry of the cost function c(x, y) = |x − y|p and π ∈ Π(µ, ν) ⇔
S# π ∈ Π(ν, µ) where S(x, y) = (y, x) we have symmetry of dW p . Now if µ = ν then we can
take π(x, y) = δx (y)µ(x) so that
Z
p
dW p (µ, ν) ≤ |x − y|p dπ(x, y) = 0
X×X

as x = y π-almost everywhere. Now if dW p (µ, ν) = 0 then there exists π ∈ Π(µ, ν) such that
x = y π-almost everywhere. Hence for any test function f : X → R,
Z Z Z Z
f (x) dµ(x) = f (x) dπ(x, y) = f (y) dπ(x, y) = f (y) dν(y).
X X×X X×X X

As this holds for all test functions f then µ = ν.


The following lemma is known as the gluing lemma and we will use it to "glue" two transport
plans π1 ∈ Π(µ, ν) and π2 ∈ Π(ν, ω). The triangle inequality then follows from the triangle
inequality for Lp distances.

Lemma 5.5. Given measures µ ∈ P(X), ν ∈ P(Y ), ω ∈ P(Z) and transport plans π1 ∈
Π(µ, ν) and π2 ∈ Π(ν, ω) there exists a measure γ ∈ P(X × Y × Z) such that P#X,Y γ = π1
and P#Y,Z γ = π2 where P X,Y (x, y, z) = (x, y) and P Y,Z (x, y, z) = (y, z) are the projections
onto the two first and two last variables respectively.

Proof. By the disintegration of measures we can write


Z
π1 (A × B) = π1 (A|y) dν(y)
B

for some family of probability measures π1 (·|y) ∈ P(X), and similarly for π2 ,
Z
π2 (B × C) = π2 (C|y) dν(y).
B

44
Define γ ∈ M(X × Y × Z) by
Z
γ(A × B × C) = π1 (A|y)π2 (C|y) dν(y).
B

Then,
Z Z
γ(A × B × Z) = π1 (A|y)π2 (Z|y) dν(y) = π1 (A|y) dν(y) = π1 (A × B).
B B

Similarly, γ(X ×B×C) = π2 (B×C). Therefore, P#X,Y γ = π1 and P#Y,Z γ = π2 as required.


We are now in a position to complete the proof of Proposition 5.4.
Proof of Proposition 5.4 (the triangle inequality). Let µ, ν, ω ∈ Pp (X) and assume πXY ∈
Π(µ, ν), πY Z ∈ Π(ν, ω) are optimal, i.e.
Z
p
dW p (µ, ν) = |x − y|p dπXY (x, y)
ZX×X
dpW p (ν, ω) = |y − z|p dπY Z (y, z).
X×X

Let γ ∈ P(X × X × X) be such that P#X,Y γ = πXY and P#Y,Z γ = πY Z (such γ exists by
Lemma 5.5). Let πXZ = P#X,Z γ. Then,

πXZ (A × X) = P#X,Z γ(A × X)


= γ (x, y, z) : P X,Z (x, y, z) = (x, z) ∈ A × X
 

= γ ({(x, y, z) : x ∈ A})
= γ(A × X × X)
= πXZ (A × X)
= µ(A).
Similarly πXZ (X × B) = ω(B). So, πXZ ∈ Π(µ, ω).
Now,
Z  p1
p
dW p (µ, ω) ≤ |x − z| dπXZ (x, z)
X×X
Z  p1
p
= |x − z| dγ(x, y, z)
X×X×X
Z  p1 Z  p1
p p
≤ |x − y| dγ(x, y, z) + |y − z| dγ(x, y, z)
X×X×X X×X×X
Z  p1 Z  p1
p p
= |x − y| dπXY (x, y) + |y − z| dπY Z (y, z)
X×X X×X
= dW p (µ, ν) + dW p (ν, ω).
This proves the triangle inequality.

45
One can also prove the triangle inequality using an transport maps and an approximation
argument. Slightly more precisely, if µ, ν and ω all have densities with respect to Lebesgue then
we know there exits transport maps T and S with T# µ = ν and S# ν = ω. The map S ◦ T then
pushes µ onto ω. One can argue, similarly to our proof, that dW p (µ, ν) + dW p (ν, ω) ≥ dW p (µ, ω).
To extend the argument to arbitrary probability measures µ, ν and ω one uses mollifiers to
define µ̃ε = µ ∗ Jε , analogously for ν̃ε , ω̃ε , where Jε = ε1d J(·/ε) and J is a standard mollifier.
The measures µ̃, ν̃, ω̃ have densities with respect to the Lebesgue measure and one can show
dW p (µ̃ε , ν̃ε ) → dW p (µ, ν) as ε → 0. We refer to [12, Lemma 5.2 and Lemma 5.3] for full details.
Our final result of the section gives sufficient conditions for equivalence of Wasserstein
distances.
Proposition 5.6. For every p ∈ [1, +∞) and any µ, ν ∈ Pp (X) we have dW p (µ, ν) ≥ dW 1 (µ, ν).
Furthermore, if X ⊂ Rn is bounded then dpW p (µ, ν) ≤ diam(X)p−1 dW 1 (µ, ν).
Proof. By Jensen’s inequality, for π ∈ Π(µ, ν), we have
Z  p1 Z
|x − y|p dπ(x, y) ≥ |x − y| dπ(x, y).
X×X X×X

Hence, dW p (µ, ν) ≥ dW 1 (µ, ν).


Now if X is bounded, then ∀x, y ∈ X,
|x − y|p ≤ ( max |w − z|p−1 )|x − y| = (diam(X))p−1 |x − y|.
w,z∈X

Hence, Z Z
p p−1
|x − y| dπ(x, y) ≤ (diam(X)) |x − y| dπ(x, y).
X×X X×X

From which it follows dpW p (µ, ν) ≤ diam(X)p−1 dW 1 (µ, ν).


In fact the above is also true for p = +∞, however we do not consider (or even define) dW ∞
here and instead refer to [12, Section 5.5.1] for more information on the ∞-Wasserstein distance.

5.2 The Wasserstein Topology


Section references: The two results regarding the relationship between convergence in Wasser-
stein distance and weak* convergence can be found in [12, Theorem 5.10 and Theorem 5.11].
In this section we prove the relationship of convergence in Wasserstein distance with weak*
convergence. We start with when X ⊂ Rn is compact.
*
Theorem 5.7. Let X ⊂ Rn be compact, and µm , µ ∈ P(X). Then µm * µ if and only if
dW p (µm , µ) → 0.
Proof. By Proposition 5.6 it is enough to show the result for p = 1. Assume dW 1 (µm , µ) → 0.
By the Kantorovich-Rubinstein theorem, see Remark 5.2, we can write
Z 
1
dW 1 (µ, ν) = sup ϕ d(µ − ν) : ϕ ∈ L (|µ − ν|), |ϕ(x) − ϕ(y)| ≤ |x − y| .
X

46
1
Let ϕ be a Lipschitz function with Lip(ϕ) > 0 then ϕ̃ = Lip(ϕ) ϕ is a 1-Lipschitz function and
therefore Z Z
1
ϕ d(µm − µ) = ϕ̃ d(µm − µ) ≤ dW 1 (µn , µ) → 0.
Lip(ϕ) X X
By substituting ϕ 7→ −ϕ we have that
Z Z
ϕ dµm → ϕ dµ
X X

*
for any Lipschitz function ϕ. By the Portmanteau theorem µm * µ.
*
For the converse statement we assume that µm * µ and let mk be the subsequence such that

lim dW 1 (µmk , µ) = lim sup dW 1 (µm , µ).


k→∞ m→∞

Let ϕ̃mk be 1-Lipschitz and such that


Z
1
dW 1 (µmk , µ) ≤ ϕ̃mk d(µmk − µ) + .
X k
Pick x0 ∈ supp(µ). Note that, for any ϕ ∈ L1 (ν) where ν ∈ M(X), c ∈ R that
Z Z
(ϕ + c) dν = ϕ dν
X X

if ν(X) = 0. Hence if we let ϕmk (x) = ϕ̃mk (x) − ϕ̃mk (x0 ) then
Z
1
dW 1 (µmk , µ) ≤ ϕmk d(µmk − µ) + ,
X k
ϕmk are 1-Lipschitz (in particular equicontinuous) and bounded. By the Arzelà-Ascoli theorem
there exists a further subsequence (relabelled) such that ϕmk → ϕ uniformly. In particular, ϕ is
1-Lipschitz. Hence,
Z 
1
lim sup dW 1 (µm , µ) ≤ lim sup ϕmk d(µmk − µ) +
m→∞ k→∞ X k
Z Z 
≤ lim sup (ϕmk − ϕ) dµmk + ϕ d(µmk − µ)
k→∞ X X
Z
≤ lim sup kϕmk − ϕkL∞ + lim sup ϕ d(µmk − µ)
k→∞ k→∞ X
= 0.

Hence, dW 1 (µm , µ) → 0 as m → ∞.
We now generalise to unbounded domains.
R
Theorem 5.8. Let µm , µ ∈ Pp (Rn ). Then dW p (µm , µ) → 0 if and only if Rn
|x|p dµm →
R *
Rn
|x|p dµ and µm * µ.

47
Proof. Let dW p (µm , µ) → 0. Then byR Proposition 5.6 we have dW 1 (µm , µ) → 0. Analogously
to the proof of Theorem 5.7 we have X ϕ d(µm − µ) → 0 for all Lipschitz functions ϕ. Hence,
*
by the Portmanteau
R theorem µRm * µ.
To show Rn |x|p dµm → Rn |x|p dµ we note that
Z Z
p
p
|x| dµm = dW p (µm , δ0 ) and |x|p dµ = dpW p (µ, δ0 ).
Rn Rn

Now,
dW p (µm , δ0 ) ≤ dW p (µm , µ) + dW p (µ, δ0 ) → dW p (µ, δ0 )
and
dW p (µm , δ0 ) ≥ dW p (µ, δ0 ) − dW p (µm , µ) → dW p (µ, δ0 )
R R
Hence Rn |x|p dµm → Rn |x|p dµ.
* R R
For the converse statement let µm * µ and |x|p dµ → |x|p dµ. For any R > 0 let
φR (x) = (|x| ∧ R)p = (min{|x|, R})p which is continuous and bounded. We have
Z Z
p
(5.1) (|x| − φR (x)) dµn → (|x|p − φR (x)) dµ
Rn Rn

by weak* convergence and convergence of pth moments. Now


Z Z Z
p p p
(|x| − φR (x)) dµ(x) = |x| − R dµ ≤ |x|p dµ < ∞.
Rn |x|>R |x|>R

In particular, we let ε > 0 and choose R > 0 such that


Z
ε
(|x|p − φR (x)) dµ(x) < .
Rn 2
R
By (5.1) we also have Rn (|x|p − φR (x)) dµm (x) < ε for m sufficiently large.
For a > b > 0 and p ≥ 1 we have (a + b)p = ap + pbξ p−1 for some ξ ∈ [a, a + b]. Hence,
(a + b)p ≥ ap + pap−1 b ≥ ap + bp .
Using the above, for |x| > R we have (|x| − R)p ≤ |x|p − Rp = |x| − φR (x). So for n
sufficiently large,
Z Z
p
(|x| − R) dµm < ε and (|x| − R)p dµ < ε.
|x|>R |x|>R

Let PR : Rn → B(0, R) be the projection onto the ball B(0, R), i.e.

x if x ∈ B(0, R)
PR (x) =
argminy∈∂B(0,R) |y − x| else.

48
The map PR is continuous and equal to the identity on B(0, R). For all x 6∈ B(0, R) we have
|x − PR (x)| = |x| − R. Hence,
Z  p1
p
dW p (µ, (PR )# µ) ≤ |x − PR (x)| dµ(x)
Rn
Z  p1
p
= |x − PR (x)| dµ(x)
|x|>R
Z  p1
p
= (|x| − R) dµ(x)
|x|>R
1
≤ε , p

and similarly,
1
dW p (µm , (PR )# µm ) ≤ ε p .
For any ϕ ∈ Cb0 (Rn ) we have
Z Z Z Z
ϕ d(PR )# µm = ϕ(PR (x)) dµm → ϕ(PR (x)) dµ = ϕ d(PR )# µ

*
since ϕ ◦ PR is continuous and bounded. Hence, (PR )# µm *(PR )# µ.
Now, (PR )# µm , (PR )# µ have support in B(0, R) (a compact set) so by Theorem 5.7 we
have dW p ((PR )# µm , (PR )# µ) → 0. Hence,

lim sup dW p (µm , µ) ≤ lim sup dW p (µm , (PR )# µm ) + dW p ((PR )# µm , (PR )# µ)
m→∞ m→∞

+ dW p ((PR )# µ, µ)
1
≤ 2ε p .

Letting ε → 0 implies limm→∞ dW p (µm , µ) = 0 as required.

5.3 Geodesics in the Wasserstein Space


Section references: The result that the Wasserstein space is a geodesic space (Theorem 5.12) can
be found in [12, Theorem 5.27].
The aim of this section is to show that the Wasserstein space (Pp (X), dW p ) is a geodesic
space. We start with some definitions. The definitions are given in terms of a metric space (Z, d),
of course we have in mind that this will later be the Wasserstein space.

Definition 5.9. Let (Z, d) be a metric space and ω : [0, 1] → Z a curve in Z.R We say ω is
t
absolutely continuous if there exists g ∈ L1 ([0, 1]) such that d(ω(t0 ), ω(t1 )) ≤ t01 g(s) ds for
any 0 ≤ t0 < t1 ≤ 1. We denote the set of absolutely continuous curves on Z by AC(Z).

49
Definition 5.10. Let (Z, d) be a metric space and ω : [0, 1] → Z a curve in Z. We define the
length of ω by
( n−1 )
X
Len(ω) := sup d(ω(tk ), ω(tk+1 )) : n ≥ 1, 0 = t0 < t1 < · · · < tn = 1 .
k=0

A curve ω : [0, 1] → Z is said to be a geodesic between z0 ∈ Z and z1 ∈ Z if

ω ∈ argmin {Len(ω̃) : ω̃ : [0, 1] → Z, ω̃(0) = z0 , ω̃(1) = z1 } .

A curve ω : [0, 1] → Z is said to be a constant speed geodesic between z0 ∈ Z and z1 ∈ Z if

d(ω(t), ω(s)) = |t − s|d(ω(0), ω(1)).

Note that if ω : [0, 1] → Z is a constant speed geodesic then it is a geodesic. Indeed, assume
that ω : [0, 1] → Z and ω̃ : [0, 1] → Z satisfy ω(0) = z0 = ω̃(0), ω(1) = z1 = ω̃(1),

d(ω(t), ω(s)) = |t − s|d(z0 , z1 ) ∀0 ≤ t < s ≤ 1, and Len(ω̃) < Len(ω).

I.e. we assume that ω is a constant speed geodesic but not a geodesic. Then there exists n ∈ N
and 0 ≤ t0 < t1 < · · · < tn = 1 such that
n−1
X n−1
X
Len(ω̃) < d(ω(tk ), ω(tk+1 )) = d(z0 , z1 ) (tk+1 − tk ) = d(z0 , z1 ).
k=0 k=0

This implies Len(ω̃) < d(z0 , z1 ). Clearly this is a contradiction (choosing n = 1 in the definition
of Len(ω̃) implies Len(ω̃) ≥ d(z0 , z1 )).
Note also that if d(ω(t), ω(s)) = |t − s|d(z0 , z1 ) then ω ∈ AC(Z) with g(s) = d(z0 , z1 ).

Definition 5.11. Let (Z, d) be a metric space. We say (Z, d) is a length space if

d(x, y) = inf {Len(ω) : ω ∈ AC(Z), ω(0) = x, ω(1) = y} .

We say (Z, d) is a geodesic space if

d(x, y) = min {Len(ω) : ω ∈ AC(Z), ω(0) = x, ω(1) = y} .

We now show that the Wasserstein space (Pp (X), dW p ) is a geodesic space.

Theorem 5.12. Let p ≥ 1, X ⊆ Rn be convex and define Pt : X × X → X by Pt (x, y) =


(1 − t)x + ty. Let µ, ν ∈ Pp (X) and assume π ∈ Π(µ, ν) minimises K over Π(µ, ν). Then the
curve µt = (Pt )# π is a constant speed geodesic in (X, dW p ) connecting µ and ν. In particular, if
π = (Id × T )# µ for some transport map T : X → X that pushes forwards µ to ν, i.e. T# µ = ν
(that is T is a solution to the Monge optimal transport problem), then µt = ((1 − t)Id + tT )# µ.

50
Proof. Note that P0 = P X and P1 = P Y . Therefore, µ0 = (P0 )# π = µ, µ1 = (P1 )# π = ν,
so µt connects µ and ν. To show dW p (µs , µt ) = |t − s|dW p (µ, ν) it is enough to prove that
dW p (µs , µt ) ≤ |t − s|dW p (µ, ν). Indeed assuming this is true, then if dW p (µs , µt ) < |t −
s|dW p (µ, ν) for any 0 ≤ s < t ≤ 1 we have

dW p (µ, ν) ≤ dW p (µ, µs ) + dW p (µs , µt ) + dW p (µt , ν)


< (s + (t − s) + (1 − t))dW p (µ, ν)
= dW p (µ, ν)

a contradiction.
To show dW p (µs , µt ) ≤ |t − s|dW p (µ, ν) let πs,t = (Ps , Pt )# π. Then for any (measurable)
A ⊆ X,

πs,t (A × X) = π ({(x, y) : (1 − s)x + sy ∈ A, (1 − t)x + ty ∈ X})


= π ({(x, y) : (1 − s)x + sy ∈ A})
= (Ps )# π(A)
= µs (A).

Hence P#X πs,t = µs . Similarly, P#Y πs,t = µt so πs,t ∈ Π(µs , µt ). Now,


Z  p1
p
dW p (µs , µt ) ≤ |x − y| dπs,t (x, y)
X×X
Z  p1
p
= |Ps (x, y) − Pt (x, y)| dπ(x, y)
X×X
Z  p1
p
= |(t − s)x − (t − s)y| dπ(x, y)
X×X
Z  p1
p
= |t − s| |x − y| dπ(x, y)
X×X
= |t − s|dW p (µ, ν)

as required.
If π = (Id×T )# µ where T is as in the statement of the theorem, then for A ⊂ X (measurable)
we have

µt (A) = (Pt )# π(A)


= π ({(x, y) : (1 − t)x + ty ∈ A})
= (Id × T )# µ ({(x, y) : (1 − t)x + ty ∈ A})
= µ ({x : (1 − t)x + tT (x) ∈ A})
= ((1 − t)Id + tT )# µ(A)

which shows µt = ((1 − t)Id + tT )# µ.

51
Bibliography

[1] L. Ambrosio. Mathematical Aspects of Evolving Interfaces: Lectures given at the C.I.M.-
C.I.M.E. joint Euro-Summer School held in Madeira, Funchal, Portugal, July 3-9, 2000,
chapter Lecture Notes on Optimal Transport Problems, pages 1–52. Springer Berlin
Heidelberg, 2003.

[2] L. Ambrosio, N. Gigli, and G. Savaré. Gradient flows in metric spaces and in the space
of probability measures. Lectures in Mathematics ETH Zürich. Birkhäuser Verlag, Basel,
second edition, 2008.

[3] L. Ambrosio and A. Pratelli. Optimal Transportation and Applications: Lectures given at
the C.I.M.E. Summer School, held in Martina Franca, Italy, September 2-8, 2001, chapter
Existence and stability results in the L1 theory of optimal transportation, pages 123–160.
Springer Berlin Heidelberg, 2003.

[4] Y. Brenier. Décomposition polaire et réarrangement monotone des champs de vecteurs.


Comptes rendus de l’Académie des Sciences, Paris, Série I, 305:805–808, 1987.

[5] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances


in Neural Information Processing Systems (NIPS), pages 2292–2300, 2013.

[6] L. C. Evans and W. Gangbo. Differential equations methods for the Monge-Kantorovich
mass transfer problem, volume 653. American Mathematical Society, 1999.

[7] W. Gangbo and R. J. McCann. The geometry of optimal transportation. Acta Mathematica,
177(2):113–161, 1996.

[8] L. V. Kantorovich. On translation of mass (in Russian), C. R. Doklady. Proceedings of the


USSR Academy of Sciences, 37:199–201, 1942.

[9] S. Kolouri, S. R. Park, M. Thorpe, D. Slepčev, and G. K. Rohde. Optimal mass transport:
Signal processing and machine-learning applications. IEEE Signal Processing Magazine,
34(4):43–59, 2017.

[10] S. Kolouri and G. K. Rohde. Optimal transport a crash course. IEEE ICIP 2016 Tutorial
Slides: Part 1, 2016.

[11] G. Monge. Mémoire sur la théorie des déblais et des remblais. De l’Imprimerie Royale,
1781.

52
[12] F. Santambrogio. Optimal transport for applied mathematicians. Birkäuser Springer, Basel,
2015.

[13] B. Simon. Convexity: An Analytic Viewpoint. Cambridge University Press, 2011.

[14] V. N. Sudakov. Geomtric problems in the theory of infinite-dimensional probability


distributions. Proceedings of the Steklov Institute of Mathematics, 141:1–178, 1979.

[15] C. Villani. Topics in Optimal Transportation. American Mathematical Society, 2003.

[16] C Villani. Optimal transport: old and new, volume 338. Springer Science & Business
Media, 2008.

53

You might also like