0% found this document useful (0 votes)
18 views23 pages

Optimal Transport: Fast Probabilistic Approximation With Exact Solvers

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views23 pages

Optimal Transport: Fast Probabilistic Approximation With Exact Solvers

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Journal of Machine Learning Research 20 (2019) 1-23 Submitted 2/18; Revised 3/19; Published 7/19

Optimal Transport: Fast Probabilistic Approximation with


Exact Solvers
Max Sommerfeld [email protected]
Felix-Bernstein Institute for Mathematical Statistics in the Biosciences
University of Göttingen
Goldschmidtstr. 7, 37077 Göttingen

Jörn Schrieber [email protected]


Institute for Mathematical Stochastics
University of Göttingen
Goldschmidtstr. 7, 37077 Göttingen

Yoav Zemel [email protected]


Felix-Bernstein Institute for Mathematical Statistics in the Biosciences
University of Göttingen
Goldschmidtstr. 7, 37077 Göttingen

Axel Munk [email protected]


Institute for Mathematical Stochastics and Felix-Bernstein Institute for Mathematical Statistics in the Bio-
sciences
University of Göttingen
Goldschmidtstr. 7, 37077 Göttingen
and
Max-Planck-Institute for Biophysical Chemistry
Am Faßberg 11, 37077 Göttingen

Editor: Animashree Anandkumar

Abstract

We propose a simple subsampling scheme for fast randomized approximate computation of optimal
transport distances on finite spaces. This scheme operates on a random subset of the full data and
can use any exact algorithm as a black-box back-end, including state-of-the-art solvers and entrop-
ically penalized versions. It is based on averaging the exact distances between empirical measures
generated from independent samples from the original measures and can easily be tuned towards
higher accuracy or shorter computation times. To this end, we give non-asymptotic deviation
bounds for its accuracy in the case of discrete optimal transport problems. In particular, we show
that in many important instances, including images (2D-histograms), the approximation error is
independent of the size of the full problem. We present numerical experiments that demonstrate
that a very good approximation in typical applications can be obtained in a computation time
that is several orders of magnitude smaller than what is required for exact computation of the full
problem.

Keywords: computational vs statistical accuracy, covering numbers, empirical optimal trans-


port, resampling, risk bounds, spanning tree, Wasserstein distance

c 2019 Max Sommerfeld, Jörn Schrieber, Yoav Zemel, and Axel Munk.
License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at
http://jmlr.org/papers/v20/18-079.html.
Sommerfeld, Schrieber, Zemel, and Munk

1. Introduction
Optimal transport distances, a.k.a. Wasserstein, earth-mover’s, Monge-Kantorovich-Rubinstein or
Mallows distances, as metrics to compare probability measures (Rachev and Rüschendorf, 1998;
Villani, 2008) have become a popular tool in a wide range of applications in computer science,
machine learning and statistics. Important examples are image retrieval (Rubner et al., 2000) and
classification (Zhang et al., 2007), computer vision (Ni et al., 2009), but also therapeutic equivalence
(Munk and Czado, 1998), generative modeling (Bousquet et al., 2017), biometrics (Sommerfeld and
Munk, 2018), metagenomics (Evans and Matsen, 2012) and medical imaging (Ruttenberg et al.,
2013).
Optimal transport distances compare probability measures by incorporating a suitable ground
distance on the underlying space, typically driven by the particular application, e.g. Euclidean dis-
tance. This often makes it preferable to competing distances such as total-variation or χ2 -distances,
which are oblivious to any metric or similarity structure on the ground space. Note that total vari-
ation is the Wasserstein distance with respect to the trivial metric, which usually does not carry
the geometry of the underlying ground space. In this setting, optimal transport distances have a
clear and intuitive interpretation as the amount of ‘work’ required to transport one probability dis-
tribution onto the other. This notion is typically well-aligned with human perception of similarity
(Rubner et al., 2000).

1.1. Computation
The outstanding theoretical and practical performance of optimal transport distances is contrasted
by its excessive computational cost. For example, optimal transport distances can be computed
with an auction algorithm (Bertsekas, 1992). For two probability measures supported on N points
this algorithm has a worst case run time of O(N 3 log N ). Other methods like the transportation
simplex have sub-cubic empirical average runtime (compare Gottschlich and Schuhmacher, 2014),
but exponential worst case runtimes.
Therefore, many attempts have been made to design improved algorithms. We give some selective
references: Ling and Okada (2007) proposed a specialized algorithm for L1 -ground distance and X
a regular grid and report an empirical runtime of O(N 2 ). Gottschlich and Schuhmacher (2014)
improved existing general purpose algorithms by initializing with a greedy heuristic. Their Shortlist
algorithm achieves an empirical average runtime of the order O(N 5/2 ). Schmitzer (2016) solves the
optimal transport problem by solving a sequence of sparse problems. The theoretical runtime of his
algorithm is not known, but it exhibits excellent performance on two-dimensional grids (Schrieber
et al., 2016). The literature on this topic is rapidly growing and we refer for further recent work to
Liu et al. (2018), Dvurechensky et al. (2018), Lin et al. (2019), and the references given there.
Despite these efforts, still many practically relevant problems remain well outside the scope of
available algorithms. See Schrieber et al. (2016) for an overview and a numerical comparison of
state-of-the-art algorithms for discrete optimal transport. This is true in particular for two or three
dimensional images and spatio temporal imaging, which constitute an important area of potential
applications. Here, N is the number of pixels or voxels and is typically of size 105 to 107 . Naturally,
this problem is aggravated when many distances have to be computed, as is the case for Wasserstein
barycenters (Agueh and Carlier, 2011; Cuturi and Doucet, 2014), which have become an important
use case.
To bypass the computational bottleneck, also many surrogates for optimal transport distances
that are more amenable to fast computation have been proposed. Shirdhonkar and Jacobs (2008)
proposed to use an equivalent distance based on wavelets that can be computed in linear time but
cannot be calibrated to approximate the Wasserstein distance with arbitrary accuracy. Pele and
Werman (2009) threshold the ground distance to reduce the complexity of the underlying linear
program, obtaining a lower bound for the exact distance. Cuturi (2013) altered the optimization
problem by adding an entropic penalty term in order to use faster and more stable algorithms, see

2
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

100% Problem
Size
● 32x32
64x64
128x128

Relative Error

10% ●


1%

10−5 10−4 10−3 10−2 10−1 100


Relative Runtime

Figure 1: Relative error and relative runtime compared to the exact computation of the proposed
scheme. Optimal transport distances and its approximations were computed between
images of different sizes (32 × 32, 64 × 64, 128 × 128). Each point represents a specific
parameter choice in the scheme and is a mean over different problem instances, solvers
and cost exponents. For the relative runtimes the geometric mean is reported. For details
on the parameters see Figure 2.

also Altschuler et al. (2017). Bonneel et al. (2015) consider the 1D Wasserstein distances of radial
projections of the original measures, exploiting the fact that, in one dimension, computing the
Wasserstein distance amounts to sorting the point masses and hence has quasi-linear computation
time.

1.2. Contribution
We do not propose a new algorithm to solve the optimal transport problem. Instead, we propose a
simple probabilistic scheme as a meta-algorithm that can use any algorithm (e.g., those mentioned
above) solving finitely supported optimal transport problems as a black-box back-end and gives a
random but fast approximation of the exact distance. This scheme

a) is extremely easy to implement, to parallelize and to tune towards higher accuracy or shorter
computation time as desired (see Figure 1);

b) can be used with any algorithm for transportation problems as a back-end, including general LP
solvers, specialized network solvers and algorithms using entropic penalization (Cuturi, 2013);

c) comes with theoretical non-asymptotic guarantees for the approximation error of the Wasserstein
distance—in particular, this error is independent of the size of the original problem in many
important cases, including images;

d) works well in practice. For example, the Wasserstein distance between two 1282 -pixel images can
typically be approximated with a relative error of less than 5% in only 1% of the time required
for exact computation.

3
Sommerfeld, Schrieber, Zemel, and Munk

2. Problem and Algorithm


Although our meta-algorithm is applicable to exact solvers for any optimal transport distance be-
tween probability measures, for example the Sinkhorn distance (Cuturi, 2013), the theory we present
here concerns the Kantorovich (1942) transport distance, often also denoted as Wasserstein distance.
Wasserstein Distance Consider a fixed finite space X = {x1 , . . . , xN } with a metric d : X × X →
[0, ∞). Every probability measure on X is given by a vector r in
( )
X
X
PX = r = (rx )x∈X ∈ R≥0 : rx = 1 ,
x∈X

via Pr ({x}) = rx . We will not distinguish between the vector r and the measure it defines. For
p ≥ 1, the p-th Wasserstein distance between two probability measures r, s ∈ PX is defined as
 1/p
X
Wp (r, s) =  min dp(x, x0 )wx,x0  , (1)
w∈Π(r,s)
x,x0 ∈X

where Π(r, s) is the set of all probability measures on X × X with marginal distributions r and s,
respectively. The minimization in (1) can be written as a linear program
X X X
min wx,x0 dp(x, x0 ) s.t. wx,x0 = rx , wx,x0 = sx0 , wx,x0 ≥ 0, (2)
x,x0 ∈X x0 ∈X x∈X

with N 2 variables wx,x0 and 2N constraints, where the weights dp(x, x0 ) are known and have been
precalculated.

2.1. Approximating the Wasserstein Distance


The idea of the proposed algorithm is to replace a probability measure r ∈ P(X ) with an empirical
measure r̂S based on i.i.d. picks X1 , . . . , XS ∼ r for some integer S:
1
r̂S,x = # {k : Xk = x} , x ∈ X. (3)
S
Likewise, replace s with ŝS . Then, use the empirical optimal transport distance (EOT) Wp (r̂S , ŝS )
as a random approximation of Wp (r, s).

Algorithm 1 Statistical approximation of Wp (r, s)


1: Input: Probability measures r, s ∈ PX , sample size S and number of repetitions B
2: for i = 1 . . . B do
3: Sample i.i.d. X1 , . . . , XS ∼ r and independently Y1 , . . . , YS ∼ s
4: r̂S,x ← # {k : Xk = x} /S for all x ∈ X
5: ŝS,x ← # {k : Yk = x} /S for all x ∈ X
6: Compute Ŵ (i) ← Wp (r̂S , ŝS )
7: end for
(S) PB
8: Return: Ŵp (r, s) ← B −1 i=1 Ŵ
(i)

In each of the B iterations in Algorithm 1, the Wasserstein distance between two sets of S point
masses has to be computed. For the exact Wasserstein distance, two measures on N points need to
be compared. If we take for example the super-cubic runtime of the auction algorithm as a basis,
Algorithm 1 has worst case runtime
O(BS 3 log S)

4
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

compared to O(N 3 log N ) for the exact distance. This means a dramatic reduction of computation
time if S (and B) are small compared to N .
The application of Algorithm 1 to other optimal transport distances is straightforward. One
can simply replace Wp (r̂S , ŝS ) with the desired distance, e.g., the Sinkhorn distance (Cuturi, 2013),
see also our numerical experiments below. Further, the algorithm can be applied to non-discrete
instances as long as we can sample from the measures. However, the theoretical results below only
apply to the EOT on a finite ground space X .

3. Theoretical Results
(S)
We give general non-asymptotic guarantees for the quality of the approximation Ŵp (r, s) =
PB
B −1 i=1 Wp (r̂S,i , ŝS,i ) (where r̂S,i are independent empirical measures of size S from r; see Algo-
rithm 1) in terms of the expected L1 -error. That is, we give bounds of the form
h i
E Ŵp(S) (r, s) − Wp (r, s) ≤ g(S, X , p), (4)

for some function g. We are particularly interested in the dependence of the bound on the size
N of X and on the sample size S as this determines how the number of sampling points S (and
hence the computational effort of Algorithm 1) must be increased for increasing problem size N in
order to retain (on average) a certain approximation quality. In a second step, we obtain deviation
inequalities for Ŵ (S) (r, s) via concentration of measure techniques.

Related work The question of the convergence of empirical measures to the true measure in
expected Wasserstein distance has been considered in detail by Boissard and Le Gouic (2014) and
Fournier and Guillin (2015). The case of the underlying measures being different (that is, the
convergence of EWp (r̂S , ŝS ) to Wp (r, s) when r 6= s) has not been considered to the best of our
knowledge. Theorem 1 is reminiscent of the main result of Boissard and Le Gouic (2014). However,
we give a result here, which is explicitly tailored to finite spaces and makes explicit the dependence
of the constants on the size N of the underlying set X . In fact, when we consider finite spaces X
which are subsets of RD later in Theorem 3, we will see that in contrast to the results of Boissard
and Le Gouic (2014), the rate of convergence (in S) does not change when the dimension gets large,
but rather the dependence of the constants on N changes. This is a valuable insight as our main
concern here is how the subsample size S (driving the computational cost) must be chosen when N
grows in order to retain a certain approximation quality.

3.1. Expected Absolute Error


Recall that, for δ > 0 the covering number N (X , δ) of X is defined as the minimal number of closed
balls with radius δ and centers in X that is needed to cover X . Note that in contrast to continuous
spaces, N (X , δ) is bounded by N for all δ > 0.

Theorem 1. Let r̂S be the empirical measure obtained from i.i.d. samples X1 , . . . , XS ∼ r, then

E Wpp (r̂S , r) ≤ Eq / S,
 
(5)

where the constant Eq := Eq (X , p) is given by

lX
!
p−1 2p p −(lmax +1)p
√ max
−lp
q
Eq = 2 q ( diam(X )) q N+ q N (X , q −l diam(X )) (6)
l=0

for any 2 ≤ q ∈ N and lmax ∈ N.

5
Sommerfeld, Schrieber, Zemel, and Munk

Remark 1. Since Theorem 1 holds for any integer q ≥ 2 and lmax ∈ N, they can be chosen freely to
minimize the constant Eq . In the proof they appear as the branching number and depth of a spanning
tree that is constructed on X (see appendix). In general, an optimal choice of q and lmax cannot be
given. However, in the Euclidean case, the optimal values for q and lmax will be determined, and in
particular we will show that q = 2 is optimal (see the discussion after Theorem 3, and Lemma 1).
Remark 2 (covering by arbitrary sets). At the price of a factor 2p , we can replace the balls defining
the covering numbers N with arbitrary sets, and obtain the bound
lX
!
2p−1 2p p −(lmax +1)p
√ max
−lp
q
Eq = 2 q ( diam(X )) q N+ q −l
N1 (X , q diam(X )) ,
l=0

where N1 (X , δ) is the minimal number of closed sets of diameter ≤ 2δ needed to cover X . The
proof is given in the appendix. These alternative covering numbers lead to better bounds in high-
dimensional Euclidean spaces when p > 2.5 (see Remark 3).
Based on Theorem 1, we can formulate a bound for the mean approximation error of Algorithm
1. A mean squared error version is given below, in Theorem 5.
(S)
Theorem 2. Let Ŵp (r, s) be as in Algorithm 1 for any choice of B ∈ N. Then for every integer
q≥2 h i
E Ŵp(S) (r, s) − Wp (r, s) ≤ 2Eq1/p S −1/(2p) . (7)

Proof. The statement is an immediate consequence of the reverse triangle inequality for the Wasser-
stein distance, Jensen’s inequality and Theorem 1,
h i
E Ŵp(S) (r, s) − Wp (r, s) ≤ E [Wp (r̂S , r) + Wp (ŝS , s)]
1/p 1/p
≤ E Wpp (r̂S , r) + E Wpp (ŝS , s) ≤ 2Eq1/p /S 1/(2p) .
 

Measures on Euclidean Space While the constant Eq in Theorem 1 may be difficult to compute
or estimate in general, we give explicit bounds in the case when X is a finite subset of a Euclidean
space. They exhibit the dependence of the approximation error on N = |X |. In particular, it
comprises the case when the measures represent images (two- or more dimensional).
Theorem 3. Let X be a finite subset of RD with the usual Euclidean metric. Then,
E2 ≤ Dp/2 23p−1 ( diam(X ))p · CD,p (N ),
where N = |X | and

D/2−p
1/(1 − 2
 ) D < 2p,
CD,p (N ) = 2 + D−1 log2 N D = 2p, (8)

 1/2−p/D
N [2 + 1/(2D/2−p − 1)] D > 2p.
One can obtain bounds for Eq , q > 2 (see the proof), but the choice q = 2 leads to the smallest
bound (Lemma 1a, page 17). Further, if p is an integer, then
 √
2 + 2
 D < 2p,
−1
CD,p (N ) ≤ 2 + D log2 N D = 2p,
 √
(3 + 2)N 1/2−p/D D > 2p

(see Lemma 1b).


In particular, we have for the most important cases p = 1, 2:

6
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

Corollary 1. Under the conditions of Theorem 3,



D/2−1
1/(1 − 2
 ) D < 2,
1/2
p = 1 =⇒ E2 ≤ 4D diam(X ) · 2 + (1/2) log2 N D = 2,

 1/2−1/D
N [2 + 1/(2D/2−1 − 1)] D > 2.

D/2−2
1/(1 − 2
 ) D < 4,
2
p = 2 =⇒ E2 ≤ 32D( diam(X )) · 2 + (1/4) log2 N D = 4,

 1/2−2/D D/2−2
N [2 + 1/(2 − 1)] D > 4.

Remark 3 (improved bounds in high dimensions). The term Dp/2 appears because in the proof
of Theorem 3 we switch between the Euclidean norm and the supremum norm. One may wonder
whether this change of norms is necessary. We can stay in the Euclidean setting, and may assume
without loss of generality that X is included in B diam(X ) (0), where Br (x) = {y : ky − xk2 ≤
r} is the closed ball of radius r around x. According to Verger-Gaugry (2005), there exists an
absolute constant C such that N (B1 (0), ) ≤ C 2 D5/2 −D . Using this would allow to replace Dp/2
by C2D/2 D5/4 , or, combining the alternative covering numbers N1 (Remark 2), by C2p D5/4 . This
is better than Dp/2 when p > 2.5 and D is large.
(S)
Theorem 3 gives control over the error made by the approximation Ŵp (r, s) of Wp (r, s). Of
particular interest is the behavior of this error as N gets large (e.g., for high resolution images). We
distinguish three cases. In the low-dimensional case p0 = D/2 − p < 0, we have CD,p (N ) = O(1) and
1
the approximation error is O(S − 2p ) independent of the size of the image.In the critical 0
 case p = 0
1
the approximation error is no longer independent of N but is of order O log(N )S − 2p . Finally, in
the high-dimensional case the dependence on N becomes stronger with an approximation error of
order  !12p 2p
N (1− D )
O .
S

In all cases one can choose S = o(N ) while still guaranteeing vanishing approximation error for
N → ∞. In practice, this means that S can typically be chosen (much) smaller than N to obtain a
good approximation of the Wasserstein distance. In particular, this implies that for low-dimensional
applications with two or three dimensional histograms (for example grayscale images, where N
corresponds to the number of pixels / voxels and r, s correspond to the grey value distribution after
normalization), the approximation error is essentially not affected by the size of the problem when
p is not too small, e.g., p = 2.
While the three cases in Theorem 3 resemble those given by Boissard and Le Gouic (2014),
the rate of convergence in S as seen in Theorem 1 is O(S −1/2 ), regardless of the dimension of the
underlying space X . The constant depends on D, however, roughly at the polynomial rate Dp/2 and
through CD,p (N ). It is also worth mentioning that by considering the dual transport problem, our
results can be recast in the framework of Shalev-Shwartz et al. (2010).
Remark 4. The results presented here extend to the case where X is a bounded, countable subset
of RD . However, our bounds for Eq contain the term CD,p (N ), which is finite as N → ∞ in the
low-dimensional case (D < 2p) but infinite otherwise. Finding a better bound for Eq when X is
countable is challenging and an interesting topic for further research.

3.2. Concentration Bounds


Based on the bounds for the expected approximation error we now give non-asymptotic guarantees
for the approximation error in the form of deviation bounds using standard concentration of measure
techniques.

7
Sommerfeld, Schrieber, Zemel, and Munk

(S)
Theorem 4. If Ŵp (r, s) is obtained from Algorithm 1, then for every z ≥ 0
" #
1/p
SBz 2p
 
(S) 2Eq
P |Ŵp (r, s) − Wp (r, s)| ≥ z + 1/2p ≤ 2 exp − . (9)
S 8 diam(X )2p

1/p
Note that while the mean approximation quality 2Eq /S 1/(2p) only depends on the subsample
size S, the stochastic variability (see the right hand side term in Equation 9) depends on the product
SB. This means that the repetition number B cannot decrease the expected error but it decreases
the magnitude of fluctuation around it.
From these concentration bounds we can obtain a mean squared error version of Theorem 2:
(S)
Theorem 5. Let Ŵp (r, s) be as in Algorithm 1 for any choice of B ∈ N. Then for every integer
q ≥ 2 the mean squared error of the EOT can be bounded as
 
2
E Ŵp(S) (r, s) − Wp (r, s) ≤ 18Eq2/p S −1/p = O(S −1/p ).

Remark 5. The power 2 can be replaced by any α ≤ 2p with rate S −α/(2p) , as can be seen from a
straightforward modification of the first lines of the proof.

For example, in view of Theorem 3, when X is a finite subset of a RD and q = 2, we obtain


 
2
2/p
(S)
E Ŵp (r, s) − Wp (r, s) ≤ 32 27−2/p DCD,p (N )[ diam(X )]2 S −1/p .

with the constant CD,p (N ) given in (8). Thus, we qualitatively observe the same dependence on N
as in Theorem 3, e.g., the mean squared error is independent of N when D < 2p.

4. Simulations
This section covers the numerical findings of the simulations. Runtimes and returned values of
Algorithm 1 for each back-end solver are reported in relation to the results of that solver on the
original problem. Four different solvers are tested.

4.1. Simulation Setup


The setup of our simulations is identical to that of Schrieber et al. (2016). One single core of a
Linux server (AMD Opteron Processor 6140 from 2011 with 2.6 GHz) was used. The original and
subsampled instances were run under the same conditions.
Three of the four methods featured in this simulation are exact linear programming solvers. The
transportation simplex is a modified version of the network simplex solver tailored towards optimal
transport problems. Details can be found for example in Luenberger and Ye (2008). The shortlist
method (Gottschlich and Schuhmacher, 2014) is a modification of the transportation simplex, that
performs an additional greedy step to quickly find a good initial solution. The parameters were
chosen as the default parameters described in that paper. The third method is the network simplex
solver of CPLEX (www.ibm.com/software/commerce/optimization/cplex-optimizer/). For the
transportation simplex and the shortlist method the implementations provided in the R package
transport (Schuhmacher et al., 2014) were used. The models for the CPLEX solver were created and
solved via the R package Rcplex (Bravo and Theussl, 2016).
Additionally, the Sinkhorn scaling algorithm (Cuturi, 2013) was tested in our simulation. This
method computes an entropy regularized optimal transport distance. The regularization parameter
was chosen according to the heuristic in Cuturi (2013). Note that the Sinkhorn distance is not

8
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

covered by the theoretical results from Section 3. The errors reported for the Sinkhorn scaling are
relative to the values returned by the algorithm on the full problems, which themselves differ from
the actual Wasserstein distances.
The instances of optimal transport considered here are discrete instances of two different types:
regular grids in two dimensions, that is, images in various resolutions, as well as point clouds in
D
[0, 1] with dimensions D = 2, 3 and 4. For the image case, from the DOTmark, which contains
images of various types intended to be used as optimal transport instances in the form of two-
dimensional histograms, three instances were chosen: two images of each of the classes White Noise,
Cauchy Density, and Classic Images, which are then treated in the three resolutions 32 × 32, 64 × 64
and 128 × 128. Images are interpreted as finitely supported measures. The mass of a pixel is given
by the grayscale value and the support of the measure is the grid {1, . . . , R} × {1, . . . , R} for an
image with resolution R × R.
In the White Noise class the grayscale values of the pixels are independent of each other, the
Cauchy Density images show bivariate Cauchy densities with random centers and varying scale
ellipses, while Classic Images contains grayscale test images. See Schrieber et al. (2016) for further
details on the different image classes and example images. The instances were chosen to cover
different types of images, while still allowing for the simulation of a large variety of parameters for
subsampling.
The point cloud type instances were created as follows: The support points of the measures
D
are independently, uniformly distributed on [0, 1] . The number of points N was chosen 322 , 642
2
and 128 in order to match the size of the grid based instances. For each choice of D and N ,
three instances were generated with regards to the three images types used in the grid based case.
Two measures on the points are drawn from the Dirichlet distribution with all parameters equal
to one. That is, the masses on different points are independent of each other, similar to the white
noise images. To create point cloud versions of the Cauchy Density and Classic Images classes the
grayscale values of the same images were used to get the mass values for the support points. In
three and four dimensions, the product measure of the images with their sum of columns and with
themselves, respectively, was used.
All original instances were solved by each back-end solver in each resolution for the values p = 1,
p = 2, and p = 3 in order to be compared to the approximative results for the subsamples in terms
of runtime and accuracy, with the exception of CPLEX, where the 128 × 128 instances could not
be solved due to memory limitations. Algorithm 1 was applied to each of these instances with
parameters S ∈ {100, 500, 1000, 2000, 4000} and B ∈ {1, 2, 5}. For every combination of instance
and parameters, the subsampling algorithm was run 5 times in order to mitigate the randomness of
the results.
Since the linear programming solvers had a very similar performance on the grid based instances
(see below), only one of them–the transportation simplex—was tested on the point cloud instances.

4.2. Computational Results


As mentioned before, all results of Algorithm 1 are relative to the results of the methods applied
to the original problems. We are mainly interested in the reduction in runtime and accuracy of
the returned values. Many important results can be observed in Figures 2 and 3. The points in
the diagram represent averages over the different methods, instances, and multiple tries, but are
separated in resolution and choices of the parameters S and B in Algorithm 1.
For images we observe a decrease in relative runtimes with higher resolution, while the average
relative error is independent of the image resolution. In the point cloud case, however, the relative
error increases slightly with the instance size. The number S of sampled points seems to considerably
affect the relative error. An increase of the number of points results in more accurate values, with
average relative errors as low as about 3% for S = 4000, while still maintaining a speedup of two
orders of magnitude on 128 × 128 images. Lower sample sizes yield higher average errors, but also

9
Sommerfeld, Schrieber, Zemel, and Munk

32x32 64x64 128x128


100%
B 1 2 5

Relative Error

S 100 1000 4000


● ● ●

● ● 500 2000
● ●

10% ● ●

● ●

● ● ●
● ● ●

1%
10−5 10−4 10−3 10−2 10−1 100 10−5 10−4 10−3 10−2 10−1 100 10−5 10−4 10−3 10−2 10−1 100
Relative Runtime

(S)
Figure 2: Relative errors |Ŵp (r, s) − Wp (r, s)|/Wp (r, s) vs. relative runtimes t̂/t for different pa-
rameters S and B and different problem sizes for images. t̂ is the runtime of Algorithm 1
and t is the runtime of the respective back-end solver without subsampling.

32x32 64x64 128x128


100% B 1 2 5

Relative Error

S 100 1000 4000


● ● ● ●

● 500 2000
● ●


10% ●

● ● ●
● ●
● ●

1%
10−5 10−4 10−3 10−2 10−1 100 10−5 10−4 10−3 10−2 10−1 100 10−5 10−4 10−3 10−2 10−1 100
Relative Runtime

Figure 3: Relative errors vs. relative runtimes for different parameters S and B and different problem
sizes for point clouds. The number of support points matches the number of pixels in the
images.

10
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers




32x32
64x64

Signed Relative Error


150% ●
128x128
●●

●●
●●


●●

●●

100% ●●
●●

●●●



●●
●●

● ●
● ● ●● ●●

● ● ●● ●

●●●
● ● ●● ●●
● ●●
● ●● ● ●

●● ●●● ●
● ●● ●●

● ● ● ●●
●●● ● ●

●●●● ●● ●●

●●●● ●●
● ● ●● ●
● ●
●● ● ● ●●●

●● ●●
● ●● ●●
● ●
●● ●●●
●●● ●● ●

50% ● ● ● ●
● ● ●●
●● ●● ●●
●●●● ● ●●● ●●● ● ●●

●●
●● ●
●●● ●●
● ● ●● ● ● ●●
● ● ●● ●● ● ●
●●●● ●● ● ●
● ●
● ● ●

●●
● ●●
● ●● ● ●
●●● ●
●● ● ●
●● ●

● ●● ●● ●●


●●● ● ● ● ●●
●● ●●● ● ●●● ●● ●● ● ●

●● ●●● ● ●● ●
● ● ●●● ●●
●● ●●
● ●
●●
● ●●
●●
●●
● ●● ● ●
● ●● ● ●●● ●● ● ●● ● ●● ●
●● ●●
●●● ● ● ● ● ● ● ● ● ●● ●
● ●● ●
●●
●●●● ● ●●●● ●●●●
●●
●●
●●●
●●●● ●
●●●●

● ●●● ● ●● ●●
●● ●●●
●●
●● ●● ● ●● ● ●●●
● ●●
●● ●●● ●● ●

●● ● ●●●●●
●● ● ●●● ●●●● ●●

●● ● ●●


● ● ● ●
● ●●●●●● ●●●● ●●● ●
●●●


● ●●● ●●
●●●
● ● ●● ●●

●● ●●●●
●●
●●●
●● ●● ●●

● ●● ●●●●
● ●● ●
●●
●●● ●● ● ●●
● ●
●●● ● ● ●● ●
●●
● ●●●
● ●●● ●● ● ●●
●●

●●● ● ●● ●●
● ●●● ●
● ●●● ●● ●●
● ●● ●● ● ● ● ● ● ●
●●
● ● ● ●●
●● ● ●●●●● ●
●● ●●
● ●●● ●● ●● ●●
●●


●●

●●● ● ●●

●●
●●

●●● ●
●● ●
●●●●
●●●



● ●● ●●


● ● ●●●●● ●

●●●
●●●●●● ●●
●●
●●

● ●●● ●


● ●●●

●●
●● ●● ●


●●●● ● ●● ●●
●●
● ●
●● ● ●
●● ●


●● ● ●
●●●● ● ●●
●● ●● ●● ●● ●● ●

●● ●●●● ●●● ●

●●

● ●●●●●● ●

●●● ● ●●● ●●● ● ●● ● ● ● ●●●

●●

●●
●● ●●

● ●
●●●●●

●●●
●● ●
●●
● ● ●

●●●● ●●
●●

●●
●●
● ● ●●●
● ● ●

●●●●●
●●


● ● ●
●●●
●●
●●
● ●●●●● ●●
●● ●
●●
● ●





●●●

●●

●● ●●
● ●●●●
● ●●● ●
●●
● ●●●●
●● ●●●
●● ● ●● ● ● ●●● ●
●●●●●● ●
● ● ●●
● ● ● ● ●●● ●● ● ● ●● ●● ● ●● ● ● ● ● ●● ●● ●●
● ●● ●●● ●
●●●●
●●

● ●●●
●●
●●●●
●●

● ●●
● ●

●●
● ●
● ● ●●

●●●●
●●●
●● ● ●


●●●●
●● ●●●●● ●
● ●
●●
●●●●

●● ●●
● ●

●● ●
●●
● ●●
●● ●
●●
●●●
●●●
● ●●●●
● ● ●●●● ●● ●●

●●●
● ●●●
●●
●●


● ●
●● ●

● ● ●●

● ●



● ●●●●
●●
●●
● ●














●●


●● ●

●●



●●




●●

●●


● ●
●● ● ●
●●
● ●●

●●
●●●
●●
● ●






●●
●●

●●



●●
● ●


●●
● ●●
●●





●●

●●
●●










●●











●●●●●●● ●

●●
●●
●●

●●


● ●●
●●

● ●●
● ●
●●●●●



●●
●●
●● ●● ●
●●● ●
● ●● ● ●●●

●●
●●●
●●
●● ●●●
●●●
●●●●●
● ● ●●
●● ● ●
●● ●
●●
●●
●●●
●●
● ● ●●
●●

● ●●●●

●●● ●
●●●
●● ● ●● ●●
●● ●●● ●
●●
● ●●

●●●
●●● ●●
●●

0%
● ● ●● ●● ● ●● ● ● ● ● ●●● ●● ●●
● ●
● ● ●●●● ●● ●●

●●



●●


●●



●● ●●
● ●








●●●

●●
● ●●●

●●

● ●












●●










●●













●●
●●

●●●

●●
●●


●●

●●●

●●


●●







● ●



●●



●●





●●





●●

●●









































●●


●●●

●●

●●
●●

























●●




































●●










●●


● ●
●●●●●



●●●

















●●●
●●


●●●



●●
















●●●



●●●



●●

●●








●●

● ●





●●






●●



● ●
● ● ● ●●●● ●●● ●● ● ●
●●●
● ● ● ● ●
● ●●●●●●●●
● ●
● ●●● ● ●●
●●● ●
●● ●●
●●

●●
● ●●
●●● ●● ●
●●
● ●
● ●●●

● ● ●
●●

●●●●

●●● ●●
●●

●●
●●● ●● ●

●●●
●● ●●● ●● ●
● ●
●●
● ●●● ● ●

●●●
●●
●●●
●●




● ●

●●●
●●●


●●●● ●
●●●
●●
●●
●●
● ●● ●


●●
●●

●●
●●●
●●
●●
●●
● ●●
●●
●●


●●
●●


●●


●● ●●●●
●●●

●●
●●

● ●
●●
●●

●●





●●



●●



● ●

●●


●●

●●
●●

●●



●●
● ●
●●●
●●●
● ●

●●
●●●





●●



●●







●●

●●●


●● ●

●●
●●●


●●


●●








●●

● ●


●●
●●
●●

●●
●●


●●●





●●●
●●
●●●
● ●
● ●●●
●●●●
● ●●●●


●●●
●●
● ●






● ●●●


●●●



●● ● ●●


●●

●●
●●● ●●

●●
●●

●●●


●●


● ●●



●●






●●

●●
● ●●● ●

●●

●● ●

●●●●
●●
●●



● ●

●●

●●




●●


●●

● ● ●


● ●




●●●
●●●●





●●


●●


●●





●●

●●




●●

●●
● ●
●●
●●
● ●●
●●●● ● ●●● ●●
● ● ●
●●● ●
●●●

●● ●●●●●● ● ●● ●●● ● ●●
●●
●●
●●● ● ● ●● ●● ● ●
●●
●● ●●
●● ● ●
●●
●●
●●●
●●
● ●●
●●●

●● ●
●●●
●●
●●
●● ● ● ● ● ● ● ●● ●● ● ●
●●● ●● ●●● ● ●●●
●●
●● ●●●●
●●● ● ●●●● ● ●●●

●● ●
● ●●●●

●● ● ● ●●●
● ●●●●
●● ● ● ● ● ●● ●
● ●●●
● ●●
● ●
●● ● ●● ●
●●●● ● ●●


● ●● ● ●●● ●●●● ●● ●●
● ●
● ●●
● ●
● ●
●●●●
●● ● ●
●● ●
●●● ●● ● ●● ●●● ●


●●
● ●● ●
●●● ●● ●● ●● ●●
● ● ●● ●
● ●


● ●●
● ●
● ●●

100 500 1000 2000 4000


S
 
(S)
Figure 4: The signed relative approximation error Ŵp (r, s) − Wp (r, s) /Wp (r, s) showing that
the approximation overestimates the exact distance for small S but the bias vanishes for
larger S.

lower runtimes. With S = 500 the runtime is reduced by over four orders of magnitude with an
average relative error of less than 10%. As to be expected, runtime increases linearly with the
number of repetitions B. However, the impact on the relative errors is rather inconsistent. This
is due to the fact, that the costs returned by the subsampling algorithm are often overestimated,
therefore averaging over multiple tries does not yield improvements (see Figure 4). This means that
in order to increase the accuracy of the algorithm it is advisable to keep B = 1 and instead increase
the sample size S. However, increasing B can be useful to lower the variability of the results.
On the contrary, there is a big difference in accuracy between the image classes. While Algorithm
1 has consistently low relative errors on the Cauchy Density images, the exact optimal costs for
White Noise images cannot be approximated as reliably. The relative errors fluctuate more and
are generally much higher, as one can see from Figure 5 (left). In images with smooth structures
and regular features the subsamples are able to capture that structure and therefore deliver a more
precise representation of the images and a more precise value. This is not possible in images that are
very irregular or noisy, such as the White Noise images, which have no structure to begin with. The
Classic Images contain both regular structures and more irregular regions, therefore their relative
errors are slightly higher than in the Cauchy Density cases. The algorithm has a similar performance
on the point cloud instances, that are modelled after the Cauchy Density and Classic Images classes,
while the Dirichlet instances have a more desirable accuracy compared to the White Noise images,
as seen in Figure 5 (right).
There are no significant differences in performance between the different back-end solvers for the
Wasserstein distance. As Figure 6 shows, accuracy seems to be better for the Sinkhorn distance
compared to the other three solvers which report the exact Wasserstein distance.
In the results of the point cloud instances we can observe the influence of the value p0 = (D/2) − p
on the scaling of the relative error with the instance size N for constant sample size (S = 4000).
This is shown in Figure 7. We observe an increase of the relative error with p0 , as expected from
the theory. However, we are not able to clearly distinguish between the three cases p0 < 0, p0 = 0

11
Sommerfeld, Schrieber, Zemel, and Munk

●●


●●

●●
●●●


●●●

CauchyDensity 1000% ●●






●●




●●●


●●








CauchyDensity
ClassicImages
●●● ●●●
●● ● ●
●●●●
● ●

ClassicImages
● ●● ●

1000%
● ● ●● ●●
●● ●● ●
●●●

●● ●● ●
●●

●●
●●
●●






●●



● ●●●● ●

●●
● ●●●
● ●●●
● ● ●



●●●●
●●

●● ●

●●
● ●

●●●●


● ●
●● ●●●●
● ●
●●

●●
●●● ● ●●● ●●

Dirichlet
●●●
●●
● ● ● ● ●

●●
●●●●



●●
● ●
●●●●●●
● ● ●
●● ●●
●●
●●●
● ●●
● ● ●●
●●●
●●
●●
● ●
● ● ●
● ● ●



●●
●●●●●● ●●●● ●●● ●


●●
●●

●●●
●●
● ● ● ●●●
● ●

WhiteNoise
● ● ●●● ●●
● ●●
●● ●● ● ●
●●






●●
●●

●●
●●







●●
● ●● ●


●●
●●

●● ● ● ●


●●

●●




●●








● ●●

●●



●●
●●
●●
●●●



●●
●●



●●
●●


●●






●●




● ● ● ●●
●●
● ●●
● ●




●●




●●●




● ●


●●

●●●
●●

●●
●●




●●●● ● ● ● ●
● ●●

●●●●●
●● ●● ● ●● ●● ●●●

● ● ● ●
● ● ● ●● ●● ●●● ●●

● ● ● ●●● ●
● ●
●●●● ● ● ●● ●

100%


●●

●●
●●
●●

●●
●● ●● ●
●●

● ● ●
● ●
●●

●●


●●●
●●

●●

●● ●

●●●
●●
● ●
●●
●●●

● ●
●●●●●
●●
●● ●●●● ●●
● ● ●





●●





●●

















●●●



●●





● ● ●●
●●●● ●




●●











●●●

● ●












●●

●●






















●●










●●




●●

●●



●●







●●





●●

● ●●
●●
● ● ●●



●●●
●●
●●
●●
●●
● ●●●●



●●●

● ●● ●●
● ●●● ●●●
● ●

●●



●●

●●
●●
●● ●



●●

●●

●●●
●●

● ●
●●
●●
●●●●
●● ●
●●
●●
●●●
●●



●●


●● ●
●●●
● ●●●● ● ●

●●


●●●
●●● ●

●●

●●●
●●
●●
●● ●●
●●● ● ●●●●●●
● ●

●●
●●
●●


●●



●●
●● ●

● ●
●●


●●
●●
●●

●● ●●
●●
●●
●●
●●●
●●
●●
● ●
●●●


●●
●●●

●●

●●
●● ●
●●●
●● ●●
●● ●
●●
●●●

●●

●●●●●
●● ●●
● ●
● ●

● ●

●●●
● ●●
● ● ● ●

●●


●●
●●

●●
●●●●●
●●●●
●● ●
●●


●● ●●● ●●●
●●

●●●●●

● ●

● ●
●●●
●●●●
● ●● ●●

●● ● ●


●●

●●
●●
●●●

● ●●●
●●
●● ●
●●●●●
●● ●

●●



●●
●●


●●




●●
● ●


●●

●●
●●
●●●
●●



●●
● ●

●●●

●● ●
●●
●●

●●
● ●●

●●
●●
●●
●●

● ● ●
● ●

●●
●● ●●●
●●


●● ●●

● ●
●●●
● ●
● ●● ●
●●
●●●
●●
●● ●●
●●●
●●●
● ● ● ● ●
●●
● ●
●●● ●●
●●●●
● ●●●


●●
●● ●●

●● ●

● ●● ●●●
●●








●●











●●


●●
●●

●●

●●

● ●●
●●

● ●● ●



























●●

●● ●
● ●●


●●●
●●

●●●



●●


● ●● ●●

●●●




●●
●● ●●

●●












●●

● ●



●●

●●







●●












●●

●●



●●
● ●●


●●

●●

●●

●●


●●



● ●●


● ●
● ●●

●●●●


● ●
●●

● ●● ●
●●

●●

●●




●●●
●●●
● ●
●●


●●●●

●●



● ●
●●
●●
●●

● ●

●●●
● ●
●●●● ●
●● ●
●●
●●


●●
●●

● ●
●●●
●●
●●● ●●●●●

● ● ●

●●●
●●

●●
●●

● ●



●●

●●

●●

●●

● ●●●●●
●●
●●
● ●● ●

●●


●●●

●●
●●

●●



● ●
● ●
●●●

● ●

●●
● ●


●●
●●

●●

●●

●●
● ●
●●
●●●●

●●

●●
● ● ●● ●●
●●
●● ●

●●

●● ● ●●● ● ●●●
●●
● ●●●●

●●●

● ●●●●
● ●●
● ●● ●● ● ● ●
●●
●●

●●
●●● ●

● ●


●● ●
●● ●●
●●
● ●
●● ●
● ●
●●
●●
●●
●●
●●
●● ●●●●● ●●●
●●
●●●



●●
● ● ●
●●●● ●●
●●●
●●
●●

●●●

● ●
●●


●●
●● ● ● ● ● ●

100% ●●●● ●●●


● ●●●● ● ●●
●●●●
● ●●

●●●
●●●● ●● ● ● ● ●
●●
●●●
●●●
●●● ●●
● ● ●
● ●

●●
●●

● ● ●
● ● ● ● ●●
●●
●●
● ●

●● ●

● ● ● ●

●●
●●●
●●
Relative Error

●●●
● ● ●●●
●●
●●
● ●●●●
● ●● ● ● ●● ●● ●●
●●
●●
●●

●● ●●
● ●● ●
●● ●
●●●
● ●
●●
●●●●●
●● ●● ● ● ● ●●● ●
●●● ●
●●
●● ● ●

●●
●●●

Relative Error
●●●● ● ●


●●
● ●
● ●
●● ●



●●●●
● ●●
● ●

●●●
●● ●
● ● ●
● ●●● ●●

●●
●●

●●

●● ●

●●

●●● ● ●



●●
●●


●●

●●●



●● ●●
●●
●●
●●

●●
●●●

●● ●
●●

●●●●

●●●● ●●●●●
●●
● ●● ●

●●●
●●
●●
● ●

● ●●●
●●
●●
●●●
●●
● ●● ●●● ●● ● ● ● ●
● ●●
● ● ●●●● ●
● ●● ●
●●●
●●

●●
●● ●●● ●● ●● ● ●
●● ● ● ● ●● ● ● ●
●● ●

● ●● ●● ●●●
●●●● ●
●●●
●●
●●●
● ●


●●

●●
●●

●●
● ● ●
●●
● ●
●●

●●
●● ●
●●●●●●
● ●●●
●●●●
●● ●



●●
●●

●●

●●
● ●
● ●● ●●● ●



●●
●●

●●


●●

●● ●


●●


●●●●

●●
●●
● ●●●●●

●●●
●●
●●
● ●●
●●●
●●

●●
● ● ●
● ●

●●
●●


●● ●●
●●
● ● ● ●
● ●
●●

●●
● ●
● ●●●
●● ● ●●●● ●
●●
●●
● ●

●●


●● ●●

●●
●●



● ●

●●● ●●●●●
●●●
● ●●
●●●
●● ●
●●●
●●
●●●
●●


● ●●●●● ●

●●●
●●
●●
●●
●●
● ●●
●●
●●
●●

●●


●●
● ●

●●●
● ●


●●

●●
●● ●●●
●●●
●●
● ●●● ●
● ●
●●
●●
●●
●●●●●



●● ●●
●●●●













●●






●●
●● ●●●
● ●


●●●

● ●● ●






●●

●●●


●●




●●



●●

●●


●●
●●



● ●























●●

● ●

























● ●



●●

●●










● ●● ●
●●










●●
























● ●

●●


●●●
●●
●●
●●
●●









●●


●●
















●●
●●
















●●




●●



●●
●●





●● ●





●●


●●


●●




●● ●●●
●●









●●







●●●
●●● ●● ●●
● ●●
●●
●●●●
● ●● ●●●● ●
● ● ●
●●
● ●●
●●
●●
●●● ●● ●●
●●
●●●● ●
●●●● ●
●●
●●●●●
●● ●●
●●
●●●
●●
●●● ● ●
●●●●
●●
●● ● ●●● ●● ● ●●●
●●

●●


●●●
●●


●●
●●●
● ● ●● ●●●●● ●
●●

●●●
●●●
●● ● ●
●●●●
●●
● ●
●●
●●
●●



●●

●●● ●
● ●●●
●● ●



●●●


●●

●●


●● ●●
● ●●●

●●
● ●
●●●
●●
●●
●●

●●
●●




● ●●



●●●●
●●

●● ●
●●
●●
●●●





● ●
●●
●●
● ●●


● ● ● ●● ●●●



●●●
●●
●●

10%

●●
●●●●

●●
●● ● ● ●●
● ● ●
●●
●● ●
● ●
●● ●● ●

●●●●

●●
●●● ●● ●●
●●
●●●●
●● ●
●● ●● ●●
●● ●
●●●●● ●●
●●●
●●●
● ●●●
●●
●●●
●● ●● ● ● ● ● ● ●
●●●
●●
●●●






●●

















● ●●● ●
●●● ● ●
● ●
●●








●●
●●●


●●




●●


●●●
●●

●●




●●
● ●



























●●




● ●●●●

●●
●● ●
● ●
●●●●

● ●




●●



●●













●●

●●





●●

●●●




●●


●●


● ●





●●


●●


























●●


●●






























●●●


● ●●





●●●

●●
●●



●● ●

●●●●




●●

●●




●●


●●












●●
●●●
●●
●●
●● ●
● ●●●● ●● ●● ● ●●●
●●●●
●●
● ●
● ●●●
● ●●

●●
●●


●●
●●●
● ● ● ●●
● ●

●●
●●

●●
●●

●●
● ● ●●●●
●● ●

●●
●●●
●●
●●

●●

●●●
● ●
●●●●
●●
● ●
●●

●●
●●●●

●●
●●●●●●
● ●●
● ●

●●●●●●
● ●
●●●●

● ●
●●
●●

●●

● ●


●●
●●

● ●● ●
● ●●●● ●● ● ● ●
●● ●● ● ●●
● ●● ●●
●●

●●


●●
●●
●● ●
●●● ●


●●●●
● ●●●

● ●
●●
●●


●●●●●
● ●●
●● ●● ●●

●● ●

●●
●●
●● ● ●
● ●

●●
● ● ● ●●●
●●


●●


●●

●●
●●●●●● ●
●●
● ●


●●
●●
●●
● ●

● ●
● ● ● ●●










●●



●● ●
● ● ●
●●●
● ●






●●

●●
● ●
●● ●
●●
● ●●
●●●●
●●

●●●
●●
● ●
●●

●●


●●●


















●●















●● ●●●


● ●●
●●

●●



●●
●●
● ●

●●●●

●●


●●












●●
● ●

● ●



●●






●●●●


●●

● ● ● ● ●






● ●













●●● ●
● ●●
● ●




●●

●● ●





●●


●●








●●

●●













●●
















● ●







●●

●●




●●






●●
●●●●





●●
●●




●● ● ●
●●
●●●
●●

●●

●● ●
●● ●●●● ●●●
●●●
●● ●● ●●●
● ●●●●● ●● ●
● ●●●●
●●
●● ●

●●
●●

●●

●●
●●
● ●●
●● ●
●●
●●
●●
●●

●● ●
● ●
●●●●

●●
●●
●●
● ●
●●● ●
● ●●
●●●
●●
●● ●

●●
●●●

●●
●●

●●


●● ●

● ●●●●●
●●
● ●
●●

●●
●●

●●



●●

●●
● ●
●●●


●●
●●
● ●


●● ●


●● ●

●●

●●●

●●


●●● ●●
●●●●
● ●
● ●

●●

●●



●●● ● ● ●


●●
●●



●●●● ●●

●●
●● ● ●●


●● ● ●
●●


●● ●

●● ●





●●














●●
●● ●●
● ● ●


●●

●●


●●
●●











●● ●
● ●●●
●●






●●

●●

●●●
● ●

●●●● ●

●●●
● ●
●●
●●

● ●


●●














●●


●● ●●●
●●

●●

●●
●● ● ●●● ●












●●


●●

● ●
● ●
●●
●●
●●●●
●●

●● ●




●●
●●
●●
●●

●●







●●●




●●








●●

●●●
●●
● ●● ●


●●
● ●●

●●
●● ●●●●●● ●●●●

●●
●● ●● ●●● ●● ●●●
●●●●
●● ●
●●

●●●
●●
●●

●● ●● ●

●●

●●●
●●
●●
●●
● ●●

●●
● ●● ●●● ●
● ●
●●

● ●

●● ●
● ●
●●●
● ●●
●●
●●
●● ●

●●
●●
●●
●●
● ●
● ●
● ●●

●● ●●●

●●


●●
●●

●●

●●●●●●
●●
●●●●

●● ●

●●●

●●

●●
●●

● ●●

●●

●●
●●
●●
●●
● ●● ● ●● ●●
● ● ●
● ●
●●●●


● ●●
●●●●●

●●


● ●
●●●●●


● ●●● ●●●● ●● ●
●●●

●●●

●●● ●●●●

●●●●
● ●● ●

●●



●●
●●
●●


●●

● ●
● ●
●●
●●●●
●● ●●


●●
●●


●●●

● ●●
●●
● ●
●●●●
● ●●
●●●●

●●
● ●● ●
● ●
●●●

●●

●●

●●




●●●
●●●●●●


● ●● ● ●●




●●
●●


●●




● ●

● ●●●●


●●
●●
●●
●●●
● ●●
● ● ●



●●






● ●●
● ●
● ●

●●

●●
●●

●●●



● ●●
● ●●●●●●

● ●●● ●● ●● ● ●●
●●●
●● ●●


●●









●●
●● ● ● ●


●●



●●
●●
●●

●●

●●



● ●
●●●
●●●
● ● ●●
● ●



●●


●●
●●



●●



● ●
●●
●●



●●


●●
●●
● ●●


●●●


●●● ●

●●●
●●
●●

●● ●
● ●
●●

●●●

●●





●●●●

● ●●


●●
●●●●
● ● ● ●


●●
●●●

●●
●●
●●●●

●●







●●

●●



●●●
●●●●●
● ●
●● ●
●●
●●

●●



●●

●●

● ●







●●
●● ● ●●
● ●
● ●





●●●












● ●●● ● ●
●●


●●●



●●
● ●


●● ●
●●● ●

●●●●


●● ●
●●












●● ●● ●














●●









●●●
●●






●●
● ●

●● ●
● ●●●











●●

















●●



● ●●
●●

●●

●●


●●
●●

●●
● ● ●●


● ● ●●






●● ●



●●●
● ●


●●●









●●●





●●







● ●
● ●●●




●●
●●






● ●
●●●
● ● ●
●●







●●



















● ●









●●








● ● ● ● ● ● ●● ● ●
●● ● ●
● ● ●●● ● ●
● ● ● ● ●● ●●● ● ● ● ● ● ●●●
● ●● ●●● ● ●● ●● ● ● ●
●●●
● ●
●● ●●● ●●●●●● ●● ●
●●●●
●●● ● ●●●
●●

● ●

● ●
●● ●● ● ●
● ●●●● ●●● ●●●●
● ● ●
●●
●●●

●●
●● ●●●
●●●●● ●●

● ●●

●●●
●●●
● ●●●● ●● ●
● ●
● ●● ● ●
●●
●● ●

●●

●●

● ● ●
●●
●●
●●
●●
● ●●● ●
● ●●●
● ● ●

●●

●●●
● ●

●● ●●●



● ●
●●

10%
●●
●● ● ● ●●
● ● ● ●●●
● ● ●●● ● ●● ●●●● ●●
● ●●● ● ● ●

●●
●●●
●●
●● ●●●
● ●● ●●●●● ●

●● ●● ● ●●
●● ●
●● ●●
●●●● ●
● ● ●
●●


● ● ●
●●●
● ●

●●●●

● ●
● ●●● ●
●●●

●●
●●
● ●●
●●
●●●

● ●●

●●●

●●






●●●●●●
●●

● ●
●● ●
●●
●●●
●●●

●● ●● ●
●●
●●●●
●● ●
● ●
●●
● ●
●●●
●● ●


●●
●●

●●
●●
●●




●●●
● ● ●
●●



●●


●●

●●


●●


●●
●●● ●
●●●



●● ● ●
●● ●● ●


●●●

●●

●●

●●
●●
● ●●●
●●●●●●
●●
●●
●●

●●●●


● ●
●●●
●●●
●●


●●●●
●●
●●

●●

●●●


●●●● ●
●●●


●●



●●●
●●●●

● ●
● ●
●●●● ●● ● ●●











●●

● ●
●●

●●
●●

●●


●● ●

●●
●●

●●




●●




●●






●●

●● ●●




● ●●
●●
●● ●●
● ●
●●● ●●


●●




●●
●●●
● ●●













●●














●●
● ● ●




























● ●

● ●



●●●

●● ●
● ●

● ●
● ●





●●















●●●●







●●

●●●

●●
●●


● ●
● ●●


●●●
●●
●●
● ●




●●













●●




●●●





















●●



●●

● ●
●●





●●





●●
● ● ●●
● ●● ●● ● ● ● ● ● ●● ● ●● ● ●●
●●●●●●

●● ●

● ●● ● ●
●●●●● ●

●●

●●●

●●
●● ● ●●●
● ● ●●●

●●


●●
● ●●
● ●●



●●●●●
● ● ●● ● ●●

●●●●
● ● ● ●●● ●
●●
●●


●●●
● ●
●●

● ●●
● ●●● ●



●●
●●

●●

●●

●● ●


● ● ●● ●●●
● ●●
● ●

●●

●●
●●
●●


●●●


● ● ●


●●

●● ●
● ●●●● ●●● ●

●●●


●●●
●●


●● ●

● ●
●●
●●●
●●

● ●
●● ●
●●

●●
●●●


●●




●●
●●●●
● ● ●●●
●●● ● ●
●●
●●


● ●● ●●
● ●● ●
● ●● ●●
●●●

●●

●●
● ●●
● ●

●●
●●
●●

●●
●●


●●
●●



● ●●● ● ●●

●●● ●

●●

● ●●
●● ●
● ● ●

●●

●●
●●

●●

●●

●●●
● ●●
● ●

●●


●●
●●
●●

●●
●●
● ●
● ●●●●
● ●
● ●●

●●




●●








●●


● ●●
●●



●● ●

●● ●


●●

●●

●●



● ●

●●

●●●

●● ●
●●
●●
●●

●●
● ●●●
●●
●●
●●●
●●

1%
● ● ● ● ●● ●●
● ●● ● ● ● ● ●● ●
●●
● ●● ●
● ● ● ●
●●●●

● ●● ● ● ●● ●
● ● ●●● ● ● ● ●● ●●

●●●●
● ●

●● ●●●
●● ●●
● ●
● ●●
●●
●●

●● ● ●
●●


● ●● ● ●

●●







●●

● ● ● ●

●●
●●
●●

●●
●●
● ●●
● ●● ● ●
●●


● ●
● ●● ● ●
● ●● ●
● ● ●








●●
●●● ●●●


● ●● ● ● ●

●●

●●●


●●


●● ●
● ●●● ●

●●●● ●
●●
●●
●●
●●



●●●
● ●●●●
●●

● ● ●●
●●● ●
●● ●







●●





●●●

●●

●●



● ●
●●


● ●
●●●

● ●
● ●●●

●●●
● ●



●●●
●● ● ● ●● ●● ●
●●
●●●●●
● ●
●●
● ● ● ● ●● ●
● ● ●

●●●

●●

● ●
● ●● ●● ●
●●●●
●●
●●


● ●●
● ● ●●●

●●●
●● ●●●● ●● ●
● ●


●●
●●●
●●
●●

● ●● ●

●●


●●
●●

●●
●● ●● ● ●
●●

●●
●●
●●



●●
●●

● ●●

● ●●

●●● ●● ●● ●
● ●
●●
●●


●●


●●
●●

●●

●● ●

● ●
● ●●


●●● ●
●●
●●●●

●●


●●●
●●
●●
●●● ●

● ●
● ●●

●●

●●
●●
●●

●●
● ●●●●
●●


●● ●●
● ●●
● ●● ●
●●
● ●● ● ● ●
● ●●
● ●●●●● ●
● ●●


●●






●●● ●
● ●●

●● ● ●

●●


●●
●●
●●●



●●

● ●●●●
● ● ●





●●●
●●
●●
●● ●

● ●●●

●●


●●



●●●





●● ●

● ● ●


●●

●●

●●

●●
●●

● ● ●●
●●●

● ● ●●

● ●





●●


●●

●●







● ●●
●●●
●●

●●● ●●
●●
●●
●●

●●




●●

●●●
●● ●● ●●

● ●
● ●









●●
●●



●● ●
●●


● ●● ●

● ●●●
●●
● ●●●

●●



●●




● ●
●●
● ●
●●


●●


● ●● ●●

● ●


●●●

●●●
● ●
● ●●●
● ●
●● ●

● ●



●●●
●●






●● ● ●●●

● ●

●● ●● ●










●●
● ●












●●


●●
●●


●●●
●●

● ●



















●●

● ●

● ● ●● ●
● ●
● ●













●●








●●●




●●
●●●●






●● ●●
●●

●●
●●

● ●
● ●● ●●● ●
●●
●●● ●● ●
● ●
● ●●●● ●● ●
●● ● ●● ●
● ●● ● ●●
● ●●● ●●●●●● ● ●● ●●
●●●●
●● ●●● ●● ● ● ● ●● ● ●● ●●● ●●
● ● ●
●●●

● ●
●●
● ●
● ●

● ●● ●●●●

●●

●● ●●
● ●●
●●

●●●
●●
●●

●●
●●
●●

●● ●●●
●● ● ●●
●●

● ●

● ●

●●
●●
● ● ●● ●
●●●
●●
●●● ●
●● ●●

●●

●●
● ●● ● ●
● ●●




● ●●


●●



● ●
● ●● ●●

●●●●
●●●●
●● ●●●

●●
● ●●



●●
●●●

●●
● ● ●
●●●●







●●
● ●● ●●

● ● ●●●●● ●●
●●

●●
●●

●●
● ●●

● ●●
● ● ●●●
● ●


●●
● ● ● ●

●● ●●

● ●
● ●● ●

●●
●●
● ●●
●●●

●●
●●

●●
● ●●


●●●

● ● ●●●●● ●
●●●
●●
● ●●

● ●●
●● ●

●●●● ●

●●
●●●

●●
● ●●●
● ●
●●●



●● ●● ●●●●●

● ●

●●

●●
●●


●● ●●●●●
● ● ●
●●●


●●
●●
●●● ●●●●●● ●
●●●



●●●
●●


●● ● ● ●

●● ●
● ●●


●●●●
●●●

●● ●● ●● ●

●●


●●●
●●●





●● ● ●●
●●
●● ●● ●●●

●●

●●


●●
●●
● ●●





● ● ● ●

● ●
●●
●●
● ●

●●

●●
●●● ●● ●●

●●

●●
●●


●●●
● ●



● ●

● ●
●●● ● ●● ●



●●

● ●
●●
●● ●●
● ●
● ●●
●● ●●

●●
●●●
●●

●●
●●
● ●●
● ● ● ●




●●



●●

●●
●●
●●



●●●
● ●●●●

●●
●● ●●

●● ● ●●

●●● ●
●●●●● ● ●● ●●● ●
●●● ●●
● ●●●● ● ●●
● ● ●●●
●●●●●
● ●●● ●● ●●●
● ● ● ● ● ●
● ●
●● ●●●●●

● ●
●●●●● ●● ● ●
●●● ●● ●● ●●
● ● ●

● ●●●
● ●●●
● ●
●●● ● ●
● ●
● ●●

●●

● ●
●●
●●
●●
●● ● ●
● ●
●●●


●●

●●●
●●
● ●●● ●●
● ●●● ●
●●
●●●
●●●● ●
●●


●●
●●●● ●●●
●●●● ●
●●

●●
●●●●●

● ●●
●● ●
● ●
●●●
● ●● ●●●●
●●

●● ●
● ●●● ●

●●●
●●
● ●
●● ●●
● ●
●● ●


●●●

●●

●●●
●●
●●●
●●●
●● ●● ●


● ●
●●●●●
● ●● ● ●
●●

●●
●●
● ●
●●
●●


●●



●●●
●● ●●●●


● ●
● ●●
●●
● ●●
●●●

●●
●●●●●
●●● ●●
●● ●● ●●

● ●



● ●●
●●● ● ● ●●
● ●● ●● ●●
● ●
●● ● ●●
● ●

● ●
●●●
●● ●
● ●●

● ●●

●●● ●●●

●●●
●●● ●●

● ●
●●●


●●●

●●●
●● ●● ● ●


● ●●●
● ●●●● ●




●●●
●●●


●● ●●

●●
●●●●
● ●●
●● ●
●●●
●●
●●
●●

●● ● ●●● ●●● ●
●●●
●●●●


●● ●

●●

● ●●● ● ●
● ●●● ●●
●● ●●● ● ●● ●●● ●●
● ●●●
●●
●● ● ●




●●

●●●

● ●
●●
● ●

● ● ●
●●● ●●
● ● ● ● ●
● ● ● ● ● ●●
●● ●
● ●
● ● ●●
●● ●●● ●●● ● ●

●● ●● ● ●
● ● ● ●● ●
●●● ●● ● ● ●
● ● ● ● ●
●●●●
● ●● ●● ●●●●●
●● ●●

● ● ●●●●
● ●
● ●●● ●● ●● ●
● ●
●●●
●●●●
●● ●
● ●● ● ● ●●

● ●

●●
●●
● ●●
●●
● ●●● ● ● ●●● ● ● ●
●●●●● ●● ● ●● ●

●●● ●●

● ●

● ● ● ● ●●● ●
● ●● ●●
●●●●
●● ●● ● ●

●●●
●●● ●● ● ●●●●
●●●● ●

● ●
●●● ●● ●● ● ●


● ●●●●
● ●
●●●
●●● ●●● ●●●●● ●● ●● ●●● ●● ●● ● ●● ● ● ●● ●●
●●
●●
●●
● ●
● ●
●● ●
●● ● ●● ● ● ● ● ●

1% ● ●●● ●● ● ●
●● ● ●● ●●
●●
● ● ● ●● ●●

●●●●
●● ●●● ● ●
●● ● ● ●


●●●● ●● ●●● ● ●● ●

● ● ● ●●●●
● ●●●●●
● ●● ●●● ● ●
● ● ●
●●
● ●
●●
●● ●● ● ●
● ●
●●●
●● ● ● ●
●●
●●●


●●●

●● ● ●●● ●●
● ●
●●●●●
●●●

●● ●●
● ●●
● ●


●●●● ● ●●
● ●
●● ●●●●
● ●●● ●

●●●● ●●

● ●●

●●


● ●●
● ● ●●
● ● ●

● ● ●
●●
●●
●●●
● ●●
● ●
●● ●● ●●●

●●
●●●●●



●●

● ●

●● ●
● ● ●
●●
●●


● ●●
● ●
● ●●●

● ●

●●
● ●● ● ●●●●
● ●● ●●

● ●●
●●●● ●● ●●

0.1%
●● ● ● ●●● ●
●● ● ● ● ● ● ●● ● ●● ●●
●● ● ●●● ● ● ●
●●● ●●●● ● ●
●●
● ● ●
●●● ●●
● ●● ●
● ●●
● ● ●
● ●●●●●


●●●● ● ●
●● ● ● ● ● ●●● ●● ●

●●
● ● ● ●●
●● ●●● ●● ●●
●●●●●● ●●● ●
●● ● ●● ● ●
●●●●
●●●
● ●●●
● ● ●●●
●● ●
●●●●
●● ●● ● ● ●● ●● ●●
● ●● ●● ●
● ●●
● ● ●●●
●●●
●● ●

●●

● ●● ●● ●● ●
●●
● ●
●●
●●●●● ● ●
● ●●● ●●
● ●

●●●●

●●●
●●
● ●

● ●●●●

● ●● ● ●
●●●





●● ●●
●●●●
● ●● ● ●● ●
●● ●
● ●●
● ●●●
● ●
●● ●● ●

●●

●●●


●● ● ●
●● ●●
●●

●●●
●●

●● ● ● ●
● ● ●
●●●


●●●

● ●●●

●● ●


● ●●● ●● ●●
● ●
●●
●● ●
●●
●●● ● ●
● ●●
●● ● ●● ● ●
● ●
●● ● ● ●
●●●●●
● ●
● ● ● ●●●

● ●
●●● ●●
●●●
●● ●
●● ●

●●
●●● ●●
● ●●
●●● ●●

●● ●● ●●●● ● ●●●● ●● ●

● ● ●
●●●●●
●●●

●● ●●
● ● ●

●● ●●
●● ●
●●
●● ● ●
●●●●● ●●
●●
●●● ●


● ●
● ● ●●● ● ● ● ●

● ● ● ●
●●● ● ●

●● ●●●





●●● ● ●●
● ●
●● ● ●● ●●●●
●● ●● ● ●● ●●● ●●
●●
● ●
● ●●● ●● ● ● ● ●●
● ●● ●●●● ●●●●
● ●●
●● ● ● ●●●

● ●
●●
● ●●

●● ●●
● ●
●● ●●
●●●● ●● ● ●●● ●● ● ●● ● ●● ●● ●●● ●●
● ● ●● ●● ●●

●● ● ●● ● ● ● ●● ●
● ● ●● ●● ●

● ●● ●●
●● ● ●● ●● ● ●● ● ●
● ●

● ●● ● ●●● ●
● ● ●●● ●● ●● ● ● ●●● ● ●●●

●●● ●● ● ●●
● ●●●

● ●● ● ●● ●●
●● ● ● ● ●● ●●
●●
● ●● ●● ●● ● ●● ●
●● ●● ●●
●● ●● ●●
●●●
●●●


●●● ● ●
● ● ●●● ● ● ●● ●●●
●●● ●●● ● ●●● ● ● ●● ●●● ● ● ●●

0.1% ● ● ●● ●●●
●●●
● ●● ● ● ●● ●●

0.01%
●●
●●
● ●● ● ● ●●● ●● ●●
●●● ●● ● ●
● ● ●
● ●
● ● ●●●
●● ● ● ● ● ● ●● ●●
●● ●● ●● ● ●●
●●● ●
●● ●
● ● ●
● ●●

●●● ●
●●● ●●● ●
● ●
●●● ●● ●●
●● ●● ●● ●

●● ●●
●●
●● ●● ●● ●●
●● ●● ●
● ●●
●● ●● ●

● ●● ●●
●● ●● ● ●●
●● ●●

0.01% ●●


●● ●● ●●

●●
●●

100 500 1000 2000 4000 100 500 1000 2000 4000
S S

Figure 5: A comparison of the relative errors for different image classes (left) and point cloud in-
stance classes (right).

● ● ●

100% ●




● ●
●●
●●●
●●●
●●

●●





●●
●●● ●

● ●●●

Wasserstein
Sinkhorn
● ●

●●●●
●● ● ●● ●
●●
● ● ●
● ●

●● ● ●●
●● ●●●●
●● ●●
● ● ● ●● ●

●●●●● ● ●● ●
●●●● ●●● ●

●● ●●●●● ● ● ●● ●
●●●●

● ●●●● ●●●● ● ●
●●
● ●● ● ● ● ●

●● ● ●

● ●●
●●●●●●


●●
● ● ●
●●
●●
● ●●●● ● ● ●

● ● ●●●●
●● ●
● ●● ● ●
●●
● ● ● ●
●●●

● ●● ● ●● ●●●● ●●● ● ● ●● ●● ●●●●
●●●●●●●●● ●
● ●● ●● ●● ● ●●
● ●● ●● ●
●●
●●● ● ●


●●
●●●●●
●●

● ● ●●
● ●●●●● ●●


●●● ●● ●
● ● ● ●
●●
●● ●●●
●●●
●●●

● ● ●● ● ●●●● ● ●● ●●
● ●● ● ●● ● ●● ●●●●● ●● ●

●●
●● ●●
●● ●
● ●●
●●●● ●●
● ●●
●●●●●●


●●

● ●
●● ● ●

●● ● ● ● ●
●● ● ● ●● ● ●●● ●●● ● ●●●● ● ●● ●
●● ●● ●


● ● ●
● ● ● ● ●● ● ●
●●●
●●
● ●
●●
● ●●

● ●●●● ●●● ●●
● ● ●● ●● ●

●●
●●
● ●●● ●●●
● ●●●●●●● ● ● ●

● ●
● ●●●●● ●
● ●● ●● ● ●
●●

● ●●●● ●
●● ●●●
● ●
● ● ●●●●
● ●●●● ●● ● ●● ●● ●●● ●●●● ●●●● ●●●●
●●●● ●●●
●● ●●●● ●●●● ●●● ● ● ●● ●●
● ●● ●●●
●● ● ●● ●●●● ●

10%
● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●●
● ● ● ● ●
● ● ●●● ●● ●●
Relative Error

●● ●●● ●●● ● ● ● ● ●●●● ●●●


● ●● ● ●●
● ● ●● ●● ●●
●●● ●● ●
● ●
●● ●●● ● ●
●●●● ●
●● ●●● ●● ●●●● ●● ●●●● ●●
● ●●
●●
● ●●●●● ● ●●●● ●●● ●● ●
●● ●● ●●●
●● ●●●●● ● ●●●●

●●●● ●
●●● ●● ●●●
● ● ●● ●● ● ●●

●●
● ●● ●●

●●● ●●●●●

● ● ● ●● ●● ●
●● ● ●
●●●● ●●
● ●
●●●● ●
● ●●
● ●● ●● ●● ●
● ●● ●● ●●●● ●● ● ●● ● ●●●

●● ● ●
●●
●●
●● ●● ●●
● ● ●● ●
●● ●●● ●● ●

●● ● ●

●●● ●●●

●●
●●
● ●●●●●● ●●
● ● ●●
●● ●
●●● ●●●
● ●●●● ●● ●
●●●● ● ●
●●●●

● ●●●●
● ●● ●
● ●●
● ●●

●●●
● ● ●●● ●●●● ●
●●●●● ●●●●

●● ●● ●●● ● ●●
● ●●● ●

● ●● ●●
● ●● ● ● ●
● ●● ● ●
● ●●
●●●●
● ●●●● ●● ●
●● ●
●● ● ●● ●●●●●
●● ● ● ●
●●●●● ●● ●● ●●
●●
●●●● ● ● ●● ●


● ●● ●
● ● ●
●● ●
● ● ●● ●
● ●●
●●
●●


●●●●
●● ●


●●
●●
●●

●●●
● ●
●●

● ●●●

● ●●●● ● ●
●●● ●●


●●
●●●


● ●● ●●●
●●●
●●
●● ●
● ●

●●●
●●●● ●●
●●● ●●● ●● ● ● ● ●●● ● ●
●●● ● ●●● ● ● ● ●● ● ●● ●●●● ● ●● ● ●●●
● ●●● ●●● ●●●●●● ●●● ●
●● ●● ●●● ●●●●●
●●
●● ●
● ●● ●●●
● ●● ● ●● ● ●●●●

● ●●
● ●
●●●
●●
● ●●●●● ●
●●● ● ●● ●●●● ● ●●●
●● ● ● ●● ●● ●● ● ● ●
● ●
● ●● ● ●● ● ●●●● ●
● ●
●● ●


●●● ●●●

● ●● ●●●●●●
● ● ●●
●●●
●●●
●●● ● ● ●●
●●●● ●
●●●
● ●●●● ●

●●●●●●
● ● ●
●● ●●●●●
● ●
●● ●●●●● ● ●
●●●●
●●● ●●●
●● ●●

● ●●●
●●●
●●


● ●●●●●
●●●●
● ● ●
●●
●●
● ● ● ●●●
● ●● ●●
● ●●

● ●●●
● ● ●● ●
●●●● ●● ● ● ●

● ● ●
●●

●●●● ●●●●

●●
●●●●
● ●●
●●

●●
● ●
●●
● ●●● ●● ●


● ●
● ●● ●●● ●
●●●
●● ●●●●●●
● ● ● ● ● ● ●●●●● ●
● ●● ● ●●● ● ●●● ● ●
● ● ●●

●● ●●● ●●● ●●●●●

● ●● ●●●
●●
● ● ●●


● ●●●●● ●●
●●●
●●●● ●
● ●● ● ●
● ● ● ●●
● ● ●
●● ●●●
●● ● ●
● ●
● ● ● ●●● ●●● ●
● ●●● ●● ● ● ●●●● ●●●● ●●
●●
●●●● ●●●● ●
●●● ● ● ●
●●●●
● ●●● ●
●●●● ●● ● ● ●
●●

●●
● ●
●●●
●●●
● ●●●
●●

● ●●
●●●
●● ●
●●● ●

● ●●
● ●●● ●●●
●● ●
●● ●● ●● ● ●● ●
● ●●
● ●●●●●●● ●●
● ●●●● ●● ●
●●
● ● ●●

●●
● ● ●

●● ● ●●●
● ● ● ●●
● ●
●● ● ● ●● ●●●●
●●
●● ● ● ● ●●●● ● ●
● ●●
●●
● ● ● ●●
●●● ●
●● ●●
●●
●● ●
● ●●●
●●


●● ●●●
●● ● ●●● ●

● ●●
● ●
●●●

● ●● ●●●●
●●
●● ● ●● ●
● ●●●● ●● ● ●● ●● ● ●
●●
● ● ●●●● ●
●●●●●● ●● ● ●●

●● ●
●●
● ●●●
● ●●
● ●

●●●●

●● ●●●●● ●●●●● ●
●● ●● ● ●● ● ● ●● ●● ●
● ●
● ●
● ●●●●
● ● ● ●
●●
●●

●●
●● ●
●● ●● ●●● ● ●● ● ●● ●● ●● ●● ● ●
●● ● ● ● ●
●●
●●●●●● ●●●●●●●
● ●● ●● ●●●

1%
● ●● ● ●● ●● ● ●●●● ● ●●●●
●●● ●●●● ● ●
● ● ● ● ●●●
● ●● ●● ●● ●● ● ● ●●● ● ● ●● ●
● ● ● ●●● ●● ● ● ●● ●

●●●

● ●●●●
●●●

●●● ●

●●● ●

●● ●● ● ●●● ●

●● ●●● ●● ●● ●●● ●● ● ●●● ● ●
● ●● ●●●●
●● ● ●● ● ●● ● ●●● ● ●● ●● ● ●●
● ● ●● ●
●● ● ● ●● ●
●●●●
● ●●●● ●
●●
● ●●● ●
●● ●●
●●●● ●●
●●●
● ●
● ●● ● ● ● ●● ● ●
●● ●●●● ●●
●● ●● ●●
● ● ●
● ●● ● ●
●●●
● ●● ● ●●● ● ● ● ● ● ●●● ●●
●●

●● ●● ● ● ● ● ●● ●● ●● ●●● ●
●● ●
●● ●● ●●● ●
● ● ●●● ●●
● ● ● ● ● ●● ● ● ●●● ●●●● ●●
● ●● ● ●● ●●
●● ●
● ●● ● ●● ● ●●●
● ● ●●

●● ● ●● ● ● ●● ●●

●●
● ●
●●

● ● ● ● ● ●● ●● ● ● ●●●●
● ● ● ●●
● ● ● ● ●● ● ●● ●
●●● ● ● ● ●
●●● ● ●● ●● ●

●● ● ● ●
●● ● ●● ● ●●● ● ● ●●

● ● ●●
●● ● ●● ● ●●● ●
● ● ●● ●●● ●●●● ●●
● ●●●● ● ● ● ●●●
●● ● ● ●●●● ●
●●
● ●● ●● ●
●● ●● ●
●●
●● ●● ● ●●
● ●● ● ●
● ●●

●● ●●● ● ● ● ● ●
●●● ● ●●● ● ● ●
● ●●

0.1% ●●

● ●●
● ● ● ● ● ●

● ● ●●

●● ● ● ●
●● ● ● ● ●●
●●
●● ●
● ● ● ●●


●●
●● ● ●●




●●
●●
●●




● ●

●●

0.01% ●●
●● ●●
●●

100 500 1000 2000 4000


S

Figure 6: A comparison between the approximations of the Wasserstein and Sinkhorn distances.

12
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

5%
4%

Relative Error
3% Problem
Size

32x32
2% ● ●
64x64
● ● 128x128


1%
−2 −1 0 1
D 2−p

Figure 7: A comparison of the mean relative errors in the point cloud instances with sample size
S = 4000 for different values of p0 = (D/2) − p.

and p0 > 0. This might be due to the relatively small instance sizes N in the experiments. While
we see that the relative errors are independent of N in the image case (compare Figure 2), for the
point clouds N has an influence on the accuracy that depends on p0 .

5. Discussion
As our simulations demonstrate, subsampling is a simple yet powerful tool to obtain good approxi-
mations to Wasserstein distances with only a small fraction of required runtime and memory. It is
especially remarkable that in the case of two dimensional images for a fixed amount of subsampled
points, and therefore a fixed amount of time and memory, the relative error is independent of the
resolution/size of the images. Based on these results, we expect the subsampling algorithm to return
similarly precise results with even higher resolutions of the images it is applied to, while the effort
to obtain them stays the same. Even in point cloud instances the relative error only scales mildly
with the original input size N and is dependent on the value p0 .
The numerical results (Figure 2) show an inverse polynomial decrease of the approximation
error with S, in accordance with the theoretical results. In fact, the rate O(S −1/2p ) is optimal.
Indeed, when r = s (are nontrivial measures), Sommerfeld and Munk (2018) show that ZS =
S 1/2p [Wp (r̂S , ŝS ) − Wp (r, s)] has a nondegenerate limiting distribution Z. For each R > 0 the
function x 7→ min(R, |x|) is nonnegative, continuous and bounded, so

lim inf E{S 1/2p |Wp (r̂S , ŝS )−Wp (r, s)|} = lim inf E{|ZS |} ≥ lim inf E min{R, |ZS |} = E min(R, |Z|).
S→∞ S→∞ S→∞

Letting R → ∞ and using the monotone convergence theorem yields

lim inf E{S 1/2p |Wp (r̂S , ŝS ) − Wp (r, s)|} ≥ E|Z| > 0.
S→∞

When applying the algorithm, it is important to note that the quality of the returned values
depends on the structure of the data. In very irregular instances it is necessary to increase the

13
Sommerfeld, Schrieber, Zemel, and Munk

sample size in order to obtain similarly precise results, while in regular structures a small sample
size suffices.
Our scheme allows the parameters S and B to be easily tuned towards faster runtimes or more
precise results, as desired. Increases and decreases of the sample size S will increase/decrease the
(S) (S)
mean approximation of Wp by Ŵp , while B will only affect the concentration around E Ŵp .
Empirically, we found that for fixed computational cost, the best performance is achieved when
B = 1 (compare Figure 2), suggesting that the bias is more dominant than the variance in the mean
squared error.
The scheme presented here can readily be applied to other optimal transport distances, as long as
a solver is available, as we demonstrated with the Sinkhorn distance (Cuturi, 2013). Empirically, we
can report good performance in this case, suggesting that entropically regularized distances might be
even more amenable to subsampling approximation than the Wasserstein distance itself. Extending
the theoretical results to this case would require an analysis of the mean speed of convergence of
empirical Sinkhorn distances, which is an interesting task for future research.
All in all, subsampling proves to be a general, powerful and versatile tool that can be used with
virtually any optimal transport solver as back-end and has both theoretical approximation error
guarantees, and a convincing performance in practice. It is a challenge to extend this method in a
way which is specifically tailored to the geometry of the underlying space X , which may result in
further improvements.

Acknowledgments

Jörn Schrieber and Max Sommerfeld acknowledge support by the German Research Foundation
(DFG) through the RTG 2088. Yoav Zemel is supported by Swiss National Science Foundation
Grant #178220. We thank an Associate Editor and three reviewers for insightful comments on
previous versions of this work.

Appendix A. Proofs
This Appendix contains proofs of all our theoretical results.

A.1. Proof of Theorem 1


Proof strategy The method used in this proof has been employed before to bound the mean
rate of convergence of the empirical Wasserstein distance on a general metric space (X , d) (Boissard
and Le Gouic, 2014; Fournier and Guillin, 2015). In essence, it constructs a tree on the space X
and bounds the Wasserstein distance with some transport metric in the tree, which can either be
computed explicitly or bounded easily (see also Heinrich and Kahn, 2018, who use a coarse-graining
tree in order to bound the Wasserstein distance in the context of mixture models). Our construction
is specifically tailored to finite spaces, and allows to obtain a better dependence on N = |X | in
Theorem 3 while preserving the rate S −1/2 .
More precisely, in our case of finite spaces, let T be a spanning tree on X (that is, a tree with
vertex set X and edge lengths given by the metric d on X ) and dT the metric on X defined by the
path lengths in the tree. Clearly, the tree metric dT dominates the original metric d on X and hence
Wp (r, s) ≤ WpT (r, s) for all r, s ∈ P(X ), where WpT denotes the Wasserstein distance evaluated
with respect to the tree metric. The goal is now to bound E (WpT (r̂S , r))p . We refer to Tameling
 

and Munk (2018) for examples and comparisons of different spanning trees on two-dimensional grids.
Assume T is rooted at root(T ) ∈ X . Then, for x ∈ X and x 6= root(T ) we may define par(x) ∈ X
as the immediate neighbor of x in the unique path connecting x and root(T ). We set par(root(T )) =
root(T ). We also define children(x) as the set of vertices x0 ∈ X such that there exists a sequence

14
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

x0 = x1 , . . . , xl = x ∈ X with par(xj ) = xj+1 for j = 1, . . . , l − 1. Note that with this definition


x ∈ children(x). Additionally, define the linear operator ST : RX → RX
X
(ST u)x = ux0 . (10)
x0 ∈children(x)

Building the tree We build a q-ary tree on X . To this end, we split X to lmax + 2 groups and
build the tree in such a way that a node at level l + 1 has a unique parent at level l with edge length
q −l . The formal construction follows.
For l ∈ {0, . . . , lmax } we let Ql ⊂ X be the center points of a q −l diam(X ) covering of X , that is
[
B(x, q −l diam(X )) = X , and |Ql | = N (X , q −l diam(X )),
x∈Ql

where B(x, ) = {x0 ∈ X : d(x, x0 ) ≤ }. Additionally set Qlmax +1 = X . Now define Q̃l = Ql × {l};
we will build a tree structure on ∪ll=0
max +1
Q̃l .
Since we must have |Q̃0 | = 1 we can take this element as the root. Assume now that the tree
already contains all elements of ∪lj=0 Q̃j . Then, we add to the tree all elements of Q̃l+1 by choosing
for (x, l + 1) ∈ Q̃l+1 (exactly one) parent element (x0 , l) ∈ Q̃l such that d(x, x0 ) ≤ q −l diam(X ). This
is possible, since Ql is a q −l diam(X ) covering of X . We set the length of the connecting edge to
q −l diam(X ).
In this fashion we obtain a spanning tree T of ∪ll=0
max +1
Q̃l and a partition {Q̃l }l=0,...,lmax +1 . About
this tree we know that:
• it is in fact a tree. First, it is connected, because the construction starts with one connected
component and in every subsequent step all additional vertices are connected to it. Second,
it contains no cycles. To see this let ((x1 , l1 ), . . . , (xK , lK )) be a cycle in T . Without loss of
generality we may assume l1 = min{l1 , . . . , lK }. Then, (x1 , l1 ) must have at least two edges
connecting it to vertices in a Q̃l with l ≥ l1 which is impossible by construction.
• |Q̃l | = N (X , q −l diam(X )) for 0 ≤ l ≤ lmax .
• d(x, par(x)) = q −l+1 diam(X ) whenever x ∈ Q̃l , l ≥ 1.
• d(x, x0 ) ≤ dT ((x, lmax + 1), (x0 , lmax + 1)).
Since the leaves of T can be identified with X , a measure r ∈ P(X ) canonically defines a probability
measure r T ∈ P(T ) for which r(x,l
T
max +1)
T
= rx and r(x,l) = 0 for l ≤ lmax . In slight abuse of notation
we will denote the measure r simply by r. With this notation, we have Wp (r, s) ≤ WpT (r, s) for
T

all r, s ∈ P(X ).
Wasserstein distance on trees Note also that T is ultra-metric, that is, all its leaves are at the
same distance from the root. For trees of this type, we can define a height function h : X → [0, ∞)
such that h(x) = 0 if x ∈ X is a leaf and h(par(x)) − h(x) = dT (x, par(x)) for all x ∈ X \ root(X ).
There is an explicit formula for the Wasserstein distance on ultra-metric trees (Kloeckner, 2015).
Indeed, if r, s ∈ P(X ) then
X
(WpT (r, s))p = 2p−1 (h(par(x))p − h(x)p ) |(ST r)x − (ST s)x | , (11)
x∈X

with the operator ST as defined in (10). For the tree T constructed above and x ∈ Q̃l with
l = 0, . . . , lmax we have
lX
max

h(x) = q −j diam(X ),
j=l

15
Sommerfeld, Schrieber, Zemel, and Munk

and therefore diam(X )q −l ≤ h(x) ≤ 2 diam(X )q −l . This yields

(h(par(x))p − (h(x))p ) ≤ ( diam(X ))p q −(l−2)p .

Then (11) yields


lmax
X+1 X
E Wpp (r̂S , r) ≤ 2p−1 q 2p ( diam(X ))p q −lp
 
E|(ST r̂S )x − (ST r)x |.
l=0 x∈Q̃l

Since (ST r̂S )x is the mean of S i.i.d. Bernoulli variables with expectation (ST r)x we have
r
X X (ST r)x (1 − (ST r)x )
E|(ST r̂S )x − (ST r)x | ≤
S
x∈Q̃l x∈Q̃l
 1/2  1/2
1 X X q
≤√  (ST r)x   (1 − (ST r)x ) ≤ |Q̃l |/S,
S
x∈Q̃l x∈Q̃l
P
using Hölder’s inequality and the fact that x∈Q̃l (ST r)x = 1 for all l = 0, . . . , lmax + 1. This finally
yields

lX
!
−(lmax +1)p
√ max
−lp
q √
E Wpp (r̂S , r) ≤ 2p−1 q 2p ( diam(X ))p
 
q N+ q N (X , q −l diam(X )) / S
l=0

≤ Eq (X , p)/ S.

Covering by arbitrary sets We now explain how to obtain the second formula for Eq as stated
in Remark 2. The idea is to define the coverings with arbitrary sets, not necessarily balls. Let

N1 (X , δ) = inf{m : ∃A1 , . . . , Am ⊆ X , diam(Ai ) ≤ 2δ, ∪Ai ⊇ X }.

Since balls satisfy the diameter condition, N1 ≤ N . Furthermore, if X 0 ⊇ X , then N1 (X , δ) ≤


N1 (X 0 , δ), which is not the case for N . For example, let X = {−1, 1} ⊂ {−1, 0, 1} = X 0 and observe
that
N1 (X , 1) = 1 = N1 (X 0 , 1), but N (X , 1) = 2 > 1 = N (X 0 , 1).
The tree construction with respect to the new covering numbers is done in a similar manner.
For each 0 ≤ l ≤ lmax let Q0l be a collection of disjoint sets of diameter 2q −l diam(X ) that cover
X and |Q0l | = N1 (X , q −l diam(X )). Let Ql = {x1 , . . . , x|Q0l | } ⊆ X be an arbitrary collection of
representatives from the sets in Q0l . Such representatives exist by minimality of |Q0l | and they are
different by the disjoint nature of Q0l . Additionally set Qlmax +1 = X . Construct the tree in the same
way, except that now we only have the bound d(x, x0 ) ≤ 2q −l diam(X ) for (x, l + 1) ∈ Q̃l+1 and a
corresponding (x, l) ∈ Q̃l , so we need to set the edge length to be 2q −l diam(X ), twice as much as in
the original construction. The proof then goes in the same way, with an extra factor 2p . We obtain
an alternative bound
lX
!
√ max q
Eq = 22p−1 q 2p ( diam(X ))p q −(lmax +1)p N + q −lp N1 (X , q −l diam(X )) .
l=0

In comparison with (6), we replaced N by N1 . The price to pay for this is an additional factor of
2p .

16
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

A.2. Proof of Theorem 3


We may assume without loss of generality that X ⊆ [0, diam(X )]D . The covering numbers of the
cube with Euclidean balls behave badly in high dimensions, so it will prove useful to replace the
Euclidean norm by the infinity norm kxk∞ = maxi |xi |, x = (x1 , . . . , xD ) ∈ RD . With this norm we
have N ([0, diam(X )]D ,  diam(X ), k · k∞ ) ≤ (d1/(2)e)D . If q is an integer, then

N (X , q −l diam(X ), k · k∞ ) ≤ N ([0, diam(X )]D , q −l diam(X )/2, k · k∞ ) ≤ dq l eD = q lD .

This yields
lX lX
(
max max
(1 − q (lmax +1)(D/2−p) )/(1 − q D/2−p ) D 6= 2p,
q
−lp −l l(D/2−p)
q N (X , q diam(X )) ≤ q =
l=0 l=0
lmax + 1 D = 2p.

Denote for brevity p0 = D/2 − p and plug this into (6) to bound S 1/2 E Wpp (r̂S , r, k · k∞ ) by
 

" ( 0 0
#
√ (1 − q (lmax +1)p )/(1 − q p ) p0 =
6 0,
2p−1 q 2p ( diam(X ))p q −p(lmax +1)
N+ 0
lmax + 1 p = 0.

If p0 < 0, then let lmax → ∞. Otherwise, choose lmax = bD−1 logq N c (giving the best dependence
on N ), so that the element inside the square brackets is smaller than

p0 p0
 
1/(1 − q )
 p0 < 0, 1/(1 − q )
 p0 < 0,
−1 0 −1
2 + D logq N p = 0, ≤ 2 + D logq N p0 = 0,
0 0 0 0
+ (N 1/2−p/D q p − 1)/(q p − 1) p0 > 0 (2q p − 1)N 1/2−p/D /(q p − 1) p0 > 0.

 1/2−p/D 
N

(12)

The right-hand side is CD,p (N ) for q = 2. To get back to the Euclidean norm use kak2 ≤ kak∞ D,
so that

E Wpp (r̂S , r) ≤ Dp/2 E Wpp (r̂S , r, k · k∞ ) ≤ Dp/2 2p−1 q 2p ( diam(X ))p CD,p (N )/ S,
   

which is the desired conclusion.

Lemma 1. (a) Let C̃D,p (q, N ) denote the right-hand side of (12). Then the minimum of the
function q 7→ q 2p C̃D,p (q, N ) on [2, ∞) is attained at q = 2.
0 √
(b) Let q ≥ 2, p, D integers, and √ p0 = D/2 − p. If p0 < 0, then 1/(1 − q p ) ≤ 2 + 2 and if p0 > 0,
0
then 2 + 1/(q p − 1) ≤ 3 + 2.
0
Proof. We begin with (b). If p0 < 0 then 1/(1 − q p ) is decreasing in q and increasing in p0 . The
integer constraints on D and p imply that the maximal value p0 can attain is −0.5. The smallest
value q can attain is 2. Thus

0 2 √ √ √
1/(1 − q p ) ≤ 1/(1 − 2−0.5 ) = √ = 2( 2 + 1) = 2 + 2.
2−1
0
When p0 > 0 the term 2 + 1/(q p − 1) is decreasing in p0 ≥ 0.5 and in q ≥ 2, so it is bounded by
√ √
2 + 1/( 2 − 1) = 3 + 2.

To prove (a) we shall differentiate the function q 2p C̃D,p (q, N ) with respect to q and show that
the derivative is positive for all q ≥ 2, and p, D, N ≥ 1.

17
Sommerfeld, Schrieber, Zemel, and Munk

For negative p0 consider the function

q 2p
f1 (q) = , q ≥ 2; p ≥ 1; p0 < 0.
1 − q p0
Its derivative is
0 0
" 0
#
2pq 2p−1 (1 − q p ) + p0 q p −1 q 2p q 2p−1 p0 q p
f10 (q) = = 2p + .
(1 − q p0 )2 1 − q p0 1 − q p0
0
It suffices to show that the term in square brackets is positive, since 1 − q p > 0. Let us bound
0 0
q p and the denominator (1 − q p )−1 . Since ex ≥ 1 + x for x ≥ 0, e−x ≤ 1/(1 + x) and setting
x = −p0 log q gives
0 0 1
q p = ep log q ≤ .
1 − p0 log q
Hence
0 1 1 − p0 log q − 1 −p0 log q
1 − qp ≥ 1 − = = ,
1 − p0 log q 1 − p0 log q 1 − p0 log q
so that 0
qp 1 1 − p0 log q 1
p 0 ≤ 0 0
= 0
.
1−q 1 − p log q −p log q −p log q
Conclude that, since p0 < 0,
0
p0 q p 0 1 1 1 1 1
2p + p0 ≥ 2p + p 0
= 2p + = 2p − ≥ 2p − ≥2− > 0.
1−q −p log q − log q log q log 2 log 2

For p0 = 0 consider the function

q 2p log N
f2 (q) = q 2p (2 + D−1 logq N ) = 2q 2p + , q ≥ 2; D = 2p ≥ 2.
D log q
Its derivative is
 
log N  2p−1 log N
f20 (q) 2p−1 −1 2p 2p−1

= 4pq + 2pq log q − q q =q 4p + (2p log q − 1) > 0
D(log q)2 D(log q)2

since 2p log q ≥ 2 log 2 > 1.


For p0 > 0 consider the function

0 q 2p
f3 (q) = q 2p [2 + 1/(q p − 1)] = 2q 2p + = 2q 2p − f1 (q), q ≥ 2; p ≥ 1; p0 > 0.
q p0 − 1
The derivative is
" 0
# " 0
#
2p−1 q 2p−1 p0 q p 2p−1 q 2p−1 p0 q p
4pq − 2p + = 4pq + p0 2p − p0 .
1 − q p0 1 − q p0 q −1 q −1

This function is more complicated and we need to split into cases according to small, large or
moderate values of p0 .
0
Case 1: p0 ≤ 0.5. Then the negative term can be bounded using q p − 1 ≥ p0 log q as
0
p0 q p p0 1 1 1
p 0 = p 0 + p0 ≤ p0 + ≤ p0 + ≤ 0.5 + < 2 ≤ 2p.
q −1 q −1 log q log 2 log 2

18
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

Thus f30 (q) ≥ 0 in this case.


To deal with larger values of p0 rewrite the derivative as
" 0
#
2p−1 2p p0 q p
q 4p + p0 − ,
q − 1 (q p0 − 1)2

and bound the negative part:


0
p0 q p p0 p0 1 1
0 = 0 + 0 ≤ + .
(q p − 1)2 q p − 1 (q p − 1)2 log q (q p0 − 1) log q
0
Case 2: p0 ≥ 1. Then q p − 1 ≥ 1 so this is smaller than
1 1 2
+ = < 4 ≤ 4p.
log 2 log 2 log 2

Hence the derivative is positive in this case.


Case 3: p0 ≥ 1/2 and q ≥ e. Then this is smaller than

1 1 √
1+ ≤ 1 + √ = 2 + 2 < 4 ≤ 4p.
e1/2 − 1 2−1
Hence the derivative is positive in this case.
Case 4: q ≤ e and p0 ∈ [1/2, 1]. The negative term is bounded above by

1 1 1 1 1 1 2+ 2
+ ≤ + ≤ + √ = ≈ 4.93,
log q (q p0 − 1) log q log 2 (q p0 − 1) log 2 log 2 ( 2 − 1) log 2 log 2

whereas the positive term can be bounded below as


2p 2
4p + ≥4+ ≈ 5.16 > 4.93.
q p0−1 e−1

This completes the proof.

A.3. Proof of Theorem 4


We introduce some additional notation. For (x, y), (x0 , y 0 ) ∈ X 2 we set
1/p
dX 2 ((x, y), (x0 , y 0 )) = {dp (x, x0 ) + dp (y, y 0 )}

We further define the function Z : (X 2 )SB → R via


   
B S S
1 X
Wp  1
X 1 X
((x11 , y11 ), . . . , (xSB , ySB )) 7→ δx , δy  − Wp (r, s) .
B i=1 S j=1 ji S j=1 ji

Since Wpp (·, ·) is jointly convex (Villani, 2008, Theorem 4.8),


   1/p  1/p
S S S S
1 X 1 X  1 X   X 
Wp  δ xj , δyj  ≤ Wpp (δxj , δyj ) = S −1/p dp (xj , yj ) .
S j=1 S j=1 S
j=1
 
j=1

19
Sommerfeld, Schrieber, Zemel, and Munk

Our first goal is to show that Z is Lipschitz continuous. Let ((x11 , y11 ), . . . , (xSB , ySB )) and
((x011 , y11
0
), . . . , (x0SB , ySB
0
)) be arbitrary elements of (X 2 )SB . Then, using the reverse triangle in-
equality and the relations above
|Z((x11 , y11 ), . . . , (xSB , ySB )) − Z((x011 , y11
0
), . . . , (x0SB , ySB
0
))|
   
B S S S S
1 X 1 X 1 X 1 X 1 X
≤ Wp  δx , δ y  − Wp  δ x0 , δy 0 
B i=1 S j=1 ji S j=1 ji S j=1 ji S j=1 ji
   
B  S S S S 
1 X 1 X 1 X 1 X 1 X
≤ Wp  δx , δx 0  + Wp  δy , δy 0 
B i=1 S j=1 ji S j=1 ji S j=1 ji S j=1 ji
 1/p  1/p
−1/p B  XS S  
S X  X
≤ dp (xji , x0ji ) + 0
dp (yji , yji )
B i=1

j=1
 
j=1

 1/p
S −1/p p−1
X 
≤ (2B) p dpX 2 ((xji , yji ), (x0ji , yji
0
)) .
B 
i,j

Hence, Z/2 is Lipschitz continuous with constant (SB)−1/p relative to the p-metric generated by
dX 2 on (X 2 )SB .
For r̃ ∈ P(X 2 ) let H(· | r̃) denote the relative entropy with respect to r̃. Since X 2 has dX 2 -
diameter 21/p diam(X ), we have by Bolley and Villani (2005, Particular case 2.5, page 337) that for
every s̃
1/2p
Wp (r̃, s̃) ≤ 8 diam(X )2p H(r̃ | s̃) . (13)
If X11 , . . . , XSB ∼ r and Y11 , . . . , YSB ∼ s are all independent, we have
Z((X11 , Y11 ), . . . , (XSB , YSB )) ∼ Ŵp(S) (r, s) − Wp (r, s).
The Lipschitz continuity of Z and the transportation inequality (13) yields a concentration result
for this random variable. In fact, by Gozlan and Léonard (2007, Lemma 6) we have
−SBz 2p
 h i   
P Ŵp(S) (r, s) − Wp (r, s) ≥ E Ŵp(S) (r, s) − Wp (r, s) + z ≤ exp .
8 diam(X )2p
for all z ≥ 0. Note that −Z is Lipschitz continuous as well and hence, by the union bound,
−SBz 2p
     
(S) (S)
P Ŵp (r, s) − Wp (r, s) ≥ E Ŵp (r, s) − Wp (r, s) + z ≤ 2 exp .
8 diam(X )2p
Now, with the reverse triangle inequality, Jensen’s inequality and Theorem 1,
h i
E Ŵp(S) (r, s) − Wp (r, s) ≤ E [Wp (r̂S , r) + Wp (ŝS , s)]
1/p  p 1/p
≤ E Wpp (r̂S , r) ≤ 2Eq1/p /S 1/(2p) .

+ Wp (ŝS , s)
Together with the last concentration inequality above, this concludes the proof of Theorem 4.

A.4. Proof of Theorem 5


(S) 1/p
Denote V = |Ŵp (r, s) − Wp (r, s)|, C = 2Eq /S 1/(2p) ≥ 0, and observe that
Z ∞ √ Z ∞ Z ∞
E V2 =
 
P (V > t)dt = 2 P (V > s)sds = 2 P (V > z + C)(z + C)dz
0 0 −C

20
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

C ∞ ∞
SBz 2p
Z Z Z  
≤2 (z + C)dz + 4 P (V > z + C)zdz ≤ 4C 2 + 8 z exp − dz,
−C C C 8 diam(X )2p

by Theorem 4. Changing variables and using the inequality y 2p ≥ y 2 (valid for y, p ≥ 1) gives
Z ∞ Z ∞
SBz 2p SB(Cy)2p
   
2
8 z exp − dz = 8C y exp − dy
C 8 diam(X )2p 1 8 diam(X )2p
Z ∞
SBC 2p y 2 2p
SBC 2p
   
2 4( diam(X ))
≤ 8C 2 y exp − dy = 8C exp −
1 8 diam(X )2p SBC 2p 8 diam(X )2p
!
2 ( diam(X ))
2p 4p Eq2 B
= 4C exp − ,
22p−3 Eq2 B 8 diam(X )2p

2/p
where we have used C 2 = 4Eq S −1/p . Deduce that
( !)
( diam(X ))2p 4p Eq2 B
 
2
(S)
E Ŵp (r, s) − Wp (r, s) ≤ 16Eq2/p
1+ exp − S −1/p .
22p−3 Eq2 B 8 diam(X )2p

Now note that (6) implies Eq2 ≥ 26p−2 [diam(X )]2p and hence [diam(X )]2p /[B22p−3 Eq2 ] ≤ 25−8p ≤ 1/8,
so the term in parentheses is smaller than 1 + 1/8. Consequently the mean squared error is bounded
2/p
by 18Eq S −1/p . h i α
(S)
Similar computations show that E Ŵp (r, s) − Wp (r, s) = O(S −α/(2p) ) for all 0 ≤ α ≤
2p.

References
Martial Agueh and Guillaume Carlier. Barycenters in the Wasserstein space. SIAM J. Math. Anal.,
43(2):904–924, 2011.
Jason Altschuler, Jonathan Weed, and Philippe Rigollet. Near-linear time approximation algorithms
for optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing Sys-
tems, pages 1964–1974, 2017.
Dimitri P. Bertsekas. Auction algorithms for network flow problems: A tutorial introduction. Com-
putational Optimization and Applications, 1(1):7–66, 1992.
Emmanuel Boissard and Thibaut Le Gouic. On the mean speed of convergence of empirical and
occupation measures in Wasserstein distance. Ann. Inst. H. Poincaré Probab. Statist., 50(2):
539–563, 2014.
François Bolley and Cédric Villani. Weighted Csiszár-Kullback-Pinsker inequalities and applications
to transportation inequalities. Annales de La Faculté Des Sciences de Toulouse: Mathématiques,
14(3):331–352, 2005.
Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and Radon Wasserstein
barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard
Schoelkopf. From optimal transport to generative modeling: the VEGAN cookbook. 2017. URL
https://arxiv.org/abs/1705.07642.
Hector Corrada Bravo and Stefan Theussl. Rcplex: R interface to cplex, 2016. URL https:
//CRAN.R-project.org/package=Rcplex. R package version 0.3-3.

21
Sommerfeld, Schrieber, Zemel, and Munk

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in


Neural Information Processing Systems, pages 2292–2300, 2013.
Marco Cuturi and Arnaud Doucet. Fast computation of Wasserstein barycenters. In International
Conference on Machine Learning, pages 685–693, 2014.
Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal transport:
Complexity by accelerated gradient descent is better than by Sinkhorn’s algorithm. In Interna-
tional Conference on Machine Learning, pages 1367–1376. 2018.
Steven N. Evans and Frederick A. Matsen. The phylogenetic Kantorovich–Rubinstein metric for
environmental sequence samples. J. R. Stat. Soc. B, 74(3):569–592, 2012.
Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the
empirical measure. Probab. Theory Relat. Fields, 162(3-4):707–738, 2015.
Carsten Gottschlich and Dominic Schuhmacher. The Shortlist method for fast computation of the
earth mover’s distance and finding optimal solutions to transportation problems. PLoS ONE, 9
(10):e110214, 2014.
Nathael Gozlan and Christian Léonard. A large deviation approach to some transportation cost
inequalities. Probab. Theory Relat. Fields, 139(1-2):235–283, 2007.
Philippe Heinrich and Jonas Kahn. Strong identifiability and optimal minimax rates for finite
mixture estimation. Ann. Stat., 46(6A):2844–2870, 2018.
Leonid Vitaliyevich Kantorovich. On the translocation of masses. (Dokl.) Acad. Sci. URSS 37, 3:
199–201, 1942.
Benoı̂t R. Kloeckner. A geometric study of Wasserstein spaces: Ultrametrics. Mathematika, 61(1):
162–178, 2015.
Tianyi Lin, Nhat Ho, and Michael I Jordan. On efficient optimal transport: An analysis of greedy
and accelerated mirror descent algorithms. 2019. URL https://arxiv.org/abs/1901.06482.
Haibin Ling and Kazunori Okada. An efficient earth mover’s distance algorithm for robust histogram
comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(5):840–853,
2007.
Jialin Liu, Wotao Yin, Wuchen Li, and Yat Tin Chow. Multilevel optimal transport: A fast approx-
imation of Wasserstein-1 distances. 2018. URL https://arxiv.org/abs/1810.00118.
David G. Luenberger and Yinyu Ye. Linear and Nonlinear Programming. Springer, New York, 2008.
Axel Munk and Claudia Czado. Nonparametric validation of similar distributions and assessment
of goodness of fit. J. R. Stat. Soc. B, 60(1):223–241, 1998.
Kangyu Ni, Xavier Bresson, Tony Chan, and Selim Esedoglu. Local histogram based segmentation
using the Wasserstein distance. International Journal of Computer Vision, 84(1):97–111, 2009.
Ofir Pele and Michael Werman. Fast and robust earth mover’s distances. In IEEE 12th International
Conference on Computer Vision, pages 460–467, 2009.
Svetlozar T. Rachev and Ludger Rüschendorf. Mass Transportation Problems, Volume 1: Theory.
Springer, New York, 1998.
Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. The earth mover’s distance as a metric for
image retrieval. International Journal of Computer Vision, 40(2):99–121, 2000.

22
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers

Brian E. Ruttenberg, Gabriel Luna, Geoffrey P. Lewis, Steven K. Fisher, and Ambuj K. Singh.
Quantifying spatial relationships from whole retinal images. Bioinformatics, 29(7):940–946, 2013.
Bernhard Schmitzer. A sparse multi-scale algorithm for dense optimal transport. Journal of Math-
ematical Imaging and Vision, 56(2):238–259, 2016.
Jörn Schrieber, Dominic Schuhmacher, and Carsten Gottschlich. DOTmark — a benchmark for
discrete optimal transport. IEEE Access, 5:271–282, 2016. doi: 10.1109/ACCESS.2016.2639065.

Dominic Schuhmacher, Carsten Gottschlich, and Bjoern Baehre. R-package transport: Optimal
transport in various forms, 2014. URL https://cran.r-project.org/package=transport. R
package version 0.6-3.
Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability
and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.
Sameer Shirdhonkar and David W. Jacobs. Approximate earth mover’s distance in linear time. In
IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.
Max Sommerfeld and Axel Munk. Inference for empirical Wasserstein distances on finite spaces. J.
R. Stat. Soc. B, 80(1):219–238, 2018.

Carla Tameling and Axel Munk. Computational strategies for statistical inference based on empirical
optimal transport. In 2018 IEEE Data Science Workshop, pages 175–179. IEEE, 2018.
Jean-Louis Verger-Gaugry. Covering a ball with smaller equal balls in Rn . Discrete & Computational
Geometry, 33(1):143–155, 2005.

Cédric Villani. Optimal Transport: Old and New. Springer, New York, 2008.
Jianguo Zhang, Marcin Marszalek, Svetlana Lazebnik, and Cordelia Schmid. Local features and
kernels for classification of texture and object categories: A comprehensive study. International
Journal of Computer Vision, 73(2):213–238, 2007.

23

You might also like