0% found this document useful (0 votes)
10 views24 pages

A Fast and Accurate Splitting Method For Optimal Transport: Analysis and Implementation

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views24 pages

A Fast and Accurate Splitting Method For Optimal Transport: Analysis and Implementation

Uploaded by

mymnaka82125
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

A Fast and Accurate Splitting Method for Optimal Transport:

Analysis and Implementation


Vien V. Mai Jacob Lindbäck Mikael Johansson
[email protected] [email protected] [email protected]
arXiv:2110.11738v1 [math.OC] 22 Oct 2021

KTH Royal Institute of Technology

October 25, 2021

Abstract
We develop a fast and reliable method for solving large-scale optimal transport (OT)
problems at an unprecedented combination of speed and accuracy. Built on the celebrated
Douglas-Rachford splitting technique, our method tackles the original OT problem directly
instead of solving an approximate regularized problem, as many state-of-the-art techniques
do. This allows us to provide sparse transport plans and avoid numerical issues of methods
that use entropic regularization. The algorithm has the same cost per iteration as the
popular Sinkhorn method, and each iteration can be executed efficiently, in parallel. The
proposed method enjoys an iteration complexity O(1/) compared to the best-known
O(1/2 ) of the Sinkhorn method. In addition, we establish a linear convergence rate
for our formulation of the OT problem. We detail an efficient GPU implementation of
the proposed method that maintains a primal-dual stopping criterion at no extra cost.
Substantial experiments demonstrate the effectiveness of our method, both in terms of
computation times and robustness.

1 Introduction
We study the discrete optimal transport (OT) problem:

minimize hC, Xi
X≥0
subject to X1n = p (1)
X > 1m = q,

where X ∈ Rm×n + is the transport plan, C ∈ Rm×n + is the cost matrix, and p ∈ Rm + and
q ∈ R+ are two discrete probability measures. OT has a very rich history in mathematics and
n

operations research dated back to at least the 18th century. By exploiting geometric properties
of the underlying ground space, OT provides a powerful and flexible way to compare probability
measures. It has quickly become a central topic in machine learning and has found countless
applications in tasks such as deep generative models [4], domain adaptation [12], and inference
of high-dimensional cell trajectories in genomics [31]; we refer to [29] for a more comprehensive
survey of OT theory and applications. However, the power of OT comes at the price of an
enormous computational cost for determining the optimal transport plan. Standard methods
for solving linear programs (LPs) suffer from a super-linear time complexity in terms of the
problem size [29]. Such methods are also challenging to parallelize on modern processing

1
hardware. Therefore, there has been substantial research in developing new efficient methods
for OT. This paper advances the-state-of-the-art in this direction.

1.1 Related work


Below we review some of the topics most closely related to our work.

Sinkhorn method The Sinkhorn (SK) method [13] aims to solve an approximation of (1)
in which thePobjective is replaced by a regularized version of the formhC, Xi − ηH(X). Here,
H(X) = − ij Xij log(Xij ) is an entropy function and η > 0 is a regularization parameter.
The Sinkhorn method defines the quantity K = exp(−C/η) and repeats the following steps

uk = p/(Kvk−1 ) and vk = q/(K > uk ),

until kuk (Kvk ) − pk + kvk (K > uk ) − qk becomes small, then returns diag(uk )K diag(vk ).
The division (/) and multiplication ( ) operators between two vectors are to be understood
entrywise. Each SK iteration is built from matrix-vector multiplies and element-wise arith-
metic operations, and is hence readily parallelized on multi-core CPUs and GPUs. However,
due to the entropic regularization, SK suffers from numerical issues and can struggle to find
even moderately accurate solutions. This problem is even more prevalent in GPU implemen-
tations as most modern GPUs are built and optimized for single-precision arithmetic [23, 11].
Substantial care is therefore needed to select an appropriate η that is small enough to provide
a meaningful approximation, while avoiding numerical issues. In addition, the entropy term
enforces a dense solution, which can be undesirable when the optimal transportation plan
itself is of interest [8]. We mention that there has been substantial research in improving the
performance of SK [1, 2, 8, 26]. Most of these contributions improve certain aspects of SK:
some result in more stable but much slower methods, while others allow to produce sparse
solutions but at a much higher cost per iteration. Also, some sophisticated changes make
parallelization challenging due to the many branching conditions that they introduce.

Operator splitting solvers for general LP With a relatively low per-iteration cost and
the ability to exploit sparsity in the problem data, operator splitting methods such as Douglas-
Rachford splitting [15] and ADMM [18] have gained widespread popularity in large-scale op-
timization. Such algorithms can quickly produce solutions of moderate accuracy and are the
engine of several successful first-order solvers [28, 32, 19]. As OT can be cast as an LP, it
can, in principle, be solved by these splitting-based solvers. However, there has not been much
success reported in this context, probably due to the memory-bound nature of large-scale OTs.
For OTs, the main bottleneck is not floating-point computations, but rather time-consuming
memory operations on large two-dimensional arrays. Even an innocent update like X ← X −C
is more expensive than the two matrix-vector multiplies in SK. To design a high-performance
splitting method, it is thus crucial to minimize the memory operations associated with large
arrays. In addition, since these solvers target general LPs, they often solve a linear system at
each iteration, which is prohibitively expensive for many OT applications.

Convergence rates of DR for LP Many splitting methods, including Douglas-Rachford,


are known to converge linearly under strong convexity (see e.g. [20]). Recently, it has been
shown that algorithms based on DR/ADMM often enjoy similar convergence properties also

2
in the absence of strong convexity. For example, Liang et al. [25] derived local linear con-
vergence guarantees under mild assumptions; Applegate et al. [3] established global linear
convergence for Primal-Dual methods for LPs using restart strategies; and Wang and Shroff
[33] established linear convergence for an ADMM-based algorithm on general LPs. Yet, these
frameworks quickly become intractable on large OT problems, due to the many memory op-
erations required.

1.2 Contributions
We demonstrate that DR splitting, when properly applied and implemented, can solve large-
scale OT problems to modest accuracy reliably and quickly, while retaining the excellent
memory footprint and parallelization properties of the Sinkhorn method. Concretely, we make
the following contributions:

• We develop a DR splitting algorithm that solves the original OT directly, avoiding the
numerical issues of SK and forgoing the need for solving linear systems of general solvers.
We perform simplifications to eliminate variables so that the final algorithm can be exe-
cuted with only one matrix variable, while maintaining the same degree of parallelization.
Our method implicitly maintains a primal-dual pair, which facilitates a simple evaluation
of stopping criteria.

• We derive an iteration complexity O(1/) for our method. This is a significant improve-
ment over the best-known estimate O(1/2 ) for the Sinkhorn method (cf. [26]). We also
provide a global linear convergence rate that holds independently of the initialization,
despite the absence of strong convexity in the OT problem.

• We detail an efficient GPU implementation that fuses many immediate steps into one and
performs several on-the-fly reductions between a read and write of the matrix variable.
We also show how a primal-dual stopping criterion can be evaluated at no extra cost.
The implementation is available as open source and gives practitioners a fast and robust
OT solver also for applications where regularized OT is not suitable.

As a by-product of solving the original OT problem, our approximate solution is guaranteed


to be sparse. Indeed, it is known that DR can identify an optimal support in a finite time,
even before reaching a solution [22]. To avoid cluttering the presentation, we focus on the
primal OT, but note that the approach applies equally well to the dual form. Moreover, our
implementation is readily extended to other splitting schemes (e.g. Chambolle and Pock [10]).

2 Background
Notation For any x, y ∈ Rn , hx, yi is the Euclidean inner product of x and y, and k·k2
denotes the `2 -norm.p For matrices X, Y ∈ Rm×n , hX, Y i = tr(X > Y ) denotes their inner
product and k·kF = h·, ·i is the the induced Frobenius norm. We use k·k to indicate either
k·k2 or k·kF . For a closed and convex set X , the distance and the projection map are given by
dist(x, X ) = minz∈X kz − xk and PX (x) = argminz∈X kz − xk, respectively. The Euclidean
projection of x P ∈ Rn onto the nonnegative orthant is denoted by [x]+ = max(x, 0), and
∆n = {x ∈ Rn+ : ni=1 xi = 1} is the (n − 1) dimensional probability simplex.

3
OT and optimality conditions Let e = 1n and f = 1m , and consider the linear mapping
A : Rm×n → Rm+n : X 7→ (Xe, X > f ),
and its adjoint A∗ : Rm+n → Rm×n : (y, x) 7→ ye> +f x> . The projection onto the range of A is
PranA ((y, x)) = (y, x)−α(f, −e), where α = (f > y − e> x)/(m+n) [6]. With b = (p, q) ∈ Rm+n ,
Problem (1) can be written as a linear program on the form
minimize hC, Xi subject to A(X) = b, X ≥ 0. (2)
X∈Rm×n

Let (µ, ν) ∈ Rm+n be the dual variable associated the affine constraint in (2). Then, using the
definition of A∗ , the dual problem of (2), or equivalently of (1), reads
maximize p> µ + q > ν subject to µe> + f ν > ≤ C. (3)
µ,ν

OT is a bounded program and always admits a feasible solution. It is thus guaranteed to have
an optimal solution, and the optimality conditions can be summarized as follows.
Proposition 1. A pair X and (µ, ν) are primal and dual optimal if and only if: (i) Xe =
p, X > f = q, X ≥ 0, (ii) µe> + f ν > − C ≤ 0, (iii) hX, C − µe> − f > νi = 0.
These conditions mean that: (i) X is primal feasible; (ii) µ, ν are dual feasible, and (iii)
the duality gap is zero hC, Xi = p> µ + q > ν, or equivalently, complementary slackness holds.
Thus, solving the OT problem amounts to (approximately) finding such a primal-dual pair.
To this end, we rely on the celebrated Douglas-Rachford splitting method [27, 16, 17, 5].
Douglas-Rachford splitting Consider composite convex optimization problems on the form
minimize
n
f (x) + g(x), (4)
x∈R

where f and g are proper closed and convex functions. To solve Problem (4), the DR splitting
algorithm starts from y0 ∈ Rn and repeats the following steps:
xk+1 = proxρf (yk ) , zk+1 = proxρg (2xk+1 − yk ) , yk+1 = yk + zk+1 − xk+1 , (5)
where ρ > 0 is a penalty parameter. The preceding procedure can be viewed as a fixed-point
iteration yk+1 = T (yk ) for the mapping
(6)

T (y) = y + proxρg 2proxρf (y) − y − proxρf (y) .
The DR iterations (5) can also be derived from the ADMM method applied to an equivalent
problem to (4) (see, Appendix B). Indeed, they are both special instances of the classical
proximal point method in [30]. As for convergence, Lions and Mercier [27] showed that T is
a firmly-nonexpansive mapping, from which they obtained convergence of yk . Moreover, the
sequence xk is guaranteed to converge to a minimizer of f + g (assuming a minimizer exists)
for any choice of ρ > 0. In particular, we have the following general convergence result, whose
proof can be found in [5, Corollary 28.3].
Lemma 2.1. Consider the composite problem (4) and its Fenchel–Rockafellar dual defined as
maximize
n
−f ∗ (−u) − g ∗ (u). (7)
u∈R

Let P ? and D? be the sets of solutions to the primal (4) and dual (7) problems, respectively.
Let xk , yk , and zk be generated by procedure (5) and let uk := (yk−1 − xk )/ρ. Then, there exists
y ? ∈ Rn such that yk → y ? . Setting x? = proxρf (y ? ) and u? = (y ? − x? )/ρ, then it holds that
(i) x? ∈ P ? and u? ∈ D? ; (ii) xk − zk → 0, xk → x? and zk → x? ; (iii) uk → u? .

4
3 Douglas-Rachford splitting for optimal transport
To efficiently apply DR to OT, we need to specify the functions f and g as well as how to
evaluate their proximal operations. We begin by introducing a recent result for computing the
projection onto the set of real-valued matrices with prescribed row and column sums [6].

Lemma 3.1. Let e = 1n and f = 1m . Let p ∈ ∆m and q ∈ ∆n . Then, the set X defined by:

X := X ∈ Rm×n Xe = p and X > f = q




is non-empty, and for every given X ∈ Rm×n , we have


1  1  
PX (X) = X − (Xe − p) e> − γf e> − f (X > f − q)> − γf e> ,
n m
where γ = f > (Xe − p) /(m + n) = e> X > f − q /(m + n).


The lemma follows immediately from [6, Corollary 3.4] and the fact that (p, q) = PranA ((p, q)).
It implies that PX (X) can be carried out by basic linear algebra operations, such as matrix-
vector multiplies and rank-one updates, that can be effectively parallelized.

3.1 Algorithm derivation


Our algorithm is based on re-writing (2) on the standard form for DR-splitting (4) using a
carefully selected f and g that ensures a rapid convergence of the iterates and an efficient
execution of the iterations. In particular, we propose to select f and g as follows:

f (X) = hC, Xi + IRm×n (X) and g(X) = I{Y :A(Y )=b} (X). (8)
+

By Lemma 3.1, we readily have the explicit formula for the proximal operator of g, namely,
proxρg (·) = PX (·). The proximal operator of f can also be evaluated explicitly as:

proxρf (X) = PRm×n (X − ρC) = [X − ρC]+ .


+

The Douglas-Rachford splitting applied to this formulation of the OT problem then reads:

Xk+1 = [Yk − ρC]+ , Zk+1 = PX (2Xk+1 − Yk ), Yk+1 = Yk + Zk+1 − Xk+1 . (9)

Despite their apparent simplicity, the updates in (9) are inefficient to execute in practice
due to the many memory operations needed to operate on the large arrays Xk , Yk , Zk and
C. To reduce the memory access, we will now perform several simplifications to eliminate
variables from (9). The resulting algorithm can be executed with only one matrix variable
while maintaining the same degree of parallelization.
We first note that the linearity of PX (·) allows us to eliminate Z, yielding the Y -update
>
Yk+1 = Xk+1 − n−1 2Xk+1 e − Yk e − p − γk f e> − m−1 f 2Xk+1
>
f − Yk> f − q − γk e ,


where γk = f > (2Xk+1 e − Yk e − p) /(m + n) = e> 2Xk+1 > f − Y > f − q /(m + n). We also

k
define the following quantities that capture how Yk affects the update of Yk+1

ak = Yk e − p, bk = Yk> f − q, αk = f > ak /(m + n) = e> bk /(m + n).

5
Similarly, for Xk , we let:

rk = Xk e − p, sk = Xk> f − q, βk = f > rk /(m + n) = e> sk /(m + n).

Recall that the pair (rk , sk ) represents the primal residual at Xk . Now, the preceding update
can be written compactly as

Yk+1 = Xk+1 + φk+1 e> + f ϕ>


k+1 , (10)

where

φk+1 = (ak − 2rk+1 + (2βk+1 − αk )f ) /n


ϕk+1 = (bk − 2sk+1 + (2βk+1 − αk )e) /m.

We can see that the Y -update can be implemented using 4 matrix-vector multiples (for com-
puting ak , bk , rk+1 , sk+1 ), followed by 2 rank-one updates. As a rank-one update requires a
read from an input matrix and a write to an output matrix, it is typically twice as costly as
a matrix-vector multiply (which only writes the output to a vector). Thus, it would involve 8
memory operations of large arrays, which is still significant.
Next, we show that the Y -update can be removed too. Noticing that updating Yk+1 from
Yk and Xk+1 does not need the full matrix Yk , but only the ability to compute ak and bk .
This allows us to use a single physical memory array to represent both the sequences Xk and
Yk . Suppose that we overwrite the matrix Xk+1 as:

Xk+1 ← Xk+1 + φk+1 e> + f ϕ>


k+1 ,

then after the two rank-one updates, the X-array holds the value of Yk+1 . We can access the
actual X-value again in the next update, which now reads: Xk+2 ← Xk+1 − ρC + . It thus
remains to show that ak and bk can be computed efficiently. By multiplying both sides of (10)
by e and subtracting the result from p, we obtain

Yk+1 e − p = Xk+1 e − p + φk+1 e> e + (ϕ>


k+1 e)f.

Since e> e = n and (bk − 2sk+1 )> e = (m + n)(αk − 2βk+1 ), it holds that (ϕ> k+1 e)f = (αk −
2βk+1 )f . We also have φk+1 e> e = ak − 2rk+1 + (2βk+1 − αk )f . Therefore, we end up with an
extremely simple recursive form for updating ak :

ak+1 = ak − rk+1 .

Similarly, we have bk+1 = bk − sk+1 and αk+1 = αk − βk+1 . In summary, the DROT method
can be implemented with a single matrix variable as summarized in Algorithm 1.

Stopping criteria It is interesting to note that while DROT directly solves to the primal
problem (1), it maintains a pair of vectors that implicitly plays the role of the dual variables
µ and ν in the dual problem (3). To get a feel for this, we note that the optimality conditions
in Proposition 1 are equivalent to the existence of a pair X ? and (µ? , ν ? ) such that

(X ? e, X ? > f ) = (p, q) and X ? = X ? + µ? e> + f ν ? > − C + .


 

Here, the later condition encodes the nonnegativity constraint, the dual feasibility, and the
zero duality gap. The result follows by invoking [14, Theorem 5.6(ii)] with the convex cone

6
Algorithm 1 Douglas-Rachford Splitting for Optimal Transport (DROT)
Input: OT(C, p, q), initial point X0 , penalty parameter ρ
1: φ0 = 0, ϕ0 = 0
2: a0 = X0 e − p, b0 = X0> f − q, α0 = f > a0 /(m + n)
3: for k = 0, 1, 2, . . . do
Xk+1 = Xk + φk e> + f ϕ>
 
4: k − ρC +
rk+1 = Xk+1 e − p, sk+1 = Xk+1> f − q, β >
5: k+1 = f rk+1 /(m + n)
6: φk+1 = (ak − 2rk+1 + (2βk+1 − αk )f ) /n
7: ϕk+1 = (bk − 2sk+1 + (2βk+1 − αk )e) /m
8: ak+1 = ak − rk+1 , bk+1 = bk − sk+1 , αk+1 = αk − βk+1
9: end for
Output: XK

C = Rm×n
+ , y = X ? ∈ C and z = µ? e> + f ν ? > − C ∈ C ◦ . Now, compared to Step 4 in DROT,
the second condition above suggests that φk /ρ and ϕk /ρ are such implicit variables. To see
why this is indeed the case, let Uk = (Yk−1 − Xk )/ρ. Then it is easy to verify that

(Zk − Xk )/ρ = (Xk − Yk−1 + φk e> + f ϕ> > >


k )/ρ = −Uk + (φk /ρ)e + f (ϕk /ρ) .

By Lemma 2.1, we have Zk − Xk → 0 and Uk → U ? ∈ Rm×n , by which it follows that

(φk /ρ)e> + f (ϕk /ρ)> → U ? .

Finally, by evaluating the conjugate functions f ∗ and g ∗ in Lemma 2.1, it can be shown that
U ? must have the form µ? e> +f ν ? > , where µ? ∈ Rm and ν ? ∈ Rn satisfying µ? e> +f ν ? > ≤ C;
see Appendix C for details. With such primal-dual pairs at our disposal, we can now explicitly
evaluate their distance to the set of solutions laid out in Proposition 1 by considering:
q
rprimal = krk k2 + ksk k2
rdual = k[(φk /ρ)e> + f (ϕk /ρ)> − C]+ k
gap = hC, Xk i − (p> φk + ϕ>
k q)/ρ .

As problem (1) is feasible and bounded, Lemma 2.1 and strong duality guarantees that all
the three terms will converge to zero. Thus, we terminate DROT when these become smaller
than some user-specified tolerances.

3.2 Convergence rates


In this section, we state the main convergence results of the paper, namely a sublinear and a
linear rate of convergence of the given splitting algorithm. In order to establish the sublinear
rate, we need the following function:

V (X, Z, U ) = f (X) + g(Z) + hU, Z − Xi ,

which is defined for X ∈ Rm×n


+ , Z ∈ X and U ∈ Rm×n . We can now state the first result.

7
Theorem 1 (Sublinear rate). Let Xk , Yk , Zk be the generated by procedure (9). Then, for any
X ∈ Rm×n
+ , Z ∈ X , and Y ∈ Rm×n , we have
1 1
V (Xk+1 , Zk+1 , (Y − Z)/ρ) − V (X, Z, (Yk − Xk+1 )/ρ) ≤ kYk − Y k2 − kYk+1 − Y k2 .
2ρ 2ρ
Furthermore, let Y ? be a fixed-point of T in (6) and let X ? be a solution of (1) defined from
Y ? in the manner of Lemma 2.1. Then, it holds that
 
? 1 1 2 2 ? ?
C, X̄k − hC, X i ≤ kY0 k + kX k kY0 − Y k
k 2ρ ρ
kYk − Y0 k 2 kY0 − Y ? k
Z̄k − X̄k = ≤ ,
k k
where X̄k = ki=1 Xi /k and Z̄k = ki=1 Zi /k.
P P

The theorem implies that one can compute a solution satisfying hC, Xi − hC, X ? i ≤  in
O(1/) iterations. This is a significant improvement over the best-known iteration complexity
O(1/2 ) of the Sinkhorn method (cf. [26]). Note that the linearity of hC, ·i allows to update
the scalar value C, X̄k recursively without ever needing to form the ergodic sequence X̄k .
Yet, in terms of rate, this result is still conservative, as the next theorem shows.
Theorem 2 (Linear rate). Let Xk and Yk be generated by (9). Let G ? be the set of fixed points
of T in (6) and let X ? be the set of primal solutions to (1). Then, {Yk } is bounded, kYk k ≤ M
for all k, and
dist(Yk , G ? ) ≤ dist(Y0 , G ? ) × rk
dist(Xk , X ? ) ≤ dist(Y0 , G ? ) × rk−1 ,

where r = c/ c2 + 1 < 1, c = γ(1 + ρ(kek + kf k)) ≥ 1, and γ = θS ? (1 + ρ−1 (M + 1)). Here,
θS ? > 0 is a problem-dependent constant characterized by the primal-dual solution sets only.
This means that an -optimal solution can be computed in O(log 1/) iterations. However,
it is, in general, difficult to estimate the convergence factor r, and it may in the worst case be
close to one. In such settings, the sublinear rate will typically dominate for the first iterations.
In either case, DROT always satisfies the better of the two bounds at each k.

3.3 Implementation
In this section, we detail our implementation of DROT to exploit parallel processing on GPUs.
We review only the most basic concepts of GPU programming necessary to describe our kernel
and refer to [23, 11] for comprehensive treatments.

Thread hierarchy When a kernel function is launched, a large number of threads are gener-
ated to execute its statements. These threads are organized into a two-level hierarchy. A grid
contains multiple blocks and a block contains multiple threads. Each block is scheduled to one
of the streaming multiprocessors (SMs) on the GPU concurrently or sequentially, depending
on available hardware. While all threads in a thread block run logically in parallel, not all
of them can run physically at the same time. As a result, different threads in a thread block
may progress at a different pace. Once a thread block is scheduled to an SM, its threads are
further partitioned into warps. A warp consists of 32 consecutive threads that execute the
same instruction at the same time. Each thread has its own instruction address counter and
register state, and carries out the current instruction on its own data.

8
atomicAdd
warp-level reduction

Global memory (off-chip)


Shared memory (on-chip)

atomicAdd Thread register (on-chip)

Figure 1: Left: Logical view of the main kernel. To expose sufficient parallelism to GPU, we organize threads
into a 2D grid of blocks (3 × 2 in the figure), which allows several threads per row. The threads are then
grouped in 1D blocks (shown in red) along the columns of X. This ensures that global memory access is
aligned and coalesced to maximize bandwidth utilization. We use the parameter work-size ws to indicate how
many elements of a row each thread should handle. For simplicity, this parameter represents multiples of the
block size bs. Each arrow denotes the activity of a single thread in a thread block. Memory storage is assumed
to be column-major. Right: Activity of a normal working block, which handles a submatrix of size bs×(ws·bs).

Memory hierarchy Registers and shared memory (“on-chip”) are the fastest memory spaces
on a GPU. Registers are private to each thread, while shared memory is visible to all threads
in the same thread block. An automatic variable declared in a kernel is generally stored in a
register. Shared memory is programmable, and users have full control over when data is moved
into or evicted from the shared memory. It enables block-level communication, facilitates reuse
of on-chip data, and can greatly reduce the global memory access of a kernel. However, there
are typically only a couple dozen registers per thread and a couple of kBs shared memory
per thread block. The largest memory on a GPU card is global memory, physically separated
from the compute chip (“off-chip”). All threads can access global memory, but its latency
is much higher, typically hundreds of clock cycles.1 Therefore, minimizing global memory
transactions is vital to a high-performance kernel. When all threads of a warp execute a load
(store) instruction that access consecutive memory locations, these will be coalesced into as
few transactions as possible. For example, if they access consecutive 4-byte words such as
float32 values, four coalesced 32-byte transactions will service that memory access.
Before proceeding further, we state the main result of the section.
Claim 3.1. On average, an iteration of DROT, including all the stopping criteria, can be
done using 2.5 memory operations of m × n arrays. In particular, this includes one read from
and write to X, and one read from C in every other iteration.

Main kernel Steps 4–5 and the stopping criteria are the main computational components
of DROT, since they involve matrix operations. We will design an efficient kernel that: (i) up-
dates Xk to Xk+1 , (ii) computes uk+1 := Xk+1 e and vk+1 := Xk+1 > f , (iii) evaluates hC, X
k+1 i
in the duality gap expression. The kernel fuses many immediate steps into one and performs
several on-the-fly reductions while updating Xk , thereby minimizing global memory access.
Our kernel is designed so that each thread block handles a submatrix x of size bs × (ws · bs),
except for the corner blocks which will have fewer rows and/or columns. Since all threads in
a block need the same values of ϕ, it is best to read these into shared memory once per block
and then let threads access them from there. We, therefore, divide ϕ into chunks of the block
size and set up a loop to let the threads collaborate in reading chunks in a coalesced fashion
1
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

9
into shared memory. Since each thread works on a single row, it accesses the same element of
φ throughout, and we can thus load and store that value directly in a register. These allow
maximum reuse of ϕ and φ.
In the j-th step, the working block loads column j of x to the chip in a coalesced fashion.
Each thread i ∈ {0, 1, . . . , bs − 1} uses the loaded xij to compute and store x+
ij in a register:

x+
ij = max(xij + φi + ϕj − ρcij , 0),

where c is the corresponding submatrix of C. As sum reduction is order-independent, it can


be done locally at various levels, and local results can then be combined to produce the final
value. We, therefore, reuse x+ ij and perform several such partial reductions. First, to compute
the local value for uk+1 , at column j, thread i simply adds x+ ij to a running sum kept in
a register. The reduction leading to hC, Xk+1 i can be done in the same way. The vertical
reduction to compute vk+1 is more challenging as coordination between threads is needed. We
rely on warp-level reduction in which the data exchange is performed between registers. This
way, we can also leverage efficient CUDA’s built-in functions for collective communication at
warp-level.2 When done, the first thread in a warp holds the total reduced value and simply
adds it atomically to a proper coordinate of vk+1 . Finally, all threads write x+ ij to the global
memory. The process is repeated by moving to the next column.
When the block finishes processing the last column of the submatrix x, each thread adds
its running sum atomically to a proper coordinate of uk+1 . They then perform one vertical
(warp-level) reduction to collect their private sums to obtain the partial cost value hc, xi.
When all the submatrices have been processed, we are granted all the quantities described
in (i)–(iii), as desired. It is essential to notice that each entry of x is only read and written
once during this process. To conclude Claim 3.1, we note that one can skip the load of C in
every other iteration. Indeed, if in iteration k, instead of writing Xk+1>, one writes the value
of Xk+1 − ρC, then the next update is simply Xk+2 = Xk+1 + φk+1 e + f ϕk+1 + . Finally,
>

all the remaining updates in DROT only involve vector and scalar operations, and can thus
be finished off with a simple auxiliary kernel.

4 Experimental results
In this section, we perform experiments to validate our method and to demonstrate its ef-
ficiency both in terms of accuracy and speed. We focus on comparisons with the Sinkhorn
method, as implemented in the POT toolbox3 , due to its minimal per-iteration cost and its
publicly available GPU implementation. All runtime tests were carried out on an NVIDIA
Tesla T4 GPU with 16GB of global memory. The CUDA C++ implementation of DROT is
open source and available at https://github.com/vienmai/drot.
We consider six instances of SK, called SK1–SK6, corresponding to η = 10−4 , 10−3 , 5 ×
10 , 10−2 , 5×10−2 , 10−1 , in that order. Given m and n, we generate source and target samples
−3

as xs and xt , whose entries are drawn from a 2D Gaussian distribution with randomized
means and covariances (µs , Σs ) and (µt , Σt ). Here, µs ∈ R2 has normal distributed entries,
and Σs = As A> s , where As ∈ R
2×2 is a matrix with random entries in [0, 1]; µ ∈ R2 has
t
entries generated from N (5, σt ) for some σt > 0, and Σt ∈ R2×2 is generated similarly to Σs .
Given xs and xt , the cost matrix C represents pair-wise squared Euclidean distance between
2
https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/
3
https://pythonot.github.io

10
1.0 DROT 1.0 DROT 1.0 DROT
SK1 SK1 SK1
SK2 SK2 SK2
0.8 0.8 0.8
SK3 SK3 SK3
SK4 SK4 SK4
Performance ratio

Performance ratio

Performance ratio
0.6 Sk5 0.6 Sk5 0.6 Sk5
SK6 SK6 SK6

0.4 0.4 0.4

0.2 0.2 0.2

0.0 0.0 0.0


10−4 10−3 10−2 10−1 10−4 10−3 10−2 10−1 10−4 10−3 10−2 10−1
Accuracy Accuracy Accuracy

(a) Float 32, σt = 5 (b) Float 64, σt = 5 (c) Float 64, σt = 10

Figure 2: The percentage of problems solved up to various accuracy levels for σt = 5, 10.

samples: Cij = kxs [i] − xt [j]k22 for i ∈ {1, . . . , m} and j ∈ {1, . . . , n} and is normalized to
have kCk∞ = 1. The marginals p and q are set to p = 1m /m and q = 1n /n. For simplicity,
we set the penalty parameter ρ in DROT to ρ0 /(m + n) for some constant ρ0 . This choice is
inspired by a theoretical limit in the Chambolle and Pock [10] method: ρ < 1/ kAk2 , where
kAk2 = kek2 + kf k2 = m + n. We note however that this choice is conservative and it is
likely that better performance can be achieved with more careful selection strategies (cf. [9]).
Finally, DROT always starts at X0 = pq > .

Robustness and accuracy For each method, we evaluate for each  > 0 the percentage
of experiments for which |f (XK ) − f (X ? )| /f (X ? ) ≤ , where f (X ? ) is the optimal value
and f (XK ) is the objective value at termination. An algorithm is terminated as soon as the
constraint violation goes below 10−4 or 1000 iterations have been executed. Figure 2 depicts
the fraction of problems that are successfully solved up to an accuracy level  given on the
x-axis. For each subplot, we set m = n = 512 and generate 100 random problems, and
set ρ0 = 2. We can see that DROT is consistently more accurate and robust. It reinforces
that substantial care is needed to select the right η for SK. We can see that SK is extremely
vulnerable in single-precision. Even in double-precision, an SK instance that seems to work
in one setting can run into numerical issues in another. For example, by slightly changing a
statistical properties of the underlying data, nearly 40% of the problems in Fig. 2(c) cannot
be solved by SK2 due to numerical errors, even though it works reasonably well in Fig. 2(b).

Runtime As we strive for the excellent per-iteration cost of SK, this work would be incom-
plete if we do not compare these. Figure 3(a) shows the median of the per-iteration runtime
and the 95% confidence interval, as a function of the dimensions. Here, m and n range from
100 to 20000 and each plot is obtained by performing 10 runs; in each run, the per-iteration
runtime is averaged over 100 iterations. Since evaluating the termination criterion in SK is
very expensive, we follow the POT default and only do so once every 10 iterations. We note
also that SK calls the highly optimized CuBLAS library to carry out their updates. The result
confirms the efficacy of our kernel, and for very large problems the per-iteration runtimes of
the two methods are almost identical. Finally, Figs 3(b)-(c) show the median times required
for the total error (the sum of function gap and constraint violation) to reach different  values.
It should be noted that by design, all methods start from different initial points.4 Therefore,
4
The SK methods have their own matrices K = exp(−C/η) and X0 = diag(u0 )K diag(v0 ). For DROT,
since Xk+1 = [Xk + φk e> f ϕ> k − ρC]+ , where X0 = pq
>
= 1/(mn), kCk∞ = 1, ρ = ρ0 /(m + n), φ0 = ϕ0 = 0,
the term inside [·]+ is of the order O(1/(mn) − ρ0 /(m + n))  0. This makes the first few iterations of DROT
identically zeros. To keep the canonical X0 , we just simply set ρ0 to 1/ log(m) to reduce the warm-up period
as m increases, which is certainly suboptimal for DROT performance.

11
DROT DROT DROT
SK 100 SK4 100 SK4
Sk5 Sk5
SK6 SK6

101

10−1
Time [ms]

Time [s]

Time [s]
10−1
100

10−2

10−1
10−2
102 103 104 103 104 103 104
Dimension m = n Dimension m = n Dimension m = n

(a) Runtime per iteration (b) Time to  = 0.01 (c) Time to  = 0.001

Figure 3: Wall-clock runtime performance versus the dimensions m = n for σt = 5.

with very large , this can lead to substantially different results since all methods terminate
early. It seems that SK can struggle to find even moderately accurate solutions, especially
for large problems. This is because for a given η, the achievable accuracy of SK scales like
 = O(η log(n)), which can become significant as n grows [2]. Moreover, the more data entries
available in larger problems can result in the higher chance of getting numerical issues. Note
that none of the SK instances can find a solution of accuracy  = 0.001.

Sparsity of transportation By design, DROT efficiently finds sparse transport plans. To


illustrate this, we apply DROT to a color transferring problem between two images (see [8]).
By doing so, we obtain a highly sparse plan (approximately 99% zeros), and in turn, a high-
quality artificial image, visualized in Figure 4. In the experiment, we quantize each image
with KMeans to reduce the number of distinct colors to 750. We subsequently use DROT to
estimate an optimal color transfer between color distributions of the two images.

Figure 4: Color transfer via DROT: The left-most image is a KMeans compressed source
image (750 centroids), the right-most is a compressed target image (also obtained via 750
KMeans centroids). The middle panel displays an artificial image generated by mapping the
pixel values of each centroid in the source to a weighted mean of the target centroid. The
weights are determined by the sparse transportation plan computed via DROT.

12
5 Conclusions
We developed, analyzed, and implemented an operator splitting method (DROT) for solving
the discrete optimal transport problem. Unlike popular Sinkhorn-based methods, which solve
approximate regularized problems, our method tackles the OT problem directly in its primal
form. Each DROT iteration can be executed very efficiently, in parallel, and our implemen-
tation can perform more extensive computations, including the continuous monitoring of a
primal-dual stopping criterion, with the same per-iteration cost and memory footprint as the
Sinkhorn method. The net effect is a fast method that can solve OTs to high accuracy, provide
sparse transport plans, and avoid the numerical issues of the Sinkhorn family. Our algorithm
enjoys strong convergence guarantees, including an iteration complexity O(1/) compared to
O(1/2 ) of the Sinkhorn method and a problem-dependent linear convergence rate.

Acknowledgement
This work was supported in part by the Knut and Alice Wallenberg Foundation, the Swedish
Research Council and the Swedish Foundation for Strategic Research, and the Wallenberg
AI, Autonomous Systems and Software Program (WASP). The computations were enabled by
resources provided by the Swedish National Infrastructure for Computing (SNIC), partially
funded by the Swedish Research Council through grant agreement no. 2018-05973.

References
[1] M. Z. Alaya, M. Berar, G. Gasso, and A. Rakotomamonjy. Screening Sinkhorn algorithm
for regularized optimal transport. In Advances in Neural Information Processing Systems,
volume 32, 2019.

[2] J. Altschuler, J. Weed, and P. Rigollet. Near-linear time approximation algorithms for
optimal transport via Sinkhorn iteration. In Advances in Neural Information Processing
Systems, pages 1964–1974, 2017.

[3] D. Applegate, O. Hinder, H. Lu, and M. Lubin. Faster first-order primal-dual methods
for linear programming using restarts and sharpness. arXiv preprint arXiv:2105.12715,
2021.

[4] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks.


In International conference on machine learning, pages 214–223. PMLR, 2017.

[5] H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in
Hilbert spaces. Springer, 2nd edition, 2017.

[6] H. H. Bauschke, S. Singh, and X. Wang. Projecting onto rectangular matrices with
prescribed row and column sums. arXiv preprint arXiv:2105.12222, 2021.

[7] A. Beck. First-order methods in optimization, volume 25. SIAM, 2017.

[8] M. Blondel, V. Seguy, and A. Rolet. Smooth and sparse optimal transport. In Interna-
tional Conference on Artificial Intelligence and Statistics, pages 880–889. PMLR, 2018.

13
[9] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations and
Trends® in Machine learning, 3(1):1–122, 2011.

[10] A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with
applications to imaging. Journal of mathematical imaging and vision, 40(1):120–145,
2011.

[11] J. Cheng, M. Grossman, and T. McKercher. Professional CUDA C programming. John


Wiley & Sons, 2014.

[12] N. Courty, R. Flamary, D. Tuia, and A. Rakotomamonjy. Optimal transport for domain
adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–
1865, 2016.

[13] M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances


in neural information processing systems, 26:2292–2300, 2013.

[14] F. Deutsch. Best approximation in inner product spaces, volume 7. Springer, 2001.

[15] J. Douglas and H. H. Rachford. On the numerical solution of heat conduction problems
in two and three space variables. Transactions of the American mathematical Society, 82
(2):421–439, 1956.

[16] J. Eckstein and D. P. Bertsekas. On the Douglas–Rachford splitting method and the
proximal point algorithm for maximal monotone operators. Mathematical Programming,
55(1):293–318, 1992.

[17] M. Fukushima. The primal Douglas-Rachford splitting algorithm for a class of monotone
mappings with application to the traffic equilibrium problem. Mathematical Programming,
72(1):1–15, 1996.

[18] D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational
problems via finite element approximation. Computers & mathematics with applications,
2(1):17–40, 1976.

[19] M. Garstka, M. Cannon, and P. Goulart. COSMO: A conic operator splitting method
for convex conic problems. Journal of Optimization Theory and Applications, pages 1–32,
2021.

[20] P. Giselsson and S. Boyd. Linear convergence and metric selection for Douglas–Rachford
splitting and admm. IEEE Transactions on Automatic Control, 62(2):532–544, 2016.

[21] B. He and X. Yuan. On the O(1/n) convergence rate of the Douglas-Rachford alternating
direction method. SIAM Journal on Numerical Analysis, 50(2):700–709, 2012.

[22] F. Iutzeler and J. Malick. Nonsmoothness in machine learning: specific structure, proxi-
mal identification, and applications. Set-Valued and Variational Analysis, 28(4):661–678,
2020.

[23] D. B. Kirk and W. H. Wen-Mei. Programming massively parallel processors: a hands-on


approach. Morgan kaufmann, 2016.

14
[24] W. Li. Sharp Lipschitz constants for basic optimal solutions and basic feasible solutions
of linear programs. SIAM journal on control and optimization, 32(1):140–153, 1994.

[25] J. Liang, J. Fadili, and G. Peyré. Local convergence properties of Douglas–Rachford


and alternating direction method of multipliers. Journal of Optimization Theory and
Applications, 172(3):874–913, 2017.

[26] T. Lin, N. Ho, and M. Jordan. On efficient optimal transport: An analysis of greedy and
accelerated mirror descent algorithms. In International Conference on Machine Learning,
pages 3982–3991. PMLR, 2019.

[27] P.-L. Lions and B. Mercier. Splitting algorithms for the sum of two nonlinear operators.
SIAM Journal on Numerical Analysis, 16(6):964–979, 1979.

[28] B. O’donoghue, E. Chu, N. Parikh, and S. Boyd. Conic optimization via operator splitting
and homogeneous self-dual embedding. Journal of Optimization Theory and Applications,
169(3):1042–1068, 2016.

[29] G. Peyré and M. Cuturi. Computational optimal transport. Foundations and Trends®
in Machine Learning, 11(5-6):355–607, 2019.

[30] R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal
on control and optimization, 14(5):877–898, 1976.

[31] G. Schiebinger, J. Shu, M. Tabaka, B. Cleary, V. Subramanian, A. Solomon, J. Gould,


S. Liu, S. Lin, P. Berube, et al. Optimal-transport analysis of single-cell gene expression
identifies developmental trajectories in reprogramming. Cell, 176(4):928–943, 2019.

[32] B. Stellato, G. Banjac, P. Goulart, A. Bemporad, and S. Boyd. OSQP: An operator


splitting solver for quadratic programs. Mathematical Programming Computation, pages
1–36, 2020.

[33] S. Wang and N. Shroff. A new alternating direction method for linear programming.
In Proceedings of the 31st International Conference on Neural Information Processing
Systems, pages 1479–1487, 2017.

15
A Proofs of convergence rates
In order to establish the convergence rates for DROT, we first collect some useful results
associated with the mapping that underpins the DR splitting:

T (y) = y + proxρg 2proxρf (y) − y − proxρf (y) .

DR corresponds to finding a fixed point to T , i.e. a point y ? : y ? = T (y ? ), or equivalently


finding a point in the nullspace of G, where G(y) = y − T (y). The operator T , and thereby
also G, are firmly non-expansive (see e.g. [27]). That is, for all y, y 0 ∈ dom(T ):

hT (y) − T (y 0 ), y − y 0 i ≥ kT (y) − T (y 0 )k2


hG(y) − G(y 0 ), y − y 0 i ≥ kG(y) − G(y 0 )k2 . (11)

Let G ? be the set of all fixed points to T , i.e. G ? = {y ∈ dom(T )| G(y) = 0}. Then (11)
implies the following bound:

Lemma A.1. Let y ? ∈ G ? and yk+1 = T (yk ), it holds that

kyk+1 − y ? k2 ≤ kyk − y ? k2 − kyk+1 − yk k2 .

Proof. We have

kyk+1 − y ? k2 = kyk+1 − yk + yk − y ? k2
= kyk − y ? k2 + 2 hyk+1 − yk , yk − y ? i + kyk − y ? k2
= kyk − y ? k2 − 2 hG(yk ) − G(y ? ), yk − y ? i + kyk − y ? k2
≤ kyk − y ? k2 − 2kG(yk ) − G(y ? )k2 + kyk − y ? k2
= kyk − y ? k2 − kyk+1 − yk k2 .

Corollary A.1. Lemma A.1 implies that

kyk+1 − yk k ≤ kyk − y ? k (12a)


(dist(yk+1 , G ? ))2 ≤ (dist(yk , G ? ))2 − kyk+1 − yk k2 . (12b)

Proof. The first inequality follows directly from the non-negativity of the left-hand side of
Lemma A.1. The latter inequality follows from:

(dist(yk+1 , G ? ))2 = kyk+1 − PG ? (yk+1 )k2 ≤ kyk+1 − PG ? (yk )k2 .

Applying Lemma A.1 to the right-hand side with y ? = PG ? (yk ) yields the desired result.

A.1 Proof of Theorem 1


Since DR is equivalent to ADMM up to changes of variables and swapping of the iteration
order, it is expected that DR can attain the ergodic rate O(1/k) of ADMM, both for the
objective gap and the constraint violation [21, 7]. However, mapping the convergence proof of

16
ADMM to a specific instance of DR is tedious and error-prone. We thus give a direct proof
here instead. We first recall the DROT method:

(13a)
 
Xk+1 = Yk − ρC +
Zk+1 = PX (2Xk+1 − Yk ) (13b)
Yk+1 = Yk + Zk+1 − Xk+1 . (13c)

Since Xk+1 is the projection of Yk − ρC onto Rm×n


+ , it holds that

hXk+1 − Yk + ρC, X − Xk+1 i ≥ 0 ∀X ∈ Rm×n


+ . (14)

Also, since X is a closed affine subspace, we have

hZk+1 − 2Xk+1 + Yk , Z − Zk+1 i = 0 ∀Z ∈ X . (15)

Next, for X ∈ Rm×n


+ , Z ∈ X and U ∈ Rm×n we define the function:

V (X, Z, U ) = f (X) + g(Z) + hU, Z − Xi .

Let Uk+1 = (Yk − Xk+1 )/ρ, it holds that

V (Xk+1 , Zk+1 , Uk+1 ) − V (X, Z, Uk+1 )


= f (Xk+1 ) + g(Zk+1 ) + hUk+1 , Zk+1 − Xk+1 i − f (X) − g(Z) − hUk+1 , Z − Xi
1
= hC, Xk+1 i − hC, Xi + hYk − Xk+1 , Zk+1 − Z + X − Xk+1 i .
ρ
By (14), we have hYk − Xk+1 , X − Xk+1 i /ρ ≤ hC, Xi − hC, Xk+1 i , we thus arrive at
1
hYk − Xk+1 , Zk+1 − Zi
V (Xk+1 , Zk+1 , Uk+1 ) − V (X, Z, Uk+1 ) ≤
ρ
1 1 1
= hXk+1 − Zk+1 , Zk+1 − Zi = − kXk+1 − Zk+1 k2 + hXk+1 − Zk+1 , Xk+1 − Zi , (16)
ρ ρ ρ

where the second step follows from (15). Now, for any Y ∈ Rm×n , we also have
Y −Z
V (Xk+1 , Zk+1 , ) − V (Xk+1 , Zk+1 , Uk+1 )
ρ
1 1
= hY − Z, Zk+1 − Xk+1 i − hYk − Xk+1 , Zk+1 − Xk+1 i
ρ ρ
1 1
= hY − Yk , Zk+1 − Xk+1 i + hXk+1 − Z, Zk+1 − Xk+1 i . (17)
ρ ρ
Therefore, by adding both sides of (16) and (17), we obtain
Y −Z 1 1
V (Xk+1 , Zk+1 , ) − V (X, Z, Uk+1 ) ≤ hY − Yk , Zk+1 − Xk+1 i − kXk+1 − Zk+1 k2 .
ρ ρ ρ
(18)

Since Zk+1 − Xk+1 = Yk+1 − Yk , it holds that


1 1 1 1
hY − Yk , Zk+1 − Xk+1 i = kYk − Y k2 − kYk+1 − Y k2 + kYk+1 − Yk k2 . (19)
ρ 2ρ 2ρ 2ρ

17
Plugging (19) into (18) yields
Y −Z 1 1 1
V (Xk+1 , Zk+1 , ) − V (X, Z, Uk+1 ) ≤ kYk − Y k2 − kYk+1 − Y k2 − kYk+1 − Yk k2 ,
ρ 2ρ 2ρ 2ρ
(20)

which by dropping the last term on the right-hand side gives the first claim in the theorem.
Let Y ? be a fixed-point of the mapping T and let X ? be a solution of (1) defined from Y ?
in the manner of Lemma 2.1. Invoking (20) with Z = X = X ? ∈ Rm×n + ∩ X and summing
both sides of (20) over the iterations 0, . . . , k − 1 gives

Y − X?
   
1 1 2 1 2
?
C, X̄k − hC, X i + , Z̄k − X̄k ≤ kY0 − Y k − kYk − Y k , (21)
ρ k 2ρ 2ρ
Pk Pk
where X̄k = i=1 Xi /k and Z̄k = i=1 Zi /k. Since
1 1 1 1 1
kY0 − Y k2 − kYk − Y k2 = kY0 k2 − kYk k2 + hY, Yk − Y0 i ,
2ρ 2ρ 2ρ 2ρ ρ
it follows that
 
1 Yk − Y0 1 1 1
C, X̄k ?
− hC, X i + Y, Z̄k − X̄k − ≤ kY0 k2 − kYk k2 + X ? , Z̄k − X̄k .
ρ k 2ρk 2ρk ρ

Since Y is arbitrary in the preceding inequality, we must have that


Yk − Y0
Z̄k − X̄k − = 0, (22)
k
from which we deduce that
1 1 1
C, X̄k − hC, X ? i ≤ kY0 k2 − kYk k2 + X ? , Z̄k − X̄k
2ρk 2ρk ρ
1 1
≤ kY0 k2 + kX ? k Z̄k − X̄k
2ρk ρ
1 1
= kY0 k2 + kX ? k kYk − Y0 k . (23)
2ρk ρk
By Lemma A.1, we have

kYk − Y ? k2 ≤ kYk−1 − Y ? k2 ≤ · · · ≤ kY0 − Y ? k2 ,

which implies that

kYk − Y0 k ≤ kYk − Y ? k + kY0 − Y ? k ≤ 2 kY0 − Y ? k . (24)

Finally, plugging (24) into (23) and (22) yields


 
1 1 2
C, X̄k − hC, X ? i ≤ kY0 k2 + kX ? k kY0 − Y ? k
k 2ρ ρ
kYk − Y0 k 2 kY0 − Y ? k
Z̄k − X̄k = ≤ ,
k k
which concludes the proof.

18
A.2 Proof of Theorem 2
We will employ a similar technique for proving linear convergence as [33] used for ADMM on
LPs. Although they, in contrast to our approach, handled the linear constraints via relaxation,
we show that the linear convergence also holds for our OT-customized splitting algorithm
which relies on polytope projections. The main difference is that the dual variables are given
implicitly, rather than explicitly via the multipliers defined in the ADMM framework. As a
consequence, their proof needs to be adjusted to apply to our proposed algorithm.
To facilitate the analysis of the given splitting algorithm, we let e = 1n and f = 1m and
consider the following equivalent LP:
minimize hC, Xi
X,Z∈Rm×n
subject to Ze = p
Z >f = q (25)
X≥0
Z = X.
The optimality conditions can be written on the form
Z ?e = p (26a)
?>
Z f =q (26b)
X ≥0 ?
(26c)
Z =X? ?
(26d)
?>
µ? e> + f ν ≤C (26e)
> ? > ?
?
hC, X i − p µ − q ν = 0. (26f)
This means that set of optimal solutions, S ? , is a polyhedron
S ? = {Z = (X, Z, µ, ν) | Meq (Z) = 0, Min (Z) ≤ 0}.
where Meq and Min denote the equality and inequality constraints of (26), respectively. Li
[24] proposed a uniform bound of the distance between a point such a polyhedron in terms of
the constraint violation given by:

dist(Z, S ? ) ≤ θS ?  Meq (Z) . (27)


Min (Z) +
For the (X, Z)-variables generated by (9), (26a), (26b), and (26c) will always hold due to the
projections carried out in the subproblems. As a consequence, only (26d), (26e), and (26f)
will contribute to the bound (27). In particular:
Zk − Xk
dist((Xk , Zk , µk , νk ), S ? ) ≤ θS ? hC, X k i − p> µk − q> νk . (28)
µk e + f νk> − C +
>


The following Lemma allows us to express the right-hand side of (28) in terms of Yk+1 − Yk .
Lemma A.2. Let Xk , Zk and Yk be a sequence generated by (9). It then holds that
Zk+1 − Xk+1 = Yk+1 − Yk
µk+1 e> + f νk+1
>
− C ≤ ρ−1 (Yk+1 − Yk )
hC, Xk+1 i − p> µk+1 − q > νk+1 = −ρ−1 hYk+1 , Yk+1 − Yk i .

19
Proof. The first equality follows directly from the Y -update in (9). Let

µk = φk /ρ, νk = ϕk /ρ (29)

the Y -update in (10) yields

Yk+1 = Yk + Zk+1 − Xk+1 = Xk+1 + ρµk+1 e> + ρf νk+1


>
. (30)

In addition, the X-update

Xk+1 = max(Yk − ρC, 0) (31)

can be combined with the first equality in (30) to obtain

Yk+1 = min(Yk + Zk+1 , Zk+1 + ρC),

or equivalently

Yk+1 − Zk+1 − ρC = min(Yk − ρC, 0). (32)

By (30), the left-hand-side in (32) can be written as

Yk+1 − Zk+1 − ρC = Xk+1 − Zk+1 + ρµk+1 e> + ρf νk+1


>
− ρC
= ρ(µk+1 e> + f νk+1
>
− C − ρ−1 (Yk+1 − Yk )).

Substituting this in (32) gives

µk+1 e> + f νk+1


>
− C − ρ−1 (Yk+1 − Yk )) = ρ−1 min(Yk − ρC, 0) ≤ 0, (33)

which gives the second relation in the lemma. Moreover, (31) and (33) implies that Xk+1 and
µk+1 e> + f νk+1 k+1 − Yk ) − C cannot be nonzero simultaneously. As a consequence,
> − ρ−1 (Y

D E
Xk+1 , µk+1 e> + f νk+1
>
− ρ−1 (Yk+1 − Yk ) − C = 0

or
D E
hC, Xk+1 i = Xk+1 , µk+1 e> + f νk+1
>
− ρ−1 (Yk+1 − Yk )
D E
= Zk+1 − (Yk+1 − Yk ), µk+1 e> + f νk+1
>
− ρ−1 (Yk+1 − Yk )
D E D E
= Zk+1 , µk+1 e> + f νk+1
>
− ρ−1 Zk+1 + µk+1 e> + f νk+1
>
, Yk+1 − Yk
+ ρ−1 kYk+1 − Yk k2 .

By (30), ρ−1 Zk+1 + µk+1 e> + f νk+1


> = ρ−1 (Z
k+1 + Yk+1 − Xk+1 ) = ρ (2Yk+1 − Yk ), hence
−1

D E
hC, Xk+1 i = Zk+1 , µk+1 e> + f νk+1
>
− ρ−1 h2Yk+1 − Yk , Yk+1 − Yk i + ρ−1 kYk+1 − Yk k2
D E
= Zk+1 , µk+1 e> + f νk+1
>
− ρ−1 hYk+1 , Yk+1 − Yk i . (34)

20
Note that
D E
Zk+1 , µk+1 e> + f νk+1
> >
= tr(Zk+1 µk+1 e> ) + tr(Zk+1
> >
f νk+1 )
= tr(e> Zk+1
> >
µk+1 ) + tr(νk+1 >
Zk+1 f)
= tr(p> µk+1 ) + tr(νk+1
>
q)
= p> µk+1 + q > νk+1 . (35)

Substituting (35) into (34), we get the third equality of the lemma:

hC, Xk+1 i − p> µk+1 + q > νk+1 = −ρ−1 hYk+1 , Yk+1 − Yk i .

By combining Lemma A.2 and (28), we obtain the following result:


Lemma A.3. Let Xk , Zk and Yk be a sequence generated by (9) and µk , νk be defined by
(29). Then, there is a γ > 0 such that:

dist((Xk , Zk , µk , νk ), S ? ) ≤ γ kYk − Yk−1 k .

Proof. Let Y ? ∈ G ? , then Lemma A.1 asserts that

kYk − Y ? k ≤ kYk−1 − Y ? k ≤ · · · ≤ kY0 − Y ? k ,

which implies that kYk k ≤ M for some positive constant M . In fact, {Yk } belongs to a ball
centered at Y ? with the radius kY0 − Y ? k. From the bound of (28) and Lemma A.2, we have

Zk − Xk
? hC, > >
dist((Xk , Zk , µk , νk ), S ) ≤θS ?  Xk>i − p µ>
k − q νk
µk e + f νk − C +
≤θS ? (kZk − Xk k + | hC, Xk i − p> µk − q > νk |
+ k µk e> + f νk> − C + k)
 

≤θS ? (kYk − Yk−1 k + ρ−1 | hYk , Yk − Yk−1 i | + ρ−1 k[Yk − Yk−1 ]+ k)


≤θS ? (1 + ρ−1 + ρ−1 kYk k) kYk − Yk−1 k
≤θS ? (1 + ρ−1 (M + 1)) kYk − Yk−1 k .

By letting γ = θS ? (1 + ρ−1 (M + 1)) we obtain the desired result.

Recall that for DROT, proxρf (·) = PRm×n (· − ρC) and proxρg (·) = PX (·). The following
+
Lemma bridges the optimality conditions stated in (26) with G ? .
Lemma A.4. If Y ? = X ? +ρ(µ? e> +f ν ? > ), where X ? , µ? , and ν ? satisfy (26), then G(Y ? ) = 0.
Proof. We need to prove that
 
G(Y ? ) = PRm×n (Y ? − ρC) − PX 2PRm×n (Y ? − ρC) − Y ?
+ +

is equal to zero. Let us start with simplifying PRm×n (Y ? − ρC). By complementary slackness,
+

hX ? , µ? e> + f ν ? > − Ci = 0,

21
and by dual feasibility

µ? e> + f ν ? > − C ≤ 0.

Consequentially, we have:

hµ? e> + f ν ? > − C, X − X ? i ≤ 0, ∀X ∈ Rm×n


+ ,

which defines a normal cone of Rm×n + , i.e. µ? e> + f ν ? > − C ∈ NRm×n (X ? ). By using the
+
definition of Y ? , and that ρ > 0, this implies that

Y ? − X ? − ρC ∈ NRm×n (X ? ),
+

which corresponds to optimality condition of the projection

X ? = PRm×n (Y ? − ρC).
+

 
We thus have PX 2PRm×n (Y ? − ρC) − Y ? = PX (2X ? − Y ? ). But by the definition of Y ? ,
+

0 = Y ? − X ? − ρ(µ? e> + f ν ? > ) = X ? − (2X ? − Y ? ) − ρµ? e> − ρf ν ? > ,

which means that X ? = PX (2X ? − Y ? ) (following from the optimality condition of the projec-
tion). Substituting all these quantities into the formula for G(Y ? ), we obtain G(Y ? ) = 0.

Lemma A.5. There exists a constant c ≥ 1 such that

dist(Yk , G ? ) ≤ c kYk − Yk−1 k.

Proof. Let Y ? = X ? + ρµ? e> + ρf ν ? > , where (X ? , X ? , µ? , ν ? ) = PS ? ((Xk , Zk , µk , νk )). Ac-


cording to Lemma A.4, we have that Y ? ∈ G? . Then

dist(Yk , G ? ) ≤ kYk − Y ? k = kYk − X ? − ρ(µ? e> + f ν ? > )k.

Since Yk = Xk + ρµk e> + ρf νk> , it holds that

dist(Yk , G ? ) ≤ kXk − X ? + ρ(µk − µ? )e> + ρf (νk − ν ? )> k


≤ kXk − X ? k + ρkekkµk − µ? k + ρkf kkνk − ν ? k
p p
≤ 1 + ρ2 kek2 + ρ2 kf k2 kXk − X ? k2 + kµk − µ? k2 + kνk − ν ? k2
q
≤ 1 + ρ kek + ρ kf k k(Xk − X ? , Zk − X ? , µk − µ? , νk − ν ? )k2
p
2 2 2 2
p
= 1 + ρ2 kek2 + ρ2 kf k2 dist((Xk , Zk , µk , νk ), S ? )
≤ γ(1 + ρ(kek + kf k))kYk − Yk−1 k,

where the last inequality follows from Lemma A.3. The fact that

c = γ(1 + ρ(kek + kf k)) ≥ 1 (36)

follows from (12a) in Corollary A.1.

22
A.3 Proof of linear convergence
By combining Lemma A.5 with (12b) in Corollary A.1, we obtain:
c2
(dist(Yk+1 , G ? ))2 ≤ (dist(Yk , G ? ))2
1 + c2
or equivalently
c
dist(Yk+1 , G ? ) ≤ √ dist(Yk , G ? ).
1 + c2

Letting r = c/ 1 + c2 < 1 and iterating the inequality k times gives
dist(Yk , G ? ) ≤ dist(Y0 , G ? )rk .
Let Y ? := PG ? (Yk ). Since proxρf (Y ? ) ∈ X ? and Xk+1 = proxρf (Yk ), we have
dist(Xk+1 , X ? ) ≤ proxρf (Yk ) − proxρf (Y ? ) ≤ kYk − Y ? k = dist(Yk , G ? ) ≤ dist(Y0 , G ? )rk ,
where the second inequality follows from the non-expansiveness of proxρf (·) This completes
the proof of Theorem 2.

B DR and its connection to ADMM


The Douglas-Rachford updates given in (5) can equivalently be formulated as:
xk+1 = proxρf (yk )
zk+1 = proxρg (2xk+1 + yk )
yk+1 = yk + zk+1 − xk+1

zk+1 = proxρg (2xk+1 + yk )
yk+1 = yk + zk+1 − xk+1
xk+2 = proxρf (yk+1 )

zk+1 = proxρg (2xk+1 + yk )
xk+2 = proxρf (yk + zk+1 − xk+1 )
yk+1 = yk + zk+1 − xk+1 .
If we let wk+1 = ρ1 (yk − xk+1 ), we get.
zk+1 = proxρg (xk+1 + ρwk+1 )
xk+2 = proxρf (zk+1 − ρwk+1 )
1
wk+2 = wk+1 + (zk+1 − xk+2 ).
ρ
By re-indexing the x and the w variables, we get:
zk+1 = proxρg (xk + ρwk )
xk+1 = proxρf (zk+1 − ρwk )
1
wk+1 = wk + (zk+1 − xk+1 ).
ρ
This is exactly ADMM applied to the composite convex problem on the form (4).

23
C Fenchel-Rockafellar duality of OT
Recall that

f (X) = hC, Xi + IRm×n (X) and g(X) = I{Y :A(Y )=b} (X) = IX (X).
+

Recall also that the conjugate of the indicator function IC of a closed and convex set C is the
support function σC of the same set [7, Chapter 4]. For a non-empty affine set X = {Y :
A(Y ) = b}, its support function can be computed as:

σX (U ) = hU, X0 i + IranA∗ (U ),

where X0 is a point in X [7, Example 2.30]. By letting X0 = pq > ∈ X , it follows that

g ∗ (U ) = hU, pq > i + IranA∗ (U ).

Since A∗ : Rm+n → Rm×n : (y, x) 7→ ye> + f x> , any matrix U ∈ ranA∗ must have the form
µe> + f ν > for some µ ∈ Rm and ν ∈ Rn . The function g ∗ (U ) can thus be evaluated as:

g ∗ (U ) = p> µ + q > ν.

We also have

f ∗ (−U ) = max{h−U, Xi − hC, Xi − IRm×n (X)}


X +

= max{h−U − C, Xi − IRm×n (X)}


X +

= σRm×n (−U − C)
+

if − U − C ≤ 0,

0
=
+∞ otherwise.

The Fenchel–Rockafellar dual of problem 4 in Lemma 2.1 thus becomes

maximize −p> µ − q > ν


µ∈Rm ,ν∈Rn
subject to −µe> − f ν > − C ≤ 0,

which is identical to (3) upon replacing µ by −µ and ν by −ν.

24

You might also like