On The Convergence of Projected Alternating Maximization For Equitable and Optimal Transport

Journal of Machine Learning Research 25 (2024) 1-33 Submitted 5/22; Revised 7/24; Published 10/24
On the Convergence of Projected Alternating Maximization

for Equitable and Optimal Transport
Minhui Huang [email protected]

Department of Electrical and Computer Engineering
University of California
Davis, CA 95616-5270, USA
Shiqian Ma [email protected]
Department of Computational Applied Mathematics and Operations Research
Rice University
Houston, TX 77005-1827, USA
Lifeng Lai [email protected]
Department of Electrical and Computer Engineering
University of California
Davis, CA 95616-5270, USA
Editor: Suvrit Sra
Abstract
This paper studies the equitable and optimal transport (EOT) problem, which has many
applications such as fair division problems and optimal transport with multiple agents etc.
In the discrete distributions case, the EOT problem can be formulated as a linear program
(LP). Since this LP is prohibitively large for general LP solvers, (Scetbon et al., 2021)
suggests to perturb the problem by adding an entropy regularization. They proposed
a projected alternating maximization algorithm (PAM) to solve the dual of the entropy
regularized EOT. In this paper, we provide the first convergence analysis of PAM. A novel
rounding procedure is proposed to help construct the primal solution for the original EOT
problem. We also propose a variant of PAM by incorporating the extrapolation technique
that can numerically improve the performance of PAM. Results in this paper may shed
lights on block coordinate (gradient) descent methods for general optimization problems.
Keywords: Equitable and Optimal Transport, Fairness, Saddle Point Problem, Projected
Alternating Maximization, Block Coordinate Descent, Acceleration, Rounding.
1. Introduction
Optimal transport (OT) is a classical problem that recently finds many emerging applica-
tions in machine learning and artificial intelligence, including generative models (Arjovsky
et al., 2017), representation learning (Ozair et al., 2019), reinforcement learning (Bellemare
et al., 2017) and word embeddings (Alvarez-Melis et al., 2019) etc. More recently, (Scetbon
et al., 2021) proposed an equitable and optimal transport (EOT) problem that targets to
fairly distribute the workload of OT when there are multiple agents. In this problem formu-
lation, there are multiple agents working together to move mass from measures µ to ν and
each agent has its unique cost function. A very important issue that needs to be considered
here is the fairness, which aims at finding transportation plans such that the workloads
c 2024 Minhui Huang and Shiqian Ma and Lifeng Lai.

License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided
at http://jmlr.org/papers/v25/22-0524.html.
Huang and Ma and Lai
among all the agents are equal to each other. This can be achieved by minimizing the
largest transportation cost among all agents, which leads to a convex-concave saddle point
problem. The EOT problem has wide applications in economics and machine learning,
such as fair division or the cake-cutting problem (Moulin, 2003; Brandt et al., 2016), multi-
type resource allocation (Mackin and Xia, 2015), internet minimal transportation time and
sequential optimal transport (Scetbon et al., 2021).
WePnow describe the EOT Pn problem formally. Given two discrete probability measures
n
µn = i=1 ai δxi and νn = i=1 bi δyi , the EOT studies the problem of transporting mass
from µ to ν by N agents. Here, {x1 , x2 , ..., xn } ⊂ Rd and {y1 , y2 , ..., yn } ⊂ Rd are the
support points of each measure and a = [a1 , a2 , ..., an ]> ∈ ∆n+ , b = [b1 , b2 , ..., bn ]> ∈ ∆n+ are
corresponding weights for each measure, where ∆n+ denotes the probability simplex in Rn .
Moreover, throughout this paper, we assume vector b > 0. For each agent k, we denote
its unique cost function as ck (x, y), k ∈ [N ] = {1, . . . , N } and its cost matrix as C k , where
k = ck (x , y ). Moreover, we define the following coupling decomposition set
Ci,j i j
( ! ! )
X X
ΠN k
a,b := π = (π )k∈[N ] r πk = a, c πk = b, k
πij ≥ 0, ∀i, j ∈ [n] ,
k k
where r(π) = π1, c(π) = π > 1 are the row sum and column sum of matrix π respectively.
Mathematically, the EOT problem can be formulated as
min max hπ k , C k i. (1)

π ∈ΠN
a,b
1≤k≤N
When N = 1, (1) reduces to the standard OT problem. Note that (1) minimizes the point-
wise maximum of a finite collection of functions. It is easy to see that (1) is equivalent to
the following constrained problem:
N
X
π , λ) :=
min max `(π λk hπ k , C k i. (2)
π ∈ΠN N
a,b λ∈∆+ k=1
The following proposition shows an important property of EOT: at the optimum of the
minimax EOT formulation (2), the transportation costs of the agents are equal to each
other.
Proposition 1 (Scetbon et al., 2021)[Proposition 1] Assume that all entries of all cost
matrices C k , k ∈ [N ] have the same sign. Let π ∗ ∈ ΠN a,b be the optimal solution of (2). It
holds that
h(π ∗ )i , C i i = h(π ∗ )j , C j i, ∀i, j ∈ [N ]. (3)
Note that Proposition 1 requires all entries of all cost matrices to have the same sign. When
the cost matrices are all non-negative, (2) solves the transportation problem with multiple
agents. When the cost matrices are all non-positive, the cost matrices are interpreted as
the utility functions and (2) solves the fair division problem (Moulin, 2003).
The discrete OT is a linear programming (LP) problem (in fact, an assignment prob-
lem) with a complexity of O(n3 log n) (Tarjan, 1997). Due to this cubic dependence on the
2
On the Convergence of PAM for Equitable and Optimal Transport
dimension n, it is challenging to solve large-scale OT in practice. A widely adopted com-

promise is to add an entropy regularizer to the OT problem (Cuturi, 2013). The resulting
problem is strongly convex and smooth, and its dual problem can be efficiently solved by the
celebrated Sinkhorn’s algorithm (Sinkhorn and Knopp, 1967; Cuturi, 2013). This strategy
is now widely used in the OT community due to its computational advantages as well as
improved sample complexity (Genevay et al., 2019). Similar ideas were also used for com-
puting the Wasserstein barycenter (Benamou et al., 2015), projection robust Wasserstein
distance (Paty and Cuturi, 2019; Lin et al., 2020; Huang et al., 2021a), projection robust
Wasserstein barycenter (Huang et al., 2021b). Motivated by these previous works, (Scetbon
et al., 2021) proposed to add an entropy regularizer to (2), and designed a projected alter-
nating maximization algorithm (PAM) to solve its dual problem. However, the convergence
of PAM has not been studied. (Scetbon et al., 2021) also proposed an accelerated projected
gradient ascent algorithm (APGA) for solving a different form of the dual problem of the
entropy regularized EOT. Since the objective function of this new dual form has Lipschitz
continuous gradient, APGA is essentially the Nesterov’s accelerated gradient method and
thus its convergence rate is known. However, numerical experiments conducted in (Scetbon
et al., 2021) indicate that APGA performs worse than PAM. We will discuss the reasons in
details later.
Our Contributions. There are mainly three issues with the PAM and APGA al-
gorithms in (Scetbon et al., 2021), and we will address all of them in this paper. Our
results may shed lights on designing new block coordinate descent algorithms. Our main
contributions are given below.
• The PAM algorithm in (Scetbon et al., 2021) only returns the dual variables. How to
find the primal solution of (2), i.e., the optimal transport plans π , was not discussed
in (Scetbon et al., 2021). In this paper, we propose a novel rounding procedure to find
the primal solution. Our rounding procedure is different from the one widely used in
the literature (Altschuler et al., 2017).
• We provide the first convergence analysis of the PAM algorithm, and analyze its
iteration complexity for finding an -optimal solution to the EOT problem (2). In
particular, we show that it takes at most O(N n2 −2 ) arithmetic operations to find
an -optimal solution to (2). This matches the rate of the Sinkhorn’s algorithm for
computing the Wasserstein distance (Dvurechensky et al., 2018).
• We propose a variant of PAM that incorporates the extrapolation technique as used

in Nesterov’s accelerated gradient method. We name this variant as PAM with Ex-
trapolation (PAME). The iteration complexity of PAME is also analyzed. Though
we are not able to prove a better complexity over PAM at this moment, we find that
PAME performs much better than PAM numerically.
Notation. For vectors a and b with the same dimension, a./b denotes their entry-wise
division. We denote c∞ := maxk kC k k∞ . Throughout this paper, we assume vector b > 0,
and we denote ι := minj log(bj ). We use 1n to denote the n-dimensional vector whose
entries are all equal to one. We use IX (x) to denote the indicator PN function of set X , i.e.,
IX (x) = 0 if x ∈ X , and IX (x) = ∞ otherwise. We denote c = c( k=1 π (f , g t , λt )). For
t k t+1
integer N > 0, we denote [N ] := {1, . . . , N }. We also denote π (f, g, λ) = [π k (f, g, λ)]k∈[N ] .
3
2. Projected Alternating Maximization Algorithm

The PAM algorithm proposed in (Scetbon et al., 2021) aims to solve the entropy regularized
EOT problem, which is given by
N
X
π , λ) :=
min max `η (π pkη (π k , λ) (4)
π ∈ΠN N
a,b λ∈∆+ k=1
k k k k k
where η > 0 is a regularization parameter, P pη (π , λ) := λk hπ , C i − ηH(π ), and the
entropy function H is defined as H(π) = − i,j πi,j (log πi,j −1). The entropy regularization
was first introduced into the OT problem by (Cuturi, 2013) and is now widely used in the
OT community. By adding an entropy regularizer, the primal problem becomes strongly
convex and the dual problem is unconstrained and is suitable for alternating maximization.
This leads to the Sinkhorn’s algorithm which has low per-iteration complexity and thus
is scalable. The PAM algorithm proposed by (Scetbon et al., 2021) used the same idea
for the EOT problem. Note that (4) is a strongly-convex-concave minimax problem whose
constraint sets are convex and bounded, and thus the Sion’s minimax theorem (Sion, 1958)
guarantees that
min max `η (π π , λ) = max min `η (π π , λ). (5)
π ∈ΠN N
a,b λ∈∆+ λ∈∆N N
+ π ∈Πa,b
Now we consider the dual problem of minπ ∈ΠN `η (π π , λ). First, we add a redundant
a,b
k
P
constraint k,i,j πi,j = 1 and consider the dual of
min π , λ).
`η (π (6)
π ∈ΠN k =1
P
a,b , k,i,j πi,j
The reason for adding this redundant constraint is to guarantee that the dual objective
function is Lipschitz smooth. It is easy to verify that the dual problem of (6) is given by
N
!! !!
X X X
k k > k > k >
max P min λk hπ , C i − ηH(ππ) + f a−r π +g b−c (π ) ,
f,g k =1,
πi,j
k,i,j k=1 k k
n×n N
π ∈(R+ )
(7)
k ). It is noted that problem (7)
P
π) =
where f and g are the dual variables and H(π k H(π
admits the following solution:
ζ k (f, g, λ)
π k (f, g, λ) = P k
, ∀k ∈ [N ], (8)
k kζ (f, g, λ)k1
where
f 1> > k

k n + 1n g − λ k C
ζ (f, g, λ) = exp , ∀k ∈ [N ]. (9)
η
By plugging (8) into (7), we obtain the following dual problem of (6):
N
!
X
max
n n
hf, ai + hg, bi − η log kζ k (f, g, λ)k1 − η. (10)
f ∈R , g∈R
k=1
4
Plugging (10) into (5), we know that the entropy regularized EOT problem (4) is equivalent
to a pure maximization problem:
N
!
X
max F (f, g, λ) := hf, ai + hg, bi − η log kζ k (f, g, λ)k1 − η. (11)
f ∈Rn , g∈Rn , λ∈∆N
+ k=1
Function F (f, g, λ) is a smooth concave function with three block variables (f, g, λ). We
use (f ∗ , g ∗ , λ∗ ) to denote an optimal solution of (11), and we denote F ∗ = F (f ∗ , g ∗ , λ∗ ).
The PAM algorithm proposed in (Scetbon et al., 2021) is essentially a block coordinate
descent (BCD) algorithm for solving (11). More specifically, the PAM updates the three
block variables by the following scheme:
f t+1 ∈ argmax F (f, g t , λt ), (12a)

f
g t+1 ∈ argmax F (f t+1 , g, λt ), (12b)

g
λt+1 := Proj∆N λt + τ ∇λ F (f t+1 , g t+1 , λt ) .

(12c)
+
Each iteration of PAM consists of two exact maximization steps followed by one projected
gradient step. Importantly, the two exact maximization problems (12a)-(12b) have numer-
ous optimal solutions, and we choose to use the following ones:
 
a
f t+1 = f t + η log  P  , (13)
N k (f t , g t , λt )
r k=1 ζ
 
b
g t+1 = g t + η log  P  . (14)
N k t+1 , g t , λt )
c k=1 ζ (f
Furthermore, the optimality conditions of (12a)-(12b) imply that

P
N k (f t+1 , g t , λt )
r k=1 ζ
a− P = 0,
kζ k (f t+1 , g t , λt )k1
Pk (15)
N k t+1 , g t+1 , λt )
c k=1 ζ (f
b− P k t+1 , g t+1 , λt )k
= 0, ∀t.
k kζ (f 1
However, we need to point out that the PAM (12) only returns the dual variables (f t , g t , λt ).
One can compute the primal variable π using (8), but it is not necessarily a feasible solution.
That is, π computed from (8) does not satisfy π ∈ ΠN a,b . How to obtain an optimal primal
solution from the dual variables was not discussed in (Scetbon et al., 2021). For the OT
problem, i.e., N = 1, a rounding procedure for returning a feasible primal solution has been
proposed in (Altschuler et al., 2017). However, this rounding procedure cannot be applied
to the EOT problem directly. In the next section, we propose a new rounding procedure
for returning a primal solution based on the dual solution (f t , g t , λt ). This new rounding
procedure involves a dedicated way to compute the margins.
5
2.1 The Rounding Procedure and the Margins

Given a ∈ ∆n+ , b ∈ ∆n+ , and π = {π k }k∈[N ] satisfying r( k π k ) = a, we construct vectors
P
ak , bk ∈ Rn , k ∈ [N ] from the procedure
(ak , bk )k∈[N ] = Margins(π

π , a, b). (16)
The details
PN of this procedure are given below. First, we set ak = r(π k ), which immediately
implies k=1 a = a. We then construct bk such that the following properties hold (these
k
properties are required in our convergence analysis later):
(i) bk ≥ 0;
PN k
(ii) k=1 b = b;
Pn k
Pn k
(iii) i=1 ai = j=1 bj , ∀k ∈ [N ];
(iv) For any fixed j ∈ [n], the quantities bkj − [c(π k )]j have the same sign for all k ∈ [N ].
That is, for any k and k 0 , we have
0 0
(bkj − [c(π k )]j ) · (bkj − [c(π k )]j ) ≥ 0, (17)
which provides the following identity that is useful in our convergence analysis later:
N
X
kbk − c(π k )k1
k=1
XN Xn n X
X N
= |bkj k
− [c(π )]j | = (bkj − [c(π k )]j ) (18)
k=1 j=1 j=1 k=1
n N N
" !# !
X X X
k k
= bj − c π = b−c π .
j=1 k=1 j k=1 1
The procedure on constructing (bk )k∈[N ] satisfying these four properties is provided in Ap-
pendix ??.
After (ak , bk )k∈[N ] are constructed from (16) with π = π (f T , g T −1 , λT −1 ), we adopt the
rounding procedure proposed in (Altschuler et al., 2017) to output a primal feasible solution
(π̂ k )k∈[N ] . The rounding procedure is described in Algorithm 2.
With this new procedure for rounding and computing the margins ak , bk , we now for-
mally describe our PAM algorithm in Algorithm 1. Note that the algorithm is terminated
when the following criteria are met:
kct−1 − bk1 ≤ /(6(6c∞ − ηι)), (19a)

t t−1
λ −λ 2
≤ η/(18c2∞ ), (19b)
F̃ (f t , g t−1 , λt−1 ) ≤ /6, (19c)
where F̃ (f, g, λ) is the suboptimality defined as: F̃ (f, g, λ) = F (f ∗ , g ∗ , λ∗ ) − F (f, g, λ).
6
Algorithm 1 Projected Alternating Maximization Algorithm

1: Input: Cost matrices {C k }1≤k≤N , vectors a, b ∈ ∆n+ with b > 0, accuracy . f 0 = g 0 =
[1, ..., 1]> , λ0 = [1/N, ..., 1/N ]> ∈ ∆N+. t = 0
2: Choose parameters as

η = min , c∞ , τ = η/c2∞ . (20)
3(log(n2 N ) + 1)
3: while (19) is not met do

4: Compute f t+1 by (13)
5: Compute g t+1 by (14)
6: Compute λt+1 by (12c)
7: t←t+1
8: end while
9: Assume stopping condition (19) is satisfied at the T -th iteration. Compute
k k T
π (f , g
(a , b )k∈[N ] = Margins(π T −1 ,λT −1 ), a, b) as in Section 2.1.
10: Output: (π̂, λ̂) where π̂ = Round(π (f , g T −1 , λT −1 ), ak , bk ), ∀k ∈ [N ], λ̂ = λT −1 .
k k T
Algorithm 2 Round(π, a, b)
1: Input: π ∈ Rn×n , a ∈ Rn+ , b ∈ Rn+ .
ai
2: X = Diag (x) with xi = r(π) i
∧1
3: 0
π = Xπ
b
4: Y = Diag (y) with yj = c(πj0 )j ∧ 1
5: π 00 = π 0 Y
6: erra = a − r(π 00 ), errb = b − c(π 00 )
7: Output: π 00 + erra errb> /kerra k1 .
2.2 Connections with BCD and BCGD Methods

The block coordinate descent (BCD) algorithm minimizes a multivariate objective function
by iteratively updating subsets of variables while keeping the rest fixed. The convergence of
BCD has been studied in many literatures (Beck and Tetruashvili, 2013; Diakonikolas and
Orecchia, 2018; Sun and Hong, 2015). We now discuss the connections between PAM and
the block coordinate descent method and the block coordinate gradient descent (BCGD)
method. For the ease of presentation, we now assume that we are dealing with the following
general convex optimization problem with m block variables:
min J(x1 , x2 , . . . , xm ), (21)

xi ∈Xi ,i=1,...,m
where Xi ⊂ Rdi and J is convex and differentiable. The BCD method for solving (21)
iterates as follows:
xt+1
i = argmin J(xt+1 t+1 t+1 t t
1 , x2 , . . . , xi−1 , xi , xi+1 , . . . , xm ), (22)
xi ∈Xi
7
and it assumes that these subproblems are easy to solve. The BCGD method for solving
(21) iterates as follows:
1
xt+1
i = argmin h∇xi J(xt+1 t+1 t t t
1 , . . . , xi−1 , xi , xi+1 , . . . , xm ), xi − xi i + kxi − xti k22 , (23)
xi ∈Xi 2τ
where τ > 0 is the step size. The PAM (12) is a hybrid of BCD (22) and BCGD (23), in
the sense that some block variables are updated by exactly solving a maximization problem
(the f and g steps), and some other block variables are updated by taking a gradient step
(the λ step). Though this hybrid idea has been studied in the literature (Hong et al., 2017;
Xu and Yin, 2013), their convergence analysis requires the blocks corresponding to exact
minimization to be strongly convex. However, in our problem (11), the negative of the
objective function is merely convex. Hence we need to develop new convergence proofs to
analyze the convergence of PAM (Algorithm 1). How to extend our convergence results of
PAM (Algorithm 1) to more general settings is a very interesting topic for future study.
3. Convergence Analysis of PAM

In this section, we analyze the iteration complexity of Algorithm 1 for obtaining an -
optimal solution to the original EOT problem (2). The -optimal solution to (2) is defined
as follows.
Definition 2 (see, e.g., (Nemirovski, 2005)) We call (π̂ π , λ̂) ∈ ΠN N
a,b × ∆+ an -optimal
solution to the EOT problem (2) if the following inequality holds:
π , λ) − min `(π
max `(π̂ π , λ̂) ≤ .
λ∈∆N
+ π ∈ΠN
a,b
Note that the left hand side of the inequality is the duality gap of (2).
3.1 Technical Preparations

We first give the partial gradients of F .
" !#
k )/η)
P
k,j exp((f i + g j − λ k C ij
X
[∇f F (f, g, λ)]i = ai − P k (f, g, λ))k
= ai − r π k (f, g, λ) ,
k kζ 1
k i
(24a)
" !#
k )/η)
P
k,i exp((fi + gj − λk Cij X
[∇g F (f, g, λ)]j = bj − P k
= bj − c π k (f, g, λ) ,
k kζ (f, g, λ))k1 k j
(24b)
k exp((f + g − λ C k )/η)
P
Cij i,j i j k ij
[∇λ F (f, g, λ)]k = P k
= hπ k (f, g, λ), C k i. (24c)
k kζ (f, g, λ))k1
Since (13) and (14) renormalize the row sum and column sum of k ζ k (f, g, λ) to be a and
P
b, we immediately have
N
X N
X
kζ k (f t+1 , g t , λt )k1 = 1, kζ k (f t+1 , g t+1 , λt )k1 = 1, ∀t, (25)
k=1 k=1
8
which, combined with (8), yields

π k (f t+1 , g t , λt ) = ζ k (f t+1 , g t , λt ), π k (f t+1 , g t+1 , λt ) = ζ k (f t+1 , g t+1 , λt ), ∀t. (26)
The following lemma gives an error bound for Algorithm 2 (see (Altschuler et al., 2017)).
Lemma 3 (Rounding Error) Let a, b ∈ Rn+ with ni=1 ai = nj=1 bj = q, π ∈ R+ n×n
P P
, and
π̂ = Round(π, a, b). The following inequality holds:
kπ̂ − πk1 ≤ 2(kr(π) − ak1 + kc(π) − bk1 ).
Proof. This a directly result of (Altschuler et al., 2017)[Lemma 7].
The following corollary follows (Scetbon et al., 2021)[Proposition 12] and shows that
∇λ F is Lipschitz continuous.
Corollary 4 For any f, g ∈ Rn and λ1 , λ2 ∈ ∆N
+ , the following inequality holds
k∇λ F (f, g, λ1 ) − ∇λ F (f, g, λ2 )k2 ≤ c2∞ kλ1 − λ2 k2 /η, (27)

which immediately implies
c2∞ 1
F (f, g, λ1 ) ≥ F (f, g, λ2 ) + h∇λ F (f, g, λ2 ), λ1 − λ2 i − kλ − λ2 k22 . (28)
2η
The next corollary is an extension of Dvurechensky et al. (2018)[Lemma 1] and gives a
bound for g.
Corollary 5 Let (f t , g t , λt ) be the sequence generated by Algorithm 1. For any t ≥ 0, it
holds that
max gjt − min gjt ≤ c∞ − ηι, (29a)
j j
max gj∗ − min gj∗ ≤ c∞ − ηι. (29b)

j j
Lemma 6 Let {f t , g t , λt } be generated by PAM (Algorithm 1). The following equality holds.
N
X
kπ k (f t+1 , g t+1 , λt ) − π k (f t+1 , g t , λt )k1 = ct − b 1
, ∀t.
k
Proof. By (26), we have

N
X
kπ k (f t+1 , g t+1 , λt ) − π k (f t+1 , g t , λt )k1
k
N
t+1
+gjt+1 −λtk Ci,j
k /η t+1
+gjt −λtk Ci,j
k /η
|e(fi ) − e(fi )
XX
= |
k i,j
N X
X X
= [π k (f t+1 , g t , λt )]i,j bj /ctj − 1 = ctj |bj /ctj − 1| = ct − b 1
.
k i,j j
9
3.2 Key Lemmas

In this subsection, we provide a few useful lemmas that will lead to our main theorem on the
iteration complexity of PAM (Algorithm 1). These lemmas yield the following results: the
function F is monotonically increasing (Lemmas 7), the suboptimality of the dual problem
can be upper bounded (Lemma 8-10), and the PAM returns an -optimal solution under
conditions (43) (Lemma 11). In Theorem 12 we will show that these conditions can indeed
be satisfied.
Lemma 7 [Increase of F ] Let {f t , g t , λt } be generated by PAM (Algorithm 1). The follow-

ing inequalities hold:
F (f t+1 , g t , λt ) − F (f t , g t , λt ) ≥ 0 (30a)
η t 2
F (f t+1 , g t+1 , λt ) − F (f t+1 , g t , λt ) ≥ c −b 1 (30b)
2
F (f t+1 , g t+1 , λt+1 ) − F (f t+1 , g t+1 , λt ) ≥ c2∞ kλt+1 − λt k2 /(2η). (30c)
Proof.
First, (30a) is a direct consequence of (12a).
Next, we prove (30b). We have
F (f t+1 , g t+1 , λt ) − F (f t+1 , g t , λt )
N N
! !
X X
t+1 t k t+1 t+1 t k t+1 t t
= hg − g , bi − η log kζ (f ,g , λ )k1 + η log kζ (f , g , λ )k1
k=1 k=1
n
t+1 t
X η t
= hg − g , bi = η bj log(bj /ctj ) = ηK(b||ct ) ≥ kc − bk21 ,
2
j=1
where K(x||y) denotes the KL divergence of x and y, the second equality is due to (25), the
third equality is due to (14), and the last inequality follows the Pinsker’s inequality.
Finally, we prove (30c). From the optimality condition of (12c), we know that there
exists
h(λt+1 ) ∈ ∂I∆+ (λt+1 ) (31)
N
such that
1
∇λ F (f t+1 , g t+1 , λt ) − (λt+1 − λt ) − h(λt+1 ) = 0. (32)
τ
From (28) we have
F (f t+1 , g t+1 , λt+1 ) − F (f t+1 , g t+1 , λt )
c2∞ t+1
≥h∇λ F (f t+1 , g t+1 , λt ), λt+1 − λt i −
kλ − λ t k2
2η
1 c2
=h (λt+1 − λt ) + h(λt+1 ), λt+1 − λt i − ∞ kλt+1 − λt k2
τ 2η
1 c2
≥h (λt+1 − λt ), λt+1 − λt i − ∞ kλt+1 − λt k2
τ 2η
2 t+1 t 2
=c∞ kλ − λ k /(2η),
10
where the first equality is due to (32), the second inequality is due to (31), and the last
equality is due to the definition of τ in (20).

Before we bound the suboptimality gap, we need the following lemma.
Lemma 8 Let {f t , g t , λt } be generated by PAM (Algorithm 1). For any λ ∈ ∆+

N , the
following inequality holds:
λ − λt , ∇λ F (f t+1 , g t , λt ) ≤ 3c2∞ kλt+1 − λt k2 /η + c∞ ct − b 1

. (33)
Proof. The optimality condition of (12c) is given by:
1
hλ − λt+1 , (λt+1 − λt ) − ∇λ F (f t+1 , g t+1 , λt )i ≥ 0, ∀λ ∈ ∆+
N,
(34)
τ
which implies that
hλt+1 − λ, −∇λ F (f t+1 , g t+1 , λt )i

1 1 (35)
≤ hλ − λt+1 , (λt+1 − λt )i ≤ kλ − λt+1 k2 kλt+1 − λt k2 ≤ 2c2∞ kλt+1 − λt k2 /η,
τ τ
√
where the last inequality is due to the fact that the diameter of ∆+
N is bounded by 2 ≤ 2.
Moreover, we have
hλt − λ, ∇λ F (f t+1 , g t+1 , λt ) − ∇λ F (f t+1 , g t , λt )i

N
X
= (λtk − λk ) · hπ k (f t+1 , g t+1 , λt ) − π k (f t+1 , g t , λt ), C k i
k
(36)
N
X
k t+1 t+1 t k t+1 t t k
≤ kπ (f ,g , λ ) − π (f , g , λ )k1 kC k∞
k
≤ c∞ kct − bk1 ,
where the equality is due to (24c), and the last inequality is due to Lemma 6. Finally, we
have
λt − λ, −∇λ F (f t+1 , g t , λt )
= λt − λt+1 , −∇λ F (f t+1 , g t+1 , λt ) + λt+1 − λ, −∇λ F (f t+1 , g t+1 , λt ) +
(37)
λt − λ, ∇λ F (f t+1 , g t+1 , λt ) − ∇λ F (f t+1 , g t , λt )
≤kλt − λt+1 k2 · k∇λ F (f t+1 , g t+1 , λt )k2 + 2c2∞ kλt+1 − λt k2 /η + c∞ kct − bk1 ,
where the first inequality is due to (35) and (36). From (24c) we have k∇λ F (f t+1 , g t+1 , λt )k2 ≤
c∞ , which, combined with (37) and the fact that η ≤ c∞ , yields the desired result.
The suboptimality of (11) is defined as: F̃ (f, g, λ) = F (f ∗ , g ∗ , λ∗ ) − F (f, g, λ). Note

that F̃ (f, g, λ) ≥ 0, ∀f, g, λ ∈ ∆+
N.
11
Lemma 9 Let (f t , g t , λt ) be generated by PAM (Algorithm 1). The following inequality

holds:
F̃ (f t+1 , g t , λt ) ≤ (2c∞ − ηι)kct − bk1 + 3c2∞ kλt+1 − λt k2 /η.
Proof. Denote ut = (maxj gjt + minj gjt )/2, u∗ = (maxj gj∗ + minj gj∗ )/2. From (26) we get
n
X n
X
t
h1, c − bi = ai − bj = 0,
i=1 j=1
which further implies
g t − g ∗ , ct − b = (g t − ut 1) − (g ∗ − u∗ 1), ct − b
(38)
≤ kg t − ut 1k∞ + kg ∗ − u∗ 1k∞ ct − b ≤ (c∞ − ηι) ct − b

1 1
,
where the last inequality is due to Corollary 5. Now we set λ = λ∗ in (33), and we obtain
λt − λ∗ , −∇λ F (f t+1 , g t , λt ) ≤ 3c2∞ kλt+1 − λt k2 /η + c∞ ct − b 1

. (39)
Since F (f, g, λ) is a concave function, we have
F (f ∗ , g ∗ , λ∗ ) ≤ F (f t+1 , g t , λt ) + h∇F (f t+1 , g t , λt ), (f ∗ , g ∗ , λ∗ ) − (f t+1 , g t , λt )i,
which, combining with (24) yields
F̃ (f t+1 , g t , λt ) = F (f ∗ , g ∗ , λ∗ ) − F (f t+1 , g t , λt )
≤ hf t+1 − f ∗ , r( N k t+1 , g t , λt )) − ai + g t − g ∗ , ct − b
P
k=1 π (f
+ λt − λ∗ , −∇λ F (f t+1 , g t , λt )
≤ (2c∞ − ηι)kct − bk1 + 3c2∞ kλt+1 − λt k2 /η,
where the last inequality follows from (15), (26), (38) and (39).
The next lemma shows that the suboptimality gap F̃ (f, g, λ) can be bounded by O(1/t).
Lemma 10 Let (f t , g t , λt ) be generated by PAM (Algorithm 1). The following inequality

holds:
4/(ηγ0 )
F̃ (f t+1 , g t+1 , λt+1 ) ≤ ,
t + 1 + 4/(ηγ0 F̃ (f 0 , g 0 , λ0 ))
n o
where γ0 = min (2c 1−ηι)2 , 9c12 is a constant.
∞ ∞
Proof. Combining (30b) and (30c), we have

η t 2 2
F (f t+1 , g t+1 , λt+1 ) − F (f t+1 , g t , λt ) ≥ c −b 1
+ c2∞ λt+1 − λt 2
/(2η). (40)
2
12
Therefore, we have
F̃ (f t+1 , g t , λt ) − F̃ (f t+1 , g t+1 , λt+1 )
η t 2 2
≤− c − b 1 − c2∞ λt+1 − λt 2 /(2η)
2
η
≤ − γ0 · ((2c∞ − ηι)kct − bk1 )2 + (3c2∞ kλt+1 − λt k2 /η)2

2 (41)
η 2
≤ − γ0 (2c∞ − ηι)kct − bk1 + 3c2∞ kλt+1 − λt k2 /η
4
η
≤ − γ0 F̃ (f t+1 , g t , λt )2 ,
4
where the last inequality is from Lemma 9. Dividing both sides of (41) by F̃ (f t+1 , g t+1 , λt+1 )·
F̃ (f t+1 , g t , λt ), we have
1 1 η F̃ (f t+1 , g t , λt )
≥ + γ0 ·
F̃ (f t+1 , g t+1 , λt+1 ) F̃ (f t+1 , g t , λt ) 4 F̃ (f t+1 , g t+1 , λt+1 )
(42)
1 η 1 η
≥ + γ0 ≥ + γ0 ,
F̃ (f , g , λ ) 4
t+1 t t F̃ (f , g , λ ) 4
t t t
where the second inequality is due to (41) and the last inequality is from (30a). Summing
(42) from 0 to t leads to
1 1 η(t + 1)
≥ + γ0 ,
F̃ (f t+1 , g t+1 , λt+1 ) F̃ (f 0 , g 0 , λ0 ) 4
which implies the desired result.
The next lemma gives sufficient conditions for the PAM algorithm to return an -optimal
solution to the original EOT problem (2).
Lemma 11 Assume PAM terminates at the T -iteration, i.e.,
kcT −1 − bk1 ≤ /(6(6c∞ − ηι)), (43a)

λT − λT −1 2
≤ η/(18c2∞ ), (43b)
T T −1 T −1
F̃ (f , g ,λ ) ≤ /6. (43c)
Then the output (π̂, λ̂) of PAM (Algorithm 1), i.e.,
π̂ k = Round(π k (f T , g T −1 , λT −1 ), ak , bk ), ∀k ∈ [N ], λ̂ = λT −1 ,
is an -optimal solution of the original EOT problem (2).

+
π , λ̂) ∈ ΠN
Proof. According to Definition 2, it is sufficient to show that the output (π̂ a,b × ∆N
satisfies the following two inequalities:

max ` (π̂ π , λ) − `(π̂
π , λ̂) ≤ , (44a)
λ∈∆N + 2

π , λ̂) − min `(π
`(π̂ π , λ̂) ≤ . (44b)
N
π ∈Πa,b 2
13
We prove (44a) first. For ease of presentation, we denote π̃ π = π (f T , g T −1 , λT −1 ), π ∗ =

∗ ∗ ∗ k k k k
π (f , g , λ ). Note that π̂ = Round(π̃ , a , b ), ∀k ∈ [N ]. We also denote
N
( )
X
π ) := argmax `(π
λ̄(π π , λ) = λk hπ k , C k i . (45)
λ∈∆+
N k=1
Note that the term on the left hand side of (44a) can be rewritten as

` π̂ π ) − ` π̂
π , λ̄(π̂ π , λ̂
π )) − `(π̃
π , λ̄(π̂
= (`(π̂ π , λ̄(π̃
π ))) + ([`(π̃ π )) − ηH(π̃
π , λ̄(π̃ π ∗ , λ∗ ) − ηH(π
π )] − [`(π π ∗ )])
| {z } | {z }
(I) (II) (46)
π ∗ , λ∗ ) − ηH(π
+ ([`(π π ∗ )] − [`(π̃
π , λ̂) − ηH(π̃ π , λ̂) − `(π̂
π )]) + (`(π̃ π , λ̂)) .
| {z } | {z }
(III) (IV )
We now provide upper bounds for these four terms. Denote
k̂ ∗ = argmaxhπ̂ k , C k i, k̃ ∗ = argmaxhπ̃ k , C k i. (47)

k∈[N ] k∈[N ]
Since (1) and (2) are equivalent, we have the following for the term (I):
∗ ∗ ∗ ∗
X X
(I) = π )]k hπ̂ k , C k i −
[λ̄(π̂ π )]k hπ̃ k , C k i = hπ̂ k̂ , C k̂ i − hπ̃ k̃ , C k̃ i
[λ̄(π̃
k k
k̂∗ k̂∗ k̂∗ k̂∗ ∗ ∗
X
≤hπ̂ , C i − hπ̃ , C i ≤ kπ̂ k̂ − π̃ k̂ k1 kC k k∞ ≤ c∞ kπ̂ k − π̃ k k1 (48)
k
X
≤2c∞ (kr(π̃ k ) − ak k1 + kc(π̃ k ) − bk k1 ) = 2c∞ kcT −1 − bk1 ,
k
where the first inequality follows from the definition of k̃ ∗ in (47), the fourth inequality is
from Lemma 3, and the last equality follows from (15) and (17).
For the term (II), recall that
!
X
k k k f T 1> + 1(g T −1 )> − λTk −1 C k
π) = −
H(π πi,j (log πi,j − 1) and π̃ = exp
η
k,i,j
14
maxj gjT −1 +minj gjT −1

due to (26), and define uT −1 = 2 . We have
fiT + gjT −1 − λTk −1 Cij

!
X X k
(II) = π )k hπ̃ k , C k i + η
λ̄(π̃ k
π̃i,j −1 − F∗
η
k k,i,j
X X
= π )k − λ̂k )hπ̃ k , C k i +
(λ̄(π̃ k
π̃i,j fiT + gjT −1 − η − F ∗
k k,i,j
D E
π ) − λ̂, ∇λ F (f T , g T −1 , λT −1 ) + f T , a + g T −1 , cT −1 − η i,j,k π̃i,j
k − F∗
P
= λ̄(π̃
!
D E X
T −1 T −1 T −1 T −1
= λ̄(π̃ T
π ) − λ̂, ∇λ F (f , g ,λ T
) + f ,a + g ,c − log kπ̃ k1 − η − F ∗
k
k
D E
T −1 T −1 T −1 T −1 T −1
π ) − λ̂, ∇λ F (f , g
= λ̄(π̃ T
,λ ) + hg ,c − bi + F (f , g T
, λT −1 ) − F ∗
D E
π ) − λ̂, ∇λ F (f T , g T −1 , λT −1 ) + hg T −1 , cT −1 − bi
≤ λ̄(π̃
≤c∞ kcT −1 − bk1 + 3c2∞ kλT − λT −1 k2 /η + hg T −1 − uT −1 1, cT −1 − bi
≤c∞ kcT −1 − bk1 + 3c2∞ kλT − λT −1 k2 /η + kg T −1 − uT −1 1k∞ kcT −1 − bk1
≤(3c∞ /2 − ηι/2)kcT −1 − bk1 + 3c2∞ kλT − λT −1 k2 /η,
(49)
where the third equality uses (26), (24c) and (15), the second inequality follows from Lemma
π ) and t = T − 1, and the last inequality uses Corollary 5.
8 by setting λ = λ̄(π̃
For the term (III), we have
fiT + gjT −1 − λTk −1 C k

!
X X
(III) ≤ k k
λ̂k hπ̃ , C i + η k
π̃i,j − 1 − F∗
η
k i,j
(50)
=|hg T −1 , cT −1 − bi + F (f T , g T −1 , λT −1 ) − F ∗ |
≤(c∞ /2 − ηι/2)kcT −1 − bk1 + |F (f T , g T −1 , λT −1 ) − F ∗ |,
where the last inequality follows from Corollary 5.
Finally, for the term (IV), we have
X X
(IV ) = hπ̃ k − π̂ k , λ̂k C k i ≤ kπ̃ k − π̂ k k1 kC k k∞
k k
X (51)
≤2c∞ (kr(π̃ k ) − ak k1 + kc(π̃ k ) − bk k1 ) = 2c∞ kcT −1 − bk1 ,
k
where the first inequality uses |λ̂k | ≤ 1, the second inequality uses Lemma 3 and (17).
Plugging (48) - (51) into (46), and using (43), we obtain (44a).
Now we prove (44b). For ease of presentation, we denote
π (λ) := argmin `(π
π̄ π , λ). (52)
π ∈ΠN
a,b
We also denote b̃ = cT −1 = c π̃ k and π 0k = Round(π̄(λ̂)k , ãk , b̃k ), where

P
k
(ãk , b̃k )k∈[N ] := Margins(π̄

π (λ̂), a, b̃),
15
as defined in (16). From (18) we know that
X X X
c (π̄(λ̂))k − b̃k = c (π̄(λ̂))k − b̃k = kb − b̃k1 = kb − cT −1 k1 , (53)
1
k k k 1
π (λ̂) ∈ ΠN π (λ̂))k ) = b, and the fact

P
where the second equality is due to π̄ a,b and thus c( k (π̄
P k
that k b̃ = b̃ due to Property (ii) of the Margins procedure in Section 2.1. By the
Sinkhorn’s theorem (Sinkhorn, 1967), π̃ is the unique optimal solution of minπ ∈ΠN `η (ππ , λ̂).
a,b̃
Therefore X X
λ̂k hπ̃ k , C k i − ηH(π̃
π̃) ≤
π̃ λ̂k hπ 0k , C k i − ηH(π
π0 ). (54)
k k
Now, note that the left hand side of (44b) can be arranged into three parts:
π̂, λ̂) − `(π̄

π̂
`(π̂ π (λ̂), λ̂)
! !
X X X X
= λ̂k hπ̂ k , C k i − λ̂k hπ̃ k , C k i + λ̂k hπ̃ k , C k i − λ̂k hπ 0k , C k i
k k k k
| {z } | {z }
(V ) (V I) (55)
!
X X
+ λ̂k hπ 0k , C k i − λ̂k h(π̄(λ̂))k , C k i .
k
| {zk }
(V II)
We now upper bound these three terms. First note that the term (V) is the same as the
term (IV) and thus has the same upper bound in (51). Since 0 ≤ H(ππ ) ≤ log(n2 N ) + 1,
from (54) we have that
X X 1
(V I) = λ̂k hπ̃ k , C k i − λ̂k hπ 0k , C k i ≤ η H(π̃ π0 ) ≤ ,
π ) − H(π (56)
3
k k
where the last step uses the definition of η in (20).

For the term (VII), we have
X X X
(V II) = λ̂k hπ 0k , C k i − λ̂k h(π̄(λ̂))k , C k i ≤ kπ 0k − (π̄(λ̂))k k1 kC k k∞
k k k
X (57)
≤2c∞ (kr((π̄(λ̂)) ) − ã k1 + kc((π̄(λ̂)) ) − b̃k k1 ) = 2c∞ kcT −1 − bk1 ,
k k k
where the second inequality follows from Lemma 3, the second equality uses (53) and the
fact that r((π̄(λ̂))k ) = ãk due to the property of the Margins procedure in (16).
Finally, plugging (51) (note (V)=(IV)), (56) and (57) into (55), and using (43a) and
noting ι < 0, we obtain (44b). This completes the proof.

16
3.3 Main Result

We now present our main theorem, which gives the iteration complexity of PAM such that
(43) is satisfied, and as a result of Lemma 11, an -optimal solution to the original EOT
problem (2) is obtained.
Theorem 12 Define 0 = /(6c∞ − ηι), and set T as
36 648c2∞ 28
= O c2∞ −2 ,

T =5+ √ 0 + + (58)
η γ0 η ηγ0
n o
where γ0 = min 1
−ηι)
(2c∞
1
2 , 9c2 is a constant and we know γ0 = O(c−2∞ ). The output pair
∞
of Algorithm 1 is an -optimal solution of the EOT problem (2).
Proof. According to Lemma 11, we only need to show that (43) holds after T iterations
as defined in (58). To guarantee (43a) and (43b), we follow the ideas of Dvurechensky et
al. (Dvurechensky et al., 2018) and construct a switching process. We first reduce F̃ from
F̃ (f 0 , g 0 , λ0 ) to a constant s by running t1 steps. In this process, Lemma 10 indicates
4 4
t1 ≤ 1 + − . (59)
ηγ0 s ηγ0 F̃ (f 0 , g 0 , λ0 )
Secondly, starting from s, we continue running the algorithm, and assume that there are t2
iterations in which (43a) fails. By (30b) we have
72s
t2 ≤ 1 + .
η02
Therefore, we know that the total iteration number that (43a) fails is upper bounded by
72s 4 4
T1 = t1 + t2 ≤ 2 + 02
+ −
η ηγ0 s ηγ0 F̃ (f 0 , g 0 , λ0 )
√
iterations. By choosing s = 0 /(6 γ0 ), we know that
0
2 + η√12 √24 4 √36
(
γ0 0 + η γ0 0 − ηγ0 F̃ (f 0 ,g 0 ,λ0 ) ≤ 2 + η γ0 0 if F̃ (f 0 , g 0 , λ0 ) ≥ √
6 γ0
T1 ≤
2 + η√12 √24 4
γ0 0 + η γ0 0 − ηγ F̃ (f 0 ,g 0 ,λ0 ) ≤ 2 +
√12
η γ0 0 otherwise.
0
Therefore, we have T1 ≤ 2 + η√36γ0 0 . Similarly, starting from s, the number of iterations

that (43b) fails can be bounded by
648sc2∞
t3 ≤ 1 + ,
η2
where we apply (30b). By choosing s = , we know that the total iteration number that
(43b) fails is upper bounded by
648c2∞ 4 4
T2 = t1 + t3 ≤ 2 + + −
η ηγ0 ηγ0 F̃ (f 0 , g 0 , λ0 )
17
iterations. Finally, by letting s = /6 in (59), we know that
F̃ (f T3 −1 , g T3 −1 , λT3 −1 ) ≤ /6
after
24
T3 = 1 +
ηγ0
iterations. From (30a) we know that after T3 iterations, we have
F̃ (f T3 , g T3 −1 , λT3 −1 ) ≤ F̃ (f T3 −1 , g T3 −1 , λT3 −1 ) ≤ /6,
i.e., (43c) holds. Combining the above discussions, we know that after T = T1 + T2 + T3 + 1
iterations, there must exist at least one iteration such that (43) holds, and thus the output
of PAM is an -optimal solution to the original EOT problem (2).

Remark 13 Though our complexity result matches the rate of the Sinkhorn’s algorithm
in terms of the dependence on , we argue that EOT is a more difficult problem than the
entropic regularized OT, and thus our results are promising. First, EOT is a saddle-point
problem while entropic regularized OT is a minimization problem. Second, the extra vari-
able λ in EOT requires a gradient projection step in the PAM algorithm, which introduces
significant difficulty to the analysis of the convergence behavior. While for Sinkhorn’s algo-
rithm it is much easier to analyze, because the dual is unconstrained. Third, since there are
multiple agents in EOT, it is more difficult to design the rounding procedure to obtain the
primal solution. We also note that the dependence of c∞ in our result and in the result of
Sinkhorn’s algorithm (Dvurechensky et al., 2018) are both c2∞ .
4. Projected Alternating Maximization with Extrapolation

In this section, we discuss how to accelerate the PAM algorithm (Algorithm 1). It can be
shown that the gradient of F in (11) is Lipschitz continuous1 . Therefore, Scetbon et al.
(Scetbon et al., 2021) proposed to adopt Nesterov’s accelerated gradient method (Nesterov,
2004) to solve (11). Their algorithm, named APGA (Accelerated Projected Gradient Ascent
algorithm), iterates as follows:

> t−1 t−1 t−1 > t−2
(v, w, z) ← (f , g , λ ) + · (f t−1 , g t−1 , λt−1 )> − (f t−2 , g t−2 , λt−2 )>
t+1
1
(f t , g t )> ← (v, w)> + ∇(f,g) F (v, w, z) (60a)
L
1
(λt )> ← Proj∆+ z + ∇λ F (v, w, z) , (60b)
N L
where L is the Lipschitz constant of ∇F . Note that APGA treats the problem (11) as a
generic convex and smooth problem, and does not take advantage of the special structures
of (11). In particular, f and g are updated using gradient ascent steps. This is in contrast
1. In Lemma 4 we proved that ∇λ F is Lipschitz continuous. The Lipschitz continuity of ∇f F and ∇g F
can be proved similarly.
18
Algorithm 3 Projected Alternating Maximization with Extrapolation Algorithm

1: Input: Cost matrices {C k }1≤k≤N , accuracy , θ ∈ (0, 1).
+
2: Initialization: f 0 = g 0 = [1, ..., 1]> , λ0 = [1/N, ..., 1/N ]> ∈ ∆N .
3: Choose parameters as

η
η = min , c∞ , τ = 2 . (61)
3(log(n2 N ) + 1) 2c∞
4: while (63) is not met do

5: Compute f t+1 by (62a)
6: Compute g t+1 by (62b)
7: Compute y t+1 by (62c)
8: Compute λt+1 by (62d)
9: t←t+1
10: end while
11: Assume the stop condition (63) is satisfied at the T -th iteration. Compute
π (f T , g T −1 , λT −1 ), a, b) as in Section 2.1.
(ak , bk )k∈[N ] = Margins(π
12: Output: (π̂, λ̂) where π̂ k = Round(π k (f T , g T −1 , λT −1 ), ak , bk ), ∀k ∈ [N ], λ̂ = λT −1 .
to PAM in which f and g are obtained by exact maximizations, which is expected to

improve the function value of F more significantly. In the following, we will design an
accelerated algorithm that utilizes this property. Our method is called PAME (PAM with
Extrapolation) and it incorporates the extrapolation technique to the gradient step for
updating λ, and f and g are still updated using exact maximizations. We note that currently
we are not able to prove a better complexity for PAME. Our iteration complexity result
in Theorem 19 is in the same order as that of PAM, but numerically we have observed
great improvement of PAME over PAM. It is an interesting future topic to study other
accelerations to PAM that can provably achieve improved complexity.
A typical iteration of our PAME algorithm is given below:

a
f t+1 = f t + η log P k t t t , (62a)
r ( k ζ (f , g , λ ))

t+1 t b
g = g + η log , (62b)
c ( k ζ k (f t+1 , g t , λt ))
P
y t+1 = Proj∆+ λt + (1 − θ)(λt − λt−1 ) ,

(62c)
N
λt+1 = Proj∆+ y t+1 + τ ∇λ F (f t+1 , g t+1 , y t+1 ) .

(62d)
N
Here θ ∈ (0, 1) is a given parameter for the extrapolation step. We see that steps (62a)-
(62b) are the same as (13)-(14) and they are solutions to the exact maximizations (12a)-
(12b). Steps (62c)-(62c) give extrapolation to the gradient step for λ, similar to Nesterov’s
accelerated gradient method. Note that PAME (62) solves the dual entropy-regularized
EOT problem (11). We use the same rounding procedure in Section 2.1 to generate a primal
solution to the original EOT problem (1). The complete PAME algorithm is described in
19
Algorithm 3. Note that the algorithm is terminated when the following criteria are met:
kct−1 − bk1 ≤ /(6(6c∞ − ηι)), (63a)

t−1 t−2
kλ −λ k2 ≤ η/(60(1 − θ)c2∞ ), (63b)
t t
kλ − y k2 ≤ η/(42c2∞ ), (63c)
t t−1 t−1
F̃ (f , g ,λ ) ≤ /6. (63d)
5. Convergence Analysis of PAME Algorithm

In this section, we analyze the iteration complexity of PAME (Algorithm 3) for obtaining
an -optimal solution to the original EOT problem (1). The proof for PAME is different
from that of PAM, and here we need to analyze the behavior of the following Hamiltonian,
inspired by (Jin et al., 2018).
1 1
E(f, g, λ1 , λ2 ) = F (f, g, λ1 ) − kλ − λ2 k22 . (64)
2τ
5.1 Technical Preparation

The following simple fact is useful for our analysis later.
ky t+1 − λt k2 = kProj∆+ λt + (1 − θ)(λt − λt−1 ) − Proj∆+ λt k2 ≤ (1 − θ)kλt − λt−1 k2 ,

N N
(65)
where the equality follows from the definition of y t+1 in (62c), and the inequality is due to
the non-expansiveness of the projection operator.
The following lemma shows that the Hamiltonian E(f t , g t , λt , λt−1 ) is monotonically
increasing when updating λ in Algorithm (3).
Lemma 14 [Sufficient increase in λ] Let {f t , g t , y t , λt } be generated by PAME (Algorithm

3). The following inequality holds:
2θ − θ2 t 1
E(f t+1 , g t+1 , λt+1 , λt ) − E(f t+1 , g t+1 , λt , λt−1 ) ≥
kλ − λt−1 k22 + kλt+1 − y t+1 k22 .
2τ 4τ
(66)
Note that since θ ∈ (0, 1), the right hand side of (66) is always nonnegative.
Proof. From the optimality condition of (62d) we know that, there exists h(λt+1 ) ∈
∂I∆+ (λt+1 ) such that
N
1
∇λ F (f t+1 , g t+1 , y t+1 ) − (λt+1 − y t+1 ) − h(λt+1 ) = 0. (67)
τ
By the convexity of the indicator function I∆+ (λt+1 ), we have

N
hy t+1 − λt+1 , h(λt+1 )i ≤ 0, hλt − λt+1 , h(λt+1 )i ≤ 0. (68)
20
Moreover, we have the following inequality:
kλt+1 − λt k22 = kλt+1 − y t+1 + y t+1 − λt k22

= ky t+1 − λt k22 + 2hλt+1 − y t+1 , y t+1 − λt i + kλt+1 − y t+1 k22 (69)
2 t
≤ (1 − θ) kλ − λt−1 k22 + 2hλ t+1
−y t+1
,y t+1 t
− λ i + kλ t+1
− y t+1 k22 ,
where the inequality is from (65).

We then have the following inequality:
F (f t+1 , g t+1 , λt ) − F (f t+1 , g t+1 , λt+1 )

≤ F (f t+1 , g t+1 , y t+1 ) + h∇λ F (f t+1 , g t+1 , y t+1 ), λt − y t+1 i −

F (f t+1 , g t+1 , y t+1 ) + h∇λ F (f t+1 , g t+1 , y t+1 ), λt+1 − y t+1 i − c2∞ kλt+1 − y t+1 k22 /(2η)

= h∇λ F (f t+1 , g t+1 , y t+1 ), λt − λt+1 i + c2∞ kλt+1 − y t+1 k22 /(2η)
≤ h∇λ F (f t+1 , g t+1 , y t+1 ) − h(λt+1 ), λt − λt+1 i + c2∞ kλt+1 − y t+1 k22 /(2η)
1 1
= hλt+1 − y t+1 , λt − λt+1 i + kλt+1 − y t+1 k2
τ 4τ
1 t+1 t+1 t t+1 t+1 1
= hλ − y ,λ − y +y − λt+1 i + kλt+1 − y t+1 k2
τ 4τ
1 3
= − hλt+1 − y t+1 , y t+1 − λt i − kλt+1 − y t+1 k2 ,
τ 4τ
(70)
where the first inequality is from the concavity of F with respect to λ and (28), the second
inequality is due to (68), the second equality is due to (67). Combining (69) and (70) leads
to
1 t+1
E(f t+1 , g t+1 , λt+1 , λt ) = F (f t+1 , g t+1 , λt+1 ) − kλ − λt k22
2τ
1 3
≥ F (f t+1 , g t+1 , λt ) + hλt+1 − y t+1 , y t+1 − λt i + kλt+1 − y t+1 k2
τ 4τ
(1 − θ)2 t 1 1
− kλ − λt−1 k22 − hλt+1 − y t+1 , y t+1 − λt i − kλt+1 − y t+1 k22
2τ τ 2τ
1 2θ − θ 2 1
= F (f t+1 , g t+1 , λt ) − kλt − λt−1 k22 + kλt − λt−1 k22 + kλt+1 − y t+1 k2
2τ 2τ 4τ
2θ − θ 2 1
= E(f t+1 , g t+1 , λt , λt−1 ) + kλt − λt−1 k22 + kλt+1 − y t+1 k2 ,
2τ 4τ
which completes the proof.
Now we define the following function Ẽ, and later we will prove that Ẽ(f t , g t , λt , λt−1 )
can be upper bounded by O(1/t).
Ẽ(f, g, λ1 , λ2 ) = F (f ∗ , g ∗ , λ∗ ) − E(f, g, λ1 , λ2 ).
The next lemma is useful for obtaining the upper bound for Ẽ(f t , g t , λt , λt−1 ). Moreover,
it is noted that Ẽ(f, g, λ1 , λ2 ) ≥ 0, ∀f, g, λ1 , λ2 , and Ẽ(f, g, λ, λ) = F̃ (f, g, λ), ∀f, g, λ.
21
Lemma 15 Let {f t , g t , y t , λt } be generated by PAME (Algorithm 3). For any λ ∈ ∆+

N , the
following inequality holds
λ − λt , ∇λ F (f t+1 , g t , λt ) ≤ c∞ kct − bk1 + 7c2∞ kλt+1 − y t+1 k2 /η + 5(1 − θ)c2∞ kλt − λt−1 k2 /η.
(71)
Proof. From the optimality condition of (62d), we have the following inequality:

t+1 1
λ − λ , (λ t+1
− y ) − ∇λ F (f , g , y ) ≥ 0, ∀λ ∈ ∆+
t+1 t+1 t+1 t+1
N. (72)
τ
The left hand side of (71) can be rearranged to three terms.
hλ − λt , ∇λ F (f t+1 , g t , λt )i
= hλt − λ, −∇λ F (f t+1 , g t+1 , y t+1 )i + hλt − λ, ∇λ F (f t+1 , g t+1 , y t+1 ) − ∇λ F (f t+1 , g t+1 , λt )i
| {z } | {z }
(I) (II)
t t+1 t+1 t t+1 t t
+ hλ − λ, ∇λ F (f ,g , λ ) − ∇λ F (f , g , λ )i .
| {z }
(III)
(73)
We now bound these three terms one by one. To bound the term (I), we first note that
from (24c) and (15), we have
k∇λ F (f t+1 , g t+1 , λt )k2 ≤ c∞ ≤ c2∞ /η, (74)
where the second inequality is due to the definition of η (61). Now we can bound the term
(I) as follows:
(I) = λt − λt+1 , −∇λ F (f t+1 , g t+1 , λt ) + λt − λt+1 , ∇λ F (f t+1 , g t+1 , λt ) − ∇λ F (f t+1 , g t+1 , y t+1 )
+ λt+1 − λ, −∇λ F (f t+1 , g t+1 , y t+1 )
c2∞ t
≤ λt − λt+1 2
· ∇λ F (f t+1 , g t+1 , λt ) 2
+ λ − λt+1 2
· λt − y t+1 2
η
1 t+1
+ λ − λ 2 · λt+1 − y t+1 2
τ
≤3c2∞ kλt − λt+1 k2 /η + 4c2∞ kλt+1 − y t+1 k2 /η,
(75)
where the first inequality uses Lemma 4 and (72), the second inequality uses (74) and the
facts that λt − y t+1 2 ≤ 2 and λt − λ 2 ≤ 2.
For the term (II), Lemma 4 yields:
(II) ≤2 ∇λ F (f t+1 , g t+1 , y t+1 ) − ∇λ F (f t+1 , g t+1 , λt ) 2
≤ 2c2∞ y t+1 − λt 2
/η. (76)
For the term (III), it can be bounded as:
N
X
(III) = (λtk − λk ) · hπ k (f t+1 , g t+1 , λt ) − π k (f t+1 , g t , λt ), C k i
k=1
(77)
N
X
≤ kπ k (f t+1 , g t+1 , λt ) − π k (f t+1 , g t , λt )k1 kC k k∞ ≤ c∞ kct − bk1 ,
k=1
22
where the last inequality is due to Lemma (6). Plugging (75) - (77) into (73) and applying
the triangle inequality, we obtain
λ − λt , ∇λ F (f t+1 , g t , λt ) ≤ c∞ kct − bk1 + 7c2∞ kλt+1 − y t+1 k2 /η + 5c2∞ ky t+1 − λt k2 /η,
which immediately implies (71) by noting (65).

Lemma 16 Let (f t , g t , y t , λt ) be generated by PAME (Algorithm 3). The following inequal-

ity holds:
Ẽ(f t+1 , g t , λt , λt−1 ) ≤ (2c∞ −ηι)kct −bk1 +7c2∞ kλt+1 −y t+1 k2 /η +(7−5θ)c2∞ kλt −λt−1 k2 /η.
Proof. Since F (f, g, λ) is a concave function, we have
F (f ∗ , g ∗ , λ∗ ) ≤ F (f t+1 , g t , λt ) + ∇F (f t+1 , g t , λt ), (f ∗ , g ∗ , λ∗ ) − (f t+1 , g t , λt ) ,
which implies that
F̃ (f t+1 , g t , λt ) ≤hg t − g ∗ , ct − bi + hλt − λ∗ , −∇λ F (f t+1 , g t , λt )i

≤(c∞ − ηι)kct − bk1 + c∞ kct − bk1 + 7c2∞ kλt+1 − y t+1 k2 /η (78)
+ 5(1 − θ)c2∞ kλt −λ t−1
k2 /η.
where in the first inequality we have used (26), and the second inequality follows from (38)
and setting λ = λ∗ in (71). From (78) we immediately get
1 t
Ẽ(f t+1 , g t , λt , λt−1 ) = F̃ (f t+1 , g t , λt ) +
kλ − λt−1 k22
2τ
≤ c2∞ kλt − λt−1 k22 /η + (2c∞ − ηι)kct − bk1 + 7c2∞ kλt+1 − y t+1 k2 /η + 5(1 − θ)c2∞ kλt − λt−1 k2 /η
≤ 2c2∞ kλt − λt−1 k2 /η + (2c∞ − ηι)kct − bk1 + 7c2∞ kλt+1 − y t+1 k2 /η + 5(1 − θ)c2∞ kλt − λt−1 k2 /η,
where the second inequality is due to kλt − λt−1 k2 ≤ 2. This completes the proof.
The following lemma bounds Ẽ(f t+1 , g t+1 , λt+1 , λt ) by O(1/t).
Lemma 17 Let {f t , g t , y t λt } be generated by PAME (Algorithm 3). The following inequal-

ity holds:
6/(ηγ1 )
Ẽ(f t+1 , g t+1 , λt+1 , λt ) ≤ , ∀t ≥ 0,
t + 1 + 6/(ηγ1 F̃ (f 0 , g 0 , λ0 ))
where we assume λ−1 = λ0 , and
2(2θ − θ2 )

1 1
γ1 = min 2 , (7 − 5θ)2 c2 , 49c2 (79)
(2c∞ − ηι) ∞ ∞
is a constant.
23
Proof. Combining (30b) and Lemma 14, we have

E(f t+1 , g t+1 , λt+1 , λt ) − E(f t+1 , g t , λt , λt−1 )
= E(f t+1 , g t+1 , λt+1 , λt ) − E(f t+1 , g t+1 , λt , λt−1 ) + F (f t+1 , g t+1 , λt ) − F (f t+1 , g t , λt )

η t 2θ − θ2 t 2 1 2
≥ kc − bk21 + λ − λt−1 2
+ λt+1 − y t+1 2
,
2 2τ 4τ
which implies that
Ẽ(f t+1 , g t+1 , λt+1 , λt ) − Ẽ(f t+1 , g t , λt , λt−1 )
η c2 c2
≤ − kct − bk21 − (2θ − θ2 ) ∞ kλt − λt−1 k22 − ∞ kλt+1 − y t+1 k22
2 η 2η
η h 2 2 2 i
≤ − γ1 (2c∞ − ηι) ct − b 1 + (7 − 5θ)c2∞ kλt − λt−1 k2 /η + 7c2∞ kλt+1 − y t+1 k2 /η

2
η 2
≤ − γ1 (2c∞ − ηι) ct − b 1 + (7 − 5θ)c2∞ kλt − λt−1 k2 /η + 7c2∞ kλt+1 − y t+1 k2 /η
6
η
≤ − γ1 Ẽ(f t+1 , g t , λt , λt−1 )2 ,
6
(80)
where the last inequality applies Lemma 16. We then divide both sides of (80) by
Ẽ(f t+1 , g t+1 , λt+1 , λt ) · Ẽ(f t+1 , g t , λt , λt−1 ),
and we obtain
1 1 η Ẽ(f t+1 , g t , λt , λt−1 )
≥ + γ1 ·
Ẽ(f t+1 , g t+1 , λt+1 , λt ) Ẽ(f t+1 , g t , λt , λt−1 ) 6 Ẽ(f t+1 , g t+1 , λt+1 , λt )
(81)
1 η 1 η
≥ + γ1 ≥ + γ1 ,
Ẽ(f , g , λ , λ ) 6
t+1 t t t−1 Ẽ(f , g , λ , λ ) 6
t t t t−1
where the second inequality holds because (80) implies that

Ẽ(f t+1 , g t , λt , λt−1 ) ≥ Ẽ(f t+1 , g t+1 , λt+1 , λt ),
and the last inequality follows from (30a). Summing (81) from 0 to t leads to
1 1 η(t + 1) 1 η(t + 1)
≥ + γ1 = + γ1 ,
Ẽ(f t+1 , g t+1 , λt+1 , λt ) Ẽ(f 0 , g 0 , λ0 , λ−1 ) 6 F̃ (f 0 , g 0 , λ0 ) 6
which immediately leads to the desired result.
Similar to Lemma 11, the following lemma provides some sufficient conditions for the
PAME algorithm to return an -optimal solution to the original EOT problem (2).
Lemma 18 Assume PAME terminates at the T -iteration, i.e.,

kcT −1 − bk1 ≤ /(6(6c∞ − ηι)), (82a)
T −1 T −2
kλ −λ k2 ≤ η/(60(1 − θ)c2∞ ), (82b)
kλT − y T k2 ≤ η/(42c2∞ ), (82c)
T T −1 T −1
F̃ (f , g ,λ ) ≤ /6. (82d)
24
Then the output (π̂, λ̂) of PAME (Algorithm 3), i.e.,
π̂ k = Round(π k (f T , g T −1 , λT −1 ), ak , bk ), ∀k ∈ [N ], λ̂ = λT −1 ,
is an -optimal solution of the original EOT problem (2).
Proof. The proof is essentially the same as that of Lemma 11. More specifically, we again
need to show that the output of PAME (π̂ π , λ̂) satisfies (44). The proof of (44b) is exactly
the same as the proof of Lemma 11. The proof of (44a) only requires to develop a new
bound for D E
π ) − λ̂, ∇λ F (f T , g T −1 , λT −1 )
λ̄(π̃ (83)
that is used in (49). Other parts are again exactly the same as the ones in Lemma 11. The
new bound of (83) can be obtained by applying Lemma 15 with λ = λ̄(π̃ π ) and t = T − 1,
which yields
π ) − λ̂, ∇λ F (f T , g T −1 , λT −1 )i
hλ̄(π̃
(84)
≤ c∞ kcT −1 − bk1 + 5(1 − θ)c2∞ kλT −1 − λT −2 k2 /η + 7c2∞ kλT − y T k2 /η.
By combining (84) with (48)-(51), we can bound the left hand side of (44a) by

` π̂ π ) − `(π̂
π , λ̄(π̂ π , λ̂)
≤ (6c∞ − ηι) cT −1 − b 1
+ 5(1 − θ)c2∞ λT −1 − λT −2 2
/η + 7c2∞ kλT − y T k2 /η
+ F (f T , g T −1 , λT −1 ) − F ∗ (85)

1 1 1 1 1
≤ + + + = ,
6 12 12 6 2
where in the last inequality we have used all the sufficient conditions (82a)-(82d).
Theorem 19 Define 0 = /(6c∞ − ηι), and set T to be
3600(1 − θ)2 + 882 c2∞ s

48 48
T =8 + √ 0 + 2
+
η γ1 η ηγ1 (86)
2 −2

=O c∞ ,
n o
2(2θ−θ2 )
where γ1 = min 1
(2c∞ −ηι)
1
2 , (7−5θ)2 c2 , 49c2 and we know γ1 = O(c−2
∞ ). At least one of
∞ ∞
the iterations in Algorithm 3, after rounding, is an -saddle point of the EOT problem (2).
Proof. According to Lemma 18, we only need to show that (82) holds after T iterations as
defined in (86).
We follow the same idea as the proof of Theorem 12. First we reduce Ẽ(f t+1 , g t+1 , λt+1 , λt )
from Ẽ(f 0 , g 0 , λ0 , λ−1 ) = F̃ (f 0 , g 0 , λ0 ) to a constant s by running t1 steps. By Lemma 17,
we have
6 6
t1 ≤ 1 + − . (87)
ηγ1 s ηγ1 F̃ (f 0 , g 0 , λ0 )
25
Secondly, starting from s, we continue running the algorithm, and assume that there are t2
iteration in which (82a) fails. By (30b) we have
72s
t2 ≤ 1 + .
η02
Therefore, we know that the total iteration number that (82a) fails can be upper bounded
by
72s 6 6
T1 = t1 + t2 ≤ 2 + 02 + −
η ηγ1 s ηγ1 F̃ (f 0 , g 0 , λ0 )
0
iterations. By choosing s = 6 γ1 ,
√ we know that
√12 √36 6 √48 0

(
2+ η γ1 0 + η γ1 0 − ηγ1 F̃ (f 0 ,g 0 ,λ0 )
≤2+ η γ1 0 if F̃ (f 0 , g 0 , λ0 ) ≥ 6 γ1 ,
√
T1 ≤
2+ √12 + √36 − 6
≤2+ √12 otherwise.
η γ1 0 η γ1 0 ηγ1 F̃ (f 0 ,g 0 ,λ0 ) η γ1 0
Therefore, we have T1 ≤ 2 + η√48 γ1 0 . Similarly, from Lemma 14 we know that, starting

from s, the number of iterations that (82b) and (82c) fail can be respectively bounded by
3600(1 − θ)2 c2∞ s 3528c2∞ s

t3 ≤ 1 + , and t4 ≤ 1 + .
η2 (2θ − θ2 ) η2
By choosing s = , we have the total iteration numbers that (82b) and (82c) fail can be
respectively bounded by
3600(1 − θ)2 c2∞ 6 6 3600(1 − θ)2 c2∞ 6

T2 = t1 + t3 ≤ 2 + 2
+ − ≤ 2 + 2
+
η(2θ − θ ) ηγ1 ηγ1 F̃ (f 0 , g 0 , λ0 ) η(2θ − θ ) ηγ1
and
3528c2∞ 6 6 3528c2∞ 6
T3 = t1 + t4 ≤ 2 + + − ≤2+ + .
η ηγ1 ηγ1 F̃ (f 0 , g 0 , λ0 ) η ηγ1
Finally, by letting s = /6 in (87), we know that
Ẽ(f T4 −1 , g T4 −1 , λT4 −1 , λT4 −2 ) ≤ /6 (88)
after
36
T4 = 1 +
ηγ1
iterations. From (88) we know that
F̃ (f T4 −1 , g T4 −1 , λT4 −1 ) ≤ /6,
which implies that (82d) holds with T = T4 by noting (30a).

Combining the above discussions, we know that after T = T1 +T2 +T3 +T4 +1 iterations,
there must exist at least one iteration such that the sufficient condition (82) holds, and thus
the output of PAME is an -optimal solution to the original EOT problem (2).
26
Figure 1: Computational time comparison between PAM, PAME and APGA algorithms
on Gaussian distributions. Upper Left: N = 10, n = 100, η = 0.1, Upper Right:
N = 10, n = 500, η = 0.1, Bottom Left: N = 10, n = 100, η = 0.5, Bottom Right:
N = 5, n = 100, η = 0.1.
Remark 20 We are not able to analytically prove that PAME has an improved complexity
bound at this moment yet. The APGA proposed in (Scetbon et al., 2021) in fact has better
complexity than PAM and PAME. However, as demonstrated in (Scetbon et al., 2021) and
in our numerical experiments (Sections 6), APGA performs worse than PAM. We believe
the reason is that APGA takes gradient step for the variables f and g, while PAM exactly
minimizes the subproblems corresponding to these two variables. It is the exact minimization
step that led to the improvement. Developing a provably better algorithm is definitely an
important and interesting future direction.
6. Numerical Experiments
We compare the performance of PAME with PAM and APGA (60) (Scetbon et al., 2021)
on a synthetic dataset: the Gaussian distributions. We also conduct numerical comparison
on another synthetic dataset: the fragmented hypercube dataset. The results are included
in the following sections.
27
6.1 Numerical Results on Gaussian Distribution

Gaussian Distribution: Consider the case when two sets of discrete support {xi }i∈[n] , {yj }j∈[n]
are independently sampled from Gaussian distributions

1 10 1
N , (89)
1 1 10
and
2 1 −0.2
N , (90)
2 −0.2 1
respectively. The base cost matrix C base is computed by Ci,j base = kx − y k2 . Assume we
i j 2
have N agents. The cost matrix of each agent can be obtained by adding Gaussian noise
sampled from N (0, 10) to each element of the base cost. For instance, for the k-th agent
with a cost matrix C k , we have Ci,j
k = |C base + N (0, 10)|.
i,j
We then set a = b = [1/n, ..., 1/n] for all experiments. For all algorithms, we set τ = c5η
2
∞
and we set θ = 0.1 for the PAME algorithm. We consider the EOT error as a measure of
optimality. The EOT error at iteration t is defined by
Error = |`(π(f t , g t , λt ), λt ) − `∗ |, (91)
where `∗ is the approximated optimal value of EOT (2) obtained by running the PAM
algorithm for 20000 iterations. Figure 1 plots the EOT error against the execution time for
Gaussian distributions. We run each algorithm for 2000 iterations for different parameter
settings. In all cases, the PAME and PAM perform significantly better than APGA, and
PAME also shows significant improvement over PAM.
Figure 2 shows the optimal couplings obtained from the standard OT and EOT of two
Gaussian distributions under three different metrics: the Euclidean cost (k · k2 ), the square
Euclidean cost (k · k22 ) and the L1.5 1.5
1 norm (k · k1 ) respectively. We set n = 4, η = 0.05 and
generate samples independently according to (89). For the EOT problem, we consider three
agents with cost matrices computed by the three metrics mentioned above. Note that the
entropy regularized models lead to a dense transportation plan and Figure 2 only plots the
couplings with a probability larger than 10−3 . We see that all the agents have the same
total cost in the EOT model, and as expected, the cost is smaller than the other three OT
costs obtained by using the same metric. The sub figures in the first row imply that if we
split the workload to three parts evenly, then the three agents will each have costs 5.935/3,
2.158/3 and 5.030/3, which is not fair because they have different costs. But the EOT
model can indeed guarantee the fairness.
We further compare the computational time for Gaussian distributions. We generate
the data as previously and set the parameters as η = 0.5, τ = 5η/c2∞ . We stop all the
algorithms when the EOT error (91) is less than 10−4 . Tables 1 and 2 show the CPU time
(in seconds) for different (n, N ) pairs. The reported computational time is averaged over 5
runs. In Table 1, the APGA algorithm fails to reach an error of 10−4 in 500000 iterations
when n = 100 and n = 500. We conclude that the APGA algorithm converges much slower
than PAM and PAME algorithms. The PAME algorithm performs the best among all three
algorithms.
28
Figure 2: Optimal couplings of standard OT (first row) and EOT (second row). OT Square
Euclidean Cost: 5.935; OT Euclidean Cost: 2.158; OT L11.5 Cost: 5.030; EOT Cost: 0.906.
Table 1: CPU time (in seconds) comparison for Gaussian Distributions. Fixed N = 3.
Algorithms n = 10 n = 20 n = 50 n = 100 n = 500

PAM 0.038283 0.096039 0.177209 1.038785 1.768560
PAME 0.025210 0.065552 0.091593 0.564618 1.340104
APGA 1.125673 12.364775 106.768840 - -
6.2 Numerical Results on Fragmented Hypercube Dataset
In this section, we compare the performance of PAME with PAM and APGA (60) (Scetbon
et al., 2021) on the fragmented hypercube dataset.
Fragmented Hypercube: We now consider transferring mass between a uniform distri-

bution over a hypercube µ = U([−1, 1]d ) and a distribution
ν obtained by a pushforward
Pm∗
ν = T] µ defined by T (x) = x + 2sign(x) m=1 em . Here sign(·) is taken elemen-
twisely, m∗ ∈ [d] and ei , i ∈ [d] is the canonical basis of Rd . In our experiments, we
set d = 10, m∗ = 2 and sample two base support sets {xbase i }i∈[n] , {yjbase }j∈[n] indepen-
dently from µ, ν. To obtain the cost matrix for one agent, we first add Gaussian noise
sampled from N (0, 1) to the base support sets to get {xnoisy i }i∈[n] , {yjnoisy }j∈[n] and com-
pute the cost using the noisy support sets. For instance, for the k-th agent, we have
(xnoisy
i )k = xbase
i + N (0, 1) , (yjnoisy )k = yjbase + N (0, 1) and Ci,j
k = k(xnoisy )k − (y noisy )k k2 .
i j 2
29
Table 2: CPU time (in seconds) comparison for Gaussian Distributions. Fixed n = 50.
Algorithms N =2 N =3 N =5 N = 10 N = 20
PAM 0.180343 0.177209 1.021598 0.719909 0.903429
PAME 0.105775 0.091593 0.560785 0.385495 0.618381
APGA 70.959300 106.768840 169.213121 178.637889 212.697866
Figure 3: CPU time comparison between PAM, PAME and APGA algorithms on the Frag-
mented Hypercube dataset. Upper Left: N = 5, n = 100, η = 0.2, Upper Right:
N = 5, n = 500, η = 0.2, Bottom Left: N = 5, n = 100, η = 0.1, Bottom Right:
N = 10, n = 100, η = 0.2.
Figures 3 plots the EOT error versus the CPU time for Fragmented Hypercube dataset.
We run PAM for 20000 iterations to get an approximate optimal `∗ and run all algorithms for
2000 iterations for different parameter settings. In all cases, the PAME and PAM perform
significantly better than APGA, and PAME also shows significant improvement over PAM.
We then compare the computational time for Fragmented Hypercube dataset. We set
the parameters as η = 0.2, τ = 5η/c2∞ . We stop all the algorithms when the EOT error (91)
is less than 10−4 . Tables 3 and 4 show the CPU time (averaged over 5 runs) for different
(n, N ) pairs. We see that the PAME algorithm still performs the best among all three
30
algorithms. Note that in Table 3 the APGA algorithm fails to reach an error of 10−4 in
500000 iterations when n = 50, n = 100 and n = 500.
Table 3: CPU time (in seconds) comparison for Fragmented Hypercube. Fixed N = 3.
Algorithms n = 10 n = 20 n = 50 n = 100 n = 500

PAM 0.165165 0.101363 0.177209 1.154193 3.840804
PAME 0.112527 0.068529 0.091593 0.588553 2.007653
APGA 13.771911 22.017804 - - -
Table 4: CPU time (in seconds) comparison for Fragmented Hypercube. Fixed n = 20.
Algorithms N =2 N =3 N =5 N = 10 N = 20
PAM 0.003180 0.101363 0.172696 0.253926 0.231646
PAME 0.040302 0.068529 0.110080 0.150513 0.129156
APGA 1.007166 22.017804 13.801154 6.304324 3.749495
7. Conclusion
In this paper, we have provided the first convergence analysis of the PAM algorithm for solv-
ing the EOT problem. Specifically, we have shown that it takes at most O(−2 ) iterations
for the PAM algorithm to find an -saddle point. We have proposed a PAME algorithm
which incorporates the extrapolation technique to PAM. The PAME shows significant nu-
merical improvement over PAM. Results in this paper might shed lights on designing new
BCD type algorithms.
Acknowledgments
Research of Shiqian Ma was supported in part by Office of Naval Research (ONR) grant
N00014-24-1-2705, National Science Foundation (NSF) grants DMS-2243650, CCF-2308597,
CCF-2311275 and ECCS-2326591, UC Davis CeDAR (Center for Data Science and Artificial
Intelligence Research) Innovative Data Science Seed Funding Program, and a startup fund
from Rice University. Research of Lifeng Lai was supported by National Science Foundation
under grants CCF-2112504 and CCF-2232907.
References
Jason Altschuler, Jonathan Niles-Weed, and Philippe Rigollet. Near-linear time approxi-
mation algorithms for optimal transport via Sinkhorn iteration. In Advances in neural
information processing systems, pages 1964–1974, 2017.
31
David Alvarez-Melis, Stefanie Jegelka, and Tommi S Jaakkola. Towards optimal transport
with global invariances. In The 22nd International Conference on Artificial Intelligence
and Statistics, pages 1870–1879. PMLR, 2019.
Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial
networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
Amir Beck and Luba Tetruashvili. On the convergence of block coordinate descent type
methods. SIAM journal on Optimization, 23(4):2037–2060, 2013.
Marc G Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein-
forcement learning. In International Conference on Machine Learning, pages 449–458,
2017.
Jean-David Benamou, Guillaume Carlier, Marco Cuturi, Luca Nenna, and Gabriel Peyré.
Iterative Bregman projections for regularized transportation problems. SIAM Journal on
Scientific Computing, 37(2):A1111–A1138, 2015.
Felix Brandt, Vincent Conitzer, Ulle Endriss, Jérôme Lang, and Ariel D Procaccia. Hand-
book of computational social choice. Cambridge University Press, 2016.
Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Ad-
vances in neural information processing systems, pages 2292–2300, 2013.
Jelena Diakonikolas and Lorenzo Orecchia. Alternating randomized block coordinate de-
scent. In International Conference on Machine Learning, pages 1224–1232. PMLR, 2018.
Pavel Dvurechensky, Alexander Gasnikov, and Alexey Kroshnin. Computational optimal
transport: Complexity by accelerated gradient descent is better than by Sinkhorn’s al-
gorithm. In International Conference on Machine Learning, pages 1367–1376. PMLR,
2018.
Aude Genevay, Lénaic Chizat, Francis Bach, Marco Cuturi, and Gabriel Peyré. Sample
complexity of sinkhorn divergences. In The 22nd International Conference on Artificial
Intelligence and Statistics, pages 1574–1583, 2019.
M. Hong, X. Wang, M. Razaviyayn, and Z.-Q. Luo. Iteration complexity analysis of block
coordinate descent method. Mathematical Programming Series A, 163(1):85–114, 2017.
Minhui Huang, Shiqian Ma, and Lifeng Lai. A Riemannian block coordinate descent method
for computing the projection robust Wasserstein distance. In Proceedings of the 38th
International Conference on Machine Learning, volume 139, pages 4446–4455. PMLR,
2021a.
Minhui Huang, Shiqian Ma, and Lifeng Lai. Projection robust Wasserstein barycenters.
In Proceedings of the 38th International Conference on Machine Learning, volume 139,
pages 4456–4465. PMLR, 2021b.
Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes
saddle points faster than gradient descent. In Conference On Learning Theory, pages
1042–1085. PMLR, 2018.
32
Tianyi Lin, Chenyou Fan, Nhat Ho, Marco Cuturi, and Michael Jordan. Projection robust
Wasserstein distance and Riemannian optimization. In NeurIPS, volume 33, 2020.
Erika Mackin and Lirong Xia. Allocating indivisible items in categorized domains. arXiv
preprint arXiv:1504.05932, 2015.
Hervé Moulin. Fair division and collective welfare. MIT press, 2003.
A. Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities

with Lipschitz continuous monotone operators and smooth convex-concave saddle point
problems. SIAM Journal on Optimization, 15(1):229–251, 2005.
Y. E. Nesterov. Introductory lectures on convex optimization: A basic course. Applied

Optimization. Kluwer Academic Publishers, Boston, MA, 2004. ISBN 1-4020-7553-7.
Sherjil Ozair, Corey Lynch, Yoshua Bengio, Aaron Van den Oord, Sergey Levine, and Pierre
Sermanet. Wasserstein dependency measure for representation learning. In Advances in
Neural Information Processing Systems, pages 15604–15614, 2019.
François-Pierre Paty and Marco Cuturi. Subspace robust Wasserstein distances. In Inter-
national Conference on Machine Learning, pages 5072–5081, 2019.
Meyer Scetbon, Laurent Meunier, Jamal Atif, and Marco Cuturi. Equitable and optimal
transport with multiple agents. In International Conference on Artificial Intelligence and
Statistics, pages 2035–2043. PMLR, 2021.
R. Sinkhorn and P. Knopp. Concerning nonnegative matrices and doubly stochastic matri-
ces. Pacific J. Math., 21:343–348, 1967.
Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums.
The American Mathematical Monthly, 74(4):402–405, 1967.
Maurice Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176,

1958.
Ruoyu Sun and Mingyi Hong. Improved iteration complexity bounds of cyclic block coordi-
nate descent for convex problems. Advances in Neural Information Processing Systems,
28, 2015.
Robert E Tarjan. Dynamic trees as search trees via Euler tours, applied to the network
simplex algorithm. Mathematical Programming, 78(2):169–177, 1997.
Y. Xu and W. Yin. A block coordinate descent method for regularized multi-convex op-
timization with applications to nonnegative tensor factorization and completion. SIAM
Journal on Imaging Sciences, 63(3):1758–1789, 2013.
33

On The Convergence of Projected Alternating Maximization For Equitable and Optimal Transport

Uploaded by

Copyright:

Available Formats

On The Convergence of Projected Alternating Maximization For Equitable and Optimal Transport

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On The Convergence of Projected Alternating Maximization For Equitable and Optimal Transport

Uploaded by

Copyright:

Available Formats

Journal of Machine Learning Research 25 (2024) 1-33 Submitted 5/22; Revised 7/24; Published 10/24

On the Convergence of Projected Alternating Maximization

Minhui Huang [email protected]

Editor: Suvrit Sra

c 2024 Minhui Huang and Shiqian Ma and Lifeng Lai.

min max hπ k , C k i. (1)

dimension n, it is challenging to solve large-scale OT in practice. A widely adopted com-

• We propose a variant of PAM that incorporates the extrapolation technique as used

integer N > 0, we denote [N ] := {1, . . . , N }. We also denote π (f, g, λ) = [π k (f, g, λ)]k∈[N ] .

2. Projected Alternating Maximization Algorithm

f t+1 ∈ argmax F (f, g t , λt ), (12a)

g t+1 ∈ argmax F (f t+1 , g, λt ), (12b)

λt+1 := Proj∆N λt + τ ∇λ F (f t+1 , g t+1 , λt ) .

Furthermore, the optimality conditions of (12a)-(12b) imply that

2.1 The Rounding Procedure and the Margins

ak , bk ∈ Rn , k ∈ [N ] from the procedure

(ak , bk )k∈[N ] = Margins(π

properties are required in our convergence analysis later):

kct−1 − bk1 ≤ /(6(6c∞ − ηι)), (19a)

where F̃ (f, g, λ) is the suboptimality defined as: F̃ (f, g, λ) = F (f ∗ , g ∗ , λ∗ ) − F (f, g, λ).

Algorithm 1 Projected Alternating Maximization Algorithm

3: while (19) is not met do

2.2 Connections with BCD and BCGD Methods

min J(x1 , x2 , . . . , xm ), (21)

3. Convergence Analysis of PAM

3.1 Technical Preparations

which, combined with (8), yields

k∇λ F (f, g, λ1 ) − ∇λ F (f, g, λ2 )k2 ≤ c2∞ kλ1 − λ2 k2 /η, (27)

max gj∗ − min gj∗ ≤ c∞ − ηι. (29b)

Proof. By (26), we have

3.2 Key Lemmas

Lemma 7 [Increase of F ] Let {f t , g t , λt } be generated by PAM (Algorithm 1). The follow-

Before we bound the suboptimality gap, we need the following lemma.

Lemma 8 Let {f t , g t , λt } be generated by PAM (Algorithm 1). For any λ ∈ ∆+

λ − λt , ∇λ F (f t+1 , g t , λt ) ≤ 3c2∞ kλt+1 − λt k2 /η + c∞ ct − b 1

Proof. The optimality condition of (12c) is given by:

hλt+1 − λ, −∇λ F (f t+1 , g t+1 , λt )i

hλt − λ, ∇λ F (f t+1 , g t+1 , λt ) − ∇λ F (f t+1 , g t , λt )i

The suboptimality of (11) is defined as: F̃ (f, g, λ) = F (f ∗ , g ∗ , λ∗ ) − F (f, g, λ). Note

Lemma 9 Let (f t , g t , λt ) be generated by PAM (Algorithm 1). The following inequality

which further implies

λt − λ∗ , −∇λ F (f t+1 , g t , λt ) ≤ 3c2∞ kλt+1 − λt k2 /η + c∞ ct − b 1

Since F (f, g, λ) is a concave function, we have

F (f ∗ , g ∗ , λ∗ ) ≤ F (f t+1 , g t , λt ) + h∇F (f t+1 , g t , λt ), (f ∗ , g ∗ , λ∗ ) − (f t+1 , g t , λt )i,

which, combining with (24) yields

Lemma 10 Let (f t , g t , λt ) be generated by PAM (Algorithm 1). The following inequality

Proof. Combining (30b) and (30c), we have

Lemma 11 Assume PAM terminates at the T -iteration, i.e.,

kcT −1 − bk1 ≤ /(6(6c∞ − ηι)), (43a)

Then the output (π̂, λ̂) of PAM (Algorithm 1), i.e.,

is an -optimal solution of the original EOT problem (2).

We prove (44a) first. For ease of presentation, we denote π̃ π = π (f T , g T −1 , λT −1 ), π ∗ =

We now provide upper bounds for these four terms. Denote

k̂ ∗ = argmaxhπ̂ k , C k i, k̃ ∗ = argmaxhπ̃ k , C k i. (47)

maxj gjT −1 +minj gjT −1

fiT + gjT −1 − λTk −1 Cij

fiT + gjT −1 − λTk −1 C k

We also denote b̃ = cT −1 = c π̃ k and π 0k = Round(π̄(λ̂)k , ãk , b̃k ), where

(ãk , b̃k )k∈[N ] := Margins(π̄

kct−1 − bk1 ≤ /(6(6c∞ − ηι)), (19a)

kcT −1 − bk1 ≤ /(6(6c∞ − ηι)), (43a)

is an -optimal solution of the original EOT problem (2).

Theorem 12 Define 0 = /(6c∞ − ηι), and set T as

Therefore, we have T1 ≤ 2 + η√36γ0 0 . Similarly, starting from s, the number of iterations

iterations. Finally, by letting s = /6 in (59), we know that

F̃ (f T3 , g T3 −1 , λT3 −1 ) ≤ F̃ (f T3 −1 , g T3 −1 , λT3 −1 ) ≤ /6,

kct−1 − bk1 ≤ /(6(6c∞ − ηι)), (63a)

is an -optimal solution of the original EOT problem (2).

Theorem 19 Define 0 = /(6c∞ − ηι), and set T to be

√12 √36 6 √48 0

Therefore, we have T1 ≤ 2 + η√48 γ1 0 . Similarly, from Lemma 14 we know that, starting

Finally, by letting s = /6 in (87), we know that

Ẽ(f T4 −1 , g T4 −1 , λT4 −1 , λT4 −2 ) ≤ /6 (88)