Park 22 C
Park 22 C
Abstract of Lieder (2021); Kim (2021); Yoon & Ryu (2021), and is
distinct from Nesterov’s acceleration.
Despite the broad use of fixed-point iterations
throughout applied mathematics, the optimal con-
vergence rate of general fixed-point problems with 1.1. Preliminaries and notations
nonexpansive nonlinear operators has not been We review standard definitions and set up the notation.
established. This work presents an acceleration
mechanism for fixed-point iterations with non- Monotone and set-valued operators. We follow stan-
expansive operators, contractive operators, and dard notation of Bauschke & Combettes (2017); Ryu & Yin
nonexpansive operators satisfying a Hölder-type (2020). For the underlying space, consider Rn with standard
growth condition. We then provide matching com- inner product ⟨·, ·⟩ and norm ∥ · ∥, although our results can
plexity lower bounds to establish the exact opti- be extended to infinite-dimensional Hilbert spaces.
mality of the acceleration mechanisms in the non-
expansive and contractive setups. Finally, we pro- We say 𝔸 is an operator on Rn and write 𝔸 : Rn ⇒ Rn if
vide experiments with CT imaging, optimal trans- 𝔸 maps a point in Rn to a subset of Rn . For notational sim-
port, and decentralized optimization to demon- plicity, also write 𝔸x = 𝔸(x). Write Gra 𝔸 = {(x, u) |
strate the practical effectiveness of the accelera- u ∈ 𝔸x} for the graph of 𝔸. Write 𝕀 : Rn → Rn for the
tion mechanism. identity operator. We say 𝔸 : Rn ⇒ Rn is monotone if
for k = 0, 1, . . . with some starting point x0 ∈ Rn . The An operator 𝔸 is maximally monotone if there is no other
general rubric of formulating solutions of a problem at hand monotone 𝔹 such that Gra 𝔸 ⊂ Gra 𝔹 properly, and is
as fixed points of an operator and then performing the fixed- maximally µ-strongly monotone if there is no other µ-
point iterations is ubiquitous throughout applied mathemat- strongly monotone 𝔹 such that Gra 𝔸 ⊂ Gra 𝔹 properly.
ics, science, engineering, and machine learning. For L ∈ (0, ∞), single-valued operator 𝕋 : Rn → Rn is
Surprisingly, however, the iteration complexity of the ab- L-Lipschitz if
stract fixed-point iteration has not been thoroughly studied.
∥𝕋x − 𝕋y∥ ≤ L∥x − y∥, ∀x, y ∈ Rn .
This stands in sharp contrast with the literature on con-
vex optimization algorithms, where convergence rates and 𝕋 is contractive if it is L-Lipschitz with L < 1 and non-
matching lower bounds are carefully studied. expansive if it is 1-Lipschitz. For θ ∈ (0, 1), an operator
In this paper, we establish the exact optimal complexity of 𝕊 : Rn → Rn is θ-averaged if 𝕊 = (1 − θ)𝕀 + θ𝕋 and a
fixed-point iterations by providing an accelerated method nonexpansive operator 𝕋.
and a matching complexity lower bound. The acceleration is Write 𝕁𝔸 = (𝕀 + 𝔸)−1 for the resolvent of 𝔸, and ℝ𝔸 =
based on a Halpern mechanism, which follows the footsteps 2𝕁𝔸 −𝕀 for the reflected resolvent of 𝔸. When 𝔸 is maximal
1
Department of Mathematical Sciences, Seoul National Univer- monotone, it is well known that 𝕁𝔸 is single-valued with
sity. Correspondence to: Ernest K. Ryu <[email protected]>. dom 𝕁𝔸 = Rn , ℝ𝔸 is a nonexpansive operator, and 𝕁𝔸 =
1 1
2 𝕀 + 2 ℝ𝔸 is 1/2-averaged.
Proceedings of the 39 th International Conference on Machine
Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy- We say x⋆ ∈ Rn is a zero of 𝔸 if 0 ∈ 𝔸x⋆ . We say y⋆
right 2022 by the author(s). is a fixed-point of 𝕋 if 𝕋y⋆ = y⋆ . Write Zer 𝔸 for the set
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
of zeros of 𝔸 and Fix 𝕋 for the set of all fixed-points of it has been established for KM (Ishikawa, 1976; Borwein
𝕋. For any x ∈ Rn such that x = 𝕁𝔸 y for some y ∈ Rn , et al., 1992) and Halpern (Wittmann, 1992; Xu, 2002).
˜ = y − 𝕁𝔸 y as the resolvent residual of 𝔸 at x.
define 𝔸x
˜ ∈ 𝔸x. For any y ∈ Rn , define y − 𝕋y as the The convergence rate of the KM iteration in terms of
Note that 𝔸x
∥yk − 𝕋yk ∥2 was shown to exhibit O(1/k)-rate (Cominetti
fixed-point residual of 𝕋 at y.
et al., 2014; Liang et al., 2016; Bravo & Cominetti, 2018)
and o(1/k)-rate (Baillon & Bruck, 1992; Davis & Yin, 2016;
Fixed-point iterations. There is a long and rich history
Matsushita, 2017) under various setups. In addition, Bor-
of iterative methods for finding a fixed point of an operator
wein et al. (2017); Lin & Xu (2021) studied the convergence
𝕋 : Rn → Rn (Rhoades, 1991; Brezinski, 2000; Rhoades
rate of the distance to solution under additional bounded
& Saliga, 2001; Berinde & Takens, 2007). In this work, we
Hölder regularity assumption.
consider the following three: the Picard iteration
For the convergence rate of the Halpern iteration in terms of
yk+1 = 𝕋yk , ∥yk − 𝕋yk ∥2 , Leustean (2007) proved a O(1/(log k)2 )-rate
and later Kohlenbach (2011) improved this to a O(1/k)-
the Krasnosel’skiı̆–Mann iteration (KM iteration) rate. Sabach & Shtern (2017) first proved the O(1/k 2 )-rate
yk+1 = λk+1 yk + (1 − λk+1 )𝕋yk , of Halpern iteration, and this rate has been improved in its
constant by a factor of 16 by Lieder (2021).
and the Halpern iteration
Monotone inclusions and splitting methods. As we soon
yk+1 = λk+1 y0 + (1 − λk+1 )𝕋yk ,
establish in Section 2, monotone operators are intimately
where y0 ∈ Rn is an initial point and {λk }k∈N ⊂ (0, 1). connected to fixed-point iterations. Splitting methods
Under suitable assumptions, the {yk }k∈N sequence of these such as forward-backward splitting (FBS) (Bruck Jr, 1977;
iterations converges to a fixed point of 𝕋. Passty, 1979), augmented Lagrangian method (Hestenes,
1969; Powell, 1969), Douglas–Rachford splitting (DRS)
1.2. Prior work (Peaceman & Rachford, 1955; Douglas & Rachford, 1956;
Lions & Mercier, 1979), alternating direction method of
Fixed-point iterations. Picard iteration’s convergence multiplier (ADMM) (Gabay & Mercier, 1976), Davis–Yin
with a contractive operator was established by Banach’s splitting (DYS) (Davis & Yin, 2017), (PDHG) (Chambolle
fixed-point theorem (Banach, 1922). What we refer to as & Pock, 2011), and Condat–Vũ (Condat, 2013; Vũ, 2013)
the Krasnosel’skiı̆–Mann iteration is a generalization of are all fixed-point iterations with respect to specific nonex-
the setups by Krasnosel’skiı̆ (1955) and Mann (1953). Its pansive operators. Therefore, an acceleration of the abstract
convergence with general nonexpansive operators is due to fixed-point iteration is applicable to the broad range of split-
Martinet (1972). The iteration of Halpern (1967) converges ting methods for monotone inclusions.
1
for the wider choice of parameter λk (including λk = k+1 )
due to Wittmann (1992). Halpern iteration is later gen-
Acceleration. Since the seminal work by Nesterov (1983)
eralized to the sequential averaging method (Xu, 2004).
on accelerating gradient methods convex minimization prob-
Ishikawa iteration (Ishikawa, 1976) is an iteration with two
lems, much work as been dedicated to algorithms with
sequences updated in an alternating manner. Anderson ac-
faster accelerated rates. Gradient descent (Cauchy, 1847)
celeration (Anderson, 1965) is another acceleration scheme
can be accelerated in terms of function value suboptimal-
for fixed-point iterations, and it has recently attracted sig-
ity for smooth convex minimization problems (Nesterov,
nificant interest (Walker & Ni, 2011; Scieur et al., 2020;
1983; Kim & Fessler, 2016a), smooth strongly convex mini-
Barré et al., 2020; Zhang et al., 2020; Bertrand & Massias,
mization problems (Nesterov, 2004; Van Scoy et al., 2018;
2021). A number of inertial fixed-point iterations have also
Park et al., 2021; Taylor & Drori, 2021; Salim et al., 2022),
been proposed to accelerate fixed-point iterations (Maingé,
and convex composite minimization problems (Güler, 1992;
2008; Dong et al., 2018; Shehu, 2018; Reich et al., 2021).
Beck & Teboulle, 2009). Recently, accelerated methods for
Our presented method is optimal (in the sense made precise
reducing the squared gradient magnitude for smooth convex
by the theorems) when compared these prior non-stochastic
minimization (Kim & Fessler, 2021; Lee et al., 2021) and
fixed-point iterations.
smooth convex-concave minimax optimization (Diakoniko-
las & Wang, 2021; Yoon & Ryu, 2021) were presented.
Convergence rates of fixed-point iterations. The
squared fixed-point residual ∥yk − 𝕋yk ∥2 is the error mea- Recently, it was discovered that acceleration is also possible
sure for fixed-point problems that we focus on. Its con- in solving monotone inclusions. The accelerated proximal
vergence to 0 (without a specified rate) is referred to as point method (APPM) (Kim, 2021) provides an accelerated
asymptotic regularity (Browder & Petryshyn, 1966), and ˜ k ∥2 compared to the O(1/k)-rate of
O(1/k 2 )-rate of ∥𝔸x
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
proximal point method (PPM) (Martinet, 1970; Gu & Yang, Performance estimation problem. The discovery of the
2020) for monotone inclusions. Maingé (2021) improved main algorithm of Section 3 heavily relied on the use of the
this rate to o(1/k 2 ) rate with another accelerated variant of performance estimation problem (PEP) technique (Drori &
proximal point method called CRIPA-S. Teboulle, 2014). Loosely speaking, the PEP is a computer-
assisted methodology for finding optimal methods by nu-
merically solving semidefinite programs (Drori & Teboulle,
Complexity lower bound. Under the information-based 2014; Kim & Fessler, 2016a; Taylor et al., 2018; Drori &
complexity framework (Nemirovski, 1992), complexity Taylor, 2020; Kim & Fessler, 2021). We discuss the details
lower bound on first-order methods for convex optimiza- of our use of the PEP in Section C of the appendix.
tion has been thoroughly studied (Nesterov, 2004; Drori,
2017; Drori & Shamir, 2020; Carmon et al., 2020; 2021; 1.3. Contributions
Drori & Taylor, 2022). When a complexity lower bound
We summarize the contribution of this work as follows.
matches an algorithms’ guarantee, it establishes optimality
First, we present novel accelerated fixed-point iteration
of the algorithm (Nemirovski, 1992; Drori & Teboulle, 2016;
(OC-Halpern) and its equivalent form (OS-PPM) for mono-
Kim & Fessler, 2016a; Taylor & Drori, 2021; Yoon & Ryu,
tone inclusions. Second, we present exact matching com-
2021; Salim et al., 2022). In the fixed-point theory literature,
plexity lower bounds and thereby establish the exact opti-
Diakonikolas (2020) provided the lower bound result for the
mality of our presented methods. Third, using a restarting
rate of ⟨𝔸xk , xk − x⋆ ⟩ for variational inequalities with Lip-
mechanism, we extend the acceleration to a broader setup
schitz,
monotone operator. Colao & Marino (2021) showed
2− q2
with operators satisfying a Hölder-type growth condition.
Ω 1/k lower bound on ∥yk − y⋆ ∥2 for Halpern itera- Finally, we demonstrate the effectiveness of the proposed
tions in q-uniformly smooth Banach spaces. Recently, there acceleration mechanism through extensive experiments.
has been work establishing complexity lower bounds for
the more restrictive “1-SCLI” class of algorithms (Arjevani 2. Equivalence of nonexpansive operators and
et al., 2016). The class of 1-SCLI fixed-point iterations
includes the KM iteration but not Halpern. Up-to-constant
monotone operators
optimality of the KM iteration among 1-SCLI algorithms Before presenting the main content, we quickly establish
was proved with the Ω(1/k) lower bound by Diakonikolas the equivalence between the fixed-point problem
& Wang (2021).
find y = 𝕋y
There also has been recent work on lower bounds for the y∈Rn
general class of algorithms (not just 1-SCLI) for fixed-point
problems. Contreras & Cominetti (2021) established a and the monotone inclusion
Ω(1/k 2 ) lower bound on the fixed-point residual for the find 0 ∈ 𝔸x,
general Mann iteration, which includes the KM and Halpern x∈Rn
Theorem 23.8) (Bauschke et al., 2012; Combettes, 2018). When 𝔸 is strongly monotone (µ > 0), (OS-PPM) exhibits
This lemma generalizes the equivalence to γ ≥ 1 and µ ≥ 0. an accelerated O(e−4µN )-rate compared to the O(e−2µN )-
As we see in Appendix A of the appendix, the equivalence is rate of the proximal point method (PPM) (Rockafellar, 1976;
straightforwardly established using the scaled relative graph Bauschke & Combettes, 2017). When 𝕋 is contractive
(SRG) (Ryu et al., 2021), but we also provide a classical (γ < 1), both (OC-Halpern) and the Picard iteration ex-
proof based on inequalities without using the SRG. hibit O(γ −2N )-rates on the squared fixed-point residual. In
1 fact, the Picard iteration with the 𝕋 of Lemma 2.1 instead of
Remark. Since 𝕀 − 𝕋 = (1 + 1+2µ )(𝕀 − 𝕁𝔸 ), finding an
𝕁𝔸 is faster than the regular PPM and achieves a O(e−4µN )
algorithm that effectively reduces ∥yN −1 − 𝕋yN −1 ∥2 for rate. (OC-Halpern) is exactly optimal and is faster than Pi-
fixed-point problem is equivalent to finding an algorithm card in higher order terms hidden in the big-O notation. To
˜ N ∥2 for monotone inclusions.
that effectively reduces ∥𝔸x clarify, the O considers the regime µ → 0.
When 𝔸 is not strongly monotone (µ = 0) or 𝕋 is not
3. Exact optimal methods contractive (γ = 1), (OS-PPM) and (OC-Halpern) respec-
We now present our methods and their accelerated rates. tively reduces to accelerated PPM (APPM) of Kim (2021)
and Halpern iteration of Lieder (2021), sharing the same
For a 1/γ-contractive operator 𝕋 : Rn → Rn , the Optimal O(1/N 2 )-rate. In this paper, we refer to the method of
Contractive Halpern (OC-Halpern) is Lieder (2021) as the optimized Halpern method (OHM).
1 1
yk = 1 − 𝕋yk−1 + y0 (OC-Halpern) The discovery of (OC-Halpern) and (OS-PPM) was assisted
φk φk by the performance estimation problem (Drori & Teboulle,
Pk
for k = 1, 2, . . . , where φk = i=0 γ 2i and y0 ∈ Rn is a 2014; Kim & Fessler, 2016b; Taylor et al., 2017; Drori &
starting point. For a maximal µ-strongly monotone operator Taylor, 2020; Ryu et al., 2020; Kim & Fessler, 2021; Park
𝔸 : Rn ⇒ Rn , the Optimal Strongly-monotone Proximal & Ryu, 2021) The details are discussed in Section C of the
Point Method (OS-PPM) is appendix.
xk = 𝕁𝔸 yk−1 (OS-PPM)
3.1. Proof outline of Theorem 3.2
φk−1 − 1 2µφk−1
yk = xk + (xk − xk−1 ) − (yk−1 − xk ) Here, we quickly outline the proof of Theorem 3.2 while
φk φk
(1 + 2µ)φk−2 deferring the full proof to Section B of the appendix.
+ (yk−2 − xk−1 )
φk Define the Lyapunov function
Pk
for k = 1, 2, . . . , where φk = i=0 (1 + 2µ)2i , φ−1 = 0, " !2
k−1
and x0 = y0 = y−1 ∈ Rn is a starting point. These two k −k
X
n ˜ k ∥2
V = (1 + γ ) γ ∥𝔸x
methods are equivalent.
n=0
Lemma 3.1. Suppose γ = 1 + 2µ. Let 𝔸 = k−1
!
−1 X
𝕋 + γ𝕀1 1
1 + γ − 𝕀 given 𝕋, or equivalently let +2 γn ˜ k − µ(xk − x⋆ ), xk − x⋆ ⟩
⟨𝔸x
n=0
1 1
𝕋 = 1 + 1+2µ 𝕁𝔸 − 1+2µ 𝕀 given 𝔸. Then the yk -iterates
k−1 !
X
2 #
of (OC-Halpern) and (OS-PPM) are identical provided they +γ −k
γ n ˜ k
𝔸xk − γ (xk − x⋆ ) + (xk − y0 )
start from the same initial point y0 = ỹ0 .
n=0
γ
and we conclude Lemma 4.3. 𝕋 is γ1 -contractive if and only if 𝔾 = 1+γ (𝕀 −
1
!2 𝕋) is 1+γ -averaged.
˜ N ∥2 ≤ 1
∥𝔸x PN −1 ∥y0 − x⋆ ∥2 .
k=0 γk By Lemma 4.3, finding the worst-case γ1 -contractive opera-
1
tor 𝕋 is equivalent to finding the worst-case 1+γ -averaged
4. Complexity lower bound operator 𝔾, which we define in the following lemma.
We now establish exact optimality of (OC-Halpern) and Lemma 4.4. Let R > 0. Define ℕ, 𝔾 : RN +1 → RN +1 as
(OS-PPM) through matching complexity lower bound. By
ℕ(x1 , x2 , . . . , xN , xN +1 ) = (xN +1 , −x1 , −x2 , . . . , −xN )
exact, we mean that the lower bound is exactly equal to
upper bounds of Theorem 3.2 and Corollary 3.3. 1 + γ N +1
−p Re1
Theorem 4.1. For n ≥ N +1 and any initial point y0 ∈ Rn , 1 + γ 2 + · · · + γ 2N
there exists an 1/γ-Lipschitz operator 𝕋 : Rn → Rn with a
and
fixed point y⋆ ∈ Fix 𝕋 such that 1 γ
𝔾= ℕ+ 𝕀.
2 !2 1+γ 1+γ
2 1 1
∥yN − 𝕋yN ∥ ≥ 1 + PN ∥y0 − y⋆ ∥2 That is,
γ k=0 γ
k
γ 0 ··· 0 1
for any iterates {yk }N
k=0 satisfying −1 γ ··· 0 0
1
.. .. .. ..
.. x
yk ∈ y0 + span{y0 − 𝕋y0 , y1 − 𝕋y1 , . . . , yk−1 − 𝕋yk−1 } 𝔾x = .
1+γ . . . .
0 0 ··· γ 0
for k = 1, . . . , N . 0 0 ··· −1 γ
| {z }
The following corollary translates Theorem 4.1 to an equiv- =:H
alent complexity lower bound for proximal point methods 1 1 + γ N +1
− Re1 .
in monotone inclusions.
p
1 + γ 1 + γ 2 + · · · + γ 2N
Corollary 4.2. For n ≥ N and any initial point x0 = y0 ∈
| {z }
=:b
Rn , there exists a maximal µ-strongly monotone operator
1
𝔸 : Rn ⇒ Rn with a zero x⋆ ∈ Zer 𝔸 such that Then ℕ is nonexpansive, and 𝔾 is 1+γ -averaged.
!2
˜ 2 1 Following lemma states the property of iterations {yk }N
∥𝔸xN ∥ ≥ PN −1 ∥y0 − x⋆ ∥2 k=0
(1 + 2µ) k with respect to 𝔾, that proper span condition results in grad-
k=0
ually expanding support of yk .
for any iterates {xk }k=0,1,... and {yk }k=0,1,... satisfying Lemma 4.5. Let 𝔾 : RN +1 → RN +1 be defined as in
xk = 𝕁𝔸 yk−1 Lemma 4.4. For any {yk }N
k=0 with y0 = 0 satisfying
˜ 1 , 𝔸x
yk ∈ y0 + span{𝔸x ˜ 2 , . . . , 𝔸x
˜ k} yk ∈ y0 + span{𝔾y0 , 𝔾y1 , . . . , 𝔾yk−1 }, k = 1, . . . , N,
˜ k = yk−1 − xk .
for k = 1, . . . , N , where 𝔸x we have
we can use Gaussian elimination on H to obtain an upper 5. Acceleration under Hölder-type growth
triangular matrix with nonzero diagonals. This makes 𝔾 an condition
invertible affine operator with the unique zero
While (OS-PPM) provides an accelerated rate when the un-
R ⊺
γ N −1 · · · γ 1 .
N
y⋆ = p γ derlying operator is monotone or strongly monotone, many
1 + γ 2 + · · · + γ 2N operators encountered in practice have a structure lying
So Fix 𝕋 = Zer 𝔾 = {y⋆ } and ∥y0 − y⋆ ∥ = ∥y⋆ ∥ = R. between these two assumptions. For (OC-Halpern), this
corresponds to a fixed-point operator that is not strictly con-
Let the iterates {yk }N
k=0 satisfy the span condition of Theo- tractive but has structure stronger than nonexpansiveness.
rem 4.1, which is equivalent to In this section, we accelerate the proximal point method
yk ∈ y0 + span{𝔾y0 , 𝔾y1 , . . . , 𝔾yk−1 }, k = 1, . . . , N. when the underlying operator is uniformly monotone, an
assumption weaker than strong monotonicity but stronger
By Lemma 4.5, yN ∈ span{e1 , . . . , eN }. Therefoere
than monotonicity.
𝔾yN = HyN − b ∈ span{He1 , . . . , HeN } − b.
We say an operator 𝔸 : Rn ⇒ Rn is uniformly monotone
and
2 with parameters µ > 0 and α > 1 if it is monotone and
2
∥𝔾yN ∥ ≥
Pspan{He1 ,...,HeN }⊥ (b)
,
⟨𝔸x, x − x⋆ ⟩ ≥ µ∥x − x⋆ ∥α+1
where PV is the orthogonal projection onto the subspace V .
⊥
As span{He1 , . . . , HeN } = span{v} with for any x ∈ Rn and x⋆ ∈ Zer 𝔸. This is a special case
v = 1 γ · · · γ N −1 γ N ,
⊺ of uniform monotonicity in Bauschke & Combettes (2017,
Definition 22.1). We also refer to this as a Hölder-type
we get growth condition, as it resembles the Hölderian error bound
!2 condition with function-value suboptimality replaced by
2
⟨b, v⟩
2(∗)
2 1 ⟨𝔸x, x − x⋆ ⟩ (Lojasiewicz, 1963; Bolte et al., 2017).
R2 ,
∥𝔾yN ∥ ≥ Pspan{v} (b) =
v
= PN
⟨v, v⟩
k=0 γ
k
The following theorem establishes a convergence rate of the
where (∗) is established in the Section D of the appendix. (unaccelerated) proximal point method. This rate serves as
Finally, a baseline to improve upon with acceleration.
2
Theorem 5.1. Let 𝔸 : Rn ⇒ Rn be uniformly monotone
2
1
∥yN − 𝕋yN ∥ =
1 +
𝔾yN
γ
with parameters µ > 0 and α > 1. Let x⋆ ∈ Zer 𝔸.
2 !2 Then the iterates {xk }N
k=0 generated by the proximal point
1 1 method xk+1 = 𝕁𝔸 xk starting from x0 ∈ Rn satisfy
≥ 1+ PN R2
γ k=0 γ k
α 2
α−1
!2 α+3 α−1 −2
2 2
1
2
1 2 α−1 max µ , ∥x0 − x⋆ ∥
= 1+ PN ∥y0 − y⋆ ∥2 . ˜ N ∥2 ≤
∥𝔸x
γ γ k α+1
k=0 N α−1
4.3. Generalized complexity lower bound result ˜ N = xN −1 − xN .
for N ∈ N where 𝔸x
In order to extend the lower bound results of Theorem 4.1 We now present an accelerated method based on (OS-PPM)
and Corollary 4.2 to general deterministic fixed-point itera- and restarting (Nesterov, 2013; Roulet & d’Aspremont,
tions and proximal point methods (which do not necessarily 2020). Given a uniformly monotone operator 𝔸 : Rn ⇒ Rn
satisfy the span condition), we use the resisting oracle tech- with µ > 0 and α > 1, x⋆ ∈ Zer 𝔸, and an initial point
nique of Nemirovski & Yudin (1983). Here, we quickly x0 ∈ Rn , Restarted OS-PPM is:
state the result while deferring the proofs to the Section D
of the appendix. x̃0 = 𝕁𝔸 x0 (OS-PPMres
0 )
Theorem 4.6. Let n ≥ 2N for N ∈ N. For any determin- x̃k ← OS-PPM0 (x̃k−1 , tk ), k = 1, . . . , R,
istic fixed-point iteration A and any initial point y0 ∈ Rn ,
there exists a γ1 -Lipschitz operator 𝕋 : Rn → Rn with a where OS-PPM0 (x̃k−1 , tk ) is the execution of tk iterations
fixed point y⋆ ∈ Fix 𝕋 such that of (OS-PPM) with µ = 0 starting from x̃k−1 . The following
2 !2 theorem provides a restarting schedule, i.e., specified values
1 1 of t1 , . . . , tR , and an accelerated rate.
∥yN − 𝕋yN ∥2 ≥ 1 + PN ∥y0 − y⋆ ∥2
γ k=0 γ k
10-1
10-2 10-3
10-3
kxk − Txk k 2
kM̃xk k 2
10-5
10-4
10-5 Picard PPM
OHM 10-7 APPM
10-6 OC-Halpern OS-PPM
100 101 102 100 101 102
Iteration count Iteration count
Figure 1. Fixed-point and resolvent residuals versus iteration count for the 2D toy example of Section 6.1. Here, γ = 1/0.95 = 1.0526,
µ = 0.035, θ = 15◦ and N = 101. Indeed, (OC-Halpern) and (OS-PPM) exhibit the fastest rates.
0.8 0.6
0.6
0.4
0.4
0.2 0.2
0.0 0.0
0.2
0.2
0.4 Picard PPM
OHM APPM
0.6 OC-Halpern 0.4 OS-PPM
1.00 0.75 0.50 0.25 0.00 0.25 0.50 0.75 1.00 0.4 0.2 0.0 0.2 0.4 0.6 0.8 1.0
Figure 2. Trajectories of iterates for the 2D toy example of Section 6.1. Here, γ = 1/0.95 = 1.0526, µ = 0.035, θ = 15◦ and N = 101.
A marker is placed at every iterate. Picard and PPM are slowed down by the cyclic behavior. Halpern and APPM dampens the cycling
behavior, but does so too aggressively. The fastest rate is achieved by (OC-Halpern) and (OS-PPM), which appears to be due to the
cycling behavior being optimally dampened.
R R+1 The proofs of Theorems 5.1 and 5.2 are presented in Sec-
tion E of the appendix. When the values of α, µ, and
X X
βk βk
⌈λe ⌉ ≤ N − 1 < ⌈λe ⌉,
k=1 k=1
∥x0 − x⋆ ∥2 are unknown, as in the case in most practical se-
tups, one can use a grid search
as2αin Roulet & d’Aspremont
and let tk be defined as (2020) and retain the O N − α−1 (log N )2 -rate. Using
Lemma 4.3, (OS-PPMres 0 ) can be translated into a restarted
( OC-Halpern method. The experiments of Section 6 indi-
λeβk for k = 1, . . . , R − 1 cate that (OS-PPMres
0 ) does provide an acceleration in cases
tk = PR−1
N − 1 − k=1 tk . for k = R. where (OS-PPM) by itself does not.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
kxk − Txk k 2F
kxk − Txk k 2M
103 10-9 10-1
102 10-11 PG-EXTRA
PDHG OHM
101 OHM 10-2 Restarted OC-Halpern
Restarted OC-Halpern 10-13 OC-Halpern
100 101 102 103 0 20000 40000 60000 80000 100000 100 101 102
Iteration count Iteration count Iteration count
(a) CT imaging (b) Earth mover’s distance (c) Decentralized compressed sensing
Figure 3. Reduction of fixed-point residuals. The norms are the norms under which the fixed-point operator 𝕋 is nonexpansive.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower Dong, Q., Yuan, H., Cho, Y., and Rassias, T. M. Modified
bounds for finding stationary points I. Mathematical inertial Mann algorithm and inertial CQ-algorithm for
Programming, 184:71–120, 2020. nonexpansive mappings. Optimization Letters, 12(1):
87–102, 2018.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. Lower
bounds for finding stationary points II: first-order meth- Douglas, J. and Rachford, H. H. On the numerical solu-
ods. Mathematical Programming, 185:315–355, 2021. tion of heat conduction problems in two and three space
variables. Transactions of the American Mathematical
Cauchy, A.-L. Méthode générale pour la résolution des Society, 82(2):421–439, 1956.
systemes d’équations simultanées. Comptes rendus de
l’Académie des Sciences, 25:536–538, 1847. Drori, Y. The exact information-based complexity of smooth
convex minimization. Journal of Complexity, 39:1–16,
Chambolle, A. and Pock, T. A first-order primal-dual algo- 2017.
rithm for convex problems with applications to imaging.
Journal of Mathematical Imaging and Vision, 40(1):120– Drori, Y. and Shamir, O. The complexity of finding sta-
145, 2011. tionary points with stochastic gradient descent. ICML,
2020.
Colao, V. and Marino, G. On the rate of convergence of
Halpern iterations. Journal of Nonlinear and Convex Drori, Y. and Taylor, A. On the oracle complexity of smooth
Analysis, 22(12):2639–2646, 2021. strongly convex minimization. Journal of Complexity, 68,
2022.
Combettes, P. L. Monotone operator theory in convex opti-
mization. Mathematical Programming, 170(1):177–206, Drori, Y. and Taylor, A. B. Efficient first-order methods for
2018. convex minimization: a constructive approach. Mathe-
matical Programming, 184(1–2):183–220, 2020.
Cominetti, R., Soto, J. A., and Vaisman, J. On the rate of
convergence of Krasnosel’skiı̆-Mann iterations and their Drori, Y. and Teboulle, M. Performance of first-order meth-
connection with sums of Bernoullis. Israel Journal of ods for smooth convex minimization: a novel approach.
Mathematics, 199(2):757–772, 2014. Mathematical Programming, 145(1–2):451–482, 2014.
Condat, L. A primal–dual splitting method for convex opti- Drori, Y. and Teboulle, M. An optimal variant of Kelley’s
mization involving Lipschitzian, proximable and linear cutting-plane method. Mathematical Programming, 160
composite terms. Journal of Optimization Theory and (1–2):321–351, 2016.
Applications, 158(2):460–479, 2013.
Esser, E., Zhang, X., and Chan, T. F. A general framework
Contreras, J. P. and Cominetti, R. Optimal error bounds for a class of first order primal-dual algorithms for con-
for nonexpansive fixed-point iterations in normed spaces. vex optimization in imaging science. SIAM Journal on
arXiv preprint arXiv:2108.10969, 2021. Imaging Sciences, 3(4):1015–1046, 2010.
Davis, D. and Yin, W. Convergence rate analysis of several Fercoq, O. and Qu, Z. Adaptive restart of accelerated gradi-
splitting schemes. In Glowinski, R., Osher, S. J., and Yin, ent methods under local quadratic growth condition. IMA
W. (eds.), Splitting Methods in Communication, Imaging, Journal of Numerical Analysis, 39(4):2069–2095, 2019.
Science, and Engineering, pp. 115–163. Springer, 2016.
Gabay, D. and Mercier, B. A dual algorithm for the solu-
Davis, D. and Yin, W. A three-operator splitting scheme and tion of nonlinear variational problems via finite element
its optimization applications. Set-Valued and Variational approximation. Computers & Mathematics with Applica-
Analysis, 25(4):829–858, 2017. tions, 2(1):17–40, 1976.
Diakonikolas, J. Halpern iteration for near-optimal and Gu, G. and Yang, J. Tight sublinear convergence rate of
parameter-free monotone inclusion and strong solutions the proximal point algorithm for maximal monotone in-
to variational inequalities. COLT, 2020. clusion problems. SIAM Journal on Optimization, 30(3):
1905–1921, 2020.
Diakonikolas, J. and Wang, P. Potential function-based
framework for making the gradients small in convex and Güler, O. New proximal point algorithms for convex mini-
min-max optimization. arXiv preprint arXiv:2101.12101, mization. SIAM Journal on Optimization, 2(4):649–664,
2021. 1992.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
Halpern, B. Fixed points of nonexpanding maps. Bulletin Li, W., Ryu, E. K., Osher, S., Yin, W., and Gangbo, W. A
of the American Mathematical Society, 73(6):957–961, parallel method for earth mover’s distance. Journal of
1967. Scientific Computing, 75(1):182–197, 2018.
He, B. and Yuan, X. Convergence analysis of primal-dual Liang, J., Fadili, J., and Peyré, G. Convergence rates with
algorithms for a saddle-point problem: from contraction inexact non-expansive operators. Mathematical Program-
perspective. SIAM Journal on Imaging Sciences, 5(1): ming, 159(1–2):403–434, 2016.
119–149, 2012.
Lieder, F. On the convergence rate of the Halpern-iteration.
Hestenes, M. R. Multiplier and gradient methods. Journal Optimization Letters, 15(2):405–418, 2021.
of Optimization Theory and Applications, 4(5):303–320,
Lin, Q. and Xiao, L. An adaptive accelerated proximal
1969.
gradient method and its homotopy continuation for sparse
Ishikawa, S. Fixed points and iteration of a nonexpansive optimization. ICML, 2014.
mapping in a Banach space. Proceedings of the American
Lin, Y. and Xu, Y. Convergence rate analysis for fixed-
Mathematical Society, 59(1):65–71, 1976.
point iterations of generalized averaged nonexpansive
Ito, M. and Fukuda, M. Nearly optimal first-order methods operators. arXiv preprint arXiv:2108.06714, 2021.
for convex optimization under gradient norm measure: Lions, P.-L. and Mercier, B. Splitting algorithms for the sum
An adaptive regularization approach. Journal of Optimiza- of two nonlinear operators. SIAM Journal on Numerical
tion Theory and Applications, 188(3):770–804, 2021. Analysis, 16(6):964–979, 1979.
Kim, D. Accelerated proximal point method for maximally Lojasiewicz, S. Une propriété topologique des sous-
monotone operators. Mathematical Programming, 190 ensembles analytiques réels. Les Équations aux Dérivées
(1–2):57–87, 2021. Partielles, 117:87–89, 1963.
Kim, D. and Fessler, J. A. Optimized first-order methods for Maingé, P.-E. Convergence theorems for inertial KM-type
smooth convex minimization. Mathematical Program- algorithms. Journal of Computational and Applied Math-
ming, 159(1–2):81–107, 2016a. ematics, 219(1):223–236, 2008.
Kim, D. and Fessler, J. A. Optimized first-order methods Maingé, P.-E. Accelerated proximal algorithms with a cor-
for smooth convex minimization. Mathematical program- rection term for monotone inclusions. Applied Mathemat-
ming, 159(1):81–107, 2016b. ics & Optimization, 84(2):2027–2061, 2021.
Kim, D. and Fessler, J. A. Adaptive restart of the optimized Mann, W. R. Mean value methods in iteration. Proceedings
gradient method for convex optimization. Journal of of the American Mathematical Society, 4(3):506–510,
Optimization Theory and Applications, 178(1):240–263, 1953.
2018.
Martinet, B. Régularisation d’inéquations variationnelles
Kim, D. and Fessler, J. A. Optimizing the efficiency of par approximations successives. Revue Française de
first-order methods for decreasing the gradient of smooth Informatique et Recherche Opérationnelle, 4(R3):154–
convex functions. Journal of Optimization Theory and 158, 1970.
Applications, 188(1):192–219, 2021.
Martinet, B. Algorithmes pour la résolution de problèmes
Kohlenbach, U. On quantitative versions of theorems due d’optimisation et de minimax. PhD thesis, Université
to F. E. Browder and R. Wittmann. Advances in Mathe- Joseph-Fourier-Grenoble I, 1972.
matics, 226(3):2764–2795, 2011.
Matsushita, S.-Y. On the convergence rate of the Kras-
Krasnosel’skiı̆, M. A. Two remarks on the method of succes- nosel’skiı̆–Mann iteration. Bulletin of the Australian
sive approximations. Uspekhi Matematicheskikh Nauk, Mathematical Society, 96(1):162–170, 2017.
10:123–127, 1955.
Nemirovski, A. S. Information-based complexity of linear
Lee, J., Park, C., and Ryu, E. K. A geometric structure of operator equations. Journal of Complexity, 8(2):153–175,
acceleration and its role in making gradients small fast. 1992.
NeurIPS, 2021.
Nemirovski, A. S. and Nesterov, Y. E. Optimal methods
Leustean, L. Rates of asymptotic regularity for Halpern iter- of smooth convex minimization. USSR Computational
ations of nonexpansive mappings. Journal of Universal Mathematics and Mathematical Physics, 25(3–4):21–30,
Computer Science, 13(11):1680–1691, 2007. 1985.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
Nemirovski, A. S. and Yudin, D. B. Problem Complexity and Roulet, V. and d’Aspremont, A. Sharpness, restart, and
Method Efficiency in Optimization. Wiley-Interscience, acceleration. SIAM Journal on Optimization, 30(1):262–
1983. 289, 2020.
Nesterov, Y. Introductory Lectures on Convex Optimization: Ryu, E. and Yin, W. Large-scale Convex Optimization via
A Basic Course. Springer, 2004. Monotone Operators. Draft, 2020.
Nesterov, Y. Gradient methods for minimizing composite Ryu, E. K., Taylor, A. B., Bergeling, C., and Giselsson,
functions. Mathematical Programming, 140(1):125–161, P. Operator splitting performance estimation: Tight con-
2013. traction factors and optimal parameter selection. SIAM
Journal on Optimization, 30(3):2251–2271, 2020.
Nesterov, Y. E. A method for solving the convex program-
ming problem with convergence rate O(1/k 2 ). Doklady Ryu, E. K., Hannah, R., and Yin, W. Scaled relative
Akademii Nauk SSSR, 269:543–547, 1983. graphs: Nonexpansive operators via 2d Euclidean ge-
O’Donoghue, B. and Candes, E. Adaptive restart for accel- ometry. Mathematical Programming, 2021.
erated gradient schemes. Foundations of Computational
Sabach, S. and Shtern, S. A first order method for solving
Mathematics, 15(3):715–732, 2015.
convex bilevel optimization problems. SIAM Journal on
Park, C. and Ryu, E. K. Optimal first-order algorithms as a Optimization, 27(2):640–660, 2017.
function of inequalities. arXiv preprint arXiv:2110.11035,
Salim, A., Condat, L., Kovalev, D., and Richtárik, P. An op-
2021.
timal algorithm for strongly convex minimization under
√
Park, C., Park, J., and Ryu, E. K. Factor- 2 accelera- affine constraints. AISTATS, 2022.
tion of accelerated gradient methods. arXiv preprint
arXiv:2102.07366, 2021. Scieur, D., d’Aspremont, A., and Bach, F. Regularized
nonlinear acceleration. Mathematical Programming, 179
Passty, G. B. Ergodic convergence to a zero of the sum of (1–2):47–83, 2020.
monotone operators in Hilbert space. Journal of Mathe-
matical Analysis and Applications, 72(2):383–390, 1979. Shehu, Y. Convergence rate analysis of inertial
Krasnoselskii–Mann type iteration with applications. Nu-
Peaceman, D. W. and Rachford, Jr, H. H. The numerical merical Functional Analysis and Optimization, 39(10):
solution of parabolic and elliptic differential equations. 1077–1091, 2018.
Journal of the Society for Industrial and Applied Mathe-
matics, 3(1):28–41, 1955. Shi, W., Ling, Q., Wu, G., and Yin, W. A proximal gradi-
ent algorithm for decentralized composite optimization.
Pock, T., Cremers, D., Bischof, H., and Chambolle, A. An IEEE Transactions on Signal Processing, 63(22):6013–
algorithm for minimizing the Mumford-Shah functional. 6023, 2015.
ICCV, 2009.
Taylor, A. and Drori, Y. An optimal gradient method for
Powell, M. J. A method for nonlinear constraints in mini- smooth (possibly strongly) convex minimization. arXiv
mization problems. Optimization, pp. 283–298, 1969. preprint arXiv:2101.09741, 2021.
Reich, S., Thong, D. V., Cholamjiak, P., and Van Long,
Taylor, A. B., Hendrickx, J. M., and Glineur, F. Smooth
L. Inertial projection-type methods for solving pseu-
strongly convex interpolation and exact worst-case per-
domonotone variational inequality problems in Hilbert
formance of first-order methods. Mathematical Program-
space. Numerical Algorithms, 88(2):813–835, 2021.
ming, 161(1–2):307–345, 2017.
Rhoades, B. Some fixed point iteration procedures. In-
Taylor, A. B., Hendrickx, J. M., and Glineur, F. Exact
ternational Journal of Mathematics and Mathematical
worst-case convergence rates of the proximal gradient
Sciences, 14(1):1–16, 1991.
method for composite convex minimization. Journal of
Rhoades, B. and Saliga, L. Some fixed point iteration pro- Optimization Theory and Applications, 178(2):455–476,
cedures. II. Nonlinear Analysis Forum, 6(1):193–217, 2018.
2001.
Van Scoy, B., Freeman, R. A., and Lynch, K. M. The
Rockafellar, R. T. Monotone operators and the proximal fastest known globally convergent first-order method for
point algorithm. SIAM Journal on Control and Optimiza- minimizing strongly convex functions. IEEE Control
tion, 14(5):877–898, 1976. Systems Letters, 2(1):49–54, 2018.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
−1
1 1
𝔸= 𝕋+ 𝕀 1+ − 𝕀.
γ γ
Likewise,
1
𝕋(y + v) = y − v.
γ
u = 𝕋x
1 1
⇐⇒ u = 1 + 𝕁𝔸 x − x
1 + 2µ 1 + 2µ
1 1
⇐⇒ x+u= 1+ 𝕁𝔸 x
1 + 2µ 1 + 2µ
1 + 2µ 1 1 1 + 2µ
⇐⇒ x+u = x+ u = 𝕁𝔸 x
2 + 2µ 1 + 2µ 2 + 2µ 2 + 2µ
1 1 + 2µ
⇐⇒ x ∈ (𝕀 + 𝔸) x+ u
2 + 2µ 2 + 2µ
1 + 2µ 1 1 + 2µ
⇐⇒ (x − u) ∈ 𝔸 x+ u .
2 + 2µ 2 + 2µ 2 + 2µ
Likewise,
1 + 2µ 1 1 + 2µ
(y − v) ∈ 𝔸 y+ v .
2 + 2µ 2 + 2µ 2 + 2µ
From the µ-strong monotonicity of 𝔸,
1 1 + 2µ 1 1 + 2µ 1 1 + 2µ 1 1 + 2µ
𝔸 x+ u −𝔸 y+ v , x+ u − y+ v
2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ
2
1 1 + 2µ 1 1 + 2µ
≥ µ
2 + 2µ x + u − y + v
2 + 2µ 2 + 2µ 2 + 2µ
1 + 2µ 1 + 2µ 1 1 + 2µ 1 1 + 2µ
=⇒ (x − u) − (y − v), x+ u − y+ v
2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ
2
1 1 + 2µ 1 1 + 2µ
≥ µ
2 + 2µ x + 2 + 2µ u − 2 + 2µ y + 2 + 2µ v
1 + 2µ 1 + 2µ 1 1 + 2µ
⇐⇒ (x − y) − (u − v), (x − y) + (u − v)
2 + 2µ 2 + 2µ 2 + 2µ 2 + 2µ
2
1 1 + 2µ
≥ µ
2 + 2µ (x − y) + 2 + 2µ (u − v)
Proof of Lemma 2.1 with scaled relative graph. In this proof, we use the notations of Ryu et al. (2021) for the operator
classes, which we list below. Consider a class of operators Mµ of µ-strongly monotone operators and L1/γ of γ1 -contractions.
As Mµ , L1/γ are SRG-full classes, which means that the inclusion of the SRG of some operator to the SRG of an operator
class is equivalent to membership of that operator to the given operator class (Ryu et al., 2021, Section 3.3). Instead of
showing that the operators satisfy the equivalent inequality condition to the membership, we show the membership in terms
of the SRGs.
µ 1+µ 1 1
− 1+2µ 1
1+µ 1+2µ
F is a composition of only scalar addition/subtraction/multiplication and inversion, therefore preserves the SRG of Mµ and
L1/γ . SRG of F (Mµ ) and L1/γ match, and the SRG of F −1 (L1/γ ) and Mµ match.
xk = 𝕁𝔸 yk−1
1 1 1 1
yk = 1 − 1+ xk − yk−1 + y0
φk γ γ φk
where γ = 1 + 2µ.
Proof. It suffices to show the equivalence of yk -iterates. For k = 1, from (OS-PPM) update,
φ0 − 1 (γ − 1)φ0
y1 = x1 + (x1 − y0 ) − (y0 − x1 )
φ1 φ1
γ−1
= x1 − 2 (y0 − x1 ) (φ1 = 1 + γ 2 )
γ +1
1 1 1 1
= 1− 1+ x1 − y0 + y0 .
φ1 γ γ φ1
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
Assume that the equivalence of the iterates holds for k = 1, 2, . . . , l. From the (OS-PPM) update,
φl − 1 (γ − 1)φl γφl−1
yl+1 = xl+1 + (xl+1 − xl ) − (yl − xl+1 ) + (yl−1 − xl )
φl+1 φl+1 φl+1
φl − 1 (γ − 1)φl φl − 1 γφl−1 (γ − 1)φl γφl−1
= 1+ + xl+1 − + xl − yl + yl−1
φl+1 φl+1 φl+1 φl+1 φl+1 φl+1
φl φl−1 (γ − 1)φl γφl−1
= γ(γ + 1) xl+1 − γ(γ + 1) xl − yl + yl−1 .
φl+1 φl+1 φl+1 φl+1
φl φl−1 (γ − 1)φl 1
yl+1 = γ(γ + 1) xl+1 − γ(γ + 1) xl − yl + {γ(γ + 1)φl−1 xl − φl yl + y0 }
φl+1 φl+1 φl+1 φl+1
φl φl 1
= γ(γ + 1) xl+1 − γ yl + y0
φl+1 φl+1 φl+1
γ 2 φl
1 1 1
= 1+ xl+1 − yl + y0
φl+1 γ γ φl+1
1 1 1 1
= 1− 1+ xl+1 − yl + y0 .
φl+1 γ γ φl+1
Proof of Lemma 3.1. Start from the same initial iterate y0 = ỹ0 . Suppose yk = ỹk for some k ≥ 0. Then,
1 1 1 1
yk+1 = 1 − 1+ xk+1 − yk + y0 (Lemma B.1)
φk+1 1 + 2µ 1 + 2µ φk+1
1 1 1 1
= 1− 1+ 𝕁𝔸 − 𝕀 yk + y0
φk+1 1 + 2µ 1 + 2µ φk+1
1 1
= 1− 𝕋yk + y0
φk+1 φk+1
1 1
= 1− 𝕋ỹk + y0 = ỹk+1 . (yk = ỹk by induction hypothesis)
φk+1 φk+1
First, we show that V k has an alternate form as below. This form is useful in proving the monotone decreasing property of
V k in k.
Lemma B.2. V k defined in (OS-PPM-Lyapunov) can be equivalently written as
Also, we have
˜ k − µ(xk − x⋆ ), xk − x⋆ ⟩
⟨𝔸x
˜ k − µ(xk − y0 ) − µ(y0 − x⋆ ), (xk − y0 ) + (y0 − x⋆ )⟩
= ⟨𝔸x
˜ k − µ(xk − y0 ), xk − y0 ⟩ + ⟨𝔸x
= ⟨𝔸x ˜ k , y0 − x⋆ ⟩ − 2µ⟨xk − y0 , y0 − x⋆ ⟩.
Then V k is expressed as
k−1
!
X
V k = 2(1 + γ −k ) γn ˜ k − µ(xk − y0 ), xk − y0 ⟩ + ⟨𝔸x
˜ k , y0 − x⋆ ⟩ − 2µ⟨xk − y0 , y0 − x⋆ ⟩
⟨ ⟨𝔸x
n=0
( k−1
!2 k−1
!
X X
+ (1 + γ −k
)γ −k
γ n ˜ k ∥2 − 2
∥𝔸x γ n ˜ k , xk − y0 ⟩ + (γ k − 1)2 ∥xk − y0 ∥2
(γ k − 1)⟨𝔸x
n=0 n=0
k−1
! )
X
−2 γn ˜ k , y0 − x⋆ ⟩ − 2γ k (γ k − 1)⟨xk − y0 , y0 − x⋆ ⟩ + γ 2k ∥y0 − x⋆ ∥2
γ k ⟨𝔸x
n=0
k−1
!2
X
+ (1 + γ −k ) γn ˜ k ∥2 + (1 − γ −k )∥y0 − x⋆ ∥2
∥𝔸x
n=0
k−1
!2 k−1
!
X X
= (1 + γ −k 2
) γ n ˜ k ∥ + 2(1 + γ
∥𝔸x 2 −k
) γ n ˜ k − µ(xk − y0 ), xk − y0 ⟩
⟨𝔸x
n=0 n=0
k−1
!
X
− 2γ −k
(1 + γ −k
)(γ − 1)k
γ n ˜ k , xk − y0 ⟩ + γ −k (1 + γ −k )(γ k − 1)2 ∥xk − y0 ∥2 + 2∥y0 − x⋆ ∥2
⟨𝔸x
n=0
k−1
!2 k−1
!
X X
= (1 + γ −k 2
) γ n ˜ k ∥2 + 2γ −k (1 + γ −k )
∥𝔸x γ n ˜ k − µ(xk − y0 ), xk − y0 ⟩ + 2∥y0 − x⋆ ∥2 .
⟨𝔸x
n=0 n=0
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
As
k−1
! k−1
!
−k
X
n 1 + γk X
n 1+γ
(1 + γ ) γ = γ = φk−1 ,
n=0
γk n=0
γk
we have
V N ≤ V N −1 ≤ · · · ≤ V 1 ≤ V 0 .
˜ 1 ∥2 + 2γ −2 (1 + γ)⟨𝔸x
V 1 − V 0 = γ −2 (1 + γ)2 ∥𝔸x ˜ 1 − µ(x1 − y0 ), x1 − y0 ⟩
−2 ˜ 1 ∥ + 2⟨𝔸x
2 ˜ 1 − µ(x1 − y0 ), x1 − y0 ⟩
= γ (1 + γ) (1 + γ)∥𝔸x
= γ −2 (1 + γ) (1 + γ)∥𝔸x˜ 1 ∥2 − 2(1 + µ)∥𝔸x
˜ 1 ∥2 ˜ 1)
(x1 − y0 = −𝔸x
= 0. (1 + γ = 2(1 + µ))
˜ k+1 − 𝔸x
V k+1 − V k + 2γ −2k (1 + γ)φk φk−1 ⟨𝔸x ˜ k − µ(xk+1 − xk ), xk+1 − xk ⟩ = 0.
First,
˜ k+1 − 𝔸x
V k+1 − V k + 2γ −2k (1 + γ)φk φk−1 ⟨𝔸x ˜ k − µ(xk+1 − xk ), xk+1 − xk ⟩
˜ k+1 − µ(xk+1 − y0 ), xk+1 − xk ⟩
= V k+1 − V k + 2γ −2k (1 + γ)φk φk−1 ⟨𝔸x
˜ k − µ(xk − y0 ), xk+1 − xk ⟩
− 2γ −2k (1 + γ)φk φk−1 ⟨𝔸x
˜ k+1 − γφk−1 𝔸x
= γ −2(k+1) (1 + γ)2 ⟨φk 𝔸x ˜ k , φk 𝔸x
˜ k+1 + γφk−1 𝔸x
˜ k⟩
˜ k+1 − µ(xk+1 − y0 ), γ 2 φk−1 (xk+1 − xk ) + (xk+1 − y0 )⟩
+ 2γ −2(k+1) (1 + γ)φk ⟨𝔸x
˜ k − µ(xk − y0 ), φk (xk+1 − xk ) + (xk − y0 )⟩.
− 2γ −2k (1 + γ)φk−1 ⟨𝔸x
˜ k+1 − γφk−1 𝔸x
Letting U k = φk (xk+1 − y0 ) − γ 2 φk−1 (xk − y0 ) = −φk 𝔸x ˜ k , above formula is simplified as
˜ k+1 − 𝔸x
V k+1 − V k + 2γ −2k (1 + γ)φk φk−1 ⟨𝔸x ˜ k − µ(xk+1 − xk ), xk+1 − xk ⟩
˜ k+1 − γφk−1 𝔸x
= −γ −2(k+1) (1 + γ)2 ⟨φk 𝔸x ˜ k , Uk ⟩
˜ k+1 − µ(xk+1 − y0 ), Uk ⟩
+ 2γ −2(k+1) (1 + γ)φk ⟨𝔸x
˜ k − µ(xk − y0 ), Uk ⟩
− 2γ −2k (1 + γ)φk−1 ⟨𝔸x
˜ k+1 − γφk−1 𝔸x
= γ −2(k+1) (1 + γ) − (1 + γ)(φk 𝔸x ˜ k ) + 2φk {𝔸x
˜ k+1 − µ(xk+1 − y0 )}
2 ˜ k − µ(xk − y0 )}, Uk
− 2γ φk−1 {𝔸x
˜ k+1 + γφk−1 𝔸x
= γ −2(k+1) (1 + γ) (1 − γ)(φk 𝔸x ˜ k ) − 2µ{φk (xk+1 − y0 ) − γ 2 φk−1 (xk − y0 )}, Uk
2∥y0 − x⋆ ∥2 ≥ V N
N −1
!2 N −1
!
X X
= (1 + γ −N
) γ n ˜ N ∥2 + 2(1 + γ −N )
∥𝔸x γ n ˜ N − µ(xN − x⋆ ), xN − x⋆ ⟩
⟨𝔸x
n=0 n=0
N −1 !
2
X
−N −N
+ γ (1 + γ )
γ n ˜
𝔸xN − γ (xN − x⋆ ) + (xN − y0 )
+ (1 − γ −N )∥y0 − x⋆ ∥2
N
n=0
N −1
! 2
X
≥ (1 + γ −N ) γ n ∥𝔸x ˜ N ∥2 + (1 − γ −N )∥y0 − x⋆ ∥2 ,
n=0
N −1
!2
X
(1 + γ −N 2
)∥y0 − x⋆ ∥ ≥ (1 + γ −N
) γ n ˜ N ∥2 ,
∥𝔸x
n=0
or equivalently,
!2
˜ N ∥2 ≤ 1
∥𝔸x PN −1 ∥y0 − x⋆ ∥2 .
n=0 γn
Proof of Corollary 3.3. This immediately follows from Theorem 3.2 and Lemma 3.1 by
−1
˜ N = yN −1 − xN = 1
𝔸x 1+ (yN −1 − 𝕋yN −1 ) ∈ 𝔸xN .
γ
When discovering (OS-PPM), we used maximal monotonicity as our interpolation condition, just as in Ryu et al. (2020);
Kim (2021). We further extended this to cover the case of maximal strongly-monotone operators, in a slightly different
way with Taylor & Drori (2021) who considered strongly convex interpolation. The optimization variable is a positive
semidefinite matrix, and this is of a Gram matrix form which stores information on the iterates of algorithms. Usual choice
of basis vectors for the gram matrix in PEP is usually ∇f (x) for convex minimization setup (Kim & Fessler, 2016b; Taylor
˜ for operator setup (Kim, 2021). Here, we used x-iterates to form the gram matrix
et al., 2018; Taylor & Drori, 2021), or 𝔸x
of SDP.
This basic SDP is a primal problem of the PEP (Primal-PEP), and solving this returns an estimate to the worst-case
complexity of given algorithm. If we form a dual problem (dual-PEP) and minimize the optimal value of dual-PEP over
possible choices of stepsizes as in Kim & Fessler (2016b); Taylor et al. (2018); Kim (2021); Taylor & Drori (2021),
this provides possibly the fastest rate, and solution to this minimization problem gives possibly optimal algorithms. We
considered a class of algorithms satisfying the span assumption in Corollary 4.2, and obtained (OS-PPM).
1
∥𝕋x − 𝕋y∥2 ≤ ∥x − y∥2 ⇐⇒ ∥γ𝕋x − γ𝕋y∥2 ≤ ∥x − y∥2
γ2
⇐⇒ ∥{(1 + γ)𝔾x − γx} − {(1 + γ)𝔾y − γy}∥2 ≤ ∥x − y∥2
⇐⇒ (1 + γ)2 ∥𝔾x − 𝔾y∥2 − 2γ(1 + γ)⟨𝔾x − 𝔾y, x − y⟩ + γ 2 ∥x − y∥2 ≤ ∥x − y∥2
⇐⇒ (1 + γ)2 ∥𝔾x − 𝔾y∥2 + (γ 2 − 1)∥x − y∥2 ≤ 2γ(1 + γ)⟨𝔾x − 𝔾y, x − y⟩
γ−1 2γ
⇐⇒ ∥𝔾x − 𝔾y∥2 + ∥x − y∥2 ≤ ⟨𝔾x − 𝔾y, x − y⟩. (∵ 1 + γ > 0)
γ+1 γ+1
Proof of Lemma 4.3 with scaled relative graph. Using the notion of SRG (Ryu et al., 2021), we get the following equiva-
1
lence of SRGs. Here, N 1+γ
1 is a class of 1+γ -averaged operators. Therefore, we get the chain of equivalences
γ
SRG of 𝕋 ∈ L γ1 SRG of 1+γ (𝕀 − 𝕋)
− γ1 1
γ 1− 2
1+γ
1
SRG of 𝕀 − 1 + γ1 𝔾 SRG of 𝔾 ∈ N 1+γ
1
𝕋 ∈ L1/γ ⇐⇒ γ𝕋 ∈ L1 ⇐⇒ −γ𝕋 ∈ L1
γ 1
⇐⇒ 𝔾 = 𝕀+ (−γ𝕋) ∈ N 1+γ
1 ,
1+γ 1+γ
and conclude that 𝕋 is γ1 -Lipschitz if and only if 𝔾 is 1
1+γ -averaged.
Proof of Theorem 4.1. The proof outline of Theorem 4.1 in Section 4.2 is complete except for the part that the identity (∗)
holds, and that Theorem 4.1 holds for any initial point y0 ∈ Rn which is not necessarily zero.
First, we show that for any initial point y0 ∈ Rn , there exists an worst-case operator 𝕋 : Rn → Rn which cannot exhibit
better than the desired rate. Denote by 𝕋0 : Rn → Rn the worst-case operator constructed in the proof of Theorem 4.1 for
y0 = 0. Define 𝕋 : Rn → Rn as
𝕋y = 𝕋0 (y − y0 ) + y0
given y0 ∈ Rn . Then, first of all, the fixed point of 𝕋 is y⋆ = ỹ⋆ + y0 where ỹ⋆ is the unique solution of 𝕋0 . Also, if
{yk }N
k=0 satisfies the span condition
which is the same span condition in Theorem 4.1 with respect to 𝕋0 . This is true from the fact that
yk − 𝕋yk = yk − y0 +𝕋0 (yk − y0 ) = ỹk − 𝕋0 ỹk
| {z } | {z }
=ỹk ỹk
for k = 1, . . . , N .
Now, {ỹk }N
k=0 is a sequence starting from ỹ0 = 0 satisfying the span condition for 𝕋0 . This implies that,
2 !2
1 1
= 1+ PN ∥y0 − y⋆ ∥2 .
γ k=0 γ k
where ⊺
γ2 γN
v= 1 γ ... ,
especially the identity (∗).
⟨b, v⟩
2 2
v
= |⟨b, v⟩|
⟨v, v⟩
∥v∥2
!2
R 1 + γ N +1 1
= ×p ×
1+γ 1 + γ 2 + γ 4 + · · · + γ 2N 1 + γ 2 + γ 4 + · · · + γ 2N
2
1 + γ N +1
R
= ×
1+γ 1 + γ 2 + γ 4 + · · · + γ 2N
2
R
=
1 + γ + γ2 + · · · + γN
!2
1
= PN R2 .
k
k=0 γ
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
γ−1
Proof of Corollary 4.2. According to Lemma 2.1, 𝕋 is 1/γ-contractive if and only if 𝔸 = (𝕋 + 1/γ𝕀)−1 − 𝕀 is 2 -strongly
monotone. For any y ∈ Rn , if x = 𝕁𝔸 y, then
1 1 1 1 ˜
y − 𝕋y = y − 1+ 𝕁𝔸 y − y = 1 + (y − x) = 1 + 𝔸x.
γ γ γ γ
if and only if
xk = 𝕁𝔸 yk−1
˜ 1 , . . . , 𝔸x
yk ∈ y0 + span{𝔸x ˜ k }, k = 1, . . . , N
where xk = 𝕁𝔸 yk−1 . Span conditions in the statements of Theorem 4.1 and Corollary 4.2 are equivalent under the
transformation 𝔸 = (𝕋 + 1/γ𝕀)−1 − 𝕀. Therefore, the lower bound result of this corollary can be derived from the lower
bound result of Theorem 4.1.
where yt is the t-th query point and ȳt is the t-th approximate solution produced by At . Here, we consider the algorithms
whose query points and approximate solutions are identical (yt = ȳt ).
Even though the A is defined to produce infinitely many yt - and ȳt -iterates, the definition includes the case where algorithm
terminates at a predetermined total iteration count N , i.e., the algorithm may have a predetermined iteration count N and
the behavior may depend on the specified value of N . In such cases, yN = ȳN = yN +1 = ȳN +1 = · · · .
Similarly, a deterministic proximal point method A is a mapping of an initial point y0 and a maximal monotone operator 𝔸
to a sequence of query points {yt }t∈N and approximate solutions {ȳt }t∈N , such that the output depends on 𝔸 only through
the resolvent residual oracle O𝔸 (y) = y − 𝕁𝔸 y = 𝔸x ˜ ∈ 𝔸x where x = 𝕁𝔸 y. Indeed, this method A yields the same
sequence of iterates given the same initial point y0 and oracle evaluations {O𝔸 (yt )}t∈N .
By the equivalency of the optimization problems and algorithms stated in Lemma 2.1 and Lemma 3.1, Theorem 4.6 also
generalizes Corollary 4.2 to general proximal point methods.
Corollary D.1 (Complexity lower bound of general proximal point methods). Let n ≥ 2N − 2 for N ∈ N. For any
deterministic proximal point method A and arbitrary initial point y0 ∈ Rn , there exists a µ-strongly monotone operator
𝔸 : Rn → Rn with a zero x⋆ ∈ Zer 𝔸 such that
2
˜ N ∥2 ≥ 1
∥𝔸x ∥y0 − x⋆ ∥2
1 + γ + · · · + γ N −1
for every t ∈ N ∪ {0}, where supp{z} := {i ∈ [d] | ⟨z, ei ⟩ = ̸ 0}. An deterministic fixed-point iteration A is called
zero-respecting if A generates a sequence {zt }t∈N∪{0} which is zero-respecting with respect to 𝕋 for any nonexpansive
𝕋 : Rd → Rd . Note that by definition, z0 = 0. And for notational simplicity, define suppV = z∈V supp{z}.
S
This property serves as an important intermediate step to the generalization of Theorem 4.1, where its similar form called
‘zero-chain’ has numerously appeared on the relevant references in convex optimization (Nesterov, 2004; Drori, 2017;
Carmon et al., 2020; Drori & Taylor, 2022). The worst-case operator found in the proof of Theorem 4.1 still performs the
best among all the zero-respecting query points with respect to 𝕋, according to the following lemma.
Lemma D.2. Let 𝕋 : RN +1 → RN +1 be the worst-case operator defined in the proof of Theorem 4.1. If the iterates {zt }N
t=0
are zero-respecting with respect to 𝕋,
2 2
2 1 1
∥zN − 𝕋zN ∥ ≥ 1+ ∥z0 − z⋆ ∥2
γ 1 + γ + · · · + γN
for z⋆ ∈ Fix 𝕋.
If k = 0, then y0 = 0 and from this, 𝔾0 ∈ span{e1 }. So the case of k = 0 holds. Now, suppose that 0 < k ≤ N
and the claim holds for all n < k. Then 𝔾zn ∈ span{e1 , . . . , en+1 } ⊆ span{e1 , . . . , ek } for 0 ≤ k < n. {zk }N
k=0 is
zero-respecting with respect to 𝕋, so
[
supp{zk } ⊆ supp{zn − 𝕋zn }
n<k
= supp 𝔾z0 , 𝔾z1 , . . . , 𝔾zk−1
⊆ supp{e1 , e2 , . . . , ek }.
Therefore, zk ∈ span{e1 , e2 , . . . , ek }, and 𝔾zk ∈ span{e1 , e2 , . . . , ek+1 }. The claim holds for k = 1, . . . , N .
According to the proof of Theorem 4.1,
2 2
1 1
∥zN − 𝕋zN ∥2 ≥ 1+ ∥z0 − z⋆ ∥2
γ 1 + γ + · · · + γN
for any zero-respecting iterates {zk }N
k=0 with respect to 𝕋.
We say that a matrix U ∈ Rm×n with m ≥ n is orthogonal, if each columns {ui }ni=1 ⊆ Rm of U as in
| ... |
U = u1 . . . un
| ... |
are orthonormal to each other, or in other words, U ⊺ U = In . It directly follows that U U ⊺ is an orthogonal projection from
Rm to the range R(U ) of U .
Lemma D.3. For any orthogonal matrix U ∈ Rm×n with m ≥ n and any arbitrary vector y0 ∈ Rm , if 𝕋 : Rn → Rn is a
1 m
γ -contractive operator with γ ≥ 1, then 𝕋U : R → Rm defined as
𝕋U (y) := U 𝕋U ⊺ (y − y0 ) + y0 , ∀y ∈ Rm
is also a γ1 -contractive operator. Furthermore, z⋆ ∈ Fix 𝕋 if and only if y⋆ = y0 + U z⋆ ∈ Fix 𝕋U .
Lemma D.4. Let A be a general deterministic fixed-point iteration, and 𝕋 : Rn → Rn be a 1/γ-contractive operator.
For m ≥ n + N − 1 and any arbitrary point y0 ∈ Rm , there exists an orthogonal matrix U ∈ Rm×n and the iterates
{yt }N
t=1 = A[y0 ; 𝕋U ] with the following properties.
(ii) {z (t) }N
t=0 satisfies
∥z (t) − 𝕋z (t) ∥ ≤ ∥yt − 𝕋U yt ∥, t = 0, . . . , N.
Proof. We first show that (i) implies (ii). From (i), we know that z (t) = U ⊺ (yt − y0 ) for t = 0, 1, . . . , N . Therefore,
Now we prove the existence of orthogonal U ∈ Rm×n with {yt }N t=0 = A[y0 ; 𝕋U ] and (i) holds. In order to show the
existence of such orthogonal matrix U as in (i), we provide the inductive scheme that finds the columns of U at each
iteration. Before describing the actual scheme, we first provide some observations useful to deriving the necessary conditions
for the columns {ui }ni=1 of U to satisfy.
Let t ∈ {1, . . . , N }, and define the set of indices St as
For {z (t) }N
t=0 to satisfy the zero-respecting property with respect to 𝕋, z
(t)
is required to satisfy
supp{z (t) } ⊆ St
yt − y0 ∈ span{ui }i∈St
or equivalently,
⟨ui , yt − y0 ⟩ = 0
(0) ⊺
for every i ∈
/ St . Note that z = U (y0 − y0 ) = 0 is trivial.
m×n
We now construct U ∈ R . Note that S0 = ∅ ⊆ S1 ⊆ · · · ⊆ St . {ui }i∈St \St−1 is chosen inductively starting from
t = 1. Suppose we have already chosen {ui }i∈St−1 . Choose {ui }i∈St \St−1 from the orthogonal complement of
Wt := span {y1 − y0 , · · · , yt−1 − y0 } ∪ {ui }i∈St−1
We now prove the complexity lower bound result for general fixed-point iterations.
Proof of Theorem 4.6. For any deterministic fixed-point iteration A and initial point y0 ∈ Rn , consider a worst-case
operator 𝕋 : RN +1 → RN +1 defined in the proof of Theorem 4.1. According to Lemma D.4, there exists an orthogonal
U ∈ Rn×(N +1) with n ≥ (N + 1) + (N − 1) = 2N such that z (k) = U ⊺ (yk − y0 ) for k = 0, . . . , N ,
where the query points {yk }Nk=0 are generated from applying A to 𝕋U given initial point y0 , and {z
(k) N
}k=0 is a zero-
respecting sequence with respect to 𝕋. According to Lemma D.2,
2 2
(N ) (N ) 2 1 1
∥z − 𝕋z ∥ ≥ 1+ ∥z (0) − z⋆ ∥2 .
γ 1 + γ + · · · + γN
where the second identity comes from orthogonality of U . We may conclude that
2 2
1 1
∥yN − 𝕋U yN ∥2 ≥ 1+ ∥y0 − y⋆ ∥2
γ 1 + γ + · · · + γN
Bk ≥ Bk+1 .
˜ k+1 where 𝔸x
Proof. Note that PPM update xk+1 = 𝕁𝔸 xk is equivalent to xk = xk+1 + 𝔸x ˜ k+1 ∈ 𝔸xk+1 .
˜ k+1 ) − x⋆ = (xk+1 − x⋆ ) + 𝔸x
xk − x⋆ = (xk+1 + 𝔸x ˜ k+1 .
Then
˜ k+1 , xk+1 − x⋆ ⟩
Ak = Ak+1 + Bk + 2⟨𝔸x
≥ Ak+1 + Bk + 2µ∥xk+1 − x⋆ ∥α+1
α+1
= Ak+1 + Bk + 2µAk+1
2
α+1
≥ Ak+1 + µ2 Aα
k+1 + 2µAk+1
2
α−1 2
≥ Ak+1 1 + µAk+12
Also, from
˜ k ∥2 − ∥𝔸x
Bk − Bk+1 = ∥𝔸x ˜ k+1 ∥2
˜ k ∥2 + ∥𝔸x
= (∥𝔸x ˜ k+1 ∥2 ) − 2∥𝔸x
˜ k+1 ∥2
˜ k , 𝔸x
≥ 2⟨𝔸x ˜ k+1 ⟩ − 2∥𝔸x
˜ k+1 ∥2 (Young’s inequality)
˜ k+1 − 𝔸x
= −2⟨𝔸x ˜ k , 𝔸x
˜ k+1 ⟩
˜ k+1 − 𝔸x
= 2⟨𝔸x ˜ k , xk+1 − xk ⟩ ≥ 0, (Monotonicity of 𝔸)
we get Bk ≥ Bk+1 .
Theorem E.2. If 𝔸 : Rn ⇒ Rn is a uniformly monotone operator with parameters µ > 0 and α > 1, there exists C > 0
such that the iterates {xk }k∈N generated by PPM exhibits the rate
C
∥xk − x⋆ ∥2 ≤ 2
k α−1
for any k ∈ N.
Proof. We use the induction on k to show the convergence rate, and find the necessary conditions for C > 0 to satisfy.
In case of k = 1, ∥x1 − x⋆ ∥2 ≤ C must be satisfied. Lemma E.1 implies the monotonicity of Ak , so C with C ≥ ∥x0 − x⋆ ∥2
is a suitable choice.
2 2
Now, suppose that Ak ≤ Ck − α−1 and k ≥ 1. We claim that Ak+1 ≤ C(k + 1)− α−1 for the same C > 0. Define
fµα : [0, ∞) → [0, ∞) as
α−1
2
fµα (t) := t 1 + µt 2 .
2
Then fµα (Ak+1 ) ≤ Ak from Lemma E.1. If fµα (Ak+1 ) ≤ fµα C(k + 1)− α−1 , since fµα is a monotonically increasing
function over [0, ∞), we are done. Define
( 1
α−1 )
1
an := (n + 1) 1+ −1 ,
n
2 α 2
for any choice of C > 0. Choosing C ≥ µ− α−1 (2 α−1 − 2) α−1 which is equivalent to
α
2 α−1 − 2
α−1 ≤ µ,
C 2
we get
! α−1 2 ! α−1 2
α 2 2
C 2 α−1 − 2 C C C
2 1+ α−1 2 ≤ 2 1+µ 2
(k + 1) α−1 C 2 (k + 1) α−1 (k + 1) α−1 (k + 1) α−1
!
C
= fµα 2 .
(k + 1) α−1
2 α 2
Gathering all the inequalities above, if C ≥ µ− α−1 (2 α−1 − 2) α−1 , then
!
C C
fµα (Ak+1 ) ≤ Ak ≤ 2 ≤ fµα 2
k α−1 (k + 1) α−1
so we get
C C
Ak ≤ 2 =⇒ Ak+1 ≤ 2
k α−1 (k + 1) α−1
for k = 1, 2, . . . .
Therefore,
α 2
α−1
2 α−1 −2
max µ , ∥x0 − x⋆ ∥2
C
∥xk − x⋆ ∥2 ≤ 2 = 2 .
k α−1 k α−1
˜ k+1 ∥2 .
We now prove the convergence rate of PPM in terms of Bk = ∥𝔸x
α 2
α−1
2 α−1 −2
where C = max µ , ∥x0 − x⋆ ∥2 . Therefore,
N C C
BN −1 ≤ 2 ≤ 2 ,
2 n α−1 N −1
α−1
2
N α−1
α+1
= O N − α−1
where the second inequality follows from 2(N − 1) ≥ N . Since this bound also holds for the case of N = 1 from
B0 ≤ ∥x0 − x⋆ ∥2 , we are done.
Roulet & d’Aspremont (2020) showed that if the objective function f of a smooth convex minimization problem satisfies a
Hölderian error bound condition
µ
∥x − x⋆ ∥r ≤ f (x) − f ⋆ , ∀ x ∈ K ⊂ Rn
r
where x⋆ ∈ K is a minimizer of f and K is a given set, then the unaccelerated base algorithm can be accelerated with a
restarting scheme. The restarting schedule uses tk iterations for each k-th outer loop recursively satisfying
f (xk ) − f ⋆ ≤ e−ηk (f (x0 ) − f ⋆ ), k = 1, 2, . . .
for some η > 0, where xk = A(xk−1 , tk ) is the output of k-th outer loop, which applies tk iterations of the base algorithm
A starting from xk−1 . If an objective function is strongly convex near the solution (r = 2), a constant restarting schedule
tk = λ provides a faster rate compared to an unaccelerated base algorithm (Nemirovski & Nesterov, 1985). If an objective
function satisfies a Hölderian error bound condition but it is not strongly convex (r > 2), then an exponentially-growing
schedule tk = λeβk for some λ > 0 and β > 0 results in a faster sublinear convergence rate.
As notable prior work, Kim (2021) studied APPM with a constant restarting schedule in the strongly monotone setup but
was not able to obtain a rate faster than plain PPM. We show that restarting with an exponentially increasing schedule
accelerates (OS-PPM) under uniform monotonicity, as for the case of r > 2 in Roulet & d’Aspremont (2020).
Proof of Theorem 5.2. Suppose that given an initial point x0 ∈ Rn , let x̃0 be an iterate generated by applying APPM on x0
only once. Then
1 1
x̃0 = (2𝕁𝔸 x0 − x0 ) + x0 = 𝕁𝔸 x0 ,
2 2
so we get
˜ 0 + (x̃0 − x⋆ )∥2
∥x0 − x⋆ ∥2 = ∥𝔸x̃
˜ 0 ∥2 + 2⟨𝔸x̃
= ∥𝔸x̃ ˜ 0 , x̃0 − x⋆ ⟩ + ∥x̃0 − x⋆ ∥2 .
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
Now we describe the restarting scheme of APPM. Let tk be the number of inner iterations applying APPM for the kth outer
iteration. This iteration starts from x̃k−1 and outputs x̃k after applying tk iterations of APPM. Then the kth outer iteration
results in
˜ k ∥2 ≤ 1 1 1 ˜ k−1 ∥2/α ,
∥𝔸x̃ ∥x̃k−1 − x⋆ ∥2 ≤ 2 ∥x̃k−1 − x⋆ ∥2 ≤ 2/α 2 ∥𝔸x̃
(tk + 1)2 tk µ tk
where the last inequality follows from
˜ k−1 ∥∥x̃k−1 − x⋆ ∥ ≥ ⟨𝔸x̃
∥𝔸x̃ ˜ k−1 , x̃k−1 − x⋆ ⟩ ≥ µ∥x̃k−1 − x⋆ ∥α+1 .
In order to find a possible choice of restart schedule, we will iteratively find the number tk of inner iterations for kth outer
iteration which satisfies
˜ k ∥2 ≤ e−ηk ∥x0 − x⋆ ∥2
∥𝔸x̃
for some η > 0. The case of k = 0 holds automatically. Suppose k ≥ 1, and t1 , . . . , tk−1 are already chosen to satisfy
˜ k−1 ∥2 ≤ e−η(k−1) ∥x0 − x⋆ ∥2
∥𝔸x̃
for k ≥ 1. Then
˜ k ∥2 ≤ 1 ˜ k−1 ∥2/α ≤ 1 η(k−1) 2
∥𝔸x̃ ∥𝔸x̃ e− α ∥x0 − x⋆ ∥ α ,
µ2/α t2k µ2/α t2k
so that the claimed convergence rate is guaranteed if
1 η(k−1) 2
e− α ∥x0 − x⋆ ∥ α ≤ e−ηk ∥x0 − x⋆ ∥2 .
µ2/α t2k
This is equivalent to
1 η 1
nη 1
o
−α α −1
tk ≥ µ e 2α ∥x0 − x⋆ ∥ exp 1− k ,
| {z } 2 α
:=λ | {z }
:=β
we have
R−1
X
tR = N − 1 − ⌈λeβk ⌉ ≥ ⌈λeβR ⌉ ≥ λeβR .
k=1
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
˜ R ∥2 ≤ e−ηR ∥x0 − x⋆ ∥2 .
∥𝔸x̃
˜ R ∥2 using the inequality above, we obtain a lower bound to R. From λeβk ≤ ⌈λeβk ⌉ ≤
To find the upper bound to ∥𝔸x̃
λe + 1 and ⌈λe ⌉ ≤ tR < ⌈λeβR ⌉ + ⌈λeβ(R+1) ⌉, we have
βk βR
R
X R
X R−1
X R+1
X
λeβk ≤ N − 1 = tk = ⌈λeβk ⌉ + tR ≤ λeβk + R + 1. (1)
k=1 k=1 k=1 k=1
eβ(R+1) − 1 N − 1 eβ − 1
1
N − 1 ≤ λeβ β
+ log + 1 + 1.
e −1 β λ eβ
Therefore,
˜ R ∥2 ≤ e−ηR ∥x0 − x⋆ ∥2
∥𝔸x̃
β − βη
N − 1 eβ − 1
η e −1 1
≤e N − 2 − log +1 +1 ∥x0 − x⋆ ∥2
λeβ β λ eβ
β β − α−12α
e −1 1 e −1 1
= N − 2 − log (N − 1) + 1 + β ∥x0 − x⋆ ∥2 (Choose η = 2)
λe2β β λeβ e
2α
= O N − α−1
α1
∥x0 − x⋆ ∥−(1− α ) .
1
where λ = e
µ
α+1
This is a rate faster than O(N − α−1 )-rate of PPM. Although the monotonicity parameter µ > 0 and α > 1 are unknown,
one can obtain a suboptimal restart schedule with additional cost
for the grid search as in Roulet & d’Aspremont (2020),
2α
− α−1 2
where the total cost for the algorithm is of O N (log N ) .
F. Experiment details
We now describe the experiments of Section 6 in further detail.
1
minimize
n 2 ∥Ex − b∥2 + λ∥Dx∥1 , (2)
x∈R
where x ∈ Rn is a vectorized image, E ∈ Rm×n is the discrete Radon transform, b = Ex is the measurement, and D
is the finite difference operator. This regularized least-squares problem can be solved using PDHG, also known as the
Chambolle–Pock method (Chambolle & Pock, 2011). PDHG can be interpreted as an instance of variable metric PPM (He
& Yuan, 2012); it is a nonexpansive fixed-point iteration (xk+1 , uk+1 , v k+1 ) = 𝕋(xk , uk , v k ) defined as
xk+1 = xk − αE ⊺ uk − βD⊺ v k
1
uk+1 = uk + αE(2xk+1 − xk ) − αb
1+α
v k+1 = Π[−λα/β,λα/β] v k + βD(2xk+1 − xk )
−E ⊺ −(β/α)D⊺
(1/α)I
M = −E (1/β)I 0 .
−(β/α)D 0 (1/β)I
k+1 k+1 k+1 1 1
(x ,u ,v )= 1− 𝕋(xk , uk , v k ) + (x0 , u0 , v 0 ) (PDHG with OHM)
k+2 k+2
Figure 6. Images reconstructed by applying PDHG, PDHG with OHM, and PDHG with restarted OC-Halpern for 1000 iterations.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
108
107
f(xk ) − f
106
105 PDHG
OHM
104 Restarted OC-Halpern
100 101 102 103
Iteration count
Figure 7. Function value suboptimality f (xk ) − f ⋆ plot of PDHG, PDHG with OHM, and PDHG with restarted OC-Halpern (OS-PPMres
0 )
in CT image reconstruction.
Figure 6 shows the reconstructed images after 1000 iterations. Restarted OC-Halpern (OS-PPMres 0 ) can effectively recover
the original image, in a faster rate. Figure 7 shows that even without theoretical guarantee, the function value suboptimality
decreases in a faster rate for OHM and restarted OC-Halpern.
To solve this problem Li et al. (2018) used PDHG (Chambolle & Pock, 2011)
1
m̃k+1 shrink1 m̃kx,ij + µ(∇Φk )x,ij , µ
x,ij =
1 + εµ
k+1 1
shrink1 m̃ky,ij + µ(∇Φk )y,ij , µ
m̃y,ij =
1 + εµ
Φk+1
ij = Φkij + τ (div(2mk+1 − mk ))ij + ρ1ij − ρ0ij (Primal-dual method for EMD-L1 )
for k = 1, 2, . . . where m̃ = (m̃x , m̃y ) is m = (mx , my ) with zero padding on their last row and last column, respectively,
hence making m̃x , m̃y ∈ Rn×n . We denote this fixed-point iteration by 𝕋, so that (m̃k+1 x , m̃k+1
y , Φk+1 ) = 𝕋(m̃kx , m̃ky , Φk ).
Combining OHM on this fixed-point iteration yields the iteration
k+1 k+1 k+1 1 1
(m̃x , m̃y , Φ )= 1− 𝕋(m̃kx , m̃ky , Φk ) + (m̃0 , m̃0 , Φ0 )
k+2 k+2 x y
for k = 1, 2, . . . , and we also combine restarting technique with exponential schedule to hope for further acceleration.
This experiment calculated an approximation of Wasserstein distance between the two probability distributions as in Figure 8.
We applied 3 different algorithms for N = 100, 000 iterations with algorithm parameters µ = 1.0 × 10−6 and ε = 1.0.
Restarting the algorithm with Halpern scheme every 10, 000 iterations provided the accelerated rate, but we chose better
exponential schedule for the plot in Figure 9.
(a) Probabilistic distribution ρ0 (b) Probabilistic distribution of ρ1 (c) Solution of discretized problem
in Section 6.3
Figure 8. Probabilistic distribution of ρ0 and ρ1 . This distribution is expressed as the colored parts in 256 × 256 grid, there ρ0 contains
the part of x2 + y 2 ≤ (0.3)2 and ρ1 contains 4 identical circles with radius 0.2, centered at (±1, ±1).
10-3 PDHG
OHM
Restarted OC-Halpern
10-6
kxk − Txk k 2M
10-9
10-12
Figure 9. Fixed-point residual of 𝕋 versus iteration count plot for approximating Wasserstein distance.
Experiment considered the regularized least-squares problem on R50 , where the sparse signal x⋆ has 10 nonzero entries.
n
1X
minimize ∥A(i) x − b(i) ∥2 + λ∥x∥1 .
n
x∈R n i=1
Each node i maintains its local estimate xi of x ∈ Rn , and have access to sensing matrix A(i) ∈ Rmi ×n , where mi is the
number of accessible sensors. Here, we assume to have mi = 3 many sensors for each node and has total m = 30 sensors.
We applied PG-EXTRA, PG-EXTRA combined with OHM, PG-EXTRA with (OC-Halpern), and PG-EXTRA with
Restarted OC-Halpern (OS-PPMres 0 ), since PG-EXTRA can be understood as a fixed-point iteration (Wu et al., 2018).
Let xk ∈ Rn×10 be a vertical stack of Rn vectors, where each i-th row vector xki is a local copy of x stored in node
i. The vectors in node i only interact with other vectors in close neighborhood of node i. The fixed-point iteration
(xk+1 , wk+1 ) = 𝕋(xk , wk ) is
X
xk+1
i = Proxαλ∥·∥1 Wi,j xkj − αA⊺(i) (A(i) xki − b(i) ) − wik
j
1
wk+1 = wk + (I − W )xk
2
and PG-EXTRA combined with OHM is
1 1
(xk+1 , wk+1 ) = 1 − 𝕋(xk , wk ) + (x0 , w0 ) (PG-EXTRA with OHM)
k+2 k+2
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
100
PDHG
10-1 OHM
Restarted OC-Halpern
10-2
|f(xk ) − f |
10-3
10-4
10-5
10-6
Figure 10. Absolute function-value suboptimality |f (xk ) − f ⋆ | versus iteration count plot for approximating Wasserstein distance.
for k = 0, 1, . . . . For all these methods, we chose the mixing matrix W ∈ R10×10 to be Metropolis-Hastings weight with
each (i, j)-entry Wi,j being
(
1
max{deg(i),deg(j)} (i ̸= j)
Wi,j = P
1− j̸=i Wi,j (i = j)
where deg(i) is the number of edges connected to node i. We applied each methods (PG-EXTRA, PG-EXTRA with OHM,
PG-EXTRA with OC-Halpern, and PG-EXTRA with restarted OC-Halpern (OS-PPMres 0 )) with stepsize α = 0.005 and
regularization parameter λ = 0.002 for 100 iterations.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
kx k − x k 2F
Restarted OC-Halpern
1.5 × 101
1.4 × 101
1.3 × 101
1.2 × 101
100 101 102
Iteration count
Figure 12. Distance to solution ∥xk − x⋆ ∥2F versus iteration count plot for PG-EXTRA, PG-EXTRA with OHM, PG-EXTRA with
(OC-Halpern), and PG-EXTRA with Restarted OC-Halpern (OS-PPMres 0 ).