0% found this document useful (0 votes)
31 views12 pages

On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views12 pages

On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Proceedings of Machine Learning Research vol 242:197–208, 2024

On the convergence of adaptive first order methods:


proximal gradient and alternating minimization algorithms
Puya Latafat PUYA . LATAFAT @ KULEUVEN . BE
Department of Electrical Engineering (ESAT-STADIUS)
KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
Andreas Themelis ANDREAS . THEMELIS @ EES . KYUSHU - U . AC . JP
Faculty of Information Science and Electrical Engineering (ISEE)
Kyushu University, 744 Motooka, Nishi-ku 819-0395, Fukuoka, Japan
Panagiotis Patrinos PANOS . PATRINOS @ ESAT. KULEUVEN . BE
Department of Electrical Engineering (ESAT-STADIUS)
KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

Abstract
Building upon recent works on linesearch-free adaptive proximal gradient methods, this paper pro-
poses AdaPGq,r , a framework that unifies and extends existing results by providing larger stepsize
policies and improved lower bounds. Different choices of the parameters q and r are discussed and
the efficacy of the resulting methods is demonstrated through numerical simulations. In an attempt
to better understand the underlying theory, its convergence is established in a more general setting
that allows for time-varying parameters. Finally, an adaptive alternating minimization algorithm is
presented by exploring the dual setting. This algorithm not only incorporates additional adaptivity,
but also expands its applicability beyond standard strongly convex settings.
Keywords: Convex minimization, proximal gradient method, alternating minimization algorithm,
locally Lipschitz gradient, linesearch-free adaptive stepsizes

1. Introduction
The proximal gradient (PG) method is the natural extension of gradient descent for constrained and
nonsmooth problems. It addresses nonsmooth minimization problems by splitting them as

minimize
n
φ(x) := f (x) + g(x), (P)
x∈’

where f is here assumed locally Lipschitz differentiable, and g possibly nonsmooth but with an
easy-to-compute proximal mapping, while both being convex (see Assumption 2.1 for details). It
has long been known that performance of first-order methods can be drastically improved by an
appropriate stepsize selection as evident in the success of linesearch based approaches.
Substantial effort has been devoted to developing adaptive methods. Most notably, in the context
of stochastic (sub)gradient descent, numerous adaptive methods have been proposed starting with
Duchi et al. (2011). We only point the reader to few recent works in this area Li and Orabona
(2019); Ward et al. (2019); Yurtsever et al. (2021); Ene et al. (2021); Defazio et al. (2022); Ivgi et al.
(2023). However, although applicable to a more general setting, such approaches tend to suffer from
diminishing stepsizes, which can hinder their performance.
Closer to our setting are recent works Grimmer et al. (2023); Altschuler and Parrilo (2023)
which consider smooth optimization problems, and propose predefined stepsize patterns. These

© 2024 P. Latafat, A. Themelis & P. Patrinos.


L ATAFAT T HEMELIS PATRINOS

methods obtain accelerated worst-case rates under global Lipschitz continuity assumptions. We
also mention the recent work Li and Lan (2023) in the constrained smooth setting which, while
also being bound to a global Lipschitz continuity assumption, uses an adaptive estimate for the
Lipschitz modulus and achieves an accelerated worst-case rate.
In this paper, we extend recent results pioneered in Malitsky and Mishchenko (2020) and later
further developed in Latafat et al. (2023b); Malitsky and Mishchenko (2023), where novel (self-)
adaptive schemes are developed. We provide a unified analysis that bridges together and improves
upon all these works by enabling larger stepsizes and, in some cases, providing tighter lower bounds.
Adaptivity refers to the fact that, in contrast to linesearch methods that employ a look forward
approach based on trial and error to ensure a sufficient descent in the cost, we look backward to
yield stepsizes only based on past information. Specifically, we estimate the Lipschitz modulus of
∇f at consecutive iterates xk−1 , xk ∈ ’n generated by the algorithm using the quantities

⟨∇f (xk ) − ∇f (xk−1 ), xk − xk−1 ⟩ ∥∇f (xk ) − ∇f (xk−1 )∥


ℓk := and Lk := . (1.1)
∥xk − xk−1 ∥2 ∥xk − xk−1 ∥

Throughout, we stick to the convention 00 = 0 so that ℓk and Lk are well-defined, positive real
numbers. In addition, we adhere to 10 = ∞. Note also that

ℓk ≤ Lk ≤ Lf,V (1.2)

holds whenever Lf,V is a Lipschitz modulus for ∇f on a convex set V containing xk−1 and xk .
Despite the mere dependence of these quantities on the previous iterates, they provide a sufficiently
refined estimate of the local geometry of f . In fact, a carefully designed stepsize update rule not
only ensures that the stepsize sequence is separated from zero, but also that a sufficient descent-type
inequality can be indirectly ensured between (xk+1 , xk ) and (xk , xk−1 ) without any backtracks.
The ultimate deliverable of this manuscript is the general adaptive framework outlined in Algo-
rithm 2.1. A special case of it is here condensed into a two-parameter simplified algorithm.

AdaPGq,r Fix x−1 ∈ ’n and γ0 = γ−1 > 0. With ℓk and Lk as in (1.1), starting from
x0 = proxγ0 g (x−1 − γ0 ∇f (x−1 )), iterate for k = 0, 1, . . .
r r 
γk 1− rq
γk+1 = γk min 1
+ , (1.3a)
q γk−1 [γk2 L2k +2γk ℓk (r−1)−(2r−1)]+
xk+1 = proxγk+1 g (xk − γk+1 ∇f (xk )) (1.3b)

Theorem 1.1 Under Assumption 2.1, for any q > r ≥ 21 the sequence (xk )k∈Ž generated by

AdaPGq,r converges to some x⋆ ∈ arg min φ. If in addition q ≤ 21 (3 + 5), then
r
1− rq 1
holds for all k ≥ 2 log1+ 1 γ0 L1f,V + ,
 
γk ≥ γmin := max {1,q} Lf,V q

where Lf,V is a Lipschitz modulus for ∇f on a convex and compact set V that contains (xk )k∈Ž .

Moreover, mink≤K (φ(xk )−min φ) ≤ PUK+1 1 (x )
holds for every K ≥ 1, where U1 (x⋆ ) is as in (2.4).
k=1 γk

The above worst-case sublinear rate depends on the aggregate of the stepsize sequence, provid-
ing a partial explanation for the fast convergence of the algorithm observed in practice. AdaPGq,r

2
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

and Theorem 1.1 are particular instances of the general framework provided in Section 2, see Re-
mark 2.2 and Theorem 2.7 for the details. Specific choices of the parameters q, r nevertheless allow
AdaPGq,r to embrace and extend existing algorithms:
( )
q
1 γk 1
• r = 2 and q = 1. Then, γk+1 = γk min 1+ γk−1 ,
q coincides with the
2[γk2 L2k −γk ℓk ]+

update in (Latafat et al., 2023b, Alg. 2.1) with second term improved by a 2 factor.

• Owing to the relation γk2 L2k − γk ℓk ≤ γk2 L2k , the case above is also an proximal extension of
o
q
γk 1
(Malitsky and Mishchenko, 2023, Alg. 1) which considers γk+1 = min γk 1 + γk−1 , √2L
k
when g = 0, and which in turn is also a strict improvement over the previous work Malitsky and
Mishchenko (2020).
q 
3 3 2 γk √ 2 21
• r= 4 and q = 2. Then, γk+1 = γk min 3 + γk−1 , recovers the update
[2γk Lk −1−γk ℓk ]+
rule (Malitsky and Mishchenko, 2023, Alg. 2) (in fact tighter because of the extra −γk ℓk term).
q
γk
The interplay between the parameters can then be understood by noting that 1q + γk−1 allows the
algorithm to recover from a potentially small stepsize, which can only decrease for a controlled
number of iterations and will then rapidly enter a phase where it increases linearly until a certain
threshold is reached, see the proof of Theorem 2.7. A smaller q allows for a more aggressive recov-
ery, but comes at the cost of more conservative second term. As for r, values in the range [1/2, 1],
such as in the combinations reported in Table 1, work well in practice.
1 − r/q q r γmin Lf,V
1/4 10/9 5/6 3/2
p
1/10 ≈ 0.47
Table 1. Suggested options for q and r in AdaPGq,r .
2/5 8/5 24/25 1/2
Green cells strike a nice balance between aggressive
p increases and large lower bounds (γmin ) for the step-
1/2 5/3 5/6 3/10 ≈ 0.55
√ size sequence, while the orange cell yields the largest
1/2 3/2 3/4 1/ 3 ≈ 0.57
√ theoretical lower bound. Here, Lf,V is a local Lipschitz
1/2 1 1/2 1/ 2 ≈ 0.71
√ modulus for ∇f as in Theorem 1.1.
3/5 5/2 1 6/5 ≈ 0.49

As a final contribution, an adaptive variant of the alternating minimization algorithm (AMA) of


Tseng (1991) is proposed that addresses composite problems of the form

minimize
n
ψ1 (x) + ψ2 (Ax). (CP)
x∈’

AMA is particularly interesting in settings where ψ1 is either nonsmooth or its gradient is compu-
tationally demanding. Its convergence was established in Tseng (1991) by framing it as the dual
form of the splitting method introduced in Gabay (1983), and acceleration techniques have also
been adapted to this setting Goldstein et al. (2014). In contrast to existing methods, ours not only
incorporates an adaptive stepsize mechanism but also relaxes the strong convexity assumption to
mere local strong convexity, see Assumption 3.1 for details. Due to space limitations, some proofs
are deferred to the preprint version Latafat et al. (2023a).

3
L ATAFAT T HEMELIS PATRINOS

2. A general framework for adaptive proximal gradient methods


In this section we consider plain proximal gradient iterations of the form

xk+1 = proxγk+1 g xk − γk+1 ∇f (xk ) , (2.1)




where (γk )k∈Ž is a sequence of strictly positive stepsize parameters. The main oracles of the method
are gradient and proximal maps (see (Beck, 2017, §6) for examples of proximable functions). When-
ever g is convex, for any γ > 0 it is well known that proxγg is firmly nonexpansive (Bauschke and
Combettes, 2017, §4.1 and Prop. 12.28), a property stronger than Lipschitz continuity. We here
show that even when the stepsizes are time-varying as in (2.1) a similar property still holds for the
iterates therein. This fact is a refinement of (Malitsky and Mishchenko, 2023, Lem. 12) that follows
after an application of Cauchy-Schwarz and that will be used in our main descent inequality.

Lemma 2.1 (FNE-like inequality) Suppose that g is convex and that f is differentiable. Then, for
any (γk )k∈Ž ⊂ ’++ and with Hk := id − γk ∇f , proximal gradient iterates (2.1) satisfy

∥xk+1 − xk ∥2 ≤ ρk+1 ⟨Hk (xk−1 ) − Hk (xk ), xk − xk+1 ⟩ ≤ ρ2k+1 ∥Hk (xk−1 ) − Hk (xk )∥2 . (2.2)

Throughout, we study problem (P) under the following assumptions.

Assumption 2.1 (Requirements for problem (P))

A1 f : ’n → ’ is convex and has locally Lipschitz continuous gradient.


A2 g : ’n → ’ is proper, lsc, and convex.
A3 There exists x⋆ ∈ arg min f + g.

The main adaptive framework, involving two time-varying parameters qk , ξk , is given in Algo-
rithm 2.1. The shorthand notation 1γk ℓk ≥1 equals 1 if γk ℓk ≥ 1 and 0 otherwise, while for t ∈ ’
we denote [t]+ := max {0, t}.

Algorithm 2.1 General adaptive proximal gradient framework


R EQUIRE starting point x−1 ∈ ’n , stepsizes γ0 = γ−1 > 0
parameters 21 < qmin ≤ qmax , 0 < ξmin ≤ 2qmin − 1,
I NITIALIZE x0 = proxγ0 g (x−1 − γ0 ∇f (x−1 )), ρ0 = 1, q0 ∈ [qmin , qmax ], ξ0 ≥ ξmin
R EPEAT FOR k = 0, 1, . . . until convergence
2.1.1: Let ℓk and Lk be as in (1.1), and choose qk+1 , ξk+1 such that

qmin ≤ qk+1 ≤ min {qmax , qk + 1γk ℓk ≥1 }


qk+1
ξk+1 ≥ ξmin , rk+1 := 1+ξk+1 ≥ 12 ,
( r )
q
1+qk ρk rk+1 ξk
2.1.2: γk+1 = γk min qk+1 , qk+1 [γ 2 L2 +2γk ℓk (rk+1 −1)−(2rk+1 −1)]
k k +

γk+1
2.1.3: Set ρk+1 = γk and update xk+1 = proxγk+1 g (xk − γk+1 ∇f (xk ))

4
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

Remark 2.2 (relation to AdaPGq,r ). Whenever qk ≡ qmin = qmax =: q and ξk ≡ ξmin =: ξ, the
q
conditions in Algorithm 2.1 reduce to q > 12 , r = ξ+1 ≥ 12 , and ξ = qr − 1 > 0; equivalently,
q > r ≥ 12 as in Theorem 1.1.

In what follows, for x ∈ dom φ we adopt the notation

Pk (x) := φ(xk ) − φ(x). (2.3)

Our convergence analysis revolves around showing that under appropriate stepsize update and pa-
rameter selection the function
ξk
Uk (x) := 12 ∥xk − x∥2 + γk (1 + qk ρk )Pk−1 (x) + 2 ∥x
k
− xk−1 ∥2 , (2.4)

monotonically decreases along the iterates for all x ∈ arg min φ. The main inequality, outlined in
Theorem 2.3, extends the one in (Latafat et al., 2023b, Eq. (2.8)) by blending it with Lemma 2.1.
This combination is achieved by adding and subtracting a multiple of the residual scaled by a newly
added parameter ξk . The proof is otherwise adapted from that of (Latafat et al., 2023b, Lem. 2.2),
and is included in full detail in the preprint version.

Theorem 2.3 (main PG inequality) Consider a sequence (xk )k∈Ž generated by PG iterations
γ
(2.1) under Assumption 2.1, and denote ρk+1 := k+1 γk . Then, for any x ∈ dom φ, qk , ξk ≥ 0
and νk > 0, k ∈ Ž,

2 k k−1 2
1
Uk+1 (x) ≤ Uk (x) − γk (1 + qk ρk − qk+1 ρk+1 )Pk−1 (x) − 2 ∥x − x ∥ 1 + ξk − ν1k
h 
2 2 2 qk+1  qk+1 i
− ρk+1 (νk+1 + ξk+1 ) γk Lk + 2γk ℓk νk+1 +ξk+1 − 1 − 2 νk+1 +ξk+1 − 1 ,
(2.5)

where Uk (x) is as in (2.4). In particular, with νk ≡ 1, if φ(x) ≤ inf k∈Ž φ(xk ) (for instance, if
x ∈ arg min φ), and
 
 
0 < ρ2k+1 ≤ min 1+q k ρk
,  ξk (2.6)
 qk+1 (1+ξk+1 ) γk2 L2k +2γk ℓk qk+1 −1 + 1−2 qk+1
   

1+ξk+1 1+ξk+1
+

(with ξk > 0) holds for every k, then the coefficients of Pk−1 (x) and ∥xk − xk−1 ∥2 in (2.5) are
negative, Uk+1 (x) ≤ Uk (x) and thus (Uk (x))k∈Ž converges and (xk )k∈Ž is bounded.

Consistently with what was first observed in Malitsky and Mishchenko (2020), inequality (2.6)
confirms that stepsizes should both not grow too fast and be controlled by the local curvature of
f . We next show that, under a technical condition on qk , all that remains to do is ensuring that
the stepsizes do not vanish, which is precisely the reason behind the restrictions on the parameters
qk and ξk prescribed in Algorithm 2.1, as Theorem 2.7 will ultimately demonstrate. The technical
condition
q turns out to be a controlled growth of qk , needed to guarantee that a sequence ϱk+1 ≈
1+qk ϱk
qk+1 will eventually stay above 1.

5
L ATAFAT T HEMELIS PATRINOS

Theorem 2.4 (convergence of PG with nonvanishing stepsizes) Consider the iterates generated
by (2.1) under Assumption 2.1, with γk+1 = γk ρk+1 complying with (2.6). If qk+1 ≤ 1 + qk holds
for every k and inf k∈Ž γk > 0, then:

(i) The (bounded) sequence (xk )k∈Ž has exactly one optimal accumulation point.
(ii) If, in addition, (qk )k∈Ž and (ξk )k∈Ž are chosen bounded and bounded away from zero, then
the entire sequence (xk )k∈Ž converges to a solution x⋆ ∈ arg min φ, and Uk (x⋆ ) ↘ 0.

We now turn to the last piece of the puzzle, namely enforcing a strictly positive lower bound on
the stepsizes. The following elementary lemma provides the key insight to achieve this.

Lemma 2.5 Let f be convex and differentiable, and consider the iterates generated by Algorithm 2.1.
Then, for every k ∈ Ž such that γk ℓk < 1 it holds that
 q q 
1 ξmin rmin 1
γk+1 ≥ min γk qmax + ρk , qmax Lk . (2.7)

Proof We start by observing that the assumptions on f guarantee that 0 ≤ ℓk ≤ Lk . The (squared)
second term in the minimum of step 2.1.2 can be lower bounded as follows
rk+1 ξk ξmin rmin
qk+1 [γ 2 L2 +2γk ℓk (rk+1 −1)+(1−2rk+1 )] ≥ qmax [γ 2 L2 +2γk ℓk (rmin −1)+(1−2rmin )]
k k + k k +
ξmin rmin ξmin rmin
= qmax [γ 2 L2 −γk ℓk +(γk ℓk −1)(2rmin −1)] ≥ qmax γk2 L2k
, (2.8)
k k +

where the first inequality follows from the fact that the left-hand side is increasing with respect
to rk+1 , and the second inequality follows since rmin ≥ 1/2. In turn, the claimed inequality (2.7)
follows from the fact that qk+1 ≤ qk ≤ qmax whenever γk ℓk < 1, see step 2.1.1.

This lemma already hints at a potential lower bound for the stepsize, since boundedness of
the sequence (xk )k∈Ž ensures lower boundedness of the second term in the minimum. As for the
first term, as long as qk is upper bounded, the stepsize can only decrease for a controlled number
of iterations. This arguments will be formally completed in the proof of Theorem 2.7, where the
following notation will be instrumental.
√ √
Definition 2.6 Let ε > 0. With ϱ1 = ε and ϱt+1 = ε + ϱt for t ≥ 1, we denote
Qtε
tε := max {t ∈ Ž | ϱ1 , . . . , ϱt < 1} and m(ε) := t=1 ϱt .

Notice that m(ε) ≤ 1 and equality holds iff ε ≥ 1 (equivalently, iff tε = 0). For ε ∈ (0, 1), tε is
a well-defined strictly positive integer, owing to the monotonic increase of ϱt and its convergence to

the positive root of
pthe equation ϱ2 − ϱ − ε = 0. In particular, m(ε) ≤ ϱ1 = ε and identity holds

if tε = 1, that is, ε + ε ≥ 1 (and ε < 1). This leads to a partially explicit expression
( √
3− 5
and m(ε) = min {1, ε} if ε ≥
p
tε = 1 2 ≈ 0.382
 1  √ √ (2.9)
1 < tε ≤ ε(2−ε) and εtε < m(ε) < ε otherwise.

6
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

The bound on tε in the second case is obtained by observing that



X
1 > ϱtε = ϱtε − ϱ2tε + ε + ϱtε −1 = · · · = (ϱt − ϱ2t ) + (tε − 1)ε ≥ (tε − 1)ε(1 − ε) + (tε − 1)ε,
t=1

where we used the fact that ε ≤ ϱt ≤ 1 − ε and thus ϱt − ϱ2t ≥ ε(1 − ε) for t = 1, . . . , tε − 1. We
also remark that the lower bound in the simplified setting of Theorem 1.1 pertains to the case when
tε = 1, since ε = 1q falls under the first case above.
Theorem 2.7 (convergence of Algorithm 2.1) Under Assumption 2.1, the sequence (xk )k∈Ž gen-
⋆ ⋆
erated by Algorithm 2.1 converges1to asolution x ∈ arg min φ and (Uk (x ))k∈Ž ↘ 0. Moreover,
there exists k0 ≤ 2 log1+ 1 γ0 Lf,V + such that
qmax
q
γk ≥ γmin := m(1/qmax ) ξmin rmin 1
qmax Lf,V ∀k ≥ k0 ,
where rmin := inf k∈Ž rk ≥ 21 , m( · ) is as in Definition 2.6 (see also (2.9)), and Lf,V is a Lipschitz
modulus for ∇f on a compact convex set V that contains (xk )k∈Ž .
Proof The conditions prescribed in step 2.1.1 entail that the requirements of Theorem 2.4(ii) are
met, so that the proof reduces to showing the claimed lower bound on (γk )k∈Ž . Boundedness of the
sequence (xk )k∈Ž established in Theorem 2.3 ensures the existence of Lf,V > 0 as in the statement.
In particular, recall that ℓk ≤ Lk ≤ Lf,V holds for all k ∈ Ž, cf. (1.2). Lemma 2.5 then yields that
 q q 
1 ξmin rmin 1
γk ℓk < 1 ⇒ γk+1 ≥ min γk qmax + ρk , qmax Lf,V . (2.10)
q
ξmin rmin
We first show that γk0 Lf,V ≥ qmax holds for some k0 ≥ 0 upper bounded as in the statement.
q
ξmin rmin
To this end, suppose that γk Lf,V < qmax for k = 0, 1, . . . , K. The bounds ξk ≥ ξmin and
qmax
qk ≤ qmax enforced in step 2.1.1 imply that 21 ≤ rmin ≤ rk ≤ for any k. In particular,
1+ξmin
q
ξmin rmin ξmin rk ξmin ξmin rmin
qmax ≤ qmax ≤ 1+ξmin < 1 holds for every k. Then, γ ℓ
k k ≤ γ L
k f,V < qmax < 1, and
(2.10) hold true for all such k, leading topγk+1 ≥ γk 1/qmax + ρk for k = 0, . . . , K − 1. Since
p

ρ0 ≥ 1, it follows that ρk+1 = γk+1/γk ≥ 1/qmax + 1 for k = 0, . . . , K − 1. Thus,


K
1 > ar 2 1 2 1
(γ0 Lf,V )2 ,

qmax > (γK Lf,V ) ≥ 1 + qmax (γK−1 Lf,V ) ≥ · · · ≥ 1 + qmax
min

from which the existence of k0 bounded as in the q statement follows. q


Let k ≥ k0 be an index such that γk Lf,V ≥ ξmin rmin
qmax , and suppose that γ k+t L f,V < ξmin rmin
qmax
for t = 1, . . . , T . As before, the inequalities in (2.10) hold true for all such iterates, leading to
q q
1 1
ρk+t ≥ qmax + ρk+t−1 , t = 1, . . . , T + 1, and in particular ρk+1 ≥ qmax .
1
It then follows from the definition of m(ε) and tε as in Definition 2.6 with ε = qmax that γk+t =
γk+t−1 ρk+t can only decrease for at most t ≤ tε iterations (that is, T ≤ tε ), at the end of which
Qt q
 1 1 ξmin rmin 1 (def)
γk+t = ρ
i=1 k+i k γ ≥ m( )γ
qmax k ≥ m( qmax ) qmax Lf,V = γmin ,
q
and then increases linearly up to when it is again larger than ξmin rmin 1
qmax Lf,V , proving that γk ≥ γmin
holds for all k ≥ k0 .

7
L ATAFAT T HEMELIS PATRINOS

3. A class of adaptive alternating minimization algorithms


Leveraging an interpretation of AMA as the dual of the proximal gradient method, an adaptive
variant is developed for solving (CP) under the following assumptions.
Assumption 3.1 (requirements for problem (CP))
A1∗ ψ1 : ’n → ’ is proper, closed, locally strongly convex, and 1-coercive;
A2∗ ψ2 : ’m → ’ is proper, convex and closed;
A3∗ A ∈ ’m×n and there exists x ∈ relint dom ψ1 such that Ax ∈ relint ψ2 .
Under these requiremets, problem (CP) admits a unique solution x⋆ , and by virtue of (Rockafel-
lar, 1970, Thm.s 23.8, 23.9, and Cor. 31.2.1) also its dual
minimize
m
ψ1∗ (−A⊤ y) + ψ2∗ (y) (D)
y∈’

has solutions y ⋆ characterized by y ⋆ ∈ ∂ψ2 (Ax⋆ ) and −A⊤ y ⋆ ∈ ∂ψ1 (x⋆ ), and strong duality
holds. In fact, Assumption 3.1.A1∗ ensures that the conjugate ψ1∗ is a (real-valued) locally Lipschitz
differentiable function (Goebel and Rockafellar, 2008, Thm. 4.1). We note that the weaker notion
of local strong monotonicity of ∂ψ1 relative to its graph would suffice, and that this minor departure
from the reference is used for simplicity of exposition. Problem (D) can then be addressed with
proximal gradient iterations y + = proxγg (y − γ∇f (y)) analized in the previous section, with
f := ψ1∗ (−A⊤ y) and g := ψ2∗ . In terms of primal variables x and z, these iterations result in the
alternating minimization algorithm. We here reproduce the simple textbook steps. First, observe that
x = ∇ψ1∗ (−A⊤ y) ⇔ −A⊤ y ∈ ∂ψ1 (x) ⇔ 0 ∈ A⊤ y + ∂ψ1 (x) = ∂[⟨A · , y⟩ + ψ1 ](x).
Hence, by strict convexity, x = arg min {ψ1 + ⟨A · , y⟩}. By the Moreau decomposition,
proxγg (y − γ∇f (y)) = proxγψ2∗ (y + γAx) = y + γAx − γ proxψ2/γ (γ −1 y + Ax).

AMA iterations thus generate a sequence (y k )k∈Ž given in (3.1), where


Lγ (x, z, y) := ψ1 (x) + ψ2 (z) + ⟨y, Ax − z⟩ + γ2 ∥Ax − z∥2
is the γ-augmented Lagrangian associated to (CP).

AdaAMAq,r Fix y −1 ∈ ’m and γ0 = γ−1 > 0. With ℓk and Lk as in (3.2), starting from
 −1
x = arg minx∈’n ψ1 (x) + ⟨y −1 , Ax⟩


z 0 = arg minz∈’m Lγ0 (x−1 , z, y −1 )


y = y −1 + γ0 (Ax−1 − z 0 ),
 0
iterate for k = 0, 1, . . .

xk = arg min ψ1 (x) + ⟨y k , Ax⟩ ( = arg min L0 (x, z k , y k )) (3.1a)



x∈’n x∈’n
( r )
1− rq
r
γk
γk+1 = γk min 1
q + γk−1 , (3.1b)
[(1−2r)+γk2 L2k +2γk ℓk (r−1)]+
−1 k
z k+1 = proxψ2/γk+1 (γk+1 y + Axk ) ( = arg min Lγk+1 (xk , z, y k )) (3.1c)
z∈’m

y k+1 = y k + γk+1 (Axk − z k+1 ) (3.1d)

8
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

The chosen iteration indexing reflects the dependency on the stepsize γk : xk depends on y k
but not on γk+1 , whereas z k+1 does depend on it. Moreover, this convention is consistent with the
relation ∇f (y k ) = −Axk . Local Lipschitz estimates of ∇f as in (1.1) are thus expressed as

⟨Axk − Axk−1 , y k − y k−1 ⟩ ∥Axk − Axk−1 ∥2


ℓk = − and Lk = . (3.2)
∥y k − y k−1 ∥2 ∥y k − y k−1 ∥2

Being dually equivalent algorithms, convergence of AdaAMAq,r is deduced from that of AdaPGq,r .

Theorem 3.1 Under Assumption 3.1, for any q > r ≥ 21 the sequence (xk )k∈Ž generated by
AdaAMAq,r converges to the (unique) primal solution of (CP), and (y k )k∈Ž to a solution of the
dual problem (D).

4. Numerical simulations
Performance of AdaPGq,r with five different parameter choices from Table 1 is reported through
a series of experiments on (i) logistic regression, (ii) cubic regularization for logistic loss, (iii)
regularized least squares. The two former simulations use three standard datasets from the LIBSVM
library Chang and Lin (2011), while for Lasso synthetic data is generated based on (Nesterov, 2013,
§6); for further details the reader is referred to (Latafat et al., 2023b, §4.1) where the same problem
setup is used. When applicable, the following algorithms are included in the comparisons.1

PG-lsb Proximal gradient method with nonmonotone backtracking


Nesterov Nesterov’s acceleration with constant stepsize 1/Lf (Beck, 2017, §10.7)
adaPG (Latafat et al., 2023b, Alg. 2.1)
adaPG-MM Proximal extension of (Malitsky and Mishchenko, 2020, Alg. 1)

The backtracking procedure in PG-lsb is meant in the sense of (Beck, 2017, §10.4.2), (see
also (Salzo, 2017, LS1) and (De Marchi and Themelis, 2022, Alg. 3) for the locally Lipschitz
smooth case), without enforcing monotonic decrease on the stepsize sequence. To improve per-
formance, the initial guess for γk+1 is warm-started as bγk , where γk is the accepted value in the
previous iteration and b ≥ 1 is a backtracking factor. For each simulation we tested all values of
b ∈ {1, 1.1, 1.3, 1.5, 2} and only reported the best outcome.

5. Conclusions
This paper proposed a general framework for a class of adaptive proximal gradient methods, demon-
strating its capacity to extend and tighten existing results when restricting to certain parameter
choices. Moreover, application of the developed method was explored in the dual setting which led
to a class of novel adaptive alternating minimization algorithms.
Future research directions include extensions to nonconvex problems, variational inequalities,
and simple bilevel optimization expanding upon Malitsky (2020) and Latafat et al. (2023c). It would
also be interesting to investigate the effectiveness of time-varying parameters in our framework for
further improving performance and worst-case convergence rate guarantees.
1. https://github.com/pylat/adaptive-proximal-algorithms-extended-experiments

9
L ATAFAT T HEMELIS PATRINOS

Nesterov adaPG-MM adaPG


5 5 10 5 3 3 5 1
adaPG( 3 , 6 ) adaPG( 9 , 6 ) adaPG( 2 , 4 ) adaPG( 2 ,1) adaPG(1, 2 )
PG-ls1.1 PG-ls1.3 PG-ls1.5 PG-ls2
m = 100, n = 300, n⋆ = 30 m = 500, n = 1000, n⋆ = 100 m = 4000, n = 1000, n⋆ = 100

101
101
101

10−3 10−3
φ(xk ) − min φ

10−3

10−7 10−7 10−7

10−11 10−11 10−11

10−15 10−15 10−15


0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 2,500 0 200 400 600 800 1,000

a5a dataset (m = 6414, n = 123, density = 0.11) mushroom dataset (m = 8124, n = 112, density = 0.19) phishing dataset (m = 11055, n = 68, density = 0.44)

100 100
10−1

10−3 10−3
10−4
φ(xk ) − min φ

10−7 10−6 10−6

10−10 10−9 10−9

10−13 10−12 10−12


0 100 200 300 400 500 600 700 0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
a5a dataset (m = 6414, n = 123, density = 0.11) mushroom dataset (m = 8124, n = 112, density = 0.19) phishing dataset (m = 11055, n = 68, density = 0.44)
100 100

10−3
10−3 10−3
φ(xk ) − min φ

10−6 10−6
10−6

10−9 10−9 10−9

10−12 10−12 10−12

10−15 10−15 10−15


0 10 20 30 40 50 0 10 20 30 40 50 0 5 10 15 20 25 30
# of calls to linear operators # of calls to linear operators # of calls to linear operators

Figure 1: First row: regularized least squares, second row: ℓ1 -regularized logistic regression, third
row: cubic regularization with Hessian generated for the logistic loss problem evaluated
at zero. For the linesearch method PG-lsb , in each simulation only the best outcome for
b ∈ {1, 1.1, 1.3, 1.5, 2} is reported.

Acknowledgments
Work supported by: the Research Foundation Flanders (FWO) postdoctoral grant 12Y7622N and re-
search projects G081222N, G033822N, and G0A0920N; European Union’s Horizon 2020 research
and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348; Japan
Society for the Promotion of Science (JSPS) KAKENHI grants JP21K17710 and JP24K20737.

10
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

References
Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging II: Silver stepsize
schedule for smooth convex optimization. arXiv:2309.16530, 2023.

Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in
Hilbert spaces. CMS Books in Mathematics. Springer, 2017. ISBN 978-3-319-48310-8.

Amir Beck. First-Order Methods in Optimization. SIAM, Philadelphia, PA, 2017.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2:1–27, 2011.

Alberto De Marchi and Andreas Themelis. Proximal gradient algorithms under local Lipschitz
gradient continuity: A convergence and robustness analysis of PANOC. Journal of Optimization
Theory and Applications, 194:771–794, 2022.

Aaron Defazio, Baoyu Zhou, and Lin Xiao. Grad-GradaGrad? A non-monotone adaptive stochastic
gradient method. arXiv:2206.06900, 2022.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of machine learning research, 12(7), 2011.

Alina Ene, Huy L Nguyen, and Adrian Vladu. Adaptive gradient methods for constrained convex
optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 35, pages 7314–7321, 2021.

Daniel Gabay. Chapter IX Applications of the method of multipliers to variational inequalities. In


Michel Fortin and Roland Glowinski, editors, Augmented Lagrangian Methods: Applications to
the Numerical Solution of Boundary-Value Problems, volume 15 of Studies in Mathematics and
Its Applications, pages 299–331. Elsevier, 1983.

Rafal Goebel and R Tyrrell Rockafellar. Local strong convexity and local Lipschitz continuity of
the gradient of convex functions. Journal of Convex Analysis, 15(2):263, 2008.

Tom Goldstein, Brendan O’Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternating direc-
tion optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014.

Benjamin Grimmer, Kevin Shu, and Alex L Wang. Accelerated gradient descent via long steps.
arXiv:2309.09961, 2023.

Maor Ivgi, Oliver Hinder, and Yair Carmon. DoG is SGD’s best friend: A parameter-free dynamic
step size schedule. arXiv:2302.12022, 2023.

Puya Latafat, Andreas Themelis, and Panagiotis Patrinos. On the convergence of adaptive first order
methods: proximal gradient and alternating minimization algorithms. arXiv:2311.18431, 2023a.

11
L ATAFAT T HEMELIS PATRINOS

Puya Latafat, Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Adaptive prox-
imal algorithms for convex optimization under local Lipschitz continuity of the gradient.
arXiv:2301.04431, 2023b.

Puya Latafat, Andreas Themelis, Silvia Villa, and Panagiotis Patrinos. On the convergence of prox-
imal gradient methods for convex simple bilevel optimization. arXiv:2305.03559, 2023c.

Tianjiao Li and Guanghui Lan. A simple uniformly optimal method without line search for convex
optimization. arXiv:2310.10082, 2023.

Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive
stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages
983–992. PMLR, 2019.

Yura Malitsky. Golden ratio algorithms for variational inequalities. Mathematical Programming,
184(1):383–410, 2020.

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. In Proceed-
ings of the 37th International Conference on Machine Learning, volume 119, pages 6702–6712.
PMLR, 13- 2020.

Yura Malitsky and Konstantin Mishchenko. Adaptive proximal gradient method for convex opti-
mization. arXiv:2308.02261, 2023.

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-
ming, 140(1):125–161, aug 2013.

Ralph T. Rockafellar. Convex analysis. Princeton University Press, 1970.

Saverio Salzo. The variable metric forward-backward splitting algorithm under mild differentiabil-
ity assumptions. SIAM Journal on Optimization, 27(4):2153–2181, 2017.

Paul Tseng. Applications of a splitting algorithm to decomposition in convex programming and


variational inequalities. SIAM Journal on Control and Optimization, 29(1):119–138, 1991.

Rachel Ward, Xiaoxia Wu, and Leon Bottou. AdaGrad stepsizes: Sharp convergence over non-
convex landscapes. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 6677–6686. PMLR, 09- 2019.

Alp Yurtsever, Alex Gu, and Suvrit Sra. Three operator splitting with subgradients, stochastic gra-
dients, and adaptive learning rates. Advances in Neural Information Processing Systems, 34:
19743–19756, 2021.

12

You might also like