0% found this document useful (0 votes)

31 views12 pages

On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms

Uploaded by

rasheena.arimbrathodi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views12 pages

On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms

Uploaded by

rasheena.arimbrathodi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Proceedings of Machine Learning Research vol 242:197–208, 2024

On the convergence of adaptive first order methods:

proximal gradient and alternating minimization algorithms
Puya Latafat PUYA . LATAFAT @ KULEUVEN . BE
Department of Electrical Engineering (ESAT-STADIUS)
KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium
Andreas Themelis ANDREAS . THEMELIS @ EES . KYUSHU - U . AC . JP
Faculty of Information Science and Electrical Engineering (ISEE)
Kyushu University, 744 Motooka, Nishi-ku 819-0395, Fukuoka, Japan
Panagiotis Patrinos PANOS . PATRINOS @ ESAT. KULEUVEN . BE
Department of Electrical Engineering (ESAT-STADIUS)
KU Leuven, Kasteelpark Arenberg 10, 3001 Leuven, Belgium

Abstract
Building upon recent works on linesearch-free adaptive proximal gradient methods, this paper pro-
poses AdaPGq,r , a framework that unifies and extends existing results by providing larger stepsize
policies and improved lower bounds. Different choices of the parameters q and r are discussed and
the efficacy of the resulting methods is demonstrated through numerical simulations. In an attempt
to better understand the underlying theory, its convergence is established in a more general setting
that allows for time-varying parameters. Finally, an adaptive alternating minimization algorithm is
presented by exploring the dual setting. This algorithm not only incorporates additional adaptivity,
but also expands its applicability beyond standard strongly convex settings.
Keywords: Convex minimization, proximal gradient method, alternating minimization algorithm,
locally Lipschitz gradient, linesearch-free adaptive stepsizes

1. Introduction
The proximal gradient (PG) method is the natural extension of gradient descent for constrained and
nonsmooth problems. It addresses nonsmooth minimization problems by splitting them as

minimize
n
φ(x) := f (x) + g(x), (P)
x∈

where f is here assumed locally Lipschitz differentiable, and g possibly nonsmooth but with an
easy-to-compute proximal mapping, while both being convex (see Assumption 2.1 for details). It
has long been known that performance of first-order methods can be drastically improved by an
appropriate stepsize selection as evident in the success of linesearch based approaches.
Substantial effort has been devoted to developing adaptive methods. Most notably, in the context
of stochastic (sub)gradient descent, numerous adaptive methods have been proposed starting with
Duchi et al. (2011). We only point the reader to few recent works in this area Li and Orabona
(2019); Ward et al. (2019); Yurtsever et al. (2021); Ene et al. (2021); Defazio et al. (2022); Ivgi et al.
(2023). However, although applicable to a more general setting, such approaches tend to suffer from
diminishing stepsizes, which can hinder their performance.
Closer to our setting are recent works Grimmer et al. (2023); Altschuler and Parrilo (2023)
which consider smooth optimization problems, and propose predefined stepsize patterns. These

© 2024 P. Latafat, A. Themelis & P. Patrinos.

L ATAFAT T HEMELIS PATRINOS

methods obtain accelerated worst-case rates under global Lipschitz continuity assumptions. We
also mention the recent work Li and Lan (2023) in the constrained smooth setting which, while
also being bound to a global Lipschitz continuity assumption, uses an adaptive estimate for the
Lipschitz modulus and achieves an accelerated worst-case rate.
In this paper, we extend recent results pioneered in Malitsky and Mishchenko (2020) and later
further developed in Latafat et al. (2023b); Malitsky and Mishchenko (2023), where novel (self-)
adaptive schemes are developed. We provide a unified analysis that bridges together and improves
upon all these works by enabling larger stepsizes and, in some cases, providing tighter lower bounds.
Adaptivity refers to the fact that, in contrast to linesearch methods that employ a look forward
approach based on trial and error to ensure a sufficient descent in the cost, we look backward to
yield stepsizes only based on past information. Specifically, we estimate the Lipschitz modulus of
∇f at consecutive iterates xk−1 , xk ∈ n generated by the algorithm using the quantities

⟨∇f (xk ) − ∇f (xk−1 ), xk − xk−1 ⟩ ∥∇f (xk ) − ∇f (xk−1 )∥

ℓk := and Lk := . (1.1)
∥xk − xk−1 ∥2 ∥xk − xk−1 ∥

Throughout, we stick to the convention 00 = 0 so that ℓk and Lk are well-defined, positive real
numbers. In addition, we adhere to 10 = ∞. Note also that

ℓk ≤ Lk ≤ Lf,V (1.2)

holds whenever Lf,V is a Lipschitz modulus for ∇f on a convex set V containing xk−1 and xk .
Despite the mere dependence of these quantities on the previous iterates, they provide a sufficiently
refined estimate of the local geometry of f . In fact, a carefully designed stepsize update rule not
only ensures that the stepsize sequence is separated from zero, but also that a sufficient descent-type
inequality can be indirectly ensured between (xk+1 , xk ) and (xk , xk−1 ) without any backtracks.
The ultimate deliverable of this manuscript is the general adaptive framework outlined in Algo-
rithm 2.1. A special case of it is here condensed into a two-parameter simplified algorithm.

AdaPGq,r Fix x−1 ∈ n and γ0 = γ−1 > 0. With ℓk and Lk as in (1.1), starting from
x0 = proxγ0 g (x−1 − γ0 ∇f (x−1 )), iterate for k = 0, 1, . . .
r r
γk 1− rq
γk+1 = γk min 1
+ , (1.3a)
q γk−1 [γk2 L2k +2γk ℓk (r−1)−(2r−1)]+
xk+1 = proxγk+1 g (xk − γk+1 ∇f (xk )) (1.3b)

Theorem 1.1 Under Assumption 2.1, for any q > r ≥ 21 the sequence (xk )k∈ generated by
√
AdaPGq,r converges to some x⋆ ∈ arg min φ. If in addition q ≤ 21 (3 + 5), then
r
1− rq 1
holds for all k ≥ 2 log1+ 1 γ0 L1f,V + ,

γk ≥ γmin := max {1,q} Lf,V q

where Lf,V is a Lipschitz modulus for ∇f on a convex and compact set V that contains (xk )k∈ .
⋆
Moreover, mink≤K (φ(xk )−min φ) ≤ PUK+1 1 (x )
holds for every K ≥ 1, where U1 (x⋆ ) is as in (2.4).
k=1 γk

The above worst-case sublinear rate depends on the aggregate of the stepsize sequence, provid-
ing a partial explanation for the fast convergence of the algorithm observed in practice. AdaPGq,r

2
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

and Theorem 1.1 are particular instances of the general framework provided in Section 2, see Re-
mark 2.2 and Theorem 2.7 for the details. Specific choices of the parameters q, r nevertheless allow
AdaPGq,r to embrace and extend existing algorithms:
( )
q
1 γk 1
• r = 2 and q = 1. Then, γk+1 = γk min 1+ γk−1 ,
q coincides with the
2[γk2 L2k −γk ℓk ]+
√
update in (Latafat et al., 2023b, Alg. 2.1) with second term improved by a 2 factor.

• Owing to the relation γk2 L2k − γk ℓk ≤ γk2 L2k , the case above is also an proximal extension of
o
q
γk 1
(Malitsky and Mishchenko, 2023, Alg. 1) which considers γk+1 = min γk 1 + γk−1 , √2L
k
when g = 0, and which in turn is also a strict improvement over the previous work Malitsky and
Mishchenko (2020).
q
3 3 2 γk √ 2 21
• r= 4 and q = 2. Then, γk+1 = γk min 3 + γk−1 , recovers the update
[2γk Lk −1−γk ℓk ]+
rule (Malitsky and Mishchenko, 2023, Alg. 2) (in fact tighter because of the extra −γk ℓk term).
q
γk
The interplay between the parameters can then be understood by noting that 1q + γk−1 allows the
algorithm to recover from a potentially small stepsize, which can only decrease for a controlled
number of iterations and will then rapidly enter a phase where it increases linearly until a certain
threshold is reached, see the proof of Theorem 2.7. A smaller q allows for a more aggressive recov-
ery, but comes at the cost of more conservative second term. As for r, values in the range [1/2, 1],
such as in the combinations reported in Table 1, work well in practice.
1 − r/q q r γmin Lf,V
1/4 10/9 5/6 3/2
p
1/10 ≈ 0.47
Table 1. Suggested options for q and r in AdaPGq,r .
2/5 8/5 24/25 1/2
Green cells strike a nice balance between aggressive
p increases and large lower bounds (γmin ) for the step-
1/2 5/3 5/6 3/10 ≈ 0.55
√ size sequence, while the orange cell yields the largest
1/2 3/2 3/4 1/ 3 ≈ 0.57
√ theoretical lower bound. Here, Lf,V is a local Lipschitz
1/2 1 1/2 1/ 2 ≈ 0.71
√ modulus for ∇f as in Theorem 1.1.
3/5 5/2 1 6/5 ≈ 0.49

As a final contribution, an adaptive variant of the alternating minimization algorithm (AMA) of

Tseng (1991) is proposed that addresses composite problems of the form

minimize
n
ψ1 (x) + ψ2 (Ax). (CP)
x∈

AMA is particularly interesting in settings where ψ1 is either nonsmooth or its gradient is compu-
tationally demanding. Its convergence was established in Tseng (1991) by framing it as the dual
form of the splitting method introduced in Gabay (1983), and acceleration techniques have also
been adapted to this setting Goldstein et al. (2014). In contrast to existing methods, ours not only
incorporates an adaptive stepsize mechanism but also relaxes the strong convexity assumption to
mere local strong convexity, see Assumption 3.1 for details. Due to space limitations, some proofs
are deferred to the preprint version Latafat et al. (2023a).

3
L ATAFAT T HEMELIS PATRINOS

2. A general framework for adaptive proximal gradient methods

In this section we consider plain proximal gradient iterations of the form

xk+1 = proxγk+1 g xk − γk+1 ∇f (xk ) , (2.1)

where (γk )k∈ is a sequence of strictly positive stepsize parameters. The main oracles of the method
are gradient and proximal maps (see (Beck, 2017, §6) for examples of proximable functions). When-
ever g is convex, for any γ > 0 it is well known that proxγg is firmly nonexpansive (Bauschke and
Combettes, 2017, §4.1 and Prop. 12.28), a property stronger than Lipschitz continuity. We here
show that even when the stepsizes are time-varying as in (2.1) a similar property still holds for the
iterates therein. This fact is a refinement of (Malitsky and Mishchenko, 2023, Lem. 12) that follows
after an application of Cauchy-Schwarz and that will be used in our main descent inequality.

Lemma 2.1 (FNE-like inequality) Suppose that g is convex and that f is differentiable. Then, for
any (γk )k∈ ⊂ ++ and with Hk := id − γk ∇f , proximal gradient iterates (2.1) satisfy

∥xk+1 − xk ∥2 ≤ ρk+1 ⟨Hk (xk−1 ) − Hk (xk ), xk − xk+1 ⟩ ≤ ρ2k+1 ∥Hk (xk−1 ) − Hk (xk )∥2 . (2.2)

Throughout, we study problem (P) under the following assumptions.

Assumption 2.1 (Requirements for problem (P))

A1 f : n → is convex and has locally Lipschitz continuous gradient.

A2 g : n → is proper, lsc, and convex.
A3 There exists x⋆ ∈ arg min f + g.

The main adaptive framework, involving two time-varying parameters qk , ξk , is given in Algo-
rithm 2.1. The shorthand notation 1γk ℓk ≥1 equals 1 if γk ℓk ≥ 1 and 0 otherwise, while for t ∈
we denote [t]+ := max {0, t}.

Algorithm 2.1 General adaptive proximal gradient framework

R EQUIRE starting point x−1 ∈ n , stepsizes γ0 = γ−1 > 0
parameters 21 < qmin ≤ qmax , 0 < ξmin ≤ 2qmin − 1,
I NITIALIZE x0 = proxγ0 g (x−1 − γ0 ∇f (x−1 )), ρ0 = 1, q0 ∈ [qmin , qmax ], ξ0 ≥ ξmin
R EPEAT FOR k = 0, 1, . . . until convergence
2.1.1: Let ℓk and Lk be as in (1.1), and choose qk+1 , ξk+1 such that

qmin ≤ qk+1 ≤ min {qmax , qk + 1γk ℓk ≥1 }

qk+1
ξk+1 ≥ ξmin , rk+1 := 1+ξk+1 ≥ 12 ,
( r )
q
1+qk ρk rk+1 ξk
2.1.2: γk+1 = γk min qk+1 , qk+1 [γ 2 L2 +2γk ℓk (rk+1 −1)−(2rk+1 −1)]
k k +

γk+1
2.1.3: Set ρk+1 = γk and update xk+1 = proxγk+1 g (xk − γk+1 ∇f (xk ))

4
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

Remark 2.2 (relation to AdaPGq,r ). Whenever qk ≡ qmin = qmax =: q and ξk ≡ ξmin =: ξ, the
q
conditions in Algorithm 2.1 reduce to q > 12 , r = ξ+1 ≥ 12 , and ξ = qr − 1 > 0; equivalently,
q > r ≥ 12 as in Theorem 1.1.

In what follows, for x ∈ dom φ we adopt the notation

Pk (x) := φ(xk ) − φ(x). (2.3)

Our convergence analysis revolves around showing that under appropriate stepsize update and pa-
rameter selection the function
ξk
Uk (x) := 12 ∥xk − x∥2 + γk (1 + qk ρk )Pk−1 (x) + 2 ∥x
k
− xk−1 ∥2 , (2.4)

monotonically decreases along the iterates for all x ∈ arg min φ. The main inequality, outlined in
Theorem 2.3, extends the one in (Latafat et al., 2023b, Eq. (2.8)) by blending it with Lemma 2.1.
This combination is achieved by adding and subtracting a multiple of the residual scaled by a newly
added parameter ξk . The proof is otherwise adapted from that of (Latafat et al., 2023b, Lem. 2.2),
and is included in full detail in the preprint version.

Theorem 2.3 (main PG inequality) Consider a sequence (xk )k∈ generated by PG iterations
γ
(2.1) under Assumption 2.1, and denote ρk+1 := k+1 γk . Then, for any x ∈ dom φ, qk , ξk ≥ 0
and νk > 0, k ∈ ,

2 k k−1 2
1
Uk+1 (x) ≤ Uk (x) − γk (1 + qk ρk − qk+1 ρk+1 )Pk−1 (x) − 2 ∥x − x ∥ 1 + ξk − ν1k
h
2 2 2 qk+1 qk+1 i
− ρk+1 (νk+1 + ξk+1 ) γk Lk + 2γk ℓk νk+1 +ξk+1 − 1 − 2 νk+1 +ξk+1 − 1 ,
(2.5)

where Uk (x) is as in (2.4). In particular, with νk ≡ 1, if φ(x) ≤ inf k∈ φ(xk ) (for instance, if
x ∈ arg min φ), and
 
 
0 < ρ2k+1 ≤ min 1+q k ρk
, ξk (2.6)
 qk+1 (1+ξk+1 ) γk2 L2k +2γk ℓk qk+1 −1 + 1−2 qk+1


1+ξk+1 1+ξk+1
+

(with ξk > 0) holds for every k, then the coefficients of Pk−1 (x) and ∥xk − xk−1 ∥2 in (2.5) are
negative, Uk+1 (x) ≤ Uk (x) and thus (Uk (x))k∈ converges and (xk )k∈ is bounded.

Consistently with what was first observed in Malitsky and Mishchenko (2020), inequality (2.6)
confirms that stepsizes should both not grow too fast and be controlled by the local curvature of
f . We next show that, under a technical condition on qk , all that remains to do is ensuring that
the stepsizes do not vanish, which is precisely the reason behind the restrictions on the parameters
qk and ξk prescribed in Algorithm 2.1, as Theorem 2.7 will ultimately demonstrate. The technical
condition
q turns out to be a controlled growth of qk , needed to guarantee that a sequence ϱk+1 ≈
1+qk ϱk
qk+1 will eventually stay above 1.

5
L ATAFAT T HEMELIS PATRINOS

Theorem 2.4 (convergence of PG with nonvanishing stepsizes) Consider the iterates generated
by (2.1) under Assumption 2.1, with γk+1 = γk ρk+1 complying with (2.6). If qk+1 ≤ 1 + qk holds
for every k and inf k∈ γk > 0, then:

(i) The (bounded) sequence (xk )k∈ has exactly one optimal accumulation point.
(ii) If, in addition, (qk )k∈ and (ξk )k∈ are chosen bounded and bounded away from zero, then
the entire sequence (xk )k∈ converges to a solution x⋆ ∈ arg min φ, and Uk (x⋆ ) ↘ 0.

We now turn to the last piece of the puzzle, namely enforcing a strictly positive lower bound on
the stepsizes. The following elementary lemma provides the key insight to achieve this.

Lemma 2.5 Let f be convex and differentiable, and consider the iterates generated by Algorithm 2.1.
Then, for every k ∈ such that γk ℓk < 1 it holds that
q q
1 ξmin rmin 1
γk+1 ≥ min γk qmax + ρk , qmax Lk . (2.7)

Proof We start by observing that the assumptions on f guarantee that 0 ≤ ℓk ≤ Lk . The (squared)
second term in the minimum of step 2.1.2 can be lower bounded as follows
rk+1 ξk ξmin rmin
qk+1 [γ 2 L2 +2γk ℓk (rk+1 −1)+(1−2rk+1 )] ≥ qmax [γ 2 L2 +2γk ℓk (rmin −1)+(1−2rmin )]
k k + k k +
ξmin rmin ξmin rmin
= qmax [γ 2 L2 −γk ℓk +(γk ℓk −1)(2rmin −1)] ≥ qmax γk2 L2k
, (2.8)
k k +

where the first inequality follows from the fact that the left-hand side is increasing with respect
to rk+1 , and the second inequality follows since rmin ≥ 1/2. In turn, the claimed inequality (2.7)
follows from the fact that qk+1 ≤ qk ≤ qmax whenever γk ℓk < 1, see step 2.1.1.

This lemma already hints at a potential lower bound for the stepsize, since boundedness of
the sequence (xk )k∈ ensures lower boundedness of the second term in the minimum. As for the
first term, as long as qk is upper bounded, the stepsize can only decrease for a controlled number
of iterations. This arguments will be formally completed in the proof of Theorem 2.7, where the
following notation will be instrumental.
√ √
Definition 2.6 Let ε > 0. With ϱ1 = ε and ϱt+1 = ε + ϱt for t ≥ 1, we denote
Qtε
tε := max {t ∈ | ϱ1 , . . . , ϱt < 1} and m(ε) := t=1 ϱt .

Notice that m(ε) ≤ 1 and equality holds iff ε ≥ 1 (equivalently, iff tε = 0). For ε ∈ (0, 1), tε is
a well-defined strictly positive integer, owing to the monotonic increase of ϱt and its convergence to
√
the positive root of
pthe equation ϱ2 − ϱ − ε = 0. In particular, m(ε) ≤ ϱ1 = ε and identity holds
√
if tε = 1, that is, ε + ε ≥ 1 (and ε < 1). This leads to a partially explicit expression
( √
3− 5
and m(ε) = min {1, ε} if ε ≥
p
tε = 1 2 ≈ 0.382
1 √ √ (2.9)
1 < tε ≤ ε(2−ε) and εtε < m(ε) < ε otherwise.

6
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

The bound on tε in the second case is obtained by observing that

tε
X
1 > ϱtε = ϱtε − ϱ2tε + ε + ϱtε −1 = · · · = (ϱt − ϱ2t ) + (tε − 1)ε ≥ (tε − 1)ε(1 − ε) + (tε − 1)ε,
t=1

where we used the fact that ε ≤ ϱt ≤ 1 − ε and thus ϱt − ϱ2t ≥ ε(1 − ε) for t = 1, . . . , tε − 1. We
also remark that the lower bound in the simplified setting of Theorem 1.1 pertains to the case when
tε = 1, since ε = 1q falls under the first case above.
Theorem 2.7 (convergence of Algorithm 2.1) Under Assumption 2.1, the sequence (xk )k∈ gen-
⋆ ⋆
erated by Algorithm 2.1 converges1to asolution x ∈ arg min φ and (Uk (x ))k∈ ↘ 0. Moreover,
there exists k0 ≤ 2 log1+ 1 γ0 Lf,V + such that
qmax
q
γk ≥ γmin := m(1/qmax ) ξmin rmin 1
qmax Lf,V ∀k ≥ k0 ,
where rmin := inf k∈ rk ≥ 21 , m( · ) is as in Definition 2.6 (see also (2.9)), and Lf,V is a Lipschitz
modulus for ∇f on a compact convex set V that contains (xk )k∈ .
Proof The conditions prescribed in step 2.1.1 entail that the requirements of Theorem 2.4(ii) are
met, so that the proof reduces to showing the claimed lower bound on (γk )k∈ . Boundedness of the
sequence (xk )k∈ established in Theorem 2.3 ensures the existence of Lf,V > 0 as in the statement.
In particular, recall that ℓk ≤ Lk ≤ Lf,V holds for all k ∈ , cf. (1.2). Lemma 2.5 then yields that
q q
1 ξmin rmin 1
γk ℓk < 1 ⇒ γk+1 ≥ min γk qmax + ρk , qmax Lf,V . (2.10)
q
ξmin rmin
We first show that γk0 Lf,V ≥ qmax holds for some k0 ≥ 0 upper bounded as in the statement.
q
ξmin rmin
To this end, suppose that γk Lf,V < qmax for k = 0, 1, . . . , K. The bounds ξk ≥ ξmin and
qmax
qk ≤ qmax enforced in step 2.1.1 imply that 21 ≤ rmin ≤ rk ≤ for any k. In particular,
1+ξmin
q
ξmin rmin ξmin rk ξmin ξmin rmin
qmax ≤ qmax ≤ 1+ξmin < 1 holds for every k. Then, γ ℓ
k k ≤ γ L
k f,V < qmax < 1, and
(2.10) hold true for all such k, leading topγk+1 ≥ γk 1/qmax + ρk for k = 0, . . . , K − 1. Since
p

ρ0 ≥ 1, it follows that ρk+1 = γk+1/γk ≥ 1/qmax + 1 for k = 0, . . . , K − 1. Thus,

K
1 > ar 2 1 2 1
(γ0 Lf,V )2 ,

qmax > (γK Lf,V ) ≥ 1 + qmax (γK−1 Lf,V ) ≥ · · · ≥ 1 + qmax
min

from which the existence of k0 bounded as in the q statement follows. q

Let k ≥ k0 be an index such that γk Lf,V ≥ ξmin rmin
qmax , and suppose that γ k+t L f,V < ξmin rmin
qmax
for t = 1, . . . , T . As before, the inequalities in (2.10) hold true for all such iterates, leading to
q q
1 1
ρk+t ≥ qmax + ρk+t−1 , t = 1, . . . , T + 1, and in particular ρk+1 ≥ qmax .
1
It then follows from the definition of m(ε) and tε as in Definition 2.6 with ε = qmax that γk+t =
γk+t−1 ρk+t can only decrease for at most t ≤ tε iterations (that is, T ≤ tε ), at the end of which
Qt q
1 1 ξmin rmin 1 (def)
γk+t = ρ
i=1 k+i k γ ≥ m( )γ
qmax k ≥ m( qmax ) qmax Lf,V = γmin ,
q
and then increases linearly up to when it is again larger than ξmin rmin 1
qmax Lf,V , proving that γk ≥ γmin
holds for all k ≥ k0 .

7
L ATAFAT T HEMELIS PATRINOS

3. A class of adaptive alternating minimization algorithms

Leveraging an interpretation of AMA as the dual of the proximal gradient method, an adaptive
variant is developed for solving (CP) under the following assumptions.
Assumption 3.1 (requirements for problem (CP))
A1∗ ψ1 : n → is proper, closed, locally strongly convex, and 1-coercive;
A2∗ ψ2 : m → is proper, convex and closed;
A3∗ A ∈ m×n and there exists x ∈ relint dom ψ1 such that Ax ∈ relint ψ2 .
Under these requiremets, problem (CP) admits a unique solution x⋆ , and by virtue of (Rockafel-
lar, 1970, Thm.s 23.8, 23.9, and Cor. 31.2.1) also its dual
minimize
m
ψ1∗ (−A⊤ y) + ψ2∗ (y) (D)
y∈

has solutions y ⋆ characterized by y ⋆ ∈ ∂ψ2 (Ax⋆ ) and −A⊤ y ⋆ ∈ ∂ψ1 (x⋆ ), and strong duality
holds. In fact, Assumption 3.1.A1∗ ensures that the conjugate ψ1∗ is a (real-valued) locally Lipschitz
differentiable function (Goebel and Rockafellar, 2008, Thm. 4.1). We note that the weaker notion
of local strong monotonicity of ∂ψ1 relative to its graph would suffice, and that this minor departure
from the reference is used for simplicity of exposition. Problem (D) can then be addressed with
proximal gradient iterations y + = proxγg (y − γ∇f (y)) analized in the previous section, with
f := ψ1∗ (−A⊤ y) and g := ψ2∗ . In terms of primal variables x and z, these iterations result in the
alternating minimization algorithm. We here reproduce the simple textbook steps. First, observe that
x = ∇ψ1∗ (−A⊤ y) ⇔ −A⊤ y ∈ ∂ψ1 (x) ⇔ 0 ∈ A⊤ y + ∂ψ1 (x) = ∂[⟨A · , y⟩ + ψ1 ](x).
Hence, by strict convexity, x = arg min {ψ1 + ⟨A · , y⟩}. By the Moreau decomposition,
proxγg (y − γ∇f (y)) = proxγψ2∗ (y + γAx) = y + γAx − γ proxψ2/γ (γ −1 y + Ax).

AMA iterations thus generate a sequence (y k )k∈ given in (3.1), where

Lγ (x, z, y) := ψ1 (x) + ψ2 (z) + ⟨y, Ax − z⟩ + γ2 ∥Ax − z∥2
is the γ-augmented Lagrangian associated to (CP).

AdaAMAq,r Fix y −1 ∈ m and γ0 = γ−1 > 0. With ℓk and Lk as in (3.2), starting from
 −1
x = arg minx∈n ψ1 (x) + ⟨y −1 , Ax⟩

z 0 = arg minz∈m Lγ0 (x−1 , z, y −1 )

y = y −1 + γ0 (Ax−1 − z 0 ),
 0
iterate for k = 0, 1, . . .

xk = arg min ψ1 (x) + ⟨y k , Ax⟩ ( = arg min L0 (x, z k , y k )) (3.1a)

x∈n x∈n
( r )
1− rq
r
γk
γk+1 = γk min 1
q + γk−1 , (3.1b)
[(1−2r)+γk2 L2k +2γk ℓk (r−1)]+
−1 k
z k+1 = proxψ2/γk+1 (γk+1 y + Axk ) ( = arg min Lγk+1 (xk , z, y k )) (3.1c)
z∈m

y k+1 = y k + γk+1 (Axk − z k+1 ) (3.1d)

8
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

The chosen iteration indexing reflects the dependency on the stepsize γk : xk depends on y k
but not on γk+1 , whereas z k+1 does depend on it. Moreover, this convention is consistent with the
relation ∇f (y k ) = −Axk . Local Lipschitz estimates of ∇f as in (1.1) are thus expressed as

⟨Axk − Axk−1 , y k − y k−1 ⟩ ∥Axk − Axk−1 ∥2

ℓk = − and Lk = . (3.2)
∥y k − y k−1 ∥2 ∥y k − y k−1 ∥2

Being dually equivalent algorithms, convergence of AdaAMAq,r is deduced from that of AdaPGq,r .

Theorem 3.1 Under Assumption 3.1, for any q > r ≥ 21 the sequence (xk )k∈ generated by
AdaAMAq,r converges to the (unique) primal solution of (CP), and (y k )k∈ to a solution of the
dual problem (D).

4. Numerical simulations
Performance of AdaPGq,r with five different parameter choices from Table 1 is reported through
a series of experiments on (i) logistic regression, (ii) cubic regularization for logistic loss, (iii)
regularized least squares. The two former simulations use three standard datasets from the LIBSVM
library Chang and Lin (2011), while for Lasso synthetic data is generated based on (Nesterov, 2013,
§6); for further details the reader is referred to (Latafat et al., 2023b, §4.1) where the same problem
setup is used. When applicable, the following algorithms are included in the comparisons.1

PG-lsb Proximal gradient method with nonmonotone backtracking

Nesterov Nesterov’s acceleration with constant stepsize 1/Lf (Beck, 2017, §10.7)
adaPG (Latafat et al., 2023b, Alg. 2.1)
adaPG-MM Proximal extension of (Malitsky and Mishchenko, 2020, Alg. 1)

The backtracking procedure in PG-lsb is meant in the sense of (Beck, 2017, §10.4.2), (see
also (Salzo, 2017, LS1) and (De Marchi and Themelis, 2022, Alg. 3) for the locally Lipschitz
smooth case), without enforcing monotonic decrease on the stepsize sequence. To improve per-
formance, the initial guess for γk+1 is warm-started as bγk , where γk is the accepted value in the
previous iteration and b ≥ 1 is a backtracking factor. For each simulation we tested all values of
b ∈ {1, 1.1, 1.3, 1.5, 2} and only reported the best outcome.

5. Conclusions
This paper proposed a general framework for a class of adaptive proximal gradient methods, demon-
strating its capacity to extend and tighten existing results when restricting to certain parameter
choices. Moreover, application of the developed method was explored in the dual setting which led
to a class of novel adaptive alternating minimization algorithms.
Future research directions include extensions to nonconvex problems, variational inequalities,
and simple bilevel optimization expanding upon Malitsky (2020) and Latafat et al. (2023c). It would
also be interesting to investigate the effectiveness of time-varying parameters in our framework for
further improving performance and worst-case convergence rate guarantees.
1. https://github.com/pylat/adaptive-proximal-algorithms-extended-experiments

9
L ATAFAT T HEMELIS PATRINOS

Nesterov adaPG-MM adaPG

5 5 10 5 3 3 5 1
adaPG( 3 , 6 ) adaPG( 9 , 6 ) adaPG( 2 , 4 ) adaPG( 2 ,1) adaPG(1, 2 )
PG-ls1.1 PG-ls1.3 PG-ls1.5 PG-ls2
m = 100, n = 300, n⋆ = 30 m = 500, n = 1000, n⋆ = 100 m = 4000, n = 1000, n⋆ = 100

101
101
101

10−3 10−3
φ(xk ) − min φ

10−3

10−7 10−7 10−7

10−11 10−11 10−11

10−15 10−15 10−15

0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000 2,500 0 200 400 600 800 1,000

a5a dataset (m = 6414, n = 123, density = 0.11) mushroom dataset (m = 8124, n = 112, density = 0.19) phishing dataset (m = 11055, n = 68, density = 0.44)

100 100
10−1

10−3 10−3
10−4
φ(xk ) − min φ

10−7 10−6 10−6

10−10 10−9 10−9

10−13 10−12 10−12

0 100 200 300 400 500 600 700 0 500 1,000 1,500 2,000 0 500 1,000 1,500 2,000
a5a dataset (m = 6414, n = 123, density = 0.11) mushroom dataset (m = 8124, n = 112, density = 0.19) phishing dataset (m = 11055, n = 68, density = 0.44)
100 100

10−3
10−3 10−3
φ(xk ) − min φ

10−6 10−6
10−6

10−9 10−9 10−9

10−12 10−12 10−12

10−15 10−15 10−15

0 10 20 30 40 50 0 10 20 30 40 50 0 5 10 15 20 25 30
# of calls to linear operators # of calls to linear operators # of calls to linear operators

Figure 1: First row: regularized least squares, second row: ℓ1 -regularized logistic regression, third
row: cubic regularization with Hessian generated for the logistic loss problem evaluated
at zero. For the linesearch method PG-lsb , in each simulation only the best outcome for
b ∈ {1, 1.1, 1.3, 1.5, 2} is reported.

Acknowledgments
Work supported by: the Research Foundation Flanders (FWO) postdoctoral grant 12Y7622N and re-
search projects G081222N, G033822N, and G0A0920N; European Union’s Horizon 2020 research
and innovation programme under the Marie Skłodowska-Curie grant agreement No. 953348; Japan
Society for the Promotion of Science (JSPS) KAKENHI grants JP21K17710 and JP24K20737.

10
O N THE CONVERGENCE OF ADAPTIVE FIRST ORDER METHODS : PG AND AMA

References
Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging II: Silver stepsize
schedule for smooth convex optimization. arXiv:2309.16530, 2023.

Heinz H. Bauschke and Patrick L. Combettes. Convex analysis and monotone operator theory in
Hilbert spaces. CMS Books in Mathematics. Springer, 2017. ISBN 978-3-319-48310-8.

Amir Beck. First-Order Methods in Optimization. SIAM, Philadelphia, PA, 2017.

Chih-Chung Chang and Chih-Jen Lin. LIBSVM: A library for support vector machines. ACM
Transactions on Intelligent Systems and Technology (TIST), 2:1–27, 2011.

Alberto De Marchi and Andreas Themelis. Proximal gradient algorithms under local Lipschitz
gradient continuity: A convergence and robustness analysis of PANOC. Journal of Optimization
Theory and Applications, 194:771–794, 2022.

Aaron Defazio, Baoyu Zhou, and Lin Xiao. Grad-GradaGrad? A non-monotone adaptive stochastic
gradient method. arXiv:2206.06900, 2022.

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of machine learning research, 12(7), 2011.

Alina Ene, Huy L Nguyen, and Adrian Vladu. Adaptive gradient methods for constrained convex
optimization and variational inequalities. In Proceedings of the AAAI Conference on Artificial
Intelligence, volume 35, pages 7314–7321, 2021.

Daniel Gabay. Chapter IX Applications of the method of multipliers to variational inequalities. In

Michel Fortin and Roland Glowinski, editors, Augmented Lagrangian Methods: Applications to
the Numerical Solution of Boundary-Value Problems, volume 15 of Studies in Mathematics and
Its Applications, pages 299–331. Elsevier, 1983.

Rafal Goebel and R Tyrrell Rockafellar. Local strong convexity and local Lipschitz continuity of
the gradient of convex functions. Journal of Convex Analysis, 15(2):263, 2008.

Tom Goldstein, Brendan O’Donoghue, Simon Setzer, and Richard Baraniuk. Fast alternating direc-
tion optimization methods. SIAM Journal on Imaging Sciences, 7(3):1588–1623, 2014.

Benjamin Grimmer, Kevin Shu, and Alex L Wang. Accelerated gradient descent via long steps.
arXiv:2309.09961, 2023.

Maor Ivgi, Oliver Hinder, and Yair Carmon. DoG is SGD’s best friend: A parameter-free dynamic
step size schedule. arXiv:2302.12022, 2023.

Puya Latafat, Andreas Themelis, and Panagiotis Patrinos. On the convergence of adaptive first order
methods: proximal gradient and alternating minimization algorithms. arXiv:2311.18431, 2023a.

11
L ATAFAT T HEMELIS PATRINOS

Puya Latafat, Andreas Themelis, Lorenzo Stella, and Panagiotis Patrinos. Adaptive prox-
imal algorithms for convex optimization under local Lipschitz continuity of the gradient.
arXiv:2301.04431, 2023b.

Puya Latafat, Andreas Themelis, Silvia Villa, and Panagiotis Patrinos. On the convergence of prox-
imal gradient methods for convex simple bilevel optimization. arXiv:2305.03559, 2023c.

Tianjiao Li and Guanghui Lan. A simple uniformly optimal method without line search for convex
optimization. arXiv:2310.10082, 2023.

Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive
stepsizes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages
983–992. PMLR, 2019.

Yura Malitsky. Golden ratio algorithms for variational inequalities. Mathematical Programming,
184(1):383–410, 2020.

Yura Malitsky and Konstantin Mishchenko. Adaptive gradient descent without descent. In Proceed-
ings of the 37th International Conference on Machine Learning, volume 119, pages 6702–6712.
PMLR, 13- 2020.

Yura Malitsky and Konstantin Mishchenko. Adaptive proximal gradient method for convex opti-
mization. arXiv:2308.02261, 2023.

Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Program-
ming, 140(1):125–161, aug 2013.

Ralph T. Rockafellar. Convex analysis. Princeton University Press, 1970.

Saverio Salzo. The variable metric forward-backward splitting algorithm under mild differentiabil-
ity assumptions. SIAM Journal on Optimization, 27(4):2153–2181, 2017.

Paul Tseng. Applications of a splitting algorithm to decomposition in convex programming and

variational inequalities. SIAM Journal on Control and Optimization, 29(1):119–138, 1991.

Rachel Ward, Xiaoxia Wu, and Leon Bottou. AdaGrad stepsizes: Sharp convergence over non-
convex landscapes. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of
the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine
Learning Research, pages 6677–6686. PMLR, 09- 2019.

Alp Yurtsever, Alex Gu, and Suvrit Sra. Three operator splitting with subgradients, stochastic gra-
dients, and adaptive learning rates. Advances in Neural Information Processing Systems, 34:
19743–19756, 2021.

Nonlinear Ill Posed Problems of Monotone Type Y Alber I Ryazantseva Springer
No ratings yet
Nonlinear Ill Posed Problems of Monotone Type Y Alber I Ryazantseva Springer
420 pages
Banking System Project
100% (1)
Banking System Project
93 pages
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
No ratings yet
Adaptive Proximal Gradient Method For Convex Optimization: 1 Intro
23 pages
32 Learning Rate Free Learning by
No ratings yet
32 Learning Rate Free Learning by
31 pages
Linear Convergence of Adaptive Stochastic Gradient Descent
No ratings yet
Linear Convergence of Adaptive Stochastic Gradient Descent
19 pages
Gradient Methods With Adaptive Step-Sizes
No ratings yet
Gradient Methods With Adaptive Step-Sizes
19 pages
Numerische Mathematik: Efficient Algorithms For Solving The - Laplacian in Polynomial Time
No ratings yet
Numerische Mathematik: Efficient Algorithms For Solving The - Laplacian in Polynomial Time
32 pages
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
No ratings yet
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
15 pages
Proximal Minimization With D-Functions: Gorithms
No ratings yet
Proximal Minimization With D-Functions: Gorithms
11 pages
Adaptive Stochastic Conjugate Gradient For Machine Learning
No ratings yet
Adaptive Stochastic Conjugate Gradient For Machine Learning
14 pages
A Limited T,: Memory Algorithm For Bound Constrained T, T
No ratings yet
A Limited T,: Memory Algorithm For Bound Constrained T, T
19 pages
Pham Ky Anh
No ratings yet
Pham Ky Anh
19 pages
Ahookhosh 等 - 2021 - A Bregman Forward-Backward Linesearch Algorithm for Nonconvex Composite Optimization Superlinear Co
No ratings yet
Ahookhosh 等 - 2021 - A Bregman Forward-Backward Linesearch Algorithm for Nonconvex Composite Optimization Superlinear Co
33 pages
Learning Rate Free Learning by D Adaptation
No ratings yet
Learning Rate Free Learning by D Adaptation
35 pages
A Bregman Subgradient Extragradient Method With Self-Adaptive Technique For Solving Variational Inequalities in Reflexive Banach Spaces
No ratings yet
A Bregman Subgradient Extragradient Method With Self-Adaptive Technique For Solving Variational Inequalities in Reflexive Banach Spaces
27 pages
A Generic Proximal Algorithm For Convex Optimization - Application To Total Variation Minimization
No ratings yet
A Generic Proximal Algorithm For Convex Optimization - Application To Total Variation Minimization
5 pages
NeurIPS 2023 Path Following Algorithms For Ell - 2 Regularized M Estimation With Approximation Guarantee Paper Conference
No ratings yet
NeurIPS 2023 Path Following Algorithms For Ell - 2 Regularized M Estimation With Approximation Guarantee Paper Conference
11 pages
A Note On The Optimal Convergence Rate of Descent
No ratings yet
A Note On The Optimal Convergence Rate of Descent
11 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
A Proximal-Gradient Homotopy Method For The Sparse Least-Squares Problem
No ratings yet
A Proximal-Gradient Homotopy Method For The Sparse Least-Squares Problem
37 pages
Ee227c Notes 2 PDF
No ratings yet
Ee227c Notes 2 PDF
122 pages
Ee227c Notes PDF
No ratings yet
Ee227c Notes PDF
122 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Rajmic 2016
No ratings yet
Rajmic 2016
5 pages
Convergence of Descent Methods For Semi-Algebraic and
No ratings yet
Convergence of Descent Methods For Semi-Algebraic and
37 pages
Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
No ratings yet
Lower Complexity Bounds of Finite-Sum Optimization Problems: The Results and Construction
86 pages
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
No ratings yet
Local Search in Smooth Convex Sets: CX Ax B A I A A A A A A O D X Ax B X CX CX O A I J Z O Opt D X X C A B P CX
9 pages
Math 11143 Peer Review
No ratings yet
Math 11143 Peer Review
16 pages
Gcmma
No ratings yet
Gcmma
23 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Lec 13
No ratings yet
Lec 13
6 pages
Ph. D. Research Proposal Iterative Processes For Solving Nonlinear Operator Equations
No ratings yet
Ph. D. Research Proposal Iterative Processes For Solving Nonlinear Operator Equations
4 pages
Dang 2020
No ratings yet
Dang 2020
21 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradient Method: Ryan Tibshirani Convex Optimization 10-725
21 pages
MUR8 A Multirate Extension of The Eighth Order Do - 1997 - Applied Numerical Ma
No ratings yet
MUR8 A Multirate Extension of The Eighth Order Do - 1997 - Applied Numerical Ma
8 pages
02-Subgrad Method Notes
No ratings yet
02-Subgrad Method Notes
27 pages
Mmagcmma
No ratings yet
Mmagcmma
15 pages
Improving The Accuracy of Computed Eigenvalues and Eigenvectors
No ratings yet
Improving The Accuracy of Computed Eigenvalues and Eigenvectors
23 pages
Error Propagation of General Linear Methods For Ordinary Differential Equations
No ratings yet
Error Propagation of General Linear Methods For Ordinary Differential Equations
21 pages
ADAM StochasticOptimiz 1412.6980
100% (1)
ADAM StochasticOptimiz 1412.6980
15 pages
Hwei Thesis
No ratings yet
Hwei Thesis
155 pages
A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans
No ratings yet
A Relaxed Inertial Forward-Backward-Forward Algorithm For Solving Monotone Inclusions With Application To Gans
37 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
Numerical Methods For Least Squares Problems, Second Edition
No ratings yet
Numerical Methods For Least Squares Problems, Second Edition
510 pages
Generalized Runge Kutta Method
No ratings yet
Generalized Runge Kutta Method
14 pages
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
No ratings yet
The O.D.E. Method For Convergence of Stochastic Approximation and Reinforcement Learning
23 pages
Benson 1
No ratings yet
Benson 1
36 pages
Inequality 20161031
No ratings yet
Inequality 20161031
31 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Coordinate Descent Algorithms: Stephen J. Wright
No ratings yet
Coordinate Descent Algorithms: Stephen J. Wright
32 pages
A Limited-Memory Algorithm For
No ratings yet
A Limited-Memory Algorithm For
22 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
Lecture 11 AGD Restart Lower Bounds
No ratings yet
Lecture 11 AGD Restart Lower Bounds
5 pages
A Lyapunov Analysis For Accelerated Gradient Methods
No ratings yet
A Lyapunov Analysis For Accelerated Gradient Methods
37 pages
Model Updating Using Robust Estimation
No ratings yet
Model Updating Using Robust Estimation
15 pages
Convex - Optimization - Homework 3
No ratings yet
Convex - Optimization - Homework 3
6 pages
PhilRice Citizens Charter Handbook v3
No ratings yet
PhilRice Citizens Charter Handbook v3
55 pages
EmpoTech Module 3
No ratings yet
EmpoTech Module 3
17 pages
React Fundamentals and Environment Setup
No ratings yet
React Fundamentals and Environment Setup
8 pages
User Manual of TEM912 Digital Test Hammer-2023
No ratings yet
User Manual of TEM912 Digital Test Hammer-2023
26 pages
Medit I700 Requirements - Buscar Con Google
No ratings yet
Medit I700 Requirements - Buscar Con Google
1 page
Production Planning and Scheduling Technology For Steel Manufacturing Process
No ratings yet
Production Planning and Scheduling Technology For Steel Manufacturing Process
17 pages
Inventor 2008 Curric
No ratings yet
Inventor 2008 Curric
4 pages
Grade-11 - Week-5-INFOSHEET - Straight-Through Cable Crossover Cable
No ratings yet
Grade-11 - Week-5-INFOSHEET - Straight-Through Cable Crossover Cable
10 pages
Operation Manual: 8008 Trotec Laserati C100/C180
No ratings yet
Operation Manual: 8008 Trotec Laserati C100/C180
48 pages
SCC600A-5 Skema
No ratings yet
SCC600A-5 Skema
38 pages
BS EN 15495-2007 - Zone Location of AE Sources
No ratings yet
BS EN 15495-2007 - Zone Location of AE Sources
16 pages
134.4020.23 DM705 Datasheet
No ratings yet
134.4020.23 DM705 Datasheet
32 pages
Enel4aah1 6 2023 y P1
No ratings yet
Enel4aah1 6 2023 y P1
5 pages
Factoring Perfect Square Trinomials
No ratings yet
Factoring Perfect Square Trinomials
14 pages
PB 000097 IIM 46234 v0.1
No ratings yet
PB 000097 IIM 46234 v0.1
1 page
Abbreviations - Miscellaneous Mains (Pratiyogita-Abhiyan)
No ratings yet
Abbreviations - Miscellaneous Mains (Pratiyogita-Abhiyan)
10 pages
User Manual: Rev X9
No ratings yet
User Manual: Rev X9
41 pages
Debris Chute Checklist
No ratings yet
Debris Chute Checklist
1 page
Fs Mini Project Report
No ratings yet
Fs Mini Project Report
25 pages
NU Student Email User Guide
No ratings yet
NU Student Email User Guide
6 pages
Sales Guide
No ratings yet
Sales Guide
25 pages
Wa0045.
No ratings yet
Wa0045.
4 pages
The Net Exam
No ratings yet
The Net Exam
73 pages
DoLynk Care Datasheet
No ratings yet
DoLynk Care Datasheet
1 page
Um2559 Getting Started With The Xnucleoiks01a3 Motion Mems and Environmental Sensor Expansion Board For stm32 Nucleo Stmicroelectronics
No ratings yet
Um2559 Getting Started With The Xnucleoiks01a3 Motion Mems and Environmental Sensor Expansion Board For stm32 Nucleo Stmicroelectronics
17 pages
EPI Computer Room Utilization Ratio
No ratings yet
EPI Computer Room Utilization Ratio
31 pages
Hayes Command Set / Register Formats
100% (1)
Hayes Command Set / Register Formats
5 pages
Ug906 Vivado Design Analysis
No ratings yet
Ug906 Vivado Design Analysis
373 pages
How To Access Your Email From The University Website
No ratings yet
How To Access Your Email From The University Website
5 pages

On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms

Uploaded by

On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms

Uploaded by

Proceedings of Machine Learning Research vol 242:197–208, 2024

On the convergence of adaptive first order methods:

© 2024 P. Latafat, A. Themelis & P. Patrinos.

⟨∇f (xk ) − ∇f (xk−1 ), xk − xk−1 ⟩ ∥∇f (xk ) − ∇f (xk−1 )∥

As a final contribution, an adaptive variant of the alternating minimization algorithm (AMA) of

2. A general framework for adaptive proximal gradient methods

xk+1 = proxγk+1 g xk − γk+1 ∇f (xk ) , (2.1)

Throughout, we study problem (P) under the following assumptions.

Assumption 2.1 (Requirements for problem (P))

A1 f : n →  is convex and has locally Lipschitz continuous gradient.

Algorithm 2.1 General adaptive proximal gradient framework

qmin ≤ qk+1 ≤ min {qmax , qk + 1γk ℓk ≥1 }

In what follows, for x ∈ dom φ we adopt the notation

Pk (x) := φ(xk ) − φ(x). (2.3)

The bound on tε in the second case is obtained by observing that

ρ0 ≥ 1, it follows that ρk+1 = γk+1/γk ≥ 1/qmax + 1 for k = 0, . . . , K − 1. Thus,

from which the existence of k0 bounded as in the q statement follows. q

3. A class of adaptive alternating minimization algorithms

AMA iterations thus generate a sequence (y k )k∈ given in (3.1), where

z 0 = arg minz∈m Lγ0 (x−1 , z, y −1 )

xk = arg min ψ1 (x) + ⟨y k , Ax⟩ ( = arg min L0 (x, z k , y k )) (3.1a)

y k+1 = y k + γk+1 (Axk − z k+1 ) (3.1d)

⟨Axk − Axk−1 , y k − y k−1 ⟩ ∥Axk − Axk−1 ∥2

PG-lsb Proximal gradient method with nonmonotone backtracking

Nesterov adaPG-MM adaPG

10−7 10−7 10−7

10−11 10−11 10−11

10−15 10−15 10−15

10−7 10−6 10−6

10−10 10−9 10−9

10−13 10−12 10−12

10−9 10−9 10−9

10−12 10−12 10−12

10−15 10−15 10−15

Amir Beck. First-Order Methods in Optimization. SIAM, Philadelphia, PA, 2017.

Daniel Gabay. Chapter IX Applications of the method of multipliers to variational inequalities. In

Ralph T. Rockafellar. Convex analysis. Princeton University Press, 1970.

Paul Tseng. Applications of a splitting algorithm to decomposition in convex programming and

You might also like

A1 f : n → is convex and has locally Lipschitz continuous gradient.

AMA iterations thus generate a sequence (y k )k∈ given in (3.1), where

z 0 = arg minz∈m Lγ0 (x−1 , z, y −1 )