0% found this document useful (0 votes)
29 views5 pages

Lecture 11 AGD Restart Lower Bounds

Uploaded by

drbaskerphd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Lecture 11 AGD Restart Lower Bounds

Uploaded by

drbaskerphd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Lecture 11: Acceleration via Regularization and Restarting;


Lower Bounds
Yudong Chen

Last week we discussed two variants of Nesterov’s accelerated gradient descent (AGD).

Algorithm 1 Nesterov’s AGD, smooth and strongly convex


input: initial x0 , strong convexity

and smoothness parameters m, L, number of iterations K
L/m − 1
initialize: x−1 = x0 , β = √ L/m+1 .
for k = 0, 1, . . . K
y k = x k + β ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )
return xK

Theorem 1. For Nesterov’s AGD Algorithm 1 applied to m-strongly convex L-smooth f , we have
r k
( L + m) ∥ x0 − x ∗ ∥22

∗ m
f ( xk ) − f ≤ 1 − · .
L 2
q 
∗ L L∥ x0 − x ∗ ∥22
Equivalently, we have f ( xk ) − f ≤ ϵ after at most k = O m log ϵ iterations.

Algorithm 2 Nesterov’s AGD, smooth convex


input: initial x0 , smoothness parameter L, number of iterations K
initialize: x−1 = x0 , λ0 = 0, β 0 = 0.
for k = 0, 1, . . . K
y k = x k + β k ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )

1+ 1+4λ2k
λ k +1 = 2 , β k+1 = λλkk−
+1
1

return xK

Theorem 2. For Nesterov’s AGD Algorithm 2 applied to L-smooth convex f , we have

2L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≤ .
k2

In this lecture, we will show that the two types of acceleration above are closely related: we
can use one to derive the other. We then show that in a certain precise (but narrow) sense, the
convergence rates of AGD are optimal among first-order methods. For this reason, AGD is also
known as Nesterov’s optimal method.

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

1 Acceleration via regularization


Suppose we only know the AGD method for strongly convex functions (Algorithm 1) and its
p k
1 − mL guarantee (Theorem 1). Can we use it as a subroutine to develop an accelerated al-
gorithm for (non-strongly) convex functions with a k12 convergence rate?
The answer is yes (up to logarithmic factors). One approach is to add a regularizer ϵ ∥ x ∥22 to
f ( x ) and apply Algorithm 1 to the function f ( x ) + ϵ ∥ x ∥22 , which is strongly convex. See HW 3.

2 Acceleration via restarting


In the opposite direction, suppose we only know the AGD method for (non-strongly) convex func-
tions (Algorithm 2) and its k12 guarantee (Theorem 2). Can we use it as a subroutine to develop an
p k
accelerated algorithm for strongly convex functions with a 1 − mL convergence rate (equiva-
q
lently, a mL log 1ϵ iteration complexity)?
powerful idea in optimization: restarting. See Algorithm 3.
This is possible using a classical andq
8L
In each round, we run Algorithm 2 for iterations to obtain x t+1 . In the next round, we restart
m
q
Algorithm 2 using x t+1 as the initial solution and run for another 8L m iterations. This is repeated
for T rounds.

Algorithm 3 Restarting AGD


input: initial x0 , strong convexity and smoothness parameters m, L, number of rounds T
for t = 0, 1, . . . T
q
Run Algorithm 2 with x t (initial solution), L (smoothness parameter), 8L
m (number of
iterations) as the input. Let x t+1 be the output.

return x T

Exercise
q 1. How is Algorithm 3 different from running Algorithm 2 without restarting for T ×
8L
m iterations?

2.1 Analysis
Suppose f is m-strongly convex and L-smooth. By Theorem 2, we know that

2L ∥ x t − x ∗ ∥22 m ∥ x t − x ∗ ∥22
f ( x t +1 ) − f ( x ∗ ) ≤ = .
8L/m 4
By strong convexity, we have
m
f ( x t ) ≥ f ( x ∗ ) + ⟨∇ f ( x ∗ ), x t − x ∗ ⟩ + ∥ x t − x ∗ ∥22 ,
| {z } 2
=0

hence ∥ x t − x ∗ ∥22 ≤ 2
m ( f ( x t ) − f ( x ∗ )). Combining, we get
f (xt ) − f (x∗ )
f ( x t +1 ) − f ( x ∗ ) ≤ .
2

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

That is, each round of Algorithm 3 halves the optimality gap. It follows that
 T
1
f (x T ) − f (x∗ ) ≤ ( f ( x0 ) − f ( x ∗ )) .
2
Therefore, f ( x T ) − f ( x ∗ ) ≤ ϵ can be achieved after at most
f ( x0 ) − f ( x ∗ )
 
T = O log rounds,
ϵ
which corresponds to a total of
r r !
8L L f ( x0 ) − f ( x ∗ )
T× =O log AGD iterations.
m m ϵ
This iteration complexity is the same as Theorem 1 up to a logarithmic factor.
Remark 1. Note how strong convexity is needed in the above argument.
Remark 2. Optional reading: This overview article discusses restarting as a general/meta algorith-
mic technique.

3 Lower bounds
In this section, we consider a class of first-order iterative algorithms that satisfy x0 = 0, and
xk+1 ∈ Lin {∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk )} , ∀k ≥ 0, (1)
where the RHS denotes the linear subspace spanned by ∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk ); in other
words, xk+1 is an (arbitrary) linear combination of the gradients at the previous (k + 1) iterates.

3.1 Smooth and convex f


Theorem 3. There exists an L-smooth convex function f such that any first-order method in the sense of
(1) must satisfy
3L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≥ .
32(k + 1)2
L
Comparing with this lower bound, we see that the k2
rate for AGD in Theorem 2 is opti-
mal/unimprovable (up to constants).
Proof of Theorem 3. Let A ∈ Rd×d be the matrix given by

2,
 i=j
Aij = −1, j ∈ {i − 1, i + 1} (2)

0, otherwise.

Explicitly,
−1 0 0 ··· ···
 
2 0
 −1 2 −1 0 · · · ··· 0
 
 0 −1 2 −1 0 ··· 0
A= .
 
.. .. ..

 . . . 

 0 ··· −1 2 −1
0 ··· −1 2

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Let ei ∈ Rd denote the i-th standard basis vector. Consider the quadratic function

L ⊤ L
f (x) = x Ax − x ⊤ e1 ,
8 4
L
which is convex and L-smooth since 0 ≼ A ≼ 4I. Note that ∇ f ( x ) = 4 ( Ax − e1 ). By induction,
we can show that for k ≥ 1,

xk ∈ Lin {e1 , Ax1 , . . . , Axk−1 } ⊆ Lin {e1 , . . . , ek } .

Therefore, if we let Ak ∈ Rd×d denote the matrix obtained by zeroing out the entries of A outside
the top-left k × k block, then
 
L ⊤ L ⊤ ∗ L ⊤ L ⊤
f ( xk ) = xk Ak xk − xk e1 ≥ f k := min x A k x − x e1 .
8 4 x 8 4

By setting gradient to zero, we find that the minimum above is attained by


 ⊤
1 2 k
xk∗ := 1− ,1− ,...,1− , 0, . . . , 0 ∈ Rd ,
k+1 k+1 k+1

with f k∗ = − L8 1 − k+1 1 . It follows that the global minimizer x ∗ = xd∗ of f satisfies f ( x ∗ ) = f d∗ =




− L8 1 − d+1 1 and (since x0 = 0)




d  2
i d+1
∥ xd∗ − x0 ∥22 = ∥ xd∗ ∥22 = ∑ 1−
d+1

3
.
i =1

Combining pieces and taking d = 2k + 1, we have


 
L 1 1
f ( xk ) − f ( x ∗ ) ≥ f k∗ − f d∗ = −
8 k + 1 2k + 2
L k+1
=
16 (k + 1)2
L d+1
=
32 (k + 1)2
2
3L ∥ x ∗ − x0 ∥2
≥ .
32 (k + 1)2

3.2 Smooth and strongly convex f


 k
For strongly convex functions, we have the following lower bound, which shows that the 1 − √1
L/m
rate of AGD in Theorem 1 cannot be significantly improved.

Theorem 4. There exists an m-strongly convex and L-smooth function such that any first-order method in
the sense of (1) must satisfy
  k +1
m 4

f ( xk ) − f ( x ) ≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Proof. Let A ∈ Rd×d be defined in (2) above and consider the function
L−m  ⊤  m
f (x) = x Ax − 2x e1 + ∥ x ∥22 ,

8 2
which is L-smooth and m-strongly convex. Strong convexity implies that
m
f ( xk ) − f ( x ∗ ) ≥ ∥ xk − x ∗ ∥22 . (3)
2
A similar argument as above shows that xk ∈ Lin {e1 , . . . , ek } , hence
d
∥ xk − x ∗ ∥22 ≥ ∑ x ∗ ( i )2 , (4)
i = k +1

where x ∗ (i ) denotes the ith entry of x ∗ . For simplicity we take d → ∞ (we omit the formal limiting
argument).1 The minimizer x ∗ can be computed by setting the gradient of f to zero, which gives
an infinite set of equations

L/m + 1 ∗
1−2 x (1) + x ∗ (2) = 0,
L/m − 1
L/m + 1 ∗
x ∗ ( k − 1) − 2 x (k ) + x ∗ (k + 1) = 0, k = 2, 3, . . .
L/m − 1
Solving these equations gives
√ !i
∗ L/m − 1
x (i ) = √ , i = 1, 2, . . . (5)
L/m + 1

Combining pieces, we obtain

m ∞ ∗ 2
2 i=∑
f ( xk ) − f ( x ∗ ) ≥ x (i ) by (3) and (4)
k +1
√ ! 2( k +1)
m L/m − 1
≥ √ ∥ x0 − x ∗ ∥22 by (5) and x0 = 0
2 L/m + 1
  k +1
m 4 4
= 1− √ + √ ∥ x0 − x ∗ ∥22
2 L/m + 1 ( L/m + 1)2
  k +1
m 4
≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

Remark 3. The lower bounds in Theorems 3 and 4 are in the worst-case/minimax sense: one cannot
find a first-order method that achieves a better convergence rate on all smooth convex functions
than AGD. This, however, does not prevent better rates to be achieved for a sub class of such
functions. It is also possible to achieve better rates by using higher-order information (e.g., the
Hessian).
1 The convergence rates for AGD in Theorems 1 and 2 do not explicitly depend on the dimension d, hence these
results can be generalized to infinite dimensions.

You might also like