0% found this document useful (0 votes)

29 views5 pages

Lecture 11 AGD Restart Lower Bounds

Uploaded by

drbaskerphd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views5 pages

Lecture 11 AGD Restart Lower Bounds

Uploaded by

drbaskerphd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Lecture 11: Acceleration via Regularization and Restarting;

Lower Bounds
Yudong Chen

Last week we discussed two variants of Nesterov’s accelerated gradient descent (AGD).

Algorithm 1 Nesterov’s AGD, smooth and strongly convex

input: initial x0 , strong convexity
√
and smoothness parameters m, L, number of iterations K
L/m − 1
initialize: x−1 = x0 , β = √ L/m+1 .
for k = 0, 1, . . . K
y k = x k + β ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )
return xK

Theorem 1. For Nesterov’s AGD Algorithm 1 applied to m-strongly convex L-smooth f , we have
r k
( L + m) ∥ x0 − x ∗ ∥22

∗ m
f ( xk ) − f ≤ 1 − · .
L 2
q
∗ L L∥ x0 − x ∗ ∥22
Equivalently, we have f ( xk ) − f ≤ ϵ after at most k = O m log ϵ iterations.

Algorithm 2 Nesterov’s AGD, smooth convex

input: initial x0 , smoothness parameter L, number of iterations K
initialize: x−1 = x0 , λ0 = 0, β 0 = 0.
for k = 0, 1, . . . K
y k = x k + β k ( x k − x k −1 )
xk+1 = yk − L1 ∇ f (yk )
√
1+ 1+4λ2k
λ k +1 = 2 , β k+1 = λλkk−
+1
1

return xK

Theorem 2. For Nesterov’s AGD Algorithm 2 applied to L-smooth convex f , we have

2L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≤ .
k2

In this lecture, we will show that the two types of acceleration above are closely related: we
can use one to derive the other. We then show that in a certain precise (but narrow) sense, the
convergence rates of AGD are optimal among first-order methods. For this reason, AGD is also
known as Nesterov’s optimal method.

1
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

1 Acceleration via regularization

Suppose we only know the AGD method for strongly convex functions (Algorithm 1) and its
p k
1 − mL guarantee (Theorem 1). Can we use it as a subroutine to develop an accelerated al-
gorithm for (non-strongly) convex functions with a k12 convergence rate?
The answer is yes (up to logarithmic factors). One approach is to add a regularizer ϵ ∥ x ∥22 to
f ( x ) and apply Algorithm 1 to the function f ( x ) + ϵ ∥ x ∥22 , which is strongly convex. See HW 3.

2 Acceleration via restarting

In the opposite direction, suppose we only know the AGD method for (non-strongly) convex func-
tions (Algorithm 2) and its k12 guarantee (Theorem 2). Can we use it as a subroutine to develop an
p k
accelerated algorithm for strongly convex functions with a 1 − mL convergence rate (equiva-
q
lently, a mL log 1ϵ iteration complexity)?
powerful idea in optimization: restarting. See Algorithm 3.
This is possible using a classical andq
8L
In each round, we run Algorithm 2 for iterations to obtain x t+1 . In the next round, we restart
m
q
Algorithm 2 using x t+1 as the initial solution and run for another 8L m iterations. This is repeated
for T rounds.

Algorithm 3 Restarting AGD

input: initial x0 , strong convexity and smoothness parameters m, L, number of rounds T
for t = 0, 1, . . . T
q
Run Algorithm 2 with x t (initial solution), L (smoothness parameter), 8L
m (number of
iterations) as the input. Let x t+1 be the output.

return x T

Exercise
q 1. How is Algorithm 3 different from running Algorithm 2 without restarting for T ×
8L
m iterations?

2.1 Analysis
Suppose f is m-strongly convex and L-smooth. By Theorem 2, we know that

2L ∥ x t − x ∗ ∥22 m ∥ x t − x ∗ ∥22
f ( x t +1 ) − f ( x ∗ ) ≤ = .
8L/m 4
By strong convexity, we have
m
f ( x t ) ≥ f ( x ∗ ) + ⟨∇ f ( x ∗ ), x t − x ∗ ⟩ + ∥ x t − x ∗ ∥22 ,
| {z } 2
=0

hence ∥ x t − x ∗ ∥22 ≤ 2
m ( f ( x t ) − f ( x ∗ )). Combining, we get
f (xt ) − f (x∗ )
f ( x t +1 ) − f ( x ∗ ) ≤ .
2

2
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

That is, each round of Algorithm 3 halves the optimality gap. It follows that
T
1
f (x T ) − f (x∗ ) ≤ ( f ( x0 ) − f ( x ∗ )) .
2
Therefore, f ( x T ) − f ( x ∗ ) ≤ ϵ can be achieved after at most
f ( x0 ) − f ( x ∗ )

T = O log rounds,
ϵ
which corresponds to a total of
r r !
8L L f ( x0 ) − f ( x ∗ )
T× =O log AGD iterations.
m m ϵ
This iteration complexity is the same as Theorem 1 up to a logarithmic factor.
Remark 1. Note how strong convexity is needed in the above argument.
Remark 2. Optional reading: This overview article discusses restarting as a general/meta algorith-
mic technique.

3 Lower bounds
In this section, we consider a class of first-order iterative algorithms that satisfy x0 = 0, and
xk+1 ∈ Lin {∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk )} , ∀k ≥ 0, (1)
where the RHS denotes the linear subspace spanned by ∇ f ( x0 ), ∇ f ( x1 ), . . . , ∇ f ( xk ); in other
words, xk+1 is an (arbitrary) linear combination of the gradients at the previous (k + 1) iterates.

3.1 Smooth and convex f

Theorem 3. There exists an L-smooth convex function f such that any first-order method in the sense of
(1) must satisfy
3L ∥ x0 − x ∗ ∥22
f ( xk ) − f ( x ∗ ) ≥ .
32(k + 1)2
L
Comparing with this lower bound, we see that the k2
rate for AGD in Theorem 2 is opti-
mal/unimprovable (up to constants).
Proof of Theorem 3. Let A ∈ Rd×d be the matrix given by

2,
 i=j
Aij = −1, j ∈ {i − 1, i + 1} (2)

0, otherwise.


Explicitly,
−1 0 0 ··· ···
 
2 0
 −1 2 −1 0 · · · ··· 0
 
 0 −1 2 −1 0 ··· 0
A= .
 
.. .. ..

 . . . 

 0 ··· −1 2 −1
0 ··· −1 2

3
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Let ei ∈ Rd denote the i-th standard basis vector. Consider the quadratic function

L ⊤ L
f (x) = x Ax − x ⊤ e1 ,
8 4
L
which is convex and L-smooth since 0 ≼ A ≼ 4I. Note that ∇ f ( x ) = 4 ( Ax − e1 ). By induction,
we can show that for k ≥ 1,

xk ∈ Lin {e1 , Ax1 , . . . , Axk−1 } ⊆ Lin {e1 , . . . , ek } .

Therefore, if we let Ak ∈ Rd×d denote the matrix obtained by zeroing out the entries of A outside
the top-left k × k block, then

L ⊤ L ⊤ ∗ L ⊤ L ⊤
f ( xk ) = xk Ak xk − xk e1 ≥ f k := min x A k x − x e1 .
8 4 x 8 4

By setting gradient to zero, we find that the minimum above is attained by

⊤
1 2 k
xk∗ := 1− ,1− ,...,1− , 0, . . . , 0 ∈ Rd ,
k+1 k+1 k+1

with f k∗ = − L8 1 − k+1 1 . It follows that the global minimizer x ∗ = xd∗ of f satisfies f ( x ∗ ) = f d∗ =

− L8 1 − d+1 1 and (since x0 = 0)

d 2
i d+1
∥ xd∗ − x0 ∥22 = ∥ xd∗ ∥22 = ∑ 1−
d+1
≤
3
.
i =1

Combining pieces and taking d = 2k + 1, we have

L 1 1
f ( xk ) − f ( x ∗ ) ≥ f k∗ − f d∗ = −
8 k + 1 2k + 2
L k+1
=
16 (k + 1)2
L d+1
=
32 (k + 1)2
2
3L ∥ x ∗ − x0 ∥2
≥ .
32 (k + 1)2

3.2 Smooth and strongly convex f

k
For strongly convex functions, we have the following lower bound, which shows that the 1 − √1
L/m
rate of AGD in Theorem 1 cannot be significantly improved.

Theorem 4. There exists an m-strongly convex and L-smooth function such that any first-order method in
the sense of (1) must satisfy
k +1
m 4
∗
f ( xk ) − f ( x ) ≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

4
UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Proof. Let A ∈ Rd×d be defined in (2) above and consider the function
L−m ⊤ m
f (x) = x Ax − 2x e1 + ∥ x ∥22 ,
⊤
8 2
which is L-smooth and m-strongly convex. Strong convexity implies that
m
f ( xk ) − f ( x ∗ ) ≥ ∥ xk − x ∗ ∥22 . (3)
2
A similar argument as above shows that xk ∈ Lin {e1 , . . . , ek } , hence
d
∥ xk − x ∗ ∥22 ≥ ∑ x ∗ ( i )2 , (4)
i = k +1

where x ∗ (i ) denotes the ith entry of x ∗ . For simplicity we take d → ∞ (we omit the formal limiting
argument).1 The minimizer x ∗ can be computed by setting the gradient of f to zero, which gives
an infinite set of equations

L/m + 1 ∗
1−2 x (1) + x ∗ (2) = 0,
L/m − 1
L/m + 1 ∗
x ∗ ( k − 1) − 2 x (k ) + x ∗ (k + 1) = 0, k = 2, 3, . . .
L/m − 1
Solving these equations gives
√ !i
∗ L/m − 1
x (i ) = √ , i = 1, 2, . . . (5)
L/m + 1

Combining pieces, we obtain

m ∞ ∗ 2
2 i=∑
f ( xk ) − f ( x ∗ ) ≥ x (i ) by (3) and (4)
k +1
√ ! 2( k +1)
m L/m − 1
≥ √ ∥ x0 − x ∗ ∥22 by (5) and x0 = 0
2 L/m + 1
k +1
m 4 4
= 1− √ + √ ∥ x0 − x ∗ ∥22
2 L/m + 1 ( L/m + 1)2
k +1
m 4
≥ 1− √ ∥ x0 − x ∗ ∥22 .
2 L/m

Remark 3. The lower bounds in Theorems 3 and 4 are in the worst-case/minimax sense: one cannot
find a first-order method that achieves a better convergence rate on all smooth convex functions
than AGD. This, however, does not prevent better rates to be achieved for a sub class of such
functions. It is also possible to achieve better rates by using higher-order information (e.g., the
Hessian).
1 The convergence rates for AGD in Theorems 1 and 2 do not explicitly depend on the dimension d, hence these
results can be generalized to infinite dimensions.

Exercises With Solutions PDF
No ratings yet
Exercises With Solutions PDF
37 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
12 pages
Unconstrained Optimization (Contd.) Constrained Optimization
No ratings yet
Unconstrained Optimization (Contd.) Constrained Optimization
19 pages
Lec 13
No ratings yet
Lec 13
6 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Gradient
No ratings yet
Gradient
31 pages
Lecture 15 Projected Gradient
No ratings yet
Lecture 15 Projected Gradient
8 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
A Note On The Optimal Convergence Rate of Descent
No ratings yet
A Note On The Optimal Convergence Rate of Descent
11 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Lecture 11
No ratings yet
Lecture 11
4 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Xu2001 Minimax
No ratings yet
Xu2001 Minimax
13 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
O4MD 02 Foundations
No ratings yet
O4MD 02 Foundations
8 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
15.093 Optimization Methods
No ratings yet
15.093 Optimization Methods
12 pages
Ps 2
No ratings yet
Ps 2
3 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Gradient
No ratings yet
Gradient
37 pages
Algorithmic Stability
No ratings yet
Algorithmic Stability
87 pages
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
No ratings yet
1 Convex Analysis: 1.1 Motivations: Convex Optimization Problems
24 pages
Sheet 2 Solution
No ratings yet
Sheet 2 Solution
5 pages
Gradient Methods For Minimizing Composite Objective Function
No ratings yet
Gradient Methods For Minimizing Composite Objective Function
31 pages
The Penalty Function Method
No ratings yet
The Penalty Function Method
22 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
Convex Optimization For Machine Learning
No ratings yet
Convex Optimization For Machine Learning
110 pages
Lect5 Removed
No ratings yet
Lect5 Removed
35 pages
p5 CO Opti Algo
No ratings yet
p5 CO Opti Algo
15 pages
Numerical Algebra, Control and Optimization Volume 6, Number 2, June 2016
No ratings yet
Numerical Algebra, Control and Optimization Volume 6, Number 2, June 2016
13 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
No ratings yet
Recitation 11: Based On Nesterov, Yurii. Introductory Lectures On Convex Optimization: A Basic Course
3 pages
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
No ratings yet
Exportar Páginas Numerical-Optimization-Second-Edition - Backup
3 pages
Institute of Computer Science: Academy of Sciences of The Czech Republic
No ratings yet
Institute of Computer Science: Academy of Sciences of The Czech Republic
49 pages
Lecture 4 Si416 2025
No ratings yet
Lecture 4 Si416 2025
22 pages
Appendix PDF
No ratings yet
Appendix PDF
6 pages
Homework 2
No ratings yet
Homework 2
5 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Convex Problems
No ratings yet
Convex Problems
48 pages
Steepest Descent Algorithm
No ratings yet
Steepest Descent Algorithm
28 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
13 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
No ratings yet
Nonlinear Programming 3rd Edition Theoretical Solutions Manual
27 pages
Subgrad Method Slides
No ratings yet
Subgrad Method Slides
33 pages
Composite
No ratings yet
Composite
26 pages
On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms
No ratings yet
On The Convergence of Adaptive First Order Methods: Proximal Gradient and Alternating Minimization Algorithms
12 pages
Fast Gradient Method
No ratings yet
Fast Gradient Method
25 pages
Acm Tosn
No ratings yet
Acm Tosn
22 pages
Blockchain of Finite-Lifetime Blocks With Applications To Edge-Based IoT
No ratings yet
Blockchain of Finite-Lifetime Blocks With Applications To Edge-Based IoT
15 pages
Jellyfish Merkle Tree: Abstract
No ratings yet
Jellyfish Merkle Tree: Abstract
12 pages
HMT: A Hardware-Centric Hybrid Bonsai Merkle Tree Algorithm For High-Performance Authentication
No ratings yet
HMT: A Hardware-Centric Hybrid Bonsai Merkle Tree Algorithm For High-Performance Authentication
28 pages
PA-PUF: A Novel Priority Arbiter PUF
No ratings yet
PA-PUF: A Novel Priority Arbiter PUF
6 pages
Towards Merkle Trees For High-Performance Data Systems: Muhammad El-Hindi Tobias Ziegler Carsten Binnig
No ratings yet
Towards Merkle Trees For High-Performance Data Systems: Muhammad El-Hindi Tobias Ziegler Carsten Binnig
6 pages
Selfstudys Com File
No ratings yet
Selfstudys Com File
5 pages
CPAKA Mutual Authentication and Key Agreement Scheme Based On Conditional PUF in Space-Air-Ground Integrated Network
No ratings yet
CPAKA Mutual Authentication and Key Agreement Scheme Based On Conditional PUF in Space-Air-Ground Integrated Network
14 pages
PUF Aging
No ratings yet
PUF Aging
26 pages
Fuzzy Extractor N
No ratings yet
Fuzzy Extractor N
15 pages
A Privacy-Aware Provably Secure Smart Card Authentication Protocol Based On Physically Unclonable Functions
No ratings yet
A Privacy-Aware Provably Secure Smart Card Authentication Protocol Based On Physically Unclonable Functions
13 pages
Fringe
No ratings yet
Fringe
38 pages
Bookwithindex
No ratings yet
Bookwithindex
96 pages
Citstudents - In: Unit-Iii Transportation Model
No ratings yet
Citstudents - In: Unit-Iii Transportation Model
19 pages
Top-Down DP - G5 - I (With Code)
No ratings yet
Top-Down DP - G5 - I (With Code)
61 pages
Introduction To Dimensionality Reduction-1
No ratings yet
Introduction To Dimensionality Reduction-1
16 pages
Assignment 6 Ans
No ratings yet
Assignment 6 Ans
10 pages
Machine Learning SVM - Supervised
No ratings yet
Machine Learning SVM - Supervised
32 pages
Ball Tree
No ratings yet
Ball Tree
4 pages
Mta Mock
No ratings yet
Mta Mock
5 pages
TD1 ELTP 2023 Correction
No ratings yet
TD1 ELTP 2023 Correction
6 pages
AI Question Bank 2017 18 CSE
No ratings yet
AI Question Bank 2017 18 CSE
4 pages
Cmfe Termproj
No ratings yet
Cmfe Termproj
7 pages
MACHINE LEARNING TUTORIAL QUESTION BANK Modified
No ratings yet
MACHINE LEARNING TUTORIAL QUESTION BANK Modified
13 pages
Class 2 Worksheets
No ratings yet
Class 2 Worksheets
8 pages
cs231n 2018 Lecture09
No ratings yet
cs231n 2018 Lecture09
106 pages
New PPT-2
No ratings yet
New PPT-2
12 pages
L25 - Lexicographically First Palindromic String
100% (1)
L25 - Lexicographically First Palindromic String
11 pages
An Incremental K-Means Algorithm
No ratings yet
An Incremental K-Means Algorithm
14 pages
Slides Control of Discrete Systems
No ratings yet
Slides Control of Discrete Systems
45 pages
SSP 0 1 Introduction
No ratings yet
SSP 0 1 Introduction
5 pages
AI11
No ratings yet
AI11
2 pages
Page Rank Algorithm
No ratings yet
Page Rank Algorithm
9 pages
Labview Aliasing 1
No ratings yet
Labview Aliasing 1
5 pages
Technology - Mca Master of Computer Applications - Semester 2 - 2023 - May - Elective 1 - Computer Vision Rev 2019 C' Scheme
No ratings yet
Technology - Mca Master of Computer Applications - Semester 2 - 2023 - May - Elective 1 - Computer Vision Rev 2019 C' Scheme
1 page
Quadratic Equations #BB2.0
100% (2)
Quadratic Equations #BB2.0
117 pages
Advanced Time-Domain Measurements Teledyne
No ratings yet
Advanced Time-Domain Measurements Teledyne
32 pages
IO621PE: MACHINE LEARNING (Professional Elective - II) B.Tech. III Year II Sem. L T P C 3 0 0 3 Course Objectives
No ratings yet
IO621PE: MACHINE LEARNING (Professional Elective - II) B.Tech. III Year II Sem. L T P C 3 0 0 3 Course Objectives
1 page
Audio Equalizer Project
100% (1)
Audio Equalizer Project
44 pages
Chapter 6 Part 1 - Handout
No ratings yet
Chapter 6 Part 1 - Handout
24 pages
Gaussian Quadrature
No ratings yet
Gaussian Quadrature
4 pages
Artificial Intelligence & Machine Learning Lab With Applications
No ratings yet
Artificial Intelligence & Machine Learning Lab With Applications
6 pages

Lecture 11 AGD Restart Lower Bounds

Uploaded by

Lecture 11 AGD Restart Lower Bounds

Uploaded by

UW-Madison CS/ISyE/Math/Stat 726 Spring 2024

Lecture 11: Acceleration via Regularization and Restarting;

Algorithm 1 Nesterov’s AGD, smooth and strongly convex

Algorithm 2 Nesterov’s AGD, smooth convex

Theorem 2. For Nesterov’s AGD Algorithm 2 applied to L-smooth convex f , we have

1 Acceleration via regularization

2 Acceleration via restarting

Algorithm 3 Restarting AGD

3.1 Smooth and convex f

xk ∈ Lin {e1 , Ax1 , . . . , Axk−1 } ⊆ Lin {e1 , . . . , ek } .

By setting gradient to zero, we find that the minimum above is attained by

with f k∗ = − L8 1 − k+1 1 . It follows that the global minimizer x ∗ = xd∗ of f satisfies f ( x ∗ ) = f d∗ =

− L8 1 − d+1 1 and (since x0 = 0)

Combining pieces and taking d = 2k + 1, we have

3.2 Smooth and strongly convex f

Combining pieces, we obtain

You might also like