0% found this document useful (0 votes)

12 views

Convex Module B

This document discusses various optimization algorithms for solving optimization problems, including gradient descent methods. It covers zeroth, first and second order methods, as well as distributed and stochastic algorithms. It then provides details on gradient descent under different assumptions like bounded gradients, smoothness and strong convexity. It also discusses projected gradient descent and accelerated gradient descent.

Uploaded by

chinaski06

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Convex Module B

Uploaded by

chinaski06

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Module B: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

min f (x)
x2Rn
s.t. gi (x)  0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information

collected on the path.

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

Distributed Algorithms

Stochastic/Randomized Algorithms

1
Measure of progress

Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.

errt := ||xt x? ||

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

f (x̄)  f (x? ) + ✏.

We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.

errt := max(f (xt ) f (x? ), g1 (xt ), g2 (xt ), . . . , gm (xt )).

2
First order methods: Gradient descent

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an
initial guess x0 2 Rn .

Convergence rate depends on choice of step size ⌘t and characteristic of the

function.

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

2
We can obtain a quadratic upper bound on the function from local informa-
tion.

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any

x, y, we have
↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic lower bound on the function from local informa-
tion.

3
Gradient Descent with Bounded Gradient Assumption

Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent

Let the function f satisfy the bounded gradient property. Let ||x0 x? || 
D. Then, for the choice of step size ⌘t = GD
p , we have
T

DG
f (b
xT ) f (x? )  p .
T
DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation:

Proof: Define the following (potential) function:

1
t := ||xt x? ||2 .
2⌘
We show that t is decreasing in t. We compute t+1 t as:

4
Proof

5
Proof Continues

6
Gradient Descent with Smoothness Assumption

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .

2
Theorem 2
Let the function f be -smooth. Let ||x0 x? ||  D. Then, for the choice
of step size ⌘t = 1 , we have

? ||x0 x? ||2
f (xT ) f (x )  .
2T

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .

2
We show that t is decreasing in t. We compute t+1 t as:

7
Proof

8
Gradient Descent with Smoothness and Strong Convexity

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have

↵
f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 3
Let the function f be -smooth and ↵-strongly convex with ↵  . Define
condition number  := ↵ . Then, for the choice of step size ⌘t = 1 , we have
T
f (xT ) f (x? )  e  (f (x0 ) f (x? )).

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:

1 ↵
t := (1 + )T [f (xt ) f (x? )], where = = .
 1 ↵
We need to show that t+1  t.

9
Proof

10
Proof Continues

11
Summary of gradient descent convergence rates

Consider the unconstrained optimization problem: minx2Rn f (x)

Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an
initial guess x0 2 Rn .

Theorem 4: GD Convergence rates

Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

T
e  (f (x0 ) f (x? )) where  := ↵ is the condition number.

12
Gradient descent: Constrained Case

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn

is a convex feasibility set.

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0

starting from an initial guess x0 2 R where ⇧X (y) is the projection of y
n

on the set X.

Theorem 5
Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 

T
e  (f (0) f (x? )) where  := ↵ is the condition number.

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

13
When is Projection easy to find?

Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.

X = {x 2 Rn |||x||2  r}.

X = {x 2 Rn |xl  x  xu }.

X = {x 2 Rn |Ax = b}.

Pn
X = {x 2 Rn |x 0, i=1 xi  1}.

14
Accelerated Gradient Descent

Start with x0 = y0 = z0 2 Rn . At every time-step t,

1
yt+1 = xt rf (xt )

zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1

Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have

? 2 ||x0 x⇤ ||2
f (yT ) f (x )  .
T (T + 1)

Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1  t .

15
Accelerated Gradient Descent 2

Start with x0 = y0 . At every state t,

1
yt+1 = xt rf (xt )
p p
 1  1
xt+1 = (1 + p )yt+1 p yt
+1 +1

Theorem 7

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,

we have ⇣ ⌘
? T ↵+ ⇤ 2
f (yT ) f (x )  (1 + ) ||x0 x || .
2
1
Improvement upon the previous rate where we had =  1.

16
Further details

AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers

The convergence rates in the previous two theorems are the best possible
ones.

Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9

https://francisbach.com/continuized-acceleration/

https://www.nowpublishers.com/article/Details/OPT-036

17
Finite Sum Setting

A large number of problems that arise in (supervised) ML can be written as

N N
1 X 1 X
min f (x) := fi (x) = l(x, ⇠i ).
x2Rn N i=1 N i=1

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation

of a stochastic optimization problem

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.

Challenge: N (number of samples) or n (dimension of decision variable)
both may be large. Samples may be located in di↵erent servers.

18
Gradient Descent vs. Stochastic Gradient Descent

Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,

Pick an index (sample) it uniformly at random from the set
{1, 2, . . . , N }.
Set xt+1 = xt ⌘t rfit (xt ).

Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .

19
Key result for SGD convergence

Under the following assumptions

Convexity: each fi is convex,
Bounded variance: E[||rfit (x)||2 ]  2
for some for all x,
Unbiased gradient estimate: E[rfit (x)] = rf (x) for all x,
the solutions generated by SGD algorithm satisfies
T 1
X T 1
2 X
? 1 ? 2
⌘t [E[f (xt )] f (x )]  ||x0 x || + ⌘t2
t=0
2 2 t=0
PT 1 2
||x0 x? ||2
?
2
⌘t
=) E[f (b
xT )] f (x )  PT 1 + Pt=0
T 1
,
2 t=0 ⌘t 2 t=0 ⌘ t

1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0

20
Proof Continues

21
Proof Continues

22
Proof Continues

23
Choice of stepsize

Constant step-size will not give us convergence. For convergence, we need to

choose step sizes that are diminishing and square-summable, i.e.,
T 1
X T 1
X
lim ⌘t = 1, lim ⌘t2 < 1.
T !1 T !1
t=0 t=0

⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x )  O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).

24
Analysis for Smooth and Strongly Convex Functions

When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .

If ⌘t := ⌘, then error bound

2
? 2 T ? 2 ⌘
E[||xT x || ]  (1 ⌘↵) ||x0 x || + .
2↵
With constant step-size ⌘ < ↵1 , convergence is quick to a neighborhood of the
optimal solution.

25
Extension: Mini-Batch

26
Extension: Stochastic Averaging

27
Further Reading

SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:

A fast incremental gradient method with support for non-strongly convex
composite objectives.” Advances in neural information processing systems
27 (2014).

Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic

Gradient Methods.” Journal of Machine Learning Research 18 (2018): 1-51.

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-

convex least squares.” In Conference on Learning Theory, pp. 2062-2126.
PMLR, 2022.

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-

ance reduction via gradient sketching.” Advances in Neural Information Pro-
cessing Systems 31 (2018).

28
Extension: Adaptive Step-sizes

AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).

Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).

Lab Report On False Position Method: Assignment 3
100% (1)
Lab Report On False Position Method: Assignment 3
9 pages
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
UnderstandingNonconvexOptimization V5 PDF
No ratings yet
UnderstandingNonconvexOptimization V5 PDF
95 pages
Berkeley-tutorial Optimization for Machine Learningpart2
No ratings yet
Berkeley-tutorial Optimization for Machine Learningpart2
35 pages
Lecture05_descent
No ratings yet
Lecture05_descent
31 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
SGD
No ratings yet
SGD
19 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Lecture 7
No ratings yet
Lecture 7
54 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
02_lecturenote_GD
No ratings yet
02_lecturenote_GD
10 pages
OptimML
No ratings yet
OptimML
41 pages
Lecture 8: Gradient Descent and Logistic Regression
No ratings yet
Lecture 8: Gradient Descent and Logistic Regression
39 pages
Optimization For Machine Learning
No ratings yet
Optimization For Machine Learning
45 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Lecture 3 ML_optimization
No ratings yet
Lecture 3 ML_optimization
32 pages
LN - Optimization For ML
No ratings yet
LN - Optimization For ML
129 pages
Gradient Decent - PDF 2
No ratings yet
Gradient Decent - PDF 2
7 pages
4_Gradient Descent and Stochastic GD
No ratings yet
4_Gradient Descent and Stochastic GD
37 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
10 Convex Optimisation
No ratings yet
10 Convex Optimisation
31 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
week 10 notes MLF
No ratings yet
week 10 notes MLF
20 pages
Berkeley-tutorial Optimization for Machine Learning-part1
No ratings yet
Berkeley-tutorial Optimization for Machine Learning-part1
37 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Stochastic Gradient Descent Algorithm
No ratings yet
Stochastic Gradient Descent Algorithm
6 pages
3 Gradient Descent
No ratings yet
3 Gradient Descent
8 pages
ML MODULE 5 FULL NOTES
No ratings yet
ML MODULE 5 FULL NOTES
23 pages
Raghu Meka notes
No ratings yet
Raghu Meka notes
7 pages
Cours D'optimisation
No ratings yet
Cours D'optimisation
159 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
No ratings yet
Concise - Lecture - Notes - On - Optimization - Methods - 1722728042 2024-08-03 23 - 34 - 09
258 pages
Gradient Descent Algorithm is a first
No ratings yet
Gradient Descent Algorithm is a first
5 pages
Sparsity and Its Mathematics
No ratings yet
Sparsity and Its Mathematics
44 pages
Backpropagation_optimization_tutorial
No ratings yet
Backpropagation_optimization_tutorial
14 pages
lec13
No ratings yet
lec13
6 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
optimization
No ratings yet
optimization
6 pages
Optimization23 22
No ratings yet
Optimization23 22
32 pages
Notes HQ
No ratings yet
Notes HQ
96 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
online gradient descent
No ratings yet
online gradient descent
7 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
0105 Stoch Subgrad Notes
No ratings yet
0105 Stoch Subgrad Notes
17 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Optimization PDF
No ratings yet
Optimization PDF
59 pages
15 Optimization Script
No ratings yet
15 Optimization Script
62 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Long-Memory Time Series: Theory and Methods
From Everand
Long-Memory Time Series: Theory and Methods
Wilfredo Palma
No ratings yet
Elementary Calculus
From Everand
Elementary Calculus
George N. Frempong
No ratings yet
Bairstow Method
No ratings yet
Bairstow Method
7 pages
EVCO A 00024
No ratings yet
EVCO A 00024
37 pages
Matrix Decomposition and Its Application in Statistics NK
No ratings yet
Matrix Decomposition and Its Application in Statistics NK
82 pages
Nonlinear Programming Concepts
No ratings yet
Nonlinear Programming Concepts
37 pages
Crisis at Bhilpur
No ratings yet
Crisis at Bhilpur
9 pages
Eemath 12
No ratings yet
Eemath 12
19 pages
Ch-5 Integer Programming Edited 1
No ratings yet
Ch-5 Integer Programming Edited 1
31 pages
Numerical Interpolation: February 22, 2021
No ratings yet
Numerical Interpolation: February 22, 2021
18 pages
Gujarat Technological University Chemical (Computer Aided Process Design)
No ratings yet
Gujarat Technological University Chemical (Computer Aided Process Design)
3 pages
Constructingcircuitsforbooleanexpressionsgate 151126140459 Lva1 App6891
No ratings yet
Constructingcircuitsforbooleanexpressionsgate 151126140459 Lva1 App6891
18 pages
Chapter 3 - Transporatition and Assignment Models & Programming
No ratings yet
Chapter 3 - Transporatition and Assignment Models & Programming
32 pages
A Comparison of Solving The Poisson Equation Using Several Numerical Methods in Matlab and Octave On The Cluster Maya
No ratings yet
A Comparison of Solving The Poisson Equation Using Several Numerical Methods in Matlab and Octave On The Cluster Maya
18 pages
#Program in Python For Tic-Tac-Toe Using Min-Max Method: Maulik Varshney 219310274 Section C (Ai and ML)
No ratings yet
#Program in Python For Tic-Tac-Toe Using Min-Max Method: Maulik Varshney 219310274 Section C (Ai and ML)
4 pages
Experiment # 2 (Basic Logic Gates) at Proteus
No ratings yet
Experiment # 2 (Basic Logic Gates) at Proteus
5 pages
Numerical Optimization
No ratings yet
Numerical Optimization
31 pages
AoL Assessment Form MATH6183016 Scientific Computing JKT Odd 2324
No ratings yet
AoL Assessment Form MATH6183016 Scientific Computing JKT Odd 2324
6 pages
Presentation 3 CE316
No ratings yet
Presentation 3 CE316
26 pages
Centroide de Un Conjunto Difuso Continuo Vs Discreto
No ratings yet
Centroide de Un Conjunto Difuso Continuo Vs Discreto
12 pages
Simplex Method
No ratings yet
Simplex Method
22 pages
CBNST
0% (1)
CBNST
2 pages
TP 03: Technique D'optimisation PSO
No ratings yet
TP 03: Technique D'optimisation PSO
3 pages
Ja 08066
No ratings yet
Ja 08066
3 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
3 Sol
No ratings yet
3 Sol
3 pages
Chapter - Numerical Integration: Aman W Department of Applied Physics University of Gondar
No ratings yet
Chapter - Numerical Integration: Aman W Department of Applied Physics University of Gondar
38 pages
Caltech For Bisection
No ratings yet
Caltech For Bisection
7 pages
Full Matrix Computations Fourth Edition Gene H. Golub PDF All Chapters
100% (8)
Full Matrix Computations Fourth Edition Gene H. Golub PDF All Chapters
60 pages
Numerical Methods Notes - by Trockers
No ratings yet
Numerical Methods Notes - by Trockers
40 pages
Practical methods for optimal control using nonlinear programming 1st Edition John T. Betts download
100% (10)
Practical methods for optimal control using nonlinear programming 1st Edition John T. Betts download
66 pages