0% found this document useful (0 votes)
12 views

Convex Module B

This document discusses various optimization algorithms for solving optimization problems, including gradient descent methods. It covers zeroth, first and second order methods, as well as distributed and stochastic algorithms. It then provides details on gradient descent under different assumptions like bounded gradients, smoothness and strong convexity. It also discusses projected gradient descent and accelerated gradient descent.

Uploaded by

chinaski06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Convex Module B

This document discusses various optimization algorithms for solving optimization problems, including gradient descent methods. It covers zeroth, first and second order methods, as well as distributed and stochastic algorithms. It then provides details on gradient descent under different assumptions like bounded gradients, smoothness and strong convexity. It also discusses projected gradient descent and accelerated gradient descent.

Uploaded by

chinaski06
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Module B: Algorithms for Optimization

Recall that an optimization problem in standard form is given by

min f (x)
x2Rn
s.t. gi (x)  0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].

Most algorithms generate a sequence x0 , x1 , x2 , . . . by exploiting local information


collected on the path.

Zeroth Order: Only f (xt ), gi (xt ), hj (xt ) available.

First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.

Second Order: Hessian information is used. Eg: Newton’s Method, etc.

Distributed Algorithms

Stochastic/Randomized Algorithms

1
Measure of progress

Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.

errt := ||xt x? ||

errt := f (xt ) f (x? )

A solution x̄ is ✏-optimal when

f (x̄)  f (x? ) + ✏.

We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.

errt := max(f (xt ) f (x? ), g1 (xt ), g2 (xt ), . . . , gm (xt )).

2
First order methods: Gradient descent

Consider the unconstrained optimization problem: minx2Rn f (x)


Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an
initial guess x0 2 Rn .

Convergence rate depends on choice of step size ⌘t and characteristic of the


function.

Bounded Gradient: ||rf (x)||  G for all x 2 Rn .

Smooth: A di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .


2
We can obtain a quadratic upper bound on the function from local informa-
tion.

Strongly Convex: A di↵erentiable convex f is ↵-strongly convex if for any


x, y, we have

f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
We can obtain a quadratic lower bound on the function from local informa-
tion.

3
Gradient Descent with Bounded Gradient Assumption

Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent

Let the function f satisfy the bounded gradient property. Let ||x0 x? || 
D. Then, for the choice of step size ⌘t = GD
p , we have
T

DG
f (b
xT ) f (x? )  p .
T
DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation:

Proof: Define the following (potential) function:


1
t := ||xt x? ||2 .
2⌘
We show that t is decreasing in t. We compute t+1 t as:

4
Proof

5
Proof Continues

6
Gradient Descent with Smoothness Assumption

Recall that a di↵erentiable convex f is -smooth if for any x, y, we have

f (y)  f (x) + hrf (x), y xi + ||y x||2 .


2
Theorem 2
Let the function f be -smooth. Let ||x0 x? ||  D. Then, for the choice
of step size ⌘t = 1 , we have

? ||x0 x? ||2
f (xT ) f (x )  .
2T

Proof: Define the following (potential) function:

t := t[f (xt ) f (x? )] + ||xt x? ||2 .


2
We show that t is decreasing in t. We compute t+1 t as:

7
Proof

8
Gradient Descent with Smoothness and Strong Convexity

Recall that a di↵erentiable convex f is ↵-strongly convex if for any x, y, we have



f (y) f (x) + hrf (x), y xi + ||y x||2 .
2
Theorem 3
Let the function f be -smooth and ↵-strongly convex with ↵  . Define
condition number  := ↵ . Then, for the choice of step size ⌘t = 1 , we have
T
f (xT ) f (x? )  e  (f (x0 ) f (x? )).

Note: To obtain ✏-optimal solution, choose T = O log( 1✏ ) .

Proof: Define the following (potential) function:


1 ↵
t := (1 + )T [f (xt ) f (x? )], where = = .
 1 ↵
We need to show that t+1  t.

9
Proof

10
Proof Continues

11
Summary of gradient descent convergence rates

Consider the unconstrained optimization problem: minx2Rn f (x)


Gradient Descent (GD): xt+1 = xt ⌘t rf (xt ), t 0 starting from an
initial guess x0 2 Rn .

Theorem 4: GD Convergence rates

Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 


T
e  (f (x0 ) f (x? )) where  := ↵ is the condition number.

12
Gradient descent: Constrained Case

Consider the unconstrained optimization problem: minx2X f (x) where X ✓ Rn


is a convex feasibility set.

Projected Gradient Descent (PGD): xt+1 = ⇧X [xt ⌘t rf (xt )], t 0


starting from an initial guess x0 2 R where ⇧X (y) is the projection of y
n

on the set X.

Theorem 5
Let ||x0 x? ||  D.
If ||rf (x)||  G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? ) 
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? )  2T .

If f is -smooth and ↵-strongly convex, for ⌘t = 1 , f (xT ) f (x? ) 


T
e  (f (0) f (x? )) where  := ↵ is the condition number.

Note: Convergence rates remain unchanged.

Note: Projection itself is another optimization problem!

Non-expansive Property which preserves the convergence rates:

||⇧X (y1 ) ⇧X (y2 )||  ||y1 y2 ||.

13
When is Projection easy to find?

Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.

X = {x 2 Rn |||x||2  r}.

X = {x 2 Rn |xl  x  xu }.

X = {x 2 Rn |Ax = b}.

Pn
X = {x 2 Rn |x 0, i=1 xi  1}.

14
Accelerated Gradient Descent

Start with x0 = y0 = z0 2 Rn . At every time-step t,


1
yt+1 = xt rf (xt )

zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1

Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have

? 2 ||x0 x⇤ ||2
f (yT ) f (x )  .
T (T + 1)

Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1  t .

15
Accelerated Gradient Descent 2

Start with x0 = y0 . At every state t,


1
yt+1 = xt rf (xt )
p p
 1  1
xt+1 = (1 + p )yt+1 p yt
+1 +1

Theorem 7

Let f be -smooth, ↵-strongly convex with  = ↵ and let = p1 1 . Then,


we have ⇣ ⌘
? T ↵+ ⇤ 2
f (yT ) f (x )  (1 + ) ||x0 x || .
2
1
Improvement upon the previous rate where we had =  1.

16
Further details

AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers

The convergence rates in the previous two theorems are the best possible
ones.

Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9

https://francisbach.com/continuized-acceleration/

https://www.nowpublishers.com/article/Details/OPT-036

17
Finite Sum Setting

A large number of problems that arise in (supervised) ML can be written as


N N
1 X 1 X
min f (x) := fi (x) = l(x, ⇠i ).
x2Rn N i=1 N i=1

Example: Regression/Least Squares, SVM, NN Training

The above problem can also be viewed as sample average approximation


of a stochastic optimization problem

f (x) = E[l(x, ⇠)]

involving uncertain parameter or random variable ⇠.


Challenge: N (number of samples) or n (dimension of decision variable)
both may be large. Samples may be located in di↵erent servers.

18
Gradient Descent vs. Stochastic Gradient Descent

Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1

Each step requires N gradient computations.

Stochastic Gradient Descent (SGD) At every time step t,


Pick an index (sample) it uniformly at random from the set
{1, 2, . . . , N }.
Set xt+1 = xt ⌘t rfit (xt ).

Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .

19
Key result for SGD convergence

Under the following assumptions


Convexity: each fi is convex,
Bounded variance: E[||rfit (x)||2 ]  2
for some for all x,
Unbiased gradient estimate: E[rfit (x)] = rf (x) for all x,
the solutions generated by SGD algorithm satisfies
T 1
X T 1
2 X
? 1 ? 2
⌘t [E[f (xt )] f (x )]  ||x0 x || + ⌘t2
t=0
2 2 t=0
PT 1 2
||x0 x? ||2
?
2
⌘t
=) E[f (b
xT )] f (x )  PT 1 + Pt=0
T 1
,
2 t=0 ⌘t 2 t=0 ⌘ t

1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0

20
Proof Continues

21
Proof Continues

22
Proof Continues

23
Choice of stepsize

Constant step-size will not give us convergence. For convergence, we need to


choose step sizes that are diminishing and square-summable, i.e.,
T 1
X T 1
X
lim ⌘t = 1, lim ⌘t2 < 1.
T !1 T !1
t=0 t=0

⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x )  O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).

24
Analysis for Smooth and Strongly Convex Functions

When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .

If ⌘t := ⌘, then error bound


2
? 2 T ? 2 ⌘
E[||xT x || ]  (1 ⌘↵) ||x0 x || + .
2↵
With constant step-size ⌘ < ↵1 , convergence is quick to a neighborhood of the
optimal solution.

25
Extension: Mini-Batch

26
Extension: Stochastic Averaging

27
Further Reading

SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.

SAGA: Defazio, Aaron, Francis Bach, and Simon Lacoste-Julien. “SAGA:


A fast incremental gradient method with support for non-strongly convex
composite objectives.” Advances in neural information processing systems
27 (2014).

Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.

Allen-Zhu, Zeyuan. “Katyusha: The First Direct Acceleration of Stochastic


Gradient Methods.” Journal of Machine Learning Research 18 (2018): 1-51.

Varre, Aditya, and Nicolas Flammarion. “Accelerated SGD for non-strongly-


convex least squares.” In Conference on Learning Theory, pp. 2062-2126.
PMLR, 2022.

Hanzely, Filip, Konstantin Mishchenko, and Peter Richtárik. ”SEGA: Vari-


ance reduction via gradient sketching.” Advances in Neural Information Pro-
cessing Systems 31 (2018).

28
Extension: Adaptive Step-sizes

AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).

Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).

29

You might also like