Convex Module B
Convex Module B
min f (x)
x2Rn
s.t. gi (x) 0, i 2 [m] := {1, 2, . . . , m},
hj (x) = 0, j 2 [p].
First Order: Gradients rf (xt ), rgi (xt ), rhj (xt ) are used. Heavily used in
ML.
Distributed Algorithms
Stochastic/Randomized Algorithms
1
Measure of progress
Let x? be the optimal solution. The iterative algorithms continue till any of the
following error metrics is sufficiently small.
errt := ||xt x? ||
f (x̄) f (x? ) + ✏.
We often run the algorithm till errt is smaller than a sufficiently small ✏ > 0.
2
First order methods: Gradient descent
3
Gradient Descent with Bounded Gradient Assumption
Let x0 , x1 , . . . , xT
be the iterates generated by the GD algorithm.
1
P
bt := 1t ti=01 xi . Let x? be the optimal solution.
For any t, we define x
Theorem 1: Convergence of Gradient Descent
Let the function f satisfy the bounded gradient property. Let ||x0 x? ||
D. Then, for the choice of step size ⌘t = GD
p , we have
T
DG
f (b
xT ) f (x? ) p .
T
DG 2 ✏
To find an ✏ optimal solution, choose T ✏ and ⌘ = G2 .
Possible Limitation:
4
Proof
5
Proof Continues
6
Gradient Descent with Smoothness Assumption
? ||x0 x? ||2
f (xT ) f (x ) .
2T
7
Proof
8
Gradient Descent with Smoothness and Strong Convexity
9
Proof
10
Proof Continues
11
Summary of gradient descent convergence rates
Let ||x0 x? || D.
If ||rf (x)|| G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? )
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? ) 2T .
12
Gradient descent: Constrained Case
on the set X.
Theorem 5
Let ||x0 x? || D.
If ||rf (x)|| G for all x 2 Rn , then with ⌘t = D
p ,
G T
f (b
xT ) f (x? )
DG
p .
T
||x0 x? ||2
If f is -smooth, for ⌘t = 1 , f (xT ) f (x? ) 2T .
13
When is Projection easy to find?
Note that ⇧X (y) = argminx2X ||y x||2 . Find closed form expression of the
projection for the following cases.
X = {x 2 Rn |||x||2 r}.
X = {x 2 Rn |xl x xu }.
X = {x 2 Rn |Ax = b}.
Pn
X = {x 2 Rn |x 0, i=1 xi 1}.
14
Accelerated Gradient Descent
zt+1 = zt ⌘t rf (xt )
xt+1 = (1 ⌧t+1 )yt+1 + ⌧t+1 zt+1
Theorem 6
t+1 2
Let f be -smooth, ⌘t = 2 and ⌧t = t+2 . Then, we have
? 2 ||x0 x⇤ ||2
f (yT ) f (x ) .
T (T + 1)
Proof: Define t = t(t + 1)(f (yt ) f (x⇤ )) + 2 ||zt x⇤ ||2 and show that
t+1 t .
15
Accelerated Gradient Descent 2
Theorem 7
16
Further details
AGD invented by Nesterov in a series of papers in the 80s and early 2000s,
later popularized by ML researchers
The convergence rates in the previous two theorems are the best possible
ones.
Book by Nesterov:
https://link.springer.com/book/10.1007/978-1-4419-8853-9
https://francisbach.com/continuized-acceleration/
https://www.nowpublishers.com/article/Details/OPT-036
17
Finite Sum Setting
18
Gradient Descent vs. Stochastic Gradient Descent
Gradient
PN Descent (GD) xt+1 = xt ⌘t rf (xt ) = xt
⌘t N i=1 rfi (xt ), t 0 starting from an initial guess x0 2 Rn .
1
Each step requires 1 gradient computation, which is a noisy version of the true
gradient of the cost function at xt .
19
Key result for SGD convergence
1
PT 1
bT =
where x PT 1
⌘t t=0 ⌘t xt .
t=0
20
Proof Continues
21
Proof Continues
22
Proof Continues
23
Choice of stepsize
⇣ ⌘
log
pT
If ⌘t := p1 ,
c t+1
then E[f (b
xT )] f (x ) O ?
T
. This rate does not
improve when the function is smooth.
When the function is smooth,
⇣ ⌘ then for ⌘t := ⌘ chosen appropriately, then
1
R.H.S. will be of order O ⌘T + O(⌘).
24
Analysis for Smooth and Strongly Convex Functions
When the function f is -smooth and ↵-strongly convex, we have the following
guarantees for SGD after T iterations.
⇣ ⌘
1 log T
If ⌘t := ct for a suitable constant c, then error bound is O T . Can be
1
improved to O T .
25
Extension: Mini-Batch
26
Extension: Stochastic Averaging
27
Further Reading
SAG: Schmidt, Mark, Nicolas Le Roux, and Francis Bach. “Minimizing finite
sums with the stochastic average gradient.” Mathematical Programming
162 (2017): 83-112.
Recent Review: Gower, Robert M., Mark Schmidt, Francis Bach, and Peter
Richtárik. “Variance-reduced methods for machine learning.” Proceedings
of the IEEE 108, no. 11 (2020): 1968-1983.
28
Extension: Adaptive Step-sizes
AdaGrad Duchi, John, Elad Hazan, and Yoram Singer. ”Adaptive subgra-
dient methods for online learning and stochastic optimization.” Journal of
machine learning research 12, no. 7 (2011).
Adam Kingma, Diederik P., and Jimmy Ba. ”Adam: A method for stochastic
optimization.” arXiv preprint arXiv:1412.6980 (2014).
29