0% found this document useful (0 votes)
12 views24 pages

Class06 SGD

This document discusses optimization techniques for machine learning problems, specifically online learning with stochastic gradients. [1] It introduces Newton's method for optimization and how it can be applied to problems like linear regression to find the optimal solution in one step. [2] However, for online/streaming data, it is not feasible to recompute the inverse at every step. Instead, it describes recursive least squares and stochastic gradient descent as online methods that can incorporate new data points efficiently with each step requiring only O(d^2) time rather than recomputing the full solution. [3] It proves convergence guarantees for stochastic gradient descent that still hold even when using a stochastic estimate of the gradient rather than the true gradient at each

Uploaded by

timetif286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views24 pages

Class06 SGD

This document discusses optimization techniques for machine learning problems, specifically online learning with stochastic gradients. [1] It introduces Newton's method for optimization and how it can be applied to problems like linear regression to find the optimal solution in one step. [2] However, for online/streaming data, it is not feasible to recompute the inverse at every step. Instead, it describes recursive least squares and stochastic gradient descent as online methods that can incorporate new data points efficiently with each step requiring only O(d^2) time rather than recomputing the full solution. [3] It proves convergence guarantees for stochastic gradient descent that still hold even when using a stochastic estimate of the gradient rather than the true gradient at each

Uploaded by

timetif286
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

MIT 9.520/6.

860, Fall 2018


Statistical Learning Theory and Applications

Class 06: Learning with Stochastic Gradients

Sasha Rakhlin

A. Rakhlin, 9.520/6.860 2018


Why Optimization?

Much (but not all) of Machine Learning: write down objective function
involving data and parameters, find good (or optimal) parameters
through optimization.

Key idea: find a near-optimal solution by iteratively using only local


information about the objective (e.g. gradient, Hessian).

A. Rakhlin, 9.520/6.860 2018


Motivating example: Newton’s Method

Newton’s method in 1d:

wt+1 = wt − (f 00 (wt ))−1 f 0 (wt )

Example (parabola):
f (w ) = aw 2 + bw + c
Start with any w1 . Then Newton’s Method gives

w2 = w1 − (2a)−1 (2aw1 + b)

which means w2 = −b/(2a). Finds minimum of f in 1 step, no matter


where you start!

A. Rakhlin, 9.520/6.860 2018


Newton’s Method in multiple dim:

wt+1 = wt − [∇2 f (wt )]−1 ∇f (wt )

(here ∇2 f (wt ) is the Hessian, assume invertible)

A. Rakhlin, 9.520/6.860 2018


Recalling Least Squares

Least Squares objective (without 1/n normalization)


n
X 2
f (w ) = (yi − xi T w )2 = kY − Xw k
i=1

Calculate: ∇2 f (w ) = 2X T X and ∇f (w ) = −2X T (Y − Xw ).

Taking w1 = 0, the Newton’s Method gives

w2 = 0 + (2X T X )−1 2X T (Y − X 0) = (X T X )−1 X T Y

which is the least-squares solution (global min). Again, 1 step is enough.

2 2
Verify: if f (w ) = kY − Xw k + λ kw k , (X T X ) becomes (X T X + λ)

A. Rakhlin, 9.520/6.860 2018


What do we do if data (x1 , y1 ), . . . , (xn , yn ), . . . are streaming? Can we
incorporate data on the fly without having to re-compute inverse (X T X )
at every step?

−→ Online Learning

A. Rakhlin, 9.520/6.860 2018


Let w1 = 0. Let wt be least-squares solution after seeing t − 1 data
points. Can we get wt from wt−1 cheaply? Newton’s Method will do it in
1 step (since objective is quadratic).
Pt
Let Ct = i=1 xi xi T (or +λI ) and Xt = [x1 , . . . , xt ] T , Yt = [y1 , . . . , yt ] T .
Newton’s method gives

wt+1 = wt + Ct−1 XtT (Yt − Xt wt )

This can be simplified to

wt+1 = wt + Ct−1 xt (yt − xtT wt )

since residuals up to t − 1 are orthogonal to columns of Xt−1 .

The bottleneck is computing Ct−1 . Can we update it quickly from Ct−1


−1
?

A. Rakhlin, 9.520/6.860 2018


Sherman-Morrison formula: for invertible square A and any u, v

A−1 uv T A−1
(A + uv T )−1 = A−1 −
1 + v T A−1 u
Hence
−1 −1
−1 Ct−1 xt xtT Ct−1
Ct−1 = Ct−1 − −1
1 + xtT Ct−1 xt
and (do the calculation)

−1 1
Ct−1 xt = Ct−1 xt · −1
1 + xt Ct−1
T
xt

Computation required: d × d matrix Ct−1 times a d × 1 vector = O(d 2 )


time to incorporate new datapoint. Memory: O(d 2 ). Unlike full
regression from scratch, does not depend on amount of data t.

A. Rakhlin, 9.520/6.860 2018


Recursive Least Squares (cont.)

Recap: recursive least squares is

wt+1 = wt + Ct−1 xt (yt − xtT wt )


−1
with a rank-one update of Ct−1 to get Ct−1 .

Consider throwing away second derivative information, replacing with


scalar:
wt+1 = wt + ηt xt (yt − xtT wt ).
where ηt is a decreasing sequence.

A. Rakhlin, 9.520/6.860 2018


Online Least Squares

The algorithm
wt+1 = wt + ηt xt (yt − xtT wt ).

I is recursive;
I does not require storing the matrix Ct−1 ;
I does not require updating the inverse, but only vector/vector
multiplication.
However, we are not guaranteed convergence in 1 step. How many? How
to choose ηt ?

A. Rakhlin, 9.520/6.860 2018


First, recognize that

−∇(yt − xtT w )2 = 2xt [yt − xtT w ].

Hence, proposed method is gradient descent. Let us study it abstractly


and then come back to least-squares.

A. Rakhlin, 9.520/6.860 2018


Lemma: Let f be convex G -Lipschitz. Let w ∗ ∈ argmin f (w ) and
w
kw ∗ k ≤ B. Then gradient descent

wt+1 = wt − η∇f (wt )


B
with η = G
and w1 = 0 yields a sequence of iterates such that the

T
PT 1
average w̄T = t=1 wt of trajectory satisfies
T

BG
f (w̄T ) − f (w ∗ ) ≤ √ .
T
Proof:
2 2
kwt+1 − w ∗ k = kwt − η∇f (wt ) − w ∗ k
2 2
= kwt − w ∗ k + η 2 k∇f (wt )k − 2η∇f (wt ) T (wt − w ∗ )

Rearrange:
2 2 2
2η∇f (wt ) T (wt − w ∗ ) = kwt − w ∗ k − kwt+1 − w ∗ k + η 2 k∇f (wt )k .

Note: Lipschitzness of f is equivalent to k∇f (w )k ≤ G .


A. Rakhlin, 9.520/6.860 2018
Summing over t = 1, . . . , T , telescoping, dropping negative term, using
w1 = 0, and dividing both sides by 2η,
T √
X 1 2 η
∇f (wt ) T (wt − w ∗ ) ≤ kw ∗ k + TG 2 ≤ BGT .
t=1
2η 2

Convexity of f means

f (wt ) − f (w ∗ ) ≤ ∇f (wt ) T (wt − w ∗ )

and so
T T
1 X 1 X BG
f (wt ) − f (w ∗ ) ≤ ∇f (wt ) T (wt − w ∗ ) ≤ √
T t=1 T t=1 T

Lemma follows by convexity of f and Jensen’s inequality. (end of proof)

A. Rakhlin, 9.520/6.860 2018


Gradient descent can be written as
1 2
wt+1 = argmin η {f (wt ) + ∇f (wt ) T (w − wt )} + kw − wt k
w 2

which can be interpreted as minimizing a linear approximation but


staying close to previous solution.

Alternatively, can interpret it as building a second-order model locally


(since cannot fully trust the local information – unlike our first parabola
example).

A. Rakhlin, 9.520/6.860 2018


Remarks:
I Gradient descent for non-smooth functions does not guarantee
actual descent of the iterates wt (only their average).
I For constrained optimization problems over a set K , do projected
gradient step
wt+1 = ProjK (wt − η∇f (wt ))
Proof essentially the same.
I Can take stepsize ηt = BG
√ to make it horizon-independent.
t
I Knowledge of G and B not necessary (with appropriate changes).
I Faster convergence under additional assumptions on f (smoothness,
strong convexity).
I Last class: for smooth functions (gradient is L-Lipschitz), constant
step size 1/L gives faster O(1/T ) convergence.
I Gradients can be replaced with stochastic gradients (unbiased
estimates).

A. Rakhlin, 9.520/6.860 2018


Stochastic Gradients

Suppose we only have access to an unbiased estimate ∇t of ∇f (wt ) at


step t. That is, E[∇t |wt ] = ∇f (wt ). Then Stochastic Gradient Descent
(SGD)
wt+1 = wt − η∇t
enjoys the guarantee
BG
E[f (w̄T )] − f (w ∗ ) ≤ √
n
2
where G is such that E[k∇t k ] ≤ G 2 for all t.

Kind of amazing: at each step go in the direction that is wrong (but


correct on average) and still converge.

A. Rakhlin, 9.520/6.860 2018


Stochastic Gradients

Setting #1:

Empirical loss can be written as


n
1X
f (w ) = `(yi , w T xi ) = EI ∼unif[1:n] `(yI , w T xI )
n
i=1

Then ∇t = ∇`(yI , wtT xI ) is an unbiased gradient:

E[∇t |wt ] = E[∇`(yI , wtT xI )|wt ] = ∇E[`(yI , wtT xI )|wt ] = ∇f (wt )

Conclusion: if we pick index I uniformly at random from dataset and


make gradient step ∇`(yI , wtT xI ), then we are performing SGD on
empirical loss objective.

A. Rakhlin, 9.520/6.860 2018


Stochastic Gradients

Setting #2:

Expected loss can be written as

f (w ) = E`(Y , w T X )

where (X , Y ) is drawn i.i.d. from population PX ×Y .

Then ∇t = ∇`(Y , wtT X ) is an unbiased gradient:

E[∇t |wt ] = E[∇`(Y , wtT X )|wt ] = ∇E[`(Y , wtT X )|wt ] = ∇f (wt )

Conclusion: if we pick example (X , Y ) from distribution PX ×Y and make


gradient step ∇`(Y , wtT X ), then we are performing SGD on expected
loss objective. Equivalent to going through a dataset once.

A. Rakhlin, 9.520/6.860 2018


Stochastic Gradients

Say we are in Setting #2 and we go through dataset once. The


guarantee is
BG
E[f (w̄ )] − f (w ∗ ) ≤ √
T
after T iterations. So, time complexity to find -minimizer of expected
objective E`(w T X , Y ) is independent of the dataset size n!! Suitable for
large-scale problems.

A. Rakhlin, 9.520/6.860 2018


Stochastic Gradients

In practice, we cycle through the dataset several times (which is


somewhere between Setting #1 and #2).

A. Rakhlin, 9.520/6.860 2018


Appendix

A function f : Rd → R is convex if

f (αu + (1 − α)v ) ≤ αf (u) + (1 − α)f (v )

for any α ∈ [0, 1] and u, v ∈ Rd (or restricted to a convex set). For a


differentiable function, convexity is equivalent to monotonicity

h∇f (u) − ∇f (v ), u − v i ≥ 0. (1)

where  
∂f (u) ∂f (u)
∇f (u) = ,..., .
∂u1 ∂ud

A. Rakhlin, 9.520/6.860 2018


Appendix

It holds that for a convex differentiable function

f (u) ≥ f (v ) + h∇f (v ), u − v i . (2)

A subdifferential set is defined (for a given v ) precisely as the set of all


vectors ∇ such that

f (u) ≥ f (v ) + h∇, u − v i . (3)

for all u. The subdifferential set is denoted by ∂f (v ). A subdifferential


will often substitute the gradient, even if we don’t specify it.

A. Rakhlin, 9.520/6.860 2018


Appendix

If f (v ) = maxi fi (v ) for convex differentiable fi , then, for a given v ,


whenever i ∈ argmax fi (v ), it holds that
i

∇fi (v ) ∈ ∂f (v ).

(Prove it!) We conclude that the subdifferential of the hinge loss


max{0, 1 − yt hw , xt i} with respect to w is

−yt xt · 1{yt hw , xt i < 1} . (4)

A. Rakhlin, 9.520/6.860 2018


Appendix
A function f is L-Lipschitz over a set S with respect to a norm k · k if
kf (u) − f (v )k ≤ L ku − v k
for all u, v ∈ S. A function f is β-smooth if its gradient maps are
Lipschitz
k∇f (v ) − ∇f (u)k ≤ β ku − v k ,
which implies
β 2
f (u) ≤ f (v ) + h∇f (v ), u − v i + ku − v k .
2
(Prove that the other implication also holds.) The dual notion to
smoothness is that of strong convexity. A function f is σ-strongly convex
if
σ 2
f (αu + (1 − α)v ) ≤ αf (u) + (1 − α)f (v ) − α(1 − α) ku − v k ,
2
which means
σ 2
f (u) ≥ f (v ) + hu − v , ∇f (v )i + ku − v k .
2
A. Rakhlin, 9.520/6.860 2018

You might also like