0% found this document useful (0 votes)

12 views24 pages

Class06 SGD

This document discusses optimization techniques for machine learning problems, specifically online learning with stochastic gradients. [1] It introduces Newton's method for optimization and how it can be applied to problems like linear regression to find the optimal solution in one step. [2] However, for online/streaming data, it is not feasible to recompute the inverse at every step. Instead, it describes recursive least squares and stochastic gradient descent as online methods that can incorporate new data points efficiently with each step requiring only O(d^2) time rather than recomputing the full solution. [3] It proves convergence guarantees for stochastic gradient descent that still hold even when using a stochastic estimate of the gradient rather than the true gradient at each

Uploaded by

timetif286

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views24 pages

Class06 SGD

Uploaded by

timetif286

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

MIT 9.520/6.

860, Fall 2018

Statistical Learning Theory and Applications

Class 06: Learning with Stochastic Gradients

Sasha Rakhlin

A. Rakhlin, 9.520/6.860 2018

Why Optimization?

Much (but not all) of Machine Learning: write down objective function
involving data and parameters, find good (or optimal) parameters
through optimization.

Key idea: find a near-optimal solution by iteratively using only local

information about the objective (e.g. gradient, Hessian).

A. Rakhlin, 9.520/6.860 2018

Motivating example: Newton’s Method

Newton’s method in 1d:

wt+1 = wt − (f 00 (wt ))−1 f 0 (wt )

Example (parabola):
f (w ) = aw 2 + bw + c
Start with any w1 . Then Newton’s Method gives

w2 = w1 − (2a)−1 (2aw1 + b)

which means w2 = −b/(2a). Finds minimum of f in 1 step, no matter

where you start!

A. Rakhlin, 9.520/6.860 2018

Newton’s Method in multiple dim:

wt+1 = wt − [∇2 f (wt )]−1 ∇f (wt )

(here ∇2 f (wt ) is the Hessian, assume invertible)

A. Rakhlin, 9.520/6.860 2018

Recalling Least Squares

Least Squares objective (without 1/n normalization)

n
X 2
f (w ) = (yi − xi T w )2 = kY − Xw k
i=1

Calculate: ∇2 f (w ) = 2X T X and ∇f (w ) = −2X T (Y − Xw ).

Taking w1 = 0, the Newton’s Method gives

w2 = 0 + (2X T X )−1 2X T (Y − X 0) = (X T X )−1 X T Y

which is the least-squares solution (global min). Again, 1 step is enough.

2 2
Verify: if f (w ) = kY − Xw k + λ kw k , (X T X ) becomes (X T X + λ)

A. Rakhlin, 9.520/6.860 2018

What do we do if data (x1 , y1 ), . . . , (xn , yn ), . . . are streaming? Can we
incorporate data on the fly without having to re-compute inverse (X T X )
at every step?

−→ Online Learning

A. Rakhlin, 9.520/6.860 2018

Let w1 = 0. Let wt be least-squares solution after seeing t − 1 data
points. Can we get wt from wt−1 cheaply? Newton’s Method will do it in
1 step (since objective is quadratic).
Pt
Let Ct = i=1 xi xi T (or +λI ) and Xt = [x1 , . . . , xt ] T , Yt = [y1 , . . . , yt ] T .
Newton’s method gives

wt+1 = wt + Ct−1 XtT (Yt − Xt wt )

This can be simplified to

wt+1 = wt + Ct−1 xt (yt − xtT wt )

since residuals up to t − 1 are orthogonal to columns of Xt−1 .

The bottleneck is computing Ct−1 . Can we update it quickly from Ct−1

−1
?

A. Rakhlin, 9.520/6.860 2018

Sherman-Morrison formula: for invertible square A and any u, v

A−1 uv T A−1
(A + uv T )−1 = A−1 −
1 + v T A−1 u
Hence
−1 −1
−1 Ct−1 xt xtT Ct−1
Ct−1 = Ct−1 − −1
1 + xtT Ct−1 xt
and (do the calculation)

−1 1
Ct−1 xt = Ct−1 xt · −1
1 + xt Ct−1
T
xt

Computation required: d × d matrix Ct−1 times a d × 1 vector = O(d 2 )

time to incorporate new datapoint. Memory: O(d 2 ). Unlike full
regression from scratch, does not depend on amount of data t.

A. Rakhlin, 9.520/6.860 2018

Recursive Least Squares (cont.)

Recap: recursive least squares is

wt+1 = wt + Ct−1 xt (yt − xtT wt )

−1
with a rank-one update of Ct−1 to get Ct−1 .

Consider throwing away second derivative information, replacing with

scalar:
wt+1 = wt + ηt xt (yt − xtT wt ).
where ηt is a decreasing sequence.

A. Rakhlin, 9.520/6.860 2018

Online Least Squares

The algorithm
wt+1 = wt + ηt xt (yt − xtT wt ).

I is recursive;
I does not require storing the matrix Ct−1 ;
I does not require updating the inverse, but only vector/vector
multiplication.
However, we are not guaranteed convergence in 1 step. How many? How
to choose ηt ?

A. Rakhlin, 9.520/6.860 2018

First, recognize that

−∇(yt − xtT w )2 = 2xt [yt − xtT w ].

Hence, proposed method is gradient descent. Let us study it abstractly

and then come back to least-squares.

A. Rakhlin, 9.520/6.860 2018

Lemma: Let f be convex G -Lipschitz. Let w ∗ ∈ argmin f (w ) and
w
kw ∗ k ≤ B. Then gradient descent

wt+1 = wt − η∇f (wt )

B
with η = G
and w1 = 0 yields a sequence of iterates such that the
√
T
PT 1
average w̄T = t=1 wt of trajectory satisfies
T

BG
f (w̄T ) − f (w ∗ ) ≤ √ .
T
Proof:
2 2
kwt+1 − w ∗ k = kwt − η∇f (wt ) − w ∗ k
2 2
= kwt − w ∗ k + η 2 k∇f (wt )k − 2η∇f (wt ) T (wt − w ∗ )

Rearrange:
2 2 2
2η∇f (wt ) T (wt − w ∗ ) = kwt − w ∗ k − kwt+1 − w ∗ k + η 2 k∇f (wt )k .

Note: Lipschitzness of f is equivalent to k∇f (w )k ≤ G .

A. Rakhlin, 9.520/6.860 2018
Summing over t = 1, . . . , T , telescoping, dropping negative term, using
w1 = 0, and dividing both sides by 2η,
T √
X 1 2 η
∇f (wt ) T (wt − w ∗ ) ≤ kw ∗ k + TG 2 ≤ BGT .
t=1
2η 2

Convexity of f means

f (wt ) − f (w ∗ ) ≤ ∇f (wt ) T (wt − w ∗ )

and so
T T
1 X 1 X BG
f (wt ) − f (w ∗ ) ≤ ∇f (wt ) T (wt − w ∗ ) ≤ √
T t=1 T t=1 T

Lemma follows by convexity of f and Jensen’s inequality. (end of proof)

A. Rakhlin, 9.520/6.860 2018

Gradient descent can be written as
1 2
wt+1 = argmin η {f (wt ) + ∇f (wt ) T (w − wt )} + kw − wt k
w 2

which can be interpreted as minimizing a linear approximation but

staying close to previous solution.

Alternatively, can interpret it as building a second-order model locally

(since cannot fully trust the local information – unlike our first parabola
example).

A. Rakhlin, 9.520/6.860 2018

Remarks:
I Gradient descent for non-smooth functions does not guarantee
actual descent of the iterates wt (only their average).
I For constrained optimization problems over a set K , do projected
gradient step
wt+1 = ProjK (wt − η∇f (wt ))
Proof essentially the same.
I Can take stepsize ηt = BG
√ to make it horizon-independent.
t
I Knowledge of G and B not necessary (with appropriate changes).
I Faster convergence under additional assumptions on f (smoothness,
strong convexity).
I Last class: for smooth functions (gradient is L-Lipschitz), constant
step size 1/L gives faster O(1/T ) convergence.
I Gradients can be replaced with stochastic gradients (unbiased
estimates).

A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients

Suppose we only have access to an unbiased estimate ∇t of ∇f (wt ) at

step t. That is, E[∇t |wt ] = ∇f (wt ). Then Stochastic Gradient Descent
(SGD)
wt+1 = wt − η∇t
enjoys the guarantee
BG
E[f (w̄T )] − f (w ∗ ) ≤ √
n
2
where G is such that E[k∇t k ] ≤ G 2 for all t.

Kind of amazing: at each step go in the direction that is wrong (but

correct on average) and still converge.

A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients

Setting #1:

Empirical loss can be written as

n
1X
f (w ) = `(yi , w T xi ) = EI ∼unif[1:n] `(yI , w T xI )
n
i=1

Then ∇t = ∇`(yI , wtT xI ) is an unbiased gradient:

E[∇t |wt ] = E[∇`(yI , wtT xI )|wt ] = ∇E[`(yI , wtT xI )|wt ] = ∇f (wt )

Conclusion: if we pick index I uniformly at random from dataset and

make gradient step ∇`(yI , wtT xI ), then we are performing SGD on
empirical loss objective.

A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients

Setting #2:

Expected loss can be written as

f (w ) = E`(Y , w T X )

where (X , Y ) is drawn i.i.d. from population PX ×Y .

Then ∇t = ∇`(Y , wtT X ) is an unbiased gradient:

E[∇t |wt ] = E[∇`(Y , wtT X )|wt ] = ∇E[`(Y , wtT X )|wt ] = ∇f (wt )

Conclusion: if we pick example (X , Y ) from distribution PX ×Y and make

gradient step ∇`(Y , wtT X ), then we are performing SGD on expected
loss objective. Equivalent to going through a dataset once.

A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients

Say we are in Setting #2 and we go through dataset once. The

guarantee is
BG
E[f (w̄ )] − f (w ∗ ) ≤ √
T
after T iterations. So, time complexity to find -minimizer of expected
objective E`(w T X , Y ) is independent of the dataset size n!! Suitable for
large-scale problems.

A. Rakhlin, 9.520/6.860 2018

Stochastic Gradients

In practice, we cycle through the dataset several times (which is

somewhere between Setting #1 and #2).

A. Rakhlin, 9.520/6.860 2018

Appendix

A function f : Rd → R is convex if

f (αu + (1 − α)v ) ≤ αf (u) + (1 − α)f (v )

for any α ∈ [0, 1] and u, v ∈ Rd (or restricted to a convex set). For a

differentiable function, convexity is equivalent to monotonicity

h∇f (u) − ∇f (v ), u − v i ≥ 0. (1)

where
∂f (u) ∂f (u)
∇f (u) = ,..., .
∂u1 ∂ud

A. Rakhlin, 9.520/6.860 2018

Appendix

It holds that for a convex differentiable function

f (u) ≥ f (v ) + h∇f (v ), u − v i . (2)

A subdifferential set is defined (for a given v ) precisely as the set of all

vectors ∇ such that

f (u) ≥ f (v ) + h∇, u − v i . (3)

for all u. The subdifferential set is denoted by ∂f (v ). A subdifferential

will often substitute the gradient, even if we don’t specify it.

A. Rakhlin, 9.520/6.860 2018

Appendix

If f (v ) = maxi fi (v ) for convex differentiable fi , then, for a given v ,

whenever i ∈ argmax fi (v ), it holds that
i

∇fi (v ) ∈ ∂f (v ).

(Prove it!) We conclude that the subdifferential of the hinge loss

max{0, 1 − yt hw , xt i} with respect to w is

−yt xt · 1{yt hw , xt i < 1} . (4)

A. Rakhlin, 9.520/6.860 2018

Appendix
A function f is L-Lipschitz over a set S with respect to a norm k · k if
kf (u) − f (v )k ≤ L ku − v k
for all u, v ∈ S. A function f is β-smooth if its gradient maps are
Lipschitz
k∇f (v ) − ∇f (u)k ≤ β ku − v k ,
which implies
β 2
f (u) ≤ f (v ) + h∇f (v ), u − v i + ku − v k .
2
(Prove that the other implication also holds.) The dual notion to
smoothness is that of strong convexity. A function f is σ-strongly convex
if
σ 2
f (αu + (1 − α)v ) ≤ αf (u) + (1 − α)f (v ) − α(1 − α) ku − v k ,
2
which means
σ 2
f (u) ≥ f (v ) + hu − v , ∇f (v )i + ku − v k .
2
A. Rakhlin, 9.520/6.860 2018

Mastering EES Themechangers - Blogspot.in
100% (3)
Mastering EES Themechangers - Blogspot.in
608 pages
Subroutine For Cohesive Element
100% (2)
Subroutine For Cohesive Element
41 pages
Digital Communications A Discrete-Time Approach Rice M. 2009 Solution Manual
57% (7)
Digital Communications A Discrete-Time Approach Rice M. 2009 Solution Manual
539 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
11th Class Math Notes Chapter 11
No ratings yet
11th Class Math Notes Chapter 11
3 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Linear Equations
No ratings yet
Linear Equations
12 pages
PDE's
No ratings yet
PDE's
3 pages
Kvs PGT Mathematics Syllabus
No ratings yet
Kvs PGT Mathematics Syllabus
4 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Wk05 Machine Learning
No ratings yet
Wk05 Machine Learning
6 pages
Collaborative Filtering Matrix Factorization Approach: Jeff Howbert Introduction To Machine Learning Winter 2012 #
No ratings yet
Collaborative Filtering Matrix Factorization Approach: Jeff Howbert Introduction To Machine Learning Winter 2012 #
30 pages
Optimization-Module Iv
No ratings yet
Optimization-Module Iv
7 pages
Finite Element Analysis Answers by Mr. B. Guruprasad: Asst. Professor, Mechanical
No ratings yet
Finite Element Analysis Answers by Mr. B. Guruprasad: Asst. Professor, Mechanical
146 pages
Statistics 580 Nonlinear Least Squares: I I I I I I I 2 N I I 2
No ratings yet
Statistics 580 Nonlinear Least Squares: I I I I I I I 2 N I I 2
14 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Linear Regression
No ratings yet
Linear Regression
55 pages
Stochastic Optimization For Machine Learning
No ratings yet
Stochastic Optimization For Machine Learning
59 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
Factorial PDF
No ratings yet
Factorial PDF
16 pages
Flightno. 201 202 203 203: 5.6. Traveling Salesman Problem
No ratings yet
Flightno. 201 202 203 203: 5.6. Traveling Salesman Problem
7 pages
Les Hoches 2022 Convex Optimization
No ratings yet
Les Hoches 2022 Convex Optimization
34 pages
NO LINEALs
No ratings yet
NO LINEALs
61 pages
CS480 6 Linear Models
No ratings yet
CS480 6 Linear Models
68 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Class05 LogisticsSVM
No ratings yet
Class05 LogisticsSVM
33 pages
A Study of Systems of Neutrosophic Linear Equations
No ratings yet
A Study of Systems of Neutrosophic Linear Equations
10 pages
2406 9MA0-02 A Level Pure Mathematics June 2024 Word
No ratings yet
2406 9MA0-02 A Level Pure Mathematics June 2024 Word
12 pages
Optimization23 22
No ratings yet
Optimization23 22
32 pages
Class03 RLS
No ratings yet
Class03 RLS
28 pages
Unit 6 Binomial Theorem
No ratings yet
Unit 6 Binomial Theorem
28 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
Linear - Regression - SGD
No ratings yet
Linear - Regression - SGD
71 pages
Environmental Data Analysis With MatLab
No ratings yet
Environmental Data Analysis With MatLab
46 pages
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
No ratings yet
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
15 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Lec 6 Tutorial
No ratings yet
Lec 6 Tutorial
27 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Mathématiques Les Exercices Incontournables MPSI
No ratings yet
Mathématiques Les Exercices Incontournables MPSI
4 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
02 Lecturenote GD
No ratings yet
02 Lecturenote GD
10 pages
Axioms of Probability
No ratings yet
Axioms of Probability
40 pages
Montanari
No ratings yet
Montanari
10 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Representer Function
No ratings yet
Representer Function
12 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Online Gradient Descent
No ratings yet
Online Gradient Descent
7 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Calculus - Class Notes
No ratings yet
Calculus - Class Notes
4 pages
L23 Stochastic Gradient and Mini Batch
No ratings yet
L23 Stochastic Gradient and Mini Batch
9 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
Lec 11
No ratings yet
Lec 11
13 pages
Advanced Gradient Descent
No ratings yet
Advanced Gradient Descent
14 pages
N20122 Mathematical Economics II 1 141
No ratings yet
N20122 Mathematical Economics II 1 141
141 pages
Optimization
No ratings yet
Optimization
6 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Lse RLS
No ratings yet
Lse RLS
4 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
02 - Linear Models - A
No ratings yet
02 - Linear Models - A
23 pages
Optimization in Neural Network
No ratings yet
Optimization in Neural Network
22 pages
Elmer Programmers Tutorial
No ratings yet
Elmer Programmers Tutorial
14 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Gradient Descent and SGD
No ratings yet
Gradient Descent and SGD
8 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Transcendental Number Theory Alan Baker Instant Download
No ratings yet
Transcendental Number Theory Alan Baker Instant Download
52 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
PDF 4
No ratings yet
PDF 4
11 pages
Class 12 Term 2 SET A Answer Key Final
No ratings yet
Class 12 Term 2 SET A Answer Key Final
14 pages
MAE C3 Derivertive
No ratings yet
MAE C3 Derivertive
50 pages
Exam - Trigo - Math4 - 4thperiodic
No ratings yet
Exam - Trigo - Math4 - 4thperiodic
7 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
2014 2 Selangor Bandarutama PJ Maths Qa
No ratings yet
2014 2 Selangor Bandarutama PJ Maths Qa
6 pages
StaticsC02 - Force Vectors
No ratings yet
StaticsC02 - Force Vectors
37 pages
Msu Mindtrek Prospectus
No ratings yet
Msu Mindtrek Prospectus
26 pages
Maths 2
No ratings yet
Maths 2
85 pages
Multiplying and Dividing Polynomials Unit TestVERSION B
No ratings yet
Multiplying and Dividing Polynomials Unit TestVERSION B
12 pages
DR Sudhir Chandra Sur Institute of Technology and Sports Com
No ratings yet
DR Sudhir Chandra Sur Institute of Technology and Sports Com
4 pages
Honors Math 7 Introduction To Algebra B 6
No ratings yet
Honors Math 7 Introduction To Algebra B 6
1 page