0% found this document useful (0 votes)
91 views23 pages

Stochastic Gradient Descent

This document discusses stochastic gradient descent (SGD), an optimization algorithm for minimizing an objective function. SGD approximates the gradient of the objective function using a random subset of training examples rather than the entire training set. This makes each iteration faster but introduces noise. The document outlines the benefits of SGD such as faster convergence and lower memory requirements. It also discusses important considerations like learning rate scheduling and monitoring validation error in addition to training error. Mini-batch SGD is introduced as a variant that reduces variance compared to single example SGD. Recommendations are provided for effectively implementing SGD.

Uploaded by

penets
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views23 pages

Stochastic Gradient Descent

This document discusses stochastic gradient descent (SGD), an optimization algorithm for minimizing an objective function. SGD approximates the gradient of the objective function using a random subset of training examples rather than the entire training set. This makes each iteration faster but introduces noise. The document outlines the benefits of SGD such as faster convergence and lower memory requirements. It also discusses important considerations like learning rate scheduling and monitoring validation error in addition to training error. Mini-batch SGD is introduced as a variant that reduces variance compared to single example SGD. Recommendations are provided for effectively implementing SGD.

Uploaded by

penets
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Stochastic Gradient Descent

CS 584: Big Data Analytics


Gradient Descent Recap
• Simplest and extremely popular

• Main Idea: take a step proportional to the negative of the


gradient

• Easy to implement

• Each iteration is relatively cheap

• Can be slow to converge

CS 584 [Spring 2016] - Ho


Example: Linear Regression
• Optimization problem:
min ||Xw y||2
w

• Closed form solution:


w⇤ = (X > X) 1
X >y

• Gradient update:
1 X >
w+ = w (xi w yi )xi
m i
Requires an entire pass through the data!
CS 584 [Spring 2016] - Ho
Tackling Compute Problems: Scaling to Large n
• Streaming implementation

• Parallelize your batch algorithm

• Aggressively subsample the data

• Change algorithm or training method

• Optimization is a surrogate for learning

• Trade-off weaker optimization with more data

CS 584 [Spring 2016] - Ho


Tradeoffs of Large Scale Learning
• True (generalization) error is a function of approximation
error, estimation error, and optimization error subject to
number of training examples and computational time

• Solution will depend on which budget constraint is active

Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning.


In Optimization for Machine Learning (pp. 351–368).
CS 584 [Spring 2016] - Ho
Minimizing Generalization Error

If n ! 1,
then "est ! 0

For fixed generalization error, as number of samples increases,


we can increase optimization tolerance
CS 584 [Spring 2016] - Ho Talk by Aditya Menon, UCSD
Expected Risk vs Empirical Risk Minimization
Expected Risk Empirical Risk

• Assume we know the • Real world, ground truth


ground truth distribution distribution is not known
P(x,y)
• Only empirical risk can be
• Expected risk associated calculated for function
with classification function 1X
Z En (fw ) = L(fw (xi ), yi )
n i
E(fw ) = L(fw (x), y)dP (x, y)

= E[L(fw (x), y)]

CS 584 [Spring 2016] - Ho


Gradient Descent Reformulated
1X
w+ = w rw L(fw (xi ), yi ) rEn (fw )
n i
learning rate or gain

• True gradient descent is a batch algorithm, slow but sure

• Under sufficient regularity assumptions, initial estimate is


close to the optimal and gain is sufficiently small, there is
linear convergence

CS 584 [Spring 2016] - Ho


Stochastic Optimization Motivation
• Information is redundant amongst samples

• Sufficient samples means we can afford more frequent,


noisy updates

• Never-ending stream means we should not wait for all


data

• Tracking non-stationary data means that the target is


moving

CS 584 [Spring 2016] - Ho


Stochastic Optimization
• Idea: Estimate function and gradient from a small, current
subsample of your data and with enough iterations and data,
you will converge to the true minimum

• Pro: Better for large datasets and often faster convergence

• Con: Hard to reach high accuracy

• Con: Best classical methods can’t handle stochastic


approximation

• Con: Theoretical definitions for convergence not as well-


defined
CS 584 [Spring 2016] - Ho
Stochastic Gradient Descent (SGD)
• Randomized gradient estimate to minimize the function
using a single randomly picked example
Instead of rf, use r̃f, where E[r̃f ] = rf

• The resulting update is of the form:

w+ = w rw L(fw (xi , yi ))

• Although random noise is introduced, it behaves like


gradient descent in its expectation

CS 584 [Spring 2016] - Ho


SGD Algorithm
Randomly initialize parameter w and learning rate
while Not Converged do
Randomly shu✏e examples in training set
for i = 1, · · · , N do
w+ = w rw L(fw (xi , yi ))
end
end

CS 584 [Spring 2016] - Ho


The Benefits of SGD
• Gradient is easy to calculate (“instantaneous”)

• Less prone to local minima

• Small memory footprint

• Get to a reasonable solution quickly

• Works for non-stationary environments as well as online


settings

• Can be used for more complex models and error surfaces


CS 584 [Spring 2016] - Ho
Importance of Learning Rate
• Learning rate has a large impact on convergence

• Too small —> too slow

• Too large —> oscillatory and may even diverge

• Should learning rate be fixed or adaptive?

• Is convergence necessary?

• Non-stationary: convergence may not be required or desired

• Stationary: learning rate should decrease with time


1
• Robbins-Monroe sequence is adequate t =
t
CS 584 [Spring 2016] - Ho
Mini-batch Stochastic Gradient Descent
• Rather than using a single point, use a random subset
where the size is less than the original data size
1 X
+
w =w rw L(fw (xi , yi )), where Sk ✓ [n]
|Sk |
i2Sk
• Like the single random sample, the full gradient is
approximated via an unbiased noisy estimate

• Random subset reduces the variance by a factor of 



1/|Sk|, but is also |Sk| times more expensive

CS 584 [Spring 2016] - Ho


Example: Regularized Logistic Regression
• Optimization problem:
1 X ⇣ ⌘
> x> 2
min yi xi + log(1 + e i ) + || ||2
n i 2
• Gradient computation:
X
rf ( ) = (yi pi ( ))xi +
i
• Update costs:

• Batch: O(nd)
Batch is doable if n is moderate

• Stochastic: O(d) but not when n is huge

• Mini-batch: O(|Sk|d)

CS 584 [Spring 2016] - Ho


Example: n=10,000, d=20

Iterations make better progress as mini-batch size is


larger but also takes more computation time
http://stat.cmu.edu/~ryantibs/convexopt/lectures/25-fast-stochastic.pdf
CS 584 [Spring 2016] - Ho
SGD Updates for Various Systems

Bottou, L. (2012). Stochastic Gradient Descent Tricks.


CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
Asymptotic Analysis of GD and SGD

Bottou, L. (2012). Stochastic Gradient Descent Tricks.


CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations
• Randomly shuffle training examples

• Although theory says you should randomly pick examples, it


is easier to make a pass through your training set sequentially

• Shuffling before each iteration eliminates the effect of order

• Monitor both training cost and validation error

• Set aside samples for a decent validation set

• Compute the objective on the training set and validation set


(expensive but better than overfitting or wasting computation)

Bottou, L. (2012). Stochastic Gradient Descent Tricks.


CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations (2)
• Check gradient using finite differences

• If computation is slightly incorrect can yield erratic and slow


algorithm

• Verify your code by slightly perturbing the parameter and


inspecting differences between the two gradients

• Experiment with the learning rates using small sample of training set

• SGD convergence rates are independent from sample size

• Use traditional optimization algorithms as a reference point

CS 584 [Spring 2016] - Ho


SGD Recommendations (3)
• Leverage sparsity of the training examples

• For very high-dimensional vectors with few non zero


coefficients, you only need to update the weight
coefficients corresponding to nonzero pattern in x
1
• Use learning rates of the form t = 0 (1 + 0 t)

• Allows you to start from reasonable learning rates


determined by testing on a small sample

• Works well in most situations if the initial point is slightly


smaller than best value observed in training sample
CS 584 [Spring 2016] - Ho
Some Resources for SGD
• Francis Bach’s talk in 2012: http://www.ann.jussieu.fr/~plc/
bach2012.pdf

• Stochastic Gradient Methods Workshop: http://


yaroslavvb.blogspot.com/2014/03/stochastic-gradient-
methods-2014.html

• Python implementation in scikit-learn: http://scikit-learn.org/


stable/modules/sgd.html

• iPython notebook for implementing GD and SGD in Python:


https://github.com/dtnewman/gradient_descent/blob/master/
stochastic_gradient_descent.ipynb
CS 584 [Spring 2016] - Ho

You might also like