0% found this document useful (0 votes)

91 views23 pages

Stochastic Gradient Descent

This document discusses stochastic gradient descent (SGD), an optimization algorithm for minimizing an objective function. SGD approximates the gradient of the objective function using a random subset of training examples rather than the entire training set. This makes each iteration faster but introduces noise. The document outlines the benefits of SGD such as faster convergence and lower memory requirements. It also discusses important considerations like learning rate scheduling and monitoring validation error in addition to training error. Mini-batch SGD is introduced as a variant that reduces variance compared to single example SGD. Recommendations are provided for effectively implementing SGD.

Uploaded by

penets

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

91 views23 pages

Stochastic Gradient Descent

Uploaded by

penets

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Stochastic Gradient Descent

CS 584: Big Data Analytics

Gradient Descent Recap
• Simplest and extremely popular

• Main Idea: take a step proportional to the negative of the

gradient

• Easy to implement

• Each iteration is relatively cheap

• Can be slow to converge

CS 584 [Spring 2016] - Ho

Example: Linear Regression
• Optimization problem:
min ||Xw y||2
w

• Closed form solution:

w⇤ = (X > X) 1
X >y

• Gradient update:
1 X >
w+ = w (xi w yi )xi
m i
Requires an entire pass through the data!
CS 584 [Spring 2016] - Ho
Tackling Compute Problems: Scaling to Large n
• Streaming implementation

• Parallelize your batch algorithm

• Aggressively subsample the data

• Change algorithm or training method

• Optimization is a surrogate for learning

• Trade-off weaker optimization with more data

CS 584 [Spring 2016] - Ho

Tradeoffs of Large Scale Learning
• True (generalization) error is a function of approximation
error, estimation error, and optimization error subject to
number of training examples and computational time

• Solution will depend on which budget constraint is active

Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning.

In Optimization for Machine Learning (pp. 351–368).
CS 584 [Spring 2016] - Ho
Minimizing Generalization Error

If n ! 1,
then "est ! 0

For fixed generalization error, as number of samples increases,

we can increase optimization tolerance
CS 584 [Spring 2016] - Ho Talk by Aditya Menon, UCSD
Expected Risk vs Empirical Risk Minimization
Expected Risk Empirical Risk

• Assume we know the • Real world, ground truth

ground truth distribution distribution is not known
P(x,y)
• Only empirical risk can be
• Expected risk associated calculated for function
with classification function 1X
Z En (fw ) = L(fw (xi ), yi )
n i
E(fw ) = L(fw (x), y)dP (x, y)

= E[L(fw (x), y)]

CS 584 [Spring 2016] - Ho

Gradient Descent Reformulated
1X
w+ = w rw L(fw (xi ), yi ) rEn (fw )
n i
learning rate or gain

• True gradient descent is a batch algorithm, slow but sure

• Under sufficient regularity assumptions, initial estimate is

close to the optimal and gain is sufficiently small, there is
linear convergence

CS 584 [Spring 2016] - Ho

Stochastic Optimization Motivation
• Information is redundant amongst samples

• Sufficient samples means we can afford more frequent,

noisy updates

• Never-ending stream means we should not wait for all

data

• Tracking non-stationary data means that the target is

moving

CS 584 [Spring 2016] - Ho

Stochastic Optimization
• Idea: Estimate function and gradient from a small, current
subsample of your data and with enough iterations and data,
you will converge to the true minimum

• Pro: Better for large datasets and often faster convergence

• Con: Hard to reach high accuracy

• Con: Best classical methods can’t handle stochastic

approximation

• Con: Theoretical definitions for convergence not as well-

defined
CS 584 [Spring 2016] - Ho
Stochastic Gradient Descent (SGD)
• Randomized gradient estimate to minimize the function
using a single randomly picked example
Instead of rf, use r̃f, where E[r̃f ] = rf

• The resulting update is of the form:

w+ = w rw L(fw (xi , yi ))

• Although random noise is introduced, it behaves like

gradient descent in its expectation

CS 584 [Spring 2016] - Ho

SGD Algorithm
Randomly initialize parameter w and learning rate
while Not Converged do
Randomly shu✏e examples in training set
for i = 1, · · · , N do
w+ = w rw L(fw (xi , yi ))
end
end

CS 584 [Spring 2016] - Ho

The Benefits of SGD
• Gradient is easy to calculate (“instantaneous”)

• Less prone to local minima

• Small memory footprint

• Get to a reasonable solution quickly

• Works for non-stationary environments as well as online

settings

• Can be used for more complex models and error surfaces

CS 584 [Spring 2016] - Ho
Importance of Learning Rate
• Learning rate has a large impact on convergence

• Too small —> too slow

• Too large —> oscillatory and may even diverge

• Should learning rate be fixed or adaptive?

• Is convergence necessary?

• Non-stationary: convergence may not be required or desired

• Stationary: learning rate should decrease with time

1
• Robbins-Monroe sequence is adequate t =
t
CS 584 [Spring 2016] - Ho
Mini-batch Stochastic Gradient Descent
• Rather than using a single point, use a random subset
where the size is less than the original data size
1 X
+
w =w rw L(fw (xi , yi )), where Sk ✓ [n]
|Sk |
i2Sk
• Like the single random sample, the full gradient is
approximated via an unbiased noisy estimate

• Random subset reduces the variance by a factor of  

1/|Sk|, but is also |Sk| times more expensive

CS 584 [Spring 2016] - Ho

Example: Regularized Logistic Regression
• Optimization problem:
1 X ⇣ ⌘
> x> 2
min yi xi + log(1 + e i ) + || ||2
n i 2
• Gradient computation:
X
rf ( ) = (yi pi ( ))xi +
i
• Update costs:

• Batch: O(nd)
Batch is doable if n is moderate 
• Stochastic: O(d) but not when n is huge

• Mini-batch: O(|Sk|d)

CS 584 [Spring 2016] - Ho

Example: n=10,000, d=20

Iterations make better progress as mini-batch size is

larger but also takes more computation time
http://stat.cmu.edu/~ryantibs/convexopt/lectures/25-fast-stochastic.pdf
CS 584 [Spring 2016] - Ho
SGD Updates for Various Systems

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
Asymptotic Analysis of GD and SGD

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations
• Randomly shuffle training examples

• Although theory says you should randomly pick examples, it

is easier to make a pass through your training set sequentially

• Shuffling before each iteration eliminates the effect of order

• Monitor both training cost and validation error

• Set aside samples for a decent validation set

• Compute the objective on the training set and validation set

(expensive but better than overfitting or wasting computation)

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

CS 584 [Spring 2016] - Ho Neural Networks Tricks of the Trade.
SGD Recommendations (2)
• Check gradient using finite differences

• If computation is slightly incorrect can yield erratic and slow

algorithm

• Verify your code by slightly perturbing the parameter and

inspecting differences between the two gradients

• Experiment with the learning rates using small sample of training set

• SGD convergence rates are independent from sample size

• Use traditional optimization algorithms as a reference point

CS 584 [Spring 2016] - Ho

SGD Recommendations (3)
• Leverage sparsity of the training examples

• For very high-dimensional vectors with few non zero

coefficients, you only need to update the weight
coefficients corresponding to nonzero pattern in x
1
• Use learning rates of the form t = 0 (1 + 0 t)

• Allows you to start from reasonable learning rates

determined by testing on a small sample

• Works well in most situations if the initial point is slightly

smaller than best value observed in training sample
CS 584 [Spring 2016] - Ho
Some Resources for SGD
• Francis Bach’s talk in 2012: http://www.ann.jussieu.fr/~plc/
bach2012.pdf

• Stochastic Gradient Methods Workshop: http://

yaroslavvb.blogspot.com/2014/03/stochastic-gradient-
methods-2014.html

• Python implementation in scikit-learn: http://scikit-learn.org/

stable/modules/sgd.html

• iPython notebook for implementing GD and SGD in Python:

https://github.com/dtnewman/gradient_descent/blob/master/
stochastic_gradient_descent.ipynb
CS 584 [Spring 2016] - Ho

2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Technical Writing
No ratings yet
Technical Writing
8 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Paper 2
No ratings yet
Paper 2
27 pages
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
No ratings yet
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
9 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
5 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
Stochastic Gradient Descent - Math and Python Code
No ratings yet
Stochastic Gradient Descent - Math and Python Code
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent & Stockastic Gradient Descent
No ratings yet
Gradient Descent & Stockastic Gradient Descent
6 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Dla-Cat 1
No ratings yet
Dla-Cat 1
37 pages
ANN Explanation Request Updated
No ratings yet
ANN Explanation Request Updated
44 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Non-Convex Optimization For Deep Networks and Stochastic
No ratings yet
Non-Convex Optimization For Deep Networks and Stochastic
9 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
17 Large Scale Machine Learning PDF
No ratings yet
17 Large Scale Machine Learning PDF
10 pages
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
No ratings yet
12-Mini-Batch Gradient Descent - Exponential Weighted Averages-07-08-2024
2 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
05.stochastic Gradient Descent
No ratings yet
05.stochastic Gradient Descent
2 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent
12 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
SGD
No ratings yet
SGD
3 pages
Optimization Algorithms Deep PDF
No ratings yet
Optimization Algorithms Deep PDF
9 pages
Lec 6
No ratings yet
Lec 6
11 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Stochastic Search Methods
No ratings yet
Stochastic Search Methods
2 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Lecture 04
No ratings yet
Lecture 04
32 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Aie231 NN Lab5
No ratings yet
Aie231 NN Lab5
7 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Beckett: A Guide For The Perplexed
91% (11)
Beckett: A Guide For The Perplexed
27 pages
The Structure of An Article Is Simple
No ratings yet
The Structure of An Article Is Simple
6 pages
Eng. 5, Week 1 - Quarter 1
No ratings yet
Eng. 5, Week 1 - Quarter 1
3 pages
MTOT Session10 Instructional Materials
No ratings yet
MTOT Session10 Instructional Materials
23 pages
Analyzing Consumer Markets
No ratings yet
Analyzing Consumer Markets
30 pages
Social Studies Lesson Plan Wednesday December 11th 2019
No ratings yet
Social Studies Lesson Plan Wednesday December 11th 2019
3 pages
Social Cognitive Theory
No ratings yet
Social Cognitive Theory
1 page
Q1M2 PerDev
No ratings yet
Q1M2 PerDev
14 pages
Mind Your Thoughts Upper Intermediate
No ratings yet
Mind Your Thoughts Upper Intermediate
5 pages
Lecture 1: Introduction To Cognitive Computing and Deep Learning
No ratings yet
Lecture 1: Introduction To Cognitive Computing and Deep Learning
32 pages
Structured Data Classification MCQ's
No ratings yet
Structured Data Classification MCQ's
6 pages
Kaggle Ai Report 2023
No ratings yet
Kaggle Ai Report 2023
72 pages
Folk
No ratings yet
Folk
1 page
Ctet July 2019 W For Grade 5to 9
No ratings yet
Ctet July 2019 W For Grade 5to 9
64 pages
Ch7b P6 English TSA2015E
No ratings yet
Ch7b P6 English TSA2015E
38 pages
English
No ratings yet
English
4 pages
Summary of Elements of Style
No ratings yet
Summary of Elements of Style
10 pages
English Mcqs
100% (1)
English Mcqs
26 pages
Grade 3 Nonfiction Virtual Reading Performance Post-Assessment Tools
No ratings yet
Grade 3 Nonfiction Virtual Reading Performance Post-Assessment Tools
7 pages
TFG G6251
No ratings yet
TFG G6251
63 pages
2017 Unit 3 4 Physical Education Course Outline
No ratings yet
2017 Unit 3 4 Physical Education Course Outline
10 pages
II. Five Stages of Group Development
No ratings yet
II. Five Stages of Group Development
37 pages
TPACK
No ratings yet
TPACK
3 pages
AICTE Approved Short Term Course On Consumer Behavior: Role of Market Research (Under QIP Sponsored)
No ratings yet
AICTE Approved Short Term Course On Consumer Behavior: Role of Market Research (Under QIP Sponsored)
6 pages
Preschool Language Enrichment Plans
No ratings yet
Preschool Language Enrichment Plans
3 pages
By: Lisa Thomas and Jenny Johnson
No ratings yet
By: Lisa Thomas and Jenny Johnson
30 pages
Discourse Community Paper-Final Draft
No ratings yet
Discourse Community Paper-Final Draft
8 pages
DIAZ COLLEGE - Prelim
No ratings yet
DIAZ COLLEGE - Prelim
5 pages
Thomson2006 PDF
No ratings yet
Thomson2006 PDF
16 pages
1 s2.0 S2667318525000029 Main
No ratings yet
1 s2.0 S2667318525000029 Main
8 pages

Stochastic Gradient Descent

Uploaded by

Stochastic Gradient Descent

Uploaded by

Stochastic Gradient Descent

CS 584: Big Data Analytics

• Main Idea: take a step proportional to the negative of the

• Each iteration is relatively cheap

• Can be slow to converge

CS 584 [Spring 2016] - Ho

• Closed form solution:

• Parallelize your batch algorithm

• Aggressively subsample the data

• Change algorithm or training method

• Optimization is a surrogate for learning

• Trade-off weaker optimization with more data

CS 584 [Spring 2016] - Ho

• Solution will depend on which budget constraint is active

Bottom and Bousquet (2011). The Tradeoffs of Large-Scale Learning.

For fixed generalization error, as number of samples increases,

• Assume we know the • Real world, ground truth

= E[L(fw (x), y)]

CS 584 [Spring 2016] - Ho

• True gradient descent is a batch algorithm, slow but sure

• Under sufficient regularity assumptions, initial estimate is

CS 584 [Spring 2016] - Ho

• Sufficient samples means we can afford more frequent,

• Never-ending stream means we should not wait for all

• Tracking non-stationary data means that the target is

CS 584 [Spring 2016] - Ho

• Pro: Better for large datasets and often faster convergence

• Con: Hard to reach high accuracy

• Con: Best classical methods can’t handle stochastic

• Con: Theoretical definitions for convergence not as well-

• The resulting update is of the form:

• Although random noise is introduced, it behaves like

CS 584 [Spring 2016] - Ho

CS 584 [Spring 2016] - Ho

• Less prone to local minima

• Small memory footprint

• Get to a reasonable solution quickly

• Works for non-stationary environments as well as online

• Can be used for more complex models and error surfaces

• Too small —> too slow

• Too large —> oscillatory and may even diverge

• Should learning rate be fixed or adaptive?

• Non-stationary: convergence may not be required or desired

• Stationary: learning rate should decrease with time

• Random subset reduces the variance by a factor of

CS 584 [Spring 2016] - Ho

CS 584 [Spring 2016] - Ho

Iterations make better progress as mini-batch size is

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

• Although theory says you should randomly pick examples, it

• Shuffling before each iteration eliminates the effect of order

• Monitor both training cost and validation error

• Set aside samples for a decent validation set

• Compute the objective on the training set and validation set

Bottou, L. (2012). Stochastic Gradient Descent Tricks.

• If computation is slightly incorrect can yield erratic and slow

• Verify your code by slightly perturbing the parameter and

• SGD convergence rates are independent from sample size

• Use traditional optimization algorithms as a reference point

CS 584 [Spring 2016] - Ho

• For very high-dimensional vectors with few non zero

• Allows you to start from reasonable learning rates

• Works well in most situations if the initial point is slightly

• Stochastic Gradient Methods Workshop: http://

• Python implementation in scikit-learn: http://scikit-learn.org/

• iPython notebook for implementing GD and SGD in Python:

You might also like

• Random subset reduces the variance by a factor of