CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent

Stochastic gradient descent is an optimization algorithm for training machine learning models. It addresses the limitation that gradient descent is slow by computing the gradient using a single training example at each step, rather than the entire training dataset. This allows for more frequent updates to the model parameters. The key idea is making progress through many stochastic updates, rather than refining the gradient with high quality but costly computations using the whole dataset.

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views12 pages

CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent

Uploaded by

Ardiansyah Mochamad Nugraha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Machine learning: stochastic gradient descent

• In this module, we will introduce stochastic gradient descent.

Gradient descent is slow
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Problem: each iteration requires going over all training examples — expensive when have lots
of data!

CS221 2
• So far, we’ve seen gradient descent as a general-purpose algorithm to optimize the training loss.
• But one problem with gradient descent is that it is slow.
• Recall that the training loss is a sum over the training data. If we have one million training examples, then each gradient computation requires
going through those one million examples, and this must happen before we can make any progress.
• Can we make progress before seeing all the data?
Stochastic gradient descent
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Algorithm: stochastic gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain :
w ← w − η∇w Loss(x, y, w)

CS221 4
• The answer is stochastic gradient descent (SGD).
• Rather than looping through all the training examples to compute a single gradient and making one step, SGD loops through the examples
(x, y) and updates the weights w based on each example.
• Each update is not as good because we’re only looking at one example rather than all the examples, but we can make many more updates
this way.
• Aside: there is a continuum between SGD and GD called minibatch SGD, where each update consists of an average over B examples.
• Aside: There are other variants of SGD. You can randomize the order in which you loop over the training data in each iteration. Think about
why this is important if in your training data, you had all the positive examples first and the negative examples after that.
Step size
w←w− η ∇w Loss(x, y, w)
|{z}
step size

Question: what should η be?

0 1
η
conservative, more stable aggressive, faster

Strategies:
• Constant: η = 0.1
√
• Decreasing: η = 1/ # updates made so far

CS221 6
• One remaining issue is choosing the step size, which in practice is quite important.
• Generally, larger step sizes are like driving fast. You can get faster convergence, but you might also get very unstable results and crash and
burn.
• On the other hand, with smaller step sizes you get more stability, but you might get to your destination more slowly. Note that the weights
do not change if η = 0
• A suggested form for the step size is to set the initial step size to 1 and let the step size decrease as the inverse of the square root of the
number of updates we’ve taken so far.
• Aside: There are more sophisticated algorithms like AdaGrad and Adam that adapt the step size based on the data, so that you don’t have
to tweak it as much.
• Aside: There are some nice theoretical results showing that SGD is guaranteed to converge in this case (provided all your gradients are
bounded).
Stochastic gradient descent in Python

[code]

CS221 8
• Now let us code up stochastic gradient descent for linear regression in Python.
• First we generate a large enough dataset so that speed actually matters. We will also generate 1 million points according to x ∼ N (0, I) and
y ∼ N (w∗ · x, 1), where w∗ is the true weight vector, but hidden to the algorithm.
• This way, we can diagnose whether the algorithm is actually working or not by checking whether it recovers something close to w∗ .
• Let’s first run gradient descent, and watch that it makes progress but it is very slow.
• Now let us implement stochastic gradient descent. It is much faster.
Summary
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.

CS221 10
• In summary, we’ve shown how stochastic gradient descent can be faster than gradient descent.
• Gradient just spends too much time refining its gradient (quality), while you can get a quick and dirty estimate just from one sample and
make more updates (quantity).
• Of course, sometimes stochastic gradient descent can be unstable, and other techniques such as mini-batching can be used to stabilize it.

Restful Api
No ratings yet
Restful Api
69 pages
Automated Warehouse PDF
No ratings yet
Automated Warehouse PDF
345 pages
Opti Incertitude
No ratings yet
Opti Incertitude
231 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
04 Batch SGD Mini Batch Gradient Descent Algorithms
No ratings yet
04 Batch SGD Mini Batch Gradient Descent Algorithms
3 pages
Tale Two Brothers 00 Mold Goog
No ratings yet
Tale Two Brothers 00 Mold Goog
203 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Learning 2
No ratings yet
Learning 2
82 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
INT255 Unit-4
No ratings yet
INT255 Unit-4
40 pages
L3 Cse256 Fa24 FFN
No ratings yet
L3 Cse256 Fa24 FFN
64 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Lec 5 Scaling and Opt
No ratings yet
Lec 5 Scaling and Opt
68 pages
Linear Models-Gradient Descent, Regularization (Introduction)
No ratings yet
Linear Models-Gradient Descent, Regularization (Introduction)
26 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
Mini Project 2 Semster
No ratings yet
Mini Project 2 Semster
35 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
UNIT3
No ratings yet
UNIT3
37 pages
5 Gradients
No ratings yet
5 Gradients
26 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
Icontine 2022 Full Paper
No ratings yet
Icontine 2022 Full Paper
7 pages
Topic5 Stoch Grad D Oct202023
No ratings yet
Topic5 Stoch Grad D Oct202023
29 pages
Mlfa Autumn 22 Lec 04
No ratings yet
Mlfa Autumn 22 Lec 04
24 pages
Computer Modeling of Electronic Circuits With LTSPICE
No ratings yet
Computer Modeling of Electronic Circuits With LTSPICE
30 pages
ML Lecture2
No ratings yet
ML Lecture2
36 pages
Acceptance Test Plan Template
100% (1)
Acceptance Test Plan Template
16 pages
SONTU Digital Radiography System (Human Use)
No ratings yet
SONTU Digital Radiography System (Human Use)
10 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Decent
No ratings yet
Gradient Decent
15 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
BMS 201 Introduction PP
No ratings yet
BMS 201 Introduction PP
37 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
The Influence of Social Media On Marketing
No ratings yet
The Influence of Social Media On Marketing
19 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
2,5 Stochastic Gradient Descent
No ratings yet
2,5 Stochastic Gradient Descent
11 pages
User Manual
No ratings yet
User Manual
83 pages
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
100% (6)
Data Architecture Basics: An Illustrated Guide For Non-Technical Readers
31 pages
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
No ratings yet
ML - Stochastic Gradient Descent (SGD) - GeeksforGeeks
9 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 3 Linear Classification
28 pages
ControlAcceso Resumen
No ratings yet
ControlAcceso Resumen
27 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 2 Linear Regression
24 pages
Gradient Descent 5 Part 2
No ratings yet
Gradient Descent 5 Part 2
15 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
4 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
CS221 - Artificial Intelligence - Search - 4 Dynamic Programming
No ratings yet
CS221 - Artificial Intelligence - Search - 4 Dynamic Programming
23 pages
Lec 6
No ratings yet
Lec 6
11 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
UNIT2
No ratings yet
UNIT2
25 pages
Machine Learning: Backpropagation
No ratings yet
Machine Learning: Backpropagation
24 pages
Week 6 Resources
No ratings yet
Week 6 Resources
24 pages
CyberArk CDE Reviewer Notes
100% (1)
CyberArk CDE Reviewer Notes
8 pages
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 6 Non-Linear Features
22 pages
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
No ratings yet
CS221 - Artificial Intelligence - Machine Learning - 1 Overview
16 pages
Marketing Report Comparison of Global Hourly Rates For Software Developers
No ratings yet
Marketing Report Comparison of Global Hourly Rates For Software Developers
17 pages
Unit 1 CC
No ratings yet
Unit 1 CC
4 pages
Department: Lab Manual
No ratings yet
Department: Lab Manual
36 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
5 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Linear Models (Unit II) Chapter III 1
No ratings yet
Linear Models (Unit II) Chapter III 1
24 pages
Machine Learning: Neural Networks
No ratings yet
Machine Learning: Neural Networks
22 pages
Downloads F Biomassehacker en
No ratings yet
Downloads F Biomassehacker en
8 pages
Integer Factorization Using ML
No ratings yet
Integer Factorization Using ML
7 pages
Laiza-Powerpoint 20240410 215454 0000
No ratings yet
Laiza-Powerpoint 20240410 215454 0000
9 pages
Gradient Descent & Stockastic Gradient Descent
No ratings yet
Gradient Descent & Stockastic Gradient Descent
6 pages
Gradient Descent and Its Types
No ratings yet
Gradient Descent and Its Types
5 pages
Case Study
No ratings yet
Case Study
6 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Aie231 NN Lab5
No ratings yet
Aie231 NN Lab5
7 pages
Meeting 9-Elliptical Construction
No ratings yet
Meeting 9-Elliptical Construction
14 pages
Stochastic Gradient Descent
No ratings yet
Stochastic Gradient Descent
23 pages
Blockchain Technology Applications in Healthcare: An Overview
No ratings yet
Blockchain Technology Applications in Healthcare: An Overview
11 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Keymaker
No ratings yet
Keymaker
3 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Unit 4 - GRADIENT LEARNING
No ratings yet
Unit 4 - GRADIENT LEARNING
3 pages
9.b Handout-3-GD Variants
No ratings yet
9.b Handout-3-GD Variants
3 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
Mobile Virus and Security
No ratings yet
Mobile Virus and Security
25 pages
SGD
No ratings yet
SGD
3 pages
Transform Energy Lab Report
No ratings yet
Transform Energy Lab Report
2 pages
cs188 Fa23 Note23
No ratings yet
cs188 Fa23 Note23
2 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
1 Harish
No ratings yet
1 Harish
2 pages
Stochastic Gradient Descent - Term Paper
No ratings yet
Stochastic Gradient Descent - Term Paper
8 pages
Ez Series Technical Bulletin: SUBJECT: EZ2x0/3x0 Program Update V3.16 EZ5x0 Program Update V4.25
No ratings yet
Ez Series Technical Bulletin: SUBJECT: EZ2x0/3x0 Program Update V3.16 EZ5x0 Program Update V4.25
2 pages
Aksantara2015 Sheet1
No ratings yet
Aksantara2015 Sheet1
2 pages
Compression Algorithms: Hu Man and Lempel-Ziv-Welch (LZW) : Hapter
No ratings yet
Compression Algorithms: Hu Man and Lempel-Ziv-Welch (LZW) : Hapter
17 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent

Uploaded by

CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent

Uploaded by

Machine learning: stochastic gradient descent

• In this module, we will introduce stochastic gradient descent.

Algorithm: gradient descent

Algorithm: stochastic gradient descent

Question: what should η be?

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.

You might also like