0% found this document useful (0 votes)
40 views12 pages

CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent

Stochastic gradient descent is an optimization algorithm for training machine learning models. It addresses the limitation that gradient descent is slow by computing the gradient using a single training example at each step, rather than the entire training dataset. This allows for more frequent updates to the model parameters. The key idea is making progress through many stochastic updates, rather than refining the gradient with high quality but costly computations using the whole dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views12 pages

CS221 - Artificial Intelligence - Machine Learning - 4 Stochastic Gradient Descent

Stochastic gradient descent is an optimization algorithm for training machine learning models. It addresses the limitation that gradient descent is slow by computing the gradient using a single training example at each step, rather than the entire training dataset. This allows for more frequent updates to the model parameters. The key idea is making progress through many stochastic updates, rather than refining the gradient with high quality but costly computations using the whole dataset.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Machine learning: stochastic gradient descent

• In this module, we will introduce stochastic gradient descent.


Gradient descent is slow
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Algorithm: gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
w ← w − η∇w TrainLoss(w)

Problem: each iteration requires going over all training examples — expensive when have lots
of data!

CS221 2
• So far, we’ve seen gradient descent as a general-purpose algorithm to optimize the training loss.
• But one problem with gradient descent is that it is slow.
• Recall that the training loss is a sum over the training data. If we have one million training examples, then each gradient computation requires
going through those one million examples, and this must happen before we can make any progress.
• Can we make progress before seeing all the data?
Stochastic gradient descent
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

Algorithm: stochastic gradient descent

Initialize w = [0, . . . , 0]
For t = 1, . . . , T :
For (x, y) ∈ Dtrain :
w ← w − η∇w Loss(x, y, w)

CS221 4
• The answer is stochastic gradient descent (SGD).
• Rather than looping through all the training examples to compute a single gradient and making one step, SGD loops through the examples
(x, y) and updates the weights w based on each example.
• Each update is not as good because we’re only looking at one example rather than all the examples, but we can make many more updates
this way.
• Aside: there is a continuum between SGD and GD called minibatch SGD, where each update consists of an average over B examples.
• Aside: There are other variants of SGD. You can randomize the order in which you loop over the training data in each iteration. Think about
why this is important if in your training data, you had all the positive examples first and the negative examples after that.
Step size
w←w− η ∇w Loss(x, y, w)
|{z}
step size

Question: what should η be?

0 1
η
conservative, more stable aggressive, faster

Strategies:
• Constant: η = 0.1

• Decreasing: η = 1/ # updates made so far

CS221 6
• One remaining issue is choosing the step size, which in practice is quite important.
• Generally, larger step sizes are like driving fast. You can get faster convergence, but you might also get very unstable results and crash and
burn.
• On the other hand, with smaller step sizes you get more stability, but you might get to your destination more slowly. Note that the weights
do not change if η = 0
• A suggested form for the step size is to set the initial step size to 1 and let the step size decrease as the inverse of the square root of the
number of updates we’ve taken so far.
• Aside: There are more sophisticated algorithms like AdaGrad and Adam that adapt the step size based on the data, so that you don’t have
to tweak it as much.
• Aside: There are some nice theoretical results showing that SGD is guaranteed to converge in this case (provided all your gradients are
bounded).
Stochastic gradient descent in Python

[code]

CS221 8
• Now let us code up stochastic gradient descent for linear regression in Python.
• First we generate a large enough dataset so that speed actually matters. We will also generate 1 million points according to x ∼ N (0, I) and
y ∼ N (w∗ · x, 1), where w∗ is the true weight vector, but hidden to the algorithm.
• This way, we can diagnose whether the algorithm is actually working or not by checking whether it recovers something close to w∗ .
• Let’s first run gradient descent, and watch that it makes progress but it is very slow.
• Now let us implement stochastic gradient descent. It is much faster.
Summary
1 X
TrainLoss(w) = Loss(x, y, w)
|Dtrain |
(x,y)∈Dtrain

gradient descent stochastic gradient descent

Key idea: stochastic updates

It’s not about quality, it’s about quantity.

CS221 10
• In summary, we’ve shown how stochastic gradient descent can be faster than gradient descent.
• Gradient just spends too much time refining its gradient (quality), while you can get a quick and dirty estimate just from one sample and
make more updates (quantity).
• Of course, sometimes stochastic gradient descent can be unstable, and other techniques such as mini-batching can be used to stabilize it.

You might also like