0% found this document useful (0 votes)
13 views3 pages

9.b Handout-3-GD Variants

Uploaded by

calabi mozart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views3 pages

9.b Handout-3-GD Variants

Uploaded by

calabi mozart
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

The numerical gradient is very simple to compute using the nite difference approximation, but
the downside is that it is approximate (since we have to pick a small value of h, while the true
gradient is de ned as the limit as h goes to zero), and that it is very computationally expensive to
compute. The second way to compute the gradient is analytically using Calculus, which allows us
to derive a direct formula for the gradient (no approximations) that is also very fast to compute.
However, unlike the numerical gradient it can be more error prone to implement, which is why in
practice it is very common to compute the analytic gradient and compare it to the numerical
gradient to check the correctness of your implementation. This is called a gradient check.

Lets use the example of the SVM loss function for a single datapoint:

T T
Li = ∑ [max(0, w xi − wy xi + Δ)]
j i

j≠y i

We can differentiate the function with respect to the weights. For example, taking the gradient
with respect to wy we obtain:
i

T T
∇w Li = − (∑ 1(w xi − wy xi + Δ > 0)) xi
y
i
j i

j≠y i

where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While
the expression may look scary when it is written out, when you’re implementing this in code you’d
simply count the number of classes that didn’t meet the desired margin (and hence contributed to
the loss function) and then the data vector xi scaled by this number is the gradient. Notice that
this is the gradient only with respect to the row of W that corresponds to the correct class. For
the other rows where j ≠ y i the gradient is:

T T
∇w Li = 1(w xi − wy xi + Δ > 0)xi
j j i

Once you derive the expression for the gradient it is straight-forward to implement the
expressions and use them to perform the gradient update.

Gradient Descent
Now that we can compute the gradient of the loss function, the procedure of repeatedly
evaluating the gradient and then performing a parameter update is called Gradient Descent. Its
vanilla version looks as follows:

# Vanilla Gradient Descent

http://cs231n.github.io/optimization-1/ 11/14
10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update

This simple loop is at the core of all Neural Network libraries. There are other ways of performing
the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and
established way of optimizing Neural Network loss functions. Throughout the class we will put
some bells and whistles on the details of this loop (e.g. the exact details of the update equation),
but the core idea of following the gradient until we’re happy with the results will remain the same.

Mini-batch gradient descent. In large-scale applications (such as the ILSVRC challenge), the
training data can have on order of millions of examples. Hence, it seems wasteful to compute the
full loss function over the entire training set in order to perform only a single parameter update. A
very common approach to addressing this challenge is to compute the gradient over batches of
the training data. For example, in current state of the art ConvNets, a typical batch contains 256
examples from the entire training set of 1.2 million. This batch is then used to perform a
parameter update:

# Vanilla Minibatch Gradient Descent

while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update

The reason this works well is that the examples in the training data are correlated. To see this,
consider the extreme case where all 1.2 million images in ILSVRC are in fact made up of exact
duplicates of only 1000 unique images (one for each class, or in other words 1200 identical
copies of each image). Then it is clear that the gradients we would compute for all 1200 identical
copies would all be the same, and when we average the data loss over all 1.2 million images we
would get the exact same loss as if we only evaluated on a small subset of 1000. In practice of
course, the dataset would not contain duplicate images, the gradient from a mini-batch is a good
approximation of the gradient of the full objective. Therefore, much faster convergence can be
achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter
updates.

The extreme case of this is a setting where the mini-batch contains only a single example. This
process is called Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient
descent). This is relatively less common to see because in practice due to vectorized code

http://cs231n.github.io/optimization-1/ 12/14
10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition

optimizations it can be computationally much more e cient to evaluate the gradient for 100
examples, than the gradient for one example 100 times. Even though SGD technically refers to
using a single example at a time to evaluate the gradient, you will hear people use the term SGD
even when referring to mini-batch gradient descent (i.e. mentions of MGD for “Minibatch Gradient
Descent”, or BGD for “Batch gradient descent” are rare to see), where it is usually assumed that
mini-batches are used. The size of the mini-batch is a hyperparameter but it is not very common
to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g.
32, 64 or 128. We use powers of 2 in practice because many vectorized operation
implementations work faster when their inputs are sized in powers of 2.

Summary

Summary of the information ow. The dataset of pairs of (x,y) is given and xed. The weights start out as
random numbers and can change. During the forward pass the score function computes class scores,
stored in vector f. The loss function contains two components: The data loss computes the compatibility
between the scores f and the labels y. The regularization loss is only a function of the weights. During
Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them
to perform a parameter update during Gradient Descent.

In this section,

We developed the intuition of the loss function as a high-dimensional optimization


landscape in which we are trying to reach the bottom. The working analogy we developed
was that of a blindfolded hiker who wishes to reach the bottom. In particular, we saw that
the SVM cost function is piece-wise linear and bowl-shaped.
We motivated the idea of optimizing the loss function with iterative re nement, where we
start with a random set of weights and re ne them step by step until the loss is minimized.
We saw that the gradient of a function gives the steepest ascent direction and we
discussed a simple but ine cient way of computing it numerically using the nite
difference approximation (the nite difference being the value of h used in computing the
numerical gradient).

http://cs231n.github.io/optimization-1/ 13/14

You might also like