9.b Handout-3-GD Variants
9.b Handout-3-GD Variants
The numerical gradient is very simple to compute using the nite difference approximation, but
the downside is that it is approximate (since we have to pick a small value of h, while the true
gradient is de ned as the limit as h goes to zero), and that it is very computationally expensive to
compute. The second way to compute the gradient is analytically using Calculus, which allows us
to derive a direct formula for the gradient (no approximations) that is also very fast to compute.
However, unlike the numerical gradient it can be more error prone to implement, which is why in
practice it is very common to compute the analytic gradient and compare it to the numerical
gradient to check the correctness of your implementation. This is called a gradient check.
Lets use the example of the SVM loss function for a single datapoint:
T T
Li = ∑ [max(0, w xi − wy xi + Δ)]
j i
j≠y i
We can differentiate the function with respect to the weights. For example, taking the gradient
with respect to wy we obtain:
i
T T
∇w Li = − (∑ 1(w xi − wy xi + Δ > 0)) xi
y
i
j i
j≠y i
where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While
the expression may look scary when it is written out, when you’re implementing this in code you’d
simply count the number of classes that didn’t meet the desired margin (and hence contributed to
the loss function) and then the data vector xi scaled by this number is the gradient. Notice that
this is the gradient only with respect to the row of W that corresponds to the correct class. For
the other rows where j ≠ y i the gradient is:
T T
∇w Li = 1(w xi − wy xi + Δ > 0)xi
j j i
Once you derive the expression for the gradient it is straight-forward to implement the
expressions and use them to perform the gradient update.
Gradient Descent
Now that we can compute the gradient of the loss function, the procedure of repeatedly
evaluating the gradient and then performing a parameter update is called Gradient Descent. Its
vanilla version looks as follows:
http://cs231n.github.io/optimization-1/ 11/14
10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update
This simple loop is at the core of all Neural Network libraries. There are other ways of performing
the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and
established way of optimizing Neural Network loss functions. Throughout the class we will put
some bells and whistles on the details of this loop (e.g. the exact details of the update equation),
but the core idea of following the gradient until we’re happy with the results will remain the same.
Mini-batch gradient descent. In large-scale applications (such as the ILSVRC challenge), the
training data can have on order of millions of examples. Hence, it seems wasteful to compute the
full loss function over the entire training set in order to perform only a single parameter update. A
very common approach to addressing this challenge is to compute the gradient over batches of
the training data. For example, in current state of the art ConvNets, a typical batch contains 256
examples from the entire training set of 1.2 million. This batch is then used to perform a
parameter update:
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update
The reason this works well is that the examples in the training data are correlated. To see this,
consider the extreme case where all 1.2 million images in ILSVRC are in fact made up of exact
duplicates of only 1000 unique images (one for each class, or in other words 1200 identical
copies of each image). Then it is clear that the gradients we would compute for all 1200 identical
copies would all be the same, and when we average the data loss over all 1.2 million images we
would get the exact same loss as if we only evaluated on a small subset of 1000. In practice of
course, the dataset would not contain duplicate images, the gradient from a mini-batch is a good
approximation of the gradient of the full objective. Therefore, much faster convergence can be
achieved in practice by evaluating the mini-batch gradients to perform more frequent parameter
updates.
The extreme case of this is a setting where the mini-batch contains only a single example. This
process is called Stochastic Gradient Descent (SGD) (or also sometimes on-line gradient
descent). This is relatively less common to see because in practice due to vectorized code
http://cs231n.github.io/optimization-1/ 12/14
10/24/2018 CS231n Convolutional Neural Networks for Visual Recognition
optimizations it can be computationally much more e cient to evaluate the gradient for 100
examples, than the gradient for one example 100 times. Even though SGD technically refers to
using a single example at a time to evaluate the gradient, you will hear people use the term SGD
even when referring to mini-batch gradient descent (i.e. mentions of MGD for “Minibatch Gradient
Descent”, or BGD for “Batch gradient descent” are rare to see), where it is usually assumed that
mini-batches are used. The size of the mini-batch is a hyperparameter but it is not very common
to cross-validate it. It is usually based on memory constraints (if any), or set to some value, e.g.
32, 64 or 128. We use powers of 2 in practice because many vectorized operation
implementations work faster when their inputs are sized in powers of 2.
Summary
Summary of the information ow. The dataset of pairs of (x,y) is given and xed. The weights start out as
random numbers and can change. During the forward pass the score function computes class scores,
stored in vector f. The loss function contains two components: The data loss computes the compatibility
between the scores f and the labels y. The regularization loss is only a function of the weights. During
Gradient Descent, we compute the gradient on the weights (and optionally on data if we wish) and use them
to perform a parameter update during Gradient Descent.
In this section,
http://cs231n.github.io/optimization-1/ 13/14