CS231n Deep Learning for Computer Vision p-1
CS231n Deep Learning for Computer Vision p-1
Course Website
Table of Contents:
Gradient checks
Sanity checks
Babysitting the learning process
Loss function
Train/val accuracy
Weights:Updates ratio
Activation/Gradient distributions per layer
Visualization
Parameter updates
First-order (SGD), momentum, Nesterov momentum
Annealing the learning rate
Second-order methods
Per-parameter adaptive learning rates (Adagrad, RMSProp)
Hyperparameter Optimization
Evaluation
Model Ensembles
Summary
Additional References
Learning
In the previous sections we’ve discussed the static parts of a Neural Networks: how we can
set up the network connectivity, the data, and the loss function. This section is devoted to the
dynamics, or in other words, the process of learning the parameters and finding good
hyperparameters.
Gradient Checks
In theory, performing a gradient check is as simple as comparing the analytic gradient to the
numerical gradient. In practice, the process is much more involved and error prone. Here are
some tips, tricks, and issues to watch out for:
Use the centered formula. The formula you may have seen for the finite difference
approximation when evaluating the numerical gradient looks as follows:
df (x) f (x + h) − f (x)
= (bad, do not use)
dx h
where h is a very small number, in practice approximately 1e-5 or so. In practice, it turns out
that it is much better to use the centered difference formula of the form:
df (x) f (x + h) − f (x − h)
= (use instead)
dx 2h
This requires you to evaluate the loss function twice to check every single dimension of the
gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be
much more precise. To see this, you can use Taylor expansion of f (x + h) and f (x − h)
and verify that the first formula has an error on order of O(h), while the second formula only
2
has error terms on order of O(h ) (i.e. it is a second order approximation).
Use relative error for the comparison. What are the details of comparing the numerical
gradient ′
fn and analytic gradient fa ?
′
That is, how do we know if the two are not compatible?
You might be temped to keep track of the difference ′ ′
∣f a − f n ∣ or its square and define the
gradient check as failed if that difference is above a threshold. However, this is problematic.
For example, consider the case where their difference is 1e-4. This seems like a very
appropriate difference if the two gradients are about 1.0, so we’d consider the two gradients to
match. But if the gradients were both on order of 1e-5 or lower, then we’d consider 1e-4 to be a
huge difference and likely a failure. Hence, it is always more appropriate to consider the
relative error:
′ ′
∣f a − f n ∣
′ ′
max(∣ f a ∣, ∣ f n ∣)
which considers their ratio of the differences to the ratio of the absolute values of both
gradients. Notice that normally the relative error formula only includes one of the two terms
(either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by
zero in the case where one of the two is zero (which can often happen, especially with ReLUs).
However, one must explicitly keep track of the case where both are zero and pass the gradient
check in that edge case. In practice:
relative error > 1e-2 usually means the gradient is probably wrong
1e-2 > relative error > 1e-4 should make you feel uncomfortable
1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g.
use of tanh nonlinearities and softmax), then 1e-4 is too high.
1e-7 and less you should be happy.
Also keep in mind that the deeper the network, the higher the relative errors will be. So if you
are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be
okay because the errors build up on the way. Conversely, an error of 1e-2 for a single
differentiable function likely indicates incorrect gradient.
Use double precision. A common pitfall is using single precision floating point to compute
gradient check. It is often that case that you might get high relative errors (as high as 1e-2)
even with a correct gradient implementation. In my experience I’ve sometimes seen my
relative errors plummet from 1e-2 to 1e-8 by switching to double precision.
Stick around active range of floating point. It’s a good idea to read through “What Every
Computer Scientist Should Know About Floating-Point Arithmetic”, as it may demystify your
errors and enable you to write more careful code. For example, in neural nets it can be
common to normalize the loss function over the batch. However, if your gradients per
datapoint are very small, then additionally dividing them by the number of data points is
starting to give very small numbers, which in turn will lead to more numerical issues. This is
why I like to always print the raw numerical/analytic gradient, and make sure that the numbers
you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is
worrying). If they are you may want to temporarily scale your loss function up by a constant to
bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where
your float exponent is 0.
Kinks in the objective. One source of inaccuracy to be aware of during gradient checking is the
problem of kinks. Kinks refer to non-differentiable parts of an objective function, introduced by
functions such as ReLU (max(0, x)), or the SVM loss, Maxout neurons, etc. Consider
gradient checking the ReLU function at x = −1e6. Since x < 0, the analytic gradient at this
point is exactly zero. However, the numerical gradient would suddenly compute a non-zero
gradient because f (x + h) might cross over the kink (e.g. if h > 1e − 6) and introduce a
non-zero contribution. You might think that this is a pathological case, but in fact this case can
be very common. For example, an SVM for CIFAR-10 contains up to 450,000 max(0, x)
terms because there are 50,000 examples and each example yields 9 terms to the objective.
Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.
Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be
done by keeping track of the identities of all “winners” in a function of form max(x, y) ; That
is, was x or y higher during the forward pass. If the identity of at least one winner changes
when evaluating f (x + h) and then f (x − h), then a kink was crossed and the numerical
gradient will not be exact.
Use only few datapoints. One fix to the above problem of kinks is to use fewer datapoints,
since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will
have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you
perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3
datapoints then you would almost certainly gradcheck for an entire batch. Using very few
datapoints also makes your gradient check faster and more efficient.
Be careful with the step size h. It is not necessarily the case that smaller is better, because
when h is much smaller, you may start running into numerical precision problems. Sometimes
when the gradient doesn’t check, it is possible that you change h to be 1e-4 or 1e-6 and
suddenly the gradient will be correct. This wikipedia article contains a chart that plots the value
of h on the x-axis and the numerical gradient error on the y-axis.
Don’t let the regularization overwhelm the data. It is often the case that a loss function is a
sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be
aware of is that the regularization loss may overwhelm the data loss, in which case the
gradients will be primarily coming from the regularization term (which usually has a much
simpler gradient expression). This can mask an incorrect implementation of the data loss
gradient. Therefore, it is recommended to turn off regularization and check the data loss alone
first, and then the regularization term second and independently. One way to perform the latter
is to hack the code to remove the data loss contribution. Another way is to increase the
regularization strength so as to ensure that its effect is non-negligible in the gradient check,
and that an incorrect implementation would be spotted.
Look for correct loss at chance performance. Make sure you’re getting the loss you
expect when you initialize with small parameters. It’s best to first check the data loss
alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax
classifier we would expect the initial loss to be 2.302, because we expect a diffuse
probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the
negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins
SVM, we expect all desired margins to be violated (since all scores are approximately
zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you’re not
seeing these losses there might be issue with initialization.
As a second sanity check, increasing the regularization strength should increase the loss
Overfit a tiny subset of data. Lastly and most importantly, before training on the full
dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you
can achieve zero cost. For this experiment it’s also best to set regularization to zero,
otherwise this can prevent you from getting zero cost. Unless you pass this sanity check
with a small dataset it is not worth proceeding to the full dataset. Note that it may
happen that you can overfit very small dataset but still have an incorrect
implementation. For instance, if your datapoints’ features are random due to some bug,
then it will be possible to overfit your small training set but you will never notice any
generalization when you fold it your full dataset.
Loss function
The first quantity that is useful to track during training is the loss, as it is evaluated on the
individual batches during the forward pass. Below is a cartoon diagram showing the loss over
time, and especially what the shape might tell you about the learning rate:
Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements
will be linear. With high learning rates they will start to look more exponential. Higher learning rates will
decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too
much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in
a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while
training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a
slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that
the batch size might be a little too low (since the cost is a little too noisy).
The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the
wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal
because every gradient update should be improving the loss function monotonically (unless
the learning rate is set too high).
Some people prefer to plot their loss functions in the log domain. Since learning progress
generally takes an exponential form shape, the plot appears as a slightly more interpretable
straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are
plotted on the same loss graph, the differences between them become more apparent.
Sometimes loss functions can look funny lossfunctions.tumblr.com.
Train/Val accuracy
The second important quantity to track while training a classifier is the validation/training
accuracy. This plot can give you valuable insights into the amount of overfitting in your model:
Ratio of weights:updates
The last quantity you might want to track is the ratio of the update magnitudes to the value
magnitudes. Note: updates, not the raw gradients (e.g. in vanilla sgd this would be the gradient
multiplied by the learning rate). You might want to evaluate and track this ratio for every set of
parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-
3. If it is lower than this then the learning rate might be too low. If it is higher then the learning
rate is likely too high. Here is a specific example:
Instead of tracking the min or the max, some people prefer to compute and track the norm of
the gradients and their updates instead. These metrics are usually correlated and often give
approximately the same results.
Activation / Gradient distributions per layer
An incorrect initialization can slow down or even completely stall the learning process. Luckily,
this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient
histograms for all layers of the network. Intuitively, it is not a good sign to see any strange
distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations
between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons
being completely saturated at either -1 or 1.
First-layer Visualizations
Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-
layer features visually:
Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could
be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty.
Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well.
Parameter updates
Once the analytic gradient is computed with backpropagation, the gradients are used to
perform a parameter update. There are several approaches for performing the update, which
we discuss next.
We note that optimization for deep networks is currently a very active area of research. In this
section we highlight some established and common techniques you may see in practice,
briefly describe their intuition, but leave a detailed analysis outside of the scope of the class.
We provide some further pointers for an interested reader.
# Vanilla update
x += - learning_rate * dx
Momentum update is another approach that almost always enjoys better converge rates on
deep networks. This update can be motivated from a physical perspective of the optimization
problem. In particular, the loss can be interpreted as the height of a hilly terrain (and therefore
also to the potential energy since U = mgh and therefore U ∝ h ). Initializing the
parameters with random numbers is equivalent to setting a particle with zero initial velocity at
some location. The optimization process can then be seen as equivalent to the process of
simulating the parameter vector (i.e. a particle) as rolling on the landscape.
Since the force on the particle is related to the gradient of potential energy (i.e. F = −∇U ),
the force felt by the particle is precisely the (negative) gradient of the loss function. Moreover,
F = ma so the (negative) gradient is in this view proportional to the acceleration of the
particle. Note that this is different from the SGD update shown above, where the gradient
directly integrates the position. Instead, the physics view suggests an update in which the
gradient only directly influences the velocity, which in turn has an effect on the position:
# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position
Nesterov Momentum is a slightly different version of the momentum update that has recently
been gaining popularity. It enjoys stronger theoretical converge guarantees for convex
functions and in practice it also consistenly works slightly better than standard momentum.
The core idea behind Nesterov momentum is that when the current parameter vector is at
some position x , then looking at the momentum update above, we know that the momentum
term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter
vector by mu * v . Therefore, if we are about to compute the gradient, we can treat the future
approximate position x + mu * v as a “lookahead” - this is a point in the vicinity of where we
are soon going to end up. Hence, it makes sense to compute the gradient at x + mu * v
instead of at the “old/stale” position x .
Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our
momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore
instead evaluate the gradient at this "looked-ahead" position.
x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v
However, in practice people prefer to express the update to look as similar to vanilla SGD or to
the previous momentum update as possible. This is possible to achieve by manipulating the
update above with a variable transform x_ahead = x + mu * v , and then expressing the
update in terms of x_ahead instead of x . That is, the parameter vector we are actually
storing is always the ahead version. The equations in terms of x_ahead (but renaming it back
to x ) then become: