0% found this document useful (0 votes)

7 views

CS231n Deep Learning for Computer Vision p-1

The document outlines key concepts and practices for training neural networks in the CS231n Deep Learning for Computer Vision course. It covers gradient checks, sanity checks, and monitoring the learning process, emphasizing the importance of tracking loss, accuracy, and hyperparameter optimization. Additionally, it provides practical tips for ensuring correct gradient implementation and avoiding common pitfalls during training.

Uploaded by

Kuber Chaurasiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

CS231n Deep Learning for Computer Vision p-1

Uploaded by

Kuber Chaurasiya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CS231n Deep Learning for Computer Vision

Course Website

Table of Contents:

Gradient checks
Sanity checks
Babysitting the learning process
Loss function
Train/val accuracy
Weights:Updates ratio
Activation/Gradient distributions per layer
Visualization
Parameter updates
First-order (SGD), momentum, Nesterov momentum
Annealing the learning rate
Second-order methods
Per-parameter adaptive learning rates (Adagrad, RMSProp)
Hyperparameter Optimization
Evaluation
Model Ensembles
Summary
Additional References

Learning
In the previous sections we’ve discussed the static parts of a Neural Networks: how we can
set up the network connectivity, the data, and the loss function. This section is devoted to the
dynamics, or in other words, the process of learning the parameters and finding good
hyperparameters.

Gradient Checks
In theory, performing a gradient check is as simple as comparing the analytic gradient to the
numerical gradient. In practice, the process is much more involved and error prone. Here are
some tips, tricks, and issues to watch out for:
Use the centered formula. The formula you may have seen for the finite difference
approximation when evaluating the numerical gradient looks as follows:

df (x) f (x + h) − f (x)
= (bad, do not use)
dx h

where h is a very small number, in practice approximately 1e-5 or so. In practice, it turns out
that it is much better to use the centered difference formula of the form:

df (x) f (x + h) − f (x − h)
= (use instead)
dx 2h

This requires you to evaluate the loss function twice to check every single dimension of the
gradient (so it is about 2 times as expensive), but the gradient approximation turns out to be
much more precise. To see this, you can use Taylor expansion of f (x + h) and f (x − h)

and verify that the first formula has an error on order of O(h), while the second formula only
2
has error terms on order of O(h ) (i.e. it is a second order approximation).

Use relative error for the comparison. What are the details of comparing the numerical
gradient ′
fn and analytic gradient fa ?
′
That is, how do we know if the two are not compatible?
You might be temped to keep track of the difference ′ ′
∣f a − f n ∣ or its square and define the
gradient check as failed if that difference is above a threshold. However, this is problematic.
For example, consider the case where their difference is 1e-4. This seems like a very
appropriate difference if the two gradients are about 1.0, so we’d consider the two gradients to
match. But if the gradients were both on order of 1e-5 or lower, then we’d consider 1e-4 to be a
huge difference and likely a failure. Hence, it is always more appropriate to consider the
relative error:

′ ′
∣f a − f n ∣

′ ′
max(∣ f a ∣, ∣ f n ∣)

which considers their ratio of the differences to the ratio of the absolute values of both
gradients. Notice that normally the relative error formula only includes one of the two terms
(either one), but I prefer to max (or add) both to make it symmetric and to prevent dividing by
zero in the case where one of the two is zero (which can often happen, especially with ReLUs).
However, one must explicitly keep track of the case where both are zero and pass the gradient
check in that edge case. In practice:

relative error > 1e-2 usually means the gradient is probably wrong
1e-2 > relative error > 1e-4 should make you feel uncomfortable
1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g.
use of tanh nonlinearities and softmax), then 1e-4 is too high.
1e-7 and less you should be happy.
Also keep in mind that the deeper the network, the higher the relative errors will be. So if you
are gradient checking the input data for a 10-layer network, a relative error of 1e-2 might be
okay because the errors build up on the way. Conversely, an error of 1e-2 for a single
differentiable function likely indicates incorrect gradient.

Use double precision. A common pitfall is using single precision floating point to compute
gradient check. It is often that case that you might get high relative errors (as high as 1e-2)
even with a correct gradient implementation. In my experience I’ve sometimes seen my
relative errors plummet from 1e-2 to 1e-8 by switching to double precision.

Stick around active range of floating point. It’s a good idea to read through “What Every
Computer Scientist Should Know About Floating-Point Arithmetic”, as it may demystify your
errors and enable you to write more careful code. For example, in neural nets it can be
common to normalize the loss function over the batch. However, if your gradients per
datapoint are very small, then additionally dividing them by the number of data points is
starting to give very small numbers, which in turn will lead to more numerical issues. This is
why I like to always print the raw numerical/analytic gradient, and make sure that the numbers
you are comparing are not extremely small (e.g. roughly 1e-10 and smaller in absolute value is
worrying). If they are you may want to temporarily scale your loss function up by a constant to
bring them to a “nicer” range where floats are more dense - ideally on the order of 1.0, where
your float exponent is 0.

Kinks in the objective. One source of inaccuracy to be aware of during gradient checking is the
problem of kinks. Kinks refer to non-differentiable parts of an objective function, introduced by
functions such as ReLU (max(0, x)), or the SVM loss, Maxout neurons, etc. Consider
gradient checking the ReLU function at x = −1e6. Since x < 0, the analytic gradient at this
point is exactly zero. However, the numerical gradient would suddenly compute a non-zero
gradient because f (x + h) might cross over the kink (e.g. if h > 1e − 6) and introduce a
non-zero contribution. You might think that this is a pathological case, but in fact this case can
be very common. For example, an SVM for CIFAR-10 contains up to 450,000 max(0, x)

terms because there are 50,000 examples and each example yields 9 terms to the objective.
Moreover, a Neural Network with an SVM classifier will contain many more kinks due to ReLUs.

Note that it is possible to know if a kink was crossed in the evaluation of the loss. This can be
done by keeping track of the identities of all “winners” in a function of form max(x, y) ; That
is, was x or y higher during the forward pass. If the identity of at least one winner changes
when evaluating f (x + h) and then f (x − h), then a kink was crossed and the numerical
gradient will not be exact.

Use only few datapoints. One fix to the above problem of kinks is to use fewer datapoints,
since loss functions that contain kinks (e.g. due to use of ReLUs or margin losses etc.) will
have fewer kinks with fewer datapoints, so it is less likely for you to cross one when you
perform the finite different approximation. Moreover, if your gradcheck for only ~2 or 3
datapoints then you would almost certainly gradcheck for an entire batch. Using very few
datapoints also makes your gradient check faster and more efficient.

Be careful with the step size h. It is not necessarily the case that smaller is better, because
when h is much smaller, you may start running into numerical precision problems. Sometimes
when the gradient doesn’t check, it is possible that you change h to be 1e-4 or 1e-6 and
suddenly the gradient will be correct. This wikipedia article contains a chart that plots the value
of h on the x-axis and the numerical gradient error on the y-axis.

Gradcheck during a “characteristic” mode of operation. It is important to realize that a

gradient check is performed at a particular (and usually random), single point in the space of
parameters. Even if the gradient check succeeds at that point, it is not immediately certain that
the gradient is correctly implemented globally. Additionally, a random initialization might not be
the most “characteristic” point in the space of parameters and may in fact introduce
pathological situations where the gradient seems to be correctly implemented but isn’t. For
instance, an SVM with very small weight initialization will assign almost exactly zero scores to
all datapoints and the gradients will exhibit a particular pattern across all datapoints. An
incorrect implementation of the gradient could still produce this pattern and not generalize to a
more characteristic mode of operation where some scores are larger than others. Therefore, to
be safe it is best to use a short burn-in time during which the network is allowed to learn and
perform the gradient check after the loss starts to go down. The danger of performing it at the
first iteration is that this could introduce pathological edge cases and mask an incorrect
implementation of the gradient.

Don’t let the regularization overwhelm the data. It is often the case that a loss function is a
sum of the data loss and the regularization loss (e.g. L2 penalty on weights). One danger to be
aware of is that the regularization loss may overwhelm the data loss, in which case the
gradients will be primarily coming from the regularization term (which usually has a much
simpler gradient expression). This can mask an incorrect implementation of the data loss
gradient. Therefore, it is recommended to turn off regularization and check the data loss alone
first, and then the regularization term second and independently. One way to perform the latter
is to hack the code to remove the data loss contribution. Another way is to increase the
regularization strength so as to ensure that its effect is non-negligible in the gradient check,
and that an incorrect implementation would be spotted.

Remember to turn off dropout/augmentations. When performing gradient check, remember

to turn off any non-deterministic effects in the network, such as dropout, random data
augmentations, etc. Otherwise these can clearly introduce huge errors when estimating the
numerical gradient. The downside of turning off these effects is that you wouldn’t be gradient
checking them (e.g. it might be that dropout isn’t backpropagated correctly). Therefore, a
better solution might be to force a particular random seed before evaluating both f (x + h)

and f (x − h), and when evaluating the analytic gradient.

Check only few dimensions. In practice the gradients can have sizes of million parameters. In
these cases it is only practical to check some of the dimensions of the gradient and assume
that the others are correct. Be careful: One issue to be careful with is to make sure to gradient
check a few dimensions for every separate parameter. In some applications, people combine
the parameters into a single large parameter vector for convenience. In these cases, for
example, the biases could only take up a tiny number of parameters from the whole vector, so
it is important to not sample at random but to take this into account and check that all
parameters receive the correct gradients.

Before learning: sanity checks Tips/Tricks

Here are a few sanity checks you might consider running before you plunge into expensive
optimization:

Look for correct loss at chance performance. Make sure you’re getting the loss you
expect when you initialize with small parameters. It’s best to first check the data loss
alone (so set regularization strength to zero). For example, for CIFAR-10 with a Softmax
classifier we would expect the initial loss to be 2.302, because we expect a diffuse
probability of 0.1 for each class (since there are 10 classes), and Softmax loss is the
negative log probability of the correct class so: -ln(0.1) = 2.302. For The Weston Watkins
SVM, we expect all desired margins to be violated (since all scores are approximately
zero), and hence expect a loss of 9 (since margin is 1 for each wrong class). If you’re not
seeing these losses there might be issue with initialization.
As a second sanity check, increasing the regularization strength should increase the loss
Overfit a tiny subset of data. Lastly and most importantly, before training on the full
dataset try to train on a tiny portion (e.g. 20 examples) of your data and make sure you
can achieve zero cost. For this experiment it’s also best to set regularization to zero,
otherwise this can prevent you from getting zero cost. Unless you pass this sanity check
with a small dataset it is not worth proceeding to the full dataset. Note that it may
happen that you can overfit very small dataset but still have an incorrect
implementation. For instance, if your datapoints’ features are random due to some bug,
then it will be possible to overfit your small training set but you will never notice any
generalization when you fold it your full dataset.

Babysitting the learning process

There are multiple useful quantities you should monitor during training of a neural network.
These plots are the window into the training process and should be utilized to get intuitions
about different hyperparameter settings and how they should be changed for more efficient
learning.
The x-axis of the plots below are always in units of epochs, which measure how many times
every example has been seen during training in expectation (e.g. one epoch means that every
example has been seen once). It is preferable to track epochs rather than iterations since the
number of iterations depends on the arbitrary setting of batch size.

Loss function

The first quantity that is useful to track during training is the loss, as it is evaluated on the
individual batches during the forward pass. Below is a cartoon diagram showing the loss over
time, and especially what the shape might tell you about the learning rate:

Left: A cartoon depicting the effects of different learning rates. With low learning rates the improvements
will be linear. With high learning rates they will start to look more exponential. Higher learning rates will
decay the loss faster, but they get stuck at worse values of loss (green line). This is because there is too
much "energy" in the optimization and the parameters are bouncing around chaotically, unable to settle in
a nice spot in the optimization landscape. Right: An example of a typical loss function over time, while
training a small network on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a
slightly too small learning rate based on its speed of decay, but it's hard to say), and also indicates that
the batch size might be a little too low (since the cost is a little too noisy).

The amount of “wiggle” in the loss is related to the batch size. When the batch size is 1, the
wiggle will be relatively high. When the batch size is the full dataset, the wiggle will be minimal
because every gradient update should be improving the loss function monotonically (unless
the learning rate is set too high).

Some people prefer to plot their loss functions in the log domain. Since learning progress
generally takes an exponential form shape, the plot appears as a slightly more interpretable
straight line, rather than a hockey stick. Additionally, if multiple cross-validated models are
plotted on the same loss graph, the differences between them become more apparent.
Sometimes loss functions can look funny lossfunctions.tumblr.com.

Train/Val accuracy

The second important quantity to track while training a classifier is the validation/training
accuracy. This plot can give you valuable insights into the amount of overfitting in your model:

The gap between the training and validation

accuracy indicates the amount of overfitting.
Two possible cases are shown in the diagram on
the left. The blue validation error curve shows
very small validation accuracy compared to the
training accuracy, indicating strong overfitting
(note, it's possible for the validation accuracy to
even start to go down after some point). When
you see this in practice you probably want to
increase regularization (stronger L2 weight
penalty, more dropout, etc.) or collect more data.
The other possible case is when the validation
accuracy tracks the training accuracy fairly well.
This case indicates that your model capacity is not high enough: make the model larger by increasing the
number of parameters.

Ratio of weights:updates

The last quantity you might want to track is the ratio of the update magnitudes to the value
magnitudes. Note: updates, not the raw gradients (e.g. in vanilla sgd this would be the gradient
multiplied by the learning rate). You might want to evaluate and track this ratio for every set of
parameters independently. A rough heuristic is that this ratio should be somewhere around 1e-
3. If it is lower than this then the learning rate might be too low. If it is higher then the learning
rate is likely too high. Here is a specific example:

# assume parameter vector W and its gradient vector dW

param_scale = np.linalg.norm(W.ravel())
update = -learning_rate*dW # simple SGD update
update_scale = np.linalg.norm(update.ravel())
W += update # the actual update
print update_scale / param_scale # want ~1e-3

Instead of tracking the min or the max, some people prefer to compute and track the norm of
the gradients and their updates instead. These metrics are usually correlated and often give
approximately the same results.
Activation / Gradient distributions per layer

An incorrect initialization can slow down or even completely stall the learning process. Luckily,
this issue can be diagnosed relatively easily. One way to do so is to plot activation/gradient
histograms for all layers of the network. Intuitively, it is not a good sign to see any strange
distributions - e.g. with tanh neurons we would like to see a distribution of neuron activations
between the full range of [-1,1], instead of seeing all neurons outputting zero, or all neurons
being completely saturated at either -1 or 1.

First-layer Visualizations

Lastly, when one is working with image pixels it can be helpful and satisfying to plot the first-
layer features visually:

Examples of visualized weights for the first layer of a neural network. Left: Noisy features indicate could
be a symptom: Unconverged network, improperly set learning rate, very low weight regularization penalty.
Right: Nice, smooth, clean and diverse features are a good indication that the training is proceeding well.

Parameter updates
Once the analytic gradient is computed with backpropagation, the gradients are used to
perform a parameter update. There are several approaches for performing the update, which
we discuss next.

We note that optimization for deep networks is currently a very active area of research. In this
section we highlight some established and common techniques you may see in practice,
briefly describe their intuition, but leave a detailed analysis outside of the scope of the class.
We provide some further pointers for an interested reader.

SGD and bells and whistles

Vanilla update. The simplest form of update is to change the parameters along the negative
gradient direction (since the gradient indicates the direction of increase, but we usually wish to
minimize a loss function). Assuming a vector of parameters x and the gradient dx , the
simplest update has the form:

# Vanilla update
x += - learning_rate * dx

where learning_rate is a hyperparameter - a fixed constant. When evaluated on the full

dataset, and when the learning rate is low enough, this is guaranteed to make non-negative
progress on the loss function.

Momentum update is another approach that almost always enjoys better converge rates on
deep networks. This update can be motivated from a physical perspective of the optimization
problem. In particular, the loss can be interpreted as the height of a hilly terrain (and therefore
also to the potential energy since U = mgh and therefore U ∝ h ). Initializing the
parameters with random numbers is equivalent to setting a particle with zero initial velocity at
some location. The optimization process can then be seen as equivalent to the process of
simulating the parameter vector (i.e. a particle) as rolling on the landscape.

Since the force on the particle is related to the gradient of potential energy (i.e. F = −∇U ),
the force felt by the particle is precisely the (negative) gradient of the loss function. Moreover,
F = ma so the (negative) gradient is in this view proportional to the acceleration of the
particle. Note that this is different from the SGD update shown above, where the gradient
directly integrates the position. Instead, the physics view suggests an update in which the
gradient only directly influences the velocity, which in turn has an effect on the position:

# Momentum update
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

Here we see an introduction of a v variable that is initialized at zero, and an additional

hyperparameter ( mu ). As an unfortunate misnomer, this variable is in optimization referred to
as momentum (its typical value is about 0.9), but its physical meaning is more consistent with
the coefficient of friction. Effectively, this variable damps the velocity and reduces the kinetic
energy of the system, or otherwise the particle would never come to a stop at the bottom of a
hill. When cross-validated, this parameter is usually set to values such as [0.5, 0.9, 0.95, 0.99].
Similar to annealing schedules for learning rates (discussed later, below), optimization can
sometimes benefit a little from momentum schedules, where the momentum is increased in
later stages of learning. A typical setting is to start with momentum of about 0.5 and anneal it
to 0.99 or so over multiple epochs.
With Momentum update, the parameter vector will build up velocity in any direction that has
consistent gradient.

Nesterov Momentum is a slightly different version of the momentum update that has recently
been gaining popularity. It enjoys stronger theoretical converge guarantees for convex
functions and in practice it also consistenly works slightly better than standard momentum.

The core idea behind Nesterov momentum is that when the current parameter vector is at
some position x , then looking at the momentum update above, we know that the momentum
term alone (i.e. ignoring the second term with the gradient) is about to nudge the parameter
vector by mu * v . Therefore, if we are about to compute the gradient, we can treat the future
approximate position x + mu * v as a “lookahead” - this is a point in the vicinity of where we
are soon going to end up. Hence, it makes sense to compute the gradient at x + mu * v
instead of at the “old/stale” position x .

Nesterov momentum. Instead of evaluating gradient at the current position (red circle), we know that our
momentum is about to carry us to the tip of the green arrow. With Nesterov momentum we therefore
instead evaluate the gradient at this "looked-ahead" position.

That is, in a slightly awkward notation, we would like to do the following:

x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v

However, in practice people prefer to express the update to look as similar to vanilla SGD or to
the previous momentum update as possible. This is possible to achieve by manipulating the
update above with a variable transform x_ahead = x + mu * v , and then expressing the
update in terms of x_ahead instead of x . That is, the parameter vector we are actually
storing is always the ahead version. The equations in terms of x_ahead (but renaming it back
to x ) then become:

Decision Tree qUIZE
100% (5)
Decision Tree qUIZE
3 pages
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Multiple Regression
No ratings yet
Multiple Regression
10 pages
04 Numerical
No ratings yet
04 Numerical
46 pages
04 Numerical
No ratings yet
04 Numerical
39 pages
9.b Handout-3-GD variants
No ratings yet
9.b Handout-3-GD variants
3 pages
Lec 3
No ratings yet
Lec 3
29 pages
kalezhi_1673778879_a
No ratings yet
kalezhi_1673778879_a
58 pages
Discussion 4 CS771
No ratings yet
Discussion 4 CS771
25 pages
Notes01 - Numerical Errors and Stability
No ratings yet
Notes01 - Numerical Errors and Stability
4 pages
Chapter 1
No ratings yet
Chapter 1
76 pages
Chapter_1
No ratings yet
Chapter_1
78 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
68 pages
Lecture Notes To Accompany: Second Edition
No ratings yet
Lecture Notes To Accompany: Second Edition
48 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
P03 Intro To Numerical Methods
No ratings yet
P03 Intro To Numerical Methods
11 pages
optim
No ratings yet
optim
33 pages
Safe Computing: Week 1: Monday, Jan 23
No ratings yet
Safe Computing: Week 1: Monday, Jan 23
7 pages
ML cheat sheet(1)
No ratings yet
ML cheat sheet(1)
2 pages
ML Notes
No ratings yet
ML Notes
14 pages
06 AIS302 ANN backpropagation
No ratings yet
06 AIS302 ANN backpropagation
83 pages
Lecture2_Intro_to_Scientific_Computing
No ratings yet
Lecture2_Intro_to_Scientific_Computing
30 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Chapter 1 Basic Concepts in Numerical Analysis
No ratings yet
Chapter 1 Basic Concepts in Numerical Analysis
34 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000689_2025-01-03_Reference-Material-I
39 pages
ANN_Presentation
No ratings yet
ANN_Presentation
29 pages
Sources of Approximation: Before Computation
No ratings yet
Sources of Approximation: Before Computation
31 pages
DL_Assi02
No ratings yet
DL_Assi02
9 pages
Chapter 1 Error Analysis
No ratings yet
Chapter 1 Error Analysis
19 pages
Error Approximation - Part 2
No ratings yet
Error Approximation - Part 2
5 pages
Chapter
No ratings yet
Chapter
46 pages
MATH685 Sp10 Lec1
No ratings yet
MATH685 Sp10 Lec1
26 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Lecture 02
No ratings yet
Lecture 02
37 pages
Handout Delta Rule
No ratings yet
Handout Delta Rule
10 pages
2.vanishing Gradient and Exploding Gradient Simple Notes
No ratings yet
2.vanishing Gradient and Exploding Gradient Simple Notes
2 pages
CV Lec4
No ratings yet
CV Lec4
46 pages
Practical Aspects of Deep Learning PIII
No ratings yet
Practical Aspects of Deep Learning PIII
11 pages
NUMERIC CHAPTER 1 2016
No ratings yet
NUMERIC CHAPTER 1 2016
6 pages
Num Method 04 Errors & Taylor Series MAE284 SP20
No ratings yet
Num Method 04 Errors & Taylor Series MAE284 SP20
42 pages
Note Set 1 - The Basics: 1.1 - Overview
No ratings yet
Note Set 1 - The Basics: 1.1 - Overview
24 pages
Lecture 3
No ratings yet
Lecture 3
24 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
4 - Roundoff and Truncation Errors
No ratings yet
4 - Roundoff and Truncation Errors
4 pages
4.deep Learning Assignment4 Solution PDF
100% (1)
4.deep Learning Assignment4 Solution PDF
12 pages
Approximation and Errors in Computing
No ratings yet
Approximation and Errors in Computing
38 pages
Chapter 3
No ratings yet
Chapter 3
17 pages
Deep Learning - Summary - Deep - Learning
No ratings yet
Deep Learning - Summary - Deep - Learning
17 pages
Gradient vectors computation
No ratings yet
Gradient vectors computation
4 pages
DL UNIT2
No ratings yet
DL UNIT2
22 pages
Approximate Checking of Polynomials and Functional Equations
No ratings yet
Approximate Checking of Polynomials and Functional Equations
10 pages
Validation and Verification
No ratings yet
Validation and Verification
29 pages
Introduction: Engineering Problems and Computational Methods
No ratings yet
Introduction: Engineering Problems and Computational Methods
13 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
3 TrainingNetwork
No ratings yet
3 TrainingNetwork
65 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
MATH685_Sp10_lec1
No ratings yet
MATH685_Sp10_lec1
26 pages
S Ccs Answers
No ratings yet
S Ccs Answers
192 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
grade-2-multiply-rows-columns-c
No ratings yet
grade-2-multiply-rows-columns-c
2 pages
story_4
No ratings yet
story_4
1 page
story_2
No ratings yet
story_2
1 page
Data Science Engineering
No ratings yet
Data Science Engineering
30 pages
This Is A Fake Document:d
No ratings yet
This Is A Fake Document:d
1 page
SAS 9.0 Installation Guide
100% (2)
SAS 9.0 Installation Guide
9 pages
NCA-AIIO - 2
No ratings yet
NCA-AIIO - 2
11 pages
Final Documentation 9
No ratings yet
Final Documentation 9
69 pages
Neural Network Regression
No ratings yet
Neural Network Regression
7 pages
Clinical Prediction Models A Practical Approach to Development, Validation, and Updating 2nd Edition Readable PDF Download
100% (10)
Clinical Prediction Models A Practical Approach to Development, Validation, and Updating 2nd Edition Readable PDF Download
16 pages
CS 304.A Training Models
No ratings yet
CS 304.A Training Models
149 pages
ML Syllabus - Sem VII - Mumbai University
No ratings yet
ML Syllabus - Sem VII - Mumbai University
3 pages
HW1 Cse 5160-01
No ratings yet
HW1 Cse 5160-01
6 pages
Project_Report_Template_AICTE_Internship_2025
No ratings yet
Project_Report_Template_AICTE_Internship_2025
21 pages
Chapter 6 (Part Ii)
No ratings yet
Chapter 6 (Part Ii)
41 pages
Automated Learning of Interpretable Models With Quantified Uncertainty
No ratings yet
Automated Learning of Interpretable Models With Quantified Uncertainty
18 pages
ANN Doc
No ratings yet
ANN Doc
2 pages
Enhancing Option Pricing Accuracy in the Indian
No ratings yet
Enhancing Option Pricing Accuracy in the Indian
25 pages
Optimizing Brain Tumor Identification With Fine - Tuned Pre-Trained CNN Models A Comparative Study of VGG16 and EfficientNetB4
No ratings yet
Optimizing Brain Tumor Identification With Fine - Tuned Pre-Trained CNN Models A Comparative Study of VGG16 and EfficientNetB4
5 pages
Unit 4 Supervised Learning
100% (1)
Unit 4 Supervised Learning
75 pages
High Yield Notes
No ratings yet
High Yield Notes
251 pages
Numbercrunch Professor Oliver Johnson download
100% (3)
Numbercrunch Professor Oliver Johnson download
89 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
short notes CCS35_Neural_Network_and_Deep_Learning_U3
No ratings yet
short notes CCS35_Neural_Network_and_Deep_Learning_U3
41 pages
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
No ratings yet
CS3491 Artificial Intelligence and Machine Learning Two Mark Questions 1
23 pages
ML LAB Viva Questions with Answers
No ratings yet
ML LAB Viva Questions with Answers
10 pages
Domain Payments
No ratings yet
Domain Payments
15 pages
Btech III Year i Semester (Ar20)
No ratings yet
Btech III Year i Semester (Ar20)
7 pages
Lecture Notes Lecture 2 Basic Linear Algebra Matlab
No ratings yet
Lecture Notes Lecture 2 Basic Linear Algebra Matlab
45 pages
Microsoft Test-DP-100
100% (1)
Microsoft Test-DP-100
50 pages
190319windercleaningdatascience1576692643371 PDF
No ratings yet
190319windercleaningdatascience1576692643371 PDF
110 pages
ML using Python
No ratings yet
ML using Python
6 pages
Applied Machine Learning Course Schedule: 1:fundamentals of Programming
No ratings yet
Applied Machine Learning Course Schedule: 1:fundamentals of Programming
33 pages
The Machine Learning Audit CRISP DM Framework - Joa - Eng - 0118
No ratings yet
The Machine Learning Audit CRISP DM Framework - Joa - Eng - 0118
6 pages

CS231n Deep Learning for Computer Vision p-1

Uploaded by

CS231n Deep Learning for Computer Vision p-1

Uploaded by

CS231n Deep Learning for Computer Vision

Gradcheck during a “characteristic” mode of operation. It is important to realize that a

Remember to turn off dropout/augmentations. When performing gradient check, remember

and f (x − h), and when evaluating the analytic gradient.

Before learning: sanity checks Tips/Tricks

Babysitting the learning process

The gap between the training and validation

# assume parameter vector W and its gradient vector dW

SGD and bells and whistles

where learning_rate is a hyperparameter - a fixed constant. When evaluated on the full

Here we see an introduction of a v variable that is initialized at zero, and an additional

That is, in a slightly awkward notation, we would like to do the following:

You might also like