cs231n Github Io Neural Networks Case Study
cs231n Github Io Neural Networks Case Study
Table of Contents:
In this section we’ll walk through a complete implementation of a toy Neural Network in 2 dimensions. We’ll first
implement a simple linear classifier and then extend the code to a 2-layer Neural Network. As we’ll see, this
extension is surprisingly simple and very few changes are necessary.
Normally we would want to preprocess the dataset so that each feature has zero mean and unit standard
deviation, but in this case the features are already in a nice range from -1 to 1, so we skip this step.
In this example we have 300 2-D points, so after this multiplication the array scores will have size [300 x 3],
where each row gives the class scores corresponding to the 3 classes (blue, red, yellow).
f
y
e i
Li = − log( )
fj
∑ e
j
We can see that the Softmax classifier interprets every element of f as holding the (unnormalized) log
probabilities of the three classes. We exponentiate these to get (unnormalized) probabilities, and then normalize
them to get probabilites. Therefore, the expression inside the log is the normalized probability of the correct
class. Note how this expression works: this quantity is always between 0 and 1. When the probability of the
correct class is very small (near 0), the loss will go towards (positive) infinity. Conversely, when the correct class
probability goes towards 1, the loss will go towards zero because log(1) = 0. Hence, the expression for Li is
low when the correct class probability is high, and it’s very high when it is low.
Recall also that the full Softmax classifier loss is then defined as the average cross-entropy loss over the training
examples and the regularization:
1 1
2
L = ∑ Li + λ∑∑W
k,l
N 2
i k l
data loss regularization loss
Given the array of scores we’ve computed above, we can compute the loss. First, the way to obtain the
probabilities is straight forward:
num_examples = X.shape[0]
# get unnormalized probabilities
exp_scores = np.exp(scores)
# normalize them for each example
probs = exp_scores / np.sum(exp_scores, axis=1, keepdims=True)
We now have an array probs of size [300 x 3], where each row now contains the class probabilities. In
particular, since we’ve normalized them every row now sums to one. We can now query for the log probabilities
assigned to the correct classes in each example:
corect_logprobs = -np.log(probs[range(num_examples),y])
The array correct_logprobs is a 1D array of just the probabilities assigned to the correct classes for each
example. The full loss is then the average of these log probabilities and the regularization loss:
In this code, the regularization strength λ is stored inside the reg . The convenience factor of 0.5 multiplying
the regularization will become clear in a second. Evaluating this in the beginning (with random parameters) might
give us loss = 1.1 , which is np.log(1.0/3) , since with small initial random weights all probabilities
assigned to all classes are about one third. We now want to make the loss as low as possible, with loss = 0
as the absolute lower bound. But the lower the loss is, the higher are the probabilities assigned to the correct
classes for all examples.
f
e k
pk = Li = − log( py )
i
fj
∑ e
j
We now wish to understand how the computed scores inside f should change to decrease the loss Li that this
example contributes to the full objective. In other words, we want to derive the gradient ∂ Li /∂ fk . The loss Li
is computed from p, which in turn depends on f . It’s a fun exercise to the reader to use the chain rule to derive
the gradient, but it turns out to be extremely simple and interpretible in the end, after a lot of things cancel out:
∂Li
= pk − 1(yi = k)
∂fk
Notice how elegant and simple this expression is. Suppose the probabilities we computed were p = [0.2,
0.3, 0.5] , and that the correct class was the middle one (with probability 0.3). According to this derivation the
gradient on the scores would be df = [0.2, -0.7, 0.5] . Recalling what the interpretation of the gradient,
we see that this result is highly intuitive: increasing the first or last element of the score vector f (the scores of
the incorrect classes) leads to an increased loss (due to the positive signs +0.2 and +0.5) - and increasing the
loss is bad, as expected. However, increasing the score of the correct class has negative influence on the loss.
The gradient of -0.7 is telling us that increasing the correct class score would lead to a decrease of the loss Li ,
which makes sense.
All of this boils down to the following code. Recall that probs stores the probabilities of all classes (as rows) for
each example. To get the gradient on the scores, which we call dscores , we proceed as follows:
dscores = probs
dscores[range(num_examples),y] -= 1
dscores /= num_examples
Lastly, we had that scores = np.dot(X, W) + b , so armed with the gradient on scores (stored in
dscores ), we can now backpropagate into W and b :
dW = np.dot(X.T, dscores)
db = np.sum(dscores, axis=0, keepdims=True)
dW += reg*W # don't forget the regularization gradient
Where we see that we have backpropped through the matrix multiply operation, and also added the contribution
from the regularization. Note that the regularization gradient has the very simple form reg*W since we used the
d 1 2
constant 0.5 for its loss contribution (i.e. (
2
λw ) = λw. This is a common convenience trick that
dw
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
We see that we’ve converged to something after about 190 iterations. We can evaluate the training set accuracy:
This prints 49%. Not very good at all, but also not surprising given that the dataset is constructed so it is not
linearly separable. We can also plot the learned decision boundaries:
Linear classifier fails to learn the toy spiral dataset.
Notice that the only change from before is one extra line of code, where we first compute the hidden layer
representation and then the scores based on this hidden layer. Crucially, we’ve also added a non-linearity, which
in this case is simple ReLU that thresholds the activations on the hidden layer at zero.
Everything else remains the same. We compute the loss based on the scores exactly as before, and get the
gradient for the scores dscores exactly as before. However, the way we backpropagate that gradient into the
model parameters now changes form, of course. First lets backpropagate the second layer of the Neural
Network. This looks identical to the code we had for the Softmax classifier, except we’re replacing X (the raw
data), with the variable hidden_layer ):
However, unlike before we are not yet done, because hidden_layer is itself a function of other parameters
and the data! We need to continue backpropagation through this variable. Its gradient can be computed as:
dhidden = np.dot(dscores, W2.T)
Now we have the gradient on the outputs of the hidden layer. Next, we have to backpropagate the ReLU non-
linearity. This turns out to be easy because ReLU during the backward pass is effectively a switch. Since
dr
r = max(0, x), we have that = 1(x > 0) . Combined with the chain rule, we see that the ReLU unit lets
dx
the gradient pass through unchanged if its input was greater than 0, but kills it if its input was less than zero
during the forward pass. Hence, we can backpropagate the ReLU in place simply with:
And now we finally continue to the first layer weights and biases:
We’re done! We have the gradients dW,db,dW2,db2 and can perform the parameter update. Everything else
remains unchanged. The full code looks very similar:
# some hyperparameters
step_size = 1e-0
reg = 1e-3 # regularization strength
This prints:
Summary
We’ve worked with a toy 2D dataset and trained both a linear network and a 2-layer Neural Network. We saw
that the change from a linear classifier to a Neural Network involves very few changes in the code. The score
function changes its form (1 line of code difference), and the backpropagation changes its form (we have to
perform one more round of backprop through the hidden layer to the first layer of the network).
You may want to look at this IPython Notebook code rendered as HTML.
Or download the ipynb file
cs231n
cs231n
[email protected]