Assignment 2
Assignment 2
In this assignment you will train and test a two layer network with multi-
ple outputs to classify images from the CIFAR-10 dataset. You will train
the network using mini-batch gradient descent applied to a cost function
that computes the cross-entropy loss of the classifier applied to the labelled
training data and an L2 regularization term on the weight matrix.
The overall structure of your code for this assignment should mimic that
from Assignment 1. You will have more parameters than before and you
will have to change the functions that 1) evaluate the network (the forward
pass) and 2) compute the gradients (the backward pass). We will also be
paying more attention to how to search for good parameter settings for the
network’s regularization term and the learning rate. Welcome to the nitty
gritty of training neural networks!
s 1 = W 1 x + b1 (1)
h = max(0, s1 ) (2)
s = W2 h + b2 (3)
p = SOFTMAX(s) (4)
The predicted class corresponds to the label with the highest probability:
1
W1 x + b1 max(0, s1 ) W 2 h + b2 softmax(s) −yT log(p)
W1 x + b1 max(0, s1 ) W2 h + b2 softmax(s) x s1 h s p l
l + λr
J
x s1 h s p
W1 b1 W2 b2 y λ
W1 b1 W2 b2 kW1 k2 + kW2 k2 r
Figure 1: For this assignment the computational graph of the classification function
applied to an input x and the computational graph of the cost function applied to
a mini-batch of size 1.
where
and p has been calculated using equations (1-4). (Note as the label is en-
coded by a one-hot representation then the cross-entropy loss −yT log(p) =
− log(py ).) The optimization problem we have to solve is
In this assignment (as described in the lectures) we will solve this optimiza-
tion problem via mini-batch gradient descent with cyclic learning rates.
For mini-batch gradient descent we begin with a sensible random initial-
ization of the parameters W, b and we then update our estimate for the
parameters with for k = 1, 2
where η is the learning rate and B (t+1) is called a mini-batch and is a random
subset of the training data D and for k = 1, 2:
2
To compute the relevant gradients for the mini-batch, we then have to com-
pute the gradient of the loss w.r.t. each training example in the mini-batch.
You should refer to the lecture notes for the explicit description of how to
compute these gradients. Once again I would advise you to implement the
Matlab efficient version as it results in significant speed ups.
There is not really one optimal learning-rate when training a neural network
with vanilla mini-batch gradient descent. Choose a too small learning rate
and training will take too long and too large a learning rate may result in
training diverging. Ideally, one should have an adaptive learning rate which
changes to match the local shape of the cost surface at the current estimate
of the network’s parameters. Many variants of mini-batch training try to
achieve this - ADAM, mini-batch with a momentum term etc. - and these
variants are covered in the lectures. For this assignment though we will ex-
plore the rather recent idea of exploiting cyclical learning rates [Smith, 2015]
as this approach eliminates much of the trial-and-error associated with find-
ing a good learning rate and some of the costly hyper-parameter optimiza-
tion over multiple parameters associated with training with momentum. It
also empirically seem to work well when training relatively small networks
as in these assignments. The main idea of cyclical learning rates is that dur-
ing training the learning rate is periodically changed in a systematic fashion
from a small value to a large one and then from this large value back to the
small value. And this process is then repeated again and again until training
is stopped. See figure 2 for an illustration of a typical example of how the
learning rate is scheduled to change periodically during training. This is the
schedule you will implement.
Assume that you have defined a minimum ηmin and a maximum ηmax learning
rate. ηmin and ηmax define the range of learning rates where learning occurs
without divergence. (Note, in general, these values will be affected by λ,
network architecture and parameter initialization.) Let ηt represent the
learning rate at the tth update step. One complete cycle will take 2ns update
steps, where ns is known as the stepsize. When t = 2lns then ηt = ηmin and
when t = (2l + 1)ns then ηt = ηmax for l = 0, 1, . . .. A rule of thumb is to
set ns = k ⌊n/nbatch ⌋ with k being an integer between 2 and 8 and n is the
total number of training examples and nbatch in the number of examples in
a batch. For a triangular learning rate schedule have
3
ηt
ηmax
ηmin
update step
ns 2ns 3ns 4ns 5ns 6ns
Figure 2: Schedule for the cyclic learning rate. The graph above plots the
learning rate ηt at each update step. Initially η1 = ηmin and its value increases
linearly until it has a maximum value of ηmax when t = ns . Then the ηt decreases
linearly until it has a value of η1 = ηmin again when t = 2ns . The cycle can be
repeated as many times as one likes and in this example it is repeated three times.
For most practical applications the number of cycles is ≥ 2 and ≤ 10. The positive
integer ns is known as the stepsize and is usually chosen so that one cycle of training
corresponds to a multiple of epochs of training.
{0, 1, 2, . . .} set
t − 2lns
ηt = ηmin + (ηmax − ηmin ) (14)
ns
while if (2l + 1)ns ≤ t ≤ 2(l + 1)ns for some l ∈ {0, 1, 2, . . .} set
t − (2 l + 1)ns
ηt = ηmax − (ηmax − ηmin ) (15)
ns
∂J
θ t = θ t−1 − ηt
∂θ θ t−1
Normally training is run for a set number of complete cycles and is stopped
when the learning rate is at its smallest, that is t = 2lns for some l ≥ 2. For
this assignment I will give you values for ηmin , ηmax and ns that work well
for the default network used in the assignment. You can read [Smith, 2015],
the paper which forms the basis for this assignment, to get guidelines and
tests about how to set ηmin and ηmax .
Exercise 1: Read in the data & initialize the parameters of the network
For this assignment (to begin) you will just use the data in the file data batch 1
for training, the file data batch 2 for validation and the file test batch for
4
testing. You have already written a function for Assignment 1 to read in
the data and pre-process it. For this assignment we will apply the same
pre-processing as before. Remember you should transform it to have zero
mean. If trainX is the d × n image data matrix (each column corresponds
to an image) for the training data then
Then you should normalize the training, validation and test data with re-
spect to this mean and standard deviations. If X is an d × n image data
matrix then you can normalize X as
X = X - mean X
X = X / std X
Next you have to set up the data structure for the parameters of the network
and to initialize their values. In the assignment we will just focus on a
network that has m=50 nodes in the hidden layer. As W1 and W2 will have
different sizes, as will b1 and b2, I recommend you use a list to store
these weight matrices and bias vectors within a dictionary, that is if L is the
number of layers then like where :
net params = {}
net params[’W’] = [None] * L
net params[’b’] = [None] * L
Next you will write functions to compute the gradient of your two-layer
network w.r.t. its weight and bias parameters. As before I suggest you re-
use much of your code from Assignment1.py You will need to write (update)
the following functions (that you wrote previously):
5
the intermediary activation values. Once again I would recommend
using a container such as a dictionary to store this forward-pass data
so it can be easily passed to the backward pass function to compute
the gradients. For the sample code in this document I refer to this
dictionary as fp data.
Once you have written the code to compute the gradients the next step
is debugging. Download the file torch gradient computations.py from the
Canvas page. This file contains skeleton code to compute the gradients via
torch. The lines missing are those that compute the scores for input data
X corresponding to equations (1-3). You need to use torch operations to
compute them instead of numpy ones. To help you out here is one way to
compute the ReLu function on each entry of the m×n torch array H:
d small = 5
n small = 3
m = 6
lam = 0
small net[’W’][0] = (1/np.sqrt(d small))*rng.standard normal(size = (m, d small))
small net[’b’][0] = np.zeros((m, 1))
small net[’W’][1] = (1/np.sqrt(m))*rng.standard normal(size = (10, m))
small net[’b’][1] = np.zeros((10, 1))
X small = trainX[0:d small, 0:n small]
Y small = trainY[:, 0:n small]
fp data = ApplyNetwork(X small, small net)
my grads = BackwardPass(X small, Y small, fp data, small net, lam)
torch grads = ComputeGradsWithTorch(X small, train y[0:n small], small net)
After you have computed the gradients (with no regularization) via your
code and PyTorch you should check they have produced the same output
and follow the guidelines described in the first assignment. You can add in
L2 regularization when you are convinced all the loss gradients are correct
in both your code and the pytorch code.
Once you have convinced yourself that your analytic gradient computations
are correct then you can move forward with the following sanity check.
Try and train your network on a small amount of the training data (say
100 examples) with regularization turned off (lam=0) and check if you can
6
overfit to the training data and get a very low loss on the training data after
training for a sufficient number of epochs (∼200) and with a reasonable η.
Being able to achieve this indicates that your gradient computations and
mini-batch gradient descent algorithm are okay.
Up until now you have trained your networks with vanilla mini-batch gra-
dient descent. To help speed up training times and avoid time-consuming
searches for good values of η you will now implement mini-batch gradient
descent training where the learning rate at each update step is defined by
equations (14) and (15) and where you have set eta min = 1e-5, eta max
= 1e-1 and n s=500 and the batch size to 100. To help you debug, fig-
ure 3 shows the training and validation loss/cost I achieved when lam=.01
and I ran training for one cycle that is from t=1 until t=2*n s. Once you
have convinced yourself that you have a bug free implementation of the
cyclic scheduled learning rate then you are ready to somewhat optimize the
performance of your network.
cost loss accuracy
training training training
4
validation validation validation
2 0.6
3
1.5
0.4
2
1
1 0.2
0.5
Figure 3: Training curves (cost, loss, accuracy) for one cycle of train-
ing. In this example one batch of the training data is used. The hyper-parameter
settings of the training algorithm are eta min = 1e-5, eta max = 1e-1, lam=.01
and n s=500. The last parameter setting implies, as the batch size is 100, one
full cycle corresponds to 10 epochs of training. In this simple example only one
cycle of training is performed, but already at the end of training a test accuracy of
46.29% is achieved. Please note my curves are relatively smooth because I plot the
loss/cost/accuracy scores 10 times per cycle as opposed to plotting these quantities
at every update step.
Now you should run your training algorithm for more cycles (say 3) and
for a larger n s=800. For reference the performance curves I obtained with
these parameter settings are shown in figure 4. I measured my performance
on the whole training and validation set 9 times per cycle. At the moment
7
you have not optimized the value of regularization term lam at all.
cost loss accuracy
training training training
4
validation validation validation
2 0.6
3
1.5
0.4
2
1
1 0.2
0.5
Figure 4: Training curves (cost, loss, accuracy) for three cycles of train-
ing. In this example one batch of the training data is used. The hyper-parameter
settings of the training algorithm are eta min = 1e-5, eta max = 1e-1, lam=.01
and n s=800. In the loss and accuracy plots you can clearly see how the loss and
accuracy vary as the ηt varies. After the three cycles of training a test accuracy of
48.11% is achieved.
Coarse-to-fine random search to set lam. At this point you may need
to restructure/re-organize your code so that you can cleanly and easily call
function(s) to initialize your network, perform training and then check the
learnt network’s best performance on the validation set. To perform your
random search you’ll need to train your network from a random initializa-
tion and measure its performance (via the accuracy on the validation set)
multiple times as the hyper-parameters lam varies. You should first perform
a coarse search over a very broad range of values for lam. To perform this
search you should use most of the training data available and the rest for
the validation. This is because more data and increasing the value of lam
are both forms of regularization. When you have less training data you will
need a higher lam and when you have more training data you will need a
lower lam. Thus you should perform your search using the same ballpark
amount of data that you will use when you train your final network. Thus
for this part of the assignment you should load all 5 training batches and
use all for training except for 5000 images that should be used as your val-
idation set. When you train each network you should only run 2 cycles (1
cycle could also potentially work) of training and you should set n s = 2 *
np.floor(n / n batch) to get a good idea of performance for a given lam.
Search for lam on a log scale, for example to generate one random sample
for the learning rate uniformly in the range of 10^l min to 10^l max
In my experiments for the coarse search I set l min=-5 and l max=-1 and
actually used a uniform grid with eight different values. Save all the pa-
rameter settings tried and their resulting best scores on the validation set
8
to a file. Inspect this file after finishing the coarse search and see what
parameter ranges gave the best results. Next perform a random search but
with your search adjusted to a narrower range focused on the good settings
found in the coarse setting and possibly run training for a few more cycles
than before. Once again save the results and look for the best parameter
settings. You could do another round of random search or just use the best
lam found, and then train the network using most of the training data, for
more cycles and for a larger n s and see what final performance you get on
the test set. You should be getting performances of >50% for your good
settings given ≥ 2 cycles of training (or even just one cycle of training). For
reference I was able to train networks that achieved test accuracies ∼ 52%
without too much exhaustive searching.
9
For Assignment 2 I will award at most 5 bonus points.
(a) Explore whether having significantly more hidden nodes improves the final
classification rate. One would expect that with more hidden nodes then the
amount of regularization would have to increase.
(b) Apply dropout to your training if you have a high number of hidden nodes
and you feel you need more regularization.
(c) Apply data augmentation during training - random mirroring as described in
the bonus part of assignment and also random translations. As a hint if you
translate your image xx by a positive integer translations tx and ty then by
computing the following indices (conceptually not hard but a little tricky to
get right):
aa = np.arange(32).reshape((32, 1))
vv = np.tile(32*aa, (1, 32-tx))
bb1 = np.arange(tx, 32, 1).reshape((32-tx, 1))
bb2 = np.arange(0, 32-tx, 1).reshape((32-tx, 1))
ind fill = vv.reshape((32*(32-tx), 1)) + np.tile(bb1, (32, 1))
ii = np.transpose(np.nonzero(ind fill > ty*32+1))
ind fill = ind fill[ii[0,0]:]
ind xx = vv.reshape((32*(32-tx), 1)) + np.tile(bb2, (32, 1))
ii = np.transpose(np.nonzero(ind xx < 1024-ty*32))
ind xx = ind xx[0:ii[-1, 0]+1]
inds fill = np.vstack((ind fill, 1024+ind fill))
inds fill = np.vstack((inds fill, 2048+ind fill))
inds xx = np.vstack((ind xx, 1024+ind xx))
inds xx = np.vstack((inds xx, 2048+ind xx))
and applying them to produce the shifted image
xx shifted[inds fill] = xx[inds xx]
Note the code above has to be changed a bit if either tx or ty is negative.
So please visualize your before and after images to ensure there are no bugs.
Even if you pre-compute the indices for all the different (tx, ty) pairs given
-3 <= tx <= 3 and -3 <= ty <= 3 there will be a slow down in your train-
ing if you apply a shift randomly for every image in your batch, as there will
be accessing different parts of memory. So perhaps use this augmentation
somewhat judiciously to avoid too much of a slow down.
(d) In the basic assignment cyclical learning rates with sgd was the optimizer
used. Potentially using Nesterov Momentum with straightforward linear de-
cay on the learning rate could perform just as well or better given a fixed
number of update steps could potential outperform the basic assignment.
(See Budgeted Training:Rethinking Deep Neural Network Training Under
Resource Constraints, M, Li, E. Yumer, and D. Ramanan, ICLR 2020 for
10
evidence. Their default implementation has: base learning rate 0.1, momen-
tum 0.9, weight decay 0.0005 and a batch size 128, but this is for a ResNet
as opposed to a fully connected network.)
To get the bonus point(s) you must upload the following to the Canvas
assignment page Assignment 2 Bonus Points:
1. Your code.
2. You can get at most 5 points for Assignment 2.
3. A Pdf document which
- reports on your trained network with the best test accuracy, what
improvements you made and which ones brought the largest gains
(if any!). (Exercise 5.1)
- Summarizes the training and search you completed and the final
test accuracies you achieve. (Exercise 5.2)
References
[Smith, 2015] Smith, L. N. (2015). Cyclical learning rates for training neural
networks. arXiv:1506.01186 [cs.CV].
11