Assignment3 – DeepLearning
Assignment3 – DeepLearning
In this assignment you will train and test a three layer network with multiple
outputs to classify images from the CIFAR-10 dataset. The first layer will
be a convolution layer applied with a stride equal to the width of the filter.
In the parlance of computer vision this corresponds to a patchify layer, see
figure 1, and this is the first layer that is applied in the Vision Transformer
and other architectures such as MLPMixer and ConvMixer. You will train
the network using mini-batch gradient descent applied to a cost function
computing the cross-entropy loss of the classifier applied to the labelled
training data and an L2 regularization term on the weight matrix.
Figure 1: To patchify an input image, split it into a regular grid of non-overlapping sub-
regions. For Vision Transformers the pixel data in each patch is flattened into a vector,
transformed with an affine transformation and this output vector becomes an input to
a Transformer network. The patchify operation is just a convolution applied with stride
equal to the width of the filter.
The overall structure of your code for this assignment should mimic that
from the previous assignments. You will have slightly different parameters
than before and you will have to change the functions that 1) evaluate the
network (the forward pass) and 2) compute the gradients (the backward
pass). There is some work to do to get an efficient implementation of the
first convolutional layer to allow reasonable training times without a GPU!
But the reward will be an increased performance from assignment 2 (don’t
get your expectations too high though).
1
The network function you will apply to the input X is:
Hi = max(0, X ∗ Fi ) for i = 1, . . . , nf (1)
vec(H1 )
vec(H2 )
h= (2)
..
.
vec(Hnf )
x1 = max(0, W1 h + b1 ) (3)
s = W 2 x 1 + b2 (4)
p = SoftMax(s) (5)
where
• The predicted class corresponds to the label with the highest proba-
bility:
k ∗ = arg max {p1 , . . . , pK } (7)
1≤k≤K
In the lecture notes I have the convention that the operator vec(·) makes
a vector by traversing the elements of the matrix row by row from left to
right. This simple example illustrates:
H 11
H11 H12 H12
H= =⇒ vec(H) =
H21 H22 H21
H22
2
I have omitted all bias terms in the network for the sake of clarity and
simplicity of implementation.
H =X ∗F =⇒ h = MX vec(F ) (9)
and the definition of F is extended from equation (8) similarly. Now when
we convolve X with F (stride 2 and no zero-padding) the output vector
3
still has size 2 × 2. Once again we can write this convolution as a matrix
convolution but in this case:
X111 X121 X211 X221 X112 X122 X212 X222
X131 X141 X231 X241 X132 X142 X232 X242
MX = X311 X321 X411
(12)
X421 X312 X322 X412 X422
X331 X341 X431 X441 X332 X342 X432 X442
and
T
vec(F ) = F111 F121 F211 F221 F112 F122 F212 F222 (13)
where g is the vector of size 4 × 1 containing the gradient of the loss relative
to the vector h. Here we have applied the convention the gradient of the loss
w.r.t. to a vector is a column vector. Apologies for the slight inconsistencies
w.r.t. the lecture notes. If there is a batch of input images then
∂L 1 X
T
= MX gy (15)
∂vec(F ) |B|
(X,y)∈B
Returning to consider our simple example, we now review the case when
we apply 3 filters F1 , F2 and F3 to each patch. These three filters can be
applied efficiently to the region patches defined in MX by concatenating the
flattened versions of each Fi into an 8 × 3 matrix Fall that is
H = MX Fall (16)
where H is 4 × 3 and
Fall = vec(F1 ), vec(F2 ), vec(F3 ) (17)
4
∂L T
= MX G (18)
∂Fall
where G is the 4 × 3 matrix constructed by concatenating the gi ’s
G = g1 , g2 , g3 (19)
As per usual let L(·, ·) represent the loss for the mini-batch of data B
1 X
L(B, Θ) = lcross (y, fnetwork (X, Θ)) (20)
|B|
(x,y)∈B
where lcross (·) is the usual cross-entropy loss and fnetwork represents the
function corresponding to our network. To learn the parameters Θ =
{Fall , W1 , W2 } we need to compute the gradient of the batch loss w.r.t.
each parameter.
Given the probabilistic outputs of the network for all the images in the mini-
batch, then the gradient of the batch loss w.r.t. W1 and W2 is computed as
in Assignment 2. The novel quantity to be computed is the gradient w.r.t.
Fall . Given the expression of the gradient for one input in equation (18),
the gradient for the batch is:
where we have used the subscript y for Gy to denote the 2D array (np × nf )
of gradients, as defined in equation (19), for a particular input. In code you
will store the “MX ” representations of each input in a 3D array M of size
np × 3f 2 × n and the gradients for the H nodes for each input in another
3D array G of size np × nf × n. Then
n
grad 1X
Fall = M (:, :, i)T G(:, :, i) (22)
n
i=1
grad
where Fall has size 3f 2 × nf and contains the gradient of the loss w.r.t. all
the conlution filters.
5
As you begin to fit larger models you need to apply more regularization to
allow longer training runs without over-fitting. These longer runs are re-
quired to train the best possible version of your network! In this training
scenario a simple and complimentary approach to other regularization tech-
niques such as data-augmentation, weight decay, is label smoothing. Label
smoothing works as follows. Assume you have a labelled training example
(X, y) where y ∈ {1, . . . , K}. Let y represent the one-hot encoding of label
y. This is the usual target output for cross-entropy training. However, this
target vector does not necessarily need to be a one-hot encoded vector. One
can instead spread some of the probability from the ground truth class to
the other classes. Let ϵ be small number in [0, 1). The ith entry of the label
smoothed target vector ysmooth is defined by
(
1−ϵ if i = y
ysmooth,i = (23)
ϵ/(K − 1) otherwise
Typically we set ϵ = .1. Once you have defined ysmooth then you only have
to make a minimal change in your training algorithm. In the backward-pass
to compute the gradients when you propagate the gradient through cross-
entropy loss and softmax operations you should make this replacement:
Thus label smoothing also has the benefit that it has minimal computational
overhead per update iteration. Though, of course, similar to over regulariza-
tion techniques if applied then longer training is required as you have made
the training task harder.
In Assignment 2 you trained with cyclical learning rates. For most of the
experiments you will use this optimizer again for this assignment. However,
you will apply a small upgrade to this algorithm for longer training runs to
help with efficiency, that is we will have shorter cycles at the start of training.
The upgrade is that the number of update steps per cycle is doubled after
each cycle. Let ni,s be the number of steps for the ith cycle then
6
For this assignment to be a success you must have a bug free and fast im-
plementation of the convolution applied at the first layer. To ensure this
happens you will write slow but straightforward code as your first imple-
mentation of the convolutional layer - a dot-product between the convolu-
tion filter and each non-overlapping sub-patch of the input image. I have
provided debugging information on the Canvas assignment page to ensure a
bug free implementation. Read in the file debug info.npz and extract the
image data X and the convolution filter Fs:
X contains the image data from 5 cifar10 images and has size 3072×n where
n=5 and Fs has size f×f×3×nf where f=4 and nf=2. The images in X have
been flattened and to convert them back into images of size 32×32×3×n
apply
X ims = np.transpose(X.reshape((32, 32, 3, n), order=’F’), (1, 0, 2, 3))
I visited the sub-patches in the order where the second dimension of X ims[:,
:, :, i] changes first and then the first dimension. This is to be consistent
7
with how I flatten load data[’conv outputs’]. Importantly, MX just needs
to be computed once at the beginning of training and stored. When train-
ing MX will be accessed continually and it does not change during training
as it is completely defined by the input data and the size and stride of the
convolution filters and these quantities do not change.
You are almost ready to compute the convolution as a matrix multiplication.
First, though you have to flatten each filter via:
Fs flat = Fs.reshape((f*f*3, nf), order=’C’)
where Fs flat has size f*f*3×nf. The convolutions can now be computed
for each image as a matrix multiplication see in equation (9)
for i in range(n):
conv outputs mat[:, :, i] = np.matmul(MX[:, :, i], Fs flat)
where conv outputs mat storing the result has size n p × n f × n. You
should compare this output to that from your dot-product implementation.
First, thought you have to reshape conv outputs to have the same shape
as conv outputs mat that is
conv outputs flat = conv outputs.reshape((n p, nf, n), order=’C’)
The last step to make the convolution computations even faster is to remove
the for loop over images using an Einstein summation:
conv outputs mat = np.einsum(’ijn, jl ->iln’, MX, Fs flat, optimize=True)
You should check this produces the same output as before. The moment it
does you can feel like a deep learning guru! Using einsum is hardcore! It
is okay if you do not fully understand the workings of the einsum operator,
to be honest I just did some pattern matching and interpolated to our use
case, but more details are available at the numpy einsum help page. For
our test case MX is small and only contains data from 5 images so there is
probably no speed up using the Einstein summation. But during training MX
will contain data from a training batch of ∼ 100 images and on my laptop
(M1 Macbook pro) machine the einsum implementation gives a > 3 times
speed up over the for loop implementation.
With the fast convolution computations in place, you are ready to tackle
writing the code for the bare bones of the back-prop algorithm. First you
have to write the forward pass and then the backward pass. Initially, you will
check your gradient computations using debugging information provided.
But after clearing these checks you will clean up your code, add bias terms
8
to your network function and use pytorch to do a more thorough job of
debugging your gradient calculations.
Forward Pass You are ready to write the forward pass of the back-prop
algorithm for the network described in equations (1 - 5). The function you
write should have as input the MX representation of the input data and the
parameters of your network. These parameters are the convolution filters of
the first layer and the weight matrices for the two subsequent layers. You
have written the code to apply the first convolutional layer. Then you have
to apply the ReLu activation function to the output of the convolution and
then reshape the output array so you can apply the first fully connected
layer. This corresponds to:
conv flat = np.fmax(conv outputs mat.reshape((n p*nf, n), order=’C’), 0)
(Note I don’t think this code vectorizes the output of the convolution in the
same order as equation (13). This does not matter as the critical issue is
that the ordering is consistent within your code and this is why I’m being
very explicit about this in the code description.) Given conv flat you can
proceed with the fully connected layers of the network as in Assignment 2 to
finally produce the probability vector for each input image. The function you
write should return the information needed for the backward pass. These
required quantities are the final probabilities and the intermediary outputs
at each layer (ie those corresponding to equations (2), (3) and (5).) To help
you debug your code the debugging file you loaded previously, which had
the parameters for the convolution filters, also contains the arrays for the
rest of the network’s parameters:
load data[’W1’] (size: nh × n p*nf),
load data[’W2’] (size: 10 × nh),
load data[’b1’] (size: nh × 1),
load data[’b2’] (size: 10 × 1)
and the values for the arrays your forward-pass should return
load data[’conv flat’] (size: n p*nf × n),
load data[’X1’] (size:nh × n),
load data[’P’] (size: 10 × n)
Backward Pass Next up is the backward pass code. The target labels
for the given debug data are in load data[’Y’] (10 × n). Propagating the
gradient from the loss node through the fully-connected layers and to the
weight matrices W2 and W1 is the same as in assignment 2 though, of course,
9
the input to first fully-connected layer is the output of the convolutional
layer as opposed to the original input data. Let G batch be the n p*n f×n
array containing the gradient information after back-prop to the h node. At
this stage in the forward-pass we had just performed a flattening/reshaping
operation. We need to undo this operation w.r.t. G batch and obtain an
array G of size n p×n f×n
GG = G batch.reshape((n p, nf, n), order=’C’)
Given GG and MX you can now compute the sum in equation (22) and compute
the gradient for Fs flat. Write the code for this and you can check your
answer against load data[’grad Fs flat’].
The last step to make your gradient computations fast is to remove the for
loop over images in equation (22) and replace with an Einstein summation:
MXt = np.transpose(MX, (1, 0, 2))
grad Fs flat = np.einsum(’ijn, jln ->il’, MXt, GG, optimize=True)
Clean up your code! You have written the essential code for applying
the network function and computing the gradients. At this point you prob-
ably need to clean up your code and come up with good ways of storing
your network’s parameters etc., You will also need to write and/or upgrade
existing code
• to pre-compute, save and load the MX representation for all the training
data and test data,
10
Check gradient computations To double check your gradient compu-
tations are okay, you should upgrade the torch gradient code supplied for
Assignment 2, to compute the gradients for your new network. Here you
should emulate in torch the forward pass you have just written and also
compute the loss. Use the for loop implementation, equation (21), to apply
the convolutional filters to each input image to avoid any nuisance com-
plications caused by differences in calling and the ordering of inputs and
outputs to torch’s and numpy’s version of einsum. The total number of
training examples is large. Thus you should compare your gradients to the
torch gradients computed on just a small subset of training images, small nf
(<10) and small nf (<10). After completing this and checking for agreement
between the two sets of gradient computations you should:
After these upgrades you will need to check the gradient calculations again
to ensure you have a bug free implementation.
You are now ready to start training your first ConvNets. For training we will
use the cyclic learning rates optimizer as in Assignment 2. Do, if possible,
re-use your code. The initial network you will train has architecture:
11
1.0
2.25 training loss
test loss
2.00 0.8
1.75
1.50 0.6
Accuracy
Loss
1.25
1.00 0.4
0.75
0.2
0.50 training accuracy
test accuracy
0.25 0.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
update step update step
Figure 2: The training and tests curves for the first ConvNet (f = 4, nf = 10, nh =
50). The final test accuracy achieved for this network is 57.61% when trained with
three cycles of cyclical learning rate and step=800. This accuracy probably exceeds
most if not all the models you trained in Assignment 1 and 2.
Train for longer The results you just obtained indicate working with
filters of size f=16 is going to be too slow and with f=2 we minimize the
effect of the convolutional layer. Therefore f ∈ {4, 8} is probably the best
option for our set-up. Now you should investigate the benefit of training
the networks with architecture 2 and 3 for longer. You should upgrade
the basic cyclic learning rate algorithm to the algorithm with increasing
cycle length introduced in background section 6. Set step 1 = 800, still
12
with n cycles=3 and train the two models. Keep a record of the final test
accuracies achieved and the plots of the training and test loss.
As you can see training for more updates improves results! At least one
of the networks should have given final test performance of >60%. Hurrah
and well done! With longer training the bigger jump in performance was
for architecture 3. But was this caused by the bigger filter size or that
architecture 3 is significantly wider, i.e. it applies more filters at the first
layer than architecture 2? Let’s get some indication whether layer width
is a critical factor. For architecture 2 bump up the number of filters to
n f=40 and re-run the previous experiment. Keep a record of the final test
accuracies achieved and the plots of the training and test loss. As you can
see increasing width helps alot. There is lots of anecdotal and empirical
evidence that increasing the width of a neural network helps training and
capacity. If you are interested check out this paper [Tan and Le, 2019] which
introduces the EfficientNet and giving a rule of thumb scaling laws of how
best you should increase the width, depth and resolution of the your network
architecture if given more resources.
As training with f=4 is generally quicker we will now investigate how in-
creasing the capacity of the network by making each layer wider increases
the capacity and performance of your network. But, of course, when you
begin to train larger models and have longer training times then over-fitting,
particularly in relation to the loss, is much more likely. And if you over-fit
quickly w.r.t. the loss then it is not possible to train for many update steps.
In this part of the assignment you will generate some results to help you get
a feel for the issue. But remember we are only barely scratching the surface
here. Consider the network:
Train this network with n cycle=4 and step 1=800 and with L2 regulariza-
tion lam=.0025. (Note this run will take some time ∼ 10 minutes for me)
We reduce the L2 regularization over-fit more quickly than with a higher
L2 regularization. Plot the training and test losses for this architecture,
regularization and length of training. It is obvious for this set-up the loss
has over-fit. To regularize your model, instead of applying more L2 regular-
ization, implement label smoothing as in background section 5. Re-run the
same experiment but with label smoothing. Keep a record of the loss plots
plus the final test accuracy achieved and report on the qualitative difference
of the loss plots for the two approaches.
13
To complete the assignment:
For Assignment 3 I will award at most 4 bonus points (the special bonus
points are not included in this calculation).
14
are lots of options to try to make gains! Make the network wider, use
data-augmentation, try to find the right balance between the amount
of L2 regularization and label smoothing, decay eta max in the cycli-
cal learning rate algorithm, concatenate the output from convolution
filters of multiple sizes (just a crazy idea), . . . From my not so ex-
tensive and slightly haphazard investigations the maximum I achieved
was 67.26%. The tricks I used are the ones I describe in assignment!
But, in general, data-augmentation and making the network wider is
usually a good bet!
Bonus Points Available: 4 points (if you complete at least 3 improve-
ments) - you can follow my suggestions, think of your own or some
combination of the two. I’ll also award an additional extra bonus
point if you get a test accuracy ≥68% and two if you achieve ≥70%.
Note for this assignment because training takes a relatively long time,
I’m okay with the test set becoming your validation set! Thus any
final result you get will be a slight over-estimate of the true test error.
Note for the project such behaviour will not be tolerated!
2. Compare the speed of your training to using pytorch Use the
torch function torch.nn.functional.conv2d to compute the convo-
lutions and its auto-diff computations for training where the computa-
tions are calculated on the CPU. Please compare the training timings
of your implementation versus the torch implementation for this net-
work across a variety of filter widths and number of filters. These test
do not need to be exhaustive. We just want to get an indication if
your implementation is faster/slower than the in-built pytorch func-
tions for this particular network and if the speed-up / slow-down is
correlated with the filter size etc..
Bonus Points Available: 3 points
To get the bonus point(s) you must upload the following to the Canvas
assignment page Assignment 3 Bonus Points:
1. Your code.
2. You can get at most 4 points for Assignment 3.
3. A pdf document which
- reports on your trained network with the best test accuracy, what
improvements you made and which ones brought the largest gains
(if any!). (Exercise 5.1)
- Summarizes the training times for pytorch Vs your training code
for several network architectures and the general conclusions you
can draw from these results. (Exercise 5.2)
15
References
[Tan and Le, 2019] Tan, M. and Le, Q. (2019). EfficientNet: Rethinking
model scaling for convolutional neural networks. In Proceedings of the
International Conference of Machine Learning (ICML).
16