0% found this document useful (0 votes)
4 views

Assignment3 – DeepLearning

The assignment involves training and testing a three-layer neural network to classify images from the CIFAR-10 dataset, utilizing a convolution layer as the first layer. It emphasizes efficient implementation of the convolution operation and includes details on the structure of the network, the back-propagation algorithm, and techniques for regularization and optimization. Students are required to write code for the convolution layer and apply label smoothing and cyclical learning rates for improved performance.

Uploaded by

kenning.max
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Assignment3 – DeepLearning

The assignment involves training and testing a three-layer neural network to classify images from the CIFAR-10 dataset, utilizing a convolution layer as the first layer. It emphasizes efficient implementation of the convolution operation and includes details on the structure of the network, the back-propagation algorithm, and techniques for regularization and optimization. Students are required to write code for the convolution layer and apply label smoothing and cyclical learning rates for improved performance.

Uploaded by

kenning.max
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Course: DD2424 - Assignment 3

In this assignment you will train and test a three layer network with multiple
outputs to classify images from the CIFAR-10 dataset. The first layer will
be a convolution layer applied with a stride equal to the width of the filter.
In the parlance of computer vision this corresponds to a patchify layer, see
figure 1, and this is the first layer that is applied in the Vision Transformer
and other architectures such as MLPMixer and ConvMixer. You will train
the network using mini-batch gradient descent applied to a cost function
computing the cross-entropy loss of the classifier applied to the labelled
training data and an L2 regularization term on the weight matrix.

Figure 1: To patchify an input image, split it into a regular grid of non-overlapping sub-
regions. For Vision Transformers the pixel data in each patch is flattened into a vector,
transformed with an affine transformation and this output vector becomes an input to
a Transformer network. The patchify operation is just a convolution applied with stride
equal to the width of the filter.

The overall structure of your code for this assignment should mimic that
from the previous assignments. You will have slightly different parameters
than before and you will have to change the functions that 1) evaluate the
network (the forward pass) and 2) compute the gradients (the backward
pass). There is some work to do to get an efficient implementation of the
first convolutional layer to allow reasonable training times without a GPU!
But the reward will be an increased performance from assignment 2 (don’t
get your expectations too high though).

Background 1: Network with an initial patchify layer

Each input image X in its original form is a 3D array of size 32 × 32 × 3.

1
The network function you will apply to the input X is:
Hi = max(0, X ∗ Fi ) for i = 1, . . . , nf (1)
 
vec(H1 )
 vec(H2 ) 
h= (2)
 
.. 
 . 
vec(Hnf )
x1 = max(0, W1 h + b1 ) (3)
s = W 2 x 1 + b2 (4)
p = SoftMax(s) (5)
where

• the input image X has size 32 × 32 × 3


• Each filter Fi has size f × f × 3 and is applied with stride f and no
zero-padding. The possible values for f ∈ {2, 4, 8, 16}. We denote the
set of layer 1 filters as F = {F1 , . . . , Fnf }.
• The output of each convolution Hi has size 32/f × 32/f × 1.
• The Hi ’s are each flattened with the vec(·) operation and concatenated
so h will have size nf np × 1 where np = (32/f )2 is the number of the
sub-patches to which the filter is applied.
• Two successive fully connected layers are then applied where W1 has
size d × d0 with d0 = nf np and W2 has size K × d. The last weight
matrix implies =⇒ s has size K × 1.
• SOFTMAX is defined as
exp(s)
SOFTMAX(s) = (6)
1Texp(s)

• The predicted class corresponds to the label with the highest proba-
bility:
k ∗ = arg max {p1 , . . . , pK } (7)
1≤k≤K

In the lecture notes I have the convention that the operator vec(·) makes
a vector by traversing the elements of the matrix row by row from left to
right. This simple example illustrates:
 
  H 11
H11 H12 H12 
H= =⇒ vec(H) =  
H21 H22 H21 
H22

2
I have omitted all bias terms in the network for the sake of clarity and
simplicity of implementation.

Background 2: Writing the convolution as a matrix multiplication

To make the back-propagation algorithm transparent and relatively efficient


(as we will be running this on CPU not a GPU and to take computational
advantage of fast matrix operations) we will set up the convolutions as ma-
trix multiplications.
Let’s consider this simple example where X is the 4 × 4 input matrix and F
is 2 × 2 × 2 filter
 
X11 X12 X13 X14  
X21 X22 X23 X24  F11 F12
X= X31 X32 X33
, F = . (8)
X34  F21 F22
X41 X42 X43 X44

When we convolve X with F (stride 2 and no zero-padding) the output


vector has size 2 × 2. We can write this convolution as a matrix convolution:

H =X ∗F =⇒ h = MX vec(F ) (9)

where MX is the matrix of size 4 × 4


 
X11 X12 X21 X22
X13 X14 X23 X24 
MX =  X31 X32
 (10)
X41 X42 
X33 X34 X43 X44

The rows of MX correspond to the all sub-blocks of X which are involved


with a dot-product with F when the convolution with stride f and no zero-
padding is applied.

Multiple Channels The input image has multiple channels correspond-


ing to the red, green and blue channels of an image. We will extend the
tutorial example to the case where X is a 3D array. Let the input matrix
X have size 4 × 4 × 2 and the filter F has size 2 × 2 × 2. Then:
   
X111 X121 X131 X141 X112 X122 X132 X142
X X X X  X X X X 
X(:, :, 1) = X211 X221 X231 X241  , X(:, :, 2) = X212 X222 X232 X242 
311 321 331 341 312 322 332 342
X411 X421 X431 X441 X412 X422 X432 X442
(11)

and the definition of F is extended from equation (8) similarly. Now when
we convolve X with F (stride 2 and no zero-padding) the output vector

3
still has size 2 × 2. Once again we can write this convolution as a matrix
convolution but in this case:
 
X111 X121 X211 X221 X112 X122 X212 X222
X131 X141 X231 X241 X132 X142 X232 X242 
MX =  X311 X321 X411
 (12)
X421 X312 X322 X412 X422 
X331 X341 X431 X441 X332 X342 X432 X442

and
T
vec(F ) = F111 F121 F211 F221 F112 F122 F212 F222 (13)

Note here we have flattened F channel by channel. You can, of course,


flatten F in any order you like but you have to ensure that MX is defined
to be consistent with this ordering!

Propagating the gradient to F To back-prop the gradient back to filter


F , as we have written the convolution as a matrix multiplication, you can
∂L T
= MX g (14)
∂vec(F )

where g is the vector of size 4 × 1 containing the gradient of the loss relative
to the vector h. Here we have applied the convention the gradient of the loss
w.r.t. to a vector is a column vector. Apologies for the slight inconsistencies
w.r.t. the lecture notes. If there is a batch of input images then
∂L 1 X
T
= MX gy (15)
∂vec(F ) |B|
(X,y)∈B

where L is the average cross-entropy loss for the batch.

Background 3: Apply multiple convolution filters

Returning to consider our simple example, we now review the case when
we apply 3 filters F1 , F2 and F3 to each patch. These three filters can be
applied efficiently to the region patches defined in MX by concatenating the
flattened versions of each Fi into an 8 × 3 matrix Fall that is

H = MX Fall (16)

where H is 4 × 3 and
 
Fall = vec(F1 ), vec(F2 ), vec(F3 ) (17)

4
∂L T
= MX G (18)
∂Fall
where G is the 4 × 3 matrix constructed by concatenating the gi ’s
 
G = g1 , g2 , g3 (19)

and MX is the same 4 × 8 matrix defined in equation (12).

Background 4: Equations for the back-propagation algorithm

As per usual let L(·, ·) represent the loss for the mini-batch of data B
1 X
L(B, Θ) = lcross (y, fnetwork (X, Θ)) (20)
|B|
(x,y)∈B

where lcross (·) is the usual cross-entropy loss and fnetwork represents the
function corresponding to our network. To learn the parameters Θ =
{Fall , W1 , W2 } we need to compute the gradient of the batch loss w.r.t.
each parameter.
Given the probabilistic outputs of the network for all the images in the mini-
batch, then the gradient of the batch loss w.r.t. W1 and W2 is computed as
in Assignment 2. The novel quantity to be computed is the gradient w.r.t.
Fall . Given the expression of the gradient for one input in equation (18),
the gradient for the batch is:

∂L(B) 1 X ∂ lcross (y, fnetwork (X, Θ)) 1 X


T
= = MX Gy (21)
∂Fall |B| ∂ Fall |B|
(X,y)∈B (X,y)∈B

where we have used the subscript y for Gy to denote the 2D array (np × nf )
of gradients, as defined in equation (19), for a particular input. In code you
will store the “MX ” representations of each input in a 3D array M of size
np × 3f 2 × n and the gradients for the H nodes for each input in another
3D array G of size np × nf × n. Then
n
grad 1X
Fall = M (:, :, i)T G(:, :, i) (22)
n
i=1

grad
where Fall has size 3f 2 × nf and contains the gradient of the loss w.r.t. all
the conlution filters.

Background 5: Label smoothing another form of regularization

5
As you begin to fit larger models you need to apply more regularization to
allow longer training runs without over-fitting. These longer runs are re-
quired to train the best possible version of your network! In this training
scenario a simple and complimentary approach to other regularization tech-
niques such as data-augmentation, weight decay, is label smoothing. Label
smoothing works as follows. Assume you have a labelled training example
(X, y) where y ∈ {1, . . . , K}. Let y represent the one-hot encoding of label
y. This is the usual target output for cross-entropy training. However, this
target vector does not necessarily need to be a one-hot encoded vector. One
can instead spread some of the probability from the ground truth class to
the other classes. Let ϵ be small number in [0, 1). The ith entry of the label
smoothed target vector ysmooth is defined by
(
1−ϵ if i = y
ysmooth,i = (23)
ϵ/(K − 1) otherwise

Typically we set ϵ = .1. Once you have defined ysmooth then you only have
to make a minimal change in your training algorithm. In the backward-pass
to compute the gradients when you propagate the gradient through cross-
entropy loss and softmax operations you should make this replacement:

−(y − p) =⇒ −(ysmooth − p) (24)

Thus label smoothing also has the benefit that it has minimal computational
overhead per update iteration. Though, of course, similar to over regulariza-
tion techniques if applied then longer training is required as you have made
the training task harder.

Background 6: Cyclical learning rates with increasing step sizes

In Assignment 2 you trained with cyclical learning rates. For most of the
experiments you will use this optimizer again for this assignment. However,
you will apply a small upgrade to this algorithm for longer training runs to
help with efficiency, that is we will have shorter cycles at the start of training.
The upgrade is that the number of update steps per cycle is doubled after
each cycle. Let ni,s be the number of steps for the ith cycle then

ni+1,s = 2 ni,s (25)

This schedule of the learning rate is an approximation to the cosine with


warm restarts schedule of [Loshchilov and Hutter, 2017]. For an even more
accurate approximation you should also decay ηmax for each new cycle but
in the basic assignment we will not do this.

Exercise 1: Write code to implement the convolution efficiently

6
For this assignment to be a success you must have a bug free and fast im-
plementation of the convolution applied at the first layer. To ensure this
happens you will write slow but straightforward code as your first imple-
mentation of the convolutional layer - a dot-product between the convolu-
tion filter and each non-overlapping sub-patch of the input image. I have
provided debugging information on the Canvas assignment page to ensure a
bug free implementation. Read in the file debug info.npz and extract the
image data X and the convolution filter Fs:

debug file = ’debug conv info.npz’


load data = np.load(debug file)
X = load data[’X’]
Fs = load data[’Fs’]

X contains the image data from 5 cifar10 images and has size 3072×n where
n=5 and Fs has size f×f×3×nf where f=4 and nf=2. The images in X have
been flattened and to convert them back into images of size 32×32×3×n
apply
X ims = np.transpose(X.reshape((32, 32, 3, n), order=’F’), (1, 0, 2, 3))

For each image X ims[:, :, :, i] an easy way to compute the convolution


applied with stride f between X ims[:, :, :, i] and the kth filter F[:,
:, :, k] is to have nested for loops to visit each sub-patch. At each sub-
patch compute the dot product between the sub-patch of X ims[:, :, :,
i] of size f×f×3 and F[:, :, :, k]. The dot-product can be computed
the with numpy commands of np.multiply and np.sum. There are 32/f ×
32/f sub-patches to visit. Write a function which for each image, each sub-
window and each filter computes all the appropriate dot-products and puts
the result in a 4d array of size 32/f× 32/f×nf×n. To debug you should
compare your result to load data[’conv outputs’].
Once your code produces the correct numbers, it will act as your ground truth
code and you can begin to write a more efficient implementation via matrix
multiplication. This more efficient implementation first requires construct-
ing the multi-dimensional array MX of size n p × f*f*3 × n which contains
the MX defined as in equation (10), for each input image X ims[:, :, :,
i]. To construct MX you should allocate the memory by initializing an array
of zeros of appropriate size. Then for each image X ims[:, :, :, i], you
should iterate through each non-overlapping sub-region (as previously), let
X patch be the pixel data extracted from the lth sub-region, and set:

MX[l, :, i] = X patch.reshape((1, f*f*3), order=’C’)

I visited the sub-patches in the order where the second dimension of X ims[:,
:, :, i] changes first and then the first dimension. This is to be consistent

7
with how I flatten load data[’conv outputs’]. Importantly, MX just needs
to be computed once at the beginning of training and stored. When train-
ing MX will be accessed continually and it does not change during training
as it is completely defined by the input data and the size and stride of the
convolution filters and these quantities do not change.
You are almost ready to compute the convolution as a matrix multiplication.
First, though you have to flatten each filter via:
Fs flat = Fs.reshape((f*f*3, nf), order=’C’)

where Fs flat has size f*f*3×nf. The convolutions can now be computed
for each image as a matrix multiplication see in equation (9)
for i in range(n):
conv outputs mat[:, :, i] = np.matmul(MX[:, :, i], Fs flat)

where conv outputs mat storing the result has size n p × n f × n. You
should compare this output to that from your dot-product implementation.
First, thought you have to reshape conv outputs to have the same shape
as conv outputs mat that is
conv outputs flat = conv outputs.reshape((n p, nf, n), order=’C’)

The last step to make the convolution computations even faster is to remove
the for loop over images using an Einstein summation:
conv outputs mat = np.einsum(’ijn, jl ->iln’, MX, Fs flat, optimize=True)

You should check this produces the same output as before. The moment it
does you can feel like a deep learning guru! Using einsum is hardcore! It
is okay if you do not fully understand the workings of the einsum operator,
to be honest I just did some pattern matching and interpolated to our use
case, but more details are available at the numpy einsum help page. For
our test case MX is small and only contains data from 5 images so there is
probably no speed up using the Einstein summation. But during training MX
will contain data from a training batch of ∼ 100 images and on my laptop
(M1 Macbook pro) machine the einsum implementation gives a > 3 times
speed up over the for loop implementation.

Exercise 2: Compute gradients

With the fast convolution computations in place, you are ready to tackle
writing the code for the bare bones of the back-prop algorithm. First you
have to write the forward pass and then the backward pass. Initially, you will
check your gradient computations using debugging information provided.
But after clearing these checks you will clean up your code, add bias terms

8
to your network function and use pytorch to do a more thorough job of
debugging your gradient calculations.

Forward Pass You are ready to write the forward pass of the back-prop
algorithm for the network described in equations (1 - 5). The function you
write should have as input the MX representation of the input data and the
parameters of your network. These parameters are the convolution filters of
the first layer and the weight matrices for the two subsequent layers. You
have written the code to apply the first convolutional layer. Then you have
to apply the ReLu activation function to the output of the convolution and
then reshape the output array so you can apply the first fully connected
layer. This corresponds to:
conv flat = np.fmax(conv outputs mat.reshape((n p*nf, n), order=’C’), 0)

(Note I don’t think this code vectorizes the output of the convolution in the
same order as equation (13). This does not matter as the critical issue is
that the ordering is consistent within your code and this is why I’m being
very explicit about this in the code description.) Given conv flat you can
proceed with the fully connected layers of the network as in Assignment 2 to
finally produce the probability vector for each input image. The function you
write should return the information needed for the backward pass. These
required quantities are the final probabilities and the intermediary outputs
at each layer (ie those corresponding to equations (2), (3) and (5).) To help
you debug your code the debugging file you loaded previously, which had
the parameters for the convolution filters, also contains the arrays for the
rest of the network’s parameters:
load data[’W1’] (size: nh × n p*nf),
load data[’W2’] (size: 10 × nh),
load data[’b1’] (size: nh × 1),
load data[’b2’] (size: 10 × 1)

and the values for the arrays your forward-pass should return
load data[’conv flat’] (size: n p*nf × n),
load data[’X1’] (size:nh × n),
load data[’P’] (size: 10 × n)

My naming convention tries to be consistent with equations (1-5) and hope-


fully from the context you can figure out to which arrays they correspond.

Backward Pass Next up is the backward pass code. The target labels
for the given debug data are in load data[’Y’] (10 × n). Propagating the
gradient from the loss node through the fully-connected layers and to the
weight matrices W2 and W1 is the same as in assignment 2 though, of course,

9
the input to first fully-connected layer is the output of the convolutional
layer as opposed to the original input data. Let G batch be the n p*n f×n
array containing the gradient information after back-prop to the h node. At
this stage in the forward-pass we had just performed a flattening/reshaping
operation. We need to undo this operation w.r.t. G batch and obtain an
array G of size n p×n f×n
GG = G batch.reshape((n p, nf, n), order=’C’)

Given GG and MX you can now compute the sum in equation (22) and compute
the gradient for Fs flat. Write the code for this and you can check your
answer against load data[’grad Fs flat’].
The last step to make your gradient computations fast is to remove the for
loop over images in equation (22) and replace with an Einstein summation:
MXt = np.transpose(MX, (1, 0, 2))
grad Fs flat = np.einsum(’ijn, jln ->il’, MXt, GG, optimize=True)

Clean up your code! You have written the essential code for applying
the network function and computing the gradients. At this point you prob-
ably need to clean up your code and come up with good ways of storing
your network’s parameters etc., You will also need to write and/or upgrade
existing code

• to pre-compute, save and load the MX representation for all the training
data and test data,

• to initialize the network’s parameters etc. (You should use He initial-


ization for both the convolution filters and the fully connected layers.)

• and of course integrate your new forward and backward algorithms


into your training.

A few words of advice:

• Regarding MX if you are having trouble fitting everything into memory


then you should use np.float32 as the numerical type for each number
(input data and network parameters) and to be honest this is the
default type used in deep learning. The precision of doubles is not
needed.

• As you apply the convolution via matrix multiplication you should


store your network’s convolutional filters as an array of size 3*f*f ×
nf.

10
Check gradient computations To double check your gradient compu-
tations are okay, you should upgrade the torch gradient code supplied for
Assignment 2, to compute the gradients for your new network. Here you
should emulate in torch the forward pass you have just written and also
compute the loss. Use the for loop implementation, equation (21), to apply
the convolutional filters to each input image to avoid any nuisance com-
plications caused by differences in calling and the ordering of inputs and
outputs to torch’s and numpy’s version of einsum. The total number of
training examples is large. Thus you should compare your gradients to the
torch gradients computed on just a small subset of training images, small nf
(<10) and small nf (<10). After completing this and checking for agreement
between the two sets of gradient computations you should:

• Upgrade your network and gradient calculations to also include a bias


vector for the convolutional layer. This vector will have nf entries.

• Add an L2 regularization for the weights of the fully connected layers


and the convolutional filters.

After these upgrades you will need to check the gradient calculations again
to ensure you have a bug free implementation.

Exercise 3: Train small networks with cyclic learning rates

You are now ready to start training your first ConvNets. For training we will
use the cyclic learning rates optimizer as in Assignment 2. Do, if possible,
re-use your code. The initial network you will train has architecture:

• Network architecture: f=4, nf=10, nh=50

with the hyper-parameters for training and L2 regularization set to

• Cyclic learning rates hyper-parameters:


n cycles=3, step=800, eta min=1e-5, eta max=1e-1, n batch=100
• Regularization: lam=.003

With this setting my final model achieved a test performance of ∼57.61%


and took under 50 seconds to train on a M1 mac book pro when I use
49,000 training examples, see figure 2 for the training curves. Note I had
some variation in my training run times. Please train this network and check
you can get a similar level of performance in a reasonable running time.
Even if we keep the number of layers of this network fixed, there are too
many hyper-parameters w.r.t. number of filters, number of hidden node, size

11
1.0
2.25 training loss
test loss
2.00 0.8
1.75
1.50 0.6

Accuracy
Loss
1.25
1.00 0.4

0.75
0.2
0.50 training accuracy
test accuracy
0.25 0.0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
update step update step

Loss curves Accuracy curves

Figure 2: The training and tests curves for the first ConvNet (f = 4, nf = 10, nh =
50). The final test accuracy achieved for this network is 57.61% when trained with
three cycles of cyclical learning rate and step=800. This accuracy probably exceeds
most if not all the models you trained in Assignment 1 and 2.

of filters, number of updates applied during training, amount of regulariza-


tion applied etc, to fully explore the best configuration given your limited
computational resources. You will just, therefore, run a small subset of
experiments to get some feel for how changing some factors while keeping
others fixed impacts test performance. Our initial set of experiments is to
see the impact of the filter size, f, and the number of filters, n f, on test
performance and run time. Given the same training set up as before you
should run the following architectures:

• Network architecture 1: f=2, nf=3, nh=50


• Network architecture 2: f=4, nf=10, nh=50
• Network architecture 3: f=8, nf=40, nh=50
• Network architecture 4: f=16, nf=160, nh=50

The number of filters, n f, has be chosen to keep the number of outputs


from the convolution layer constant (or at least approximately the same) for
each network. Remember you have already trained architecture 2. Keep
a record of the final test performance and also the training time for each
architecture and make two bar charts. One show the final test performance
of each model and the other displaying the run time.

Train for longer The results you just obtained indicate working with
filters of size f=16 is going to be too slow and with f=2 we minimize the
effect of the convolutional layer. Therefore f ∈ {4, 8} is probably the best
option for our set-up. Now you should investigate the benefit of training
the networks with architecture 2 and 3 for longer. You should upgrade
the basic cyclic learning rate algorithm to the algorithm with increasing
cycle length introduced in background section 6. Set step 1 = 800, still

12
with n cycles=3 and train the two models. Keep a record of the final test
accuracies achieved and the plots of the training and test loss.
As you can see training for more updates improves results! At least one
of the networks should have given final test performance of >60%. Hurrah
and well done! With longer training the bigger jump in performance was
for architecture 3. But was this caused by the bigger filter size or that
architecture 3 is significantly wider, i.e. it applies more filters at the first
layer than architecture 2? Let’s get some indication whether layer width
is a critical factor. For architecture 2 bump up the number of filters to
n f=40 and re-run the previous experiment. Keep a record of the final test
accuracies achieved and the plots of the training and test loss. As you can
see increasing width helps alot. There is lots of anecdotal and empirical
evidence that increasing the width of a neural network helps training and
capacity. If you are interested check out this paper [Tan and Le, 2019] which
introduces the EfficientNet and giving a rule of thumb scaling laws of how
best you should increase the width, depth and resolution of the your network
architecture if given more resources.

Exercise 4: Larger networks and regularization with label smoothing

As training with f=4 is generally quicker we will now investigate how in-
creasing the capacity of the network by making each layer wider increases
the capacity and performance of your network. But, of course, when you
begin to train larger models and have longer training times then over-fitting,
particularly in relation to the loss, is much more likely. And if you over-fit
quickly w.r.t. the loss then it is not possible to train for many update steps.
In this part of the assignment you will generate some results to help you get
a feel for the issue. But remember we are only barely scratching the surface
here. Consider the network:

• Network architecture 5: f=4, nf=40, nh=300

Train this network with n cycle=4 and step 1=800 and with L2 regulariza-
tion lam=.0025. (Note this run will take some time ∼ 10 minutes for me)
We reduce the L2 regularization over-fit more quickly than with a higher
L2 regularization. Plot the training and test losses for this architecture,
regularization and length of training. It is obvious for this set-up the loss
has over-fit. To regularize your model, instead of applying more L2 regular-
ization, implement label smoothing as in background section 5. Re-run the
same experiment but with label smoothing. Keep a record of the loss plots
plus the final test accuracy achieved and report on the qualitative difference
of the loss plots for the two approaches.

13
To complete the assignment:

To pass the assignment you need to upload to Canvas:

1. The code for your assignment assembled into one file.


2. A brief pdf report with the following content:
i) State how you checked your analytic gradient computations and
whether you think that your gradient computations were bug
free. Give evidence for these conclusions. Please also report the
training time for the initial three layer ConvNet you train in
Exercise 3 you train.
ii) The bar charts for the final test accuracy and training times for
the 4 network architecture trained with short training runs and
varying f and n f in Exercise 3.
iii) The curves for the training and test loss asked for in the Train
for longer part of Exercise 3. To keep running times down I only
computed the test loss and accuracy sparsely for these curves. I
computed the performance metrics at every j*step/2 th update
iteration for j = 0, 1, 2, . . .. I did the same for the training loss
and used a big chunk of the training data to have a less noisy
estimate than the estimate from the batch.
iv) The curves for the training and test loss for network architecture
5 when no label smoothing and label smoothing is applied. Please
comment on the qualitative difference in the curves.
v) Imagine you want to boost performance of your three layer net-
work even further and you have more available compute than
now but it is not unlimited. What would be the next set of ex-
periments you would run and why? Remember the goal here is
to investigate factor(s) that you feel, given the experiments from
this assignment and the previous assignments, would give the
most bang for your buck w.r.t. final test accuracy! There is no
perfect answer here I’m just curious to find out what you think
would be import architecture and/or training issue to investigate.

Exercise 5: Optional for bonus points

For Assignment 3 I will award at most 4 bonus points (the special bonus
points are not included in this calculation).

1. Push performance of the network. It would be fun to discover


how high the performance of Assignment 3’s network (a 3-layer net-
work with an initial patchify layer) on CIFAR-10 can be pushed. There

14
are lots of options to try to make gains! Make the network wider, use
data-augmentation, try to find the right balance between the amount
of L2 regularization and label smoothing, decay eta max in the cycli-
cal learning rate algorithm, concatenate the output from convolution
filters of multiple sizes (just a crazy idea), . . . From my not so ex-
tensive and slightly haphazard investigations the maximum I achieved
was 67.26%. The tricks I used are the ones I describe in assignment!
But, in general, data-augmentation and making the network wider is
usually a good bet!
Bonus Points Available: 4 points (if you complete at least 3 improve-
ments) - you can follow my suggestions, think of your own or some
combination of the two. I’ll also award an additional extra bonus
point if you get a test accuracy ≥68% and two if you achieve ≥70%.
Note for this assignment because training takes a relatively long time,
I’m okay with the test set becoming your validation set! Thus any
final result you get will be a slight over-estimate of the true test error.
Note for the project such behaviour will not be tolerated!
2. Compare the speed of your training to using pytorch Use the
torch function torch.nn.functional.conv2d to compute the convo-
lutions and its auto-diff computations for training where the computa-
tions are calculated on the CPU. Please compare the training timings
of your implementation versus the torch implementation for this net-
work across a variety of filter widths and number of filters. These test
do not need to be exhaustive. We just want to get an indication if
your implementation is faster/slower than the in-built pytorch func-
tions for this particular network and if the speed-up / slow-down is
correlated with the filter size etc..
Bonus Points Available: 3 points

To get the bonus point(s) you must upload the following to the Canvas
assignment page Assignment 3 Bonus Points:

1. Your code.
2. You can get at most 4 points for Assignment 3.
3. A pdf document which
- reports on your trained network with the best test accuracy, what
improvements you made and which ones brought the largest gains
(if any!). (Exercise 5.1)
- Summarizes the training times for pytorch Vs your training code
for several network architectures and the general conclusions you
can draw from these results. (Exercise 5.2)

15
References

[Loshchilov and Hutter, 2017] Loshchilov, I. and Hutter, F. (2017). SGDR:


Stochastic Gradient Descent with Warm Restarts. In 5th International
Conference on Learning Representations. In Proceedings of the Interna-
tional Conference on Learning Representations (ICLR).

[Tan and Le, 2019] Tan, M. and Le, Q. (2019). EfficientNet: Rethinking
model scaling for convolutional neural networks. In Proceedings of the
International Conference of Machine Learning (ICML).

16

You might also like