0% found this document useful (0 votes)

3 views

Assignment 2

This assignment involves training a two-layer neural network to classify images from the CIFAR-10 dataset using mini-batch gradient descent and cross-entropy loss with L2 regularization. Key tasks include implementing the forward and backward passes, computing gradients, and utilizing cyclical learning rates for optimization. Students will also initialize parameters, preprocess data, and validate their implementation by checking gradient computations against PyTorch outputs.

Uploaded by

kenning.max

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Assignment 2

Uploaded by

kenning.max

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Course: DD2424 - Assignment 2

In this assignment you will train and test a two layer network with multi-
ple outputs to classify images from the CIFAR-10 dataset. You will train
the network using mini-batch gradient descent applied to a cost function
that computes the cross-entropy loss of the classifier applied to the labelled
training data and an L2 regularization term on the weight matrix.
The overall structure of your code for this assignment should mimic that
from Assignment 1. You will have more parameters than before and you
will have to change the functions that 1) evaluate the network (the forward
pass) and 2) compute the gradients (the backward pass). We will also be
paying more attention to how to search for good parameter settings for the
network’s regularization term and the learning rate. Welcome to the nitty
gritty of training neural networks!

Background 1: Mathematical background

The mathematical details of the network are as follows. Given an input

vector, x, of size d × 1 our classifier outputs a vector of probabilities, p
(K × 1), for each possible output label:

s 1 = W 1 x + b1 (1)
h = max(0, s1 ) (2)
s = W2 h + b2 (3)
p = SOFTMAX(s) (4)

where the matrix W1 and W2 have size m × d and K × m respectively and

the vectors b1 and b2 have sizes m × 1 and K × 1. SOFTMAX is defined as
exp(s)
SOFTMAX(s) = (5)
1Texp(s)

The predicted class corresponds to the label with the highest probability:

k ∗ = arg max {p1 , . . . , pK } (6)

1≤k≤K

We have to learn the parameters W1 , W2 , b1 and b2 from our labelled

training data. Let D = {(xi , yi )}ni=1 , with each yi ∈ {1, . . . , K} and xi ∈ Rd ,
represent our labelled training data. In the lectures we have described how to
set the parameters by minimizing the cross-entropy loss plus a regularization
term on W1 and W2 . To simplify the notation we group the parameters of

1
W1 x + b1 max(0, s1 ) W 2 h + b2 softmax(s) −yT log(p)
W1 x + b1 max(0, s1 ) W2 h + b2 softmax(s) x s1 h s p l
l + λr
J
x s1 h s p

W1 b1 W2 b2 y λ

W1 b1 W2 b2 kW1 k2 + kW2 k2 r

a) Classification function b) Cost function

Figure 1: For this assignment the computational graph of the classification function
applied to an input x and the computational graph of the cost function applied to
a mini-batch of size 1.

the model as Θ = {W1 , W2 , b1 , b2 }. The cost function is

2 X
1 X X
2
J(D, λ, Θ) = lcross (xi , yi , Θ) + λ Wl,ij (7)
|D|
(x,y)∈D l=1 i,j

where

lcross (x, y, Θ) = − log(py ) (8)

and p has been calculated using equations (1-4). (Note as the label is en-
coded by a one-hot representation then the cross-entropy loss −yT log(p) =
− log(py ).) The optimization problem we have to solve is

Θ∗ = arg min J(D, λ, Θ) (9)

In this assignment (as described in the lectures) we will solve this optimiza-
tion problem via mini-batch gradient descent with cyclic learning rates.
For mini-batch gradient descent we begin with a sensible random initial-
ization of the parameters W, b and we then update our estimate for the
parameters with for k = 1, 2

(t+1) (t) ∂J(B (t+1) , λ, Θ)

Wk = Wk − η (10)
∂Wk
Θ=Θ(t)

(t+1) (t) ∂J(B (t+1) , λ, Θ)

bk = bk − η (11)
∂bk
Θ=Θ(t)

where η is the learning rate and B (t+1) is called a mini-batch and is a random
subset of the training data D and for k = 1, 2:

∂J(B (t+1) , λ, Θ) 1 X ∂lcross (x, y, Θ)

= (t+1) + 2λWk (12)
∂Wk |B | ∂Wk
(x,y)∈B(t+1)

∂J(B (t+1) , λ, Θ) 1 X ∂lcross (x, y, Θ)

= (t+1) (13)
∂bk |B | ∂bk
(x,y)∈B(t+1)

2
To compute the relevant gradients for the mini-batch, we then have to com-
pute the gradient of the loss w.r.t. each training example in the mini-batch.
You should refer to the lecture notes for the explicit description of how to
compute these gradients. Once again I would advise you to implement the
Matlab efficient version as it results in significant speed ups.

Background 2: What learning rate to use with SGD?

There is not really one optimal learning-rate when training a neural network
with vanilla mini-batch gradient descent. Choose a too small learning rate
and training will take too long and too large a learning rate may result in
training diverging. Ideally, one should have an adaptive learning rate which
changes to match the local shape of the cost surface at the current estimate
of the network’s parameters. Many variants of mini-batch training try to
achieve this - ADAM, mini-batch with a momentum term etc. - and these
variants are covered in the lectures. For this assignment though we will ex-
plore the rather recent idea of exploiting cyclical learning rates [Smith, 2015]
as this approach eliminates much of the trial-and-error associated with find-
ing a good learning rate and some of the costly hyper-parameter optimiza-
tion over multiple parameters associated with training with momentum. It
also empirically seem to work well when training relatively small networks
as in these assignments. The main idea of cyclical learning rates is that dur-
ing training the learning rate is periodically changed in a systematic fashion
from a small value to a large one and then from this large value back to the
small value. And this process is then repeated again and again until training
is stopped. See figure 2 for an illustration of a typical example of how the
learning rate is scheduled to change periodically during training. This is the
schedule you will implement.
Assume that you have defined a minimum ηmin and a maximum ηmax learning
rate. ηmin and ηmax define the range of learning rates where learning occurs
without divergence. (Note, in general, these values will be affected by λ,
network architecture and parameter initialization.) Let ηt represent the
learning rate at the tth update step. One complete cycle will take 2ns update
steps, where ns is known as the stepsize. When t = 2lns then ηt = ηmin and
when t = (2l + 1)ns then ηt = ηmax for l = 0, 1, . . .. A rule of thumb is to
set ns = k ⌊n/nbatch ⌋ with k being an integer between 2 and 8 and n is the
total number of training examples and nbatch in the number of examples in
a batch. For a triangular learning rate schedule have

1. At iteration t of training if 2lns ≤ t ≤ (2l + 1)ns for some l ∈

3
ηt

ηmax

ηmin
update step
ns 2ns 3ns 4ns 5ns 6ns

Figure 2: Schedule for the cyclic learning rate. The graph above plots the
learning rate ηt at each update step. Initially η1 = ηmin and its value increases
linearly until it has a maximum value of ηmax when t = ns . Then the ηt decreases
linearly until it has a value of η1 = ηmin again when t = 2ns . The cycle can be
repeated as many times as one likes and in this example it is repeated three times.
For most practical applications the number of cycles is ≥ 2 and ≤ 10. The positive
integer ns is known as the stepsize and is usually chosen so that one cycle of training
corresponds to a multiple of epochs of training.

{0, 1, 2, . . .} set

t − 2lns
ηt = ηmin + (ηmax − ηmin ) (14)
ns
while if (2l + 1)ns ≤ t ≤ 2(l + 1)ns for some l ∈ {0, 1, 2, . . .} set

t − (2 l + 1)ns
ηt = ηmax − (ηmax − ηmin ) (15)
ns

2. Update the current estimate of the parameters with

∂J
θ t = θ t−1 − ηt
∂θ θ t−1

Normally training is run for a set number of complete cycles and is stopped
when the learning rate is at its smallest, that is t = 2lns for some l ≥ 2. For
this assignment I will give you values for ηmin , ηmax and ns that work well
for the default network used in the assignment. You can read [Smith, 2015],
the paper which forms the basis for this assignment, to get guidelines and
tests about how to set ηmin and ηmax .

Exercise 1: Read in the data & initialize the parameters of the network

For this assignment (to begin) you will just use the data in the file data batch 1
for training, the file data batch 2 for validation and the file test batch for

4
testing. You have already written a function for Assignment 1 to read in
the data and pre-process it. For this assignment we will apply the same
pre-processing as before. Remember you should transform it to have zero
mean. If trainX is the d × n image data matrix (each column corresponds
to an image) for the training data then

mean X = np.mean(trainX, axis=1).reshape(d, 1)

std X = np.std(trainX, axis=1).reshape(d, 1)

Then you should normalize the training, validation and test data with re-
spect to this mean and standard deviations. If X is an d × n image data
matrix then you can normalize X as

X = X - mean X
X = X / std X

Next you have to set up the data structure for the parameters of the network
and to initialize their values. In the assignment we will just focus on a
network that has m=50 nodes in the hidden layer. As W1 and W2 will have
different sizes, as will b1 and b2, I recommend you use a list to store
these weight matrices and bias vectors within a dictionary, that is if L is the
number of layers then like where :

net params = {}
net params[’W’] = [None] * L
net params[’b’] = [None] * L

To set the initial values of net params[’b’][i] and net params[’W’][i]

for i=0,...,(L-1), I typically set the bias vectors to zero and the entries
in the weight matrices are random draws from a Gaussian distribution with
mean 0 and standard deviation 1/sqrt(d) for layer 1 and 1/sqrt(m) for
layer 2. You can see the instructions from Assignment 1 to how how to
do this. You should probably write a separate function to initialize the
parameters as you will be initializing your network frequently when you
perform a grid search for a good value of lambda.

Exercise 2: Compute the gradients for the network parameters

Next you will write functions to compute the gradient of your two-layer
network w.r.t. its weight and bias parameters. As before I suggest you re-
use much of your code from Assignment1.py You will need to write (update)
the following functions (that you wrote previously):

• Compute the network function, to apply the function defined in figure

1(a), on a mini-batch of data and that returns the final p values and

5
the intermediary activation values. Once again I would recommend
using a container such as a dictionary to store this forward-pass data
so it can be easily passed to the backward pass function to compute
the gradients. For the sample code in this document I refer to this
dictionary as fp data.

• Compute the gradients of the cost function for a mini-batch of data

given the values computed from the forward pass.

Once you have written the code to compute the gradients the next step
is debugging. Download the file torch gradient computations.py from the
Canvas page. This file contains skeleton code to compute the gradients via
torch. The lines missing are those that compute the scores for input data
X corresponding to equations (1-3). You need to use torch operations to
compute them instead of numpy ones. To help you out here is one way to
compute the ReLu function on each entry of the m×n torch array H:

apply relu = torch.nn.ReLU()

H = apply relu(H)

The number of training examples in X is large as is the input dimension. To

avoid numerical issues you should do your checks on just a small batch of
training data and also a much reduced input dimension. Here is a snippet
code to show the type of calculations you should be performing for the
checks:

d small = 5
n small = 3
m = 6
lam = 0
small net[’W’][0] = (1/np.sqrt(d small))*rng.standard normal(size = (m, d small))
small net[’b’][0] = np.zeros((m, 1))
small net[’W’][1] = (1/np.sqrt(m))*rng.standard normal(size = (10, m))
small net[’b’][1] = np.zeros((10, 1))
X small = trainX[0:d small, 0:n small]
Y small = trainY[:, 0:n small]
fp data = ApplyNetwork(X small, small net)
my grads = BackwardPass(X small, Y small, fp data, small net, lam)
torch grads = ComputeGradsWithTorch(X small, train y[0:n small], small net)

After you have computed the gradients (with no regularization) via your
code and PyTorch you should check they have produced the same output
and follow the guidelines described in the first assignment. You can add in
L2 regularization when you are convinced all the loss gradients are correct
in both your code and the pytorch code.
Once you have convinced yourself that your analytic gradient computations
are correct then you can move forward with the following sanity check.
Try and train your network on a small amount of the training data (say
100 examples) with regularization turned off (lam=0) and check if you can

6
overfit to the training data and get a very low loss on the training data after
training for a sufficient number of epochs (∼200) and with a reasonable η.
Being able to achieve this indicates that your gradient computations and
mini-batch gradient descent algorithm are okay.

Exercise 3: Train your network with cyclical learning rates

Up until now you have trained your networks with vanilla mini-batch gra-
dient descent. To help speed up training times and avoid time-consuming
searches for good values of η you will now implement mini-batch gradient
descent training where the learning rate at each update step is defined by
equations (14) and (15) and where you have set eta min = 1e-5, eta max
= 1e-1 and n s=500 and the batch size to 100. To help you debug, fig-
ure 3 shows the training and validation loss/cost I achieved when lam=.01
and I ran training for one cycle that is from t=1 until t=2*n s. Once you
have convinced yourself that you have a bug free implementation of the
cyclic scheduled learning rate then you are ready to somewhat optimize the
performance of your network.
cost loss accuracy
training training training
4
validation validation validation

2 0.6
3

1.5
0.4
2
1
1 0.2
0.5

update step update step update step

200 400 600 800 1,000 200 400 600 800 1,000 200 400 600 800 1,000

Cost plot Loss plot Accuracy plot

Figure 3: Training curves (cost, loss, accuracy) for one cycle of train-
ing. In this example one batch of the training data is used. The hyper-parameter
settings of the training algorithm are eta min = 1e-5, eta max = 1e-1, lam=.01
and n s=500. The last parameter setting implies, as the batch size is 100, one
full cycle corresponds to 10 epochs of training. In this simple example only one
cycle of training is performed, but already at the end of training a test accuracy of
46.29% is achieved. Please note my curves are relatively smooth because I plot the
loss/cost/accuracy scores 10 times per cycle as opposed to plotting these quantities
at every update step.

Exercise 4: Train your network for real

Now you should run your training algorithm for more cycles (say 3) and
for a larger n s=800. For reference the performance curves I obtained with
these parameter settings are shown in figure 4. I measured my performance
on the whole training and validation set 9 times per cycle. At the moment

7
you have not optimized the value of regularization term lam at all.
cost loss accuracy
training training training
4
validation validation validation

2 0.6
3

1.5
0.4
2
1
1 0.2
0.5

update step update step update step

1,000 2,000 3,000 4,000 5,000 1,000 2,000 3,000 4,000 5,000 1,000 2,000 3,000 4,000 5,000

Cost plot Loss plot Accuracy plot

Figure 4: Training curves (cost, loss, accuracy) for three cycles of train-
ing. In this example one batch of the training data is used. The hyper-parameter
settings of the training algorithm are eta min = 1e-5, eta max = 1e-1, lam=.01
and n s=800. In the loss and accuracy plots you can clearly see how the loss and
accuracy vary as the ηt varies. After the three cycles of training a test accuracy of
48.11% is achieved.

Coarse-to-fine random search to set lam. At this point you may need
to restructure/re-organize your code so that you can cleanly and easily call
function(s) to initialize your network, perform training and then check the
learnt network’s best performance on the validation set. To perform your
random search you’ll need to train your network from a random initializa-
tion and measure its performance (via the accuracy on the validation set)
multiple times as the hyper-parameters lam varies. You should first perform
a coarse search over a very broad range of values for lam. To perform this
search you should use most of the training data available and the rest for
the validation. This is because more data and increasing the value of lam
are both forms of regularization. When you have less training data you will
need a higher lam and when you have more training data you will need a
lower lam. Thus you should perform your search using the same ballpark
amount of data that you will use when you train your final network. Thus
for this part of the assignment you should load all 5 training batches and
use all for training except for 5000 images that should be used as your val-
idation set. When you train each network you should only run 2 cycles (1
cycle could also potentially work) of training and you should set n s = 2 *
np.floor(n / n batch) to get a good idea of performance for a given lam.
Search for lam on a log scale, for example to generate one random sample
for the learning rate uniformly in the range of 10^l min to 10^l max

l = l min + (l max - l min) * rng.random()

lam = 10**l;

In my experiments for the coarse search I set l min=-5 and l max=-1 and
actually used a uniform grid with eight different values. Save all the pa-
rameter settings tried and their resulting best scores on the validation set

8
to a file. Inspect this file after finishing the coarse search and see what
parameter ranges gave the best results. Next perform a random search but
with your search adjusted to a narrower range focused on the good settings
found in the coarse setting and possibly run training for a few more cycles
than before. Once again save the results and look for the best parameter
settings. You could do another round of random search or just use the best
lam found, and then train the network using most of the training data, for
more cycles and for a larger n s and see what final performance you get on
the test set. You should be getting performances of >50% for your good
settings given ≥ 2 cycles of training (or even just one cycle of training). For
reference I was able to train networks that achieved test accuracies ∼ 52%
without too much exhaustive searching.

To complete the assignment:

To pass the assignment you need to upload to Canvas:

1. The code for your assignment assembled into one file.

2. A brief pdf report with the following content:

i) State how you checked your analytic gradient computations and

whether you think that your gradient computations were bug free.
Give evidence for these conclusions.
ii) The curves for the training and validation loss/cost when using
the cyclical learning rates with the default values, that is replicate
figures 3 and 4. Also comment on the curves.
iii) State the range of the values you searched for lam, the num-
ber of cycles used for training during the coarse search and the
hyper-parameter settings for the 3 best performing networks you
trained.
iv) State the range of the values you searched for lam, the num-
ber of cycles used for training during the fine search, and the
hyper-parameter settings for the 3 best performing networks you
trained.
v) For your best found lam setting (according to performance on the
validation set), train the network on all the training data (all the
batch data), except for 1000 examples in a validation set, for ∼3
cycles. Plot the training and validation loss plots and then report
the learnt network’s performance on the test data.

Exercise 5: Optional for bonus points

9
For Assignment 2 I will award at most 5 bonus points.

1. Optimize the performance of the network. It would be inter-

esting to discover what is the best possible performance achievable by
Assignment 2’s network (a 2-layer fully connected network) on CIFAR-
10. Here are some tricks/avenues you can explore to help bump up
performance:

(a) Explore whether having significantly more hidden nodes improves the final
classification rate. One would expect that with more hidden nodes then the
amount of regularization would have to increase.
(b) Apply dropout to your training if you have a high number of hidden nodes
and you feel you need more regularization.
(c) Apply data augmentation during training - random mirroring as described in
the bonus part of assignment and also random translations. As a hint if you
translate your image xx by a positive integer translations tx and ty then by
computing the following indices (conceptually not hard but a little tricky to
get right):
aa = np.arange(32).reshape((32, 1))
vv = np.tile(32*aa, (1, 32-tx))
bb1 = np.arange(tx, 32, 1).reshape((32-tx, 1))
bb2 = np.arange(0, 32-tx, 1).reshape((32-tx, 1))
ind fill = vv.reshape((32*(32-tx), 1)) + np.tile(bb1, (32, 1))
ii = np.transpose(np.nonzero(ind fill > ty*32+1))
ind fill = ind fill[ii[0,0]:]
ind xx = vv.reshape((32*(32-tx), 1)) + np.tile(bb2, (32, 1))
ii = np.transpose(np.nonzero(ind xx < 1024-ty*32))
ind xx = ind xx[0:ii[-1, 0]+1]
inds fill = np.vstack((ind fill, 1024+ind fill))
inds fill = np.vstack((inds fill, 2048+ind fill))
inds xx = np.vstack((ind xx, 1024+ind xx))
inds xx = np.vstack((inds xx, 2048+ind xx))
and applying them to produce the shifted image
xx shifted[inds fill] = xx[inds xx]
Note the code above has to be changed a bit if either tx or ty is negative.
So please visualize your before and after images to ensure there are no bugs.
Even if you pre-compute the indices for all the different (tx, ty) pairs given
-3 <= tx <= 3 and -3 <= ty <= 3 there will be a slow down in your train-
ing if you apply a shift randomly for every image in your batch, as there will
be accessing different parts of memory. So perhaps use this augmentation
somewhat judiciously to avoid too much of a slow down.
(d) In the basic assignment cyclical learning rates with sgd was the optimizer
used. Potentially using Nesterov Momentum with straightforward linear de-
cay on the learning rate could perform just as well or better given a fixed
number of update steps could potential outperform the basic assignment.
(See Budgeted Training:Rethinking Deep Neural Network Training Under
Resource Constraints, M, Li, E. Yumer, and D. Ramanan, ICLR 2020 for

10
evidence. Their default implementation has: base learning rate 0.1, momen-
tum 0.9, weight decay 0.0005 and a batch size 128, but this is for a ResNet
as opposed to a fully connected network.)

Bonus Points Available: 2 points (if you complete at least 3 improve-

ments) - you can follow my suggestions, think of your own or some
combination of the two.
2. Given that you have implemented data-augmentation and defined a
network with more hidden nodes, do some semi-extensive testing (amount
of l2 regularization, number of cycles used for training, annealing of
the max learning rate as you train, have a long training run etc...) to
see what level of performance you can get with a wide 2-layer network
and data-augmentation.
Bonus Points Available: 2 points
3. Cyclical learning rates are good for fast training. But for your 2-layer
network would an Adam optimizer perform just as well or better? For
these bonus points implement an Adam optimizer. Then do some
testing (amount of l2 regularization and data-augmentations, have a
long training run etc...) to see what level of performance you can
achieve by training a wide 2-layer network with data-augmentation
and an Adam optimizer. I would recommend using the suggested
settings for Adam’s parameters.
Bonus Points Available: 3 points

To get the bonus point(s) you must upload the following to the Canvas
assignment page Assignment 2 Bonus Points:

1. Your code.
2. You can get at most 5 points for Assignment 2.
3. A Pdf document which
- reports on your trained network with the best test accuracy, what
improvements you made and which ones brought the largest gains
(if any!). (Exercise 5.1)
- Summarizes the training and search you completed and the final
test accuracies you achieve. (Exercise 5.2)

References
[Smith, 2015] Smith, L. N. (2015). Cyclical learning rates for training neural
networks. arXiv:1506.01186 [cs.CV].

General Studies (Prelims) Paper-1994: Years 1984 1989 1991 Percentages of Votes
100% (1)
General Studies (Prelims) Paper-1994: Years 1984 1989 1991 Percentages of Votes
31 pages
Identity Cues On Your Most Loved Brand': Assignment 4.2
0% (2)
Identity Cues On Your Most Loved Brand': Assignment 4.2
2 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
6.TrainingNN
No ratings yet
6.TrainingNN
51 pages
Training NNs
No ratings yet
Training NNs
34 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
DL_22043 (1)
No ratings yet
DL_22043 (1)
7 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Lect 7
No ratings yet
Lect 7
43 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
cours5
No ratings yet
cours5
23 pages
Aie231 NN Lab5
No ratings yet
Aie231 NN Lab5
7 pages
2 Deep Neural Network_241120_095158
No ratings yet
2 Deep Neural Network_241120_095158
47 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
Survey of FNN
No ratings yet
Survey of FNN
25 pages
Project 1 - ANN With Backprop
No ratings yet
Project 1 - ANN With Backprop
3 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
DL Unit-3
No ratings yet
DL Unit-3
10 pages
Lec 8
No ratings yet
Lec 8
43 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
No ratings yet
Neural Networks For Machine Learning: Lecture 6a Overview of Mini - Batch Gradient Descent
31 pages
L06 Slides.mlp3
No ratings yet
L06 Slides.mlp3
26 pages
Neural Networks Tricks: Patrick Van Der Smagt
No ratings yet
Neural Networks Tricks: Patrick Van Der Smagt
20 pages
Deep MLP's
No ratings yet
Deep MLP's
44 pages
Assignment1 DeepL25
No ratings yet
Assignment1 DeepL25
15 pages
L5 Training Neural Networks Part 2 en v2
No ratings yet
L5 Training Neural Networks Part 2 en v2
70 pages
DL 3
No ratings yet
DL 3
72 pages
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
No ratings yet
Kevin Swingler - Lecture 4: Multi-Layer Perceptrons
20 pages
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
No ratings yet
WINSEM2024-25_CSE4006_ETH_AP2024254000693_2025-01-08_Reference-Material-I
40 pages
cs519 hw2
No ratings yet
cs519 hw2
15 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Module 3.Docxaiml
No ratings yet
Module 3.Docxaiml
20 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
2. Deep Neural Network
No ratings yet
2. Deep Neural Network
60 pages
Ex1 2017 1
No ratings yet
Ex1 2017 1
2 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
AI_Lec24-25
No ratings yet
AI_Lec24-25
63 pages
Neural Networks Report
No ratings yet
Neural Networks Report
4 pages
lecture 4
No ratings yet
lecture 4
46 pages
Lecture 5
No ratings yet
Lecture 5
34 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
DL Lab Manual
No ratings yet
DL Lab Manual
52 pages
Chap 2 Training Feed Forward Neural Networks
No ratings yet
Chap 2 Training Feed Forward Neural Networks
22 pages
Exercises INF 5860: Exercise 1 Linear Regression
No ratings yet
Exercises INF 5860: Exercise 1 Linear Regression
5 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
mv_cs4243_2024_amir_6_p1 (1)
No ratings yet
mv_cs4243_2024_amir_6_p1 (1)
97 pages
2802ICT Programming Assignment 2
No ratings yet
2802ICT Programming Assignment 2
6 pages
5.Scaling_Optimization
No ratings yet
5.Scaling_Optimization
68 pages
NN_task2
No ratings yet
NN_task2
20 pages
This Code Fragment Defines A Single Layer With Artificial Neurons, and It Expects Input Variables
No ratings yet
This Code Fragment Defines A Single Layer With Artificial Neurons, and It Expects Input Variables
9 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
This Code Fragment Defines A Single Layer With Artificial Neurons, and It Expects Input Variables
No ratings yet
This Code Fragment Defines A Single Layer With Artificial Neurons, and It Expects Input Variables
9 pages
Inbound 8392301798635648784
No ratings yet
Inbound 8392301798635648784
43 pages
SS_2021_Solutions
No ratings yet
SS_2021_Solutions
16 pages
Neural Network Module 2 Notes
100% (1)
Neural Network Module 2 Notes
72 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Deep Learning Lectures - 2
No ratings yet
Deep Learning Lectures - 2
73 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Mathematical Formulas for Economics and Business: A Simple Introduction
From Everand
Mathematical Formulas for Economics and Business: A Simple Introduction
K.H. Erickson
4/5 (4)
Regulation of Respiration PPT
No ratings yet
Regulation of Respiration PPT
29 pages
Mosh #13
No ratings yet
Mosh #13
22 pages
Electrolux Washer Error Codes - WasherErrorCodes
100% (1)
Electrolux Washer Error Codes - WasherErrorCodes
16 pages
L1 Winter2023 FINE4050
No ratings yet
L1 Winter2023 FINE4050
35 pages
Social Story About Touching Others
100% (1)
Social Story About Touching Others
10 pages
SAC49246675-Company Credit Report
No ratings yet
SAC49246675-Company Credit Report
8 pages
The Fu Master Productivity Checklist Using Things3 - Productive With A Purpose
No ratings yet
The Fu Master Productivity Checklist Using Things3 - Productive With A Purpose
21 pages
Dap An Test Unit 4
No ratings yet
Dap An Test Unit 4
5 pages
Study Guide CH 13 Nutr Care, Assessment
No ratings yet
Study Guide CH 13 Nutr Care, Assessment
3 pages
Capstone Proposal: Ceramic Arts: Purpose
No ratings yet
Capstone Proposal: Ceramic Arts: Purpose
6 pages
Topic 4 Past Exam MC Questions
No ratings yet
Topic 4 Past Exam MC Questions
6 pages
C++ Tic Tac Toe Game Project
0% (1)
C++ Tic Tac Toe Game Project
3 pages
List of Mills
No ratings yet
List of Mills
342 pages
For Tarp
No ratings yet
For Tarp
40 pages
Theoretical Plates Calculation by McCabe-Thiele Method
100% (1)
Theoretical Plates Calculation by McCabe-Thiele Method
4 pages
Civil Avionics Systems4
No ratings yet
Civil Avionics Systems4
10 pages
Cisco Testing Confirmation Letter
No ratings yet
Cisco Testing Confirmation Letter
4 pages
PixHawk 2.4.8 Quadcopter - 2022 02 21
No ratings yet
PixHawk 2.4.8 Quadcopter - 2022 02 21
1 page
Paraphrasing exercises IELTS Writing task 2
No ratings yet
Paraphrasing exercises IELTS Writing task 2
3 pages
Laboratory Rock Mechanics: MNG 551-001 Laboratory #6 Elastic Moduli and Constants ASTM D4543 &7012
100% (1)
Laboratory Rock Mechanics: MNG 551-001 Laboratory #6 Elastic Moduli and Constants ASTM D4543 &7012
7 pages
8th Sem Syllabus
No ratings yet
8th Sem Syllabus
13 pages
Readme GibbsCAM 2012 Plus
No ratings yet
Readme GibbsCAM 2012 Plus
8 pages
3-2540.090 Rev 10 English Manual
No ratings yet
3-2540.090 Rev 10 English Manual
12 pages
Reading Parts 1-6
No ratings yet
Reading Parts 1-6
11 pages
Ben & Jerry - S Case Study
No ratings yet
Ben & Jerry - S Case Study
3 pages
Cisco Unified Contact Center Enterprise - Application (Programming) Interfaces
No ratings yet
Cisco Unified Contact Center Enterprise - Application (Programming) Interfaces
40 pages
Single Gen-Set Control Solutions PDF
No ratings yet
Single Gen-Set Control Solutions PDF
20 pages
Erica Smith - Resume
No ratings yet
Erica Smith - Resume
2 pages

Assignment 2

Uploaded by

Assignment 2

Uploaded by

Course: DD2424 - Assignment 2

Background 1: Mathematical background

The mathematical details of the network are as follows. Given an input

where the matrix W1 and W2 have size m × d and K × m respectively and

k ∗ = arg max {p1 , . . . , pK } (6)

We have to learn the parameters W1 , W2 , b1 and b2 from our labelled

a) Classification function b) Cost function

the model as Θ = {W1 , W2 , b1 , b2 }. The cost function is

lcross (x, y, Θ) = − log(py ) (8)

Θ∗ = arg min J(D, λ, Θ) (9)

(t+1) (t) ∂J(B (t+1) , λ, Θ)

(t+1) (t) ∂J(B (t+1) , λ, Θ)

∂J(B (t+1) , λ, Θ) 1 X ∂lcross (x, y, Θ)

∂J(B (t+1) , λ, Θ) 1 X ∂lcross (x, y, Θ)

Background 2: What learning rate to use with SGD?

1. At iteration t of training if 2lns ≤ t ≤ (2l + 1)ns for some l ∈

2. Update the current estimate of the parameters with

mean X = np.mean(trainX, axis=1).reshape(d, 1)

To set the initial values of net params[’b’][i] and net params[’W’][i]

Exercise 2: Compute the gradients for the network parameters

• Compute the network function, to apply the function defined in figure

• Compute the gradients of the cost function for a mini-batch of data

apply relu = torch.nn.ReLU()

The number of training examples in X is large as is the input dimension. To

Exercise 3: Train your network with cyclical learning rates

update step update step update step

Cost plot Loss plot Accuracy plot

Exercise 4: Train your network for real

update step update step update step

Cost plot Loss plot Accuracy plot

l = l min + (l max - l min) * rng.random()

To complete the assignment:

To pass the assignment you need to upload to Canvas:

1. The code for your assignment assembled into one file.

2. A brief pdf report with the following content:

i) State how you checked your analytic gradient computations and

Exercise 5: Optional for bonus points

1. Optimize the performance of the network. It would be inter-

Bonus Points Available: 2 points (if you complete at least 3 improve-

You might also like