0% found this document useful (0 votes)
73 views3 pages

CSC413 A2

The document provides instructions for Assignment 2 for CSC413. It states that the deadline for submission is November 7, 2023 by 6pm EST. Students can submit a PDF report or image of handwritten solutions on Markus. Late submissions will be evaluated based on the syllabus criteria. The assignment contains 3 questions - the first asks students to provide example values for neural network parameters to make the hidden units dead, the second asks about dropout layers in PyTorch, and the third explains the bias-variance decomposition formula.

Uploaded by

Ian Quan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views3 pages

CSC413 A2

The document provides instructions for Assignment 2 for CSC413. It states that the deadline for submission is November 7, 2023 by 6pm EST. Students can submit a PDF report or image of handwritten solutions on Markus. Late submissions will be evaluated based on the syllabus criteria. The assignment contains 3 questions - the first asks students to provide example values for neural network parameters to make the hidden units dead, the second asks about dropout layers in PyTorch, and the third explains the bias-variance decomposition formula.

Uploaded by

Ian Quan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

CSC413 Assignment 2

Deadline: Nov 7, 2023 by 6pm EST


Submission: Compile and submit a PDF report containing your written solutions. You may also submit an
image of your legible hand-written solutions. Submissions will be done on Markus.
Late Submission: Please see the syllabus for the late submission criteria. You must work individually on this
assignment.

Question 1. Dead Units (3 pts)


Consider the following neural network, where x ∈ R2 and h ∈ R2 , and y ∈ R2 .

h = ReLU (W(1) x + b(1) )


y = w(2) h + b(2)

Suppose also that each element of x is between -1 and 1.

Part (a)
Come up with an example values of the parameters W(1) , and b(1) such that both hidden units h1 and h2 are
dead.
Answer:
   
0 0 −1
W(1) = , b(1) =
0 0 −1

Regardless of the input data, both hidden units h1 and h2 will always be ReLU ([−1, −1]T ) = max(−1, 0) = 0.

Part (b)
Show that the gradients of y with respect to W(1) , and b(1) are zero.
Answer:
Let z = W(1) x + b(1)

∂y ∂y ∂h ∂z
= · · = w(2) · 0 · x = 0
∂W(1) ∂h ∂z z1 =z2 =−1 ∂W(1)
∂y ∂y ∂h ∂z
(1)
= · · = w(2) · 0 · 0 = 0
∂b ∂h ∂z z1 =z2 =−1 ∂b(1)

Question 2. Dropout (3pt)


Part (a)
In a dropout layer, instead of “zeroing out” activations at test time, we multiply the weights by 1 − p, where p
is the probability that an activation is set to zero during training. Explain why the multiplication by 1 − p is
necessary for the neural network to make meaningful predictions.
Answer:

When dropout is implemented, during training, dropout is applied with probability 1 − p to each neuron,
effectively setting a fraction 1 − p of activations to zero. This creates a thinner architecture in the given training
batch. However, when we make prediction, we do not use a dropout layer. This means that all the neurons are
activated during the prediction step. But, because of taking all the neurons from a layer, the final weights will
be larger than expected. Scaling the weights by 1 − p during testing helps maintain the expected value of the
activations, so the network doesn’t need to adapt to the sudden drop in activations.

1
Part (b)
Explain the difference between model.train() and model.eval() modes of evaluating a network in Pytorch. Does
the Dropout layer in Pytorch behave differently in these two modes? Feel free to look at the online documen-
tation for Pytorch.
Answer:

model.train() activates dropout layers, which randomly deactivate a fraction of neurons during forward
passes. This mode is used during training to prevent overfitting. While model.eval() deactivates dropout layers,
causing them to pass all activations through without any dropout. This mode is used during testing, validation,
and inference to ensure consistent and deterministic results.

Question 3. Bias-variance decomposition (4 pts)


Let D = (xi , yi )|i = 1...n be a dataset obtained from the true underlying data distribution P , i.e. D ∼ P n .
And let hD (·) be a classifier trained on D. Show the variance bias decomposition
       
ED,x,y (hD (x) − y)2 = ED,x (hD (x) − ĥ(x))2 + Ex,y (ŷ(x) − y)2 + Ex (ĥ(x) − ŷ(x))2
| {z } | {z } | {z } | {z }
Expected test error Variance Noise Bias2

where ĥ(x) = ED∼P n [hD (x)] is the expected regressor over possible training sets, given the learning algorithm A
and ŷ(x) = Ey|x [y] is the expected label given x. As mentioned in the lecture, labels might not be deterministic
given x. To carry out the proof, proceed in the following steps:

Part (a)
Show that the following identity holds
   2   2 
2
ED,x,y (hD (x) − y) = ED,x ĥD (x) − ĥ(x) + Ex,y ĥ(x) − y (1)

Answer:
Reformulate (1) as
   h   i2 
2
Ex,y (hD (x) − y) = Ex,y hD (x) − ĥ(x) + ĥ(x) − y
 2      2 
= Ex,y ĥD (x) − ĥ(x) + 2Ex,y,D hD (x) − ĥ(x) ĥ(x) − y + Ex,y,D ĥ(x) − y
 2    2 
= Ex,y ĥD (x) − ĥ(x) + Ex,y,D ĥ(x) − y

Note that the second term in the above equation is zero because
      
Ex,y,D hD (x) − ĥ(x) ĥ(x) − y = Ex,y ED hD (x) − ĥ(x) ĥ(x) − y
  
= Ex,y ED [hD (x)] − ĥ(x) ĥ(x) − y
  
= Ex,y ĥ(x) − ĥ(x) ĥ(x) − y

= Ex,y [0]
=0

2
Part (b)
Next, show  2     2 
2
Ex,y ĥ(x) − y = Ex,y (ŷ(x) − y) + Ex ĥ(x) − ŷ(x) (2)

which completes the proof by substituting (2) into (1).

Answer:
Reformulate (2) as
 2   h  i2 
Ex,y ĥ(x) − y = Ex,y ĥ(x) − ŷ(x) + (ŷ(x) − y)
 2      
2
= Ex ĥ(x) − ŷ(x) + 2Ex,y ĥ(x) − ŷ(x) (ŷ(x) − y) + Ex,y (ŷ(x) − y)
 2   
2
= Ex ĥ(x) − ŷ(x) + Ex,y (ŷ(x) − y)

Note that the second term in the above equation is also zero because
      
Ex,y ĥ(x) − ŷ(x) (ŷ(x) − y) = Ex Ey|x ĥ(x) − ŷ(x) (ŷ(x) − y)
  
= Ex Ey|x [ŷ(x) − y] ĥ(x) − ŷ(x)
 

= Ex ŷ(x) − Ey|x [y] ĥ(x) − ŷ(x)
  
= Ex (ŷ(x) − ŷ(x)) ĥ(x) − ŷ(x)

= Ex [0]
=0

Part (c)
Explain in a sentence or two what overfitting means and which term in this formula represents it.
Answer:

Overfitting means the model fits too close to the training data, thus gives accurate predictions for training
data but not for new data. When the model overfits, the variance term of this formula will be very high and
the bias term will be very low.

You might also like