0% found this document useful (0 votes)
117 views8 pages

Ann 2

Uploaded by

sahiny883
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
117 views8 pages

Ann 2

Uploaded by

sahiny883
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Introduction to Deep Learning (I2DL)

Mock Exam - Solutions


IN2346 - SoSe 2020
Technical University of Munich

Full Your
Problem
Points Score
1 Multiple Choice 10
2 Short Questions 12
3 Backpropagation 9

Total 31

Total Time: 31 Minutes


Allowed Ressources: None

The purpose of this mock exam is to give you an idea of the type of problems and the
structure of the final exam. The mock exam is not graded. The final exam will most
probably be composed of 90 graded points with a total time of 90 minutes.

Multiple Choice Questions:


• For all multiple choice questions any number of answers, i.e. either zero (!) or one or
multiple answers can be correct.

• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct
answers are checked, wrong answers are not checked) and 0 otherwise.

How to Check a Box:


• Please cross the respective box: (interpreted as checked)

• If you change your mind, please fill the box: (interpreted as not checked)

• If you change your mind again, please circle the box: (interpreted as checked)
I2DL Mock Exam, Page 2 of 8 SoSe 2020

Part I: Multiple Choice (10 points)


1. (2 points) To avoid overfitting, you can...
 increase the size of the network.

use data augmentation.
 use Xavier initialization.

stop training earlier.

2. (2 points) What is true about Dropout?


 The training process is faster and more stable to initialization when using
Dropout.
 You should not use weaky ReLu as non-linearity when using Dropout.

Dropout acts as regularization.

Dropout is applied differently during training and testing.
3. (2 points) What is true about Batch Normalization?

Batch Normalization uses two trainable parameters that allow the
network to undo the normalization effect of this layer if needed.

Batch Normalization makes the gradients more stable so that we
can train deeper networks.

At test time, Batch Normalization uses a mean and variance com-
puted on training samples to normalize the data.

Batch Normalization has learnable parameters.

4. (2 points) Which of the following optimization methods use first order momentum?
 Stochastic Gradient Descent

Adam
 RMSProp
 Gauss-Newton

5. (2 points) Making your network deeper by adding more parametrized layers will al-
ways...

slow down training and inference speed.
 reduce the training loss.
 improve the performance on unseen data.

(Optional: make your model sound cooler when bragging about it
at parties.)
I2DL Mock Exam, Page 3 of 8 SoSe 2020

Part II: Short Questions (12 points)


1. (2 points) You’re training a neural network and notice that the validation error is sig-
nificantly lower than the training error. Name two possible reasons for this to happen.

Solution:
The model performs better on unseen data than on training data - this should not
happen under normal circumstances. Possible explanations:

• Training and Validation data sets are not from the same distribution
• Error in the implementation
• ...

2. (2 points) You’re working for a cool tech startup that receives thousands of job appli-
cations every day, so you train a neural network to automate the entire hiring process.
Your model automatically classifies resumes of candidates, and rejects or sends job offers
to all candidates accordingly. Which of the following measures is more important for
your model? Explain.
T rue P ositives
Recall = T otal P ositive Samples
T rue P ositives
Precision = T otal P redicted P ositive Samples

Solution:
Precision: High precision means low rate of false positives.
False Negatives are okay, since we get ”thousands of applications” it’s not too bad if
we miss a few candidates even when they’d be a good fit. However, we don’t want
False Positives, i.e. offer a job to people who are not well suited.
I2DL Mock Exam, Page 4 of 8 SoSe 2020

3. (2 points) You’re training a neural network for image classification with a very large
dataset. Your friend who studies mathematics suggests: ”If you would use Newton-
Method for optimization, your neural network would converge much faster than with
gradient descent!”. Explain whether this statement is true (1p) and discuss potential
downsides of following his suggestion (1p).

Solution:
Faster convergence in terms of number of iterations (”mathematical view”). (1 pt.)
However: Approximating the inverse Hessian is highly computationally costly, not
feasible for high-dimensional datasets. (1 pt.)

4. (2 points) Your colleague trained a neural network using standard stochastic gradient
descent and L2 weight regularization with four different learning rates (shown below)
and plotted the corresponding loss curves (also shown shown below). Unfortunately he
forgot which curve belongs to which learning rate. Please assign each of the learning rate
values below to the curve (A/B/C/D) it probably belongs to and explain your thoughts.
l e a r n i n g r a t e s = [ 3 e −4, 4e −1, 2e −5, 8e −3]

Training Loss history


2.4
Curve A (red)
2.3
Curve B (blue)
2.2
training loss

2.1
Curve C (green)
2.0
Curve D (orange)

1.9
0 20 40 60 80 100 120 140
iteration

Solution:
Curve A: 4e-1 = 0.4 (Learning Rate is way too high)
Curve B: 2e-5 = 0.00002 (Learning Rate is too low)
Curve C: 8e-3 = 0.008 (Learning Rate is too high)
Curve D: 3e-4 = 0.0003 (Good Learning Rate)
I2DL Mock Exam, Page 5 of 8 SoSe 2020

5. (1 point) Explain why we need activation functions.

Solution:
Without non-linearities, our network can only learn linear functions, because the
composition of linear functions is again linear.

6. (3 points) When implementing a neural network layer from scratch, we usually imple-
ment a ‘forward‘ and a ‘backward‘ function for each layer. Explain what these functions
do, potential variables that they need to save, which arguments they take, and what
they return.

Solution:
Forward Function:

• takes output from previous layer, performs operation, returns result (1 pt.)
• caches values needed for gradient computation during backprop (1 pt.)

Backward Function:

• takes upstream gradient, returns all partial derivatives (1 pt.)

7. (0 points) Optional: Given a Convolution Layer with 8 filters, a filter size of 6, a stride
of 2, and a padding of 1. For an input feature map of 32 × 32 × 32, what is the output
dimensionality after applying the Convolution Layer to the input?

Solution:
32−6+2·1
2
+ 1 = 14 + 1 = 15 (1 pt.)

15 × 15 × 8 (1 pt.)
I2DL Mock Exam, Page 6 of 8 SoSe 2020

Part III: Backpropagation (9 points)


1. (9 points) Given the following neural network with fully connection layer and ReLU
activations, including two input units (i1 , i2 ), four hidden units (h1 , h2 ) and (h3 , h4 ).
The output units are indicated as (o1 , o2 ) and their targets are indicated as (t1 , t2 ). The
weights and bias of fully connected layer are called w and b with specific sub-descriptors.
b1 b3

i1 w11 h1 ReLU h3 w31 o1


w12 w32
w21 w41
i2 w22 h2 ReLU h4 w42 o2

b2 b4

The values of variables are given in the following table:


Variable i1 i2 w11 w12 w21 w22 w31 w32 w41 w42 b1 b2 b3 b4 t1 t2
Value 2.0 -1.0 1.0 -0.5 0.5 -1.0 0.5 -1.0 -0.5 1.0 0.5 -0.5 -1.0 0.5 1.0 0.5

(a) (3 points) Compute the output (o1 , o2 ) with the input (i1 , i2 ) and network paramters
as specified above. Write down all calculations, including intermediate layer results.

Solution:

Forward pass:

h1 = i1 × w11 + i2 × w21 + b1 = 2.0 × 1.0 − 1.0 × 0.5 + 0.5 = 2.0


h2 = i1 × w12 + i2 × w22 + b2 = 2.0 × −0.5 + −1.0 × −1.0 − 0.5 = −0.5
h3 = max(0, h1 ) = h1 = 2
h4 = max(0, h2 ) = 0
o1 = h3 × w31 + h4 × w41 + b3 = 2 × 0.5 + 0 × −0.5 − 1.0 = 0
o2 = h3 × w32 + h4 × w42 + b4 = 2 × −1.0 + 0 × 1.0 + 0.5 = −1.5
I2DL Mock Exam, Page 7 of 8 SoSe 2020

(b) (1 point) Compute the mean squared error of the output (o1 , o2 ) calculated above
and the target (t1 , t2 ).

Solution:
1 1
M SE = × (t1 − o1 )2 + × (t2 − o2 )2 = 0.5 × 1.0 + 0.5 × 4.0 = 2.5
2 2

(c) (5 points) Update the weight w21 using gradient descent with learning rate 0.1 as
well as the loss computed previously. (Please write down all your computations.)

Solution:

Backward pass (Applying chain rule):

∂M SE ∂ 1 (t1 − o1 )2 ∂o1 ∂h3 ∂h1 ∂ 21 (t2 − o2 )2 ∂o2 ∂h3 ∂h1


= 2 × × × + × × ×
∂w21 ∂o1 ∂h3 ∂h1 ∂w21 ∂o2 ∂h3 ∂h1 ∂w21
= (o1 − t1 ) × w31 × 1.0 × i2 + (o2 − t2 ) × w32 × 1.0 × i2
= (0 − 1.0) × 0.5 × −1.0 + (−1.5 − 0.5) × −1.0 × −1.0
= 0.5 + −2.0 = −1.5
Update using gradient descent:

+ ∂M SE
w21 = w21 − lr ∗ = 0.5 − 0.1 ∗ −1.5 = 0.65
∂w21
I2DL Mock Exam, Page 8 of 8 SoSe 2020

Additional Space for solutions. Clearly mark the problem your answers are
related to and strike out invalid solutions.

You might also like