0% found this document useful (0 votes)
24 views16 pages

SS 2021

This document outlines the structure and instructions for an endterm exam in Deep Learning at the Technical University of Munich. It includes details on the exam format, types of questions (multiple choice, short questions, and problems related to convolutions, optimization, and multi-class classification), and specific instructions for answering. The total achievable credits for the exam is 91, with various topics covered including data augmentation, neural network architectures, and optimization techniques.

Uploaded by

aleksanderpiciga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views16 pages

SS 2021

This document outlines the structure and instructions for an endterm exam in Deep Learning at the Technical University of Munich. It includes details on the exam format, types of questions (multiple choice, short questions, and problems related to convolutions, optimization, and multi-class classification), and specific instructions for answering. The total achievable credits for the exam is 91, with various topics covered including data augmentation, neural network architectures, and optimization techniques.

Uploaded by

aleksanderpiciga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Chair of Visual Computing & Artificial Intelligence

Department of Informatics
Technical University of Munich

Note:
• During the attendance check a sticker containing a unique code will be
Ecorrection put on this exam.
Place student sticker here • This code contains a unique number that associates this exam with your
registration number.
• This number is printed both next to the code and to the signature field in
the attendance check list.

Introduction to Deep Learning


Exam: IN2346 / Endterm Date: Tuesday 13th July, 2021
Examiner: Prof. Dr. Matthias Nießner Time: 17:30 – 19:00

Working instructions
• This exam consists of 16 pages with a total of 5 problems.
Please make sure now that you received a complete copy of the exam.

• The total amount of achievable credits in this exam is 91 credits.

• Detaching pages from the exam is prohibited.

• Allowed resources: None

• Do not write with red or green colors

– Page 1 / 16 –
Problem 1 Multiple Choice (18 credits)
Below you can see how you can answer multiple choice questions.

Mark correct answers with a cross ×


To undo a cross, completely fill out the answer option ⌅
To re-mark an option, use a human-readable marking ×⌅
• For all multiple choice questions any number of answers, i.e. either zero (!), one or multiple
answers can be correct.

• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct
answers are checked, wrong answers are not checked) and 0 otherwise.

1.1 Which of the following models are unsupervised learning methods?


Auto-Encoder
Maximum Likelihood Estimate Debatable: badly phrased.
K-means Clustering
Linear regression

1.2 In which cases would you usually reduce the learning rate when training a neural network?
When the training loss stops decreasing
To reduce memory consumption
After increasing the mini-batch size
After reducing the mini-batch size

1.3 Which techniques will typically decrease your training loss?


Add additional training data
Remove data augmentation
Add batch normalization
Add dropout

1.4 Which techniques will typically decrease your validation loss?


Add dropout
Add additional training data
Remove data augmentation
Use ReLU activations instead of LeakyReLU

– Page 2 / 16 –
1.5 Which of the following are affected by multiplying the loss function by a constant positive
value when using SGD?
Memory consumption during training
Magnitude of the gradient step
Location of minima
Number of mini-batches per epoch

1.6 Which of the following functions are not suitable as activation functions to add non-linearity
to a network?
sin(x)

ReLU(x) ≠ ReLU(≠x)

log (ReLU(x) + 1)

log (ReLU(x + 1))

1.7 Which of the following introduce non-linearity in the neural network?


LeakyReLU with – = 0
Convolution
MaxPool
Skip connection

1.8 Compared to the L1 loss, the L2 loss...


is robust to outliers
is costly to compute
has a different optimum
will lead to sparser solutions

1.9 Which of the following datasets are NOT i.i.d. (independent and identically distributed)?
A sequence (toss number, result) of 10,000 coin flips using biased coins with p(toss result =
1) = 0.7

A set of (image, label) pairs where each image is a frame in a video and each label
indicates whether that frame contains humans.
A monthly sample of Munich’s population over the past 100 years
A set of (image, number) pairs where each image is a chest X-ray of a different human
and each number represents the volume of their lungs.

– Page 3 / 16 –
Problem 2 Short Questions (29 credits)

0 2.1 Explain the idea of data augmentation (1p). Specify 4 different data augmentation techniques
you can apply on a dataset of RGB images (2p).
1

0 2.2 You are training a deep neural network for the task of binary classification using the Binary
Cross Entropy loss. What is the expected loss value for the first mini-batch with batch size N = 64
1 q
for an untrained, randomly initialized network? Hint: BCE = ≠
2

0 2.3 Explain the differences between ReLU, LeakyReLU and Parametric ReLU.
1

0 2.4 How will weights be initialized by Xavier initialization? Which mean and variance will the
weights have? Which mean and variance will the output data have?
1

– Page 4 / 16 –
Only true in the context of SGD:
2.5 Why do we often refer to L2-regularization as “weight decay”? Derive a mathematical expres- 0
sion that includes the weights W , the learning rate ÷, and the L2 regularization hyperparameter
1
⁄ to explain your point.
2

2.6 Given a Convolution Layer in a network with 6 filters, kernel size 5, a stride of 3, and a 0
padding of 2. For an input feature map of shape 28 ◊ 28 ◊ 28 , what are the dimensions/shape
1
of the output tensor after applying the Convolution Layer to the input?
2

2.7 You are given a Convolutional Layer with: number of input channels 3, number of filters 5, 0
kernel size 4, stride 2, padding 1. What is the total number of trainable parameters for this layer?
1
Don’t forget to consider the bias.
2

2.8 You are given a fully-connected network with 2 hidden layers, the first of has 10 neurons, 0
and the second hidden layer contains 5 neurons. Both layers use dropout with probability 0.5.
1
The network classifies gray-scale images of size 8 ◊ 8 pixels as one of 3 different classes. All
neurons include a bias. Calculate the total number of trainable parameters in this network. 2

– Page 5 / 16 –
0 2.9 ”Breaking the symmetry” : Why is initializing all weights of a fully-connected layer to the
same value problematic?
1

0 2.10 Explain the difference between Auto-Encoders and Variational Auto-Encoders.


1

0 2.11 Generative Adversarial Networks (GANs): What is the input to the generator network (1
pt)? What are the two inputs to the discriminator (1 pt)?
1

0 2.12 Explain how LSTM networks often outperform traditional RNNs. What in their architecture
enables this?
1

0 2.13 Explain how batch normalization is applied differently between a fully connected layer and
a convolutional layer (1 pt). How many learnable parameters does batch normalization contain
1
following (a) a single fully-connected layer (1 pt), and (b) a single convolutional layer with 16
2 filters (1 pt)?
3

– Page 6 / 16 –
Problem 3 Convolutions (13 credits)
You are asked to perform per-pixel semantic segmentation on the Cityscapes dataset, which
consists of RGB images of European city streets, and you want to segment the images into 5
classes (vehicle, road, sky, nature, other). You have designed the following network, as seen in
the illustration below:
For clarification of notation: The shape after having applied the operation ‘conv1’ (the first
convolutional layer in the network) is 50x32x32.
You are using 2D convolutions with: stride = 2 , padding = 1 , and kernel_size = 4 .
For the MaxPool operation, you are using: stride = 2 , padding = 0 , and kernel_size = 2 .

3.1 What is the shape of the weight matrix of the fully-connected layer ‘fc1’? (Ignore the bias) 0

3.2 Explain the term ‘receptive field’ (1p). What is the receptive field of one pixel of the activation 0
map. after performing the operation ‘maxpool1’(1p)? What is the receptive field of a single
1
neuron in the output of layer ‘fc1’ (1p)?
2

– Page 7 / 16 –
0 3.3 You now want to be able to classify finer-grained labels, which comprise of 30 classes. What
is the minimal change in network architecture needed in order to support this without adding
1
any additional layers?

0 3.4 Luckily, you found a pre-trained version of this network, which is trained on the original 5
labels. (It outputs a tensor of shape 5 ◊ 64 ◊ 64 ). How can you make use of/build upon this
1
pre-trained network (as a black-box) to perform segmentation into 30 classes.
2

0 3.5 Luckily, you have gained access to a large dataset of city street images. Unfortunately, these
images are not labelled, and you do not have the resources to annotate them. However, how
1
can you still make use of these images to improve your network? Explain the architecture of any
2 networks that you will use and explain how training will be performed. (Note: This question is
3 independent of (3.3) and (3.4))

0 3.6 Instead of taking 64 ◊ 64 images as input, you now want to be able to train the network to
segment images of arbitrary size > 64. List, explicitly, two different approaches that would allow
1
this. Your new network should support varying image sizes in run-time, without having to be
2 re-trained.

– Page 8 / 16 –
Problem 4 Optimization (13 credits)

4.1 Explain the idea behind the RMSProp optimizer. How does it enable faster convergence 0
than standard SGD? How does it make use of the gradient?
1

4.2 What is the bias correction in the ADAM optimizer? Explain the problem that it fixes. 0

4.3 You read that when training deeper networks, you may suffer from the vanishing gradients 0
problem. Explain what are vanishing gradients in the context of deep convolutional networks
1
and the underlying cause of the problem.
2

– Page 9 / 16 –
0 4.4 In the following image you can see a segment of a very deep architecture that uses residual
connections. How are residual connections helpful against vanishing gradients? Demonstrate
1
this mathematically by performing a weight update for w0 . Make sure to explain how this reduces
2 the effect of vanishing gradients. Hint: Write the mathematical expression for ˆˆWz0 w.r.t all other
3 weights.

– Page 10 / 16 –
Problem 5 Multi-Class Classification (18 credits)
Note: If you cannot solve a sub-question and need its answer for a calculation in following sub-
questions, mark it as such and use a symbolic placeholder (i.e., the mathematical expression
you could not explicitly calculate + a note that it is missing from the previous question.)

Assume you are given a labeled dataset {X , y } , where each sample xi belongs to one of
C = 10 classes. We denote its corresponding label yi œ {1, ..., 10} . In addition, you can assume
each data sample is a row vector.
You are asked to train a classifier for this classification task, namely, a 2-layer fully-connected
network. For a visualization of the setting, refer to the following illustration:

5.1 Why does one use a Softmax activation at the end of such a classification network? What 0
property does it have that makes it a common choice for a classification task?
1

5.2 For a vector of logits ˛z , the Softmax function ‡ : RC æ RC , is defined: 0

e zi 1
ŷi = ‡ (˛z )i = qC z
j=1 e
j

where C is the number of classes and zi is the i -th logit.


A special property of this function is that its derivative can be expressed in terms of the Softmax
function itself. How could this be advantageous for training neural networks?

– Page 11 / 16 –
ˆ ŷi
0 5.3 Show explicitly how this can be done, by writing ˆ zi
in terms of ŷi .
1

ˆ ŷi
0 5.4 Similarly, show explicitly how this can be done, by writing ˆ zj
in terms of ŷi and ŷj , for
1
” j.
i=

– Page 12 / 16 –
5.5 Using the Softmax activation, what loss function L (y, ŷ) would you want to minimize, to 0
train a network on such a multi-class classification task? Name this loss function (1 pt), and
1
write down its formula (2 pt), for a single sample x , in terms of the network’s prediction ŷ and
C
its true
Y label y . Here, you can assume the label y œ {0, 1} is a one-hot encoded vector: 2
]1, if i == true class index
3
yi =
[0, otherwise

5.6 Having done a forward pass with our sample x , we will back-propagate through the network. 0
2
We want to perform a gradient update for the weight wj,k (the weight which is in row j , column
2 1
k of the second weights’ matrix W ). First, use the chain rule to write down the derivative
ˆL
ˆ wj,k
as a product of 3 partial derivatives (no need to compute them). For convenience, you can 2
ignore the bias and omit the 2 superscript.

– Page 13 / 16 –
2
0 5.7 Now, compute the gradient for the weight: w3,1 . For this, you will need to compute each
of the partial derivatives you have written above, and perform the multiplication to get the final
1
answer. You can assume the ground-truth label for the sample was true_class = 3 . Hint: The
2 derivative of the logarithm is (log t)Õ = 1t .
3

– Page 14 / 16 –
Additional space for solutions–clearly mark the (sub)problem your answers are related
to and strike out invalid solutions.

– Page 15 / 16 –
– Page 16 / 16 –

You might also like