SS 2021
SS 2021
Department of Informatics
Technical University of Munich
Note:
• During the attendance check a sticker containing a unique code will be
Ecorrection put on this exam.
Place student sticker here • This code contains a unique number that associates this exam with your
registration number.
• This number is printed both next to the code and to the signature field in
the attendance check list.
Working instructions
• This exam consists of 16 pages with a total of 5 problems.
Please make sure now that you received a complete copy of the exam.
– Page 1 / 16 –
Problem 1 Multiple Choice (18 credits)
Below you can see how you can answer multiple choice questions.
• For each question, you’ll receive 2 points if all boxes are answered correctly (i.e. correct
answers are checked, wrong answers are not checked) and 0 otherwise.
1.2 In which cases would you usually reduce the learning rate when training a neural network?
When the training loss stops decreasing
To reduce memory consumption
After increasing the mini-batch size
After reducing the mini-batch size
– Page 2 / 16 –
1.5 Which of the following are affected by multiplying the loss function by a constant positive
value when using SGD?
Memory consumption during training
Magnitude of the gradient step
Location of minima
Number of mini-batches per epoch
1.6 Which of the following functions are not suitable as activation functions to add non-linearity
to a network?
sin(x)
ReLU(x) ≠ ReLU(≠x)
log (ReLU(x) + 1)
1.9 Which of the following datasets are NOT i.i.d. (independent and identically distributed)?
A sequence (toss number, result) of 10,000 coin flips using biased coins with p(toss result =
1) = 0.7
A set of (image, label) pairs where each image is a frame in a video and each label
indicates whether that frame contains humans.
A monthly sample of Munich’s population over the past 100 years
A set of (image, number) pairs where each image is a chest X-ray of a different human
and each number represents the volume of their lungs.
– Page 3 / 16 –
Problem 2 Short Questions (29 credits)
0 2.1 Explain the idea of data augmentation (1p). Specify 4 different data augmentation techniques
you can apply on a dataset of RGB images (2p).
1
0 2.2 You are training a deep neural network for the task of binary classification using the Binary
Cross Entropy loss. What is the expected loss value for the first mini-batch with batch size N = 64
1 q
for an untrained, randomly initialized network? Hint: BCE = ≠
2
0 2.3 Explain the differences between ReLU, LeakyReLU and Parametric ReLU.
1
0 2.4 How will weights be initialized by Xavier initialization? Which mean and variance will the
weights have? Which mean and variance will the output data have?
1
– Page 4 / 16 –
Only true in the context of SGD:
2.5 Why do we often refer to L2-regularization as “weight decay”? Derive a mathematical expres- 0
sion that includes the weights W , the learning rate ÷, and the L2 regularization hyperparameter
1
⁄ to explain your point.
2
2.6 Given a Convolution Layer in a network with 6 filters, kernel size 5, a stride of 3, and a 0
padding of 2. For an input feature map of shape 28 ◊ 28 ◊ 28 , what are the dimensions/shape
1
of the output tensor after applying the Convolution Layer to the input?
2
2.7 You are given a Convolutional Layer with: number of input channels 3, number of filters 5, 0
kernel size 4, stride 2, padding 1. What is the total number of trainable parameters for this layer?
1
Don’t forget to consider the bias.
2
2.8 You are given a fully-connected network with 2 hidden layers, the first of has 10 neurons, 0
and the second hidden layer contains 5 neurons. Both layers use dropout with probability 0.5.
1
The network classifies gray-scale images of size 8 ◊ 8 pixels as one of 3 different classes. All
neurons include a bias. Calculate the total number of trainable parameters in this network. 2
– Page 5 / 16 –
0 2.9 ”Breaking the symmetry” : Why is initializing all weights of a fully-connected layer to the
same value problematic?
1
0 2.11 Generative Adversarial Networks (GANs): What is the input to the generator network (1
pt)? What are the two inputs to the discriminator (1 pt)?
1
0 2.12 Explain how LSTM networks often outperform traditional RNNs. What in their architecture
enables this?
1
0 2.13 Explain how batch normalization is applied differently between a fully connected layer and
a convolutional layer (1 pt). How many learnable parameters does batch normalization contain
1
following (a) a single fully-connected layer (1 pt), and (b) a single convolutional layer with 16
2 filters (1 pt)?
3
– Page 6 / 16 –
Problem 3 Convolutions (13 credits)
You are asked to perform per-pixel semantic segmentation on the Cityscapes dataset, which
consists of RGB images of European city streets, and you want to segment the images into 5
classes (vehicle, road, sky, nature, other). You have designed the following network, as seen in
the illustration below:
For clarification of notation: The shape after having applied the operation ‘conv1’ (the first
convolutional layer in the network) is 50x32x32.
You are using 2D convolutions with: stride = 2 , padding = 1 , and kernel_size = 4 .
For the MaxPool operation, you are using: stride = 2 , padding = 0 , and kernel_size = 2 .
3.1 What is the shape of the weight matrix of the fully-connected layer ‘fc1’? (Ignore the bias) 0
3.2 Explain the term ‘receptive field’ (1p). What is the receptive field of one pixel of the activation 0
map. after performing the operation ‘maxpool1’(1p)? What is the receptive field of a single
1
neuron in the output of layer ‘fc1’ (1p)?
2
– Page 7 / 16 –
0 3.3 You now want to be able to classify finer-grained labels, which comprise of 30 classes. What
is the minimal change in network architecture needed in order to support this without adding
1
any additional layers?
0 3.4 Luckily, you found a pre-trained version of this network, which is trained on the original 5
labels. (It outputs a tensor of shape 5 ◊ 64 ◊ 64 ). How can you make use of/build upon this
1
pre-trained network (as a black-box) to perform segmentation into 30 classes.
2
0 3.5 Luckily, you have gained access to a large dataset of city street images. Unfortunately, these
images are not labelled, and you do not have the resources to annotate them. However, how
1
can you still make use of these images to improve your network? Explain the architecture of any
2 networks that you will use and explain how training will be performed. (Note: This question is
3 independent of (3.3) and (3.4))
0 3.6 Instead of taking 64 ◊ 64 images as input, you now want to be able to train the network to
segment images of arbitrary size > 64. List, explicitly, two different approaches that would allow
1
this. Your new network should support varying image sizes in run-time, without having to be
2 re-trained.
– Page 8 / 16 –
Problem 4 Optimization (13 credits)
4.1 Explain the idea behind the RMSProp optimizer. How does it enable faster convergence 0
than standard SGD? How does it make use of the gradient?
1
4.2 What is the bias correction in the ADAM optimizer? Explain the problem that it fixes. 0
4.3 You read that when training deeper networks, you may suffer from the vanishing gradients 0
problem. Explain what are vanishing gradients in the context of deep convolutional networks
1
and the underlying cause of the problem.
2
– Page 9 / 16 –
0 4.4 In the following image you can see a segment of a very deep architecture that uses residual
connections. How are residual connections helpful against vanishing gradients? Demonstrate
1
this mathematically by performing a weight update for w0 . Make sure to explain how this reduces
2 the effect of vanishing gradients. Hint: Write the mathematical expression for ˆˆWz0 w.r.t all other
3 weights.
– Page 10 / 16 –
Problem 5 Multi-Class Classification (18 credits)
Note: If you cannot solve a sub-question and need its answer for a calculation in following sub-
questions, mark it as such and use a symbolic placeholder (i.e., the mathematical expression
you could not explicitly calculate + a note that it is missing from the previous question.)
Assume you are given a labeled dataset {X , y } , where each sample xi belongs to one of
C = 10 classes. We denote its corresponding label yi œ {1, ..., 10} . In addition, you can assume
each data sample is a row vector.
You are asked to train a classifier for this classification task, namely, a 2-layer fully-connected
network. For a visualization of the setting, refer to the following illustration:
5.1 Why does one use a Softmax activation at the end of such a classification network? What 0
property does it have that makes it a common choice for a classification task?
1
e zi 1
ŷi = ‡ (˛z )i = qC z
j=1 e
j
– Page 11 / 16 –
ˆ ŷi
0 5.3 Show explicitly how this can be done, by writing ˆ zi
in terms of ŷi .
1
ˆ ŷi
0 5.4 Similarly, show explicitly how this can be done, by writing ˆ zj
in terms of ŷi and ŷj , for
1
” j.
i=
– Page 12 / 16 –
5.5 Using the Softmax activation, what loss function L (y, ŷ) would you want to minimize, to 0
train a network on such a multi-class classification task? Name this loss function (1 pt), and
1
write down its formula (2 pt), for a single sample x , in terms of the network’s prediction ŷ and
C
its true
Y label y . Here, you can assume the label y œ {0, 1} is a one-hot encoded vector: 2
]1, if i == true class index
3
yi =
[0, otherwise
5.6 Having done a forward pass with our sample x , we will back-propagate through the network. 0
2
We want to perform a gradient update for the weight wj,k (the weight which is in row j , column
2 1
k of the second weights’ matrix W ). First, use the chain rule to write down the derivative
ˆL
ˆ wj,k
as a product of 3 partial derivatives (no need to compute them). For convenience, you can 2
ignore the bias and omit the 2 superscript.
– Page 13 / 16 –
2
0 5.7 Now, compute the gradient for the weight: w3,1 . For this, you will need to compute each
of the partial derivatives you have written above, and perform the multiplication to get the final
1
answer. You can assume the ground-truth label for the sample was true_class = 3 . Hint: The
2 derivative of the logarithm is (log t)Õ = 1t .
3
– Page 14 / 16 –
Additional space for solutions–clearly mark the (sub)problem your answers are related
to and strike out invalid solutions.
– Page 15 / 16 –
– Page 16 / 16 –