Solution21 Winter
Solution21 Winter
Instructions. This pack contains all questions for the final exam. It contains the questions only. Please
use the accompanying answer sheet to provide your answers by blackening out the corresponding squares.
As the exam will be graded by a computer, please make sure to do blacken out the whole square and
do not use ticks or crosses. During the exam you can use a pencil to fill out the squares as well as an
eraser to edit your answers. After the exam is over, we will collect the questions pack and provide you with
additional time to blacken out the squares on the answer sheet with a black pen. Nothing written on pages
of the question pack will be collected or marked. Only the separate answer sheet with the filled squares
will be marked.
Please make sure that your answer sheet is clean and all answers are clearly marked by filling the squares
out completely. We reserve the right to classify answers as wrong without further consideration if the sheet
is filled out ambiguously.
Collaboration on the exam is strictly forbidden. You are allowed a summary of two A4 pages and a simple,
non-programmable calculator. The use of any other helping material or collaboration will lead to being
excluded from the exam and subjected to disciplinary measures by the ETH Zurich disciplinary committee.
Question Types In this exam, you will encounter the following question types.
• Multiple Choice questions with a single answer.
Multiple Choice questions have exactly one correct choice. Depending on the difficulty of the ques-
tion 2, 3, or 4 points are awarded if answered correctly, and zero points are awarded if answered
wrong or not attempted.
We are given a dataset consisting of n labeled training points D = {(x1 , y1 ), ..., (xn , yn )}, where xi ∈
Rd are the feature vectors and yi ∈ R are the labels. Here samples are generated independently from a
distribution p(x, y) for which the following holds:
The true underlying parameters w∗ ∈ Rd are unknown. The rows of the design matrix X ∈ Rn×d are the
feature vectors xi ∈ Rd . The label vector is denoted by y = (y1 , . . . , yn )> ∈ Rn . In all of Section 1, we
assume X is full rank i.e., rank(X) = min(n, d).
Recall from lecture that the empirical risk is defined as follows:
n
X
R̂D (w) = (w> xi − yi )2 = ky − Xwk22 . (1)
i=1
Question 2 [reg2] (3 points) Assume that the feature vectors of our training set are centered, i.e.,
P n
i=1 xi = 0. Compute the following: Pn
(i.) The empirical covariance matrix of our training data-points: Σ , n1 i=1 xi x>
i .
(ii.) The covariance matrix of the random vector ˜ , U > .
1 1 1 1
2 > Λ 2 V > (ii.)σ 2 I 2 >Λ 2 V >
1 1
A (i.) nV Λ n×n C (i.) nV Λ (ii.) U
1 1 1 1
B (i.) 1 > > 2 D (i.) 1 2 >U >
nUΛ Λ U (ii.)σ In×n nUΛ Λ (ii.) U
2 2 2
Question 3 [reg3] (1 point) When n ≥ d, the empirical risk R̂D , has a unique minimizer.
A True B False
C ATALOG
Question 4 [reg4] (1 point) A local minimizer for the empirical risk R̂D , is also a global minimizer.
A True B False
A True B False
Question 6 [reg6] (3 points) We would like to minimize the empirical risk R̂D using gradient descent.
What is the update formula?
A wt+1 = wt + ηt (X > Xwt − 2X > y)
B wt+1 = wt − ηt (2X > Xwt − 2X > y)
C wt+1 = wt + ηt (2Xwt − 2XX > y)
D wt+1 = wt − ηt (Xwt − 2XX > y)
E wt+1 = wt + ηt (2yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
F wt+1 = wt − ηt (2yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
G wt+1 = wt + ηt (yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
H wt+1 = wt − ηt (2yi − wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
C ATALOG
bλ ∈ Rd .
The minimizer of Equation (2) is denoted by w
Remember that for fixed x ∈ Rd the bias-variance tradeoff can be written as follows:
Question 8 [reg8] (1 point) A bigger λ (Equation 2) reduces the bias term in the bias variance trade-off.
A True B False
Question 9 [reg9] (1 point) A smaller λ (Equation 2) increases the variance in the bias-variance trade-off.
A True B False
Question 10 [reg10] (1 point) Smaller λ (Equation 2) prevents overfitting to the training data.
A True B False
A True B False
C ATALOG
1.3 Classification
w1
Question 12 [classf1] (3 points) Consider linear classification with weights w = . Predictions
w2
> −1 −2 1 2
take the form ypred = sign(w x). Consider the dataset {( , −1), ( , −1), ( , +1), ( , +1)},
0 1 0 1
x1
where the first element in each data-point x = is the feature vector and the second element is its class
x2
label y = ±1. The points are represented in the Figure below.
x2
1.0
class 1 0.5
class +1
2 Kernels
T
Question 14 [kernel1] (2 points) Consider the feature map Φ : R → R3 defined as Φ(x) = x, x2 , ex .
Find the kernel k(x, y) associated with Φ.
A x + x2 + ex E xy + (xy)2 + ex+y
B xy + ex+y F x + y + (xy)2 + exy
C x + y + x2 + y 2 + ex+y G (xy + 1)2 + exy
D x2 + y 2 + xy + ex+y H xy + (xy)2
Question 15 [kernel2] (4 points) Let x, x0 ∈ R3 and k(x, x0 ) = (x> x0 + 1)2 . What is the minimal
dimensionality of a feature map φ(x), such that k(x, x0 ) = φ(x)> φ(x0 )?
A 6 B 9 C 10 D 12 E 13 F 15 G 16 H 27
A True B False
For each of the following functions k, decide if it is a valid kernel function (True) or not (False).
max(x,x0 )
Question 17 [kernel4] (1 point) For x, x0 ∈ R+ , define k(x, x0 ) = min(x,x0 ) .
A True B False
>
x0 )
Question 18 [kernel5] (1 point) For x, x ∈ Rd , define k(x, x0 ) = (x> x0 + 1)3 + e(x .
A True B False
C ATALOG
Question 19 [pca1] (2 points) What is the empirical covariance Σ∗z of the latent variables zi∗ ?
Question 20 [pca2] (2 points) We obtain new data by first sampling s ∼ N (0, Σ∗z ) in the latent space
and then projecting the obtained sample to the original space xnew = W∗ s. What is the distribution of xnew ?
Question 21 [pca3] (2 points) In Figure 1 we plot 1000 points. We apply PCA to those 1000 points.
What is the direction of the principal eigenvector?
10.0
Principal Eigenvector
B
7.5
C A
5.0
2.5
0.0
y
2.5
5.0
7.5
D
10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x
A A B B C C D D
C ATALOG
Question 22 [pca4] (1 point) Figure 2 shows the Swiss roll dataset. All data points in this dataset lie on
a 2D plane that has been wrapped around an axis. Linear (non-kernelized) PCA with k = 2 can explain all
sources of variance in this dataset (C ∗ in Equation 3 is equal to 0).
Swiss Roll dataset
15
10
5 z
0
5
10
20
15
10 10 y
5
0 5
x 5
10 0
A True B False
Question 23 [pca5] (1 point) Both the PCA and the k-means problems can be formulated as
n
X 2
min kW zi − xi k2 ,
W,z1 ,...,zn
i=1
A True B False
Question 24 [pca6] (1 point) If we use the Gaussian kernel for kernel PCA, we implicitly perform PCA
on an infinite-dimensional feature space.
A True B False
Question 25 [pca7] (1 point) If we use neural network autoencoders with nonlinear activation functions,
we seek to compress the data with a nonlinear map.
A True B False
C ATALOG
4 Neural Networks
Question 26 [nn1] (1 point) With depth K = 1, we recover linear regression (with no bias term).
A True B False
Question 27 [nn2] (1 point) For K = 2 there is a unique pair of matrices W1 , W2 that minimizes `.
A True B False
Question 28 [nn3] (1 point) Networks with increasing depth K allow one to model more complex rela-
tionships between x and y.
A True B False
A True B False
Explanation: For 2), we can see that the pair of matrices is not unique by taking invertible matrix A and
considering the pair W2 A−1 and AW1 , which will achieve the same loss as W2 , W1 . For 3), this is false
since for all K, the function class remains the same as linear regression (it always only contains functions
of the form w · x). 4) is true since this is necessary to achieve scalar output.
Question 30 [nn5] (3 points) You plan to train this model with stochastic gradient descent and batch size
1. In each batch, you minimize `˜x (F ) = (y − F (x))2 , for a fixed data point x. For simplicity, suppose
K = 3 and W3 is a scalar.
Question 31 [nn6] (2 points) Again consider the loss calculated on a fixed data point x ∈ Rd0 ,
`˜x (F ) = (y − F (x))2 .
We use backpropagation to compute ∂ `˜x/∂W1 . Suppose that z1 (x) = W1 x ∈ Rd1 . From previous steps in
the backpropagation algorithm you know that ∂ `˜x/∂z1 (x) = a ∈ Rd1 . Please assume a and x are column
vectors, i.e., a ∈ Rd1 ×1 and x ∈ Rd0 ×1 .
Question 32 [nn7] (2 points) Which of the following describes one iteration of a stochastic gradient
descent update (still batch size 1 with single data point x) on W1 with step size α?
˜
∂ `x
A W1 ← W1 − α ∂W C W1 ← W1 − α ∂z∂W
1 (x)
1
1
˜
∂ `x
B W1 ← W1 + α ∂W D W1 ← W1 + α ∂z∂W
1 (x)
1
1
Question 33 [nn8] (1 point) Increasing the minibatch size in stochastic gradient descent (SGD) lowers
the variance of gradient estimates (assuming data points in the mini-batch are selected independently).
A True B False
Question 34 [nn9] (1 point) For a minibatch size of 1, SGD is guaranteed to decrease the loss in every
iteration.
A True B False
Question 35 [nn10] (1 point) There exists a fixed learning rate η > 0 such that SGD with momentum is
guaranteed to converge to a global minimum of the empirical risk for any architecture of a neural network
and any dataset.
A True B False
Question 36 [nn11] (1 point) The cross entropy loss is designed for regression tasks, where the goal is to
predict arbitrary real-valued labels.
A True B False
C ATALOG
5 Decision Theory
In the following questions, assume that data is generated from some known probabilistic model P (x, y). In
both questions, we use the shorthand p(x) = P (y = +1 | x).
Question 37 [decthe1] (3 points) Assume that we want to train a classifier y = f (x) where labels y take
values y ∈ {1, −1}. We extend the action (label) space and allow the classifier to abstain i.e., refrain from
making a prediction. This extends the label space to y ∈ {+1, −1, r}. In order to make sure the classifier
does not always abstain, we introduce a cost c > 0 for an abstention. The resulting 0-1 loss with abstention
is given by:
`(f (x), y) = 1f (x)6=y 1f (x)6=r + c1f (x)=r .
An (Bayes) optimal classifier is one that minimizes the expected loss (risk) under the known conditional
distribution. For a given input x, for which range of c should the optimal classifier abstain from predicting
+1 or −1?
Question 38 [decthe2] (2 points) We want to use regression with the quantile loss to estimate the current
price y of our house given features x, defined as
A A B B C C D D
C ATALOG
In this question, we use the (soft) expectation maximization (EM) algorithm to compute a maximum like-
lihood estimator (MLE) for the average lifetime of light bulbs. We assume the lifetime of a light bulb is
exponentially distributed with unknown mean θ > 0, i.e., its cumulative distribution function is given by
F (x) = 1 − e− θ 1{x≥0} .
x
We test N + M independent light bulbs in two independent experiments. In the first experiment, we test
the first N light bulbs. Let Y = (Y1 , . . . , YN ), where each random variable Yi represents the exact lifetime
of light bulb i. In the second experiment we test the remaining M bulbs, but we only check the light bulbs
at some fixed time t > 0 and record for each bulb whether it is still working or not.
Let X = (X1 , . . . , XM ), where the random variable Xj = 1 if the bulb j from the second experiment was
still working at time t and 0 if it already expired. We denote by Z = (Z1 , . . . , ZM ) the unobserved lifetime
of the light bulbs from the second experiment.
1 t
A θ0 + t B θ0 +t C tθ0 + t D θ0 +t
Question 42 [ema4] (2 points) We define the expected complete data log-likelihood Q(θ, θ0 ) to be
and
M
1{Xj =1}
X
k,
j=1
to be the number of light bulbs still working at time t in the second experiment. What is Q(θ, θ0 )?
Question 43 [gmm1] (1 point) The MLE objective for Gaussian mixture models (GMM) is non-convex
with respect to the cluster’s means, covariances, and weights when we have strictly more than one Gaussian
in the mixture.
C ATALOG
A True B False
Question 44 [gmm2] (1 point) An EM algorithm can also be used to fit GMMs in the semi-supervised
setting, where some data points are labeled and some are unlabeled.
A True B False
Question 45 [gmm3] (1 point) We fit a GMM to a dataset utilizing the (soft) EM algorithm. We compute
the log-likelihood of the data after each iteration. During this process the log-likelihood of the data never
decreases.
A True B False
Question 46 [gmm4] (2 points) You get 2D scatter plots of 3 different sets of data points (A, B, C re-
spectively, see Figure below). You decide to cluster them with GMMs. You could model the covariance
matrices of the two clusters as spherical, unrestricted, and diagonal. For datasets A, B, and C, assign the
most appropriate covariance matrix.
A B C
Explanation: B is spherical since the covariance matrix is the identity, whilst C is diagonal since, within
clusters, there is no correlation between x1 and x2.
C ATALOG
You train a generative adversarial network (GAN) with neural network discriminator D and neural network
generator G. Let z ∼ N (0, I) represent the random Gaussian (normal) noise input for G. Here, I is the
identity matrix. The objective during training is given by
Question 47 [gan1] (2 points) Consider a fixed data point x with probability density pdata (x). Suppose the
probability density of x under the (not necessarily optimal) trained generator is pG (x). Moreover, assume
that the trained discriminator D∗ is the optimal discriminator for G, based on the loss above. That is:
C 0
D 1
E Not enough information
Explanation: The second point is a sufficient condition. The third is not sufficient since, even at conver-
gence, D may not be the optimal discriminator given G.
GANs can be used for the task of learning a generative model of data. However, GANs are not the only
generative models we have seen in the course. Indicate whether each of the following models is generative
or discriminative.
0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 ←− Please encode your student number on the left, and write
your first and last names below.
4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 Firstname and Lastname:
6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 ....................................................
8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9