0% found this document useful (0 votes)

41 views16 pages

Solution21 Winter

The document is an exam for an introduction to machine learning course. It contains instructions for taking the exam, which will consist of multiple choice and true/false questions related to regression, classification, and ridge regression. The exam allows for the use of notes and a calculator, but collaboration is strictly forbidden. The questions cover topics like ordinary least squares regression, empirical risk minimization, gradient descent, and the bias-variance tradeoff in ridge regression.

Uploaded by

David

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views16 pages

Solution21 Winter

Uploaded by

David

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

C ATALOG

Introduction to Machine Learning Exam

Questions Pack
January 27, 2022

Time limit: 120 minutes

Instructions. This pack contains all questions for the final exam. It contains the questions only. Please
use the accompanying answer sheet to provide your answers by blackening out the corresponding squares.
As the exam will be graded by a computer, please make sure to do blacken out the whole square and
do not use ticks or crosses. During the exam you can use a pencil to fill out the squares as well as an
eraser to edit your answers. After the exam is over, we will collect the questions pack and provide you with
additional time to blacken out the squares on the answer sheet with a black pen. Nothing written on pages
of the question pack will be collected or marked. Only the separate answer sheet with the filled squares
will be marked.

Please make sure that your answer sheet is clean and all answers are clearly marked by filling the squares
out completely. We reserve the right to classify answers as wrong without further consideration if the sheet
is filled out ambiguously.

Collaboration on the exam is strictly forbidden. You are allowed a summary of two A4 pages and a simple,
non-programmable calculator. The use of any other helping material or collaboration will lead to being
excluded from the exam and subjected to disciplinary measures by the ETH Zurich disciplinary committee.

Question Types In this exam, you will encounter the following question types.
• Multiple Choice questions with a single answer.
Multiple Choice questions have exactly one correct choice. Depending on the difficulty of the ques-
tion 2, 3, or 4 points are awarded if answered correctly, and zero points are awarded if answered
wrong or not attempted.

• True Or False questions.

Each True Or False questions has a value of 1 point if answered correctly, 0 points if answered wrong
or not attempted.
Not all questions need to be answered correctly to achieve the best grade. There are no negative grades so
you are incentivized to attempt all questions.
C ATALOG

1 Regression and Classification

We are given a dataset consisting of n labeled training points D = {(x1 , y1 ), ..., (xn , yn )}, where xi ∈
Rd are the feature vectors and yi ∈ R are the labels. Here samples are generated independently from a
distribution p(x, y) for which the following holds:

y = w∗> x + where ∼ N (0, σ 2 ).

The true underlying parameters w∗ ∈ Rd are unknown. The rows of the design matrix X ∈ Rn×d are the
feature vectors xi ∈ Rd . The label vector is denoted by y = (y1 , . . . , yn )> ∈ Rn . In all of Section 1, we
assume X is full rank i.e., rank(X) = min(n, d).
Recall from lecture that the empirical risk is defined as follows:
n
X
R̂D (w) = (w> xi − yi )2 = ky − Xwk22 . (1)
i=1

The goal is to to find w ∈ Rd that minimizes the empirical risk.

1.1 Ordinary Least Squares

1
Let X have a singular value decomposition X = U Λ 2 V > . Here U ∈ Rn×n and V ∈ Rd×d are orthogonal,
1 1
and Λ 2 ∈ Rn×d has the singular values σi > 0 on its diagonal and is zero elsewhere, i.e., σi = Λi,i
2
or
2
equivalently σi = Λi,i .

Question 1 [reg1] (4 points) What is the estimator w

b you obtain by minimizing the empirical risk in
Equation (1), i.e.,
w
b , arg min R̂D (w),
w

in terms of w∗ , V , Λ, and ˜ , U > ?

Hint: Since U and V are orthogonal, it holds that, U > U = U U > = In×n , V V > = V > V = Id×d .
1 >
A V Λ−1 V > U Λ 2 V > (w∗ + ˜) E w∗ + V Λ− 2 ˜
B V Λ−> U > (w∗ + ˜) F w∗ + V Λ−> ˜
1 1
C V Λ− 2 V > w∗ + V Λ− 2 ˜ G w∗ + V Λ> ˜
1 >
D V Λ− 2 V > w∗ + V Λ−1 ˜ H w∗ + V Λ 2 ˜

Question 2 [reg2] (3 points) Assume that the feature vectors of our training set are centered, i.e.,
P n
i=1 xi = 0. Compute the following: Pn
(i.) The empirical covariance matrix of our training data-points: Σ , n1 i=1 xi x>
i .
(ii.) The covariance matrix of the random vector ˜ , U > .
1 1 1 1
2 > Λ 2 V > (ii.)σ 2 I 2 >Λ 2 V >
1 1
A (i.) nV Λ n×n C (i.) nV Λ (ii.) U
1 1 1 1
B (i.) 1 > > 2 D (i.) 1 2 >U >
nUΛ Λ U (ii.)σ In×n nUΛ Λ (ii.) U
2 2 2

For the following statements, decide if they are True or False.

Question 3 [reg3] (1 point) When n ≥ d, the empirical risk R̂D , has a unique minimizer.

A True B False
C ATALOG

Question 4 [reg4] (1 point) A local minimizer for the empirical risk R̂D , is also a global minimizer.

A True B False

Question 5 [reg5] (1 point) When n ≤ d, there always exists w such that Xw = y.

A True B False

Question 6 [reg6] (3 points) We would like to minimize the empirical risk R̂D using gradient descent.
What is the update formula?
A wt+1 = wt + ηt (X > Xwt − 2X > y)
B wt+1 = wt − ηt (2X > Xwt − 2X > y)
C wt+1 = wt + ηt (2Xwt − 2XX > y)
D wt+1 = wt − ηt (Xwt − 2XX > y)
E wt+1 = wt + ηt (2yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
F wt+1 = wt − ηt (2yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
G wt+1 = wt + ηt (yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
H wt+1 = wt − ηt (2yi − wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
C ATALOG

1.2 Ridge Regression

To avoid overfitting to the data, we add a regularization term to the empirical risk and minimize the follow-
ing objective
Xn
lD (w) , R̂D (w) + λkwk22 = (w> xi − yi )2 + λkwk22 , λ > 0. (2)
i=1

bλ ∈ Rd .
The minimizer of Equation (2) is denoted by w

Question 7 [reg7] (2 points) Assume n > d. The minimizer w

bλ of Equation (2) in closed form is given
by
−1
A wbλ = X > X + λI Xy.
−1
bλ = X > X + λI X > y.

B w
−1
bλ = XX > + λI

C w Xy.
−1
bλ = XX > + λI X > y.

D w
E there is no closed form solution.

Remember that for fixed x ∈ Rd the bias-variance tradeoff can be written as follows:

bλ> x)2 ] = (ED [w

ED, [(y − w bλ> x − w∗> x])2 + VarD [w
bλ> x] + σ 2 .
The first term is called the bias term, the second term is called the variance term and the third term is the
irreducible noise.

Question 8 [reg8] (1 point) A bigger λ (Equation 2) reduces the bias term in the bias variance trade-off.

A True B False

Question 9 [reg9] (1 point) A smaller λ (Equation 2) increases the variance in the bias-variance trade-off.

A True B False

Question 10 [reg10] (1 point) Smaller λ (Equation 2) prevents overfitting to the training data.

A True B False

Question 11 [reg11] bλ> x)2 ] is constant with respect to λ

(1 point) The population risk ED, [(y − w
(Equation 2):

A True B False
C ATALOG

1.3 Classification

w1
Question 12 [classf1] (3 points) Consider linear classification with weights w = . Predictions
w2
> −1 −2 1 2
take the form ypred = sign(w x). Consider the dataset {( , −1), ( , −1), ( , +1), ( , +1)},
0 1 0 1
x1
where the first element in each data-point x = is the feature vector and the second element is its class
x2
label y = ±1. The points are represented in the Figure below.

x2
1.0

class 1 0.5
class +1

2.0 1.5 1.0 0.5 0.00.0 0.5 1.0 1.5 2.0

x1
The solution (normalized such that kwk2 = 1) that classifies all points correctly and achieves the maximum
margin is given by (Recall that the margin is defined as the minimum distance between all of the data-points
and the decision boundary of the classifier):

1 −1 1
A w = √2 with margin 2 E w= with margin 1
1 0

1 1
B w = √12 with margin 2 F w = √12 with margin 1
−1 1

0 2
C w= with margin 2 G w = √15 with margin 5
1 1

0 1
D w= with margin 1 H w = √15 with margin 5
1 2

Question 13 [classf2] (2 points) Remember that the zero-one loss is given by

(
0 z≥0
l0−1 (z) = .
1 z<0

Which property is shared between the following “surrogate” loss functions?

hinge `hinge (z) = max{0, 1 − z}

squared 2`sq (z) = (1 − z)2
logistic `logistic (z)/ ln(2) = ln (1 + e−z ) / ln(2)
exponential `exp (z) = e−z

A Each one is an upper bound for the 0-1 loss.

B Each one is a lower bound for the 0-1 loss.
C Each one is differentiable on its whole domain.
D They are equally robust to outliers.
C ATALOG

2 Kernels

T
Question 14 [kernel1] (2 points) Consider the feature map Φ : R → R3 defined as Φ(x) = x, x2 , ex .
Find the kernel k(x, y) associated with Φ.

A x + x2 + ex E xy + (xy)2 + ex+y
B xy + ex+y F x + y + (xy)2 + exy
C x + y + x2 + y 2 + ex+y G (xy + 1)2 + exy
D x2 + y 2 + xy + ex+y H xy + (xy)2

Question 15 [kernel2] (4 points) Let x, x0 ∈ R3 and k(x, x0 ) = (x> x0 + 1)2 . What is the minimal
dimensionality of a feature map φ(x), such that k(x, x0 ) = φ(x)> φ(x0 )?

A 6 B 9 C 10 D 12 E 13 F 15 G 16 H 27

Question 16 [kernel3] (1 point) Is the following statement True or False?

For every valid kernel k(x, x0 ), k(x, x0 ) ≥ 0 for all x and x0 .

A True B False

For each of the following functions k, decide if it is a valid kernel function (True) or not (False).

max(x,x0 )
Question 17 [kernel4] (1 point) For x, x0 ∈ R+ , define k(x, x0 ) = min(x,x0 ) .

A True B False

>
x0 )
Question 18 [kernel5] (1 point) For x, x ∈ Rd , define k(x, x0 ) = (x> x0 + 1)3 + e(x .

A True B False
C ATALOG

3 Dimension Reduction with PCA

In (linear) principal component analysis (PCA), we map the data points xi ∈ Rd , i = 1, . . . , n, to zi ∈ Rk ,

k d, by solving the following optimization problem:
n
1 X 2
C∗ = min kW zi − xi k2 . (3)
n W ∈Rd×k ,W > W =I i=1
k
z1 ,...,zn ∈R

We denote by W∗ , z1∗ , . . . , zn∗ theP

optimal solution of Equation (3). For all questions in section 3, assume
the data P
points are centered i.e., xi = 0. Therefore, the empirical covariance of the data is as follows:
n
Σx = n1 i=1 xi x> i

Question 19 [pca1] (2 points) What is the empirical covariance Σ∗z of the latent variables zi∗ ?

A Σ∗z = Σx W∗ B Σ∗z = W∗ Σx W∗> C Σ∗z = W∗> Σx D Σ∗z = W∗> Σx W∗

Question 20 [pca2] (2 points) We obtain new data by first sampling s ∼ N (0, Σ∗z ) in the latent space
and then projecting the obtained sample to the original space xnew = W∗ s. What is the distribution of xnew ?

A N 0, W∗ Σ∗z W∗> C N 0, W∗> Σ∗z W∗ D N 0, W∗ W∗> Σx

B N (0, Σx )

Question 21 [pca3] (2 points) In Figure 1 we plot 1000 points. We apply PCA to those 1000 points.
What is the direction of the principal eigenvector?

10.0
Principal Eigenvector
B
7.5
C A
5.0

2.5

0.0
y

2.5

5.0

7.5
D
10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x

Figure 1: Direction of principal eigenvector.

A A B B C C D D
C ATALOG

Question 22 [pca4] (1 point) Figure 2 shows the Swiss roll dataset. All data points in this dataset lie on
a 2D plane that has been wrapped around an axis. Linear (non-kernelized) PCA with k = 2 can explain all
sources of variance in this dataset (C ∗ in Equation 3 is equal to 0).
Swiss Roll dataset

15
10
5 z
0
5
10
20
15
10 10 y
5
0 5
x 5
10 0

Figure 2: The Swiss roll dataset.

A True B False

Question 23 [pca5] (1 point) Both the PCA and the k-means problems can be formulated as
n
X 2
min kW zi − xi k2 ,
W,z1 ,...,zn
i=1

albeit, with different constraints on the matrix W and vectors z1 , . . . , zn .

A True B False

Question 24 [pca6] (1 point) If we use the Gaussian kernel for kernel PCA, we implicitly perform PCA
on an infinite-dimensional feature space.

A True B False

Question 25 [pca7] (1 point) If we use neural network autoencoders with nonlinear activation functions,
we seek to compress the data with a nonlinear map.

A True B False
C ATALOG

4 Neural Networks

4.1 Linear Neural Networks

This subsection is regarding linear networks. For input x ∈ Rd0 , a deep linear network F : Rd0 → R of
depth K will output F (x) = WK WK−1 ...W1 x, where each Wj is a matrix of appropriate dimension. We
aim to train F to minimize the mean squared error loss on predicting real-valued scalar labels y. The loss
is specified by
1X
`(F ) = (yi − F (xi ))2 ,
n i
where i ranges over the dataset.

Question 26 [nn1] (1 point) With depth K = 1, we recover linear regression (with no bias term).

A True B False

Question 27 [nn2] (1 point) For K = 2 there is a unique pair of matrices W1 , W2 that minimizes `.

A True B False

Question 28 [nn3] (1 point) Networks with increasing depth K allow one to model more complex rela-
tionships between x and y.

A True B False

Question 29 [nn4] (1 point) WK must be a row vector.

A True B False

Explanation: For 2), we can see that the pair of matrices is not unique by taking invertible matrix A and
considering the pair W2 A−1 and AW1 , which will achieve the same loss as W2 , W1 . For 3), this is false
since for all K, the function class remains the same as linear regression (it always only contains functions
of the form w · x). 4) is true since this is necessary to achieve scalar output.

Question 30 [nn5] (3 points) You plan to train this model with stochastic gradient descent and batch size
1. In each batch, you minimize `˜x (F ) = (y − F (x))2 , for a fixed data point x. For simplicity, suppose
K = 3 and W3 is a scalar.

Then, ∂ `˜x/∂W3 is equal to

A −2(y − F (x))(W2 W1 x). E −2(y − F (x)).

B 2(y − F (x))(W2 W1 x). F 2(y − F (x)).
C (y − F (x))(W2 W1 x). G (y − F (x)).
D −(y − F (x))(W2 W1 x). H −(y − F (x)).

Explanation: Application of the chain rule.

C ATALOG

Question 31 [nn6] (2 points) Again consider the loss calculated on a fixed data point x ∈ Rd0 ,

`˜x (F ) = (y − F (x))2 .

We use backpropagation to compute ∂ `˜x/∂W1 . Suppose that z1 (x) = W1 x ∈ Rd1 . From previous steps in
the backpropagation algorithm you know that ∂ `˜x/∂z1 (x) = a ∈ Rd1 . Please assume a and x are column
vectors, i.e., a ∈ Rd1 ×1 and x ∈ Rd0 ×1 .

Compute ∂ `˜x/∂W1 ∈ Rd1 ×d0 .

A 2ax> C ax> E 2a(W1 x)> G a(W1 x)>

B −2ax> D −ax> F −2a(W1 x)> H −a(W1 x)>

∂ L̃x ∂ L̃x ∂z1 (x)
Explanation: ∂W1 = ∂z1 (x) ∂W1 = (a)(x> ) Chain rule for backpropagation.

Question 32 [nn7] (2 points) Which of the following describes one iteration of a stochastic gradient
descent update (still batch size 1 with single data point x) on W1 with step size α?
˜
∂ `x
A W1 ← W1 − α ∂W C W1 ← W1 − α ∂z∂W
1 (x)
1
1
˜
∂ `x
B W1 ← W1 + α ∂W D W1 ← W1 + α ∂z∂W
1 (x)
1
1

Explanation: Substituting into the definition of gradient descent.

4.2 Training Neural Networks

All questions in this subsection are independent from the previous subsection and are regarding training
neural networks.

Question 33 [nn8] (1 point) Increasing the minibatch size in stochastic gradient descent (SGD) lowers
the variance of gradient estimates (assuming data points in the mini-batch are selected independently).

A True B False

Question 34 [nn9] (1 point) For a minibatch size of 1, SGD is guaranteed to decrease the loss in every
iteration.

A True B False

Question 35 [nn10] (1 point) There exists a fixed learning rate η > 0 such that SGD with momentum is
guaranteed to converge to a global minimum of the empirical risk for any architecture of a neural network
and any dataset.

A True B False

Question 36 [nn11] (1 point) The cross entropy loss is designed for regression tasks, where the goal is to
predict arbitrary real-valued labels.

A True B False
C ATALOG

5 Decision Theory

In the following questions, assume that data is generated from some known probabilistic model P (x, y). In
both questions, we use the shorthand p(x) = P (y = +1 | x).

Question 37 [decthe1] (3 points) Assume that we want to train a classifier y = f (x) where labels y take
values y ∈ {1, −1}. We extend the action (label) space and allow the classifier to abstain i.e., refrain from
making a prediction. This extends the label space to y ∈ {+1, −1, r}. In order to make sure the classifier
does not always abstain, we introduce a cost c > 0 for an abstention. The resulting 0-1 loss with abstention
is given by:
`(f (x), y) = 1f (x)6=y 1f (x)6=r + c1f (x)=r .
An (Bayes) optimal classifier is one that minimizes the expected loss (risk) under the known conditional
distribution. For a given input x, for which range of c should the optimal classifier abstain from predicting
+1 or −1?

A c < max{p(x), 1 − p(x)} D c > 1 − min{p(x), 1 − p(x)}

B c > min{p(x), 1 − p(x)} E c > 1 − p(x)
C c < min{p(x), 1 − p(x)} F c < p(x)
C ATALOG

Question 38 [decthe2] (2 points) We want to use regression with the quantile loss to estimate the current
price y of our house given features x, defined as

`(f (x), y) = τ max(y − f (x), 0) + (1 − τ ) max(f (x) − y, 0).

Here, τ ∈ (0, 1) is a parameter that balances overestimation and underestimation errors.

As we have enough time to sell the house, overestimation errors of the predictor are less critical than
underestimation errors. Which of the asymmetric loss functions in Figure 3 would you use for the estimation
of the current price of your house?

Figure 3: Different quantile loss functions.

A A B B C C D D
C ATALOG

6 Expectation Maximization Algorithm

In this question, we use the (soft) expectation maximization (EM) algorithm to compute a maximum like-
lihood estimator (MLE) for the average lifetime of light bulbs. We assume the lifetime of a light bulb is
exponentially distributed with unknown mean θ > 0, i.e., its cumulative distribution function is given by
F (x) = 1 − e− θ 1{x≥0} .
x

We test N + M independent light bulbs in two independent experiments. In the first experiment, we test
the first N light bulbs. Let Y = (Y1 , . . . , YN ), where each random variable Yi represents the exact lifetime
of light bulb i. In the second experiment we test the remaining M bulbs, but we only check the light bulbs
at some fixed time t > 0 and record for each bulb whether it is still working or not.
Let X = (X1 , . . . , XM ), where the random variable Xj = 1 if the bulb j from the second experiment was
still working at time t and 0 if it already expired. We denote by Z = (Z1 , . . . , ZM ) the unobserved lifetime
of the light bulbs from the second experiment.

Question 39 [ema1] (2 points) What is the log-likelihood log p(X, Y, Z|θ)?

A log p(X, Y, Z|θ) = −(N + M ) log θ − θ1 N

P 1
PM
i=1 Yi − θ j=1 Zj
PN PM
B log p(X, Y, Z|θ) = −(N + M ) log θ − θ i=1 Yi − θ j=1 Zj
C log p(X, Y, Z|θ) = −N log θ − θ N
P PM
i=1 Yi − θ j=1 Zj

D log p(X, Y, Z|θ) = −M log θ − θ1 N 1 M

P P
i=1 Yi − θ j=1 Zj

Question 40 [ema2] (3 points) What is E1 (θ0 ) , E[Zj |Xj = 1, θ0 ]?

1 t
A θ0 + t B θ0 +t C tθ0 + t D θ0 +t

Question 41 [ema3] (2 points) What is E0 (θ0 ) , E[Zj |Xj = 0, θ0 ]?

−t −t −t −t
θ0 θ0 θ0 θ0
A θ0 − te
−t B θ0 − 2te
−t C θ0 − 1−e
−t D 1
θ0 − te
−t
1−e θ0 1−e θ0 te θ0 1−e θ0

Question 42 [ema4] (2 points) We define the expected complete data log-likelihood Q(θ, θ0 ) to be

Q(θ, θ0 ) , EZ [log p(X, Y, Z | θ)|X, Y, θ0 ]

and
M
1{Xj =1}
X
k,
j=1

to be the number of light bulbs still working at time t in the second experiment. What is Q(θ, θ0 )?

A Q(θ, θ0 ) = −(N + M ) log θ − θ1 N k 0 M −k 0

P
i=1 yi − θ E1 (θ ) − θ E0 (θ )

B Q(θ, θ0 ) = −(N + M ) log θ0 − θ10 N k M −k

P
i=1 yi − θ 0 E1 (θ) − θ 0 E0 (θ)

C Q(θ, θ0 ) = −(N + M ) log θ − θ N 0 0

P
i=1 yi − θkE1 (θ ) − θ(M − k)E0 (θ )

D Q(θ, θ0 ) = −(N + M ) log θ0 − θ0 N 0 0

P
i=1 yi − θ kE1 (θ) − θ (M − k)E0 (θ)

Question 43 [gmm1] (1 point) The MLE objective for Gaussian mixture models (GMM) is non-convex
with respect to the cluster’s means, covariances, and weights when we have strictly more than one Gaussian
in the mixture.
C ATALOG

A True B False

Question 44 [gmm2] (1 point) An EM algorithm can also be used to fit GMMs in the semi-supervised
setting, where some data points are labeled and some are unlabeled.

A True B False

Question 45 [gmm3] (1 point) We fit a GMM to a dataset utilizing the (soft) EM algorithm. We compute
the log-likelihood of the data after each iteration. During this process the log-likelihood of the data never
decreases.

A True B False

Question 46 [gmm4] (2 points) You get 2D scatter plots of 3 different sets of data points (A, B, C re-
spectively, see Figure below). You decide to cluster them with GMMs. You could model the covariance
matrices of the two clusters as spherical, unrestricted, and diagonal. For datasets A, B, and C, assign the
most appropriate covariance matrix.

A B C

A A: spherical, B: unrestricted, C: diagonal D A: diagonal, B: spherical, C: unrestricted

B A: spherical, B: diagonal, C: unrestricted E A: unrestricted, B: diagonal, C: spherical
C A: unrestricted, B: spherical, C: diagonal F A: diagonal, B: unrestricted, C: spherical

Explanation: B is spherical since the covariance matrix is the identity, whilst C is diagonal since, within
clusters, there is no correlation between x1 and x2.
C ATALOG

7 Generative Adversarial Networks

You train a generative adversarial network (GAN) with neural network discriminator D and neural network
generator G. Let z ∼ N (0, I) represent the random Gaussian (normal) noise input for G. Here, I is the
identity matrix. The objective during training is given by

min max Ex∼pdata [log D(x)] + Ez [log(1 − D(G(z)))],

G D

where pdata is the data-generating distribution.

Question 47 [gan1] (2 points) Consider a fixed data point x with probability density pdata (x). Suppose the
probability density of x under the (not necessarily optimal) trained generator is pG (x). Moreover, assume
that the trained discriminator D∗ is the optimal discriminator for G, based on the loss above. That is:

D∗ = arg max Ex∼pdata [log D(x)] + Ez [log(1 − D(G(z)))],

For the data point x, what is D∗ (x)?

pG (x)
A D(x) = pG (x)+pdata (x)
pdata (x)
B D(x) = pG (x)+pdata (x)

C 0
D 1
E Not enough information

Explanation: The second point is a sufficient condition. The third is not sufficient since, even at conver-
gence, D may not be the optimal discriminator given G.
GANs can be used for the task of learning a generative model of data. However, GANs are not the only
generative models we have seen in the course. Indicate whether each of the following models is generative
or discriminative.

Question 48 [gan2] (1 point) Support Vector Machines.

A Generative Model B Discriminative Model

Question 49 [gan3] (1 point) Gaussian Mixture Models.

A Generative Model B Discriminative Model

Question 50 [gan4] (1 point) Decision Trees.

A Generative Model B Discriminative Model

C ATALOG

Answer Sheet of the Introduction to Machine Learning Exam

0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 ←− Please encode your student number on the left, and write
your first and last names below.
4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 Firstname and Lastname:
6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 ....................................................

8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9

Question 1: A B C D E F G H Question 26: A B

Question 2: A B C D Question 27: A B
Question 3: A B Question 28: A B
Question 4: A B Question 29: A B
Question 5: A B Question 30: A B C D E F G H
Question 6: A B C D E F G H Question 31: A B C D E F G H
Question 7: A B C D E Question 32: A B C D
Question 8: A B Question 33: A B
Question 9: A B Question 34: A B
Question 10: A B Question 35: A B
Question 11: A B Question 36: A B
Question 12: A B C D E F G H Question 37: A B C D E F
Question 13: A B C D Question 38: A B C D
Question 14: A B C D E F G H Question 39: A B C D
Question 15: A B C D E F G H Question 40: A B C D
Question 16: A B Question 41: A B C D
Question 17: A B Question 42: A B C D
Question 18: A B Question 43: A B
Question 19: A B C D Question 44: A B
Question 20: A B C D Question 45: A B
Question 21: A B C D Question 46: A B C D E F
Question 22: A B Question 47: A B C D E
Question 23: A B Question 48: A B
Question 24: A B Question 49: A B
Question 25: A B Question 50: A B

Generative AI and Higher Education
No ratings yet
Generative AI and Higher Education
61 pages
(Ebook PDF) The Anthropology of Language: An Introduction To Linguistic Anthropology 4th Edition PDF Download
100% (4)
(Ebook PDF) The Anthropology of Language: An Introduction To Linguistic Anthropology 4th Edition PDF Download
47 pages
Photography and Cyprus - Time, Place and Identity
100% (1)
Photography and Cyprus - Time, Place and Identity
289 pages
NRP Math
100% (1)
NRP Math
2 pages
Provisional Allotment 308726
No ratings yet
Provisional Allotment 308726
3 pages
Duties and Responsibilities of Teachers
No ratings yet
Duties and Responsibilities of Teachers
11 pages
A Semi Detailed Lesson Plan in
No ratings yet
A Semi Detailed Lesson Plan in
10 pages
Syllabus in EE104
No ratings yet
Syllabus in EE104
7 pages
4013 WEEK 3 Gass & Mackey 2020
No ratings yet
4013 WEEK 3 Gass & Mackey 2020
31 pages
ANFIS Wind Speed Estimator-Based Output Feedback Near-Optimal MPPT Control For PMSG Wind Turbine
No ratings yet
ANFIS Wind Speed Estimator-Based Output Feedback Near-Optimal MPPT Control For PMSG Wind Turbine
11 pages
Medical Mind
No ratings yet
Medical Mind
2 pages
Pirouette - Guia de Uso
No ratings yet
Pirouette - Guia de Uso
514 pages
Organization As Flux and Transformation
0% (1)
Organization As Flux and Transformation
3 pages
Murad Alam: Career Objective
No ratings yet
Murad Alam: Career Objective
2 pages
Senior PMO Project Manager in Washington DC Resume Mark Yader
No ratings yet
Senior PMO Project Manager in Washington DC Resume Mark Yader
4 pages
Index Kanpur Historiographers Volume 10 Issue 2,2023
No ratings yet
Index Kanpur Historiographers Volume 10 Issue 2,2023
6 pages
CHC42015 - Overview of Role Plays and Recorded Assessments
No ratings yet
CHC42015 - Overview of Role Plays and Recorded Assessments
10 pages
School/ District: Cabugao Elementary School/ Ragay Division: Camarines Sur Program Title: Early Language, Literacy and Numeracy Program
No ratings yet
School/ District: Cabugao Elementary School/ Ragay Division: Camarines Sur Program Title: Early Language, Literacy and Numeracy Program
2 pages
Case Study On BPO Employee - Impact of Employee Remuneration in Employee Motivation (Responses)
No ratings yet
Case Study On BPO Employee - Impact of Employee Remuneration in Employee Motivation (Responses)
13 pages
The Burdenof Englishin Africa Universityof Botswana June 09 Version 2
No ratings yet
The Burdenof Englishin Africa Universityof Botswana June 09 Version 2
14 pages
Simultaneous Equations Quadratic
No ratings yet
Simultaneous Equations Quadratic
7 pages
Formatting Guide
No ratings yet
Formatting Guide
9 pages
Literature Review Example Chicago
100% (2)
Literature Review Example Chicago
5 pages
Learning Plan TLE ICT 7
No ratings yet
Learning Plan TLE ICT 7
4 pages
Script For Teacher's Day Celebration 2023
No ratings yet
Script For Teacher's Day Celebration 2023
2 pages
GSB5021 MODULE 7 ASSIGNMENT Sipiwe Singo Chisha 2
No ratings yet
GSB5021 MODULE 7 ASSIGNMENT Sipiwe Singo Chisha 2
9 pages
Lanier-Handout 8-3 Additional Resources For Dealing With Emotions in Coaching
No ratings yet
Lanier-Handout 8-3 Additional Resources For Dealing With Emotions in Coaching
1 page
The Basics of Webserver Performance Testing: by Dirk Paessler
No ratings yet
The Basics of Webserver Performance Testing: by Dirk Paessler
5 pages
Actual Paper
No ratings yet
Actual Paper
7 pages
Essay 3 - The Importance of Financial Literacy in Today's World
No ratings yet
Essay 3 - The Importance of Financial Literacy in Today's World
2 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6458)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (1005)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (582)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (464)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5181)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2016)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2814)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1022)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4372)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (280)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4135)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
3.5/5 (2141)

Solution21 Winter

Uploaded by

Solution21 Winter

Uploaded by

C ATALOG

Introduction to Machine Learning Exam

Time limit: 120 minutes

• True Or False questions.

1 Regression and Classification

y = w∗> x +  where  ∼ N (0, σ 2 ).

The goal is to to find w ∈ Rd that minimizes the empirical risk.

1.1 Ordinary Least Squares

Question 1 [reg1] (4 points) What is the estimator w

in terms of w∗ , V , Λ, and ˜ , U > ?

For the following statements, decide if they are True or False.

Question 5 [reg5] (1 point) When n ≤ d, there always exists w such that Xw = y.

1.2 Ridge Regression

Question 7 [reg7] (2 points) Assume n > d. The minimizer w

bλ> x)2 ] = (ED [w

Question 11 [reg11] bλ> x)2 ] is constant with respect to λ

2.0 1.5 1.0 0.5 0.00.0 0.5 1.0 1.5 2.0

Question 13 [classf2] (2 points) Remember that the zero-one loss is given by

Which property is shared between the following “surrogate” loss functions?

hinge `hinge (z) = max{0, 1 − z}

A Each one is an upper bound for the 0-1 loss.

Question 16 [kernel3] (1 point) Is the following statement True or False?

3 Dimension Reduction with PCA

In (linear) principal component analysis (PCA), we map the data points xi ∈ Rd , i = 1, . . . , n, to zi ∈ Rk ,

We denote by W∗ , z1∗ , . . . , zn∗ theP

A Σ∗z = Σx W∗ B Σ∗z = W∗ Σx W∗> C Σ∗z = W∗> Σx D Σ∗z = W∗> Σx W∗

A N 0, W∗ Σ∗z W∗> C N 0, W∗> Σ∗z W∗ D N 0, W∗ W∗> Σx

Figure 1: Direction of principal eigenvector.

Figure 2: The Swiss roll dataset.

albeit, with different constraints on the matrix W and vectors z1 , . . . , zn .

4.1 Linear Neural Networks

Question 29 [nn4] (1 point) WK must be a row vector.

Then, ∂ `˜x/∂W3 is equal to

A −2(y − F (x))(W2 W1 x). E −2(y − F (x)).

Explanation: Application of the chain rule.

Compute ∂ `˜x/∂W1 ∈ Rd1 ×d0 .

A 2ax> C ax> E 2a(W1 x)> G a(W1 x)>

Explanation: Substituting into the definition of gradient descent.

4.2 Training Neural Networks

A c < max{p(x), 1 − p(x)} D c > 1 − min{p(x), 1 − p(x)}

`(f (x), y) = τ max(y − f (x), 0) + (1 − τ ) max(f (x) − y, 0).

Here, τ ∈ (0, 1) is a parameter that balances overestimation and underestimation errors.

Figure 3: Different quantile loss functions.

6 Expectation Maximization Algorithm

Question 39 [ema1] (2 points) What is the log-likelihood log p(X, Y, Z|θ)?

A log p(X, Y, Z|θ) = −(N + M ) log θ − θ1 N

D log p(X, Y, Z|θ) = −M log θ − θ1 N 1 M

Question 40 [ema2] (3 points) What is E1 (θ0 ) , E[Zj |Xj = 1, θ0 ]?

Question 41 [ema3] (2 points) What is E0 (θ0 ) , E[Zj |Xj = 0, θ0 ]?

Q(θ, θ0 ) , EZ [log p(X, Y, Z | θ)|X, Y, θ0 ]

A Q(θ, θ0 ) = −(N + M ) log θ − θ1 N k 0 M −k 0

B Q(θ, θ0 ) = −(N + M ) log θ0 − θ10 N k M −k

C Q(θ, θ0 ) = −(N + M ) log θ − θ N 0 0

D Q(θ, θ0 ) = −(N + M ) log θ0 − θ0 N 0 0

A A: spherical, B: unrestricted, C: diagonal D A: diagonal, B: spherical, C: unrestricted

7 Generative Adversarial Networks

min max Ex∼pdata [log D(x)] + Ez [log(1 − D(G(z)))],

where pdata is the data-generating distribution.

D∗ = arg max Ex∼pdata [log D(x)] + Ez [log(1 − D(G(z)))],

For the data point x, what is D∗ (x)?

Question 48 [gan2] (1 point) Support Vector Machines.

A Generative Model B Discriminative Model

Question 49 [gan3] (1 point) Gaussian Mixture Models.

A Generative Model B Discriminative Model

Question 50 [gan4] (1 point) Decision Trees.

A Generative Model B Discriminative Model

Answer Sheet of the Introduction to Machine Learning Exam

Question 1: A B C D E F G H Question 26: A B

You might also like

y = w∗> x + where ∼ N (0, σ 2 ).

in terms of w∗ , V , Λ, and ˜ , U > ?