0% found this document useful (0 votes)
41 views16 pages

Solution21 Winter

The document is an exam for an introduction to machine learning course. It contains instructions for taking the exam, which will consist of multiple choice and true/false questions related to regression, classification, and ridge regression. The exam allows for the use of notes and a calculator, but collaboration is strictly forbidden. The questions cover topics like ordinary least squares regression, empirical risk minimization, gradient descent, and the bias-variance tradeoff in ridge regression.

Uploaded by

David
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views16 pages

Solution21 Winter

The document is an exam for an introduction to machine learning course. It contains instructions for taking the exam, which will consist of multiple choice and true/false questions related to regression, classification, and ridge regression. The exam allows for the use of notes and a calculator, but collaboration is strictly forbidden. The questions cover topics like ordinary least squares regression, empirical risk minimization, gradient descent, and the bias-variance tradeoff in ridge regression.

Uploaded by

David
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

C ATALOG

Introduction to Machine Learning Exam


Questions Pack
January 27, 2022

Time limit: 120 minutes

Instructions. This pack contains all questions for the final exam. It contains the questions only. Please
use the accompanying answer sheet to provide your answers by blackening out the corresponding squares.
As the exam will be graded by a computer, please make sure to do blacken out the whole square and
do not use ticks or crosses. During the exam you can use a pencil to fill out the squares as well as an
eraser to edit your answers. After the exam is over, we will collect the questions pack and provide you with
additional time to blacken out the squares on the answer sheet with a black pen. Nothing written on pages
of the question pack will be collected or marked. Only the separate answer sheet with the filled squares
will be marked.

Please make sure that your answer sheet is clean and all answers are clearly marked by filling the squares
out completely. We reserve the right to classify answers as wrong without further consideration if the sheet
is filled out ambiguously.

Collaboration on the exam is strictly forbidden. You are allowed a summary of two A4 pages and a simple,
non-programmable calculator. The use of any other helping material or collaboration will lead to being
excluded from the exam and subjected to disciplinary measures by the ETH Zurich disciplinary committee.

Question Types In this exam, you will encounter the following question types.
• Multiple Choice questions with a single answer.
Multiple Choice questions have exactly one correct choice. Depending on the difficulty of the ques-
tion 2, 3, or 4 points are awarded if answered correctly, and zero points are awarded if answered
wrong or not attempted.

• True Or False questions.


Each True Or False questions has a value of 1 point if answered correctly, 0 points if answered wrong
or not attempted.
Not all questions need to be answered correctly to achieve the best grade. There are no negative grades so
you are incentivized to attempt all questions.
C ATALOG

1 Regression and Classification

We are given a dataset consisting of n labeled training points D = {(x1 , y1 ), ..., (xn , yn )}, where xi ∈
Rd are the feature vectors and yi ∈ R are the labels. Here samples are generated independently from a
distribution p(x, y) for which the following holds:

y = w∗> x +  where  ∼ N (0, σ 2 ).

The true underlying parameters w∗ ∈ Rd are unknown. The rows of the design matrix X ∈ Rn×d are the
feature vectors xi ∈ Rd . The label vector is denoted by y = (y1 , . . . , yn )> ∈ Rn . In all of Section 1, we
assume X is full rank i.e., rank(X) = min(n, d).
Recall from lecture that the empirical risk is defined as follows:
n
X
R̂D (w) = (w> xi − yi )2 = ky − Xwk22 . (1)
i=1

The goal is to to find w ∈ Rd that minimizes the empirical risk.

1.1 Ordinary Least Squares


1
Let X have a singular value decomposition X = U Λ 2 V > . Here U ∈ Rn×n and V ∈ Rd×d are orthogonal,
1 1
and Λ 2 ∈ Rn×d has the singular values σi > 0 on its diagonal and is zero elsewhere, i.e., σi = Λi,i
2
or
2
equivalently σi = Λi,i .

Question 1 [reg1] (4 points) What is the estimator w


b you obtain by minimizing the empirical risk in
Equation (1), i.e.,
w
b , arg min R̂D (w),
w

in terms of w∗ , V , Λ, and ˜ , U > ?


Hint: Since U and V are orthogonal, it holds that, U > U = U U > = In×n , V V > = V > V = Id×d .
1 >
A V Λ−1 V > U Λ 2 V > (w∗ + ˜) E w∗ + V Λ− 2 ˜
B V Λ−> U > (w∗ + ˜) F w∗ + V Λ−> ˜
1 1
C V Λ− 2 V > w∗ + V Λ− 2 ˜ G w∗ + V Λ> ˜
1 >
D V Λ− 2 V > w∗ + V Λ−1 ˜ H w∗ + V Λ 2 ˜

Question 2 [reg2] (3 points) Assume that the feature vectors of our training set are centered, i.e.,
P n
i=1 xi = 0. Compute the following: Pn
(i.) The empirical covariance matrix of our training data-points: Σ , n1 i=1 xi x>
i .
(ii.) The covariance matrix of the random vector ˜ , U > .
1 1 1 1
2 > Λ 2 V > (ii.)σ 2 I 2 >Λ 2 V >
1 1
A (i.) nV Λ n×n C (i.) nV Λ (ii.) U
1 1 1 1
B (i.) 1 > > 2 D (i.) 1 2 >U >
nUΛ Λ U (ii.)σ In×n nUΛ Λ (ii.) U
2 2 2

For the following statements, decide if they are True or False.

Question 3 [reg3] (1 point) When n ≥ d, the empirical risk R̂D , has a unique minimizer.

A True B False
C ATALOG

Question 4 [reg4] (1 point) A local minimizer for the empirical risk R̂D , is also a global minimizer.

A True B False

Question 5 [reg5] (1 point) When n ≤ d, there always exists w such that Xw = y.

A True B False

Question 6 [reg6] (3 points) We would like to minimize the empirical risk R̂D using gradient descent.
What is the update formula?
A wt+1 = wt + ηt (X > Xwt − 2X > y)
B wt+1 = wt − ηt (2X > Xwt − 2X > y)
C wt+1 = wt + ηt (2Xwt − 2XX > y)
D wt+1 = wt − ηt (Xwt − 2XX > y)
E wt+1 = wt + ηt (2yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
F wt+1 = wt − ηt (2yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
G wt+1 = wt + ηt (yi − 2wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
H wt+1 = wt − ηt (2yi − wt> xi )xi , for some randomly chosen i ∈ {1, 2, . . . , n}
C ATALOG

1.2 Ridge Regression


To avoid overfitting to the data, we add a regularization term to the empirical risk and minimize the follow-
ing objective
Xn
lD (w) , R̂D (w) + λkwk22 = (w> xi − yi )2 + λkwk22 , λ > 0. (2)
i=1

bλ ∈ Rd .
The minimizer of Equation (2) is denoted by w

Question 7 [reg7] (2 points) Assume n > d. The minimizer w


bλ of Equation (2) in closed form is given
by
−1
A wbλ = X > X + λI Xy.
−1
bλ = X > X + λI X > y.

B w
−1
bλ = XX > + λI

C w Xy.
−1
bλ = XX > + λI X > y.

D w
E there is no closed form solution.

Remember that for fixed x ∈ Rd the bias-variance tradeoff can be written as follows:

bλ> x)2 ] = (ED [w


ED, [(y − w bλ> x − w∗> x])2 + VarD [w
bλ> x] + σ 2 .
The first term is called the bias term, the second term is called the variance term and the third term is the
irreducible noise.

Question 8 [reg8] (1 point) A bigger λ (Equation 2) reduces the bias term in the bias variance trade-off.

A True B False

Question 9 [reg9] (1 point) A smaller λ (Equation 2) increases the variance in the bias-variance trade-off.

A True B False

Question 10 [reg10] (1 point) Smaller λ (Equation 2) prevents overfitting to the training data.

A True B False

Question 11 [reg11] bλ> x)2 ] is constant with respect to λ


(1 point) The population risk ED, [(y − w
(Equation 2):

A True B False
C ATALOG

1.3 Classification

 
w1
Question 12 [classf1] (3 points) Consider linear classification with weights w = . Predictions
     w2  
> −1 −2 1 2
take the form ypred = sign(w x). Consider the dataset {( , −1), ( , −1), ( , +1), ( , +1)},
  0 1 0 1
x1
where the first element in each data-point x = is the feature vector and the second element is its class
x2
label y = ±1. The points are represented in the Figure below.

x2
1.0

class 1 0.5
class +1

2.0 1.5 1.0 0.5 0.00.0 0.5 1.0 1.5 2.0


x1
The solution (normalized such that kwk2 = 1) that classifies all points correctly and achieves the maximum
margin is given by (Recall that the margin is defined as the minimum distance between all of the data-points
and the decision boundary of the classifier):
   
1 −1 1
A w = √2 with margin 2 E w= with margin 1
1 0
   
1 1
B w = √12 with margin 2 F w = √12 with margin 1
−1 1
   
0 2
C w= with margin 2 G w = √15 with margin 5
1 1
   
0 1
D w= with margin 1 H w = √15 with margin 5
1 2

Question 13 [classf2] (2 points) Remember that the zero-one loss is given by


(
0 z≥0
l0−1 (z) = .
1 z<0

Which property is shared between the following “surrogate” loss functions?

hinge `hinge (z) = max{0, 1 − z}


squared 2`sq (z) = (1 − z)2
logistic `logistic (z)/ ln(2) = ln (1 + e−z ) / ln(2)
exponential `exp (z) = e−z

A Each one is an upper bound for the 0-1 loss.


B Each one is a lower bound for the 0-1 loss.
C Each one is differentiable on its whole domain.
D They are equally robust to outliers.
C ATALOG

2 Kernels

T
Question 14 [kernel1] (2 points) Consider the feature map Φ : R → R3 defined as Φ(x) = x, x2 , ex .
Find the kernel k(x, y) associated with Φ.

A x + x2 + ex E xy + (xy)2 + ex+y
B xy + ex+y F x + y + (xy)2 + exy
C x + y + x2 + y 2 + ex+y G (xy + 1)2 + exy
D x2 + y 2 + xy + ex+y H xy + (xy)2

Question 15 [kernel2] (4 points) Let x, x0 ∈ R3 and k(x, x0 ) = (x> x0 + 1)2 . What is the minimal
dimensionality of a feature map φ(x), such that k(x, x0 ) = φ(x)> φ(x0 )?

A 6 B 9 C 10 D 12 E 13 F 15 G 16 H 27

Question 16 [kernel3] (1 point) Is the following statement True or False?


For every valid kernel k(x, x0 ), k(x, x0 ) ≥ 0 for all x and x0 .

A True B False

For each of the following functions k, decide if it is a valid kernel function (True) or not (False).

max(x,x0 )
Question 17 [kernel4] (1 point) For x, x0 ∈ R+ , define k(x, x0 ) = min(x,x0 ) .

A True B False

>
x0 )
Question 18 [kernel5] (1 point) For x, x ∈ Rd , define k(x, x0 ) = (x> x0 + 1)3 + e(x .

A True B False
C ATALOG

3 Dimension Reduction with PCA

In (linear) principal component analysis (PCA), we map the data points xi ∈ Rd , i = 1, . . . , n, to zi ∈ Rk ,


k  d, by solving the following optimization problem:
n
1 X 2
C∗ = min kW zi − xi k2 . (3)
n W ∈Rd×k ,W > W =I i=1
k
z1 ,...,zn ∈R

We denote by W∗ , z1∗ , . . . , zn∗ theP


optimal solution of Equation (3). For all questions in section 3, assume
the data P
points are centered i.e., xi = 0. Therefore, the empirical covariance of the data is as follows:
n
Σx = n1 i=1 xi x> i

Question 19 [pca1] (2 points) What is the empirical covariance Σ∗z of the latent variables zi∗ ?

A Σ∗z = Σx W∗ B Σ∗z = W∗ Σx W∗> C Σ∗z = W∗> Σx D Σ∗z = W∗> Σx W∗

Question 20 [pca2] (2 points) We obtain new data by first sampling s ∼ N (0, Σ∗z ) in the latent space
and then projecting the obtained sample to the original space xnew = W∗ s. What is the distribution of xnew ?

A N 0, W∗ Σ∗z W∗> C N 0, W∗> Σ∗z W∗ D N 0, W∗ W∗> Σx


  
B N (0, Σx )

Question 21 [pca3] (2 points) In Figure 1 we plot 1000 points. We apply PCA to those 1000 points.
What is the direction of the principal eigenvector?

10.0
Principal Eigenvector
B
7.5
C A
5.0

2.5

0.0
y

2.5

5.0

7.5
D
10.0
10.0 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0
x

Figure 1: Direction of principal eigenvector.

A A B B C C D D
C ATALOG

Question 22 [pca4] (1 point) Figure 2 shows the Swiss roll dataset. All data points in this dataset lie on
a 2D plane that has been wrapped around an axis. Linear (non-kernelized) PCA with k = 2 can explain all
sources of variance in this dataset (C ∗ in Equation 3 is equal to 0).
Swiss Roll dataset

15
10
5 z
0
5
10
20
15
10 10 y
5
0 5
x 5
10 0

Figure 2: The Swiss roll dataset.

A True B False

Question 23 [pca5] (1 point) Both the PCA and the k-means problems can be formulated as
n
X 2
min kW zi − xi k2 ,
W,z1 ,...,zn
i=1

albeit, with different constraints on the matrix W and vectors z1 , . . . , zn .

A True B False

Question 24 [pca6] (1 point) If we use the Gaussian kernel for kernel PCA, we implicitly perform PCA
on an infinite-dimensional feature space.

A True B False

Question 25 [pca7] (1 point) If we use neural network autoencoders with nonlinear activation functions,
we seek to compress the data with a nonlinear map.

A True B False
C ATALOG

4 Neural Networks

4.1 Linear Neural Networks


This subsection is regarding linear networks. For input x ∈ Rd0 , a deep linear network F : Rd0 → R of
depth K will output F (x) = WK WK−1 ...W1 x, where each Wj is a matrix of appropriate dimension. We
aim to train F to minimize the mean squared error loss on predicting real-valued scalar labels y. The loss
is specified by
1X
`(F ) = (yi − F (xi ))2 ,
n i
where i ranges over the dataset.

Question 26 [nn1] (1 point) With depth K = 1, we recover linear regression (with no bias term).

A True B False

Question 27 [nn2] (1 point) For K = 2 there is a unique pair of matrices W1 , W2 that minimizes `.

A True B False

Question 28 [nn3] (1 point) Networks with increasing depth K allow one to model more complex rela-
tionships between x and y.

A True B False

Question 29 [nn4] (1 point) WK must be a row vector.

A True B False

Explanation: For 2), we can see that the pair of matrices is not unique by taking invertible matrix A and
considering the pair W2 A−1 and AW1 , which will achieve the same loss as W2 , W1 . For 3), this is false
since for all K, the function class remains the same as linear regression (it always only contains functions
of the form w · x). 4) is true since this is necessary to achieve scalar output.

Question 30 [nn5] (3 points) You plan to train this model with stochastic gradient descent and batch size
1. In each batch, you minimize `˜x (F ) = (y − F (x))2 , for a fixed data point x. For simplicity, suppose
K = 3 and W3 is a scalar.

Then, ∂ `˜x/∂W3 is equal to

A −2(y − F (x))(W2 W1 x). E −2(y − F (x)).


B 2(y − F (x))(W2 W1 x). F 2(y − F (x)).
C (y − F (x))(W2 W1 x). G (y − F (x)).
D −(y − F (x))(W2 W1 x). H −(y − F (x)).

Explanation: Application of the chain rule.


C ATALOG

Question 31 [nn6] (2 points) Again consider the loss calculated on a fixed data point x ∈ Rd0 ,

`˜x (F ) = (y − F (x))2 .

We use backpropagation to compute ∂ `˜x/∂W1 . Suppose that z1 (x) = W1 x ∈ Rd1 . From previous steps in
the backpropagation algorithm you know that ∂ `˜x/∂z1 (x) = a ∈ Rd1 . Please assume a and x are column
vectors, i.e., a ∈ Rd1 ×1 and x ∈ Rd0 ×1 .

Compute ∂ `˜x/∂W1 ∈ Rd1 ×d0 .

A 2ax> C ax> E 2a(W1 x)> G a(W1 x)>


B −2ax> D −ax> F −2a(W1 x)> H −a(W1 x)>
  
∂ L̃x ∂ L̃x ∂z1 (x)
Explanation: ∂W1 = ∂z1 (x) ∂W1 = (a)(x> ) Chain rule for backpropagation.

Question 32 [nn7] (2 points) Which of the following describes one iteration of a stochastic gradient
descent update (still batch size 1 with single data point x) on W1 with step size α?
˜
∂ `x
A W1 ← W1 − α ∂W C W1 ← W1 − α ∂z∂W
1 (x)
1
1
˜
∂ `x
B W1 ← W1 + α ∂W D W1 ← W1 + α ∂z∂W
1 (x)
1
1

Explanation: Substituting into the definition of gradient descent.

4.2 Training Neural Networks


All questions in this subsection are independent from the previous subsection and are regarding training
neural networks.

Question 33 [nn8] (1 point) Increasing the minibatch size in stochastic gradient descent (SGD) lowers
the variance of gradient estimates (assuming data points in the mini-batch are selected independently).

A True B False

Question 34 [nn9] (1 point) For a minibatch size of 1, SGD is guaranteed to decrease the loss in every
iteration.

A True B False

Question 35 [nn10] (1 point) There exists a fixed learning rate η > 0 such that SGD with momentum is
guaranteed to converge to a global minimum of the empirical risk for any architecture of a neural network
and any dataset.

A True B False

Question 36 [nn11] (1 point) The cross entropy loss is designed for regression tasks, where the goal is to
predict arbitrary real-valued labels.

A True B False
C ATALOG

5 Decision Theory

In the following questions, assume that data is generated from some known probabilistic model P (x, y). In
both questions, we use the shorthand p(x) = P (y = +1 | x).

Question 37 [decthe1] (3 points) Assume that we want to train a classifier y = f (x) where labels y take
values y ∈ {1, −1}. We extend the action (label) space and allow the classifier to abstain i.e., refrain from
making a prediction. This extends the label space to y ∈ {+1, −1, r}. In order to make sure the classifier
does not always abstain, we introduce a cost c > 0 for an abstention. The resulting 0-1 loss with abstention
is given by:
`(f (x), y) = 1f (x)6=y 1f (x)6=r + c1f (x)=r .
An (Bayes) optimal classifier is one that minimizes the expected loss (risk) under the known conditional
distribution. For a given input x, for which range of c should the optimal classifier abstain from predicting
+1 or −1?

A c < max{p(x), 1 − p(x)} D c > 1 − min{p(x), 1 − p(x)}


B c > min{p(x), 1 − p(x)} E c > 1 − p(x)
C c < min{p(x), 1 − p(x)} F c < p(x)
C ATALOG

Question 38 [decthe2] (2 points) We want to use regression with the quantile loss to estimate the current
price y of our house given features x, defined as

`(f (x), y) = τ max(y − f (x), 0) + (1 − τ ) max(f (x) − y, 0).

Here, τ ∈ (0, 1) is a parameter that balances overestimation and underestimation errors.


As we have enough time to sell the house, overestimation errors of the predictor are less critical than
underestimation errors. Which of the asymmetric loss functions in Figure 3 would you use for the estimation
of the current price of your house?

Figure 3: Different quantile loss functions.

A A B B C C D D
C ATALOG

6 Expectation Maximization Algorithm

In this question, we use the (soft) expectation maximization (EM) algorithm to compute a maximum like-
lihood estimator (MLE) for the average lifetime of light bulbs. We assume the lifetime of a light bulb is
exponentially distributed with unknown mean θ > 0, i.e., its cumulative distribution function is given by
F (x) = 1 − e− θ 1{x≥0} .
x

We test N + M independent light bulbs in two independent experiments. In the first experiment, we test
the first N light bulbs. Let Y = (Y1 , . . . , YN ), where each random variable Yi represents the exact lifetime
of light bulb i. In the second experiment we test the remaining M bulbs, but we only check the light bulbs
at some fixed time t > 0 and record for each bulb whether it is still working or not.
Let X = (X1 , . . . , XM ), where the random variable Xj = 1 if the bulb j from the second experiment was
still working at time t and 0 if it already expired. We denote by Z = (Z1 , . . . , ZM ) the unobserved lifetime
of the light bulbs from the second experiment.

Question 39 [ema1] (2 points) What is the log-likelihood log p(X, Y, Z|θ)?

A log p(X, Y, Z|θ) = −(N + M ) log θ − θ1 N


P 1
PM
i=1 Yi − θ j=1 Zj
PN PM
B log p(X, Y, Z|θ) = −(N + M ) log θ − θ i=1 Yi − θ j=1 Zj
C log p(X, Y, Z|θ) = −N log θ − θ N
P PM
i=1 Yi − θ j=1 Zj

D log p(X, Y, Z|θ) = −M log θ − θ1 N 1 M


P P
i=1 Yi − θ j=1 Zj

Question 40 [ema2] (3 points) What is E1 (θ0 ) , E[Zj |Xj = 1, θ0 ]?

1 t
A θ0 + t B θ0 +t C tθ0 + t D θ0 +t

Question 41 [ema3] (2 points) What is E0 (θ0 ) , E[Zj |Xj = 0, θ0 ]?


−t −t −t −t
θ0 θ0 θ0 θ0
A θ0 − te
−t B θ0 − 2te
−t C θ0 − 1−e
−t D 1
θ0 − te
−t
1−e θ0 1−e θ0 te θ0 1−e θ0

Question 42 [ema4] (2 points) We define the expected complete data log-likelihood Q(θ, θ0 ) to be

Q(θ, θ0 ) , EZ [log p(X, Y, Z | θ)|X, Y, θ0 ]

and
M
1{Xj =1}
X
k,
j=1

to be the number of light bulbs still working at time t in the second experiment. What is Q(θ, θ0 )?

A Q(θ, θ0 ) = −(N + M ) log θ − θ1 N k 0 M −k 0


P
i=1 yi − θ E1 (θ ) − θ E0 (θ )

B Q(θ, θ0 ) = −(N + M ) log θ0 − θ10 N k M −k


P
i=1 yi − θ 0 E1 (θ) − θ 0 E0 (θ)

C Q(θ, θ0 ) = −(N + M ) log θ − θ N 0 0


P
i=1 yi − θkE1 (θ ) − θ(M − k)E0 (θ )

D Q(θ, θ0 ) = −(N + M ) log θ0 − θ0 N 0 0


P
i=1 yi − θ kE1 (θ) − θ (M − k)E0 (θ)

Question 43 [gmm1] (1 point) The MLE objective for Gaussian mixture models (GMM) is non-convex
with respect to the cluster’s means, covariances, and weights when we have strictly more than one Gaussian
in the mixture.
C ATALOG

A True B False

Question 44 [gmm2] (1 point) An EM algorithm can also be used to fit GMMs in the semi-supervised
setting, where some data points are labeled and some are unlabeled.

A True B False

Question 45 [gmm3] (1 point) We fit a GMM to a dataset utilizing the (soft) EM algorithm. We compute
the log-likelihood of the data after each iteration. During this process the log-likelihood of the data never
decreases.

A True B False

Question 46 [gmm4] (2 points) You get 2D scatter plots of 3 different sets of data points (A, B, C re-
spectively, see Figure below). You decide to cluster them with GMMs. You could model the covariance
matrices of the two clusters as spherical, unrestricted, and diagonal. For datasets A, B, and C, assign the
most appropriate covariance matrix.

A B C

A A: spherical, B: unrestricted, C: diagonal D A: diagonal, B: spherical, C: unrestricted


B A: spherical, B: diagonal, C: unrestricted E A: unrestricted, B: diagonal, C: spherical
C A: unrestricted, B: spherical, C: diagonal F A: diagonal, B: unrestricted, C: spherical

Explanation: B is spherical since the covariance matrix is the identity, whilst C is diagonal since, within
clusters, there is no correlation between x1 and x2.
C ATALOG

7 Generative Adversarial Networks

You train a generative adversarial network (GAN) with neural network discriminator D and neural network
generator G. Let z ∼ N (0, I) represent the random Gaussian (normal) noise input for G. Here, I is the
identity matrix. The objective during training is given by

min max Ex∼pdata [log D(x)] + Ez [log(1 − D(G(z)))],


G D

where pdata is the data-generating distribution.

Question 47 [gan1] (2 points) Consider a fixed data point x with probability density pdata (x). Suppose the
probability density of x under the (not necessarily optimal) trained generator is pG (x). Moreover, assume
that the trained discriminator D∗ is the optimal discriminator for G, based on the loss above. That is:

D∗ = arg max Ex∼pdata [log D(x)] + Ez [log(1 − D(G(z)))],


D

For the data point x, what is D∗ (x)?


pG (x)
A D(x) = pG (x)+pdata (x)
pdata (x)
B D(x) = pG (x)+pdata (x)

C 0
D 1
E Not enough information

Explanation: The second point is a sufficient condition. The third is not sufficient since, even at conver-
gence, D may not be the optimal discriminator given G.
GANs can be used for the task of learning a generative model of data. However, GANs are not the only
generative models we have seen in the course. Indicate whether each of the following models is generative
or discriminative.

Question 48 [gan2] (1 point) Support Vector Machines.

A Generative Model B Discriminative Model

Question 49 [gan3] (1 point) Gaussian Mixture Models.

A Generative Model B Discriminative Model

Question 50 [gan4] (1 point) Decision Trees.

A Generative Model B Discriminative Model


C ATALOG

Answer Sheet of the Introduction to Machine Learning Exam

0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3 ←− Please encode your student number on the left, and write
your first and last names below.
4 4 4 4 4 4 4 4
5 5 5 5 5 5 5 5 Firstname and Lastname:
6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7 ....................................................

8 8 8 8 8 8 8 8
9 9 9 9 9 9 9 9

Question 1: A B C D E F G H Question 26: A B


Question 2: A B C D Question 27: A B
Question 3: A B Question 28: A B
Question 4: A B Question 29: A B
Question 5: A B Question 30: A B C D E F G H
Question 6: A B C D E F G H Question 31: A B C D E F G H
Question 7: A B C D E Question 32: A B C D
Question 8: A B Question 33: A B
Question 9: A B Question 34: A B
Question 10: A B Question 35: A B
Question 11: A B Question 36: A B
Question 12: A B C D E F G H Question 37: A B C D E F
Question 13: A B C D Question 38: A B C D
Question 14: A B C D E F G H Question 39: A B C D
Question 15: A B C D E F G H Question 40: A B C D
Question 16: A B Question 41: A B C D
Question 17: A B Question 42: A B C D
Question 18: A B Question 43: A B
Question 19: A B C D Question 44: A B
Question 20: A B C D Question 45: A B
Question 21: A B C D Question 46: A B C D E F
Question 22: A B Question 47: A B C D E
Question 23: A B Question 48: A B
Question 24: A B Question 49: A B
Question 25: A B Question 50: A B

You might also like