0% found this document useful (0 votes)
13 views8 pages

Final 2014 Wwithanswers

This 3-sentence summary provides the key information about the document: The document outlines a 15-question final exam for a machine learning course, covering topics like linear algebra, decision trees, linear classification, Gaussian processes, and neural networks. The exam is 2 hours long and consists of 8 pages of multiple choice and short answer questions worth a total of 44 points. Students are instructed to write their identification number but not their name on submitted pages.

Uploaded by

Việt Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Final 2014 Wwithanswers

This 3-sentence summary provides the key information about the document: The document outlines a 15-question final exam for a machine learning course, covering topics like linear algebra, decision trees, linear classification, Gaussian processes, and neural networks. The exam is 2 hours long and consists of 8 pages of multiple choice and short answer questions worth a total of 44 points. Students are instructed to write their identification number but not their name on submitted pages.

Uploaded by

Việt Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Machine Learning 1 — WS2014 — Module IN2064 Final Exam · Page 1

Machine Learning 1 — Final Exam

1 Preliminaries

• Please write your immatriculation number but not your name on every page you hand in.

• The exam is closed book. You may, however, take one A4 sheet of handwritten notes.

• The exam is limited to 2 × 60 minutes.

• If a question says “Describe in 2–3 sentences” or “Show your work” or something similar, these
mean the same: give a succinct description or explanation.

• This exam consists of 8 pages, 15 problems. You can earn up to 44 points.

Problem 1 [3 points] Fill in your immatriculation number on every sheet you hand in. Make sure
it is easily readable. Make sure you do not write your name on any sheet you hand in.

2 Linear Algebra and Probability Theory

Problem 2 [2 points] Let X and Y be two random variables. Show that

var[X + Y ] = var[X] + var[Y ] + 2cov[X, Y ]

where cov[X,Y ] is the covariance between X and Y . You can use that

var[X] = E[X 2 ] − E2 [X]


cov[X, Y ] = E[XY ] − E[X]E[Y ]

We know [1 point]:

var[X] = E[X 2 ] − E2 [X]


cov[X, Y ] = E[XY ] − E[X]E[Y ]

Hence [1 point]:

var[X + Y ] = E[(X + Y )2 ] − E2 [X + Y ]
= E[X 2 + Y 2 + 2XY ] − (E[X] + E[Y ])2
= E[X 2 ] + E[Y 2 ] + 2E[XY ] − E2 [X] − E2 [Y ] − 2E[X]E[Y ]
= E[X 2 ] − E2 [X] + E[Y 2 ] − E2 [Y ] +2(E[XY ] − E[X]E[Y ]) 
| {z } | {z } | {z }
=var[X] =var[Y ] =cov[X,Y ]

imat:
Final Exam · Page 2 Machine Learning 1 — WS2014/2015 — Module IN2064

3 Decision Trees & kNN

You are given two-dimensional input data with corresponding targets in below plot.
6

0
x2

6
4 2 0 2 4 6 8
x1

Problem 3 [2 points] Sketch the decision boundaries of a maximally-trained decision tree classifier
using misclassification rate.

0
x2

6
4 2 0 2 4 6 8
x1

Problem 4 [2 points] Describe how this model can overfit the data. Describe how that problem can
be solved or prevented.

Overfitting!
• prune the tree
• restrict the depth of the tree
• train an ensemble of slightly different trees –> random forests

Problem 5 [2 points] Perform 1-NN with leave-one-out cross validation on the data in the plot.
Circle all points that are misclassified and write down the accuracy.

imat:
Machine Learning 1 — WS2014 — Module IN2064 Final Exam · Page 3

0
x2
2

6
4 2 0 2 4 6 8
x1

accuracy = 6/10

4 Linear Classification

Problem 6 [4 points] The decision boundary for some linear classifier on two-dimensional data
crosses axis x1 at 2 and x2 at 5. First, write down the general form of a linear classifier model (how
many parameters do you need, given the dimensions?). Calculate the coefficients (parameters).

w0 + w1 x 1 + w2 x 2 = 0
w0 + 2w1 = 0
w0
w1 = −
2
w0 + 5w2 = 0
w0
w2 = −
5

set, e.g., w0 = −2
w1 = 1
2
w2 =
5

or anything proportional

Problem 7 [2 points] Which basis function φ(x1 , x2 ) makes the data in the example below linearly
separable (crosses in one class, circles in the other)?

imat:
Final Exam · Page 4 Machine Learning 1 — WS2014/2015 — Module IN2064

φ(x1 , x2 ) = x1 x2

5 Gaussian Processes

The posterior conditional for an MVN with distribution


      
y1 µ1 Σ11 Σ12
y= ∼N µ= ,Σ =
y2 µ2 Σ21 Σ22

is given by

p(y 1 |y 2 ) = N (y 1 |µ1|2 , Σ1|2 )


µ1|2 = µ1 + Σ12 Σ−1
22 (y 2 − µ2 )
Σ1|2 = Σ11 − Σ12 Σ−1
22 Σ21

Assume a noise-free GP with mean function

m(x) = 0

and covariance function


K(x, x0 ) = 1 + (x − 2)(x0 − 2).

You are given two data points (0, 4) =: x.

Problem 8 [2 points] Compute the kernel matrix (aka covariance matrix) for x.

 
5 −3
K=
−3 5

Problem 9 [6 points] Given corresponding outputs y = (2, 6), compute the posterior function values
for data points x∗ = (0, 2, 4).

imat:
Machine Learning 1 — WS2014 — Module IN2064 Final Exam · Page 5

5 3
 
−1 16 16
K = 3 5
16 16
 
5 −3
K T∗ =  1 1
−3 5
µf ∗ |y = m(x∗ ) + K T∗ K −1 (y − m(x))
 
1 0   
2
= 0 + 1/2 1/2 −0
6
0 1
 
2
= 4
6

Problem 10 [4 points] Which other algorithm does this resemble? Please describe the corresponding
feature space.

K is a linear kernel, and we are basically solving linear regression.

6 Neural networks

Problem 11 [2 points] Geoffrey has a data set with input X ∈ R2 and output Y ∈ R1 . He has
neural network A with one hidden layer and 9 neurons in that layer, which can fit the data. However,
he does not know how good the model is, so he also tests neural network B with two hidden layers
and three neurons for each of these layers. Both models have biases for the hidden units only.
• How many free parameters do the two models have? Show your calculation, not just the result.
• What are the pros and cons of model A compared to model B? Mention at least one pro and
one con.

Model A and model B have 36 and 24 free parameters, respectively. Pros: fewer parameters for
B. Cons: more complicated structure may prone to local minimum.

Problem 12 [4 points] You know that the sum of squared errors is related to the Gaussian distribution—
differently put, if you assume a normal distribution of the data around their expectation, the maximum
likelihood estimate (MLE) is reached when the summed squared errors is minimised.
The same is true for a Laplace distribution and the sum of absolute errors. In particular, if the data
observes a Laplacian distribution

N N  
Y Y 1 |zn − y(xn , w)|
p(z|x, w, β) = p(zn |xn , w, β) = exp −
2β β
n=1 n=1

imat:
Final Exam · Page 6 Machine Learning 1 — WS2014/2015 — Module IN2064

then minimising the summed absolute errors

N
X
|zn − y(xn , w)|
n=1

leads to MLE. In these equations, x is the vector of all inputs, xn is the input of sample n, while
y(xn , w) is the neural network prediction on xn . Then, zn is the desired output for xn .

Show that the MLE of the Laplace distribution minimises the sum of absolut errors.

Taking the negative logarithm, we obtain the error function


N
βX N N
zn − y(xn , w) − ln β + ln(2π).
2 2 2
n=1

Maximising the likelihood function is equivalent to minimising the error given by the sum of
absolutes
XN
|zn − y(xn , w)|.
n=1

7 Unsupervised learning

Problem 13 [2 points] Consider the plot below. The two classes are circles and triangles. What
significance do class labels have for k-means? Draw the resulting decision boundary in the plot for
k-means with two centroids.

imat:
Machine Learning 1 — WS2014 — Module IN2064 Final Exam · Page 7

Exact position is not important, but should not seperate the two classes perfectly.

15

10

10

15
15 10 5 0 5 10 15

Problem 14 [3 points] Would the separation be different using the EM algorithm with a Gaussian
mixture model using two components and individual full covariance matrices? What would it look
like? The left cluster has 200 points and the right cluster has 40 points. Draw qualitatively in the
above figure.

EM for GMM would be able to distinguish better between the two classes. k-means only fits
clusters of the same size. EM for GMM can model different sizes for each cluster. The new
boundary is now a circle around the right cluster:

imat:
Final Exam · Page 8 Machine Learning 1 — WS2014/2015 — Module IN2064

15

10

0.0
0

00
5

10

15
15 10 5 0 5 10 15

Problem 15 [4 points] The likelihood for ICA is


m X
X n
`(W ) = log pS i (wTi x) + log |W |
i=1 j=1

When calculating the gradient for this likelihood we need to compute the derivative of log pS i (wTi x).
Here we are using pS i (s) ≈ σ 0 (s) where the sigmoid function is given as

1
σ(a) = .
1 + e−a

with the derivative

dσ(a)
σ 0 (a) ≡ = σ(a) (1 − σ(a)) .
da

Show that d
ds log σ 0 (s) = 1 − 2σ(s).

d e−s e−s (e−s −1)


σ 00 (s) ds (1+e−s )2 (1+e−s )3 e−s −1 e−s +1−2
d
ds log σ 0 (s) = σ 0 (s) = e−s
= e−s
= 1+e−s
= 1+e−s
=1− 2
1+e−s
= 1 − 2σ(s)
(1+e−s )2 (1+e−s )2

imat:

You might also like