0% found this document useful (0 votes)
225 views9 pages

Final: CS 189 Spring 2013 Introduction To Machine Learning

The document is the instructions for a machine learning final exam. It states that the exam is 3 hours, closed book except for a 1 or 2 page crib sheet. Students should use non-programmable calculators and mark their answers on the exam itself. Short answers should be a few sentences at most and include a bounding box. The exam consists of true/false questions, multiple choice questions, and short answer questions.

Uploaded by

Shabs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
225 views9 pages

Final: CS 189 Spring 2013 Introduction To Machine Learning

The document is the instructions for a machine learning final exam. It states that the exam is 3 hours, closed book except for a 1 or 2 page crib sheet. Students should use non-programmable calculators and mark their answers on the exam itself. Short answers should be a few sentences at most and include a bounding box. The exam consists of true/false questions, multiple choice questions, and short answer questions.

Uploaded by

Shabs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CS 189 Introduction to

Spring 2013 Machine Learning Final


• You have 3 hours for the exam.
• The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

• Please use non-programmable calculators only.


• Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a
brief explanation. All short answer sections can be successfully answered in a few sentences AT MOST.
• For true/false questions, fill in the True/False bubble.

• For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be
more than one). For a question with p points and k choices, every false positive wil incur a penalty of p/(k − 1)
points.
• For short answer questions, unnecessarily long explanations and extraneous data will be penalized.
Please try to be terse and precise and do the side calculations on the scratch papers provided.

• Please draw a bounding box around your answer in the Short Answers section. A missed answer without
a bounding box will not be regraded.

First name

Last name

SID

For staff use only:


Q1. True/False /23
Q2. Multiple Choice Questions /36
Q3. Short Answers /26
Total /85

1
Q1. [23 pts] True/False
(a) [1 pt] Solving a non linear separation problem with a hard margin Kernelized SVM (Gaussian RBF Kernel)
might lead to overfitting.
True False

(b) [1 pt] In SVMs, the sum of the Lagrange multipliers corresponding to the positive examples is equal to the sum
of the Lagrange multipliers corresponding to the negative examples.
True False

(c) [1 pt] SVMs directly give us the posterior probabilities P (y = 1|x) and P (y = −1|x).
True False

(d) [1 pt] V (X) = E[X]2 − E[X 2 ]


True False

(e) [1 pt] In the discriminative approach to solving classification problems, we model the conditional probability
of the labels given the observations.
True False

(f ) [1 pt] In a two class classification problem, a point on the Bayes optimal decision boundary x∗ always satisfies
P (y = 1|x∗ ) = P (y = 0|x∗ ).
True False

(g) [1 pt] Any linear combination of the components of a multivariate Gaussian is a univariate Gaussian.
True False

(h) [1 pt] For any two random variables X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ), X + Y ∼ N (µ1 + µ2 , σ12 + σ22 ).
True False

(i) [1 pt] Stanford and Berkeley students are trying to solve the same logistic regression problem for a dataset.
The Stanford group claims that their initialization point will lead to a much better optimum than Berkeley’s
initialization point. Stanford is correct.
True False
p
(j) [1 pt] In logistic regression, we model the odds ratio ( 1−p ) as a linear function.
True False

(k) [1 pt] Random forests can be used to classify infinite dimensional data.
True False

(l) [1 pt] In boosting we start with a Gaussian weight distribution over the training samples.
True False

(m) [1 pt] In Adaboost, the error of each hypothesis is calculated by the ratio of misclassified examples to the total
number of examples.
True False

(n) [1 pt] When k = 1 and N → ∞, the kNN classification rate is bounded above by twice the Bayes error rate.
True False

(o) [1 pt] A single layer neural network with a sigmoid activation for binary classification with the cross entropy
loss is exactly equivalent to logistic regression.
True False

2
(p) [1 pt] The loss function for LeNet5 (the convolutional neural network by LeCun et al.) is convex.
True False

(q) [1 pt] Convolution is a linear operation i.e. (αf1 + βf2 ) ∗ g = αf1 ∗ g + βf2 ∗ g.
True False

(r) [1 pt] The k-means algorithm does coordinate descent on a non-convex objective function.
True False

(s) [1 pt] A 1-NN classifier has higher variance than a 3-NN classifier.
True False

(t) [1 pt] The single link agglomerative clustering algorithm groups two clusters on the basis of the maximum
distance between points in the two clusters.
True False

(u) [1 pt] The largest eigenvector of the covariance matrix is the direction of minimum variance in the data.
True False

(v) [1 pt] The eigenvectors of AAT and AT A are the same.


True False

(w) [1 pt] The non-zero eigenvalues of AAT and AT A are the same.
True False

3
Q2. [36 pts] Multiple Choice Questions
(a) [4 pts] In linear regression, we model P (y|x) ∼ N (wT x + w0 , σ 2 ). The irreducible error in this model is
.

σ2 E[(y − E[y|x])|x]

E[(y − E[y|x])2 |x] E[y|x]

(b) [4 pts] Let S1 and S2 be the set of support vectors and w1 and w2 be the learnt weight vectors for a linearly
separable problem using hard and soft margin linear SVMs respectively. Which of the following are correct?

S1 ⊂ S2 S1 may not be a subset of S2

w1 = w2 w1 may not be equal to w2 .

(c) [4 pts] Ordinary least-squares regression is equivalent to assuming that each data point is generated according
to a linear function of the input plus zero-mean, constant-variance Gaussian noise. In many systems, however,
the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e.,
x ≥ 0). Which of the following families of probability models correctly describes this situation in the univariate
case?
2 2 2
P (y|x) = √1
σ 2πx
exp(− (y−(w2xσ
0 +w1 x))
2 ) P (y|x) = √1
σ 2πx
exp(− (y−(w0 +(w 1 +σ )x))
2σ 2 )

(y−(w0 +w1 x))2 2


P (y|x) = √1 exp(−
σ 2π 2σ 2 ) P (y|x) = 1

σx 2π
exp(− (y−(w2x0 2+w
σ2
1 x))
)

(d) [3 pts] The left singular vectors of a matrix A can be found in .

Eigenvectors of AAT Eigenvectors of A2

Eigenvectors of AT A Eigenvalues of AAT

(e) [3 pts] Averaging the output of multiple decision trees helps .

Increase bias Increase variance

Decrease bias Decrease variance

(f ) [4 pts] Let A be a symmetric matrix and S be the matrix containing its eigenvectors as column vectors, and D
a diagonal matrix containing the corresponding eigenvalues on the diagonal. Which of the following are true:

AS = SD SA = DS

AS = DS AS = DS T

(g) [4 pts] Consider the following dataset: A = (0, 2), B = (0, 1) and C = (1, 0). The k-means algorithm is
initialized with centers at A and B. Upon convergence, the two centers will be at

A and C C and the midpoint of AB

A and the midpoint of BC A and B

4
(h) [3 pts] Which of the following loss functions are convex?

Misclassification loss Hinge loss

Logistic loss Exponential Loss (e(−yf (x)) )

(i) [3 pts] Consider T1 , a decision stump (tree of depth 2) and T2 , a decision tree that is grown till a maximum
depth of 4. Which of the following is/are correct?

Bias(T1 ) < Bias(T2 ) V ariance(T1 ) < V ariance(T2 )

Bias(T1 ) > Bias(T2 ) V ariance(T1 ) > V ariance(T2 )

(j) [4 pts] Consider the problem of building decision trees with k-ary splits (split one node intok nodes) and
you are deciding k for each node by calculating the entropy impurity for different values of k and optimizing
simultaneously over the splitting threshold(s) and k. Which of the following is/are true?

The algorithm will always choose k = 2 There will be k −1 thresholds for a k-ary split

This model is strictly more powerful than a


The algorithm will prefer high values of k binary decision tree.

5
Q3. [26 pts] Short Answers
 σ2 σ12
(a) [5 pts] Given that (x1 , x2 ) are jointly normally distributed with µ = µµ12 and Σ = σ 1
  
σ22
(σ21 = σ12 ), give
21
an expression for the mean of the conditional distribution p(x1 |x2 = a).

This can be solved by writing p(x1 |x2 = a) = p(x 1 ,x2 =a)


p(x2 =a) . x2 being a component of a multivariate Gaussian is
a univariate Gaussian with x2 ∼ N (µ2 , σ22 ). Write out the Gaussian densities and simplify (complete squares)
to see the following:
σ12
x1 |x2 = a ∼ N (µ̄, σ̄ 2 ), µ̄ = µ1 + 2 (a − µ2 )
σ22

(b) [4 pts] The logistic function is given by σ(x) = 1


1+e−x . Show that σ 0 (x) = σ(x)(1 − σ(x)).

e−x e−x
  
1 1 1
σ 0 (x) = −x
= . = 1− = σ(x)(1 − σ(x))
(1 + e ) 2 (1 + e ) (1 + e−x )
−x 1 + e−x 1 + e−x

(c) Let X have a uniform distribution


(
1
θ 0≤x≤θ
p(x; θ) =
0 otherwise
Suppose that n samples x1 , . . . , xn are drawn independently according to p(x; θ).
(i) [5 pts] The maximum likelihood estimate of θ is x(n) = max(x1 , x2 , . . . , xn ). Show that this estimate of θ
is biased.

Biased estimator: θ̂ (the sample estimate) is a biased estimator of θ (the population distribution parameter)
if E[θ̂] 6= θ.

n
Here θ̂ = x(n) . And E[x(n) ] = n+1 θ 6= θ. The steps for finding E[x(n) ] are given in the solutions of Homework
2, problem 5(c).

(ii) [2 pts] Give an expression for an unbiased estimator of θ.

n+1
θ̂unbiased = x(n)
n
n+1 n+1 n+1 n
E[θ̂unbiased ] = E[ x(n) ] = E[x(n) ] = × θ=θ
n n n n+1

6
(d) [5 pts] Consider the problem of fitting the following function to a dataset of 100 points {(xi , yi )}, i = 1 . . . 100:

y = αcos(x) + βsin(x) + γ

This problem can be solved using the least squares method with a solution of the form:
 
α
β  = (X T X)−1 X T Y
γ

What are X and Y ?

   
cos(x1 ) sin(x1 ) 1 y1
 cos(x2 ) sin(x2 ) 1  y2 
X= 
 .. ..

..  Y = 
 ..


 . . .  . 
cos(x100 ) sin(x100 ) 1 y100

(e) [5 pts] Consider the problem of binary classification using the Naive Bayes classifier. You are given two dimen-
sional features (X1 , X2 ) and the categorical class conditional distributions in the tables below. The entries in
the tables correspond to P (X1 = x1 |Ci ) and P (X2 = x2 |Ci ) respectively. The two classes are equally likely.

PP Class PP Class
PP PP
C1 C2 C1 C2
X1 = PPP P X2 = PPP P
−1 0.2 0.3 −1 0.4 0.1
0 0.4 0.6 0 0.5 0.3
1 0.4 0.1 1 0.1 0.6

Given a data point (−1, 1), calculate the following posterior probabilities:

P (C1 |X1 = −1, X2 = 1) = Using Bayes’ Rule and conditional independence assumption of Naive Bayes

P (X1 =−1,X2 =1|C1 )P (C1 ) P (X1 =−1|C1 )P (X2 =1|C1 )P (C1 )


P (X1 =−1,X2 =1)
= P (X1 =−1|C1 )P (X2 =1|C1 )P (C1 )+P (X1 =−1|C2 )P (X2 =1|C2 )P (C2 )
= 0.1

P (C2 |X1 = −1, X2 = 1) = 1 − P (C1 |X2 = −1, X1 = 1) = 0.9

7
Scratch paper

8
Scratch paper

You might also like