Final: CS 189 Spring 2013 Introduction To Machine Learning
Final: CS 189 Spring 2013 Introduction To Machine Learning
• For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be
more than one). For a question with p points and k choices, every false positive wil incur a penalty of p/(k − 1)
points.
• For short answer questions, unnecessarily long explanations and extraneous data will be penalized.
Please try to be terse and precise and do the side calculations on the scratch papers provided.
• Please draw a bounding box around your answer in the Short Answers section. A missed answer without
a bounding box will not be regraded.
First name
Last name
SID
1
Q1. [23 pts] True/False
(a) [1 pt] Solving a non linear separation problem with a hard margin Kernelized SVM (Gaussian RBF Kernel)
might lead to overfitting.
True
False
(b) [1 pt] In SVMs, the sum of the Lagrange multipliers corresponding to the positive examples is equal to the sum
of the Lagrange multipliers corresponding to the negative examples.
True
False
(c) [1 pt] SVMs directly give us the posterior probabilities P (y = 1|x) and P (y = −1|x).
True False
(e) [1 pt] In the discriminative approach to solving classification problems, we model the conditional probability
of the labels given the observations.
True
False
(f ) [1 pt] In a two class classification problem, a point on the Bayes optimal decision boundary x∗ always satisfies
P (y = 1|x∗ ) = P (y = 0|x∗ ).
True False
(g) [1 pt] Any linear combination of the components of a multivariate Gaussian is a univariate Gaussian.
True
False
(h) [1 pt] For any two random variables X ∼ N (µ1 , σ12 ) and Y ∼ N (µ2 , σ22 ), X + Y ∼ N (µ1 + µ2 , σ12 + σ22 ).
True False
(i) [1 pt] Stanford and Berkeley students are trying to solve the same logistic regression problem for a dataset.
The Stanford group claims that their initialization point will lead to a much better optimum than Berkeley’s
initialization point. Stanford is correct.
True False
p
(j) [1 pt] In logistic regression, we model the odds ratio ( 1−p ) as a linear function.
True False
(k) [1 pt] Random forests can be used to classify infinite dimensional data.
True
False
(l) [1 pt] In boosting we start with a Gaussian weight distribution over the training samples.
True False
(m) [1 pt] In Adaboost, the error of each hypothesis is calculated by the ratio of misclassified examples to the total
number of examples.
True False
(n) [1 pt] When k = 1 and N → ∞, the kNN classification rate is bounded above by twice the Bayes error rate.
True
False
(o) [1 pt] A single layer neural network with a sigmoid activation for binary classification with the cross entropy
loss is exactly equivalent to logistic regression.
True
False
2
(p) [1 pt] The loss function for LeNet5 (the convolutional neural network by LeCun et al.) is convex.
True False
(q) [1 pt] Convolution is a linear operation i.e. (αf1 + βf2 ) ∗ g = αf1 ∗ g + βf2 ∗ g.
True
False
(r) [1 pt] The k-means algorithm does coordinate descent on a non-convex objective function.
True
False
(s) [1 pt] A 1-NN classifier has higher variance than a 3-NN classifier.
True
False
(t) [1 pt] The single link agglomerative clustering algorithm groups two clusters on the basis of the maximum
distance between points in the two clusters.
True False
(u) [1 pt] The largest eigenvector of the covariance matrix is the direction of minimum variance in the data.
True False
(w) [1 pt] The non-zero eigenvalues of AAT and AT A are the same.
True
False
3
Q2. [36 pts] Multiple Choice Questions
(a) [4 pts] In linear regression, we model P (y|x) ∼ N (wT x + w0 , σ 2 ). The irreducible error in this model is
.
σ2 E[(y − E[y|x])|x]
(b) [4 pts] Let S1 and S2 be the set of support vectors and w1 and w2 be the learnt weight vectors for a linearly
separable problem using hard and soft margin linear SVMs respectively. Which of the following are correct?
(c) [4 pts] Ordinary least-squares regression is equivalent to assuming that each data point is generated according
to a linear function of the input plus zero-mean, constant-variance Gaussian noise. In many systems, however,
the noise variance is itself a positive linear function of the input (which is assumed to be non-negative, i.e.,
x ≥ 0). Which of the following families of probability models correctly describes this situation in the univariate
case?
2 2 2
P (y|x) = √1
σ 2πx
exp(− (y−(w2xσ
0 +w1 x))
2 )
P (y|x) = √1
σ 2πx
exp(− (y−(w0 +(w 1 +σ )x))
2σ 2 )
(f ) [4 pts] Let A be a symmetric matrix and S be the matrix containing its eigenvectors as column vectors, and D
a diagonal matrix containing the corresponding eigenvalues on the diagonal. Which of the following are true:
AS = SD SA = DS
AS = DS AS = DS T
(g) [4 pts] Consider the following dataset: A = (0, 2), B = (0, 1) and C = (1, 0). The k-means algorithm is
initialized with centers at A and B. Upon convergence, the two centers will be at
4
(h) [3 pts] Which of the following loss functions are convex?
(i) [3 pts] Consider T1 , a decision stump (tree of depth 2) and T2 , a decision tree that is grown till a maximum
depth of 4. Which of the following is/are correct?
(j) [4 pts] Consider the problem of building decision trees with k-ary splits (split one node intok nodes) and
you are deciding k for each node by calculating the entropy impurity for different values of k and optimizing
simultaneously over the splitting threshold(s) and k. Which of the following is/are true?
The algorithm will always choose k = 2 There will be k −1 thresholds for a k-ary split
5
Q3. [26 pts] Short Answers
σ2 σ12
(a) [5 pts] Given that (x1 , x2 ) are jointly normally distributed with µ = µµ12 and Σ = σ 1
σ22
(σ21 = σ12 ), give
21
an expression for the mean of the conditional distribution p(x1 |x2 = a).
e−x e−x
1 1 1
σ 0 (x) = −x
= . = 1− = σ(x)(1 − σ(x))
(1 + e ) 2 (1 + e ) (1 + e−x )
−x 1 + e−x 1 + e−x
Biased estimator: θ̂ (the sample estimate) is a biased estimator of θ (the population distribution parameter)
if E[θ̂] 6= θ.
n
Here θ̂ = x(n) . And E[x(n) ] = n+1 θ 6= θ. The steps for finding E[x(n) ] are given in the solutions of Homework
2, problem 5(c).
n+1
θ̂unbiased = x(n)
n
n+1 n+1 n+1 n
E[θ̂unbiased ] = E[ x(n) ] = E[x(n) ] = × θ=θ
n n n n+1
6
(d) [5 pts] Consider the problem of fitting the following function to a dataset of 100 points {(xi , yi )}, i = 1 . . . 100:
y = αcos(x) + βsin(x) + γ
This problem can be solved using the least squares method with a solution of the form:
α
β = (X T X)−1 X T Y
γ
cos(x1 ) sin(x1 ) 1 y1
cos(x2 ) sin(x2 ) 1 y2
X=
.. ..
.. Y =
..
. . . .
cos(x100 ) sin(x100 ) 1 y100
(e) [5 pts] Consider the problem of binary classification using the Naive Bayes classifier. You are given two dimen-
sional features (X1 , X2 ) and the categorical class conditional distributions in the tables below. The entries in
the tables correspond to P (X1 = x1 |Ci ) and P (X2 = x2 |Ci ) respectively. The two classes are equally likely.
PP Class PP Class
PP PP
C1 C2 C1 C2
X1 = PPP P X2 = PPP P
−1 0.2 0.3 −1 0.4 0.1
0 0.4 0.6 0 0.5 0.3
1 0.4 0.1 1 0.1 0.6
Given a data point (−1, 1), calculate the following posterior probabilities:
P (C1 |X1 = −1, X2 = 1) = Using Bayes’ Rule and conditional independence assumption of Naive Bayes
7
Scratch paper
8
Scratch paper