0% found this document useful (0 votes)
53 views12 pages

ML Question CMU

Uploaded by

arjbaid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views12 pages

ML Question CMU

Uploaded by

arjbaid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

CS 189 Introduction to

Spring 2014 Machine Learning Midterm


• You have 2 hours for the exam.
• The exam is closed book, closed notes except your one-page crib sheet.

• Please use non-programmable calculators only.


• Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a
brief explanation.
• For true/false questions, fill in the True/False bubble.

• For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be
more than one). We have introduced a negative penalty for false positives for the multiple choice questions
such that the expected value of randomly guessing is 0. Don’t worry, for this section, your score will be the
maximum of your score and 0, thus you cannot incur a negative score for this section.

First name

Last name

SID

First and last name of student to your left

First and last name of student to your right

For staff use only:


Q1. True or False /10
Q2. Multiple Choice /24
Q3. Decision Theory /8
Q4. Kernels /14
Q5. L2-Regularized Linear Regression with Newton’s Method /8
Q6. Maximum Likelihood Estimation /8
Q7. Affine Transformations of Random Variables /13
Q8. Generative Models /15
Total /100

1
Q1. [10 pts] True or False
(a) [1 pt] The hyperparameters in the regularized logistic regression model are η (learning rate) and λ (regularization
term).
True False

(b) [1 pt] The objective function used in L2 regularized logistic regression is convex.
True False

(c) [1 pt] In SVMs, the values of αi for non-support vectors are 0.


True False

(d) [1 pt] As the number of data points approaches ∞, the error rate of a 1-NN classifier approaches 0.
True False

(e) [1 pt] Cross validation will guarantee that our model does not overfit.
True False

(f ) [1 pt] As the number of dimensions increases, the percentage of the volume in the unit ball shell with thickness
 grows.
True False

(g) [1 pt] In logistic regression, the Hessian of the (non regularized) log likelihood is positive definite.
True False

(h) [1 pt] Given a binary classification scenario with Gaussian class conditionals and equal prior probabilities, the
optimal decision boundary will be linear.
True False

(i) [1 pt] In the primal version of SVM, we are minimizing the Lagrangian with respect to w and in the dual
version, we are minimizing the Lagrangian with respect to α.
True False

(j) [1 pt] For the dual version of soft margin SVM, the αi ’s for support vectors satisfy αi > C.
True False

2
Q2. [24 pts] Multiple Choice
(a) [3 pts] Consider the binary classification problem where y ∈ {0, 1} is the label and we have prior probability
P (y = 0) = π0 . If we model P (x|y = 1) to be the following distributions, which one(s) will cause the posterior
P (y = 1|x) to have a logistic function form?

Gaussian Uniform

Poisson None of the above

(b) [3 pts] Given the following data samples (square and triangle belong to two different classes), which one(s) of
the following algorithms can produce zero training error?

1-nearest neighbor Logistic regression

Support vector machine Linear discriminant analysis

(c) [3 pts] The following diagrams show the iso-probability contours for two different 2D Gaussian distributions. On
the left side, the data ∼ N (0, I) where I is the identity matrix. The right side has the same set of contour levels
as left side. What is the mean and covariance matrix for the right side’s multivariate Gaussian distribution?
5
5

4
4

3
3

2
2

1
1

0
y

0
y

−1
−1

−2
−2

−3
−3

−4
−4

−5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x x

" # " #
1 0 4 0
µ = [0, 0]T , Σ= µ = [0, 1]T , Σ=
0 1 0 0.25

" # " #
T
1 0 T
2 0
µ = [0, 1] , Σ= µ = [0, 1] , Σ=
0 1 0 0.5

3
(d) [3 pts] Given the following data samples (square and triangle mean two classes), which one(s) of the following
kernels can we use in SVM to separate the two classes?

Linear kernel Gaussian RBF (radial basis function) kernel

Polynomial kernel None of the above

(e) [3 pts] Consider the following plots of the contours of the unregularized error function along with the constraint
region. What regularization term is used in this case?

L2 L∞

L1 None of the above

(f ) [3 pts] Suppose we have a covariance matrix  


5 a
Σ=
a 4
What is the set of values that a can take on such that Σ is a valid covariance matrix?

a∈< a≥0
√ √ √ √
− 20 ≤ a ≤ 20 − 20 < a < 20

4
(g) [3 pts] The soft margin SVM formulation is as follows:
N
1 X
min wT w + C ξi
2 i=1
subject to yi (wT xi + b) ≥ 1 − ξi ∀i
ξi ≥ 0 ∀i
2
What is the behavior of the width of the margin ( kwk ) as C → 0?

Behaves like hard margin Goes to zero

Goes to infinity None of the above

(h) [3 pts] In Homework 4, you fit a logistic regression model on spam and ham data for a Kaggle Competition.
Assume you had a very good score on the public test set, but when the GSIs ran your model on a private test
set, your score dropped a lot. This is likely because you overfitted by submitting multiple times and changing
the following between submissions:

λ, your penalty term , your convergence criterion

η, your step size Fixing a random bug

(i) [0 pts] BONUS QUESTION (Answer this only if you have time and are confident of your other answers
because this is not extra points.)
We have constructed the multiple choice problems such that every false positive will incur some negative
penalty. For one of these multiple choice problems, given that there are p points, r correct answers, and k
choices, what is the formula for the penalty such that the expected value of random guessing is equal to 0?
(You may assume k > r)
p
k−r

5
Q3. [8 pts] Decision Theory
Consider the following generative model for a 2-class classification problem, in which the class conditionals are
Bernoulli distributions:

p(ω1 ) = π
p(ω2 ) = 1 − π
(
1 with probability 0.5
x|ω1 =
0 with probability 0.5
(
1 with probability 0.5
x|ω2 =
0 with probability 0.5

Assume the loss matrix


true class = 1 true class = 2
 
predicted class = 1 0 λ12
predicted class = 2 λ21 0

(a) [8 pts] Give a condition in terms of λ12 , λ21 , and π that determines when class 1 should always be chosen as
the minimum-risk class.

Based on Bayes’ Rule, the posterior probability of P (wi |x) is


1
P (x|w1 )P (w1 ) π
P (w1 |x) = = 2
P (x) P (x)

1
P (x|w2 )P (w2 ) 2 (1− π)
P (w2 |x) = =
P (x) P (x)

Risk for predicting class 1 is

λ12 (1 − π)
R(α1 |x) = λ11 P (w1 |x) + λ12 P (w2 |x) =
2P (x)

Risk for predicting class 2 is


λ21 π
R(α2 |x) = λ21 P (w1 |x) + λ22 P (w2 |x) =
2P (x)

λ12 (1−π) λ21 π


Choose class 1 when R(α1 |x) < R(α2 |x), i.e. 2P (x) < 2P (x) , which is

λ12 (1 − π) < λ21 π

6
Q4. [14 pts] Kernels
(a) [6 pts] Let k1 and k2 be (valid) kernels; that is, k1 (x, y) = Φ1 (x)T Φ1 (y) and k2 (x, y) = Φ2 (x)T Φ2 (y).
Show that k = k1 + k2 is a valid kernel by explicitly constructing a corresponding feature mapping Φ(z).

k(x, y) = k1 (x, y) + k2 (x, y) = Φ1 (x)T Φ1 (y) + Φ2 (x)T Φ2 (y) = [Φ1 (x) Φ2 (x)]T [Φ1 (x) Φ2 (x)]
If we let φ(z) = [Φ1 (x) Φ2 (x)], then we have k(x, y) = φ(z)T φ(z). Therefore, k = k1 + k2 is a valid kernel.

(b) [8 pts] The polynomial kernel is defined to be

k(x, y) = (xT y + c)d

where x, y ∈ Rn , and c ≥ 0. When we take d = 2, this kernel is called the quadratic kernel. Find the feature
mapping Φ(z) that corresponds to the quadratic kernel.

First we expand the dot product inside, and square the entire sum. We will get a sum of the squares of the
components and a sum of the cross products.

n
X
(xT y + c)2 = (c + xi yi )2
i=1
n
X n X
X i−1 n
X
= c2 + x2i yi2 + 2xi yi xj yj + 2xi yi c
i=1 i=2 j=1 i=1

Pulling this sum into a dot product of x components and y components, we have
 √ √ √ √ √ √ 
Φ(x) = c, x21 , . . . , x2n , 2x1 x2 , . . . , 2x1 xn , 2x2 x3 , . . . , 2xn−1 xn , 2cx1 , . . . , 2cxn

In this feature
√ mapping, we have c, the squared components of the vector x, 2 multiplied by all of the cross
terms, and 2c multiplied by all of the components.

7
Q5. [8 pts] L2-Regularized Linear Regression with Newton’s
Method
Recall that the objective function for L2-regularized linear regression is

J(w) = kXw − yk22 + λkwk22

where X is the design matrix (the rows of X are the data points).

The global minimizer of J is given by:


w∗ = (X T X + λI)−1 X T y

(a) [8 pts] Consider running Newton’s method to minimize J.


Let w0 be an arbitrary initial guess for Newton’s method. Show that w1 , the value of the weights after one
Newton step, is equal to w∗ .

Recall that Newton’s Method for Optimization is

w1 = w0 − [H(J(w))]−1 ∇w J(w)

Solving for the gradient, we have:

∇w J(w) = 2X T Xw − 2X T Y + 2λw = 2[(X T X + λI)w − X T Y ]

Solving for the Hessian, we have:

H(J(w)) = ∇2w J(w) = 2X T X + 2λI = 2(X T X + λI)

We initialize w0 to some value. Note that this won’t matter. Plugging this in, we have

w1 = w0 − (X T X + λI)−1 2−1 2[(X T X + λI)w0 − X T Y ]


= w0 − (X T X + λI)−1 (X T X + λI)w0 + (X T X + λI)−1 X T Y
= w0 − w0 + (X T X + λI)−1 X T Y
= (X T X + λI)−1 X T Y

Thus, w1 = w∗ .

8
Q6. [8 pts] Maximum Likelihood Estimation
(a) [8 pts] Let x1 , x2 , . . . , xn be independent samples from the following distribution:

P (x | θ) = θx−θ−1 where θ > 1, x ≥ 1

Find the maximum likelihood estimator of θ.

n
Y n
Y
L(θ|x1 , x2 , . . . , xn ) = θx−θ−1
i = θn xi−θ−1
i=1 i=1
n
X
ln L(θ|x1 , x2 , . . . , xn ) = n ln θ − (θ + 1) ln xi
i=1

n
δ ln L n X
= − ln xi = 0
δθ θ i=1

n
θmle = Pn
i=1 ln xi
Since θ > 1, any θmle ≤ 1 has a zero probability of generating any data, so our best estimate of θ when θmle ≤ 1
is θmle = 1. Therefore, the final answer is θmle = max(1, Pn n ln xi ).
i=1

However, we will still accept θmle = Pn n


i=1 ln xi .

9
Q7. [13 pts] Affine Transformations of Random Variables
Let X be a d-dimensional random vector with mean µ and covariance matrix Σ. Let Y = AX + b, where A is a
n × d matrix and b is a n-dimensional vector.

(a) [6 pts] Show that the mean of Y is Aµ + b.

E(Y) = E(AX + b) = E(AX) + E(b) = AE(X) + b = Aµ + b

(b) [7 pts] Show that the covariance matrix of Y is AΣAT .

V ar(Y) = E((Y − EY)(Y − EY)T ) = E((AX + b − Aµ − b)(AX + b − Aµ − b)T )


= E((AX − Aµ)(AX − Aµ)T ) = E(A(X − µ)(X − µ)T AT ) = AE((X − µ)(X − µ)T )AT
= AΣAT

10
Q8. [15 pts] Generative Models
Consider a generative classification model for K classes defined by the following:

• Prior class probabilities: P (Ck ) = πk k = 1, . . . , K


• General class-conditional densities: P (x|Ck ) k = 1, . . . , K

Suppose we are given training data {(xn , yn )}N


n=1 drawn independently from this model.

The labels yi are “one-of-K” vectors; that is, K-dimensional vectors of all 0’s except for a single 1 at the element
corresponding to the class. For example, if K = 4 and the true label of xi is class 2, then
 T
yi = 0 1 0 0

(a) [5 pts] Write the log likelihood of the data set. You may use yij to denote the j th element of yi .
The probability of one data point is
K
Y
P(x, y) = P(x|y)P(y) = (P(x|Ck )πk )yk
k=1

I denote the parameters of this model as θ. The independent samples allow us to take a product over the data
points.
YN Y K
L(θ) = (P(xn |Ck )πk )yn,k
n=1 k=1

Thus,
N X
X K
l(θ) = yn,k [log(P(xn |Ck ) + log πk ]
n=1 k=1

(b) [10 pts] What are the maximum likelihood estimates of the prior probabilities?
(Hint: Remember to use Lagrange multipliers!)
PK
We want to maximize the log likelihood subject to the constraint that k=1 πk = 1. Thus, we must introduce
Lagrange Multipliers. The parameters we care about here are the πk ’s. Here is the Lagrangian:
N XK K
!
X X
L (π, λ) = yn,k [log(P(xn |Ck ) + log πk ] + λ πk − 1
n=1 k=1 k=1

Taking the derivative with respect to πk and setting it to 0, we have


N N
∂ 1 X 1X Nk
L (π, λ) = yn,k + λ = 0 =⇒ πk = − yn,k = −
∂πk πk n=1 λ n=1 λ

where Nk is the number of data points whose label is class k. Taking the derivative with respect to λ, we have
K K
∂ X X
L (π, λ) = πk − 1 = =⇒ πk = 1
∂λ
k=1 k=1

We can plug in all of our values of the πk ’s into the constraint, giving us the value of λ:
K K
X X Nk N
πk = − = − = 1 =⇒ λ = −N
λ λ
k=1 k=1

11
After having solved for λ, we can just plug this back into our other equations to solve for our πk ’s. Thus, we
have that the maximum likelihood estimates of the prior probabilities are
Nk
πk =
N

12

You might also like