ML Question CMU
ML Question CMU
• For multiple-choice questions, fill in the bubbles for ALL CORRECT CHOICES (in some cases, there may be
more than one). We have introduced a negative penalty for false positives for the multiple choice questions
such that the expected value of randomly guessing is 0. Don’t worry, for this section, your score will be the
maximum of your score and 0, thus you cannot incur a negative score for this section.
First name
Last name
SID
1
Q1. [10 pts] True or False
(a) [1 pt] The hyperparameters in the regularized logistic regression model are η (learning rate) and λ (regularization
term).
True False
(b) [1 pt] The objective function used in L2 regularized logistic regression is convex.
True False
(d) [1 pt] As the number of data points approaches ∞, the error rate of a 1-NN classifier approaches 0.
True False
(e) [1 pt] Cross validation will guarantee that our model does not overfit.
True False
(f ) [1 pt] As the number of dimensions increases, the percentage of the volume in the unit ball shell with thickness
grows.
True False
(g) [1 pt] In logistic regression, the Hessian of the (non regularized) log likelihood is positive definite.
True False
(h) [1 pt] Given a binary classification scenario with Gaussian class conditionals and equal prior probabilities, the
optimal decision boundary will be linear.
True False
(i) [1 pt] In the primal version of SVM, we are minimizing the Lagrangian with respect to w and in the dual
version, we are minimizing the Lagrangian with respect to α.
True False
(j) [1 pt] For the dual version of soft margin SVM, the αi ’s for support vectors satisfy αi > C.
True False
2
Q2. [24 pts] Multiple Choice
(a) [3 pts] Consider the binary classification problem where y ∈ {0, 1} is the label and we have prior probability
P (y = 0) = π0 . If we model P (x|y = 1) to be the following distributions, which one(s) will cause the posterior
P (y = 1|x) to have a logistic function form?
Gaussian Uniform
(b) [3 pts] Given the following data samples (square and triangle belong to two different classes), which one(s) of
the following algorithms can produce zero training error?
(c) [3 pts] The following diagrams show the iso-probability contours for two different 2D Gaussian distributions. On
the left side, the data ∼ N (0, I) where I is the identity matrix. The right side has the same set of contour levels
as left side. What is the mean and covariance matrix for the right side’s multivariate Gaussian distribution?
5
5
4
4
3
3
2
2
1
1
0
y
0
y
−1
−1
−2
−2
−3
−3
−4
−4
−5
−5 −4 −3 −2 −1 0 1 2 3 4 5 −5
−5 −4 −3 −2 −1 0 1 2 3 4 5
x x
" # " #
1 0 4 0
µ = [0, 0]T , Σ= µ = [0, 1]T , Σ=
0 1 0 0.25
" # " #
T
1 0 T
2 0
µ = [0, 1] , Σ= µ = [0, 1] , Σ=
0 1 0 0.5
3
(d) [3 pts] Given the following data samples (square and triangle mean two classes), which one(s) of the following
kernels can we use in SVM to separate the two classes?
(e) [3 pts] Consider the following plots of the contours of the unregularized error function along with the constraint
region. What regularization term is used in this case?
L2 L∞
a∈< a≥0
√ √ √ √
− 20 ≤ a ≤ 20 − 20 < a < 20
4
(g) [3 pts] The soft margin SVM formulation is as follows:
N
1 X
min wT w + C ξi
2 i=1
subject to yi (wT xi + b) ≥ 1 − ξi ∀i
ξi ≥ 0 ∀i
2
What is the behavior of the width of the margin ( kwk ) as C → 0?
(h) [3 pts] In Homework 4, you fit a logistic regression model on spam and ham data for a Kaggle Competition.
Assume you had a very good score on the public test set, but when the GSIs ran your model on a private test
set, your score dropped a lot. This is likely because you overfitted by submitting multiple times and changing
the following between submissions:
(i) [0 pts] BONUS QUESTION (Answer this only if you have time and are confident of your other answers
because this is not extra points.)
We have constructed the multiple choice problems such that every false positive will incur some negative
penalty. For one of these multiple choice problems, given that there are p points, r correct answers, and k
choices, what is the formula for the penalty such that the expected value of random guessing is equal to 0?
(You may assume k > r)
p
k−r
5
Q3. [8 pts] Decision Theory
Consider the following generative model for a 2-class classification problem, in which the class conditionals are
Bernoulli distributions:
p(ω1 ) = π
p(ω2 ) = 1 − π
(
1 with probability 0.5
x|ω1 =
0 with probability 0.5
(
1 with probability 0.5
x|ω2 =
0 with probability 0.5
(a) [8 pts] Give a condition in terms of λ12 , λ21 , and π that determines when class 1 should always be chosen as
the minimum-risk class.
1
P (x|w2 )P (w2 ) 2 (1− π)
P (w2 |x) = =
P (x) P (x)
λ12 (1 − π)
R(α1 |x) = λ11 P (w1 |x) + λ12 P (w2 |x) =
2P (x)
6
Q4. [14 pts] Kernels
(a) [6 pts] Let k1 and k2 be (valid) kernels; that is, k1 (x, y) = Φ1 (x)T Φ1 (y) and k2 (x, y) = Φ2 (x)T Φ2 (y).
Show that k = k1 + k2 is a valid kernel by explicitly constructing a corresponding feature mapping Φ(z).
k(x, y) = k1 (x, y) + k2 (x, y) = Φ1 (x)T Φ1 (y) + Φ2 (x)T Φ2 (y) = [Φ1 (x) Φ2 (x)]T [Φ1 (x) Φ2 (x)]
If we let φ(z) = [Φ1 (x) Φ2 (x)], then we have k(x, y) = φ(z)T φ(z). Therefore, k = k1 + k2 is a valid kernel.
where x, y ∈ Rn , and c ≥ 0. When we take d = 2, this kernel is called the quadratic kernel. Find the feature
mapping Φ(z) that corresponds to the quadratic kernel.
First we expand the dot product inside, and square the entire sum. We will get a sum of the squares of the
components and a sum of the cross products.
n
X
(xT y + c)2 = (c + xi yi )2
i=1
n
X n X
X i−1 n
X
= c2 + x2i yi2 + 2xi yi xj yj + 2xi yi c
i=1 i=2 j=1 i=1
Pulling this sum into a dot product of x components and y components, we have
√ √ √ √ √ √
Φ(x) = c, x21 , . . . , x2n , 2x1 x2 , . . . , 2x1 xn , 2x2 x3 , . . . , 2xn−1 xn , 2cx1 , . . . , 2cxn
√
In this feature
√ mapping, we have c, the squared components of the vector x, 2 multiplied by all of the cross
terms, and 2c multiplied by all of the components.
7
Q5. [8 pts] L2-Regularized Linear Regression with Newton’s
Method
Recall that the objective function for L2-regularized linear regression is
where X is the design matrix (the rows of X are the data points).
w1 = w0 − [H(J(w))]−1 ∇w J(w)
We initialize w0 to some value. Note that this won’t matter. Plugging this in, we have
Thus, w1 = w∗ .
8
Q6. [8 pts] Maximum Likelihood Estimation
(a) [8 pts] Let x1 , x2 , . . . , xn be independent samples from the following distribution:
n
Y n
Y
L(θ|x1 , x2 , . . . , xn ) = θx−θ−1
i = θn xi−θ−1
i=1 i=1
n
X
ln L(θ|x1 , x2 , . . . , xn ) = n ln θ − (θ + 1) ln xi
i=1
n
δ ln L n X
= − ln xi = 0
δθ θ i=1
n
θmle = Pn
i=1 ln xi
Since θ > 1, any θmle ≤ 1 has a zero probability of generating any data, so our best estimate of θ when θmle ≤ 1
is θmle = 1. Therefore, the final answer is θmle = max(1, Pn n ln xi ).
i=1
9
Q7. [13 pts] Affine Transformations of Random Variables
Let X be a d-dimensional random vector with mean µ and covariance matrix Σ. Let Y = AX + b, where A is a
n × d matrix and b is a n-dimensional vector.
10
Q8. [15 pts] Generative Models
Consider a generative classification model for K classes defined by the following:
The labels yi are “one-of-K” vectors; that is, K-dimensional vectors of all 0’s except for a single 1 at the element
corresponding to the class. For example, if K = 4 and the true label of xi is class 2, then
T
yi = 0 1 0 0
(a) [5 pts] Write the log likelihood of the data set. You may use yij to denote the j th element of yi .
The probability of one data point is
K
Y
P(x, y) = P(x|y)P(y) = (P(x|Ck )πk )yk
k=1
I denote the parameters of this model as θ. The independent samples allow us to take a product over the data
points.
YN Y K
L(θ) = (P(xn |Ck )πk )yn,k
n=1 k=1
Thus,
N X
X K
l(θ) = yn,k [log(P(xn |Ck ) + log πk ]
n=1 k=1
(b) [10 pts] What are the maximum likelihood estimates of the prior probabilities?
(Hint: Remember to use Lagrange multipliers!)
PK
We want to maximize the log likelihood subject to the constraint that k=1 πk = 1. Thus, we must introduce
Lagrange Multipliers. The parameters we care about here are the πk ’s. Here is the Lagrangian:
N XK K
!
X X
L (π, λ) = yn,k [log(P(xn |Ck ) + log πk ] + λ πk − 1
n=1 k=1 k=1
where Nk is the number of data points whose label is class k. Taking the derivative with respect to λ, we have
K K
∂ X X
L (π, λ) = πk − 1 = =⇒ πk = 1
∂λ
k=1 k=1
We can plug in all of our values of the πk ’s into the constraint, giving us the value of λ:
K K
X X Nk N
πk = − = − = 1 =⇒ λ = −N
λ λ
k=1 k=1
11
After having solved for λ, we can just plug this back into our other equations to solve for our πk ’s. Thus, we
have that the maximum likelihood estimates of the prior probabilities are
Nk
πk =
N
12