0% found this document useful (0 votes)
121 views10 pages

cs675 SS2022 Midterm Solution PDF

This document contains the solutions to a midterm exam for an introduction to machine learning course. It includes solutions to 5 problems testing concepts like Bayes' theorem, linear regression, perceptron, logistic regression, and Naive Bayes classification. For each problem, partial credit is awarded based on providing the correct solution or answering true/false questions correctly.

Uploaded by

gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
121 views10 pages

cs675 SS2022 Midterm Solution PDF

This document contains the solutions to a midterm exam for an introduction to machine learning course. It includes solutions to 5 problems testing concepts like Bayes' theorem, linear regression, perceptron, logistic regression, and Naive Bayes classification. For each problem, partial credit is awarded based on providing the correct solution or answering true/false questions correctly.

Uploaded by

gaurav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

CS 675 Introduction to Machine Learning (Spring 2022): Midterm Exam

Maximum Points to Gain: 100


SOLUTIONS
Name:

1. [20 points]

(a) [10 points] You have a new box containing 8 apples and 4 oranges and an old
box containing 10 apples and 2 oranges. You select a box at random with equal
probability, select an item out of that box at random with equal probability, and
find it is an apple. Using Bayes’ theorem to find the probability that the apple
came from the old box?
Solution:

p(a | o)p(o) (10/12)(1/2) 5


p(o | a) = = =
p(a) (10/12)(1/2) + (8/12)(1/2) 9

(b) [10 points]


Consider two data sets D1 and D2 , each consisting of scalar measurements
xi , i = 1, . . . , N1 for D1 and xj , j = 1, . . . , N2 for D2 . Assume that each set
of measurements comes from a Gaussian distribution. The two Gaussian distri-
butions share a common variance σ 2 and the mean µ2 of the Gaussian for data
set D2 is known to be three times the value of the mean µ1 for the first data set
D1 , i.e., µ2 = 3µ1 . The parameters µ1 , µ2 , and σ 2 are all assumed unknown and
to be estimated. Define the log-likelihood for this problem with µ1 and
σ2.
Solution:
The likelihood of the datasets is given by:
N1 N2
 Y Y
p D1 , D2 | µ1 , µ2 , σ 2 = N x i | µ1 , σ 2 N xj | µ2 , σ 2


i=1 j=1
N1
Y N2
Y
N x i | µ1 , σ 2 N xj | 3µ1 , σ 2

=
i=1 j=1

1
The log-likelihood is then given by

ln p D1 , D2 | µ1 , µ2 , σ 2


N1 N2
1 X 2 N1 N1 1 X N2 N1
=− 2
(x i − µ 1 ) − ln σ 2
− ln(2π) − 2
(xj − 3µ1 )2 − ln σ 2 − ln(2π)
2σ 2 2 2σ 2 2
i=1 j=1
 
N 1 N 2
1 X X N1 + N2 N1 + N2
=− 2  (xi − µ1 )2 + (xj − 3µ1 )2  − ln σ 2 − ln(2π)
2σ 2 2
i=1 j=1

2
2. [10 points]
Answer True or False for the following questions.

(a) Let us denote ∥a∥ as the norm of a vector a ∈ Rd . For any two vectors x ∈ Rd
and y ∈ Rd , we have ∥x + y∥ ≤ ∥x∥ + ∥y∥
Solution: True

(b) If λ is a non-zero eigenvalue of a square matrix A ∈ Rn×n , then for any real
positive constant c, c · λ must be an eigenvalue of A.
Solution: False

(c) For any two vectors x ∈ Rd and y ∈ Rd , we have x⊤ y = y⊤ x.


Solution: True

(d) Given two matrices A ∈ R7×6 and B ∈ R6×4 , the rank of their product AB
can be 6.
Solution: False

(e) If there exists two exactly the same columns in a matrix A ∈ Rm×n , then the
rank of A must be smaller than n.
Solution: True

3
3. [20 points]
The loss function for linear regression model is as follows

L(w) = ∥y − Xw∥22

where X ∈ RN ×d is the data matrix, with N sample and d features, y ∈ RN ×1 is the


target, and w ∈ Rd×1 is the parameters to be learned.

(a) [10 points] Derive the first-order derivative of L(w) with respect to w.
Solution:

Let L(w) denote the objective to be minimized

L(w) = ∥y − Xw∥22
= yT y − yT Xw − wT XT y + wT XT Xw

We calculate the derivative


∂L(w)
= −2 XT y + 2 XT Xw
∂w
= 2XT Xw − 2XT y

(b) [10 points] Discuss whether or not we can achieve sparsity on the parameters
by optimizing the following problem:

min ∥y − Xw∥22 + λ∥y∥1 .



w

Provide your explanations.


Solution:
No, y is a constant vector (the observed target). Hence, the problem

min ∥y − Xw∥22 + λ∥y∥1


w

is the same as
min ∥y − Xw∥22 .
w

Thus, it has no impact on the final solution of w

4
4. [10 points]
Answer True or False for the following questions about Perceptron and Logistic
Regression.

(a) The sigmoid function is able to map any real-value input to the range of (0, 1).
Solution: True

(b) Both Perceptron and Logistic Regression have linear boundaries.


Solution: True

(c) The Perceptron algorithm updates (makes change to) the parameters only when
dealing with a mis-classified sample.
Solution: True

(d) When we adopt the Stochastic Gradient Descent (SGD) algorithm to optimize
the objective of Logistic Regression (each step, we use a single sample to update
the parameters), SGD makes change to the parameters only when the current
sample is mis-classified.
Solution: False

(e) When dealing with a binary classification problem, the Perceptron algorithm can
always find a hyperplane to perfectly separate samples from two classes without
making any mistakes.
Solution: False

5
5. [20 points]
Naive Bayes.

(a) [9 points] Let (x[1], x[2], x[3]) denote the attributes of a sample x and y denote
the class label. For the sample x = (x[1], x[2], x[3]), the Naive Bayes classifier
predicts its label ŷ as:

ŷ = argmaxy P (y|x[1], x[2], x[3])



1 P (x[1], x[2], x[3]|y)P (y)
= argmaxy
P (x[1], x[2], x[3])

2
= argmaxy P (x[1], x[2], x[3]|y)P (y)

3
= argmaxy P (x[1]|y)P (x[2]|y)P (x[3]|y)P (y)

Please explain the basis for each step of the derivation.


Solution:
Step ○:
1 It can be obtained directly from the Bayes Theorem.
Step ○:
2 The maximum is with respect to y, thus, P (x[1], x[2], x[3]) in the de-
nominator can be treated as a constant and can be ignored.
Step ○:
3 According to the Naive Bayes assumption, the attributes x[1], x[2], x[3]
are conditionally independent with each other when the class label is given:

P (x[1], x[2], x[3]|y) = P (x[1]|y)P (x[2]|y)P (x[3]|y)

(b) [11 points] Consider the training data in the following table, where each sample
consists of two features x[1], x[2] and one label y. Specifically, the attributes
x[1] ∈ {1, 2, 3}, x[2] ∈ {S, M, L} and the label y ∈ {0, 1}. Learn a Naive Bayes
classifier from the training data and then predict the label of an unseen sample
xu = (1, S) with this learned Naive Bayes classifier.

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
x[2] S M M S S S M M L L L M M L L
y 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0
Solution:
1 1 1
P (x[1] = 1|y = 0) = P (x[1] = 2|y = 0) = P (x[1] = 3|y = 0) =
2 3 6
2 1 4
P (x[1] = 1|y = 1) = P (x[1] = 2|y = 1) = P (x[1] = 3|y = 1) =
9 3 9
1 1 1
P (x[2] = S|y = 0) = P (x[2] = M |y = 0) = P (x[2] = L|y = 0) =
2 3 6
1 4 4
P (x[2] = S|y = 1) = P (x[2] = M |y = 1) = P (x[2] = L|y = 1) =
9 9 9

2 3
P (y = 0) = P (y = 1) =
5 5

6
P (xu [1] = 1, xu [2] = S|y = 0)P (y = 0)
= P (xu [1] = 1|y = 0)P (xu [2] = S|y = 0)P (y = 0)
1 1 2 1
= × × =
2 2 5 10
P (xu [1] = 1, xu [2] = S|y = 1)P (y = 1)
= P (xu [1] = 1|y = 1)P (xu [2] = S|y = 1)P (y = 1)
2 1 3 2
= × × =
9 9 5 135
1 2
Since 10 > 135 , by Naive Bayes classifier we predict that ŷ = 0.

7
6. [20 points]
Consider a linearly separable dataset consists of samples from two classes (circles
with label 1 and squares with label −1) as shown in Figure 1. The dataset can be
formally represented as {xi , yi }N
i=1 , where xi is the data point and yi ∈ {−1, +1} is
the associated label. A linear decision boundary (a hyperplane) can be represented as
wT x + b = 0. By varying the values of w and b, we can obtain different hyperplanes
that can perfectly separate the samples from the two classes. For example, both the
hyperplane B1 and B2 can perfectly separate the samples from the two classes.
B1

B2

b21
b22

margin
b11

b12

Figure 1: Two hyperplanes that can perfectly separate the two classes.

(a) [4 points] Both the decision boundaries B1 and B2 perfectly separate the data
samples from the two classes. Which boundary is better? Provide the reasons.
Solution: Boundary B1 is better, as it has a larger margin.

(b) [10 points] Let us focus on one decision boundary, say B1 . Assume that the
decision boundary B1 can be represented as wT x + b = 0 for some w and b.
i. [5 points] Show that w is orthogonal to the decision boundary B1 .
Solution: Assume v is an arbitrary vector on this plane, then there exists
two points xs and xe on the plane such that v = xs − xe . We have

wT v = wT v = wT (xs − xe ) = 0.

Thus, w is orthogonal to any vector in the plane, which means it is orthogonal


to the plane.

8
ii. [5 points] Note that the decision boundary B1 perfectly separates samples
from the two classes. Specifically, for any positive sample x, we have w⊤ x +
b > 0. Show that yi (w⊤ xi + b) > 0 holds for any sample {xi , yi } in the
dataset.
Solution: For any positive sample {xi , yi }, we have yi = 1 and yi (w⊤ xi +
b) > 0, hence yi (w⊤ xi + b) > 0 holds for all positive samples. Similarly, For
any negative sample {xi , yi }, we have yi = −1 and yi (w⊤ xi + b) < 0. Thus,
yi (w⊤ xi + b) > 0 also holds for all negative samples.

(c) [6 points] The (hard-margin) Support Vector Machine (SVM) can be formulated
by the following constrained optimization problem.

∥w∥2
min
2 (1)
T
s.t. yi (w xi + b) ≥ 1, i = 1, 2, . . . , N.

Explain the intuition of these formulations. Specifically, answer the following


2
questions 1) Why do we want to minimize ∥w∥ 2 in the above optimization prob-
lem? 2) Why do we need these constraints on all the samples in the above
optimization problem?
Solution:1) We need to maximize the margin. Maximizing the margin is equiva-
2
lent to minimize ∥w∥
2 ; 2) These constraints make sure that the boundary defined
by w and b perfectly separated all the samples in the dataset. It also ensures
that the minimum distance from any training sample to the decision boundary
1
is ∥w∥ .

9
7. [ Bonus: 8 points]
Answer True or False for the following questions about the three distinct approaches
(probabilistic generative models, probabilistic discriminative models, and discriminate
function) for modeling classification.

(a) Probabilistic Discriminative models aim to model the the class-conditional den-
sities p(x|y = k) and prior probabilities p(y = k), and then compute p(y = k|x)
using Bayes’ theorem.
Solution: False

(b) Logistic Regression is an example of Probabilistic Discriminative models.


Solution: True

(c) Probabilistic Generative models typically require more parameters than Proba-
bilistic Discriminative models.
Solution: True

(d) Support Vector Machine (SVM) is an example of Probabilistic Generative mod-


els.
Solution: False

10

You might also like