cs675 SS2022 Midterm Solution PDF
cs675 SS2022 Midterm Solution PDF
1. [20 points]
(a) [10 points] You have a new box containing 8 apples and 4 oranges and an old
box containing 10 apples and 2 oranges. You select a box at random with equal
probability, select an item out of that box at random with equal probability, and
find it is an apple. Using Bayes’ theorem to find the probability that the apple
came from the old box?
Solution:
i=1 j=1
N1
Y N2
Y
N x i | µ1 , σ 2 N xj | 3µ1 , σ 2
=
i=1 j=1
1
The log-likelihood is then given by
ln p D1 , D2 | µ1 , µ2 , σ 2
N1 N2
1 X 2 N1 N1 1 X N2 N1
=− 2
(x i − µ 1 ) − ln σ 2
− ln(2π) − 2
(xj − 3µ1 )2 − ln σ 2 − ln(2π)
2σ 2 2 2σ 2 2
i=1 j=1
N 1 N 2
1 X X N1 + N2 N1 + N2
=− 2 (xi − µ1 )2 + (xj − 3µ1 )2 − ln σ 2 − ln(2π)
2σ 2 2
i=1 j=1
2
2. [10 points]
Answer True or False for the following questions.
(a) Let us denote ∥a∥ as the norm of a vector a ∈ Rd . For any two vectors x ∈ Rd
and y ∈ Rd , we have ∥x + y∥ ≤ ∥x∥ + ∥y∥
Solution: True
(b) If λ is a non-zero eigenvalue of a square matrix A ∈ Rn×n , then for any real
positive constant c, c · λ must be an eigenvalue of A.
Solution: False
(d) Given two matrices A ∈ R7×6 and B ∈ R6×4 , the rank of their product AB
can be 6.
Solution: False
(e) If there exists two exactly the same columns in a matrix A ∈ Rm×n , then the
rank of A must be smaller than n.
Solution: True
3
3. [20 points]
The loss function for linear regression model is as follows
L(w) = ∥y − Xw∥22
(a) [10 points] Derive the first-order derivative of L(w) with respect to w.
Solution:
L(w) = ∥y − Xw∥22
= yT y − yT Xw − wT XT y + wT XT Xw
(b) [10 points] Discuss whether or not we can achieve sparsity on the parameters
by optimizing the following problem:
is the same as
min ∥y − Xw∥22 .
w
4
4. [10 points]
Answer True or False for the following questions about Perceptron and Logistic
Regression.
(a) The sigmoid function is able to map any real-value input to the range of (0, 1).
Solution: True
(c) The Perceptron algorithm updates (makes change to) the parameters only when
dealing with a mis-classified sample.
Solution: True
(d) When we adopt the Stochastic Gradient Descent (SGD) algorithm to optimize
the objective of Logistic Regression (each step, we use a single sample to update
the parameters), SGD makes change to the parameters only when the current
sample is mis-classified.
Solution: False
(e) When dealing with a binary classification problem, the Perceptron algorithm can
always find a hyperplane to perfectly separate samples from two classes without
making any mistakes.
Solution: False
5
5. [20 points]
Naive Bayes.
(a) [9 points] Let (x[1], x[2], x[3]) denote the attributes of a sample x and y denote
the class label. For the sample x = (x[1], x[2], x[3]), the Naive Bayes classifier
predicts its label ŷ as:
(b) [11 points] Consider the training data in the following table, where each sample
consists of two features x[1], x[2] and one label y. Specifically, the attributes
x[1] ∈ {1, 2, 3}, x[2] ∈ {S, M, L} and the label y ∈ {0, 1}. Learn a Naive Bayes
classifier from the training data and then predict the label of an unseen sample
xu = (1, S) with this learned Naive Bayes classifier.
ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
x[2] S M M S S S M M L L L M M L L
y 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0
Solution:
1 1 1
P (x[1] = 1|y = 0) = P (x[1] = 2|y = 0) = P (x[1] = 3|y = 0) =
2 3 6
2 1 4
P (x[1] = 1|y = 1) = P (x[1] = 2|y = 1) = P (x[1] = 3|y = 1) =
9 3 9
1 1 1
P (x[2] = S|y = 0) = P (x[2] = M |y = 0) = P (x[2] = L|y = 0) =
2 3 6
1 4 4
P (x[2] = S|y = 1) = P (x[2] = M |y = 1) = P (x[2] = L|y = 1) =
9 9 9
2 3
P (y = 0) = P (y = 1) =
5 5
6
P (xu [1] = 1, xu [2] = S|y = 0)P (y = 0)
= P (xu [1] = 1|y = 0)P (xu [2] = S|y = 0)P (y = 0)
1 1 2 1
= × × =
2 2 5 10
P (xu [1] = 1, xu [2] = S|y = 1)P (y = 1)
= P (xu [1] = 1|y = 1)P (xu [2] = S|y = 1)P (y = 1)
2 1 3 2
= × × =
9 9 5 135
1 2
Since 10 > 135 , by Naive Bayes classifier we predict that ŷ = 0.
7
6. [20 points]
Consider a linearly separable dataset consists of samples from two classes (circles
with label 1 and squares with label −1) as shown in Figure 1. The dataset can be
formally represented as {xi , yi }N
i=1 , where xi is the data point and yi ∈ {−1, +1} is
the associated label. A linear decision boundary (a hyperplane) can be represented as
wT x + b = 0. By varying the values of w and b, we can obtain different hyperplanes
that can perfectly separate the samples from the two classes. For example, both the
hyperplane B1 and B2 can perfectly separate the samples from the two classes.
B1
B2
b21
b22
margin
b11
b12
Figure 1: Two hyperplanes that can perfectly separate the two classes.
(a) [4 points] Both the decision boundaries B1 and B2 perfectly separate the data
samples from the two classes. Which boundary is better? Provide the reasons.
Solution: Boundary B1 is better, as it has a larger margin.
(b) [10 points] Let us focus on one decision boundary, say B1 . Assume that the
decision boundary B1 can be represented as wT x + b = 0 for some w and b.
i. [5 points] Show that w is orthogonal to the decision boundary B1 .
Solution: Assume v is an arbitrary vector on this plane, then there exists
two points xs and xe on the plane such that v = xs − xe . We have
wT v = wT v = wT (xs − xe ) = 0.
8
ii. [5 points] Note that the decision boundary B1 perfectly separates samples
from the two classes. Specifically, for any positive sample x, we have w⊤ x +
b > 0. Show that yi (w⊤ xi + b) > 0 holds for any sample {xi , yi } in the
dataset.
Solution: For any positive sample {xi , yi }, we have yi = 1 and yi (w⊤ xi +
b) > 0, hence yi (w⊤ xi + b) > 0 holds for all positive samples. Similarly, For
any negative sample {xi , yi }, we have yi = −1 and yi (w⊤ xi + b) < 0. Thus,
yi (w⊤ xi + b) > 0 also holds for all negative samples.
(c) [6 points] The (hard-margin) Support Vector Machine (SVM) can be formulated
by the following constrained optimization problem.
∥w∥2
min
2 (1)
T
s.t. yi (w xi + b) ≥ 1, i = 1, 2, . . . , N.
9
7. [ Bonus: 8 points]
Answer True or False for the following questions about the three distinct approaches
(probabilistic generative models, probabilistic discriminative models, and discriminate
function) for modeling classification.
(a) Probabilistic Discriminative models aim to model the the class-conditional den-
sities p(x|y = k) and prior probabilities p(y = k), and then compute p(y = k|x)
using Bayes’ theorem.
Solution: False
(c) Probabilistic Generative models typically require more parameters than Proba-
bilistic Discriminative models.
Solution: True
10