0% found this document useful (0 votes)

122 views10 pages

cs675 SS2022 Midterm Solution PDF

This document contains the solutions to a midterm exam for an introduction to machine learning course. It includes solutions to 5 problems testing concepts like Bayes' theorem, linear regression, perceptron, logistic regression, and Naive Bayes classification. For each problem, partial credit is awarded based on providing the correct solution or answering true/false questions correctly.

Uploaded by

gaurav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views10 pages

cs675 SS2022 Midterm Solution PDF

Uploaded by

gaurav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

CS 675 Introduction to Machine Learning (Spring 2022): Midterm Exam

Maximum Points to Gain: 100

SOLUTIONS
Name:

1. [20 points]

(a) [10 points] You have a new box containing 8 apples and 4 oranges and an old
box containing 10 apples and 2 oranges. You select a box at random with equal
probability, select an item out of that box at random with equal probability, and
find it is an apple. Using Bayes’ theorem to find the probability that the apple
came from the old box?
Solution:

p(a | o)p(o) (10/12)(1/2) 5

p(o | a) = = =
p(a) (10/12)(1/2) + (8/12)(1/2) 9

(b) [10 points]

Consider two data sets D1 and D2 , each consisting of scalar measurements
xi , i = 1, . . . , N1 for D1 and xj , j = 1, . . . , N2 for D2 . Assume that each set
of measurements comes from a Gaussian distribution. The two Gaussian distri-
butions share a common variance σ 2 and the mean µ2 of the Gaussian for data
set D2 is known to be three times the value of the mean µ1 for the first data set
D1 , i.e., µ2 = 3µ1 . The parameters µ1 , µ2 , and σ 2 are all assumed unknown and
to be estimated. Define the log-likelihood for this problem with µ1 and
σ2.
Solution:
The likelihood of the datasets is given by:
N1 N2
Y Y
p D1 , D2 | µ1 , µ2 , σ 2 = N x i | µ1 , σ 2 N xj | µ2 , σ 2

i=1 j=1
N1
Y N2
Y
N x i | µ1 , σ 2 N xj | 3µ1 , σ 2

=
i=1 j=1

1
The log-likelihood is then given by

ln p D1 , D2 | µ1 , µ2 , σ 2

N1 N2
1 X 2 N1 N1 1 X N2 N1
=− 2
(x i − µ 1 ) − ln σ 2
− ln(2π) − 2
(xj − 3µ1 )2 − ln σ 2 − ln(2π)
2σ 2 2 2σ 2 2
i=1 j=1
 
N 1 N 2
1 X X N1 + N2 N1 + N2
=− 2  (xi − µ1 )2 + (xj − 3µ1 )2  − ln σ 2 − ln(2π)
2σ 2 2
i=1 j=1

2
2. [10 points]
Answer True or False for the following questions.

(a) Let us denote ∥a∥ as the norm of a vector a ∈ Rd . For any two vectors x ∈ Rd
and y ∈ Rd , we have ∥x + y∥ ≤ ∥x∥ + ∥y∥
Solution: True

(b) If λ is a non-zero eigenvalue of a square matrix A ∈ Rn×n , then for any real
positive constant c, c · λ must be an eigenvalue of A.
Solution: False

(c) For any two vectors x ∈ Rd and y ∈ Rd , we have x⊤ y = y⊤ x.

Solution: True

(d) Given two matrices A ∈ R7×6 and B ∈ R6×4 , the rank of their product AB
can be 6.
Solution: False

(e) If there exists two exactly the same columns in a matrix A ∈ Rm×n , then the
rank of A must be smaller than n.
Solution: True

3
3. [20 points]
The loss function for linear regression model is as follows

L(w) = ∥y − Xw∥22

where X ∈ RN ×d is the data matrix, with N sample and d features, y ∈ RN ×1 is the

target, and w ∈ Rd×1 is the parameters to be learned.

(a) [10 points] Derive the first-order derivative of L(w) with respect to w.
Solution:

Let L(w) denote the objective to be minimized

L(w) = ∥y − Xw∥22
= yT y − yT Xw − wT XT y + wT XT Xw

We calculate the derivative

∂L(w)
= −2 XT y + 2 XT Xw
∂w
= 2XT Xw − 2XT y

(b) [10 points] Discuss whether or not we can achieve sparsity on the parameters
by optimizing the following problem:

min ∥y − Xw∥22 + λ∥y∥1 .

Provide your explanations.

Solution:
No, y is a constant vector (the observed target). Hence, the problem

min ∥y − Xw∥22 + λ∥y∥1

is the same as
min ∥y − Xw∥22 .
w

Thus, it has no impact on the final solution of w

4
4. [10 points]
Answer True or False for the following questions about Perceptron and Logistic
Regression.

(a) The sigmoid function is able to map any real-value input to the range of (0, 1).
Solution: True

(b) Both Perceptron and Logistic Regression have linear boundaries.

Solution: True

(c) The Perceptron algorithm updates (makes change to) the parameters only when
dealing with a mis-classified sample.
Solution: True

(d) When we adopt the Stochastic Gradient Descent (SGD) algorithm to optimize
the objective of Logistic Regression (each step, we use a single sample to update
the parameters), SGD makes change to the parameters only when the current
sample is mis-classified.
Solution: False

(e) When dealing with a binary classification problem, the Perceptron algorithm can
always find a hyperplane to perfectly separate samples from two classes without
making any mistakes.
Solution: False

5
5. [20 points]
Naive Bayes.

(a) [9 points] Let (x[1], x[2], x[3]) denote the attributes of a sample x and y denote
the class label. For the sample x = (x[1], x[2], x[3]), the Naive Bayes classifier
predicts its label ŷ as:

ŷ = argmaxy P (y|x[1], x[2], x[3])

○
1 P (x[1], x[2], x[3]|y)P (y)
= argmaxy
P (x[1], x[2], x[3])
○
2
= argmaxy P (x[1], x[2], x[3]|y)P (y)
○
3
= argmaxy P (x[1]|y)P (x[2]|y)P (x[3]|y)P (y)

Please explain the basis for each step of the derivation.

Solution:
Step ○:
1 It can be obtained directly from the Bayes Theorem.
Step ○:
2 The maximum is with respect to y, thus, P (x[1], x[2], x[3]) in the de-
nominator can be treated as a constant and can be ignored.
Step ○:
3 According to the Naive Bayes assumption, the attributes x[1], x[2], x[3]
are conditionally independent with each other when the class label is given:

P (x[1], x[2], x[3]|y) = P (x[1]|y)P (x[2]|y)P (x[3]|y)

(b) [11 points] Consider the training data in the following table, where each sample
consists of two features x[1], x[2] and one label y. Specifically, the attributes
x[1] ∈ {1, 2, 3}, x[2] ∈ {S, M, L} and the label y ∈ {0, 1}. Learn a Naive Bayes
classifier from the training data and then predict the label of an unseen sample
xu = (1, S) with this learned Naive Bayes classifier.

ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
x[1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
x[2] S M M S S S M M L L L M M L L
y 0 0 1 1 0 0 0 1 1 1 1 1 1 1 0
Solution:
1 1 1
P (x[1] = 1|y = 0) = P (x[1] = 2|y = 0) = P (x[1] = 3|y = 0) =
2 3 6
2 1 4
P (x[1] = 1|y = 1) = P (x[1] = 2|y = 1) = P (x[1] = 3|y = 1) =
9 3 9
1 1 1
P (x[2] = S|y = 0) = P (x[2] = M |y = 0) = P (x[2] = L|y = 0) =
2 3 6
1 4 4
P (x[2] = S|y = 1) = P (x[2] = M |y = 1) = P (x[2] = L|y = 1) =
9 9 9

2 3
P (y = 0) = P (y = 1) =
5 5

6
P (xu [1] = 1, xu [2] = S|y = 0)P (y = 0)
= P (xu [1] = 1|y = 0)P (xu [2] = S|y = 0)P (y = 0)
1 1 2 1
= × × =
2 2 5 10
P (xu [1] = 1, xu [2] = S|y = 1)P (y = 1)
= P (xu [1] = 1|y = 1)P (xu [2] = S|y = 1)P (y = 1)
2 1 3 2
= × × =
9 9 5 135
1 2
Since 10 > 135 , by Naive Bayes classifier we predict that ŷ = 0.

7
6. [20 points]
Consider a linearly separable dataset consists of samples from two classes (circles
with label 1 and squares with label −1) as shown in Figure 1. The dataset can be
formally represented as {xi , yi }N
i=1 , where xi is the data point and yi ∈ {−1, +1} is
the associated label. A linear decision boundary (a hyperplane) can be represented as
wT x + b = 0. By varying the values of w and b, we can obtain different hyperplanes
that can perfectly separate the samples from the two classes. For example, both the
hyperplane B1 and B2 can perfectly separate the samples from the two classes.
B1

b21
b22

margin
b11

b12

Figure 1: Two hyperplanes that can perfectly separate the two classes.

(a) [4 points] Both the decision boundaries B1 and B2 perfectly separate the data
samples from the two classes. Which boundary is better? Provide the reasons.
Solution: Boundary B1 is better, as it has a larger margin.

(b) [10 points] Let us focus on one decision boundary, say B1 . Assume that the
decision boundary B1 can be represented as wT x + b = 0 for some w and b.
i. [5 points] Show that w is orthogonal to the decision boundary B1 .
Solution: Assume v is an arbitrary vector on this plane, then there exists
two points xs and xe on the plane such that v = xs − xe . We have

wT v = wT v = wT (xs − xe ) = 0.

Thus, w is orthogonal to any vector in the plane, which means it is orthogonal

to the plane.

8
ii. [5 points] Note that the decision boundary B1 perfectly separates samples
from the two classes. Specifically, for any positive sample x, we have w⊤ x +
b > 0. Show that yi (w⊤ xi + b) > 0 holds for any sample {xi , yi } in the
dataset.
Solution: For any positive sample {xi , yi }, we have yi = 1 and yi (w⊤ xi +
b) > 0, hence yi (w⊤ xi + b) > 0 holds for all positive samples. Similarly, For
any negative sample {xi , yi }, we have yi = −1 and yi (w⊤ xi + b) < 0. Thus,
yi (w⊤ xi + b) > 0 also holds for all negative samples.

(c) [6 points] The (hard-margin) Support Vector Machine (SVM) can be formulated
by the following constrained optimization problem.

∥w∥2
min
2 (1)
T
s.t. yi (w xi + b) ≥ 1, i = 1, 2, . . . , N.

Explain the intuition of these formulations. Specifically, answer the following

2
questions 1) Why do we want to minimize ∥w∥ 2 in the above optimization prob-
lem? 2) Why do we need these constraints on all the samples in the above
optimization problem?
Solution:1) We need to maximize the margin. Maximizing the margin is equiva-
2
lent to minimize ∥w∥
2 ; 2) These constraints make sure that the boundary defined
by w and b perfectly separated all the samples in the dataset. It also ensures
that the minimum distance from any training sample to the decision boundary
1
is ∥w∥ .

9
7. [ Bonus: 8 points]
Answer True or False for the following questions about the three distinct approaches
(probabilistic generative models, probabilistic discriminative models, and discriminate
function) for modeling classification.

(a) Probabilistic Discriminative models aim to model the the class-conditional den-
sities p(x|y = k) and prior probabilities p(y = k), and then compute p(y = k|x)
using Bayes’ theorem.
Solution: False

(b) Logistic Regression is an example of Probabilistic Discriminative models.

Solution: True

(c) Probabilistic Generative models typically require more parameters than Proba-
bilistic Discriminative models.
Solution: True

(d) Support Vector Machine (SVM) is an example of Probabilistic Generative mod-

els.
Solution: False

WEG CFW500 Programming Manual 10006739425 Enaa
No ratings yet
WEG CFW500 Programming Manual 10006739425 Enaa
268 pages
MLvsMAP Merged
No ratings yet
MLvsMAP Merged
208 pages
Wa0030.
No ratings yet
Wa0030.
36 pages
ML FinalUpdated 1
No ratings yet
ML FinalUpdated 1
45 pages
Final2019 Solutions
No ratings yet
Final2019 Solutions
23 pages
ML 2024a QP Solution Full
No ratings yet
ML 2024a QP Solution Full
13 pages
ML Practice 1
No ratings yet
ML Practice 1
106 pages
FTTX
No ratings yet
FTTX
44 pages
Machine Learning PYQ 2022 Ans
No ratings yet
Machine Learning PYQ 2022 Ans
17 pages
Final2018 Solutions
No ratings yet
Final2018 Solutions
19 pages
Cerec Radio Device
No ratings yet
Cerec Radio Device
32 pages
Final 2019
No ratings yet
Final 2019
15 pages
Machine 2021 Jan-Apr
No ratings yet
Machine 2021 Jan-Apr
45 pages
Machine 2020 Jul-Dec
No ratings yet
Machine 2020 Jul-Dec
45 pages
8 Switch Magnum 10KT
No ratings yet
8 Switch Magnum 10KT
51 pages
Finals 19
No ratings yet
Finals 19
16 pages
How To Set Up A LLC in USA For Non Residents
No ratings yet
How To Set Up A LLC in USA For Non Residents
29 pages
Machine 2021 Jul-Dec
No ratings yet
Machine 2021 Jul-Dec
46 pages
Midterm2008f Sol
No ratings yet
Midterm2008f Sol
12 pages
Relay Setting
No ratings yet
Relay Setting
144 pages
Geographic Vs Projected Coordinate Systems PDF
No ratings yet
Geographic Vs Projected Coordinate Systems PDF
8 pages
ML Question CMU
No ratings yet
ML Question CMU
12 pages
Govt - Polytechnic College Nedumangadu: Seminar Report ON
No ratings yet
Govt - Polytechnic College Nedumangadu: Seminar Report ON
29 pages
Midterm Exam - Summer 21
No ratings yet
Midterm Exam - Summer 21
6 pages
Assignment II - XR Application Requirements and Outline
No ratings yet
Assignment II - XR Application Requirements and Outline
17 pages
Exam 2011
No ratings yet
Exam 2011
22 pages
ML 2023a Midsem Solution
No ratings yet
ML 2023a Midsem Solution
9 pages
Exam 21
No ratings yet
Exam 21
17 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
Epfl Machine Learning Final Exam 2021 Solutions
No ratings yet
Epfl Machine Learning Final Exam 2021 Solutions
21 pages
10f 601 Midterm
No ratings yet
10f 601 Midterm
17 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Working With Time - Lab Solutions Guide: Index Type Sourcetype Interesting Fields
No ratings yet
Working With Time - Lab Solutions Guide: Index Type Sourcetype Interesting Fields
10 pages
Secureworks Hacker Annualreport
No ratings yet
Secureworks Hacker Annualreport
25 pages
ASU Assignment2 Sol
No ratings yet
ASU Assignment2 Sol
8 pages
GC 2024 06 30
No ratings yet
GC 2024 06 30
8 pages
Ai ML Exam - 1march 16 2022-Michael Magreola
No ratings yet
Ai ML Exam - 1march 16 2022-Michael Magreola
8 pages
Computer SSC CGL 2022 Tier II Paper I - RBE - Compressed
No ratings yet
Computer SSC CGL 2022 Tier II Paper I - RBE - Compressed
17 pages
Midterm Practice Questions
No ratings yet
Midterm Practice Questions
14 pages
Assignment 1
No ratings yet
Assignment 1
6 pages
Awsm: CS 771A: Intro To Machine Learning, IIT Kanpur (19 Oct 2022) Name Roll No Dept
No ratings yet
Awsm: CS 771A: Intro To Machine Learning, IIT Kanpur (19 Oct 2022) Name Roll No Dept
2 pages
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
No ratings yet
10-701/15-781 Machine Learning Mid-Term Exam Solution: Your Name
12 pages
Midterm 2010 F
No ratings yet
Midterm 2010 F
15 pages
10-701 Midterm Exam, Fall 2007
No ratings yet
10-701 Midterm Exam, Fall 2007
25 pages
Midterm 2006
No ratings yet
Midterm 2006
11 pages
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
No ratings yet
SS ZG568 EC 2R SECOND SEM 2020 2021 Solution 1617000149821
6 pages
Jagpat Project Dhapni
No ratings yet
Jagpat Project Dhapni
46 pages
Lokesh T00691325
No ratings yet
Lokesh T00691325
5 pages
Practical 01
No ratings yet
Practical 01
5 pages
File Management Explained
No ratings yet
File Management Explained
5 pages
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
No ratings yet
10-701/15-781 Machine Learning - Midterm Exam, Fall 2010: Aarti Singh Carnegie Mellon University
16 pages
BasIc Structure of Computer
No ratings yet
BasIc Structure of Computer
19 pages
SVM Problems1
No ratings yet
SVM Problems1
5 pages
Practice Questions Lec 18 45
No ratings yet
Practice Questions Lec 18 45
4 pages
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
No ratings yet
Solution of Final Exam: 10-701/15-781 Machine Learning: Fall 2004 Dec. 12th 2004
27 pages
10-701 Midterm Exam Solutions, Spring 2007
No ratings yet
10-701 Midterm Exam Solutions, Spring 2007
20 pages
Midterm Solutions Machine
100% (1)
Midterm Solutions Machine
17 pages
CS725 2020 Midsem
No ratings yet
CS725 2020 Midsem
3 pages
Midterm Aut2014 (Final) Sol
No ratings yet
Midterm Aut2014 (Final) Sol
23 pages
Midterm Solutions PDF
No ratings yet
Midterm Solutions PDF
17 pages
Module 4 Learning Plan 1
No ratings yet
Module 4 Learning Plan 1
11 pages
EE 769 2023.02.23 Mid Term
No ratings yet
EE 769 2023.02.23 Mid Term
2 pages
Tute1 Questions
No ratings yet
Tute1 Questions
4 pages
Final: CS 189 Spring 2013 Introduction To Machine Learning
No ratings yet
Final: CS 189 Spring 2013 Introduction To Machine Learning
9 pages
ML Midsem 2018 Solutions
No ratings yet
ML Midsem 2018 Solutions
7 pages
CS725 2020 Quiz1
No ratings yet
CS725 2020 Quiz1
3 pages
Level of Implementation of Industrial Technology Syllabi at ThePangasinan State University
No ratings yet
Level of Implementation of Industrial Technology Syllabi at ThePangasinan State University
9 pages
AICE - Milestone #1 - Thokozile - Jumbo - 16.02.2023
No ratings yet
AICE - Milestone #1 - Thokozile - Jumbo - 16.02.2023
3 pages
Robotech Chellenge 1
No ratings yet
Robotech Chellenge 1
5 pages
Management of Information Systems Assignment 1
No ratings yet
Management of Information Systems Assignment 1
6 pages
Midterm 2008s Solution
No ratings yet
Midterm 2008s Solution
12 pages
PRML 2022 Endsem
No ratings yet
PRML 2022 Endsem
3 pages
OUTPUT#5
No ratings yet
OUTPUT#5
2 pages
Refurbished (Good) - Apple Iphone 12 Pro 256GB Smartphone - Pacific Blue - Unlocked Best Buy Canada
No ratings yet
Refurbished (Good) - Apple Iphone 12 Pro 256GB Smartphone - Pacific Blue - Unlocked Best Buy Canada
1 page
Midterm Solution
No ratings yet
Midterm Solution
6 pages
SATA Drivers For XP (Solution On 0x0000007B BSOD) - HP Support Forum
No ratings yet
SATA Drivers For XP (Solution On 0x0000007B BSOD) - HP Support Forum
8 pages
Machine Learning Foundations and Applications Assignment 1 Due Date: 10 October, 2021
No ratings yet
Machine Learning Foundations and Applications Assignment 1 Due Date: 10 October, 2021
3 pages
CMPUT 466/551 - Assignment 1: Paradox?
No ratings yet
CMPUT 466/551 - Assignment 1: Paradox?
6 pages
SSD 9971
No ratings yet
SSD 9971
4 pages
2080iq4 2
No ratings yet
2080iq4 2
2 pages
Practice Midterm 2010
No ratings yet
Practice Midterm 2010
4 pages
Ramya R Chandran: Career Objective
No ratings yet
Ramya R Chandran: Career Objective
2 pages
Travel Request Form: Traveller Information
No ratings yet
Travel Request Form: Traveller Information
1 page
Design of An Extremely High Performance Counter Mode AES Reconfigurable Processor
No ratings yet
Design of An Extremely High Performance Counter Mode AES Reconfigurable Processor
7 pages
Practice Midterm
No ratings yet
Practice Midterm
4 pages

cs675 SS2022 Midterm Solution PDF

Uploaded by

cs675 SS2022 Midterm Solution PDF

Uploaded by

CS 675 Introduction to Machine Learning (Spring 2022): Midterm Exam

Maximum Points to Gain: 100

p(a | o)p(o) (10/12)(1/2) 5

(b) [10 points]

(c) For any two vectors x ∈ Rd and y ∈ Rd , we have x⊤ y = y⊤ x.

where X ∈ RN ×d is the data matrix, with N sample and d features, y ∈ RN ×1 is the

Let L(w) denote the objective to be minimized

We calculate the derivative

min ∥y − Xw∥22 + λ∥y∥1 .

Provide your explanations.

min ∥y − Xw∥22 + λ∥y∥1

Thus, it has no impact on the final solution of w

(b) Both Perceptron and Logistic Regression have linear boundaries.

ŷ = argmaxy P (y|x[1], x[2], x[3])

Please explain the basis for each step of the derivation.

P (x[1], x[2], x[3]|y) = P (x[1]|y)P (x[2]|y)P (x[3]|y)

Thus, w is orthogonal to any vector in the plane, which means it is orthogonal

Explain the intuition of these formulations. Specifically, answer the following

(b) Logistic Regression is an example of Probabilistic Discriminative models.

(d) Support Vector Machine (SVM) is an example of Probabilistic Generative mod-

You might also like