2019-20-I MS Key
2019-20-I MS Key
Instructions:
1. This question paper contains 3 pages (6 sides of paper). Please verify.
2. Write your name, roll number, department in block letters neatly with ink on each page of this question paper.
3. If you don’t write your name and roll number on all pages, pages may get lost when we unstaple to scan pages
4. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
5. Don’t overwrite/scratch answers especially in MCQ and T/F. We will entertain no requests for leniency.
Q1. Write T or F for True/False (write only in the box on the right hand side) (10x2=20 marks)
When using kNN to do classification, using a large value of k always gives better
1
performance since more training points are used to decide label of the test point
F
Cross validation means taking a small subset of the test data and using it to get an
2
estimate of how well will our algorithm perform on the entire test dataset
F
The EM algo does not require a careful initialization of model parameters since it
3
anyway considers all possible assignments of latent variables with different weights
F
If 𝑋 and 𝑌 are two real-valued random variables such that Cov(𝑋, 𝑌) < 0 then at
4
least one of 𝑋 or 𝑌 must have negative variance i.e. either 𝕍𝑋 < 0 or 𝕍𝑌 < 0
F
If 𝐚 ∈ ℝ2 is a constant vector and 𝑓: ℝ2 → ℝ is such that 𝑔(𝐱) = 𝑓(𝐱) + 𝐚⊤ 𝐱 is a
5 T
non-convex function, then ℎ(𝐱) = 𝑓(𝐱) − 𝐚⊤ 𝐱 must be a non-convex function too
The SVM is so named because the decision boundary of the SVM classifier passes
6
through the data points which are marked as being support vectors
F
Suppose 𝑋 is a real valued random variable with variance 𝕍𝑋 = 9. Then the
7 F
random variable 𝑌 defined as 𝑌 = 𝑋 − 2 will always satisfy 𝕍𝑌 = 𝕍𝑋 − 22 = 5
The LwP algorithm for binary classification always gives linear decision boundary if
8
we use one prototype per class and Euclidean distance to measure distances
T
If 𝑓, 𝑔: ℝ2 → ℝ are two non-convex functions, then the function ℎ: ℝ2 → ℝ
9
defined as ℎ(𝐱) = 𝑓(𝐱) + 𝑔(𝐱) must always be non-convex too
F
If we learn models {𝐰 𝑐 }𝐶𝑐=1 for multiclassification using the Crammer-Singer loss
10 T
function, these models can be used to assign a PMF over the class labels [𝐶]
where we break ties (when both terms on the RHS are equal) arbitrarily, say in favour of 1. We
may choose to break ties any way we wish since sign(0) is not cleanly defined and does not
matter in calculations.
2.2 Derive an expression for ℙ[𝐰 | 𝐮, 𝐳, 𝑋] using a standard Gaussian prior ℙ[𝐰] = 𝒩(𝟎, 𝐼𝑑 ).
Then derive an expression for the MAP estimate for 𝐰 i.e. arg max𝑑 ℙ[𝐰 | 𝐮, 𝐳, 𝑋] (here we are
𝐰∈ℝ
using shorthand notation 𝑋 = [𝐱1 ,…𝐱 𝑛 ]⊤ 𝑛×𝑑
∈ℝ ,𝐮 = [𝑢1 ,…,𝑢𝑛]
∈ ℝ𝑛 , 𝐳 = [𝑧 1 , … , 𝑧 𝑛 ] ∈ ℝ𝑛 ).
Using independence, the Bayes rule, and ignoring proportionality constants as before gives us
2
1 2
𝑛 (𝑢𝑖 𝑧 𝑖 − 𝐰 ⊤ 𝐱 𝑖 )
ℙ[𝐰 | 𝐮, 𝐳, 𝑋] ∝ ℙ[𝐮 | 𝐰, 𝐳, 𝑋] ⋅ ℙ[𝐰] ∝ exp (− ‖𝐰‖2 ) ⋅ ∏ exp (− )
2 𝑖=1 2
Note that the expression for ℙ[𝑢𝑖 | 𝑧 𝑖 , 𝐱 𝑖 , 𝐰] is available to us from the question text itself.
Taking logarithms as usual gives us
𝑛 2
̂ MAP = arg min𝑑‖𝐰‖22 + ∑
𝐰 (𝑢𝑖 𝑧 𝑖 − 𝐰 ⊤ 𝐱 𝑖 )
𝐰∈ℝ 𝑖=1
2.3 Using the above derivations, give the pseudocode (as we write in lecture slides i.e. not
necessarily Python code or C code but sufficient details of the algorithm updates) for an
alternating optimization algorithm for estimating the model 𝐰 in the presence of the latent
variables. Give precise update expressions in your pseudocode and not just vague statements.
CS 771A: Introduction to Machine Learning Midsem Exam (15 Sep 2019)
Name SAMPLE SOLUTIONS 80 marks
Roll No Dept. Page 3 of 6
Q3 We have seen that algorithms such as the EM require weighted optimization problems to be
solved where different data points may have different weights. Consider the following problem
of L2 regularized squared hinge loss minimization but with different weights per data point. The
data points are 𝐱 𝑖 ∈ ℝ𝑑 and the labels are 𝑦 𝑖 ∈ {−1,1}. The weights 𝑞𝑖 are all known (i.e. are
constants) and are all strictly positive i.e. 𝑞𝑖 > 0, 𝑞𝑖 ≠ 0 for all 𝑖 = 1, … , 𝑛 (3+2+5=10 marks)
1 𝑛 2
arg mind ‖𝐰‖22 + ∑ 𝑞𝑖 ⋅ ([1 − 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ]+ )
𝐰∈ℝ 2 𝑖=1
3.1 As we did in assignment 1, rewrite the above problem as an equivalent problem that has
inequality constraints in it (the above problem does not have any constraints).
1 𝑛
arg mind ‖𝐰‖22 + ∑ 𝑞𝑖 ⋅ 𝜉𝑖2
𝐰∈ℝ 2 𝑖=1
𝛏∈ℝ𝑛
3.3 Simplify the dual by eliminating the primal variables and write down the expression for the
simplified dual. Show only brief derivations.
Applying first order optimality to the inner unconstrained optimization problem gives us:
𝜕ℒ 𝑛
= 0 ⇒ 𝐰 = ∑ 𝛼𝑖 𝑦 𝑖 ⋅ 𝐱 𝑖
𝜕𝐰 𝑖=1
𝜕ℒ 𝛼𝑖
= 0 ⇒ 𝜉𝑖 =
𝜕𝜉 2𝑞𝑖
Putting these in the dual expression gives us the following simplified dual problem
1
max { 𝛂⊤ 1 − 𝛂⊤ (𝑄 + 𝐷)𝛂 }
𝛂≥0 2
where 𝑄 is an 𝑛 × 𝑛 matrix with 𝑄𝑖𝑗 = 𝛼𝑖 𝛼𝑗 𝑦 𝑖 𝑦 𝑗 〈𝐱 𝑖 , 𝐱𝑗 〉 and 𝐷 is an 𝑛 × 𝑛 diagonal matrix with
1
𝐷𝑖𝑖 = and 𝐷𝑖𝑗 = 0 if 𝑖 ≠ 𝑗.
2𝑞𝑖
Q4 Recall the uniform distribution over an interval [𝑎, 𝑏] ⊂ ℝ where 𝑎 < 𝑏. Just two parameters,
namely 𝑎, 𝑏, are required to define this distribution (no restrictions on 𝑎, 𝑏 being positive/non-
zero etc, just that we must have 𝑎 < 𝑏. Note this implies 𝑎 ≠ 𝑏). The PDF of this distribution is
0 𝑥<𝑎
ℙ[𝑥 | 𝑎, 𝑏] = 𝒰(𝑥; 𝑎, 𝑏) ≜ {1⁄(𝑏 − 𝑎) 𝑥 ∈ [𝑎, 𝑏]
0 𝑥>𝑏
Given 𝑛 independent samples 𝑥 1 , … , 𝑥 𝑛 ∈ ℝ (assume w.l.o.g. that not all samples are the same
number) we wish to learn a uniform distribution as a generative distribution using these samples
using the MLE technique i.e. we wish to find
arg max ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏]
𝑎<𝑏,𝑎≠𝑏
Give a brief derivation for, and the final values of, 𝑎̂MLE and 𝑏̂MLE . (5+5=10 marks)
CS 771A: Introduction to Machine Learning Midsem Exam (15 Sep 2019)
Name SAMPLE SOLUTIONS 80 marks
Roll No Dept. Page 5 of 6
Using independence, we have arg max ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏] = arg max ∏𝑛𝑖=1 𝒰(𝑥 𝑖 ; 𝑎, 𝑏)
𝑎<𝑏,𝑎≠𝑏 𝑎<𝑏,𝑎≠𝑏
Now, suppose we have a pair (𝑎, 𝑏) such that for some 𝑖 ∈ [𝑛], we have 𝑥 𝑖 ≠ [𝑎, 𝑏], then
𝒰(𝑥 𝑖 ; 𝑎, 𝑏) = 0 and as a result ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏] = 0 too! This means that if we denote 𝑚 ≜
min 𝑥 𝑖 and 𝑀 ≜ max 𝑥 𝑖 , then we must have 𝑎 ≤ 𝑚 and 𝑏 ≥ 𝑀 to get a non-zero value of the
𝑖 𝑖
likelihood function i.e. we need to solve
𝑛
𝑖
1 𝑛
arg max ∏ 𝒰(𝑥 ; 𝑎, 𝑏) = arg max ( )
𝑎≤𝑚,𝑏≥𝑀 𝑖=1 𝑎≤𝑚,𝑏≥𝑀 𝑏 − 𝑎
The above is maximized for the smallest value of 𝑏 − 𝑎 which, subject to the constraints, is
achieved exactly at 𝑎 = 𝑚, 𝑏 = 𝑀. Thus, we have
𝑎̂MLE = min 𝑥 𝑖 and 𝑏̂MLE = max 𝑥 𝑖
𝑖 𝑖
Q5. Fill the circle (don’t tick) next to all the correct options (many may be correct).(2x3=6 marks)
5.1 The use of the Laplace (aka Laplacian) prior and Laplace (aka Laplacian) likelihood results in a
MAP problem that requires us to solve an optimization problem whose objective function is
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- END OF EXAM - - - - -- - - - - - - - - - - - - - - - - - - - - - - - -
---