0% found this document useful (0 votes)
16 views6 pages

2019-20-I MS Key

Uploaded by

singhalabhi53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

2019-20-I MS Key

Uploaded by

singhalabhi53
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

CS 771A: Introduction to Machine Learning Midsem Exam (15 Sep 2019)

Name SAMPLE SOLUTIONS 80 marks


Roll No Dept. Page 1 of 6

Instructions:
1. This question paper contains 3 pages (6 sides of paper). Please verify.
2. Write your name, roll number, department in block letters neatly with ink on each page of this question paper.
3. If you don’t write your name and roll number on all pages, pages may get lost when we unstaple to scan pages
4. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
5. Don’t overwrite/scratch answers especially in MCQ and T/F. We will entertain no requests for leniency.

Q1. Write T or F for True/False (write only in the box on the right hand side) (10x2=20 marks)
When using kNN to do classification, using a large value of k always gives better
1
performance since more training points are used to decide label of the test point
F
Cross validation means taking a small subset of the test data and using it to get an
2
estimate of how well will our algorithm perform on the entire test dataset
F
The EM algo does not require a careful initialization of model parameters since it
3
anyway considers all possible assignments of latent variables with different weights
F
If 𝑋 and 𝑌 are two real-valued random variables such that Cov(𝑋, 𝑌) < 0 then at
4
least one of 𝑋 or 𝑌 must have negative variance i.e. either 𝕍𝑋 < 0 or 𝕍𝑌 < 0
F
If 𝐚 ∈ ℝ2 is a constant vector and 𝑓: ℝ2 → ℝ is such that 𝑔(𝐱) = 𝑓(𝐱) + 𝐚⊤ 𝐱 is a
5 T
non-convex function, then ℎ(𝐱) = 𝑓(𝐱) − 𝐚⊤ 𝐱 must be a non-convex function too
The SVM is so named because the decision boundary of the SVM classifier passes
6
through the data points which are marked as being support vectors
F
Suppose 𝑋 is a real valued random variable with variance 𝕍𝑋 = 9. Then the
7 F
random variable 𝑌 defined as 𝑌 = 𝑋 − 2 will always satisfy 𝕍𝑌 = 𝕍𝑋 − 22 = 5
The LwP algorithm for binary classification always gives linear decision boundary if
8
we use one prototype per class and Euclidean distance to measure distances
T
If 𝑓, 𝑔: ℝ2 → ℝ are two non-convex functions, then the function ℎ: ℝ2 → ℝ
9
defined as ℎ(𝐱) = 𝑓(𝐱) + 𝑔(𝐱) must always be non-convex too
F
If we learn models {𝐰 𝑐 }𝐶𝑐=1 for multiclassification using the Crammer-Singer loss
10 T
function, these models can be used to assign a PMF over the class labels [𝐶]

Q2 Phase retrieval is used in X-ray crystallography. Let 𝐱 𝑖 ∈ ℝ𝑑 , 𝑖 ∈ [𝑛] be features and 𝑦 𝑖 ∈ ℝ be


labels. All data points are independent. However, we only get to see the absolute value of labels,
𝑛
i.e. the train data is {(𝐱 𝑖 , 𝑢𝑖 )}𝑖=1 where 𝑢𝑖 = |𝑦 𝑖 |. Let 𝑧 𝑖 ∈ {−1,1} be latent variables for missing
label signs (aka phases). Use the data likelihood function ℙ[𝑢𝑖 | 𝑧 𝑖 , 𝐱 𝑖 , 𝐰] = 𝒩(𝑢𝑖 𝑧 𝑖 ; 𝐰 ⊤ 𝐱 𝑖 , 1).
Note that this is a discriminative setting (i.e. 𝐱 𝑖 are constants). Expressions in your answers may
contain unspecified normalization constants. Give only brief derivations. (8+6+6=20 marks)
2.1 Assuming ℙ[𝑧 𝑖 = 𝑐 | 𝐱 𝑖 , 𝐰] = ℙ[𝑧 𝑖 = 𝑐] = 0.5 for 𝑐 ∈ {−1,1} (i.e. uniform prior on 𝑧 𝑖 that
does not depend on features or model), derive an expression for ℙ[𝑧 𝑖 = 1 | 𝑢𝑖 , 𝐱 𝑖 , 𝐰]. Using this,
derive an expression for the MAP estimate arg max ℙ[𝑧 𝑖 = 𝑐 | 𝑢𝑖 , 𝐱 𝑖 , 𝐰]
𝑐∈{−1,+1}
Page 2 of 6
Applying Bayes rule and ℙ[𝑧 𝑖 = 1 | 𝐱 𝑖 , 𝐰] = 0. 5, we have (omitting normalization constants)
2
𝑖 𝑖 𝑖
ℙ[𝑢𝑖 | 𝑧 𝑖 = 1, 𝐱 𝑖 , 𝐰] ⋅ ℙ[𝑧 𝑖 = 1 | 𝐱 𝑖 , 𝐰] (𝑢𝑖 − 𝐰 ⊤ 𝐱 𝑖 )
ℙ[𝑧 = 1 | 𝑢 , 𝐱 , 𝐰] = ∝ exp (− )
ℙ[𝑢𝑖 | 𝐱 𝑖 , 𝐰] 2
2
𝑖 𝑖 𝑖 (−𝑢𝑖 −𝐰 ⊤ 𝐱 𝑖 )
Similarly, we have ℙ[𝑧 = −1 | 𝑢 , 𝐱 , 𝐰] ∝ exp (− ). This tells us that we should
2
set 𝑧 𝑖 to whatever value that leads to a smaller residual error. A nice way of saying this is
arg max ℙ[𝑧 𝑖 = 𝑐 | 𝑢𝑖 , 𝐱 𝑖 , 𝐰] = sign(|−𝑢𝑖 − 𝐰 ⊤ 𝐱 𝑖 | − |𝑢𝑖 − 𝐰 ⊤ 𝐱 𝑖 |)
𝑐∈{−1,+1}

where we break ties (when both terms on the RHS are equal) arbitrarily, say in favour of 1. We
may choose to break ties any way we wish since sign(0) is not cleanly defined and does not
matter in calculations.

2.2 Derive an expression for ℙ[𝐰 | 𝐮, 𝐳, 𝑋] using a standard Gaussian prior ℙ[𝐰] = 𝒩(𝟎, 𝐼𝑑 ).
Then derive an expression for the MAP estimate for 𝐰 i.e. arg max𝑑 ℙ[𝐰 | 𝐮, 𝐳, 𝑋] (here we are
𝐰∈ℝ
using shorthand notation 𝑋 = [𝐱1 ,…𝐱 𝑛 ]⊤ 𝑛×𝑑
∈ℝ ,𝐮 = [𝑢1 ,…,𝑢𝑛]
∈ ℝ𝑛 , 𝐳 = [𝑧 1 , … , 𝑧 𝑛 ] ∈ ℝ𝑛 ).

Using independence, the Bayes rule, and ignoring proportionality constants as before gives us
2
1 2
𝑛 (𝑢𝑖 𝑧 𝑖 − 𝐰 ⊤ 𝐱 𝑖 )
ℙ[𝐰 | 𝐮, 𝐳, 𝑋] ∝ ℙ[𝐮 | 𝐰, 𝐳, 𝑋] ⋅ ℙ[𝐰] ∝ exp (− ‖𝐰‖2 ) ⋅ ∏ exp (− )
2 𝑖=1 2

Note that the expression for ℙ[𝑢𝑖 | 𝑧 𝑖 , 𝐱 𝑖 , 𝐰] is available to us from the question text itself.
Taking logarithms as usual gives us
𝑛 2
̂ MAP = arg min𝑑‖𝐰‖22 + ∑
𝐰 (𝑢𝑖 𝑧 𝑖 − 𝐰 ⊤ 𝐱 𝑖 )
𝐰∈ℝ 𝑖=1

Applying first order optimality and using the shorthand 𝑣 𝑖 = 𝑢𝑖 𝑧 𝑖 and 𝐯 = [𝑣 1 , … , 𝑣 𝑛 ] ∈ ℝ𝑛


̂ MAP = (𝑋 ⊤ 𝑋 + 𝐼𝑑 )−1 𝑋 ⊤ 𝐯
𝐰

2.3 Using the above derivations, give the pseudocode (as we write in lecture slides i.e. not
necessarily Python code or C code but sufficient details of the algorithm updates) for an
alternating optimization algorithm for estimating the model 𝐰 in the presence of the latent
variables. Give precise update expressions in your pseudocode and not just vague statements.
CS 771A: Introduction to Machine Learning Midsem Exam (15 Sep 2019)
Name SAMPLE SOLUTIONS 80 marks
Roll No Dept. Page 3 of 6

AltOpt for Phase Retrieval


1. Initialize model 𝐰
2. For 𝑖 ∈ [𝑛], update {𝑧𝑖 } using {𝐰 𝑐 }
1. Let 𝑧𝑖 = sign(|−𝑢𝑖 − 𝐰 ⊤ 𝐱 𝑖 | − |𝑢𝑖 − 𝐰 ⊤ 𝐱 𝑖 |)
2. Break ties arbitrarily
3. Update 𝐰 using {𝑧𝑖 }
1. Let 𝑣 𝑖 = 𝑢𝑖 𝑧 𝑖 and 𝐯 = [𝑣 1 , … , 𝑣 𝑛 ]
2. Let 𝐰 = (𝑋 ⊤ 𝑋 + 𝐼𝑑 )−1 𝑋 ⊤ 𝐯
4. Repeat until convergence

Q3 We have seen that algorithms such as the EM require weighted optimization problems to be
solved where different data points may have different weights. Consider the following problem
of L2 regularized squared hinge loss minimization but with different weights per data point. The
data points are 𝐱 𝑖 ∈ ℝ𝑑 and the labels are 𝑦 𝑖 ∈ {−1,1}. The weights 𝑞𝑖 are all known (i.e. are
constants) and are all strictly positive i.e. 𝑞𝑖 > 0, 𝑞𝑖 ≠ 0 for all 𝑖 = 1, … , 𝑛 (3+2+5=10 marks)
1 𝑛 2
arg mind ‖𝐰‖22 + ∑ 𝑞𝑖 ⋅ ([1 − 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ]+ )
𝐰∈ℝ 2 𝑖=1

3.1 As we did in assignment 1, rewrite the above problem as an equivalent problem that has
inequality constraints in it (the above problem does not have any constraints).
1 𝑛
arg mind ‖𝐰‖22 + ∑ 𝑞𝑖 ⋅ 𝜉𝑖2
𝐰∈ℝ 2 𝑖=1
𝛏∈ℝ𝑛

s. t. 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ≥ 1 − 𝜉𝑖 , for all 𝑖 ∈ [𝑛]


Similar to what we observed in assignment 1, even in this case, including or omitting the
constraints 𝜉𝑖 ≥ 0 does not affect the solution.
Page 4 of 6
3.2 Then introduce dual variables as appropriate and write down the expression for the dual
problem as a max-min problem (no need to write the Lagrangian expression separately).
1
The Lagrangian is ℒ(𝐰, 𝛏, 𝛂) = ‖𝐰‖22 + ∑𝑛𝑖=1 𝑞𝑖 ⋅ 𝜉𝑖2 + ∑𝑛𝑖=1 𝛼𝑖 (1 − 𝜉𝑖 − 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 )
2

Thus, the dual problem is


1 𝑛 𝑛
max {min { ‖𝐰‖22 + ∑ 𝑞𝑖 ⋅ 𝜉𝑖2 + ∑ 𝛼𝑖 (1 − 𝜉𝑖 − 𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ) }}
𝛂≥0 𝐰,𝛏 2 𝑖=1 𝑖=1

3.3 Simplify the dual by eliminating the primal variables and write down the expression for the
simplified dual. Show only brief derivations.

Applying first order optimality to the inner unconstrained optimization problem gives us:
𝜕ℒ 𝑛
= 0 ⇒ 𝐰 = ∑ 𝛼𝑖 𝑦 𝑖 ⋅ 𝐱 𝑖
𝜕𝐰 𝑖=1

𝜕ℒ 𝛼𝑖
= 0 ⇒ 𝜉𝑖 =
𝜕𝜉 2𝑞𝑖
Putting these in the dual expression gives us the following simplified dual problem
1
max { 𝛂⊤ 1 − 𝛂⊤ (𝑄 + 𝐷)𝛂 }
𝛂≥0 2
where 𝑄 is an 𝑛 × 𝑛 matrix with 𝑄𝑖𝑗 = 𝛼𝑖 𝛼𝑗 𝑦 𝑖 𝑦 𝑗 〈𝐱 𝑖 , 𝐱𝑗 〉 and 𝐷 is an 𝑛 × 𝑛 diagonal matrix with
1
𝐷𝑖𝑖 = and 𝐷𝑖𝑗 = 0 if 𝑖 ≠ 𝑗.
2𝑞𝑖

Q4 Recall the uniform distribution over an interval [𝑎, 𝑏] ⊂ ℝ where 𝑎 < 𝑏. Just two parameters,
namely 𝑎, 𝑏, are required to define this distribution (no restrictions on 𝑎, 𝑏 being positive/non-
zero etc, just that we must have 𝑎 < 𝑏. Note this implies 𝑎 ≠ 𝑏). The PDF of this distribution is
0 𝑥<𝑎
ℙ[𝑥 | 𝑎, 𝑏] = 𝒰(𝑥; 𝑎, 𝑏) ≜ {1⁄(𝑏 − 𝑎) 𝑥 ∈ [𝑎, 𝑏]
0 𝑥>𝑏
Given 𝑛 independent samples 𝑥 1 , … , 𝑥 𝑛 ∈ ℝ (assume w.l.o.g. that not all samples are the same
number) we wish to learn a uniform distribution as a generative distribution using these samples
using the MLE technique i.e. we wish to find
arg max ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏]
𝑎<𝑏,𝑎≠𝑏

Give a brief derivation for, and the final values of, 𝑎̂MLE and 𝑏̂MLE . (5+5=10 marks)
CS 771A: Introduction to Machine Learning Midsem Exam (15 Sep 2019)
Name SAMPLE SOLUTIONS 80 marks
Roll No Dept. Page 5 of 6

Using independence, we have arg max ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏] = arg max ∏𝑛𝑖=1 𝒰(𝑥 𝑖 ; 𝑎, 𝑏)
𝑎<𝑏,𝑎≠𝑏 𝑎<𝑏,𝑎≠𝑏

Now, suppose we have a pair (𝑎, 𝑏) such that for some 𝑖 ∈ [𝑛], we have 𝑥 𝑖 ≠ [𝑎, 𝑏], then
𝒰(𝑥 𝑖 ; 𝑎, 𝑏) = 0 and as a result ℙ[𝑥 1 , … , 𝑥 𝑛 | 𝑎, 𝑏] = 0 too! This means that if we denote 𝑚 ≜
min 𝑥 𝑖 and 𝑀 ≜ max 𝑥 𝑖 , then we must have 𝑎 ≤ 𝑚 and 𝑏 ≥ 𝑀 to get a non-zero value of the
𝑖 𝑖
likelihood function i.e. we need to solve
𝑛
𝑖
1 𝑛
arg max ∏ 𝒰(𝑥 ; 𝑎, 𝑏) = arg max ( )
𝑎≤𝑚,𝑏≥𝑀 𝑖=1 𝑎≤𝑚,𝑏≥𝑀 𝑏 − 𝑎

The above is maximized for the smallest value of 𝑏 − 𝑎 which, subject to the constraints, is
achieved exactly at 𝑎 = 𝑚, 𝑏 = 𝑀. Thus, we have
𝑎̂MLE = min 𝑥 𝑖 and 𝑏̂MLE = max 𝑥 𝑖
𝑖 𝑖

Q5. Fill the circle (don’t tick) next to all the correct options (many may be correct).(2x3=6 marks)
5.1 The use of the Laplace (aka Laplacian) prior and Laplace (aka Laplacian) likelihood results in a
MAP problem that requires us to solve an optimization problem whose objective function is

A Always convex and always differentiable


B Always convex but possibly non-differentiable
C Possibly non-convex but always differentiable
D Always non-convex and always non-differentiable
Page 6 of 6
5.2 In probabilistic multiclassification with 𝐶 classes, if for a test data point, the ML algorithm
predicts a PMF over the classes with an extremely small variance, then it means that
A The mode of that PMF should have a probability value much larger than 0
B The mode of that PMF should have a probability value very close to 0
C The ML algorithm is very confident about its prediction on that data point
D The ML algorithm is very unsure about its prediction on that data point
Q6 Nadal and Federer have played a total of 80 matches of which Nadal won 50, Federer won 30.
They have played on three types of courts – clay, grass, and hard. Among the matches Nadal
won, 70% were played on clay courts, 4% on grass courts and rest on hard courts. Federer has
won a 15/120 fraction of matches played on clay courts, 96/120 fraction of matches played on
grass courts, and 68/120 fraction of matches played on hard courts. What is the number of
matches that the two players have played on each of the three types of courts? (3x2=6 marks)

Clay ( 40 ) Grass ( 10 ) Hard ( 30 )


Q7 Let 𝑋 be a discrete random variable with support {−1,0,1}. Find a PMF for 𝑋 for which 𝑋 has
the highest possible variance. What value of variance do you get in this case? Repeat the analysis
(i.e. give the highest variance PMF as well as the variance value) when 𝑋 is a Rademacher random
variable i.e. has support only over {−1,1}. Justify all your answers briefly. (3+1+3+1=8 marks)
Suppose the PMF assigns probability values 𝑝−1 , 𝑝0 , 𝑝1 to the support elements. Then we have
𝔼𝑋 = (𝑝1 − 𝑝−1 ) and 𝕍𝑋 = 𝔼[𝑋 2 ] − (𝔼𝑋)2 = (𝑝1 + 𝑝−1 ) − (𝑝1 − 𝑝−1 )2 . Now, whereas we
could go all Lagrangian on this problem and solve it by brute force, a more careful look at the
problem gives results more readily.
The largest (perhaps unachievably so) value of the last expression is achieved when 𝑝1 + 𝑝−1
takes on its largest value (which is 1 since 𝑝−1 + 𝑝0 + 𝑝1 = 1 and 𝑝0 ≥ 0) and (𝑝1 − 𝑝−1 )2
takes on its smallest value (which is 0 since a square of a real number can never be negative).
Thus, we must not expect a result better than 𝕍𝑋 = 1.
However, the above can actually be achieved. (𝑝1 − 𝑝−1 )2 = 0 when 𝑝1 = 𝑝−1 and we cab
even simultaneously ensure 𝑝1 + 𝑝−1 by setting 𝑝1 = 𝑝−1 = 0.5. Thus, at the PMF
{𝑝−1 = 0.5, 𝑝0 = 0, 𝑝1 = 0.5}, the random variable has highest variance of 𝕍𝑋 = 1.
For the Rademacher case, the solution is readily seen to be {𝑝−1 = 0.5, 𝑝1 = 0.5} and 𝕍𝑋 = 1
since in the first case, the solution we obtained looks exactly like a Rademacher random
variable if we look at the support elements which are assigned non-zero probability.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- END OF EXAM - - - - -- - - - - - - - - - - - - - - - - - - - - - - - -
---

You might also like