MS Key-4
MS Key-4
Instructions:
1. This question paper contains 2 page (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. If you don’t do this, your pages may get lost when we unstaple your paper to scan pages
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ – such cases will get straight 0 marks.
Q1. For the hangman problem, 5 decision tree splits at a node are given. For each split, write down
the information gain (entropy reduction) in the bold border boxes border next to the diagrams as
a single fraction or decimal number. Use logarithms with base 2 in the definition of entropy. The
numbers written in the nodes indicate how many words reached that node. (5 marks)
1.875 or 15/8
Q2. (Intriguing entropy) For a random variable 𝑋 with support {−1,1} with ℙ[𝑋 = 1] = 𝑝, define
its entropy as 𝐻(𝑋) ≝ −𝑝 ln 𝑝 − (1 − 𝑝) ln(1 − 𝑝) (use natural logarithms for sake of simplicity).
Find (a) a value of 𝑝 ∈ [0,1] where the entropy 𝐻(𝑋) is largest and (b) a value 𝑝 ∈ [0,1] where the
entropy 𝐻(𝑋) is smallest. Show brief calculations/arguments for both parts. (3 + 2 = 5 marks)
𝑑𝐻 1 1
= −1 − ln 𝑝 + 1 + ln(1 − 𝑝) = ln ( − 1) which vanishes at 𝑝 = . Confirming with the
𝑑𝑝 𝑝 2
𝑑2𝐻 1 1
second derivative tells us that =− < 0 at 𝑝 = confirming that it is a local maxima
𝑑𝑝2 𝑝(1−𝑝) 2
candidate. Since this value of 𝑝 also lies in the feasible set [0,1], we conclude that 𝐻(𝑋) is
1
largest at 𝑝 = .
2
𝐻(𝑋) ≥ 0 for all values of 𝑝 and we see that for 𝑝 = 0 as well as for 𝑝 = 1, we get 𝐻(𝑋) = 0
since 1 ln 1 = 0 = 0 ln 0. Since these values are also in the feasible set [0,1] we conclude that
𝐻(𝑋) is the smallest at these values.
Page 2 of 4
Q3. (At a loss for names) Consider the following loss function where 𝜏 > 0 and 𝑧 ∈ ℝ.
1−𝑧 𝑧 <1−𝜏
4
(1 − 𝑧) 3(1 − 𝑧)2 1 − 𝑧 3𝜏
ℓ𝜏 (𝑧) = {− + + + 𝑧 ∈ [1 − 𝜏, 1 + 𝜏]
16𝜏 3 8𝜏 2 16
0 𝑧 >1+𝜏
𝑑ℓ𝜏 (𝑧) 𝑑 2 ℓ𝜏 (𝑧)
1. Write down expressions for and . No need to show calculations.
𝑑𝑧 𝑑𝑧 2
2. Write down an expression for ∇𝐰 𝑓(𝐰) where 𝐰, 𝐱 𝑖 ∈ ℝ𝑑 , 𝑦 𝑖 ∈ {−1, +1}. You can use terms
such as ℓ′𝜏 (⋅) in your expression to denote the first derivative of ℓ𝜏 (⋅) to avoid clutter.
1 𝑛
2
𝑓(𝐰) = ‖𝐰‖ 2+∑ ℓ𝜏 (𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 )
2 𝑖=1
3. Write down an expression for what the loss function ℓ𝜏 (⋅) would look like as 𝜏 → 0+ .
Give brief calculations for the 2nd part and brief justification for the 3rd part. (2+2+3+1 = 8 marks)
−1 𝑧 <1−𝜏
𝑑ℓ𝜏 (𝑧) (1 − 𝑧) 3
3(1 − 𝑧) 1
={ − − 𝑧 ∈ [1 − 𝜏, 1 + 𝜏]
𝑑𝑧 4𝜏 3 4𝜏 2
0 𝑧 >1+𝜏
0 𝑧 <1−𝜏
2
𝑑 ℓ𝜏 (𝑧) 3(1 − 𝑧)2 3
= {− + 𝑧 ∈ [1 − 𝜏, 1 + 𝜏]
𝑑𝑧 2 4𝜏 3 4𝜏
0 𝑧 >1+𝜏
By applying the chain rule, we get
𝑛
∇𝐰 𝑓(𝐰) = 𝐰 + ∑ ℓ′𝜏 (𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ) ⋅ 𝑦 𝑖 ⋅ 𝐱 𝑖
𝑖=1
The loss function ℓ𝜏 (⋅) is a doubly differentiable version of the hinge loss first proposed by
Kamalika Chaudhuri, Claire Monteleoni and Anand D. Sarwate in their 2011 JMLR paper
entitled “Differentially Private Empirical Risk Minimization”. Other such “smoothed” hinge loss
variants also exist.
1. Write down the Lagrangian of this problem by introducing dual variables for the constraints.
2. Using the Lagrangian, create the dual problem (show brief derivation). Simplify the dual as
much as you can otherwise the next part may get more cumbersome for you.
3. Give an expression for deriving the primal solution 𝑥 in terms of the dual variables
4. Give peudocode for a coordinate ascent/descent method to solve the dual. Use any
coordinate selection method you like. Give precise expressions in pseudocode on how you
would process a chosen coordinate taking care of constraints. (2 + 4 + 2 + 4 = 12 marks)
Introducing dual variables 𝛼𝑖 , 𝛽𝑖 , 𝛾𝑖 for the 3 sets of constraints and compactly writing them and
the 𝑎𝑖 , 𝑐 𝑖 values as vectors gives the Lagrangian as (note that 𝟏 denotes the all-ones vector)
1
ℒ(𝑥, 𝐜, 𝛂, 𝛃, 𝛄) = 𝑥 2 + 𝐜 ⊤ 𝟏 + 𝛂⊤ (𝑥 ⋅ 𝟏 − 𝐚 − 𝐜) − 𝛃⊤ (𝑥 ⋅ 𝟏 − 𝐚 + 𝐜) − 𝛄⊤ 𝐜
2
𝜕ℒ
Setting = 0 gives us 𝟏 − 𝛂 − 𝛃 − 𝛄 = 𝟎 and therefore, 𝛂 + 𝛃 ≤ 𝟏 (as 𝛄 ≥ 0)
𝜕𝐜
𝜕ℒ
Setting = 0 gives us 𝑥 + 𝛂⊤ 𝟏 − 𝛃⊤ 𝟏 = 0 i.e., 𝑥 = (𝛃 − 𝛂)⊤ 𝟏 = ∑𝑛𝑖−1(𝛽𝑖 − 𝛼𝑖 )
𝜕𝑥
1 1
Putting these back gives us ℒ = 𝑥 2 + 𝑥 ⋅ (𝛂 − 𝛃)⊤ 𝟏 − (𝛂 − 𝛃)⊤ 𝐚 = − 𝑥 2 − (𝛂 − 𝛃)⊤ 𝐚
2 2
Q6. Melbo has learnt a decision tree to solve a binary classification problem with 95 train points of
which it gets 48 correct and 47 wrong. There are 10 real features for every training point and the
first feature 𝑥1 is an interesting one. 𝑥1 takes only 3 values, namely 0, 1, 2. Among the train points
that Melbo classified correctly, a 5/12 fraction had 𝑥1 = 0, a 1/6 fraction had 𝑥1 = 1 and the rest
had 𝑥1 = 2. Melbo got 2/3rd of the training points that had 𝑥1 = 0 wrong. Melbo got 1/5th of the
training points that had 𝑥1 = 1 wrong and 1/5th of the training points that had 𝑥1 = 2 wrong. Find
out how many train points had the feature value 𝑥1 = 0, 𝑥1 = 1 and 𝑥1 = 2.
(Bonus) Do you notice anything funny about the way Melbo’s decision tree gets answers right and
wrong? Can you improve its classification accuracy on the training set? You cannot change the
decision tree itself, but you can take the decision tree output on a data point and the value of the
first feature for that data point 𝑥1 and possibly change the output. What is the improved accuracy?
For this part, you may assume that the binary labels are +1 and −1. (2 x 3 = 6 + 3 marks)
There are 60 points with 𝑥1 = 0, 10 points with 𝑥1 = 1 and 25 points with 𝑥1 = 2.To see what
is funny about this DT classifier, let us tabulate the breakup
Correct Wrong Correct Wrong
𝑥1 = 0 20 40 𝑥1 = 0 40 20
𝑥1 = 1 8 2 𝑥1 = 1 8 2
𝑥1 = 2 20 5 𝑥1 = 2 20 5
We note that the DT classifies many more points wrongly than correctly if 𝑥1 = 0. Thus, if we
invert the DT outputs for data points with 𝑥1 = 0 i.e., predict 𝑦 = +1 if the DT predicted 𝑦 =
−1 and predict 𝑦 = −1 if the DT predicted 𝑦 = +1, then we would get 68/95 points correct.
We leave DT predictions on points with 𝑥1 = 1 or 𝑥1 = 2 untouched.