0% found this document useful (0 votes)
5 views4 pages

MS Key-4

The document is a mid-semester exam paper for CS 771A: Intro to Machine Learning at IIT Kanpur, containing various questions on decision trees, entropy, loss functions, and optimization problems. It includes detailed instructions for answering the questions and specific tasks related to entropy calculations, loss function expressions, and dual problem formulations. The exam is structured to assess students' understanding of machine learning concepts through theoretical and practical applications.

Uploaded by

Ashutosh Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views4 pages

MS Key-4

The document is a mid-semester exam paper for CS 771A: Intro to Machine Learning at IIT Kanpur, containing various questions on decision trees, entropy, loss functions, and optimization problems. It includes detailed instructions for answering the questions and specific tasks related to entropy calculations, loss function expressions, and dual problem formulations. The exam is structured to assess students' understanding of machine learning concepts through theoretical and practical applications.

Uploaded by

Ashutosh Anand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

CS 771A: Intro to Machine Learning, IIT Kanpur Midsem Exam (24 Sep 2022)

Name Melbo 40 marks


Roll No 000001 Dept. AWSM Page 1 of 4

Instructions:
1. This question paper contains 2 page (4 sides of paper). Please verify.
2. Write your name, roll number, department in block letters with ink on each page.
3. If you don’t do this, your pages may get lost when we unstaple your paper to scan pages
3. Write your final answers neatly with a blue/black pen. Pencil marks may get smudged.
4. Don’t overwrite/scratch answers especially in MCQ – such cases will get straight 0 marks.

Q1. For the hangman problem, 5 decision tree splits at a node are given. For each split, write down
the information gain (entropy reduction) in the bold border boxes border next to the diagrams as
a single fraction or decimal number. Use logarithms with base 2 in the definition of entropy. The
numbers written in the nodes indicate how many words reached that node. (5 marks)

1.0 or 1 1.5 or 3/2

1.75 or 7/4 2.0 or 2

1.875 or 15/8

Q2. (Intriguing entropy) For a random variable 𝑋 with support {−1,1} with ℙ[𝑋 = 1] = 𝑝, define
its entropy as 𝐻(𝑋) ≝ −𝑝 ln 𝑝 − (1 − 𝑝) ln(1 − 𝑝) (use natural logarithms for sake of simplicity).
Find (a) a value of 𝑝 ∈ [0,1] where the entropy 𝐻(𝑋) is largest and (b) a value 𝑝 ∈ [0,1] where the
entropy 𝐻(𝑋) is smallest. Show brief calculations/arguments for both parts. (3 + 2 = 5 marks)
𝑑𝐻 1 1
= −1 − ln 𝑝 + 1 + ln(1 − 𝑝) = ln ( − 1) which vanishes at 𝑝 = . Confirming with the
𝑑𝑝 𝑝 2
𝑑2𝐻 1 1
second derivative tells us that =− < 0 at 𝑝 = confirming that it is a local maxima
𝑑𝑝2 𝑝(1−𝑝) 2
candidate. Since this value of 𝑝 also lies in the feasible set [0,1], we conclude that 𝐻(𝑋) is
1
largest at 𝑝 = .
2

𝐻(𝑋) ≥ 0 for all values of 𝑝 and we see that for 𝑝 = 0 as well as for 𝑝 = 1, we get 𝐻(𝑋) = 0
since 1 ln 1 = 0 = 0 ln 0. Since these values are also in the feasible set [0,1] we conclude that
𝐻(𝑋) is the smallest at these values.
Page 2 of 4
Q3. (At a loss for names) Consider the following loss function where 𝜏 > 0 and 𝑧 ∈ ℝ.
1−𝑧 𝑧 <1−𝜏
4
(1 − 𝑧) 3(1 − 𝑧)2 1 − 𝑧 3𝜏
ℓ𝜏 (𝑧) = {− + + + 𝑧 ∈ [1 − 𝜏, 1 + 𝜏]
16𝜏 3 8𝜏 2 16
0 𝑧 >1+𝜏
𝑑ℓ𝜏 (𝑧) 𝑑 2 ℓ𝜏 (𝑧)
1. Write down expressions for and . No need to show calculations.
𝑑𝑧 𝑑𝑧 2
2. Write down an expression for ∇𝐰 𝑓(𝐰) where 𝐰, 𝐱 𝑖 ∈ ℝ𝑑 , 𝑦 𝑖 ∈ {−1, +1}. You can use terms
such as ℓ′𝜏 (⋅) in your expression to denote the first derivative of ℓ𝜏 (⋅) to avoid clutter.
1 𝑛
2
𝑓(𝐰) = ‖𝐰‖ 2+∑ ℓ𝜏 (𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 )
2 𝑖=1
3. Write down an expression for what the loss function ℓ𝜏 (⋅) would look like as 𝜏 → 0+ .
Give brief calculations for the 2nd part and brief justification for the 3rd part. (2+2+3+1 = 8 marks)
−1 𝑧 <1−𝜏
𝑑ℓ𝜏 (𝑧) (1 − 𝑧) 3
3(1 − 𝑧) 1
={ − − 𝑧 ∈ [1 − 𝜏, 1 + 𝜏]
𝑑𝑧 4𝜏 3 4𝜏 2
0 𝑧 >1+𝜏
0 𝑧 <1−𝜏
2
𝑑 ℓ𝜏 (𝑧) 3(1 − 𝑧)2 3
= {− + 𝑧 ∈ [1 − 𝜏, 1 + 𝜏]
𝑑𝑧 2 4𝜏 3 4𝜏
0 𝑧 >1+𝜏
By applying the chain rule, we get
𝑛
∇𝐰 𝑓(𝐰) = 𝐰 + ∑ ℓ′𝜏 (𝑦 𝑖 ⋅ 𝐰 ⊤ 𝐱 𝑖 ) ⋅ 𝑦 𝑖 ⋅ 𝐱 𝑖
𝑖=1

As 𝜏 → 0+ (i.e., approaches 0 from the right), the interval [1 − 𝜏, 1 + 𝜏] vanishes as well as we


have 1 − 𝜏 → 1− and 1 + 𝜏 → 1+ . Thus, the limiting behaviour of this function can be
described as lim+ ℓ𝜏 (𝑧) = [1 − 𝑧]+ i.e., it approaches the hinge loss function.
𝜏→0

The loss function ℓ𝜏 (⋅) is a doubly differentiable version of the hinge loss first proposed by
Kamalika Chaudhuri, Claire Monteleoni and Anand D. Sarwate in their 2011 JMLR paper
entitled “Differentially Private Empirical Risk Minimization”. Other such “smoothed” hinge loss
variants also exist.

Q4. (A regularized median) Given a set of real numbers 1 2 𝑛


𝑎1 , 𝑎2 , … , 𝑎𝑛 ∈ ℝ (all are distinct but may be positive, negative min𝑥,𝑐 𝑖 𝑥 + ∑ 𝑐𝑖
2 𝑖=1
or zero), we wish to find its “regularized median” by solving: 𝑥 − 𝑎 ≤ 𝑐𝑖
𝑖
1
min 𝑥 2 + ∑𝑛𝑖=1|𝑥 − 𝑎𝑖 |. However, to design a solver, it would s. t. 𝑥 − 𝑎𝑖 ≥ −𝑐 𝑖
𝑥 2
be helpful if we first rewrite this objective function as shown 𝑐𝑖 ≥ 0
on the right-hand side by artificially introducing constraints
CS 771A: Intro to Machine Learning, IIT Kanpur Midsem Exam (24 Sep 2022)
Name Melbo 40 marks
Roll No 000001 Dept. AWSM Page 3 of 4

1. Write down the Lagrangian of this problem by introducing dual variables for the constraints.
2. Using the Lagrangian, create the dual problem (show brief derivation). Simplify the dual as
much as you can otherwise the next part may get more cumbersome for you.
3. Give an expression for deriving the primal solution 𝑥 in terms of the dual variables
4. Give peudocode for a coordinate ascent/descent method to solve the dual. Use any
coordinate selection method you like. Give precise expressions in pseudocode on how you
would process a chosen coordinate taking care of constraints. (2 + 4 + 2 + 4 = 12 marks)
Introducing dual variables 𝛼𝑖 , 𝛽𝑖 , 𝛾𝑖 for the 3 sets of constraints and compactly writing them and
the 𝑎𝑖 , 𝑐 𝑖 values as vectors gives the Lagrangian as (note that 𝟏 denotes the all-ones vector)
1
ℒ(𝑥, 𝐜, 𝛂, 𝛃, 𝛄) = 𝑥 2 + 𝐜 ⊤ 𝟏 + 𝛂⊤ (𝑥 ⋅ 𝟏 − 𝐚 − 𝐜) − 𝛃⊤ (𝑥 ⋅ 𝟏 − 𝐚 + 𝐜) − 𝛄⊤ 𝐜
2
𝜕ℒ
Setting = 0 gives us 𝟏 − 𝛂 − 𝛃 − 𝛄 = 𝟎 and therefore, 𝛂 + 𝛃 ≤ 𝟏 (as 𝛄 ≥ 0)
𝜕𝐜
𝜕ℒ
Setting = 0 gives us 𝑥 + 𝛂⊤ 𝟏 − 𝛃⊤ 𝟏 = 0 i.e., 𝑥 = (𝛃 − 𝛂)⊤ 𝟏 = ∑𝑛𝑖−1(𝛽𝑖 − 𝛼𝑖 )
𝜕𝑥
1 1
Putting these back gives us ℒ = 𝑥 2 + 𝑥 ⋅ (𝛂 − 𝛃)⊤ 𝟏 − (𝛂 − 𝛃)⊤ 𝐚 = − 𝑥 2 − (𝛂 − 𝛃)⊤ 𝐚
2 2

Inverting the sign (optional) gives us the dual problem


1
min ((𝛂 − 𝛃)⊤ 𝟏)2 + (𝛂 − 𝛃)⊤ 𝐚
𝛂≥𝟎,𝛃≥𝟎 2
𝛂+𝛃≤𝟏

for 𝑖 = 1,2, … , 𝑛 (cyclic coordinate choice)


Terms involving only 𝛼𝑖 , 𝛽𝑖 are
1
(𝛼𝑖 − 𝛽𝑖 )2 + (𝛼𝑖 − 𝛽𝑖 )𝑢𝑖
2
1 1
= 𝛼𝑖2 + 𝛽𝑖2 − 𝛼𝑖 𝛽𝑖 + (𝛼𝑖 − 𝛽𝑖 )𝑢𝑖
2 2
with 𝑢𝑖 ≝ (𝑎𝑖 + ∑𝑗≠𝑖(𝛼𝑗 − 𝛽𝑗 )) shorthand. Note: 𝑢𝑖 will keep changing across iterations
While updating 𝛼𝑖 , we must ensure 𝛼𝑖 ∈ [0, 1 − 𝛽𝑖 ] as 𝛼𝑖 + 𝛽𝑖 ≤ 1. The unconstrained
optimum is 𝑠𝑖 ≝ 𝛽𝑖 − 𝑢𝑖 but since quadratics are unimodal functions, the constrained
optimum is 𝑠𝑖 if 𝑠𝑖 ∈ [0,1 − 𝛽𝑖 ] else if 𝑠𝑖 < 0, the constrained optimum is 0 else it is 1 − 𝛽𝑖 .
Similarly, the unconstrained optimal value for 𝛽𝑖 is 𝑡𝑖 ≝ 𝛼𝑖 + 𝑢𝑖 but the constrained
optimum is 𝑡𝑖 if 𝑡𝑖 ∈ [0,1 − 𝛼𝑖 ] else if 𝑡𝑖 < 0, the constrained optimum is 0 else it is 1 − 𝛼𝑖 .
Note: use new 𝛼𝑖 value when updating 𝛽𝑖
Page 4 of 4
Q5. (Circular argument) Given a circle in 2D plane with centre at
𝐜 = (𝑝, 𝑞) ∈ ℝ2 and radius 𝑟 > 0, we wish to build a classifier
that gives output 𝑦 = −1 if a point 𝐱 = (𝑥, 𝑦) ∈ ℝ2 is inside the
circle i.e., (𝑥 − 𝑝)2 + (𝑦 − 𝑞)2 < 𝑟 2 and 𝑦 = +1 otherwise.
Do not worry about points on the boundary. Give a feature map
𝜙: ℝ2 → ℝ𝐷 for some 𝐷 > 0 and a corresponding classifier
𝐖 ∈ ℝ𝐷 such that for any 𝐱 ∈ ℝ2 , sign(𝐖 ⊤ 𝜙(𝐱)) is the correct
output. Your map 𝜙 must not depend on 𝑝, 𝑞, 𝑟 but your classifier
𝐖 may depend on 𝑝, 𝑞, 𝑟. (2 + 2 = 4 marks)
We wish to capture sign((𝑥 − 𝑝)2 + (𝑦 − 𝑞)2 − 𝑟 2 ) – expanding the expression gives us
sign(𝑥 2 − 2𝑝𝑥 + 𝑝2 + 𝑦 2 − 2𝑞𝑦 + 𝑞 2 − 𝑟 2 )
This is readily done by the following combination of feature map and classifier
𝜙((𝑥, 𝑦)) ≝ [𝑥 2 , 𝑥, 𝑦 2 , 𝑦, 1] ∈ ℝ5
𝐖 = [1, −2𝑝, 1, −2𝑞, 𝑝2 + 𝑞 2 − 𝑟 2 ]

Q6. Melbo has learnt a decision tree to solve a binary classification problem with 95 train points of
which it gets 48 correct and 47 wrong. There are 10 real features for every training point and the
first feature 𝑥1 is an interesting one. 𝑥1 takes only 3 values, namely 0, 1, 2. Among the train points
that Melbo classified correctly, a 5/12 fraction had 𝑥1 = 0, a 1/6 fraction had 𝑥1 = 1 and the rest
had 𝑥1 = 2. Melbo got 2/3rd of the training points that had 𝑥1 = 0 wrong. Melbo got 1/5th of the
training points that had 𝑥1 = 1 wrong and 1/5th of the training points that had 𝑥1 = 2 wrong. Find
out how many train points had the feature value 𝑥1 = 0, 𝑥1 = 1 and 𝑥1 = 2.
(Bonus) Do you notice anything funny about the way Melbo’s decision tree gets answers right and
wrong? Can you improve its classification accuracy on the training set? You cannot change the
decision tree itself, but you can take the decision tree output on a data point and the value of the
first feature for that data point 𝑥1 and possibly change the output. What is the improved accuracy?
For this part, you may assume that the binary labels are +1 and −1. (2 x 3 = 6 + 3 marks)
There are 60 points with 𝑥1 = 0, 10 points with 𝑥1 = 1 and 25 points with 𝑥1 = 2.To see what
is funny about this DT classifier, let us tabulate the breakup
Correct Wrong Correct Wrong
𝑥1 = 0 20 40 𝑥1 = 0 40 20
𝑥1 = 1 8 2 𝑥1 = 1 8 2
𝑥1 = 2 20 5 𝑥1 = 2 20 5
We note that the DT classifies many more points wrongly than correctly if 𝑥1 = 0. Thus, if we
invert the DT outputs for data points with 𝑥1 = 0 i.e., predict 𝑦 = +1 if the DT predicted 𝑦 =
−1 and predict 𝑦 = −1 if the DT predicted 𝑦 = +1, then we would get 68/95 points correct.
We leave DT predictions on points with 𝑥1 = 1 or 𝑥1 = 2 untouched.

You might also like