Machine Learning Questions Final - Solutions
Machine Learning Questions Final - Solutions
Type: Closed Book Time: 180 minutes Max Marks: 120 Date: 09/05//2025
------------------------------------------------------------------------------------------------------
(Write in Answer Sheet) Self Declaration: I declare that I am not carrying with me:
1. Any written material on paper, clothes, or body parts
2. any mobile phone, communication or data storage devices.
Name: and Signature with date:
Note: This is only the solution key. You are expected to derive/solve to get the
solutions.
Q.1) [Total Marks: 16] You are using Adaboost and obtain the final ensemble of weak classifiers as shown below.
There are 9 regions, and each region either will either predict a +ve (+1) outcome or a -ve (-1) outcome. Given
that the final prediction from region (5) is -ve, find the relationships between the weighting coefficients.
Solution:
Since, we do not know the prediction of each classifier, we need to assume different cases. There are
total 16 different cases. Each case will give a different relationship between the weighting coefficients.
Case 1:
• Classifier 1 (with weighting coefficient 𝛼1 ): Left is +ve and Right is -ve
• Classifier 2 (with weighting coefficient 𝛼2 ): Left is +ve and Right is -ve
• Classifier 3 (with weighting coefficient 𝛼3 ): top is +ve and down is -ve
• Classifier 4 (with weighting coefficient 𝛼4 ): top is +ve and down is -ve
1
Relationship: -𝛼1 +𝛼2 − 𝛼3 +𝛼4 <0 and hence 𝛼2 +𝛼4 < 𝛼1 +𝛼3
Case 2:
• Classifier 1 (with weighting coefficient 𝛼1 ): Left is -ve and Right is +ve
• Classifier 2 (with weighting coefficient 𝛼2 ): Left is +ve and Right is -ve
• Classifier 3 (with weighting coefficient 𝛼3 ): top is +ve and down is -ve
• Classifier 4 (with weighting coefficient 𝛼4 ): top is +ve and down is -ve
Relationship: 𝛼1 +𝛼2 − 𝛼3+𝛼4 <0 and hence 𝛼2 +𝛼4 + 𝛼1 < 𝛼3
So for each case, you can write: 𝛼1 ? 𝛼2 ? 𝛼3 ? 𝛼4 < 0 where based on the use-case, ‘?’ can be either +
or - . Hence, you will get all 8 possibilities.
1 mark will be given for each case and the corresponding relationship.
Q.2) [Total Marks: 12] There are 2000 labelled samples and you decide to use bagging. You use ensemble of
decision trees by generating 50 bootstrap samples, each used to train a separate tree. Each bootstrap is created
by sampling with replacement from the original dataset, and contains 2000 instances (same size as the original
dataset). Each tree has an accuracy of 60% and their errors are uncorrelated. We model X, which is the number
of trees that correctly classify, with a Binomial Distribution. Note that a Chernoff bound is used to calculate the
conservative lower bound for ensemble accuracy. For a Binomial 𝑋~𝐵𝑖𝑛𝑜𝑚𝑖𝑎𝑙(𝑛, 𝑝), with Chernoff bound, the
𝜇𝛿2
probability that X deviates below its mean by a factor of 𝛿 is 𝑃(𝑋 < (1 − 𝛿)𝜇) ≤ 𝑒 − 2 where 𝜇 is the mean.
(a) [Marks: 8] What is the expected accuracy of the majority-vote ensemble using Chernoff bound ?
(b) [Marks: 4] Is there anything wrong with the result or is it correct ? Please justify with the reason.
Solution:
The number of trees that correctly classify is modelled as Binomial. Each tree has an accuracy of 60%
(error rate of 40%). The mean of X is 𝜇 = 𝑛𝑝 = 50 × 0.6 = 30.
30−26
The decision is incorrect if 26 trees make a wrong decision. So for P(X<26), 𝛿 = = 0.13
30
𝜇𝛿2
30 × 0.132
P(ensemble incorrect) = P(X<26) ≤ 𝑒 − 2 ≈ exp (− ) = exp(−.2535) = .776
2
The probability that the ensemble is incorrect is 77.6%. Hence, therefore it has an accuracy of 23.4%,
which is much lower than the individual accuracy. Therefore the result is wrong and Chernoff bound
does not product correct result.
2
Note that Chernoff bound is useful when the sample size is high. In this case, the sample size is small
and hence the bound is not helpful (I do not expect you to know this – and hence no marks are
deducted if you do not mention this reason – this is only for your information)
Q.3) [Total Marks: 22] 4 Clusters and 5 Data Points are given to you. The data points are X1: [1,2]; X2: [2,1]; X3:
[8,8]; X4: [9,9]; X5: [4,5]. Mixing coefficients for all the clusters are all same and equals to ¼. The mean of four
clusters are: C1: [1,1]; C2: [2,2]; C3: [8,8]; C4: [9,9]. We also assume covariance matrices as identity matrices.
Complete one step of E-M and answer the following:
(a) [Marks: 10] Draw a table with the values of responsibilities (𝛾) after the E-Step. The table should contain
rows representing data points and columns representing clusters ?
(b) [Marks: 4] Data point [4,5] belongs to which cluster and why ?
(c) [Marks: 8] After M-step, which cluster has a highest mixing coefficient ? Why and what is the value?
Solution:
(a)
C1 C2 C3 C4
X1 0.5 0.5 2.87e-19 2.39e-25
X2 0.5 0.5 2.87e-19 2.39e-25
X3 3.83e-22 1.70e-16 0.73 0.26
X4 1.17e-28 3.83e-22 .27 0.73
X5 .002 .995 .002 8.27e-07
3
Q.4) [Total Marks: 24] Answer the following questions:
(a) [Marks: 10] For a linear SVM with decision boundary 𝒘𝑇 + 𝜑(𝒙𝑛 ) + 𝑏 = 0, derive the width of the margin.
(b) [Marks: 4] Can XOR data be classified correctly by Linear SVMs, why or why Not ?
(c) [Marks: 4] Let us assume that the basis function is used on XOR data 𝜑(𝒙1 , 𝒙2 ) = (𝒙1 , 𝒙2 , 𝒙1 . 𝒙2 ), will it now
be correctly classified by Linear SVMs ?
(d) [Marks: 4] Assuming that we are using a hard-margin linear SVM on the transformed data generated in part
( c ), can you write the constraints for the optimization problem where you maximize the margin ? You can
assume 𝑤1 , 𝑤2 𝑎𝑛𝑑 𝑤3 as components of weights.
[Marks: 2] Instead of manual transformation used in part (c ), can you specify the polynomial kernel which can
be used ?
Solution:
(a) Done in class
(b) No since they are not linearly separable
(c) Yes. Please provide the complete table of the output.
(d) Four constrains for four points
i. For (1,0,0): w1+b>=1
ii. For (0,1,0): w2+b>=1
iii. For (0,0,0): b<=-1
iv. For (1,1,1): w1+w2+w3+b <= -1
v. (x. x’)2 [Marks will be given for any other correct Kernel as well]
Q.5) [Total Marks: 26] Let us consider a 2-layer neural network 1-hidden and 1 output layer. Input layer contains
2 neurons and output layer contains 1 neuron. The hidden layer uses a Swish Activation Function where
𝑆𝑤𝑖𝑠ℎ(𝑧) = 𝑧. 𝜎(𝑧) with 𝜎(𝑧) as a sigmoid activation function. Input is given as [1, 0.5] and the real output is 1.
𝑤11 𝑤12 0.1 0.2
Hidden layer weights are [𝑤
21 𝑤22 ] = [0.3 0.4] with both biases as 𝑏1 = 𝑏2 = 0.1 and weights for output
𝑤@ 0.7
layers are [ 11
@
] =[ ] with bias 𝑏1@ as 0.2. Considering the loss function as MSE. Please solve and answer the
𝑤12 0.8
following after one step of feedforward and backward propagation.
(a) [Marks: 2] What is the derivative of the activation function at the hidden layers?
(b) [Marks: 8] Find out the loss.
(b) [Marks: 16] Find out the gradients with respect to all the weights and biases.
4
Solution:
(a) The derivative of Swish function is Swish’(z) = 𝜎(𝑧) + 𝑧. 𝜎(𝑧) (1 − 𝑧. 𝜎(𝑧))
(b) You need to show the complete feedforward calculations.
Hidden layer
• Pre-activation input to two units: [0.35 0.65]
• After Activation (Swish) output from two units: [0.497 0.934]
• Output Layer pre-activation: [0.633]
• Output (Sigmoid): 0.6526
• Loss = 0.0603
(c) Perform the complete backpropagation steps
𝜕𝐿 𝜕𝐿 𝜕𝐿
• At output unit @ = −0.0136, @ = −0.0305, = −0.0788
𝜕𝑤11 𝜕𝑤12 𝜕𝑏1@
• At Hidden units:
−0.0357 −0.0179
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑟𝑒𝑠𝑝𝑒𝑐𝑡 𝑡𝑜 𝑤𝑒𝑖𝑔ℎ𝑡𝑠 = [ ]
−0.0493 −0.0247
𝜕𝐿 𝜕𝐿
𝐺𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑤𝑖𝑡ℎ 𝑟𝑒𝑠𝑝𝑒𝑐𝑡 𝑡𝑜 𝑏𝑖𝑎𝑠𝑒𝑠; 𝜕𝑏 = −0.0357 and 𝜕𝑏 = −0.0493
1 2
Q.6) [Total Marks: 20] Let us assume that we have IID observations: {𝑥1 , 𝑥2 , … … , 𝑥𝑛 } which are drawn from the
(𝑥−𝑤)6
following PDF: 𝑓𝑤 (𝑥) = 𝐶 exp(− ) where C is the constant. Answer the following:
6
Solution:
(a) Derive the loss function. The final expression is:
𝑛
1
log 𝐿(𝑤) = 𝑛 log(𝐶) − ∑(𝑥𝑖 − 𝑤)6
6
𝑖=1