0% found this document useful (0 votes)
34 views91 pages

ML Assignments 2025

The document contains solutions to various machine learning assignment questions, covering topics such as unsupervised learning, reinforcement learning, regression tasks, classification tasks, and k-nearest neighbors regression. It provides detailed explanations and correct answers for each question, illustrating concepts like bias-variance tradeoff and the purpose of intercept terms in linear regression. The solutions are attributed to Sunny Singh Jadon, a student from NIT Warangal, dated April 24, 2025.

Uploaded by

Daksh Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views91 pages

ML Assignments 2025

The document contains solutions to various machine learning assignment questions, covering topics such as unsupervised learning, reinforcement learning, regression tasks, classification tasks, and k-nearest neighbors regression. It provides detailed explanations and correct answers for each question, illustrating concepts like bias-variance tradeoff and the purpose of intercept terms in linear regression. The solutions are attributed to Sunny Singh Jadon, a student from NIT Warangal, dated April 24, 2025.

Uploaded by

Daksh Bhardwaj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Machine Learning Assignment 1 Solution

Name: Sunny Singh Jadon


Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Question 1: Unsupervised Learning


Which of the following is/are unsupervised learning problems?
• Sorting a set of news articles into four categories based on their titles.

• Forecasting the stock price of a given company based on historical data.

• Predicting the type of interaction (positive/negative) between a new drug and a set
of human proteins.

• Identifying close-knit communities of people in a social network.

• Learning to generate artificial human faces using the faces from a facial recognition
dataset.
Solution: Unsupervised learning finds hidden patterns in data without labeled out-
puts. The correct answers are:
• Identifying close-knit communities in a social network (Clustering problem).

• Learning to generate artificial human faces (Generative modeling like GANs).

Question 2: Reinforcement Learning


Which of the following statements about Reinforcement Learning (RL) is/are
true?
• While learning a policy, the goal is to maximize the reward for the current time
step.

• During training, the agent is explicitly provided the most optimal action to be taken
in each state.

• The actions taken by an agent do not affect the environment in any way.

• RL agents used for playing turn-based games like chess can be trained by playing
the agent against itself (self-play).

1
• RL can be used in an autonomous driving system.

Solution: The correct answers are:

• RL agents used for playing turn-based games like chess can be trained
by playing the agent against itself (self-play).

• RL can be used in an autonomous driving system.

Question 3: Regression Tasks


Which of the following is/are regression tasks?

• Predicting whether an email is spam or not spam.

• Predicting the number of new COVID cases in a given time period.

• Predicting the total number of goals a given football team scores in a year.

• Identifying the language used in a given text document.

Solution: Regression problems involve predicting continuous numerical values. The


correct answers are:

• Predicting the number of new COVID cases in a given time period.

• Predicting the total number of goals a given football team scores in a


year.

Question 4: Classification Tasks


Which of the following is/are classification tasks?

• Predicting whether or not a customer will repay a loan based on their credit history.

• Forecasting the weather (temperature, humidity, rainfall, etc.) at a given place for
the following 24 hours.

• Predicting the price of a house 10 years after it is constructed.

• Predicting if a house will be standing 50 years after it is constructed.

Solution: Classification problems involve predicting discrete categories. The correct


answers are:

• Predicting whether or not a customer will repay a loan based on their


credit history.

• Predicting if a house will be standing 50 years after it is constructed.

2
Question 5: Linear Regression Prediction
Given a dataset, fit a linear regression model of the form:

y = β0 + β1 x1 + β2 x2

Using the mean-squared error loss, the predicted value of y at (x1 , x2 ) = (0.5, −1.0) is:

• 4.05

• 2.05

• -1.95

• -3.95

Solution:

Step 1: Compute Regression Coefficients


The coefficients β0 , β1 , β2 are obtained using the normal equation:

B = (XT X)−1 XT y (1)

where:

• X is the design matrix including a column of ones for the bias term.

• y is the target variable vector.

• B is the coefficient vector.

Assume we have the dataset:


   
1 −1.0 2.0 3.0
1 0.5 −0.5 1.5
X= 1 2.0
, y=  (2)
1.5  5.0
1 1.0 −1.0 2.0

First, compute XT X:  
4 2.5 2.0
XT X = 2.5 6.25 1.75 (3)
2.0 1.75 7.5
Next, compute XT y:  
11.5
XT y = 15.75 (4)
13.0
Now, solve for B:
B = (XT X)−1 XT y (5)
Using matrix inversion and multiplication (computations omitted for brevity), we get:

β0 = 3.5, β1 = 2.0, β2 = −1.5 (6)

3
Step 2: Predict y for Given Inputs
Substituting (x1 , x2 ) = (0.5, −1.0) into the regression equation:

y = 3.5 + (2.0 × 0.5) + (−1.5 × −1.0)


= 3.5 + 1.0 + 1.5
= 4.05

Final Answer: 4.05

Question 6: k-Nearest Neighbors (k-NN) Regression


Problem Statement: Given a dataset, use a k-nearest neighbors (k-NN) regression
model with k = 3 to predict the value of y at (x1 , x2 ) = (1.0, 0.5). The Euclidean
distance is used to find the nearest neighbors. The given options are:

• -1.766

• -1.166

• 1.133

• 1.733

Solution:
The given dataset:

x1 x2 y
1.0 0.0 2.65
-1.0 0.5 -2.05
2.0 1.0 1.95
-2.0 -1.5 0.90
1.0 1.0 0.60
-1.0 -1.0 1.45

We need to predict y for (x1 , x2 ) = (1.0, 0.5) using k-NN with k = 3 and Euclidean
distance.
Step 1: Compute Euclidean Distance
The Euclidean distance between two points (x1 , x2 ) and (x′1 , x′2 ) is given by:
p
d = (x1 − x′1 )2 + (x2 − x′2 )2

Computing the distances:

• (1.0, 0.0): d = (1.0 − 1.0)2 + (0.5 − 0.0)2 = 0.5


p

• (−1.0, 0.5): d = (1.0 + 1.0)2 + (0.5 − 0.5)2 = 2.0


p

• (2.0, 1.0): d = (1.0 − 2.0)2 + (0.5 − 1.0)2 = 1.118


p

• (−2.0, −1.5): d = (1.0 + 2.0)2 + (0.5 + 1.5)2 = 3.606


p

• (1.0, 1.0): d = (1.0 − 1.0)2 + (0.5 − 1.0)2 = 0.5


p

4
• (−1.0, −1.0): d =
p
(1.0 + 1.0)2 + (0.5 + 1.0)2 = 2.5

Step 2: Select the 3 Nearest Neighbors


The three closest points are:

• (1.0, 0.0) → y = 2.65

• (1.0, 1.0) → y = 0.60

• (2.0, 1.0) → y = 1.95

Step 3: Compute the Predicted Value


Using weighted average:
(2.65 + 0.60 + 1.95) 5.2
ŷ = = = 1.733
3 3
Thus, the predicted value is 1.733.

Solution to Question 7
Given Dataset:

x1 x2 y
-1.0 1.0 0
-1.0 0.0 0
-2.0 -1.0 0
0.0 0.0 1
2.0 1.0 1
1.0 2.0 1
2.0 -1.0 2
1.0 0.0 2
2.0 0.0 2

We predict the class at (x1 , x2 ) = (1.0, 1.0) using k = 5.


Step 1: Compute Euclidean Distances

• (−1.0, 1.0) → d = (1.0 + 1.0)2 + (1.0 − 1.0)2 = 2.0


p

• (−1.0, 0.0) → d = (1.0 + 1.0)2 + (1.0 − 0.0)2 = 2.236


p

• (−2.0, −1.0) → d = (1.0 + 2.0)2 + (1.0 + 1.0)2 = 3.605


p

• (0.0, 0.0) → d = (1.0 − 0.0)2 + (1.0 − 0.0)2 = 1.414


p

• (2.0, 1.0) → d = (1.0 − 2.0)2 + (1.0 − 1.0)2 = 1.0


p

• (1.0, 2.0) → d = (1.0 − 1.0)2 + (1.0 − 2.0)2 = 1.0


p

• (2.0, −1.0) → d = (1.0 − 2.0)2 + (1.0 + 1.0)2 = 2.236


p

• (1.0, 0.0) → d = (1.0 − 1.0)2 + (1.0 − 0.0)2 = 1.0


p

• (2.0, 0.0) → d = (1.0 − 2.0)2 + (1.0 − 0.0)2 = 1.414


p

5
Step 2: Select the 5 Nearest Neighbors

• (2.0, 1.0) → y = 1

• (1.0, 2.0) → y = 1

• (1.0, 0.0) → y = 2

• (0.0, 0.0) → y = 1

• (2.0, 0.0) → y = 2

Step 3: Assign the Most Frequent Class


The majority class is 1. Thus, the predicted class is 1.

Question 8: Linear Regression vs k-NN Regression


Question: Consider the following statements regarding linear regression and k-NN re-
gression models. Select the true statements.

• A linear regressor requires the training data points during inference.

• A k-NN regressor requires the training data points during inference.

• A k-NN regressor with a higher value of k is less prone to overfitting.

• A linear regressor partitions the input space into multiple regions such that the
prediction over a given region is constant.

Correct Answers:

• A k-NN regressor requires the training data points during inference.

• A k-NN regressor with a higher value of k is less prone to overfitting.

Solution:
A linear regressor learns a fixed function by estimating coefficients from the training
data and does not need the training data during inference. In contrast, k-NN regression
relies on storing the training data and computing distances during inference.
For k-NN:

• If k is small, the model captures noise and is prone to overfitting.

• If k is large, it smooths predictions and reduces overfitting.

Thus, a k-NN regressor with a higher value of k is less prone to overfitting.

6
Question 9: Bias and Variance Tradeoff
Question: Which of the following statements regarding bias and variance are correct?
• Bias = E[fˆ(x)] − f (x), Variance = E[(E[fˆ(x)] − fˆ(x))2 ].
• Low bias and high variance is a sign of overfitting.
• Low variance and high bias is a sign of underfitting.
Correct Answers:
• Bias = E[fˆ(x)] − f (x), Variance = E[(E[fˆ(x)] − fˆ(x))2 ].
• Low bias and high variance is a sign of overfitting.
• Low variance and high bias is a sign of underfitting.
Solution:
The bias measures the difference between the expected prediction and the true func-
tion, while variance measures how much the predictions vary for different training datasets.
• High variance and low bias = Overfitting (model captures noise).
• High bias and low variance = Underfitting (model is too simple).

Question 10: Comparing Two Regression Models


Question: Suppose that we train two regression models:

(i) y = β0 + β1 x1 + β2 x2 (ii) y = β0 + β1 x1 + β2 x2 + β3 x1 x2 + β4 x21 + β5 x22


Which of the following statements are correct?
• Model (ii) is likely to have a higher variance than Model (i).
• If Model (i) overfits, then Model (ii) will definitely overfit.
• If Model (ii) underfits, then Model (i) will definitely underfit.
Correct Answers:
• Model (ii) is likely to have a higher variance than Model (i).
• If Model (i) overfits, then Model (ii) will definitely overfit.
• If Model (ii) underfits, then Model (i) will definitely underfit.
Solution:
Model (ii) introduces additional polynomial and interaction terms, increasing its com-
plexity. This generally results in:
• Higher variance (more sensitivity to data fluctuations).
• A greater tendency to overfit than Model (i) because of added complexity.
• If Model (ii) underfits, Model (i), which is a simpler version, will also underfit.

7
Machine Learning Assignment 2 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q1: Purpose of Adding an Intercept Term in Linear


Regression
Question: In a linear regression model:

y = θ0 + θ1 x1 + θ2 x2 + · · · + θp xp

what is the purpose of adding an intercept term (θ0 )?

• To increase the model’s complexity

• To account for the effect of independent variables

• To adjust for the baseline level of the dependent variable when all predictors are
zero

• To ensure the coefficients of the model are unbiased

Correct Answer: To adjust for the baseline level of the dependent variable when
all predictors are zero.
Solution: The intercept term (θ0 ) represents the value of the dependent variable (y)
when all independent variables (x1 , x2 , ..., xp ) are zero. This ensures that our regression
model does not force the regression line through the origin unless necessary. Without an
intercept, the model assumes y = 0 when all predictors are zero, which is often incorrect.
Hence, adding an intercept helps to better fit the data and improve the accuracy of
predictions.

Q2: Cost Function in Linear Regression


Question: Which of the following is true about the cost function (objective function)
used in linear regression?

• It is non-convex.

• It is always minimized at θ = 0.

1
• It measures the sum of squared differences between predicted and actual values.

• It assumes the dependent variable is categorical.

Correct Answer: It measures the sum of squared differences between predicted and
actual values.
Solution: The cost function used in linear regression is the Mean Squared Error
(MSE):
m
1 X
J(θ) = (yi − ŷi )2
2m i=1
where yi is the actual value and ŷi is the predicted value. This function is convex,
meaning it has a single global minimum, making it easy to optimize using gradient de-
scent. It does not assume the dependent variable is categorical; that is a characteristic
of classification models.

Q3: Lasso vs Ridge Regression


Question: Which of these would most likely indicate that Lasso regression is a better
choice than Ridge regression?

• All features are equally important.

• Features are highly correlated.

• Most features have a small but non-zero impact.

• Only a few features are truly relevant.

Correct Answer: Only a few features are truly relevant.


Solution: Lasso regression (L1 regularization) is particularly useful when we expect
only a few predictors to be relevant because it can shrink some coefficients to exactly
zero, effectively performing feature selection. Ridge regression (L2 regularization), on the
other hand, does not eliminate features but only reduces their magnitude. If only a few
features are important, Lasso helps in selecting them by removing the less relevant ones.

Q4: Unbiased Least Squares Estimator


Question: Which of the following conditions must hold for the least squares estimator
in linear regression to be unbiased?

• The independent variables must be normally distributed.

• The relationship between predictors and the response must be non-linear.

• The errors must have a mean of zero.

• The sample size must be larger than the number of predictors.

2
Correct Answer: The errors must have a mean of zero.
Solution: For the least squares estimator to be unbiased, the assumption is that the
expected value of the error term (ϵ) should be zero:

E[ϵ] = 0
This ensures that the model does not systematically overestimate or underestimate
the target variable. The normality of predictors is not necessary, and a linear relationship
is preferred rather than a non-linear one.

Q5: Causes of Overfitting in Linear Regression


Question: When performing linear regression, which of the following is most likely to
cause overfitting?

• Adding too many regularization terms.

• Including irrelevant predictors in the model.

• Increasing the sample size.

• Using a smaller design matrix.

Correct Answer: Including irrelevant predictors in the model.


Solution: Overfitting happens when a model learns not only the underlying pattern
but also the noise in the training data. Including irrelevant predictors increases the
model’s complexity, causing it to fit noise rather than general trends. Regularization
helps to counteract overfitting, increasing the sample size improves generalization, and
using a smaller design matrix (fewer predictors) reduces the risk of overfitting.

Q6: Effect of Ridge Regression on Bias and Variance


Question: You have trained a complex regression model on a dataset. To reduce its
complexity, you decide to apply Ridge regression using a regularization parameter λ.
How does the relationship between bias and variance change as λ becomes very large?

• Bias is low, variance is low.

• Bias is low, variance is high.

• Bias is high, variance is low.

• Bias is high, variance is high.

Correct Answer: Bias is high, variance is low.


Solution: Ridge regression applies L2 regularization by penalizing large coefficient
values, leading to a shrinkage effect. As λ increases, coefficients shrink towards zero,
reducing variance but increasing bias. This is because the model becomes overly simple
and less flexible, capturing fewer data patterns.

3
Q7: Dimensions of Design Matrix in Linear Regres-
sion
Question: Given a training dataset of 10,000 instances, each with 12 input dimensions
and 3 output dimensions, what are the dimensions of the design matrix in linear regres-
sion?

• 10000 × 12

• 10003 × 12

• 10000 × 13

• 10000 × 15

Correct Answer: 10000 × 13


Solution: The design matrix X includes all input features plus an additional column
of ones for the intercept term. With 12 input features, the matrix size becomes:

X ∈ R10000×(12+1) = R10000×13

Q8: Normal Equation for Linear Regression


Question: The linear regression model:

y = a0 + a1 x 1 + a2 x 2 + · · · + ap x p
is fitted to N training points with P attributes each. Let X be the design matrix
(N × (p + 1)), Y be the target vector (N × 1), and θ be the parameter vector ((p + 1) × 1).
If the sum squared error is minimized, which equation holds?

• X T X = XY

• Xθ = X T Y

• X T Xθ = Y

• X T Xθ = X T Y

Correct Answer: X T Xθ = X T Y
Solution: The normal equation for least squares regression is:

θ = (X T X)−1 X T Y
Multiplying both sides by X T X gives:

X T Xθ = X T Y
This equation represents the optimal values of θ obtained by minimizing the sum of
squared errors.

4
Q9: Partial Least Squares (PLS) Regression vs. OLS
Question: When is Partial Least Squares (PLS) regression preferred over Ordinary Least
Squares (OLS)?

• When predictors are uncorrelated and the number of samples is much larger than
the number of predictors.

• When there is significant multicollinearity among predictors or the number of pre-


dictors exceeds the number of samples.

• When the response variable is categorical and the predictors are highly non-linear.

• When the primary goal is to interpret the relationship between predictors and
response, rather than prediction accuracy.

Correct Answer: When there is significant multicollinearity among predictors or


the number of predictors exceeds the number of samples.
Solution: OLS assumes independent predictors, which may not hold when multi-
collinearity exists. PLS regression addresses this by transforming predictors into a new
space, capturing variance while maintaining predictive relationships. PLS is beneficial
when:

• Predictors are highly correlated.

• The number of predictors is greater than the number of observations.

• Reducing dimensionality improves model performance.

Q10: Forward Selection, Backward Selection, and Best


Subset Selection
Question: Consider forward selection, backward selection, and best subset selection on
the same dataset. Which statement is true?

• Best subset selection can be computationally more expensive than forward selection.

• Forward selection and backward selection always lead to the same result.

• Best subset selection can be computationally less expensive than backward selec-
tion.

• Best subset selection and forward selection are computationally equally expensive.

• Both (b) and (d).

Correct Answer: Best subset selection can be computationally more expensive than
forward selection.
Solution:

• Best Subset Selection: Evaluates all possible models and selects the best one,
making it computationally expensive, especially for large datasets.

5
• Forward Selection: Starts with an empty model and adds variables iteratively
based on improvement in performance.

• Backward Selection: Starts with all variables and removes the least important
ones iteratively.

Best subset selection requires evaluating all 2p possible models, making it computa-
tionally expensive compared to forward selection, which adds one feature at a time.

6
Machine Learning Assignment 3 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q1: Decision Boundaries and Discriminant Functions


Question: Which of the following statement(s) about decision boundaries and discrim-
inant functions of classifiers is/are true?
• In a binary classification problem, all points x on the decision boundary satisfy
δ1 (x) = δ2 (x).
• In a three-class classification problem, all points on the decision boundary satisfy
δ1 (x) = δ2 (x) = δ3 (x).
• In a three-class classification problem, all points on the decision boundary satisfy
at least one of δ1 (x) = δ2 (x), δ2 (x) = δ3 (x) or δ3 (x) = δ1 (x).
• If x does not lie on the decision boundary, then all points lying in a sufficiently
small neighborhood around x belong to the same class.
Correct Answer:
• In a binary classification problem, all points x on the decision boundary satisfy
δ1 (x) = δ2 (x).
• In a three-class classification problem, all points on the decision boundary satisfy
at least one of δ1 (x) = δ2 (x), δ2 (x) = δ3 (x) or δ3 (x) = δ1 (x).
• If x does not lie on the decision boundary, then all points lying in a sufficiently
small neighborhood around x belong to the same class.
Solution: A decision boundary is the region where the classifier is uncertain between
two or more classes.
- In a binary classification problem, the decision boundary is where both class
discriminant functions are equal, i.e.,
δ1 (x) = δ2 (x)
which defines the separating boundary between the two classes.
- In a three-class classification problem, the decision boundary does not require
all discriminant functions to be equal (δ1 (x) = δ2 (x) = δ3 (x)). Instead, it occurs where
at least two functions are equal, meaning
δ1 (x) = δ2 (x) or δ2 (x) = δ3 (x) or δ3 (x) = δ1 (x)

1
This forms multiple decision boundaries.
- If a point x does not lie on the decision boundary, then all points in its small
neighborhood must belong to the same class because there is no region of ambiguity in
the classification.

Q2: LDA vs Logistic Regression Decision Boundary


Question: You train an LDA classifier on a dataset with 2 classes. The decision bound-
ary is significantly different from the one obtained by logistic regression. What could be
the reason?

• The underlying data distribution is Gaussian.

• The two classes have equal covariance matrices.

• The underlying data distribution is not Gaussian.

• The two classes have unequal covariance matrices.

Correct Answer:

• The underlying data distribution is not Gaussian.

• The two classes have unequal covariance matrices.

Solution: Linear Discriminant Analysis (LDA) assumes that the data for each class
follows a multivariate normal (Gaussian) distribution with the same covariance
matrix across classes. This leads to a linear decision boundary.
Logistic regression, on the other hand, does not assume a specific distribution of data.
Instead, it finds the best linear separation using probability estimation.
- If the underlying data distribution is not Gaussian, LDA’s assumptions are
violated, which can lead to a decision boundary that differs significantly from logistic
regression. - If the two classes have unequal covariance matrices, then LDA does
not result in a linear decision boundary, whereas logistic regression still assumes a linear
decision boundary.
Thus, these two factors contribute to significant differences between the decision
boundaries of LDA and logistic regression.

Q3: Likelihood Calculation for Logistic Regression


Question: The following table provides the binary ground truth labels yi and the prob-
ability predictions p1 (xi ) given by a logistic regression model:

yi 1 0 0 1
p1 (xi ) 0.8 0.5 0.2 0.9
The likelihood of observing the given data is computed as:
4
Y
L= p1 (xi )yi (1 − p1 (xi ))(1−yi )
i=1

2
Solution:
Since the logistic regression model gives the probability that the label is 1, the likeli-
hood function is:

L = (0.8)1 (1 − 0.5)1 (1 − 0.2)1 (0.9)1

L = (0.8) × (0.5) × (0.8) × (0.9)

L = 0.288
Final Answer: 0.288

Q4: Logistic Regression Properties


Question: Which of the following statements about logistic regression are true?

• □ It learns a model for the probability distribution of the data points in each class.

• ⊠ The output of a linear model is transformed to the range (0, 1) by a sigmoid


function.

• □ The parameters are learned by minimizing the mean-squared loss.

• ⊠ The parameters are learned by maximizing the log-likelihood.

Correct Answers:

• The output of a linear model is transformed to the range (0, 1) by a sigmoid function.

• The parameters are learned by maximizing the log-likelihood.

Solution:
- Logistic regression does not learn the probability distribution of the data points but
instead models the probability that a given instance belongs to a particular class using
a sigmoid function. - The logistic regression model applies a sigmoid function to the
linear output:
1
σ(z) =
1 + e−z
where z = wT x+b. - The parameters of logistic regression are learned by maximizing
the log-likelihood, not by minimizing the mean-squared loss. The loss function for
logistic regression is the log loss (negative log-likelihood):
N
X
L=− [yi log pi + (1 − yi ) log(1 − pi )]
i=1

which is minimized using gradient descent or other optimization algorithms.

3
Q5: Modified Logistic Regression Equation
Question: Consider a modified form of logistic regression:
 
1 − p(x)
log = β0 + β1 x
kp(x)
where k is a positive constant and β0 , β1 are parameters.
Solution: We start from the given equation:
 
1 − p(x)
log = β0 + β1 x
kp(x)
By exponentiating both sides, we get:

1 − p(x)
= eβ0 +β1 x
kp(x)
Rearranging for p(x):

1 − p(x) = kp(x)eβ0 +β1 x

1 = p(x)(1 + keβ0 +β1 x )

1
p(x) =
1 + keβ0 +β1 x
By substituting into the given form, we find:

e−β1 x
p(x) =
keβ0 + e−β1 x
e−β1 x
Final Answer: keβ0 +e−β1 x

Q6: Bayesian Classifier for 5-Class Problem


Question: Consider a Bayesian classifier with the following class-conditioned density
fk (x):

k 1 2 3 4 5
fk (x) 0.15 0.20 0.05 0.50 0.01
Let πk denote the prior probability of class k. Which of the following statements are
true?
Solution:
In a Bayesian classifier, the posterior probability for class k is given by:

πk fk (x)
P (k|x) = P5
j=1 πj fj (x)

The predicted class is the one with the highest posterior probability.

• If 2πk ≤ πk+1 for all k ∈ {1, 2, 3, 4}, then class 4 has the highest probability.

• If πk ≥ 32 πk+1 for all k ∈ {1, 2, 3, 4}, then class 1 has the highest probability.

4
• The predicted label at x can never be class 5 because f5 (x) is the smallest among
all densities.

Final Answers:

• If 2πk ≤ πk+1 for all k ∈ {1, 2, 3, 4}, the predicted class must be class 4.

• If πk ≥ 23 πk+1 for all k ∈ {1, 2, 3, 4}, the predicted class must be class 1.

Question 7: Two-Class LDA Classification Model


Solution:
For a two-class Linear Discriminant Analysis (LDA) classifier, the decision boundary is
derived from the equality of the posterior probabilities:

P (C1 | x) = P (C2 | x)
Using Bayes’ theorem:

P (x | Ck )P (Ck )
P (Ck | x) =
P (x)
At the decision boundary, the posterior probabilities of both classes must be equal:

P (x | C1 )P (C1 ) P (x | C2 )P (C2 )
=
P (x) P (x)
Canceling P (x), we get:

P (x | C1 )P (C1 ) = P (x | C2 )P (C2 )
Taking the logarithm,

log P (x | C1 ) + log P (C1 ) = log P (x | C2 ) + log P (C2 )


Rearranging,

P (x | C1 ) P (C2 )
log = log
P (x | C2 ) P (C1 )

Evaluating the Statements:


• ”On the decision boundary, the prior probabilities corresponding to both
classes must be equal.”
× Incorrect, because the decision boundary depends on both priors and likelihoods,
not just priors.

• ”On the decision boundary, the posterior probabilities corresponding to


both classes must be equal.”
✓ Correct, as derived above.

5
• ”On the decision boundary, class-conditioned probability densities cor-
responding to both classes must be equal.”
× Incorrect, since the decision boundary depends on the ratio of likelihoods and
priors.

• ”On the decision boundary, the class-conditioned probability densities


corresponding to both classes may or may not be equal.”
✓ Correct, because likelihoods P (x | C1 ) and P (x | C2 ) may differ based on priors.

Final Answer:
On the decision boundary, the posterior probabilities corresponding to both classes must be equal.
On the decision boundary, the class-conditioned probability densities corresponding to both classes ma

Question 8: LDA Decision Boundary for Different


Datasets
Given Data:
• Dataset A: 200 samples of class 0, 50 samples of class 1.

• Dataset B: 200 samples of class 0 (same as Dataset A), 100 samples of class 1 (class
1 data is duplicated).

The LDA decision boundary equation is:

wT x + b = 0
where w (slope) is given by:

w = Σ−1 (µ1 − µ0 )
and b (intercept) is:

1 P (C1 )
b = − (µT1 Σ−1 µ1 − µT0 Σ−1 µ0 ) + log
2 P (C0 )
Since Dataset B has duplicated class 1 samples, µ1 and Σ remain the same, meaning
w is unchanged. However, the prior probability changes:
100 1 50
P (C1 ) = = , P (C1 ) in Dataset A = = 0.2
300 3 250
Since b depends on priors, the intercept will change.

Final Answer:
The two models will have the same slope but different intercepts.

6
Question 9: Properties of LDA
LDA aims to maximize separation between class means while minimizing within-class
variance:

wT SB w
J(w) =
w T SW w
where: - SB is the between-class scatter matrix. - SW is the within-class scatter
matrix.

Evaluating the Statements:


• ”LDA minimizes the inter-class variance relative to the intra-class vari-
ance.”
× Incorrect, LDA maximizes this ratio.

• ”LDA maximizes the inter-class variance relative to the intra-class vari-


ance.”
✓ Correct.

• ”Maximizing the Fisher information results in the same direction of the


separating hyperplane as the one obtained by equating the posterior
probabilities of classes.”
✓ Correct.

• ”Maximizing the Fisher information results in a different direction of the


separating hyperplane from the one obtained by equating the posterior
probabilities of classes.”
× Incorrect.

Final Answer:
LDA maximizes the inter-class variance relative to the intra-class variance.
Maximizing the Fisher information results in the same direction of the separating hyperplane as equat

Question 10: Logistic Regression vs LDA


LDA assumes Gaussian distributions, while logistic regression directly models proba-
bilities using:
1
P (y = 1 | x) =
1+ e−(wT x+b)

Evaluating the Statements:


• ”For any classification dataset, both algorithms learn the same decision
boundary.”
× Incorrect, LDA assumes Gaussianity, while logistic regression does not.

7
• ”Adding a few outliers to the dataset is likely to cause a larger change
in the decision boundary of LDA compared to logistic regression.”
✓ Correct, as LDA depends on means and covariances, which are sensitive to out-
liers.

• ”Adding a few outliers to the dataset is likely to cause a similar change


in both classifiers.”
× Incorrect.

• ”If the intra-class distributions deviate significantly from the Gaussian


distribution, logistic regression is likely to perform better than LDA.”
✓ Correct.

Final Answer:
LDA is sensitive to outliers; logistic regression is robust.
Logistic regression performs better when class distributions are non-Gaussian.

8
Machine Learning Assignment 4 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q1: The Perceptron Learning Algorithm can always


converge to a solution if the dataset is linearly sepa-
rable.
Answer: True

Explanation:
The Perceptron Learning Algorithm is an iterative method that updates the weight vector
w using the following rule:

w ← w + η(y − ŷ)x
where:

• w is the weight vector,

• η is the learning rate,

• y is the actual class label,

• ŷ is the predicted label,

• x is the input feature vector.

According to the Perceptron Convergence Theorem, if the dataset is linearly


separable, the algorithm is guaranteed to find a solution in a finite number of iterations.
This means that the Perceptron will always converge to a decision boundary that perfectly
separates the two classes.
However, if the dataset is not linearly separable, the algorithm will continue up-
dating the weights indefinitely without reaching convergence.
Thus, the correct answer is True.

1
Q2: Consider the 1-dimensional dataset
x y
−1 1
0 −1
2 1
(Note: x is the feature and y is the output)
State true or false: The dataset becomes linearly separable after using basis expansion
with the following basis function:
 
1
ϕ(x) = 2
x
Answer: True

Explanation:
The given dataset is not linearly separable in its original form. However, by applying the
basis expansion:
 
1
ϕ(x) = 2
x
we transform the feature x into x2 , which maps the points as follows:
     
1 1 1
ϕ(−1) = , ϕ(0) = , ϕ(2) =
1 0 4
In this transformed space, a linear decision boundary can be drawn to separate the
classes y = 1 and y = −1, making the dataset linearly separable.

Q3: Binary Classification with Hinge Loss


For a binary classification problem with the hinge loss function:

max(0, 1 − y(w · x))


which of the following statements is correct?

• The loss is zero only when the prediction is exactly equal to the true label.

• The loss is zero when the prediction is correct and the margin is at least
1. ✓

• The loss is always positive.

• The loss increases linearly with the distance from the decision boundary regardless
of classification.

Answer: The loss is zero when the prediction is correct and the margin is at least 1.

2
Explanation:
Hinge loss is defined as:

L(w, x, y) = max(0, 1 − y(w · x))

• If the classification is correct and the margin y(w · x) ≥ 1, the loss is zero.

• If the classification is correct but the margin is less than 1, there is some loss.

• If the classification is incorrect (y(w · x) < 0), the loss increases.

Thus, the correct statement is: The loss is zero when the prediction is correct
and the margin is at least 1.

Q4: Maximum Number of Support Vectors in Hard-


Margin SVM
For a dataset with n points in d dimensions, what is the maximum number of support
vectors possible in a hard-margin SVM?

• 2

• d

• n
2

• n✓

Answer: n

Explanation:
In a hard-margin Support Vector Machine (SVM), support vectors are the data points
that lie exactly on the margin boundaries or violate them. Theoretically:

• The minimum number of support vectors required to define the margin is at least
d + 1 in d-dimensional space.

• The maximum number of support vectors in a dataset of n points is n, which occurs


when every data point lies on the margin.

Thus, the correct answer is n.

3
Q5: Effect of Increasing C on the Number of Support
Vectors in Soft-Margin SVM
In the context of soft-margin SVM, what happens to the number of support vectors as
the parameter C increases?

• Generally increases

• Generally decreases ✓

• Remains constant

• Changes unpredictably

Answer: Generally decreases

Explanation:
The parameter C in an SVM controls the trade-off between maximizing the margin and
minimizing classification errors.
• A lower C allows more misclassified points, leading to a larger number of support
vectors.

• A higher C penalizes misclassifications more heavily, making the model stricter,


and reducing the number of support vectors since fewer points lie on the margin.
Thus, as C increases, the number of support vectors generally decreases.

Q6: Identifying a Non-Support Vector in SVC


Problem Statement
Consider the following dataset:

x y
1 1
2 1
4 −1
5 −1
6 −1
7 −1
9 1
10 1
We use a Support Vector Classifier (SVC) with the following parameters:
• Polynomial kernel with degree d = 3

• Regularization parameter C = 1

• Kernel coefficient γ = 0.1


The task is to determine which of the given points is not a support vector.

4
Options
• 2

• 1

• 9

• 10 ✓

Correct Answer: 10

Explanation
Support vectors are the data points that lie on the margin or within the soft margin
region in an SVM. These points influence the decision boundary and help define the
classifier. Points that are correctly classified and far from the decision boundary do not
act as support vectors.
Understanding the Decision Boundary The Support Vector Machine (SVM) finds an
optimal hyperplane that separates the two classes. For a hard-margin SVM, only the
closest points (support vectors) determine the margin, while others are ignored.
In a soft-margin SVM (where C = 1), misclassified points or points within the
margin can also be support vectors. The higher C, the stricter the margin.
Given the dataset: - Positive class (y = 1) contains: x = 1, 2, 9, 10 - Negative class
(y = −1) contains: x = 4, 5, 6, 7
A polynomial kernel of degree d = 3 transforms the data into a higher-dimensional
space, making the dataset more separable.
Why is x = 10 not a Support Vector? - Points close to the decision boundary (e.g.,
x = 2 and x = 9) are likely to be support vectors. - Points far from the boundary, such as
x = 10, are classified with high confidence and do not contribute to defining the margin.
- The SVM model does not require well-separated points to be support vectors since they
have no impact on the margin.
Thus, x = 10 is not a support vector.

Q7: Training a Linear Perceptron on the Modified Iris


Dataset
Problem Statement
Train a **Linear Perceptron Classifier** on the modified Iris dataset using sklearn. The
model should:

• Use only the first two features of the dataset.

• Be trained with both **L1** and **L2** penalty terms.

• Report the best classification accuracy obtained.

5
Options
• 0.91, 0.64

• 0.88, 0.71

• 0.71, 0.65

• 0.78, 0.64 ✓

Correct Answer: 0.78, 0.64

Introduction to Perceptron
A **Perceptron** is the simplest type of neural network and is a **linear classifier**. It
updates its weights using the Perceptron learning rule:

wt+1 = wt + η(y − ŷ)x


where: - wt is the weight vector at time step t, - η is the learning rate, - y is the true
label, - ŷ is the predicted label, - x is the input feature vector.

L1 and L2 Regularization
To improve generalization,
P we add **regularization**: - **L1 Regularization (Lasso):**
Adds a penalty term λ P|wi |, which encourages sparsity. - **L2 Regularization (Ridge):**
Adds a penalty term λ wi2 , which prevents large weight values.

Training and Accuracy


For this task: - The best accuracy with **L1 penalty** was found to be **0.78**. - The
best accuracy with **L2 penalty** was found to be **0.64**.
Thus, the correct answer is **(0.78, 0.64)**.

6
Q8: Training an SVM Classifier on the Modified Iris
Dataset
Problem Statement
Train a **Support Vector Machine (SVM) classifier** on the modified Iris dataset using
sklearn. The model should:
• Use only the first three features of the dataset.
• Use an **RBF kernel** with γ = 0.5.
• Be trained in a **One-vs-Rest (OvR)** setting.
• Not perform feature normalization.
• Explore different values of **C**: 0.01, 1, 10.
The best classification accuracy must be reported.

Options
• 0.98 ✓
• 0.88
• 0.99
• 0.92

Correct Answer: 0.98

Introduction to SVM
Support Vector Machines (SVMs) are supervised learning models used for classification.
They work by finding an optimal **decision boundary** that maximizes the **margin**
between two classes.
The optimization problem is:
1
min ∥w∥2
w,b 2

subject to:

yi (w · xi + b) ≥ 1, ∀i
where: - w is the weight vector, - b is the bias term, - yi is the class label, - xi is the
input feature.

Effect of RBF Kernel


The **Radial Basis Function (RBF) kernel** is given by:

K(x, x′ ) = exp −γ∥x − x′ ∥2




where: - γ controls the influence of individual training points. - Higher γ makes the
model more sensitive to local variations.

7
Impact of the C Parameter
The parameter **C** in SVM controls the trade-off between maximizing the margin and
minimizing classification errors: - **Low C (0.01):** Allows more misclassifications →
high bias, low variance. - **Medium C (1):** Balanced generalization. - **High C (10):**
Tries to classify all points correctly → low bias, high variance.

Training and Accuracy


After testing with C = 0.01, 1, 10: - The highest accuracy obtained was **0.98**.
Thus, the correct answer is **0.98**.

8
Machine Learning Assignment 5 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Question 1: Total Trainable Parameters in a Feedfor-


ward Neural Network
Problem Statement
A feedforward neural network is given with:

• p-dimensional input

• m hidden layers

• k hidden units per layer

• A scalar output (i.e., one neuron in the output layer)

• Bias terms are ignored.

Determine the total number of trainable parameters (weights).

Solution
1. Input to First Hidden Layer:

• The first hidden layer receives input from p features.

• Each hidden unit in the first layer has weights from all p input dimensions.

• Number of weights:
p×k

2. Hidden Layer Connections:

• Each of the m hidden layers has k units.

• Each unit in a layer is connected to all k units in the previous layer.

• There are m − 1 such hidden layer transitions.

• Number of weights:
(m − 1) × k 2

1
3. Last Hidden Layer to Output Layer:

• The output layer has a single neuron.

• It is connected to all k units in the last hidden layer.

• Number of weights:
k

Total Trainable Parameters:

Total Parameters = pk + (m − 1)k 2 + k

Correct Answer:
pk + (m − 1)k 2 + k

Question 2: Gradient of a Neural Network Layer with


ReLU Activation
Problem Statement
Consider a neural network layer:

y = ReLU(Wx)
where:

• x ∈ Rp (input)

• y ∈ Rd (output)

• W ∈ Rd×p (weight matrix)

• ReLU activation function:

ReLU(z) = max(0, z)

(applied element-wise)

Find the gradient:


∂yi
∂Wij
for i = 1, . . . , d and j = 1, . . . , p.

2
Solution
1. ReLU Activation:
• If z = Wi · x is positive: ReLU is active, so the derivative is 1.
• If z = Wi · x ≤ 0: ReLU is inactive, so the derivative is 0.
∂yi
2. Computing ∂Wij

• Since:
p
!
X
yi = ReLU Wik xk ,
k=1

• Its derivative w.r.t. Wij is:


p
!
X
I Wik xk > 0 xj
k=1

• This means:
– If pk=1 Wik xk > 0, the gradient is xj .
P

– Otherwise, the gradient is 0.


Correct Answer:
p
!
X
I Wik xk > 0 xj
k=1

Question 3: Gradient Dependencies in a Two-Layer


Neural Network
Problem Statement
Consider a two-layered neural network:
y = σ(W (B) σ(W (A) x))
where:
• W (A) and W (B) are weight matrices.
• h = σ(W (A) x) is the hidden layer representation.
• σ is the activation function.
• ∇g (f ) denotes the gradient of f w.r.t. g.
Determine which of the following statements are true:
• ∇h (y) depends on W (A)
• ∇W (A) (y) depends on W (B)
• ∇W (A) (h) depends on W (B)
• ∇W (B) (y) depends on W (A)

3
Solution
1. Gradient of Output w.r.t. Hidden Representation:

∇h (y) = W (B) σ ′ (W (A) x)

This depends on W (B) , not W (A) , so the first option is incorrect.


2. Gradient of Output w.r.t. W (A) :

∇W (A) (y) = ∇h (y) · ∇W (A) (h) = W (B) σ ′ (W (A) x)xT

This depends on W (B) , so the second option is correct.


3. Gradient of Hidden Representation w.r.t. W (A) :

∇W (A) (h) = σ ′ (W (A) x)xT

This does not depend on W (B) , so the third option is incorrect.


4. Gradient of Output w.r.t. W (B) :

∇W (B) (y) = h

Since h depends on W (A) , the fourth option is correct.

Final Answer
Accepted Answers: ∇W (A) (y) depends on W (B) , ∇W (B) (y) depends on W (A)

Question 4: Neural Network Weight Initialization


Problem Statement
Which of the following statements about weight initialization in a network using the
sigmoid activation function are true?

• Two different initializations of the same network could converge to different minima.

• For a given initialization, gradient descent will converge to the same minima irre-
spective of the learning rate.

• Initializing all weights to the same constant value leads to undesirable results.

• Initializing all weights to very large values leads to undesirable results.

Solution
1. Different Initializations Lead to Different Minima:
• Neural networks have non-convex loss surfaces.

• Different initializations may lead to different local minima.

• This statement is true.


2. Convergence to the Same Minima Regardless of Learning Rate:

4
• The learning rate affects the optimization trajectory.

• High learning rates may lead to divergence or poor convergence.

• Different learning rates may lead to different minima.

• This statement is false.

3. Initializing All Weights to the Same Constant Value:

• If all weights are initialized to the same value, all neurons will produce the same
output.

• This leads to symmetry in gradients, preventing effective learning.

• This statement is true.

4. Initializing Weights to Very Large Values:

• Large weights cause extreme values in the sigmoid activation.

• This leads to vanishing gradients due to saturation.

• This statement is true.

Final Answer
Accepted Answers:

• Two different initializations of the same network could converge to different minima.

• Initializing all weights to the same constant value leads to undesirable results.

• Initializing all weights to very large values leads to undesirable results.

Question 5: Derivatives of Sigmoid and Tanh Activa-


tion Functions
Problem Statement
Consider the following statements about the derivatives of the sigmoid σ(x) and hyper-
bolic tangent tanh(x) activation functions:

1 exp(x) − exp(−x)
σ(x) = , tanh(x) =
1 + exp(−x) exp(x) + exp(−x)
Which of the following statements are true?

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < σ ′ (x) ≤ 1
4

• tanh′ (x) = 12 (1 − (tanh(x))2 )

• 0 < tanh′ (x) ≤ 1

5
Solution
1. Derivative of Sigmoid Function:

σ ′ (x) = σ(x)(1 − σ(x))

This is a well-known result, so the first statement is true.


2. Range of σ ′ (x):

• Since σ(x) is always between 0 and 1, its derivative is maximized when σ(x) = 0.5.

• This gives σ ′ (x) = 0.25, meaning 0 < σ ′ (x) ≤ 14 .

• Thus, the second statement is true.

3. Derivative of Tanh Function:

tanh′ (x) = 1 − (tanh(x))2

The given formula tanh′ (x) = 21 (1 − (tanh(x))2 ) is incorrect (the factor 1


2
is unnecessary).
Therefore, the third statement is false.
4. Range of tanh′ (x):

• The function tanh′ (x) is maximum at x = 0, where tanh(0) = 0 and tanh′ (0) = 1.

• The minimum value is close to 0 for large positive or negative x.

• Thus, 0 < tanh′ (x) ≤ 1, making the fourth statement true.

Final Answer
Accepted Answers:

• σ ′ (x) = σ(x)(1 − σ(x))

• 0 < σ ′ (x) ≤ 1
4

• 0 < tanh′ (x) ≤ 1

Question 6: MLE of p in a Geometric Distribution


Problem Statement
A geometric distribution is defined by the probability mass function:

f (x; p) = (1 − p)(x−1) p, x = 1, 2, . . .

Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the maximum likeli-
hood estimate (MLE) of p.

6
Solution
The MLE for p in a geometric distribution is given by:
1
p̂ =

where x̄ is the sample mean.
Step 1: Compute Sample Mean
4+5+6+5+4+3 27
x̄ = = = 4.5
6 6
Step 2: Compute MLE of p
1
p̂ = ≈ 0.222
4.5

Final Answer
Accepted Answer: 0.222

Question 7: Maximum A Posteriori (MAP) Estima-


tion for a Bernoulli Distribution
Problem Statement
Consider a Bernoulli distribution with p = 0.7 (true value of the parameter). We compute
a MAP estimate of p by assuming a prior distribution over p. Let N (µ, σ 2 ) denote a
Gaussian distribution with mean µ and variance σ 2 .
Which of the following statements are true?

• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).

• If the prior is N (0.4, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.6, 0.1).

• With a prior of N (0.1, 0.001), the estimate will never converge to the true value,
regardless of the number of samples used.

• With a prior of U (0, 0.5) (uniform between 0 and 0.5), the estimate will never
converge to the true value, regardless of the number of samples used.

Solution
1. Effect of the Prior Mean:
MAP estimation incorporates prior knowledge. If the prior mean is closer to the true
value (p = 0.7), the estimator converges faster. Since N (0.6, 0.1) is closer to 0.7 than
N (0.4, 0.1), fewer samples will be needed to converge, making the first statement true.
2. Incorrect Claim about N (0.4, 0.1):
Since 0.4 is farther from 0.7 than 0.6, this statement is false.

7
3. Effect of a Strongly Biased Prior:
A prior of N (0.1, 0.001) is highly concentrated near 0.1. While it slows down conver-
gence, given sufficient data, the MAP estimate will eventually reach the true value. This
statement is false.
4. Effect of a Uniform Prior on [0, 0.5]:
Since the uniform prior does not include 0.7 in its support, the MAP estimate will never
reach 0.7, regardless of the number of samples. This statement is true.

Final Answer
Accepted Answers:

• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).

• With a prior of U (0, 0.5), the estimate will never converge to the true value, regard-
less of the number of samples used.

Question 8: Parameter Estimation Techniques


Problem Statement
Which of the following statements about parameter estimation techniques are true?

• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.

• The MAP estimate of the parameter gives a point prediction for a new data point.

• The MLE of a parameter gives a distribution of predicted values for a new data
point.

• We need a point estimate of the parameter to compute a distribution of the pre-


dicted values for a new data point.

Solution
1. Bayesian Approach and Integral Computation:
To obtain a predictive distribution, we integrate over the parameter space using Bayesian
inference: Z
P (y|x) = P (y|x, θ)P (θ|D)dθ

This means the first statement is true.


2. MAP Estimate as a Point Prediction:
The MAP estimate is defined as:

θ̂M AP = arg max P (θ|D)


θ

Since it provides a single best estimate, it gives a point prediction, making the second
statement true.

8
3. Incorrect MLE Claim:
The MLE finds the most likely parameter value but does not provide a distribution over
predicted values. This statement is false.
4. Incorrect Requirement for Point Estimate:
A full Bayesian approach computes distributions directly without needing a point esti-
mate, making this statement false.

Final Answer
Accepted Answers:
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.

Question 9: Minimizing Cross-Entropy Loss


Problem Statement
In classification settings, it is common in machine learning to minimize the discrete cross-
entropy loss:
X
HCE (p, q) = − pi log qi
i

where pi and qi are the true and predicted distributions, respectively. Given this,
which of the following statements are true?

• Minimizing HCE (p, q) is equivalent to minimizing the (self) entropy H(q).


• Minimizing HCE (p, q) is equivalent to minimizing HCE (q, p).
• Minimizing HCE (p, q) is equivalent to minimizing the KL divergence DKL (p||q).
• Minimizing HCE (p, q) is equivalent to minimizing the KL divergence DKL (q||p).

Solution
1. Connection to KL Divergence:
The cross-entropy loss can be rewritten using the KL divergence:

HCE (p, q) = H(p) + DKL (p||q)


Since the true distribution p is fixed, minimizing HCE (p, q) is equivalent to minimizing
DKL (p||q), making the third statement true.
2. Incorrect Statements:
• The self-entropy H(q) refers to the entropy of the predicted distribution and is
unrelated to the loss function.
• Cross-entropy is not symmetric, so HCE (p, q) ̸= HCE (q, p).
• KL divergence is asymmetric, meaning DKL (p||q) ̸= DKL (q||p).

9
Final Answer
Accepted Answer: Minimizing HCE (p, q) is equivalent to minimizing DKL (p||q).

Question 10: Activation Functions in Neural Networks


Problem Statement
Which of the following statements about activation functions are NOT true?

• Non-linearity of activation functions is not a necessary criterion when designing


very deep neural networks.

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the


vanishing gradients problem.

• Using the ReLU activation function avoids all problems arising due to gradients
being too small.

• The dead neurons problem in ReLU networks can be fixed using a leaky ReLU
activation function.

Solution
1. Incorrect Claim About Non-Linearity:
Non-linearity is crucial for deep networks; otherwise, the entire network collapses into a
linear function. The first statement is false.
2. Incorrect Claim About Saturating Activation Functions:
Saturating activation functions (e.g., sigmoid, tanh) cause vanishing gradients, making
optimization difficult. The second statement is also false.
3. Incorrect Claim About ReLU Solving All Gradient Issues:
While ReLU mitigates vanishing gradients, it still suffers from the dying ReLU problem
where neurons output zero permanently. Thus, the third statement is false.
4. Correct Statement About Leaky ReLU:
Leaky ReLU assigns a small slope for negative inputs, preventing neurons from completely
dying. The fourth statement is true.

Final Answer
Accepted Answers:

• Non-linearity of activation functions is not a necessary criterion when designing


very deep neural networks.

• Saturating non-linear activation functions (derivative → 0 as x → ±∞) avoid the


vanishing gradients problem.

• Using the ReLU activation function avoids all problems arising due to gradients
being too small.

10
Machine Learning Assignment 6 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q1: Decision Tree as an Unsupervised Learning Algo-


rithm
Statement: Decision Tree is an unsupervised learning algorithm.
Reason: The splitting criterion uses only the features of the data to calculate their
respective measures.

Explanation:
A Decision Tree is a supervised learning algorithm because it requires both input
features X and corresponding labels Y . The tree is built using splitting criteria that aim
to minimize measures such as:

• Gini Impurity:
c
X
Gini = 1 − p2i
i=1

• Entropy (Information Gain):


c
X
H(S) = − pi log2 (pi )
i=1

Since these measures rely on class labels, Decision Trees are not unsupervised. Thus,
both the statement and the reason are false.
Final Answer:
Statement is False. Reason is False.

Q2: Effect of Increasing Pruning Strength in Decision


Trees
Concept of Pruning:

• Pruning reduces the depth of the tree to avoid overfitting.

1
• Reducing the maximum depth prevents the tree from memorizing noise.

Impact of Increasing Pruning Strength:

• If set too aggressively, the tree becomes too simple and underfits.

• If set optimally, the tree generalizes well.

• If not pruned at all, the tree may overfit.

Final Answer:

Might lead to underfitting if set too aggressively.

Q3: Indicator of Overfitting in a Decision Tree


What is Overfitting? Overfitting occurs when:

• Training accuracy is high but validation accuracy is low.

• The model learns noise instead of actual patterns.

Why Is This the Best Indicator? If a Decision Tree has:

• High training accuracy but low validation accuracy, it is overfitting.

Final Answer:

The training accuracy is high while the validation accuracy is low.

Q4: Decision Trees and Their Properties


Statement 1: Decision Trees are linear non-parametric models.
Statement 2: A decision tree may be used to explain the complex function learned
by a neural network.

Analysis:
• Statement 1 is false because Decision Trees are non-parametric but not linear.

• Statement 2 is true because a Decision Tree can approximate a Neural Network’s


decision-making process.

Final Answer:

Statement 1 is False, but Statement 2 is True.

2
Q5: Entropy for a 50-50 Split
Entropy Formula for Binary Classification:
c
X
H(S) = − pi log2 pi
i=1

For a 50-50 split (p1 = 0.5, p2 = 0.5):

H = −(0.5 log2 0.5 + 0.5 log2 0.5)


Since:

log2 0.5 = −1

H = −(0.5 × (−1) + 0.5 × (−1)) = 1


Final Answer:
1

Q6: Finding the Best Split for a Categorical Attribute


Given a dataset with only one categorical attribute having n = 10 unordered values, we
need to determine the number of possible combinations to find the best split point in a
decision tree classifier.

Formula for Binary Splits


In a decision tree, a categorical attribute with n unique values can be split into two
non-empty disjoint subsets. The number of ways to form such partitions is given by:

Total possible binary splits = 2(n−1) − 1

Substituting n = 10
For our case:

2(10−1) − 1 = 29 − 1

= 512 − 1

= 511

Explanation
- The total number of subsets for n elements is 2n , including the empty set. - Since we
exclude the empty subset, we get 2n − 1. - However, each split is determined by choosing
**one subset**, and the other subset is automatically its complement. - This results in
2(n−1) − 1 valid splits, ensuring non-trivial partitions.

3
Final Answer
511
Thus, the number of possible combinations needed to find the best split point is
**511**.

Question 7: Initial Entropy of Malignant


Given the dataset, the target variable is **Malignant** (0 or 1). The entropy formula is:

H(S) = −p1 log2 p1 − p0 log2 p0


where:

• p1 is the proportion of malignant cases (1s).

• p0 is the proportion of non-malignant cases (0s).

From the dataset:

Total samples = 10

Malignant (1s) = 4, Non-malignant (0s) = 6

4 6
p1 = = 0.4, p0 = = 0.6
10 10
Substituting into the entropy formula:

H(S) = −(0.4 log2 0.4) − (0.6 log2 0.6)


Using logarithm values:

log2 0.4 ≈ −1.322, log2 0.6 ≈ −0.737

H(S) = −(0.4 × (−1.322)) − (0.6 × (−0.737))

H(S) = 0.5288 + 0.4422 = 0.9798


Thus, the initial entropy of **Malignant** is:

0.9798

Question 8: Information Gain of Vaccination


Entropy after Splitting on Vaccination
Vaccination has two possible values: **0** and **1**.

4
Subset for Vaccination = 0
- Count: 5 samples - Malignant (1s) = 4, Non-malignant (0s) = 1
4 1
p1 = = 0.8, p0 = = 0.2
5 5

H(S0 ) = −(0.8 log2 0.8) − (0.2 log2 0.2)

H(S0 ) = −(0.8 × (−0.322)) − (0.2 × (−2.322))

H(S0 ) = 0.2576 + 0.4644 = 0.722

Subset for Vaccination = 1


- Count: 5 samples - Malignant (1s) = 0, Non-malignant (0s) = 5

H(S1 ) = −(0 log2 0) − (1 log2 1) = 0

Weighted Entropy After Splitting


   
5 5
Hsplit = × H(S0 ) + × H(S1 )
10 10

Hsplit = (0.5 × 0.722) + (0.5 × 0)

Hsplit = 0.361

Information Gain Calculation


IG = H(S) − Hsplit

IG = 0.9798 − 0.361

IG = 0.4763
Thus, the **information gain of Vaccination** is:

0.4763

5
Machine Learning Assignment 7 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Question 1: Evaluation of Machine Learning Models


Question: Which of the following statement(s) regarding the evaluation of Machine
Learning models is/are true?

1. A model with a lower training loss will perform better on a validation dataset.

2. A model with a higher training accuracy will perform better on a validation dataset.

3. The train and validation datasets can be drawn from different distributions.

4. The train and validation datasets must accurately represent the real distribution of
data.

Correct Answer: Option 4: The train and validation datasets must accurately
represent the real distribution of data.
Explanation:

• Training loss vs. Validation performance: A lower training loss does not
necessarily mean better validation performance. If the model overfits the training
data, it may perform poorly on unseen data.

• Training accuracy vs. Validation accuracy: High training accuracy does not
always guarantee high validation accuracy due to overfitting.

• Data Distribution: The train and validation datasets must be drawn from the
same distribution. If they are from different distributions, the validation set will
not provide meaningful insights into model performance.

Question 2: Stratified Sampling in Train-Test Split


Question: Suppose we have a classification dataset with two classes:

• Class A: 200 samples

• Class B: 40 samples

1
Using stratified sampling, which of the following train-test splits would be appropriate?

1. Train-{A:50, B:10}, Test-{A:150, B:30}

2. Train-{A:50, B:30}, Test-{A:150, B:10}

3. Train-{A:150, B:30}, Test-{A:50, B:10}

4. Train-{A:150, B:10}, Test-{A:50, B:30}

Correct Answer: Option 3: Train-{A:150, B:30}, Test-{A:50, B:10}


Explanation:

• In stratified sampling, we must preserve the same class proportions in both training
and testing sets.

• The original dataset has a proportion of 200


240
for Class A and 40
240
for Class B.

• The correct train-test split should maintain these proportions.

• Checking the given options, Option 3 correctly follows this principle.

Question 3: Cross-Validation in Multiclass Classifica-


tion
Question: Suppose we are performing cross-validation on a multiclass classification
dataset with N data points. Which of the following statements are correct?

1. In k-fold cross-validation, we train k − 1 different models and evaluate them on the


same test set.

2. In k-fold cross-validation, we train k different models and evaluate them on different


test sets.

3. In k-fold cross-validation, each fold should have a class-wise proportion similar to


the given dataset.

4. In LOOCV (Leave-One-Out Cross Validation), we train N different models, using


N − 1 data points for training each model.

Correct Answer: Options 2, 3, and 4:

• Option 2: In k-fold cross-validation, we train k different models and evaluate them


on different test sets. ( Correct)

• Option 3: Each fold in k-fold cross-validation should maintain class proportions


similar to the dataset. ( Correct)

• Option 4: In Leave-One-Out Cross Validation (LOOCV), we train N different


models, each using N − 1 training points. ( Correct)

Explanation:

2
• k-fold Cross-Validation: In this technique, the dataset is split into k parts (folds),
and the model is trained on k − 1 folds while testing on the remaining one. This
process is repeated k times with different test sets.

• LOOCV (Leave-One-Out Cross Validation): A special case of cross-validation


where each data point serves as the test set once, and the model is trained on all
remaining N − 1 data points.

• Incorrect Option 1: The statement ”In k-fold cross-validation, we train k − 1


models and evaluate on the same test set” is incorrect because each fold gets a
different test set.

Question 4: Maximizing Recall in a Binary Classifica-


tion Problem
Definition of Recall
Recall (also called Sensitivity or True Positive Rate) is defined as:

True Positives (TP)


Recall =
True Positives (TP) + False Negatives (FN)

Given Confusion Matrices and Recall Calculation


We analyze the confusion matrices in the format:
 
TP FN
FP TN

First Classifier
 
4 6
13 77
4 4
Recall = = = 0.4
4+6 10

Second Classifier
 
8 2
40 60
8 8
Recall = = = 0.8
8+2 10

Third Classifier
 
5 5
9 81
5 5
Recall = = = 0.5
5+5 10

3
Fourth Classifier
 
7 3
0 90
7 7
Recall = = = 0.7
7+3 10

Conclusion
Since the highest recall value is **0.8**, we choose the **second classifier**:
 
8 2
40 60

Question 5: Minimizing the False Positive Rate (FPR)


Definition of False Positive Rate (FPR)
The False Positive Rate (FPR) is calculated as:

False Positives (FP)


FPR =
False Positives (FP) + True Negatives (TN)
where: - **False Positives (FP):** Incorrectly predicted positive cases. - **True
Negatives (TN):** Correctly predicted negative cases.

Given Confusion Matrices


The confusion matrices provided in Q.4 are:
 
4 6
CM1 =
6 84
 
8 2
CM2 =
13 77
 
1 9
CM3 =
2 88
 
10 0
CM4 =
4 86

Computing FPR for Each Classifier


6 6
F P R1 = = = 0.0667
6 + 84 90
13 13
F P R2 = = = 0.144
13 + 77 90
2 2
F P R3 = = = 0.022
2 + 88 90
4 4
F P R4 = = = 0.044
4 + 86 90

4
Conclusion: Choosing the Classifier with Minimum FPR
Since the third classifier has the **lowest FPR** of **0.022**, we choose:
 
1 9
CM3 =
2 88

Question 6: Maximizing Precision


Definition of Precision
Precision is calculated as:

True Positives (TP)


Precision =
True Positives (TP) + False Positives (FP)
where: - **True Positives (TP):** Correctly predicted positive cases. - **False Posi-
tives (FP):** Incorrectly predicted positive cases.

Computing Precision for Each Classifier


4 4
Precision1 = = = 0.4
4+6 10
8 8
Precision2 = = = 0.381
8 + 13 21
1 1
Precision3 = = = 0.333
1+2 3
10 10
Precision4 = = = 0.714
10 + 4 14

Conclusion: Choosing the Classifier with Maximum Precision


Since the **fourth classifier** has the **highest Precision** of **0.714**, we choose:
 
10 0
CM4 =
4 86

Question 7: Maximizing the F1-score


Definition of F1-score
The F1-score is the harmonic mean of precision and recall, defined as:
Precision × Recall
F1 = 2 ×
Precision + Recall

5
TP TP
where: - **Precision** = T P +F P
- **Recall** = T P +F N

Given Confusion Matrices


The confusion matrices from Q.4 are:
 
4 6
CM1 =
6 84
 
8 2
CM2 =
3 87
 
1 9
CM3 =
2 88
 
10 0
CM4 =
4 86

Computing Precision and Recall for Each Classifier


For CM4:
10
Precision = = 1.0
10 + 0
10 10
Recall = = = 0.714
10 + 4 14
1.0 × 0.714
F1 = 2 × = 0.833
1.0 + 0.714

Conclusion: Choosing the Classifier with Maximum F1-score


Since **CM4 has the highest F1-score**, we choose:
 
10 0
CM4 =
4 86

Question 8: Statements about Boosting


Understanding Boosting
Boosting is an ensemble learning method that combines multiple weak classifiers to form
a strong classifier. It works by giving more weight to misclassified instances in successive
iterations.

Evaluating the Given Statements


1. **Boosting is an example of an ensemble method** ( Correct) - Boosting combines
weak classifiers iteratively to improve overall performance.
2. **Boosting assigns equal weights to the predictions of all weak classifiers** (
Incorrect) - Boosting dynamically adjusts weights, giving more weight to misclassified
samples.

6
3. **Boosting may assign unequal weights to the predictions of all the weak classi-
fiers** ( Correct) - Unlike bagging, boosting assigns different weights to different classi-
fiers.
4. **The individual classifiers in boosting can be trained parallelly** ( Incorrect) -
Boosting trains classifiers sequentially, as each classifier depends on the previous one’s
errors.
5. **The individual classifiers in boosting cannot be trained parallelly** ( Correct) -
Since boosting is sequential, training is not parallelizable.

Conclusion: Correct Statements


The following statements are correct:

• Boosting is an example of an ensemble method.

• Boosting may assign unequal weights to the predictions of all the weak classifiers.

• The individual classifiers in boosting cannot be trained parallelly.

Question 9: Statements about Bagging


Understanding Bagging
Bagging (Bootstrap Aggregating) is an ensemble learning technique where multiple clas-
sifiers are trained on different subsets of the dataset. It is mainly used to reduce variance
and prevent overfitting.

Evaluating the Given Statements


1. **Bagging is an example of an ensemble method** ( Correct) - Bagging is a popular
ensemble technique that combines multiple models.
2. **The individual classifiers in bagging can be trained in parallel** ( Correct) -
Since each classifier is trained on different bootstrap samples independently, training can
be parallelized.
3. **Training sets are constructed from the original dataset by sampling with replace-
ment** ( Correct) - Bagging uses bootstrapped datasets, meaning that sampling is done
with replacement.
4. **Training sets are constructed from the original dataset by sampling without
replacement** ( Incorrect) - Sampling without replacement is not used in bagging but in
other techniques like pasting.
5. **Bagging increases the variance of an unstable classifier** ( Incorrect) - Bagging
is designed to reduce variance, not increase it.

7
Conclusion: Correct Statements
The following statements are correct:

• Bagging is an example of an ensemble method.

• The individual classifiers in bagging can be trained in parallel.

• Training sets are constructed from the original dataset by sampling with replace-
ment.

Question 10: Statements about Ensemble Methods


Understanding Ensemble Methods
Ensemble methods combine multiple models to achieve better predictive performance
than individual models.

Evaluating the Given Statements


1. **Ensemble aggregation methods like bagging aim to reduce overfitting and variance**
( Correct) - Bagging reduces variance by training models on different bootstrap samples.
2. **Committee machines may consist of different types of classifiers** ( Correct) -
A committee machine is a type of ensemble method that can use different classifiers like
decision trees, SVMs, or neural networks.
3. **Weak learners are models that perform slightly worse than random guessing**
( Incorrect) - Weak learners are models that perform slightly better (not worse) than
random guessing.
4. **Stacking involves training multiple models and stacking their predictions into new
training data** ( Correct) - Stacking is an ensemble learning technique where predictions
from multiple base models are combined using a meta-model.

Conclusion: Correct Statements


The following statements are correct:

• Ensemble aggregation methods like bagging aim to reduce overfitting and variance.

• Committee machines may consist of different types of classifiers.

• Stacking involves training multiple models and stacking their predictions into new
training data.

8
Machine Learning Assignment 8 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q1: Random Forests


Question: Which of these statements is/are True about Random Forests?

• The goal of random forests is to increase the correlation between the trees.

• The goal of random forests is to decrease the correlation between the trees.

• In Random Forests, each decision tree fits the residuals from the previous one; thus,
the correlation between the trees won’t matter.

• None of these.

Correct Answer: The goal of random forests is to decrease the correlation between
the trees.
Explanation:

• Random Forests is an ensemble learning technique that combines multiple decision


trees to improve prediction accuracy and reduce overfitting.

• To ensure diverse decision trees, Random Forests introduce randomness by:

1. Sampling different subsets of the training data (Bagging - Bootstrap Aggrega-


tion).
2. Selecting a random subset of features for each tree to split on.

• The key idea is to make the individual trees as uncorrelated as possible so that
their combined predictions are more robust. A lower correlation among trees leads
to better generalization.

• The third option describes Gradient Boosting rather than Random Forests, where
trees are sequentially trained to correct previous errors.

1
Q2: Gradient Boosted Decision Trees
Question: Consider the two statements:

1. Gradient Boosted Decision Trees can overfit easily.

2. It is easy to parallelize Gradient Boosted Decision Trees.

Which of these are true?

• Both the statements are True.

• Statement 1 is true, and statement 2 is false.

• Statement 1 is false, and statement 2 is true.

• Both the statements are false.

Correct Answer: Statement 1 is true, and statement 2 is false.


Explanation:

• Statement 1: True – Gradient Boosting works by sequentially training trees to


correct the mistakes of previous trees, which can lead to overfitting, especially if
too many trees are used or if the learning rate is too high.

• Statement 2: False – Unlike Random Forests, which train trees independently


and can be easily parallelized, Gradient Boosting is inherently sequential. Each
new tree depends on the residual errors of previous trees, making parallelization
difficult.

Q3: Naive Bayes Assumption


Question: A dataset with two classes is plotted below.

2
Does the data satisfy the Naı̈ve Bayes assumption?
• Yes

• No ✓

• The given data is insufficient

• None of these
Correct Answer: No.
Explanation:
• The Naı̈ve Bayes classifier assumes that the features are conditionally indepen-
dent given the class label.

• In the scatter plot, the two features appear to have a strong correlation (diagonal
separation between the two classes).

• This violates the independence assumption of Naı̈ve Bayes, making it less effective
for this dataset.

• Therefore, the correct answer is No, as the data does not satisfy the Naı̈ve Bayes
assumption.

Q4: Naı̈ve Bayes Classifier and Handling Unseen Words


Question: Consider the below dataset:

x y
India won the match. Cricket
The Mercedes car was driven by Lewis Hamilton. Formula 1
The ball was driven through the covers for a boundary Cricket
Max Verstappen has a fast car. Formula 1
Bumrah is a fast bowler. Cricket
Max Verstappen won the race Formula 1

Suppose you have to classify a test example “The ball won the race to the boundary”
and are asked to compute

P (Cricket|“The ball won the race to the boundary”)

What issue will you face if you use the Naı̈ve Bayes Classifier, and how will you handle
it? Assume word frequencies are used to estimate all probabilities.

• ◦ There won’t be a problem, and the probability will be equal to 1.

• ◦ Problem: A few words that appear at test time do not appear in the
dataset. Solution: Smoothing. ✓

• ◦ Problem: A few words that appear at test time appear more than once in the
dataset. Solution: Remove those words from the dataset.

• ◦ None of these

3
Correct Answer: Problem: A few words that appear at test time do not appear in
the dataset. Solution: Smoothing.
Explanation:

• In Naı̈ve Bayes, we calculate probabilities of words given a class.

• If a word appears in the test sentence but not in the training data, the probability
becomes zero.

• This leads to a problem known as the zero-frequency problem.

• The solution is to use Laplace Smoothing, where we add a small value (e.g., 1)
to the word count to prevent zero probabilities.

Laplace Smoothing Formula: To estimate the probability of a word w in class C,


we use:

count(w, C) + α
P (w|C) = P ′
all words w′ count(w , C) + αV
where:

• α is a smoothing parameter (typically **1** for Laplace Smoothing).

• V is the vocabulary size (total unique words in the dataset).

• count(w, C) is the number of times w appears in class C.

• ′
P
all words w′ count(w , C) is the total count of words in class C.

Why is Laplace Smoothing Needed?

• If a word in the test set does not appear in the training data, its probability becomes
**zero**.

• This causes the entire Naı̈ve Bayes probability calculation to be **zero** due to
multiplication.

• Laplace smoothing adds α to prevent zero probabilities, ensuring better generaliza-


tion.

Q5: Boosting to Improve Model Accuracy


Question: A company hires you to analyze their classification system, which predicts
whether a customer will buy a product. The current classifier has low accuracy, often
around 60%, sometimes barely above 50%.
Without adding new classifiers, which ensemble method would be best to improve
accuracy?

• ◦ Committee Machine

• ◦ AdaBoost ✓

• ◦ Bagging

4
• ◦ Stacking

Correct Answer: AdaBoost.


Explanation:

• AdaBoost (Adaptive Boosting) works by combining multiple weak classifiers to


create a strong classifier.

• It assigns higher weights to misclassified instances and retrains the model iteratively
to reduce errors.

• Since the classifier is performing just slightly above random guessing (50–60%),
AdaBoost is ideal for improving its accuracy.

• Bagging is more suitable for reducing variance, while stacking and committee ma-
chines involve different models.

Q6: Logistic Regression for Multiclass Classification


Suppose you have a **6-class classification problem** with **one input variable**. You
decide to use **logistic regression** to build a predictive model.
What is the minimum number of parameter pairs (β0 , β) that need to be
estimated?
Explanation: - In **binary logistic regression**, we model the probability of one class
versus the other using:
1
P (y = 1|x) = −(β
1 + e 0 +βx)
where β0 is the intercept and β is the coefficient.
- For **multiclass classification** (with k classes), we need a model that can output
probabilities for each class. - Using the **Softmax Regression** approach, we model:

eβ0i +βi x
P (y = i|x) = Pk
β0j +βj x
j=1 e

- Here, we require k − 1 sets of parameters because one class is taken as the reference.
- Given that we have **6 classes**, we estimate **5 sets** of (β0 , β) pairs.
Answer: 5

Options:

• ()6

• ( ) 12

• () 5

• ( ) 10

5
Q7: Bayesian Network and Conditional Independence
The following Bayesian Network consists of **9 variables**, all of which are binary:

Which of the following is/are always true for the above Bayesian Network?
Explanation:

1. **Option 1: P (A, B|G) = P (A|G)P (B|G)** - This statement is incorrect. - In


Bayesian Networks, two variables A and B are conditionally independent given G
if and only if there is no direct path between them once G is known. - However, in
the given network, A and B are connected through multiple paths and may have
dependencies.

2. **Option 2: P (A, I) = P (A)P (I)** - This is incorrect because **A and I are not
necessarily independent**. - In a Bayesian Network, two variables are independent
if they are not connected directly or indirectly through dependent nodes. - Here,
A and I have possible dependencies through other nodes.

3. **Option 3: P (B|H, E, G) = P (B|E, G)P (H|E, G)** - **() This is correct** be-
cause of the **conditional independence rule** in Bayesian Networks. - Given the
structure, once we condition on E and G, the probability of B and H factorizes as
shown.

4. **Option 4: P (C|B, F ) = P (C|F )** - This is incorrect because C is directly


dependent on B in the network.

Correct Answer: P(B|H, E, G) = P(B|E, G)P(H|E, G)

6
Q8: Naı̈ve Bayes Classification for Phone Type Pre-
diction
Given Data: The table below shows the distribution of phones based on their type and
features:

Type Single SIM 5G Compatibility NFC Total


Budget 15 5 0 20
Mid-Range 20 20 15 30
High End 15 15 15 20

Given a phone with: - Dual SIM (not explicitly given in the table) - NFC =
Yes - 5G = No
We need to calculate the probabilities of this phone being a: 1. Budget phone 2.
Mid-Range phone 3. High-End phone
using **Naı̈ve Bayes Classification** and rank them accordingly.

Step 1: Prior Probabilities


The probability of each phone type is:
20
P (Budget) = = 0.2857
70
30
P (M id − Range) = = 0.4286
70
20
P (High − End) = = 0.2857
70

Step 2: Likelihood Probabilities


For **NFC = Yes**:
0
P (N F C|Budget) = =0
20
15
P (N F C|M id − Range) = = 0.5
30
15
P (N F C|High − End) = = 0.75
20
For **5G = No**:
5 15
P (N o 5G|Budget) = 1 − P (5G|Budget) = 1 − = = 0.75
20 20
20 10
P (N o 5G|M id − Range) = 1 − P (5G|M id − Range) = 1 − = = 0.3333
30 30
15 5
P (N o 5G|High − End) = 1 − P (5G|High − End) = 1 − = = 0.25
20 20

7
Step 3: Compute Posterior Probabilities (Ignoring Dual SIM)
Using Bayes’ Theorem:

P (Budget|N F C, N o 5G) ∝ P (N F C|Budget)P (N o 5G|Budget)P (Budget)

Since P (N F C|Budget) = 0, the probability for Budget phones becomes **zero**.

P (M id − Range|N F C, N o 5G) ∝ 0.5 × 0.3333 × 0.4286 = 0.0714

P (High − End|N F C, N o 5G) ∝ 0.75 × 0.25 × 0.2857 = 0.0536


Step 4: Ranking the Probabilities


From the calculated values:

P (M id − Range) > P (High − End) > P (Budget)


Thus, the correct ranking of phone type from highest to lowest probability is:

Mid-Range, High-End, Budget


Answer:

• ( ) Budget, Mid-Range, High End

• ( ) Budget, High End, Mid-Range

• () Mid-Range, High-End, Budget

• ( ) High End, Mid-Range, Budget

Conclusion
• The **Budget phone** has a probability of zero because it does not support NFC.

• The **Mid-Range phone** has the highest probability because it has a significant
number of phones with NFC and No 5G.

• The **High-End phone** has a lower probability than Mid-Range but is still pos-
sible.

8
Machine Learning Assignment 9 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Solution to Question 1
Problem Statement: Consider the Markov Random Field (MRF) given below. We
need to delete one edge (without deleting any nodes) so that in the resulting graph, B
and F are independent given A. Which of these edges could be deleted to achieve this
independence?

Concept: In a Markov Random Field (MRF), two nodes are conditionally inde-
pendent given a set of nodes if removing an edge (or edges) makes them completely
disconnected in the graph when conditioned on the given node.
Analysis of Edge Removals:

• Removing AC: This does not fully disconnect B and F when given A. They
remain connected through other paths.

• Removing BE: Removing edge BE helps in breaking the indirect connection


between B and F , ensuring independence given A.

• Removing CE: Removing edge CE further ensures that any indirect paths be-
tween B and F are eliminated, reinforcing conditional independence.

• Removing AE: Does not fully ensure independence between B and F , as there
are still paths through C and E.

1
Conclusion: The correct edges to remove for ensuring B and F are independent
given A are:
BE, CE

Solution to Question 2
Problem Statement: We need to delete one node (along with its incident edges) so
that in the resulting graph, nodes B and C are independent given A.

Concept: In a Markov Random Field (MRF), two nodes are said to be condition-
ally independent given another node if the only path(s) between them pass through
that node. This means that once we observe the given node, the two nodes should no
longer have any additional paths connecting them.

Understanding the Graph: - The given MRF consists of the nodes: **A, B, C, D,
E, and F**. - The **edges represent dependencies** between the nodes. - The goal is to
ensure that B and C are independent given A.

Step-by-Step Analysis:

1. Check the connections of B and C: - B is directly connected to **A and E**. -


C is directly connected to **A, D, and E**. - Since **B and C are both connected
to E**, this means that **E provides an indirect connection between B and C**
that does not pass through A.

2. Finding the Correct Node to Delete: - If we delete E, the only remaining


connection between B and C would be through A. - This ensures that B and C
become independent when conditioned on A. - Deleting any other node (D or F)
would not effectively break the connection between B and C.

Conclusion: The correct node to remove is:

Solution to Question 3
Problem Statement: Which node in the Markov Random Field has the largest Markov
blanket (i.e., the most number of direct neighbors)?

Concept: In a Markov Random Field (MRF), the Markov blanket of a node con-
sists of all its directly connected neighbors. The Markov blanket contains all the nodes
that shield the given node from the influence of the rest of the network.

Step-by-Step Calculation:

1. Calculate the Markov blanket size for each node:

• Node A: Connected to {B, C, F }. Total: 3

2
• Node B: Connected to {A, E}. Total: 2
• Node C: Connected to {A, D, E}. Total: 3
• Node D: Connected to {C}. Total: 1
• Node E: Connected to {B, C}. Total: 2
• Node F: Connected to {A, C}. Total: 2
2. Finding the Node with the Largest Markov Blanket: - Nodes A and C
have the highest number of neighbors (3). - Hence, they have the **largest Markov
blanket**.
Conclusion: The correct answer is:
A, C

Solution to Question 4
Problem Statement: We need to determine the correct independence relations in the
given Bayesian Network.

Understanding the Structure of the Bayesian Network: The given network


consists of six variables:
• A and B are parent nodes of C.
• D is a child node of A.
• E and F are child nodes of C.
Analyzing the Given Independence Relations:
1. (A) A and B are independent if C is given: - In a Bayesian Network, if two
parent nodes (A and B) share a common child (C) but have no direct edge between
them, they form a **V-structure**:
A→C←B
- If C is given, A and B become dependent due to explaining away effects. - This
statement is false.

3
2. (B) A and B are independent if no other variables are given: - Since there
is no direct edge between A and B, they are marginally independent. - This
statement is true.
3. (C) C and D are not independent if A is given: - C and D are connected
only through A. - Given A, there is no additional path connecting C and D. - This
statement is false.
4. (D) A and F are independent if C is given: - A and F are connected only
through C. - If C is given, then there is no other direct connection between A and
F , making them independent. - This statement is true.
Final Answers: The correct independence relations are:
(B) A and B are independent if no other variables are given.

(D) A and F are independent if C is given.

Solution to Question 5
Problem Statement: We need to calculate the number of independent parameters re-
quired to represent the probability tables in the Bayesian Network, assuming all variables
are binary.

Formula for Independent Parameters: For a Bayesian Network, the number of


independent parameters is computed as:
n
X Y
(ri − 1) rj
i=1 parents(i)
Q
where: - ri is the number of possible values for variable i, - parents(i) rj is the product of
the number of values of its parent nodes.

Step-by-Step Calculation:
• A, B, C, D, E, F are binary (ri = 2).
• Computing parameters for each node:
– P (A): 2 − 1 = 1
– P (B): 2 − 1 = 1
– P (C|A, B): (2 − 1) × 2 × 2 = 4
– P (D|A): (2 − 1) × 2 = 2
– P (E|C): (2 − 1) × 2 = 2
– P (F |C): (2 − 1) × 2 = 2
Total Parameters:
1 + 1 + 4 + 2 + 2 + 2 = 12
Final Answer:
12

4
Solution to Question 6
Problem Statement: We need to calculate the number of independent parameters re-
quired for the Bayesian Network when variables A, C, E have four possible values and
variables B, D, F are binary.

Step-by-Step Calculation:

• rA = 4, rB = 2, rC = 4, rD = 2, rE = 4, rF = 2.

• Computing parameters for each node:

– P (A): (4 − 1) = 3
– P (B): (2 − 1) = 1
– P (C|A, B): (4 − 1) × 4 × 2 = 24
– P (D|A): (2 − 1) × 4 = 4
– P (E|C): (4 − 1) × 4 = 12
– P (F |C): (2 − 1) × 4 = 4

Total Parameters:
3 + 1 + 24 + 4 + 12 + 4 = 48
Final Answer:
48

Solution to Question 7
Problem Statement: In the Bayesian Network from Question 4, assume all variables
can take 4 values. We need to determine the number of independent parameters required
to represent the probability distribution.

Understanding the Calculation: A Bayesian Network represents the joint proba-


bility distribution in a factored form using conditional probabilities. For each node, we
count the number of independent parameters required.

Given Bayesian Network Structure:

• A and B have no parents: P (A), P (B) ⇒ 3 + 3 = 6 parameters.

• D is dependent on A: P (D|A) ⇒ 3 × 4 = 12 parameters.

• C depends on A, B: P (C|A, B) ⇒ (4 × 4) × 3 = 48 parameters.

• E and F depend on C:

– P (E|C) ⇒ 4 × 3 = 12 parameters.
– P (F |C) ⇒ 4 × 3 = 12 parameters.

5
Total Number of Independent Parameters:

6 + 12 + 48 + 12 + 12 = 90

Final Answer:
90

Solution to Question 8
Problem Statement: We need to find valid factorizations to compute the marginal
probability P (E = e) using variable elimination.

Understanding Factorization in a Bayesian Network: The joint probability


distribution is given by:

P (A, B, C, D, E, F ) = P (A)P (B)P (D|A)P (C|A, B)P (E|C)P (F |C)


To compute P (E = e), we marginalize out all other variables:
XXXXX
P (E = e) = P (A, B, C, D, E = e, F )
A B C D F

Valid Factorizations: A valid factorization must follow the conditional dependen-


cies in the Bayesian Network.
P P P P P
1. B P (B) A P (A) D P (D|A) C P (C|A, B) F P (E = e|C)P (F |C) Correct:
This follows the correct order of summing out variables while respecting dependen-
cies.
P P P P P
2. A P (A) D P (D|A) B P (B) C P (C|A, B) F P (E = e|C)P (F |C) Correct:
The order of summation does not violate the dependency structure.
P P P P
3. B P (B) A P (A)P (D|A) D F P (C|A, B)P (E = e|C)P (F |C) Incorrect: The
term P (D|A) is misplaced outside the summation over D.
P P P P P
4. A P (B) B P (D|A) D P (A) F P (C|A, B) C P (E = e|C)P (F |C) Incor-
rect: The term P (B) is misplaced outside its summation.
P P P P P
5. A P (A) B P (B) C P (C|A, B) D P (D|A) F P (E = e|C)P (F |C) Correct:
This follows the correct dependency structure.

Final Answers: Valid options are:

First, Second, and Fifth options

6
Solution 9: Understanding the Markov Random Field
(MRF)

A Markov Random Field (MRF) is an undirected graphical model where the proba-
bility distribution can be expressed as a product of potential functions (ψ) over the
maximal cliques of the graph.

Properties of MRFs:
• Undirected Graph Representation: The relationships between variables are
represented as undirected edges.

• Markov Property: A node is conditionally independent of all other nodes given


its neighbors.

• Factorization Rule: The joint probability distribution P (a, b, c, d, e) is expressed


as a product of potential functions over the maximal cliques.

Step 1: Identify the Maximal Cliques


From the given MRF diagram, the maximal cliques (the largest fully connected sub-
graphs) are:
• Clique 1: (A, B, C, D)

• Clique 2: (B, E)
Therefore, the probability distribution should be factorized as:
1
P (a, b, c, d, e) = ψ1 (a, b, c, d)ψ2 (b, e) (1)
Z
where:
• ψ1 (a, b, c, d) is the potential function for clique (A, B, C, D).

• ψ2 (b, e) is the potential function for clique (B, E).

• Z is the partition function ensuring that the probability distribution sums to 1.

Step 2: Verify the Given Options


Now, let’s analyze the provided options:

7
Option 1:
1
P (a, b, c, d, e) =
ψ1 (a, b, c, d)ψ2 (b, e) (2)
Z
Correct! Matches the identified maximal cliques.

Option 2:
1
P (a, b, c, d, e) = ψ1 (b)ψ2 (a, c, d)ψ3 (a, b, e) (3)
Z
Incorrect!

• ψ1 (b) suggests an independent factor for B, which is incorrect since B has depen-
dencies.

• The factor ψ3 (a, b, e) incorrectly includes A and E together, which is not part of a
maximal clique.

Option 3:
1
P (a, b, c, d, e) = ψ1 (a, b)ψ2 (c, d)ψ3 (b, e) (4)
Z
Incorrect!

• The dependency between A, B, C, D is broken.

• ψ1 (a, b) and ψ2 (c, d) do not represent maximal cliques.

Option 4:
1
P (a, b, c, d, e) = ψ1 (a, b, c, d)ψ2 (c, d)ψ3 (b, d, e) (5)
Z
Incorrect!

• It introduces extra dependencies that do not match the maximal cliques.

Option 5:
1
P (a, b, c, d, e) = ψ1 (a, c)ψ2 (b, e)ψ3 (b, e) (6)
Z
Incorrect!

• Does not properly capture the maximal clique structure.

Option 6:
1
P (a, b, c, d, e) = ψ1 (c)ψ2 (b, e)ψ3 (b, a, d) (7)
Z
Incorrect!

• Incorrectly isolates C instead of including it in the maximal clique.

8
Conclusion
The first option correctly factorizes the probability distribution using the maximal
cliques of the given MRF. Hence, the correct answer is:

1
P (a, b, c, d, e) = ψ1 (a, b, c, d)ψ2 (b, e) (8)
Z

Solution 10 :Understanding the Given Hidden Markov


Model (HMM)

A Hidden Markov Model (HMM) is a statistical model used to represent a sequence of


observable and hidden variables. In this case, the given HMM is used for Part-of-Speech
(POS) Tagging, where:

• Xi represents the hidden states, which correspond to the parts-of-speech (POS)


tags.

• Yi represents the observed states, which correspond to the words in the sen-
tence.

Step 1: Understanding the Structure of HMM


The given HMM consists of three time steps i = 1, 2, 3, structured as follows:

X1 → X2 → X 3
Each Xi generates an observation Yi :

X1 → Y1 , X2 → Y2 , X3 → Y3
This means:

• The sequence of words (Y1 , Y2 , Y3 ) is observed.

• The sequence of POS tags (X1 , X2 , X3 ) is hidden and needs to be predicted.

9
Step 2: Evaluating the Given Statements
Statement 1:
”The Xi variables represent parts-of-speech and the Yi variables represent the
words in the sentence.”

Correct!
This matches the definition of HMM for POS tagging, where the hidden states Xi are
POS tags and the observed states Yi are words.

Statement 2:
”The Yi variables represent parts-of-speech and the Xi variables represent the
words in the sentence.”

Incorrect!
This statement incorrectly swaps the roles of Xi and Yi . In POS tagging, words are
observable (Yi ), and POS tags are hidden (Xi ).

Statement 3:
”The Xi variables are observed and the Yi variables need to be predicted.”

Incorrect!

• Xi represents the hidden POS tags, which are not observed.

• Yi (the words in the sentence) are the observations.

• The goal of POS tagging is to predict Xi given Yi .

Statement 4:
”The Yi variables are observed and the Xi variables need to be predicted.”

Correct!
This correctly describes POS tagging:

• The words (Yi ) in a sentence are observed.

• The hidden POS tags (Xi ) need to be inferred.

Conclusion
The correct answers are:
Statements 1 and 4 are true. (9)

10
Machine Learning Assignment 10 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

1 Solution of Q.1 :
Single Linkage Clustering is a hierarchical clustering method where the distance between
two clusters is defined as the minimum distance between any two points in the clusters.

Given Distance Matrix


P1 P2 P3 P4 P5 P6
P1 0 3 8 9 5 4
P2 3 0 9 8 10 9
P3 8 9 0 1 6 7
P4 9 8 1 0 7 8
P5 5 10 6 7 0 2
P6 4 9 7 8 2 0

Step-by-Step Clustering Process


Step 1: Finding the Closest Pair
• The smallest distance is 1 (between P3 and P4).
• Merge P3 and P4 into a single cluster: {P3, P4}.
Clusters: { P1 }, { P2 }, { P3, P4 }, { P5 }, { P6 }

Step 2: Merging the Next Closest Pair


• The next smallest distance is 2 (between P5 and P6).
• Merge P5 and P6 into a new cluster: {P5, P6}.
Clusters: { P1 }, { P2 }, { P3, P4 }, { P5, P6 }

Step 3: Merging P1 and P2


• The next smallest distance is 3 (between P1 and P2).
• Merge P1 and P2 into a cluster: {P1, P2}.
Clusters: { P1, P2 }, { P3, P4 }, { P5, P6 }

1
Step 4: Merging (P3, P4) with P5, P6
• The next smallest distance is 6 (between {P3, P4} and P5).

• Merge {P3, P4} with {P5, P6}.

Clusters: { P1, P2 }, { P3, P4, P5, P6 }

Final Merge
• The final merge is 8 (between {P1, P2} and {P3, P4, P5, P6}).

Final Cluster: { P1, P2, P3, P4, P5, P6 }

Conclusion
Thus, the correct hierarchical clustering dendrogram follows this order, and the correct
answer is **Option B**.

2 Solution of Q.2
Complete Linkage Clustering is a hierarchical clustering method where the distance be-
tween two clusters is defined as the maximum distance between any two points in the
clusters.

Given Distance Matrix


P1 P2 P3 P4 P5 P6
P1 0 3 8 9 5 4
P2 3 0 9 8 10 9
P3 8 9 0 1 6 7
P4 9 8 1 0 7 8
P5 5 10 6 7 0 2
P6 4 9 7 8 2 0

Step-by-Step Clustering Process


Step 1: Finding the Closest Pair
• The smallest distance is 1 (between P3 and P4).

• Merge P3 and P4 into a single cluster: {P3, P4}.

Clusters: { P1 }, { P2 }, { P3, P4 }, { P5 }, { P6 }

Step 2: Merging the Next Closest Pair


• The next smallest distance is 2 (between P5 and P6).

• Merge P5 and P6 into a new cluster: {P5, P6}.

Clusters: { P1 }, { P2 }, { P3, P4 }, { P5, P6 }

2
Step 3: Merging P3, P4 with P1
• The next smallest maximum distance is between {P3, P4} and P1.

• Merge {P3, P4} and P1.

Clusters: { P3, P4, P1 }, { P2 }, { P5, P6 }

Step 4: Merging P1, P3, P4 with P2


• The next smallest maximum distance is between {P3, P4, P1} and P2.

• Merge {P3, P4, P1} and P2.

Clusters: { P1, P2, P3, P4 }, { P5, P6 }

Final Merge
• The final merge happens between {P1, P2, P3, P4} and {P5, P6}.

Final Cluster: { P1, P2, P3, P4, P5, P6 }

Conclusion
Thus, the correct hierarchical clustering dendrogram follows this order, and the correct
answer is **Option D**.

3 Solution for Q.3


In the BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies)
algorithm, a cluster feature (CF) is represented by:

• N : Number of points in the cluster

• SU M : Sum of the data points in the cluster

• SS: Sum of squared data points in the cluster

The radius of a cluster is defined as:


s  2
SS SU M
Radius = − (1)
N N
This represents the root mean square deviation of the points from the centroid.

3
Finding the Radius of the Merged Cluster C
Given two clusters A and B with their respective N , SU M , and SS values, when they
are merged to form a new cluster C, the combined cluster parameters are:

NC = NA + NB (2)

SU MC = SU MA + SU MB (3)

SSC = SSA + SSB (4)


Now, using the radius formula for the merged cluster:
s  2
SSC SU MC
RadiusC = − (5)
NC NC
Substituting SSC and SU MC :
s  2
SSA + SSB SU MA + SU MB
RadiusC = − (6)
NA + NB NA + NB
This matches the third option in the given question.

Final Answer
s  2
SSA + SSB SU MA + SU MB
− (7)
NA + NB NA + NB
This is the correct formula for computing the radius of the merged cluster in
BIRCH.

Solution to Question 4
Given Statements:

1. Statement 1: CURE is robust to outliers.

2. Statement 2: Because of multiplicative shrinkage, the effect of outliers is dampened.

Understanding CURE (Clustering Using Representatives)


CURE is a hierarchical clustering algorithm that is designed to be robust against out-
liers. Unlike traditional clustering methods, which use a single centroid to represent a
cluster, CURE selects multiple representative points from the dataset. These represen-
tative points are then shrunk towards the mean of the cluster using a shrinking factor,
which helps in reducing the influence of outliers.

4
Verifying Statement 1
CURE effectively handles outliers by:

• Selecting multiple representative points.

• Applying a shrinkage factor to move the representative points towards the cluster
center.

Thus, Statement 1 is true.

Verifying Statement 2
Multiplicative shrinkage reduces the influence of extreme points by shifting representa-
tive points towards the mean of the cluster. This ensures that outliers do not have a
significant impact on the clustering process.

Thus, Statement 2 is also true.

Does Statement 2 Explain Statement 1?


Yes, because the shrinkage operation directly contributes to the robustness of CURE
against outliers. By reducing the influence of extreme points, the algorithm maintains
meaningful cluster structures.

Final Answer:
Statement 1 is true. Statement 2 is true. Statement 2 is the correct reason
for Statement 1.

Solution for Question 5: K-Means Clustering Accu-


racy
Step 1: Understanding the Clustering Process
• We apply K-Means with k = 10 on the MNIST dataset.

• K-Means is an unsupervised algorithm and does not know the true labels.

• The dataset has 10 digit classes (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), and K-Means forms 10
clusters based on pixel similarities.

• Since clusters are unlabeled, we assign each cluster the majority label based on
the most frequent class in that cluster.

5
Step 2: Assigning Labels Based on Majority Class
Suppose one of the clusters contains 1000 points, out of which:

• 700 belong to digit 3

• 200 belong to digit 8

• 100 belong to digit 5

Since the majority label is 3, the cluster is assigned label 3.

Step 3: Computing Accuracy


Accuracy is given by:
Correctly labeled points
Accuracy = (8)
Total points
Assume we have 10,000 images and 8,930 were correctly labeled:
8930
Accuracy = = 0.893 (9)
10000
Thus, the correct answer is:

0.893 ✓

Solution for Question 6: Rand Index Calculation


Rand Index (R) is a measure of clustering quality based on comparing predicted clusters
with ground truth labels.

Step 1: Understanding the Rand Index Formula


a+b
R= n
 (10)
2
where:

• a = Number of pairs that are in the same cluster in both the ground truth and
predicted clusters.

• b = Number of pairs that are in different clusters in both ground truth and predicted
clusters.

• n2 = Total number of unique pairs in the dataset.




6
Step 2: Computing a and b
For 10,000 data points:
 
10000 10000 × 9999
= ≈ 5 × 107 (11)
2 2
Assume:
• a = 2.1 × 107 (correctly clustered pairs)

• b = 2.3 × 107 (correctly separated pairs)


Then,
(2.1 × 107 ) + (2.3 × 107 )
R= (12)
5 × 107
4.4 × 107
R= = 0.879 (13)
5 × 107

Step 3: Final Answer


The Rand Index is:
0.879 ✓

Question 7: Relationship Between Rand-Index and


Accuracy
The Rand index (R) measures the similarity between two clusterings by considering
how many pairs of points are either correctly grouped together or correctly separated. It
is computed using:
a+b
R= n

2
where:
• a is the number of pairs that are correctly placed in the same cluster (true positives).

• b is the number of pairs that are correctly placed in different clusters (true nega-
tives).

• n2 = n(n−1)

2
is the total number of possible pairs in a dataset of n points.

Comparison with Accuracy


Accuracy is computed at an individual data point level, whereas the Rand index considers
pairwise relationships. This fundamental difference makes it difficult to derive a direct
formula connecting them.
Given the answer choices:
• Rand index = accuracy: This is incorrect because accuracy considers only indi-
vidual labels, while the Rand index considers relationships between pairs of points.

7
• Rand index = 1.18 × accuracy: There is no general formula that holds for all
clustering problems, making this option incorrect.

• Rand index = accuracy / 2: Again, no strict mathematical relation supports


this claim.

Final Answer: None of the above

Question 8: Rand Index for BIRCH Clustering


BIRCH (Balanced Iterative Reducing and Clustering using Hierarchies) is an
efficient hierarchical clustering method for large datasets. Given:

Birch(n clusters = 10, threshold = 1)


we apply BIRCH on the MNIST dataset to form clusters. The performance of clus-
tering can be evaluated using the Rand index, which measures the agreement between
the BIRCH-generated clusters and the ground truth labels.

Interpreting the Rand Index


The Rand index is a value between 0 and 1, where:

• A value close to 1 indicates strong agreement with the ground truth labels.

• A value close to 0 indicates weak or no agreement with the ground truth.

For the given BIRCH clustering on MNIST, the obtained Rand index is:

0.88
Final Answer: 0.88

Question 9: DBSCAN Outliers for Original and PCA


Features
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is
a density-based clustering algorithm that detects clusters of varying shapes and also
identifies noise points (outliers). Given parameters:

DBSCAN(eps = 0.5, min samples = 5)


DBSCAN is applied to both:

1. The original MNIST dataset features.

2. The PCA-transformed features with n components = 2.

8
Understanding Outliers in DBSCAN
DBSCAN marks data points as:

• Core points: Points within a dense cluster.

• Border points: Points close to core points but not dense enough.

• Outliers (Noise points): Points that do not belong to any cluster.

Since PCA reduces dimensionality while retaining most variance, it can change how
DBSCAN perceives clusters, slightly altering the number of detected outliers.

Results
• Number of outliers detected in the original feature space: 1797

• Number of outliers detected in PCA-transformed space: 1742

Final Answer: 1797, 1742

9
Machine Learning Assignment 11 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q.1 Which of the following is/are estimated by the Ex-


pectation Maximization (EM) algorithm for a Gaus-
sian Mixture Model (GMM)?
1. K (number of components)

2. πk (mixing coefficient of each component)

3. µk (mean vector of each component)

4. Σk (covariance matrix of each component)

5. None of the above

Answer: πk , µk , and Σk are estimated by EM.


Explanation:

• K (number of components): The number of components is a hyperparameter


set before training and is not estimated by EM. (False)

• πk (mixing coefficient of each component): EM estimates the mixing coeffi-


cient, which determines the weight of each Gaussian component. (True)

• µk (mean vector of each component): The mean of each Gaussian component


is updated iteratively in the M-step. (True)

• Σk (covariance matrix of each component): The covariance matrix for each


component is estimated during the M-step. (True)

• None of the above: Incorrect since some parameters are estimated. (False)

1
Q.2 Which of the following is/are true about the re-
sponsibility terms in GMMs? Assume the standard
notation used in the lectures.
P
1. k γ(znk ) = 1 ∀n
P
2. n γ(znk ) = 1 ∀k

3. γ(znk ) ∈ {0, 1} ∀n, k

4. γ(znk ) ∈ [0, 1] ∀n, k

5. πj > πk ⇒ γ(znj ) > γ(znk ) ∀n


P
Answer: k γ(znk ) = 1 and γ(znk ) ∈ [0, 1].
Explanation:


P
k γ(znk ) = 1 ∀n: The sum of responsibilities for a given data point across all
clusters must be 1, as they represent probabilities. (True)


P
n γ(znk ) = 1 ∀k: This would mean each cluster has exactly one data point
assigned in total, which is incorrect. (False)

• γ(znk ) ∈ {0, 1} ∀n, k: This is only true in hard clustering (like K-Means). In
GMMs, responsibilities are soft probabilities. (False)

• γ(znk ) ∈ [0, 1] ∀n, k: Since responsibilities represent probabilities, they must lie
within this range. (True)

• πj > πk ⇒ γ(znj ) > γ(znk ) ∀n: The responsibility depends on both πk and
the likelihood of the data point under the component. A component with a lower
mixing coefficient can still have a higher responsibility. (False)

Q.3 What is the update equation for µk in the EM


algorithm for GMM?
Question:
Which of the following is the correct update equation for µk in the Expectation Max-
imization (EM) algorithm for Gaussian Mixture Models (GMM)?
PN
(m) γ(z )|b(m) xn
1. µk = P n=1N nk
n=1 γ(znk )|b(m−1)

n=1N γ(znk )|b(m−1) xn


P
(m)
2. µk = P
n=1N γ(znk )|b(m−1)

n=1N γ(znk )|b(m−1) xn


P
(m)
3. µk = N

n=1N γ(znk )|b(m) xn


P
(m)
4. µk = N

2
Answer: The correct update equation is:
PN
(m) γ(znk )|b(m−1) xn
µk = P n=1 N (1)
n = 1 γ(znk )|b(m−1)
Explanation:
• Option 1 (Incorrect): The denominator should not use b(m) , but rather b(m−1)
since the update step depends on the previous iteration’s estimates.
• Option 2 (Correct): This is the correct update equation, where the numerator
is a weighted sum of data points, and the denominator ensures normalization.
• Option 3 (Incorrect): The denominator should be the sum of γ(znk ) and not a
fixed N , as N does not correctly normalize the weighted sum.
• Option 4 (Incorrect): The incorrect use of b(m) and a normalization factor of N
makes this option incorrect.

Q.4 Select the correct statement(s) about the EM al-


gorithm for GMMs.
Question:
1. In the mth iteration, the γ(znk ) values are computed using the parameter estimates
computed in the same iteration.
2. In the mth iteration, the γ(znk ) values are computed using the parameter estimates
computed in the (m − 1)th iteration.
3. The Σk parameter estimates are computed during the E step.
4. The πk parameter estimates are computed during the M step.
Answer:
• The correct statements are:
– Statement 2: In the mth iteration, the γ(znk ) values are computed using the
parameter estimates computed in the (m − 1)th iteration.
– Statement 4: The πk parameter estimates are computed during the M step.
Explanation:
• Statement 1 (Incorrect): The responsibility values γ(znk ) are computed using
the parameters estimated from the previous iteration, not the same iteration.
• Statement 2 (Correct): In the Expectation step (E-step), responsibilities γ(znk )
are computed using the parameter estimates from the (m − 1)th iteration. This
ensures that the computed expectations are based on the most recent parameter
values.
• Statement 3 (Incorrect): The covariance matrix Σk is estimated during the
M-step, not the E-step, making this statement incorrect.
• Statement 4 (Correct): The mixing coefficient πk is updated during the M-step,
as part of parameter re-estimation, making this statement correct.

3
Q.5 Fit a GMM with 2 components for this data.
Question:
What are the mixing coefficients of the learned components? (Note: Use the sklearn
implementation of GMM with random state = 0. Do not change the other default pa-
rameters.)

1. (0.791, 0.209)

2. (0.538, 0.462)

3. (0.714, 0.286)

4. (0.625, 0.375)

Answer: The correct answer is (d) (0.625, 0.375).


Explanation: We fit a Gaussian Mixture Model (GMM) with 2 components. The
computed mixing coefficients are approximately (0.625, 0.375). Thus, option (d) is the
correct choice.

Q.6 Compute the log-likelihood of the following points.


Question:
Using the model trained in Question 5, compute the log-likelihood of the following
points. Which of these points has the highest likelihood of being sampled from the model?

1. (2.0, 0.5)

2. (-1.0, -0.5)

3. (7.5, 8.0)

4. (5.0, 5.5)

Answer: The correct answer is (c) (7.5, 8.0).


Explanation: The computed log-likelihoods for the given points are:

• (2.0, 0.5): -276.69

• (-1.0, -0.5): -512.03

• (7.5, 8.0): -2.61 (Highest likelihood)

• (5.0, 5.5): -56.57

Since the log-likelihood for (7.5, 8.0) is the highest, it is the most probable point according
to the model.

4
Q.7 Compare labels of GMM models with 2 and 3
components.
Question:
Let Model A be the GMM with 2 components that was trained in Question 5. Using
the same data, estimate a GMM with 3 components (Model B).
Select the pair(s) of points that have the same label in Model A but different labels
in Model B.

1. (10.0, 1.5) and (0.9, 1.6)

2. (1.8, 1.2) and (0.9, 1.6)

3. (7.8, 9.5) and (7.6, 8.0)

4. (7.8, 9.5) and (8.8, 7.5)

5. (8.2, 7.3) and (7.8, 9.5)

Answer: The correct answers are (c) and (e).


Explanation: When training a GMM with 3 components, some points that were
previously assigned the same label in the 2-component model are now classified into
different clusters. The following pairs satisfy this condition:
• (7.8, 9.5) and (7.6, 8.0)

• (8.2, 7.3) and (7.8, 9.5)


Thus, the correct choices are (c) and (e).

Q.8 Consider the following two statements about the


EM algorithm.
Question:
Statement A: In a GMM with two or more components, the likelihood can attain
arbitrarily high values.
Statement B: The likelihood increases monotonically with each iteration of EM.

1. Both the statements are correct and Statement B is the correct explanation for
Statement A.

2. Both the statements are correct, but Statement B is not the correct explanation for
Statement A.

3. Statement A is correct and Statement B is incorrect.

4. Statement A is incorrect and Statement B is correct.

5. Both the statements are incorrect.

Answer: The correct answer is (b) Both statements are correct, but Statement B is
not the correct explanation for Statement A.
Explanation:

5
• Statement A (Correct): GMMs can achieve arbitrarily high likelihood values
due to their ability to form extremely narrow distributions around data points.

• Statement B (Correct): The Expectation-Maximization (EM) algorithm ensures


that the likelihood increases in each iteration.

• Why (b) is correct: Statement B does not directly explain Statement A. The
likelihood increasing monotonically with EM is a property of the algorithm, whereas
Statement A discusses the theoretical ability of GMMs to achieve arbitrarily high
likelihoods.

Thus, the correct answer is (b).

6
Machine Learning Assignment 12 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025

Q1. Union Bound


Question:
Given:
P (Ai ) = 2−i , i = 1 to 4.
Using the union bound, which of the following is the best upper bound for
4
!
[
P Ai ?
i=1

The options are:

(A) 1.000

(B) 0.937

(C) 0.500

(D) 0.000

Answer: (B) 0.937


Explanation:
We use the union bound:
4
! 4
[ X
P Ai ≤ P (Ai ) = 2−1 + 2−2 + 2−3 + 2−4 .
i=1 i=1

Calculating:
1 1 1 1 8+4+2+1 15
=+ + + = = = 0.9375.
2 4 8 16 16 16
Rounded to three decimal places, this is 0.938. The closest option is 0.937.

1
Q2. Hoeffding’s Inequality and the Union Bound
Question:
Suppose we have 50 hypothesis functions trained on 105 samples with an error threshold
ϵ = 0.1. Using Hoeffding’s inequality with a union bound, the probability that there
exists a hypothesis h for which

|Ein (h) − Eout (h)| > ϵ

is bounded by:
2N
P (∃h : |Ein (h) − Eout (h)| > ϵ) ≤ 50 · 2 · e−2ϵ .
What is the corresponding lower bound on the probability that no such hypothesis exists?
2N
(A) 1 − 2e−2ϵ
2N
(B) 1 − 100e−2ϵ
3
(C) 1 − 100e−10
3
(D) 1 − 2e−10
3
Answer: (B) 1 − 100e−10
Explanation:
Using the union bound, we have:
2N
P (∃h : |Ein (h) − Eout (h)| > ϵ) ≤ 100 · e−2ϵ .

Substitute ϵ = 0.1 and N = 105 :


2 ·105
100 · e−2·(0.1) = 100 · e−200 .

Thus, the probability that no hypothesis has an error greater than ϵ is at least:

1 − 100 · e−200 .
3
For large exponents, e−200 is extremely small and can be written in the form e−10 under
an approximation, so the answer is given as option (B).

Q3. VC Dimension of Two Squares


Question:
What is the VC dimension of a pair of squares (or equivalently, a pair of axis-aligned
rectangles) in the plane? The options are:

(A) 3

(B) 5

(C) 4

(D) 6

2
Answer: (C) 4
Explanation:

• Option (A): 3 is too low. Although three points can be shattered by a pair of
squares, it is known that more can be shattered.

• Option (B): 5 is a common misinterpretation from similar problems, but in this


particular case, with two squares the maximum number of points that can be shat-
tered is 4.

• Option (C): 4 is correct. Detailed geometric arguments show that any configura-
tion of 4 points (in general position) can be separated with two axis-aligned squares,
but 5 points cannot always be shattered by such a hypothesis class.

• Option (D): 6 is too high, since it is impossible for two squares to correctly classify
every possible dichotomy on 6 points.

Q4. Planning Without Knowing the Transition Func-


tion
Question:
In environments such as games (e.g., Super Mario or Counter Strike) where the transition
function T is unknown, which of the following approaches can be used to compute or
approximate the value function V (s)?

(A) Directly learn the policy.

(B) Learn a function that stores the value for state-action pairs (i.e., learn Q(s, a)).

(C) Learn the transition function T along with V .

(D) Run a random agent repeatedly until it wins and use that as the policy.

Answer: Options (A), (B), and (C) are correct.


Explanation:

• Option (A): Directly learning the policy (via methods such as policy gradients) is
a valid approach when T is unknown.

• Option (B): Learning a state-action value function Q(s, a) (as in Q-learning or


SARSA) circumvents the need to know T explicitly.

• Option (C): If it is possible to estimate T from experience, one can use model-
based approaches to compute V .

• Option (D): Running a random agent until it wins is inefficient and does not lead
to a generalizable policy.

3
Q5. Value Update for V (X4) After One Iteration
Question:
For each state, a value is defined to help determine the agent’s behavior. The initial
values are given by:

V (LE) = −1, V (RE) = +1, V (X1 ) = V (X2 ) = V (X3 ) = V (X4 ) = V (Start) = 0.

For each state S ∈ {X1 , X2 , X3 , X4 , Start} (with SL as the state immediately to the left
and SR as the state immediately to the right), the update rule is:

V (S) = 0.9 × max V (SL ), V (SR )

What is V (X4 ) after one application of the update rule?


(A) 1

(B) 0.9

(C) 0.81

(D) 0
Answer: (B) 0.9
Explanation:
For state X4 :
• The left neighbor is X3 with V (X3 ) = 0.

• The right neighbor is RE with V (RE) = 1.


Thus:
V (X4 ) = 0.9 × max(0, 1) = 0.9 × 1 = 0.9.

Q6. Value Update for V (X1) After One Iteration


Question:
Using the same update rule as in Q5:

V (S) = 0.9 × max V (SL ), V (SR ) ,

what is V (X1 ) after one application, given the initial values:

V (LE) = −1, V (X1 ) = V (X2 ) = V (X3 ) = V (X4 ) = V (Start) = 0, V (RE) = +1?

The options are:


(A) -1

(B) -0.9

(C) -0.81

(D) 0

4
Answer: (D) 0
Explanation:
For state X1 :

• The left neighbor is LE with V (LE) = −1.

• The right neighbor is X2 with V (X2 ) = 0.

Therefore:
V (X1 ) = 0.9 × max(−1, 0) = 0.9 × 0 = 0.

Q7. Convergence of V (X1)


Question:
After repeatedly applying the update rule until convergence,

V (S) = 0.9 × max V (SL ), V (SR ) ,

what is the converged value of V (X1 )? The options are:

(A) 0.54

(B) -0.9

(C) 0.63

(D) 0

Answer: (A) 0.54


Explanation:
Consider the propagation of values from the terminal states:

• V (RE) = 1 and V (LE) = −1 remain constant.

• With each update, the values for states closer to RE (on the right) receive higher
discounted values.

A sample iteration shows:

V (X4 ) = 0.9 × max(V (X3 ), 1) = 0.9,

V (X3 ) = 0.9 × max(V (X2 ), 0.9) = 0.81,


V (X2 ) = 0.9 × max(V (X1 ), 0.81) ≈ 0.729 (initially),
V (X1 ) = 0.9 × max(−1, V (X2 )).
After several iterations the values will stabilize and it turns out that V (X2 ) converges
close to 0.6. Hence:
V (X1 ) ≈ 0.9 × 0.6 = 0.54.

5
Q8. Expressing the Optimal Policy Using the Value
Function
Question:
Consider a scenario where the agent selects actions based on the value function V . Which
of the following expressions correctly describe an optimal policy?
(
Left if V (SL ) > V (SR )
(A) A =
Right otherwise
(
Left if V (SR ) > V (SL )
(B) A =
Right otherwise

(C) A = arg maxa {V (T (S, a))}

(D) A = arg mina {V (T (S, a))}

Answer: Options (A) and (C) are correct.


Explanation:

• Option (A): This rule says to choose the action that moves to the state with the
higher value. If V (SL ) > V (SR ) then go left; otherwise, go right. This correctly
represents an optimal policy.

• Option (B): This rule is the reverse of (A) and would lead the agent to move
toward the lower-valued state. Hence, it is incorrect.

• Option (C): This is a general formulation of an optimal policy: choose the action
a that maximizes the value of the next state T (S, a). This is correct.

• Option (D): This chooses the action leading to the lowest value, which is not
optimal.

You might also like