ML Assignments 2025
ML Assignments 2025
• Predicting the type of interaction (positive/negative) between a new drug and a set
of human proteins.
• Learning to generate artificial human faces using the faces from a facial recognition
dataset.
Solution: Unsupervised learning finds hidden patterns in data without labeled out-
puts. The correct answers are:
• Identifying close-knit communities in a social network (Clustering problem).
• During training, the agent is explicitly provided the most optimal action to be taken
in each state.
• The actions taken by an agent do not affect the environment in any way.
• RL agents used for playing turn-based games like chess can be trained by playing
the agent against itself (self-play).
1
• RL can be used in an autonomous driving system.
• RL agents used for playing turn-based games like chess can be trained
by playing the agent against itself (self-play).
• Predicting the total number of goals a given football team scores in a year.
• Predicting whether or not a customer will repay a loan based on their credit history.
• Forecasting the weather (temperature, humidity, rainfall, etc.) at a given place for
the following 24 hours.
2
Question 5: Linear Regression Prediction
Given a dataset, fit a linear regression model of the form:
y = β0 + β1 x1 + β2 x2
Using the mean-squared error loss, the predicted value of y at (x1 , x2 ) = (0.5, −1.0) is:
• 4.05
• 2.05
• -1.95
• -3.95
Solution:
where:
• X is the design matrix including a column of ones for the bias term.
First, compute XT X:
4 2.5 2.0
XT X = 2.5 6.25 1.75 (3)
2.0 1.75 7.5
Next, compute XT y:
11.5
XT y = 15.75 (4)
13.0
Now, solve for B:
B = (XT X)−1 XT y (5)
Using matrix inversion and multiplication (computations omitted for brevity), we get:
3
Step 2: Predict y for Given Inputs
Substituting (x1 , x2 ) = (0.5, −1.0) into the regression equation:
• -1.766
• -1.166
• 1.133
• 1.733
Solution:
The given dataset:
x1 x2 y
1.0 0.0 2.65
-1.0 0.5 -2.05
2.0 1.0 1.95
-2.0 -1.5 0.90
1.0 1.0 0.60
-1.0 -1.0 1.45
We need to predict y for (x1 , x2 ) = (1.0, 0.5) using k-NN with k = 3 and Euclidean
distance.
Step 1: Compute Euclidean Distance
The Euclidean distance between two points (x1 , x2 ) and (x′1 , x′2 ) is given by:
p
d = (x1 − x′1 )2 + (x2 − x′2 )2
4
• (−1.0, −1.0): d =
p
(1.0 + 1.0)2 + (0.5 + 1.0)2 = 2.5
Solution to Question 7
Given Dataset:
x1 x2 y
-1.0 1.0 0
-1.0 0.0 0
-2.0 -1.0 0
0.0 0.0 1
2.0 1.0 1
1.0 2.0 1
2.0 -1.0 2
1.0 0.0 2
2.0 0.0 2
5
Step 2: Select the 5 Nearest Neighbors
• (2.0, 1.0) → y = 1
• (1.0, 2.0) → y = 1
• (1.0, 0.0) → y = 2
• (0.0, 0.0) → y = 1
• (2.0, 0.0) → y = 2
• A linear regressor partitions the input space into multiple regions such that the
prediction over a given region is constant.
Correct Answers:
Solution:
A linear regressor learns a fixed function by estimating coefficients from the training
data and does not need the training data during inference. In contrast, k-NN regression
relies on storing the training data and computing distances during inference.
For k-NN:
6
Question 9: Bias and Variance Tradeoff
Question: Which of the following statements regarding bias and variance are correct?
• Bias = E[fˆ(x)] − f (x), Variance = E[(E[fˆ(x)] − fˆ(x))2 ].
• Low bias and high variance is a sign of overfitting.
• Low variance and high bias is a sign of underfitting.
Correct Answers:
• Bias = E[fˆ(x)] − f (x), Variance = E[(E[fˆ(x)] − fˆ(x))2 ].
• Low bias and high variance is a sign of overfitting.
• Low variance and high bias is a sign of underfitting.
Solution:
The bias measures the difference between the expected prediction and the true func-
tion, while variance measures how much the predictions vary for different training datasets.
• High variance and low bias = Overfitting (model captures noise).
• High bias and low variance = Underfitting (model is too simple).
7
Machine Learning Assignment 2 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
y = θ0 + θ1 x1 + θ2 x2 + · · · + θp xp
• To adjust for the baseline level of the dependent variable when all predictors are
zero
Correct Answer: To adjust for the baseline level of the dependent variable when
all predictors are zero.
Solution: The intercept term (θ0 ) represents the value of the dependent variable (y)
when all independent variables (x1 , x2 , ..., xp ) are zero. This ensures that our regression
model does not force the regression line through the origin unless necessary. Without an
intercept, the model assumes y = 0 when all predictors are zero, which is often incorrect.
Hence, adding an intercept helps to better fit the data and improve the accuracy of
predictions.
• It is non-convex.
• It is always minimized at θ = 0.
1
• It measures the sum of squared differences between predicted and actual values.
Correct Answer: It measures the sum of squared differences between predicted and
actual values.
Solution: The cost function used in linear regression is the Mean Squared Error
(MSE):
m
1 X
J(θ) = (yi − ŷi )2
2m i=1
where yi is the actual value and ŷi is the predicted value. This function is convex,
meaning it has a single global minimum, making it easy to optimize using gradient de-
scent. It does not assume the dependent variable is categorical; that is a characteristic
of classification models.
2
Correct Answer: The errors must have a mean of zero.
Solution: For the least squares estimator to be unbiased, the assumption is that the
expected value of the error term (ϵ) should be zero:
E[ϵ] = 0
This ensures that the model does not systematically overestimate or underestimate
the target variable. The normality of predictors is not necessary, and a linear relationship
is preferred rather than a non-linear one.
3
Q7: Dimensions of Design Matrix in Linear Regres-
sion
Question: Given a training dataset of 10,000 instances, each with 12 input dimensions
and 3 output dimensions, what are the dimensions of the design matrix in linear regres-
sion?
• 10000 × 12
• 10003 × 12
• 10000 × 13
• 10000 × 15
X ∈ R10000×(12+1) = R10000×13
y = a0 + a1 x 1 + a2 x 2 + · · · + ap x p
is fitted to N training points with P attributes each. Let X be the design matrix
(N × (p + 1)), Y be the target vector (N × 1), and θ be the parameter vector ((p + 1) × 1).
If the sum squared error is minimized, which equation holds?
• X T X = XY
• Xθ = X T Y
• X T Xθ = Y
• X T Xθ = X T Y
Correct Answer: X T Xθ = X T Y
Solution: The normal equation for least squares regression is:
θ = (X T X)−1 X T Y
Multiplying both sides by X T X gives:
X T Xθ = X T Y
This equation represents the optimal values of θ obtained by minimizing the sum of
squared errors.
4
Q9: Partial Least Squares (PLS) Regression vs. OLS
Question: When is Partial Least Squares (PLS) regression preferred over Ordinary Least
Squares (OLS)?
• When predictors are uncorrelated and the number of samples is much larger than
the number of predictors.
• When the response variable is categorical and the predictors are highly non-linear.
• When the primary goal is to interpret the relationship between predictors and
response, rather than prediction accuracy.
• Best subset selection can be computationally more expensive than forward selection.
• Forward selection and backward selection always lead to the same result.
• Best subset selection can be computationally less expensive than backward selec-
tion.
• Best subset selection and forward selection are computationally equally expensive.
Correct Answer: Best subset selection can be computationally more expensive than
forward selection.
Solution:
• Best Subset Selection: Evaluates all possible models and selects the best one,
making it computationally expensive, especially for large datasets.
5
• Forward Selection: Starts with an empty model and adds variables iteratively
based on improvement in performance.
• Backward Selection: Starts with all variables and removes the least important
ones iteratively.
Best subset selection requires evaluating all 2p possible models, making it computa-
tionally expensive compared to forward selection, which adds one feature at a time.
6
Machine Learning Assignment 3 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
1
This forms multiple decision boundaries.
- If a point x does not lie on the decision boundary, then all points in its small
neighborhood must belong to the same class because there is no region of ambiguity in
the classification.
Correct Answer:
Solution: Linear Discriminant Analysis (LDA) assumes that the data for each class
follows a multivariate normal (Gaussian) distribution with the same covariance
matrix across classes. This leads to a linear decision boundary.
Logistic regression, on the other hand, does not assume a specific distribution of data.
Instead, it finds the best linear separation using probability estimation.
- If the underlying data distribution is not Gaussian, LDA’s assumptions are
violated, which can lead to a decision boundary that differs significantly from logistic
regression. - If the two classes have unequal covariance matrices, then LDA does
not result in a linear decision boundary, whereas logistic regression still assumes a linear
decision boundary.
Thus, these two factors contribute to significant differences between the decision
boundaries of LDA and logistic regression.
yi 1 0 0 1
p1 (xi ) 0.8 0.5 0.2 0.9
The likelihood of observing the given data is computed as:
4
Y
L= p1 (xi )yi (1 − p1 (xi ))(1−yi )
i=1
2
Solution:
Since the logistic regression model gives the probability that the label is 1, the likeli-
hood function is:
L = 0.288
Final Answer: 0.288
• □ It learns a model for the probability distribution of the data points in each class.
Correct Answers:
• The output of a linear model is transformed to the range (0, 1) by a sigmoid function.
Solution:
- Logistic regression does not learn the probability distribution of the data points but
instead models the probability that a given instance belongs to a particular class using
a sigmoid function. - The logistic regression model applies a sigmoid function to the
linear output:
1
σ(z) =
1 + e−z
where z = wT x+b. - The parameters of logistic regression are learned by maximizing
the log-likelihood, not by minimizing the mean-squared loss. The loss function for
logistic regression is the log loss (negative log-likelihood):
N
X
L=− [yi log pi + (1 − yi ) log(1 − pi )]
i=1
3
Q5: Modified Logistic Regression Equation
Question: Consider a modified form of logistic regression:
1 − p(x)
log = β0 + β1 x
kp(x)
where k is a positive constant and β0 , β1 are parameters.
Solution: We start from the given equation:
1 − p(x)
log = β0 + β1 x
kp(x)
By exponentiating both sides, we get:
1 − p(x)
= eβ0 +β1 x
kp(x)
Rearranging for p(x):
1
p(x) =
1 + keβ0 +β1 x
By substituting into the given form, we find:
e−β1 x
p(x) =
keβ0 + e−β1 x
e−β1 x
Final Answer: keβ0 +e−β1 x
k 1 2 3 4 5
fk (x) 0.15 0.20 0.05 0.50 0.01
Let πk denote the prior probability of class k. Which of the following statements are
true?
Solution:
In a Bayesian classifier, the posterior probability for class k is given by:
πk fk (x)
P (k|x) = P5
j=1 πj fj (x)
The predicted class is the one with the highest posterior probability.
• If 2πk ≤ πk+1 for all k ∈ {1, 2, 3, 4}, then class 4 has the highest probability.
• If πk ≥ 32 πk+1 for all k ∈ {1, 2, 3, 4}, then class 1 has the highest probability.
4
• The predicted label at x can never be class 5 because f5 (x) is the smallest among
all densities.
Final Answers:
• If 2πk ≤ πk+1 for all k ∈ {1, 2, 3, 4}, the predicted class must be class 4.
• If πk ≥ 23 πk+1 for all k ∈ {1, 2, 3, 4}, the predicted class must be class 1.
P (C1 | x) = P (C2 | x)
Using Bayes’ theorem:
P (x | Ck )P (Ck )
P (Ck | x) =
P (x)
At the decision boundary, the posterior probabilities of both classes must be equal:
P (x | C1 )P (C1 ) P (x | C2 )P (C2 )
=
P (x) P (x)
Canceling P (x), we get:
P (x | C1 )P (C1 ) = P (x | C2 )P (C2 )
Taking the logarithm,
P (x | C1 ) P (C2 )
log = log
P (x | C2 ) P (C1 )
5
• ”On the decision boundary, class-conditioned probability densities cor-
responding to both classes must be equal.”
× Incorrect, since the decision boundary depends on the ratio of likelihoods and
priors.
Final Answer:
On the decision boundary, the posterior probabilities corresponding to both classes must be equal.
On the decision boundary, the class-conditioned probability densities corresponding to both classes ma
• Dataset B: 200 samples of class 0 (same as Dataset A), 100 samples of class 1 (class
1 data is duplicated).
wT x + b = 0
where w (slope) is given by:
w = Σ−1 (µ1 − µ0 )
and b (intercept) is:
1 P (C1 )
b = − (µT1 Σ−1 µ1 − µT0 Σ−1 µ0 ) + log
2 P (C0 )
Since Dataset B has duplicated class 1 samples, µ1 and Σ remain the same, meaning
w is unchanged. However, the prior probability changes:
100 1 50
P (C1 ) = = , P (C1 ) in Dataset A = = 0.2
300 3 250
Since b depends on priors, the intercept will change.
Final Answer:
The two models will have the same slope but different intercepts.
6
Question 9: Properties of LDA
LDA aims to maximize separation between class means while minimizing within-class
variance:
wT SB w
J(w) =
w T SW w
where: - SB is the between-class scatter matrix. - SW is the within-class scatter
matrix.
Final Answer:
LDA maximizes the inter-class variance relative to the intra-class variance.
Maximizing the Fisher information results in the same direction of the separating hyperplane as equat
7
• ”Adding a few outliers to the dataset is likely to cause a larger change
in the decision boundary of LDA compared to logistic regression.”
✓ Correct, as LDA depends on means and covariances, which are sensitive to out-
liers.
Final Answer:
LDA is sensitive to outliers; logistic regression is robust.
Logistic regression performs better when class distributions are non-Gaussian.
8
Machine Learning Assignment 4 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
Explanation:
The Perceptron Learning Algorithm is an iterative method that updates the weight vector
w using the following rule:
w ← w + η(y − ŷ)x
where:
1
Q2: Consider the 1-dimensional dataset
x y
−1 1
0 −1
2 1
(Note: x is the feature and y is the output)
State true or false: The dataset becomes linearly separable after using basis expansion
with the following basis function:
1
ϕ(x) = 2
x
Answer: True
Explanation:
The given dataset is not linearly separable in its original form. However, by applying the
basis expansion:
1
ϕ(x) = 2
x
we transform the feature x into x2 , which maps the points as follows:
1 1 1
ϕ(−1) = , ϕ(0) = , ϕ(2) =
1 0 4
In this transformed space, a linear decision boundary can be drawn to separate the
classes y = 1 and y = −1, making the dataset linearly separable.
• The loss is zero only when the prediction is exactly equal to the true label.
• The loss is zero when the prediction is correct and the margin is at least
1. ✓
• The loss increases linearly with the distance from the decision boundary regardless
of classification.
Answer: The loss is zero when the prediction is correct and the margin is at least 1.
2
Explanation:
Hinge loss is defined as:
• If the classification is correct and the margin y(w · x) ≥ 1, the loss is zero.
• If the classification is correct but the margin is less than 1, there is some loss.
Thus, the correct statement is: The loss is zero when the prediction is correct
and the margin is at least 1.
• 2
• d
• n
2
• n✓
Answer: n
Explanation:
In a hard-margin Support Vector Machine (SVM), support vectors are the data points
that lie exactly on the margin boundaries or violate them. Theoretically:
• The minimum number of support vectors required to define the margin is at least
d + 1 in d-dimensional space.
3
Q5: Effect of Increasing C on the Number of Support
Vectors in Soft-Margin SVM
In the context of soft-margin SVM, what happens to the number of support vectors as
the parameter C increases?
• Generally increases
• Generally decreases ✓
• Remains constant
• Changes unpredictably
Explanation:
The parameter C in an SVM controls the trade-off between maximizing the margin and
minimizing classification errors.
• A lower C allows more misclassified points, leading to a larger number of support
vectors.
x y
1 1
2 1
4 −1
5 −1
6 −1
7 −1
9 1
10 1
We use a Support Vector Classifier (SVC) with the following parameters:
• Polynomial kernel with degree d = 3
• Regularization parameter C = 1
4
Options
• 2
• 1
• 9
• 10 ✓
Correct Answer: 10
Explanation
Support vectors are the data points that lie on the margin or within the soft margin
region in an SVM. These points influence the decision boundary and help define the
classifier. Points that are correctly classified and far from the decision boundary do not
act as support vectors.
Understanding the Decision Boundary The Support Vector Machine (SVM) finds an
optimal hyperplane that separates the two classes. For a hard-margin SVM, only the
closest points (support vectors) determine the margin, while others are ignored.
In a soft-margin SVM (where C = 1), misclassified points or points within the
margin can also be support vectors. The higher C, the stricter the margin.
Given the dataset: - Positive class (y = 1) contains: x = 1, 2, 9, 10 - Negative class
(y = −1) contains: x = 4, 5, 6, 7
A polynomial kernel of degree d = 3 transforms the data into a higher-dimensional
space, making the dataset more separable.
Why is x = 10 not a Support Vector? - Points close to the decision boundary (e.g.,
x = 2 and x = 9) are likely to be support vectors. - Points far from the boundary, such as
x = 10, are classified with high confidence and do not contribute to defining the margin.
- The SVM model does not require well-separated points to be support vectors since they
have no impact on the margin.
Thus, x = 10 is not a support vector.
5
Options
• 0.91, 0.64
• 0.88, 0.71
• 0.71, 0.65
• 0.78, 0.64 ✓
Introduction to Perceptron
A **Perceptron** is the simplest type of neural network and is a **linear classifier**. It
updates its weights using the Perceptron learning rule:
L1 and L2 Regularization
To improve generalization,
P we add **regularization**: - **L1 Regularization (Lasso):**
Adds a penalty term λ P|wi |, which encourages sparsity. - **L2 Regularization (Ridge):**
Adds a penalty term λ wi2 , which prevents large weight values.
6
Q8: Training an SVM Classifier on the Modified Iris
Dataset
Problem Statement
Train a **Support Vector Machine (SVM) classifier** on the modified Iris dataset using
sklearn. The model should:
• Use only the first three features of the dataset.
• Use an **RBF kernel** with γ = 0.5.
• Be trained in a **One-vs-Rest (OvR)** setting.
• Not perform feature normalization.
• Explore different values of **C**: 0.01, 1, 10.
The best classification accuracy must be reported.
Options
• 0.98 ✓
• 0.88
• 0.99
• 0.92
Introduction to SVM
Support Vector Machines (SVMs) are supervised learning models used for classification.
They work by finding an optimal **decision boundary** that maximizes the **margin**
between two classes.
The optimization problem is:
1
min ∥w∥2
w,b 2
subject to:
yi (w · xi + b) ≥ 1, ∀i
where: - w is the weight vector, - b is the bias term, - yi is the class label, - xi is the
input feature.
where: - γ controls the influence of individual training points. - Higher γ makes the
model more sensitive to local variations.
7
Impact of the C Parameter
The parameter **C** in SVM controls the trade-off between maximizing the margin and
minimizing classification errors: - **Low C (0.01):** Allows more misclassifications →
high bias, low variance. - **Medium C (1):** Balanced generalization. - **High C (10):**
Tries to classify all points correctly → low bias, high variance.
8
Machine Learning Assignment 5 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
• p-dimensional input
• m hidden layers
Solution
1. Input to First Hidden Layer:
• Each hidden unit in the first layer has weights from all p input dimensions.
• Number of weights:
p×k
• Number of weights:
(m − 1) × k 2
1
3. Last Hidden Layer to Output Layer:
• Number of weights:
k
Correct Answer:
pk + (m − 1)k 2 + k
y = ReLU(Wx)
where:
• x ∈ Rp (input)
• y ∈ Rd (output)
ReLU(z) = max(0, z)
(applied element-wise)
2
Solution
1. ReLU Activation:
• If z = Wi · x is positive: ReLU is active, so the derivative is 1.
• If z = Wi · x ≤ 0: ReLU is inactive, so the derivative is 0.
∂yi
2. Computing ∂Wij
• Since:
p
!
X
yi = ReLU Wik xk ,
k=1
• This means:
– If pk=1 Wik xk > 0, the gradient is xj .
P
3
Solution
1. Gradient of Output w.r.t. Hidden Representation:
∇W (B) (y) = h
Final Answer
Accepted Answers: ∇W (A) (y) depends on W (B) , ∇W (B) (y) depends on W (A)
• Two different initializations of the same network could converge to different minima.
• For a given initialization, gradient descent will converge to the same minima irre-
spective of the learning rate.
• Initializing all weights to the same constant value leads to undesirable results.
Solution
1. Different Initializations Lead to Different Minima:
• Neural networks have non-convex loss surfaces.
4
• The learning rate affects the optimization trajectory.
• If all weights are initialized to the same value, all neurons will produce the same
output.
Final Answer
Accepted Answers:
• Two different initializations of the same network could converge to different minima.
• Initializing all weights to the same constant value leads to undesirable results.
1 exp(x) − exp(−x)
σ(x) = , tanh(x) =
1 + exp(−x) exp(x) + exp(−x)
Which of the following statements are true?
• 0 < σ ′ (x) ≤ 1
4
5
Solution
1. Derivative of Sigmoid Function:
• Since σ(x) is always between 0 and 1, its derivative is maximized when σ(x) = 0.5.
• The function tanh′ (x) is maximum at x = 0, where tanh(0) = 0 and tanh′ (0) = 1.
Final Answer
Accepted Answers:
• 0 < σ ′ (x) ≤ 1
4
f (x; p) = (1 − p)(x−1) p, x = 1, 2, . . .
Given the samples [4, 5, 6, 5, 4, 3] drawn from this distribution, find the maximum likeli-
hood estimate (MLE) of p.
6
Solution
The MLE for p in a geometric distribution is given by:
1
p̂ =
x̄
where x̄ is the sample mean.
Step 1: Compute Sample Mean
4+5+6+5+4+3 27
x̄ = = = 4.5
6 6
Step 2: Compute MLE of p
1
p̂ = ≈ 0.222
4.5
Final Answer
Accepted Answer: 0.222
• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).
• If the prior is N (0.4, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.6, 0.1).
• With a prior of N (0.1, 0.001), the estimate will never converge to the true value,
regardless of the number of samples used.
• With a prior of U (0, 0.5) (uniform between 0 and 0.5), the estimate will never
converge to the true value, regardless of the number of samples used.
Solution
1. Effect of the Prior Mean:
MAP estimation incorporates prior knowledge. If the prior mean is closer to the true
value (p = 0.7), the estimator converges faster. Since N (0.6, 0.1) is closer to 0.7 than
N (0.4, 0.1), fewer samples will be needed to converge, making the first statement true.
2. Incorrect Claim about N (0.4, 0.1):
Since 0.4 is farther from 0.7 than 0.6, this statement is false.
7
3. Effect of a Strongly Biased Prior:
A prior of N (0.1, 0.001) is highly concentrated near 0.1. While it slows down conver-
gence, given sufficient data, the MAP estimate will eventually reach the true value. This
statement is false.
4. Effect of a Uniform Prior on [0, 0.5]:
Since the uniform prior does not include 0.7 in its support, the MAP estimate will never
reach 0.7, regardless of the number of samples. This statement is true.
Final Answer
Accepted Answers:
• If the prior is N (0.6, 0.1), we will likely require fewer samples for converging to the
true value than if the prior is N (0.4, 0.1).
• With a prior of U (0, 0.5), the estimate will never converge to the true value, regard-
less of the number of samples used.
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.
• The MLE of a parameter gives a distribution of predicted values for a new data
point.
Solution
1. Bayesian Approach and Integral Computation:
To obtain a predictive distribution, we integrate over the parameter space using Bayesian
inference: Z
P (y|x) = P (y|x, θ)P (θ|D)dθ
Since it provides a single best estimate, it gives a point prediction, making the second
statement true.
8
3. Incorrect MLE Claim:
The MLE finds the most likely parameter value but does not provide a distribution over
predicted values. This statement is false.
4. Incorrect Requirement for Point Estimate:
A full Bayesian approach computes distributions directly without needing a point esti-
mate, making this statement false.
Final Answer
Accepted Answers:
• To obtain a distribution over the predicted values for a new data point, we need to
compute an integral over the parameter space.
• The MAP estimate of the parameter gives a point prediction for a new data point.
where pi and qi are the true and predicted distributions, respectively. Given this,
which of the following statements are true?
Solution
1. Connection to KL Divergence:
The cross-entropy loss can be rewritten using the KL divergence:
9
Final Answer
Accepted Answer: Minimizing HCE (p, q) is equivalent to minimizing DKL (p||q).
• Using the ReLU activation function avoids all problems arising due to gradients
being too small.
• The dead neurons problem in ReLU networks can be fixed using a leaky ReLU
activation function.
Solution
1. Incorrect Claim About Non-Linearity:
Non-linearity is crucial for deep networks; otherwise, the entire network collapses into a
linear function. The first statement is false.
2. Incorrect Claim About Saturating Activation Functions:
Saturating activation functions (e.g., sigmoid, tanh) cause vanishing gradients, making
optimization difficult. The second statement is also false.
3. Incorrect Claim About ReLU Solving All Gradient Issues:
While ReLU mitigates vanishing gradients, it still suffers from the dying ReLU problem
where neurons output zero permanently. Thus, the third statement is false.
4. Correct Statement About Leaky ReLU:
Leaky ReLU assigns a small slope for negative inputs, preventing neurons from completely
dying. The fourth statement is true.
Final Answer
Accepted Answers:
• Using the ReLU activation function avoids all problems arising due to gradients
being too small.
10
Machine Learning Assignment 6 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
Explanation:
A Decision Tree is a supervised learning algorithm because it requires both input
features X and corresponding labels Y . The tree is built using splitting criteria that aim
to minimize measures such as:
• Gini Impurity:
c
X
Gini = 1 − p2i
i=1
Since these measures rely on class labels, Decision Trees are not unsupervised. Thus,
both the statement and the reason are false.
Final Answer:
Statement is False. Reason is False.
1
• Reducing the maximum depth prevents the tree from memorizing noise.
• If set too aggressively, the tree becomes too simple and underfits.
Final Answer:
Final Answer:
Analysis:
• Statement 1 is false because Decision Trees are non-parametric but not linear.
Final Answer:
2
Q5: Entropy for a 50-50 Split
Entropy Formula for Binary Classification:
c
X
H(S) = − pi log2 pi
i=1
log2 0.5 = −1
Substituting n = 10
For our case:
2(10−1) − 1 = 29 − 1
= 512 − 1
= 511
Explanation
- The total number of subsets for n elements is 2n , including the empty set. - Since we
exclude the empty subset, we get 2n − 1. - However, each split is determined by choosing
**one subset**, and the other subset is automatically its complement. - This results in
2(n−1) − 1 valid splits, ensuring non-trivial partitions.
3
Final Answer
511
Thus, the number of possible combinations needed to find the best split point is
**511**.
Total samples = 10
4 6
p1 = = 0.4, p0 = = 0.6
10 10
Substituting into the entropy formula:
0.9798
4
Subset for Vaccination = 0
- Count: 5 samples - Malignant (1s) = 4, Non-malignant (0s) = 1
4 1
p1 = = 0.8, p0 = = 0.2
5 5
Hsplit = 0.361
IG = 0.9798 − 0.361
IG = 0.4763
Thus, the **information gain of Vaccination** is:
0.4763
5
Machine Learning Assignment 7 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
1. A model with a lower training loss will perform better on a validation dataset.
2. A model with a higher training accuracy will perform better on a validation dataset.
3. The train and validation datasets can be drawn from different distributions.
4. The train and validation datasets must accurately represent the real distribution of
data.
Correct Answer: Option 4: The train and validation datasets must accurately
represent the real distribution of data.
Explanation:
• Training loss vs. Validation performance: A lower training loss does not
necessarily mean better validation performance. If the model overfits the training
data, it may perform poorly on unseen data.
• Training accuracy vs. Validation accuracy: High training accuracy does not
always guarantee high validation accuracy due to overfitting.
• Data Distribution: The train and validation datasets must be drawn from the
same distribution. If they are from different distributions, the validation set will
not provide meaningful insights into model performance.
• Class B: 40 samples
1
Using stratified sampling, which of the following train-test splits would be appropriate?
• In stratified sampling, we must preserve the same class proportions in both training
and testing sets.
Explanation:
2
• k-fold Cross-Validation: In this technique, the dataset is split into k parts (folds),
and the model is trained on k − 1 folds while testing on the remaining one. This
process is repeated k times with different test sets.
First Classifier
4 6
13 77
4 4
Recall = = = 0.4
4+6 10
Second Classifier
8 2
40 60
8 8
Recall = = = 0.8
8+2 10
Third Classifier
5 5
9 81
5 5
Recall = = = 0.5
5+5 10
3
Fourth Classifier
7 3
0 90
7 7
Recall = = = 0.7
7+3 10
Conclusion
Since the highest recall value is **0.8**, we choose the **second classifier**:
8 2
40 60
4
Conclusion: Choosing the Classifier with Minimum FPR
Since the third classifier has the **lowest FPR** of **0.022**, we choose:
1 9
CM3 =
2 88
5
TP TP
where: - **Precision** = T P +F P
- **Recall** = T P +F N
6
3. **Boosting may assign unequal weights to the predictions of all the weak classi-
fiers** ( Correct) - Unlike bagging, boosting assigns different weights to different classi-
fiers.
4. **The individual classifiers in boosting can be trained parallelly** ( Incorrect) -
Boosting trains classifiers sequentially, as each classifier depends on the previous one’s
errors.
5. **The individual classifiers in boosting cannot be trained parallelly** ( Correct) -
Since boosting is sequential, training is not parallelizable.
• Boosting may assign unequal weights to the predictions of all the weak classifiers.
7
Conclusion: Correct Statements
The following statements are correct:
• Training sets are constructed from the original dataset by sampling with replace-
ment.
• Ensemble aggregation methods like bagging aim to reduce overfitting and variance.
• Stacking involves training multiple models and stacking their predictions into new
training data.
8
Machine Learning Assignment 8 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
• The goal of random forests is to increase the correlation between the trees.
• The goal of random forests is to decrease the correlation between the trees.
• In Random Forests, each decision tree fits the residuals from the previous one; thus,
the correlation between the trees won’t matter.
• None of these.
Correct Answer: The goal of random forests is to decrease the correlation between
the trees.
Explanation:
• The key idea is to make the individual trees as uncorrelated as possible so that
their combined predictions are more robust. A lower correlation among trees leads
to better generalization.
• The third option describes Gradient Boosting rather than Random Forests, where
trees are sequentially trained to correct previous errors.
1
Q2: Gradient Boosted Decision Trees
Question: Consider the two statements:
2
Does the data satisfy the Naı̈ve Bayes assumption?
• Yes
• No ✓
• None of these
Correct Answer: No.
Explanation:
• The Naı̈ve Bayes classifier assumes that the features are conditionally indepen-
dent given the class label.
• In the scatter plot, the two features appear to have a strong correlation (diagonal
separation between the two classes).
• This violates the independence assumption of Naı̈ve Bayes, making it less effective
for this dataset.
• Therefore, the correct answer is No, as the data does not satisfy the Naı̈ve Bayes
assumption.
x y
India won the match. Cricket
The Mercedes car was driven by Lewis Hamilton. Formula 1
The ball was driven through the covers for a boundary Cricket
Max Verstappen has a fast car. Formula 1
Bumrah is a fast bowler. Cricket
Max Verstappen won the race Formula 1
Suppose you have to classify a test example “The ball won the race to the boundary”
and are asked to compute
What issue will you face if you use the Naı̈ve Bayes Classifier, and how will you handle
it? Assume word frequencies are used to estimate all probabilities.
• ◦ Problem: A few words that appear at test time do not appear in the
dataset. Solution: Smoothing. ✓
• ◦ Problem: A few words that appear at test time appear more than once in the
dataset. Solution: Remove those words from the dataset.
• ◦ None of these
3
Correct Answer: Problem: A few words that appear at test time do not appear in
the dataset. Solution: Smoothing.
Explanation:
• If a word appears in the test sentence but not in the training data, the probability
becomes zero.
• The solution is to use Laplace Smoothing, where we add a small value (e.g., 1)
to the word count to prevent zero probabilities.
count(w, C) + α
P (w|C) = P ′
all words w′ count(w , C) + αV
where:
• ′
P
all words w′ count(w , C) is the total count of words in class C.
• If a word in the test set does not appear in the training data, its probability becomes
**zero**.
• This causes the entire Naı̈ve Bayes probability calculation to be **zero** due to
multiplication.
• ◦ Committee Machine
• ◦ AdaBoost ✓
• ◦ Bagging
4
• ◦ Stacking
• It assigns higher weights to misclassified instances and retrains the model iteratively
to reduce errors.
• Since the classifier is performing just slightly above random guessing (50–60%),
AdaBoost is ideal for improving its accuracy.
• Bagging is more suitable for reducing variance, while stacking and committee ma-
chines involve different models.
eβ0i +βi x
P (y = i|x) = Pk
β0j +βj x
j=1 e
- Here, we require k − 1 sets of parameters because one class is taken as the reference.
- Given that we have **6 classes**, we estimate **5 sets** of (β0 , β) pairs.
Answer: 5
Options:
• ()6
• ( ) 12
• () 5
• ( ) 10
5
Q7: Bayesian Network and Conditional Independence
The following Bayesian Network consists of **9 variables**, all of which are binary:
Which of the following is/are always true for the above Bayesian Network?
Explanation:
2. **Option 2: P (A, I) = P (A)P (I)** - This is incorrect because **A and I are not
necessarily independent**. - In a Bayesian Network, two variables are independent
if they are not connected directly or indirectly through dependent nodes. - Here,
A and I have possible dependencies through other nodes.
3. **Option 3: P (B|H, E, G) = P (B|E, G)P (H|E, G)** - **() This is correct** be-
cause of the **conditional independence rule** in Bayesian Networks. - Given the
structure, once we condition on E and G, the probability of B and H factorizes as
shown.
6
Q8: Naı̈ve Bayes Classification for Phone Type Pre-
diction
Given Data: The table below shows the distribution of phones based on their type and
features:
Given a phone with: - Dual SIM (not explicitly given in the table) - NFC =
Yes - 5G = No
We need to calculate the probabilities of this phone being a: 1. Budget phone 2.
Mid-Range phone 3. High-End phone
using **Naı̈ve Bayes Classification** and rank them accordingly.
—
7
Step 3: Compute Posterior Probabilities (Ignoring Dual SIM)
Using Bayes’ Theorem:
Conclusion
• The **Budget phone** has a probability of zero because it does not support NFC.
• The **Mid-Range phone** has the highest probability because it has a significant
number of phones with NFC and No 5G.
• The **High-End phone** has a lower probability than Mid-Range but is still pos-
sible.
8
Machine Learning Assignment 9 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
Solution to Question 1
Problem Statement: Consider the Markov Random Field (MRF) given below. We
need to delete one edge (without deleting any nodes) so that in the resulting graph, B
and F are independent given A. Which of these edges could be deleted to achieve this
independence?
Concept: In a Markov Random Field (MRF), two nodes are conditionally inde-
pendent given a set of nodes if removing an edge (or edges) makes them completely
disconnected in the graph when conditioned on the given node.
Analysis of Edge Removals:
• Removing AC: This does not fully disconnect B and F when given A. They
remain connected through other paths.
• Removing CE: Removing edge CE further ensures that any indirect paths be-
tween B and F are eliminated, reinforcing conditional independence.
• Removing AE: Does not fully ensure independence between B and F , as there
are still paths through C and E.
1
Conclusion: The correct edges to remove for ensuring B and F are independent
given A are:
BE, CE
Solution to Question 2
Problem Statement: We need to delete one node (along with its incident edges) so
that in the resulting graph, nodes B and C are independent given A.
Concept: In a Markov Random Field (MRF), two nodes are said to be condition-
ally independent given another node if the only path(s) between them pass through
that node. This means that once we observe the given node, the two nodes should no
longer have any additional paths connecting them.
Understanding the Graph: - The given MRF consists of the nodes: **A, B, C, D,
E, and F**. - The **edges represent dependencies** between the nodes. - The goal is to
ensure that B and C are independent given A.
Step-by-Step Analysis:
Solution to Question 3
Problem Statement: Which node in the Markov Random Field has the largest Markov
blanket (i.e., the most number of direct neighbors)?
Concept: In a Markov Random Field (MRF), the Markov blanket of a node con-
sists of all its directly connected neighbors. The Markov blanket contains all the nodes
that shield the given node from the influence of the rest of the network.
Step-by-Step Calculation:
2
• Node B: Connected to {A, E}. Total: 2
• Node C: Connected to {A, D, E}. Total: 3
• Node D: Connected to {C}. Total: 1
• Node E: Connected to {B, C}. Total: 2
• Node F: Connected to {A, C}. Total: 2
2. Finding the Node with the Largest Markov Blanket: - Nodes A and C
have the highest number of neighbors (3). - Hence, they have the **largest Markov
blanket**.
Conclusion: The correct answer is:
A, C
Solution to Question 4
Problem Statement: We need to determine the correct independence relations in the
given Bayesian Network.
3
2. (B) A and B are independent if no other variables are given: - Since there
is no direct edge between A and B, they are marginally independent. - This
statement is true.
3. (C) C and D are not independent if A is given: - C and D are connected
only through A. - Given A, there is no additional path connecting C and D. - This
statement is false.
4. (D) A and F are independent if C is given: - A and F are connected only
through C. - If C is given, then there is no other direct connection between A and
F , making them independent. - This statement is true.
Final Answers: The correct independence relations are:
(B) A and B are independent if no other variables are given.
Solution to Question 5
Problem Statement: We need to calculate the number of independent parameters re-
quired to represent the probability tables in the Bayesian Network, assuming all variables
are binary.
Step-by-Step Calculation:
• A, B, C, D, E, F are binary (ri = 2).
• Computing parameters for each node:
– P (A): 2 − 1 = 1
– P (B): 2 − 1 = 1
– P (C|A, B): (2 − 1) × 2 × 2 = 4
– P (D|A): (2 − 1) × 2 = 2
– P (E|C): (2 − 1) × 2 = 2
– P (F |C): (2 − 1) × 2 = 2
Total Parameters:
1 + 1 + 4 + 2 + 2 + 2 = 12
Final Answer:
12
4
Solution to Question 6
Problem Statement: We need to calculate the number of independent parameters re-
quired for the Bayesian Network when variables A, C, E have four possible values and
variables B, D, F are binary.
Step-by-Step Calculation:
• rA = 4, rB = 2, rC = 4, rD = 2, rE = 4, rF = 2.
– P (A): (4 − 1) = 3
– P (B): (2 − 1) = 1
– P (C|A, B): (4 − 1) × 4 × 2 = 24
– P (D|A): (2 − 1) × 4 = 4
– P (E|C): (4 − 1) × 4 = 12
– P (F |C): (2 − 1) × 4 = 4
Total Parameters:
3 + 1 + 24 + 4 + 12 + 4 = 48
Final Answer:
48
Solution to Question 7
Problem Statement: In the Bayesian Network from Question 4, assume all variables
can take 4 values. We need to determine the number of independent parameters required
to represent the probability distribution.
• E and F depend on C:
– P (E|C) ⇒ 4 × 3 = 12 parameters.
– P (F |C) ⇒ 4 × 3 = 12 parameters.
5
Total Number of Independent Parameters:
6 + 12 + 48 + 12 + 12 = 90
Final Answer:
90
Solution to Question 8
Problem Statement: We need to find valid factorizations to compute the marginal
probability P (E = e) using variable elimination.
6
Solution 9: Understanding the Markov Random Field
(MRF)
A Markov Random Field (MRF) is an undirected graphical model where the proba-
bility distribution can be expressed as a product of potential functions (ψ) over the
maximal cliques of the graph.
Properties of MRFs:
• Undirected Graph Representation: The relationships between variables are
represented as undirected edges.
• Clique 2: (B, E)
Therefore, the probability distribution should be factorized as:
1
P (a, b, c, d, e) = ψ1 (a, b, c, d)ψ2 (b, e) (1)
Z
where:
• ψ1 (a, b, c, d) is the potential function for clique (A, B, C, D).
7
Option 1:
1
P (a, b, c, d, e) =
ψ1 (a, b, c, d)ψ2 (b, e) (2)
Z
Correct! Matches the identified maximal cliques.
Option 2:
1
P (a, b, c, d, e) = ψ1 (b)ψ2 (a, c, d)ψ3 (a, b, e) (3)
Z
Incorrect!
• ψ1 (b) suggests an independent factor for B, which is incorrect since B has depen-
dencies.
• The factor ψ3 (a, b, e) incorrectly includes A and E together, which is not part of a
maximal clique.
Option 3:
1
P (a, b, c, d, e) = ψ1 (a, b)ψ2 (c, d)ψ3 (b, e) (4)
Z
Incorrect!
Option 4:
1
P (a, b, c, d, e) = ψ1 (a, b, c, d)ψ2 (c, d)ψ3 (b, d, e) (5)
Z
Incorrect!
Option 5:
1
P (a, b, c, d, e) = ψ1 (a, c)ψ2 (b, e)ψ3 (b, e) (6)
Z
Incorrect!
Option 6:
1
P (a, b, c, d, e) = ψ1 (c)ψ2 (b, e)ψ3 (b, a, d) (7)
Z
Incorrect!
8
Conclusion
The first option correctly factorizes the probability distribution using the maximal
cliques of the given MRF. Hence, the correct answer is:
1
P (a, b, c, d, e) = ψ1 (a, b, c, d)ψ2 (b, e) (8)
Z
• Yi represents the observed states, which correspond to the words in the sen-
tence.
X1 → X2 → X 3
Each Xi generates an observation Yi :
X1 → Y1 , X2 → Y2 , X3 → Y3
This means:
9
Step 2: Evaluating the Given Statements
Statement 1:
”The Xi variables represent parts-of-speech and the Yi variables represent the
words in the sentence.”
Correct!
This matches the definition of HMM for POS tagging, where the hidden states Xi are
POS tags and the observed states Yi are words.
Statement 2:
”The Yi variables represent parts-of-speech and the Xi variables represent the
words in the sentence.”
Incorrect!
This statement incorrectly swaps the roles of Xi and Yi . In POS tagging, words are
observable (Yi ), and POS tags are hidden (Xi ).
Statement 3:
”The Xi variables are observed and the Yi variables need to be predicted.”
Incorrect!
Statement 4:
”The Yi variables are observed and the Xi variables need to be predicted.”
Correct!
This correctly describes POS tagging:
Conclusion
The correct answers are:
Statements 1 and 4 are true. (9)
10
Machine Learning Assignment 10 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
1 Solution of Q.1 :
Single Linkage Clustering is a hierarchical clustering method where the distance between
two clusters is defined as the minimum distance between any two points in the clusters.
1
Step 4: Merging (P3, P4) with P5, P6
• The next smallest distance is 6 (between {P3, P4} and P5).
Final Merge
• The final merge is 8 (between {P1, P2} and {P3, P4, P5, P6}).
Conclusion
Thus, the correct hierarchical clustering dendrogram follows this order, and the correct
answer is **Option B**.
2 Solution of Q.2
Complete Linkage Clustering is a hierarchical clustering method where the distance be-
tween two clusters is defined as the maximum distance between any two points in the
clusters.
Clusters: { P1 }, { P2 }, { P3, P4 }, { P5 }, { P6 }
2
Step 3: Merging P3, P4 with P1
• The next smallest maximum distance is between {P3, P4} and P1.
Final Merge
• The final merge happens between {P1, P2, P3, P4} and {P5, P6}.
Conclusion
Thus, the correct hierarchical clustering dendrogram follows this order, and the correct
answer is **Option D**.
3
Finding the Radius of the Merged Cluster C
Given two clusters A and B with their respective N , SU M , and SS values, when they
are merged to form a new cluster C, the combined cluster parameters are:
NC = NA + NB (2)
SU MC = SU MA + SU MB (3)
Final Answer
s 2
SSA + SSB SU MA + SU MB
− (7)
NA + NB NA + NB
This is the correct formula for computing the radius of the merged cluster in
BIRCH.
Solution to Question 4
Given Statements:
4
Verifying Statement 1
CURE effectively handles outliers by:
• Applying a shrinkage factor to move the representative points towards the cluster
center.
Verifying Statement 2
Multiplicative shrinkage reduces the influence of extreme points by shifting representa-
tive points towards the mean of the cluster. This ensures that outliers do not have a
significant impact on the clustering process.
Final Answer:
Statement 1 is true. Statement 2 is true. Statement 2 is the correct reason
for Statement 1.
• K-Means is an unsupervised algorithm and does not know the true labels.
• The dataset has 10 digit classes (0, 1, 2, 3, 4, 5, 6, 7, 8, 9), and K-Means forms 10
clusters based on pixel similarities.
• Since clusters are unlabeled, we assign each cluster the majority label based on
the most frequent class in that cluster.
5
Step 2: Assigning Labels Based on Majority Class
Suppose one of the clusters contains 1000 points, out of which:
0.893 ✓
• a = Number of pairs that are in the same cluster in both the ground truth and
predicted clusters.
• b = Number of pairs that are in different clusters in both ground truth and predicted
clusters.
6
Step 2: Computing a and b
For 10,000 data points:
10000 10000 × 9999
= ≈ 5 × 107 (11)
2 2
Assume:
• a = 2.1 × 107 (correctly clustered pairs)
• b is the number of pairs that are correctly placed in different clusters (true nega-
tives).
• n2 = n(n−1)
2
is the total number of possible pairs in a dataset of n points.
7
• Rand index = 1.18 × accuracy: There is no general formula that holds for all
clustering problems, making this option incorrect.
• A value close to 1 indicates strong agreement with the ground truth labels.
For the given BIRCH clustering on MNIST, the obtained Rand index is:
0.88
Final Answer: 0.88
8
Understanding Outliers in DBSCAN
DBSCAN marks data points as:
• Border points: Points close to core points but not dense enough.
Since PCA reduces dimensionality while retaining most variance, it can change how
DBSCAN perceives clusters, slightly altering the number of detected outliers.
Results
• Number of outliers detected in the original feature space: 1797
9
Machine Learning Assignment 11 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
• None of the above: Incorrect since some parameters are estimated. (False)
1
Q.2 Which of the following is/are true about the re-
sponsibility terms in GMMs? Assume the standard
notation used in the lectures.
P
1. k γ(znk ) = 1 ∀n
P
2. n γ(znk ) = 1 ∀k
•
P
k γ(znk ) = 1 ∀n: The sum of responsibilities for a given data point across all
clusters must be 1, as they represent probabilities. (True)
•
P
n γ(znk ) = 1 ∀k: This would mean each cluster has exactly one data point
assigned in total, which is incorrect. (False)
• γ(znk ) ∈ {0, 1} ∀n, k: This is only true in hard clustering (like K-Means). In
GMMs, responsibilities are soft probabilities. (False)
• γ(znk ) ∈ [0, 1] ∀n, k: Since responsibilities represent probabilities, they must lie
within this range. (True)
• πj > πk ⇒ γ(znj ) > γ(znk ) ∀n: The responsibility depends on both πk and
the likelihood of the data point under the component. A component with a lower
mixing coefficient can still have a higher responsibility. (False)
2
Answer: The correct update equation is:
PN
(m) γ(znk )|b(m−1) xn
µk = P n=1 N (1)
n = 1 γ(znk )|b(m−1)
Explanation:
• Option 1 (Incorrect): The denominator should not use b(m) , but rather b(m−1)
since the update step depends on the previous iteration’s estimates.
• Option 2 (Correct): This is the correct update equation, where the numerator
is a weighted sum of data points, and the denominator ensures normalization.
• Option 3 (Incorrect): The denominator should be the sum of γ(znk ) and not a
fixed N , as N does not correctly normalize the weighted sum.
• Option 4 (Incorrect): The incorrect use of b(m) and a normalization factor of N
makes this option incorrect.
3
Q.5 Fit a GMM with 2 components for this data.
Question:
What are the mixing coefficients of the learned components? (Note: Use the sklearn
implementation of GMM with random state = 0. Do not change the other default pa-
rameters.)
1. (0.791, 0.209)
2. (0.538, 0.462)
3. (0.714, 0.286)
4. (0.625, 0.375)
1. (2.0, 0.5)
2. (-1.0, -0.5)
3. (7.5, 8.0)
4. (5.0, 5.5)
Since the log-likelihood for (7.5, 8.0) is the highest, it is the most probable point according
to the model.
4
Q.7 Compare labels of GMM models with 2 and 3
components.
Question:
Let Model A be the GMM with 2 components that was trained in Question 5. Using
the same data, estimate a GMM with 3 components (Model B).
Select the pair(s) of points that have the same label in Model A but different labels
in Model B.
1. Both the statements are correct and Statement B is the correct explanation for
Statement A.
2. Both the statements are correct, but Statement B is not the correct explanation for
Statement A.
Answer: The correct answer is (b) Both statements are correct, but Statement B is
not the correct explanation for Statement A.
Explanation:
5
• Statement A (Correct): GMMs can achieve arbitrarily high likelihood values
due to their ability to form extremely narrow distributions around data points.
• Why (b) is correct: Statement B does not directly explain Statement A. The
likelihood increasing monotonically with EM is a property of the algorithm, whereas
Statement A discusses the theoretical ability of GMMs to achieve arbitrarily high
likelihoods.
6
Machine Learning Assignment 12 Solution
Name: Sunny Singh Jadon
Class: M.Sc Maths
Branch: Maths and Scientific Computing
NIT WARANGAL
April 24, 2025
(A) 1.000
(B) 0.937
(C) 0.500
(D) 0.000
Calculating:
1 1 1 1 8+4+2+1 15
=+ + + = = = 0.9375.
2 4 8 16 16 16
Rounded to three decimal places, this is 0.938. The closest option is 0.937.
1
Q2. Hoeffding’s Inequality and the Union Bound
Question:
Suppose we have 50 hypothesis functions trained on 105 samples with an error threshold
ϵ = 0.1. Using Hoeffding’s inequality with a union bound, the probability that there
exists a hypothesis h for which
is bounded by:
2N
P (∃h : |Ein (h) − Eout (h)| > ϵ) ≤ 50 · 2 · e−2ϵ .
What is the corresponding lower bound on the probability that no such hypothesis exists?
2N
(A) 1 − 2e−2ϵ
2N
(B) 1 − 100e−2ϵ
3
(C) 1 − 100e−10
3
(D) 1 − 2e−10
3
Answer: (B) 1 − 100e−10
Explanation:
Using the union bound, we have:
2N
P (∃h : |Ein (h) − Eout (h)| > ϵ) ≤ 100 · e−2ϵ .
Thus, the probability that no hypothesis has an error greater than ϵ is at least:
1 − 100 · e−200 .
3
For large exponents, e−200 is extremely small and can be written in the form e−10 under
an approximation, so the answer is given as option (B).
(A) 3
(B) 5
(C) 4
(D) 6
2
Answer: (C) 4
Explanation:
• Option (A): 3 is too low. Although three points can be shattered by a pair of
squares, it is known that more can be shattered.
• Option (C): 4 is correct. Detailed geometric arguments show that any configura-
tion of 4 points (in general position) can be separated with two axis-aligned squares,
but 5 points cannot always be shattered by such a hypothesis class.
• Option (D): 6 is too high, since it is impossible for two squares to correctly classify
every possible dichotomy on 6 points.
(B) Learn a function that stores the value for state-action pairs (i.e., learn Q(s, a)).
(D) Run a random agent repeatedly until it wins and use that as the policy.
• Option (A): Directly learning the policy (via methods such as policy gradients) is
a valid approach when T is unknown.
• Option (C): If it is possible to estimate T from experience, one can use model-
based approaches to compute V .
• Option (D): Running a random agent until it wins is inefficient and does not lead
to a generalizable policy.
3
Q5. Value Update for V (X4) After One Iteration
Question:
For each state, a value is defined to help determine the agent’s behavior. The initial
values are given by:
For each state S ∈ {X1 , X2 , X3 , X4 , Start} (with SL as the state immediately to the left
and SR as the state immediately to the right), the update rule is:
V (S) = 0.9 × max V (SL ), V (SR )
(B) 0.9
(C) 0.81
(D) 0
Answer: (B) 0.9
Explanation:
For state X4 :
• The left neighbor is X3 with V (X3 ) = 0.
(B) -0.9
(C) -0.81
(D) 0
4
Answer: (D) 0
Explanation:
For state X1 :
Therefore:
V (X1 ) = 0.9 × max(−1, 0) = 0.9 × 0 = 0.
(A) 0.54
(B) -0.9
(C) 0.63
(D) 0
• With each update, the values for states closer to RE (on the right) receive higher
discounted values.
5
Q8. Expressing the Optimal Policy Using the Value
Function
Question:
Consider a scenario where the agent selects actions based on the value function V . Which
of the following expressions correctly describe an optimal policy?
(
Left if V (SL ) > V (SR )
(A) A =
Right otherwise
(
Left if V (SR ) > V (SL )
(B) A =
Right otherwise
• Option (A): This rule says to choose the action that moves to the state with the
higher value. If V (SL ) > V (SR ) then go left; otherwise, go right. This correctly
represents an optimal policy.
• Option (B): This rule is the reverse of (A) and would lead the agent to move
toward the lower-valued state. Hence, it is incorrect.
• Option (C): This is a general formulation of an optimal policy: choose the action
a that maximizes the value of the next state T (S, a). This is correct.
• Option (D): This chooses the action leading to the lowest value, which is not
optimal.