MLT - Solutions (12 Weeks Merged) PDF
MLT - Solutions (12 Weeks Merged) PDF
MLT - Solutions (12 Weeks Merged) PDF
Consider a point and a line passing through the origin which is represented by the
vector . What can you say about the following quantities? (MSQ)
Options
(a)
(b)
(c)
(d)
Answer
(b), (c)
Solution
We have . So, the projection is the zero vector. The residue is given by:
Common data for questions (2) to (5)
Statement
Consider a point and a line that passes through the origin . The point lies on the line.
Options
(a)
Statement-1
(b)
Statement-2
(c)
Statement-3
(d)
Statement-4
(e)
Answer
(e)
Solution
The projection of a point on a line is given by:
This is the expression when does not have unit length. In this problem, does not have unit
length. If , then the expression becomes:
Question-3 [1 point]
Statement
Find the length of the projection of on the line . Enter your answer correct to two decimal
places.
Answer
Range:
Solution
The length of the projection is given by:
Question-4 [1 point]
Statement
Find the residue after projecting on the line .
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The residue is given by:
Question-5 [1 point]
Statement
Find the reconstruction error for this point. Enter your answer correct to two decimal places.
Answer
Range:
Solution
The reconstruction error is given by the square of the length of the residue. If the residue is ,
then:
Programming based solution. This is to be used only to verify the correctness of the calculations.
The added benefit is that you get used to NumPy .
1 import numpy as np
2
3 x = np.array([2, 5])
4 w = np.array([1, 1])
5 w = w / np.linalg.norm(w)
6
7 # Projection
8 proj = (x @ w) * w
9 print(f'Projection = {np.linalg.norm(proj)}')
10 # Residue
11 res = x - proj
12 print(f'Residue = {res}')
13 # Reconstruction error
14 recon = res @ res
15 print(f'Reconstruction error = {recon}')
Question-6 [0.5 point]
Statement
Consider the following images of points in 2D space. The red line segments in one of the images
represent the lengths of the residues after projecting the points on the line . Which image is it?
Image-1
Image-2
Options
(a)
Image-1
(b)
Image-2
Answer
(b)
Solution
The residue after the projection should be perpendicular to the line. Note that by projection we
mean the orthogonal projection of a point on a line. The projection of a point on a line is one of
the proxies for that point on the line, in fact it is the "best" possible proxy. But every proxy does
not become a projection. The projection of a point on a line is unique.
Question-7 [1 point]
Statement
Consider a dataset that has samples, where each sample belongs to . PCA is run on this
dataset and the top principal components are retained, the rest being discarded. If it takes one
unit of memory to store a real number, find the percentage decrease in storage space of the
dataset by moving to its compressed representation. Enter your answer correct to two decimal
places; it should lie in the range .
Answer
Range:
Solution
Original space =
Compressed space =
Options
(a)
(b)
(c)
(d)
Answer
(a)
Solution
Let us first arrange the data in the form of a matrix. Here, and :
Recall that the first principal component is the most important. Enter your answer correct to two
decimal places.
Answer
Range:
Solution
If is the eigenpair for , we have:
1 import numpy as np
2
3 X = np.array([[-3, 0],
4 [-2, 0],
5 [-2, 1],
6 [-1, 0],
7 [1, 0],
8 [2, 0],
9 [2, -1],
10 [3, 0]])
11
12 C = X.T @ X / X.shape[0]
13 print(f'Covariance matrix = {C}')
14 eigval, eigvec = np.linalg.eigh(C)
15 print(f'Variance = {eigval[-1]}')
A more detailed version. The variance of the dataset along the principal component is and
is given by:
So, the variance along the principal component is the largest eigenvalue of the covariance
matrix.
Question-10 [1 point]
Statement
Consider a dataset of points all of which lie in . The eigenvalues of the covariance matrix
are given below:
If we run the PCA algorithm on this dataset and retain the top- principal components, what is a
good choice of ? Use the heuristic that was discussed in the lectures.
Answer
4
Solution
The top- principal components should capture of the variance. Here is a code snippet to
answer this question:
Options
(a)
(b)
(c)
(d)
Answer
(b), (d)
Solution
Each vector is associated with a line perpendicular to it. This line divides the space into two
halves. The basic idea is to identify the sign of the half-planes into which the line perpendicular to
the vector divides the space.
Week 2 Graded assignment
Common data for Questions 1 and 2
A function is defined as follows.
Question 1
Statement
Is a valid kernel?
Options
(a)
Yes
(b)
No
Answer
(a)
Solution
Let be an identity transformation that is
Question 2
If is the valid kernel, we apply it to the three-dimensional dataset to run the kernel PCA. Select
the correct options.
Options
(a)
(b)
(c)
It will be the same as the polynomial transformation of degree 2 and then run the PCA.
(d)
Answer
(b)
Solution
We have seen (in question 1) that corresponds to the identity transformation. It implies that
applying kernel and running PCA is same as standard PCA on the given dataset.
Question 3
Statement
Consider ten data points lying on a curve of degree two in a two-dimensional space. We run a
kernel PCA with a polynomial kernel of degree two on the same data points. Choose the correct
options.
Options
(a)
(b)
(c)
There will be some that all of the data points are orthogonal to.
(d)
There will be some that all of the data points are orthogonal to.
Answer
(a), (c)
Solution
Since we are applying the polynomial kernel of degree two on the 2D dataset, the dataset will be
transformed into a 6D feature space. (verify)
And the dataset is given to lying on a curve of degree two, the transformed dataset will live in the
linear subspace of . and therefore, there will be some that all of the data points are
orthogonal to.
Question 4
Statement
Which of the following matrices can not be appropriate matrix for some data matrix
?
Options
(a)
(b)
(c)
(d)
Answer
(a), (b) and (c)
Solution
We know that matrix must be symmetric and positive semi-definite.
Option (a)
Question 5
Statement
A function is defined as
Is a valid kernel?
Options
(a)
Yes
(b)
No
Answer
(a)
Solution
The given function is
Question 6
Statement
Kernel PCA was run on the four data points and with the
polynomial kernel of degree 2. What will be the shape of the matrix ? Notations are used as per
lectures.
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The matrix is defined as where is a data matrix of shape . That is matrix is of
shape where is a number of examples.
Question 7
Statement
Find the element at the index of the matrix defined in Question 6. Take the points in the
same order.
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The polynomial kernel of degree 2 is given by
Question 8
Statement
A dataset containing 200 examples in four-dimensional space has been transformed into higher
dimensional space using the polynomial kernel of degree two. What will be the dimension of
transformed feature space?
Answer
15 (No range required)
Solution
Let the features be , and . After the transformation of degree two, features will be
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
If the largest eigenvalue and corresponding unit eigenvector of is and ,
Question 10
Statement
Let and be two valid kernels. Is a valid kernel?
Options
(a)
Yes
(b)
No
Answer
(a)
Solution
Let and be two valid kernels defined on , it implies that they satisfies the following
two properties
where,
Assume that
To show:
and
Now,
and
(1) ,
(2) and
(3)
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
The first quantity represents the value of objective function in iteration . the third quantity
represents the value of objective function in iteration . The second quality represents an
intermediate quantity which captures the distance of each data point from the mean that they will
be moving towards, in the t+1 iteration. Since in every iteration, the reassignment happens only if
a data point has found a closer mean, (3) will be lesser than (1). Further, since every point will
want to move towards a closer cluster center in the subsequent iteration, the value of (2) will be
between (1) and (3).
Question-2
Statement
Consider that in an iteration of Lloyd's algorithm, the partition configuration ( ) is
where each . Assume that the algorithm does not converge in
iteration , and hence some re-assignment happens, thus updating the partition configuration in
the next iteration ( ) to . How can we say that partition configuration
is better than ?
Options
(a)
The value of the objective function for should be more than that for
(b)
The value of the objective function for should be lesser than that for
(c)
Answer
(b)
Solution
Since in every iteration, the reassignment happens only if a data point has found a closer mean,
will be lesser than .
Question-3
Statement
With respect to Lloyd’s algorithm, choose the correct statements:
Options
(a)
At the end of k-means, the objective function settles in a local minima and reaching global minima
may not be guaranteed.
(b)
At the end of k-means, the objective function always settles in the global minima.
(c)
(d)
If the resources are limited and the data set is huge, it will be good to prefer K-means over K-
means ++.
(e)
Answer
(a), (d)
Solution
(a), (b) K-means may not always settle in a global minima.
(c) Finding optimal clusters is an NP-hard problem. K-means provides approximate clusters.
(d) If the dataset is huge. the elaborate initialization step in K-means++ will take a lot of time.
(e) In practice, k should neither be very small nor very large, because in both these cases, we may
not be able to uncover groupings present in the data.
Question-4
Statement
Consider two cluster centres and corresponding to two clusters and as shown in the
below image. Consider four half spaces represented by lines and . Where would the
data points falling in cluster lie?
Options
(a)
To the left of
(b)
Between and
(c)
Between and
(d)
To the left of
(e)
To the left of
Answer
(d)
Solution
Half-spaces are perpendicular bisectors of the line joining the cluster centers.
Question-5
Statement
Which of the following best represents a valid voronoi diagram for K-means algorithm with K = 3?
(The dots represent the cluster centres of respective clusters.)
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
Half-spaces are perpendicular bisectors of the line joining the cluster centers.
Question-6
Statement
Consider the following data points:
Assume that K-means is applied on this data with = 2. Which of the following are expected to be
the clusters produced?
Options
(a)
(b)
Answer
(b)
Solution
Half-spaces are perpendicular bisectors of the line joining the cluster centers.
In the given data, in case of option (a), the cluster centers will coincide, which is something does
not happen as a result of applying k-means.
Question-7
Statement
Assume that in the initialization step of k-means++, the squared distances from the closest mean
for 10 points are: 25, 67, 89, 24, 56, 78, 90, 85, 35, 95. Which point has the highest
probability of getting chosen as the next mean and how much will that probability be?
Options
(a)
, 0.24
(b)
, 0.037
(c)
, 0.95
(d)
, 0.1475
Answer
(d)
Solution
25+ 67+ 89+ 24+ 56+ 78+ 90+ 85+ 35+ 95 = 644
Probability for
Probability for
Question-8
Statement
Consider 7 data points : {(0, 4), (4, 0), (2, 2), (4, 4), (6, 6), (5, 5), (9, 9)}. Assume that we
want to form 3 clusters from these points using K-Means algorithm. Assume that after first
iteration, clusters have the following data points:
: {(0,4), (4,0)}
: {(5,5), (9,9)}
After second iteration, which of the clusters is the data point (2, 2) expected to move to?
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
C1: (4,4), C2: (2,2), C3: (7,7)
Options
(a)
1 and 3
(b)
1 and 2
(c)
2 and 3
(d)
1, 2, and 3
Answer
(d)
Solution
1. Different cluster center initializations may result in different clusters produced by k-means.
2. Some initializations may take more time to converge.
3. Some initializations may converge either in a local minima rather than global minima.
Question-10
Statement
If the data set has two features and , which of the following are true for K means clustering
with k = 3?
Options
(a)
(b)
(c)
Answer
(a)
Solution
If and have a correlation of 1, all data points will lie along a line.
Hence the cluster centers will also lie along the same line.
Graded
This document has questions.
Note to Learners
Statement
For all questions involving the Bernoulli distribution, the parameter is .
Question-1
Statement
Consider a dataset that has zeros and ones. What is the likelihood function if we assume a
Bernoulli distribution with parameter as the probabilistic model?
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
We shall use the i.i.d. assumption. If is the random variable corresponding to the data-
point, we have:
and are independent for and they are identically distributed. The likelihood is
therefore the product of terms, five of which correspond to ones and the rest to zeros:
Question-2
Statement
In the previous question, what is the estimate of ? Enter your answer correct to two decimal
places.
Answer
Range:
Solution
The estimate is the fraction of ones:
Question-3
Statement
Consider a dataset that has a single feature ( ). The first column in the table below represents the
value of the feature, the second column represents the number of times it occurs in the dataset.
Frequency
If we use a Gaussian distribution to model this data, find the maximum likelihood estimate of the
mean.
Options
(a)
(b)
(c)
(d)
The mean cannot be computed as the variance of the Gaussian is not explicitly specified.
Answer
(c)
Solution
Question-4
Statement
Consider a beta prior for the parameter of a Bernoulli distribution:
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
Since the beta distribution is a conjugate prior of the Bernoulli distribution, the posterior is also a
beta distribution. Specifically, if the prior is and the dataset has ones and zeros,
then the posterior:
In this problem, .
Question-5
Statement
In the previous question, we use the expected value of the posterior as a point-estimate for the
parameter of the Bernoulli distribution. What is ? Enter your answer correct to two decimal
places.
Answer
Range:
Solution
The expected value of a beta distribution with parameters and is:
Ignore the values on the Y-axis and just focus on the shapes of the distributions. Which of the
following could correspond to the observed data?
Options
(a)
(b)
(c)
Answer
(c)
Solution
The prior encodes the belief that coin is somewhat unbiased. The posterior seems to have made
that belief stronger. So, the data should have been something that strengthens the belief in the
prior, meaning, an equal number of ones and zeros.
Common Data for questions (7) to (9)
Statement
We wish to fit a GMM with for a dataset having points. At the beginning of the time
step of the EM algorithm, we have as follows:
The density of the points given a particular mixture is given to you for all four points. is the
density of a Gaussian.
Use three decimal places for all quantities throughout the questions.
Question-7
Statement
What is the value of for and after the E-step? Enter your answer correct to three
decimal places.
Answer
Range:
Solution
From Bayes rule, we have:
So:
Question-8
Statement
If we pause the algorithm at this stage (after the E-step) and use the values to do a hard-
clustering, what would be the cluster assignment? We use the following rule to come up with
cluster assignments:
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
We need to compute the table of values from which we can read of the cluster assignments.
Question-9
Statement
What is the value of after the M-step? Enter your answer correct to three decimal places.
Answer
Range:
Solution
Now for :
Which is:
Question-10
Statement
A GMM is fit for a dataset with points. At some time-step in the EM algorithm, the following are
the values of for all points in the dataset for the mixture after the E-step:
What is the estimate of after the M-step? Enter your answer correct to two decimal places.
Answer
Range:
Solution
Question-11
Statement
What is the value of the following expression after the E-step at time-step in the EM algorithm?
There are data-points and mixtures.
Options
(a)
(b)
(c)
(d)
(e)
(f)
Answer
(b)
Solution
We know that the values should sum to for each data-point. Since there are data-points,
the expression should sum to .
Graded Assignment
Note:
1. In the following assignment, denotes the data matrix of shape where and are
the number of features and samples, respectively.
2. denotes the sample and denotes the corresponding label.
3. denotes the weight vector (parameter) in the linear regression model.
Question 1
Statement
An ML engineer comes up with two different models for the same dataset. The performances of
these two models on the training dataset and test dataset are as follows:
Options
(a)
Model 1
(b)
Model 2
Answer
(a)
Solution
In model , the test error is very low compared to model even though the training error is high
in model . We choose model as it worked well on unseen data.
Question 2
Statement
Consider a model for a given -dimensional training data points and
corresponding labels as follows:
where is the average of all the labels. Which of the following error function will always give the
zero training error for the above model?
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
The sum of squared error and absolute error will give zero error only if predicted values are the
same as actual values for all the examples.
This error function will give zero error for the above model.
label ( )
-1 5
0 7
1 6
Question 3
Statement
We want to fit a linear regression model of the form . Assume that the initial weight
vector is . What will be the weight after one iteration using the gradient descent algorithm
assuming the squared loss function? Assume the learning rate is .
Answer
No range is required
Solution
At iteration , we have
At , we have
Here
Question 4
Statement
If we stop the algorithm at the weight calculated in question 1, what will be the prediction for the
data point ?
Answer
8 No range is required
Solution
The model is given as
at ,
Question 5
Statement
Assume that denotes the updated weight after the iteration in the stochastic gradient
descent. At each step, a random sample of the data points is considered for weight update. What
will be the final weight after iterations?
Options
(a)
(b)
(c)
(d)
any of the
Answer
(c)
Solution
The final weight is given by the average of all the weights updated in all the iterations. That is why
option (c) is correct.
Question 6
Statement
Which data point plays the most important role in predicting the outcome for an unseen data
point? Write the data point index as per matrix assuming indices start from 1.
Answer
3, No range is required
Solution
Since is written as , the data point which is associated with the
highest weight (coefficient) will have the most importance. The third data point is associated with
the highest coefficient ( ) therefore, the third data point has the highest importance.
Question 7
What will be the prediction for the data point ?
Answer
6.5 No range is required
Solution
The polynomial kernel of degree is given by
Since,
That is
Question 8
Statement
If be the solution to the optimization problem of the linear regression model, which of the
following expression is always correct?
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
We know that is the projection of labels on the subspace spanned by the features that is
will be orthogonal to . For datails, check the lecture 5.4.
Question 9
Statement
The gradient descent with a constant learning rate of for a convex function starts oscillating
around the local minima. What should be the ideal response in this case?
Options
(a)
(b)
Answer
(b)
Solution
One possible reason of oscillation is that the weight vector jumps the local minima due to greater
step size . That is if we decrease the value of , the weight vector may not jump the local
minima and the GD will converge to that local minima.
Question 10
Statement
Is the following statement true or false?
Options
(a)
True
(b)
False
Answer
(a)
Solution
We make the assumption in the regression model that the error follows gaussian distribution with
zero mean and a constant variance.
Graded
This document has questions.
Question-1
Statement
Assume that for a certain linear regression problem involving 4 features, the following weight
vectors produce an equal amount of mean square error:
= [2, 2, 3, 1]
= [1, 1, 3, 1]
= [3, 2, 4, 1]
= [1, 2, 1, 1]
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
Total error = MSE +
If MSE for all the given weights is same, the weight vector whose squared length is the least will be
chosen by Ridge Regression.
Question-2
Statement
Assuming that in the constrained version of ridge regression optimization problem, following are
the weight vectors to be considered, along with the mean squared error (MSE) produced by each:
If the value of is 13, which of the following weight vectors will be selected as the final weight
vector by ridge regression?
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
We need to minimize MSE such that
for and .
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
lies on the part of the function which is differentiable. For a differentiable function (subpart),
only one sub-gradient is possible which is the gradient itself.
lies at the intersection of two and . The function is not differentiable at this point (as left
slope is different from right slope). Hence there are multiple sub-gradients possible at .
lies on the part of the function which is differentiable. For a differentiable function (subpart),
only one sub-gradient is possible which is the gradient itself.
lies at the intersection of two and . The function is not differentiable at this point (as
left slope is different from right slope). Hence there are multiple sub-gradients possible at
Question-4
Statement
For a data set with 1000 data points and 50 features, 10-fold cross-validation will perform
validation of how many models?
Options
(a)
10
(b)
50
(c)
1000
(d)
500
Answer
(a)
Solution
In 10-fold cross validation, the data will be divided into 10 parts. In each of ten iterations, a model
will be built using nine of these parts and the remaining part will be used for validation. Hence, in
total, ten models will be validated.
Question-5
Statement
For a data set with 1000 data points and 50 features, assume that you keep 80% of the data for
training and remaining 20% of the data for validation during k-fold cross-validation. How many
models will be validated during cross-validation?
Options
(a)
80
(b)
20
(c)
(d)
Answer
(c)
Solution
If 20% of the data is used for validation, that means, 1/5th part is used for validation, which
means, 5-fold cross validation is being performed. In each iteration, one model will be validated.
Hence, total 5 models will be validated.
Question-6
Statement
For a data set with 1000 data points and 50 features, how many models will be trained during
Leave-One-Out cross-validation?
Options
(a)
1000
(b)
50
(c)
5000
(d)
20
Answer
(a)
Solution
In leave one out cross-validation, only one data point is used for validation in each iteration, and
the remaining n-1 data points are used for training. Hence a total of n = 1000 models will be
trained.
Question-7
Statement
The mean squared error of will be small if
Options
(a)
(b)
(c)
(d)
Answer
(c), (d)
Solution
Mean Squared error of . Trace of a matrix = sum of eigenvalues.
If the eigenvalues of are large, the eigenvalues of will be small. Hence, trace will
be small and in turn MSE will be small.
Question-8
Statement
The eigenvalues of a matrix are 2, 5 and 1. What will be the eigenvalues of the matrix
Options
(a)
4, 25, 1
(b)
2, 5, 1
(c)
0.5, 0.2, 1
(d)
Answer
(c)
Solution
If the eigenvalues of A are a, b and c, then the eigenvalues of will be 1/a, 1/b and 1/c.
Graded
This document has questions.
Question-1
Statement
We have a dataset of points for a classification problem using -NN algorithm. Now
consider the following statements:
S3: The number of data-points that we have to store increases as the size of increases.
S4: The number of data-points that we have to store is independent of the value of .
Options
(a)
(b)
(c)
(d)
(e)
Answer
(b)
Solution
The entire training dataset has to be stored in memory. For predicting the label of a test-point, we
have to perform the following steps:
How should we recolor the black train point if the test point is classified as "red" without any
uncertainty by a -NN classifier, with ? Use the Euclidean distance metric for computing
distances.
Options
(a)
blue
(b)
red
(c)
Insufficient information
Answer
(b)
Solution
Since we are looking at the -NN algorithm with , we need to look at the four nearest
neighbors of the test data-point. The four points from the training dataset that are closest to the
test data-point are the following:
: black
: blue
: red
: red
Each of them is at unit distance from the test data-point. From the problem statement, it is given
that the test data-point is classified as "red" without any uncertainty. Let us now consider two
scenarios that concern the black training data-point at :
There are three red neighbors and one blue neighbor. Therefore, the test-data point will be
classified as red. There is no uncertainty in the classification. This is what we want. However, for
the sake of completeness, let us look at the alternative possibility.
There will be exactly two neighbors that are blue and two that are red. In such a scenario, we can't
classify the black test-point without any uncertainty. That is, we could call it either red or blue.
This is one of the reasons why we choose an odd value of for the -NN algorithm. If is odd,
then this kind of a tie between the two classes can be avoided.
Question-3
Statement
Consider the following feature vectors:
If use a -NN algorithm with , what would be the predicted label for the following test point:
Answer
1
Solution
The distances are:
We see that among the three nearest neighbors, two have label 1 and one has label 0. Hence the
predicted label is 1. For those interested in a code for the same:
1 import numpy as np
2
3 x_1 = np.array([1, 2, 1, -1])
4 x_2 = np.array([5, -3, -5, 10])
5 x_3 = np.array([3, 1, 2, 4])
6 x_4 = np.array([0, 1, 1, 0])
7 x_5 = np.array([10, 7, -3, 2])
8
9 x_test = np.array([1, 1, 1, 1])
10
11 for x in [x_1, x_2, x_3, x_4, x_5]:
12 print(round(np.linalg.norm(x_test - x) ** 2))
Comprehension Type (4 to 6)
Statement
Consider the following split at some node in a decision tree:
Q1 100 0
Q1 100 1
L1 50 0
L1 30 1
L2 50 0
L2 70 1
For example, L1 has 80 points of which 50 belong to class 0 and 30 belong to class 1. Use for
all calculations that involve logarithms.
Question-4
Statement
If the algorithm is terminated at this level, then what are the labels associated with L1 and L2?
Options
(a)
L1 : 0
(b)
L1 : 1
(c)
L2 : 0
(d)
L2 : 1
Answer
(a), (d)
Solution
has data-points out of which belong to class- and belong to class- . Since the
majority of the points belong to class- , this node will have as the predicted label.
has data-points out of which belong to class- and belong to class- . Since the
majority of the points belong to class , this node will have as the predicted label.
Question-5
Statement
What is the impurity in L1 if we use entropy as a measure of impurity? Report your answer correct
to three decimal places.
Answer
Solution
If represents the proportion of the samples that belong to class-1 in a node, then the impurity
of this node using entropy as a measure is:
1 import math
2 imp = lambda p: -p * math.log2(p) - (1 - p) * math.log2(1 - p)
3 print(imp(3 / 8))
Question-6
Statement
What is the information gain for this split? Report your answer correct to three decimal places.
Use at least three decimal places in all intermediate computations.
Answer
Solution
The information gain because of this split is equal to the decrease in impurity. Here, and
denote the cardinality of the leaves. is the total number of points before the split at node .
To calculate the entropy of the three nodes, we need the proportion of points that belong to
class-1 in each of the three nodes. Let us call them for node , for node and for node
:
If a test-point comes up for prediction, what is the minimum and maximum number of questions
that it would have to pass through before being assigned a label?
Options
(a)
(b)
(c)
(d)
(e)
Answer
(b), (e)
Solution
Look at all paths from the root to the leaves. Find the shortest and longest path.
Question-8
Statement
is the proportion of points with label 1 in some node in a decision tree. Which of the following
statements are true? [MSQ]
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
Options (a) and (b) are incorrect as the impurity increases from to and then
decreases. Option-(c) is incorrect for obvious reasons.
Question-9
Statement
Consider a binary classification problem in which all data-points are in . The red points belong
to class and the green points belong to class . A linear classifier has been trained on this
data. The decision boundary is given by the solid line.
This classifier misclassifies four points. Which of the following could be a possible value for the
weight vector?
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The weight vector is orthogonal to the decision boundary. So it will lie on the dotted line. This
gives us two quadrants in which the vector can lie in: second or fourth. In other words, we only
need to figure out its direction. If it is pointing in the second quadrant, then there will be four
misclassifications. If it is pointing in the fourth quadrant then all but four points will be
misclassified.
Question-10
Statement
Which of the following are valid decision regions for a decision tree classifier for datapoints in ?
The question in every internal node is of the form . Both the features are positive real
numbers.
Options
(a)
(b)
(c)
(d)
Answer
(a), (b), (d)
Solution
A question of the form can only result in one of these two lines:
a horizontal line
a vertical line
It cannot produce a slanted line as shown in option-(c). Options (a) and (d) correspond to what are
called decision stumps: a single node splitting into two child nodes.
Graded assignment
Question 1
Statement
Consider the two different generative model-based algorithms.
1. Model : chances of occurring a feature are affected by the occurrence of other features and
the model does not impose any additional condition on conditional independence of
features.
2. Model : chances of occurring a feature are not affected by the occurrence of other features
and therefore, the model assumes that features are conditionally independent of the label.
Options
(a)
Model
(b)
Model
Answer
(a)
Solution:
In the first model, features are not independent, therefore, we need to find the probabilities (or
density) for each and every possible example given the labels whereas in model 2, the features
are independent, therefore we need to find the pmf (or pdf) of the features only.
Question 2
Statement
Which of the following statement is/are always correct in context to the naive Bayes classification
algorithm for binary classification with all binary features? Here, denotes the estimate for the
probability that the feature value of a data point is given that the point has the label .
Options
(a)
(b)
for any
(c)
(d)
(e)
Answer
(d)
Solution
In general, estimate for .
It means that denotes the parameters of different distributions for different and for
different .
For different , distributions are different distributions. Therefore, it is not necessary that
If for , It implies that there is no labeled zero examples such that feature value is
. It doesn't mean that all feature value is for all labeled one examples.
Question 3
Statement
A naive Bayes model is trained on a dataset containing features . Labels are and
. If a test point was predicted to have the label , which of the following expression should be
sufficient for this prediction?
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
A test example is predicted label , it implies that
Question 4
Statement
Consider a binary classification dataset contains only one feature and the data points given the
label follow the given distribution
If the decision boundary learned using the gaussian naive Bayes algorithm is linear, what is the
value of ?
Answer
Solution
Since the decision boundary is linear, both theiances will be the same. That is
Question 5
Statement
Consider a binary classification dataset with two binary features and . The feature values
are for all label ' ' examples but the label ' ' examples take both values and for the feature
. If we apply the naive Bayes algorithm on the same dataset, what will be the prediction for
point ?
Options
(a)
Label
(b)
Label
(c)
Answer
(c)
Solution
Given that the feature values are for all label ' ' examples it implies that .
Therefore,
But the label ' ' examples take both values and for the feature . It implies that
Still the value of can be 0 if the value of is zero. So, we need the value of to
make any conclusion.
label
0.5 1.3 1
0.7 1.1 1
1.3 2.0 0
2.3 2.4 0
Question 6
Statement
What will be the value of the estimate for ?
Answer
Solution
Question 7
Statement
What will be the value of
Options
(a)
(b)
(c)
(d)
Answer
(a)
Solution
Question 8
Statement
Consider a binary classification dataset containing two features and . The feature is
categorical which can take three values and the feature is numerical that follows the Gaussian
distribution. How many independent parameters must be estimated if we apply the naive Bayes
algorithm to the same dataset?
Answer
Solution
We need one parameter for as takes only two values.
feature can take three values, therefore we need two estimates for and
.
, estimate for
, estimate for
, estimate for
, estimate for
Question 9
Statement
What is the estimated value of ? Write your answer correct to two decimal
places.
Answer
Range: [0.97, 0.99]
Solution
Question 10
Statement
What will be the predicted label for the data point ?
Answer
No range is required
Solution
Since , therefore the point will be predicted label .
Graded
Question-1
Statement
Assume that Perceptron algorithm is applied to a data set in which the maximum of the lengths of
the data points is 4 and the value of margin ( ) of the optimal separator is 1. If the algorithm has
made 10 mistakes at some point of the execution of the algorithm, which of the following can be
valid squared length(s) of the weight vector obtained in the 11th iteration?
Options
(a)
90
(b)
150
(c)
170
(d)
190
Answer
(b), (c)
Solution
Given, ,
Need to find .
and
Hence,
and
Hence both (b) and (c) will be correct.
Question-2
Statement
Consider the following data set:
f_1 f_2 y
-1 -1 -1
0 1 +1
1 0 +1
1 1 +1
If Perceptron algorithm is applied on this data set with the weight vector initialized to [0, 0], how
many times the weight vector will be updated during the training process?
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The predictions for each of the four data points as per will be +1, +1, +1, +1.
Hence,
Resulting into
:
The predictions for each of the four data points as per will be -1, +1, +1, +1,
which are correct.
, ,
If the Perceptron algorithm is applied to this data with the initial weight vector to be a zero
vector, what will be the outcome?
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
The predictions for each of the three data points as per will be +1, +1, +1
Hence,
Resulting into
The predictions for each of the three data points as per will be +1, +1, -1, which
are correct.
The blue line represents the weight vector. As per this weight vector, the Perceptron algorithm will
predict which classes for the data points and ?
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The decision boundary will be perpendicular to . For the data points on the right side of it,
will be greater than equal to zero, on the LHS, it will be less than zero.
Options
(a)
(b)
(c)
(d)
Answer
(a)
Solution
If the weight vector is multiplied by -1, then for the data points on the RHS, will be less than
0 and on the LHS, it will be greater than equal to zero.
Options
(a)
(b)
(c)
(d)
16
Answer
(b)
Solution
The maximum number of mistakes is given by
Which is
Question-7
Statement
If the scores (i.e, values) for some data points are -4, 3, 1, 2, -6 respectively, what will be the
probabilities returned for these points by Logistic Regression?
Options
(a)
(b)
-1, 1, 1, 1, -1
(c)
(d)
0, 1, 1, 1, 0
Answer
(c)
Solution
Ex:
Question-8
Statement
Which of the lines (blue or brown) in the following image may represent the decision boundary of
Logistic Regression?
Options
(a)
Blue line
(b)
Brown line
(c)
Both
(d)
None of these
Answer
(b)
Solution
The decision boundary in logistic regression is always linear (Brown line).
It's just that the values obtained from a linear combination are reduced to values between 0 and 1
by the sigmoid function (blue line).
Question-9
Statement
Consider a data set , , , . Let the
corresponding class labels be and .
Assume you try to find the using the Perceptron algorithm. You decide to cycle through points
in the order repeatedly until you find a linear separator. How many mistakes
does your algorithm make and what is the linear separator your algorithm outputs?
Options
(a)
3, [1, 0]
(b)
2, [1, 1]
(c)
3, [-1, 1]
(d)
4, [-1, 0]
Answer
(b)
Solution
When initial weight vector is not given, we will take zero vector as initial weight vector.
We start with .
Hence, gives
Hence gives
Once again we check in that order, and they all are predicted correctly.
Hence final and had been updated twice.
Graded
This document has questions.
Question-1 [1 point]
Statement
Consider a linearly separable dataset for a binary classification problem in . Three linear
classifiers have been trained on this dataset. All three pass through the origin and have the
following property:
Here, is the weight vector corresponding to the classifier. Note that the above property is
satisfied for each of the data-points. If is the weight vector corresponding to a hard-margin
SVM, which of the following statements is always true? You can assume that the norms of all three
weights are different from each other.
Options
(a)
(b)
(c)
(d)
Answer
(c)
Solution
will have the smallest norm (maximum margin) among the three classifiers. The three weight
vectors are feasible points for the primal. Among them, is optimal.
Common Data for questions (2) to (4)
Statement
Common data for questions (2) to (4)
Consider the following training dataset for a binary classification problem in . Each data-point
is represented by whose label is .
Index
1 1 0 1
2 -1 0 -1
3 5 4 1
4 -5 -4 -1
We wish to train a hard-margin SVM for this problem. represents the weight
vector. The index is for the data-point. is the Lagrange multiplier for the data-point.
Question-2 [1 point]
Statement
Select all primal constraints from the options given below.
Options
(a)
(b)
(c)
(d)
Answer
(a), (c)
Solution
Because of the symmetry in the problem, we effectively have only two constraints even though
there are data-points:
But in order to remain consistent with our formulation, let us list it down in the following manner:
Question-3 [2 points]
Statement
Which of the following is the objective function of the dual problem? In all options,
and .
Options
(a)
(b)
(c)
(d)
Answer
(b)
Solution
The objective function corresponding to the dual is:
We have:
Therefore:
And:
Finally:
Question-4 [1 point]
Statement
What is the optimal weight vector, ?
Hint: Plot the points and try to compute the answer using geometry; do not try to solve the dual
algebraically!
Options
(a)
(b)
(c)
(d)
Answer
(a)
Solution
We see that the axis has to be the optimal separator as that is the one which has maximum
margin. So, the equation of the decision boundary is . This implies, . Therefore, the
weight vector becomes . The choice of can be found out by noticing that and
are the supporting hyperplanes. Therefore, .
Question-5 [1 point]
Statement
Consider a kernel-SVM trained on a dataset of points with polynomial kernel of degree . If
is the optimal dual solution, what is the predicted label for a test-point ?
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
The optimal weight vector is given by:
Here, is the vector in the transformed space. First we compute the dot-product:
Finally, the prediction is:
Common Data for questions (6) and (7)
Statement
Common data for questions (6) and (7)
Consider the transformation associated with the polynomial kernel with degree :
A kernel-SVM is trained on a dataset with the above kernel. The optimal weight vector is as
follows:
You can assume that the dataset is linearly separable in the transformed space.
Question-6 [1 point]
Statement
What is the shape of the decision boundary in ?
Options
(a)
(b)
(c)
It is a straight line
(d)
It is a circle
Answer
(d)
Solution
The decision boundary in is given by:
Options
(a)
(b)
(c)
(d)
(e)
Answer
(b), (c), (e)
Solution
All the support vectors will lie on the two supporting hyperplanes:
These curves are two circles, one smaller than the decision boundary and one larger than the
decision boundary. From the points given here, there are two points that could be support
vectors:
Note that these two are "potential" support vectors. Every support vector lies on the supporting
hyperplanes. But every point on the supporting hyperplanes need not be a support vector.
Question-8 [1 point]
Statement
Match the following classification datasets with the most appropriate choice of adjectives:
Options
(a)
Dataset-2: hard-margin
Dataset-3: soft-margin
Dataset-4: hard-margin
(b)
Dataset-2: hard-margin
Dataset-3: soft-margin
Dataset-1: soft-margin
Dataset-2: hard-margin
Dataset-3: kernel
Dataset-4: soft-margin
Answer
(b)
Solution
Dataset-1: the decision boundary is non-linear. The structure is non-linear and the problem
is linearly separable in some high dimensional space.
Dataset-2: this is a clear case of hard-margin SVM
Dataset-3: the boundary is linear. The presence of outliers suggests that this should be
solved using a soft-margin SVM
Dataset-4: The boundary is non-linear. In addition, the dataset has outliers. So, this should
involve both a kernel and a soft-margin formulation.
Week 11
Question 1
Statement
In each round of AdaBoost, the weight for a particular training observation is increased going
from round to round if the observation was...
Options
(a)
(b)
(c)
(d)
Answer
(a)
solution
Since in AdaBoost, we increase the weight of the incorrectly classified points for the next bag, the
option (a) is correct.
Question 2
Statement
Which of the following statements are true about bagging?
Options
(a)
In general, the final model has a higher bias than the individual learners.
(b)
In general, the final model has less bias than the individual learners.
(c)
In general, the final model has a higher variance than the individual learners.
(d)
In general, the final model has less variance than the individual learners.
Answer
(d)
Solution
Bagging on high variance models will reduce the variance without increasing the bias.
There is always a tradeoff between bias and variance. And reducing variance may cost increment
in the bias. But bagging on high variance and low bias models reduces the variance without
making the predictions biased.
Question 3
Statement
Is the following statement true or false?
If a point lies between the supporting hyperplanes in the soft-margin SVM problem, it always pays
a positive bribe and plays a role in defining .
Options
(a)
True
(b)
False
Answer
(a)
Solution
If a point lies between the supporting hyperplanes, it satisfies the following:
Using the CS ,
It implies that and therefore If a point lies between the supporting hyperplanes in the
soft-margin SVM problem, it plays a role in defining .
Question 4
Statement
Is the following statement true or false?
Options
(a)
True
(b)
False
Answer
(a)
Solution
Using the CS ,
It implies that
Label
The first decision stump was created using the question or not. The error of a decision
stump is defined as the proportion of misclassified points.
Question 5
Statement
Find the value of . Notation is defined as per lecture.
Options
(a)
(b)
(c)
(c)
Answer
(a)
Solution
If we split the root node as per question or not, the left node will contain the points
and the labels of these points are respectively. Therefore, the prediction in
left node will be (the majority class).
Similarly, in the right nodes, labels will be and and the prediction will be (the majority
class).
Only one point is misclassified.
Therefore, error is
Question 6
Statement
How will the weight corresponding to the last example change for creating the next stump?
Options
(a)
It will increase
(b)
It will decrease
Answer
(b)
Since the last example is correctly classified, its weight will decrease.
Question 7
Statement
A strong learner is formed as per the AdaBoost algorithm by three weak learners and
. Their performance/weights are and respectively. For a particular point,
and predict that its label is positive, and predicts that it’s negative. What is the final
prediction the learner makes on this point? Enter or .
Answer
No range is required
Solution
For the final prediction, we have
Question 8
Statement
Which of the following options is correct? Select all that apply.
Options
(a)
In bagging, typically around data points remain unselected in bags if the number of data
points is large.
(b)
Each weak leaner has equal importance in making the final prediction in Bagging.
(c)
Each weak leaner has equal importance in making the final prediction in AdaBoost.
(d)
Answer
(a), (b), (d)
Solution
The probability that a point will not be selected in any one pick will be .
as
that is In bagging, typically around data points remain unselected in bags if the number of
data points is large.
In bagging, each learner has equal importance in making the final prediction as the majority of all
the predictions are taken into account, and therefore each prediction counts.
But in AdaBoost, the weighted average is taken into account, and the estimator which has a
higher value of will have a higher importance in making the final prediction.
In the random forest, overfit models are preferred as they have high variance and low bias.
Question 9
Statement
Which model tends to underfit?
Options
(a)
Model
(b)
Model
(c)
Model
(d)
Model
Answer
(d)
Solution
Model 4 has high training error as well as high test error. It means that model 4 as high variance
and high bias and tends to underfit.
Question 10
Statement
Which model tends to overfit?
Options
(a)
Model
(b)
Model
(c)
Model
(d)
Model
Answer
(a)
Model 1 has less training error as well as high test error. It means that model 4 as high variance
and low bias and tends to overfit.
Question 11
Statement
Which model would you choose?
Options
(a)
Model
(b)
Model
(c)
Model
(d)
Model
Answer
(c)
Model 3 has less training and test error and therefore, it is most preferred.
Graded
Question-1
Statement
Consider the following data and the hypothesis function :
g(x) y
+30 +1
-20 -1
-1 -1
+1 +1
Options
(a)
The values of 0-1 loss and squared loss will be same, which will be equal to zero.
(b)
The values of 0-1 loss and squared loss will be same, which will be equal to some large positive
quantity.
(c)
The value of 0-1 loss will be zero, while the value of squared loss will be some large positive
quantity.
(d)
The value of squared loss will be zero, while the value of 0-1 loss will be some large positive
quantity.
Answer
(c)
Solution
There is an ambiguity in this question. The given is supposed to be explain
the 0-1 loss and not the squared loss. In this case,
+30 +1 +1 0 (29)^2
-20 -1 -1 0 (-19)^2
-1 -1 -1 0 0
g(x) y sign(g(x)) 0-1 loss Squared loss ((g(x)-y)^2
+1 +1 -1 0 0
The value of 0-1 loss is zero, while the value of squared loss is a large positive quantity. Hence,
option (c) should be correct.
However, since h(x) is stated to be sign(g(x)) and if squared loss is computed on the column 2 and
3 of the above table, squared loss will also come out to be 0, resulting in option (a) to be correct.
[5, 5, 4, 3, 1]
Question-2
Statement
How many hidden layers are there in the network?
Answer
3 (No range required)
Solution
The first layer is input layer and the last one is hidden layer. The intermediate 3 layers are hidden
layers.
Question-3
Statement
Through how many paths can the 3rd neuron in the 2nd hidden layer affect the final output?
Answer
3 (No range required)
Solution
The three paths are shown below in different colors.
Question-4
Statement
Assuming that there is a bias associated with each neuron, how many total parameters need to
be computed?
Answer
73 (No range required)
Solution
#weights from input to hidden layer 1: 5*5 = 25
There will be a bias associated withe each neuron except the input layer neurons. (Input layer is
used to simply pass on the inputs to the subsequent layers)
#biases = 5+4+3+1 = 13
Statement
What will be an appropriate activation function for the output layer?
Options
(a)
Sigmoid
(b)
Linear
(c)
ReLU
Answer
(b)
Solution
ReLU is mostly used at the hidden layers. (It does not make sense to be used at the output layer.)
Options
(a)
(b)
(c)
(d)
Answer
(d)
Solution
ReLU converts the negative inputs to zero, and keeps the positive inputs same.
Question-7
Statement
Suppose we build a neural network for a 5-class classification task. Suppose for a single training
example, the true label is [1 0 0 0 0] while the predictions by the neural network are [0.1 0.5
0.1 0.1 0.2] . What would be the value of cross entropy loss for this example?
Answer
3. 322 (Range: 3.2 to 3.4)
Solution
Cross entropy =
=
Question-8
Statement
State True or False:
Options
(a)
True
(b)
False
Answer
(b)
Solution
Cross entropy ( )= which is not commutative.