0% found this document useful (0 votes)
51 views45 pages

Machine 2020 Jul-Dec

This document contains a set of questions and solutions related to machine learning concepts. It covers topics like supervised vs unsupervised learning, classification vs regression, bias and variance, and dimensionality reduction techniques. Linear regression, logistic regression, and orthogonalization methods are also discussed in the context of solving machine learning problems.

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views45 pages

Machine 2020 Jul-Dec

This document contains a set of questions and solutions related to machine learning concepts. It covers topics like supervised vs unsupervised learning, classification vs regression, bias and variance, and dimensionality reduction techniques. Linear regression, logistic regression, and orthogonalization methods are also discussed in the context of solving machine learning problems.

Uploaded by

Himanshu Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Assignment 1

Introduction to Machine Learning


Prof. B. Ravindran
1. Which of the following is a supervised learning problem?

(a) Grouping people in a social network.


(b) Predicting credit approval based on historical data
(c) Predicting rainfall based on historical data
(d) all of the above

Sol. (b) and (c)


(a) does not have labels to indicate the groups. (b) and (c) have the correct answers for the
examples in the dataset.
2. Which of the following are classification problems? (multiple options may be correct)

(a) Predicting the temperature (in Celsius) of a room from other environmental features (such
as atmospheric pressure, humidity etc).
(b) Predicting if a cricket player is a batsman or bowler given his playing records.
(c) Predicting if a particular route between two points has traffic jam or not based on the
travel time of vehicles.
(d) Filtering of spam messages
Sol. (b),(c), (d)

3. Which of the following is a regression task? (multiple options may be correct)

(a) Predicting the monthly sales of a cloth store in rupees.


(b) Predicting if a user would like to listen to a newly released song or not based on historical
data.
(c) Predicting the confirmation probability (in fraction) of your train ticket whose current
status is waiting list based on historical data.
(d) Predicting if a patient has diabetes or not based on historical medical records.
(e) Predicting the gender of a human based on facial features.
Sol. (a) and (c)
4. Which of the following is an unsupervised task?

(a) Learning to play chess.


(b) Predicting if a new edible item is sweet or spicy based on the information of the ingredi-
ents, their quantities, and labels (sweet or spicy) for many other similar dishes.
(c) Grouping related documents from an unannotated corpus.
(d) all of the above

1
Sol. (c)
option (b) has the labels, so it is supervised task
5. Which of the following is a categorical feature?
(a) Number of legs of an animal
(b) Number of hours you study in a day
(c) Your weekly expenditure in rupees.
(d) Branch of an engineering student
(e) Ethnicity of a person
(f) Height of a person in inches
Sol. (d) and (e)
6. Let X and Y be a uniformly distributed random variable over the interval [0, 4] and [0, 6]
respectively. If X and Y are independent events, then compute the probability,
P(max(X, Y ) > 2)
(a) 16
(b) 56
(c) 23
(d) None of the above
Sol. (b)

P(max(X, Y ) > 2) = P(X > 2) + P(Y > 2) − P(X > 2 & Y > 2)
1 2 1 2
= + − ×
2 3 2 3
5
=
6
 
a b
7. Let the trace and determinant of a matrix A = be 4 and 3 respectively. The eigenvalues
c d
of A are
√ √ √
(a) 3+ι2 7 , 3−ι2 7 , where ι = −1
(b) 1, 3
(c) None of the above
(d) Can be computed only if A is a symmetric matrix.
(e) Can not be computed as the entries of the matrix A are not given.
Sol. (b)
Use of the facts that the trace and determinant of a matrix is equal to the sum and product
of its eigenvalues respectively. Using this
λ1 + λ2 = 4, λ1 λ2 = 3
where λ1 and λ2 denotes the eigenvalues. Solve the above two equations in two variables or
check if option (a) satisfies them.

2
8. What happens when your model complexity increases? (multiple options may be correct)

(a) Model Bias decreases


(b) Model Bias increases
(c) Variance of the model decreases
(d) Variance of the model increases

Sol. (a) and (d)


9. Based on a survey, it was found that the probability that a student likes to play football was
0.25 and the probability that a student likes to play cricket is 0.43. It was also found that
the probability that a student likes to play both football and cricket is 0.12. What is the
probability that a student does not like to play either?

(a) 0.32
(b) 0.2
(c) 0.44
(d) 0.56

Sol. (c)
Given P(football) = 0.25, P(cricket) = 0.43, and P(football ∩ cricket) = 0.12.
We are interested in the probability of students who do not like to play either football or
cricket, i.e., P((football ∪ cricket)0 ).
From basic set theory, we have P(football ∪ cricket) = P(football) + P(cricket) - P(football ∩
cricket) = 0.25 + 0.43 - 0.12 = 0.56.
Also, the two events, a student likes to play football or cricket (football ∪ cricket) and a student
does not like to play either football or cricket ((football ∪ cricket)0 ) are mutually exclusive.
Therefore, we have P((football ∪ cricket)0 ) = 1 - P(football ∪ cricket) = 1 - 0.56 = 0.44.

10. Which of the following are true about bias and variance of overfitted and underfitted models?
(multiple options may be correct)
(a) Underfitted models have high bias.
(b) Underfitted models have low bias.
(c) Overfitted models have low variance.
(d) Overfitted models have high variance.
(e) none of these
Sol. (a), (d)

3
Assignment 2
Introduction to Machine Learning
Prof. B. Ravindran
1. In building a linear regression model for a particular data set, you observe the coefficient of
one of the features having a relatively high negative value. This suggests that
(a) This feature has a strong effect on the model (should be retained)
(b) This feature does not have a strong effect on the model (should be ignored)
(c) It is not possible to comment on the importance of this feature without additional infor-
mation
Sol. (c)
A high magnitude suggests that the feature is important. However, it may be the case that
another feature is highly correlated with this feature and it’s coefficient also has a high magni-
tude with the opposite sign, in effect cancelling out the effect of the former. Thus, we cannot
really remark on the importance of a feature just because it’s coefficient has a relatively large
magnitude.
2. We have seen methods like ridge and lasso to reduce variance among the co-efficients. We can
use these methods to do feature selection also. Which one of them is more appropriate?
(a) Ridge
(b) Lasso
Sol. (b)
For feature selection, we would prefer to use lasso since solving the optimisation problem when
using lasso will cause some of the coefficients to be exactly zero (depending of course on the
data) whereas with ridge regression, the magnitude of the coefficients will be reduced, but
won’t go down to zero.
3. Given a set of n data points, (x1 , y1 ), (x2 , y2 ), . . . , (xn , yn ), the best least squares fit f (x) is
obtained by minimization of:
Pn
(a) i=1 [yi − f (xi )]
(b) min(yi − f (xi ))
Pn 2
(c) i=1 [yi − f (xi )]
(d) max(yi − f (xi ))
Sol. (c)

4. During linear regression, with regards to residuals, which among the following is true?
(a) Lower is better
(b) Higher is better
(c) Depends upon the data
(d) None of the above

1
Sol. (a)
Residuals refer to the errors in the model, hence, lower is better.
5. In the lecture on Multivariate Regression, you learn about using orthogonalization iteratively
to obtain regression co-effecients. This method is generally referred to as Multiple Regression
using Successive Orthogonalization.
In the formulation of the method, we observe that in iteration k, we regress the entire dataset
on z0 , z1 , . . . zk−1 . It seems like a waste of computation to recompute the coefficients for z0 a
total of p times, z1 a total of p − 1 times and so on. Can we re-use the coefficients computed
in iteration j for iteration j + 1 for zj−1 ?
(a) No. Doing so will result in the wrong γ matrix. and hence, the wrong βi ’s.
(b) Yes. Since zj−1 is orthogonal to zj−l ∀ l ≤ j1, the multiple regression in each iteration is
essentially a univariate regression on each of the previous residuals. Since the regression
coefficients for the previous residuals don’t change over iterations, we can re-use the
coefficients for further iterations.
Sol. (b)
The answer is self-explanatory. Please refer to the section on Multiple Regression using Suc-
cessive Orthogonalization in Elements of Statistical Learning, 2nd edition for the algorithm.
6. You decide to reduce the dimensionality of your data(N × p) using Best Subset Selection. The
library you’re using has a function regress(X, Y ) that takes in X and Y and regresses Y on X.
What is the expected number of times regress(·, ·) will be called during your dimensionality
reduction?
(a) O(2N )
(b) O(2p )
(c) O(N p )
(d) O(p2 )
Sol. (b)
In Best subset selection, each possible subset of features is regressed and the number of possible
subsets follow O(2p ).
7. If the number of features is larger than the number of training data points, to identify a suitable
subset of the features for use with linear regression, we would prefer
(a) Forward stepwise selection
(b) Backward stepwise selection
Sol. (a)
Explanation: Recall that in backward stepwise selection, we need to first build a model using
all features. This can lead to problems in matrix inversion when the number of training data
points is less than the number of features. Thus, in this scenario we would prefer forward
stepwise selection.
8. Assume you have a five-dimensional input data for a three-class classification problem. Further
assume that all five dimensions of the input are independent to each other. In this scenario,
is it possible for linear regression using lasso to result in one or more coefficients to become
zero?

2
(a) Yes
(b) No
Sol. (a)
Note that even if the input dimensions are independent, one or more dimensions may not
be very useful in discriminating between examples of the different classes. In such cases, the
coefficients for such features may become zero on using lasso.

9. You are given the following five three-dimensional training data instances (along with one-
dimensional output)
• x1 = 5, x2 = 7, x3 = 3, y = 4
• x1 = 2, x2 = 4, x3 = 9, y = 8
• x1 = 3, x2 = 8, x3 = 1, y = 2
• x1 = 7, x2 = 7, x3 = 2, y = 3
• x1 = 1, x2 = 9, x3 = 7, y = 8
Using the K-nearest neighbour technique for performing regression, what will be the predicted
y value corresponding to the query point (x1 = 5, x2 = 3, x3 = 4), for K = 2?

(a) 3
(b) 2.5
(c) 3.5
(d) 2

Sol. (c)
When K = 2, the nearest points are x1 = 5, x2 = 7, x3 = 3 and x1 = 7, x2 = 7, x3 = 2. Taking
the average of the outputs of these two points, we have y = (4 + 3)/2 = 3.5.
10. For the dataset given in the previous question, what will be the predicted y value corresponding
to the query point (x1 = 5, x2 = 3, x3 = 4), for K = 3?

(a) 4.66
(b) 5
(c) 3
(d) 3.5

Sol. (b)
Similarly, when K = 3, nearest points are x1 = 5, x2 = 7, x3 = 3, x1 = 7, x2 = 7, x3 = 2
and x1 = 2, x2 = 4, x3 = 9. Taking the average of the outputs of these three points, we have
y = (4 + 8 + 3)/3 = 5

3
Assignment 3
Introduction to Machine Learning
Prof. B. Ravindran
1. Which of the following is true about a logistic regression based classifier?

(a) The logistic function is non-linear in the weights


(b) The logistic function is linear in the weights
(c) The decision boundary is linear in the weights
(d) The decision boundary is non-linear in the weights

Sol. (a), (c)

2. For a binary classification problem, the decision boundary resulting from the use of logistic
regression is

(a) linear
(b) sigmoid
(c) parabolic
(d) exponential

Sol. (a)
Refer to the videos
3. (2 marks) Consider the case where two classes follow Gaussian distribution which are cen-
tered at (−1, 2) and (1, 4) and have identity covariance matrix. Which of the following is the
separating decision boundary using LDA?

(a) y − x = 3
(b) x + y = 3
(c) x + y = 6
(d) (b) and (c) are possible
(e) None of these
(f) Can not be found from the given information
Sol. (b)
As the distribution is Gaussian and have identity covariance (which are equal), the separating
boundary will be linear. The decision boundary will be orthogonal to the line joining the
centers and will pass from the midpoint of centers.
4. Consider the following relation between a dependent variable and an independent variable
identified by doing simple linear regression. Which among the following relations between the
two variables does the graph indicate?

1
(a) as the independent variable increases, so does the dependent variable
(b) as the independent variable increases, the dependent variable decreases
(c) if an increase in the value of the dependent variable is observed, then the independent
variable will show a corresponding increase
(d) if an increase in the value of the dependent variable is observed, then the independent
variable will show a corresponding decrease
(e) the dependent variable in this graph does not actually depend on the independent variable
(f) none of the above

Sol. (e)
5. Given the following distribution of data points:

2
What method would you choose to perform Dimensionality Reduction?
(a) Linear Discriminant Analysis
(b) Principal Component Analysis
Sol. (a)
PCA does not use class labels and will treat all the points as instances of the same pool. Thus
the principal component will be the vertical axis, as the most variance is along that direction.
However, projecting all the points into the vertical axis will mean that critical information is
lost and both classes are mixed completely. LDA, on the other hand models each class with a
gaussian. This will lead to a principal component along the horizontal axis which retains class
information (the classes are still linearly separable)

6. In general, which of the following classification methods is the most resistant to gross outliers?

3
(a) Quadratic Discriminant Analysis (QDA)
(b) Linear Regression
(c) Logistic regression
(d) Linear Discriminant Analysis (LDA)
Sol. (c)
In general, a good way to tell if a method is sensitive to outliers is to look at the loss it incurs
upon ignoring outliers.
Linear Regression uses a square loss and thus, outliers that are far away from the hyperplane
contribute significantly to the loss.
LDA and QDA both use the L2-Norm and, for the same reason, sensitive to outliers.
Logistic Regression weights the points close to the boundary higher than points far away. This
is an implication of the Logistic loss function (beyond the boundary, roughly linear instead of
quadratic).
7. Suppose that we have two variables, X and Y (the dependent variable). We wish to find
the relation between them. An expert tells us that relation between the two has the form
Y = mX 2 + c. Available to us are samples of the variables X and Y. Is it possible to apply
linear regression to this data to estimate the values of m and c?

(a) no
(b) yes
(c) insufficient information

Sol. (b)
Instead of considering the dependent variable directly, we can transform the independent
variable by considering the square of each value. Thus, on the X-axis, we can plot values of
X 2 and on the Y-axis, we can plot values of Y . The relation between the dependent and
the transformed independent variable is linear and the value of slope and intercept can be
estimated using linear regression.

8. In a binary classification scenario where x is the independent variable and y is the dependent
variable, logistic regression assumes that the conditional distribution y|x follows a

(a) Bernoulli distribution


(b) binomial distribution
(c) normal distribution
(d) exponential distribution

Sol. (a)
The dependent variable is binary, so a Bernoulli distribution is assumed.

9. Consider the following data:

4
Feature 1 Feature 2 Class
1 1 A
2 3 A
2 4 A
5 3 A
8 6 B
8 8 B
9 9 B
11 7 B

Assuming that you apply LDA to this data, what is the estimated covariance matrix?
 
1.875 0.3125
(a)
0.3125 0.9375
 
2.5 0.4167
(b)
0.4167 1.25
 
1.875 0.3125
(c)
0.3125 1.2188
 
2.5 0.4167
(d)
0.4167 1.625
(e) None of these

Sol. (d)

5
Assignment 4
Introduction to Machine Learning
Prof. B. Ravindran
1. (1 marks) Consider the data set given below. Can we use perceptron learning algorithm to

build a model using only the given features that achieves zero misclassification error on the
training data?
(a) Yes
(b) No
(c) Depends on the initial weights
Sol. (b)
The given data specifies the well-known XOR problem which cannot be separated by a linear
boundary.
2. (1 marks) Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where
the training data points are linearly separable. In general, will the classifier trained in this
manner be always the same as the classifier trained using the perceptron training algorithm
on the same training data?
(a) Yes
(b) No
Sol. (b) The hyperplane returned by the SVM approach will have a maximal margin, whereas
no such guarantee can be given for the hyperplane identified using the perceptron training
algorithm.
For Q3,4: Kindly download the synthetic dataset from the following link
https://rb.gy/jpcuaf
The dataset contains 1000 points and each input point contains 3 features.
3. (2 marks) Train a linear regression model (without regularization) on the above dataset. Re-
port the coefficients of the best fit model. Report the coefficients in the following format:
β0 , β1 , β2 , β3 . (You can round-off the accuracy value to the nearest 2-decimal point number.)
(a) -1.2, 2.1, 2.2, 1
(b) 1, 1.2, 2.1, 2.2
(c) -1, 1.2, 2.1, 2.2
(d) 1, -1.2, 2.1, 2.2

1
(e) 1, 1.2, -2.1, -2.2

Sol. (c)
Follow the steps given on the sklearn page.
4. (2 marks) Train an l2 regularized linear regression model on the above dataset. Vary the
regularization parameter from 1 to 10. As you increase the regularization parameter, absolute
value of the coefficients (excluding the intercept) of the model:

(a) increase
(b) first increase then decrease
(c) decrease
(d) first decrease then increase
(e) does not change

Sol. (c)
Follow the steps given on the sklearn page.

For Q5,6: Kindly download the modified version of Iris dataset from this link.
Available at: (https://goo.gl/vchhsd)
The dataset contains 150 points and each input point contains 4 features and belongs to one
among three classes. Use the first 100 points as the training data and the remaining 50 as
test data. In the following questions, to report accuracy, use test dataset. You can round-off
the accuracy value to the nearest 2-decimal point number. (Note: Do not change the order of
data points.)
5. (2 marks) Train an l2 regularized logistic regression classifier on the modified iris dataset. We
recommend using sklearn. Use only the first three features for your model. We encourage you
to explore the impact of varying different hyperparameters of the model. Kindly note that the
C parameter mentioned below is the inverse of the regularization parameter λ. As part of the
assignment train a model with the following hyperparameters:
Model: logistic regression with one-vs-rest classifier, C = 1e4
For the above set of hyperparameters, report the best classification accuracy
(a) 0.88
(b) 0.86
(c) 0.98
(d) 0.68
Sol. (c)
Following code will give the desired result.
>>clf = LogisticRegression(penalty=’l2’, C=1e4, multi class = ”ovr”).fit(X[0:100,0:3], Y[0:100])
>>clf.score(X[101:,0:3], Y[101:])

2
6. (2 marks) Train an SVM classifier on the modified iris dataset. We recommend using sklearn.
Use only the first three features for your model. We encourage you to explore the impact
of varying different hyperparameters of the model. Specifically try different kernels and the
associated hyperparameters. As part of the assignment train models with the following set of
hyperparameters
RBF-kernel, gamma = 0.5, one-vs-rest classifier, no-feature-normalization.
Try C = 0.01, 1, 10. For the above set of hyperparameters, report the best classification
accuracy along with total number of support vectors on the test data.
(a) 0.92, 69
(b) 0.88, 40
(c) 0.88, 69
(d) 0.98, 41
Sol. (d)
Following code will give the desired result.
>>clf = svm.SVC( C=1.0, kernel=’rbf’, decision function shape=’ovr’, gamma = 0.5)).fit(X[0:100,0:3],
Y[0:100])
>>clf.score(X[101:,0:3], Y[101:])
>>clf.n support

3
Assignment 5
Introduction to Machine Learning
Prof. B. Ravindran
1. (2 marks) For training a binary classification model with three independent variables, you
choose to use neural networks. You apply one hidden layer with three neurons. What are the
number of parameters to be estimated? (Consider the bias term as a parameter)
(a) 16
(b) 21
(c) 34 = 81
(d) 43 = 64
(e) 12
(f) 4
(g) None of these

Sol. (a)
Number of weights from input to hidden layer = 3 × 3 = 9
Bias term for the three neurons in hidden layer = 3
Number of weights from hidden to output layer = 3
Bias term in the final output layer = 1
Summing the above = 16
2. Suppose the marks obtained by randomly sampled students follow a normal distribution with
unknown µ. A random sample of 5 marks are 30, 50, 69, 2 and 99. Using the given samples
find the maximum likelihood estimate for the mean.
(a) 54.2
(b) 67.75
(c) 50
(d) Information not sufficient for estimation
Sol. (c)

3. You are given the following neural networks which take two binary valued inputs x1 , x2 ∈ {0, 1}
and the activation function is the threshold function(h(x) = 1 if x > 0; 0 otherwise). Which
of the following logical functions does it compute?

1
Figure 1: Q1

(a) OR
(b) AND
(c) NAND
(d) None of the above.

Sol. (a)

4. Using the notations used in class, evaluate the value of the neural network with a 3-3-1 archi-
tecture (2-dimensional input with 1 node for the bias term in both the layers). The parameters
are as follows  
1 0.2 0.4
α=
−1 0.8 0.5
 
β = 0.3 0.4 0.5
Using sigmoid function as the activation functions at both the layers, the output of the network
for an input of (0.8, 0.7) will be
(a) 0.6710
(b) 0.9617
(c) 0.6948
(d) 0.7052
(e) None of these
Solution (d)
This is a straight forward computation task. First pad x with 1 and make it the X vector,
 
1
X = 0.8
0.7
The output of the first layer can be written as

o1 = αX

2
Next apply the sigmoid function and compute
1
a1 (i) =
1 + e−o1 (i)
Then pad the a1 vector also with 1 for bias, then compute the output of the second layer.

o2 = βa1
1
a2 =
1 + e−o2
a2 = 0.7052

5. Which of the following statements is false:


(a) The chances of overfitting decrease with Increasing the number of hidden nodes and
increasing the number of hidden layers.
(b) A neural network with one hidden layer can represent any Boolean function given sufficient
number of hidden units and appropriate activation functions.
(c) Two hidden layer neural networks can represent any continuous functions (within a tol-
erance) as long as the number of hidden units is sufficient and appropriate activation
functions used.

Sol. (a) By increasing the number of hidden nodes or hidden layers we are increasing the
number of parameters. Increased set of parameters is more capable to memorize the training
data. Hence it may result in overfitting.
eα0 +αx eβ0 +βx
6. Consider the function f1 (x) = 1+eα0 +αx
and f2 (x) = 1+eβ0 +βx
shown in the figure below:

Figure 2: f2 (x) Figure 3: f1 (x)

Which of the following is correct?


(a) 0 < β < α
(b) 0 < α < β
(c) α < β < 0
(d) β < α < 0

3
Sol. (c)
7. We have a function which takes a two-dimensional input x = (x1 , x2 ) and has two parameters
w = (w1 , w2 ) given by f (x, w) = σ(σ(x1 w1 )w2 + x2 ) where σ(x) = 1+e1−x . We use backprop-
agation to estimate the right parameter values. We start by setting both the parameters to
0. Assume that we are given a training point x2 = 1, x1 = 0, y = 5. Given this information
∂f
answer the next two questions. What is the value of ∂w 2
?
(a) 0.150
(b) -0.25
(c) 0.125
(d) 0.098
(e) None of these
Solution: (d)
Write σ(x1 w1 )w2 + x2 as o2 and x1 w1 as o1
∂f ∂f ∂o2
=
∂w2 ∂o2 ∂w2
∂f
= σ(o2 )(1 − σ(o2 )) × σ(o1 )
∂w2
8. If the learning rate is 0.5, what will be the value of w2 after one update using backpropagation
algorithm?
(a) 0.4197
(b) -0.4197
(c) 0.5625
(d) - 0.5625
Solution: (a)
The update equation would be
∂L
w2 = w2 − λ
∂w2
where L is the loss function, here L = (y − f )2
∂f
w2 = w2 − λ × 2(y − f ) × (−1) ×
∂w2

Now putting in the given values we get the right answer.


9. Which of the following are true when comparing ANNs and SVMs?
(a) ANN error surface has multiple local minima while SVM error surface has only one minima
(b) After training, an ANN might land on a different minimum each time, when initialized
with random weights during each run.
(c) As shown for Perceptron, there are some classes of functions that cannot be learnt by an
ANN. An SVM can learn a hyperplane for any kind of distribution.

4
(d) In training, ANN’s error surface is navigated using a gradient descent technique while
SVM’s error surface is navigated using convex optimization solvers.
Sol. (a), (b) and (d)
By universal approximate theorem, we can argue that option (d) is not true.

5
Assignment 6
Introduction to Machine Learning
Prof. B. Ravindran
1. What is specified at any non-leaf node in a decision tree?
(a) Class of instance
(b) Data value description
(c) Test specification
(d) Data process description
Sol (c)
2. Suppose we use the decision tree model for solving a multi-class classification problem. As we
continue building the tree, w.r.t. the generalisation error of the model,
(a) the error due to bias increases
(b) the error due to bias decreases
(c) the error due to variance increases
(d) the error due to variance decreases
Sol. (b) & (c)
As we continue to build the decision tree model, it is possible that we overfit the data. In
this case, the model is sufficiently complex, i.e., the error due to bias is low. However, due to
overfitting, the error due to variance starts increasing.
3. (2 marks) Having built a decision tree, we are using reduced error pruning to reduce the size
of the tree. We select a node to collapse. For this particular node, on the left branch, there
are 3 training data points with the following outputs: 5, 7, 9.6 and for the right branch,
there are four training data points with the following outputs: 8.7, 9.8, 10.5, 11. The original
responses for data points along the two branches (left right respectively) were response left
and, response right and the new response after collapsing the node is response new. What are
the values for response left, response right and response new (numbers in the option are given
in the same order)?
(a) 21.6, 40, 61.6
(b) 7.2; 10; 8.8
(c) 3, 4, 7
(d) depends on the tree height.
Sol. (b)
Original responses:
Left: 5+7+9.6
3 = 7.2
Right: 8.7+9.8+10.5+11
4 = 10
New response: 7.2 × 37 + 10 × 4
7 = 8.8
4. Which of these classifiers do not require any additional modifications to their original descrip-
tions (as seen in the lectures) to use them when we have more than 2 classes? (multiple options
may be correct)

1
(a) decision trees
(b) logistic regression
(c) support vector machines
(d) k nearest neighbors
Sol. (a), (d)
Explanation Logistic regression and SVM’s need to be tweaked to make them work for
multiclass classification problems. Decision trees and kNN’s on the other hand are agnostic to
the number of classes.
5. (2 marks) Consider the following data set.

price maintenance capacity airbag profitable


low low 2 no yes
low med 4 yes yes
low low 4 no yes
low med 4 no no
low high 4 no no
med med 4 no no
med med 4 yes yes
med high 2 yes no
med high 5 no yes
high med 4 yes yes
high med 2 yes yes
high high 2 yes no
high high 5 yes yes

Considering ’profitable’ as the binary valued attribute we are trying to predict, which of the
attributes would you select as the root in a decision tree with multi-way splits using the
cross-entropy impurity measure?

(a) price
(b) maintenance
(c) capacity
(d) airbag

Sol. (b)
Follow the steps given in the example from the lecture: https://bit.ly/2EtiMhH
6. In the above data set, what is the value of cross entropy when we consider capacity as the
attribute to split on (multi-way splits)?
(You can round-off the cross entropy value to the nearest 4-decimal place number)

(a) 0.7973
(b) 0.8684
(c) 0.8382
(d) 0.7688

2
Sol. (c)

7. An important factor that influences the variance of decision trees is the average height of the
tree. For the same dataset, if we limited the height of the trees to some H, how would the
variance of the decision tree algorithm be affected?

(a) Variance may increase with tree length H.


(b) Variance may decrease with tree length H.
(c) Variance is unaffected by tree length H.

Sol. (a)
Generally, a more complex classifier implies more variance and less bias.
An intuitive way to imagine it is to think of a decision tree with H = 1 as a linear classifier.
We know that a linear classifier has low variance and high bias. We also know that a general
decision tree has high variance and low bias. Changing the average tree height H produces
a spectrum of different classifiers ranging from low variance/high bias (a linear classifier) to
high variance/low bias (a very tall tree).

8. In which of the following situations is it appropriate to introduce a new category ’Missing’ for
missing values? (multiple options may be correct)

(a) When values are missing because the 108 emergency operator is sometimes attending a
very urgent distress call.
(b) When values are missing because the attendant spilled coffee on the papers from which
the data was extracted.
(c) When values are missing because the warehouse storing the paper records went up in
flames and burnt parts of it.
(d) When values are missing because the nurse/doctor finds the patient’s situation too urgent.

Sol. (a),(d)
We typically introduce a ‘Missing’ value when the fact that a value is missing can also be a
relevant feature. In the case of (a) is can imply that the call was so urgent that the operator
couldn’t note it down. This urgency could potentially be useful to determine the target.
But a coffee spill corrupting the records is likely to be completely random and we glean no
new information from it. In this case, a better method is to try to predict the missing data
from the available data.

3
Assignment 7
Introduction to Machine Learning
Prof. B. Ravindran
1. For the given confusion matrix, compute the recall

True Positive True Negative


Predicted Positive 6 3
Predicted Negative 4 7

(a) 0.73
(b) 0.7
(c) 0.6
(d) 0.67
(e) 0.78
(f) None of the above

Sol. (c)

2. Pallavi is working on developing a binary classifier which has a huge class imbalance. Which
of the following metric should she optimize the classifier over to develop a good model?

(a) Accuracy
(b) Precision
(c) Recall
(d) F-Score

Sol. d
3. While designing an experiment, which of these aspects should be considered?
(a) Floor/Ceiling Effects
(b) Order Effects
(c) Sampling Bias
Sol. (a), (b) and (c)
4. Which of the following are true?
TP - True Positive, TN - True Negative, FP - False Positive, FN - False Negative
TP
(a) Precision = T P +F P
TP
(b) Recall = T P +F N
2(T P +T N )
(c) Accuracy = T P +T N +F P +F N
FP
(d) Recall = T P +F P

1
Sol. (a), (b)

5. In the ROC plot, what are the quantities along y and x axes respectively?
(a) Precision, Recall
(b) Recall, Precision,
(c) True Positive Rate, False Positive Rate
(d) False Positive Rate, True Positive Rate
(e) Specificity, Sensitivity
(f) True Positive, True Negative
(g) True Negative, True Positive

Sol. (c)
6. (2 marks) How does bagging help in improving the classification performance?
(a) If the parameters of the resultant classifiers are fully uncorrelated (independent), then
bagging is inefficient.
(b) It helps reduce variance
(c) If the parameters of the resultant classifiers are fully correlated, then bagging is inefficient.
(d) It helps reduce bias
Sol. (b), (c)
The lecture clearly states that correlated weights generally means that all the classifiers learn
very similar functions. This means that bagging gives no extra stability.
Having a lot of uncorrelated classifier helps to reduce variance since the resultant ensemble is
more resistant to a single outlier (It’s likely that the outlier affects only a small fraction of
classifiers in the ensemble)

7. Which method among bagging and stacking should be chosen in case of limited training data?
and What is the appropriate reason for your preference?
(a) Bagging, because we can combine as many classifier as we want by training each on a
different sample of the training data
(b) Bagging, because we use the same classification algorithms on all samples of the training
data
(c) Stacking, because we can use different classification algorithms on the training data
(d) Stacking, because each classifier is trained on all of the available data
Sol. (d)

8. (2 marks) Which of the following statements are TRUE when comparing Committee Machines
and Stacking
(a) Committee Machines are, in general, special cases of 2-layer stacking where the second-
layer classifier provides uniform weightage.

2
(b) Both Committee Machines and Stacking have similar mechanisms, but Stacking uses
different classifiers while Committee Machines use similar classifiers.
(c) Committee Machines are more powerful than Stacking
(d) Committee Machines are less powerful than Stacking
Sol. (a), (d)
Both Committee Machines and Stacked Classifiers use sets of different classifiers. Assigning
constant weight to all first layer classifiers in a Stacked Classifier is simply the same as giving
each one a single vote (Committee Machines).
Since Committee Machines are a special case of Stacked Classifiers, they are less powerful than
Stacking, which can assign an adaptive weight depending on the region.

3
Assignment 8
Introduction to Machine Learning
Prof. B. Ravindran
1. How does bagging help in designing better classifiers?
(a) If the parameters of the resultant classifiers are fully uncorrelated (independent), then
bagging is inefficient.
(b) It helps reduce bias
(c) If the parameters of the resultant classifiers are fully correlated, then bagging is inefficient.
(d) It helps reduce variance
Sol. (c), (d)
The lecture clearly states that correlated weights generally means that all the classifiers learn
very similar functions. This means that bagging gives no extra stability.
Having a lot of uncorrelated classifier helps to reduce variance since the resultant ensemble
is more resistant to a single outlier (it likely only affects a small fraction of classifiers in the
ensemble)

2. In a random forest model let m << p be the number of randomly selected features that are
used to identify the best split at any node of a tree. Which of the following are true? (p is the
original number of features)
(Multiple options may be correct)
(a) increasing m reduces the correlation between any two trees in the forest
(b) decreasing m reduces the correlation between any two trees in the forest
(c) increasing m increases the performance of individual trees in the forest
(d) decreasing m increases the performance of individual trees in the forest
Sol. (b) and (c)
3. If you have a bad classifier, which of the following ensemble methods will give the worst
performance when including the given classifier?
(a) Gradient Boosting
(b) AdaBoost
(c) Bagging
(d) Committee Machine
Sol. (c)
As mentioned in the lectures, Bagging tends to make a bad classifier even worse (partially due
to the bootstrap mechanism providing fewer data points to an already bad classifier).
4. Which of the following properties is false in the case of a Bayesian Network?
(a) The edges are directed
(b) Contains cycles

1
(c) Represents conditional independence relations among random variables
(d) All of the above
Sol. (b)
5. A and B are Boolean random variables.
Given: P (A = T rue) = 0.3, P (A = F alse) = 0.7, P (B = T rue|A = T rue) = 0.4, P (B =
F alse|A = T rue) = 0.6, P (B = T rue|A = F alse) = 0.6, P (B = F alse|A = F alse) = 0.4.
Calculate P (A = F alse|B = F alse) by Bayes rule.
(a) 0.497
(b) 0.391
(c) 0.609
(d) 0.503
(e) None of the above
Sol. (c)
0.4 × 0.7
= 0.609
0.6 × 0.3 + 0.7 × 0.4
6. Can the boosting technique be applied to regression problems? Can bagging be applied to
regression problems?

(a) no, no
(b) no, yes
(c) yes, no
(d) yes, yes

Sol. (d)
Ensemble methods are not tied to the classification problem, and can be used for regression
as well.
7. In boosting, the weights of data points that were miscalssified in the previous step, are:

(a) increased as training progresses


(b) decreased as training progresses
(c) made zero as training progresses
(d) kept unchanged as training progresses

Sol. (a)

8. Which of the following statements are true about ensemble classifiers? (multiple options may
be correct)
(a) The different learners in boosting based ensembles can be trained in parallel
(b) The different learners in bagging based ensembles can be trained in parallel

2
(c) Boosting based algorithms which iteratively re-weight training points, such as AdaBoost,
are more sensitive to noise than bagging based methods.
(d) Boosting methods generally use strong learners as individual classifiers
(e) Boosting methods generally use weak learners as individual classifiers.
(f) An individual classifier in a boosting based ensemble is trained with every point in the
training set.

Sol. (b), (c), (e), (f)


9. Consider the following graphical model, which of the following are false about the model?
(multiple options may be correct)

(a) A is independent of B when C is known


(b) D is independent of A when C is known
(c) D is not independent of A when B is known
(d) D is not independent of A when C is known

Sol. (a), (b)


10. Consider the Bayesian network given in the previous question. Let ‘A’, ‘B’, ‘C’, ‘D’and
‘E’denote the random variables shown in the network. Which of the following can be inferred
from the network structure?

(a) ‘A’causes ‘D’


(b) ‘E’causes ‘D’

3
(c) ‘C’causes ‘A’
(d) options (a) and (b) are correct
(e) none of the above can be inferred
Sol. (e)
As discussed in the lecture, in Bayesian Network, the edges do not imply any causality.

4
Assignment 9
Introduction to Machine Learning
Prof. B. Ravindran
1. Consider the bayesian network shown below.

Figure 1: Q1

Two students - Manish and Trisha make the following claims:


• Manish claims P (D|{S, L, C}) = P (D|{L, C})
• Trisha claims P (D|{S, L}) = P (D|L)
where P (X|Y ) denotes probability of event X given Y . Please note that Y can be a set. Which
of the following is true?
(a) Manish and Trisha are correct.
(b) Manish is correct and Trisha is incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Both are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (b)
D and S are independent events given two variables {L, C} but not in the case when only L is
given.
2. Consider the same bayesian network shown in previous question (Figure 1). Two other students
in the class - Trina and Manish make the following claims:
• Trina claims P (S|{G, C}) = P (S|C)
• Manish claims P (L|{D, G}) = P (L|G)
Which of the following is true?
(a) Both the students are correct.

1
(b) Trina is incorrect and Manish is correct.
(c) Trina is correct and Manish is incorrect.
(d) Both the students are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (a)
same as previous question
3. Consider the Bayesian graph shown below in Figure 2.

Figure 2: Q3

The random variables have the following notation: d - Difficulty, i - Intelligence, g - Grade, s -
SAT, l - Letter. The random variables are modeled as discrete variables and the corresponding
CPDs are as below.
d0 d1
0.6 0.4
i0 i1
0.6 0.4
g1 g2 g3
0 0
i ,d 0.3 0.4 0.3
i0 , d1 0.05 0.25 0.7
i1 , d0 0.9 0.08 0.02
i1 , d1 0.5 0.3 0.2
s0 s1
0
i 0.95 0.05
i1 0.2 0.8

2
l0 l1
1
g 0.2 0.8
g2 0.4 0.6
g3 0.99 0.01
What is the probability of P (i = 1, d = 0, g = 2, s = 1, l = 1)?

(a) 0.004608
(b) 0.006144
(c) 0.001536
(d) 0.003992
(e) 0.009216
(f) 0.007309
(g) None of these

Sol. (e)

P (i = 1, d = 0, g = 2, s = 1, l = 0) = P (i = 1)P (d = 0)P (g = 2|i = 1, d = 0)P (s = 1|i = 1)P (l = 1|g = 2)


= 0.4 ∗ 0.6 ∗ 0.08 ∗ 0.8 ∗ 0.6 = 0.009216

4. Using the data given in the previous question, compute the probability of following assignment,
P (i = 1, g = 1, s = 1, l = 0) irrespective of the difficulty of the course? (up to 3 decimal places)

(a) 0.160
(b) 0.371
(c) 0.662
(d) 0.047
(e) 0.037
(f) 0.066
(g) 0.189

Sol. (d)
d=0,1
X
P (i = 1, g = 1, s = 1, l = 0) = P (i = 1)P (s = 1|i = 1)P (l = 0|g = 1) (P (d)P (g = 1|i = 1, d))
d

P (i = 1, g = 1, s = 1, l = 1) = 0.4 × 0.8 × 0.2 × (0.9 × 0.6 + 0.5 × 0.4) = 0.04736

5. Consider the Bayesian network shown below in Figure 3

3
Figure 3: Q5

Two students - Manish and Trisha make the following claims:


• Manish claims P (H|{S, G, J}) = P (H|{G, J})
• Trisha claims P (H|{S, C, J}) = P (H|{C, J})
Which of the following is true?
(a) Manish and Trisha are correct.
(b) Both are incorrect.
(c) Manish is incorrect and Trisha is correct.
(d) Manish is correct and Trisha is incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (d)

6. Consider the Markov network shown below in Figure 4

Figure 4: Q6

4
Which of the following variables are NOT in the markov blanket of variable “3” shown in the
above Figure 4 ? (multiple answers may be correct)
(a) 1
(b) 8
(c) 2
(d) 5
(e) 6
(f) 4
(g) 7
Sol. (b), (c), (d) and (g)

7. In the Markov network given in Figure 4, two students make the following claims:
• Manish claims variable “1” is independent of variable “7” given variable “2”.
• Trina claims variable “2” is independent of variable “6” given variable “3”.

Which of the following is true?


(a) Both the students are correct.
(b) Trina is incorrect and Manish is correct.
(c) Trina is correct and Manish is incorrect.
(d) Both the students are incorrect.
(e) Insufficient information to make any conclusion. Probability distributions of each variable
should be given.
Sol. (b)

8. Four random variables are known to follow the given factorization


1
P (A1 = a1 , A2 = a2 , A3 = a3 , A4 = a4 ) = ψ1 (a2 , a3 )ψ2 (a1 , a4 )ψ3 (a2 , a4 )ψ4 (a1 , a3 )
Z
The corresponding Markov network would be

(a)

5
(b)

(c)

(d)

(e)

Sol. (e)

6
9. Does there exist a more compact factorization involving less number of factors for the distri-
bution given in previous question?
(a) Yes
(b) No
(c) Insufficient information

Sol. (b)

10. Consider the following Markov Random Field.

Figure 10: Q10

Which of the following nodes will have no effect on D given the Markov Blanket of D?
(a) A
(b) B
(c) C
(d) E
(e) F
(f) G
(g) H
(h) I
(i) J
Sol. (a), (h) and (i)
The question requires you to select the random variables not in the Markov blanket of D. We
see that the Markov blanket of D contains B, C, E, F, G, &H. The only other variables, other
than D are A, I, &J. These three variables can have no effect on D once the Markov blanket
is known/given.

7
Assignment 10
Introduction to Machine Learning
Prof. B. Ravindran
1. (1 mark) In the lecture on the BIRCH algorithm, it is stated that using the number of points
N, sum of points SUM and sum of squared points SS, we can determine the centroid and
radius of the combination of any two clusters A and B. How do you determine the centroid of
the combined cluster? (In terms of N,SUM and SS of both the clusters)
(a) SU MA + SU MB
SU MA SU MB
(b) NA + NB
SU MA +SU MB
(c) NA +NB
SSA +SSB
(d) NA +NB

Sol. (c)
Apply the centroid formula to the combined cluster points. It’s simply the sum of all points
divided by the total number of points.
2. (1 mark) What assumption does the CURE clustering algorithm make with regards to the
shape of the clusters?
(a) No assumption
(b) Spherical
(c) Elliptical
Sol. (a)
Explanation CURE does not make any assumption on the shape of the clusters.
3. (1 mark) What would, in general, can be the effect of increasing MinPts in DBSCAN while
retaining the same Eps parameter? (Note that more than one statement may be correct)
(a) Increase in the sizes of individual clusters
(b) Decrease in the sizes of individual clusters
(c) Increase in the number of clusters
(d) Decrease in the number of clusters
Sol. (b), (c)
By increasing the MinPts, we are expecting large number of points in the neighborhood, to
include them in cluster. In one sense, by increasing MinPts, we are looking for dense clusters.
This can break not-so-dense clusters into more than one part, which can lead to reduce the
cluster size and increase the number of clusters.

For the next question, kindly download the dataset - DS1. The first two columns in the
dataset correspond to the co-ordinates of each data point. The third column corresponds two

1
the actual cluster label.
DS1: https://bit.ly/2Lm75Ly
4. (3 marks) Visualize the dataset DS1. Which of the following algorithms will be able to recover
the true clusters (check by visual inspection).
(a) K-means clustering
(b) Single link hierarchical clustering
(c) Complete link hierarchical clustering
(d) Average link hierarchical clustering
Sol. (b)
The dataset contains spiral clusters. Single link hierarchical clustering can recover spiral
clusters with appropriate parameter settings.
5. (2 marks) Consider the similarity matrix given below: Which of the following shows the
hierarchy of clusters created by the single link clustering algorithm.
P1 P2 P3 P4 P5 P6
P1 1.0000 0.7895 0.1579 0.0100 0.5292 0.3542
P2 0.7895 1.0000 0.3684 0.2105 0.7023 0.5480
P3 0.1579 0.3684 1.0000 0.8421 0.5292 0.6870
P4 0.0100 0.2105 0.8421 1.0000 0.3840 0.5573
P5 0.5292 0.7023 0.5292 0.3840 1.0000 0.8105
P6 0.3542 0.5480 0.6870 0.5573 0.8105 1.0000

2
Sol. (b)

6. (2 marks) For the similarity matrix given in the previous question, which of the following shows
the hierarchy of clusters created by the complete link clustering algorithm.

Sol. (d)

3
Assignment 11
Introduction to Machine Learning
Prof. B. Ravindran
1. Given n samples x1 , x2 , . . . , xN drawn independently from an Geometric distribution unknown
parameter p given by pdf Pr(X = k) = (1 − p)k−1 p for k = 1, 2, 3, · · · , find the MLE of p.
Pn
(a) pM LE = i=1 xi
Pn
(b) pM LE = n i=1 xi
(c) pM LE = Pnn
i=1 xi
Pn
xi
(d) pM LE = n
i=1

(e) pM LE = Pn−1
n
i=1 xi
Pn
i=1 xi
(f) pM LE = n−1

Sol. (c)

2. (2 marks) Suppose we are trying to model a p dimensional Gaussian distribution. What is the
actual number of independent parameters that need to be estimated in mean and covariance
matrix respectively?
(a) 1, 1
(b) p − 1, 1
(c) p, p
(d) p, p(p + 1)
(e) p, p(p + 1)/2
(f) p, (p + 3)/2

1
(g) p − 1, p(p + 1)
(h) p − 1, p(p + 1)/2 + 1
(i) p − 1, (p + 3)/2
(j) p, p(p + 1) − 1
(k) p, p(p + 1)/2 − 1
(l) p, (p + 3)/2 − 1
(m) p, p2
(n) p, p2 /2
(o) None of these
Sol. (e)
Explanation Mean vector has p parameters. The covariance matrix is symmetric (p × p) and
hence has p p+1
2 independent parameters.

3. (2 marks) Given n samples x1 , x2 , . . . , xN drawn independently from a Bionomial distribution


unknown parameter p, find the MLE of p. Binomial Distribution is used to model ’x’ successes
in ’n’ Bernoulli trials. Its p.d.f. is given by:
 
n x
f (x, n, p) = p (1 − p)n−x
x
 
n n!
for x = 0, 1, 2, ..., n, where =
x x!(n − x)!
PN
(a) p = i=1xi
Pn
(b) pM LE = N i=1 xi
PN
(c) pM LE = n i=1 xi
(d) pM LE = PNn
i=1 xi

(e) pM LE = PnN
i=1 xi
PN
xi
(f) pM LE = i=1
n
PN
xi
(g) pM LE = i=1
N
PN
n xi
(h) pM LE = i=1
N
PN
i=1 xi
(i) pM LE = n.N
(j) pM LE = Pn−1
N
i=1 xi
PN
i=1 xi
(k) pM LE = N −1
PN
i=1 xi
(l) pM LE = n−1

Sol. (i)

2
4. (2 marks) In Gaussian Mixture Models, πi are the mixing coefficients. Select the incorrect
conditions that the mixing coefficients need to satisfy for a valid GMM model.
(a) −1 ≤ πi ≤ 1, ∀i
(b) 0 ≤ πi ≤ 1, ∀i
P
(c) i πi = 1
P
(d) i πi need not be bounded

Sol. (a), (d)

5. (2 marks) Expectation-Maximization, or the EM algorithm, consists of two steps - E step and


the M-step. Using the following notation, select the correct set of equations used at each step
of the algorithm.
Notation.
X Known/Given variables/data
Z Hidden/Unknown variables
θ Total set of parameters to be learned
θk Values of all the parameters after stage k
Q(, ) The Q-function as described in the lectures
(a) E − EZ|X,θ [log(P r(X, Z|θm ))]
(b) E − EZ|X,θm−1 [log(P r(X, Z|θ))]
P
(c) M − argmaxθ Z P r(Z|X, θm−2 ) · log(P r(X, Z|θ))
(d) M − argmaxθ Q(θ, θm−2 )
(e) M − argmaxθ Q(θ, θm−1 )
Sol. (b), (e)

3
Assignment 12
Introduction to Machine Learning
Prof. B. Ravindran
1. You have been recruited as a lead engineer by ArrEll corporation which wants to enter the
self-driving car market. In the context of the standard Reinforcement Learning framework,
what would you classify as the state and actions? Note that your system does not have
access to previous states.
(a) State-(Current steering wheel position, Current pedal positions, Current speed) Ac-
tions-(Turn the wheel, Press pedals)
(b) State-(Current steering wheel position, Current pedal positions, Current acceleration)
Actions-(Turn the wheel, Press pedals)
(c) State-(Current steering wheel position, Current pedal positions, Current speed) Ac-
tions-(Change direction, Change speed)
(d) State-(Current steering wheel position, Current pedal positions, Current acceleration)
Actions-(Change direction, Change speed)
Sol. (a)
2. After completing Introduction to Machine Learning on NPTEL, you have landed a job as
a Data Scientist at YumEll Solutions Inc. Your first assignment as a trainee is to learn a
classifier given some data and present insights on it to your manager, who apparently doesn’t
seem to have any knowledge on Machine Learning. Which of the following classification models
would you pick to best explain the nature of the data and the underlying distribution to your
manager?
(a) Linear Models
(b) Support Vector Machines
(c) Decision Trees
(d) Artificial Neural Networks
Sol. (c)

3. What happens when your model complexity (such as interaction terms in linear regression,
order of polynomial in SVM, etc.) increases?
(a) Model Bias increases
(b) Model Bias decreases
(c) Variance of the model increases
(d) Variance of the model decreases
Sol. (b) and (c)
4. (2 marks) In the context of Reinforcement Learning algorithms, which of the following defini-
tions constitutes a valid Markov State? (multiple options may be correct)

1
(a) For Chess: Positions of yours and the opponent’s remaining pieces
(b) For Tic-Tac-Toe: A snapshot of the game board (all Xs, Os and empty spaces)
(c) For Chess: Positions of your pieces and the identities of the opponents defeated pieces.
(d) For Tennis: Position of the ball
Sol. (a) (b)
This is answered by considering if a given state has all the information needed to make the
next move, or if previous states need to be accessed for more information.

5. (2 marks) Suppose we want an RL agent to learn to play the game of golf. For training purposes,
we make use of a golf simulator program. Assume that the original reward distribution gives
a reward of +10 when the golf ball is hit into the hole and -1 for all other transitions. To aid
the agent’s learning process, we propose to give an additional reward of +3 whenever the ball
is within a 1 metre radius of the hole. Is this additional reward a good idea or not? Why?
(a) Yes. The additional reward will help speed-up learning.
(b) Yes. Getting the ball to within a metre of the hole is like a sub-goal and hence, should
be rewarded.
(c) No. The additional reward may actually hinder learning.
(d) No. It violates the idea that a goal must be outside the agent’s direct control.
Sol. (c)
In this specific case, the additional reward will be detrimental to the learning process, as the
agent will learn to accumulate rewards by keeping the ball within the 1 metre radius circle and
not actually hitting the ball in the hole.
6. (2 marks) You want to toss a fair coin a number of times and obtain the probability of getting
heads by taking a simple average. What is the estimated number of times you’ll have to toss
the coin to make sure that your estimated probability is within 10% of the actual probability,
at least 90% of the time?
(a) 400*ln(20)
(b) 800*ln(20)
(c) 200*ln(20)

Sol. (c)
Since you’re given that the coin is fair, p = 0.5 is the mean of your tosses if tossed infinitely.
Hence, margin for error is 0.05. Now, using Chernoff-Hoeffding bounds, we can obtain the

2
required number of trails as follows.
2
P r(|X̄ − 0.5| ≥ 0.05) ≤ 2 =⇒ 0.1 ≤ 2e−2·(0.05) ·n

2
0.05 ≤ e−2·(0.05) ·n
1
ln( ) ≤ −2 · (0.05)2 · n
20
1
n≥ ln(20)
2 · (0.05)2
400
n≥ ln(20)
2
=⇒ n ≥ 200ln(20)

7. You face a particularly challenging RL problem, where the reward distribution keeps changing
with time. In order to gain maximum reward in this scenario, does it make sense to stop
exploration or continue exploration?

(a) Stop exploration


(b) Continue exploration

Sol. (b)
Ideally, we would like to continue exploring, since this allows us to adapt to the changing
reward distribution.

You might also like