Nptel ML Questions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

1) Suppose we use a linear kernel SVM to build a classifier for a 2-class problem where the

training data points are linearly separable. In general, will the classifier trained in this
manner be always the same as the classifier trained using the perceptron training algorithm
on the same training data

No

2) Consider the case where two classes follow Gaussian distribution which are centred at (−1,
2) and (1, 4) and have identity covariance matrix. Which of the following is the separating
decision boundary using LDA?

x+y=3

3) What is specified at any non-leaf node in a decision tree

Test specification

4) Consider a modified k-NN method in which once the k nearest neighbours to the query point
are identified, you do a linear regression fit on them and output the fitted value for the
query point. Which of the following is/are true regarding this method

This method makes an assumption that the data is locally linear.


In order to perform well, this method would need dense distributed training data.
This method has higher variance compared to K-NN

5) How does LASSO differ from Ridge Regression

LASSO uses L1 regularization while Ridge Regression uses L2 regularization

The lasso constraint is a high dimensional rhomboid while the ridge regression is a high
dimensional ellipsoid

Ridge regression shrinks to 0 compared to lasso

6) Suppose we are trying to model a p dimensional Gaussian distribution. What is the actual
number of independent parameters that need to be estimated?

p(p+3)/2

7) If the number of features is larger than the number of training data points, to identify a
suitable subset of the features for use with linear regression, we would prefer

Forward stepwise selection

8) We have seen methods like ridge and lasso to reduce variance among the co-efficients. We
can use these methods to do feature selection also. Which one of them is more appropriate?

Lasso
9) What assumption does the CURE clustering algorithm make with regards to the shape of the
clusters?

No assumption

10) Which method among bagging and stacking should be chosen in case of limited training
data? and What is the appropriate reason for your preference?

Stacking, because each classifier is trained on all of the available data

11) Consider the following distribution of training data: Which method would you choose for
dimensionality reduction

Linear Discriminant Analysis or Principal Component Analysis are equally good

12) For training a binary classification model with three independent variables, you choose to
use neural networks. You apply one hidden layer with three neurons. What are the number
of parameters to be estimated? (Consider the bias term as a parameter)

21

13) Given N samples x1, x2, . . . , xN drawn independently from a Gaussian distribution with
variance σ2 and unknown mean µ, find the MLE of the mean

14) In the context of Reinforcement Learning algorithms, which of the following definitions
constitutes a valid Markov State?

For Chess: Positions of yours and the opponent’s remaining pieces


For Tic-Tac-Toe: A snapshot of the game board (all Xs, Os and empty spaces)

15) Given below are some properties of different classification algorithms. In which among the
following would you expect feature
normalisation to be useful?
uses a measure of distance between points
uses ridge regression
attempts to identify the maximum-margin hyperplane

16)
17)

18) In Gaussian Mixture Models, πi are the mixing coefficients. Select the correct conditions that
the mixing coefficients need to satisfy for a valid GMM model.

−1 ≤ πi ≤ 1, ∀i
0 ≤ πi ≤ 1, ∀i

19) Which of the following properties are true in the context of decision trees?
High variance
Lack of smoothness of prediction surfaces

20)

21)

22) Solution: A 5. Which of the following graphical models capture the Naive Bayes assumption,
where c represents the class label and fi are the features?

23) Based on a survey, it was found that the probability that a student likes to play football was
0.25 and the probability that a student likes to play cricket is 0.43. It was also found that the
probability that a student likes to play both football and cricket is 0.12. What is the
probability that a student does not like to play either?

0.44
24) For the ROC curve of True positive rate vs False positive rate, which of the following are
true?

The curve may or may not be concave

25) Which of the following are true about bias and variance of overfitted and underfitted
models?

Underfitted models have high bias


Overfitted models have high variance.

26) Considering the AdaBoost algorithm, which among the following statements are false?

in each stage, we try to train a classifier which makes accurate predictions on any subset of
the data points where the subset size is at least half the size of the data set

The weight assigned to an individual classifier depends upon the number of data points
correctly classified by the classifier

27)

d is independent of b when c is known


a is independent of b when c is known

28) Consider the Bayesian network given in the previous question. Let ‘A’, ‘B’, ‘C’, ‘D’and
‘E’denote the random variables shown in the network. Which of the following can be
inferred from the network structure?
none of the above can be inferred

29) ) Consider the following one dimensional data set: 12, 22, 2, 3, 33, 27, 5, 16, 6, 31, 20, 37, 8
and 18. Given k = 3 and initial cluster centers to be 5, 6 and 31, what are the final cluster
centres obtained on applying the k-means algorithm?

4.8, 17.6, 32

30) For the previous question, in how many iterations will the k-means algorithm converge?

31)

0.098

32)

None of these
33)

0.006144

34) Using the data given in the previous question, compute the probability of following
assignment, P(i = 1, g = 1, s = 1, l = 0) irrespective of the difficulty of the course? (up to 3
decimal places)

0.047
35)
ANS\\
36) Does there exist a more compact factorization involving less number of factors for the
distribution given in previous question

No

37)
Considering ’profitable’ as the binary valued attribute we are trying to predict, which of the
attributes would you select as the root in a decision tree with multi-way splits using the
information gain measure?

capacity

38) For the same data set, suppose we decide to construct a decision tree using binary splits and
the Gini index impurity measure. Which among the following feature and split point
combinations would be the best to use as the root node assuming that we consider each of
the input features to be unordered?

maintenance - {high}|{med, low}

39) In the above data set, what is the value of cross entropy when we consider capacity as the
attribute to split on (multi-way splits)? (You can round-off the cross entropy value to the
nearest 4-decimal place number)

0.8382

40)

0.4615

You might also like