Machine Learning MCQ S
Machine Learning MCQ S
6.
Identify whether true or false: In PCA the number of input
dimensions is equal to principal components.
True
False
Hide
Answer - A) True. In PCA the number of input dimensions
is equal to principal components.
7.
Among the following identify the one in which
dimensionality reduction reduces.
Performance
Answer - D) Dimensionality reduction reduces collinearity.
8.
Which of the following machine learning algorithm is based
upon the idea of bagging?
Decision tree
Random-forest
Classification
Regression
Hide
13.
The father of machine learning is _____________
Geoffrey Everest Hinton
Geoffrey Hill
Geoffrey Chaucer
None of the above
Hide
C. both A and B
Answer: Option C
Which of the following is the best machine learning
method?
A. scalable
B. accuracy
C. fast
Answer: Option D
C. null
D. accuracy
Answer: Option A
Application of machine learning methods to large
databases is called
A. data mining.
B. artificial intelligence
C. big data computing
D. internet of things
A. descriptive model
B. predictive model
C. reinforcement learning
Answer: Option A
Which of the following is not Machine Learning?
A. artificial intelligence
C. both a and b
Answer: Option A
If machine learning model output doesnot involves
target variable then that model is called as
A. descriptive model
B. predictive model
C. reinforcement learning
C. Both A and B
Answer: Option A
11.
What characterize unlabeled examples in machine
learning
A. there is no prior knowledge
Answer: Option A
12.
What characterize is hyperplance in geometrical model of
machine learning?
A. a plane with 1 dimensional fewer than number of input
attributes
Answer: Option A
What characterizes a hyperplane in the geometrical model of
machine learning?
13.
Imagine a Newly-Born starts to learn walking. It will try to
find a suitable policy to learn walking after repeated
falling and getting up.specify what type of machine
learning is best suited?
A. classification
B. regression
C. kmeans algorithm
D. reinforcement learning
Discuss in Board Save for Later
Answer: Option D
14.
What are the popular algorithms of Machine Learning?
A. decision trees and neural networks (back propagation)
D. all
Answer: Option D
A machine learning problem involves four attributes plus
a class. The attributes have 3, 2, 2, and 2 possible values
each. The class has 3 possible values. How many maximum
possible different examples are there?
A. 12
B. 24
C. 48
D. 72
Answer: Option D
In machine learning, an algorithm (or learning algorithm)
is said to be unstable if a small change in training data
cause the large change in the learned classifiers. True or
False: Bagging of unstable classifiers is a good idea
A. TRUE
B. FALSE
Answer: Option A
Which of the following is characteristic of best machine
learning method ?
A. fast
B. accuracy
C. scalable
D. all above
Answer: Option D
Machine learning techniques differ from statistical
techniques in that machine learning methods
A. typically assume an underlying distribution for the
data.
Answer: Option A
What is Model Selection in Machine Learning?
A. The process of selecting models among different
mathematical models, which are used to describe the same
data set
D. All above
B. Interference
C. Accuracy
D. None of above
Answer: Option A
21.
The average squared difference between classifier
predicted output and actual output.
A. mean squared error
B. Maximum Likelihood
C. Logarithmic Loss
D. Both A and B
Answer: Option A
23.
Following are the descriptive models
A. clustering
B. classification
C. association rule
D. both a and c
Answer: Option D
24.
Assume that you are given a data set and a neural network
model trained on the data set. You are asked to build a
decision tree model with the sole purpose of
understanding/interpreting the built neural network
model. In such a scenario, which among the following
measures would you concentrate most on optimising?
A. accuracy of the decision tree model on the given data
set
B. greedy algorithms
C. all above
D. none of these
B. ii and iii
C. iii and iv
D. none of these
Answer: Option C
.
27.
Which of the following can only be used when training
data are linearlyseparable?
A. linear hard-margin svm
Answer: Option C
29.
Given that we can select the same feature multiple times
during the recursive partitioning of the input space, is it
always possible to achieve 100% accuracy on the training
data (given that we allow for trees to grow to their
maximum size) when building decision trees?
A. Yes
B. No
Answer: Option B
30.
In many classification problems, the target dataset is made
up of categorical labels which cannot immediately be
processed by any algorithm. An encoding is needed and
scikit-learn offers at least . . . . . . . . valid options
A. 1
B. 2
C. 3
D. 4
Discuss in Board Save for Later
Answer: Option B
A. facts.
B. concepts.
C. procedures.
D. principles.
Answer: Option A
32.
B. you always get the same clusters. if you use or dont use
feature scaling
Answer: Option A
33.
C. Both A and B
Answer: Option C
34.
A. data
B. knowledge
C. rules
D. model
Answer: Option B
35.
A. supervised
B. unsupervised
C. semi-supervised
D. cant say
Answer: Option A
36.
A. 0.0398
B. 0.0389
C. 0.0368
D. 0.0396
Answer: Option D
37.
Answer: Option B
38.
D. none of these
Answer: Option A
39.
A. Supervised
B. Semi-supervised
C. Reinforcement
D. Clusters
Answer: Option B
40.
Binarize parameter in BernoulliNB scikit sets threshold
for binarizing of sample features.
A. TRUE
B. FALSE
Answer: Option A
41.
C. not possible
D. none of these
Answer: Option A
42.
The . . . . . . . . of the hyperplane depends upon the
number of features.
A. dimension
B. classification
C. reduction
Answer: Option A
43.
A. greedy
B. top down
C. procedural
D. step by step
Answer: Option A
44.
A. Yes
B. No
Answer: Option A
45.
A. 0.05
B. 0.06
C. 0.07
D. 0.09
Answer: Option B
46.
A. TRUE
B. FALSE
Answer: Option A
47.
A. training data
B. validation data
C. test data
D. hidden data
A. linear regression
B. logistic regression
C. simple regression
Answer: Option C
51.
A. TRUE
B. FALSE
Answer: Option B
52.
B. FALSE
Answer: Option A
53.
A. TRUE
B. FALSE
A. a and b
B. b and c
C. a and c
Answer: Option A
56.
D. none of these
B. linear, binary
C. nonlinear, numeric
D. nonlinear, binary
A. supervised learning
B. unsupervised learning
C. semisupervised learning
D. reinforcement learning
Discuss in Board Save for Later
Answer: Option A
59.
C. both a and b
A. linear functions
B. nonlinear functions
C. discrete functions
D. exponential functions
A. regression
B. classification
C. random_state
D. missing_values
\nswer: Option D
62.
A. branch of bank
B. expenditure in rupees
C. prize of house
D. weight of a person
A. an omnibus test
Answer: Option C
Linear Regression is a supervised machine learning
algorithm.
A. TRUE
B. FALSE
Answer: Option A
Which of the following is a good test dataset
characteristic?
C. both A and B
Answer: Option C.
Features being classified is . . . . . . . . of each other in
Nave Bayes Classifier
A. independent
B. dependent
C. partial dependent
D. none
Answer: Option A
The gini index is not biased towards multivalued
attributed.
A. TRUE
B. FALSE
Answer: Option B
This unsupervised clustering algorithm terminates
when mean values computed for the current iteration of
the algorithm are identical to the computed mean
values for the previous iteration.
A. agglomerative clustering
B. conceptual clustering
C. k-means clustering
D. expectation maximization
Answer: Option C
A. Matrix
B. Vector
C. Array
D. List
Answer: Option B
This set of Machine Learning Multiple Choice Questions &
Answers (MCQs) focuses on “Statistical Learning
Framework”.
1. How are the points in the domain set given as input to the
algorithm?
a) Vector of features
b) Scalar points
c) Polynomials
d) Clusters
View Answer
Answer: a
Explanation: The variables are converted into a vector of
features, and then given as an input to the algorithm. The
vector is of the size (number of features x number of training
data sets). The output of the learner is usually given as a
polynomial.
2. To which input does the learner has access to?
a) Testing Data
b) Label Data
c) Training Data
d) Cross-Validation Data
View Answer
Answer: c
Explanation: The learner gets access to a particular set of
data on which it trains. This data is called as training data.
Testing Data is used for testing of the learner’s outputs. The
best outputs are then used on the cross-validation data. The
label data is a representation of different types of the
dependent variables.
3. The set which represents the different instances of the
target variable is known as ______
a) domain set
b) training set
c) label set
d) test set
View Answer
Answer: c
Explanation: Label Set denotes all the possible forms the
target variable can take (for e.g. {0,1} or {yes, no} in a
logistic regression problem). Domain Set represents the
vector of features, given as input to the learner. Training Set
and Test Set are parts of the Domain Set which are used for
training and testing respectively.
advertisement
4. What is the learner’s output also called?
a) Predictor, or Hypothesis, or Classifier
b) Predictor, or Hypothesis, or Trainer
c) Predictor, or Trainer, or Classifier
d) Trainer, or Hypothesis, or Classifier
View Answer
Answer: a
Explanation: The output is called a predictor when it is used
to predict the type or the numerical value of the target
variable. It is called a hypothesis when it is a general
statement about the data set. It is called a classifier when it is
used to classify the training set in two or more types.
5. It is assumed that the learner has prior knowledge about
the probability distribution which generates the instances in
a training set.
a) True
b) False
View Answer
Answer: b
Explanation: The learner has no prior knowledge about the
distribution. It is assumed that the distribution is completely
arbitrary. It is also assumed that there is a function which
“correctly” labels the training examples. The learner’s job is
to find out this function.
6. The labeling function is known to the learner in the
beginning.
a) True
b) False
View Answer
Answer: b
Explanation: The function is unknown to the learner as this
is what the learner is trying to find out. In the beginning, the
learner just knows about the training set and the
corresponding label set.
7. The papaya learning algorithm is based on a dataset that
consists of three variables – color, softness, tastiness of the
papaya. Which is more likely to be the target variable?
a) Tastiness
b) Softness
c) Papaya
d) Color
View Answer
Answer: a
Explanation: The tastiness is dependent on how ripe the
papaya is. The ripeness is determined by the color and
softness. Hence color and softness are the independent
variables and the tastiness is the dependent variable or
target variable.
8. The error of classifier is measured with respect to
_________
a) variance of data instances
b) labeling function
c) probability distribution
d) probability distribution and labeling function
View Answer
Answer: d
Explanation: The error is the probability of choosing a
random instance from the data set and then misclassifying it
using the labeling function.
9. What is not accessible to the learner?
a) Training Set
b) Label Set
c) Labeling Function
d) Domain Set
View Answer
Answer: c
Explanation: The learner has access to the domain set, from
which it extracts the training set. The label set is also given.
Then the algorithm is applied to the training set to teach the
learner, a function to determine the correct label of a given
instance. This is the labeling function.
10. What are the possible values of A, B, and C in the
following diagram?
a) Polynomial regression
b) Univariate linear regression
c) Logistic regression
d) Multivariate linear regression
View Answer
Answer: a
Explanation: The expression has only one feature x, so it is
not a multivariate linear regression. There is more than one
term containing a feature, so it is also not a univariate linear
regression. The features are expressed as a polynomial, so it
is a polynomial regression.
5. h(x) = t0 + t1x + t2x2. t0 = t1 = t2 = 1. X is the size of the
house. For what value of x, h(x) is minimum?
a) -1
b) 0
c) 0 or -1
d) 1
View Answer
Answer: d
Explanation: h(x) = t0 + t1x + t2x2
= 1 + x + x2
Since x cannot be negative, the minimum value of h(x) is 1.
6. h(x) = t0 + t1x + t2x2. t0 = 0, t1 = t2 = 1. X is the size of the
house. For what value of x, h(x) is minimum?
a) -1
b) 0
c) 0 or -1
d) 1
View Answer
Answer: b
Explanation: h(x) = t0 + t1x + t2x2
= x + x2
h(x) will be minimum if the expression (x + x 2) is minimum
i.e. 0 (size of house cannot be negative)
x + x2 = 0
or, x(x + 1) = 0
Since, x cannot be negative, the value of x is 0.
7. There are two features. One is of higher priority. What
can be done to improve the hypothesis?
a) Increase the power to which the feature with higher
priority is raised
b) Remove the feature with lower priority
c) Depends on the dataset
d) Nothing can be done
View Answer
Answer: a
Explanation: One of the advantages of polynomial
regression is that of handling features with a different
priority. If a feature with higher priority is encountered, its
power can be raised to give it higher priority in the
hypothesis.
8. A drawback of Polynomial Regression is handling of
features with a different priority.
a) True
b) False
View Answer
Answer: b
Explanation: Polynomial Regression can handle features
with varying priority very well. One of its drawbacks is that
it is sensitive to outliers. Overfitting may or may not occur.
1. What kind of algorithm is logistic regression?
a) Cost function minimization
b) Ranking
c) Regression
d) Classification
View Answer
Answer: d
Explanation: Logistic regression is a classification problem.
The target variable is categorical (specific few options).
Logistic regression outputs in yes or no / true or false / 0 or 1
and so on.
2. Can a cancer detection problem be solved by logistic
regression?
a) Sometimes
b) No
c) Yes
d) Depends on the dataset
View Answer
Answer: c
Explanation: If the target is to detect cancer, logistic
regression can always be used. Logistic regression algorithm
will output if the patient has cancer or not, depending on the
symptoms and training examples.
3. In a logistic regression problem, there are 300 instances.
270 people voted. 30 people did not cast their votes. What is
the probability of finding a person who cast one’s vote?
a) 10%
b) 90%
c) 0.9
d) 0.1
View Answer
Answer: c
Explanation: 270 out of 300 people voted. Hence, the
probability of finding a person who cast his/her vote is
270/300 or 9/10 i.e. 0.9. Since probability has to be within 0
or 1, it can never be 90%.
advertisement
4. In a logistic regression problem, what is a possible output
for a new instance?
a) 0.85
b) -0.19
c) 1.20
d) 89%
View Answer
Answer: a
Explanation: The output in a logistic regression problem is
calculated by a probability function. Thus, the output can
only be between 0 and 1. It cannot be negative, or greater
than 1. It is not expressed in a percentage.
5. The output in a logistic regression problem is yes
(equivalent to 1 or true). What is its possible value?
a) Greater than 0.5
b) Depends on the algorithm’s threshold value
c) Greater than 0.6
d) Equal to 1
View Answer
Answer: b
Explanation: If the output is true, the probability of the
instance to be true is greater than the threshold value. Now,
for different datasets, the threshold value can be different. It
can be 0.5, it can also be 0.6. It is dependent on the
algorithm.
6. Who invented logistic regression?
a) Vapnik
b) Ross Quinlan
c) DR Cox
d) Chervonenkis
View Answer
Answer: c
Explanation: Statistician DR Cox invented Logistic
Regression in 1958. Ross Quinlan is the founder of the
machine learning model decision tree. Vapnik and
Chervonenkis introduced the idea of VC dimension.
7. An artificially intelligent car knows if to brake or not
based on its distance from the car in front of it. Logistic
regression algorithm is used.
a) True
b) False
View Answer
Answer: a
Explanation: The output is given as yes or no, based on the
distance from the car in front of it. It is thus a classification
problem. Hence, the logistic regression algorithm can be
used to determine whether to stop or not.
8. An artificially intelligent car decreases its speed based on
its distance from the car in front of it. Which algorithm is
used?
a) Decision Tree
b) Naïve-Bayes
c) Logistic Regression
d) Linear Regression
View Answer
Answer: d
Explanation: The output is numerical. It determines the
speed of the car. Hence it is not a classification problem. All
the three, decision tree, naïve-Bayes, and logistic regression
are classification algorithms. Linear regression, on the other
hand, outputs numerical values based on input. So, this can
be used.
9. In a logistic regression problem an instance is similar to 60
positive instances, 20 negative instances, dissimilar to 30
positive instances, 90 negative instances. What kind of an
instance is this?
a) Negative instance
b) Positive instance
c) Cannot be determined, even if the threshold is given
d) Can be determined, if the threshold is given
View Answer
Answer: c
Explanation: Similarity or dissimilarity does not determine
the output of logistic regression. The output is completely
dependent on the independent variables and their values. So,
the output cannot be determined even if the threshold is
given.
10. When was logistic regression invented?
a) 1968
b) 1958
c) 1948
d) 1988
View Answer
Answer: b
Explanation: Logistic regression was invented by statistician
DR Cox in the year 1958. It was introduced even before the
invention of machine learning. It was introduced as a part of
the direct probability model.
1. What function is used for hypothesis representation in
logistic regression?
a) Cos function
b) Laplace transformation
c) Lagrange’s function
d) Sigmoid function
View Answer
Answer: d
Explanation: In logistic regression, the output is based on a
probability and thus must be within the range of 0 and 1.
The sigmoid function is used for models whose output is
given as a probability i.e. the range lies between 0 and 1. So,
the sigmoid function is used in hypothesis representation.
2. The value of a sigmoid function is 1.5.
a) True
b) False
View Answer
Answer: b
Explanation: Sigmoid function can be used for machine
learning models where output is based on the prediction of a
probability. The function only exists between 0 and 1. Thus
its value can never be 1.5.
3. How is the hypothesis represented? Transpose of t is t T.
a) h(X) = t0 + t1x1
b) h(X) = 1/(1 + e(tTx))
c) h(X) = e(-tTx)/(1 + e(-tTx))
d) h(X) = 1/(1 + e(-tTx))
View Answer
Answer: d
Explanation: The hypothesis is a function of the term tTx.
Since its value should be between 0 and 1, sigmoid function
is used. Sigmoid function is given by g(a) = 1/(1 + e-a).
h(x) = g (tTx)
⇨h(X) = 1/(1 + e(-tTx)).
advertisement
4. Let g be the sigmoid function. Let a = 0. What is the value
of g(a)?
a) 1/2
b) 1/4
c) 1
d) 0
View Answer
Answer: a
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a=0
Hence, g(a) = 1/1+e-0
= 1/1+1
= 1/2.
5. Probability of an event occurring is 1.2. What is odds
ratio?
a) 6:1
b) -6:1
c) Undefined
d) 1:2
View Answer
Answer: c
Explanation: Probability p has to be within the range of 0 to
1. p can never be 1.2. Odds ratio is calculated as the ratio of
p and (1-p). Since p can never be 1.2, odds ratio calculation
is also possible.
6. Probability of an event occurring is 0.9. What is odds
ratio?
a) 0.9:1
b) 9:1
c) 1:9
d) 1:0.9
View Answer
Answer: b
Explanation: p = 0.9 i.e. 9/10, hence (1-p) = 1 – 9/10 = 1/10
Odds ratio = p/(1-p)
= (9/10)/(1/10) = 9:1.
7. What is the odds ratio?
a) p/(1-p)
b) p
c) 1-p
d) p*(1-p)
View Answer
Answer: a
Explanation: p is the probability that event y occurs. Then
the probability of event y not occurring can be given as (1-p).
Odds ratio is given by the ratio of the probability of an event
occurring and the probability that an event is not occurring.
Thus, odds ratio is p/(1-p).
8. The output of logistic regression is always 0 or 1.
a) True
b) False
View Answer
Answer: b
Explanation: The output of logistic regression is not always 0
or 1. It can be yes or no. It can be even true or false. The
output of binary logistic regression is always 0 or 1.
1. h(x) > 0.6 -> y = 1. What does the value 0.6 represent?
a) Cost function
b) Threshold value
c) Gradient descent
d) Sigmoid function
View Answer
Answer: b
Explanation: In logistic regression, a particular value is
taken. If the value of the hypothesis is greater than this
value, the output y is considered to b true or 1. This value is
the threshold value. Here, 0.6 is the threshold value.
2. The value of a sigmoid function is the threshold value.
a) True
b) False
View Answer
Answer: b
Explanation: Sigmoid function is used in machine learning
models to predict the probability of an event happening. The
value of the function varies for different instances in the
training set. The threshold value is fixed for a particular
dataset.
3. Threshold value is 0.5. h(x) = 0.7 for a particular instance.
What is the value of y?
a) 0
b) 0.3
c) 0.7
d) 1
View Answer
Answer: d
Explanation: The decision boundary depends on the value of
threshold. If output of function h(x) is greater than the
threshold value, the output y is equal to 1. Here, h(x) = 0.7
and threshold value = 0.5. Since 0.7 > 0.5, y = 1.
advertisement
4. Let g be the sigmoid function. Let a >= 0. What is the
value of g(a)?
a) g(a) >= 1/2
b) g(a) <= 0
c) g(a) <= 1/2
d) g(a) >= 0
View Answer
Answer: a
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a >= 0
Hence, g(a) >= 1/1+e-0
g(a) >= 1/1+1
g(a) >= 1/2.
5. Probability of an event occurring is 0.2. What is odds
ratio?
a) -4:1
b) 4:1
c) 1:4
d) 1:0.4
View Answer
Answer: c
Explanation: p = 0.2 i.e. 2/10 i.e. 1/5, hence (1-p) = 1 – 1/5 =
4/5
Odds ratio = p/(1-p)
= (1/5)/(4/5) = 1:4.
6. Probability of an event occurring is 0.8. What is odds
ratio?
a) 0.8:1
b) 4:1
c) 1:4
d) 2:0.8
View Answer
Answer: b
Explanation: p = 0.8 i.e. 8/10, hence (1-p) = 1 – 8/10 = 2/10
Odds ratio = p/(1-p)
= (8/10)/(2/10) = 4:1.
7. Let g be the sigmoid function. Let a = infinite. What is the
value of g(a)?
a) 1/2
b) -1
c) 1
d) 0
View Answer
Answer: c
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a = infinite
Hence, g(a) = 1/1+e-infinite
= 1/1+0
= 1.
8. The decision boundary is an important parameter in
logistic regression.
a) True
b) False
View Answer
Answer: a
Explanation: In logistic regression, the decision boundary is
based on the threshold value. It separates the area where
output y = 0 and y = 1. Without the decision boundary, the
output cannot be calculated. Thus, it is very important.
9. Let g be the sigmoid function. Let a = -(infinite). What is
the value of g(a)?
a) -1/2
b) 1
c) 1/2
d) 0
View Answer
Answer: d
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a = -(infinite)
Hence, g(a) = 1/1+einfinite
= 1/1+infinite
= 1/infinite
= 0.
10. Threshold value is 0.6. h(x) = 0.3 for a particular
instance. What is the value of y?
a) 0
b) 0.3
c) 0.7
d) 1
View Answer
Answer: a
Explanation: The threshold value separates the positive
instances from the negative instances. If output of function
h(x) is lesser than the threshold value, the output y is equal
to 0. Here, h(x) = 0.3 and threshold value = 0.6. Since 0.3 <
0.6, y = 0.
1. The cost function for logistic regression and linear
regression are the same.
a) True
b) False
View Answer
Answer: d
Explanation: Logistic regression deals with classification
based problems or probability based, whereas linear
regression is more based on regression problems. Obviously,
the two cost functions are different.
2. h(x) = y. What is the cost (h(x), y)?
a) -infinite
b) infinite
c) 0
d) always h(x)
View Answer
Answer: c
Explanation: The cost function is used to determine the
similarity between the two parameters. The more the
similarity, higher is the tendency of cost function
approaching zero. Since h(x) = y here, the cost function is 0.
3. What is the generalized cost function?
a) cost(h(x),y) = -y*log(h(x)) – (1 – y)*log(1-h(x))
b) cost(h(x),y) = – (1 – y)*log(1-h(x))
c) cost(h(x),y) = -y*log(h(x))
d) cost(h(x),y) = y*log(h(x)) + (1 – y)*log(1-h(x))
View Answer
Answer: a
Explanation: cost(h(x),y) = -y*log(h(x)) when y = 1, and – (1
– y)*log(1-h(x)) when y = 0
Thus the generalized function cost(h(x),y) = -y*log(h(x)) – (1
– y)*log(1-h(x)) becomes
cost(h(x),y) = -y*log(h(x)) when y = 1 as (1 – y) is 0 and
becomes – (1 – y)*log(1-h(x)) when y = 0.
advertisement
4. Let m be the number of training instances. What is the
summation of cost function multiplied by to get the gradient
descent?
a) 1/m
b) m
c) 1 + m
d) 1 – m
View Answer
Answer: a
Explanation: Since the summation is taken of all the cost
functions starting from training instance 1 to training
instance m, an average needs to be taken to get the actual
cost function. So, it is multiplied by 1/m.
5. y = 1. How does cost(h(x), y) change with h(x)?
a) cost(h(x), y) = infinite when h(x) = 1
b) cost(h(x), y) = 0 when h(x) = 0
c) cost(h(x), y) = 0 when h(x) = 1
d) it is independent of h(x)
View Answer
Answer: c
Explanation: Since, the actual output is 1, the calculated
output tending toward 1 will reduce the cost function. Thus
cost function is 0 when h(x) is 1 and it is infinite when h(x) =
0.
6. Who invented gradient descent?
a) Ross Quinlan
b) Leslie Valiant
c) Thomas Bayes
d) Augustin-Louis Cauchy
View Answer
Answer: d
Explanation: Cauchy invented gradient descent in 1847.
Bayes invented Bayes’ theorem. Leslie Valiant introduced
the idea of PAC learning. Quinlan is the founder of the
machine learning algorithm Decision Trees.
7. When was gradient descent invented?
a) 1847
b) 1947
c) 1857
d) 1957
View Answer
Answer: a
Explanation: Augustin-Louis Cauchy, a French
mathematician invented the concept of gradient descent in
1847. Since then, it has been modified a few times. Gradient
descent algorithm has a lot of different applications.
8. h(x) = 1, y = 0. What is the cost (h(x), y)?
a) -infinite
b) infinite
c) 0
d) always h(x)
View Answer
Answer: b
Explanation: The cost function determines the similarity
between the actual output and the calculated output. The
lesser the similarity, the higher is the cost function. It is
maximum (infinite) when h(x) and y are the exact opposite.
1. Which is a better algorithm than gradient descent for
optimization?
a) Conjugate gradient
b) Cost Function
c) ERM rule
d) PAC Learning
View Answer
Answer: a
Explanation: Conjugate gradient is an optimization
algorithm and it gives better results than gradient descent.
Cost function is used to calculate the average difference
between predicted output and actual output. ERM although
tries to lower the cost function, it often leads to overfitting.
2. Who invented BFGS?
a) Quinlan
b) Bayes
c) Broyden, Fletcher, Goldfarb and Shannon
d) Cauchy
View Answer
Answer: c
Explanation: Broyden, Fletcher, Goldfarb and Shannon are
credited with the invention of BFGS method. Quinlan
introduced the algorithm of Decision trees. Bayes invented
Naïve-Bayes algorithm. Cauchy is the founder of gradient
descent algorithm.
3. Ax = b => [4 2, 2 3][x 1, x2] = [2, 2]. Let x0, the initial guess
be [1, 1]. What is the residual vector?
a) [4, -3]
b) [-4, 3]
c) [-4, -3]
d) [4, 3]
View Answer
Answer: c
Explanation: Residual vector, r0 = b – Ax0
r0 = [2, 2] – [4 2, 2 3][1, 1]
= [2, 2] – [6, 5]
= [-4, -3].
advertisement
4. Ax = b => [2 2, 3 3][x 1, x2] = [1, 2]. Let x0, the initial guess
be [1, 1]. What is the residual vector?
a) [3, -4]
b) [-4, 3]
c) [-4, -3]
d) [-3, -4]
View Answer
Answer: d
Explanation: Residual vector, r0 = b – Ax0
r0 = [1, 2] – [2 2, 3 3][1, 1]
= [1, 2] – [4, 6]
= [-3, -4].
5. In the L-BFGS algorithm, what does the letter L stand
for?
a) Lengthy
b) Limited-memory
c) Linear
d) Logistic
View Answer
Answer: b
Explanation: L-BFGS is an approximation of the Broyden-
Fletcher-Goldfarb-Shannon algorithm. It is used for cases
which are limited in memory. Like BFGS, this method also
works better than gradient descent.
6. Ax = b => [3 2, 2 3][x 1, x2] = [8, 6]. Let x0, the initial guess
be [2, 1]. What is the residual vector?
a) [-1, 0]
b) [0, -1]
c) [1, 0]
d) [0, 1]
View Answer
Answer: b
Explanation: Residual vector, r0 = b – Ax0
r0 = [8, 6] – [3 2, 2 3][2, 1]
= [8, 6] – [8, 7]
= [0, -1].
7. Who developed conjugate gradient method?
a) Hestenes and Stiefel
b) Broyden, Fletcher, Goldfarb and Shannon
c) Valiant
d) Vapnik and Chervonenkis
View Answer
Answer: a
Explanation: Magnus Hestenes and Eduard Stiefel
introduced the conjugate gradient algorithm. It is used for
advanced optimization. Broyden, Fletcher, Goldfarb and
Shannon invented BFGS algorithm. Leslie Valiant
introduced the idea of PAC Learning. Vapnik and
Chervonenkis was the founder of VC dimension.
8. When was BFGS invented?
a) 1960
b) 1965
c) 1975
d) 1970
View Answer
Answer: d
Explanation: Broyden, Fletcher, Goldfarb and Shannon are
credited with the invention of the BFGS method. It was
invented in the year, 1970. BFGS is an advanced
optimization technique. It is a better algorithm than
gradient descent
1. The output is whether a person will vote or not, based on
several features. It is an example of multiclass classification.
a) True
b) False
View Answer
Answer: b
Explanation: In multiclass classification, the output, y
should have more than two values (or classes). In this
example, the output can be only yes or no. Hence, it is not an
example of multiclass classification.
2. The output is whether a person will surely vote or surely
not vote or may cast a vote, based on one feature. It is an
example of multiclass classification.
a) True
b) False
View Answer
Answer: a
Explanation: In multiclass classification, the output, y
should have more than two values (or classes). Here, there
are three classes – i) surely vote, ii) surely not vote, and iii)
may cast a vote. Thus, it is an example of multiclass
classification.
3. y = {0, 1, …, n}. This problem is divided into ______
binary classification problems.
a) n
b) 1/n
c) n + 1
d) 1/(n+1)
View Answer
Answer: c
Explanation: The indexing starts at 0. So there are n + 1
output classes. Hence, to get the correct output, we need to
divide the problem into n + 1 classification problems with
binary outputs (0 or 1). If 1 is the output, the instance
belongs to that particular class.
advertisement
4. y = {0, 1, …, 8}. This problem is divided into ______
binary classification problems.
a) 1/9
b) 9
c) 8
d) 1/8
View Answer
Answer: b
Explanation: Since, indexing starts at 0, the number of
classes is 9 and they are 0, 1, 2, 3, 4, 5, 6, 7, and 8. To solve
this 9-class problem, we need to divide the problem 9 binary
classification problems.
5. y = {0, 1, 2, 3, 4, 5, 6, 8}. This problem is divided into
______ binary classification problems.
a) 9
b) 1/9
c) 8
d) 1/8
View Answer
Answer: c
Explanation: In this example, there are 8 different classes
and not 9. 0 is one of the classes but there is no class 7. So,
here number of classes is 8. Thus, the problem is divided into
8 binary classification problems.
6. The outputs of an image recognition system is {0, 0, 1, 0}.
The classes are dog, cat, elephant, and lion. What is the
image of, according to our algorithm?
a) Dog
b) Cat
c) Elephant
d) Lion
View Answer
Answer: c
Explanation: The output vector is a representative of the
probability of the image being a particular class. According
to the algorithm, the probability of image being a cat is zero,
dog is zero, elephant is one, lion is zero. Thus, the image is of
an elephant.
7. Who invented logistic regression?
a) Valiant
b) Ross Quinlan
c) DR Cox
d) Bayes
View Answer
Answer: c
Explanation: Statistician DR Cox invented Logistic
Regression in 1958. Ross Quinlan is the founder of the
machine learning model decision tree. Leslie Valiant
introduced PAC Learning. Bayes is known for Naïve-Bayes
algorithm.
8. When was logistic regression invented?
a) 1957
b) 1959
c) 1960
d) 1958
View Answer
Answer: d
Explanation: Logistic regression was invented by statistician
DR Cox in the year 1958. It was introduced even before the
invention of machine learning. It was introduced as a part of
the direct probability model.
. Which of the following statements is false about Ensemble
learning?
a) It is a supervised learning algorithm
b) More random algorithms can be used to produce a
stronger ensemble
c) It is an unsupervised learning algorithm
d) Ensembles can be shown to have more flexibility in the
functions they can represent
View Answer
Answer: c
Explanation: Ensemble learning is not an unsupervised
learning algorithm. It is a supervised learning algorithm that
combines several machine learning techniques into one
predictive model to decrease variance and bias. It can be
trained and then used to make predictions. And this
ensemble can be shown to have more flexibility in the
functions they can represent.
2. Ensemble learning is not combining learners that always
make similar decisions; the aim is to be able to find a set of
diverse learners.
a) True
b) False
View Answer
Answer: a
Explanation: Ensemble learning aims to find a set of diverse
learners who differ in their decisions so that they
complement each other. There is no point in combining
learners that always make similar decisions.
3. Which of the following is not a multi – expert model
combination scheme to generate the final output?
a) Global approach
b) Local approach
c) Parallel approach
d) Serial approach
View Answer
Answer: d
Explanation: Multi – expert combination methods have base
– learners that work in parallel. Global approach and Local
approach are the two subdivisions of this parallel approach.
Serial approach is a multi – stage combination method.
advertisement
4. The global approach is also known as learner fusion.
a) False
b) True
View Answer
Answer: b
Explanation: The global approach also called as learner
fusion. Where given an input, all base – learners generate an
output and all these outputs are combined by voting or
averaging. This represents integration (fusion) functions
where for each pattern, all the classifiers contribute to the
final decision.
5. Which of the following statements is true about multi –
stage combination methods?
a) The next base – learner is trained on only the instances
where the previous base – learners are not accurate enough
b) It is a selection approach
c) It has base – learners that work in parallel
d) The base – learners are sorted in decreasing complexity
View Answer
Answer: a
Explanation: It is a serial approach; the next base – learner
is trained or tested on only the instances where the previous
base – learners are not accurate enough. A multi – stage
combination method is neither a parallel approach nor a
selection approach. The base – learners are sorted in
increasing complexity.
6. Which of the following is not an example of a multi –
expert combination method?
a) Voting
b) Stacking
c) Mixture of experts
d) Cascading
View Answer
Answer: d
Explanation: Cascading is not a multi – expert combination
example and is a multi – stage combination method. It is
based on the concatenation of several classifiers, which use
all the information collected from the output from a given
classifier as additional information for the next classifier in
the cascade. Voting, stacking and a mixture of experts are
the example of multi – expert combination methods.
7. Which of the following statements is false about the base –
learners?
a) The base – learners are chosen for their accuracy
b) The base – learners are chosen for their simplicity
c) The base – learners has to be diverse
d) Base – learners do not require them to be very accurate
individually
View Answer
Answer: a
Explanation: When we generate multiple base – learners, we
want them to be reasonably accurate but do not require
them to be very accurate individually. Hence the base –
learners are not chosen for their accuracy, but for their
simplicity. However, the base – learners have to be diverse.
8. Different algorithms make different assumptions about
the data and lead to different classifiers in generating
diverse learners.
a) True
b) False
View Answer
Answer: a
Explanation: Different algorithms make different
assumptions about the data and lead to different classifiers.
For example one base – learner may be parametric and
another may be nonparametric. When we decide on a single
algorithm, we give importance to a single method and ignore
all others.
9. Ensembles tend to yield better results when there is a
significant diversity among the models.
a) False
b) True
View Answer
Answer: b
Explanation: Ensembles tend to yield better results when
there is a significant diversity among the models. Many
ensemble methods, therefore, try to promote diversity
among the models they combine.
10. The partitioning of the training sample cannot be done
based on locality in the input space.
a) False
b) True
View Answer
Answer: a
Explanation: The partitioning of the training sample can
also be done based on locality in the input space. So each
base – learner is trained on instances in a certain local part
of the input space. And it is done by a mixture of experts.
11. Which of the following is represented by the below
figure?
a) Stacking
b) Mixture of Experts
c) Bagging
d) Boosting
View Answer
Answer: b
Explanation: The figure shows a Mixture of Experts. It is
based on the divide – and – conquer principle and mixture of
experts trains individual models to become experts in
different regions of the feature space. Then, a gating
network decides which combination of ensemble learners is
used to predict the final output of any instance.
12. Given the target value of a mixture of expert
combinations is 0.8. The predictions of three experts and the
probability of picking them are 0.6, 0.4, 0.5 and 0.8, 0.5, 0.7
respectively. Then what is the simple error for training?
a) 0.13
b) 0.15
c) 0.18
d) 0.2
View Answer
Answer: c
Explanation: We know the simple error for training:
E = ∑ipi(d – yi)2 where d is the target value and pi is the
probability of picking expert i, and yi is the individual
prediction of expert i. Given d = 0.8, y1 = 0.6, y2 = 0.4, y3 =
0.5 and p1 = 0.8, p2 = 0.5, p3 = 0.7
Then E = p1(d – y1)2 + p2(d – y2)2 + p3(d – y3)2
= 0.8(0.8 – 0.6)2 + 0.5(0.8 – 0.4)2 + 0.7(0.8 – 0.5)2
= 0.8(0.2)2 + 0.5(0.4)2 + 0.7(0.3)2
= 0.8 * 0.04 + 0.5 * 0.16 + 0.7 * 0.09
= 0.032 + 0.08 + 0.063
= 0.18
13. The ABC company has released their Android app. And
80 people have rated the app on a scale of 5 stars. Out of the
total people 15 people rated it with 1 star, 20 people rated it
with 2 stars, 30 people rated it with 3 stars, 10 people rated it
with 4 stars and 5 people rated it with 5 stars. What will be
the final prediction if we take the average of individual
predictions?
a) 2
b) 3
c) 4
d) 5
View Answer
Answer: b
Explanation: Given that we are taking the average of
individual predictions to make the final prediction.
Average = ∑ (Rating * Number of people) / Total number of
people
= ((1 * 15) + (2 * 20) + (3 * 30) + (4 * 10) + (5 * 5)) / 80
= (15 + 40 + 90 + 40 + 25) / 80
= 210 / 80
= 2.625
And the nearest integer is 3. So the final prediction will be 3.
14. Consider there are 5 employees A, B, C, D, and E of ABC
company. Where people A, B and C are experienced, D and
E are fresher. They have rated the company app as given in
the table. What will be the final prediction if we are taking
the weighted average?
A 0.4 3
B 0.4 2
C 0.4 2
D 0.2 2
E 0.2 4
a) 2
b) 3
c) 4
d) 5
View Answer
Answer: c
Explanation: We have,
Weighted average = ∑ (Weight * Rating)
= (0.4 * 3) + (0.4 * 2) + (0.4 * 2) + (0.2 * 2) + (0.2 * 4)
= 1.2 + 0.8 + 0.8 + 0.4 + 0.8
=4
1. Which of the following statements is false about Ensemble
voting?
a) It takes a linear combination of the learners
b) It takes non-linear combination of the learners
c) It is the simplest way to combine multiple classifiers
d) It is also known as ensembles and linear opinion pools
View Answer
Answer: b
Explanation: Voting doesn’t take non-linear combination of
the learners. It is the simplest way to combine multiple
classifiers, which corresponds to taking a linear combination
of the learners (yi = ∑j wjdji where wj ≥ 0, ∑j wj = 1, wj is the
weight of learner j and dji is the vote of learner j for class Ci).
So this is also known as ensembles and linear opinion pools.
2. In the simplest case of voting, all the learners are given
equal weight.
a) True
b) False
View Answer
Answer: a
Explanation: In the simplest case, all learners are given
equal weight and here we have simple voting that
corresponds to taking an average. There are also other
combination rules and taking a weighted sum is only one of
such possibilities.
3. With the product rule, if one learner has an output of 0,
the overall output goes to zero.
a) True
b) False
View Answer
Answer: a
Explanation: With the product rule (yi = Πjdji where dji is the
vote of learner j for class Ci), each learner has veto power.
That is regardless of the other ones, if one learner has an
output of 0, the overall output goes to 0.
advertisement
4. In plurality voting the winner is the class with maximum
number of votes.
a) True
b) False
View Answer
Answer: b
Explanation: Plurality voting in classification is where the
class having the maximum number of votes is the winner. In
plurality voting a classification of an unlabelled instance is
performed according to the class that obtains the highest
number of votes. So in reality plurality voting is commonly
used to solve the multi class problems.
5. In majority voting, the winning class gets more than half
of the total votes.
a) True
b) False
View Answer
Answer: a
Explanation: It is majority voting, when there are two
classes and the winning class gets more than half of the
votes. Here every model makes a prediction (votes) for each
test instance and the final output prediction (votes) is the one
that receives more than half of the votes.
6. Which of the following statements is true about the
combination rules?
a) Maximum rule is pessimistic
b) Sum rule takes the weighted sum of vote of each learner
for each class
c) Median rule is more robust to outliers
d) Minimum rule is optimistic
View Answer
Answer: c
Explanation: Median rule is more robust to outliers. If you
throw away the largest and smallest values (predictions) in
the data set, then the median doesn’t change. The sum rule
takes the sum of vote of each learner for each class,
maximum rule is optimistic and minimum rule is pessimistic.
7. Hard voting is where the model is selected from an
ensemble to make the final prediction using simple majority
vote.
a) True
b) False
View Answer
Answer: d
Explanation: In hard voting, a model is selected from an
ensemble by a simple majority vote to make the final
prediction for accuracy. Here every individual classifier
votes for a class, and the majority class wins. So it simply
aggregates the predictions of each classifier and predicts the
class that gets the most votes.
8. Borda count takes the rankings of the class supports into
consideration unlike the voting.
a) True
b) False
View Answer
Answer: a
Explanation: Borda count can rank order the classifier
outputs. The classes can easily be rank ordered with respect
to the support they receive from the classifier. Where the
voting considers the support of the winning classes only and
ignores the support that non winning classes may receive.
9. Which of the following is a solution for the problem,
where the classifiers erroneously give unusual low or high
support to a particular class?
a) Maximum rule
b) Minimum rule
c) Product rule
d) Trimmed mean rule
View Answer
Answer: d
Explanation: Trimmed rule can be used to avoid the damage
done by the unusual vote given by the classifiers. It discards
the decisions of those classifiers with the highest and lowest
support before calculating the mean. And the mean is
calculated on the remaining supports, avoiding the extreme
values of support.
10. The weighted average rule combines the mean and the
weighted majority voting rules.
a) True
b) False
View Answer
Answer: a
Explanation: The weighted average rule combines the mean
and the weighted majority voting rules. It makes use of
weighted majority voting and the ensemble prediction is
calculated as the average of the member predictions.
11. Assume we are combining three classifiers that classify a
training sample as given in the table. Then what is the class
of the samples using majority voting?
Classifier Class label
C1 0
C2 0
C3 1
a) 0
b) 1
c) 2
d) New class
View Answer
Answer: a
Explanation: In majority voting the class label (y) is
predicted as, Y = mode {P1, P2, …, Pn} where P1, P2, …,
Pn are the predictions of n classifiers that are combined.
Here P1 = 0, P2 = 0, …, P3 = 1. And Y can be calculated as,
Y = mode {P1, P2, P3}
= mode {0, 0, 1}
=0
12. Assume we are combining eight classifiers that classify a
training sample as given in the table. Then what is the class
of the samples using simple majority voting?
C2 0
C3 1
C4 0
C5 2
C6 3
C7 0
C8 0
a) 1
b) 2
c) 0
d) 3
View Answer
Answer: c
Explanation: Majority voting has three flavors. And one
among them is depending on whether the ensemble decision
is the class predicted by at least one more than half the
number of classifiers. And it is known as simple majority
voting. Total number of classifiers is 8 and the number of
classifiers predicts class label 0 is 5. So the number of
classifiers predicts class label 0 > 4 (Total number of
classifiers / 2). And the class label for the samples is 0.
13. Assume we are combining three classifiers that classify a
training sample and the probabilities are given in the table.
Given that it assigns equal weights to all classifiers w1=1,
w2=1, w3=1. What is the class of the samples using weighted
majority voting?
a) Class 0
b) Class 1
c) Class 2
d) New class
View Answer
Answer: d
Explanation: Given the table about the probabilities of
samples classified to each class label by three classifiers. And
assigns equal weights to all classifiers (w1 = w2 = w3 = 1).
Then we have,
Class label 0 Clas
From the table above the class 1 has the highest weighted
average probability, thus we classify the sample as class 1.
1. In error-correcting output codes (ECOC), the main
classification task is defined in terms of a number of
subtasks that are implemented by the base-learners.
a) True
b) False
View Answer
Answer: a
Explanation: In multi-class problems the original task of
separating one class from all other classes may be difficult.
So we want to define a set of simpler classification problems,
where each specializing in one aspect of the task. And we get
the final classifier by combining these simpler classifiers.
2. Which of the following statements is not true about error-
correcting output codes (ECOC)?
a) It is a method for solving multi-class classification
problems
b) It is a method for decomposing a multiway classification
problem into many binary classification tasks
c) It is a method for solving binary classification problems
d) It is a method for converting a k-class supervised learning
problem into a large number of two class supervised
learning problem
View Answer
Answer: c
Explanation: Error-correcting output coding is not a method
for solving binary classification problems. It is a method for
solving multi-class classification problems. Here a k-class
supervised learning problem (multiway classification
problem) is converted into a large number L of two class
supervised learning problems (binary classification tasks).
3. Which of the following statements is not true about multi-
class classification?
a) An input can belong to one of K classes
b) Each input belongs to exactly one class
c) Each training data associated with class labels which is a
number from 1 to K
d) Each input belongs to more than one class
View Answer
Answer: d
Explanation: In a multi-class classification problem, an
input cannot belong to more than one class. Here an input
can belong to one of K classes, but each input belongs to
exactly one class. And the training data input is associated
with a class label (a number from 1 to K).
advertisement
4. Which of the following statements is false about error-
correcting output codes (ECOC)?
a) It is based on the embedding of binary classifiers
b) The ECOC designs are independent of the base classifier
applied
c) ECOC framework consists of designing a codeword for
each of the classes
d) The ECOC designs are dependent on the base classifier
applied
View Answer
Answer: d
Explanation: ECOC designs are not dependent on the base
classifier applied and are independent of the base classifier
applied. ECOC is the powerful framework based on the
embedding of binary classifiers. It consists of designing a
codeword for each of the classes and these codewords encode
the membership information of each class for a given binary
problem.
5. Which of the following is not a problem independent
ECOC design?
a) One-versus-all
b) SFFS criterion
c) One-versus-one
d) Dense Random
View Answer
Answer: b
Explanation: Sequential Forward Floating Search (SFFS) is
not a problem independent ECOC design. It is a method
used in problem dependent ECOC design for feature
selection. And all other three are the problem independent
ECOC designs.
6. Which of the following is not a problem dependent ECOC
design?
a) Sparse Random
b) DECOC
c) ECOC-ONE
d) Forest-ECOC
View Answer
Answer: a
Explanation: Sparse Random is not a problem dependent
ECOC design and it is a problem independent ECOC
design. It uses n = 15 · logNc dichotomizers. All other three
are the problem dependent ECOC designs.
7. Which of the following ECOC designs uses n =
(Nc−1).T dichotomizers, where T stands for the number of
binary tree structures to be embedded?
a) DECOC
b) One-versus-all
c) Forest-ECOC
d) One-versus-one
View Answer
Answer: c
Explanation: Forest-ECOC design uses n =
(Nc−1).T dichotomizers, where T stands for the number of
binary tree structures to be embedded. Whereas the
DECOC design uses n = Nc−1 dichotomizers, One-versus-all
uses Nc dichotomizers and One-versus-one uses n =
Nc(Nc−1)/2 dichotomizers.
8. Forest-ECOC design which uses n =
(Nc−1).T dichotomizers, extends the variability of the
classifiers of the DECOC design.
a) True
b) False
View Answer
Answer: a
Explanation: Forest-ECOC design uses n =
(Nc−1).T dichotomizers, where T stands for the number of
binary tree structures to be embedded. Whereas the
DECOC design uses n = Nc−1 dichotomizers. So Forest-
ECOC design extends the variability of the classifiers of the
DECOC design by including extra dichotomizers T (number
of binary tree structures to be embedded).
9. Problem independent ECOC design and Problem
dependent ECOC design are the two types of ECOC
decoding strategies.
a) True
b) False
View Answer
Answer: b
Explanation: Problem independent ECOC design and
Problem dependent ECOC design are not the two types of
ECOC decoding strategies. The ECOC coding designs are
mainly divided into two main groups: problem-independent
approaches, and the problem-dependent designs. Hamming
decoding, Euclidean decoding etc. are the ECOC decoding
strategies.
10. Problem independent approaches take into account the
distribution of the data to define the coding matrix.
a) False
b) True
View Answer
Answer: a
Explanation: Problem-independent approaches are used to
guide the coding design. It does not take into account the
distribution of the data to define the coding matrix. It
considers the row separation and column separation criteria
to build a code matrix.
11. Given the two strings “cats” and “dogs”. What is the
Hamming distance between two strings?
a) 4
b) 3
c) 2
d) 5
View Answer
Answer: b
Explanation: The Hamming distance between two strings of
equal length is the number of positions at which the
corresponding symbols are different. It is the number of
substitutions required to transform one string into another.
cats ⇒ dats (substitute ‘d’ for ‘c’)
dats ⇒ dots (substitute ‘o’ for ‘a’)
dots ⇒ dogs (substitute ‘g’ for ‘t’)
So hamming distance is 3 as it requires 3 edit operations to
convert “cats” to “dogs”.
12. What is the hamming distance between the binary values
100111010 and 101111111?
a) 9
b) 7
c) 5
d) 3
View Answer
Answer: d
Explanation: Hamming distance between two strings of
equal length is the minimum number of substitutions
required to change one string into the other.
100111010 ⇒ 101111010
101111010 ⇒ 101111110
101111110 ⇒ 101111111
So hamming distance is 3 as it requires 3 edit operations to
convert 100111010 to 101111111.
13. How many single bit errors take to turn “cow” to “fox”?
a) 2
b) 0
c) 1
d) 3
View Answer
Answer: a
Explanation: The number of single bit errors taken to turn
one string into another is known as the hamming distance.
cow ⇒ fow (substitute ‘f’ for ‘c’)
fow ⇒ fox (substitute ‘x’ for ‘w’)
So the number of single bit errors taken to turn “cow” to
“fox” is 2.
1. Boosting is a machine learning ensemble algorithm which
converts weak learners to strong ones.
a) True
b) False
View Answer
Answer: a
Explanation: Boosting is a machine learning ensemble meta-
algorithm which converts weak learners to strong ones. A
weak learner is defined to be a classifier which is only
slightly correlated with the true classification and a strong
learner is a classifier that is arbitrarily well correlated with
the true classification.
2. Which of the following statements is not true about
boosting?
a) It uses the mechanism of increasing the weights of
misclassified data in preceding classifiers
b) It mainly increases the bias and the variance
c) It tries to generate complementary base-learners by
training the next learner on the mistakes of the previous
learners
d) It is a technique for solving two-class classification
problems
View Answer
Answer: b
Explanation: Boosting does not increase the bias and
variance but it mainly reduces the bias and the variance. It is
a technique for solving two-class classification problems.
And it tries to generate complementary base-learners by
training the next learner (by increasing the weights) on the
mistakes (misclassified data) of the previous learners.
3. Boosting is a heterogeneous ensemble technique.
a) True
b) False
View Answer
Answer: b
Explanation: Boosting is not a heterogeneous ensemble but
is a homogeneous ensemble. Homogeneous ensemble consists
of members having a single-type base learning algorithm.
Whereas a heterogeneous ensemble consists of members
having different base learning algorithms.
advertisement
4. The issues that boosting addresses are the bias-complexity
tradeoff and computational complexity of learning.
a) True
b) False
View Answer
Answer: a
Explanation: The more expressive the hypothesis class the
learner is searching over, the smaller the approximation
error is, but the larger the estimation error becomes. And
for many concept classes the task of finding an Empirical
Risk Minimization hypothesis may be computationally
infeasible.
5. Which of the following statements is not true about weak
learners?
a) They can be used as the building blocks for designing
more complex models by combining them
b) Boosting learns the weak learners sequentially in a very
adaptive way
c) They are combined using a deterministic strategy
d) They have low bias
View Answer
Answer: d
Explanation: Weak learners do not have low bias but have
high bias. Boosting primarily reduces the bias by combining
the weak learners in a deterministic strategy. And boosting
learns the weak learners sequentially in a very adaptive way.
6. Which of the following is not related to boosting?
a) Non uniform distribution
b) Re-weighting
c) Re-sampling
d) Sequential style
View Answer
Answer: c
Explanation: Re-sampling is done with the bagging
technique. Boosting uses a non-uniform distribution, during
the training the distribution will be modified and difficult
samples will have higher probability. And it follows a
sequential style to generate complementary base-learners by
re-weighting the learner.
7. In ensemble method if the classifier is unstable, then we
need to apply boosting.
a) True
b) False
View Answer
Answer: b
Explanation: If the classifier is unstable which means it has
high variance, then we cannot apply boosting. We can use
bagging if the classifier is unstable. If the classifier is steady
and straightforward (high bias), then we have to apply
boosting.
8. The original boosting method requires a very large
training sample.
a) True
b) False
View Answer
Answer: a
Explanation: The disadvantage of the original boosting
method is that it requires a very large training sample. And
the sample should be divided into three and furthermore, the
second and third classifiers are only trained on a subset of
the previous classifier’s error.
9. Which of the following is not true about boosting?
a) It considers the weightage of the higher accuracy sample
and lower accuracy sample
b) It helps when we are dealing with bias or under-fitting in
the data set
c) Net error is evaluated in each learning step
d) It always considers the overfitting or variance issues in
the data set
View Answer
Answer: d
Explanation: One of the main disadvantages of boosting is
that it often ignores overfitting or variance issues in the data
set. And it mainly reduces the bias and also the variance. All
other three options are the advantages of boosting.
10. Boosting can be used for spam filtering.
a) False
b) True
View Answer
Answer: b
Explanation: Boosting can be used for spam filtering, where
the first classifier can be used to distinguish between emails
from contacts and others. And the subsequent classifiers
used to find examples wrongly classified as spam and find
words/phrases appearing in spam. And finally combine it to
the final classifier that predicts spam accurately.
11. Consider there are 7 weak learners, out of which 4
learners are voted as FAKE for a social media account and 3
learners are voted as REAL. What will be the final
prediction for the account if we are using a majority voting
method?
a) FAKE
b) REAL
c) Undefined
d) Error
View Answer
Answer: a
Explanation: As we are using a majority voting method here
it will be considering the prediction of weak learners with
higher number of votes. And here 4 learners out of 7 are
voted as FAKE. And it is the higher number of votes
considering the 3 votes as REAL. So the final prediction will
be FAKE.
12. Assume that we are training a boosting classifier using
decision stumps on the given dataset. Then which of the
given examples will have their weights increased at the end
of the first iteration?
a) Circle
b) Square
c) Both
d) No increment in weight
View Answer
Answer: b
Explanation: The square example will have their weights
increased at the end of the first iteration. Decision stump is a
1-level decision tree and is a test based on one feature. And
the decision stump with the least error in the first iteration is
constant over the whole domain. So it only predicts
incorrectly on the square example.
13. Assume that we are training a boosting classifier using
decision stumps on the given dataset. At the least how much
iteration does it need to achieve zero training error?
a) 1
b) 2
c) 3
d) 0
View Answer
Answer: c
Explanation: It will require at least three iterations to
achieve zero training error. First iteration will misclassify
the square example. Second iteration will misclassify the two
square examples. And finally the third iteration will
misclassify the remaining two square examples which can
yield zero training error. So it requires at least three
iterations.
1. AdaBoost is an algorithm that has access to a weak
learner and finds a hypothesis with a low empirical risk.
a) True
b) False
View Answer
Answer: a
Explanation: AdaBoost (Adaptive Boosting) is an algorithm
that has access to a weak learner and finds a hypothesis with
a low empirical risk. Each iteration of AdaBoost involves
O(m) operations as well as a single call to the weak learner.
Therefore, if the weak learner can be implemented
efficiently, then the total training process will be efficient.
2. Which of the following statements is not true about
AdaBoost?
a) The boosting process proceeds in a sequence of
consecutive rounds
b) In each round t, the weak learner is assumed to return a
weak hypothesis ht
c) The output of AdaBoost algorithm is a weak classifier
d) It assigns a weight to the weak hypothesis that is inversely
proportional to the error of the weak hypothesis
View Answer
Answer: c
Explanation: The output of the AdaBoost algorithm is not a
weak classifier but is a strong classifier that is based on a
weighted sum of all the weak hypotheses. The boosting
process proceeds in a sequence of consecutive rounds. So in
each round t, the weak learner is assumed to return a weak
hypothesis ht and it assigns a weight to ht that is inversely
proportional to the error of ht.
3. AdaBoost runs in polynomial time.
a) False
b) True
View Answer
Answer: b
Explanation: AdaBoost runs in polynomial time and does
not require defining a large number of hyper parameters.
Each iteration of AdaBoost involves O (m) operations as well
as a single call to the weak learner. So overall running time
is polynomial in m.
advertisement
4. The basic functioning of the AdaBoost algorithm is to
maintain a weight distribution over the data points.
a) True
b) False
View Answer
Answer: a
Explanation: The basic functioning of the algorithm is to
maintain a weight distribution d, over data points. A weak
learner, f(k) is trained on this weighted data. And the
(weighted) error rate of f(k) is used to determine the
adaptive parameter α, which controls how “important” a
weak learner, f(k) is.
5. The success of AdaBoost is due to its property of
increasing the margin.
a) False
b) True
View Answer
Answer: b
Explanation: The success of AdaBoost is due to its property
of increasing the margin. In practice we observe that
running boosting for many rounds does not overfit in most
cases and margin is a solution for it. The margins can be
thought of as a measure of how confident a classifier is
about, how it labels each point, and one would hypothetically
desire to produce a classifier with margins as large as
possible.
6. Which of the following statements is true about
AdaBoost?
a) It is generally more prone to overfitting.
b) It improves classification accuracy.
c) It is particularly prone to overfitting on noisy datasets.
d) Complexity of the weak learner is important in AdaBoost.
View Answer
Answer: a
Explanation: AdaBoost is generally not more prone to
overfitting but is less prone to overfitting. And it is prone to
overfitting on noisy datasets. If you use very simple weak
learners, then the algorithms are much less prone to
overfitting and it improves classification accuracy. So
Complexity of the weak learner is important in AdaBoost.
7. Which of the following statements is true about the
working of AdaBoost?
a) It starts with equal weights and re – weighting will be
done.
b) It starts with unequal weights and re – weighting will be
done.
c) It starts with unequal weights and random sampling.
d) It starts with equal weights and random sampling.
View Answer
Answer: d
Explanation: AdaBoost starts with equal weights and
random sampling. It starts by predicting the original dataset
and gives equal weights to each observation. So in the first
step of AdaBoost each sample has an identical weight that
indicates how important it is regarding the classification.
8. AdaBoost is a parallel ensemble method.
a) True
b) False
View Answer
Answer: b
Explanation: AdaBoost is not a parallel ensemble method. It
is a sequential ensemble method, where the base learners are
generated sequentially. The boosting process proceeds in a
sequence of consecutive rounds.
9. Given three training instances with weights 0.5, 0.2, 0.04.
The predicted values are 1, 1, and – 1. The actual output
variables in the instances are – 1, 1, and 1. And the terror
would be 1, 0, and 1. What is the misclassification rate?
a) 0.71
b) 0.65
c) 0.73
d) 0.5
View Answer
Answer: c
Explanation: Misclassification rate or error = sum (w (i) *
terror (i)) / sum (w)
= (0.5 * 1 + 0.2 * 0 + 0.04 * 1) / (0.5 + 0.2 + 0.04)
= (0.5 + 0 + 0.04) / 0.74
= 0.54 / 0.74
= 0.73
10. AdaBoost is sensitive to outliers.
a) False
b) True
View Answer
Answer: b
Explanation: AdaBoost is sensitive to outliers or label noise
and the outliers are tending to get misclassified. As the
number of iterations increases, the weights corresponding to
outlier points can become very large. And the subsequent
classifiers are trying to classify these outlier points correctly.
11. Consider the two instances having errors 0.4, 0.5. Then
what will be the weights of the classifier for these two
instances?
a) 0.401, 0.5
b) 0.903, 0.1
c) 0.304, 0.6
d) 0.205, 0
View Answer
Answer: d
Explanation: The weight of the classifier is calculated as,
α = (1 / 2) * ln((1 – error) / error)
Then for error = 0.4,
Weight α = (1 / 2) * ln((1 – 0.4) / 0.4)
= (0.5 * ln(0.6 / 0.4)
= (0.5 * ln(1.5)
= 0.5 * 0.41
= 0.205
For error = 0.5,
Weight α = (1 / 2) * ln((1 – 0.5) / 0.5)
= (0.5 * ln(0.5 / 0.5)
= (0.5 * ln(1)
= 0.5 * 0
=0
12. The classifier weight for an instance will be less than zero
if the error is greater than or equal to 0.5.
a) True
b) False
View Answer
Answer: a
Explanation: The weight of the classifier is calculated as α =
(1 / 2) * ln((1 – error) / error). Consider a classifier instance
with error = 0.8. Then the weight will be calculated as,
Weight α = (1 / 2) * ln((1 – 0.8) / 0.8)
= (0.5 * ln(0.2 / 0.8)
= (0.5 * ln(0.25)
= 0.5 * – 1.39
= – 0.695
So when the error (0.8) is greater than or equal to 0.5, the
weight for an instance will be less than zero (- 0.695).
1. Stacked generalization extends voting.
a) True
b) False
View Answer
Answer: a
Explanation: Voting is the simplest way to combine multiple
classifiers, which corresponds to taking a linear combination
of the learners. Stacked generalization extends voting by
combining the base learners through a combiner, which is
another learner.
2. Which of the following is represented by the below figure?
a) Bagging
b) Boosting
c) Mixture of Experts
d) Stacking
View Answer
Answer: d
Explanation: Stacking is a technique in which the outputs of
the base-learners are combined and is learned through a
combiner system, f(•|Φ). Where f(•|Φ) is another learner,
whose parameters Φ are also trained as: y = f (d1, d2, …,
dL|Φ) where d1, d2, …, dL are the base learners.
3. Which of the following is the main function of stacking?
a) Ensemble meta algorithm for reducing variance
b) Ensemble meta algorithm for reducing bias
c) Ensemble meta algorithms for improving predictions
d) Ensemble meta algorithms for increasing bias and
variance
View Answer
Answer: c
Explanation: Stacking is a way of combining multiple
models. And it uses predictions of multiple models as
“features” to train a new model and use the new model to
make predictions on test data. So it ensemble meta
algorithms for improving predictions.
advertisement
4. Which of the following is an example of stacking?
a) AdaBoost
b) Random Forest
c) Bagged Decision Trees
d) Voting Classifier
View Answer
Answer: d
Explanation: Voting classifiers (ensemble or majority voting
classifiers) are an example of stacking. They are used to
combine several classifiers to create the final classifier.
AdaBoost is a boosting technique whereas random forest
and bagged decision trees are examples of bagging.
5. The fundamental difference between voting and stacking
is how the final aggregation is done.
a) True
b) False
View Answer
Answer: a
Explanation: The fundamental difference between voting
and stacking is how the final aggregation is done. In voting,
user-specified weights are used to combine the classifiers
whereas stacking performs this aggregation by using a linear
or nonlinear function. And this function can be a
blender/meta classifier.
6. Which of the following statements is false about stacking?
a) It introduces the concept of a meta learner
b) It combines multiple classification or regression models
c) The combiner function can be nonlinear
d) Stacking ensembles are always homogeneous
View Answer
Answer: d
Explanation: Stacking ensembles are not always
homogeneous but are often heterogeneous, because the base
level often consists of different learning algorithms. It
combines multiple classification or regression models. And it
introduces the concept of a meta learner and this combiner
function can be nonlinear unlike voting.
7. Stacking trains a meta-learner to combine the individual
learners.
a) True
b) False
View Answer
Answer: a
Explanation: Stacking trains a meta-learner (second-level
learner) to combine the individual learners (first-level
learners). In stacking the first-level learners are often
generated by different learning algorithms. And the second-
level learner learned from examples for combining multiple
classifiers or first-level learners.
8. Associative switch can be used to combine multiple
classifiers in stacking.
a) True
b) False
View Answer
Answer: a
Explanation: Associative switch also known as the meta-
learner in the stacking. And it is used to learn from examples
for combining multiple classifiers. And it is also known as
combining by learning.
9. Which of the following is represented by the below figure?
a) Stacking
b) Mixture of Experts
c) Bagging
d) Boosting
View Answer
Answer: a
Explanation: Stacking introduces a level-1 algorithm, called
meta-learner, for learning the weights of the level-0
predictors. That means the predictions of each training
instance from the models become now training data for the
level-1 learner (generalizer).
10. What does the given figure indicate?
a) Stacking
b) Support vector machine
c) Bagging
d) Boosting
View Answer
Answer: a
Explanation: The given figure shows a stacked
generalization framework. It consists of level 0 (three
models) and level 1 (one Meta model). And these individual
algorithms (Light gradient boosting, Support vector
regression and neural network) improve the predictive
performance.
1. Gradient descent is an optimization algorithm for finding
the local minimum of a function.
a) True
b) False
View Answer
Answer: a
Explanation: Gradient descent is an optimization algorithm
for finding the local minimum of a function. It is used to find
the values of parameters of a function that minimizes a cost
function. The slope of this cost function curve tells us how to
update our parameters to make the model more accurate.
2. We can use gradient descent as a best solution, when the
parameters cannot be calculated analytically.
a) False
b) True
View Answer
Answer: b
Explanation: Gradient descent is best used when the
parameters cannot be calculated using linear algebra
(analytically). So, in order to solve a system of nonlinear
equations numerically, we have to reformulate it as an
optimization problem. And it must be searched by an
optimization algorithm like gradient descent.
3. Which of the following statements is false about gradient
descent?
a) It updates the weight to comprise a small step in the
direction of the negative gradient
b) The learning rate parameter is η where η > 0
c) In each iteration, the gradient is re-evaluated for the new
weight vector
d) In each iteration, the weight is updated in the direction of
positive gradient
View Answer
Answer: d
Explanation: Gradient descent is an optimization algorithm,
and in each iteration the weight is not updated in the
direction of positive gradient. Here it updates the weight in
the direction of the negative gradient. And the gradient is re-
evaluated for the new weight vector with a learning
parameter η > 0.
advertisement
4. In batch method gradient descent, each step requires the
entire training set be processed in order to evaluate the error
function.
a) True
b) False
View Answer
Answer: a
Explanation: Techniques that use the whole data set at once
are called batch methods. So in batch method gradient
descent, each step requires the entire training set be
processed in order to evaluate the error function. Here the
error function is defined with respect to a training set.
5. Simple gradient descent is a better batch optimization
method than conjugate gradients and quasi-newton
methods.
a) False
b) True
View Answer
Answer: a
Explanation: Conjugate gradients and quasi-newton
methods are the more robust and faster batch optimization
methods than simple gradient descent. In these algorithms
the error function always decreases at each iteration unless
the weight vector has arrived at a local or global minimum
unlike gradient descent.
6. What is the gradient of the function 2x 2 – 3y2 + 4y – 10 at
point (0, 0)?
a) 0i + 4j
b) 1i + 10j
c) 2i – 3j
d) -3i + 4j
View Answer
Answer: a
Explanation: Given the function f = 2x 2-3y2+4y-10 at point
(0, 0). Then the gradient of the function can be calculated as:
\( \frac {\partial f}{\partial x} = \frac {\partial (2x^2 – 3y^2 +
4y – 10)}{\partial x}\)
= 4x
=4*0
=0
\( \frac {\partial f}{\partial y} = \frac {\partial (2x^2 – 3y^2 +
4y – 10)}{\partial y}\)
= -6y + 4
= (-6 * 0) + 4
=0+4
=4
Gradient, ∇ f = 0i + 4j
7. The gradient is set to zero to find the minimum or the
maximum of a function.
a) False
b) True
View Answer
Answer: b
Explanation: The gradient is set to zero, to find the
minimum or the maximum of a function. Because the value
of gradient at extremes (minimum or maximum) of a
function is always zero. So the derivative of the function is
zero at any local maximum or minimum.
8. The main difference between gradient descents variants
are based on the amount of data.
a) True
b) False
View Answer
Answer: a
Explanation: There are mainly three types of gradient
descents. They are batch gradient descent, stochastic
gradient, and mini-batch gradient descent. The main
difference between these algorithms is the amount of data
they handle. And based on this their accuracy, and time
taken for the weight updating varies.
9. Which of the following statements is false about choosing
learning rate in gradient descent?
a) Small learning rate leads to slow convergence
b) Large learning rate cause the loss function to fluctuate
around the minimum
c) Large learning rate can cause to divergence
d) Small learning rate cause the training to progress very
fast
View Answer
Answer: d
Explanation: If the learning rate is too small then the
training will progress very slowly because the weight
updating is very small. So, it leads to slow convergence.
Whereas the large learning rate causes the loss function to
fluctuate around the minimum and even can cause
divergence.
10. Which of the following is not related to a gradient
descent?
a) AdaBoost
b) Adadelta
c) Adagrad
d) RMSprop
View Answer
Answer: a
Explanation: AdaBoost is a meta algorithm to combine the
base learners to form a final classifier. Where Adadelta,
Adagrad and RMSprop are the gradient descent
optimization algorithms. And these algorithms are most
widely used by the deep learning community to solve a
number of challenges.
11. Given a function y = (x + 4)2. What is the local minima of
the function starting from the point x = 3 and the value of x
after the first iteration using gradient descent (Assume the
learning rate is 0.01)?
a) 0, 3.02
b) 0, 4.08
c) -4, 2.86
d) 4, 3.8
View Answer
Answer: c
Explanation: We know y = (x + 4)2 reaches its minimum
value when x = -4 (i.e when x = -4, y = 0). Hence x = -4 is the
local and global minima of the function.
Let x0 = 3, Learning rate = 0.01 and y = (x + 4)2. Then using
gradient descent,
\(\frac {dy}{dx} = \frac {d(x + 4)^2}{dy}\)
= 2 * (x + 4)
During the first iteration,
x1 = x0 – (learning rate * \(\frac {dy}{dx}\))
= 3 – (0.01 * (2 * (3 + 4)))
= 3 – (0.01 * (2 * 7))
= 3 – (0.01 * 14)
= 3 – 0.14
= 2.86
12. Given a function y = (x + 30)2. How many iterations does
it need to reach the first negative value of the function
starting from the point x = 1 using gradient descent (Assume
the learning rate is 0.01)?
a) 3
b) 4
c) 2
d) 5
View Answer
Answer: c
Explanation: Let x0 = 1, Learning rate = 0.01 and y = (x +
30)2. Then using gradient descent,
\(\frac {dy}{dx} = \frac {d(x + 30)^2}{dy}\)
= 2 * (x + 30)
During the first iteration,
x1 = x0 – (learning rate * \(\frac {dy}{dx}\))
= 1 – (0.01 * (2 * (1 + 30)))
= 1 – (0.01 * (2 * 31))
= 1 – (0.01 * 62)
= 1 – 0.62
= 0.38
During the second iteration,
x2 = x1 – (learning rate * \(\frac {dy}{dx}\))
= 0.38 – (0.01 * (2 * (0.38 + 30)))
= 0.38 – (0.01 * (2 * 30.38))
= 0.38 – (0.01 * 60.76)
= 0.38 – 0.61
= -0.23
So, the function reaches the first negative value after the two
iterations.
1. The Subgradient method is an algorithm for maximizing a
non-differentiable convex function.
a) True
b) False
View Answer
Answer: b
Explanation: The Subgradient method is not an algorithm
for maximizing a non-differentiable convex function but is
used to minimize the non differentiable convex function.
Convex optimization is the problem of minimizing convex
functions over convex sets. And when the objective function
is non-differentiable Subgradient methods are used.
2. Which of the following statements is not true about
Subgradient?
a) The step lengths are chosen via a line search
b) It can be directly applied to non-differentiable functions
c) It is an iterative method
d) The step lengths are fixed ahead of time
View Answer
Answer: a
Explanation: In Subgradient the step lengths are not chosen
via line search, and are often fixed ahead of time. It is an
iterative algorithm, which uses an initial guess to generate a
sequence of improving approximate solutions for a class of
problems. And these are directly applied to non-
differentiable functions.
3. Subgradient methods can be much slower than interior-
point methods.
a) True
b) False
View Answer
Answer: a
Explanation: Subgradient methods can be much slower than
interior-point methods. Where the interior-point methods
are used to solve linear and nonlinear convex optimization
problems and are second-order methods, not affected by
problem scaling. Subgradient methods are first-order
methods and their performance depends very much on the
problem of scaling and conditioning.
advertisement
4. Which of the following statements is not true about the
Subgradient method?
a) It has small memory requirement than interior-point
methods
b) It can be used for extremely large problems
c) Simple distributed algorithm can be generated by
combining sub gradient with primal or dual composition
techniques
d) It is much faster than Newton’s method in the
unconstrained case
View Answer
Answer: d
Explanation: Subgradient methods are not faster than
Newton’s method but are slower than it. The advantages of
Sub-gradient are that it has smaller memory requirements
than interior-point methods and can be used for extremely
large problems. And can be combined with primal or dual
composition techniques to form a simple distributed
algorithm.
5. Step size, αk = α is a positive constant, independent of k is
represented by Constant step size rule.
a) True
b) False
View Answer
Answer: a
Explanation: Constant step size rule defines the step size, αk
= α is a positive constant, independent of k. And Constant
step length, Non-summable diminishing, and Non-summable
diminishing step lengths are other step size rules in
Subgradient which define the step size in different ways.
6. In SVM problems, we cannot directly apply gradient
descent but we can apply Subgradient descent.
a) True
b) False
View Answer
Answer: a
Explanation: In SVM problems we cannot directly apply
gradient descent but we can apply Subgradient descent.
Because SVM objective is not continuously differentiable
and we cannot apply gradient descent. And Sub-gradient
descent can be used to solve this non-differentiable SVM
objective function.
7. Which of the following objective functions is not solved by
Subgradient?
a) Hinge loss
b) L1 norm
c) Perceptron loss
d) TanH function
View Answer
Answer: d
Explanation: TanH function cannot be handled by the
Subgradient. TanH function is a differentiable objective
function which cannot be solved by the Subgradient.
Because Subgradient is used to solve the non differentiable
convex problems where all the other three are the non
differentiable functions.
8. The step size rules in Subgradient are determined before
the algorithm is run.
a) False
b) True
View Answer
Answer: b
Explanation: The step size rules in subgradient are
determined before the algorithm is run. That is they do not
depend on any data computed during the algorithm. But in
standard descent methods the step size rules depend very
much on the current point and search direction.
9. The Subgradient is a descent method.
a) True
b) False
View Answer
Answer: b
Explanation: Unlike the ordinary gradient method, the
subgradient method is not a descent method, because the
function value often increases. The method looks very much
like the ordinary gradient method for differentiable
functions, but with several exceptions.
10. Subgradient descent can be used at points where
derivative is not defined.
a) True
b) False
View Answer
Answer: a
Explanation: Subgradient descent can be used at points
where derivative is not defined. It solves the non-
differentiable convex function. And it is like gradient
descent, but replacing gradients with subgradients.
11. Which of the following statements is not true about
Subgradient method?
a) Its convergence can be very fast
b) It handles general non-differentiable convex problem
c) It has no good stopping criterion
d) It often leads to very simple algorithms
View Answer
Answer: d
Explanation: Subgradient method’s convergence can be very
slow and not very fast. It involves the convergence of the
iterative process. And this iterative process makes the
convergence very slowly. All other three statements are the
key features of the Sub gradient method.
1. Stochastic gradient descent is also known as on-line
gradient descent.
a) True
b) False
View Answer
Answer: a
Explanation: Stochastic gradient descent is also known as
online gradient descent. It is said to be online because it can
update coefficients on new samples as it comes in the system.
So it makes an update to the weight vector based on one data
point at a time.
2. Stochastic gradient descent (SGD) methods handle
redundancy in the data much more efficiently than batch
methods.
a) True
b) False
View Answer
Answer: a
Explanation: Stochastic gradient descent methods handle
redundancy in the data much more efficiently than batch
methods. If we are doubling the dataset size then the error
function will be multiplied by a factor of 2. And SGD can
handle this error function normally but batch methods need
double the computational power to handle this error
function.
3. Which of the following statements is true about stochastic
gradient descent?
a) It processes all the training examples for each iteration of
gradient descent
b) It is computationally very expensive, if the number of
training examples is large
c) It processes one training example per iteration
d) It is not preferred, if the number of training examples is
large
View Answer
Answer: c
Explanation: Stochastic gradient descent processes one
training example per iteration. That is it updates the weight
vector based on one data point at a time. All other three are
the features of Batch Gradient Descent.
advertisement
4. Which of the following statements is not true about the
stochastic gradient descent?
a) The parameters are being updated after one iteration
b) It is quite faster than batch gradient descent
c) Stochastic gradient descent is faster than mini batch
gradient descent
d) When the number of training examples is large, it can be
additional overhead for the system
View Answer
Answer: c
Explanation: Stochastic gradient descent is not faster than
mini batch gradient descent but is slower than it. But it is
faster than batch gradient descent and the parameters are
updated after each iteration. And when the number of
training examples is large, then the number of iterations
increases and it will be an overhead for the system.
5. Stochastic gradient descent falls under Non-convex
optimization.
a) True
b) False
View Answer
Answer: a
Explanation: Stochastic gradient descent falls under Non-
convex optimization. A non-convex optimization problem is
any problem where the objective or any of the constraints
are non-convex.
6. Which of the following statements is not true about
stochastic gradient descent?
a) Due to the frequent updates, there can be so many noisy
steps
b) It may take longer to achieve convergence to the minima
of the loss function
c) Frequent updates are computationally expensive
d) It is computationally slower
View Answer
Answer: d
Explanation: Stochastic gradient descent (SGD) is not
computationally slower but is faster, as only one sample is
processed at a time. All other three are the disadvantages of
SGD. Where the frequent updates make noisy steps and
make it to achieve convergence to the minima very slowly.
And it is computationally expensive also.
7. In stochastic gradient descent the high variance frequent
parameter updates causes the loss function to fluctuate
heavily.
a) False
b) True
View Answer
Answer: b
Explanation: In stochastic gradient descent the frequent
parameter updates have high variance and cause the loss
function (objective function) to fluctuate to different
intensities. The high variance parameter updates helps to
discover better local minima but at the same time it
complicates the convergence (unstable convergence) to the
exact minimum.
8. Stochastic gradient descent has the possibility of escaping
from local minima.
a) False
b) True
View Answer
Answer: b
Explanation: One of the properties of stochastic gradient
descent is the possibility of escaping from local minima.
Since a stationary point with respect to the error function
for the whole data set will generally not be a stationary point
for each data point individually.
9. Given an example from a dataset (x 1, x2) = (4, 1), observed
value y = 2 and the initial weights w 1, w2, bias b as -0.015, -
0.038 and 0. What will be the prediction y’.
a) 0.01
b) 0.03
c) 0.05
d) 0.1
View Answer
Answer: d
Explanation: Given x1 = 4, x2 = 1, w1 = -0.015, w2 = -0.038, y
= 2 and b = 0.
Then prediction y’ = w1 x1 + w2 x2 + b
= (-0.015 * 4) + (-0.038 * 1) + 0
= -0.06 + -0.038 + 0
= -0.098
= -0.1
10. Given an example from a dataset (x 1, x2) = (2,8) and the
dependent variable y = -14, and the model prediction y’ = -
11. What will be the loss function if we are using a squared
difference method?
a) 6
b) -3
c) 9
d) 3
View Answer
Answer: c
Explanation: Given the observed variable, y = -14, predicted
value y’ = -11, and additional parameters x 1 = 2, x2 = 8.
Then using squared difference method Loss, L = (y’ – y)2
= (-11 – -14)2
= (-11 +14)2
= (3)2
=9
11. Given the current bias b = 0, learning rate = 0.01 and
gradient = -4.2. What will be the b’ value after the update?
a) -0.42
b) 0.042
c) 0.42
d) -0.042
View Answer
Answer: b
Explanation: Given b = 0, learning rate η = 0.01 gradient = -
4.2.
Then bias value after update, b’ = b – (η * Gradient)
= 0 – (0.01 * -4.2)
= 0 – -0.042
= 0.042
12. Given the example from a data set x 1 = 3, x2 = 1, observed
value y = 2 and predicted value y’ = -0.05. What will be the
gradient if you are using a squared difference method?
a) -4.1
b) -2.05
c) 4.1
d) 2.05
View Answer
Answer: a
Explanation: Given x1 = 3, x2 = 1, y = 2 and y’ = -0.05.
Then Gradient = 2 (y’ – y) as we are taking the partial
derivative of (y’ – y)2 with respect to y’.
Gradient = 2 (y’ – y)
= 2 (-0.05 – 2)
= 2 * -2.05
= -4.1
13. Given the example from a data set x 1 = 4, x2 = 1, weights
w1 = -0.02, w2 = -0.03, bias b = 0, observed value y = 2,
predicted value y’ = -0.11 and learning rate = 0.05. What will
be the next weight updating values if you are using a
squared difference approach?
a) -0.902, -0.314
b) -0.864, -0.241
c) -0.594, -0.324
d) -0.625, -0.524
View Answer
Answer: b
Explanation: Given x1 = 4, x2 = 1, w1 = -0.02, w2 = -0.03, bias
b = 0, y = 2, y’ = -0.11 and η= 0.05.
Then w1’ = w1 – η(2 (y’ – y) * x1)
= -0.02 – 0.05 * (2(-0.11 – 2) * 4)
= -0.02 – 0.05 * (2 * 2.11 * 4)
= -0.02 – 0.05 * 16.88
= -0.02 – 0.844
= -0.864
Then w2’ = w2 – η(2 (y’ – y) * x2)
= -0.03 – 0.05 * (2(-0.11 – 2) * 1)
= -0.03 – 0.05 * (2 * 2.11 * 1)
= -0.03 – 0.05 * 4.22
= -0.03 – 0.211
= -0.241
1. Stochastic gradient descent cannot be used for risk
minimisation.
a) False
b) True
View Answer
Answer: a
Explanation: Stochastic gradient descent (SGD) can be used
for risk minimisation. In learning the problem we are facing
is minimising the risk function LD(w). With SGD, all we need
is to find an unbiased estimate of the gradient of LD(w) that
is, a random vector.
2. Stochastic gradient descent can be used for convex-smooth
learning problems.
a) False
b) True
View Answer
Answer: b
Explanation: Stochastic gradient descent can be used for
convex-smooth learning problems. Assume that for all z, the
loss function l(.,z) is convex, β-smooth, and nonnegative.
Then, if we can run the SGD algorithm for
minimising LD(w), it will minimise the loss function also.
3. Which of the following statements is not true about
stochastic gradient descent for regularised loss
minimisation?
a) Stochastic gradient descent has the same worst-case
sample complexity bound as regularised loss minimisation
b) On some distributions, regularised loss minimisation
yields a better solution than stochastic gradient descent
c) In some cases we solve the optimisation problem as
associated with regularised loss minimisation
d) Stochastic gradient descent has entirely different worst-
case sample complexity bound from regularised loss
minimisation
View Answer
Answer: d
Explanation: Stochastic gradient descent has the same
worst-case sample complexity bound as regularised loss
minimisation. But on distributions, regularised loss
minimisation yields a better solution than stochastic gradient
descent. So in some cases we solve the optimisation problem
associated with regularised loss minimisation.
advertisement
4. In convex learning problems where the loss function is
convex, the preceding problem is also a convex optimisation
problem.
a) False
b) True
View Answer
Answer: b
Explanation: In convex learning problems where the loss
function is convex, the preceding problem is also a convex
optimisation problem that can be solved using SGD.
Consider f is a strongly convex function and we can apply
the SGD variant by constructing an unbiased estimate of a
sub-gradient of f at w(t).
1. Which of the following is not a variant of stochastic
gradient descent?
a) Adding a projection step
b) Variable step size
c) Strongly convex functions
d) Strongly non convex functions
View Answer
Answer: d
Explanation: There are several variants of Stochastic
Gradient Descent (SGD). Strongly non convex functions are
not a variant of stochastic gradient descent. Where adding a
projection step, variable step size, and strongly convex
functions are three variants of (SGD) which is used to
improve it.
2. Projection step is used to overcome the problem while
maintaining the same convergence rate.
a) True
b) False
View Answer
Answer: a
Explanation: Gradient descent and stochastic gradient
descent are restricting w* to a B-bounded hypothesis class
(w* is in the set H = {w : ∥ w∥ ≤ B}). So any step in the
opposite direction of the gradient might result in stepping
out of this bound. And projection step is used to overcome
this problem while maintaining the same convergence rate.
3. Which of the following statements is not true about two-
step update rule?
a) Two-step update rule is a way to add a projection step
b) First subtract a sub-gradient from the current value
of w and then project the resulting vector onto H
c) First add a sub-gradient to the current value of w and
then project the resulting vector onto H
d) The projection step replaces the current value of w by the
vector in H closest to it
View Answer
Answer: c
Explanation: In two-step update rule, we are not adding a
sub-gradient to the current value of w. The two-step rule is a
way to add a projection step, where we first subtract a sub-
gradient from the current value of w and then project the
resulting vector onto H. Then it replaces the current value of
w by the vector in H closest to it.
advertisement
4. Variable step size decrease the step size as a function of
iteration, t.
a) True
b) False
View Answer
Answer: a
Explanation: Variable step size decrease the step size as a
function of iteration, t. It updates the value
of w with ηt rather than updating with a constant η. When it
is closer to the minimum of the function, it takes the steps
more carefully, so as not to overshoot the minimum.
5. More sophisticated averaging schemes can improve the
convergence speed in the case of strongly convex functions.
a) False
b) True
View Answer
Answer: b
Explanation: Averaging techniques are one of the variants of
stochastic gradient descent which is used to improve the
convergence speed in the case of strongly convex functions.
It can output the average of w(t) over the last αT iterations,
for some α ∈ (0, 1) or it can also take a weighted average of
the last few iterates.
1. The computational complexity challenge related to
learning half-space in high dimensional feature spaces can
be solved using the method of kernels.
a) True
b) False
View Answer
Answer: a
Explanation: When the data is mapped into a high
dimensional feature space, it extends the expressiveness of
half-space predictors. And it raises both sample complexity
and computational complexity challenges. And it can be
solved using the method of kernels.
2. A kernel is a type of a similarity measure between
instances.
a) True
b) False
View Answer
Answer: a
Explanation: A kernel is a type of a similarity measure
between instances. When we are embedding the data into a
high dimensional feature space we introduce the idea of
kernels. Mathematical meaning of a kernel is the inner
product in some Hilbert space. So a standard interpretation
of a kernel is the pair wise similarity between different
samples.
3. Let the domain be the real line and consider the domain
points {-10, -9, -8, …, 0, 1, …, 9, 10} where the labels are +1
for all x such that |x| > 2 and 1 otherwise. The given training
set is separable by a half-space.
a) True
b) False
View Answer
Answer: b
Explanation: The given training set is not separable by a
half-space. Because the domain points are {-10, -9, -8, …, 0,
1, …, 9, 10} where the labels are +1 for all x such that |x| > 2
and 1 otherwise. Here the expressive power of half-spaces is
rather restricted. So it is not separable by a half-space.
advertisement
4. Which of the following is not true about making the class
of half-spaces more expressive?
a) First map the original instance space into another high
dimension space
b) Initially map the original instance space into another low
dimension space
c) After mapping then learn a half-space in that space
d) Increasing expressive power is useful in separating the
training set by a half-space
View Answer
Answer: b
Explanation: Initially we are mapping the original instance
space not into another low dimension space but to a higher
dimension space. After the mapping then learns a half-space
in that space. And it is useful in separating the training set
by a half-space.
5. Polynomial-based classifiers yield much richer hypothesis
classes than half-spaces.
a) False
b) True
View Answer
Answer: b
Explanation: Polynomial-based classifiers yield much richer
hypothesis classes than half-spaces. Consider the domain
points {-6,…, 0, 1,…, 5, 6} where the labels are +1 for
all x such that |x| > 2 and 1 otherwise. It is not separable by a
half-space, but after the embedding x ↦ (x, x2) it is perfectly
separable.
6. Which of the following statements is not true about Kernel
methods?
a) It can be used for pattern analysis or pattern recognition
b) It maps the data into higher dimensional space
c) The data can be easily separated in the higher
dimensional space
d) It only leads to finite dimensional space
View Answer
Answer: d
Explanation: The kernel methods lead not only to finite
dimensional space but also to infinite dimensional space as
there are no constraints of this mapping. Because it maps the
data into higher dimensional space by assuming that the
data can be easily separated in the higher dimensional space.
And it can be used for pattern analysis or pattern
recognition.
7. Which of the following statements is not true about Kernel
methods?
a) It works by embedding the input data to some high
dimensional feature space
b) Embedding into feature space can be determined
uniquely by specifying a kernel function that computes the
dot product between data points in the feature space
c) It defines only the linear mapping to the feature space
d) Expensive computations in the high dimensional feature
space can be avoided by evaluating the kernel function
View Answer
Answer: c
Explanation: The kernel function not only defines the linear
mapping to the feature space but also implicitly defines the
non linear mapping to the feature space and expensive
computations in the high dimensional feature space can be
avoided by evaluating the kernel function. All other three
statements are true about kernel methods.
1. When we make the half-space learning more expressive,
the computational complexity of learning may increase.
a) False
b) True
View Answer
Answer: b
Explanation: Embedding the input space into some high
dimensional feature space makes half-space learning more
expressive. But the computational complexity of such
learning may increase. So, computing linear separators over
very high dimensional data may be computationally
expensive.
2. Which of the following statements is not true about
kernel?
a) Kernel is used to describe inner products in the feature
space
b) The kernel function K specify the similarity between
instances
c) The kernel function K specify the embedding as mapping
the domain set into a space
d) The kernel function does the mapping of the domain set
into a space where the similarities are realised as outer
products
View Answer
Answer: d
Explanation: Mathematical meaning of a kernel is the inner
product in some Hilbert space not the outer products. And it
is a type of a similarity measure between instances. When we
are embedding the data into a high dimensional feature
space we introduce the idea of kernels.
3. Many learning algorithms for half-spaces can be carried
out just on the basis of the values of the kernel function over
pairs of domain points.
a) True
b) False
View Answer
Answer: a
Explanation: Many learning algorithms for half-spaces can
be carried out just on the basis of the values of the kernel
function over pairs of domain points. One of the main
advantages of such algorithms is that they implement linear
separators in high dimensional feature spaces without
having to specify points in that space or expressing the
embedding explicitly.
advertisement
4. Which of the following statements is not true about the
learning algorithms?
a) A feature mapping can be viewed as expanding the class
of linear classifiers to a richer class
b) The suitability of any hypothesis class to a given learning
task depends on the nature of that task
c) An embedding is a way to express and utilise prior
knowledge about the problem at hand
d) The sample complexity required to learn with some kinds
of kernels is independent of the margin in the feature space
View Answer
Answer: d
Explanation: Sample complexity required to learn with some
kinds of kernels (Gaussian kernels) depends on the margin
in the feature space which will be large, but can in general
be arbitrarily small. All three other statements are true
about learning algorithms.
5. A Hilbert space is a vector space with an inner product,
which is also complete.
a) True
b) False
View Answer
Answer: a
Explanation: A Hilbert space is a vector space with an inner
product, which is also complete. An inner product space X is
called a Hilbert space if it is a complete metric space. And it
is complete if all Cauchy sequences in the space converge. In
feature mapping it maps the original instances into some
Hilbert space.
6. The k degree polynomial kernel is defined as K(x, x’) = (1
+ <x, x’>)k.
a) True
b) False
View Answer
Answer: a
Explanation: The k degree polynomial kernel is defined as
K(x, x’) = (1 + <x, x’>)k where k is the degree of the
polynomial. It is popular in image processing.
7. Which of the following statements is not true about kernel
trick?
a) It allows one to incorporate prior knowledge of the
problem domain
b) The training data only enter the algorithm through their
entries in the kernel matrix
c) The training data only enter the algorithm through their
individual attributes
d) The number of operations required is not necessarily
proportional to the number of features
View Answer
Answer: c
Explanation: The training data only enter the algorithm
through their entries in the kernel matrix (Gram matrix),
and never through their individual attributes. All three are
the advantages of kernel trick.
8. The Gaussian kernel is also called the RBF kernel, for
Radial Basis Functions.
a) True
b) False
View Answer
Answer: a
Explanation: The Gaussian kernel is also called the RBF
kernel, for Radial Basis Functions. RBF kernel is a kernel
that is in the form of a radial basis function (more
specifically, a Gaussian function). The RBF kernel is defined
as : KRBF(x, x’) = exp[-ƴ ǁx – x’ǁ2].
9. Spectrum Kernel count the number of substrings in
common.
a) True
b) False
View Answer
Answer: a
Explanation: Spectrum Kernel counts the number of
substrings in common. It is a kernel since it is a dot product
between vectors of indicators of all the substrings. Other
kernels like: Gaussian kernel is a general-purpose kernel,
Polynomial kernel is popular in image processing and
sigmoid kernel can be used as the proxy for neural networks.
10. Which of the following statements is not true about
kernel trick?
a) It provides a bridge from linearity to non-linearity to any
algorithm that can expressed solely on terms of dot products
between two vectors
b) If we first map the input data into a higher-dimensional
space, a linear algorithm operating in this space will behave
non-linearly in the original input space
c) The mapping is always need to be computed
d) If the algorithm can be expressed only in terms of an
inner product between two vectors, all it need is replacing
this inner product with the inner product from some other
suitable space
View Answer
Answer: c
Explanation: Kernel trick is really interesting because that
mapping does not need to be ever computed. That is where
the trick resides, so wherever a dot product is used; it is
replaced with a Kernel function. All other three statements
best explain the kernel trick.
11. Which of the following statements is not true about
kernel properties?
a) Kernel functions must be continuous
b) Kernel functions must be symmetric
c) Kernels which are said to satisfy the Mercer’s theorem are
negative semi-definite
d) Kernel functions most preferably should have a positive
(semi-) definite Gram matrix
View Answer
Answer: c
Explanation: Kernels which are said to satisfy the Mercer’s
theorem are positive semi-definite as there is a property that
kernel functions most preferably should have a positive
(semi-) definite Gram matrix. And positive semi-definite
means that their kernel matrices have only non-negative
Eigen values.
12. Which of the following statements is not true about
choosing the right kernel?
a) Linear kernel allows to picking out hyper spheres
b) A polynomial kernel allows to model feature conjunctions
up to the order of the polynomial
c) Radial basis functions allow to pick out circles
d) Linear kernel allows picking out lines
View Answer
Answer: a
Explanation: A linear kernel allows only to picking out lines
(hyper planes) and not to picking out hyper spheres (circles).
Radial basis functions allow picking out circles and a
polynomial kernel allows to model feature conjunctions up
to the order of the polynomial.
13. As per the given figure Kernel trick illustrates some
fundamental ideas about different ways to represent data
and how machine learning algorithms see these different
data representations.
a) True
b) False
View Answer
Answer: a
Explanation: It is a kernel trick used in an SVM.
Implementing support vector classifiers requires specifying a
kernel function (Φ). Here in the picture the easily
inseparable data is then transformed into a high dimensional
feature space which is easily separable now using a kernel
function. Kernel trick illustrates some fundamental ideas
about different ways to represent data.
1. A Support Vector Machine (SVM) is a discriminative
classifier defined by a separating hyperplane.
a) True
b) False
View Answer
Answer: a
Explanation: A Support Vector Machine (SVM) is a
discriminative classifier defined by a separating hyperplane.
Suppose we are given labeled training data, then the
algorithm outputs an optimal hyperplane which categories
new examples. And hyperplane is a line dividing a plane into
two parts where in each class lay in either side.
2. Support vector machines cannot be used for regression.
a) False
b) True
View Answer
Answer: a
Explanation: Support Vector Machine (SVM) is a
classification and regression prediction tool. These are a
popular set of supervised learning algorithms originally
developed for classification (categorical target) problems,
and then extended to regression (numerical target)
problems.
3. Which of the following statements is not true about SVM?
a) It is memory efficient
b) It can address a large number of predictor variables
c) It is versatile
d) It doesn’t require feature scaling
View Answer
Answer: d
Explanation: SVM requires feature scaling, so we have to do
feature scaling of variables before applying SVM. SVMs are
memory efficient, can address a large number of predictor
variables and are versatile since they support a large
number of different kernel functions.
advertisement
4. Which of the following statements is not true about SVM?
a) It has regularization capabilities
b) It handles non-linear data efficiently
c) It has much improved stability
d) Choosing an appropriate kernel function is easy
View Answer
Answer: d
Explanation: Choosing an appropriate kernel function is not
an easy task. It could be tricky and complex. In case of using
a high dimension kernel, you might generate too many
support vectors which reduce the training speed. All other
three statements are advantages of SVM.
5. Minimizing a quadratic objective function (w2i) subject to
certain constraints where i= 1 to n, in SVM is known as
primal formulation of linear SVMs.
a) True
b) False
View Answer
Answer: a
Explanation: Minimizing a quadratic objective function
(w2i) subject to certain constraints in SVM is known as
primal formulation of linear SVMs. It is an SVM
optimisation problem. It is a convex quadratic programming
optimisation problem with n variables, where n is the
number of features in the data set.
6. Given a primal problem f*, minimizing x2 subject to x >=
b and a dual problem d*, maximizing d(α) subject to α >= 0.
Then d* = f* if f is non convex and x*, α* satisfy zero
gradient, primal feasibility dual feasibility, and
complementary slackness.
a) True
b) False
View Answer
Answer: b
Explanation: Given a primal problem f*,
minimizing x2 subject to x >= b and a dual problem d*,
maximizing d(α) subject to α >= 0. Then d* = f* if f is non
convex and x*, α* satisfy zero gradient, primal feasibility,
dual feasibility and complementary slackness. These are the
Karush–Kuhn–Tucker (KKT) conditions.
7. Which of the following statements is not true about dual
formulation in SVM optimisation problem?
a) No need to access data, need to access only dot products
b) Number of free parameters is bounded by the number of
support vectors
c) Number of free parameters is bounded by the number of
variables
d) Regularizing the sparse support vector associated with the
dual hypothesis is sometimes more intuitive than
regularizing the vector of regression coefficients
View Answer
Answer: c
Explanation: In dual formulation in SVM optimisation
problem number of free parameters is bounded not by the
number of variables but by the number of support vectors.
All other three statements are benefits of dual formulation in
SVM optimisation problem.
8. The optimal classifier is the one with the largest margin.
a) True
b) False
View Answer
Answer: a
Explanation: Consider all the samples are correctly
classified, where the data point can be as far from the
decision boundary as possible. Then we introduce the
concept of margin to measure the distance from data
samples to separating hyperplane. So the optimal classifier is
the one with the largest margin.
9. Suppose we have an equality optimization problem as
follows: Minimize f(x, y) = x + 2y subject to x 2 + y2 – 4 = 0.
While solving the above equation we get x = ± 25√, y = ± 45√,
λ = ± 5√4. At what value of x and y does the function f(x, y)
has its minimum value?
a) –25√,–45√
b) 25√,–45√
c) –25√,45√
d) 25√,45√
View Answer
Answer: a
Explanation: When x = –25√, y = –45√ and λ = ± 5√4,
f(x, y, λ) = x + 2y + λ(x2 + y2 – 4)
= –25√+−85√±5√4(45+165–4)
= –105√±5√4 (4 – 4)
= –105√±5√4 * 0
= –105√
Similarly when x = 25√, y = 45√ and λ = ± 5√4,
f(x, y, λ) = 105√
When x = –25√, y = 45√ and λ = ± 5√4
f(x, y, λ) = 65√
When x = 25√, y = –45√ and λ = ± 5√4
f(x, y, λ) = –65√
So the function f(x, y) has its minimum value (-105√) at x = –
25√ and y = –45√.
10. Suppose we have an equality optimization problem as
follows: Minimize f(x, y) = x + y subject to x 2 + y2 – 2 = 0.
While solving the above equation what will be the value x, y
and λ?
a) ±1, ±1, ± 12
b) ±2, ±1, ± 12
c) ±1, ±2, ± 12
d) ±12, ±12, ± 1
View Answer
Answer: a
Explanation: We know the Lagrangian L(x, y, λ) = x + y +
λ(x2 + y2 − 2)
δLδx = 1 + 2λx = 0
δLδy = 1 + 2λy = 0
δLδλ = x2 + y2 − 2 = 0
By solving the above three equations we get x = ±1, y = ±1
and λ = ±12.
11. Suppose we have an equality optimization problem as
follows: Minimize f(x, y) = x + 2y subject to x 2 + y2 – 9 = 0.
What will be the value of x, y and λ?
a) ± 35√, ±65√, ±5√6
b) ± 95, ±65, ±56
c) ± 95√, ±65, ±56
d) ± 35, ±65, ±5√6
View Answer
Answer: a
Explanation: We know the Lagrangian L(x, y, λ) = x + 2y +
λ(x2 + y2 – 9).
δLδx = 1 + 2λx = 0
δLδy = 2 + 2λy = 0
δLδλ = x2 + y2 – 9 = 0
By solving the above three equations we get x = ± 35√, y =
± 65√ and λ = ± 5√6.
1. The goal of a support vector machine is to find the optimal
separating hyperplane which minimizes the margin of the
training data.
a) False
b) True
View Answer
Answer: a
Explanation: The goal of a support vector machine is to find
the optimal separating hyperplane which maximizes the
margin of the training data. So it is based on finding the
hyperplane that gives the largest minimum distance to the
training examples.
2. Which of the following statements is not true about
hyperplane in SVM?
a) If a hyperplane is very close to a data point, its margin
will be small
b) If an hyperplane is far from a data point, its margin will
be large
c) Optimal hyperplane will be the one with the biggest
margin
d) If we select a hyperplane which is close to the data points
of one class, then it generalize well
View Answer
Answer: d
Explanation: If we select a hyperplane which is close to the
data points of one class, then it might not generalize well. If a
hyperplane is very close to a data point, its margin will be
small and if it is far from a data point, its margin will be
large. So the optimal hyperplane is the one with the biggest
margin.
3. Which of the following statements is not true about
optimal separating hyperplane?
a) It correctly classifies the training data
b) It is the one which will generalize better with unseen data
c) Finding the optimal separating hyperplane can be
formulated as a convex quadratic programming problem
d) The optimal hyperplane cannot correctly classifies all the
data while being farthest away from the data points
View Answer
Answer: d
Explanation: The optimal hyperplane correctly classifies all
the data while being farthest away from the data points. So it
correctly classifies the training data and will generalize
better with unseen data. And finding the optimal separating
hyperplane can be formulated as a convex quadratic
programming problem.
advertisement
4. Support Vector Machines are known as Large Margin
Classifiers.
a) True
b) False
View Answer
Answer: a
Explanation: SVM is a type of classifier which classifies
sample data. And the largest margin is found in order to
avoid overfitting and the optimal hyperplane is at the
maximum distance from the samples. So the margin is
maximized to classify the data points accurately.
5. Which of the following statements is not true about the
role of C in SVM?
a) The C parameter tells the SVM optimisation how much
you want to avoid misclassifying each training example
b) For large values of C, the optimisation will choose a
smaller-margin hyperplane
c) For small values of C, the optimisation will choose a large-
margin hyperplane
d) If we increase margin, it will end up getting a low
misclassification rate
View Answer
Answer: d
Explanation: If we increase margin, it will end up getting a
high misclassification rate. Because the C parameter tells the
SVM optimisation how much you want to avoid
misclassifying each training example. For large values of C,
the optimisation will choose a smaller-margin hyperplane
and vice versa.
6. Which of the following statements is not true about large
margin intuition classifier?
a) It has a hyperplane with the maximum margin
b) The hyperplane divides the data properly and is as far as
possible from your data points
c) The hyperplane is close to your data points
d) When new data comes in, even if it is a little closer to the
wrong class than the training points, it will still lie on the
right side of the hyperplane
View Answer
Answer: c
Explanation: The hyperplane is not close to your data points
but is as far as possible from it. In large margin intuition
classifier the hyper plane is with a maximum margin. So
when new data comes in, even if it is a little closer to the
wrong class than the training points, it will still lie on the
right side of the hyperplane.
7. Suppose the optimal separating hyperplane is given by
2x1 + 4x2 + x3 − 4 = 0 and the class labels are +1 and -1. For
the training example (1, 0.5, 1), the class label is -1, and is a
support vector.
a) True
b) False
View Answer
Answer: b
Explanation: Suppose the optimal separating hyperplane is
given by 2x1 + 4x2 + x3 − 4 = 0 and the class labels are +1 and
-1. For the training example (1, 0.5, 1), the class label is +1,
and is a support vector.
Let the training sample is (1, 0.5, 1) and the optimal
separating hyperplane is given by 2x 1 + 4x2 + x3 − 4 = 0.
2x1 + 4x2 + x3 − 4 = 2 * 1 + 4 * 0.5 + 1 − 4
=2+2+1–4
=5–4
= +1
8. The optimum separation hyperplane (OSH) is the linear
classifier with the minimum margin.
a) True
b) False
View Answer
Answer: b
Explanation: The optimum separation hyperplane (OSH) is
the linear classifier with the maximum margin for a given
finite set of learning patterns. To find the OSH draw convex
hull around each set of points and find the shortest line
segment connecting two convex hulls. Find midpoint of line
segment and the optimal hyperplane is perpendicular to
segment at midpoint of line segment.
9. SVM find outs the probability value.
a) True
b) False
View Answer
Answer: b
Explanation: SVM does not find out the probability value.
Suppose you are given a set of training examples, each
marked as belonging to one of two categories, an SVM
training algorithm builds a model that assigns new examples
into one category or the other, making it a non probabilistic
binary classifier.
10. Given figure shows some data points classified by an
SVM classifier and the bold line on the center represents the
optimal hyperplane. What the perpendicular distance
between the two dashed lines represented by a double arrow
line known as?
a) Maximum margin
b) Minimum margin
c) Support vectors
d) Hyperplane
View Answer
Answer: a
Explanation: The operation of the SVM algorithm is based
on finding the optimal hyperplane. Therefore, the optimal
separating hyperplane maximizes the margin of the training
data. And hence the distance between the two dashed lines
are known as maximum margin.
11. What is the leave-one-out cross-validation error estimate
for maximum margin separation in the following figure?
a) Zero
b) Maximum
c) Minimum
d) Half of the previous error value
View Answer
Answer: a
Explanation: From the figure we can see that removing any
single point would not chance the resulting maximum
margin separator. Here all the points are initially classified
correctly, so the leave-one-out error is zero.
1. In SVM the distance of the support vector points from the
hyperplane are called the margins.
a) True
b) False
View Answer
Answer: a
Explanation: The SVM is based on the idea of finding a
hyperplane that best separates the features into different
domains. And the points closest to the hyperplane are called
as the support vector points and the distance of the vectors
from the hyperplane are called the margins.
2. If the support vector points are farther from the
hyperplane, then this hyperplane can also be called as
margin maximizing hyperplane.
a) True
b) False
View Answer
Answer: a
Explanation: In SVM if more the farther support vector
points, from the hyperplane, then this hyperplane can also
be called as margin maximizing hyperplane. And the
probability of correctly classifying the points in their
respective region or classes is high.
3. Which of the following statements is not true about the C
parameter in SVM?
a) Large values of C give solutions with less misclassification
errors
b) Large values of C give solutions with smaller margin
c) Small values of C give solutions with bigger margin
d) Small values of C give solutions with less classification
errors
View Answer
Answer: d
Explanation: Small values of C give solutions with more
classification errors but a bigger margin. So it focuses more
on finding a hyperplane with a big margin. And large values
of C give solutions with less misclassification errors but a
smaller margin.
advertisement
4. Which of the following statements is not true about
margin in SVM?
a) The margin of a hyperplane with respect to a training set
is defined to be the minimal distance between a point in the
training set and the hyperplane
b) The margin of a hyperplane with respect to a training set
is defined to be the maximum distance between a point in the
training set and the hyperplane
c) If a hyperplane has a large margin, then it will still
separate the training set even if we slightly disturb each
instance
d) True error of a half space can be bounded in terms of the
margin that it has over the training sample
View Answer
Answer: b
Explanation: The margin of a hyperplane with respect to a
training set is defined to be not the maximum but the
minimal distance between a point in the training set and the
hyperplane. So if a hyperplane has a large margin, then it
will still separate the training set even if we slightly disturb
each instance. And the true error of a half space can be
bounded in terms of the margin it has over the training
sample.
5. The maximum margin linear classifier is the linear
classifier with the maximum margin.
a) True
b) False
View Answer
Answer: a
Explanation: The maximum margin linear classifier is the
linear classifier with the maximum margin. And these kinds
of SVMs are called Linear SVM (LSVM). Support vectors
are those data points that the margin pushes up against.
6. Which of the following statements is not true about
maximum margin?
a) It is safe and empirically works well
b) It is not sensitive to removal of any non support vector
data points
c) If the location of the boundary is not perfect due to noise,
this gives us the least chance of misclassification
d) It is not immune to removal of any non-support-vector
data points
View Answer
Answer: d
Explanation: The maximum margin is immune to removal of
any non support vector data points. It is safe and empirically
works well. So even If we have made a small error in the
location of the boundary (imperfect location of the
boundary) this gives us least chance of causing a
misclassification.
7. Hard SVM is the learning rule in which return an ERM
hyperplane that separates the training set with the largest
possible margin.
a) True
b) False
View Answer
Answer: a
Explanation: Hard-SVM is the learning rule in which return
an ERM hyperplane that separates the training set with the
largest possible margin. Here the margin of an ERM
hyperplane with respect to a training set is defined to be the
minimal distance between a point in the training set and the
ERM hyperplane.
8. The output of hard-SVM is the separating hyperplane
with the largest margin.
a) True
b) False
View Answer
Answer: a
Explanation: The output of hard-SVM is the separating
hyperplane with the largest margin and it seeks for the
separating plane with the largest margin. Hard-SVM works
on separable problems and it finds the linear predictor with
the maximal margin on the training sample.
9. Assume that we are training an SVM with quadratic
kernel. Given figure shows a dataset and the decision
boundary will be the one with maximum curvature for very
large values of C as shown in figure.
a) True
b) False
View Answer
Answer: b
Explanation: The slack penalty C will determine the location
of the separating parabola. When C is too large, we can’t
afford any misclassification. And hence among all the
parabolas, it chooses the minimum curvature one. So the
decision boundary will be the one with minimum curvature
as shown below.
a) True
b) False
View Answer
Answer: b
Explanation: The slack penalty C will determine the location
of the separating parabola. When the penalty for
misclassification is too small (C = 0) the decision boundary
will be linear. So the decision boundary will be like as shown
below.
11. The given figure shows the hard margin while classifying
a set of data points using SVM.
a) True
b) False
View Answer
Answer: a
Explanation: The given figure shows the hard margin while
classifying a set of data points using SVM. Here all the
points are correctly classified. And the hard margin
maximizes margin between separating hyperplane.
1. The Soft SVM assumes that the training set is linearly
separable.
a) True
b) False
View Answer
Answer: b
Explanation: The Soft SVM did not assume that the training
set is linearly separable. But the Hard SVM assumes that the
training set is linearly separable. And Soft SVM can be
applied even if the training set is not linearly separable.
2. Soft SVM is an extended version of Hard SVM.
a) True
b) False
View Answer
Answer: a
Explanation: Soft SVM is an extended version of Hard
SVM. Hard SVM can work only when data is completely
linearly separable without any errors (noise or outliers). If
there are errors then either the margin is smaller or hard
margin SVM fails. And Soft SVM was proposed to solve this
problem by introducing slack variables.
3. Linear Soft margin SVM can only be used when the
training data are linearly separable.
a) True
b) False
View Answer
Answer: b
Explanation: The Linear Soft margins SVM are not used
when the training data are linearly separable but can use
Hard SVM only. Because linear separability of the training
data is a strong assumption in Hard SVM. And Soft SVM
can be applied if the training set is not linearly separable.
advertisement
4. Given a two-class classification problem with data points
x1 = -5, x2 = 3, x3 = 5, having class label +1 and x 4 = 2 with
class label -1. The problem can be solved using Soft SVM.
a) True
b) False
View Answer
Answer: a
Explanation: The given problem is a one dimensional two-
class classification problem. Here the points x 1, x2, and
x3 have class labels +1 and x 4 has class label -1. And the
dataset is not linearly separable, so we can use Soft SVM to
solve this classification problem.
5. Given a two-class classification problem with data points
x1 = -5, x2 = 3, x3 = 5, having class label +1 and x 4 = 2 with
class label -1. The problem can never be solved using Hard
SVM.
a) True
b) False
View Answer
Answer: b
Explanation: The given problem is a one dimensional two-
class classification problem and the data points are non-
linearly separable. So the problem cannot be solved by the
Hard SVM directly. But it can be solved using Hard SVM if
the one dimensional data set is transformed into a 2-
dimensional dataset using some function like (x, x2). Then
the problem is linearly separable and can be solved by Hard
SVM.
6. Which of the following statements is not true about the
picture shown below?
a) LONG, LONG
b) LONG, SHORT
c) SHORT, LONG
d) SHORT, SHORT
View Answer
Answer: c
Explanation: Given figure shows a decision tree. And person
A starts driving at 8:30 AM and there is no traffic. So he will
commute in SHORT time. At the same time person B starts
driving at 10 AM and there was an accident on the road. So
he will commute for a LONG time.
5. In a splitting rule at internal nodes of the tree based on
thresholding the value of a single feature, it follows that a
tree with k leaves can shatter a set of k instances.
a) False
b) True
View Answer
Answer: b
Explanation: Here the splitting rule at internal nodes of the
tree is based on thresholding the value of a single feature; it
follows that a tree with k leaves can shatter a set
of k instances. Hence, if we allow decision trees of arbitrary
size, we obtain a hypothesis class of infinite VC dimension
and this approach can easily lead to overfitting.
6. Minimum description length (MDL) principle is used to
avoid overfitting in decision trees.
a) True
b) False
View Answer
Answer: a
Explanation: MDL procedures automatically and inherently
protect against overfitting and can be used to estimate both
the parameters and the structure of a model. Hence MDL
principle is used to avoid overfitting in decision trees and
aim at learning a decision tree that on one hand fits the data
well while on the other hand is not too large.
7. Suppose in a decision tree, we are making some
simplifying assumptions that each instance is a vector of d
bits (X = {0, 1}d). Which of the following statements is not
true about the above situation?
a) It thresholding the value of a single feature corresponds to
a splitting rule of the form 1[xi=1] for some i = [d]
b) The hypothesis class becomes finite, but is still very large
c) Any classifier from {0, 1}d to {0, 1} can be represented by a
decision tree with 2d leaves and depth of d + 1
d) Any classifier from {0, 1}d to {0, 1} can be represented by a
decision tree with 2d+1 leaves and depth of d + 1
View Answer
Answer: d
Explanation: Given the simplifying assumptions, and any
classifier from {0, 1}d to {0, 1} can be represented by a
decision tree not with 2d+1 leaves but with 2d leaves and depth
of d + 1. And here the hypothesis class becomes finite, but is
still very large.
8. What does it mean by the VC dimension of a class is 2 d?
a) The number of examples need to PAC learn the
hypothesis class grows with 2d
b) The number of examples need to PAC learn the
hypothesis class grows with 2d+1
c) The number of examples need to PAC learn the
hypothesis class grows with 2d-1
d) The number of examples need to PAC learn the
hypothesis class grows with 2d+1
View Answer
Answer: a
Explanation: Suppose in a decision tree we are making some
simplifying assumptions that each instance is a vector of d
bits (X = {0, 1}d). Then the VC dimension of the class is 2 d,
which means that the number of examples we need to PAC
learn the hypothesis class grows with 2d. Unless d is very
small, this is a huge number of examples.
9. Consider the dataset given below where T and F represent
True and False respectively. What is the entropy H (Rain)?
Temperature Clou
Low T
Low T
Medium T
Medium T
High T
High F
a) 1
b) 0.5
c) 0.2
d) 0.6
View Answer
Answer: a
Explanation: We know entropy = ∑ni=1 – Pi log2 Pi.
Entropy = – (3/6) * log2 (3/6) – (3/6) * log2 (3/6)
= – (1/2) * log2 (1/2) – (1/2) * log2 (1/2)
= – 0.5 * -1 – 0.5 * -1
= 0.5 + 0.5
=1
10. What does the following figure represent?
A B A AND B
F F F
F T F
T T T
T F F
a) 0.5
b) -0.5
c) 1
d) -1
View Answer
Answer: c
Explanation: We know the entropy E = -p log2p – q log2q.
Here p = 0.5 and q = 1 – p = 1 – 0.5 = 0.5. So we have p = 0.5
and q = 0.5.
Entropy = (-0.5 * log2 0.5) – (0.5 * log2 0.5)
= (-0.5 * -1) – (0.5 * -1)
= 0.5 + 0.5
=1
14. Given entropy of parent = 1, weights averages = (34,14)
and entropy of children = (0.9, 0). What is the information
gain?
a) 0.675
b) 0.75
c) 0.325
d) 0.1
View Answer
Answer: c
Explanation: We know Information Gain = Entropy
(Parent) – ∑(weights average * entropy (Child).
Information Gain = 1 – (34 * 0.9 + 14 * 0)
= 1 – (0.675 + 0)
= 1 – 0.675
= 0.325
1. In the ID3 algorithm the returned tree will usually be very
large.
a) True
b) False
View Answer
Answer: a
Explanation: In the ID3 algorithm the returned tree will
usually be very large. Such trees may have low empirical
risk, but their true risk will tend to be high. One solution is
to limit the number of iterations of ID3, leading to a tree
with a bounded number of nodes.
2. Pruning a tree reduces it to a much smaller tree.
a) True
b) False
View Answer
Answer: a
Explanation: Pruning a tree will reduce it to a much smaller
tree, but still with a similar empirical error. So pruning is
the process of adjusting the decision tree to minimize the
misclassification error.
3. Pruning can only be performed by a bottom up walk on
the decision tree.
a) True
b) False
View Answer
Answer: b
Explanation: Pruning can occur in a top down or bottom up
fashion. Usually, the pruning is performed by a bottom-up
walk on the tree. Each node might be replaced with one of its
subtrees or with a leaf. But there are situations where the
top down pruning is also used.
advertisement
4. Which of the following statements is not true about
Pruning?
a) It removes the sections of the tree that provide little power
to classify instances
b) It is a technique in machine learning and search
algorithms to reduce the size of the decision trees
c) It increases the complexity of the final classifier
d) It improves the predictive accuracy by the reduction of
overfitting
View Answer
Answer: c
Explanation: Pruning reduces the complexity of the final
classifier and improves the predictive accuracy by the
reduction of overfitting. It is a technique in machine learning
and search algorithms to reduce the size of the decision
trees. And it removes the sections of the tree that provide
little power to classify instances.
5. Which of the following statements is not true about
Pruning?
a) It reduces the size of learning tree without reducing
predictive accuracy
b) It is will not optimise the performance of the tree
c) Top down pruning will traverse nodes and trim subtrees
starting at the root
d) Bottom up pruning will traverse nodes and trim subtrees
starting at the leaf nodes
View Answer
Answer: b
Explanation: Pruning will optimise the performance of the
tree and it reduces the size of the learning tree without
reducing predictive accuracy. Bottom up pruning will
traverse nodes and trim subtrees starting at the leaf nodes
and Top down pruning starting at the root.
6. Which of the following is not a Pruning technique?
a) Cost based pruning
b) Cost complexity pruning
c) Minimum error pruning
d) Maximum error pruning
View Answer
Answer: d
Explanation: Maximum error pruning is not a pruning
technique. Cost based pruning, Cost complexity pruning,
and Minimum error pruning are the three popular pruning
techniques in Decision trees.
7. Which of the following statements is not true about the
pruning in the decision tree?
a) When the decision tree is created, many of the branches
will reflect anomalies in the training data due to noise
b) The over fitting happens when the learning algorithm
continues to develop hypothesis that reduce training set
error at the cost of an increased test set errors
c) It optimises the computational efficiency
d) It reduces the classification accuracy
View Answer
Answer: d
Explanation: Pruning in decision trees improves the
classification accuracy and optimises computational
efficiency. When the decision tree is created, many of the
branches will reflect anomalies in the training data due to
noise. And over-fitting happens when the learning algorithm
continues to develop hypotheses that reduce training set
error at the cost of an increased test set error.
8. Post pruning is also known as backward pruning.
a) True
b) False
View Answer
Answer: a
Explanation: Post-pruning is also known as backward
pruning. Here generate the decision tree and then remove
non-significant branches. It allows the tree to perfectly
classify the training set, and then post prune the tree.
9. Which of the following statements is not true about Post
pruning?
a) It begins by generating the (complete) tree and then
adjust it with the aim of improving the classification
accuracy on unseen instances
b) It begins by converting the tree to an equivalent set of
rules
c) It would not overfit trees
d) It converts a complete tree to a smaller pruned one which
predicts the classification of unseen instances at least as
accurately
View Answer
Answer: c
Explanation: Post-pruning overfit trees in a more successful
way because it is not easy to precisely estimate when to stop
growing the tree. It begins by generating the (complete) tree
and then adjusting it with the aim of improving the
classification accuracy on unseen instances. The other two
statements are the two principal methods of doing this.
10. Which of the following statements is not true about
Reduced error pruning?
a) It is the simplest and most understandable method in
decision tree pruning
b) It considers each of the decision nodes in the tree to be
candidates for pruning, consist of removing the subtree
rooted at that node, making it a leaf node
c) If the error rate of the new tree would be equal to or
smaller than that of the original tree and that subtree
contains no subtree with the same property, then subtree is
replaced by leaf node
d) If the error rate of the new tree would be greater than
that of the original tree and that subtree contains no subtree
with the same property, then subtree is replaced by leaf
node, means pruning is done
View Answer
Answer: d
Explanation: If the error rate of the new tree would be
greater than that of the original tree and that subtree
contains no subtree with the same property, then subtree is
replaced by leaf node, meaning no pruning is done. All other
three statements are true about Reduced error pruning.
1. Which of the following statements is not an advantage of
Reduced error pruning?
a) Linear computational complexity
b) Over pruning
c) Simplicity
d) Speed
View Answer
Answer: b
Explanation: Over pruning is a disadvantage of Reduced
error pruning. When the test set is much smaller than the
training set, it may lead to over pruning. And the advantage
of this method is its linear computational complexity,
simplicity and speed.
2. Minimum error pruning is a Top down approach.
a) True
b) False
View Answer
Answer: b
Explanation: Minimum error pruning is not a Top down
approach. It is a bottom – up approach which seeks a single
tree that minimizes the expected error rate on an
independent data set. The tree is pruned back to the point
where the cross – validated error is a minimum.
3. Which of the following statements is not a step in
Minimum error pruning?
a) At each non leaf node in the tree, calculate expected error
rate if that subtree is pruned
b) Calculate the expected error rate for that node if subtree
is not pruned
c) If pruning the node leads to greater expected error rate,
then keep the subtree
d) If pruning the node leads to smaller expected error rate,
then don’t prune it
View Answer
Answer: d
Explanation: In Minimum error pruning if pruning the node
leads to smaller expected error rate, then prune it. Here at
each non leaf node in the tree, calculate expected error rate
if that subtree is pruned otherwise calculate the expected
error rate for that node if subtree is not pruned. And if
pruning the node leads to greater expected error rate, then
keep the subtree (no pruning).
advertisement
4. Pre pruning is also known as online pruning.
a) True
b) False
View Answer
Answer: a
Explanation: Pre – pruning is also known as forward
pruning or online – pruning. Pre – pruning prevents the
generation of non – significant branches. It prevents
overfitting by trying to stop the tree – building process early,
before it produces leaves with very small samples.
5. Which of the following statements is not a step in Pre
pruning?
a) Pre – pruning a decision tree involves using a termination
condition to decide when to terminate some of the branches
prematurely as the tree is generated
b) When constructing the tree, some significant measures
can be used to assess the goodness of a split
c) High threshold results in oversimplified trees
d) If partitioning the tuples at a node would result the split
that falls below a pre specified threshold, then further
partitioning of the given subset is expanded
View Answer
Answer: d
Explanation: If partitioning the tuples at a node would result
in the split that falls below a pre specified threshold, then
further partitioning of the given subset is halted otherwise it
is expanded. That is a high threshold result in oversimplified
trees, and low threshold result in very little simplification.
6. Minimum number of objects pruning is a Post pruning
technique.
a) True
b) False
View Answer
Answer: b
Explanation: Minimum number of objects pruning is not a
post pruning technique but it is a pre pruning technique. In
this method of pruning, the minimum number of objects is
specified as a threshold value. And there is one parameter
minobj which is set to specify threshold value.
7. Which of the following statements is not true about
Minimum number of objects pruning?
a) Whenever the split is made which yields a child leaf that
represents less than minobj from the data set, the parent
node and children node are compressed to a single node
b) Increasing no of objects increases accuracy of the dataset
c) Increasing no of objects simplifies the tree
d) The different ranges of the minimum no of objects are set
for few examples and tested for accuracy
View Answer
Answer: b
Explanation: In Minimum number of object pruning
increasing no of objects reduces accuracy of the dataset, but
it simplifies the tree. Whenever the split is made which yields
a child leaf that represents less than minobj from the data
set, the parent node and children node are compressed to a
single node.
8. Which of the following is not a Post pruning technique?
a) Reduced error pruning
b) Error complexity pruning
c) Minimum error pruning
d) Chi – square pruning
View Answer
Answer: d
Explanation: Chi – square pruning is not a post pruning
technique but it is a pre pruning technique. It converts
decision trees to a set of rules and eliminates variable values
in rules which are independent of label using chi – square
test for independence. And simplify rule set by eliminating
unnecessary rules.
9. Which of the following is not a Post pruning technique?
a) Pessimistic error pruning
b) Iterative growing and pruning
c) Reduced error pruning
d) Early stopping pruning
View Answer
Answer: d
Explanation: Early stopping pruning is also known as Pre
pruning. So it is not a post pruning technique. To prevent
overfitting it tries to stop the tree – building process early,
before it produces leaves with very small samples. This
heuristic is also known as Pre – pruning decision trees.
10. Consider we have a set of data with 3 classes, and we
have observed 20 examples of which the greatest number 15
is in class c. If we predict that all future examples will be in
class c, what is the expected error rate using minimum error
pruning?
a) 0.304
b) 0.5y
c) 0.402
d) 0.561
View Answer
Answer: a
Explanation: The expected error rate Ek = n–nc+k–1n+k.
Given n = 20, nc = 15 and k = 3.
Expected error rate Ek = 20–15+3–120+3
= 723
= 0.304
11. Consider we have a set of data with 3 classes, and we
have observed 20 examples of which the greatest number 15
is in class c. If we predict that all future examples will be in
class c, what is the expected error rate without pruning?
a) 0.22
b) 0.17
c) 0.15
d) 0.05
View Answer
Answer: a
Explanation: Given n = 20, nc = 15 and k = 3. Then without
pruning the Expected error rate Ek will be:
Expected error rate Ek = n–kn(n–nc–1n)+kn(k–12k)
= 20–320(20–15–120)+320(3–12∗ 3)
= 1720(420)+320(26)
= 68400+6120
= 0.17 + 0.05
= 0.22
12. Consider the example, number of corrected mis –
classifications at a particular node, n'(t) = 15.5, and number
of corrected mis – classifications for sub – tree, n'(Tt) = 12.
N(t) is the number of training set examples at node t and it is
equal to 35. Here the tree should be pruned.
a) True
b) False
View Answer
Answer: b
Explanation: We know the standard error SE
= n′(Tt)∗ (N(t)–n′(Tt))N(t)−−−−−−−−−−−−√
= 12∗ (35–12)35−−−−−−−√
= 12∗ 2335−−−−√
= 2.8
Since 12 + 2.8 = 14.8, which is less than 15.5, the sub – tree
should be kept and not pruned.
1. Which of the following statements is not true about
Decision trees?
a) It builds classification models in the form of a tree
structure
b) It builds regression models in the form of a tree structure
c) The final result is a tree with decision nodes and leaf
nodes
d) It never breaks down a dataset into smaller subsets with
increase in depth of tree
View Answer
Answer: d
Explanation: The decision tree breaks down a dataset into
smaller subsets with increase in depth of tree. And it builds
classification and regression models in the form of a tree
structure where the regression models in the form of a tree
structure.
2. Splitting is the process of dividing a node into two or more
sub-nodes.
a) True
b) False
View Answer
Answer: a
Explanation: Splitting in decision tree is the process of
dividing a node into two or more sub-nodes. When a sub-
node splits into further sub-nodes, then it is called decision
node. And the nodes that do not split are called leaf or
terminal nodes.
3. Real valued features problems in decision trees cannot be
solved using ID3 algorithm.
a) True
b) False
View Answer
Answer: b
Explanation: Real valued features problem in decision tree
cannot be solved directly. But it can be solved by converting
it into a binary feature value problem using threshold based
splitting rules. Then we can solve this problem using ID3.
advertisement
4. Which of the following statements is not true about
reducing a real valued feature problem into binary feature?
a) It utilizes the threshold based splitting rules
b) Once the binary features are constructed the ID3
algorithm can be easily applied
c) After ID3 applied it is easy to verify that there exists a
decision tree with different training error
d) After ID3 applied it is easy to verify that there exists a
decision tree with same number of nodes
View Answer
Answer: c
Explanation: Once the real valued features are reduced to
binary features then we can apply ID3. And it is easy to
verify that for any decision tree with threshold based
splitting rules over the original real valued features that
there exists a decision tree over the constructed binary
features with the same training error and the same number
of nodes.
5. If the original number of real valued features is d and the
number of examples is m, then which of the following
statements is not true?
a) The number of constructed binary features becomes dm
b) Calculating the Gain of each feature might
take O(dm2) operations
c) With more improved implementation the run time can be
reduced to O(dmlog(m))
d) The constructed binary features are dm
View Answer
Answer: d
Explanation: If the original number of real valued features
is d and the number of examples is m, then the number of
constructed binary features becomes dm not dm. And here
calculating the Gain of each feature might
take O(dm2) operations. But with more improved
implementation the run time can be reduced
to O(dmlog(m)) .
1. Inductive bias is also known as learning bias.
a) True
b) False
View Answer
Answer: a
Explanation: Inductive bias is also known as learning bias
and it is related with the learning algorithms. It is a set of
assumptions that the learner uses to predict outputs for the
given inputs that has not been encountered.
2. Which of the following statements is not true about the
Inductive bias in the decision tree?
a) It is harder to define because of heuristic search
b) Trees that place high information gain attributes close to
the root are preferred
c) Trees that place high information gain attributes far away
from the root are preferred
d) Shorter trees are preferred over longer ones
View Answer
Answer: c
Explanation: Here the trees that place high information gain
attributes close to the root are preferred over those that do
not. And it is harder to define because of heuristic search. It
prefers the shorter trees over longer trees.
3. According to Occam’s Razor, which of the following
statements is not favorable to short hypotheses?
a) It is good to use fewer short hypotheses than long
hypotheses
b) A short hypothesis that fits the data is unlikely to be a
coincidence
c) A long hypothesis that fits the data might be a coincidence
d) There are many ways to define small set of hypotheses
View Answer
Answer: d
Explanation: Occam’s Razor is the problem solving
principle that prefers the simplest hypotheses that fits the
data. And the argument opposed is that: there are many
ways to define a small set of hypotheses. All other three
statements are in favor of the short hypotheses.
advertisement
4. Which of the following statements are not true about
Inductive bias in ID3?
a) It is the set of assumptions that along with the training
data justify the classifications assigned by the learner to
future instances
b) ID3 has preference of short trees with high information
gain attributes near the root
c) ID3 has preference for certain hypotheses over others,
with no hard restriction on the hypotheses space
d) ID3 has preference of long trees with high information
gain attributes far away from the root
View Answer
Answer: d
Explanation: ID3 prefers the short trees and not the long
trees. It has preference of short trees with high information
gain attributes near the root and for certain hypotheses over
others, with no hard restriction on the hypotheses space.
And it is the set of assumptions that along with the training
data justify the classifications assigned by the learner to
future instances.
5. Which of the following statements is not true about ID3?
a) ID3 searches incompletely through the hypotheses space,
from simple to complex hypotheses, until its termination
condition is met
b) Its inductive bias is solely a consequence of the ordering
of hypotheses by its search strategy
c) Its hypothesis space introduces additional bias in its each
iteration
d) Its hypothesis space introduces no additional bias
View Answer
Answer: c
Explanation: In ID3 its hypothesis space introduces no
additional bias and it is solely a consequence of the ordering
of hypotheses by its search strategy. And ID3 searches
incompletely through this space, from simple to complex
hypotheses, until its termination condition is met.
6. Which of the following statements is true about Candidate
elimination?
a) Candidate elimination searches the hypotheses
completely, and finding every hypothesis consistent with the
training data
b) Its inductive bias is solely a consequence of the ordering
of hypotheses by its search strategy
c) Its inductive bias is solely a consequence of the expressive
power of its hypothesis representation
d) Its search strategy introduces no additional bias
View Answer
Answer: b
Explanation: Its inductive bias is not solely a consequence of
the ordering of hypotheses by its search strategy but is solely
a consequence of the expressive power of its hypothesis
representation. All other statements are true about inductive
bias in Candidate elimination.
7. Preference bias is more desirable than a restriction bias.
a) True
b) False
View Answer
Answer: a
Explanation: Preference bias is more desirable than a
restriction bias because it allows the learner to work within a
complete hypothesis space that is assured to contain the
unknown target function. So Preference bias is more
desirable than a restriction bias (language bias).
8. Preference bias is also known as search bias.
a) True
b) False
View Answer
Answer: a
Explanation: Preference bias is also known as search bias. It
is used when a learning algorithm incompletely searches a
complete hypothesis space. It chooses which part of the
hypothesis space to search. A decision tree is an example.
1. Categorical Variable Decision tree has a categorical target
variable.
a) True
b) False
View Answer
Answer: a
Explanation: Decision tree is an algorithm having a
predefined target variable that is mostly used in
classification problems. If the target variable is a categorical
target variable then such type of classification tree is known
as Categorical Variable Decision tree.
2. Which of the following statements is not true about the
Classification tree?
a) It is used when the dependent variable is categorical
b) It divides the predictor space into distinct and non
overlapping regions
c) It divides the independent variables into distinct and non
overlapping regions
d) It is used when the dependent variable is continuous
View Answer
Answer: d
Explanation: Classification trees are used when the
dependent variable is categorical not continuous. And it
divides the predictor space (independent variables) into
distinct and non overlapping regions.
3. In Classification trees the value obtained by terminal node
in the training data is the mode of observations falling in
that region.
a) True
b) False
View Answer
Answer: a
Explanation: In Classification trees the value obtained by
terminal node in the training data is the mode of
observations falling in that region. And this value obtained
by terminal node is known as the class. So if an unseen data
observation falls in that region, it will make its prediction
with mode value.
advertisement
4. Classification trees follow a top-down greedy approach.
a) True
b) False
View Answer
Answer: a
Explanation: Classification tree follows a top-down greedy
approach known as recursive binary splitting. It begins from
the top of tree when all the observations are available in a
single region and successively splits the predictor space into
two new branches down the tree. It is known as greedy
because the algorithm cares only about the current split, and
not about future splits which will lead to a better tree.
5. Which of the following statements is not true about
Classification trees?
a) It labels, records, and assigns variables to discrete classes
b) It can also provide a measure of confidence that the
classification is correct
c) It is built through a process known as binary recursive
partitioning
d) It will always looks for the best variable available in the
future splits for a better tree
View Answer
Answer: d
Explanation: In classification trees it will always look for the
best variable available in the current split and not in the
future splits for a better tree and it is built through a process
known as binary recursive partitioning. It labels, records,
and assigns variables to discrete classes and it can also
provide a measure of confidence that the classification is
correct.
6. Which of the following statements are not true about the
Classification trees?
a) The target variable can take a discrete set of values
b) The leaves represent class labels
c) The branches represent conjunctions of features
d) The target variable can take real numbers
View Answer
Answer: d
Explanation: In classification trees, the target variable
cannot take real numbers but can take only a discrete set of
values. Here the leaves represent class labels and the
branches represent conjunctions of features that will lead to
those class labels.
7. Which of the following statements is not true about
CART?
a) It is used for generating regression tree
b) It is used for generating classification tree
c) It is used for binary classification
d) It always uses Gini index as cost function to evaluate split
in feature selection
View Answer
Answer: d
Explanation: It uses Gini index as a cost function to evaluate
split in feature selection in case of classification tree and it
uses least square as a metric to select features in case of
Regression tree. So it is used for generating both
classification and regression trees. And it is used for binary
classification also.
8. From the below table where the target is to predict play or
not (Yes or No) based on weather condition, what is the Gini
index for Climate = Sunny?
1 Sunny Cool
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
a) 0.45
b) 0.49
c) 0.47
d) 0.43
View Answer
Answer: a
Explanation: From the given table we have:
Climate Yes N
Sunny 1 2
1 Sunny Cool
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
a) True
b) False
View Answer
Answer: a
Explanation: From the given table we have:
Climate Yes N
Rainy 1 1
Winter 1 1
1 Sunny Medium
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
a) 0.43
b) 0.45
c) 0.48
d) 0.5
View Answer
Answer: c
Explanation: From the table we have:
Climate Yes N
Hot 1 1
Cool 1 2
Medium 1 1
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
a) 0.41
b) 0.43
c) 0.45
d) 0.47
View Answer
Answer: a
Explanation: We know the Gini index for the Wind feature
is the weighted sum of Gini index for Wind features. From
the table we have:
Wind Yes No
Strong 1 3
Weak 2 1
1 Sunny Cool
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
a) 10
b) 15.71
c) 4.08
d) 7.07
View Answer
Answer: c
Explanation: From the table we have,
1 Sunny Cool
2 Sunny Hot
7 Sunny Hot
1 Sunny Cool
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
a) 7.54
b) 15.71
c) 8.17
d) 7.07
View Answer
Answer: a
Explanation: From the table for Strong Wind we have,
Day Climate Temperature
1 Sunny Cool
5 Rainy Cool
6 Winter Cool
7 Sunny Hot
2 Sunny Hot
3 Rainy Medium
4 Winter Cool
Strong 7.07
Weak 8.17
Hot 5.85
Medium 7.54
Cool 4.58
a) 5.68
b) 0.9
c) 1.89
d) 2.34
View Answer
Answer: b
Explanation: From the table we have,
Weighted standard deviation of Temperature = (5.85 *
(4/10)) + (7.54 * (2/10)) + (4.58 * (4/10))
= (5.85 * 0.4) + (7.54 * 0.2) + (4.58 * 0.4)
= 2.34 + 1.51 + 1.83
= 5.68
Standard deviation reduction for Temperature = Standard
deviation of players – Weighted standard deviation of
Temperature
= 6.58 – 5.68
= 0.9
1. Random forest can be used to reduce the danger of
overfitting in the decision trees.
a) True
b) False
View Answer
Answer: a
Explanation: One way to reduce the danger of overfitting is
by constructing an ensemble of trees. So Random forest is an
ensemble method which is better than a single decision tree
because it reduces the over-fitting by averaging the result.
2. Which of the following statements is not true about the
Random forests?
a) It is a classifier consisting of a collection of decision trees
b) Each tree is constructed by applying an algorithm on the
training set and an additional random vector
c) The prediction of the random forest is obtained by a
majority vote over the predictions of the individual trees
d) Each individual tree in the random forest will not spits
out a class prediction
View Answer
Answer: d
Explanation: As Random forest is a classifier consisting of a
collection of decision trees, each individual tree in the
random forest spits out a class prediction and the class with
the most votes becomes the model’s prediction. And each
tree is constructed by applying an algorithm on the training
set and an additional random vector.
3. Which of the following statements is not true about the
Random forests?
a) It is an ensemble learning method for classification only
b) It operates by constructing a multitude of decision trees at
training time
c) It outputs the class that is the mode of the classes
d) It outputs the mean prediction of the individual trees
View Answer
Answer: a
Explanation: Random forest is a supervised learning method
which are used for classification and regression. It is a group
of decision trees. The more the number of the trees the result
is error-free. During training time multitude of decision
trees will be constructed.
advertisement
4. Which of the following statements is not true about
Random forests?
a) Scaling of data required in random forest algorithm
b) It works well for a large range of data items than a single
decision tree
c) It has less variance than single decision tree
d) Random forests are very flexible and possess very high
accuracy
View Answer
Answer: a
Explanation: Scaling of data does not require in random
forest algorithm. It maintains good accuracy and it is very
flexible even after providing data without scaling. It works
well for a large range of data items and has less variance
than a single decision tree.
5. Which of the following statements is not true about
Random forests?
a) It has high complexity
b) Construction of Random forests are much easier than
decision trees
c) Construction of Random forests are time-consuming than
decision trees
d) More computational resources are required to implement
Random Forest algorithm
View Answer
Answer: b
Explanation: Construction of Random forests is much
harder and time-consuming than decision trees as it requires
more computational resources for the implementation. And
it has high complexity.
6. There is a direct relationship between the number of trees
in the random forest and the results.
a) False
b) True
View Answer
Answer: b
Explanation: Random forest is a supervised machine
learning technique. And there is a direct relationship
between the number of trees in the forest and the results it
produces. If larger the number of trees, the result will be
more accurate.
7. A data set T is split into two subsets T1 and T2 with sizes
N1 and N2. And Gini index of the split data contains
examples from N classes. Then the Gini index of T is defined
by which of the following options?
a) Ginisplit (T) = N2N gini (T1) + N1N gini (T2)
b) Ginisplit (T) = NN1 gini (T1) + NN2 gini (T2)
c) Ginisplit (T) = NN2 gini (T1) + NN1 gini (T2)
d) Ginisplit (T) = N1N gini (T1) + N2N gini (T2)
View Answer
Answer: d
Explanation: Let a data set T is split into two subsets T1 and
T2 with sizes N1 and N2. And Gini index of the split data
contains examples from N classes. Then the Gini index of T
is defined by, Ginisplit (T) = N1N gini (T1) + N2N gini (T2).
And its implementation is not easy as a decision tree with
impurity measures.
8. Random forest is known as the forest of Decision trees.
a) True
b) False
View Answer
Answer: a
Explanation: Random forest makes predictions by
combining the results from many individual decision trees.
So, we call them a forest of decision trees. Random forest
combines multiple models, and it falls under the category of
ensemble learning.
9. Bagging and Boosting are two main ways for combining
the outputs of multiple decision trees into a random forest.
a) True
b) False
View Answer
Answer: a
Explanation: Bagging and Boosting are two main ways for
combining the outputs of multiple decision trees into a
random forest. Bagging is also called Bootstrap aggregation
(used in Random Forests) and Boosting (used in Gradient
Boosting Machines).
10. Which of the following is represented by the below
figure?
Neighbor Class
X Good
Y Bad
Z Bad
advertisement
a) True
b) False
View Answer
Answer: b
Explanation: In majority voting approach, all votes are
equal. For each class C∈ L, we count how many of the k
neighbors have that class. We return the class with the most
votes. So here are two classes ‘Good’ and ‘Bad’. And the
class ‘Bad’ have the most votes (2 votes). So A’s predicted
class using majority voting will be ‘Bad’.
4. We have data from a survey and objective testing with two
attributes A and B to classify whether a special paper tissue
is good or not. Here are four training samples given in the
table. Now the factory produces a new paper tissue that pass
laboratory test with A = 3 and B = 7. If K = 3, then ‘Good’ is
the classification of this new tissue?
A B C = Classification
7 6 Bad
7 4 Bad
4 4 Good
2 4 Good
a) True
b) False
View Answer
Answer: a
Explanation: We have K = 3. Then we have,
7 6 (7 – 3)2 + (6 – 7)2 = 17
7 4 (7 – 3)2 + (4 – 7)2 = 25
4 4 (4 – 3)2 + (4 – 7)2 = 10
2 4 (2 – 3)2 + (4 – 7)2 = 10
7 6 17 3
7 4 25 4
4 4 10 1
2 4 10 2
Neighbor Class
X Good
Y Bad
Z Bad
a) True
b) False
View Answer
Answer: a
Explanation: In this approach, closer neighbors get higher
votes. Take a neighbor’s vote to be the inverse of its distance
to q and is known as Inverse distance weighted voting.
Vote (X) = 1 / 0.1
= 10
Vote(Y) = 1 / 0.3
= 3.33
Vote (Z) = 1 / 0.5
=2
Here X (Good) gets a vote of 10 and Y (Bad), Z (Bad)
together gets a vote of 5.33 only. So, the predicted class will
be ‘Good’.
6. Which of the following statements is not true about k
Nearest Neighbor?
a)It belongs to the supervised learning domain
b)It has an application in data mining and intrusion
detection
c)It is Non-parametric
d) It is not an instance based learning algorithm
View Answer
Answer: d
Explanation: k-NN is based on supervised learning
algorithm and a Non- parametric algorithm. It is also called
as lazy learner algorithm. KNN is used in applications like
data mining, intrusion decision and genetics, economic
forecasting.
7. Which of the following statements is not supporting in
defining k Nearest Neighbor as a lazy learning algorithm?
a) It defers data processing until it receives a request to
classify unlabeled data
b) It replies to a request for information by combining its
stored training data
c) It stores all the intermediate results
d) It discards the constructed answer
View Answer
Answer: c
Explanation: k Nearest Neighbor is considered to be as a
lazy learning algorithm and it defers data processing until it
receives a request to classify unlabeled data. It replies to a
request for information by combining its stored training
data. And the most important thing is that it discards the
constructed answer and any intermediate results.
8. Which of the following statements is not supporting kNN
to be a lazy learner?
a) When it gets the training data, it does not learn and make
a model
b) When it gets the training data, it just stores the data
c) It derives a discriminative function from the training data
d) It uses the training data when it actually needs to do some
prediction
View Answer
Answer: c
Explanation: It does not derive any discriminative function
from the training data. So, kNN does not immediately learn
a model, but delays the learning, that is why it is called lazy
learner. All other three are the statements supporting kNN
to be a lazy learner.
9. Euclidian distance and Manhattan distance are the same
in kNN algorithm to calculate the distance.
a) True
b) False
View Answer
Answer: b
Explanation: Both Euclidian distance and Manhattan
distance are used to calculate the distance between two
points. But they are not the same. Euclidian distance takes
the square root of the sum of the squares of the difference of
the coordinates. Manhattan distance takes the sum of the
absolute values of the difference of the coordinates.
10. What is the Manhattan distance between a data point (9,
7) and a new query instance (3, 4)?
a) 7
b) 9
c) 3
d) 4
View Answer
Answer: b
Explanation: Manhattan distance takes the sum of the
absolute values of the difference of the coordinates. Let the
data point be (x1, y1) = (9, 7) and query instance be (x 2, y2) =
(3, 4).
Manhattan distance, d = |x1 – x2| + |y1 – y2|
= |9 – 3| + |7 – 4|
= |6| + |3|
=9
1. In kNN too large value of K has a negative impact on the
data points.
a) True
b) False
View Answer
Answer: a
Explanation: Too large value of K in kNN has a negative
impact on the data points. A too large value of K is
detrimental as it destroys the locality of information since
farther examples are taken into account. It also increases the
computational burden.
2. It is good to use kNN for large data sets.
a) True
b) False
View Answer
Answer: b
Explanation: KNN works well with smaller dataset because
it is a lazy learner. It needs to store all the data and then
makes decision only at run time. So, if dataset is large, there
will be a lot of processing which may adversely impact the
performance of the algorithm.
3. When we set K = 1 in kNN algorithm, the predictions
become more stable.
a) True
b) False
View Answer
Answer: b
Explanation: As we decrease the value of K to 1, our
predictions become less stable. Choosing smaller values for
K can be noisy and will have a higher influence on the result.
In general, choosing the value of k is k = sqrt (N) where N
stands for the number of samples in your training dataset.
advertisement
4. Setting large values of K in kNN is computationally
inexpensive.
a) True
b) False
View Answer
Answer: b
Explanation: Setting large values of K in kNN is
computationally expensive. Larger values of K will have
smoother decision boundaries which mean lower variance
but increased bias. ‘K’ in kNN algorithm is based on feature
similarity and choosing the right value of K is a process
called parameter tuning.
5. Which of the following statements is not a feature of kNN?
a) K-NN has assumptions
b) K-NN is pretty intuitive and simple
c) No Training Step
d) It constantly evolves
View Answer
Answer: a
Explanation: In kNN there are no assumptions to be met to
implement kNN. Parametric models like linear regression
has lots of assumptions to be met by data before it can be
implemented which is not the case with kNN. All other three
statements are the advantages of kNN.
6. Which of the following statements is not a feature of kNN?
a) Very easy to implement for multi-class problem
b) One Hyper Parameter
c) Variety of distance criteria to be choose from
d) Fast algorithm for large dataset
View Answer
Answer: d
Explanation: kNN is a slow algorithm. KNN might be very
easy to implement but as dataset grows efficiency or speed of
algorithm declines very fast. So, it is a slow algorithm for
large dataset. All other three statements are the advantages
of kNN.
7. Which of the following statements is not a feature of kNN?
a) K-NN does not need homogeneous features
b) Curse of Dimensionality
c) Optimal number of neighbors
d) Outlier sensitivity
View Answer
Answer: a
Explanation: K-NN needs homogeneous features. If you
decide to build k-NN using a common distance, like
Euclidean or Manhattan distances, it is completely necessary
that features have the same scale, since absolute differences
in features weight the same, i.e., a given distance in feature 1
must means the same for feature 2.
8. KNN performs well on imbalanced data.
a) True
b) False
View Answer
Answer: a
Explanation: k-NN doesn’t perform well on imbalanced
data. If we consider two classes, A and B, and the majority
of the training data is labeled as A, then the model will
ultimately give a lot of preference to A. This might result in
getting the less common class B wrongly classified.
9. In kNN low K value is sensitive to outliers.
a) True
b) False
View Answer
Answer: a
Explanation: KNN is sensitive to outliers. Low k-value is
sensitive to outliers and a higher K-value is more flexible to
outliers as it considers more voters to decide prediction.
10. Cross-validation is a smart way to find out the optimal K
value.
a) True
b) False
View Answer
Answer: a
Explanation: Cross-validation is a smart way to find out the
optimal K value. It estimates the validation error rate by
holding out a subset of the training set from the model
building process.
1. Naïve Bayes classifier algorithms are mainly used in text
classification.
a) True
b) False
View Answer
Answer: a
Explanation: Naïve Bayes classifier is a simple probabilistic
framework for solving a classification problem. It is used to
organize text into categories based on the bayes probability
and is used to train data to learn document-class
probabilities before classifying text documents.
2. What is the formula for Bayes’ theorem? Where (A & B)
and (H & E) are events and P(B), P(H) & P(E) ≠ 0.
a) P(H|E) = [P(E|H) * P(E)] / P(H)
b) P(A|B) = [P(A|B) * P(A)] / P(B)
c) P(H|E) = [P(H|E) * P(H)] / P(E)
d) P(A|B) = [P(B|A) * P(A)] / P(B)
View Answer
Answer: d
Explanation: Here, P(A) &P(H) is the probability of
hypothesis before observing the evidence, P(B) & P(E) is the
probability of evidence, P(A|B) & P(H|E) is the posterior
probability and P(B|A) & P(E|H) is the likelihood
probability. Since Bayes Theorem states that:
Conditional Probability of A given B = \(\frac {Conditional \,
probability \, of \, B \, given \, A \, * \, Prior probability \, of \,
A}{Prior \, probability \, of \, B}\)
3. Which of the following statement is not true about Naïve
Bayes classifier algorithm?
a) It cannot be used for Binary as well as multi-class
classifications
b) It is the most popular choice for text classification
problems
c) It performs well in Multi-class prediction as compared to
other algorithms
d) It is one of the fast and easy machine learning algorithms
to predict a class of test datasets
View Answer
Answer: a
Explanation: Naïve Bayes algorithm can be used for binary
as well as multi-class classifications. It is a parametric
algorithm, which means it requires a fixed set of
assumptions or parameters to simplify the machine’s
learning process.
advertisement
4. What is the assumptions of Naïve Bayesian classifier?
a) It assumes that features of a data are completely
dependent on each other
b) It assumes that each input variable is dependent and the
model is not generative
c) It assumes that each input attributes are independent of
each other and the model is generative
d) It assumes that the data dimensions are dependent and
the model is generative
View Answer
Answer: c
Explanation: The assumptions of Naïve Bayes Classifier is
that it assumes that each input attributes are independent of
each other which is the Naïve part, and the model is
generative which is the Bayesian part.
5. Which of the following is not a supervised machine
learning algorithm?
a) Decision tree
b) SVM for classification problems
c) Naïve Bayes
d) K-means
View Answer
Answer: d
Explanation: Decision tree, SVM (Support vector machines)
for classification problems and Naïve Bayes are the examples
of supervised machine learning algorithm. K-means is an
example of unsupervised machine learning algorithm.
6. Which one of the following terms is not used in the Bayes’
Theorem?
a) Prior
b) Unlikelihood
c) Posterior
d) Evidence
View Answer
Answer: b
Explanation: The terms Evidence, Prior, Likelihood and
Posterior are used in the Bayes’ Theorem. But, the term
unlikelihood is not used in the Bayes’ Theorem. Bayes
Theorem states that Posterior = (Likelihood * Prior) /
Evidence.
7. Is the assumption of the Naïve Bayes algorithm a
limitation to use it?
a) True
b) False
View Answer
Answer: a
Explanation: It’s true that the assumption of the Naïve
Bayes’ algorithm is a limitation to use it since implicitly it
assumes that all the input attributes are mutually
independent of each other. But in real life it is almost
impossible that we get a set of input attributes which are
independent.
8. In which of the following case the Naïve Bayes’ algorithm
does not work well?
a) When faster prediction is required
b) When the Naïve assumption holds true
c) When there is the case of Zero Frequency
d) When there is a multiclass prediction
View Answer
Answer: c
Explanation: When there is a case of “Zero Frequency”, the
categorical variable is not detected and hence the classifier
will not be able to make prediction with an assumption of
“Zero” probability.
9. There are two boxes. The first box contains 3 white and 2
red balls whereas the second contains 5 white and 4 red
balls. A ball is drawn at random from one of the two boxes
and is found to be white. Find the probability that the ball
was drawn from the second box?
a) 53/50
b) 50/104
c) 54/104
d) 54/44
View Answer
Answer: b
Explanation: Let the first box be A and the second box be B
Then probability of choosing one box from the two is P(A) =
1/2 and P(B) = 1/2
As given in the question we have to find the probability that
the white ball was drawn from the second box = P(B/W)
Now,
P(W/A) = 3/5 and P(W/B) = 5/9
According to Bayes Theorem we know that,
P(B/W) = \(\frac {P(W/B) * P(B)}{P(W/B) * P(B) + P(W/A) *
P(A)}\)
P(B/W) = \(\frac {5/9 * 1/2}{(5/9 * 1/2) + (3/5 * 1/2)}\)
P(B/W) = \(\frac {5/18}{5/18 + 3/10}\)
P(B/W) = \(\frac {5/18}{104/180}\)
P(B/W) = 50/104
10. Which one of the following models is a generative model
used in machine learning?
a) Linear Regression
b) Logistic Regression
c) Naïve Bayes
d) Support vector machines
View Answer
Answer: c
Explanation: Naïve Bayes is a type of generative model
which is used in machine learning. Linear Regression,
Logistic Regression and Support vector machines are the
types of discriminative models which are used in machine
learning.
11. The number of balls in three boxes are as follows
A 3 2
B 2 1
C 4 2
One box is chosen at random and two balls are drawn from
it. The balls are green and blue. What is the probability that
the ball chosen are from the first box?
a) 37/18
b) 15/56
c) 18/37
d) 56/15
View Answer
Answer: c
Explanation: The probability of choosing one box out of
three boxes is P(A) = P(B) = P(C) = 1/3.
Here the event (E) is choosing the green and blue balls from
the random box.
Therefore, P(E|A) = \(\frac {^3C_1*^2C_1}{^6C_2}\) = 6/15
= 2/5
P(E|B) = \(\frac {^2C_1*^1C_1}{^5C_2}\) = 2/10 = 1/5
P(E|C) = \(\frac {^4C_1*^2C_1}{^9C_2} = \frac {8}{72/2}\) =
2/9
According to Bayes Theorem;
P(A|E) = P(E|A) / [P(E|A) + P(E|B) + P(E|C)]
= \(\frac {2/5}{(2/5) + (1/5) + (2/9)}\)
= 18/37
12. Identify the parametric machine learning algorithm.
a) CNN (Convolutional neural network)
b) KNN (K-Nearest Neighbours)
c) Naïve Bayes
d) SVM (Support vector machines)
View Answer
Answer: c
Explanation: In machine learning, the algorithms which can
simplify a function by collecting information about its
prediction within a finite set of parameters is defined as
parametric machine learning algorithm. Naïve Bayes is a
parametric machine learning algorithm whereas CNN, KNN
and SVM are non-parametric machine learning algorithms.
13. Which one of the following applications is not an
example of Naïve Bayes algorithm?
a) Spam filtering
b) Text classification
c) Stock market forecasting
d) Sentiment analysis
View Answer
Answer: c
Explanation: Stock market forecasting is one of the most
core financial tasks of KNN (K-Nearest Neighbours). Spam
filtering, text classification and sentiment analysis is the
application of Naïve Bayes algorithm, which uses Bayes
theorem of probability for prediction of unknown classes.
14. Arrange the following steps in sequence in order to
calculate the probability of an event through Naïve Bayes
classifier.
I. Find the likelihood probability with each attribute for
each class.
II. Calculate the prior probability for given class labels.
III. Put these values in Bayes formula and calculate
posterior probability.
IV. See which class has a higher probability, given the input
belongs to the higher probability class.
a) I → II → III → IV
b) II → I → III → IV
c) III → II → I → IV
d) II → III → I → IV
View Answer
Answer: b
Explanation: The sequence in which Naïve Bayes calculates
the probability of an event is:
II. Calculate the prior probability for given class labels.
I. Find the likelihood probability with each attribute for
each class.
III. Put these values in Bayes formula and calculate
posterior probability.
IV. See which class has a higher probability, given the input
belongs to the higher probability class.
15. “It is easy and fast to predict the class of the test data set
by using Naïve Bayes algorithm”.
Which of the following statement contradicts the above given
statement?
a) Because there is no iteration
b) Because there is no epoch
c) Because there is an error back propagation
d) Because there are no operations involved in solving a
matrix problem
View Answer
Answer: c
Explanation: Naïve Bayes algorithm is easy and fast to
predict the class of the test data set because there is no
iteration involved, there is no epoch, there are no operations
involved in solving a matrix problem and there is no error
back propagation.