0% found this document useful (0 votes)
1K views318 pages

Machine Learning MCQ S

Uploaded by

Mayuri Ravande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views318 pages

Machine Learning MCQ S

Uploaded by

Mayuri Ravande
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 318

1.

Among the following option identify the one which is not a


type of learning
Semi unsupervised learning
Supervised learning
Reinforcement learning
Unsupervised learning
Hide

Answer - A) Semi unsupervised learning is incorrect.


2.
Identify the kind of learning algorithm for “facial identities
for facial expressions”.
Prediction
Recognition patterns
Recognizing anomalies
Generating patterns
Hide

Answer - B) For facial identities and facial expression,


“recognition patterns” is used.
3.
Identify the model which is trained with data in only a single
batch.
Offline learning
Batch learning
Both A and B
None
Hide
Answer - C) The model is trained with data in only a single
batch. is known as batch learning or offline learning.
4.
What is the application of machine learning methods to a
large database called?
Big data computing
Internet of things
Data mining
Artificial intelligence
Hide

Answer - C) Application of machine learning methods to


large databases is known as data mining.
5.
Identify the type of learning in which labeled training data is
used.
Semi unsupervised learning
Supervised learning
Reinforcement learning
Unsupervised learning
Hide

Answer - B) Supervised learning uses labeled training data.

6.
Identify whether true or false: In PCA the number of input
dimensions is equal to principal components.
True
False
Hide
Answer - A) True. In PCA the number of input dimensions
is equal to principal components.
7.
Among the following identify the one in which
dimensionality reduction reduces.
Performance
Answer - D) Dimensionality reduction reduces collinearity.

8.
Which of the following machine learning algorithm is based
upon the idea of bagging?
Decision tree
Random-forest
Classification
Regression
Hide

Answer - B) Random forest is based on the idea of bagging.


9.
Choose a disadvantage of decision trees among the following.
Decision trees are robust to outliers
Factor analysis
Decision trees are prone to overfit
All of the above
Hide

Answer - C) Decision trees are prone to overfitting.


10.
What is the term known as on which the machine learning
algorithms build a model based on sample data?
Data training
Training data
Transfer data
None of the above
Hide

Answer - B) The term is known as training data.


11.
Machine learning is a subset of which of the following.
Artificial intelligence
Deep learning
Data learning
None of the above
Hide

Answer - A) Machine learning is a subset of artificial


intelligence.
12.
Which of the following machine learning techniques helps in
detecting the outliers in data?
Classification
Clustering
Anomaly detection
All of the above
Hide

Answer - C) The machine learning algorithm which helps in


detecting the outliers is known as anomaly detection.

13.
The father of machine learning is _____________
Geoffrey Everest Hinton
Geoffrey Hill
Geoffrey Chaucer
None of the above
Hide

Answer - A) The father of machine learning is Geoffrey


Everest Hinton.
14.
The most significant phase in genetic algorithm is _________
Mutation
Selection
Fitness function
Crossover
Hide

Answer - D) Crossover is the most significant figure in


genetic algorithm
15.
Which of the following are common classes of problems in
machine learning?
Regression
Classification
Clustering
All of the above
Hide

Answer - D) All of the above are common classes of


problems in machine learning.
16.
Among the following options identify the one which is false
regarding regression.
It is used for the prediction
It is used for interpretation
It relates inputs to outputs
It discovers casual relationships
Hide

Answer - D) Option d is incorrect among the following.


17.
Identify the successful applications of ML.
Learning to classify new astronomical structures
Learning to recognize spoken words
Learning to drive an autonomous vehicle
All of the above
Hide

Answer - D) All of the above are applications of ML.


18.
Identify the incorrect numerical functions in the various
function representation of machine learning.
Case-based
Support vector machines
Linear regression
Neural network
Hide

Answer - A) Case-based is the correct answer


19.
FIND-S algorithm ignores?
Positive
Negative
Both
None
Hide

Answer - B) FIND-S algorithm ignores negative.


20.
Select the correct definition of neuro software.
It is software used by neurosurgeons
It is software used to analyze neurons
It is a powerful and easy neural network
None of the above
Hide

Answer - C) Neuro software is a powerful and easy neural


network.
21.
Choose whether the following statement is true or false: The
backpropagation law is also known as the generalized Delta
rule.
True
False
Hide

Answer - A) True. The backpropagation law is also known


as the generalized Delta rule.
22.
Choose the general limitations of the backpropagation rule
among the following.
Slow convergence
Scaling
Local minima problem
All of the above
Hide

Answer - D) All of the above general limitations of the


backpropagation rule.
23.
Analysis of ML algorithm needs
Statistical learning theory
Computational learning theory
Both A and B
None of the above
Hide

Answer - C) Analysis of ML algorithms needs both statistical


learning theory and computational learning theory.
24.
Choose the most widely used mattress and tools to assess the
classification models.
The area under the ROC curve
Confusion matrix
Cost-sensitive accuracy
All of the above
Hide

Answer - D) All of the above are correct.


25.
Full form of PAC is _____________
Probably Approx Cost
Probably Approximate Correct
Probability Approx Communication
Probably Approximate Computation
Hide
Answer - B) PAC stands for Probably Approximate Correct.
26.
Choose that following statement is true or false: True error
is defined over the entire instance space, and not just over
training data
True
False
Hide

Answer - A) True. the above statement is correct.


27.
Choose the options below of which the area CLT is
comprised of.
Mistake bound
Sample complexity
Computational complexity
All of the above
Hide

Answer - D) The area CLT is comprised of all of the above.


28.
Choose the instance-based learner.
Eager learner
Lazy learner
Both A and B are correct
None of the above
Hide

Answer - B) Lazy learner is an instance-based learner.


29.
Identify the difficulties with the k-nearest neighbor
algorithm.
Curse of dimensionality
Calculate the distance of the test case from all training cases
Both A and B
None of the above
Hide

Answer - C) Both A and B are correct.


30.
The total types of the layer in radial basis function neural
networks is ______
1
2
3
4
Hide

Answer - C) There is a total of 3 types of layered in radial


basis function neural networks.
31.
Which of the following is an application of CBR?
Diagnosis
Design
Planning
All of the above
Hide

Answer - B) Design is an application of CBR.


32.
Choose the correct advantages of CBR.
Fast to train
A local approx is found for each test case
Knowledge is in a form understandable by human
All of the above
Hide

Answer - D) All of the above advantages of CBR.


33.
Machine learning as various Search and Optimisation
algorithms. Identify among the following which is not
evolutionary computation.
Genetic algorithm
Genetic programming
Neuroevolution
Perceptron
Hide

Answer - D) Perceptron is not evolutionary computing.


34.
Choose whether the following statement is true or false:
Artificial intelligence is the process that allows a computer to
learn and make decisions like humans.
True
False
Hide

Answer - A) The above statement is true.


35.
Which of the following is not machine learning disciplines?
Information theory
Optimisation + control
Physics
Neuro statistics
Hide

Answer - D) Neuro statistics is not a message landing


discipline
36.
What does K stand for in K mean algorithm?
Number of clusters
Number of data
Number of attributes
Number of iterations
Hide

Answer - D) K stands for a number of iterations in the K


mean algorithm.
37.
Choose whether true or false: Decision tree cannot be used
for clustering
True
False
Hide

Answer - B) False. A decision tree can be used for clustering.


38.
Identify the clustering method which takes care of variance
in data
Decision tree
Gaussian mixture model
K means
All of the above
Hide

Answer - B) Gaussian mixture model takes care of variance


in data
39.
Which of the following is not a supervised learning
PCA
Naive Bayesian
Linear regression
Decision tree
Hide

Answer - A) PCA Is not supervised learning.


40.
What is unsupervised learning?
Number of groups may be known
Features of group explicitly stated
Neither feature nor number of groups is known
None of the above
Hide

Answer - C) Unsupervised learning has neither feature nor


number of groups known.
41.
Which of the following is not a machine learning algorithm?
SVM
SVG
Random forest
None of the above
Hide
Answer - B) SVG Is not a machine learning algorithm.
42.
What is true about Machine Learning?
The main focus of ML is to allow computer systems to learn
from experience without being explicitly programmed or
human intervention.
ML is a type of artificial intelligence that extracts patterns
out of raw data by using an algorithm or method.
Machine Learning (ML) is the field of computer science.
All of the above
Hide

Answer - D) All of the above is true about machine learning.


43.
Which of the following is not machine learning?
Artificial intelligence
Rule-based inference
Both A and B
None of the above
Hide

Answer - B) Rule-based inference is not machine learning


44.
Identify the method which is used for trainControl
resampling.
svm
repeatedcv
Bag32
None of the above
Hide
Answer - B) repeatedcv is used for trainControl resampling.
45.
Among the following option identify the one which is used to
create the most common graph types.
plot
quickplot
qplot
All of the above
Hide

Answer - C) qplot is used to create the most common graph


types

In Simple Terms What is Machine Learning


A. training based on historical data

B. prediction to answer a query

C. both A and B

D. automization of complex task Discuss in BoardSave for


Later

Answer: Option C
Which of the following is the best machine learning
method?

A. scalable
B. accuracy

C. fast

D. all of the above

Answer: Option D

The output of training process in machine learning is

A. machine learning model

B. machine learning algorithm

C. null

D. accuracy

Answer: Option A
Application of machine learning methods to large
databases is called

A. data mining.

B. artificial intelligence
C. big data computing

D. internet of things

Discuss in Board Save for Later


Answer: Option A
If machine learning model output involves target
variable then that model is called as

A. descriptive model

B. predictive model

C. reinforcement learning

D. all of the above

Discuss in Board Save for Later


Answer: Option B
What are the different Algorithm techniques in Machine
Learning?

A. supervised learning and semi-supervised learning

B. unsupervised learning and transduction


C. both A & B

D. none of the mentioned

Answer: Option A
Which of the following is not Machine Learning?

A. artificial intelligence

B. rule based inference

C. both a and b

D. none of the mentioned

Discuss in Board Save for Later


Answer: Option B

What is 'Overfitting' in Machine learning?

A. when a statistical model describes random error or


noise instead of

B. robots are programed so that they can perform the task


based on data they gather from
C. while involving the process of learning 'overfitting'
occurs.

D. a set of data is used to discover the potentially


predictive relationship

Answer: Option A
If machine learning model output doesnot involves
target variable then that model is called as

A. descriptive model

B. predictive model

C. reinforcement learning

D. all of the above

Discuss in Board Save for Later


Answer: Option A

Which are two techniques of Machine Learning ?

A. Genetic Programming and Inductive Learning


B. Speech recognition and Regression

C. Both A and B

D. None of the Mentioned

Answer: Option A
11.
What characterize unlabeled examples in machine
learning
A. there is no prior knowledge

B. there is no confusing knowledge


C. there is prior knowledge

D. there is plenty of confusing knowledge

Answer: Option A

12.
What characterize is hyperplance in geometrical model of
machine learning?
A. a plane with 1 dimensional fewer than number of input
attributes

B. a plane with 2 dimensional fewer than number of input


attributes
C. a plane with 1 dimensional more than number of input
attributes

D. a plane with 2 dimensional more than number of input


attributes

Discuss in Board Save for Later

Answer: Option A
What characterizes a hyperplane in the geometrical model of
machine learning?

Option A: A plane with 1 dimensional fewer than the


number of input attributes

13.
Imagine a Newly-Born starts to learn walking. It will try to
find a suitable policy to learn walking after repeated
falling and getting up.specify what type of machine
learning is best suited?
A. classification

B. regression

C. kmeans algorithm

D. reinforcement learning
Discuss in Board Save for Later
Answer: Option D
14.
What are the popular algorithms of Machine Learning?
A. decision trees and neural networks (back propagation)

B. probabilistic networks and nearest neighbor

C. support vector machines

D. all

Answer: Option D
A machine learning problem involves four attributes plus
a class. The attributes have 3, 2, 2, and 2 possible values
each. The class has 3 possible values. How many maximum
possible different examples are there?
A. 12

B. 24

C. 48

D. 72

Answer: Option D
In machine learning, an algorithm (or learning algorithm)
is said to be unstable if a small change in training data
cause the large change in the learned classifiers. True or
False: Bagging of unstable classifiers is a good idea
A. TRUE

B. FALSE

Answer: Option A
Which of the following is characteristic of best machine
learning method ?
A. fast

B. accuracy

C. scalable

D. all above

Answer: Option D
Machine learning techniques differ from statistical
techniques in that machine learning methods
A. typically assume an underlying distribution for the
data.

B. are better able to deal with missing and noisy data.

C. are not able to explain their behavior.

D. have trouble with large-sized datasets.

Answer: Option A
What is Model Selection in Machine Learning?
A. The process of selecting models among different
mathematical models, which are used to describe the same
data set

B. when a statistical model describes random error or


noise instead of underlying relationship

C. Find interesting directions in data and find novel


observations/ database cleaning

D. All above

Discuss in Board Save for Later


Answer: Option A

Some people are using the term . . . . . . . . instead of


prediction only to avoid the weird idea that machine
learning is a sort of modern magic.
A. Inference

B. Interference

C. Accuracy
D. None of above

Answer: Option A
21.
The average squared difference between classifier
predicted output and actual output.
A. mean squared error

B. root mean squared error

C. mean absolute error

D. mean relative error

Discuss in Board Save for Later


Answer: Option A
22.
Which of the following methods do we use to find the best
fit line for data in Linear Regression?
A. Least Square Error

B. Maximum Likelihood

C. Logarithmic Loss

D. Both A and B

Discuss in Board Save for Later

Answer: Option A

23.
Following are the descriptive models
A. clustering
B. classification

C. association rule

D. both a and c

Answer: Option D
24.
Assume that you are given a data set and a neural network
model trained on the data set. You are asked to build a
decision tree model with the sole purpose of
understanding/interpreting the built neural network
model. In such a scenario, which among the following
measures would you concentrate most on optimising?
A. accuracy of the decision tree model on the given data
set

B. f1 measure of the decision tree model on the given data


set

C. fidelity of the decision tree model, which is the fraction


of instances on which the neural network and the decision
tree give the same output

D. comprehensibility of the decision tree model, measured


in terms of the size of the corresponding rule set

Discuss in Board Save for Later


Answer: Option C
25.
What are common feature selection methods in regression
task?
A. correlation coefficient

B. greedy algorithms

C. all above

D. none of these

Discuss in Board Save for Later


Answer: Option C
26.
Regarding bias and variance, which of the following
statements are true? (Here 'high' and 'low' are relative to
the ideal model.
i. Models which overfit are more likely to have high bias
ii. Models which overfit are more likely to have low bias
iii. Models which overfit are more likely to have high
variance
iv. Models which overfit are more likely to have low
variance
A. i and ii

B. ii and iii

C. iii and iv
D. none of these

Answer: Option C
.
27.
Which of the following can only be used when training
data are linearlyseparable?
A. linear hard-margin svm

B. linear logistic regression

C. linear soft margin svm

D. the centroid method

Discuss in Board Save for Later


Answer: Option A
28.
Wrapper methods are hyper-parameter selection methods
that
A. should be used whenever possible because they are
computationally efficient

B. should be avoided unless there are no other options


because they are always prone to overfitting.

C. are useful mainly when the learning machines are


"black boxes"
D. should be avoided altogether.

Answer: Option C
29.
Given that we can select the same feature multiple times
during the recursive partitioning of the input space, is it
always possible to achieve 100% accuracy on the training
data (given that we allow for trees to grow to their
maximum size) when building decision trees?
A. Yes

B. No

Answer: Option B
30.
In many classification problems, the target dataset is made
up of categorical labels which cannot immediately be
processed by any algorithm. An encoding is needed and
scikit-learn offers at least . . . . . . . . valid options
A. 1

B. 2

C. 3

D. 4
Discuss in Board Save for Later
Answer: Option B
A. facts.

B. concepts.

C. procedures.

D. principles.

Answer: Option A

32.

what is Feature scaling done before applying K-Mean


algorithm?

A. in distance calculation it will give the same weights for


all features

B. you always get the same clusters. if you use or dont use
feature scaling

C. in manhattan distance it is an important step but in


euclidian it is not
D. none of these

Answer: Option A
33.

Which of the following is true about Naive Bayes?

A. Assumes that all the features in a dataset are equally


important

B. Assumes that all the features in a dataset are


independent

C. Both A and B

D. None of the above option

Answer: Option C
34.

KDD represents extraction of

A. data
B. knowledge

C. rules

D. model

Discuss in Board Save for Later

Answer: Option B
35.

Linear Regression is a . . . . . . . . machine learning


algorithm.

A. supervised

B. unsupervised

C. semi-supervised

D. cant say
Answer: Option A
36.

The probability that a person owns a sports car given


that they subscribe to automotive magazine is 40%. We
also know that 3% of the adult population subscribes to
automotive magazine. The probability of a person
owning a sports car given that they don't subscribe to
automotive magazine is 30%. Use this information to
compute the probability that a person subscribes to
automotive magazine given that they own a sports car

A. 0.0398

B. 0.0389

C. 0.0368

D. 0.0396

Answer: Option D
37.

Which among the following statements best describes


our approach to learning decision trees
A. identify the best partition of the input space and
response per partition to minimise sum of squares error

B. identify the best approximation of the above by the


greedy approach (to identifying the partitions)

C. identify the model which gives the best performance


using the greedy approximation (option (b)) with the
smallest partition scheme

D. identify the model which gives performance close to the


best greedy approximation performance (option (b)) with
the smallest partition scheme

Answer: Option B
38.

Which of the following techniques would perform


better for reducing dimensions of a data set?

A. removing columns which have too many missing values

B. removing columns which have high variance in data


C. removing columns with dissimilar data trends

D. none of these

Answer: Option A
39.

. . . . . . . . can be adopted when it's necessary to


categorize a large amount of data with a few complete
examples or when there's the need to impose some
constraints to a clustering algorithm.

A. Supervised

B. Semi-supervised

C. Reinforcement

D. Clusters

Answer: Option B
40.
Binarize parameter in BernoulliNB scikit sets threshold
for binarizing of sample features.

A. TRUE

B. FALSE

Answer: Option A
41.

What would you do in PCA to get the same projection


as SVD?

A. transform data to zero mean

B. transform data to zero median

C. not possible

D. none of these

Answer: Option A

42.
The . . . . . . . . of the hyperplane depends upon the
number of features.

A. dimension

B. classification

C. reduction

D. none of the above

Answer: Option A
43.

What is the approach of basic algorithm for decision


tree induction?

A. greedy

B. top down

C. procedural
D. step by step

Answer: Option A
44.

Can we extract knowledge without apply feature


selection

A. Yes

B. No

Answer: Option A
45.

Suppose there are 25 base classifiers. Each classifier has


error rates of e = 0.35. Suppose you are using averaging
as ensemble technique. What will be the probabilities
that ensemble of above 25 classifiers will make a wrong
prediction? Note: All classifiers are independent of each
other

A. 0.05

B. 0.06
C. 0.07

D. 0.09

Answer: Option B
46.

When the number of classes is large Gini index is not a


good choice.

A. TRUE

B. FALSE

Answer: Option A
47.

Data used to build a data mining model.

A. training data

B. validation data

C. test data
D. hidden data

Discuss in Board Save for Later


Answer: Option A
48.

This technique associates a conditional probability


value with each data instance.

A. linear regression

B. logistic regression

C. simple regression

D. multiple linear regression

Discuss in Board Save for Later


Answer: Option B
49.

What is the purpose of the Kernel Trick?


A. to transform the data from nonlinearly separable to
linearly separable

B. to transform the problem from regression to


classification

C. to transform the problem from supervised to


unsupervised learning.

D. all of the above

Discuss in Board Save for Later


Answer: Option A
50.

Having multiple perceptrons can actually solve the


XOR problem satisfactorily: this is because each
perceptron can partition off a linear part of the space
itself, and they can then combine their results.

A. true - this works always, and these multiple perceptrons


learn to classify even complex problems
B. false - perceptrons are mathematically incapable of
solving linearly inseparable functions, no matter what you
do

C. true - perceptrons can do this but are unable to learn to


do it - they have to be explicitly hand-coded

D. false - just having a single perceptron is enough

Answer: Option C
51.

Can we calculate the skewness of variables based on


mean and median?

A. TRUE

B. FALSE

Answer: Option B

52.

It is possible to design a Linear regression algorithm


using a neural network?
A. TRUE

B. FALSE

Answer: Option A
53.

If Linear regression model perfectly first i.e., train


error is zero, then . . . . . . . .

A. Test error is also always zero

B. Test error is non zero

C. Couldn't comment on Test error

D. Test error is equal to Train error

Discuss in Board Save for Later


Answer: Option C
54.
Increase in size of a convolutional kernel would
necessarily increase the performance of a convolutional
network.

A. TRUE

B. FALSE

Discuss in Board Save for Later


Answer: Option B
55.

To control the size of the tree, we need to control the


number of regions. One approach to do this would be to
split tree nodes only if the resultant decrease in the sum
of squares error exceeds some threshold. For the
described method, which among the following are true?
a. It would, in general, help restrict the size of the trees
b. It has the potential to affect the performance of the
resultant regression/classification model
c. It is computationally infeasible

A. a and b

B. b and c
C. a and c

D. all of the above

Answer: Option A
56.

Which of the following is the difference between


stacking and blending?

A. stacking has less stable cv compared to blending

B. in blending, you create out of fold prediction

C. stacking is simpler than blending

D. none of these

Discuss in Board Save for Later


Answer: Option D
57.

Logistic regression is a . . . . . . . . regression technique


that is used to model data having a . . . . . . . . outcome.
A. linear, numeric

B. linear, binary

C. nonlinear, numeric

D. nonlinear, binary

Discuss in Board Save for Later


Answer: Option D
58.

You are given reviews of few netflix series marked as


positive, negative and neutral. Classifying reviews of a
new netflix series is an example of

A. supervised learning

B. unsupervised learning

C. semisupervised learning

D. reinforcement learning
Discuss in Board Save for Later
Answer: Option A
59.

In following type of feature selection method we start


with empty feature set

A. forward feature selection

B. backword feature selection

C. both a and b

D. none of the above

Discuss in Board Save for Later


Answer: Option A
60.

Neural Networks are complex . . . . . . . . with many


parameters.

A. linear functions
B. nonlinear functions

C. discrete functions

D. exponential functions

Discuss in Board Save for Later


Answer: Option A
1.

How it's possible to use a different placeholder through


the parameter . . . . . . . .

A. regression

B. classification

C. random_state

D. missing_values

\nswer: Option D
62.

Which of the following is a categorical data?

A. branch of bank

B. expenditure in rupees

C. prize of house

D. weight of a person

Answer: Option A 63.


The F-test

A. an omnibus test

B. considers the reduction in error when moving from the


complete model to the reduced model

C. considers the reduction in error when moving from the


reduced model to the complete model
D. can only be conceptualized as a reduction in error

Answer: Option C
Linear Regression is a supervised machine learning
algorithm.

A. TRUE

B. FALSE

Discuss in Board Save for Later


Answer: Option A
No explanation is given for this question Let's Discuss on
Board
65.

Which of the following are supervised learning


applications

A. Spam detection, Pattern detection, Natural Language


Processing

B. Image classification, Real-time visual tracking


C. Autonomous car driving, Logistic optimization

D. Bioinformatics, Speech recogniti

Answer: Option A
Which of the following is a good test dataset
characteristic?

A. large enough to yield meaningful results

B. is representative of the dataset as a whole

C. both A and B

D. none of the above

Answer: Option C.
Features being classified is . . . . . . . . of each other in
Nave Bayes Classifier

A. independent

B. dependent
C. partial dependent

D. none

Answer: Option A
The gini index is not biased towards multivalued
attributed.

A. TRUE

B. FALSE

Answer: Option B
This unsupervised clustering algorithm terminates
when mean values computed for the current iteration of
the algorithm are identical to the computed mean
values for the previous iteration.

A. agglomerative clustering

B. conceptual clustering

C. k-means clustering
D. expectation maximization

Answer: Option C

In syntax of linear model lm(formula,data,..), data


refers to . . . . . . . .

A. Matrix

B. Vector

C. Array

D. List

Answer: Option B
This set of Machine Learning Multiple Choice Questions &
Answers (MCQs) focuses on “Statistical Learning
Framework”.

1. How are the points in the domain set given as input to the
algorithm?
a) Vector of features
b) Scalar points
c) Polynomials
d) Clusters
View Answer
Answer: a
Explanation: The variables are converted into a vector of
features, and then given as an input to the algorithm. The
vector is of the size (number of features x number of training
data sets). The output of the learner is usually given as a
polynomial.
2. To which input does the learner has access to?
a) Testing Data
b) Label Data
c) Training Data
d) Cross-Validation Data
View Answer
Answer: c
Explanation: The learner gets access to a particular set of
data on which it trains. This data is called as training data.
Testing Data is used for testing of the learner’s outputs. The
best outputs are then used on the cross-validation data. The
label data is a representation of different types of the
dependent variables.
3. The set which represents the different instances of the
target variable is known as ______
a) domain set
b) training set
c) label set
d) test set
View Answer
Answer: c
Explanation: Label Set denotes all the possible forms the
target variable can take (for e.g. {0,1} or {yes, no} in a
logistic regression problem). Domain Set represents the
vector of features, given as input to the learner. Training Set
and Test Set are parts of the Domain Set which are used for
training and testing respectively.
advertisement
4. What is the learner’s output also called?
a) Predictor, or Hypothesis, or Classifier
b) Predictor, or Hypothesis, or Trainer
c) Predictor, or Trainer, or Classifier
d) Trainer, or Hypothesis, or Classifier
View Answer
Answer: a
Explanation: The output is called a predictor when it is used
to predict the type or the numerical value of the target
variable. It is called a hypothesis when it is a general
statement about the data set. It is called a classifier when it is
used to classify the training set in two or more types.
5. It is assumed that the learner has prior knowledge about
the probability distribution which generates the instances in
a training set.
a) True
b) False
View Answer
Answer: b
Explanation: The learner has no prior knowledge about the
distribution. It is assumed that the distribution is completely
arbitrary. It is also assumed that there is a function which
“correctly” labels the training examples. The learner’s job is
to find out this function.
6. The labeling function is known to the learner in the
beginning.
a) True
b) False
View Answer
Answer: b
Explanation: The function is unknown to the learner as this
is what the learner is trying to find out. In the beginning, the
learner just knows about the training set and the
corresponding label set.
7. The papaya learning algorithm is based on a dataset that
consists of three variables – color, softness, tastiness of the
papaya. Which is more likely to be the target variable?
a) Tastiness
b) Softness
c) Papaya
d) Color
View Answer
Answer: a
Explanation: The tastiness is dependent on how ripe the
papaya is. The ripeness is determined by the color and
softness. Hence color and softness are the independent
variables and the tastiness is the dependent variable or
target variable.
8. The error of classifier is measured with respect to
_________
a) variance of data instances
b) labeling function
c) probability distribution
d) probability distribution and labeling function
View Answer
Answer: d
Explanation: The error is the probability of choosing a
random instance from the data set and then misclassifying it
using the labeling function.
9. What is not accessible to the learner?
a) Training Set
b) Label Set
c) Labeling Function
d) Domain Set
View Answer
Answer: c
Explanation: The learner has access to the domain set, from
which it extracts the training set. The label set is also given.
Then the algorithm is applied to the training set to teach the
learner, a function to determine the correct label of a given
instance. This is the labeling function.
10. What are the possible values of A, B, and C in the
following diagram?

a) A – Training Set, B – Domain Set, C – Cross-Validation


Set
b) A – Training Set, B – Test Set, C – Cross-Validation Set
c) A – Training Set, B – Test Set, C – Domain Set
d) A – Test Set, B – Domain Set, C – Training Set
View Answer
Answer: b
Explanation: Domain Set comprises of the total input data
set. It is usually divided into a training set, a test set and a
cross-validation set in the ratio 3:1:1. Since the learner
learns about the data set from the training set, the later is
usually larger than the test and cross-validation set.
This set of Machine Learning Multiple Choice Questions &
Answers (MCQs) focuses on “Empirical Minimization
Framework”.

1. The true error is available to the learner.


a) True
b) False
View Answer
Answer: b
Explanation: True error is calculated with respect to the
probability distribution of the generation of dataset
instances and labeling function. These two are not available
to the learner. Hence, the learner cannot calculate the true
error.
2. What is one of the drawbacks of Empirical Risk
Minimization?
a) Underfitting
b) Both Overfitting and Underfitting
c) Overfitting
d) No drawbacks
View Answer
Answer: c
Explanation: Empirical Risk Minimization makes the
learner output a predictor which gives the minimum error
on the training set. This often leads to a predictor that is
specifically designed to be accurate on the training data set
but fails to be highly accurate on the test set, as the predictor
was training set-specific. This is overfitting.
3. The error available to the learner is ______
a) true error
b) error of the classifier
c) training error
d) testing error
View Answer
Answer: c
Explanation: The learner only knows about the error it
incurred over the training set instances. It is minimized by
the learner to produce the labeling function. This is then
used on the testing set to generate a testing error. The error
produced by randomly selecting an instance from the
dataset, and misclassifying it using the labeling function.
advertisement
4. Which is the more desirable way to reduce overfitting?
a) Giving an upper bound to the size of the training set
b) Making the test set larger than the training set
c) Giving an upper bound to the accuracy obtained on the
training set
d) Overfitting cannot be reduced
View Answer
Answer: a
Explanation: More the number of training set examples,
more specific the predictor is going to be to the training set.
Hence reducing it can reduce overfitting. Making the test set
larger than the training set will lead to underfitting, which is
not desirable. Giving an upper bound on accuracy can
abruptly stop the learner at a premature stage.
5. What is the relation between Empirical Risk Minimization
and Training Error?
a) ERM tries to maximize training error
b) ERM tries to minimize training error
c) It depends on the dataset
d) ERM is not concerned with training error
View Answer
Answer: b
Explanation: ERM makes the learner develop a predictor
which works well on the training data (data available to the
learner). Its aim is to minimize the error. Lesser the error,
the better is the predictor (not considering overfitting).
6. What happens due to overfitting?
a) Hypothesis works poorly on training data but works well
on test data
b) Hypothesis works well on training data and works well on
test data
c) Hypothesis works well on training data but works poorly
on test data
d) Hypothesis works poorly on training data and works
poorly on test data
View Answer
Answer: c
Explanation: ERM tries to minimize the training error. This
often leads to the learner producing a hypothesis that is too
specific to the training data. This then performs badly on
any other data set. This is overfitting.
7. What is assumed while using empirical risk minimization
with inductive bias?
a) The learner has some prior knowledge about training
data
b) The learner has some knowledge about labeling function
c) Reduction of overfitting may lead to underfitting
d) No assumptions are made
View Answer
Answer: a
Explanation: The learner must choose a hypothesis from a
set of H, reduced hypothesis space. Since the choice is
determined before seeing the training set, the learner needs
to have prior knowledge of training data.
8. The hypothesis space H for inductive bias is a finite class.
a) False
b) True
View Answer
Answer: b
Explanation: The hypothesis space H contains a finite
number of hypothesizes. The learner is restricted to chose
from only these hypothesizes. If hypothesis space is not
finite, there is no question of restriction.
9. The assumption that the training set instances are
independently and identically distributed is known as the
__________
a) empirical risk assumption
b) inductive bias assumption
c) i.i.d. assumption
d) training set rule
View Answer
Answer: c
Explanation: The three letters of i.i.d stands for
independently and identically distributed. The instances are
not dependent on each other. Every one of them is unique,
and it is assumed that the instances follow a certain
distribution.
10. Delta is the __________ parameter of the prediction.
a) training
b) confidence
c) accuracy
d) computing
View Answer
Answer: b
Explanation: The confidence parameter is used to state that
the chosen hypothesis will give a successful outcome with a
certain probability. This probability is given by (1 – delta).
This set of Machine Learning Multiple Choice Questions &
Answers (MCQs) focuses on “PAC Learning”.

1. Who introduced the concept of PAC learning?


a) Francis Galton
b) Reverend Thomas Bayes
c) J.Ross Quinlan
d) Leslie Valiant
View Answer
Answer: d
Explanation: Valiant introduced PAC learning. Galton
introduced Linear Regression. Quinlan introduced the
Decision Tree. Bayes introduced Bayes’ rule and Naïve-
Bayes theorem.
2. When was PAC learning invented?
a) 1874
b) 1974
c) 1984
d) 1884
View Answer
Answer: c
Explanation: Leslie Valiant was the inventor of PAC
learning. He described it in 1984. It was introduced as a part
of the Computational Learning Theory.
3. The full form of PAC is ______
a) Partly Approximation Computation
b) Probability Approximation Curve
c) Probably Approximately Correct
d) Partly Approximately Correct
View Answer
Answer: c
Explanation: Probably Approximately Correct tries to build
a hypothesis that can predict with low generalization error
(approximately correct), with high probability (probably).
advertisement
4. What can be explained by PAC learning?
a) Sample Complexity
b) Overfitting
c) Underfitting
d) Label Function
View Answer
Answer: a
Explanation: PAC learning can give a rough estimate of the
number of training examples required by the learning
algorithm to develop a hypothesis with the desired accuracy.
5. What is the significance of epsilon in PAC learning?
a) Probability of approximation < epsilon
b) Maximum error < epsilon
c) Minimum error > epsilon
d) Probability of approximation = delta – epsilon
View Answer
Answer: b
Explanation: A concept is PAC learnable by L if L can
output a hypothesis with error < epsilon. Hence the
maximum error obtained by the hypothesis should be less
than epsilon. Epsilon is usually 5% or 1%.
6. What is the significance of delta in PAC learning?
a) Probability of approximation < delta
b) Error < delta
c) Probability = 1 – delta
d) Probability of approximation = delta – epsilon
View Answer
Answer: c
Explanation: A concept C is PAC learnable by L if L can
predict a hypothesis with a certain error with probability
equal to 1-delta. Delta is usually very low.
7. One of the goals of PAC learning is to give __________
a) maximum accuracy
b) cross-validation complexity
c) error of classifier
d) computational complexity
View Answer
Answer: d
Explanation: PAC learning tells us about the amount of
effort required for computation so that a learner can come
up with a successful hypothesis with high probability.
8. A learner can be deemed consistent if it produces a
hypothesis that perfectly fits the __________
a) cross-validation data
b) overall dataset
c) test data
d) training data
View Answer
Answer: d
Explanation: PAC learning is concerned with the behavior
of the learning algorithm. The learner has access to only the
training data set.
9. Number of hypothesizes |H| = 973, probability = 95%,
error < 0.1. Find minimum number of training examples, m,
required.
a) 98.8
b) 99.8
c) 99
d) 98
View Answer
Answer: c
Explanation: Probability = 1 – delta (d) = 0.95; d = 0.05.
Error < epsilon (e); e = 0.1.
m >= 1/e (ln |H| + ln (1/d)) i.e. m> = 1/0.1(ln 973 + ln (1/0.05))
i.e. m>=98.8
Since number of training examples, m, has to be an integer,
answer is 99.
10. In PAC learning, sample complexity grows as the
logarithm of the number of hypothesizes.
a) False
b) True
View Answer
Answer: b
Explanation: Sample complexity is the number of training
examples required to converge to a successful hypothesis. It
is given by m >= 1/e (ln |H| + ln (1/d)), where m is the
number of training examples and H is hypothesis space.
This set of Machine Learning Multiple Choice Questions &
Answers (MCQs) focuses on “Version Spaces”.

1. A Boolean-valued function can be an example of concept


learning.
a) True
b) False
View Answer
Answer: a
Explanation: Any function which can describe some concept
based on datasets can be part of concept learning. For e.g. a
function over pictures of animals, that is true for pictures of
cats, false for others.
2. How do we learn concepts from training examples?
a) Arbitrarily
b) Decremental
c) Incrementally
d) Non-incremental
View Answer
Answer: c
Explanation: From each training instance, a concept is
enhanced. We start with a basic hypothesis. Then at each
step, we go on developing the hypothesis based on the
training example.
3. What is the goal of concept learning?
a) To minimize cross-validation set error
b) To maximize test set accuracy
c) To find a hypothesis that is most suitable for training
instances
d) To identify all possible predictors
View Answer
Answer: c
Explanation: Concept learning algorithms are applied to the
training set. They output the hypothesis which best suits the
training set, irrespective of the possible overfitting.
advertisement
4. Which is not a concept learning algorithm?
a) ID3
b) Find-S
c) Candidate Elimination
d) List-Then-Eliminate
View Answer
Answer: a
Explanation: The ID3 algorithm is one of the decision tree
algorithms. All the others – find-s, candidate elimination,
and list-then-eliminate are concept learning algorithms.
5. In the list-then-eliminate algorithm, the initial version
space contains _____
a) most specific hypothesis
b) all hypotheses in H
c) most accurate hypothesis
d) most general hypothesis
View Answer
Answer: b
Explanation: Initially, the version space contains all
hypotheses in H, including the most specific and most
general hypothesis. Gradually, it eliminates hypotheses
which are not suitable for the training sets. Finally, it
reaches the hypothesis which is most accurate for the
training set.
6. What happens to the version space in the list-then-
eliminate algorithm, at each step?
a) Remains the same
b) Increases
c) Shrinks
d) Depends on dataset
View Answer
Answer: c
Explanation: Since the version space initially contains all the
hypotheses, it gradually shrinks. At every step, it applies all
the hypotheses remaining in the version space and removes
each one that does not satisfy the current training example.
7. The list-then-eliminate algorithm can output more than
one hypothesis.
a) True
b) False
View Answer
Answer: a
Explanation: If the data available to the learner is
insufficient, the algorithm can output all the hypotheses that
still remain in the version space – they are consistent with
observed data.
8. What is the advantage of the list-then-eliminate
algorithm?
a) Computation is less
b) Time-effective
c) Overfitting never occurs
d) Contains all hypotheses consistent with observed data
View Answer
Answer: d
Explanation: As the initial version space contains all
hypotheses, the algorithm always outputs every hypothesis
consistent with training data. A concept learning algorithm
can always overfit. Since, all the hypotheses in version space
are tried at every step, a lot of computation is done which
takes a lot of time.
9. For a dataset with 4 attributes, which is the most general
hypothesis?
a) (Sunny, Warm, Strong, Humid)
b) (Sunny, ?, ?, ?)
c) (?, ?, ?, ?)
d) (phi, phi, phi, phi)
View Answer
Answer: c
Explanation: The most general hypothesis is the one that can
accept any training example. For any attribute, the most
general notation is (?). So, any hypothesis which consists of
only (?) is the most general hypothesis.
10. How is a hypothesis represented in concept learning?
a) Scalar
b) Vector
c) Polynomial
d) Either scalar or vector
View Answer
Answer: b
Explanation: The hypothesis is always expressed as a vector.
If the dataset contains n independent variables, then the
hypothesis is a vector with n constraints, each of which
specifies one of the attributes.
1. What is present in the version space of the Find-S
algorithm in the beginning?
a) Set of all hypotheses H
b) Both maximally general and maximally specific
hypotheses
c) Maximally general hypothesis
d) Maximally specific hypothesis
View Answer
Answer: d
Explanation: Initially, only the maximally specific
hypothesis is contained. That is generalized step by step after
encountering every positive example. At any stage, the
hypothesis is the most specific hypothesis consistent with
training data.
2. When does the hypothesis change in the Find-S algorithm,
while iteration?
a) Any example (positive or negative) is encountered
b) Amy negative example is encountered
c) Positive Example inconsistent with the hypothesis is
encountered
d) Any positive example is encountered
View Answer
Answer: c
Explanation: Find-S algorithm does not care about any
negative example. It only changes when a positive example is
encountered, which is inconsistent with the given hypothesis.
The attribute(s) which is inconsistent is changed into a more
generalized form.
3. What is one of the assumptions of the Find-S algorithm?
a) No assumptions are made
b) The most specific hypothesis is also the most general
hypothesis
c) All training data are correct (there is no noise)
d) Overfitting does not occur
View Answer
Answer: c
Explanation: Since no negative examples are being
considered, a huge part of the data is discarded. In order to
output an accurate hypothesis, the remaining dataset must
be noise free and adequate.
advertisement
4. What is one of the advantages of the Find-S algorithm?
a) Computation is faster than other concept learning
algorithms
b) All correct hypotheses are output
c) Most generalized hypothesis is output
d) Overfitting does not occur
View Answer
Answer: a
Explanation: All negative data are discarded. Version Space
consists of only one hypothesis, whereas the version space of
other algorithms consists of more than one hypothesis. At
each step, the version space is compared with the training
data instance. So, for Find-S computation is much faster.
5. How does the hypothesis change gradually?
a) Specific to Specific
b) Specific to General
c) General to Specific
d) General to General
View Answer
Answer: b
Explanation: Initially, the hypothesis is most specific –
consists of only phi. Gradually, after encountering each new
positive example, it generalizes with a change in attributes,
to remain consistent with training data.
6. S = <phi, phi, phi>Training data = <rainy, cold, white> =>
No (negative example). How will S be represented after
encountering this training data?
a) <phi, phi, phi>
b) <sunny, warm, white>
c) <rainy, cold, black>
d) <?, ?, ?>
View Answer
Answer: a
Explanation: When a negative example is encountered, the
Find-S algorithm ignores it. Hence, the hypothesis remains
unchanged. It will only a change, when a positive example
inconsistent with data is encountered.
7. What is one of the drawbacks of the Find-S algorithm?
a) Computation cost is high
b) Time-ineffective
c) All correct hypotheses are not output
d) Most specific accurate hypothesis is not output
View Answer
Answer: a
Explanation: The hypothesis generated is always the most
specific one at each step. A more generalized hypothesis can
be there but it is not considered as negative examples are
discarded. Thus, hypotheses that are consistent with training
data may not be output by the learner.
8. Noise or errors in the dataset can severely affect the
performance of the Find-S algorithm.
a) True
b) False
View Answer
Answer: a
Explanation: The algorithm ignores negative examples.
Since a huge part of the dataset is discarded, the accuracy of
the learned hypothesis becomes heavily dependent on the
remaining portion. If there is error or noise in this part,
huge inaccuracies may occur.
9. S = <phi, phi, phi> Training data = <square, pointy,
white> => Yes (positive example). How will S be represented
after encountering this training data?
a) <phi, phi, phi>
b) <square, pointy, white >
c) <circular, blunt, black>
d) <?, ?, ? >
View Answer
Answer: b
Explanation: Initially, S contains phi, which implies that no
example is positive. It encounters a positive example, which
is inconsistent with the current hypothesis. So, it generalizes
accordingly to approve the new example. It thus takes the
values of the training instance.
10. The algorithm accommodates all the maximally specific
hypotheses.
a) True
b) False
View Answer
Answer: b
Explanation: S contains phi initially. Then it gradually
generalizes, with new training examples. But it only
generalizes in a particular order and never backtracks. It
never considers a different branch which may lead to a
different target concept.
1. The algorithm is trying to find a suitable day for
swimming. What is the most general hypothesis?
a) A rainy day is a positive example
b) A sunny day is a positive example
c) No day is a positive example
d) Every day is a positive example
View Answer
Answer: d
Explanation: The most general hypothesis must accept any
type of data instance. In this case, the hypothesis that states
that any day is a positive example accepts all the specific
days as positive.
2. Candidate-Elimination algorithm can be described by
____________
a) just a set of candidate hypotheses
b) depends on the dataset
c) set of instances, set of candidate hypotheses
d) just a set of instances
View Answer
Answer: c
Explanation: A set of instances is required. A set of
candidate hypotheses are given. These are applied to the
training data and the list of accurate hypotheses is output in
accordance with the candidate-elimination algorithm.
3. How is the version space represented?
a) Least general members
b) Most general members
c) Most general and least general members
d) Arbitrary members chosen form hypothesis space
View Answer
Answer: c
Explanation: The algorithm starts with the most general and
most specific (least general members). Then it tries to specify
more general members or generalize more specific members
based on the data from the training examples.
advertisement
4. Let G be the set of maximally general hypotheses. While
iterating through the dataset, when is it changed for the first
time?
a) Negative example is encountered for the first time
b) Positive example is encountered for the first time
c) First example encountered, irrespective of whether it is
positive or negative
d) S, the set of maximally specific hypotheses, is changed
View Answer
Answer: a
Explanation: The most general hypothesis states that any
example is a positive example. So, it changes the first time
when it encounters the first negative example. It takes the
values of each attribute, other than the values, in the
negative example.
5. Let S be the set of maximally specific hypotheses. While
iterating through the dataset, when is it changed for the first
time?
a) Negative example is encountered for the first time
b) Positive example is encountered for the first time
c) First example encountered, irrespective of whether it is
positive or negative
d) G, the set of maximally general hypotheses, is changed
View Answer
Answer: b
Explanation: The most specific hypothesis states that any
example is a negative example. So, it changes the first time
when it encounters the first positive example. It takes the
values of each attribute in the positive example.
6. S = <sunny, warm, high, same>. Training data = <sunny,
warm, normal, same> => Yes (positive example). How will S
be represented after encountering this training data?
a) <sunny, warm, high, same>
b) <phi, phi, phi, phi>
c) <sunny, warm, ?, same>
d) <sunny, warm, normal, same>
View Answer
Answer: c
Explanation: Initially the S hypothesis states that if the
conditions are sunny, warm, high and same, then only the
example will be positive. But it encounters an example that
contains normal instead of high and is a positive example.
Hence, that attribute is not specific and it needs to be
generalized.
7. S = <phi, phi, phi, phi>Training data = <rainy, cold,
normal, change> => No (negative example). How will S be
represented after encountering this training data?
a) <phi, phi, phi, phi>
b) <sunny, warm, high, same>
c) <rainy, cold, normal, change>
d) <?, ?, ?, ?>
View Answer
Answer: a
Explanation: Initially S is phi, which implies that the learner
is yet to encounter a positive example. S will remain the
same after encountering another negative example. It will
change only after encountering a positive example.
8. G = <?, ?, ?, ?>. Training data = <sunny, warm, normal,
same> => Yes (positive example). How will G be represented
after encountering this training data?
a) <sunny, warm, normal, same>
b) <phi, phi, phi, phi>
c) <rainy, cold, normal, change>
d) <?, ?, ?, ?>
View Answer
Answer: d
Explanation: Initially G is (?), which implies that the learner
is yet to encounter a negative example. G will remain the
same after encountering another positive example. It will
change only after encountering a negative example.
9. G = (<sunny, ?, ?, ?> ; <?, warm, ?, ?> ; <?, ?, high, ?>).
Training data = <sunny, warm, normal, same> => Yes
(positive example). How will G be represented after
encountering this training data?
a) <phi, phi, phi, phi>
b) (<sunny, ?, ?, ?> ; <?, warm, ?, ?> ; <?, ?, high, ?>)
c) (<sunny, ?, ?, ?> ; <?, warm, ?, ?>)
d) <?, ?, ?, ?>
View Answer
Answer: c
Explanation: Initially, third hypothesis in set G states that
irrespective of other attributes, if third attribute is high, the
example is positive and it is negative, otherwise. But, in the
given example, third attribute is normal (not high), but still
the example is positive. Thus, the third hypothesis is
incorrect. So, it is discarded.
10. It is possible that in the output, set S contains only phi.
a) False
b) True
View Answer
Answer: b
Explanation: Initially, set S contains only phi. It states that
no example is positive. If there is no positive example in the
dataset, the set will not change. Even after a complete
iteration, S will remain the same and will contain only phi.
1. What does VC dimension do?
a) Reduces complexity of hypothesis space
b) Removes noise from dataset
c) Measures complexity of training dataset
d) Measures the complexity of hypothesis space H
View Answer
Answer: d
Explanation: The VC dimension measures the complexity of
the hypothesis space H, not by the number of distinct
hypotheses |H|, but by the number of distinct instances from
X that can be completely discriminated using H.
2. An instance set S is given. How many dichotomies are
possible?
a) 2*|S|
b) 2/|S|
c) 2^|S|
d) |S|
View Answer
Answer: c
Explanation: Given some instance set S, there are 2 ISI
possible dichotomies. H shatters S if every possible
dichotomy of S can be represented by some hypothesis from
H.
3. If h is a straight line, what is the maximum number of
points that can be shattered?
a) 4
b) 2
c) 3
d) 5
View Answer
Answer: c
Explanation: The main concept is to separate the positive
data points from negative data points. Thus, is done using
the line. So, the maximum number of points that can be
shattered (or separated) is 3.
advertisement
4. What is the VC dimension of a straight line?
a) 3
b) 2
c) 4
d) 0
View Answer
Answer: a
Explanation: The maximum number of points shattered by
the straight line is 3. Since VC dimension is the maximum
number of points shattered, the VC dimension of a straight
line is 3.
5. A set of 3 instances is shattered by _____ hypotheses.
a) 4
b) 8
c) 3
d) 2
View Answer
Answer: b
Explanation: A set of S instances can be shattered by 2^|S|
hypotheses. Here, number of instances is 3. Hence, number
of hypotheses is 2^|3| i.e. 8.
6. What is the relation between VC dimension and
hypothesis space H?
a) VC(H) <= |H|
b) VC(H) != log2|H|
c) VC(H) <= log2|H|
d) VC(H) > log2|H|
View Answer
Answer: c
Explanation: Suppose that, VC(H) = d. Then H will require
2d distinct hypotheses to shatter d instances. Hence, 2d <=
IHI, and d = VC(H) <= log2(H).
7. VC Dimension can be infinite.
a) True
b) False
View Answer
Answer: a
Explanation: The VC dimension is defined as the largest
number of points in set X that can be shattered (or
separated) successfully by the chosen hypothesis space. If the
hypothesis space H can separate arbitrarily large number of
data points in a given set, then VC(H) = ∞.
8. Who invented VC dimension?
a) Francis Galton
b) J. Ross Quinlan
c) Leslie Valiant
d) Vapnik and Chervonenkis
View Answer
Answer: d
Explanation: Vapnik and Chervonenkis introduced VC
dimension. Valiant introduced the concept of PAC learning.
Galton introduced Linear Regression. Quinlan introduced
the Decision Tree.
9. What is the advantage of VC dimension over PAC
learning?
a) VC dimension reduces complexity of training data
b) VC dimension outputs more accurate predictors
c) VC dimension can work for infinite hypothesis space
d) There is no advantage
View Answer
Answer: c
Explanation: In the case of infinite hypothesis spaces we
cannot apply m >= 1/ε ((ln |H| + ln (1/δ) (PAC Learning).
Hence, we consider VC dimension of H. If training set X is
too large, VC(H) = ∞.
10. IF VC(H) increases, number of maximum training
examples required (m) increases.
a) False
b) True
View Answer
Answer: b
Explanation: m > = 1/ε (4(log 2(2/δ) + 8 VC (H) log 2 (13/ε))
Thus, m is directly proportional to VC(H). Hence, if VC(H)
increases, m also increases.
1. Instance space: X = set of real numbers, Hypothesis space
H: the set of intervals on the real number line. a and b can
be any constants used to represent the hypothesis. How is H
represented?
a) a – b < a + b
b) a + b < x < 2(a+b)
c) a/b < x < a*b
d) a < x < b
View Answer
Answer: d
Explanation: H is the set of hypotheses of the form a < x < b,
where a and b may be any real constants. It signifies an
interval whose lower bound is a and the upper bound is b. X
is contained within these bounds.
2. S = {3.1, 5.7}. How many hypotheses are required?
a) 2
b) 3
c) 4
d) 1
View Answer
Answer: c
Explanation: The four hypotheses (1 < x < 2), (1 < x < 4), (4 <
x < 7), and (1 < x < 7) represents each of the four
dichotomies over S, covering neither instance, either one of
the instances, and both of the instances, respectively.
3. S = {x0, x1, x2}. This set can be shattered by hypotheses of
form a < x < b, where a and b are arbitrary constants.
a) True
b) False
View Answer
Answer: b
Explanation: Without loss of generality, assume x0 < x1 <
x2. This set cannot be shattered, because the dichotomy that
includes x0 and x2, but not x1, cannot be represented by a
single closed interval.
advertisement
4. S = {x0, x1, x2}. Hypotheses are of the form a < x < b.
What is H?
a) infinite
b) 0
c) 2
d) 1
View Answer
Answer: a
Explanation: No hypotheses of the form a < x < b, where a
and b can be any constants, can shatter the three points.
Suppose a hypothesis wants to shatter (x0, x2) and x1, but
there are no such intervals which contain both x0 and x2 but
not x1.
5. S = {x0, x1, x2}. Hypotheses are of the form a < x < b.
What is VC(H)?
a) 0
b) 2
c) 1
d) infinite
View Answer
Answer: b
Explanation: No subset S of size three can be shattered.
Maximum two points can be shattered by hypotheses of the
form a < x < b, where a and b can be any constants. Thus
VC(H) = 2.
6. S = {x0, x1, x2} and H is finite. What is VC(H)?
a) 1
b) 2
c) 3
d) infinite
View Answer
Answer: c
Explanation: Instead of random intervals, hypotheses are
straight lines. They can shatter all three points. For straight
lines, VC(H) = n, where n is the number of instances. Thus
VC(H) is 3.
7. S = {x0, x1, x2}. Hypotheses are straight lines. What is H?
a) 8
b) 3
c) 4
d) infinite
View Answer
Answer: a
Explanation: VC(H) = 3. The number of points is 3. Thus,
the number of dichotomies is 23 or 8. Each of the
23 dichotomies of these three instances is covered by some
hypothesis, a straight line. Thus H contains 8 hypotheses.
8. S = {x0, x1, x2, x3}. Hypotheses are straight lines. What is
H?
a) 8
b) 3
c) 4
d) infinite
View Answer
Answer: d
Explanation: VC dimension of straight line hypotheses is 3.
Hence, the maximum number of points that can be shattered
is 3. 4 points cannot be shattered. Thus H contains an
infinite number of hypotheses.
9. S contains 4 instances. H is the hypothesis space and it can
shatter S. What is the correct form of hypotheses?
a) Hypotheses are intervals
b) Hypotheses are straight lines
c) Hypotheses are rectangles
d) No such hypotheses exist
View Answer
Answer: c
Explanation: A rectangle can distinguish between 4 points
easily. So, they can shatter S. Number of instances is 4. The
number of hypotheses is 24 or 16.
10. For which combination H is infinite and VC(H) is finite?
a) S contains 2 points and hypotheses are intervals
b) S contains 3 points and hypotheses are intervals
c) S contains 3 points and hypotheses are straight lines
d) S contains 2 points and hypotheses are straight lines
View Answer
Answer: b
Explanation: Suppose S contains x0, x1, and x2. Let x0 < x1
< x2. There are no intervals such that x0 and x2 are within
the interval but x1 is not. Thus S is not shattered by H. So, H
is infinite. Since the maximum number of instances that can
be shattered is 2, VC(H) is 2.
1. Any ERM rule is a successful PAC learner for hypothesis
space H.
a) True
b) False
View Answer
Answer: a
Explanation: ERM rules try to reduce the hypothesis error
on the training dataset, even if it overfits. PAC learner tries
to calculate the probability of finding an instance in which
the hypothesis cannot label correctly.
2. If distribution D assigns zero probability to instances
where h not equal to c, then an error will be ______
a) 1
b) 0.5
c) 0
d) infinite
View Answer
Answer: c
Explanation: For every instance selected, h is equal to c.
Thus no error occurs. Hence, the probability of finding an
error will be zero.
3. If distribution D assigns zero probability to instances
where h = c, then an error will be ______
a) Cannot be determined
b) 0.5
c) 1
d) 0
View Answer
Answer: c
Explanation: D assigned zero probability to instances where
h = c. Thus, whichever instance is selected, h is not equal to c
i.e. error. Since for every instance error is raised, the
probability of finding an error is 1.
advertisement
4. Error strongly depends on distribution D.
a) True
b) False
View Answer
Answer: a
Explanation: If D is a uniform probability distribution that
assigns the same probability to every instance in X, then the
error for the hypothesis will be the fraction of the total
instance space that falls into the region where h and c
disagree.
5. PAC learning was introduced by ____________
a) Vapnik
b) Leslie Valiant
c) Chervonenkis
d) Reverend Thomas Bayes
View Answer
Answer: b
Explanation: Leslie Valiant introduced PAC Learning in
1984. Vapnik and Chervonenkis introduce the idea of VC
dimension. Thomas Bayes published Bayes’ Theorem.
6. Error is defined over the _____________
a) training set
b) test Set
c) domain set
d) cross-validation set
View Answer
Answer: c
Explanation: Error is defined over the entire distribution of
instances-not simply over the training examples-because this
is the true error one expects to encounter when actually
using the learned hypothesis h on subsequent instances
drawn from D.
7. The error of h with respect to c is the probability that a
randomly drawn instance will fall into the region where
_________
a) h and c disagree
b) h and c agree
c) h is greater than c but not less
d) h is lesser than c but not greater
View Answer
Answer: a
Explanation: The concepts c and h are depicted by the sets
of instances within X that they label as positive. If c says the
label is positive but h says a negative, error has occurred,
whose probability is to be calculated.
8. When was PAC learning invented?
a) 1954
b) 1964
c) 1974
d) 1984
View Answer
Answer: d
Explanation: PAC learning was introduced in the domain
Computational Learning Theory. Leslie Valiant introduced
the concept of PAC learning in 1984.
1. In which category does linear regression belong to?
a) Neither supervised nor unsupervised learning
b) Both supervised and unsupervised learning
c) Unsupervised learning
d) Supervised learning
View Answer
Answer: d
Explanation: In linear regression, the dataset is given to the
learner. The classification is already done in the training
data, from which the learner can learn. Hence, it is
supervised learning.
2. The learner is trying to predict housing prices based on
the size of each house. What type of regression is this?
a) Multivariate Logistic Regression
b) Logistic Regression
c) Linear Regression
d) Multivariate Linear Regression
View Answer
Answer: c
Explanation: Learner is trying to predict a discrete number
instead of binary output (yes or no / 0 or 1, etc.). Hence, it is
a linear regression and not a logistic regression. Since there
is only one independent variable, it is not multivariate linear
regression.
3. The learner is trying to predict housing prices based on
the size of each house. The variable “size” is ___________
a) dependent variable
b) label set variable
c) independent variable
d) target variable
View Answer
Answer: c
Explanation: The variable “size” is not dependent on the
price of the house. So, it is an independent variable. The
variable “price” is dependent on size.
advertisement
4. The target variable is represented along ____________
a) Y axis
b) X axis
c) Either Y-axis or X-axis, it doesn’t matter
d) Depends on the dataset
View Answer
Answer: a
Explanation: The target variable is dependent on the other
variables. So, it is represented as a function of the
independent variable. Thus, it is plotted along the y-axis,
after its value is calculated using the function.
5. The learner is trying to predict the cost of papaya based
on its size. The variable “cost” is __________
a) independent variable
b) target Variable
c) ranked variable
d) categorical variable
View Answer
Answer: b
Explanation: The cost is dependent on the size of the papaya.
It is measured as a function of the size. The learner’s goal is
to predict the price of the papaya. Hence, it is the target
variable. It is represented along y-axis.
6. The independent variable is represented along _________
a) Either X-axis or Y-axis, it doesn’t matter
b) Y axis
c) X axis
d) Depends on the dataset
View Answer
Answer: c
Explanation: As the name suggests, the independent variable
is not dependent on other variables. The target variable’s
value is predicted using the value of the independent
variable. Thus, it is represented along the x-axis.
7. How many variables are required to represent a linear
regression model?
a) 3
b) 2
c) 1
d) 4
View Answer
Answer: a
Explanation: Three variables are required. They are: – 1) m
= the number of training examples, 2) x = the independent
variable, y = the target variable.
8. What does (x(5), y(5)) represent or imply?
a) There are 5 training examples
b) The values of x and y are 5
c) The fourth training example
d) The fifth training example
View Answer
Answer: d
Explanation: In a linear regression model, the set (x (i), y(i))
represents the ith example in the training set. x (i) gives the
value of ith x, y(i) gives the ith value of y.
9. Hypothesis h maps from x (independent variable) to y
(dependent variable).
a) True
b) False
View Answer
Answer: a
Explanation: The hypothesis is developed by the learner to
predict y. It reads the value of x and then uses the mapping
function to calculate the value of y for the given value of x.
10. Learning algorithm outputs the hypothesis.
a) False
b) True
View Answer
Answer: b
Explanation: The learning algorithm iterates through the
training dataset, develops a function with minimal error to
predict the value of the target variable based on the value of
the independent variable.
1. The hypothesis is given by h(x) = t0 + t1x. What are t0 and
t1?
a) Value of h(x) when x is 0, intercept along y-axis
b) Value of h(x) when x is 0, the rate at which h(x) changes
with respect to x
c) The rate at which h(x) changes with respect to x, intercept
along the y-axis
d) Intercept along the y-axis, the rate at which h(x) changes
with respect to x
View Answer
Answer: d
Explanation: Since t1 is the coefficient of x, it is the rate at
which h(x) changes with respect to x. t0 is the intercept at the
y-axis, but practically, it may not be the value of h(x) when
x=0.
2. The hypothesis is given by h(x) = t0 + t1x. t0 gives the value
of h(x) when x is 0.
a) True
b) False
View Answer
Answer: b
Explanation: Although t0 is the intercept along the y-axis, it
is not always the value of h(x) when x is 0. For e.g. a learner
predicts a hypothesis which gives the price of a house based
on its size. The hypothesis may have a y-intercept but that
does not mean it is equal to the price of the house whose size
is 0. A non-existent house cannot have a price.
3. The hypothesis is given by h(x) = t0 + t1x. What is the goal
of t0 and t1?
a) Give negative h(x)
b) Give h(x) as close to 0 as possible, without themselves
being 0
c) Give h(x) as close to y, in training data, as possible
d) Give h(x) closer to x than y
View Answer
Answer: c
Explanation: t0 and t1 try to minimize prediction error on
the training set. Since y is the target variable, h(x) must be y
(ideally) or closer to y. This is what t0 and t1 try to achieve.
advertisement
4. The hypothesis is given by h(x) = t0 + t1x. What does t1 = 0
after several iterations imply?
a) The target variable is independent of x
b) Hypothesis is wrong
c) t0 is 0
d) x is the target variable
View Answer
Answer: a
Explanation: The equation t1 = 0 implies that h(x) does not
change with change in x. The value of h(x) is not dependent
on x. Thus, the target variable is not dependent on x.
5. In a linear regression problem, h(x) is the predicted value
of the target variable, y is the actual value of the target
variable, m is the number of training examples. What do we
try to minimize?
a) (h(x) – y) / m
b) (h(x) – y)2 / 2*m
c) (h(x) – y) / 2*m
d) (y – h(x))
View Answer
Answer: b
Explanation: The objective is to find the difference between
the predicted value and actual value of the target variable
and to minimize this error. If we get a negative value, then
no minimizing can be done, that’s why squaring is done. To
get an average over the dataset, the squared value is divided
by twice the number of training examples.
6. The cost function contains a summation expression.
a) True
b) False
View Answer
Answer: c
Explanation: The objective of the cost function is to
minimize the error. Thus it calculates the error for each
example and sums it. The error for each example is basically
the difference between the predicted value and the actual
value of the target variable.
7. What is the simplified hypothesis?
a) h(x) = t1x
b) h(x) = t0 + t1x
c) h(x) = t0
d) h(x) = t0x
View Answer
Answer: a
Explanation: In the simplified hypothesis, we assume that
t0 = 0. It is safe to assume this because often it is practical
that the value of h(x) is 0 when the value of x is 0, especially
in problems where we try to output the cost price o
something.
8. The simplified hypothesis reduces the complexity of the
cost function.
a) True
b) False
View Answer
Answer: a
Explanation: When we ignore the intercept term in the
hypothesis equation, we are only left with the x term. Thus
the cost function only tries to minimize the term t1 which is
the coefficient of x. This simplifies the calculation of the cost
function to a certain degree.
9. In the simplified hypothesis, what does hypothesis H and
cost function J depend on?
a) Both are functions of x
b) J is a function of x, H is a function of t1
c) H is a function of x, J is a function of t1
d) Both are functions of t1
View Answer
Answer: c
Explanation: The simplified hypothesis: h(x) = t1x; thus h is
only dependent on the value of x, as t1 is kept constant for a
single iteration through the dataset. Now, after the
completion of one iteration through the dataset, the cost
function calculates the error and alters t1 in order to
minimize the error.
10. (x(1), y(1)) = 1, 1.5, (x(2), y(2)) = 2, 3, (x(3), y(3)) = 3, 4.5.
Hypothesis: h(x) = t1x, where t1 = 1.5. How much error is
obtained?
a) 4.5
b) 0
c) 22.5
d) 1.5
View Answer
Answer: b
Explanation: Cost function: J(t1) = [(t1x(1) – y(1))2 + (t1x(2) –
y(2))2 + (t1x(3) – y(3))2] / 2*m
= [0 + 0 + 0] / 2*3
= 0/6
= 0.
11. (x(1), y(1)) = 1, 1.5, (x(2), y(2)) = 2, 3, (x(3), y(3)) = 3, 4.5.
Hypothesis: h(x) = t1x, where t1 = 2. How much error is
obtained?
a) 0.3
b) 0
c) 0.42
d) 0.5
View Answer
Answer: c
Explanation: Cost function: J(t1) = [(t1x(1) – y(1))2 + (t1x(2) –
y(2))2 + (t1x(3) – y(3))2] / 2*m
= [0.52 + 12 + 1.52] / 2*3
= 2.5/6
= 0.42.
12. How to graphically find t1 for which cost function is
minimized?
a) Plot J(t1) against t1 and find minima
b) Plot t1 against J(t1) and find minima
c) Plot J(t1) against t1 and find maxima
d) Plot t1 against J(t1) and find maxima
View Answer
Answer: a
Explanation: At the minima of the graph obtained by
plotting J(t1) against t1, we have the minimal value of J(t1)
for the given dataset, using linear regression. This is the
desired cost function. So, we take this value of t1 and use it in
the final hypothesis.
13. What is the ideal value of t1?
a) 0
b) Depends on the dataset
c) 1
d) 0.5
View Answer
Answer: b
Explanation: There is no way that t1 can be determined
before observing the dataset. It can take any value based on
the rate of change of the target variable with the change of
the independent variable. It can even take a negative value if
the target variable is indirectly proportional to the
independent variable.
14. Hypothesis is: h(x) = t0 + t1x. How do we graphically find
the desired cost function?
a) Plot J(t0, t1) against t0 and find minima
b) Plot J(t0, t1)) against t1 and find minima
c) Plot J(t0, t1) against either t1 or t0 and find minima
d) Make a 3-d plot with J(t0, t1) against t1 and t0 and find
minima
View Answer
Answer: d
Explanation: J(t0, t1) is dependent on both the parameters,
t1 and t0. Thus we need to find the J(t0, t1) as a function of
both t1 and t0. So we need to plot in 3 dimensions.
1. What is the goal of gradient descent?
a) Reduce complexity
b) Reduce overfitting
c) Maximize cost function
d) Minimize cost function
View Answer
Answer: d
Explanation: Gradient descent starts with some random
t0 and t1. It keeps on altering them to reduce the cost
function J(t0, t1). It stops at a point where it assumes that the
cost function is minimal.
2. Gradient descent always gives minimal cost function.
a) True
b) False
View Answer
Answer: b
Explanation: Often, the gradient descent reaches the local
minimum of the cost function. From there, anywhere it
moves, the value of cost function increases. So, the learner
assumes that point gives the minimum cost function. In this
way, the global minimum is never reached.
3. What happens when the learning rate is high?
a) It always reaches the minima quickly
b) It overshoots the maxima
c) Most of the times, it overshoots the minima
d) Nothing happens
View Answer
Answer: c
Explanation: If the learning rate is high, the gradient
descent overshoots the minima most of the times. This is
because, when it is close to the minima, instead of reaching
it, the algorithm alters the parameters by a greater
percentage. This leads to overshooting the minima.
advertisement
4. What is the correct way to update t0 and t1?
a) Calculate t0 and t1 and then update t0 and t1
b) Update t0 and t1 and then calculate t0 and t1
c) Calculate t0, update t0 and then calculate t1, update t1
d) Calculate t1, update t1 and then calculate t0, update t0
View Answer
Answer: a
Explanation: Both the calculations are done first, and then
updating is done. If we update one parameter before
calculating both, using the updated first parameter to
calculate the second parameter will lead to error.
5. The cost function contains a squared term and is divided
by 2*m where m is the number of training examples. What is
in the denominator of gradient descent function?
a) 2*m
b) m
c) m/2
d) m2
View Answer
Answer: b
Explanation: Gradient descent performs a partial derivative
of the cost function. The squared term produces a two after
differentiation. This is canceled out with the two in the
denominator, leaving only the term “m” there.
6. Cost function has a squared term, but gradient descent
does not. Why?
a) Integration of cost function
b) The square root of the cost function
c) Differentiation of cost function
d) They are not related
View Answer
Answer: c
Explanation: Gradient descent performs a partial derivative
of the cost function. The term which was raised to the power
of 2, is also differentiated. So, after differentiation, the
power goes down by 1, and no squared terms remain.
7. What is the output of gradient descent after each
iteration?
a) Updated t0, t1
b) J(t0, t1)
c) J(t1, t0)
d) A better learning rate
View Answer
Answer: a
Explanation: The goal of gradient descent is to alter t0 and
t1 until a minimum J(t0, t1) is reached. It only updates t0 and
t1. This, in turn, changes J(t0, t1) but it is never updated by
the gradient descent algorithm.
8. Who invented gradient descent?
a) Ross Quinlan
b) Leslie Valiant
c) Thomas Bayes
d) Augustin-Louis Cauchy
View Answer
Answer: d
Explanation: Cauchy invented gradient descent in 1847.
Bayes invented Bayes’ theorem. Leslie Valiant introduced
the idea of PAC learning. Quinlan is the founder of the
machine learning algorithm Decision Trees.
9. h(x) = t0 + t1x. Alpha value (learning rate) is 0.1. Initial
theta values are 0, 0. X = [1, 2, 3] and Y = [1, 3, 5]. What is
the value of cost function after 1st iteration?
a) 0.3
b) 0.73
c) 1.2953
d) 0.425
View Answer
Answer: c
Explanation: t0 = 0 – 0.1 / 3((0+0*1 – 1) + (0 + 0 * 2 – 3) + (0
+ 0 * 3 – 5)) = -0.1/3 (-9)
t0 = 0.3
t1 = 0-0.1/3((0 + 0 * 1 – 1)1 + (0+0*2–3)2 + (0 + 0 * 3 – 5)3)
t1 = 0.73
Cost = 1/6((.3 + .73 * 1 – 1)2 + (.3 + .73 * 2 – 3)2 + (.3 + .73 * 3
– 5)2) = 1.2953.
10. h(x) = t0 + t1x. Alpha value (learning rate) is 0.1. Initial
theta values are 0.3, 0.73. X = [1, 2, 3] and Y = [1, 3, 5]. What
is the value of t0 after 1st iteration?
a) 0.73
b) 0.425
c) 1.064444
d) 0.392
View Answer
Answer: b
Explanation: Here,
t0 = .3-0.1/3((.3+.73*1–1) + (.3+.73*2–3) + (.3+.73*3-5))
t0 = 0.425.
11. What is the generalized goal of gradient descent?
a) Minimize J(t1)
b) Minimize J(t0, t1, t1, …, tn)
c) Minimize J(t0, t1)
d) Maximize J(t1)
View Answer
Answer: b
Explanation: The generalized goal of gradient descent is to
minimize cost function J(t0, t1, t1, …, tn). The goal of gradient
descent for linear regression is to minimize J(t0, t1). The goal
of gradient descent for linear regression with the simplified
hypothesis is to minimize J(t1). Its goal is never to maximize
the cost function.
1. Multivariate linear regression belongs to which category?
a) Neither supervised nor unsupervised learning
b) Both supervised and unsupervised learning
c) Supervised learning
d) Unsupervised learning
View Answer
Answer: c
Explanation: In multivariate linear regression the dataset
with correct values of the target variable is given to the
learner. The correct answers are already given, from which
the learner can learn. Hence, it is supervised learning.
2. The learner is trying to predict housing prices based on
the size of each house and number of bedrooms. What type
of regression is this?
a) Multivariate Logistic Regression
b) Logistic Regression
c) Linear Regression
d) Multivariate Linear Regression
View Answer
Answer: d
Explanation: Learner is trying to output a value which can
be any number, instead of a binary output (yes or no / 0 or 1,
etc.). Thus it is linear regression and not logistic regression.
Since there are two independent variables “size” and
“number of bedrooms”, it is multivariate linear regression.
3. What does xn(i) represent?
a) Value of nth variable is i
b) Value of ith variable in the nth training example
c) Value of nth variable in the ith training example
d) Value of ith variable is n
View Answer
Answer: c
Explanation: In multivariate linear regression, while
representing an independent variable, the subscript denotes
the feature number and the superscript denotes the number
of the training example.
advertisement
4. How is the hypothesis represented in multivariate
regression? Transpose of matrix a is represented as aT.
a) h(X) = tTX
b) h(X) = tX
c) h(X) = tXT
d) h(X) = tTXT
View Answer
Answer: a
Explanation: In multivariate regression, t = [t0 t1 t2 … tn]T, X
= [1 x1 x2 … xn]T. t is the vector of coefficients of the features.
X is the feature vector denoting all the features.
5. Let there be n features. What is the dimension of the X
vector in hypothesis h(X) = tTX?
a) n x 1
b) (n + 1) x 1
c) n x n
d) (n – 1) x 1
View Answer
Answer: b
Explanation: The feature vector contains n features and 1
which represents x0. It is not a feature. It is used to match
the dimension of vector t since vector t contains an extra
element t0 which is the intercept.
6. What does X(I) represent?
a) A vector denoting all the values of the ith feature
b) ith feature in the ith example
c) A feature vector denoting the independent variables in the
ith example
d) Depends on the dataset
View Answer
Answer: c
Explanation: The feature vector is of dimension n x 1, where
n is the number of features. Values of all the features in the
ith row of training examples are values of this vector, where
the value of feature a is represented in a row an of the
feature vector.
7. What is the minimum number of variables required to
represent a linear regression model?
a) 3
b) 2
c) 1
d) 4
View Answer
Answer: d
Explanation: At least four variables are required. One is
required to represent the number of training examples,
another should represent the target variable. Since it is
multivariate linear regression, there should be at least two
independent variables and they are represented by two more
variables.
8. What does (x1(4), x2(4), y(4)) represent or imply?
a) There are 4 training examples
b) The values of x1, x2, and y are 4
c) The fourth training example and there are two
independent variables
d) The second training example and there are two
independent variables
View Answer
Answer: c
Explanation: In a linear regression model, the set (x 1(i), …,
xn(i), y(i)) represents the ith example in the training set.
xn(i) gives the value of ith xn, y(i) gives the ith value of y. (x 1(i),
…, xn(i), y(i)) implies that there are n independent variables.
9. There is no upper bound on the number of the
independent variable(s).
a) True
b) False
View Answer
Answer: a
Explanation: In multivariate linear regression, there is no
upper bound on the number of independent variables. The
number of independent variables can be 2 to n, where n is
any positive number above 2, and not infinite. They can be
represented as a vector and the target variable is
represented as the function of these independent variables.
10. There is no upper bound on the number of the target
variable(s).
a) True
b) False
View Answer
Answer: b
Explanation: The maximum number of target variables that
can be predicted by a learner is 1. The target variable is
dependent on the independent variables. The value of the
target variable is predicted based on a function of the
independent variables.
1. The cost function is minimized by __________
a) Linear regression
b) Polynomial regression
c) PAC learning
d) Gradient descent
View Answer
Answer: d
Explanation: Gradient descent starts with a random value of
t0, t1, …, tn. It alters them in order to change the cost
function at a particular learning rate. Once it reaches a local
minimum, it stops and outputs the value of t0, t1, …, tn.
2. What is the minimum number of parameters of the
gradient descent algorithm?
a) 1
b) 2
c) 3
d) 4
View Answer
Answer: c
Explanation: Since multivariate linear regression is being
considered, the minimum number of features is 2. For these
two variables, two parameters are t1 and t2. Another
parameter that is required is t0 which gives the y-intercept.
3. What happens when the learning rate is low?
a) It always reaches the minima quickly
b) It reaches the minima very slowly
c) It overshoots the minima
d) Nothing happens
View Answer
Answer: b
Explanation: If the learning rate is low, the gradient descent
reaches the minima very slowly. The parameters are then
altered by a very low percentage and thus a lot of iterations
are required to reach the minima. It is time ineffective.
advertisement
4. When was gradient descent invented?
a) 1847
b) 1947
c) 1857
d) 1957
View Answer
Answer: a
Explanation: Augustin-Louis Cauchy, a French
mathematician invented the concept of gradient descent in
1847. Since then, it has been modified a few times. Gradient
descent algorithm has a lot of different applications.
5. Gradient descent tries to _____________
a) maximize the cost function
b) minimize the cost function
c) minimize the learning rate
d) maximize the learning rate.
View Answer
Answer: b
Explanation: Gradient descent tries to minimize the cost
function by updating the values of t0, t1, …, tn after each
iteration. The change in the values of t0, t1, …, tn depends on
the learning rate.
6. Feature scaling can be used to simplify gradient descent
for multivariate linear regression.
a) True
b) False
View Answer
Answer: a
Explanation: There are multiple features in multivariate
linear regression and all of them have different ranges. This
increases the complexity of gradient descent. So, feature
scaling is used to make the ranges of each feature similar.
7. x1’s range is 0 to 300. x2’s range is 0 to 1000. What are the
suitable ranges of x1 and x2 after mean normalization?
a) x1 = (x1 – 150)/300, x2 = (x2-500)/1000
b) x1 = x2 – 700
c) x1 = x1 – 300, x2 = x2 – 1000
d) x1 = x1/300, x2 = x2/1000
View Answer
Answer: a
Explanation: Mean normalization tries to make the range of
each feature similar. It subtracts the mean from the value
and divides it by the upper bound of the range. After
updating (x1 – 150)/300, x2 = (x2-500)/1000, we get, x1’s range
is -0.5 to 0.5 and x2’s range is -0.5 to 0.5.
8. x1’s range is 0 to 300. x2’s range is 0 to 1000. What are the
suitable ranges of x1 and x2 after feature scaling?
a) x1 = x1 – 300, x2 = x2 – 1000
b) x1 = x2 – 700
c) x1 = x1/1000, x2 = x2/ 300
d) x1 = x1/300, x2 = x2/1000
View Answer
Answer: d
Explanation: Feature scaling tries to make the range of each
feature similar. After updating x1 = x1/300, x2 = x2/1000, we
get, x1’s range is 0 to 1 and x2’s range is 0 to 1.
9. On which factor is the updating of each parameter
dependent on?
a) The number of training examples
b) Target variable
c) The learning rate and the target variable
d) The learning rate
View Answer
Answer: c
Explanation: Updating each factor depends on both the
learning rate and the target variable. If the learning rate is
high, the change will be more and vice-versa. The updating
depends on how much closer the value predicted by the
hypothesis is to the value of the target variable.
10. What is updated by gradient descent after each
iteration?
a) The learning rate
b) Independent variables
c) Target variable
d) The number of training examples
View Answer
Answer: b
Explanation: The gradient descent algorithm updates the
value of all the features. It is done in order to minimize the
cost function. The change in the value of the independent
variables depends on the learning rate.
11. Who introduced the topic of gradient descent?
a) Vapnik
b) Augustin-Louis Cauchy
c) Chervonenkis
d) Alan Turing
View Answer
Answer: b
Explanation: Cauchy invented gradient descent in 1847.
Vapnik and Chervonenkis introduced the concept of VC
dimension. Alan Turing is known as the father of computer
science, for his various works in the field of artificial
intelligence, cryptanalysis, amongst others.
12. Mean normalization can be used to simplify gradient
descent for multivariate linear regression.
a) True
b) False
View Answer
Answer: a
Explanation: Mean normalization tries to reduce the
complexity of gradient descent by scaling down the range of
each feature. It subtracts the mean from the value of the
independent variable and divides it by the upper limit of its
range.
1. Who coined the term regression?
a) Andrey Markov
b) Alexey Chervonenkis
c) Vladimir Vapnik
d) Francis Galton
View Answer
Answer: d
Explanation: Galton introduced the idea of regression in the
19th century. Vapnik and Chervonenkis established the VC
dimension. Markov was best known for his work on
stochastic processes. He introduced the Markov model.
2. Polynomial regression and multivariate regression are the
same.
a) True
b) False
View Answer
Answer: b
Explanation: In multivariate regression, there must be at
least two independent variables. In polynomial regression,
even one variable is enough. Different degrees of indices or
products of features can be used.
3. The learner is trying to predict the price of a house based
on the length and width of the house.
x1 = length and x2 = width. What is a better hypothesis?
a) h(X) = t0 + t1x1
b) h(X) = t0 + t1x1 + t2x2
c) h(X) = t0 + t2x2
d) h(X) = t0 + t1X, where area of the house: X = x 1 * x2
View Answer
Answer: d
Explanation: To predict the price of the house, the size is a
better parameter. It can be determined by the area of the
house which is length multiplied by width. So, instead of
using the two features separately, a better third feature can
be used.
advertisement
4. h(X) = t0 + t1x + t2x + t3x3. What type of regression is this?
2

a) Polynomial regression
b) Univariate linear regression
c) Logistic regression
d) Multivariate linear regression
View Answer
Answer: a
Explanation: The expression has only one feature x, so it is
not a multivariate linear regression. There is more than one
term containing a feature, so it is also not a univariate linear
regression. The features are expressed as a polynomial, so it
is a polynomial regression.
5. h(x) = t0 + t1x + t2x2. t0 = t1 = t2 = 1. X is the size of the
house. For what value of x, h(x) is minimum?
a) -1
b) 0
c) 0 or -1
d) 1
View Answer
Answer: d
Explanation: h(x) = t0 + t1x + t2x2
= 1 + x + x2
Since x cannot be negative, the minimum value of h(x) is 1.
6. h(x) = t0 + t1x + t2x2. t0 = 0, t1 = t2 = 1. X is the size of the
house. For what value of x, h(x) is minimum?
a) -1
b) 0
c) 0 or -1
d) 1
View Answer
Answer: b
Explanation: h(x) = t0 + t1x + t2x2
= x + x2
h(x) will be minimum if the expression (x + x 2) is minimum
i.e. 0 (size of house cannot be negative)
x + x2 = 0
or, x(x + 1) = 0
Since, x cannot be negative, the value of x is 0.
7. There are two features. One is of higher priority. What
can be done to improve the hypothesis?
a) Increase the power to which the feature with higher
priority is raised
b) Remove the feature with lower priority
c) Depends on the dataset
d) Nothing can be done
View Answer
Answer: a
Explanation: One of the advantages of polynomial
regression is that of handling features with a different
priority. If a feature with higher priority is encountered, its
power can be raised to give it higher priority in the
hypothesis.
8. A drawback of Polynomial Regression is handling of
features with a different priority.
a) True
b) False
View Answer
Answer: b
Explanation: Polynomial Regression can handle features
with varying priority very well. One of its drawbacks is that
it is sensitive to outliers. Overfitting may or may not occur.
1. What kind of algorithm is logistic regression?
a) Cost function minimization
b) Ranking
c) Regression
d) Classification
View Answer
Answer: d
Explanation: Logistic regression is a classification problem.
The target variable is categorical (specific few options).
Logistic regression outputs in yes or no / true or false / 0 or 1
and so on.
2. Can a cancer detection problem be solved by logistic
regression?
a) Sometimes
b) No
c) Yes
d) Depends on the dataset
View Answer
Answer: c
Explanation: If the target is to detect cancer, logistic
regression can always be used. Logistic regression algorithm
will output if the patient has cancer or not, depending on the
symptoms and training examples.
3. In a logistic regression problem, there are 300 instances.
270 people voted. 30 people did not cast their votes. What is
the probability of finding a person who cast one’s vote?
a) 10%
b) 90%
c) 0.9
d) 0.1
View Answer
Answer: c
Explanation: 270 out of 300 people voted. Hence, the
probability of finding a person who cast his/her vote is
270/300 or 9/10 i.e. 0.9. Since probability has to be within 0
or 1, it can never be 90%.
advertisement
4. In a logistic regression problem, what is a possible output
for a new instance?
a) 0.85
b) -0.19
c) 1.20
d) 89%
View Answer
Answer: a
Explanation: The output in a logistic regression problem is
calculated by a probability function. Thus, the output can
only be between 0 and 1. It cannot be negative, or greater
than 1. It is not expressed in a percentage.
5. The output in a logistic regression problem is yes
(equivalent to 1 or true). What is its possible value?
a) Greater than 0.5
b) Depends on the algorithm’s threshold value
c) Greater than 0.6
d) Equal to 1
View Answer
Answer: b
Explanation: If the output is true, the probability of the
instance to be true is greater than the threshold value. Now,
for different datasets, the threshold value can be different. It
can be 0.5, it can also be 0.6. It is dependent on the
algorithm.
6. Who invented logistic regression?
a) Vapnik
b) Ross Quinlan
c) DR Cox
d) Chervonenkis
View Answer
Answer: c
Explanation: Statistician DR Cox invented Logistic
Regression in 1958. Ross Quinlan is the founder of the
machine learning model decision tree. Vapnik and
Chervonenkis introduced the idea of VC dimension.
7. An artificially intelligent car knows if to brake or not
based on its distance from the car in front of it. Logistic
regression algorithm is used.
a) True
b) False
View Answer
Answer: a
Explanation: The output is given as yes or no, based on the
distance from the car in front of it. It is thus a classification
problem. Hence, the logistic regression algorithm can be
used to determine whether to stop or not.
8. An artificially intelligent car decreases its speed based on
its distance from the car in front of it. Which algorithm is
used?
a) Decision Tree
b) Naïve-Bayes
c) Logistic Regression
d) Linear Regression
View Answer
Answer: d
Explanation: The output is numerical. It determines the
speed of the car. Hence it is not a classification problem. All
the three, decision tree, naïve-Bayes, and logistic regression
are classification algorithms. Linear regression, on the other
hand, outputs numerical values based on input. So, this can
be used.
9. In a logistic regression problem an instance is similar to 60
positive instances, 20 negative instances, dissimilar to 30
positive instances, 90 negative instances. What kind of an
instance is this?
a) Negative instance
b) Positive instance
c) Cannot be determined, even if the threshold is given
d) Can be determined, if the threshold is given
View Answer
Answer: c
Explanation: Similarity or dissimilarity does not determine
the output of logistic regression. The output is completely
dependent on the independent variables and their values. So,
the output cannot be determined even if the threshold is
given.
10. When was logistic regression invented?
a) 1968
b) 1958
c) 1948
d) 1988
View Answer
Answer: b
Explanation: Logistic regression was invented by statistician
DR Cox in the year 1958. It was introduced even before the
invention of machine learning. It was introduced as a part of
the direct probability model.
1. What function is used for hypothesis representation in
logistic regression?
a) Cos function
b) Laplace transformation
c) Lagrange’s function
d) Sigmoid function
View Answer
Answer: d
Explanation: In logistic regression, the output is based on a
probability and thus must be within the range of 0 and 1.
The sigmoid function is used for models whose output is
given as a probability i.e. the range lies between 0 and 1. So,
the sigmoid function is used in hypothesis representation.
2. The value of a sigmoid function is 1.5.
a) True
b) False
View Answer
Answer: b
Explanation: Sigmoid function can be used for machine
learning models where output is based on the prediction of a
probability. The function only exists between 0 and 1. Thus
its value can never be 1.5.
3. How is the hypothesis represented? Transpose of t is t T.
a) h(X) = t0 + t1x1
b) h(X) = 1/(1 + e(tTx))
c) h(X) = e(-tTx)/(1 + e(-tTx))
d) h(X) = 1/(1 + e(-tTx))
View Answer
Answer: d
Explanation: The hypothesis is a function of the term tTx.
Since its value should be between 0 and 1, sigmoid function
is used. Sigmoid function is given by g(a) = 1/(1 + e-a).
h(x) = g (tTx)
⇨h(X) = 1/(1 + e(-tTx)).
advertisement
4. Let g be the sigmoid function. Let a = 0. What is the value
of g(a)?
a) 1/2
b) 1/4
c) 1
d) 0
View Answer
Answer: a
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a=0
Hence, g(a) = 1/1+e-0
= 1/1+1
= 1/2.
5. Probability of an event occurring is 1.2. What is odds
ratio?
a) 6:1
b) -6:1
c) Undefined
d) 1:2
View Answer
Answer: c
Explanation: Probability p has to be within the range of 0 to
1. p can never be 1.2. Odds ratio is calculated as the ratio of
p and (1-p). Since p can never be 1.2, odds ratio calculation
is also possible.
6. Probability of an event occurring is 0.9. What is odds
ratio?
a) 0.9:1
b) 9:1
c) 1:9
d) 1:0.9
View Answer
Answer: b
Explanation: p = 0.9 i.e. 9/10, hence (1-p) = 1 – 9/10 = 1/10
Odds ratio = p/(1-p)
= (9/10)/(1/10) = 9:1.
7. What is the odds ratio?
a) p/(1-p)
b) p
c) 1-p
d) p*(1-p)
View Answer
Answer: a
Explanation: p is the probability that event y occurs. Then
the probability of event y not occurring can be given as (1-p).
Odds ratio is given by the ratio of the probability of an event
occurring and the probability that an event is not occurring.
Thus, odds ratio is p/(1-p).
8. The output of logistic regression is always 0 or 1.
a) True
b) False
View Answer
Answer: b
Explanation: The output of logistic regression is not always 0
or 1. It can be yes or no. It can be even true or false. The
output of binary logistic regression is always 0 or 1.
1. h(x) > 0.6 -> y = 1. What does the value 0.6 represent?
a) Cost function
b) Threshold value
c) Gradient descent
d) Sigmoid function
View Answer
Answer: b
Explanation: In logistic regression, a particular value is
taken. If the value of the hypothesis is greater than this
value, the output y is considered to b true or 1. This value is
the threshold value. Here, 0.6 is the threshold value.
2. The value of a sigmoid function is the threshold value.
a) True
b) False
View Answer
Answer: b
Explanation: Sigmoid function is used in machine learning
models to predict the probability of an event happening. The
value of the function varies for different instances in the
training set. The threshold value is fixed for a particular
dataset.
3. Threshold value is 0.5. h(x) = 0.7 for a particular instance.
What is the value of y?
a) 0
b) 0.3
c) 0.7
d) 1
View Answer
Answer: d
Explanation: The decision boundary depends on the value of
threshold. If output of function h(x) is greater than the
threshold value, the output y is equal to 1. Here, h(x) = 0.7
and threshold value = 0.5. Since 0.7 > 0.5, y = 1.
advertisement
4. Let g be the sigmoid function. Let a >= 0. What is the
value of g(a)?
a) g(a) >= 1/2
b) g(a) <= 0
c) g(a) <= 1/2
d) g(a) >= 0
View Answer
Answer: a
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a >= 0
Hence, g(a) >= 1/1+e-0
g(a) >= 1/1+1
g(a) >= 1/2.
5. Probability of an event occurring is 0.2. What is odds
ratio?
a) -4:1
b) 4:1
c) 1:4
d) 1:0.4
View Answer
Answer: c
Explanation: p = 0.2 i.e. 2/10 i.e. 1/5, hence (1-p) = 1 – 1/5 =
4/5
Odds ratio = p/(1-p)
= (1/5)/(4/5) = 1:4.
6. Probability of an event occurring is 0.8. What is odds
ratio?
a) 0.8:1
b) 4:1
c) 1:4
d) 2:0.8
View Answer
Answer: b
Explanation: p = 0.8 i.e. 8/10, hence (1-p) = 1 – 8/10 = 2/10
Odds ratio = p/(1-p)
= (8/10)/(2/10) = 4:1.
7. Let g be the sigmoid function. Let a = infinite. What is the
value of g(a)?
a) 1/2
b) -1
c) 1
d) 0
View Answer
Answer: c
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a = infinite
Hence, g(a) = 1/1+e-infinite
= 1/1+0
= 1.
8. The decision boundary is an important parameter in
logistic regression.
a) True
b) False
View Answer
Answer: a
Explanation: In logistic regression, the decision boundary is
based on the threshold value. It separates the area where
output y = 0 and y = 1. Without the decision boundary, the
output cannot be calculated. Thus, it is very important.
9. Let g be the sigmoid function. Let a = -(infinite). What is
the value of g(a)?
a) -1/2
b) 1
c) 1/2
d) 0
View Answer
Answer: d
Explanation: The sigmoid function is given by g(x) = 1/1+e-x.
a = -(infinite)
Hence, g(a) = 1/1+einfinite
= 1/1+infinite
= 1/infinite
= 0.
10. Threshold value is 0.6. h(x) = 0.3 for a particular
instance. What is the value of y?
a) 0
b) 0.3
c) 0.7
d) 1
View Answer
Answer: a
Explanation: The threshold value separates the positive
instances from the negative instances. If output of function
h(x) is lesser than the threshold value, the output y is equal
to 0. Here, h(x) = 0.3 and threshold value = 0.6. Since 0.3 <
0.6, y = 0.
1. The cost function for logistic regression and linear
regression are the same.
a) True
b) False
View Answer
Answer: d
Explanation: Logistic regression deals with classification
based problems or probability based, whereas linear
regression is more based on regression problems. Obviously,
the two cost functions are different.
2. h(x) = y. What is the cost (h(x), y)?
a) -infinite
b) infinite
c) 0
d) always h(x)
View Answer
Answer: c
Explanation: The cost function is used to determine the
similarity between the two parameters. The more the
similarity, higher is the tendency of cost function
approaching zero. Since h(x) = y here, the cost function is 0.
3. What is the generalized cost function?
a) cost(h(x),y) = -y*log(h(x)) – (1 – y)*log(1-h(x))
b) cost(h(x),y) = – (1 – y)*log(1-h(x))
c) cost(h(x),y) = -y*log(h(x))
d) cost(h(x),y) = y*log(h(x)) + (1 – y)*log(1-h(x))
View Answer
Answer: a
Explanation: cost(h(x),y) = -y*log(h(x)) when y = 1, and – (1
– y)*log(1-h(x)) when y = 0
Thus the generalized function cost(h(x),y) = -y*log(h(x)) – (1
– y)*log(1-h(x)) becomes
cost(h(x),y) = -y*log(h(x)) when y = 1 as (1 – y) is 0 and
becomes – (1 – y)*log(1-h(x)) when y = 0.
advertisement
4. Let m be the number of training instances. What is the
summation of cost function multiplied by to get the gradient
descent?
a) 1/m
b) m
c) 1 + m
d) 1 – m
View Answer
Answer: a
Explanation: Since the summation is taken of all the cost
functions starting from training instance 1 to training
instance m, an average needs to be taken to get the actual
cost function. So, it is multiplied by 1/m.
5. y = 1. How does cost(h(x), y) change with h(x)?
a) cost(h(x), y) = infinite when h(x) = 1
b) cost(h(x), y) = 0 when h(x) = 0
c) cost(h(x), y) = 0 when h(x) = 1
d) it is independent of h(x)
View Answer
Answer: c
Explanation: Since, the actual output is 1, the calculated
output tending toward 1 will reduce the cost function. Thus
cost function is 0 when h(x) is 1 and it is infinite when h(x) =
0.
6. Who invented gradient descent?
a) Ross Quinlan
b) Leslie Valiant
c) Thomas Bayes
d) Augustin-Louis Cauchy
View Answer
Answer: d
Explanation: Cauchy invented gradient descent in 1847.
Bayes invented Bayes’ theorem. Leslie Valiant introduced
the idea of PAC learning. Quinlan is the founder of the
machine learning algorithm Decision Trees.
7. When was gradient descent invented?
a) 1847
b) 1947
c) 1857
d) 1957
View Answer
Answer: a
Explanation: Augustin-Louis Cauchy, a French
mathematician invented the concept of gradient descent in
1847. Since then, it has been modified a few times. Gradient
descent algorithm has a lot of different applications.
8. h(x) = 1, y = 0. What is the cost (h(x), y)?
a) -infinite
b) infinite
c) 0
d) always h(x)
View Answer
Answer: b
Explanation: The cost function determines the similarity
between the actual output and the calculated output. The
lesser the similarity, the higher is the cost function. It is
maximum (infinite) when h(x) and y are the exact opposite.
1. Which is a better algorithm than gradient descent for
optimization?
a) Conjugate gradient
b) Cost Function
c) ERM rule
d) PAC Learning
View Answer
Answer: a
Explanation: Conjugate gradient is an optimization
algorithm and it gives better results than gradient descent.
Cost function is used to calculate the average difference
between predicted output and actual output. ERM although
tries to lower the cost function, it often leads to overfitting.
2. Who invented BFGS?
a) Quinlan
b) Bayes
c) Broyden, Fletcher, Goldfarb and Shannon
d) Cauchy
View Answer
Answer: c
Explanation: Broyden, Fletcher, Goldfarb and Shannon are
credited with the invention of BFGS method. Quinlan
introduced the algorithm of Decision trees. Bayes invented
Naïve-Bayes algorithm. Cauchy is the founder of gradient
descent algorithm.
3. Ax = b => [4 2, 2 3][x 1, x2] = [2, 2]. Let x0, the initial guess
be [1, 1]. What is the residual vector?
a) [4, -3]
b) [-4, 3]
c) [-4, -3]
d) [4, 3]
View Answer
Answer: c
Explanation: Residual vector, r0 = b – Ax0
r0 = [2, 2] – [4 2, 2 3][1, 1]
= [2, 2] – [6, 5]
= [-4, -3].
advertisement
4. Ax = b => [2 2, 3 3][x 1, x2] = [1, 2]. Let x0, the initial guess
be [1, 1]. What is the residual vector?
a) [3, -4]
b) [-4, 3]
c) [-4, -3]
d) [-3, -4]
View Answer
Answer: d
Explanation: Residual vector, r0 = b – Ax0
r0 = [1, 2] – [2 2, 3 3][1, 1]
= [1, 2] – [4, 6]
= [-3, -4].
5. In the L-BFGS algorithm, what does the letter L stand
for?
a) Lengthy
b) Limited-memory
c) Linear
d) Logistic
View Answer
Answer: b
Explanation: L-BFGS is an approximation of the Broyden-
Fletcher-Goldfarb-Shannon algorithm. It is used for cases
which are limited in memory. Like BFGS, this method also
works better than gradient descent.
6. Ax = b => [3 2, 2 3][x 1, x2] = [8, 6]. Let x0, the initial guess
be [2, 1]. What is the residual vector?
a) [-1, 0]
b) [0, -1]
c) [1, 0]
d) [0, 1]
View Answer
Answer: b
Explanation: Residual vector, r0 = b – Ax0
r0 = [8, 6] – [3 2, 2 3][2, 1]
= [8, 6] – [8, 7]
= [0, -1].
7. Who developed conjugate gradient method?
a) Hestenes and Stiefel
b) Broyden, Fletcher, Goldfarb and Shannon
c) Valiant
d) Vapnik and Chervonenkis
View Answer
Answer: a
Explanation: Magnus Hestenes and Eduard Stiefel
introduced the conjugate gradient algorithm. It is used for
advanced optimization. Broyden, Fletcher, Goldfarb and
Shannon invented BFGS algorithm. Leslie Valiant
introduced the idea of PAC Learning. Vapnik and
Chervonenkis was the founder of VC dimension.
8. When was BFGS invented?
a) 1960
b) 1965
c) 1975
d) 1970
View Answer
Answer: d
Explanation: Broyden, Fletcher, Goldfarb and Shannon are
credited with the invention of the BFGS method. It was
invented in the year, 1970. BFGS is an advanced
optimization technique. It is a better algorithm than
gradient descent
1. The output is whether a person will vote or not, based on
several features. It is an example of multiclass classification.
a) True
b) False
View Answer
Answer: b
Explanation: In multiclass classification, the output, y
should have more than two values (or classes). In this
example, the output can be only yes or no. Hence, it is not an
example of multiclass classification.
2. The output is whether a person will surely vote or surely
not vote or may cast a vote, based on one feature. It is an
example of multiclass classification.
a) True
b) False
View Answer
Answer: a
Explanation: In multiclass classification, the output, y
should have more than two values (or classes). Here, there
are three classes – i) surely vote, ii) surely not vote, and iii)
may cast a vote. Thus, it is an example of multiclass
classification.
3. y = {0, 1, …, n}. This problem is divided into ______
binary classification problems.
a) n
b) 1/n
c) n + 1
d) 1/(n+1)
View Answer
Answer: c
Explanation: The indexing starts at 0. So there are n + 1
output classes. Hence, to get the correct output, we need to
divide the problem into n + 1 classification problems with
binary outputs (0 or 1). If 1 is the output, the instance
belongs to that particular class.
advertisement
4. y = {0, 1, …, 8}. This problem is divided into ______
binary classification problems.
a) 1/9
b) 9
c) 8
d) 1/8
View Answer
Answer: b
Explanation: Since, indexing starts at 0, the number of
classes is 9 and they are 0, 1, 2, 3, 4, 5, 6, 7, and 8. To solve
this 9-class problem, we need to divide the problem 9 binary
classification problems.
5. y = {0, 1, 2, 3, 4, 5, 6, 8}. This problem is divided into
______ binary classification problems.
a) 9
b) 1/9
c) 8
d) 1/8
View Answer
Answer: c
Explanation: In this example, there are 8 different classes
and not 9. 0 is one of the classes but there is no class 7. So,
here number of classes is 8. Thus, the problem is divided into
8 binary classification problems.
6. The outputs of an image recognition system is {0, 0, 1, 0}.
The classes are dog, cat, elephant, and lion. What is the
image of, according to our algorithm?
a) Dog
b) Cat
c) Elephant
d) Lion
View Answer
Answer: c
Explanation: The output vector is a representative of the
probability of the image being a particular class. According
to the algorithm, the probability of image being a cat is zero,
dog is zero, elephant is one, lion is zero. Thus, the image is of
an elephant.
7. Who invented logistic regression?
a) Valiant
b) Ross Quinlan
c) DR Cox
d) Bayes
View Answer
Answer: c
Explanation: Statistician DR Cox invented Logistic
Regression in 1958. Ross Quinlan is the founder of the
machine learning model decision tree. Leslie Valiant
introduced PAC Learning. Bayes is known for Naïve-Bayes
algorithm.
8. When was logistic regression invented?
a) 1957
b) 1959
c) 1960
d) 1958
View Answer
Answer: d
Explanation: Logistic regression was invented by statistician
DR Cox in the year 1958. It was introduced even before the
invention of machine learning. It was introduced as a part of
the direct probability model.
. Which of the following statements is false about Ensemble
learning?
a) It is a supervised learning algorithm
b) More random algorithms can be used to produce a
stronger ensemble
c) It is an unsupervised learning algorithm
d) Ensembles can be shown to have more flexibility in the
functions they can represent
View Answer
Answer: c
Explanation: Ensemble learning is not an unsupervised
learning algorithm. It is a supervised learning algorithm that
combines several machine learning techniques into one
predictive model to decrease variance and bias. It can be
trained and then used to make predictions. And this
ensemble can be shown to have more flexibility in the
functions they can represent.
2. Ensemble learning is not combining learners that always
make similar decisions; the aim is to be able to find a set of
diverse learners.
a) True
b) False
View Answer
Answer: a
Explanation: Ensemble learning aims to find a set of diverse
learners who differ in their decisions so that they
complement each other. There is no point in combining
learners that always make similar decisions.
3. Which of the following is not a multi – expert model
combination scheme to generate the final output?
a) Global approach
b) Local approach
c) Parallel approach
d) Serial approach
View Answer
Answer: d
Explanation: Multi – expert combination methods have base
– learners that work in parallel. Global approach and Local
approach are the two subdivisions of this parallel approach.
Serial approach is a multi – stage combination method.
advertisement
4. The global approach is also known as learner fusion.
a) False
b) True
View Answer
Answer: b
Explanation: The global approach also called as learner
fusion. Where given an input, all base – learners generate an
output and all these outputs are combined by voting or
averaging. This represents integration (fusion) functions
where for each pattern, all the classifiers contribute to the
final decision.
5. Which of the following statements is true about multi –
stage combination methods?
a) The next base – learner is trained on only the instances
where the previous base – learners are not accurate enough
b) It is a selection approach
c) It has base – learners that work in parallel
d) The base – learners are sorted in decreasing complexity
View Answer
Answer: a
Explanation: It is a serial approach; the next base – learner
is trained or tested on only the instances where the previous
base – learners are not accurate enough. A multi – stage
combination method is neither a parallel approach nor a
selection approach. The base – learners are sorted in
increasing complexity.
6. Which of the following is not an example of a multi –
expert combination method?
a) Voting
b) Stacking
c) Mixture of experts
d) Cascading
View Answer
Answer: d
Explanation: Cascading is not a multi – expert combination
example and is a multi – stage combination method. It is
based on the concatenation of several classifiers, which use
all the information collected from the output from a given
classifier as additional information for the next classifier in
the cascade. Voting, stacking and a mixture of experts are
the example of multi – expert combination methods.
7. Which of the following statements is false about the base –
learners?
a) The base – learners are chosen for their accuracy
b) The base – learners are chosen for their simplicity
c) The base – learners has to be diverse
d) Base – learners do not require them to be very accurate
individually
View Answer
Answer: a
Explanation: When we generate multiple base – learners, we
want them to be reasonably accurate but do not require
them to be very accurate individually. Hence the base –
learners are not chosen for their accuracy, but for their
simplicity. However, the base – learners have to be diverse.
8. Different algorithms make different assumptions about
the data and lead to different classifiers in generating
diverse learners.
a) True
b) False
View Answer
Answer: a
Explanation: Different algorithms make different
assumptions about the data and lead to different classifiers.
For example one base – learner may be parametric and
another may be nonparametric. When we decide on a single
algorithm, we give importance to a single method and ignore
all others.
9. Ensembles tend to yield better results when there is a
significant diversity among the models.
a) False
b) True
View Answer
Answer: b
Explanation: Ensembles tend to yield better results when
there is a significant diversity among the models. Many
ensemble methods, therefore, try to promote diversity
among the models they combine.
10. The partitioning of the training sample cannot be done
based on locality in the input space.
a) False
b) True
View Answer
Answer: a
Explanation: The partitioning of the training sample can
also be done based on locality in the input space. So each
base – learner is trained on instances in a certain local part
of the input space. And it is done by a mixture of experts.
11. Which of the following is represented by the below
figure?

a) Stacking
b) Mixture of Experts
c) Bagging
d) Boosting
View Answer
Answer: b
Explanation: The figure shows a Mixture of Experts. It is
based on the divide – and – conquer principle and mixture of
experts trains individual models to become experts in
different regions of the feature space. Then, a gating
network decides which combination of ensemble learners is
used to predict the final output of any instance.
12. Given the target value of a mixture of expert
combinations is 0.8. The predictions of three experts and the
probability of picking them are 0.6, 0.4, 0.5 and 0.8, 0.5, 0.7
respectively. Then what is the simple error for training?
a) 0.13
b) 0.15
c) 0.18
d) 0.2
View Answer
Answer: c
Explanation: We know the simple error for training:
E = ∑ipi(d – yi)2 where d is the target value and pi is the
probability of picking expert i, and yi is the individual
prediction of expert i. Given d = 0.8, y1 = 0.6, y2 = 0.4, y3 =
0.5 and p1 = 0.8, p2 = 0.5, p3 = 0.7
Then E = p1(d – y1)2 + p2(d – y2)2 + p3(d – y3)2
= 0.8(0.8 – 0.6)2 + 0.5(0.8 – 0.4)2 + 0.7(0.8 – 0.5)2
= 0.8(0.2)2 + 0.5(0.4)2 + 0.7(0.3)2
= 0.8 * 0.04 + 0.5 * 0.16 + 0.7 * 0.09
= 0.032 + 0.08 + 0.063
= 0.18
13. The ABC company has released their Android app. And
80 people have rated the app on a scale of 5 stars. Out of the
total people 15 people rated it with 1 star, 20 people rated it
with 2 stars, 30 people rated it with 3 stars, 10 people rated it
with 4 stars and 5 people rated it with 5 stars. What will be
the final prediction if we take the average of individual
predictions?
a) 2
b) 3
c) 4
d) 5
View Answer
Answer: b
Explanation: Given that we are taking the average of
individual predictions to make the final prediction.
Average = ∑ (Rating * Number of people) / Total number of
people
= ((1 * 15) + (2 * 20) + (3 * 30) + (4 * 10) + (5 * 5)) / 80
= (15 + 40 + 90 + 40 + 25) / 80
= 210 / 80
= 2.625
And the nearest integer is 3. So the final prediction will be 3.
14. Consider there are 5 employees A, B, C, D, and E of ABC
company. Where people A, B and C are experienced, D and
E are fresher. They have rated the company app as given in
the table. What will be the final prediction if we are taking
the weighted average?

Employee Weight Rating

A 0.4 3

B 0.4 2
C 0.4 2

D 0.2 2

E 0.2 4

a) 2
b) 3
c) 4
d) 5
View Answer
Answer: c
Explanation: We have,
Weighted average = ∑ (Weight * Rating)
= (0.4 * 3) + (0.4 * 2) + (0.4 * 2) + (0.2 * 2) + (0.2 * 4)
= 1.2 + 0.8 + 0.8 + 0.4 + 0.8
=4
1. Which of the following statements is false about Ensemble
voting?
a) It takes a linear combination of the learners
b) It takes non-linear combination of the learners
c) It is the simplest way to combine multiple classifiers
d) It is also known as ensembles and linear opinion pools
View Answer
Answer: b
Explanation: Voting doesn’t take non-linear combination of
the learners. It is the simplest way to combine multiple
classifiers, which corresponds to taking a linear combination
of the learners (yi = ∑j wjdji where wj ≥ 0, ∑j wj = 1, wj is the
weight of learner j and dji is the vote of learner j for class Ci).
So this is also known as ensembles and linear opinion pools.
2. In the simplest case of voting, all the learners are given
equal weight.
a) True
b) False
View Answer
Answer: a
Explanation: In the simplest case, all learners are given
equal weight and here we have simple voting that
corresponds to taking an average. There are also other
combination rules and taking a weighted sum is only one of
such possibilities.
3. With the product rule, if one learner has an output of 0,
the overall output goes to zero.
a) True
b) False
View Answer
Answer: a
Explanation: With the product rule (yi = Πjdji where dji is the
vote of learner j for class Ci), each learner has veto power.
That is regardless of the other ones, if one learner has an
output of 0, the overall output goes to 0.
advertisement
4. In plurality voting the winner is the class with maximum
number of votes.
a) True
b) False
View Answer
Answer: b
Explanation: Plurality voting in classification is where the
class having the maximum number of votes is the winner. In
plurality voting a classification of an unlabelled instance is
performed according to the class that obtains the highest
number of votes. So in reality plurality voting is commonly
used to solve the multi class problems.
5. In majority voting, the winning class gets more than half
of the total votes.
a) True
b) False
View Answer
Answer: a
Explanation: It is majority voting, when there are two
classes and the winning class gets more than half of the
votes. Here every model makes a prediction (votes) for each
test instance and the final output prediction (votes) is the one
that receives more than half of the votes.
6. Which of the following statements is true about the
combination rules?
a) Maximum rule is pessimistic
b) Sum rule takes the weighted sum of vote of each learner
for each class
c) Median rule is more robust to outliers
d) Minimum rule is optimistic
View Answer
Answer: c
Explanation: Median rule is more robust to outliers. If you
throw away the largest and smallest values (predictions) in
the data set, then the median doesn’t change. The sum rule
takes the sum of vote of each learner for each class,
maximum rule is optimistic and minimum rule is pessimistic.
7. Hard voting is where the model is selected from an
ensemble to make the final prediction using simple majority
vote.
a) True
b) False
View Answer
Answer: d
Explanation: In hard voting, a model is selected from an
ensemble by a simple majority vote to make the final
prediction for accuracy. Here every individual classifier
votes for a class, and the majority class wins. So it simply
aggregates the predictions of each classifier and predicts the
class that gets the most votes.
8. Borda count takes the rankings of the class supports into
consideration unlike the voting.
a) True
b) False
View Answer
Answer: a
Explanation: Borda count can rank order the classifier
outputs. The classes can easily be rank ordered with respect
to the support they receive from the classifier. Where the
voting considers the support of the winning classes only and
ignores the support that non winning classes may receive.
9. Which of the following is a solution for the problem,
where the classifiers erroneously give unusual low or high
support to a particular class?
a) Maximum rule
b) Minimum rule
c) Product rule
d) Trimmed mean rule
View Answer
Answer: d
Explanation: Trimmed rule can be used to avoid the damage
done by the unusual vote given by the classifiers. It discards
the decisions of those classifiers with the highest and lowest
support before calculating the mean. And the mean is
calculated on the remaining supports, avoiding the extreme
values of support.
10. The weighted average rule combines the mean and the
weighted majority voting rules.
a) True
b) False
View Answer
Answer: a
Explanation: The weighted average rule combines the mean
and the weighted majority voting rules. It makes use of
weighted majority voting and the ensemble prediction is
calculated as the average of the member predictions.
11. Assume we are combining three classifiers that classify a
training sample as given in the table. Then what is the class
of the samples using majority voting?
Classifier Class label

C1 0

C2 0

C3 1

a) 0
b) 1
c) 2
d) New class
View Answer
Answer: a
Explanation: In majority voting the class label (y) is
predicted as, Y = mode {P1, P2, …, Pn} where P1, P2, …,
Pn are the predictions of n classifiers that are combined.
Here P1 = 0, P2 = 0, …, P3 = 1. And Y can be calculated as,
Y = mode {P1, P2, P3}
= mode {0, 0, 1}
=0
12. Assume we are combining eight classifiers that classify a
training sample as given in the table. Then what is the class
of the samples using simple majority voting?

Classifier Class label


C1 0

C2 0

C3 1

C4 0

C5 2

C6 3

C7 0

C8 0

a) 1
b) 2
c) 0
d) 3
View Answer
Answer: c
Explanation: Majority voting has three flavors. And one
among them is depending on whether the ensemble decision
is the class predicted by at least one more than half the
number of classifiers. And it is known as simple majority
voting. Total number of classifiers is 8 and the number of
classifiers predicts class label 0 is 5. So the number of
classifiers predicts class label 0 > 4 (Total number of
classifiers / 2). And the class label for the samples is 0.
13. Assume we are combining three classifiers that classify a
training sample and the probabilities are given in the table.
Given that it assigns equal weights to all classifiers w1=1,
w2=1, w3=1. What is the class of the samples using weighted
majority voting?

Class label 0 Class la

Classifier 1 0.3 0.5

Classifier 2 0.4 0.3

Classifier 3 0.2 0.4

a) Class 0
b) Class 1
c) Class 2
d) New class
View Answer
Answer: d
Explanation: Given the table about the probabilities of
samples classified to each class label by three classifiers. And
assigns equal weights to all classifiers (w1 = w2 = w3 = 1).
Then we have,
Class label 0 Clas

Classifier 1 w1 * 0.3 = 1 * 0.3 = 0.3 w1 *

Classifier 2 w2 * 0.4 = 1 * 0.4 = 0.4 w2 *

Classifier 3 w3 * 0.2 = 1 * 0.2 = 0.2 w3 *

Weighted average (0.3 + 0.4 + 0.2) / 3 = 0.3 (0.5

From the table above the class 1 has the highest weighted
average probability, thus we classify the sample as class 1.
1. In error-correcting output codes (ECOC), the main
classification task is defined in terms of a number of
subtasks that are implemented by the base-learners.
a) True
b) False
View Answer
Answer: a
Explanation: In multi-class problems the original task of
separating one class from all other classes may be difficult.
So we want to define a set of simpler classification problems,
where each specializing in one aspect of the task. And we get
the final classifier by combining these simpler classifiers.
2. Which of the following statements is not true about error-
correcting output codes (ECOC)?
a) It is a method for solving multi-class classification
problems
b) It is a method for decomposing a multiway classification
problem into many binary classification tasks
c) It is a method for solving binary classification problems
d) It is a method for converting a k-class supervised learning
problem into a large number of two class supervised
learning problem
View Answer
Answer: c
Explanation: Error-correcting output coding is not a method
for solving binary classification problems. It is a method for
solving multi-class classification problems. Here a k-class
supervised learning problem (multiway classification
problem) is converted into a large number L of two class
supervised learning problems (binary classification tasks).
3. Which of the following statements is not true about multi-
class classification?
a) An input can belong to one of K classes
b) Each input belongs to exactly one class
c) Each training data associated with class labels which is a
number from 1 to K
d) Each input belongs to more than one class
View Answer
Answer: d
Explanation: In a multi-class classification problem, an
input cannot belong to more than one class. Here an input
can belong to one of K classes, but each input belongs to
exactly one class. And the training data input is associated
with a class label (a number from 1 to K).
advertisement
4. Which of the following statements is false about error-
correcting output codes (ECOC)?
a) It is based on the embedding of binary classifiers
b) The ECOC designs are independent of the base classifier
applied
c) ECOC framework consists of designing a codeword for
each of the classes
d) The ECOC designs are dependent on the base classifier
applied
View Answer
Answer: d
Explanation: ECOC designs are not dependent on the base
classifier applied and are independent of the base classifier
applied. ECOC is the powerful framework based on the
embedding of binary classifiers. It consists of designing a
codeword for each of the classes and these codewords encode
the membership information of each class for a given binary
problem.
5. Which of the following is not a problem independent
ECOC design?
a) One-versus-all
b) SFFS criterion
c) One-versus-one
d) Dense Random
View Answer
Answer: b
Explanation: Sequential Forward Floating Search (SFFS) is
not a problem independent ECOC design. It is a method
used in problem dependent ECOC design for feature
selection. And all other three are the problem independent
ECOC designs.
6. Which of the following is not a problem dependent ECOC
design?
a) Sparse Random
b) DECOC
c) ECOC-ONE
d) Forest-ECOC
View Answer
Answer: a
Explanation: Sparse Random is not a problem dependent
ECOC design and it is a problem independent ECOC
design. It uses n = 15 · logNc dichotomizers. All other three
are the problem dependent ECOC designs.
7. Which of the following ECOC designs uses n =
(Nc−1).T dichotomizers, where T stands for the number of
binary tree structures to be embedded?
a) DECOC
b) One-versus-all
c) Forest-ECOC
d) One-versus-one
View Answer
Answer: c
Explanation: Forest-ECOC design uses n =
(Nc−1).T dichotomizers, where T stands for the number of
binary tree structures to be embedded. Whereas the
DECOC design uses n = Nc−1 dichotomizers, One-versus-all
uses Nc dichotomizers and One-versus-one uses n =
Nc(Nc−1)/2 dichotomizers.
8. Forest-ECOC design which uses n =
(Nc−1).T dichotomizers, extends the variability of the
classifiers of the DECOC design.
a) True
b) False
View Answer
Answer: a
Explanation: Forest-ECOC design uses n =
(Nc−1).T dichotomizers, where T stands for the number of
binary tree structures to be embedded. Whereas the
DECOC design uses n = Nc−1 dichotomizers. So Forest-
ECOC design extends the variability of the classifiers of the
DECOC design by including extra dichotomizers T (number
of binary tree structures to be embedded).
9. Problem independent ECOC design and Problem
dependent ECOC design are the two types of ECOC
decoding strategies.
a) True
b) False
View Answer
Answer: b
Explanation: Problem independent ECOC design and
Problem dependent ECOC design are not the two types of
ECOC decoding strategies. The ECOC coding designs are
mainly divided into two main groups: problem-independent
approaches, and the problem-dependent designs. Hamming
decoding, Euclidean decoding etc. are the ECOC decoding
strategies.
10. Problem independent approaches take into account the
distribution of the data to define the coding matrix.
a) False
b) True
View Answer
Answer: a
Explanation: Problem-independent approaches are used to
guide the coding design. It does not take into account the
distribution of the data to define the coding matrix. It
considers the row separation and column separation criteria
to build a code matrix.
11. Given the two strings “cats” and “dogs”. What is the
Hamming distance between two strings?
a) 4
b) 3
c) 2
d) 5
View Answer
Answer: b
Explanation: The Hamming distance between two strings of
equal length is the number of positions at which the
corresponding symbols are different. It is the number of
substitutions required to transform one string into another.
cats ⇒ dats (substitute ‘d’ for ‘c’)
dats ⇒ dots (substitute ‘o’ for ‘a’)
dots ⇒ dogs (substitute ‘g’ for ‘t’)
So hamming distance is 3 as it requires 3 edit operations to
convert “cats” to “dogs”.
12. What is the hamming distance between the binary values
100111010 and 101111111?
a) 9
b) 7
c) 5
d) 3
View Answer
Answer: d
Explanation: Hamming distance between two strings of
equal length is the minimum number of substitutions
required to change one string into the other.
100111010 ⇒ 101111010
101111010 ⇒ 101111110
101111110 ⇒ 101111111
So hamming distance is 3 as it requires 3 edit operations to
convert 100111010 to 101111111.
13. How many single bit errors take to turn “cow” to “fox”?
a) 2
b) 0
c) 1
d) 3
View Answer
Answer: a
Explanation: The number of single bit errors taken to turn
one string into another is known as the hamming distance.
cow ⇒ fow (substitute ‘f’ for ‘c’)
fow ⇒ fox (substitute ‘x’ for ‘w’)
So the number of single bit errors taken to turn “cow” to
“fox” is 2.
1. Boosting is a machine learning ensemble algorithm which
converts weak learners to strong ones.
a) True
b) False
View Answer
Answer: a
Explanation: Boosting is a machine learning ensemble meta-
algorithm which converts weak learners to strong ones. A
weak learner is defined to be a classifier which is only
slightly correlated with the true classification and a strong
learner is a classifier that is arbitrarily well correlated with
the true classification.
2. Which of the following statements is not true about
boosting?
a) It uses the mechanism of increasing the weights of
misclassified data in preceding classifiers
b) It mainly increases the bias and the variance
c) It tries to generate complementary base-learners by
training the next learner on the mistakes of the previous
learners
d) It is a technique for solving two-class classification
problems
View Answer
Answer: b
Explanation: Boosting does not increase the bias and
variance but it mainly reduces the bias and the variance. It is
a technique for solving two-class classification problems.
And it tries to generate complementary base-learners by
training the next learner (by increasing the weights) on the
mistakes (misclassified data) of the previous learners.
3. Boosting is a heterogeneous ensemble technique.
a) True
b) False
View Answer
Answer: b
Explanation: Boosting is not a heterogeneous ensemble but
is a homogeneous ensemble. Homogeneous ensemble consists
of members having a single-type base learning algorithm.
Whereas a heterogeneous ensemble consists of members
having different base learning algorithms.
advertisement
4. The issues that boosting addresses are the bias-complexity
tradeoff and computational complexity of learning.
a) True
b) False
View Answer
Answer: a
Explanation: The more expressive the hypothesis class the
learner is searching over, the smaller the approximation
error is, but the larger the estimation error becomes. And
for many concept classes the task of finding an Empirical
Risk Minimization hypothesis may be computationally
infeasible.
5. Which of the following statements is not true about weak
learners?
a) They can be used as the building blocks for designing
more complex models by combining them
b) Boosting learns the weak learners sequentially in a very
adaptive way
c) They are combined using a deterministic strategy
d) They have low bias
View Answer
Answer: d
Explanation: Weak learners do not have low bias but have
high bias. Boosting primarily reduces the bias by combining
the weak learners in a deterministic strategy. And boosting
learns the weak learners sequentially in a very adaptive way.
6. Which of the following is not related to boosting?
a) Non uniform distribution
b) Re-weighting
c) Re-sampling
d) Sequential style
View Answer
Answer: c
Explanation: Re-sampling is done with the bagging
technique. Boosting uses a non-uniform distribution, during
the training the distribution will be modified and difficult
samples will have higher probability. And it follows a
sequential style to generate complementary base-learners by
re-weighting the learner.
7. In ensemble method if the classifier is unstable, then we
need to apply boosting.
a) True
b) False
View Answer
Answer: b
Explanation: If the classifier is unstable which means it has
high variance, then we cannot apply boosting. We can use
bagging if the classifier is unstable. If the classifier is steady
and straightforward (high bias), then we have to apply
boosting.
8. The original boosting method requires a very large
training sample.
a) True
b) False
View Answer
Answer: a
Explanation: The disadvantage of the original boosting
method is that it requires a very large training sample. And
the sample should be divided into three and furthermore, the
second and third classifiers are only trained on a subset of
the previous classifier’s error.
9. Which of the following is not true about boosting?
a) It considers the weightage of the higher accuracy sample
and lower accuracy sample
b) It helps when we are dealing with bias or under-fitting in
the data set
c) Net error is evaluated in each learning step
d) It always considers the overfitting or variance issues in
the data set
View Answer
Answer: d
Explanation: One of the main disadvantages of boosting is
that it often ignores overfitting or variance issues in the data
set. And it mainly reduces the bias and also the variance. All
other three options are the advantages of boosting.
10. Boosting can be used for spam filtering.
a) False
b) True
View Answer
Answer: b
Explanation: Boosting can be used for spam filtering, where
the first classifier can be used to distinguish between emails
from contacts and others. And the subsequent classifiers
used to find examples wrongly classified as spam and find
words/phrases appearing in spam. And finally combine it to
the final classifier that predicts spam accurately.
11. Consider there are 7 weak learners, out of which 4
learners are voted as FAKE for a social media account and 3
learners are voted as REAL. What will be the final
prediction for the account if we are using a majority voting
method?
a) FAKE
b) REAL
c) Undefined
d) Error
View Answer
Answer: a
Explanation: As we are using a majority voting method here
it will be considering the prediction of weak learners with
higher number of votes. And here 4 learners out of 7 are
voted as FAKE. And it is the higher number of votes
considering the 3 votes as REAL. So the final prediction will
be FAKE.
12. Assume that we are training a boosting classifier using
decision stumps on the given dataset. Then which of the
given examples will have their weights increased at the end
of the first iteration?

a) Circle
b) Square
c) Both
d) No increment in weight
View Answer
Answer: b
Explanation: The square example will have their weights
increased at the end of the first iteration. Decision stump is a
1-level decision tree and is a test based on one feature. And
the decision stump with the least error in the first iteration is
constant over the whole domain. So it only predicts
incorrectly on the square example.
13. Assume that we are training a boosting classifier using
decision stumps on the given dataset. At the least how much
iteration does it need to achieve zero training error?

a) 1
b) 2
c) 3
d) 0
View Answer
Answer: c
Explanation: It will require at least three iterations to
achieve zero training error. First iteration will misclassify
the square example. Second iteration will misclassify the two
square examples. And finally the third iteration will
misclassify the remaining two square examples which can
yield zero training error. So it requires at least three
iterations.
1. AdaBoost is an algorithm that has access to a weak
learner and finds a hypothesis with a low empirical risk.
a) True
b) False
View Answer
Answer: a
Explanation: AdaBoost (Adaptive Boosting) is an algorithm
that has access to a weak learner and finds a hypothesis with
a low empirical risk. Each iteration of AdaBoost involves
O(m) operations as well as a single call to the weak learner.
Therefore, if the weak learner can be implemented
efficiently, then the total training process will be efficient.
2. Which of the following statements is not true about
AdaBoost?
a) The boosting process proceeds in a sequence of
consecutive rounds
b) In each round t, the weak learner is assumed to return a
weak hypothesis ht
c) The output of AdaBoost algorithm is a weak classifier
d) It assigns a weight to the weak hypothesis that is inversely
proportional to the error of the weak hypothesis
View Answer
Answer: c
Explanation: The output of the AdaBoost algorithm is not a
weak classifier but is a strong classifier that is based on a
weighted sum of all the weak hypotheses. The boosting
process proceeds in a sequence of consecutive rounds. So in
each round t, the weak learner is assumed to return a weak
hypothesis ht and it assigns a weight to ht that is inversely
proportional to the error of ht.
3. AdaBoost runs in polynomial time.
a) False
b) True
View Answer
Answer: b
Explanation: AdaBoost runs in polynomial time and does
not require defining a large number of hyper parameters.
Each iteration of AdaBoost involves O (m) operations as well
as a single call to the weak learner. So overall running time
is polynomial in m.
advertisement
4. The basic functioning of the AdaBoost algorithm is to
maintain a weight distribution over the data points.
a) True
b) False
View Answer
Answer: a
Explanation: The basic functioning of the algorithm is to
maintain a weight distribution d, over data points. A weak
learner, f(k) is trained on this weighted data. And the
(weighted) error rate of f(k) is used to determine the
adaptive parameter α, which controls how “important” a
weak learner, f(k) is.
5. The success of AdaBoost is due to its property of
increasing the margin.
a) False
b) True
View Answer
Answer: b
Explanation: The success of AdaBoost is due to its property
of increasing the margin. In practice we observe that
running boosting for many rounds does not overfit in most
cases and margin is a solution for it. The margins can be
thought of as a measure of how confident a classifier is
about, how it labels each point, and one would hypothetically
desire to produce a classifier with margins as large as
possible.
6. Which of the following statements is true about
AdaBoost?
a) It is generally more prone to overfitting.
b) It improves classification accuracy.
c) It is particularly prone to overfitting on noisy datasets.
d) Complexity of the weak learner is important in AdaBoost.
View Answer
Answer: a
Explanation: AdaBoost is generally not more prone to
overfitting but is less prone to overfitting. And it is prone to
overfitting on noisy datasets. If you use very simple weak
learners, then the algorithms are much less prone to
overfitting and it improves classification accuracy. So
Complexity of the weak learner is important in AdaBoost.
7. Which of the following statements is true about the
working of AdaBoost?
a) It starts with equal weights and re – weighting will be
done.
b) It starts with unequal weights and re – weighting will be
done.
c) It starts with unequal weights and random sampling.
d) It starts with equal weights and random sampling.
View Answer
Answer: d
Explanation: AdaBoost starts with equal weights and
random sampling. It starts by predicting the original dataset
and gives equal weights to each observation. So in the first
step of AdaBoost each sample has an identical weight that
indicates how important it is regarding the classification.
8. AdaBoost is a parallel ensemble method.
a) True
b) False
View Answer
Answer: b
Explanation: AdaBoost is not a parallel ensemble method. It
is a sequential ensemble method, where the base learners are
generated sequentially. The boosting process proceeds in a
sequence of consecutive rounds.
9. Given three training instances with weights 0.5, 0.2, 0.04.
The predicted values are 1, 1, and – 1. The actual output
variables in the instances are – 1, 1, and 1. And the terror
would be 1, 0, and 1. What is the misclassification rate?
a) 0.71
b) 0.65
c) 0.73
d) 0.5
View Answer
Answer: c
Explanation: Misclassification rate or error = sum (w (i) *
terror (i)) / sum (w)
= (0.5 * 1 + 0.2 * 0 + 0.04 * 1) / (0.5 + 0.2 + 0.04)
= (0.5 + 0 + 0.04) / 0.74
= 0.54 / 0.74
= 0.73
10. AdaBoost is sensitive to outliers.
a) False
b) True
View Answer
Answer: b
Explanation: AdaBoost is sensitive to outliers or label noise
and the outliers are tending to get misclassified. As the
number of iterations increases, the weights corresponding to
outlier points can become very large. And the subsequent
classifiers are trying to classify these outlier points correctly.
11. Consider the two instances having errors 0.4, 0.5. Then
what will be the weights of the classifier for these two
instances?
a) 0.401, 0.5
b) 0.903, 0.1
c) 0.304, 0.6
d) 0.205, 0
View Answer
Answer: d
Explanation: The weight of the classifier is calculated as,
α = (1 / 2) * ln((1 – error) / error)
Then for error = 0.4,
Weight α = (1 / 2) * ln((1 – 0.4) / 0.4)
= (0.5 * ln(0.6 / 0.4)
= (0.5 * ln(1.5)
= 0.5 * 0.41
= 0.205
For error = 0.5,
Weight α = (1 / 2) * ln((1 – 0.5) / 0.5)
= (0.5 * ln(0.5 / 0.5)
= (0.5 * ln(1)
= 0.5 * 0
=0
12. The classifier weight for an instance will be less than zero
if the error is greater than or equal to 0.5.
a) True
b) False
View Answer
Answer: a
Explanation: The weight of the classifier is calculated as α =
(1 / 2) * ln((1 – error) / error). Consider a classifier instance
with error = 0.8. Then the weight will be calculated as,
Weight α = (1 / 2) * ln((1 – 0.8) / 0.8)
= (0.5 * ln(0.2 / 0.8)
= (0.5 * ln(0.25)
= 0.5 * – 1.39
= – 0.695
So when the error (0.8) is greater than or equal to 0.5, the
weight for an instance will be less than zero (- 0.695).
1. Stacked generalization extends voting.
a) True
b) False
View Answer
Answer: a
Explanation: Voting is the simplest way to combine multiple
classifiers, which corresponds to taking a linear combination
of the learners. Stacked generalization extends voting by
combining the base learners through a combiner, which is
another learner.
2. Which of the following is represented by the below figure?

a) Bagging
b) Boosting
c) Mixture of Experts
d) Stacking
View Answer
Answer: d
Explanation: Stacking is a technique in which the outputs of
the base-learners are combined and is learned through a
combiner system, f(•|Φ). Where f(•|Φ) is another learner,
whose parameters Φ are also trained as: y = f (d1, d2, …,
dL|Φ) where d1, d2, …, dL are the base learners.
3. Which of the following is the main function of stacking?
a) Ensemble meta algorithm for reducing variance
b) Ensemble meta algorithm for reducing bias
c) Ensemble meta algorithms for improving predictions
d) Ensemble meta algorithms for increasing bias and
variance
View Answer
Answer: c
Explanation: Stacking is a way of combining multiple
models. And it uses predictions of multiple models as
“features” to train a new model and use the new model to
make predictions on test data. So it ensemble meta
algorithms for improving predictions.
advertisement
4. Which of the following is an example of stacking?
a) AdaBoost
b) Random Forest
c) Bagged Decision Trees
d) Voting Classifier
View Answer
Answer: d
Explanation: Voting classifiers (ensemble or majority voting
classifiers) are an example of stacking. They are used to
combine several classifiers to create the final classifier.
AdaBoost is a boosting technique whereas random forest
and bagged decision trees are examples of bagging.
5. The fundamental difference between voting and stacking
is how the final aggregation is done.
a) True
b) False
View Answer
Answer: a
Explanation: The fundamental difference between voting
and stacking is how the final aggregation is done. In voting,
user-specified weights are used to combine the classifiers
whereas stacking performs this aggregation by using a linear
or nonlinear function. And this function can be a
blender/meta classifier.
6. Which of the following statements is false about stacking?
a) It introduces the concept of a meta learner
b) It combines multiple classification or regression models
c) The combiner function can be nonlinear
d) Stacking ensembles are always homogeneous
View Answer
Answer: d
Explanation: Stacking ensembles are not always
homogeneous but are often heterogeneous, because the base
level often consists of different learning algorithms. It
combines multiple classification or regression models. And it
introduces the concept of a meta learner and this combiner
function can be nonlinear unlike voting.
7. Stacking trains a meta-learner to combine the individual
learners.
a) True
b) False
View Answer
Answer: a
Explanation: Stacking trains a meta-learner (second-level
learner) to combine the individual learners (first-level
learners). In stacking the first-level learners are often
generated by different learning algorithms. And the second-
level learner learned from examples for combining multiple
classifiers or first-level learners.
8. Associative switch can be used to combine multiple
classifiers in stacking.
a) True
b) False
View Answer
Answer: a
Explanation: Associative switch also known as the meta-
learner in the stacking. And it is used to learn from examples
for combining multiple classifiers. And it is also known as
combining by learning.
9. Which of the following is represented by the below figure?

a) Stacking
b) Mixture of Experts
c) Bagging
d) Boosting
View Answer
Answer: a
Explanation: Stacking introduces a level-1 algorithm, called
meta-learner, for learning the weights of the level-0
predictors. That means the predictions of each training
instance from the models become now training data for the
level-1 learner (generalizer).
10. What does the given figure indicate?

a) Stacking
b) Support vector machine
c) Bagging
d) Boosting
View Answer
Answer: a
Explanation: The given figure shows a stacked
generalization framework. It consists of level 0 (three
models) and level 1 (one Meta model). And these individual
algorithms (Light gradient boosting, Support vector
regression and neural network) improve the predictive
performance.
1. Gradient descent is an optimization algorithm for finding
the local minimum of a function.
a) True
b) False
View Answer
Answer: a
Explanation: Gradient descent is an optimization algorithm
for finding the local minimum of a function. It is used to find
the values of parameters of a function that minimizes a cost
function. The slope of this cost function curve tells us how to
update our parameters to make the model more accurate.
2. We can use gradient descent as a best solution, when the
parameters cannot be calculated analytically.
a) False
b) True
View Answer
Answer: b
Explanation: Gradient descent is best used when the
parameters cannot be calculated using linear algebra
(analytically). So, in order to solve a system of nonlinear
equations numerically, we have to reformulate it as an
optimization problem. And it must be searched by an
optimization algorithm like gradient descent.
3. Which of the following statements is false about gradient
descent?
a) It updates the weight to comprise a small step in the
direction of the negative gradient
b) The learning rate parameter is η where η > 0
c) In each iteration, the gradient is re-evaluated for the new
weight vector
d) In each iteration, the weight is updated in the direction of
positive gradient
View Answer
Answer: d
Explanation: Gradient descent is an optimization algorithm,
and in each iteration the weight is not updated in the
direction of positive gradient. Here it updates the weight in
the direction of the negative gradient. And the gradient is re-
evaluated for the new weight vector with a learning
parameter η > 0.
advertisement
4. In batch method gradient descent, each step requires the
entire training set be processed in order to evaluate the error
function.
a) True
b) False
View Answer
Answer: a
Explanation: Techniques that use the whole data set at once
are called batch methods. So in batch method gradient
descent, each step requires the entire training set be
processed in order to evaluate the error function. Here the
error function is defined with respect to a training set.
5. Simple gradient descent is a better batch optimization
method than conjugate gradients and quasi-newton
methods.
a) False
b) True
View Answer
Answer: a
Explanation: Conjugate gradients and quasi-newton
methods are the more robust and faster batch optimization
methods than simple gradient descent. In these algorithms
the error function always decreases at each iteration unless
the weight vector has arrived at a local or global minimum
unlike gradient descent.
6. What is the gradient of the function 2x 2 – 3y2 + 4y – 10 at
point (0, 0)?
a) 0i + 4j
b) 1i + 10j
c) 2i – 3j
d) -3i + 4j
View Answer
Answer: a
Explanation: Given the function f = 2x 2-3y2+4y-10 at point
(0, 0). Then the gradient of the function can be calculated as:
\( \frac {\partial f}{\partial x} = \frac {\partial (2x^2 – 3y^2 +
4y – 10)}{\partial x}\)
= 4x
=4*0
=0
\( \frac {\partial f}{\partial y} = \frac {\partial (2x^2 – 3y^2 +
4y – 10)}{\partial y}\)
= -6y + 4
= (-6 * 0) + 4
=0+4
=4
Gradient, ∇ f = 0i + 4j
7. The gradient is set to zero to find the minimum or the
maximum of a function.
a) False
b) True
View Answer
Answer: b
Explanation: The gradient is set to zero, to find the
minimum or the maximum of a function. Because the value
of gradient at extremes (minimum or maximum) of a
function is always zero. So the derivative of the function is
zero at any local maximum or minimum.
8. The main difference between gradient descents variants
are based on the amount of data.
a) True
b) False
View Answer
Answer: a
Explanation: There are mainly three types of gradient
descents. They are batch gradient descent, stochastic
gradient, and mini-batch gradient descent. The main
difference between these algorithms is the amount of data
they handle. And based on this their accuracy, and time
taken for the weight updating varies.
9. Which of the following statements is false about choosing
learning rate in gradient descent?
a) Small learning rate leads to slow convergence
b) Large learning rate cause the loss function to fluctuate
around the minimum
c) Large learning rate can cause to divergence
d) Small learning rate cause the training to progress very
fast
View Answer
Answer: d
Explanation: If the learning rate is too small then the
training will progress very slowly because the weight
updating is very small. So, it leads to slow convergence.
Whereas the large learning rate causes the loss function to
fluctuate around the minimum and even can cause
divergence.
10. Which of the following is not related to a gradient
descent?
a) AdaBoost
b) Adadelta
c) Adagrad
d) RMSprop
View Answer
Answer: a
Explanation: AdaBoost is a meta algorithm to combine the
base learners to form a final classifier. Where Adadelta,
Adagrad and RMSprop are the gradient descent
optimization algorithms. And these algorithms are most
widely used by the deep learning community to solve a
number of challenges.
11. Given a function y = (x + 4)2. What is the local minima of
the function starting from the point x = 3 and the value of x
after the first iteration using gradient descent (Assume the
learning rate is 0.01)?
a) 0, 3.02
b) 0, 4.08
c) -4, 2.86
d) 4, 3.8
View Answer
Answer: c
Explanation: We know y = (x + 4)2 reaches its minimum
value when x = -4 (i.e when x = -4, y = 0). Hence x = -4 is the
local and global minima of the function.
Let x0 = 3, Learning rate = 0.01 and y = (x + 4)2. Then using
gradient descent,
\(\frac {dy}{dx} = \frac {d(x + 4)^2}{dy}\)
= 2 * (x + 4)
During the first iteration,
x1 = x0 – (learning rate * \(\frac {dy}{dx}\))
= 3 – (0.01 * (2 * (3 + 4)))
= 3 – (0.01 * (2 * 7))
= 3 – (0.01 * 14)
= 3 – 0.14
= 2.86
12. Given a function y = (x + 30)2. How many iterations does
it need to reach the first negative value of the function
starting from the point x = 1 using gradient descent (Assume
the learning rate is 0.01)?
a) 3
b) 4
c) 2
d) 5
View Answer
Answer: c
Explanation: Let x0 = 1, Learning rate = 0.01 and y = (x +
30)2. Then using gradient descent,
\(\frac {dy}{dx} = \frac {d(x + 30)^2}{dy}\)
= 2 * (x + 30)
During the first iteration,
x1 = x0 – (learning rate * \(\frac {dy}{dx}\))
= 1 – (0.01 * (2 * (1 + 30)))
= 1 – (0.01 * (2 * 31))
= 1 – (0.01 * 62)
= 1 – 0.62
= 0.38
During the second iteration,
x2 = x1 – (learning rate * \(\frac {dy}{dx}\))
= 0.38 – (0.01 * (2 * (0.38 + 30)))
= 0.38 – (0.01 * (2 * 30.38))
= 0.38 – (0.01 * 60.76)
= 0.38 – 0.61
= -0.23
So, the function reaches the first negative value after the two
iterations.
1. The Subgradient method is an algorithm for maximizing a
non-differentiable convex function.
a) True
b) False
View Answer
Answer: b
Explanation: The Subgradient method is not an algorithm
for maximizing a non-differentiable convex function but is
used to minimize the non differentiable convex function.
Convex optimization is the problem of minimizing convex
functions over convex sets. And when the objective function
is non-differentiable Subgradient methods are used.
2. Which of the following statements is not true about
Subgradient?
a) The step lengths are chosen via a line search
b) It can be directly applied to non-differentiable functions
c) It is an iterative method
d) The step lengths are fixed ahead of time
View Answer
Answer: a
Explanation: In Subgradient the step lengths are not chosen
via line search, and are often fixed ahead of time. It is an
iterative algorithm, which uses an initial guess to generate a
sequence of improving approximate solutions for a class of
problems. And these are directly applied to non-
differentiable functions.
3. Subgradient methods can be much slower than interior-
point methods.
a) True
b) False
View Answer
Answer: a
Explanation: Subgradient methods can be much slower than
interior-point methods. Where the interior-point methods
are used to solve linear and nonlinear convex optimization
problems and are second-order methods, not affected by
problem scaling. Subgradient methods are first-order
methods and their performance depends very much on the
problem of scaling and conditioning.
advertisement
4. Which of the following statements is not true about the
Subgradient method?
a) It has small memory requirement than interior-point
methods
b) It can be used for extremely large problems
c) Simple distributed algorithm can be generated by
combining sub gradient with primal or dual composition
techniques
d) It is much faster than Newton’s method in the
unconstrained case
View Answer
Answer: d
Explanation: Subgradient methods are not faster than
Newton’s method but are slower than it. The advantages of
Sub-gradient are that it has smaller memory requirements
than interior-point methods and can be used for extremely
large problems. And can be combined with primal or dual
composition techniques to form a simple distributed
algorithm.
5. Step size, αk = α is a positive constant, independent of k is
represented by Constant step size rule.
a) True
b) False
View Answer
Answer: a
Explanation: Constant step size rule defines the step size, αk
= α is a positive constant, independent of k. And Constant
step length, Non-summable diminishing, and Non-summable
diminishing step lengths are other step size rules in
Subgradient which define the step size in different ways.
6. In SVM problems, we cannot directly apply gradient
descent but we can apply Subgradient descent.
a) True
b) False
View Answer
Answer: a
Explanation: In SVM problems we cannot directly apply
gradient descent but we can apply Subgradient descent.
Because SVM objective is not continuously differentiable
and we cannot apply gradient descent. And Sub-gradient
descent can be used to solve this non-differentiable SVM
objective function.
7. Which of the following objective functions is not solved by
Subgradient?
a) Hinge loss
b) L1 norm
c) Perceptron loss
d) TanH function
View Answer
Answer: d
Explanation: TanH function cannot be handled by the
Subgradient. TanH function is a differentiable objective
function which cannot be solved by the Subgradient.
Because Subgradient is used to solve the non differentiable
convex problems where all the other three are the non
differentiable functions.
8. The step size rules in Subgradient are determined before
the algorithm is run.
a) False
b) True
View Answer
Answer: b
Explanation: The step size rules in subgradient are
determined before the algorithm is run. That is they do not
depend on any data computed during the algorithm. But in
standard descent methods the step size rules depend very
much on the current point and search direction.
9. The Subgradient is a descent method.
a) True
b) False
View Answer
Answer: b
Explanation: Unlike the ordinary gradient method, the
subgradient method is not a descent method, because the
function value often increases. The method looks very much
like the ordinary gradient method for differentiable
functions, but with several exceptions.
10. Subgradient descent can be used at points where
derivative is not defined.
a) True
b) False
View Answer
Answer: a
Explanation: Subgradient descent can be used at points
where derivative is not defined. It solves the non-
differentiable convex function. And it is like gradient
descent, but replacing gradients with subgradients.
11. Which of the following statements is not true about
Subgradient method?
a) Its convergence can be very fast
b) It handles general non-differentiable convex problem
c) It has no good stopping criterion
d) It often leads to very simple algorithms
View Answer
Answer: d
Explanation: Subgradient method’s convergence can be very
slow and not very fast. It involves the convergence of the
iterative process. And this iterative process makes the
convergence very slowly. All other three statements are the
key features of the Sub gradient method.
1. Stochastic gradient descent is also known as on-line
gradient descent.
a) True
b) False
View Answer
Answer: a
Explanation: Stochastic gradient descent is also known as
online gradient descent. It is said to be online because it can
update coefficients on new samples as it comes in the system.
So it makes an update to the weight vector based on one data
point at a time.
2. Stochastic gradient descent (SGD) methods handle
redundancy in the data much more efficiently than batch
methods.
a) True
b) False
View Answer
Answer: a
Explanation: Stochastic gradient descent methods handle
redundancy in the data much more efficiently than batch
methods. If we are doubling the dataset size then the error
function will be multiplied by a factor of 2. And SGD can
handle this error function normally but batch methods need
double the computational power to handle this error
function.
3. Which of the following statements is true about stochastic
gradient descent?
a) It processes all the training examples for each iteration of
gradient descent
b) It is computationally very expensive, if the number of
training examples is large
c) It processes one training example per iteration
d) It is not preferred, if the number of training examples is
large
View Answer
Answer: c
Explanation: Stochastic gradient descent processes one
training example per iteration. That is it updates the weight
vector based on one data point at a time. All other three are
the features of Batch Gradient Descent.
advertisement
4. Which of the following statements is not true about the
stochastic gradient descent?
a) The parameters are being updated after one iteration
b) It is quite faster than batch gradient descent
c) Stochastic gradient descent is faster than mini batch
gradient descent
d) When the number of training examples is large, it can be
additional overhead for the system
View Answer
Answer: c
Explanation: Stochastic gradient descent is not faster than
mini batch gradient descent but is slower than it. But it is
faster than batch gradient descent and the parameters are
updated after each iteration. And when the number of
training examples is large, then the number of iterations
increases and it will be an overhead for the system.
5. Stochastic gradient descent falls under Non-convex
optimization.
a) True
b) False
View Answer
Answer: a
Explanation: Stochastic gradient descent falls under Non-
convex optimization. A non-convex optimization problem is
any problem where the objective or any of the constraints
are non-convex.
6. Which of the following statements is not true about
stochastic gradient descent?
a) Due to the frequent updates, there can be so many noisy
steps
b) It may take longer to achieve convergence to the minima
of the loss function
c) Frequent updates are computationally expensive
d) It is computationally slower
View Answer
Answer: d
Explanation: Stochastic gradient descent (SGD) is not
computationally slower but is faster, as only one sample is
processed at a time. All other three are the disadvantages of
SGD. Where the frequent updates make noisy steps and
make it to achieve convergence to the minima very slowly.
And it is computationally expensive also.
7. In stochastic gradient descent the high variance frequent
parameter updates causes the loss function to fluctuate
heavily.
a) False
b) True
View Answer
Answer: b
Explanation: In stochastic gradient descent the frequent
parameter updates have high variance and cause the loss
function (objective function) to fluctuate to different
intensities. The high variance parameter updates helps to
discover better local minima but at the same time it
complicates the convergence (unstable convergence) to the
exact minimum.
8. Stochastic gradient descent has the possibility of escaping
from local minima.
a) False
b) True
View Answer
Answer: b
Explanation: One of the properties of stochastic gradient
descent is the possibility of escaping from local minima.
Since a stationary point with respect to the error function
for the whole data set will generally not be a stationary point
for each data point individually.
9. Given an example from a dataset (x 1, x2) = (4, 1), observed
value y = 2 and the initial weights w 1, w2, bias b as -0.015, -
0.038 and 0. What will be the prediction y’.
a) 0.01
b) 0.03
c) 0.05
d) 0.1
View Answer
Answer: d
Explanation: Given x1 = 4, x2 = 1, w1 = -0.015, w2 = -0.038, y
= 2 and b = 0.
Then prediction y’ = w1 x1 + w2 x2 + b
= (-0.015 * 4) + (-0.038 * 1) + 0
= -0.06 + -0.038 + 0
= -0.098
= -0.1
10. Given an example from a dataset (x 1, x2) = (2,8) and the
dependent variable y = -14, and the model prediction y’ = -
11. What will be the loss function if we are using a squared
difference method?
a) 6
b) -3
c) 9
d) 3
View Answer
Answer: c
Explanation: Given the observed variable, y = -14, predicted
value y’ = -11, and additional parameters x 1 = 2, x2 = 8.
Then using squared difference method Loss, L = (y’ – y)2
= (-11 – -14)2
= (-11 +14)2
= (3)2
=9
11. Given the current bias b = 0, learning rate = 0.01 and
gradient = -4.2. What will be the b’ value after the update?
a) -0.42
b) 0.042
c) 0.42
d) -0.042
View Answer
Answer: b
Explanation: Given b = 0, learning rate η = 0.01 gradient = -
4.2.
Then bias value after update, b’ = b – (η * Gradient)
= 0 – (0.01 * -4.2)
= 0 – -0.042
= 0.042
12. Given the example from a data set x 1 = 3, x2 = 1, observed
value y = 2 and predicted value y’ = -0.05. What will be the
gradient if you are using a squared difference method?
a) -4.1
b) -2.05
c) 4.1
d) 2.05
View Answer
Answer: a
Explanation: Given x1 = 3, x2 = 1, y = 2 and y’ = -0.05.
Then Gradient = 2 (y’ – y) as we are taking the partial
derivative of (y’ – y)2 with respect to y’.
Gradient = 2 (y’ – y)
= 2 (-0.05 – 2)
= 2 * -2.05
= -4.1
13. Given the example from a data set x 1 = 4, x2 = 1, weights
w1 = -0.02, w2 = -0.03, bias b = 0, observed value y = 2,
predicted value y’ = -0.11 and learning rate = 0.05. What will
be the next weight updating values if you are using a
squared difference approach?
a) -0.902, -0.314
b) -0.864, -0.241
c) -0.594, -0.324
d) -0.625, -0.524
View Answer
Answer: b
Explanation: Given x1 = 4, x2 = 1, w1 = -0.02, w2 = -0.03, bias
b = 0, y = 2, y’ = -0.11 and η= 0.05.
Then w1’ = w1 – η(2 (y’ – y) * x1)
= -0.02 – 0.05 * (2(-0.11 – 2) * 4)
= -0.02 – 0.05 * (2 * 2.11 * 4)
= -0.02 – 0.05 * 16.88
= -0.02 – 0.844
= -0.864
Then w2’ = w2 – η(2 (y’ – y) * x2)
= -0.03 – 0.05 * (2(-0.11 – 2) * 1)
= -0.03 – 0.05 * (2 * 2.11 * 1)
= -0.03 – 0.05 * 4.22
= -0.03 – 0.211
= -0.241
1. Stochastic gradient descent cannot be used for risk
minimisation.
a) False
b) True
View Answer
Answer: a
Explanation: Stochastic gradient descent (SGD) can be used
for risk minimisation. In learning the problem we are facing
is minimising the risk function LD(w). With SGD, all we need
is to find an unbiased estimate of the gradient of LD(w) that
is, a random vector.
2. Stochastic gradient descent can be used for convex-smooth
learning problems.
a) False
b) True
View Answer
Answer: b
Explanation: Stochastic gradient descent can be used for
convex-smooth learning problems. Assume that for all z, the
loss function l(.,z) is convex, β-smooth, and nonnegative.
Then, if we can run the SGD algorithm for
minimising LD(w), it will minimise the loss function also.
3. Which of the following statements is not true about
stochastic gradient descent for regularised loss
minimisation?
a) Stochastic gradient descent has the same worst-case
sample complexity bound as regularised loss minimisation
b) On some distributions, regularised loss minimisation
yields a better solution than stochastic gradient descent
c) In some cases we solve the optimisation problem as
associated with regularised loss minimisation
d) Stochastic gradient descent has entirely different worst-
case sample complexity bound from regularised loss
minimisation
View Answer
Answer: d
Explanation: Stochastic gradient descent has the same
worst-case sample complexity bound as regularised loss
minimisation. But on distributions, regularised loss
minimisation yields a better solution than stochastic gradient
descent. So in some cases we solve the optimisation problem
associated with regularised loss minimisation.
advertisement
4. In convex learning problems where the loss function is
convex, the preceding problem is also a convex optimisation
problem.
a) False
b) True
View Answer
Answer: b
Explanation: In convex learning problems where the loss
function is convex, the preceding problem is also a convex
optimisation problem that can be solved using SGD.
Consider f is a strongly convex function and we can apply
the SGD variant by constructing an unbiased estimate of a
sub-gradient of f at w(t).
1. Which of the following is not a variant of stochastic
gradient descent?
a) Adding a projection step
b) Variable step size
c) Strongly convex functions
d) Strongly non convex functions
View Answer
Answer: d
Explanation: There are several variants of Stochastic
Gradient Descent (SGD). Strongly non convex functions are
not a variant of stochastic gradient descent. Where adding a
projection step, variable step size, and strongly convex
functions are three variants of (SGD) which is used to
improve it.
2. Projection step is used to overcome the problem while
maintaining the same convergence rate.
a) True
b) False
View Answer
Answer: a
Explanation: Gradient descent and stochastic gradient
descent are restricting w* to a B-bounded hypothesis class
(w* is in the set H = {w : ∥ w∥ ≤ B}). So any step in the
opposite direction of the gradient might result in stepping
out of this bound. And projection step is used to overcome
this problem while maintaining the same convergence rate.
3. Which of the following statements is not true about two-
step update rule?
a) Two-step update rule is a way to add a projection step
b) First subtract a sub-gradient from the current value
of w and then project the resulting vector onto H
c) First add a sub-gradient to the current value of w and
then project the resulting vector onto H
d) The projection step replaces the current value of w by the
vector in H closest to it
View Answer
Answer: c
Explanation: In two-step update rule, we are not adding a
sub-gradient to the current value of w. The two-step rule is a
way to add a projection step, where we first subtract a sub-
gradient from the current value of w and then project the
resulting vector onto H. Then it replaces the current value of
w by the vector in H closest to it.
advertisement
4. Variable step size decrease the step size as a function of
iteration, t.
a) True
b) False
View Answer
Answer: a
Explanation: Variable step size decrease the step size as a
function of iteration, t. It updates the value
of w with ηt rather than updating with a constant η. When it
is closer to the minimum of the function, it takes the steps
more carefully, so as not to overshoot the minimum.
5. More sophisticated averaging schemes can improve the
convergence speed in the case of strongly convex functions.
a) False
b) True
View Answer
Answer: b
Explanation: Averaging techniques are one of the variants of
stochastic gradient descent which is used to improve the
convergence speed in the case of strongly convex functions.
It can output the average of w(t) over the last αT iterations,
for some α ∈ (0, 1) or it can also take a weighted average of
the last few iterates.
1. The computational complexity challenge related to
learning half-space in high dimensional feature spaces can
be solved using the method of kernels.
a) True
b) False
View Answer
Answer: a
Explanation: When the data is mapped into a high
dimensional feature space, it extends the expressiveness of
half-space predictors. And it raises both sample complexity
and computational complexity challenges. And it can be
solved using the method of kernels.
2. A kernel is a type of a similarity measure between
instances.
a) True
b) False
View Answer
Answer: a
Explanation: A kernel is a type of a similarity measure
between instances. When we are embedding the data into a
high dimensional feature space we introduce the idea of
kernels. Mathematical meaning of a kernel is the inner
product in some Hilbert space. So a standard interpretation
of a kernel is the pair wise similarity between different
samples.
3. Let the domain be the real line and consider the domain
points {-10, -9, -8, …, 0, 1, …, 9, 10} where the labels are +1
for all x such that |x| > 2 and 1 otherwise. The given training
set is separable by a half-space.
a) True
b) False
View Answer
Answer: b
Explanation: The given training set is not separable by a
half-space. Because the domain points are {-10, -9, -8, …, 0,
1, …, 9, 10} where the labels are +1 for all x such that |x| > 2
and 1 otherwise. Here the expressive power of half-spaces is
rather restricted. So it is not separable by a half-space.
advertisement
4. Which of the following is not true about making the class
of half-spaces more expressive?
a) First map the original instance space into another high
dimension space
b) Initially map the original instance space into another low
dimension space
c) After mapping then learn a half-space in that space
d) Increasing expressive power is useful in separating the
training set by a half-space
View Answer
Answer: b
Explanation: Initially we are mapping the original instance
space not into another low dimension space but to a higher
dimension space. After the mapping then learns a half-space
in that space. And it is useful in separating the training set
by a half-space.
5. Polynomial-based classifiers yield much richer hypothesis
classes than half-spaces.
a) False
b) True
View Answer
Answer: b
Explanation: Polynomial-based classifiers yield much richer
hypothesis classes than half-spaces. Consider the domain
points {-6,…, 0, 1,…, 5, 6} where the labels are +1 for
all x such that |x| > 2 and 1 otherwise. It is not separable by a
half-space, but after the embedding x ↦ (x, x2) it is perfectly
separable.
6. Which of the following statements is not true about Kernel
methods?
a) It can be used for pattern analysis or pattern recognition
b) It maps the data into higher dimensional space
c) The data can be easily separated in the higher
dimensional space
d) It only leads to finite dimensional space
View Answer
Answer: d
Explanation: The kernel methods lead not only to finite
dimensional space but also to infinite dimensional space as
there are no constraints of this mapping. Because it maps the
data into higher dimensional space by assuming that the
data can be easily separated in the higher dimensional space.
And it can be used for pattern analysis or pattern
recognition.
7. Which of the following statements is not true about Kernel
methods?
a) It works by embedding the input data to some high
dimensional feature space
b) Embedding into feature space can be determined
uniquely by specifying a kernel function that computes the
dot product between data points in the feature space
c) It defines only the linear mapping to the feature space
d) Expensive computations in the high dimensional feature
space can be avoided by evaluating the kernel function
View Answer
Answer: c
Explanation: The kernel function not only defines the linear
mapping to the feature space but also implicitly defines the
non linear mapping to the feature space and expensive
computations in the high dimensional feature space can be
avoided by evaluating the kernel function. All other three
statements are true about kernel methods.
1. When we make the half-space learning more expressive,
the computational complexity of learning may increase.
a) False
b) True
View Answer
Answer: b
Explanation: Embedding the input space into some high
dimensional feature space makes half-space learning more
expressive. But the computational complexity of such
learning may increase. So, computing linear separators over
very high dimensional data may be computationally
expensive.
2. Which of the following statements is not true about
kernel?
a) Kernel is used to describe inner products in the feature
space
b) The kernel function K specify the similarity between
instances
c) The kernel function K specify the embedding as mapping
the domain set into a space
d) The kernel function does the mapping of the domain set
into a space where the similarities are realised as outer
products
View Answer
Answer: d
Explanation: Mathematical meaning of a kernel is the inner
product in some Hilbert space not the outer products. And it
is a type of a similarity measure between instances. When we
are embedding the data into a high dimensional feature
space we introduce the idea of kernels.
3. Many learning algorithms for half-spaces can be carried
out just on the basis of the values of the kernel function over
pairs of domain points.
a) True
b) False
View Answer
Answer: a
Explanation: Many learning algorithms for half-spaces can
be carried out just on the basis of the values of the kernel
function over pairs of domain points. One of the main
advantages of such algorithms is that they implement linear
separators in high dimensional feature spaces without
having to specify points in that space or expressing the
embedding explicitly.
advertisement
4. Which of the following statements is not true about the
learning algorithms?
a) A feature mapping can be viewed as expanding the class
of linear classifiers to a richer class
b) The suitability of any hypothesis class to a given learning
task depends on the nature of that task
c) An embedding is a way to express and utilise prior
knowledge about the problem at hand
d) The sample complexity required to learn with some kinds
of kernels is independent of the margin in the feature space
View Answer
Answer: d
Explanation: Sample complexity required to learn with some
kinds of kernels (Gaussian kernels) depends on the margin
in the feature space which will be large, but can in general
be arbitrarily small. All three other statements are true
about learning algorithms.
5. A Hilbert space is a vector space with an inner product,
which is also complete.
a) True
b) False
View Answer
Answer: a
Explanation: A Hilbert space is a vector space with an inner
product, which is also complete. An inner product space X is
called a Hilbert space if it is a complete metric space. And it
is complete if all Cauchy sequences in the space converge. In
feature mapping it maps the original instances into some
Hilbert space.
6. The k degree polynomial kernel is defined as K(x, x’) = (1
+ <x, x’>)k.
a) True
b) False
View Answer
Answer: a
Explanation: The k degree polynomial kernel is defined as
K(x, x’) = (1 + <x, x’>)k where k is the degree of the
polynomial. It is popular in image processing.
7. Which of the following statements is not true about kernel
trick?
a) It allows one to incorporate prior knowledge of the
problem domain
b) The training data only enter the algorithm through their
entries in the kernel matrix
c) The training data only enter the algorithm through their
individual attributes
d) The number of operations required is not necessarily
proportional to the number of features
View Answer
Answer: c
Explanation: The training data only enter the algorithm
through their entries in the kernel matrix (Gram matrix),
and never through their individual attributes. All three are
the advantages of kernel trick.
8. The Gaussian kernel is also called the RBF kernel, for
Radial Basis Functions.
a) True
b) False
View Answer
Answer: a
Explanation: The Gaussian kernel is also called the RBF
kernel, for Radial Basis Functions. RBF kernel is a kernel
that is in the form of a radial basis function (more
specifically, a Gaussian function). The RBF kernel is defined
as : KRBF(x, x’) = exp[-ƴ ǁx – x’ǁ2].
9. Spectrum Kernel count the number of substrings in
common.
a) True
b) False
View Answer
Answer: a
Explanation: Spectrum Kernel counts the number of
substrings in common. It is a kernel since it is a dot product
between vectors of indicators of all the substrings. Other
kernels like: Gaussian kernel is a general-purpose kernel,
Polynomial kernel is popular in image processing and
sigmoid kernel can be used as the proxy for neural networks.
10. Which of the following statements is not true about
kernel trick?
a) It provides a bridge from linearity to non-linearity to any
algorithm that can expressed solely on terms of dot products
between two vectors
b) If we first map the input data into a higher-dimensional
space, a linear algorithm operating in this space will behave
non-linearly in the original input space
c) The mapping is always need to be computed
d) If the algorithm can be expressed only in terms of an
inner product between two vectors, all it need is replacing
this inner product with the inner product from some other
suitable space
View Answer
Answer: c
Explanation: Kernel trick is really interesting because that
mapping does not need to be ever computed. That is where
the trick resides, so wherever a dot product is used; it is
replaced with a Kernel function. All other three statements
best explain the kernel trick.
11. Which of the following statements is not true about
kernel properties?
a) Kernel functions must be continuous
b) Kernel functions must be symmetric
c) Kernels which are said to satisfy the Mercer’s theorem are
negative semi-definite
d) Kernel functions most preferably should have a positive
(semi-) definite Gram matrix
View Answer
Answer: c
Explanation: Kernels which are said to satisfy the Mercer’s
theorem are positive semi-definite as there is a property that
kernel functions most preferably should have a positive
(semi-) definite Gram matrix. And positive semi-definite
means that their kernel matrices have only non-negative
Eigen values.
12. Which of the following statements is not true about
choosing the right kernel?
a) Linear kernel allows to picking out hyper spheres
b) A polynomial kernel allows to model feature conjunctions
up to the order of the polynomial
c) Radial basis functions allow to pick out circles
d) Linear kernel allows picking out lines
View Answer
Answer: a
Explanation: A linear kernel allows only to picking out lines
(hyper planes) and not to picking out hyper spheres (circles).
Radial basis functions allow picking out circles and a
polynomial kernel allows to model feature conjunctions up
to the order of the polynomial.
13. As per the given figure Kernel trick illustrates some
fundamental ideas about different ways to represent data
and how machine learning algorithms see these different
data representations.

a) True
b) False
View Answer
Answer: a
Explanation: It is a kernel trick used in an SVM.
Implementing support vector classifiers requires specifying a
kernel function (Φ). Here in the picture the easily
inseparable data is then transformed into a high dimensional
feature space which is easily separable now using a kernel
function. Kernel trick illustrates some fundamental ideas
about different ways to represent data.
1. A Support Vector Machine (SVM) is a discriminative
classifier defined by a separating hyperplane.
a) True
b) False
View Answer
Answer: a
Explanation: A Support Vector Machine (SVM) is a
discriminative classifier defined by a separating hyperplane.
Suppose we are given labeled training data, then the
algorithm outputs an optimal hyperplane which categories
new examples. And hyperplane is a line dividing a plane into
two parts where in each class lay in either side.
2. Support vector machines cannot be used for regression.
a) False
b) True
View Answer
Answer: a
Explanation: Support Vector Machine (SVM) is a
classification and regression prediction tool. These are a
popular set of supervised learning algorithms originally
developed for classification (categorical target) problems,
and then extended to regression (numerical target)
problems.
3. Which of the following statements is not true about SVM?
a) It is memory efficient
b) It can address a large number of predictor variables
c) It is versatile
d) It doesn’t require feature scaling
View Answer
Answer: d
Explanation: SVM requires feature scaling, so we have to do
feature scaling of variables before applying SVM. SVMs are
memory efficient, can address a large number of predictor
variables and are versatile since they support a large
number of different kernel functions.
advertisement
4. Which of the following statements is not true about SVM?
a) It has regularization capabilities
b) It handles non-linear data efficiently
c) It has much improved stability
d) Choosing an appropriate kernel function is easy
View Answer
Answer: d
Explanation: Choosing an appropriate kernel function is not
an easy task. It could be tricky and complex. In case of using
a high dimension kernel, you might generate too many
support vectors which reduce the training speed. All other
three statements are advantages of SVM.
5. Minimizing a quadratic objective function (w2i) subject to
certain constraints where i= 1 to n, in SVM is known as
primal formulation of linear SVMs.
a) True
b) False
View Answer
Answer: a
Explanation: Minimizing a quadratic objective function
(w2i) subject to certain constraints in SVM is known as
primal formulation of linear SVMs. It is an SVM
optimisation problem. It is a convex quadratic programming
optimisation problem with n variables, where n is the
number of features in the data set.
6. Given a primal problem f*, minimizing x2 subject to x >=
b and a dual problem d*, maximizing d(α) subject to α >= 0.
Then d* = f* if f is non convex and x*, α* satisfy zero
gradient, primal feasibility dual feasibility, and
complementary slackness.
a) True
b) False
View Answer
Answer: b
Explanation: Given a primal problem f*,
minimizing x2 subject to x >= b and a dual problem d*,
maximizing d(α) subject to α >= 0. Then d* = f* if f is non
convex and x*, α* satisfy zero gradient, primal feasibility,
dual feasibility and complementary slackness. These are the
Karush–Kuhn–Tucker (KKT) conditions.
7. Which of the following statements is not true about dual
formulation in SVM optimisation problem?
a) No need to access data, need to access only dot products
b) Number of free parameters is bounded by the number of
support vectors
c) Number of free parameters is bounded by the number of
variables
d) Regularizing the sparse support vector associated with the
dual hypothesis is sometimes more intuitive than
regularizing the vector of regression coefficients
View Answer
Answer: c
Explanation: In dual formulation in SVM optimisation
problem number of free parameters is bounded not by the
number of variables but by the number of support vectors.
All other three statements are benefits of dual formulation in
SVM optimisation problem.
8. The optimal classifier is the one with the largest margin.
a) True
b) False
View Answer
Answer: a
Explanation: Consider all the samples are correctly
classified, where the data point can be as far from the
decision boundary as possible. Then we introduce the
concept of margin to measure the distance from data
samples to separating hyperplane. So the optimal classifier is
the one with the largest margin.
9. Suppose we have an equality optimization problem as
follows: Minimize f(x, y) = x + 2y subject to x 2 + y2 – 4 = 0.
While solving the above equation we get x = ± 25√, y = ± 45√,
λ = ± 5√4. At what value of x and y does the function f(x, y)
has its minimum value?
a) –25√,–45√
b) 25√,–45√
c) –25√,45√
d) 25√,45√
View Answer
Answer: a
Explanation: When x = –25√, y = –45√ and λ = ± 5√4,
f(x, y, λ) = x + 2y + λ(x2 + y2 – 4)
= –25√+−85√±5√4(45+165–4)
= –105√±5√4 (4 – 4)
= –105√±5√4 * 0
= –105√
Similarly when x = 25√, y = 45√ and λ = ± 5√4,
f(x, y, λ) = 105√
When x = –25√, y = 45√ and λ = ± 5√4
f(x, y, λ) = 65√
When x = 25√, y = –45√ and λ = ± 5√4
f(x, y, λ) = –65√
So the function f(x, y) has its minimum value (-105√) at x = –
25√ and y = –45√.
10. Suppose we have an equality optimization problem as
follows: Minimize f(x, y) = x + y subject to x 2 + y2 – 2 = 0.
While solving the above equation what will be the value x, y
and λ?
a) ±1, ±1, ± 12
b) ±2, ±1, ± 12
c) ±1, ±2, ± 12
d) ±12, ±12, ± 1
View Answer
Answer: a
Explanation: We know the Lagrangian L(x, y, λ) = x + y +
λ(x2 + y2 − 2)
δLδx = 1 + 2λx = 0
δLδy = 1 + 2λy = 0
δLδλ = x2 + y2 − 2 = 0
By solving the above three equations we get x = ±1, y = ±1
and λ = ±12.
11. Suppose we have an equality optimization problem as
follows: Minimize f(x, y) = x + 2y subject to x 2 + y2 – 9 = 0.
What will be the value of x, y and λ?
a) ± 35√, ±65√, ±5√6
b) ± 95, ±65, ±56
c) ± 95√, ±65, ±56
d) ± 35, ±65, ±5√6
View Answer
Answer: a
Explanation: We know the Lagrangian L(x, y, λ) = x + 2y +
λ(x2 + y2 – 9).
δLδx = 1 + 2λx = 0
δLδy = 2 + 2λy = 0
δLδλ = x2 + y2 – 9 = 0
By solving the above three equations we get x = ± 35√, y =
± 65√ and λ = ± 5√6.
1. The goal of a support vector machine is to find the optimal
separating hyperplane which minimizes the margin of the
training data.
a) False
b) True
View Answer
Answer: a
Explanation: The goal of a support vector machine is to find
the optimal separating hyperplane which maximizes the
margin of the training data. So it is based on finding the
hyperplane that gives the largest minimum distance to the
training examples.
2. Which of the following statements is not true about
hyperplane in SVM?
a) If a hyperplane is very close to a data point, its margin
will be small
b) If an hyperplane is far from a data point, its margin will
be large
c) Optimal hyperplane will be the one with the biggest
margin
d) If we select a hyperplane which is close to the data points
of one class, then it generalize well
View Answer
Answer: d
Explanation: If we select a hyperplane which is close to the
data points of one class, then it might not generalize well. If a
hyperplane is very close to a data point, its margin will be
small and if it is far from a data point, its margin will be
large. So the optimal hyperplane is the one with the biggest
margin.
3. Which of the following statements is not true about
optimal separating hyperplane?
a) It correctly classifies the training data
b) It is the one which will generalize better with unseen data
c) Finding the optimal separating hyperplane can be
formulated as a convex quadratic programming problem
d) The optimal hyperplane cannot correctly classifies all the
data while being farthest away from the data points
View Answer
Answer: d
Explanation: The optimal hyperplane correctly classifies all
the data while being farthest away from the data points. So it
correctly classifies the training data and will generalize
better with unseen data. And finding the optimal separating
hyperplane can be formulated as a convex quadratic
programming problem.
advertisement
4. Support Vector Machines are known as Large Margin
Classifiers.
a) True
b) False
View Answer
Answer: a
Explanation: SVM is a type of classifier which classifies
sample data. And the largest margin is found in order to
avoid overfitting and the optimal hyperplane is at the
maximum distance from the samples. So the margin is
maximized to classify the data points accurately.
5. Which of the following statements is not true about the
role of C in SVM?
a) The C parameter tells the SVM optimisation how much
you want to avoid misclassifying each training example
b) For large values of C, the optimisation will choose a
smaller-margin hyperplane
c) For small values of C, the optimisation will choose a large-
margin hyperplane
d) If we increase margin, it will end up getting a low
misclassification rate
View Answer
Answer: d
Explanation: If we increase margin, it will end up getting a
high misclassification rate. Because the C parameter tells the
SVM optimisation how much you want to avoid
misclassifying each training example. For large values of C,
the optimisation will choose a smaller-margin hyperplane
and vice versa.
6. Which of the following statements is not true about large
margin intuition classifier?
a) It has a hyperplane with the maximum margin
b) The hyperplane divides the data properly and is as far as
possible from your data points
c) The hyperplane is close to your data points
d) When new data comes in, even if it is a little closer to the
wrong class than the training points, it will still lie on the
right side of the hyperplane
View Answer
Answer: c
Explanation: The hyperplane is not close to your data points
but is as far as possible from it. In large margin intuition
classifier the hyper plane is with a maximum margin. So
when new data comes in, even if it is a little closer to the
wrong class than the training points, it will still lie on the
right side of the hyperplane.
7. Suppose the optimal separating hyperplane is given by
2x1 + 4x2 + x3 − 4 = 0 and the class labels are +1 and -1. For
the training example (1, 0.5, 1), the class label is -1, and is a
support vector.
a) True
b) False
View Answer
Answer: b
Explanation: Suppose the optimal separating hyperplane is
given by 2x1 + 4x2 + x3 − 4 = 0 and the class labels are +1 and
-1. For the training example (1, 0.5, 1), the class label is +1,
and is a support vector.
Let the training sample is (1, 0.5, 1) and the optimal
separating hyperplane is given by 2x 1 + 4x2 + x3 − 4 = 0.
2x1 + 4x2 + x3 − 4 = 2 * 1 + 4 * 0.5 + 1 − 4
=2+2+1–4
=5–4
= +1
8. The optimum separation hyperplane (OSH) is the linear
classifier with the minimum margin.
a) True
b) False
View Answer
Answer: b
Explanation: The optimum separation hyperplane (OSH) is
the linear classifier with the maximum margin for a given
finite set of learning patterns. To find the OSH draw convex
hull around each set of points and find the shortest line
segment connecting two convex hulls. Find midpoint of line
segment and the optimal hyperplane is perpendicular to
segment at midpoint of line segment.
9. SVM find outs the probability value.
a) True
b) False
View Answer
Answer: b
Explanation: SVM does not find out the probability value.
Suppose you are given a set of training examples, each
marked as belonging to one of two categories, an SVM
training algorithm builds a model that assigns new examples
into one category or the other, making it a non probabilistic
binary classifier.
10. Given figure shows some data points classified by an
SVM classifier and the bold line on the center represents the
optimal hyperplane. What the perpendicular distance
between the two dashed lines represented by a double arrow
line known as?

a) Maximum margin
b) Minimum margin
c) Support vectors
d) Hyperplane
View Answer
Answer: a
Explanation: The operation of the SVM algorithm is based
on finding the optimal hyperplane. Therefore, the optimal
separating hyperplane maximizes the margin of the training
data. And hence the distance between the two dashed lines
are known as maximum margin.
11. What is the leave-one-out cross-validation error estimate
for maximum margin separation in the following figure?
a) Zero
b) Maximum
c) Minimum
d) Half of the previous error value
View Answer
Answer: a
Explanation: From the figure we can see that removing any
single point would not chance the resulting maximum
margin separator. Here all the points are initially classified
correctly, so the leave-one-out error is zero.
1. In SVM the distance of the support vector points from the
hyperplane are called the margins.
a) True
b) False
View Answer
Answer: a
Explanation: The SVM is based on the idea of finding a
hyperplane that best separates the features into different
domains. And the points closest to the hyperplane are called
as the support vector points and the distance of the vectors
from the hyperplane are called the margins.
2. If the support vector points are farther from the
hyperplane, then this hyperplane can also be called as
margin maximizing hyperplane.
a) True
b) False
View Answer
Answer: a
Explanation: In SVM if more the farther support vector
points, from the hyperplane, then this hyperplane can also
be called as margin maximizing hyperplane. And the
probability of correctly classifying the points in their
respective region or classes is high.
3. Which of the following statements is not true about the C
parameter in SVM?
a) Large values of C give solutions with less misclassification
errors
b) Large values of C give solutions with smaller margin
c) Small values of C give solutions with bigger margin
d) Small values of C give solutions with less classification
errors
View Answer
Answer: d
Explanation: Small values of C give solutions with more
classification errors but a bigger margin. So it focuses more
on finding a hyperplane with a big margin. And large values
of C give solutions with less misclassification errors but a
smaller margin.
advertisement
4. Which of the following statements is not true about
margin in SVM?
a) The margin of a hyperplane with respect to a training set
is defined to be the minimal distance between a point in the
training set and the hyperplane
b) The margin of a hyperplane with respect to a training set
is defined to be the maximum distance between a point in the
training set and the hyperplane
c) If a hyperplane has a large margin, then it will still
separate the training set even if we slightly disturb each
instance
d) True error of a half space can be bounded in terms of the
margin that it has over the training sample
View Answer
Answer: b
Explanation: The margin of a hyperplane with respect to a
training set is defined to be not the maximum but the
minimal distance between a point in the training set and the
hyperplane. So if a hyperplane has a large margin, then it
will still separate the training set even if we slightly disturb
each instance. And the true error of a half space can be
bounded in terms of the margin it has over the training
sample.
5. The maximum margin linear classifier is the linear
classifier with the maximum margin.
a) True
b) False
View Answer
Answer: a
Explanation: The maximum margin linear classifier is the
linear classifier with the maximum margin. And these kinds
of SVMs are called Linear SVM (LSVM). Support vectors
are those data points that the margin pushes up against.
6. Which of the following statements is not true about
maximum margin?
a) It is safe and empirically works well
b) It is not sensitive to removal of any non support vector
data points
c) If the location of the boundary is not perfect due to noise,
this gives us the least chance of misclassification
d) It is not immune to removal of any non-support-vector
data points
View Answer
Answer: d
Explanation: The maximum margin is immune to removal of
any non support vector data points. It is safe and empirically
works well. So even If we have made a small error in the
location of the boundary (imperfect location of the
boundary) this gives us least chance of causing a
misclassification.
7. Hard SVM is the learning rule in which return an ERM
hyperplane that separates the training set with the largest
possible margin.
a) True
b) False
View Answer
Answer: a
Explanation: Hard-SVM is the learning rule in which return
an ERM hyperplane that separates the training set with the
largest possible margin. Here the margin of an ERM
hyperplane with respect to a training set is defined to be the
minimal distance between a point in the training set and the
ERM hyperplane.
8. The output of hard-SVM is the separating hyperplane
with the largest margin.
a) True
b) False
View Answer
Answer: a
Explanation: The output of hard-SVM is the separating
hyperplane with the largest margin and it seeks for the
separating plane with the largest margin. Hard-SVM works
on separable problems and it finds the linear predictor with
the maximal margin on the training sample.
9. Assume that we are training an SVM with quadratic
kernel. Given figure shows a dataset and the decision
boundary will be the one with maximum curvature for very
large values of C as shown in figure.

a) True
b) False
View Answer
Answer: b
Explanation: The slack penalty C will determine the location
of the separating parabola. When C is too large, we can’t
afford any misclassification. And hence among all the
parabolas, it chooses the minimum curvature one. So the
decision boundary will be the one with minimum curvature
as shown below.

10. Assume that we are training an SVM with quadratic


kernel. Given figure shows a dataset and the decision
boundary will be the one with maximum curvature when
values of C = 0 as shown in figure.

a) True
b) False
View Answer
Answer: b
Explanation: The slack penalty C will determine the location
of the separating parabola. When the penalty for
misclassification is too small (C = 0) the decision boundary
will be linear. So the decision boundary will be like as shown
below.

11. The given figure shows the hard margin while classifying
a set of data points using SVM.

a) True
b) False
View Answer
Answer: a
Explanation: The given figure shows the hard margin while
classifying a set of data points using SVM. Here all the
points are correctly classified. And the hard margin
maximizes margin between separating hyperplane.
1. The Soft SVM assumes that the training set is linearly
separable.
a) True
b) False
View Answer
Answer: b
Explanation: The Soft SVM did not assume that the training
set is linearly separable. But the Hard SVM assumes that the
training set is linearly separable. And Soft SVM can be
applied even if the training set is not linearly separable.
2. Soft SVM is an extended version of Hard SVM.
a) True
b) False
View Answer
Answer: a
Explanation: Soft SVM is an extended version of Hard
SVM. Hard SVM can work only when data is completely
linearly separable without any errors (noise or outliers). If
there are errors then either the margin is smaller or hard
margin SVM fails. And Soft SVM was proposed to solve this
problem by introducing slack variables.
3. Linear Soft margin SVM can only be used when the
training data are linearly separable.
a) True
b) False
View Answer
Answer: b
Explanation: The Linear Soft margins SVM are not used
when the training data are linearly separable but can use
Hard SVM only. Because linear separability of the training
data is a strong assumption in Hard SVM. And Soft SVM
can be applied if the training set is not linearly separable.
advertisement
4. Given a two-class classification problem with data points
x1 = -5, x2 = 3, x3 = 5, having class label +1 and x 4 = 2 with
class label -1. The problem can be solved using Soft SVM.
a) True
b) False
View Answer
Answer: a
Explanation: The given problem is a one dimensional two-
class classification problem. Here the points x 1, x2, and
x3 have class labels +1 and x 4 has class label -1. And the
dataset is not linearly separable, so we can use Soft SVM to
solve this classification problem.
5. Given a two-class classification problem with data points
x1 = -5, x2 = 3, x3 = 5, having class label +1 and x 4 = 2 with
class label -1. The problem can never be solved using Hard
SVM.
a) True
b) False
View Answer
Answer: b
Explanation: The given problem is a one dimensional two-
class classification problem and the data points are non-
linearly separable. So the problem cannot be solved by the
Hard SVM directly. But it can be solved using Hard SVM if
the one dimensional data set is transformed into a 2-
dimensional dataset using some function like (x, x2). Then
the problem is linearly separable and can be solved by Hard
SVM.
6. Which of the following statements is not true about the
picture shown below?

a) The data are not completely separable


b) Slack variables can be introduced to the objective
function to allow error in the misclassification
c) Soft SVM can be used here to classify the data correctly
d) Hard SVM can be used here to classify the data flawlessly
View Answer
Answer: d
Explanation: Here we cannot use Hard SVM to classify the
data flawlessly, because the data are not completely
separable. But we can use Soft SVM for the same
classification purpose. And can introduce slack variables to
the SVM objective function to allow the error in the
misclassification.
7. The SVM relies on hinge loss.
a) True
b) False
View Answer
Answer: a
Explanation: The SVM relies on hinge loss. Because hinge
loss is convex and, therefore minimizing the hinge loss can
be performed efficiently. But the problem of minimising the
other losses is computationally intractable.
8. Which of the following statements is not true about Soft
SVM classification?
a) If the data are non separable it needs to introduce some
tolerance to outlier data points
b) If the data are non separable, slack variable can be added
to allow misclassification of noisy examples
c) If the data are non separable it will add one slack variable
greater than or equal to zero for each training data point
d) The slack variable value is greater than one for the points
that are on the correct side of the margin
View Answer
Answer: d
Explanation: The slack variable value is equal to zero for the
points that are on the correct side of the margin. And if the
data are non-separable it introduces some tolerance to
outlier data points (slack variable). And the value of the
slack variable will be greater than or equal to zero.
9. The slack variable value of the point on the decision
boundary of the Soft SVM is equal to one.
a) True
b) False
View Answer
Answer: a
Explanation: The slack variable value of the point on the
decision boundary of the Soft SVM is equal to one. Slack
variables are introduced to allow certain constraints to be
violated. That is, certain training points will be allowed to be
within the margin.
10. The slack variables value ξi ≥ 1 for misclassified points,
and 0 < ξi < 1 for points close to the decision boundary.
a) True
b) False
View Answer
Answer: a
Explanation: The slack variables ξi ≥ 1 for misclassified
points, and 0 < ξi < 1 for points close to the decision
boundary. For non separable data it aims to both maximize
the margin and minimize violation of the margin constraints.
So one slack variable ξ must be optimized for each data
point.
11. The bounds derived for Soft-SVM do not depend on the
dimension of the instance space.
a) True
b) False
View Answer
Answer: a
Explanation: The bounds derived for Soft-SVM do not
depend on the dimension of the instance space. The bounds
depend on the norm of the examples, the norm of the half-
space. And in the non-separable case, the bounds also
depend on the minimum hinge loss of all half-spaces of norm
less than or equal to half-space.
1. A Lagrange dual of a convex optimisation problem is
another convex optimisation problem.
a) True
b) False
View Answer
Answer: a
Explanation: The optimisation problems can be either
primal problems or dual problems. The solution to the dual
problem provides a lower bound to the solution of the
primal problem. And the Lagrange dual of a convex
optimisation problem is another convex optimisation
problem where the optimisation variables are the Lagrange
multipliers of the original problem.
2. The difference between the primal and dual solutions is
known as duality gap.
a) True
b) False
View Answer
Answer: a
Explanation: In optimisation problems the difference
between the primal and dual solutions are known as the
duality gap. And the duality gap is zero if and only if strong
duality holds. Otherwise the gap is strictly positive and weak
duality holds.
3. In optimisation problems Lagrangian is used to find out
only the local minima of a function subject to certain
constraints.
a) True
b) False
View Answer
Answer: b
Explanation: In optimisation problems Lagrangian is used
to find out both the local minima and maxima of a function
subject to certain equality constraints. Lagrangian is the
function in which the constraints have been introduced by
multiplying them by positive coefficients called Lagrange
multipliers.
advertisement
4. Karush–Kuhn–Tucker (KKT) conditions are second
derivative tests for a solution in nonlinear programming to
be optimal, provided that some regularity conditions are
satisfied.
a) False
b) True
View Answer
Answer: a
Explanation: Karush–Kuhn–Tucker (KKT) conditions are
not second derivative but are the first derivative tests for a
solution in nonlinear programming to be optimal, provided
that some regularity conditions are satisfied. And sometimes
it is also known as first-order necessary conditions.
5. When the constrained maximization/minimization
problem is rewritten as a Lagrange function, then its optimal
point is known as saddle point.
a) True
b) False
View Answer
Answer: a
Explanation: When the constrained
maximization/minimization problem is rewritten as a
Lagrange function, then its optimal point is known as saddle
point. A saddle point or minimax point is a point on the
surface of the graph of a function where the slopes
(derivatives) in orthogonal directions are all zero.
6. Support vector machine is a generative classifier.
a) True
b) False
View Answer
Answer: b
Explanation: A Support vector machine is a discriminative
classifier. Because rather than modeling each class, SVMs
simply find a line or curve that divides the classes from each
other. And SVM is a discriminative classifier which tries to
model by just depending on the observed data.
7. Which of the following statements is not true about
Lagrange multipliers?
a) The basic idea behind this is to convert a constrained
problem into a form such that the derivative test of an
unconstrained problem can still be applied
b) It is done by converting a constrained problem to an
equivalent unconstrained problem with the help of certain
unspecified parameters
c) The Hessian matrix determines the maxima, minima, or
saddle points before the stationary points have been
identified
d) Once the stationary points have been identified from the
first-order necessary conditions, it determines the maxima,
minima or saddle points
View Answer
Answer: c
Explanation: The Hessian matrix determines the maxima,
minima, or saddle points only after the stationary points
have been identified from the first-order necessary
conditions. And for this, initially the constrained problem is
converted into an unconstrained problem with the help of
Lagrange multipliers.
8. Let the problem is min f(x1, x2, …, xn) subject to h1(x1, x2,
…, xn) = 0. And it converted it into min L(x1, x2, …, xn, λ) =
min {f(x1, x2, …, xn) – λh1 (x1, x2, …, xn)}. Then L(x, λ), λ are
known as Lagrangian function and Lagrangian function
respectively.
a) True
b) False
View Answer
Answer: a
Explanation: Given the problem of optimisation Then L(x,
λ) is known as the Lagrangian function and λ is an
unspecified positive or negative constant called the Lagrange
multiplier. The method of Lagrange multipliers is widely
used to solve challenging constrained optimisation problems.
9. The training data points near to the separating
hyperplane are known as support vectors.
a) True
b) False
View Answer
Answer: a
Explanation: The training data points near to the separating
hyperplane are known as Support vectors. The SVM finds
the best line of separation and the points closest to the line
from both the classes. And these points are known as
support vectors.
10. Which of the following statements is not true about
support vectors?
a) Support vectors are used to maximize the margin of the
classifier
b) Deleting the support vectors will change the position of
the hyperplane
c) The vectors that define the hyperplane are the support
vectors
d) The extreme points in the data sets that define the
hyperplane are not included in the support vectors
View Answer
Answer: d
Explanation: The extreme points in the data sets that define
the hyperplane are the support vectors. It defines the
hyperplane and is used to maximize the margin of the
classifier. Deleting it will change the position of the
hyperplane.
1. SVM uses Gradient descent (GD) to minimize its margin
instead of using a Lagrange.
a) True
b) False
View Answer
Answer: b
Explanation: SVM do not use gradient descent to minimize
its margin instead of using a Lagrange but both are used for
different purposes. GD minimizes an unconstrained
optimization problem and Lagrange multipliers used to
convert a constrained optimization problem into an
unconstrained problem.
2. Gradient descent and Lagrange are interchangeably used
by SVM.
a) True
b) False
View Answer
Answer: b
Explanation: Gradient descent and Lagrange are not
interchangeably used by SVM. They are used for different
purposes. Gradient descent minimizes an unconstrained
optimization problem. And Lagrange multipliers used to
convert a constrained optimization problem into an
unconstrained problem.
3. Let the optimization problem using Soft SVM is minimize
a function and the update rule of SGD is w (t+1) = –
(1λt ∑tj=1vj) then Vj is the sub gradient of the loss function.
a) True
b) False
View Answer
Answer: a
Explanation: In the given update rule of SGD, Vj is the sub
gradient of the loss function at w(J) on the random example
chosen at iteration j. Given the optimization problem and
the Soft SVM rely on the SGD framework for solving
regularized loss minimization problems and hence can
rewrite the update rule as given.
advertisement
4. Given the Soft SVM optimization problem and the update
rule of SGD w(t+1) = – (1λt ∑tj=1vj). For the hinge loss, given
an example (x, y), it can choose Vj to be one if y (w (J), x) ≥ 1.
a) True
b) False
View Answer
Answer: b
Explanation: Given the Soft SVM optimization problem and
the update rule of SGD w (t+1) = – (1λt ∑tj=1vj). For the hinge
loss, given an example (x, y), we can choose Vj to be zero if y
(w(J), x) ≥ 1 and Vj = −y x otherwise.
5. Which of the following statements is not true about the
soft margin solution for optimization problem?
a) Every constraint can be satisfied if slack variable is
sufficiently large
b) C is a regularization parameter
c) Small C allows constraints to be hard to ignore
d) C = ∞ enforces all constraints and it implies hard margin
View Answer
Answer: c
Explanation: In the soft margin solution for optimization
problems, small C allows constraints to be easily ignored
(large margin) and not hard to ignore. Here C is a
regularization parameter and C = ∞ enforces all constraints
which implies hard margin. And every constraint can be
satisfied if the slack variable is sufficiently large.
1. Which of the following statements is not true about the
Decision tree?
a) It can be applied on binary classification problems only
b) It is a predictor that predicts the label associated with an
instance by traveling from a root node of a tree to a leaf
c) At each node, the successor child is chosen on the basis of
a splitting of the input space
d) The splitting is based on one of the features or on a
predefined set of splitting rules
View Answer
Answer: a
Explanation: Decision trees can be also used for other
prediction problems and not only for binary classification
problems. So it is a predictor that predicts the label
associated with an instance by traveling from a root node of
a tree to a leaf. The successor child is chosen on the basis of a
splitting of the input space and is based on one of the
features or on a predefined set of splitting rules.
2. Decision tree uses the inductive learning machine learning
approach.
a) True
b) False
View Answer
Answer: a
Explanation: Decision tree uses the inductive learning
machine learning approach. Inductive learning enables the
system to recognize patterns and regularities in previous
knowledge or training data and extract the general rules
from them. A decision tree is considered to be an inductive
learning task as it uses particular facts to make more
generalized conclusions.
3. Which of the following statements is not true about a
splitting rule at internal nodes of the tree based on
thresholding the value of a single feature?
a) It move to the right or left child of the node on the basis of
1[xi < ϑ], where i ∈ [d] is the index of the relevant feature
b) It move to the right or left child of the node on the basis of
1[xi < ϑ], where ϑ ∈ R is the threshold
c) Here a decision tree splits the instance space, X = Rd, into
cells, where each leaf of the tree corresponds to one cell
d) Splits based on thresholding the value of a single feature
are also known as multivariate splits
View Answer
Answer: d
Explanation: Splits based on thresholding the value of a
single feature are known as univariate splits. And here it
moves to the right or left child of the node on the basis of 1 [xi
< ϑ], where i ∈ [d] is the index of the relevant feature and ϑ ∈
R is the threshold. A decision tree splits the instance space, X
= Rd, into cells, where each leaf of the tree corresponds to
one cell.
advertisement
4. Consider the figure. If person A starts driving at 8:30 AM
and there are no other vehicles on the road, and another
person B starts driving at 10 AM and there is an accident on
the road, what will be the commute time of A and B
respectively?

a) LONG, LONG
b) LONG, SHORT
c) SHORT, LONG
d) SHORT, SHORT
View Answer
Answer: c
Explanation: Given figure shows a decision tree. And person
A starts driving at 8:30 AM and there is no traffic. So he will
commute in SHORT time. At the same time person B starts
driving at 10 AM and there was an accident on the road. So
he will commute for a LONG time.
5. In a splitting rule at internal nodes of the tree based on
thresholding the value of a single feature, it follows that a
tree with k leaves can shatter a set of k instances.
a) False
b) True
View Answer
Answer: b
Explanation: Here the splitting rule at internal nodes of the
tree is based on thresholding the value of a single feature; it
follows that a tree with k leaves can shatter a set
of k instances. Hence, if we allow decision trees of arbitrary
size, we obtain a hypothesis class of infinite VC dimension
and this approach can easily lead to overfitting.
6. Minimum description length (MDL) principle is used to
avoid overfitting in decision trees.
a) True
b) False
View Answer
Answer: a
Explanation: MDL procedures automatically and inherently
protect against overfitting and can be used to estimate both
the parameters and the structure of a model. Hence MDL
principle is used to avoid overfitting in decision trees and
aim at learning a decision tree that on one hand fits the data
well while on the other hand is not too large.
7. Suppose in a decision tree, we are making some
simplifying assumptions that each instance is a vector of d
bits (X = {0, 1}d). Which of the following statements is not
true about the above situation?
a) It thresholding the value of a single feature corresponds to
a splitting rule of the form 1[xi=1] for some i = [d]
b) The hypothesis class becomes finite, but is still very large
c) Any classifier from {0, 1}d to {0, 1} can be represented by a
decision tree with 2d leaves and depth of d + 1
d) Any classifier from {0, 1}d to {0, 1} can be represented by a
decision tree with 2d+1 leaves and depth of d + 1
View Answer
Answer: d
Explanation: Given the simplifying assumptions, and any
classifier from {0, 1}d to {0, 1} can be represented by a
decision tree not with 2d+1 leaves but with 2d leaves and depth
of d + 1. And here the hypothesis class becomes finite, but is
still very large.
8. What does it mean by the VC dimension of a class is 2 d?
a) The number of examples need to PAC learn the
hypothesis class grows with 2d
b) The number of examples need to PAC learn the
hypothesis class grows with 2d+1
c) The number of examples need to PAC learn the
hypothesis class grows with 2d-1
d) The number of examples need to PAC learn the
hypothesis class grows with 2d+1
View Answer
Answer: a
Explanation: Suppose in a decision tree we are making some
simplifying assumptions that each instance is a vector of d
bits (X = {0, 1}d). Then the VC dimension of the class is 2 d,
which means that the number of examples we need to PAC
learn the hypothesis class grows with 2d. Unless d is very
small, this is a huge number of examples.
9. Consider the dataset given below where T and F represent
True and False respectively. What is the entropy H (Rain)?

Temperature Clou

Low T

Low T

Medium T

Medium T

High T
High F

a) 1
b) 0.5
c) 0.2
d) 0.6
View Answer
Answer: a
Explanation: We know entropy = ∑ni=1 – Pi log2 Pi.
Entropy = – (3/6) * log2 (3/6) – (3/6) * log2 (3/6)
= – (1/2) * log2 (1/2) – (1/2) * log2 (1/2)
= – 0.5 * -1 – 0.5 * -1
= 0.5 + 0.5
=1
10. What does the following figure represent?

a) Decision tree for OR


b) Decision tree for AND
c) Decision tree for XOR
d) Decision tree for XNOR
View Answer
Answer: b
Explanation: The given figure represents the decision tree
implementation of Boolean AND as per the following truth
table.

A B A AND B

F F F

F T F

T T T

T F F

So whenever A is false the decision tree will lead to false.


Otherwise it will lead to true or false according to the B.
1. Which of the following statements is not true about the
Decision tree?
a) A Decision tree is also known as a classification tree
b) Each element of the domain of the classification in
decision tree is called a class
c) It is a tree in which each internal node is labeled with an
input feature
d) It cannot be used in data mining applications as it only
classifies but not predicts anything
View Answer
Answer: d
Explanation: Decision trees can be widely used in data
mining applications because it is able to classify and predict
as well. It is also known as a classification tree. Each element
of the domain of the classification in the decision tree is
called a class and each internal node is labeled with an input
feature.
2. Practical decision tree learning algorithms are based on
heuristics.
a) True
b) False
View Answer
Answer: a
Explanation: Practical decision tree learning algorithms are
based on heuristics such as a greedy approach, where the
tree is constructed gradually, and locally optimal decisions
are made at the construction of each node. Such algorithms
cannot guarantee to return the globally optimal decision tree
but tend to work reasonably well in practice.
3. Which of the following statements is not true about the
Decision tree?
a) It starts with a tree with a single leaf and assign this leaf a
label according to a majority vote among all labels over the
training set
b) It performs a series of iterations and on each iteration, it
examine the effect of splitting a single leaf
c) It defines some gain measure that quantifies the
improvement due to the split
d) Among all possible splits, it either choose the one that
minimizes the gain and perform it, or choose not to split the
leaf at all
View Answer
Answer: d
Explanation: In decision trees among all the possible splits,
it chooses the one that maximizes the gain not the one that
minimizes it. Or it chooses not to split the leaf at all. All
other three are the correct statements about the Decision
tree.
advertisement
4. Which of the following is not a Decision tree algorithm?
a) ID3
b) C4.5
c) DBSCAN
d) CART
View Answer
Answer: c
Explanation: DBSCAN is a clustering algorithm. ID3, C4.5,
and CART are the Decision tree algorithms. ID3 is known as
Iterative Dichotomiser 3 and C4.5 is the successor of ID3.
CART is the Classification and Regression Tree. There are
many other decision-tree algorithms also.
5. Which of the following statements is not true about the
ID3 algorithm?
a) It is used to generate a decision tree from a dataset
b) It begins with the original set S as the root node
c) On each iteration of the algorithm, it iterates through
every unused attribute of the set S and calculates the entropy
or the information gain of that attribute
d) Finally it selects the attribute which has the largest
entropy value
View Answer
Answer: d
Explanation: ID3 is an algorithm which is used to generate a
decision tree from a dataset and it begins with the original
set S as the root node. On each iteration of the algorithm, it
iterates through every unused attribute of the set S and
calculates the entropy or the information gain of that
attribute. And it then selects the attribute which has the
smallest entropy value.
6. Which of the following statements is not true about
Information Gain?
a) It is a gain measure that is used in the ID3 algorithms
b) It is the difference between the entropy of the label before
and after the split
c) It is based on the decrease in entropy after a data-set is
split on an attribute
d) Constructing a decision tree is all about finding attribute
that returns the lowest information gain
View Answer
Answer: d
Explanation: Information Gain is a measure that is used in
the ID3 algorithms. Constructing a decision tree is all about
finding the attribute that returns the highest information
gain not the lowest one. It is the difference between the
entropy of the label before and after the split.
7. Which of the following statements is not true about
Information Gain?
a) It is the addition in entropy by transforming a dataset
b) It is calculated by comparing the entropy of the dataset
before and after a transformation
c) It is often used in training decision trees
d) It is also known as Kullback-Leibler divergence
View Answer
Answer: a
Explanation: Information Gain is also known as Kullback-
Leibler divergence which is the reduction in entropy by
transforming a dataset. It is often used in training decision
trees and is calculated by comparing the entropy of the
dataset before and after a transformation.
8. Which of the following statements is not true about
Information Gain?
a) It is the amount of information gained about a random
variable or signal from observing another random variable
b) It tells us how important a given attribute of the feature
vectors is
c) It implies how much entropy we removed
d) Higher Information Gain implies less entropy removed
View Answer
Answer: d
Explanation: The higher Information Gain implies more
entropy removed not less because Information Gain implies
how much entropy we removed. It tells us how important a
given attribute of the feature vectors is. And it is the amount
of information gained about a random variable or signal
from observing another random variable.
9. Given the entropy for a split, Esplit = 0.39 and the entropy
before the split, Ebefore = 1. What is the Information Gain for
the split?
a) 1
b) 0.39
c) 0.61
d) 2.56
View Answer
Answer: c
Explanation: Information Gain is calculated for a split by
subtracting the weighted entropies of each branch from the
original entropy. We have Esplit = 0.39 and Ebefore = 1.
Then Information Gain, IG = Ebefore – Esplit
= 1 – 0.39
= 0.61
10. Which of the following statements is not an objective of
Information Gain?
a) It tries to determine which attribute in a given set of
training feature vectors is most useful for discriminating
between the classes to be learned
b) Decision Trees algorithm will always tries to minimize
Information Gain
c) It is used to decide the ordering of attributes in the nodes
of a decision tree
d) Information Gain of certain event is the discrepancy of
the amount ofinformation before someone observes that
event and the amount after observation
View Answer
Answer: b
Explanation: Decision Trees algorithm will always try to
maximize Information Gain. It tries to determine which
attribute in a given set of training feature vectors is most
useful for discriminating between the classes to be learned.
And it is used to decide the ordering of attributes in the
nodes of a decision tree.
11. Information Gain and Gini Index are the same.
a) True
b) False
View Answer
Answer: b
Explanation: Gini index measures the degree or probability
of a particular variable being wrongly classified when it is
randomly chosen. Unlike information gain, Gini Index is not
computationally intensive as it doesn’t involve the logarithm
function used to calculate entropy in information gain. So
Gini Index is preferred over Information gain.
12. Which of the following statements is not true about
Information Gain?
a) It is used to determine which feature/attribute gives us the
maximum information about a class
b) It is based on the concept of entropy, which is the degree
of impurity or disorder
c) It aims to reduce the level of entropy starting from the
root node to the leave nodes
d) It is often promote the level of entropy starting from the
root node to the leave nodes
View Answer
Answer: d
Explanation: Information Gain is based on the concept of
entropy and it never tries to promote but tries to reduce the
level of entropy starting from the root node to the leave
nodes. All other options are true about Information Gain.
13. What is the entropy at P = 0.5 from the given figure?

a) 0.5
b) -0.5
c) 1
d) -1
View Answer
Answer: c
Explanation: We know the entropy E = -p log2p – q log2q.
Here p = 0.5 and q = 1 – p = 1 – 0.5 = 0.5. So we have p = 0.5
and q = 0.5.
Entropy = (-0.5 * log2 0.5) – (0.5 * log2 0.5)
= (-0.5 * -1) – (0.5 * -1)
= 0.5 + 0.5
=1
14. Given entropy of parent = 1, weights averages = (34,14)
and entropy of children = (0.9, 0). What is the information
gain?
a) 0.675
b) 0.75
c) 0.325
d) 0.1
View Answer
Answer: c
Explanation: We know Information Gain = Entropy
(Parent) – ∑(weights average * entropy (Child).
Information Gain = 1 – (34 * 0.9 + 14 * 0)
= 1 – (0.675 + 0)
= 1 – 0.675
= 0.325
1. In the ID3 algorithm the returned tree will usually be very
large.
a) True
b) False
View Answer
Answer: a
Explanation: In the ID3 algorithm the returned tree will
usually be very large. Such trees may have low empirical
risk, but their true risk will tend to be high. One solution is
to limit the number of iterations of ID3, leading to a tree
with a bounded number of nodes.
2. Pruning a tree reduces it to a much smaller tree.
a) True
b) False
View Answer
Answer: a
Explanation: Pruning a tree will reduce it to a much smaller
tree, but still with a similar empirical error. So pruning is
the process of adjusting the decision tree to minimize the
misclassification error.
3. Pruning can only be performed by a bottom up walk on
the decision tree.
a) True
b) False
View Answer
Answer: b
Explanation: Pruning can occur in a top down or bottom up
fashion. Usually, the pruning is performed by a bottom-up
walk on the tree. Each node might be replaced with one of its
subtrees or with a leaf. But there are situations where the
top down pruning is also used.
advertisement
4. Which of the following statements is not true about
Pruning?
a) It removes the sections of the tree that provide little power
to classify instances
b) It is a technique in machine learning and search
algorithms to reduce the size of the decision trees
c) It increases the complexity of the final classifier
d) It improves the predictive accuracy by the reduction of
overfitting
View Answer
Answer: c
Explanation: Pruning reduces the complexity of the final
classifier and improves the predictive accuracy by the
reduction of overfitting. It is a technique in machine learning
and search algorithms to reduce the size of the decision
trees. And it removes the sections of the tree that provide
little power to classify instances.
5. Which of the following statements is not true about
Pruning?
a) It reduces the size of learning tree without reducing
predictive accuracy
b) It is will not optimise the performance of the tree
c) Top down pruning will traverse nodes and trim subtrees
starting at the root
d) Bottom up pruning will traverse nodes and trim subtrees
starting at the leaf nodes
View Answer
Answer: b
Explanation: Pruning will optimise the performance of the
tree and it reduces the size of the learning tree without
reducing predictive accuracy. Bottom up pruning will
traverse nodes and trim subtrees starting at the leaf nodes
and Top down pruning starting at the root.
6. Which of the following is not a Pruning technique?
a) Cost based pruning
b) Cost complexity pruning
c) Minimum error pruning
d) Maximum error pruning
View Answer
Answer: d
Explanation: Maximum error pruning is not a pruning
technique. Cost based pruning, Cost complexity pruning,
and Minimum error pruning are the three popular pruning
techniques in Decision trees.
7. Which of the following statements is not true about the
pruning in the decision tree?
a) When the decision tree is created, many of the branches
will reflect anomalies in the training data due to noise
b) The over fitting happens when the learning algorithm
continues to develop hypothesis that reduce training set
error at the cost of an increased test set errors
c) It optimises the computational efficiency
d) It reduces the classification accuracy
View Answer
Answer: d
Explanation: Pruning in decision trees improves the
classification accuracy and optimises computational
efficiency. When the decision tree is created, many of the
branches will reflect anomalies in the training data due to
noise. And over-fitting happens when the learning algorithm
continues to develop hypotheses that reduce training set
error at the cost of an increased test set error.
8. Post pruning is also known as backward pruning.
a) True
b) False
View Answer
Answer: a
Explanation: Post-pruning is also known as backward
pruning. Here generate the decision tree and then remove
non-significant branches. It allows the tree to perfectly
classify the training set, and then post prune the tree.
9. Which of the following statements is not true about Post
pruning?
a) It begins by generating the (complete) tree and then
adjust it with the aim of improving the classification
accuracy on unseen instances
b) It begins by converting the tree to an equivalent set of
rules
c) It would not overfit trees
d) It converts a complete tree to a smaller pruned one which
predicts the classification of unseen instances at least as
accurately
View Answer
Answer: c
Explanation: Post-pruning overfit trees in a more successful
way because it is not easy to precisely estimate when to stop
growing the tree. It begins by generating the (complete) tree
and then adjusting it with the aim of improving the
classification accuracy on unseen instances. The other two
statements are the two principal methods of doing this.
10. Which of the following statements is not true about
Reduced error pruning?
a) It is the simplest and most understandable method in
decision tree pruning
b) It considers each of the decision nodes in the tree to be
candidates for pruning, consist of removing the subtree
rooted at that node, making it a leaf node
c) If the error rate of the new tree would be equal to or
smaller than that of the original tree and that subtree
contains no subtree with the same property, then subtree is
replaced by leaf node
d) If the error rate of the new tree would be greater than
that of the original tree and that subtree contains no subtree
with the same property, then subtree is replaced by leaf
node, means pruning is done
View Answer
Answer: d
Explanation: If the error rate of the new tree would be
greater than that of the original tree and that subtree
contains no subtree with the same property, then subtree is
replaced by leaf node, meaning no pruning is done. All other
three statements are true about Reduced error pruning.
1. Which of the following statements is not an advantage of
Reduced error pruning?
a) Linear computational complexity
b) Over pruning
c) Simplicity
d) Speed
View Answer
Answer: b
Explanation: Over pruning is a disadvantage of Reduced
error pruning. When the test set is much smaller than the
training set, it may lead to over pruning. And the advantage
of this method is its linear computational complexity,
simplicity and speed.
2. Minimum error pruning is a Top down approach.
a) True
b) False
View Answer
Answer: b
Explanation: Minimum error pruning is not a Top down
approach. It is a bottom – up approach which seeks a single
tree that minimizes the expected error rate on an
independent data set. The tree is pruned back to the point
where the cross – validated error is a minimum.
3. Which of the following statements is not a step in
Minimum error pruning?
a) At each non leaf node in the tree, calculate expected error
rate if that subtree is pruned
b) Calculate the expected error rate for that node if subtree
is not pruned
c) If pruning the node leads to greater expected error rate,
then keep the subtree
d) If pruning the node leads to smaller expected error rate,
then don’t prune it
View Answer
Answer: d
Explanation: In Minimum error pruning if pruning the node
leads to smaller expected error rate, then prune it. Here at
each non leaf node in the tree, calculate expected error rate
if that subtree is pruned otherwise calculate the expected
error rate for that node if subtree is not pruned. And if
pruning the node leads to greater expected error rate, then
keep the subtree (no pruning).
advertisement
4. Pre pruning is also known as online pruning.
a) True
b) False
View Answer
Answer: a
Explanation: Pre – pruning is also known as forward
pruning or online – pruning. Pre – pruning prevents the
generation of non – significant branches. It prevents
overfitting by trying to stop the tree – building process early,
before it produces leaves with very small samples.
5. Which of the following statements is not a step in Pre
pruning?
a) Pre – pruning a decision tree involves using a termination
condition to decide when to terminate some of the branches
prematurely as the tree is generated
b) When constructing the tree, some significant measures
can be used to assess the goodness of a split
c) High threshold results in oversimplified trees
d) If partitioning the tuples at a node would result the split
that falls below a pre specified threshold, then further
partitioning of the given subset is expanded
View Answer
Answer: d
Explanation: If partitioning the tuples at a node would result
in the split that falls below a pre specified threshold, then
further partitioning of the given subset is halted otherwise it
is expanded. That is a high threshold result in oversimplified
trees, and low threshold result in very little simplification.
6. Minimum number of objects pruning is a Post pruning
technique.
a) True
b) False
View Answer
Answer: b
Explanation: Minimum number of objects pruning is not a
post pruning technique but it is a pre pruning technique. In
this method of pruning, the minimum number of objects is
specified as a threshold value. And there is one parameter
minobj which is set to specify threshold value.
7. Which of the following statements is not true about
Minimum number of objects pruning?
a) Whenever the split is made which yields a child leaf that
represents less than minobj from the data set, the parent
node and children node are compressed to a single node
b) Increasing no of objects increases accuracy of the dataset
c) Increasing no of objects simplifies the tree
d) The different ranges of the minimum no of objects are set
for few examples and tested for accuracy
View Answer
Answer: b
Explanation: In Minimum number of object pruning
increasing no of objects reduces accuracy of the dataset, but
it simplifies the tree. Whenever the split is made which yields
a child leaf that represents less than minobj from the data
set, the parent node and children node are compressed to a
single node.
8. Which of the following is not a Post pruning technique?
a) Reduced error pruning
b) Error complexity pruning
c) Minimum error pruning
d) Chi – square pruning
View Answer
Answer: d
Explanation: Chi – square pruning is not a post pruning
technique but it is a pre pruning technique. It converts
decision trees to a set of rules and eliminates variable values
in rules which are independent of label using chi – square
test for independence. And simplify rule set by eliminating
unnecessary rules.
9. Which of the following is not a Post pruning technique?
a) Pessimistic error pruning
b) Iterative growing and pruning
c) Reduced error pruning
d) Early stopping pruning
View Answer
Answer: d
Explanation: Early stopping pruning is also known as Pre
pruning. So it is not a post pruning technique. To prevent
overfitting it tries to stop the tree – building process early,
before it produces leaves with very small samples. This
heuristic is also known as Pre – pruning decision trees.
10. Consider we have a set of data with 3 classes, and we
have observed 20 examples of which the greatest number 15
is in class c. If we predict that all future examples will be in
class c, what is the expected error rate using minimum error
pruning?
a) 0.304
b) 0.5y
c) 0.402
d) 0.561
View Answer
Answer: a
Explanation: The expected error rate Ek = n–nc+k–1n+k.
Given n = 20, nc = 15 and k = 3.
Expected error rate Ek = 20–15+3–120+3
= 723
= 0.304
11. Consider we have a set of data with 3 classes, and we
have observed 20 examples of which the greatest number 15
is in class c. If we predict that all future examples will be in
class c, what is the expected error rate without pruning?
a) 0.22
b) 0.17
c) 0.15
d) 0.05
View Answer
Answer: a
Explanation: Given n = 20, nc = 15 and k = 3. Then without
pruning the Expected error rate Ek will be:
Expected error rate Ek = n–kn(n–nc–1n)+kn(k–12k)
= 20–320(20–15–120)+320(3–12∗ 3)
= 1720(420)+320(26)
= 68400+6120
= 0.17 + 0.05
= 0.22
12. Consider the example, number of corrected mis –
classifications at a particular node, n'(t) = 15.5, and number
of corrected mis – classifications for sub – tree, n'(Tt) = 12.
N(t) is the number of training set examples at node t and it is
equal to 35. Here the tree should be pruned.
a) True
b) False
View Answer
Answer: b
Explanation: We know the standard error SE
= n′(Tt)∗ (N(t)–n′(Tt))N(t)−−−−−−−−−−−−√

= 12∗ (35–12)35−−−−−−−√
= 12∗ 2335−−−−√

= 2.8
Since 12 + 2.8 = 14.8, which is less than 15.5, the sub – tree
should be kept and not pruned.
1. Which of the following statements is not true about
Decision trees?
a) It builds classification models in the form of a tree
structure
b) It builds regression models in the form of a tree structure
c) The final result is a tree with decision nodes and leaf
nodes
d) It never breaks down a dataset into smaller subsets with
increase in depth of tree
View Answer
Answer: d
Explanation: The decision tree breaks down a dataset into
smaller subsets with increase in depth of tree. And it builds
classification and regression models in the form of a tree
structure where the regression models in the form of a tree
structure.
2. Splitting is the process of dividing a node into two or more
sub-nodes.
a) True
b) False
View Answer
Answer: a
Explanation: Splitting in decision tree is the process of
dividing a node into two or more sub-nodes. When a sub-
node splits into further sub-nodes, then it is called decision
node. And the nodes that do not split are called leaf or
terminal nodes.
3. Real valued features problems in decision trees cannot be
solved using ID3 algorithm.
a) True
b) False
View Answer
Answer: b
Explanation: Real valued features problem in decision tree
cannot be solved directly. But it can be solved by converting
it into a binary feature value problem using threshold based
splitting rules. Then we can solve this problem using ID3.
advertisement
4. Which of the following statements is not true about
reducing a real valued feature problem into binary feature?
a) It utilizes the threshold based splitting rules
b) Once the binary features are constructed the ID3
algorithm can be easily applied
c) After ID3 applied it is easy to verify that there exists a
decision tree with different training error
d) After ID3 applied it is easy to verify that there exists a
decision tree with same number of nodes
View Answer
Answer: c
Explanation: Once the real valued features are reduced to
binary features then we can apply ID3. And it is easy to
verify that for any decision tree with threshold based
splitting rules over the original real valued features that
there exists a decision tree over the constructed binary
features with the same training error and the same number
of nodes.
5. If the original number of real valued features is d and the
number of examples is m, then which of the following
statements is not true?
a) The number of constructed binary features becomes dm
b) Calculating the Gain of each feature might
take O(dm2) operations
c) With more improved implementation the run time can be
reduced to O(dmlog(m))
d) The constructed binary features are dm
View Answer
Answer: d
Explanation: If the original number of real valued features
is d and the number of examples is m, then the number of
constructed binary features becomes dm not dm. And here
calculating the Gain of each feature might
take O(dm2) operations. But with more improved
implementation the run time can be reduced
to O(dmlog(m)) .
1. Inductive bias is also known as learning bias.
a) True
b) False
View Answer
Answer: a
Explanation: Inductive bias is also known as learning bias
and it is related with the learning algorithms. It is a set of
assumptions that the learner uses to predict outputs for the
given inputs that has not been encountered.
2. Which of the following statements is not true about the
Inductive bias in the decision tree?
a) It is harder to define because of heuristic search
b) Trees that place high information gain attributes close to
the root are preferred
c) Trees that place high information gain attributes far away
from the root are preferred
d) Shorter trees are preferred over longer ones
View Answer
Answer: c
Explanation: Here the trees that place high information gain
attributes close to the root are preferred over those that do
not. And it is harder to define because of heuristic search. It
prefers the shorter trees over longer trees.
3. According to Occam’s Razor, which of the following
statements is not favorable to short hypotheses?
a) It is good to use fewer short hypotheses than long
hypotheses
b) A short hypothesis that fits the data is unlikely to be a
coincidence
c) A long hypothesis that fits the data might be a coincidence
d) There are many ways to define small set of hypotheses
View Answer
Answer: d
Explanation: Occam’s Razor is the problem solving
principle that prefers the simplest hypotheses that fits the
data. And the argument opposed is that: there are many
ways to define a small set of hypotheses. All other three
statements are in favor of the short hypotheses.
advertisement
4. Which of the following statements are not true about
Inductive bias in ID3?
a) It is the set of assumptions that along with the training
data justify the classifications assigned by the learner to
future instances
b) ID3 has preference of short trees with high information
gain attributes near the root
c) ID3 has preference for certain hypotheses over others,
with no hard restriction on the hypotheses space
d) ID3 has preference of long trees with high information
gain attributes far away from the root
View Answer
Answer: d
Explanation: ID3 prefers the short trees and not the long
trees. It has preference of short trees with high information
gain attributes near the root and for certain hypotheses over
others, with no hard restriction on the hypotheses space.
And it is the set of assumptions that along with the training
data justify the classifications assigned by the learner to
future instances.
5. Which of the following statements is not true about ID3?
a) ID3 searches incompletely through the hypotheses space,
from simple to complex hypotheses, until its termination
condition is met
b) Its inductive bias is solely a consequence of the ordering
of hypotheses by its search strategy
c) Its hypothesis space introduces additional bias in its each
iteration
d) Its hypothesis space introduces no additional bias
View Answer
Answer: c
Explanation: In ID3 its hypothesis space introduces no
additional bias and it is solely a consequence of the ordering
of hypotheses by its search strategy. And ID3 searches
incompletely through this space, from simple to complex
hypotheses, until its termination condition is met.
6. Which of the following statements is true about Candidate
elimination?
a) Candidate elimination searches the hypotheses
completely, and finding every hypothesis consistent with the
training data
b) Its inductive bias is solely a consequence of the ordering
of hypotheses by its search strategy
c) Its inductive bias is solely a consequence of the expressive
power of its hypothesis representation
d) Its search strategy introduces no additional bias
View Answer
Answer: b
Explanation: Its inductive bias is not solely a consequence of
the ordering of hypotheses by its search strategy but is solely
a consequence of the expressive power of its hypothesis
representation. All other statements are true about inductive
bias in Candidate elimination.
7. Preference bias is more desirable than a restriction bias.
a) True
b) False
View Answer
Answer: a
Explanation: Preference bias is more desirable than a
restriction bias because it allows the learner to work within a
complete hypothesis space that is assured to contain the
unknown target function. So Preference bias is more
desirable than a restriction bias (language bias).
8. Preference bias is also known as search bias.
a) True
b) False
View Answer
Answer: a
Explanation: Preference bias is also known as search bias. It
is used when a learning algorithm incompletely searches a
complete hypothesis space. It chooses which part of the
hypothesis space to search. A decision tree is an example.
1. Categorical Variable Decision tree has a categorical target
variable.
a) True
b) False
View Answer
Answer: a
Explanation: Decision tree is an algorithm having a
predefined target variable that is mostly used in
classification problems. If the target variable is a categorical
target variable then such type of classification tree is known
as Categorical Variable Decision tree.
2. Which of the following statements is not true about the
Classification tree?
a) It is used when the dependent variable is categorical
b) It divides the predictor space into distinct and non
overlapping regions
c) It divides the independent variables into distinct and non
overlapping regions
d) It is used when the dependent variable is continuous
View Answer
Answer: d
Explanation: Classification trees are used when the
dependent variable is categorical not continuous. And it
divides the predictor space (independent variables) into
distinct and non overlapping regions.
3. In Classification trees the value obtained by terminal node
in the training data is the mode of observations falling in
that region.
a) True
b) False
View Answer
Answer: a
Explanation: In Classification trees the value obtained by
terminal node in the training data is the mode of
observations falling in that region. And this value obtained
by terminal node is known as the class. So if an unseen data
observation falls in that region, it will make its prediction
with mode value.
advertisement
4. Classification trees follow a top-down greedy approach.
a) True
b) False
View Answer
Answer: a
Explanation: Classification tree follows a top-down greedy
approach known as recursive binary splitting. It begins from
the top of tree when all the observations are available in a
single region and successively splits the predictor space into
two new branches down the tree. It is known as greedy
because the algorithm cares only about the current split, and
not about future splits which will lead to a better tree.
5. Which of the following statements is not true about
Classification trees?
a) It labels, records, and assigns variables to discrete classes
b) It can also provide a measure of confidence that the
classification is correct
c) It is built through a process known as binary recursive
partitioning
d) It will always looks for the best variable available in the
future splits for a better tree
View Answer
Answer: d
Explanation: In classification trees it will always look for the
best variable available in the current split and not in the
future splits for a better tree and it is built through a process
known as binary recursive partitioning. It labels, records,
and assigns variables to discrete classes and it can also
provide a measure of confidence that the classification is
correct.
6. Which of the following statements are not true about the
Classification trees?
a) The target variable can take a discrete set of values
b) The leaves represent class labels
c) The branches represent conjunctions of features
d) The target variable can take real numbers
View Answer
Answer: d
Explanation: In classification trees, the target variable
cannot take real numbers but can take only a discrete set of
values. Here the leaves represent class labels and the
branches represent conjunctions of features that will lead to
those class labels.
7. Which of the following statements is not true about
CART?
a) It is used for generating regression tree
b) It is used for generating classification tree
c) It is used for binary classification
d) It always uses Gini index as cost function to evaluate split
in feature selection
View Answer
Answer: d
Explanation: It uses Gini index as a cost function to evaluate
split in feature selection in case of classification tree and it
uses least square as a metric to select features in case of
Regression tree. So it is used for generating both
classification and regression trees. And it is used for binary
classification also.
8. From the below table where the target is to predict play or
not (Yes or No) based on weather condition, what is the Gini
index for Climate = Sunny?

Day Climate Temperature

1 Sunny Cool
2 Sunny Hot

3 Rainy Medium

4 Winter Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot

a) 0.45
b) 0.49
c) 0.47
d) 0.43
View Answer
Answer: a
Explanation: From the given table we have:

Climate Yes N

Sunny 1 2

Gini index = 1 – ∑ni=1p2i


= 1 – ((1/3)2 + (2/3)2)
= 1 – (0.11 + 0.44)
= 1 – 0.55
= 0.45
9. From the below table where the target is to predict play or
not (Yes or No) based on weather condition, the Gini index
for Climate = Rainy and Climate = Winter are the same.

Day Climate Temperature

1 Sunny Cool

2 Sunny Hot

3 Rainy Medium

4 Winter Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot
a) True
b) False
View Answer
Answer: a
Explanation: From the given table we have:

Climate Yes N

Rainy 1 1

Winter 1 1

Gini index = 1 – ∑ni=1p2i. Here the entries of Rainy and


Winter are the same, so the Gini index is also same. And it
is:
Gini index = 1 – ((1/2)2 + (1/2)2)
= 1 – (0.25 + 0.25)
= 1 – 0.5
= 0.5
10. From the below table where the target is to predict play
or not (Yes or No) based on weather condition, what is the
Gini index for the Temperature feature?

Day Climate Temperature

1 Sunny Medium
2 Sunny Hot

3 Rainy Medium

4 Winter Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot

a) 0.43
b) 0.45
c) 0.48
d) 0.5
View Answer
Answer: c
Explanation: From the table we have:

Climate Yes N

Hot 1 1
Cool 1 2

Medium 1 1

We know the Gini index for the Temperature feature is the


weighted sum of Gini index for Temperature features.
Gini index = 1 – ∑ni=1p2i
Gini index (Temperature = Hot) = 1 – ((1/2)2 + (1/2)2)
= 1 – (0.25 + 0.25)
= 0.5
Gini index (Temperature = Cool) = 1 – ((1/3)2 + (2/3)2)
= 1 – (0.11 + 0.44)
= 0.45
Gini index (Temperature = Medium) = 1 – ((1/2)2 + (1/2)2)
= 1 – (0.25 + 0.25)
= 0.5
Gini index (Temperature) = (2/7) * 0.5 + (3/7) * 0.45 + (2/7) *
0.5
= 0.29 * 0.5 + 0.43 * 0.45 + 0.29 * 0.5
= 0.145 + 0.194 + 0.145
= 0.48
11. From the below table where the target is to predict play
or not (Yes or No) based on weather conditions, what is the
Gini index for the Wind feature?

Day Climate Temperature


1 Sunny Medium

2 Sunny Hot

3 Rainy Medium

4 Winter Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot

a) 0.41
b) 0.43
c) 0.45
d) 0.47
View Answer
Answer: a
Explanation: We know the Gini index for the Wind feature
is the weighted sum of Gini index for Wind features. From
the table we have:

Wind Yes No
Strong 1 3

Weak 2 1

Gini index = 1 – ∑ni=1p2i


Gini index (Wind = Strong) = 1 – ((1/4)2 + (3/4)2)
= 1 – (0.0625 + 0.5625)
= 1 – 0.625
= 0.38
Gini index (Wind = Weak) =1 – ((2/3)2 + (1/3)2)
= 1 – (0.44 + 0.11)
= 1 – 0.55
= 0.45
Gini index (Wind) = (4/7) * 0.38 + (3/7) * 0.45
= 0.57 * 0.38 + 0.43 * 0.45
= 0.217 + 0.194
= 0.41
1. Continuous Variable Decision tree has a categorical target
variable.
a) False
b) True
View Answer
Answer: a
Explanation: Continuous Variable Decision tree don’t have
a categorical target variable but has a continuous target
variable. It is mainly used to predict the values for
continuous variables (when dependent variable is
continuous).
2. Which of the following statements is not true about the
Regression trees?
a) The general regression tree building methodology allows
input variables to be a mixture of continuous and categorical
variables
b) The terminal nodes of the tree contain the predicted
output variable values
c) Regression tree is a variant of decision trees, designed to
approximate real-valued functions
d) The root node holds the final prediction value
View Answer
Answer: d
Explanation: The root node doesn’t hold the final prediction
value, but the terminal nodes of the tree contain the
predicted output variable values. In the general regression
tree building methodology it allows input variables to be a
mixture of continuous and categorical variables. And it is a
variant of decision trees, designed to approximate real-
valued functions.
3. A Regression tree is built through a process known as
binary recursive partitioning.
a) True
b) False
View Answer
Answer: a
Explanation: During an iterative process a regression tree is
built by breaking the data into partitions or branches. This
process is called as binary recursive partitioning. During the
iteration each branch is broken into smaller groups in each
branch.
advertisement
4. Which of the following statements is not true about the
Regression trees?
a) The algorithm allocates the data into the first two
partitions or branches, using every possible binary split on
every field
b) Initially, all records in the training set (pre-classified
records that are used to determine the structure of the tree)
are grouped into the same partition
c) The algorithm selects the split that minimizes the sum of
the squared deviations from the mean in the two separate
partitions
d) Algorithm starts from the leftmost leave nodes
View Answer
Answer: d
Explanation: The Regression tree algorithm doesn’t start
from the leave nodes. Initially, all records in the training set
are grouped into the same partition and it allocates the data
into the first two partitions or branches, using every possible
binary split on every field. Then it selects the split that
minimizes the sum of the squared deviations from the mean
in the two separate partitions and so on.
5. Which of the following statements is not true about the
Regression trees?
a) It has the advantage of being concise
b) It is able to make few assumptions beyond normality of
the response
c) It is not fast to compute
d) It works equally well with numerical or categorical
predictors
View Answer
Answer: c
Explanation: Regression trees (RT) are fast to compute. And
it is one of the main advantages of RT. All other three
statements are the advantages of RT making it to perform
well with numerical or categorical predictors. And no
linearity or smoothness is assumed in RT.
6. Which of the following statements is not true about the
Regression trees?
a) It needs more data than other regression techniques
b) It is especially sensitive to the particular data used to
build the tree
c) It gives crude predictions when it is sensitive to the
particular data
d) It gives processed predictions when it is sensitive to the
particular data
View Answer
Answer: d
Explanation: Regression trees won’t give processed
predictions when it is sensitive to the particular data. One of
the main disadvantages is that it needs more data than other
regression techniques being especially sensitive to the
particular data used to build the tree. And it gives crude
predictions.
7. Regression trees follow a top down greedy approach.
a) True
b) False
View Answer
Answer: a
Explanation: Regression trees follow a top down greedy
approach. It begins from the top of tree when all the
observations are available in a single region and successively
splits the predictor space into two new branches down the
tree (Top down approach). It looks for the best variable in
the current split (Greedy approach).
8. Which of the following is expressed by the given equation
Y = β0 + β1X + Ɛ which shows a real-valued dependent
variable Y is modeled as function of a real-valued
independent variable X plus noise?
a) Binary classification
b) Linear Regression
c) Multiple Regression
d) Multi classification
View Answer
Answer: b
Explanation: The given equation shows the linear
regression. In simple linear regression a real-valued
dependent variable Y is modeled as a linear function of a
real-valued independent variable X plus noise. Here Ɛ is the
noise.
9. Which of the following is expressed by the given equation
Y = β0 + βTX + Ɛ which shows a real-valued dependent
variable Y is modeled as function of multiple independent
variables X1, X2, …, Xp ≡ X plus noise?
a) Binary classification
b) Linear Regression
c) Multiple Regression
d) Multi classification
View Answer
Answer: c
Explanation: The given equation shows the multiple
regression. Let multiple independent variables X1, X2, …,
Xp ≡ X. And a real-valued dependent variable Y is modeled
as a function of multiple independent variables plus noise
where the noise is Ɛ.
10. Linear regression is a global model.
a) True
b) False
View Answer
Answer: a
Explanation: Linear regression is a global model, where
there is a single predictive formula holding over the entire
data-space. When the data has lots of features which interact
in complicated and nonlinear ways, assembling a single
global model can be very difficult. And it is confusing when
you do succeed.
11. Which of the following statements is not true about the
Regression trees?
a) It divides the predictor space into distinct and non-
overlapping regions
b) It divides the independent variables into distinct and non-
overlapping regions
c) It always looks for the best variable in the future split
d) It cares about only the current split
View Answer
Answer: d
Explanation: The regression tree always looks for the best
variable in the current split and not in the future split. And
it divides the predictor space (independent variables) into
distinct and non-overlapping regions like the classification
trees.
12. The value obtained by terminal nodes in the training
data is the mean response of observation falling in that
region.
a) True
b) False
View Answer
Answer: a
Explanation: In the case of a regression tree, the value
obtained by terminal nodes in the training data is the mean
response of observation falling in that region. Thus, if an
unseen data observation falls in that region, it will make its
prediction with mean value.
13. Which of the following statements is not true about the
Regression trees?
a) User can visualize each step which helps with making
decisions
b) Making decision based on regression is much easier than
other methods
c) It is not easy to prepare a regression tree
d) User can give the priority to a decision criterion
View Answer
Answer: c
Explanation: It is easy to prepare a regression tree
compared to the other methods. Because a user can present
the regression tree in a much easier way as it can be
represented on a simple chart or diagram. All other three
statements are the advantages of regression trees.
14. Given the table which shows the number of players who
play a particular game on various days according to the
weather conditions. What is the standard deviation of
players for Sunny Climate candidates?

Day Climate Temperature

1 Sunny Cool

2 Sunny Hot

3 Rainy Medium

4 Winter Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot

a) 10
b) 15.71
c) 4.08
d) 7.07
View Answer
Answer: c
Explanation: From the table we have,

Day Climate Temperature

1 Sunny Cool

2 Sunny Hot

7 Sunny Hot

Players for Sunny climate = (15, 10, 5)


Average = (15 + 10 + 5) / 3
= 10
Standard deviation for Sunny climate = √(((15 – 10)2 + (10 –
10)2 + (5 – 10)2)/3)
= √((52 + 02 + (-52))/3)
= √((25 + 0 + 25)/3)
= √(50/3)
= √16.67
= 4.08
15. Given the table which shows the number of players who
play a particular game on various days according to the
weather conditions. What is the weighted standard deviation
of players for all the Wind candidates?
Day Climate Temperature

1 Sunny Cool

2 Sunny Hot

3 Rainy Medium

4 Winter Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot

a) 7.54
b) 15.71
c) 8.17
d) 7.07
View Answer
Answer: a
Explanation: From the table for Strong Wind we have,
Day Climate Temperature

1 Sunny Cool

5 Rainy Cool

6 Winter Cool

7 Sunny Hot

Players for Strong Wind = (15, 15, 25, 5)


Average = (15 + 15 + 25 + 5) / 4
= 15
Standard deviation for Strong Wind = √(((15 – 15)2 + (15 –
15)2 + (25 – 15)2 + (5 – 15)2)/4)
= √((02 + 02 + 102 + (-102))/4)
= √((0+ 0 + 100 + 100)/4)
= √(200/4)
= √50
= 7.07
And we have,

Day Climate Temperature

2 Sunny Hot
3 Rainy Medium

4 Winter Cool

Players for Weak Wind= (10, 20, 30)


Average = (10 + 20 + 30) / 3
= 20
Standard deviation for Weak Wind = √(((10 – 20)2 + (20 –
20)2 + (30 – 20)2)/3)
= √(((-102)) + 02 + 102)/3)
= √((100 + 0 + 100)/3)
= √(200/3)
= √66.67
= 8.17
Hence we get,

Wind Standard deviation of players

Strong 7.07

Weak 8.17

Weighted standard deviation for Wind = (7.07 * (4/7)) +


(8.17 * (3/7))
= (7.07 * 0.57) + (8.17 * 0.43)
= 4.03 + 3.51
= 7.54
16. Given the table which shows the abstract details of
players who play a particular game on various days
according to the weather conditions. The standard deviation
of players is 6.58. What is the Standard deviation reduction
for Temperature?

Temperature Standard deviation of players

Hot 5.85

Medium 7.54

Cool 4.58

a) 5.68
b) 0.9
c) 1.89
d) 2.34
View Answer
Answer: b
Explanation: From the table we have,
Weighted standard deviation of Temperature = (5.85 *
(4/10)) + (7.54 * (2/10)) + (4.58 * (4/10))
= (5.85 * 0.4) + (7.54 * 0.2) + (4.58 * 0.4)
= 2.34 + 1.51 + 1.83
= 5.68
Standard deviation reduction for Temperature = Standard
deviation of players – Weighted standard deviation of
Temperature
= 6.58 – 5.68
= 0.9
1. Random forest can be used to reduce the danger of
overfitting in the decision trees.
a) True
b) False
View Answer
Answer: a
Explanation: One way to reduce the danger of overfitting is
by constructing an ensemble of trees. So Random forest is an
ensemble method which is better than a single decision tree
because it reduces the over-fitting by averaging the result.
2. Which of the following statements is not true about the
Random forests?
a) It is a classifier consisting of a collection of decision trees
b) Each tree is constructed by applying an algorithm on the
training set and an additional random vector
c) The prediction of the random forest is obtained by a
majority vote over the predictions of the individual trees
d) Each individual tree in the random forest will not spits
out a class prediction
View Answer
Answer: d
Explanation: As Random forest is a classifier consisting of a
collection of decision trees, each individual tree in the
random forest spits out a class prediction and the class with
the most votes becomes the model’s prediction. And each
tree is constructed by applying an algorithm on the training
set and an additional random vector.
3. Which of the following statements is not true about the
Random forests?
a) It is an ensemble learning method for classification only
b) It operates by constructing a multitude of decision trees at
training time
c) It outputs the class that is the mode of the classes
d) It outputs the mean prediction of the individual trees
View Answer
Answer: a
Explanation: Random forest is a supervised learning method
which are used for classification and regression. It is a group
of decision trees. The more the number of the trees the result
is error-free. During training time multitude of decision
trees will be constructed.
advertisement
4. Which of the following statements is not true about
Random forests?
a) Scaling of data required in random forest algorithm
b) It works well for a large range of data items than a single
decision tree
c) It has less variance than single decision tree
d) Random forests are very flexible and possess very high
accuracy
View Answer
Answer: a
Explanation: Scaling of data does not require in random
forest algorithm. It maintains good accuracy and it is very
flexible even after providing data without scaling. It works
well for a large range of data items and has less variance
than a single decision tree.
5. Which of the following statements is not true about
Random forests?
a) It has high complexity
b) Construction of Random forests are much easier than
decision trees
c) Construction of Random forests are time-consuming than
decision trees
d) More computational resources are required to implement
Random Forest algorithm
View Answer
Answer: b
Explanation: Construction of Random forests is much
harder and time-consuming than decision trees as it requires
more computational resources for the implementation. And
it has high complexity.
6. There is a direct relationship between the number of trees
in the random forest and the results.
a) False
b) True
View Answer
Answer: b
Explanation: Random forest is a supervised machine
learning technique. And there is a direct relationship
between the number of trees in the forest and the results it
produces. If larger the number of trees, the result will be
more accurate.
7. A data set T is split into two subsets T1 and T2 with sizes
N1 and N2. And Gini index of the split data contains
examples from N classes. Then the Gini index of T is defined
by which of the following options?
a) Ginisplit (T) = N2N gini (T1) + N1N gini (T2)
b) Ginisplit (T) = NN1 gini (T1) + NN2 gini (T2)
c) Ginisplit (T) = NN2 gini (T1) + NN1 gini (T2)
d) Ginisplit (T) = N1N gini (T1) + N2N gini (T2)
View Answer
Answer: d
Explanation: Let a data set T is split into two subsets T1 and
T2 with sizes N1 and N2. And Gini index of the split data
contains examples from N classes. Then the Gini index of T
is defined by, Ginisplit (T) = N1N gini (T1) + N2N gini (T2).
And its implementation is not easy as a decision tree with
impurity measures.
8. Random forest is known as the forest of Decision trees.
a) True
b) False
View Answer
Answer: a
Explanation: Random forest makes predictions by
combining the results from many individual decision trees.
So, we call them a forest of decision trees. Random forest
combines multiple models, and it falls under the category of
ensemble learning.
9. Bagging and Boosting are two main ways for combining
the outputs of multiple decision trees into a random forest.
a) True
b) False
View Answer
Answer: a
Explanation: Bagging and Boosting are two main ways for
combining the outputs of multiple decision trees into a
random forest. Bagging is also called Bootstrap aggregation
(used in Random Forests) and Boosting (used in Gradient
Boosting Machines).
10. Which of the following is represented by the below
figure?

a) Support vector machine


b) Random forest
c) Regression tree
d) Classification tree
View Answer
Answer: b
Explanation: The given figure shows the Random forest
where the outputs of multiple decision trees are combined to
form a random forest. And the final result of the model is
calculated by averaging over all predictions from these
sampled trees or by majority vote.
11. Suppose we are using a random forest algorithm to solve
regression problem and there are 4 data points. The value
returned by the model and the actual value for the data
points 1, 2, 3, and 4 are 11, 14, 9, 10 and 8, 10, 12, 14
respectively. What is the mean squared error?
a) 11.5
b) 14
c) 12.5
d) 10
View Answer
Answer: c
Explanation: We know the mean squared error MSE
= 1N ∑Ni=1(fi – yi)2
Given N = 4, f1 = 11, f2 = 14, f3 = 9, f4 = 10, y1 = 8, y2 = 10, y3 =
12 and y4 = 14
MSE = 14 ((11-8)2 + (14-10)2 + (9-12)2 +(10-14)2)
= 14 ((3)2 + (4)2 + (-3)2 + (-4)2))
= 14 (9 + 16 + 9 + 16)
= 504
= 12.5
12. Consider we are performing Random Forests based on
classification data, and the relative frequencies of the class
you are observing in the dataset are 0.65, 0.35, 0.29 and 0.5.
What is the Gini index?
a) 0.423
b) 0.084
c) 0.12
d) 0.25
View Answer
Answer: c
Explanation: We know Gini index = 1 – ∑ci=1(pi)2 where
pi represents the relative frequency of the class you are
observing in the dataset and c represents the number of
classes. And we have p1 = 0.65, p2 = 0.35, p3 = 0.29 and p4 =
0.5
Gini index = 1 – (0.652 + 0.352 + 0.292 + 0.52)
= 1 – (0.423 + 0.123 + 0.084 + 0.25)
= 1 – 0.88
= 0.12
1. Which of the following statements is false about k-Nearest
Neighbor algorithm?
a) It stores all available cases and classifies new cases based
on a similarity measure
b) It has been used in statistical estimation and pattern
recognition
c) It cannot be used for regression
d) The input consists of the k closest training examples in the
feature space
View Answer
Answer: c
Explanation: KNN is used for both classification and
regression and the input consist of the k closest training
examples in the feature space. It has been used in statistical
estimation and pattern recognition. It stores all available
cases and classifies new cases based on a similarity measure.
2. Which of the following statements is not true about k-
Nearest Neighbor classification?
a) The output is a class membership
b) An object is classified by a plurality vote of its neighbors
c) If k = 1, then the object is simply assigned to the class of
that single nearest neighbor
d) The output is the property value for the object
View Answer
Answer: d
Explanation: In k-Nearest Neighbor classification the output
is a class membership and not the property value for the
object. Here an object is classified by a plurality vote of its
neighbors. So if k = 1, then the object is simply assigned to
the class of that single nearest neighbor.
3. Suppose k = 3 and the data point A’s 3-nearest-neighbours
from the dataset are instances X, Y and Z. The table shows
their classes and the distances computed. Then A’s predicted
class using majority voting will be ‘Good’?

Neighbor Class

X Good

Y Bad

Z Bad

advertisement
a) True
b) False
View Answer
Answer: b
Explanation: In majority voting approach, all votes are
equal. For each class C∈ L, we count how many of the k
neighbors have that class. We return the class with the most
votes. So here are two classes ‘Good’ and ‘Bad’. And the
class ‘Bad’ have the most votes (2 votes). So A’s predicted
class using majority voting will be ‘Bad’.
4. We have data from a survey and objective testing with two
attributes A and B to classify whether a special paper tissue
is good or not. Here are four training samples given in the
table. Now the factory produces a new paper tissue that pass
laboratory test with A = 3 and B = 7. If K = 3, then ‘Good’ is
the classification of this new tissue?

A B C = Classification

7 6 Bad

7 4 Bad

4 4 Good

2 4 Good

a) True
b) False
View Answer
Answer: a
Explanation: We have K = 3. Then we have,

A B Square distance to query instance (3, 7)

7 6 (7 – 3)2 + (6 – 7)2 = 17

7 4 (7 – 3)2 + (4 – 7)2 = 25

4 4 (4 – 3)2 + (4 – 7)2 = 10

2 4 (2 – 3)2 + (4 – 7)2 = 10

When sorting the distance we get,

Square distance to query Rank the mini-mum


A B
instance (3, 7) distance

7 6 17 3

7 4 25 4

4 4 10 1
2 4 10 2

Use simple majority of the category of nearest neighbors as


the prediction value of the query instance. Here we have 2
‘Good’ and 1 ‘Bad’. Then the new tissue paper lies in the
category of ‘Good’.
5. Suppose k = 3 and the data point A’s 3-nearest-neighbours
from the dataset are instances X, Y and Z. The table shows
their classes and the distances computed. Then A’s predicted
class using inverse distance weighted voting will be ‘Good’?

Neighbor Class

X Good

Y Bad

Z Bad

a) True
b) False
View Answer
Answer: a
Explanation: In this approach, closer neighbors get higher
votes. Take a neighbor’s vote to be the inverse of its distance
to q and is known as Inverse distance weighted voting.
Vote (X) = 1 / 0.1
= 10
Vote(Y) = 1 / 0.3
= 3.33
Vote (Z) = 1 / 0.5
=2
Here X (Good) gets a vote of 10 and Y (Bad), Z (Bad)
together gets a vote of 5.33 only. So, the predicted class will
be ‘Good’.
6. Which of the following statements is not true about k
Nearest Neighbor?
a)It belongs to the supervised learning domain
b)It has an application in data mining and intrusion
detection
c)It is Non-parametric
d) It is not an instance based learning algorithm
View Answer
Answer: d
Explanation: k-NN is based on supervised learning
algorithm and a Non- parametric algorithm. It is also called
as lazy learner algorithm. KNN is used in applications like
data mining, intrusion decision and genetics, economic
forecasting.
7. Which of the following statements is not supporting in
defining k Nearest Neighbor as a lazy learning algorithm?
a) It defers data processing until it receives a request to
classify unlabeled data
b) It replies to a request for information by combining its
stored training data
c) It stores all the intermediate results
d) It discards the constructed answer
View Answer
Answer: c
Explanation: k Nearest Neighbor is considered to be as a
lazy learning algorithm and it defers data processing until it
receives a request to classify unlabeled data. It replies to a
request for information by combining its stored training
data. And the most important thing is that it discards the
constructed answer and any intermediate results.
8. Which of the following statements is not supporting kNN
to be a lazy learner?
a) When it gets the training data, it does not learn and make
a model
b) When it gets the training data, it just stores the data
c) It derives a discriminative function from the training data
d) It uses the training data when it actually needs to do some
prediction
View Answer
Answer: c
Explanation: It does not derive any discriminative function
from the training data. So, kNN does not immediately learn
a model, but delays the learning, that is why it is called lazy
learner. All other three are the statements supporting kNN
to be a lazy learner.
9. Euclidian distance and Manhattan distance are the same
in kNN algorithm to calculate the distance.
a) True
b) False
View Answer
Answer: b
Explanation: Both Euclidian distance and Manhattan
distance are used to calculate the distance between two
points. But they are not the same. Euclidian distance takes
the square root of the sum of the squares of the difference of
the coordinates. Manhattan distance takes the sum of the
absolute values of the difference of the coordinates.
10. What is the Manhattan distance between a data point (9,
7) and a new query instance (3, 4)?
a) 7
b) 9
c) 3
d) 4
View Answer
Answer: b
Explanation: Manhattan distance takes the sum of the
absolute values of the difference of the coordinates. Let the
data point be (x1, y1) = (9, 7) and query instance be (x 2, y2) =
(3, 4).
Manhattan distance, d = |x1 – x2| + |y1 – y2|
= |9 – 3| + |7 – 4|
= |6| + |3|
=9
1. In kNN too large value of K has a negative impact on the
data points.
a) True
b) False
View Answer
Answer: a
Explanation: Too large value of K in kNN has a negative
impact on the data points. A too large value of K is
detrimental as it destroys the locality of information since
farther examples are taken into account. It also increases the
computational burden.
2. It is good to use kNN for large data sets.
a) True
b) False
View Answer
Answer: b
Explanation: KNN works well with smaller dataset because
it is a lazy learner. It needs to store all the data and then
makes decision only at run time. So, if dataset is large, there
will be a lot of processing which may adversely impact the
performance of the algorithm.
3. When we set K = 1 in kNN algorithm, the predictions
become more stable.
a) True
b) False
View Answer
Answer: b
Explanation: As we decrease the value of K to 1, our
predictions become less stable. Choosing smaller values for
K can be noisy and will have a higher influence on the result.
In general, choosing the value of k is k = sqrt (N) where N
stands for the number of samples in your training dataset.
advertisement
4. Setting large values of K in kNN is computationally
inexpensive.
a) True
b) False
View Answer
Answer: b
Explanation: Setting large values of K in kNN is
computationally expensive. Larger values of K will have
smoother decision boundaries which mean lower variance
but increased bias. ‘K’ in kNN algorithm is based on feature
similarity and choosing the right value of K is a process
called parameter tuning.
5. Which of the following statements is not a feature of kNN?
a) K-NN has assumptions
b) K-NN is pretty intuitive and simple
c) No Training Step
d) It constantly evolves
View Answer
Answer: a
Explanation: In kNN there are no assumptions to be met to
implement kNN. Parametric models like linear regression
has lots of assumptions to be met by data before it can be
implemented which is not the case with kNN. All other three
statements are the advantages of kNN.
6. Which of the following statements is not a feature of kNN?
a) Very easy to implement for multi-class problem
b) One Hyper Parameter
c) Variety of distance criteria to be choose from
d) Fast algorithm for large dataset
View Answer
Answer: d
Explanation: kNN is a slow algorithm. KNN might be very
easy to implement but as dataset grows efficiency or speed of
algorithm declines very fast. So, it is a slow algorithm for
large dataset. All other three statements are the advantages
of kNN.
7. Which of the following statements is not a feature of kNN?
a) K-NN does not need homogeneous features
b) Curse of Dimensionality
c) Optimal number of neighbors
d) Outlier sensitivity
View Answer
Answer: a
Explanation: K-NN needs homogeneous features. If you
decide to build k-NN using a common distance, like
Euclidean or Manhattan distances, it is completely necessary
that features have the same scale, since absolute differences
in features weight the same, i.e., a given distance in feature 1
must means the same for feature 2.
8. KNN performs well on imbalanced data.
a) True
b) False
View Answer
Answer: a
Explanation: k-NN doesn’t perform well on imbalanced
data. If we consider two classes, A and B, and the majority
of the training data is labeled as A, then the model will
ultimately give a lot of preference to A. This might result in
getting the less common class B wrongly classified.
9. In kNN low K value is sensitive to outliers.
a) True
b) False
View Answer
Answer: a
Explanation: KNN is sensitive to outliers. Low k-value is
sensitive to outliers and a higher K-value is more flexible to
outliers as it considers more voters to decide prediction.
10. Cross-validation is a smart way to find out the optimal K
value.
a) True
b) False
View Answer
Answer: a
Explanation: Cross-validation is a smart way to find out the
optimal K value. It estimates the validation error rate by
holding out a subset of the training set from the model
building process.
1. Naïve Bayes classifier algorithms are mainly used in text
classification.
a) True
b) False
View Answer
Answer: a
Explanation: Naïve Bayes classifier is a simple probabilistic
framework for solving a classification problem. It is used to
organize text into categories based on the bayes probability
and is used to train data to learn document-class
probabilities before classifying text documents.
2. What is the formula for Bayes’ theorem? Where (A & B)
and (H & E) are events and P(B), P(H) & P(E) ≠ 0.
a) P(H|E) = [P(E|H) * P(E)] / P(H)
b) P(A|B) = [P(A|B) * P(A)] / P(B)
c) P(H|E) = [P(H|E) * P(H)] / P(E)
d) P(A|B) = [P(B|A) * P(A)] / P(B)
View Answer
Answer: d
Explanation: Here, P(A) &P(H) is the probability of
hypothesis before observing the evidence, P(B) & P(E) is the
probability of evidence, P(A|B) & P(H|E) is the posterior
probability and P(B|A) & P(E|H) is the likelihood
probability. Since Bayes Theorem states that:
Conditional Probability of A given B = \(\frac {Conditional \,
probability \, of \, B \, given \, A \, * \, Prior probability \, of \,
A}{Prior \, probability \, of \, B}\)
3. Which of the following statement is not true about Naïve
Bayes classifier algorithm?
a) It cannot be used for Binary as well as multi-class
classifications
b) It is the most popular choice for text classification
problems
c) It performs well in Multi-class prediction as compared to
other algorithms
d) It is one of the fast and easy machine learning algorithms
to predict a class of test datasets
View Answer
Answer: a
Explanation: Naïve Bayes algorithm can be used for binary
as well as multi-class classifications. It is a parametric
algorithm, which means it requires a fixed set of
assumptions or parameters to simplify the machine’s
learning process.
advertisement
4. What is the assumptions of Naïve Bayesian classifier?
a) It assumes that features of a data are completely
dependent on each other
b) It assumes that each input variable is dependent and the
model is not generative
c) It assumes that each input attributes are independent of
each other and the model is generative
d) It assumes that the data dimensions are dependent and
the model is generative
View Answer
Answer: c
Explanation: The assumptions of Naïve Bayes Classifier is
that it assumes that each input attributes are independent of
each other which is the Naïve part, and the model is
generative which is the Bayesian part.
5. Which of the following is not a supervised machine
learning algorithm?
a) Decision tree
b) SVM for classification problems
c) Naïve Bayes
d) K-means
View Answer
Answer: d
Explanation: Decision tree, SVM (Support vector machines)
for classification problems and Naïve Bayes are the examples
of supervised machine learning algorithm. K-means is an
example of unsupervised machine learning algorithm.
6. Which one of the following terms is not used in the Bayes’
Theorem?
a) Prior
b) Unlikelihood
c) Posterior
d) Evidence
View Answer
Answer: b
Explanation: The terms Evidence, Prior, Likelihood and
Posterior are used in the Bayes’ Theorem. But, the term
unlikelihood is not used in the Bayes’ Theorem. Bayes
Theorem states that Posterior = (Likelihood * Prior) /
Evidence.
7. Is the assumption of the Naïve Bayes algorithm a
limitation to use it?
a) True
b) False
View Answer
Answer: a
Explanation: It’s true that the assumption of the Naïve
Bayes’ algorithm is a limitation to use it since implicitly it
assumes that all the input attributes are mutually
independent of each other. But in real life it is almost
impossible that we get a set of input attributes which are
independent.
8. In which of the following case the Naïve Bayes’ algorithm
does not work well?
a) When faster prediction is required
b) When the Naïve assumption holds true
c) When there is the case of Zero Frequency
d) When there is a multiclass prediction
View Answer
Answer: c
Explanation: When there is a case of “Zero Frequency”, the
categorical variable is not detected and hence the classifier
will not be able to make prediction with an assumption of
“Zero” probability.
9. There are two boxes. The first box contains 3 white and 2
red balls whereas the second contains 5 white and 4 red
balls. A ball is drawn at random from one of the two boxes
and is found to be white. Find the probability that the ball
was drawn from the second box?
a) 53/50
b) 50/104
c) 54/104
d) 54/44
View Answer
Answer: b
Explanation: Let the first box be A and the second box be B
Then probability of choosing one box from the two is P(A) =
1/2 and P(B) = 1/2
As given in the question we have to find the probability that
the white ball was drawn from the second box = P(B/W)
Now,
P(W/A) = 3/5 and P(W/B) = 5/9
According to Bayes Theorem we know that,
P(B/W) = \(\frac {P(W/B) * P(B)}{P(W/B) * P(B) + P(W/A) *
P(A)}\)
P(B/W) = \(\frac {5/9 * 1/2}{(5/9 * 1/2) + (3/5 * 1/2)}\)
P(B/W) = \(\frac {5/18}{5/18 + 3/10}\)
P(B/W) = \(\frac {5/18}{104/180}\)
P(B/W) = 50/104
10. Which one of the following models is a generative model
used in machine learning?
a) Linear Regression
b) Logistic Regression
c) Naïve Bayes
d) Support vector machines
View Answer
Answer: c
Explanation: Naïve Bayes is a type of generative model
which is used in machine learning. Linear Regression,
Logistic Regression and Support vector machines are the
types of discriminative models which are used in machine
learning.
11. The number of balls in three boxes are as follows

Box Green Blu

A 3 2

B 2 1

C 4 2

One box is chosen at random and two balls are drawn from
it. The balls are green and blue. What is the probability that
the ball chosen are from the first box?
a) 37/18
b) 15/56
c) 18/37
d) 56/15
View Answer
Answer: c
Explanation: The probability of choosing one box out of
three boxes is P(A) = P(B) = P(C) = 1/3.
Here the event (E) is choosing the green and blue balls from
the random box.
Therefore, P(E|A) = \(\frac {^3C_1*^2C_1}{^6C_2}\) = 6/15
= 2/5
P(E|B) = \(\frac {^2C_1*^1C_1}{^5C_2}\) = 2/10 = 1/5
P(E|C) = \(\frac {^4C_1*^2C_1}{^9C_2} = \frac {8}{72/2}\) =
2/9
According to Bayes Theorem;
P(A|E) = P(E|A) / [P(E|A) + P(E|B) + P(E|C)]
= \(\frac {2/5}{(2/5) + (1/5) + (2/9)}\)
= 18/37
12. Identify the parametric machine learning algorithm.
a) CNN (Convolutional neural network)
b) KNN (K-Nearest Neighbours)
c) Naïve Bayes
d) SVM (Support vector machines)
View Answer
Answer: c
Explanation: In machine learning, the algorithms which can
simplify a function by collecting information about its
prediction within a finite set of parameters is defined as
parametric machine learning algorithm. Naïve Bayes is a
parametric machine learning algorithm whereas CNN, KNN
and SVM are non-parametric machine learning algorithms.
13. Which one of the following applications is not an
example of Naïve Bayes algorithm?
a) Spam filtering
b) Text classification
c) Stock market forecasting
d) Sentiment analysis
View Answer
Answer: c
Explanation: Stock market forecasting is one of the most
core financial tasks of KNN (K-Nearest Neighbours). Spam
filtering, text classification and sentiment analysis is the
application of Naïve Bayes algorithm, which uses Bayes
theorem of probability for prediction of unknown classes.
14. Arrange the following steps in sequence in order to
calculate the probability of an event through Naïve Bayes
classifier.
I. Find the likelihood probability with each attribute for
each class.
II. Calculate the prior probability for given class labels.
III. Put these values in Bayes formula and calculate
posterior probability.
IV. See which class has a higher probability, given the input
belongs to the higher probability class.
a) I → II → III → IV
b) II → I → III → IV
c) III → II → I → IV
d) II → III → I → IV
View Answer
Answer: b
Explanation: The sequence in which Naïve Bayes calculates
the probability of an event is:
II. Calculate the prior probability for given class labels.
I. Find the likelihood probability with each attribute for
each class.
III. Put these values in Bayes formula and calculate
posterior probability.
IV. See which class has a higher probability, given the input
belongs to the higher probability class.
15. “It is easy and fast to predict the class of the test data set
by using Naïve Bayes algorithm”.
Which of the following statement contradicts the above given
statement?
a) Because there is no iteration
b) Because there is no epoch
c) Because there is an error back propagation
d) Because there are no operations involved in solving a
matrix problem
View Answer
Answer: c
Explanation: Naïve Bayes algorithm is easy and fast to
predict the class of the test data set because there is no
iteration involved, there is no epoch, there are no operations
involved in solving a matrix problem and there is no error
back propagation.

You might also like