Top 40 Machine Learning Questions & Answers: Which of The Following Statement Is True in The Following Case?
Top 40 Machine Learning Questions & Answers: Which of The Following Statement Is True in The Following Case?
Q1) A feature F1 can take a certain value: A, B, C, D, E, or F, which represents the grades of
students from a college.
D) Both of these
Solution: (B)
Ordinal variables are the variables that have some order in their categories. For example, grade A
A) PCA
B) K-Means clustering
C) KNN
Solution: (A)
A deterministic algorithm is one in which output does not change on different runs. PCA would
give the same result if we run again, but not k-means clustering.
Q3) [True or False] A Pearson correlation between two variables is zero; still, their values can be
related to each other.
A) TRUE
B) FALSE
Solution: (A)
Y=X2. Note that they are not only associated, but one is a function of the other, and the Pearson
Q4) Which of the following statement(s) is / are true for Gradient Descent (GD) and Stochastic
Gradient Descent (SGD)?
2. In SGD, you must run through all the samples in your training set
3. In GD, you either use the entire data points or a subset of training
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
Solution: (A)
In SGD, for each iteration, you choose the batch, which generally contains a random sample of
data. But in the case of GD, each iteration contains all of the training observations.
Q5) Which of the following hyper-parameter (s), when increased, may cause the random
forest to over fit the data?
1. Number of Trees
2. Depth of Tree
3. Learning Rate
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 2 and 3
F) 1, 2 and 3
Solution: (B)
Usually, if we increase the depth of the tree, it will cause overfitting. The learning rate is not a
hyperparameter in a random forest. An increase in the number of trees will cause under fitting.
Q6) Suppose you want to develop a machine learning algorithm that predicts the number of
views on the articles in a blog.
Your data analysis is based on features like author name, number of articles written by the
same author, etc. Which of the following evaluation metrics would you choose in that case?
3. F1 Score
A) Only 1
B) Only 2
C) Only 3
D) 1 and 3
E) 2 and 3
F) 1 and 2
Solution:(A)
You can think that the number of views of articles is the continuous target variable that falls
under the regression problem. So, the mean of squared error will be used as an evaluation metric.
Q7) Given below are three images (1,2,3). Which of the following option is correct for these
images?
A)
A) 1 is tanh, 2 is ReLU, and 3 is SIGMOID activation functions.
Solution: (D)
Q8) Below are the 8 actual target variable values in the training file.
[0,0,0,1,1,1,1,1]
Solution: (A)
Q9) Let’s say you are working with categorical feature(s), and you have not looked at the
distribution of the categorical variable in the test data.
You want to apply one hot encoding (OHE) on the categorical feature(s). What challenges
may you face if you have applied OHE on a categorical variable of the training dataset?
A) All categories of the categorical variables are not present in the test dataset.
B) The frequency distribution of categories is different in the train compared to the test dataset.
D) Both A and B
E) None of these
Solution: (D)
Both are true. The OHE will fail to encode the categories which is present in the test but not in
the train, so it could be one of the main challenges while applying OHE. The challenge given in
option B is also true. You need to be more careful while applying OHE if frequency distribution
A) A
B) B
C) Both A and B
D) None of these
Solution: (B)
Both models (model1 and model2) are used in the Word2vec algorithm. Model1 represents a
Q11) Let’s say you are using activation function X in hidden layers of a neural network.
At a particular neuron for any given input, you get the output as “-0.0001”. Which of the
A) ReLU
B) tanh
C) SIGMOID
D) None of these
Solution: (B)
The function is a tanh because this function output range is between (-1,-1).
Q12) [True or False] LogLoss evaluation metric can have negative values.
A) TRUE
B) FALSE
Solution: (B)
Q13) Which of the following statements is/are true about “Type I error” and “Type II error”
errors?
actually true.
A) Only 1
B) Only 2
C) Only 3
D) 1 and 2
E) 1 and 3
F) 2 and 3
Solution: (E)
In statistical hypothesis testing, a type I error is the incorrect rejection of a true null hypothesis (a
“false positive”), while a type II error is incorrectly retaining a false null hypothesis (a “false
negative”).
Q14) Which of the following is/are one of the important step(s) to pre-process the text in NLP-
based projects?
1. Stemming
3. Object Standardization
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) 1,2 and 3
Solution: (D)
Stemming is a rudimentary rule-based process of stripping the suffixes (“ing”, “ly”, “es”, “s”,
Stop words are those words which will have not relevant to the context of the data, for example,
is/am/are.
Q15) Suppose you want to project high-dimensional data into lower dimensions.
The two most famous dimensionality reduction algorithms used here are PCA and t-SNE.
Let’s say you have applied both algorithms respectively on data “X”, and you got the
Solution: (B)
t-SNE algorithm considers nearest neighbor points to reduce the dimensionality of the data. So,
after using t-SNE, we can think that reduced dimensions will also have interpretation in the
nearest neighbor space. But in the case of PCA, it is not the case.
Q16) In the above images, which of the following is/are examples of multi-collinear features?
A) Features in Image 1
B) Features in Image 2
C) Features in Image 3
Solution: (D)
In Image 1, features have a high positive correlation, whereas, in Image 2, there is a high
negative correlation between the features. So in both images, the pair of features is an example of
multicollinear features.
Q17) In the previous question, suppose you have identified multi-collinear features. Which of the
following action(s) would you perform next?
variable.
A) Only 1
B) Only 2
C) Only 3
D) Either 1 or 3
E) Either 2 or 3
Solution: (E)
You cannot remove both features because after removing them, you will lose all of the
information. So you should either remove only 1 feature or use a regularization algorithm like L1
and L2.
Q18) Adding a non-important feature to a linear regression model may result in.
1. Increase in R-square
2. Decrease in R-square
A) Only 1 is correct
B) Only 2 is correct
C) Either 1 or 2
D) None of these
Solution: (A)
After adding a feature in the feature space, whether that feature is an important or unimportant
Q19) Suppose you are given three variables X, Y, and Z. The Pearson correlation coefficients for
(X, Y), (Y, Z), and (X, Z) are C1, C2 & C3, respectively.
Now, you have added 2 in all values of X (i.e., new values become X+2), subtracting 2 from
all values of Y (i.e., new values are Y-2), and Z remains the same. The new coefficients for
(X,Y), (Y,Z), and (X,Z) are given by D1, D2 & D3, respectively. How do the values of D1,
E) D1 = C1, D2 = C2, D3 = C3
F) Cannot be determined
Solution: (E)
Correlation between the features won’t change if you add or subtract a value in the features.
Q20) Imagine you are solving a classification problem with a highly imbalanced class.
The majority class is observed 99% of the time in the training data. Your model has 99%
accuracy after taking the predictions on the test set. Which of the following is true in such a
case?
problems.
problems.
problems.
A) 1 and 3
B) 1 and 4
C) 2 and 3
D) 2 and 4
Solution: (A)
Q21) In ensemble learning (i.e., bagging or boosting), you aggregate the predictions for weak
learners so that an ensemble of these models will give a better prediction than the prediction of
individual machine learning models.
Which of the following statements is / are true for weak learners used in the ensemble
model?
1. They don’t usually overfit.
problems
A) 1 and 2
B) 1 and 3
C) 2 and 3
D) Only 1
E) Only 2
Solution: (A)
Weak learners are sure about a particular part of a problem. So, they usually don’t overfit, which
means that weak learners have low variance and high bias.
Q22) Which of the following options is/are true for K-fold cross-validation?
B) 2 and 3
C) 1 and 3
D) 1,2 and 3
Solution: (D)
A larger k-value means less bias towards overestimating the truly expected error (as training
folds will be closer to the total dataset) and higher running time (as you are getting closer to the
limit case: Leave-One-Out CV). We also need to consider the variance between the k folds
Cross-validation is an important step in machine learning for hyper parameter tuning. Let’s say
you are tuning a hyper-parameter “max_depth” for GBM by selecting it from 10 different depth
values (values are greater than 2) for a tree-based model using 5-fold cross-validation.
Time taken by an algorithm for training (on a model with max_depth 2) 4-fold is 10 seconds, and
Q23) Which of the following option is true for overall execution time for 5-fold cross-validation
with 10 different values of “max_depth”?
F) Can’t estimate
Solution: (D)
Each iteration for depth “2” in 5-fold cross-validation will take 10 secs for training and 2 seconds
for testing. So, 5 folds will take 12*5 = 60 seconds. Since we are searching over the 10 depth
values so the algorithm would take 60*10 = 600 seconds. But training and testing a model on a
depth greater than 2 will take more time than depth “2”, so overall timing would be greater than
600.
Q24) In the previous question, if you train the same algorithm for tuning 2 hyperparameters, say
“max_depth” and “learning_rate”.
You want to select the right value against “max_depth” (from the given 10 depth values)
and learning rate (from the given 5 different learning rates). In such cases, which of the
A) 1000-1500 second
B) 1500-3000 Second
D) None of these
Solution: (D)
H TE VE
1 105 90
2 200 85
3 250 96
4 105 85
5 300 100
A) 1
B) 2
C) 3
D) 4
E) 5
Solution: (D)
Q26) What would you do in PCA to get the same projection as SVD?
C) Not possible
D) None of these
Solution: (A)
When the data has a zero mean vector, PCA will have the same projections as SVD; otherwise,
Assume there is a black box algorithm that takes training data with multiple observations (t1, t2,
The black box outputs the nearest neighbor of q1 (say ti) and its corresponding class label ci.
You can also think that this black box algorithm is the same as 1-NN (1-nearest neighbor).
Q27) Is it possible to construct a k-NN classification algorithm based on this black box alone?
A) TRUE
B) FALSE
Solution: (A)
In the first step, you pass an observation (q1) in the black box algorithm, so this algorithm
In the second step, you go through the nearest observation from train data and again input the
observation (q1). The black box algorithm will again return the nearest observation and its class.
Q28) Instead of using a 1-NN black box, we want to use the j-NN (j>1) algorithm as a black box.
Which of the following option is correct for finding k-NN using j-NN?
1. J must be a proper factor of k
2. J > k
3. Not possible
A) 1
B) 2
C) 3
Solution: (A)
Q29) Suppose you are given 7 Scatter plots 1-7 (left to right), and you want to compare Pearson
correlation coefficients between variables of each scatterplot.
1. 1<2<3<4
2. 1>2>3 > 4
3. 7<6<5<4
4. 7>6>5>4
Which of the following option is / are true for the interpretation of log loss as an evaluation
metric?
A) 1 and 3
B) 2 and 3
C) 1 and 2
D) 1,2 and 3
Solution: (D)
Q31) Which of the following is leave-one-out cross-validation accuracy for 3-NN (3-nearest
neighbor)?
A) 0
B) 0.4
C) 0.8
D) 1
Solution: (C)
observation of validation. Consider each point as a cross-validation point and then find the 3
nearest points to this point. So if you repeat this procedure for all points, you will get the correct
classification for all positive classes given in the above figure, but the negative classes will be
Q32) Which of the following value of K will have the least leave-one-out cross-validation
accuracy?
A) 1NN
B) 3NN
C) 4NN
Solution: (A)
Each point will always be misclassified in 1-NN which means that you will get 0% accuracy.
Q33) Suppose you are given the below data, and you want to apply a logistic regression model
for classifying it into two given classes.
Which of the following option is correct when you increase the value of C from zero to a
Solution: (B)
By looking at the image, we see that even by just using x2, we can efficiently perform
Q34) Suppose we have a dataset that can be trained with 100% accuracy with the help of
a decision tree of depth 6.
Now consider the points below and choose the option based on these points.
Note: All other hyper parameters are the same, and other factors are not affected.
B) Only 2
C) Both 1 and 2
Solution: (A)
If you fit the decision tree of depth 4 in such data means, it will be more likely to underfit the
data. So, in case of underfitting, you will have high bias and low variance.
Q35) Which of the following options can be used to get global minima in k-Means Algorithm?
A) 2 and 3
B) 1 and 3
C) 1 and 2
D) All of above
Solution: (D)
Q36) Imagine you are working on a project which is a binary classification problem.
You trained a model on the training dataset and got the below confusion matrix on the
validation dataset.
Based on the above confusion matrix, choose which option(s) below will give you the
correct predictions.
1. Accuracy is ~0.91
A) 1 and 3
B) 2 and 4
C) 1 and 4
D) 2 and 3
Solution: (C)
Q37) For which of the following hyperparameters higher value is better for the decision tree
algorithm?
2. Depth of tree
A)1 and 2
B) 2 and 3
C) 1 and 3
D) 1, 2 and 3
E) Can’t say
Solution: (E)
For all three options, A, B, and C, it is not necessary that if you increase the value of the
parameter, the performance may increase. For example, if we have a very high value of depth of
the tree, the resulting tree may overfit the data and would not generalize well. On the other hand,
if we have a very low value, the tree may underfit the data. So, we can’t say for sure that “higher
is better.”
Q38) What is the dimension of the output feature map when you are using the given parameters?
Solution: (A)
Q39) What are the dimensions of the output feature map when you are using the following
parameters?
Solution: (B)
Due to some reason, we forgot to tag the C values with visualizations. In that case, which of
the following option best explains the C values for the images below (1,2,3 left to right, so C
values are C1 for image1, C2 for image2, and C3 for image3 ) in the case of an rbf kernel?
A) C1 = C2 = C3
B) C1 > C2 > C3
C) C1 < C2 < C3
D) None of these
Solution: (C)
Penalty parameter C of the error term. It also controls the trade-off between smooth decision
boundaries and classifying the training points correctly. For large values of C, the optimization
Conclusion
I hope these questions and answers helped you test your knowledge and maybe learn a thing or
two about Python, machine learning, and deep learning. You can find all the information about
our upcoming skill tests and other events here. Machine learning questions.
Hope you like this article and getting understanding about the machine learning questions and
machine learning exam questions and answers.You Will get a clear understanding about these
questions and answers and it will make a better for learning the machine learning exam question
Key Takeways
Networks are very important to learn if you are aiming for a data
scientist job.
You must be good with data analysis skills, such as handling missing
Keep yourself updated by reading data science blogs so that you are
always up to date.
Here are some more articles and tutorials if you wish to explore machine learning further.
Deep Learning vs. Machine Learning – the essential differences you need to know!
Recommended Articles
Submit reply
For question 25, wouldnt Occam's Razor suggest choosing option 2. Its giving the same VE, but
with a lower hyperparameter value. Considering that we should keep our hyperparameters and
hence our model simpler, wouldnt option 2 be a choice. Option 4 may be overfitting the training
data
Hi Amit, It is true what you are saying but here hyperparameter H doesn't have any
interpretation. So in such case you should choose the one which has lower training and validation
error and also the close match. Best! Ankit Gupta
Hi, why is the correct answer for question 28 "Not Possible"? For example, to construct a 6-NN
classifier f