Exam All Questions
Exam All Questions
120
C.
D. The scheme always generate low bias and low variance model
SCQs [Paper-I]
Linear Regression
1. Feature engineering is an important step in any model building exercise. It is the process of creating new features
from a given data set using the domain knowledge to leverage the predictive power of a machine learning model.
Which of the following statements are correct?
Statement 1: Feature engineering techniques are applied before train test split.
Statement 2: There is no difference between standardization and normalization,
Statement 3: Mean encoding is a feature engineering technique for handling categorical features.
a. Only 1 and 2 c. Only 2 and 3
b. Only 1 d. Only 3
2. VIF is used to detect Multicollinearity. Which of the following statements is NOT true for VIF?
a. The VIF has lowest bound of 0
b. The VIF has no upper bound
c. VIF for a variable generally changes if you drop one of the predictor variables
d. If a variable is a product of two other variables, it can have a high VIF
3. The distribution of errors terms in a linear regression model should look like (the horizonal line represents y=0):
a. A c. B
b. C d. D
4. For the same dependent variable Y, two models were created using the independent variables X1 and X2. The
following graph represent the fitted line on the scatterplot. (Both the graph are on same scale). Which of the
following is true about the residuals in these two models?
a. The sum of residuals in model 2 is higher than model 1
b. The sum of residuals in model 1 is higher than model 2
c. Both have the same sum of residuals
d. Nothing can be said about the sum of residuals from
the given graph
5. You built a simple linear regression model on a provided problem statement by the client. After a few days, the client
asks you to build a new model with an increased number of data points (old dataset + new data points). The count of
new data points exceeds old data points by 20%.
Which of the following statement is TRUE regarding the mean of residuals?
a. Mean of residuals of old model > Mean of residuals of new model
b. Mean of residuals of old model < Mean of residuals of new model
c. Mean of residuals of old model = Mean of residuals of new model
d. Information provided is not enough to comment on the mean of residuals
6. A scatterplot was plotted for two variables – age and income to find out how the income depends on the age of a
person. It was found that as the income increases linearly with age, the variability in income also increases. This is a
violation of which of the following assumptions of linear regression?
a. Homogeneity c. Heterogeneity
b. Homoscedasticity d. Linearity
7. RFE method is used for:
a. Dummy variable creation c. Detecting multicollinearity
b. Feature selection d. Univariate regression
8. Which of the following assumptions do we make while building a simple linear regression model (assume X and y to be
independent and dependent variables respectively)
A. There is a linear relationship between X and y
B. X and Y are normally distributed
C. Error terms are independent of each other
D. Error terms have constant variance
a. A, B, C and D c. A, C and D
b. A, B and C d. B, C and D
9. A client approached you with a problem statement. You decided to build a multiple linear regression model on the
dataset provided. The dataset consisted of 40 features. Obviously, all features will not be significant. Selecting the
relevant features manually will be a tougher task. You can use RFE to select relevant features. RFE is an automated
feature selection technique. Initially, you assumed 25 features can explain your whole data.
Which of the following commands correctly calls the RFE technique in Python? (Here “lm” is the fitted instance of
multiple linear regression model)
a. from stastmodel.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.fit(X_train,y_train)
b. from sklearn.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.predict(X_train,y_train)
c. from sklearn.feature_selection import RFE
rfe=RFE(lm,25)
rfe=rfe.fit(X_train,y_train)
d. from RFE import feature_selection
rfe=RFE(lm,25)
rfe=rfe.predict(X_train,y_train)
10. Suppose that on adding a new predictor variable to a linear regression model (model-1), the adjusted r-squared of the
new model (model-2) decreases. Choose the correct statement:
a. The r-squared of model-2 will be less than that of model 1
b. The r-squared of model-2 increases, but the complexity of model-2 also increases
c. The r-squared of model-2 decreases, but the complexity of model-2 also increases
d. Nothing can be said about the r-squared of model-2
11. Some of the independent variables (predictors) might be interrelated, due to which the presence of a particular
independent variable in the model is redundant. This phenomenon is called Multicollinearity.
Suppose that you are building a multiple linear regression model for a given problem statement, which of the
following statements is TRUE w.r.t. multicollinearity?
a. Multicollinearity is a problem when your only goal is to predict the independent variable from the set of
dependent variables
b. Multicollinearity is a problem when your goal is to infer the effect on the dependent variable due to
independent variable.
c. Multicollinearity is not a problem if a variable is not collinear with your variable of interest
d. Multicollinearity is not a problem if there are multiple dummy(binary) variables that represent a categorical
variable with three or more categories
12. If the co-efficient of determination is 0.47 between a dependent variable and an independent variable. This denotes
that-
a. The relationship between the two variables is not strong
b. The corelation coefficient between the two variables is also 0.47
c. 47% of the variance in the independent variable is explained by the dependent variable
d. 47% of the variance in the dependent variable is explained by the independent variable
13. While solving linear regression, the dependent variable is-
a. Numeric c. Categorical
b. Dummy coded d. Binary
14. Consider the following two assumptions for a single regression model. (Assume X and y to be independent and
dependent variables respectively).
Statement 1: There is a linear relationship between X and y
Statement 2: X and y are normally distributed
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
15. What does standardized scaling do?
a. Bring all data points in the range 0 to 1
b. Bring all data points in the range -1 to 1
c. Bring all the data points in a normal distribution with mean 0 and standard deviation 1
d. Bring all the data points in a normal distribution with mean 1 and standard deviation 0
16. In the linear regression, F-statistic is used to determine-
a. The significance of the individual beta coefficient
b. The variance explanation strength of the model
c. The significance of the overall model fit
d. Both A and C
17. Suppose you run a regression with one of the feature variable T, with all the remaining feature variables. The R-
squared of this model was found out to be 0.8. What will be the VIF for the variable T?
a. 1.56 c. 2.77
b. 3.33 d. 5.00
18. Which of the following is true regarding the error terms in linear regression?
a. The sum of residuals should be zero
b. The sum of residuals should be lesser than zero
c. The sum of residuals should be greater than zero
d. There is no such restriction on what the sum of residuals should be
Clustering
38. In hierarchical clustering, the shortest distance and the maximum distance between points in two clusters are
defined as ………. and ………….. respectively.
a. Single linkage and complete linkage c. Complete linkage and single linkage
b. Single linkage and average linkage d. Complete linkage and average linkage
39. Which of the following statement is NOT true?
a. Each time the clusters are made during the K-means algorithm, the centroid is updated.
b. The cluster centres that are computed in the K-means algorithm are given by centroid value of the cluster
points
c. Standardization of the data is not important before applying Euclidean distance as a measure of
similarity/dissimilarity
d. The centroid of a column with data points 25, 32, 34 and 23 is 28.5.
e. The Euclidean distance between two points (10,2) and (4,5) is 7.
40. Initializing the following command in Python will result in the following:
model_clus= KMeans(n_clusters=6, max_iter=50)
a. Run maximum 6 iterations c. Run maximum 40 iterations
b. Create 6 final clusters d. Create 50 final clusters
41. Which of the following is not true for Hopkins Statistics?
a. Hopkins statistics decides if the data is suitable for clustering or not
b. Hopkins statistics lie between -1 and 1
c. If the Hopkins statistics comes out to be 0, then the data is uniformly distributed
d. If the Hopkins statistics comes out to be 1, then the data is highly suitable for clustering
42. Consider the two statements-
Statement 1: The distance between 2 clusters is the maximum distance between 2 points in the clusters in
complete linkage.
Statement 2: Most of the time Complete linkage will produce unstructured dendrograms.
a. Statement 1 is correct and statement 2 is wrong
b. Statement 2 is correct and statement 1 is wrong
c. Both the statements are correct
d. Both the statements are incorrect
43. A client has approached you for a problem statement that requires the use of clustering. You decided to model the
problem statement with hierarchical clustering. Consider the datasets having ‘n’ data points.
Which of the following statements is true for the above problem statement?
a. ‘n*n’ distance matrix should be calculated for the mentioned problem statement
b. Initially ‘n’ clusters are formed for the mentioned problem statement
c. The output of the problem statement above is a dendrogram
d. All the above
44. Silhouette metric for any ith point is given by S(i) = (b(i) - a(i)/max(a(i), b(i))
Which of the following is not true about the Silhouette metric?
a. b(i) is the average distance from the nearest neighbour cluster (Separation)
b. a(i) is the average distance from own cluster (Cohesion).
c. If S(i) = 1 the data point is similar to its own cluster.
d. Silhouette metric ranges from 0 to +1
45. Clustering is used to identify the below-
a. Data distribution c. Correlation among the data points
b. Principal components d. Subgroups in the data
46. For a K-means clustering process, the Hopkin Statistic for the dataset came out to be 0.8. Hence the dataset is-
a. Suitable for clustering c. Not suitable for clustering
b. Can’t say from the given information d. None of the above
47. For a K-means clustering process, the Hopkin Statistic for the dataset came out to be 0.3. Hence the dataset is-
a. Suitable for clustering c. Not suitable for clustering
b. Can’t say from the given information d. None of the above
48. You observed the following dendrogram after performing K-means clustering on a dataset. Which of the following
statements can be concluded from this dendrogram?
49. Refer to the dendrogram image below and answer the question that follow:
Find the number of clusters formed if the dendrogram is cut at 0.25. (Assume agglomerative clustering method)
a. 6 c. 11
b. 13 d. 15
Decision Tree
50. Which of the following is the correct sampling technique that is used by a random forest model to overcome the
problem of overfitting?
a. Random sampling c. Bootstrapping
b. Oversampling d. Stratified sampling
51. Which of the following metrics measures how often a randomly chosen element would be incorrectly identified?
a. Entropy c. Information Gain
b. Gini Index d. None of these
52. Which of the following is true for weight of evidence (WoE) analysis?
a. It helps in finding the different predictive patterns for the different segments that might be present in the data
b. WoE helps in treating missing values for both continuous and categorical variables
c. WoE values should follow an increasing or decreasing trend across bins.
d. All of the above
53. Refer to the decision tree given below and choose the statement that is correct as per this tree.
a. The tree given above will show very good performance on the train data
b. The tree given above is an underfitting tree.
c. If the petal length is more than 2.45, then it is equally likely that the flower is either setosa or virginica.
d. Both B and C
54. Suppose you train a decision tree with the following data. Which feature should we split on at the root?
X Y Z V
T T F 1
F F F 0
T T T 0
F T T 1
a. X c. Y
b. Z d. Cannot be determined
55. Select the correct option based on the following decision tree.