AIL Quiz Loc
AIL Quiz Loc
AI Winters happened mostly due to the lack of understanding behind the theory of neural networks
a. True b. False
2. Most modern applications that use computer vision, use models that were trained using this discipline:
a. Machine learning b. Deep learning c. Artificial Intelligience
3. In the Machine Learning Workflow, the main goal of the Data Exploration and Preprocessing step is to:
a. Identify what data that is best suited to find a solution to your business problem
b. Determine how to clean your data such that you can use it to train a model
4. What is the goal of supervised learning?
a. Find an underlying structure of the dataset without any labels.
b. Predict the labels c. Find the target d. Predict the features
5. What is deep learning?
a. Deep learning is machine learning that involves deep neural networks.
b. Deep learning is another name for artificial intelligence.
c. Deep learning includes artificial intelligence and machine learning.
d. None of the above are correct
6. When is a standard machine learning algorithm usually a better choice than using deep learning to get the job done?
a. When working with small data sets c. When working with large data sets
b. When the data is steady over time d. None of the above correct
7. What is a Turing test?
a. It tests images c. It tests and cleans the dataset
b. It tests a machine’s ability to exhibit intelligent behavior d. It tests the dataset
8. What are some of the different milestones in deep learning history?
a. Deep Blue defeats a world champion chess player, and Keras is released.
b. Geoffrey Hinton’s work, AlexNet, and TensorFlow
c. Deep Blue defeats a world champion chess player, and AlexNet is created.
d. Deep Blue defeats a world champion chess player and TensorFlow is released
9. What is artificial intelligence?
a. A subset of deep learning c. Any program that can sense, reason, act and adapt
b. A subset of machine learning d. None of the above
10. What are two spaces within AI that are going through drastic growth and innovation?
a. Computer vision and natural language processing.
b. Language processing and deep learning
c. Deep learning and machine learning
d. Computer vision and deep learning
11. Why did AI flourish so much in the last years?
a. Access to hardware for cleaning data
b. Data storage in the cloud is much more expensive
c. Stylish designed computers
d. Faster and inexpensive computers and data storage
12. How does Alexa use artificial intelligence?
a. Recognizes faces and pictures c. Recognizes our voice and answer questions
b. Suggests who a person on a photo is d. None of these
13. What are the first two steps of a typical machine learning workflow?
a. Problem statement and data clean c. Data collection and data transform
b. Problem statement and data collection d. None of these
14. Which statement about the Pandas read_csv function is TRUE?
a. It allows only one argument: the name of the file
b. It reads data into a 2-dimensional Numpy array
c. It can only read comma-delimited data
d. It can read both tab-delimited and space-delimited data
15. Which of the following is a reason to use JavaScript Object Notation (JSON) files for storing data?
a. Because the data is stored in a matrix format.
b. Because they can store NA values.
c. Because they can store NULL values.
d. Because they are cross-platform compatible.
16. The data below appears in 'data.txt', and Pandas has been imported. Which Python command will read it correctly
into a Pandas DataFrame?
63.03 22.55 39.61 40.48 98.67 -0.25 AB
39.06 10.06 25.02 29 114.41 4.56 AB
68.832.22 50.09 46.61 105.99 -3.53 AB
a. pandas.read_csv(‘data.txt’)
b. pandas.read_csv(‘data.txt’, delim_whitespace = True)
c. pandas.read_csv(‘data.txt’, header = None, sep= ‘’)
d. pandas.read_csv(‘data.txt’, header = 0, delim_whitespace = True)
17. Outliers must be very extreme to noticeably impact the fit of a statistical model.
a. True b. False
18. Outliers should always be replaced, since they never contain useful information about the data.
a. True b. False
19. Which residual-based approach to identifying outliers compares running a model with all data to running the same
model, but dropping a single observation?
a. Standardized residuals c. Externally-studentized residuals
b. Unstandardized residuals d. Abnormally-studentized residuals
20. What is a CSV file?
a. CSV files are a standard way to store data across platforms.
b. CSV files are rows of data or values separated by commas
c. CSV is a method of JavaScript Object Notation.
d. CSV makes data readily available for analytics, dashboards, and reports
21. What are residuals?
a. Residuals are data removed from the data frame.
b. Residuals are the difference between the actual values and the values predicted by a given model.
c. Residuals are a method to standardize data.
d. Residuals are a method for handling identified outliers.
22. If the removal of rows or columns of data is not an option, why must we ensure that information is assigned for
missing data?
a. Assigning information for missing data improves the accuracy of the dataset.
b. Information must be assigned to prevent outliers.
c. Missing data may bias the dataset.
d. Most models will not accept blank values in our data.
23. What are the two main data problems companies face when getting started with artificial intelligence/machine
learning?
a. Lack of training and expertise c. Data sampling and categorization
b. Outliers and duplicated data d. Lack of relevant data and bad data
24. What does SQL stand for and what does it represent?
a. SQL stands for Structured Query Language, and it represents a set of relational databases with fixed
schemas.
b. SQL stands for Structured Query Language, and it represents databases that are not relational, they vary in
structure.
c. SQL stands for Sequential Query Language, and it represents a set of sequential databases with fixed schemas.
d. SQL stands for Sequential Query Language, and it represents a set of relational databases with fixed schemas.
25. What does NoSQL stand for and what does it represent?
a. NoSQL stands for Not-only SQL, and it represents a set of databases that are not relational, therefore,
they vary in structure.
b. NoSQL stands for Non-Structured Query Language, and it represents a set of non-relational databases with varied
schemas.
c. NoSQL stands for Non-Structured Query Language, and it represents a set of relational databases with fixed
schemas.
d. NoSQL stands for Not-only SQL, and it represents a set of databases that are relational, therefore, they have fixed
structure.
26. What is a JSON file?
a. JSON stands for JavaString Object Notation, and it is a standard way to store the data across platforms.
b. JSON stands for JavaScript Object Notation, and it is a non-standard way to store the data across platforms.
c. JSON stands for JavaString Object Notation, and they have very similar structure to Python Dictionaries.
d. JSON stands for JavaScript Object Notation, and it is a standard way to store the data across platforms.
27. What is meant by the Messy Data?
a. Duplicated or unnecessary data c. Inconsistent text and typos
b. Missing data d. All of the above
28. What is an outlier?
a. Outlier is a data point that has the highest or lowest value in the dataset.
b. Outlier is a data point that is very close to the mean value of all observations.
c. Outlier is an observation in dataset that is distant from most other observations.
d. Outlier is a data point that does not belong in our dataset.
29. How do we identify outliers in our dataset?
a. We can identify outliers only by calculating the minimum and maximum values in the dataset.
b. We can identify outliers both visually and with statistical calculations.
c. We can only identify outliers visually through building plots.
d. We can only identify outliers by using some statistical calculations.
30. From the options listed below, select the option that is NOT a valid exploratory data approach to visually confirm
whether your data is ready for modeling or if it needs further cleaning or data processing:
a. Create a panel plot that shows distributions for the dependent variable and scatter plots for all independent
variables
b. Train a model and identify the observations with the largest residuals
c. Create visualizations for scatter plots, histograms, box plots, and hexbin plots
d. Create a correlation heat map to confirm the sign and magnitude of correlation across your features.
31. These are two of the most common variables for data visualization:
a. matplotlib and seaborn c. numpy and matplotlib
b. scipy and seaborn d. scipy and numpy
32. You can use the pandas library to use plots.
a. True b. False
33. Classification models require that input features be scaled.
a. True b. False
34. Feature scaling allows better interpretation of distance-based approaches.
a. True b. False
35. Feature scaling reduces distortions caused by variables with different scales.
a. True b. False
36. Which scaling approach converts features to standard normal variables?
a. Robust scaling b. Standard scaling c. MinMax scaling d. Nearest neighbor scaling
37. Which variable transformation should you use for ordinal data?
a. Standard scaling b. Ordinal encoding c. One-hot encoding d. Min-max scaling
38. What are polynomial features?
a. They are logistic regression coefficients. c. They are higher order relationships in the data.
b. They are lower order relationships in the data. d. They are represented by linear relationships in the data.
39. What does Boxcox transformation do?
a. It makes the data more left skewed
b. It transforms the data distribution into more symmetrical bell curve
c. It makes the data more right skewed.
d. It transforms categorical variables into numerical variables.
40. Select three important reasons why EDA is useful.
a. To examine correlations, to sample from dataframes, and to train models on random samples of data
b. To utilize summary statistics, to create visualizations, and to identify outliers
c. To analyze data sets, to determine the main characteristics of data sets, and to use sampling to examine data
d. To determine if the data makes sense, to determine whether further data cleaning is needed, and to help
identify patterns and trends in the data
41. What assumption does the linear regression model make about data?
a. This model assumes a linear relationship between predictor variables and outcome variables.
b. This model assumes an addition of each one of the model parameters multiplied by a coefficient.
c. This model assumes a transformation of each parameter to a linear relationship.
d. This model assumes that raw data in data sets is on the same scale.
42. What is skewed data?
a. Raw data that may not have a linear relationship.
b. Data that has a normal distribution.
c. Raw data that has undergone log transformation.
d. Data that is distorted away from normal distribution; may be positively or negatively skewed.
43. Select the two primary types of categorical feature encoding.
a. Encoding and scaling c. One-hot encoding and ordinal encoding
b. Nominal encoding and ordinal encoding d. Log and polynomial transformation
44. Which scaling approach puts values between zero and one?
a. Robust scaling b. Standard scaling c. MinMax scaling d. Nearest neighbor scaling
45. Which variable transformation should you use for nominal data with multiple different values within the feature?
a. One-hot scaling b. Standard scaling c. MinMax scaling d. Ordinal encoding
46. In general, the population parameters are unknown.
a. True b. False
47. Parametric models have finite number of parameters.
a. True b. False
48. The most common way of estimating parameters in a parametric model is:
a. Using the maximum likelihood estimation c. extrapolating a non-parametric model
b. Using the central limit theorem d. extrapolating Bayesian statistics
49. A p-value is:
a. The probability of the null hypothesis being true
b. The probability of the null hypothesis being false
c. The smallest significance level at which the null hypothesis would be rejected
d. The smallest significance level at which the null hypothesis is accepted
50. Type 1 Error 1 is defined as:
a. Saying the null hypothesis is false, when it is actually true
b. Saying the null hypothesis is false, when it is actually false
51. You find through a graph that there is a strong correlation between Net Promoter Score and the visual time that
customers spend on a website. Select the TRUE assertion:
a. To boost the Net Promoter Score of a business, you need to increase the time that customers spend on a website.
b. There is an underlying factor that explains this correlation, but manipulating the time that customers
spend on a website may not affect the Net Promoter Score they will give to the company
52. Which one of the following is common to both machine learning and statistical inference?
a. Using sample data to make inferences about a hypothesis.
b. Using population data to model a null hypothesis.
c. Using population data to make inferences about a null sample.
d. Using sample data to infer qualities of the underlying population distribution.
53. Which one of the following describes an approach to customer churn prediction stated in terms of probability?
a. Predicting a score for individuals that estimates the probability the customer will stay.
b. Churn prediction is a data-generating process representing the actual joint distribution between our x and the y
variable.
c. Predicting a score for individuals that estimates the probability the customer will leave.
d. Data related to churn may include the target variable for whether a certain customer has left.
54. What is customer lifetime value?
a. The total churn a customer generates in the population.
b. The total purchases over the time which the person is a customer.
c. The total value that the customer receives during their life.
d. The total churn generated by a customer over their lifetime.
55. Which one the following statements about the normalized histogram of a variable is true?
a. It provides an estimate of the variable’s probability distribution.
b. It is a parametric representation of the population distribution.
c. It is a non-parametric representation of the population variance.
d. It serves as a bar chart for the null hypothesis.
56. The outcome of rolling a fair die can be modelled as a _______ distribution.
a. Poisson b. Normal c. Log-normal d. Uniform
57. Which one of the following features best distinguishes the Bayesian approach to statistics from the Frequentist
approach?
a. Frequentist statistics incorporates the probability of the hypothesis being true.
b. Frequentist statistics requires construction of a prior distribution.
c. Bayesian statistics is better than Frequentist.
d. Bayesian statistics incorporate the probability of the hypothesis being true.
58. Which of the following best describes what a hypothesis is?
a. A hypothesis is a statement about a prior distribution.
b. A hypothesis is a statement about a sample of the population.
c. A hypothesis is a statement about a population.
d. A hypothesis is a statement about a posterior distribution.
59. A Type 2 error in hypothesis testing is _____________________:
a. Correctly rejecting the null hypothesis. c. Incorrectly accepting the alternative hypothesis.
b. Incorrectly accepting the null hypothesis. d. Correctly rejecting the alternative hypothesis.
60. Which statement best describes a consequence of a type II error in the context of a churn prediction example?
Assume that the null hypothesis is that customer churn is due to chance, and that the alternative hypothesis is that
customers enrolled for greater than two years will not churn over the next year.
a. You incorrectly conclude that there is no effect
b. You correctly conclude that a customer will eventually churn
c. You incorrectly conclude that customer churn is by chance
d. You correctly conclude that customer churn is by chance
61. Which of the following is a statistic used for hypothesis testing?
a. The standard deviation. c. The acceptance region
b. The likelihood ratio d. The rejection region
62. Which statement about Logistic Regression is TRUE?
a. Logistic Regression is a generalized linear model.
b. Logistic Regression models can only predict variables with 2 classes.
c. Logistic Regression models can be used for classification but not for regression.
d. Logistic Regression models can be used for regression but not for classification.
63. Logistic regression is similar to a linear regression, except that it uses a logistic function to estimate probabilities of
an observation belonging to a certain class or category.
a. True b. False
64. Usually the first step to fit a logistic regression model using scikit learn is to:
a. import logistic regression from the sklearn.linear_model module
e.g. from sklearn.linear_model import LogisticRegression
b. import Logistic from the sklearn.regression module
e.g. from sklearn.regression import Logistic
c. import logistic regression from the sklearn.linearclassifer module
e.g. from sklearn.linearclassifer import LogisticRegression
65. The output of a logistic regression model applied to a data sample _____________.
a. tells you the odds of the sample belonging to a certain class.
b. is the log odds of the sample, which you can use for interpretive purposes.
c. tells you which class the sample belongs to
d. is the probability of the sample being in a certain class.
66. Describe how any binary classification model can be extended from its basic form on two classes, to work on
multiple classes.
a. Use the coefficients from a linear regression model to weight the classes.
b. Use a process of elimination to discard any unimportant classes.
c. Fit the binary classifier to all of the classes simultaneously.
d. Use a one-versus-all technique, where for each class you fit a binary classifier to that class versus all of
the other classes.
67. Which tool is most appropriate for measuring the performance of a classifier on unbalanced classes?
a. The true positive rate. c. The false positive rate
b. The Receiver Operating Characteristic(ROC) curve d. The precision-recall curve
68. One of the requirements of logistic regression is that you need a variable with two classes.
a. True b. False
69. The shape of ROC curves are the leading indicator of an overfitted logistic regression.
a. True b. False
70. What is the goal of supervised learning?
a. Find an underlying structure of the dataset without any labels. b. Predict the features.
c. Predict the labels. d. Find the target.
71. A simplified way to interpret K Nearest Neighbors is by thinking of the output of this method as a decision
boundary which is then used to classify new points.
a. True b. False
72. These are all characteristics of the k nearest neighbors algorithm EXCEPT:
a. It is sensitive to scaling
b. It determines decision boundaries to make predictions
c. It determines the value for k
d. It is well suited to predict variables with multiple classes
73. An advantage of k nearest neighbor methods is that they can leverage categorical data without encoding.
a. True b. False
74. Usually the first step to fit a k nearest neighbor classifier using scikit learn is to:
a. import KNN from the sklearn.knearestneighbors module
e.g. from sklearn.knearestneighbors import KNN
b. import Classifier from the sklearn.nearestneighbors module
e.g. from sklearn.nearestneighbors import Classifier
c. import KNNClassifier from the sklearn.knearestneighbors module
e.g. from sklearn.knearestneighbors import KNNClassifier
75. Which one of the following statements is true regarding K Nearest Neighbors?
a. K Nearest Neighbors (KNN) assumes that points which are close together are similar.
b. The distance between two data points is independent of the scale of their features.
c. For high dimensional data, the best distance measure to use for KNN is the Euclidean distance.
d. The Manhattan distance between two data points is the square root of the sum of the squares of the differences
between the individual feature values of the data points.
76. Which one of the following statements is most accurate?
a. Linear regression needs to remember the entire training dataset in order to make a prediction for a new data
sample.
b. KNN determines which points are closest to a given data point, so it doesn’t take long to actually perform
prediction.
c. K nearest neighbors (KNN) need to remember the entire training dataset in order to classify a new data
sample.
d. KNN only needs to remember the hyperplane coefficients to classify a new data sample.
77. Which one of the following statements is most accurate about K Nearest Neighbors (KNN)?
a. KNN can be used for both classification and regression.
b. KNN is an unsupervised learning method.
c. KNN is a regression model.
d. KNN is a classification model.
78. K Nearest Neighbors with large k tend to be the best classifiers.
a. True b. False
79. When building a KNN classifier for a variable with 2 classes, it is advantageous to set the neighbor count k to an
odd number.
a. True b. False
80. The Euclidean distance between two points will always be shorter than the Manhattan distance:
a. True b. False
81. The main purpose of scaling features before fitting a k nearest neighbor model is to:
a. Break ties in case there is the same number of neighbors of different classes next to a given observation
b. Ensure decision boundaries have roughly the same size for all classes
c. Ensure that features have similar influence on the distance calculation
d. Help find the appropriate value of k
82. These are all pros of the k nearest neighbor algorithm EXCEPT:
a. It adapts wells to new training data
b. It is simple to implement as it does not require parameter estimation
c. It is easy to interpret
d. It is sensitive to the curse of dimensionality
83. (True/False) SVMs calculate predicted probabilities in the range between 0 and 1.
a. True b. False
84. All of these are characteristics of SVMs, EXCEPT:
a. Support Vector Machines do not return predicted probabilities.
b. Support Vector Machines use decision boundaries for classification.
c. The algorithm behind Support Vector Machines calculates hyperplanes that minimize misclassification error.
d. Support Vector Machine models are non-linear.
85. Any linear model can be turned into a non-linear model by applying a kernel to the model
a. True b. False
86. SVMs with kernels are recommended for large data sets with many features
a. True b. False
87. Select the TRUE statement regarding the cost function for SVMs:
a. SVMs use the same loss function as logistic regression
b. SVMs use the Hinge Loss function as a cost function
c. SVMs use a loss function that penalizes vectors prone to misclassification
d. SVMs do not use a cost function. They use regularization instead of a cost function.
88. Which statement about Support Vector Machines is TRUE?
a. Support Vector Machine models are non-linear.
b. Support Vector Machine models can be used for regression but not for classification.
c. Support Vector Machine models rarely overfit on training data.
d. Support Vector Machine models can be used for classification but not for regression.
89. A large c term will penalize the SVM coefficients more heavily.
a. True b. False
90. Regularization in the context of support vector machine (SVM) learning is meant to _________________.
a. lessen the impact that some minor misclassifications have on the cost function
b. bring all features to a common scale to ensure they have equal weight
c. smooth the input data to reduce the chance of overfitting
d. encourage the model to ignore outliers during training
91. Support vector machines can be extended to work with nonlinear classification boundaries by _______.
a. modifying the standard sigmoid function
b. incorporating polynomial regression
c. using the kernel trick
d. projecting the feature space onto a lower-dimensional space
92. Select the image that displays the line at the optimal point in the phone usage that the data can be split to create a
decision boundary.
a. c.
b. d.
93. The below image shows the decision boundary with a clear margin, such decision boundary belongs to what type
machine learning model?
a. Support Vector Machine
b. Machine Learning
c. Super Vector Machine
94. SVM with kernels can be very slow on large datasets. To speed up SVM training, which methods may you perform
to map low dimensional data into high dimensional beforehand?
a. Regularization b. Linear SVC c. RBF Sample d. Nystroem
95. Concerning the Machine Learning workflow what model choice would you pick if you have "Few" features and a
"Medium" amount of data?
a. LinearSVC, or Kernal Approximation c. Simple, Logistic or LinearSVC
b. Add features or Logistic d. SVC with RBF
96. Select the image that best displays the line that separates the classes.
a. b.
c. d.
97. Which of the following statements about Decision Tree models is TRUE?
a. Decision Tree models are non-linear.
b. Decision Tree models rarely overfit training data.
c. Decision Tree models can be used for classification but not for regression.
d. Decision Tree models can be used for regression but not for classification.
98. Decision Trees are considered a greedy algorithm.
a. True b. False
99. These are all characteristics of decision trees, EXCEPT:
a. They have well-rounded decision boundaries
b. They can be used for either classification or regression
c. They segment data based on features to predict results
d. They split nodes into leaves
100. Decision trees used as classifiers compute the value assigned to a leaf by calculating the ratio: number of
observations of one class divided by the number of observations in that leaf E.g. number of customers that are younger
than 50 years old divided by the total number of customers.
How are leaf values calculated for regression decision trees?
a. Median value of the predicted variable c. mode value of the predicted variable
b. Weighted average value of the predicted variable d. average value of the predicted variable
101. These are two main advantages of decision trees:
a. They are very visual and easy to interpret
b. They output both parameters and significance levels
c. They do not tend to overfit and are not sensitive to changes in data
d. They are resistant to outliers and output scaled features
102. How can you determine the split for each node of a decision tree?
a. Find the split that induces the largest entropy. c. Randomly select the split.
b. Find the split that minimizes the gini impurity. d. Use a nonlinear decision boundary to find the best split.
103. Which of the following describes a way to regularize a decision tree to address overfitting?
a. Reduce the information gain. c. Increase the number of branches.
b. Increase the max depth d. Decrease the max depth.
104. What is the disadvantage of decision trees?
a. Scaling is required. c. They tend to overfit.
b. They can get too large. d. They are difficult to interpret.
105. What method can you use to minimize the overfitting of a machine-learning model?
a. Decrease the variance of your test data.
b. Choose the hyperparameters that maximize the goodness of fit on your training data.
c. Increase the variance of your training data.
d. Tune the hyperparameters of your model using cross-validation.
106. Concerning Classification algorithms, what is a characteristic of K-Nearest Neighbors?
a. Training data is the model c. The model is just parameters
b. The fitting can be slow d. Prediction is fast
107. Concerning Classification algorithms, what are the characteristics of Logistic Regression?
a. The model is just parameters, fitting can be slow, prediction is fast, and the decision boundary is simple
and less flexible
b. The training data is the model, fitting is fast, predicting class for new records can be slow, and the decision
boundary is flexible
c. The model is just parameters, fitting is fast, prediction is fast, and the decision boundary is flexible
d. The training data is the model, fitting is fast, prediction is fast, and the decision boundary is flexible
108. When evaluating all possible splits of a decision tree what can be used to find the best split regardless of what
happened in prior or future steps?
a. Logistic regression b. Greedy Search c. Classification d. Regularization
109. A model that averages the predictions of multiple models reduces the variance of a single model and has high
chances to generalize well when scoring new data.
a. True b. False
110. Bagging is a tree ensemble that combines the prediction of several trees that were trained on bootstrap samples of
the data.
a. True b. False
111. In general, a random forest can be considered a special case of bagging and it tends to have better out of sample
accuracy
a. True b. False
112. Bagging tends to have less overfitting than decision trees
a. True b. False
113. Boosting tend to be well suited for data sets with outliers and rare events.
a. True b. False
114. If you were to combine several logistic regressions using a voting ensemble, you should use a Voting Regressor.
a. True b. False
115. All of these are characteristics of boosting algorithms, EXCEPT:
a. They use the entire data set, not only bootstrapped samples
b. They use residuals from previous models
c. They create trees iteratively
d. They create trees independently
116. The term Bagging stands for bootstrap aggregating.
a. True b. False
117. This is the best way to choose the number of trees to build on a Bagging ensemble.
a. Choose a number of trees past the point of diminishing returns
b. Choose a large number of trees, typically above 100
c. Prioritize training error metrics over out-of-bag sample
d. Tune the number of trees as a hyperparameter that needs to be optimized
118. Which type of Ensemble modeling approach is NOT a special case of model averaging?
a. The Pasting method of Bootstrap aggregation c. Random Forest methods
b. The Bagging method of Bootstrap aggregation d. Boosting methods
119. What is an ensemble model that needs you to look at out of bag error?
a. Logistic Regression b. Stacking c. Random Forest d. Out of Bag Regression
120. What is the main condition to use stacking as an ensemble method?
a. Models need to output predicted probabilities c. Models need to be parametric
b. Models need to output residual values for each class d. Models need to be nonparametric
121. This tree ensemble method only uses a subset of the features for each tree:
a. Stacking b. Bagging c. Adaboost d. Random Forest
122. Order these tree ensembles in order of most randomness to least randomness:
a. Random Trees, Random Forest, Bagging c. Bagging, Random Forest, Random Trees
b. Random Forest, Bagging, Random Trees d. Random Forest, Random Trees, Bagging
123. This is an ensemble model that does not use bootstrapped samples to fit the base trees, takes residuals into
account, and fits the base trees iteratively:
a. Random Forest b. Bagging c. Random trees d. Boosting
124. When comparing the two ensemble methods Bagging and Boosting, what is one characteristic of Boosting?
a. Bootstraped samples c. Fits entire data set
b. No weighting used d. Only data points are considered
125. What is the most frequently discussed loss function in boosting algorithms?
a. Gradient Loss Function c. 0-1 Loss Function
b. Gradient Boosting Loss Function d. Adaboost Loss Function
126. What type of forest is a classification algorithm that potentially contains hundreds of different decision trees?
a. The Multiple Forest b. Model Forest c. Random Forest d. Global Forest
127. When describing models what type of model will feature coefficients help to explain?
a. Global Surrogate model b. KNN c. Linear Model d. SVM
128. What type of surrogate model tries to approximate a black-box model globally on every instance in the data set?
a. Strategic Surrogate model c. Local Surrogate model
b. Global Surrogate model d. Complex Surrogate model
129. These are all methods of dealing with unbalanced classes EXCEPT:
a. Down-sampling. c. A mix of in-sample and out-of-sample
b. A mix of down-sampling and up-sampling. d. Up-sampling.
130. A best practice to build a model using unbalanced classes is to split the data first, then apply an upsample or
undersample technique.
a. True b. False
131. Which of the following statements about Downsampling is TRUE?
a. Down-sampling results in excessive focus on the more frequently-occurring class.
b. Down-sampling preserves all the original observations.
c. Down-sampling is likely to decrease Recall.
d. Down-sampling is likely to decrease Precision
132. Which of the following statements about Random Upsampling is TRUE?
a. Random Upsampling will generally lead to a higher F1 score.
b. Random Upsampling generates observations that were not part of the original data.
c. Random Upsampling preserves all original observations.
d. Random Upsampling results in excessive focus on the more frequently-occurring class.
133. Which of the following statements about Synthetic Upsampling is TRUE?
a. Synthetic Upsampling results in excessive focus on the more frequently-occurring class.
b. Synthetic Upsampling generates observations that were not part of the original data.
c. Synthetic Upsampling will generally lead to a higher F1 score.
d. Synthetic Upsampling uses fewer hyperparameters than Random Upsampling
134. What can help humans to interpret the behaviors and methods of Machine Learning models more easily?
a. Model Explanations b. Model Debug c. Model Trust d. Explanation Debug
135. What type of explanation method can be used to explain different types of Machine Learning models no matter
the model structures and complexity?
a. Model Trust Explanations c. Model Explanations
b. Local Interpretable Model-Agnostic Explanations (LIME) d. Model-Agnostic Explanations
136. What reason might a Global Surrogate model fail?
a. Consistency between surrogate models and black-box models
b. Single data instance groups
c. Large inconsistency between surrogate models and black-box models
d. Single clusters in the data instance groups
137. When working with unbalanced sets, what should be done to the samples so the class balance remains consistent
in both the train and test sets?
a. Apply weighted observations c. Stratify the samples
b. Use oversampling d. Use a combination of oversampling and undersampling
138. What approach are you using when trying to increase the size of a minority class so that it is similar to the size of
the majority class?
a. Synthetic Oversampling b. Random Oversampling c. Undersampling d. Oversampling
139. What approach are you using when you create a new sample of a minority class that does not yet exist?
a. Random Oversampling b. Synthetic Oversampling c. Oversampling d. Weighting
140. What intuitive technique is used for unbalanced datasets that ensures a continuous downsample for each of the
bootstrap samples?
a. SMOTE b. Upsampling c. Downsampling d. Blagging
141. Predicting payment default, whether a transaction is fraudulent, and whether a customer will be part of the top 5%
spenders on a given year, are examples of:
a. Classification b. Regression
142. It is less concerning to treat a Machine Learning model as a black box for prediction purposes, compared to
interpretation purposes:
a. True b. False
143. Predicting total revenue, number of customers, and percentage of returning customers are examples of:
a. Classification b. Regression
144. (True/False) The Sum of Squared Errors (SSE) can be used to select the best-fitting regression model.
a. True b. False
145. (True/False) The R-squared value from estimating a linear regression model will almost always increase if more
features are added.
a. True b. False
146. (True/False) The Total Sum of Squares (TSS) can be used to select the best-fitting regression model.
a. True b. False
147. You can use supervised machine learning for all of the following examples, EXCEPT:
a. Segment customers by their demographics.
b. Predict the number of customers that will visit a store on a given week.
c. Predict the probability of a customer returning to a store.
d. Interpret the main drivers that determine if a customer will return to a store.
148. The autocorrect on your phone is an example of:
a. Unsupervised Learning c. Supervised learning
b. Semi-supervised learning d. Reinforcement learning\
149. This is the type of Machine Learning that uses both data with labeled outcomes and data without labeled
outcomes:
a. Unsupervised Machine Learning c. Supervised Machine learning
b. Semi-Supervised Machine learning d. Mix-Supervised Machine learning
150. This option describes a way of turning a regression problem into a classification problem:
a. Create a new variable that flags 1 for above a certain value and 0 otherwise
b. Use outlier treatment
c. Use missing value handling
d. Create a new variable that uses autoencoding to transform a continuous outcome into categorical
151. This is the syntax you need to predict new data after you have trained a linear regression model called LR :
a. LR = predict(X_test) b. LR.predict(X_test) c. LR.predict(LR, X_test) d. predict(LR, X_test)
152. All of these options are useful error measures to compare regressions except:
a. SSE b. R squared c. TSS d. ROC index
153. All of the listed below are part of the Machine Learning Framework, except:
a. Observations b. Features c. Parameters d. None of the above
154. Select the option that is the most INACCURATE regarding the definition of Machine Learning:
a. Machine Learning allows computers to learn from data
b. Machine Learning allows computers to infer predictions for new data
c. Machine Learning is a subset of Artificial Intelligence
d. Machine Learning is automated and requires no programming
155. In Linear Regression, which statement about model evaluation is the most accurate?
a. Model selection involves choosing a model that minimizes the cost function.
b. Model estimation involves choosing parameters that minimize the cost function.
c. Model estimation involves choosing a cost function that can be compared across models.
d. Model selection involves choosing modeling parameters that minimize in-sample validation error.
156. When learning about regression we saw the outcome as a continuous number. Given the below options what is an
example of regression?
a. A fraudulent charge
b. Under certain circumstances determine if a person is a Republican or Democrat
c. Customer churn
d. Housing prices
157. What is another term for the testing data:
a. Training data b. Unseen data c. Corroboration data d. Cross validation data
158. (True/False) The ShuffleSplit will ensure that there is no bias in your outcome variable.
a. True b. False
159. Select the option that has the syntax to obtain the data splits you will need to train a model having a test split that
is a third the size of your available data.
a. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5)
b. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
c. X_train, y_test = train_test_split(X, y, test_size=0.33)
d. X_train, y_test = train_test_split(X, y, test_size=0.5)
160. What is the main goal of adding polynomial features to linear regression?
a. Remove the linearity of the regression and turn it into a polynomial model.
b. Capture the relation of the outcome with features of higher order.
c. Increase the interpretability of a black box model.
d. Ensure similar results across all folds when using K-fold cross-validation.
161. What are the most common sklearn methods to add polynomial features to your data?
Note: polyFeat =PolynomialFeatures(degree)
a. polyFeat.add and polyFeat.transform c. polyFeat.add and polyFeat.fit
b. polyFeat.fit and polyFeat.transform d. polyFeat.transform
162. How can you adjust the standard linear approach to regression when dealing with fundamental problems such as
prediction or interpretation?
a. Create a class instance
b. Add some non-linear patterns, i.e., polynomial features
c. Import the transformation method
d. By transforming the data
163. The main purpose of splitting your data into a training and test sets is:
a. To improve accuracy c. To avoid overfitting
b. To improve regularization d. To improve cross-validation and overfitting
164. What term is used if your test data leaks into the training data?
a. Test leakage b. Training leakage c. Data leakage d. Historical data leakage
165. Which one of the below terms uses a linear combination of features?
a. Binomial Regression b. Linear Regression c. Multiple Regression d. Polynomial Regression
166. When splitting your data, what is the purpose of the training data?
a. Compare with the actual value c. Fit the actual model and learn the parameters
b. Predict the label with the model d. Measure errors
167. Polynomial features capture what effects?
a. Non-linear effects b. Linear effects c. Multiple effects d. Regression effects
168. Which fundamental problems are being solved by adding non-linear patterns, such as polynomial features, to a
standard linear approach?
a. Prediction b. Interpretation c. Prediction and Interpretation d. None of the above
169. A testing data could be also reffered to as:
a. Training data b. Unseen data c. Corroboration data d. Cross validation data
170. What is the correct sklearn syntax to add a third degree polynomial to your model?
a. polyFeat = polyFeat.add(degree=3) c. polyFeat = polyFeat.fit(degree=3)
b. polyFeat = PolynomialFeatures(degree=3) d. polyFeat = polyFeat.transform(degree=3)
171. Complete the following sentence: The training data is used to fit the model, while the test data is used to:
a. measure the parameters and hyperparameters of the model
b. tweak the model hyperparameters
c. tweak the model parameters
d. measure error and performance of the model
172. In model complexity versus error diagram, the model compexity increases as the training error decreases.
a. True b. False
173. In model complexity versus error diagram, there is an inflection point after which, as the cross validatio error
increases, so does the complexity of the model.
a. True b. False
174. In the model complexity versus error diagram, the right side of the curve is where the model is underfitted and the
left side of the curve, is where the model is overfitted.
a. True b. False
175. In K-fold cross-validation, how will increasing k affect the variance (across subsamples) of estimated model
parameters?
a. Increasing k will not affect the variance of estimated parameters.
b. Increasing k will usually reduce the variance of estimated parameters.
c. Increasing k will usually increase the variance of estimated parameters.
d. Increasing k will increase the variance of estimated parameters if models are underfitting, but reduce it if
models are overfitting.
176. Which statement about K-fold cross-validation below is TRUE?
a. Each subsample in K-fold cross-validation has at least k observations.
b. Each of the k subsamples in K-fold cross-validation is used as a training set.
c. Each of the k subsamples in K-fold cross-validation is used as a test set.
d. Each subsample in K-fold cross-validation has at least k-1 observations.
177. Which of the following statements about a high-complexity model in a linear regression setting is TRUE?
a. Cross-validation with a small k will reduce or eliminate overfitting.
b. A high variance of parameter estimates across cross-validation subsamples indicates likely overfitting.
c. A low variance of parameter estimates across cross-validation subsamples indicates likely overfitting.
d. Cross-validation with a large k will reduce or eliminate overfitting.
178. Which of the following statements about cross-validation is/are True?
a. Cross-validation is an essential step in hyperparameter tuning.
b. We can manually generate folds by using KFold function.
c. GridSearchCV is commonly used in cross-validation.
d. All of the above is True.
179. Which of the following statements about GridSearchCV is/are True?
a. GridSearchCV scans over a dictionary of parameters.
b. GridSearchCV finds the hyperparameter set that has the best out-of-sample score.
c. GridSearchCV retrains on all data with the "best" hyper-parameters.
d. All of the above is True.
180. Which of the below functions, randomly selects data to be in the train/test folds?
a. `StratifiedKFold` c. `GroupKFold`
b. 'KFold' and `StratifiedKFold` d. 'KFold'
181. Reviewing the below graph, what is the model considered when
associated with the right side of the cross validation error?
a. Overfitting c. Training error
b. Polynomial Regression d. Underfitting
181.1 Reviewing the below graph, what is the model considered when
associated with the left side of this curve before hitting the plateau?
a. Overfitting c. Training error
b. Polynomial Regression d. Underfitting
182. If a low-complexity model is underfitting during estimation, which of the following is MOST LIKELY true
(holding the model constant) about K-fold cross-validation?
a. K-fold cross-validation will still lead to underfitting, for any k.
b. K-cross-validation with a small k will reduce or eliminate underfitting.
c. K-fold cross-validation with a large k will reduce or eliminate underfitting.
d. None of the above.
183. Which of the following functions perform K-fold cross-validation for us, appropriately fitting and transforming at
every step of the way?
a. 'cross_val' b. 'cross_validation' c. 'cross_validation_predict' d. 'cross_val_predict'
184. (True/False) The variance of a model is determined by the degree of irreducible error.
a. True b. False
185. (True/False) As more variables are added to a model, both its complexity and its variance generally increase.
a. True b. False
186. (True/False) Model adjustments that decrease bias also decrease variance, leading to a bias-variance trade off.
a. True b. False
187. Regularization zeroes out or gets model’s coefficients closer to zero and, in such a way, it avoids the data being
overfitted.
a. True b. False
188. Scaling the features is not very important before using regularization techniques.
a. True b. False
189. (True/False) A model with high variance is characterized by sensitivity to small changes in input data.
a. True b. False
190. Which of the following statements about model complexity is TRUE?
a. Higher model complexity leads to a lower chance of overfitting
b. Higher model complexity leads to a higher chance of overfitting.
c. Reducing the number of features while adding feature interactions leads to a lower chance of overfitting.
d. Reducing the number of features while adding feature interactions leads to a higher chance of overfitting
191. Which of the following statements about model errors is TRUE?
a. Underfitting is characterized by lower errors in both training and test samples.
b. Underfitting is characterized by higher errors in both training and test samples.
c. Underfitting is characterized by higher errors in training samples and lower errors in test samples.
d. Underfitting is characterized by lower errors in training samples and higher errors in test samples.
192. Which of the following statements about regularization is TRUE?
a. Regularization always reduces the number of selected features.
b. Regularization increases the likelihood of overfitting relative to training data.
c. Regularization decreases the likelihood of overfitting relative to training data.
d. Regularization performs feature selection without a negative impact on the likelihood of overfitting relative to
the training data.
193. Which of the following statements about scaling features prior to regularization is TRUE?
a. Feature scaling is not recommended prior to regularization.
b. Features should rarely or never be scaled prior to implementing regularization.
c. The larger a feature’s scale, the more likely its estimated impact will be influenced by regularization.\
d. The smaller a feature’s scale, the more likely its estimated impact will be influenced by regularization.
194. Which one of the 3 Regularization techniques: Ridge, Lasso, and Elastic Net, performs the fastest under the hood?
a. Ridge b. Lasso c. Elastic Net d. None of the above
195. Which of the following statements about Elastic Net regression is TRUE?
a. Elastic Net combines L1 and L2 regularization.
b. Elastic Net does not use L1 or L2 regularization.
c. Elastic Net uses L2 regularization, as with Ridge regression.
d. Elastic Net uses L1 regularization, as with Ridge regression.
196. BOTH Ridge regression and Lasso regression
a. Do not adjust the cost function used to estimate a model.
b. Add a term to the loss function proportional to a regularization parameter.
c. Add a term to the loss function proportional to the square of parameter coefficients.
d. Add a term to the loss function proportional to the absolute value of parameter coefficients.
197. Compared with Lasso regression (assuming a similar implementation), Ridge regression is:
a. Less likely to overfit training data. c. More likely to overfit to training data.
b. Less likely to set feature coefficients to zero. d. More likely to set feature coefficients to zero.
198. Which of the following about Ridge Regularization is TRUE?
a. It enforces the coefficients to be lower, but not 0
b. It minimizes irrelevant features
c. It penalizes the size magnitude of the regression coefficients by adding a squared term
d. All of the above
199. Which of the below statements is correct?
a. Neither RidgeCV nor LassoCV uses the L1 regularization function.
b. Both RidgeCV and LassoCV use the L1 regularization function.
c. Only RidgeCV uses the L1 regularization function.
d. Only LassoCV uses the L1 regularization function.
200. (True/False) In an Analytic View, increasing L2/L1 penalties force coefficients to be smaller, restricting their
plausible range.
a. True b. False
201. (True/False) Under the Geometric formulation, the cost function minimum is found at the intersection of the
penalty boundtry and a contour of the traditional OLS cost function surface.
a. True b. False
202. (True/False) Under the Probabilistic formulation, L2 (Ridge) regularization imposes Gaussian prior on the
coefficients, while L1 (Lasso) regularization imposes Laplacian prior.
a. True b. False
203. When working with regularization, what is the view that illuminates the actual optimization problem and shows
why LASSO generally zeros out coefficients?
a. Analytical view b. Geometric view c. Probabilistic view d. Regression view
204. When working with regularization, what is the view that recalibrates our understanding of LASSO and a Ridge, as
a base problem, where coefficients have particular prior distributions?
a. Analytical view b. Geometric view c. Probabilistic view d. Regression view
205. When working with regularization, what is the logical view of how to achieve the goal of reducing complexity?
a. Analytical view b. Geometric view c. Probabilistic view d. Regression view
206. All of the following statements about Regularization are TRUE except:
a. Optimizing predictive models is about finding the right bias/variance tradeoff.
b. Features should rarely or never be scaled prior to implementing regularization.
c. We need models that are sufficiently complex to capture patterns in data, but not so complex that they overfit.
d. Regularization techniques have an analytical, geometric, and probabilistic interpretation.
207. When working with regularization and using the geometric formulation, what is found at the intersection of the
penalty boundary and a contour of the traditional OLS cost function surface?
a. The cost function minimum c. A smaller range of coefficients
b. The prior distribution of β d. A peaked density
208. Which statement under the Probabilistic View is correct?
a. Regularization imposes certain errors on the regression coefficients. Feedback: Incorrect! Please review the
further Details of Regularization lessons.
b. Regularization imposes certain priors on the regression coefficients.
c. Regularization uses some regression coefficients to inflate the errors.
d. Regularization coefficients do not take into consideration prior probabilities.
209. Increasing L2/L1 penalties force coefficients to be smaller, restricting their plausible range. This statement is part
of what View?
a. Analytical view b. Geometric view c. Probabilistic view d. Regression view
210. What does a higher lambda term mean in the Regularization technique?
a. Higher lambda decreases variance which means smaller coefficients.
b. Higher lambda increases variance, which means smaller coefficients.
c. Higher lambda decreases variance which means larger coefficients.
d. Higher lambda decreases the prior probability.
211. What concept/s under Probabilistic View is/are True?
a. We can derive the posterior probability by knowing the probability of the target and the prior distribution.
b. The prior distribution is derived from independent draws of a prior coefficient density function that we choose
when regularizing.
c. L2 (ridge) regularization imposes a Gaussian prior on the coefficients, while L1 (lasso) regularization imposes
a Laplacian prior.
d. All of the above
212. What statement is True?
a. We reduce the complexity of the model by minimizing the error on our training set.
b. By penalizing the cost function, we increase the complexity of the model.
c. The goal of Regularization is always going to be to optimize our complexity trade-off, so we can minimize
error on the hold-out set.
d. Introducing Regularization will increase bias and variance.
213. Which statement about unsupervised algorithms is TRUE?
a. Unsupervised algorithms are relevant when we have outcomes we are trying to predict.
b. Unsupervised algorithms are relevant when we don’t have the outcomes we are trying to predict and
when we want to break down our data set into smaller groups.
c. Unsupervised algorithms are typically used to forecast time-related patterns like stock market trends or sales
forecasts
d. Unsupervised algorithms are relevant in cases that require explainability, for example comparing parameters
from one model to another.
214. What is one of the real-world solutions to fix the problems of curse dimensionality?
a. Increase the size of the data set c. Use more computational power
b. Reduce the dimension of the data set. d. Balance the classes of a data set
215. Which statement is a common use of Dimension Reduction in the real world?
a. Image tracking
b. Explaining the relation between the amount of alcohol consumption and diabetes.
c. Deep Learning
d. Predicting whether a customer will return to a store to make a major purchase.
216. Which statement describes better “the smarter initialization of K-mean clusters?
a. “Draw a line between the data points to create 2 big clusters.”
b. “After we find our centroids, we calculate the distance between all our data points.”
c. “Pick one random point, as the initial point, and for the second point, instead of picking it randomly, we
prioritize by assigning the probability of the distance.”
d. “We start by having two centroids as far as possible between each other.”
217. Which of the following statements best describes the iterative part of the K-means algorithm?
a. The k-means algorithm assigns a number of clusters at random.
b. The k-means algorithm adjusts the centroids to the new mean of each cluster, and then it keeps repeating
this process until no example is assigned to another cluster.
c. The k-means algorithm iteratively deletes outliers.
d. The k-means algorithm iteratively calculates the distance from each point to the centroid of each cluster.
218. What happens with our second cluster centroid when we use the probability formula?
a. When we use the probability formula, we put less weight on the points that are far away. So, our second cluster
centroid is likely going to be closer.
b. When we use the probability formula, we put more weight on the points that are far away. So, our
second cluster centroid is likely going to be more distant.
c. When we use the probability formula, we put more weight on the lighter centroids, because it will take more
computational power to draw our clusters. So, the second cluster centroid is likely going to be less distant.
d. When we use the probability formula, we put less weight on the points that are far away. So, our second cluster
centroid is likely going to be more distant.
219. What is the implication of a small standard deviation of the clusters?
a. A small standard deviation of the clusters defines the size of the clusters.
b. The standard deviation of the cluster defines how tightly around each one of the centroids are. With a
small standard deviation, the points will be closer to the centroids.
c. The standard deviation of the cluster defines how tightly around each one of the centroids are. With a small
standard deviation, we cannot find any centroids.
d. A small standard deviation of the clusters means that the centroids are not close enough to each other.
220. After we plot our elbow and we find the inflection point, what does that point indicate to us?
a. The ideal number of clusters. c. The data points we need to form a cluster
b. How we can reduce our number of clusters. d. Whether we need to remove outliers.
221. What is one of the most suitable ways to choose K when the number of clusters is unclear?
a. You can start by choosing a random number of clusters.
b. By evaluating Clustering performance such as Inertia and Distortion.
c. By increasing the number of clusters calculate the square root.
d. You can start by using a k-nearest neighbor method.
222. Which statement describes correctly the use of distortion and inertia?
a. When the sum of the points equals a prime number, use inertia, and when the sum of the points equals a pair
number, use distortion.
b. When we can calculate a number of clusters higher than 10, we use distortion, when we calculate a number of
clusters smaller than 10, we use inertia.
c. When outliers are a concern use inertia, otherwise use distortion.
d. When the similarity of the points in the cluster is more important, you should use distortion, and if you
are more concerned that clusters have similar numbers of points, then you should use inertia.
223. Which method is commonly used to select the right number of clusters?
a. The elbow method. c. The ROC curve.
b. The perfect Square Method d. The Sum of Square Method
224. What is the other name we can give to the L1 distance?
a. Hamming Distance b. Euclidean Distance c. Manhattan Distance d. Mahalanobis Distance
225. What is the key feature of the Jaccard Distance?
a. It takes into account the angle between the 2 points.
b. It looks at the difference and similarities between sets of values.
c. It describes distance by squaring each term, adding and squaring them.
d. It is obtained by adding up the absolute value of each term.
226. What is the advantage of the L1 distance over the L2?
a. It's better for data where the location of occurrence is less important.
b. It's useful for coordinate-based measurements.
c. It shows the difference between sets of values.
d. It can better handle high-dimensional data.
227. What is the other name we can give to the L2 distance?
a. Hamming Distance b. Euclidean Distance c. Manhattan Distance d. Mahalanobis Distance
228. Which of the following statements is a business case for the use of the Manhattan distance (L1)?
a. We use it in business cases where there is very high dimensionality.
b. We use it in business cases with outliers.
c. We use it in business cases where the dimensionality is unknown.
d. We use it in business cases where there is low dimensionality.
229. What is the key feature of Cosine Distance?
a. The Cosine Distance takes into account the angle between 2 points.
b. The size of the curve.
c. It is sensitive to the size of the data set.
d. It is not sensitive to the size of the data set.
230. The following statement is an example of a business case where we can use the Cosine Distance
a. Cosine is better for data such as text where the location of occurrence is less important.
b. Cosine is useful for coordinate-based measurements.
c. Cosine distance is more sensitive to the curse of dimensionality
d. Cosine distance is less sensitive to the curse of dimensionality
231. Which distance metric is useful when we have text documents and we want to group similar topics together?
a. Euclidean b. Mahalanobis Distance c. Manhattan Distance d. Jaccard
232. How is a core point defined in the DBSCAN algorithm?
a. A point that has more than n_clu neighbors in their Ɛ-neighborhood.
b. A point that has no points in its Ɛ-neighborhood.
c. A point that has the same amount of n_clu neighbors within and outside the Ɛ-neighborhood.
d. An Ɛ-neighbor point has fewer than n_clu neighbors itself.
233. Why do we need a stopping criterion when we are using the HAC?
a. The algorithm will turn our data into small clusters.
b. The algorithm will turn our data into just one cluster.
c. The algorithm will not start working if we don’t assign a number of clusters.
d. The stopping criterion ensures centroids are calculated correctly.
234. According to the DBSCAN required inputs, which statement describes the n_clu input?
a. It's the function to calculate distance.
b. It's the radius of the local neighborhood.
c. It determines the density threshold (for fixed Ɛ) (The minimum amount of points for a particular point to
be considered a core point of a cluster)
d. It's the maximum amount of observations for a particular point to be considered a core point of a cluster.
235. Which of the following statements is a characteristic of the DBSCAN algorithm?
a. Can handle tons of data and weird shapes.
b. Finds uneven cluster sizes (one is big, some are tiny).
c. It will do a great performance finding many clusters.
d. It will do a great performance finding a few clusters.
236. Which of the following statements is a characteristic of the Hierarchical Clustering (Ward) algorithm?
a. If we use a mini-batch to find our centroids and clusters this will find our clusters fairly quickly.
b. It offers a lot of distance metrics and linkage options.
c. Too small epsilon (too many clusters) is not trustworthy.
d. Too large epsilon (too few clusters) is not trustworthy.
237. Which of the following statements is a characteristic of the Mean Shift algorithm?
a. Does not require to set the number of clusters; the number of clusters will be determined.
b. Bad with non-spherical cluster shapes.
c. You need to decide the number of clusters on your own, choosing the numbers directly or the minimum
distance threshold.
d. Good with non-spherical cluster shapes.
238. When using DBSCAN, how does the algorithm determine that a cluster is complete and is time to move to a
different point of the data set and potentially start a new cluster?
a. When the algorithm requires you to change the input.
b. When the algorithm forms a new cluster using the outliers.
c. When no point is left unvisited by the chain reaction.
d. When the solution converges to a single cluster.
239. Which of the following statements correctly defines the strengths of the DBSCAN algorithm?
a. No need to specify the number of clusters (cf. K-means), allows for noise and can handle arbitrary-
shaped clusters.
b. Do well with different densities, works with just one parameter, and the n_clu defines itself.
c. The algorithm will find the outliers first, draw regular shapes, and works faster than other algorithms.
d. The algorithm is computationally intensive, it is sensitive to outliers, and it requires a few hyperparameters to
be tuned
240. Which of the following statements correctly defines the weaknesses of the DBSCAN algorithm?
a. The clusters it finds might not be trustworthy, it needs noisy data to work, and it can’t handle subgroups.
b. It needs two parameters as input, finding appropriate values of Ɛ and n_clu can be difficult, and it does
not do well with clusters of different densities.
c. The algorithm will find the outliers first, it draws regular shapes, and works faster than other algorithms.
d. The algorithm is computationally intensive, it is sensitive to outliers, and it requires too many hyperparameters
to be tuned.
241. (True/false) Does complete linkage refer to the maximum pairwise distance between clusters?
a. True b. False
242. Which of the following measure methods computes the inertia and pick the pair that is going to ultimately
minimize the inertia value?
a. Single linkage b. Average linkage c. Ward linkage d. Complete linkage
243. What is the purpose of dimensionality reduction in enterprise datasets?
a. To improve model performance by reducing the number of features used.
b. To create clusters for grouping data points.
c. To predict the target with the best accuracy.
d. To improve model performance by providing a ranking of the features and maximizing the features used
244. (True/False) Principal Component Analysis reduces dimensions by identifying features that can be excluded.
a. True b. False
245. Let’s say that PCA found two principal components v1 and v2. v1 accounts for 0.5 of the total amount ò variance
in our dataset and v2 accounts for 0.24. Which one is more important and why?
a. v1 because we will be able to maintain more of the original variance in the dataset
b. v1 because it reduces 50 of the total variance in the data set.
c. v2 because it accounts for lower variance in the dataset
d. v2 because it reduces the amount of variance in the dataset
246. Select the option that best completes the following sentence: For data with many features, principal components
analysis
247. Which option correctly lists the steps for implementing PCA in Python?
1. Fit PCA to data 2. Scale the data
3. Determine the desired number of components based on total explained variance
4. Define a PCA object
a. 4, 1, 2, 3 b. 4, 1, 3, 2 c. 2, 4, 1, 3 d. 2, 1, 3, 4
248. Given the following matrix for lengths of singular vectors, how do we rank the vectors in terms of importance?
a. v1, v2, v3, v4
b. v4, v3, v2, v1
c. v1, v4, v3, v2
d. v2, v3, v4, v1
249. (True/False) In PCA, the first principal component represents the most important feature in the dataset.
a. True b. False
250. Given two principal components v1, v2, let’s say that
Feature f1 contributed 0.15 to v1 and 0.25 to v2.
Feature f2 contributed -0.11 to v1 and 0.4 to v2
Which feature is more important according to their total contribution to the components?
a. v1 because 0.15 + 0.25 > -0.11 + 0.4 c. v2 because -0.11 + 0.4 < 0.15 + 0.25
b. v2 because | -0.11| + | 0.4 | > | 0.15 | + | 0.25 | d. Neither
251. (True/False) If the number of components is equal to the dimension of the original features, kernel PCA will
reconstruct the data, returning the original.
a. True b. False
252. How does the goal of MDS (Multidimensional Scaling) compare to PCA?
a. PCA tries to maintain geometric distances between data points, whereas MDS tries to preserve variance within
data.
b. Both MDS and PCA try to preserve variance within data.
c. MDS tries to maintain geometric distances between data points, whereas PCA tries to preserve variance
within data.
d. Both MDS and PCA try to maintain geometric distances between data points.
253. Given the data visualized below with the classes represented by different colors, should PCA or kernel PCA be
used, and why?
a. Kernel PCA because the data is not linearly separable.
b. Neither because the data cannot be projected onto a lower dimension.
c. PCA because the data is clearly separable when projected onto a lower dimension.
d. Either is fine because the two classes are clearly separable.
254. What is the main difference between kernel PCA and linear PCA?
a. The objective of linear PCA is to decrease the dimensionality of the space whereas the objective of Kernel PCA
is to increase the dimensionality of the space.
b. Kernel PCA tends to uncover non-linearity structures within the dataset by increasing the
dimensionality of the space thanks to the kernel trick.
c. Kernel PCA and Linear PCA are both Linear dimensionality reduction algorithms but they use a different
optimization methods.
d. Kernel PCA tends to preserve the geometric distances between the points while reducing the dimensionality of
the space.
255. (True/False) Multi-Dimensional Scaling (MDS) focuses on maintaining the geometric distances between points.
a. True b. False
256. Which of the following data types is more suitable for Kernel PCA than PCA?
a. Data where the classes are not linearly separable.
b. Data with linearly separable classes.
c. Data that do not need to be mapped to a higher dimension to distinguish categories.
d. None; they can be used interchangeably.
257. By applying for MDS, you are able to:
a. Attain higher dimensions for the features.
b. Maximize the distance between data points in a lower dimension.
c. Preserve variance within the original data.
d. Find embeddings for points so that their distance is the most similar to the original distance.
258. Which one of the following hyperparameters is NOT considered when using GridSearchCV for Kernel PCA?
a. n_clusters b. n_components c. gamma d. kernels
259. In which case would you prefer using PCA over NMF?
a. When canceling out with negative values is not desired.
b. When you have a linear combination of features.
c. When the original decomposition strictly contains positive values.
d. When you want to decompose videos, music, or images.
260. Which of the following is the most suitable for NMF?
a. Reconstruct a text document with learned topics (features).
b. Analyze potential movements and relationships of multiple stocks.
c. Predict the price of a rental space based on location, facility, and average rent in the surrounding area.
d. Learn features for a dataset in which negative values are highly insightful and valuable.
261. What is the key difference between NMF and PCA?
a. NMF decomposes the original matrix, whereas PCA does not.
b. PCA finds a representation of the data in a lower dimension, whereas NMF does not.
c. The input matrix for NMF consists of only positive values.
d. NMF requires orthogonal vectors created, whereas such constraint doesn't apply for PCA.
262. (True/False) In some applications, NMF can make for more human interpretable latent features.
a. True b. False
263. Which of the following set of features is the least adapted to NMF?
a. Word Count of the different words present in a text. c. Pixel color values of an Image.
b. Spectral decomposition of an audio file. d. Monthly returns of a set of stock portfolios.
264. (True/False) The NMF can produce different outputs depending on its initialization.
a. True b. False
265. In Practice lab: Non-Negative Matrix Factorization, why did we use "pairwise_distances" from sci-kit-learn?
a. To calculate the pairwise distance between points of the NMF-encoded version of the original dataset.
b. To calculate the pairwise distance between the NMF encoded version of the original dataset and the
encoded query dataset.
c. To calculate the pairwise distance between data points for eliminating outliers.
d. To calculate the maximum pairwise distance between points in the dataset.
266. (True/False) Machine Learning consists in programming computers to learn from real-time human interactions
a. True b. False
267. Which among the following options does not conform to the best practice of modelling in Supervised Machine
learning?
a. Use the cost function to fit the model.
b. Use loss function to fit the model.
c. Develop multiple models.
d. Compare results and choose the best one.
268. In Linear Regression, which statement is correct about Sum Squared Error?
a. The Sum Squared Error measures the distance between the truth and predicted values.
b. The Sum Squared Error measures the distance between the truth and the average values of the truth.
c. The Sum Squared Error is a measure of the explained variation of our model.
d. The Sum Squared Error measures the distance between the predicted values and the average values of the truth.
269. Which of the following is the type of Machine Learning that uses only data with outcomes to build a model?
a. Supervised Machine Learning
b. Unsupervised Machine Learning
c. Mixed Machine Learning
d. Semi Supervised Learning
270. Select the correct syntax to obtain the data split that will result in a train set that is 60% of the size of your
available data.
a. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.6)
b. X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)
c. X_train, y_test = train_test_split(X, y, test_size=0.40)
d. X_train, y_test = train_test_split(X, y, test_size=0.6)