Itae006 Test 1 and 2
Itae006 Test 1 and 2
Chapter 1
1. Which of the following statements is true?
(c) Statistics algorithms are not as efficient or stable for small data.
(a) Predictive models require data in the form of two-dimensional data (rows and columns).
(b) Often, deployment of predictive models require shift in resources for an organization.
3. What is the format in which data must be available for predictive modelling?
4. Computational methods to discover and report influential patterns in data are known as
(a) Data mining (b) Data discovery (c) Data analytics (d) All of the above
(a) Data analytics (b) Predictive analytics (c) Data discovery (d) All of the above
BATC601 – Predictive Analytics
7. Inputs are analyzed and grouped/clustered based on the proximity of input values to one
another is
Chapter 2
1. Which of the following is a standard data mining methodology?
5. What is true about predictive modelling algorithms, assuming there are two customer
records in the data who are actually brother and sister?
6. Assume that we have records of each visit by a customer to a medical shop. Which of the
following will be a derived variable?
(a) Average error (b) Confusion matrix (c) Median error (d) Median absolute error
8. Which of the following metric is best suited to assess model accuracy in continuous-valued
estimation problem?
9. According to CRISP-DM, how many phases are there in a data-mining project life cycle?
(a) PCC (b) ROC (c) AVC (d) None of the above
12. Mean, median, and mode give clear picture about data spread and variability.
BATC601 – Predictive Analytics
(a) True (b) False
Chapter 3
1. What is true about a distribution measured by kurtosis?
(a) The median would not change much when there is a single large outlier.
(b) The mean would not change much when there is a single large outlier.
(c) The mean is defined as the value that is exactly 50 percent of the way from the minimum
to maximum value of the variable.
4. If the value of a variable can range from negative infinity to positive infinity, what is the
type of this variable?
(b) The mean and the median are not the same value.
(c) The median and the mode are not the same value.
(d) The mean, median and mode are all the same value.
(a) Approximately 95% of the data will fall between the mean and +/-1 standard deviation
from the mean.
(b) Approximately 95% of the data will fall between the mean and + /-2 standard
deviations from the mean.
(c) Approximately 95% of the data will fall between the mean and +/-3 standard deviations
from the mean.
(d) Approximately 95% of the data will fall between the mean and +/-4 standard deviations
from the mean.
(d) The mean and the midpoint of the distribution are not the same.
9. What is the phenomenon called when a trend is seen in individual variables, but is reversed
when variables are combined?
(a) Simpson's paradox (b) Redskin Rule (c) Anscombe's Quartet (d) Platykurtic
(d) 86% of data lies between the mean and the +/-1 standard deviation
11. Generally, one-dimensional data distribution visualization can be done using __________
(a) The distribution is infinite (b) Mean and midpoint are different
(c) Distribution is symmetric about the mean (d) None of the above
Chapter 4
1. What is NOT a key step in data preparation related to the columns in the data?
2. What is NOT a key step in data preparation related to the rows in the data?
(b) Separate the outliers and create separate models just for outliers
(a) Missing Completely at Random (MCAR) implies a conditional relationship between the
missing value and other variables
(b) Missing at Random (MAR) means that there is no way to determine what the value should
have been
(c) Missing Not at Random (MNAR) means the missing value can be inferred in general
by the mere fact that the value is missing
12. Models that are accurate, on the data used to train the models, generally show
(a) Underfitting (b) Overfitting (c) Randomness (d) All of the above
Chapter 5
1 1 1 0 0 0
2 0 0 1 0 0
BATC601 – Predictive Analytics
3 0 0 0 1 1
4 1 1 1 0 0
5 0 1 0 0 0
1. Consider an example rule for the supermarket [butter, bread) → {milk} meaning that if
butter and bread are bought, customers also buy milk in this rule, which is the
antecedent?
2. Consider an example rule for the supermarket [beer} → [diaper] meaning that if beer
was bought, customers also buy diaper. In this rule, which is the consequent?
3. In the example data set of the supermarket given above, what is the support for (bread}
→ {milk}?
4. In the sample example dataset table of the supermarket, what is the Antecedent
Support for {bread}→ {milk}?
5. In the sample example dataset table of the supermarket, what is the Confidence for
{bread} → {milk}?
6. In the sample example dataset table of the supermarket, what is the Lift for {milk,
bread} → {butter}?
BATC601 – Predictive Analytics
(a) 1 (b) 2.5 (c) 4 (d) None of the above
(a) Data in Standard predictive analytics format can have extraordinarily large number of
columns
(c) Data in Standard predictive analytics format will normally have a lesser number of
rows compared to transactional format.
9. ________________ is defined as the number of times a rule occurs in data divided by the
number of transactions in the data?
10. ________________ is a measure of how many times more likely the consequent will
occur when antecedent is true compared to how often the consequent occurs on its own
11. Generally, __ is a data format in which there are only few columns, but many rows are
there.
Chapter 6
BATC601 – Predictive Analytics
1. Which of the following statements is incorrect?
(a) Descriptive modelling algorithms are also called as unsupervised learning methods.
(c) Descriptive modelling algorithms discover the best way to segment the data.
(d) Descriptive modelling algorithms try to find relationships that associate inputs to
one or more target variables.
(b) Kohonen Self-Organizing Maps (SOM) needs all data to be populated, there can be no
missing values.
(c) Inputs need not be numeric for Kohonen Self-Organizing Maps (SOM) algorithm.
(d) When using Principal Component Analysis (PCA), any categorical variable to be included
in the model, must be converted to a number.
4. Which of the following algorithms is best suited for reducing the number of inputs for
predictive models?
5. Which of the following is NOT one of the distance metrics used in building the K-Means
clustering model?
BATC601 – Predictive Analytics
(a) Mahalanobis distance metric (b) Milwaukee distance metric
(a) Perceptron
(a) Algorithm will determine the same dynamically (b) It must be pre-specified
(a) A number of weights (b) Number of clusters (c) Value, one per unit (d) One per unit
(b) Predetermined
(d) Randomly
BATC601 – Predictive Analytics
Chapter 7
2. Software provides useful information about cluster but fails to explain about _________
3. If software does not provide summaries, then it is impossible to generate summary from
clusters.
(a) removal
(b) scaling
(c) inclusion
5. Table 7.2 shows that cluster 1 and 2 have higher number of gifts than average gifts.
(a) without normalization (b) with normalization (c) with compression (d) none of the above
8. As a thumb rule or guiding principle, ANOVA method works _______ when there are ___
clusters.
10. Decision trees are not distance-based algorithms and therefore are______ by _____ and
skewed distributions.
11. In multivariate problem, ANOVA determines which variables has most significant
difference in __________ values between the clusters.
(a) Cluster mean (b) Cluster average (c) Cluster centre (d) Cluster median
Chapter 9
1. The choice of the model assessment metric should be tied to ________ rather than
_______
(d) None
(a) training data (b) new data (c) test data (d) None of the above
3. Generally, a ________ built classifier always have percent correct classification (PCC) in
the numeric range of ________
(a) well, 50 to 100 (b) badly, 50 to 100 (c) badly, 0 to 10 (d) well, 0 to 10
(a) good (b) poor (c) best (d) None of the above
6. Which of the following confusion matrix measures uses all quadrants of confusion matrix?
(a) PCC (b) Recall (c) Precision (d) None of the above
8. In general, model assessment which matches business objectives closely should be used.
10. Which of the following commonly used metrics for regression problems?
11. Which one is the target for optimizing model parameters in Linear Regression?
Chapter 10
(a) high bias, low variance. (b) high bias, high variance.
(c) low bias, high variance. (d) low bias, low variance.
4. In boosting algorithm, final predictions are made based on ________ of predictions from
all models.
(a) average (b) median (c) weighted average (d) None of the above
5. At each split in the tree, rather than considering all input variables as candidates, only a
random subset of variables is considered in random forest.
6. TreeNET has proven to be an accurate predictor with the benefit that very little data
cleanup is needed for the trees before modelling.
7. Ensembles are the methods which not only increases model accuracy but also
9. Ensembles are often considered black box models, meaning that what they do is not
transparent to the modeler or domain expert.
11. In general, regression requires for good results if the high-complexity model has a
___________ bias, but it has a ______ variance for training data set.
(a) High, low (b) Low, high (c) Low, low (d) High, high