0% found this document useful (0 votes)
2K views18 pages

Itae006 Test 1 and 2

This document contains sample questions from chapters 1-4 of a textbook on predictive analytics. The questions cover topics like statistics concepts, predictive modeling challenges, CRISP-DM methodology, data preparation techniques, and data distributions. For example, one question asks about the format data must be in for predictive modeling, and the answer is two-dimensional with rows and columns.

Uploaded by

Nageshwar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views18 pages

Itae006 Test 1 and 2

This document contains sample questions from chapters 1-4 of a textbook on predictive analytics. The questions cover topics like statistics concepts, predictive modeling challenges, CRISP-DM methodology, data preparation techniques, and data distributions. For example, one question asks about the format data must be in for predictive modeling, and the answer is two-dimensional with rows and columns.

Uploaded by

Nageshwar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BATC601 – Predictive Analytics

Chapter 1
1. Which of the following statements is true?

(a) Statistics is often based on non-parametric algorithms; no guaranteed optimum.

(b) In statistics, models are typically nonlinear,

(c) Statistics algorithms are not as efficient or stable for small data.

(d) In statistics, data is typically smaller, the model is important.

2. What are the challenges in using Predictive Analytics?

(a) Predictive models require data in the form of two-dimensional data (rows and columns).

(b) Often, deployment of predictive models require shift in resources for an organization.

(c) The models become too complex because of overfitting.

(d) All of the above.

3. What is the format in which data must be available for predictive modelling?

(a) One-dimension (b) Two-dimension (c) Three-dimension (d) n-Dimension

4. Computational methods to discover and report influential patterns in data are known as

(a) Data mining (b) Data discovery (c) Data analytics (d) All of the above

5. Predictive analytics is the process of

(a) Just cleaning data (b) Just compressing data

(c) Guessing about present output without any data

(d) Information retrieval to make useful predictions about future outcomes

6. Discovering interesting and meaningful patterns in data is known as

(a) Data analytics (b) Predictive analytics (c) Data discovery (d) All of the above
BATC601 – Predictive Analytics
7. Inputs are analyzed and grouped/clustered based on the proximity of input values to one
another is

(a) Supervised learning (b) Unsupervised learning

(c) Descriptive modelling (d) Both (b) and (c)

Chapter 2
1. Which of the following is a standard data mining methodology?

(a) CRISP-DM (b) SPSS (c) Clementine (d) Mineset

2. Which of the following is NOT a step in CRISP-DM?

(a) Business understanding (b) Customer understanding

(c) Data understanding (d) Modelling

3. Which are the experts critical for success of predictive modelling?

(a) Domain experts (b) Data or database experts

(c) Predictive modelling experts (d) All of the above

4. Which of the following should be resolved during Business Understanding stage?

(a) What data is available to quantify the business objectives.

(b) Examine key summary Characteristics about the data.

(c) Begin to enumerate problems with the data.

(d) Visualize data to gain further insights.

5. What is true about predictive modelling algorithms, assuming there are two customer
records in the data who are actually brother and sister?

(a) Predictive algorithms treat the two customer’s records as dependent

(b) Predictive algorithms must know they are related


BATC601 – Predictive Analytics
(c) Predictive algorithms treat these two no differently than any other two people with
similar patterns or behaviour

(d) None of the above

6. Assume that we have records of each visit by a customer to a medical shop. Which of the
following will be a derived variable?

(a) Customer's name (b) Average spend in the last month

(c) Date of birth (d) Doctor who prescribed the medicine

7. Which of the following is a metric used to assess model accuracy in classification


problem?

(a) Average error (b) Confusion matrix (c) Median error (d) Median absolute error

8. Which of the following metric is best suited to assess model accuracy in continuous-valued
estimation problem?

(a) Percent correct classification (b) Area under the curve

(c) Average error (d) Confusion matrix

9. According to CRISP-DM, how many phases are there in a data-mining project life cycle?

(a) Five (b) Six (c) Four (d) Seven

10. Most frequent metrics to assess model accuracy in classification problems is

(a) PCC (b) ROC (c) AVC (d) None of the above

11. _______________ determine magnitude of error

(a) Average errors (b) Mean squared error

(c) Median error (d) Average absolute error

12. Mean, median, and mode give clear picture about data spread and variability.
BATC601 – Predictive Analytics
(a) True (b) False

Chapter 3
1. What is true about a distribution measured by kurtosis?

(a) Kurtosis is always negative.

(b) Normal distribution will have a Kurtosis value of 2.

(c) A leptokurtic distribution is one in which Kurtosis values is more than 4.

(d) A platykurtic distribution is one in which Kurtosis values is greater than 3.

2. Which of the following statements is true?

(a) The median would not change much when there is a single large outlier.

(b) The mean would not change much when there is a single large outlier.

(c) The mean is defined as the value that is exactly 50 percent of the way from the minimum
to maximum value of the variable.

(d) The calculation of mean requires the data to be first sorted.

3. What is the correct two-way combinations/interactions possible, if the number of variables


is 5?

(a) The number of possible two-way interactions is 2.

(b) The number of possible two-way interactions is 5.

(c) The number of possible two-way interactions is 10.

(d) The number of possible two-way interactions is 20.

4. If the value of a variable can range from negative infinity to positive infinity, what is the
type of this variable?

(a) Categorical variables (b) Continuous variables

(c) Binary variables (d) Numeric Variables

5. Which of the following is a property of normal distribution?


BATC601 – Predictive Analytics
(a) The distribution is asymmetric.

(b) The mean and the median are not the same value.

(c) The median and the mode are not the same value.

(d) The mean, median and mode are all the same value.

6. Which of the following is a property of normal distribution?

(a) Approximately 95% of the data will fall between the mean and +/-1 standard deviation
from the mean.

(b) Approximately 95% of the data will fall between the mean and + /-2 standard
deviations from the mean.

(c) Approximately 95% of the data will fall between the mean and +/-3 standard deviations
from the mean.

(d) Approximately 95% of the data will fall between the mean and +/-4 standard deviations
from the mean.

7. Which of the following is a property of Uniform Distribution?

(a) The distribution is asymmetric about the mean.

(b) The distribution is infinite.

(c) The distribution is finite, with a minimum and maximum value.

(d) The mean and the midpoint of the distribution are not the same.

8. What is the value of skew in normal distribution?

(a) Less than 1 (b) 0 (c) 1 (d) Greater than 1

9. What is the phenomenon called when a trend is seen in individual variables, but is reversed
when variables are combined?

(a) Simpson's paradox (b) Redskin Rule (c) Anscombe's Quartet (d) Platykurtic

10. Which of the following is not a property of normal distribution?


BATC601 – Predictive Analytics
(a) It is symmetric (b) Mean, median and mode are all same

(c) It is also called bell curve

(d) 86% of data lies between the mean and the +/-1 standard deviation

11. Generally, one-dimensional data distribution visualization can be done using __________

(a) Scatter plot (b) Histogram

(c) Scatterplot matrices (d) Anscombe's quartet

12. Which of these is true about Uniform Distribution?

(a) The distribution is infinite (b) Mean and midpoint are different

(c) Distribution is symmetric about the mean (d) None of the above

13. Which of these is false about correlations between two variables?

(a) Measures the numerical relationship of one variable to another's

(b) One variable meaning is related to another's

(c) Both of these

(d) None of the above

Chapter 4

1. What is NOT a key step in data preparation related to the columns in the data?

(a) Variable naming (b) Variable cleaning

(c) Variable selection (d) Feature creation

2. What is NOT a key step in data preparation related to the rows in the data?

(a) Record Selection (b) Sampling

(c) Feature Creation (d) Record Archiving


BATC601 – Predictive Analytics

3. What are the different approaches to handle outliers in data?

(a) Remove the outliers from the modelling data

(b) Separate the outliers and create separate models just for outliers

(c) Transform the outliers so that they are no longer outliers

(d) Bin the data

(e) All of the above

4. Which of the following statements is correct?

(a) Missing Completely at Random (MCAR) implies a conditional relationship between the
missing value and other variables

(b) Missing at Random (MAR) means that there is no way to determine what the value should
have been

(c) Missing Not at Random (MNAR) means the missing value can be inferred in general
by the mere fact that the value is missing

(d) All of the above

5. Which of the following is a typical method to correct negative skew in distribution?

(a) Log Transform Ex: log(x)

(b) Multiplicative inverse 1/x

(c) Square Root or sqrt(x)

(d) Power Transform Ex: Xn

6. If the distribution has spikes, what is a good corrective action?

(a) Binning into regions centred on spikes (b) Log10 transform

(c) Power transform (d) Flip transform

7. Which of the following is NOT a Single-Variable Selection Technique?


BATC601 – Predictive Analytics
(a) Chi-square Test (b) Simpson's Paradox

(c) ANOVA (d) Linear regression forward selection (1 step)

8. What sampling technique do statisticians typically use to assess model stability?

(a) Cross Validation (b) Curse of dimensonality

(c) Rule of Thumb (d) Temporal Sequencing

9. MCAR stands for

(a) Missing completely at random (b) Missing conditional at random

(c) Missing convolute at random (d) None of these

10. Missing values is coded as

(a) none (b) zero (c) False (d) null

11. In general, min-max normalization changes range of a variable to

(a) -100 to 100 (b) -50 to 50 (c) -1 to 1 (d) 0 to1

12. Models that are accurate, on the data used to train the models, generally show

(a) Underfitting (b) Overfitting (c) Randomness (d) All of the above

Chapter 5

Following is an example database/dataset of a supermarket with five transactions and five


items (milk, bread, butter, beer, diapers). A purchase is indicated by 1 in the item column.

Transaction ID Milk Bread Butter Beer Diapers

1 1 1 0 0 0

2 0 0 1 0 0
BATC601 – Predictive Analytics
3 0 0 0 1 1

4 1 1 1 0 0

5 0 1 0 0 0

Following questions are based on the above sample database:

1. Consider an example rule for the supermarket [butter, bread) → {milk} meaning that if
butter and bread are bought, customers also buy milk in this rule, which is the
antecedent?

(a) {butter, bread} (b) {milk} (c) {butter} (d) {bread}

2. Consider an example rule for the supermarket [beer} → [diaper] meaning that if beer
was bought, customers also buy diaper. In this rule, which is the consequent?

(a) {beer} (b) {diaper} (c) → (d) None of the above

3. In the example data set of the supermarket given above, what is the support for (bread}
→ {milk}?

(a) 40% (b) 60% (c) 66.67% (d) 33.33%

4. In the sample example dataset table of the supermarket, what is the Antecedent
Support for {bread}→ {milk}?

(a) 40% (b) 60% (c) 66.67% (d) 33.33%

5. In the sample example dataset table of the supermarket, what is the Confidence for
{bread} → {milk}?

(a) 40% (b) 60% (c) 66.67% (d) 33.33%

6. In the sample example dataset table of the supermarket, what is the Lift for {milk,
bread} → {butter}?
BATC601 – Predictive Analytics
(a) 1 (b) 2.5 (c) 4 (d) None of the above

7. If the sample example dataset table of the supermarket is converted to Transactional


Format, how many rows of data will be present?

(a) 5 (b) 25 (c) 10 (d) 9

8. Which of the following statements is true?

(a) Data in Standard predictive analytics format can have extraordinarily large number of
columns

(b) In Standard predictive ar1llytics format, representation of data will be sparse.

(c) Data in Standard predictive analytics format will normally have a lesser number of
rows compared to transactional format.

(d) All of the above.

9. ________________ is defined as the number of times a rule occurs in data divided by the
number of transactions in the data?

(a) Antecedent support (b) Confidence (c) Accuracy (d) Support

10. ________________ is a measure of how many times more likely the consequent will
occur when antecedent is true compared to how often the consequent occurs on its own

(a) Support (b) Confidence (c) Accuracy (d) Lift

11. Generally, __ is a data format in which there are only few columns, but many rows are
there.

(a) Standard predictive modelling data format (b) Transactional format

(c) Key value (d) None of these

Chapter 6
BATC601 – Predictive Analytics
1. Which of the following statements is incorrect?

(a) Descriptive modelling algorithms are also called as unsupervised learning methods.

(b) Descriptive modelling algorithms try to find relationships between inputs.

(c) Descriptive modelling algorithms discover the best way to segment the data.

(d) Descriptive modelling algorithms try to find relationships that associate inputs to
one or more target variables.

2. Which of the following statements is incorrect?

(a) Decision Tree is a commonly used unsupervised modelling algorithm.

(b) K-Means clustering is a commonly used unsupervised modelling algorithm.

(c) Kohonen Self-Organizing Maps (SOM) is a commonly used unsupervised modelling


algorithm.

(d) Principal Component Analysis (PCA) is a commonly used unsupervised modelling


algorithm.

3. Which of the following statements is incorrect?

(a) Inputs must be numeric for K-Means clustering algorithm.

(b) Kohonen Self-Organizing Maps (SOM) needs all data to be populated, there can be no
missing values.

(c) Inputs need not be numeric for Kohonen Self-Organizing Maps (SOM) algorithm.

(d) When using Principal Component Analysis (PCA), any categorical variable to be included
in the model, must be converted to a number.

4. Which of the following algorithms is best suited for reducing the number of inputs for
predictive models?

(a) K-Means clustering (b) Kohonen Self-Organizing Maps (SOM)

(c) Principal Component Analysis (PCA) (d) All of the above

5. Which of the following is NOT one of the distance metrics used in building the K-Means
clustering model?
BATC601 – Predictive Analytics
(a) Mahalanobis distance metric (b) Milwaukee distance metric

(c) Manhattan distance metric (d) Minkowski distance metric

6. Which of the following is widely used as an unsupervised learning neural network


algorithm?

(a) Perceptron

(b) Kohonen Self-Organizing Map (SOM)

(c) Both Perceptron and Kohonen Self-Organizing Map (SOM)

(d) None of the above

7. In K MEANS, what is the number of clusters in the data?

(a) Algorithm will determine the same dynamically (b) It must be pre-specified

(c) It is always 2 (d) It is always 3

8. Which of these is not an unsupervised modelling algorithm?

(a) K-means clustering (b) Kohonen

(c) Self-organizing maps (SOMs) (d) Linear regression

9. In K-means, clusters model parameters are defined by

(a) A number of weights (b) Number of clusters (c) Value, one per unit (d) One per unit

10. Generally, in Kohonen map, numbers of nodes are

(a) Post determined after plotting map

(b) Predetermined

(c) Predetermined by length and width of map

(d) Randomly
BATC601 – Predictive Analytics
Chapter 7

1. How cluster differs from one another is _____________ problem.

(a) unsupervised learning (b) supervised learning

(c) reinforcement learning (d) hybrid

2. Software provides useful information about cluster but fails to explain about _________

(a) How clusters are formed by algorithm (b) meaning of cluster

(c) Both (a) and (b) (d) None of the above

3. If software does not provide summaries, then it is impossible to generate summary from
clusters.

(a) True (b) False

4. Dummy variable __________ is helpful to reduce bias with dummy variables.

(a) removal

(b) scaling

(c) inclusion

(d) None of the above

5. Table 7.2 shows that cluster 1 and 2 have higher number of gifts than average gifts.

Table 7-2: Cluster Centers for K-Means 3-Cluster Model

VARIABLE Cluster 1 Cluster2 Cluster3 Overall

# Records inCluster 8,538 8,511 30,656 47,705

LASTDATE 0.319 0.304 0.179 0.226

FISTDATE 0.886 0.885 0.908 0.900

RFA_2F 0.711 0.716 0.074 0.303


BATC601 – Predictive Analytics
D_RFA_2A 0.382 0.390 0.300 0.331

E_RFA_2A 0.499 0.500 0.331 0.391

F_RFA_2A 0.369 0.366 0.568 0.496

DOMAIN3 0.449 0.300 0.368 0.370

DOMAIN2 0.300 0.700 0.489 0.493

DOMAIN1 0.515 0.300 0.427 0.420

NGIFTALL_log10 0.384 0.385 0.233 0.287

LASTGIFT_log10 0.348 0.343 0.430 0.400

(a) True (b) False

6. If data ______________ is applied to clustering algorithm then it is difficult to understand


summaries after clustering.

(a) without normalization (b) with normalization (c) with compression (d) none of the above

7. Generally, interval and ratio variables are problematic to interpret.

(a) True (b) False

8. As a thumb rule or guiding principle, ANOVA method works _______ when there are ___
clusters.

(a) worst, small no. of (b) best, small no. of

(c) best, large no. of (d) worst, large no. of

9. Hierarchical clustering works well with large number of records.

(a) True (b) False

10. Decision trees are not distance-based algorithms and therefore are______ by _____ and
skewed distributions.

(a) unaffected, outliers (b) affected, outliers


BATC601 – Predictive Analytics
(c) affected, normalized (d) unaffected, normalized

11. In multivariate problem, ANOVA determines which variables has most significant
difference in __________ values between the clusters.

(a) Mean (b) variance (c) error (d) none

12. Mean value for variables in each cluster is called as

(a) Cluster mean (b) Cluster average (c) Cluster centre (d) Cluster median

Chapter 9

1. The choice of the model assessment metric should be tied to ________ rather than
_______

(a) Operational considerations, algorithmic expedience

(b) Algorithmic expedience, operational considerations

(c) Algorithmic considerations, operational expedience

(d) None

2. Model assessment should be done first on __________

(a) training data (b) new data (c) test data (d) None of the above

3. Generally, a ________ built classifier always have percent correct classification (PCC) in
the numeric range of ________

(a) well, 50 to 100 (b) badly, 50 to 100 (c) badly, 0 to 10 (d) well, 0 to 10

4. A lift of a model is a ratio of model accuracy to accuracy of a random guess.

(a) True (b) False


BATC601 – Predictive Analytics
5. You must always be aware of the base rate to ensure models with a large baseline rate are
not perceived as _______ models.

(a) good (b) poor (c) best (d) None of the above

6. Which of the following confusion matrix measures uses all quadrants of confusion matrix?

(a) PCC (b) Recall (c) Precision (d) None of the above

7. For gains, the metric is the percentage of found by the model.

(a) 0s (b) 1s (c) 2s (d) None of the above

8. In general, model assessment which matches business objectives closely should be used.

(a) True (b) False

9. In assessing regression models, the value of R2 should be ________________

(a) Fixed (b) depends on application (c) 0.3 (d) NOTA

10. Which of the following commonly used metrics for regression problems?

(a) Average absolute error and R2

(b) R2 and average squared error

(c) Average percentage error and R2

(d) All of the above

11. Which one is the target for optimizing model parameters in Linear Regression?

(a) Minimizes cross entropy

(b) Minimized distance between data

(c) Minimizes squared mean error

(d) Minimizes mean squared error


BATC601 – Predictive Analytics

Chapter 10

1. Model ensembles improve model accuracy and robustness.

(a) True (b) False

2. The best models have

(a) high bias, low variance. (b) high bias, high variance.

(c) low bias, high variance. (d) low bias, low variance.

3. ____________ is an important requirement for building good bagged ensembles.

(a) Underfitting the model (b) Overfitting the model

(c) Exact fitting the model (d) None of the above

4. In boosting algorithm, final predictions are made based on ________ of predictions from
all models.

(a) average (b) median (c) weighted average (d) None of the above

5. At each split in the tree, rather than considering all input variables as candidates, only a
random subset of variables is considered in random forest.

(a) True (b) False

6. TreeNET has proven to be an accurate predictor with the benefit that very little data
cleanup is needed for the trees before modelling.

(a) True (b) False

7. Ensembles are the methods which not only increases model accuracy but also

(a) They increase only model sensitivity.


BATC601 – Predictive Analytics
(b) They reduce risk on deploying poor model.

(c) They reduce error.

(d) None of the above.

8. Sometimes the ensemble will significantly reduce the behavioural complexity

(a) True (b) False

9. Ensembles are often considered black box models, meaning that what they do is not
transparent to the modeler or domain expert.

(a) True (b) False

10. Ensemble are appropriate solution to all problems.

(a) True (b) False

11. In general, regression requires for good results if the high-complexity model has a
___________ bias, but it has a ______ variance for training data set.

(a) High, low (b) Low, high (c) Low, low (d) High, high

You might also like