Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
Answers To Problems For Data Mining and Predictive Analytics (2nd Edition) by Larose
com
WhatsApp: https://wa.me/message/2H3BV2L5TTSUF1 Telegram: https://t.me/solutionmanual
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
Solutions to Chapter 1
AN INTRODUCTION TO DATA MINING AND PREDICTIVE
ANALYTICS
Prepared by James Cunningham, Graduate Assistant
1. For each of the following, identify the relevant data mining task(s):
a. The Boston Celtics would like to approximate how many points their next
opponent will score against them.
Estimation
Classification
Clustering
Description
Classification
Clustering
Classification
Description
Prediction
f. A Wall Street analyst has been asked to find out the expected change in stock
price for a set of companies with similar price/earnings ratios.
Estimation
1
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
2. For each of the following meetings, explain which phase in the CRISP-DM process is
represented:
a. Managers want to know by next week whether deployment will take place.
Therefore, analysts meet to discuss how useful and accurate their model is.
b. The data mining project manager meets with the data warehousing manager
to discuss how the data will be collected.
c. The data mining consultant meets with the Vice President for Marketing,
who says that he would like to move forward with customer relationship
management.
d. The data mining project manager meets with the production line supervisor,
to discuss implementation of changes and improvements.
e. The analysts meet to discuss whether the neural network or decision tree
models should be applied.
Modeling Phase
2
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
3. Discuss the need for human direction of data mining. Describe the possible
consequences of relying on completely automatic data analysis tools.
Data mining requires human direction in order to be both effective and appropriate as
problem-solving is a human process requiring human critical thinking every step of the way.
As stated in the text, data mining without proper human direction is something that is very
easy to do badly. It is very easy to derive results that are damaging to business processes by
(1) failing to understand the business problem at hand, (2) failing to understand the data sets
at hand (and their interrelationships), (3) failing to select appropriate modeling techniques,
and (4) failing to evaluate model results correctly.
One very popular fallacy is that data mining can be completely autonomous and thus requires
little to no human direction. Applying data mining software features at random is bound to
produce the wrong answer to the wrong question with the wrong data. In fact business
decisions made based on inappropriate analyses are much more damaging and costly than
those made based on no analysis at all. Also, once a model is deployed, it must be monitored
for its efficacy and will most often need to be tuned over time.
4. CRISP-DM is not the only standard process for data mining. Research an
alternative methodology (Hint: SEMMA, from the SAS Institute). Discuss the
similarities and differences with CRISP-DM.
SEMMA is a process developed by the SAS Institute for conducting a data mining project.
Each letter in the acronym SEMMA identifies a separate stage of the data mining process as
follows:
Sample – The first stage in SEMMA entails extracting a representative sample of a much
larger data set. Please note that this stage is optional and thus used at the discretion of the
analyst.
Explore – The second stage in SEMMA entails searching for unanticipated trends, patterns,
and anomalies in order to gain an understanding of the data and develop ideas.
Modify – The third stage in SEMMA entails modifying the data set through a combination of
selecting original variables and more importantly transforming variables and deriving new
ones that would be most conducive to a data modeling exercise.
Model – The fourth stage in SEMMA entails allowing the software to determine the best
combination of variables that predict a desired outcome.
Assess – The fifth and final stage in SEMMA entails evaluating model efficacy and
estimating how well it will perform if deployed.
3
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
Business Understanding – The first phase entails gaining an understanding of the business
problem at hand and translating this into a data mining problem to be solved and an initial
solution approach. In direct contrast with SEMMA, we observe that CRISP-DM prescribes
business-requirements development as an explicit activity and the specific data mining
problem and solution approach as explicit deliverables whereas SEMMA does not. SEMMA
prescribes delving right into the data set, which can lead to significant time wasted (that will
most likely be proportional to the dimensionality of the data set being explored).
Data Understanding – The second phase entails determining how data will be collected and
exploratory analysis. This phase is similar in nature to SEMMA’s Explore stage, but in
contrast with SEMMA, the exploratory analysis activities of the CRISP-DM Data
Understanding phase are conducted from the perspective of solving a particular data mining
problem. Therefore, while exploration conducted in SEMMA’s Explore stage seems to be by
pure brute-force, exploration conducted in CRISP-DM’s Data Understanding phase is done
from the perspective of a specific data mining problem to be solved. In other words, the
exploratory analysis in CRISP-DM’s Data Understanding phase is expected to be more
effective and more efficient focusing on exploring correlations between predictors and
interactions between predictors and a specific target variable.
Data Preparation - The third phase entails all of the actions (e.g. selections,
transformations, derivations, etc.) needed to develop a data set that is most conducive to a
data modeling exercise. This phase is similar to SEMMA’s Modify stage, but contrast with
SEMMA, the preparation activities conducted in the CRISP-DM Data Preparation phase are
done so with a specific data mining problem and target modeling approach in mind. This is a
critical distinction between the two processes. As an example, if we have data that is highly
inter-correlated or multicoliear, we can leverage a dimensional transformation such one
produced via Principal Components Analysis (PCA) to eliminate the multicolinearity, but
only for certain types of modeling approaches. Therefore, since the CRISP-DM Data
Preparation phase has a target modeling approach in mind when preparing data, it can
leverage advanced transformational techniques like PCA appropriately and is thus superior to
the SEMMA Modify stage.
Modeling - The fourth phase entails the human-directed application of multiple modeling
techniques in order to (1) optimize the balance between model bias and model variance and
(2) maximize the ability for these models to operate effectively on new observations. While
this is similar to SEMMA’s Model stage, the CRISP-DM Modeling phase is human-directed
whereas SEMMA’s Model stage appears to be autonomous with little or no human direction.
As stated in the text, autonomous data mining is a dangerous practice.
Evaluation – The fifth phase entails thorough evaluation of both the (1) constructed models
for their efficacy and performance as well as the (2) approach used to construct the models to
ensure that the constructed models actually solve the business problem at hand. While this
phase is similar to SEMMA’s Assess stage, the CRISP-DM Evaluation phase verifies that the
4
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
models constructed actually solve the business problem at hand. Since SEMMA does not
prescribe formal definition of the business problem to be solved, the SEMMA Assess stage
may actually result in a model that performs well but operates on the wrong target variable
and corresponding predictors and thus has little or no business value.
Deployment – The sixth and final phase entails preparing the model results so that it can be
leveraged by the business sponsor. For simpler data mining projects, this may entail
generating a report that the sponsor may use to base business decisions off of. For more
complex projects, this may entail implementation of the final model in a commercial rules-
engine software package. In direct contrast with SEMMA, there is no corresponding stage in
the SEMMA process prescribing model deployment.
5
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
Solutions to Chapter 2
DATA PREPROCESSING
Prepared by James Cunningham, Graduate Assistant
1. Describe the possible negative effects of proceeding directly to mine data that has not been
preprocessed.
Neglecting to preprocess the data adequately before data modeling begins will likely produce data
models that are unreliable and whose results should be considered dubious as best. Performing data
cleaning and data transformation during the data preparation phase is absolutely necessary for
successful data mining efforts.
For example, suppose you are analyzing a data set that includes a person’s Age and Date_of_Birth
attributes, and you want to calculate the average Age. Now, if 5% of the records contain a value of 0
for Age, the mean value would be very misleading and inaccurate. One solution to this problem
would be to derive Age for the zero-based records based on information contained in the
Date_of_Birth variable. Now, the mean value for Age is more representative of those persons in the
data set.
2. Refer to the income attribute of the five customers in Table 2.1, before preprocessing.
The mean value for Income before preprocessing is 38,999.80 and is derived by the possible
inclusion of Income values -40,000 (erroneous) and 100,000 (possible outlier).
In this case the mean value has little meaning because we are combining real data values with
erroneous values.
c. Now, calculate the mean income for the three values left after preprocessing. Does this
value have a meaning?
However, the mean value for Income produced by values 75,000, 50,000, and 10,000 (9,999
rounded to nearest 5,000) is 45,000. The latter value is certainly more representative of the true
mean for Income, now that the records containing questionable values have been excluded.
3. Explain why zip codes should be considered text variables rather than numeric.
6
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
Zip codes should be considered text variables because they cannot be quantified on any numeric
scale. Even their order has no numerical significance.
Consider a set of numerical observations and the center of this observation set. An outlier is an
observation that lies much farther away from the center than the majority of the other observations
in the set.
We must treat outliers carefully because they can cause us to misrepresent the true center of an
observation set incorrectly if they lie significantly farther away from the other observations in the
set.
A birthdate variable is preferable to an age variable in a database because (1) one can always derive
age from birthdate by taking the difference from the current date, and (2) age is relative to the
current date only and would need to be updated continuously over time in order to remain
accurate.
6. True or false: All things being equal, more information is almost always better.
The answer is true. In general, more information is almost always better. The more information we
have to work with, the more insight into the underlying relationships of a particular domain of
discourse we can glean from it.
7. Explain why it is not recommended, as a strategy for dealing with missing data, to simply omit
the records or fields with missing values from the analysis.
It is not recommended to omit records or fields from an analysis simply because they have missing
values. The rationale for this recommendation is that omitting these fields and records may cause
us to lose valuable insight into the underlying relationships that we may have gleaned from the
partial information that we do have.
8. Which of the four methods for handling missing data would tend to lead to an underestimate
of the spread (e.g., standard deviation) of the variable? What are some benefits to this
method?
Replacing a missing value by the attribute value’s mean artificially reduces the measure of spread
for that particular attribute. Although the mean value is not necessarily a typical value, for some
data sets this form of substitution may work well. Specifically, the effectiveness of this technique
7
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
depends on the size of the variation of the underlying population. In other words, the technique
works well for populations having small variations, and works less effectively for populations having
larger variations.
Several benefits to leveraging this method include (1) ease of implementation (i.e. only one value to
impute), (2) preservation of the standard error (i.e. no additional residual error is introduced).
9. What are some of the benefits and drawbacks for the method for handling missing data that
chooses values at random from the variable distribution?
By using the data values randomly generated from the variable distribution, the measures of center
and spread are most likely to remain similar to the original; however, there is a chance that the
resulting records may not make intuitive sense.
10. Of the four methods for handling missing data, which method is preferred?
Having the analyst choose a constant to replace missing values based on specific domain knowledge
is overall, probably the most conservative choice. If missing values are replaced with a flag such as
“missing” or “unknown”, in many situations those records would ultimately be excluded from the
modeling process; that is, all remaining valid, potentially important, values contained in those
records would not be included in the data model.
11. Make up a classification scheme which is inherently flawed, and would lead to
misclassification, as we find in Table 2.2. For example, classes of items bought in a grocery
store.
Breakfast Count
Cold Cereals 72
Sugar Smacks 1
Cheerios 2
Hot Cereals 28
Cream of Wheat 3
8
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
Using the table above, the “Breakfast” categorical attribute contains 5 apparent classes.
However, upon further inspection the classes are discovered to be inconsistent. For example,
both “Sugar Smacks” and “Cheerios” are cold cereals, and “Cream of Wheat” is a hot cereal.
Below, the cereals are now classified according to one of two classes, “Cold Cereals” or “Hot
Cereals.”
Breakfast Count
Cold Cereals 75
Hot Cereals 31
12. Make up a data set, consisting of the heights and weights of six children, in which one of the
children is an outlier with respect to one of the variables, but not the other. Then alter this
data set so that the child is an outlier with respect to both variables.
In the table below, Child #1 is an outlier with respect to Weight only. All children in the table are
close in Height differing at most by 9 inches. However, all children except for Child # 1 are close in
Weight differing at most by 7 pounds. Child #1 is an outlier as the Weight differs by 18 pounds from
the second-heaviest child (Child #6), making this right-tailed difference in Weight greater than the
entire Weight range for the other five children.
In the table below, Child #1 is an outlier with respect to both Height and Weight. All children
except for Child #1 in the table are close in Height differing at most by 8 inches and are close in
Weight differing at most by 7 pounds. Child #1 is an outlier for both Height and Weight as the
Height differs by 14 inches from the second-shortest child (Child#2) (which is greater than the
entire Height range of the other five children), and the Weight differs by 18 pounds from the
second-heaviest child (Child #6) (which is greater than the entire Weight range of the other five
children).
Use the following stock price data (in dollars) for Exercises 13–18
10 7 20 12 75 15 9 18 4 12 8 14
9
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
The mean is calculated as the sum of the data points divided by the number of points as follows:
The median is calculated by placing the prices in order and (a) selecting the middle value if the
number of points is odd, or (b) taking the average of the two middle values if the number of points is
even. Since we have twelve points, median is calculated as follows:
The mode is calculated as the value that occurs the most often in the set and is calculated as
follows:
10
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
14. Compute the standard deviation of the stock price. Interpret what this number means.
The standard deviation represents the expected distance of a point chosen at random from a data
set to the center of that set and is calculated by taking the square root of the variance. The variance
is the average of the sum of squared distances of each point from the data-set mean. Given that the
mean is $17 (see Exercise #13) for this set, the variance for the set of stock prices is calculated as
follows:
(4-17)2+(7-17)2+(8-17)2+(9-17)2+(10-17)2+(12-17)2+(12-17)2+(14-17)2+(15-17)2+(18-17)2+(20-17)2+(75-17)2 =
(-13)2 + (-10)2 + (-9)2 + (-8)2 + (-7)2 + (-5)2 + (-5)2 + (-3)2 + (-2)2 + (1)2 + (3)2 + (58)2 =
Taking the square root of the Variance, the Standard Deviation (SD) is calculated as follows:
Since the mean is $17 and the standard deviation is plus/minus $18.03, the expected price of a stock
drawn at random from the set of twelve stocks is expected to lie mathematically between ($17–
$18.03) = -$1.03 (i.e. $0.01 since we assume that a stock price can never be less than one penny
USD) and ($17+$18.03) = $35.03.
As we can see, each stock with the exception of the one priced at $75 is priced within this range.
15. Find the min-max normalized stock price for the stock worth $20.
Min-Max normalization scales an observation relative to the data-set’s range resulting in a value
between 0 and 1 (this value has no units) and is formulated as follows:
The midrange stock price is the central price for the entire price range and is formulated as follows:
11
Data Mining and Predictive Analytics
Daniel T. Larose, Ph.D.
17. Compute the Z-score standardized stock price for the stock worth $20.
Z-Score standardization scales an observation where the mean value is zero, the SD is 1 and most
values lie between -4 and 4 (this value has no units) and is formulated as follows:
Given the mean of $17 (see Exercise #13) and |SD| of 18.03 (see Exercise #14), The Z-Score for the
stock price of $20 is calculated as follows:
Please note that this value makes sense as it is slightly greater than zero just as $20 is slightly
greater than $18.03.
18. Find the decimal scaling stock price for the stock worth $20.
Decimal standardization scales an observation to a value between -1 and 1 (this value has no units)
and is formulated as follows:
Decimal(Xi) = Xi / 10d
where d is the number of digits in the observation in the data set having the largest absolute value.
Since the largest stock price is $75, d = 2 as there are two digits in this price. The decimal
standardization is then calculated as follows:
12