Chapter 2
Chapter 2
FIGURE 2.1
○
16 OVERVIEW OF THE DATA MINING PROCESS
> $
Predictive Analytics
Classification, prediction, and, to some extent, association rules and collabora-
tive filtering constitute the analytical methods employed in predictive analytics.
The term predictive analytics is sometimes used to also include data pattern
identification methods such as clustering.
�
�
�
�
� �
THE STEPS IN DATA MINING 19
2
$ $
22 OVERVIEW OF THE DATA MINING PROCESS
varying limitations on what they can handle in tenns of the nwnbers of records
and variables, limitations that may be specific to computing power and capacity
as well as software limitations. Even within those limits, many algorithms will
execute faster with smaller samples.
Accurate models can often be built with as few as several hundred or thou-
sand records. Hence we will want to sample a subset of records for model
building.
(1, 2, 3)
…
24 OVERVIEW OF THE DATA MINING PROCESS
� �
�
�1 , … , �15
6�� � �
PRELIMINARY STEPS 25
For example, suppose we're trying to predict the total purchase amount
spent by customers, and we have a few predictor columns that are coded
�1 , �2 , �3 , … , where we don't know what those codes mean. We might find
that �1 is an excellent predictor of the total amount spent. However, if we
discover that �1 is the amount spent on shipping, calculated as a percentage of
the purchase amount, then obviously a model that uses shipping amount cannot
be used to predict purchase amount, since the shipping amount is not known
until the transaction is completed. Another example is if we are trying to predict
loan default at the time a customer applies for a loan. If our dataset includes
only information on approved loan applications, we will not have information
about what distinguishes defaulters from nondefaulters among denied applicants.
A model based on approved loans alone can therefore not be used to predict
defaulting behavior at the time ofloan application but rather only once a loan
is approved.
Outliers The more data we are dealing with, the greater the chance of
encountering erroneous values resulting from measurement error, data-entry
error, or the like. If the erroneous value is in the same range as the rest of the
data, it may be harmless. If it is well outside the range of the rest of the data
(e.g., a misplaced decimal), it may have a substantial effect on some of the data
mining procedures we plan to use.
Values that lie fur away from the bulk of the data are called outliers. The
term far away is deliberately left vague because what is or is not called an outlier
is basically an arbitrary decision. Analysts use rules of thumb such as "anything
over 3 standard deviations away from the mean is an outlier," but no statistical
rule can tell us whether such an outlier is the result of an error. In this statistical
sense, an outlier is not necessarily an invalid data point, it is just a distant one.
The purpose of identifying outliers is usually to call attention to values
that need further review. We might come up with an explanation looking at
the data-in the case of a misplaced decimal, this is likely. We might have no
explanation but know that the value is wrong-a temperature of178◦ F for a sick
person. Or, we might conclude that the value is within the realm of possibility
and leave it alone. All these are judgments best made by someone with domain
knowledge, knowledge of the particular application being considered: direct mail,
mortgage finance, and so on, as opposed to technical knowledge of statistical or
data mining procedures. Statistical procedures can do litde beyond identifying
the record as something that needs review.
If manual review is feasible, some outliers may be identified and corrected.
In any case, if the number of records with outliers is very small, they might
be treated as missing data. How do we inspect for outliers? One technique
in Excel is to sort the records by the first column, then review the data for
very large or very small values in that column. Then repeat for each successive
26 OVERVIEW OF THE DATA MINING PROCESS
0.9530 = 0.215.
PREDICTIVE POWER AND OVERFITTING 27
a "0" might mean two things: (1) the value is missing, or (2) the value is actually
zero. In the credit industry, a "0" in the "past due" variable might mean a
customer who is fully paid up, or a customer with no credit history at all-two
very different situations. Human judgment may be required for individual cases
or to determine a special rule to deal with the situation.
In supervised learning, a key question presents itself How well will our pre-
diction or classification model perform when we apply it to new data? We are
particularly interested in comparing the performance of different models so that
we can choose the model that we think will do the best when it is implemented
in practice. A key concept is to make sure that our chosen model generalizes
beyond the dataset that we have at hand. To assure generalization, we use the
concept of data partitioning and try to avoid overfitting. These two important
concepts are described next.
28 OVERVIEW OF THE DATA MINING PROCESS
•
A superior model
•
Chance aspects of the data that happen to match the chosen model better
than they match other models
The latter is a particularly serious problem with techniques (e.g., trees and neural
nets) that do not impose linear or other structure on the data, and thus end up
overfitting it.
To address the overfitting problem, we simply divide (partition) our data and
develop our model using only one of the partitions. Mter we have a model, we
try it out on another partition and see how it performs, which we can measure
in several ways. In a classification model, we can count the proportion of held-
back records that were misclassified. In a prediction model, we can measure the
residuals (prediction errors) between the predicted values and the actual values.
This evaluation approach in effect mimics the deployment scenario, where our
model is applied to data that it hasn't "seen."
We typically deal with two or three partitions: a training set, a validation
set, and sometimes an additional test set. Partitioning the data into training,
validation, and test sets is done either randomly according to predetermined
proportions or by specifYing which records go into which partition according
to some relevant variable (e.g., in time series forecasting, the data are partitioned
according to their chronological order). In most cases the partitioning should be
done randomly to minimize the chance of getting a biased partition. It is also
possible (although cumbersome) to divide the data into more than three parti-
tions by successive partitioning (e.g., divide the initial data into three partitions,
then take one of those partitions and partition it further).
Test Partition The test partition (sometimes called the holdout or evalu-
ation partition) is used to assess the performance of the chosen model with new
data.
Why have both a validation and a test partition? When we use the validation
data to assess multiple models and then choose the model that performs best with
the validation data, we again encounter another (lesser) facet of the overfitting
problem-chance aspects of the validation data that happen to match the chosen
model better than they match other models. In other words, by using the
validation data to choose one of several models, the performance of the chosen
model on the validation data will be overly optimistic.
The random features of the validation data that enhance the apparent perfor-
mance of the chosen model will probably not be present in new data to which
the model is applied. Therefore we may have overestimated the accuracy of our
model. The more models we test, the more likely it is that one of them will be
particularly effective in modeling the noise in the validation data. Applying the
model to the test data, which it has not seen before, will provide an unbiased
estimate of how well the model will perform with new data. Figure 2.2 shows
the three data partitions and their use in the data mining process. When we are
concerned mainly with finding the best model and less with exacdy how well it
will do, we might use only training and validation partitions.
Note that with some algorithms, such as nearest-neighbor algorithms,
records in the validation and test partitions, and in new data, are compared
j
Evaluate model(s) Validation
data
j
Re-evaluate model(s)
(optional)
[;]data
j
Predict/classify
using final model [;]data
FIGURE 2.2 THREE DATA PARTITIONS AND THEIR ROLE IN THE DATA MINING
PROCESS
30 OVERVIEW OF THE DATA MINING PROCESS
$ $
TABLE 2.1
PREDICTIVE POWER AND OVERFITTING 31
FIGURE 2.3
FIGURE 2.4
32 OVERVIEW OF THE DATA MINING PROCESS
includes spurious effects that are specific to the 100 individuals but not beyond
that sample.
For example, one of the variables might be height. We have no basis in
theory to suppose that tall people might contribute more or less to charity, but if
there are several tall people in our sample and they just happened to contribute
heavily to charity, our model might include a term for height-the taller you
are, the more you will contribute. Of course, when the model is applied to
additional data, it is likely that this will not turn out to be a good predictor.
If the dataset is not much larger than the number of predictor variables, it
is very likely that a spurious relationship like this will creep into the model.
Continuing with our charity example, with a small sample just a few of whom
are tall, whatever the contribution level of tall people may be, the algorithm
is tempted to attribute it to their being tall. If the dataset is very large relative
to the number of predictors, this is less likely to occur. In such a case, each
predictor must help predict the outcome for a large number of cases, so the job
it does is much less dependent on just a few cases that might be flukes.
Somewhat surprisingly, even if we know for a fact that a higher degree
curve is the appropriate model, if the model-fitting dataset is not large enough,
a lower degree function (i.e., not as likely to fit the noise) is likely to perform
better. Overfitting can also result from the application of many different models,
from which the best performing model is selected.
Let us go through the steps typical to many data mining tasks using a farniliar
procedure: multiple linear regression. This will help you understand the overall
process before we begin tackling new algorithms. We illustrate the procedure
using XLMiner.
1 "Zestimates may not be as right as you'd like" Washington Post Feb. 7, 2015, p. TI0, by K. Hamey.
BUILDING A PREDICTIVE MODEL WITH XLMINER 33
$
$
2
TABLE 2.2
2
34 OVERVIEW OF THE DATA MINING PROCESS
TABLE 2.3
BUILDING A PREDICTIVE MODEL WITH XLMINER 35
FLOORS ROOMS
15 8
2 10
1.5 6
1 6
FIGURE 2.5
Note:
BUILDING A PREDICTIVE MODEL WITH XLMINER 37
8. Use the algorithm to peiform the task. In XLMiner, we select Multiple Lin-
ear Regression from the Prediction menu. The first dialog box is shown
in Figure 2.6. The variable TOTAL VALUE is selected as the output
(dependent) variable. All the other variables, except TAX and one of
the REMODEL dummy variables, are selected as input (predictor) vari-
ables. We ask XLMiner to show us the fitted values on the training data
as well as the predicted values (scores) on the validation data, as shown in
Figure 2.7. XLMiner produces standard regression output, but for now
we defer that as well as the more advanced options displayed above (see
Chapter 6 or the user documentation for XLMiner for more informa-
tion.) Rather, we review the predictions themselves. Figure 2.8 shows
the predicted values for the first few records in the training data along
with the actual values and the residuals (prediction errors). Note that the
predicted values would often be called the fitted values, since they are for
the records to which the model was fit. The results for the validation
data are shown in Figure 2.9. The prediction errors for the training and
validation data are compared in Figure 2.10.
Prediction error can be measured in several ways. Three measures
produced by XLMiner are shown in Figure 2.10. On the right is the
average error, simply the average of the residuals (errors). In both cases,
it is quite small relative to the units of TOTAL VALUE, indicating
that, on balance, predictions average about right-our predictions are
"unbiased." Of course, this simply means that the positive and negative
errors balance out. It tells us nothing about how large these errors are.
The total sum oj squared errors on the left adds up the squared errors,
so whether an error is positive or negative, it contributes just the same.
However, this sum does not yield information about the size of the
typical error.
The RMS e"or (root-mean-squared error) is perhaps the most useful
term of all. It takes the square root of the average squared error, so it
gives an idea of the typical error (whether positive or negative) in the
same scale as that used for the original outcome variable. As we might
expect, the RMS error for the validation data (45.2 thousand dollars),
which the model is seeing for the first time in making these predictions,
is larger than for the training data (40.9 thousand dollars), which were
used in training the model.
9. Interpret the results. At this stage we would typically try other prediction
algorithms (e.g., regression trees) and see how they do error-wise. We
might also try different "settings" on the various models (e.g., we could
use the best subsets option in multiple linear regression to choose a reduced
set of variables that might perform better with the validation data). After
choosing the best model-typically, the model with the lowest error on
the validation data while also recognizing that "simpler is better" -we
38 OVERVIEW OF THE DATA MINING PROCESS
FIGURE 2.6
FIGURE 2.7
BUILDING A PREDICTIVE MODEL WITH XLMINER 39
FIGURE 2.8
FIGURE 2.9
FIGURE 2.10
$
40 OVERVIEW OF THE DATA MINING PROCESS
A B C D E F G H I J K L M N 0
LOT YR GROSS LIVING BEDR FULL HALF KI TC FIREP REMODEL REMODEL REMODEL
1 TAX SQFT BUILT AREA AREA FLOORS ROOMS OOMS BATH BATH HEN LACE None Old Recent
2 3850 6877 1963 2240 1808 1 6 3 1 1 1 0 1 0 0
3 5386 5965 1963 2998 1637 1.5 8 3 3 0 1 1 0 1 0
4 4608 5662 1961 2805 1750 2 7 4 2 0 1 1 0 0 1
-
FIGURE 2.11 WORKSHEET WITH THREE RECORDS TO BE SCORED
use that model to predict the output variable in fresh data. These steps
are covered in more detail in the analysis of cases.
10. Deploy the model. Mter the best model is chosen, it is applied to new data
to predict TOTAL VALUE for homes where this value is unknown. This
was, of course, the original purpose. Predicting the output value for new
records is called scoring. For predictive tasks, scoring produces predicted
numerical values. For classification tasks, scoring produces classes and/or
propensities. In XLMiner, we can score new records using one of the
models we developed. To do that, we must first create a worksheet or file
with the records to be predicted. For these records, we must include all
the predictor values. Figure 2.11 shows an example of a worksheet with
three homes to be scored using our linear regression model. Note that
all the required predictor columns are present, and the output column is
absent.
Figure 2.12 shows the Score dialog box. We chose "match by name"
to match the predictor columns in our model with the new records'
worksheet. The result is shown in Figure 2.13, where the predictions
are in the first column. Note: In XLMiner, scoring new data can also be
done directly from a specific prediction or classification2 method dialog
box ("New Data Scoring," typically in the last step). In our example,
scoring can be done in step 2 in the multiple linear regression dialog
shown in Figure 2.7.
XLMiner has a facility for drawing a sampLe from an externaL database. The sampLe
can be drawn at random or it can be stratified. It also has a faciLity to score data in
the externaL database using the modeL that was obtained from the training data.
2Note: In some versions ofXLMiner, propensities for new records are produced only if "New Data
Scoring" is selected in the classification method dialog, they are not available in the option to score
stored models.
USING EXCEL FOR DATA MINING 41
FIGURE 2.12
FIGURE 2.13
42 OVERVIEW OF THE DATA MINING PROCESS
judiciously, 2000 voters can give an estimate of the entire population's opinion
within one or two percentage points. (See "How Many Variables and How
Much Data" in Section 2.4 for further discussion.)
Therefore, in most cases, the number of records required in each partition
(training, validation, and test) can be accommodated within the rows allowed by
Excel. Of course, we need to get those records into Excel, and for this pnrpose
the standard version of XLMiner provides an interface for random sampling of
records from an external database.
Similarly, we may need to apply the results of our analysis to a large database,
and for this purpose the standard version of XLMiner has a facility for storing
models and scoring them to an external database. For example, XLMiner would
write an additional column (variable) to the database consisting of the predicted
purchase amount for each record.
In most supervised data mining applications, the goal is not a static, one-time
analysis of a particular dataset. Rather, we want to develop a model that can be
used on an ongoing basis to predict or classifY new records. Our initial analysis
will be in prototype mode, while we explore and defme the problem and test
different models. We will follow all the steps outlined earlier in this chapter.
At the end of that process, we will typically want our chosen model to be
deployed in automated fashion. For example, the US Internal Revenue Service
(IRS) receives several hundred million tax returns per year-it does not want
to have to pull each tax return out into an Excel sheet or other environment
separate from its main database to detennine the predicted probability that the
return is fraudulent. Rather, it would prefer that detennination to be made
as part of the normal tax filing environment and process. Music streaming
services, such as Pandora or Spotify, need to detennine "recommendations" for
next songs quickly for each of millions of users; there is no time to extract the
data for manual analysis.
In practice, this is done by building the chosen algorithm into the com-
putational setting in which the rest of the process lies. A tax return is entered
directly into the IRS system by a tax preparer, a predictive algorithm is imme-
diately applied to the new data in the IRS system, and a predicted classification
is decided by the algorithm. Business rules would then determine what happens
with that classification. In the IRS case, the rule might be "if no predicted
fraud, continue routine processing; if fraud is predicted, alert an examiner for
possible audit."
This flow of the tax return from data entry, into the IRS system, through
a predictive algorithm, then back out to a human user is an example of a
AUTOMATING DATA MINING SOLUTIONS 43
$ $
$
AUTOMATING DATA MINING SOLUTIONS 45
Microsoft is an active player in cloud anaLytics with its Azure Machine Learning
Studio and Stream AnaLytics. Azure works with Hadoop clusters as weLL as with
traditionaL relationaL databases. Azure ML offers a broad range of aLgorithms such
as boosted trees and support vector machines as weLL as supporting R scripts and
Python. Azure ML also supports a workflow interface making it more suitable for the
nonprogrammer data scientist. The reaL-time anaLytics component is designed to
aLLow streaming data from a variety of sources to be anaLyzed on the fly. XLMiner's
cloud version is based on Microsoft Azure. Microsoft also acquired RevoLution An-
aLytics, a major player in the R anaLytics business, with a view to integrating
RevoLution's uR Enterprise" with SQL Server and Azure ML. R Enterprise includes
extensions to R that eLiminate memory Limitations and take advantage of paraLLeL
processing.
One drawback of the cloud-based anaLytics tools is a relative Lack of trans-
parency and user controL over the aLgorithms and their parameters. In some cases,
the service wiLL simpLy seLect a singLe modeL that is a black box to the user. Another
drawback is that for the most part cloud-based tools are aimed at more sophisticated
data scientists who are systems savvy.
Oata science is playing a centraL roLe in enabLing many organizations to opti-
mize everything from production to marketing. New storage options and anaLyticaL
tools promise even greater capabilities. The key is to seLect technoLogy that's ap-
propriate for an organization's unique goals and constraints. As aLways, human
judgment is the most important component of a data mining soLution.
This book's focus is on a comprehensive understanding of the different tech-
niques and aLgorithms used in data mining, and Less on the data management re-
quirements of reaL-time depLoyment of data mining models. XLMiner's short Learning
curve and integration with ExceL makes it ideaL for this purpose, and for expLoration,
prototyping, and piloting of soLutions.
Herb Edelstein is president of Two Crows ConsuLting (www.twocrows.com). a leading data mining con-
sulting finn near Washington, DC. He is an internationally recognized expert in data mining and data
warehousing, a wideLy published author on these topics, and a popular speaker. ○
c 2015 Herb Edelstein.
PROBLEMS 47
TABLE 2.5
48 OVERVIEW OF THE DATA MINING PROCESS
TABLE 2.6
PROBLEMS 49
TABLE 2.7
√
(25 − 56)2 + (49, 000 − 156, 000)2 .