0% found this document useful (0 votes)
42 views

Chapter 2

Uploaded by

Sairam Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

Chapter 2

Uploaded by

Sairam Prakash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

2

FIGURE 2.1


16 OVERVIEW OF THE DATA MINING PROCESS

> $

CORE IDEAS IN DATA MINING 17

Assodation Rules and Recommendation Systems


Large databases of customer transactions lend themselves naturally to the analysis
of associations among items purchased, or "what goes with what." Assoaation
rules, or affinity analysis, is designed to find such general associations patterns
between items in large databases. The rules can then be used in a variety of ways.
For example, grocery stores can use such information for product placement.
They can use the rules for weekly promotional offers or for bundling products.
Association rules derived from a hospital database on patients' symptoms during
consecutive hospitalizations can help find "which symptom is followed by what
other symptom" to help predict future symptoms for returning patients.
Online recommendation systems, such as those used on Amazon.com and
Netflix.com, use collaborative filtering, a method that uses individual users' pref-
erences and tastes given their historic purchase, rating, browsing, or any other
measurable behavior indicative of preference, as well as other users' history. In
contrast to assoaation rules that generate rules general to an entire population,
collaborative filtering generates "what goes with what" at the individual user
level. Hence collaborative filtering is used in many recommendation systems
that aim to deliver personalized recommendations to users with a wide range of
preferences.

Predictive Analytics
Classification, prediction, and, to some extent, association rules and collabora-
tive filtering constitute the analytical methods employed in predictive analytics.
The term predictive analytics is sometimes used to also include data pattern
identification methods such as clustering.

Data Reduction and Dimension Reduction


The performance of data mining algorithms is often improved when the number
of variables is limited, and when large numbers of records can be grouped
into homogeneous groups. For example, rather than dealing with thousands of
product types, an analyst might wish to group them into a smaller number of
groups and build separate models for each group. Or a marketer might want to
classify customers into different "personas," and must therefore group customers
into homogeneous groups to define the personas. This process of consolidating
a large number of records (or cases) into a smaller set is termed data reduction.
Methods for reducing the number of cases are often called clustering.
Reducing the number of variables is typically called dimension reduction.
Dimension reduction is a common initial step before deploying supervised
learning methods, intended to improve predictive power, manageability, and
interpretability.
18 OVERVIEW OF THE DATA MINING PROCESS





� �
THE STEPS IN DATA MINING 19

SUPERVISED LEARNING REQUIRES GOOD


SUPERVISION
20 OVERVIEW OF THE DATA MINING PROCESS

> $100 ≤ $100


PRELIMINARY STEPS 21

2
$ $
22 OVERVIEW OF THE DATA MINING PROCESS

varying limitations on what they can handle in tenns of the nwnbers of records
and variables, limitations that may be specific to computing power and capacity
as well as software limitations. Even within those limits, many algorithms will
execute faster with smaller samples.
Accurate models can often be built with as few as several hundred or thou-
sand records. Hence we will want to sample a subset of records for model
building.

Oversampling Rare Events in Classification Tasks


If the event we are interested in classifYing is rare, such as customers purchasing a
product in response to a mailing, or fraudulent credit card transactions, sampling
a random subset of records may yield so few events (e.g., purchases) that we have
little information on them. We would end up with lots of data on nonpurchasers
and nonfraudulent transactions but little on which to base a model that distin-
guishes purchasers from nonpurchasers or fraudulent from nonfraudulent. In
such cases, we would want our sampling procedure to overweight the rare class
(purchasers or frauds) relative to the majority class (nonpurchasers, nonfrauds)
so that our sample would end up with a healthy complement of purchasers or
frauds.
Assuting an adequate nwnber of responder or "success" cases to train the
model is just part of the picture. A more important factoris the costs of misclas-
sification. Whenever the response rate is extremely low, we are likely to attach
more importance to identifying a responder than to identifYing a nonrespon-
der. In direct-response advertising (whether by traditional mail, email or web
advertising), we may encounter only one or two responders for every hundred
records-the value offmding such a customer far outweighs the costs of reaching
him or her. In trying to identifY fraudulent transactions, or customers unlikely
to repay debt, the costs of failing to fmd the fraud or the nonpaying customer
are likely to exceed the cost of more detailed review of a legitimate transaction
or customer.
If the costs of fulling to locate responders are comparable to the costs of
misidenti£Y:ing responders as nonresponders, our models would usually achieve
highest overall accuracy if they identified everyone (or almost everyone, if it is
easy to identifY a few responders without catching many nonresponders) as a
nonresponder. In such a case, the misclassification rate is very low-equal to the
rate of responders-but the model is of no value.
More generally, we want to train our model with the asymmetric costs in
mind so that the algorithm will catch the more valuable responders, probably at
the cost of" catching" and misclassi£Y:ing more nonresponders as responders than
would be the case if we assume equal costs. This subject is discussed in detail in
Chapter 5.
PRELIMINARY STEPS 23

(1, 2, 3)


24 OVERVIEW OF THE DATA MINING PROCESS

� �

�1 , … , �15

6�� � �
PRELIMINARY STEPS 25

For example, suppose we're trying to predict the total purchase amount
spent by customers, and we have a few predictor columns that are coded
�1 , �2 , �3 , … , where we don't know what those codes mean. We might find
that �1 is an excellent predictor of the total amount spent. However, if we
discover that �1 is the amount spent on shipping, calculated as a percentage of
the purchase amount, then obviously a model that uses shipping amount cannot
be used to predict purchase amount, since the shipping amount is not known
until the transaction is completed. Another example is if we are trying to predict
loan default at the time a customer applies for a loan. If our dataset includes
only information on approved loan applications, we will not have information
about what distinguishes defaulters from nondefaulters among denied applicants.
A model based on approved loans alone can therefore not be used to predict
defaulting behavior at the time ofloan application but rather only once a loan
is approved.

Outliers The more data we are dealing with, the greater the chance of
encountering erroneous values resulting from measurement error, data-entry
error, or the like. If the erroneous value is in the same range as the rest of the
data, it may be harmless. If it is well outside the range of the rest of the data
(e.g., a misplaced decimal), it may have a substantial effect on some of the data
mining procedures we plan to use.
Values that lie fur away from the bulk of the data are called outliers. The
term far away is deliberately left vague because what is or is not called an outlier
is basically an arbitrary decision. Analysts use rules of thumb such as "anything
over 3 standard deviations away from the mean is an outlier," but no statistical
rule can tell us whether such an outlier is the result of an error. In this statistical
sense, an outlier is not necessarily an invalid data point, it is just a distant one.
The purpose of identifying outliers is usually to call attention to values
that need further review. We might come up with an explanation looking at
the data-in the case of a misplaced decimal, this is likely. We might have no
explanation but know that the value is wrong-a temperature of178◦ F for a sick
person. Or, we might conclude that the value is within the realm of possibility
and leave it alone. All these are judgments best made by someone with domain
knowledge, knowledge of the particular application being considered: direct mail,
mortgage finance, and so on, as opposed to technical knowledge of statistical or
data mining procedures. Statistical procedures can do litde beyond identifying
the record as something that needs review.
If manual review is feasible, some outliers may be identified and corrected.
In any case, if the number of records with outliers is very small, they might
be treated as missing data. How do we inspect for outliers? One technique
in Excel is to sort the records by the first column, then review the data for
very large or very small values in that column. Then repeat for each successive
26 OVERVIEW OF THE DATA MINING PROCESS

0.9530 = 0.215.
PREDICTIVE POWER AND OVERFITTING 27

a "0" might mean two things: (1) the value is missing, or (2) the value is actually
zero. In the credit industry, a "0" in the "past due" variable might mean a
customer who is fully paid up, or a customer with no credit history at all-two
very different situations. Human judgment may be required for individual cases
or to determine a special rule to deal with the situation.

Normalizing (Standardizing) and Rescaling Data Some algorithms


require that the data be norrnalized before the algorithm can be implemented
effectively. To normalize a variable, we subtract the mean from each value and
then divide by the standard deviation. This operation is also sometimes called
standardizing. In effect, we are expressing each value as the "number of standard
deviations away from the mean," also called a z-score.
N orrnalizing is one way to bring all variables to the same scale. Another
popular approach is rescaling each variable to a [0,1] scale. This is done by
subtracting the minimum value and then dividing by the range. Subtracring the
minimum shifts the variable origin to zero. Dividing by the range shrinks or
expands the data to the range [0,1].
To consider why norrnalizing or scaling to [0,1] might be necessary, consider
the case of clustering. Clustering typically involves calcularing a distance measure
that reflects how far each record is from a cluster center or from other records.
With multiple variables, different units will be used: days, dollars, counts, and
so on. If the dollars are in the thousands and everything else is in the tens,
the dollar variable will come to dominate the distance measure. Moreover,
changing units from, say, days to hours or months could alter the outcome
completely.
Data mining software, including XLMiner, typically have an option to
norrnalize the data in those algorithms where it may be required. It is an option
rather than an automatic feature of such algorithms because there are situations
where we want each variable to contribute to the distance measure in proportion
to its original scale.

2.5 PREDICTIVE POWER AND OVERFllTlNG

In supervised learning, a key question presents itself How well will our pre-
diction or classification model perform when we apply it to new data? We are
particularly interested in comparing the performance of different models so that
we can choose the model that we think will do the best when it is implemented
in practice. A key concept is to make sure that our chosen model generalizes
beyond the dataset that we have at hand. To assure generalization, we use the
concept of data partitioning and try to avoid overfitting. These two important
concepts are described next.
28 OVERVIEW OF THE DATA MINING PROCESS

Creation and Use of Data Partitions


At first glance, we might think it is best to choose the model that did the best
job of classifYing or predicting the outcome variable of interest with the data
at hand. However, when we use the same data both to develop the model
and to assess its performance, we introduce an "optimism" bias. This is because
when we choose the model that works best with the data, this model's superior
performance comes from two sources:


A superior model

Chance aspects of the data that happen to match the chosen model better
than they match other models

The latter is a particularly serious problem with techniques (e.g., trees and neural
nets) that do not impose linear or other structure on the data, and thus end up
overfitting it.
To address the overfitting problem, we simply divide (partition) our data and
develop our model using only one of the partitions. Mter we have a model, we
try it out on another partition and see how it performs, which we can measure
in several ways. In a classification model, we can count the proportion of held-
back records that were misclassified. In a prediction model, we can measure the
residuals (prediction errors) between the predicted values and the actual values.
This evaluation approach in effect mimics the deployment scenario, where our
model is applied to data that it hasn't "seen."
We typically deal with two or three partitions: a training set, a validation
set, and sometimes an additional test set. Partitioning the data into training,
validation, and test sets is done either randomly according to predetermined
proportions or by specifYing which records go into which partition according
to some relevant variable (e.g., in time series forecasting, the data are partitioned
according to their chronological order). In most cases the partitioning should be
done randomly to minimize the chance of getting a biased partition. It is also
possible (although cumbersome) to divide the data into more than three parti-
tions by successive partitioning (e.g., divide the initial data into three partitions,
then take one of those partitions and partition it further).

Training Partition The training partition, typically the largest partition,


contains the data used to build the various models we are examining. The same
training partition is generally used to develop multiple models.

Validation Partition The validation partition (sometimes called the test


partition) is used to assess the predictive performance of each model so that
you can compare models and choose the best one. In some algorithms (e.g.,
PREDICTIVE POWER AND OVERFITTING 29

classification and regression trees, k-nearest-neighbors), the validation partition


may be used in an automated fashion to tune and improve the model.

Test Partition The test partition (sometimes called the holdout or evalu-
ation partition) is used to assess the performance of the chosen model with new
data.
Why have both a validation and a test partition? When we use the validation
data to assess multiple models and then choose the model that performs best with
the validation data, we again encounter another (lesser) facet of the overfitting
problem-chance aspects of the validation data that happen to match the chosen
model better than they match other models. In other words, by using the
validation data to choose one of several models, the performance of the chosen
model on the validation data will be overly optimistic.
The random features of the validation data that enhance the apparent perfor-
mance of the chosen model will probably not be present in new data to which
the model is applied. Therefore we may have overestimated the accuracy of our
model. The more models we test, the more likely it is that one of them will be
particularly effective in modeling the noise in the validation data. Applying the
model to the test data, which it has not seen before, will provide an unbiased
estimate of how well the model will perform with new data. Figure 2.2 shows
the three data partitions and their use in the data mining process. When we are
concerned mainly with finding the best model and less with exacdy how well it
will do, we might use only training and validation partitions.
Note that with some algorithms, such as nearest-neighbor algorithms,
records in the validation and test partitions, and in new data, are compared

Build model(s) Training


data

j
Evaluate model(s) Validation
data

j
Re-evaluate model(s)
(optional)
[;]data

j
Predict/classify
using final model [;]data

FIGURE 2.2 THREE DATA PARTITIONS AND THEIR ROLE IN THE DATA MINING
PROCESS
30 OVERVIEW OF THE DATA MINING PROCESS

$ $

TABLE 2.1
PREDICTIVE POWER AND OVERFITTING 31

FIGURE 2.3

FIGURE 2.4
32 OVERVIEW OF THE DATA MINING PROCESS

includes spurious effects that are specific to the 100 individuals but not beyond
that sample.
For example, one of the variables might be height. We have no basis in
theory to suppose that tall people might contribute more or less to charity, but if
there are several tall people in our sample and they just happened to contribute
heavily to charity, our model might include a term for height-the taller you
are, the more you will contribute. Of course, when the model is applied to
additional data, it is likely that this will not turn out to be a good predictor.
If the dataset is not much larger than the number of predictor variables, it
is very likely that a spurious relationship like this will creep into the model.
Continuing with our charity example, with a small sample just a few of whom
are tall, whatever the contribution level of tall people may be, the algorithm
is tempted to attribute it to their being tall. If the dataset is very large relative
to the number of predictors, this is less likely to occur. In such a case, each
predictor must help predict the outcome for a large number of cases, so the job
it does is much less dependent on just a few cases that might be flukes.
Somewhat surprisingly, even if we know for a fact that a higher degree
curve is the appropriate model, if the model-fitting dataset is not large enough,
a lower degree function (i.e., not as likely to fit the noise) is likely to perform
better. Overfitting can also result from the application of many different models,
from which the best performing model is selected.

2.6 BUILDING A PREDICTIVE MODEL WITH XLMINER

Let us go through the steps typical to many data mining tasks using a farniliar
procedure: multiple linear regression. This will help you understand the overall
process before we begin tackling new algorithms. We illustrate the procedure
using XLMiner.

Predicting Home Values in the West Roxbury Neighborhood


The Internet has revolutionized the real estate industry. Realtors now list
houses and their prices on the web, and estimates of house and condominium
prices have become widely available, even for units not on the market. Zillow
(www.zillow.com) is the most popular online real estate information site, 1 and
in 2014 they purchased their major rival, Trulia. By 2015 Zillow had become the
dominant platform for checking house prices and, as such, the dominant online
advertising venue for realtors. What used to be a comfortable 6% commission
structure for realtors, affording them a handsome surplus (and an oversupply of
realtors), was being rapidly eroded by an increasing need to pay for advertising

1 "Zestimates may not be as right as you'd like" Washington Post Feb. 7, 2015, p. TI0, by K. Hamey.
BUILDING A PREDICTIVE MODEL WITH XLMINER 33

$
$
2

TABLE 2.2

2
34 OVERVIEW OF THE DATA MINING PROCESS

TABLE 2.3
BUILDING A PREDICTIVE MODEL WITH XLMINER 35

TABLE 2.4 OUTLIER IN WEST


ROXBURY DATA

FLOORS ROOMS

15 8
2 10
1.5 6
1 6

It is also useful to check for outliers that might be errors. For


example, suppose that the column FLOORS (number of floors) looked
like the one in Table 2.4, after sorting the data in descending order based
on floors. We can tell right away that the 15 is in error-it is unlikely that
a home has 15 floors. All other values are between 1 and 2. Probably,
the decimal was misplaced and the value should be 1.5.
Last, we create dummy variables for categorical variables. Here
we have one categorical variable: REMODEL, which has three
categories.
4. Reduce the data dimension. Our dataset has been prepared for presentation
with fairly low dimension-it has only 13 variables, and the single cate-
gorical variable considered has only three categories (and hence adds two
dummy variables). Ifwe had many more variables, at this stage we might
want to apply a variable reduction technique such as condensing multi-
ple categories into a smaller number, or applying principal components
analysis to consolidate multiple similar numerical variables (e.g., LIVING
AREA, ROOMS, BEDROOMS, BATH, HALF BATH) into a smaller
number of variables.
5. Determine the data mining task. The specific task is to predict the value
of TOTAL VALUE using the predictor variables. This is a supervised
prediction task. For simplicity, we excluded several additional variables
present in the original dataset, which have many categories (BLDG
TYPE, ROOF TYPE, and EXT FIN). We therefore use alI the numer-
ical variables (except TAX) and the dummies created for the remaining
categorical variables.
6. Partition the data (for supetVised tasks). In XLMiner, select Partition from
the Data Mining menus, and the dialog box shown in Figure 2.5 appears.
Here we specny the data range to be partitioned and the variables to be
included in the partitioned dataset. The partitioning can be handled in
one of two ways:
a. The dataset can have a partition variable that governs the division into
training and validation partitions ("t" = training, "v" = validation).
36 OVERVIEW OF THE DATA MINING PROCESS

FIGURE 2.5

Note:
BUILDING A PREDICTIVE MODEL WITH XLMINER 37

8. Use the algorithm to peiform the task. In XLMiner, we select Multiple Lin-
ear Regression from the Prediction menu. The first dialog box is shown
in Figure 2.6. The variable TOTAL VALUE is selected as the output
(dependent) variable. All the other variables, except TAX and one of
the REMODEL dummy variables, are selected as input (predictor) vari-
ables. We ask XLMiner to show us the fitted values on the training data
as well as the predicted values (scores) on the validation data, as shown in
Figure 2.7. XLMiner produces standard regression output, but for now
we defer that as well as the more advanced options displayed above (see
Chapter 6 or the user documentation for XLMiner for more informa-
tion.) Rather, we review the predictions themselves. Figure 2.8 shows
the predicted values for the first few records in the training data along
with the actual values and the residuals (prediction errors). Note that the
predicted values would often be called the fitted values, since they are for
the records to which the model was fit. The results for the validation
data are shown in Figure 2.9. The prediction errors for the training and
validation data are compared in Figure 2.10.
Prediction error can be measured in several ways. Three measures
produced by XLMiner are shown in Figure 2.10. On the right is the
average error, simply the average of the residuals (errors). In both cases,
it is quite small relative to the units of TOTAL VALUE, indicating
that, on balance, predictions average about right-our predictions are
"unbiased." Of course, this simply means that the positive and negative
errors balance out. It tells us nothing about how large these errors are.
The total sum oj squared errors on the left adds up the squared errors,
so whether an error is positive or negative, it contributes just the same.
However, this sum does not yield information about the size of the
typical error.
The RMS e"or (root-mean-squared error) is perhaps the most useful
term of all. It takes the square root of the average squared error, so it
gives an idea of the typical error (whether positive or negative) in the
same scale as that used for the original outcome variable. As we might
expect, the RMS error for the validation data (45.2 thousand dollars),
which the model is seeing for the first time in making these predictions,
is larger than for the training data (40.9 thousand dollars), which were
used in training the model.
9. Interpret the results. At this stage we would typically try other prediction
algorithms (e.g., regression trees) and see how they do error-wise. We
might also try different "settings" on the various models (e.g., we could
use the best subsets option in multiple linear regression to choose a reduced
set of variables that might perform better with the validation data). After
choosing the best model-typically, the model with the lowest error on
the validation data while also recognizing that "simpler is better" -we
38 OVERVIEW OF THE DATA MINING PROCESS

FIGURE 2.6

FIGURE 2.7
BUILDING A PREDICTIVE MODEL WITH XLMINER 39

FIGURE 2.8

FIGURE 2.9

FIGURE 2.10
$
40 OVERVIEW OF THE DATA MINING PROCESS

A B C D E F G H I J K L M N 0

LOT YR GROSS LIVING BEDR FULL HALF KI TC FIREP REMODEL REMODEL REMODEL
1 TAX SQFT BUILT AREA AREA FLOORS ROOMS OOMS BATH BATH HEN LACE None Old Recent
2 3850 6877 1963 2240 1808 1 6 3 1 1 1 0 1 0 0
3 5386 5965 1963 2998 1637 1.5 8 3 3 0 1 1 0 1 0
4 4608 5662 1961 2805 1750 2 7 4 2 0 1 1 0 0 1
-
FIGURE 2.11 WORKSHEET WITH THREE RECORDS TO BE SCORED

use that model to predict the output variable in fresh data. These steps
are covered in more detail in the analysis of cases.
10. Deploy the model. Mter the best model is chosen, it is applied to new data
to predict TOTAL VALUE for homes where this value is unknown. This
was, of course, the original purpose. Predicting the output value for new
records is called scoring. For predictive tasks, scoring produces predicted
numerical values. For classification tasks, scoring produces classes and/or
propensities. In XLMiner, we can score new records using one of the
models we developed. To do that, we must first create a worksheet or file
with the records to be predicted. For these records, we must include all
the predictor values. Figure 2.11 shows an example of a worksheet with
three homes to be scored using our linear regression model. Note that
all the required predictor columns are present, and the output column is
absent.
Figure 2.12 shows the Score dialog box. We chose "match by name"
to match the predictor columns in our model with the new records'
worksheet. The result is shown in Figure 2.13, where the predictions
are in the first column. Note: In XLMiner, scoring new data can also be
done directly from a specific prediction or classification2 method dialog
box ("New Data Scoring," typically in the last step). In our example,
scoring can be done in step 2 in the multiple linear regression dialog
shown in Figure 2.7.

XLMiner has a facility for drawing a sampLe from an externaL database. The sampLe
can be drawn at random or it can be stratified. It also has a faciLity to score data in
the externaL database using the modeL that was obtained from the training data.

2Note: In some versions ofXLMiner, propensities for new records are produced only if "New Data
Scoring" is selected in the classification method dialog, they are not available in the option to score
stored models.
USING EXCEL FOR DATA MINING 41

FIGURE 2.12

FIGURE 2.13
42 OVERVIEW OF THE DATA MINING PROCESS

judiciously, 2000 voters can give an estimate of the entire population's opinion
within one or two percentage points. (See "How Many Variables and How
Much Data" in Section 2.4 for further discussion.)
Therefore, in most cases, the number of records required in each partition
(training, validation, and test) can be accommodated within the rows allowed by
Excel. Of course, we need to get those records into Excel, and for this pnrpose
the standard version of XLMiner provides an interface for random sampling of
records from an external database.
Similarly, we may need to apply the results of our analysis to a large database,
and for this purpose the standard version of XLMiner has a facility for storing
models and scoring them to an external database. For example, XLMiner would
write an additional column (variable) to the database consisting of the predicted
purchase amount for each record.

2.8 AUTOMATING DATA MINING SOLUTIONS

In most supervised data mining applications, the goal is not a static, one-time
analysis of a particular dataset. Rather, we want to develop a model that can be
used on an ongoing basis to predict or classifY new records. Our initial analysis
will be in prototype mode, while we explore and defme the problem and test
different models. We will follow all the steps outlined earlier in this chapter.
At the end of that process, we will typically want our chosen model to be
deployed in automated fashion. For example, the US Internal Revenue Service
(IRS) receives several hundred million tax returns per year-it does not want
to have to pull each tax return out into an Excel sheet or other environment
separate from its main database to detennine the predicted probability that the
return is fraudulent. Rather, it would prefer that detennination to be made
as part of the normal tax filing environment and process. Music streaming
services, such as Pandora or Spotify, need to detennine "recommendations" for
next songs quickly for each of millions of users; there is no time to extract the
data for manual analysis.
In practice, this is done by building the chosen algorithm into the com-
putational setting in which the rest of the process lies. A tax return is entered
directly into the IRS system by a tax preparer, a predictive algorithm is imme-
diately applied to the new data in the IRS system, and a predicted classification
is decided by the algorithm. Business rules would then determine what happens
with that classification. In the IRS case, the rule might be "if no predicted
fraud, continue routine processing; if fraud is predicted, alert an examiner for
possible audit."
This flow of the tax return from data entry, into the IRS system, through
a predictive algorithm, then back out to a human user is an example of a
AUTOMATING DATA MINING SOLUTIONS 43

DATA MINING SOFTWARE TOOLS: THE STATE OF


THE MARKET
44 OVERVIEW OF THE DATA MINING PROCESS

$ $

$
AUTOMATING DATA MINING SOLUTIONS 45

In contrast to the generaL-purpose suites, appLication-specific tools are in-


tended for particular anaLytic appLications such as credit scoring, customer reten-
tion, and product marketing. Their focus may be further sharpened to address the
needs of specialized markets such as mortgage Lending or financiaL services. The
target user is an anaLyst with expertise in the appLication domain. Therefore the
interfaces, the aLgorithms, and even the terminoLogy are customized for that par-
ticular industry, appLication, or customer. While Less flexibLe than generaL-purpose
tools, they offer the advantage of aLready incorporating domain knowLedge into the
product design, and can provide very good soLutions with Less effort. Data mining
companies including SAS, IBM, and RapidMiner offer verticaL market tools, as do
industry speciaLists such as Fair Isaac. Other companies, such as Domo, are focusing
on creating dashboards with anaLytics and visualizations for business inteLligence.
Another technoLogicaL shift has occurred with the spread of open source modeL
building tools and open core tools. A somewhat simpLified view of open source
software is that the source code for the tooL is availabLe at no charge to the
community of users and can be modified or enhanced by them. These enhancements
are submitted to the originator or copyright hoLder, who can add them to the base
package. Open core is a more recent approach in which a core set of functionaLity
remains open and free, but there are proprietary extensions that are not free.
The most important open source statisticaL anaLysis software is R. R is de-
scended from a BeLL Labs program caLLed S, which was commercialized as 5+. Many
data mining aLgorithms have been added to R, aLong with a pLethora of statistics,
data management tools, and visualization tools. Because it is essentiaLLy a pro-
gramming Language, R has enormous flexibility but a steeper Learning curve than
many of the GUI-based tools. ALthough there are some GUIs for R, the overwheLming
majority of use is through programming.
Some vendors, as weLL as the open source community, are adding statisticaL
and data mining tools to Python, a popular programming Language that is generaLLy
easier to use than c++ or Java, and faster than R.
As mentioned above, the cloud-computing vendors have moved into the data
mining/predictive anaLytics business by offering AaaS (AnaLytics as a Service) and
pricing their products on a transaction basis. These products are oriented more
toward appLication deveLopers than business inteLLigence anaLysts. A big part of the
attraction of mining data in the cloud is the ability to store and manage enormous
amounts of data without requiring the expense and compLexity of building an
in-house capability. This can also enabLe a more rapid impLementation of Large
distributed muLti-user appLications. CLoud based data can be used with non-cLoud-
based anaLytics if the vendors anaLytics do not meet the users needs.
Amazon has added Amazon Machine learning to its Amazon Web Services
(AWS), taking advantage of predictive modeling tools deveLoped for Amazons in-
ternaL use. AWS supports both relationaL databases and Hadoop data management.
Models cannot be exported, because they are intended to be appLied to data stored
on the Amazon cloud.
GoogLe is very active in cloud anaLytics with its BigOuery and Prediction API.
BigOuery aLLows the use of GoogLe infrastructure to access Large amounts of data
using a SOL-Like interface. The Prediction API can be accessed from a variety of
Languages including Rand Python. It uses a variety of machine Learning aLgorithms
and automaticaLLy seLects the best resuLts. UnfortunateLy, this is not a transparent
process. Furthermore, as with Amazon, models cannot be exported.
46 OVERVIEW OF THE DATA MINING PROCESS

Microsoft is an active player in cloud anaLytics with its Azure Machine Learning
Studio and Stream AnaLytics. Azure works with Hadoop clusters as weLL as with
traditionaL relationaL databases. Azure ML offers a broad range of aLgorithms such
as boosted trees and support vector machines as weLL as supporting R scripts and
Python. Azure ML also supports a workflow interface making it more suitable for the
nonprogrammer data scientist. The reaL-time anaLytics component is designed to
aLLow streaming data from a variety of sources to be anaLyzed on the fly. XLMiner's
cloud version is based on Microsoft Azure. Microsoft also acquired RevoLution An-
aLytics, a major player in the R anaLytics business, with a view to integrating
RevoLution's uR Enterprise" with SQL Server and Azure ML. R Enterprise includes
extensions to R that eLiminate memory Limitations and take advantage of paraLLeL
processing.
One drawback of the cloud-based anaLytics tools is a relative Lack of trans-
parency and user controL over the aLgorithms and their parameters. In some cases,
the service wiLL simpLy seLect a singLe modeL that is a black box to the user. Another
drawback is that for the most part cloud-based tools are aimed at more sophisticated
data scientists who are systems savvy.
Oata science is playing a centraL roLe in enabLing many organizations to opti-
mize everything from production to marketing. New storage options and anaLyticaL
tools promise even greater capabilities. The key is to seLect technoLogy that's ap-
propriate for an organization's unique goals and constraints. As aLways, human
judgment is the most important component of a data mining soLution.
This book's focus is on a comprehensive understanding of the different tech-
niques and aLgorithms used in data mining, and Less on the data management re-
quirements of reaL-time depLoyment of data mining models. XLMiner's short Learning
curve and integration with ExceL makes it ideaL for this purpose, and for expLoration,
prototyping, and piloting of soLutions.

Herb Edelstein is president of Two Crows ConsuLting (www.twocrows.com). a leading data mining con-
sulting finn near Washington, DC. He is an internationally recognized expert in data mining and data
warehousing, a wideLy published author on these topics, and a popular speaker. ○
c 2015 Herb Edelstein.
PROBLEMS 47

TABLE 2.5
48 OVERVIEW OF THE DATA MINING PROCESS

TABLE 2.6
PROBLEMS 49

TABLE 2.7


(25 − 56)2 + (49, 000 − 156, 000)2 .

You might also like