0% found this document useful (0 votes)
108 views27 pages

Data Exploration in Preparation For Modeling

Uploaded by

dtkraeut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views27 pages

Data Exploration in Preparation For Modeling

Uploaded by

dtkraeut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Exploration in Preparation for Modeling

Michael Berry, Data Miners Inc.


Exploratory Analysis and Data Mining
What makes a good modeler (or data miner, or analyst)? This question comes up often. Managers ask because
they want to hire such a person. Students ask because they want to become such a person. The answer is
that while many skills are useful, the most important traits are curiosity, creativity and intuition for how to answer
important questions using data. With a bit of training, a person with these traits can acquire the skills to produce
great results using fairly simple tools. Without them, even a PhD statistician will struggle to produce good results.

“Curiosity and creativity are arguably inborn, but intuition for data comes from time
spent exploring it.”

Curiosity and creativity are arguably inborn, but intuition for data comes from time spent exploring it. Exploratory
analysis is a crucial part of any data mining project. This is the stage where you discover what individual variables
actually contain and how variables interact with each other and with potential modeling targets. Using JMP®, much
of this exploration can be done visually using charts, graphs and maps. This white paper illustrates the process
by exploring the house file of a catalog retailer in preparation for building response models and segmenting
customers.

Before diving into the catalog data, we present some cautionary tales in which seemingly innocuous data contains
traps for the unwary modeler.

What You Don’t Know Can Hurt You


The examples presented here are all taken from real data sets used in real predictive modeling projects. The
problems with these data sets are neither unusual nor particularly hard to spot. In fact, that is the point: If you
spend time exploring your data before creating models, you will save yourself a lot of trouble by spotting problems
quickly and easily before they lead to incorrect models and erroneous results.

Do some households really spend hundreds of thousands of dollars at the grocery store
each week?
Supermarket loyalty card data can be used to build a surprisingly rich portrait of a shopper. Every bag of corn
chips and can of cat food that goes over the scanner becomes a record in a database. Transactions with the
same store code, lane number and register open time can be grouped into “market baskets.” When these market
baskets can be linked to individual shoppers through a loyalty card number, it becomes possible to ask questions
like these:

• What time of day does this shopper habitually shop? Weekday afternoons? Late nights and weekends?
The answer suggests something about the shopper’s lifestyle.
• How adventurous is the shopper? Does she frequently try new things?
• How loyal is she to her regular brands? When a competing brand is on special, does
she switch?
• Does she have children? Pets?
• Does she spend more or less than the average shopper?

www.jmp.com/berry 1
The last question sounds like the simplest to answer, but loyalty card data typically includes some cards that
spend way more than any family possibly could. Why? Because when you are about to miss out on a discount
because you forgot your card, the friendly cashier kindly scans hers for you. Don’t think you’re special; she does
that for everyone! Clearly, these outliers should be eliminated before calculating averages or, for that matter,
statistics of any kind, but if you don’t look for them, you might not find them.

Were there really a lot of laptop computers in 1900?


Laptop computers manufactured by a particular company come with software that periodically contacts the
manufacturer and reports some statistics about their general health. These postcards from the field arrive as XML
files. One of the XML elements contains the date and time according to the PC’s clock. Since the program was put
in place in 2009, all of the dates should be from 2009 or later. A histogram of reported dates by year shows a large
number of unexpected years. 1900 and 1931 are particularly popular.

“Finding problems is not the only reason for exploring your data before modeling.
Data exploration is also a good way to spot important relationships between
variables, test hypotheses and generate ideas for new derived variables.”

This matters because the earliest reporting date is used as an estimate of when the machine was first powered
up by a user. This may be long after the date of manufacture since systems sometimes spend time in a retailer’s
warehouse before being sold. A few thousand machines that appear to have been around for over 100 years will
certainly change the calculated average age of systems in the field!

The likely explanation is that the clock stops working when the laptop’s battery is removed from the system, and
then reverts to some default time when power is restored. Apparently, some users never reset the clock after such
an event.

Is Sydney, NSW, really in the United States?


In another case, a histogram of subscribers by country showed far more customers than expected in the United
States. The US is in fact the largest market for the company in question, but the size of the discrepancy was
greater than expected. The problem turned out to be that “United States” was the default value for the country

2
field, so any time someone entered an address without supplying a country, the address was assumed to be in the
US. This led to addresses such as:

200 George St.


Sydney, NSW 2000
United States

145 King St. W.


Toronto, ON M5H 1J8
United States

You can probably guess the correct countries for these addresses, but that is not the same as being able to write
a program to do it. This is an example of a fairly common problem – missing values disguised as something else.
Another common mistake is to use some unusual value (such as a negative number for a value that can only be
positive) to indicate a missing value. The meaning may be clear to human readers, but to a data mining algorithm it
is just another number.

Looking for Relationships


Finding problems is not the only reason for exploring your data before modeling. Data exploration is also a good
way to spot important relationships between variables, test hypotheses and generate ideas for new derived
variables. Simple visualizations such as scatter plots and maps facilitate this.

For example, these two scatter plots look at the relationship of two census variables – the percentage of people
working in agriculture and the total population of a town to the percentage of homes heated by wood (the value
to be modeled). The population plot shows that when a town has a large population, it never has high penetration
of wood stoves used as the primary heating source, but when the population is small there is a very wide range

www.jmp.com/berry 3
of penetrations. The agriculture plot shows that there is a strong positive correlation between the percentage of
people working in agriculture and the percentage of homes heated by wood. Together, they suggest that wood
stoves are more likely to be used in rural areas. The original census data does not include a variable that defines
rurality, but it does contain the land area and population of each town, so we can test the rural hypothesis using
population density.

Defining a new column is a simple matter in JMP, and the recursive partitioning tool provides an easy way to test
the ability of the new PopDensity field to separate the towns into a high-penetration group and a low-
penetration group.

The very first split on population density yields one group of towns with average woodstove penetration of 4
percent and another with average penetration of 16 percent. Further splits lead to a tree with leaves average
penetration ranging from about .5 percent to more than 21 percent. Splits are chosen to maximize the difference in
penetration between the two children of a split. This process continues recursively until some stopping condition is
met.

At each split, all available variables are evaluated to identify the one that does the best job of separating high-
penetration towns from low-penetration towns. As a result, the variables picked for inclusion in the decision tree
are good candidates for use in other models.

4
A Typical Database Marketing Scenario
The following extended example uses data from Vermont Country Store, a catalog and online retailer. The data
used here is a 10 percent sample of the company’s customers as of the end of 1998. The age of the data is visible
in, for example, the distribution of payment methods (lots of people pay by check; no one uses PayPal), but aside
from that, the data could be from yesterday. The reason for using this particular data set for the extended example
is that it is available for download so readers may explore it on their own.

For each customer there is historical data on past purchases and past campaigns. In particular, the data shows
how many times customers have made purchases in each of 27 departments and how much money they
have spent each quarter for a series of 23 consecutive quarters. There is no demographic data describing the
customers, but from their ZIP codes it is possible to look up ZIP code-level demographics such as population
density and average home values.

This data can be used to support a variety of business goals, but we will focus on increasing the response rate
and increasing the average order size.

Increasing the Response Rate


One of the fields in the catalog data set is called respond. The field has the value 1 if the customer made a
purchase from either or both of two Christmas-themed catalogs. This field would be the target variable for a
response model. Such a model would give each customer a score indicating the probability of response to a
similar catalog this year.

Increasing the Average Order Size


Another important field is dolindea which records the average order size for each customer. This would be
the target variable for a model to estimate order size. When exploring the data set, we will pay special attention to
the variables that are potential targets for model building.

The Data
As mentioned previously, the data used for this example is available for download. The ZIP code demographics
from the 2000 census are in tables available from the companion Web page for Gordon Linoff’s book Data
Analysis Using SQL and Excel at www.data-miners.com/sql_companion.htm. The catalog data is available
at www.data-miners.com/materials.htm.

It is worth noting that in the computing environment used for this paper, the ZIP code demographic tables were in
an SQL Server database and the catalog data was in a SAS® data set. JMP provides access to these and many
other data sources, so it is not necessary to do anything special to import data into JMP. New JMP tables can
be defined using SQL queries or through the JMP graphical user interface. Here, for example, is a JMP table that
uses data from both sources summarized to the county level.

www.jmp.com/berry 5
The rest of this paper uses JMP to explore these data sources in preparation for building models for response and
average order size.

Bearing in mind the eventual modeling goals, some questions come to mind: Where are the customers? Are they
concentrated in certain parts of the country? Do they tend to be urban or rural? Rich or poor? Do areas with a
high proportion of customers also exhibit higher response rates? What are response rates like, anyway? Do they
vary by time or year? And so forth.

Distribution of the Response Target


This chart shows the distribution of responders and nonresponders.

About 5.7 percent of customers who were sent the Christmas catalog made a purchase from it. One reason it is
important to know the response percentage is that some modeling techniques do not work very well when the
target classes are not in balance. In particular, decision tree algorithms base their splitting decisions on increasing
purity. Since a sample containing 94 percent nonresponders is already fairly pure, the tree won’t work very hard to
increase purity.

6
Distribution of the Average Order Size Target
JMP software’s standard distribution plot is a better way to look at average order size.

The long tail on the right invites further exploration. The largest average order size is $768.85, which seems very
high for this sort of catalog. Ninety percent of customers have an average order size less than $87.90. Because of
the outliers, the average order size is $10 higher than the median order size of $37.78.

Where Are the Customers?


The name of the company that supplied the data for this example is Vermont Country Store. As the name
suggests, the company is based in Vermont, and many of the products it sells have a Yankee theme. Does this
mean that customers are also concentrated in New England? A simple bar chart of customer counts by state
hints at the answer, since Massachusetts has the fourth-largest number of customers but is not in the top 10 by
population, and Connecticut, Maine and New Hampshire are all in the top 25.

JMP provides a much nicer way of looking at geographic data. It recognizes many standard geographic
designations, including the standard postal abbreviations for countries and for US states. It even recognizes the
FIPS codes used to identify US counties. Any data that includes one of these geographic fields can be displayed
on a map. Here is a map of the United States colored by the percentage of its population that is in the Vermont
Country Store sample.

www.jmp.com/berry 7
Not surprisingly, Vermont has the highest customer density, followed by the other New England states and, for
some reason, Alaska. Utah is the darkest blue, indicating the lowest percentage of customers. Utah and Vermont
share a love of skiing, but not, apparently, a love of maple sugar candies and aged cheddar. Southern states are
also dark blue except for Virginia. Could this be the influence of the northern Virginia counties close to customer-
rich Washington? Let’s look at Virginia by county to find out.

This map colored by penetration does not support the “close to Washington” hypothesis. Some of the counties
with high penetration are along the border with West Virginia. Clicking on the reddest county highlights its row in
the data table. It turns out to be sparsely-populated Highland County, which owes its high penetration to just four
customers. Some counties have no customers in the sample. These appear as white holes in the map because
only FIPS codes found in the customer data are plotted.
8
In fact, most Virginia counties have very few customers. A Pareto chart reveals that a handful of counties account
for most of the customers. Fairfax County, just outside of Washington, is home to nearly 14 percent of Virginia
customers, so the hunch that Vermont Country Store customers would be found close to Washington was actually
correct even though they are a small percentage of the population of that northern Virginia county.

Where Are the Responders?


We have seen that customers are concentrated in New England. Are New Englanders also more likely to respond?
To answer that question, we can create another derived variable to compare the number of catalogs that have
ever been mailed to a particular geographic area to the number of orders received from that area. Before getting to
that, we must decide what level of geography is appropriate. Our choices include states, counties, ZIP codes and
ZIP code prefixes, such as the first 3 digits. The geographic areas should be small enough to capture important
variations, but large enough to keep sample sizes reasonable.

Using the JMP table summarization facility, it is easy to count the number of customers, catalogs and orders in
each of the 14,632 ZIP codes in the sample. By summarizing the resulting summary table, we can count the
number of ZIP codes that have just one customer, just two customers and so on.

www.jmp.com/berry 9
ZIP codes without any customers are not shown because the chart is a summary of customer data, not ZIP code
data. There are 6,033 ZIP codes with only one customer, and 2,620 ZIP codes have only two customers. Only a
handful of ZIP codes have more than 20 customers, so clearly the proportion of responders at ZIP code level is not
meaningful. The same is true for counties: Very few have more than 30 customers.

10
Since the vast majority of counties contain too few customers to calculate a meaningful response rate, we look at
response rate by state. To establish a baseline, we first calculate the overall response rate by summing the catalog
count field to calculate the total number of catalogs mailed and the frequency of purchase field to calculate the
total number of orders. This calculation leads to a surprising (and not credible) result: The overall response rate
is greater than one! There were 182,100 catalogs mailed and 201,499 orders received. Not only does that seem
unlikely given our knowledge of the industry, it also contradicts another field in the table that should be closely
correlated – proportion of quarters in which the customer has an order.

One possibility is the count of catalogs and the count of orders were made over different periods. This is a frequent
problem when data from several different sources is summarized and brought together in a single customer
signature table. Comparing summary data from different time periods can lead to incorrect conclusions. In one
case, the author characterized the tendency of customers to complain by dividing the total number of complaints
received by the customer’s tenure. The idea was to be able to compare complaint rates for customers with
different tenures. Unfortunately, records of the complaints had only been kept for two years, but some customers
had been there for decades. As a result, the calculated complaint rate was correct for customers with under
two years of tenure, but too low for everyone else. This led to a strong, but specious, correlation between low
complaint rates and long tenure. If the mistake had not been caught, it would have been easy to conclude that
happy customers last a long time and can be recognized by the infrequency of their complaints.

To see if something like that is going on here, we compare the results of several different ways of characterizing
response rates.

www.jmp.com/berry 11
Reported Orders Divided by Tenure
This is the way JMP displays information about the distribution of the new derived variable for the order rate.
Although 50 percent of customers have an order rate under .05 orders per month, there are some outliers with
rates over one order per month. Using the lasso tool, it is possible to select just these records. The 70 customers
with order rates of more than one per month have a median tenure of one month, meaning that more than half
these customers were in their first month of tenure when the data was extracted. The mean tenure, on the other hand, is
3.6 months, so there must be a few customers who have been buying at this rate for a long time.

There is one customer with tenure 95 months (nearly eight years), and 150 orders. This customer is in ZIP code
05759, which happens to be North Clarendon, VT. A little Web research reveals that although the company
headquarters is in Manchester, VT, the call center and mail order department is in nearby North Clarendon. Given
that, it seems likely that the customer with daily orders is not a real customer, but some sort of test of the system.
Without exploration, we would not know to exclude this “customer” from the analysis.

Calculated Order Rate During the 23-Quarter Observation Period


The reported frequency of purchase is presumably a lifetime-to-date figure. For a period of 23 quarters from Q1 of
1993 through Q3 of 1998 we have details of how many orders were placed each quarter. How well do these agree
with the order rate calculated above? And how well does either correlate with the reported proportion of quarters
with a buy?

Calculating the order rate during the observation period is a bit tricky because not all customers have been on the
customer list for the entire period. Ideally, we should start the calculation for customers when they received their
first catalog, but the documentation that accompanied the data suggests that the tenure variable records months
since first order rather than months since first offer. Rather than trust the documentation, we compare the tenure
variable with the calculated months since first order. Calculating the number of months from the first purchase
to 1998-09-18, which is the maximum value for most recent purchase, yields the values recorded in the tenure
variable. Using this method, each customer’s tenure starts with a purchase rather than with a possibly extended

12
wait for first purchase, so the estimated purchase rate is higher than it should be. Since the effect is the same for
all customers, it is not a problem for modeling so long as we remember to use the same definition of tenure when
the model is used for scoring potential catalog recipients.

To complete the calculation, we need two derived variables: the number of purchases in the 23-month observation
period and the number of quarters (less than or equal to 23) included in the tenure.

As hoped, the order rate calculated from the 23-quarter observation period is closely correlated with the order rate
calculated from lifetime orders and tenure.

Summary of Fit
RSquare 0.902332
RSquare Adj 0.90233
Root Mean Square Error 0.02551
Mean of Response 0.069373

www.jmp.com/berry 13
One might expect the order rate to be closely correlated with one of the other variables from the original data
set – namely, the proportion of quarters with a purchase, but this expectation is not borne out. JMP provides many
ways of looking at these relationships, including a scatter plot matrix, a simple correlation matrix and a 3-D scatter
plot.

The scatter plots are colored by customer tenure with red indicating long tenure. Note that the reddest areas
correspond to low order rates. Many customers with long tenure have not actually made a purchase in years, so
their number of orders per quarter is low.

The correlation matrix is another way to look at these pairwise relationships. It confirms the very strong relationship
between the two methods of calculating the order rate.

14
Any way you look at it, there is little correlation between the orders per month and the proportion of quarters with
a buy, and the small correlation that does exist is negative. This is unexpected. Is says that people who buy more
often make their purchases in fewer quarters. This requires further investigation.

This scatter plot shows the relationship between the overall order rate, the order rate during the observation
period, and the proportion of quarters with a buy. The chart is colored by customer tenure, with red indicating long
tenure. Clearly, if customers’ orders were distributed uniformly across the year, more orders would lead to more
quarters with an order. That this relationship does not hold suggests that orders may tend to bunch together. That
would make sense if customers have a seasonal pattern, such as using the catalog to outfit the summer cottage
every spring or to buy Christmas presents every winter. To test this idea, we create new variables to record the
number and proportion of a customer’s orders in each quarter. The time series does not contain a full five years,
so to avoid favoring any quarter, we use the four years 1993-1997 to count orders by quarter.

www.jmp.com/berry 15
Q1 and Q4 have the most orders. Is the Q4 order proportion in 1993-1997 predictive of response in 1998? One
way to look at that is to compare the average Q4 proportion for responders and nonresponders.

The difference is dramatic, which suggests that the newly derived Q4 order proportion variable will be valuable for
modeling. The absolute number of previous Q4 orders is also predictive as can be seen by graphing the response
rate grouped by the number of Q4 orders.

16
Responders in a Tree
What other variables are predictive of response? The JMP recursive partitioning tool builds a tree with some
leaves rich in responders and some rich in nonresponders. To create each split, the algorithm tries every available
input and chooses the ones best able to increase the purity of the resulting children. The variables that end up
in the tree are good candidates for building nontree models as well. Before trying this, it is important to remove
from consideration variables that we know from our exploration will be problematic. For example, the ZIP code
and county might appear to be good splitters, but only because there are so many ZIP codes and counties with
a single customer who either did or did not respond. We must also remove variables whose values can only be
known after a response has already occurred: The order size variable is only nonzero for responders, so it does a
perfect job of splitting the data but can’t be used for modeling.

As noted earlier, a decision tree continues splitting until the leaves reach some level of purity. When the target
variable only takes on two values and one is much more common than the other, the algorithm may not create
many splits. More splits may mean more variables selected, so before building the tree, it is a good idea to
create a balanced sample with equal numbers of responders and nonresponders. JMP software’s table subset
commands make this very easy.

www.jmp.com/berry 17
As a variable selection tool, what matters is not the tree itself but the variables that contribute to it the most. JMP
software’s decision trees provide a measure of variable importance that can be sorted for easy discovery of the
top influencing variables.

Translated into English, the top 10 are:


1. Dollars spent in the last 24 months.
2. Number of purchases during the 23-month observation period.
3. Purchases in Department 25 (food).
4. Months since last purchase.
5. The order rate during the observation period.
6. The proportion of quarters with an order.
7. The historical response rate of the state.
8. The number of items previously purchased.
9. The percentage of the state’s population that is in the customer sample.
10. The number of orders in Department 3 (women’s underwear).

Of course, some of these are highly correlated. Dollars in the last 24 months is clearly related to number of
purchases in the last 23 months. These correlations don’t matter if the final model is to be a decision tree, but
if it is to be a regression model, more thought (and more exploration) is still needed. The importance of the food
category in predicting Christmas catalog response suggests another hypothesis: Are the food items sold by
Vermont Country Store often purchased as gifts? A glance at the website supports this hypothesis.

18
At first, it seems a bit difficult to test this hypothesis because although we know how many purchases each
customer has made in Department 25, and we know how many purchases a customer has made each quarter,
we do not have a mapping between the two. The solution is to compare customers who have only made
purchases in the fourth quarter with customers who have only made purchases in other quarters. The newly
derived variables recording the proportion of orders in each quarter make that easy. We first select rows where
Q4prop=1 and sum purchases by department. Then for comparison we repeat the operation selecting rows
where one of Q1prop, Q2prop or Q3prop is equal to one.

Beauty is the most popular department in all quarters, but in the fourth quarter food is a close second. The rest of
the year, it is a distant third.

Q4 Purchasers Q1-3 Purchasers

| Food

| Food

www.jmp.com/berry 19
What Influences Average Order Size?
A similar process reveals a slightly different set of variables that are important for modeling average order size.
After removing several variables that are close proxies for order size, the following five turned up in a regression
tree:

1. Dollars spent in the past 24 months.

2. Total number of purchases.

3. Number of purchases in Department 14 (Bedding).

4. Number of catalogs received.

5. Order rate (number of orders divided by months of tenure).

Interestingly, some of the variables that have high worth in the decision tree do not show strong correlation with
the target variable. In particular, frequency of purchase has near zero correlation with average order size. The
explanation is the interaction between frequency of purchase and spending in the last 24 months. When spending
is high, but frequency is low, the average order size must be high. In effect, the tree has discovered the definition
of average order size as dollars per purchase.

20
The importance of bedding suggests further investigation. The average order size for customers with at least one
purchase in the bedding department is $65.80. The average order size for customers with no bedding purchases
is $43.64. Clearly the difference is significant. Is bedding a particularly expensive category? Or do people who
buy bedding have a tendency to buy lots of other things as well? We can approach this by looking at the average
spending of customers with purchases in a single category. Here the JMP name selection in column feature
comes in handy. After constructing a complex row filter, you can remember the results in a column and use them
as a grouping variable for summarization. People who buy only bedding have an even higher average order size
than that of all bedding buyers.

Unfortunately, the data set does not contain prices of individual items, but the website confirms that bedding is
one of the more expensive categories.

www.jmp.com/berry 21
Using What We’ve Learned to Build Response Models
Data exploration is useful in its own right, but often the goal is to build a model that can be used to create scores.
In this case, the goal is to assign each customer a probability of response score.

The Baseline Model


Without the benefit of the foregoing exploration, we would build the baseline model using the three variables
beloved by all direct marketers: recency, frequency and monetary value, or RFM for short. The existing model is
called the champion. New candidate models are evaluated by comparing them with the champion. When a new
challenger model does better than the current champion, a new champion is crowned.

There are many metrics that could possibly be used to compare models. Here we use lift at the first decile. To
calculate lift, half of the data is used as a training set to fit the model; the other half is used to evaluate it. The
evaluation data is sorted in order of descending model score, and the percentage of responders in the first decile
is compared to the overall response rate. The baseline model achieves a lift of three, meaning that there are three
times as many responders in the top decile than would be expected in a 10 percent random sample.

Using the Variables Suggested by the Exploration


The first challenger model is (like the baseline model) a logistic regression model. Instead of the three RFM
variables, it uses the variables suggested by the preceding exploration:

• Dollars spent in the last 24 months.


• Months since last purchase.
• Tenure (months since first purchase).
• Number of purchases from the food department.
•  rder rate (number of orders during the observation period divided by number of months of tenure in the
O
observation period).
• The number of past purchases in Q4.
• The historical response rate of the state where the customer resides.
This new model does better than the former champion. It gets lift of 3.5 on the first decile.

22
Comparing the Regression Model with a Neural Network Using the Same Variables
Another very flexible model type available in JMP is the neural network. These are provided in several flavors. Here
we have used a multilayer perceptron with four units in a single hidden layer. Each of the units produces a different
S-shaped curve based on the model inputs, which are the same as were used for the logistic regression model.
By adding together several different S-shaped curves, a neural network of this type can approximate pretty much
any continuous function.

In this case, the neural network model gets lift similar to that of the logistic regression model. The JMP profiler
helps show how the models are different.

www.jmp.com/berry 23
Comparing Models Using the Profiler
Data exploration yields insights that are useful in their own right. Often, though, a near-term goal is building
models to get high-quality predictions for outcomes of interest. Often there is a tradeoff between model
complexity and the ability of a model to produce better predictions for the near-term event of interest – in this
case, who is likely to respond to the next mailing. The profiler also highlights differences in the way different kinds
of models capture relationships between inputs and outputs.

The top row of this profiler output refers to the model average of the best 10 stepwise logistic regression models.
As many readers are probably aware, a logistic regression model is actually based on a linear regression model of
a nonlinear target variable, the log odds. Odds are simply the ratio of the probability of response to the probability
of nonresponse. Whereas the probability of response ranges from zero to one, the odds of response range from
zero to infinity. Taking the log of the odds yields a value that ranges from negative infinity to positive infinity – as
does the result of a linear regression model. The topmost row of the profiler output shows the relationship
between the inputs and the log odds. Notice that all the traces are straight lines with differing slopes. As the
sliders are moved to different positions to denote different values of one input variable, the other traces may move
up and down, but they do not change shape. The second row shows the results of a bootstrap forest model
where some nonlinearity in relationships is now evident.

The bottom row of traces represents the neural network model using the same inputs. Neural networks do
not lend themselves to interpretation, but the profiler does call attention to the inputs (food and fourth quarter
orders) that have the most influence on the target variable under average conditions. Because of the complexity
of interactions within a neural network model, it is possible that the relative importance of different inputs varies
considerably depending on the initial values chosen for each input.

24
Conclusions
Exploring the Vermont Country Store data revealed the existence of outlier customers who participate in and
respond to every campaign. These are not likely to be real customers, so they were removed before modeling.
Further exploration revealed that there are geographical differences in responsiveness that could be incorporated
in the model. There were also seasonal differences in the relative popularity of departments, with food showing
increased popularity in the holiday season. The final model takes advantage of all of these findings.

These results are not particular to this data set. There are surprises lurking in even the most familiar data sets,
and finding them in advance leads to better models. The best data miners and modelers rely on intuition as well
as expertise. Visual exploration is the best way to develop intuition for what is going on in a data set. Although the
goal here was to build a better response model, a lot of what we learned along the way is useful in areas that go
beyond this specific model. The strong geographic variation in customer density provides food for thought. What
sort of campaigns might appeal to potential customers far from New England? Or should we perhaps capitalize
on regional associations by targeting snowbirds and retirees in the South and Southwest who may not miss
New England weather, but still feel an attachment to their roots? If, as it appears, there is a segment of shoppers
who only make gift purchases, should we include even more gift items in fourth quarter catalogs? Could we
save postage and increase yields by dropping such customers from mailings at other times of year? These new
questions are an important outcome of data exploration, and pursuing these new lines of research may ultimately
be more valuable than having a slightly improved champion model.

JMP Pro provides a highly interactive and visual environment that makes it easy to examine distributions, explore
interactions and spot anomalies before building models. It also includes an extensive collection of data mining
techniques, including linear and logistic regression, decision trees, neural networks and survival analysis. When
exploration and modeling is finished and it is time to present results, the visualizations offered in JMP allow very
effective communication.

www.jmp.com/berry 25
About the Author
Michael Berry is co-founder of Data Miners Inc., and serves as Business Intelligence Director at TripAdvisor for
Business. Together with his Data Miners colleague Gordon Linoff, Berry is author of some of the most widely read
and respected books on data mining, including Data Mining Techniques, Third Edition (2011). A data mining
educator as well as a consultant, Berry has taught marketing analytics in the MBA program at Boston College’s
Carroll School of Management. He is also in demand as a keynote speaker and seminar leader in the area of data
mining generally and the application of data mining to customer relationship management in particular.

JMP WORLD HEADQUARTERS SAS INSTITUTE INC. +1 919 677 8000 U.S. & CANADA SALES 800 727 0025 www. jmp.com
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. .
Other brand and product names are trademarks of their respective companies. Copyright © 2011 SAS Institute Inc. All rights reserved. 105333_S78896_0811

You might also like