Mastering Data Mining The Art and Science of Customer Relationship Management PDF
Mastering Data Mining The Art and Science of Customer Relationship Management PDF
Mastering Data Mining The Art and Science of Customer Relationship Management PDF
Part One of this book introduces data mining in the context of customer
relationship management (CRM) in companies with large numbers of
customers. This does not mean that data mining is not useful in other
fields! Data mining is used in the pharmaceuticals industry to aid in
drug design and to analyze data generated by high-throughput testing of
chemical precursors for new drugs. It is used in law enforcement to find
suspicious patterns of international funds transfer. It is used by both
public and private insurers to uncover potential cases of fraud. Even in
the marketing context, data mining is not restricted to
business-to-consumer relationships. On the World Wide Web, which offers
many opportunities for data mining, business-to-business e-commerce
transactions outweigh direct-to-consumer transactions, at least as
measured by dollar amount. Why then this focus on customer relationship
management? There are three different (though not uncorrelated) answers:
1. We have done most of our data mining work in the service of improved
marketing , sales, and customer support for large corporations with
millions of customers or prospects. We naturally find it easier to write
about data mining in the context of our own work, and by doing so we can
share the kinds of insights and lessons that are learned only through
direct experience. 2. Customer-relationship management and its servants,
database marketing, customization, and individualization, have been
responsible for the sudden surge of popular interest in what were once
obscure and academic data mining techniques. These are the applications
that have grabbed the attention of both the press and the investment
community. 3. These are the data mining applications that touch each and
every one of us directly. No matter what our profession, we are all
consumers. Every time we make a telephone call, use a credit card, click
through an ad, or show a supermarket loyalty card, we are providing
fodder to data miners. This personal connection with the data being
mined makes it easier to understand the goals of the data mining
exercises we present. It also helps make sense of the data
transformations that are required to get good results. For all these
reasons, most of the case studies in this book are somehow tied to
analyzing customer behavior. The first chapter, Data Mining in Context,
supplies a bit of historical background for the current explosion in
both data and data mining and examines the current data mining
environment from three viewpoints-business, technical, and societal. The
second chapter, Why Master the Art?, addresses the need for companies
that are serious about organizing their business around their customers
to
Page 3
:r-1 : Data Mining in Context To say that the past century saw rapid
change is both a cliché and an understatement . Although the rapid pace
of change was felt in nearly every area, it is hard to find examples of
anything, anywhere that has changed as fast as the quantity of stored
information. This information explosion has created new opportunities
and new headaches in every field, from manufacturing to medicine to
marketing. To appreciate just how fast the world's store of information
has grown in recent years, it is instructive to compare it against some
of the standard benchmarks of the twentieth century. In 1900, the world
population (another area where growth has frequently been described as
"explosive") was 1.6 billion. A hundred years later, the population
topped 6 billion. So, the population explosion caused the number of
people living on Earth to increase by a factor of 3.75 over the course
of the century. In 1906, the Stanley twins, Francis and Freelan,
established a world land speed record of 122 miles per hour with their
Stanley Steamer. The land speed record was, of course, the only one that
counted; 15 miles per hour was a pretty good speed for a ship and
airplanes had only been in the air for three years, so they were in no
position to challenge the speed record. On their journey to the moon 63
years later, the Apollo astronauts traveled at nearly 25,000 miles per
hour-223 times as fast. The trip to the moon provides another yardstick.
In 1900, the longest journey one could reasonably make was 25,000
miles-the distance required to circumnavigate the globe. The round trip
distance to the moon is 19 times as far.
Page 6
BkW. has misled many people into believing that data mining is a product
that can be bought rather than a discipline that must be mastered. In
this book, although we discuss algorithms and techniques. when
necessary, we put the focus back where it belongs: the data mining
process. But before we can discuss process, we need to establish a
common understanding of what data mining is and how it can be used. For
readers of our earlier book, this will be review. Can Data Mining Do?
The term data mining is often thrown around rather loosely. In this
book, we use the term for a specific set of activities, all of which
involve extracting meaningful new information from the data. The six
activities are: • Classification • Estimation Prediction • Affinity
grouping or association rules Clustering - Description and visualization
The first three tasks-classification, estimation, and prediction-are all
examples of directed data mining. In directed data mining, the goal is
to use the available data to build a model that describes one particular
variable of interest in terms of the rest of the available data. The
next three tasks are examples of undirected data mining. In undirected
data mining, no variable is singled out as the target; the goal is to
establish some relationship among all the variables. Classification
Classification consists of examining the features of a newly presented
object and assigning to it a predefined class. For our purposes, the
objects to be classified are generally represented by records in a
database. The act of classification consists of updating each record by
filling in a field with a class code. The classification task is
characterized by a well-defined definition of the classes, and a
training set consisting of preclassified examples. The task is to build
a model that can be applied to unclassified data in order to classify
it. Examples of classification tasks include:
Page 9
F15 ° The field that has come to be called data mining has grown from
several antecedents. On the academic side are machine learning and
statistics. Machine learning has contributed important algorithms for
recognizing patterns in data. Machine-learning researchers are on the
bleeding edge, conjuring ideas about how to make computers think.
Statistics is another important area that provides background for data
mining. Statisticians offer mathematical rigor; not only do they
understand the algorithms, they understand the best practices in
modeling and experimental design. The final thread is decision support.
Over the past few decades, people have been gathering data into
databases to make better informed decisions. Data mining is a natural
extension of this effort. Data Mining and Machine Learning The machine
learning people come from the computer science and artificial
intelligence worlds. They have focused their efforts on getting
computers to display intelligence. In particular, the machine learning
community is interested in writing computer programs that are capable of
learning by example. The first kind of learning manifests itself by a
newfound ability to perform some task such as balancing a broom handle
or recognizing written characters. In other cases the new learning is
expressed as rules that have been induced from the examples. Neural
networks have proved to be very successful at the first kind of learning
and decision trees have proved to be very successful at the second. The
term data mining, in its present, nonpejorative sense, was first used by
people who took the methods of the machine learning and began to apply
them to fields outside of computer science and artificial intelligence
(AI)- fields such as industrial process control and direct marketing.
This search for practical applications was probably encouraged by the
collapse of funding for artificial intelligence research in the early
1980s when the over-ambitious claims from the 1960s and 1970s (machine
translation, natural language recognition ) failed to materialize. The
choice of the term data mining for the new, business-oriented
applications of Al research shows how little overlap there was between
this group and the statisticians, actuaries, and economists who had long
been doing predictive modeling. For the latter group, the term "data
mining" meant searching for data to support a particular point of view
rather than letting the facts speak for themselves. The data miners were
smart people getting good results, but they were not mathematicians.
Data Mining and Statistics Statistics has been another important thread
that has supported data mining. For centuries, people have used
statistical techniques to understand the natural world. These have
included predictive algorithms (which statisticians call
Page 16
three dimensions). Each dimension can, and usually does, have many
levels of aggregation. In fact, a single dimension may have multiple
hierarchies. So the geography dimension might be arranged into stores,
cities, states, and countries , or into stores, sales regions, and
countries, or both. Or the time dimensions may be organized into days.
weeks, and years, or days, months, quarters, and years. An added
complication is that the dimensions are not static. For example, many
companies redraw their sales regions every time a new VP of sales comes
along, if not more frequently. The structure chosen for a
multidimensional database limits the kind of analysis we can perform.
Often, a dimension of particular interest to us, the customer dimension,
is entirely absent! When a multidimensional database is stored in a
relational database system, the arrangement of the tables-one central
fact table with many dimensional tables surrounding it-is called a star
schema, or dimensional model. Such an arrangement is especially
appropriate for a database that serves the interests of a particular
department such as finance or marketing since it is easier to get
agreement as to what are the central facts to be tracked and what
dimensions would be useful for tracking them within a single department
than across a larger enterprise. These specialized decision support
databases are often called data marts. Sometimes a data warehouse is
made up of a collection of data marts. In this approach to data
warehouse design, dimensional models are the basis for a style of
incremental design of an enterprise data warehouse that is inherently
distributed. The main challenge for the data warehouse teams building
these distributed data marts is to establish what are called conformed
dimensions and conformed facts so that the separate data marts will work
together. For more on this topic, see Ralph Kimball, et al.'s book, The
Data Warehouse Life- cycle Toolkit (Wiley, 1998). We discuss OLAP and
multi-dimensional databases in more detail in Chapter 6. Decision
Support Fusion To the business user, the distinctions between online
analytic processing, data mining, and data visualization seem pointless.
To make better decisions, management needs answers to all kinds of
questions. Today, some of those questions ("How have widget sales
changed quarter over quarter by sales rep and widget type?") are
answered using OLAP. Other questions ("How are sales varying by
geography and widget type?") are best understood through visualization
-in this case, perhaps a map with different countries shown in relief
with height representing total sales and color representing product mix.
Another family of questions ("Which customers should receive the 96-page
holiday catalog, and which should receive the 120-page catalog?") are
best addressed through data mining. The VP of Marketing wants answers of
all three types of question, and probably does not understand why the
answers have to come from three different
Page 19
& ~ tl a 2 Why Master the Art? n its infancy, any technology requires a
great deal of specialized knowledge and experience on the part of its
users. The history of photography provides Jan apt metaphor for that of
data mining, which is still in its infancy. This chapter develops this
metaphor and uses it to describe four methods of putting data mining to
work for your company. The pinhole camera or camera obscura has been
known since at least the time of Leonardo da Vinci. For hundreds of
years the ephemeral images projected on the wall of a darkened room were
little more than curiosities that could be viewed only in real time. It
wasn't until the 1830s that Louis Daguerre developed a method of
capturing these fleeting images in a lasting form. The first
photographers could not simply take pictures; they had to be chemists as
well. The creation of a single daguerreotype required hours of
work-first a copper plate was exposed to iodine to form a thin layer of
light- sensitive silver iodide. The exposed plate was then held over hot
mercury, whose highly toxic fumes combined with the silver nitrite to
produce a positive image directly on the plate. The plate was then
rinsed in salt water to fix the image. The resulting pictures were so
lifelike and so remarkably detailed that many contemporary observers
believed that the new process would replace painting and other
traditional art forms. In fact, the new process was described as a
"self-operating process of fine art."' 1. See A History of Photography
by Robert Leggat (www.kbnet.co.uk/rleggat/photo/). íeM
Page 22
The daguerreotype was a huge commercial and artistic success, but like
the current crop of data mining tools, it hardly deserved to be called
self-operating. The technique could be mastered only by skilled
professionals or truly dedicated amateurs. The next 50 years saw a
series of important improvements to photographic technology including
negatives that allowed multiple prints to be made from the same exposure
and more reactive emulsions that brought exposure times down from
minutes to seconds. But it took George Eastman's introduction of the
Kodak camera in 1888 to put photography into the hands of casual users.
The Kodak camera was simple and portable. When its flexible film was
exposed, the user sent the entire camera off for processing at the Kodak
factory where it would be reloaded and returned along with the developed
negatives and prints. The advertising slogan for the Kodak camera
anticipates the message from many of today's data mining software
companies: "You press the button, we do the rest." By the twentieth
century, anyone wishing to make a record of a fishing expedition or
important family gathering could point an inexpensive, mass-produced
camera at the subject, focus the lens, cock the shutter, and take a
snapshot. After exposing a roll of film, the amateur would send it out
for developing and printing . Only at this point would his or her
mistakes become apparent. The photographer might forget to advance the
film and end up with double exposures or misjudge the light and get a
picture that was under- or over-exposed. The shutter speed might be too
slow and blur the image of a fidgety child. The aperture might be too
wide leading to insufficient depth of field. All those ruined snapshots
put evolutionary pressure on the camera. Today, cameras have become so
smart that we do not have to think very much to take decent pictures.
The auto-focus mechanism picks some element in the scene and adjusts the
lens so that it is in perfect focus. A microprocessor determines a
suitable combination of shutter speed and aperture after automatically
sensing the light conditions and the speed of the film. Once the picture
has been snapped, the film is automatically advanced to the next frame.
Or perhaps, there is no film at all and the image is stored directly to
disk in JPEG format. In short, much of the expertise formerly required
of the photographer has now been embedded in the camera. There is no
denying that all this progress has made it much harder for amateur
photographers to completely ruin their vacation snapshots, but what
relation do these albums and shoe boxes full of 4-by-6 glossies bear to
the powerful images displayed in the photography galleries of an art
museum? Clearly, there is a trade-off between quality and automation.
This is no less true of data mining than it is of photography. Of
course, you do not have to become a full-time professional data miner to
extract useful infor-
Page 23
mation from data. Most people are content to rely on the photographic
expertise embodied in their cameras for day-to-day situations, only
calling in an expert for special occasions. That's fine so long as we
recognize that learning to change the battery in a point-and-click
camera is not the same as mastering the art of photography! Mastering
the art of photography means understanding composition, becoming
familiar with the tricks of light and shadow, understanding the uses of
different filters, films, and papers, and learning the countless other
elements of the craft that allow a photographer to tell stories with a
single picture that could never be captured in the proverbial thousand
words. Mastering data mining means learning how to get data to tell a
true and useful story. There are now many tools to assist in the process
and even some that claim to automate it. But, as in photography, study,
practice, and apprenticeship will be rewarded with better results. Four
Approaches to Data Mining There are essentially four ways to bring data
mining expertise to bear on a company's business problems and
opportunities: 1. By purchasing scores (see later) from outside vendors
that are related to your business problem; analogous to using an
automatic Polaroid camera. 2. By purchasing software that embodies data
mining expertise directed towards a particular application such as
credit approval, fraud detection, or churn prevention; analogous to
purchasing a fully automated camera. 3. By hiring outside consultants to
perform predictive modeling for you for special projects; analogous to
hiring a wedding photographer. 4. By mastering data mining skills within
your own organization; analogous to building your own darkroom and
becoming a skilled photographer yourself. There are plusses and minuses
to each approach and, depending on the circumstances , we may recommend
any of the four, or some combination, to our clients. The clear bias of
this book, however, is toward those who want to at least consider option
four as a long-term goal. The balance of this chapter examines each of
the four approaches and explains why companies that want to put customer
relationship management at the center of their business strategy should
make data mining one of their firm's core competencies.
Page 24
Neural Net Models for Credit Card Fraud Detection HNC, a company based
in San Diego, California, has made a very successful business around
embedded neural network models for predicting fraud in the credit and
debit card industry. HNCs neural network models evaluate a large
percent- age of all credit card transactions before the transaction is
allowed to complete. I HNC's models are now also deployed in the
wireless industry to spot fraudulent calls and in the insurance industry
to spot fraudulent claims. The company has also moved beyond fraud
detection with vertical applications for predicting profitability ,
attrition, and bankruptcy. HNCs flagship product, Falcon, addresses
credit card fraud. Since large issuers may lose as much as $10 million
per year to fraud, there is strong demand for the product. According to
the company, it is used by 16 of the top card issuing banks in the
world. Locating the few transactions among thousands that represent
potential fraud losses is a difficult problem. Part of the difficulty is
that false positives-transactions that are flagged as potentially
fraudulent but turn out to be innocent-can be nearly as costly as false
negatives because cardholders tend to get very annoyed when their
legitimate transactions do not go through. Falcon works by modeling each
cardholder 's behavior and then comparing the current transaction with
that customer's historical spending pattern. If the current transaction
falls too far outside the customer 's historical patterns, the software
will flag it before it is allowed to complete. In 1998, Falcon monitored
over 250 million payment card accounts worldwide. Will a Vertical
Application Work for You? When evaluating a vertical application with
embedded data mining models, try to find out as much as possible about
the assumptions under which the application was built and compare those
assumptions to your own situation. Some questions you might ask include
the following: • Was the application developed under similar assumptions
about competition ? If not, response rates and retention rates are
likely to follow very different patterns. For example, in a recently
deregulated market for airline service, cable television, or telephone
service, models that work for the incumbent former monopoly will not be
germane to the new upstart providers and vice versa. • Was the
application developed with similar assumptions about pricing practices
to those that apply in your market? For example, in the United States,
cellular subscribers pay for both incoming and outgoing calls. This
leads to very different calling patterns than one would expect to see in
a market where only the outgoing calls are charged.
Page 27
interface layer over the same core routines along with a preconfigured
approach to the particular problem. So, for example, the response
modeler could actually be used for any binary outcome classification
task for which there are training examples as long as the user is happy
to refer to one class as "responders" and the other as "nonresponders."
One of the data mining tasks that Model One attempts to automate is
experimenting with different data mining algorithms to see which one
will get the best results for a particular data set. The tool generates
a large number of models and displays the cumulative response charts for
each on a single graph so that it is easy to pick out the best one. The
graphs are updated dynamically as new models are built and tested with
the current best model highlighted. This gives the user a pleasing sense
that progress is being made. A similar automated search for models is
provided by the Model Seeker wizard that is part of the Darwin product
from Oracle (see Figure 2.2). Print Model Results j Tree Models i
Display Net Models W Display Match Models Display Grid .3 Cumulated
Targets vs. Percent of Population Tree Net + Match 100% p V00/0 F-- 60%
~ M 40°!0 :3 20% U 0% 0% 20% 40% 60% ..ERR, COD:1..
Page 31
This affects the decision to use flash. The camera is programmed with
some set of assumptions about the user's intentions. When all goes well,
these assumptions are close enough to what the photographer had in mind
so that the resulting pictures are satisfactory. When the assumptions
are violated, the result is disappointment. A data mining tool or
vertical application with embedded expertise has the same potential to
please or disappoint depending on how well its assumptions match the
actual requirements of the business. And, like the automatic camera, it
can automate only the small portion of the data mining task that takes
place between the time when the shot is set up-a modeling data set is
assembled and ready to go-and the time when the model itself is built.
It can do little about what comes before and after. As we will see in
the next chapter, the actual building of models is only one stage in a
continuous cycle of activities that make up the data mining process.
Page 32
The automated tool does not address the activities that are analogous
to setting up the shot, nor does it address the activities that are
analogous to the work that goes on in the darkroom to enlarge, enhance,
and reproduce the results. Among the activities not addressed by an
automated tool are: • Choosing a suitable business problem to be
addressed by data mining. • Identifying and collecting data that is
likely to contain the information needed to answer the business
question. • Massaging and enhancing the data so that the information
content is close to the surface where the data mining tool can make use
of it. • Transforming the database to be scored by the model so that the
input variables needed by the model are available. • Designing a plan of
action based on the model and implementing it in the marketplace. •
Measuring the results of the actions and feeding them back into the
database where they will be available for future mining. The first three
items are things that need to happen before any automated modeling can
take place. The last three items are things that need to happen if the
automatically created models are to be of lasting value. This does not
mean that modeling wizards and the like are useless. Indeed, they can be
very helpful. But, it is important to understand what they can and
cannot do. The slogan "you press the button, we do the rest" is no more
accurate for today's data mining tools than it was back when George
Eastman used it to sell his cameras. Hiring Outside Experts If you are
in the early stages of integrating data mining into your business
processes, the judicious application of outside resources can be very
valuable. Indeed, we have seen many cases where outside consulting has
led to successful conclusions to projects that would almost certainly
have failed without it. On the other hand, we have seen many
organizations that are overly dependent on outside resources and are
failing to get the full benefit of data mining because the data, the
models, and even the insights generated by applying the latter to the
former are in the hands of outsiders. The real question is not whether
to use outside expertise, but how. The answer will depend on many
factors, including the answers to these questions: 1. Is the data mining
activity meant to address a one-time problem or an ongoing process?
Page 33
Learning and Loyalty We have said that data mining allows a company to
do a better job of learning about its customers, but why is that so
essential? Web-based travel services provide a good example. You can
browse to any number of Web sites that all offer the same
services-airline reservations, hotel bookings, and car rentals.
Initially, there is little reason to choose one over another. At least
for frequent travelers, however, just typing in seven or eight frequent
flyer numbers and a couple of credit card numbers with their expiration
dates and billing addresses into a profile is painful enough that we
might hesitate before using a competing travel service the next time.
The second service could offer the same flights, hotel rooms, and cars,
but only after asking a bunch of questions that we have already answered
for the first one. Of course, in this example, the travel site has not
done any active learning; all it has to do to be more attractive than a
competitor is to recognize us when we return to the site (through a user
name and password , a stored cookie, or both) and then remember the
profile entered during the earlier visit. Sounds simple, but it is easy
to find examples of customer interactions that don't do even that: •
cash machines that ask what language you'd like to use even though
you've used the same card and answered the same question the same way a
thousand times before • catalog companies that still ask for your
shipping address the twentieth time you place an order - credit card
companies that haven't figured out that you are never going to order the
credit insurance they offer you in every bill • long distance companies
who haven't learned which prospects do not want frequent flyer miles for
their telephone calls. If the travel site really wants to hook us, it
will learn by watching. Each time we return to the site, a few more
blanks will be filled in with appropriate default values-the usual
departure airport, the aisle seat, the nonsmoking hotel room, the
four-door compact car, the preferred airline, the first-choice hotel
chain, the nonrefundable ticket. A really clever system (more clever
than any we have yet encountered) would even change some defaults
dynamically as we filled in the s Learned Although data mining tools
have improved greatly over the last few years, it is still not possible
to get good results from data mining without considerable
Page 37
^/-is Data Mining Methodology: The Virtuous Cycle Revisited The great
American photographer Ansel Adams did not just step outside and "snap"
pictures of sunsets in the West. Even though the images were captured by
just pressing a button, he had to plan the photos and wait until just
the right moment. One great irony of photography is that the most
natural and unposed shots are often the most work, requiring a great
deal of planning and preparation . The same is true in data mining.
Being successful in data mining requires planning and understanding the
business problem. To continue the analogy, photography has a whole range
of technical options for developing prints: one-hour photo labs, amateur
darkrooms, professional darkrooms, and digital photography. Mastering
photography requires understanding the development process as well as
composition. The ultimate result is a combination of the aesthetic and
the technical. Similarly, mastering data mining requires combining the
business and the technical. Data mining connects business needs to data;
it is about understanding customers and prospects, understanding
products and markets, understanding suppliers and partners,
understanding processes-all by leveraging the data collected about them.
A basic understanding of the technical side of data mining is critical
for success, especially the process of transforming data into
information. Failure to follow the rules of photography can lead to
fuzzy, poorly exposed pictures-even when using the best equipment. The
same is true of data
Page 40
around the virtuous cycle of data mining, which highlights the business
aspects of data mining while recognizing the interplay between the
business and technical. The discussion on the methodology begins,
appropriately, with the two different styles of data mining. There is a
fear that as computers become more and more powerful, they will
eventually replace people in many different fields. With respect to data
mining, the day is very far off! At the technical level, data mining is
a set of tools and techniques that make people more productive.
Automated algorithms can spot patterns. People will always be needed to
know when the patterns are relevant, what problems need to be addressed,
when the results are meaningful, and so on. Two Styles of Data Mining
There are two styles of data mining. Directed data mining is a top-down
approach, used when we know what we are looking for. This often takes
the form of predictive modeling, where we know exactly what we want to
predict. Undirected data mining is a bottom-up approach that lets the
data speak for itself. Undirected data mining finds patterns in the
..ERR, COD:3..
Page 41
work is quite important, they have been relatively far removed from
daily business concerns; they provide little assistance on the front
line, for the typical brand manager, account manager, or product
manager. These specialists speak their own language. Consider an example
from the risk management group in one credit card company. They have a
label for customers who earn high incomes, are unlikely to go bankrupt,
and use the card infrequently. For in more detail throughout this
chapter. And, the most important insight in this chapter is this: The
best model is not the one with the highest lift when it is being built.
It is the model that performs the best on unseen, future data. Although
there is no 1-2-3 recipe for building perfect models, this chapter
covers the fundamentals needed to build effective models, and we will
see these lessons applied in the case studies in later chapters. ng Good
Predictive Models Fortunately, the basic process for building predictive
models is the same, regardless of the data mining technique being used.
Success depends more on the process than on the technique. And this
process depends critically on the data being used to generate the model.
Garbage-in, garbage-out is an adage that applies especially well to
predictive modeling. Chapter 3 introduced the basic methodology for data
mining, as well as lift charts to measure performance. This section will
go more in depth into this area, illustrating some common problems that
can be diagnosed by examining lift charts. It will also discuss why
producing effective models is challenging, regardless of the techniques
being used. A Process for Building Predictive Models The first challenge
in building predictive models is gathering together enough preclassified
data. Since the next chapter covers this area, we can skip the details
here. In preclassified data, the outcomes are already known. And,
because these known examples will be used to teach the model about the
data, this set is called the model set. Figure 7.1 shows the basic steps
in building and applying a predictive model: 1. The model is trained
using preclassified data in a subset of the model set called the
training set. In this step, the data mining algorithms find patterns of
predictive value. 2. The model is refined, using another subset called
the test set. Why does the model need to be refined? In order to prevent
the model from memorizing the training set, thereby ensuring that the
model is more general and will work better on unseen data. 3. We can
estimate the performance of the model, or compare the performance of
several models, by using a third set, entirely distinct from the first
two. This holdout set is called the evaluation set.
Page 43
Clearly, data mining has many applications. And, although they may have
much in common, every application has its own unique characteristics.
Industries differ from each other; within a single industry, different
companies have different strategic plans and different approaches. All
of this affects the approach to data mining. The previous chapter
outlined three ways that companies can incorporate data mining solutions
into their business. Here, we focus primarily on the companies that want
to build core competencies in data mining, because data analysis
supports critical business processes. The virtuous cycle is a high-level
process, consisting of four major business processes: • Identifying the
business problem • Transforming data into actionable results • Acting on
the results • Measuring the results There are no shortcuts-success in
data mining requires all four processes. Results have to be communicated
and, over time, we hope that expertise in data mining will grow.
Expertise grows as organizations focus on the right business problems,
learn about data and modeling techniques, and improve data mining
processes based on the results of previous efforts. In short, successful
data mining is an example of organizational learning. ing the Right
Business Problem Defining the business problem is the trickiest part of
successful data mining because it is exclusively a communication
problem. The technical people analyzing data need to understand what the
business really needs. Even the most advanced algorithms cannot figure
out what is most important. A necessary part of every data mining
project is talking to the people who understand the business. These
people are often referred to as domain experts. Sometimes there is a
tendency to want to treat data analysis as a strictly technical exercise
. Resist this tendency! Only domain experts fully understand what really
needs to be done; and ultimately, they are likely to receive the credit
or blame for bottom-line results. While taking into account what the
domain experts have to say, it is also important not to be constrained
by their expertise. Important results often
Page 45
Identify and Obtain Data The first step in the modeling process is
identifying and obtaining the right data. Often, the right data is
simply whatever data is available, reasonably clean, and accessible. In
general, more data is better. It is important to verify that the data
meets the requirements for solving the business problem. For instance,
if the business problem is to identify particular customers, then the
data must contain information about each individual customer. There may
be additional detail data-such as transaction-level information-but it
must also be possible to tie this data back to individual customers. In
addition, we want the data to be as complete as possible when modeling.
This can make it impractical to use survey data or other data available
only for survey respondents. If the data from the survey proves
valuable, then how will nonrespondents be scored? (In other situations,
using survey data for data mining may prove very fruitful; just don't
expect to apply the resulting model to non-respondents without some
extra work.) The purpose of the data mining effort may be to identify
customer segments, perhaps for the purpose of directing advertising or
purchasing lists of prospective customers. In this case, the data needs
to contain fields that are appropriate for purchasing advertising space
and lists. This often includes fields supplied by outside list
providers, location information, demographics, and so on. When doing
predictive modeling the data also needs to contain the desired outcome.
One brick-and-mortar retailer was trying to set up a catalog for their
identified customers (members). They were building a response model,
based on three earlier catalog mailings; this model would determine who
was likely to respond to the catalog based on previous responses to test
catalogs in the past. They had the following data, as shown in Figure
3.4: • Marketing data about all members • Responses data to previous
catalogs • Tons of transaction-level detail about what members had
purchased The one thing they were missing was who had been sent the
earlier catalogs! Knowing who has responded without knowing who had been
contacted is almost useless. Without this information, the attempts at
predicting response were doomed.
Page 50
000 000 0003 1999 CAR01 000 000 0004 1999 CAR01 Data mining algorithms
typically require the data in the format of one policy per row when we
want to make predictions about policies. All rows need to have the same
number of columns. This requires transposing the data and calculating
new values for the columns. This is an example of car insurance data.
Every year the insurance company keeps track of claims by policy number,
year, and car on the policy. Number Number Policy Number of Car- of
Value of ..ERR, COD:1..
Page 53
These are all examples of derived variables coming from the data within
a single row. Another type of derived variable gives information about
that row relative to all the others. For instance: • The revenue decile
for each customer. This is determined by taking the total amount spent
in a given period for all customers and assigning 1 for the customers in
the top decile, 2 for the customers in the second decile, and so on. •
The chum rate by type of wireless phone. This is determined by taking
the most recently available churn information and determining the rate
for each type of handset. • Profitability by demographics. This is
determined by taking historical profitability information by age and
gender. • Fraud by amount of transaction. This is determined by
determining the amount of fraud that has historically been identified
for transactions of different sizes. These variables are powerful
because past behavior is often a strong predictor of future behavior. In
the wireless industry, for instance, handset churn rates are an
important part of every churn model. However, these historical variables
are not enough. Obtaining better predictive results requires combining
the historical information with other types of data. This type of
derived variable is often available through an OLAP system. In fact,
these types of variables show that there are many synergies between OLAP
and data mining. Sometimes you can get the data from an OLAP system ;
sometimes you want to calculate these historical variables directly from
the data used for data mining. Prepare the Model Set The model set is
the data that is used to actually build the data mining models. Once the
data has been cleaned, transposed, and derived variables added, what
more needs to be done? There are a few things that we still have to take
into account. If we are building a predictive model from historical
data, then what is the frequency of the rarer outcomes in the model set?
A good rule of thumb is that we want between 15 and 30 percent density
of the rarer outcomes. Consider fraud. The data may contain fewer than 1
percent cases of known fraud. Almost any model that we build on such a
model set will be 99 percent accurate-by simply predicting no fraud.
Very accurate and entirely useless.
Page 54
There are several ways of handling such rare data. The most common is
over- sampling, which we discuss in Chapter 7. At this point, we also
need to divide the model set into the training, test, and evaluation
sets. Only some of the data is used to create the model initially; other
data is held back to refine the model and to predict how well it works.
We may also decide to build different models on different segments of
data. For instance, when building a cross-sell model, we may start by
building a model for the propensity to buy each different product. In
this case, we might create a separate model set for each product as a
prelude to making a prediction about that product. Choose on the right
business problems, learn about data and modeling techniques, and improve
data mining processes based on the results of previous efforts. In
short, successful data mining is an example of organizational learning.
ing the Right Business Problem Defining the business problem is the
trickiest part of successful data mining because it is exclusively a
communication problem. The technical people analyzing data need to
understand what the business really needs. Even the most advanced
algorithms cannot figure out what is most important. A necessary part of
every data mining project is talking to the people who understand the
business. These people are often referred to as domain experts.
Sometimes there is a tendency to want to treat data analysis as a
strictly technical exercise . Resist this tendency! Only domain experts
fully understand what really needs to be done; and ultimately, they are
likely to receive the credit or blame for bottom-line results. While
taking into account what the domain experts have to say, it is also
important not to be constrained by their expertise. Important results
often
Page 55
Matrix? The name confusion matrix puts off a lot of people. in fact, on
hearing the word confusion, the concept suddenly becomes part of the
model set). Figure 3.6 shows a confusion matrix, both graphically and as
a table. This tells us how many predictions made by a predictive model
are correct and how many are incorrect. Which is the best model depends
on the business problem.
Page 56
This model time chart shows that six months of historical data is being
used to predict one month into the future. The "P" represents the month
being predicted. In the model set, these are already known, because we
are using preclassified data. Jan Feb Mar Apr May Jun Jul Aug Se Oct Nov
P 6 I 5 I 4 I 3j2 I 1 P 5 1 4 1 3 1 2 1 1 Model Set AUG Model Set SEP
Score Set Model performance usually degrades over time. We expect the
model's predictions to be a bit less accurate on the score set than on
the model set. Figure 3.8 A model time chart shows that the score set is
usually more recent than the model set. What Makes Predictive Modeling
Successful? "Consistency is the hobgoblin of little minds," is a
frequent misquote of Ralph Waldo Emerson (he actually wrote "A foolish
consistency . . ."). Quite the opposite is true for predictive modeling.
Predictions are only useful because they are consistent-and especially,
because they are consistent over time. Otherwise, they would have no
predictive value. With hard work and a bit of luck, our predictive
models will not produce foolish consistencies. Time Frames of Predictive
Modeling Although the inner workings of the data mining techniques are
interesting, it is possible to approach predictive models without
considering the details of the techniques. Models simply transform
inputs into predictions, whether using statistical regressions, neural
networks, decision trees, nearest neighbor approaches, or even some
technique waiting to be invented in the future. There are really two
things to do with predictive models, as shown in Figure 3.9:
Page 60
The models are created using data from the past to make predictions.
This process is called training the model. The models are then run on
another set of data to assign outcomes. This process is known as
scoring, and often predicts future outcomes based on the most recent
data. There are two time frames associated with developing models. The
first is when they are being trained. At this point in time, the data is
historical and the outcomes are known. These are the records used for
training. The second time frame is when the models are being scored. At
this point in time, the input data is available, but the outcomes are
not known. The role of modeling is to assign probable outcomes. When
predictive models are being created, their performance can be measured
only on the past data-because that is the only data that is available.
Often, it is possible to achieve good results in the past that do not
generalize well in the future, resulting in a model that looks very good
on the model set and fails miserably when applied in practice. The
methods given in this chapter help to reduce the likelihood of
developing poor models. Fortunately, a good predictor of the future is
the past, so predictive modeling has proven to be a good approach for
many types of problems. However, the past is by no means perfect. To
make effective use of predictive models, we need to understand not only
how to build them but also when they work well and when they don't.
Modeling Shelf-Life Looking at time frames also brings up two critical
questions about models and their predictions: Training a Predictive
Model is the Already Predictive process of creating a model using ED
orical Known Model historical data and already known ata I nstances
instances of what you are trying to predict. Scoring a Predictive Model
is the More Predictive ~ Prediction, process of applying the model to
Recent Model ~~/ Confidence unseen data to make new Data predictions.
Figure 3.9 Predictive models must be trained (created) before they are
used (scored).
Page 61
What is the shelf-life of a model? The things being modeled change over
time, such as the business environment, technology, and customer base
changes. This means that a model created five years ago, or last year,
or last month, may no longer be valid. When this happens, you need to
train a new model on more recent data. What is the shelf-life of a
prediction? Predictions also have a shelf-life. They are valid during a
particular time frame. The classic example is predicting what will
happen during a particular month (such as churn or making a purchase).
And then using the prediction during a different month. The whole
process of predictive modeling is based on some key assumptions . These
assumptions shed light on the process of building models. Assumption 1:
The Past Is a Good Predictor of the Future Using predictive models
assumes that the past is a good predictor of the future. If we know how
patients reacted to a drug in the past, we can be confident that similar
patients will have roughly similar reactions in the future. Or if
certain customers who are going bankrupt have behaved in a certain way,
then similar customers will behave in similar ways in the future. Or,
customers who bought widgets last month are similar to customers who
will buy widgets next month. And so on. It is important to recognize
that this is an assumption about the problem being addressed and about
the business environment. It is usually a pretty reasonable assumption,
too. However, there are some cases where the past may not be a
particularly good most common application of data mining. Its success
rests on three assumptions. The first is that the past is a good
predictor of the future. The second is that the data is available. And
the third is that the data contains what we want to predict. Its success
also requires taking into account the time frame of the model, and
acting on the results before the model and predictions expire.
Page 62
.. ~.. . ~" . .... ., . `~. , v ..., . ~" ~. q,: A student has just
moved to another state and started college. She applies for a credit
card, where she and her family have always banked, and the bank turns
her down. From the bank's perspective, this is a reasonable decision,
since college students are often a high-risk group. Little did they know
that this would turn into a crisis at the highest levels. The executive
vice president of sales at a large regional bank in the United States
related this story. In this case, the student happened to be the
daughter of a large real estate developer. Her father's company was one
of the bank's largest and most profitable customers. He was also a close
friend of several members of the bank's Board of Directors. The unhappy
student told her father about the rejected credit application. The
father told the board, the board asked the bank president what was going
on. And the president demanded an explanation from the EVP. A simple
business practice turned into a crisis. Two things explained the credit
rejection. First, when the young woman lived at home, she was considered
part of her parent's household. However, when she left, she was no
longer part of that household. The bank no longer knew the most
important fact about her customer relationship. Second, the systems
supporting the consumer and business sides of the bank did not talk to
each other (and sometimes there are good reasons for keeping such
information in separate places). Who is the customer? In this case, the
rather standard rejection of a single credit card application threatened
a much more profitable relationship. Should the credit card group
automatically approve applications from family members of important
customers? Probably not. However, all the divisions of the bank should
know who the important customers are so they can make more informed
decisions. Companies, for instance, often rely on distribution networks
to sell their goods. This means that the makers of the product have
little control over actual purchases. And, conversely, the people who
sell the product don't care what customers buy-so long as they buy a
lot. So, when a customer buys soda at a supermarket, the supermarket
does not care if it is Pepsi or Coke. The supermarket wants to sell both
of them, and usually makes a comparable margin on either. However, the
manufacturers do care, and care a lot. Data mining can help
manufacturers learn about how their product is being used. In Chapter
13, we will see some good examples of this from the retail grocery
industry. It is useful to realize that just obtaining the data in this
case required cooperation among several companies. The grocery store
used another company to process point-of-sale transactions, and all this
data had to get back to the packaged goods manufacturer. Cooperation
among businesses
Page 69
• Regular, gold, and platinum credit card holders • Mass market versus
corporate customers Customers with 12 months or more of history and more
recent customers Segments are useful because they allow business users
to add domain-specific information into the data mining effort. The
business people expect platinum cardholders to behave differently from
gold cardholders, so they develop different marketing campaigns for each
group and also build separate models for them. This separation often
produces better models, because the data mining algorithms do not have
to rediscover what is already known. On the other hand, not all customer
segmentations yield fruitful results. Often, the belief in separate
segments has been driving the business in the past. However , there is
no data to support these differences. As with all assumptions made about
the data, it is important to validate assumptions made about customer
segmentation. Are the churn rates different for platinum cardholders
versus gold cardholders? Are average charges? And so on. Of course, data
mining algorithms, such as clustering, can help find segments. These
data mining segments are different from segments known to the business,
precisely because they are not driving the business. Or, they may be
driving the business, but management cannot take advantage of what it
does not know. Visualization can be a powerful tool for spotting and
understanding clusters. Figure 4.2 is a scatter graph that shows sales
of products at a small group of pet stores over a period of a few weeks
that uses an affinity card to track repeated visits by the same
customer. In these pet stores, many products are associated with
particular pet types, such as birds, cats, dogs, fish, and reptiles.
Each axis of the graph corresponds to a purchase of an item associated
with that pet. So the box at the intersection of "bird" and "reptile"
shows the number of customers who purchased a bird item and a reptile
item. The box at the intersection of "bird" and "bird" is the number
..ERR, COD:1..
Page 72
reason, many companies want to keep track of prospects the same way
that they keep track of their own customers. Some are investing in
prospect data warehouses, while others use an outside bureau to keep
track of their prospecting campaigns. Often, when someone "applies" for
a product or service, they leave a trail of information. This may be in
the form of credit applications, customer surveys, medical forms, and so
on. Capturing this information is valuable, although it is subject to
some caveats. When someone is signing up as a customer, they usually do
not want to be bothered with filling out unnecessary forms. If the form
is required, they may take short-cuts and not fill in fully accurate
information. Or, credit checks may not be run on every potential
customer. So the data gathered at this point is often incomplete and
inaccurate. It is the behavior data generated by established customers
that is the most valuable. This data contains a wealth of information
describing customers and their segments. Response data refers to data
about responses and nonresponses to campaigns. It is important to
remember who is exposed to every marketing campaign, as well as
remembering who responded. Customer's Lifecycle In addition to the
customer lifecycle, we must also consider the customer's life cycle.
That is, every person has life events that affect his or her value as a
customer. These life events may determine when someone becomes an
established customer, when they churn, and what products they need. Some
of these events are - Changing jobs • Having a first child • Marrying •
Divorcing • Retiring • Moving • Major illness These events often provide
opportunities for enhancing customer relationships. Unfortunately,
people do not widely broadcast this type of information-it is personal
and intimate.
Page 80
many customers are not exposed to any message at all. Solving this
requires turning our thinking inside out. The strategies so far have
consisted of optimizing a single campaign-this is a product-oriented
strategy. The solution is a customer- focused strategy. Instead of
maximizing the value of each campaign, maximize the value of each
customer. That is, for each customer or prospect, what is the best next
message? The best next message refers to any types of offer, including:
• Different types of promotional offers for new customers (free weekend
long distance or airlines miles, for instance)
Page 86
€ IIIJI!UUIP The Three P illars of Data ing TWO€ IIIJI!UUIP The Three P
illars of Data ing TWO
Page 91
92 The next four chapters focus on the more technical aspects of data
mining. Although the path is more technical, the goal remains the same:
using patterns in data for business success. In some ways, the technical
side is more challenging because it is so broad. It spans many
disciplines, including statistics, machine learning, databases, and
experimental design. Any of these could be the subject of an entire
book-and many are. A single technique in Chapter 5, neural networks, is
the subject of dozens of different books and hundreds of Ph.D. theses.
How can we hope to cover such a diverse field in just a few chapters?
The answer is quite simple. We are looking at the big picture, which is
how to extract business value from data, instead of trying to focus on
each and every detail. This is not the place to learn how to implement
neural networks, for instance, or the difference between training using
backpropagation versus conjugate-gradient. Nor is it a comprehensive
compendium of every type of decision tree. This is the place to learn
how to apply these techniques to relevant real world problems. Consider
driving a car. It is not necessary (fortunately) to be an expert on
internal combustion engines to be a good driver, or to be able to build
a muffler from scratch, or to explain the role of each additive in
gasoline. If a thorough understanding of all aspects of automobile
manufacturing were necessary, there would be far fewer drivers on the
road. Some understanding is useful-to handle an automobile on wet roads,
or to diagnose what is going wrong when the car does not start. However,
details are better left for the experts, and even the experts benefit
from the perspective of the driver. The situation is quite similar with
data mining. For a long time, people did need to know many of the arcane
details about data analysis to be successful. These people are
statisticians. Now, the volumes of data and the business needs have
grown enormously. There are not enough statisticians around to analyze
all the data. Fortunately, the tools have become easier to use. And the
methods (as discussed in the first part of the book) are better
understood. The next few chapters contain mini-case studies and
vignettes whose purpose is to illuminate the technical concepts. These
are taken from our personal experiences in working with our clients.
Their purpose is to act as streetlights, to help data miners stay on (or
close to) a well-trodden path of best practices. This introduction to
Part Two provides an overview of the technical material in the next four
chapters. The purpose is to lay out a map that will help to keep the
focus on the bigger business problems rather than on the details of
algorithms , data layouts, or experimental design.
Page 92
~ # • 93 Three Pillars The next three chapters discuss the three pillars
of data mining: data mining techniques, data, and the modeling of data.
The three pillars of data mining represent three core areas of
competency that are needed to be successful in data mining.
Practitioners need to have hands-on skills in these areas. On the other
hand, managers and business users should understand enough about them to
know when and where data mining can be effective. Data Mining Techniques
The first pillar consists of the data mining techniques themselves. To
some people, the data mining techniques seem a bit mysterious-are they
evidence that computers can think like people? The answer, for better or
worse, is that data mining techniques are no more mysterious than
anything else a computer does, such as storing files or creating a
spreadsheet (and both of these are admittedly a bit mysterious to us).
The techniques are general approaches to solving problems, and there are
usually many ways to approach the technique. Each of these ways is a
different algorithm. The algorithms are like recipes with step-by-step
instructions explaining what is happening. Without requiring a
background in mathematics or statistics, we try to go into enough detail
to demystify the techniques, to convey how they actually work. It is
important to have some understanding of their inner workings to know
when to apply them, how to interpret the results, and whether or not
they are working. In the chapter on techniques, our purpose is to
explain the techniques with enough detail so you can • Distinguish
between different techniques, knowing their advantages and disadvantages
• Follow the techniques as they are used in the case studies •
Understand which technique is most appropriate for a given business
problem • Become familiar with important variations The major techniques
discussed here are the ones that are found in most comprehensive data
mining tools: decision trees, neural networks, and clustering. These
techniques are also available on a wide range of computing platforms,
from individual desktops, to departmental servers, to the most powerful
parallel computers. However, as desktops are becoming more powerful, it
is often
Page 93
95 However, internal data is not the only source of data. Much data
comes from external sources, such as: • Demographics, psychographics,
and webgraphics-information about individuals and households that
bureaus glean from many different sources • Data shared within an
industry, such as credit reports, credit scores, and catalog
subscriptions • Summary data about geographic areas, store catchment
areas, and so on • Purchased external lists that meet some particular
criteria • Data shared from strategic business partners The chapter on
data covers how to work with data for data mining in the real world. Its
purpose is to cover the important issues that arise with data and to act
as a guide in planning and doing data mining. We do not intend for it to
teach the particulars of specific data sources, such as "how to create
dimensions on an OLAP cube" or "how to access data in SQL" or "how to do
it in JDBC." We assume that these technical skills are available, in
some form, to any company that wants to make data mining a core
competency. Modeling Skills The third pillar of data mining consists of
the set of modeling skills needed to build predictive models. The focus
is on predictive models-directed data mining-instead of undirected data
mining for two reasons. First, data mining is often about building
predictive models. These models find patterns in data from the past to
make predictions about unknown outcomes. Second, undirected data mining
requires noticing patterns. It is much less susceptible to a repeatable
methodology because it is about discovering new things; there is
necessarily a human element. The predictive modeling process leads to
many interesting insights, especially during the data exploration phase
or while analyzing how models are working. There is a methodology for
building effective predictive models. This methodology is based on
principles of experimental design, which is a way of saying that we need
to understand all the factors that affect the model. A model that
predicts churn in the wireless industry in the United States may not be
appropriate in other countries. Or a propensity-to-buy model developed
last year may not be appropriate this year, because the market has
changed. Or the marketing collateral may change from one campaign to the
next. The data miner needs to be aware of these factors to judge when
and whether predictive models will be effective.
Page 95
96 The models are built on preclassified data; that is, on data where
the desired outcome (or some proxy for it) is already known. The process
of building models involves holding back some of the data for validation
and test purposes. This process is encapsulated in the creation of the
model set. Another very important factor is time. Typically, models are
built using data from the past, since this is the data that is
available. However, we often want to make predictions about the future,
using current data. So between the time that the model is developed and
it is used, the time frame shifts. Predictive data mining injects a bit
of the scientific method into business processes. Of course, the purpose
of most data mining is to improve business decisions, such as choosing
the right customers for a particular marketing campaign. Data miners are
not scientists. We do not have to create repeatable processes, defending
every detail to our peers. On the other hand, we often do have to
explain what we did to colleagues and have confidence in the ultimate
results, so it is worthwhile paying attention to the details. ting It
All Together in a Data Mining Environment Pillars hold things up: It is
fair to ask the question, "What rests on the three pillars of data
mining?" Chapter 8 explores creating a data mining environment. Such an
environment needs to take the best practices from working with data
mining techniques, transforming data, and building models, and combine
them with the business needs to deliver effective results. As data
mining becomes an increasingly important part of business, managers want
to understand how best to take advantage of the new technology. Once,
they merely wanted to know what data mining was. Then, they wanted to
know if it was relevant to their business. Those days have past. Now,
they want to know how to implement it successfully inside their
organization (or whether they should outsource it). The tools have
matured, so success is more an organizational issue than a technical
one. We have seen many organizations with top-notch technology, where
lack of communication between different groups hinders the effectiveness
of their efforts. However, better groups learn from their experiences
and improve over time. American Express is a good example of a company
where modeling plays a strategic role in their business, and related
issues, such as customer privacy , are part of the core culture. As of
this writing, the primary group responsible for modeling has hundreds of
people and high corporate visibility. Few data mining groups will grow
to this size, but it does point to the potential strategic importance of
data mining.
Page 96
97 Tool selection is one of the first big decisions facing data mining
groups. Unfortunately, the process of tool selection often overshadows
the fact that data mining software is one of the least important
ingredients for effective customer relationship management. Different
tools do have different strengths and are appropriate under different
circumstances. However, we only very rarely have seen data mining
efforts fail because of the choice of tool-and the software has been
improving significantly over the past few years. Organizational
commitment, access to data and data cleansing, and good modeling
techniques are all far more important. Another important consideration
is that the data mining expertise does not exist in a vacuum. This is
true from a technical perspective as well as an organizational one. The
data used for data mining comes from many different systems . This makes
data mining part of the enterprise-wide effort for data warehousing and
business intelligence. Often, the results of data mining efforts feed
back into other systems, such as campaign management or e-commerce
systems. Managing technical interdependencies, while maintaining
business relevance, is a challenge for effective data mining groups.
Page 97
Page 98
When to Use Cluster Detection Use cluster detection when you suspect
that there are natural groupings that may represent groups of customers
or products that have a lot in common with each other. These may turn
out to be naturally occurring customer segments for which customized
marketing approaches are justified. More generally, clustering is often
useful when there are many competing patterns in the data making it hard
to spot any single pattern. Creating clusters of similar records reduces
the complexity within clusters so that other data mining techniques are
more likely to succeed.
Page 109
114 can be used to assign a class to the target field of a new record
based on the values of the other fields or independent variables. For
simplicity, assume that there are only two target classes and that each
split is a binary partitioning. The splitting criterion easily
generalizes to multiple classes, and clearly any multiway partitioning
can be achieved through repeated binary splits. We don't lose much by
addressing the simpler case in order to make the explanation easier to
follow. The first task is to decide which of the independent fields
makes the best splitter . The best split is defined as one that does the
best job of separating the records into groups where a single class
predominates. The measure used to evaluate a potential splitter is the
reduction in diversity (which is just another way of saying "the
increase in purity"). Because the concept of diversity (or conversely
purity) is at the very heart of the decision tree methods, it is worth
spending a little time on it. There are several ways of calculating the
index of diversity for a set of records. Even though it is intuitively
obvious in Figure 5.8 that the group at the root is more diverse than
either of the groups at the children nodes, how is this fact actually
measured? Ecologists, concerned about the diversity of actual
populations of plants and animals in the wild, have developed one
measure, called Simpson's diversity index. To calculate this index,
which in the data mining world is usually called the Gini index, imagine
reaching out and touching a single member of the population and then
letting it go before reaching out again. The diversity index is the
probability that the second thing touched belongs to a different class
than the first. The root of the tree contains nine triceratopses and
seven stegosauruses. The chance of touching a triceratops twice in a row
is 0.56x0.56, the chance of touching a stegosaurus twice in a row is
0.44x0.44, and the diversity index, the chance of touching a different
kind of dinosaur each time, is what's left over or 1-(0.442+0.562) =
0.49, which is close to the maximum possible diversity index of 0.5. The
limiting value, 0.5 (or, generally, 1 /n where n is the number of
categories), is reached when each class has exactly the same number of
members and therefore exactly the same probability of being picked.
Page 113
116 Measuring Diversity The diversity index described in the main text
has been developed several times in several different fields, and has
therefore been given several different names. To statistical biologists,
it is the Simpson diversity index. To cryptographers, it is (Breiman et
al. 1984) called it the Gini index, so that is what most software tools
call it. Whatever its name, its purpose is to measure the diversity of a
population. it can be interpreted as the probability that any two
elements of the population chosen at random with replacement will belong
to different classes. Since the probability of any one class being
chosen twice in a row is simply P;2, the diversity index is simply one
minus the sum of the all the P12. When there are only two classes,
things get even simpler since the probability of one class is one minus
the probability of the other: 1- ( P12+P22 ) 1- (P12+(1-Pl)2) 1-
(P12+(1-Pl) (1-Pl) 1+ -1(P12+(1-Pl)-Pl +P12) 1+ -P12 + -1 + Pl + Pl
..ERR, COD:1..
Page 115
117 There are several other popular diversity measures, some of which
follow. All of their graphs have similar shapes going from 0 when the
population is all one class to a maximum value when the two classes are
equally represented. A high index of diversity indicates that the set
contains an even distribution of classes, whereas a low index means that
members of a single class predominate. The best splitter is the one that
decreases the diversity of the record sets by the greatest amount. Three
common diversity functions have simple formulas when there are only two
outcomes: min (P1, P2 ) 2P1(1-P1) Gini index P11ogP1+P21ogP2 Entropy All
of these functions have a maximum where the probabilities of the classes
are equal and evaluate to zero when the set contains only a single
class. Between the extremes of full diversity and complete uniformity,
these functions have slightly different shapes. As a result, they
produce slightly different rankings of the proposed splits. It has been
shown that the Gini criterion tends to favor splits that isolate the
largest target class in one branch of the tree, and the entropy
criterion tends to favor balanced splits. The reason that software
packages often allow the user to choose a splitting criterion is that
there is no single best choice. The data miner must experiment to
determine which one gives the best results for the data set in hand.
Pruning the Tree Pruning is the process of removing leaves and branches
to improve the performance of the decision tree. A pruned tree is, in
fact, a subset of the full decision tree. The decision tree keeps
growing as long as new splits can be found that improve the ability of
the tree to separate the records of the training set into classes. If
the training data were used for evaluation, any pruning of the tree
would only increase the error rate. Does this imply that the full tree
will also do the best job of classifying new data sets? Certainly not!
Tree building algorithms make their best split first, at the root node
where there is a large population of records. Each subsequent split has
a smaller and less representative population with which to work. Towards
the end, idiosyncrasies of the training records at a particular node
display patterns that are peculiar only to those records. The patterns
are meaningless and harmful for prediction. For example, say the
decision tree is trying to predict height and it comes to a node
containing one tall Dorian and several shorter people with other names.
It can decrease diversity at the node by a new rule saying that "people
named
Page 116
55 Matrix? The name confusion matrix puts off a lot of people. in fact,
on hearing the word confusion, the concept suddenly becomes difficult
compare the results between different models, though, we want to see how
well the model performs on unseen data. This hold-out set is the
evaluation set (which is part of the model set). Figure 3.6 shows a
confusion matrix, both graphically and as a table. This tells us how
many predictions made by a predictive model are correct and how many are
incorrect. Which is the best model depends on the business problem.
Page 117
119 on the test set. The graph in Figure 5.9 shows how this works. With
multiple test sets, we can even more directly address the issue of model
generality by selecting the subtree that performs most consistently
across several test sets. Consequences of Choosing Decision Trees Now
that you understand how decision trees are built, it is easy to see some
of the consequences for the data miner. First, notice that since every
split in a decision tree is a test on a single variable, decision trees
can never discover rules that involve a relationship between variables.
This puts a responsibility on the miner to add derived variables to
express relationships that are likely to be important. For example, a
loan database is likely to have fields for the initial amount of the
loan and the remaining balance, but neither of these fields is likely to
have much predictive value in isolation. The ratio of the outstanding
balance to the initial amount carries much more helpful information, but
a decision tree will never discover a single rule based on this ratio
unless it is included as a separate variable. Handling of Input
Variables In some situations, the way that decision trees handle numeric
input variables can cause valuable information to be lost. When a split
is chosen only the rank order of the observations comes into play. For
the most part, this does not ó N = i~ W CC Depth of Tree Figure 5.9
Error rate on training set and test set as tree complexity increases.
Page 118
121 When to Use Decision Trees Decision-tree methods are a good choice
when the data mining task is classification of records or prediction of
outcomes. Use decision trees when your goal is to assign each record to
one of a few broad categories. Decision trees are also a natural choice
when your goal is to generate rules that can be easily understood,
explained, and translated into SQL or a natural language. other purpose
than prioritizing the independent variables. That is, using a decision
tree, it is possible to pick the most important variables for predicting
a particular outcome because these variables are chosen for splitting
high in the tree. Another useful consequence of the way that important
variables float to the top is that it becomes very easy to spot input
variables that are doing too good a job of prediction because they
encode knowledge of the outcome that is available in the training data,
but would not be available in the field. We have seen many amusing
examples of this, such as discovering that people with nonzero account
numbers were the most likely to respond to an offer of credit-less than
surprising since account numbers are assigned only after the application
has been processed! Neural Networks Neural networks are at once the most
widely known and the least understood of the major data mining
techniques. Much of the confusion stems from overreliance on the
metaphor of the brain that gives the technique its name. The people who
invented artificial neural networks were not statisticians or data
analysts. They were machine learning researchers interested in mimicking
the behavior of natural neural networks such as those found inside of
fruit flies, earthworms, and human beings. The vocabulary these machine
learning and artificial intelligence researchers used to describe their
work-"perceptrons," "neurons," "learning," and the like-led to a
romantic and anthropomorphic impression of neural networks among the
general public and to deep distrust among statisticians and analysts.
Depending on your own background, you may be either delighted or
disappointed to learn that, whatever the original intentions of the
early neural networkers, from a data mining perspective, neural networks
are just another way of fitting a model to observed historical data in
order to be able to make classifications or predictions. To illustrate
this point, and to introduce the various components of a neural network
, it is worth noting that standard linear regression models and many
other
Page 120
125 nearly always the weighted sum of the inputs. Transfer functions
come in many more flavors. In Figure 5.12 we used a linear transfer
function because the network was drawn to represent a linear function.
More commonly, the transfer function is sigmoidal (S-shaped) or
bell-shaped. The bell-shaped transfer functions are called radial basis
functions. Common sigmoidal transfer functions are the arctangent, the
hyperbolic tangent, and the logistic. The nice thing about these
S-shaped and bell-shaped functions is that any curve, no matter how
wavy, can be created by adding together enough S-shaped or bell-shaped
curves. In fact, multilayer perceptrons with sigmoidal transfer
functions and radial basis networks are both universal approximators,
meaning that they can theoretically approximate any continuous function
to any degree of accuracy. Of course, theory does not guarantee that we
can actually find the right neural network to approximate any particular
function in a finite amount of time, but it's nice to know. (Incidently,
decision trees are not universal approximators.) The sigmoidal transfer
functions used in the classic multilayer perceptrons have several nice
properties. The shape of the curve means that no matter how extreme the
input values, the output value is always constrained to a known range
(-1 to 1 for the hyperbolic tangent and the arctangent, 0 to 1 for the
logistic ). For moderate input values, the slope of the curve is nearly
constant. Within this range, the sigmoid function is almost linear and
exhibits almost-linear behavior. As the weights get larger, the response
becomes less and less linear as it takes a larger and larger change in
the input to cause a small change in the output. This behavior
corresponds to a gradual movement from a linear model to a nonlinear
model as the inputs become extreme. Training Neural Networks Training a
neural network is the process of setting the weights on the inputs of
each of the units in such a way that the network best approximates the
underlying function, or put in data mining terms, does the best job of
predicting the target variable. This is an optimization problem and
there are whole textbooks dedicated to optimization, but in broad
outline most software packages for building neural network models use
some variation of the technique known as backpropagation.
Backpropagatíon Training a backpropagation neural network has three
steps: 1. The network gets a training instance and, using the existing
weights in the network, it calculates the output or outputs for the
instance.
Page 124
127 One of the concerns with any neural network training technique is
the risk of falling into something called a local optimum. This happens
when the adjustments to the network weights suggested by whatever
optimization method is in use no longer improve the performance of the
network even though there is some other combination of
weights-significantly different from those in the network-that yields a
much better solution. This is analogous to trying to climb to the top of
a mountain and finding that you have only climbed to the top of a nearby
hill. There is a tension between finding the local best solution and the
global best solution. Adjusting parameters such as the learning rate and
momentum helps to find the best solution. Consequences of Choosing
Neural Networks Neural networks can produce very good predictions, but
they are neither easy to use nor easy to understand. The difficulties
with ease of use stem mainly from the extensive data preparation
required to get good results from a neural network model. The results
are difficult to understand because a neural network is a complex
nonlinear model that does not produce rules. Data Preparation Issues The
inputs to a neural network must somehow be scaled to be in a particular
range, usually between -1 and 1. This requires additional transforms and
manipulations of the input data that require careful thought. Simply
dividing everything by the largest magnitude is unlikely to get good
results due to what might be termed the "Bill Gates problem." If we were
to scale a variable containing net worth information by dividing all the
values by Bill Gates' net worth, everyone else's net worth would be
clumped together near zero while Bill's would be at one. The network
would be unable to make use of "small" differences in net worth like one
or two million dollars in order to make predictions. Something
else-perhaps removing outliers or using log transformations-is required.
Categorical variables need to be converted to numerical variables in a
manner that does not introduce spurious ordering. If states of the
United States were given numbers in alphabetical order, then Alaska and
Alabama would be close neighbors but Alabama and Mississippi would be
far apart. You may think that the ordering doesn't matter, but there is
no way to stop the neural network from trying to make use of it. Many
categorical variables do have a natural order for a given data set and
there are techniques for discovering it, but again, thought is required.
Another approach to categorical variables is to create one binary flag
variable for each value the variable might possibly take on.
Unfortunately , this can lead to an explosion in the number of input
nodes in the net-
Page 126
128 work and larger networks are slower to train and more likely to
create unstable models. Neural networks cannot deal with missing values.
If records containing missing values are simply dropped, the training
data will probably be skewed since the subset of records for which all
fields are filled in is not likely to be representative of the
population. Somehow, the missing values must be estimated , preferably
while recording in another variable the fact that the field was missing.
Predicting the best replacement for a missing value given the values of
the filled-in fields is itself a data mining problem. The requirement to
pay so much attention to the input data is not entirely a bad thing.
Since data quality is the number one issue in data mining, this
additional attention can forestall problems later in the analysis.
Neural Networks Cannot Explain Results This is the biggest drawback of
neural networks in a business decision support context. For our clients,
understanding what is going on is often as important, if not more
important, than getting the best prediction. In situations where
explaining rules may be critical, such as denying loan applications,
neural networks are not a good choice. There are many situations,
however, when the prediction itself matters far more than the
explanation. The neural network models that can spot a potentially
fraudulent credit card transaction before it has been completed are a
good example. An analyst or data miner can study the historical data at
leisure in order to come up with a good explanation of why the
transaction was suspicious, but in the moments after the card is swiped,
the most important thing is to make a quick and accurate prediction.
When to Use Neural Networks Neural networks are a good choice for most
classification and prediction tasks when the results of the model are
more important than understanding how the model works. Neural networks
actually represent complex mathematical equations , with lots of
summations, exponential functions, and many parameters. These equations
describe the neural network, but are quite opaque to human eyes. The
equation is the rule of the network, and it is useless for our
understanding . Neural networks do not work well when there are many
hundreds or thousands of input features. Large numbers of features make
it more difficult for the network to find patterns and can result in
long training phases that never converge to a good solution. Here,
neural networks can work well with decision-tree methods. Decision trees
are good at choosing the most important variables-and these can then be
used for training a network.
Page 127
129 Lessons Learned This chapter has introduced three of the most common
data mining techniques- clustering, decision trees, and neural networks
by describing the inner workings of at least one of the algorithms used
to implement each one. We have seen that each technique is applicable to
a wide range of situations, but that each has strengths and weaknesses
that must be taken into account. The primary lesson that you should take
away from this chapter is that no one data mining technique is right for
all situations.
Page 128
Page 129
132 • The package carrier knows zip code, value of package, time stamp
at truck, time stamp at sorting center, and so on. Each company has the
opportunity to learn from this simple interaction and to apply what they
learn to improve their business and more profitably serve their
customers. In a nutshell, this is the goal of data mining, and we see
that it rests firmly on the foundation of data. The electronic trail
leads to some basic truths about data. Data comes in many forms, in many
types, and on many systems. It is always dirty, incomplete, and
sometimes incomprehensible. And yet, this is the raw material of data
mining. If we compare data mining to searching for oil, then
understanding the data is comparable to knowing where to drill for oil.
The most powerful machinery is unlikely to find significant oil deposits
beneath New York City or Boston, because they are not in an appropriate
geography. Similarly, the most powerful data mining techniques cannot
find interesting patterns in data without adequately preparing it and
knowing where to look. This chapter undertakes an ambitious task, to
cover the most important data issues that arise in data mining. These
include choosing the right data, understanding the structure of the
data, adding derived variables, and working with dirty data. It also
shows a practical example of work with data to derive important features
from time series that describe customer behavior. Should Data Look Like?
To start the discussion on data, let's begin at the end: What should the
data look like for data mining? All data mining algorithms use a very
simple view of data, illustrated in Figure 6.1. This view of data, as a
single table with rows and columns, is probably familiar to most
readers. Alas, this single-table columnar format is not the way that
data is created and stored in most environments- and for good reason.
What is good for data mining is not optimal for most other purposes.
This simple view immediately brings up two questions. What are the rows?
And, what are the columns? The answers to these questions are the most
important step in preparing data for data mining. The Rows What is the
process for determining what a row refers to? A row is the unit of
action, and should be determined by understanding how the data mining
results will be used. That is, data mining serves a purpose, in helping
the business eventually take some action. A row is the unit of action.
It is one instance
Page 131
134 making it more feasible for a row to be at the account level. Or, in
the case of web applications, the unit might be based on a cookie stored
on a computer. This roughly corresponds to a user, but several people
may use a single computer and a single person may use several computers.
In other applications, a row can refer to anything from a printing run
to an inventory item. Not all data is always available for all
customers, so the data is often limited to a subset of customers with
valid values in certain fields. For a direct mail campaign , rows should
refer to customers with valid addresses. For a telemarketing campaign,
rows should refer to customers with valid telephone numbers. For an
e-mail campaign, rows should refer to customers with valid e-mail
addresses. During the data mining process, this suggests limiting the
data mining to those customers with valid mail addresses, telephone
numbers, or e-mail addresses. Use of a subset of the data occurs in
other ways. Perhaps the initial campaign will be targeted to prospects
in New Jersey-so it will focus only on data from New Jersey initially.
Perhaps the initial campaign will be only for Elite Club members, so it
will focus only on them for the data mining effort. The rows are the
level of granularity and the unit of action. Next we need to know what
data the rows contain. The Columns The fields or columns represent the
data in each record. Each column contains values. The range of the
column refers to the allowable values for that column. Numbers typically
would have a minimum and maximum value. Categorical fields would have a
list of observed values for their range. A histogram, such as those
shown in Figure 6.2, shows how often each value or range of values
occurs. So, the vertical axis is a count of records and the horizontal
axis is the values in the column. This figure shows some common
distributions , such as the normal distribution that looks like a
bell-shaped curve (statisticians call this a Gaussian distribution) and
the uniform distribution. The distribution of the values provides some
very important insights into the columns. Statistical methods are very
concerned with distributions; fortunately, data mining algorithms are a
bit less sensitive to them. Here, we will illustrate some special cases
of distributions that are important for data mining purposes. Columns
with One Value The most degenerate distribution is a column that has
only one value. These unary-valued columns, as they are more formally
known, do not contain any information that helps to distinguish between
different rows. Because they lack any information content, they should
be ignored for data mining purposes.
Page 133
r ~ 135 1600 This histogram is for the month of 1400 1200 .J claim for a
set of insurance claims. 1000 800 This is an example of a typically 600
uniform distribution. That is, the 400 number of claims is roughly the .
-... 200 _..{ _ .._ _ same for each month. Jan Feb Mar Apr May Jun Jul
Aug Sep Oct Nov Dec This histogram shows the number of telephone calls N
made for different durations. O This is an example of an Z exponentially
decreasing distribution. r V N O (') CO O) N u) CO V N O C') (D a) N CO
.- N N N C7 Ch co - V'- u r) u7 Duration (Minutes) 250 200 This
histogram shows a normal 150 distribution with a mean of 50 and a _
standard deviation of 10. Notice that 00 high and low values are very
rare. 50 75 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100
Value Figure 6.2 Common data distribItions. Having only one value is
sometimes a property of the data. It is not uncommon , for instance, for
a database to have fields defined in the database that are not yet
populated. The fields are only placeholders for future values, so all
the values are uniformly something such as "null" or "no" or "0."
Another cause of unary-valued columns is when the data mining effort is
focused on a subset of customers. The fields that define this subset all
may contain the same value. If we are building a model to predict the
loss-ratio (an insurance profitability measure) for automobile customers
in New Jersey, then the state field will always contain "NJ." This field
has no information content on the subset of data that we will be using.
We want to ignore it for modeling purposes.
Page 134
138 recent. And so on. These are cases where the important features
(such as geography and customer recency) should be extracted from the
fields as derived variables. However, once the useful information has
been extracted, the original columns should be ignored. Ignoring Columns
Synonymous with the Target When a column is too highly correlated with
the target column, it can mean that the column is just a synonym. Here
are two examples: • "Account number is non-NULL" may be synonymous with
response to a marketing campaign. Only responders who opened accounts
are assigned account numbers. • "Date of churn is not NULL" is
synonymous with having churned. Another danger is that the column
reflects previous business practices. For instance, we may find that all
customers with call forwarding also have call waiting. This is a result
of product bundling; call forwarding is sold in a product bundle that
always includes call waiting. Or perhaps previous prospecting campaigns
focused on a particular segment of customers, such as people with
children and under 40. Well, all responders will have these
characteristics, so the age and number of children may not be useful for
data mining purposes. This, by the way, also illustrates why the data
miner needs to know who was contacted, as well as who responded. It is
important to ignore columns that are synonymous with the target. Roles
of Columns in Data Mining Different columns play different roles in data
mining. Three fundamental roles are Input columns. Used as input into
the model. Target column(s). Used only when building predictive models.
These are what is interesting, such as propensity to buy a particular
product, likelihood to respond to an offer, or probability of remaining
a customer. When building descriptive models, there does not need to be
a target. Ignored columns. Columns that are not used. Of course,
different tools have different names for these roles. Figure 6.4
illustrates the "Type" node in SPSS Clementine. This is the node that
sets the role for each column. In this case, there is a single TARGET
(or OUT direction in their terminology), which is typical for most
predictive data mining applications.
Page 137
140 ~Ti Cost column. Specifies a cost associated with a row. For
instance, if we are building a customer retention model, then the "cost"
might include an estimate of each customer's value. Some tools can use
this information to optimize the models that they are building. These
roles are illustrated in Figure 6.5. Data for Data Mining In short, data
for data mining needs to be in the following format: • All data should
be in a single table or database view. • Each row should correspond to
an instance that is relevant to the business. Columns with a single
value should be ignored. Columns with a different value for every row
should be ignored-although the information they contain may be extracted
into derived columns. Figure 6.5 Column roles in SAS Enterprise Miner.
Page 139
by the values in one or more columns for output purposes. SQL, developed
by IBM in the 1980s, has become the standard language for accessing
relational databases. A common way of representing the database
structure is to use an entity-relationship (E-R) diagram. Figure 6.7 is
a simple E-R diagram with five entities and four relationships among
them. In this case, each entity corresponds to a separate table with
columns corresponding to the attributes of the entity. In addition, keys
in the table represent the relationships.
Page 141
g,. ~143 A single transaction occurs at exactly one vendor. But, each
vendor may have multiple transactions. TABLE / EE~l VENDOR TABLE One
account has multiple transactions, but each transaction is associated
with exactly one account. A customer may have one or more accounts. But
each account belongs to exactly one customer. Likewise, one or more
customers may be in a household. / CUSTOMER TABLE An E-R diagram can be
used to show the tables and fields in a relational database. Each box
shows a single table and its columns. The lines between them show
relationships, such as 1-many, 1-1, and many-to-many. Because each table
corresponds to an entity, this is called a physical design. Sometimes,
the physical design of a database is very complicated. For instance, the
TRANSACTION TABLE might actually be split into a separate table for each
month of transactions. In this case, the above E-R diagram is still
useful; it represents the logical structure of the data, as business
users would understand it. Figure 6.7 An entity relationship diagram
describes the data for a simple credit card database. One nice feature
of relational databases is the ability to design a database so that any
given data item appears in exactly one place-with no duplication. Such a
database is called a normalized database. Knowing exactly where each
data item is located is highly efficient in theory, since updating any
field requires modifying only one row in one table. When a normalized
database is well designed, there is no redundant, out-of-date, or
invalid data. The key idea behind normalization is creating reference
tables. Each reference table logically corresponds to an entity, and
each has a key used for looking up information about the entity. In a
relational database, the "join" operation is used to look up values in
the reference table.
Page 142
148 S.,_...._.. . products, or the data may come from different time
frames. In general, it is a good idea to reverify any hunches on the
data actually being used for data mining. Also, data mining predictions
can be applied to the dimensional data. For instance, in the standard
OLAP cube, perhaps we want to add a new calculated value that is a
prediction of how many products would be sold in each store on a given
future day. This prediction would then be available in the system , and
available for slicing and dicing and reporting. The cube could then also
be used for measuring the effectiveness of data mining, by easily
showing the accuracy of the predictions. Survey and Product Registration
Data Surveys, polls, and product registrations are examples of
self-reported data. That is, customers are being asked to tell you
something about themselves. Surveys are near and dear to the hearts of
marketers. One reason for this affinity is because they often provide
the only data available about customers and their behavior that is
combined with other attributes, such as demographics. Since many surveys
are conducted on only a few hundred or few thousand people, the data
fits inside a desktop spreadsheet. And perhaps the biggest advantage is
that the surveys are conducted or outsourced directly from the marketing
department. Often, the cumbersome process of working with IT can be
circumvented. A similar data source is product registrations. These are
the postcards or online forms that ask users to check a box for
estimated household income, number of pets, where they saw the product,
and so on. These types of self-reported data should always be treated
cautiously: • First, anytime people are asked questions, they have the
opportunity to answer none of them. The people who respond are
self-selected as a group of people more likely to respond to questions.
This is an example of sample bias. The people in the survey do not
necessarily represent the entire population , so you have to be careful
about using the responses as representative of the whole population. •
People may not always give accurate responses or the responses may be
keyed-in with errors. Even if only 10 percent of the responders are
inaccurate , this can have a large effect on the analysis. There is no
way to tell which 10 percent of the answers are inaccurate-and the
inaccuracies may merely reflect errors that occur when typing responses
into databases. • Although self-reported data often gets saved for
future use, past surveys may not be comparable to more recent surveys.
Consider survey of Web users in 1996. The same survey may have quite
different results from the
Page 147
~149 one taken now. Back in 1996, Web users were much more likely to be
white, male, American, earn high incomes, and work in the high-tech
industry. As time goes by, Web users are becoming more and more like the
general population . The differences in survey responses may be due only
to shifting demographics. - It can also be challenging to match survey
data to the customer database. That is, often the only identifying
information on the survey is the customer's name, and this must be
matched to the customer database-if you want to tie survey responses to
customer billing, promotion, and usage histories. • And, perhaps the
biggest problem from the data mining perspective is that survey data is
quite incomplete. So, it is not possible to actually use responses from
surveys as inputs into models. The simple reason is that the model could
be run only on people who have responded to the survey. Of course,
having pointed out all these deficiencies, it is also important to point
out that surveys can be quite useful. It is possible to adjust the
results of a survey for sample bias, by taking into account the
responses for different subgroups of the population. Polling agencies
use these types of adjustments to determine public opinion, with a
margin of error less than 3 percent. By assuming that people lie
consistently, it is possible to reduce the impact of inaccurate data.
And, like data mining, surveys are often used to provide insight into
customers . The results may point at a new way to market a product, or a
new theme for an advertising campaign, or a new list to purchase. The
survey data itself is often valuable as the primary source of data. It
can also be valuable to try to predict survey responses using other
data. For instance, perhaps we can predict who the "soccer moms" are by
using internal data sources (that is, "soccer moms" may have particular
and distinctive patterns in their data). The predictions may not be 100
percent accurate, but they would give us the ability to identify
important demographic segments in the customer base. External Data
Sources External providers are a very valuable source of data. Some of
these suppliers augment existing data with additional fields. Here are
some examples: • Doing a credit check on a potential or current
customer. In this case, additional credit history and credit worthiness
data is provided by a credit bureau. • Using an outside agency to add
demographic, psychographic, or Web- graphic data. This data contain
columns such as "number of children," "amount paid in greens fees," and
"number of Web sites visited."
Page 148
151 • A customer corresponds to an individual who may have more than one
credit card. • A customer corresponds to a household, containing all
credit cards held by all individuals in the household. All of these are
valid answers. The purpose here is not to choose among them, but to
illustrate that there are different levels of granularity. An important
point is that, regardless of which definition is chosen, different data
is available at different levels of granularity. Ignoring the vendor
table in the E-R diagram, this simple example has data at four different
levels: 1. Transaction data at the transaction level, with an account
key 2. Account data, with a customer key 3. Customer data, with a
household key 4. Household data To get this all at the household level
takes a bit of work. Well, the HOUSEHOLD TABLE is fine-it provides a
beginning. Let's add something from the CUSTOMER TABLE, such as the
number of customers in a household or the average FICO score. This
requires aggregating the customer table at the household level, and
calculating the desired fields. Next comes the ACCOUNT TABLE. This table
can provide the number of accounts, the total amount due, the total last
payment, and so on. But we need this information at the household level.
First, look up the household ID in the CUSTOMER TABLE, using the
customer ID; then aggregate the ACCOUNT TABLE by household; then join
this into the result. This is starting to get complicated. (Processing
the TRANSACTION TABLE is left as an exercise.) Even in this simple
example, combining these disparate sources of data into a single table
(as needed for data mining purposes) is a challenge. Figure 6.9
illustrates the processing that needs to be done. First, data that is at
a too-detailed level needs to be aggregated to the right level, which
may involve looking up the key. For instance, to aggregate accounts at
the household level may first require looking up the household key
Page 150
This affects the decision to use flash. The camera is programmed with
some set of assumptions about the user's intentions. When all goes well,
these assumptions are close enough to what the photographer had in mind
so that the resulting pictures are satisfactory. When the assumptions
are violated, the result is disappointment. A data mining tool or
vertical application with embedded expertise has the same potential to
please or disappoint depending on how well its assumptions match the
actual requirements of the business. And, like the automatic camera, it
can automate only the small portion of the data mining task that takes
place between the time when the shot is set up-a modeling data set is
assembled and ready to go-and the time when the model itself is built.
is assembled and ready to go-and the time when the model itself is
built. It can do little about what comes before and after. As we will
see in the next chapter, the actual building of models is only one stage
in a continuous cycle of activities that make up the data mining
process.
Page 151
fact that the codes are represented as numbers does not mean that the
values are ordered. Since some data mining tools treat all numbers as
true numeric, it is important to override the default when the numbers
represent categories. Neural networks and k-means clustering are
examples of algorithms that want all their inputs to be true numeric.
This poses a problem for categorical data. The naïve approach is to
assign a number to each value, but this introduces a spurious ordering
not present in the original data. The solution is to introduce a flag
variable for each value. Although this increases the number of input
Figure 6.10 Informing Enterprise Miner that a code, stored as a number,
is really categorical.
Page 153
161 Here are five approaches to working with outliers. Do nothing. Some
algorithms are robust in the presence of outliers. For instance,
decision trees are mainly concerned with the rank of numeric variables ,
so numeric outliers do not have a big effect on them. On the other hand,
a few outliers can seriously disrupt a neural network. Filter the rows
containing them. This can be a bad idea, because it can introduce a
sample bias into the data. However, in the case of the retail data,
ignoring the cards with large purchases actually removes some
"noncustomer " cards and might improve results. Ignore the column. This
is extreme, but not as far-fetched as it may sound. The column can be
replaced with reference information about the column. For instance,
instead of ZIP codes, we might include some information about the ZIP
code-number of customers, number of residents, average income, and so
on. Some of this data is available from sources such as the United
States Census Bureau (www.census.gov). Replace the outlying values. This
is a very common approach. The replacement value might be "null" if your
data mining tool handles null values effectively. It might be "0," or
the average value, or a reasonable maximum/minimum value, or some other
appropriate value. In some cases, it makes sense to predict a value,
using the other input fields to infer a reasonable value that will not
disturb the true distribution of the data. Bin values into equal height
ranges. An example would be low, medium, and high for salaries. Binning
values places them into ranges, so outliers fall into their appropriate
range. Many data mining tools support binning directly in the tool. As
just described, binning applies to true numerics. However, it can also
be applied to categoricals, when there are large numbers of them. There
are two approaches in this case. Consider the ZIP codes. What we really
might want is to use all the ZIP codes where there are more than, say,
100 customers . And then combine the rest of the ZIP codes into a
"too-small-to- care" ZIP code. That is, we are grouping together, into a
single category, the rarest ZIP codes while keeping information about
the larger ZIP codes. The other approach occurs when the categories
naturally fall into a hierarchy. For instance, ZIP codes in the United
States fall into counties, which are parts of states, which are parts of
regions. Instead of using the ZIP codes directly, we might group them
into states. Or, the ZIP code itself represents its location, so we
might use only the first three digits of the ZIP code. Hierarchies are
particularly important for certain types of data, such as product codes
and geography.
Page 160
167 digits are a code for the manufacturer. The next five encode the
specific product , and are controlled by the manufacturer. For instance,
one of them often contains the "amount" of the product. The final digit
has no meaning. It is a check digit to verify the data. More information
about them is available through the Uniform Code Council
(www.uc-council.org). Scan codes outside North America have different
formats. More information about them is available through the standard
organization, EAN International (www.ean.be). • Vehicle identification
numbers are the 17-character codes inscribed on automobiles that
describe the make, model, and year of the vehicle. The first character
describes the country of origin. The second, the manufacturer. The third
is the vehicle type, with 4-8 being specific features of the vehicle.
The tenth is the model year; the eleventh is the assembly plant. The
remaining six are sequential production numbers. These are examples of
the types of codes and strings that appear in data. It is important to
extract the component information from these codes for data mining
purposes. Time Series Some data occurs repeatedly at specific time
intervals. Probably the most common time series data are customer
billing records, which provide monthly snapshots of customer behavior.
Billing data has the additional advantage that it is usually quite
clean-numbers on bills typically are audited for correctness since they
feed into the company's balance sheet. When using times series data, the
series are usually normalized to the last date available for each
record. Figure 6.12 shows this normalization. Several customers leave
(or churn) at different points in time. To build a model to describe
these customers, we want to reorient their data relative to the date
they left. So, instead of a fixed month, the month is relative to each
customer's final month. This is the first step in working with time
series. However, this approach does remove information about particular
periods of time, such as seasonality. Adding derived variables back in
can recover the lost seasonality information. Examples of such variables
are • Proportion of a customer's yearly purchasing that occurs during
the holiday season Average length of telephone calls made during
weekends and during the week • Total amount of interest paid in the
first quarter of the year
Page 166
168 ith of urn This chart represents the data for five customers who
churn in different months. month before churn Here, the data is
reoriented relative to the month of churn. Figure 6.12 Reorient time
series relative to the event in question. This information would
otherwise be lost after reorienting the time-series data. Adding new
derived fields makes it possible to find time-related patterns. Regular
time series, of course, have many features that are not related to
seasonality . Within a single time series, the following may be of
interest: • Total (or average) values in the time series, such as total
amount spent over the period • Rate of growth of the times series, such
as the ratio of the first and the last values or ratios between success
values • Number of values that exceed thresholds, such as the number of
times that the customer used a credit card more than three times during
the month • Variance of the time series, which gives a measure of how
wildly the values change
Page 167
173 First, what inside month (say, more than $10 in 80% of the months)
All these definitions might make sense to business users. However, to do
analysis we have to choose one of them. This is an example of why
communication between the data miners and the business users is very
important. All of these definitions have an ad hoc quality (and the
marketing group had historically made up definitions similar to these on
the fly). What about someone who pays very little interest, but does pay
interest every month? Why $10? Why 80 percent of the months? These
definitions are all arbitrary, often the result of one person's best
guess at the definition. It is worth investigating further. From the
customer perspective, what is a revolver? It is someone who makes only
the minimum payment every month. So far, so good. For comparing
customers , this definition is a bit tricky because the minimum payment
changes from month to month and from customer to customer. Figure 6.13
shows the actual and minimum payment made by three customers , all of
whom have a credit line of $2000. The revolver makes payments that are
very close to the minimum payment each month. The transactor makes
payments closer to the credit line, but these monthly charges vary more
widely, depending on the amount charged during the month. The
convenience user is somewhere in-between. Qualitatively, the shapes of
the curves help us humans understand the behavior. Manually looking at
the shapes is an inefficient way to categorize the behavior of several
million customers. Shape is a vague, qualitative notion. We want a
score. One way to create a score is by looking at the area between the
"minimum payment" curve and the "payment" curve. For our purposes, the
area is the sum of the differences between the payment and the minimum.
For the revolver, this sum is $112; for the convenience user, $559.10;
and for the transactor , a whopping $13,178.90. This score makes
intuitive sense. The lower the score, the more the customer looks like a
revolver. However, it does not work for two cardholders with different
credit lines. Consider an extreme case. If a cardholder has a credit
line of $100 and was a perfect transactor, then the score would be no
more than $1200.
Page 172
176 The Ideal Convenience User The measures in the previous section
focused on the extremes of customer behavior, revolvers, and
transactors. Convenience users were just assumed to be somewhere in the
middle. Is there a way to develop a score that is maximized for the
ideal convenience user? The answer is "yes." First, we have to define
the ideal convenience user. This is someone who, twice a year, charges
up to his or her credit line and then pays off over four months. There
are few, if any, additional charges during the other 10 months of the
year. Table 6.3 illustrates the monthly balances for two convenience
users as a ratio of their credit lines. This table also shows a problem.
The curves describing the behavior of convenience users have no
relationship to each other in any given month. They are out of phase. In
fact, there is a fundamental difference between convenience users on the
one hand and transactors and revolvers on the other. Knowing that
someone is a transactor exactly describes their credit card behavior in
any given month-they will making many charges and pay off the balance.
Knowing someone is a convenience user is much more ambiguous . In any
given month, they may be paying nothing, everything, or making a partial
payment. Does this mean that it is not possible to develop a measure to
identify convenience users? Not at all. The solution is to sort the 12
months of data and to create the measure using the sorted monthly data.
Figure 6.15 illustrates this process. It shows the two convenience users
along with the profile of the ideal convenience user. Here, the data is
sorted, with the largest values occurring first. For the first
convenience user, month 1 refers to January. For the second, it refers
to April. Now, using the same idea of taking the area between the ideal
and the actual produces a score that measures how close a convenience
user is to the ideal. Notice that revolvers would have an outstanding
balance near the maximum for all months. They would have a high score,
indicating that they are far from the ideal convenience user. For
convenience users, the score is much smaller. Table 6.3 Two Convenience
Users and Their Pattern of Monthly Balances, Expressed as a Percentage
of Their Credit Line Convl 80% 60% 40% 20% 0% 0% 0% 60% 30% 15% 70%
Conv2 0% 0% 83% ..ERR, COD:1..
Page 175
.~.,.~, ._.. 178 Empty values occur when the fact that data is missing
is relevant information . For instance, if a customer does not supply a
telephone number, that could indicate that the customer does not want to
be bothered with telephone calls, perhaps because the customer receives
too many marketing calls or perhaps because the customer receives too
many bill collection calls. This is a case where the missing value
actually contains valuable information . In this case, it is valuable to
add a new true-or-false field indicating whether the value is missing. •
Nonexistent values are caused by the nature of the problem. For
instance, if a model wants to use 12 months of history to predict some
future event, then recent customers will have lots of missing data,
since they lack 12 months of history. In this case, the solution is
often to reformulate the problem . Build a separate model for customers
who have 12 months of history and another for those who do not. •
Incomplete data occurs when data sources cannot supply all the relevant
data. This is a big issue with outside vendors who provide overlays for
data. In this case, a noticeable number of customers will not match, and
will have missing values. Or, when data is coming from multiple
divisions, such as the demand deposit group, the credit card group, and
the mortgage group of a bank, many customers do not have relationships
with all the divisions. Sometimes it is helpful to build separate
models. Sometimes it is helpful to replace the missing values with
derived values, such as the total number of relationships with. the
bank, the total amount in deposit accounts, and so on. Uncollected data
is missing because it is never collected. For instance, most telephone
switches do not record when a customer turns off call waiting; and if
they do record this as an event, they do not pass the data through to
the billing system. The consequence is that there is no way to determine
which customers turn off call waiting, without significant modifications
to the operational system. What can be done about missing data?
Actually, the situation is quite similar to handling outliers: Do
nothing. Some algorithms (such as many decision tree implementations)
are fairly robust when there are missing values. If there are few
missing values , they may not materially affect the models. Filter the
rows containing them. This can be a bad idea, because it might introduce
sample bias into the data. In other words, if there is a systematic
problem in the systems that produce the data, the rows may not be
representative of the overall population.
Page 177
179 Ignore the column. Focus on the complete data by ignoring columns
with missing values. If only a few columns have spotty columns, often it
makes sense to ignore them, or to replace them with indicator flags that
just say whether the data was present. Predict new values. Using
decision trees or neural networks, it is possible to use other columns
to predict the missing column. A less refined approach is to insert the
average value or most common value into the column. Build separate
models. Often, it is possible to eliminate much of the problem by
segmenting the customers based on what data they have available. This
approach is particularly useful for nonexistent missing values. Modify
the operational systems. And wait until the data is collected-
admittedly not the most practical short-term solution. There is no
simple cure-all for missing data. The curse of data mining is that it
looks at lots and lots of data-and finds problems. Fuzzy Definitions
Another class of dirty data occurs when the data does not have clean,
consistent definitions. Data mining algorithms assume that the values in
fields mean the same thing from one row to the next. That is, when the
same value appears in multiple rows it means the same thing each time.
Consider retail point-of-sale systems in North America that produce data
with a column for the UPC. Another field is the "amount" of the
transaction. Embedded in the UPC is a subfield that says what the amount
means. A "1" in the amount field could mean 1 apple, or 1 box of cereal,
1 six-pack of beer, or 1 pound of meat. In other words, the field means
different things in different records-a problem for data mining. Mergers
and acquisitions cause dirty data, too. Having to translate one set of
codes to match another introduces the possibility of error. Sometimes,
multiple codes appear in the data, so there may be several codes that
mean the same thing, such as "account open." But each has a different
nuance in its definition, so they are not exactly the same (if they
were, it would be more likely that they would have been combined). In
this situation, two records may have different codes, but they almost
always mean the same thing. Fuzzy definitions can also occur at the
record level. The definition of the population may be fuzzy. What
exactly is a current customer? What about customers who have been late
paying their bills? Or, how are separate sites of the same business
customer handled? Often, the answers to these and similar questions are
answered on an ad-hoc basis. So, the available data may include
"customers" that are not really customers.
Page 178
the general patterns in the model set and of the specific patterns in
the training set. But the patterns specific to the training set are not
useful. By using the test set to refine the model, the model can
"unlearn" these extraneous patterns. The test set allows the model to
generalize better. Finally, using ..ERR, COD:3.. patterns specific to
the training set are not useful. By using the test set to refine the
model, the model can "unlearn" these extraneous patterns. The test set
allows the model to generalize better. Finally, using the evaluation
set, we can
Page 183
ó 40% .. .... .. ; .. ...... . _.....-{ 30% 20% . Training Set ~,- Test
Set 10% Evaluation Baseline 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100% % Claims Figure 7.2 Example of a good lift chart where performance
on the test and evaluation sets are similar. Sometimes, the lift curves
are close to the baseline, as in Figure 7.4. In this case, the model
does not provide much lift. Normally, this would be an indication of a
poor model. However, it is important to keep in mind the theoretical
maximum as well as the baseline when interpreting these charts. ..ERR,
COD:1.. What is happening? In fact, 55 percent of the customers already
had call waiting-quite a high number. So, even if all the top scoring
customers had call waiting (the theoretically best model), then they
would account for only 18 percent (10
Page 185
188 The cumulative gains chart for a decision tree consists of a series
of flat line segments. Each segment corresponds to one of No (3692)
80.0% Y- !9931 90 d°/ the leaves of the tree. the line corresponds to
AIIPerils(1481) collision(1840) Liability(1294) the lift at that leaf.
The length No (1029) 69.5% No (1405) 76.4% No (1258) 97.2% to the number
of records Yes (452) 30.5% as (435) 23.6% Yes (36) 2.8% corresponds
Fault that land there. PolicyHolder(1042) ThirdParty(439) steepest
segments correspond to No (sos> 5e.2% No (423) 96.4% the lea with the
biggest lift (highest Yes (436) 41.8% Yes (16) 36." AccidentArea density
of the sired outcome). Rural(122) Urban(920) 100% No (56) 45.9% 11 No
(550) 59.8%1 / Yes (66) 54.1 % Yes (370) 40.2% 90% --- -- --Y ..ERR,
COD:1..
Page 186
191 This is a cumulative gains chart used in the cross selling case
study. This portion of the cumulative- gains chart is suspicious because
it is entirely flat. A flat line in a cumulative gains chart can
indicate a strong correlation between the inputs into the model and the
output being predicted. Figure 7.6 When a cumulative gains chart looks
too good to be true, it just might be. slope and gradually become more
horizontal. In a cumulative gains chart, the curve for the training set
should be above the curve for the test set, and both should be above the
evaluation set (although they should be close). If these curves are
inverted, then it is worth investigating further. When the curves look
very different from these rules, it is worth revisiting the model. Model
Stability Building a model is like test-driving a car. The car should
handle well not only on a sunny, clear day when it is brand-new and
driven for the first time, but also in all kinds of weather and road
conditions. We want the performance of the car to be stable over time
and in different situations. The same principle applies to predictive
models. A stable model is one that behaves similarly when applied to
different sets of data. Not only do we want
Page 189
score set. The model set and the score set are likely to be different
for several reasons. First, the score set is usually more recent than
the model set. And, more recent data is likely to be a bit different.
There have been additional marketing campaigns , customer contacts, new
products, changes in the marketplace, and macro-economic changes. It
sounds daunting; later in this chapter we will discuss how to enable
models to slide forward in time. Other problems arise because the score
set and the model set are samples from the general population. And the
samples might be different. For instance, if we are trying to predict
the propensity of customers to purchase a product, we will eliminate
anyone who already has the product from the score set. This can
introduce skew. Or, we might decide to score a subset of the customers
for business reasons, such as those from California or those who have a
second telephone line. Once again, the model set and the score set may
be qualitatively different. What does this mean? It means that it might
be hard to make a stable model because the data we used to build the
model is different from the data being scored. Statistics provides
detailed explanations for understanding sampling and modeling. Without
diving into the mathematics, we can say that data mining , fortunately,
uses large amounts of data (tens or hundreds of thousands of rows, for
instance). And having lots of data reduces the impact of these issues.
Page 190
194 The evaluation set must be completely different from the training
and test sets. Otherwise, the evaluation of the model's performance is
cheating. The model should do well on the data it was trained on. In
fact, one way to "cheat" when showing the results of models is to show
the results on the test set instead of the evaluation set; this makes
the results look better since the data mining techniques use the test
set to optimize the model. How should the model set be distributed among
the training, test, and evaluation sets? This is a good question. If you
really want to know the answer, try building models with different
proportions. We have found that a split of 60-30-10 percent works well
in practice (although at least one popular tool has a default split of
40-30-30 percent). How the Size of the Model Set Affects Results In
general, bigger is better. Models trained on larger model sets tend to
do a better job of prediction because the model set has more examples
from which to learn. When working with model sets of different sizes,
there are some things to keep in mind. In general, you want at least
several thousand records in the model set-and model sets with hundreds
of thousands of records are not uncommon. Why Bigger Is Not Always
Better There are several reasons why a larger model set may not be the
best choice. First, there is usually a fixed amount of time for doing
the modeling. During this period of time, the model set has to be
constructed, models built and tested, and then the results deployed.
Building models on larger model sets takes more time. This time could
sometimes be better spent investigating other parameters, obtaining
different types of data, or creating new derived variables . Any of
these could have a larger effect on the performance of the model than
the number of records. So when there is a trade-off with time, sometimes
it is better to work with a smaller model set, at least initially.
Similarly, the tool being used may limit the size of the model set. For
performance reasons, some tools try to keep all the data in memory. Even
if they do not, the extensive file processing required for really large
model sets may make them unattractive for performance reasons.
Sometimes, the right approach is to experiment with a sample of the data
and then build the final model on a larger model set. There are even
some technical reasons why bigger model sets may not produce better
results. Often, what we are trying to predict is a rare event, such as
Page 192
198 Reducing the size of the model set to 100,000 records, where 90,000
are "light" and 10,000 are "dark," makes it more likely that data mining
algorithms will produce a good model. Because the original data started
out with only 10,000 "dark" examples, this oversampled data set will
have all the "dark" ones and just a small fraction (90,000/990,000 or
9.09 percent) of the "light" ones. Oversampling has its limits. Because
there are only so many "dark" examples, it is not possible, for
instance, to create a model set that has 500,000 examples of which 10
percent are "dark." To do this, there would have to be 50,000 "dark"
records-but there are only 10,000 in the original data. This is a
typical occurrence. Oversampling increases the proportion of the less
frequent outcome or outcomes. What we are really doing is taking all of
the examples of the rarer outcome and a mix of just enough of the rest
to get an appropriate density. When there are more than two outcomes,
oversampling works in the same way. All the records with a given outcome
are called a stratum of the data. Oversampling works by taking different
proportions from each stratum. For this reason, statisticians call
oversampling stratified random sampling. This name is quite informative.
It means that the sampling is done separately for each possible
outcome-each stratum of the data-to achieve the right balance in the
model set. When to Use Oversampling Often, the outcome that we are
trying to model just does not occur very often in the data. There are
several reasons for this: • The outcome may be quite rare, such as
tracking breakdowns in machinery • The outcome may have to be validated
before it can be used for modeling, such as cases of fraud The outcome
may be for a short amount of time, such as customer churn or charge-offs
in a single month Regardless of the cause, though, modeling techniques
do not do a good job when almost all the data has a single outcome. If
99 percent of the data is "no fraud" and 1 percent is "fraud," then it
is very easy to devise a model that is accurate 99 percent of the
time-just predict that everything is "no fraud." Although this is
accurate, it is not valuable because we are really interested in the
"fraud" cases-and the model predicts none of them. The solution is over-
sampling.
Page 196
to choose the top scoring records. Sometimes, this is not possible, and
we really want a threshold value. It is possible to estimate for a
particular score threshold , the number of records that will have that
score or greater. Of course, the Calculating Predicted Lift when Using
Oversampling Often, when we build predictive models, the purpose is to
assign a score to new records and then to choose records whose score
exceeds a certain threshold. Often we choose the threshold either to
optimize a function (such as profit) or just to get a certain number of
records. It is tempting to believe the lift values and scores generated
on the oversampled model set. However, these are only useful at that
oversampling rate and not on the original data density. How do we
convert lift values on the model set into lift values ..ERR, COD:1..
Page 198
~, „201,, ? threshold value is just like the scores on the records and
has to be adjusted for the original density. Oversampling affects the
number of records as well. Estimating the number of records is a bit
tricky. Just because a threshold value corresponds to 1 percent of the
oversampled model set does not mean that the adjusted threshold will
correspond to 1 percent of the original data. In fact, it won't. In one
example, the score threshold for the top 1 percent, based of the
oversampled model set, corresponded to just 0.07 percent of the score
set. Even the threshold at 10 percent yeilded lass than 1 percent of the
score set. The sidebar explains this as well. Modeling Time-Dependent
Data Time frames play a critical role in building effective predictive
models. The time frame for predictive modeling has three important
components, as shown in Figure 7.9: 1. The past consists of what has
already happened and data that has already To calculate the predicted
density, we simply multiply the number of "rare" instances by 1 and the
number of non-rare instances by OSR. DENSITYo„g set = (# of rares OSR)
/ (# of rares OSR + # of not-rares) To estimate the ..ERR, COD:1..
Page 199
(370) 40.2% 90% --- -- --Y from Angos KnowledgeStudio to 13% of the data
and a lift of 3. Figure 7.8 In an oversampled data set, some of the
records represent multiple other records. been collected and processed.
At all times, data about the past is available (assuming that it has not
been deleted). The past actually mimics the overall structure of the
data. The distant past is used on the input side of the data; the recent
past determines the outputs; a period of latency represents the present.
2. The present is the time period when the model is being built. Data
about the present is not available because it still is still being
generated by operational systems. A data mining algorithm assigns its
top score to 40% of the oversampled Model Set, giving a lift of 2.
..ERR, COD:1..
Page 200
204 In this data, the distant past consists of all the data available
before last year's marketing campaign was launched. The recent past
consists of the data after the campaign, including any responses. The
present is the period when we are building the model for this year's
campaign. Finally, the future represents the responses to this year's
campaign, and these have not yet occurred. All the input data must be
available prior to any of the data used to determine the outputs . In
the relatively simple example just given, it is not difficult to
separate out the inputs and the outputs. Data from before the initial
campaign date is used for the input; data from after that date is used
for the output. Also, the types of data are quite different, and they
are likely to come from different sources. The only output data used is
the response variable. Any other older data can be used as inputs
(including responses to even earlier campaigns). All data used as inputs
into a predictive model must occur earlier in time than any of the data
used to create the outputs. Violating this rule creates models that look
very, very good on the model set and yet fail to predict the future
well. Separating inputs and outputs can be tricky in other
circumstances, especially when the inputs and the outputs are coming
from the same data source. As with other data mining efforts, predictive
modeling usually requires adding in derived variables. These derived
variables must follow the same law: Any variables used in the derivation
must occur earlier in time than the output variables. Figure 7.10
illustrates an oversight that once happened to one of the authors. In
this case, we were working with the credit card group of a large
regional bank to build a model to predict customer behavior. We had 18
months of data at hand and were interested in predicting customer
behavior segments: • Revolvers are cardholders who maintain large
outstanding balances and pay lots of interest. • Transactors are
cardholders who pay off their bills every month, hence paying no
interest. • Convenience users are cardholders who charge a lot and then
pay off the balance over several months. We determined customer behavior
by calculating the behavior on the last six months of the data. This
left 12 months of data for the inputs into the model. Alas, one derived
variable was the total ratio of outstanding balances to the total credit
line, for the most recent 12 months in the data. This variable violated
the law that input come strictly before outputs. What effect did this
derived variable have? With the variable in place, the model achieved an
accu-
Page 202
M .~~~É 205 The intention was to have 12 months of history predict six
months in the future. Months in the past But one derived variable used
data from the past twelve months, crept in, and skewed the results. 18
17 16 15 14 l 13 I 12 I 11 I 10 I 9 I 8 I 7 6 I 5 I 4 1 31 2 I 1 IMonths
past Variable This is a mistake! Derived variables V/0 should never use
data from months Inputs - Model Output used to determine the output.
Figure 7.10 Never use the output months as input, so be careful of how
derived variables are defined. racy of over 90 percent-an exceptionally
good result, considering that the model was predicting three outcomes
(so the baseline accuracy would be about 33 percent). Without the
variable, the model accuracy fell to under 70 percent. This result,
while still good, illustrates the dramatic influence of this variable.
How did we find this problem? In this case, the model results looked
suspiciously good, good enough to make us stare at all the input
variables to verify their correctness. One way to prevent this from
occurring is to keep track of the time frame used for every derived
variable. Using a careful naming convention for the variables, where the
time frame for the variables is included in the name, would have
prevented this problem.
Page 203
~ ~.~ 206 M° ti~~ Latency: Taking Model Deployment into Account It takes
time to build models. It takes time to score models when they are being
built. These statements are not just truisms; they must be taken into
account when building models. In more technical jargon, we say that a
model has latency, because it takes time to run. Of course, what is
important here is not just the time for running the model itself. This
is usually insignificant on today's computers. Much more important is
the time required for getting the data, transforming the data into
inputs for the model, and deploying the output. In fact, although it may
take only a fraction of a second to assign a model score to a given
record, it can take weeks or longer to complete the entire operation,
from collecting the data from the input systems to deploying the
results. An example will help to clarify why this is important. Let us
say that we are in the business of selling something and we want to make
a special offer for customers who are likely to want to make a purchase
during August. Our goal, of course, is to make customers buy more than
they would have without the offer, and we have a database that contains
all customer purchases throughout the year, along with a bit of
demographic and marketing data about each customer. Now, it is June and
we are preparing to build the model. We have the data for last September
through May (in the real world, we would want data from at least one
year ago as well). A typical approach would be to try to predict who
made a purchase in May, using all the data from September through April,
as in Figure 7.11. This sounds reasonable, but it is not. Why not? Let
us see what happens in July, when we want to create the mailing list. To
predict who is going to make a purchase in August, we need the data from
December through July-the inputs into the model that makes a prediction
for August. July data is not available now. It is still July, and all
the data has not been collected yet. So, let's wait until the first of
August. Is July data available yet? Not likely; at most companies, it
might take one or two weeks to get such data through all the relevant
systems. So, about in mid-August, we have the July data. Then it takes a
day or two or three to process the data and apply the model. Finally, in
the middle of the month, the mailing list is ready. And if we are lucky,
it will get there before the end of the month. By then, we can hardly
affect August sales. It is too late. This is a bad thing. The problem
goes all the way back to the beginning. When designing the model, we did
not take latency into account. Instead of using September-April data to
predict May, we should have used September-March
Page 204
r ~ 135 1600 This histogram is for the month of 1400 1200 .J claim for a
set of insurance claims. 1000 800 This is an example of a typically 600
uniform distribution. That is, the 400 number of claims is roughly the .
-... 200 _..{ _ .._ _ same for each month. Jan Feb Mar Apr May Jun Jul
Aug Sep Oct Nov Dec This histogram shows the number of telephone calls N
made Jun Z \ Inputs 10 Model Output No problems scoring the model
because the input are now available in mid-July. Figure 7.12 Including a
month of latency takes model deployment and scoring into account. steps
in the process. First, she may decide that she's going to switch. Then
she might research other offers, and, at some later point in time, she
informs the company-most likely, not until after she has already signed
up with a competitor . The company assigns a churn date sometime after
that. Presumably, she stopped using her cell phone when she first
decided to churn; that was the point in time when the decision was made.
However, her actual chum date is later. Invariably, the last month of
usage for churning customers shows a large rate of decline-however, it
is not useful because the customers have already decided to leave. To
prevent churn, we want to reach them before they make the decision, not
after. During the final month, churners will show a large decline in
usage; this is the time period when they stopped using the phone but
before the churn was formally recorded. If we look for customers with
big usage declines during the previous month, then the churn prevention
campaign may be wasted on customers who are already commmited to
leaving. Including a month of latency helps avoid this problem.
Page 206
213 For example, assume that the units are in months, and the modeling
month is given a value of 0. No variable should end in 0, since data
from the modeling month is not valid input data. The data from the
previous month would have 01 appended, from two months before, 02, and
so on. Because we are counting backwards, it is much harder to
inadvertently include future data. For derived variables, it is
necessary to include the earliest and latest time units in the name. So
the names might be like "total-03-01" for the total balances for the
three months just before modeling. Notice another effect as well: the
field names are not tied to a specific month, so it is more natural to
include overlapping months of data in the model set. For instance, the
model set might have three components of data: Jan, Feb, Mar -+ May Feb,
Mar, Apr -, Jun Mar, Apr, May -. Jul The records for these three would
all look the same. The field balance-01 would refer to March data in the
first case, to April data in the second, and May data in the third.
Using Multiple Models Predictive models are very useful. In fact, if one
is useful, then perhaps several are even better. The more the merrier,
right? Well, as it turns out, this is often the case. There are many
different ways to combine multiple models. Each model makes its own
prediction. All of these predictions are then compared. When all the
models agree, the confidence is usually much higher. The resulting
model, combining many smaller models, is sometimes called an ensemble
model.
Page 211
215 3. (release) Dispense with the old model, when you believe that the
new one is performing better. This is a good example of having two
models vote. During the alpha stage, the two models vote, and when they
disagree, the results from the old model are used. During the beta
stage, the two models vote, and when they disagree, the results from the
new model are used. Eventually, the newer model replaces the old model.
Of course, this is the simplest scenario. Perhaps there are several
models contending to replace the new one. Then, during the alpha phase,
all the models would vote, with the vote for the older model having a
greater weight. This may weed out a few of the possible models. During
the beta phase, these models would vote while still being able to
compare them to newer models. Finally, the old model can be retired.
Often, replacing models does not need to be such a cumbersome process
because the newer models are known to be better once they have been
tested. This example , though, shows how multiple models vote and decide
tie-breakers. Multiple Outcomes-Best Next Offer Models Multiple model
voting is quite useful for determining the best offer to make to a
customer. In Chapter 10, there is a case study about an online bank that
wants to put up a banner ad offering one of a few dozen products-a
mortgage , a certificate of deposit, a home equity line of credit, a
brokerage account, and so on. Which banner should they put up for any
given customer? This is a good example of a problem where multiple model
voting can provide the solution. What is a good approach for building a
predictive model for this problem? Well, one approach is to train a
single model with multiple outputs. The single model would distinguish
between all the different types of outputs in one swoop. This works okay
when there are few choices-fewer than four, say. But it does not work
well at all when there are dozens of possibilities. Model voting
provides a solution. The approach is to build a separate model
indicating the propensity of a customer to purchase any given product.
This requires building a separate model for each different product. All
the product propensities are then voted on to choose the product for the
best next offer. There are many ways to combine the results. The
simplest is to choose the product with the highest propensity. It may
also be desirable to include profitability information, business rules,
and other information during voting. In fact, combining the propensities
is sometimes called conflict resolution, highlighting the fact that
different business units have different interests in the result.
Page 213
218 ~ ~ x.,... even when the scores are converted into confidence
levels. With an oversampling rate of 10, a score of 0.5 is really 0.091
on the original data. Segmenting the Input Another common way of
combining models is to segment the inputs into the model. For any given
class of input, exactly one model is being built, focused on that
particular class. Notice that this is different from multiple model
voting . In voting, all models are applied to all the inputs. In
segmented input models, only one model is used for any given input. One
of the consequences of segmenting the input is that the resulting models
are built on smaller model sets. Regardless of the modeling technique,
smaller model sets increase the risk of overfitting the data-so it is
worth rechecking all the parameters to verify that they make sense on
smaller model sets. In particular , check the size of the hidden layer
in a neural network and the minimum leaf size for a decision tree. When
Is Segmenting Useful? There are two primary reasons for segmenting the
input into models. The first is to handle missing data. Data may be
available for some records but not for all of them: • Outside data, such
as demographics, is available only for the subset of the customer base
that "matches" • Historical data, such as billing information, is
available only for customers who have been around for a sufficiently
long time • More detailed data, such as credit card or ATM transactions,
are available only for customers who have those products Customers
acquired through mergers and acquisitions may have different sets of
data available - And so on Instead of working with the missing data in a
single model, it is often more efficient and effective to build
different models for different groups of customers. The second reason is
to incorporate business information into the modeling process. For
instance, it may be known that platinum cardholders behave differently
from gold cardholders. Instead of having a data mining technique figure
this out, give it the hint by building separate models for platinum and
gold cardholders.
Page 216
219 Segmenting Using Clustering One way to segment the data is by using
an automatic clustering algorithm. The clustering algorithm is used to
assign each record to a cluster, and then each cluster is treated as a
different segment. This can be effective when the data falls into
different clusters that behave differently . In this case, a better
model may be built for each cluster than for all the data-because
training the model does not have to relearn the clustering. This method,
though, does not always produce better results, because the models are
being built on smaller model sets. If such obvious clusters appear in
the data, business users often know about them. In fact, information
from business users often suggests more pertinent segmentation. Less
obvious clusters may not have predictive power. When time is available,
it is often worth experimenting with clustering to see if it produces
superior results. How to Do It There are two ways to segment the input:
either inside the tool or outside it. To a greater or lesser extent, all
data mining tools support this functionality inside them. In some, such
as SPSS Clementine, it is easy and a natural part of the user interface.
It supports splitting the data and building different models on it. SAS
Enterprise Miner supports the functionality, but it can take some work
to find the "filter outliers" node and use it correctly. And then you
have to use it multiple times to partition the data for each segment.
One of the advantages to using the tool to segment data comes when the
model is being deployed. The more work done in the tool, the easier it
is to use that tool's methods for deployment. The alternative is to
segment the data outside the tool by building separate input files for
each segment. This may be desirable even when the tool supports
segmentation, because it allows the model developer to experiment with
each segment independently, or even for different people to build models
for different segments at the same time. Remember, though, that if the
models are going to be combined, then the scores must be adjusted to
take oversampling into account. Other Reasons to Combine Models Multiple
model voting and segmented input models are the most common reasons for
combining models. However, there are other ways that are worth
mentioning, because they may prove useful at some point.
Page 217
221 The use of cascading models based on error results occurs in the
nonautomated world as well. Medical exams are often structured this way.
Less expensive , noninvasive tests are first run to find the likelihood
of some disease, such as AIDS or cancer. If the test is positive, then
more invasive and expensive tests follow-effectively and efficiently
screening a large population for the disease at a lower cost. Enhancing
the Data The last way of combining models that is discussed here is
simply to use models to add new features to the input, sort of as an
enhancement of derived variables . For instance, a cluster field can be
added based on the results of clustering. This field does not have to be
used for segmentation; it can be used at the whim of the modeling
algorithm, if the algorithm finds it useful. Actually, there are some
common applications of using other models to enhance data. This is
because outside service bureaus often add (at a price) propensity scores
to the data. Probably the best known is the FICO score available for
credit card processing. This score, or similar scores, sometimes find
their way into in-house models that do a better job than the purchased
scores. Another case where enhancing the data can be useful is to
replace missing values. In some cases, missing values can be predicted
from the data that is available. Experiment! The good news is that there
is no one right way to build a predictive model. The bad news is the
same: there is no simple recipe for building the best model. So, the
task of model building needs to be faced as a learning process. Even
when building models for the nth time, there is still a lot to learn.
The best way to figure out what to do is to try out different things and
see how they work on your data in your environment. Of course, this
advice is not always practical because of the constraints facing any
real-world problem. This section discusses some of the different things
that you might want to test. The Model Set The model set is a good place
to start. Once the data fields have been chosen, there are two
interesting features of the model set: size and density (controlled by
the oversampling rate). Generally, the more data you have the better, so
you want to build models with tens or hundreds of thousands of records.
At the
Page 219
WEN 224 Table 7.2 Comparison of Predicted Lift for Six Different Types
of Decision Tree Models 20K 10.3% 12.9 4.74 3.91 4.40 4.46 4.41 4.18 20K
17.9% 22.4 4.45 4.68 4.93 4.29 4.63 - 50K 9.9% 12.4 3.71 3.51 3.66 3.75
3.65 3.65 50K 17.8% 22.3 4.37 4.32 4.20 4.21 4.39 4.16 50K 30.5% 38.1
5.60 5.48 5.17 5.39 5.38 5.22 100K 10.0% 12.5 3.`.i1 3.53 - 3.46 3.55
3.54 100K 17.9% 22.4 4.40 4.36 4.29 4.30 4.30 4.37 month? For three
months? For the calendar year? Similarly, how long does a best next
offer last? For a specific amount of time? Until the next purchase?
Models can sometimes be improved by increasing the time frame for the
outcome . This is especially true for infrequent occurrences where there
may only be relatively few records in a short time span. For instance,
few customers charge off in any given month; but over the course of six
months, there are several times as many-providing a richer model set. ns
Learned The main theme of this chapter has been building stable models.
Stability is important, because we want to have confidence that
predictions made by the model will hold on unseen data. That is the
point of making predictions, after all. Important points are the
following: • The model set consists of three components, the training
set, the test set, and the evaluation set. It is important to understand
model performance on all these sets of data. - The density of the model
set has a big influence on the model. Oversampling lets you control the
density. At the same time, understanding the effect of oversampling on
the results is important. • The model time chart allows us to see what
time frames of data are used for inputs and for outputs. All inputs must
occur before the outputs, and it is usually wise to leave a latent
period as well. Using multiple time windows in the model set helps
ensure that models are more stable and can shift in time.
Page 222
225 • Naming the fields by appending their time frame can help avoid
time frame mistakes. • Combining models is an important part of data
mining and the basis for models such as cross-sell models. When
combining models, it is important to understand the effects of
oversampling. Finally, there is no one solution that works for all
problems. Experimenting with the model set, model parameters, and the
time frame can lead to better and more stable models.
Page 223
Page 224
;WA: Taking Control: Setting Up a Data Mining Environment Like the paths
to Buddhist enlightenment, there is no single path to introducing data
mining into a company. In the end, we want to create an environment
conducive to data mining, both technically and business-wise. Success is
much less dependent on the choice of particular tools and algorithms
than it is on creating the right environment. Data mining is only as
powerful as the organization that implements it, and every organization
is different. This chapter tells four stories about four different data
mining environments in four different companies. The first three focus
on setting up a data mining environment for the first time. Although
each company is new to data mining, each is starting from a different
place. This is quite typical, since no two business environments are
identical; they often differ in strategy, customer focus, data
warehousing infrastructure, market positioning, analytic skills, and so
on. The last story goes into more depth and demonstrates how technology
can catalyze data mining-when used well. This is the story of Eddie
Bauer, Inc., a major multichannel retailer. Eddie Bauer is not new to
data mining, especially on the catalog side of the business. With a keen
understanding of how data mining fits into their business, they have
built an environment, using Tessera's Rapid Modeling Environment, to
facilitate data mining and to automate many of the supporting functions.
(At the time of writing this chapter, Tessera has announced that iXL, a
leader in internet services, has acquired them.) 227
Page 225
business units. The pilot needs to be part of the bigger effort to build
an internal competency in data mining. 1: Building Up a Core Competency
Internally This case study looks at a property and casualty insurance
company that is building up a data mining practice. In this case, we
were working initially with the car insurance side of the business,
although the data mining group is intended to support multiple lines.
When we met them, they were in the process of selecting a vendor by
comparing proposals from about twenty different vendors.
Page 228
232 5. They performed a pilot project with the vendor. After the pilot,
they will move on with installation, training, and other projects. The
following sections dive into this process in a bit more detail. Choosing
the Team The first step was to identify the individuals in the
organization responsible for data mining. Although only three people
would initially be using the tool, the team responsible for getting data
mining started was a bit larger. It included the managers of the users,
the IT group responsible for providing the data and hardware, and
several marketing professionals who provide the business expertise for
initial projects. This team was responsible for advocating data mining
and customer relationship management throughout the company. The team's
focus was on building a competency: identifying likely requirements for
data mining, defining a pilot project, learning about the vendors, and
choosing a vendor with a good fit. The group was also responsible for
preparing a budget, assessing the impact of the pilot project, and
selling data mining throughout the company. The team recognized that
there was no one in the organization who had the right background for
leading the data mining effort. However, they were fortunate enough to
have an individual with strong analytic and business skills who had been
working as a consultant on special analysis projects. They hired this
individual during this initial process. Sketching the Business Needs
They recognized several areas where data mining could add value to the
company , and determined that there was a need to develop a core
competency by creating a data mining group to serve the needs of
marketing. The group would be located with marketing groups in the
company's headquarters, as opposed to residing with the technical people
at a satellite location. To further the process, they settled on an
business problem that would become the data mining pilot project. This
example was to analyze data for automobile insurance in one state (New
Jersey) in order to build a predictive model to estimate loss ratios for
policies. Loss ratio is the insurance industry term for the ratio of
claims paid out to premiums collected. It is a key driver of
profitability . In general, the ratio is about 70 percent; that is, 70
percent of revenue from premiums is paid out as claims (although the
percentage can differ significantly from this).
Page 230
233 One of the challenges faced when approaching the business problem is
the level of action. What is the right unit for analysis? Driver? Car?
Policy? • Household? This is challenging, because these change over
time. For instance, cars and drivers are added onto policies fairly
often. Households may have more than one insurance policy, covering
separate cars driven by different individuals. The unit chosen, in this
case, was policy, which implied rolling up information at the policy
level-including the number of cars, the number of drivers, the frequency
that cars are added and dropped from the policy, and so on. New Jersey
was chosen because it had been used for previous pilot projects, so data
was known to be available. Although rates are set statewide, and all car
insurance companies have to offer insurance to everyone, there are a few
ways to leverage customer information. The company can target marketing
efforts to particular customers who are profitable, or by avoiding
particular groups of customers who are not profitable. Because rates are
set statewide, profitability has little to do with risk. That is, the
least risky individuals could be very profitable , if the statewide rate
structure makes their rates a little on the high side; or they could be
quite unprofitable, if the statewide rate structure makes their rates a
little on the low side. In the parlance of the insurance industry, they
are looking for areas where the statewide rate structure is inefficient,
and they want to exploit the inefficiencies. An example from another
insurance company, Fireman's Insurance Group, illustrates how insurance
companies can learn from their data. It is well known that men who own
sports cars are generally a higher risk than other car owners, and they
pay a premium on their insurance. However, it turns out that if an
individual owns a sports car and has another car used for routine trips,
then the sports car is no higher risk than any other car. By focusing on
owners of multiple cars who happen to have a sports car, and offering
them reduced rates (where allowed), the car insurance company can grow
market share with a minimum of risk. Developing an RFI (Request for
Information) With a sketch of the business needs, the next step is to
identify and approach vendors with an RFI. An RFI is a document sent
typically to tens of vendors that specifies what the company wants to do
and then asks vendors how they would approach the problem. Their RFI
asked for some specific details:
Page 231
234 - What are the vendor's tools and capabilities? • How does the
vendor approach training? • How does the vendor solve business problems?
• What is the 5-year cost-of-ownership? • What is the financial
viability of the vendor? • Does the reader have good customer
references, preferably in the insurance industry? • Does the vendor
support the particular hardware (in this case an Alpha processor running
UNIX)? • How would the vendor approach the pilot project and how much
would it cost? Finding the vendor with the lowest price is not the
purpose of the RFI. In fact, the chosen vendor turned out to be the one
with the highest five-year cost of ownership because they could
demonstrate the most value across the spectrum . Many other issues,
vaguely called "vendor fit," are more important than price. Also, the
team knew that there were no standards yet on choosing data mining
software within their company. The result from this effort would
probably result in a company-wide commitment to the tool. Choosing a
Vendor Choosing a vendor was a two-part process. The first part was
narrowing down the list of vendors based on the written proposals. The
list was narrowed to four: SAS Enterprise Miner, SPSS Clementine,
Thinking Machines (now Oracle ) Darwin, and Unica Model 1. The second
step was having vendors each give a two-hour presentation on their
solution. The presentations started with a 45-minute vendor presentation
, followed by a 30-minute demonstration of the tool. The remainder was
devoted to discussing the vendor's approach to the pilot project. The
nature of the pilot project had been worked out in more detail, and
these details were provided to each vendor as a written document. The
vendor presentations were scheduled on two consecutive days, in the
morning and afternoon. One week later, the team made its final decision
by voting. Each member of the team ranked each vendor in several
categories. The rankings were added up and the vendor with the highest
average ranking won. The team chose SAS Enterprise Miner on the first
round of voting. Had the vote been inconclusive, the team was ready to
proceed with the pilot using more than one vendor.
Page 232
235 Where They Are Now At the time of writing this chapter, the company
is working on the pilot project . As expected, there were delays due to
obtaining the right data in the right format-a very typical problem at
this stage in the effort. At the same time, several members of the team
have attended SAS training, to learn more about data mining in general
and Enterprise Miner in particular. All indications are that the pilot
project will be quite successful. When it is completed, the data mining
group will formally open for business. Case 2: Building a New Line of
Business This case study is about another insurance company, a life
insurance company. The life insurance business is quite different from
the property and casualty business because whole life insurance is
really an investment, and term life insurance often complements
investments. Life insurance companies expect their competition in the
future to be other financial services companies, such as banks and
mutual funds. Going Online This life insurance company recognized the
need for a direct insurance business unit that would supplement the
agent networks where most of their life insurance is sold. The personal
relationships that agents build are a very powerful way of keeping
customers-in some cases, the agents actually collect the insurance
premiums by visiting their customers every month. However, these
personal relationships are expensive. Selling insurance over the
telephone has also proven costly, because the sales agents must be
licensed to insurance agents-making them considerably more expensive
than most tele- marketers. The Web, though, offers a new approach.
People interested in insurance visit the Web. A licensed insurance agent
only needs to become involved when the insurance is actually purchased.
In fact, this company has an internal goal of selling 25 percent of its
policies over the Web by the end of 2003. The Web, of course, is only
one component of the direct insurance business unit. It also included
outbound telemarketing, direct mail, and advertising. The process of
purchasing insurance requires more than a few Web clicks. Life insurance
still requires investigations into the health of the covered person and
into determining the factors that affect the premiums. So,
click-click-click will get a potential customer an initial quote on the
Web, but the insurance does not become active until the customer's
health is verified and other paperwork gets
Page 233
237 The Next Step At this point, the company is in the process of
launching the new line of business . They have the organization in place
to analyze data and act on the results. They have the data available
through the prospect data warehouse, and good relationships with the
warehousing group. They have a tool that they are comfortable using.
They have set up a small data mining group in the marketing department
to use the tool. Choosing Demographic Data for Prospect Warehouse Data
about prospects is notoriously poor, because they are not yet customers.
To make it more useful, the plan was to augment the data with additional
demographic data, purchased from an outside vendor. The augmentation
process requires: 1. Sending a list of prospects to the vendors (or, in
some cases, purchasing a list that meets specified criteria from them).
2. Having the vendor augment the list with additional fields, generally
several hundreds of them, and returning the list back to the company. 3.
Transforming and loading the data into the prospect warehouse. There are
several data suppliers in this area. The company decided to test
overlays from three of them: Acxiom, Experian (formerly Metromail), and
First Data. Only one vendor is really necessary. How does the company
choose which one? The most important requirement is that the data be
valid for data mining purposes . In fact, it is worth paying a premium
for better data. The company wanted to test the data for data mining, in
a way that would somehow measure the value of the data instead of the
skill of the data miners. A subset of prospects that had been targeted
for a previous campaign was chosen for testing. All three vendors were
asked to augment the data for these prospects. The idea behind the tests
was to build "naïve" models on the augmented prospect data-during a
period of time that lasted for three weeks. The naive modeling test that
proved most useful was the evidence model in SGI Mineset. Figure 8.1
provides an example of the evidence model, showing the conversion rate
for customers who apply for insurance. That is, of the people who apply,
who makes it through underwriting, and who purchases a policy at the
price set by underwriting? The input variables are sorted on the left by
their importance to the outcome. Then each value of the variable (or bin
for real numbers) has a little two-part box that shows the ratio of
converters in that bin. We see in this chart that the most important
variable is ius_duration_day, the amount of time spent in underwriting.
People who spend very little time in underwriting are those that are
easily rejected. Continues
Page 235
238 J,° ~... Aht 718B Figure 8.1 Evidence for conversion (making it
through underwriting). Choosing Demographic Data for Prospect Warehouse
(Continued) (t,sclW»~rtlNfM ` II ~ 17eta11 stnic!' r(m ~~ y i x wcignt
^! xa11n1U On the right, we see a pie chart that shows the ratio of
converters. We can choose particular variables or values on the left
side and interactively see the conversion ratio meeting the criteria.
The fourth data variable found by the evidence model is
media_vehicle_nm. It turns out that the square at the very end
corresponds to the Internet channel- and it has a very high conversion
rate. Evidence models handle sparse data very well, making it very
appropriate for this test. And, the demographic data tends to be quite
sparse, because, for instance, most households do not have any children
in the 1-4 age bracket (and so on). Figure 8.2 compares the evidence
models for the three vendors. What is most striking is that all data
performs about the same. This is all the more surprising since one of
the vendors does offer a smattering of medical information, information
that would likely affect life insurance. However, the health data is so
sparsely populated that it does not affect the overall results.
Page 236
239 Acxiom Figure 8.2 Comparison of lift of evidence models for three
data vendors. Part of this test also compared data from multiple
vendors. The vendors were all able to augment about the same proportion
of the data. And when fields could be compared, they almost always
agreed. For instance, the estimated household income field did not vary
significantly from one data vendor to another. Having determined that
the data all contained about the same information, the insurance company
turned to other criteria for choosing the vendor. Because of previous
experience and the ease of loading the data, they chose Acxiom.
Page 237
240 The next step is to work on a marketing campaign. For this purpose,
they are testing a direct marketing campaign, using models built by
their new analysis group. Case 3: Building Data Mining Skills on Data
Warehouse Efforts The third case study tells the story of a bank that is
building data mining expertise on top of an ongoing data warehousing
effort. In this case, the bank was very fortunate to be working with a
vendor (then Tandem Computers, now Compaq) that really understands data
mining. And, unlike the previous cases, the data mining expertise was
built within the IT group to serve all the groups within the bank.
However, once the bank built the computing in data mining and verified
the quality of the data in the data warehouse, the bank reorganized the
data mining efforts into different business units. A Special Kind of
Data Warehouse This bank had decided to build their customer-centric
data warehouse using hardware and software from Tandem Computers. The
database, Nonstop- SQL, is a fully functional parallel relational
database. At the same time, it also incorporates some very useful
features for data mining. These features include performance
enhancements, such as enabling all operations in parallel and
compressing data stored in the database to reduce disk access times for
queries that read a lot of data. The database also has extensions that
help some data mining operations run more efficiently. Several vendors,
including SAS, Angoss, and SPSS Clementine, partnered with Tandem to
take advantage of this functionality. By moving some of the complex data
manipulations into the database, the tools can leverage the parallel
server performance. In short, Nonstop-SQL is data-mining-ready. To take
advantage of data mining, Tandem set up the Advanced Data Mining Center
(ADMC) located in Austin, Texas to work on pilot projects (www.tandem
.com/prod_des/advdmcbr/advdmcbr.htm). (The ADMC is now part of Compaq 's
Database Engineering Center.) The ADMC is a good group to work with;
because of their database backgrounds, they are intimately familiar with
data. The Plan for Data Mining The bank identified a key resource in the
IT group to head data mining efforts. This individual had a background
in databases and was learning about statistics and data mining when he
started the effort. His background in databases was very important,
because access to the data is a key success factor for data mining ,
especially when the functionality is being built on top of a data
warehouse.
Page 238
241 As data was loaded into the data warehouse, more and more became
available for data-mining purposes. The bank embarked on a data-mining
pilot project: 1. Identify business objectives. They worked with the
managers in the credit card group to identify business needs that could
be addressed with data mining. The business was concerned with the
behavior of customers over time; in particular, whether they would be
transactors (who pay off their balances every month), revolvers (who pay
the minimum balance and lots of interest ), or convenience users (who
pay off the balance over several months). 2. Evaluate the Data. The next
step was evaluating the data for the project. In particular, to
understand customer behavior in the manner described, they needed to
have multiple months of billing data, with outstanding balances and
amount paid just to define the behavior. Plus, there were other fields
from the customer information file and historical billing records that
could be used to predict future behavior. 3. Prepare and transform the
data. The data in the data warehouse was in a very normalized format and
it needed to be flattened for data mining purposes. This required
appending the billing data to each customer record and adding derived
fields. 4. Explore and interpret the data. Using two tools, Angoss
KnowledgeStudio and Clementine (now owned by SPSS, Inc.), they analyzed
the data to predict future customer behavior. 5. Deliver results. The
final step was delivering the results back to the business to illustrate
how data mining can be used for predicting behavior, and for other
business problems as well. Much of this work took place off-site at the
ADMC, especially steps 2, 3, and 4. However, identifying the business
problem required working with the credit card group inside the bank,
since this group was chosen for the pilot. The work at the ADMC included
demonstrations of SPSS Clementine and Angoss KnowledgeStudio running on
the bank's data. Data Mining in IT The group responsible for the data
mining pilot project was an ad hoc group. However, one individual had
been identified as a key leader for data mining. He had both a
background in databases and was familiar with analytics-a good
combination for leading a new data mining group. This group is
responsible for building a core competency in data mining to take
advantage of data in their data warehouse as well as other data sources.
The group has taken a multiple tool approach, so they have since
acquired SAS Enterprise Miner as well as other tools.
Page 239
242 As they move forward, the data mining group is going to other groups
in the bank offering data mining services. By centralizing the effort,
they are learning about the available data and rapidly improving their
modeling skills. Case 4: Data Mining Using Tessera RME With the advent
of e-commerce and affinity card programs, retailers are starting to
understand who their customers are and what they are purchasing. These
changes are revolutionizing the industry, shifting the power from the
manufacturers and distributors to the marketers. However, one corner of
the industry has always been focused on customers-the catalogs. Chapter
9 describes a case study with a catalog vendor, which describes the
business background, history, and data mining efforts in more detail.
Here we are focusing on how to build an effective environment. Eddie
Bauer, Inc. is a major multichannel retailer. As a cataloger, they send
out more than one hundred million catalogs each year. As a retailer,
they have more than 500 stores. And their Web presence is an
increasingly important part of the business. Eddie Bauer has also been a
leader in integrating the different channels of their business. Instead
of building artificial walls between the different divisions, they have
concentrated on a single brand image delivered across multiple channels.
In the area of database marketing, the catalog side has taken the lead.
However, they have been very careful to recognize that good customers
sometimes purchase from the catalog, sometimes purchase from stores, and
increasingly purchase over the Web, even though different channels offer
different products. They have been a leader in managing data and in
using it to run the business. Here, in this case study, we are going to
take a look at their approach for building a data mining environment,
using the Rapid Modeling Environment (RME) from Tessera Enterprise
Systems (www.tesent.com). This case study is going to focus on the
capabilities of the RME and how it enables effective data mining.
(Although Eddie Bauer is happy to be mentioned as a user of RME, they
are hesitant to disclose details about the business operations.) RME
provides one example, and a good example, of a data mining environment.
The experience at Eddie Bauer shows some technical functionality that is
useful, regardless of the tools being used. Requirements for an Advanced
Data Mining Environment As the data mining group matures, it needs to
become a part of the overall business processes. This usually requires a
higher level of support, from a technical perspective, than originally
needed.
Page 240
is, the RME Server and data warehouse are running on separate parts of
the same physical parallel computer. RME Client RME Server DBMS Server
of the same physical parallel computer. RME Client RME Server DBMS
Server Systems ^ ^ LANNVAN~ ^ High Bandwidth ^ Link Connection Software
SAS/Connect Custom RME SAS/Base GUI & Programs SAS/AF & SCL SAS/Connect
Custom SAS/Access RME Link to DBMS Programs SAS/Base Customer
Information Warehouse on DBMS Data RME User RME Extracts & Metadata
Extract Metadata Sample Specs. SAS Data Sets Production Scoring Model
Code Logs and Extract Specs. Customer Data from Operations System
Samples, Scores, & their Metadata Figure 8.4 Systems architecture of
RME.
Page 243
248 The RME Sampling Model facilitates sampling and works with RME
Extraction . Sampling is a three-part process, as shown in Figure 8.6:
1. The user specifies how to create the sample using the graphical user
interface. The customer ids and fields needed for the sample are
included in the extract. 2. The sample is created from the extract, as a
list of households that belong to the sample. 3. This list lives in the
data warehouse, so the same sample can be used later. 4. RME Extract
uses this list to generate the appropriate data for an extraction. This
process automatically takes into account the nature of the sample and
the specific needs of the extraction. There are several advantages to
this multipart sampling approach. First, the sample can be used for
multiple extracts with a minimum of effort. The sample itself is a list
of households, which is small, and it can be reused. Second, the
extraction process knows about the sample so it can optimize its
performance by working only on the appropriate data elements. Third, it
is logically very easy to replace the sample by a new one. How RME Helps
in Model Development At this point, RME has done the first half of its
work. It has prepared the model set, which can then be used by SAS or
imported into another tool. During the course of developing models,
users will generally modify the data: They add new derived variables,
remove unneeded fields, bin numeric values, or make other data
transformations. Eventually, the data miner has produced a final
Customer Information Warehouse Household 1. Universe-style Promo History
Extract Transaction Product Store/Catalog CIO Model 3. Sample IDs 4.
Model Set Using 2. Sampled to Small registered Sample IDS Set of IDs
Figure 8.6 Sampling using a multistep approach.
Page 246
249 model, which can also be described using SAS code. This is true even
when the user is using SAS Enterprise Miner, SAS's graphical data mining
tool. RME basically stays out of the model development process itself,
letting data mining tools focus on this area. Of course, during the
course of creating a model, users may decide that they need more data or
a different model set. In these circumstances, it is a simple matter to
return to RME to iterate through the model development process. The
final step is bringing together the data transformations and model
scoring code. Together, these represent the model, which can then be
registered back into RME. How RME Helps in Scoring and Managing Models
Once a model has been developed, RME kicks in again. Now, it must
address the needs of registering models, scoring data sets, scheduling
scoring operations , and so on. Model Registration RME registers models
after they have been created. The model registration is smart: • It
understands the extract associated with the model, so it can ensure that
exactly the right data is generated for scoring it. • It recognizes when
certain fields in the data are not used, and uses this information to
optimize the extract. It allows the model to have date indicators
instead of absolute dates, helping data miners work with time-dependent
data. This information is stored in the data warehouse as modeling
metadata (along with other information such as who created the model,
comments about it, and so on). Customer Scoring The scoring process then
combines all of the steps that have been described so far. The process
of scoring is depicted in Figure 8.7. 1. RME creates the appropriate
extract for the model. Notice that the production extract does not have
sampling, has been optimized to contain only the fields needed by the
model, and date indicators are converted to actual date ranges.
Page 247
250 2. The registered version of the scoring code is used to add scores
to each household, in essence adding one or more new fields onto each
household record. 3. The user can specify how to work with the scores.
Often, the raw score is not as interesting as knowing into which decile
the score falls, or which model produced the highest score. 4. The
scores are loaded back into the data warehouse, where they can be used
by other applications and fed to downstream marketing efforts. The
scoring process is fully automated. In production, it runs on the RME
Server and data warehouse. RME can schedule production scoring to run at
any time, to meet production schedules. For instance, if a catalog has a
deadline where it needs the list of customers on Tuesday morning at
10:00 A.M. to meet printing deadlines, then production scoring is run on
Monday at midnight (in case there is a problem, there is still time to
run production scoring again). The production scoring process uses the
latest version of the model registered for that mailing. Analysts can
register new and improved versions of the model through Monday evening.
Over time, Eddie Bauer's database accumulates many scores for each
household . After all, they mail out dozens of catalogs every year to
their household base, and every household gets a score for each
marketing effort-even if they were not chosen. This introduces yet
another management issue. Over time Eddie Bauer archives old scores,
keeping only a handful active in the ware- Customer Information
Warehouse Household Promo History Transaction Production Product Load
Extract Store/Catalog E3 Scores Model Scoring Post S e Algorithm
Analysis Figure 8.7 The scoring process.
Page 248
254 I'll _ A The first two parts of this book are really preparation for
the case studies. Part One explains the business imperative for data
mining. Part Two focuses on the technical aspects of data mining. Here
in Part Three, we see examples of real opportunities addressed by real
companies. In the case studies, we see data mining applied to real world
problems. Mastering data mining is best accomplished by learning from
real world experiences. Up to this point, we have many examples to
illustrate the business and technical sides of data mining. How are the
case studies different from these vignettes? The examples shown so far
are intended to illustrate a single point: how to identify the business
problem, or what decision trees are, or what is an example of data
transformation? We hope that these vignettes have helped to convey
important facets of data mining. However, they remain just that: pieces
of the story. The case studies take a more comprehensive approach. They
start with background about the business and industry, and then move to
the particular business problem. They explain what data is available,
and what data is not available, and then show the data mining process.
Where possible, the case studies give actual results. Each case study
tries to show an entire cycle through the virtuous cycle of data mining.
The case studies have both positive and negative lessons. That is, we do
not pretend that everything goes right in the course of doing data
mining. Such an attitude would be disingenuous. Even worse, it might
discourage others who are in the throes of data mining challenges.
Imprecise business problems, lack of data, dirty and inaccessible data,
bugs in data mining software-these are not unexpected events. As with
any process, we learn from both successes and mistakes. That said, the
case studies clearly demonstrate the value of data mining. In every
case, data mining is providing business value, and the speed-bumps on
the road are merely inconveniences. The most important success factor is
to let what's important in the business drive the data analysis-be
open-minded and listen to the data. Following the Customer Lifecycle The
most important application for data mining is customer relationship
management . The first three case studies follow the customer lifecycle.
They start, naturally, with the acquisition of new customers, then move
to the building of cross-sell models for existing customers. The third
is about customer retention. Although the case studies all illustrate
the application of data mining to managing some specific point in the
customer lifecycle, they differ in most other
Page 252
256 models for each product. Another approach, less favored by us, is to
look at historical patterns of which products sold together, and then
look for customers with some but not all of these products. This second,
more market-basket approach, has a tendency to reinforce historical
marketing efforts. It tends to build a box rather than enable thinking
outside the box. The third case study addresses one of the final steps
in the customer lifecyclecustomer retention. The market for mobile
telecommunications is rapidly maturing in many markets. The maturation
is actually happening more quickly outside North America than within.
This case study looks at the leading cellular provider in a country
where half the population already has mobile phones. At this point,
retaining customers becomes as important to the bottom line as acquiring
new ones. This shift from growth and customer acquisition to customer
relationship management is a big change in the wireless industry. This
case study illustrates the importance of effective model building
skills. Because the ultimate goal of the project was to build a churn
management system , the client wanted to understand the model-building
process as well as build an effective churn model. This provided the
leeway to experiment with different types of models and model sets.
Together, these three case studies span three industries that have
proven leadership in data mining: catalog retailing, banking, and
telecommunications. They also span three important parts of the customer
lifecycle. Together, they bring the lessons of Chapter 4 into the real
world. Gaining Insight into Business Practices The customer lifecycle
does not drive everything that is data mining. And there is more to data
mining than building predictive models for customer- related events.
Data mining is often about understanding the business and gaining
actionable insights. The next three case studies focus on this aspect of
data mining. Before describing them, it is important to emphasize that
predictive modeling and gaining insights are not mutually exclusive.
There were some very interesting insights gained during the process of
building predictive models in the first three case studies. As an
example, the work described in Chapter 11 revealed an important segment
of customers who "abused" the marketing rules to get discounts on new
cellular handsets. The first case study in this group takes on a big
challenge. The challenge is big because the data is big. Telephone
companies make records of every telephone call being made. These are
called call detail records. There are literally
Page 254
257 billions of telephone calls made every day throughout the world,
resulting in more than a terabyte (trillion bytes) of call detail
records every day. All the large telephone companies are dealing with
tens or hundreds of gigabytes per day. Their challenge is to figure out
what to do with them-and whether it is worth the effort. This case study
is a voyage of discovery through call detail records-or actually several
voyages from different companies, all of which wanted to remain
anonymous . The voyage takes us away from the focus on predictive
models, focusing instead on intelligent querying and visualization. It
takes us from looking at anomalies in the data, to individual consumers,
to business consumers. The analysis of such large amounts of data is
made possible only with the advent of parallel computing environments.
Part of the case study describes a technology called dataflows that have
proven to be very valuable for these types of projects. A word of advice
to nontechnical readers: don't be put off by the description of
dataflows. We have included them in the chapter because they are fairly
easy to explain, they are a useful way to look at complicated data
transformations, and most technical readers are probably not familiar
with them. However, understanding dataflows is not critical for
understanding the value of exploring detailed data that describes
customer behavior. The next case study also looks at an industry that
collects a tremendous volume of data, the supermarket industry. However,
in this case, they immediately summarize the data and throw away the
individual transactions. Only then do they start to ask the hard
questions. The particular question being asked in this case was what
differences are there in the purchasing patterns of Hispanic consumers
versus non-Hispanics in Texas? This is the type of business problem that
can result in big changes. The results can be used in many ways: • Which
products are distributed to predominantly Hispanic stores • Which
products are promoted, and where, and when • The layout of stores and
the placement of products on shelves • Advertising in Spanish-language
media This case study looks at finding this information through very
informed visualizations on presummarized data. The chapter also includes
a smaller case study on using market basket analysis to discover
interesting patterns in grocery data; in this case, for perishables. The
last of the case studies is the most surprising. It moves outside the
realm of marketing entirely and shifts to the operational side of the
business. Today, manufacturers are often massive generators of data.
Many manufacturing processes
Page 255
_ ~ rt.. . 258 have been automated, and those that have not are often
still subject to central control . As a result, there are constantly
many measurements being taken, whether the product is a computer chip, a
roll of aluminum, or, in this case, a magazine. Time, Inc. is a major
publisher of magazines, spends hundreds of millions of dollars each year
on paper alone. During the process of making the magazines, about 10-15
percent of the paper is wasted. Do the math. Millions and millions of
dollars of paper are wasted. Since data is collected all through the
printing process, it should be possible to identify the causes of waste.
And, indeed, it is possible. This chapter once again shows the power of
flexible data analysis for tackling a complicated problem. Even though
this problem has nothing to do with marketing, the result is still worth
millions of dollars to Time, Inc. in lost savings. cture of the Case
Studies Each case study chapter is independent from the others. Although
you can pick and choose among them, we do encourage you to read all of
them and to learn from the combined experience. Even seemingly unrelated
case studies can help improve your data mining efforts. Even though the
chapters are independent, we have tried to follow a common organization
to help you follow them more easily. All start with an introduction that
explains the case study and why it should be interesting. Following the
introduction is usually a section called "Business Problem." This
explains the business problem that the case study is addressing. In some
cases, the business problem is quite simple. In others, it is much more
nebulous. Where useful, we have explained the process of arriving at the
right problem, showing some of the pitfalls that arise when trying to do
data analysis. The next common section is about data. Data, more than
anything else, determines the success or failure of a data mining
effort. This section talks about the sources of data, important data
transformations, and derived variables. The bulk of most of the chapters
is then devoted to explaining the modeling and data mining process.
Where possible, we have illustrated the chapters with actual results.
Even where the results needed to be tweaked to protect the identity of
the companies, they still are representative of the actual results. All
the chapters end with a section entitled "Lessons Learned." This section
provides a bulleted point-by-point summary of the important ideas
presented in the chapter. Earnest readers may be tempted to cheat by
reading just the lessons. However, the stories, as told, are a much
richer explanation of the data mining process.
Page 256
:rs': Who Needs Bag Balm and Pants Stretchers? Acquisition or response
models are undoubtedly the most common application of data mining. For
any business, no matter how large, the world population contains many
more noncustomers than customers. It is the job of the marketing
department to craft a message that will convince some of those
noncustomers to become customers. What should that message be? That
depends on whom you want to attract. Who do you want to attract? That
depends on who is likely to be a happy, loyal, and profitable customer.
Data mining, in combination with traditional market research, can shed
light on both what the message should be and to whom it should be
addressed. Every industry faces this challenge in one way or another,
but the nature of the customer interaction and the characteristics of
the sales channels determine quantity and quality of the data available
for mining and the types of communication that are practical. As an
example, compare and contrast the banking and fast food industries. In
banking, every transaction is linked to a particular customer about whom
much is already known. Bank statements, ATM screens, and home banking
Web sites all have the potential to be used for personalized messages
chosen on the basis of information gathered throughout the history of
the relationship. In fast food industries, on the other hand, most
transactions are done anonymously with cash. There is no way to tell if
a particular milkshake and large fries went to a regular customer or to
a first-time visitor. Any communication is per force aimed at an entire
market segment rather than an individual. The catalog industry, where
this chapter's case study takes place, is somewhere between these
extremes, but closer to banking than to fast food. No 261
Page 259
264 Figure 9.1 The Voice of the Mountains, the main VCS catalog. data by
household from 1989 (the earliest year for which tapes were available).
The primary impetus for the data warehouse was the inability to use
historical data to answer marketing questions related to past activity.
A secondary goal was support for analytic modeling. Order processing is
handled on an HP 3000 computer, which has 200 order takers entering
transactions. These transactions accumulate on the operational system
until the marketing department requests an update for the data warehouse
. This tends to happen every six to eight weeks in preparation for a
mailing . The data warehouse is implemented in Oracle on a UNIX system
from Sun Microsystems. The end users are all in the marketing
department. Marketing analysts use Cognos Impromptu or native SQL
queries for decision support. At this writing, in 1999, the mail-order
catalog accounts for most of the $60 million in sales volume for the
Vermont Country Store. Most of the company's 350 employees are employed
in the mail-order business taking orders or work- The Ternont Country
Store wk,~CrF, 51aÊTM1ia~y"
Page 262
265 ing in the distribution center. Others work in the company bakery
making the old-fashioned crackers sold in the food section of the
catalog. And of course, there would be no case study without those
all-important folks in the information technology and marketing
departments at the corporate headquarters in Manchester Center, Vermont.
Don't assume that just because your company is not measured in billions
of dollars and tens of millions of customers, it is too small to take
advantage of data mining. A smaller company may be able to get by
setting up a data mining environment on an existing computer alongside
other applications. If there are only one or two analysts, they can
share a software license and the investment in training will not be
large. The potential savings or extra profit available through data
mining will be smaller at a small company, but so will the expenses. One
thing that makes this case stand out from others in this book is the
size of the company. Although $60 million is certainly nothing to sneeze
at in a family -owned business, it is tiny in comparison with the large
banks, insurance companies, and telephone companies that we are used to
working with. The company maintains a culture that makes it feel even
smaller than it is-there is, for example, the never-yet-broken promise
of no layoffs, and the fact that the company has never felt the need for
an 800 number for taking catalog orders (here VCS may be helped by the
fact that Vermont's area code, 802, apparently looks like a free call to
many customers from out-of-state). Even the vendor who sold VCS its data
mining software was unsure whether it would be possible for such a
relatively small company to justify the expense of the software licenses
and training. They were amazed when VCS reported a return of over 1,000
percent on their investment in data mining! Predictive Modeling at
Vermont Country Store Predictive modeling of the kind we discuss in this
book has only recently been introduced at VCS. In this section we focus
on how the company convinced itself that mastering data mining was worth
the time, effort, and money expended. VCS went about this in a very
methodical way, by applying data mining to a well-defined business
problem with measurable results. The Business Problem Catalogs are never
(well, rarely) simply mailed out at random. They are always mailed to a
list and the list is always chosen in some deliberate way to achieve
some particular goal, such as maximizing the number of orders or the
Page 263
266 number of items per order or the total revenue due to the mailing.
Whatever method used to select the mailing list is a model in the
broadest definition. The most frequently used model in this industry,
and the one against which all others were compared at Vermont Country
Store, is called RFM, which stands for Recency, Frequency, Monetary
segmentation. RFM analysis is explained in the section "The Technical
Approach," later in this chapter. A model must have a target. In the
catalog world, there are several possible goals including increased
response rate, increased overall revenue, decreased mailing costs,
increased overall profit, increased reactivation of dormant customers,
higher order values, and lower returns. For this study, the goal was
increased Data Mining at Larger Catalogers At the other end of the
spectrum from Vermont Country Store is Fingerhut, a wholly owned
subsidiary of Federated Department stores, that has long been
acknowledged as a leader in database marketing and the application of
data mining to the catalog business. In addition to its flagship
Fingerhut catalog, Fingerhut owns Figi's, a specialty food and gift
catalog company, and several other women's apparel and general
merchandise catalogers. Recently, Fingerhut has leaped wholeheartedly
into e-commerce with equity positions in a variety of online retailers
such as PC Flowers and Gifts and Hand.com in addition to its own online
business. VCS is a relatively small company with revenues of $60 million
from selling practical , if eccentric, merchandise to a fairly
well-heeled customer base. Fingerhut, with 40 times as many employees,
takes in approximately 33 times the revenue (around $2 billion a year)
selling equally eccentric, but decidedly more down-market merchandise to
consumers, many of whom may be more concerned with the size of the "four
easy payments" than the actual price of the item purchased. VCS has a
list of around a million and a half people who get its regular catalog,
whereas Fingerhut mails out 400 million catalogs yearly-more than a
million a day. It is at companies like Fingerhut, where the sheer scale
of the operation means that a tiny reduction in cost per mailing or a
tiny increase in response translates into millions of dollars, that data
mining collects its best publicity. As a dedicated database marketing
firm, Fingerhut maintains multiple terabytes of transaction history and
a customer file with 1,400 (mostly null) entries for each of 30 million
households. This customer database has been mined in many interesting
ways. One application is customer segmentation based on a combination of
the customer file, the order histories, and demographic data. For
example, Fingerhut discovered that its customers triple their purchasing
in the 12 weeks after a change of address. This lead to the creation of
a special movers' catalog featur-
Page 264
267 revenue per catalog mailed. This is a function of both response rate
and order size and so would be a good candidate for a multimodel
approach that combines a response model with a revenue model in order to
come up with an expected value for the catalog mailed. The approach at
VCS, however was to go after high spending responders directly with a
single model. Data One of our first introductions to data mining
(although it wasn't called that yet) was in the airline industry, back
in the days when airlines were a heavily ing home furnishings, kitchen
appliances, telephones, and the like, but without jewelry or cosmetics.
Another application is °mailstream optimization," which means figuring
out who should receive which catalogs and when, or, more importantly
from the point of view of cost savings, which catalogs not to send to
which customers. In initial tests, mailing expenses dropped 8 percent
while revenue declined by only 1.5 percent . Fingerhut reports that it
now saves $3 million a year by not mailing catalogs it otherwise would
have. A third example is the use of a neural network model to predict
staffing needs in the call center based on catalog mailing schedules and
historical response. In addition to supporting its own catalog business,
Fingerhut's Business Systems division provides direct marketing and
information management services to other companies in the industry. The
focus on database marketing has been there since the beginning. Manny
Fingerhut, who started the company that bears his name in 1948 and
retired from it 30 years later when it was a quarter-billion dollar
company, was a pioneer of modern direct marketing. The company started
out selling plastic car-seat covers by mail. They gave buyers 30 days to
pay while somehow convincing suppliers to wait 60 days for their money.
In a very early example of database marketing, the seat covers were
marketed directly to a list of people registering new cars in the state
of Minnesota, where Fingerhut is based. The product was even offered on
an installment plan without any credit checks. The company figured that
any buyer who could afford a new car was good for the price of seat
covers as well-a rudimentary form of predictive modeling. We got this
account from one of Manny's acquaintances from the good old days, so we
aren't sure that it is entirely accurate, but we hope so because it
makes a great story! Today, about 70 percent of Fingerhut's sales come
from the catalog, and the other 30 percent come from direct mail,
outbound calls, and the Internet.
Page 265
270 ~ ~~~~ ~q. • Number of days since last order • Number of purchases
from the various different catalog types (Apothecary, Green Mountain
Mercantile, Goods and Wares, Home, Voice of the Mountains) • Dollars
spent per quarter going back several years Internal data of this kind is
augmented with outside information about a household's industry-wide
purchasing by product category and by demographic data. The Technical
Approach Vermont Country Store wanted a controlled, scientific
comparison of advanced data mining techniques such as neural networks
and decision trees, which would require an investment in new software
and training, with the current practice of using RFM models and various
kinds of demographic segmentation. VCS decided to test the new methods
by applying data from past mailings and calculating how well they would
have done had they used a mailing list produced by the new models
instead of the one they did, in fact, use. This approach is attractive
because it makes use of historical data that is readily available in the
house file and does not require any actual test mailings. The difficulty
is that the new models will suggest sending catalogs to people who did
not actually get them and not sending catalogs to people who did. The
latter case is easily handled-any expenses incurred by sending catalogs
to people who would not have gotten them if the new model had been used
can be subtracted from the total expense of the mailing, and any
revenues due to those customers can be subtracted from total revenues.
But what about the people that the new model would have added to the
mailing list? For these people, we need to be able to estimate the
probability that they would have responded and the revenues that would
have resulted. This is a data mining problem in itself. Choice of
Software Package VCS was new to data mining and so made a survey of
software packages that might help them incorporate this new technology
into their marketing process. After looking into several software
packages, VCS settled on Enterprise Miner from SAS Institute. Although
they found some other packages easier to use, such as Unica's Model 1,
they liked the greater degree of control they felt was afforded by
Enterprise Miner. The fact that user-written SAS code could be called
from Enterprise Miner also appealed to the analysts at VCS. In short, it
felt like a package that could not be outgrown.
Page 268
274 would have increased the profitability of the 120-page book at the
expense of the 96-page book by moving valuable customers from one to the
other. The Neural Net Model The neural network was applied to two
different mailings of the general merchandise catalog. Key input
variables were recency, first quarter orders, beauty, and number of
Voice of the Mountains catalogs received. This model achieved a
predicted increase in sales of 2.86 percent over the baseline. In this
case, the neural network model did not do as well as the other two.
Since this technique was new to VCS, this may reflect the fact that
neural network models are less tolerant of naive users than decision
tree models and can benefit from more extensive data preparation as well
as a greater degree of experimentation with model parameters. The
Regression Model We have not discussed regression much in this book,
because it is generally considered to be "statistics" rather than "data
mining." This distinction is a bit arbitrary since data mining is first
and foremost about creating predictive models and that is exactly what
regression is for. Presumably, the reason that many data mining tools do
not include a regression module is that it is hard to market regression
as a leading edge technology. For the same reason, we have not given
much attention to regression in this book, but if you are interested,
the sidebar contains an introduction to the concept. The regression
model built for this study was applied to a catalog that VCS calls the
"Holiday Catalog" that comes out in October, and to the Christmas
catalog that comes out shortly thereafter. Inputs to the regression
model were RFM code, number of categories purchased, average items per
order, fourth quarter purchases, and food purchases. Using this model
achieved a predicted increase in sales of 3.89 percent over the baseline
for the Holiday mailing and 4.99 percent for the Christmas mailing. The
Decision Tree Model The decision tree model did the best of all,
apparently because it was able to make good use of a wider range of
variables to find many small pockets of profitable customers. Variables
that proved useful included beauty, number of categories , months since
last order, food, bath, women's footwear, hosiery, and outdoor. This
model achieved a predicted increase in sales of a spectacular 12.83
percent.
Page 272
the same score who were picked by the RFM model. Calculating the Return
on Investment To calculate the return on investment from using the new
data mining techniques , Vermont Country Store took a very conservative
approach. They assumed that the worst-performing model was
representative and so based their figures on a 2.86 percent increase in
sales. The ROI is the ratio of the extra revenue brought in due to the
models, to the money invested in data mining. Both of those figures are
confidential, but VCS is happy to share the ratio itself: They calculate
that this project had a return on investment of 1,182 percent. Future
Given returns like that, data mining has a bright future at VCS. Plans
for the future include models with a variety of target variables to
explore alternate corporate strategies. Models can be built based on
response rate, sales dollars, profit dollars, house file growth,
reactivation, higher order values, lower returns, or anything else the
company would like to improve. Even further out, VCS would like to
develop optimal contact strategies for each customer, offering each one
only those catalogs to which they are likely to respond at a profitable
level.
Page 274
277 Expected Benefits Vermont Country Store expects to see benefits from
the new approach that go beyond better response rates, including: Fewer
keys to keep track of. The data mining techniques yield a single score
no matter how many variables go into the model. That frees VCS to use
input variables that were previously left out in order to prevent an
explosion of customer "buckets." Fresh Segmentation. The same people
tend to end up in the best RFM buckets time after time, and there is a
danger of "exhausting" these people by targeting them too often.
Reactivation. The new models are already finding large numbers of
responders among segments overlooked by the RFM models. The recency and
frequency components may be to blame. Consider the case of seasonal
buyers who buy only at Christmas, or only when it is time to open up the
summer place at the lake. They buy infrequently and so at any given time
their most recent purchase may not be very recent, and yet, if reached
at the right time, they may respond very profitably. Lessons Learned The
most important lesson of this chapter is that it is not only huge,
multinational corporations that benefit from data mining. Vermont
Country Store is a medium-sized, family-owned company with a talented,
but small marketing team. Despite their size, they are a showcase of
best practices in database marketing and are successfully applying
advanced data mining techniques while some much larger companies dither
on the sidelines.
Page 275
Page 276
281 ural applications for data mining because in general, you know much
more about current customers than you could possibly find out about
external prospects. Furthermore, the information gathered on customers
in the course of normal business operations (balances, preferred
language at ATMs, average number of credits and debits per month,
payment history home address, and the like) is much more reliable than
the data purchased on external prospects. This case study shows how one
bank built on data drawn from its customer information file to improve
its ability to cross-sell new products to existing online banking
customers by determining each existing customer's best next offer-the
offer that is most likely to elicit a positive response from that
customer. Business Problem Our client was one of the largest banks in
the United States with assets on the order of $100 billion and millions
of customers. The online banking division is a small part of that, with
fewer than half a million customers at the time of this project, but its
aggressive plans call for double-digit yearly growth rates. The project
had immediate, short-term, and long-term goals. The long-term goal was
to increase the bank's share of each customer's financial business by
cross- selling appropriate products. Armed with a best next offer model,
the bank would be able to greet each customer with a clickable banner ad
for the next product he or she is most likely to want. The same model
could, of course, guide other channels such as e-mail, traditional
direct mail, and outbound telemarketing as well. The short-term goal was
to support a direct e-mail campaign for four selected products
(brokerage accounts, money market accounts, home equity loans, and a
particular type of savings account). E-mail is a very cheap channel for
contacting customers. There is almost no monetary cost for sending an
e-mail- that is, sending one e-mail costs about the same as sending one
million. If the channel is so cheap, why do we care about accurately
targeting e-mail? After all, the bank could just send messages to all
online customers. It is still important to target e-mail effectively
because customers, who might read one targeted e-mail message, are less
likely to read 30 random messages. And, perhaps more importantly,
customers who have given their permission to be contacted by e-mail will
change their minds and withdraw their permission to be contacted if they
begin receiving too many off-target messages. The immediate goal was to
take advantage of a data mining platform on loan from SGI to demonstrate
the usefulness of data mining to the marketing of online banking
services. This demonstration would justify the purchase of data mining
software and the hardware on which to run it. The initial modeling work
was
Page 279
.,.... eee. ,.e. ee._, 282 Unsolicited e-mail falls somewhere between
old fashioned junk mail and the dreaded dinner-time phone call in terms
of its ability to annoy and alienate a customer . To avoid having your
e-mail messages regarded as "spam," be careful to use them sparingly and
only to convey information that is likely to be of interest to the
customer. By signing up for online banking, customers have implicitly
expressed a willingness to transact business on the net, but you should
be careful to get their explicit permission to be contacted
electronically with important information about their accounts. That
way, your marketing messages will not be truly unsolicited and may even
be welcomed as useful information. done on-site. The four models used
for the initial e-mail campaign were developed during this first phase
using the MineSet data mining package running on a two-processor SGI
Origin 200 server and a SGI 02 client workstation. Later stages of the
project were performed at the DSS Lab (www.dsslab.com) in Cambridge,
Massachusetts using a nearly identical software and hardware
configuration. Data The initial data comprised 1,122,692 account records
extracted from the Customer Information System (CIS). The represents a
snapshot of all online accounts as of February 1998. The 143 variables
extracted for each account included internally generated data, such as
account opening date and account balances, along with personal data
provided by the customers themselves, such as names and addresses. The
CIS also contained a small amount of externally purchased demographic
data. The customer data was available only as snapshots in time of
account-level data. The particular snapshot used for modeling was June,
1998. Although the snapshot contained some historical data such as
balances for the last several cycles, the fact that our view of the
customers was essentially frozen did hamper our ability to apply some of
the best practices for building cross-sell models described later in
this chapter. In particular, it was not possible to model current
holders of accounts of a particular type at the moment just prior to
their having obtained that particular account. Instead, we had to work
with the current account holders as they appeared in the snapshot month.
In preparation for the data mining engagement, analysts at the bank
created a SAS data set containing an enriched version of the extracted
data. The SAS data set included derived variables that the bank had
found to be useful in the past such as the lag between the date a person
first becomes a bank customer and the date they first sign up for online
services. As usual, many more
Page 280
283 derived variables were added during the course of data mining and
some of the existing variables had to be transformed in order to
accommodate limitations of the data mining algorithms. Table 10.1 shows
the variables extracted from the customer information system. Table 10.1
Variables Extracted from the Customer Information System Focus on Single
Accounts r Account Number r Statement County r Employee Account Flag n
AOL User Flag n Open Date r Prime Household Phone Number r Acquired From
n Account Status r Area of Dominant Influence (ADI) Char tstimatea
housenoia Net vvon:n Char Statement ZIP Num Balance #1: CURRENT CYCLE
Num Balance #1: CYCLE-1 Num Balance #1: CYCLE-2 Num Balance #1: CYCLE-3
Num Balance #1: CYCLE-4 Num ..ERR, COD:1..
Page 281
284 Table 10.1 Variables Extracted from the Customer Information System
Focus on Single Accounts (Continued) Char Subproduct Char Customer ID
Num Debit Card use 60-day, Card 1 Num Balance Inquiries 60-day, Card 1
Num ATM Payments End, Card I Num POS Num Card 1 Network ATM use 60-day ,
From Accounts to Customers The data extracted from the customer
information system had one row per account. This reflects the usual
product-centric organization of a bank where managers are responsible
for the profitability of particular products rather than the
profitability of customers or households. The best next offer project
required pivoting the data to build customer-centric models. To be
useful for cross-selling, the 1.2 million account-level records
extracted from the customer information system had to be transformed
into around a
Page 282
287 Table 10.2 (Continued) if customer has at least one wholesale DDA
product if customer has private banking services ce for asset management
products Note that many of the variables used as inputs to the
model-building process are derived variables that were not part of the
original extract. Some values, such as length of tenure, have been
binned into ranges. Others, such as total balance, are values calculated
from the original fields. A few are flags added to reflect groups of
people that the bank considered interesting, such as people who tried
online banking within 30 days of becoming a bank customer and people
whose total deposits with the bank were over $50,000. Defining the
Products to Be Offered The customer information system recognized
several hundred different products , many of which are simply small
variations on a theme. This level of product differentiation is too
detailed for the kind of marketing campaign we were supporting. For
example, the bank might make someone an offer of a savings account
without trying to determine which of several variants would be most
likely to appeal. These variants offer different interest rates based on
total balances at the bank, other types of accounts, and so on. In fact,
there are business rules for determining which savings account is most
appropriate for a given customer-data mining can figure out that a
savings account is appropriate and then business rules take over to
determine which one in particular.
Page 285
288 Often the number of product codes is often dauntingly large. And,
when there are too many codes (more than a few dozen) it is difficult to
develop good cross-sell models-there are simply too few instances for
each one. Often, many of the codes refer to the same type of thing, such
as a checking account or a home mortgage, with just minor (from the
point of view of marketing) differences between them. Look for a
hierarchy that describes the products at the right level. There is a
budgeting application that rolls up account types into a hierarchy of
product category, account type, and subtype. The four major categories
are deposit account, loan, service, and investment. The marketing people
decided that, with a few modifications, the account-type level of this
preexisting hierarchy would serve well. From a marketing perspective,
some of the account types are essentially the same, such as certificates
of deposit (CDs) and time deposits (TDs). These account types were
combined into a single category. The product categories were used as the
target variables for modeling. That is, a model predicted who would have
CD/TD, or home mortgages, or whatever. The individual product types were
retained as input variables. Table 10.3 shows the 45 product types used
for the best next offer model. Of these 25 products are ones that may be
offered to a customer as part of this campaign. Information on the
remaining (business-oriented) account types are used only as input
variables when building the models. BBK Business Brokera e No g BCC
Business Credit Card No BCD Business Certificate of No Deposit BIC
Business Interest Bearing Checking BIL Bill Pay Yes 106,949 BLD Business
Loan Division No BLN Business Line of Credit No BMM Market No Business
Mone y BMR Business Market Rate No BMS Business Money Market No Table
10.3 Product Types Used in the Best Next Offer Model
Page 286
Page 287
290 Table 10.3 Product Types Used in the Best Next Offer Model
(Continued) WDA Wholesale DDA No XTR Linked Savings Yes 219,695 SAV The
marketing campaign supported by the best next offer model was aimed at
individual consumers, not businesses, so none of the business account
types were used as target variables in the cross-sell models.
Information on business accounts was retained as input to the models. It
seems quite likely, in fact, that someone who has both business and
personal accounts may behave differently than someone who has only
personal accounts. With that in mind, one of the derived variables we
added was a flag indicating whether a person has any accounts classified
as business rather than personal. If people who own their own businesses
really do exhibit different behaviors with respect to their personal
accounts as well, a business products flag will allow that pattern to be
found. Approach to the Problem To accommodate both the short- and
long-term goals of the project, we wanted the best next offer model to
be built up from component models for the individual products. Our
approach was to build a propensity-to-buy model for each product
individually. The individual propensity models can be used to sort a
list of prospects for each product so that those most likely to respond
to a product offer are at the top of the list. Then, once each customer
has been given a score for every product, the scores can be combined to
yield the best next offer model: Customers are all offered the product
for which they have the highest score. Of course, we had to take special
care that the scores developed by each individual propensity-to-buy
model were comparable to the scores developed by other models. This
approach meant that we could start with the four models needed to
support the near-term e-mail campaigns for brokerage, savings, home
equity, and money market accounts, and scale up to the larger campaigns
in later stages. STG Mutual Funds Yes 3,880 STU Student Loans Yes 7,430
TD Time Deposit Yes 3 CD
Page 288
~- 291 Comparable Scores To be useful for the best next offer model, the
scores from the various product propensity models must be comparable.
But what does it mean to be comparable ? We came up with the following
list of requirements: 1. All scores must fall into the same range: zero
to one. 2. Anyone who already has a product should score zero for it. 3.
The relative popularity of products should be reflected in the scores.
That is, the average score for popular products should be larger than
the average score for less popular products. The first requirement is
really only that all scores should fall within the same range. The range
zero to one is nice because it is the way probabilities are normally
expressed and these scores may be thought of as the probability that a
customer in a given leaf will want the product being modeled. The second
requirement reflects the bank's wish to avoid trying to sell people
accounts that they already have. Even for products where it might make
sense for the customer to have more than one (such as one credit card
for reimbursable expenses and another for purely personal expenses), it
is likely that the bank would want to send a different message to
promote the second instance of the product. The point, after all, is to
make the bank appear to understand its customers needs. That means
avoiding communications that might make the bank just look plain dumb!
It is the third requirement that poses the biggest problem. Many
algorithms designed to accommodate requirement one would grant the
people most likely to want a given product a score of one and the people
least likely to want it a score of zero regardless of the number of
people who might possibly want the product. To see the problem with
that, imagine two products, one that could be used by anyone, and one
that can be used only by left-handed people. The vast majority of people
are right-handed and so have no interest in the left-handed product. Any
system that gives right-handed people a higher score for the left-
handed product than for the ambidextrous product is misleading. It is
fine for the occasional left-handed person to score very high for the
left-handed product , but the average score for that product will
necessarily be much lower than the score for the ambidextrous product,
reflecting the fact that most customers are not interested in it at all.
At the bank, many more online customers are interested in the bill
paying service than in student loans. This almost certainly reflects the
true nature of the customer population, so given a set of customers who
have neither, we should expect more of them to be offered bill paying
than to be offered student loans.
Page 289
293 Becoming a Customer Changes the Way You Behave The first problem is
that the propensity models score prospects based on their similarity to
current customers, but current customers may look different than they
did at the time that they signed up for the product. Certificates of
Deposit (CDs) provide a good example. CDs are essentially
interest-bearing savings accounts where the customer promises to leave a
certain sum on deposit for a predetermined period. In return for giving
up immediate access to the money, the customer earns a higher interest
rate. It is a safe guess that people who own CDs probably do not have
large balances lying around in their ordinary savings accounts, because
they would move this money into CDs. Does that mean that we should look
for CD prospects among the customers with low savings balances? Surely
not! Prior to purchasing a certificate of deposit, the CD customer must
have had the purchase price available somewhere. Of course, there is no
guarantee that the somewhere else was another account at the same bank,
but in many cases it probably was. The best approach is to build models
based on the way current customers looked just before they became
customers. Unfortunately, this approach requires a fairly sophisticated
data warehouse (and a fairly complex query) to get the requisite data.
For each current holder of a particular account, this requires going
back in time months or years to get a snapshot of his or her other
accounts in the month prior to opening the particular account. For each
product, there could be a different base month for each customer.
Furthermore, each product would be modeled on a completely different
data set because each product has a different population of customers
with different account opening dates. As is often the case in the real
world, it wasn't possible to obtain data in just the right form so we
had to make do with a current snapshot instead. To reduce the impact of
the problem, we were careful not to use variables that seemed
particularly likely to be misleading for a given product. A derived
variable, total balance, dampened the effects of transfers between
different accounts owned by the same customer. Fortunately,
decision-tree models make it easy to spot which variables are being used
to make a classification. If a model is relying on a variable that is
likely to be influenced by the very thing that the model is trying to
predict, it should be removed from the input list. Current Customers
Reflect Past Policy Looking at the bank's data, it is clear that some
products are much more popular than others. To some extent, this
reflects naturally occurring patterns: More people have use for checking
accounts than for home equity lines because more people pay bills than
own homes. In other cases, a current policy
Page 291
294 A model that does a very good job of finding people who have a
particular product may be useless for identifying prospects for that
product if it has simply learned to recognize some behavior that only
people with the product exhibit. We have seen many examples of this
problem, including one amusing case of a model that was very successful
at classifying people with voice mail-the model determined that people
who tend to make a lot of short duration phone calls to the voice mail
retrieval number are very likely to have voice mail. True, but not
terribly useful for identifying prospects! of the bank is reflected in
the data: You don't get to be a private banking client unless you are
very wealthy. Data mining will find both of these patterns- more people
will be offered checking accounts than home equity lines and only the
very wealthy will be offered private banking. But what if the current
makeup of the population reflects some policy that is no longer in
force? For example, the low number of mortgages reflects the fact that
this bank historically has steered clear of the home mortgage market. If
that policy should change, many people who might in fact be interested
in a mortgage will not be identified simply because people like them
where never offered mortgages in the past. Although nothing of the kind
was suggested in the current case, a particularly pernicious form of
this problem involves past discrimination, or "red lining." If a bank
had a past policy of not granting mortgages in certain ZIP codes, that
would show up in the model as low propensity for mortgages in those ZIP
codes which, if not caught, could lead to a perpetuation of the
discriminatory practice. Similarly, if a certain product traditionally
has been offered only to suburban men, this sort of best next offer
model will continue to pick them as the most likely prospects, ignoring
urban women. ding the Models At the bank's suggestion, the first model
was built for brokerage accounts. These are of particular interest
because they are highly profitable and relatively underutilized. (Of
462,799 online customers in June of 1998, only 4,685 had brokerage
accounts.) Finding Important Variables Before actually trying to build
models, we needed to become more familiar with the data. The first step
was to use MineSet's statistics visualizer to look at
Page 292
295 the distributions of all the input variables. This initial data
profiling revealed a few possible anomalies such as a few accounts that
were over 100 years old. According to the bank, which was founded in the
nineteenth century, there really could be accounts that old because some
kinds of account get passed down through generations. Even so, we
decided that these outliers were unlikely to aid the modeling process.
In general, though, the data appeared clean and consistent. Using the
Column Importance Tool The first data mining algorithm applied was
MineSet's column importance tool. The column importance tool finds a set
of variables which, taken together, do a good job of differentiating two
or more classes (in this case, people with brokerage accounts and people
without brokerage accounts). The top three variables found by column
importance are not the three with the greatest individual discriminatory
power (for that, MineSet provides an evidence visualizer). Rather, it
finds a combination of variables that work together to increase the
purity of the classification. The column importance tool is often used
to select the best variables to map to the axes of a scatter plot for
data visualization. The column importance tool discovered that the most
significant factors for determining whether someone has a brokerage
account are: . Whether they are a private banking customer . The length
of time they have been with the bank . The existence of and balance in a
money market account . The Microvision code
Page 293
297 and levels, but at the leaves there will be groups that are either
mostly nonbrokerage (the majority of leaves) or mostly brokerage. Each
path through the tree to a leaf containing mostly brokerage customers
can be thought of as a "rule" for predicting that an unclassified
customer who meets the conditions of the rule is likely to have or be
interested in a brokerage account. Building a good decision tree for
brokerage accounts took some time and some careful experimentation with
parameters. The initial tree, obtained by accepting all the default
parameters and just pressing the button, had exactly one node labeled
"nonbrokerage." When the density of the target variable is so low (only
1.2 percent of the customers had brokerage accounts), it can be hard to
beat such a simple model. One approach is to create a cost matrix that
punishes the model for misclassifying a brokerage customer as a
nonbrokerage customer much more severely than for misclassifying a
nonbrokerage customer as a brokerage customer. Another approach is to
use oversampling to increase the percentage of brokerage customers in
the model set to give the model a better shot at learning how to
recognize a brokerage customer when it sees one. The latter approach
proved most effective. The final tree was built on a model set
containing about one quarter brokerage accounts. Figure 10.1 The MineSet
evidence visualizer.
Page 295
305 selection -4 Ai . Figure 10.7 The model finds 100 percent of the
brokerage accounts in the top 30 percent of the test set. are not
interested in them. Most people get a score of zero or very close to
zero for brokerage account, and the actual brokerage customers are found
among the few who have higher scores. We can expect similar lift curves
whenever we are searching for something that can easily be ruled out for
most of the population : Inuit speakers in France; Macintosh users at a
bank; curling fans anywhere but Scotland. In any of these cases, the
algorithm does not have to work very hard to find the bottom 98 percent
of the population. In cases like these, it makes sense to filter out the
boring 98 percent of the population in advance. Had we restricted
ourselves to only customers with at least $50,000 in the bank, the lift
curve for the brokerage model would have had a more usual shape and told
us more. The Problem with Private Banking Clients So now we know that
private banking clients who don't have money market accounts are the
best prospects for brokerage accounts. Unfortunately, we are not allowed
to contact them. All communications with the private banking clients go
through their private banker-that is what it means to be a private
banking client. Such customers are too valuable to be bothered by mere
marketing campaigns.
Page 303
306 Of course, the online bank can pass the information on to the
private banking group, but they probably already know this anyway.
Fortunately, all is not lost. Since the very first split in the tree
takes away all the private banking clients on the right-hand side, the
left-hand split remains as a perfectly good model of who, among the rest
of the population, is a good brokerage prospect. The lift curve of this
model is not as good as for the complete model, but it is much more
useful because it identifies prospects that the bank is allowed to
contact. Brokerage Model Performance in a Controlled Test When the
brokerage model, without the private banking clients, was put to the
test, it did a respectable job of finding brokerage candidates given the
general unpopularity of that product. To perform the test, the bank
created three groups of 10,000 people each: a model group, a control
group, and a hold-out group. The model group consisted of 10,000 people
who got relatively high scores from the brokerage model. (Recall that a
high score is any score higher than the density of brokerage account
holders in the population, not a large number.) Everyone in the model
group received an e-mail message suggesting that they open a brokerage
account. The control group consisted of another 10,000 customers chosen
at random without the help of the model. They, too, got the e-mail
solicitation. The hold-out group had the same characteristics as the
control group, but its members did not receive the brokerage
solicitation. The results were as follows: . The response rate of the
model group was 0.7 percent . The response rate of the control group was
0.3 percent . The response rate of the hold-out group was 0.05 percent
The improvement in response rate from the control group to the model
group means that the model generated a lift of 2.3, albeit from a low
base. The extremely low incidence of members of the hold-out group
simply wandering in and asking for a brokerage account shows that the
e-mail solicitation was effective. The improvement in response rate from
no solicitation to untargeted solicitation was much greater than the
improvement from untargeted solicitation to targeted solicitation.
Building the Rest of the Models The models for the rest of the product
groups were built following essentially the same procedure as for the
brokerage model. For any product that was used by 25 to 75 percent of
the total population, we did no weighting and no back- fitting, but
built the models directly from the June data. For checking accounts
(interest-bearing and noninterest-bearing combined), we used weights so
that
Page 304
308 ~ ° Figure 10.8 Combining multiple models to find the best next
offer. More Perfect World Now that we have seen an actual case study
involving the construction of a cross-sell model, it is time to step
back and outline the steps we would follow to build one under ideal
conditions. 1. Determine whether cross-selling makes sense for this
company. For cross- selling to make sense there should be a variety of
products (at least five) that might reasonably appeal to a large number
of customers. The products should be complementary, or at least not
mutually exclusive. People are unlikely to want a gas stove and an
electric stove. 2. Determine whether sufficient data exists to build a
good cross-sell model. It should be possible to recapture the state of a
current customer at any point in his or her tenure to whatever
resolution historical data is kept. 3. Build propensity models for each
product individually, making sure that the models produce comparable
scores. The target variables for each model should reflect the way
customers who now have the product looked at the time period immediately
before they made the purchase. Our preferred approach is to assign each
prospect to a leaf node of a decision tree built for the product and to
use the percentage of existing customers at that leaf to assign a score
for the product. 4. Except in the case where a customer gets a zero
score for all products, each customer will have some product or products
for which they have a higher score than the rest. That product is the
best next offer. Lessons Learned This case study illustrates many
important things about cross-sell models. A cross-sell model may be
constructed by combining several independent models each of which
predicts the propensity of a customer to buy a par-
Page 306
312 LEII1iLLff1I Source Data I Systems Source Data Systems Churn Model
Churn App Server p I Churn App GUI Churn App Data 0 0 D F M°: Churn App
GUI Over time, source data systems are modified. New data sources become
available that are desirable for modeling. The performance of the churn
model (or models) needs to be tracked. Models need to be rebuilt.
Assumptions about modeling need to be revisited as the business changes.
Users need to be able to manage models and understand their performance
without having O° a PhD in statistics. Churn App GUI Figure 11.1
Building a good churn modeling application is more than just building a
good model. As with all the case studies, we start with some background
on the industry. Wireless Telephone Industry Every reader of this book
is the customer of at least one telecommunications company-and probably
several. Some of us may have different carriers for local telephone
calls and long distance; some may use a dial-around service for
international calls. Some may also carry a pager served by one company
and a mobile telephone served by another.
Page 310
313 Our own experience as customers often provides valuable insight when
doing data mining. Why would we or our friends or colleagues change
service providers? Service plans may no longer be competitive, or our
handset may be too old. Changes in job responsibilities may change
priorities, such as increasing or decreasing the need for international
calling. It is interesting to consider how available data may reflect
these different situations. Our experience provides insight at the micro
level, but these answers are not the whole story. This industry differs
from other retailing and service industries . Even though no two
companies are exactly alike, mobile telephone companies are more similar
than different, offering similar services to similar markets using
similar technology. A Rapidly Maturing Industry Once upon a time,
wireless telephones were so popular that the leading service providers
did not have to worry about customer churn. Many, many more customers
would join than would leave in any given year. Such a period of
exponential growth cannot last forever, since eventually everyone will
have a mobile phone, saturating the market. Figure 11.2 illustrates the
growth in the number of customers in a typical market, such as cellular.
As we see, the number of churners and the effect of churn on the
customer base grows significantly over time. Figure 11.3 more clearly
shows the increasing effect of churn. Initially, for every customer who
churns, there are several new customers who sign up for the service. The
focus during this stage is rightly on getting more and more new
customers. Even eliminating churn entirely during this stage of rapid
growth has little effect on the number of customers. As the market
matures, the churn rate rises until each new customer merely replaces
one who is leaving. Representative Growth in a Maturing Market ;n 1200-
(D 1000 Initially, growth is , ó exponential and churn Ú ó 800 is not a
problem. 30 600 ñ v 400 Many more new customers are joining E 200 than
are churning z total customers new customers As growth flattens out,
churn becomes more and more significant. 10 Year Figure 11.2 As
exponential growth levels off, churn becomes a bigger and bigger
problem.
Page 311
new for Each 50% - customers for New 40% every churner. Customer 30%
Eventually, every 20% new customer is 10% 7 7 ~ - merely replacing 1 a
lost one. 0% 1 2 3 4 5 6 7 8 9 10 Time Period Figure 11.3 The numb er of
churners gets larger and larger, choking off growth. There is some limit
on the total number of customers (such as the size of the population),
so growth must stabilize at some point. That is, as more and more people
use cellular phones, the business shifts away from signing on nonusers
in the general population. In a mature market, growth comes from three
areas: Cross-selling and up-selling: maximizing the profit of existing
customers • Retention and up-selling: keeping profitable customers and
getting rid of (or upgrading) unprofitable ones • Poaching: stealing new
customers from competitors The cellular telephone market is in the
process of maturing. Many parts of the world already have a more
saturated market than the United States. This includes many developing
countries, where a shortage of landline telephones has led to the rapid
diffusion of wireless technologies. The case study in this chapter took
place at the leading mobile provider in one of these newly developed
countries. It contains many lessons, not only about churn modeling, but
also about building good and effective models in general. Some
Differences from Other Industries In many respects, telephone companies
are just another example of the service industry, similar to financial
services, insurance, and utilities. In other respects, selling telephone
service is more like selling retail products. There are some important
things to keep in mind when working in the mobile telephone industry:
..ERR, COD:1..
Page 312
316 :'. into in detail. However, it is worth realizing just how invasive
call detail information can be; marketing campaigns can backfire if they
exhibit too much knowledge about customers. The Business Problem The
largest mobile telephone company in a newly developed country had been
investing in decision support technology for several years. The mobile
telephone market in their country had recently been deregulated and
several recent entrants were growing rapidly. At the same time, the
market was maturing and they recognized the need to move from reactive
marketing to proactive customer management. They and their handful of
competitors already supplied mobile services to over one-third of the
country's population, with each of the other competitors having about
half the number of subscribers of this company. The maturing of the
market and the increasing competition was now leading the company to
focus on existing customers, how to keep them, and how to make them more
profitable. Project Background Churn modeling was just one of the
responsibilities of their newly formed database marketing team. Another
relevant project was an ongoing data warehousing effort, whose prototype
was the primary data source for this churn modeling effort. During the
course of this effort, the data warehouse was in the process of being
migrated to a larger platform with more functionality, more data, and
more history. The first release was scheduled for delivery several
months after the completion of this project. Another relevant project
was a decision support application based on relational OLAP
(Microstrategy's DSS Agent) in its beta testing phase. This system
allowed business users to slice and dice marketing and sales data along
a number of dimensions, such as handset type, region, and time of day.
The OLAP system proved very useful for the churn modeling effort by
allowing quick answers to queries such as "What is the churn rate in
April and May for Club members versus non-Club members?" Throughout the
churn modeling project, the client was also interested in learning how
modeling efforts in the future would interact with other systems. What
other requirements does churn modeling impose on the data warehouse and
on the data marts?
Page 314
319 The final question about chum is "when." We know that every customer
who joins is eventually going to disconnect for some reason, so a chum
model predicting who will chum in the next hundred years is easy to
produce-everyone is going to chum. The question of "when" is directly
related to how the information will be used. All of these possibilities
suggest that a more refined definition of churn may be useful to the
business. This project did not try to differentiate between different
types of voluntary churn. The best approach for differentiating is often
to build a model of who is going to churn first and then to figure out
why (using another model). Another approach to working with chum is to
build models that predict each customer's tenure-how long each will
remain a customer. Such models require a sufficient amount of historical
data, which was not available. In addition , the purpose of the effort
described here was to produce a list for interventions during an
upcoming month. As a result, a churn model was most appropriate. Why Is
Churn Modeling Useful? With a definition of chum, lots of data, and a
powerful data mining tool we can develop models to predict the
likelihood to churn. The key to successful data mining is to incorporate
the models into the business. Because this was a real project, we can
admit one of the primary business drivers was an executive who insisted
on having a churn model by the end of the year. His reasoning was simply
that churn is becoming a bigger and bigger problem and well-run cellular
companies have churn models. He wanted his company to be the best.
Fortunately, there are many good reasons for churn models besides
satisfying the whims of executive management (even when they are right).
The most obvious is to provide the lists to the marketing department for
churn prevention programs. Such programs usually consist of giving
customers discounts on air time, free incoming minutes, or other
promotions to encourage the customers to stay with the company. For the
case study, the cellular company belonged to a conglomerate, and their
promotions offered products from sister companies that were not at all
related to telephone usage. Other applications of chum scores are
perhaps less obvious. Churn is related to the length of time that
customers are estimated to remain; that is, the customer lifetime. The
idea is simple: If a group of customers have a 20 percent chance of
churning this month, then we would expect them to remain customers for
five months (one month divided by 20 percent). If the churn score
Page 317
320 suggested a churn rate of only 1 percent, then we would expect the
customers to remain for one hundred months. The length of the customer
lifetime can then be fed into models that calculate customer's lifetime
revenue or profitability (also called lifetime customer value). Churn
models have an ironic relationship to customer lifetimes. If the churn
model were perfect, then the scores would either be a 100 percent chance
of churning in the next month, or a 0 percent chance. The customer
lifetimes would then be either one month or forever. However, because
the churn model is not perfect, it can provide insight into the length
of customers' lifetimes as well. Quite a different application is for
prioritizing customer segments. If a segment is more likely to churn,
perhaps they should not get the fabulous new offer for a discount on a
handset-that will only start making money after the tenth month. Of
course, giving them the discount might also encourage them to stay. The
issue is not clear-cut, but having a churn score helps the business make
more informed decisions. Three Goals The churn modeling effort had
several goals. There was a near-term goal of returning value by building
a list of probable churners for a marketing intervention . The approach
taken to build this list could then be automated into a churn management
application. And this churn management application would, in turn, be
part of a larger customer relationship management system. Working with
three such diverse goals is a challenge in any project. Near-Term Goal:
Identify a List of Probable Churners One of the first tasks in the
project was to talk to representatives from the marketing department and
to understand how they would use a churn score. There had been attempts
to build churn models in the past. In the initial discussion with
marketing, they pointed out a disappointing experience: a previous list
of 10,000 likely churners had fewer than 3000 Club members. For churn
interventions, the highest value customers interested them the most. The
type of intervention that they had in mind was to offer incentives to a
list of about 10,000 customers using their outbound telemarketing
center. These incentives were not related to telecommunications; they
were discounts on products from other companies in the conglomerate that
owned this company. The discussions with the marketing group narrowed
the initial focus considerably . Instead of assigning a churn score to
all the customers, the marketing
Page 318
322 These associated activities shift the focus from merely building a
data mining model to automating model building as much as possible.
Users of CMA would not need a Ph.D. in statistics to use it. In fact, a
prototype of CMA that focused much too much on the statistics of
modeling provided too little help in maintaining , testing, and updating
models. Of course, the database marketing team never used this
prototype, since it did not meet their needs. The need to automate the
model imposed several new requirements on the data mining model
building: - Automated model building is incompatible with changing the
modeling technique every month, since end users are unlikely to be able
to make educated decisions about logistic regression versus decision
trees versus neural networks, and so on. - Automated model building is
incompatible with manually pruning decision trees, since users are
unlikely to understand the details of manual pruning. • Automated model
building is incompatible with clustering, since it is impor- tant to
understand clusters from both a business and a technical perspective. •
Automated model building needs to have very reasonable defaults set for
modeling parameters; that is, the application should be the repository
for the best practices in building models. The need for automation also
precluded some hybrid techniques, such as building a decision tree,
taking a bunch of the most significant variables, and feeding them to a
neural network (or, to a logistic regression routine). Such techniques
are risky without being overseen by knowledgeable people. Of course, CMA
could implement interfaces so more advanced users would have access to
more enhanced functionality. The issue here is defining a basic user
interface so users do not have to understand all the details of
modeling. Long-Term Goal: Complete Customer Relationship Management The
long-term goal of the database marketing group was to include churn
management as just another facet of a complete customer relationship
management system. This initial project provided a good backdrop for
discussing the Virtuous Cycle of Data Mining and customer lifecycle
modeling. This project focused on building a model; the business still
needs to deploy the model and to measure its effectiveness over time.
Approach to Building the Churn Model Building a churn model is a good
example of the Virtuous Data Mining Cycle in action. In any business,
there is always a first time for data mining efforts
Page 320
324 chum. They would roam in the first month, get the bill in the second
month, and, finding it too expensive, they would stop roaming. Another
valuable source of information is customer service data. This sometimes
has a paradoxical relationship to chum: We have seen companies where
customers who bother to call customer service are less likely to churn
than those who don't, perhaps because customer service is actually able
to solve their problems. Build Models Since churn is an ongoing business
problem, building the models is not a onetime event, such as building a
response model for a single marketing campaign . For churn modeling, it
is a good idea to experiment to determine which model provides a good
fit to the data and to the business needs. Decision trees are a good
choice for modeling, since they provide rules that business users can
understand. Other techniques, such as neural networks or boosting,
reduces the understandability of the model as the price of a bit of
incremental lift. An important part of building the models is
determining the right set of derived variables to include with the
model. We recommend including variables that explain phenomena in the
real world as opposed to including mere mathematical transformations.
For decision trees, many mathematical transformations are of little use,
since the decision tree algorithms, unlike neural networks or regression
, only use the relative ordering of values and not their magnitudes.
Important derived variables (apart from churn rates) include: growth
rates of numbers of calls over time; proportion of calls of different
types; changes in the proportions; and calls to customer service. When
detailed call data is available , there is a much richer ability to
include customer behavior. Deploy Scores There are several ways to
deploy churn scores within an organization: The most static way is to
make them available in a data warehouse or data mart environment. This
is quite useful for business people, but it is important that they
understand what the scores mean and the limitations on churn models. 2.
Churn scores can also be used for marketing intervention campaigns. In
this case, the scores may be generated for only a subset of the
customers (such as the highest value customers or segments targeted by a
competitor). 3. The business side may also use the churn scores for
ongoing prioritization of customers for many different campaigns. They
may want to focus some
Page 322
325 campaigns on customers who are not likely to churn, to increase the
value of the responses. 4. Finally, churn scores can be used to estimate
customer longevity as a factor in computing estimated lifetime customer
value. This starts to enter a gray area, since the estimate of the churn
probability is only loosely related to customer longevity and directly
predicting longevity is a better solution in the long-term. Measure the
Scores Against What Really Happens Whenever approaching a data mining or
modeling effort, measuring the results is a very important part of the
process. In the case of churn modeling, it is important to measure two
things: 1. How close are the estimated churn probabilities to the actual
churn probabilities for each group? Answering this question requires
measuring the actual churn rates for different groups and comparing the
probabilities. 2. Are the churn scores "relatively" true? That is, does
a higher churn score imply a higher probability of churn, even if the
predicted probability is off? The relative values of churn scores are
often more important than the absolute values, so the second measurement
is more important than the first. In some cases, such as using the churn
probability in a customer lifetime value calculation , the accuracy of
the probability is very important. As with many such projects, measuring
the results takes place over a longer time span and was not available
for the case study. However, a key success factor was approaching the
modeling with an open mind-and this revealed an interesting segment of
customers, described later in the chapter. The Project Itself The
project itself took place over a two-month period, with a senior data
miner present for three weeks during each month. In addition, two
experienced SAS programmers were involved with the project on a
part-time basis to build data sets and to score the final model. An
additional resource was available part- time to work as a liaison to the
rest of the company and to learn about modeling . This does not count
the occasional involvement of the marketing department and specific IT
resources for obtaining data. During the first month, the modeling
effort focused on the refined business problem of returning a list of
10,000 Club members likely to churn. The second month of the project
focused on finding effective models that could be incorporated into CMA.
Page 323
. . ..! 3 t , 327 els, one for the Club members and one for the rest?
And how can we make this decision? First, we discussed the issue with
the business customers. Are the drivers for churn likely to be the same
for the membership would be unlikely to appear as significant (since the
chum rates were similar), although the business felt it important. A
third segment eventually provided the most insight of all. This segment
consisted of recent customers who had joined in the previous nine months
and then churned. This led to some further investigation. Perhaps
customers who join at about the same time have similar reasons for
churn, and this insight can be applied to the other customer segments.
Well, not quite. It turns out that most customers who have been around
for more than a year are already Club members. Table 11.2 Churn Rates by
Customer Segment Nonclub 1,750,000 35% 1.5% 55% Recent ..ERR, COD:1..
the chum rates were similar), although the business felt it important. A
third segment eventually provided the most insight of all. This segment
consisted of recent customers who had joined in the previous eight or
nine months. They were included as
Page 325
332 several times during this modeling effort, several variables were
indeed too useful for predicting churn. These are variables such as the
termination reason and termination date, which should not have been
included as inputs at all. Business users often have a strong sense of
what constitutes important customer segments. Incorporating this
knowledge can improve modeling efforts, and one way to incorporate this
information is to build different models for each segment. When does
this make sense? - When there is a clear and relatively stable
definition of each of the customer segments , and the segments are
nonoverlapping. • When there are enough records to build effective
models for all the different segments. - When you are prepared to take a
multimodel approach and combine the results from the models. Another
advantage of segmenting the data is that it is often more feasible to
partition the data and build several smaller models than to use all the
data and build one big model. Each of the models can be tweaked for data
in that segment, so the Club membership segment, for example, could
contain length of time as a Club member and any special offers used by
the member. Of course, building separate models for different segments
is only advantageous when you believe that you can segment the customers
better than the automatic data mining algorithms. The explanatory power
of decision trees is another advantage. Business users understand rules
and understand the idea that certain variables are more important than
others. These make decision trees a natural choice for applications
involving not-so-technical end users. Finally, it is pretty easy to
automate decision trees. Although pruning decision trees is always a
challenge (in a later section, we will discuss alternatives to the
pruning that we used), the basic decision tree modeling process is quite
simple : Assign a bunch of parameters to reasonable defaults, create the
model set, throw the model set at the model, and just wait for the
decision tree to come back. There is no need to choose the most
important variables, or to figure out how to bin continuous values, or
how to encode categorical variables, and so on. The algorithm is a
natural for an automatic system such as CMA. Different Types of Decision
Trees EM supports three different types of splitting function for
generating decision trees: CHAID, Entropy, and Gini. The textbook
definition of CHAID generally does not perform as well as the other two,
so it was not considered for modeling.
Page 329
333 One of the drawbacks to CHAID is that it can only handle categorical
variables , so continuous values have to be binned somehow. EM has an
implementation of CHAID that automatically bins continuous values. And,
in fact, during the course of testing different models, it proved that
EM's implementation of CHAID performs about as well as the other
algorithms, even providing intelligent splits on continuous values. EM
offers the option of using multiway splits. Multiway splits are
appealing, because they make it easier to isolate extreme values. Figure
11.8 illustrates what a typical EM diagram looked like during the
exploratory portion of this project where different decision trees were
tested (using Gini and entropy as splitting functions, and allowing 2-,
3-, and 4-way splits). Typically, six models were built on any
particular model set or in testing any particular combination of
parameters. One purpose of building six models at the same time was to
see which performed the best and to get familiar with the performance of
the models on the data. Figure 11.9 shows a lift graph for several
different models at the same time. The performance of the different
types of models is generally pretty File Edit vbwActions $Iobals 4Dtlons
Vyñndow Help • ¡3 R a3 x [;_`: M r, a.181J '.z ~ UX . . Node types J
^Samp I e []Input Data SoI -11RData Partitior ï-(~Sampllnp SBSOB Data
Transform SamplIn [1Explore Source Variables or It o AilBar Chart ~-~Ino
Iyht 1®Varlable Selec I VHssoclatlon 1 , Modify 46 Data Set Attrl...{ E
E T E M 0 12 3 14 -¡$Fllter Outlier T e T e T e a Tr e ;~PbDa1a
Replaceme ~...&vClusterlnp ..ERR, COD:1..
Page 330
334 Lit Value 3,190 31 C 2.90 4- 2.70 2,50 2.30 2.10 - , 1.90 1.70 f'--,
1.50 1.30 1 .1 0 0.90 10 20 30 40 50 60 70 80 90 100 Percentile Fd®1
Name ^ Baseline ®GINià ^ GIN 13 ^ 61N12 ENT4 ^ ENT3 ^ ENT2 ^ Exact
Figure 11.9 This lift chart for six different models; the line at the
top represents the theoretical best the models can do. similar, although
it is not uncommon for one or two to be noticeably better than the rest.
Another important reason for building all these models was for the
models to verify each other. Think of each model as giving a different
view on the data, and their commonality provides insight. Which
variables do the models consistently use? Which are different? Are the
points where splits are made for continuous variables consistent? When
one model is radically different from the others, the information from
other models built at the same time helps us to determine if the model
really is better (has it uncovered something in the data) or if the
results are just a fluke. Important EM Decision Tree Parameters In
addition to the splitting function and number of children, EM offers
several other parameters for creating decision trees. We recommend that
you think about the values for these parameters, because adjusting them
can have a sig-
Page 331
more likely that the results on the training set will hold for other
sets of data. This is the law of large numbers. EM also lets the user
control the maximum depth of the tree. The default value of 6 is
insufficient; for this effort, 10 was used. Because of difficulties
visualizing trees online, the trees were often printed out. Deeper trees
require much larger amounts of paper. Pruning Decision Trees The first
decision trees built for the churn modeling project were too bushy; that
is, they had too many leaves and were clearly overfitting the training
set. The basic decision tree algorithm works by creating very large and
very bushy trees. Lurking inside this large tree is a smaller tree that
does a much better job of prediction (see Figure 11.10). To find the
smaller tree, the decision tree algorithms measure the performance of
the every node on the test set, eliminating leaves and nodes that do not
improve predictability. This part of the algorithm is known as pruning
the decision tree. The details of EM's pruning algorithm are not
appropriate for this chapter. However, it was obvious that the algorithm
failed on these trees because they consistently produced the largest
trees or trees with only one node. Arbitrarily setting the size of the
tree to a lower level always yielded better results on the test set (we
usually use the term "test" set for what EM calls the "validation set;"
and what EM calls the "test" set we call the "evaluation" set). This
prob-
Page 332
336 lem was particularly evident when using very sparse model sets, with
less than 10 percent density of churners. To fix this problem, the team
developed SAS code to prune the trees for lift at a given percentile.
That is, the code would find the subtree that produced the best lift on
the top 1 percent of the data, or the top 10 percent of the data.
Increasing the density of churners in the model set to 30 percent
greatly improved the performance of EM's built-in pruning algorithm.
Even so, there were still some issues regarding churn. Looking at
individual trees, it was possible to find highly suspect nodes that
logically should have been pruned back. Fortunately, though, these nodes
were in the minority and did not have a big effect on tree performance.
Pruning is an issue whenever we use decision tree algorithms. Since
pruning algorithms are buried deep inside the algorithmic code, it might
seem difficult for mere humans to understand if pruning is working well.
This is not the case-it is actually quite easy. Pruning is good when a
tree is consistent-more specifically, when the results on the test set
are similar to the results on the training set. Figure 11.11 is a
snapshot of part of a decision tree that has an example of bad pruning.
This tree has three levels: At the top is the split for AGE > 35.5,
fol- Figure 11.10 The challenge of pruning is to find the best subtree.
Page 333
337 fi lowed by splits on ZIP_RT, and TOT CL (EM happens to give the
formula instead of the variable name). Each node presented here repeated
in the case studies. Divide and Conquer: Training, Test, and Evaluation
Sets As discussed earlier, the model set needs to be split into three
components, the training set, the test set, and the evaluation set. Each
of these components should be totally separate. They should not have any
records in common. Most data mining tools provide at least some support
for splitting the model set. In particular, the test set is important
when using most flavors of decision trees and for neural networks. The
test set is actually part of these techniques. It is used to make the
model more general; that is, to prevent it from overfitting the training
set. Since both the training set and the test set are used to create the
model, they cannot be used to evaluate its performance. Support for the
evaluation set is not always available in tools. In this case, part of
the model set needs to be manually separated from the rest and held
back. Once the model has been trained, the evaluation set can be scored
to get an idea of how the model performs on unseen data.
Page 334
338 The Size and Density of the Model Set The size and churner density
in the model set definitely have an effect on performance . The previous
section explained that low densities produced very poorly pruned
decision trees. This section covers the efforts to find the optimal size
and density for building models. It is almost always true that using a
larger model set will produce better models . However, the larger the
model set, the longer it takes to build any given model. And longer
build times mean that there is less time for experimentation and
learning about the data. So small model sets can be useful. Table 11.3
shows the predicted churn rates for different models built on different
model sets. When measuring the effect of model set density, it is
important to convert the model scores into predicted churn rates by
taking into account the oversampling rate (explained in Chapter 7). The
predicted churn rate for the top 1 percent of the validation set (the
highlighted box is the maximum value in each row) is shown in Table
11.3. The table clearly shows that the model set with 30 percent
churners and 50k records uniformly produces a better lift for all six
types of decision trees. One advantage of building multiple trees is to
increase confidence in exactly this sort of knowledge. Also notice that
increasing the density of churners for a given size always produces
better model results. We did not extend this investigation beyond a
density of 30 percent. The table also shows that there is no single best
model. Both Entropy and Gini sometimes give the best results; 2-way,
3-way, and 4-way splits are also spread among the best models. To choose
one model over the others was hard, but necessary. The GINI-3 tree and
ENT-2 tree gave almost identical results on the Table 11.3 One Percent
Lift for Different Models 20K 10.3% 4.94 7.37 6.59 4.97 5.47 6.59 20K
17.9% 6.66 7.05 7.06 6.66 7.38 - 50K 9.9% 5.89 5.55 5.83 5.39 5.61 5.61
50K 17.8% 8.20 6.94 7.22 7.04 6.45 6.85 50K 30.5% 11.70 11.53 11.59 9.85
11.68 11.59 100K 10.0% 7.20 8.16 - 7.90 7.75 8.20 1 00K 17.9% 9.22 10.12
10.14 9.23 9.34 10.18
Page 335
340 deliver the scored list, and decide upon some action? At least
another week. It is the end of the July before we can apply the model to
predict July churn. Churners identified by the model have already left
if the model is any good. For this effort, we needed to leave a month of
lag time to allow for scoring the model in the real world, as shown in
the bottom part of Figure 11.12. The latency had another beneficial
effect. In the models with no latency, one of the most significant
factors of churn was when call volume dropped to zero in the most recent
month. Well, in fact, this represented customers who had already turned
off their handsets but had not yet formalized the process or there was a
few-days delay in setting the deactivation date. In practice, these
customers are not good candidates for churn prevention. Translating
Models in Time The churn model uses data from the past to predict the
future. This implies that the model must slide forward in time. Certain
steps help to ensure that models can move forward in time. The first
step was to make all data relative to the "present." The present is the
first day of the month of latency (which corresponds to the first day of
the month when scoring will take place for the scoring data set). For
the time series data, this implies making fields relative to the churn
date rather than absolute time. Use names like CALL-1 and CALL_2 instead
of CALLJUN and CALL_MAY. Notice what happens with absolute names. A name
like CALL_JUN may be four months before the churn date when the model is
first deployed. It is more useful to make all time-varying behaviors
relative to scoring month (which is equivalent to making them relative
to the churn date). The second step was to remove absolute dates
wherever they occurred. Absolute dates almost always cause trouble as
models are deployed further and further into the future. Dates were
translated into days before the "present." But what about seasonality?
Some things occur because of the time of year, because of important
events such as the holiday season and back-to- school. This was handled
by adding back in variables for the year and month of important dates.
In this data set, the only available date was the service activation
date, but this principal applies to other dates as well. When other
dates are available, some other difference may be significant, such as
the number of days from the activation date to the date a new service
was added or additional handsets were added on the same account. The
third step was to ensure that the model set had a mixture of churn data
from several months, as shown in Figure 11.13. Notice that in this
chart, June
Page 337
data for Model Set Aug and Model Set Sep was available . By the end of
the month, predictions for November could be made. The Data The primary
source of data for this project was a prototype data warehousing system,
using Informix on a multiprocessor Sun system. This system had a minimum
amount of data modeling associated with it. Basically files from
operational systems were downloaded into relational tables. Because much
of the data came from the billing system, it was reasonably accurate and
complete. There were certain reference tables that were not available
directly in the Informix system and needed to be added from other
systems. However, no external data was incorporated into the model. At
the time the modeling was taking place, approximately nine months of
data was available for the project. The Basic Customer Model Figure
11.14 is a high-level entity-relationship diagram that describes the
data in this project. The definition of customer is rarely obvious in
most business environments, and the same is true here.
Page 338
342 The most important level is the service level. A service is the
access to the telephone network provided to a single telephone number,
usually associated with a single telephone handset. Chum at the service
level was of primary interest, since one of the most important measures
of the business is the number of active telephone numbers. As mentioned
earlier, most customers automatically pay their telephone bills through
automatic payment on a banking or credit account. The account level
refers to this actual account. Multiple services can share an account,
for instance, when there are several telephones in a household. In fact,
one of the billing plans, the family plan, provides discounts for
multiple services that all pay from the same account. The customer level
is the least defined of all. It is used only when a customer signs up
for a new service, and is not particularly useful for chum modeling. It
is included here only to note that what is called a customer may not be
the right level for modeling. Figure 11.14 The important entities
available for churn modeling.
Page 339
343 Much effort has gone into creating databases and creating names for
important entities . When doing modeling for customer relationship
management, it would seem natural to use the data at the customer level.
However, this may not be appropriate. And, in some cases, the model of
the customer does not even include an entity called "customer."
Understand the data; don't just use the names. From Telephone Calls to
Data Figure 11.15 provides a high-level view of how data about a call
moves through various data systems. A receiver receives a signal from a
handset and maintains contact with the handset. It, in turn, passes the
call over to a telephone switch that records every call passing over the
switch. This record is called a call detail record and it tells who is
making the call, the number called, the duration, time of day, and so
on. The switches also record dropped calls, A Call is -- Made Wireless
Receiver 1 Telephone Switch 1 00 Bill Processing System Other data
sources, such as the customer information file, are jData also fed in.
Warehouse The telephone connects to a wireless receiver to complete the
call. The receiver passes the call through to a switch that routes the
call to its destination. The switch records information about the call
in a call detail record. This includes information such as the from
number, ..ERR, COD:1..
Page 340
344 incomplete call attempts, and incoming calls, although these are not
often used by the business. It is also possible to use data from the
receivers to identify the exact location of the call. However,
positional data (beyond the region covered by the tower) is not yet
available for large-scale data mining efforts. Call detail records for
billable events are passed into the billing system, which categorizes
and summarizes and slices and dices them to produce bills sent to
customers. The billing system classifies calls into different
categories, such as mobile-to-mobile calls, overseas calls, calls to
value-added services (similar to 900 number in the U.S.), calls to
directory assistance, and so on. The data in the billing systems
provides a rough summary snapshot of customer behavior every month, and
it is generally the only customer usage that is available for modeling.
In addition, this project had access to some hourly summaries of data
that provided some value for the segments containing Club members. The
billing systems, the call detail records, and other data sources are
then fed into the decision support systems, such as the data warehouse
that contains a history of data from the customer perspective. The first
prototype of the data warehouse provided the bulk of the data used for
this project. Historical Churn Rates One of the best predictors of churn
in the near future is the recent history of chum, along different
dimensions in the data. For this model, the historical chum rate was
calculated along several different dimensions: Handset churn rate. The
churn rate for the handset was measured and then included in the data.
The churn rates varied from 0.03 percent for the most popular and recent
handsets to over 10 percent for older handsets with few users.
Demographic churn rate. The churn rate was measured for a combination of
demographics including gender, age group, and geographic area, for a
total of several hundred different combinations. Dealer churn rate. The
churn rate for each dealer where the customer bought his or her handset
was included in the data. ZIP code chum rate. The churn rate for the ZIP
code of the customer's billing address (actually based on the first four
digits of the ZIP code). These churn rates turned out to be very
significant in the churn models. One of the best indicators of future
behavior is past behavior. To predict churn, we want to include
historical churn rates in the data. In fact, the business is quite aware
Page 341
345 that handsets are a major driver of churn-and the handset churn rate
almost always turned out to be the most significant variable in the
predictive models. Data at the Customer and Account Level The customer
and account level contain basic descriptive information, including •
Social security number. • ZIP code of residence. Summary demographic
information for the ZIP code was not available for this project. •
Market id. The company split their service area into different marketing
regions. • Age and Gender. These fields would not normally be considered
to be accurate. However, the social security number in the country where
this study took place encodes the date of birth and the gender, and a
valid social security number was required when signing up for service. •
Pager Indicator Flag. The company also offers paging services.
Typically, the information available at the customer level is rather
sparse and often inaccurate because this data is usually self-reported
at the time of purchase . And a network of geographically dispersed
dealers usually collects this data. Front-line sales people who have
little incentive to collect clean data rarely do. For this reason, we
did not use other data, such as reported income, reported occupation,
and so on. Data at the Service Level The data provides information at
the service level that gives insight into the nature of the specific
service: • The activation date and reason for activation (and
deactivation date for churned customers) • The features ordered by the
customer • The billing plan • Handset type, manufacturer, weight, analog
versus digital, and so on • Dealer where the service was activated This
data was judged to be quite accurate. Unfortunately, the network of
dealers included both internal dealers and independent dealers, so
consistent information about them was not possible to obtain. In
particular, we would
Page 342
346 _. ~.,..~° have liked to include information such as the size of the
dealer and the actual price paid for the handset. Data Billing History
The billing history contained monthly summaries for nine months. The
billing history contained only a few different line items, such as: •
Total amount billed, late charges, and amount overdue • All calls
(number of calls and amount billed) Overseas calls (number of calls and
amount billed) • Fee-paid services, such as 900 numbers in the United
States (number of calls and amount billed) • Directory assistance
charges This data provided several time series for this modeling effort.
Rejecting Some Variables During the course of modeling, it became
obvious that certain fields were doing more harm than good, usually
because the models were overfitting the training data. This case study
employed several strategies for identifying and working with fields that
hurt. Variables that cheat. First, it is critical to eliminate future
variables that are redundant with the target. In this data, termination
reason and deactivation date were always very good predictors of
deactivation! Fortunately, the decision tree made this obvious, so they
could easily be removed when they were accidentally left in.
Identifiers. The second category consists of identifiers, such as
customer ID, telephone number, and social security number, which are
useless as inputs into models. They uniquely (or almost uniquely)
identify every single row, giving the algorithm no new information about
each row. Fortunately, EM does a good job of identifying these variables
and eliminating them from the models. Very high skew. At the other
extreme are variables that are so highly skewed that all or almost all
the values are identical. One of the parameters of the decision tree
algorithm is the minimum leaf size. If all values for a variable are the
same except for at most about minimum leaf size, then that variable can
be safely ignored: the decision tree algorithm will never use it to
split a node (including such variables merely increases the time it
takes to build models).
Page 343
included for variables such as dealer number and handset model. Another
approach not used in this project is simply to gather all the less
frequently occurring values together into a single categorical. Absolute
dates. Absolute dates represent fixed points in time that make it
unlikely that the model can be applied well in the future. As discussed
earlier in this chapter, there are two sets of variables used to replace
absolute dates. The first is the relative information for which this
project used number of days before the present. It is sometimes useful
to take the relative offset from other dates. The second is seasonality
information that is included by storing the year and month, and
sometimes the day of the month and the day of the week as separate
variables. Untrustworthy values. Some of the variables simply do not
contain trustworthy data. This was particularly true of marketing data
collected when new customers sign up for service. The sales force has
little or no incentive to collect good data, so information about
occupation, salary, and so on could not be relied upon. Other data, such
as the weight in grams of handsets, was similarly known to be
unreliable. Derived Variables There are two approaches to adding derived
variables. One is to add only variables that make sense; that is,
including variables which, if significant, can be explained to business
users who are barely numerate (it is good to plan for the worst case).
The second approach is to add combinations of variables that seem
important, even if they do not make apparent sense. This project took
more of the second approach.
Page 344
the ratio of calls between months, and so on for units and usage.
Although this situation is fairly obvious, others are subtler. The
project was not long enough to take a more intelligent approach to the
derived variables that would start to eliminate the redundancy. In fact,
finding a small set of effective derived variables is a long-term
learning task for the data mining group. Besides, decision trees do a
good job of sorting through hundreds of variables so people do not have
to. There is a caveat, though. Two decision trees may not look alike in
the variables that they choose, but they may be fundamentally similar .
Each node divides the population into about the same sets for its
children. The rules look different because different variables had close
to the same effect. Some additional derived variables proved useful as
well. The age of the customer and the length of service both suggested
interesting variables. For instance, what portion of a customer's life
has he or she been a customer? This is the length of service divided by
the customer's age. What is a rough estimate of the customer's worth so
far? A rough estimate is the length of service in months times the
average amount billed in a month. A minor point about implementing the
derived variables: Almost all the derived variables were added using SAS
code instead of using the transform variables node in EM. One reason was
the availability of seasoned SAS programmers . More important, though,
the SAS code could easily be applied to data sets for the four segments
without much trouble. Also, since SAS code is strictly text, it is
easier to input when defining lots of variables. EM requires using both
the mouse and the keyboard for creating new derived variables, which can
be a cumbersome task for more than a handful. The transform variables
node was used for a few on-the-fly derivations.
Page 345
349 ns about Building Churn Models One of the ideas we have stressed
throughout the book is the need to be open- minded when approaching any
data mining effort. This case study provides some lessons in building
chum models. Some of these lessons are about the particulars of churn
models. Others are more general. It is important to listen to the data
and to let it be the guide. At the same time, technical and business
hurdles must also be overcome. Data mining requires being flexible and
listening. This churn modeling effort reinforced these data mining
lessons. Finding the Most Significant Variables History is the best
predictor of the future, and churn is no exception. Topping the list of
significant variables is the handset chum rate. It appeared at the top
node in almost every tree built during the course of the project. Other
historical churn rates, such as chum by demographics and ZIP code, were
typically present in the trees as well. Another very important variable
was the number of different telephones in use by a customer. This
variable was added rather late in the modeling process. It turns out
that customers with multiple telephones are much, much less likely to
churn than their counterparts with one telephone. Other important
variables included the number of changes of features over time, age, and
the market serving the customer. In terms of usage, the most important
billing variable seemed to be a decline to 0 usage in the most recent
month. This was often expressed in different ways, such as very low
values for ratios that included usage for the month. Surprisingly ,
billing data rarely appeared near the top of trees, although it was an
important discriminator further down. Listening to the Business Users
Before the project began, the primary goal was to create a model that
could assign a churn score to all customers. During the first week of
the project, the marketing group explained exactly what they needed in
order to act on a model: a list of 10,000 names of Club members. The
focus changed to meet the needs of the marketing group. The next goal
was to assign a churn score to all customers, presumably by building a
single model. From discussions with the business users, though, it
Page 346
~ . MINN. was evident that Club members and non-Club members were very
different segments. This led to the decision to build two different
models, and then combine them. Listening to the Data Data miners have
two ears, one for listening to the business users and the other for
listening to data. The two models suggested by the business users
quickly became three out of the realization that not all customers would
have six months of billing history. Recent customers became the target
of the third model. And then a very strange thing occurred. The model
for churn for recent customers far and away outperformed the other
models, as shown in the lift diagram in Figure 11.16. The churn model
for recent customers provided lift, at the high end, of over 50, when
oversampling is taken into account. This is a rare event in data mining,
and we wish we could promise it on all engagements . It led to further
investigation. Lift Value 3.60 m 3.40 3,20 3.00 2.80 2.60 2.40 2.20 2.00
1 .80 1 .60 1.40 1,20 1.80 [ tftl 0.80 1 , 0 1 1 . 19, 27, 35. 43. 51,
59. 67. 75. 83. 91. 100 7.0 15. 23. 31. 39. 47. 55. 63. 71. 79, 87. 95.
Percentile r Mode l Name I ^ Baseline ^ Recent ^ Exact Figure 11.16 The
lift curve for the segment of recent customers is close to the
theoretical best possible.
Page 347
= 351 Of course, one possible explanation might have been that the
billing data was interfering with a good model. Could adding the billing
data hurt model performance ? This is not as far-fetched as it sounds;
some variables, such as the ZIP code and dealer number, were known to
hurt performance because the model used them to overfit the training
set. This is a simple enough hypothesis to test. Happily, removing the
billing data for the other segments produced worse models . Hooray for
more data! In fact, the actual explanation is quite interesting. Part of
the decision tree for recent customers showed the following rules: •
Number of handsets on the account is 1 • Billing plan is family basic •
Handset has a somewhat higher churn rate This is a curious phenomenon.
There should be multiple handsets on an account with the family plan.
What is happening is that we are finding smart customers. Existing
customers do not get any discounts on new handsets. Well, if they join
the family plan, and add a new handset, and then cancel their old
one-they can still get a discount. Being able to see the rule (using
decision trees) has revealed an important and interesting customer
segment. Including Historical Churn Rates This case study does
illustrate one of the most important lessons in data mining : the past
is often the best predictor of the future. For churn, the past is often
in the form of churn rates. For each series of data, the churn rates
should be calculated for the most recent month that data is available
(see Figure 11.17), even though the month is different for different
parts of the model set and score set. This helps ensure that the model
can slide through time better. There are a number of different churn
rates that might prove useful: • Churn rate by handset model type Churn
rate by demographics (age, gender, etc.) • Churn rate by area (based on
ZIP code or market ID) • Churn rate by usage patterns This last one
implies breaking down usage into a few dimensions, such as quintiles for
total billing, total number of calls, and average duration of calls, and
determining the churn rate for each cell. In all these cases, the number
of different cells should be between about 50 and 500. Although it would
seem desirable to combine them (such as churn rate by
Page 348
353 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Model Set 1 Æ P ~ Model
Set ADDI P Model Set MAY 1 P Model Set JUN P Model Set JUL 1 P Model Set
AUGI 1 Model Set 1 P I SEPI Score Jan Feb Mar Apr May Jun Jul Aug Sep
Oct Nov 4 3 2 1 P Mode l Set JUN- AUG 4 3 2 1 P Model Set JUL-SEP Model
Set 4 3 2
Page 350
355 This also indicates one of the values of building multiple models
with different parameters. If only one model was giving that result,
then we might think it was a fluke. However, when six models yield the
same conclusion, we can make an assumption about the model set. Using a
high oversampling rate has another advantage: all of the churners could
be used for the model set. And, indeed, it was decided to build the
final model sets using all the churners from two months and then mixing
them with about twice as many nonchurners. Understanding the Algorithm
and the Tool During the second week of the project, results were not
looking as good as they should have. Particularly frustrating was the
fact that the billing variables were not improving the model-at all. Why
could this be? Such variables almost always provide some lift. It took
some investigation with EM to understand that the default algorithm for
pruning decision trees almost always returned either the entire tree or
the tree with a single node. In fact, manually setting the number of
leafs on the tree almost always gave a better tree. Fortunately, EM
offers the ability to drop into the powerful statistical language of
SAS. With this knowledge, we were able to write SAS code that does a
good job of pruning the trees. Learned This chapter has described how to
build a churn model, for a churn management application, using SAS
Enterprise Miner. Four critical success factors for building a churn
model are • Defining churn, especially differentiating between
interesting churn (such as customers who leave for a competitor) and
uninteresting churn (customers whose service has been cut off due to
nonpayment). • Understanding how the churn results will be used. A churn
model being used to estimate customer lifetime values is different from
one that simply returns a list of high-valued customers for a campaign.
• Identifying data requirements for the churn model, being sure to
include historical predictors of churn, such as churn by handset and
churn by demographics.
Page 352
358 these telephone calls at the most basic level (who called whom and
when) would require hundreds of gigabytes of storage every day.
Technology is making storage and processing power cheaper. Even so, the
volumes are daunting. Data exploration plays a critical role in data
mining, and, in many cases, spreadsheets, relational databases, and OLAP
systems can yield significant insights into customers and the business.
In other cases, visualization tools and geographic mapping tools are
good choices. All these tools can work well, but what happens when the
volume of data is large, very large? This chapter demonstrates the
insights gained from several data exploration projects. One of them
involved approximately 20 gigabytes of data; another used a smaller
amount of data for business-to-business data analysis; and another, over
125 gigabytes of data. The latter case was particularly interesting ,
because the technical work was completed in less than two weeks, using
powerful hardware, and software that could readily take advantage of it.
The companies sponsoring these projects have declined to be identified
for this chapter. They consider their data a key competitive advantage
and are wary of highlighting the fact that they are involved in such
work. This chapter combines experiences from several projects, mixing
and altering them so that no company is recognizable while retaining the
key characteristics of any project involving large volumes of
transaction-level data. This chapter highlights the power of parallel
processing when it is successfully applied to problems with large
amounts of data. Visualization is still one of the big challenges in
data exploration. In other chapters, we demonstrate how packages, such
as SGI MineSet, do a good job giving visual insight into data. In this
case study, the visualizations are generally simple and almost all are
done with the capabilities of desktop spreadsheets. The results
presented here are all based on presentations given to business users.
This chapter introduces a powerful technology for working with large
amounts of data. This technology is based on the idea of dataflows, and
in particular on dataflows in a parallel computing environment. After
introducing dataflows, the chapter introduces the business problem and
then continues with a voyage of discovery with particular examples of
finding and extracting patterns from call detail data. Business readers
will be more interested in the results; technical readers will likely be
interested in both the results and how they have been obtained. OWS A
dataflow is a way of visually representing transformations on data. It
is not necessary to understand them to follow the results presented in
this chapter. Readers not interested in the technical detail can skip to
the next section.
Page 355
360 Basic Operations There are some basic building blocks that make
dataflows useful for data analysis. To describe them, we will be using
the terminology of Ab Initio (www.abinitio.com), the leader in graphical
dataflow programming tools and the tool used for several of the projects
incorporated in this chapter. Compress and Uncompress make it easy to
read and write compressed data files. Compression is a way of storing
file data in less space, and these operations handle the processing
needed to work with compressed files. (The Windows utility WinZip is a
familiar example of a program to compress and uncompress files.)
Interestingly, storing data as compressed files and doing the processing
to uncompress them is often more efficient than storing the uncompressed
data. Computer processors are many times faster than disks so it can
take less time to read the smaller compressed file and uncompress it
than to read the larger uncompressed file. Reformat is a component that
processes each input record individually to produce an output record.
This is useful for dropping unneeded fields, adding derived variables,
parsing dates, extracting parts of strings, and so on. Reformat can also
convert text data (such as comma-delimited fields familiar to users of
Microsoft Excel) into an internal format. The internal format increases
processing speed, so it is quite common to have a reformat just after
reading a text file. The reformat component reads the description of the
transformation from the aptly named transformation file. This file has a
powerful language for expressing the many types of transformations that
prove useful, including arithmetic operations, string operations, date
functions, and more. One useful operation looks up a value in a table-we
will be seeing several examples of this. The details of the
transformation language, though, are beyond the scope of this chapter.
Select chooses certain records, based on selection criteria. This is the
"if" statement and "where" clause of dataflow graphs. Select has an
optional output for those that do not meet the selection criteria-the
equivalent of an "else". The selection criteria can be arbitrarily
complicated. Sort orders all the records in a dataflow according to some
key. The operations we have seen so far can just read one or a small
number of records and generate an output record. Sort is different,
because it must read all its input records before producing any output.
In Ab Initio software, the sort operator is most efficient when all the
records fit into memory. Aggregate takes a dataflow that has already
been sorted and produces summaries for all values of a key. This is
basically the same operation as the
Page 357
365 people is the only way to accomplish this goal. The result is a list
of interesting questions. The data exploration is the next part. This
can vary from one to several weeks, depending on a number of different
factors, such as the amount of data, the specific problems being
addressed, the quality of the data, and the power of the hardware and
software. When everything works well, even hundreds of gigabytes of data
can be analyzed in a matter of weeks. The final part is bringing the
results together into a coherent presentation- and getting the right
people in the room to listen to it. Since these are demonstration
projects, the presentation is the deliverable. As mentioned earlier,
this case study is a combination of several projects. One of the
projects involved analyzing over 100 Gbytes of data on a massively
parallel computer; another analyzed over 20 Gbytes; and a third, just a
few Gbytes. For the purposes of describing the data, we will be
referring often to the "big data" project because it best represents the
challenges of working with large amounts of data. Important Marketing
Questions Discussions with the business users highlighted several
critical areas for analysis. These areas served as guidelines. One area
of interest was understanding the behavior of individual consumers. When
do they use the telecommunication services? Who is likely to be working
from home? What telephone numbers are forwarded to mobile phones? Who is
using ISDN to connect to a computer network? Another area was regional
differences in calling patterns. This is important for demonstrating to
regulators and pricing groups why different areas should be treated
differently, from a regulatory or pricing perspective. The business side
did not know exactly what patterns might illustrate such differences;
that was left to the analysis team. High-margin services provide another
area of large interest. International calls account for a small fraction
of all calls and a disproportionate share of profit. What could the data
tell us about international calling patterns? With the Internet being an
area with rapid growth, which customers use the Internet? Of course, one
of the prime motivations for the work was to support marketing and new
sales initiatives. What marketing opportunities lie in the data, such as
which customers need a specific product? These are typical of the types
of questions that call detail records can help answer.
Page 362
366 The Data The most voluminous of the data sources used for these
types of projects are call detail records. At the same time, the
structure of these records is usually fairly simple. For Ab Initio, they
can be stored in comma-delimited text files, much like the files used to
import data into spreadsheets, but much larger. Typically, the call
detail comes from one of three sources: Direct switch recordings. These
are the records that are generated by the switch. Generally, these are
the least clean, but the most informative. Inputs into the billing
system. Switch records eventually get transformed into billing records.
These are cleaner, but not as complete. Some records, such as toll-free
calls, may never make it into the billing system. Data warehouse feeds.
This is yet another source. The data will be rather clean, but will be
limited by the needs of the data warehouse. Of course, other sources of
data are needed as well. Tables describing customers and other reference
files are needed. We will talk about some of the more common ones.
Interestingly, some of the most important information sometimes exists
in spreadsheets on peoples' desktops. This is especially true of
reference data, such as lists of access numbers for Internet Service
Providers, international country codes, and the like. Technically, the
most interesting of these projects analyzed over one billion records on
a massively parallel computer. We will be using ideas from this
particular project as a guide, when talking about the data. Call Detail
Data A call detail record is a single record for every call made over
the telephone network. Because so many telephone calls are made, call
detail records are a very, very, large data source. For instance, there
are typically over one billion completed telephone calls every day in
the United States. If about one hundred characters of information are
kept about every call (a typical amount), then a single day's worth of
data amounts to about 100 Gbytes of data. If this were stored on floppy
disks, the stack would be about 781 feet high-two days' worth would be
higher than the Empire State Building. Often, call detail records are
used by the billing systems to generate bills for customers. It follows
that, as a data source, they include only those calls that are billable
to the caller, so they do not include incoming calls (since the called
person typically does not pay for these), toll-free calls, or calls on
certain corporate networks. Also, call detail records can contain
potentially billable
Page 363
368 . Call detail records for telephone calls were placed into three
categories: local and long-distance calls, international calls, and
billing events. . Data originating from one particular region was split
into separate files for each week, based on start-time. . The format of
the files was changed from variable length text records to fixed length
binary records to make processing more efficient. The format was
slightly modified for each of the files. For instance, the billing
events did not need duration_of_call, since the duration for all billing
events is 0. . The data was partitioned over the processors by
from-number, so all the calls originating from the same telephone number
were located on the same processor. Customer Data In addition to the
call detail records, the project needed some basic customer information.
Fortunately, telecommunications companies have made significant
investments in building and populating data models for their customers.
These data models generally describe residential and business customers
using dozens of tables. Customer data is needed to match telephone
numbers to information about customers, since customers can have
multiple telephone lines. The Customer Model For the purposes of this
project, there were only a few basic items of information required from
the customer model. For instance, it was important to be able to
identify all the calls from a single customer, even though that customer
might have more than one telephone line. Figure 12.3 shows a basic
customer model with several important entities. The telephone number
itself refers to a particular line with a particular telephone number
and to the services available on that line. The installation record
contains information about all the telephones installed at the same
time. These are then rolled up into a billing account, which for a
business may include dozens or hundreds of telephone numbers. Finally,
the entity represents a given business customer . Large corporations,
for instance, may have multiple billing accounts, spread throughout the
world. For residential customers, there is no entity. An additional
entity in the customer model, sales account, represents information
about the sales and marketing aspects of customers. For instance, it
contains the market segment of the customer, the size of the customer
(in employees or dollars), a code for the sales representative, and the
line of business . Although sales account is not fully populated and the
data is not as clean
Page 365
371 tem has to send almost every record from the processor where it
lives to another-a vast amount of data movement. So we want to minimize
the amount of partitioning to make the dataflow graphs run faster. One
of the big challenges in data mining is combining data from multiple
files. In the terminology of relational databases, this is called a
join. For instance, each call detail record contains a from-number. The
customer data for that telephone number is in a different table. For
much of the analysis, we need to join the two files on a key; in this
case, the telephone number. Figure 12.5 illustrates the process for two
large tables. First the records in both files are partitioned using the
key. This means that all records with the same key are accessible on the
same processor. Then, the records in each file are sorted on the
processor. Now, it is a simple matter of matching up two sorted lists.
When one of the files is small, it is often more Ab Initio supports this
method of doing joins using the lookup tables. Lookup tables are also
used by reformat operations. Laying out the data in a parallel
environment can have a very significant effect on the performance of the
data analysis. Partitioning the data correctly can greatly speed up the
processing time. Looku 2 Looku 6 In order to join two files Looku 1
Looku 7. on a key (such as using one as a lookup table), H:' 8 2,8 38
4.8 all the records with the 7 27 same key have to be on the same
processor. 1 4 9 One way is to partition both files across the
processors. .-. ooku 8 ~. `~ .rLooku 8:: ~. .aá_i 4G~ 3,ß 3 4 4i 3 F 3,1
. ' 3..ß -.; 2,4 3`_.,...__..i 1 22 1,4 Another way is to partition the
larger file among the processors and to copy the smaller file to all
processors. Looku s Look 8 LOO k, I L skip 7 Looku 7 Looku 7 j Look, 6
Look. ¡ i LnoK la rciki+6 1 .1 IL Gax~ 1 ( Loo à Looku 2 Looku 2 ..ERR,
COD:1..
Page 368
372 One of the complications that arises when trying to match telephone
numbers to customers is that the same telephone number can be assigned
to different customers at different times. So, the TELEPHONE_NUMBER
table actually has an effective date as a field. In the interest of
speed, the project did not use the effective date field, running the
risk of some inaccuracies. However, ownership of telephone numbers
typically changes rather slowly over time and, in this case, we verified
that fewer than 0.1 percent of the telephone numbers in the table were
duplicates with different effective dates. Ignoring the effective date
did not materially affect the analysis. Auxiliary Files Call detail
analysis typically requires additional reference tables. These generally
consist of anywhere from a few dozen to a few thousand rows, with data
such as: ISP access numbers, a list of access numbers of Internet
service providers Fax numbers, a list of known fax machines Wireless
exchanges, a list of exchanges (the first three digits of the telephone
number) that correspond to mobile carriers Exchange geography, a list of
the geographic areas represented by the telephone number exchange
International, a list of country codes and the names of the
corresponding countries. A Voyage of Discovery This section is organized
as a tour through the results based on call detail. The first few parts
of the results give us a feel for the call detail data and for analyzing
it using dataflow graph. It then moves on to more complex dataflow
graphs needed to understand customer behavior. What Is in a Call
Duration? How long the calls last is a basic facet of customer behavior.
More importantly, though, it can tell us a lot of information about data
quality and give an indication if the data sources are producing
reasonable values. Once upon a time, telephone switches recorded a
telephone call only when the call ended. This implied that calls that
never completed never generated records-and hence were never billed. To
get around this problem, very long telephone calls are broken down into
chunks. Some switches break the calls into 8-hours chunks; others break
them into 24-hour chunks. What do the call durations look like?
Page 369
calls longer than one second and shorter than one minute. What is
interesting is when we look at calls longer than one hour-there is a
very interesting feature (in Figure 12.8): a peak at 24 hours. This is
the peak we would expect if calls really are broken down into 24-hour
pieces. However, some calls last longer than 24 hours. In fact, call
durations range up to 46 hours-directly contradicting the fact that all
long calls are broken up. The data does not lie. However, sometimes
switches "forget" to break up calls during peak periods or, under
special circumstances such as testing, the switch does not break them
up. Calls by Time of Day A good way to get a feel for the call detail
data is to break down when different types of calls are being made. The
charge-band code provides a breakdown among local, regional, national,
international, and fixed-to-mobile telephone calls. When are different
types of calls being made? Solution Approach The solution is to read the
call detail records and to look up the user-specified class represented
by the charge-band field. The histogram is then produced for the six
values of the charge band by hour of the day. This result was based on
calls for a single week of the regional data (including international
calls). - O NO (h0 y0 ~n O (DO
Page 371
376 These operations combine all the data for a given day of the week
and hour of the day These operations sort by the duration and place the
result In an output file middle of the day and turned into the internal
record format . Then, another reformat component calculates the day of
the week and the hour of the day for each call. The two hash aggregate
components count the number of calls made during each hour of the week.
The first hash aggregate creates the table on each processor for each
partition of the data. The second combines these partial results into a
single result, sorts, and saves the file. This graph illustrates how to
do a group-by aggregation on a very large file, while doing some
complicated calculations as well. Results The results are in Figure
12.10. This shows the pattern of calls made throughout the day. Seeing
the calls by day-of-the-week and hour-of-the-day illustrates some
interesting patterns. Generally, calls are very low in the early morning
and increase noticeably during the day. These are only residential
calls, and they show an interesting peak in the evening at about 8:00
P.M. or 9:00 P.M.-people making telephone calls after dinner. However,
this peak does not exist on Fridays. The results in Figure 12.11 are
based on a similar dataflow that shows when, throughout a day, different
calls are being made. There are peaks through the 6 8 10 12 14 16 18 20
22 24 26 28 30 32 34 36 38 40 42 44 46
Page 373
- - - - - - - - - - 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
21 22 23 24 Hour of Day Figure 12.11 Looking at the proportions is also
very informative. This is a typical pattern for telephone calls during
the day. Using percentages, we see that long distance calls are much
more likely during the early morning. the early hours of the morning and
that international calls are a bit more likely during the night and
early morning. Because international calls are so important, this leads
to further questions, such as the average duration of international
calls and where the calls are going to. Figure its market segment. This
requires "joining" the call 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24
Page 374
378 600 500 (0 ~ 400 o U 4) U) c_ 300 o 200 o 100 o Figure 12.12 Average
duration of international calls throughout the day. Calls by Market
Segment The market segment is a broad categorization of customers. These
categories include residential customers, government accounts, and
different gradations of business (small, medium, large, and named
accounts). Market segments are units of action as well as customer
segmentation since the sales organization is organized by market
segment. For instance, there are separate divisions focusing on
residential, small business, government, and large business accounts.
This brings up some interesting questions about market segments. Are
customers within market segments really similar to each other? What are
the calling patterns between market segments, for instance? Solution
Approach This is the first of the "hard" questions being asked about the
call detail records, because this question requires looking up customer
information about each telephone call. The solution is a table giving
the number of calls made from each sales channel to each other sales
channel. Transforming the call detail data into such a matrix requires a
lot of processing power, because there are a lot of customers, and many,
many more call detail records. The market channel is available at the
customer level, not in the call detail. Remember, the call detail only
contains telephone numbers. For each record, from-number needs to be
replaced by its market segment and the to - number needs to be replaced
by its market segment. This requires "joining" the call 1 2 3 4 5 6 7 8
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Page 375
W CO- Residential 35,426 5,790 1,165 2,128 694 3,338 3,774 Small
Business 3,136 2,139 560 963 246 1,008 549 Medium Business 907 482 941
358 157 448 168 Big Business 1,266 851 414 986 213 672 314 Global 504
202 134 157 627 280 78 Named Accounts 2,542 918 448 616 325 3,730 325
Government 1,870 549 168 302 78 347 3,046 Grand Total 47,555 11,883
4,301 6,194 2,554 10,528 8,568 3,898 5,309 2,150 9,576 6,630 8,325 Table
12.1 Representative Call Volumes (in 1000s) between Market Segments for
one Week of Data
Page 377
rz 383 And Men summacee the date by country first summnrize the data en
each processor. Then, move all the date to one processor, end summazize
t again. Results Figure 12.16 shows the average length of telephone
calls for countries in Europe. It is interesting that calls to Eastern
Europe are, on average, shorter than calls to Western Europe. However,
calls to similar countries, such as Scandinavia or to the British Isles,
are of similar length. For instance, Tuvalu, a small country in the
South Pacific, had what seemed like a disproportionately high call
volume even a few more than far more populous countries, such as Mexico.
This is worth further investigation. It may suggest a spate of South
Pacific holiday travelers. It might also be indicative of a call-back
service or some other type of unusual telephone service. An interesting
use of the call detail data is for providing information to regulators ,
since they often prefer uniform rules for the entire country, state, or
region. Often, regional differences can be quite striking. In one case,
the average length of an international call was almost one minute longer
in one region than in the country as a whole. This is regulatory
ammunition for differential pricing schemes in different areas. Figure
12.15 Ab Initio graph for extracting information by country.
Page 379
386 Of course, this data includes only outgoing, billed telephone calls.
It does not include toll-free numbers or incoming calls. We could modify
the graph to look for incoming calls, but that does not provide a wealth
of information- since we do not know if anyone was present to answer the
call. Of course, we might also get confused by "automatic" out-going
calls, such as those that might be made by a computer modem or a home
security system, although these are relatively rare. Results Figure
12.18 shows when a particular customer is making telephone calls from
home. The darker shading shows when more calls are being made. This
graph shows several things: m This customer is rarely home during the
day. m This customer is home making telephone calls on Saturday and
Sunday, especially on Sunday. Monday Tuesday Wednesday Thursday Friday
Saturday Sunday 11:00 pm 10:00 pm 9:00 pm 8:00 pm 7:00 pm 6:00 pm ~ä 77
77 x;% ___I 5:00 pm 4:00 pm 3:00 pm 2:00 pm 1:00 pm 12:00 pm 11:00 am
10:00 am 9:00 am 8:00 am = ~ . 7:00 am 6:00 am 5:00 am 4:00 am 3:00 am
2:00
Page 382
389 All Calls Small business Figure 12.20 The share of the small
business ISP market differs noticeably from the overall market share.
are hospitals that have satellite clinics, retailers with multiple
locations, and government offices. A property of these businesses is
that they are likely to be making large volumes of calls between their
different sites. Another category of businesses are those that must
exchange large amounts of data with other businesses, such as clinics
and hospitals that send medical records between different sites or
printers who receive images as data. There is a telephone product
designed just for this situation. Virtual private networks (VPN) are
like dedicated circuits connecting the different sites, for data, voice,
or both. And, for large volumes of telephone calls, they provide less
expensive service than pay-by-call service. VPNs are a way of
proactively responding to customer needs. Which customers are good
candidates for VPN? Solution Approach The solution is in the data. The
customer table contains a list of companies and their different sites.
What needs to be done is to determine the call volumes between the
different sites using the call detail data.
Page 385
390 "I Using a relational database, this would be quite complicated. The
site information is needed for both the calling number and the called
number, and requires two very large joins. Then the result has to be
aggregated by site. The Ab Initio graph (Figure 12.21) for finding this
information is similar to graphs we have already seen. The business and
site information comes from the customer file, and the graph has to find
it for both the originating and terminating telephone numbers. The
selection criteria finds all calls within one business entity going
between different sites. Results The result is a list of businesses that
have multiple offices and make telephone calls between them. This list
can be acted upon immediately, especially for business that have large
volumes of calls between sites. These business customers are vulnerable
to competitors offering private networking services. Interestingly, for
some businesses, the average length of calls between sites is over an
hour. Calls this long probably represent data transfers, suggesting
further opportunities for offering them services.
Page 386
392 Table 12.4 Sorting These Events and Keeping a Count Gives the Answer
Xxxxxxxxxx 980501 110322 start 1 Xxxxxxxxxx 980501 110501 start 2
Xxxxxxxxxx 980501 111517 end 1 Xxxxxxxxxx 980501 111544 end 0 Xxxxxxxxxx
980501 110842 start 1 Xxxxxxxxxx 980501 110934 end 0 What this says is
that between the times of 11:03:22 and 11:05:01, there was one call in
progress. From 11:05:01 to 11:15:17, there were two calls, and so on.
This operation of adding the count is called a scan. It is not possible
to do this inside a relational database using standard SQL. What is
interesting here is not the from-number but the office from which the
call originated. So before doing the scan, all the numbers from a single
company have to be brought together. Note that this solution only takes
into account the call detail records that the phone company passes on to
the billing system. In particular, it does not include toll-free numbers
or incoming calls. These were not available in the call detail data,
because they are not billed to the calling number (which is what the
billing system cares about). Assuming that these calls are recorded
somewhere, the analysis can be extended for all types of calls. The
dataflow graph (in Figure 12.22) that implements this is the most
complicated one in this chapter. The graph reads each call record and
looks up the business entity for the from-number. The top two lines of
components create, respectively, a call-start record and a call-end
record. These are the two events that describe a telephone call. All of
the call event records for a single business entity are brought together
by the partition component. Now comes the tricky part. These are sorted
by business entity and time. Now we have an ordered list of all call
events for each business. The scan calculates the number of concurrent
calls (by adding 1 for a call-start and subtracting 1 for a call-end).
Finally, the aggregation components find the maximum value of the count.
And, the result is saved. There is no way to represent this query using
standard SQL. A stored procedure would be necessary to get the
functionality of the "scan" operator.
Page 388
_. ,.. . , "A 1911" 393 Results A total of 21 sites had more than 30
concurrent outbound telephone calls. The highest number was an outbound
call center. It is easy to look at the results in different ways. For
instance, the top 10 corporate customers were almost all from the same
line of business (see Table 12.5). We would expect that the peak call
volumes would occur during the day or evening hours. It is interesting
that some sites had their peak number of concurrent calls between 10
P.M. and 7 A.M. In fact, a number of sites had five or more concurrent
calls during this period, including one with a peak of 45 calls at 3:09
A.M. Learned Call detail records contain a wealth of information about
residential and business customers (similarly, detailed transaction
records in other industries
Page 389
394 SW IND12 10/27 11:27 23 SW IND12 10/28 18:30 18 SW IND12 10/29 11:05
17 SW IND48 10/29 11:10 16 NE IND12 10/29 8:59 16 SE IND12 10/27 9:48 15
SE IND12 10/30 15:58 14 NE IND12 10/31 15:42 14 SW IND12 10/30 16:37 14
provide key information about customers in those industries). The
purpose of this chapter is to show some of the compelling results: .
Customer behavior varies from one region of a country to another. .
Thousands of companies place calls to ISPs. These companies own modems
and have the ability to respond to Web-based marketing. . Residential
customers indicate when they are at home by using the telephone . These
patterns can be important, both for customer contact and for customer
segmentation. . The market share of ISPs differs by market segment. .
International calls show regional variations. In addition, the length of
the calls varies considerably depending on the destination. .
International calls made during the evening and early morning are up to
twice as long as international calls made during the day. . Companies
that make calls between their own sites are candidates for private
networking (virtual or otherwise). These results are combined from
several data mining projects. Handling these large volumes of data is
possible by taking advantage of parallelism, and thinking of the data
processing as dataflows. Table 12.5 Interesting-Nine of the Ten
Corporate Accounts with the Largest Number of Concurrent Calls Are from
the Same Industry
Page 390
A 396 identical circulars and coupons. The third case shows how a
variety of directed and undirected data mining techniques including
association rules, automatic cluster detection, and decision trees can
be used to identify profitable customer segments. Although these case
studies involve supermarkets and groceries, many aspects of them can be
generalized to the wider world of retail sales in general, not to
mention cataloging and e-commerce. An Industry in Transition Retailers
are only beginning to grasp the true value of the information they
collect, but we can look to other industries to see how much can be
accomplished . The credit card industry is similar to the retail
industry in that it collects data about who is buying what products,
although any one company only has the data for purchases made with their
credit cards. This information can be resold, in one form or another, to
companies that want to reach particular individuals. The sidebar has an
excellent illustration of how a credit card company can use information
gleaned from its customers' purchasing behavior to offer corporate
marketers access to finely tuned market segments, while providing a
valuable service to its cardholders. The large increase in available
data has so far brought incremental and evolutionary changes to
supermarkets, but that is changing. The data they collect has handed
retailers a rare opportunity to change the balance of power between
themselves and their suppliers who control the brands. When used in
conjunction with a frequent shopper program (or any other way of linking
individual shoppers with their purchases) the point of sale data can
answer a question that suppliers, such as Proctor & Gamble, Unilever,
Coca Cola, Pepsi, Clorox, General Mills, Kellogg's and so on, would love
to have answered: Who is actually buying all that stuff? Knowledge, it
is often remarked, is power. Knowledge of who is buying what gives
retailers the power to become information brokers as well as sellers of
merchandise. Supermarkets as Information Brokers A supermarket chain
stands in the same relationship to the packaged goods suppliers as the
credit card company does to the airlines. Suppose you are the brand
manager for a premium brand of "scoopable" kitty litter. Your target
market is the small segment of cat owners willing to pay extra for
whatever benefits your marketing efforts have been able to get them to
associate with this advance in cat waste-removal technology. Since many
cat owners have never tried any product in the category you manage, you
would like to put a
Page 392
400 ,,.... ~ ,,,, .... ~.,.., • For each customer, which products not
being purchased now could he or she be purchasing in the store? • Which
shoppers are most likely to be open to trying new, house-brand products?
• How profitable has this shopper been over the course of the last year?
Fortunately (at least from a data miner's point of view), anonymous
transactions are being replaced by purchases linked to individual
shoppers, because of the increased use of loyalty card programs in
supermarkets. Loyalty card programs are not primarily designed to
provide better sources of data to mine, but that is one of their side
benefits. The primary purpose of a loyalty card is to reward the shopper
for coming back frequently and spending lots of money. In that way, they
are similar to the old S&H Green Stamps and to airline frequent flier
programs. Up until the 1970s, many supermarkets offered S&H Green Stamps
with purchases-something like ten stamps for every dollar. Housewives
(for the most part) kept track of the stamps by pasting them into little
books. And when the books were filled, they could be traded for useful
items such as a toaster or blender. In the modern system, the store
keeps track of your purchase points. A crucial difference is that the
1960s housewife kept track of her own stamps by pasting them into little
books, but in the modern system, the store keeps track of your purchase
points ..ERR, COD:3..
Page 396
406 Texas, the Hispanic population is mostly Mexican and the black
population is mostly non-Hispanic. A moment's thought shows that in a
population where the two traits are pretty much mutually exclusive, if a
particular neighborhood is more than 80 percent black, it can't possibly
be more than 20 percent Hispanic . So much for what had at first
appeared to be a data mining insight! The San Antonio and store size
variables that ranked next were similarly uninteresting for our
marketing purpose. The former shows that different parts of the state
have different demographics. We already knew that, and it doesn't help
us decide what products to promote in Spanish. The latter shows that the
chain has built larger stores in some kinds of neighborhoods than it has
in others. That fact is interesting in its own right, but it too sheds
little light on the problem at hand. The next three variables, product
code, segment, and subsegment, are a bit more interesting, because they
do come from the grocery data. Product codes are integers assigned to
the products to identify them in the database. MineSet bins these codes
into product code ranges, and it was able to correlate particu- Figure
13.1 Evidence visualization for Hispanic percentage.
Page 402
407 lar groups of product codes with the target variable. That seemed
very surprising until we learned that the product codes are not assigned
randomly; similar products have similar codes. In fact, the product code
ranges are expressing the same information as the segments and
subsegments. A peanut butter segment might have subsegments creamy and
chunky, and all items in the segment would have adjacent product codes.
A Failed Approach Association rules are used for market basket analysis.
The input data for market basket analysis consists of many records, each
of which contains a list of products found in a single purchase. If the
same combination of items turns up many times, it can become the basis
for a rule such as "if peanut butter, then jelly." The data we were
working with was not suitable for ordinary market basket analysis
because there is only one record for each product in each store and time
period. However, we decided to create a market basket style analysis by
using the amount of each product sold as a replication factor for each
record. Each record also included a flag indicating whether it came from
a store with high, medium, or low Hispanic population. Our hope was that
we could then find some products in which high or low Hispanic level
would show up together with certain products often enough to allow some
association rules to be generated. Unfortunately, at reasonable levels
of prevalence and predictability (the usual measures of the significance
for associations), no rules were found. Relaxing the standards for rule
formation eventually produced the single rule illustrated in Figure
13.2. The chart says that sales of a particular brand of cereal in 10
oz. packages is an indicator for a store with low Hispanic level. The
predictability of this rule is reasonably high, meaning that more than
half the time if this product is purchased it is in a non-Hispanic
store. However, the prevalence is low, meaning that there are not many
examples of this combination in the data. Most likely, this particular
product in this particular size is not stocked much anywhere, but
happens to be stocked in some store that happens to be in a non-Hispanic
area. Just the Facts The most exciting results came from visualizing
derived hispanicity scores for every product. The hispanicity score of a
product is the difference between its average normalized sales volume in
the most Hispanic and least Hispanic stores. Thus, a product that sells
better in Hispanic stores has positive hispanicity , whereas one that
sells better in non-Hispanic stores has negative
Page 403
412 Table 13.1 Transaction Detail Fields FIELD DESCRIPTION Date Date of
transaction, YYYY-MM-DD Store Store of transaction, CCCSSSS, where
CCC=chain, SSSS=store Lane Lane of transaction Time The time-stamp of
the order start time Customer ID The loyalty card number presented by
the customer; a customer ID of 0 means the customer did not present a
card Tender type Payment type: 1=cash, 2=check, 3=American Express,
4=Visa, 5=MasterCard, 6=Discover, 7=debit card, 8=Diners Club, 9=food
stamps UPC The universal product code for item purchased (see sidebar)
Quantity The total quantity of this item (number of items or weight)
Dollar Amount The total $ amount for the quantity of a particular UPC
purchased usual practice, when presented with one file that claims to
summarize another, is to try to generate the summaries ourselves from
the detail data. We are rarely successful. The problem is that whether
it is telephone calls being rolled up into monthly usage reports, print
signatures being rolled up into press runs, or scanner events being
rolled up into market baskets, all sorts of hidden business rules come
into play. In this case, we had no actual use for the order data since
most of the useful information had been summarized away, but we did
learn a lot about the data in the process of trying to recreate it: When
both summary and detailed or "drill through" data is available for the
same time period, it is useful and educational to attempt to recreate
the summaries from the detailed transactions. This often proves to be
more difficult than expected and all sorts of data quality problems and
hidden business rules can be revealed. • Some UPCs are coupons, not
products. These UPCs do not increment the total items field in the
summary. • For some items, such as potatoes, the quantity field is a
weight, not a number of items. These items are recognizable by a 2 in
the initial digit of the UPC (see sidebar). Anything sold by weight
counts as one item. • The dollar total in the summary data reflects
taxes and discounts that apply to the whole order so the individual
prices do not add up to the total. • There are many ways to record the
fact that a shopper purchased six cans of low-salt, reduced-fat, chicken
broth-six individual scan events each with a quantity of one, one scan
event with a quantity of six, one scan event with a
Page 408
413 Universal Product Codes Universal product codes are the numbers,
encoded as machine-readable bar codes, that identify nearly every
product that might be sold in a grocery store. The codes actually are
not all that universal. Although they may look similar, a product coded
in France will not be recognized correctly by a scanner in the U.S. The
codes used in the United States and Canada are controlled by an
organization called the Uniform Code Council, which maintains a
fascinating web site at www.uc-councii.org. A similar organization, the
European Article Numbering Association (EAN) administers product codes
in Europe and much of the rest of the world. Their Web site is at
www.ean.be. The two organizations are working together to develop
worldwide standards. In the meantime, our North American codes consist
of 12 digits. The code itself fits in 11 digits; the twelfth is a
checksum. The first digit identifies a particular encoding scheme. For
example, an initial 2 indicates the coding scheme for items sold by
weight. The next five digits are assigned by the Uniform Code Council to
identify particular manufacturers. What they do with the remaining six
digits is up to them. Some industry groups, such as the Turkey Growers
Association, have developed standards for how their members should use
the digits under their control. In other industries, it is left
completely up to the discretion of the manufacturers . To take advantage
of UPCs, a store must maintain a continually updated database that maps
UPCs into its own product codes or stock-keeping units (SKU) and thus
into a product hierarchy of departments, categories, and subcategories.
When working with summarized data, be sure you know the rules that
govern the summarization. These rules may reveal important information
about the underlying business process. We generally prefer to receive
data at the most detailed level available and then perform our own
aggregations as necessary. There are perils to this approach-an
imperfect understanding of the field definitions (or a too trusting
reading of the data dictionary) may lead to nonsensical aggregations
such as mixing item counts with item weights-but at least the mistakes
will be your own! The snippet of data in Table 13.2 includes information
about two market baskets . Both were purchased on 17 May 1998 at around
2:30 in the afternoon at store #405 belonging to supermarket chain #210.
The first purchase, in lane 1, consists of three occurrences of item
87300233. First, one of these items was scanned at a cost of 98 cents.
Then, a second was scanned and the cashier
Page 409
416 keyed in a quantity of two so that the dollar amount field of the
second record contains $1.98. The third record starts a new market
basket because although the date, store, and time are the same, this
transaction happens in the next lane over. (Note that the customer ID
cannot be used as a key because (1) customers who do not present a card
will have ID 0 and (2) we may not want to roll up separate trips made by
the same customers.) From Groceries to Customers Multiply these
transactions by all the people in all the lanes in all the stores for
every day for a year and the resulting database is vast and unwieldy.
Somehow , all that scanner data about cans, cartons, pounds, and ounces
had to be turned into derived data that could reveal something about
customer behavior . Answering questions about who is buying what, when
they are buying it, and what else they might like to buy in the future,
requires many more variables and far fewer records. It also requires
some auxiliary data describing the items purchased so we can tell the
difference between beer and diapers in the shopping cart, and the
similarity between butter and margarine. The auxiliary data takes the
form of a mapping from the universal product codes scanned into the item
numbers or SKUs (stock-keeping units) used by the chain and a mapping of
those item numbers into a product hierarchy. With the help of these
additional tables, upon seeing UPC 002700048918, we can say that the
category is popcorn, the subcategory is microwave, the segment is
shelf-stable, the manufacturer is Hunt & Wesson, the brand is Orville
Redenbacker, the flavor is "butter," and the product description printed
on the shoppers register receipt reads "OR MW G BT B 6CT 21 OZ." We also
know that the product is salted, not considered low-fat, and ships 36 to
the case. Appending all of this information to each record in the
transaction detail table might seem an odd way to reduce the data size,
but it is a necessary first step in order to reduce the number of
records by performing meaningful aggregations (see Table 13.3). From
transaction detail records of this form, it is possible to create
hundreds of behavioral fields for each customer (see Table 13.4). For
this project, the focus was on the time of day that people habitually
shop and how they allocate their grocery dollars across the categories.
For each shopper, we calculated the number of trips and the amount of
money spent at each time of day (morning, lunch time, after lunch,
evening, and late night) and on weekends, holidays, and week days. We
also calculated the percentage of the items purchased that carried high,
medium, and low profit margins for the store. For each of the categories
of interest, we calculated the percent of each shopper's total spending
that went to that category, the total number of trips and the total
dollar amount spent for the year along with the total number of items
purchased and the total number of distinct items purchased. These new
variables, all of which were based on simple aggregations of the
transaction detail, were used to create fur-
Page 412
Page 413
420 The hardest thing about automatic cluster detection is making sense
of the clusters once they have been automatically detected. The MineSet
cluster visualizer consists of a set of small graphs showing statistics
on each variable within the clusters. By default, the variables are
sorted in decreasing order of their ability to differentiate the
clusters from the general population. In Figure 13.6, we can see that
the most important variables for distinguishing the clusters are the
ratio of dollars spent in the morning to total dollars spent and the
total number of morning checkouts. In the general population, people
averaged around 10 morning checkouts over the course of the study
period, accounting for about 15 percent of total spending. In cluster
one, the average number of morning checkouts is less than one and
morning checkouts account for only about 1 percent of total spending.
Cluster two, on the other hand, is full of people who like to leap out
of bed and head for the supermarket. In that cluster, the average
shopper had around 44 morning checkouts accounting for nearly 86 percent
of total spending. At the very bottom of Figure 13.5, it is just
possible to see that after three highly correlated variables having to
do with morning shopping, the next most important variable for
distinguishing the clusters is the ratio of high margin (that is to say,
very profitable) items to total items. That suggested the visualization
shown in Figure 13.6. This is output from the MineSet splat visualizer,
which is like a scatterplot for situations where there are too many
points to plot individually. Individual points get averaged into a sort
of colored haze. In the figure, the ratio of dollars spent in the
morning, afternoon, and evening have been mapped to the three axes of
the splat plot, and the ratio of high margin items has been mapped to a
slider that allows the user to move interactively from a view of
low-margin shoppers to a view of high-margin shoppers. The figure
captures the screen as it looked with the slider all the way to the
left, so the colors represent clusters of low-margin customers by time
of day. If this were an interactive book, and if it were in color, you
could move the slider to the right and watch the colors change to
display higher and higher margin clusters by time of day. Having done
so, you would notice that cluster four is particularly rich in
high-margin customers. That cluster is worthy of further investigation!
The statistical cluster viewer shown in Figure 13.5 provides one way of
investigating the high margin cluster. By clicking on a cluster's
colored bar at the top of the column, you tell MineSet to reorder the
variables so that they are sorted by their ability to distinguish the
cluster of interest from all other clusters and from the general
population. Another approach, not shown here, is to use the MineSet
evidence visualizer to see which variables play the greatest part in
determining cluster membership. Yet another approach to understanding
clusters is to build a decision tree that classifies records by cluster
membership. This gives approximate rules for
Page 416
........ rytto hïgt+ -'gin itums bin_ ratlo high mar9ln : - trA272197
a..:1 _J 17.:-_ - I F'uSh Sveed Shrv, Lave ..oú~n cluster membership.
When a leaf of the tree bears a certain cluster label, the path from the
root node to that leaf can be read as a rule for inclusion in that
cluster. Of course, there may be several leaves labeled with the same
cluster and therefore, several rules describing its members. Figure 13.7
shows such a tree. The spotlight is on a leaf node where nearly every
record is from the most populous cluster, which seems to consist of
afternoon shoppers of medium profitability. Putting the Clusters to Work
Finding clusters is rarely the final goal of a data mining project.
Clusters are only useful when they can be put to some practical use. One
way of using clusters is to identify customer segments that could
benefit from some new product or service. If there is a cluster of
customers who come into the store at lunch time to purchase ready-to-eat
items from the deli counter, but also pick up a few groceries, perhaps
they could be enticed to pick up a few more groceries by offering a
delayed home delivery service. Another way to use clustering is to feed
them back into the data as an aid to further analysis. Figure 13.6
Visualizing clusters of shoppers.
Page 417
424 Buys Meat at the Health Food Store? You cannot have read this far
without realizing that we believe strongly that data mining should not
go on in a vacuum. Data mining belongs in a business context where it
can serve well-articulated goals and answer important questions. Despite
this, it sometimes happens that we are simply handed a data set and
asked to show what data mining can do with it. In one such case, the
data we were asked to investigate was transaction-level data from
loyalty card holders at a New England health food supermarket. We loaded
the data into SAS Enterprise Miner and together with Anne Milley of SAS
Institute, we discovered a few interesting things. Since there was no
actual business problem to be solved, the search for interesting
patterns in the data started by generating association rules, a common
approach to market basket analysis. Then, because we are more interested
in customer behavior than in learning which products sell well together,
we used undirected data mining to look for naturally occurring customer
segments and undirected data mining to learn more about one particular
segment that seemed interesting to us-people who buy meat in a store
frequented largely by vegetarians. Association Rules for Market Basket
Analysis We don't often get much mileage out of association rules, but
market basket analysis is what they were made for, so we put them to
work on the health food data. Association rules are an undirected data
mining technique for examining what sells well with what. The rules all
have the form Left-hand side implies right-hand side Association rules
go in only one direction. In a restaurant, we might discover that
"caviar implies vodka" is a strong association, but that "vodka implies
caviar" is much weaker. Both the left-hand side and the right-hand side
may be combinations of more than one item as in the first rule in Table
13.5: "red peppers imply yellow peppers, bananas, and bakery." Of
course, any market basket is full of potential rules since any one item
in the basket may imply all of the others. How do we decide which
associations have predictive power? As it turns out, there are three
separate measures that must all be taken into consideration: support
(also called prevalence), confidence (also called predictability ), and
lift.
Page 420
33.72 RED PEPPERS - VINE TOMATOES 4 4 2.16 4.28 14.14 YELLOW PEPPERS -4
KITCHEN & BANANAS & BAKERY 5 3 2.15 3.46 36.06 RED PEPPERS -> YELLOW
PEPPERS & BAKERY 6 3 2.14 3.18 10.49 YELLOW PEPPERS -i FLORAL & BANANA 7
4 2.11 3.66 12.10 YELLOW PEPPERS 4 BODY CARE & BANANAS & BAKERY 8 3 2.06
4.93 16.30 YELLOW PEPPERS i KITCHEN & BAKERY ..ERR, COD:1..
Page 422
428 Table 13.5 Association Rules for Market Basket Analysis (Continued)
1 1W_IUJ:I' IIi~ Y'J'J1•I i t~l'J I I'la:[ y~;('JI 54 3 1.60 4.32 14.26
YELLOW PEPPERS -> OG CANTALOUPE & BANANAS 55 3 1.59 3.64 12.04 YELLOW
PEPPERS - OG CARROTS LOOSE & BANANAS 56 3 1.59 4.60 15.19 YELLOW PEPPERS
4 BAKERY & 5 CENT ..ERR, COD:1..
Page 424
it "w m-m - 429 Give up? Well, we thought it was interesting that
although there are many rules about red and yellow peppers, and a few
even linking them, there is only one rule in this list of pepper rules
for green peppers. In fact, in this list of pepper rules, which is
sorted by lift, you have to go all the way down to rule number 80 to
find the first rule that even mentions green peppers, and that turns out
to be one of those high-confidence, high-support rules that simply means
that nearly everyone buys bananas, and green pepper buyers are no
exception. It isn't that no one is buying green peppers; at 1.37 percent
of all baskets, the support for the rule "green peppers imply bananas"
is between the 1.17 percent for "yellow peppers imply bananas" and the
1.43 percent for "red peppers imply bananas." Since bananas are
universally popular, this means that green peppers sell in about the
same quantities as red and yellow peppers-they just aren't as
predictive. Presumably, the reason for this finding is that both red and
yellow peppers are flown in, like cut flowers, from hothouses in the
Netherlands and so are quite expensive compared to the green peppers
grown right here in the North American Free Trade Area. This suggests
that red and yellow peppers go together not just because of their pretty
colors (after all, a splash of green would also be nice in the stir-fry)
but because of their snob appeal. If they have not already done so, the
store should try putting the red and yellow peppers with the fresh
morels and other exotica rather that with the common peppers. They might
even try tripling the price of a few of the green peppers and displaying
them with their high- priced colors would sell together at the higher
price! People are More Interesting Than Groceries Having exhausted our
interest in the association rules, we then turned to another undirected
data mining technique-clustering. Before building the clusters, we added
a number of derived behavioral variables to the data. One that proved
interesting was a flag showing whether the customer had ever purchased
anything from the meat department. Only a small percentage of shoppers
ever buy meat at the health food store, but those that do are an
interesting segment. The Enterprise Miner cluster visualizer allows us
to display three variables at a time to explore the way they interact
with the clusters. In Figure 13.8, we are looking at the way gender,
meat buying, and total spending vary across the population and the three
clusters found by the tool. The height of the pies corresponds to total
spending; the shaded pie slice represents the percentage of people in
the cluster who buy meat. The top row contains data on women and the
bottom row contains data on men. Cluster three is interesting because
nearly half the people-both men and women-are meat eaters. Cluster one
is even more interesting-these people all spend more money than the
people in the other clusters, and the women, but not the men, buy more
meat than the ..ERR, COD:1..
Page 425
432 Learned The first case study focused, out of necessity, on the
grocery products themselves because that is what the data contained.
Except for a couple of demographic variables about the neighborhood
where the store was located, there was no information about customers
and their behavior. Nevertheless, by using demographic data on the
neighborhoods surrounding the stores we were able to make inferences
about the shopping preferences of Hispanic shoppers. In this case, one
of the strongest lessons is about the power of visualization .
Sometimes, a simple scatterplot reveals more about what the data has to
say than can be discovered with sophisticated tools. In the second case
study, we saw the value of having transactions that are linked to
individual customers. Using customer IDs, it was possible to derive
hundreds of variables describing the behavior of the shoppers
themselves. These variables were used to feed predictive models capable
of identifying good targets for a coupon or direct mail campaign. A
secondary lesson was that even models that do a good job of predicting
who is likely to respond to an offer may not paint a clear picture of
that person. This is true even of models , such as decision trees, that
can be translated into rules. The problem is that when there is a large
tree with many leaves that predict the same thing, each one has a
different rule. In the third case study, we saw how data mining can be
used to improve shelf placement decisions and to uncover a small, but
very profitable group of customers that might leave entirely if one line
of products were discontinued. Overall, this chapter has tried to convey
the range of exciting possibilities that are set to transform the
grocery business and other similar businesses including the wider world
of retail sales, cataloging, and e-commerce. Along the way, we made the
following observations: - Anonymous transactions do not allow the data
miner to learn much about individual customers, but they can be used to
study the behavior of important groups. Visualization can sometimes
provide insights that are difficult to glean from associations or
models. • Transaction level data must be aggregated in imaginative ways
in order to derive variables that capture important aspects of customer
behavior, such as favorite categories and habitual shopping days and
times.
Page 428
433 • Models with good predictive value may still fail to paint a
recognizable portrait of the consumers whose behavior is being
predicted. This is fine if the goal is targeting likely buyers, but
disappointing if the goal is to understand who those buyers are and what
makes them such good prospects. • Association rules often discover
trivial or uninteresting associations, but sometimes the absence of an
expected association (as between green peppers and their more colorful
cousins) can lead to insight. • Automatically detected clusters can
sometimes suggest market segments (like the meat buyers in the third
case study) worthy of further study.
Page 429
Page 430
439 Deciding on the Right Attributes The final list of input variables
was compiled through an iterative process of consulting with experts on
printing for ideas of what might be important, using them to build
trees, and keeping the ones that proved to have predictive value. The
final set of input variables (which, for our purposes we do not need to
fully understand) included • ESA anode distance in millimeters • Chrome
solution ratio • ESA current density in amperes per square decimeter •
Plating tank • Viscosity • Humidity • Ink temperature - Doctor blade
oscillation in inches • Doctor blade pressure in pounds • Basis weight
of paper • Paper type • Solvent type • Cylinder circumference in inches
• Press type • Press speed in feet per second • Electrostatic assist
current in milliamps • Electrostatic assist voltage in kilovolts All of
these variables take on values that either fall into clearly defined
categories (paper type, solvent type) or can be measured in the
specified units. Early on in the project, the team used some variables
that were more subjective . For example, there was one variable called
chrome condition that could take on the values "cloudy" and "clear."
Unfortunately, these values turned out to depend more on the person
making the observation than on the condition of the chrome. The decision
tree algorithm used in this early work could only split on categorical
variables. Continuous variables such as temperature, humidity, and
viscosity had to be binned. The printing experts felt that for most of
the numeric variables there were naturally three ranges: favorable,
neutral, and unfavorable. For example, they regarded high humidity as
favorable and low
Page 435
440 humidity as unfavorable. The sense of the experts was that in some
middle range, humidity would have no effect, so other input variables
would control the outcome. Unfortunately, the experts could not agree on
boundary values for the ranges. Data Preparation for Continuous Inputs
Experiments showed that simply partitioning the continuous variables
into three arbitrary ranges did not produce good results. What was
needed was a way to pick splits that captured the notion that there is
one range that affects the outcome in one direction, a second range that
affects the outcome in the opposite direction, and a neutral range
separating the first two. Bob Evans' approach was to use the median
value for the runs with banding as one boundary and the median value for
the runs without banding as the other. Once the continuous variables had
been binned in this manner, the decision tree algorithm was able to make
good use of them. There are now automated approaches to binning
continuous variables that might have helped even more. In his book, Data
Preparation for Data Mining (1999, Morgan Kaufmann), Dorian Pyle
describes a technique called "least information loss binning" for
choosing the optimal number and size of the bins. This technique
requires too much understanding of concepts from information theory such
as entropy and mutual information to be described here, but it is nice
to know that these kinds of transformations exist and are beginning to
be incorporated into commercial data mining tools. Defining the Target
Classes A data mining training set is not complete until the data has
been preclassified so that the data mining algorithms know what they are
looking for. Sometimes this is straightforward; a prospect either did or
did not mail in a response card.. Other times it is more subtle. In
Chapter 11, we saw that in order to predict chum we first had to define
it. In addition to distinguishing between voluntary chum and involuntary
chum, this involved choosing a time frame. In the wireless customer
churn case, if the question is "who will cancel their wireless phone
contract within the next 100 years?" all customers will get high chum
scores, but if the question is "who will chum tomorrow?" all customers
will appear loyal. So, a model built to classify people as churners or
loyal customers is really predicting which customers will cancel their
subscription before a certain date. Clearly, the choice of date affects
the outcome of the classification. In manufacturing, there is a parallel
notion. A product's reliability is often measured in terms of MTBF (mean
time between failures)-the average time
Page 436
~~~~~ 442 r Ow~p ~,.. ~ `1,. runs of any length that resulted in
banding, and all print runs that ran longer than the cut-off and
finished without banding. While solving one problem, this solution
introduces another. The sample skew with respect to print-run length
means that a spurious rule will easily be discovered : short print runs
lead to cylinder bands. However, the tree growing program used at
Donnelley allowed users to interact with the tool by accepting or
rejecting proposed splits. Inducing Rules for Cylinder Bands The goals
for the cylinder band project were different from the goals of most of
the data mining projects we have discussed. In this case, there was no
plan to use the decision trees as predictive models. Instead, the trees
were used to generate a set of practical, prescriptive rules that could
be applied on the shop floor. In this situation, the best model is not
the one that does the best job of predicting when a job will fail, it is
the one that provides a set of implementable guidelines for preventing
future failures. The word "implementable" is very important. It is of
little use knowing that cylinders are less likely to band in humid
conditions if there is no way of controlling the humidity in the plant.
R. R. Donnelley now uses commercial data mining software for its ongoing
projects, but the work described here was done before commercial data
mining software was widely available. The decision trees were developed
in a homegrown software environment called Apos. In addition to the
usual preclassified data set, Apos starts out with a set of
heuristics-rules based on received wisdom. Examples of heuristics used
are • Lower values for anode distance increase likelihood of banding
Lower values of chrome solution ratio increase likelihood of banding •
Lower values of humidity increase likelihood of banding Higher values of
ink temperature increase likelihood of banding Higher values of blade
pressure increase likelihood of banding When Apos is choosing a split,
it uses the standard information gain criterion from the 1D3 decision
tree algorithm, but it also consults the heuristics. If the
automatically chosen split would contradict expert opinion, it looks for
a different one. Apos presents the analysts with several proposed splits
along with their information gain scores. This way, the experts can see
where their beliefs are being challenged by the data. The analyst can
choose which variable will be used to create a split and may decide not
to choose the one that provides the best split. Figure 14.2 shows a
subtree from one of the decision trees developed at the Gallatin
printing plant. There is one leaf in the tree with only one banding
Page 438
443 Before Split on Chrome Solution Ratio No Band Band L...... Very Low
Chrome N. Band Band Medium Chrome No Band Band High Chrome No Band Band
Low Humidity ...._.. ~,......:, ~. ~-- o No 8en0 Bend • Low Viscosity u
4 No Band Band No Band Bona Viscosity 1N. Band Band Figure 14.2 Finding
conditions that reduce banding. Hot Ink ;1" PE No Band Band event and 18
successful print runs with over a million impressions. This suggests a
set of operating guidelines to try on the shop floor: • Keep the chrome
solution ratio high. • Keep the ink temperature low. • Keep the ink
viscosity high. Note that these rules do not in any way explain cylinder
banding. They shed no light on the actual cause of the problem or the
mechanism by which this combination of chrome solution ratio, ink
temperature, and ink viscosity can somehow guard against grooves
appearing in the etched plate. In that sense the results may seem
intellectually unsatisfying, but, if they save the company millions of
dollars, they are good rules just the same. No one had heard of vitamin
C when the British navy started issuing limes to its sailors to ward off
scurvy, but it was effective all the same.
Page 439
> ~,~ ~ 444 , , " W _. , Change on the Shop Floor The proof of the
pudding is in the eating, it is said; and the proof of the new
guidelines is in the banding statistics. Figure 14.3 shows the incidence
of banding over time. The bars represent the count of bands each month;
the line is a moving average of the counts. It is not hard to pinpoint
the time when the new guidelines came into effect. Before the decision
tree rules were distributed in the form of guidelines, they were
simplified by printing experts to eliminate dependencies on particular
press types in use in the plant. For example, if for one press type it
was important to keep blade pressure below 30 pounds and for another
type the magic number was 20 pounds, the distributed guidelines would
say to keep pressure below 20 pounds. As might be expected, it took some
time for confidence in the new guidelines to spread through the print
crews. A few early adopters began using guidelines based on the decision
tree rules in December of 1990, but it was not until May of 1991 that
printing experts at the plant had enough confidence in the rules to
distribute a table of conditions associated with banding and the
avoidance of banding to all presses. From that time on, the frequency of
banding began decreasing as more and more printers bought into the
system. Incidence of Bands Over Time 50 45 40 35 30 (n c 25 Co m 20 15
10 Data mining-based guidelines introduced Figure 14.3 Banding incidence
over time. ,~~ SJ ~eQ ~°, ~ac 1,91 lo, sJ eeQ ~°, ~,,ac ~~ lo SJ ~eQ ~°,
15a11 li~ ~~ SJ ~pQ ~?
Page 440
445 Although the principal use of the project was to come up with
prescriptive rules, rather than to do predictive modeling, the fact is
that any set of rules is also a model and therefore can be used to make
predictions. One clear prediction coming out of the rules that were
distributed was that banding increases in periods of low humidity. In
Tennessee, the driest time of the year is late fall and early winter.
Sure enough, banding increased when the drier whether started. Also,
when a cylinder band appeared and the data about the operating
conditions at the time of the event were compared to the guidelines,
more often than not the press was found to be operating outside the
suggested bounds. This, along with a program to give special recognition
to the crew with the fewest cylinder bands for the year, resulted in
steadily increasing acceptance of the computer-generated guidelines and
steadily decreasing banding. The success of the project depended on the
ability of the analytical team to prove the worth of new ways of
operating, some of which went against the received wisdom of both
management and press operators, by demonstrating concrete results.
Long-Term Impact The results of the banding study were dramatic. In
1989, before the data mining effort, 538 banding incidents caused more
than 800 hours of downtime. In 1995, the plant experienced only 21 such
incidents resulting in 30 hours of downtime. The success of the
anti-banding project in Gallatin helped speed the introduction of more
automated monitoring and data acquisition devices around the company.
The initial pilot project at Gallatin was replicated at other plants. It
turned out that the guidelines developed in one location could not
simply be transferred to another; too many factors were different at
each new plant. Donnelley recognizes that data mining is not a one-time
fix. Over the years they have continued to build on the success of this
initial project in keeping with the stated corporate goal of continually
improving performance and productivity. Reducing Paper Wastage at Time
Inc. Time Inc., the magazine publishing arm of the giant media
conglomerate, Time-Warner, is the world's largest magazine publisher.
Its flagship Time magazine has been published for over three-quarters of
a century. Its 30 or so other titles include Life, People, Sports
Illustrated, Entertainment Weekly, Teen People, In Style, Southern
Living, Cooking Light, Sunset, Sports Illustrated for Women, This Old
House, People en Español, Fortune, and Martha Stewart Living. With 120
million readers worldwide, Time Inc. goes through huge quantities of
paper.
Page 441
448 size. Less paper went into the magazines and more ended up as waste.
Gradually , as the printers shift from 22-inch presses to 21-inch
presses, the move is reducing paper costs as well. This sort of waste,
called trim waste, is only one of many ways that paper ends up in the
recycling plant instead of in the magazine. The Time Inc. database
contains records of this and all the other kinds of waste that can show
up at the printing plants. In fact, Time Inc. keeps detailed performance
data on every press run at every plant and on every roll of paper
consumed in the process. This detailed tracking reflects the close
relationship between the publisher and the printing plants. Relationship
of Time Inc. to the Printing Plants Although Time Inc.'s goals for this
project were very similar to those of R. R. Donnelley-cost reduction and
productivity improvement-the focus of the study was somewhat different
due to differences in the business arrangements of the two companies.
Time Inc. does not own the printing plants where its magazines are
printed. Instead, the publisher contracts with more than twenty printing
plants around the country. Time Inc. purchases its own paper and has it
shipped to the printing plants, so the printing plants have no direct
incentive to be frugal with paper. The interests of Time Inc. and of the
printing plants are kept in alignment by contractual limits on the
amount of waste allowed as a percentage of the total pounds of paper
used; also many of the conditions that lead to paper waste can lead to
printer down-time, which, as we saw in the R. R. Donnelley study, is
very expensive. For these reasons, and because it is always a good idea
to look out for the interests of your biggest customer, the printing
plants work very closely with the production division at Time Inc. to
find ways to improve the manufacturing process. Performance Variation
between Plants One of the key reasons that Time Inc. was sure that
overall paper waste could be reduced substantially was the wide range of
performance among the printing plants. As we can see in Figure 14.4,
which reports paper waste by plant and type of waste for selected
locations over the study period, the best-performing plants waste only
half as much paper as the worst-performing plant. Of course, some
plant-to-plant differences reflect differences in machinery or workload
that cannot easily be altered, but if even a small amount of variation
is due to best practices that can be codified and disseminated to all
plants, the savings potential is very large. Time Inc. estimates that
for its New York titles (the ones of concern in this study), each one
tenth of one percent improvement is worth
Page 444
millions. One way to control paper cost is to buy when the price is
right. As with all commodities, the price of paper is volatile. At Time
Inc., a system called the Statistical Paper Ordering Tool (SPOT) uses 12
months of history to calculate 6.00% 4.00% 2.00% 0.00% Figure 14.4 Paper
waste by plant and type of waste. ^ Overrun E3 Bindery ^ Run O Makeready
O Core ^ Strip O W rap $200,000 per year. Those are the kinds of numbers
that data miners love to hear because even a small improvement in
manufacturing efficiency will pay for a lot of data mining. In 1998, a
task force headed by Dave Trevorrow, the business manager of the
production division at Time Inc., set out to sift through the press-run
and mill- roll data from all the plants looking for ways to reduce
waste. The team included Bill Walsh, a former production director for
Money magazine and an expert in all phases of paper making and printing,
and Alan and Geoff Parker, data analysis experts from Apower Solutions.
The authors worked with Apower, using MineSet to look for predictors of
addressable waste in the press-run and mill-roll detail data. The Data
In sharp contrast to the situation at Donnelley where data from a few
hundred press runs was laboriously collected by hand, the data at Time
Inc. had already been collected in a relational database. This database
is fed by several Z Z c a w w w p O w w 3 cn w > U z u -- O O OO uJ J
J o m Z J 0 J Z S 0 Ó ~ Q I Ñ 1 Q Ÿ W Q ~ Z Q Ó d Z w W U) < Cr w U)
LL Q = 0= ~ 0 á Cl Q m 2 ~ (D Ü W ~ U F O U) < Q 0_ w a O
Page 445
452 V Shipment This table records information about how the paper got to
the plant, and the carriers that transported it down to which door of a
two-door box car from which it was unloaded. Summary of Mining Data Set
The summary statistics of the mining data set give a good first overview
of what we had to work with: Number of press runs: 30,041 Number of
rolls: 523,893 Rolls/Press Run statistics Average: 17 rolls/run Minimum:
1 roll/run Maximum: 293 rolls/run • Press run length statistics Average:
6.68 hours Minimum: 0.03 hours Maximum: 759 hours • Distribution of form
types in press runs Advance: 14,806 runs Cover: 4,143 runs Current:
11,092 runs Together, the press runs collected in the data mining data
set accounted for 100,552,816 pounds of wasted paper with a dollar value
of close to $50 million. Approach to the Problem The data analysts split
into two teams. Apower, in Atlanta, began doing hypothesis
testing-looking at various kinds of waste as a function of a long list
of known or suspected causal factors. Meanwhile, we, at Data Miners in
Cambridge, Massachusetts loaded the data into SGI MineSet and began
trying to find predictive factors for high levels of avoidable waste.
Hypothesis Testing The printing experts at Time Inc. had many insights
into likely causes of waste that they wanted tested through
single-variable regressions. Some examples are
Page 448
455 Running Waste Running waste is waste that occurs during the actual
printing process. Web breaks are a major cause of running waste. A web
break is exactly what it sounds like-the paper breaks as it is flying
through the press at 3,000 feet per minute. This makes a big mess and
wastes many pounds of paper. Much of the data mining effort went into
trying to isolate the conditions under which web breaks are most likely
to occur. Another cause of running waste is blanket washing. From time
to time, the rubber blanket that transfers the printed image to the
paper (see sidebar on web offset printing) needs to be washed. This can
be done by automatic blanket washers while the press is still in motion,
but hundreds of impressions are wasted. Core Waste Just as the last
square of toilet paper is glued to the toilet paper tube, the end of a
roll of offset paper is glued to the core. The paper must be cut off and
spliced to the start of the next roll before the last crinkled bit of
paper comes off the core. The paper left on the core, and the core
itself, become core waste. Bindery Waste After the printed pages come
off the press as signatures (folded sections of paper containing 4 to 48
magazine pages), the signatures go to the bindery to be combined with
other signatures and turned into magazines. Some get damaged in this
process, so a few extra signatures are always printed. This bindery
allowance is normally about 2 percent and those extra signatures are
considered bindery waste whether or not they actually are needed at the
bindery. It is paper that is used without ending up in a magazine. Trim
Waste Trim waste is the paper that is trimmed off after printing to
create pages of the desired size. Overrun Overrun occurs when more
signatures are printed than were ordered. At the speed that these
presses operate, letting a press run for a few extra minutes translates
into a large overrun. Addressable Waste Some waste is inevitable. All of
the wrapper waste and trim waste, most of the core waste and makeready
waste, and even some of the running waste is simply
Page 451
457 example. A date string can be exploded into many potentially useful
variables such as the year, month, day of the week, and time of day. A
rule or pattern based on any of these variables will be much easier to
detect if each date feature is presented independently instead of being
buried in a date string. Transformations of both kinds were performed
using scripts written in the pattern-matching language, Perl 5. Before
Import The following is a listing of a few fields from the beginning and
end of a typical record from the roll-run data before transformation.
This record contains data on a single roll of paper and a press run in
which it was used. mill-roll-id contains 95WVL2359801 H
manufacture_date: Nov 1 1995 12:00AM mill cd:2W gross-lb: 2050.0
invoice-lb: 2041.0 length_ft:18276.0 is_length_calculated_flag: N
diameter in: 40.0 basis-weight-lb: 120.0 basis-weight-actual-lb: 120.0
grade_cd:Z overrun-pct: 0.5 overrun_signatu re_cnt: 4265
actual-bindery-waste-lb: 919 requested-bindery-waste-pct: 2.7 b i n d e
ry_w a st e_s i g n at u re _c n t: 17281 actual-bindery-waste-pct: 2.3
press_ru n_sig natu re_cnt: 661612 p r_o rd e r_s i g n at u r e_c n t :
640066 min-trim-pct: 0.0 min-trim-lb: 0 tot-trim-pct: 0.0 tot-trim-lb: 0
Convenience Fields In addition to format conversions and type
conversions, several convenience fields were added to the data. In the
source data, a compound key consisting of the mill-roll-id, plant_cd,
and press_run_start_time fields was required to identify a particular
press run. Before being brought into MineSet, each unique
Page 453
458 press run was assigned an integer press-run-id and each roll-run
record was assigned an integer record-id. The press run IDs are there to
facilitate interactive analysis at the press run level. The record IDs
are there to facilitate communication about suspected data problems. It
is sometimes convenient to add fields to a data set that will take no
part in model building or rule induction, but which will make it easier
for the data miner to define samples and subsets or to identify
particular records without resorting to long compound keys. After Import
Data transformation does not end with the importation of records into
the data mining tool. Indeed, the mining process involves the continual
creation of new fields on an exploratory basis. These new fields are
derived from the original fields through binning or through
transformation expressions created with the small equation editor that
is part of the MineSet user interface. Derived Fields Two types of
derived fields were created to aid in data mining. The first type of
transformation was required in order to convert continuous variables
into categorical variables (e.g., high, medium, low) so that they could
be used with data mining techniques, such as association rules, that
cannot handle continuous variables. Fortunately, MineSet supports a wide
range of binning approaches. The second type of transformation involved
creating derived fields that contain information from two or more other
fields. The data mining algorithms supported by MineSet all examine one
field at a time. This means, for instance, that the date of manufacture
and date of use of a roll of paper will be considered individually (and
found to have no predictive value). To have the paper's age taken into
account, it was necessary to create a new variable, paper_age_months, by
subtracting the year of the press run from the year of manufacture,
multiplying by 12, and then subtracting the difference of the months.
Classification Target After much experimentation, we settled on a
derived variable, total_ blame-waste-flag. This field contains a "W" for
any press run where the percentage of addressable waste is in the 75th
percentile or above. The definition is
Page 454
~~~` ~. . " ~ ~~~~ 459 based on addressable waste rather than total
waste because our goal is to find actionable information and, by
definition, there is nothing that can be done about nonaddressable
waste. Addressable waste is a derived field defined as the difference
between total_waste_pct and min-total-waste-pct. min_ total- waste-pct
is the calculated minimum expected (nonaddressable) waste. Data
Characterization and Profiling Once the data has been imported, the next
step is to profile it. MineSet provides a "statviz" tool for this
purpose. Statviz produces a small graph for each variable . For numeric
variables, there is a panel showing the number of values, range, median,
mean, and standard deviation along with the 25th and 75th percentiles .
For categorical variables, there is an annotated bar chart showing the
prevalence of each category label. These graphs will often reveal
problems with the data. For example, although the profiles for
press-run-start-time and manufacture-date look perfectly reasonable on
their own, when the field derived from them to determine the age in
months of a paper roll was profiled it became immediately apparent that
there is a sprinkling of negative ages in the data, meaning that some
rolls claim to have been used before being manufactured! Decision Trees
The task we set ourselves at Time Inc. was very similar to the work at
R. R. Donnelley in that we were more interested in generating rules to
explain paper waste than in classifying particular runs as wasteful or
predicting which future runs would be wasteful. As a consequence, the
decision tree that made the best predictions was not the one most useful
to us. Classification versus Explanation Taken as a whole, a decision
tree is a classifier. Any previously unseen record can be fed into the
tree. At each node it will be sent either left or right according to
some test. Eventually, it will reach a leaf node and be given the label
associated with that leaf. Using MineSet, we used 5 percent of the data
to build a decision tree which, when applied to the entire data set, was
able to predict the correct value of the blame_total_waste_flag with
better than 90 percent accuracy. This is an important result because it
means that there really are patterns in the data that can be discovered
and expressed as rules that determine whether a particular run is likely
to produce excessive waste. Classification, however, was not the goal of
this data mining exercise. We did not wish to simply classify print runs
as wasteful or not, we wanted to come to
Page 455
461 nodes begin to handle fewer press runs, they are no longer split. We
chose the latter course because it does a better job of ensuring the
generality of the rules produced. To further simplify the rule
extraction task, we looked only at nodes classified as high waste
(defined as being in the 75th percentile or above for total addressable
waste) ignoring those which describe more usual runs. Figure 14.6 is a
screen shot of the MineSet tree visualizer. In the foreground, we see
the root node. Each node has a gray base, the height of which
corresponds to the "subtree weight" or, the number of records that
passed through that node. On top of the base, the two bars represent the
two values of the total addressable waste flag. The bar on the right
represents the number of records classed as high waste; the blue bar
represents the rest of the records. At the root, 25 percent of the
records are classified as high waste by definition. Figure 14.6 A
simpler tree for addressable waste. Trv+-wrrm Itr: }1'iil. ~3r~%~®
Page 457
464 ns Learned In this chapter we have seen that data mining can be
profitably applied to the cost side as well as the revenue side of a
business. The two examples of data mining for process improvement in the
printing industry had many differences , but they shared key
similarities: - The projects used data mining to find prescriptive rules
that could be used to improve the production process. • The projects
succeeded due to the constant involvement of subject matter experts who
understood paper and printing inside out and were willing to provide
guidance to the data miners. • Implementation of the new policies
suggested by data mining requires the active cooperation of the people
on the plant floor. • Data mining does not always require huge volumes
of data-the Donnelley study used only a few hundred records. • Where
huge volumes of data are available, as at Time Inc., data mining can
help make sense of it. As we said in Chapter 1, data mining is useful
wherever there are large quantities of data and something worth
learning. Manufacturing process control certainly fits the bill.
Page 460
1! -1 ' 3 The Societal Context: Data Mining and Privacy During the
nineteenth century, photography was a new technology, and as with many
new technologies, it inspired the fear that photographic images could
steal the very souls of the people captured in the images. Eternal
damnation was a greater preoccupation than mere intrusiveness. After
all, not only did early photographs require several minutes of exposure
to the subject, they also required hours of careful treatment with rank
and dangerous chemicals. The mere fact of producing a photograph was a
wonder and an art to some, and a threat to others. Fast forward to the
end of the twentieth century, and we have the travails of Princess
Diana, caught in her dying moments for the world to see. The image of
paparazzi chasing the Princess through tunnels in Paris, leading to her
ultimate demise, is a striking image of one person's loss of privacy to
the photographer 's lens. Today's technology has almost nothing in
common with the technology of yesteryear . A "photograph" today may be
made using digital technology, never developed in a darkroom, and
published over the Web, to a potential audience of hundreds of millions,
in a matter of minutes. Within hours, it can be on the front pages of
millions of newspapers and magazines. Technology has conspired to make
the world not only a smaller place, but potentially much more intimate.
And, it is not only celebrities who are concerned about privacy. The
rise of the Web is propelling privacy as a major concern. The electronic
world is, in many ways, a reflection of the more material world where
data has 465
Page 461
466 been proliferating for decades. Ironically, although the Web seems
frightening in its ability to monitor, it also offers some privacy
protections. At the same time, data mining holds the promise of using
all this data for more constructive ends, for customizing the Web
experience. In this final chapter, we are taking a look at data mining
in the context of society , focusing primarily on privacy issues.
Interspersed throughout the chapter are anecdotes and news stories about
privacy-both with data mining and without. It is important to realize
that any discussion of privacy must talk about the threats and
invasions. Every day, though, there are literally billions of
transactions that take place-both on the Web and in the rest of the
world- that are not the subject of horror stories. Data mining is part
of a technological, social, and economic revolution that is making the
world smaller, more connected , more service-driven, and providing
unprecedented levels of prosperity . At the same time, more information
is known, stored, and transmitted about us, as individuals, than ever
before. Privacy Prism The world is changing fast, especially with
regards to privacy and private spaces. Once upon a time, it was
sufficient merely to build walls high enough to keep out peering eyes.
Now there are satellites that peer down from above and can read the
license plate on a car. One dictionary suggests that privacy is "being
withdrawn from society or public interest; being alone and undisturbed
." Another suggests that it is "being not open or controlled by the
public ; or for an individual person." The dictionary definition
provides little guidance for understanding threats to privacy. The idea
is very subjective, with every person having his or her own limits and
levels of tolerance. Every form of commerce leaves an electronic trail.
Acts that were once private, or at least quickly forgotten, are now
stored for future reference. Privacy acts on new technologies the way a
prism divides light. For each advance in technology, the privacy prism
splits the technology into a multitude of issues, representing potential
threats to privacy. It is possible to identify the exact location of a
cell phone when it is turned on (and the approximate location is
available through the tower routing the call). For motorists requiring
assistance, this can save lives; most would agree that the greater good
of saving lives is worth some intrusion into the privacy of the phone
owner. On the other hand, being able to identify the location of phones
brings up a range of issues. Who owns the location information, and who
has rights to know it? Does the wireless phone company have a right to
know my location? Does the govern-
Page 462
467 ment? Can Burger King send a message, saying "next restaurant at
Exit 15"? Does the employer, if they own the cell phone? Do parents, who
may have bought the cell phone for their teenage children? Can a person
opt out of the location-identifying technology when they purchase a
handset? Just as high walls protect from peering eyes, high walls also
protect from criminals . Society's need to maintain safety and security
definitely places some limits on privacy. Willingly or unwillingly, we
must accept the social contract and these types of infringements. The
government already intrudes into our private lives. What about the
private sector? Companies collect information about us, sometimes
without our consent. They can then use then information directly, or
sell it to others. What right do people, as individuals, have about
their own data? Once, the only way to spread information about others
was through gossip and idle chatter. Then more people became literate,
and we invented printing, and newspapers and magazines. Then we invented
photography , telecommunications, and Web sites. Personal data, even our
images and voices, can be transmitted without our consent. Some people
are very sensitive about their private lives. In the United States, they
use their Social Security Number only when they have to. They lie about
their mother's maiden name. Even the occasional marketing letter
offering a product or service is too much of an intrusion. They do not
want to be a row in someone's database. But, for most people, it seems
that one of the bargains we make in living in the modern world includes
a certain lack of privacy. We must assume when using a credit card, for
instance, that the credit card company might keep track of our
purchases. Letting credit card companies have access to such information
is, for most people, a worthwhile bargain for the convenience of making
purchases using plastic. And, we still have the option of using cash or
checks-at least in the brick-and-mortar world. The seamier side of the
Web shows another extreme on the privacy scale. Once, people could
strive to "tell-all" in books, or by shouting on a street corner. A
woman named Jenni taught the world about watching the humdrum lives of
ordinary people, a lesson that has since been copied by others. These
activities may be erotic, or may be as common as eating a snack and
reading a book, or as uncommon as a woman giving birth on the Web. Jenni
lived her home life in view of a bunch of cameras connected to a Web
site-what are now called Webcams. Her example, although voluntary, shows
the possibility of having every detail of one's life recorded and
transmitted to a medium that can reach hundreds of millions of people.
It shows just how far privacy can be sacrificed. This extreme is not as
far removed from the average person as it may seem. Many communities
around the world monitor public streets with cameras to reduce crime.
Security cameras watch us at ATMs, in elevators, and in parking garages.
And, as celebrities have long known, these images belong to whomever
Page 463
468 takes them, so they may appear in a newspaper, on the nightly news,
or on yet another Web site. Privacy is a complex issue that, because of
technology, is increasingly becoming a social issue. - It is an issue
that we must be concerned about, both as individuals and in work we do
that may intrude on the privacy of others. - The social contract already
places limits on privacy; the issue is really how much and who is in
control. Different societies will likely resolve these issues in
different ways. Some people are very reluctant to have any information
known about them. Others are willing to have the most intimate details
of their lives revealed to the world. Technology plays a role in
defining privacy, in protecting privacy, and in intruding on privacy.
Although privacy in general is a fascinating subject, a full treatment
is beyond the scope of this book. Here we are most interested in the
business world. And, of course, in data mining itself. Mining a Threat?
Data mining, as described in this book, is a business process that
enables companies to maximize the value of the data they have collected
and purchased. Data mining is a competence that addresses the strategic
need of businesses to manage their customer relationships and run more
efficiently. Many of the uses of data mining are in the area of
marketing. And the result, from the individual's perspective, is yet
another piece of unsolicited mail. Or yet another telemarketing call at
an inconvenient time. Or yet another large banner ad that delays the
downloading of more interesting information. Of course, the purpose of
data mining is to direct the communication to people who are more
interested; however, some people resent any exposure to mar- keting
programs. This aspect of data mining does not seem particularly
threatening to the individual . However, if we move out of the realm of
direct marketing and into other areas, the lines become more blurred.
Consider the section "How Sick Is She?" where a letter sent out by
health care providers caused a woman to panic and created a potential
legal liability. This aside points out another aspect of privacy:
Privacy violations may incur legal liability. Even when they don't, they
can be the cause of bad press, which can do considerable harm to brand
or corporate image.
Page 464
471 lected in the course of our daily lives. And with the rise of the
Web for e-mail, e-commerce, news, and entertainment, almost no aspect of
daily life will be unrecorded. Is this a danger? On balance, we believe
that it is not. We do believe that the expectation of privacy is quite
reasonable. And maintaining privacy is a concern and requires vigilance.
The expectation is that only the individuals or organizations that need
information will use it. Of course, with demographic and psychographic
data readily available (at least in the United States), we, as a
society, have violated this expectation long before the rise of
e-commerce. The expectation of privacy leads quite readily into
government policy and regulations . The purpose of this chapter is not
to delve into these in great detail. It is worth noting, though, that
there are two policy extremes: Consumer rights. The more laissez-faire
approach is to educate the public and ensure that individuals have
control over their own information. Companies may use the information,
but only after obtaining permission. This is the prevalent philosophy in
the United States. It manifests itself by allowing consumers to "opt
out" of specific uses of information. Consumer protection. The stricter
approach is to protect all consumers by making it illegal for companies
to collect certain types of information, or to use information in
certain ways. This approach tends to be favored in Europe, which has a
more recent societal memory of atrocities directed toward minorities.
Even in the United States, banks avoid using race when making offers of
credit, because of fair credit lending laws made necessary by recent
history. Each of these approaches has good points and bad points.
Although we tend to lean more toward the consumer rights side, we
recognize that "consent statements" are often obscured in legalese
buried in the small print of standardized contracts. On the other hand,
consumer protection often makes it difficult to target individuals with
tailored marketing messages that would benefit them. The Importance of
Privacy Is all the hoopla about privacy just the result of asocial,
oversensitive individuals ? Actually not. Most people would agree that
the images of Big Brother, conjured up by George Orwell, truly are a
nightmare. It is also worth considering that violating the expectation
of privacy can have serious implications for individuals.
Page 467
472 ~ ; ~ ~~ ,,~ ~ ~~< MI ~~~ ! . ~ On the other hand, people are not
perfect and society plays a protective role as well. To what extent do
we allow law enforcement to investigate suspicious activities? To what
extent do we need to sacrifice our own personal privacy to be fully
involved, economically and socially? Anonymity and Seclusion An open
society allows anonymity, at least some of the time. So, purchases made
with cash generally cannot be traced (at least not easily), and
telephone calls made from pay-phones do not reveal who is making the
call. Anonymity plays a special role. For instance, witnesses to a crime
may be reluctant to provide information, unless they have some
protections (although habeus corpus gives the accused the right to face
the accuser). People calling a substance abuse hotline might be less
likely to seek help if they must reveal their identities. Abused women
and children need safe houses, where abusive partners cannot locate
them. Parents who have given their children up for adoption may never
want to be found. Even for such a mundane task as students evaluating a
professor, anonymity encourages evaluations to be as honest as possible.
Often, people may simply be embarrassed or reluctant to reveal certain
details-and there is no reason they should have to. Is there a reason to
know that a job applicant has been divorced? Probably not, even though
this information may be readily available. The section "Marketing
Program Leads to Divorce" tells the story of how a marketing program led
to the demise of one man's marriage. Is this a violation of privacy? Or,
is BT no more culpable than a neighbor who happened to witness the
indiscretions? Legal Discrimination A pernicious problem occurs when
certain forms of discrimination are legal. For instance, in the United
States, health insurance companies can legally charge higher premiums to
people who smoke than they can to nonsmokers. Does this give the
insurance companies the right to analyze the shopping habits of
applicants to determine if they ever purchase cigarettes? Maybe we think
that health insurance is special, in some way. Consider life insurance
companies that either refuse to insure, or charge much higher premiums
for, sky divers. If a life insurance company purchases the mailing list
for a sky diving magazine, and treated everyone on that list as
uninsurable because of their propensity to engage in dangerous pastimes,
does that give them the right to deny insurance? Note that their
actuaries may have calculated that the subscribers of sky diving
magazines (as opposed to the activity itself) do pose an uninsurable
risk. Is this somehow different from asking people whether they
Page 468
473 Marketing Program Leads to Divorce Many years ago, MCI introduced
the "Friends and Family" program to encourage their customers to
encourage their friends and family to switch to MCI. MCI asked their
customers to provide names for the marketing effort. Of course, MCI did
not really have to ask, since they have the data about everyone their
customers call. Across the Atlantic (according to Wired, 29 Mar 1999),
BT is locked in a fierce battle for telephone customers. In the United
Kingdom, every telephone call has a toll charge, including local
telephone calls. To inspire loyalty, BT introduced their own version of
a "Friends and Family" program. In this version, customers would get
discounts on calls to particular numbers that they identified to BT in
advance. As a courtesy to customers, BT notices which numbers should be
on the list. That is, if a customer has a frequently called number, then
the company will send out a letter suggesting the number be added to the
customer's friends and family list. One lucky household received such a
letter. However, in this case, the wife opened the letter, but did not
recognize the most-frequently-called number. After some investigation on
her part, she uncovered her husband's infidelity, and threw him out of
the house. The risk here goes beyond wrecking one marriage. The customer
has threatened to sue BT saying that the promotion "wrecked" his 40-year
marriage. It is good to remember that violations of privacy can have
high personal costs, as well as exposing companies to legal liability.
sky dive in the first place? Many people would say yes-they have lost
control over information about themselves when a third party (the
magazine) sells their name. Insurance is inherently a difficult area.
After all, the idea behind insurance is to pool people with similar
risks and charge premiums based on the calculated risk. Assigning risk
to a "market-of-one" contradicts the benefits of pooling. And, although
it may be beneficial for the vast majority of people (who are low risk),
it tends to cost the few a lot more money. The section "Violation of
Privacy = Loss of Career" tells the story of Timothy McVeigh, who was
discharged from the Navy because an illegal violation of his privacy
suggested that he might be gay. This is an example of someone's job
being at risk due to privacy violations. This may seem to be a special
case, because the American military has an expressed policy of
discriminating against gay people-and few institutions have such
explicit discriminatory policies. It does bring up the issue that in a
world of Web pages and chat rooms, much more information is potentially
available about individuals, and this information could be used to deny
employment, credit, or insurance.
Page 469
476 people may not be informed sufficiently to take the full dose of
medicine- whether because of ignorance, limited finances, or other
reasons. Where does the greater good take precedence over the right of
individuals to make such purchases-especially where the consumer's
government has no authority? rmation in the Material World Once upon a
time, if a car were illegally parked in front of your home, then you
would have difficulty tracking down the car's owner. Of course you
could, by trekking to the Department of Motor Vehicles and looking
through the license plate number in a file, discover the owner. It would
be unlikely, however since it would be time-consuming and inconvenient.
Now such information is readily available on the Web (although many
states, including California and Michigan, have passed legislation
restricting access to motor vehicle registration data). It is possible
to type in a telephone number in the United States and not only get the
owner and the address, but a convenient map showing the location with an
option for driving directions! Curious about fathers who don't pay child
support, or who is registered to vote (and what party)?-the information
is available electronically. That the government provides such
information should not be surprising. The government has always
regularly published many different types of information . Having open
lists of voters, for instance, is part of the foundation of democracy,
helping candidates for public office as well as deterring voter fraud.
Having the information available with a few clicks of a mouse, though,
introduces a new element. Access is orders of magnitude easier,
resulting in a qualitative change as well as a quantitative change. The
private sector is also a prodigious collector of information. Probably
one of the best known (and best protected) sources of information is the
credit history record. This contains information on all loans and credit
cards held by an individual, including monthly balances, late payments,
defaults, and so on. Credit history is well-protected, and such
information is available only to a company making an offer of credit to
the individual-it cannot be purchased for other purposes. And further,
customers have the right to see their credit report and a grievance
procedure when there are problems. Other data, though, is much less
protected. For instance, Abacus maintains a database of purchases made
through catalogs. This information is available to the hundreds of
catalog companies in the consortium that forms Abacus. Most of you are
probably not even aware that such a list exists. Subscription lists are
often freely sold. Want to reach Hispanics? Buy the mailing list to
People en Español. Want to introduce an audience to the wonders of
Page 472
477 Olestra? Buy the mailing list to Cooking Light. Want to reach IT
managers implementing enterprise-wide solutions? Buy the mailing list to
Intelligent Enterprise. And beyond magazines, some nonprofit groups and
political organizations sell information about members. Some conferences
sell attendee lists. Some retailers sell information about purchase
patterns. In other words, information about individuals is a valuable
commodity, and many companies, such as Acxiom, Polk, Experian, and First
Data, provide such data. This has an impact on data mining as well.
These sorts of external information are often used for building models,
as we have seen in several chapters. in the Electronic World The
electronic world is introducing a whole new dimension to the problem of
privacy. Prior to e-commerce, computers mimicked business practices in
the material world. Magazines have existed for centuries and, with or
without computers, they have always been able to sell their subscription
lists. Computers make the process more efficient, but they do not
radically change the business processes. The Web, the world of
interconnected commerce, information, and entertainment , has some new
dimensions. For the first time, it is possible to track: Advertising
messages that have been seen by a particular individual, over time •
Advertising messages that a particular individual is responding to • The
content an individual has been exposed to • Especially, the advertising
messages, content, and purchases made throughout the electronic world-as
opposed to through a single vendor The difference between the electronic
world and the material world is significant . In the material world, the
magazine publisher may know which magazines you subscribe to. In the
electronic world, the content provider knows which ads you have seen,
which articles you have read, and which sites you have visited.
E-commerce is still in its infancy. Content providers are still
determining how they will work together and how they will share
information among themselves . Governments have not yet fully addressed
the privacy issues arising on the Web. Web sites are still busy getting
up and running, attracting consumers, investment, and advertising
dollars; they are not yet fully in the business of using the data that
they collect. Consumers still make only a small portion of their
purchases online, so they are not yet fully sensitized to the privacy
issues.
Page 473
478 This is all in the process of changing. Let us look at this from two
perspectives: one, despite being a completely automated media, it is not
that easy to spot the individual on the Web; and two, how is information
is shared and what direction is e-privacy likely to take? Identifying
the Customer Surprisingly, identifying the consumer on the Web is
perhaps harder than identifying consumers in the material world. After
all, there is no Web-equivalent of a social security number or street
address. And consumers are reluctant to give such information online,
unless they are in the midst of a purchase. When talking about customers
online, there are three different streams of data that are available to
identify users: clickstream events, cookies, and registrations.
Clickstream Events Every Web page is identified by a uniform record
locator (URL), such as http://www.data-miners.com. This URL specifies
that the computer protocol for accessing the page is http. The part
following the double slashes is the name of the computer. And, this URL
could have a file path at the end, although it does not in this case (it
actually defaults to www.data-miners.com/index.shtml). When a user
clicks on a Web address, the browser passes on information to the server
handling the desired page. Some of this information, such as browser
type, CPU, operating system, and screen resolution, is quite harmless
and can be used to optimize the page being returned. The most
interesting information is the address of the referring page. This makes
it possible to follow all the pages visited, when they stay within a
single site (or on cooperating sites). Using this information, it is
possible to determine things like: • When people visit an automobile
site looking for information on vans, they also look for information on
sports utility vehicles. • When people visit Amazon and look for
technical books, they almost never visit the music portion of the site.
• When people shop for insurance, the site with the cheapest quote gets
twice as many clickthroughs as the site with the second cheapest quote.
Clickstreams do not track individuals over time, or even when they start
bouncing among many different Web sites. They only connect Web pages
together to describe a single visit. There is really no analogy in the
brick-and- mortar world. It is as if someone followed each shopper
around a shopping mall, and jotted down which stores each shopper
visits, which departments
Page 474
479 the shopper goes in, and which items the shopper looks at-all
without identifying the shopper or recognizing the shopper the next time
he or she goes to the mall. Cookies A cookie is a small amount of
information sent by a Web server to be stored by the Web browser. The
site that sent the information-and only the site that sent the
information-is authorized information. Cookies allow Web sites to
remember information about a user in between visits -a major advantage
over raw clickstream information. (A good source of information on
cookies is www.cookiecentral.com.) Cookies themselves are stored in a
text file that can readily be read. More recent versions of browsers
give users the ability to control which, if any, cookies they want to
store. The information itself is, more often than not, unintelligible,
because the format is whatever that particular server wants to store,
and may even be encoded. Happily, a Web server can only read cookies for
its own site. Cookies bring up privacy issues because some remote server
is placing information on the user's own hard disk. The defaults for
most browsers are to allow cookies, with no notification given to the
user. However, browsers typically do have an option for notifying users
about cookies and it is relatively easy to eliminate all cookies from
your computer (or to selectively eliminate cookies for a few sites). It
is not so much the cookies themselves that may tread on privacy; it is
how Web sites use them. A first observation about cookies is that they
are rather limited. They do not identify an individual; they identify a
browser/computer combination. This results in: • The same person using
two different browsers on the same computer has two different sets of
cookies, one for each browser. • Multiple people using the same computer
(with the same login) and the same browser all share the same cookies. •
The same person using multiple computers has a different set of cookies
on each computer. And, ..ERR, COD:1..
Page 475
480 • Which referring pages were visited prior to coming to this site •
Which ads have been displayed This information is often used immediately
to choose Web ads for display, as described in the sidebar. One other
interesting note about cookies is that many sites-including most e-
commerce sites and sites that allow customization-do use cookies. This
means that the cookie file also keeps track of which Web pages someone
has been visiting-a warning to anyone who borrows a computer!
Registration The third way that sites keep track of visitors is by
having visitors register at the site, typically using a login name and
password. Registration serves many purposes: • The site can customize
itself for the user. • The site can offer additional services, such as
e-mail, messaging, and stock quotes. • The site can keep track of very
confidential information, such as credit card numbers, for e-commerce.
The site can identify the user, and differentiate among multiple users
from the same machine. Users that register at sites often use their
registration regardless of the machine they use to access the Web. A
cookie keeps information only on a single machine. A registration keeps
the information on the server computer, so it is accessible regardless
of where the user logs in from. The two capabilities are often used
together. So, the login/password combination is often maintained in a
cookie for the convenience of users who do not want to retype their name
and password every time they access the site. Registration offers all of
the capabilities of cookies. One of their advantages, though, is that
the user is fully aware of where he or she is registering. Users can
also be encouraged to read privacy policies (as well as other relevant
business practices) during the course of registration. Putting It All
Together Clickstreams, cookies, and registrations are important tools
for driving e-commerce . They are responsible for much of the popularity
and potential that the Web offers as a new channel for reaching
customers.
Page 476
481 Directed Advertising on the Web How does directed advertising work
on the Web? In some ways, it is quite similar to how advertising works
in the material world. Advertising agencies often purchase space in
media-newspapers, magazines, television channels, radio ads. Because
they purchase the space in bulk, they get good rates. The agencies then
sell this space to their clients, trying to match client interests to
the appropriate media. Ads for hip clothing are more likely to appear in
Teen People than in National Geographic. The Web offers the same
facility, with one major advantage. The agencies can actually keep track
of individual consumers (or at least of browser/computer pairs) by using
cookies. The largest advertising agency on the Web, Doubleclick
(www.doubleclick.com), is an example of how powerful this can be. Many
people who look at their cookie files will find a cookie for
doubleclick.net, even though they have never knowingly visited the site.
However, they have, even if inadvertently. Many sites display banner ads
on their Web pages. They sell the space for the banner ad to
Doubleclick, who in turn has a bunch of companies that want to purchase
advertising space on the Web (and many sites both sell space and
purchase space on other sites). Doubleclick determines which ad to put
in front of a user looking at a Web page. And cookies let them do this.
The Doubleclick cookies help the agency keep track of which Doubleclick
sites a user has visited, and which ads the user has already seen. This
allows them to determine interests, at least broadly, and to choose the
most appropriate ad for that user. Many sites, but by no means all of
them, are Doubleclick clients. The advertising agency knows when a
person visits one of these sites. So they know if a person tends to
visits sites directed toward news, or toward cars, or toward travel, or
children 's media, or whatever. Using this information, they can direct
the appropriate ad to that user. Of course, for users who do not have a
cookie, they will still get a banner ad-just one that is more generic
and less likely to be of interest to them. Doubleclick has no way of
actually identifying individuals. They do not keep email addresses,
telephone numbers, names, or other identifying information. They are
building profiles of Web behavior that they can use, within their
network of clients, to direct advertising-in much the same way that
advertising agencies purchase advertising space on certain television
shows to reach target audiences. Privacy has been an important issue
since the founding of the Web. The underlying technology-from the
network protocols to http-does not require that users identify
themselves publicly. In addition, the Web was founded on a philosophy of
self-regulation, with minimal governmental interference. The fact
Page 477
482 that a Web server could be anywhere and still accessible everywhere
diminishes the regulatory power of any single government. This
philosophy has resulted in moves toward self-regulation. The TRUSTe
initiative (www.etrust.org) has developed a "seal of approval" for
privacy on the Internet. The basic proposition is that sites must post a
privacy policy and adhere to that policy. TRUSTe also helps sites
develop such a policy, corresponding to industry norms, and audits
compliance to them. They are developing additional seals, for instance,
to identify sites appropriate for children. Major advertisers have
always shied away from placing their ads in contexts, such as television
documentaries dealing with controversial subjects, that might lead
consumers to have a negative association with the company placing the
ad. On the Web, companies are also demanding that sites where they
advertise have posted privacy statements. These two initiatives will
likely lead to almost every site that collects any sort of information
having a posted privacy policy. Such a policy is good. However, the
example of Doubleclick shows that networks of companies also share
information. Another example is Engage.com, which collects information
about Web usage and sells the information. Engage has the ability to tie
together all the cookies left by members of their consortium , allowing
Engage to understand behavior across multiple sites. In addition , they
also collect some registration information from sites that register
users. However, like Doubleclick, they do not collect any identifying
information such as name, address, e-mail address, or telephone number.
Their interest is in being able to rent profiles so subscribing
companies can customize content, by targeting ads, personalizing Web
sites, and so on. This idea of renting profiles is quite similar to what
companies such as Acxiom, Experion , and Polk (among others) do in the
nonelectronic world. Engage has a very detailed privacy policy (and was
an early adopter of TRUSTe), and they refuse to collect information from
a variety of controversial sites, including sexually explicit sites,
medical sites, and "hate" sites. The e-world has some very interesting
traits. When users identify themselves, then a site can customize itself
to meet the needs of that user. Companies such as Primary Knowledge
(www.primaryknowledge.com) and Blue Martini (www.bluemartini.com) help
companies analyze Web data for customization purposes, to understand
their customers, and to understand investments made over the Web. This
"customized experience" is one of the very important features of the Web
as a new channel for reaching consumers. For many purposes, though, it
is not necessary to actually identify the person at the other end. You
may want to know that they are in the 8-10 age group-and offer a banner
ad for an age-specific site. Or know that they bought jeans three
Page 478
RAeu i:, g 483 months ago on the Web, and offer them another pair. In
the material world, making such direct offers requires actually having
identifying information, such as name, address, telephone number, social
security number. On the Web, cookies can be much more anonymous. The
question is simply "what do I know about this stream of binary digits?"
The answer is a customized Web experience. Promise of Data Mining The
issue of privacy points out that data mining exists in the context of a
larger society. When we are using data mining for customer relationship
management , we are bringing the weight of technology to bear on the
challenge of understanding other people. We are trying to predict what
their actions are likely to be in the future. We are learning from what
people did in the past to predict what they need in the future. In
principle, this activity is no different from the personal relationships
that once permeated a nostalgic past of corner stores, friendly banks,
and helpful insurance agents. In that world, people learned from their
own experience and applied the results to their line of work. Data
mining is about expanding this learning culture to companies that are
also big enough to reap economies of scale. The Wal-Mart on the edge of
town, the ATM machines outside the former bank branch, insurance on the
Web- these all exist for a reason, because they offer more efficient
distribution networks , lower prices, and greater convenience. The large
impersonal corporation, despite advertising to the contrary, has in many
cases lost sight of its customers. The customers are merely numbers. And
disconnected numbers across incompatible databases, at that. The promise
of data mining is to return the focus of businesses to serving customers
and to providing efficient business processes. This is true in the
material world, where, we hope, more targeted marketing will lead to
more satisfied, more profitable customers, as well as fewer items of
wasted mail and fewer telephone interruptions during dinner. It is even
more true in the world of electronic commerce, where the entire image of
a corporation can potentially be personalized for every customer.
Throughout this book, we have shown how mastering data mining is more
than mastering a bunch of advanced algorithms. Mastering data mining
requires incorporating data analysis into the business and asking
questions. Collecting data unobtrusively and with informed consent,
recognizing important patterns, and acting on the results in a
responsible way-this is truly "The Virtuous Cycle of Data Mining."
From Index Page 1
488 Database, Continued marketing, 267 view, 140 Dataflow graphs, 392
usage, 364 Dataflows, 358-359 definition, 359 description, 257
efficiency, 363-364 operations, 360-361 usage. See Parallel environment.
Date-product-store information granularity, 147 Dates. See Absolute
dates. features, 165 DB2 Universal Database (IBM), 245 Dealer churn
rate, 344 Decision making, 67 Decision-making role, 66 Decision support
data mining usage, 16-19 function, 18-19 Decision Support Systems
Laboratory, 19 Decision-tree building algorithms, 120 Decision tree
models, 274, 276 building. See Brokerages. Decision tree pruning,
117-119, 335-337 decisions, 303 success, 336-337 Decision trees, 59,
101-103, 111-121,213,273-276, 432, 446 algorithms, 111, 439, 442
branches, 112 building process, explanation, 113-119 choice,
consequences, 119-121 creation, 334 cumulative gains chart, 186
function, explanation, 112-113 parameters. See Enterprise Miner. rules,
444 types, 332-334 usage, timing, 121 Demographic churn rate, 344
Demographic data, 149 choice. See Prospect data warehouse. Demographic
groups, 78 Demographics, 53, 95, 218, 406 Dependent variables, 275
Derived variables, 158-169, 287,324,347-348,354 addition, 52-53, 188
creation, 194 Derived variables, usage, 177 issues, 159-160
Description/visualization, 11 Descriptive data mining, 108 comparison.
See Prescriptive data mining. Detail data, 49 Dial-around service, 312
Dimensionalized data stores, 145 Direct e-mail campaigns, 281 Direct
switch recordings, 366 Directed advertising. See World Wide Web.
Directed data mining, 40-42 Dirty data, 177-181 Discrimination. See
Privacy. Distribution network, 69 Diversity measurement, 116-117
measures, 115,117 reduction, 114 Domain experts, 44, 45 opinion
verification, 47 DoubleClick, 481 convergence. See AbacusDirec
/DoubleClick. Downstream marketing efforts, 250 DSS Agent, 316 E
E-commerce, 396, 398, 477, 480 E-mail campaigns, 290. See also Direct
e-mail campaigns. E-mail costs, 281 E-R diagram. See Entity-
relationship diagram. EAN. See European Article Numbering. EDI. See
Electronic Data Interchange. Electronic Data Interchange (EDI), 450 EM.
See Enterprise Miner. Empty values, 178 Ensemble model ENT-2 tree, 338
Enterprise Miner (EM), 153, 235. See also SAS Enterprise Miner. cluster
visualizer, 430 decision tree parameters, 334-335 Entity, 368
Entity-relationship (E-R) diagram, 142, 150, 341 Entropy, 332, 338, 339
Error-fixing model, 220 Error-prone model, 309 Estimation, 8, 9 Ethnic
purchasing patterns (case study), 402-410 approaches, 407-410 business
background, 402 data, 402-404 visualization, 405-407 European Article
Numbering (EAN), 412 Evaluation sets, 184, 193-194, 335 Evidence
classifier, usage, 296 Evidence models, 238 Expertise development. See
In-house expertise. finding/hiring. See Outside expertise. opinion
verification. See Domain expertise. External data sources, 149-150
External lists, 95 F Features. See Data; Dates; Time. extraction. See
Single columns. Feed-forward network, 123 Fields indication ability,
120-121 naming, 212-213 Filter outliers node, 219 Filtering, 246 Former
customers, 77-78 Future/past predictor, 61-62 Fuzzy definitions, 179-180
G Garbage in garbage out, 28, 184 Gaussian distribution, 134 Geometric
coordinate system, 107 Gini, 332, 338 GINI-3 tree, 338, 339
From Index Page 6
~ 491 Output layer, 122 Output variables, 102 Outside data, 218 Outside
expertise finding, 33-35 hiring, 32-35 Overfitting, 113, 159, 190, 193
Oversampling, 309 definition, 197-201 limitations, 199 rates, 217, 354,
355 record weight replacement, 298-299 usage, 198-199. See also
Predicted lift. Ownership, 67 role, 66 Panel data, 401 Parallel
environment data usage, 370-371 dataflow usage, 361-363 Parallel file
systems, 362-363 Partitioning, 370 Perceptrons, 121 Periodic data, 167
Periodic predictions, 57 Personal data, 467 Personal Identification
Number (PIN), 475 Pictures, 157 Pilot project, 230, 232-235 PIN. See
Personal Identification Number. Point-of-sale (POS) systems, 144,179 POS
systems. See Point-of-sale systems. Preclassified records, 113
Predicability, 462 Predicted lift, oversampling usage, 200-201
Predictions, 8, 10, 121 contents. See Data. shelf-life, 61 Predictive
modeling, 95, 204, 211. See also Vermont Country Store. success, 59-63
time frames, 59-60 Predictive models, 41, 95, 184, 192. See also Data
mining. Predictive models, building, 183-193 experimentation, 221-224
process, 184-186 Predictor. See Future/past predictor. Prescriptive data
mining, descriptive data mining comparison, 101-102 Privacy, 465
anonymity, 472 career loss, 474 expectation, 470-476 illegal activities,
474-476 importance, 471-476 issues, 259 laws, 133 legal discrimination,
472-474 policy, 482 public good, 474-476 seclusion, 472 threats, 466-468
violation, 473, 474 Private banking clients, problems, 305-306 Private
networks, 388-390 solutions/results, 389-390 Process improvement, data
mining usage, 13 Product-centric organization, 284 Product codes, 406,
407 Product data, 244 Product definitions, 309 Product-oriented
strategy, 86 Product registration data, 148-149 Profit curve, 85 Profit
matrix, 84 Profit optimization, 86 Promotional history, 244 Promotional
offers, 86 Propensity-to-buy model, 95, 290 Propensity-to-churn score,
46 Prospect data warehouse, 236 demographic data, choice, 237-239
Prospecting offers, 87 Prospects, 73, 75, 78 Pruning. See Decision tree
pruning. Pruning algorithms, 195 Pseudocode, 120 Pseudorandom numbers,
196 Psychographic data, 149 Psychographics, 95 Purchasing. See
Model-building software; Scores; Software. habits, 268 models, 25-27
Purity, increase, 114 Q Queries. See Killer queries. Radial basis
functions, 125 Random sampling, 247 Ranks, 155 Reactivation, 277
Real-time scoring, 57 Receivers, 169 Recency frequency and monetary
(RFM) analysis, 81,266,271-273 buckets, 272 codes, 271 Recency frequency
and monetary (RFM) analysis cells creation, 271-272 usage,272-273 Recent
Source, 327 Record weights, replacement. See Oversampling. Red lining,
294 Reference tables, 143 Registrations, 480 Regression, 16,160,273-275.
See also Statistical regressions. model, 274 trees, 111 types, 213
Relational databases, 153, 178. See also Time Inc. paper wastage
reduction. definition, 141-144 Remembered results, 57 Request for
Information (RFI), development, 231, 233-234 Residential customers, 394
Responders, 29, 30, 73, 74, 76, 84, 139. See also Nonresponders.
discovery, 20 Response curve, 309 Results. See Actionable results;
One-time results; Remembered results. delivery, 241 measurement, 64
model set density, effect, 195
From Index Page 8
494 Virtuous cycle, 39, 44, 64. See also Data mining. Virtuous Data
Mining Cycle, 322 Visualization, 70, 71, 110, 432. See also Description/
visualization. Voluntary churn, 77, 318, 323 Voluntary churner, 74
Voting, usage. See Models; Multiple models. VPN. See Virtual Private
Network. W Waste, types. See Time Inc. paper wastage reduction.
Web-graphic data, 149 Webgraphics, 95 Weight columns, 139 Weights paper,
447 usage, 199 Weights/measures, choice, 107-108 Wireless telephone
industry (case study), 312-316 account level data, 345 acquisition cost,
315 business problem, 316-325 goals, 320-322 handsets, 315 industry
differences, 314-316 market specifics, 317-318 project, 325 project
background, 316 service level data, 345-346 service provider, 315
telephone calls, 343-344 Wireless telephone industry (case study)
customers contact, 315 level data, 345 mindshare, 315 Wireless telephone
industry (case study) data, 341-348 attention, 350-351 billing history,
346 Wireless voice communication, 357 World Wide Web (Web), 2 clicks, 51
directed advertising, 481 servers, 144 Yogurt supermarket purchasers
(case study), 410-423 business background, 411 clusters, usage, 421-423
customer clusters, discovery, 419-421 data, 411-416 groceries /
customers, 416-418 ZIP code churn rate, 344
From Index Page 10
- Ili loua Rê IL Michael and I have been very impressed by the response
to our first book, Data Mining Techniques for Marketing, Sales, and
Customer Support (John Wiley & Sons, 1997) and the positive feedback
that we received from many readers. We succeeded in our intention to
write a comprehensive and comprehensible introduction to data mining.
Since writing Data Mining Techniques, much as changed. We have now
founded our own company, Data Miners, so we can focus exclusively on
data mining. Data Miners is dedicated to a vision of data mining that
puts as much emphasis on understanding as it does on model results, and
as much emphasis on process as it does on technology. For those readers
who have not had the experience, leaving the regularity of a paycheck to
work independently is, shall I say, a fascinating and sometimes
traumatic experience. It has given us the opportunity to learn
first-hand about business and the business problems facing our clients;
about different approaches to allocating budgets and choosing vendors;
and so on. It has also given us the opportunity to partner with the
leading data mining vendors, and to work with some of the top people in
the field. As our families and friends have asked us on more than one
occasion, why would we take time out to work on a second book? The
answer is simply that Mastering Data Mining needed to be written. The
field of data mining has been changing rapidly over the past few years,
and we want to address the needs of practitioners, on both the business
and the technical sides. To see how quickly the data mining market is
evolving, we have only to look to the business pages. During the time it
took to write this book, we witnessed a number of mergers and
acquisitions that spoke eloquently of the burgeoning role for data
mining in customer relationship management and e-commerce: • The world's
largest privately held software company, SAS Institute, had its most
successful product introduction ever-a data mining package. We used to
joke that "statistics is everything in the SAS library and data mining
is everything else." By that definition, data mining has now ceased to
exist! • The closest competitor to SAS in the analytical software
market, SPSS, acquired a leading data mining package, further
legitimizing what just a few years ago was on the fringes or
respectability. xvü
Front Matter Page 3
-Dr. Jim Goodnight, President and Co-Founder SAS Institute Inc. "Data
mining will be essential for understanding customer behavior on the Web
and for helping the websites of the world create their personalized
responses. In a sense, data mining recently 'got the order' to become
one of the key ingredients of e-commerce. Now all of us need to
understand and use data mining. In Mastering Data Mining, Berry and
Linoff show the industry how to think about data mining: start with
natural activities of data mining, show their obvious and compelling
value, and then only talk about the component tools at the very end.
This is a great book, and it will be in my stack of four or five
essential resources for my professional work." -Ralph Kimball, Author of
The Data Warehouse Lifecycle Toolkit "This book does two really
important things. It addresses data mining at a practical level and it
bridges the gap between the business world and the world of data mining.
All too often data miners have forgotten that if at the end of the day
they do not show business relevance to the work they are doing, then
they are building pie in the sky. Berry and Linoff do not make this most
fundamental of mistakes. If you have any interest in the topic at all,
this book is a must." -Bill Inmon, Author of Building the Data
Warehouse, Second Edition
Front Matter Page 5
A09 I lY/0 :3 'i t'i 1 i Jn writing this book, we had all kinds of help
from all kinds of people. We would like to single out a few of them for
special thanks. Our editor, Bob Elliott, first suggested a book of case
studies and helped shape the final product. Ronny Kohavi of Blue Martini
and Ralph Kimball of Ralph Kimball Associates reviewed the manuscript
for technical accuracy in their respective fields of expertise . Had we
followed more of their advice, this book would have fewer errors. In
addition to our own work, this book builds on the work of other capable
data miners including Alan Parker and Geoff Parker of Apower Solutions,
Anne Milley of SAS Institute, and Bob Evans of R.R. Donnelley & Sons. We
also benefited from many discussions with knowledgeable colleagues
including Dorian Pyle, Herb Edelstein, Bill Inmon, and Erik Thomsen. Our
understanding of the role of data mining in society has benefited from
conversations with Esther Dyson and Nolan Bowie, among others. We are
indebted to several vendors of data mining tools who provided us with
software, training, and access to designers and developers. We would
like to thank Cliff Brunk, Brad Fiery, Peter Maysek, Marc Ondrechen,
Lorna Lusic, Manuel Hoffman, Aydin Senkut, and Mario Schkolnick of SGI
for making sure we had what we needed to use MineSet successfully for
many of the projects described in this book. At SAS Institute we want to
acknowledge Will Potts who taught us much of what we know about neural
networks as well as Herbert Kirk, Anne Milley, Austin Tripensee, Padraic
Neville, Mark Brown, Rich Rovner, and Bruce Brown who all helped us to
become successful users of Enterprise Miner. Charlie Berger of Thinking
Machines provided us with information and access to Darwin software.
Debra Daily and Ken Elliott of SPSS provided us with information and
access to Clementine software. Ken Ono and Eric Apps of Angoss provided
information and access to Knowledge Studio software. Cliff Lasser, Craig
Stanfill, Marie Campbell, Sheryl Handler and Paul Bay of Ab Initio
provided information and access to their dataflow software. Beth Maerz,
Brett McTammany, Susan Ellis and Tatyana Kofman of Tessera Enterprise
Systems provided us with information on the Rapid Modeling Environment.
Rob Utzschneider, Paul Guerin, and Katherine Engelke of Tor- xv
Front Matter Page 6
A e1' :1 - "In the 21st century, corporate survival will depend on how
well vast amounts of data are mined. Berry and Linoff lead the reader
down an enlightened path of best practices." -Dr. Jim Goodnight,
President and Co-Founder SAS Institute Inc. "Data mining will be
essential for understanding customer behavior on the Web and for helping
the websites of the world create their personalized responses. In a
sense, data mining recently 'got the order' to become one of the key
ingredients of e-commerce. Now all of us need to understand and use data
mining. In Mastering Data Mining, Berry and Linoff show the industry how
to think about data mining: start with natural activities of data
mining, show their obvious and compelling value, and then only talk
about the component tools at the very end. This is a great book, and it
will be in my stack of four or five essential resources for my
professional work." -Ralph Kimball, Author of The Data Warehouse
Lifecycle Toolkit "This book does two really important things. It
addresses data mining at a practical level and it bridges the gap
between the business world and the world of data mining. All too often
data miners have forgotten that if at the end of the day they do not
show business relevance to the work they are doing, then they are
building pie in the sky. Berry and Linoff do not make this most
fundamental of mistakes. If you have any interest in the topic at all,
this book is a must." -Bill Inmon, Author of Building the Data
Warehouse, Second Edition
Front Matter Page 7
Masteri The Art a Re' Data Mining .0 nship Mar 4icha k A. Berry Linoff
[on
Table Content Page 1
X...e~~~ How the Density of the Model Set Affects Results 195 Sampling
196 What Is Oversampling? 197 Modeling Time-Dependent Data 201 Model
Inputs and Model Outputs 203 Latency: Taking Model Deployment into
Account 206 Time and Missing Data 209 Building Models that Easily Shift
in Time 210 Naming Fields 212 Using Multiple Models 213 Multiple Model
Voting 213 Segmenting the Input 218 Other Reasons to Combine Models 219
Experiment! 221 The Model Set 221 Different Types of Models and Model
Parameters 223 Time Frame 223 Lessons Learned 224 Chapter 8 Taking
Control: Setting Up a Data Mining Environment 227 Getting Started 228
What Is a Data Mining Environment? 228 Four Case Studies 229 What Makes
a Data Mining Environment Successful? 229 Case 1: Building Up a Core
Competency Internally 230 Data Mining in the Insurance Industry 231
Getting Started 231 Case 2: Building a New Line of Business 235 Going
Online 235 The Environment 236 The Prospect Data Warehouse 236 The Next
Step 237 Case 3: Building Data Mining Skills on Data Warehouse Efforts
240 A Special Kind of Data Warehouse 240 The Plan for Data Mining 240
Data Mining in IT 241 Case 4: Data Mining Using Tessera RME 242
Requirements for an Advanced Data Mining Environment 242 What Is RME?
243 How RME Works 244 How RME Helps Prepare Data 246 How RME Supports
Sampling 247 How RME Helps in Model Development 248 How RME Helps in
Scoring and Managing Models 249 Lessons Learned 25
Table Content Page 2
xiv Types of Waste 453 Addressable Waste 455 Inducing Rules for
Addressable Waste 456 Data Transformation 456 Data Characterization and
Profiling 459 Decision Trees 459 Association Rules 462 Putting It All
Together 463 Lessons Learned 464 Chapter 15 The Societal Context: Data
Mining and Privacy 465 The Privacy Prism 466 Is Data Mining a Threat?
468 The Expectation of Privacy 470 The Importance of Privacy 471
Information in the Material World 476 Information in the Electronic
World 477 Identifying the Customer 478 Putting It All Together 480 The
Promise of Data Mining 483 Index 485
Table Content Page 4
xü...é 11 What Is Churn? 318 Why Is Churn Modeling Useful? 319 Three
Goals 320 Approach to Building the Churn Model 322 The Project Itself
325 Building a Churn Model: A Real-Life Application 326 The Choice of
Tool 326 Segmenting the Model Set 326 The Final Four (Models) 327 Choice
of Modeling Algorithm 331 The Size and Density of the Model Set 338 The
Effect of Latency (or Taking Deployment into Account) 339 Translating
Models in Time 340 The Data 341 The Basic Customer Model 341 From
Telephone Calls to Data 343 Historical Churn Rates 344 Data at the
Customer and Account Level 345 Data at the Service Level 345 Data
Billing History 346 Rejecting Some Variables 346 Derived Variables 347
Lessons about Building Chum Models 349 Finding the Most Significant
Variables 349 Listening to the Business Users 349 Listening to the Data
350 Including Historical Churn Rates 351 Composing the Model Set 352
Building a Model for the Chum Management Application 354 Listening to
the Data to Determine Model Parameters 354 Understanding the Algorithm
and the Tool 355 Lessons Learned 355 Chapter 12 Converging on the
Customer: Understanding Customer Behavior in the Telecommunications
industry 357 Dataflows 358 What Is a Dataflow? 359 Basic Operations 360
Dataflows in a Parallel Environment 361 Why Are Dataflows Efficient? 363
The Business Problem 364 Project Background 364 Important Marketing
Questions 365 The Data 366 Call Detail Data 366 Customer Data 368
Auxiliary Files 372
Table Content Page 5
lmwkffl, ~ xi ? ., ..... Part Three Case Studies 253 Chapter 9 Who Needs
Bag Balm and Pants Stretchers 261 The Vermont Country Store 262 How
Vermont Country Store Got Where It Is Today 263 Predictive Modeling at
Vermont Country Store 265 292 Pitfalls of This Approach 292 Building the
Models 294 Building a Decision Tree Model for Brokerage 296 Building the
Rest of the Models 306 Getting to a Cross-Sell Model 307 In a More
Perfect World 308 Lessons Learned 308 Chapter 11 Please Don't Go! Churn
Modeling in Wireless Communication 311 The Wireless Telephone Industry
312 A Rapidly Maturing Industry 313 Some Differences from Other
Industries 314 The Business Problem 316 Project Background 316 Specifics
about This Market 317
Table Content Page 7
lmwkffl, ~ 292 Pitfalls of This Approach 292 Building the Models 294
Building a Decision Tree Model for Brokerage 296 Building the Rest of
the Models 306 Getting to a Cross-Sell Model 307 In a More Perfect World
308 Lessons Learned 308 Chapter 11 Please Don't Go! Churn Modeling in
Wireless Communication 311 The Wireless Telephone Industry 312 A Rapidly
Maturing Industry 313 Some Differences from Other Industries 314 The
Business Problem 316 Project Background 316 Specifics about This Market
317
Back Cover Page 1
"Berry and Linoff lead the reader down "This is a great book, and it
will be in my stack of four or five essential resources for my
professional work." -Ralph Kimball, author of The Data Warehouse
Lifecycle Toolkit an enlightened path of best practices." Dr. Jim
Goodnight, President and Cofounder SAS Institute Inc. Mastering Data
Mining In this follow-up to their successful first book, Data Mining
Techniques, Michael J. A. Berry and Gordon S. Linoff offer a case
study-based guide to best practices in commercial data mining. Their
first book acquainted you with the new generation of data mining tools
and techniques and showed you how to use them to make better business
decisions. Mastering Data Mining shifts the focus from understanding
data mining techniques to achieving business results, placing particular
emphasis on customer relationship management. principles of data mining
and customer relationship management, Berry and Linoff share the lessons
they have learned through a series of warts-and-all case studies drawn
from their experience in a variety of industries , including e-commerce,
banking, cataloging, retailing, and telecommunications . Through the
cases, you will learn how to formulate the business problem, analyze the
data, evaluate the results, and utilize this information for similar
business problems in different industries. MICHAEL J. A. BERRY
(mjab@data- miners.com) and GORDON S. LINOFF ([email protected])
are the founders of Data Miners Inc., a respected data mining
consultancy. When not actively engaged in data mining projects, they
present classes and seminars that have been well received around the
world. The companion Web site at www.data-miners.com features: • Updated
information on data mining products and service providers • Information
on data mining conferences, courses, and other sources of information In
this book, you'll learn how to apply data mining techniques to solve
practical business problems. After providing the fundamental JOHN WILEY
& SONS, INC. New York • Chichester • Weinheim Brisbane • Singapore •
Toronto Berry and Linoff show you how to use data mining to: • Retain
customer loyalty • Target the right prospects • Identify new markets for
products and services • Recognize cross-selling opportunities on and off
the Web • Full-color versions of the illustrations used in the book
Visit our Web site at www.wiley.com/compbooks/
g~IÎI~ÏpIRI,V,IÏ~I,~ñ3lÎMÎrV
Copyright Page 1
Managing Editor: Brian Snapp Associate New Media Editor: Mike Sosa Text
Design & Composition: Benchmark Productions, Inc. Designations used by
companies to distinguish their products are often claimed as trademarks.
In all instances where John Wiley & Sons, Inc., is aware of a claim, the
product names appear in initial capital or ALL CAPITAL LETTERS. Readers,
however, should contact the appropriate companies for more complete
information regarding trademarks and registration. This book is printed
on acid-free paper. 0 Copyright © 2000 by Michael J. A. Berry, Gordon
Linoff. All rights reserved. Published by John Wiley & Sons, Inc.
Published simultaneously in Canada. No part of this publication may be
reproduced, stored in a retrieval system or transmitted in any form or
by any means, electronic, mechanical, photocopying, recording, scanning
or otherwise, except as permitted under Sections 107 or 108 of the 1976
United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate
per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive,
Danvers, MA 01923, (978) 750- 8400, fax (978) 750-4744. Requests to the
Publisher for permission should be addressed to the Permissions
Department, John Wiley & Sons, Inc., 605 Third Avenue, New York, NY
10158-0012, (212) 850-6011, fax: (212) 850-6008, e-mail:
[email protected]. This publication is designed to provide accurate and
authoritative information in regard to the subject matter covered. It is
sold with the understanding that the publisher is not engaged in
professional services. If professional advice or other expert assistance
is required, the services of a competent professional person should be
sought. Library of Congress Cataloging-in-Publication Data: ISBN
0471-33123-6 Printed in the United States of America. 109876543 ..ERR,
COD:1..