C. Case Studies
C. Case Studies
Case Studies
Doing marketing data science means working with clients and understanding the
business context of research. Case studies demonstrate the process. This appendix
introduces case studies in marketing data science. Case study data and programs for
analyzing these data are provided on the book’s website: http://www.ftpress/miller/.
With data from one thousand long-distance telephone customers, we can develop
models for predicting telephone customer choices. We can also examine issues of
customer retention and churn and advise management on plans for target marketing.
The original data for this case were provided by James W. Watson and distributed
as part of the S system from AT&T Bell Laboratories. S and later SPlus were
precursors of R. The details of the AT&T Choice Study were discussed in Chambers
and Hastie (1992).
Client characteristics include demographic factors: age, job type, marital status, and
education. The client’s previous use of banking services is also noted.
Current contact information shows the date of the telephone call and the duration of the
call. There is also information about the call immediately preceding the current call, as
well as summary information about all calls with the client.
The bank wants its clients to invest in term deposits. A term deposit is an investment
such as a certificate of deposit. The interest rate and duration of the deposit are set in
advance. A term deposit is distinct from a demand deposit.
The bank is interested in identifying factors that affect client responses to new term
deposit offerings, which are the focus of the marketing campaigns. What kinds of clients
are most likely to subscribe to new term deposits? What marketing approaches are most
effective in encouraging clients to subscribe?
Data for this case come from the University of California–Irvine Machine Learning
Repository of the Center for Machine Learning and Intelligent Systems
at http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The original data were part of
marketing studies documented in Moro, Laureano, and Cortez (2011) and Moro,
Cortez, and Rita (2014).
The original data from the Boston Housing Study (Harrison and Rubinfeld 1978)
were published by Belsley, Kuh, and Welsch (1980) in their book about regression
diagnostics. In subsequent years, versions of these data have been used by
statisticians to introduce and evaluate regression methods, including classification
and regression trees (Breiman et al. 1984), treed regression (Alexander and
Grimshaw 1996), and monotone regression (Dole 1999). Miller (1999) used the
Boston Housing Study data to explore sample size requirements for a number of
modern data-adaptive regression methods. Data provided for this case represent an
updated version of the original data, following the suggested revisions of Gilley and
Pace (1996).
The computer choice study was a nationwide study. We identified people who expressed
an interest in buying a new personal computer within the next year. Consumers
volunteering for the study were sent questionnaire booklets, answer sheets, and
postage-paid return-mail envelopes. Each respondent received $25 for participating in
the study. The survey consisted of sixteen pages, with each page showing a choice set of
four product profiles. For each choice set, survey participants were asked first to select
the computer they most preferred, and second, to indicate whether or not they would
actually buy that computer. For some analyses it may be sufficient to focus on the initial
choice or most preferred computer in each set. Figure C.1 shows the first page of the
survey (the first choice set).
Figure C.1. Computer Choice Study: One Choice Set
Being diligent data scientists, we might want to define a training-and-test regimen. One
approach in this context would be to build predictive models on twelve choice sets and
test on four sets. We can arbitrarily select sets 3, 7, 11, and 15 as hold-out choice sets, for
example, and let the remaining item sets 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, and 16 serve as
training sets. With sixteen choice sets of four, we have 64 product profiles for each
individual in the study. The training data would include 48 rows of product profiles for
each individual and the test data would include 16 rows of product profiles for each
individual. The data for one individual are shown in table C.5.
Table C.5. Computer Choice Study: Data for One Individual
This is a retrospective study, as many of the companies involved have changed their
roles in the computer industry or have left the industry entirely. Using a study more
than ten years old has its advantages. None of the companies in question will care
what our analysis shows. Like most of the examples in the book, these are real data,
and at one time they had real meaning. The study is based on research supported
by Sharon Chamberlain.
Table C.6 provides a hypothetical example, showing how normal and overage sales
translate into business profits or losses for DriveTime. This example demonstrates the
value of using a statistical model to select vehicles for sale. Profit contributions in the
example represent gross rather than net profits. They do not account for operating costs,
overhead costs, or taxes.
The table below reflects hypothetical profits associated with DriveTime vehicle sales, given an
average total cost per vehicle of $5,000, a 20 percent markup for normal dealer sales, 10 percent
markup for overage dealer sales, and 20 percent loss for overage vehicles sold at auction. This
example assumes that, of the approximately four thousand vehicles sold each month, about 85
percent are normal dealer sales, 10 percent overage dealer sales (within the 91- to 119-day period),
and 5 percent overage auction sales.
Suppose that researchers are able to develop a model that is reasonably accurate in predicting how
long it takes to sell a vehicle. Suppose further that, using this time-to-sale model to guide inventory
decisions, DriveTime is able to increase normal dealer sales from 85 to 90 percent, with
corresponding declines in overage vehicle sales. Assuming no change in vehicle costs or prices, what
would be the effect upon profits? The following table suggests that monthly profits would increase by
$220,000. Twelve months of sales of this type would contribute more than $2.6 million in profit a
year. This demonstrates the value of using statistical models to guide business decisions.
Table C.7 describes variables from the DriveTime vehicles database. The data, which
represent 17,506 sedans sold and financed in the second half of 2001, are divided into
three data sets for modeling work: 8,753 sedans comprise the training set, 4,377 the
validation set, and 4,376 the test set.
Table C.7. DriveTime Data for Sedans
Table C.8 shows how researchers use eight color categories to represent twenty-seven
colors in the vehicles database. Color categories are defined so that each category has a
sufficiently large frequency to warrant its use in modeling work. Gold becomes a catch-
all or other color category, including gold, tan, cream, yellow, and brown tones.
Table C.8. DriveTime Sedan Color Map with Frequency Counts
Certain variables may be useful in developing vehicle selection models. Newer, lower
mileage vehicles, for example, may be expected to sell faster than older, higher mileage
vehicles. Sales prices are not included in the vehicles database, but we can assume that
prices for vehicles sold within ninety days (normal dealer sales) are marked up, so that
the firm recovers costs associated with purchasing, repairs, operations, and interest, and
makes an appropriate profit.
DriveTime managers wonder whether it is possible to develop selection models for
sedans using data from the vehicles database. Is a single model sufficient, or should
separate models be built for the states in which Drive-Time operated in 2001 (Arizona,
California, Florida, Georgia, Nevada, New Mexico, Texas, and Virginia)? What would the
models look like, and how much profit improvement would result from using the
models?
Source data for this study came from Reis and Smith (1963).
Table C.10. Cross-Classified Categorical Data for the Laundry Soap Experiment
Source data for this case come from Hensher and Johnson (1981).
Marketing touts play a similar expert role, going beyond raw sales data to provide
consumer and marketplace insights. They have formal training in measurement,
statistics, or machine learning, as well as extensive business consulting experience. The
results of their models for site selection, product positioning, segmentation, or target
marketing are of special interest to business managers.
Health and fitness, a fourth product area, involves scientists with expertise in nutrition,
physiology, and molecular mechanisms of health and disease. These touts provide
relevant information based on scientific research. They deliver personalized plans
developed from real-world models that predict future health and fitness. ToutBay
intends to make health and fitness plans available to individual consumers, personal
trainers and medical practitioners.
ToutBay’s major public event to date has been the R User Conference, also known as
UseR!, June 30 through July 3, 2014. The conference was held on the UCLA campus in
Los Angeles, California, and attracted around 700 scientists and software engineers,
people who write programs (scripts) in the open-source language R (a widely used
language in statistics and data science). ToutBay was one of the sponsors of UseR!,
along with major software developers and publishers.
ToutBay’s goal at UseR! was to introduce itself to potential touts. The company’s
message was simple: You do the research and modeling, and we do the rest. We turn
scripts into products. The idea is that, by working with ToutBay, data scientists can
focus on data science and ToutBay will take care of marketing, communications, sales,
order processing, distribution, and customer support. The ToutBay website has a For
Touts page that provides the details.
Because ToutBay operates entirely online, its business depends on having a website that
conveys a clear message to visitors or guests. Success means converting website guests
into ToutBay account holders. And after information products become available, success
will mean converting account holders into subscribers to information products.
Revenues will come from customer subscriptions, with touts setting prices for their
information products and ToutBay charging a fee for online sales and distribution of
those products. In recruiting future touts, ToutBay has a simple message: If you were
the author of a book, you would look for a publisher, and you would hope that the
publisher would work with bookstores to sell your book. But what if you are the author
of a predictive model? Where do you go to publish your model? Where do you go to sell
the results of your model? ToutBay—that’s where.
Since opening its website in April 2014, ToutBay has been tracking user traffic with
Google Analytics. Recently, the firm has been reviewing data relating to visits, page
views, and time on the site. There may have been a slight increase in traffic around the
time of the UseR! conference. Otherwise, traffic has been limited, which is a source of
concern for the company.
The ToutBay website employs a single-page design, with extensive information on the
home page, including the two-minute video introduction to the company. A single-page
approach to website design provides better overall performance than a multi-page
approach because a single-page approach requires fewer data transmissions between the
client browser and the website server.
One difficulty in employing a single-page approach, however, is that standard page-view
statistics provide an incomplete picture of website usage. Recognizing this, ToutBay
website developers employed JavaScript code to detect how far down users were
scrolling on the home page. These scrolling data are included in user traffic information
for the site. Table C.14 shows variables and variable definitions for website data under
review.
In preparing these data, we first created an external traffic reporting segment by
filtering out traffic coming from website developers and ToutBay principals. The
variables in the data set include data gathered from Google Analytics reports
for www.toutbay.com for the period from April 12, 2014 through September 19,
2014. Also included are counts from Scroll Depth, a Google Analytics plug-in that
tracks how far users scroll down a page. Scroll Depth is especially useful for a
website that puts a lot of information on individual pages such as the home page (a
single-page approach). Documentation for Scroll Depth is available
at http://scrolldepth.parsnip.io/. When using Google Analytics, we do not have access
to the original data that have been collected. Rather, we use the variables and
reporting aggregates that Google Analytics defines. Documentation for Google
Analytics measures (dimensions and metrics) is available
at https://developers.google.com/analytics/devguides/reporting/core/dimsmets.
The ToutBay’s owner hopes that a detailed analysis of website content and structure, as
well as data about website usage, will provide guidance in developing future versions of
the website, coinciding with the introduction of the company’s first products.
Color. Because diamonds are formed through heat and pressure, the presence of
various gases can cause them to take on various tints. Some diamonds are clear. Others
have a yellow or brown tint. The Gemological Institute of America (GIA) has established
a standard color scale for grading diamonds from D to Z based on tint or color. This
scale was used by all twelve of the jewelers I visited. It breaks color grades down into
categories like “colorless” and “near colorless.” Jewelers indicate that the price of a
diamond decreases as you move away from a D grade, which is considered perfectly
colorless. In most cases, however, differences in color grade can only be seen when
diamonds are compared with one another.
Clarity. The clarity of a diamond measures the purity of the stone. There are often
carbon pockets that form imperfections in diamonds called inclusions. Clarity
summarizes the number and size of inclusions. The GIA has created a scale that rates
inclusions by their visibility to the naked eye. From a flawless (FL) diamond to one that
has slight inclusions (SI1 and SI2), salespeople will tell you that the price and value of a
diamond decreases as the number of noticeable inclusions increases. But when you
shop, you will rarely see a perfectly flawless diamond, and most often you cannot
visually detect inclusions at the VVS or VS levels.
Cut. As you go from one jeweler to the next, carat, color, and clarity are defined and
measured in a generally universal way. A grade D diamond is perfectly colorless. A
diamond with I2 clarity will have plainly visible flaws. And a 1.03-carat diamond has the
same weight anywhere you shop. That leaves the type and quality of a stone’s cut to
differentiate diamond products. The type of cut determines the shape of the diamond,
but I limited my study to round-cut diamonds. Determining the quality of cut was more
problematic.
I often felt like I was being deceived when salespeople explained why their cut scale was
the only appropriate way to measure the quality of cut. A few jewelers used three criteria
that the GIA says make an ideally cut stone: depth, symmetry, and polish. Variations in
depth and symmetry can cause a diamond to lose its brilliance. In addition to these two
qualities, the overall finish or polish of the stone can have a substantial effect on how
well it shines. In the end, I simplified my definition of the cut variable based on my
shopping experiences. Regardless of what was said about cut, most jewelers would show
two levels of cut. One of the levels would be described as ideal and the other non-ideal.
The difference between ideally and non-ideally cut diamonds is not likely to be
noticeable to the naked eye, but a diamond will undoubtedly cost more if a jeweler
describes it as ideally cut. In addition to the four Cs, I wanted to see if price varied
across sales channels. I gathered data from three separate types of jewelers.
Independent Jewelers. These businesses are usually not in an enclosed mall. They
are limited to a single community rather than chain stores. Many of the independent
jewelers I visited operated at only one location. At independent jewelers I would be
given a selection of seven to ten round-cut stones, and store personnel took a non-
pressured approach to the sales process.
Mall Jewelers. Located within enclosed malls, many of these jewelers were local
branches of national chains. I found the selection of stones to be higher in number but
lower in quality. The main factor that I did not like here was the pushy nature of the
sales force. I often felt like I was buying a used car.
Internet Jewelers. I looked for online jewelers to complete my analysis. I found two
stores with a vast selection of stones. I took a sample of more than three hundred stones
from the over four thousand round-cut diamonds available at these two stores. Although
online jewelers provided pictures of about half their stones, I would find it difficult to
buy a diamond I could not see in person.
Now that the data have been gathered and coded according to the rules summarized
in table C.15, I need to figure out which diamond to buy my girlfriend. Furthermore,
some of the jewelers are asking questions about why I am collecting this information.
One of the independent jewelers is interested in my study. He thinks he might be able to
use the results to guide his own diamond buying.
Table C.15. Diamonds Data: Variable Names and Coding Rules
The Ducks. When people talk about “The Ducks in the Dells,” they are not talking
about waterfowl. These Ducks are amphibious vehicles built by the U.S. Army during
World War II as a means of transporting soldiers over land and water. The Ducks are
used to give tours of the natural wonders of the area. Duck Tours take visitors up hills,
down into valleys, across rivers, and through lakes. Along the way, visitors see all
manner of intriguing rock formations and beautiful scenery. Duck Tours run from
March through October, weather permitting.
Circus World Museum. Wisconsin Dells is located just north of Baraboo, Wisconsin,
former home of the famous Ringling brothers, founders of the Ringling Brothers and
Barnum & Bailey Circus. Owned by the State Historical Society of Wisconsin, Circus
World Museum celebrates the history of the circus with exhibits, circus performances,
variety shows, clown shows, animal shows, and a petting menagerie. The museum is
open year-round with extended hours during the summer.
Boat Tours. The Dells area stretches along the Wisconsin River and includes several
lakes. An alternative to Duck Tours are the boat tours, which stick to the waterways and
attractions along the shorelines.
Stand Rock. The Dells has fascinating natural rock formations because the upper
layers of rock are more resistant to erosion than are the underlying layers. Stand Rock is
an unusual formation, with a large, round, table-like rock supported by a far narrower
column. This formation is near another tall rock formation with a gap in between. To
commemorate a famous leap across the gap, the tour of this site includes a dog leaping
from rock to rock. Stand Rock is accessible by boat.
Gambling. Ho-Chunk Casino is located one mile south of downtown Wisconsin Dells.
This Indian casino features slots, video poker, blackjack, and various forms of
entertainment.
Additional area attractions include a wax museum, numerous campgrounds, many
shopping opportunities, go-carts, a fifties revival show, golf courses, nature walks, a
UFO and science fiction museum, a motor speedway, fishing trips, riding stables, laser
tag facilities, movie theaters, and various other museums and shows.
Exhibit C.16 shows visitor variables and their coding. Interviewers asked visitors
whether they had participated in or were likely to participate in any of a number of
activities around the Wisconsin Dells. Exhibit C.17 shows variables relating to
participation in these activities.
Taking the role of a Dells business owner or a representative of the Wisconsin Dells
Visitor and Convention Bureau, we have many questions to answer. What can we learn
about the people who visit the Dells? Are there discernible patterns in visitor activities?
Is it possible to identify consumer segments among the visitors? What kinds of activities
would we recommend for visitor groups identified by demographics or type of visiting
party?
A majority of current Dells advertising takes the form of brochures and pamphlets
placed at various attractions in the Dells. Business owners would like to target
advertising to those people most likely to visit attractions. What can we learn from the
Dells data to help business owners in their advertising and marketing activities?
Table C.16. Dells Survey Data: Visitor Characteristics
Table C.17. Dells Survey Data: Visitor Activities
Substitute or competing (and legal) gaming products include bingo and slots at Indian
casinos in Wisconsin and neighboring states. We identified fourteen Wisconsin casinos
operational at the time of the study. We also learned that new casinos were planned for
Madison in 1999 and for Milwaukee in 2000. The Potawatomi Nation plan for
Milwaukee (ZIP code 53233) included a 256,000 square foot casino complex with 1,000
slot machines.
In developing models for lottery sales, we can draw upon observations of people familiar
with lottery activities. We can also draw upon our intuition and anecdotal evidence.
There are a number of hypotheses to consider:
Ticket sales are higher shortly after new lottery games are introduced with television or
radio advertising.
Higher lottery jackpots lead to higher online ticket sales. There may also be some
carry-over effect on instant lottery ticket sales.
Ticket sales are higher in those areas that are better served by online ticket retailers.
That is, higher numbers of retailers should lead to higher sales.
Ticket sales are lower in areas served by substitute gaming facilities, such as Indian
casinos.
Lower income, less educated people buy more lottery tickets per capita than higher
income, more educated people.
On average, senior citizens buy more lottery tickets than people in other age groups.
The thinking here is that senior citizens have more free time to engage in recreational
and gaming activities.
Ticket sales are higher during the first week of the month because many people get
paid or receive government support checks, such as Social Security checks, on the first
day of the month.
Although we might expect advertising to affect sales, State of Wisconsin law restricts the
use of extensive advertising by the Wisconsin Lottery. The only time that the Wisconsin
Lottery is allowed to advertise is when a new lottery game is introduced. New lottery
games are usually instant games, and only a small proportion of these games receives
television or radio advertising. For example, in the forty-week period for this study,
twenty-seven new instant lottery games were introduced. Six of these games received
television advertising, and one received radio advertising. We might assume that each
new instant lottery game that received advertising received it for one month (for the
week of new product launch and for three weeks thereafter).
Sales data for the Wisconsin Lottery are like the sales data of many organizations. These
are hierarchical or panel data, having both a cross-sectional and a time-series
organization. For each retail establishment selling online lottery tickets, the State has a
record of the number of lottery tickets sold, their cost, and the time of the sale. Retail
establishments fall within sales territories or areas. For the Wisconsin Lottery we might
think of ZIP codes as sales territories.
We organized Wisconsin Lottery sales data by ZIP code and time. We aggregated instant
ticket sales across retail establishments within ZIP codes, and we also obtained instant
ticket sales within ZIP codes. We used weeks as our unit of aggregation across time.
Weeks began on Sundays and ended on Saturdays; we obtained data for 40 consecutive
weeks (the weeks ending April 4, 1998 through January 2, 1999). Table C.18 provides
names and descriptions for the relevant variables. Sales data were not available for
Wisconsin Indian casinos, but we did obtain measures of gaming capacity (casino size
and the number of slot machines). Table C.19 shows names and descriptions for
information fields in the casino data set.
Table C.18. Wisconsin Lottery Data
Table C.19. Wisconsin Casino Data
We can link lottery sales data, casino data, and demographic data using ZIP codes. We
derived ZIP code demographics from 1990 United States Census data, with revised
Census estimates from 1995. We also recorded the centroid of each ZIP code region in
East-West and North-South coordinates. Table C.20 shows names and descriptions for
the ZIP code demographic and location variables.
Table C.20. Wisconsin ZIP Code Data
A geographer helped us to locate the East-West and North-South coordinates for ZIP
code centroids. He explained that ZIP regions are highly irregular polygons and that the
centroid of a ZIP code is at best an approximate center of the ZIP code region. To get
coordinates for Wisconsin ZIP code centriods, the geographer used the Wisconsin
Transverse Mercator Geo-referencing System, which measures coordinate axes in
meters, with the origin set as an arbitrary point in Iowa, southwest of all Wisconsin ZIP
codes. Centriod coordinates should not be thought of as centers of population because it
is unlikely that population would be evenly distributed across ZIP code regions.
When fitting linear models to the lottery sales data, we should note that some
explanatory variables, such as the size of lottery jackpots, vary across time, but are
constant across ZIP code locations. Other explanatory variables, such as population,
vary across ZIP code locations, but are treated as constant across time (for the 40 weeks
that we are considering). Still other variables, such as lottery sales response variables,
vary across time and locations. In fitting models to these data, we need to identify
appropriate error structures, noting which variables vary with time and which vary with
location.
We do not have to make a distinction between sales dollars and sales volume because
most lottery tickets are sold for $1.00. Just the same, we need to define appropriate
response variables. In testing certain research hypotheses, we may want to use per
capita measures rather than original measures. And, given the characteristics of online
and instant sales (online being sales to consumers and instant being sales to retailers),
we may choose to develop separate models for the online and instant sales responses.
Alternatively, we could try to synchronize online and instant sales information by
shifting or lagging one sales time-series relative to the other.
State of Wisconsin administrators want to predict online and instant ticket sales and to
identify future potentially productive online ticket sales locations. In the process of
fitting models, we might think about providing meaningful tests of hypotheses about
what affects lottery sales. We have sufficient data to fit a variety of models, including
time-series, panel, and spatial data models. Where shall we begin?
Data showing the from-node and to-node structure of this social network are drawn
from Lestovec, Huttenlocher, and Kleinbert (2010a, 2010b) and are available as part
of the Stanford Large Network Dataset Collection
at https://snap.stanford.edu/data/wiki-Vote.html. Background information about
Wikipedia was obtained from Wikipedia (2014b).