0% found this document useful (0 votes)
143 views

C. Case Studies

Uploaded by

Hussein Mazaar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
143 views

C. Case Studies

Uploaded by

Hussein Mazaar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 41

C.

 Case Studies

Doing marketing data science means working with clients and understanding the
business context of research. Case studies demonstrate the process. This appendix
introduces case studies in marketing data science. Case study data and programs for
analyzing these data are provided on the book’s website: http://www.ftpress/miller/.

C.1 AT&T CHOICE STUDY


Following its breakup in 1986, AT&T wanted to identify factors relating to
customer choice of long-distance carriers. The firm collected respondent data from
telephone interviews, household service and billing information from corporate
databases, and census data linked to household addresses. Table C.1 shows
variable names and definitions.
Table C.1. Variables for the AT&T Choice Study

With data from one thousand long-distance telephone customers, we can develop
models for predicting telephone customer choices. We can also examine issues of
customer retention and churn and advise management on plans for target marketing.
The original data for this case were provided by James W. Watson and distributed
as part of the S system from AT&T Bell Laboratories. S and later SPlus were
precursors of R. The details of the AT&T Choice Study were discussed in Chambers
and Hastie (1992).

C.2 ANONYMOUS MICROSOFT WEB DATA


The data for this case come from the Microsoft website www.microsoft.com as it
existed for one week in February 1998. There are 37,711 users in the sample, and
the case is called “anonymous” because the data files contain no personally
identifiable information for these users.
While we often think of website nodes as being pages, in this study it is website areas
that are the nodes. Page-view requests of users are categorized to reflect areas of the
Microsoft website visited. There are 294 distinct areas, identified by name and number.
To provide an honest evaluation of predictive models, Microsoft user data are
partitioned into training and test sets, with 32,711 users in the training file and 5,000
users in the test file. Data rows in these files shows a user identification number along
with a website area visited during the one-week time frame of the study. A separate file
shows area identification numbers, directory names for website areas, and area
descriptions.
Analysis of the Microsoft data can begin by looking at areas visited and characterizing
user behavior. Network area structure may be gleaned from user behavior by construing
joint area usage as a link between area nodes.
A more ambitious goal, one consistent with published studies drawing on these data,
would be to predict which areas of the website a user will visit based upon other areas
visited. For predictive models of this type, we could utilize methods of association rule
analysis and recommender systems.
The data for this case come from the University of California–Irvine Machine
Learning Repository of the Center for Machine Learning and Intelligent Systems
(Bache and Lichman 2013). The data sets and documentation are available
at http://archive.ics.uci.edu/ml/datasets/Anonymous+Microsoft+Web+Data. The original data
were created by Jack S. Breese, David Heckerman, and Carl M. Kadie of Microsoft
Research in Redmond, Washington, and these data were used in testing models for
predicting areas of the website a user would visit based on data about other areas
that user had visited (Breese, Heckerman, and Kadie 1998).

C.3 BANK MARKETING STUDY


A Portuguese bank conducted seventeen telephone marketing campaigns between
May 2008 and November 2010. The bank recorded client contact information for
each telephone call. Table C.2 shows variable names and descriptions for the
study.
Table C.2. Bank Marketing Study Variables

Client characteristics include demographic factors: age, job type, marital status, and
education. The client’s previous use of banking services is also noted.
Current contact information shows the date of the telephone call and the duration of the
call. There is also information about the call immediately preceding the current call, as
well as summary information about all calls with the client.
The bank wants its clients to invest in term deposits. A term deposit is an investment
such as a certificate of deposit. The interest rate and duration of the deposit are set in
advance. A term deposit is distinct from a demand deposit.
The bank is interested in identifying factors that affect client responses to new term
deposit offerings, which are the focus of the marketing campaigns. What kinds of clients
are most likely to subscribe to new term deposits? What marketing approaches are most
effective in encouraging clients to subscribe?
Data for this case come from the University of California–Irvine Machine Learning
Repository of the Center for Machine Learning and Intelligent Systems
at http://archive.ics.uci.edu/ml/datasets/Bank+Marketing. The original data were part of
marketing studies documented in Moro, Laureano, and Cortez (2011) and Moro,
Cortez, and Rita (2014).

C.4 BOSTON HOUSING STUDY


The Boston Housing Study is a market response study of sorts, with the market
being 506 census tracts in the Boston metropolitan area. The objective of the study
was to examine the effect of air pollution on housing prices, controlling for the
effects of other explanatory variables. The response variable is the median price of
homes (in 1970 dollars) in the census track. Table C.3 shows variables included in
the case. Short variable names correspond to those used in previously published
studies.

Table C.3. Boston Housing Study Variables

The original data from the Boston Housing Study (Harrison and Rubinfeld 1978)
were published by Belsley, Kuh, and Welsch (1980) in their book about regression
diagnostics. In subsequent years, versions of these data have been used by
statisticians to introduce and evaluate regression methods, including classification
and regression trees (Breiman et al. 1984), treed regression (Alexander and
Grimshaw 1996), and monotone regression (Dole 1999). Miller (1999) used the
Boston Housing Study data to explore sample size requirements for a number of
modern data-adaptive regression methods. Data provided for this case represent an
updated version of the original data, following the suggested revisions of Gilley and
Pace (1996).

C.5 COMPUTER CHOICE STUDY


In 1998 Microsoft introduced a new operating system. Computer manufacturers
were interested in making predictions about the personal computer marketplace.
To help manufacturers understand the market for personal computers, we conducted a
computer choice study involving eight computer brands, price, and four other attributes
of interest: compatibility, performance, reliability, and learning time. Table C.4 provides
a description of attribute levels used in the study.
Table C.4. Computer Choice Study: Product Attributes

The computer choice study was a nationwide study. We identified people who expressed
an interest in buying a new personal computer within the next year. Consumers
volunteering for the study were sent questionnaire booklets, answer sheets, and
postage-paid return-mail envelopes. Each respondent received $25 for participating in
the study. The survey consisted of sixteen pages, with each page showing a choice set of
four product profiles. For each choice set, survey participants were asked first to select
the computer they most preferred, and second, to indicate whether or not they would
actually buy that computer. For some analyses it may be sufficient to focus on the initial
choice or most preferred computer in each set. Figure C.1 shows the first page of the
survey (the first choice set).
Figure C.1. Computer Choice Study: One Choice Set

Being diligent data scientists, we might want to define a training-and-test regimen. One
approach in this context would be to build predictive models on twelve choice sets and
test on four sets. We can arbitrarily select sets 3, 7, 11, and 15 as hold-out choice sets, for
example, and let the remaining item sets 1, 2, 4, 5, 6, 8, 9, 10, 12, 13, 14, and 16 serve as
training sets. With sixteen choice sets of four, we have 64 product profiles for each
individual in the study. The training data would include 48 rows of product profiles for
each individual and the test data would include 16 rows of product profiles for each
individual. The data for one individual are shown in table C.5.
Table C.5. Computer Choice Study: Data for One Individual

This is a retrospective study, as many of the companies involved have changed their
roles in the computer industry or have left the industry entirely. Using a study more
than ten years old has its advantages. None of the companies in question will care
what our analysis shows. Like most of the examples in the book, these are real data,
and at one time they had real meaning. The study is based on research supported
by Sharon Chamberlain.

C.6 DRIVETIME SEDANS


DriveTime in 2001 is an automobile dealership and financing firm with seventy-six
dealerships in eight states. In a typical month the firm sells about four thousand
used vehicles and processes about ten thousand credit applications. Virtually all
sales are financed. The firm’s stated mission is: “To be the auto dealership and
finance company for people with less than perfect credit.”
DriveTime generates traffic at its dealerships through television and radio advertising,
referrals from other dealerships, and through its website. Customers who need financing
to purchase vehicles are run through a custom credit risk scorecard, which uses both
credit bureau and application information to determine credit worthiness. A generated
risk score is used to determine the appropriate deal structure and credit policy.
DriveTime purchases most of its vehicles at auctions and from wholesalers. Vehicles
include many makes and models of cars and trucks. The firm uses an information
service known as Experian Autocheck to ensure that vehicles have correct odometer
readings, have not been previously “totaled” (that is, evaluated as having no value after
an accident), and have no other significant negative history. Vehicles that fail the
Experian check are rejected and sent back to sellers. Those that pass are sent to a
DriveTime reconditioning and inspection center, where they are put through additional
checks and repaired as necessary. Vehicles are then delivered to the dealerships for sale.
Normal dealer sales occur within ninety days of delivery to the dealership. If a vehicle
does not sell within ninety days, it is called an overage vehicle, meaning that it has been
on the lot too long to generate normal dealer profits. Each overage vehicle has its sales
price reduced in order to encourage a sale within the ensuing 91- to 119-day period.
Profits on vehicles sold within the 91- to 119-day period are much lower than profits on
vehicles sold within the normal 90-day period. Furthermore, if an overage vehicle fails
to sell within 120 days, the vehicle is taken off the lot and sold at auction. DriveTime
takes a loss on vehicles sold at auction.
Written by Thomas W. Miller and Steve Zemaitis. Data provided by DriveTime.
©2007 by Research Publishers LLC. Reprinted with permission.

Table C.6 provides a hypothetical example, showing how normal and overage sales
translate into business profits or losses for DriveTime. This example demonstrates the
value of using a statistical model to select vehicles for sale. Profit contributions in the
example represent gross rather than net profits. They do not account for operating costs,
overhead costs, or taxes.
The table below reflects hypothetical profits associated with DriveTime vehicle sales, given an
average total cost per vehicle of $5,000, a 20 percent markup for normal dealer sales, 10 percent
markup for overage dealer sales, and 20 percent loss for overage vehicles sold at auction. This
example assumes that, of the approximately four thousand vehicles sold each month, about 85
percent are normal dealer sales, 10 percent overage dealer sales (within the 91- to 119-day period),
and 5 percent overage auction sales.
Suppose that researchers are able to develop a model that is reasonably accurate in predicting how
long it takes to sell a vehicle. Suppose further that, using this time-to-sale model to guide inventory
decisions, DriveTime is able to increase normal dealer sales from 85 to 90 percent, with
corresponding declines in overage vehicle sales. Assuming no change in vehicle costs or prices, what
would be the effect upon profits? The following table suggests that monthly profits would increase by
$220,000. Twelve months of sales of this type would contribute more than $2.6 million in profit a
year. This demonstrates the value of using statistical models to guide business decisions.

Table C.6. Hypothetical Profits from Model-guided Vehicle Selection

Table C.7 describes variables from the DriveTime vehicles database. The data, which
represent 17,506 sedans sold and financed in the second half of 2001, are divided into
three data sets for modeling work: 8,753 sedans comprise the training set, 4,377 the
validation set, and 4,376 the test set.
Table C.7. DriveTime Data for Sedans

Table C.8 shows how researchers use eight color categories to represent twenty-seven
colors in the vehicles database. Color categories are defined so that each category has a
sufficiently large frequency to warrant its use in modeling work. Gold becomes a catch-
all or other color category, including gold, tan, cream, yellow, and brown tones.
Table C.8. DriveTime Sedan Color Map with Frequency Counts

Certain variables may be useful in developing vehicle selection models. Newer, lower
mileage vehicles, for example, may be expected to sell faster than older, higher mileage
vehicles. Sales prices are not included in the vehicles database, but we can assume that
prices for vehicles sold within ninety days (normal dealer sales) are marked up, so that
the firm recovers costs associated with purchasing, repairs, operations, and interest, and
makes an appropriate profit.
DriveTime managers wonder whether it is possible to develop selection models for
sedans using data from the vehicles database. Is a single model sufficient, or should
separate models be built for the states in which Drive-Time operated in 2001 (Arizona,
California, Florida, Georgia, Nevada, New Mexico, Texas, and Virginia)? What would the
models look like, and how much profit improvement would result from using the
models?

C.7 LYDIA E. PINKHAM MEDICINE COMPANY


Lydia E. Pinkham (1819–1883) was an advocate for women’s health and a
developer of herbal medicines for women. After her death, family members began
the mass marketing of a product known as Lydia E. Pinkham’s Vegetable
Compound. The product was heavily advertised for many years, and historical data
on sales and advertising were made available to economic researchers.
The data come in two files. The first file provides complete annual data for sales
revenue, advertising expenses, and income in thousands of dollars. The file covers the
years 1907–1960 and includes binary indicator variables for years prior to, during, and
after Prohibition. The medicine, which contained 40 percent alcohol, continued to be
sold during Prohibition, which extended from January 17, 1920 through December 5,
1933.
The second file contains monthly sales revenue and advertising expenses in dollars for
the period from January 1907 through December 1926 and from January 1937 through
June 1960. There are missing monthly data for the middle time period.
For more than forty years, the Lydia E. Pinkham Medicine Company case has been used
to demonstrate sales forecasting, time series, and econometric methods. Exemplary
studies include Caines, Sethi, and Brotherton (1977), Helmer and Johansson
(1977), Winer (1979), Bhattacharyya (1982), Heyse and Wei 1985, and Baghestani
(1991). More recently, Kim (2005) used bootstrap methods in a new analysis of the case.
This classic data set from the econometric literature was distributed with a textbook
by Berndt (1991). The story of Lydia E. Pinkham has been documented
by Washburn (1931) and Stage (1979) and in Lydia’s own book (Pinkham 1900). An
advertisement for Lydia E. Pinkham’s original formula appeared in the Saint John
Daily Evening News on April 17, 1883. Here is a link to an image of that
advertisement: https://news.google.com/newspapers?
id=N9kIAAAAIBAJ&sjid=5TcDAAAAIBAJ&dq=montreal+hackett&pg=5642,125153&hl=en

C.8 PROCTER & GAMBLE LAUNDRY SOAPS


The Procter & Gamble Company developed a new formula for one of its laundry
soaps. Before introducing the new formula to the marketplace, the company
wanted to know whether consumers would prefer the new formula, called X, to the
original formula, called M.
Consumers in 1,008 households, some of whom were previous users of the original
formula M, were given the opportunity to try formulas X and M in blind preference
tests. At the end of the tests, consumers were asked to indicate their soap preferences by
choosing either X or M. Water temperature (cold or hot) and type (hard, medium, soft)
were noted for each household. Data were coded as shown in Table C.9. Results from the
field test represent cross-classified categorical data. See table C.10.

Table C.9. Variables for the Laundry Soap Experiment

Source data for this study came from Reis and Smith (1963).
Table C.10. Cross-Classified Categorical Data for the Laundry Soap Experiment

C.9 RETURN OF THE BOBBLEHEADS


The Dodgers are one of thirty Major League Baseball teams using promotions to
increase attendance. Reports suggest that bobblehead promotions in particular are
on the rise, with 2.27 million dolls distributed in 2012 (Broughton 2012) and an
estimated 2.7 million dolls in 2013 (Foster 2013).
We provide complete promotion and attendance data for all teams for the 2012 season
on the website for the book. These data have a format similar to the Dodgers data
in table 8.1, except that there are extra columns for the year and home team. Having
data for all teams allows us to explore alternative modeling approaches, such as building
a model for each team, aggregate models for groups of teams, or hierarchical models for
game-day observations within teams. When predicting attendance at Major League
Baseball parks, we would need to consider the fact that ballparks are often filled to
capacity. Special models may be required to accommodate this high-end censoring
(Lemke, Leonard, and Tlhokwane 2010).
Major League Baseball data for promotions and attendance were collected by Erica
Costello in December 2012. She graciously contributed these data so students could
learn from them.

C.10 STUDENMUND’S RESTAURANTS


Managers of a nationwide restaurant chain, which we will call Studenmund’s
Restaurants, want to find new restaurant locations. Gross restaurant sales and the
number of competitors within a two-mile radius are noted at existing restaurant
locations. Census data for population and income are also collected for these
locations. Table C.11 shows the variable names and definitions, and Table
C.12 shows the observed data from thirty-three restaurants.
Table C.11. Variables for Studenmund’s Restaurants
Table C.12. Data for Studenmund’s Restaurants

Researchers at Studenmund’s wonder if it is possible to define a model for predicting


restaurant sales. Could such a model be trusted to yield accurate predictions? Could the
model be used to pick future restaurant locations?
The original data for this case were given in Studenmund (1992), an econometrics
textbook now in its sixth edition (Studenmund 2010).

C.11 SYDNEY TRANSPORTATION STUDY


Residents of the north suburbs of Sydney, Australia can commute to downtown
Sydney by car or train. Their choice of transportation will be due, in part, to the
time and cost of commuting by car and train. On the day of the Journey to Work
Survey, Sydney commuters indicate their primary method of transportation (car or
train) and their best alternative method of transportation (car or train). For both the
chosen and alternative methods, 333 commuters provide time and cost estimates
for all trip components by car, train, bus, walking, and other modes of
transportation.
Table C.13 shows names and descriptions of selected variables from the Journey to
Work Survey. Time measurements reflect the total commute time by car or train,
summing across all components of the trip. Cost measurements reflect total costs by car
or train, summing across all components of the trip. Car costs are adjusted for the
number of persons in the car and include parking charges. Using these data, we build a
models for predicting the transportation choices of Sydney commuters.

Table C.13. Variables for the Sydney Transportation Study

Source data for this case come from Hensher and Johnson (1981).

C.12 TOUTBAY BEGINS AGAIN


In March 2015, ToutBay remains in start-up mode awaiting the release of its first
products. In what is becoming an increasingly data-driven world, ToutBay owner
Tom Miller sees opportunities for data science as a service (DSaaS), a term he
uses to describe ToutBay’s business model. The goal for the ToutBay division of
Research Publishers LLC is to be a market maker in the data science space,
publishing and distributing time-sensitive information and competitive
intelligence.
The ToutBay website www.toutbay.com tells the story of a company founded in December
2013 to provide access to applications developed by analysts, modelers, researchers, and
data scientists across the world. These subject matter experts—touts—work with data
and develop models that are of use to many people. As a two-minute video on the
website claims, ToutBay gets people together—people who have answers and people
who need answers. The video introduces the firm and explains why information from
ToutBay can be more valuable than information freely obtained from search engines.
ToutBay products are expected to fall under sports, finance, marketing, and health and
fitness. Sports touts go beyond raw data about players and teams to build models that
predict future performance. ToutBay works with sports touts to make their predictive
models available to players, owners, managers, and sports enthusiasts.
Finance touts help individuals and firms make informed decisions about when and
where to make investments. These touts have expertise in econometrics and time series
analysis. They understand markets and predictive models. They detect trends in the past
and make forecasts about the future.
One of ToutBay’s first products is expected to be a stock portfolio constructor. This is a
financial model designed by Dr. Ernest P. Chan, a recognized expert in the area of
quantitative finance and author of two books on the subject (Chan 2009, 2013). The
idea behind this product is to allow a stock investor to specify his/her investment
objectives and time horizon, as well as the domain of stocks being considered and the
number of stocks desired in a portfolio. Then, using current information about stock
prices and performance, as well as selected economic factors, the Stock Portfolio
Constructor creates a customized stock portfolio for the investor. It lists the selected
stocks and shows their expected future return over the investor’s time horizon,
assuming an equal level of investment in each stock. The Stock Portfolio Constructor
also shows what would have been the historical performance of that portfolio in recent
years.
This case draws from information at the ToutBay
website http://www.toutbay.com and from Google Analytics reports, including reports
summarizing Scroll Depth plug-in data.

Marketing touts play a similar expert role, going beyond raw sales data to provide
consumer and marketplace insights. They have formal training in measurement,
statistics, or machine learning, as well as extensive business consulting experience. The
results of their models for site selection, product positioning, segmentation, or target
marketing are of special interest to business managers.
Health and fitness, a fourth product area, involves scientists with expertise in nutrition,
physiology, and molecular mechanisms of health and disease. These touts provide
relevant information based on scientific research. They deliver personalized plans
developed from real-world models that predict future health and fitness. ToutBay
intends to make health and fitness plans available to individual consumers, personal
trainers and medical practitioners.
ToutBay’s major public event to date has been the R User Conference, also known as
UseR!, June 30 through July 3, 2014. The conference was held on the UCLA campus in
Los Angeles, California, and attracted around 700 scientists and software engineers,
people who write programs (scripts) in the open-source language R (a widely used
language in statistics and data science). ToutBay was one of the sponsors of UseR!,
along with major software developers and publishers.
ToutBay’s goal at UseR! was to introduce itself to potential touts. The company’s
message was simple: You do the research and modeling, and we do the rest. We turn
scripts into products. The idea is that, by working with ToutBay, data scientists can
focus on data science and ToutBay will take care of marketing, communications, sales,
order processing, distribution, and customer support. The ToutBay website has a For
Touts page that provides the details.
Because ToutBay operates entirely online, its business depends on having a website that
conveys a clear message to visitors or guests. Success means converting website guests
into ToutBay account holders. And after information products become available, success
will mean converting account holders into subscribers to information products.
Revenues will come from customer subscriptions, with touts setting prices for their
information products and ToutBay charging a fee for online sales and distribution of
those products. In recruiting future touts, ToutBay has a simple message: If you were
the author of a book, you would look for a publisher, and you would hope that the
publisher would work with bookstores to sell your book. But what if you are the author
of a predictive model? Where do you go to publish your model? Where do you go to sell
the results of your model? ToutBay—that’s where.
Since opening its website in April 2014, ToutBay has been tracking user traffic with
Google Analytics. Recently, the firm has been reviewing data relating to visits, page
views, and time on the site. There may have been a slight increase in traffic around the
time of the UseR! conference. Otherwise, traffic has been limited, which is a source of
concern for the company.
The ToutBay website employs a single-page design, with extensive information on the
home page, including the two-minute video introduction to the company. A single-page
approach to website design provides better overall performance than a multi-page
approach because a single-page approach requires fewer data transmissions between the
client browser and the website server.
One difficulty in employing a single-page approach, however, is that standard page-view
statistics provide an incomplete picture of website usage. Recognizing this, ToutBay
website developers employed JavaScript code to detect how far down users were
scrolling on the home page. These scrolling data are included in user traffic information
for the site. Table C.14 shows variables and variable definitions for website data under
review.
In preparing these data, we first created an external traffic reporting segment by
filtering out traffic coming from website developers and ToutBay principals. The
variables in the data set include data gathered from Google Analytics reports
for www.toutbay.com for the period from April 12, 2014 through September 19,
2014. Also included are counts from Scroll Depth, a Google Analytics plug-in that
tracks how far users scroll down a page. Scroll Depth is especially useful for a
website that puts a lot of information on individual pages such as the home page (a
single-page approach). Documentation for Scroll Depth is available
at http://scrolldepth.parsnip.io/. When using Google Analytics, we do not have access
to the original data that have been collected. Rather, we use the variables and
reporting aggregates that Google Analytics defines. Documentation for Google
Analytics measures (dimensions and metrics) is available
at https://developers.google.com/analytics/devguides/reporting/core/dimsmets.

Table C.14. ToutBay Begins: Website Data

The ToutBay’s owner hopes that a detailed analysis of website content and structure, as
well as data about website usage, will provide guidance in developing future versions of
the website, coinciding with the introduction of the company’s first products.

C.13 TWO MONTH’S SALARY


I never understood why giving a diamond was the social norm when proposing
marriage. As I began searching for an engagement ring, two thoughts kept racing
through my mind: “How will I be able to find the right diamond?” and “What is
this thing going to cost me?” It goes without saying that my fiancée-to-be is worth
the expense, but very seldom in our lives do we spend two month’s salary on a
product we know so little about.
Most guys are like me. They do not want to spend a lot of time talking to jewelers, doing
extensive research, and comparing prices. So for the sake of my male cohort, I took my
statistical education to the streets to find out what goes into diamond pricing and value.
I visited ten brick-and-mortar jewelers where I talked with salespeople, tracked data,
and viewed more than one hundred diamonds. Then I visited seven online jewelers,
gathering information on more than three hundred additional diamonds from two active
stores. All observations in my data set represented round-cut diamonds. Although prices
of alternative shapes or cuts might be comparable, I only looked at round-cut stones
because that shape was the most common, held the most value, and was the only one my
girlfriend wanted.
Shortly after beginning my research, I realized why a diamond is the perfect gift to
represent an engagement. A diamond symbolizes your choice in a mate because a
perfect one is very rare and all of them are unique, complete with imperfections and
positive aspects that make them sparkle.
Uniqueness in diamonds is measured using four characteristics called the four Cs: color,
clarity, carat, and cut. These traits combine to give a diamond its brilliance and fire. A
low level of any one of these attributes can significantly decrease a diamond’s value.
Here is what I learned about the four Cs.
Carat. Carat is the standard unit of weight used for gemstones (one carat equals 0.200
grams or 200 milligrams). Diamonds are rounded to the nearest hundredth of a carat or
point. A 1.27-carat diamond is said to be “one hundred and twenty-seven points.”
Typical diamond sizes vary from one-quarter to three carats. Diamonds are sized in one-
quarter-carat increments, and jewelers typically carry stock of diamonds at each one-
quarter-carat increment. According to jewelry store personnel, not only does price
increase with the weight of a stone, but, as a diamond passes each one-quarter-carat
threshold, its price jumps correspondingly.
Written by Brian A. Pope. ©2007 by Research Publishers LLC. Reprinted with
permission.

Color. Because diamonds are formed through heat and pressure, the presence of
various gases can cause them to take on various tints. Some diamonds are clear. Others
have a yellow or brown tint. The Gemological Institute of America (GIA) has established
a standard color scale for grading diamonds from D to Z based on tint or color. This
scale was used by all twelve of the jewelers I visited. It breaks color grades down into
categories like “colorless” and “near colorless.” Jewelers indicate that the price of a
diamond decreases as you move away from a D grade, which is considered perfectly
colorless. In most cases, however, differences in color grade can only be seen when
diamonds are compared with one another.
Clarity. The clarity of a diamond measures the purity of the stone. There are often
carbon pockets that form imperfections in diamonds called inclusions. Clarity
summarizes the number and size of inclusions. The GIA has created a scale that rates
inclusions by their visibility to the naked eye. From a flawless (FL) diamond to one that
has slight inclusions (SI1 and SI2), salespeople will tell you that the price and value of a
diamond decreases as the number of noticeable inclusions increases. But when you
shop, you will rarely see a perfectly flawless diamond, and most often you cannot
visually detect inclusions at the VVS or VS levels.
Cut. As you go from one jeweler to the next, carat, color, and clarity are defined and
measured in a generally universal way. A grade D diamond is perfectly colorless. A
diamond with I2 clarity will have plainly visible flaws. And a 1.03-carat diamond has the
same weight anywhere you shop. That leaves the type and quality of a stone’s cut to
differentiate diamond products. The type of cut determines the shape of the diamond,
but I limited my study to round-cut diamonds. Determining the quality of cut was more
problematic.
I often felt like I was being deceived when salespeople explained why their cut scale was
the only appropriate way to measure the quality of cut. A few jewelers used three criteria
that the GIA says make an ideally cut stone: depth, symmetry, and polish. Variations in
depth and symmetry can cause a diamond to lose its brilliance. In addition to these two
qualities, the overall finish or polish of the stone can have a substantial effect on how
well it shines. In the end, I simplified my definition of the cut variable based on my
shopping experiences. Regardless of what was said about cut, most jewelers would show
two levels of cut. One of the levels would be described as ideal and the other non-ideal.
The difference between ideally and non-ideally cut diamonds is not likely to be
noticeable to the naked eye, but a diamond will undoubtedly cost more if a jeweler
describes it as ideally cut. In addition to the four Cs, I wanted to see if price varied
across sales channels. I gathered data from three separate types of jewelers.
Independent Jewelers. These businesses are usually not in an enclosed mall. They
are limited to a single community rather than chain stores. Many of the independent
jewelers I visited operated at only one location. At independent jewelers I would be
given a selection of seven to ten round-cut stones, and store personnel took a non-
pressured approach to the sales process.
Mall Jewelers. Located within enclosed malls, many of these jewelers were local
branches of national chains. I found the selection of stones to be higher in number but
lower in quality. The main factor that I did not like here was the pushy nature of the
sales force. I often felt like I was buying a used car.
Internet Jewelers. I looked for online jewelers to complete my analysis. I found two
stores with a vast selection of stones. I took a sample of more than three hundred stones
from the over four thousand round-cut diamonds available at these two stores. Although
online jewelers provided pictures of about half their stones, I would find it difficult to
buy a diamond I could not see in person.
Now that the data have been gathered and coded according to the rules summarized
in table C.15, I need to figure out which diamond to buy my girlfriend. Furthermore,
some of the jewelers are asking questions about why I am collecting this information.
One of the independent jewelers is interested in my study. He thinks he might be able to
use the results to guide his own diamond buying.
Table C.15. Diamonds Data: Variable Names and Coding Rules

C.14 WISCONSIN DELLS


Wisconsin Dells, a sprawling resort and entertainment center in south central
Wisconsin, is one of the Midwest’s favorite vacation destinations. The Dells area is
a mixture of beautiful valleys, canyons, hills, forests, and recreational businesses
nestled around an interlocking series of lakes and rivers. Wisconsin Dells is an
hour north of Madison, Wisconsin (the state capital), three to four hours from
Chicago, and four hours from Minneapolis/St. Paul. The Dells offers a wide variety
of activities. In summer, people come for its water parks and amphibious tours. In
winter, people come for cross-country skiing and snowmobiling. Indoor attractions
are open year-round.
In the summer of 1995 Wisconsin Dells business owners were developing plans for
drawing visitors to their attractions. They had many questions about their customers
and potential customers. To answer these questions, business owners, represented by
the Wisconsin Dells Visitor and Convention Bureau, enlisted the aid of Chamberlain
Research Consultants, a marketing research firm headquartered in Madison. The firm
conducted 1,698 in-person interviews with visitors to Wisconsin Dells. These interviews
took place on the main street of Wisconsin Dells and at water parks, hotels, restaurants,
and other area attractions. Interviewers obtained demographic and vacation trip
information from visitors. The Wisconsin Dells area offers many popular tourist
activities and attractions. Let us review some of the more popular attractions.
Tommy Bartlett’s Thrill Show. Started in 1952, this is one of the most famous Dells
attractions. The show is a combination of on-stage performances (including juggling,
tumbling, and music) and a water-skiing show. The water show has highly
choreographed stunts, including a three-tier human pyramid on water skis. The Thrill
Show auditorium holds five thousand people, and there are three performances daily
between Memorial Day and Labor Day.
Water Parks. The Dells area is home to several water parks, including Noah’s Ark,
which is reportedly the largest water park in the nation.
Written by Jonathan C. Harrington. Based on research supported by Sharon
Chamberlain. ©2007 by Research Publishers LLC. Reprinted with permission.

The Ducks. When people talk about “The Ducks in the Dells,” they are not talking
about waterfowl. These Ducks are amphibious vehicles built by the U.S. Army during
World War II as a means of transporting soldiers over land and water. The Ducks are
used to give tours of the natural wonders of the area. Duck Tours take visitors up hills,
down into valleys, across rivers, and through lakes. Along the way, visitors see all
manner of intriguing rock formations and beautiful scenery. Duck Tours run from
March through October, weather permitting.
Circus World Museum. Wisconsin Dells is located just north of Baraboo, Wisconsin,
former home of the famous Ringling brothers, founders of the Ringling Brothers and
Barnum & Bailey Circus. Owned by the State Historical Society of Wisconsin, Circus
World Museum celebrates the history of the circus with exhibits, circus performances,
variety shows, clown shows, animal shows, and a petting menagerie. The museum is
open year-round with extended hours during the summer.
Boat Tours. The Dells area stretches along the Wisconsin River and includes several
lakes. An alternative to Duck Tours are the boat tours, which stick to the waterways and
attractions along the shorelines.
Stand Rock. The Dells has fascinating natural rock formations because the upper
layers of rock are more resistant to erosion than are the underlying layers. Stand Rock is
an unusual formation, with a large, round, table-like rock supported by a far narrower
column. This formation is near another tall rock formation with a gap in between. To
commemorate a famous leap across the gap, the tour of this site includes a dog leaping
from rock to rock. Stand Rock is accessible by boat.
Gambling. Ho-Chunk Casino is located one mile south of downtown Wisconsin Dells.
This Indian casino features slots, video poker, blackjack, and various forms of
entertainment.
Additional area attractions include a wax museum, numerous campgrounds, many
shopping opportunities, go-carts, a fifties revival show, golf courses, nature walks, a
UFO and science fiction museum, a motor speedway, fishing trips, riding stables, laser
tag facilities, movie theaters, and various other museums and shows.
Exhibit C.16 shows visitor variables and their coding. Interviewers asked visitors
whether they had participated in or were likely to participate in any of a number of
activities around the Wisconsin Dells. Exhibit C.17 shows variables relating to
participation in these activities.
Taking the role of a Dells business owner or a representative of the Wisconsin Dells
Visitor and Convention Bureau, we have many questions to answer. What can we learn
about the people who visit the Dells? Are there discernible patterns in visitor activities?
Is it possible to identify consumer segments among the visitors? What kinds of activities
would we recommend for visitor groups identified by demographics or type of visiting
party?
A majority of current Dells advertising takes the form of brochures and pamphlets
placed at various attractions in the Dells. Business owners would like to target
advertising to those people most likely to visit attractions. What can we learn from the
Dells data to help business owners in their advertising and marketing activities?
Table C.16. Dells Survey Data: Visitor Characteristics
Table C.17. Dells Survey Data: Visitor Activities

C.15 WISCONSIN LOTTERY SALES


It is January 1999, and Wisconsin Lottery administrators have basic questions
about the market for lottery tickets. Who are the Wisconsin Lottery’s customers,
and what makes them buy lottery tickets? Wisconsin Lottery sales contribute to
State of Wisconsin revenues. Wisconsin Lottery and Department of Revenue
administrators want to have accurate ways of predicting these sales. If they knew
who their customers were and how to find them, administrators could do a better
job in selecting new lottery ticket retailers. Administrators are concerned about
what appears to be a drop in demand for instant lottery tickets. They are also
concerned that both online and instant ticket sales could be affected by the opening
of new Indian casinos.
There are two general classes of lottery games: online and instant. Online lottery tickets,
which are sold at selected retail establishments in Wisconsin, require the buyer to pick
numbers to be entered at an online lottery terminal. These tickets are sold to individual
customers every day of the year. Some tickets are sold for $0.50, $2.00, and $5.00, but
the great majority of lottery tickets are sold for $1.00. Odds of winning are extremely
low, but jackpots can be huge. Jackpots for the online game PowerBall sometimes
exceed $100 million.
Instant lottery tickets, also called “scratch tickets,” come in many varieties. These have
smaller jackpots and better odds of winning than online lottery tickets. The Wisconsin
Lottery sells bundles of instant tickets to legitimate for-profit and nonprofit
organizations in Wisconsin, and these organizations, in turn, sell individual tickets to
consumers. Most online ticket retailers also sell instant lottery tickets. We might assume
that, on average, online ticket retailers place orders for instant lottery ticket bundles
about once every four weeks.
Competing lottery games are offered by the neighboring states of Illinois, Iowa,
Michigan, and Minnesota. States sometimes cooperate with one another. In 1998 the
popular game PowerBall, for example, derived its large jackpots by pooling ticket sales
from eighteen states and the District of Columbia.
Wisconsin State administrators provided data for lottery sales and Wisconsin Indian
casinos. And David R. Blough provided geographical measurements for Wisconsin
ZIP codes. Frees and Miller (2004) used the Wisconsin Lottery data to demonstrate
forecasting methods for panel/longitudinal data.

Substitute or competing (and legal) gaming products include bingo and slots at Indian
casinos in Wisconsin and neighboring states. We identified fourteen Wisconsin casinos
operational at the time of the study. We also learned that new casinos were planned for
Madison in 1999 and for Milwaukee in 2000. The Potawatomi Nation plan for
Milwaukee (ZIP code 53233) included a 256,000 square foot casino complex with 1,000
slot machines.
In developing models for lottery sales, we can draw upon observations of people familiar
with lottery activities. We can also draw upon our intuition and anecdotal evidence.
There are a number of hypotheses to consider:
 Ticket sales are higher shortly after new lottery games are introduced with television or
radio advertising.
 Higher lottery jackpots lead to higher online ticket sales. There may also be some
carry-over effect on instant lottery ticket sales.
 Ticket sales are higher in those areas that are better served by online ticket retailers.
That is, higher numbers of retailers should lead to higher sales.
 Ticket sales are lower in areas served by substitute gaming facilities, such as Indian
casinos.
 Lower income, less educated people buy more lottery tickets per capita than higher
income, more educated people.
 On average, senior citizens buy more lottery tickets than people in other age groups.
The thinking here is that senior citizens have more free time to engage in recreational
and gaming activities.
 Ticket sales are higher during the first week of the month because many people get
paid or receive government support checks, such as Social Security checks, on the first
day of the month.
Although we might expect advertising to affect sales, State of Wisconsin law restricts the
use of extensive advertising by the Wisconsin Lottery. The only time that the Wisconsin
Lottery is allowed to advertise is when a new lottery game is introduced. New lottery
games are usually instant games, and only a small proportion of these games receives
television or radio advertising. For example, in the forty-week period for this study,
twenty-seven new instant lottery games were introduced. Six of these games received
television advertising, and one received radio advertising. We might assume that each
new instant lottery game that received advertising received it for one month (for the
week of new product launch and for three weeks thereafter).
Sales data for the Wisconsin Lottery are like the sales data of many organizations. These
are hierarchical or panel data, having both a cross-sectional and a time-series
organization. For each retail establishment selling online lottery tickets, the State has a
record of the number of lottery tickets sold, their cost, and the time of the sale. Retail
establishments fall within sales territories or areas. For the Wisconsin Lottery we might
think of ZIP codes as sales territories.
We organized Wisconsin Lottery sales data by ZIP code and time. We aggregated instant
ticket sales across retail establishments within ZIP codes, and we also obtained instant
ticket sales within ZIP codes. We used weeks as our unit of aggregation across time.
Weeks began on Sundays and ended on Saturdays; we obtained data for 40 consecutive
weeks (the weeks ending April 4, 1998 through January 2, 1999). Table C.18 provides
names and descriptions for the relevant variables. Sales data were not available for
Wisconsin Indian casinos, but we did obtain measures of gaming capacity (casino size
and the number of slot machines). Table C.19 shows names and descriptions for
information fields in the casino data set.
Table C.18. Wisconsin Lottery Data
Table C.19. Wisconsin Casino Data

We can link lottery sales data, casino data, and demographic data using ZIP codes. We
derived ZIP code demographics from 1990 United States Census data, with revised
Census estimates from 1995. We also recorded the centroid of each ZIP code region in
East-West and North-South coordinates. Table C.20 shows names and descriptions for
the ZIP code demographic and location variables.
Table C.20. Wisconsin ZIP Code Data

A geographer helped us to locate the East-West and North-South coordinates for ZIP
code centroids. He explained that ZIP regions are highly irregular polygons and that the
centroid of a ZIP code is at best an approximate center of the ZIP code region. To get
coordinates for Wisconsin ZIP code centriods, the geographer used the Wisconsin
Transverse Mercator Geo-referencing System, which measures coordinate axes in
meters, with the origin set as an arbitrary point in Iowa, southwest of all Wisconsin ZIP
codes. Centriod coordinates should not be thought of as centers of population because it
is unlikely that population would be evenly distributed across ZIP code regions.
When fitting linear models to the lottery sales data, we should note that some
explanatory variables, such as the size of lottery jackpots, vary across time, but are
constant across ZIP code locations. Other explanatory variables, such as population,
vary across ZIP code locations, but are treated as constant across time (for the 40 weeks
that we are considering). Still other variables, such as lottery sales response variables,
vary across time and locations. In fitting models to these data, we need to identify
appropriate error structures, noting which variables vary with time and which vary with
location.
We do not have to make a distinction between sales dollars and sales volume because
most lottery tickets are sold for $1.00. Just the same, we need to define appropriate
response variables. In testing certain research hypotheses, we may want to use per
capita measures rather than original measures. And, given the characteristics of online
and instant sales (online being sales to consumers and instant being sales to retailers),
we may choose to develop separate models for the online and instant sales responses.
Alternatively, we could try to synchronize online and instant sales information by
shifting or lagging one sales time-series relative to the other.
State of Wisconsin administrators want to predict online and instant ticket sales and to
identify future potentially productive online ticket sales locations. In the process of
fitting models, we might think about providing meaningful tests of hypotheses about
what affects lottery sales. We have sufficient data to fit a variety of models, including
time-series, panel, and spatial data models. Where shall we begin?

C.16 WIKIPEDIA VOTES


The Wikipedia online encyclopedia is a collaborative writing project open to all.
Jimmy Wales and Larry Sanger started Wikipedia January 15, 2001 using wiki
software from Ward Cunningham.
The Wikipedia website grew slowly during its first four years. Between its fifth and sixth
years of operation, however, the website doubled in size, growing from 500 thousand
articles to more than one million articles. By September 2014, Wikipedia consisted of
more than 33 million articles in 287 languages, with more than 48 million contributors.
Wikipedia is maintained by a set of elected administrators. Votes are cast by existing
administrators and by non-administrator users. A set of votes over any selected period
of time may be used to define a social network. The act of voting defines a link in a
directed network, with a user/voter linked to another user/candidate.
Wikipedia Votes represents a network data set of 7,115 nodes (users, voters, candidates)
and 103,689 links (votes). The data span the first seven years of Wikipedia, January
2001 through January 2008, documenting the early growth of the site and user
collaboration in building the site.
Table C.21 lists the top ten websites worldwide in September 2014 according to Alexa
Internet, Inc., a subsidiary of Amazon.com. The ranking is based on page view and daily
visitor counts. Wikipedia ranks sixth on the list and is the only member not maintained
by a corporation.
Adapted from Alexa Internet (2014).

Table C.21. Top Sites on the Web, September 2014

Data showing the from-node and to-node structure of this social network are drawn
from Lestovec, Huttenlocher, and Kleinbert (2010a, 2010b) and are available as part
of the Stanford Large Network Dataset Collection
at https://snap.stanford.edu/data/wiki-Vote.html. Background information about
Wikipedia was obtained from Wikipedia (2014b).

You might also like