Chapter-4 1
Chapter-4 1
AnAlytics in Action
ALLIANCE DATA SYSTEMS*
DALLAS, TEXAS
Alliance Data Systems (ADS) provides transaction
processing, credit services, and marketing services for
clients in the rapidly growing customer relationship
management (CRM) industry. ADS clients are concen-
trated in four industries: retail, petroleum/convenience
stores, utilities, and transportation. In 1983, Alliance
began offering end-to-end credit processing services
to the retail, petroleum, and casual dining industries;
today the company employs more than 6500 employ-
ees who provide services to clients around the world.
Operating more than 140,000 point-of-sale terminals
in the United States alone, ADS processes in excess of
2.5 billion transactions annually. The company ranks
second in the United States in private label credit ser-
vices by representing 49 private label programs with Alliance Data Systems analysts discuss use of a
nearly 72 million cardholders. In 2001, ADS made an
regression model to predict sales for a direct marketing
campaign. © Courtesy of Alliance Data Systems.
initial public offering and is now listed on the New
York Stock Exchange.
As one of its marketing services, ADS designs predicting sales. The consumer-specific variable that
direct mail campaigns and promotions. With its data- contributed most to predicting the amount purchased
base containing information on the spending habits of was the total amount of credit purchases at related stores
more than 100 million consumers, ADS can target con- over the past 39 months. ADS analysts developed an
sumers who are the most likely to benefit from a direct estimated regression equation relating the amount of
mail promotion. The Analytical Development Group purchase to the amount spent at related stores:
uses regression analysis to build models that measure y^ 5 26.7 1 0.00205x
and predict the responsiveness of consumers to direct
market campaigns. Some regression models predict the where
probability of purchase for individuals receiving a pro- y^ 5 predicted amount of purchase
motion, and others predict the amount spent by consum- x 5 amount spent at related stores
ers who make purchases. Using this equation, we could predict that someone
For one campaign, a retail store chain wanted to spending $10,000 over the past 39 months at related
attract new customers. To predict the effect of the stores would spend $47.20 when responding to the
campaign, ADS analysts selected a sample from the direct mail promotion. In this chapter, you will
consumer database, sent the sampled individuals pro- learn how to develop this type of estimated regres-
motional materials, and then collected transaction data sion equation. The final model developed by ADS
on the consumers’ response. Sample data were col- analysts also included several other variables that
lected on the amount of purchase made by the consum- increased the predictive power of the preceding equa-
ers responding to the campaign, as well as on a variety tion. Among these variables was the absence or pres-
of consumer-specific variables thought to be useful in ence of a bank credit card, estimated income, and the
average amount spent per trip at a selected store. In
this chapter, we will also learn how such additional
*The authors are indebted to Philip Clemance, Director of Analytical
Development at Alliance Data Systems, for providing this Analytics in variables can be incorporated into a multiple regres-
Action. sion model.
4.1 The Simple Linear Regression Model 125
The statistical methods Managerial decisions are often based on the relationship between two or more variables.
used in studying the rela- For example, after considering the relationship between advertising expenditures and sales,
tionship between two vari-
ables were first employed
a marketing manager might attempt to predict sales for a given level of advertising expen-
by Sir Francis Galton ditures. In another case, a public utility might use the relationship between the daily high
(1822–1911). Galton temperature and the demand for electricity to predict electricity usage on the basis of next
found that the heights of month’s anticipated daily high temperatures. Sometimes a manager will rely on intuition
the sons of unusually tall to judge how two variables are related. However, if data can be obtained, a statistical pro-
or unusually short fathers
tend to move, or “regress,”
cedure called regression analysis can be used to develop an equation showing how the
toward the average height variables are related.
of the male population. In regression terminology, the variable being predicted is called the dependent vari-
Karl Pearson (1857–1936), able, or response, and the variables being used to predict the value of the dependent variable
a disciple of Galton, later are called the independent variables, or predictor variables. For example, in analyzing the
confirmed this finding over
a sample of 1078 pairs of
effect of advertising expenditures on sales, a marketing manager’s desire to predict sales
fathers and sons. would suggest making sales the dependent variable. Advertising expenditure would be the
independent variable used to help predict sales.
A regression analysis involving one independent variable and one dependent variable
is referred to as a simple regression, and in statistical notation y denotes the dependent
variable and x denotes the independent variable. A regression analysis for which any one
unit change in the independent variable, x, is assumed to result in the same change in the
dependent variable, y, is referred to as a linear regression. Regression analysis involving
two or more independent variables is called multiple regression; multiple regression and
cases involving curvilinear relationships are covered in later sections of this chapter.
Regression line
Regression line
Regression line
x x x
The equation that describes how the expected value of y, denoted E(y), is related to x is
called the regression equation. The regression equation for simple linear regression follows:
E(y|x) 5 b0 1 b1x (4.2)
E(y|x) is often referred to where E(y|x) is the expected value of y for a given value of x. The graph of the simple linear
as the conditional mean of regression equation is a straight line; b0 is the y-intercept of the regression line, b1 is the
y given the value of x.
slope, and E(y|x) is the mean or expected value of y for a given value of x.
Examples of possible regression lines are shown in Figure 4.1. The regression line
in Panel A shows that the mean value of y is related positively to x, with larger values of
E(y|x) associated with larger values of x. In Panel B, the mean value of y is related nega-
tively to x, with smaller values of E(y|x) associated with larger values of x. In Panel C, the
mean value of y is not related to x; that is, E(y|x) is the same for every value of x.
Figure 4.2 provides a summary of the estimation process for simple linear regression.
The graph of the estimated simple linear regression equation is called the estimated
regression line; b0 is the estimated y-intercept, and b1 is the estimated slope. In the next
section, we show how the least squares method can be used to compute the values of b0 and
b1 in the estimated regression equation.
In general, y^ is the point estimator of E(y|x), the mean value of y for a given value of
x. Thus, to estimate the mean or expected value of travel time for a driving assignment of
75 miles, Butler trucking would substitute the value of 75 for x in equation (4.3). In some
4.2 Least Squares Method 127
A point estimator is a single cases, however, Butler Trucking may be more interested in predicting travel time for
value used as an estimate of an upcoming driving assignment of a particular length. For example, suppose Butler
the corresponding popula-
tion parameter.
Trucking would like to predict travel time for a new 75-mile driving assignment the
company is considering. As it turns out, the best predictor of y for a given value of x is
also provided by y^. Thus, to predict travel time for a new 75-mile driving assignment,
Butler Trucking would also substitute the value of 75 for x in equation (4.3). The value
of y^ provides both a point estimate of E(y|x) for a given value of x and a prediction of
an individual value of y for a given value of x. In most cases, we will refer to y^ simply
as the predicted value of y.
TablE 4.1 MILES TRAVELED AND TRAVEL TIME (IN HOURS) FOR TEN
BUTLER TRUCKING COMPANY DRIVING ASSIGNMENTS
10
9
8
Travel Time (hours) - y
7
6
5
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
the data graphically and to draw preliminary conclusions about the possible relationship
between the variables.
What preliminary conclusions can be drawn from Figure 4.3? Longer travel times
appear to coincide with more miles traveled. In addition, for these data, the relationship
between the travel time and miles traveled appears to be approximated by a straight line;
indeed, a positive linear relationship is indicated between x and y. We therefore choose
the simple linear regression model to represent this relationship. Given that choice, our
next task is to use the sample data in Table 4.1 to determine the values of b0 and b1 in the
estimated simple linear regression equation. For the ith driving assignment, the estimated
regression equation provides
y^ i 5 b0 1 b1xi (4.4)
4.2 Least Squares Method 129
where
y^ i 5 predicted travel time (in hours) for the ith driving assignment
b0 5 the y-intercept of the estimated regression line
b1 5 the slope of the estimated regression line
xi 5 miles traveled for the ith driving assignment
With yi denoting the observed (actual) travel time for driving assignment i and y^i in equa-
tion (4.4) representing the predicted travel time for driving assignment i, every driving
assignment in the sample will have an observed travel time yi and a predicted travel time y^i .
For the estimated regression line to provide a good fit to the data, the differences between
the observed travel times yi and the predicted travel times y^i should be small.
The least squares method uses the sample data to provide the values of b0 and b1 that
minimize the sum of the squares of the deviations between the observed values of the
dependent variable yi and the predicted values of the dependent variable y^i . The criterion
for the least squares method is given by equation (4.5).
min a ( yi 2 y^i ) 2
n
(4.5)
i51
where
yi 5 observed value of the dependent variable for the ith observation
y^i 5 predicted value of the dependent variable for the ith observation
n 5 total number of observations
This is known as the least squares method for estimating the regression equation.
The error we make using the regression model to estimate the mean value of the depen-
dent variable for the ith observation is often written as ei 5 yi 2 y^i and is referred to as the
ith residual. Using this notation, equation (4.5) can be rewritten as
min a e2i
n
i51
and we say that we are finding the regression that minimizes the sum of squared errors.
SLOPE EQUATION
a ( xi 2 x ) ( yi 2 y )
n
i51
b1 5 (4.6)
a ( xi 2 x )
n
2
i51
y-INTERCEPT EQUATION
b0 5 y 2 b1x (4.7)
130 Chapter 4 Linear Regression
where
xi 5 value of the independent variable for the ith observation
yi 5 value of the dependent variable for the ith observation
x 5 mean value for the independent variable
y 5 mean value for the dependent variable
n 5 total number of observations
For the Butler Trucking Company data in Table 4.1, these equations yield an estimated slope
of b15 0.0678 and a y-intercept of b0 5 1.2739. Thus, our estimated simple linear regression
model is y^ 5 1.2739 1 0.0678x1 . Although equations (4.6) and (4.7) are not difficult to use,
computer software such as Excel or XLMiner is generally used to calculate b1 and b0.
We interpret b1 and b0 as we would the y-intercept and slope of any straight line. The slope b1
is the estimated change in the mean of the dependent variable y that is associated with a one unit
increase in the independent variable x. For the Butler Trucking Company model, we therefore es-
timate that, if the length of a driving assignment were 1 unit (1 mile) longer, the mean travel time
for that driving assignment would be 0.0678 units (0.0678 hours, or approximately 4 minutes)
longer. The y-intercept b0 is the estimated value of the dependent variable y when the independent
variable x is equal to 0. For the Butler Trucking Company model, we estimate that if the driving
distance for a driving assignment was 0 units (0 miles), the mean travel time would be 1.2739
units (1.2739 hours, or approximately 76 minutes). Can we find a plausible explanation for this?
Perhaps the 76 minutes represent the time needed to prepare, load, and unload the vehicle, which
is required for all trips regardless of distance and which therefore does not depend on the distance
traveled. However, we must use caution: To estimate the travel time for a driving distance of
0 miles, we have to extend the relationship we have found with simple linear regression well
beyond the range of values for driving distance in our sample. Those sample values range from
50 to 100 miles, and this range represents the only values of driving distance for which we have
empirical evidence of the relationship between driving distance and our estimated travel time.
It is important to note that the regression model is valid only over the experimental
region, which is the range of values of the independent variables in the data used to esti-
mate the model. Prediction of the value of the dependent variable outside the experimental
The estimated value of the region is called extrapolation and is risky. Because we have no empirical evidence that
y-intercept often results the relationship we have found holds true for values of x outside of the range of values of x
from extrapolation. in the data used to estimate the relationship, extrapolation is risky and should be avoided if
possible. For VButler Trucking, this means that any prediction outside the travel time for a
driving distance less than 50 miles or greater than 100 miles is not a reliable estimate, and
so for this model the estimate of b0 is meaningless. However, if the experimental region for
a regression problem includes zero, the y-intercept will have a meaningful interpretation.
We can now also use this model and our known values for miles traveled for a driving
assignment (x) to estimate mean travel time in hours. For example, the first driving assign-
ment in Table 4.1 has a value for miles traveled of x 5 100. We estimate the mean travel
time in hours for this driving assignment to be
y^i 5 1.2739 1 0.0678(100) 5 8.0539
Since the travel time for this driving assignment was 9.3 hours, this regression estimate
would have resulted in a residual of
e1 5 y1 2 y^1 5 9.3 2 8.0539 5 1.2461
The simple linear regression model underestimated travel time for this driving assignment by
1.2261 hours (approximately 74 minutes). Table 4.2 shows the predicted mean travel times,
the residuals, and the squared residuals for all ten driving assignments in the sample data.
4.2 Least Squares Method 131
TablE 4.2 PREDICTED TRAVEL TIME IN HOURS AND THE RESIDUALS FOR TEN
BUTLER TRUCKING COMPANY DRIVING ASSIGNMENTS
10
y3 9
e3
^
y3 8
Travel Time (hours) - y
7
6
^
yi 5 1.2739 1 0.0678xi
y^5 5
y5 e5
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
132 Chapter 4 Linear Regression
10
9
Note that if we square a residual, we obtain the area of a square with the length of each
side equal to the absolute value of the residual. In other words, the square of the residual
for driving assignment 4 (e4), 21.55652 5 2.4227, is the area of a square for which the
length of each side is 1.5565. This relationship between the algebra and geometry of the
least squares method provides interesting insight into what we are achieving by using this
approach to fit a regression model.
In Figure 4.5, a vertical line is drawn from each point in the scatter chart to the linear
regression line. Each of these lines represents the difference between the actual driving
time and the driving time we predict using linear regression for one of the assignments in
our data. The length of each line is equal to the absolute value of the residual for one of the
driving assignments. When we square a residual, the resulting value is equal to the square
that is formed using the vertical dashed line representing the residual in Figure 4.4 as one
side of a square. Thus, when we find the linear regression model that minimizes the sum of
squared errors for the Butler Trucking example, we are positioning the regression line in
the manner that minimizes the sum of the areas of the ten squares in Figure 4.5.
FIGURE 4.6 EXCEL SPREADSHEET WITH SCATTER CHART, ESTIMATED REGRESSION LINE, AND
ESTIMATED REGRESSION EQUATION FOR BUTLER TRUCKING COMPANY
A B C D E F G H I J K L
1 Assignment Miles Time
2 1 100 9.3
3 2 50 4.8 10
4 3 100 8.9 9 y 5 0.0678x 1 1.2739
5 4 100 6.5
8
6 5 50 4.2
Travel Time (hours) - y
7 6 80 6.2 7
8 7 75 7.4 6
9 8 65 6.0
5
10 9 90 7.6
11 10 90 6.1 4
12 3
13
2
14
15 1
16 0
17 40 50 60 70 80 90 100
18 Miles Traveled - x
19
20
21
22
Equation 4.5 minimizes the sum of the squared deviations (observations for which the regression
deviations between the observed values of the de- forecast is less than the actual value) offset each
pendent variable yi and the predicted values of the other. Another alternative is to minimize the sum
dependent variable y^i . One alternative is to simply of the absolute value of the deviations between the
minimize the sum of the deviations between the observed values of the dependent variable yi and
observed values of the dependent variable yi and the predicted values of the dependent variable y^i. It
the predicted values of the dependent variable y^i . is possible to compute estimated regression param-
This is not a viable option because then negative eters that minimize this sum of absolute value of
deviations (observations for which the regression the deviations, but this approach is more difficult
forecast exceeds the actual value) and positive than the least squares approach.
SSE 5 a ( yi 2 y^i ) 2
n
(4.8)
i51
The value of SSE is a measure of the error in using the estimated regression equation to
predict the values of the dependent variable in the sample.
We have already shown the calculations required to compute the sum of squares due to
error for the Butler Trucking Company example in Table 4.2. The squared residual or error
for each observation in the data is shown in the last column of that table. After computing and
squaring the residuals for each driving assignment in the sample, we sum them to obtain SSE
5 8.0288. Thus, SSE 5 8.0288 measures the error in using the estimated regression equa-
tion y^i 5 1.2739 1 0.0678xi to predict travel time for the driving assignments in the sample.
Now suppose we are asked to predict travel time in hours without knowing the miles
traveled for a driving assignment. Without knowledge of any related variables, we would
use the sample mean y as a predictor of travel time for any given driving assignment. To
find y, we divide the sum of the actual driving times yi from Table 4.2 (67) by the number
of observations n in the data (10); this yields y 5 6.7.
Figure 4.7 provides insight on how well we would predict the values of yi in the Butler
Trucking company example using y 5 6.7. From this figure, which again highlights the
residuals for driving assignments 3 and 5, we can see that y tends to overpredict travel times
for driving assignments that have relatively small values for miles traveled (such as driving
assignment 5) and tends to underpredict travel times for driving assignments have relatively
large values for miles traveled (such as driving assignment 3).
In Table 4.3 we show the sum of squared deviations obtained by using the sample
mean y 5 6.7 to predict the value of travel time in hours for each driving assignment in
the sample. For the ith driving assignment in the sample, the difference yi 2 y provides a
10
y3 9
8 y3 2 y
Travel Time (hours) - y
7
y
6
y5 2 y
5
y5 4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
4.3 Assessing the Fit of the Simple Linear Regression Model 135
TablE 4.3 CALCULATIONS FOR THE SUM OF SQUARES TOTAL FOR THE BUTLER
TRUCKING SIMPLE LINEAR REGRESSION
measure of the error involved in using y to predict travel time for the ith driving assign-
ment. The corresponding sum of squares, called the total sum of squares, is denoted SST.
SST 5 a ( yi 2 y ) 2
n
(4.9)
i51
The sum at the bottom of the last column in Table 4.3 is the total sum of squares for Butler
Trucking Company: SST 5 23.9.
Now we put it all together. In Figure 4.8 we show the estimated regression line y^i 5
1.2739 1 0.0678xi and the line corresponding to y 5 6.7. Note that the points cluster more
closely around the estimated regression line y^i 5 1.2739 1 0.0678xi than they do about the
horizontal line y 5 6.7. For example, for the third driving assignment in the sample, we see
10
y3 9
^
y3 2 ^y3
y3 8 y3 2 y
Travel Time (hours) - y
7 y^3 2 y
y
6
y 5 0.0678x 1 1.2739
5
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
136 Chapter 4 Linear Regression
that the error is much larger when y 5 6.7 is used to predict y3 than when y^i 5 1.2739 1
0.0678xi is used. We can think of SST as a measure of how well the observations cluster
about the y line and SSE as a measure of how well the observations cluster about the y^ line.
To measure how much the y^ values on the estimated regression line deviate from y^,
another sum of squares is computed. This sum of squares, called the sum of squares due to
regression, is denoted SSR.
SSR 5 a ( y^i 2 y ) 2
n
(4.10)
i51
From the preceding discussion, we should expect that SST, SSR, and SSE are related.
Indeed, the relationship among these three sums of squares is:
SST 5 SSR 1 SSE (4.11)
where
SST 5 total sum of squares
SSR 5 sum of squares due to regression
SSE 5 sum of squares due to error
Equation (4.11) shows that the total sum of squares can be partitioned into two compo-
nents, the sum of squares due to regression and the sum of squares due to error. Hence, if
the values of any two of these sum of squares are known, the third sum of squares can be
computed easily. For instance, in the Butler Trucking Company example, we already know
that SSE 5 8.0288 and SST 5 23.9; therefore, solving for SSR in equation (4.11), we find
that the sum of squares due to regression is
SSR 5 SST 2 SSE 5 23.9 2 8.0288 5 15.8712
For the Butler Trucking Company example, the value of the coefficient of determination is
SSR 15.8712
The coefficient of determi- r2 5 5 5 0.6641
SST 23.9
nation r 2 is the square of
the correlation between the When we express the coefficient of determination as a percentage, r2 can be interpreted
2
4.3 Assessing the Fit of the Simple Linear Regression Model 137
regression equation. For Butler Trucking Company, we can conclude that 66.41 percent
of the total sum of squares can be explained by using the estimated regression equation
y^i 5 1.2739 1 0.0678xi to predict quarterly sales. In other words, 66.41 percent of the
variability in the values of travel time in our sample can be explained by the linear rela-
tionship between the miles traveled and travel time.
FIGURE 4.9 EXCEL SPREADSHEET WITH ORIGINAL DATA, SCATTER CHART, ESTIMATED
REGRESSION LINE, ESTIMATED REGRESSION EQUATION, AND COEFFICIENT OF
DETERMINATION r 2 FOR BUTLER TRUCKING COMPANY
A B C D E F G H I J K L
1 Assignment Miles Time
2 1 100 9.3
3 2 50 4.8 10
4 3 100 8.9 y 5 0.0678x 1 1.2739
9
5 4 100 6.5 R2 5 0.6641
8
6 5 50 4.2
Travel Time (hours) - y
7 6 80 6.2 7
8 7 75 7.4 6
9 8 65 6.0
5
10 9 90 7.6
11 10 90 6.1 4
12 3
13
2
14
15 1
16 0
17 40 50 60 70 80 90 100
18 Miles Traveled - x
19
20
21
22
As a practical matter, for typical data in the social often found; in fact, in some cases, r2 values greater
and behavioral sciences, values of r2 as low as 0.25 than 0.90 can be found. In business applications, r2
are often considered useful. For data in the physi- values vary greatly, depending on the unique char-
cal and life sciences, r2 values of 0.60 or greater are acteristics of each application.