0% found this document useful (0 votes)
60 views14 pages

Chapter-4 1

The document discusses using linear regression to predict sales for a direct marketing campaign. It describes how an analytics company used consumers' past spending amounts at related stores to build a regression model that predicted how much individual consumers would spend. The model estimated that consumers who spent $10,000 at related stores in the past 39 months would spend around $47 in response to a direct mail promotion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views14 pages

Chapter-4 1

The document discusses using linear regression to predict sales for a direct marketing campaign. It describes how an analytics company used consumers' past spending amounts at related stores to build a regression model that predicted how much individual consumers would spend. The model estimated that consumers who spent $10,000 at related stores in the past 39 months would spend around $47 in response to a direct mail promotion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

124 Chapter 4 Linear Regression

AnAlytics in Action
ALLIANCE DATA SYSTEMS*
DALLAS, TEXAS
Alliance Data Systems (ADS) provides transaction
processing, credit services, and marketing services for
clients in the rapidly growing customer relationship
management (CRM) industry. ADS clients are concen-
trated in four industries: retail, petroleum/convenience
stores, utilities, and transportation. In 1983, Alliance
began offering end-to-end credit processing services
to the retail, petroleum, and casual dining industries;
today the company employs more than 6500 employ-
ees who provide services to clients around the world.
Operating more than 140,000 point-of-sale terminals
in the United States alone, ADS processes in excess of
2.5 billion transactions annually. The company ranks
second in the United States in private label credit ser-
vices by representing 49 private label programs with Alliance Data Systems analysts discuss use of a
nearly 72 million cardholders. In 2001, ADS made an
regression model to predict sales for a direct marketing
campaign. © Courtesy of Alliance Data Systems.
initial public offering and is now listed on the New
York Stock Exchange.
As one of its marketing services, ADS designs predicting sales. The consumer-specific variable that
direct mail campaigns and promotions. With its data- contributed most to predicting the amount purchased
base containing information on the spending habits of was the total amount of credit purchases at related stores
more than 100 million consumers, ADS can target con- over the past 39 months. ADS analysts developed an
sumers who are the most likely to benefit from a direct estimated regression equation relating the amount of
mail promotion. The Analytical Development Group purchase to the amount spent at related stores:
uses regression analysis to build models that measure y^ 5 26.7 1 0.00205x
and predict the responsiveness of consumers to direct
market campaigns. Some regression models predict the where
probability of purchase for individuals receiving a pro- y^ 5 predicted amount of purchase
motion, and others predict the amount spent by consum- x 5 amount spent at related stores
ers who make purchases. Using this equation, we could predict that someone
For one campaign, a retail store chain wanted to spending $10,000 over the past 39 months at related
attract new customers. To predict the effect of the stores would spend $47.20 when responding to the
campaign, ADS analysts selected a sample from the direct mail promotion. In this chapter, you will
consumer database, sent the sampled individuals pro- learn how to develop this type of estimated regres-
motional materials, and then collected transaction data sion equation. The final model developed by ADS
on the consumers’ response. Sample data were col- analysts also included several other variables that
lected on the amount of purchase made by the consum- increased the predictive power of the preceding equa-
ers responding to the campaign, as well as on a variety tion. Among these variables was the absence or pres-
of consumer-specific variables thought to be useful in ence of a bank credit card, estimated income, and the
average amount spent per trip at a selected store. In
this chapter, we will also learn how such additional
*The authors are indebted to Philip Clemance, Director of Analytical
Development at Alliance Data Systems, for providing this Analytics in variables can be incorporated into a multiple regres-
Action. sion model.
4.1 The Simple Linear Regression Model 125

The statistical methods Managerial decisions are often based on the relationship between two or more variables.
used in studying the rela- For example, after considering the relationship between advertising expenditures and sales,
tionship between two vari-
ables were first employed
a marketing manager might attempt to predict sales for a given level of advertising expen-
by Sir Francis Galton ditures. In another case, a public utility might use the relationship between the daily high
(1822–1911). Galton temperature and the demand for electricity to predict electricity usage on the basis of next
found that the heights of month’s anticipated daily high temperatures. Sometimes a manager will rely on intuition
the sons of unusually tall to judge how two variables are related. However, if data can be obtained, a statistical pro-
or unusually short fathers
tend to move, or “regress,”
cedure called regression analysis can be used to develop an equation showing how the
toward the average height variables are related.
of the male population. In regression terminology, the variable being predicted is called the dependent vari-
Karl Pearson (1857–1936), able, or response, and the variables being used to predict the value of the dependent variable
a disciple of Galton, later are called the independent variables, or predictor variables. For example, in analyzing the
confirmed this finding over
a sample of 1078 pairs of
effect of advertising expenditures on sales, a marketing manager’s desire to predict sales
fathers and sons. would suggest making sales the dependent variable. Advertising expenditure would be the
independent variable used to help predict sales.
A regression analysis involving one independent variable and one dependent variable
is referred to as a simple regression, and in statistical notation y denotes the dependent
variable and x denotes the independent variable. A regression analysis for which any one
unit change in the independent variable, x, is assumed to result in the same change in the
dependent variable, y, is referred to as a linear regression. Regression analysis involving
two or more independent variables is called multiple regression; multiple regression and
cases involving curvilinear relationships are covered in later sections of this chapter.

4.1 the Simple Linear regression Model


Butler Trucking Company is an independent trucking company in southern California. A
major portion of Butler’s business involves deliveries throughout its local area. To develop
better work schedules, the managers want to estimate the total daily travel times for their
drivers. The managers believe that the total daily travel times (denoted by y) are closely
related to the number of miles traveled in making the daily deliveries (denoted by x). Using
regression analysis, we can develop an equation showing how the dependent variable y is
related to the independent variable x.

Regression Model and Regression Equation


In the Butler Trucking Company example, the population consists of all the driving assign-
ments that can be made by the company. For every driving assignment in the population,
there is a value of x (miles traveled) and a corresponding value of y (travel time in hours).
The equation that describes how y is related to x and an error term is called the regression
model. The regression model used in simple linear regression follows:

SIMPLE LINEAR REGRESSION MODEL


y 5 b0 1 b1x 1 « (4.1)

A random variable is the


outcome of a random b0 and b1 are characteristics of the population and so are referred to as the parameters of
experiment (such as the
drawing of a random
the model, and « (the Greek letter epsilon) is a random variable referred to as the error
sample) and so represents term. The error term accounts for the variability in y that cannot be explained by the linear
an uncertain outcome. relationship between x and y.
126 Chapter 4 Linear Regression

FIGURE 4.1 POSSIBLE REGRESSION LINES IN SIMPLE LINEAR REGRESSION

Panel A: Panel B: Panel C:


Positive Linear Relationship Negative Linear Relationship No Relationship

E(y|x) E(y|x) E(y|x)

Regression line

Regression line

Regression line
x x x

The equation that describes how the expected value of y, denoted E(y), is related to x is
called the regression equation. The regression equation for simple linear regression follows:
E(y|x) 5 b0 1 b1x (4.2)
E(y|x) is often referred to where E(y|x) is the expected value of y for a given value of x. The graph of the simple linear
as the conditional mean of regression equation is a straight line; b0 is the y-intercept of the regression line, b1 is the
y given the value of x.
slope, and E(y|x) is the mean or expected value of y for a given value of x.
Examples of possible regression lines are shown in Figure 4.1. The regression line
in Panel A shows that the mean value of y is related positively to x, with larger values of
E(y|x) associated with larger values of x. In Panel B, the mean value of y is related nega-
tively to x, with smaller values of E(y|x) associated with larger values of x. In Panel C, the
mean value of y is not related to x; that is, E(y|x) is the same for every value of x.

Estimated Regression Equation


If the values of the population parameters b0 and b1 were known, we could use equa-
tion (4.2) to compute the mean value of y for a given value of x. In practice, the
parameter values are not known and must be estimated using sample data. Sample
statistics (denoted b0 and b1) are computed as estimates of the population parameters
b0 and b1. Substituting the values of the sample statistics b0 and b1 for b0 and b1 in
the regression equation, we obtain the estimated regression equation. The estimated
regression equation for simple linear regression follows.

SIMPLE LINEAR REGRESSION ESTIMATED REGRESSION EQUATION


y^ 5 b0 1 b1x (4.3)

Figure 4.2 provides a summary of the estimation process for simple linear regression.
The graph of the estimated simple linear regression equation is called the estimated
regression line; b0 is the estimated y-intercept, and b1 is the estimated slope. In the next
section, we show how the least squares method can be used to compute the values of b0 and
b1 in the estimated regression equation.
In general, y^ is the point estimator of E(y|x), the mean value of y for a given value of
x. Thus, to estimate the mean or expected value of travel time for a driving assignment of
75 miles, Butler trucking would substitute the value of 75 for x in equation (4.3). In some
4.2 Least Squares Method 127

FIGURE 4.2 THE ESTIMATION PROCESS IN SIMPLE LINEAR REGRESSION


The estimation of b0 and
b1 is a statistical process
much like the estimation
of the population mean, m,
discussed in Chapter 2. b0
and b1 are the unknown
parameters of interest, and
b0 and b1 are the sample E(y|x)
statistics used to estimate
the parameters.

A point estimator is a single cases, however, Butler Trucking may be more interested in predicting travel time for
value used as an estimate of an upcoming driving assignment of a particular length. For example, suppose Butler
the corresponding popula-
tion parameter.
Trucking would like to predict travel time for a new 75-mile driving assignment the
company is considering. As it turns out, the best predictor of y for a given value of x is
also provided by y^. Thus, to predict travel time for a new 75-mile driving assignment,
Butler Trucking would also substitute the value of 75 for x in equation (4.3). The value
of y^ provides both a point estimate of E(y|x) for a given value of x and a prediction of
an individual value of y for a given value of x. In most cases, we will refer to y^ simply
as the predicted value of y.

4.2 Least Squares Method


The least squares method is a procedure for using sample data to find the estimated
regression equation. To illustrate the least squares method, suppose data were collected
from a sample of ten Butler Trucking Company driving assignments. For the ith observa-
tion or driving assignment in the sample, xi is the miles traveled and yi is the travel time
(in hours). The values of xi and yi for the ten driving assignments in the sample are sum-
marized in Table 4.1. We see that driving assignment 1, with x1 5 100 and y1 5 9.3, is a
driving assignment of 100 miles and a travel time of 9.3 hours. Driving assignment 2, with
x2 5 50 and y2 5 4.8, is a driving assignment of 50 miles and a travel time of 4.8 hours.
The shortest travel time is for driving assignment 5, which requires 50 miles with a travel
time of 4.2 hours.
Figure 4.3 is a scatter chart of the data in Table 4.1. Miles traveled is shown on the
horizontal axis, and travel time (in hours) is shown on the vertical axis. Scatter charts for
regression analysis are constructed with the independent variable x on the horizontal axis
and the dependent variable y on the vertical axis. The scatter chart enables us to observe
128 Chapter 4 Linear Regression

TablE 4.1 MILES TRAVELED AND TRAVEL TIME (IN HOURS) FOR TEN
BUTLER TRUCKING COMPANY DRIVING ASSIGNMENTS

WEB file Driving


Assignment i
x 5 Miles
Traveled
y 5 Travel
Time (hours)
Butler
1 100 9.3
2 50 4.8
3 100 8.9
4 100 6.5
5 50 4.2
6 80 6.2
7 75 7.4
8 65 6.0
9 90 7.6
10 90 6.1

FIGURE 4.3 SCATTER CHART OF MILES TRAVELED AND TRAVEL TIME IN


HOURS FOR SAMPLE OF TEN BUTLER TRUCKING COMPANY
DRIVING ASSIGNMENTS

10
9
8
Travel Time (hours) - y

7
6
5
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x

the data graphically and to draw preliminary conclusions about the possible relationship
between the variables.
What preliminary conclusions can be drawn from Figure 4.3? Longer travel times
appear to coincide with more miles traveled. In addition, for these data, the relationship
between the travel time and miles traveled appears to be approximated by a straight line;
indeed, a positive linear relationship is indicated between x and y. We therefore choose
the simple linear regression model to represent this relationship. Given that choice, our
next task is to use the sample data in Table 4.1 to determine the values of b0 and b1 in the
estimated simple linear regression equation. For the ith driving assignment, the estimated
regression equation provides

y^ i 5 b0 1 b1xi (4.4)
4.2 Least Squares Method 129

where
y^ i 5 predicted travel time (in hours) for the ith driving assignment
b0 5 the y-intercept of the estimated regression line
b1 5 the slope of the estimated regression line
xi 5 miles traveled for the ith driving assignment
With yi denoting the observed (actual) travel time for driving assignment i and y^i in equa-
tion (4.4) representing the predicted travel time for driving assignment i, every driving
assignment in the sample will have an observed travel time yi and a predicted travel time y^i .
For the estimated regression line to provide a good fit to the data, the differences between
the observed travel times yi and the predicted travel times y^i should be small.
The least squares method uses the sample data to provide the values of b0 and b1 that
minimize the sum of the squares of the deviations between the observed values of the
dependent variable yi and the predicted values of the dependent variable y^i . The criterion
for the least squares method is given by equation (4.5).

LEAST SQUARES METHOD EQUATION

min a ( yi 2 y^i ) 2
n
(4.5)
i51

where
yi 5 observed value of the dependent variable for the ith observation
y^i 5 predicted value of the dependent variable for the ith observation
n 5 total number of observations

This is known as the least squares method for estimating the regression equation.
The error we make using the regression model to estimate the mean value of the depen-
dent variable for the ith observation is often written as ei 5 yi 2 y^i and is referred to as the
ith residual. Using this notation, equation (4.5) can be rewritten as

min a e2i
n

i51

and we say that we are finding the regression that minimizes the sum of squared errors.

Least Squares Estimates of the Regression Parameters


Differential calculus can be used to show that the values of b0 and b1 that minimize expres-
sion (4.5) are found by using equations (4.6) and (4.7).

SLOPE EQUATION

a ( xi 2 x ) ( yi 2 y )
n

i51
b1 5 (4.6)
a ( xi 2 x )
n
2
i51

y-INTERCEPT EQUATION
b0 5 y 2 b1x (4.7)
130 Chapter 4 Linear Regression

where
xi 5 value of the independent variable for the ith observation
yi 5 value of the dependent variable for the ith observation
x 5 mean value for the independent variable
y 5 mean value for the dependent variable
n 5 total number of observations

For the Butler Trucking Company data in Table 4.1, these equations yield an estimated slope
of b15 0.0678 and a y-intercept of b0 5 1.2739. Thus, our estimated simple linear regression
model is y^ 5 1.2739 1 0.0678x1 . Although equations (4.6) and (4.7) are not difficult to use,
computer software such as Excel or XLMiner is generally used to calculate b1 and b0.
We interpret b1 and b0 as we would the y-intercept and slope of any straight line. The slope b1
is the estimated change in the mean of the dependent variable y that is associated with a one unit
increase in the independent variable x. For the Butler Trucking Company model, we therefore es-
timate that, if the length of a driving assignment were 1 unit (1 mile) longer, the mean travel time
for that driving assignment would be 0.0678 units (0.0678 hours, or approximately 4 minutes)
longer. The y-intercept b0 is the estimated value of the dependent variable y when the independent
variable x is equal to 0. For the Butler Trucking Company model, we estimate that if the driving
distance for a driving assignment was 0 units (0 miles), the mean travel time would be 1.2739
units (1.2739 hours, or approximately 76 minutes). Can we find a plausible explanation for this?
Perhaps the 76 minutes represent the time needed to prepare, load, and unload the vehicle, which
is required for all trips regardless of distance and which therefore does not depend on the distance
traveled. However, we must use caution: To estimate the travel time for a driving distance of
0 miles, we have to extend the relationship we have found with simple linear regression well
beyond the range of values for driving distance in our sample. Those sample values range from
50 to 100 miles, and this range represents the only values of driving distance for which we have
empirical evidence of the relationship between driving distance and our estimated travel time.
It is important to note that the regression model is valid only over the experimental
region, which is the range of values of the independent variables in the data used to esti-
mate the model. Prediction of the value of the dependent variable outside the experimental
The estimated value of the region is called extrapolation and is risky. Because we have no empirical evidence that
y-intercept often results the relationship we have found holds true for values of x outside of the range of values of x
from extrapolation. in the data used to estimate the relationship, extrapolation is risky and should be avoided if
possible. For VButler Trucking, this means that any prediction outside the travel time for a
driving distance less than 50 miles or greater than 100 miles is not a reliable estimate, and
so for this model the estimate of b0 is meaningless. However, if the experimental region for
a regression problem includes zero, the y-intercept will have a meaningful interpretation.
We can now also use this model and our known values for miles traveled for a driving
assignment (x) to estimate mean travel time in hours. For example, the first driving assign-
ment in Table 4.1 has a value for miles traveled of x 5 100. We estimate the mean travel
time in hours for this driving assignment to be
y^i 5 1.2739 1 0.0678(100) 5 8.0539
Since the travel time for this driving assignment was 9.3 hours, this regression estimate
would have resulted in a residual of
e1 5 y1 2 y^1 5 9.3 2 8.0539 5 1.2461
The simple linear regression model underestimated travel time for this driving assignment by
1.2261 hours (approximately 74 minutes). Table 4.2 shows the predicted mean travel times,
the residuals, and the squared residuals for all ten driving assignments in the sample data.
4.2 Least Squares Method 131

TablE 4.2 PREDICTED TRAVEL TIME IN HOURS AND THE RESIDUALS FOR TEN
BUTLER TRUCKING COMPANY DRIVING ASSIGNMENTS

Driving x 5 Miles y 5 Travel


Assignment i Traveled Time (hours) y^ i 5 b0 1 b1xi ei 5 yi 2 y^ i e2i
1 100 9.3 8.0565 1.2435 1.5463
2 50 4.8 4.6652 0.1348 0.0182
3 100 8.9 8.0565 0.8435 0.7115
4 100 6.5 8.0565 21.5565 2.4227
5 50 4.2 4.6652 20.4652 0.2164
6 80 6.2 6.7000 20.5000 0.2500
7 75 7.4 6.3609 1.0391 1.0797
8 65 6.0 5.6826 0.3174 0.1007
9 90 7.6 7.3783 0.2217 0.0492
10 90 6.1 7.3783 21.2783 1.6341
Totals 67.0 67.0000 0.0000 8.0288

Note in Table 4.2 that:



The sum of predicted values y^i is equal to the sum of the values of the dependent
variable y.

The sum of the residuals ei is 0.

The sum of the squared residuals e2i has been minimized.
These three points will always be true for a simple linear regression that is determined by
equations (4.6) and (4.7). Figure 4.4 shows the simple linear regression line y^i 5 1.2739 1
0.0678xi superimposed on the scatter chart for the Butler Trucking Company data in
Table 4.1. This figure, which also highlights the residuals for driving assignment 3 (e3) and
driving assignment 5 (e5), shows that the regression model underpredicts travel time for
some driving assignments (such as driving assignment 3) and overpredicts travel time for
others (such as driving assignment 5), but in general appears to fit the data relatively well.

FIGURE 4.4 SCATTER CHART OF MILES TRAVELED AND TRAVEL TIME


IN HOURS FOR BUTLER TRUCKING COMPANY DRIVING
ASSIGNMENTS WITH REGRESSION LINE SUPERIMPOSED

10
y3 9
e3
^
y3 8
Travel Time (hours) - y

7
6
^
yi 5 1.2739 1 0.0678xi
y^5 5
y5 e5
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
132 Chapter 4 Linear Regression

FIGURE 4.5 A GEOMETRIC INTERPRETATION OF THE LEAST SQUARES METHOD


APPLIED TO THE BUTLER TRUCKING COMPANY EXAMPLE

10
9

Travel Time (hours) - y 8


7
6
5 ^
yi 5 1.2739 1 0.0678xi
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x

Note that if we square a residual, we obtain the area of a square with the length of each
side equal to the absolute value of the residual. In other words, the square of the residual
for driving assignment 4 (e4), 21.55652 5 2.4227, is the area of a square for which the
length of each side is 1.5565. This relationship between the algebra and geometry of the
least squares method provides interesting insight into what we are achieving by using this
approach to fit a regression model.
In Figure 4.5, a vertical line is drawn from each point in the scatter chart to the linear
regression line. Each of these lines represents the difference between the actual driving
time and the driving time we predict using linear regression for one of the assignments in
our data. The length of each line is equal to the absolute value of the residual for one of the
driving assignments. When we square a residual, the resulting value is equal to the square
that is formed using the vertical dashed line representing the residual in Figure 4.4 as one
side of a square. Thus, when we find the linear regression model that minimizes the sum of
squared errors for the Butler Trucking example, we are positioning the regression line in
the manner that minimizes the sum of the areas of the ten squares in Figure 4.5.

Using Excel’s Chart Tools to Compute the Estimated


Regression Equation
We can use Excel’s chart tools to compute the estimated regression equation on a scatter
chart of the Butler Trucking Company data in Table 4.1. After constructing a scatter chart
(as shown in Figure 4.3) with Excel’s chart tools, the following steps describe how to com-
pute the estimated regression equation using the data in the worksheet:
In versions of Excel prior to Step 1. Right-click on any data point in the scatter chart and select Add Trendline . . .
2013, Step 1 will open the Step 2. When the Format Trendline task pane appears:
Format Trendline dialog Select Linear in the Trendline Options area
box where you can select
Linear under Trend/
Select Display Equation on chart in the Trendline Options area
Regression Type. The worksheet displayed in Figure 4.6 shows the original data, scatter chart, estimated
regression line, and estimated regression equation. Note that Excel uses y instead of y^ to
denote the predicted value of the dependent variable and puts the regression equation into
slope-intercept form whereas we use the intercept-slope form that is standard in statistics.
4.3 Assessing the Fit of the Simple Linear Regression Model 133

FIGURE 4.6 EXCEL SPREADSHEET WITH SCATTER CHART, ESTIMATED REGRESSION LINE, AND
ESTIMATED REGRESSION EQUATION FOR BUTLER TRUCKING COMPANY

A B C D E F G H I J K L
1 Assignment Miles Time
2 1 100 9.3
3 2 50 4.8 10
4 3 100 8.9 9 y 5 0.0678x 1 1.2739
5 4 100 6.5
8
6 5 50 4.2
Travel Time (hours) - y

7 6 80 6.2 7
8 7 75 7.4 6
9 8 65 6.0
5
10 9 90 7.6
11 10 90 6.1 4
12 3
13
2
14
15 1
16 0
17 40 50 60 70 80 90 100
18 Miles Traveled - x
19
20
21
22

NOTES AND COMMENTS

Equation 4.5 minimizes the sum of the squared deviations (observations for which the regression
deviations between the observed values of the de- forecast is less than the actual value) offset each
pendent variable yi and the predicted values of the other. Another alternative is to minimize the sum
dependent variable y^i . One alternative is to simply of the absolute value of the deviations between the
minimize the sum of the deviations between the observed values of the dependent variable yi and
observed values of the dependent variable yi and the predicted values of the dependent variable y^i. It
the predicted values of the dependent variable y^i . is possible to compute estimated regression param-
This is not a viable option because then negative eters that minimize this sum of absolute value of
deviations (observations for which the regression the deviations, but this approach is more difficult
forecast exceeds the actual value) and positive than the least squares approach.

4.3 assessing the Fit of the Simple Linear


regression Model
For the Butler Trucking Company example, we developed the estimated regression equation
y^i 5 1.2739 1 0.0678xi to approximate the linear relationship between the miles traveled x
and travel time in hours y. We now wish to assess how well the estimated regression equa-
tion fits the sample data. We begin by developing the intermediate calculations, referred to
as sums of squares.
134 Chapter 4 Linear Regression

The Sums of Squares


Recall that we found our estimated regression equation for the Butler Trucking Company
example by minimizing the sum of squares of the residuals. This quantity, also known as
the sum of squares due to error, is denoted by SSE.

SUM OF SQUARES DUE TO ERROR

SSE 5 a ( yi 2 y^i ) 2
n
(4.8)
i51

The value of SSE is a measure of the error in using the estimated regression equation to
predict the values of the dependent variable in the sample.
We have already shown the calculations required to compute the sum of squares due to
error for the Butler Trucking Company example in Table 4.2. The squared residual or error
for each observation in the data is shown in the last column of that table. After computing and
squaring the residuals for each driving assignment in the sample, we sum them to obtain SSE
5 8.0288. Thus, SSE 5 8.0288 measures the error in using the estimated regression equa-
tion y^i 5 1.2739 1 0.0678xi to predict travel time for the driving assignments in the sample.
Now suppose we are asked to predict travel time in hours without knowing the miles
traveled for a driving assignment. Without knowledge of any related variables, we would
use the sample mean y as a predictor of travel time for any given driving assignment. To
find y, we divide the sum of the actual driving times yi from Table 4.2 (67) by the number
of observations n in the data (10); this yields y 5 6.7.
Figure 4.7 provides insight on how well we would predict the values of yi in the Butler
Trucking company example using y 5 6.7. From this figure, which again highlights the
residuals for driving assignments 3 and 5, we can see that y tends to overpredict travel times
for driving assignments that have relatively small values for miles traveled (such as driving
assignment 5) and tends to underpredict travel times for driving assignments have relatively
large values for miles traveled (such as driving assignment 3).
In Table 4.3 we show the sum of squared deviations obtained by using the sample
mean y 5 6.7 to predict the value of travel time in hours for each driving assignment in
the sample. For the ith driving assignment in the sample, the difference yi 2 y provides a

FIGURE 4.7 THE SAMPLE MEAN y AS A PREDICTOR OF TRAVEL TIME


IN HOURS FOR BUTLER TRUCKING COMPANY

10
y3 9
8 y3 2 y
Travel Time (hours) - y

7
y
6
y5 2 y
5
y5 4

3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
4.3 Assessing the Fit of the Simple Linear Regression Model 135

TablE 4.3 CALCULATIONS FOR THE SUM OF SQUARES TOTAL FOR THE BUTLER
TRUCKING SIMPLE LINEAR REGRESSION

Driving x 5 Miles y 5 Travel


Assignment i Traveled Time (hours) yi 2 y (yi 2 y)2
1 100 9.3 2.6 6.76
2 50 4.8 21.9 3.61
3 100 8.9 2.2 4.84
4 100 6.5 20.2 0.04
5 50 4.2 22.5 6.25
6 80 6.2 20.5 0.25
7 75 7.4 0.7 0.49
8 65 6.0 20.7 0.49
9 90 7.6 0.9 0.81
10 90 6.1 2.6 6.76
Totals 67.0 0 23.9

measure of the error involved in using y to predict travel time for the ith driving assign-
ment. The corresponding sum of squares, called the total sum of squares, is denoted SST.

TOTAL SUM OF SQUARES, SST

SST 5 a ( yi 2 y ) 2
n
(4.9)
i51

The sum at the bottom of the last column in Table 4.3 is the total sum of squares for Butler
Trucking Company: SST 5 23.9.
Now we put it all together. In Figure 4.8 we show the estimated regression line y^i 5
1.2739 1 0.0678xi and the line corresponding to y 5 6.7. Note that the points cluster more
closely around the estimated regression line y^i 5 1.2739 1 0.0678xi than they do about the
horizontal line y 5 6.7. For example, for the third driving assignment in the sample, we see

FIGURE 4.8 DEVIATIONS ABOUT THE ESTIMATED REGRESSION LINE


AND THE LINE y 5 y FOR THE THIRD BUTLER TRUCKING
COMPANY DRIVING ASSIGNMENT

10
y3 9
^
y3 2 ^y3
y3 8 y3 2 y
Travel Time (hours) - y

7 y^3 2 y
y
6
y 5 0.0678x 1 1.2739
5
4
3
2
1
0
40 50 60 70 80 90 100
Miles Traveled - x
136 Chapter 4 Linear Regression

that the error is much larger when y 5 6.7 is used to predict y3 than when y^i 5 1.2739 1
0.0678xi is used. We can think of SST as a measure of how well the observations cluster
about the y line and SSE as a measure of how well the observations cluster about the y^ line.
To measure how much the y^ values on the estimated regression line deviate from y^,
another sum of squares is computed. This sum of squares, called the sum of squares due to
regression, is denoted SSR.

SUM OF SQUARES DUE TO REGRESSION, SSR

SSR 5 a ( y^i 2 y ) 2
n
(4.10)
i51

From the preceding discussion, we should expect that SST, SSR, and SSE are related.
Indeed, the relationship among these three sums of squares is:
SST 5 SSR 1 SSE (4.11)
where
SST 5 total sum of squares
SSR 5 sum of squares due to regression
SSE 5 sum of squares due to error
Equation (4.11) shows that the total sum of squares can be partitioned into two compo-
nents, the sum of squares due to regression and the sum of squares due to error. Hence, if
the values of any two of these sum of squares are known, the third sum of squares can be
computed easily. For instance, in the Butler Trucking Company example, we already know
that SSE 5 8.0288 and SST 5 23.9; therefore, solving for SSR in equation (4.11), we find
that the sum of squares due to regression is
SSR 5 SST 2 SSE 5 23.9 2 8.0288 5 15.8712

The Coefficient of Determination


Now let us see how the three sums of squares, SST, SSR, and SSE, can be used to provide
a measure of the goodness of fit for the estimated regression equation. The estimated re-
gression equation would provide a perfect fit if every value of the dependent variable yi
happened to lie on the estimated regression line. In this case, yi 2 y^ would be zero for each
observation, resulting in SSE 5 0. Because SST 5 SSR 1 SSE, we see that for a perfect
fit SSR must equal SST, and the ratio (SSR/SST) must equal one. Poorer fits will result in
larger values for SSE. Solving for SSE in equation (4.11), we see that SSE 5 SST 2 SSR.
Hence, the largest value for SSE (and hence the poorest fit) occurs when SSR 5 0 and
SSE 5 SST. The ratio SSR/SST, which will take values between zero and one, is used to
evaluate the goodness of fit for the estimated regression equation. This ratio is called the
coefficient of determination and is denoted by r 2.
In simple regression, r 2
is often referred to as COEFFICIENT OF DETERMINATION
the simple coefficient of
determination. SSR
r2 5 (4.12)
SST

For the Butler Trucking Company example, the value of the coefficient of determination is
SSR 15.8712
The coefficient of determi- r2 5 5 5 0.6641
SST 23.9
nation r 2 is the square of
the correlation between the When we express the coefficient of determination as a percentage, r2 can be interpreted
2
4.3 Assessing the Fit of the Simple Linear Regression Model 137

regression equation. For Butler Trucking Company, we can conclude that 66.41 percent
of the total sum of squares can be explained by using the estimated regression equation
y^i 5 1.2739 1 0.0678xi to predict quarterly sales. In other words, 66.41 percent of the
variability in the values of travel time in our sample can be explained by the linear rela-
tionship between the miles traveled and travel time.

Using Excel’s Chart Tools to Compute


the Coefficient of Determination
In Section 4.1 we used Excel’s chart tools to construct a scatter chart and compute the es-
timated regression equation for the Butler Trucking Company data. We will now describe
how to compute the coefficient of determination using the scatter chart in Figure 4.3.
Step 1. Right-click on any data point in the scatter chart and select Add Trendline. . .
Step 2. When the Format Trendline task pane appears:
Select Display R-squared value on chart in the Trendline Options area
Note that Excel uses R2 to Figure 4.9 displays the scatter chart, the estimated regression equation, the graph of the
represent the coefficient of estimated regression equation, and the coefficient of determination for the Butler Trucking
determination.
Company data. We see that r2 5 0.6641.

FIGURE 4.9 EXCEL SPREADSHEET WITH ORIGINAL DATA, SCATTER CHART, ESTIMATED
REGRESSION LINE, ESTIMATED REGRESSION EQUATION, AND COEFFICIENT OF
DETERMINATION r 2 FOR BUTLER TRUCKING COMPANY

A B C D E F G H I J K L
1 Assignment Miles Time
2 1 100 9.3
3 2 50 4.8 10
4 3 100 8.9 y 5 0.0678x 1 1.2739
9
5 4 100 6.5 R2 5 0.6641
8
6 5 50 4.2
Travel Time (hours) - y

7 6 80 6.2 7
8 7 75 7.4 6
9 8 65 6.0
5
10 9 90 7.6
11 10 90 6.1 4
12 3
13
2
14
15 1
16 0
17 40 50 60 70 80 90 100
18 Miles Traveled - x
19
20
21
22

NOTES AND COMMENTS

As a practical matter, for typical data in the social often found; in fact, in some cases, r2 values greater
and behavioral sciences, values of r2 as low as 0.25 than 0.90 can be found. In business applications, r2
are often considered useful. For data in the physi- values vary greatly, depending on the unique char-
cal and life sciences, r2 values of 0.60 or greater are acteristics of each application.

You might also like