Guide On Multiple Regression
Guide On Multiple Regression
C H A P T E R
Multiple Regression
I
n Chapter 27 we tried to predict the percent body fat of male subjects from
WHO 250 Male subjects
their waist size, and we did pretty well. The R2 of 67.8% says that we ac-
W H AT Body fat and waist counted for almost 68% of the variability in %body fat by knowing only the
size
waist size. We completed the analysis by performing hypothesis tests on the coef-
UNITS %Body fat and inches ficients and looking at the residuals.
WHEN 1990s But that remaining 32% of the variance has been bugging us. Couldn’t we do a
WHERE United States better job of accounting for %body fat if we weren’t limited to a single predictor?
WHY Scientific research In the full data set there were 15 other measurements on the 250 men. We might
be able to use other predictor variables to help us account for that leftover varia-
tion that wasn’t accounted for by waist size.
What about height? Does height help to predict %body fat? Men with the same
waist size can vary from short and corpulent to tall and emaciated. Knowing a
man has a 50-inch waist tells us that he’s likely to carry a lot of body fat. If we
found out that he was 7 feet tall, that might change our impression of his body
type. Knowing his height as well as his waist size might help us to make a more ac-
curate prediction.
Just Do It
Does a regression with two predictors even make sense? It does—and that’s fortu-
nate because the world is too complex a place for simple linear regression alone to
model it. A regression with two or more predictor variables is called a multiple
regression. (When we need to note the difference, a regression on a single predic-
tor is called a simple regression.) We’d never try to find a regression by hand, and
even calculators aren’t really up to the task. This is a job for a statistics program
on a computer. If you know how to find the regression of %body fat on waist size
with a statistics package, you can usually just add height to the list of predictors
without having to think hard about how to do it.
29-1
29-2 Par t V I I • Inferenc e When Variables Are Related
A Note on Terminology
For simple regression we found the Least Squares solution, the one whose coef-
When we have two or more
ficients made the sum of the squared residuals as small as possible. For multiple
predictors and fit a linear
regression, we’ll do the same thing but this time with more coefficients. Remark-
model by least squares, we
ably enough, we can still solve this problem. Even better, a statistics package can
are formally said to fit a least
find the coefficients of the least squares model easily.
squares linear multiple re-
Here’s a typical example of a multiple regression table:
gression. Most folks just call it Dependent variable is: Pct BF
“multiple regression.” You may R-squared 5 71.3% R-squared (adjusted) 5 71.1%
also see the abbreviation OLS s 5 4.460 with 250 2 3 5 247 degrees of freedom
used with this kind of analy- Variable Coefficient SE(Coeff) t-ratio P-value
sis. It stands for “Ordinary
Intercept 23.10088 7.686 20.403 0.6870
Least Squares.”
Waist 1.77309 0.0716 24.8 #0.0001
Height 20.60154 0.1099 25.47 #0.0001
You should recognize most of the numbers in this table. Most of them mean what
you expect them to.
Metalware Prices. Multi-
R2 gives the fraction of the variability of %body fat accounted for by the multiple
ple regression is a valuable tool for regression model. (With waist alone predicting %body fat, the R2 was 67.8%.) The
businesses. Here’s the story of one multiple regression model accounts for 71.3% of the variability in %body fat. We
company’s analysis of its manufac- shouldn’t be surprised that R2 has gone up. It was the hope of accounting for
turing process. some of that leftover variability that led us to try a second predictor.
The standard deviation of the residuals is still denoted s (or sometimes se to dis-
tinguish it from the standard deviation of y).
The degrees of freedom calculation follows our rule of thumb: the degrees of free-
dom is the number of observations (250) minus one for each coefficient estimated—
for this model, 3.
For each predictor we have a coefficient, its standard error, a t-ratio, and the
Compute a Multiple
Regression. We always find multi-
corresponding P-value. As with simple regression, the t-ratio measures how many
ple regressions with a computer. standard errors the coefficient is away from 0. So, using a Student’s t-model, we
Here’s a chance to try it with the can use its P-value to test the null hypothesis that the true value of the coefficient
statistics package you’ve been is 0.
using. Using the coefficients from this table, we can write the regression model:
¿
%body fat 5 23.10 1 1.77 waist 2 0.60 height.
As before, we define the residuals as
¿
residuals 5 %body fat 2 %body fat.
We’ve fit this model with the same least squares principle: The sum of the
squared residuals is as small as possible for any choice of coefficients.
So what’s different? With so much of the multiple regression looking just like sim-
ple regression, why devote an entire chapter (or two) to the subject?
There are several answers to this question. First—and most important—the
meaning of the coefficients in the regression model has changed in a subtle but im-
portant way. Because that change is not obvious, multiple regression coefficients
Chapter 29 • Multiple Regression 29-3
are often misinterpreted. We’ll show some examples to help make the meaning
Reading the Multiple
Regression Table. You may be sur-
clear.
prised to find that you already Second, multiple regression is an extraordinarily versatile calculation, underly-
know how to interpret most of the ing many widely used Statistics methods. A sound understanding of the multiple
values in the table. Here’s a regression model will help you to understand these other applications.
narrated review. Third, multiple regression offers our first glimpse into statistical models that
use more than two quantitative variables. The real world is complex. Simple mod-
els of the kind we’ve seen so far are a great start, but often they’re just not detailed
enough to be useful for understanding, predicting, and decision making. Models
that use several variables can be a big step toward realistic and useful modeling of
complex phenomena and relationships.
We said that height might be important in predicting body fat in men. What’s the
relationship between %body fat and height in men? We know how to approach this
question; we follow the three rules. Here’s the scatterplot:
40
% Body Fat
30
20
10
66 69 72 75
Height (in.)
It doesn’t look like height tells us much about %body fat. You just can’t tell much
about a man’s %body fat from his height. Or can you? Remember, in the multiple
regression model, the coefficient of height was 20.60, had a t-ratio of 25.47, and
had a very small P-value. So it did contribute to the multiple regression model.
How could that be?
The answer is that the multiple regression coefficient of height takes account of
the other predictor, waist size, in the regression model.
To understand the difference, let’s think about all men whose waist size is about
37 inches—right in the middle of our sample. If we think only about these men,
what do we expect the relationship between height and %body fat to be? Now a
negative association makes sense because taller men probably have less body fat
than shorter men who have the same waist size. Let’s look at the plot:
29-4 Par t V I I • Inferenc e When Variables Are Related
40
% Body Fat
30
20
10
66 69 72 75
Height (in.)
Here we’ve highlighted the men with waist sizes between 36 and 38 inches.
Overall, there’s little relationship between %body fat and height, as we can see
from the full set of points. But when we focus on particular waist sizes, there is a
relationship between body fat and height. This relationship is conditional because
we’ve restricted our set to only those men within a certain range of waist sizes.
For men with that waist size, an extra inch of height is associated with a decrease
of about 0.60% in body fat. If that relationship is consistent for each waist size,
then the multiple regression coefficient will estimate it. The simple regression co-
efficient simply couldn’t see it.
We’ve picked one particular waist size to highlight. How could we look at the
relationship between %body fat and height conditioned on all waist sizes at the same
time? Once again, residuals come to the rescue.
As their name reminds us,
We plot the residuals of %body fat after a regression on waist size against the
residuals are what’s left over
residuals of height after regressing it on waist size. This display is called a partial re-
after we fit a model. That lets
gression plot. It shows us just what we asked for: the relationship of %body fat to
us remove the effects of some
height after removing the linear effects of waist size.
variables. The residuals are
what’s left.
% Body Fat Residuals
7.5
0.0
–7.5
–4 0 4
Height Residuals (in.)
A partial regression plot for a particular predictor has a slope that is the same as
the multiple regression coefficient for that predictor. Here, it’s 20.60. It also has the
same residuals as the full multiple regression, so you can spot any outliers or
influential points and tell whether they’ve affected the estimation of this particu-
lar coefficient.
Many modern statistics packages offer partial regression plots as an option for
any coefficient of a multiple regression. For the same reasons that we always look
at a scatterplot before interpreting a simple regression coefficient, it’s a good idea
to make a partial regression plot for any multiple regression coefficient that you
hope to understand or interpret.
We can write a multiple regression model like this, numbering the predictors arbi-
trarily (we don’t care which one is x1), writing b’s for the model coefficients
(which we will estimate from the data), and including the errors in the model:
y 5 b 0 1 b 1x1 1 b 2x2 1 e.
Of course, the multiple regression model is not limited to two predictor vari-
ables, and regression model equations are often written to indicate summing any
number (a typical letter to use is k) of predictors. That doesn’t really change any-
thing, so we’ll often stick with the two-predictor version just for simplicity. But
don’t forget that we can have many predictors.
The assumptions and conditions for the multiple regression model sound
nearly the same as for simple regression, but with more variables in the model,
we’ll have to make a few changes.
Linearity Assumption
We are fitting a linear model.1 For that to be the right kind of model, we need an
underlying linear relationship. But now we’re thinking about several predictors.
To see whether the assumption is reasonable, we’ll check the Straight Enough
Condition for each of the predictors.
Straight Enough Condition: Scatterplots of y against each of the predictors are
Multiple Regression
reasonably straight. As we have seen with height in the body fat example, the scat-
Assumptions. The assumptions
and conditions we check for multi- terplots need not show a strong (or any!) slope; we just check that there isn’t a
ple regression are much like those bend or other nonlinearity. For the %body fat data, the scatterplot is beautifully lin-
we checked for simple regression. ear in waist as we saw in Chapter 27. For height, we saw no relationship at all, but
Here’s an animated discussion of at least there was no bend.
the assumptions and conditions for
As we did in simple regression, it’s a good idea to check the residuals for linear-
multiple regression.
ity after we fit the model. It’s good practice to plot the residuals against the
1 By linear we mean that each x appears simply multiplied by its coefficient and added to the model. No
x appears in an exponent or some other more complicated function. That means that as we move
along any x-variable, our prediction for y will change at a constant rate (given by the coefficient) if noth-
ing else changes.
29-6 Par t V I I • Inferenc e When Variables Are Related
Independence Assumption
As with simple regression, the errors in the true underlying regression model
must be independent of each other. As usual, there’s no way to be sure that the In-
Check the Residual Plot dependence Assumption is true. Fortunately, even though there can be many pre-
(Part 2) dictor variables, there is only one response variable and only one set of errors. The
The residuals should appear Independence Assumption concerns the errors, so we check the corresponding
to be randomly scattered and conditions on the residuals.
show no patterns or clumps Randomization Condition: The data should arise from a random sample or
when plotted against the pre- randomized experiment. Randomization assures us that the data are representa-
dicted values. tive of some identifiable population. If you can’t identify the population, you
can’t interpret the regression model or any hypothesis tests because they are
about a regression model for that population. Regression methods are often ap-
plied to data that were not collected with randomization. Regression models fit to
such data may still do a good job of modeling the data at hand, but without some
reason to believe that the data are representative of a particular population, you
should be reluctant to believe that the model generalizes to other situations.
We also check displays of the regression residuals for evidence of patterns,
trends, or clumping, any of which would suggest a failure of independence. In the
special case when one of the x-variables is related to time, be sure that the residu-
als do not have a pattern when plotted against that variable.
The %body fat data were collected on a sample of men. The men were not related
in any way, so we can be pretty sure that their measurements are independent.
Residuals
0 0
–5 –5
–10 –10
66 69 72 75 78 30 35 40 45 50
Height (in.) Waist (in.)
Residuals plotted against each predictor show no pattern. That’s a good indication that
the Straight Enough Condition and the “Does the Plot Thicken?” Condition are satisfied.
Figure 29.4
Chapter 29 • Multiple Regression 29-7
If residual plots show no pattern, if the data are plausibly independent, and if
the plots don’t thicken, we can feel good about interpreting the regression model.
Before we test hypotheses, however, we must check one final assumption.
Normality Assumption
We assume that the errors around the idealized regression model at any specified
values of the x-variables follow a Normal model. We need this assumption so
40 that we can use a Student’s t-model for inference. As with other times when
30
we’ve used Student’s t, we’ll settle for the residuals satisfying the Nearly Normal
Counts
Condition.
20
Nearly Normal Condition: Because we have only one set of residuals, this is
10 the same set of conditions we had for simple regression. Look at a histogram or
Normal probability plot of the residuals. The histogram of residuals in the %body
–12.0 –4.5 3.0 10.5 fat regression certainly looks nearly Normal, and the Normal probability plot is
Residuals
fairly straight. And, as we have said before, the Normality Assumption becomes
Check a histogram of the residuals. less important as the sample size grows.
The distribution of the residuals Let’s summarize all the checks of conditions that we’ve made and the order that
should be unimodal and symmet- we’ve made them:
ric. Or check a Normal probability
plot to see whether it is straight. 1. Check the Straight Enough Condition with scatterplots of the y-variable
Figure 29.5 against each x-variable.
2. If the scatterplots are straight enough (that is, if it looks like the regression
model is plausible), fit a multiple regression model to the data. (Otherwise,
either stop or consider re-expressing an x- or the y-variable.)
3. Find the residuals and predicted values.
4. Make a scatterplot of the residuals against the predicted values.2 This plot
should look patternless. Check in particular for any bend (which would
suggest that the data weren’t all that straight after all) and for any thickening.
If there’s a bend and especially if the plot thickens, consider re-expressing
the y-variable and starting over.
5. Think about how the data were collected. Was suitable randomization used?
Are the data representative of some identifiable population? If the data are
measured over time, check for evidence of patterns that might suggest
they’re not independent by plotting the residuals against time to look for pat-
terns.
6. If the conditions check out this far, feel free to interpret the regression model
Partial Regression Plots
vs. Scatterplots. When should you and use it for prediction. If you want to investigate a particular coefficient,
use a partial regression plot? And make a partial regression plot for that coefficient.
why? This activity shows you. 7. If you wish to test hypotheses about the coefficients or about the overall re-
gression, then make a histogram and Normal probability plot of the residuals
to check the Nearly Normal Condition.
2 In Chapter 27 we noted that a scatterplot of residuals against the predicted values looked just like the
plot of residuals against x. But for a multiple regression, there are several x’s. Now the predicted val-
ues, ŷ, are a combination of the x’s—in fact, they’re the combination given by the regression equation
we have computed. So they combine the effects of all the x’s in a way that makes sense for our partic-
ular regression model. That makes them a good choice to plot against.
29-8 Par t V I I • Inferenc e When Variables Are Related
Multiple Regression
Let’s try finding and interpreting a multiple regression model for the body fat data.
Plan Name the variables, report I have body measurements on 250 adult males from the
the W’s, and specify the questions BYU Human Performance Research Center. I want to under-
of interest. stand the relationship between % body fat, height, and waist
size.
Model Check the appropriate ✔ Straight Enough Condition: There is no obvious bend in
conditions. the scatterplots of %body fat against either x-variable.
The scatterplot of residuals against predicted values
below shows no patterns that would suggest
nonlinearity.
✔ Independence Assumption: These data are not col-
lected over time, and there’s no reason to think that the
%body fat of one man influences that of another. I don’t
know whether the men measured were sampled randomly,
but the data are presented as being representative of
the male population of the United States.
Now you can find the regression ✔ Does the Plot Thicken? Condition: The scatterplot of
and examine the residuals. residuals against predicted values shows no obvious
changes in the spread about the line.
10
Residuals (% body fat)
–5
–10
10 20 30 40
Predicted (% body fat)
Actually, you need the Nearly ✔ Nearly Normal Condition: A histogram of the residuals
Normal Condition only if we want is unimodal and symmetric.
to do inference.
30
20
Counts
10
10
–5
–10
Choose your method. Under these conditions a full multiple regression analysis is
appropriate.
Mechanics Here is the computer output for the regression:
Dependent variable is: %BF
R-squared 5 71.3% R-squared (adjusted) 5 71.1%
s 5 4.460 with 250 2 3 5 247 degrees of freedom
Sum of Mean
Source Squares DF Square F-ratio P-value
Regression 12216.6 2 6108.28 307 ,0.0001
Residual 4912.26 247 19.8877
Conclusion Interpret the regres- The R2 for the regression is 71.3%. Waist size and height to-
sion in the proper context. gether account for about 71% of the variation in %body fat
among men. The regression equation indicates that each
inch in waist size is associated with about a 1.77 increase
in %body fat among men who are of a particular height.
Each inch of height is associated with a decrease in %body
fat of about 0.60 among men with a particular waist size.
The standard errors for the slopes of 0.07 (waist) and 0.11
(height) are both small compared with the slopes them-
selves, so it looks like the coefficient estimates are fairly
precise. The residuals have a standard deviation of 4.46%,
which gives an indication of how precisely we can predict
%body fat with this model.
29-10 Par t V I I • Inferenc e When Variables Are Related
There are several hypothesis tests in the multiple regression output, but all of
Mean Squares and More. them talk about the same thing. Each is concerned with whether the underlying
Here’s an animated tour of the rest
of the regression table. The num-
model parameters are actually zero.
bers work together to help us The first of these hypotheses is one we skipped over for simple regression (for
understand the analysis. reasons that will be clear in a minute). Now that we’ve looked at ANOVA (in
Chapter 28),3 we can recognize the ANOVA table sitting in the middle of the re-
gression output. Where’d that come from?
The answer is that now that we have more than one predictor, there’s an overall
test we should consider before we do more inference on the coefficients. We ask
the global question “Is this multiple regression model any good at all?” That is,
would we do as well using just y to model y? What would that mean in terms of
the regression? Well, if all the coefficients (except the intercept) were zero, we’d
have
ŷ 5 b0 1 0x1 1 c 1 0xk
and we’d just set b0 5 y.
To address the overall question, we’ll test
H0: b1 5 b2 5 c 5 bk 5 0.
(That null hypothesis looks very much like the null hypothesis we tested in the
Analysis of Variance in Chapter 28.)
We can test this hypothesis with a statistic that is labeled with the letter F (in
honor of Sir Ronald Fisher, the developer of Analysis of Variance). In our exam-
ple, the F-value is 307 on 2 and 247 degrees of freedom. The alternative hypothesis
is just that the slope coefficients aren’t all equal to zero, and the test is one-sided—
bigger F-values mean smaller P-values. If the null hypothesis were true, the F-
statistic would be near 1. The F-statistic here is quite large, so we can easily reject
the null hypothesis and conclude that the multiple regression model is better than
just using the mean.4
Why didn’t we do this for simple regression? Because the null hypothesis
would have just been that the lone model slope coefficient was zero, and we were
already testing that with the t-statistic for the slope. In fact, the square of that t-
statistic is equal to the F-statistic for the simple regression, so it really was the
identical test.
Once we check the F-test and reject the null hypothesis—and, if we are being care-
ful, only if we reject that hypothesis—we can move on to checking the test statistics
3 If you skipped over Chapter 28, you can just take our word for this and read on.
4 There are F tables on the CD, and they work pretty much as you’d expect. Most regression tables in-
clude a P-value for the F-statistic, but there’s almost never a need to perform this particular test in a
multiple regression. Usually we just glance at the F-statistic to see that it’s reasonably far from 1.0, the
value it would have if the true coefficients were really all zero.
Chapter 29 • Multiple Regression 29-11
for the individual coefficients. Those tests look like what we did for the slope of a
simple regression. For each coefficient we test
H0: bj 5 0
against the (two-sided) alternative that it isn’t zero. The regression table gives a
standard error for each coefficient and the ratio of the estimated coefficient to its
standard error. If the assumptions and conditions are met (and now we need the
Nearly Normal condition), these ratios follow a Student’s t-distribution.
bj 2 0
tn2k21 5
SEsbjd
How many degrees of freedom? We have a rule of thumb and it works here.
The degrees of freedom is the number of data values minus the number of predic-
tors (in this case, counting the intercept term). For our regression on two predic-
tors, that’s n 2 3. You shouldn’t have to look up the t-values. Almost every regres-
sion report includes the corresponding P-values.
We can build a confidence interval in the usual way, as an estimate 6 a margin
of error. As always, the margin of error is just the product of the standard error
and a critical value. Here the critical value comes from the t-distribution on
n 2 k 2 1 degrees of freedom. So a confidence interval for bj is
bj 6 t*n2k21 SEsbjd.
The tricky parts of these tests are that the standard errors of the coefficients now
require harder calculations (so we leave it to the technology) and the meaning of a
coefficient, as we have seen, depends on all the other predictors in the multiple re-
gression model.
That last bit is important. If we fail to reject the null hypothesis for a multiple
regression coefficient, it does not mean that the corresponding predictor variable
has no linear relationship to y. It means that the corresponding predictor con-
tributes nothing to modeling y after allowing for all the other predictors.
This last point bears repeating. The multiple regression model looks so simple
and straightforward:
y 5 b0 1 b1x1 1 c 1 bkxk 1 e.
It looks like each bj tells us the effect of its associated predictor, xj, on the
response variable, y. But that is not so. This is, without a doubt, the most common
error that people make with multiple regression:
How Regression
Coefficients Change with New • It is possible for there to be no simple relationship between y and xj, and yet bj
Variables. When the regression in a multiple regression can be significantly different from 0. We saw this hap-
model grows by including a new pen for the coefficient of height in our example.
prdictor, all the coefficients are
likely to change. That can help us
• It is also possible for there to be a strong two-variable relationship between y
understand what those coefficients and xj, and yet bj in a multiple regression can be almost 0 with a large P-value
mean. so that we must retain the null hypothesis that the true coefficient is zero. If
29-12 Par t V I I • Inferenc e When Variables Are Related
we’re trying to model the horsepower of a car, using both its weight and its en-
gine size, it may turn out that the coefficient for engine size is nearly 0. That
doesn’t mean that engine size isn’t important for understanding horsepower. It
simply means that after allowing for the weight of the car, the engine size doesn’t
give much additional information.
• It is even possible for there to be a significant linear relationship between y and
Multiple Regression
Coefficients. You may be thinking
xj in one direction, and yet bj can be of the opposite sign and strongly significant
that multiple regression coefficients in a multiple regression. More expensive cars tend to be bigger, and since big-
must be more consistent than this ger cars have worse fuel efficiency, the price of a car has a slightly negative as-
discussion suggests. Here’s a sociation with fuel efficiency. But in a multiple regression of fuel efficiency on
hands-on analysis for you to weight and price, the coefficient of price may be positive. If so, it means that
investigate.
among cars of the same weight, more expensive cars have better fuel efficiency.
The simple regression on price, though, has the opposite direction because,
overall, more expensive cars are bigger. This switch in sign may seem a little
strange at first, but it’s not really a contradiction at all. It’s due to the change in
the meaning of the coefficient of price when it is in a multiple regression rather
than a simple regression.
So we’ll say it once more: The coefficient of xj in a multiple regression depends
as much on the other predictors as it does on xj. Remember that when you inter-
pret a multiple regression model.
Infant mortality is often used as a general measure of the quality of healthcare for
WHO U.S. states children and mothers. It is reported as the rate of deaths of newborns per 1000 live
W H AT Various measures births. Data recorded for each of the 50 states of the United States may allow us to
relating to children and build regression models to help understand or predict infant mortality. The vari-
teens ables available for our model are child death rate (deaths per 100,000 children aged
WHEN 1999 1–14), percent of teens who are high school dropouts (ages 16–19), percent of
WHY Research and policy low–birth weight babies (lbw), teen birth rate (births per 100,000 females ages 15–17),
and teen deaths by accident, homicide, and suicide (deaths per 100,000 teens ages
15–19).5
All of these variables were displayed and found to have no outliers and nearly
Normal distributions.6 One useful way to check many of our conditions is with a
scatterplot matrix. This is an array of scatterplots set up so that the plots in each
row have the same variable on their y-axis and those in each column have the
same variable on their x-axis. This way every pair of variables is graphed. On the
diagonal, rather than plotting a variable against itself, you’ll usually find either a
Normal probability plot or a histogram of the variable to help us assess the Nearly
Normal Condition.
5 The data are available from the Kids Count section of the Annie E. Casey Foundation, and are all for
1999.
6 In the interest of complete honesty, we should point out that the original data include the District of
Columbia, but it proved to be an outlier on several of the variables, so we’ve restricted attention to the
50 states here.
Chapter 29 • Multiple Regression 29-13
Infant Mortality
terplot of each pair of variables ar-
rayed so that the vertical and hori-
zontal axes are consistent across
rows and down columns. The diag-
onal cells may hold Normal proba-
bility plots (as they do here), his-
HS Dropout Rate
Teen Births
Teen Deaths
The individual scatterplots show at a glance that each of the relationships is
straight enough for regression. There are no obvious bends, clumping, or outliers.
And the plots don’t thicken. So it looks like we can examine some multiple regres-
sion models with inference.
Let’s try to model infant mortality with all of the available predictors.
Plan State what you want to I wonder whether all or some of these predictors contribute
know. to a useful model for infant mortality.
Hypotheses Specify your First, there is an overall null hypothesis that asks whether
hypotheses. the entire model is better than just modeling y with its
mean:
(Hypotheses on the intercept are H0: The model itself contributes nothing useful, and all the
not particularly interesting for slope coefficients,
these data.)
b1 5 b2 5 c 5 bk 5 0.
29-14 Par t V I I • Inferenc e When Variables Are Related
–1
6 7 8 9 10
Predicted (deaths/10,000 live births)
15
10
Consider the hypothesis tests. The F-ratio of 21.8 on 5 and 44 degrees of freedom is cer-
Under the assumptions we’re will- tainly large enough to reject the default null hypothesis
ing to accept, and considering the that the regression model is no better than using the
conditions we’ve checked, the in- mean infant mortality rate. So I’ll go on to examine the in-
dividual coefficients follow Stu- dividual coefficients.
dent’s t-distributions on 44 degrees
of freedom.
Conclusion Interpret your results Most of these coefficients have relatively small t-ratios, so
in the proper context. I can’t be sure that their underlying values are not zero.
Two of the coefficients, child death rate (cdr) and low birth
weight (lbw), have P-values less than 5%. So I am confident
that in this model both of these variables are unlikely to
really have zero coefficients.
Overall the R2 indicates that more than 71% of the variabil-
ity in infant mortality can be accounted for with this
regression model.
After allowing for the linear effects of the other variables in
the model, an increase in the child death rate of 1 death
per 100,000 is associated with an increase of 0.03
deaths per 1000 live births in the infant mortality rate.
And an increase of 1% in the percentage of live births that
are low birth weight is associated with an increase of 0.66
deaths per 1000 live births.
29-16 Par t V I I • Inferenc e When Variables Are Related
Adjusted R2
You may have noticed that the full regression tables shown in this chapter include
another statistic we haven’t discussed. It is called adjusted R2 and sometimes ap-
pears in computer output as R2(adjusted). The adjusted R2 statistic is a rough at-
tempt to adjust for the simple fact that when we add another predictor to a multi-
ple regression, the R2 can’t go down and will most likely get larger. Only if we
were to add a predictor whose coefficient turned out to be exactly zero would the
R2 remain the same. This fact makes it difficult to compare alternative regression
models that have different numbers of predictors.
We can write a formula for R2 using the sums of squares in the ANOVA table
portion of the regression output table:
SSRegression SSRegression
R2 5 5 .
SSRegression 1 SSResidual SSTotal
Adjusted R2 simply substitutes the corresponding mean squares for the SS’s:
MSRegression
R 2adj 5 .
MSTotal
Because the mean squares are sums of squares divided by their degrees of free-
dom, they are adjusted for the number of predictors in the model. As a result, the
adjusted R2 value won’t necessarily increase when a new predictor is added to the
multiple regression model. That’s fine. But adjusted R2 no longer tells the fraction
of variability accounted for by the model and it isn’t even bounded by 0 and
100%, so it can be awkward to interpret.
Comparing alternative regression models is a challenge, especially when they
have different numbers of predictors. The search for a summary statistic to help us
Chapter 29 • Multiple Regression 29-17
7 With several predictors we can wander beyond the data because of the combination of values even
when individual values are not extraordinary. For example, both 28-inch waists and 76-inch heights can
be found in men in the body fat study, but a single individual with both these measurements would not
be at all typical. The model we fit is probably not appropriate for predicting the % body fat for such a tall
and skinny individual.
29-18 Par t V I I • Inferenc e When Variables Are Related
allowing for the linear effects of the other predictors. The sign of a variable
can change depending on which other predictors are in or out of the model.
For example, in the regression model for infant mortality, the coefficient of
high school dropout rate was negative and its P-value was fairly small, but the
simple association between dropout rate and infant mortality is positive.
(Check the plot matrix.)
● If a coefficient’s t-statistic is not significant, don’t interpret it at all.
You can’t be sure that the value of the corresponding parameter in the un-
derlying regression model isn’t really zero.
E l se
What Can Go Wrong? ● Don’t fit a linear regression to data that aren’t straight. This is the most
^ fundamental regression assumption. If the relationship between the x’s and
y isn’t approximately linear, there’s no sense in fitting a linear model to it.
What we mean by “linear” is a model of the form we have been writing for
the regression. When we have two predictors, this is the equation of a plane,
which is linear in the sense of being flat in all directions. With more predic-
tors, the geometry is harder to visualize, but the simple structure of the
model is consistent; the predicted values change consistently with equal size
changes in any predictor.
Usually we’re satisfied when plots of y against each of the x’s are straight
enough. We’ll also check a scatterplot of the residuals against the predicted
values for signs of nonlinearity.
● Watch out for the plot thickening. The estimate of the error standard devi-
ation shows up in all the inference formulas. If se changes with x, these esti-
mates won’t make sense. The most common check is a plot of the residuals
against the predicted values. If plots of residuals against several of the pre-
dictors all show a thickening, and especially if they also show a bend, then
consider re-expressing y. If the scatterplot against only one predictor shows
thickening, consider re-expressing that predictor.
● Make sure the errors are nearly Normal. All of our inferences require
that the true errors be modeled well by a Normal model. Check the his-
togram and Normal probability plot of the residuals to see whether this as-
sumption looks reasonable.
● Watch out for high-influence points and outliers. We always have to be
on the lookout for a few points that have undue influence on our model, and
regression is certainly no exception. Partial regression plots are a good place
to look for influential points and to understand how they affect each of the
coefficients.
CONNECTIONS
We would never consider a regression analysis without first making scatterplots. The aspects of scat-
terplots that we always look for—their direction, shape, and scatter—relate directly to regression.
Regression inference is connected to just about every inference method we have seen for mea-
sured data. The assumption that the spread of data about the line is constant is essentially the same
as the assumption of equal variances required for the pooled-t methods. Our use of all the residuals
together to estimate their standard deviation is a form of pooling.
Chapter 29 • Multiple Regression 29-19
Of course, the ANOVA table in the regression output connects to our consideration of ANOVA in
Chapter 28. This, too, is not coincidental. Multiple Regression, ANOVA, pooled t-tests, and inference
for means are all part of a more general statistical model known as the General Linear Model (GLM).
● The assumptions and conditions are essentially the same. For multiple regression:
straightforward.
TERMS
Multiple regression A linear regression with two or more predictors whose coefficients are found to minimize the
sum of the squared residuals is a least squares linear multiple regression. But it is usually just
called a multiple regression. When the distinction is needed, a least squares linear regression
with a single predictor is called a simple regression. The multiple regression model is
y 5 b0 1 b1x1 1 c 1 bkxk 1 e.
Least squares We still fit multiple regression models by choosing the coefficients that make the sum of the
squared residuals as small as possible. This is called the method of least squares.
Partial regression The partial regression plot for a specified coefficient is a display that helps in understanding the
plot meaning of that coefficient in a multiple regression. It has a slope equal to the coefficient value
and shows the influences of each case on that value. A partial regression plot for a specified x
displays the residuals when y is regressed on the other predictors against the residuals when
the specified x is regressed on the other predictors.
29-20 Par t V I I • Inferenc e When Variables Are Related
Assumptions for ● Linearity. Check that the scatterplots of y against each x are straight enough and that the
inference in scatterplot of residuals against predicted values has no obvious pattern. (If we find the
regression (and relationships straight enough, we may fit the regression model to find residuals for further
conditions to check checking.)
for some of them) ● Independent errors. Think about the nature of the data. Check a residual plot. Any evident
pattern in the residuals can call the assumption of independence into question.
● Constant variance. Check that the scatterplots show consistent spread across the ranges of
the x-variables and that the residual plot has constant variance too. A common problem is
increasing spread with increasing predicted values—the plot thickens!
● Normality of the residuals. Check a histogram or a Normal probability plot of the residuals.
ANOVA The Analysis of Variance table that is ordinarily part of the multiple regression results offers an
F-test to test the null hypothesis that the overall regression is no improvement over just model-
ing y with its mean:
H0 : b1 5 b2 5 c 5 bk 5 0.
If this null hypothesis is not rejected, then you should not proceed to test the individual
coefficients.
t-ratios for the The t-ratios for the coefficients can be used to test the null hypotheses that the true value of
coefficients each coefficient is zero against the alternative that it is not.
Scatterplot matrix A scatterplot matrix displays scatterplots for all pairs of a collection of variables, arranged so
that all the plots in a row have the same variable displayed on their y-axis and all plots in a col-
umn have the same variable on their x-axis. Usually, the diagonal holds a display of a single
variable such as a histogram or Normal probability plot, and identifies the variable in its row and
column.
Adjusted R2 An adjustment to the R 2 statistic that attempts to allow for the number of predictors in the
model. It is sometimes used when comparing regression models with different numbers of
predictors.
• Understand that the “true” regression model is an idealized summary of the data.
• Know how to examine scatterplots of y vs. each x for violations of assumptions that
would make inference for regression unwise or invalid.
• Know how to examine displays of the residuals from a multiple regression to check that
the conditions have been satisfied. In particular, know how to judge linearity and con-
stant variance from a scatterplot of residuals against predicted values. Know how to
judge Normality from a histogram and Normal probability plot.
• Be able to use a statistics package to perform the calculations and make the displays
for multiple regression, including a scatterplot matrix of the variables, a scatterplot of
residuals vs. predicted values, and partial regression plots for each coefficient.
• Know how to use the ANOVA F-test to check that the overall regression model is better
than just using the mean of y.
Chapter 29 • Multiple Regression 29-21
• Know how to test the standard hypotheses that each regression coefficient is really
zero. Be able to state the null and alternative hypotheses. Know where to find the rele-
vant numbers in standard computer regression output.
• Be able to interpret the P-value of the t-statistics for the coefficients to test the standard
null hypotheses.
DATA DESK
• Select Y- and X-variable icons. Comments
• From the Calc menu, choose Regression. You can change the regression by dragging the icon of another
• Data Desk displays the regression table. variable over either the Y- or an X-variable name in the table and
• Select plots of residuals from the Regression table’s dropping it there. You can add a predictor by dragging its icon
HyperView menu. into that part of the table. The regression will recompute auto-
matically.
29-22 Par t V I I • Inferenc e When Variables Are Related
EXCEL
• From the Tools menu, select Data Analysis. Comments
• Select Regression from the Analysis Tools list. The Y and X ranges do not need to be in the same rows of the
• Click the OK button. spreadsheet, although they must cover the same number of
• Enter the data range holding the Y-variable in the box labeled cells. But it is a good idea to arrange your data in parallel
“Y-range.” columns as in a data table. The X-variables must be in adjacent
• Enter the range of cells holding the X-variables in the box columns. No cells in the data range may hold non-numeric
labeled “X-range.” values.
• Select the New Worksheet Ply option. Although the dialog offers a Normal probability plot of the residu-
• Select Residuals options. Click the OK button. als, the data analysis add-in does not make a correct probability
plot, so don’t use this option.
JMP
• From the Analyze menu select Fit Model. Comments
• Specify the response, Y. Assign the predictors, X, in the Con- JMP chooses a regression analysis when the response variable
struct Model Effects dialog box. is “Continuous.” The predictors can be any combination of quanti-
• Click on Run Model. tative or categorical. If you get a different analysis, check the
variable types.
MINITAB
• Choose Regression from the Stat menu.
• Choose Regression. . . from the Regression submenu.
• In the Regression dialog, assign the Y-variable to the Re-
sponse box and assign the X-variables to the Predictors box.
• Click the Graphs button.
• In the Regression-Graphs dialog, select Standardized residu-
als, and check Normal plot of residuals and Residuals ver-
sus fits.
• Click the OK button to return to the Regression dialog.
• To specify displays, click Graphs, and check the displays you
want.
• Click the OK button to return to the Regression dialog.
• Click the OK button to compute the regression.
SPSS
• Choose Regression from the Analyze menu.
• Choose Linear from the Regression submenu.
• When the Linear Regression dialog appears, select the Y-
variable and move it to the dependent target. Then move the X-
variables to the independent target.
• Click the Plots button.
• In the Linear Regression Plots dialog, choose to plot the
*SRESIDs against the *ZPRED values.
• Click the Continue button to return to the Linear Regression
dialog.
• Click the OK button to compute the regression.
Chapter 29 • Multiple Regression 29-23
TI-83/84 Plus
Comments
You need a special program to compute a multiple regression on
the TI-83.
TI-89
Under STAT Tests choose B:MultREg Tests Comments
• Specify the number of predictor variables, and which lists con- • The first portion of the output gives the F-statistic and its
tain the response variable and predictor variables. P-value as well as the values of R 2, AdjR 2, the standard devia-
• Press e to perform the calculations. tion of the residuals (s), and the Durbin-Watson statistic, which
measures correlation among the residuals.
• The rest of the main output gives the components of the F-test,
as well as values of the coefficients, their standard errors, and
associated t-statistics along with P-values.You can use the right
arrow to scroll through these lists (if desired).
• The calculator creates several new lists that can be used for
assessing the model and its conditions: Yhatlist, resid, sresid
(standardized residuals), leverage, and cookd, as well as lists
of the coefficients, standard errors, t’s, and P-values.
EXERCISES
1. Interpretations. A regression performed to predict b) Every million dollars spent on radio makes sales in-
selling price of houses found the equation crease $3.5 million, all other things being equal.
¿ c) Every million dollars spent on magazines increases
price 5 169328 1 35.3 area 1 0.718 lotsize 2 6543 age
TV spending $2.3 million.
where price is in dollars, area is in square feet, lotsize is in d) Sales increase on average about $6.75 million for each
square feet, and age is in years. The R2 is 92%. One of the million spent on TV, after allowing for the effects of
interpretations below is correct. Which is it? Explain the other kinds of advertising.
what’s wrong with the others.
a) Each year a house ages it is worth $6543 less. 3. Predicting final exams. How well do exams given
b) Every extra square foot of area is associated with an during the semester predict performance on the final?
additional $35.30 in average price, for houses with a One class had three tests during the semester. Computer
given lotsize and age. output of the regression gives
c) Every dollar in price means lotsize increases 0.718 Dependent variable is Final
square feet. s 513.46 R-Sq 5 77.7% R-Sq(adj) 5 74.1%
d) This model fits 92% of the data points exactly.
Predictor Coeff SE(Coeff) t P-value
2. More interpretations. A household appliance manu- Intercept 26.72 14.00 20.48 0.636
facturer wants to analyze the relationship between total Test1 0.2560 0.2274 1.13 0.274
sales and the company’s three primary means of adver- Test2 0.3912 0.2198 1.78 0.091
tising (television, magazines, and radio). All values were Test3 0.9015 0.2086 4.32 ,0.0001
in millions of dollars. They found the regression equation
¿ Analysis of Variance
sales 5 250 1 6.75 TV 1 3.5 radio 1 2.3 magazines.
Source DF SS MS F P-value
One of the interpretations below is correct. Which is it?
Regression 3 11961.8 3987.3 22.02 ,0.0001
Explain what’s wrong with the others.
Error 19 3440.8 181.1
a) If they did no advertising, their income would be
Total 22 15402.6
$250 million.
29-24 Par t V I I • Inferenc e When Variables Are Related
Variable Coefficient SE(Coeff) t-ratio P-value Variable Coefficient SE(Coeff) t-ratio P-value
Intercept 2521.995 78.39 26.66 ,0.0001 Intercept 2554.015 101.7 25.45 ,0.0001
Distance 351.879 12.25 28.7 ,0.0001 Distance 418.632 15.89 26.4 ,0.0001
Climb 0.643396 0.0409 15.7 ,0.0001 Climb 0.780568 0.0531 14.7 ,0.0001
a) Compare the regression model for the women’s
a) Write the regression equation. Give a brief report on
records with that found for the men’s records in Ex-
what it says about men’s record times in hill races.
ercise 4.
b) Interpret the value of R2 in this regression.
c) What does the coefficient of climb mean in this re- Here’s a scatterplot of the residuals for this regression:
gression?
750
Information for a random sample of homes for sale in
the Statesboro, GA, area was obtained from the Internet.
Regression output modeling the asking price with square 0
footage and number of bathrooms gave the following
result:
–750
Dependent Variable is: Price
s 5 67013 R-Sq 5 71.1% R-Sq (adj) 5 64.6%
3000 6000 9000 12000
Predictor Coeff SE(Coeff) T P-value
Predicted (min)
Intercept 2152037 85619 21.78 0.110
Baths 9530 40826 0.23 0.821 b) Discuss the residuals and what they say about the
Sq ft 139.87 46.67 3.00 0.015 assumptions and conditions for this regression.
Chapter 29 • Multiple Regression 29-25
7. Predicting finals II. Here are some diagnostic plots for 8. Secretary performance. The AFL-CIO has undertaken
the final exam data from Exercise 3. These were gener- a study of 30 secretaries’ yearly salaries (in thousands of
ated by a computer package and may look different from dollars). The organization wants to predict salaries from
the plots generated by the packages you use. (In particu- several other variables.
lar, note that the axes of the Normal probability plot are The variables considered to be potential predictors of
swapped relative to the plots we’ve made in the text. We salary are:
only care about the pattern of this plot, so it shouldn’t af-
X1 5 months of service
fect your interpretation.) Examine these plots and dis-
cuss whether the assumptions and conditions for the X2 5 years of education
multiple regression seem reasonable. X3 5 score on standardized test
X4 5 words per minute (wpm) typing speed
Residuals vs. the Fitted Values X5 5 ability to take dictation in words per minute
(Response is Final)
A multiple regression model with all five variables was
20 run on a computer package, resulting in the following
output:
Residual (points)
% Body Fat
2 30
20
1
Normal Score
10
0
0
–1
120 160 200 240
Weight (lb)
–2
–100000 –50000 0 50000 100000 150000 And here’s the simple regression:
Residual ($)
Dependent variable is: Pct BF
R-squared 5 38.1% R-squared (adjusted) 5 37.9%
s 5 6.538 with 250 2 2 5 248 degrees of freedom
Histogram of the Residuals
(Response is Price) Variable Coefficient SE(Coeff) t-ratio P-value
5 Intercept 214.6931 2.760 25.32 ,0.0001
4 Weight 0.18937 0.0153 12.4 ,0.0001
Frequency
Variable Coefficient SE(Coeff) t-ratio P-value A regression of %body fat on chest size gives the following
Intercept 231.4830 11.54 22.73 0.0068 equation:
Waist 2.31848 0.1820 12.7 ,0.0001
Dependent variable is: Pct BF
Height 20.224932 0.1583 21.42 0.1567
R-squared 5 49.1% R-squared (adjusted) 5 48.9%
Weight 20.100572 0.0310 23.25 0.0013
s 5 5.930 with 250 2 2 5 248 degrees of freedom
c) Interpret the slope for weight. How can the coefficient
for weight in this model be negative when its coeffi- Variable Coefficient SE(Coeff) t-ratio P-value
cient was positive in the simple regression model? Intercept 252.7122 4.654 211.3 ,0.0001
d) What does the P-value for height mean in this regres- Chest 0.712720 0.0461 15.5 ,0.0001
sion? (Perform the hypothesis test.) a) Is the slope of %body fat on chest size statistically dis-
T 12. Breakfast cereals. We saw in Chapter 8 that the calorie tinguishable from 0? (Perform a hypothesis test.)
content of a breakfast cereal is linearly associated with b) What does the answer in part a mean about the rela-
its sugar content. Is that the whole story? Here’s the out- tionship between %body fat and chest size?
put of a regression model that regresses calories for each We saw before that the slopes of both waist size and
serving on its protein(g), fat(g), fiber(g), carbohydrate(g), height are statistically significant when entered into a
and sugars(g) content. multiple regression equation. What happens if we add
Dependent variable is: calories chest size to that regression? Here is the output from a re-
R-squared 5 84.5% R-squared (adjusted) 5 83.4% gression on all three variables:
s 5 7.947 with 77 2 6 5 71 degrees of freedom
Dependent variable is: Pct BF
Sum of Mean R-squared 5 72.2% R-squared (adjusted) 5 71.9%
Source Squares df Square F-ratio s 5 4.399 with 250 2 4 5 246 degrees of freedom
Regression 24367.5 5 4873.50 77.2
Residual 4484.45 71 63.1613 Sum of Mean
Source Squares df Square F-ratio P
Variable Coefficient SE(Coeff) t-ratio P-value Regression 12368.9 3 4122.98 213 ,0.0001
Intercept 20.2454 5.984 3.38 0.0012 Residual 4759.87 246 19.3491
Protein 5.69540 1.072 5.32 ,0.0001
Fat 8.35958 1.033 8.09 ,0.0001 Variable Coefficient SE(Coeff) t-ratio P-value
Fiber 21.02018 0.4835 22.11 0.0384 Intercept 2.07220 7.802 0.266 0.7908
Carbo 2.93570 0.2601 11.3 ,0.0001 Waist 2.19939 0.1675 13.1 ,0.0001
Sugars 3.31849 0.2501 13.3 ,0.0001 Height 20.561058 0.1094 25.13 ,0.0001
Chest 20.233531 0.0832 22.81 0.0054
Assuming that the conditions for multiple regression are
met, c) Interpret the coefficient for chest.
a) What is the regression equation? d) Would you consider removing any of the variables
b) Do you think this model would do a reasonably good from this regression model? Why or why not?
job at predicting calories? Explain.
T 14. Grades. The table below shows the five scores from an
c) To check the conditions, what plots of the data might
introductory Statistics course. Find a model for predicting
you want to examine?
final exam score by trying all possible models with two
d) What does the coefficient of fat mean in this model?
predictor variables. Which model would you choose? Be
T 13. Body fat again. Chest size might be a good predictor of sure to check the conditions for multiple regression.
body fat. Here’s a scatterplot of %body fat vs. chest size.
Midterm Midterm Home
Name Final 1 2 Project work
40 Timothy F. 117 82 30 10.5 61
Karen E. 183 96 68 11.3 72
%Body Fat
Sum of Mean
Source Squares df Square F-ratio P-value
Regression 11211.1 3 3737.05 15.1 ,0.0001
Residual 17583.5 71 247.655