Multiple Regression
Multiple Regression
Readings: Chapter 11
Introduction
• Simple Linear Regression (Chapters 2 and 10): used when there is a single quantitative
explanatory variable.
• Multiple Regression: used when there are 2 or more quantitative explanatory variables
which will be used to predict the quantitative response variable.
The Model
y = β0 + β1 x +
y = β0 + β1 x1 + β2 x2 + . . . βp xp + ,
• Independence: Responses yi ’s are independent of each other (examine the way in which sub-
jects/units were selected in the study).
• Normality: For any fixed value of x, the response y varies according to a normal distribution
(normal probability plot of the residuals).
• Linearity: The mean response has a linear relationship with x (scatter plot of y against each
predictor variable).
• Constant variability: The standard deviation of y (σ) is the same for all values of x (scatter
plots of residuals against predicted values).
– Means, standard deviations, minimums, and maximums, outliers (if any), stem plots or
histograms are all good ways to show what is happening with your individual variables.
– In SPSS, Analyze → Descriptive Statistics → Explore.
2. Look at the relationships between the variables using the correlation and scatter
plots.
– In SPSS, Analyze → Correlate → Bivariate. Put all your variables (all the x’s and y) into
the “variables” box, and hit “ok”.
– The correlations helps us determine which are the stronger relationships between the y and
an x.
– Are there strong x-to-x relationships?
– Look at scatter plots between each pair of variables, too.
– We are only interested in keeping the variables which had strong relationships with the
response variable y.
1
3. Do a regression using the all potential explanatory variables.
– This will include an ANOVA table and coefficients output like in Chapter 10. These
regression results will indicate/confirm which relationships are strong.
– The regression equation is
ŷ = b0 + b1 x1 + . . . + bp xp
• We had ANOVA results for simple linear regression in chapter 10. But since we only had one
regression coefficient β1 , we didn’t need to use it.
• Coefficient of determination R2 :
SSM
R2 = .
SST
– R2 measures the fraction of the variation in the values of y that is explained by the linear
regression of y on x1 , x2 , . . . , xp .
– It measures the amount of linear association between the response variable y and the
multiple explanatory variables.
– Hypotheses:
H0 : β1 = β2 = . . . = βp ,
Ha : At least one of the regression coefficients is not zero.
– Test statistic:
F = MSM/MSE.
When H0 is true, the F statistic follows the F (p, N − p − 1) distribution. When Ha is true,
the F statistic tends to be large.
2
– P-value (read from the SPSS output).
– State conclusions in terms of the problem.
– Note:
∗ The F -test is an overall test that tells us whether we want to proceed.
∗ Rejecting H0 means that we need to further analysis (individual t-test) to see which
regression coefficient is different from zero (Think back: We did Bonferroni multiple
comparison procedure if we could reject the null hypothesis in a One-way ANOVA F -
test).
∗ Even if the p-value is small, we still need to look at R2 . If the R2 is small, it means
that the model (variables) we are using does not do a very good job of explaining the
variation in y.
bj ± t∗ SEbj ,
∗ Note: SPSS will give us 95% confidence intervals, but you may have to use the estimates
for the coefficients and their standard errors to find other confidence intervals (use t
table and n − p − 1 degrees of freedom to get t∗ ).
bj
t= .
SEbj
4. Interpretation of results.
– Sometimes variables that are significant by themselves may not be significant when other
variables are included too.
– The significance tests for individual regression coefficients assess the significance of each
predictor variable assuming that all other predictors are included in the regression equation.
5. Residuals.
– Use residuals to help determine whether the multiple regression model is appropriate for
the data.
– Plot residuals versus each of the explanatory variable and versus the response variable.
– Look for outliers, influential observations, evidence of a nonlinear relation, and anything
else unusual.
– Use a normal probability plot to determine that the residuals are normally distributed
(Look for your points to make an increasing line).
6. Refine the model - We are interested in keeping only the variables with the strongest
relationship......
– Try deleting deleting the variable with the largest p-value (the weakest relationship), and
re-run the regression. You may have to do this again and again, each time deleting a
variable with a weak relationship.
3
– Check to see if R2 , s, p-values from the F -test and individual t-tests change much.
∗ R2 should not drop too much when you remove a variable.
∗ The standard deviation should be as small as possible.
∗ The test statistic from the ANOVA F -test should be the largest and the p-value should
be the smallest.
∗ Any x variables left in the equations should have a significant p-value from their t-test
(their coefficient confidence intervals should not contain 0) unless taking out a slightly
insignificant coefficient makes the R2 and s move the wrong direction.
How do we know which variables should be included in our model and which should
not?
• ****Procedure 1: Start with a model that contains all your explanatory variables with strong
correlations, run the regression, and then remove one at a time whichever variables aren’t
significant from the t-test until you find that your R2 starts to decrease too rapidly or your s
goes up too rapidly. You may end up leaving in one or more variables which are not significant
on their own. You just have to see what removing them does to the whole model. (This is the
procedure that we will follow in the lecture notes and that you should use for this
class.)
• Procedure 2: Start with a model that contains only one explanatory variable and add one
variable at a time till you find that your R2 is no longer increasing rapidly.
Sometimes there may be more than one appropriate choice for your model. The
most important thing is to be able to explain why you chose the model you did.
Not every model is as easy to define as the one in the CHEESE example below.
Example 1: As cheddar cheese matures a variety of chemical processes take place. The taste
of mature cheese is related to the concentration of several chemicals in the final product. In a
study of cheddar cheese from the La Trobe Valley of Victoria, Australia, samples of cheese were
analyzed for their chemical composition and were subjected to taste tests. Data for one type of
cheese-manufacturing processes appears in below. The variable “Case” is used to number the
observations from 1 to 30. “Taste” is the response variable of interest. The taste scores were
obtained by combining the scores from several tasters.
Three chemicals whose concentrations were measured were acetic acid, hydrogen sulfide, and
lactic acid. For acetic acid and hydrogen sulfide (natural) log transformations were taken.
Thus the explanatory variables are the transformed concentrations of acetic acid (“Acetic”) and
hydrogen sulfide (“H2S”) and the untransformed concentration of lactic acid (“Lactic”). These
data are based on experiments performed by G. T. Lloyd and E. H. Ramshaw of the CSIRO
Division of Food Research, Victoria, Australia.
4
Case Taste Acetic H2S Lactic
1 12.3 4.543 3.135 0.86
2 20.9 5.159 5.043 1.53
3 39 5.366 5.438 1.57
4 47.9 5.759 7.496 1.81
5 5.6 4.663 3.807 0.99
6 25.9 5.697 7.601 1.09
7 37.3 5.892 8.726 1.29
8 21.9 6.078 7.966 1.78
9 18.1 4.898 3.85 1.29
10 21 5.242 4.174 1.58
11 34.9 5.74 6.142 1.68
12 57.2 6.446 7.908 1.9
13 0.7 4.477 2.996 1.06
14 25.9 5.236 4.942 1.3
15 54.9 6.151 6.752 1.52
16 40.9 6.365 9.588 1.74
17 15.9 4.787 3.912 1.16
18 6.4 5.412 4.7 1.49
19 18 5.247 6.174 1.63
20 38.9 5.438 9.064 1.99
21 14 4.564 4.949 1.15
22 15.2 5.298 5.22 1.33
23 32 5.455 9.242 1.44
24 56.7 5.855 10.199 2.01
25 16.8 5.366 3.664 1.31
26 11.6 6.043 3.219 1.46
27 26.5 6.458 6.962 1.72
28 0.7 5.328 3.912 1.25
29 13.4 5.802 6.685 1.08
30 5.5 6.176 4.787 1.25
a. For each of variables in the data set, find the mean, median, standard deviation, and IQR.
Display each distribution with a boxplot.
5
6
b. Make a scatter plot for each pair of variables in the CHEESE data set (you will have 6
plots). Describe the relationships.
c. Which explanatory variables (x’s–Acetic, H2S, Lactic) are most strongly correlated to the
response variable (y, taste)? Calculate the correlation for each pair of variables and report
the P-value for the test of zero population correlation in each case.
7
d. Which variables look important at this time?
e. Perform a multiple regression using the explanatory variables which look important at this
point. Give the fitted regression equation.
f. State your hypotheses for an ANOVA F -test, give the test statistic and its p-value, and
state your conclusion.
8
g. Report the t test statistics and the p-values for the tests of the regression coefficients of
your explanatory variables. What conclusions do you draw from these tests?
h. Give the 95% confidence intervals for the regression coefficients of your explanatory vari-
ables. Do any of the intervals contain the point 0? (This should verify your answer to part
g).
k. One variable looks like a good candidate to be dropped. Which one is it? Try running the
multiple regression again without this variable. Look at parts e through j again.
9
10
What changed? What stayed the same or improved?
R2 65.2% 65.2%
s 10.1307 9.9424
l. Using the better model, predict the “taste” for an H2S=4 and Lactic=1.
m. Now look at a residual plot for each of the variables you still have in the model. Do a
normal probability plot, too.
11