16 Review of Part II
16 Review of Part II
16 Review of Part II
Topics Outline
Inference for Regression
Multiple Regression
Building Regression Models
21. Collinearity
29. Adjusted r 2
30.
-1-
C p statistic
Example 1
Florida reappraises real estate every year, so the county appraisers Web site lists the current
fair market value of each piece of property. Property usually sells for somewhat more than the
appraised market value. Data for the appraised market values and actual selling prices
(in thousands of dollars) of 16 condominium units sold in a beachfront building over a 19-month
period are stored in the file Condominiums.xlsx.
Condominium
Selling Price
Appraised Value
1
2
15
16
850
900
1325
845
758.0
812.7
1031.8
586.7
Excel output for a linear regression of selling price on appraised value is shown on the next page.
(a) Write the equation for the model of the population regression line.
y x
y x
(c) What is the equation of the least-squares regression line for predicting selling price
from appraised value?
y 127.27 1.0466 x
(d) What is the correlation between appraised value and selling price?
The correlation r is the square root of r 2 .
r 0.861 0.93
(We take the positive square root because the sign of r must be the same as the sign of
the slope, 1.0466.)
Reminder: For simple and multiple regression, r is the correlation between the observed values of y
and the predicted values y . For simple linear regression, r is also the correlation between x and y.
(e) Explain why the pattern you see on the residual plot agrees with the conditions of
linear relationship and constant standard deviation needed for regression inference.
On the residual plot, as usual a horizontal line is added at residual zero, the mean of the residuals.
This line corresponds to the regression line in the plot of selling price against appraised value.
The residuals show a random scatter about the line, with roughly equal vertical spread across
their range. This is what we expect when the conditions for regression inference hold.
(f) Does the histogram of the residuals suggest lack of normality?
The distribution of residuals has a bit of a cluster at the left, but there are no outliers or
other strong deviations from normality that would prevent regression inference.
-2-
Regression Statistics
Multiple R
0.9277
R Square
0.8606
Adjusted R Square
0.8506
Standard Error
69.7299
Observations
16
ANOVA
df
Regression
Residual
Total
SS
MS
420072.1418 420072.1418
68071.6082
4862.2577
488143.7500
1
14
15
Coefficients
127.2705
1.0466
Intercept
Appraised Value
Standard Error
79.4892
0.1126
t Stat
1.6011
9.2949
F
Significance F
86.3945
0.0000
P-value
0.1317
0.0000
Lower 95%
-43.2168
0.8051
Upper 95%
297.7578
1.2881
Selling Price
1300
1100
900
700
500
400
600
800
1000
Appraised Value
150
Frequency
Residuals
100
50
0
400
600
800
1000
1200
-50
-100
4
3
2
1
0
Appraised Value
-3-
Residual
(g) How many degrees of freedom does the t distribution used for statistical inference on these data have?
There are n = 16 data pairs, so df = n 1 1 = 16 2 = 14.
Reminder:
df = n k 1, where n is the number of observations, k is the number of explanatory variables.
(h) Explain what the slope of the true regression line means in this setting.
is the average rate of increase in selling price in a population of condominium units when
appraised value increases by $1,000.
(i) Find a 95% confidence interval for the population slope .
-4-
Example 2
Demand and Cost for Electricity
The Public Service Electric Company produces different quantities of electricity each month,
depending on the demand. The file Cost_of_Power.xlsx lists the number of Units of electricity
produced and the total Cost (in dollars) of producing these units for a 36-month period.
Month
1
2
35
36
Cost
45623
46507
45218
45357
Units
601
738
705
637
(a) What does the scatterplot of Cost versus Units reveal about the relationship between Cost and Units?
Scatterplot of Cost vs Units
50000
45000
Cost
40000
35000
30000
25000
200
300
400
500
600
700
800
900
Units
The scatterplot indicates a definite positive relationship and one that is nearly linear.
However, there is also some evidence of curvature in the plot. The points increase slightly
less rapidly as Units increases from left to right. In economic terms, there might be
economies of scale, so that the marginal cost of electricity decreases as more units of
electricity are produced.
(b) The output for a simple linear regression is shown on the next page.
Does the residual plot suggest the need for a nonlinear transformation?
The residuals to the far left and the far right are all negative, whereas the majority of the
residuals in the middle are positive. This negative-positive-negative behavior of residuals
suggests a parabola. Admittedly, the pattern is far from a perfect parabola because there are
several negative residuals in the middle. However, this plot certainly suggests nonlinear
behavior and exploring a quadratic relationship with the square of Units included in the
equation is reasonable.
-5-
Adjusted
StErr of
R-Square
Estimate
0.7359
0.7282
2733.7
Degrees of
Sum of
Mean of
Freedom
Squares
Squares
Summary
R-Square
0.8579
ANOVA Table
1
34
Explained
Unexplained
708085273.8 708085273.8
254093815.2 7473347.506
Coefficient
Regression Table
Standard
F-Ratio
p-Value
94.7481
< 0.0001
t-Value
p-Value
12.3369
9.7339
< 0.0001
< 0.0001
Error
23651.5
30.533
Constant
Units
1917.1
3.137
Upper
19755.4
24.158
27547.6
36.908
Residual
1000.0
0.0
30000
-1000.0
35000
40000
45000
50000
-2000.0
-3000.0
-4000.0
-5000.0
Fit
(c) The regression output for estimating a quadratic relationship between Cost and Units is
shown on the next page. What is the estimated regression equation? Does it provide a better
fit than the linear equation?
The estimated regression equation is
Predicted Cost = 5792.80 + 98.350Units 0.0600(Units)2
The graph of the regression equation superimposed on the scatterplot of Cost versus Units
shows a reasonably good fit, plus an obvious curvature.
The quadratic model provides a better fit as indicated by the coefficient of determination r 2
which has increased from 73.6% to 82.2% and the standard error of estimate s e which has
decreased from $2,734 to $2,281.
-6-
Adjusted
StErr of
R-Square
Estimate
0.8216
0.8108
2280.800
Degrees of
Sum of
Mean of
Freedom
Squares
Squares
Summary
R-Square
0.9064
ANOVA Table
2
33
Explained
Unexplained
790511518.3 395255759.1
171667570.7 5202047.597
Standard
Coefficient
Regression Table
F-Ratio
p-Value
75.9808
< 0.0001
t-Value
p-Value
1.2162
5.7058
-3.9806
0.2325
0.0000
0.0004
Error
5792.7983
98.3504
-0.0600
Constant
Units
(Units)^2
4763.0585
17.2369
0.0151
-3897.7171
63.2817
-0.0906
Upper
15483.3137
133.4191
-0.0293
Quadratic Fit
50000
45000
Cost
40000
35000
30000
25000
200
300
400
500
600
700
800
900
Units
H 0 : 2 0 (Including the quadratic term does not significantly improve the model.)
H 0 : 2 0 (Including the quadratic term significantly improves the model.)
we use the test statistic t = 3.98 with df = n k 1 = 36 2 1 = 33 and P-value = 0.0004.
The small P-value indicates that the quadratic effect is significant.
Notes:
1. The coefficient of (Units)2, 0.0600 is negative and it makes the parabola bend downward.
This produces the decreasing marginal cost behavior, where every extra unit of electricity
incurs a smaller cost. Actually, the curve described by the regression equation eventually
goes downhill for large values of Units, but this part of the curve is irrelevant because the
company evidently never produces such large quantities.
2. You should not be fooled by the small magnitude of this coefficient.
Remember that it is the coefficient of Units squared, which is a large quantity.
Therefore, the effect of the product 0.0600(Units)2 is sizable.
(f) To examine the possibility for a logarithmic fit, a new variable Log(Units), the natural
logarithm of Units has been created. The output from a regression of Cost against
Log(Units) is shown on the next page. Interpret the slope of the regression line.
The estimated regression equation is
Predicted Cost = 63993 + 16654 Log(Units)
Reminder: If b is the coefficient of the log of x, then the expected change in y when x increases
by 1% is approximately 0.01 times b.
In the present case, you can interpret the slope coefficient as follows.
Suppose that Units increases by 1%, for example, from 600 to 606.
Then the regression equation implies that the expected Cost will increase by approximately
(0.01)(16654) = 166.54 dollars.
In words, every 1% increase in Units is accompanied by an expected $166.54 increase in Cost.
Note that for larger values of Units, a 1% increase represents a larger absolute increase
(from 700 to 707 instead of from 600 to 606, say). But each such 1% increase entails the same
increase in Cost. This is another way of describing the decreasing marginal cost property.
-8-
R-Square
ANOVA Table
R-Square
Estimate
0.7977
0.7917
2392.8
Degrees of
Sum of
Mean of
Freedom
Squares
Squares
1
34
Unexplained
767506900.9 767506900.9
194672188.1 5725652.59
Standard
Coefficient
Regression Table
F-Ratio
p-Value
134.0471
< 0.0001
t-Value
p-Value
Error
-63993.3
16653.6
Log(Units)
StErr of
0.8931
Explained
Constant
Adjusted
9144.3
1438.4
-6.9981
11.5779
< 0.0001
< 0.0001
Lower
Upper
-82576.8
13730.4
-45409.8
19576.7
Logarithmic Fit
50000
45000
Cost
40000
35000
30000
25000
200
300
400
500
600
700
800
900
Units
r2
2
radj
se
Quadratic
Logarithmic
82.2%
79.8%
81.1%
79.2%
$2,281
$2,393
Example 3
Meddicorp
Meddicorp Company sells medical supplies to hospitals, clinics, and doctors offices.
The company currently markets in three regions of the United States: the South, the West,
and the Midwest. These regions are each divided into many smaller sales territories.
Meddicorp management is concerned with the effectiveness of a new bonus program.
This program is overseen by regional sales managers and provides bonuses to salespeople based
on performance. Management wants to know if the bonuses paid in 2010 were related to sales.
(Obviously, if there is a relationship here, the managers expect it to be a direct positive one.)
In determining whether this relationship exists, they also want to take into account the effects of
advertising, market share, and competitors sales. The variables to be used in the study include:
y = Sales Meddicorp sales (in thousands of dollars) in each territory for 2010
x1 = Adv the amount Meddicorp spent on advertising in each territory (in hundreds of dollars) in 2010
x 2 = Bonus the total amount of bonuses paid in each territory (in hundreds of dollars) in 2010
x3 = MktShare percentage of the market share currently held by Meddicorp in each territory
24
1583.75
25
1124.75
Adv
374.27
408.50
583.85
499.15
Bonus
230.98
236.28
289.29
272.55
MktShare Compet
33
29
202.22
252.77
27
26
313.44
374.11
y 1 x1 2 x2 3 x3 4 x4
(b) Interpret the equation of the true population surface.
The population regression equation
y 1 x1 2 x2 3 x3 4 x4
shows that the conditional mean of y given x1 , x2 , x3 , and x4 is a point on the four-dimensional
hyperplane described by 1 x1 2 x2 3 x3 4 x4 .
- 10 -
(c) Below are the least squares regression results. Conduct the F test for overall fit of the regression.
Regression of y on x1 (Adv), x 2 (Bonus), x3 (MktShare), x 4 (Compet)
Regression Statistics
Multiple R
0.9269
R Square
0.8592
Adjusted R Square
0.8310
Standard Error
93.7697
Observations
25
ANOVA
df
Regression
Residual
Total
Intercept
Adv
Bonus
MktShare
Compet
4
20
24
Coefficients
-593.5375
2.5131
1.9059
2.6510
-0.1207
SS
MS
1073118.5420 268279.6355
175855.1980
8792.7599
1248973.7400
Standard Error
259.1959
0.3143
0.7424
4.6357
0.3718
t Stat
-2.2899
7.9966
2.5673
0.5719
-0.3247
F
30.5114
Significance F
0.0000
P-value
0.0330
0.0000
0.0184
0.5738
0.7488
Lower 95%
-1134.2105
1.8576
0.3574
-7.0188
-0.8963
Upper 95%
-52.8644
3.1687
3.4545
12.3208
0.6549
H 0 : 1 2 3 4 0
H a : At least one j 0
Because of the small P-value ( 0) for the F statistic (= 30.51) we reject the null hypothesis
and conclude that at least one of the regression slopes ( 1 , 2 , 3 , 4 ) is not equal to zero.
This means that at least one of the variables ( x1 , x2 , x3 , x4 ) is important in explaining the
variation in y.
(d) At the 0.05 significance level, test the significance of the relationship between y and each of
the explanatory variables.
The P-values for the four t tests are:
0.00 for Adv, 0.02 for Bonus, 0.57 for MktShare, 0.75 for Compet
Thus, the two explanatory variables x1 (amount spent on advertising) and x 2 (amount of bonuses)
are related to y (sales). The variables x3 (market share) and x 4 (competitors sales) are not
useful in explaining any of the variation in y (sales) and should be excluded from the model.
- 11 -
(e) Below is the regression output for the model with x1 (amount spent on advertising) and
x 2 (amount of bonuses). Interpret the estimated regression equation and its slope coefficients.
Regression of y on x1 (Adv), x 2 (Bonus)
Regression Statistics
Multiple R
0.9246
R Square
0.8549
Adjusted R Square
0.8418
Standard Error
90.7485
Observations
25
Residuals
Residual Plot
180
120
60
0
-60 800
-120
-180
1000
1200
1400
1600
Predicted Sales
ANOVA
df
Regression
Residual
Total
Coefficients
-516.4443
2.4732
1.8562
Intercept
Adv
Bonus
F
64.8306
Significance F
0.0000
Standard Error
189.8757
0.2753
0.7157
P-value
0.0125
0.0000
0.0166
Lower 95%
-910.2224
1.9022
0.3719
1600
1400
1200
1000
800
t Stat
-2.7199
8.9832
2.5934
Sales
Sales
2
22
24
SS
MS
1067797.3206 533898.6603
181176.4194
8235.2918
1248973.7400
350
450
550
650
1600
1400
1200
1000
800
220
Adv
Upper 95%
-122.6662
3.0441
3.3405
270
320
Bonus
After rounding, the least squares regression equation describing the relationship between sales
and the two explanatory variables may be written
Predicted Sales = 516.4 + 2.47 Adv + 1.86 Bonus
This equation can be interpreted as providing an estimate of mean sales for a given level of
advertising and bonus payment.
If bonus payment is held fixed, the equation shows that mean sales tends to rise by $2,470
(2.47 thousands of dollars) for each $100 spent on ads.
If advertising is held fixed, the equation shows that mean sales tends to rise by $1,860
(1.86 thousands of dollars) for each $100 of bonus paid.
- 12 -
(f) The best subsets regression procedure has been performed using all four explanatory variables.
Below is a summary of the results. Which is the best model according to the best subsets
regression technique?
Variables in the Regression
Adv
Bonus
Compet
MktShare
Adv, Bonus
Adv, MktShare
Adv, Compet
Bonus, Compet
Bonus, MktShare
MktShare, Compet
Adv, Bonus, MktShare
Adv, Bonus, Compet
Adv, MktShare Compet
Bonus, MktShare, Compet
Adv, Bonus, MktShare, Compet
k+1
2
2
2
2
3
3
3
3
3
3
4
4
4
4
5
Cp
5.90
75.19
100.85
120.97
1.61
7.66
7.74
68.03
76.46
100.18
3.11
3.33
9.59
66.95
5.00
r2
0.811
0.323
0.142
0.001
0.855
0.812
0.812
0.387
0.328
0.161
0.859
0.857
0.813
0.409
0.859
2
radj
0.802
0.293
0.105
0.000
0.842
0.795
0.795
0.332
0.267
0.085
0.838
0.836
0.786
0.325
0.831
se
101.42
191.76
215.83
232.97
90.75
103.23
103.38
186.51
195.33
218.20
91.75
92.26
105.52
187.48
93.71
Recall that small values of C p and values close to k + 1 are of interest in choosing good sets
of explanatory variables.
There are four competing models with relatively small C p values:
Variables in the Regression
k+1
Cp
r2
2
radj
se
Adv, Bonus
1.61
0.855
0.842
90.75
4
4
5
3.11
3.33
5.00
0.859
0.857
0.859
0.838
0.836
0.831
91.75
92.26
93.71
The smallest C p value is for the regression with Adv and Bonus as explanatory variables.
It has a C p value of 1.61 and explains 85.5% of the variation in sales.
Note that only modest increases in r 2 are achieved in the other three models.
The adjusted r 2 is highest and the standard error of estimate is smallest for the regression
with Adv and Bonus, again supporting this model as best.
Therefore, the best subsets procedure suggests using this model.
- 13 -
(g) Below is the StatTools output from forward selection, backward elimination, and stepwise
regression when applied to the Meddicorp data with P-value to enter = 0.05 and
P-value to leave = 0.10. Which model appears to be the best?
Forward selection
Multiple
Summary
R-Square
Adjusted
StErr of
R-Square
Estimate
90.7485
0.9246
0.8549
0.8418
Degrees of
Sum of
Mean of
ANOVA Table
Freedom
Squares
Squares
Explained
Unexplained
2
22
Regression Table
Constant
Adv
Bonus
Coefficient
-516.4443
2.4732
1.8562
Multiple
Step Information
Adv
Bonus
R
0.9003
0.9246
1067797.3206
181176.4194
Standard
533898.6603
8235.2918
t-Value
Error
189.8757
0.2753
0.7157
R-Square
0.8106
0.8549
-2.7199
8.9832
2.5934
F-Ratio
p-Value
64.8306
< 0.0001
p-Value
0.0125
0.0000
0.0166
Upper
-910.2224
1.9022
0.3719
-122.6662
3.0441
3.3405
Adjusted
StErr of
Entry
R-Square
Estimate
Number
0.8024
0.8418
101.4173
90.7485
1
2
Backward elimination
Multiple
Summary
R-Square
Adjusted
StErr of
R-Square
Estimate
90.7485
0.9246
0.8549
0.8418
Degrees of
Sum of
Mean of
ANOVA Table
Freedom
Squares
Squares
Explained
Unexplained
2
22
Regression Table
Constant
Adv
Bonus
Coefficient
-516.4443
2.4732
1.8562
Multiple
Step Information
All Variables
Compet
MktShare
R
0.9269
0.9265
0.9246
1067797.3206
181176.4194
Standard
533898.6603
8235.2918
t-Value
Error
189.8757
0.2753
0.7157
R-Square
0.8592
0.8585
0.8549
- 14 -
-2.7199
8.9832
2.5934
F-Ratio
p-Value
64.8306
< 0.0001
p-Value
0.0125
0.0000
0.0166
Upper
-910.2224
1.9022
0.3719
-122.6662
3.0441
3.3405
Adjusted
StErr of
Exit
R-Square
Estimate
Number
0.8310
0.8382
0.8418
93.7697
91.7508
90.7485
1
2
Stepwise regression
Multiple
Summary
R-Square
Adjusted
StErr of
R-Square
Estimate
90.7485
0.9246
0.8549
0.8418
Degrees of
Sum of
Mean of
ANOVA Table
Freedom
Squares
Squares
Explained
Unexplained
2
22
Regression Table
Constant
Adv
Bonus
Adv
Bonus
Standard
Coefficient
R
0.9003
0.9246
533898.6603
8235.2918
t-Value
Error
-516.4443
2.4732
1.8562
Multiple
Step Information
1067797.3206
181176.4194
189.8757
0.2753
0.7157
R-Square
0.8106
0.8549
F-Ratio
p-Value
64.8306
< 0.0001
p-Value
-2.7199
8.9832
2.5934
0.0125
0.0000
0.0166
Lower
Upper
-910.2224
1.9022
0.3719
-122.6662
3.0441
3.3405
Adjusted
StErr of
Enter or
R-Square
Estimate
Exit
0.8024
0.8418
101.4173
90.7485
Enter
Enter
Regardless of the procedure used, the result is the same. The equation chosen is
Predicted Sales = 516.4 + 2.47 Adv + 1.86 Bonus
(h) Medicorp markets in three regions of the United States: the South, the West, and the Midwest.
Management of Meddicorp believes that, in addition to advertising and bonus, the regions
it markets in may be important in explaining variation in sales.
What is the equation of the regression model that includes the region information?
Since there are three regions, two indicator variables have to be included in the model.
Let Midwest be the base category. Then the two dummy variables are:
1 if the territory is in the South
x3 South
0 otherwise
1 if the territory is in the West
x4 West
0 otherwise
The model is
Region
x3
x4
South
West
Midwest
y 1 x1 2 x2 3 x3 4 x4
- 15 -
(i) The regression output for this model follows. Interpret the regression equation from the least squares fit.
Regression of y on x1 (Adv), x 2 (Bonus), x3 (South), x 4 (West)
Regression Statistics
Multiple R
0.9730
R Square
0.9468
Adjusted R Square
0.9362
Standard Error
57.6254
Observations
25
Residuals
Residual Plot
120
80
40
0
-40 900
-80
-120
1100
1300
1500
1700
Predicted Sales
ANOVA
df
Regression
Residual
Total
Intercept
Adv
Bonus
South
West
4
20
24
Coefficients
435.0989
1.3678
0.9752
-257.8916
-209.7457
SS
MS
1182559.8959 295639.9740
66413.8441
3320.6922
1248973.7400
F
89.0296
Significance F
0.0000
Standard Error
206.2342
0.2622
0.4808
48.4129
37.4203
P-value
0.0477
0.0000
0.0561
0.0000
0.0000
Lower 95%
4.9020
0.8208
-0.0278
-358.8792
-287.8032
t Stat
2.1097
5.2165
2.0281
-5.3269
-5.6051
Upper 95%
865.2958
1.9148
1.9781
-156.9040
-131.6883
West ( x3 = 0, x 4 = 1):
Midwest ( x3 = 0, x 4 = 0):
For given amounts spent by Meddicorp on advertising and bonuses, the estimated mean sales
in a territory that is in the South region will be $257,892 (257.8916 thousands of dollars)
below the sales in a territory that is in the Midwest region.
For given amounts spent by Meddicorp on advertising and bonuses, the estimated mean sales
in a territory that is in the West region will be $209,746 (209.7457 thousands of dollars)
below the sales in a territory that is in the Midwest region.
- 16 -
(j) Predict the average sales in each region when advertising expenditures equal 500 hundreds of
dollars and bonuses are 250 hundreds of dollars.
South:
y 177.2073 1.3678(500) 0.9752(250) 1104.9073
West:
y 225.3532 1.3678(500) 0.9752(250) 1153.0532
Midwest:
y 435.0989 1.3678(500) 0.9752(250) 1362.7989
The mean sales figures ($1,104,907 for South, $1,153,053 for West, and $1,362,799 for Midwest)
when advertising expenditures equal to $50,000 and bonus payments equal to $25,000 differ according
to the coefficients of the dummy variables: the figure for South is $257,892 smaller than the
figure for Midwest; the figure for West is $209,746 smaller than the figure for Midwest.
(k) Determine whether there is a significant difference in sales for territories in different regions.
Because the location of territories is measured by two variables in a group x3 (South) and
17.2799
MSE (full)
3320.6922
3320.6922
n = 25, k = 4, j = 2; df1 = k j = 4 2 = 2, df2 = n k 1 = 25 4 1 = 20
P-value = FDIST(17.2799,2,20) = 0.000044
The P-value is very small and we reject the null hypothesis.
Thus, at least one of the coefficients of the indicator variables is not zero.
This means that there are statistically significant differences in average sales levels between
the three regions in which Meddicorp does business.
(l) How useful is the group of the dummy variables x3 (South) and x 4 (West)?
Do they improve considerably the explanation of variation in sales?
Model
Adv, Bonus
Adv, Bonus, South, West
r2
85%
95%
2
radj
se
84%
94%
91
58
2
A comparison of r 2 , radj
, and s e for the reduced and full models shows that the indicator variables
x3 (South) and x 4 (West) carry a lot of explanatory power. They help to explain about 10% more of
the variation in sales while reducing the standard error of estimate by about $33,000.
Therefore, the dummy variables x3 (South) and x 4 (West) should be retained in the model.
- 17 -