0% found this document useful (0 votes)
22 views44 pages

CH 16

Uploaded by

Sayeeda Jahan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views44 pages

CH 16

Uploaded by

Sayeeda Jahan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 44

Slides Prepared by

JOHN S. LOUCKS
St. Edward’s University

© 2003 South-Western/Thomson Learning™


1
Chapter 16
Regression Analysis: Model Building
 General Linear Model
 Determining When to Add or Delete Variables
 Analysis of a Larger Problem
 Variable-Selection Procedures
 Residual Analysis
 Multiple Regression Approach
to Analysis of Variance and
Experimental Design

© 2003 South-Western/Thomson Learning™


2
General Linear Model

Models in which the parameters (0, 1, . . . , p )


all
have exponents of one are called linear models.
 First-Order Model with One Predictor Variable
y   0   1x1  

 Second-Order Model with One2 Predictor Variable


y   0   1x1   2x1  

 Second-Order Model with Two Predictor


Variables
2 2
y     x
with Interaction
0 1 1   x
2 2   x
3 1   4 2   5 x1x2  
x

© 2003 South-Western/Thomson Learning™


3
General Linear Model

Often the problem of nonconstant variance can be


corrected by transforming the dependent variable to a
different scale.
 Logarithmic Transformations
Most statistical packages provide the ability to apply
logarithmic transformations using either the base-10
(common log) or the base e = 2.71828... (natural
log).
 Reciprocal Transformation
Use 1/y as the dependent variable instead of y.

© 2003 South-Western/Thomson Learning™


4
General Linear Model

Models in which the parameters ( 0, 1, . . . , p ) have


exponents other than one are called nonlinear models.
In some cases we can perform a transformation of
variables that will enable us to use regression analysis
with the general linear model.
 Exponential Model
The exponential model involves the regression
equation:

x
E ( y )  
We can transform this nonlinear
0 1 model to a linear
model by taking the logarithm of both sides.

© 2003 South-Western/Thomson Learning™


5
Variable Selection Procedures

 Stepwise Regression Iterative; one independent


 Forward Selection variable at a time is added or
 Backward Elimination deleted based on the F statistic
 Different subsets of the inde-
Best-Subsets Regression
pendent variables are evaluated

© 2003 South-Western/Thomson Learning™


6
Variable Selection Procedures

 F Test
To test whether the addition of x2 to a
model involving x1 (or the deletion of x2 from a
model involving x1and x2) is statistically
(SSE(reduced)-SSE(full))/ number of extra terms
significant
F
MSE(full)

(SSE(x1 )-SSE(x1 ,x2 ))/ 1


F
(SSE(x1 , x2 ))/ (n  p  1)

The p-value corresponding to the F


statistic is the criterion used to determine if a
variable should be added or deleted
© 2003 South-Western/Thomson Learning™
7
Stepwise Regression

Compute F stat. and Indep. variable with


p-value for each indep. smallest p-value is
variable in model entered into model

Any Indep. variable with


p-value > alpha Yes largest p-value is
to remove removed from model
?
No Yes

Compute F stat. and Any


p-value for each indep. p-value < alpha
variable not in model to enter
?
No
Start Stop
© 2003 South-Western/Thomson Learning™
8
Forward Selection

 This procedure is similar to stepwise-


regression, but does not permit a variable to
be deleted.
 This forward-selection procedure starts with no
independent variables.
 It adds variables one at a time as long as a
significant reduction in the error sum of
squares (SSE) can be achieved.

© 2003 South-Western/Thomson Learning™


9
Forward Selection

Start with no indep.


variables in model

Compute F stat. and


p-value for each indep.
variable not in model

Any Indep. variable with


p-value < alpha Yes smallest p-value is
to enter entered into model
?
No
Stop

© 2003 South-Western/Thomson Learning™


10
Backward Elimination

 This procedure begins with a model that


includes all the independent variables the
modeler wants considered.
 It then attempts to delete one variable at a
time by determining whether the least
significant variable currently in the model can
be removed because its p-value is less than
the user-specified or default value.
 Once a variable has been removed from the
model it cannot reenter at a subsequent step.

© 2003 South-Western/Thomson Learning™


11
Backward Elimination

Start with all indep.


variables in model

Compute F stat. and


p-value for each indep.
variable in model

Any Indep. variable with


p-value > alpha Yes largest p-value is
to remove removed from model
?
No
Stop

© 2003 South-Western/Thomson Learning™


12
Example: Clarksville Homes

Tony Zamora, a real estate investor, has


just moved to Clarksville and wants to learn
about the city’s residential real estate market.
Tony has randomly selected 25 house-for-sale
listings from the Sunday newspaper and
collected the data listed on the next three
slides.
Develop, using the backward elimination
procedure, a multiple regression model to
predict the selling price of a house in
Clarksville.

© 2003 South-Western/Thomson Learning™


13
Using Excel to Perform the
Backward Elimination Procedure
 Worksheet (showing partial data)
A B C D E F
Selling House Number Number Garage
Segment Price Size of of Size
1 of City ($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
2 Northwest 290 21 4 2 2
3 South 95 11 2 1 0
4 Northeast 170 19 3 2 2
5 Northwest 375 38 5 4 3
6 West 350 24 4 3 2
7 South 125 10 2 2 0
8 West 310 31 4 4 2
9 West 275 25 3 2 2

Note: Rows 10-26 are not shown.

© 2003 South-Western/Thomson Learning™


14
Using Excel to Perform the
Backward Elimination Procedure
 Worksheet (showing partial data)
A B C D E F
Selling House Number Number Garage
Segment Price Size of of Size
1 of City ($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
10 Northwest 340 27 5 3 3
11 Northeast 215 22 4 3 2
12 Northwest 295 20 4 3 2
13 South 190 24 4 3 2
14 Northwest 385 36 5 4 3
15 West 430 32 5 4 2
16 South 185 14 3 2 1
17 South 175 18 4 2 2

Note: Rows 2-9 are hidden and rows 18-26 not shown.

© 2003 South-Western/Thomson Learning™


15
Using Excel to Perform the
Backward Elimination Procedure
 Worksheet (showing partial data)
A B C D E F
Selling House Number Number Garage
Segment Price Size of of Size
1 of City ($000) (00 sq. ft.) Bedrms. Bathrms. (cars)
18 Northeast 190 19 4 2 2
19 Northwest 330 29 4 4 3
20 West 405 33 5 4 3
21 Northeast 170 23 4 2 2
22 West 365 34 5 4 3
23 Northwest 280 25 4 2 2
24 South 135 17 3 1 1
25 Northeast 205 21 4 3 2
26 West 260 26 4 3 2
Note: Rows 2-17 are hidden.
© 2003 South-Western/Thomson Learning™
16
Using Excel to Perform the
Backward Elimination Procedure
 Value Worksheet (partial)
A B C
27
28 SUMMARY OUTPUT
29
30 Regression Statistics
31 Multiple R 0.898964443
32 R Square 0.80813707
33 Adjusted R Square 0.769764484
34 Standard Error 45.87155025
35 Observations 25
36

© 2003 South-Western/Thomson Learning™


17
Using Excel to Perform the
Backward Elimination Procedure

 Value Worksheet (partial)

A B C D E F
36
37 ANOVA
38 df SS MS F Significance F
39 Regression 4 177260 44315 21.06027 6.1385E-07
40 Residual 20 42083.98 2104.199
41 Total 24 219344
42

© 2003 South-Western/Thomson Learning™


18
Using Excel to Perform the
Backward Elimination Procedure

 Value Worksheet (partial)

A B C D E
42
43 Coeffic. Std. Err. t Stat P-value
44 Intercept -59.416 54.6072 -1.0881 0.28951
45 House Size 6.50587 3.24687 2.0037 0.05883
46 Bedrooms 29.1013 26.2148 1.1101 0.28012
47 Bathrooms 26.4004 18.8077 1.4037 0.17574
48 Cars -10.803 27.329 -0.3953 0.6968
49

© 2003 South-Western/Thomson Learning™


19
Using Excel to Perform the
Backward Elimination Procedure

 Cars (garage size) is the independent variable


with the highest p-value (.697) > .05
 Cars is removed from the model
 Multiple regression is performed again on the
remaining independent variables

© 2003 South-Western/Thomson Learning™


20
Using Excel to Perform the
Backward Elimination Procedure
 Value Worksheet (partial)
A B C
27
28 SUMMARY OUTPUT
29
30 Regression Statistics
31 Multiple R 0.898130279
32 R Square 0.806637998
33 Adjusted R Square 0.779014855
34 Standard Error 44.94059302
35 Observations 25
36

© 2003 South-Western/Thomson Learning™


21
Using Excel to Perform the
Backward Elimination Procedure
 Value Worksheet (partial)

A B C D E F
36
37 ANOVA
38 df SS MS F Significance F
39 Regression 4 177260 44315 21.06027 6.1385E-07
40 Residual 20 42083.98 2104.199
41 Total 24 219344
42

© 2003 South-Western/Thomson Learning™


22
Using Excel to Perform the
Backward Elimination Procedure
 Value Worksheet (partial)

A B C D E
42
43 Coeffic. Std. Err. t Stat P-value
44 Intercept -47.342 44.3467 -1.0675 0.29785
45 House Size 6.02021 2.94446 2.0446 0.05363
46 Bedrooms 23.0353 20.8229 1.1062 0.28113
47 Bathrooms 27.0286 18.3601 1.4721 0.15581
48
49

© 2003 South-Western/Thomson Learning™


23
Using Excel to Perform the
Backward Elimination Procedure

 Bedrooms is the independent variable with the


highest p-value (.281) > .05
 Bedrooms is removed from the model
 Multiple regression is performed again on the
remaining independent variables

© 2003 South-Western/Thomson Learning™


24
Using Excel to Perform the
Backward Elimination Procedure

 Value Worksheet (partial)


A B C
27
28 SUMMARY OUTPUT
29
30 Regression Statistics
31 Multiple R 0.891835053
32 R Square 0.795369762
33 Adjusted R Square 0.776767013
34 Standard Error 45.1685807
35 Observations 25
36

© 2003 South-Western/Thomson Learning™


25
Using Excel to Perform the
Backward Elimination Procedure

 Value Worksheet (partial)

A B C D E F
36
37 ANOVA
38 df SS MS F Significance F
39 Regression 2 174459.6 87229.79 42.7555 2.63432E-08
40 Residual 22 44884.42 2040.201
41 Total 24 219344
42

© 2003 South-Western/Thomson Learning™


26
Using Excel to Perform the
Backward Elimination Procedure

 Value Worksheet (partial)

A B C D E
42
43 Coeffic. Std. Err. t Stat P-value
44 Intercept -12.349 31.2392 -0.3953 0.69642
45 House Size 7.94652 2.38644 3.3299 0.00304
46 Bathrooms 30.3444 18.2056 1.6668 0.10974
47
48
49

© 2003 South-Western/Thomson Learning™


27
Using Excel to Perform the
Backward Elimination Procedure

 Bathrooms is the independent variable with


the highest p-value (.110) > .05
 Bathrooms is removed from the model
 Regression is performed again on the
remaining independent variable

© 2003 South-Western/Thomson Learning™


28
Using Excel to Perform the
Backward Elimination Procedure
 Value Worksheet (partial)
A B C
27
28 SUMMARY OUTPUT
29
30 Regression Statistics
31 Multiple R 0.877228487
32 R Square 0.769529819
33 Adjusted R Square 0.759509376
34 Standard Error 46.88202186
35 Observations 25
36

© 2003 South-Western/Thomson Learning™


29
Using Excel to Perform the
Backward Elimination Procedure
 Value Worksheet (partial)

A B C D E F
36
37 ANOVA
38 df SS MS F Significance F
39 Regression 1 168791.7 168791.7 76.79599 8.67454E-09
40 Residual 23 50552.25 2197.924
41 Total 24 219344
42

© 2003 South-Western/Thomson Learning™


30
Using Excel to Perform the
Backward Elimination Procedure

 Value Worksheet (partial)

A B C D E
42
43 Coeffic. Std. Err. t Stat P-value
44 Intercept -9.8669 32.3874 -0.3047 0.76337
45 House Size 11.3383 1.29384 8.7633 8.7E-09
46
47
48
49

© 2003 South-Western/Thomson Learning™


31
Using Excel to Perform the
Backward Elimination Procedure
 House size is the only independent variable
remaining in the model
 The estimated regression equation is:
yˆ  9.8669  11.3383(House Size)
 The Adjusted R Square value is .760

© 2003 South-Western/Thomson Learning™


32
Variable-Selection Procedures

 Best-Subsets Regression
• The three preceding procedures are one-
variable-at-a-time methods offering no
guarantee that the best model for a given
number of variables will be found.
• Some statistical software packages include
best-subsets regression that enables the user to
find, given a specified number of independent
variables, the best regression model.
• Typical output identifies the two best one-
variable estimated regression equations, the
two best two-variable regression equations, and
so on.

© 2003 South-Western/Thomson Learning™


33
Example: PGA Tour Data

The Professional Golfers Association keeps a variety


of statistics regarding performance measures. Data
include the average driving distance, percentage of
drives that land in the fairway, percentage of greens hit
in regulation, average number of putts, percentage of
sand saves, and average score.
The variable names and definitions are shown on the
next slide.

© 2003 South-Western/Thomson Learning™


34
Example: PGA Tour Data

 Variable Names and Definitions

Drive: average length of a drive in yards


Fair: percentage of drives that land in the fairway
Green: percentage of greens hit in regulation (a par-3
green is “hit in regulation” if the player’s first
shot lands on the green)
Putt: average number of putts for greens that
have
been hit in regulation
Sand: percentage of sand saves (landing in a sand
trap and still scoring par or better)
Score: average score for an 18-hole round
© 2003 South-Western/Thomson Learning™
35
Example: PGA Tour Data

 Sample Data

Drive Fair Green Putt Sand


Score
277.6 .681 .667 1.768 .550 69.10
259.6 .691 .665 1.810 .536 71.09
269.1 .657 .649 1.747 .472 70.12
267.0 .689 .673 1.763 .672 69.88
267.3 .581 .637 1.781 .521 70.71
255.6 .778 .674 1.791 .455 69.76
272.9 .615 .667 1.780 .476 70.19
265.4 .718 .699 1.790 .551 69.73

© 2003 South-Western/Thomson Learning™


36
Example: PGA Tour Data

 Sample Data (continued)

Drive Fair Green Putt Sand


Score
272.6 .660 .672 1.803 .431 69.97
263.9 .668 .669 1.774 .493 70.33
267.0 .686 .687 1.809 .492 70.32
266.0 .681 .670 1.765 .599 70.09
258.1 .695 .641 1.784 .500 70.46
255.6 .792 .672 1.752 .603 69.49
261.3 .740 .702 1.813 .529 69.88
262.2 .721 .662 1.754 .576 70.27

© 2003 South-Western/Thomson Learning™


37
Example: PGA Tour Data

 Sample Data (continued)

Drive Fair Green Putt Sand


Score
260.5 .703 .623 1.782 .567 70.72
271.3 .671 .666 1.783 .492 70.30
263.3 .714 .687 1.796 .468 69.91
276.6 .634 .643 1.776 .541 70.69
252.1 .726 .639 1.788 .493 70.59
263.0 .687 .675 1.786 .486 70.20
263.0 .639 .647 1.760 .374 70.81
253.5 .732 .693 1.797 .518 70.26
266.2 .681 .657 1.812 .472 70.96
© 2003 South-Western/Thomson Learning™
38
Example: PGA Tour Data

 Sample Correlation Coefficients

Score Drive Fair Green


Putt
Drive -.154
Fair -.427 -.679
Green -.556 -.045 .421
Putt .258 -.139 .101 .354
Sand -.278 -.024 .265 .083 -.296

© 2003 South-Western/Thomson Learning™


39
Example: PGA Tour Data

 Best Subsets Regression of SCORE

Vars R-sq R-sq(a) C-p s D F


G P S
1 30.9 27.9 26.9 .39685 X
1 18.2 14.6 35.7 .43183 X
2 54.7 50.5 12.4 .32872 X X
2 54.6 50.5 12.5 .32891 X X
3 60.7 55.1 10.2 .31318 X X X
3 59.1 53.3 11.4 .31957 X X X
4 72.2 66.8 4.2 .26913 X X X X
4 60.9 53.1 12.1 .32011 X X X X
5 72.6 65.4 6.0 .27499 X X X X X
© 2003 South-Western/Thomson Learning™
40
Example: PGA Tour Data

The regression equation


Score = 74.678 - .0398(Drive) - 6.686(Fair)
- 10.342(Green) + 9.858(Putt)
Predictor Coef Stdev t-ratio
p
Constant 74.678 6.952 10.74 .000
Drive -.0398 .01235 -3.22 .004
Fair -6.686 1.939 -3.45 .003
Green -10.342 3.561 -2.90 .009
Putt 9.858 3.180 3.10 .006
s = .2691 R-sq = 72.4% R-sq(adj) =
66.8%

© 2003 South-Western/Thomson Learning™


41
Example: PGA Tour Data

Analysis of Variance
SOURCE DF SS MS
F P
Regression 4 3.79469 .94867
13.10 .000
Error 20 1.44865 .07243
Total 24 5.24334

© 2003 South-Western/Thomson Learning™


42
Residual Analysis: Autocorrelation

 Durbin-Watson Test for Autocorrelation


• Statistic
n
2
 (et  et 1 )
d  t2 n
2
 et
t1
• The statistic ranges in value from zero to
four.
• If successive values of the residuals are
close together (positive autocorrelation),
the statistic will be small.
• If successive values are far apart (negative
auto-
correlation), the statistic will be large.
© 2003 South-Western/Thomson Learning™
• A value of two indicates no autocorrelation. 43
End of Chapter 16

© 2003 South-Western/Thomson Learning™


44

You might also like