0% found this document useful (0 votes)
58 views29 pages

Multiple Regression - Selecting The Best Equation: An Example

The document discusses techniques for selecting the best multiple linear regression equation, including all possible regressions, backward elimination, and forward selection. It explains that the best equation is a compromise between simplicity/interpretability and reliability. Backward elimination starts with all variables and removes the least significant ones. Forward selection starts with no variables and adds the most significant ones. The example uses cement data to demonstrate these techniques.

Uploaded by

Kishalaya Kundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views29 pages

Multiple Regression - Selecting The Best Equation: An Example

The document discusses techniques for selecting the best multiple linear regression equation, including all possible regressions, backward elimination, and forward selection. It explains that the best equation is a compromise between simplicity/interpretability and reliability. Backward elimination starts with all variables and removes the least significant ones. Forward selection starts with no variables and adds the most significant ones. The example uses cement data to demonstrate these techniques.

Uploaded by

Kishalaya Kundu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Multiple Regression - Selecting the Best Equation

When fitting a multiple linear regression model, a researcher will likely include
independent variables that are not important in predicting the dependent variable Y.
In the analysis he will try to eliminate these variable from the final equation. The
objective in trying to find the “best equation” will be to find the simplest model that
adequately fits the data. This will not necessarily be the model the explains the most
variance in the dependent variable Y (the equation with the highest value of R2). This
equation will be the equation with all of the independent variables in the equation. Our
objective will be to find the equation with the least number of variables that still explain a
percentage of variance in the dependent variable that is comparable to the percentage
explained with all the variables in the equation.

An Example
The example that we will consider is interested in how the heat evolved in the curing of
cement is affected by the amounts of various chemical included in the cement mixture.
The independent and dependent variables are listed below:
X1 = amount of tricalcium aluminate, 3 CaO - Al2O3
X2 = amount of tricalcium silicate, 3 CaO - SiO2
X3 = amount of tetracalcium alumino ferrite, 4 CaO - Al2O3 - Fe2O3
X4 = amount of dicalcium silicate, 2 CaO - SiO2
Y = heat evolved in calories per gram of cement.

X1 X2 X3 X4 Y
7 26 6 60 79
1 29 15 52 74
11 56 8 20 104
11 31 8 47 88
7 52 6 33 96
11 55 9 22 109
3 71 17 6 103
1 31 22 44 73
2 54 18 22 93
21 47 4 26 116
1 40 23 34 84
11 66 9 12 113
10 68 8 12 109

Techniques for Selecting the "Best" Regression Equation


The best Regression equation is not necessarily the equation that explains most of the
variance in Y (the highest R2).
• This equation will be the one with all the variables included.
• The best equation should also be simple and interpretable. (i.e. contain a small no.
of variables).
• Simple (interpretable) & Reliable - opposing criteria.
• The best equation is a compromise between these two.

page 55
I All Possible Regressions
Suppose we have the p independent variables X1, X2, ..., Xp.
- Then there are 2p subsets of variables.

Example (k=3) X1, X2, X3

Variables in Equation Model


- no variables Y = β0 + ε
- X1 Y = β0 + β1 X1+ ε
- X2 Y = β0 + β2 X2+ ε
- X3 Y = β0 + β3 X3+ ε
- X1, X2 Y = β0 + β1 X1+ β2 X2+ ε
- X1, X3 Y = β0 + β1 X1+ β3 X3+ ε
- X2, X3 Y = β0 + β2 X2+ β3 X3+ ε ,3/
- X1, X2, X3 Y = β0 + β1 X1+ β2 X2+ β2 X3+ ε

Use of R2
1. Assume we carry out 2p runs for each of the subsets.
Divide the Runs into the following sets
Set 0: No variables
Set 1: One independent variable.
...
Set p: p independent variables.
2. Order the runs in each set according to R2.
3. Examine the leaders in each run looking for consistent patterns
- take into account correlation between independent variables.

Example (k=4) X1, X2, X3, X4

Variables in for leading runs 100 R2%

Set 1: X4. 67.5 %


Set 2: X1, X2. 97.9 %
X1, X4 97.2 %
Set 3: X1, X2, X4. 98.234 %
Set 4: X1, X2, X3, X4. 98.237 %

Examination of the correlation coefficients reveals a high correlation


between X1, X3 (r 13= -0.824) and between X2, X4 (r 24= -0.973).

Best Equation Y = β0 + β1 X1+ β4 X4+ ε

page 56
Use of the Residual Mean Square (RMS) (s2)
When all of the variables having a non-zero effect have been included in the model then the
residual mean square is an estimate of σ2.
If "significant" variables have been left out then RMS will be biased upward.
No. of Variables p RMS s2(p) Average s2(p)

1 115.06, 82.39,1176.31, 80.35 113.53


2 5.79*,122.71,7.48**,86.59.17.57 47.00
3 5.35, 5.33, 5.65, 8.20 6.13
4 5.98 5.98

* - run X1, X2 ** - run X1, X4 s2 - approximately 6.

Use of Mallows Ck
− [n − 2(k + 1)]
RSS k
Mallows C k = 2
s complete
If the equation with p variables is adequate then both s2complete and RSSp/(n-p-1) will be
estimating σ2.Then Ck = [(n-k-1)σ2]/σ2 - [n-2(k+1)]= [n-k-1] - [n-2(k+1)] = k +1. Thus if we plot,
for each run, Ck vs k and look for Ck close to p then we will be able to identify models giving a
reasonable fit.

Run Ck k+1

no variables 443.2 1

1,2,3,4 202.5, 142.5, 315.2, 138.7 2

12,13,14 2.7, 198.1, 5.5 3


23,24,34 62.4, 138.2, 22.4

123,124,134,234 3.0, 3.0, 3.5, 7.5 4

1234 5.0 5

page 57
II Backward Elimination
In this procedure the complete regression equation is determined containing all the variables - X1,
X2, ..., Xp. Then variables are checked one at a time and the least significant is dropped from the
model at each stage. The procedure is terminated when all of the variables remaining in the
equation provide a significant contribution to the prediction of the dependent variable Y. The
precise algorithm proceeds as follows:

1. Fit a regression equation containing all variables.

2. A partial F-test (F to remove) is computed for each of the independent variables still
in the equation.
• The Partial F statistic (F to remove) = [RSS2 - RSS1]/MSE1 ,where
• RSS1 = the residual sum of squares with all variables that are presently
in the equation,
• RSS2 = the residual sum of squares with one of the variables removed,
and
• MSE1 = the Mean Square for Error with all variables that are presently
in the equation.

3. The lowest partial F value (F to remove) is compared with Fα for some pre-specified
α .If FLowest ≤ Fα then remove that variable and return to step 2. If FLowest >
Fα then accept the equation as it stands.

Example (k=4) (same example as before) X1, X2, X3, X4

1. X1, X2, X3, X4 in the equation.


The lowest partial F = 0.018 (X3) is compared with Fα(1,8) = 3.46 for α = 0.01.
Remove X3.
2. X1, X2, X4 in the equation.
The lowest partial F = 1.86 (X4) is compared with Fα(1,9) = 3.36 for α= 0.01.
Remove X4.
3. X1, X2 in the equation.
Partial F for both variables X1 and X2 exceed Fα(1,10) = 3.36 for α= 0.01.
Equation is accepted as it stands. Note : F to Remove = partial F.
Y = 52.58 + 1.47 X1 + 0.66 X2

II Forward Selection

In this procedure we start with no variables in the equation. Then variables are checked one at a
time and the most significant is added to the model at each stage. The procedure is terminated
when all of the variables not in the equation have no significant effect on the dependent variable
Y. The precise algorithm proceeds as follows:

page 58
1. With no varaibles in the equation compute a partial F-test (F to enter) is computed for
each of the independent variables not in the equation.
• The Partial F statistic (F to enter) = [RSS2 - RSS1]/MSE1 ,where
• RSS1 = the residual sum of squares with all variables that are presently
in the equation and the variable under consideration,
• RSS2 = the residual sum of squares with all variables that are presently
in the equation .
• MSE1 = the Mean Square for Error with variables that are presently in
the equation and the variable under consideration.

2. The largest partial F value (F to enter) is compared with Fα for some pre-specified
α .If FLargest > Fα then add that variable and return to step 1. If FLargest ≤ Fα then
accept the equation as it stands.

IV Stepwise Regression
In this procedure the regression equation is determined containing no variables in the model.
Variables are then checked one at a time using the partial correlation coefficient (equivalently F
to Enter) as a measure of importance in predicting the dependent variable Y. At each stage the
variable with the highest significant partial correlation coefficient (F to Enter) is added to the
model. Once this has been done the partial F statistic (F to Remove) is computed for all
variables now in the model is computed to check if any of the variables previously added can
now be deleted. This procedure is continued until no further variables can be added or deleted
from the model. The partial correlation coefficient for a given variable is the correlation
between the given variable and the response when the present independent variables in the
equation are held fixed. It is also the correlation between the given variable and the residuals
computed from fitting an equation with the present independent variables in the equation.

(Partial correlation of Xi with variables Xi1, X12, ... etc in the equation)2
= The percentage of variance in Y explained Xi by that is left unexplained Xi1, X12, etc.

Example (k=4) (same example as before) X1, X2, X3, X4

1. With no variables in the equation. The correlation of each independent variable with
the dependent variable Y is computed. The highest significant correlation ( r = -
0.821) is with variable X4. Thus the decision is made to include X4.
Regress Y with X4 -significant thus we keep X4.

2. Compute partial correlation coefficients of Y with all other independent variables


given X4 in the equation.The highest partial correlation is with the variable X1. (
[rY1.4]2 = 0.915). Thus the decision is made to include X1.

Regress Y with X1, X4.


R2 = 0.972 , F = 176.63 .

For X1 the partial F value =108.22 (F0.10(1,8) = 3.46)

page 59
Retain X1.

For X4 the partial F value =154.295 (F0.10 (1,8) = 3.46)


Retain X4.

3. Compute partial correlation coefficients of Y with all other independent variables


given X4 and X1 in the equation. The highest partial correlation is with the variable
X2. ( [rY2.14]2 = 0.358). Thus the decision is made to include X2.

Regress Y with X1, X2,X4.


R2 = 0.982 .

Lowest partial F value =1.863 for X4 (F0.10 (1,9) = 3.36)


Remove X4 leaving X1 and X2 .

page 60
Transformations to Linearity, Polynomial Regression models,
Response Surface Models, The Use of Dummy Variaables
Many non-linear curves can be put into a linear form by appropriate transformations of
the either the dependent variable Y or some (or all) of the independent variables X1, X2,
... , Xp . This leads to the wide utility of the Linear model. We have seen that through the
use of dummy variables, categorical independent variables can be incorporated into a
Linear Model. We will now see that through the technique of variable transformation that
many examples of non-linear behaviour can also be converted to linear behaviour.

Intrinsically Linear (Linearizable) Curves


1 Hyperbolas
y = x/(ax-b)
Linear from: 1/y = a -b (1/x) or Y = β0 + β1 X
Transformations: Y = 1/y, X=1/x, β0 = a, β1 = -b

b/a
posit ive curvat ure b>0

1/a
y=x/(ax-b)

y=x/(ax-b)

1/a negative curvature b< 0

b/a

2. Exponential

y = a e bx = a Bx
Linear from: ln y = lna + b x = lna + lnB x or Y = β0 + β1 X
Transformations: Y = ln y, X = x, β0 = lna, β1 = b = lnB
Exponential (B > 1) Exponential (B < 1)
5 2

y aB 1
y

aB
a

0
0
0 1 2
0 1 2
x
x

page 61
3. Power Functions
y = a xb
Linear from: ln y = lna + blnx or Y = β0 + β1 X
Transformations: Y = ln y, X = ln x, β0 = lna, β1 = b

Power funct ions


b>0
b>1 Power funct ions
b<0

b=1

0 <b<1 -1 < b < 0


b = -1
b < -1

Logarithmic Functions
y = a + b lnx
Linear from: y = a + b lnx or Y = β0 + β1 X
Transformations: Y = y, X = ln x, β0 = a, β1 = b

b>0 b<0

Other special functions


y = a e b/x
Linear from: ln y = lna + b 1/x or Y = β0 + β1 X
Transformations: Y = ln y, X = 1/x, β0 = lna, β1 = b

b>0 b<0

page 62
Polynomial Models Exponential Models with a polynomial exponent
y = β0 + β1x + β2x2 + β3x3 y = e β 0 + β 1 x + + β 4 x
4

Linear form Y = β0 + β1 X1 + β2 X2 + β3 X3
Variables Y = y, X1 = x , X2 = x2, X3 = x3 Linear form lny = β0 + β1 X1 + β2 X2 + β3 X3+ β4 X4
Y = lny, X1 = x , X2 = x2, X3 = x3, X4 = x4

3 2
1.75
2.5
1.5
2
1.25
1.5 1
0.75
1
0.5
0.5
0.25

0 0.5 1 1.5 2 2.5 3 0 5 10 15 20 25 30

Response Surface models


Dependent variable Y and two indepedent variables x1 and x2. (These ideas are easily extended to more the
two independent variables)
The Model (A cubic response surface model)
Y = β 0 + β 1x1 + β2 x2 + β 3 x1 + β 4 x1 x2 + β 5 x 2 + β 6 x1 + β 7 x1x 2 + β 8 x1x 2 + β 9 x1 + ε
2 2 3 2 2 3

or
Y = β0 + β1 X1 + β2 X2 + β3 X3 + β4 X4 + β5 X5 + β6 X6 + β7 X7 + β8 X8 + β9 X9+ ε
where
X 1 = x1, X 2 = x 2 , X 3 = x1 , X 4 = x1 x2 , X 5 = x 2, X6 = x1 , X 7 = x1 x2 , X8 = x1x 2 and X 9 = x1
2 2 3 2 2 3

4
3
2
1
0

40

20

0
1
2
3
4
5

page 63
The Box-Cox Family of Transformations

xλ −1
 λ λ ≠0

x(λ) = transformed x = 
 ln(x) λ =0


The Transformation Staircase

4 λ=2
λ=1
3
λ = 1/2
2
λ=0
1 λ = -1/2
λ = -1
1 2 3 4
-1

-2
-3
-4

The Bulging Rule

y up y up

x down x up

x down x up

y down y down

page 64
Non-Linear Growth models - many models cannot be transformed into a linear model
The Mechanistic Growth Model

(
Equation: Y = α 1 − βe
−kx
)+ ε or (ignoring ε)
dY
= “rate of increase in Y” = k (α − Y )
dx
Mechanist ic Growt h Model

The Logistic Growth Model


α dY kY (α − Y )
Equation: Y = + ε or (ignoring ε) = “rate of increase in Y” =
1 + βe −kx
dx α

Logistic Growth Model

k=4
1.0
k=2 k=1
k=1/2
k=1/4

y 0.5

(α = 1, β = 1)

0.0
0 2 4 6 8 10
x

page 65
The Gompertz Growth Model:
− kx dY α 
Equation: Y = αe − βe + ε or (ignoring ε) = “rate of increase in Y” = kY ln  
dx Y 

Gompertz Growth Model

β = 1/8
1.0

0.8

β=1
0.6 β=8
β = 64
y
0.4 α =1
k=1
0.2

0.0
0 2 4 6 8 10
x

page 66
The Use of Dummy Variables
Dummy variables are artificially defined variables designed to convert a model including
categorical independent variables to the standard multiple regression model.

Comparison of Slopes of k Regression Lines with Common Intercept


Situation:
- k treatments or k populations are being compared.
- For each of the k treatments we have measured both Y (the response variable)
and X (an independent variable)
- Y is assumed to be linearly related to X with the slope dependent on treatment
(population), while the intercept is the same for each treatment

The Model:
Y = β0 + β 1( i ) X + ε for treatment i (i = 1, 2, ... , k)
Graphical Illustration of the above Model
120
Treat k
100 Treat 3
.....
Treat 2
80
Treat 1
y
60

40
Different Slopes
20
Common Intercept
0
0 10 x 20 30

This model can be artificially put into the form of the Multiple Regression model by the
use of dummy variables to handle the categorical variable Treatments. Dummy variables
are variables that are artificially defined:
In this case we define a new variable for each category of the categorical variable.
That is we will define Xi for each category of treatments as follows:
Then the model can be written as follows:
X if the subject receives treatment i
Xi = 0 otherwise


The Complete Model: (in Multiple Regression Format)


Y = β0 + β(1)
1 X1 +β 1 X2+ ... + β 1 Xk+ ε
(2) (k)

X if the subject receives treatment i


where Xi = 0 otherwise


Dependent Variable: Y

page 67
Independent Variables: X1, X2, ... , Xk
In the above situation we would likely be interested in testing the equality of the slopes.
Namely the Null Hypothesis

H0: β 1(1) = β 1(2 ) = = β 1(k ) = β 1 (q = k-1)

In this situation the model would become as follows

The Reduced Model: Y = β0 + β1X + ε

Dependent Variable: Y
Independent Variables: X = X1 + X2 + ... + X2

The Anova Table to carry out this test would take on the following form:

The Anova Table :


Source df Sum of Squares Mean Square F

Regression 1 1
SSReg 1
SSReg 1 /s2
MSReg
(for the reduced model)

1 MSH0
Departure from H0 k -1 SSH0 SSH0
k-1 s2
(Equality of Slopes)

Residual (Error) N-k-1 SSError s2

Total N-1 SSTotal


(N= The total number of cases = n1 + n2 + ... + nk and ni = the number of cases for treatment i)

Example
In the following example we are measuring Yield Y as it dependents on the amount of
pesticide X. Again we will assume that the dependence will be linear. (I should point out
that the concepts that are used in this discussion can easily be adapted to the non-linear
situation.) Suppose that the experiment is going to be repeated for three brands of
pesticides - A, B and C. The quantity, X, of pesticide in this experiment was set at 4
different levels 2 units/hectare, 4 units/hectare and 8 units per hectare. Four test plots
were randomly assigned to each of the nine combinations of test plot and level of
pesticide. Note that we would expect a common intercept for each brand of pesticide
since when the amount of pesticide, X, is zero the four brands of pesticides would be
equivalent.

page 68
The data for this experiment is given in the following table:

2 4 8
A 29.63 28.16 28.45
31.87 33.48 37.21
28.02 28.13 35.06
35.24 28.25 33.99
B 32.95 29.55 44.38
24.74 34.97 38.78
23.38 36.35 34.92
32.08 38.38 27.45
C 28.68 33.79 46.26
28.70 43.95 50.77
22.67 36.89 50.21
30.02 33.56 44.14

A graph of the data is displayed below:

60

40

A
B
C

20

0
0 1 2 3 4 5 6 7 8

page 69
The data as it would appear in a data file. The variables X1, X2 and X3 are the “dummy”
variables
Pesticide X (Amount) X1 X2 X3 Y
A 2 2 0 0 29.63
A 2 2 0 0 31.87
A 2 2 0 0 28.02
A 2 2 0 0 35.24
B 2 0 2 0 32.95
B 2 0 2 0 24.74
B 2 0 2 0 23.38
B 2 0 2 0 32.08
C 2 0 0 2 28.68
C 2 0 0 2 28.70
C 2 0 0 2 22.67
C 2 0 0 2 30.02
A 4 4 0 0 28.16
A 4 4 0 0 33.48
A 4 4 0 0 28.13
A 4 4 0 0 28.25
B 4 0 4 0 29.55
B 4 0 4 0 34.97
B 4 0 4 0 36.35
B 4 0 4 0 38.38
C 4 0 0 4 33.79
C 4 0 0 4 43.95
C 4 0 0 4 36.89
C 4 0 0 4 33.56
A 8 8 0 0 28.45
A 8 8 0 0 37.21
A 8 8 0 0 35.06
A 8 8 0 0 33.99
B 8 0 8 0 44.38
B 8 0 8 0 38.78
B 8 0 8 0 34.92
B 8 0 8 0 27.45
C 8 0 0 8 46.26
C 8 0 0 8 50.77
C 8 0 0 8 50.21
C 8 0 0 8 44.14

page 70
Fitting the complete model
ANOVA
df SS MS F Significance F
Regression 3 1095.815813 365.2719378 18.33114788 4.19538E-07
Residual 32 637.6415754 19.92629923
Total 35 1733.457389

Coefficients
Intercept 26.24166667
X1 0.981388889
X2 1.422638889
X3 2.602400794

Fitting the Reduced model


ANOVA
df SS MS F Significance F
Regression 1 623.8232508 623.8232508 19.11439978 0.000110172
Residual 34 1109.634138 32.63629818
Total 35 1733.457389

Coefficients
Intercept 26.24166667
X 1.668809524

The Anova Table for testing the equality of slopes


df SS MS F Significance F
common slope 1 623.8232508 623.8232508 31.3065283 3.51448E-06
zero
Slope comparison 2 471.9925627 235.9962813 11.84345766 0.000141367
Residual 32 637.6415754 19.92629923
Total 35 1733.457389

page 71
Comparison of Intercepts of k Regression Lines with a Common Slope (One-way
Analysis of Covariance)
Situation:
- k treatments or k populations are being compared.
- For each of the k treatments we have measured both Y (then response variable)
and X (an independent variable)
- Y is assumed to be linearly related to X with the intercept dependent on treatment
(population), while the slope is the same for each treatment.
- Y is called the response variable, while X is called the covariate.
The Model: Y = β(i)
0 + β1X + ε for treatment i (i = 1, 2, ... , k)

Graphical Illustration of the One-way


Analysis of Covariance Model
200

Treat k

Treat 3

y Treat 2
100
Treat 1

Common Slopes

0
x
0 10 20 30
Equivalent Forms of the Model:
_
1) Y = µi + β1(X - X ) + ε (treatment i), where
µi = the adjusted mean for treatment i
_
2) Y = µ + αi + β1(X - X ) + ε (treatment i), where
µ = the overall adjusted mean response
αi = the adjusted effect for treatment i
µi = µ + α i
The Complete Model: (in Multiple Regression Format)
Y = β0 + δ1X1 + δ2X2+ ... + δk-1Xk-1+ β1X + ε
1 if the subject receives treatment i
where Xi = 0 otherwise

Comment: (i)
β 0 = β0 + δi for treatment i = 1, 2, 3, .., k-1; and
β(k)
0 = β0 .
Dependent Variable: Y
Independent Variables: X1, X2, ... , Xk-1, X

page 72
Testing for the Equality of Intercepts (Treatments)
H0: β(1) (2) (k)
0 = β 0 = ... = β 0 (= β0 say) (q = k-1)
( or δ1 = δ2 = ... = δk-1= 0)

The Reduced Model:


Y = β0 + β1X + ε
Dependent Variable: Y
Independent Variables: X

The Anova Table (Analysis of Covariance Table):


Source df Sum of Squares Mean Square F

Regression 1 1
SSReg 1
SSReg 1 /s2
MSReg
(for the reduced model)

1 MSH0
Departure from H0 k -1 SSH0 SSH0
k-1 s2
(Equality of Intercepts
(Treatments))

Residual (Error) N-k-1 SSError s2

Total N-1 SSTotal

where N = The total number of cases = n1 + n2 + ... + nk


and ni = the number of cases for treatment i

An Example
In this example we are comparing four treatments for reducing Blood Pressure in Patients
whose blood pressure is abnormally high. Ten patients are randomly assigned to each of
the four treatment groups. In addition to the drop in blood pressure (Y) during the test
period the initial blood pressure (X) prior to the test period was also recorded. It was
thought that this would be correlated with X. The data is given below for this experiment.
Treatment case 1 2 3 4 5 6 7 8 9 10
1 X 186 185 199 167 187 168 183 176 158 190
Y 34 36 41 34 36 38 39 34 37 35
2 X 183 202 149 187 182 139 167 192 160 185
Y 29 36 27 29 27 28 22 32 26 30
3 X 182 168 175 174 183 182 181 148 205 188
Y 27 30 28 31 28 25 27 25 32 25
4 X 176 202 159 164 176 173 159 167 174 175
Y 26 26 20 18 27 20 24 22 22 25

page 73
The data as it would appear in a data file:

X Y Treatment X1 X2 X3
186 34 1 1 0 0
185 36 1 1 0 0
199 41 1 1 0 0
167 34 1 1 0 0
187 36 1 1 0 0
168 38 1 1 0 0
183 39 1 1 0 0
176 34 1 1 0 0
158 37 1 1 0 0
190 35 1 1 0 0
183 29 2 0 1 0
202 36 2 0 1 0
149 27 2 0 1 0
187 29 2 0 1 0
182 27 2 0 1 0
139 28 2 0 1 0
167 22 2 0 1 0
192 32 2 0 1 0
160 26 2 0 1 0
185 30 2 0 1 0
182 27 3 0 0 1
168 30 3 0 0 1
175 28 3 0 0 1
174 31 3 0 0 1
183 28 3 0 0 1
182 25 3 0 0 1
181 27 3 0 0 1
148 25 3 0 0 1
205 32 3 0 0 1
188 25 3 0 0 1
176 26 4 0 0 0
202 26 4 0 0 0
159 20 4 0 0 0
164 18 4 0 0 0
176 27 4 0 0 0
173 20 4 0 0 0
159 24 4 0 0 0
167 22 4 0 0 0
174 22 4 0 0 0
175 25 4 0 0 0

page 74
The Complete Model
ANOVA
df SS MS F Significance F
Regression 4 1000.862103 250.2155258 36.6366318 4.66264E-12
Residual 35 239.0378966 6.829654189
Total 39 1239.9

Coefficients
Intercept 6.360395468
X1 12.68618508
X2 5.397430901
X3 4.211584999
X 0.096461476

The Reduced Model


ANOVA
df SS MS F Significance F
Regression 1 187.7440297 187.7440297 6.78062315 0.013076205
Residual 38 1052.15597 27.68831501
Total 39 1239.9

Coefficients
Intercept 2.991349082
X 0.147157885

The Anova Table for comparing intercepts:


ANOVA
df SS MS F Significance F
Testing for slope 1 187.7440297 187.7440297 27.48953674 7.68771E-06
Comparison of intercepts 3 813.1180737 271.0393579 39.68566349 2.32981E-11
Residual 35 239.0378966 6.829654189
Total 39 1239.9

page 75
The Examination of Residuals
Introduction

Much can be learned by observing residuals. This is true not only for linear regression
models, but also for nonlinear regression models and analysis of variance models. In fact,
this is true for any situation where a model is fitted and measures of unexplained
variation (in the form of a set of residuals) are available for examination.
Quite often models that are proposed initially for a set of data are incorrect to some
extent. An important part of the modeling process is diagnosing the flaws in these


models. Much of this can be done by carefully examining the residuals
The residuals are defined as the n differences ei = y i - y i i = 1, 2,..., n where y i is an

observation and y i is the corresponding fitted value obtained by use of the fitted model.
We can see from this definition that the residuals, ei, are the differences between what is
actually observed, and what is predicted by model. That is, the amount which the model
has not been able to explain.
Many of the statistical procedures used in linear and nonlinear regression analysis are
based certain assumptions about the random departures from the proposed model.
Namely; the random departures are assumed

i) to have zero mean,


ii) to have a constant variance, σ2,
iii) independent, and
iv) follow a normal distribution.

Thus if the fitted model is correct, the residuals should exhibit tendencies that tend to
confirm the above assumptions, or at least, should not exhibit a denial of the assumptions.
When examining the residuals one should ask, "Do the residuals make it appear that our
assumptions are wrong?"
After examination of the residuals we shall be able to conclude either:

(1) the assumptions appear to be violated (in a way that can be specified), or
(2) the assumptions do not appear to be violated.

Note that (2) , in the same spirit of hypothesis testing of does not mean that we are
concluding that the assumptions are correct; it means merely that on the basis of the data
we have seen, we have no reason to say that they are incorrect.
The methods for examining the residuals are sometimes graphical and sometimes
statistical

The principal ways of plotting the residuals ei are


1. Overall.


2. In time sequence, if the order is known.
3. Against the fitted values y i
4. Against the independent variables xij for each value of j
In addition to these basic plots, the residuals should also be plotted
5. In any way that is sensible for the particular problem under consideration,

page 76
Overall Plot
The residuals can be plotted in an overall plot in several ways.
1. The scatter plot.

2. The histogram.

3. The box-whisker plot.

-8 -6 -4 -2 0 2 4 6

R e s id u a l

page 77
4. The kernel density plot.

5. a normal plot or a half normal plot on standard probability paper.


N o rm a l P -P P lo t o f E
1 .0 0

.7 5
Expected Cum Prob

.5 0

.2 5

0 .0 0
0 .0 0 .2 5 .5 0 .7 5 1 .0 0

O b s e rv e d C u m P ro b

If our model is correct these residuals should (approximately) resemble observations


from a normal distribution with zero mean. Does our overall plot contradict this idea?
Does the plot exhibit appear abnormal for a sample of n observations from a normal
distribution. How can we tell? With a little practice one can develop an excellent "feel" of
how abnormal a plot should look before it can be said to appear to contradict the
normality assumption. The standard statistical test for testing Normality are:

1. The Kolmogorov-Smirnov test.


2. The Chi-square goodness of fit test

The Kolmogorov-Smirnov test


The Kolmogorov-Smirnov uses the empirical cumulative distribution function as a tool
for testing the goodness of fit of a distribution. The empirical distribution function is
defined below for n random observations
Fn(x) = the proportion of observations in the sample that are less than or equal to x.
Let F0(x) denote the hypothesized cumulative distribution function of the population
(Normal population if we were testing normality) If F0(x) truly represented distribution of
observations in the population than Fn(x) will be close to F0(x) for all values of x.
The Kolmogorov-Smirinov test statistic is
Dn = sup Fn ( x ) − F0 ( x ) = the maximum distance between Fn(x) and F0(x).
x
If F0(x) does not provide a good fit to the distributions of the observation Dn will be
large. Critical values for are given in many texts

page 78
The Chi-square goodness of fit test
The Chi-square test uses the histogram as a tool for testing the goodness of fit of a
distribution. Let fi denote the observed frequency in each of the class intervals of the
histogram. Let Ei denote the expected number of observation in each class interval
assuming the hypothesized distribution. The hypothesized distribution is rejected if the

statistic χ = ∑
2
m
( f i − Ei )
2
is large. (greater than the critical value from the chi-square
i =1 Ei
distribution with m - 1 degrees of freedom. m = the number of class intervals used for
constructing the histogram)

Note. The in the above tests it is assumed that the residuals are independent with a
common variance of σ2. This is not completely accurate for this reason: Although the
theoretical random errors εi are all assumed to be independent with the same variance σ2,
the residuals are not independent and they also do not have the same variance. They will
however be approximately independent with common variance if the sample size is large
relative to the number of parameters in the model. It is important to keep this in mind
when judging residuals when the number of observations is close to the number of
parameters in the model.

Time Sequence Plot


The residuals should exhibit a pattern of independence. If the data was collected in time
there could be a strong possibility that the random departures from the model are
autocorrelated. Namely the random departures for observations that were taken at
neighbouring points in time are autocorrelated. This autocorrelation can sometimes be
seen in a time sequence plot. The following three graphs show a sequence of residuals
that are respectively i) positively autocorrelated , ii) independent and iii) negatively
autocorrelated.
Residuals that are positively autocorrelated tend to say positive (and negative) for long
periods of time. On the other hand residuals that are negatively autocorrelated tend to
oscillate frequently about zero. The performance of independent residuals is somewhere
in between these two extremes.

page 79
i) Positively auto-correlated residuals

ii) Independent residuals.

iii) Negatively auto-correlated residuals

There are several statistics and statistical tests that can also pick out autocorrelation
amongst the residuals. The most common are
i) The Durbin Watson statistic
ii) The autocorrelation function
iii) The runs test

The Durbin Watson statistic


The Durbin-Watson statistic which is used frequently to detect serial correlation is
defined by the following formula:
n-1
∑ (e i − e i+1 )
2

i =1
D= n
∑ ei2
i =1
If the residuals are serially correlated the differences, ei - ei+1, will be stochastically small.
Hence a small value of the Durbin-Watson statistic will indicate positive autocorrelation.
Large values of the Durbin-Watson statistic on the other hand will indicate negative
autocorrelation. Critical values for this statistic, can be found in many statistical
textbooks.

page 80
The autocorrelation function
The autocorrelation function at lag k is defined by
n-k n-k
1
n-k ∑ (e i − e)(e i+k − e) 1
n-k ∑ e i e i+k
i =1 i =1
rk = n
= n
∑ (e i − e) ∑ ei2
1 2 1
n n
i =1 i =1
This statistic measures the correlation between residuals the occur a distance k apart in
time. One would expect that residuals that are close in time are more correlated than
residuals that are separated by a greater distance in time. If the residuals are indepedent
than rk should be close to zero for all values of k A plot of rk versus k can be very
revealing with respect to the independence of the residuals. Some typical patterns of the
autocorrelation function are given below:

Auto correlation pattern for independent residuals


1

0.5

-0.5

-1

Various Autocorrelation patterns for serially correlated residuals


1

0.5

-0.5

-1

0.5

-0.5

-1

page 81
1

0.5

-0.5

-1

The runs test


This test uses the fact that the residuals will oscillate about zero at a “normal” rate if the
random departures are independent. If the residuals oscillate slowly about zero, this is an
indication that there is a positive autocorrelation amongst the residuals. If the residuals
oscillate at a frequent rate about zero, this is an indication that there is a negative
autocorrelation amongst the residuals. In the “runs test”, one observes the time sequence
of the “sign” of the residuals:
+++--++---+++
and counts the number of runs (i.e. the number of periods that the residuals keep the same
sign). This should be low if the residuals are positively correlated and high if negatively
correlated.
Plot Against fitted values y  i and the Predictor Variables X
ij
.
If we "step back" from this diagram and the residuals behave in a manner consistent with
the assumptions of the model we obtain the impression of a horizontal "band " of
residuals which can be represented by the diagram below.

Individual observations lying considerably outside of this band indicate that the
observation may be and outlier. An outlier is an observation that is not following the
normal pattern of the other observations. Such an observation can have a considerable
effect on the estimation of the parameters of a model. Sometimes the outlier has occurred
because of a typographical error. If this is the case and it is detected than a correction can
be made. If the outlier occurs for other (and more natural) reasons it may be appropriate
to construct a model that incorporates the occurrence of outliers.

page 82
If our "step back" view of the residuals resembled any of those shown below we should
conclude that assumptions about the model are incorrect. Each pattern may indicate that a
different assumption may have to be made to explain the “abnormal” residual pattern.

a) b)

Pattern a) indicates that the variance the random departures is not constant
(homogeneous) but increases as the value along the horizontal axis increases (time, or
one of the independent variables). This indicates that a weighted least squares analysis
should be used.

The second pattern, b) indicates that the mean value of the residuals is not zero. This is
usually because the model (linear or non linear) has not been correctly specified. Linear
and quadratic terms have been omitted that should have been included in the model.

page 83

You might also like