Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance

Statistics 203: Introduction to Regression
and Analysis of Variance
Multiple Linear Regression: Diagnostics

Jonathan Taylor
- p. 1/16
Today
Today
Spline models
What are the assumptions?
Problems in the regression
Splines + other bases.

Diagnostics
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS
- p. 2/16
Spline models
Today
Spline models
function
Added-variable plot
Splines are piecewise polynomials functions, i.e. on an

interval between knots (ti , ti+1 ) the spline f (x) is
polynomial but the coefficients change within each interval.
Example: cubic spline with knows at t1 < t2 < < th
Different residuals
f (x) =

3
X
0j xj +
j=0
DF F IT S
Cooks distance
where
DF BET AS
(x ti )+ =
h
X
i=1
x ti
0
i (x ti )3+
if x ti 0
otherwise.
Here is an example.
Conditioning problem again: B-splines are used to keep the
model subspace the same but have the design less
ill-conditioned.
Other bases one might use: Fourier: sin and cos waves;
Wavelet: space/time localized basis for functions.
- p. 3/16
Today
Spline models
What is the full model for a given design matrix X ?
Yi = 0 + 1 Xi1 + + p Xi,p1 + i

function
Added-variable plot
Different residuals
DF F IT S
Cooks distance
DF BET AS
Errors N (0, 2 I).

What can go wrong?
Regression function can be wrong missing predictors,
nonlinear.
Assumptions about the errors can be wrong.
Outliers & influential observations: both in predictors and
observations.
- p. 4/16
Problems in the regression function
Today
Spline models
function
Added-variable plot
True regression function may have higher-order non-linear

terms i.e. X12 or even interactions X1 X2 .
How to fix? Difficult in general we will look at two plots
added variable plots and partial residual plots.

Different residuals
DF F IT S
Cooks distance
DF BET AS
- p. 5/16
Today
Spline models
function
Added-variable plot

Different residuals
DF F IT S
For 1 j p 1 let
eij = ei + bj Xij .
Can help to determine if variance depends on X j and

outliers.
If there is a non-linear trend, it is evidence that linear is not
sufficient.
Cooks distance
DF BET AS
- p. 6/16
Added-variable plot
Today
Spline models
function
(I H(j) )Y vs.(I H(j) )Xj .
Added-variable plot
For 1 j p 1 let H(j) be the Hat matrix with this predictor

deleted. Plot
Plot should be linear and slope should be j . Why?
Different residuals
Y = X(j) (j) + j Xj +

(I H(j) )Y = (I H(j) )X(j) (j) + j (I H(j) )Xj + (I H(j) )
DF F IT S
Cooks distance
DF BET AS
(I H(j) )Y = j (I H(j) )Xj + (I H(j) )
Also can be helpful for detecting outliers.

If there is a non-linear trend, it is evidence that linear is not
sufficient.
- p. 7/16
Today
Spline models
function
Added-variable plot

Different residuals
DF F IT S
Cooks distance
DF BET AS
Errors may not be normally distributed. We will look at

QQplot for a graphical check. May not effect inference in
large samples.
Variance may not be constant. Transformations can
sometimes help correct this. Non-constant variance affects
b which can change t and F statistics
our estimates of SE()
substantially!
Graphical checks of non-constant variance: added variable
plots, partial residual plots, fitted vs. residual plots.
Errors may not be independent. This can seriously affect our
b
estimates of SE().
- p. 8/16
Today
Spline models
function
Added-variable plot
Different residuals

DF F IT S
Cooks distance
DF BET AS
Some residuals may be much larger than others which can

affect the overall fit of the model. This may be evidence of an
outlier: a point where the model has very poor fit. This can
be caused by many factors and such points should not be
automatically deleted from the dataset.
Even if an observation does not have a large residual, it can
exert a strong influence on the regression function.
General stragegy to measure influence: for each
observation, drop it from the model and measure how much
does the model change?
- p. 9/16
Today
Spline models
function
Added-variable plot
Different residuals
DF F IT S
Cooks distance
DF BET AS
A (i) indicates i-th observation was not used in fitting the

model.
For example: Ybj(i) is the regression function evaluated at the
j-th observations predictors BUT the coefficients
(b0,(i) , . . . , bp1,(i) ) were fit after deleting i-th row of data.
Basic idea: if Ybj(i) is very different than Ybj (using all the data)
then i is an influential point for determining Ybj .
- p. 10/16
Different residuals
Today
Spline models
function
Added-variable plot
Different residuals
Ordinary residuals: ei = Yi Ybi
Standardized residuals: ri = ei /s(ei ) = ei /b

1 Hii , H is
the hat matrix. (rstandard)
Studentized residuals: ti = ei /d
(i) 1 Hii tnp1 .
(rstudent)

DF F IT S
Cooks distance
DF BET AS
- p. 11/16
Today
Spline models
function
Added-variable plot
Different residuals
If the studentized residuals are large: observation may be an

outlier.
Problem: if n is large, if we threshold at t1/2,np1 we
will get many outliers by chance even if model is correct.
Solution: Bonferroni correction, threshold at t1/(2n),np1 .

DF F IT S
Cooks distance
DF BET AS
- p. 12/16
Bonferroni correction for multiple comparisons
Today
Spline models
function
Added-variable plot
Different residuals
If we are doing many t (or other) tests, say m > 1 we can

control overall false positive rate at by testing each one at
level /m.
Proof:
P (at least one false positive)
=P
DF F IT S
Cooks distance
DF BET AS
m
i=1 |Ti |
m
X
i=1
t1/(2m),np2
P |Ti | t1/(2m),np2
m
X
= .
=
m
i=1
- p. 13/16
DF F IT S
Today
Spline models
function
Added-variable plot

Different residuals
DF F IT S
Cooks distance
DF BET AS
Ybi Ybi(i)
DF F IT Si =
b(i) Hii
This quantity measures how much the regression function

changes at the i-th observation when the i-th variable is
deleted.
For small/medium datasets: value of 1 or greater is p
considered suspicious. For large dataset: value of 2 p/n.
- p. 14/16
Cooks distance
Today
Spline models
Di =

function
Added-variable plot

Different residuals
DF F IT S
Cooks distance
Pn
bj Ybj(i) )2
(
Y
j=1
p
b2
This quantity measures how much the entire regression
function changes when the i-th variable is deleted.
Should be comparable to Fp,np : if the p-value of Di is 50
percent or more, then the i-th point is likely influential:
investigate this point further.
DF BET AS
- p. 15/16
DF BET AS
Today
Spline models
function
Added-variable plot
Different residuals
DF F IT S
Cooks distance
DF BET AS
DF BET ASj(i) = q
bj bj(i)
2 (X T X)1
b(i)
jj
This quantity measures how much the coefficients change

when the i-th variable is deleted.
For small/medium datasets: value of 1 or greater
is
suspicious. For large dataset: value of 2/ n.

Here is an example.
- p. 16/16

Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance

Uploaded by

Copyright:

Available Formats

Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance

Uploaded by

Copyright:

Available Formats

Statistics 203: Introduction to Regression

and Analysis of Variance

Multiple Linear Regression: Diagnostics

Splines + other bases.

Outliers & Influence

Splines are piecewise polynomials functions, i.e. on an

Crude outlier detection test

What are the assumptions?

What is the full model for a given design matrix X ?

What are the assumptions?

Problems in the regression

Errors N (0, 2 I).

Problems in the regression function

True regression function may have higher-order non-linear

Outliers & Influence

Partial residual plot

Outliers & Influence

Can help to determine if variance depends on X j and

(I H(j) )Y vs.(I H(j) )Xj .

For 1 j p 1 let H(j) be the Hat matrix with this predictor

Plot should be linear and slope should be j . Why?

Bonferroni correction for

(I H(j) )Y = (I H(j) )X(j) (j) + j (I H(j) )Xj + (I H(j) )

(I H(j) )Y = j (I H(j) )Xj + (I H(j) )

Also can be helpful for detecting outliers.

Problems with the errors

Outliers & Influence

Errors may not be normally distributed. We will look at

Outliers & Influence

Crude outlier detection test

Some residuals may be much larger than others which can

A (i) indicates i-th observation was not used in fitting the

Ordinary residuals: ei = Yi Ybi

Standardized residuals: ri = ei /s(ei ) = ei /b

Crude outlier detection test

Crude outlier detection test

If the studentized residuals are large: observation may be an

Crude outlier detection test

Bonferroni correction for multiple comparisons

If we are doing many t (or other) tests, say m > 1 we can

Outliers & Influence

This quantity measures how much the regression function

Problems in the regression

Outliers & Influence

This quantity measures how much the coefficients change

suspicious. For large dataset: value of 2/ n.

You might also like