Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Statistics 203: Introduction to Regression

and Analysis of Variance

Multiple Linear Regression: Diagnostics


Jonathan Taylor

- p. 1/16

Today

Today

Spline models
What are the assumptions?
Problems in the regression

Splines + other bases.


Diagnostics

function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

- p. 2/16

Spline models

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors

Outliers & Influence

Splines are piecewise polynomials functions, i.e. on an


interval between knots (ti , ti+1 ) the spline f (x) is
polynomial but the coefficients change within each interval.
Example: cubic spline with knows at t1 < t2 < < th

Dropping an observation
Different residuals

f (x) =

Crude outlier detection test


Bonferroni correction for
multiple comparisons

3
X

0j xj +

j=0

DF F IT S
Cooks distance

where

DF BET AS

(x ti )+ =

h
X
i=1

x ti
0

i (x ti )3+

if x ti 0
otherwise.

Here is an example.
Conditioning problem again: B-splines are used to keep the
model subspace the same but have the design less
ill-conditioned.
Other bases one might use: Fourier: sin and cos waves;
Wavelet: space/time localized basis for functions.

- p. 3/16

What are the assumptions?

Today

Spline models

What is the full model for a given design matrix X ?

What are the assumptions?

Yi = 0 + 1 Xi1 + + p Xi,p1 + i

Problems in the regression


function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

Errors N (0, 2 I).


What can go wrong?
Regression function can be wrong missing predictors,
nonlinear.
Assumptions about the errors can be wrong.
Outliers & influential observations: both in predictors and
observations.

- p. 4/16

Problems in the regression function

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors

True regression function may have higher-order non-linear


terms i.e. X12 or even interactions X1 X2 .
How to fix? Difficult in general we will look at two plots
added variable plots and partial residual plots.

Outliers & Influence


Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

- p. 5/16

Partial residual plot

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors

Outliers & Influence


Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S

For 1 j p 1 let
eij = ei + bj Xij .

Can help to determine if variance depends on X j and


outliers.
If there is a non-linear trend, it is evidence that linear is not
sufficient.

Cooks distance
DF BET AS

- p. 6/16

Added-variable plot

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot

(I H(j) )Y vs.(I H(j) )Xj .

Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation

For 1 j p 1 let H(j) be the Hat matrix with this predictor


deleted. Plot

Plot should be linear and slope should be j . Why?

Different residuals
Crude outlier detection test

Y = X(j) (j) + j Xj +

Bonferroni correction for


multiple comparisons

(I H(j) )Y = (I H(j) )X(j) (j) + j (I H(j) )Xj + (I H(j) )

DF F IT S
Cooks distance
DF BET AS

(I H(j) )Y = j (I H(j) )Xj + (I H(j) )

Also can be helpful for detecting outliers.


If there is a non-linear trend, it is evidence that linear is not
sufficient.

- p. 7/16

Problems with the errors

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors

Outliers & Influence


Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance

DF BET AS

Errors may not be normally distributed. We will look at


QQplot for a graphical check. May not effect inference in
large samples.
Variance may not be constant. Transformations can
sometimes help correct this. Non-constant variance affects
b which can change t and F statistics
our estimates of SE()
substantially!
Graphical checks of non-constant variance: added variable
plots, partial residual plots, fitted vs. residual plots.
Errors may not be independent. This can seriously affect our
b
estimates of SE().

- p. 8/16

Outliers & Influence

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals

Crude outlier detection test


Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

Some residuals may be much larger than others which can


affect the overall fit of the model. This may be evidence of an
outlier: a point where the model has very poor fit. This can
be caused by many factors and such points should not be
automatically deleted from the dataset.
Even if an observation does not have a large residual, it can
exert a strong influence on the regression function.
General stragegy to measure influence: for each
observation, drop it from the model and measure how much
does the model change?

- p. 9/16

Dropping an observation

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot

Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

A (i) indicates i-th observation was not used in fitting the


model.
For example: Ybj(i) is the regression function evaluated at the
j-th observations predictors BUT the coefficients
(b0,(i) , . . . , bp1,(i) ) were fit after deleting i-th row of data.

Basic idea: if Ybj(i) is very different than Ybj (using all the data)
then i is an influential point for determining Ybj .

- p. 10/16

Different residuals

Today
Spline models
What are the assumptions?
Problems in the regression

function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals

Ordinary residuals: ei = Yi Ybi

Standardized residuals: ri = ei /s(ei ) = ei /b


1 Hii , H is
the hat matrix. (rstandard)

Studentized residuals: ti = ei /d
(i) 1 Hii tnp1 .
(rstudent)

Crude outlier detection test


Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

- p. 11/16

Crude outlier detection test

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot

Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals

If the studentized residuals are large: observation may be an


outlier.
Problem: if n is large, if we threshold at t1/2,np1 we
will get many outliers by chance even if model is correct.
Solution: Bonferroni correction, threshold at t1/(2n),np1 .

Crude outlier detection test


Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

- p. 12/16

Bonferroni correction for multiple comparisons

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence
Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons

If we are doing many t (or other) tests, say m > 1 we can


control overall false positive rate at by testing each one at
level /m.
Proof:
P (at least one false positive)
=P

DF F IT S
Cooks distance
DF BET AS

m
i=1 |Ti |

m
X
i=1

t1/(2m),np2

P |Ti | t1/(2m),np2

m
X

= .
=
m
i=1

- p. 13/16

DF F IT S

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors

Outliers & Influence


Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance
DF BET AS

Ybi Ybi(i)

DF F IT Si =

b(i) Hii

This quantity measures how much the regression function


changes at the i-th observation when the i-th variable is
deleted.
For small/medium datasets: value of 1 or greater is p
considered suspicious. For large dataset: value of 2 p/n.

- p. 14/16

Cooks distance

Today

Spline models
What are the assumptions?

Di =

Problems in the regression


function
Partial residual plot
Added-variable plot
Problems with the errors

Outliers & Influence


Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons
DF F IT S
Cooks distance

Pn

bj Ybj(i) )2
(
Y
j=1

p
b2
This quantity measures how much the entire regression
function changes when the i-th variable is deleted.
Should be comparable to Fp,np : if the p-value of Di is 50
percent or more, then the i-th point is likely influential:
investigate this point further.

DF BET AS

- p. 15/16

DF BET AS

Today

Spline models
What are the assumptions?
Problems in the regression
function
Partial residual plot
Added-variable plot
Problems with the errors
Outliers & Influence

Dropping an observation
Different residuals
Crude outlier detection test
Bonferroni correction for
multiple comparisons

DF F IT S
Cooks distance
DF BET AS

DF BET ASj(i) = q

bj bj(i)

2 (X T X)1

b(i)
jj

This quantity measures how much the coefficients change


when the i-th variable is deleted.
For small/medium datasets: value of 1 or greater
is

suspicious. For large dataset: value of 2/ n.


Here is an example.

- p. 16/16

You might also like