0% found this document useful (0 votes)
27 views61 pages

330 Lecture11 2014

This document provides an overview of outliers and independence in statistical modeling. It introduces key terminology for outliers and high-leverage points, and discusses how outliers can distort regression models depending on whether they are also high-leverage points. A variety of diagnostic tools are presented for identifying outliers and checking the independence of errors. An example using education data illustrates the impact of an outlier that is also a high-leverage point.

Uploaded by

PETER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views61 pages

330 Lecture11 2014

This document provides an overview of outliers and independence in statistical modeling. It introduces key terminology for outliers and high-leverage points, and discusses how outliers can distort regression models depending on whether they are also high-leverage points. A variety of diagnostic tools are presented for identifying outliers and checking the independence of errors. An example using education data illustrates the impact of an outlier that is also a high-leverage point.

Uploaded by

PETER
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

STATS 330: Lecture 11

Outliers and Independence

13.08.2013
R-hint of the day

Applying a function to a data frame


> data(cherry.df)
> apply(cherry.df,2,mean)
diameter height volume
13.24839 76.00000 30.17097
> round(sapply(1:31,function(k){cherry.df[k,1]/12}),2)
[1] 0.69 0.72 0.73 0.88 0.89 0.90 0.92 0.92 0.92 0.93
[11] 0.94 0.95 0.95 0.98 1.00 1.07 1.07 1.11 1.14 1.15
[21] 1.17 1.18 1.21 1.33 1.36 1.44 1.46 1.49 1.50 1.50
[31] 1.72
Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)

Outliers: ...

Independence: ...
Aims of todays lecture

I Introduce terminology for outliers and


high-leverage points.

I Introduce a broad band of diagnostic tools to


identify and treat such extraordinary data points.
I Introduce diagnostic tools to check for
independence of errors.
Outliers and high-leverage points

I An outlier is a point that has a larger or smaller y value than


the model would suggest.

I Can be due to a genuine large error ;


I Can be caused by typographical errors in recording the data.

I A high-leverage point is a point with extreme values of the


explanatory variables.
Outliers
I The effect of an outlier depends on whether it is also a
high-leverage point.

I A high-leverage outlier

I Can attract the fitted plane, distorting the fit, sometimes


extremely;
I In extreme cases may not have a big residual;
I In extreme cases can increase R 2 .

I A low-leverage outlier

I Does not distort the fit to the same extent;


I Usually has a big residual;
I Inflates standard errors, decreases R 2 .
Outliers

40

40









30

30






20

20
y

y



10

10
No highleverage points Lowleverage outlier
No outliers Big residual
0

0
0 2 4 6 8 0 2 4 6 8

x x
40

40









30

30

Highleverage outlier



20

20
y




10

10

Highleverage point
Not an outlier

0

0 2 4 6 8 0 2 4 6 8

x x
Example: The education data (without urban)


High leverage point
550
500
450





400


educ

under18
350




400


380
300



360

340
250


320
300
200

280
3000 3500 4000 4500 5000 5500 6000

percap
An outlier too?

100
6 50
8 12
7
50
45
22
5500

48 13 157
47 2114
23 49
4

50
Residual somewhat extreme
5
9
21 41 42 43
5000

46 10
15 3
9 16 11
3846 4048

residuals(educ.lm)
per capita income

20 40 29 13
5 18 34 36
25
4717 24 26
29 14 1 23
2

0
4500

43 31 37 4144 22
38 2420
19 37 39 3230 27 39
28 28
1735 6
36 33 19 4 49
26
4000

30 3 44 2 11 12
31 1

50
25 8
27 35 16 18
32 42 45
34
3500

33 10

300 320 340 360 380 200 250 300 350 400 450

Number of residents per 1000 under 18 predict(educ.lm)


Measuring leverage

Fitted values b
y are related to response y by the equation
 1
y = H y,
b where H = X XT X XT ,
ybi = hi1 y1 + + hii yi + + hin yn .

I hij depend explanatory variables X, hii is called hat matrix


diagonals (HMDs), measures the influence yi has on ybi .
I Also distance between average xi. and average x, measures
how extreme observations are.
Interpreting the HMDs

I Each HMD lies between 0 and 1.

I The average HMD is (k + 1)/n.

I An HMD larger than 3(k + 1)/n is considered extreme.


Example: The education data (without urban)
> hatvalues(educ.lm)

6 2942
10 2349
1 1334
4 32 7
37
16
11
24
192830
3 27 25
4043 548
15
38
1446 94535 33 44
4117
20
2 2122
39 31 50
18 47 36 8
12
26

(k + 1)/n 3(k + 1)/n


= 3/50 = 9/50

0.05 0.10 0.15 0.20 0.25 0.30 0.35

Hat matrix values for educ.lm


Studentised residuals

I How can we recognise a big residual? How big is big?

I The actual size depends on the units in which the y -variable is


measured, so we need to standardize them.

I Can divide by their standard deviations.

I Variance of a typical residual ei is

Var(ei ) = (1 hii ) 2 ,

where hii is the i th diagonal entry of the hat matrix H.


Studentised residuals

I Internally studentised (or standardised in R):


ei
ei = p
b
(1 hii )s 2

where s 2 is the usual estimate of the residual variance 2 .

I Externally studentised (or studentised in R):


ei
ei = q
b
(1 hii )si2

where si2 is the estimate of 2 after deleting the i th data point.


Studentised residuals

I How big is big?

I The internally studentised residuals are approximately


standard normal distributed if the model is OK and there are
no outliers.

I The externally studentised residuals has a t-distribution.

I Thus, studentised residuals should be between 2 and 2 with


approximately 95% probability.
Studentised residuals: Calculating it in R

#Load the MASS library


> library(MASS)
# internally studentised (standardised in R)
> stdres(educ.lm)[50]
50
3.089699
# externally studentised (studentized in R)
> studres(educ.lm)[50]
50
3.424107
What does studentised mean?
Recognising outliers

I If a point is a low influence outlier, the residual will usually be


large, so large residual and a low HMD indicates an outlier.

I If a point is a high leverage outlier, then a large error usually


will cause a large residual.

I However, in extreme cases, a high leverage outlier may not


have a very big residual, depending on how much the point
attracts the fitted plane. Thus, if a point has a large HMD,
and the residual is not particularly big, we cannot always tell if
a point is an outlier or not.
High-leverage outlier

Residuals vs Fitted

8

2
Fitted Line 13
True Line 19
6

Residuals





4
y

1
2






26
0

2
0 2 4 6 8 2 3 4 5 6 7

x Fitted values
Leverage-residual plots
> plot(educ.lm,which=5)
Residuals vs Leverage

50

1
2

7

0.5

Standardized residuals





45 0.5
2


1
Cook's distance

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Leverage
lm(educ ~ percap + under18)
Interpreting LR plots
4

Lowleverage Highleverage
Outlier Outlier
2
Standardized residuals

Potential
OK Highleverage
0

Outlier
2

Lowleverage Highleverage
Outlier Outlier
4

leverage
No big studentised residuals, no big HMDs

Plot of y vs. x, Example 1


Residuals vs Leverage
40

Fitted Line
True Line



30
3(k+1)/n=0.2
19





30

23

Standardized residuals



20

19
y

0
23




1
10

30

2 Cook's distance
0

0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 0.12

x Leverage
One big studentised residuals, no big HMDs

Plot of y vs. x, Example 2


Residuals vs Leverage
40

2
Fitted Line
True Line



3(k+1)/n=0.2
19
30

1





30

Standardized residuals

1
20

19 31
y

30

2
10

0.5

3
31 1

4 Cook's distance
0

0 2 4 6 8 0.00 0.02 0.04 0.06 0.08 0.10 0.12

x Leverage
No big studentised residuals, one big HMDs

Plot of y vs. x, Example 3


Residuals vs Leverage
40

Fitted Line
True Line

2

1

30 19



0.5

30

3(k+1)/n=0.2

Standardized residuals







20

19
y

0




31

1

10


0.5

30 1

2
31
Cook's distance
0

0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

x Leverage
One big studentised residuals, one big HMDs

Plot of y vs. x, Example 4


Residuals vs Leverage
40

Fitted Line
True Line

5
31


4



30

Standardized residuals

2
20

1
y



0.5
1
3(k+1)/n=0.2

1

8

31


10

1

18 0.5

Cook's distance
0

2 1

0 2 4 6 8 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

x Leverage
Four big studentised residuals, one big HMDs

Plot of y vs. x, Example 5


Residuals vs Leverage

3

8

Fitted Line
True Line 31
13
31

2
6

Standardized residuals

1






3(k+1)/n=0.2 1
0.5
4
y

0
0.5
1






13


1



2

2


26
26 Cook's distance
0

0 2 4 6 8 0.0 0.2 0.4 0.6 0.8

x Leverage
HMD Summary

I Hat matrix diagonals

I Measure the effect of a point on its fitted value;


I Measure how outlying the x-values are (how high-leverage a
point is);
I Are always between 0 and 1 with bigger values indicating
high-leverage;
I Points with HMDs more than 3(k + 1)/n are considered
high-leverage
Influential points

I How can we tell if a high-leverage point/outlier is affecting


the regression?

I By deleting the point and refitting the regression: a large


change in coefficients means the point is affecting the
regression.

I Such points are called influential points.

I We do not want our analysis to be driven by one or two


points!
Leave one out measures

I We can calculate a variety of measures by leaving out each


data point in turn, and looking at the change in key regression
quantities such as:

I Coefficients
I Fitted Values
I Standard errors

I We discuss each in turn.


Example: Education data

With point 50 Without point 50

Const 557.451 298.714

percap 0.072 0.059

under18 1.555 0.933


Coefficient measures: DFBETAS

DFBETAS: Standardised difference in coefficients

bj bj [i]
dfbetas =
se(bj )

Problematic: when |dfbetas| > 1. This is a criterion coded into


R.
Coefficient measures: DFFITS

DFFITS: Standardised difference in fitted values:


ybj ybj [i]
dffits =
se(byj )

Problematic: when r
k +1
|dffits| > 3 .
nk 1
Coefficient measures: Covariance Ratio and Cooks D

Cov Ratio: Measures change in the standard errors of the


estimated coefficients.

Problem: when Cov Ratio greater than 1 + 3(k + 1)/n or


smaller than 1 3(k + 1)/n.

Cooks D: Measures overall change in the coefficients.

Problem: when greater than qf(.5,k+1,n-k-1) (median of


F -distribution), roughly 1 in most cases.
Coefficient measures in R

Influence measures of
lm(formula = educ ~ percap + under18, data = educ.df) :

dfb.1_ dfb.prcp dfb.un18 dffit cov.r cook.d hat inf


1 0.0120 -0.01794 -0.00588 0.0233 1.121 1.84e-04 0.0494
***
10 0.0638 -0.16792 -0.02222 -0.3631 0.803 4.05e-02 0.0257 *
***
44 0.0229 0.00298 -0.02948 -0.0340 1.283 3.94e-04 0.1690 *
***
50 -2.3688 1.50181 2.23393 2.4733 0.821 1.66e+00 0.3429 *
---
1.0000 1.00000 1.00000 0.7579 0.82 8.00e-01 0.18
1.18
Plotting influence
# There will be seven plots
par(mfrow=c(2,4))
# Plot the measures using R330
influenceplots(educ.lm)
dfb.1_ dfb.prcp dfb.un18 DFFITS

2.5
1.5
50 50 50 50

2.0
2.0

2.0
1.5
1.0
1.5

1.5
dfb.un18
dfb.prcp

DFFITS
dfb.1_

1.0
1.0

1.0
0.5

0.5
0.5

0.5
0.0

0.0

0.0

0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Obs. number Obs. number Obs. number Obs. number

ABS(COV RATIO1) Cook's D Hats


0.35
44 50 50
1.5
0.25

0.30
0.25
0.20

10
ABS(COV RATIO1)

1.0

0.20
Cook's D
0.15

Hats

0.15
0.10

0.10
0.5
0.05

0.05
0.00

0.00
0.0

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50

Obs. number Obs. number Obs. number


Remedies for outliers

I Correct typographical errors in the data.

I Delete a small number of points and refit (do not want fitted
regression to be determined by one or two influential points)

I Report existence of outliers separately: they are often of


scientific interest

I Do not delete too many points.


Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)

Outliers: Leverage-Residual plots, influence measures

Independence: ...
Independence

I One of the regression assumptions is that the errors are


independent.

I Data that is collected sequentially over time often has errors


that are not independent.

I If the independence assumption does not hold, then the


standard errors will be wrong and the tests and confidence
intervals will be unreliable.

I We need to be able to detect lack of independence.


Types of dependence

I If large positive errors have a tendency to follow large positive


errors, and large negative errors a tendency to follow large
negative errors, we say the data has positive autocorrelation.

I If large positive errors have a tendency to follow large negative


errors, and large negative errors a tendency to follow large
positive errors, we say the data has negative autocorrelation.
Diagnostics: Positive Autocorrelation

If the errors are positively autocorrelated:


I Plotting the residuals against time will show long runs of
positive and negative residuals.

I Plotting residuals against the previous residual (i.e. ei vs.


ei1 ) will show a positive trend.

I A correlogram of the residuals will show positive spikes,


gradually decaying.
Diagnostics: Negative Autocorrelation

If the errors are negatively autocorrelated:


I Plotting the residuals against time will show alternating
positive and negative residuals.

I Plotting residuals against the previous residual (i.e. ei vs.


ei1 ) will show a negative trend.

I A correlogram of the residuals will show alternating positive


and negative spikes, gradually decaying.
Residuals against time

res <- residuals(lm.obj)

plot(1:length(res),res,xlab="time",ylab="residuals",
type="b")

abline(h=0,lty=2)
Residuals against time: Plot
Autocorrelation = 0.9

0.5
residual
0.0
-1.0 -0.5

0 20 40 60 80 100

time

Autocorrelation = 0.0
0.4
residual
0.0
-0.4

0 20 40 60 80 100

time

Autocorrelation = - 0.9
0.0 0.5 1.0
residual
-1.0

0 20 40 60 80 100

time
Residuals against their predecessor

res <- residuals(lm.obj)

n <- length(res)

plot.res <- res[-1] # element 1 has no predecessor

prev.res <- res[-n] # last residual has no succesor

plot(prev.res,plot.res,xlab="previous residual",
ylab="residual")
Residuals against their predecessor: Plot

Autocorrelation = 0.9 Autocorrelation = 0.0 Autocorrelation = - 0.9


1.0

1.0
0.4
0.5

0.5
previous residual

previous residual

previous residual
0.2

0.0
0.0
0.0

-0.5
-0.2
-0.5

-1.0
-0.4

-1.5
-0.5 0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4 -1.5 -1.0 -0.5 0.0 0.5 1.0

residual residual residual


Correlogram

acf(residual(lm.obj))

I The autocorrelation function (acf, aka correlogram)


investigates the correlation between a residual and another
residual k time units apart.

I This is also called lag k autocorrelation.


Correlogram: Plot
0.8 Autocorrelation = 0.9
ACF
0.4
0.0

0 5 10 15 20

Lag

Autocorrelation = 0.0
0.8
ACF
0.4
0.0

0 5 10 15 20

Lag

Autocorrelation = - 0.9
-1.0 -0.5 0.0 0.5 1.0
ACF

0 5 10 15 20

Lag
Durbin-Watson test

I We can also do a formal hypothesis test, the Durbin-Watson


test, for independence.

I The test assumes the errors follow a model of the form

i = %i1 + ui

where the ui s are independent, normal and have constant


variance. % is the lag 1 correlation.

I This is the autoregressive model of order 1.


Durbin-Watson test

I When % = 0 then the errors are independent.

I DW tests independence by testing % = 0.

I % is estimated by
n
P
ei ei1
i=2
%b = n
ei2
P
i=2
Durbin-Watson test

I DW test statistic is
n
(ei ei1 )2
P
i=2
DW = n 2(1 %b)
ei2
P
i=1

I Value of DW is between 0 and 4;

I Values of DW around 2 are consistent with independent;

I Values close to 4 indicate negative autocorrelation;

I Values close to 0 indicate positive autocorrelation.


Durbin-Watson test in R

> library(car)
> dwt(model,simulate=T,max.lag=1)

Positive Independence Negative


Autocorrelation Autocorrelation

0 dL dU 4 dU 4 dL 4

Inconclusive
Does advertising increase sales? Residual plots

Comparing Lag 1 residuals

5





Residuals

5 0 5

Previous Residuals
Does advertising increase sales? Residual plots

Residuals against time







Residuals

0 5 10 15 20 25 30 35

Index
Does advertising increase sales? Residual plots

Correlogram of residuals
1.0
0.8
0.6
0.4
ACF

0.2
0.0
0.2

0 5 10 15

Lag
Durbin-Watson test

> library(R330)
> library(car)
> data(ad.df)
> ad.lm <- lm(sales~spend+prev.spend,data=ad.df)
> dwt(ad.lm,max.lag=2)
lag Autocorrelation D-W Statistic p-value
1 0.44249700 1.103870 0.006
2 0.07550052 1.789442 0.612
Alternative hypothesis: rho[lag] != 0
Remedy

I If we detect serial correlation, we need to fit special time


series models to the data.

I For full details see STATS 326/726.


Remedy

I Recall there was a trend in the time series plot of the


residuals, these seem related to time

I Thus, time is a lurking variable , a variable that should be


in the regression but is not

I Try model Sales~spend+prev.spend+time.


Fitting the new model

> time <- 1:35


> ad.lm2 <- lm(sales~spend+prev.spend+time,data=ad.df)
> dwt(ad.lm2,max.lag=2)
lag Autocorrelation D-W Statistic p-value
1 0.1234779 1.619324 0.152
2 -0.3925537 2.608298 0.070
Alternative hypothesis: rho[lag] != 0

I Lag 1 problem resolved, time is significant variable in model.


Diagonstic steps

We test for <> using <>:


Planarity: Residuals vs. fitted values, Added variable plots,
GAM plots, Box-Cox plot

Constant Variance: Funnel plots, weighted least squares

Normality of Errors: QQ plots, Weisberg-Bingham test (and many


more)

Outliers: Leverage-Residual plots, influence measures

Independence: Correlogram, Durbin-Watson test


Statistical outliers

http://xkcd.com/539/

You might also like