330 Lecture11 2014
330 Lecture11 2014
13.08.2013
R-hint of the day
Outliers: ...
Independence: ...
Aims of todays lecture
I A high-leverage outlier
I A low-leverage outlier
40
40
30
30
20
20
y
y
10
10
No highleverage points Lowleverage outlier
No outliers Big residual
0
0
0 2 4 6 8 0 2 4 6 8
x x
40
40
30
30
Highleverage outlier
20
20
y
10
10
Highleverage point
Not an outlier
0
0 2 4 6 8 0 2 4 6 8
x x
Example: The education data (without urban)
High leverage point
550
500
450
400
educ
under18
350
400
380
300
360
340
250
320
300
200
280
3000 3500 4000 4500 5000 5500 6000
percap
An outlier too?
100
6 50
8 12
7
50
45
22
5500
48 13 157
47 2114
23 49
4
50
Residual somewhat extreme
5
9
21 41 42 43
5000
46 10
15 3
9 16 11
3846 4048
residuals(educ.lm)
per capita income
20 40 29 13
5 18 34 36
25
4717 24 26
29 14 1 23
2
0
4500
43 31 37 4144 22
38 2420
19 37 39 3230 27 39
28 28
1735 6
36 33 19 4 49
26
4000
30 3 44 2 11 12
31 1
50
25 8
27 35 16 18
32 42 45
34
3500
33 10
300 320 340 360 380 200 250 300 350 400 450
Fitted values b
y are related to response y by the equation
1
y = H y,
b where H = X XT X XT ,
ybi = hi1 y1 + + hii yi + + hin yn .
6 2942
10 2349
1 1334
4 32 7
37
16
11
24
192830
3 27 25
4043 548
15
38
1446 94535 33 44
4117
20
2 2122
39 31 50
18 47 36 8
12
26
Var(ei ) = (1 hii ) 2 ,
Residuals vs Fitted
8
2
Fitted Line 13
True Line 19
6
Residuals
4
y
1
2
26
0
2
0 2 4 6 8 2 3 4 5 6 7
x Fitted values
Leverage-residual plots
> plot(educ.lm,which=5)
Residuals vs Leverage
50
1
2
7
0.5
Standardized residuals
45 0.5
2
1
Cook's distance
Leverage
lm(educ ~ percap + under18)
Interpreting LR plots
4
Lowleverage Highleverage
Outlier Outlier
2
Standardized residuals
Potential
OK Highleverage
0
Outlier
2
Lowleverage Highleverage
Outlier Outlier
4
leverage
No big studentised residuals, no big HMDs
Fitted Line
True Line
30
3(k+1)/n=0.2
19
30
23
Standardized residuals
20
19
y
0
23
1
10
30
2 Cook's distance
0
x Leverage
One big studentised residuals, no big HMDs
2
Fitted Line
True Line
3(k+1)/n=0.2
19
30
1
30
Standardized residuals
1
20
19 31
y
30
2
10
0.5
3
31 1
4 Cook's distance
0
x Leverage
No big studentised residuals, one big HMDs
Fitted Line
True Line
2
1
30 19
0.5
30
3(k+1)/n=0.2
Standardized residuals
20
19
y
0
31
1
10
0.5
30 1
2
31
Cook's distance
0
x Leverage
One big studentised residuals, one big HMDs
Fitted Line
True Line
5
31
4
30
Standardized residuals
2
20
1
y
0.5
1
3(k+1)/n=0.2
1
8
31
10
1
18 0.5
Cook's distance
0
2 1
x Leverage
Four big studentised residuals, one big HMDs
3
8
Fitted Line
True Line 31
13
31
2
6
Standardized residuals
1
3(k+1)/n=0.2 1
0.5
4
y
0
0.5
1
13
1
2
2
26
26 Cook's distance
0
x Leverage
HMD Summary
I Coefficients
I Fitted Values
I Standard errors
bj bj [i]
dfbetas =
se(bj )
Problematic: when r
k +1
|dffits| > 3 .
nk 1
Coefficient measures: Covariance Ratio and Cooks D
Influence measures of
lm(formula = educ ~ percap + under18, data = educ.df) :
2.5
1.5
50 50 50 50
2.0
2.0
2.0
1.5
1.0
1.5
1.5
dfb.un18
dfb.prcp
DFFITS
dfb.1_
1.0
1.0
1.0
0.5
0.5
0.5
0.5
0.0
0.0
0.0
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
0.30
0.25
0.20
10
ABS(COV RATIO1)
1.0
0.20
Cook's D
0.15
Hats
0.15
0.10
0.10
0.5
0.05
0.05
0.00
0.00
0.0
0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50
I Delete a small number of points and refit (do not want fitted
regression to be determined by one or two influential points)
Independence: ...
Independence
plot(1:length(res),res,xlab="time",ylab="residuals",
type="b")
abline(h=0,lty=2)
Residuals against time: Plot
Autocorrelation = 0.9
0.5
residual
0.0
-1.0 -0.5
0 20 40 60 80 100
time
Autocorrelation = 0.0
0.4
residual
0.0
-0.4
0 20 40 60 80 100
time
Autocorrelation = - 0.9
0.0 0.5 1.0
residual
-1.0
0 20 40 60 80 100
time
Residuals against their predecessor
n <- length(res)
plot(prev.res,plot.res,xlab="previous residual",
ylab="residual")
Residuals against their predecessor: Plot
1.0
0.4
0.5
0.5
previous residual
previous residual
previous residual
0.2
0.0
0.0
0.0
-0.5
-0.2
-0.5
-1.0
-0.4
-1.5
-0.5 0.0 0.5 1.0 -0.4 -0.2 0.0 0.2 0.4 -1.5 -1.0 -0.5 0.0 0.5 1.0
acf(residual(lm.obj))
0 5 10 15 20
Lag
Autocorrelation = 0.0
0.8
ACF
0.4
0.0
0 5 10 15 20
Lag
Autocorrelation = - 0.9
-1.0 -0.5 0.0 0.5 1.0
ACF
0 5 10 15 20
Lag
Durbin-Watson test
i = %i1 + ui
I % is estimated by
n
P
ei ei1
i=2
%b = n
ei2
P
i=2
Durbin-Watson test
I DW test statistic is
n
(ei ei1 )2
P
i=2
DW = n 2(1 %b)
ei2
P
i=1
> library(car)
> dwt(model,simulate=T,max.lag=1)
0 dL dU 4 dU 4 dL 4
Inconclusive
Does advertising increase sales? Residual plots
5
Residuals
5 0 5
Previous Residuals
Does advertising increase sales? Residual plots
Residuals
0 5 10 15 20 25 30 35
Index
Does advertising increase sales? Residual plots
Correlogram of residuals
1.0
0.8
0.6
0.4
ACF
0.2
0.0
0.2
0 5 10 15
Lag
Durbin-Watson test
> library(R330)
> library(car)
> data(ad.df)
> ad.lm <- lm(sales~spend+prev.spend,data=ad.df)
> dwt(ad.lm,max.lag=2)
lag Autocorrelation D-W Statistic p-value
1 0.44249700 1.103870 0.006
2 0.07550052 1.789442 0.612
Alternative hypothesis: rho[lag] != 0
Remedy
http://xkcd.com/539/