0% found this document useful (0 votes)

30 views11 pages

Lecture 20: Outliers and Influential Points

1. An outlier is a point with a large residual, while an influential point is one that has a large impact on the regression. A point can be an outlier, influential, both, or neither. 2. Cook's distance measures how much the fitted values change when an observation is omitted, and is determined by a point's residual and leverage. An influential point has a Cook's distance greater than 1. 3. Diagnostics to identify outliers and influential points include examining leverage, residuals, and Cook's distance. Leverage depends only on predictor values, while residuals and Cook's distance also consider predicted versus actual outcome values.

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views11 pages

Lecture 20: Outliers and Influential Points

Uploaded by

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lecture 20: Outliers and Influential Points

An outlier is a point with a large residual. An influential point is a point that has a large
impact on the regression. Surprisingly, these are not the same thing. A point can be an outlier
without being influential. A point can be influential without being an outlier. A point can be both
or neither.
Figure 1 shows four famous datasets due to Frank Anscombe. If you run least squares on each
dataset you will get the same output:

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x 0.5001 0.1179 4.241 0.00217 **

Residual standard error: 1.237 on 9 degrees of freedom

Multiple R-squared: 0.6665,Adjusted R-squared: 0.6295
F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217

The top left plot has no problems. The top right plot shows a non-linear pattern. The bottom
left plot has an an outlier. The bottom right plot has an influential point. Imagine what would
happen if we deleted the rightmost point. If you looked at residual plots, you would see problems
in the second and third case. But the resdual plot for the fourth example would look fine. You
can’t see influence in the usual residual plot.

1 Modified Residuals
Let e be the vector of residuals. Recall that
e = (I − H), E[e] = 0, Var(e) = σ 2 (I − H).
√
b 1 − hii where hii ≡ Hii . We then call
Thus the standard error of ei is σ
ei
ri = √
b 1 − hii
σ
the standardized residual.
There is another type of residual ti which goes under various names: the jackknife resid-
ual, the cross-validated residual, externally studentized residual or studentized deleted
residual. Let Ybi(−i) is the predicted value for the ith data point when (Xi , Yi ) is omitted from the
data. Then ti is defined by
Yi − Ybi(−i)
ti = (1)
si
where s2 is the estimated variance of Yi − Ybi(−i) . It can be shown that
i
s
n−p−2 e
ti = ri 2 = √i (2)
n − p − 1 − ri b(−i) 1 − hii
σ

1
10 11

● ●
● ●

9
● ●
●
● ●

8
9

●
●

7
●
8

●
y1

y2
● ●

6
●
7

5
6

●
●

4
5

● ●

3
4

4 6 8 10 12 14 4 6 8 10 12 14

x1 x2

● ●
12
12

10
10
y3

●
●
●
8

● ●
8

● ●
●
● ●
●
● ●
●
6

●
6

●
● ●
● ●

4 6 8 10 12 14 8 10 12 14 16 18

x3 x4

Figure 1: For data sets that have the same fitted line. Top left: no problems. Top right: a non-linear
pattern. Bottom left: An outlier. Bottom right: an influential point.

2
σ 2
b(−i) is the estimated variance after omitting (Xi , Yi ) is omitted from the data. The cool think is
that we can compute ti without ever having to actually delete the observation and re-fit the model.
Everything you have done so far with residuals can also be done with standardized or jackknife
residuals.

2 Influence
Recall that
Y
b = HY

where H is the hat matrix. This means that each Ybi is a linear combination of elements of H. In
particular, Hii is the contribution of the ith data point to Ybi . For this reason we call hii ≡ Hii the
leverage.
To get a better idea of how influential the ith data point is, we could ask: how much do the
fitted values change if we omit an observation? Let Y(−i) be the vector of fitted values when we
remove observation i. Then Cook’s distance is defined by

(Y − Y(−i) )T (Y − Y(−i) )
Di = .
(p + 1)bσ2
It turns out that there is a handy formula for computing Di , namely:
2
ri hii
Di = .
p+1 1 − hii
This means that the influence of a point is determined by both its residual and its leverage. Often,
people interpret Di > 1 as an influential point.
The leave-one-out idea can also be applied to the coefficients. Write βb(−i) for the vector of
coefficients we get when we drop the ith data point. One can show that

(XT X)−1 XTi ei

βb(−i) = βb − . (3)
1 − hii
Cook’s distance can actually be computed from this, since the change in the vector of fitted values
is x(βb(−i) − β),
b so
(βb(−i) − β)
b T XT X(βb(−i) − β)
b
Di = 2
. (4)
(p + 1)b
σ
Sometimes, whole clusters of nearby points might be potential outliers. In such cases, removing
just one of them might change the model very little, while removing them all might change it a
great deal. Unfortunately there are nk = O(nk ) groups of k points you could consider deleting at

once, so while looking at all leave-one-out results is feasible, looking at all leave-two- or leave-ten-
out results is not.

3 Diagnostics in Practice
We have three ways of looking at whether points are outliers:
1. We can look at their leverage, which depends only on the value of the predictors.

3
2. We can look at their studentized residuals, either ordinary or cross-validated, which depend
on how far they are from the regression line.

3. We can look at their Cook’s statistics, which say how much removing each point shifts all the
fitted values; it depends on the product of leverage and residuals.

The model assumptions don’t put any limit on how big the leverage can get (just that it’s ≤ 1
at each point) or on how its distributed across the points (just that it’s got to add up to p + 1).
Having most of the leverage in a few super-inferential points doesn’t break the model, exactly, but
it should make us worry.
The model assumptions do say how the studentized residuals should be distributed. In partic-
ular, the cross-validated studentized residuals should follow a t distribution. This is something we
can test, either for specific points which we’re worried about (say because they showed up on our
diagnostic plots), or across all the points.

3.1 In R
Almost everything we’ve talked — leverages, studentized residuals, Cook’s statistics — can be
calculated using the influence function. However, there are more user-friendly functions which
call that in turn, and are probably better to use. Leverages come from the ‘hatvalues‘ function, or
from the ‘hat‘ component of what ‘influence‘ returns:

out = lm(Mobility ~ Commute,data=mobility)

hatvalues(out)
influence(out)$hat ### this is the same as the previous line
rstandard(out) ### standardized residuals
rstudent(out) ### jackknife residuals
cooks.distance(out) ### Cook's distance

Often the most useful thing to do with these is to plot them, and look at the most extreme
points. The standardized and studentized residuals can also be put into our usual diagnostic plots,
since they should average to zero and have constant variance when plotted against the fitted values
or the predictors.

par(mfrow=c(2,2))
n = nrow(mobility)
out = lm(Mobility ~ Commute,data=mobility)
plot(hatvalue(out),ylab="Leverage")
plot(rstandard(out),ylab="Standardized Residuals")
plot(rstudent(out),ylab="Cross-Validated Residuals")
abline(h=qt(0.025,df=n-2,col="red")
abline(h=qt(1-0.025,df=n-2,col="red")
plot(cooks.distance(out),ylab="Cook's Distance")

We can now look at exactly which points have the extreme values, say the 10 most extreme
residuals, or largest Cook’s statistics:

4
8
● ●
0.015

Standardized residuals

6
●
● ●
● ●●

4
●
● ●
Leverage

●
0.010

● ●
●
● ●●
● ● ●
● ● ●●
● ● ● ●●

0
● ●● ● ●
● ● ● ●● ● ●● ● ●● ● ● ●
● ●
● ●● ● ● ● ● ●
●● ●● ●
●
● ●
●●●●
● ●
● ●●
●●
●
●●●● ●● ●
●
●●
●●●
●●●
●
●●●●
●●●●●●
●●
● ●
●●
●● ●
●●● ●●●●●● ●●●●●● ● ●●
● ●●
●●● ●
● ● ●
●● ● ● ●
●●
0.005

● ● ● ● ● ●● ● ●●●● ● ● ●
●●
●●●
● ● ● ● ●●
●●
●● ●●●●
●● ●●● ●●●●●●●
●●●● ● ●
● ●● ●●
●●
●●●
● ● ●
●●
●●●●
●●
●●
●●
●●●
● ●
●
● ●●● ● ●● ●●
●●
●
● ●
●
● ● ●●
●●
● ●●
●●●
●
●● ●●●●●●●●
● ●●●●●
● ● ●
●
● ● ● ● ●●● ●● ●● ●
●● ●
● ●● ●
●●● ●
●●●● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ●●●
●● ● ● ●● ●● ● ●
● ●●●
●●●●
●● ● ●
●
● ●●
●
●
●●
●
●●
●
● ● ●
●● ●● ● ●
●
● ● ●● ● ● ● ● ● ●
●●● ●
●● ● ● ●● ● ● ●●● ●● ●●● ●●
●
● ●● ● ●●● ● ●●
● ●● ●● ● ● ● ●
●●

−2
●
● ●● ●
●
●
●
●●
●
● ●
● ●●
●
●
●●●
● ●●● ●
●● ●●
●●
● ●●●●
●
●●
●
●●●●●●● ●
● ● ● ● ●
●● ●●
● ●
●
● ● ●●
● ●
●●
●
● ● ●● ● ● ● ● ●● ● ●● ●
●● ●● ● ● ● ●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●●
●
●●●●●
●●●●●
●●●●
●●●
●
●●●● ●
●● ●
●●●
●●●
●●●● ● ●●● ●
●●
●
●
●
●
●
●●
●●
● ●●●●● ●
●
●
●
●
●●●
●● ● ●● ●●●● ●
●
●
●●● ● ●
●● ●
● ●
●●
●● ●
●
● ●
●●● ●●●
●●●
●●●
●● ●●
●●●●● ●
●● ●●
● ●●●● ● ●
● ●
●
●●●
● ●●●●
●●● ●●
●
●● ● ●● ●
●●●
●●●
●●●●
●
●●
●● ●●● ●
●
●
●● ●●●
●
●●
●
●●
●●
●● ●●
●
●●
●
● ●
●●
●● ●
● ● ●●●●
● ●●
●
●●
●
●●
●●●●
●
●
●●●●
●●
●
●●
●●●●
●● ●●
●●
●
●●● ● ●
●
●●
●
●●
●●●
●●
●●●
●●
●
●
●●●
●●●●●
● ●
●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●
● ●●●
● ● ●●●●●
● ●
●●
●●
●●●● ●●●●●
●
●●●●
● ● ●●
●●●●●
●●
●●
● ●

−4
● ● ● ● ●●
●
●●●● ● ●● ●● ● ● ● ● ●
● ●●● ●●● ●

0 200 400 600 0 200 400 600

Index Index
Cross−validated studentized residuals

0.12

● ●
8

●
6

●
●
Cook's statistic

●●
0.08
4

●
● ● ●
●
● ●●
● ● ● ●
● ● ●● ●● ●●
2

● ● ● ● ● ●
● ● ●
● ● ●● ● ● ● ● ● ●●
● ●●●●●●●● ●● ● ● ● ●●●●●●●●●● ●● ● ●●
● ●
●
● ●●
● ● ●●●
● ●●● ●● ● ●●
●●
● ●●
●
●●
●
●
●●
●●●● ● ●●
● ● ● ●
● ●● ●
●●
●● ●
●●●
● ●● ● ● ● ● ●
0.04

●
● ●●● ●● ● ●
●●
●●●●● ●
●●
●●●● ●●
●●●●●●● ●
●●
●●● ●
●●● ●●●
●● ●
●●●
● ●●●●● ●
●●● ●
●●
● ●●
●● ● ●
● ● ● ●
●
●●
●●
●●
●● ●
●●●
●● ●
●●● ● ●● ●
● ● ●●●
●●●
●●●●
●●● ●●● ●●
●●
● ● ●●
● ● ●● ●●● ●● ●● ● ● ● ●
●
0

●
●
●●
●●
●●●
●●● ●
●●
●
●●
●●
●● ●●●●
●●
●●
●●
●●
●●●
●
●●
●●●●
●
●●●
●
● ●●●
●
●●●
●●●
●
● ●
●● ●●●
● ●
●
●●●
●●
●● ●
●●
●●
●●
●
●●● ●●
●● ●
●●●●
●●
●●●● ●
● ●● ● ● ●●●●
●● ● ● ●● ●
●
●
●●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●
●● ●●●●
● ● ●
●
●
●●
●
●●
●●
● ●●●
● ●●
●●●
● ● ●
●●●
●
●●●
●●●●
●
●●●●
● ●
●●●
●●
●●
●
●●
●●●
●
●
●●● ●
●
●
●●●
●●●●●
●●● ●
●
●●
●
●
●
●
●
● ●●
● ●
●
●
● ● ●
●●● ● ● ● ● ●
●●
●
●●●● ● ● ●● ●
● ● ● ● ● ●
●
●●● ● ● ●● ● ● ● ●● ●● ●
●
−2

●● ● ● ●
● ● ● ● ● ●
● ● ●●● ● ●●● ●
●
●●
● ●● ●
● ●●●
● ●●●
● ● ● ●
●● ●●●
0.00

●●
● ● ● ●● ● ●●● ●
● ● ● ●●
● ● ● ● ● ● ● ●●●
● ●●
●●
●●●
●●
●●
● ●●
●●
●
●
●●
●●
●
●
●●
● ● ●
● ●●● ●
●●
●●● ●
●
●●
●●●
●●
●
●●
●●
●
●●●
●
●
●●●
●●
●●
●● ●●
●
●
●●●
●●
●●
●
●
●●●●
●
●●●
● ●
●●
●●●
●●
●
●●
●
●
●●●●
−4

● ●
●
●
●●
●
●●
● ●
●
●●●●
●●
●●●
●
●
●●
●
●
●●
● ● ●
●
●●●
●●
●●
●
●
●●
●
●
● ●
●
●●●●●●●●
●●
●●●
●
●●
●
●
●●
●
●
●●●
●
●●●●
●
● ●●●●
●
● ●●●
●
● ●
●
●
●●
●
●●
●
●
●●
●●●
●
●●●●●
●●●●
●●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●●●
●
●●
●
●●

0 200 400 600 0 200 400 600

Index Index

Figure 2: Leverages, two sorts of standardized residuals, and Cook’s distance statistic for each point in a
basic linear model of economic mobility as a function of the fraction of workers with short commutes. The
horizontal line in the plot of leverages shows the average leverage. The lines in studentized residual plot
shows a 95% t-distribution sampling interval. Note the clustering of extreme residuals and leverage around
row 600, and another cluster of points with extreme residuals around row 400.

5
n = nrow(mobility)
out = lm(Mobility ~ Commute,data=mobility)
r = rstudent(out)
I = (1:n)[rank(-abs(r) <= 10)] ## indices 10 largest residuals
mobility[I,]

## X Name Mobility State Commute Longitude Latitude

## 374 375 Linton 0.29891303 ND 0.646 -100.16075 46.31258
## 376 378 Carrington 0.33333334 ND 0.656 -98.86684 47.59698
## 382 384 Bowman 0.46969697 ND 0.648 -103.42526 46.33993
## 383 385 Lemmon 0.35714287 ND 0.704 -102.42011 45.96558
## 385 388 Plentywood 0.31818181 MT 0.681 -104.65381 48.64743
## 388 391 Dickinson 0.32920793 ND 0.659 -102.61354 47.32696
## 390 393 Williston 0.33830845 ND 0.702 -103.33987 48.25441
## 418 422 Miller 0.31506848 SD 0.697 -99.27758 44.53313
## 420 424 Gettysburg 0.32653061 SD 0.729 -100.19547 45.05100
## 608 618 Nome 0.04678363 AK 0.928 -162.03012 64.47514

C = cooks.distance(out)
I = (1:n)[rank(-abs(C) <= 10)] ## indices 10 largest Cook's distances
mobility[I,]

## X Name Mobility State Commute Longitude Latitude

## 376 378 Carrington 0.33333334 ND 0.656 -98.86684 47.59698
## 382 384 Bowman 0.46969697 ND 0.648 -103.42526 46.33993
## 383 385 Lemmon 0.35714287 ND 0.704 -102.42011 45.96558
## 388 391 Dickinson 0.32920793 ND 0.659 -102.61354 47.32696
## 390 393 Williston 0.33830845 ND 0.702 -103.33987 48.25441
## 418 422 Miller 0.31506848 SD 0.697 -99.27758 44.53313
## 420 424 Gettysburg 0.32653061 SD 0.729 -100.19547 45.05100
## 607 617 Kotzebue 0.06451613 AK 0.864 -159.43781 67.02818
## 608 618 Nome 0.04678363 AK 0.928 -162.03012 64.47514
## 614 624 Bethel 0.05186386 AK 0.909 -158.38213 61.37712

3.2 plot
We have not used the plot function on an lm object yet. This is because most of what it gives us
is in fact related to residuals (Figure 3).

par(mfrow=c(2,2))
plot(out)

The first plot is of residuals versus fitted values, plus a smoothing line, with extreme residuals
marked by row number. The second is a Q-Q plot of the standardized residuals, again with
extremes marked by row number. The third shows the square root of the absolute standardized

6
residuals against fitted values (ideally, flat); the fourth plots standardized residuals against leverage,
with contour lines showing equal values of Cook’s distance. There are many options, described in
help(plot.lm).

4 Dealing With Outliers

There are essentially three things to do when we’re convinced there are outliers: delete them;
change the model; or change how we estimate.

4.1 Deletion
Deleting data points should never be done lightly, but it is sometimes the right thing to do.
The best case for removing a data point is when you have good reasons to think it’s just wrong
(and you have no way to fix it). Medical records which give a patient’s blood pressure as 0, or their
temperature as 200 degrees, are just impossible and have to be errors1 . Those points aren’t giving
you useful information about the process you’re studying so getting rid of them makes sense.
The next best case is if you have good reasons to think that the data point isn’t wrong, exactly,
but belongs to a different phenomenon or population from the one you’re studying. (You’re trying
to see if a new drug helps cancer patients, but you discover the hospital has included some burn
patients and influenza cases as well.) Or the data point does belong to the right population, but
also somehow to another one which isn’t what you’re interested in right now. (All of the data is on
cancer patients, but some of them were also sick with the flu.) You should be careful about that
last, though. (After all, some proportion of future cancer patients are also going to have the flu.)
The next best scenario after that is that there’s nothing quite so definitely wrong about the
data point, but it just looks really weird compared to all the others. Here you are really making
a judgment call that either the data really are mistaken, or not from the right population, but
you can’t put your finger on a concrete reason why. The rules-of-thumb used to identify outliers,
like “Cook’s distance shouldn’t be too big”, or “Tukey’s rule” which flags any point more than 1.5
times the inter-quartile range above the third quartile, or below the first quartile. It is always more
satisfying, and more reliable, if investigating how the data were gathered lets you turn cases of this
sort into one of the two previous kinds.
The least good case for getting rid of data points which isn’t just bogus is that you’ve got a
model which almost works, and would work a lot better if you just get rid of a few stubborn points.
This is really a sub-case of the previous one, with added special pleading on behalf of your favorite
model. You are here basically trusting your model more than your data, so it had better be either
a really good model or really bad data.

4.2 Changing the Model

Outliers are points that break a pattern. This can be because the points are bad, or because we
made a bad guess about the pattern. Figure 4 shows data where the cloud of points on the right
are definite outliers for any linear model. But I drew those points following a quadratic model, and
they fall perfectly along it (as they should). Deleting them, in order to make a linear model work
better, would have been short-sighted at best.
1
This is true whether the temperature is in degrees Fahrenheit, degrees centigrade, or kelvins.

7
Residuals vs Fitted Normal Q−Q

8
● 382 382 ●
0.0 0.1 0.2 0.3

Standardized residuals

6
383 ● ● 383
●376
● ● ●●● 376
● ●● ●
Residuals

4
● ●●
●●
● ●
● ●● ●● ● ●
●● ●
● ● ●● ● ●
●● ●
●● ● ●● ●● ●
●
● ●
●
●●
●
●●
● ●● ● ●● ●

2
● ● ● ● ● ●● ●●● ●●●● ● ●
●●
●
●●
●●●●
●
●● ●
● ●●●●●●●●
● ● ●
●● ● ●●
● ● ●
●●
● ●
●●
● ●
●●
●● ●
● ●
●
●●
●
● ● ●
●●●
●●
●
●●●●
●
●●●●
● ●
● ● ●● ●
●● ●●●
●● ● ●● ●●
●
●●
●
●
●● ●
●● ●●
●●●
●
●
● ●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●●●
●● ●
●
●
●●
●
●
●●
●
●●
●●●●
●●●
●●●●
●●
●●●●
●●●
●
● ●● ●
●●● ●
●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●●
●●
●●●
●
●
●●
●●
●
●●
●●
●
●
●●●●
●
●
●●
●
● ●
●●
●●●● ●
●●
●●
●
●●●●●●
● ●
● ●●●●● ●● ●●
●
●
●●
●
●
●
●● ● ●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●●●
●●●
●
●●
●
●●●
●
●
● ●
●●
●
●
●
● ●● ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●

0
●●●
●●●
●
● ●
●● ●●● ●●
●●
●
●
● ●
● ● ●
●●
●●●
●
●●●
● ● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●●●
●
● ●●●
●●
●
●
●
●●●
●●
●●
●
● ●●●●●● ●● ●●
●●
●
●
●
●
● ●●
●●●●● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●● ●
●
● ●●● ● ● ●●
● ●●●● ● ●
●●
●
●
●●
●
●
●● ● ● ● ●● ● ●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
● ● ●● ● ●
●
●●

−4 −2
●
●
●●
●
●
● ●●
●
●● ●●●
●
−0.2

● ●

0.05 0.10 0.15 0.20 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles

Scale−Location Residuals vs Leverage

● 382
8

● 382
0.0 0.5 1.0 1.5 2.0 2.5

0.5
Standardized residuals

Standardized residuals

383 ●
●376
● ●
●
● ●● ●● ●
● ●
●
●
4

● ● ● ● ●● ●
● ● ● ●● ●
● ●● ●●● ● ● ●●● ● ●
● ● ●● ● ●
● ● ● ● ●● ●●●
2

● ●● ●
● ●● ● ● ●●● ● ●
● ●
●
●● ● ●●
● ●●● ●●●●●● ● ● ●●
● ● ●● ● ● ●
● ●● ●● ● ● ● ●●●
●
●●●
●
●●●
●● ●
●
●●
●● ● ● ●
●●
●● ● ●● ● ●● ● ●● ●
●
●●● ● ●●● ● ●
●●●●
●● ● ●●
● ●●●●
● ●●●● ● ●●● ●●● ● ● ● ● ●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●
●
●●●
●●
●●
●●●● ●●
● ●
●
●●●●● ●●
●
● ●●
●●
● ● ●● ●●●● ● ●●● ●
●●●
●●●
●
●●●●
●●●● ● ●●
● ●
● ●●●
●●●
●●
● ●
●●
● ●● ● ●●●
●
●● ●●● ●●
● ●●●
●●● ● ●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●●●●
●●
●●
●●
●
● ●
●● ● ●● ●
0

●●●● ●● ●●● ● ●●●●● ● ● ●●● ●●●●● ● ●

●
●
●●
●
●
●●
●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●●
●●
●●
●
●
●●
●●
●●
●● ●●● ●
●
●●●
●●
●●●
●
●●
●
●●● ●●
● ●● ●●
● ●●●
● ●●●●●●
● ●●● ● ● ●●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●●
●
●● ●●● ● ● ● ●
● ●●●●●● ● ●●●●●●●● ●● ●●●●●●
●●●●● ●● ● ● ●
●●
●●
●●●
●
●●
● ●●●
●●●● ●●
●●
●●●
●●
●
●
●
●
●●●
●
●
●
● ●
●●
●●●
● ●
●
●●
●●●●●●
●●
●●●
●●
●
●
●●
●
●
●
●●●●
●
●
●
●
●●●●●
●●
●
●
●●●
●
● ● ●
●
●
●
●●
●
●
●
●
●●
●●
●
●●
●
● ●
●
●●
●● ● ● ●●● ● ●
● ●
●●
● ●●
●●●
●
● ●●●
● ●●●
●●●● ●●
●●● ●●
● ●
●
●●●
●● ● ● ●
−4 −2

● ●● ●●●●
●●
● ● ●●●
●● ● ●
●●●●●●●
● ●
●
●●●●●
●
●●
● ●●●●
●●
●●
●● ●●
●●● ●●● ●
●
●●● ● ●
●● ● ●●
●● ●●●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●●●●●
●●
● ●●
●
●
● ●●●
●●●●● ●
● ● ●●
● ● ● ● ●●● ● ● ●● ●
● ●● ● ●●
● ● ●
● ●
●
● ●●●●
●
●● ● ●
●●
●● ●● ●●● ●
●●●
●● ●●● ●
●●●
●●● ●●●●●
●
●
●
●● ●● ●
●●
●
●
● ● ● ●● ● ●
●
●
●●●●●●●●●●●●●● ●●● ●● ●
●●●●● ● ●
●●●● ●●● ● ● ● ● ●
614 608
●●
●● ● ●● ●●
● ●● ● ● ●● ● ●
● Cook's distance

0.05 0.10 0.15 0.20 0.000 0.005 0.010 0.015

Fitted values Leverage

Figure 3: The basic plot function applied to our running example model.

8
●

0.20
●
● data ●
● ●●●●
●
25

●
● ●●●
linear
quadratic
20

●●

0.15
15

●
Hii
y

●
0.10

●●
10

●●
●

●●
5

0.05

●
●●
● ●●●
●
●●● ●●● ●
●
●●●
●
●
● ●●●●●●●
●●●●
●
●
●
●●
●●● ●●●●●●●●
●
●
●●●
● ●●●●●●
0

0 1 2 3 4 5 0 5 10 20 30

x Index

Figure 4: The points in the upper-right are outliers for any linear model fit through the main body of points,
but dominate the line because of their very high leverage; they’d be identified as outliers. But all points were
generated from a quadratic model.

9
The moral of Figure 4 is that data points can look like outliers because we’re looking for the
wrong pattern. If when we find apparent outliers and we can’t convince ourselves that data is
erroneous or irrelevant, we should consider changing our model, before, or as well as, deleting
them.

4.3 Robust Linear Regression

A final alternative is to change how we estimate our model. Everything we’ve done has been based
on ordinary least-squares (OLS) estimation. Because the squared error grows very rapidly with
the error, OLS can be very strongly influenced by a few large residuals. We might, therefore, a
different method of estimating the parameters. Estimation techniques which are less influenced
by outliers in the residuals than OLS are called robust estimators, or (for regression models)
robust regression.
Usually robust estimation, like OLS, is based on minimizing a function of the form: function of
the errors:
n
1X
β̃ = argmin ρ(yi − xi b). (5)
b n
i=1
Different choices of ρ, the loss function, yield different estimators. ρ(u) = |u| is least absolute
deviation (LAD) estimation. Using ρ(u) = u2 corresponds to OLS. A popular compromise is to
use Huber’s loss function 2
u |u| ≤ c
ρ(u) = (6)
2c|u| − c2 |u| ≥ c.
Notice that Huber’s loss looks like squared error for small errors, but like absolute error for large
errors. Huber’s loss is designed to be continuous at c, and have a continuous first derivative there
as well (which helps with optimization). We need to pick the scale c at which it switches over from
acting like squared error to acting like absolute error; this is usually done using a robust estimate
of the noise standard deviation σ.
Robust estimation with Huber’s loss can be conveniently done with the rlm function in the
MASS package, which, as the name suggests, is designed to work very much like lm.

library(MASS)
out = rlm(Mobility ~ Commute,data=mobility)
summary(out)

##
## Call: rlm(formula = Mobility ~ Commute, data = mobility)
## Residuals:
## Min 1Q Median 3Q Max
## -0.148719 -0.019461 -0.002341 0.021093 0.332347
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.0028 0.0043 0.6398
## Commute 0.2077 0.0091 22.7939
##
## Residual standard error: 0.0293 on 727 degrees of freedom

10
Robust linear regression is designed for the situation where it’s still true that Y = Xβ + , but
the noise is not very close to Gaussian, and indeed is sometimes “contaminated” by wildly larger
values. It does nothing to deal with non-linearity, or correlated noise, or even some points having
excessive leverage because we’re insisting on a linear model.

Pesaran 2015 TimeSeriesAndPanelDataEconometrics
100% (1)
Pesaran 2015 TimeSeriesAndPanelDataEconometrics
1,095 pages
Panel Analysis - April 2019 PDF
100% (1)
Panel Analysis - April 2019 PDF
303 pages
Moam - Info Management-Mathematics 59c225b41723ddbf52d0b67d
No ratings yet
Moam - Info Management-Mathematics 59c225b41723ddbf52d0b67d
259 pages
BA501 Week5 Linear Regression
No ratings yet
BA501 Week5 Linear Regression
45 pages
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
No ratings yet
1 Residuals, Outliers and Regression Diagnostics - CH 14.8 15.8 Revised
48 pages
Practice Questions Econometrics II
100% (1)
Practice Questions Econometrics II
5 pages
Chapter 5 Variable Selection
No ratings yet
Chapter 5 Variable Selection
56 pages
330 Lect11
No ratings yet
330 Lect11
35 pages
Estimations
100% (1)
Estimations
183 pages
Chapter 5 Variables Selection
No ratings yet
Chapter 5 Variables Selection
57 pages
Linear Regression
100% (2)
Linear Regression
228 pages
Ch6slides Leverage Influence
No ratings yet
Ch6slides Leverage Influence
25 pages
Data Science Interview Preparation
100% (1)
Data Science Interview Preparation
113 pages
Stats101A - Chapter 3
No ratings yet
Stats101A - Chapter 3
54 pages
ES4 Slides
No ratings yet
ES4 Slides
21 pages
MultivariableRegression 6
No ratings yet
MultivariableRegression 6
44 pages
Stationarity, Non Stationarity, Unit Roots and Spurious Regression
No ratings yet
Stationarity, Non Stationarity, Unit Roots and Spurious Regression
26 pages
Stat 136 Chapter 13 Outliers and Influential Observations
No ratings yet
Stat 136 Chapter 13 Outliers and Influential Observations
27 pages
LM04 Extensions of Multiple Regression IFT Notes
No ratings yet
LM04 Extensions of Multiple Regression IFT Notes
17 pages
Influential Observation
No ratings yet
Influential Observation
4 pages
330 Lecture11 2014
No ratings yet
330 Lecture11 2014
61 pages
Lecture For 111424
No ratings yet
Lecture For 111424
23 pages
Double Bow Residual
No ratings yet
Double Bow Residual
32 pages
2022 Econometrics I Chapter Four
No ratings yet
2022 Econometrics I Chapter Four
83 pages
Lecture 4
No ratings yet
Lecture 4
12 pages
Statistics Study Notes
No ratings yet
Statistics Study Notes
71 pages
Lesson 3 Overview Problems and Outliers
No ratings yet
Lesson 3 Overview Problems and Outliers
31 pages
Lec 37
No ratings yet
Lec 37
12 pages
Assignment Regression Techniques
No ratings yet
Assignment Regression Techniques
12 pages
Outliers and Influential Points
No ratings yet
Outliers and Influential Points
14 pages
DS Module 05
No ratings yet
DS Module 05
5 pages
Türkan Et Al (2011) - Outlier Detection by Regression Diagnostics Based On Robust Parameter Estimates
No ratings yet
Türkan Et Al (2011) - Outlier Detection by Regression Diagnostics Based On Robust Parameter Estimates
9 pages
Chapter 4
No ratings yet
Chapter 4
10 pages
Predictive Analytics Group Assignment
No ratings yet
Predictive Analytics Group Assignment
21 pages
Basic Regression Analysis 3
No ratings yet
Basic Regression Analysis 3
6 pages
Robust Regression Modeling With STATA Lecture Notes
No ratings yet
Robust Regression Modeling With STATA Lecture Notes
93 pages
Outlier Detection in Multivariate Data: Applied Mathematical Sciences, Vol. 9, 2015, No. 47, 2317 - 2324
No ratings yet
Outlier Detection in Multivariate Data: Applied Mathematical Sciences, Vol. 9, 2015, No. 47, 2317 - 2324
8 pages
Linear Models Bias
No ratings yet
Linear Models Bias
17 pages
Linear Statistical Models
No ratings yet
Linear Statistical Models
16 pages
Basic Regression Analysis 7
No ratings yet
Basic Regression Analysis 7
6 pages
Outliers Influence
No ratings yet
Outliers Influence
6 pages
Chapter6 Regression Diagnostic For Leverage and Influence
No ratings yet
Chapter6 Regression Diagnostic For Leverage and Influence
10 pages
Basic Regression Analysis 5
No ratings yet
Basic Regression Analysis 5
6 pages
Multiple Regression
No ratings yet
Multiple Regression
7 pages
Oulier in R
No ratings yet
Oulier in R
8 pages
Lab 5
No ratings yet
Lab 5
6 pages
DB Structure Pivot Etc
No ratings yet
DB Structure Pivot Etc
14 pages
(English) Leverage and Influential Points in Simple Linear Regression (DownSub - Com)
No ratings yet
(English) Leverage and Influential Points in Simple Linear Regression (DownSub - Com)
5 pages
Data Mining1
No ratings yet
Data Mining1
9 pages
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
No ratings yet
Dimension Reduction and Hidden Structure: 1.1 Principal Component Analysis (PCA)
40 pages
EC212: Introduction To Econometrics Simple Regression Model (Wooldridge, Ch. 2)
No ratings yet
EC212: Introduction To Econometrics Simple Regression Model (Wooldridge, Ch. 2)
107 pages
HW4 Solutions: Problem 6.2
No ratings yet
HW4 Solutions: Problem 6.2
8 pages
Some Methods of Detection of Outliers in Linear Regression Model-Ranjit PDF
No ratings yet
Some Methods of Detection of Outliers in Linear Regression Model-Ranjit PDF
19 pages
Maths QM 2
No ratings yet
Maths QM 2
9 pages
Artikel Windy Asruri 181430112
No ratings yet
Artikel Windy Asruri 181430112
26 pages
Simple Linear Regression and Correlation
No ratings yet
Simple Linear Regression and Correlation
50 pages
Estadistica, Articulo, Analyzing Outliers: Influential or Nuisance?
No ratings yet
Estadistica, Articulo, Analyzing Outliers: Influential or Nuisance?
3 pages
Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance
No ratings yet
Multiple Linear Regression: Diagnostics: Statistics 203: Introduction To Regression and Analysis of Variance
16 pages
Matrix OLS NYU Notes
No ratings yet
Matrix OLS NYU Notes
14 pages
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
No ratings yet
Lecture 22: Review For Exam 2 1 Basic Model Assumptions (Without Gaussian Noise)
7 pages
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
No ratings yet
Robust Statistics For Outlier Detection: Peter J. Rousseeuw and Mia Hubert
7 pages
SSRN Id1369144 PDF
No ratings yet
SSRN Id1369144 PDF
14 pages
Cheatsheet Part 2
No ratings yet
Cheatsheet Part 2
2 pages
Density Estimation 36-708
No ratings yet
Density Estimation 36-708
32 pages
Chapter 8 Model Checking: Model Class Some Models Conclusions
No ratings yet
Chapter 8 Model Checking: Model Class Some Models Conclusions
7 pages
Introduction To Panel Data Analysis
No ratings yet
Introduction To Panel Data Analysis
18 pages
Correlation Regression
No ratings yet
Correlation Regression
26 pages
Understanding Diagnostic Plots For Linear Regression Analysis
No ratings yet
Understanding Diagnostic Plots For Linear Regression Analysis
5 pages
Unit 5 Estimation
No ratings yet
Unit 5 Estimation
32 pages
M1 Stat-701 SLR 2022
No ratings yet
M1 Stat-701 SLR 2022
17 pages
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
No ratings yet
How To Use "Qqplot": X: Independent Variable, Y: Dependent Variable
6 pages
1basic Econometrics Introduction Week I
No ratings yet
1basic Econometrics Introduction Week I
21 pages
Sparse Additive Models: University of California, Berkeley, USA
No ratings yet
Sparse Additive Models: University of California, Berkeley, USA
22 pages
Cooks
No ratings yet
Cooks
5 pages
Market Analysis For Real Estate 1st Edition Rena Mourouzi-Sivitanidou - Quickly Download The Ebook To Read Anytime, Anywhere
100% (1)
Market Analysis For Real Estate 1st Edition Rena Mourouzi-Sivitanidou - Quickly Download The Ebook To Read Anytime, Anywhere
60 pages
Causal Inference: 1.1 Two Types of Causal Questions
No ratings yet
Causal Inference: 1.1 Two Types of Causal Questions
19 pages
Lecture 7: Diagnostics: 36-401, Fall 2017, Section B
No ratings yet
Lecture 7: Diagnostics: 36-401, Fall 2017, Section B
35 pages
ECON2280 Introductory Econometrics 2012-21
No ratings yet
ECON2280 Introductory Econometrics 2012-21
8 pages
AIS Lecture 18
No ratings yet
AIS Lecture 18
33 pages
Lecture 9: Predictive Inference
No ratings yet
Lecture 9: Predictive Inference
10 pages
Manifold Estimation, Hidden Structure and Dimension Reduction
No ratings yet
Manifold Estimation, Hidden Structure and Dimension Reduction
39 pages
Introduction To Stata - Lecture 4: Instrumental Variables: Hayley Fisher 3 March 2010
No ratings yet
Introduction To Stata - Lecture 4: Instrumental Variables: Hayley Fisher 3 March 2010
11 pages
Linear Regression: 1 1 N N I I I D I I
No ratings yet
Linear Regression: 1 1 N N I I I D I I
20 pages
Generalized Least Squares - Lecture Notes
No ratings yet
Generalized Least Squares - Lecture Notes
4 pages
Linear Classification: 1 1 N N I D I
No ratings yet
Linear Classification: 1 1 N N I D I
33 pages
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
No ratings yet
Lecture 4: Simple Linear Regression Models, With Hints at Their Estimation
12 pages
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #4 Solutions: DUE: April 19, 2019
16 pages
Angelico COPO Margarita ESQUEJO Dawn Garcia Zsarmaine SARMIENTO
No ratings yet
Angelico COPO Margarita ESQUEJO Dawn Garcia Zsarmaine SARMIENTO
16 pages
Chapter 11
No ratings yet
Chapter 11
10 pages
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
No ratings yet
A Closer Look at Sparse Regression Ryan Tibshirani: 2.1 Three Norms: ', ', '
25 pages
Nonparametric Regression
No ratings yet
Nonparametric Regression
24 pages
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
No ratings yet
36-708 Statistical Machine Learning Homework #3 Solutions: DUE: March 29, 2019
22 pages
Nonparametric Classification 10/36-702: 1 1 N N N I I
No ratings yet
Nonparametric Classification 10/36-702: 1 1 N N N I I
20 pages
CC655 Final 2021 Key
No ratings yet
CC655 Final 2021 Key
13 pages
Lecture 8: Inference 36-401, Fall 2015, Section B
No ratings yet
Lecture 8: Inference 36-401, Fall 2015, Section B
16 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
High-Dimensional, Two-Sample Testing
No ratings yet
High-Dimensional, Two-Sample Testing
9 pages
36-401 Modern Regression HW #2 Solutions: Problem 1 (36 Points Total)
No ratings yet
36-401 Modern Regression HW #2 Solutions: Problem 1 (36 Points Total)
15 pages
Practice Questions
No ratings yet
Practice Questions
6 pages
Support Vector Machines
No ratings yet
Support Vector Machines
5 pages
36-708 Statistical Methods For Machine Learning Homework #1 Solutions
No ratings yet
36-708 Statistical Methods For Machine Learning Homework #1 Solutions
12 pages
10/36-702 Statistical Machine Learning Homework #2 Solutions
No ratings yet
10/36-702 Statistical Machine Learning Homework #2 Solutions
11 pages
Online Learning: T T T T T T T T
No ratings yet
Online Learning: T T T T T T T T
8 pages
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
No ratings yet
Data Analysis Project 2 Due 5:00 PM Nov 21 1 Instructions
3 pages
Data Analysis Exam 1 36-401, Section B
No ratings yet
Data Analysis Exam 1 36-401, Section B
3 pages
Differential Privacy: 1 N I 1 N N
No ratings yet
Differential Privacy: 1 N I 1 N N
7 pages
1 Review
No ratings yet
1 Review
7 pages
Mth302 Quiz 3
No ratings yet
Mth302 Quiz 3
2 pages
Boosting: I I I I
No ratings yet
Boosting: I I I I
5 pages
Ordinal Logistic Regression Stata Command
No ratings yet
Ordinal Logistic Regression Stata Command
3 pages
QTA 18-04-2013 Logistic Regression
No ratings yet
QTA 18-04-2013 Logistic Regression
4 pages
Psychology Statistics
No ratings yet
Psychology Statistics
1 page
Assignments Business Economics
No ratings yet
Assignments Business Economics
2 pages
Name: Bianca Goldschmidt Username: Bgoldschmidt Date: 10/17/21
No ratings yet
Name: Bianca Goldschmidt Username: Bgoldschmidt Date: 10/17/21
2 pages
Homework 4 Due Friday April 19 3:00 PM Submit A PDF File On Canvas
No ratings yet
Homework 4 Due Friday April 19 3:00 PM Submit A PDF File On Canvas
2 pages
SDET Formulae MidSem2 2018 Ver3
No ratings yet
SDET Formulae MidSem2 2018 Ver3
2 pages
HW7
No ratings yet
HW7
1 page
Fill Your Glass With Gold-When It's Half-Full or Even Completely Shattered
From Everand
Fill Your Glass With Gold-When It's Half-Full or Even Completely Shattered
Hillary Saffran
No ratings yet