Lecture 20: Outliers and Influential Points
Lecture 20: Outliers and Influential Points
An outlier is a point with a large residual. An influential point is a point that has a large
impact on the regression. Surprisingly, these are not the same thing. A point can be an outlier
without being influential. A point can be influential without being an outlier. A point can be both
or neither.
Figure 1 shows four famous datasets due to Frank Anscombe. If you run least squares on each
dataset you will get the same output:
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0001 1.1247 2.667 0.02573 *
x 0.5001 0.1179 4.241 0.00217 **
The top left plot has no problems. The top right plot shows a non-linear pattern. The bottom
left plot has an an outlier. The bottom right plot has an influential point. Imagine what would
happen if we deleted the rightmost point. If you looked at residual plots, you would see problems
in the second and third case. But the resdual plot for the fourth example would look fine. You
can’t see influence in the usual residual plot.
1 Modified Residuals
Let e be the vector of residuals. Recall that
e = (I − H), E[e] = 0, Var(e) = σ 2 (I − H).
√
b 1 − hii where hii ≡ Hii . We then call
Thus the standard error of ei is σ
ei
ri = √
b 1 − hii
σ
the standardized residual.
There is another type of residual ti which goes under various names: the jackknife resid-
ual, the cross-validated residual, externally studentized residual or studentized deleted
residual. Let Ybi(−i) is the predicted value for the ith data point when (Xi , Yi ) is omitted from the
data. Then ti is defined by
Yi − Ybi(−i)
ti = (1)
si
where s2 is the estimated variance of Yi − Ybi(−i) . It can be shown that
i
s
n−p−2 e
ti = ri 2 = √i (2)
n − p − 1 − ri b(−i) 1 − hii
σ
1
10 11
● ●
● ●
9
● ●
●
● ●
8
9
●
●
7
●
8
●
y1
y2
● ●
6
●
7
5
6
●
●
4
5
● ●
3
4
4 6 8 10 12 14 4 6 8 10 12 14
x1 x2
● ●
12
12
10
10
y3
y4
●
●
●
8
● ●
8
● ●
●
● ●
●
● ●
●
6
●
6
●
● ●
● ●
4 6 8 10 12 14 8 10 12 14 16 18
x3 x4
Figure 1: For data sets that have the same fitted line. Top left: no problems. Top right: a non-linear
pattern. Bottom left: An outlier. Bottom right: an influential point.
2
σ 2
b(−i) is the estimated variance after omitting (Xi , Yi ) is omitted from the data. The cool think is
that we can compute ti without ever having to actually delete the observation and re-fit the model.
Everything you have done so far with residuals can also be done with standardized or jackknife
residuals.
2 Influence
Recall that
Y
b = HY
where H is the hat matrix. This means that each Ybi is a linear combination of elements of H. In
particular, Hii is the contribution of the ith data point to Ybi . For this reason we call hii ≡ Hii the
leverage.
To get a better idea of how influential the ith data point is, we could ask: how much do the
fitted values change if we omit an observation? Let Y(−i) be the vector of fitted values when we
remove observation i. Then Cook’s distance is defined by
(Y − Y(−i) )T (Y − Y(−i) )
Di = .
(p + 1)bσ2
It turns out that there is a handy formula for computing Di , namely:
2
ri hii
Di = .
p+1 1 − hii
This means that the influence of a point is determined by both its residual and its leverage. Often,
people interpret Di > 1 as an influential point.
The leave-one-out idea can also be applied to the coefficients. Write βb(−i) for the vector of
coefficients we get when we drop the ith data point. One can show that
once, so while looking at all leave-one-out results is feasible, looking at all leave-two- or leave-ten-
out results is not.
3 Diagnostics in Practice
We have three ways of looking at whether points are outliers:
1. We can look at their leverage, which depends only on the value of the predictors.
3
2. We can look at their studentized residuals, either ordinary or cross-validated, which depend
on how far they are from the regression line.
3. We can look at their Cook’s statistics, which say how much removing each point shifts all the
fitted values; it depends on the product of leverage and residuals.
The model assumptions don’t put any limit on how big the leverage can get (just that it’s ≤ 1
at each point) or on how its distributed across the points (just that it’s got to add up to p + 1).
Having most of the leverage in a few super-inferential points doesn’t break the model, exactly, but
it should make us worry.
The model assumptions do say how the studentized residuals should be distributed. In partic-
ular, the cross-validated studentized residuals should follow a t distribution. This is something we
can test, either for specific points which we’re worried about (say because they showed up on our
diagnostic plots), or across all the points.
3.1 In R
Almost everything we’ve talked — leverages, studentized residuals, Cook’s statistics — can be
calculated using the influence function. However, there are more user-friendly functions which
call that in turn, and are probably better to use. Leverages come from the ‘hatvalues‘ function, or
from the ‘hat‘ component of what ‘influence‘ returns:
Often the most useful thing to do with these is to plot them, and look at the most extreme
points. The standardized and studentized residuals can also be put into our usual diagnostic plots,
since they should average to zero and have constant variance when plotted against the fitted values
or the predictors.
par(mfrow=c(2,2))
n = nrow(mobility)
out = lm(Mobility ~ Commute,data=mobility)
plot(hatvalue(out),ylab="Leverage")
plot(rstandard(out),ylab="Standardized Residuals")
plot(rstudent(out),ylab="Cross-Validated Residuals")
abline(h=qt(0.025,df=n-2,col="red")
abline(h=qt(1-0.025,df=n-2,col="red")
plot(cooks.distance(out),ylab="Cook's Distance")
We can now look at exactly which points have the extreme values, say the 10 most extreme
residuals, or largest Cook’s statistics:
4
8
● ●
0.015
Standardized residuals
6
●
● ●
● ●●
4
●
● ●
Leverage
●
0.010
● ●
●
● ●●
● ● ●
● ● ●●
● ● ● ●●
2
● ● ● ● ●
● ●
● ● ● ●● ●● ● ● ● ●●
● ●●●●●●●● ●● ● ● ●●●
● ●●●●●●●
●● ●
●
●
● ● ● ●● ●
● ● ● ●●
●
●● ● ●●●●● ●●
● ●● ●●
●●
● ●●
●
●●
●
●
●●
●●●● ● ●●
● ● ●●● ●● ●●●
●● ●
●●●
● ● ● ● ● ●●●●●●
●●
● ●●● ●● ●
●●● ● ●
● ●
●●●●●● ●
● ●●● ●●● ● ● ●
● ●●
● ● ● ●●● ● ● ● ● ●
●● ●
●●
●●●
●●
●●●●
●●
● ●
●●●●●●● ●
● ●
●
● ●●
● ●
●●●●●●●●●
●●
●
●●●●●●● ●●
●
● ● ● ●● ●●●●
●●
● ●●● ● ● ●● ● ●
●●
● ●●●● ● ●
●● ●
●●
●●
●● ●
0
● ●● ● ●
● ● ● ●● ● ●● ● ●● ● ● ●
● ●
● ●● ● ● ● ● ●
●● ●● ●
●
● ●
●●●●
● ●
● ●●
●●
●
●●●● ●● ●
●
●●
●●●
●●●
●
●●●●
●●●●●●
●●
● ●
●●
●● ●
●●● ●●●●●● ●●●●●● ● ●●
● ●●
●●● ●
● ● ●
●● ● ● ●
●●
0.005
● ● ● ● ● ●● ● ●●●● ● ● ●
●●
●●●
● ● ● ● ●●
●●
●● ●●●●
●● ●●● ●●●●●●●
●●●● ● ●
● ●● ●●
●●
●●●
● ● ●
●●
●●●●
●●
●●
●●
●●●
● ●
●
● ●●● ● ●● ●●
●●
●
● ●
●
● ● ●●
●●
● ●●
●●●
●
●● ●●●●●●●●
● ●●●●●
● ● ●
●
● ● ● ● ●●● ●● ●● ●
●● ●
● ●● ●
●●● ●
●●●● ● ● ● ●●● ●● ● ● ●● ● ● ● ● ●●●
●● ● ● ●● ●● ● ●
● ●●●
●●●●
●● ● ●
●
● ●●
●
●
●●
●
●●
●
● ● ●
●● ●● ● ●
●
● ● ●● ● ● ● ● ● ●
●●● ●
●● ● ● ●● ● ● ●●● ●● ●●● ●●
●
● ●● ● ●●● ● ●●
● ●● ●● ● ● ● ●
●●
−2
●
● ●● ●
●
●
●
●●
●
● ●
● ●●
●
●
●●●
● ●●● ●
●● ●●
●●
● ●●●●
●
●●
●
●●●●●●● ●
● ● ● ● ●
●● ●●
● ●
●
● ● ●●
● ●
●●
●
● ● ●● ● ● ● ● ●● ● ●● ●
●● ●● ● ● ● ●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●●●
●
●●●●●
●●●●●
●●●●
●●●
●
●●●● ●
●● ●
●●●
●●●
●●●● ● ●●● ●
●●
●
●
●
●
●
●●
●●
● ●●●●● ●
●
●
●
●
●●●
●● ● ●● ●●●● ●
●
●
●●● ● ●
●● ●
● ●
●●
●● ●
●
● ●
●●● ●●●
●●●
●●●
●● ●●
●●●●● ●
●● ●●
● ●●●● ● ●
● ●
●
●●●
● ●●●●
●●● ●●
●
●● ● ●● ●
●●●
●●●
●●●●
●
●●
●● ●●● ●
●
●
●● ●●●
●
●●
●
●●
●●
●● ●●
●
●●
●
● ●
●●
●● ●
● ● ●●●●
● ●●
●
●●
●
●●
●●●●
●
●
●●●●
●●
●
●●
●●●●
●● ●●
●●
●
●●● ● ●
●
●●
●
●●
●●●
●●
●●●
●●
●
●
●●●
●●●●●
● ●
●
●
●●
●●
●●
●
●●
●
●●
●
●●
●
●●
●
●●●
●
● ●●●
● ● ●●●●●
● ●
●●
●●
●●●● ●●●●●
●
●●●●
● ● ●●
●●●●●
●●
●●
● ●
−4
● ● ● ● ●●
●
●●●● ● ●● ●● ● ● ● ● ●
● ●●● ●●● ●
Index Index
Cross−validated studentized residuals
0.12
● ●
8
●
6
●
●
Cook's statistic
●●
0.08
4
●
● ● ●
●
● ●●
● ● ● ●
● ● ●● ●● ●●
2
● ● ● ● ● ●
● ● ●
● ● ●● ● ● ● ● ● ●●
● ●●●●●●●● ●● ● ● ● ●●●●●●●●●● ●● ● ●●
● ●
●
● ●●
● ● ●●●
● ●●● ●● ● ●●
●●
● ●●
●
●●
●
●
●●
●●●● ● ●●
● ● ● ●
● ●● ●
●●
●● ●
●●●
● ●● ● ● ● ● ●
0.04
●
● ●●● ●● ● ●
●●
●●●●● ●
●●
●●●● ●●
●●●●●●● ●
●●
●●● ●
●●● ●●●
●● ●
●●●
● ●●●●● ●
●●● ●
●●
● ●●
●● ● ●
● ● ● ●
●
●●
●●
●●
●● ●
●●●
●● ●
●●● ● ●● ●
● ● ●●●
●●●
●●●●
●●● ●●● ●●
●●
● ● ●●
● ● ●● ●●● ●● ●● ● ● ● ●
●
0
●
●
●●
●●
●●●
●●● ●
●●
●
●●
●●
●● ●●●●
●●
●●
●●
●●
●●●
●
●●
●●●●
●
●●●
●
● ●●●
●
●●●
●●●
●
● ●
●● ●●●
● ●
●
●●●
●●
●● ●
●●
●●
●●
●
●●● ●●
●● ●
●●●●
●●
●●●● ●
● ●● ● ● ●●●●
●● ● ● ●● ●
●
●
●●●
●
●
●●
●
●
●
●●
●●
●
●●
●
●
●
●
●
●●●
●
●●
●
●●
●●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●●●
●
●● ●●●●
● ● ●
●
●
●●
●
●●
●●
● ●●●
● ●●
●●●
● ● ●
●●●
●
●●●
●●●●
●
●●●●
● ●
●●●
●●
●●
●
●●
●●●
●
●
●●● ●
●
●
●●●
●●●●●
●●● ●
●
●●
●
●
●
●
●
● ●●
● ●
●
●
● ● ●
●●● ● ● ● ● ●
●●
●
●●●● ● ● ●● ●
● ● ● ● ● ●
●
●●● ● ● ●● ● ● ● ●● ●● ●
●
−2
●● ● ● ●
● ● ● ● ● ●
● ● ●●● ● ●●● ●
●
●●
● ●● ●
● ●●●
● ●●●
● ● ● ●
●● ●●●
0.00
●●
● ● ● ●● ● ●●● ●
● ● ● ●●
● ● ● ● ● ● ● ●●●
● ●●
●●
●●●
●●
●●
● ●●
●●
●
●
●●
●●
●
●
●●
● ● ●
● ●●● ●
●●
●●● ●
●
●●
●●●
●●
●
●●
●●
●
●●●
●
●
●●●
●●
●●
●● ●●
●
●
●●●
●●
●●
●
●
●●●●
●
●●●
● ●
●●
●●●
●●
●
●●
●
●
●●●●
−4
● ●
●
●
●●
●
●●
● ●
●
●●●●
●●
●●●
●
●
●●
●
●
●●
● ● ●
●
●●●
●●
●●
●
●
●●
●
●
● ●
●
●●●●●●●●
●●
●●●
●
●●
●
●
●●
●
●
●●●
●
●●●●
●
● ●●●●
●
● ●●●
●
● ●
●
●
●●
●
●●
●
●
●●
●●●
●
●●●●●
●●●●
●●
●
●●
●●
●
●●
●
●●
●●
●
●
●●
●●●●
●
●●
●
●●
Index Index
Figure 2: Leverages, two sorts of standardized residuals, and Cook’s distance statistic for each point in a
basic linear model of economic mobility as a function of the fraction of workers with short commutes. The
horizontal line in the plot of leverages shows the average leverage. The lines in studentized residual plot
shows a 95% t-distribution sampling interval. Note the clustering of extreme residuals and leverage around
row 600, and another cluster of points with extreme residuals around row 400.
5
n = nrow(mobility)
out = lm(Mobility ~ Commute,data=mobility)
r = rstudent(out)
I = (1:n)[rank(-abs(r) <= 10)] ## indices 10 largest residuals
mobility[I,]
C = cooks.distance(out)
I = (1:n)[rank(-abs(C) <= 10)] ## indices 10 largest Cook's distances
mobility[I,]
3.2 plot
We have not used the plot function on an lm object yet. This is because most of what it gives us
is in fact related to residuals (Figure 3).
par(mfrow=c(2,2))
plot(out)
The first plot is of residuals versus fitted values, plus a smoothing line, with extreme residuals
marked by row number. The second is a Q-Q plot of the standardized residuals, again with
extremes marked by row number. The third shows the square root of the absolute standardized
6
residuals against fitted values (ideally, flat); the fourth plots standardized residuals against leverage,
with contour lines showing equal values of Cook’s distance. There are many options, described in
help(plot.lm).
4.1 Deletion
Deleting data points should never be done lightly, but it is sometimes the right thing to do.
The best case for removing a data point is when you have good reasons to think it’s just wrong
(and you have no way to fix it). Medical records which give a patient’s blood pressure as 0, or their
temperature as 200 degrees, are just impossible and have to be errors1 . Those points aren’t giving
you useful information about the process you’re studying so getting rid of them makes sense.
The next best case is if you have good reasons to think that the data point isn’t wrong, exactly,
but belongs to a different phenomenon or population from the one you’re studying. (You’re trying
to see if a new drug helps cancer patients, but you discover the hospital has included some burn
patients and influenza cases as well.) Or the data point does belong to the right population, but
also somehow to another one which isn’t what you’re interested in right now. (All of the data is on
cancer patients, but some of them were also sick with the flu.) You should be careful about that
last, though. (After all, some proportion of future cancer patients are also going to have the flu.)
The next best scenario after that is that there’s nothing quite so definitely wrong about the
data point, but it just looks really weird compared to all the others. Here you are really making
a judgment call that either the data really are mistaken, or not from the right population, but
you can’t put your finger on a concrete reason why. The rules-of-thumb used to identify outliers,
like “Cook’s distance shouldn’t be too big”, or “Tukey’s rule” which flags any point more than 1.5
times the inter-quartile range above the third quartile, or below the first quartile. It is always more
satisfying, and more reliable, if investigating how the data were gathered lets you turn cases of this
sort into one of the two previous kinds.
The least good case for getting rid of data points which isn’t just bogus is that you’ve got a
model which almost works, and would work a lot better if you just get rid of a few stubborn points.
This is really a sub-case of the previous one, with added special pleading on behalf of your favorite
model. You are here basically trusting your model more than your data, so it had better be either
a really good model or really bad data.
7
Residuals vs Fitted Normal Q−Q
8
● 382 382 ●
0.0 0.1 0.2 0.3
Standardized residuals
6
383 ● ● 383
●376
● ● ●●● 376
● ●● ●
Residuals
4
● ●●
●●
● ●
● ●● ●● ● ●
●● ●
● ● ●● ● ●
●● ●
●● ● ●● ●● ●
●
● ●
●
●●
●
●●
● ●● ● ●● ●
2
● ● ● ● ● ●● ●●● ●●●● ● ●
●●
●
●●
●●●●
●
●● ●
● ●●●●●●●●
● ● ●
●● ● ●●
● ● ●
●●
● ●
●●
● ●
●●
●● ●
● ●
●
●●
●
● ● ●
●●●
●●
●
●●●●
●
●●●●
● ●
● ● ●● ●
●● ●●●
●● ● ●● ●●
●
●●
●
●
●● ●
●● ●●
●●●
●
●
● ●
●
●●
●
●●
●●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●●●●
●● ●
●
●
●●
●
●
●●
●
●●
●●●●
●●●
●●●●
●●
●●●●
●●●
●
● ●● ●
●●● ●
●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●
●●
●
●
●
●●
●
●●
●●
●
●●
●
●●
●●
●
●●●
●●
●●●
●
●
●●
●●
●
●●
●●
●
●
●●●●
●
●
●●
●
● ●
●●
●●●● ●
●●
●●
●
●●●●●●
● ●
● ●●●●● ●● ●●
●
●
●●
●
●
●
●● ● ●
●
●●
●●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●●●●
●●●
●
●●
●
●●●
●
●
● ●
●●
●
●
●
● ●● ●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
0
●●●
●●●
●
● ●
●● ●●● ●●
●●
●
●
● ●
● ● ●
●●
●●●
●
●●●
● ● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●●
●
●●
●
●●●
●
● ●●●
●●
●
●
●
●●●
●●
●●
●
● ●●●●●● ●● ●●
●●
●
●
●
●
● ●●
●●●●● ●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●
●● ●
●
● ●●● ● ● ●●
● ●●●● ● ●
●●
●
●
●●
●
●
●● ● ● ● ●● ● ●
●●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
●●
● ● ●● ● ●
●
●●
−4 −2
●
●
●●
●
●
● ●●
●
●● ●●●
●
−0.2
● ●
● 382
0.0 0.5 1.0 1.5 2.0 2.5
0.5
Standardized residuals
Standardized residuals
383 ●
●376
● ●
●
● ●● ●● ●
● ●
●
●
4
● ● ● ● ●● ●
● ● ● ●● ●
● ●● ●●● ● ● ●●● ● ●
● ● ●● ● ●
● ● ● ● ●● ●●●
2
● ●● ●
● ●● ● ● ●●● ● ●
● ●
●
●● ● ●●
● ●●● ●●●●●● ● ● ●●
● ● ●● ● ● ●
● ●● ●● ● ● ● ●●●
●
●●●
●
●●●
●● ●
●
●●
●● ● ● ●
●●
●● ● ●● ● ●● ● ●● ●
●
●●● ● ●●● ● ●
●●●●
●● ● ●●
● ●●●●
● ●●●● ● ●●● ●●● ● ● ● ● ●
●●
●
●
●
●●
●
●●
●
●●
●
●
●●
●●
●
●●
●
●
●
●●●
●●
●●
●●●● ●●
● ●
●
●●●●● ●●
●
● ●●
●●
● ● ●● ●●●● ● ●●● ●
●●●
●●●
●
●●●●
●●●● ● ●●
● ●
● ●●●
●●●
●●
● ●
●●
● ●● ● ●●●
●
●● ●●● ●●
● ●●●
●●● ● ●
●
●
●●
●
●
●
●●
●
●●
●●
●
●
●●
●
●
●●●
●
●
●●●●
●●
●●
●●
●
● ●
●● ● ●● ●
0
● ●● ●●●●
●●
● ● ●●●
●● ● ●
●●●●●●●
● ●
●
●●●●●
●
●●
● ●●●●
●●
●●
●● ●●
●●● ●●● ●
●
●●● ● ●
●● ● ●●
●● ●●●●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●
●●
●●●●●
●●
● ●●
●
●
● ●●●
●●●●● ●
● ● ●●
● ● ● ● ●●● ● ● ●● ●
● ●● ● ●●
● ● ●
● ●
●
● ●●●●
●
●● ● ●
●●
●● ●● ●●● ●
●●●
●● ●●● ●
●●●
●●● ●●●●●
●
●
●
●● ●● ●
●●
●
●
● ● ● ●● ● ●
●
●
●●●●●●●●●●●●●● ●●● ●● ●
●●●●● ● ●
●●●● ●●● ● ● ● ● ●
614 608
●●
●● ● ●● ●●
● ●● ● ● ●● ● ●
● Cook's distance
Figure 3: The basic plot function applied to our running example model.
8
●
0.20
●
● data ●
● ●●●●
●
25
●
● ●●●
linear
quadratic
20
●●
0.15
15
●
Hii
y
●
0.10
●●
10
●●
●
●●
5
0.05
●
●●
● ●●●
●
●●● ●●● ●
●
●●●
●
●
● ●●●●●●●
●●●●
●
●
●
●●
●●● ●●●●●●●●
●
●
●●●
● ●●●●●●
0
0 1 2 3 4 5 0 5 10 20 30
x Index
Figure 4: The points in the upper-right are outliers for any linear model fit through the main body of points,
but dominate the line because of their very high leverage; they’d be identified as outliers. But all points were
generated from a quadratic model.
9
The moral of Figure 4 is that data points can look like outliers because we’re looking for the
wrong pattern. If when we find apparent outliers and we can’t convince ourselves that data is
erroneous or irrelevant, we should consider changing our model, before, or as well as, deleting
them.
library(MASS)
out = rlm(Mobility ~ Commute,data=mobility)
summary(out)
##
## Call: rlm(formula = Mobility ~ Commute, data = mobility)
## Residuals:
## Min 1Q Median 3Q Max
## -0.148719 -0.019461 -0.002341 0.021093 0.332347
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.0028 0.0043 0.6398
## Commute 0.2077 0.0091 22.7939
##
## Residual standard error: 0.0293 on 727 degrees of freedom
10
Robust linear regression is designed for the situation where it’s still true that Y = Xβ + , but
the noise is not very close to Gaussian, and indeed is sometimes “contaminated” by wildly larger
values. It does nothing to deal with non-linearity, or correlated noise, or even some points having
excessive leverage because we’re insisting on a linear model.
11