Regression Outliers
Regression Outliers
Regression Outliers
1. Identification of Outliers
An outlier is an extreme observation. Typically points further than, say, three or four
standard deviations from the mean are considered as outliers. In regression however,
the situation is somewhat more complex in the sense that some outlying points will have
more influence on the regression than others. In JMPIN there is one diagnostic that can
be used to identify possibly influential outliers, known as Cooks Distance, or simply
Cooks D. Given a regression of Y on ( x1 ,.., xk ) using data set ( y j , x1 j ,.., xkj ), j = 1,.., n ,
if
s =
y j =
( y
D =
n
j =1
y j (i ) )
(k + 1) s 2
, i = 1,.., n
Di
4
n (k + 1)
As with all Rules of Thumb, this provides only a rough guideline (and often tends to
identify too many points as potential outliers). The best strategy is to look at the
distribution of Cooks D values and see whether there are any conspicuously large values
relative to the others. If these values are roughly of the magnitude 4 /(n k 1) or larger,
then they are worth investigating further.
2. Treatment of Outliers
The key point to stress here is that the above procedure can only serve to identify points
that are suspicious from a statistical perspective. It does not mean that these points should
automatically be eliminated! The removal of data points can be dangerous. While this
will always improves the fit of your regression, it may end up destroying some of the
most important information in your data.
Hence the first question that should be asked is whether there exists some substantive
information about these points that suggests that they should be removed. Do they
involve special properties or circumstances not relevant for the situation under
investigation? Do they involve possible measurement errors? If no such distinguishing
features can be found, then there are no clear grounds for eliminating outliers.
An alternative approach is to perform the regression both with and without these outliers,
and examine their specific influence on the results. If this influence is minor, then it may
not matter whether or not they are omitted. On the other hand, if their influence is
substantial, then it is probably best to present the results of both analyses, and simply
alert the reader to the fact that these points may be questionable.