MultivariableRegression 6
MultivariableRegression 6
Outliers
(Chapter 4.9)
Much like in the case of the univariate regression with one independent variable, the multiple
regression model has a number of required assumptions:
(MR.1): Linear Model The Data Generating Process (DGP), or in other words, the
population, is described by a linear (in terms of the coefficients) model:
Y = Xβ + ε (MR.1)
E (ε|X) = 0 (MR.2)
This assumption implies that all error pairs are uncorrelated. For cross-sectional data,
this assumption implies that there is no spatial correlation between errors.
(MR.5) There exists no exact linear relationship between the explanatory variables.
This means that:
rank (X) = k + 1
ε|X ∼ N 0, σ2 I
(MR.6)
Outliers
An outlier is an observation which is significantly different from other values in a random sample
from a population.
If we collect all of the various problems that can arise - we can rank them in terms of
severity:
Outlier Causes
Outliers can be cause by:
I measurement errors;
I being from a different process, compared to the rest of the data;
I not having a representative sample (e.g. measuring a single observation from a different city,
when the remaining observations are all from one city);
Outlier Consequences
Outliers can lead to misleading results in parameter estimation and hypothesis testing. This
means that a single outlier can make it seem like:
I a non-linear model may be better suited to the data sample, as opposed to a linear model;
I the residuals are heteroskedastic, when in fact only a residual has a larger variance, which is
different from the rest;
I the distribution is skewed (i.e. non-normal), because of a single observation/residual, which
is significantly different form the test.
set.seed(123)
#
N <- 100
x <- rnorm(mean = 8, sd = 2, n = N)
y <- 4 + 5 * x + rnorm(mean = 0, sd = 0.5, n = N)
y[N] <- -max(y)
Outlier Detection
The broad definition of outliers means that the decision whether an observation should be
considered an outlier is left to the econometrician/statistician/data scientist.
Nevertheless, there are a number of different methods, which can be used to identify abnormal
observations.
Specifically, for regression models, outliers are also detected by comparing the true and fitted
values. Assume that our true model is the linear regression:
Y = Xβ + ε (1)
b = X X> X −1 X> Y = HY
Y
b = Xβ
−1 >
where H = X X> X X is called the hat matrix (or the projection matrix), which is the
orthogonal projection that maps the vector of the response values, Y, to the vector of
fitted/predicted values, Y.
b It describes the influence that each response value has on each fitted
value, which is why H is sometimes also referred to as the influence matrix.
To understand the projection matrix a bit better do not treat the fitted values as something that
is separate from the true values.
I Instead assume that you have two sets of values: Y and Y.b
I Ideally, we would want Y = Y.
b
I Assuming that the linear relationship, Y = Xβ + ε, holds, this will generally not be possible
because of the random shocks ε
However, the closest approximation would be the conditional expectation of Y, given a design
matrix X, since we know that the conditional expectation is the best predictor from the proof in
Ch. 3.7.
The Conditional Expectation is The Best Predictor (Ch. 3.7)
We begin by outlining the main properties of the conditional moments, which will be useful
(assume that X and Y are random variables):
I Law of total expectation: E [E (h(Y )|X )] = E [h(Y )];
I Conditional variance: Var(Y |X ) := E (Y − E [Y |X ])2 |X = E(Y 2 |X ) − (E [Y |X ])2 ;
I Variance of conditional expectation:
Var(E [Y |X ]) = E (E [Y |X ])2 − (E [E [Y |X ]])2 = E (E [Y |X ])2
− (E [Y ])2 ;
I Expectation of conditional variance: E [Var(Y |X )] = E (Y − E [Y |X ])2 =
E E Y 2 |X − E (E [Y |X ])2 = E Y 2 − E (E [Y |X ])2 ;
I Adding the third and fourth properties together gives us:
Var(Y ) = E Y 2 − (E [Y ])2 = Var(E [Y |X ]) + E [Var(Y |X )].
For simplicity, assume that we are interested in the prediction of Y via the conditional
expectation:
Y = E (Y|X)
We will show that, in general, the conditional expectation is the best predictor of Y.
Assume that the best predictor of Y (a single value), given X is some function g(·), which
minimizes the expected squared error:
argming(X) E (Y − g(X))2 .
Using the conditional moment properties, we can rewrite E (Y − g(X))2 as:
= E (Y − E[Y |X])2 + 2(Y − E[Y |X])(E[Y |X] − g(X)) + (E[Y |X] − g(X))2
= E E (Y − E[Y |X])2 |X + E 2(E[Y |X] − g(X))E [Y − E[Y |X]|X] + E (E[Y |X] − g(X))2 |X
Taking g(X) = E[Y |X] minimizes the above equality to the expectation of the conditional
variance of Y given X:
E (Y − E[Y |X])2 = E [Var(Y |X )] .
The projection matrix can be utilized when calculating leverage scores and Cook’s distance,
which are used to identify influential observations.
Leverage Score of Observations
Leverage measures how far away an observation of a predictor variable, X, is from the mean of
the predictor variable.
For the linear regression model, the leverage score for the i-th observation is defined as the i-th
−1 >
diagonal element of the projection matrix H = X X> X X , which is equivalent to taking a
partial derivative of Yi with respect to Yi :
b
∂Y
bi
hii = = (H)ii
∂Yi
Defining the leverage score via the partial derivative allows us to interpret the leverage score as
the observation self-influence, which describes how the actual value, Yi , influences the fitted
value, Y
bi .
The leverage score hii is bounded:
0 ≤ hii ≤ 1
Proof.
Noting that H is symmetric and the fact that it is an idempotent matrix:
−1 > −1 > −1 >
H2 = HH = X X> X X X X> X X = XI X> X X =H
we can examine the diagonal elements of the equality H2 = H to get the following bounds of Hii :
X
hii = hii2 + hij2 ≥ 0
i6=j
ε=Y−Y
b b = (I − H) Y
Studentized Residuals
The studentized residuals are related to the standardized residuals, as they are defined as:
bi
ti = √
b 1 − hii
σ
N
1 X
b2 =
σ 2
N − (k + 1) j=1 j
b
I If we suspect that the i-th residual of being improbably large (i.e. it cannot be from the
same normal distribution as the remaining of the residuals) - we exclude it from variance
estimation by calculating the externally studentized residual variance estimate:
N
2 1 X
σ = 2
N − (k + 1) − 1 j=1 j
b(i) b
j6=i
If the residuals are independent and ε ∼ N (0, σ 2 I), then the distribution of the studentized
residuals depends on the calculation of the variance estimate:
I If the residuals are internally studentized - they have a tau distribution:
√
v · tv −1
ti ∼ q , where v = N − (k + 1)
2
tv −1 + v − 1
I If the residuals are externally studentized - they have a Student’s t-distribution (we will
also refer to them as ti(i) ):
ti = ti(i) ∼ t(N−(k−1)−1)
Observations with studentized residual values larger than 3 in *absolute* value could
be considered outliers.
We can plot the studentized and standardized residuals:
olsrr::ols_plot_resid_stud(mdl_1_fit)
−50
Observation
−100 normal
outlier
−150
−200 100
0 25 50 75 100
Observation
olsrr::ols_plot_resid_stand(mdl_1_fit)
0.0
Standardized Residuals
−2.5
−5.0
−7.5
−10.0 100
0 25 50 75 100
Observation
We can examine the same plots on the model, with the outlier observation removed from the
data:
olsrr::ols_plot_resid_stud(lm(y[-N] ~ 1 + x[-N]))
2
Deleted Studentized Residuals
0
Observation
normal
−2
−4
0 25 50 75 100
Observation
olsrr::ols_plot_resid_stand(lm(y[-N] ~ 1 + x[-N]))
49 74
39 96
2
Standardized Residuals
−2
0 25 50 75 100
Observation
While the studentized residuals appear to have no outliers, the standardized residuals indicate
that a few observations may be influential. Since we have simulated the data, we know that our
data contained only one outlier. Consequently, we should not treat all observations outside the
threshold as definite outliers.
We may also be interested in plotting the studentized residuals against the leverage points:
olsrr::ols_plot_resid_lev(mdl_1_fit)
−50
Observation
RStudent
normal
−100
leverage
outlier
−150
100
−200
5.0
64
49 74
2.5 39 96
Observation
RStudent
97 normal
57 16 70
6 72 leverage
0.0
26 18 outlier
44
35
−2.5
−5.0
## (Intercept) x
## [1,] 0.028743821 -0.022789554
## [2,] 0.030744687 -0.034844559
## [3,] 0.020403791 -0.024298429
## [4,] 0.006702931 -0.004242548
## [5,] -29.230784828 25.362876769
While these calculations are a bit more involved, we can use the built-in functions as well:
print(tail(dfbetas(mdl_1_fit), 5))
## (Intercept) x
## 96 0.028743821 -0.022789554
## 97 0.030744687 -0.034844559
## 98 0.020403791 -0.024298429
## 99 0.006702931 -0.004242548
## 100 -29.230784828 25.362876769
If we wanted, we could also plot these values:
olsrr::ols_plot_dfbetas(mdl_1_fit)
page 1 of 1
Influence Diagnostics for (Intercept)
0
Threshold: 0.2
DFBETAS
−10
−20
100
−30
0 25 50 75 100
Observation
10
0
0 25 50 75 100
Observation
If we were to remove the last observation and examine the DFBETAS plot:
olsrr::ols_plot_dfbetas(lm(y[-N] ~ 1 + x[-N]))
page 1 of 1
Influence Diagnostics for (Intercept)
64
Threshold: 0.2
0.4
DFBETAS
44
74
0.2 25 96
0.0
−0.2
8 43
0 25 50 75 100
Observation
0.00
74
−0.25
44
64
0 25 50 75 100
Observation
We see that there are some observations, which may be worth examining. In this case, we know
that there are no more outliers because we have simulated the data ourselves. So this is a good
example that you should not blindly trust the above charts, as the influential observations are
not necessarily outliers.
DFFITS
DFFITS measures how much an observation i has effected the fitted value of a regression. It is
defined as a Studentized difference between the fitted values from a regression, estimated on all
of the data, and the fitted values from a regression, estimated on the data with observation i
deleted: r
Ybi − Ybi(i) hii
DFFITSi = 2 √ = ti(i)
σb(i) hii 1 − hii
where ti(i) is the externally studentized residual.
tmp_val <- dffits(mdl_1_fit)
print(format(tail(cbind(tmp_val), 10), scientific = FALSE))
## tmp_val
## 91 " -0.0005235787"
## 92 " 0.0031760359"
## 93 " 0.0091761236"
## 94 " 0.0199891169"
## 95 " -0.0226748095"
## 96 " 0.0376495768"
## 97 " -0.0379725909"
## 98 " -0.0287153104"
## 99 " 0.0125545090"
## 100 "-32.6910440413" p
Observations with a DFFITS value larger than 2 (k + 1)/N in absolute value
olsrr::ols_plot_dffits(mdl_1_fit)
0 Threshold: 0.28
−10
DFFITS
−20
−30
100
0 25 50 75 100
Observation
olsrr::ols_plot_dffits(lm(y[-N] ~ 1 + x[-N]))
0.4
49 74
0.2
DFFITS
0.0
−0.2
43
8
44
−0.4
0 25 50 75 100
Observation
Similarly to what we have observed with DFBETAS - we should not blindly trust that each value
outside the cutoff region is an outlier. Instead, we should treat them as influential observations,
which need additional analysis to determine whether they are acceptable.
Cook’s distance
Cook’s D measures the aggregate impact of each observation on the group of regression
coefficients, as well as the group of fitted values. It can be used to:
I indicate influential data points (i.e. potential outliers);
I indicate regions, where more observations would be needed;
Cook’s distance for observation i is defined as:
PN b b 2
j=1 (Yj − Yj(i) ) 2i
b hii
Di = =
(k + 1)bσ2 (k + 1)b σ 2 (1 − hii )2
where:
I Y bj(i) is the fitted value of Yj , obtained by excluding the i-th observation and re-estimating
the same model via OLS.
ε> bε
I σb2 =
b
is the mean squared error of the error term.
N − (k + 1)
Note: in practical terms, it may be easier to use the leverage score expression of Di instead of
re-estimating the model for each observation case.
tmp_val <- cooks.distance(mdl_1_fit)
print(format(tail(cbind(tmp_val), 10), scientific = FALSE))
## tmp_val
## 91 "0.0000001384804"
## 92 "0.0000050955563"
## 93 "0.0000425310902"
## 94 "0.0002017917033"
## 95 "0.0002596785491"
## 96 "0.0007154000322"
## 97 "0.0007282312180"
## 98 "0.0004164378945"
## 99 "0.0000796089714"
## 100 "1.2596831219103"
Cook’s distance values, which are:
I larger than 4/N (the traditional cut-off);
1 PN
I larger than 3 × Di
N i=1
could be considered highly influential.
We can plot the Di points:
olsrr::ols_plot_cooksd_bar(mdl_1_fit)
1.0
Observation
Cook's D
normal
outlier
0.5
0.0
0 25 50 75 100
Observation
olsrr::ols_plot_cooksd_chart(mdl_1_fit)
Cook's D Chart
100
Threshold: 0.04
1.0
Cook's D
0.5
0.0
0 25 50 75 100
Observation
As well as plot the Di on the data without the outlier observation:
olsrr::ols_plot_cooksd_bar(lm(y[-N] ~ 1 + x[-N]))
64
Threshold: 0.04
0.15
0.10
Observation
Cook's D
normal
outlier
44
8
0.05 74
43
49
0.00
0 25 50 75 100
Observation
olsrr::ols_plot_cooksd_chart(lm(y[-N] ~ 1 + x[-N]))
Cook's D Chart
64
Threshold: 0.04
0.15
0.10
Cook's D
44
8
43 74
0.05 49
0.00
0 25 50 75 100
Observation
Standardized residuals
72 64 64 72
−20
−2
−60
−6
−100
−10
100
100
20 30 40 50 60 −2 −1 0 1 2
Scale−Location
Fitted values Cook's distance
Theoretical Quantiles
100
3.0
100
1.2
Cook's distance
2.0
0.8
1.0
0.4
72 64
18 72
0.0
0.0
20 30 40 50 60 0 20 40 60 80 100
Residuals
Fittedvs Leverage
values
Cook's dist vsObs.
Leverage
number
hii (1 − hii)
10 100 8 6
1.2
18 72
Cook's distance
−2
0.8
0.5
1
4
−6
0.4 2
−10
0.00 0.02 0.04 0.06 0.08 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
par(mfrow = c(3, 2), mar = c(2, 2, 2, 2))
for(i in 1:6){
plot(lm(y[-N] ~ 1 + x[-N]), which = i)
}
Standardized residuals
0 1 2 3 4
64 64
49
1.0
74 74 49
0.0
−1.0
−2
20 30 40 50 60 −2 −1 0 1 2
Scale−Location
Fitted values Cook's distance
Theoretical Quantiles
64 64
1.5
Cook's distance
74 49
0.10
1.0
44
8
0.5
0.00
0.0
20 30 40 50 60 0 20 40 60 80 100
Residuals
Fittedvs Leverage
values
Cook's dist vsObs.
Leverage
number
hii (1 − hii)
64 3.5 64 3 2.5 2
0.5
Cook's distance
1.5
44
8
1
44
Cook's distance
−2
8 0.5
0
Deleting Outliers
In some cases, if we are absolutely sure that the observation is an outlier, which is either
completely unlikely, or impossible to encounter again, we could drop it.
Robust Regression
In addition to the methods mentioned before, we could also run a Robust regression.
In our example, we know that the last observation was differently generated, and is thus an
outlier, which we can delete.
We can compare how would our model look like with the whole dataset, and if we were to drop
the outlier observation:
plot(x, y)
lines(x, mdl_1_fit$fitted.values, col = "red")
lines(x[-N], lm(y[-N] ~ 1 + x[-N])$fitted.values, col = "blue")
points(x[N], y[N], pch = 19, col = "red")
legend("topleft", lty = 1, col = c("red", "blue"), legend = c("with outlier", "deleted outlier"))
60
with outlier
deleted outlier
40
20
0
y
−20
−60