0% found this document useful (0 votes)
26 views44 pages

MultivariableRegression 6

This document discusses assumptions and outliers in multiple regression models. It defines outliers and discusses their causes and effects. It also covers outlier detection methods like leverage scores and Cook's distance which are calculated using the hat matrix. Leverage scores measure how far a predictor variable is from its mean.

Uploaded by

Alada mana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views44 pages

MultivariableRegression 6

This document discusses assumptions and outliers in multiple regression models. It defines outliers and discusses their causes and effects. It also covers outlier detection methods like leverage scores and Cook's distance which are calculated using the hat matrix. Leverage scores measure how far a predictor variable is from its mean.

Uploaded by

Alada mana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

PE I: Multivariable Regression

Outliers
(Chapter 4.9)

Andrius Buteikis, [email protected]


http://web.vu.lt/mif/a.buteikis/
Multiple Regression: Model Assumptions

Much like in the case of the univariate regression with one independent variable, the multiple
regression model has a number of required assumptions:
(MR.1): Linear Model The Data Generating Process (DGP), or in other words, the
population, is described by a linear (in terms of the coefficients) model:

Y = Xβ + ε (MR.1)

(MR.2): Strict Exogeneity Conditional expectation of ε, given all observations of the


explanatory variable matrix X, is zero:

E (ε|X) = 0 (MR.2)

This assumption also implies that E(ε) = E (E(ε|X)) = 0, E(εX) = 0


and Cov(ε, X) = 0. Furthermore, this property implies that: E (Y |X) = Xβ
(MR.3): Conditional Homoskedasticity The variance-covariance matrix of the error
term, conditional on X is constant:
 
Var(1 ) Cov(1 , 2 ) ... Cov(1 , N )
 Cov(2 , 1 ) Var(2 ) ... Cov(2 , N )
Var (ε|X) =   = σ2 I (MR.3)
 
.. .. .. ..
 . . . . 
Cov(N , 1 ) Cov(N , 2 ) ... Var(N )

(MR.4): Conditionally Uncorrelated Errors The covariance between different error


term pairs, conditional on X, is zero:

Cov (i , j |X) = 0, i 6= j (MR.4)

This assumption implies that all error pairs are uncorrelated. For cross-sectional data,
this assumption implies that there is no spatial correlation between errors.
(MR.5) There exists no exact linear relationship between the explanatory variables.
This means that:

c1 Xi1 + c2 Xi2 + ... + ck Xik = 0, ∀i = 1, ..., N ⇐⇒ c1 = c2 = ... = ck = 0 (MR.5)

This assumption is violated if there exists some cj 6= 0.


Alternatively, this requirement means that:

rank (X) = k + 1

or, alternatively, that:  


det X > X 6= 0

This assumption is important, because a linear relationship between independent variables


means that we cannot separately estimate the effects of changes in each variable
separately.
(MR.6) (optional) The residuals are normally distributed:

ε|X ∼ N 0, σ2 I

(MR.6)
Outliers
An outlier is an observation which is significantly different from other values in a random sample
from a population.
If we collect all of the various problems that can arise - we can rank them in terms of
severity:

outliers > non − linearity > heteroscedasticity > non − normality

Outlier Causes
Outliers can be cause by:
I measurement errors;
I being from a different process, compared to the rest of the data;
I not having a representative sample (e.g. measuring a single observation from a different city,
when the remaining observations are all from one city);
Outlier Consequences
Outliers can lead to misleading results in parameter estimation and hypothesis testing. This
means that a single outlier can make it seem like:
I a non-linear model may be better suited to the data sample, as opposed to a linear model;
I the residuals are heteroskedastic, when in fact only a residual has a larger variance, which is
different from the rest;
I the distribution is skewed (i.e. non-normal), because of a single observation/residual, which
is significantly different form the test.
set.seed(123)
#
N <- 100
x <- rnorm(mean = 8, sd = 2, n = N)
y <- 4 + 5 * x + rnorm(mean = 0, sd = 0.5, n = N)
y[N] <- -max(y)
Outlier Detection
The broad definition of outliers means that the decision whether an observation should be
considered an outlier is left to the econometrician/statistician/data scientist.
Nevertheless, there are a number of different methods, which can be used to identify abnormal
observations.
Specifically, for regression models, outliers are also detected by comparing the true and fitted
values. Assume that our true model is the linear regression:

Y = Xβ + ε (1)

Then, assume that we estimate β


b via OLS. Consequently, we can write the fitted values as:

b = X X> X −1 X> Y = HY

Y
b = Xβ
−1 >
where H = X X> X X is called the hat matrix (or the projection matrix), which is the
orthogonal projection that maps the vector of the response values, Y, to the vector of
fitted/predicted values, Y.
b It describes the influence that each response value has on each fitted
value, which is why H is sometimes also referred to as the influence matrix.
To understand the projection matrix a bit better do not treat the fitted values as something that
is separate from the true values.
I Instead assume that you have two sets of values: Y and Y.b
I Ideally, we would want Y = Y.
b
I Assuming that the linear relationship, Y = Xβ + ε, holds, this will generally not be possible
because of the random shocks ε
However, the closest approximation would be the conditional expectation of Y, given a design
matrix X, since we know that the conditional expectation is the best predictor from the proof in
Ch. 3.7.
The Conditional Expectation is The Best Predictor (Ch. 3.7)
We begin by outlining the main properties of the conditional moments, which will be useful
(assume that X and Y are random variables):
I Law of total expectation: E [E (h(Y )|X )] = E [h(Y )];
I Conditional variance: Var(Y |X ) := E (Y − E [Y |X ])2 |X = E(Y 2 |X ) − (E [Y |X ])2 ;

I Variance of conditional expectation:
   
Var(E [Y |X ]) = E (E [Y |X ])2 − (E [E [Y |X ]])2 = E (E  [Y |X ])2
− (E [Y ])2 ;
I Expectation of conditional variance: E [Var(Y |X )] = E (Y − E [Y |X ])2 =
        
E E Y 2 |X − E (E [Y |X ])2 = E Y 2 − E (E [Y |X ])2 ;
I Adding the third and fourth properties together gives us:
 
Var(Y ) = E Y 2 − (E [Y ])2 = Var(E [Y |X ]) + E [Var(Y |X )].

For simplicity, assume that we are interested in the prediction of Y via the conditional
expectation:
Y = E (Y|X)
We will show that, in general, the conditional expectation is the best predictor of Y.
Assume that the best predictor of Y (a single value), given X is some function g(·), which
minimizes the expected squared error:

argming(X) E (Y − g(X))2 .
 

 
Using the conditional moment properties, we can rewrite E (Y − g(X))2 as:

E (Y − g(X))2 = E (Y + E[Y |X] − E[Y |X] − g(X))2


   

= E (Y − E[Y |X])2 + 2(Y − E[Y |X])(E[Y |X] − g(X)) + (E[Y |X] − g(X))2
 

= E E (Y − E[Y |X])2 |X + E 2(E[Y |X] − g(X))E [Y − E[Y |X]|X] + E (E[Y |X] − g(X))2 |X
    

= E [Var(Y |X )] + E (E[Y |X] − g(X))2 .


 

Taking g(X) = E[Y |X] minimizes the above equality to the expectation of the conditional
variance of Y given X:
E (Y − E[Y |X])2 = E [Var(Y |X )] .
 

Thus, g(X) = E[Y |X] is the best predictor of Y .


Going back to out projection matrix . . .

Using the OLS definition of β,


b the best predictor (i.e. the conditional expectation) maps the
values of Y to the values of Y
b via the projection matrix H.

The projection matrix can be utilized when calculating leverage scores and Cook’s distance,
which are used to identify influential observations.
Leverage Score of Observations
Leverage measures how far away an observation of a predictor variable, X, is from the mean of
the predictor variable.
For the linear regression model, the leverage score for the i-th observation is defined as the i-th
−1 >
diagonal element of the projection matrix H = X X> X X , which is equivalent to taking a
partial derivative of Yi with respect to Yi :
b

∂Y
bi
hii = = (H)ii
∂Yi

Defining the leverage score via the partial derivative allows us to interpret the leverage score as
the observation self-influence, which describes how the actual value, Yi , influences the fitted
value, Y
bi .
The leverage score hii is bounded:

0 ≤ hii ≤ 1

Proof.
Noting that H is symmetric and the fact that it is an idempotent matrix:
−1 > −1 > −1 >
H2 = HH = X X> X X X X> X X = XI X> X X =H

we can examine the diagonal elements of the equality H2 = H to get the following bounds of Hii :
X
hii = hii2 + hij2 ≥ 0
i6=j

hii ≥ hii2 =⇒ hii ≤ 1


We can also relate the residuals to the leverage score:

ε=Y−Y
b b = (I − H) Y

Examining the variance-covariance matrix of the regression errors we see that:


> >
ε) = Var((I − H) Y) = (I − H) Var(Y) (I − H) = σ 2 (I − H) (I − H) = σ 2 (I − H) ,
Var(b

where we have used the fact that (I − H) is idempotent and Var(Y) = σ 2 I.


Since the diagonal elements of the variance-covariance matrix are the variances of each
observation, we have that Var(b i ) = (1 − hii )σ 2 .
Thus, we can see that a leverage score of hii ≈ 0 would indicate that the i-th observation has no
influence on the error variance, which would mean that its variance close to the true
(unobserved) variance σ 2 .
Observations with leverage score values larger than 2(k + 1)/N are considered to be
potentially highly influential.
Assume that we estimate the model via OLS:
mdl_1_fit <- lm(y ~ 1 + x)

Studentized Residuals
The studentized residuals are related to the standardized residuals, as they are defined as:

bi
ti = √
b 1 − hii
σ

The main distinction comes from the calculation of σ


b, which can be calculated in two ways:
I Standardized residuals calculate the internally studentized residual variance estimate:

N
1 X
b2 =
σ 2
N − (k + 1) j=1 j
b

I If we suspect that the i-th residual of being improbably large (i.e. it cannot be from the
same normal distribution as the remaining of the residuals) - we exclude it from variance
estimation by calculating the externally studentized residual variance estimate:
N
2 1 X
σ = 2
N − (k + 1) − 1 j=1 j
b(i) b
j6=i
If the residuals are independent and ε ∼ N (0, σ 2 I), then the distribution of the studentized
residuals depends on the calculation of the variance estimate:
I If the residuals are internally studentized - they have a tau distribution:

v · tv −1
ti ∼ q , where v = N − (k + 1)
2
tv −1 + v − 1

I If the residuals are externally studentized - they have a Student’s t-distribution (we will
also refer to them as ti(i) ):
ti = ti(i) ∼ t(N−(k−1)−1)
Observations with studentized residual values larger than 3 in *absolute* value could
be considered outliers.
We can plot the studentized and standardized residuals:
olsrr::ols_plot_resid_stud(mdl_1_fit)

Studentized Residuals Plot


Threshold: abs(3)
0
Deleted Studentized Residuals

−50

Observation
−100 normal
outlier

−150

−200 100

0 25 50 75 100
Observation
olsrr::ols_plot_resid_stand(mdl_1_fit)

Standardized Residuals Chart


2.5
Threshold: abs(2)

0.0
Standardized Residuals

−2.5

−5.0

−7.5

−10.0 100

0 25 50 75 100
Observation
We can examine the same plots on the model, with the outlier observation removed from the
data:
olsrr::ols_plot_resid_stud(lm(y[-N] ~ 1 + x[-N]))

Studentized Residuals Plot


Threshold: abs(3)

2
Deleted Studentized Residuals

0
Observation
normal

−2

−4

0 25 50 75 100
Observation
olsrr::ols_plot_resid_stand(lm(y[-N] ~ 1 + x[-N]))

Standardized Residuals Chart


64 Threshold: abs(2)

49 74
39 96
2
Standardized Residuals

−2

0 25 50 75 100
Observation
While the studentized residuals appear to have no outliers, the standardized residuals indicate
that a few observations may be influential. Since we have simulated the data, we know that our
data contained only one outlier. Consequently, we should not treat all observations outside the
threshold as definite outliers.
We may also be interested in plotting the studentized residuals against the leverage points:
olsrr::ols_plot_resid_lev(mdl_1_fit)

Outlier and Leverage Diagnostics for y


657 16 26 70 184497 72
Threshold: 0.04
0

−50

Observation
RStudent

normal
−100
leverage
outlier

−150

100
−200

0.02 0.04 0.06 0.08


Leverage
olsrr::ols_plot_resid_lev(lm(y[-N] ~ 1 + x[-N]))

Outlier and Leverage Diagnostics for y[−N]


Threshold: 0.04

5.0

64

49 74
2.5 39 96
Observation
RStudent

97 normal
57 16 70
6 72 leverage
0.0
26 18 outlier
44
35

−2.5

−5.0

0.02 0.04 0.06 0.08


Leverage
This plot combined the leverage score, which shows influential explanatory variable
observations, and the studentized residual plot, which shows outlier residuals of the difference
between the actual and fitted dependent variables.
Influential observations
Influential observations are defined as observations, which have a large effect on the results of
a regression.
DFBETAS
The DFBETAi vector measures how much an observation i has effected the estimate of a
regression coefficient vector . It measures the difference between the regression coefficients,
calculated for all of the data, and the regression coefficients, calculated with the observation i
deleted:
b −β
β b
(i)
DFBETAi = q
σ 2 diag ((X> X)−1 )
b(i)

Observations with a DFBETA value larger than 2/ N in absolute value should be
carefully inspected.
The recommended general cutoff (absolute) value is 2.
We can calculate the appropriate DFBETAS for the last 5 observations as follows:
dfbetas_manual <- NULL
for(i in (N-4):N){
mdl_2_fit <- lm(y[-i] ~ 1 + x[-i])
numerator <- mdl_1_fit$coef - mdl_2_fit$coef
denominator<- sqrt((summary(mdl_2_fit)$sigma^2) * diag(solve(t(cbind(1, x)) %*% cbind(1, x))))
dfbetas_manual <- rbind(dfbetas_manual, numerator / denominator)
}
print(dfbetas_manual)

## (Intercept) x
## [1,] 0.028743821 -0.022789554
## [2,] 0.030744687 -0.034844559
## [3,] 0.020403791 -0.024298429
## [4,] 0.006702931 -0.004242548
## [5,] -29.230784828 25.362876769
While these calculations are a bit more involved, we can use the built-in functions as well:
print(tail(dfbetas(mdl_1_fit), 5))

## (Intercept) x
## 96 0.028743821 -0.022789554
## 97 0.030744687 -0.034844559
## 98 0.020403791 -0.024298429
## 99 0.006702931 -0.004242548
## 100 -29.230784828 25.362876769
If we wanted, we could also plot these values:
olsrr::ols_plot_dfbetas(mdl_1_fit)

page 1 of 1
Influence Diagnostics for (Intercept)
0
Threshold: 0.2
DFBETAS

−10

−20

100
−30
0 25 50 75 100
Observation

Influence Diagnostics for x


100
Threshold: 0.2
20
DFBETAS

10

0
0 25 50 75 100
Observation
If we were to remove the last observation and examine the DFBETAS plot:
olsrr::ols_plot_dfbetas(lm(y[-N] ~ 1 + x[-N]))

page 1 of 1
Influence Diagnostics for (Intercept)
64
Threshold: 0.2
0.4
DFBETAS

44
74
0.2 25 96

0.0

−0.2
8 43

0 25 50 75 100
Observation

Influence Diagnostics for x[−N]


8
0.25 43
Threshold: 0.2 97
DFBETAS

0.00

74
−0.25
44

64

0 25 50 75 100
Observation
We see that there are some observations, which may be worth examining. In this case, we know
that there are no more outliers because we have simulated the data ourselves. So this is a good
example that you should not blindly trust the above charts, as the influential observations are
not necessarily outliers.
DFFITS
DFFITS measures how much an observation i has effected the fitted value of a regression. It is
defined as a Studentized difference between the fitted values from a regression, estimated on all
of the data, and the fitted values from a regression, estimated on the data with observation i
deleted: r
Ybi − Ybi(i) hii
DFFITSi = 2 √ = ti(i)
σb(i) hii 1 − hii
where ti(i) is the externally studentized residual.
tmp_val <- dffits(mdl_1_fit)
print(format(tail(cbind(tmp_val), 10), scientific = FALSE))
## tmp_val
## 91 " -0.0005235787"
## 92 " 0.0031760359"
## 93 " 0.0091761236"
## 94 " 0.0199891169"
## 95 " -0.0226748095"
## 96 " 0.0376495768"
## 97 " -0.0379725909"
## 98 " -0.0287153104"
## 99 " 0.0125545090"
## 100 "-32.6910440413" p
Observations with a DFFITS value larger than 2 (k + 1)/N in absolute value
olsrr::ols_plot_dffits(mdl_1_fit)

Influence Diagnostics for y

0 Threshold: 0.28

−10
DFFITS

−20

−30

100

0 25 50 75 100
Observation
olsrr::ols_plot_dffits(lm(y[-N] ~ 1 + x[-N]))

Influence Diagnostics for y[−N]


0.6 64 Threshold: 0.28

0.4

49 74

0.2
DFFITS

0.0

−0.2

43
8
44
−0.4
0 25 50 75 100
Observation
Similarly to what we have observed with DFBETAS - we should not blindly trust that each value
outside the cutoff region is an outlier. Instead, we should treat them as influential observations,
which need additional analysis to determine whether they are acceptable.
Cook’s distance
Cook’s D measures the aggregate impact of each observation on the group of regression
coefficients, as well as the group of fitted values. It can be used to:
I indicate influential data points (i.e. potential outliers);
I indicate regions, where more observations would be needed;
Cook’s distance for observation i is defined as:
PN b b 2
j=1 (Yj − Yj(i) ) 2i
 
b hii
Di = =
(k + 1)bσ2 (k + 1)b σ 2 (1 − hii )2

where:
I Y bj(i) is the fitted value of Yj , obtained by excluding the i-th observation and re-estimating
the same model via OLS.
ε> bε
I σb2 =
b
is the mean squared error of the error term.
N − (k + 1)
Note: in practical terms, it may be easier to use the leverage score expression of Di instead of
re-estimating the model for each observation case.
tmp_val <- cooks.distance(mdl_1_fit)
print(format(tail(cbind(tmp_val), 10), scientific = FALSE))

## tmp_val
## 91 "0.0000001384804"
## 92 "0.0000050955563"
## 93 "0.0000425310902"
## 94 "0.0002017917033"
## 95 "0.0002596785491"
## 96 "0.0007154000322"
## 97 "0.0007282312180"
## 98 "0.0004164378945"
## 99 "0.0000796089714"
## 100 "1.2596831219103"
Cook’s distance values, which are:
I larger than 4/N (the traditional cut-off);
1 PN
I larger than 3 × Di
N i=1
could be considered highly influential.
We can plot the Di points:
olsrr::ols_plot_cooksd_bar(mdl_1_fit)

Cook's D Bar Plot


Threshold: 0.04
100

1.0

Observation
Cook's D

normal
outlier
0.5

0.0

0 25 50 75 100
Observation
olsrr::ols_plot_cooksd_chart(mdl_1_fit)

Cook's D Chart
100
Threshold: 0.04

1.0
Cook's D

0.5

0.0

0 25 50 75 100
Observation
As well as plot the Di on the data without the outlier observation:
olsrr::ols_plot_cooksd_bar(lm(y[-N] ~ 1 + x[-N]))

Cook's D Bar Plot

64
Threshold: 0.04
0.15

0.10

Observation
Cook's D

normal
outlier
44

8
0.05 74
43
49

0.00

0 25 50 75 100
Observation
olsrr::ols_plot_cooksd_chart(lm(y[-N] ~ 1 + x[-N]))

Cook's D Chart
64
Threshold: 0.04
0.15

0.10
Cook's D

44
8
43 74
0.05 49

0.00

0 25 50 75 100
Observation

We again see a similar result, as with DFBETAS and DFFITS.


Also note that R has a lot of different plots for the default lm model output:
par(mfrow = c(3, 2), mar = c(2, 2, 2, 2))
for(i in 1:6){
plot(mdl_1_fit, which = i)
}

Residuals vs Fitted Normal Q−Q

Standardized residuals
72 64 64 72
−20

−2
−60

−6
−100

−10
100
100

20 30 40 50 60 −2 −1 0 1 2

Scale−Location
Fitted values Cook's distance
Theoretical Quantiles
100
3.0

100

1.2
Cook's distance
2.0

0.8
1.0

0.4
72 64
18 72
0.0

0.0
20 30 40 50 60 0 20 40 60 80 100

Residuals
Fittedvs Leverage
values
Cook's dist vsObs.
Leverage
number
hii (1 − hii)
10 100 8 6

1.2
18 72

Cook's distance
−2

0.8
0.5
1
4
−6

0.4 2
−10

Cook's distance 100


0.0
18 72 0

0.00 0.02 0.04 0.06 0.08 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08
par(mfrow = c(3, 2), mar = c(2, 2, 2, 2))
for(i in 1:6){
plot(lm(y[-N] ~ 1 + x[-N]), which = i)
}

Residuals vs Fitted Normal Q−Q

Standardized residuals

0 1 2 3 4
64 64

49
1.0

74 74 49
0.0
−1.0

−2
20 30 40 50 60 −2 −1 0 1 2

Scale−Location
Fitted values Cook's distance
Theoretical Quantiles
64 64
1.5

Cook's distance
74 49

0.10
1.0

44
8
0.5

0.00
0.0

20 30 40 50 60 0 20 40 60 80 100

Residuals
Fittedvs Leverage
values
Cook's dist vsObs.
Leverage
number
hii (1 − hii)

0.00 0.05 0.10 0.15


0 1 2 3 4

64 3.5 64 3 2.5 2
0.5

Cook's distance
1.5

44
8
1
44
Cook's distance
−2

8 0.5
0

0.00 0.02 0.04 0.06 0.08 0 0.02 0.04 0.06 0.08


Addressing Outliers
After determining that a specific observation is indeed an outlier, we want to address them in
some way.
Capping the Outliers
If we find that the explanatory variables X1,i , ..., Xk,i of an outlier variable Yi are similar to other
observations, with non-outlier values of Yi , we may cap the value of the outlier, to match the
values.

Replacing Outliers with Imputed Values


If we are certain that the outlier is due to some error in the data itself - we could try to impute
the observations by treating them as missing values and substituting them for some average
value of Y .
The Expectation-maximization (EM) algorithm could be utilized for missing data imputation.

Deleting Outliers
In some cases, if we are absolutely sure that the observation is an outlier, which is either
completely unlikely, or impossible to encounter again, we could drop it.

Robust Regression
In addition to the methods mentioned before, we could also run a Robust regression.
In our example, we know that the last observation was differently generated, and is thus an
outlier, which we can delete.
We can compare how would our model look like with the whole dataset, and if we were to drop
the outlier observation:
plot(x, y)
lines(x, mdl_1_fit$fitted.values, col = "red")
lines(x[-N], lm(y[-N] ~ 1 + x[-N])$fitted.values, col = "blue")
points(x[N], y[N], pch = 19, col = "red")
legend("topleft", lty = 1, col = c("red", "blue"), legend = c("with outlier", "deleted outlier"))
60

with outlier
deleted outlier
40
20
0
y

−20
−60

You might also like