0% found this document useful (0 votes)
51 views9 pages

Lecture 17: Multicollinearity 1 Why Collinearity Is A Problem

This document discusses the problem of multicollinearity in multiple linear regression. Multicollinearity occurs when predictor variables are linearly related, meaning they provide redundant information. It can cause coefficients to be poorly estimated and variances to be inflated. The document outlines methods for diagnosing multicollinearity between pairs or groups of variables, such as using correlation matrices or variance inflation factors. Removing redundant variables or combining related predictors are approaches for dealing with multicollinearity.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views9 pages

Lecture 17: Multicollinearity 1 Why Collinearity Is A Problem

This document discusses the problem of multicollinearity in multiple linear regression. Multicollinearity occurs when predictor variables are linearly related, meaning they provide redundant information. It can cause coefficients to be poorly estimated and variances to be inflated. The document outlines methods for diagnosing multicollinearity between pairs or groups of variables, such as using correlation matrices or variance inflation factors. Removing redundant variables or combining related predictors are approaches for dealing with multicollinearity.

Uploaded by

S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Lecture 17: Multicollinearity

1 Why Collinearity Is a Problem


Remember our formula for the estimated coefficients in a multiple linear regression:
βb = (XT X)−1 XT Y
This is obviously going to lead to problems if XT X isn’t invertible. Similarly, the variance of the
estimates, h i
Var βb = σ 2 (XT X)−1

will blow up when XT X is singular. If that matrix isn’t exactly singular, but is close to being
non-invertible, the variances will become huge.
There are several equivalent conditions for any square matrix U to be singular or non-invertible:
• The determinant det U (or |U|) is 0.
• At least one eigenvalue of u is 0. (This is because the determinant of a matrix is the product
of its eigenvalues.)
• U is rank deficient, meaning that one or more of its columns (or rows) is equal to a linear
combination of the other rows.
Since we’re not concerned with any old square matrix, but specifically with XT X, we have an
additional equivalent condition:
• X is column-rank deficient, meaning one or more of its columns is equal to a linear combi-
nation of the others.
The last explains why we call this problem collinearity: it looks like we have p different
predictor variables, but really some of them are linear combinations of the others, so they don’t
add any information. If the exact linear relationship holds among more than two variables, we
talk about multicollinearity; collinearity can refer either to the general situation of a linear
dependence among the predictors, or, by contrast to multicollinearity, a linear relationship among
just two of the predictors.
Again, if there isn’t an exact linear relationship among the predictors, but they’re close to one,
X X will be invertible, but (XT X)−1 will be huge, and the variances of the estimated coefficients
T

will be enormous. This can make it very hard to say anything at all precise about the coefficients,
but that’s not necessarily a problem.

1.1 Dealing with Collinearity by Deleting Variables


Since not all of the p variables are actually contributing information, a natural way of dealing with
collinearity is to drop some variables from the model. If you want to do this, you should think
very carefully about which variable to delete. As a concrete example: if we try to include all of a
student’s grades as predictors, as well as their over-all GPA, we’ll have a problem with collinearity
(since GPA is a linear function of the grades). But depending on what we want to predict, it
might make more sense to use just the GPA, dropping all the individual grades, or to include the
individual grades and drop the average.

1
1.2 Diagnosing Collinearity Among Pairs of Variables
Linear relationships between pairs of variables are fairly easy to diagnose: we make the pairs plot
of all the variables, and we see if any of them fall on a straight line, or close to one. Unless the
number of variables is huge, this is by far the best method. If the number of variables is huge, look
at the correlation matrix, and worry about any entry off the diagonal which is (nearly) ±1.

1.3 Why Multicollinearity Is Hard to Detect


A multicollinear relationship involving three or more variables might be totally invisible on a pairs
plot. For instance, suppose X1 and X2 are independent Gaussians, of equal variance σ 2 , and X3 is
their average, X3 = (X1 + X2 )/2. The correlation between X1 and X3 is

Cov [X1 , X3 ]
Cor(X1 , X3 ) = p (1)
Var [X1 ] Var [X3 ]
Cov [X1 , (X1 + X2 )/2] σ 2 /2 1
= p = √ =√ . (2)
2 2
σ σ /2 2
σ / 2 2

This is also the correlation between X2 and X3 . A correlation of 1/ 2 isn’t trivial, but is hardly
perfect, and doesn’t really distinguish itself on a pairs plot (Figure 1).

x1 = rnorm(100,70,15)
x2 = rnorm(100,70,15)
x3 = (x1 + x2)/2
X = cbind(x1,x2,x3)
pairs(X)
cor(X)

## x1 x2 x3
## x1 1.00000000 0.03788452 0.7250514
## x2 0.03788452 1.00000000 0.7156686
## x3 0.72505136 0.71566863 1.0000000

2 Variance Inflation Factors


If the predictors are correlated with each other, the standard errors of the coefficient estimates will
be bigger than if the predictors were uncorrelated. If the predictors were uncorrelated, the variance
of βbi would be
h i σ2
Var βbi = 2 (3)
nsXi
just as it is in a simple linear regression. With correlated predictors, however, we have to use our
general formula for the least squares:
h i
Var βbi = σ 2 (XT X)−1
i+1,i+1 (4)

2
30 40 50 60 70 80 90

● ● ● ●
● ●

90 100
● ●
● ● ● ●
● ●
● ● ● ● ● ● ●● ●
● ● ●● ● ●
● ● ●● ● ● ● ●● ●
● ●
●● ●●

80
● ● ●●
● ● ● ● ●● ● ●
● ● ●●● ● ● ●●
x1 ● ●
●●

● ●● ●●
● ● ●

● ●
● ● ●●

70
● ● ● ● ● ● ● ● ● ● ● ●

●●

● ●●
● ● ● ●

●●
●● ● ● ● ● ● ●● ●
● ● ●●
● ●● ●● ●●●
● ● ● ●● ● ● ●
●● ● ● ●●●

60
● ● ● ● ●●
● ● ● ●
● ●● ● ●● ●● ● ● ●● ● ● ●●
●● ●
● ●● ● ●●

50
● ●
● ●
● ●
● ●

40
● ●

● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
90

● ●
● ● ●
● ●
●● ● ● ● ● ●● ●
● ● ● ● ● ●● ● ●
●● ● ● ●
80

● ● ● ● ●● ●

● ● ●● ●
● ●

● ● ●● ●● ●
● ● ●
● ●

● ●
● ●● ● ● ●● ● ●●● ●●● ●
70

●● ● ●
●● ●●● ● ●● ●●● ●
● ● ● ●

●●

● ●
● ● ●●


x2 ●
●●●


●● ●●


60

●●● ●● ● ● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ●●
● ● ● ● ● ● ● ●
● ● ● ●
50

● ●● ● ●●
● ●
● ●
40

● ●
30

● ●

● ●

90
● ●

● ● ● ●
● ●
● ● ●
● ● ● ●
● ●
● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ●

80
●● ●
● ●● ● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●● ●●
● ●●
●● ●
● ● ● ● ● ● ●● ●●●
● ●



● ● ●●
x3

70
● ● ● ●
● ●●●
● ● ● ●
●●●●
● ●●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●●
● ●
●● ● ● ●● ●
● ●
● ● ●●
● ●● ●●
●●

●●● ● ●●●● ● ●●
●●● ● ● ●●

60
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ●●
● ●
● ● ● ●
50

● ● ● ●

40 50 60 70 80 90 100 50 60 70 80 90

Figure 1: Illustration that a perfect multi-collinear relationship might not show up on a pairs plot or in a
correlation matrix.

3
The ratio between Eqs. 4 and 3 is the variance inflation factor for the ith coefficient, V IFi . The
average of the variance inflation factors across all predictors is often written V IF , or just V IF .
Folklore says that V IFi > 10 indicates “serious” multicollinearity for the predictor. I have been
unable to discover who first proposed this threshold, or what the justification for it is. It is also
quite unclear what to do about this. Large variance inflation factors do not, after all, violate any
model assumptions.
It can be shown that V IFi = 1/(1 − Ri2 ) where Ri2 is the R2 you get by regressing Xi on all the
other covariates.
Frankly, I don’t think many people use VIF.

3 Matrix Perspective
Let X be the n × q design matrix. (Remember that q = p + 1.) We call G = XT X the Gram
Matrix. You should check the following facts:

1. G is q × q.

2. G is symmetric.

3. G is positive semi-definite. That means that, for any vector a we have that

aT Ga ≥ 0.

Multicollinearity means that there exists a perfect linear relationshipPbetween the columns of
X. This means that there is a non-zero vector a = (a1 , . . . , aq ) such that j aj Xj = 0 where Xj is
the j th column of X. In other words, there exists a 6= (0, . . . , 0) such that Xa = 0. Hence

aT Ga = 0. (5)

Since G is a square, symmetric, positive-semidefinite matrix, it has a spectral decomposition


(or eigen-decomposition). In other words, there are numbers (eigenvalues) λ1 ≥ λ2 ≥ · · · ≥ λq ≥ 0
and vectors (eigenvectors) v1 , . . . , vq such that:

1. Gvj = λj vj .

2. vjT vk = 0 for j 6= k.

3. vjT vj = 1 for each j.

4. G = j λj vj vjT .
P

5. G = VDVT where the j th column of V is vj and D is a diagonal matrix with Djj = λj .

6. The eigenvectors form a basis: any vector w can be written as w = j bj vj where bj = wT vj .


P

Now if the design matrix is collinear then there is a a such that aT Ga = 0. Now
X X X
0 = aT Ga = aT λj vj vjT a = λj aT vj vjT a = λj (aT vj )2 ≡ U.
j j j

4
> 0 then λj > 0 for all j. We know that (aT vj )2 > 0 for at least one j. (We know this since
If λq P
a = j (aT vj )vj and since a 6= 0.) So if λq > 0 then we get 0 = U > 0 which is a contradiction.
We conclude that λq = 0. (There could be other eigenvalues that are 0 as well.)
We have shown that:
Multicollinearity =⇒ aT Ga = 0 for some a 6= 0 =⇒ at least one eigenvalue of G is 0.
It is not hard to show that the reverse implications also hold.

3.1 Finding the Eigendecomposition


Because finding eigenvalues and eigenvectors of matrices is so useful for so many situations, mathe-
maticians and computer scientists have devoted incredible efforts over the last two hundred years to
fact, precise algorithms for computing them. This is not the place to go over how those algorithms
work; it is the place to say that much of the fruit of those centuries of effort is embodied in the
linear algebra packages R uses. Thus, when you call

eigen(A)

you get back a list, containing the eigenvalues of the matrix A (in a vector), and its eigenvectors (in a
matrix), and this is both a very fast and a very reliable calculation. If your matrix has very special
structure (e.g., it’s sparse, meaning almost all its entries are zero), there are more specialized
packages adapted to your needs, but we don’t pursue this further here; for most data-analytic
purposes, ordinary eigen will do.

3.2 Example
> n = 100
> x1 = rnorm(n)
> x2 = rnorm(n)
> x3 = (x1+x2)/2
> y = 5 + 2*x1 + 4*x2 + rnorm(n)
> out = lm(y ~ x1 + x2 + x3)
> summary(out)

Coefficients: (1 not defined because of singularities)


Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.14771 0.09667 53.25 <2e-16 ***
x1 1.98474 0.09544 20.80 <2e-16 ***
x2 3.97844 0.08854 44.93 <2e-16 ***
x3 NA NA NA NA

> one = rep(1,n)


> X = cbind(one,x1,x2,x3)
> G = t(X) %*% X
> tmp = eigen(G,symmetric=TRUE)

5
> names(tmp)
[1] "values" "vectors"
> round(tmp$values,5)
[1] 194.00958 100.09923 95.29363 0.00000
>
> out = lm(y ~ x1 + x2)
> summary(out)

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.14771 0.09667 53.25 <2e-16 ***
x1 1.98474 0.09544 20.80 <2e-16 ***
x2 3.97844 0.08854 44.93 <2e-16 ***

> X = cbind(one,x1,x2)
> G = t(X) %*% X
> tmp = eigen(G,symmetric=TRUE)
> round(tmp$values,5)
[1] 131.75456 99.98923 93.64998

4 Ridge Regression
The real problem with collinearity is that when it happens, there isn’t a unique solution to the
estimating equations. There are rather infinitely many solutions, which all give the minimum mean
squared error. This causes the variance of βb to be infinite.
One solution (which will also help us with high-dimensional regression) is called ridge regression.
Instead of minimizing
1
(Y − Xb)T (Y − Xb)
n
we instead minimize the penalized squared error
1 λ
(Y − Xb)T (Y − Xb) + kbk2 .
n n
The penalty factor λ > 0 will lead to a solution with some bias but it reduces the variance. In
particular, it solves the problem of non-invertibility. We’ll come back later to how to pick λ. The
gradient is
 
1 λ 2
(Y − Xb)T (Y − Xb) + bT b = −XT Y + XT Xb + λb .

∇b
n n n

Set this to zero at the optimum, βbλ ,

XT Y = (XT X + λI)βbλ

6
and solve to get
βbλ = (XT X + λI)−1 XT Y.
The inverse always exists.
Let’s compute the mean and variance:
h i −1
E βbλ = XT X + λI XT E [Y] (6)
−1
= XT X + λI XT Xβ (7)

h i h −1 i
Var βbλ = Var XT X + λI XT Y (8)
h −1 i
= Var XT X + λI XT  (9)
−1 −1
= XT X + λI XT σ 2 IX XT X + λI (10)
−1 −1
= σ 2 XT X + λI XT X XT X + λI . (11)

Notice how both of these expressions smoothly approach the corresponding formulas ones for or-
dinary least squares as λ → 0. Indeed, under the Gaussian noise assumption, βbλ actually has a
Gaussian distribution with the given expectation and variance.
It can be shown that ridge regression can also be obtained by doing a constrained minimization:

minimize (Y − Xb)T (Y − Xb) subject to ||b||2 ≤ c.

You prove this using Lagrangian multipliers that you learned in calculus.
We usually choose λ using cross-validation which we will explain later in the course.

Units and Standardization. If the different predictor variables don’t have physically compa-
rable units it’s a good idea toPstandardize them first, so they all have mean 0 and variance 1.
Otherwise, penalizing β T β = pi=1 βi2 seems to be adding up apples, oranges, and the occasional
bout of regret. (Some people always pre-standardize the predictors.)

4.1 Ridge Regression in R


There are several R implementations of ridge regression; the MASS package contains one, lm.ridge,
which needs you to specify λ. The ridge package has linearRidge, which gives you the option to
set λ, or to select it automatically via cross-validation.

4.2 Other Penalties/Constraints


Ridge regression penalizes the mean squared error with kbk2 , the squared length of the coefficient
vector. This suggests the idea of using some other measure of how big the vector is, some other
norm. A mathematically popular family of norms are the `p norms defined as

p
!1/p
X
kbkp = |bi |p .
i=1

7
The usual Euclidean length is `2 , while `1 is
p
X
kbk1 = |bi |
i=1

and (by continuity kbk0 is just the number of non-zero entries in b. When p 6= 2, penalizing the
kbkq does not, usually, have a nice closed-form solution like ridge regression does. Finding the
minimum of the mean squared error under an `1 penalty is called lasso regression or the lasso
estimator, or just the lasso. This has the nice property that it givea sparse solutions — it sets
coefficients to be exactly zero (unlike ridge). There are no closed forms for the lasso, but there are
efficient numerical algorithms. The lasso is one of the most popular methods for high-dimensional
regression today. We will discuss the lasso in detail later.
Penalizing `0 , the number of non-zero coefficients, sounds like a good idea, but there are,
provably, no algorithms for quickly trying all possible combinations of variables. (The problem is
NP hard.)

4.3 High-Dimensional Regression


One situation where we know that we will always have multicollinearity is when n < p. After all, n
points always define a linear subspace of (at most) n − 1 dimensions When the number of predictors
we measure for each data point is bigger than the number of data points, the predictors have to be
collinear, indeed multicollinear. We are then said to be in a high-dimensional regime.
This is an increasingly common situation in data analysis. A very large genetic study might
sequence the genes of, say, 500 people — but measure 500,000 genetic markers in each person.
If we want to predict some characteristic of the people from the genes (say their height, or blood
pressure, or how quickly they would reject a transplanted organ), there is simply no way to estimate
a model by ordinary least squares. Any approach to high-dimensional regression must involve either
reducing the number of dimensions or penalizing the estimates such as ridge regression or the lasso.
We will discuss high-dimensional regression in more detail later. Our main tools will be: the
lasso, ridge regression and something called forward stepwise regression.

4.4 Example
Let’s apply ridge regression to the simulated data already created, where one predictor variable
(X3 ) is just the average of two others (X1 and X2 ).

library(ridge)
out = linearRidge(y ~ x1 + x2 + x3 + x4,lambda="automatic")
coefficients(out)

## (Intercept) x1 x2 x3 x4
## 3.755564075 0.419326727 0.035301803 0.495563414 0.005749006

I had trouble installing the ridge package. There are other packages you can use too. For
example

8
library(MASS)
out = lm.ridge(y ~ x1 + x2 + x3,lambda=.1)
> coefficients(out)
x1 x2 x3
4.7644327 1.8683289 -0.5579901 1.0366070
>

You can also use the glmnet package:

library(glmnet)
lambdas = 10^seq(3, -2, by = -.1)
X = cbind(x1,x2,x3)
cvfit = cv.glmnet(X, y, alpha = 0, lambda = lambdas)
### Note: for this function you need to put the covariates in a matrix.
plot(cvfit) ##this shows the estimated MSE as a function of lambda
bestlambda = cvfit$lambda.min ### this is the best lambda
print(bestlambda)
## [1] 0.01258925
out = glmnet(X,y,alpha=0,lambda=bestlambda)
coefficients(out)

(Intercept) 5.368734
x1 1.468612
x2 -0.511296
x3 1.310621

You might also like