Lecture 17: Multicollinearity 1 Why Collinearity Is A Problem
Lecture 17: Multicollinearity 1 Why Collinearity Is A Problem
will blow up when XT X is singular. If that matrix isn’t exactly singular, but is close to being
non-invertible, the variances will become huge.
There are several equivalent conditions for any square matrix U to be singular or non-invertible:
• The determinant det U (or |U|) is 0.
• At least one eigenvalue of u is 0. (This is because the determinant of a matrix is the product
of its eigenvalues.)
• U is rank deficient, meaning that one or more of its columns (or rows) is equal to a linear
combination of the other rows.
Since we’re not concerned with any old square matrix, but specifically with XT X, we have an
additional equivalent condition:
• X is column-rank deficient, meaning one or more of its columns is equal to a linear combi-
nation of the others.
The last explains why we call this problem collinearity: it looks like we have p different
predictor variables, but really some of them are linear combinations of the others, so they don’t
add any information. If the exact linear relationship holds among more than two variables, we
talk about multicollinearity; collinearity can refer either to the general situation of a linear
dependence among the predictors, or, by contrast to multicollinearity, a linear relationship among
just two of the predictors.
Again, if there isn’t an exact linear relationship among the predictors, but they’re close to one,
X X will be invertible, but (XT X)−1 will be huge, and the variances of the estimated coefficients
T
will be enormous. This can make it very hard to say anything at all precise about the coefficients,
but that’s not necessarily a problem.
1
1.2 Diagnosing Collinearity Among Pairs of Variables
Linear relationships between pairs of variables are fairly easy to diagnose: we make the pairs plot
of all the variables, and we see if any of them fall on a straight line, or close to one. Unless the
number of variables is huge, this is by far the best method. If the number of variables is huge, look
at the correlation matrix, and worry about any entry off the diagonal which is (nearly) ±1.
Cov [X1 , X3 ]
Cor(X1 , X3 ) = p (1)
Var [X1 ] Var [X3 ]
Cov [X1 , (X1 + X2 )/2] σ 2 /2 1
= p = √ =√ . (2)
2 2
σ σ /2 2
σ / 2 2
√
This is also the correlation between X2 and X3 . A correlation of 1/ 2 isn’t trivial, but is hardly
perfect, and doesn’t really distinguish itself on a pairs plot (Figure 1).
x1 = rnorm(100,70,15)
x2 = rnorm(100,70,15)
x3 = (x1 + x2)/2
X = cbind(x1,x2,x3)
pairs(X)
cor(X)
## x1 x2 x3
## x1 1.00000000 0.03788452 0.7250514
## x2 0.03788452 1.00000000 0.7156686
## x3 0.72505136 0.71566863 1.0000000
2
30 40 50 60 70 80 90
● ● ● ●
● ●
90 100
● ●
● ● ● ●
● ●
● ● ● ● ● ● ●● ●
● ● ●● ● ●
● ● ●● ● ● ● ●● ●
● ●
●● ●●
80
● ● ●●
● ● ● ● ●● ● ●
● ● ●●● ● ● ●●
x1 ● ●
●●
●
● ●● ●●
● ● ●
●
● ●
● ● ●●
●
70
● ● ● ● ● ● ● ● ● ● ● ●
●
●●
●
● ●●
● ● ● ●
●
●●
●● ● ● ● ● ● ●● ●
● ● ●●
● ●● ●● ●●●
● ● ● ●● ● ● ●
●● ● ● ●●●
60
● ● ● ● ●●
● ● ● ●
● ●● ● ●● ●● ● ● ●● ● ● ●●
●● ●
● ●● ● ●●
50
● ●
● ●
● ●
● ●
40
● ●
● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
90
● ●
● ● ●
● ●
●● ● ● ● ● ●● ●
● ● ● ● ● ●● ● ●
●● ● ● ●
80
● ● ● ● ●● ●
●
● ● ●● ●
● ●
●
● ● ●● ●● ●
● ● ●
● ●
●
● ●
● ●● ● ● ●● ● ●●● ●●● ●
70
●● ● ●
●● ●●● ● ●● ●●● ●
● ● ● ●
●
●●
●
● ●
● ● ●●
●
●
x2 ●
●●●
●
●
●● ●●
●
●
60
●●● ●● ● ● ● ● ● ● ● ●
● ● ●● ● ●
● ● ● ● ● ●
● ● ● ● ●
● ●● ● ●●
● ● ● ● ● ● ● ●
● ● ● ●
50
● ●● ● ●●
● ●
● ●
40
● ●
30
● ●
● ●
90
● ●
●
● ● ● ●
● ●
● ● ●
● ● ● ●
● ●
● ● ●●
● ● ● ● ● ●
● ● ● ●● ● ●
80
●● ●
● ●● ● ● ● ● ●● ●
● ● ● ● ● ●
● ● ● ●● ●
●● ●●
● ●●
●● ●
● ● ● ● ● ● ●● ●●●
● ●
●
●
●
● ● ●●
x3
70
● ● ● ●
● ●●●
● ● ● ●
●●●●
● ●●● ● ● ● ● ● ●●
● ● ● ● ● ● ● ●●
● ●
●● ● ● ●● ●
● ●
● ● ●●
● ●● ●●
●●
●
●●● ● ●●●● ● ●●
●●● ● ● ●●
60
● ● ● ● ● ●
● ● ● ●
●● ● ● ● ● ●●
● ●
● ● ● ●
50
● ● ● ●
40 50 60 70 80 90 100 50 60 70 80 90
Figure 1: Illustration that a perfect multi-collinear relationship might not show up on a pairs plot or in a
correlation matrix.
3
The ratio between Eqs. 4 and 3 is the variance inflation factor for the ith coefficient, V IFi . The
average of the variance inflation factors across all predictors is often written V IF , or just V IF .
Folklore says that V IFi > 10 indicates “serious” multicollinearity for the predictor. I have been
unable to discover who first proposed this threshold, or what the justification for it is. It is also
quite unclear what to do about this. Large variance inflation factors do not, after all, violate any
model assumptions.
It can be shown that V IFi = 1/(1 − Ri2 ) where Ri2 is the R2 you get by regressing Xi on all the
other covariates.
Frankly, I don’t think many people use VIF.
3 Matrix Perspective
Let X be the n × q design matrix. (Remember that q = p + 1.) We call G = XT X the Gram
Matrix. You should check the following facts:
1. G is q × q.
2. G is symmetric.
3. G is positive semi-definite. That means that, for any vector a we have that
aT Ga ≥ 0.
Multicollinearity means that there exists a perfect linear relationshipPbetween the columns of
X. This means that there is a non-zero vector a = (a1 , . . . , aq ) such that j aj Xj = 0 where Xj is
the j th column of X. In other words, there exists a 6= (0, . . . , 0) such that Xa = 0. Hence
aT Ga = 0. (5)
1. Gvj = λj vj .
2. vjT vk = 0 for j 6= k.
4. G = j λj vj vjT .
P
Now if the design matrix is collinear then there is a a such that aT Ga = 0. Now
X X X
0 = aT Ga = aT λj vj vjT a = λj aT vj vjT a = λj (aT vj )2 ≡ U.
j j j
4
> 0 then λj > 0 for all j. We know that (aT vj )2 > 0 for at least one j. (We know this since
If λq P
a = j (aT vj )vj and since a 6= 0.) So if λq > 0 then we get 0 = U > 0 which is a contradiction.
We conclude that λq = 0. (There could be other eigenvalues that are 0 as well.)
We have shown that:
Multicollinearity =⇒ aT Ga = 0 for some a 6= 0 =⇒ at least one eigenvalue of G is 0.
It is not hard to show that the reverse implications also hold.
eigen(A)
you get back a list, containing the eigenvalues of the matrix A (in a vector), and its eigenvectors (in a
matrix), and this is both a very fast and a very reliable calculation. If your matrix has very special
structure (e.g., it’s sparse, meaning almost all its entries are zero), there are more specialized
packages adapted to your needs, but we don’t pursue this further here; for most data-analytic
purposes, ordinary eigen will do.
3.2 Example
> n = 100
> x1 = rnorm(n)
> x2 = rnorm(n)
> x3 = (x1+x2)/2
> y = 5 + 2*x1 + 4*x2 + rnorm(n)
> out = lm(y ~ x1 + x2 + x3)
> summary(out)
5
> names(tmp)
[1] "values" "vectors"
> round(tmp$values,5)
[1] 194.00958 100.09923 95.29363 0.00000
>
> out = lm(y ~ x1 + x2)
> summary(out)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.14771 0.09667 53.25 <2e-16 ***
x1 1.98474 0.09544 20.80 <2e-16 ***
x2 3.97844 0.08854 44.93 <2e-16 ***
> X = cbind(one,x1,x2)
> G = t(X) %*% X
> tmp = eigen(G,symmetric=TRUE)
> round(tmp$values,5)
[1] 131.75456 99.98923 93.64998
4 Ridge Regression
The real problem with collinearity is that when it happens, there isn’t a unique solution to the
estimating equations. There are rather infinitely many solutions, which all give the minimum mean
squared error. This causes the variance of βb to be infinite.
One solution (which will also help us with high-dimensional regression) is called ridge regression.
Instead of minimizing
1
(Y − Xb)T (Y − Xb)
n
we instead minimize the penalized squared error
1 λ
(Y − Xb)T (Y − Xb) + kbk2 .
n n
The penalty factor λ > 0 will lead to a solution with some bias but it reduces the variance. In
particular, it solves the problem of non-invertibility. We’ll come back later to how to pick λ. The
gradient is
1 λ 2
(Y − Xb)T (Y − Xb) + bT b = −XT Y + XT Xb + λb .
∇b
n n n
XT Y = (XT X + λI)βbλ
6
and solve to get
βbλ = (XT X + λI)−1 XT Y.
The inverse always exists.
Let’s compute the mean and variance:
h i −1
E βbλ = XT X + λI XT E [Y] (6)
−1
= XT X + λI XT Xβ (7)
h i h −1 i
Var βbλ = Var XT X + λI XT Y (8)
h −1 i
= Var XT X + λI XT (9)
−1 −1
= XT X + λI XT σ 2 IX XT X + λI (10)
−1 −1
= σ 2 XT X + λI XT X XT X + λI . (11)
Notice how both of these expressions smoothly approach the corresponding formulas ones for or-
dinary least squares as λ → 0. Indeed, under the Gaussian noise assumption, βbλ actually has a
Gaussian distribution with the given expectation and variance.
It can be shown that ridge regression can also be obtained by doing a constrained minimization:
You prove this using Lagrangian multipliers that you learned in calculus.
We usually choose λ using cross-validation which we will explain later in the course.
Units and Standardization. If the different predictor variables don’t have physically compa-
rable units it’s a good idea toPstandardize them first, so they all have mean 0 and variance 1.
Otherwise, penalizing β T β = pi=1 βi2 seems to be adding up apples, oranges, and the occasional
bout of regret. (Some people always pre-standardize the predictors.)
p
!1/p
X
kbkp = |bi |p .
i=1
7
The usual Euclidean length is `2 , while `1 is
p
X
kbk1 = |bi |
i=1
and (by continuity kbk0 is just the number of non-zero entries in b. When p 6= 2, penalizing the
kbkq does not, usually, have a nice closed-form solution like ridge regression does. Finding the
minimum of the mean squared error under an `1 penalty is called lasso regression or the lasso
estimator, or just the lasso. This has the nice property that it givea sparse solutions — it sets
coefficients to be exactly zero (unlike ridge). There are no closed forms for the lasso, but there are
efficient numerical algorithms. The lasso is one of the most popular methods for high-dimensional
regression today. We will discuss the lasso in detail later.
Penalizing `0 , the number of non-zero coefficients, sounds like a good idea, but there are,
provably, no algorithms for quickly trying all possible combinations of variables. (The problem is
NP hard.)
4.4 Example
Let’s apply ridge regression to the simulated data already created, where one predictor variable
(X3 ) is just the average of two others (X1 and X2 ).
library(ridge)
out = linearRidge(y ~ x1 + x2 + x3 + x4,lambda="automatic")
coefficients(out)
## (Intercept) x1 x2 x3 x4
## 3.755564075 0.419326727 0.035301803 0.495563414 0.005749006
I had trouble installing the ridge package. There are other packages you can use too. For
example
8
library(MASS)
out = lm.ridge(y ~ x1 + x2 + x3,lambda=.1)
> coefficients(out)
x1 x2 x3
4.7644327 1.8683289 -0.5579901 1.0366070
>
library(glmnet)
lambdas = 10^seq(3, -2, by = -.1)
X = cbind(x1,x2,x3)
cvfit = cv.glmnet(X, y, alpha = 0, lambda = lambdas)
### Note: for this function you need to put the covariates in a matrix.
plot(cvfit) ##this shows the estimated MSE as a function of lambda
bestlambda = cvfit$lambda.min ### this is the best lambda
print(bestlambda)
## [1] 0.01258925
out = glmnet(X,y,alpha=0,lambda=bestlambda)
coefficients(out)
(Intercept) 5.368734
x1 1.468612
x2 -0.511296
x3 1.310621