MathModel_Lecture 8 1
MathModel_Lecture 8 1
1 Multicollinearity
A basic assumption in the multiple linear regression model is that the column rank of the
data matrix X in Lecture 5 is full (K + 1), that is, the column vectors in X are linearly
independent. Let 0n denote the vector of n zeros. If there exists a nonzero vector a =
(a0 , a1 , . . . , aK )> such that
Xa = 0T , (1.1)
then we say that the independent variables X1 , X2 , . . . , XK exist perfect multicollinearity.
In practical problems, perfect multicollinearity is rare. However, it is common that (1.1) is
approximately true, that is,
Xa ≈ 0T . (1.2)
When the independent variables X1 , X2 , . . . , XK satisfies (1.2), we say that they exist collinear-
ity.
Suppose that the regression model
y = β0 + β1 X1 + · · · + βK XK +
exist perfect multicollinearity. Then the column rank of data matrix X in Lecture 5 is strictly
smaller than K + 1. In this case, |X > X| = 0, and the inverse of X > X does not exist. For
the case that (1.2) holds, though we have rank(X) = K + 1, the determinant of X > X will
be very small, that is, |X > X| ≈ 0, which implies that (X > X)−1 must have some very large
diagonal entry. Note that the dispersion matrix (covariance matrix) of b = (X > X)−1 X > y
is given by D(b) = σ 2 (X > X)−1 , which is shown as follows:
= (X > X)−1 X > D(y)X(X > X)−1 (use the fact D(Ay) = AD(y)A> )
= σ 2 (X > X)−1 X > X(X > X)−1
= σ 2 (X > X)−1 .
1
Since the diagonal entries of D(b) are var(b0 ), var(b1 ), . . . , var(bK ), there exists some j ∈
{0, 1, . . . , K} such that var(bj ) is very large, which leads to low accuracy to estimate βj .
Proof. Note that X > X is symmetric positive semi-definite. There exists an orthonormal
matrix U ∈ Rn×n (consists of the eigenvectors of X > X) such that
z > Λz ≈ 0, (1.5)
where z := U > a. Since U > is nonsingular and a 6= 0n , we have that z 6= 0n , that is, there
exists some j 0 ∈ Nn such that zj 0 6= 0. From (1.5), we have that
n
X
λj 0 zj20 ≤ λj zj2 = z > Λz ≈ 0,
j=1
2
Proposition 1.2. Let A ∈ Rn×n be a symmetric positive definite matrix. If |A| → ∞, then
A has very large diagonal entries.
Proof. Since A is symmetric positive definite, there exists an orthonormal matrix U (consists
of eigenvectors of A) such that
A = U ΛU > , (1.6)
where Λ = diag{λ1 , λ2 , . . . , λn }, λ1 ≥ λ2 ≥ · · · ≥ λn > 0 are the eigenvalues of A. Note that
|A| = λ1 λ2 · · · λn . We have λ1 → ∞ since |A| → ∞. It follows from (1.6) that
Since u1 = (u11 , u21 , . . . , un1 )> is a unit vector, it is impossible that all components of u1
are sufficient small, that is, there exists some j ∈ Nn such that uj1 is not too small, which
implies that
ajj ≥ λ1 u2j1 → ∞.
This completes the proof.
We next provide a example of bivariate regression showing that as the correlation of the
independent variables increases, the variance of the estimator will also increase.
Consider linear regression for the dependent variable y and two independent variable x1
and x2 . Assume that x1 , x2 and y are all centralized, and the intercept term in the regression
equation is 0. The regression equation is given by
ŷ = b1 x1 + b2 x2 .
Define n n n
X 2 X X 2
(i) (i) (i) (i)
L11 = x1 , L12 = x1 x 2 , L22 = x2 .
i=1 i=1 i=1
3
which implies that
!
>
−1 1 L22 −L12
X X =
|X > X| −L12 L11
!
1 L22 −L12
=
L11 L22 − L212 −L12 L11
!
1 L22 −L12
= 2
.
L11 L22 (1 − r12 ) −L12 L11
Thus,
σ2 σ2
var(b1 ) = 2
, var(b 2 ) = 2
.
(1 − r12 )L11 (1 − r12 )L22
Now we know that as the correlation of x1 and x2 increases, the variance of the b1 and b2
will also increase. When x1 and x2 are completely correlated, that is, r12 = 1, the variance
will tend to +∞.
2 Ridge Regression
When the independent variables exist multicollinearity, the smallest eigenvalue of X > X
tends to 0. Note that if λ is the eigenvalue of X > X, then λ + κ is the eigenvalue of
X > X + κI. To avoid too small eigenvalue, we can add a matrix κI (κ > 0) for X > X.
Considering the different scale of variables, we first standardize the data, and still denote
the data matrix by X. We call the equation
It is easy to verify that the optimization problem for ridge regression is given by
4
Proof.
From the above equation, we know that b(κ) is a bias estimator of β when κ 6= 0.
For any eigenvalue λ of X > X, the corresponding eigenvalue of I − κ(X > X + κI)−1 is given
κ
by 1 − λ+κ < 1, which implies that kI − κ(X > X + κI)−1 k2 < 1, and hence kb(κ)k < kbk
for kbk =
6 0.
5
a point in K-dimensional space. The direction of the PCA space represents the direction of
the maximum variance of the given data as shown in Figure 1. As shown in the figure, the
PCA space consists of a number of PCs. Each principal component has a different robustness
according to the amount of variance in its direction. The PCA space consists of k principal
components. The principal components are uncorrelated, and it represents the direction of
the maximum variance.
Performing a linear transformation on the vector x ∈ RK , we are able to get a new vector
z ∈ RK , i.e.,
z1 = a11 x1 + a12 x2 + · · · + a1K xK
z = a x + a x + · · · + a x
2 21 1 22 2 2K K
.
..
zK = aK1 x1 + aK2 x2 + · · · + aKK xK
Lemma 3.1. Let a ∈ RK be a constant vector, Σ be the covariance matrix for the elements
of x ∈ RK . Then var(a> x) = a> Σa.
According to Lemma 3.1, if we don’t add any constraint for the transform matrix A, the
6
variance of zi can be arbitrary large. We shall assume that the following three principles
hold:
(1) A> A = I;
(2) zi and zj are uncorrelated for i 6= j;
(3) z1 gets maximum variance for all A satisfies principle (1). zi gets the ith largest
variance, i ∈ NK .
Based on the above three principles, we call zi the ith principal component of the original
variable x.
We next explain the relationship between maximum variance and principle component.
Let Σ be the covariance matrix for the elements of x ∈ RK , λ1 ≥ λ2 ≥ · · · ≥ λK ≥ 0 be the
eigenvalues of Σ. The spectral decomposition of Σ is given by
K
X
>
Σ = U ΛU = λi u i u >
i ,
i=1
PK
By the fact kA>
1,: k = 1, we are able to show that i=1 αi2 = 1, which is presented as follows:
K
X K
X K
X
αi2 =h αi ui , αi ui i = kA> 2
1,: k = 1.
i=1 i=1 i=1
7
Moreover, it is easy to verify that var(z1 ) = λ1 when A1,: = u> >
1 . Therefore, z1 = u1 x has
maximum variance λ1 .
In fact, we are able to prove that the ith principal component of x is given by
zi = u>
i x, i = 1, 2, . . . , K
and we have
var(zi ) = λi ,
cov(zi , zj ) = u>
i Σuj = 0, i 6= j.
Steps of PCA:
1. Given a data matrix X = x(1) , x(2) , . . . , x(T ) , where T represents the total num-
ber of samples and x(i) ∈ RK represents the ith sample.
2. Compute the mean of all samples: µ = T1 Ti=1 x(i) .
P
Z = W > D.