0% found this document useful (0 votes)
5 views

MathModel_Lecture 8 1

This document discusses multicollinearity in multiple linear regression, explaining its implications and providing mathematical proofs related to eigenvalues and variance. It introduces ridge regression as a solution to multicollinearity by adding a parameter to improve numerical stability, and outlines the properties of ridge estimates. Additionally, it covers principal component regression, emphasizing the transformation of data into a lower-dimensional space to maximize variance.

Uploaded by

lxz1160915566
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

MathModel_Lecture 8 1

This document discusses multicollinearity in multiple linear regression, explaining its implications and providing mathematical proofs related to eigenvalues and variance. It introduces ridge regression as a solution to multicollinearity by adding a parameter to improve numerical stability, and outlines the properties of ridge estimates. Additionally, it covers principal component regression, emphasizing the transformation of data into a lower-dimensional space to maximize variance.

Uploaded by

lxz1160915566
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Mathematical Modeling

Lecture 8. Solve the Problem of Multicollinearity


Yizun Lin
Department of Mathematics, Jinan University
May 29, 2024

1 Multicollinearity
A basic assumption in the multiple linear regression model is that the column rank of the
data matrix X in Lecture 5 is full (K + 1), that is, the column vectors in X are linearly
independent. Let 0n denote the vector of n zeros. If there exists a nonzero vector a =
(a0 , a1 , . . . , aK )> such that
Xa = 0T , (1.1)
then we say that the independent variables X1 , X2 , . . . , XK exist perfect multicollinearity.
In practical problems, perfect multicollinearity is rare. However, it is common that (1.1) is
approximately true, that is,
Xa ≈ 0T . (1.2)
When the independent variables X1 , X2 , . . . , XK satisfies (1.2), we say that they exist collinear-
ity.
Suppose that the regression model

y = β0 + β1 X1 + · · · + βK XK + 

exist perfect multicollinearity. Then the column rank of data matrix X in Lecture 5 is strictly
smaller than K + 1. In this case, |X > X| = 0, and the inverse of X > X does not exist. For
the case that (1.2) holds, though we have rank(X) = K + 1, the determinant of X > X will
be very small, that is, |X > X| ≈ 0, which implies that (X > X)−1 must have some very large
diagonal entry. Note that the dispersion matrix (covariance matrix) of b = (X > X)−1 X > y
is given by D(b) = σ 2 (X > X)−1 , which is shown as follows:

D(b) = D (X > X)−1 X > y




= (X > X)−1 X > D(y)X(X > X)−1 (use the fact D(Ay) = AD(y)A> )
= σ 2 (X > X)−1 X > X(X > X)−1
= σ 2 (X > X)−1 .

1
Since the diagonal entries of D(b) are var(b0 ), var(b1 ), . . . , var(bK ), there exists some j ∈
{0, 1, . . . , K} such that var(bj ) is very large, which leads to low accuracy to estimate βj .

Proposition 1.1. Let X ∈ Rm×n . Then the following are equivalent:


(i) The column vectors of X exist multicollinearity.
(ii) X > X exists an eigenvalue that is close to 0.
(iii) |X > X| ≈ 0.

Proof. Note that X > X is symmetric positive semi-definite. There exists an orthonormal
matrix U ∈ Rn×n (consists of the eigenvectors of X > X) such that

X > X = U ΛU > , (1.3)

where Λ = diag{λ1 , λ2 , . . . , λn }, λ1 ≥ λ2 ≥ · · · ≥ λn ≥ 0 are the eigenvalues of X > X.


(i) ⇔ (ii): Suppose that (i) holds. Then there exists a nonzero vector a ∈ Rn such that
Xa ≈ 0m , which implies that
a> X > Xa ≈ 0. (1.4)
Substituting (1.3) into (1.4) yields that

z > Λz ≈ 0, (1.5)

where z := U > a. Since U > is nonsingular and a 6= 0n , we have that z 6= 0n , that is, there
exists some j 0 ∈ Nn such that zj 0 6= 0. From (1.5), we have that
n
X
λj 0 zj20 ≤ λj zj2 = z > Λz ≈ 0,
j=1

which implies that λj 0 ≈ 0, and hence item (ii) follows.


Conversely, suppose that λ ≈ 0 is an eigenvalue of X > X, with corresponding eigenvector
u ∈ Rn (u 6= 0n ). Then
X > Xu = λu ≈ 0n ,
which implies that
kXuk2 = u> X > Xu ≈ 0.
Hence Xu ≈ 0n . Since u 6= 0n , we get the multicollinearity of the column vectors of X.
(ii) ⇔ (iii): It is easy to see from (1.3) that |X > X| = λ1 λ2 · · · λn . Then item (iii)
follows from item (ii) immediately. Suppose that item (iii) holds. Then λ1 λ2 · · · λn ≈ 0. Of
course, there must be some j 00 ∈ Nn such that λj 00 ≈ 0, that is, item (ii) holds.

2
Proposition 1.2. Let A ∈ Rn×n be a symmetric positive definite matrix. If |A| → ∞, then
A has very large diagonal entries.

Proof. Since A is symmetric positive definite, there exists an orthonormal matrix U (consists
of eigenvectors of A) such that
A = U ΛU > , (1.6)
where Λ = diag{λ1 , λ2 , . . . , λn }, λ1 ≥ λ2 ≥ · · · ≥ λn > 0 are the eigenvalues of A. Note that
|A| = λ1 λ2 · · · λn . We have λ1 → ∞ since |A| → ∞. It follows from (1.6) that

ajj = λ1 u2j1 + λ2 u2j2 + · · · + λn u2jn , j ∈ Nn .

Since u1 = (u11 , u21 , . . . , un1 )> is a unit vector, it is impossible that all components of u1
are sufficient small, that is, there exists some j ∈ Nn such that uj1 is not too small, which
implies that
ajj ≥ λ1 u2j1 → ∞.
This completes the proof.

We next provide a example of bivariate regression showing that as the correlation of the
independent variables increases, the variance of the estimator will also increase.
Consider linear regression for the dependent variable y and two independent variable x1
and x2 . Assume that x1 , x2 and y are all centralized, and the intercept term in the regression
equation is 0. The regression equation is given by

ŷ = b1 x1 + b2 x2 .

Define n  n n 
X 2 X X 2
(i) (i) (i) (i)
L11 = x1 , L12 = x1 x 2 , L22 = x2 .
i=1 i=1 i=1

Then the correlation coefficient of x1 and x2 is given by


L12
r12 = √ .
L11 L22
In addition, !
> L11 L12
X X= ,
L12 L22

3
which implies that
!
>
−1 1 L22 −L12
X X =
|X > X| −L12 L11
!
1 L22 −L12
=
L11 L22 − L212 −L12 L11
!
1 L22 −L12
= 2
.
L11 L22 (1 − r12 ) −L12 L11
Thus,
σ2 σ2
var(b1 ) = 2
, var(b 2 ) = 2
.
(1 − r12 )L11 (1 − r12 )L22
Now we know that as the correlation of x1 and x2 increases, the variance of the b1 and b2
will also increase. When x1 and x2 are completely correlated, that is, r12 = 1, the variance
will tend to +∞.

2 Ridge Regression
When the independent variables exist multicollinearity, the smallest eigenvalue of X > X
tends to 0. Note that if λ is the eigenvalue of X > X, then λ + κ is the eigenvalue of
X > X + κI. To avoid too small eigenvalue, we can add a matrix κI (κ > 0) for X > X.
Considering the different scale of variables, we first standardize the data, and still denote
the data matrix by X. We call the equation

b(κ) = (X > X + κI)−1 X > y (2.1)

the ridge estimate of β, and κ the ridge parameter.


Note that the optimization problem for linear regression is given by

min kXβ − yk2 .


β∈RK+1

It is easy to verify that the optimization problem for ridge regression is given by

min kXβ − yk2 + κkβk2 .


β∈RK+1

Ridge regression is a supplement to least squares regression. It loses unbiasedness, in


exchange for high numerical stability, and thus obtains higher calculation accuracy. We next
consider some properties of ridge regression.
Property 1. b(κ) is a bias estimator of β.

4
Proof.

E(b(κ)) = E (X > X + κI)−1 X > y




= (X > X + κI)−1 X > E(y)


= (X > X + κI)−1 X > Xβ.

From the above equation, we know that b(κ) is a bias estimator of β when κ 6= 0.

Property 2. b(κ) is a linear transform of the least square estimator b.

Proof. By the definition of b(κ) and b, we have that

b(κ) = (X > X + κI)−1 X > y


= (X > X + κI)−1 X > X(X > X)−1 X > y
= (X > X + κI)−1 X > Xb, (2.2)

which complies the proof.

Property 3. For any κ > 0 and kbk =


6 0, we have kb(κ)k < kbk.

Proof. It follows from (2.2) that

kb(κ)k = k(X > X + κI)−1 X > Xbk


= k(X > X + κI)−1 (X > X + κI − κI)bk
= k(I − κ(X > X + κI)−1 )bk
≤ kI − κ(X > X + κI)−1 k2 kbk

For any eigenvalue λ of X > X, the corresponding eigenvalue of I − κ(X > X + κI)−1 is given
κ
by 1 − λ+κ < 1, which implies that kI − κ(X > X + κI)−1 k2 < 1, and hence kb(κ)k < kbk
for kbk =
6 0.

3 Principal Component Regression


Note that the definition of matrix X in this section is different from that in Lecture 7. The
goal of the PCA technique is to find a lower dimensional space or PCA space W that is used

to transform the data X = x(1) , x(2) , . . . , x(T ) from a higher dimensional space (RK ) to a
lower dimensional space (Rk ), where T represents the total number of samples or observations
and x(i) ∈ RK represents ith sample, pattern, or observation. Each sample is represented as

5
a point in K-dimensional space. The direction of the PCA space represents the direction of
the maximum variance of the given data as shown in Figure 1. As shown in the figure, the
PCA space consists of a number of PCs. Each principal component has a different robustness
according to the amount of variance in its direction. The PCA space consists of k principal
components. The principal components are uncorrelated, and it represents the direction of
the maximum variance.

Performing a linear transformation on the vector x ∈ RK , we are able to get a new vector
z ∈ RK , i.e., 


 z1 = a11 x1 + a12 x2 + · · · + a1K xK

z = a x + a x + · · · + a x

2 21 1 22 2 2K K
 .
..




zK = aK1 x1 + aK2 x2 + · · · + aKK xK

Lemma 3.1. Let a ∈ RK be a constant vector, Σ be the covariance matrix for the elements
of x ∈ RK . Then var(a> x) = a> Σa.

According to Lemma 3.1, if we don’t add any constraint for the transform matrix A, the

6
variance of zi can be arbitrary large. We shall assume that the following three principles
hold:
(1) A> A = I;
(2) zi and zj are uncorrelated for i 6= j;
(3) z1 gets maximum variance for all A satisfies principle (1). zi gets the ith largest
variance, i ∈ NK .
Based on the above three principles, we call zi the ith principal component of the original
variable x.
We next explain the relationship between maximum variance and principle component.
Let Σ be the covariance matrix for the elements of x ∈ RK , λ1 ≥ λ2 ≥ · · · ≥ λK ≥ 0 be the
eigenvalues of Σ. The spectral decomposition of Σ is given by
K
X
>
Σ = U ΛU = λi u i u >
i ,
i=1

where Λ = diag{λ1 , λ2 , . . . , λK }, λ1 ≥ λ2 ≥ · · · ≥ λK > 0 are the eigenvalues of Σ and


u1 , u2 , . . . , uK are the corresponding eigenvectors. Of course, the column vector A> 1,: can be
written as a linear combination of the orthonormal basis u1 , u2 , . . . , uK , that is,
K
X
A>
1,: = αj uj .
j=1

PK
By the fact kA>
1,: k = 1, we are able to show that i=1 αi2 = 1, which is presented as follows:

K
X K
X K
X
αi2 =h αi ui , αi ui i = kA> 2
1,: k = 1.
i=1 i=1 i=1

Now we have that

var(z1 ) = A1,: ΣA>


1,:
K
X K
X
= λi A1,: ui u> >
i A1,: = λi (A1,: ui )2
i=1 i=1
K K
!2 K
X X X
= λi αj u>
j ui = λi αi2
i=1 j=1 i=1
K
X
≤ λ1 αi2 = λ1 .
i=1

7
Moreover, it is easy to verify that var(z1 ) = λ1 when A1,: = u> >
1 . Therefore, z1 = u1 x has
maximum variance λ1 .
In fact, we are able to prove that the ith principal component of x is given by

zi = u>
i x, i = 1, 2, . . . , K

and we have
var(zi ) = λi ,
cov(zi , zj ) = u>
i Σuj = 0, i 6= j.

In order to eliminate the influence of different scales of variables, we usually centralize


and normalize the data.

Steps of PCA:

1. Given a data matrix X = x(1) , x(2) , . . . , x(T ) , where T represents the total num-
ber of samples and x(i) ∈ RK represents the ith sample.
2. Compute the mean of all samples: µ = T1 Ti=1 x(i) .
P

3. Subtract the mean from all samples as follows:

D = d(1) , d(2) , . . . , d(T ) , d(i) = x(i) − µ, i ∈ NT .




4. Compute the covariance matrix: C = T −1 1


DD > .
5. Compute the eigenvectors V and eigenvalues λ of the covariance matrix C.
6. Sort eigenvectors according to their corresponding eigenvalues.
7. Select the eigenvectors that have the largest eigenvalues W = (v1 , v2 , . . . , vk ). The
selected eigenvectors W represent the projection space of PCA.
8. All samples are projected onto the lower dimensional space of PCA as follows:

Z = W > D.

You might also like