0% found this document useful (0 votes)
10 views4 pages

AE - Tema 2 - Principal Component Analysis

Uploaded by

Ramón García
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

AE - Tema 2 - Principal Component Analysis

Uploaded by

Ramón García
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Principal Components Analysis

This technique reduces the dimensionality of a data set containing a large number
of (interrelated) variables. It was initially developed by Pearson (1901) although
it was not until 1933 that it obtained its algebraic formulation. To this end, the
original variables are transformed into a new set of variables (called principal
components) that are orthogonal and uncorrelated and that can be ordered
according to the variance they explain of the original variables.

First Principal Component


Definition. Let X = [x1 , · · · , xp ]T a multi-dimensional, stochastic variable which
has the variance-covariance (or dispersion) matrix Σ. Without loss of generality
we can assume it has the mean value zero, µ = 0. We define the first principal axis,
α 1 with α 1 T α 1 = 1 (normed coefficients), as the linear combination of the original
variables α 1 T X which has the largest variance. The random variable Y1 = α 1 T X
is called first principal component.

α1 T X). We know that


Derivation. We want to find α 1 that maximizes Var(α

α1 T X) = α1 T D(X)α
Var(α α1 = α1 T Σα
α1 ,

it has no maximum as α 1 is not bounded, therefore we establish the standardization


constraint. So, the optimization problem becomes

max α 1 T Σα
α1 ,
α1

s.t. α 1 T α 1 = 1.
Using Lagrange multipliers

max α 1 T Σα α1 T α 1 − 1),
α1 − λ(α
α1

Deriving

(Σ + ΣT )α
α1 − λ(I + I T )α α1 − 2λα
α1 = 2Σα α1 = 0 → Σα
α1 = λα
α1 ,

Then, α 1 is an eigenvector of Σ and λ an eigenvalue. But, which eigenvalue?

max α 1 T Σα
α1 = max α 1 T λα
α1 = max λ.
α1 α1 α 1 T α 1 =1 α1

λ is the largest eigenvalue.

The m-th principal component


Definition. The m-th principal axis α m , with α m T α m = 1 (normed coefficients),
is defined as the linear combination such that the random variable Ym = α m X
has maximum variance and cov (Ym , Yk ) = 0, ∀k = 1, · · · , m − 1. The random
variable Ym is called the m-th principal component.

1
Derivation. Let’s start with the second principal component:

α2 T X),
max Var(α
α2

s.t. α 2 T α 2 = 1,
α1 T X, α 2 T X) = 0.
cov(α

which is equivalent to
max α 2 T Σα
α2 ,
α2

s.t. α 2 T α 2 = 1,
α 1 T α 2 = 0.
α2 T X, α 1 T X) = α 2 T Σα
In fact, cov (α α1 = α 2 T λα α2 T α 1 = 0 ⇔ α 2 T α 1 = 0.
α1 = λα
Using the Lagrange multipliers

max α 2 T Σα α2 T α 2 − 1) − ϕ(α
α2 − λ(α α1 T α 2 − 0).
α2

Now, deriving with respect to α2

α2 − 2λα
2Σα α2 − ϕα
α1 ,

multiplying by α 1 T

α −2λ α 1 T α 2 −ϕ α 1 T α 1 = 0 ⇒ ϕ = 0.
2 α 1 T Σα
| {z }2 | {z } | {z }
0 0 1

Then, Σαα2 = λαα2 , in consequence α 2 is an eigenvector of Σ and λ an eigenvalue.


Same as before, λ = maxα 2 α 2 T Σαα2 , and assuming different eigenvalues has to be
the second eigenvalue. If α 2 = α 1 , then α 2 T α 1 ̸= 0.

The m-th principal component


Now we can now generalize to the m-th principal component

max α m T Σα
αm ,
αm

s.t. α m T α m = 1,
α 1 T α m = 0, ∀i = 1, · · · , m − 1.
Using the Lagrange multipliers
m−1
X
T T
αm − λ(α
max α m Σα αm α m − 1) − αm T α i − 0).
ϕi (α
α2
i=1

Now, deriving with respect to α m


m−1
X
αm − 2λα
2Σα αm − ϕiα i = 0.
i=1

Multiplying by αj , j = 1, · · · , m − 1, then

2
m−1
X
αm −2λ αj T αm −
2 αj T Σα ϕiαj T αi = 0 ⇒ ϕj = 0.
| {z } | {z } i=1
0 0 | {z }
ϕj

αm = λα
As before Σα αm , λ is an eigenvalue associated to the eigenvector α m .
Some important observations:
Let λ1 , · · · , λp the eigenvalues of Σ, and α 1 , · · · , α p its eigenvectors, then the
principal components are:
T
Y1 =
.. α 1 X,
.
T
Ym =.. α m X,
.
Yp = α p T X.

and Var(Yi ) = Var(αi ⊤ X) = α⊤i Σαi = αi λi α = λi . Defining P as the matrix


α1 , · · · ,α
whose columns are the eigenvectors P = [α αp ], we can write

α1 T X · · · α p T X] = P T X.
Y = [Y1 · · · Yp ] = [α

Also, we saw that


Σααi = λiα i ∀ i = 1, · · · , p.
 
λ1 . . . 0
which by defining Λ =  ... . . . ... , we can write, ΣP = P Λ, which, by
 
0 . . . λr
−1
observing that P = P , is equivalent to P T ΣP = Λ. On the basis of the
T

above results we have that

D(Y) = D(P T X) = P T D(X)P = P T D(X)P = P T ΣP = Λ.

Theorem. The sum of the variances of the original variables is equal to the sum
of the variances of the principal components.
Proof.
p p p
X X X
T T
Var(X) = tr(Σ) = tr(P ΛP ) = tr(P P Λ) = tr(Λ) = λi = Var(Yi ).
i=1 | {z } i=1 i=1
trace is invariant by ciclic permutations

Consequence: The proportion of the variance explained by the i-th principal


component is given by
λi
,
λ1 + · · · + λp
and the proportion explained by the first k principal components is given by
λ1 + · · · + λk
, k ≤ p.
λ1 + · · · + λp

3
The above theorem establishes a very useful relationship between the variance
of the original variables and the variance of the principal components. However,
it is possible to establish a more general relationship that relates the dispersion
matrix of the original variables with the main axes.

Theorem. Spectral Decomposition:

Σ = λ1α 1α 1 T + · · · + λpα pα p T .

Proof. It is enough to note that Σ = P ΛP T .


Let’s determine the correlation between the original variables and the principal
components

cov(X, Y) = cov(X, P T X) = cov(X, X)P = ΣP = P ΛP T P = P Λ,

 1/2
cov(Xi , Yi ) λj pij λj
corr(Xi , Yi ) = p p = √ p = pij .
Var(Xi ) Var(Yi ) σii λj σii

Calculate the eigenvectors where n ≪ p


How to calculate the principal components of a database where the number of
variables p is much greater than the number of observations n? Note that when p
is large, Σ is very large and its eigenvalues and eigenvectors cannot be calculated
due to computational problems.
To solve this, we first observe that the non-zero eigenvalues of A and AT are
the same. The characteristic polynomial of AT is given by

∥AT − λI∥ = ∥AT − λIT ∥ = ∥(A − λI)T ∥ = ∥A − λI∥.

Therefore, we can calculate the eigenvalues of ΣT which are smaller. To do


this, let G be a matrix containing our n observations in which the mean has been
subtracted, G = [gij ] = [xij − µ]. Let’s

1
Σl = GT G, (l = long),
n−1
1
Σs = GGT , (s = sort),
n−1
and let’s Λs and Λl the non-zero eigenvalues, Λs = Λl = Λ. We have that

1
Σs ϕs = ϕs Λ ⇒ GGT ϕs = ϕs Λ,
n−1
1
⇒ GT GGT ϕs = GT ϕs Λ,
n−1
⇒ Σl (GT ϕs ) = (GT ϕs )Λ,
⇒ GT ϕs are the non-zero n − p eigenvectors of Σl .

You might also like