Principal Component Analysis: 2.1 Definition of Principal Components
Principal Component Analysis: 2.1 Definition of Principal Components
1 INTRODUCTION
One of the main problems inherent in statistics with more than two variables is the issue of
visualising or interpreting data. Fortunately, quite often the problem can be simplified by
replacing a group of variables with a single new variable. The reason might be that more
than one variable is measuring the same driving principle governing the behaviour of the
system. One of the methods for reducing variables is Principal Component Analysis (PCA).
The purpose of the report is to give precise mathematical definition to PCA. Then,
mathematical derivation of PCA is given.
The method creates a new set of variables called principal components. Each of the new
variables is a linear combination of the original variables. Each of principal components is
chosen so that it would describe most of the still available variance and all principal
components are orthogonal to each other; hence there is no redundant information. The
first principal component has the maximum variance among all possible choices. (The
MathWorks, 2010) (Jolliffe, 1986)
PCA is used for different purposes - finding interrelations between variables in the data;
interpreting and visualizing data; decreasing the number of variables for making further
analysis simpler and for many other similar reasons.
The definition and derivation of the principal component analysis is described. In between
Lagrange multipliers for finding a maximum of a function with constraints and eigenvalues
and eigenvectors are explained, because these ideas are needed in the derivation.
[ ]
Page 1 of 8
Machine Learning, Kristjan Korjus
First step is to look at the linear function of the elements of which has maximum
variance, where is a vector of r constants, so that
There must be some constraints imposed, otherwise variance is unbounded. In the current
paper is used, which means that the sum of squares of elements of is 1 or the
length of is 1.
So, the aim is to find the linear function that transforms random variables into a new
random variable so that the new variable has maximum variation.
All of these transformations create r new random variables which are called principal
components. In general, the hope is that the first few random variables are needed to
explain necessary amount of variability in the dataset.
2.1.1 EXAMPLE
In the simplest case, there is a pair of 2 random variables and , which are highly
correlated. In this case one might want to reduce the variables to only one, so that it would
be easier to conduct further analysis. Figure 1 shows 30 such pairs of ( )
random variables, where is a random number from interval ( ) and is
plus 0.4 times a random number from interval ( ).
Page 2 of 8
Machine Learning, Kristjan Korjus
PC 1
PC 2
0
-1 -0,8 -0,6 -0,4 -0,2 0 0,2 0,4 0,6 0,8 1
-1
A strong correlation is visible. Principal component analysis tries to find the first principal
component which would explain most of the variance in the dataset. In this case it is clear
that the most variance would stay present if the new random variable (first principal
component) would be on the direction shown with the line on the graph. This new random
variable would explain most of the variation in the data set and could be used for further
analysis instead the original variables.
The method with two random variables looks similar to regression model, but the difference
is that the first principle component is chosen so that the sample points are as close to the
new variable as possible, but in regression analysis the vertical distances are as small as
possible.
In reality, random variables and can have some meaning as well. For example, might
be standardized mathematics exam scores and might be standardized physics exam
scores. In that case it would be possible to conclude that the new variable (the first
principal component) might account for some general logical ability and could be
interpreted as some other factor.
Page 3 of 8
Machine Learning, Kristjan Korjus
The following part shows how to find those principal components. Basic structure of the
definition and derivation are from I. T. Jolliffe’s (1986) book “Principal Component Analysis”.
It is assumed that the covariance matrix of the random variables is known – denoted .
is a non-singular symmetric matrix with dimension . is also positive semi-definite
which means that all the eigenvalues are non-negative. The element ( ) of the matrix
shows the covariance between and in case . Elements ( ) on the diagonal show
the variance of the element . So,
[( )( )] [( )( )] [( )( )]
[( )( )] [( )( )] [( )( )]
[ ]
[( )( )] [( )( )] [( )( )]
where [ ] is the expected value of and is the mean of . In this report the mean is
assumed to be 0 because it can be subtracted from the data before the analysis.
Finding the principal components is reduced to finding eigenvalues and eigenvectors of the
matrix such that the kth principal component is given by . Here is an
eigenvector of , which corresponds to the kth largest eigenvalue . In addition to this, the
variance of is because is chosen to be unit length. (Jolliffe, 1986)
Before the result is derived two topics must be explained. Firstly, eigenvalues and
eigenvector are described together with an example, and then method of Lagrange
Multipliers is explained.
Eigenvector is a non-zero vector that stays parallel after matrix multiplication, i.e. is
eigenvector of dimension of matrix with dimension if and are parallel.
Parallel means that there exists such that . (Roberts, 1985)
Page 4 of 8
Machine Learning, Kristjan Korjus
After finding all the eigenvalues ’s, all the corresponding eigen vectors can be found by
solving standard linear matrix equation , where ( ).
(Strang, 1999)
A:
( )
[ ]
( )
( )
[ ]
[ ]
Therefore, two pairs of eigenvalues and eigenvectors have been found as required.
In practice eigenvalues are computed by better algorithms than finding the roots of
algebraic equations.
Sometimes it is needed to find the maximum or minimum of the function that depends
upon several variables whose values must satisfy certain equalities, i.e. constraints. In this
report, it is needed to find principal components which are linear combination of original
Page 5 of 8
Machine Learning, Kristjan Korjus
random variables so that the length of the vector that represents linear combination is 1
and that all these vectors are uncorrelated to the others. The idea is to change the
constrained variable problem to unconstrained variable problem (Gowers, 2008).
Figure 2 - Find x and y to maximize f(x,y) subject to a constraint Figure 3 - Contour map of Figure 1. The red line shows the
(shown in red) g(x,y) = c. Source: (Lagrange multipliers) constraint g(x,y) = c. The blue lines are contours of f(x,y). The
point where the red line tangentially touches a blue contour is
the solution. Source: (Lagrange multipliers)
New variable called a Lagrange multiplier is introduced, and the Lagrange function is
defined by
( ) ( ) ( )
( ) ( )
Lagrange multiplier method gives necessary conditions for finding the maximum points of a
function subject to constraints.
Page 6 of 8
Machine Learning, Kristjan Korjus
( ) [( )( )] [( )( ) ]
[ ] [ ]
The technique of Lagrange multipliers is used. So it is necessary to maximize
( )
( )
[ ]
Next, it is needed to decide which of the eigenvectors gives the maximizing value for the
first principal component. It is necessary to maximize . Let be any eigenvector of
and be the corresponding eigenvalue. We have
As must be the largest possible, must be the eigenvector which corresponds to the
largest eigenvalue.
The first principal component has now been derived. Same process can be applied to others
such that principal component of is and variance of is such that is the
largest eigenvalue with the corresponding eigenvector of , where .
( ) [( )( ) ] [ ]
Page 7 of 8
Machine Learning, Kristjan Korjus
where ( ) denotes the covariance between and , and is already known from
the derivation of first principal component. So, must be maximized with following
constraints: The method of Lagrange multipliers is used.
( )
( )
where is an eigenvalue and the corresponding eigenvector of . As before,
, so because must be as big as possible and due
to correlations constraint it must not equal to .
As said before, the method applies for all the principal components and the proof is similar
but not given in this report.
3 WORKS CITED
Pearson, K. (1901). On lines and planes of closest fit to systems of points in space.
Philosophical Magazine(2), 559-572.
Strang, G. (1999). MIT video lectures in linear algebra. Retrieved April 16, 2010, from MIT
Open Courseware: http://ocw.mit.edu/OcwWeb/Mathematics/18-06Spring-
2005/VideoLectures/detail/lecture21.htm
Page 8 of 8