0% found this document useful (0 votes)
38 views65 pages

Chapter2 PCA

Uploaded by

Rafaela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views65 pages

Chapter2 PCA

Uploaded by

Rafaela
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Multivariate Data Analysis

Chapter 2 – Principal Component Analysis (PCA)

Docente: Eunice Carrasquinha


[email protected]

Departamento de Estatística e Investigação Operacional (DEIO)


Faculdade de Ciências da Universidade de Lisboa

2023/2024
Chapter 2: Principal Components Analysis (PCA)

• 2.1. General concepts;


• 2.2 Construction of the principal components;
• 2.3 Estimation of the principal components
• 2.4 Dimensionality reduction;
• 2.4 Main properties of the principal components;
• 2.5 Interpretation of the principal components;
• 2.6 Graphical representations.
2.1: General concepts

• Principal component analysis (PCA) is a method of analysing multivariate


data.

• It developed mainly in the last half of the last century, due to the evolution
of computer science and the development of computers.

• Allows the analysis of large datasets involving a large number of variables,


without requiring any complicated assumptions (e.g. concerning the type of
distribution of the data).
2.1: General concepts

Main Goal: dimensionality reduction

• The original variables are replaced by another set of variables,


uncorrelated, with smaller dimension than the original, with minimal
loss of information.

• The new variables, called principal components (p.c.), are linear


combinations of the original variables.
2.1: General concepts

Among all the possible linear combinations, the one with the maximum
variance is chosen in each case, because:

• the p.c. should reflect, as much as possible, the characteristics of the


data, which were expressed by the differentiation that the original
variables allowed to establish;

P.C. must explain a large part of the variation associated with the
initial variables
2.1: General concepts
• The differentiation between elements of a population is measured by
the variance:

Highest Variance ⟺ Greater distinction

• The variance of a principal component is therefore a measure of the


amount of information explained by that principal component.

• Dimensionality reduction is achieved by considering only some of


the main components – those with the highest variance.
• Those that are not analysed, are those that contribute little
information, given that their variances are small (the loss of
information is small).
2.1: General concepts

In addition to the dimensionality reduction, another advantage of this


method is the fact that the new variables (the principal components) are
uncorrelated, since:

• instead of analysing a large number of variables (the original ones)


with a complex inter-relational structure (as they concern the same
individual), only a few uncorrelated variables are analysed

• the analysis can be continued, perhaps applying other statistical


methodologies for uncorrelated variables
2.1: General concepts

• To help interpret the results of the analysis and allow them to be


used properly, whatever the nature of the data, it will be
advantageous to be able to assign a meaning to each of the principal
components (although this is not always possible).

• PCA can thus be considered a method of exploratory data analysis


that can be useful for a better understanding of the relationships
between the variables under study.

• Its application extends to almost all fields of science.


2.1: General concepts

Eigenvalues and Eigenvectors

Lets consider the vector equation (or system of equations)

is a known matrix
is a vector
is a scalar
2.1: General concepts

Eigenvalues and Eigenvectors

Eigenvalues of matrix

are the values of that satisfy the equation.

Eigenvectors of matrix associated to


is vector x that satisfies the equation obtained for the value .
2.1: General concepts

Eigenvalues and Eigenvectors

How to determine the eigenvalues and eigenvectors?

The system

can be rewritten as

will only have a solution if

is the identity matrix of order p and 0 is the null vector.


2.1: General concepts

Eigenvalues and Eigenvectors

By solving the equation Eigenvalues

To determine the eigenvector corresponding to each

solve the system


2.1: General concepts

Eigenvalues and Eigenvectors

Symmetric matrices
• Eigenvalues will always be real
• The sum of all eigenvalues will be equal to the trace of the matrix
(sum of the elements of the main diagonal).

Positive semi-definite matrices

• Eigenvalues will always be positive or null


2.2: Construction of the principal components

Let a random vector, with mean vector and covariance matrix

Let X be a data matrix that constitutes a random sample, of dimension n, of


observations of this random vector:

The main goal is to find a new set of p uncorrelated variables,


with maximum variance

Principal Components
2.2: Construction of the principal components

Principal Components (P.C.)

P.C. are linear combinations of p random variables

where

are constants
2.2: Construction of the principal components

The coefficients of these linear combinations are determined to satisfy


the following conditions:


• Any two principal components are uncorrelated:
• In any principal component the sum of the squares of the coefficients
is 1

that is, the vectors are normed (have unit norm).


2.2: Construction of the principal components

• Y1 is the principal component with the highest variance

• Y2 is the principal component with the 2nd largest variance, subject


to the condition that it is uncorrelated with Y1

• Y3 is the principal component with the 3rd highest variance, subject


to the condition that it is uncorrelated with neither Y1 nor Y2
• ...
2.2: Construction of the principal components

• The coefficients of Y1 are solutions of:


subject to

• The coefficients of Y2 are solutions of:


subject to

• The coefficients of Yp are solutions of:


subject to
2.2: Construction of the principal components

We then obtain (using, for example, the Lagrange multipliers


maximization method):

• are the components of the eigenvector ( ) of the


matrix associated with its largest eigenvalue ( ) and

• are the components of the eigenvector ( ) of the


matrix associated with its second largest eigenvalue ( ) and


2.2: Construction of the principal components

Principal Components (P.C.) are linear combinations

where:
are, respectively, the p normed eigenvectors associated with the
largest p eigenvalues of ( ) and ,
2.2: Construction of the principal components

Geometric interpretation

• A matrix can be seen as a representation of the coordinates of


n points in a p-dimensional space.

• Each row of a data centered matrix contains an observed value of a


random vector
– it contains the p coordinates of a vector, which can be viewed
as the coordinates of a point in p-dimensional space.
2.2: Construction of the principal components
Geometric interpretation
• The distance from this point to the origin is interpreted in terms of standard
deviation units, so that the analysis takes into account the variability inherent to
random observations.

• Therefore, points to which the same degree of uncertainty is associated, the


same variability, are at the same distance from the origin.

• It is proved that these points lie on a hyper-ellipsoid (which in the case 𝑛×2 is
an ellipse) with center at the origin and axes whose direction is given
respectively by each of the eigenvectors of the matrix (which is symmetric
and positive semi-definite) and whose length is proportional to the
corresponding eigenvalue.
2.2: Construction of the principal components
In the two-dimensional case ( ), we have:

and X2

points to which the same variability is associated


(x21 , x22 )
x22 ´ c × l1 Y1
and x2
Y2
´(x , x12 )
x12 11

c × l2 x1
v1
v2

x21 x11
0
c × l2
X1
c × l1


2.2: Construction of the principal components

• The covariance between each two principal components and


is null, as the principal components were determined to be
uncorrelated two by two.

• Then:
since is the eigenvector of for the eigenvalue

and therefore

which indicates that: and (with ) are orthogonal vectors.


2.2: Construction of the principal components

Geometry of the P.C. for


2.2: Construction of the principal components

Until now we have assumed that the population covariance matrix, ,


was known.

• This is generally not the case, so we will have to use the estimate:

Empirical covariance matrix


2.3: Estimation of the principal components
• We determine the eigenvalues of S and the corresponding eigenvectors which are the
estimates of the eigenvalues and eigenvectors of the matrix .

• The eigenvalues of S are all non-negative and equal to the estimates of the variances
of the corresponding principal components.

• The linear combinations constructed using the components of the eigenvectors of S as


coefficients, are the estimates of the principal components.

However, in practice, what is usually done is to consider that our sample actually
constitutes a population and, therefore, it is considered that:

• the principal components obtained from S are effectively “the” principal components
and not the estimates of the principal components obtained from .
2.3: Estimation of the principal components
• In many situations, the variables under study are not all measured in the same unit,
on the same scale, or are even of a different nature, or have very different variances.

• Thus, the need arises to establish a certain uniformization, which is achieved


through the standardization of the variables, that is, by dividing the centered value
of each variable by the corresponding standard deviation.

This procedure leads to obtaining variables with null mean value and unit variance:
• that is, the variables under study all have the same variance.

• The influence of variables with small variance tends to be inflated while that of
variables with high variance tends to be reduced.
2.3: Estimation of the principal components

Correlation Matrix

The covariance matrix of the set of these “new” variables is equal to


the correlation matrix of the set of initial variables, because,
considering variables already centered, we have:
2.3: Estimation of the principal components
• Therefore, the PCA of such a dataset is performed using the correlation
matrix, P.
• That is, the principal components are determined at the expense of the
eigenvalues and eigenvectors of the matrix P.
• Mathematically, everything works the same way.
• However, it should be noted that, in general, the eigenvectors of P are not
equal to those of and that therefore the principal components will not be
the same either.
• When the population correlation matrix, P, is not known, its estimate, R, is
used.
• The interpretation of results when using P or R must be done in terms of
standardized variables
2.3: Estimation of the principal components
Example 2.3.1

Let us consider the following hypothetical data concerning the observation of


two characteristics.
Variable X1 Variable X2
n 100 100
Mean 101.63 50.71
Standard deviation 10.47 7.44
Variance 109.63 55.40

The covariance matrix is:


2.3: Estimation of the principal components
Example 2.3.1

The eigenvalues of S are, respectively: and


2.3: Estimation of the principal components
Example 2.3.1

The norm 1 eigenvectors associated respectively with and are:

and

The p.c. are given, respectively, by:


(designating the centered variables by and )

Note that:

and
where and (which only happens in the bivariate case)
2.3: Estimation of the principal components
Example 2.3.1
If we use the correlation matrix instead of the covariance matrix (that is, the data are
standardized), we will have:
• correlation matrix

• eigenvalues of R are, respectively: and with

• norm 1 eigenvectors associated respectively with and are:

The p.c. are given, respectively, by:


2.4: Dimensionality reduction
• One of the main objectives of this analysis is to reduce the dimensionality of the data,
which will be achieved by replacing the original variables with some of the principal
components.
• Now it remains to be seen how many and which ones to retain.

Which p.c.'s should be retained?

• Given that the p.c.’s can be ordered in descending order of their variance and that the
larger this is, the more representative of the original data will be the corresponding
principal component, we must retain the first p.c.'s.

How many p.c.'s should be retained?


• Some rules that are established based on results that derive from the properties of the
eigenvalues and their relationship with the variance of the principal components can
be used.
2.4: Dimensionality reduction
The variance of each p.c. is given by:

Thus, the sum of the variances of the p.c.’s is given by:

Furthermore, as it is known that in a symmetric matrix (which is the


case of ) the sum of its eigenvalues is equal to the trace of the matrix,
we have:

ie
2.4: Dimensionality reduction

That is, the sum of the variances of the original variables is equal to the
sum of the variances of the p.c.’s (if we consider all the p. c.'s we explain
all the variability).

The proportion of the total variance that is explained by the jth principal
component Yj is given by:

measure of the
importance of this p.c.
2.4: Dimensionality reduction
If the data are standardized, that is, if we are working with the correlation
matrix:

• the total variance will be equal to the number of variables (p) (since the
diagonal of P is all formed by 1's):

The proportion of the total variance that is explained by the jth principal
component Yj is given by:
measure of the
importance of this p.c.
2.4: Dimensionality reduction
Some of the rules that can be used are:
• retain as many p.c.'s as necessary so that the percentage
of variance explained by them is greater than a given
value fixed a priori; that is, retain the first r p.c.'s such
that

• retain only the p.c.'s to which correspond


eigenvalues greater than the mean

• when working with the correlation matrix this rule


corresponds to:
Kaiser’s criterion
retain only the p.c.'s which correspond to
eigenvalues greater than 1
2.4: Dimensionality reduction
retain 2 pc Retain 4 pc
• Use a graph (scree-plot) where the points Scree Plot
of abscissa j and ordinate equal to the jth 3,5
eigenvalue or the percentage of variance
explained by the jth p.c. are represented, 3,0

(coordinate points or ) 2,5

2,0
where the contributions of the various p.c.'s
are distinguished. 1,5

1,0

Eigenvalue
The r p.c.'s that contribute the most should ,5

be retained, standing out sharply from the


others. 0,0
1 2 3 4 5 6 7 8 9 10

Component Number
2.4: Dimensionality reduction

Among these criteria, the mean and the scree-plot are the most commonly
used.

• Practice has shown that these criteria both lead to credible solutions if at least
one of the following conditions is met:

number of variables less than 30 or number of cases (individuals) greater than 250.

• According to some authors, when the number of variables is greater than 30


(especially if it is greater than 50), a scree-plot must be used in detriment to
the mean criterion.
2.4: Dimensionality reduction
Example 2.3.1
The percentage of the total variance explained, respectively, by the 1st and 2nd p.c.'s is
given by:

Covariance matrix

Correlation matrix
2.5: Interpretation of the Principal Components
• The meaning of a p.c. will be interpreted from the variables that most
correlate with it.
• The coefficients of the linear combinations ( ) and the correlations
between the initial variables and the p.c.’s, or the loadings will be used.
Loading of variable Xi for p.c. Yj:

• The covariance between the i th variable (Xi) and the j th p.c. (Yj) is:

since
2.5: Interpretation of the Principal Components
• The correlation coefficient between the i th variable (Xi) and the j th p.c. (Yj) is:

since

• If the data are standardized, or if the correlation matrix has been used, we have:

, and therefore

• Thus, if the absolute value of a coefficient of a p.c. for a given variable is high, it
can be concluded that the correlation between this c.p. and the variable is high.

the p.c.'s will be interpreted through these variables.


2.5: Interpretation of the Principal Components
Two possible rules are those which consist of considering a high correlation when:

• the square of the correlation is equal to or greater than the mean of the squares
of the p correlations
• The absolute value of the correlation is equal to or greater than 0.5
The variables that must be used in the interpretation of the j th p. c. will be those that
present coefficients that lead to the verification of one of the rules:

Correlation matrix

1. 2. or

or since
2.5: Interpretation of the Principal Components

Covariance matrix

1. 2.

or or
2.5: Interpretation of the Principal Components
The relative importance of a variable Xi for the explanation of a p.c. Yj is given by:
since the vector is normed,

(However, this value tells us nothing about the importance of the principal component itself.)

The meaning to give to a p.c. (useful for interpretation) will be closely associated with
the variables corresponding to high .
2.5: Interpretation of the Principal Components
The variance of a variable Xi that is explained by a p.c. Yj is given by:

• The square of the correlation coefficient between Xi and Yj, , can be interpreted
as representing the proportion of the variance of the variable Xi that is explained by
the p.c. Yj, because:

with

is easily proved, taking into account the


definition and properties of p. c.'s
2.5: Interpretation of the Principal Components

• the proportion of the variance of the variable Xi that is


explained by the retained r p.c.'s is:

• the part of the variance of the variable Xi that is


explained by the p.c. Yj is:

• the part of the variance of the variable Xi that is


explained by the retained r p.c.'s is:
2.5: Interpretation of the Principal Components
Example 2.3.1 (cont.)

In the previous case we have, for the 1st p.c. (using the data related to the
sample and therefore the resulting sample measures: r = estimate of )

positive and high correlations ( >0.5 )

• The 1st p.c. is highly correlated with both variables (using criterion 1: X1 is
more important to explain Y1 than X2). Also, when X1 or X2 increases, so does
Y1.
2.5: Interpretation of the Principal Components
For the 2nd p.c. we have:

Low correlations ( <0.5 )

• The 2nd p.c. is poorly correlated with both variables (X2 is more important to
explain Y2 than X1). When X1 increases Y2 decreases and when X2 increases
Y2 also increases.

• The variables are both important for the explanation of the 1st principal
component, with X1 being more important than X2

• none is very important for the explanation of the 2nd principal component,
X2 being more important than X1.
2.5: Interpretation of the Principal Components
Relative importance of the variable Xi Y1 Y 22
for the explanation of p.c. Yj X1 0.707 0.291
X2 0.291 0.707

Proportion of variance of the variable Y1 Y 22


Xi for the explanation of p.c. Yj X1 0.951 0.047
X2 0.775 0.225
2.6: Scores

We already know that the principal components result from a transformation


performed on the variables under study (linear combination) which, for the j th
principal component, can be formalized by:

We can now think of applying the same transformation to the data, that is, to
the observation vectors (columns of the X data matrix) of the
variables , respectively.
2.6: Scores

We will get a new data matrix, with dimension , given by:

Y= Matrix of individual scores:

whose ij th element will be

score of the i th individual for the j th principal component


2.6: Scores

• If we designate by A the matrix whose columns are constituted by the


successive p eigenvectors, we can also say that the matrix Y is obtained
by doing:

• If we are working with the correlation matrix, before determining the


scores, we should standardize the data.
2.6: Scores
Example 2.3.1 (cont.)

In the 2.3.1 example, we obtained:

where respectively the scores come from (elements od matrix Y):

… …
2.6: Scores
In matrix form, we would have:

It would now be enough to know the initial


data to be able to calculate the values of Y
2.7: Graphical Representations

Graphical representations, are a great auxiliary in interpreting PCA


results.
However, the ones with good visibility will only be the graphic
representations made in two dimensions.

Variables representation (each point represents a variable)

• the plane is defined by two axes corresponding to a pair of p.c.’ s.


• each variable is associated with a point, whose coordinates are the
correlations ( ) of that variable with each of the p.c. ‘s concerned.
2.7: Graphical Representations

Individuals representation (each point represents an individual)

• the plane is defined by two axes corresponding to a pair of p.c. ‘s


• each individual is associated with a point, whose coordinates are the
scores ( ) of that individual for each of the two p.c. ‘s concerned.

In general, the first two p.c.'s are preferably chosen, as they are the ones
that most contribute to the explanation of data variability.
2.7: Graphical Representations
2.7: Graphical Representations
2.7: Graphical Representations
2.8: Use of Principal Components

Principal components can be used to test the normality of initial variables

• if the principal components are not normally distributed, neither will the
original variables.

They can also be used to find outliers


• in a histogram or box-plot of each of the p.c. ‘s or in the representations of
individuals made in the main planes, individuals can be identified with
values that are too high or too low, distinguishing themselves from the rest as a
whole.
2.8: Use of Principal Components

Sometimes in regression analysis

• the first principal components are determined relative to the set of


independent variables, then the regression is applied to the selected
components.

• This technique is particularly useful to overcome the multicollinearity


problem, as the p.c.‘s are uncorrelated.
2.8: Use of Principal Components

To detect groups or classify objects:

• if the first two p.c.’s explain a good part of the total variability, we can
represent the scores of the individuals in the plane defined by these two
components and try to visualize clusters of the obtained points.

• if there is a need to use more than two p.c.'s, the scores of the individuals for
the most important p.c.'s are used instead of the initial values of the variables
(which were in greater number), and the groups are built from them, using one
of the classification analysis methods.

You might also like