Principal Component Analysis Slides
Principal Component Analysis Slides
COMPONENTS
In statistics, principal component analysis (PCA) is a technique used to reduce the
dimensionality of a data set.
o PCA is primarily used in exploratory data analysis and for building predictive models. The
ACP involves the calculation of
the decomposition of the covariance matrix into eigenvalues, usually after centering the data
at the mean of each attribute.
o It must be differentiated from factor analysis with which it has formal similarities and inwhich
can be used as a method
approximation for factor extraction.
o PCA constructs a linear transformation that chooses a new coordinate system for the
original data set in which the largest variance of the data set is captured on the first axis
(called the First Principal Component), the second largest variance is captured on the
second axis
o To build this linear transformation, the covariance matrix or correlation coefficient matrix
must first be built. Due to the symmetry of this matrix there exists a complete basis of
eigenvectors of the same
ACP MATHEMATICS
o Suppose there is a sample with n individuals for each of whom m (random) variables have
been measured. PCA makes it possible to find a number of underlying factors p < m that
approximately explain the value of the m variables for each individual.
There are two basic ways to apply PCA:
o Method based on the correlation matrix, when the data are not dimensionally homogeneous
or the order of magnitude of the random variables measured is not the same.
o Method based on the covariance matrix, which is used when the data are dimensionally
homogeneous and have similar mean values.
o The method starts from the correlation matrix, let us consider the value of each of the m
random variables. For each of the n individuals let us take the value of these variables
and write the data set in the form of a matrix:
o From the mxn data corresponding to the m random variables, the sample correlation matrix
can be constructed, which is defined by:
cov(F,Fj)
R = Tij € Mmxm where Tij=-— v/var(F)var(F,)
o Since the correlation matrix results is symmetrical then values
diagonalizable and own
their
m
verify: ■
i=l
o Covariance-based method
The objective is to transform a given set of data X of dimension nxm into another set of data
Y of smaller dimension nxl with the least possible loss of useful information using the
covariance matrix.
o We start from a set n of samples, each of which has m variables that describe them, and the
objective is that each of these samples is described with only I variables, where l < m.
Furthermore, the number of principal components l has to be less than the smallest of the
dimensions of X.
I < min{n,m}
o The data for analysis must be centered at 0 mean (by subtracting the mean of each column)
and/or autoscaled (centered at 0 mean and dividing each column by its standard
deviation).
o The data for analysis must be centered at 0 mean (by subtracting the mean of each column)
and/or autoscaled (centered at 0 mean and dividing each column by its standard
deviation).
Yo
X=>tapT++E
a=1
o The Ta vectors are known as scores and contain the information about how the samples
are related to each other, and they have the property of being orthogonal. The Pa vectors
are called loadings and report the relationship between the variables and have the quality
of being orthonormal.
covc = HX
cOv(-) Pa = c Pa 772
>CA=1
cr 1
Where a is the eigenvalue associated with the eigenvector Pa.
Finally,
ta — X p,
Multivariate Analysis:
In statistical studies of real cases, it is common to find that we have to handle not only a lot
of data, but also a lot of variables; having a large number of variables makes it difficult to
understand the problem as well as to interpret the statistical results. In the following
example we see a typical multivariate case:
EXAMPLE
o In an educational center they have been experimenting in the last three academic years with a
new pedagogical technique, which has been applied to five different groups. of students
from high school in different
subjects, a total of 125 students. A statistical study is to be conducted to determine to what
extent the new technique has been effective not only in terms of improving grades, but also in
other variables such as active student participation in class, improvement of attention and
study skills, and overall student satisfaction in class. In addition, it is considered important to
take into account other variables that may influence the study, such as age, social class, the
subject in which the technique was used, the educational level of the parents, and the teacher
who applied it. To compare results, data was also taken from another 125 students to whom
the new technique was not applied. We will therefore work with a sample of 250 students and
11 variables. .
o The first rows of this table are shown below:
AC PA AT ED CL SO G PR
L R AND AD TO OF
TEC EST SAT ESTP
0 1 0 1 0 3 16 0 2 3 0
0 1 0 1 0 1 17 0 3 5 0
0 1 0 0 1 7 18 2 2 4 3
0 2 1 1 0 2 19 2 3 5 0
0 2 0 1 2 5 18 2 1 1 0
The meanings of each variable are:
Component variances:
Comp, 1 Comp.2 Comp.3 Comp.4 Comp.: 2.32758465 2.01357571 1.20382700
1.04538603 1.02281750
Importance of components:
Comp.1 Comp.2 Comp.3
X=>tapT++E 7
>CA=1 7
Where a is the eigenvalue associated with the eigenvector Pa. Finally, 7
The meanings of each variable are: 12
o Reduce the number of variables: analysis of 13
main components 13
o Reduce the number of variables: factor analysis 23
o We look at the Cumulative Proportion row: it gives us the accumulated “representativeness”
of the new variables, in percentage points; we see that taking the first three components, all
the variables are represented by 0.50, or 50%, so if we go from 11 to three variables we lose
half of the information. It seems like a significant loss…if we take more principal
components, we lose less information, but we expand the number of variables again, for
example by expanding to 5 we reach 69% of representativeness, with 6 we reach 77% and
with 7 components we cover up to 85% of the original information, but the reduction in the
number of variables is already small:
Importance of components:
Comp.1 Comp.2 Comp.3 Comp. 4 Comp.5 Comp.6 Comp.7
Standard deviation 1.5256424 1.4190052 1.0971905 1.02244121 1.01134440 0.94383777 0.90071050
Proportion of Variance 0.2115986 0.1830523 0.1094388 0.09503509 0.09298341 0.08098452 0.07375267
Cumulative Proportion 0.2115986 0.3946509 d.5040898 0.59912485 0.69210826 0.77309278 0.84684546
or Fig. 3: Expanding the number of components to work with
o The choice of the number of principal components to work with is a choice of the
experimenter; “class” problems are usually prepared in such a way that with few principal
components, 2 or 3, the data are summarized well, but in real problems this is not usually
so obvious.
o To know how the new variables are related to the original ones we can use the correlation
matrix between pairs of
variables: in R we will do Statistics -> Summaries -> Correlation matrix, we choose all
the variables, and we mark the
Data Pairs option. In the resulting correlation matrix we look at the column corresponding
to the principal component PC1, for which the correlations are:
PC1
0.009634422
ASIG
-0.690929281
-0.8508779590
0.0891672163
0.233171700
-0.67173527
0.093915413
-0.712555990
1.000000e+00
1.006389e-17
-5.316147e-17
0.006182726
-0.120799459
PC1
-0.28527228
o Let's analyze these correlations: we see that PC1 is strongly correlated (more than 0.5 to
one, or 50%) with the variables ATE (Measure of attention in class, negative value), CAL
(Grade obtained, negative value, it is the strongest correlation), EST (Measure of personal
study skills, negative value) and PAR (Measure of active participation in class, negative
value), weakly correlated (between 10-50%) with AGE (positive value), SAT (Measure
of the
satisfaction in class, negative value) and TEC (1: we apply new technique, 0: we do not do
it, with negative values), and practically nothing with the others.
o FIG 4: A scatter plot of any two principal components will not show any relationship
o We were able to create this scatter diagram by selecting the option that adds the new
variables to the original data sheet as additional columns.
orFig. 5: R adds 3 new columns to the datasheet, they are the principal components chosen by
the user
o As a conclusion of this study with main components we can say:
or the new teaching technique does seem to have some influence, since its associated
variable is included in the PC1 component of “good practices and good grades”,
although its effect seems to be smaller (29% correlation) compared to the other good
practices: attention in class, etc. On the other hand, the subject where the method
has been tested, which is component PC2, has no relationship (there is no
correlation) with PC1, this is good, it tells us that in any subject “good practices”
have the same effects. The same can be said of the family environment, represented
by PC3.
o Reduce the number of variables:
factor analysis
o Factor analysis is another technique designed to reduce the number of variables, creating
new ones, called factors, by linear combination of the original ones, which attempt to
show conditions that are not directly easily recognizable. Statistical factor analysis
software allows for so-called “rotations” of variables, a mathematical transformation that
aims to simplify the new description of variables as much as possible. The results are not
the same as using principal components, since the mathematical method is different.
o In R, we go to Statistics -> Dimensional Analysis -> Factor Analysis, and we choose all
the original variables of the problem. It asks us the number of factors to retain, we try with
3. The result is this summary: Uniquenesses:
ASSIGN CM CLA AGE E3T E3TP PROFESSIONA SAT
TEC L PAR
0.077 0.541 0.262 0.983 0.952 0.72 2 0.986 0.2 93 0.005 0.995
0.956
Loadings:
ASIG 0.961
ATE 0.672
LIME 0.769 0.381
CLA -0.116
AGE -0.198
E3T 0.456 0.258
E3TP
PAIR 0.289 0.789
PROF 0.997
3AT
TEC 0.158
or this factor considers the subject and the professor who teaches it as an important factor in
the study.
or this second factor takes into account attention in class, grading, study techniques and
active participation in class, similarly to the main component PC1 in the previous section.
or the third factor considers the relationship between grades, social class, age, study
techniques, active participation in class and the application of the new study technique, in
this last case with a rather low weight, 0.158.
The conclusions we can draw are:
o In this analysis, the TEC variable we studied does not seem to play any role, it only
enters factor 3 with a weight of 15.8%, and furthermore remains unexplained in
95.6% (Uniquenesses). The related variables with the highest weight are CAL and
ATE in factor 2, which suggests that attention in class is the variable most
correlated with the grade obtained; in factor 3 the dominant variable is PAR, active
participation, which has a rather weak relationship with the grade (38.1%) and
even weaker with the other variables.