0% found this document useful (0 votes)
115 views26 pages

Principal Component Analysis Slides

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data sets while preserving variance, primarily utilized in exploratory data analysis and predictive modeling. PCA involves calculating the covariance matrix, transforming the original data into a new coordinate system where the first principal component captures the largest variance. The document also discusses the application of PCA in a study involving educational techniques, demonstrating how to reduce variables and analyze their relationships through statistical software like R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views26 pages

Principal Component Analysis Slides

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of data sets while preserving variance, primarily utilized in exploratory data analysis and predictive modeling. PCA involves calculating the covariance matrix, transforming the original data into a new coordinate system where the first principal component captures the largest variance. The document also discusses the application of PCA in a study involving educational techniques, demonstrating how to reduce variables and analyze their relationships through statistical software like R.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

ANALYSIS OF

COMPONENTS
In statistics, principal component analysis (PCA) is a technique used to reduce the
dimensionality of a data set.
o PCA is primarily used in exploratory data analysis and for building predictive models. The
ACP involves the calculation of
the decomposition of the covariance matrix into eigenvalues, usually after centering the data
at the mean of each attribute.
o It must be differentiated from factor analysis with which it has formal similarities and inwhich
can be used as a method
approximation for factor extraction.
o PCA constructs a linear transformation that chooses a new coordinate system for the
original data set in which the largest variance of the data set is captured on the first axis
(called the First Principal Component), the second largest variance is captured on the
second axis
o To build this linear transformation, the covariance matrix or correlation coefficient matrix
must first be built. Due to the symmetry of this matrix there exists a complete basis of
eigenvectors of the same
ACP MATHEMATICS

o Suppose there is a sample with n individuals for each of whom m (random) variables have
been measured. PCA makes it possible to find a number of underlying factors p < m that
approximately explain the value of the m variables for each individual.
There are two basic ways to apply PCA:
o Method based on the correlation matrix, when the data are not dimensionally homogeneous
or the order of magnitude of the random variables measured is not the same.
o Method based on the covariance matrix, which is used when the data are dimensionally
homogeneous and have similar mean values.
o The method starts from the correlation matrix, let us consider the value of each of the m
random variables. For each of the n individuals let us take the value of these variables
and write the data set in the form of a matrix:

o From the mxn data corresponding to the m random variables, the sample correlation matrix
can be constructed, which is defined by:

cov(F,Fj)
R = Tij € Mmxm where Tij=-— v/var(F)var(F,)
o Since the correlation matrix results is symmetrical then values
diagonalizable and own
their
m
verify: ■
i=l
o Covariance-based method
The objective is to transform a given set of data X of dimension nxm into another set of data
Y of smaller dimension nxl with the least possible loss of useful information using the
covariance matrix.
o We start from a set n of samples, each of which has m variables that describe them, and the
objective is that each of these samples is described with only I variables, where l < m.
Furthermore, the number of principal components l has to be less than the smallest of the
dimensions of X.
I < min{n,m}
o The data for analysis must be centered at 0 mean (by subtracting the mean of each column)
and/or autoscaled (centered at 0 mean and dividing each column by its standard
deviation).
o The data for analysis must be centered at 0 mean (by subtracting the mean of each column)
and/or autoscaled (centered at 0 mean and dividing each column by its standard
deviation).
Yo
X=>tapT++E
a=1

o The Ta vectors are known as scores and contain the information about how the samples
are related to each other, and they have the property of being orthogonal. The Pa vectors
are called loadings and report the relationship between the variables and have the quality
of being orthonormal.
covc = HX
cOv(-) Pa = c Pa 772
>CA=1
cr 1
Where a is the eigenvalue associated with the eigenvector Pa.
Finally,

ta — X p,
Multivariate Analysis:

In statistical studies of real cases, it is common to find that we have to handle not only a lot
of data, but also a lot of variables; having a large number of variables makes it difficult to
understand the problem as well as to interpret the statistical results. In the following
example we see a typical multivariate case:
EXAMPLE
o In an educational center they have been experimenting in the last three academic years with a
new pedagogical technique, which has been applied to five different groups. of students
from high school in different
subjects, a total of 125 students. A statistical study is to be conducted to determine to what
extent the new technique has been effective not only in terms of improving grades, but also in
other variables such as active student participation in class, improvement of attention and
study skills, and overall student satisfaction in class. In addition, it is considered important to
take into account other variables that may influence the study, such as age, social class, the
subject in which the technique was used, the educational level of the parents, and the teacher
who applied it. To compare results, data was also taken from another 125 students to whom
the new technique was not applied. We will therefore work with a sample of 250 students and
11 variables. .
o The first rows of this table are shown below:

AC PA AT ED CL SO G PR
L R AND AD TO OF
TEC EST SAT ESTP

0 1 0 1 0 3 16 0 2 3 0

0 1 0 1 0 1 17 0 3 5 0

0 1 0 0 1 7 18 2 2 4 3

0 2 1 1 0 2 19 2 3 5 0

0 2 0 1 2 5 18 2 1 1 0
The meanings of each variable are:

TEC 1: we apply new technique, 0: we don't do it


LIME Grade obtained

PAIR Measure of active participation in class


ATE Measuring attention in class

EST Measuring personal study skills


SAT Measuring classroom satisfaction

AGE Student age

CLA Social class: 0 low, 1 middle, 2 high

Subject in which the technique was applied: 1 MAT, 2


ASIG SCIENCE, 3 HISTORY
Professor who applied it, values 1,2 (MAT), 3,4 (CIENC), 5
(HIST)
PROF

Parents' educational level: 0 no studies, 1 basic, 2 intermediate, 3


ESTP higher
o Reduce the number of variables: analysis of
main components
o We will use the principal component analysis method; once the data is loaded into the R
environment, we access Statistics -> Dimensional analysis -> principal component
analysis. We select all the variables and in Options we mark “ Add principal components to
the data set”; when it asks us how many components we are going to include, we are saying
how many variables we want to reduce the 11 originals, we will put 3 (ideally
we will reduce it to 4 as a maximum, so that the data is manageable), and we accept. R
performs the analysis and provides us with this report:
Ccmpenent Icadimgs ;
Comp _ 1 Com . 2 Cemp _ 3 O .05122096
ASIG 0.006314994 -0.68598732 O - 01005327
ATE -EITHER . 452877616 -EITHER . 01202706 -O_06523847 0.34806320
LIME —0.557717833 O _ 02853634 -0.14651121
CLA 0.058445687 O - 09401670 -either. 18861895
AGE 0.152835097 -EITHER . 08783368 a - —o _ 59673786
EST -0.440296679 04 120903 -either. 04699023 0.04
THIS 0.061557947 -O.11504869 - 168639 or . 620454 62
PAIR —0.467053094 0.05137157 -O . n . 4434 nn
PROF 0.004052539
68918898 -a -01027994 _A
SAT -0.079179407
.7MT-AR
TF c — n.IA-ARhp
or Fig. 1: Principal components: coefficients of the combinations R will always generate as
many principal components as original variables, 11 in this case. Figure 1 does not show
columns 4, 5,…11, since we are interested in studying only 3. What R has done is create
new variables Comp.1, Comp.2, …, by linear combination of the original ones, the
coefficients of the combinations being those we see in figure 1. That is to say, it is
fulfilled that:
o For main component 2:
o In the same R report we find this other section:

Component variances:
Comp, 1 Comp.2 Comp.3 Comp.4 Comp.: 2.32758465 2.01357571 1.20382700
1.04538603 1.02281750

Importance of components:
Comp.1 Comp.2 Comp.3
X=>tapT++E 7
>CA=1 7
Where a is the eigenvalue associated with the eigenvector Pa. Finally, 7
The meanings of each variable are: 12
o Reduce the number of variables: analysis of 13
main components 13
o Reduce the number of variables: factor analysis 23
o We look at the Cumulative Proportion row: it gives us the accumulated “representativeness”
of the new variables, in percentage points; we see that taking the first three components, all
the variables are represented by 0.50, or 50%, so if we go from 11 to three variables we lose
half of the information. It seems like a significant loss…if we take more principal
components, we lose less information, but we expand the number of variables again, for
example by expanding to 5 we reach 69% of representativeness, with 6 we reach 77% and
with 7 components we cover up to 85% of the original information, but the reduction in the
number of variables is already small:

Importance of components:
Comp.1 Comp.2 Comp.3 Comp. 4 Comp.5 Comp.6 Comp.7
Standard deviation 1.5256424 1.4190052 1.0971905 1.02244121 1.01134440 0.94383777 0.90071050
Proportion of Variance 0.2115986 0.1830523 0.1094388 0.09503509 0.09298341 0.08098452 0.07375267
Cumulative Proportion 0.2115986 0.3946509 d.5040898 0.59912485 0.69210826 0.77309278 0.84684546
or Fig. 3: Expanding the number of components to work with
o The choice of the number of principal components to work with is a choice of the
experimenter; “class” problems are usually prepared in such a way that with few principal
components, 2 or 3, the data are summarized well, but in real problems this is not usually
so obvious.
o To know how the new variables are related to the original ones we can use the correlation
matrix between pairs of
variables: in R we will do Statistics -> Summaries -> Correlation matrix, we choose all
the variables, and we mark the
Data Pairs option. In the resulting correlation matrix we look at the column corresponding
to the principal component PC1, for which the correlations are:
PC1

0.009634422
ASIG
-0.690929281

-0.8508779590

0.0891672163

0.233171700

-0.67173527

0.093915413

-0.712555990

1.000000e+00

1.006389e-17

-5.316147e-17

0.006182726

-0.120799459
PC1

-0.28527228
o Let's analyze these correlations: we see that PC1 is strongly correlated (more than 0.5 to
one, or 50%) with the variables ATE (Measure of attention in class, negative value), CAL
(Grade obtained, negative value, it is the strongest correlation), EST (Measure of personal
study skills, negative value) and PAR (Measure of active participation in class, negative
value), weakly correlated (between 10-50%) with AGE (positive value), SAT (Measure
of the
satisfaction in class, negative value) and TEC (1: we apply new technique, 0: we do not do
it, with negative values), and practically nothing with the others.
o FIG 4: A scatter plot of any two principal components will not show any relationship
o We were able to create this scatter diagram by selecting the option that adds the new
variables to the original data sheet as additional columns.
orFig. 5: R adds 3 new columns to the datasheet, they are the principal components chosen by
the user
o As a conclusion of this study with main components we can say:
or the new teaching technique does seem to have some influence, since its associated
variable is included in the PC1 component of “good practices and good grades”,
although its effect seems to be smaller (29% correlation) compared to the other good
practices: attention in class, etc. On the other hand, the subject where the method
has been tested, which is component PC2, has no relationship (there is no
correlation) with PC1, this is good, it tells us that in any subject “good practices”
have the same effects. The same can be said of the family environment, represented
by PC3.
o Reduce the number of variables:
factor analysis
o Factor analysis is another technique designed to reduce the number of variables, creating
new ones, called factors, by linear combination of the original ones, which attempt to
show conditions that are not directly easily recognizable. Statistical factor analysis
software allows for so-called “rotations” of variables, a mathematical transformation that
aims to simplify the new description of variables as much as possible. The results are not
the same as using principal components, since the mathematical method is different.
o In R, we go to Statistics -> Dimensional Analysis -> Factor Analysis, and we choose all
the original variables of the problem. It asks us the number of factors to retain, we try with
3. The result is this summary: Uniquenesses:
ASSIGN CM CLA AGE E3T E3TP PROFESSIONA SAT
TEC L PAR

0.077 0.541 0.262 0.983 0.952 0.72 2 0.986 0.2 93 0.005 0.995
0.956

Loadings:

Factor1 Factor2 Fact r3

ASIG 0.961
ATE 0.672
LIME 0.769 0.381
CLA -0.116
AGE -0.198
E3T 0.456 0.258
E3TP
PAIR 0.289 0.789
PROF 0.997
3AT
TEC 0.158

Factor1 Factor 2 Fact r3


33 loadings 1.94 7 1.356 0.925
Proportion Var 0.17 7 0.123 0.084
Cumulative Var 0.17 7 0.300 0.384
o It provides us with the coefficients of the linear combinations for each factor ( Loadings
table) that are always in the interval [-1, 1], the variability explained by each factor,
the
accumulated (for the three factors added together we have 38.4% of explained variability)
and a Chi² hypothesis contrast where H0: the three factors are sufficient, H1: they are not.
We see that the result of the contrast is that the p-value = 0.593, which means that, for the
standard significance levels of acceptance of H0, 10%, 5% and 1%, we accept H0
(remember that H0 is accepted if the significance is less than the p-value). If the null
hypothesis had been rejected, we would have repeated the analysis with one more factor.
o Also, for the conclusions, we can look at the data called “ Uniquenesses”: it gives us the
proportion of variability not explained by the factors of the variable in question. For
example, for the ASIG variable it is 0.077, 7.7% not explained by the factors, meaning it
is well summarized with the three factors. However, for CLA it is worth more than 90%,
which is why the factors do not provide good information on this variable.
o Thus, we summarize the 11 variables by three factors, with the following composition:

or this factor considers the subject and the professor who teaches it as an important factor in
the study.
or this second factor takes into account attention in class, grading, study techniques and
active participation in class, similarly to the main component PC1 in the previous section.
or the third factor considers the relationship between grades, social class, age, study
techniques, active participation in class and the application of the new study technique, in
this last case with a rather low weight, 0.158.
The conclusions we can draw are:
o In this analysis, the TEC variable we studied does not seem to play any role, it only
enters factor 3 with a weight of 15.8%, and furthermore remains unexplained in
95.6% (Uniquenesses). The related variables with the highest weight are CAL and
ATE in factor 2, which suggests that attention in class is the variable most
correlated with the grade obtained; in factor 3 the dominant variable is PAR, active
participation, which has a rather weak relationship with the grade (38.1%) and
even weaker with the other variables.

You might also like