PCA Using R
PCA Using R
May 2020
ACTUARIAL TECHNOLOGY TODAY
MODELING MAY 2020
SECTION
Principal Component
Analysis Using R
By Soumava Dey
I
n today’s Big Data world, exploratory data analysis has be-
come a stepping stone to discover underlying data patterns
with the help of visualization. Due to the rapid growth in
data volume, it has become easy to generate large dimensional
datasets with multiple variables. However, the growth has also
made the computation and visualization process more tedious
in the recent era.
Principal component analysis (PCA) is the best, widely used 2. determine key numerical variables based on their contribu-
technique to perform these two tasks. The purpose of this article tion to maximum variances in the dataset,
is to provide a complete and simplified explanation of principal 3. compress the size of the data set by keeping only the key
component analysis, especially to demonstrate how you can per- variables and removing redundant variables, and
form this analysis using R. 4. find out the correlation among key variables and construct
new components for further analysis.
WHAT IS PCA?
In simple words, PCA is a method of extracting important vari-
Note that, the PCA method is particularly useful when the vari-
ables (in the form of components) from a large set of variables
ables within the data set are highly correlated and redundant.
available in a data set. PCA is a type of unsupervised linear
transformation where we take a dataset with too many vari-
HOW DO WE PERFORM PCA?
ables and untangle the original variables into a smaller set of
Before I start explaining the PCA steps, I will give you a quick
variables, which we called “principal components.” It is espe-
cially useful when dealing with three or higher dimensional rundown of the mathematical formula and description of the
data. It enables the analysts to explain the variability of that principal components.
dataset using fewer variables.
What are Principal Components?
WHY PERFORM PCA? Principal components are the set of new variables that corre-
The goals of PCA are to: spond to a linear combination of the original key variables. The
number of principal components is less than or equal to the
1. Gain an overall structure of the large dimension data, number of original variables.
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 11
Principal Component Analysis Using R
Figure 1
Principal Components
PC1
variable 2
variable 2
PC2
PCA
variable 1 variable 1
Source: ourcodingclub.github.io
In Figure 1, the PC1 axis is the first principal direction along • yi = the y value in the data set that corresponds with xi
which the samples show the largest variation. The PC2 axis is • ym = the mean, or average, of the y values
the second most important direction, and it is orthogonal to the • n = the number of data points
PC1 axis.
Both covariance and correlation indicate whether variables are
The first principal component of a data set X1,X2,...,Xp is the positively or inversely related. Correlation also tells you the de-
linear combination of the features gree to which the variables tend to move together.
Z1=ϕ1,1X1+ϕ2,1X2+...+ϕp,1Xp
Eigenvectors
Eigenvectors are a special set of vectors that satisfies the linear
Φp,1 is the loading vector comprising of all the loadings (ϕ1… ϕp)
system equations:
of the principal components.
• xi = a given x value in the data set The functions prcomp ()[“stats” package] and PCA()[“FactoM-
• xm = the mean, or average, of the x values ineR” package] use the SVD.
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 12
Principal Component Analysis Using R
Figure 2
Computer Code for Pollution Scenarios
library(dplyr)
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 13
Principal Component Analysis Using R
The output of the function PCA () is a list that includes the following components:
>pollution.pca
““Results for the Principal Component Analysis (PCA)**
The analysis was performed on 60 individuals, described by 16 variables
*The results are available in the following objects:
Name description
1 “$eig” “eigenvalues”
2 “$var” “results for the variables”
3 “$var$coord” “coord. for the variables”
4 “$var$cor” “correlations variables - dimensions”
5 “$var$cos2” “cos2 for the variables”
6 “$var$contrib” “contributions of the variables”
7 “$ind” ` “results for the individuals”
8 “$ind$coord” “coord.for the individuals”
9 “$ind$cos2” “cos2 for the individuals”
10 “$ind$contrib” “contributions of the individuals”
11 “$call” “summary statistics”
12 “$call$centre” “mean of the variables”
13 “$call$ecart.type” “standard error of the variables”
14 “$call$row.w” “weights for the individuals”
15 “$call$col.w” “weights for the variables”
For better interpretation of PCA, we need to visualize the com- First principal component keeps the largest value of eigenvalues
ponents using R functions provided in factoextra R package: and the subsequent PCs have smaller values. To determine the
eigenvalues and proportion of variances held by different PCs of
get_eigenvalue(): Extract the eigenvalues/variances of principal a given data set we need to rely on the R function get_eigenval-
components ue() that can be extracted from the factoextra package.
fviz_eig(): Visualize the eigenvalues
fviz_pca_ind(), fviz_pca_var(): Visualize the results individuals
library(“factoeextra”)
and variables, respectively.
eig.val <- get_eigenvalue(pollution.PCA)
EIGENVALUES
eig.val
As described in the previous section, eigenvalues are used to
measure the variances retained by the principal components.
“eig.val”
eigenvalue variance.percent cumulative.variance.percent
Dim.1 4.878595616 30.49122260 30.49122
Dim.2 2.766574422 17.29109013 47.78231
Dim.3 2.292475683 14.32797302 62.11029
Dim.4 1.351660343 8.44787715 70.55816
Dim.5 1.223507408 7.64692130 78.20508
Dim.6 1.086738477 6.79211548 84.99720
Dim.7 0.661476260 4.13422662 89.13143
Dim.8 0.479425447 2.99640904 92.12784
Dim.9 0.407500850 2.54688031 94.67472
Dim.10 0.244819892 1.53012432 96.20484
Dim.11 0.194097702 1.21311064 97.41795
Dim.12 0.156401959 0.97751224 98.39546
Dim.13 0.116810134 0.73006334 99.12553
Dim.14 0.089284390 0.55802744 99.68355
Dim.15 0.045962000 0.28726250 99.97082
Dim.16 0.004669417 0.02918386 100.00000
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 14
Principal Component Analysis Using R
The sum of all the eigenvalues gives a total variance of 16. results for the active variables (coordinates, correlation between
variables, squared cosine and contributions).
The proportion of all the eigenvalues is demonstrated by the sec-
ond column “variance.present.” For example, if you divide 4.878 var_pollution <- get_pca_var(pollution.PCA)
by 16 equals to 0.304875, i.e., almost 30.49 percent variance ex- var_pollution
plained by the first component/dimension. Based on the output of
eig.val object, we can derive the fact that the first six eigenvalues
> var_pollution.pca
keep almost 82 percent of total variances existed in the dataset.
Principal Component Analysis Results for
varibles
As an alternative approach, we can also examine the pattern of vari-
ances using a scree plot which showcases the order of eigenvalues
Name description
from largest to smallest. In order to produce the scree plot (see Figure
1 “$coord” “Coordinates for the variables”
3), we will use the function fviz_eig() available in factoextra() package:
2 “$cor” “Correlations between variables
and dimensions”
fviz_eig(pollution.pca, addlabels = TRUE,
3 “$cos2” “Cos2 for the variables”
hjust = -0.3, ylim = c(0,35))
4 “$contrib” “contributions of the variables”
From the scree plot above, we might consider using the first six CORRELATION CIRCLE PLOT
components for the analysis because 82 percent of the whole We can apply different methods to visualize the SVD variances in
dataset information is retained by these principal components. a correlation plot in order to demonstrate the relationship between
variables. The correlation between a variable and a principal com-
VARIABLES CONTRIBUTION GRAPH ponent (PC) is used as the coordinates of the variable on the PC.
The next step is to determine the contribution and the correla-
tion of the variables that have been considered as principal com- # Coordinates of Variables
ponents of the dataset. In order to extract the relationship of
the variables from a PCA object we need to use the function head(var_pollution$contrib)
get_pca_var () which provides a list of matrices containing all the
Figure 3
Scree Scree
Plot plot
30
Percentage of explained variances
20
10
1 2 3 4 5 6 7 8 9 10
Dimensions
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 15
Principal Component Analysis Using
using RR
> head(var_pollution$conrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
PRECReal 11.3363777 1.2901207 2.320962e-04 10.5955205 1.6692242
JANTReal 0.3312333 21.4926762 3.234525e+00 10.8905057 0.6850515
JULTReal 10.2768749 2.7936599 2.838199e+00 0.1211819 15.3922039
OVR65Real 2.4116845 11.8740295 3.795171e+00 27.4577926 0.1384696
POPNReal 8.1452038 0.5965791 3.132326e-03 30.1085931 5.6756771
EDUCReal 8.6313043 2.4816964 1.212178e+01 2.0276052
fviz_pca_var(pollution.PCA,col.var = “black”)
Figure 4
Relationship Between Variables
Variables - PCA
1.0
JANTReal
NOXReal
HCReal
NONWReal
0.5
POORReal
WWDRKReal
EDUCReal JULTReal
SO@Real
MORTReal
HOUSReal
Dim2 (17.3%)
DENSReal
0.0
HUMIDReal POPNReal
PRECReal
-0.5
OVR65Real
-1.0
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 16
Principal Component Analysis Using R
QUALITY OF REPRESENTATION
This shows the quality of representation of the variables on the factor map called cos2, which is multiplication of squared cosine
and squared coordinates. The previously created object var_pollution holds cos2 value:
head(var_pollution$cos2)
head(var_pollution$cos2)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
PRECReal 0.55305602 0.03569215 5.320748e-06 0.143215449 0.020423082
JANTReal 0.01615953 0.59461088 7.415070e-02 0.147202647 0.008381656
JULTReal 0.50136717 0.07728868 6.506502e-02 0.001637968 0.188324755
OVR65Real 0.11765633 0.32850386 8.700336e-02 0.371136093 0.001694186
POPNReal 0.39737155 0.01650481 7.180782e-05 0.406965913 0.069442330
EDUCReal 0.42108643 0.06865798 2.778888e-01 0.027406335 0.025652309
A high cos2 indicates a good representation of the variable on a particular dimension or principal component. Whereas, a low cos2
indicates that the variable is not perfectly represented by PCs.
Cos2 values can be well presented using various aesthetic colors in a correlation plot. For instance, we can use three different colors
to present the low, mid and high cos2 values of variables that contribute to the principal components.
fviz_pca_var(pollution.PCA,col.var = “cos2”,
gradient.cols = c(“green”,”blue”,”red”),
repel = TRUE # Avoid text overlapping
Figure 5
Variables—PCA
Variables - PCA
1.0
JANTReal
NOXReal
HCReal
NONWReal
0.5
POORReal
WWDRKReal
EDUCReal JULTReal
SO@Real
MORTReal contrib
HOUSReal
Dim2 (17.3%)
10.0
DENSReal
7.5
0.0
5.0
HUMIDReal POPNReal
2.5
PRECReal
-0.5
OVR65Real
-1.0
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 17
Principal Component Analysis Using R
Variables that are closed to circumference (like NONWReal, The function fviz_contrib() [factoextra package] can be used to
POORReal and HCReal ) manifest the maximum representa- draw a bar plot of variable contributions. If your data contains many
tion of the principal components. However, variables like HU- variables, you can decide to show only the top contributing vari-
MIDReal, DENSReal and SO@Real show week representation ables. The R code (see code 1 and Figures 6 and 7) below shows the
of the principal components. top 10 variables contributing to the principal components:
Code 1
CONTRIBUTION OF VARIABLES TO PCS
After observing the quality of representation, the next step is to #Contributions of variables to PC1
explore the contribution of variables to the main PCs. Variable fviz_contrib(pollution.PCA, choice = “var”,
contributions in a given principal component are demonstrated axes = 1, top = 10)
in percentage. #Contribution of variables to PC2
fviz_contrib(pollution.PCA, choice = “var”,
Key points to remember: axes = 2, top = 10)
• Variables with high contribution rate should be retained as The most important (or, contributing) variables can be high-
those are the most important components that can explain lighted on the correlation plot as in code 2 and Figure 8.
the variability in the dataset. Code 2
fviz_pca_var (pollution.pca, col.var =
• Variables with low contribution rate can be excluded from “contrib”,
the dataset in order to reduce the complexity of the data
Gradient.cols = c(“yellow”, “blue”,
analysis.
“red”)
Figures 6 and 7
Top 10 Variables Contributing to Principal Components
Contribution of variables to Dim-1
12.5
10.0
Contributions (%)
7.5
5.0
2.5
0.0
l
l
l
l
l
l
l
ea
ea
ea
ea
ea
ea
ea
l
ea
ea
ea
R
R
TR
R
R
R
R
R
W
R
S
N
C
LT
X
C
R
U
P
E
O
N
H
JU
O
O
O
R
D
O
N
O
M
P
H
P
P
20
Contributions (%)
15
10
0
l
l
l
ea
l
l
l
l
l
ea
l
ea
ea
ea
ea
ea
ea
ea
ea
R
R
R
R
TR
R
R
R
K
65
W
@
LT
X
R
N
U
O
R
H
O
D
JU
JA
D
O
N
S
E
P
N
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 18
Principal Component Analysis Using R
Figure 8
Graphical Display of the Eigen Vector and Their Relative Contribution
Variables - PCA
1.0
JANTReal
NOXReal
HCReal NONWReal
0.5
POORReal
WWDRKReal
EDUCReal
JULTReal
SO@Real cos2
0.8
Dim2 (17.3%)
MORTReal
HOUSReal
0.6
0.0 DENSReal
0.4
HUMIDReal POPNReal 0.2
PRECReal
-0.5
OVR65Real
-1.0
BIPLOT
To make a simple biplot of individuals and variables, type this:
Code 3
fviz_biplot (pollution.pca,
col.ind = pollution$MORTReal_TYPE, palette = “jco”,
addEllipses = TRUE, label = “var”
col.var = “black”, repel = TRUE,
legend.title = “Mortality_Range”)
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 19
Principal Component Analysis Using R
Figure 9
Mortality Rate Value and Corresponding Key Variables Grouped
JANTReal
Contrib
NONWReal HCReal
NOXReal
2.5
3
5.0
POORReal
WWDRKReal 7.5
JULTReal 10.0
EDUCReal
7.5
5.0
2.5
-3
OVR65Real
-6
-5 0 5
Dim1 (30.2%)
For Python Users Multidimensional reduction capability was used to build a wide
To implement PCA in python, simply import PCA from sklearn range of PCA applications in the field of medical image process-
library. The code interpretation remains the same as explained ing such as feature extraction, image fusion, image compression,
for R users above. image segmentation, image registration and de-noising of images.
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 20
Principal Component Analysis Using R
article does not outline the model building technique, but the Libin Yang. 2015. “An Application of Principal Component Analysis to Stock Portfolio
Management.” https://ir.canterbury.ac.nz/bitstream/handle/10092/10293/thesis.
six principal components can be used to construct some kind of pdf
model for prediction purposes. https://www.researchgate.net/publication/272576742_Principal_Component_
Analysis_in_Medical_Image_Processing_A_Study
Further Reading https://rdrr.io/cran/factoextra/man/fviz_pca.html
PCA using prcomp() and princomp() (tutorial). http://www.sthda.
com/english/wiki/pca-using-prcomp-and-princomp
Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 21