0% found this document useful (0 votes)
53 views12 pages

PCA Using R

The article discusses Principal Component Analysis (PCA) as a method for simplifying large datasets by extracting key variables while removing redundancy. It explains the mathematical foundations of PCA, including eigenvalues and eigenvectors, and provides guidance on performing PCA using R, specifically through the FactoMineR package. The article also includes a practical example with a pollution dataset to demonstrate the application of PCA in data analysis.

Uploaded by

Nitika Dhariwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views12 pages

PCA Using R

The article discusses Principal Component Analysis (PCA) as a method for simplifying large datasets by extracting key variables while removing redundancy. It explains the mathematical foundations of PCA, including eigenvalues and eigenvectors, and provides guidance on performing PCA using R, specifically through the FactoMineR package. The article also includes a practical example with a pollution dataset to demonstrate the application of PCA in data analysis.

Uploaded by

Nitika Dhariwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Article from

Actuarial Technology Today

May 2020
ACTUARIAL TECHNOLOGY TODAY
MODELING MAY 2020
SECTION

Principal Component
Analysis Using R
By Soumava Dey

I
n today’s Big Data world, exploratory data analysis has be-
come a stepping stone to discover underlying data patterns
with the help of visualization. Due to the rapid growth in
data volume, it has become easy to generate large dimensional
datasets with multiple variables. However, the growth has also
made the computation and visualization process more tedious
in the recent era.

The two ways of simplifying the description of large dimension-


al datasets are the following:

1. Remove redundant dimensions or variables, and


2. retain the most important dimensions/variables.

Principal component analysis (PCA) is the best, widely used 2. determine key numerical variables based on their contribu-
technique to perform these two tasks. The purpose of this article tion to maximum variances in the dataset,
is to provide a complete and simplified explanation of principal 3. compress the size of the data set by keeping only the key
component analysis, especially to demonstrate how you can per- variables and removing redundant variables, and
form this analysis using R. 4. find out the correlation among key variables and construct
new components for further analysis.
WHAT IS PCA?
In simple words, PCA is a method of extracting important vari-
Note that, the PCA method is particularly useful when the vari-
ables (in the form of components) from a large set of variables
ables within the data set are highly correlated and redundant.
available in a data set. PCA is a type of unsupervised linear
transformation where we take a dataset with too many vari-
HOW DO WE PERFORM PCA?
ables and untangle the original variables into a smaller set of
Before I start explaining the PCA steps, I will give you a quick
variables, which we called “principal components.” It is espe-
cially useful when dealing with three or higher dimensional rundown of the mathematical formula and description of the
data. It enables the analysts to explain the variability of that principal components.
dataset using fewer variables.
What are Principal Components?
WHY PERFORM PCA? Principal components are the set of new variables that corre-
The goals of PCA are to: spond to a linear combination of the original key variables. The
number of principal components is less than or equal to the
1. Gain an overall structure of the large dimension data, number of original variables.

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 11
Principal Component Analysis Using R

Figure 1
Principal Components

PC1
variable 2

variable 2
PC2

PCA

variable 1 variable 1

Source: ourcodingclub.github.io

In Figure 1, the PC1 axis is the first principal direction along • yi = the y value in the data set that corresponds with xi
which the samples show the largest variation. The PC2 axis is • ym = the mean, or average, of the y values
the second most important direction, and it is orthogonal to the • n = the number of data points
PC1 axis.
Both covariance and correlation indicate whether variables are
The first principal component of a data set X1,X2,...,Xp is the positively or inversely related. Correlation also tells you the de-
linear combination of the features gree to which the variables tend to move together.

Z1=ϕ1,1X1+ϕ2,1X2+...+ϕp,1Xp
Eigenvectors
Eigenvectors are a special set of vectors that satisfies the linear
Φp,1 is the loading vector comprising of all the loadings (ϕ1… ϕp)
system equations:
of the principal components.

The second principal component is the linear combination of Av = λv


X1,…, Xp that has maximal variance out of all linear combina-
tions that are uncorrelated with Z1. The second principal com- where A is an (n x n)square matrix, v is the eigenvector, and λ
ponent scores z1,2,z2,2,…,zn,2 take the form is the eigenvalue. Eigenvalues measure the amount of variances
retained by the principal components. For instance, eigenvalues
Z2=ϕ1,2X1+ϕ2,2X2+...+ϕp,2Xp tend to be large for the first component and smaller for the sub-
sequent principal components. The number of eigenvalues and
It is necessary to understand the meaning of covariance and ei- eigenvectors of a given dataset is equal to the number of dimen-
genvector before we further get into principal components anal- sions that dataset has. Depending upon the variances explained
ysis. by the eigenvalues, we can determine the most important princi-
pal components that can be used for further analysis.
Covariance
Covariance is a measure to find out how much the dimensions
GENERAL METHODS FOR PRINCIPAL
may vary from the mean with respect to each other. For exam-
ple, the covariance between two random variables X and Y can
COMPONENT ANALYSIS USING R
be calculated using the following formula (for population): Singular value decomposition (SVD) is considered to be a gen-
eral method for PCA. This method examines the correlations
Cov(x,y) = SUM [(xi - xm) * (yi - ym)] / (n - 1) between individuals,

• xi = a given x value in the data set The functions prcomp ()[“stats” package] and PCA()[“FactoM-
• xm = the mean, or average, of the x values ineR” package] use the SVD.

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 12
Principal Component Analysis Using R

Figure 2
Computer Code for Pollution Scenarios

pollution <- read.delim(“pollution.dat”,


header = FALSE,skip = 19, sep = “,”)

colnames(pollution) <- c(“PRECReal”,


”JANTReal”,”JULTReal”,”OVR65Real”,
”POPNReal”,”EDUCReal”,”HOUSReal”,
”DENSReal”,”NONWReal”,”WWDRKReal”,
”POORReal”,”HCReal”,”NOXReal”,”SO@Real”,
”HUMIDReal”,”MORTReal”)

library(dplyr)

pollution <- mutate(pollution,MORTReal_Type


PCA () function comes from FactoMineR. So, install this pack- = case_when
age along with another package called Factoextra which will be
used to visualize the results of PCA. (pollution$MORTReal < 900.0 ~
“Low Mortality”,
In this article, I will demonstrate a sample of SVD method using pollution$MORTReal > 900.0 & MORTReal <
PCA() function and visualize the variance results. 1000.0 ~ “Medium Mortality”,
pollution$MORTReal > 1000.0 ~
Dataset Description “High Mortality”))
I will explore the principal components of a dataset which is ex-
tracted from KEEL-dataset repository. The code in Figure 2 loads the dataset to an R data frame
and names all 16 variables. In order to define a different
This dataset was proposed in McDonald, G.C. and Schwing, range of mortality rate, one extra column named “MOR-
R.C. (1973) “Instabilities of Regression Estimates Relating Air TReal_TYPE” has been created in the R data frame. This
Pollution to Mortality,” Technometrics, vol.15, 463-482. It con-
extra column will be useful to create data visualization based
tains 16 attributes describing 60 different pollution scenarios.
on mortality rates.
The attributes are the following:
Compute Principal Components Using PCA ()
1. PRECReal: Average annual precipitation in inches
PCA () [FactoMineR package] function is very useful to identify
2. JANTReal: Average January temperature in degrees F
the principal components and the contributing variables associ-
3. JULTReal: Same for July
ated with those PCs. A simplified format is:
4. OVR65Real: of 1960 SMSA population aged 65 or older
5. POPNReal: Average household size
6. EDUCReal: Median school years completed by those over library(“FactoMineR”)
22
7. HOUSReal: of housing units which are sound and with all pollution.PCA <- PCA(pollution[c(-17)],
facilities scale.unit = TRUE, graph = FALSE)
8. DENSReal: Population per sq. mile in urbanized areas,
1960 • pollution: a data frame. Rows are individuals and columns
9. NONWReal: non-white population in urbanized areas, are numeric variables
1960
10. WWDRKReal: employed in white collar occupations • scale.unit: a logical value. If TRUE, the data are scaled to
11. POORReal: of families with income less than $3000 unit variance before the analysis. This standardization to
12. HCReal: Relative hydrocarbon pollution potential the same scale avoids some variables to become dominant
13. NOXReal: Same for nitric oxides just because of their large measurement units. It makes the
14. SO@Real: Same for sulphur dioxide variable comparable.
15. HUMIDReal: Annual average % relative humidity at 1pm
16. MORTReal: Total age-adjusted mortality rate per 100,000 • graph: a logical value. If TRUE a graph is displayed.

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 13
Principal Component Analysis Using R

The output of the function PCA () is a list that includes the following components:

>pollution.pca
““Results for the Principal Component Analysis (PCA)**
The analysis was performed on 60 individuals, described by 16 variables
*The results are available in the following objects:

Name description
1 “$eig” “eigenvalues”
2 “$var” “results for the variables”
3 “$var$coord” “coord. for the variables”
4 “$var$cor” “correlations variables - dimensions”
5 “$var$cos2” “cos2 for the variables”
6 “$var$contrib” “contributions of the variables”
7 “$ind” ` “results for the individuals”
8 “$ind$coord” “coord.for the individuals”
9 “$ind$cos2” “cos2 for the individuals”
10 “$ind$contrib” “contributions of the individuals”
11 “$call” “summary statistics”
12 “$call$centre” “mean of the variables”
13 “$call$ecart.type” “standard error of the variables”
14 “$call$row.w” “weights for the individuals”
15 “$call$col.w” “weights for the variables”

For better interpretation of PCA, we need to visualize the com- First principal component keeps the largest value of eigenvalues
ponents using R functions provided in factoextra R package: and the subsequent PCs have smaller values. To determine the
eigenvalues and proportion of variances held by different PCs of
get_eigenvalue(): Extract the eigenvalues/variances of principal a given data set we need to rely on the R function get_eigenval-
components ue() that can be extracted from the factoextra package.
fviz_eig(): Visualize the eigenvalues
fviz_pca_ind(), fviz_pca_var(): Visualize the results individuals
library(“factoeextra”)
and variables, respectively.
eig.val <- get_eigenvalue(pollution.PCA)
EIGENVALUES
eig.val
As described in the previous section, eigenvalues are used to
measure the variances retained by the principal components.

“eig.val”
eigenvalue variance.percent cumulative.variance.percent
Dim.1 4.878595616 30.49122260 30.49122
Dim.2 2.766574422 17.29109013 47.78231
Dim.3 2.292475683 14.32797302 62.11029
Dim.4 1.351660343 8.44787715 70.55816
Dim.5 1.223507408 7.64692130 78.20508
Dim.6 1.086738477 6.79211548 84.99720
Dim.7 0.661476260 4.13422662 89.13143
Dim.8 0.479425447 2.99640904 92.12784
Dim.9 0.407500850 2.54688031 94.67472
Dim.10 0.244819892 1.53012432 96.20484
Dim.11 0.194097702 1.21311064 97.41795
Dim.12 0.156401959 0.97751224 98.39546
Dim.13 0.116810134 0.73006334 99.12553
Dim.14 0.089284390 0.55802744 99.68355
Dim.15 0.045962000 0.28726250 99.97082
Dim.16 0.004669417 0.02918386 100.00000

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 14
Principal Component Analysis Using R

The sum of all the eigenvalues gives a total variance of 16. results for the active variables (coordinates, correlation between
variables, squared cosine and contributions).
The proportion of all the eigenvalues is demonstrated by the sec-
ond column “variance.present.” For example, if you divide 4.878 var_pollution <- get_pca_var(pollution.PCA)
by 16 equals to 0.304875, i.e., almost 30.49 percent variance ex- var_pollution
plained by the first component/dimension. Based on the output of
eig.val object, we can derive the fact that the first six eigenvalues
> var_pollution.pca
keep almost 82 percent of total variances existed in the dataset.
Principal Component Analysis Results for
varibles
As an alternative approach, we can also examine the pattern of vari-
ances using a scree plot which showcases the order of eigenvalues
Name description
from largest to smallest. In order to produce the scree plot (see Figure
1 “$coord” “Coordinates for the variables”
3), we will use the function fviz_eig() available in factoextra() package:
2 “$cor” “Correlations between variables
and dimensions”
fviz_eig(pollution.pca, addlabels = TRUE,
3 “$cos2” “Cos2 for the variables”
hjust = -0.3, ylim = c(0,35))
4 “$contrib” “contributions of the variables”

From the scree plot above, we might consider using the first six CORRELATION CIRCLE PLOT
components for the analysis because 82 percent of the whole We can apply different methods to visualize the SVD variances in
dataset information is retained by these principal components. a correlation plot in order to demonstrate the relationship between
variables. The correlation between a variable and a principal com-
VARIABLES CONTRIBUTION GRAPH ponent (PC) is used as the coordinates of the variable on the PC.
The next step is to determine the contribution and the correla-
tion of the variables that have been considered as principal com- # Coordinates of Variables
ponents of the dataset. In order to extract the relationship of
the variables from a PCA object we need to use the function head(var_pollution$contrib)
get_pca_var () which provides a list of matrices containing all the

Figure 3
Scree Scree
Plot plot

30
Percentage of explained variances

20

10

1 2 3 4 5 6 7 8 9 10

Dimensions

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 15
Principal Component Analysis Using
using RR

> head(var_pollution$conrib)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
PRECReal 11.3363777 1.2901207 2.320962e-04 10.5955205 1.6692242
JANTReal 0.3312333 21.4926762 3.234525e+00 10.8905057 0.6850515
JULTReal 10.2768749 2.7936599 2.838199e+00 0.1211819 15.3922039
OVR65Real 2.4116845 11.8740295 3.795171e+00 27.4577926 0.1384696
POPNReal 8.1452038 0.5965791 3.132326e-03 30.1085931 5.6756771
EDUCReal 8.6313043 2.4816964 1.212178e+01 2.0276052

To plot all the variables we can use fviz_pca_var() :

fviz_pca_var(pollution.PCA,col.var = “black”)

Figure 4
Relationship Between Variables

Variables - PCA

1.0

JANTReal

NOXReal
HCReal
NONWReal

0.5
POORReal
WWDRKReal

EDUCReal JULTReal
SO@Real

MORTReal
HOUSReal
Dim2 (17.3%)

DENSReal
0.0

HUMIDReal POPNReal
PRECReal

-0.5
OVR65Real

-1.0

-1.0 -0.5 0.0 0.5 1.0


Dim1 (30.5%)

Figure 4 shows the relationship between variables in three different ways:

• Positively correlated variables are grouped together.


• Negatively correlated variables are located on opposite sides of the plot origin
• The distance between variables and the origin measures the quality of the variables on the factor map. Variables that are away from
the origin are well represented on the factor map.

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 16
Principal Component Analysis Using R

QUALITY OF REPRESENTATION
This shows the quality of representation of the variables on the factor map called cos2, which is multiplication of squared cosine
and squared coordinates. The previously created object var_pollution holds cos2 value:

head(var_pollution$cos2)

head(var_pollution$cos2)
Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
PRECReal 0.55305602 0.03569215 5.320748e-06 0.143215449 0.020423082
JANTReal 0.01615953 0.59461088 7.415070e-02 0.147202647 0.008381656
JULTReal 0.50136717 0.07728868 6.506502e-02 0.001637968 0.188324755
OVR65Real 0.11765633 0.32850386 8.700336e-02 0.371136093 0.001694186
POPNReal 0.39737155 0.01650481 7.180782e-05 0.406965913 0.069442330
EDUCReal 0.42108643 0.06865798 2.778888e-01 0.027406335 0.025652309

A high cos2 indicates a good representation of the variable on a particular dimension or principal component. Whereas, a low cos2
indicates that the variable is not perfectly represented by PCs.

Cos2 values can be well presented using various aesthetic colors in a correlation plot. For instance, we can use three different colors
to present the low, mid and high cos2 values of variables that contribute to the principal components.

fviz_pca_var(pollution.PCA,col.var = “cos2”,
gradient.cols = c(“green”,”blue”,”red”),
repel = TRUE # Avoid text overlapping

Figure 5
Variables—PCA
Variables - PCA

1.0

JANTReal

NOXReal
HCReal
NONWReal

0.5
POORReal
WWDRKReal

EDUCReal JULTReal
SO@Real

MORTReal contrib
HOUSReal
Dim2 (17.3%)

10.0
DENSReal
7.5
0.0
5.0
HUMIDReal POPNReal
2.5
PRECReal

-0.5
OVR65Real

-1.0

-1.0 -0.5 0.0 0.5 1.0


Dim1 (30.5%)

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 17
Principal Component Analysis Using R

Variables that are closed to circumference (like NONWReal, The function fviz_contrib() [factoextra package] can be used to
POORReal and HCReal ) manifest the maximum representa- draw a bar plot of variable contributions. If your data contains many
tion of the principal components. However, variables like HU- variables, you can decide to show only the top contributing vari-
MIDReal, DENSReal and SO@Real show week representation ables. The R code (see code 1 and Figures 6 and 7) below shows the
of the principal components. top 10 variables contributing to the principal components:
Code 1
CONTRIBUTION OF VARIABLES TO PCS
After observing the quality of representation, the next step is to #Contributions of variables to PC1
explore the contribution of variables to the main PCs. Variable fviz_contrib(pollution.PCA, choice = “var”,
contributions in a given principal component are demonstrated axes = 1, top = 10)
in percentage. #Contribution of variables to PC2
fviz_contrib(pollution.PCA, choice = “var”,
Key points to remember: axes = 2, top = 10)

• Variables with high contribution rate should be retained as The most important (or, contributing) variables can be high-
those are the most important components that can explain lighted on the correlation plot as in code 2 and Figure 8.
the variability in the dataset. Code 2
fviz_pca_var (pollution.pca, col.var =
• Variables with low contribution rate can be excluded from “contrib”,
the dataset in order to reduce the complexity of the data
Gradient.cols = c(“yellow”, “blue”,
analysis.
“red”)

Figures 6 and 7
Top 10 Variables Contributing to Principal Components
Contribution of variables to Dim-1
12.5

10.0
Contributions (%)

7.5

5.0

2.5

0.0
l
l

l
l

l
l

l
ea
ea

ea
ea

ea

ea
ea

l
ea

ea
ea
R
R

TR
R

R
R

R
R
W
R
S

N
C

LT

X
C
R
U

P
E

O
N

H
JU

O
O

O
R

D
O

N
O

M
P
H

P
P

Contribution of variables to Dim-2

20
Contributions (%)

15

10

0
l
l
l

ea
l

l
l

l
l
ea
l

ea

ea

ea
ea

ea
ea
ea

ea

R
R
R

R
TR

R
R
R

K
65
W

@
LT
X

R
N

U
O

R
H

O
D

JU
JA

D
O
N

S
E
P
N

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 18
Principal Component Analysis Using R

Figure 8
Graphical Display of the Eigen Vector and Their Relative Contribution

Variables - PCA

1.0

JANTReal

NOXReal
HCReal NONWReal

0.5
POORReal

WWDRKReal
EDUCReal
JULTReal
SO@Real cos2
0.8
Dim2 (17.3%)

MORTReal
HOUSReal
0.6
0.0 DENSReal
0.4
HUMIDReal POPNReal 0.2

PRECReal

-0.5
OVR65Real

-1.0

-1.0 -0.5 0.0 0.5 1.0


Dim1 (30.5%)

BIPLOT
To make a simple biplot of individuals and variables, type this:
Code 3
fviz_biplot (pollution.pca,
col.ind = pollution$MORTReal_TYPE, palette = “jco”,
addEllipses = TRUE, label = “var”
col.var = “black”, repel = TRUE,
legend.title = “Mortality_Range”)

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 19
Principal Component Analysis Using R

Figure 9
Mortality Rate Value and Corresponding Key Variables Grouped

2D Bi-plot from the pollution dataset

JANTReal

Contrib
NONWReal HCReal
NOXReal
2.5
3
5.0
POORReal
WWDRKReal 7.5

JULTReal 10.0
EDUCReal

SO@Real Mortality Type


Dim2 (18.4%)

HOUSReal High Mortality


Low Mortality
0 DENSReal
Medium Mortality
POPNReal
HUMIDReal
PRECReal
Contrib
10.0

7.5

5.0

2.5
-3
OVR65Real

-6
-5 0 5
Dim1 (30.2%)

In Figure 9, column “MORTReal_TYPE” has been used to INDUSTRY APPLICATION USE


group the mortality rate value and corresponding key variables. PCA is a very common mathematical technique for dimension
reduction that is applicable in every industry related to STEM
SUMMARY (science, technology, engineering and mathematics). Most im-
PCA analysis is unsupervised, so this analysis is not making pre- portantly, this technique has become widely popular in areas of
dictions about pollution rate, rather simply showing the vari- quantitative finance. For instance, fund portfolio managers often
ability of dataset using fewer variables. Key observations derived use PCA to point out the main mathematical factors that drive
from the sample PCA described in this article are: the movement of all stocks. Eventually, that helps in forecasting
portfolio returns, analyzing the risk of large institutional portfolios
1. Six dimensions demonstrate almost 82 percent variances of and developing asset allocation algorithms for equity portfolios.
the whole data set.
PCA has been considered as a multivariate statistical tool which is
useful to perform the computer network analysis in order to iden-
2. The following variables are the key contributors to the vari-
tify hacking or intrusion activities. Network traffic data is typically
ability of the data set:
high-dimensional making it difficult to analyze and visualize. Di-
NONWReal, POORReal, HCReal, NOXReal, HOUSRe-
mension reduction technique and Bi-plots are helpful to understand
al and MORTReal. the network activity and provide a summary of possible intrusions
statistics. Based on a study conducted by UC Davis, PCA is applied
3 Correlation plots and Bi-plot help to identify and interpret to selected network attacks from the DARPA 1998 intrusion detec-
correlation among the key variables. tion datasets namely: Denial-of-Service and Network Probe attacks.

For Python Users Multidimensional reduction capability was used to build a wide
To implement PCA in python, simply import PCA from sklearn range of PCA applications in the field of medical image process-
library. The code interpretation remains the same as explained ing such as feature extraction, image fusion, image compression,
for R users above. image segmentation, image registration and de-noising of images.

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 20
Principal Component Analysis Using R

Using the multivariate analysis feature of PCS efficient properties


it can identify patterns in data of high dimensions and can serve ap-
plications for pattern recognition problems. For example, one type Soumava Dey is an actuarial systems analyst at AIG.
for PCA is the Kernel principal component analysis (KPCA) which He can be contacted at [email protected].
can be used for analyzing ultrasound medical images of liver cancer
( Hu and Gui, 2008). Compared with the experiments of wavelets,
the experiment of KPCA showed that KPCA

is more effective than wavelets especially in the application of REFERENCE


ultrasound medical images. Husson, Francois, Sebastien Le, and Jérôme Pagès. 2017. Exploratory Multivariate
Analysis by Example Using R. 2nd ed. Boca Raton, Florida: Chapman; Hall/CRC.
http://factominer.free.fr/bookV2/index.html.
CONCLUSION
Abdi, Hervé, and Lynne J. Williams. 2010. “Principal Component Analysis.” John Wiley
This tutorial gets you started with using PCA. Many statistical and Sons, Inc. WIREs Comp Stat 2: 433–59. http://staff.ustc.edu.cn/~zwp/teach/
techniques, including regression, classification, and clustering MVA/abdi-awPCA2010.pdf.
can be easily adapted to using principal components. KEEL-dataset citation paper: J. Alcalá-Fdez, A. Fernandez, J. Luengo, J. Derrac, S.
García, L. Sánchez, F. Herrera. “KEEL Data-Mining Software Tool: Data Set Reposi-
tory, Integration of Algorithms and Experimental Analysis Framework.” Journal of
PCA helps to produce better visualization of high dimensional Multiple-Valued Logic and Soft Computing 17:2-3 (2011) 255-287.
data. The sample analysis only helps to identify the key variables Khaled Labib and V. Rao Vemuri.“An Application of Principal Component Analysis to the
that can be used as predictors for building the regression mod- Detection and Visualization of Computer Network Attacks.” https://web.cs.ucdavis.
el for estimating the relation of air pollution to mortality. My edu/~vemuri/papers/pcaVisualization.pdf

article does not outline the model building technique, but the Libin Yang. 2015. “An Application of Principal Component Analysis to Stock Portfolio
Management.” https://ir.canterbury.ac.nz/bitstream/handle/10092/10293/thesis.
six principal components can be used to construct some kind of pdf
model for prediction purposes. https://www.researchgate.net/publication/272576742_Principal_Component_
Analysis_in_Medical_Image_Processing_A_Study
Further Reading https://rdrr.io/cran/factoextra/man/fviz_pca.html
PCA using prcomp() and princomp() (tutorial). http://www.sthda.
com/english/wiki/pca-using-prcomp-and-princomp

PCA using ade4 and factoextra (tutorial). http://www.sthda.com/


english/wiki/pca-using-ade4-and-factoextra n

Copyright © 2020 Society of Actuaries. All rights reserved. ACTUARIAL TECHNOLOGY TODAY | 21

You might also like