0% found this document useful (0 votes)

353 views7 pages

A Step by Step Explanation of Principal Component Analysis

The document provides a step-by-step explanation of principal component analysis (PCA). It begins by defining PCA as a dimensionality reduction technique that transforms a large set of variables into a smaller set that contains most of the information. It then outlines the key steps of PCA: 1) standardizing the data, 2) computing the covariance matrix to understand variable correlations, and 3) calculating the eigenvectors and eigenvalues of the covariance matrix to identify the principal components as new uncorrelated variables. The document aims to clearly explain how PCA works without advanced mathematics.

Uploaded by

Vaibhav Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

353 views7 pages

A Step by Step Explanation of Principal Component Analysis

Uploaded by

Vaibhav Yadav

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

8/3/2019 A step by step explanation of Principal Component Analysis

A step by step explanation of Principal

Component Analysis
Zakaria Jaadi Follow
Feb 28 · 9 min read

The purpose of this post is to provide a complete and simplified explanation

of Principal Component Analysis, and especially to answer how it works
step by step, so that everyone can understand it and make use of it, without
necessarily having a strong mathematical background.

PCA is actually a widely covered method on the web, and there are some
great articles about it, but only few of them go straight to the point and
explain how it works without diving too much into the technicalities and
the ‘why’ of things. That’s the reason why i decided to make my own post to
present it in a simplified way.

Before getting to the explanation, this post provides logical explanations of

what PCA is doing in each step and simplifies the mathematical concepts
behind it, as standardization, covariance, eigenvectors and eigenvalues
without focusing on how to compute them.

So what is Principal Component Analysis ?

Principal Component Analysis, or PCA, is a dimensionality-reduction
method that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains
most of the information in the large set.

Reducing the number of variables of a data set naturally comes at the

expense of accuracy, but the trick in dimensionality reduction is to trade a
little accuracy for simplicity. Because smaller data sets are easier to explore

https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 1/7
8/3/2019 A step by step explanation of Principal Component Analysis

and visualize and make analyzing data much easier and faster for machine
learning algorithms without extraneous variables to process.

So to sum up, the idea of PCA is simple — reduce the number of variables of
a data set, while preserving as much information as possible.

Step by step explanation

Step 1: Standardization
The aim of this step is to standardize the range of the continuous initial
variables so that each one of them contributes equally to the analysis.

More specifically, the reason why it is critical to perform standardization

prior to PCA, is that the latter is quite sensitive regarding the variances of
the initial variables. That is, if there are large differences between the
ranges of initial variables, those variables with larger ranges will dominate
over those with small ranges (For example, a variable that ranges between 0
and 100 will dominate over a variable that ranges between 0 and 1), which
will lead to biased results. So, transforming the data to comparable scales
can prevent this problem.

Mathematically, this can be done by subtracting the mean and dividing by

the standard deviation for each value of each variable.

Once the standardization is done, all the variables will be transformed to

the same scale.

. . .

if you want to get an in-depth understanding about standardization, i invite

you to read this simple article i wrote about it.

When and why to standardize your data ?

A simple guide on when to standardize your data and when not to.
towardsdatascience.com

Step 2: Covariance Matrix computation

The aim of this step is to understand how the variables of the input data set
are varying from the mean with respect to each other, or in other words, to
see if there is any relationship between them. Because sometimes, variables
https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 2/7
8/3/2019 A step by step explanation of Principal Component Analysis

are highly correlated in such a way that they contain redundant

information. So, in order to identify these correlations, we compute the
covariance matrix.

The covariance matrix is a p × p symmetric matrix (where p is the number

of dimensions) that has as entries the covariances associated with all
possible pairs of the initial variables. For example, for a 3-dimensional data
set with 3 variables x, y, and z, the covariance matrix is a 3×3 matrix of this
from:

Covariance matrix for 3-dimensional data

Since the covariance of a variable with itself is its variance

(Cov(a,a)=Var(a)), in the main diagonal (Top left to bottom right) we
actually have the variances of each initial variable. And since the covariance
is commutative (Cov(a,b)=Cov(b,a)), the entries of the covariance matrix
are symmetric with respect to the main diagonal, which means that the
upper and the lower triangular portions are equal.

What do the covariances that we have as entries of the matrix tell us

about the correlations between the variables?

It’s actually the sign of the covariance that matters :

if positive then : the two variables increase or decrease together

(correlated)

if negative then : One increases when the other decreases (Inversely

correlated)

Now, that we know that the covariance matrix is not more than a table that
summaries the correlations between all the possible pairs of variables, let’s
move to the next step.

Step 3: Compute the eigenvectors and eigenvalues of the

covariance matrix to identify the principal components
Eigenvectors and eigenvalues are the linear algebra concepts that we need
to compute from the covariance matrix in order to determine the principal
components of the data. Before getting to the explanation of these
concepts, let’s first understand what do we mean by principal components.

Principal components are new variables that are constructed as linear

combinations or mixtures of the initial variables. These combinations are
done in such a way that the new variables (i.e., principal components) are
https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 3/7
8/3/2019 A step by step explanation of Principal Component Analysis

uncorrelated and most of the information within the initial variables is

squeezed or compressed into the first components. So, the idea is 10-
dimensional data gives you 10 principal components, but PCA tries to put
maximum possible information in the first component, then maximum
remaining information in the second and so on, until having something like
shown in the scree plot below.

Percentage of variance (information) for by each PC

Organizing information in principal components this way, will allow you to

reduce dimensionality without losing much information, and this by
discarding the components with low information and considering the
remaining components as your new variables.

An important thing to realize here is that, the principal components are less
interpretable and don’t have any real meaning since they are constructed as
linear combinations of the initial variables.

Geometrically speaking, principal components represent the directions of

the data that explain a maximal amount of variance, that is to say, the
lines that capture most information of the data. The relationship between
variance and information here, is that, the larger the variance carried by a
line, the larger the dispersion of the data points along it, and the larger the
dispersion along a line, the more the information it has. To put all this
simply, just think of principal components as new axes that provide the best
angle to see and evaluate the data, so that the differences between the
observations are better visible.

How PCA constructs the Principal Components?

As there are as many principal components as there are variables in the
data, principal components are constructed in such a manner that the first
principal component accounts for the largest possible variance in the data
https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 4/7
8/3/2019 A step by step explanation of Principal Component Analysis

set. For example, let’s assume that the scatter plot of our data set is as
shown below, can we guess the first principal component ? Yes, it’s
approximately the line that matches the purple marks because it goes
through the origin and it’s the line in which the projection of the points (red
dots) is the most spread out. Or mathematically speaking, it’s the line that
maximizes the variance (the average of the squared distances from the
projected points (red dots) to the origin).

The second principal component is calculated in the same way, with the
condition that it is uncorrelated with (i.e., perpendicular to) the first
principal component and that it accounts for the next highest variance.

This continues until a total of p principal components have been calculated,

equal to the original number of variables.

Now that we understood what we mean by principal components, let’s go

back to eigenvectors and eigenvalues. What you firstly need to know about
them is that they always come in pairs, so that every eigenvector has an
eigenvalue. And their number is equal to the number of dimensions of the
data. For example, for a 3-dimensional data set, there are 3 variables,
therefore there are 3 eigenvectors with 3 corresponding eigenvalues.

Without further ado, it is eigenvectors and eigenvalues who are behind all
the magic explained above, because the eigenvectors of the Covariance
matrix are actually the directions of the axes where there is the most variance
(most information) and that we call Principal Components. And
eigenvalues are simply the coefficients attached to eigenvectors, which give
the amount of variance carried in each Principal Component.

By ranking your eigenvectors in order of their eigenvalues, highest to

lowest, you get the principal components in order of significance.

Example:

let’s suppose that our data set is 2-dimensional with 2 variables x,y and that
the eigenvectors and eigenvalues of the covariance matrix are as follows:

https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 5/7
8/3/2019 A step by step explanation of Principal Component Analysis

If we rank the eigenvalues in descending order, we get λ1>λ2, which means

that the eigenvector that corresponds to the first principal component
(PC1) is v1 and the one that corresponds to the second component (PC2) is
v2.

After having the principal components, to compute the percentage of

variance (information) accounted for by each component, we divide the
eigenvalue of each component by the sum of eigenvalues. If we apply this
on the example above, we find that PC1 and PC2 carry respectively 96%
and 4% of the variance of the data.

Step 4: Feature vector

As we saw in the previous step, computing the eigenvectors and ordering
them by their eigenvalues in descending order, allow us to find the principal
components in order of significance. In this step, what we do is, to choose
whether to keep all these components or discard those of lesser significance
(of low eigenvalues), and form with the remaining ones a matrix of vectors
that we call Feature vector.

So, the feature vector is simply a matrix that has as columns the
eigenvectors of the components that we decide to keep. This makes it the
first step towards dimensionality reduction, because if we choose to keep
only p eigenvectors (components) out of n, the final data set will have only
p dimensions.

Example:

Continuing with the example from the previous step, we can either form a
feature vector with both of the eigenvectors v1 and v2:

Or discard the eigenvector v2, which is the one of lesser significance, and
form a feature vector with v1 only:

https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 6/7
8/3/2019 A step by step explanation of Principal Component Analysis

Discarding the eigenvector v2 will reduce dimensionality by 1, and will

consequently cause a loss of information in the final data set. But given that
v2 was carrying only 4% of the information, the loss will be therefore not
important and we will still have 96% of the information that is carried by
v1.

. . .

So, as we saw in the example, it’s up to you to choose whether to keep all
the components or discard the ones of lesser significance, depending on
what you are looking for. Because if you just want to describe your data in
terms of new variables (principal components) that are uncorrelated
without seeking to reduce dimensionality, leaving out lesser significant
components is not needed.

Last step : Recast the data along the principal components axes
In the previous steps, apart from standardization, you do not make any
changes on the data, you just select the principal components and form the
feature vector, but the input data set remains always in terms of the original
axes (i.e, in terms of the initial variables).

In this step, which is the last one, the aim is to use the feature vector formed
using the eigenvectors of the covariance matrix, to reorient the data from
the original axes to the ones represented by the principal components
(hence the name Principal Components Analysis). This can be done by
multiplying the transpose of the original data set by the transpose of the
feature vector.

. . .

If you enjoyed this story, please click the button as many times
as you think it deserves. And share to help others find it! Feel free to
leave a comment below.

References :
[Steven M. Holland, Univ. of Georgia]: Principal Components Analysis

[skymind.ai]: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy

[Lindsay I. Smith] : A tutorial on Principal Component Analysis

Data Science Pca Statistics Machine Learning Principal Component

https://towardsdatascience.com/a-step-by-step-explanation-of-principal-component-analysis-b836fb9c97e2 7/7

Bayes Theorem Cheat Sheet
No ratings yet
Bayes Theorem Cheat Sheet
1 page
Python Data Associate Certification Study Guide
No ratings yet
Python Data Associate Certification Study Guide
2 pages
Music Theory 2.0
No ratings yet
Music Theory 2.0
30 pages
Meta Análisis Con R
No ratings yet
Meta Análisis Con R
14 pages
Nonlinear Programming - Olvi L. Mangasarian
100% (1)
Nonlinear Programming - Olvi L. Mangasarian
239 pages
RD Sharma Class 11 Maths Chapter 32
No ratings yet
RD Sharma Class 11 Maths Chapter 32
48 pages
Numerical Analysis - II
No ratings yet
Numerical Analysis - II
158 pages
Are These The Top 11 Unsolved Problems in Economics
No ratings yet
Are These The Top 11 Unsolved Problems in Economics
44 pages
Preparation of Calibration Curves A Guide To Best Practice
No ratings yet
Preparation of Calibration Curves A Guide To Best Practice
31 pages
Kelley C.T. Iterative Methods For Linear and Nonlinear Equations (SIAM 1995) (ISBN 0898713528) (171
No ratings yet
Kelley C.T. Iterative Methods For Linear and Nonlinear Equations (SIAM 1995) (ISBN 0898713528) (171
171 pages
CFA Using Excel
No ratings yet
CFA Using Excel
5 pages
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
No ratings yet
A Step-By-Step Explanation of Principal Component Analysis (PCA) - Built in
8 pages
Chapter 03 - Descriptive Statistics: Numerical Measures: Page 1
100% (3)
Chapter 03 - Descriptive Statistics: Numerical Measures: Page 1
50 pages
Quantitative Analysis
No ratings yet
Quantitative Analysis
366 pages
Mathematics - I (MATH F111)
100% (2)
Mathematics - I (MATH F111)
70 pages
M.Sc. in Actuarial Science and Quantitative Finance
No ratings yet
M.Sc. in Actuarial Science and Quantitative Finance
4 pages
Unit 1: Linear Programming L.P. Problems: Components
No ratings yet
Unit 1: Linear Programming L.P. Problems: Components
21 pages
Analysis and Linear Algebra For Finance Part II
No ratings yet
Analysis and Linear Algebra For Finance Part II
156 pages
FN3026 Introduction
No ratings yet
FN3026 Introduction
10 pages
13.practice Questions and Solutions
No ratings yet
13.practice Questions and Solutions
17 pages
VS STD POS 00 Analysis 1
No ratings yet
VS STD POS 00 Analysis 1
2 pages
Mathematical Typing Shortcuts
No ratings yet
Mathematical Typing Shortcuts
9 pages
EAII - Note 6 PDF
No ratings yet
EAII - Note 6 PDF
13 pages
Name: Alie Eldeen Ahmed Ali Section:1: Helwan University Faculty of Engineering Mechanical Department
No ratings yet
Name: Alie Eldeen Ahmed Ali Section:1: Helwan University Faculty of Engineering Mechanical Department
11 pages
Calculas of Fiinance
No ratings yet
Calculas of Fiinance
384 pages
Activity 6 Finals (PC)
No ratings yet
Activity 6 Finals (PC)
2 pages
Woolford D. Applied Data Science. Data Translators Across The Disciplines 2023
No ratings yet
Woolford D. Applied Data Science. Data Translators Across The Disciplines 2023
195 pages
Ryan Dickmann Math
No ratings yet
Ryan Dickmann Math
9 pages
Analysis and Linear Algebra For Finance Part I
No ratings yet
Analysis and Linear Algebra For Finance Part I
127 pages
ST2195 Complete
No ratings yet
ST2195 Complete
430 pages
Exercises Fourier Series PDF
100% (1)
Exercises Fourier Series PDF
3 pages
Methods of Real Analysis Goldberg
No ratings yet
Methods of Real Analysis Goldberg
410 pages
Principal Component Analysis 4 Dummies
100% (1)
Principal Component Analysis 4 Dummies
8 pages
Sa1 Pu 14 PDF
No ratings yet
Sa1 Pu 14 PDF
212 pages
Rules of Differentiation and Their Use in Comparative Statics - Alpha Chiang
No ratings yet
Rules of Differentiation and Their Use in Comparative Statics - Alpha Chiang
24 pages
Mat495 Chapter 8
No ratings yet
Mat495 Chapter 8
18 pages
A Geometric Interpretation of The Covariance Matrix
No ratings yet
A Geometric Interpretation of The Covariance Matrix
32 pages
Gollin - Kies 2014 Methods in ESP
No ratings yet
Gollin - Kies 2014 Methods in ESP
8 pages
Acourse of Pure Mathematics Cambrige
No ratings yet
Acourse of Pure Mathematics Cambrige
587 pages
Exercises From Linear Algebra, Signal Processing, and Wavelets. A Unified Approach. MATLAB Version
No ratings yet
Exercises From Linear Algebra, Signal Processing, and Wavelets. A Unified Approach. MATLAB Version
71 pages
Formula List-Math 1230
No ratings yet
Formula List-Math 1230
1 page
Booklet 6
No ratings yet
Booklet 6
134 pages
Barnfm10e PPT 8 3
No ratings yet
Barnfm10e PPT 8 3
19 pages
R PCA (Principal Component Analysis) - DataCamp
No ratings yet
R PCA (Principal Component Analysis) - DataCamp
54 pages
NA Lec 15
No ratings yet
NA Lec 15
15 pages
Principal Component Analysis (PCA) Explained - Built in
No ratings yet
Principal Component Analysis (PCA) Explained - Built in
11 pages
CP2 CMP Upgrade 2023
No ratings yet
CP2 CMP Upgrade 2023
28 pages
Modeling Tail Dependence Using Copulas - Literature Review: Jan de Kort March 15, 2007
No ratings yet
Modeling Tail Dependence Using Copulas - Literature Review: Jan de Kort March 15, 2007
43 pages
Modern Philosophy
No ratings yet
Modern Philosophy
23 pages
Engineering Mathematics: Assignment-C2
No ratings yet
Engineering Mathematics: Assignment-C2
4 pages
MATB41 - Full Course Note
No ratings yet
MATB41 - Full Course Note
49 pages
The Nature of Statistics (Statistics - A Universal Guide To The Unknown Book 1)
No ratings yet
The Nature of Statistics (Statistics - A Universal Guide To The Unknown Book 1)
184 pages
Operations Management: Sustainability and Supply Chain Management
No ratings yet
Operations Management: Sustainability and Supply Chain Management
72 pages
Banking, Finance and Insurance Domain
No ratings yet
Banking, Finance and Insurance Domain
14 pages
PCA Using R
No ratings yet
PCA Using R
12 pages
Instructors Manual
No ratings yet
Instructors Manual
96 pages
Information Theory
No ratings yet
Information Theory
21 pages
ch19 PDF
No ratings yet
ch19 PDF
23 pages
Algorithmic Trading & Quantitative Strategies Gappy Lecture 5
No ratings yet
Algorithmic Trading & Quantitative Strategies Gappy Lecture 5
22 pages
Analysis Cheat Sheet
No ratings yet
Analysis Cheat Sheet
4 pages
Probability & Statistics 2: AS2110 / MA3666
No ratings yet
Probability & Statistics 2: AS2110 / MA3666
32 pages
An Introduction To R
No ratings yet
An Introduction To R
105 pages
Grounded Medium Frequency Monopole by Valentino Trainotti, Walter G. Fano, and Lazaro Jastreblansky (University of Buenos Aires, Argentina), 2005.
No ratings yet
Grounded Medium Frequency Monopole by Valentino Trainotti, Walter G. Fano, and Lazaro Jastreblansky (University of Buenos Aires, Argentina), 2005.
29 pages
Stock Price Forecasting Using Arima and Fourier Transforms
No ratings yet
Stock Price Forecasting Using Arima and Fourier Transforms
8 pages
Tutorial 1.solutions
No ratings yet
Tutorial 1.solutions
5 pages
MATH3075 3975 Course Notes 2016
No ratings yet
MATH3075 3975 Course Notes 2016
109 pages
Statistics With R Programming PDF
No ratings yet
Statistics With R Programming PDF
53 pages
Measure Theory and Fourier Analysis
No ratings yet
Measure Theory and Fourier Analysis
2 pages
Bayes' Estimators of Generalized Entropies
No ratings yet
Bayes' Estimators of Generalized Entropies
16 pages
CP1 Study Guide 2024
No ratings yet
CP1 Study Guide 2024
38 pages
Genre Analysis - Historical Overview
No ratings yet
Genre Analysis - Historical Overview
21 pages
PCA Explained Stepbystep
No ratings yet
PCA Explained Stepbystep
4 pages
Cost Accounting 1-10 Final
No ratings yet
Cost Accounting 1-10 Final
29 pages
MEC583 Final Report Anuj Davoud
No ratings yet
MEC583 Final Report Anuj Davoud
19 pages
Actuarial CT6 Statistical Methods Sample Paper 2011 by ActuarialAnswers
No ratings yet
Actuarial CT6 Statistical Methods Sample Paper 2011 by ActuarialAnswers
10 pages
5 Ways To Fix Statistics
No ratings yet
5 Ways To Fix Statistics
3 pages
TM111 Tma Fall22 23
No ratings yet
TM111 Tma Fall22 23
2 pages
Module 6 - Analysing Qualitative Data
No ratings yet
Module 6 - Analysing Qualitative Data
17 pages
Martingales, Wiener Processes & Ito's Lemma
No ratings yet
Martingales, Wiener Processes & Ito's Lemma
35 pages
2013 FRM Practice Exam
No ratings yet
2013 FRM Practice Exam
73 pages
Untitled
No ratings yet
Untitled
78 pages
Data Analysis
No ratings yet
Data Analysis
19 pages
Case 1
No ratings yet
Case 1
2 pages
Actuarial CT8 Financial Economics Sample Paper 2011 by ActuarialAnswers
No ratings yet
Actuarial CT8 Financial Economics Sample Paper 2011 by ActuarialAnswers
8 pages
LRB Non-Linear
No ratings yet
LRB Non-Linear
24 pages
Solutions To Pandas Basic Questions
No ratings yet
Solutions To Pandas Basic Questions
1 page
2019 12 Exam FM Syllabus
No ratings yet
2019 12 Exam FM Syllabus
10 pages
ST4 Pu 14 PDF
No ratings yet
ST4 Pu 14 PDF
10 pages
General Topology
90% (10)
General Topology
154 pages

A Step by Step Explanation of Principal Component Analysis

Uploaded by

A Step by Step Explanation of Principal Component Analysis

Uploaded by

8/3/2019 A step by step explanation of Principal Component Analysis

A step by step explanation of Principal

The purpose of this post is to provide a complete and simplified explanation

Before getting to the explanation, this post provides logical explanations of

So what is Principal Component Analysis ?

Reducing the number of variables of a data set naturally comes at the

Step by step explanation

More specifically, the reason why it is critical to perform standardization

Mathematically, this can be done by subtracting the mean and dividing by

Once the standardization is done, all the variables will be transformed to

if you want to get an in-depth understanding about standardization, i invite

When and why to standardize your data ?

Step 2: Covariance Matrix computation

are highly correlated in such a way that they contain redundant

The covariance matrix is a p × p symmetric matrix (where p is the number

Covariance matrix for 3-dimensional data

Since the covariance of a variable with itself is its variance

What do the covariances that we have as entries of the matrix tell us

It’s actually the sign of the covariance that matters :

if positive then : the two variables increase or decrease together

if negative then : One increases when the other decreases (Inversely

Step 3: Compute the eigenvectors and eigenvalues of the

Principal components are new variables that are constructed as linear

uncorrelated and most of the information within the initial variables is

Percentage of variance (information) for by each PC

Organizing information in principal components this way, will allow you to

Geometrically speaking, principal components represent the directions of

How PCA constructs the Principal Components?

This continues until a total of p principal components have been calculated,

Now that we understood what we mean by principal components, let’s go

By ranking your eigenvectors in order of their eigenvalues, highest to

If we rank the eigenvalues in descending order, we get λ1>λ2, which means

After having the principal components, to compute the percentage of

Step 4: Feature vector

Discarding the eigenvector v2 will reduce dimensionality by 1, and will

[skymind.ai]: Eigenvectors, Eigenvalues, PCA, Covariance and Entropy

[Lindsay I. Smith] : A tutorial on Principal Component Analysis

Data Science Pca Statistics Machine Learning Principal Component

You might also like