0% found this document useful (0 votes)
64 views29 pages

Principal Component Analysis With Python (A Deep Dive) - by Francesco Franco - Oct, 2024 - Medium

The document provides an in-depth exploration of Principal Component Analysis (PCA) in Python, emphasizing its role in dimensionality reduction for machine learning models. It outlines the iterative process of training supervised models, the challenges posed by the curse of dimensionality, and differentiates between feature selection and feature extraction techniques. The article also details the steps involved in PCA, including computing covariance matrices and using libraries like Scikit-learn for practical implementation.

Uploaded by

Fulvio Roveta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views29 pages

Principal Component Analysis With Python (A Deep Dive) - by Francesco Franco - Oct, 2024 - Medium

The document provides an in-depth exploration of Principal Component Analysis (PCA) in Python, emphasizing its role in dimensionality reduction for machine learning models. It outlines the iterative process of training supervised models, the challenges posed by the curse of dimensionality, and differentiates between feature selection and feature extraction techniques. The article also details the steps involved in PCA, including computing covariance matrices and using libraries like Scikit-learn for practical implementation.

Uploaded by

Fulvio Roveta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

Open in app

Search

Principal Component Analysis with Python (A


Deep Dive)
Francesco Franco · Follow
23 min read · 1 day ago

Listen Share More

T raining a Supervised Machine Learning model — whether it is a traditional


one or a Deep Learning model — involves a few steps. The first is feeding
forward the data through the model, generating predictions. The second is
comparing those predictions with the actual values, which are also called ground
truth. The third is to optimize the model based on the minimization of some
objective function.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 1/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

In this iterative process, the model gets better and better, and sometimes it even
gets really good.

But what data will you feed forward?

Sometimes, your input sample will contain many columns, also known as features. It
is common knowledge (especially with traditional models) that using every column
in your Machine Learning model will mean trouble, the so-called curse of
dimensionality. In this case, you’ll have to selectively handle the features you are
working with. In this article, we’ll cover Principal Components Analysis (PCA),
which is one such way. It provides a gentle but extensive introduction to feature
extraction for your Machine Learning model with PCA.

The article is structured as follows. First of all, we’ll take a look at what PCA is. We
do this through the lens of the Curse of Dimensionality, which explains why we need
to reduce dimensionality especially with traditional Machine Learning algorithms.
This also involves the explanation of the differences between Feature Selection and
Feature Extraction techniques, which have different goals. PCA, which is part of the
Feature Extraction branch of techniques, is then introduced.

When we have an adequate understanding of PCA conceptually, we’ll take a look at


it from a Python point of view. For a sample dataset, we’re going to perform PCA in a
step-by-step fashion. We’ll take a look at all the individual components. Firstly, we’ll
compute the covariance matrix for the variables. Then, we compute the
eigenvectors and eigenvalues, and select which ones are best. Subsequently, we
compose the PCA projection matrix for mapping the data onto the axes of the
principal components. This allows us to create entirely new dimensions which
capture most of the variance from the original dataset, at a fraction of the
dimensions. Note that SVD can also be used instead of eigenvector decomposition;
we’ll also take a look at that.

Once we clearly understand how PCA happens by means of the Python example,
we’ll show you how you don’t have to reinvent the wheel if you’re using PCA. If you
understand what’s going on, it’s often better to use a well-established library for
computing the PCA. Using Scikit-learn’s sklearn.decomposition.PCA API, we will
finally show you how to compute principal components and apply them to perform
dimensionality reduction for your dataset.

All right. Enough introduction. Let’s get to work!


https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 2/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

What is Principal Component Analysis?


Before we dive in to the specifics of PCA, I think we should first take a look at why it
can be really useful for Machine Learning projects. For this reason, we will first take
a look at Machine Learning projects and the Curse of Dimensionality, which is
especially present when using older Machine Learning algorithms (Support Vector
Machines, Logistic Regression, …).

Then, we’ll discuss what can be done against it — dimensionality reduction — and
explain the difference between Feature Selection and Feature Extraction. Finally,
we’ll get to PCA — and provide a high-level introduction.

Machine Learning and the Curse of Dimensionality


If you are training a Supervised Machine Learning model, at a high level, you are
following a three-step, iterative process:

Since Supervised Learning means that you have a dataset at your disposal, the first
step in training a model is feeding the samples to the model. For every sample, a
prediction is generated. Note that at the first iteration, the model has just been
initialized. The predictions therefore likely make no sense at all.

This becomes especially evident from what happens in the second step, where
predictions and ground truth (= actual targets) are compared. This comparison
produces an error of loss value which illustrates how bad the model performs.

The third step is then really simple: you improve the model. Depending on the
Machine Learning algorithm, optimization happens in different ways. In the case of
Neural networks, gradients are computed with backpropagation, and subsequently
optimizers (topic for a future blog) are used for changing the model internals.
https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 3/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

Weights can also be changed by minimizing one function only; it just depends on
the algorithm.

You then start again. Likely, because you have optimized the model, the predictions
are a little bit better now. You simply keep iterating until you are satisfied with the
results, and then you stop the training process.

Underfitting and overfitting a model

When you are performing this iterative process, you are effectively moving from a
model that is underfit to a model that demonstrates a good fit. Let’s briefly take a look
at them here.

In the first stages of the training process, your model is likely not able to capture the
patterns in your dataset. This is visible in the top figure below. The solution is
simple: just keep training until you achieve the right fit for the dataset (that’s the
bottom figure). Now, you can’t keep training forever. If you do, the model will learn to
focus too much on patterns hidden within your training dataset — patterns that may
not be present in other real-world data at all; patterns truly specific to the sample
with which you are training.

The result: a model tailored to your specific dataset, visible in the second figure.

In other words, training a Machine Learning model involves finding a good balance
between a model that is underfit and a model that is overfit. Fortunately, many
techniques are available to help you with this, but it’s one of the most common
problems in Supervised ML today.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 4/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 5/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

On the top, a model that is underfit with respect to the data. In the middle: a model that is overfit with respect
to the data. On the bottom: the fit that we were looking for.

Having a higher-dimensional feature vector

I think the odds are that I can read your mind at this point.

Overfitting, underfitting, and training a Machine Learning model — how are they
related to Principal Component Analysis?

That’s a fair question. What I want to do is to illustrate why a large dataset — in


terms of the number of columns — can significantly increase the odds that your
model will overfit.

Suppose that you have the following feature vector:

f{x} = [1.23, -3.00, 45.2, 9.3, 0.1, 12.3, 8.999, 1.02, -2.45, -0.26, 1.24]

This feature vector is 11-dimensional.

Now suppose that you have 200 samples.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 6/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

Will a Machine Learning model be able to generalize across all eleven dimensions?
In other words, do we have sufficient samples to cover large parts of the domains
for all features in the vector (i.e., all the axes in the 11-dimensional space)? Or does
it look like Swiss cheese with (massive) holes?

I think it’s the latter. Welcome to the Curse of Dimensionality.

The Curse of Dimensionality

Quoting from Wikipedia:

In machine learning problems that involve learning a “state-of-nature” from a finite


number of data samples in a high-dimensional feature space with each feature having a
range of possible values, typically an enormous amount of training data is required to
ensure that there are several samples with each combination of values.

Wikipedia (n.d.)

That’s what we just described in different words.

The point with “ensuring that there are several samples with each combination of
values” is that when this is performed well, you will likely be able to train a model
that (1) performs well and (2) generalizes well across many settings. With 200
samples, however, it’s 100% certain that you don’t meet this requirement. The effect
is simple: your model will overfit to the data at hand, and it will become worthless if
it is used with data from the real world.

Since increasing dimensionality equals an increasingly growing need for more data,
the only way out of this curse is to reduce the number of dimensions in our dataset.
This is called Dimensionality Reduction, and we’ll now take a look at two
approaches — Feature Selection and Feature Extraction.

Dimensionality Reduction: Feature Selection vs Feature Extraction

We saw that if we want to decrease the odds of overfitting, we must reduce the
dimensionality of our data. While this can easily be done in theory (we can simply
cut off a few dimensions, who cares?), this gets slightly difficult in practice (which
dimension to cut… because, how do I know which one contributes most?).

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 7/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

And what if each dimension contributes an equal amount to the predictive power of the
model? What then?

In the field of Dimensionality Reduction, there are two main approaches that you
can use: Feature Selection and Feature Extraction.

Feature Selection involves “the process of selecting a subset of relevant features


(variables, predictors) for use in model construction” (Wikipedia, 2004). In other
words, Feature Selection approaches attempt to measure the contribution of
each feature, so that you can keep the ones that contribute most. What must be
clear is that your model will be trained with the original variables; however, with
only a few of them.

Feature Selection can be a good idea if you already think that most variance
within your dataset can be explained by a few variables. If the others are truly
unimportant, then you can easily discard them without losing too much
information.

Feature Extraction, on the other hand, “starts from an initial set of measured
data and builds derived values (features) intended to be informative and non-
redundant” (Wikipedia, 2003). In other words, a derived dataset will be built that
can be used for training your Machine Learning model. It is lower-dimensional
compared to the original dataset. It will be as informative as possible (i.e., as
much information from the original dataset is pushed into the new variables)
while non-redundant (i.e., we want to avoid that information present in one
variable in the new dataset is also present in another variable in the new
dataset). In other words, we get a lower-dimensional dataset that explains most
variance in the dataset, while keeping things relatively simple.

Especially in the case where each dimension contributes an equal amount,


Feature Extraction can be preferred over Feature Selection. The same is true if
you have no clue about the contribution of each variable to the model’s
predictive power.

Introducing Principal Component Analysis (PCA)


Now that we are aware of the two approaches, it’s time to get to the point. We’ll now
introduce PCA, a Feature Extraction technique, for dimensionality reduction.

Principal Component Analysis is defined as follows:

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 8/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

Principal component analysis (PCA) is the process of computing the principal components
and using them to perform a change of basis on the data, sometimes using only the first
few principal components and ignoring the rest.

Wikipedia (2002)

Well, that’s quite a technical description, isn’t it. And what are “principal
components”?

The principal components of a collection of points in a real p-space are a sequence of p


direction vectors, where the ith vector is the direction of a line that best fits the data while
being orthogonal to the first i — 1 vectors.

Wikipedia (2002)

I can perfectly understand it if you still have no idea what PCA is after reading those
two quotes. I felt the same. For this reason, let’s break down stuff step-by-step.

The goal of PCA: finding a set of vectors (principal components) that best describe
the spread and direction of your data across its many dimensions, allowing you to
subsequently pick the top-n best-describing ones for reducing the dimensionality of
your feature space.

The steps of PCA:

1. If you have a dataset, its spread can be expressed in orthonormal vectors — the
principal directions of the dataset. Orthonormal, here, means that the vectors
are orthogonal to each other (i.e. they have an angle of 90°) and are of size 1.

2. By sorting these vectors in order of importance (by looking at their relative


contribution to the spread of the data as a whole), we can find the dimensions of
the data which explain most variance.

3. We can then reduce the number of dimensions to the most important ones only.

4. And finally, we can project our dataset onto these new dimensions, called the
principal components, performing dimensionality reduction without losing
much of the information present in the dataset.

The how:

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSUzI… 9/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

Although we will explain how later in this article, we’ll now visually walk through
performing PCA at a high level. This allows you to understand what happens first,
before we dive into how it happens.

Another important note is that for step (1), decomposing your dataset into vectors
can be done in two different ways — by means of (a) eigenvector decomposition of
the covariance matrix, or (b) Singular Value Decomposition. Later in this article,
we’ll walk through both approaches step-by-step.

Expressing the spread of your dataset in vectors

Suppose that we generate a dataset based on two overlapping blobs which we


consider to be part of just one dataset:

from sklearn.datasets import make_blobs


import matplotlib.pyplot as plt
import numpy as np

# Configuration options
num_samples_total = 1000
cluster_centers = [(1,1), (1.25,1.5)]
num_classes = len(cluster_centers)

# Generate data
X, y = make_blobs(n_samples = num_samples_total, centers = cluster_centers, n_f
e)

# Make plot
plt.scatter(X[:, 0], X[:, 1])
axes = plt.gca()
axes.set_xlim([0, 2])
axes.set_ylim([0, 2])
plt.show()

…which looks as follows:

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 10/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

If you look closely at the dataset, you can see that it primarily spreads into two
directions. These directions are from the upper right corner to the lower left corner
and from the lower right middle to the upper left middle. Those directions are
different from the axis directions, which are orthogonal to each other: the x and y
axes have an angle of 90 degrees.

No other set of directions will explain as much as the variance than the one we
mentioned above.

After standarization (future article), we can visualize the directions as a pair of two
vectors. These vectors are called the principal directions of the data (StackExchange,
n.d.). There are as many principal directions as the number of dimensions; in our
case, there are two.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 11/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

We call these vectors eigenvectors. Their length is represented by what is known as


an eigenvalue. They play a big role in PCA for the following reason:

[The eigenvectors and related] eigenvalues explain the variance of the data along the new
feature axes.

Raschka (2015)

In other words, they allow us to capture both the (1) direction and (2) the magnitude
of the spread in your dataset.

Notice that the vectors are orthogonal to each other. Also recall that our axes are
orthogonal to each other. You can perhaps now imagine that it becomes possible to
perform a transformation to your dataset, so that the directions of the axes are equal
to the directions of the eigenvectors. In other words, we change the “viewpoint” of
our data, so that the axes and vectors have equal directions.

This is the core of PCA: projecting the data to our principal directions, which are
then called principal components.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 12/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

The benefit here is that while the eigenvectors tell us something about the directions
of our projection, the corresponding eigenvalues tell us something about the
importance of that particular principal direction in explaining the variance of the
dataset. It allows us to easily discard the directions that don’t contribute sufficiently.
That’s why before projecting the dataset onto the principal components, we must
first sort the vectors and reduce the number of dimensions.

Sorting the vectors in order of importance

Once we know the eigenvectors and eigenvalues that explain the spread of our
dataset, we must sort them in order of descending importance. This allows us to
perform dimensionality reduction, as we can keep the principal directions which
contribute most significantly to the spread in our dataset.

Sorting is simple: we sort the list with eigenvalues in descending order and ensure
that our list with eigenvectors is sorted in the same way. In other words, the pairs of
eigenvectors and eigenvalues are jointly sorted in a descending order based on the
eigenvalue. As the largest eigenvalues indicate the biggest explanation for spread in
your dataset, they must be on top of the list.

For the example above, we can see that the eigenvalue for the downward-oriented
eigenvector exceeds the one for the upward-oriented vector. If we draw a line
through the dataset that overlaps with the vector, we can also see that variance for
that line as a whole (where variance is defined as the squared distance of each point
to the mean value for the line) is biggest. We can simply draw no line where
variance is larger.

In fact, the total (relative) contribution of the eigenvectors to the spread for our
example is as follows:

[0.76318124 0.23681876]

(We’ll look at how we can determine this later.)

So, for our example above, we now have a sorted list with eigenpairs.

Reducing the number of dimensions

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 13/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

As we saw above, the first eigenpair explains 76.3% of the spread in our dataset,
whereas the second one explains only 23.7%. Jointly, they explain 100% of the
spread, which makes sense.

Using PCA for dimensionality reduction now allows us to take the biggest-
contributing vectors (if your original feature space was say 10-dimensional, it is
likely that you can find a smaller set of vectors which explains most of the variance)
and only move forward with them.

If our goal was to reduce dimensionality to one, we would now move forward and
take the 0.763 contributing eigenvector for data projection. Note that this implies
that we will lose 0.237 worth of information about our spread, but in return get a
lower number of dimensions.

Clearly, this example with only two dimensions makes no rational sense as two
dimensions can easily be handled by Machine Learning algorithms, but this is
incredibly useful if you have many dimensions to work with.

Projecting the dataset

Once we have chosen the number of eigenvectors that we will use for
dimensionality reduction (i.e. our target number of dimensions), we can project the
data onto the principal components — or component, in our case.

This means that we will be changing the axes so that they are now equal to the
eigenvectors.

In the example below, we can project our data to one eigenvector. We can see that
only the x axis has values after projecting, and that hence our feature space has
been reduced to one dimension.

We have thus used PCA for dimensionality reduction.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 14/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

The how of generating eigenpairs: Eigenvector Decomposition or Singular Value


Decomposition

Above, we covered the general steps of performing Principal Component Analysis.


Recall that they are as follows:

1. Decomposing the dataset into a set of eigenpairs.

2. Sorting the eigenpairs in descending order of importance.

3. Selecting n most important eigenpairs, where n is the desired number of


dimensions.

4. Projecting the data to the n eigenpairs so that their directions equal the ones of
our axes.

In step (1), we simply mentioned that we can express the spread of our data by
means of eigenpairs. On purpose, we didn’t explain how this can be done, for the
sake of simplicity.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 15/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

In fact, there are two methods that are being used for this purpose today:
Eigenvector Decomposition (often called “EIG”) and Singular Value Decomposition
(“SVD”). Using different approaches, they can be used to obtain the same end result:
expressing the spread of your dataset in eigenpairs, the principal directions of your
data, which can subsequently be used to reduce the number of dimensions by
projecting your dataset to the most important ones, the principal components.

While mathematically and hence formally you can obtain the same result with both,
in practice PCA-SVD is numerically more stable (StackExchange, n.d.). For this
reason, you will find that most libraries and frameworks favor a PCA-SVD
implementation over a PCA-EIG one. Nevertheless, you can still achieve the same
result with both approaches!

In the next section, we will take a look at a clear and step-by-step example of PCA
with EIG and, in a future post, we’ll look at PCA with SVD, allowing you to
understand the differences intuitively. We will then look at
sklearn.decomposition.PCA , Scikit-learn's implementation of Principal Component
Analysis based on PCA-SVD. There is no need to perform PCA manually if there are
great tools out there, after all!

PCA-EIG: Eigenvector Decomposition with Python Step-by-Step


One of the ways in which PCA can be performed is by means of Eigenvector
Decomposition (EIG). More specifically, we can use the covariance matrix of our N-
dimensional dataset and decompose it into N eigenpairs. We can do this as follows:

1. Standardizing the dataset: EIG based PCA only works well if the dataset is
centered and has a mean of zero (i.e. μ = 0.0). We will use standardization for
this purpose, which also scales the data to a standard deviation of one (σ= 1.0).

2. Computing the covariance matrix of the variables: a covariance matrix indicates


how much variance each individual variable has, and how much they ‘covary’ —
in other words, how much certain variables move together.

3. Decomposing the covariance matrix into eigenpairs: mathematically, we can


rewrite the covariance matrix so that we can get a set of eigenvectors and
eigenvalues, or eigenpairs.

4. Sorting the eigenpairs in decreasing order of importance, to find the principal


directions in your dataset which contribute to the spread most significantly.

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 16/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

5. Selecting the variance contribution of your principal directions and selecting n


principal components: if we know the relative contributions to the spread for
each principal direction, we can perform dimensionality reduction by selecting
only the n most contributing principal components.

6. Building the projection matrix for projecting our original dataset onto the
principal components.

We can see that steps (1), (4), (5) and (6) are general — we also saw them above. Steps
(2) and (3) are specific to PCA-EIG and represent the core of what makes eigenvector
decomposition based PCA unique. We will now cover each step in more detail,
including step-by-step examples with Python. Note that the example in this section
makes use of native / vanilla Python deliberately, and that Scikit-learn based
implementations of, e.g., standardization and PCA will be used in another section.

Using the multidimensional Iris dataset

If we want to show how PCA works, we must use a dataset where the number of
dimensions > 2. Fortunately, Scikit-learn provides the Iris dataset, which can be
used to classify three groups of Iris flowers based on four characteristics (and hence
features or dimensions): petal length, petal width, sepal length and sepal width.

This code can be used for visualizing two dimensions every time:

from sklearn import datasets


import matplotlib.pyplot as plt
import numpy as np

# Configuration options
dimension_one = 1
dimension_two = 3

# Load Iris dataset


iris = datasets.load_iris()
X = iris.data
y = iris.target

# Shape
print(X.shape)
print(y.shape)

# Dimension definitions
dimensions = {
https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 17/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

0: 'Sepal Length',
1: 'Sepal Width',
2: 'Petal Length',
3: 'Petal Width'
}

# Color definitions
colors = {
0: '#b40426',
1: '#3b4cc0',
2: '#f2da0a',
}

# Legend definition
legend = ['Iris Setosa', 'Iris Versicolour', 'Iris Virginica']

# Make plot
colors = list(map(lambda x: colors[x], y))
plt.scatter(X[:, dimension_one], X[:, dimension_two], c=colors)
plt.title(f'Visualizing dimensions {dimension_one} and {dimension_two}')
plt.xlabel(dimensions[dimension_one])
plt.ylabel(dimensions[dimension_two])
plt.show()

This yields the following plots, if we play with the dimensions:

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 18/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 19/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

The images illustrate that two of the Iris flowers cannot be linearly separated, but
that this group can be separated from the other Iris flower. Printing the shape yields
the following:

(150, 4)
(150,)

…indicating that we have only 150 samples, but that our feature space is four-
dimensional. Clearly a case where feature extraction could be beneficial for training
our Machine Learning model.

Performing standardization

We first add Python code for standardization, which brings our data to μ=0.0, σ= 1.0
by performing x =( x —μ) /σ for each dimension.

# Perform standardization
for dim in range(0, X.shape[1]):
print(f'Old mean/std for dim={dim}: {np.average(X[:, dim])}/{np.std(X[:, dim]
X[:, dim] = (X[:, dim] - np.average(X[:, dim])) / np.std(X[:, dim])
print(f'New mean/std for dim={dim}: {np.abs(np.round(np.average(X[:, dim])))}

# Make plot
colors = list(map(lambda x: colors[x], y))
plt.scatter(X[:, dimension_one], X[:, dimension_two], c=colors)
plt.title(f'Visualizing dimensions {dimension_one} and {dimension_two}')
plt.xlabel(dimensions[dimension_one])
plt.ylabel(dimensions[dimension_two])
plt.show()

And indeed:

Old mean/std for dim=0: 5.843333333333334/0.8253012917851409


New mean/std for dim=0: 0.0/1.0
Old mean/std for dim=1: 3.0573333333333337/0.4344109677354946
New mean/std for dim=1: 0.0/0.9999999999999999
Old mean/std for dim=2: 3.7580000000000005/1.759404065775303

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 20/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

New mean/std for dim=2: 0.0/1.0


Old mean/std for dim=3: 1.1993333333333336/0.7596926279021594
New mean/std for dim=3: 0.0/1.0

Computing the covariance matrix of your variable

The next step is computing the covariance matrix for our dataset.

In probability theory and statistics, a covariance matrix (…) is a square matrix giving the
covariance between each pair of elements of a given random vector.

Wikipedia (2003)

If you’re not into mathematics, I can understand that you don’t know what this is yet.
Let’s therefore briefly take a look at a few aspects related to a covariance matrix
before we move on, based on Lambers (n.d.).

A variable: such as X. A mathematical representation of one dimension of the data


set. For example, if X represents {petal width}, numbers such as 1.19, 1.20, 1.21,
1.18, 1.16, … which represent the petal width for one flower can all be described by
variable X.

Variable mean: the average value for the variable. Computed as the sum of all
available values divided by the number of values summed together. As petal width
represents dim=3 in the visualization above, with a mean of ≈ 1.1993, we can see
how the numbers above fit.

Variance: describing the “spread” of data around the variable. Computed as the sum
of squared differences between each number and the mean, i.e. the sum of (x- μ)²for
each number.

Covariance: describing the joint variability (or joint spread) of two variables. For
each pair of numbers from both variables, covariance is computed as:

Covariance matrix for n variables: a matrix representing covariances for each pair
of variables from some set of variables (dimensions) V = [X, Y, Z, ….].

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 21/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

A covariance matrix for two dimensions X and Y looks as follows:

Fortunately, there are some properties which make covariance matrices interesting
for PCA (Lambers, n.d.):

[Cov(X, X) = Var(X)

Cov(X, Y) = Cov(Y, X)

As a consequence, our covariance matrix is a symmetrical and square, n x n matrix


and can hence also be written as follows:

We can compute the covariance matrix by generating an (n x n) matrix and then


filling it by iterating over its rows and columns, setting the value to the average
covariance for each respective number from both variables:

# Compute covariance matrix


cov_matrix = np.empty((X.shape[1], X.shape[1])) # 4 x 4 matrix
for row in range(0, X.shape[1]):
for col in range(0, X.shape[1]):
cov_matrix[row][col] = np.round(np.average([(X[i, row] - np.average(X[:, ro
- np.average(X[:, col])) for i in range(0, X.shape[0])]), 2)

If we compare our self-computed covariance matrix with one generated with


NumPy’s np.cov , we can see the similarities:

# Compare the matrices


print('Self-computed:')
print(cov_matrix)
https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 22/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

print('NumPy-computed:')
print(np.round(np.cov(X.T), 2))

> Self-computed:
> [[ 1. -0.12 0.87 0.82]
> [-0.12 1. -0.43 -0.37]
> [ 0.87 -0.43 1. 0.96]
> [ 0.82 -0.37 0.96 1. ]]
> NumPy-computed:
> [[ 1.01 -0.12 0.88 0.82]
> [-0.12 1.01 -0.43 -0.37]
> [ 0.88 -0.43 1.01 0.97]
> [ 0.82 -0.37 0.97 1.01]]

Decomposing the covariance matrix into eigenvectors and eigenvalues


Above, we have expressed the spread of our dataset across the dimensions in our
covariance matrix. Recall that PCA works by expressing this spread in terms of
vectors, called eigenvectors, which together with their corresponding eigenvalues
tell us something about the direction and magnitude of the spread.

The great thing of EIG-PCA is that we can decompose the covariance matrix into
eigenvectors and eigenvalues.

We can do this as follows:

Here, V is a matrix of eigenvectors where each column is an eigenvector, L is a


diagonal matrix with eigenvalues and V^top is the transpose of V.

We can use NumPy’s numpy.linalg.eig to compute the eigenvectors for this square
array:

# Compute the eigenpairs


eig_vals, eig_vect = np.linalg.eig(cov_matrix)
print(eig_vect)
print(eig_vals)

This yields the following:

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 23/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

[[ 0.52103086 -0.37921152 -0.71988993 0.25784482]


[-0.27132907 -0.92251432 0.24581197 -0.12216523]
[ 0.57953987 -0.02547068 0.14583347 -0.80138466]
[ 0.56483707 -0.06721014 0.63250894 0.52571316]]
[2.91912926 0.91184362 0.144265 0.02476212]

If we compute how much each principal dimension contributes to variance


explanation, we get the following:

# Compute variance contribution of each vector


contrib_func = np.vectorize(lambda x: x / np.sum(eig_vals))
var_contrib = contrib_func(eig_vals)
print(var_contrib)
print(np.sum(var_contrib))
# > [0.72978232 0.2279609 0.03606625 0.00619053]
# > 1.0

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 24/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

In other words, the first principal dimension contributes for 73%; the second one
for 23%. If we therefore reduce the dimensionality to two, we get to keep
approximately 73 + 23 = 96% of the variance explanation.

Sorting the eigenpairs in decreasing order of importance


Even though the eigenpairs above have already been sorted, it’s a thing we must
definitely do — especially when you perform the decomposition in eigenpairs in a
different way.

Sorting the eigenpairs happens by eigenvalue: the eigenvalues must be sorted in a


descending way; the corresponding eigenvectors must therefore also be sorted
equally.

# Sort eigenpairs
eigenpairs = [(np.abs(eig_vals[x]), eig_vect[:,x]) for x in range(0, len(eig_va
eig_vals = [eigenpairs[x][0] for x in range(0, len(eigenpairs))]
eig_vect = [eigenpairs[x][1] for x in range(0, len(eigenpairs))]
print(eig_vals)

This yields sorted eigenpairs, as we can see from the eigenvalues:

[2.919129264835876, 0.9118436180017795, 0.14426499504958146, 0.0247621221127632

Selecting n principal components

Above, we saw that 96% of the variance can be explained by only two of the
dimensions. We can therefore reduce the dimensionality of our feature space from
n = 4 to n = 2 without losing much of the information.

Building the projection matrix

The final thing we must do is generate the projection matrix and project our
original data onto the (two) principal components (Raschka, 2015):

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 25/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

# Build the projection matrix


proj_matrix = np.hstack((eig_vect[0].reshape(4,1), eig_vect[1].reshape(4,1)))
print(proj_matrix)

# Project onto the principal components


X_proj = X.dot(proj_matrix)

Voilà, you’ve performed PCA


If we now plot the projected data, we get the following plot:

# Make plot of projection


plt.scatter(X_proj[:, 0], X_proj[:, 1], c=colors)
plt.title(f'Visualizing the principal components')
plt.xlabel('Principal component 1')
plt.ylabel('Principal component 2')
plt.show()

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 26/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

That’s it! You just performed Principal Component Analysis using Eigenvector
Decomposition and have reduced dimensionality to two without losing much of the
information in the dataset.

Summary
In this article, we read about performing Principal Component Analysis on the
dimensions of your dataset for the purpose of dimensionality reduction. Some
datasets have many features and few samples, meaning that many Machine
Learning algorithms will be struck by the curse of dimensionality. Feature
extraction approaches like PCA, which attempt to construct a lower-dimensional
feature space based on the original dataset, can help reduce this curse. Using PCA,
we can attempt to recreate our feature space with fewer dimensions and with
minimum information loss.

After defining the context for applying PCA, we looked at it from a high-level
perspective. We saw that we can compute eigenvectors and eigenvalues and sort
those to find the principal directions in your dataset. After generating a projection
matrix for these directions, we can map our dataset onto these directions, which are
then called the principal components. But how these eigenvectors can be derived
was explained later, because there are two methods for doing so: using Eigenvector
Decomposition (EIG) and the more generalized Singular Value Decomposition
(SVD).

In two step-by-step examples, we saw how we can apply both PCA-EIG (this post)
and PCA-SVD (next post) for performing a Principal Component Analysis. In the
first case, we saw that we can compute a covariance matrix for the standardized
dataset which illustrates the variances and covariances of its variables. This matrix
can then be decomposed into eigenvectors and eigenvalues, which illustrate the
direction and magnitude of the spread expressed by the covariance matrix. Sorting
the eigenpairs, we can select the principal directions that contribute most to
variance, generate the projection matrix and project our data.

While PCA-EIG works well with symmetric and square matrices (and hence with our
covariance matrix), it can be numerically unstable. That’s why PCA-SVD is very
common in today’s Machine Learning libraries. In another step-by-step example, we
will look at how the SVD can be used directly on the standardized data matrix for
deriving the eigenvectors we also found with PCA-EIG. They can be used for

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 27/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

generating a projection matrix which allowed us to arrive at the same end result as
when performing PCA-EIG.

It’s been a thorough read, that’s for sure. Still I hope that you have learned
something. Any comments, questions or suggestions are welcome. Thank you for
reading!!

References
Wikipedia. (n.d.). Curse of dimensionality. Wikipedia, the free encyclopedia.
Retrieved December 3, 2020, from
https://en.wikipedia.org/wiki/Curse_of_dimensionality

Wikipedia. (2004, November 17). Feature selection. Wikipedia, the free encyclopedia.
Retrieved December 3, 2020, from https://en.wikipedia.org/wiki/Feature_selection

Wikipedia. (2003, June 8). Feature extraction. Wikipedia, the free encyclopedia.
Retrieved December 3, 2020, from https://en.wikipedia.org/wiki/Feature_extraction

Wikipedia. (2002, August 26). Principal component analysis. Wikipedia, the free
encyclopedia. Retrieved December 7, 2020, from
https://en.wikipedia.org/wiki/Principal_component_analysis

StackExchange. (n.d.). Relationship between SVD and PCA. How to use SVD to perform
PCA? Cross Validated.
https://stats.stackexchange.com/questions/134282/relationship-between-svd-and-
pca-how-to-use-svd-to-perform-pca

Wikipedia. (n.d.). Eigenvalues and eigenvectors. Wikipedia, the free encyclopedia.


Retrieved December 7, 2020, from
https://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors

Lambers, J. V. (n.d.). PCA — Mathematical Background.


https://www.math.usm.edu/lambers/cos702/cos702_files/docs/PCA.pdf

Raschka, S. (2015, January 27). Principal component analysis. Dr. Sebastian Raschka.
https://sebastianraschka.com/Articles/2015_pca_in_3_steps.html

Machine Learning Principal Component Data Science Computer Science

Technology

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 28/36
10/28/24, 9:39 AM Principal Component Analysis with Python (A Deep Dive) | by Francesco Franco | Oct, 2024 | Medium

Follow

Written by Francesco Franco


617 Followers

BS in Computer Science | studying ML| amatuer philosopher| compulsive reader

More from Francesco Franco

https://medium.com/@francescofranco_39234/principal-component-analysis-with-python-a-deep-dive-0c5195bff087#id_token=eyJhbGciOiJSU… 29/36

You might also like