Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub
Dismiss
Join GitHub today
GitHub is home to over 36 million developers
working together to host and review code, manage
projects, and build software together.
Sign up
1 contributor
CPython 3.5.1
IPython 4.2.0
scikit-learn 0.17.1
matplotlib 1.5.1
numpy 1.11.0
pandas 0.18.1
This article just got a complete overhaul, the original version is still available at
principal_component_analysis_old.ipynb
(http://nbviewer.ipython.org/github/rasbt/pattern_classification/blob/master/dimensionality_reduction/projection
Sections
Introduction
PCA Vs. LDA
PCA and Dimensionality Reduction
A Summary of the PCA Approach
Preparing the Iris Dataset
About Iris
About Iris
Loading the Dataset
Exploratory Visualization
Standardizing
1 - Eigendecomposition - Computing Eigenvectors and Eigenvalues
Covariance Matrix
Correlation Matrix
Singular Vector Decomposition
2 - Selecting Principal Components
Sorting Eigenpairs
Explained Variance
Projection Matrix
3 - Projection Onto the New Feature Space
Shortcut - PCA in scikit-learn
Introduction
[back to top]
The sheer size of data in the modern age is not only a challenge for computer hardware but also a main
bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis
is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong
correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a
nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional
data and project it onto a smaller dimensional subspace while retaining most of the information.
[back to top]
Both Linear Discriminant Analysis (LDA) and PCA are linear transformation methods. PCA yields the
directions (principal components) that maximize the variance of the data, whereas LDA also aims to
find the directions that maximize the separation (or discrimination) between different classes, which
can be useful in pattern classification problem (PCA "ignores" class labels).
In other words, PCA projects the entire dataset onto a different feature (sub)space, and LDA tries to
determine a suitable feature (sub)space in order to distinguish between patterns that belong to different
determine a suitable feature (sub)space in order to distinguish between patterns that belong to different
classes.
[back to top]
Often, the desired goal is to reduce the dimensions of a -dimensional dataset by projecting it onto a
-dimensional subspace (where ) in order to increase the computational efficiency while
retaining most of the information. An important question is "what is the size of that represents the
data 'well'?"
Later, we will compute eigenvectors (the principal components) of a dataset and collect them in a
projection matrix. Each of those eigenvectors is associated with an eigenvalue which can be
interpreted as the "length" or "magnitude" of the corresponding eigenvector. If some eigenvalues have a
significantly larger magnitude than others that the reduction of the dataset via PCA onto a smaller
dimensional subspace by dropping the "less informative" eigenpairs is reasonable.
[back to top]
[back to top]
About Iris
[back to top]
For the following tutorial, we will be working with the famous "Iris" dataset that has been deposited on
the UCI machine learning repository
(https://archive.ics.uci.edu/ml/datasets/Iris (https://archive.ics.uci.edu/ml/datasets/Iris)).
The iris dataset contains measurements for 150 iris flowers from three different species.
1. Iris-setosa (n=50)
2. Iris-versicolor (n=50)
3. Iris-virginica (n=50)
1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm
Iris
[back to top]
In order to load the Iris data directly from the UCI repository, we are going to use the superb pandas
(http://pandas.pydata.org) library. If you haven't used pandas yet, I want encourage you to check out the
pandas tutorials (http://pandas.pydata.org/pandas-docs/stable/tutorials.html). If I had to name one
Python library that makes working with data a wonderfully simple task, this would definitely be pandas!
df = pd.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-l
earning-databases/iris/iris.data',
header=None
None,
sep=' ')
sep , )
df.tail()
Out[4]:
sepal_len sepal_wid petal_len petal_wid class
X = df.ix[:,0:4].values
y = df.ix[:,4].values
Our iris dataset is now stored in form of a matrix where the columns are the different features,
and every row represents a separate flower sample. Each sample row can be pictured as a 4-
dimensional vector
Exploratory Visualization
[back to top]
To get a feeling for how the 3 different flower classes are distributes along the 4 different features, let
us visualize them via histograms.
with plt.style.context('seaborn-whitegrid'):
plt.figure(figsize=(8, 6))
for cnt in range(4):
plt.subplot(2, 2, cnt+1)
for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virg
inica'):
plt.hist(X[y==lab, cnt],
label=lab,
bins=10,
alpha=0.3,)
plt.xlabel(feature_dict[cnt])
plt.legend(loc='upper right', fancybox=True
True, fontsize=8)
plt.tight_layout()
plt.show()
Standardizing
[back to top]
h h d d h d h d d h
Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement
scales of the original features. Since PCA yields a feature subspace that maximizes the variance along
the axes, it makes sense to standardize the data, especially, if it was measured on different scales.
Although, all features in the Iris dataset were measured in centimeters, let us continue with the
transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the
optimal performance of many machine learning algorithms.
[back to top]
The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA:
The eigenvectors (principal components) determine the directions of the new feature space, and the
eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data
along the new feature axes.
Covariance Matrix
[back to top]
The classic approach to PCA is to perform the eigendecomposition on the covariance matrix , which
is a matrix where each element represents the covariance between two features. The covariance
between two features is calculated as follows:
We can summarize the calculation of the covariance matrix via the following matrix equation:
Covariance matrix
[[ 1.00671141 -0.11010327 0.87760486 0.82344326]
[-0.11010327 1.00671141 -0.42333835 -0.358937 ]
[ 0.87760486 -0.42333835 1.00671141 0.96921855]
[ 0.82344326 -0.358937 0.96921855 1.00671141]]
Covariance matrix
[[ 1.00671141 -0.11010327 0.87760486 0.82344326]
[-0.11010327 1.00671141 -0.42333835 -0.358937 ]
[ 0.87760486 -0.42333835 1.00671141 0.96921855]
[ 0.82344326 -0.358937 0.96921855 1.00671141]]
The more verbose way above was simply used for demonstration purposes, equivalently, we could have
used the numpy cov function:
print('Eigenvectors \n
\n%s
%s' %eig_vecs)
print('\n
\nEigenvalues \n
\n%s
%s' %eig_vals)
Eigenvectors
[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]
[-0.26335492 -0.92555649 0.24203288 -0.12413481]
[ 0.58125401 -0.02109478 0.14089226 -0.80115427]
[ 0.56561105 -0.06541577 0.6338014 0.52354627]]
Eigenvalues
[ 2.93035378 0.92740362 0.14834223 0.02074601]
Correlation Matrix
[back to top]
Especially, in the field of "Finance," the correlation matrix typically used instead of the covariance
matrix. However, the eigendecomposition of the covariance matrix (if the input data was standardized)
yields the same results as a eigendecomposition on the correlation matrix, since the correlation matrix
can be understood as the normalized covariance matrix.
print('Eigenvectors \n
\n%s
%s' %eig_vecs)
print('\n
\nEigenvalues \n
\n%s
%s' %eig_vals)
Eigenvectors
[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]
[-0.26335492 -0.92555649 0.24203288 -0.12413481]
[ 0.58125401 -0.02109478 0.14089226 -0.80115427]
[ 0.56561105 -0.06541577 0.6338014 0.52354627]]
Eigenvalues
[ 2.91081808 0.92122093 0.14735328 0.02060771]
print('Eigenvectors \n
\n%s
%s' %eig_vecs)
print('\n
\nEigenvalues \n
\n%s
%s' %eig_vals)
Eigenvectors
[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]
[-0.26335492 -0.92555649 0.24203288 -0.12413481]
[ 0.58125401 -0.02109478 0.14089226 -0.80115427]
[ 0.56561105 -0.06541577 0.6338014 0.52354627]]
Eigenvalues
[ 2 91081808 0 92122093 0 14735328 0 02060771]
[ 2.91081808 0.92122093 0.14735328 0.02060771]
We can clearly see that all three approaches yield the same eigenvectors and eigenvalue pairs:
[back to top]
While the eigendecomposition of the covariance or correlation matrix may be more intuitiuve, most
PCA implementations perform a Singular Vector Decomposition (SVD) to improve the computational
efficiency. So, let us perform an SVD to confirm that the result are indeed the same:
Vectors U:
[[-0.52237162 -0.37231836 0.72101681 0.26199559]
[ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
[-0.58125401 -0.02109478 -0.14089226 -0.80115427]
[-0.56561105 -0.06541577 -0.6338014 0.52354627]]
In [ ]:
[back to top]
Sorting Eigenpairs
[back to top]
The typical goal of a PCA is to reduce the dimensionality of the original feature space by projecting it
onto a smaller subspace, where the eigenvectors will form the axes. However, the eigenvectors only
define the directions of the new axis, since they have all the same unit length 1, which can confirmed by
the following two lines of code:
Everything ok!
In order to decide which eigenvector(s) can dropped without losing too much information for the
construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The
eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data;
those are the ones can be dropped.
In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order
choose the top eigenvectors.
Explained Variance
[back to top]
After sorting the eigenpairs, the next question is "how many principal components are we going to
choose for our new feature subspace?" A useful measure is the so-called "explained variance," which
can be calculated from the eigenvalues. The explained variance tells us how much information
(variance) can be attributed to each of the principal components.
The plot above clearly shows that most of the variance (72.77% of the variance to be precise) can be
explained by the first principal component alone. The second principal component still bears some
information (23.03%) while the third and fourth principal components can safely be dropped without
losing to much information. Together, the first two principal components contain 95.8% of the
information.
Projection Matrix
[back to top]
It's about time to get to the really interesting part: The construction of the projection matrix that will be
used to transform the Iris data onto the new feature subspace. Although, the name "projection matrix"
has a nice ring to it, it is basically just a matrix of our concatenated top k eigenvectors.
Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by
choosing the "top 2" eigenvectors with the highest eigenvalues to construct our -dimensional
eigenvector matrix .
print('Matrix W:\n
\n', matrix_w)
Matrix W:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]
[back to top]