0% found this document useful (0 votes)

157 views14 pages

Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub

This document provides an overview of principal component analysis (PCA) in 3 steps: 1) Computing eigenvectors and eigenvalues from the covariance or correlation matrix. 2) Selecting principal components by sorting eigenpairs and choosing those that explain the most variance. 3) Projecting the original dataset onto the new feature space defined by the selected principal components. It then prepares the Iris dataset for analysis, provides exploratory visualizations, and discusses standardizing the data prior to performing PCA.

Uploaded by

sid rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views14 pages

Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub

Uploaded by

sid rai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

llSourcell / Dimensionality_Reduction

Dismiss
Join GitHub today
GitHub is home to over 36 million developers
working together to host and review code, manage
projects, and build software together.

Branch: master Dimensionality_Reduction / Find ﬁle Copy path

principal_component_analysis.ipynb

llSourcell Add ﬁles via upload bb6aed9 on 15 Jul 2017

1 contributor

1394 lines (1393 sloc) 140 KB Raw Blame History

In [1]: %load_ext watermark

In [2]: %watermark -v -d -a 'Sebastian Raschka' -p scikit-learn,matplotl

ib,numpy,pandas

Sebastian Raschka 2016-07-11

CPython 3.5.1
IPython 4.2.0

scikit-learn 0.17.1
matplotlib 1.5.1
numpy 1.11.0
pandas 0.18.1

In [3]: %matplotlib inline

Principal Component Analysis in 3 Simple Steps

Principal Component Analysis (PCA) is a simple yet popular and useful linear transformation technique
that is used in numerous applications, such as stock market predictions, the analysis of gene
expression data, and many more. In this tutorial, we will see that PCA is not just a "black box", and we
are going to unravel its internals in 3 basic steps.

This article just got a complete overhaul, the original version is still available at
principal_component_analysis_old.ipynb
(http://nbviewer.ipython.org/github/rasbt/pattern_classiﬁcation/blob/master/dimensionality_reduction/projection

Sections

Introduction
PCA Vs. LDA
PCA and Dimensionality Reduction
A Summary of the PCA Approach
Preparing the Iris Dataset
About Iris
About Iris
Loading the Dataset
Exploratory Visualization
Standardizing
1 - Eigendecomposition - Computing Eigenvectors and Eigenvalues
Covariance Matrix
Correlation Matrix
Singular Vector Decomposition
2 - Selecting Principal Components
Sorting Eigenpairs
Explained Variance
Projection Matrix
3 - Projection Onto the New Feature Space
Shortcut - PCA in scikit-learn

Introduction

[back to top]

The sheer size of data in the modern age is not only a challenge for computer hardware but also a main
bottleneck for the performance of many machine learning algorithms. The main goal of a PCA analysis
is to identify patterns in data; PCA aims to detect the correlation between variables. If a strong
correlation between variables exists, the attempt to reduce the dimensionality only makes sense. In a
nutshell, this is what PCA is all about: Finding the directions of maximum variance in high-dimensional
data and project it onto a smaller dimensional subspace while retaining most of the information.

PCA Vs. LDA

[back to top]

Both Linear Discriminant Analysis (LDA) and PCA are linear transformation methods. PCA yields the
directions (principal components) that maximize the variance of the data, whereas LDA also aims to
ﬁnd the directions that maximize the separation (or discrimination) between different classes, which
can be useful in pattern classiﬁcation problem (PCA "ignores" class labels).
In other words, PCA projects the entire dataset onto a different feature (sub)space, and LDA tries to
determine a suitable feature (sub)space in order to distinguish between patterns that belong to different
determine a suitable feature (sub)space in order to distinguish between patterns that belong to different
classes.

PCA and Dimensionality Reduction

[back to top]

Often, the desired goal is to reduce the dimensions of a -dimensional dataset by projecting it onto a
-dimensional subspace (where ) in order to increase the computational efﬁciency while
retaining most of the information. An important question is "what is the size of that represents the
data 'well'?"

Later, we will compute eigenvectors (the principal components) of a dataset and collect them in a
projection matrix. Each of those eigenvectors is associated with an eigenvalue which can be
interpreted as the "length" or "magnitude" of the corresponding eigenvector. If some eigenvalues have a
signiﬁcantly larger magnitude than others that the reduction of the dataset via PCA onto a smaller
dimensional subspace by dropping the "less informative" eigenpairs is reasonable.

A Summary of the PCA Approach

[back to top]

Standardize the data.

Obtain the Eigenvectors and Eigenvalues from the covariance matrix or correlation matrix, or
perform Singular Vector Decomposition.
Sort eigenvalues in descending order and choose the eigenvectors that correspond to the
largest eigenvalues where is the number of dimensions of the new feature subspace (
)/.
Construct the projection matrix from the selected eigenvectors.
Transform the original dataset via to obtain a -dimensional feature subspace .

Preparing the Iris Dataset

[back to top]
About Iris

[back to top]

For the following tutorial, we will be working with the famous "Iris" dataset that has been deposited on
the UCI machine learning repository
(https://archive.ics.uci.edu/ml/datasets/Iris (https://archive.ics.uci.edu/ml/datasets/Iris)).

The iris dataset contains measurements for 150 iris ﬂowers from three different species.

The three classes in the Iris dataset are:

1. Iris-setosa (n=50)
2. Iris-versicolor (n=50)
3. Iris-virginica (n=50)

And the four features of in Iris dataset are:

1. sepal length in cm
2. sepal width in cm
3. petal length in cm
4. petal width in cm

Iris

Loading the Dataset

[back to top]

In order to load the Iris data directly from the UCI repository, we are going to use the superb pandas
(http://pandas.pydata.org) library. If you haven't used pandas yet, I want encourage you to check out the
pandas tutorials (http://pandas.pydata.org/pandas-docs/stable/tutorials.html). If I had to name one
Python library that makes working with data a wonderfully simple task, this would deﬁnitely be pandas!

In [4]: import pandas as pd

df = pd.read_csv(
filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-l
earning-databases/iris/iris.data',
header=None
None,
sep=' ')
sep , )

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid',

'class']
True) # drops the empty line at fil
df.dropna(how="all", inplace=True
e-end

df.tail()

Out[4]:
sepal_len sepal_wid petal_len petal_wid class

145 6.7 3.0 5.2 2.3 Iris-virginica

146 6.3 2.5 5.0 1.9 Iris-virginica

147 6.5 3.0 5.2 2.0 Iris-virginica

148 6.2 3.4 5.4 2.3 Iris-virginica

149 5.9 3.0 5.1 1.8 Iris-virginica

In [5]: # split data table into data X and class labels y

X = df.ix[:,0:4].values
y = df.ix[:,4].values

Our iris dataset is now stored in form of a matrix where the columns are the different features,
and every row represents a separate ﬂower sample. Each sample row can be pictured as a 4-
dimensional vector

Exploratory Visualization

[back to top]

To get a feeling for how the 3 different ﬂower classes are distributes along the 4 different features, let
us visualize them via histograms.

In [6]: from matplotlib import pyplot as plt

import numpy as np
import math

label_dict = {1: 'Iris-Setosa',

2: 'Iris-Versicolor',
3: 'Iris-Virgnica'}
feature_dict = {0: 'sepal length [cm]',
1: 'sepal width [cm]',
2: 'petal length [cm]',
3: 'petal width [cm]'}

with plt.style.context('seaborn-whitegrid'):
plt.figure(figsize=(8, 6))
for cnt in range(4):
plt.subplot(2, 2, cnt+1)
for lab in ('Iris-setosa', 'Iris-versicolor', 'Iris-virg
inica'):
plt.hist(X[y==lab, cnt],
label=lab,
bins=10,
alpha=0.3,)
plt.xlabel(feature_dict[cnt])
plt.legend(loc='upper right', fancybox=True
True, fontsize=8)

plt.tight_layout()
plt.show()

Standardizing

[back to top]

h h d d h d h d d h
Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement
scales of the original features. Since PCA yields a feature subspace that maximizes the variance along
the axes, it makes sense to standardize the data, especially, if it was measured on different scales.
Although, all features in the Iris dataset were measured in centimeters, let us continue with the
transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the
optimal performance of many machine learning algorithms.

In [7]: from sklearn.preprocessing import StandardScaler

X_std = StandardScaler().fit_transform(X)

1 - Eigendecomposition - Computing Eigenvectors and

Eigenvalues

[back to top]

The eigenvectors and eigenvalues of a covariance (or correlation) matrix represent the "core" of a PCA:
The eigenvectors (principal components) determine the directions of the new feature space, and the
eigenvalues determine their magnitude. In other words, the eigenvalues explain the variance of the data
along the new feature axes.

Covariance Matrix

[back to top]

The classic approach to PCA is to perform the eigendecomposition on the covariance matrix , which
is a matrix where each element represents the covariance between two features. The covariance
between two features is calculated as follows:

We can summarize the calculation of the covariance matrix via the following matrix equation:

where is the mean vector $\mathbf{\bar{x}} = \sum\limits_{i=1}^n x_{i}.$

The mean vector is a -dimensional vector where each value in this vector represents the sample mean
of a feature column in the dataset.

In [8]: import numpy as np

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.
shape[0] 1)
shape[0]-1)
print('Covariance matrix \n
\n%s
%s' %cov_mat)

Covariance matrix
[[ 1.00671141 -0.11010327 0.87760486 0.82344326]
[-0.11010327 1.00671141 -0.42333835 -0.358937 ]
[ 0.87760486 -0.42333835 1.00671141 0.96921855]
[ 0.82344326 -0.358937 0.96921855 1.00671141]]

In [9]: import numpy as np

mean_vec = np.mean(X_std, axis=0)
cov_mat = (X_std - mean_vec).T.dot((X_std - mean_vec)) / (X_std.
shape[0]-1)
print('Covariance matrix \n
\n%s
%s' %cov_mat)

The more verbose way above was simply used for demonstration purposes, equivalently, we could have
used the numpy cov function:

In [10]: print('NumPy covariance matrix: \n

\n%s
%s' %np.cov(X_std.T))

NumPy covariance matrix:

[[ 1.00671141 -0.11010327 0.87760486 0.82344326]
[-0.11010327 1.00671141 -0.42333835 -0.358937 ]
[ 0.87760486 -0.42333835 1.00671141 0.96921855]
[ 0.82344326 -0.358937 0.96921855 1.00671141]]

Next, we perform an eigendecomposition on the covariance matrix:

In [11]: cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

print('Eigenvectors \n
\n%s
%s' %eig_vecs)
print('\n
\nEigenvalues \n
\n%s
%s' %eig_vals)

Eigenvectors
[[ 0.52237162 -0.37231836 -0.72101681 0.26199559]
[-0.26335492 -0.92555649 0.24203288 -0.12413481]
[ 0.58125401 -0.02109478 0.14089226 -0.80115427]
[ 0.56561105 -0.06541577 0.6338014 0.52354627]]

Eigenvalues
[ 2.93035378 0.92740362 0.14834223 0.02074601]
Correlation Matrix

[back to top]

Especially, in the ﬁeld of "Finance," the correlation matrix typically used instead of the covariance
matrix. However, the eigendecomposition of the covariance matrix (if the input data was standardized)
yields the same results as a eigendecomposition on the correlation matrix, since the correlation matrix
can be understood as the normalized covariance matrix.

Eigendecomposition of the standardized data based on the correlation matrix:

In [12]: cor_mat1 = np.corrcoef(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cor_mat1)

print('Eigenvectors \n
\n%s
%s' %eig_vecs)
print('\n
\nEigenvalues \n
\n%s
%s' %eig_vals)

Eigenvalues
[ 2.91081808 0.92122093 0.14735328 0.02060771]

Eigendecomposition of the raw data based on the correlation matrix:

In [13]: cor_mat2 = np.corrcoef(X.T)

eig_vals, eig_vecs = np.linalg.eig(cor_mat2)

print('Eigenvectors \n
\n%s
%s' %eig_vecs)
print('\n
\nEigenvalues \n
\n%s
%s' %eig_vals)

Eigenvalues
[ 2 91081808 0 92122093 0 14735328 0 02060771]
[ 2.91081808 0.92122093 0.14735328 0.02060771]

We can clearly see that all three approaches yield the same eigenvectors and eigenvalue pairs:

Eigendecomposition of the covariance matrix after standardizing the data.

Eigendecomposition of the correlation matrix.
Eigendecomposition of the correlation matrix after standardizing the data.

Singular Vector Decomposition

[back to top]

While the eigendecomposition of the covariance or correlation matrix may be more intuitiuve, most
PCA implementations perform a Singular Vector Decomposition (SVD) to improve the computational
efﬁciency. So, let us perform an SVD to conﬁrm that the result are indeed the same:

In [14]: u,s,v = np.linalg.svd(X_std.T)

print('Vectors U:\n
\n', u)

Vectors U:
[[-0.52237162 -0.37231836 0.72101681 0.26199559]
[ 0.26335492 -0.92555649 -0.24203288 -0.12413481]
[-0.58125401 -0.02109478 -0.14089226 -0.80115427]
[-0.56561105 -0.06541577 -0.6338014 0.52354627]]

In [ ]:

2 - Selecting Principal Components

[back to top]

Sorting Eigenpairs
[back to top]

The typical goal of a PCA is to reduce the dimensionality of the original feature space by projecting it
onto a smaller subspace, where the eigenvectors will form the axes. However, the eigenvectors only
deﬁne the directions of the new axis, since they have all the same unit length 1, which can conﬁrmed by
the following two lines of code:

In [15]: for ev in eig_vecs:

np.testing.assert_array_almost_equal(1.0, np.linalg.norm(ev
))
print('Everything ok!')

Everything ok!

In order to decide which eigenvector(s) can dropped without losing too much information for the
construction of lower-dimensional subspace, we need to inspect the corresponding eigenvalues: The
eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data;
those are the ones can be dropped.
In order to do so, the common approach is to rank the eigenvalues from highest to lowest in order
choose the top eigenvectors.

In [16]: # Make a list of (eigenvalue, eigenvector) tuples

eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range
(len(eig_vals))]

# Sort the (eigenvalue, eigenvector) tuples from high to low

eig_pairs.sort(key=lambda
lambda x: x[0], reverse=True
True)

# Visually confirm that the list is correctly sorted by decreasi

ng eigenvalues
print('Eigenvalues in descending order:')
for i in eig_pairs:
print(i[0])

Eigenvalues in descending order:

2.91081808375
0.921220930707
0.147353278305
0.0206077072356

Explained Variance

[back to top]
After sorting the eigenpairs, the next question is "how many principal components are we going to
choose for our new feature subspace?" A useful measure is the so-called "explained variance," which
can be calculated from the eigenvalues. The explained variance tells us how much information
(variance) can be attributed to each of the principal components.

In [17]: tot = sum(eig_vals)

var_exp = [(i / tot)*100 for i in sorted(eig_vals, reverse=True
True
)]
cum_var_exp = np.cumsum(var_exp)

In [23]: with plt.style.context('seaborn-whitegrid'):

plt.figure(figsize=(6, 4))

plt.bar(range(4), var_exp, alpha=0.5, align='center',

label='individual explained variance')
plt.step(range(4), cum_var_exp, where='mid',
label='cumulative explained variance')
plt.ylabel('Explained variance ratio')
plt.xlabel('Principal components')
plt.legend(loc='best')
plt.tight_layout()
plt.savefig('/Users/Sebastian/Desktop/pca2.pdf')

The plot above clearly shows that most of the variance (72.77% of the variance to be precise) can be
explained by the ﬁrst principal component alone. The second principal component still bears some
information (23.03%) while the third and fourth principal components can safely be dropped without
losing to much information. Together, the ﬁrst two principal components contain 95.8% of the
information.

Projection Matrix
[back to top]

It's about time to get to the really interesting part: The construction of the projection matrix that will be
used to transform the Iris data onto the new feature subspace. Although, the name "projection matrix"
has a nice ring to it, it is basically just a matrix of our concatenated top k eigenvectors.

Here, we are reducing the 4-dimensional feature space to a 2-dimensional feature subspace, by
choosing the "top 2" eigenvectors with the highest eigenvalues to construct our -dimensional
eigenvector matrix .

In [19]: matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),

eig_pairs[1][1].reshape(4,1)))

print('Matrix W:\n
\n', matrix_w)

Matrix W:
[[ 0.52237162 -0.37231836]
[-0.26335492 -0.92555649]
[ 0.58125401 -0.02109478]
[ 0.56561105 -0.06541577]]

3 - Projection Onto the New Feature Space

[back to top]

Majumdar 和 Laha - 2020 - Clustering and classification of time series using
No ratings yet
Majumdar 和 Laha - 2020 - Clustering and classification of time series using
40 pages
PCA Machine Learning
No ratings yet
PCA Machine Learning
34 pages
Unit 3 Univariate Analysis
No ratings yet
Unit 3 Univariate Analysis
39 pages
Dimensionality Reduction
No ratings yet
Dimensionality Reduction
60 pages
Module 07 Lecture Slides
No ratings yet
Module 07 Lecture Slides
166 pages
Seaborn PDF
No ratings yet
Seaborn PDF
242 pages
Chapter 7: Dimensionality Reduction
No ratings yet
Chapter 7: Dimensionality Reduction
34 pages
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
No ratings yet
Dat Science: CLASS 11: Clustering and Dimensionality Reduction
30 pages
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
No ratings yet
Hackathon Overall Travel Experience of Traveling in Shinkansen Bullet Train Merging Two Data Set
59 pages
Python Final Project
No ratings yet
Python Final Project
3 pages
Topological Data Analysis Presentation
No ratings yet
Topological Data Analysis Presentation
87 pages
Churn Data
100% (1)
Churn Data
56 pages
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
No ratings yet
Handling Imbalanced Datasets in Machine Learning - by Baptiste Rocca - Towards Data Science
24 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Differential and Integral Calculus by Love Rainville Solutions Manual PDF
100% (1)
Differential and Integral Calculus by Love Rainville Solutions Manual PDF
2 pages
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
100% (1)
Heart: Our "Goal" Predict The Presence of Heart Disease in The Patient
73 pages
Outlier Detection
No ratings yet
Outlier Detection
19 pages
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
100% (1)
Importing Libraries: Import As Import As Import As From Import As From Import From Import Import
11 pages
Pandas Handbook
No ratings yet
Pandas Handbook
33 pages
A Critique of Opinion of Engineer Muhammad Ali Mirza: and His Views About Dreams and Interpretations of Dreams
0% (1)
A Critique of Opinion of Engineer Muhammad Ali Mirza: and His Views About Dreams and Interpretations of Dreams
25 pages
Week 7 - Tree-Based Model
100% (1)
Week 7 - Tree-Based Model
8 pages
Math Method Textbook Unit 4 Revision
0% (1)
Math Method Textbook Unit 4 Revision
22 pages
Data Mining Project Shivani Pandey
100% (1)
Data Mining Project Shivani Pandey
40 pages
Feature Engg Pre Processing Python
No ratings yet
Feature Engg Pre Processing Python
68 pages
Lme4: Mixed-Effects Modeling With R
No ratings yet
Lme4: Mixed-Effects Modeling With R
145 pages
Oil Export Indonesia
100% (1)
Oil Export Indonesia
12 pages
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
No ratings yet
Slicing. Both, Numpy Array Indexing and Slicing Will Be Discussed in The Remainder
50 pages
9.3.2 CDO Operator Categories: Section 5.3
No ratings yet
9.3.2 CDO Operator Categories: Section 5.3
50 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
An Introduction To Feature Selection
No ratings yet
An Introduction To Feature Selection
45 pages
Mastering Parallel Programming with R
From Everand
Mastering Parallel Programming with R
Simon R. Chapple
No ratings yet
Types of Classification Algorithm
No ratings yet
Types of Classification Algorithm
27 pages
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
No ratings yet
Learning Best Practices For Model Evaluation and Hyperparameter Tuning
17 pages
ML0101EN Clas K Nearest Neighbors CustCat Py v1
100% (1)
ML0101EN Clas K Nearest Neighbors CustCat Py v1
11 pages
Customer Segmentation Clustering
No ratings yet
Customer Segmentation Clustering
35 pages
Linear Inequalities
100% (2)
Linear Inequalities
4 pages
Dimensionality Reduction: Motivation I: Data Compression
No ratings yet
Dimensionality Reduction: Motivation I: Data Compression
35 pages
Assignment 2 PDF
No ratings yet
Assignment 2 PDF
25 pages
Matemática de Blockchains
No ratings yet
Matemática de Blockchains
200 pages
Partial Differentiation
100% (1)
Partial Differentiation
5 pages
Task 1
No ratings yet
Task 1
14 pages
A "Short" Introduction To Model Selection
No ratings yet
A "Short" Introduction To Model Selection
25 pages
Discriminant Analysis
No ratings yet
Discriminant Analysis
13 pages
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
100% (1)
Random Forest: Implementaciones de Scikit-Learn Sobre QSAR
11 pages
A Study On Feature Selection Techniques in Bio Informatics
100% (1)
A Study On Feature Selection Techniques in Bio Informatics
7 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
Auc Roc Curve Machine Learning
No ratings yet
Auc Roc Curve Machine Learning
12 pages
Decision Trees For Predictive Modeling (Neville)
100% (1)
Decision Trees For Predictive Modeling (Neville)
24 pages
Prime Numbers
No ratings yet
Prime Numbers
27 pages
Heart Disease Prediction - Jupyter Notebook
100% (1)
Heart Disease Prediction - Jupyter Notebook
9 pages
Exno 4
No ratings yet
Exno 4
13 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Modeling With Penalized Splines
No ratings yet
Modeling With Penalized Splines
50 pages
Python For Multivariate Analysis
No ratings yet
Python For Multivariate Analysis
47 pages
2014 - Predicting The Price of Used Cars Using Machine Learning Techniques PDF
No ratings yet
2014 - Predicting The Price of Used Cars Using Machine Learning Techniques PDF
12 pages
Breast Cancer Classification
No ratings yet
Breast Cancer Classification
18 pages
Market Segmentation - Product Service Management
No ratings yet
Market Segmentation - Product Service Management
16 pages
FEU EAC ITES103 ITEI103 Flowcharting and Pseudocoding StudVersion PDF
No ratings yet
FEU EAC ITES103 ITEI103 Flowcharting and Pseudocoding StudVersion PDF
48 pages
Handling Missing Value in Decision Tree Algorithm PDF
No ratings yet
Handling Missing Value in Decision Tree Algorithm PDF
6 pages
7144CEM Principles of Data Science: Faculty of Engineering, Environment and Computing
No ratings yet
7144CEM Principles of Data Science: Faculty of Engineering, Environment and Computing
8 pages
Python Oops Concept
No ratings yet
Python Oops Concept
37 pages
STASA - Applied Finite Element Analysis For Engineers
90% (10)
STASA - Applied Finite Element Analysis For Engineers
674 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
03 Diversity PDF
No ratings yet
03 Diversity PDF
30 pages
0.1 Stock Data
100% (1)
0.1 Stock Data
4 pages
Continuity and Differentiability
No ratings yet
Continuity and Differentiability
4 pages
University of Cagayan Valley: Tuguegarao City, Cagayan, Philippines
100% (2)
University of Cagayan Valley: Tuguegarao City, Cagayan, Philippines
2 pages
12th Maths Marking Scheme
No ratings yet
12th Maths Marking Scheme
6 pages
Number - Quantitative Aptitude For CAT EBOOK
No ratings yet
Number - Quantitative Aptitude For CAT EBOOK
6 pages
01 - Vectors and Their Applications
No ratings yet
01 - Vectors and Their Applications
11 pages
Uttar Pradesh Kamdhenu Dairy Farm Scheme
No ratings yet
Uttar Pradesh Kamdhenu Dairy Farm Scheme
6 pages
Mathematics Class 9 Syllabus Break Up AY 2022-23
No ratings yet
Mathematics Class 9 Syllabus Break Up AY 2022-23
5 pages
Discrete Mathematics QB
No ratings yet
Discrete Mathematics QB
8 pages
Ist Q Math 9
100% (1)
Ist Q Math 9
2 pages
PC Solve Linear Systems
100% (1)
PC Solve Linear Systems
15 pages
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
No ratings yet
Semi-Automated Exploratory Data Analysis (EDA) in Python - by Destin Gong - Mar, 2021 - Towards Data
3 pages
Stability Numerical Schemes
No ratings yet
Stability Numerical Schemes
15 pages
Okamoto Analisis de Geometria Amonoideos
No ratings yet
Okamoto Analisis de Geometria Amonoideos
19 pages
TCIAIG13 Mhauskn
No ratings yet
TCIAIG13 Mhauskn
19 pages
Salary Prediction LinearRegression
100% (1)
Salary Prediction LinearRegression
7 pages
LAS WEEK 2A-The NTH Term of A Sequence
No ratings yet
LAS WEEK 2A-The NTH Term of A Sequence
2 pages
Cardio Fitness Project
No ratings yet
Cardio Fitness Project
1 page
Math Paper 3
No ratings yet
Math Paper 3
9 pages
BSCMSC
No ratings yet
BSCMSC
1 page
Neuro - Evolutionary Model For Playing Games
No ratings yet
Neuro - Evolutionary Model For Playing Games
6 pages
Calculus 1 Chapter 3 To 6
No ratings yet
Calculus 1 Chapter 3 To 6
5 pages
Lecture 8
No ratings yet
Lecture 8
15 pages
CS221 CS221 Computer Arithmetic: Computer Arithmetic: Multiplier
No ratings yet
CS221 CS221 Computer Arithmetic: Computer Arithmetic: Multiplier
19 pages
ICCR Endometrial Cancer Dataset 3rd Edition
No ratings yet
ICCR Endometrial Cancer Dataset 3rd Edition
18 pages
A446685478 - 21976 - 2 - 2018 - Practice Questions1
No ratings yet
A446685478 - 21976 - 2 - 2018 - Practice Questions1
3 pages
9th Maths Part I
No ratings yet
9th Maths Part I
4 pages
HTML Syllabus
No ratings yet
HTML Syllabus
1 page
GenMath11 - Q1 - Mod9 - Intercepts Zeroes and Asymptotes of Functions - 08082020
No ratings yet
GenMath11 - Q1 - Mod9 - Intercepts Zeroes and Asymptotes of Functions - 08082020
39 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Maths Curriculum GRADE 3
No ratings yet
Maths Curriculum GRADE 3
3 pages
Quantum Computing
No ratings yet
Quantum Computing
7 pages
A1336701014 23590 21 2018 2 Practice Exercises For Expressions
No ratings yet
A1336701014 23590 21 2018 2 Practice Exercises For Expressions
3 pages
Seamo Paper C 2016
100% (1)
Seamo Paper C 2016
7 pages
SId Resume Finaledit PDF
No ratings yet
SId Resume Finaledit PDF
2 pages
Profile
No ratings yet
Profile
2 pages
Even Tree English
No ratings yet
Even Tree English
2 pages
Red Knights Shortest Path English PDF
No ratings yet
Red Knights Shortest Path English PDF
2 pages
Academic Qualifications Year Degree Institute Cgpa / Percentage
No ratings yet
Academic Qualifications Year Degree Institute Cgpa / Percentage
1 page

Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub

Uploaded by

Dimensionality - Reduction - Principal - Component - Analysis - Ipynb at Master Llsourcell - Dimensionality - Reduction GitHub

Uploaded by

llSourcell / Dimensionality_Reduction

Branch: master Dimensionality_Reduction / Find ﬁle Copy path

llSourcell Add ﬁles via upload bb6aed9 on 15 Jul 2017

1394 lines (1393 sloc) 140 KB Raw Blame History

In [2]: %watermark -v -d -a 'Sebastian Raschka' -p scikit-learn,matplotl

Sebastian Raschka 2016-07-11

In [3]: %matplotlib inline

Principal Component Analysis in 3 Simple Steps

PCA Vs. LDA

PCA and Dimensionality Reduction

A Summary of the PCA Approach

Standardize the data.

Preparing the Iris Dataset

The three classes in the Iris dataset are:

And the four features of in Iris dataset are:

Loading the Dataset

In [4]: import pandas as pd

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid',

145 6.7 3.0 5.2 2.3 Iris-virginica

146 6.3 2.5 5.0 1.9 Iris-virginica

147 6.5 3.0 5.2 2.0 Iris-virginica

148 6.2 3.4 5.4 2.3 Iris-virginica

149 5.9 3.0 5.1 1.8 Iris-virginica

In [5]: # split data table into data X and class labels y

In [6]: from matplotlib import pyplot as plt

label_dict = {1: 'Iris-Setosa',

In [7]: from sklearn.preprocessing import StandardScaler

1 - Eigendecomposition - Computing Eigenvectors and

where is the mean vector $\mathbf{\bar{x}} = \sum\limits_{i=1}^n x_{i}.$

In [8]: import numpy as np

In [9]: import numpy as np

In [10]: print('NumPy covariance matrix: \n

NumPy covariance matrix:

Next, we perform an eigendecomposition on the covariance matrix:

In [11]: cov_mat = np.cov(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cov_mat)

Eigendecomposition of the standardized data based on the correlation matrix:

In [12]: cor_mat1 = np.corrcoef(X_std.T)

eig_vals, eig_vecs = np.linalg.eig(cor_mat1)

Eigendecomposition of the raw data based on the correlation matrix:

In [13]: cor_mat2 = np.corrcoef(X.T)

eig_vals, eig_vecs = np.linalg.eig(cor_mat2)

Eigendecomposition of the covariance matrix after standardizing the data.

Singular Vector Decomposition

In [14]: u,s,v = np.linalg.svd(X_std.T)

2 - Selecting Principal Components

In [15]: for ev in eig_vecs:

In [16]: # Make a list of (eigenvalue, eigenvector) tuples

# Sort the (eigenvalue, eigenvector) tuples from high to low

# Visually confirm that the list is correctly sorted by decreasi

Eigenvalues in descending order:

In [17]: tot = sum(eig_vals)

In [23]: with plt.style.context('seaborn-whitegrid'):

plt.bar(range(4), var_exp, alpha=0.5, align='center',

In [19]: matrix_w = np.hstack((eig_pairs[0][1].reshape(4,1),

3 - Projection Onto the New Feature Space

You might also like