PCA Using Python

This document provides an overview of principal component analysis (PCA) using Python. PCA is a technique for dimensionality reduction that transforms high-dimensional data into a lower-dimensional space while retaining as much information as possible. It works by selecting principal components or features that capture the most variance in the dataset. The first principal component accounts for the highest variance, followed by the second component with the next highest variance, and so on. PCA has advantages like reducing training time for machine learning models and allowing visualization of high-dimensional data. The document provides examples of reading in a dataset, normalizing features, applying PCA, and plotting the variance to select principal components.

Uploaded by

Ravindra Ambilwade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

414 views18 pages

PCA Using Python

Uploaded by

Ravindra Ambilwade

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Principle Component Analysis using

Python

Tushar B. Kute,
http://tusharkute.com
Dimensionality Reduction

• Dimensionality reduction or dimension reduction is the

process of reducing the number of random variables
under consideration by obtaining a set of principal
variables.
• It can be divided into feature selection and feature
extraction.
– Feature selection approaches try to find a subset of the
original variables (also called features or attributes).
– Feature projection or Feature extraction transforms the
data in the high-dimensional space to a space of fewer
dimensions.
Large Dimensions

• Large number of features in the dataset is one of the factors

that affect both the training time as well as accuracy of machine
learning models. You have different options to deal with huge
number of features in a dataset.
– Try to train the models on original number of features, which
take days or weeks if the number of features is too high.
– Reduce the number of variables by merging correlated
variables.
– Extract the most important features from the dataset that
are responsible for maximum variance in the output.
Different statistical techniques are used for this purpose e.g.
linear discriminant analysis, factor analysis, and principal
component analysis.
Principal Component Analysis

• Principal component analysis, or PCA, is a statistical

technique to convert high dimensional data to low
dimensional data by selecting the most important features
that capture maximum information about the dataset.
• The features are selected on the basis of variance that they
cause in the output.
• The feature that causes highest variance is the first
principal component. The feature that is responsible for
second highest variance is considered the second principal
component, and so on.
• It is important to mention that principal components do
not have any correlation with each other.
Advantages of PCA

• The training time of the algorithms reduces

significantly with less number of features.
• It is not always possible to analyze data in high
dimensions. For instance if there are 100
features in a dataset. Total number of scatter
plots required to visualize the data would be
100(100-1)2 = 4950. Practically it is not possible
to analyze data this way.
Normalization of features

• It is imperative to mention that a feature set must be

normalized before applying PCA. For instance if a feature set
has data expressed in units of Kilograms, Light years, or
Millions, the variance scale is huge in the training set. If PCA
is applied on such a feature set, the resultant loadings for
features with high variance will also be large. Hence,
principal components will be biased towards features with
high variance, leading to false results.
• Finally, the last point to remember before we start coding is
that PCA is a statistical technique and can only be applied to
numeric data. Therefore, categorical features are required to
be converted into numerical features before PCA can be
applied.
Example:
Reading the dataset
Normalize
Apply PCA
Calculate variance
Variance plot
Variance Ratio

• The PCA class contains

explained_variance_ratio_ which returns the
variance caused by each of the principal
components.
Principal Components = 1
Principal Components = 2
Principal Components = 3
Useful resources

• https://stackabuse.com
• http://archive.ics.uci.edu/ml/index.php
• https://scikit-learn.org
• https://en.wikipedia.org
• www.towardsdatascience.com
• www.analyticsvidhya.com
• www.kaggle.com
• www.github.com
Thank you
This presentation is created using LibreOffice Impress 5.1.6.2, can be used freely as per GNU General Public License

/mITuSkillologies @mitu_group /company/mitu- c/MITUSkillologies

skillologies

Web Resources
http://mitu.co.in
http://tusharkute.com

[email protected]
[email protected]

Module 3
No ratings yet
Module 3
41 pages
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
100% (1)
Introduction To K-Nearest Neighbors: Simplified (With Implementation in Python)
125 pages
Supervised Learning 1 PDF
100% (1)
Supervised Learning 1 PDF
162 pages
Experimental Design and Data Analysis For Biologists 1st Edition Gerry P. Quinn Download
No ratings yet
Experimental Design and Data Analysis For Biologists 1st Edition Gerry P. Quinn Download
46 pages
Outliers, Hypothesis and Natural Language Processing
100% (1)
Outliers, Hypothesis and Natural Language Processing
7 pages
Full Download Chemical Product Formulation Design and Optimization: Methods, Techniques, and Case Studies Ali Elkamel PDF
No ratings yet
Full Download Chemical Product Formulation Design and Optimization: Methods, Techniques, and Case Studies Ali Elkamel PDF
47 pages
Education - Post 12th Standard - CSV
88% (16)
Education - Post 12th Standard - CSV
11 pages
Exploratory Data Analysis (EDA) Using Python
No ratings yet
Exploratory Data Analysis (EDA) Using Python
21 pages
Principal Component Analysis1
No ratings yet
Principal Component Analysis1
26 pages
Unit - II Data Analysis
No ratings yet
Unit - II Data Analysis
49 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Principal Component Analysis
No ratings yet
Principal Component Analysis
27 pages
Food Chemistry: Juliane Elisa Welke, Vitor Manfroi, Mauro Zanus, Marcelo Lazzarotto, Cláudia Alcaraz Zini
No ratings yet
Food Chemistry: Juliane Elisa Welke, Vitor Manfroi, Mauro Zanus, Marcelo Lazzarotto, Cláudia Alcaraz Zini
9 pages
M.E. Construction Engg. Curriculum
No ratings yet
M.E. Construction Engg. Curriculum
44 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
Python Decision Making & Loops Guide
No ratings yet
Python Decision Making & Loops Guide
52 pages
Data Pre-Processing (Pandas)
No ratings yet
Data Pre-Processing (Pandas)
19 pages
NumPy Notes
No ratings yet
NumPy Notes
13 pages
Project
No ratings yet
Project
113 pages
Machine Learning Loss Functions Guide
No ratings yet
Machine Learning Loss Functions Guide
37 pages
L2 - Machine Learning Process
No ratings yet
L2 - Machine Learning Process
17 pages
Real Time Face Attendance System Using Deep Learning
No ratings yet
Real Time Face Attendance System Using Deep Learning
7 pages
Linear Regression Techniques Explained
100% (1)
Linear Regression Techniques Explained
44 pages
Regression Vs Kalman Filter
No ratings yet
Regression Vs Kalman Filter
68 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
56 pages
Slides - Graph Signal Processing: An Introductory Overview
No ratings yet
Slides - Graph Signal Processing: An Introductory Overview
47 pages
Day 5 Supervised Technique-Decision Tree For Classification PDF
100% (1)
Day 5 Supervised Technique-Decision Tree For Classification PDF
58 pages
Credit Card Default Prediction
No ratings yet
Credit Card Default Prediction
33 pages
Logistic Regression
100% (1)
Logistic Regression
14 pages
Python Data Science Course Guide
100% (1)
Python Data Science Course Guide
5 pages
02 Python - Getting Started
No ratings yet
02 Python - Getting Started
73 pages
M.E. Cse.
No ratings yet
M.E. Cse.
62 pages
02 - Decision Tree Classification On Iris Dataset
No ratings yet
02 - Decision Tree Classification On Iris Dataset
6 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
Predicting Salary with Experience
100% (1)
Predicting Salary with Experience
7 pages
Final Stibo
No ratings yet
Final Stibo
25 pages
An Overview On Indications and Chemical Composition of Aromatic Waters (Hydrosols)
No ratings yet
An Overview On Indications and Chemical Composition of Aromatic Waters (Hydrosols)
18 pages
Introduction to Statistics Basics
100% (1)
Introduction to Statistics Basics
46 pages
The Fourier Transform - Bridging Theory and Applications in Signal Processing, Music Synthesis, and Climate Analytics
No ratings yet
The Fourier Transform - Bridging Theory and Applications in Signal Processing, Music Synthesis, and Climate Analytics
12 pages
Data Mining Project: Clustering & Model Analysis
100% (1)
Data Mining Project: Clustering & Model Analysis
40 pages
Machine Learning Lab Manual 06
100% (1)
Machine Learning Lab Manual 06
8 pages
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
100% (1)
Data Mining: M.P.Geetha, Department of CSE, Sri Ramakrishna Institute of Technology, Coimbatore
115 pages
Customer Data Analysis & Feature Engineering
No ratings yet
Customer Data Analysis & Feature Engineering
35 pages
ML Lab File
No ratings yet
ML Lab File
53 pages
Matplotlib Guide for Python Users
100% (1)
Matplotlib Guide for Python Users
21 pages
EGJG - Volume 62 - Issue 1 - Pages 133-150
No ratings yet
EGJG - Volume 62 - Issue 1 - Pages 133-150
18 pages
A Comparative Study and Analysis of Dimensionality Reduction Techniques On High Dimensional Datasets For Network Anomaly Detection
No ratings yet
A Comparative Study and Analysis of Dimensionality Reduction Techniques On High Dimensional Datasets For Network Anomaly Detection
8 pages
Multivariate Data Analysis Guide
No ratings yet
Multivariate Data Analysis Guide
24 pages
Investigating Individual Preferences in Rating and Ranking Conjoint Experiments. A Case Study On Semi-Hard Cheese
No ratings yet
Investigating Individual Preferences in Rating and Ranking Conjoint Experiments. A Case Study On Semi-Hard Cheese
12 pages
Understanding the Curse of Dimensionality
No ratings yet
Understanding the Curse of Dimensionality
9 pages
Python Overview: Features & Applications
No ratings yet
Python Overview: Features & Applications
28 pages
ML - Viva QnA - Doubtly - in
No ratings yet
ML - Viva QnA - Doubtly - in
14 pages
Improving Your Exploratory Factor Analysis For Ordinal Data: A Demonstration Using FACTOR
No ratings yet
Improving Your Exploratory Factor Analysis For Ordinal Data: A Demonstration Using FACTOR
15 pages
Python for Aspiring Developers
No ratings yet
Python for Aspiring Developers
26 pages
Data Visualization Rubic
No ratings yet
Data Visualization Rubic
48 pages
4 - Chrnic Kidney Disease Prediction Based On Machine Learning Algorithms
No ratings yet
4 - Chrnic Kidney Disease Prediction Based On Machine Learning Algorithms
12 pages
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
100% (1)
Linear Regression With LM Function, Diagnostic Plots, Interaction Term, Non-Linear Transformation of The Predictors, Qualitative Predictors
15 pages
An Effective Framework For Skyline Queries Using PCA
No ratings yet
An Effective Framework For Skyline Queries Using PCA
5 pages
ML UNIT-2 Notes
No ratings yet
ML UNIT-2 Notes
15 pages
DRC 0 201966154529919
No ratings yet
DRC 0 201966154529919
15 pages
Mooncake Sensory Article
No ratings yet
Mooncake Sensory Article
7 pages
Predictive Analytics
No ratings yet
Predictive Analytics
46 pages
Classification With Decision Trees: Instructor: Qiang Yang
100% (1)
Classification With Decision Trees: Instructor: Qiang Yang
62 pages
SVM Guide for Data Scientists
No ratings yet
SVM Guide for Data Scientists
24 pages
Urban-Rural Household Wealth Index Guide
No ratings yet
Urban-Rural Household Wealth Index Guide
7 pages
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
100% (1)
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
13 pages
Defence Academic Pay Revision
No ratings yet
Defence Academic Pay Revision
6 pages
Logistic Regression
100% (1)
Logistic Regression
29 pages
The Complete Guide To Data Preprocessing
No ratings yet
The Complete Guide To Data Preprocessing
50 pages
Chapter 3: Data Preprocessing
100% (1)
Chapter 3: Data Preprocessing
41 pages
Duda ch10
No ratings yet
Duda ch10
17 pages
Week 1 Quiz
100% (1)
Week 1 Quiz
28 pages
Deploy A Machine Learning Model Using Flask - Towards Data Science
No ratings yet
Deploy A Machine Learning Model Using Flask - Towards Data Science
12 pages
Bootstrap Powerpoint
100% (1)
Bootstrap Powerpoint
20 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Brochure Edn
No ratings yet
Brochure Edn
1 page
Full Pattern Cluster Poster PDF
No ratings yet
Full Pattern Cluster Poster PDF
1 page
Statistics Probability
No ratings yet
Statistics Probability
66 pages
IIIT-B Postgrad Assessment Guide
No ratings yet
IIIT-B Postgrad Assessment Guide
13 pages
Some Experiences With F Scale in Yugoslavia
No ratings yet
Some Experiences With F Scale in Yugoslavia
11 pages
Cluster
100% (1)
Cluster
72 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
SVM Notes
No ratings yet
SVM Notes
40 pages
EDA Assignment
No ratings yet
EDA Assignment
15 pages
Quiz Week 7 - Support Vector Machines
100% (1)
Quiz Week 7 - Support Vector Machines
3 pages
Machine Learning Summarized Notes 1660762916
No ratings yet
Machine Learning Summarized Notes 1660762916
111 pages
Machine Learning Basics Stanford Notes
No ratings yet
Machine Learning Basics Stanford Notes
15 pages
Linear Regression with Scikit-Learn
No ratings yet
Linear Regression with Scikit-Learn
8 pages
Stats & ML Model Comparisons
100% (1)
Stats & ML Model Comparisons
72 pages
Dimensionality Reduction Guide
No ratings yet
Dimensionality Reduction Guide
15 pages
Python Seaborn Notes
No ratings yet
Python Seaborn Notes
28 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Matplotlib Basics for Beginners
No ratings yet
Matplotlib Basics for Beginners
16 pages
Documenting Data Science Projects
No ratings yet
Documenting Data Science Projects
9 pages
Big Data Analytics Notes
No ratings yet
Big Data Analytics Notes
9 pages
SQL Database Notes
No ratings yet
SQL Database Notes
8 pages
Support Vector Machine - Explanation
No ratings yet
Support Vector Machine - Explanation
12 pages