0% found this document useful (0 votes)
54 views27 pages

Principal Component Analysis

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction, transforming correlated variables into uncorrelated variables while retaining maximum variance. It helps simplify data analysis, improve performance, and visualize high-dimensional data effectively. However, PCA has limitations, including potential information loss, sensitivity to data scaling, and challenges in interpreting principal components.

Uploaded by

bca2m2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views27 pages

Principal Component Analysis

Principal Component Analysis (PCA) is a statistical technique used for dimensionality reduction, transforming correlated variables into uncorrelated variables while retaining maximum variance. It helps simplify data analysis, improve performance, and visualize high-dimensional data effectively. However, PCA has limitations, including potential information loss, sensitivity to data scaling, and challenges in interpreting principal components.

Uploaded by

bca2m2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Principal Component Analysis

Principal Component Analysis


• Principal component analysis, or PCA, is a
statistical procedure that allows you to
summarize the information content in large
data tables by means of a smaller set of
“summary indices” that can be more easily
visualized and analyzed.
Principal Component Analysis
• As the number of features or dimensions in a
dataset increases, the amount of data required to
obtain a statistically significant result increases
exponentially.
• This can lead to issues such as overfitting,
increased computation time, and reduced
accuracy of machine learning models this is
known as the curse of dimensionality problems
that arise while working with high-dimensional
data.
Principal Component Analysis
• As the number of dimensions increases, the number
of possible combinations of features increases
exponentially, which makes it computationally
difficult to obtain a representative sample of the
data and it becomes expensive to perform tasks
such as clustering or classification.
• Additionally, some machine learning algorithms can
be sensitive to the number of dimensions, requiring
more data to achieve the same level of accuracy as
lower-dimensional data.
Principal Component Analysis
• To address the curse of dimensionality,
Feature engineering techniques are used
which include feature selection and feature
extraction.
• Dimensionality reduction is a type of feature
extraction technique that aims to reduce the
number of input features while retaining as
much of the original information as possible.
What is Principal Component
Analysis(PCA)?
• Principal Component Analysis(PCA) technique
was introduced by the mathematician Karl
Pearson in 1901.
• It works on the condition that while the data
in a higher dimensional space is mapped to
data in a lower dimension space, the variance
of the data in the lower dimensional space
should be maximum.
Principal Component Analysis
• Principal Component Analysis (PCA) is a
statistical procedure that uses an orthogonal
transformation that converts a set of
correlated variables to a set of uncorrelated
variables.
• PCA is the most widely used tool in
exploratory data analysis and in machine
learning for predictive models.
Principal Component Analysis
• Principal Component Analysis (PCA) is an
unsupervised learning algorithm technique used to
examine the interrelations among a set of variables.
• It is also known as a general factor analysis where
regression determines a line of best fit.
• The main goal of Principal Component Analysis (PCA)
is to reduce the dimensionality of a dataset while
preserving the most important patterns or
relationships between the variables without any
prior knowledge of the target variables.
Principal Component Analysis
• Principal Component Analysis (PCA) is used to
reduce the dimensionality of a data set by
finding a new set of variables, smaller than the
original set of variables, retaining most of the
sample’s information, and useful for the
regression and classification of data.
Principal Component Analysis
Principal Component Analysis
• Principal Component Analysis (PCA) is a technique
for dimensionality reduction that identifies a set
of orthogonal axes, called principal components,
that capture the maximum variance in the data.
• The principal components are linear combinations
of the original variables in the dataset and are
ordered in decreasing order of importance.
• The total variance captured by all the principal
components is equal to the total variance in the
original dataset.
Principal Component Analysis
• The first principal component captures the
most variation in the data, but the second
principal component captures the maximum
variance that is orthogonal to the first
principal component, and so on.
Principal Component Analysis
• Principal Component Analysis can be used for a
variety of purposes, including data visualization,
feature selection, and data compression.
• In data visualization, PCA can be used to plot high-
dimensional data in two or three dimensions, making
it easier to interpret.
• In feature selection, PCA can be used to identify the
most important variables in a dataset.
• In data compression, PCA can be used to reduce the
size of a dataset without losing important information.
Principal Component Analysis
• In Principal Component Analysis, it is assumed
that the information is carried in the variance
of the features, that is, the higher the
variation in a feature, the more information
that features carries.
• Overall, PCA is a powerful tool for data
analysis and can help to simplify complex
datasets, making them easier to understand
and work with.
Step-By-Step Explanation of PCA (Principal
Component Analysis)
• Step 1: Standardization
– First, we need to standardize our dataset to
ensure that each variable has a mean of 0 and a
standard deviation of 1.
• Here, µ is the mean of independent features.

• σ sigma is the standard deviation of


independent features.
Step2: Covariance Matrix Computation
• Covariance measures the strength of joint
variability between two or more variables,
indicating how much they change in relation
to each other. To find the covariance we can
use the formula:
• The value of covariance can be
positive, negative, or zeros.
–Positive: As the x1 increases x2 also
increases.
–Negative: As the x1 increases x2
also decreases.
–Zeros: No direct relation
Step 3: Compute Eigenvalues and Eigenvectors of
Covariance Matrix to Identify Principal Components

• Let A be a square nXn matrix and X be a non-


zero vector for which
– for some scalar values λ.
• Then λ is known as the eigenvalue of matrix A
and X is known as the eigenvector of matrix A
for the corresponding eigenvalue.
• It can also be written as :
• where I is the identity matrix of the same
shape as matrix A.
• conditions will be true only if (A-λI) will be
non-invertible (i.e. singular matrix). That
means,

• From the above equation, we can find the


eigenvalues λ, and therefore corresponding
eigenvector can be found using the equation
AX-λX
Advantages of Principal Component Analysis
• Dimensionality Reduction
– Principal Component Analysis is a popular technique used for
dimensionality reduction, which is the process of reducing the
number of variables in a dataset.
– By reducing the number of variables, PCA simplifies data
analysis, improves performance, and makes it easier to
visualize data.
• Feature Selection
– Principal Component Analysis can be used for feature selection,
which is the process of selecting the most important variables
in a dataset.
– This is useful in machine learning, where the number of
variables can be very large, and it is difficult to identify the
most important variables.
Advantages of Principal Component Analysis
• Data Visualization
– Principal Component Analysis can be used for data
visualization.
– By reducing the number of variables, PCA can plot high-
dimensional data in two or three dimensions, making it easier
to interpret.
• Multicollinearity:
– Principal Component Analysis can be used to deal with
multicollinearity, which is a common problem in a regression
analysis where two or more independent variables are highly
correlated.
– PCA can help identify the underlying structure in the data and
create new, uncorrelated variables that can be used in the
regression model.
Advantages of Principal Component Analysis
• Noise Reduction
– Principal Component Analysis can be used to reduce the noise in
data.
– By removing the principal components with low variance, which
are assumed to represent noise, Principal Component Analysis
can improve the signal-to-noise ratio and make it easier to
identify the underlying structure in the data.
• Data Compression
– Principal Component Analysis can be used for data compression.
– By representing the data using a smaller number of principal
components, which capture most of the variation in the data,
PCA can reduce the storage requirements and speed up
processing.
Advantages of Principal Component Analysis

• Outlier Detection
– Principal Component Analysis can be used for
outlier detection.
– Outliers are data points that are significantly
different from the other data points in the dataset.
– Principal Component Analysis can identify these
outliers by looking for data points that are far from
the other points in the principal component space.
Disadvantages of Principal Component Analysis
• Interpretation of Principal Components
– The principal components created by Principal
Component Analysis are linear combinations of the
original variables, and it is often difficult to interpret
them in terms of the original variables.
– This can make it difficult to explain the results of PCA to
others.
• Data Scaling
– Principal Component Analysis is sensitive to the scale of
the data. If the data is not properly scaled, then PCA
may not work well.
– Therefore, it is important to scale the data before
applying Principal Component Analysis.
Disadvantages of Principal Component Analysis
• Information Loss
– Principal Component Analysis can result in information loss.
– While Principal Component Analysis reduces the number of
variables, it can also lead to loss of information.
– The degree of information loss depends on the number of
principal components selected.
– Therefore, it is important to carefully select the number of
principal components to retain.
• Non-linear Relationships
– Principal Component Analysis assumes that the relationships
between variables are linear.
– However, if there are non-linear relationships between
variables, Principal Component Analysis may not work well.
Disadvantages of Principal Component Analysis
• Computational Complexity
– Computing Principal Component Analysis can be
computationally expensive for large datasets.
– This is especially true if the number of variables in the
dataset is large.
• Overfitting
– Principal Component Analysis can sometimes result in
overfitting, which is when the model fits the training
data too well and performs poorly on new data.
– This can happen if too many principal components
are used or if the model is trained on a small dataset.

You might also like