0% found this document useful (0 votes)
48 views

PCA Machine Learning

The document discusses principal component analysis (PCA), a method of dimensionality reduction. It begins with an introduction to why dimensionality reduction is useful and provides recaps of linear algebra and statistics concepts. It then covers the foundations of PCA, describing how it finds principal components that explain the most variance in the data and can be used to reduce dimensions. It notes some assumptions of PCA, such as variables being normally distributed and high variance necessarily indicating importance.

Uploaded by

Arjun PM
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views

PCA Machine Learning

The document discusses principal component analysis (PCA), a method of dimensionality reduction. It begins with an introduction to why dimensionality reduction is useful and provides recaps of linear algebra and statistics concepts. It then covers the foundations of PCA, describing how it finds principal components that explain the most variance in the data and can be used to reduce dimensions. It notes some assumptions of PCA, such as variables being normally distributed and high variance necessarily indicating importance.

Uploaded by

Arjun PM
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Advanced Section #4:

Methods of Dimensionality Reduction:


Principal Component Analysis (PCA)

Cedric Flamant

CS109A Introduction to Data Science


Pavlos Protopapas, Kevin Rader, and Chris Tanner

1
Outline

1. Introduction:
a. Why Dimensionality Reduction?
b. Linear Algebra (Recap).
c. Statistics (Recap).

2. Principal Component Analysis:


a. Foundation.
b. Assumptions & Limitations.
c. Kernel PCA for nonlinear dimensionality reduction.

CS109A, PROTOPAPAS, RADER, 2


TANNER
Dimensionality Reduction, why?

A process of reducing the number of predictor variables under


consideration.

To find a more meaningful basis to express our data filtering the noise
and revealing the hidden structure.

C. Bishop, Pattern Recognition and Machine


CS109A, PROTOPAPAS, RADER, Learning, Springer (2008). 3
TANNER
A simple example taken from Physics
Consider an ideal spring-mass system oscillating along x.
Seeking the pressure Y that spring exerts on the wall.

LASSO regression model:

LASSO variable selection:

J. Shlens, A Tutorial on Principal Component


Analysis, (2003). CS109A, PROTOPAPAS, RADER, 4
TANNER
Principal Component Analysis versus LASSO

LASSO LASSO simply selects one of the arbitrary


directions, scientifically unsatisfactory.

We want to use all the measurements to situate


X the position of mass.

X We want to find a lower-dimensional manifold


of predictors on which data lie.

✓ Principal Component Analysis (PCA):


A powerful Statistical tool for analyzing data sets and is
formulated in the context of Linear Algebra.
CS109A, PROTOPAPAS, RADER, 5
TANNER
Linear Algebra (Recap)

6
Symmetric matrices
Consider a design (or data) matrix consists of n observations and p
predictors:

Then is a symmetric matrix.


Symmetric:
Using that :

Similar for

CS109A, PROTOPAPAS, RADER, 7


TANNER
Eigenvalues and Eigenvectors
For a real and symmetric matrix:
There exists a unique set of real eigenvalues:
and the associated eigenvectors:

such that:

(orthogonal)

(normalized)
➢ Hence, they form an orthonormal basis.

CS109A, PROTOPAPAS, RADER, 8


TANNER
Spectrum and Eigen-decomposition

Spectrum:

Orthogonal Matrix:

Eigen-decomposition:

CS109A, PROTOPAPAS, RADER, 9


TANNER
Real & Positive Eigenvalues: Gram Matrix
● The eigenvalues of are non-negative real numbers:

Similar for

● Hence, and are positive-semidefinite .

CS109A, PROTOPAPAS, RADER,


TANNER
Same eigenvalues

● and share the same eigenvalues:

Same eigenvalues.
Transformed eigenvectors:

CS109A, PROTOPAPAS, RADER, 11


TANNER
The sum of eigenvalues of is equal to its trace

● Cyclic Property of Trace:


Suppose the matrices:

● The trace of a Gram matrix is the sum of its eigenvalues.

CS109A, PROTOPAPAS, RADER, 12


TANNER
Statistics (Recap)

13
Centered Model Matrix

Consider the model (data) matrix

We make the predictors centered (each column has zero expectation) by


subtracting the sample mean:

Centered Model Matrix:

CS109A, PROTOPAPAS, RADER, 14


TANNER
Sample Covariance Matrix
Consider the Covariance matrix:

Inspecting the terms:


➢ The diagonal terms are the sample variances:

➢ The non-diagonal terms are the sample covariances:

CS109A, PROTOPAPAS, RADER, 15


TANNER
Principal Components Analysis (PCA)

16
PCA

PCA tries to fit an ellipsoid to the data.

PCA is a linear transformation that transforms data


to a new coordinate system.

The data with the greatest variance lie on the first axis
(first principal component) and so on.

PCA reduces the dimensions by throwing away the


low variance principal components.
CS109A, PROTOPAPAS, RADER,
TANNER J. Jauregui (2012) 17
PCA foundation

Note that the covariance matrix is symmetric, so it permits an


orthonormal eigenbasis:

The eigenvalues can be sorted in as:

The eigenvector is called the ith principal component of

CS109A, PROTOPAPAS, RADER, 18


TANNER
Measure the importance of the principal components

The total sample variance of the predictors:

The fraction of the total sample variance that corresponds to :

so, indicates the “importance” of the ith principal component.

CS109A, PROTOPAPAS, RADER, 19


TANNER
Back to spring-mass example

PCA finds:

revealing the one-degree of freedom.

Hence, PCA indicates that there may be fewer variables that are essentially
responsible for the variability of the response.

CS109A, PROTOPAPAS, RADER, 20


TANNER
PCA Dimensionality Reduction
The Spectrum represents the dimensionality reduction by PCA.

CS109A, PROTOPAPAS, RADER, 21


TANNER
PCA Dimensionality Reduction
There is no rule in how many eigenvalues to keep, but it is generally
clear and left to the analyst’s discretion.

C. Bishop, Pattern Recognition and Machine


Learning, Springer (2008).

CS109A, PROTOPAPAS, RADER, 22


TANNER
PCA Dimensionality Reduction
An example on leaves (thanks to Chris Rycroft, AM205)

CS109A, PROTOPAPAS, RADER, 23


TANNER
PCA Dimensionality Reduction

The average leaf

(Why do we need this again?)

CS109A, PROTOPAPAS, RADER, 24


TANNER
PCA Dimensionality Reduction
First three principal components

positive

negative

CS109A, PROTOPAPAS, RADER, 25


TANNER
PCA Dimensionality Reduction – Keeping up to k Components

CS109A, PROTOPAPAS, RADER, 26


TANNER
Assumptions of PCA

Although PCA is a powerful tool for dimension reduction, it is based on


some strong assumptions.

The assumptions are reasonable, but they must be checked in practice


before drawing conclusions from PCA.

When PCA assumptions fail, we need to use other Linear or Nonlinear


dimension reduction methods.

CS109A, PROTOPAPAS, RADER, 27


TANNER
Mean/Variance are sufficient
In applying PCA, we assume that means and covariance matrix are sufficient for
describing the distributions of the predictors.
This is only exactly true if the predictors are drawn from a multivariable Normal
distribution, but works approximately for many situations.

When a predictor deviates heavily from being


Normally distributed, an appropriate nonlinear
transformation may solve this problem.

CS109A, PROTOPAPAS, RADER, 28


TANNER
Wikipedia – multivariate normal distribution
High Variance indicates importance

Assumption: The eigenvalue is measures the “importance” of the i th


principal component.

It is intuitively reasonable that lower variability components describe the


data less, but it is not always true.

CS109A, PROTOPAPAS, RADER, 29


TANNER
Principal Components are orthogonal

PCA assumes that the intrinsic dimensions are orthogonal.

When this assumption fails, we need


to assume non-orthogonal components
which are not compatible with PCA.

Balaji Pitchai Kannu (on Quora)


CS109A, PROTOPAPAS, RADER, 30
TANNER
Linear Change of Basis

PCA assumes that data lie on a lower dimensional linear manifold.

vs

projectrhea.org Alexsei Tiulpin

When the data lie on a nonlinear manifold in the predictor space, then
linear methods are likely to be ineffective.
CS109A, PROTOPAPAS, RADER, 31
TANNER
Kernel PCA for Nonlinear Dimensionality Reduction

Applying a nonlinear map Φ (called feature map) on data yields PCA


kernel:

Centered nonlinear representation:

Apply PCA to the modified Kernel:

CS109A, PROTOPAPAS, RADER, 32


TANNER Alexsei Tiulpin
Summary
• Dimensionality Reduction Methods
1. A process of reducing the number of predictor variables under consideration.
2. To find a more meaningful basis to express our data filtering the noise and
revealing the hidden structure.

• Principal Component Analysis


1. A powerful Statistical tool for analyzing data sets and is formulated in the
context of Linear Algebra.
2. Spectral decomposition: We reduce the dimension of predictors by reducing
the number of principal components and their eigenvalues.
3. PCA is based on strong assumptions that we need to check.
4. Kernel PCA for nonlinear dimensionality reduction.

CS109A, PROTOPAPAS, RADER, 33


TANNER
Advanced Section 4: Dimensionality Reduction, PCA

Thank you

CS109A, PROTOPAPAS, RADER, 34


TANNER

You might also like