0% found this document useful (0 votes)
2 views37 pages

Lecture 3

The document discusses Principal Component Analysis (PCA) as a method for dimension reduction in high-dimensional data analysis, highlighting its mathematical principles and applications. It covers the fundamentals of linear algebra necessary for understanding PCA, including eigenvalues, eigenvectors, and matrix transformations. Additionally, it addresses the limitations of PCA, particularly in handling nonlinear relationships in complex datasets.

Uploaded by

Husam hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views37 pages

Lecture 3

The document discusses Principal Component Analysis (PCA) as a method for dimension reduction in high-dimensional data analysis, highlighting its mathematical principles and applications. It covers the fundamentals of linear algebra necessary for understanding PCA, including eigenvalues, eigenvectors, and matrix transformations. Additionally, it addresses the limitations of PCA, particularly in handling nonlinear relationships in complex datasets.

Uploaded by

Husam hr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Principal component analysis


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

High dimensional data in data analysis?

Words embeddings in NLP


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

High dimensional data in data analysis?

Brain activity
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

High dimensional data in data analysis?

Challenges ?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

High dimensional data in data analysis?

Challenges ?
Visualize
Group in relevant clusters
Difficult with high dimensional data!
A classical dimension reduction approach Principal Component
Analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Dimension reduction

Dimension reduction without loss of information?


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Dimension reduction

Dimension reduction without loss of information?


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Dimension reduction

Scientific questions
How can we reduce dimension to separate observations?
Possible answer : Principal Component Analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

High dimensional data in data analysis?

Challenges ?
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

High dimensional data in data analysis?

Challenges ?
Visualize
Group in relevant clusters
Difficult with high dimensional data!
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Dimension reduction

Main features of Principal Component Analysis (PCA)


preserves the global structure of data.
maps all the clusters as a whole
potential applications : noise filtering, feature extractions, stock
market predictions, and gene data analysis.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Vectors of Rp
Rp is the set of vectors with p components
 
 1 
For e.g. X = −3 is a 3-component vector.
 
 
4
We can also say that X belongs to R3

Concept of basis
The family (X1 , · · · , Xp ) is a basis of Rp if each vector of Rp can be
expressed in a unique way as a linear combination of X1 · · · , Xp
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Example 1
! !!
1 0
, is a basis of R2
0 1
!
x
Indeed every X = 1 can be expressed in a unique way as
x2
! !
1 0
X = x1 · + x2 ·
0 1
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

!
2
Example with X =
3
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Example 2
! !!
1 1
, is a basis of R2
1 −1
!
x1
Indeed, every X = can be expressed in a unique way as
x2

x1 + x2 1
! !
x1 − x2 1
· X= + ·
2 1 2 −1
!
3
Example with X =
2
! !
1 1
X = 2.5 · + 0.5 ·
1 −1
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

!
3
Example with X =
2
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Matrices
A matrix with p rows and p columns is an array of reals with p
rows nd p columns
A matrix maps vectors of Rp to vectors of Rp
It can be interpretated as a linear transformation of the plane in
the case p = 2
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Matrices
If we are given a matrix M, the transformation may be not so
simple to identify!
!
2 1
What is the transformation associated to M = ?
1 2
! !
x1 y1
Y = M · X with X = and Y = means
x2 y2

y1 = 2x1 + x2
y2 = x1 + 2x2

The transformation is eplicit but the geometric interpretation is


not so clear
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Simpler with a smartchange of coordinate?


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Eigenvalues and eigenvectors


Let A ∈ Md (R).
The vector X ∈ Rd \ {0} is said to be a eigenvector of matrix A
associated to the eigenvalue λ if AX = λX.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Example 3
! !
2 1 1
Let A = and X =
1 2 1
!
3
Since AX = , X is an eigenvector of A with associated
3
eigenvalue 3
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Refresher on Linear Algebra

Diagonalizable matrices
The square matrix A with p columns and p rows is said to be
diagonalizable if there exists (X1 , · · · , Xp ) such that
Condition 1 : (X1 , · · · , Xp ) is a basis of Rp
Condition 2 : for each i, Xi is an eigenvector of A
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Principal Component Analysis


Principle

How perform dimension reduction in a linear way?


Mathematical tool : linear projection on a low-dimensional space
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Principal Component Analysis


Principle

How can we find the low-dimensional space H?


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Principal Component Analysis


Principle

Principal component analysis : how does it work?


The k dimensional space H that we are looking for is generated
by the k eigenvectors uα associated to the k largest eigenvalues
λα of the matrix X T X
We have several possible choices for the matrix X :
General PCA : the raw data matrix X = R
Centered PCA: the centered data matrix. the matrix X T X is then
the matrix of empirical covariances
Normed PCA : the normed and centered data matrix. The matrix
X T X is then the matrix of empirical correlations
Projection of the observation oi on the axis α
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Principal Component Analysis


Principle

Principal component analysis : how does it work?


In general n ≫ d (number of observations ≫ number of initial
variables)
It is the reason why we deal with the matrix X T X with dimension
d × d rather than XX T with dimension n × n
Existence of some links between these two analysis
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

PCA with Python


An example

To illustrate PCA we consider a dataset containing gene


expression profiles for 105 breast tumour samples measured
using Swegene Human 27K RAP UniGene188 arrays
Within the population of cells, one can focus focused on the
expression of GATA3 and XBP1, whose expression was known
to correlate with estrogen receptor status1

1
Breast cancer cells may be estrogen receptor positive, ER +,
or negative, ER , indicating capacity to respond to estrogen
signalling, which can therefore influence treatment
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

PCA with Python


An example

We plot the expression levels of GATA3 and XBP1 against one


another to visualise the data in the two-dimensional space
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

PCA with Python


An example

We perform PCA and visualize by plotting the original data


side-by-side with the transformed data

Original data versus transformed ones


Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

PCA with Python


An example

We have simply rotated the original data, so that the greatest


variance aligns along the x-axis and so forth
We can find out how much of the variance each of the principle
components explains
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

PCA with Python


An example

PC1 explains the vast majority of the variance in the observations


The dimensionality reduction step of PCA occurs when we
choose to discard the later PCs.
We visualise the data using only PC1.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Limits of PCA
An example

Principle component analysis is not always appropriate for


complex datasets, particularly when dealing with nonlinearities
To illustrate this, let’s consider an simulated expression set
containing 8 genes, with 10 timepoints/conditions.
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Limits of PCA
An example

The data can be separated out by a single direction


The data from time/condition 1 through to time/condition 10 can
ordered
Intuitively, the data can be represented by a single dimension
We run PCA as we would normally, and visualise the result,
plotting the first two PCs
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Limits of PCA
An example
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Limits of PCA
An example

We see that the PCA plot has placed the datapoints in a


horseshoe shape, with condition/time point 1 very close to
condition/time point 10.
From the earlier plots of gene expression profiles we can see that
the relationships between the various genes are not entirely
straightforward.
For example, gene 1 is initially correlated with gene 2, then
negatively correlated, and finally uncorrelated, whilst no
correlation exists between gene 1 and genes 5 - 8.
These nonlinearities make it difficult for PCA which, in general,
attempts to preserve large pairwise distances, leading to the well
known horseshoe effect
Motivation Dimension reduction Refresher on Linear Algebra Principal component analysis PCA with Python Limits of PCA

Limits of PCA
Pro and cons of PCA

Main advantages of PCA


Simple to implement, no tuning
Highly interpretable. We can find decide on how much variance
to preserve using eigenvalues.

Main drawbacks of PCA


It is a global transform which may not preserve local structure
(clusters)
It is sensitive to outliers

You might also like