0% found this document useful (0 votes)

420 views48 pages

Intro To PCA

PCA (Principal Component Analysis) is a technique used to simplify complex datasets. It works by transforming the data to a new coordinate system where the greatest variance in the data lies on the first axis (the first principal component), second greatest variance on the second axis, and so on. This allows reducing dimensionality by removing less significant components. The key steps are: 1) centering the data, 2) calculating the covariance matrix, 3) finding eigenvalues and eigenvectors of the covariance matrix which identify the principal components, and 4) reducing dimensions by keeping only the most significant components.

Uploaded by

avant_ganji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

420 views48 pages

Intro To PCA

Uploaded by

avant_ganji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 48

Intro to PCA

Adapted from G. Piatetsky-Shapiro, Biologically Inspired Intelligent Systems (Lecture 7) and R. Gutierrez-Osunas Lecture

Field Reduction Improves Classification

Most mining algorithms look for non-linear combinations of fields -- can easily find many spurious combinations given small # of records and large # of fields Classification accuracy improves if we first reduce number of fields Multi-class heuristic: select equal # of fields from each class
8

Selecting Most Relevant Fields

If there are too many fields, select a subset that is most relevant
Can select top N fields using 1-field predictive accuracy as computed earlier What is good N?
Rule of thumb: keep top 50 fields

Other techniques exist

Attribute Construction
Better to have a fair modeling method and good variables, than to have the best modeling method and poor variables Examples:
People are eligible for pension withdrawal at age 59 . Create it as a separate Boolean variable! Household income as sum of spouses incomes in loan underwritting

Advanced methods exists for automatically examining variable combinations, but it is very computationally expensive!
10

Variance
A measure of the spread of the data in a data set

X
s !
2 i! 1

n 1

Variance is claimed to be the original statistical measure of spread of data.

Covariance
Variance measure of the deviation from the mean for points in one dimension, e.g., heights Covariance a measure of how much each of the dimensions varies from the mean with respect to each other. Covariance is measured between 2 dimensions to see if there is a relationship between the 2 dimensions, e.g., number of hours studied & grade obtained. The covariance between one dimension and itself is the variance
12

Covariance
X
var( X ) !
i! 1 n n i

n 1

X
cov( X ,Y ) !
i! 1

X Yi Y

n 1

So, if you had a 3-dimensional data set (x,y,z), then you could measure the covariance between the x and y dimensions, the y and z dimensions, and the x and z dimensions.
13

Covariance
What is the interpretation of covariance calculations? Say you have a 2-dimensional data set
X: number of hours studied for a subject Y: marks obtained in that subject

And assume the covariance value (between X and Y) is: 104.53 What does this value mean?
14

Covariance
Exact value is not as important as its sign. A positive value of covariance indicates that both dimensions increase or decrease together, e.g., as the number of hours studied increases, the grades in that subject also increase. A negative value indicates while one increases the other decreases, or vice-versa, e.g., active social life at BYU vs. performance in CS Dept. If covariance is zero: the two dimensions are independent of each other, e.g., heights of students vs. grades obtained in a subject. 15

Covariance
Why bother with calculating (expensive) covariance when we could just plot the 2 values to see their relationship?
Covariance calculations are used to find relationships between dimensions in high dimensional data sets (usually greater than 3) where visualization is difficult.

Covariance Matrix
Representing covariance among dimensions as a matrix, e.g., for 3 dimensions:
cov( X, X) cov( X,Y ) cov( X,Z) C ! cov(Y , X) cov(Y ,Y ) cov(Y ,Z) cov(Z, X) cov(Z,Y ) cov(Z,Z)

Properties:
Diagonal: variances of the variables cov(X,Y)=cov(Y,X), hence matrix is symmetrical about the diagonal (upper triangular) n-dimensional data will result in nxn covariance matrix 17

Transformation Matrices
Consider the following:

3 3 3 12 2 v ! ! 4 v 2 2 2 1 8
The square (transformation) matrix scales (3,2) Now assume we take a multiple of (3,2) 3 6 2 v ! 2 4
2 2
3 v 1

6 ! 4

6 24 ! 4 v 4 16

Transformation Matrices
Scale vector (3,2) by a value 2 to get (6,4) Multiply by the square transformation matrix And we see that the result is still scaled by 4. WHY? A vector consists of both length and direction. Scaling a vector only changes its length and not its direction. This is an important observation in the transformation of matrices leading to formation of eigenvectors and eigenvalues. Irrespective of how much we scale (3,2) by, the solution (under the given transformation matrix) is always a multiple of 4.

Eigenvalue Problem
The eigenvalue problem is any problem having the following form: A.v= .v A: n x n matrix v: n x 1 non-zero vector : scalar Any value of for which this equation has a solution is called the eigenvalue of A and the vector v which corresponds to this value is called the eigenvector of A.
20

Eigenvalue Problem
Going back to our example:

3 3 3 12 2 v ! ! 4 v 2 2 2 1 8
A . v =

. v

Therefore, (3,2) is an eigenvector of the square matrix A and 4 is an eigenvalue of A The question is: Given matrix A, how can we calculate the eigenvector and eigenvalues for A?
21

Calculating Eigenvectors & Eigenvalues

Simple matrix algebra shows that: A.v= .v A.v- .I.v=0 (A - . I ). v = 0 Finding the roots of |A - . I| will give the eigenvalues and for each of these eigenvalues there will be an eigenvector Example
22

Calculating Eigenvectors & Eigenvalues

Let

0 1 A ! 2 3 Then:

0 1 0 0 1 1 P A P .I ! ! P 0 0 2 3 1 2 3

0 P

P 1 ! ! v 3 P 2 v 1 ! P2 3P 2 P 2 3 P

And setting the determinant to 0, we obtain 2 eigenvalues: 1 = -1 and 2 = -2

Calculating Eigenvectors & Eigenvalues

For
1

the eigenvector is:

1 v1:1 . ! 0 v 2 1:2 and 2v1:1 2v1:2 ! 0

A P1 .I .v1 ! 0
1 2

v1:1 v 1:2 ! 0 v1:1 ! v1:2

Therefore the first eigenvector is any column vector in which the two elements have equal magnitude and opposite sign.
24

Calculating Eigenvectors & Eigenvalues

Therefore eigenvector v1 is

1 v1 ! k1 1

where k1 is some constant. Similarly we find that eigenvector v2

1 v 2 ! k2 2

where k2 is some constant.

Properties of Eigenvectors and Eigenvalues

Eigenvectors can only be found for square matrices and not every square matrix has eigenvectors. Given an n x n matrix (with eigenvectors), we can find n eigenvectors. All eigenvectors of a symmetric* matrix are perpendicular to each other, no matter how many dimensions we have. In practice eigenvectors are normalized to have unit length.
*Note: covariance matrices are symmetric! 26

PCA
Principal components analysis (PCA) is a technique that can be used to simplify a dataset It is a linear transformation that chooses a new coordinate system for the data set such that
The greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component) The second greatest variance on the second axis Etc.

PCA can be used for reducing dimensionality by eliminating the later principal components.
27

PCA
By finding the eigenvalues and eigenvectors of the covariance matrix, we find that the eigenvectors with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the dataset. These are the principal components. PCA is a useful statistical technique that has found application in:
Fields such as face recognition and image compression Finding patterns in data of high dimension.

PCA Process STEP 1

Subtract the mean from each of the dimensions This produces a data set whose mean is zero. Subtracting the mean makes variance and covariance calculation easier by simplifying their equations. The variance and co-variance values are not affected by the mean value.

PCA Process STEP 1

X Y X d 0.69 0.39 X !1.81 Y !1.91 0.09 1.29 0.49 0.19 Y d 0.49 0.99 0.29 1.09 0.79 0.31 2.5 2.4 0.5 0.7 2.2 2.9 1.9 2.2 3.1 3.0 2.3 2.7 2.0 1.6 1.0 1.1 1.5 1.6 1.2 0.9 1.31 1.21

0.81 0.81 0.31 0.31 0.71 1.01

http://kybele.psych.cornell.edu/~edelman/Psych-465-Spring-2003/PCA-tutorial.pdf

PCA Process STEP 2

Calculate the covariance matrix

0.616555556 0.615444444 cov ! 0.615444444 0.716555556

Since the non-diagonal elements in this covariance matrix are positive, we should expect that both the X and Y variables increase together. Since it is symmetric, we expect the eigenvectors to be orthogonal.
31

PCA Process STEP 3

Calculate the eigenvectors and eigenvalues of the covariance matrix 0.490833989 eigenvalues ! 1.28402771

0.735178656 0.677873399 eigenvectors ! 0.677873399 0.735178656

PCA Process STEP 3

Eigenvectors are plotted as diagonal dotted lines on the plot. (note: they are perpendicular to each other). One of the eigenvectors goes through the middle of the points, like drawing a line of best fit. The second eigenvector gives us the other, less important, pattern in the data, that all the points follow the main line, but are off to the side of the main line by some amount.
33

PCA Process STEP 4

Reduce dimensionality and form feature vector The eigenvector with the highest eigenvalue is the principal component of the data set. In our example, the eigenvector with the largest eigenvalue is the one that points down the middle of the data. Once eigenvectors are found from the covariance matrix, the next step is to order them by eigenvalue, highest to lowest. This gives the components in order of significance.
34

PCA Process STEP 4

Now, if youd like, you can decide to ignore the components of lesser significance. You do lose some information, but if the eigenvalues are small, you dont lose much

n dimensions in your data calculate n eigenvectors and eigenvalues choose only the first p eigenvectors final data set has only p dimensions.
35

PCA Process STEP 4

When the is are sorted in descending order, the proportion of variance explained by the p principal components is:

P P
i! 1 i! 1 n

!
i

P1 P 2 K P p P1 P 2 K P p K P n

If the dimensions are highly correlated, there will be a small number of eigenvectors with large eigenvalues and p will be much smaller than n. If the dimensions are not correlated, p will be as large as n and PCA does not help. 36

PCA Process STEP 4

Feature Vector FeatureVector = (
1 2 3 p)

(take the eigenvectors to keep from the ordered list of eigenvectors, and form a matrix with these eigenvectors in the columns)

We can either form a feature vector with both of the eigenvectors: 0.677873399 0.735178656
0.735178656 0.677873399

or, we can choose to leave out the smaller, less significant component and only have a single column:
0.677873399 0.735178656
37

PCA Process STEP 5

Derive the new data
FinalData = RowFeatureVector x RowZeroMeanData

RowFeatureVector is the matrix with the eigenvectors in the columns transposed so that the eigenvectors are now in the rows, with the most significant eigenvector at the top. RowZeroMeanData is the mean-adjusted data transposed, i.e., the data items are in each column, with each row holding a separate dimension.
38

PCA Process STEP 5

FinalData is the final data set, with data items in columns, and dimensions along rows. What does this give us? The original data solely in terms of the vectors we chose. We have changed our data from being in terms of the axes X and Y, to now be in terms of our 2 eigenvectors.
39

PCA Process STEP 5

FinalData (transpose: dimensions along columns)
ne X 0.827870186 1.77758033 0.992197494 0.274210416 1.67580142 0.912949103 0.0991094375 1.14457216 0.438046137 1.22382956 ne Y 0.175115307 0.142857227 0.384374989 0.130417207 0.209498461 0.175282444 0.349824698 0.0464172582 0.0177646297 0.162675287

PCA Process STEP 5

Reconstruction of Original Data

Recall that:
FinalData = RowFeatureVector x RowZeroMeanData

Then:
RowZeroMeanData = RowFeatureVector-1 x FinalData

And thus:
RowOriginalData = (RowFeatureVector-1 x FinalData) + OriginalMean

If we use unit eigenvectors, the inverse is the same as the transpose (hence, easier).
42

Reconstruction of Original Data

If we reduce the dimensionality (i.e., p<n), obviously, when reconstructing the data we lose those dimensions we chose to discard. In our example let us assume that we considered only a single eigenvector. The final data is newX only and the reconstruction yields

Reconstruction of original Data

The variation along the principal component is preserved. The variation along the other component has been lost.

A Word on Factor Analysis

The reciprocal of PCA(?) PCA generates new variables (zi) that are linear combinations of the original input variables (xi). FA assumes that there are factors (zi) that, when linearly combined, generate the input variables (xi).
46

A Word On Linear Discrimant Analysis

Both PCA and FA are unsupervised. LDA seeks to find a dimension such that when the data is projected onto it, the two classes* are well separated (i.e., the means are as far apart as possible and the examples of classes are as tightly clustered)

*This generalizes naturally to K classes yielding K-1 dimensions

References
PCA tutorial: http://kybele.psych.cornell.edu/~edelman/Ps ych-465-Spring-2003/PCA-tutorial.pdf Wikipedia: http://en.wikipedia.org/wiki/Principal_comp onent_analysis http://en.wikipedia.org/wiki/Eigenface
48

Fundamentals of Academic Writing Level 1 PDF
83% (24)
Fundamentals of Academic Writing Level 1 PDF
236 pages
Omeprazole
No ratings yet
Omeprazole
1 page
Detailed Lesson Plan in Grade 8 Pythagorean Theorem
67% (3)
Detailed Lesson Plan in Grade 8 Pythagorean Theorem
5 pages
Bus Schedule 2012
No ratings yet
Bus Schedule 2012
1 page
A Raisin in The Sun
100% (1)
A Raisin in The Sun
4 pages
Machinist Mate 3 2 Surface Navy
No ratings yet
Machinist Mate 3 2 Surface Navy
592 pages
It Audit Sarana Dan Prasarana Rumah Sakit M Hoesin
50% (2)
It Audit Sarana Dan Prasarana Rumah Sakit M Hoesin
10 pages
Easter
No ratings yet
Easter
2 pages
YOGA Sutras
0% (1)
YOGA Sutras
13 pages
Law 016
No ratings yet
Law 016
3 pages
Case Study or (PGO)
No ratings yet
Case Study or (PGO)
10 pages
Marketing Mix of HDFC Bank
No ratings yet
Marketing Mix of HDFC Bank
2 pages
Spanish
No ratings yet
Spanish
1 page
Pakistan Telecommunication Authority Rules For Portability.... Awais
No ratings yet
Pakistan Telecommunication Authority Rules For Portability.... Awais
4 pages
Hire Purchase Contract
100% (1)
Hire Purchase Contract
4 pages
PR EQ@Workplace 21aug'10
No ratings yet
PR EQ@Workplace 21aug'10
3 pages
Planning Organizing Staffing Directing Co: CCCCC CCCC CCC
No ratings yet
Planning Organizing Staffing Directing Co: CCCCC CCCC CCC
3 pages
Essex Crossing SPURA Minor Modifications March 2015
No ratings yet
Essex Crossing SPURA Minor Modifications March 2015
97 pages
Core Principles of Systemic Thinking
No ratings yet
Core Principles of Systemic Thinking
8 pages
Cruise Options, Prepared For Fabrikam, Inc.: Cruise Name Duration Inside Cabin Outside Cabin Dates Departs From
No ratings yet
Cruise Options, Prepared For Fabrikam, Inc.: Cruise Name Duration Inside Cabin Outside Cabin Dates Departs From
3 pages
Legal Management
No ratings yet
Legal Management
9 pages
Master of Arts (Psychology) (MAPC) : Handbook On Project
No ratings yet
Master of Arts (Psychology) (MAPC) : Handbook On Project
50 pages
Educational Guide
No ratings yet
Educational Guide
4 pages
Disaster Nursing
No ratings yet
Disaster Nursing
5 pages
Elements of The Code
No ratings yet
Elements of The Code
4 pages
Calculating Football Results
No ratings yet
Calculating Football Results
13 pages
Speaking at 2006: Horasis Global China Business Meeting
No ratings yet
Speaking at 2006: Horasis Global China Business Meeting
5 pages
2.1significant Figures Auto Saved)
No ratings yet
2.1significant Figures Auto Saved)
4 pages
جزيئية
No ratings yet
جزيئية
3 pages
CC CC CC C !""#$ %&' (!""!) (CC '&&+ '&!&) (
No ratings yet
CC CC CC C !""#$ %&' (!""!) (CC '&&+ '&!&) (
4 pages
FMCG Marketers: Destination For FMCG Marketing
No ratings yet
FMCG Marketers: Destination For FMCG Marketing
5 pages
Aya's Diary
No ratings yet
Aya's Diary
5 pages
GroupA9 Reaction Paper Newell
No ratings yet
GroupA9 Reaction Paper Newell
5 pages
Credit Rating
No ratings yet
Credit Rating
6 pages
Physical Geography: Darrell Hess: Chapter 14 Part A
100% (1)
Physical Geography: Darrell Hess: Chapter 14 Part A
40 pages
Stop Pain
No ratings yet
Stop Pain
6 pages
Michael Salla, Elena Danaan, Commander Thor Han Eredyon - Decimation of The Dark Fleet and The Liberation of Terra, An Nonfiction Galactic Anthology (2021)
No ratings yet
Michael Salla, Elena Danaan, Commander Thor Han Eredyon - Decimation of The Dark Fleet and The Liberation of Terra, An Nonfiction Galactic Anthology (2021)
269 pages
Fil2 Syllabus
No ratings yet
Fil2 Syllabus
9 pages
Exam Assist
No ratings yet
Exam Assist
200 pages
Design & Build A Microsoft Office Access Database
100% (7)
Design & Build A Microsoft Office Access Database
150 pages
Steel Making Process
No ratings yet
Steel Making Process
8 pages
C CC C C CC: Mohansingh, India
No ratings yet
C CC C C CC: Mohansingh, India
7 pages
Decision Theory
No ratings yet
Decision Theory
33 pages
Superscalar Vs Superpipeline Processor
No ratings yet
Superscalar Vs Superpipeline Processor
17 pages
1 Concepts and Win 7 Quiz
No ratings yet
1 Concepts and Win 7 Quiz
8 pages
CSR of TATA
No ratings yet
CSR of TATA
7 pages
SM Case
No ratings yet
SM Case
9 pages
Human Resource Management
No ratings yet
Human Resource Management
10 pages
Sap MM Complete Training Material
No ratings yet
Sap MM Complete Training Material
175 pages
Circular Letter No.4623 - Information On Hybrid Meetings (Secretariat)
No ratings yet
Circular Letter No.4623 - Information On Hybrid Meetings (Secretariat)
6 pages
002 Ostrich PDF
No ratings yet
002 Ostrich PDF
10 pages
Coca Cola
No ratings yet
Coca Cola
14 pages
Case Study Repor Take Time
No ratings yet
Case Study Repor Take Time
18 pages
Assignment
No ratings yet
Assignment
12 pages
A Study On Brand Preference of Mobile Phone Users in
No ratings yet
A Study On Brand Preference of Mobile Phone Users in
8 pages
List of Banks in India
No ratings yet
List of Banks in India
12 pages
CF 3
No ratings yet
CF 3
2 pages
Coalitions and The Making of Modern American Politics
No ratings yet
Coalitions and The Making of Modern American Politics
7 pages
Cholelithiasis Case 1
No ratings yet
Cholelithiasis Case 1
21 pages
ISE 330 Introduction To Operations Research: Deterministic Models What Is Linear Programming?
No ratings yet
ISE 330 Introduction To Operations Research: Deterministic Models What Is Linear Programming?
5 pages
TDS 2 LES SYLLABUS March 16 11
No ratings yet
TDS 2 LES SYLLABUS March 16 11
16 pages
Vision and Mission: Philosophy
No ratings yet
Vision and Mission: Philosophy
41 pages
Sap MM Complete Training Material
67% (3)
Sap MM Complete Training Material
175 pages
Tutorial 26 Sarma Non-Vertical Slices
No ratings yet
Tutorial 26 Sarma Non-Vertical Slices
6 pages
CP Lab Manual
No ratings yet
CP Lab Manual
81 pages
How To Program A Panasonic Tvs 75
No ratings yet
How To Program A Panasonic Tvs 75
31 pages
Classroom Management
No ratings yet
Classroom Management
59 pages
Thesis of Prelude To The Modern World
100% (3)
Thesis of Prelude To The Modern World
7 pages
Goosebumps
No ratings yet
Goosebumps
24 pages
Nazia Bakhsi and Zoheb
No ratings yet
Nazia Bakhsi and Zoheb
14 pages
30th International Kangaroo Mathematics Contest 2020 Answer of Problems
No ratings yet
30th International Kangaroo Mathematics Contest 2020 Answer of Problems
1 page
Syllabus - Strength of Materials
No ratings yet
Syllabus - Strength of Materials
2 pages
Statistics 2
No ratings yet
Statistics 2
13 pages
Discussedlessonplan
No ratings yet
Discussedlessonplan
2 pages
True or False: The Earth's Surface Has Stayed The Same For Thousands of Years
No ratings yet
True or False: The Earth's Surface Has Stayed The Same For Thousands of Years
31 pages
STS-30 Press Kit
No ratings yet
STS-30 Press Kit
41 pages
Lec01 Fundamentals PDF
No ratings yet
Lec01 Fundamentals PDF
14 pages
TOSHIBA
0% (1)
TOSHIBA
25 pages
Heal Your Core Wound With Soul Art Journal
No ratings yet
Heal Your Core Wound With Soul Art Journal
17 pages
Grade 6 Civics Lesson Plan Term1 Week 2
No ratings yet
Grade 6 Civics Lesson Plan Term1 Week 2
6 pages
STEM Qualifying Exam Reviewer Grade10 Science Math 50Q
No ratings yet
STEM Qualifying Exam Reviewer Grade10 Science Math 50Q
26 pages
Terato Threshold Black Magic and Shattered Geometry Ryan Anschauung PDF Download
100% (1)
Terato Threshold Black Magic and Shattered Geometry Ryan Anschauung PDF Download
40 pages
Frankl - Freud Graphic Organizer
No ratings yet
Frankl - Freud Graphic Organizer
1 page
(eBook PDF) Mindful Crafts as Therapy: Engaging More Than Hands download
No ratings yet
(eBook PDF) Mindful Crafts as Therapy: Engaging More Than Hands download
153 pages

Intro To PCA

Uploaded by

Intro To PCA

Uploaded by

Intro to PCA

Field Reduction Improves Classification

Selecting Most Relevant Fields

 Other techniques exist

 Variance is claimed to be the original statistical measure of spread of data.

Calculating Eigenvectors & Eigenvalues

Calculating Eigenvectors & Eigenvalues

 And setting the determinant to 0, we obtain 2 eigenvalues: 1 = -1 and 2 = -2

Calculating Eigenvectors & Eigenvalues

the eigenvector is:

v1:1 v 1:2 ! 0 v1:1 ! v1:2

Calculating Eigenvectors & Eigenvalues

where k1 is some constant.  Similarly we find that eigenvector v2

where k2 is some constant.

Properties of Eigenvectors and Eigenvalues

PCA Process STEP 1

PCA Process STEP 1

0.81 0.81 0.31 0.31 0.71 1.01

PCA Process STEP 2

0.616555556 0.615444444 cov ! 0.615444444 0.716555556

PCA Process STEP 3

0.735178656 0.677873399  eigenvectors ! 0.677873399 0.735178656

PCA Process STEP 3

PCA Process STEP 4

PCA Process STEP 4

PCA Process STEP 4

PCA Process STEP 4

PCA Process STEP 5

PCA Process STEP 5

PCA Process STEP 5

PCA Process STEP 5

Reconstruction of Original Data

Reconstruction of Original Data

Reconstruction of original Data

A Word on Factor Analysis

A Word On Linear Discrimant Analysis

*This generalizes naturally to K classes yielding K-1 dimensions

You might also like

Other techniques exist

Variance is claimed to be the original statistical measure of spread of data.

And setting the determinant to 0, we obtain 2 eigenvalues: 1 = -1 and 2 = -2

v1:1 v 1:2 ! 0 v1:1 ! v1:2

where k1 is some constant. Similarly we find that eigenvector v2

0.81 0.81 0.31 0.31 0.71 1.01

0.735178656 0.677873399 eigenvectors ! 0.677873399 0.735178656