Principal Component Analysis
Principal Component Analysis
Introduction
.616555556 .615444444
Cov=
.615444444 .716555556
Since the off-diagonal elements in this covariance matrix are positive, both
x and y variable have a positive co-relation. That is both x and y variables
increase together.
• Step 4: Calculate the eigenvectors and eigenvalues of the covariance
matrix.
From the covariance matrix, it is possible to calculate the eigenvectors and
eigenvalues. These are very important because they represent useful
information about our data.
The eigenvectors and eigenvalues of our covariance matrix are as follows:
0.0490833989
eigenvalues =
1.28402771
−.735178656 −.677873399
eigenvectors =
.677873399 −.735178656
• These eigenvectors are both unit eigenvectors.
• We can plot these eigenvectors on top of the data we have.
• They appear as diagonal lines on the plot.
• They are perpendicular to each other.
• They provide information about patterns in data.
• One of the eigen vectors goes through the middle of the points, like drawing a line of
best fit.
• That eigenvector is showing the relationship between x and y through that line (an
approximation of the data points).
• The second eigenvector is less important, gives us other important pattern in data.
• All the points follow the main line but are off to the side of the main line by some
amount.
• So, by this process of taking the eigenvectors of the covariance matrix, we
have been able to extract lines that characterize the data.
• It is possible to transform the given data in such a way that it is expressed in
terms of these lines.
• Step 5: Choosing components and forming a feature vector.
Here the idea of data compression and dimensionality reduction comes into
picture.
Eigenvector with highest eigenvalue is the principal component of the
dataset.
Once the eigenvectors are found from the covariance matrix, the next step is
to rank them by eigenvalue from highest to lowest.
This gives components in order of significance.
Based on this order, the components of less significance can be ignored.
• If we leave out some components, the final dataset will have less dimensions
than original.
• To be precise, if we have n dimensions in our data originally and if we
calculate n eigen vectors and values and if we choose only the first p
eigenvectors, then the final dataset has only p-dimensions.
So, for feature selection, what we have to do is form a reduced matrix by
taking the eigenvectors we want to keep from the list of eigenvectors by
keeping these selected eigenvectors as columns.
So, FeatureVector=(eig1, eig2, eig3,…,eign)
In our example data, since we have two eigenvectors, we have two choices.
We can form a feature vector with both of the eigenvectors or we can
choose to leave out the smaller; less significant component and only have a
single column.
In this example, by considering both these eigenvectors in the order of
eigenvalues,
−0.677873399 −0.735178656
FeatureVector =
−0.735178656 0.677873399
If we leave out the less significant eigenvector from the list, the reduced
−0.677873399
FeatureVector =
−0.735178650
• Step 6: Deriving the new dataset.
This is the final step in PCA. Once we chosen the component (eigenvectors)
that we wish to keep in our data and formed a feature vector, we simply take the
transpose of the vector and multiply it on the left of the original dataset
transformed.
FinalData = RowFeatureVector × RowDataAdjust
where RowFeatureVector is the matrix with the eigenvectors in the columns
transposed so that the eigenvectors are now in rows with most significant
vectors at the top. RowDataAdjust is the mean adjusted data transposed. That is,
data items are now in each column with each row holding a separate dimension.
FinalData is the final dataset with data items in columns and dimensions
along rows.
The final data is only in terms of the vectors that we decided to keep.
To bring the data back to the same table like format, take the transpose of
the result.
When we consider a transformation by taking only the eigenvector with the
largest eigenvalue, it has only a single dimension. This data set is nothing
but the data contained in the first column of the original dataset. If we plot
this data it is one-dimensional and is actually the projection of the actual
data ponts on the x-axis. We have effectively thrown away the other axis.
Getting the old data back