Principal Component Analysis
Principal Component Analysis
Step 1: Standardization
Step 2: Covariance Matrix Computation
Step 3: Compute The Eigenvectors And Eigenvalues of The Covariance Matrix To
Identify The Principal Components
Step 4: Feature Vector
Step 5: Recast The Data Along The Principal Components Axes
STEP 1: STANDARDIZATION
1. The aim of this step is to standardize the range of the continuous initial variables so
that each one of them contributes equally to the analysis.
2. Mathematically, this can be done by subtracting the mean and dividing by the
standard deviation for each value of each variable.
3. Once the standardization is done, all the variables will be transformed to the same
scale.
STEP 2: COVARIANCE MATRIX COMPUTATION
1. Eigenvectors and eigenvalues are the mathematical constructs that must be computed
from the covariance matrix in order to determine the principal components of the data
set.
2. Principal components are the new set of variables that are obtained from the initial
set of variables.
3. The principal components are computed in such a manner that newly obtained
variables are highly significant and independent of each other.
4. The principal components compress and possess most of the useful information that
was scattered among the initial variables.
5. If your data set is of 5 dimensions, then 5 principal components are computed, such
that, the first principal component stores the maximum possible information and the
second one stores the remaining maximum info and so on.
HOW PCA CONSTRUCTS THE PRINCIPAL COMPONENTS
1. Once the Eigenvectors and eigenvalues are computed, we have to arrange them in the
descending order, where the eigenvector with the highest eigenvalue is the most
significant and thus forms the first principal component.
2. The principal components of lesser significances can thus be removed in order to
reduce the dimensions of the data.
STEP 4: FEATURE VECTOR
1. The final step in computing the Principal Components is to form a matrix known as
the feature matrix.
2. It contains all the significant data variables that possess maximum information about
the data.
LAST STEP: RECAST THE DATA ALONG THE PRINCIPAL COMPONENTS AXES
1. The last step in performing PCA is to re-arrange the original data with the final
principal components which represent the maximum and the most significant
information of the data set.
2. In order to replace the original data axis with the newly formed Principal
Components, you simply multiply the transpose of the original data set by the
transpose of the obtained feature vector.
Case Study
Problem Statement: To perform step by step
Principal Component Analysis in order to reduce the
dimension of the data set.
Data set Description: Movies rating data set that
contains ratings from 700+ users for approximately
9000 movies (features).
Logic: Perform PCA by finding the most significant
features in the data. PCA will be performed by
following the steps. Step 1: Import Required Packages
Import data set
Formatting the data
Step 2: Standardization
Step 3: Compute Covariance Matrix
Step 4: Calculate Eigenvectors and Eigenvalues
Step 5: Compute the feature vector
Step 6: Use the PCA() Function to Reduce the Dimensionality
of the Dataset
Step 7: Projecting the Variance w.r.t the Principle Components