How Do You Do A Principal Component Analysis?
How Do You Do A Principal Component Analysis?
2. Calculate Mean: The next step is to calculate the mean (average) of all data points. Note, if the data is 3D,
the mean is also a 3D point with x, y and z coordinates. Similarly, if the data is m dimensional, the mean will
also be m dimensional. The mean is calculculated as
3.Subtract Mean from data matrix: We next create another matrix M by subtracting the mean from every data
point of D
4.Calculate the Covariance matrix: Remember we want to find the direction of maximum variance. The
covariance matrix captures the information about the spread of the data. The diagonal elements of a
covariance matrix are the variances along the X, Y and Z axes. The off-diagonal elements represent the
covariance between two dimensions ( X and Y, Y and Z, Z and X ).The covariance matrix, C} is calculated using
the following product.
where, T represents the transpose operation. The matrix Cis of size m x m times where m is the
number of dimensions ( which is 3 in our example ).Figure shows how the covariance matrix
changes depending on the spread of data in different directions.
Figure: Left : When the data is evenly spread in all directions, the covariance matrix has equal
diagonal elements and zero off-diagonal elements. Center: When the data spread is
elongated along one of the axes, the diagonal elements are unequal, but the off diagonal
elements are zero. Right : In general the covariance matrix has both diagonal and off -
diagonal elements.
Variance- can only be used to explain the spread of the data in
the directions parallel to the axes of the feature space.
Covariance
Variance
For this data, we could calculate the variance in the x-direction and the variance in the y-direction.
However, the horizontal spread and the vertical spread of the data does not explain the clear diagonal correlation. Figure
clearly shows that on average, if the x-value of a data point increases, then also the y-value increases, resulting in a
positive correlation. This correlation can be captured by extending the notion of variance to what is called the ‘covariance’
of the data:
Covariance
These four values can be summarized in a matrix, called the covariance matrix:
the covariance matrix is always a symmetric matrix with the variances on its diagonal and the covariances off-
diagonal.
So, the covariance matrix defines both the spread (variance), and the orientation (covariance) of our data. So, if
we would like to represent the covariance matrix with a vector and its magnitude, we should simply try to find
the vector that points into the direction of the largest spread of the data, and whose magnitude equals the
spread (variance) in this direction
5. Calculate the Eigen vectors and Eigen values of the covariance matrix: The
principal components are the Eigen vectors of the covariance matrix. The first principal
component is the Eigen vector corresponding to the largest Eigen value, the second
principal component is the Eigen vector corresponding to the second largest Eigen
value and so on and so forth.
This is the final step where we actually form the principal components using all the
math we did till here. For the same, we take the transpose of the feature vector and
left-multiply it with the transpose of scaled version of original dataset.
NewData = FeatureVectorT x
ScaledDataT
NewData- is the Matrix consisting of the principal components,
FeatureVector- is the matrix we formed using the eigenvectors we chose to keep
Scaled Data- is the scaled version of original dataset