Principle Component Analysis
Principle Component Analysis
PCA is commonly used as one step in a series of analyses. One can use principal components
analysis to reduce the number of variables and avoid multicollinearity, or when you have too
many predictors relative to the number of observations.
Working of PCA:
For example, a consumer products company wants to analyze customer responses to several
characteristics of a new shampoo: color, smell, texture, cleanliness, shine, volume, amount
needed to lather, and price. They conduct a principal components analysis to see if they can
form a smaller number of uncorrelated variables that are easier to interpret and analyze. The
results suggest the following patterns:
In PCA, one first finds the set of orthogonal eigenvectors of the correlation or covariance
matrix of the variables. The matrix of principal components is the product of the eigenvector
matrix with the matrix of independent variables. The first principal component accounts for
the largest percent of the total data variation. The second principal component accounts the
second largest percent of the total data variation, and so on. The goal of principal components
is to explain the maximum amount of variance with the fewest number of components.
Eigenvectors
Eigenvectors, which are comprised of coefficients corresponding to each variable, are the
weights for each variable used to calculate the principal components scores.
Scores
The linear combinations of the original variables that account for the variance in the data.
Eigenvalue
The method involves decomposing a data matrix X into a structure part and
a noise part. The PC model is the matrix product TPT (the Structure):
Scores: T
The Scores, structure part of the PCA. Summary of the original variables
in X that describe how the different rows in X (observations) relate to each
other. In the T-matrix column 1 (t1) is the scores of the first PC. The second
column contains the scores of the second PC, and so on.
Loadings: P
The Loadings, structure part of the PCA. The weights (in°uence) of the
variables in X on the scores T. Of the loadings we can see which variables
that are responsible for patterns found in scores, T, using the Loadings plot.
This plot is simply the loadings of a PC plotted against the loadings of
another PC. We can see how the scores and loadings relate, and that is
very important about this plot. The loadings plot could be called a map of
variables.
Residuals: E
The Residuals (E-matrix), is the noise part of the PCA, a n x p large Matrix.
E is not part of the model. It will be the part of X which is not explained by
the model TPT .
NIPALS algorithm
The NIPALS ("Nonlinear Iterative Partial Least Squares") algorithm is employed for
estimating the parameters of the PCA model. The steps of the algorithm is listed below: