Principal Component Analysis
Principal Component Analysis
Key Ideas:
1. Dimensionality Reduction:
o Think of PCA like organizing a messy room. You have too many things (data points),
and it’s hard to find what’s important.
o PCA helps by "cleaning up" and showing you the few key items (important patterns)
that matter the most.
2. Principal Components:
o These are the new, simpler pieces of information PCA gives you.
o The first principal component shows the most important pattern or trend in your
data.
o The second one shows the next most important pattern, and so on.
o Simplifying Data: If you have too many features or measurements, PCA helps by
summarizing them into just a few that still capture the essence of your data.
o Making Patterns Clearer: It helps reveal patterns in the data that might not be
obvious at first glance.
4. How It Works:
o Standardize Your Data: First, you make sure all your measurements are on the same
scale.
o Find Patterns: PCA finds the main directions (patterns) where your data varies the
most.
o Create New Variables: These main directions become new variables, called principal
components, that you can use instead of your original data.
o Too Much Data: When you have too many features and want to focus on the most
important ones.
Simple Example:
Imagine you’re looking at a lot of different types of fruit. Each fruit has different
measurements: size, weight, color, etc. PCA would help you figure out which measurements
are the most important to identify the type of fruit, like "big and heavy" might be a key
pattern. It then lets you focus on just those key patterns, making it easier to categorize or
visualize your fruit.
PCA is like taking a big, complicated puzzle and finding the main pieces that give you a clear
picture of what’s going on.
More specifically, the reason why it is critical to perform standardization prior to PCA, is that the
latter is quite sensitive regarding the variances of the initial variables. That is, if there are large
differences between the ranges of initial variables, those variables with larger ranges will dominate
over those with small ranges (for example, a variable that ranges between 0 and 100 will dominate
over a variable that ranges between 0 and 1), which will lead to biased results. So, transforming the
data to comparable scales can prevent this problem.
Mathematically, this can be done by subtracting the mean and dividing by the standard deviation for
Let’s suppose that our data set is 2-dimensional with 2 variables x,y and that
the eigenvectors and eigenvalues of the covariance matrix are as follows: