Open In App

Principal Component Analysis with R Programming

Last Updated : 12 Jul, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

Principal Component Analysis (PCA) is a machine learning technique used to reduce the dimensionality of large datasets while retaining as much variance as possible. It transforms the data into a new coordinate system where the greatest variance lies along the first principal component, the second greatest along the second principal component and so on. PCA is commonly used in Exploratory Data Analysis (EDA) to simplify datasets with many variables and make visualizations better.

Understanding Principal Component Analysis

The first principal component (PC1) captures the highest variance in the dataset and represents the direction of greatest variability. The second principal component (PC2) captures the remaining variance, ensuring it's uncorrelated with PC1. Similarly, subsequent components (PC3, PC4, etc.) capture the remaining variance, with each being orthogonal (uncorrelated) to the previous one.

Implementation of PCA in R

We will perform Principal Component Analysis (PCA) on the mtcars dataset to reduce dimensionality, visualize the variance and explore the relationships between different car attributes.

1. Installing and Loading the Required Packages

We will install and load the necessary packages.

  • install.packages(): Installs the package.
  • library(): Loads the package.
R
install.packages("dplyr")
library(dplyr)

2. Loading the Dataset

The mtcars dataset is a built in data set in R. It contains data on fuel consumption and various performance and design aspects of 32 cars. This dataset has 10 variables, including miles per gallon (mpg), horsepower and weight.

  • str(): give the structure of the dataset
R
str(mtcars)

Output:

str
mtcars dataset

3. Performing PCA

To perform PCA, we use the prcomp() function. It is used to scale and center the data before applying PCA since PCA is based on distance measures and scaling ensures that all variables are treated equally.

  • prcomp(): Performs Principal Component Analysis.
  • scale. = TRUE: Scales the data before applying PCA.
  • center = TRUE: Centers the data before applying PCA (subtracts the mean).
  • retx = TRUE: Returns the transformed data (principal components).
R
my_pca <- prcomp(mtcars, scale. = TRUE, center = TRUE, retx = TRUE)
names(my_pca)

Output:

'sdev''rotation''center''scale''x'

4. Summary of PCA Results

We will summaries the PCA model to understand how much variance is captured by each principal component.

  • summary(): Summarizes the PCA results, including the proportion of variance explained by each principal component.
R
summary(my_pca)

Output:

impor
Summary of PCA Results

5. Principal Component Loadings

We will see the weights (loadings) of each variable in the principal components.

  • my_pca$rotation: Shows the loadings (coefficients) of the principal components.
R
my_pca$rotation[1:5, 1:4]

Output:

first5
Principal Component Loadings

6. Principal Components Scores

We will now inspect the scores (values of the observations on each principal component).

  • my_pca$x: Provides the transformed data in terms of principal components (scores).
  • head(): displays the first few rows
R
head(my_pca$x)

Output:

score
Principal Components Scores

7. Visualizing the Principal Components

We will use a biplot to visualize the principal components and their contributions to the overall variance.

  • biplot(): Plots the principal components and their relationships.
  • scale = 0: Ensures that arrows are scaled to represent loadings.
R
biplot(my_pca, main = "Biplot of Principal Components", scale = 0)

Output:

bio
Visualizing the Principal Components

8. Computing Standard Deviation and Variance

We will now compute Standard Deviation and Variance

  • my_pca$sdev: Displays the standard deviation of each principal component.
  • my_pca.var <- my_pca$sdev^2: Computes the variance of each component.
R
my_pca.var <- my_pca$sdev^2

cat("Standard Deviation :",my_pca$sdev,"\n")
cat("Variance :",my_pca.var,"\n")

Output:

Standard Deviation : 2.570681 1.628026 0.7919579 0.5192277 0.4727061 0.4599958 0.3677798 0.350573 0.2775728 0.2281128 0.1484736
Variance : 6.6084 2.650468 0.6271973 0.2695974 0.2234511 0.2115961 0.135262 0.1229014 0.07704665 0.05203544 0.02204441

9. Proportion of Variance Explained

We calculate the proportion of variance for each component and visualize it using a scree plot to see how much variance each principal component explains.

  • plot(): Plots the proportion of variance explained by each principal component.
  • xlab : Label for the x-axis.
  • ylab: Label for the y-axis.
  • ylim = c(0, 1): Sets the y-axis limits from 0 to 1.
  • type = "b": Plots both points and lines.
R
propve <- my_pca.var / sum(my_pca.var)

plot(propve, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     ylim = c(0, 1), type = "b", main = "Scree Plot")

Output:

scree
Proportion of Variance Explained

10. Cumulative Proportion of Variance

We can see the cumulative variance explained by the components.

  • cumsum(): Calculates the cumulative sum of variance explained by the components.
  • plot(): Plots the cumulative proportion of variance.
R
plot(cumsum(propve), xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     ylim = c(0, 1), type = "b")

Output:

cumm
Cumulative Proportion of Variance

11. Choosing Top Principal Components

We can identify the smallest number of principal components that explain at least 90% of the variance.

  • which(cumsum(propve) >= 0.9)[1]: Finds the smallest number of principal components that explain at least 90% of the variance.
R
which(cumsum(propve) >= 0.9)[1]

Output:

4

12. Predicting with Principal Components

We can now use the first few principal components to predict another variable. For example, predicting disp (displacement) from the top 4 principal components.

  • data.frame(): Creates a new data frame, combining the original variable (disp) with the first 4 principal components.
R
train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])

head(train.data)

Output:

newdata
Predicting with Principal Components

13. Building a Decision Tree

Next, we can use the rpart package to build a decision tree model to predict disp using the first four principal components.

  • rpart(): Fits a decision tree model.
  • disp ~ .: Formula for predicting disp using all other variables (principal components in this case).
  • data = train.data: Specifies the data to use for fitting the model.
  • method = "anova": Specifies that the model is for regression.
  • rpart.plot(): Plots the decision tree.
R
install.packages("rpart")
install.packages("rpart.plot")

library(rpart)
library(rpart.plot)

rpart.model <- rpart(disp ~ ., data = train.data, method = "anova")

rpart.plot(rpart.model)

Output:

dt
Building a Decision Tree

In this article, we applied PCA to the mtcars dataset, visualized the principal components and used them to predict car displacement with a decision tree. This helped us to simplify the data and reveal patterns that would have been difficult to see in the original high-dimensional space.


Next Article
Article Tags :

Similar Reads