Principal Component Analysis with R Programming
Last Updated :
12 Jul, 2025
Principal Component Analysis (PCA) is a machine learning technique used to reduce the dimensionality of large datasets while retaining as much variance as possible. It transforms the data into a new coordinate system where the greatest variance lies along the first principal component, the second greatest along the second principal component and so on. PCA is commonly used in Exploratory Data Analysis (EDA) to simplify datasets with many variables and make visualizations better.
Understanding Principal Component Analysis
The first principal component (PC1) captures the highest variance in the dataset and represents the direction of greatest variability. The second principal component (PC2) captures the remaining variance, ensuring it's uncorrelated with PC1. Similarly, subsequent components (PC3, PC4, etc.) capture the remaining variance, with each being orthogonal (uncorrelated) to the previous one.
Implementation of PCA in R
We will perform Principal Component Analysis (PCA) on the mtcars dataset to reduce dimensionality, visualize the variance and explore the relationships between different car attributes.
1. Installing and Loading the Required Packages
We will install and load the necessary packages.
- install.packages(): Installs the package.
- library(): Loads the package.
R
install.packages("dplyr")
library(dplyr)
2. Loading the Dataset
The mtcars dataset is a built in data set in R. It contains data on fuel consumption and various performance and design aspects of 32 cars. This dataset has 10 variables, including miles per gallon (mpg), horsepower and weight.
- str(): give the structure of the dataset
R
Output:
mtcars datasetTo perform PCA, we use the prcomp() function. It is used to scale and center the data before applying PCA since PCA is based on distance measures and scaling ensures that all variables are treated equally.
- prcomp(): Performs Principal Component Analysis.
- scale. = TRUE: Scales the data before applying PCA.
- center = TRUE: Centers the data before applying PCA (subtracts the mean).
- retx = TRUE: Returns the transformed data (principal components).
R
my_pca <- prcomp(mtcars, scale. = TRUE, center = TRUE, retx = TRUE)
names(my_pca)
Output:
'sdev''rotation''center''scale''x'
4. Summary of PCA Results
We will summaries the PCA model to understand how much variance is captured by each principal component.
- summary(): Summarizes the PCA results, including the proportion of variance explained by each principal component.
R
Output:
Summary of PCA Results5. Principal Component Loadings
We will see the weights (loadings) of each variable in the principal components.
- my_pca$rotation: Shows the loadings (coefficients) of the principal components.
R
my_pca$rotation[1:5, 1:4]
Output:
Principal Component Loadings6. Principal Components Scores
We will now inspect the scores (values of the observations on each principal component).
- my_pca$x: Provides the transformed data in terms of principal components (scores).
- head(): displays the first few rows
R
Output:
Principal Components Scores7. Visualizing the Principal Components
We will use a biplot to visualize the principal components and their contributions to the overall variance.
- biplot(): Plots the principal components and their relationships.
- scale = 0: Ensures that arrows are scaled to represent loadings.
R
biplot(my_pca, main = "Biplot of Principal Components", scale = 0)
Output:
Visualizing the Principal Components8. Computing Standard Deviation and Variance
We will now compute Standard Deviation and Variance
- my_pca$sdev: Displays the standard deviation of each principal component.
- my_pca.var <- my_pca$sdev^2: Computes the variance of each component.
R
my_pca.var <- my_pca$sdev^2
cat("Standard Deviation :",my_pca$sdev,"\n")
cat("Variance :",my_pca.var,"\n")
Output:
Standard Deviation : 2.570681 1.628026 0.7919579 0.5192277 0.4727061 0.4599958 0.3677798 0.350573 0.2775728 0.2281128 0.1484736
Variance : 6.6084 2.650468 0.6271973 0.2695974 0.2234511 0.2115961 0.135262 0.1229014 0.07704665 0.05203544 0.02204441
9. Proportion of Variance Explained
We calculate the proportion of variance for each component and visualize it using a scree plot to see how much variance each principal component explains.
- plot(): Plots the proportion of variance explained by each principal component.
- xlab : Label for the x-axis.
- ylab: Label for the y-axis.
- ylim = c(0, 1): Sets the y-axis limits from 0 to 1.
- type = "b": Plots both points and lines.
R
propve <- my_pca.var / sum(my_pca.var)
plot(propve, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
ylim = c(0, 1), type = "b", main = "Scree Plot")
Output:
Proportion of Variance Explained10. Cumulative Proportion of Variance
We can see the cumulative variance explained by the components.
- cumsum(): Calculates the cumulative sum of variance explained by the components.
- plot(): Plots the cumulative proportion of variance.
R
plot(cumsum(propve), xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance Explained",
ylim = c(0, 1), type = "b")
Output:
Cumulative Proportion of Variance11. Choosing Top Principal Components
We can identify the smallest number of principal components that explain at least 90% of the variance.
- which(cumsum(propve) >= 0.9)[1]: Finds the smallest number of principal components that explain at least 90% of the variance.
R
which(cumsum(propve) >= 0.9)[1]
Output:
4
12. Predicting with Principal Components
We can now use the first few principal components to predict another variable. For example, predicting disp (displacement) from the top 4 principal components.
- data.frame(): Creates a new data frame, combining the original variable (disp) with the first 4 principal components.
R
train.data <- data.frame(disp = mtcars$disp, my_pca$x[, 1:4])
head(train.data)
Output:
Predicting with Principal Components13. Building a Decision Tree
Next, we can use the rpart package to build a decision tree model to predict disp using the first four principal components.
- rpart(): Fits a decision tree model.
- disp ~ .: Formula for predicting disp using all other variables (principal components in this case).
- data = train.data: Specifies the data to use for fitting the model.
- method = "anova": Specifies that the model is for regression.
- rpart.plot(): Plots the decision tree.
R
install.packages("rpart")
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
rpart.model <- rpart(disp ~ ., data = train.data, method = "anova")
rpart.plot(rpart.model)
Output:
Building a Decision TreeIn this article, we applied PCA to the mtcars dataset, visualized the principal components and used them to predict car displacement with a decision tree. This helped us to simplify the data and reveal patterns that would have been difficult to see in the original high-dimensional space.
Similar Reads
Machine Learning with R Machine Learning is a growing field that enables computers to learn from data and make decisions without being explicitly programmed. It mimics the way humans learn from experiences, allowing systems to improve performance over time through data-driven insights. This Machine Learning with R Programm
3 min read
Getting Started With Machine Learning In R
Data Processing
Introduction to Data in Machine LearningData refers to the set of observations or measurements to train a machine learning models. The performance of such models is heavily influenced by both the quality and quantity of data available for training and testing. Machine learning algorithms cannot be trained without data. Cutting-edge develo
4 min read
ML | Understanding Data ProcessingIn machine learning, data is the most important aspect, but the raw data is messy, incomplete, or unstructured. So, we process the raw data to transform it into a clean, structured format for analysis, and this step in the data science pipeline is known as data processing. Without data processing, e
5 min read
ML | Overview of Data CleaningData cleaning is a important step in the machine learning (ML) pipeline as it involves identifying and removing any missing duplicate or irrelevant data. The goal of data cleaning is to ensure that the data is accurate, consistent and free of errors as raw data is often noisy, incomplete and inconsi
13 min read
ML | Feature Scaling - Part 1Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing. Working: Given a data-set with features- Age, Salary, BHK Apartment with the data size of 5000 people, each having these independent data featu
3 min read
Supervised Learning
Evaluation Metrics
Unsupervised Learning
Model Selection and Evaluation
Reinforcement Learning
Dimensionality Reduction
Advanced Topics