Implementing PCA in Python with scikit-learn
Last Updated :
18 May, 2025
Principal Component Analysis (PCA) is a dimensionality reduction technique. It transform high-dimensional data into a smaller number of dimensions called principal components and keeps important information in the data. In this article, we will learn about how we implement PCA in Python using scikit-learn. Here are the steps:
Step 1: Import necessary libraries
We import all the libraries needed like numpy , pandas, matplotlib, seaborn and scikit learn.
Python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
Step 2: Load the Data
We will use breast cancer dataset. This dataset has 569 data items with 30 input attributes. There are two output classes-benign and malignant. This reads the dataset file and displays the first 5 rows. You can download the dataset from here.
Python
df = pd.read_csv('data.csv')
df.head()
Output:
DatasetStep 3: Data Cleaning and Preprocessing
It drops unnecessary columns like id, Unnamed: 32 and converts diagnosis co
lumn: Malignant to 1 and Benign to 0.
Python
df.drop(['id', 'Unnamed: 32'], axis=1, inplace=True)
df['diagnosis'] = df['diagnosis'].map({'M': 1, 'B': 0})
Step 4: Separate Features and Target
In this separate features X
contains input features (30 columns) and y
contains the target labels (0 or 1)
Python
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']
Step 5: Standardize the Data
StandardScaler transforms features so they all have a mean = 0 and standard deviation = 1 which helps PCA to treat all features equally.
Python
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled[:2])
Output:
Standard ScalerStep 6: Apply PCA Algorithm
It reduces the data to 2 principal components. PCA finds combinations of original features that explain the most variation in the data.
Python
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(X_pca[:2])
Output:
[[ 9.19283683 1.94858307]
[ 2.3878018 -3.76817174]]
We reduce 30 features to 2 components. Each row now has 2 values (PC1, PC2) instead of 30. These components contain the most variation from original data.
Step 7: Explained Variance
It tells how much information each principal component holds.
Python
print("Explained variance:", pca.explained_variance_ratio_)
print("Cumulative:", np.cumsum(pca.explained_variance_ratio_))
Output:
- Explained variance: [0.44272026 0.18971182]
- Cumulative: [0.44272026 0.63243208]
PC1 explains 44% of data and PC2 explains 19%. Combined these 2 components explain 63% of all data variation.
Step 8: Visualization Before vs After PCA
First plot shows original scaled data using first 2 features and second plot shows reduced data using PCA's 2 components. Colors represent diagnosis Benign or Malignant.
Python
plt.figure(figsize=(8,6))
plt.scatter(X_scaled[:, 0], X_scaled[:, 1], c=y, cmap='coolwarm', edgecolor='k')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Original Data (First Two Features)")
plt.colorbar(label="Diagnosis")
plt.show()
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='coolwarm', edgecolor='k')
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA Transformed Data")
plt.colorbar(label="Diagnosis")
plt.show()
Output:
Original Data
PCA Transformed DataStep 9: Train a Model on PCA Data
It splits PCA data into training and test sets. Train a Logistic Regression model to classify tumors and predicts and evaluate the model.
Python
X_train, X_test, y_train, y_test = train_test_split(X_pca, y, test_size=0.2, random_state=42)
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Output:
Classification reportStep 10: Confusion Matrix
It shows how many predictions were correct or incorrect and helps to visualize true vs. false predictions.
Python
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Benign', 'Malignant'],
yticklabels=['Benign', 'Malignant'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Output:
Confusion matrixPCA reduces data size but some information is lost. This step converts reduced data back to its original shape and measures how much data was lost in the reduction process.
Python
X_reconstructed = pca.inverse_transform(X_pca)
reconstruction_loss = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"Reconstruction Loss: {reconstruction_loss:.4f}")
Output:
Reconstruction Loss: 0.3676
As shows how much info was lost during PCA. A loss of 0.3676
means PCA with 2 components retains good structure.
Complete Code: click here
Similar Reads
Implementing SVM and Kernel SVM with Python's Scikit-Learn
In this article we will implement a classification model using Scikit learn implementation for SVM model in Python. Then we will try to understand what is a kernel and how it can helps us to achieve better performance by learning non-linear boundaries in the dataset. What is a SVM algorithm? Support
6 min read
Implementation of KNN classifier using Scikit - learn - Python
K-Nearest Neighbors is a most simple but fundamental classifier algorithm in Machine Learning. It is under the supervised learning category and used with great intensity for pattern recognition, data mining and analysis of intrusion. It is widely disposable in real-life scenarios since it is non-par
3 min read
What is fit() method in Python's Scikit-Learn?
Scikit-Learn, a powerful and versatile Python library, is extensively used for machine learning tasks. It provides simple and efficient tools for data mining and data analysis. Among its many features, the fit() method stands out as a fundamental component for training machine learning models. This
4 min read
Image processing with Scikit-image in Python
scikit-image is an image processing Python package that works with NumPy arrays which is a collection of algorithms for image processing. Let's discuss how to deal with images in set of information and its application in the real world. Important features of scikit-image : Simple and efficient tools
2 min read
Feature Selection in Python with Scikit-Learn
Feature selection is a crucial step in the machine learning pipeline. It involves selecting the most important features from your dataset to improve model performance and reduce computational cost. In this article, we will explore various techniques for feature selection in Python using the Scikit-L
4 min read
Implementing SVM from Scratch in Python
Support Vector Machines (SVMs) is a supervised machine learning algorithms used for classification and regression tasks. They work by finding the optimal hyperplane that separates data points of different classes with the maximum margin. We can use Scikit library of python to implement SVM but in th
3 min read
Data Normalization with Python Scikit-Learn
Data normalization is a crucial step in machine learning and data science. It involves transforming features to similar scales to improve the performance and stability of machine learning models. Python's Scikit-Learn library provides several techniques for data normalization, which are essential fo
7 min read
Save classifier to disk in scikit-learn in Python
In this article, we will cover saving a Save classifier to disk in scikit-learn using Python. We always train our models whether they are classifiers, regressors, etc. with the scikit learn library which require a considerable time to train. So we can save our trained models and then retrieve them w
3 min read
Scientific Computing with Python
Scientific computing refers to the use of computational techniques and tools to solve scientific and engineering problems. Python has become one of the most popular languages for scientific computing due to its simplicity, readability and the libraries used for various scientific tasks. From data an
5 min read
Project | Scikit-learn - Whisky Clustering
Introduction | Scikit-learnScikit-learn is a machine learning library for Python.It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerica
4 min read