0% found this document useful (0 votes)
13 views15 pages

Clustering and Dimensionality Reduction Techniques PCA T SNE K Means

Uploaded by

akter12345b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Clustering and Dimensionality Reduction Techniques PCA T SNE K Means

Uploaded by

akter12345b
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Clustering and dimensionality reduction techniques

(PCA, t-SNE, K-means)

Course: Machine Learning in Healthcare


Orientador: Md. Maruf Hossain
Session: 2022-23

Department of Biomedical Engineering


Islamic University

October 30, 2024


Hossain, M. M. (BME) Tı́tle Page October 30, 2024 1 / 15
Outline

1. Introduction to clustering and dimensionality reduction

2. Types of Clustering

3. Dimentional Reduction

4. Types of Dimensional Reduction

5. Principal Component Analysis

6. How does PCA work?

7. Steps of PCA

8. Advantage and Disadvantage of PCA

Hossain, M. M. (BME) Tı́tle Page October 30, 2024 2 / 15


Introduction to clustering and dimensionality reduction

Introduction to clustering and dimensionality reduction


In data science and machine learning, clustering and dimensionality reduction are
techniques used to simplify, organize, and interpret complex datasets.
Both approaches aim to make data analysis easier by revealing patterns and
structures within data, often used in exploratory data analysis and preprocessing for
machine learning.
What is Clustering?
Clustering is an unsupervised learning technique that groups data points into clusters
based on similarity.
It’s particularly useful for finding patterns or structures in unlabeled data.
In clustering, Clusters are groups where data points are more similar to each other
than to points in other groups.
Applications include image segmentation, and anomaly detection.
e.g. K-means clustering
Hossain, M. M. (BME) Tı́tle Page October 30, 2024 3 / 15
Types of Clustering

Types of Clustering
K-means clustering
K-means is a simple and fast clustering algorithm that partitions data into K clusters.
Here’s how it works:
Initialization: Choose K initial cluster centroids (center points) randomly.
Assignment: Assign each data point to the nearest centroid based on distance (often
Euclidean). Update: Recalculate the centroids of the clusters by finding the mean of
points within each cluster. Repeat: Repeat the assignment and update steps until the
centroids no longer change significantly, or a maximum number of iterations is reached.
Advantages: Simple and computationally efficient.
Works well with large datasets.
Limitations:
Sensitive to the initial choice of centroids.
Assumes clusters are spherical and equally sized, which may not be true for complex data
structures.
Hossain, M. M. (BME) Tı́tle Page October 30, 2024 4 / 15
Dimentional Reduction

Introduction to clustering and dimensionality reduction


Dimensionality Reduction
Dimensionality reduction techniques reduce the number of features (dimensions) in a
dataset while preserving as much information as possible.

Dimensionality reduction can help with:


-Reducing computational cost by lowering the number of features.
-Visualizing high-dimensional data in 2D or 3D.
-Reducing noise and improving model performance.

Hossain, M. M. (BME) Tı́tle Page


Figure 1 – Dimensionality Reduction. October 30, 2024 5 / 15
Types of Dimensional Reduction

Dimensionality Reduction
There are two popular dimensionality reduction techniques.
1. Principal Component Analysis (PCA), and
2. t-Distributed Stochastic Neighbor Embedding (t-SNE).
t-Distributed Stochastic Neighbor Embedding (t-SNE)
t-SNE is a non-linear dimensionality reduction technique that’s especially useful for
visualizing high-dimensional data in 2D or 3D.
Calculate Pairwise Similarities: Calculate probabilities of similarity between points
in high-dimensional space.
Visualization: The result is a 2D or 3D scatter plot showing clusters or patterns in
the data.
Advantages: Great for visualizing complex, non-linear relationships in data. Captures
local and global structure.
Limitations: Computationally intensive and Not ideal for general-purpose dimensionality
reduction beyond
Hossain, M. M. (BME)visualization. Tı́tle Page October 30, 2024 6 / 15
Principal Component Analysis

Dimensionality Reduction
Principal Component Analysis
Principal component analysis (PCA) is a dimensionality reduction and machine
learning method used to simplify a large data set into a smaller set.
It is a method that is often used to reduce the dimensionality of large data sets, by
transforming a large set of variables into a smaller one that still contains most of the
information in the large set.
What Are Principal Components? Principal components are new variables that are
constructed as linear combinations or mixtures of the initial variables.

Figure 2 – Principle Component.


Hossain, M. M. (BME) Tı́tle Page October 30, 2024 7 / 15
Principal Component Analysis

Principal Components Analysis


How PCA Constructs the Principal Components
As there are as many principal components as there are variables in the data, principal
components are constructed in such a manner that the first principal component
accounts for the largest possible variance in the data set.

Figure 3 – Caption.
Hossain, M. M. (BME) Tı́tle Page October 30, 2024 8 / 15
How does PCA work?

How does PCA work?

Start with Dataset(m observations, n features)

Calculate Covariance Matrix

Find Eigenvectors and Eigenvalues

Identify Principal Components(Directions of Maximum Variance)

Sort by Eigenvalues and SelectSignificant Principal Components


Hossain, M. M. (BME) Tı́tle Page October 30, 2024 9 / 15
Steps of PCA

Steps of PCA

1. Standardization: Scale the features to have a mean of 0 and a standard deviation of


1 (optional but recommended).
2. Compute Covariance Matrix: This captures how the features vary together.
3. Compute Eigenvectors and Eigenvalues: Eigenvectors represent the directions of
greatest variance, and eigenvalues tell you how much variance each eigenvector explains.
4. Select Principal Components: Choose the top k eigenvectors based on their
corresponding eigenvalues to form the new feature subspace.
5. Projection: Project the original data onto the new feature subspace formed by the
selected eigenvectors. Or project the original data points onto the new principal
component axes.

Hossain, M. M. (BME) Tı́tle Page October 30, 2024 10 / 15


Steps of PCA

Steps of PCA
Step 1: Standardization
First, we need to standardize our dataset to ensure that each variable has a mean of 0
and a standard deviation of 1.
Z = X µ = σX µ
Here,
• µisthemeanofindependentfeaturesµ = µ1, µ2, , µmµ = µ1, µ2, , µm
• σisthestandarddeviationofindependentfeaturesσ = σ1, σ2, , σσσ = σ1, σ2, ,

Step2: Covariance Matrix Computation


Covariance measures the strength of joint variability between two or more variables,
indicating how much they change in relation to each other. To find the covariance we
can use the formula:
Pn
(x −x̄ )(x −x̄ )
cov(x1 , x2 ) = i=1 1i n−11 2i 2
Hossain, M. M. (BME) Tı́tle Page October 30, 2024 11 / 15
Steps of PCA

Steps of PCA
The value of covariance can be positive, negative, or zeros.
• Positive: As the x1 increases x2 also increases.
• Negative: As the x1 increases x2 also decreases.
• Zeros: No direct relation
Step 3 & 4- Compute Eigenvectors and Eigenvalues of Covariance Matrix to Identify
Principal Components:
• Let A be a square matrix, νavector , andλascalarthatsatisfies
Aν = λν
then λ is called the eigenvalue associated with eigenvector ν of A.

Step 5- Projection or Transform the samples onto the new subspace:


• In the last step, we use the dimensional matrix that we just computed to transform our
samples onto the new subspace via the equation y = W × x where W is the transpose of
the matrix W.
Hossain, M. M. (BME) Tı́tle Page October 30, 2024 12 / 15
Advantage and Disadvantage of PCA

Advantage and Disadvantage of PCA


Advantages of PCA
1. Dimensionality Reduction: Reduces dataset complexity by lowering the number of
variables, improving analysis and performance.
2. Feature Selection: Helps in identifying the most important features, enhancing
machine learning model efficiency.
3. Data Visualization: Projects high-dimensional data into 2D or 3D for easier
interpretation.
4. Multicollinearity: Addresses issues of correlated features by creating uncorrelated
variables, useful for regression.
5. Noise Reduction: Improves data quality by removing components with low variance,
reducing noise.
6. Data Compression: Lowers storage needs by representing data with fewer principal
components.
7. Outlier Detection: Identifies outliers as points that deviate significantly in PCA.
Hossain, M. M. (BME) Tı́tle Page October 30, 2024 13 / 15
Advantage and Disadvantage of PCA

Advantage and Disadvantage of PCA

Disadvantages of PCA
1. Interpretation: Principal components are linear combinations and can be hard to
interpret.
2. Data Scaling: Sensitive to data scaling; requires proper normalization for accurate
results.
3. Information Loss: May lead to information loss if too few components are retained.
4. Non-linear Relationships: Assumes linearity, limiting its effectiveness for non-linear
data.
5. Computational Complexity: Computationally expensive for large datasets with
many variables.
6. Overfitting: Risk of overfitting if too many components are retained or if applied to
small datasets.

Hossain, M. M. (BME) Tı́tle Page October 30, 2024 14 / 15


Advantage and Disadvantage of PCA

Implementation of PCA in ML

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
Sample dataset X = np.array([[2.5, 2.4], [0.5, 0.7], [2.2, 2.9], [1.9, 2.2], [3.1, 3.0]])
Step 1: Standardize the data
scaler = StandardScaler()
Xs caled = scaler .fitt ransform(X )
Step 2: Apply PCA
pca = PCA(nc omponents = 2)
Xp ca = pca.fitt ransform(Xs caled)
Print results
print(”Explained variance ratio:”, pca.explainedv ariancer atio)
print(”Principal components:”, pca.components )

Hossain, M. M. (BME) Tı́tle Page October 30, 2024 15 / 15

You might also like