0% found this document useful (0 votes)

23 views13 pages

Week 8 DS Practical

The document provides an overview of unsupervised learning techniques, focusing on clustering and dimensionality reduction methods. It details popular clustering algorithms such as K-Means, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Models, along with example Python code for implementation. Additionally, it covers dimensionality reduction techniques like PCA, t-SNE, UMAP, and Autoencoders, including their applications and example code.

Uploaded by

vimalraj17r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views13 pages

Week 8 DS Practical

Uploaded by

vimalraj17r

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Unsupervised Learning - Clustering & Dimensionality Reduction

Introduction to Clustering

Clustering is the process of grouping similar data points based on their features. It
is commonly used for customer segmentation, anomaly detection, and data organisation.

Popular Clustering Algorithms

1. K-Means

○ Partitions data into K clusters.

○ Each cluster has a centroid, and data points are assigned to the nearest
centroid.

○ Works well with spherical clusters but requires specifying the number of
clusters in advance.

Example Python code for K-Means

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

# Generate sample data

X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-Means

kmeans = KMeans(n_clusters=4, random_state=42)

y_kmeans = kmeans.fit_predict(X)

# Plot the clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', alpha=0.7)

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=300, c='red', label='Centroids', marker='X')

plt.legend()

plt.title("K-Means Clustering")

plt.show()

2. Hierarchical Clustering

○ Builds a hierarchy of clusters using either:

■ Agglomerative (Bottom-up): Starts with individual points and

merges them into clusters.

■ Divisive (Top-down): Starts with all data points in one cluster and
splits them iteratively.

○ Does not require the number of clusters to be predefined.

Example code for Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering

import scipy.cluster.hierarchy as sch

# Create dendrogram

plt.figure(figsize=(8, 5))

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

plt.title("Dendrogram")

plt.show()

# Apply Agglomerative Clustering

hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')

y_hc = hc.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='rainbow', alpha=0.7)

plt.title("Hierarchical Clustering")

plt.show()

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

○ Groups data points based on density.

○ Identifies clusters of arbitrary shapes and handles noise well.

○ Does not require the number of clusters but depends on distance threshold
parameters.

Example code for DBSCAN

from sklearn.cluster import DBSCAN

# Apply DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)

y_dbscan = dbscan.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='plasma', alpha=0.7)

plt.title("DBSCAN Clustering")

plt.show()

4. Gaussian Mixture Model (GMM)

○ Assumes that data is generated from multiple Gaussian distributions.

○ Uses probabilistic assignment of points to clusters.

○ More flexible than K-Means but computationally intensive.

Example code for GMM

from sklearn.mixture import GaussianMixture

# Apply GMM

gmm = GaussianMixture(n_components=4, random_state=42)

y_gmm = gmm.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_gmm, cmap='coolwarm', alpha=0.7)

plt.title("Gaussian Mixture Model Clustering")

plt.show()

Dimensionality Reduction
Dimensionality reduction techniques help reduce the number of features in a
dataset while retaining as much important information as possible. This is useful for
visualization, noise reduction, and improving model performance.

Popular Dimensionality Reduction Techniques

1. Principal Component Analysis (PCA)

○ Identifies new feature axes (principal components) that capture the most
variance.

○ Reduces redundancy and computational complexity.

○ Commonly used for feature extraction and visualization.

Example Code

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.datasets import load_digits

# Load dataset (Handwritten digits)

digits = load_digits()

X = digits.data

y = digits.target

# Apply PCA (Reduce to 2 dimensions)

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

# Scatter plot of PCA results

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("PCA on Handwritten Digits")

plt.show()

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

○ Non-linear technique that preserves local relationships.

○ Excellent for visualizing high-dimensional data in 2D or 3D.

○ Computationally expensive and not ideal for large datasets.

Example Code

from sklearn.manifold import TSNE

# Apply t-SNE (Reduce to 2D)

tsne = TSNE(n_components=2, random_state=42)

X_tsne = tsne.fit_transform(X)

# Scatter plot of t-SNE results

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='Spectral', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("t-SNE on Handwritten Digits")

plt.show()
3. Uniform Manifold Approximation and Projection (UMAP)

○ Similar to t-SNE but faster and better at preserving both global and local
structures.

○ Works well for high-dimensional data visualization.

Example Code

import umap

# Apply UMAP (Reduce to 2D)

umap_reducer = umap.UMAP(n_components=2, random_state=42)

X_umap = umap_reducer.fit_transform(X)

# Scatter plot of UMAP results

plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='coolwarm', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("UMAP on Handwritten Digits")

plt.show()

4. Autoencoders (Deep Learning-Based Approach)

○ Neural networks that learn to encode and decode data.

○ Can capture non-linear relationships in data.

○ Useful for anomaly detection and representation learning.

Example Code

import tensorflow as tf

from tensorflow import keras

# Define Autoencoder Model

input_dim = X.shape[1]

encoding_dim = 32 # Reduced dimension

# Encoder

input_layer = keras.layers.Input(shape=(input_dim,))

encoded = keras.layers.Dense(encoding_dim, activation='relu')(input_layer)

# Decoder

decoded = keras.layers.Dense(input_dim, activation='sigmoid')(encoded)

# Compile Autoencoder

autoencoder = keras.models.Model(input_layer, decoded)

autoencoder.compile(optimizer='adam', loss='mse')

# Train Autoencoder

autoencoder.fit(X, X, epochs=20, batch_size=256, shuffle=True, verbose=1)

# Get encoded representation

encoder = keras.models.Model(input_layer, encoded)

X_autoencoded = encoder.predict(X)

# Scatter plot of Autoencoder results (first 2 dimensions)

plt.scatter(X_autoencoded[:, 0], X_autoencoded[:, 1], c=y, cmap='plasma',

alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("Autoencoder Representation of Digits")

plt.show()

Implement K-Means

1. Import Required Libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

2. Generate Sample Data

We will create a synthetic dataset with four clusters using make_blobs().

# Generate synthetic data (300 samples, 4 clusters)

X, y = make_blobs(n_samples=300, centers=4, random_state=42)

3. Apply K-Means Clustering

We use KMeans from Scikit-Learn.

# Apply K-Means clustering

kmeans = KMeans(n_clusters=4, random_state=42)

y_kmeans = kmeans.fit_predict(X)

4. Visualize the Clusters

We plot the clusters and their centroids.

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', alpha=0.7)

# Plot centroids

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=300, c='red', marker='X', label="Centroids")

plt.legend()

plt.title("K-Means Clustering")

plt.show()

5. Finding the Optimal Number of Clusters (Elbow Method)

To determine the best value of K, we use the Elbow Method.

wcss = [] # Within-cluster sum of squares

for k in range(1, 11):

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(X)

wcss.append(kmeans.inertia_)

# Plot the elbow graph

plt.plot(range(1, 11), wcss, marker='o', linestyle='--')

plt.xlabel("Number of Clusters (K)")

plt.ylabel("WCSS (Within-Cluster Sum of Squares)")

plt.title("Elbow Method for Optimal K")

plt.show()

Implementing Hierarchical Clustering

1. Import Required Libraries

import numpy as np

import matplotlib.pyplot as plt

import scipy.cluster.hierarchy as sch

from sklearn.cluster import AgglomerativeClustering

from sklearn.datasets import make_blobs

2. Generate Sample Data

We create a dataset with four clusters.

# Generate synthetic data

X, y = make_blobs(n_samples=300, centers=4, random_state=42)

3. Create a Dendrogram

A dendrogram helps determine the optimal number of clusters.

# Plot the dendrogram

plt.figure(figsize=(8, 5))

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

plt.title("Dendrogram")

plt.xlabel("Data Points")

plt.ylabel("Euclidean Distance")

plt.show()

4. Apply Agglomerative Clustering

We use 4 clusters based on the dendrogram.

# Apply Agglomerative Clustering

hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean',
linkage='ward')

y_hc = hc.fit_predict(X)

5. Visualize the Clusters

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='rainbow', alpha=0.7)

plt.title("Hierarchical Clustering")

plt.show()

Perform PCA (Principal Component Analysis) for dimensionality reduction:

1. Import Required Libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.datasets import load_digits

from sklearn.preprocessing import StandardScaler

2. Load the Dataset

We use the handwritten digits dataset (each image has 64 features = 8×8 pixels).

# Load the digits dataset

digits = load_digits()

X = digits.data # Features (64-dimensional)

y = digits.target # Labels (digits 0-9)

3. Standardize the Data

PCA is sensitive to feature scaling, so we standardize the dataset.

# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

4. Apply PCA (Reduce to 2 Dimensions)

We reduce the 64-dimensional dataset to 2 dimensions for visualization.

# Apply PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

5. Visualize the Reduced Data

# Scatter plot of PCA results

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("PCA: Handwritten Digits (2D Projection)")

plt.xlabel("Principal Component 1")

plt.ylabel("Principal Component 2")

plt.show()

6. Explained Variance (How Much Information is Retained?)

To check how much information is retained in lower dimensions:

# Explained variance ratio

explained_variance = pca.explained_variance_ratio_

print(f"Explained Variance (PC1 + PC2): {sum(explained_variance) * 100:.2f}

%")

7. Finding the Optimal Number of Components

If we want to retain 95% variance, we determine the optimal number of components.

# Compute cumulative explained variance

pca_full = PCA().fit(X_scaled)

cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Plot cumulative variance

plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o',

linestyle='--')

plt.xlabel("Number of Principal Components")

plt.ylabel("Cumulative Explained Variance")

plt.title("PCA: Choosing the Optimal Number of Components")

plt.axhline(y=0.95, color='r', linestyle='--') # 95% threshold

plt.show()
To select K components that retain 95% variance:

optimal_components = np.argmax(cumulative_variance >= 0.95) + 1

print(f"Optimal Number of Components for 95% Variance:

{optimal_components}")

Lab Report 4
No ratings yet
Lab Report 4
6 pages
DSBA Master Codebook - Unsupervised Learning
No ratings yet
DSBA Master Codebook - Unsupervised Learning
7 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
U2 Direct Shear Test & Unconfined Compression Test
88% (8)
U2 Direct Shear Test & Unconfined Compression Test
34 pages
Clustering
No ratings yet
Clustering
1 page
Data Science Project Training Report
No ratings yet
Data Science Project Training Report
19 pages
AppliedML Chap1 Clustering
No ratings yet
AppliedML Chap1 Clustering
37 pages
Aiml Lab
No ratings yet
Aiml Lab
37 pages
Unit 3 - MLnotes-WPS Office
No ratings yet
Unit 3 - MLnotes-WPS Office
18 pages
6 - Machine Learning and Unlabeled Data
No ratings yet
6 - Machine Learning and Unlabeled Data
67 pages
ME8097 - Non Destructive Testing and Evaluation
100% (1)
ME8097 - Non Destructive Testing and Evaluation
16 pages
ML Unit-4
No ratings yet
ML Unit-4
23 pages
Partition
No ratings yet
Partition
52 pages
Practical 5
No ratings yet
Practical 5
6 pages
ML2 Practical List
No ratings yet
ML2 Practical List
80 pages
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
No ratings yet
Machine Learning For Humans, Part 3 - Unsupervised Learning - by Vishal Maini - Machine Learning For Humans - Medium
23 pages
Unit-Iv Material
No ratings yet
Unit-Iv Material
24 pages
Data Mining
No ratings yet
Data Mining
27 pages
Data Mining
No ratings yet
Data Mining
18 pages
Practical 03
No ratings yet
Practical 03
3 pages
Experiment 4 1
No ratings yet
Experiment 4 1
4 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
Tutorial 8
No ratings yet
Tutorial 8
12 pages
Creep
0% (1)
Creep
42 pages
Unsupervised Machine Learning in Python
100% (1)
Unsupervised Machine Learning in Python
89 pages
Unit 5
No ratings yet
Unit 5
10 pages
Unsupervised Learning - A Comprehensive Overview of
No ratings yet
Unsupervised Learning - A Comprehensive Overview of
5 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
DWDM Lab All
No ratings yet
DWDM Lab All
20 pages
Clustering Algorithms CheatSheet 1710438661
No ratings yet
Clustering Algorithms CheatSheet 1710438661
6 pages
Gautam A. Kudale
No ratings yet
Gautam A. Kudale
6 pages
Python
No ratings yet
Python
5 pages
SE KMeansClustering
No ratings yet
SE KMeansClustering
21 pages
Clustering
No ratings yet
Clustering
45 pages
ML Python Exercises UOM BDS Cluster Analysis
No ratings yet
ML Python Exercises UOM BDS Cluster Analysis
8 pages
Week 4 Day 2 Science
No ratings yet
Week 4 Day 2 Science
3 pages
4.cluster Analysis
No ratings yet
4.cluster Analysis
7 pages
How To Perform Clustering Algorithms in Machine Learning
No ratings yet
How To Perform Clustering Algorithms in Machine Learning
9 pages
T3 Scheme 24 25
No ratings yet
T3 Scheme 24 25
4 pages
Institute of Public Relations, USA
No ratings yet
Institute of Public Relations, USA
9 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
30 pages
Compare Colgate and Sensodyne Based On Positioning
No ratings yet
Compare Colgate and Sensodyne Based On Positioning
5 pages
Chapter 8
No ratings yet
Chapter 8
15 pages
Oops Lab Manual
No ratings yet
Oops Lab Manual
222 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
Citrix Virtual Apps and Desktops Translate
No ratings yet
Citrix Virtual Apps and Desktops Translate
299 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Md-070 Application Extensions Technical Design
100% (1)
Md-070 Application Extensions Technical Design
16 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
UnsupervisedLearning FoundationalMathofAI S24
No ratings yet
UnsupervisedLearning FoundationalMathofAI S24
6 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Unit 4 Introduction To Algorithm
No ratings yet
Unit 4 Introduction To Algorithm
10 pages
Scottish Fold Cat
100% (2)
Scottish Fold Cat
11 pages
Unsuper
No ratings yet
Unsuper
15 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
Lesson 6 - Unsupervised Learning
No ratings yet
Lesson 6 - Unsupervised Learning
63 pages
Cheat Sheet-Building Unsupervised Learning Models
No ratings yet
Cheat Sheet-Building Unsupervised Learning Models
3 pages
Machine Learning Lab
No ratings yet
Machine Learning Lab
20 pages
FullMarks - Clustering StudentSolution 2
No ratings yet
FullMarks - Clustering StudentSolution 2
13 pages
23CC554
No ratings yet
23CC554
10 pages
Unsupervisd Learning Algorithm
No ratings yet
Unsupervisd Learning Algorithm
6 pages
Coca Cola - Portfolio Project
No ratings yet
Coca Cola - Portfolio Project
15 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
4 pages
Hierarchical Clustering: Required Data
No ratings yet
Hierarchical Clustering: Required Data
6 pages
Management Education in India
No ratings yet
Management Education in India
22 pages
Zara
No ratings yet
Zara
47 pages
HRM360 Assignment
No ratings yet
HRM360 Assignment
10 pages
Application of IR - ITC
No ratings yet
Application of IR - ITC
23 pages
MODELS (AutoRecovered)
No ratings yet
MODELS (AutoRecovered)
9 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
Avr4311 E2
No ratings yet
Avr4311 E2
2 pages
Equitable Leasing Corporation vs. Lucita Suyom, Marissa Enano, Myrnatamayo and Felix Oledan (G.R. No. 143360, 5 September 2002, 388 Scra 445)
No ratings yet
Equitable Leasing Corporation vs. Lucita Suyom, Marissa Enano, Myrnatamayo and Felix Oledan (G.R. No. 143360, 5 September 2002, 388 Scra 445)
10 pages
As 3515.2-2002 Gold and Gold Bearing Alloys Determination of Gold Content 30 Percent To 99.5 Percent - Gravim
No ratings yet
As 3515.2-2002 Gold and Gold Bearing Alloys Determination of Gold Content 30 Percent To 99.5 Percent - Gravim
7 pages
NTPC Green Energy Limited Corporate Identity Number
No ratings yet
NTPC Green Energy Limited Corporate Identity Number
643 pages
E103-W02 UserManual EN V3.0
No ratings yet
E103-W02 UserManual EN V3.0
54 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
Napier 297 - WN26
No ratings yet
Napier 297 - WN26
68 pages
School Memorandum With Number
No ratings yet
School Memorandum With Number
29 pages
Post-Earthquake Restoration Modelling of A Railway Bridge Network
No ratings yet
Post-Earthquake Restoration Modelling of A Railway Bridge Network
14 pages
Atlantic International University - Wikipedia
No ratings yet
Atlantic International University - Wikipedia
4 pages
Cotton Case Study
No ratings yet
Cotton Case Study
2 pages
G2 3 1 2HowBearLostHisTail5
No ratings yet
G2 3 1 2HowBearLostHisTail5
15 pages
Topo Sheet Report
No ratings yet
Topo Sheet Report
15 pages
CS3271 C Lab
No ratings yet
CS3271 C Lab
28 pages
Madhubhan Rejou Spa Services Menu
No ratings yet
Madhubhan Rejou Spa Services Menu
10 pages
Sample Final Paper Quantitative
No ratings yet
Sample Final Paper Quantitative
48 pages
Unit 5
No ratings yet
Unit 5
50 pages
Day-4 DS Practicals
No ratings yet
Day-4 DS Practicals
5 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Week-7 DS Practical
No ratings yet
Week-7 DS Practical
8 pages
PSAT Bahasa Inggris Kelas 10
No ratings yet
PSAT Bahasa Inggris Kelas 10
5 pages
Accomplishment District Meet
No ratings yet
Accomplishment District Meet
1 page
Is Unit Test 1 QP
No ratings yet
Is Unit Test 1 QP
2 pages
Front
No ratings yet
Front
1 page
SDN Lab Manual
No ratings yet
SDN Lab Manual
12 pages
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet