0% found this document useful (0 votes)
22 views13 pages

Week 8 DS Practical

The document provides an overview of unsupervised learning techniques, focusing on clustering and dimensionality reduction methods. It details popular clustering algorithms such as K-Means, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Models, along with example Python code for implementation. Additionally, it covers dimensionality reduction techniques like PCA, t-SNE, UMAP, and Autoencoders, including their applications and example code.

Uploaded by

vimalraj17r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views13 pages

Week 8 DS Practical

The document provides an overview of unsupervised learning techniques, focusing on clustering and dimensionality reduction methods. It details popular clustering algorithms such as K-Means, Hierarchical Clustering, DBSCAN, and Gaussian Mixture Models, along with example Python code for implementation. Additionally, it covers dimensionality reduction techniques like PCA, t-SNE, UMAP, and Autoencoders, including their applications and example code.

Uploaded by

vimalraj17r
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Unsupervised Learning - Clustering & Dimensionality Reduction

Introduction to Clustering

Clustering is the process of grouping similar data points based on their features. It
is commonly used for customer segmentation, anomaly detection, and data organisation.

Popular Clustering Algorithms

1. K-Means

○ Partitions data into K clusters.

○ Each cluster has a centroid, and data points are assigned to the nearest
centroid.

○ Works well with spherical clusters but requires specifying the number of
clusters in advance.

Example Python code for K-Means

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

# Generate sample data

X, y = make_blobs(n_samples=300, centers=4, random_state=42)

# Apply K-Means

kmeans = KMeans(n_clusters=4, random_state=42)

y_kmeans = kmeans.fit_predict(X)

# Plot the clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', alpha=0.7)


plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=300, c='red', label='Centroids', marker='X')

plt.legend()

plt.title("K-Means Clustering")

plt.show()

2. Hierarchical Clustering

○ Builds a hierarchy of clusters using either:

■ Agglomerative (Bottom-up): Starts with individual points and


merges them into clusters.

■ Divisive (Top-down): Starts with all data points in one cluster and
splits them iteratively.

○ Does not require the number of clusters to be predefined.

Example code for Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering

import scipy.cluster.hierarchy as sch

# Create dendrogram

plt.figure(figsize=(8, 5))

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

plt.title("Dendrogram")

plt.show()

# Apply Agglomerative Clustering

hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')

y_hc = hc.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='rainbow', alpha=0.7)


plt.title("Hierarchical Clustering")

plt.show()

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

○ Groups data points based on density.

○ Identifies clusters of arbitrary shapes and handles noise well.

○ Does not require the number of clusters but depends on distance threshold
parameters.

Example code for DBSCAN

from sklearn.cluster import DBSCAN

# Apply DBSCAN

dbscan = DBSCAN(eps=0.5, min_samples=5)

y_dbscan = dbscan.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_dbscan, cmap='plasma', alpha=0.7)

plt.title("DBSCAN Clustering")

plt.show()

4. Gaussian Mixture Model (GMM)

○ Assumes that data is generated from multiple Gaussian distributions.

○ Uses probabilistic assignment of points to clusters.

○ More flexible than K-Means but computationally intensive.

Example code for GMM

from sklearn.mixture import GaussianMixture

# Apply GMM

gmm = GaussianMixture(n_components=4, random_state=42)


y_gmm = gmm.fit_predict(X)

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_gmm, cmap='coolwarm', alpha=0.7)

plt.title("Gaussian Mixture Model Clustering")

plt.show()

Dimensionality Reduction
Dimensionality reduction techniques help reduce the number of features in a
dataset while retaining as much important information as possible. This is useful for
visualization, noise reduction, and improving model performance.

Popular Dimensionality Reduction Techniques

1. Principal Component Analysis (PCA)

○ Identifies new feature axes (principal components) that capture the most
variance.

○ Reduces redundancy and computational complexity.

○ Commonly used for feature extraction and visualization.

Example Code

import numpy as np

import matplotlib.pyplot as plt

from sklearn.decomposition import PCA

from sklearn.datasets import load_digits

# Load dataset (Handwritten digits)

digits = load_digits()

X = digits.data

y = digits.target

# Apply PCA (Reduce to 2 dimensions)


pca = PCA(n_components=2)

X_pca = pca.fit_transform(X)

# Scatter plot of PCA results

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("PCA on Handwritten Digits")

plt.show()

2. t-Distributed Stochastic Neighbor Embedding (t-SNE)

○ Non-linear technique that preserves local relationships.

○ Excellent for visualizing high-dimensional data in 2D or 3D.

○ Computationally expensive and not ideal for large datasets.

Example Code

from sklearn.manifold import TSNE

# Apply t-SNE (Reduce to 2D)

tsne = TSNE(n_components=2, random_state=42)

X_tsne = tsne.fit_transform(X)

# Scatter plot of t-SNE results

plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, cmap='Spectral', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("t-SNE on Handwritten Digits")

plt.show()
3. Uniform Manifold Approximation and Projection (UMAP)

○ Similar to t-SNE but faster and better at preserving both global and local
structures.

○ Works well for high-dimensional data visualization.

Example Code

import umap

# Apply UMAP (Reduce to 2D)

umap_reducer = umap.UMAP(n_components=2, random_state=42)

X_umap = umap_reducer.fit_transform(X)

# Scatter plot of UMAP results

plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y, cmap='coolwarm', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("UMAP on Handwritten Digits")

plt.show()

4. Autoencoders (Deep Learning-Based Approach)

○ Neural networks that learn to encode and decode data.

○ Can capture non-linear relationships in data.

○ Useful for anomaly detection and representation learning.

Example Code

import tensorflow as tf

from tensorflow import keras

# Define Autoencoder Model

input_dim = X.shape[1]

encoding_dim = 32 # Reduced dimension


# Encoder

input_layer = keras.layers.Input(shape=(input_dim,))

encoded = keras.layers.Dense(encoding_dim, activation='relu')(input_layer)

# Decoder

decoded = keras.layers.Dense(input_dim, activation='sigmoid')(encoded)

# Compile Autoencoder

autoencoder = keras.models.Model(input_layer, decoded)

autoencoder.compile(optimizer='adam', loss='mse')

# Train Autoencoder

autoencoder.fit(X, X, epochs=20, batch_size=256, shuffle=True, verbose=1)

# Get encoded representation

encoder = keras.models.Model(input_layer, encoded)

X_autoencoded = encoder.predict(X)

# Scatter plot of Autoencoder results (first 2 dimensions)

plt.scatter(X_autoencoded[:, 0], X_autoencoded[:, 1], c=y, cmap='plasma',


alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("Autoencoder Representation of Digits")

plt.show()

Implement K-Means

1. Import Required Libraries

import numpy as np

import matplotlib.pyplot as plt


from sklearn.cluster import KMeans

from sklearn.datasets import make_blobs

2. Generate Sample Data

We will create a synthetic dataset with four clusters using make_blobs().

# Generate synthetic data (300 samples, 4 clusters)

X, y = make_blobs(n_samples=300, centers=4, random_state=42)

3. Apply K-Means Clustering

We use KMeans from Scikit-Learn.

# Apply K-Means clustering

kmeans = KMeans(n_clusters=4, random_state=42)

y_kmeans = kmeans.fit_predict(X)

4. Visualize the Clusters

We plot the clusters and their centroids.

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, cmap='viridis', alpha=0.7)

# Plot centroids

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],

s=300, c='red', marker='X', label="Centroids")

plt.legend()

plt.title("K-Means Clustering")

plt.show()

5. Finding the Optimal Number of Clusters (Elbow Method)


To determine the best value of K, we use the Elbow Method.

wcss = [] # Within-cluster sum of squares

for k in range(1, 11):

kmeans = KMeans(n_clusters=k, random_state=42)

kmeans.fit(X)

wcss.append(kmeans.inertia_)

# Plot the elbow graph

plt.plot(range(1, 11), wcss, marker='o', linestyle='--')

plt.xlabel("Number of Clusters (K)")

plt.ylabel("WCSS (Within-Cluster Sum of Squares)")

plt.title("Elbow Method for Optimal K")

plt.show()

Implementing Hierarchical Clustering

1. Import Required Libraries

import numpy as np

import matplotlib.pyplot as plt

import scipy.cluster.hierarchy as sch

from sklearn.cluster import AgglomerativeClustering

from sklearn.datasets import make_blobs

2. Generate Sample Data

We create a dataset with four clusters.

# Generate synthetic data

X, y = make_blobs(n_samples=300, centers=4, random_state=42)


3. Create a Dendrogram

A dendrogram helps determine the optimal number of clusters.

# Plot the dendrogram

plt.figure(figsize=(8, 5))

dendrogram = sch.dendrogram(sch.linkage(X, method='ward'))

plt.title("Dendrogram")

plt.xlabel("Data Points")

plt.ylabel("Euclidean Distance")

plt.show()

4. Apply Agglomerative Clustering

We use 4 clusters based on the dendrogram.

# Apply Agglomerative Clustering

hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean',
linkage='ward')

y_hc = hc.fit_predict(X)

5. Visualize the Clusters

# Plot clusters

plt.scatter(X[:, 0], X[:, 1], c=y_hc, cmap='rainbow', alpha=0.7)

plt.title("Hierarchical Clustering")

plt.show()

Perform PCA (Principal Component Analysis) for dimensionality reduction:

1. Import Required Libraries

import numpy as np

import matplotlib.pyplot as plt


from sklearn.decomposition import PCA

from sklearn.datasets import load_digits

from sklearn.preprocessing import StandardScaler

2. Load the Dataset

We use the handwritten digits dataset (each image has 64 features = 8×8 pixels).

# Load the digits dataset

digits = load_digits()

X = digits.data # Features (64-dimensional)

y = digits.target # Labels (digits 0-9)

3. Standardize the Data

PCA is sensitive to feature scaling, so we standardize the dataset.

# Standardize the features

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

4. Apply PCA (Reduce to 2 Dimensions)

We reduce the 64-dimensional dataset to 2 dimensions for visualization.

# Apply PCA

pca = PCA(n_components=2)

X_pca = pca.fit_transform(X_scaled)

5. Visualize the Reduced Data

# Scatter plot of PCA results


plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', alpha=0.7)

plt.colorbar(label="Digit Label")

plt.title("PCA: Handwritten Digits (2D Projection)")

plt.xlabel("Principal Component 1")

plt.ylabel("Principal Component 2")

plt.show()

6. Explained Variance (How Much Information is Retained?)

To check how much information is retained in lower dimensions:

# Explained variance ratio

explained_variance = pca.explained_variance_ratio_

print(f"Explained Variance (PC1 + PC2): {sum(explained_variance) * 100:.2f}


%")

7. Finding the Optimal Number of Components

If we want to retain 95% variance, we determine the optimal number of components.

# Compute cumulative explained variance

pca_full = PCA().fit(X_scaled)

cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)

# Plot cumulative variance

plt.plot(range(1, len(cumulative_variance) + 1), cumulative_variance, marker='o',


linestyle='--')

plt.xlabel("Number of Principal Components")

plt.ylabel("Cumulative Explained Variance")

plt.title("PCA: Choosing the Optimal Number of Components")

plt.axhline(y=0.95, color='r', linestyle='--') # 95% threshold

plt.show()
To select K components that retain 95% variance:

optimal_components = np.argmax(cumulative_variance >= 0.95) + 1

print(f"Optimal Number of Components for 95% Variance:


{optimal_components}")

You might also like