Week 8 DS Practical
Week 8 DS Practical
Introduction to Clustering
Clustering is the process of grouping similar data points based on their features. It
is commonly used for customer segmentation, anomaly detection, and data organisation.
1. K-Means
○ Each cluster has a centroid, and data points are assigned to the nearest
centroid.
○ Works well with spherical clusters but requires specifying the number of
clusters in advance.
import numpy as np
# Apply K-Means
y_kmeans = kmeans.fit_predict(X)
plt.legend()
plt.title("K-Means Clustering")
plt.show()
2. Hierarchical Clustering
■ Divisive (Top-down): Starts with all data points in one cluster and
splits them iteratively.
# Create dendrogram
plt.figure(figsize=(8, 5))
plt.title("Dendrogram")
plt.show()
y_hc = hc.fit_predict(X)
# Plot clusters
plt.show()
○ Does not require the number of clusters but depends on distance threshold
parameters.
# Apply DBSCAN
y_dbscan = dbscan.fit_predict(X)
# Plot clusters
plt.title("DBSCAN Clustering")
plt.show()
# Apply GMM
# Plot clusters
plt.show()
Dimensionality Reduction
Dimensionality reduction techniques help reduce the number of features in a
dataset while retaining as much important information as possible. This is useful for
visualization, noise reduction, and improving model performance.
○ Identifies new feature axes (principal components) that capture the most
variance.
Example Code
import numpy as np
digits = load_digits()
X = digits.data
y = digits.target
X_pca = pca.fit_transform(X)
plt.colorbar(label="Digit Label")
plt.show()
Example Code
X_tsne = tsne.fit_transform(X)
plt.colorbar(label="Digit Label")
plt.show()
3. Uniform Manifold Approximation and Projection (UMAP)
○ Similar to t-SNE but faster and better at preserving both global and local
structures.
Example Code
import umap
X_umap = umap_reducer.fit_transform(X)
plt.colorbar(label="Digit Label")
plt.show()
Example Code
import tensorflow as tf
input_dim = X.shape[1]
input_layer = keras.layers.Input(shape=(input_dim,))
# Decoder
# Compile Autoencoder
autoencoder.compile(optimizer='adam', loss='mse')
# Train Autoencoder
X_autoencoded = encoder.predict(X)
plt.colorbar(label="Digit Label")
plt.show()
Implement K-Means
import numpy as np
y_kmeans = kmeans.fit_predict(X)
# Plot clusters
# Plot centroids
plt.legend()
plt.title("K-Means Clustering")
plt.show()
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.show()
import numpy as np
plt.figure(figsize=(8, 5))
plt.title("Dendrogram")
plt.xlabel("Data Points")
plt.ylabel("Euclidean Distance")
plt.show()
hc = AgglomerativeClustering(n_clusters=4, affinity='euclidean',
linkage='ward')
y_hc = hc.fit_predict(X)
# Plot clusters
plt.title("Hierarchical Clustering")
plt.show()
import numpy as np
We use the handwritten digits dataset (each image has 64 features = 8×8 pixels).
digits = load_digits()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
plt.colorbar(label="Digit Label")
plt.show()
explained_variance = pca.explained_variance_ratio_
pca_full = PCA().fit(X_scaled)
cumulative_variance = np.cumsum(pca_full.explained_variance_ratio_)
plt.show()
To select K components that retain 95% variance: