0% found this document useful (0 votes)
15 views

Building K-Means Clustering Algorithm From Scratch

Uploaded by

mouhcenbennecib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Building K-Means Clustering Algorithm From Scratch

Uploaded by

mouhcenbennecib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Building

K-Means Clustering Algorithm


from Scratch in Python

1 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Table of Contents
1. Introduction
2. K-Means Clustering Algorithm's Logic
3.The Structure of a K-Means Clustering Algorithm
4. Implementation in Python
a. Initialize Centroids
b. Compute Distances
c. Assign Clusters
d. Update Centroids
e. Check Convergence
f. K-Means Clustering Algorithm
g. Plot Clusters
5. Conclusion

2 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

1. Introduction to K-Means Clustering Algorithm


In the world of unsupervised machine learning, where data lacks predefined labels, K-Means clustering shines as a
beacon for uncovering hidden structures. It's like sorting your sock drawer without knowing what patterns exist
beforehand – the algorithm helps you discover them! K-Means aims to partition a dataset containing 'n' observations
into 'k' distinct clusters. Think of these clusters as groups where members share similar characteristics. The magic lies in
ensuring that each data point finds its home in the cluster whose center (the "mean") is closest to it.

This post takes you beyond pre-built libraries like Scikit-learn. We'll embark on a journey to build a K-Means clustering
algorithm from scratch in Python, gaining a deeper understanding of its inner workings.

2. K-Means Clustering Algorithm's Logic


Imagine you're trying to organize a group of people into teams based on their interests. K-Means operates in a
surprisingly similar way, employing a simple yet powerful iterative process:

• Initialization: Planting the Seeds


The first step is like choosing initial team captains at random. We select 'k' random points from our data to act as initial
cluster centroids. These centroids are like placeholders, representing the heart of each cluster.

• Assignment Step: Finding Your Tribe


Now, imagine each person gravitating towards the captain who shares the most common interests with them. Similarly,
in K-Means, every data point is assigned to the cluster whose centroid is nearest to it. This proximity is usually
determined using distance metrics like Euclidean distance.

• Update Step: Re-evaluating the Leaders


Once everyone has found their preliminary group, it's time to re-evaluate the captains. The position of each centroid is
recalculated, taking into account the average location of all data points now belonging to its cluster. Imagine the
captains adjusting their positions slightly to be more centered within their teams.

• Convergence: Reaching a Stable State


This process of assigning data points and updating centroids repeats. With each iteration, the centroids inch closer to
their optimal positions, and the clusters become more refined. The algorithm terminates when the centroids cease to
move significantly or a predefined threshold of change is met. We've reached a stable state where our "teams" are
well-defined.

Example:
Let's say we want to cluster customer data based on their purchase history. Using K-Means, we can group customers
who exhibit similar buying patterns, even without knowing those patterns beforehand. The algorithm will automatically
identify these groups based on the proximity of their purchase data points.
By building K-Means from the ground up, we gain valuable insights into its strengths and limitations, empowering us to
fine-tune its application for specific tasks. We'll delve into the code implementation in the next section, bringing this
powerful algorithm to life.

3 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

3. The Structure of a Decision Tree Classifier


This Structure includes the steps and sub-steps with appropriate labels and connections. Each step corresponds
to a function or a key part of the process described in the provided implementation.

4 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
5. Implementation in Python
Let's implement a simple K-Means Clustering Algorithm in Python to cluster data. We'll use NumPy for
numerical computations.

Step 1: Initialize Centroids


• This function randomly selects k data points from the dataset X to serve as the initial centroids.
• np.random.seed(42) ensures reproducibility by fixing the random seed.
• np.random.permutation(len(X)) shuffles the indices of the data points.
• centroids = X[random_indices[:k]] selects the first k points as the initial centroids.

import numpy as np

def initialize_centroids(X, k):


"""
Randomly initialize centroids from the dataset.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
k (int): The number of clusters

Returns:
numpy.ndarray: Initialized centroids (k, n_features)
"""
# Ensure reproducibility by setting a random seed
np.random.seed(42)

# Randomly shuffle the indices of the dataset


random_indices = np.random.permutation(len(X))

# Select the first k points as the initial centroids


centroids = X[random_indices[:k]]

return centroids

# Example usage
X_example = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
k_example = 2
print("Initial Centroids:\n", initialize_centroids(X_example, k_example))

5 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Step 2: Compute Distances


• This function computes the Euclidean distance from each data point in X to each centroid.
• distances[:, i] = np.linalg.norm(X - centroid, axis=1) calculates the Euclidean distance from each point to the i-
th centroid.

def compute_distances(X, centroids):


"""
Compute the distance from each point to each centroid.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
centroids (numpy.ndarray): The current centroids (k, n_features)

Returns:
numpy.ndarray: Distances from each point to each centroid (n_samples, k)
"""
# Initialize a distance matrix to store distances from each point to each centroid
distances = np.zeros((X.shape[0], len(centroids)))

# Compute the Euclidean distance from each point to each centroid


for i, centroid in enumerate(centroids):
distances[:, i] = np.linalg.norm(X - centroid, axis=1)

return distances

# Example usage
centroids_example = np.array([[1, 2], [9, 10]])
print("Distances:\n", compute_distances(X_example, centroids_example))

Step 3: Assign Clusters


• This function assigns each data point to the nearest centroid based on the computed distances.
• np.argmin(distances, axis=1) returns the index of the smallest distance for each data point, effectively
assigning it to the nearest cluster.

def assign_clusters(distances):
"""
Assign each point to the closest centroid.

Parameters:
distances (numpy.ndarray): Distances from each point to each centroid (n_samples, k)

Returns:
numpy.ndarray: Cluster labels for each point (n_samples,)
"""
# Assign each point to the closest centroid
return np.argmin(distances, axis=1)

# Example usage
distances_example = compute_distances(X_example, centroids_example)
print("Cluster Assignments:\n", assign_clusters(distances_example))

6 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Step 4: Update Centroids
• This function recalculates the centroids by computing the mean of all data points assigned to each
cluster.
• X[labels == i].mean(axis=0) calculates the mean of all points assigned to the i-th cluster.
def update_centroids(X, labels, k):
"""
Calculate new centroids as the mean of points in each cluster.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
labels (numpy.ndarray): Cluster labels for each point (n_samples,)
k (int): The number of clusters

Returns:
numpy.ndarray: Updated centroids (k, n_features)
"""
# Initialize an array to store the updated centroids
centroids = np.zeros((k, X.shape[1]))

# Compute the mean of all points assigned to each cluster


for i in range(k):
centroids[i] = X[labels == i].mean(axis=0)

return centroids

# Example usage
labels_example = assign_clusters(distances_example)
print("Updated Centroids:\n", update_centroids(X_example, labels_example, k_example))

Step 5: Check Convergence


• This function checks if the centroids have converged, i.e., if the change in centroids is below a specified
tolerance tol.
• np.linalg.norm(new_centroids - old_centroids, axis=1) computes the Euclidean distance between old and new
centroids.
def has_converged(old_centroids, new_centroids, tol=1e-4):
"""
Check if the centroids have converged.

Parameters:
old_centroids (numpy.ndarray): Previous centroids (k, n_features)
new_centroids (numpy.ndarray): Updated centroids (k, n_features)
tol (float): Tolerance for convergence

Returns:
bool: True if centroids have converged, False otherwise
"""
# Compute the Euclidean distance between old and new centroids
distances = np.linalg.norm(new_centroids - old_centroids, axis=1)

# Check if the changes in centroids are below the tolerance level


return np.all(distances < tol)

# Example usage
old_centroids_example = centroids_example
new_centroids_example = update_centroids(X_example, labels_example, k_example)
print("Has Converged:", has_converged(old_centroids_example, new_centroids_example))

7 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Step 6: K-Means Clustering Algorithm


• This function performs the K-Means clustering algorithm.
• It initializes the centroids, iteratively assigns clusters, updates centroids, and checks for convergence.
• The process repeats until convergence or the maximum number of iterations is reached.

def k_means(X, k, max_iters=100):


"""
Perform K-Means clustering.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
k (int): The number of clusters
max_iters (int): Maximum number of iterations

Returns:
tuple: Final centroids (k, n_features) and cluster labels (n_samples,)
"""
# Step 1: Initialize centroids
centroids = initialize_centroids(X, k)

for i in range(max_iters):
# Step 2: Compute distances
distances = compute_distances(X, centroids)

# Step 3: Assign clusters


labels = assign_clusters(distances)

# Step 4: Update centroids


new_centroids = update_centroids(X, labels, k)

# Check for convergence


if has_converged(centroids, new_centroids):
break

# Update centroids for the next iteration


centroids = new_centroids

return centroids, labels

# Example usage with synthetic data


from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

k = 4
centroids, labels = k_means(X, k)
print("Final Centroids:\n", centroids)
print("Final Labels:\n", labels)

8 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Step 7: Plot Clusters


• This function visualizes the resulting clusters and centroids.
• plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k') plots the data points colored by
their cluster labels.
• plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x') plots the centroids.

import matplotlib.pyplot as plt

def plot_clusters(X, centroids, labels):


"""
Visualize the clusters and centroids.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
centroids (numpy.ndarray): Final centroids (k, n_features)
labels (numpy.ndarray): Cluster labels for each point (n_samples,)
"""
# Plot the data points with cluster labels
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', s=50)

# Plot the centroids


plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200)

plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Plotting the result


plot_clusters(X, centroids, labels)

5. Conclusion
By breaking down the K-Means clustering algorithm into distinct steps and functions, we gain a better
understanding of each component's role and functionality. The initialization of centroids, computation of
distances, assignment of clusters, updating of centroids, and checking for convergence are all crucial steps in
the algorithm. Implementing these steps in Python provides valuable insights into the mechanics of clustering
algorithms.

Feel free to experiment with different datasets and parameters to see how the K-Means algorithm performs and
how different initialization methods can impact the clustering results.

9 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Constructive comments and feedback are welcomed

10 ANSHUMAN JHA

You might also like