Building K-Means Clustering Algorithm From Scratch
Building K-Means Clustering Algorithm From Scratch
1 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Table of Contents
1. Introduction
2. K-Means Clustering Algorithm's Logic
3.The Structure of a K-Means Clustering Algorithm
4. Implementation in Python
a. Initialize Centroids
b. Compute Distances
c. Assign Clusters
d. Update Centroids
e. Check Convergence
f. K-Means Clustering Algorithm
g. Plot Clusters
5. Conclusion
2 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
This post takes you beyond pre-built libraries like Scikit-learn. We'll embark on a journey to build a K-Means clustering
algorithm from scratch in Python, gaining a deeper understanding of its inner workings.
Example:
Let's say we want to cluster customer data based on their purchase history. Using K-Means, we can group customers
who exhibit similar buying patterns, even without knowing those patterns beforehand. The algorithm will automatically
identify these groups based on the proximity of their purchase data points.
By building K-Means from the ground up, we gain valuable insights into its strengths and limitations, empowering us to
fine-tune its application for specific tasks. We'll delve into the code implementation in the next section, bringing this
powerful algorithm to life.
3 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
4 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
5. Implementation in Python
Let's implement a simple K-Means Clustering Algorithm in Python to cluster data. We'll use NumPy for
numerical computations.
import numpy as np
Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
k (int): The number of clusters
Returns:
numpy.ndarray: Initialized centroids (k, n_features)
"""
# Ensure reproducibility by setting a random seed
np.random.seed(42)
return centroids
# Example usage
X_example = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
k_example = 2
print("Initial Centroids:\n", initialize_centroids(X_example, k_example))
5 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
centroids (numpy.ndarray): The current centroids (k, n_features)
Returns:
numpy.ndarray: Distances from each point to each centroid (n_samples, k)
"""
# Initialize a distance matrix to store distances from each point to each centroid
distances = np.zeros((X.shape[0], len(centroids)))
return distances
# Example usage
centroids_example = np.array([[1, 2], [9, 10]])
print("Distances:\n", compute_distances(X_example, centroids_example))
def assign_clusters(distances):
"""
Assign each point to the closest centroid.
Parameters:
distances (numpy.ndarray): Distances from each point to each centroid (n_samples, k)
Returns:
numpy.ndarray: Cluster labels for each point (n_samples,)
"""
# Assign each point to the closest centroid
return np.argmin(distances, axis=1)
# Example usage
distances_example = compute_distances(X_example, centroids_example)
print("Cluster Assignments:\n", assign_clusters(distances_example))
6 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Step 4: Update Centroids
• This function recalculates the centroids by computing the mean of all data points assigned to each
cluster.
• X[labels == i].mean(axis=0) calculates the mean of all points assigned to the i-th cluster.
def update_centroids(X, labels, k):
"""
Calculate new centroids as the mean of points in each cluster.
Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
labels (numpy.ndarray): Cluster labels for each point (n_samples,)
k (int): The number of clusters
Returns:
numpy.ndarray: Updated centroids (k, n_features)
"""
# Initialize an array to store the updated centroids
centroids = np.zeros((k, X.shape[1]))
return centroids
# Example usage
labels_example = assign_clusters(distances_example)
print("Updated Centroids:\n", update_centroids(X_example, labels_example, k_example))
Parameters:
old_centroids (numpy.ndarray): Previous centroids (k, n_features)
new_centroids (numpy.ndarray): Updated centroids (k, n_features)
tol (float): Tolerance for convergence
Returns:
bool: True if centroids have converged, False otherwise
"""
# Compute the Euclidean distance between old and new centroids
distances = np.linalg.norm(new_centroids - old_centroids, axis=1)
# Example usage
old_centroids_example = centroids_example
new_centroids_example = update_centroids(X_example, labels_example, k_example)
print("Has Converged:", has_converged(old_centroids_example, new_centroids_example))
7 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
k (int): The number of clusters
max_iters (int): Maximum number of iterations
Returns:
tuple: Final centroids (k, n_features) and cluster labels (n_samples,)
"""
# Step 1: Initialize centroids
centroids = initialize_centroids(X, k)
for i in range(max_iters):
# Step 2: Compute distances
distances = compute_distances(X, centroids)
k = 4
centroids, labels = k_means(X, k)
print("Final Centroids:\n", centroids)
print("Final Labels:\n", labels)
8 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
centroids (numpy.ndarray): Final centroids (k, n_features)
labels (numpy.ndarray): Cluster labels for each point (n_samples,)
"""
# Plot the data points with cluster labels
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', s=50)
plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
5. Conclusion
By breaking down the K-Means clustering algorithm into distinct steps and functions, we gain a better
understanding of each component's role and functionality. The initialization of centroids, computation of
distances, assignment of clusters, updating of centroids, and checking for convergence are all crucial steps in
the algorithm. Implementing these steps in Python provides valuable insights into the mechanics of clustering
algorithms.
Feel free to experiment with different datasets and parameters to see how the K-Means algorithm performs and
how different initialization methods can impact the clustering results.
9 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
10 ANSHUMAN JHA