0% found this document useful (0 votes)

22 views10 pages

Building K-Means Clustering Algorithm From Scratch

Uploaded by

mouhcenbennecib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views10 pages

Building K-Means Clustering Algorithm From Scratch

Uploaded by

mouhcenbennecib

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Building

K-Means Clustering Algorithm

from Scratch in Python

1 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Table of Contents
1. Introduction
2. K-Means Clustering Algorithm's Logic
3.The Structure of a K-Means Clustering Algorithm
4. Implementation in Python
a. Initialize Centroids
b. Compute Distances
c. Assign Clusters
d. Update Centroids
e. Check Convergence
f. K-Means Clustering Algorithm
g. Plot Clusters
5. Conclusion

2 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

1. Introduction to K-Means Clustering Algorithm

In the world of unsupervised machine learning, where data lacks predefined labels, K-Means clustering shines as a
beacon for uncovering hidden structures. It's like sorting your sock drawer without knowing what patterns exist
beforehand – the algorithm helps you discover them! K-Means aims to partition a dataset containing 'n' observations
into 'k' distinct clusters. Think of these clusters as groups where members share similar characteristics. The magic lies in
ensuring that each data point finds its home in the cluster whose center (the "mean") is closest to it.

This post takes you beyond pre-built libraries like Scikit-learn. We'll embark on a journey to build a K-Means clustering
algorithm from scratch in Python, gaining a deeper understanding of its inner workings.

2. K-Means Clustering Algorithm's Logic

Imagine you're trying to organize a group of people into teams based on their interests. K-Means operates in a
surprisingly similar way, employing a simple yet powerful iterative process:

• Initialization: Planting the Seeds

The first step is like choosing initial team captains at random. We select 'k' random points from our data to act as initial
cluster centroids. These centroids are like placeholders, representing the heart of each cluster.

• Assignment Step: Finding Your Tribe

Now, imagine each person gravitating towards the captain who shares the most common interests with them. Similarly,
in K-Means, every data point is assigned to the cluster whose centroid is nearest to it. This proximity is usually
determined using distance metrics like Euclidean distance.

• Update Step: Re-evaluating the Leaders

Once everyone has found their preliminary group, it's time to re-evaluate the captains. The position of each centroid is
recalculated, taking into account the average location of all data points now belonging to its cluster. Imagine the
captains adjusting their positions slightly to be more centered within their teams.

• Convergence: Reaching a Stable State

This process of assigning data points and updating centroids repeats. With each iteration, the centroids inch closer to
their optimal positions, and the clusters become more refined. The algorithm terminates when the centroids cease to
move significantly or a predefined threshold of change is met. We've reached a stable state where our "teams" are
well-defined.

Example:
Let's say we want to cluster customer data based on their purchase history. Using K-Means, we can group customers
who exhibit similar buying patterns, even without knowing those patterns beforehand. The algorithm will automatically
identify these groups based on the proximity of their purchase data points.
By building K-Means from the ground up, we gain valuable insights into its strengths and limitations, empowering us to
fine-tune its application for specific tasks. We'll delve into the code implementation in the next section, bringing this
powerful algorithm to life.

3 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

3. The Structure of a Decision Tree Classifier

This Structure includes the steps and sub-steps with appropriate labels and connections. Each step corresponds
to a function or a key part of the process described in the provided implementation.

4 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
5. Implementation in Python
Let's implement a simple K-Means Clustering Algorithm in Python to cluster data. We'll use NumPy for
numerical computations.

Step 1: Initialize Centroids

• This function randomly selects k data points from the dataset X to serve as the initial centroids.
• np.random.seed(42) ensures reproducibility by fixing the random seed.
• np.random.permutation(len(X)) shuffles the indices of the data points.
• centroids = X[random_indices[:k]] selects the first k points as the initial centroids.

import numpy as np

def initialize_centroids(X, k):

"""
Randomly initialize centroids from the dataset.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
k (int): The number of clusters

Returns:
numpy.ndarray: Initialized centroids (k, n_features)
"""
# Ensure reproducibility by setting a random seed
np.random.seed(42)

# Randomly shuffle the indices of the dataset

random_indices = np.random.permutation(len(X))

# Select the first k points as the initial centroids

centroids = X[random_indices[:k]]

return centroids

# Example usage
X_example = np.array([[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]])
k_example = 2
print("Initial Centroids:\n", initialize_centroids(X_example, k_example))

5 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Step 2: Compute Distances

• This function computes the Euclidean distance from each data point in X to each centroid.
• distances[:, i] = np.linalg.norm(X - centroid, axis=1) calculates the Euclidean distance from each point to the i-
th centroid.

def compute_distances(X, centroids):

"""
Compute the distance from each point to each centroid.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
centroids (numpy.ndarray): The current centroids (k, n_features)

Returns:
numpy.ndarray: Distances from each point to each centroid (n_samples, k)
"""
# Initialize a distance matrix to store distances from each point to each centroid
distances = np.zeros((X.shape[0], len(centroids)))

# Compute the Euclidean distance from each point to each centroid

for i, centroid in enumerate(centroids):
distances[:, i] = np.linalg.norm(X - centroid, axis=1)

return distances

# Example usage
centroids_example = np.array([[1, 2], [9, 10]])
print("Distances:\n", compute_distances(X_example, centroids_example))

Step 3: Assign Clusters

• This function assigns each data point to the nearest centroid based on the computed distances.
• np.argmin(distances, axis=1) returns the index of the smallest distance for each data point, effectively
assigning it to the nearest cluster.

def assign_clusters(distances):
"""
Assign each point to the closest centroid.

Parameters:
distances (numpy.ndarray): Distances from each point to each centroid (n_samples, k)

Returns:
numpy.ndarray: Cluster labels for each point (n_samples,)
"""
# Assign each point to the closest centroid
return np.argmin(distances, axis=1)

# Example usage
distances_example = compute_distances(X_example, centroids_example)
print("Cluster Assignments:\n", assign_clusters(distances_example))

6 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python
Step 4: Update Centroids
• This function recalculates the centroids by computing the mean of all data points assigned to each
cluster.
• X[labels == i].mean(axis=0) calculates the mean of all points assigned to the i-th cluster.
def update_centroids(X, labels, k):
"""
Calculate new centroids as the mean of points in each cluster.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
labels (numpy.ndarray): Cluster labels for each point (n_samples,)
k (int): The number of clusters

Returns:
numpy.ndarray: Updated centroids (k, n_features)
"""
# Initialize an array to store the updated centroids
centroids = np.zeros((k, X.shape[1]))

# Compute the mean of all points assigned to each cluster

for i in range(k):
centroids[i] = X[labels == i].mean(axis=0)

return centroids

# Example usage
labels_example = assign_clusters(distances_example)
print("Updated Centroids:\n", update_centroids(X_example, labels_example, k_example))

Step 5: Check Convergence

• This function checks if the centroids have converged, i.e., if the change in centroids is below a specified
tolerance tol.
• np.linalg.norm(new_centroids - old_centroids, axis=1) computes the Euclidean distance between old and new
centroids.
def has_converged(old_centroids, new_centroids, tol=1e-4):
"""
Check if the centroids have converged.

Parameters:
old_centroids (numpy.ndarray): Previous centroids (k, n_features)
new_centroids (numpy.ndarray): Updated centroids (k, n_features)
tol (float): Tolerance for convergence

Returns:
bool: True if centroids have converged, False otherwise
"""
# Compute the Euclidean distance between old and new centroids
distances = np.linalg.norm(new_centroids - old_centroids, axis=1)

# Check if the changes in centroids are below the tolerance level

return np.all(distances < tol)

# Example usage
old_centroids_example = centroids_example
new_centroids_example = update_centroids(X_example, labels_example, k_example)
print("Has Converged:", has_converged(old_centroids_example, new_centroids_example))

7 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Step 6: K-Means Clustering Algorithm

• This function performs the K-Means clustering algorithm.
• It initializes the centroids, iteratively assigns clusters, updates centroids, and checks for convergence.
• The process repeats until convergence or the maximum number of iterations is reached.

def k_means(X, k, max_iters=100):

"""
Perform K-Means clustering.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
k (int): The number of clusters
max_iters (int): Maximum number of iterations

Returns:
tuple: Final centroids (k, n_features) and cluster labels (n_samples,)
"""
# Step 1: Initialize centroids
centroids = initialize_centroids(X, k)

for i in range(max_iters):
# Step 2: Compute distances
distances = compute_distances(X, centroids)

# Step 3: Assign clusters

labels = assign_clusters(distances)

# Step 4: Update centroids

new_centroids = update_centroids(X, labels, k)

# Check for convergence

if has_converged(centroids, new_centroids):
break

# Update centroids for the next iteration

centroids = new_centroids

return centroids, labels

# Example usage with synthetic data

from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

k = 4
centroids, labels = k_means(X, k)
print("Final Centroids:\n", centroids)
print("Final Labels:\n", labels)

8 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Step 7: Plot Clusters

• This function visualizes the resulting clusters and centroids.
• plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k') plots the data points colored by
their cluster labels.
• plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x') plots the centroids.

import matplotlib.pyplot as plt

def plot_clusters(X, centroids, labels):

"""
Visualize the clusters and centroids.

Parameters:
X (numpy.ndarray): The input dataset (n_samples, n_features)
centroids (numpy.ndarray): Final centroids (k, n_features)
labels (numpy.ndarray): Cluster labels for each point (n_samples,)
"""
# Plot the data points with cluster labels
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', marker='o', edgecolor='k', s=50)

# Plot the centroids

plt.scatter(centroids[:, 0], centroids[:, 1], c='red', marker='x', s=200)

plt.title('K-Means Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()

# Plotting the result

plot_clusters(X, centroids, labels)

5. Conclusion
By breaking down the K-Means clustering algorithm into distinct steps and functions, we gain a better
understanding of each component's role and functionality. The initialization of centroids, computation of
distances, assignment of clusters, updating of centroids, and checking for convergence are all crucial steps in
the algorithm. Implementing these steps in Python provides valuable insights into the mechanics of clustering
algorithms.

Feel free to experiment with different datasets and parameters to see how the K-Means algorithm performs and
how different initialization methods can impact the clustering results.

9 ANSHUMAN JHA
Building a K-Means Clustering Algorithm from Scratch in Python

Constructive comments and feedback are welcomed

10 ANSHUMAN JHA

Ex1602 Excel 2016 Advanced PDF
No ratings yet
Ex1602 Excel 2016 Advanced PDF
35 pages
Brick Wall With Grill
No ratings yet
Brick Wall With Grill
6 pages
K-Means Algorithm
No ratings yet
K-Means Algorithm
29 pages
K-Means in Python - Solution
No ratings yet
K-Means in Python - Solution
6 pages
DOC-20250407-WA0033.
No ratings yet
DOC-20250407-WA0033.
38 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Exp 7
No ratings yet
Exp 7
3 pages
K-Mean
No ratings yet
K-Mean
9 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
02.1 K-Means Example
No ratings yet
02.1 K-Means Example
12 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
K-Means_Clustering_Report
No ratings yet
K-Means_Clustering_Report
2 pages
DA_EXP_10 (1)
No ratings yet
DA_EXP_10 (1)
6 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
MINOR PROJECT
No ratings yet
MINOR PROJECT
10 pages
ADL LAB Manual
No ratings yet
ADL LAB Manual
27 pages
K-Means Clustering
No ratings yet
K-Means Clustering
6 pages
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
No ratings yet
Machine Learning With Python - Machine Learning Algorithms - K-Means Clustering Algo
25 pages
DA_EXP_10
No ratings yet
DA_EXP_10
6 pages
DA_EXP_10_66
No ratings yet
DA_EXP_10_66
6 pages
Machine Learning K Means - Unsupervised
No ratings yet
Machine Learning K Means - Unsupervised
5 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
Detecting Patterns with Unsupervised Learning
No ratings yet
Detecting Patterns with Unsupervised Learning
21 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
ML - K-Means
No ratings yet
ML - K-Means
12 pages
MACHINE LEARNING NOTES ANNA UNIVERSITY
No ratings yet
MACHINE LEARNING NOTES ANNA UNIVERSITY
14 pages
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
No ratings yet
CLUSTERING CLASSIFICATION AND INTRO NEURAL NETWORK
168 pages
K.means Clustering
No ratings yet
K.means Clustering
8 pages
INTRO TO ML ASS
No ratings yet
INTRO TO ML ASS
3 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
Algorithms New
No ratings yet
Algorithms New
8 pages
K Means Clustering - Experiment 12
No ratings yet
K Means Clustering - Experiment 12
3 pages
Zara
No ratings yet
Zara
47 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
3.1 K - Means
No ratings yet
3.1 K - Means
16 pages
ML Minors Exp7
No ratings yet
ML Minors Exp7
6 pages
K MEANS
No ratings yet
K MEANS
40 pages
Assignment No. A6: 1 Title
No ratings yet
Assignment No. A6: 1 Title
5 pages
AppliedML-Chap1-Clustering
No ratings yet
AppliedML-Chap1-Clustering
37 pages
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
No ratings yet
K - Means Clustering and Related Algorithms: Ryan P. Adams COS 324 - Elements of Machine Learning Princeton University
18 pages
Clustering_notes
No ratings yet
Clustering_notes
29 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
K_means.ipynb_-_Colab
No ratings yet
K_means.ipynb_-_Colab
10 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
K Mean
No ratings yet
K Mean
7 pages
Machine Learning Chapter 3
No ratings yet
Machine Learning Chapter 3
12 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
USL
No ratings yet
USL
21 pages
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
No ratings yet
20 - 1 - ML - Unsup - 01 - Partition Based - Kmeans
20 pages
Introduction To Kmeans
No ratings yet
Introduction To Kmeans
4 pages
kmeansfinal
No ratings yet
kmeansfinal
16 pages
k-means
No ratings yet
k-means
25 pages
EXP-6 K Mean Clustring
No ratings yet
EXP-6 K Mean Clustring
6 pages
Pilot
No ratings yet
Pilot
3 pages
Clustering
No ratings yet
Clustering
35 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Advanced JavaScript Design Patterns
From Everand
Advanced JavaScript Design Patterns
Hernando Abella
No ratings yet
Drillmax Gate Valve Product Bro PDF
No ratings yet
Drillmax Gate Valve Product Bro PDF
12 pages
RAB Atap Masjid (1) Include Genteng - 13102021
No ratings yet
RAB Atap Masjid (1) Include Genteng - 13102021
7 pages
Plume P8-PGN605 Service Manual
No ratings yet
Plume P8-PGN605 Service Manual
34 pages
AI Residency Flyer-1
No ratings yet
AI Residency Flyer-1
5 pages
Mobile Accessories Inventory List Sep-20
No ratings yet
Mobile Accessories Inventory List Sep-20
52 pages
PWD 2019 20 Labs 7 SQLite Exercise PDF
No ratings yet
PWD 2019 20 Labs 7 SQLite Exercise PDF
2 pages
cloud storage-notes
No ratings yet
cloud storage-notes
7 pages
DNV RP C201
No ratings yet
DNV RP C201
33 pages
Ventilation Calculation 1591013185963
No ratings yet
Ventilation Calculation 1591013185963
8 pages
Mallikarjun - CV-Updated
No ratings yet
Mallikarjun - CV-Updated
4 pages
Final Research Paper
No ratings yet
Final Research Paper
16 pages
DC Microgrid Seminar
No ratings yet
DC Microgrid Seminar
10 pages
1ST Quarter Exam
No ratings yet
1ST Quarter Exam
6 pages
Spammer Detection and Fake User Identification On Social Networks
No ratings yet
Spammer Detection and Fake User Identification On Social Networks
7 pages
Comant II A12055 C120-200G-S Installation
No ratings yet
Comant II A12055 C120-200G-S Installation
2 pages
Literature Review On Banking Regulation
100% (1)
Literature Review On Banking Regulation
8 pages
Generative Ai-Driven Human Digital Twin in Iot-Healthcare: A Comprehensive Survey
No ratings yet
Generative Ai-Driven Human Digital Twin in Iot-Healthcare: A Comprehensive Survey
22 pages
Protector Cultivation and Secondary Agriculture Question
100% (1)
Protector Cultivation and Secondary Agriculture Question
2 pages
Luvyu BB Drix
No ratings yet
Luvyu BB Drix
10 pages
Fundamentals
No ratings yet
Fundamentals
104 pages
Statistical Analysis With Google Sheets: Part 1
No ratings yet
Statistical Analysis With Google Sheets: Part 1
4 pages
MA214 Slides 2 v2
No ratings yet
MA214 Slides 2 v2
16 pages
4.1-2 Hand Tools and Its Uses
No ratings yet
4.1-2 Hand Tools and Its Uses
14 pages
MGMT 690 MS BAIM Industry Practicum Spring 2022 Task List #3 at 5pm EST
No ratings yet
MGMT 690 MS BAIM Industry Practicum Spring 2022 Task List #3 at 5pm EST
2 pages
Mle Revised Module Flow
No ratings yet
Mle Revised Module Flow
4 pages
Running Odoo On Pypy - Hibou - Io
No ratings yet
Running Odoo On Pypy - Hibou - Io
3 pages
Proforma for Submission of Dissertation DM MCh ABVMUUP-merged
No ratings yet
Proforma for Submission of Dissertation DM MCh ABVMUUP-merged
19 pages
Unit Weight Flat Bar
No ratings yet
Unit Weight Flat Bar
3 pages