Open In App

K-Mode Clustering in Python

Last Updated : 26 Jun, 2025
Summarize
Comments
Improve
Suggest changes
Share
Like Article
Like
Report

K-mode clustering is an unsupervised machine-learning used to group categorical data into k clusters (groups). The K-Modes clustering partitions the data into two mutually exclusive groups. Unlike K-Means which uses distances between numbers K-Modes uses the number of mismatches between categorical values to decide how similar two data points are. For example:

  • Data point 1: ["red", "small", "round"]
  • Data point 2: ["blue", "small", "square"]

Here there are 2 mismatches (color and shape) so these two are not very similar.

When Should You Use K-Modes?

Use K-Modes when:

  • Your dataset contains categorical variables like gender, color, brand etc.
  • You want to group customers by product preferences
  • You're analyzing survey responses Yes/No, Male/Female etc.

How K-Modes clustering works?

Unlike hierarchical clustering KModes requires us to decide the number of clusters (K) in advance. Here's how it works step by step:

  • Start by picking clusters: Randomly select K data points from the dataset to act as the starting clusters these are called "modes".
  • Assign data to clusters: Check how similar each data point is to these clusters using the total number of mismatches and assign each data point to the cluster it matches the most.
  • Update the clusters: Find the most common value for each cluster and update the cluster centers based on this.
  • Repeat the process: Keep repeating steps 2 and 3 until no data points are reassigned to different clusters.

Let X be a set of categorical data objects of   X = \begin{bmatrix} x_{11}, & ... & x_{1n}\\ ... & ... & ...\\ x_{n1},& ... & x_{nm} \end{bmatrix} that can be denoted as and the mode of Z is a vector Q = [q_{1},q_{2},...,q_{m}]  then minimize

  D(X,Q) = \sum_{i=1}^{n}d(X_{i},Q)               

Apply dissimilarity metric equation for data objects 

D(X,Q) = \sum_{i=1}^{n}\sum_{j=1}^{m}\delta(x_{ij},Q) 

Suppose we want to K cluster Then we have Q = [q_{k1},q_{k2},....,q_{km}] \epsilon Q

C(Q) = \sum_{k=1}^{K}\sum_{i=1}^{n}\sum_{j=1}^{m}\delta(x_{ij},q_{kj})              

Overall the goal of K-modes clustering is to minimize the dissimilarities between the data objects and the centroids (modes) of the clusters using a measure of categorical similarity such as the Hamming distance.

Implementation of the k-mode clustering algorithm

K-Modes is a way to group categorical data into clusters. Here's how you can do it step-by-step in Python using just NumPy and Pandas.

Step 1: Prepare Your Data

Start by defining your dataset. Each row is a data point and each column contains categorical values like letters or labels.

Python
import numpy as np
import pandas as pd

data = np.array([
    ['A', 'B', 'C'],
    ['B', 'C', 'A'],
    ['C', 'A', 'B'],
    ['A', 'C', 'B'],
    ['A', 'A', 'B']
])

Step 2: Set Number of Clusters

Decide how many groups you want to divide your data into.

Python
k = 2

Step 3: Pick Starting Points (Modes)

Randomly choose k rows from the data to be the starting cluster centers.

Python
np.random.seed(0)
modes = data[np.random.choice(data.shape[0], k, replace=False)]

Step 4: Assign Data to Clusters

For each data point, count how many features are different from each mode. Assign the point to the most similar cluster.

Python
clusters = np.zeros(data.shape[0], dtype=int)

for _ in range(10):  
    for i, point in enumerate(data):
        distances = [np.sum(point != mode) for mode in modes]
        clusters[i] = np.argmin(distances)

Step 5: Update Cluster Modes

After assigning all points update each cluster’s mode to the most common values in that cluster.

Python
    for j in range(k):
        if np.any(clusters == j):
            modes[j] = pd.DataFrame(data[clusters == j]).mode().iloc[0].values

Step 6: View Final Results

Print out which cluster each data point belongs to and what the final cluster centers (modes) are.

Python
print("Cluster assignments:", clusters)
print("Cluster modes:", modes)

Output:

K-Mode_Clustering
K-Mode Clustering

The output shows that the first data point belongs to cluster 1 and the rest belong to cluster 0. Each cluster has a common pattern: cluster 0 has mode values ['A', 'A', 'B'] and cluster 1 has ['A', 'B', 'C']. These modes represent the most frequent values in each cluster and are used to group similar rows together.

Cluster with kmodes Library

pip install kmodes

Optimal number of clusters in the K-Mode algorithm

Elbow method is used to find the optimal number of clusters

Python
import pandas as pd
import numpy as np
# !pip install kmodes
from kmodes.kmodes import KModes
import matplotlib.pyplot as plt
%matplotlib inline

cost = []
K = range(1,5)
for k in list(K):
    kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=1)
    kmode.fit_predict(data)
    cost.append(kmode.cost_)
    
plt.plot(K, cost, 'x-')
plt.xlabel('No. of clusters')
plt.ylabel('Cost')
plt.title('Elbow Curve')
plt.show()

Outputs:

Elbow Method - Geeksforgeeks
Elbow Method

As we can see from the graph there is an elbow-like shape at 2.0 and 3.0 Now it we can consider either 2.0 or 3.0 cluster. Let's consider Number of cluster =2.0

Python
kmode = KModes(n_clusters=2, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(data)
clusters

Outputs :

array([1, 0, 1, 1, 1], dtype=uint16)

This also shows that the first, third, fourth and fifth data points have been assigned to the first cluster and the second data points have been assigned to the second cluster.  So our previous answer was 100 % correct. To find the best number of groups we use the Elbow Method which helps us see when adding more groups doesn't make a big difference. K-Modes is an easy and effective way to group similar data when working with categories


Similar Reads