K-Mode Clustering in Python
Last Updated :
26 Jun, 2025
K-mode clustering is an unsupervised machine-learning used to group categorical data into k clusters (groups). The K-Modes clustering partitions the data into two mutually exclusive groups. Unlike K-Means which uses distances between numbers K-Modes uses the number of mismatches between categorical values to decide how similar two data points are. For example:
- Data point 1: ["red", "small", "round"]
- Data point 2: ["blue", "small", "square"]
Here there are 2 mismatches (color and shape) so these two are not very similar.
When Should You Use K-Modes?
Use K-Modes when:
- Your dataset contains categorical variables like gender, color, brand etc.
- You want to group customers by product preferences
- You're analyzing survey responses Yes/No, Male/Female etc.
How K-Modes clustering works?
Unlike hierarchical clustering KModes requires us to decide the number of clusters (K) in advance. Here's how it works step by step:
- Start by picking clusters: Randomly select K data points from the dataset to act as the starting clusters these are called "modes".
- Assign data to clusters: Check how similar each data point is to these clusters using the total number of mismatches and assign each data point to the cluster it matches the most.
- Update the clusters: Find the most common value for each cluster and update the cluster centers based on this.
- Repeat the process: Keep repeating steps 2 and 3 until no data points are reassigned to different clusters.
Let X be a set of categorical data objects of X = \begin{bmatrix} x_{11}, & ... & x_{1n}\\ ... & ... & ...\\ x_{n1},& ... & x_{nm} \end{bmatrix} that can be denoted as and the mode of Z is a vector Q = [q_{1},q_{2},...,q_{m}] then minimize
D(X,Q) = \sum_{i=1}^{n}d(X_{i},Q)
Apply dissimilarity metric equation for data objects
D(X,Q) = \sum_{i=1}^{n}\sum_{j=1}^{m}\delta(x_{ij},Q)
Suppose we want to K cluster Then we have Q = [q_{k1},q_{k2},....,q_{km}] \epsilon Q
C(Q) = \sum_{k=1}^{K}\sum_{i=1}^{n}\sum_{j=1}^{m}\delta(x_{ij},q_{kj})
Overall the goal of K-modes clustering is to minimize the dissimilarities between the data objects and the centroids (modes) of the clusters using a measure of categorical similarity such as the Hamming distance.
Implementation of the k-mode clustering algorithm
K-Modes is a way to group categorical data into clusters. Here's how you can do it step-by-step in Python using just NumPy and Pandas.
Step 1: Prepare Your Data
Start by defining your dataset. Each row is a data point and each column contains categorical values like letters or labels.
Python
import numpy as np
import pandas as pd
data = np.array([
['A', 'B', 'C'],
['B', 'C', 'A'],
['C', 'A', 'B'],
['A', 'C', 'B'],
['A', 'A', 'B']
])
Step 2: Set Number of Clusters
Decide how many groups you want to divide your data into.
Python
Step 3: Pick Starting Points (Modes)
Randomly choose k
rows from the data to be the starting cluster centers.
Python
np.random.seed(0)
modes = data[np.random.choice(data.shape[0], k, replace=False)]
Step 4: Assign Data to Clusters
For each data point, count how many features are different from each mode. Assign the point to the most similar cluster.
Python
clusters = np.zeros(data.shape[0], dtype=int)
for _ in range(10):
for i, point in enumerate(data):
distances = [np.sum(point != mode) for mode in modes]
clusters[i] = np.argmin(distances)
Step 5: Update Cluster Modes
After assigning all points update each cluster’s mode to the most common values in that cluster.
Python
for j in range(k):
if np.any(clusters == j):
modes[j] = pd.DataFrame(data[clusters == j]).mode().iloc[0].values
Step 6: View Final Results
Print out which cluster each data point belongs to and what the final cluster centers (modes) are.
Python
print("Cluster assignments:", clusters)
print("Cluster modes:", modes)
Output:
K-Mode ClusteringThe output shows that the first data point belongs to cluster 1 and the rest belong to cluster 0. Each cluster has a common pattern: cluster 0 has mode values ['A', 'A', 'B'] and cluster 1 has ['A', 'B', 'C']. These modes represent the most frequent values in each cluster and are used to group similar rows together.
Cluster with kmodes Library
pip install kmodes
Optimal number of clusters in the K-Mode algorithm
Elbow method is used to find the optimal number of clusters
Python
import pandas as pd
import numpy as np
# !pip install kmodes
from kmodes.kmodes import KModes
import matplotlib.pyplot as plt
%matplotlib inline
cost = []
K = range(1,5)
for k in list(K):
kmode = KModes(n_clusters=k, init = "random", n_init = 5, verbose=1)
kmode.fit_predict(data)
cost.append(kmode.cost_)
plt.plot(K, cost, 'x-')
plt.xlabel('No. of clusters')
plt.ylabel('Cost')
plt.title('Elbow Curve')
plt.show()
Outputs:
Elbow MethodAs we can see from the graph there is an elbow-like shape at 2.0 and 3.0 Now it we can consider either 2.0 or 3.0 cluster. Let's consider Number of cluster =2.0
Python
kmode = KModes(n_clusters=2, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(data)
clusters
Outputs :
array([1, 0, 1, 1, 1], dtype=uint16)
This also shows that the first, third, fourth and fifth data points have been assigned to the first cluster and the second data points have been assigned to the second cluster. So our previous answer was 100 % correct. To find the best number of groups we use the Elbow Method which helps us see when adding more groups doesn't make a big difference. K-Modes is an easy and effective way to group similar data when working with categories
Similar Reads
Machine Learning Tutorial Machine learning is a branch of Artificial Intelligence that focuses on developing models and algorithms that let computers learn from data without being explicitly programmed for every task. In simple words, ML teaches the systems to think and understand like humans by learning from the data.Do you
5 min read
Introduction to Machine Learning
Python for Machine Learning
Machine Learning with Python TutorialPython language is widely used in Machine Learning because it provides libraries like NumPy, Pandas, Scikit-learn, TensorFlow, and Keras. These libraries offer tools and functions essential for data manipulation, analysis, and building machine learning models. It is well-known for its readability an
5 min read
Pandas TutorialPandas is an open-source software library designed for data manipulation and analysis. It provides data structures like series and DataFrames to easily clean, transform and analyze large datasets and integrates with other Python libraries, such as NumPy and Matplotlib. It offers functions for data t
6 min read
NumPy Tutorial - Python LibraryNumPy (short for Numerical Python ) is one of the most fundamental libraries in Python for scientific computing. It provides support for large, multi-dimensional arrays and matrices along with a collection of mathematical functions to operate on arrays.At its core it introduces the ndarray (n-dimens
3 min read
Scikit Learn TutorialScikit-learn (also known as sklearn) is a widely-used open-source Python library for machine learning. It builds on other scientific libraries like NumPy, SciPy and Matplotlib to provide efficient tools for predictive data analysis and data mining.It offers a consistent and simple interface for a ra
3 min read
ML | Data Preprocessing in PythonData preprocessing is a important step in the data science transforming raw data into a clean structured format for analysis. It involves tasks like handling missing values, normalizing data and encoding variables. Mastering preprocessing in Python ensures reliable insights for accurate predictions
6 min read
EDA - Exploratory Data Analysis in PythonExploratory Data Analysis (EDA) is a important step in data analysis which focuses on understanding patterns, trends and relationships through statistical tools and visualizations. Python offers various libraries like pandas, numPy, matplotlib, seaborn and plotly which enables effective exploration
6 min read
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advance Machine Learning Technique
Machine Learning Practice