How to Use Custom Distance Functions for Clustering?

Question

GeeksforGeeks · Accepted Answer

When working with clustering algorithms, especially K-Means, you may encounter scenarios where the default Euclidean distance metric might not fit your data. Perhaps, you want to use Manhattan distance or even a more complex custom similarity function. However, scikit-learn’s K-Means only supports Euclidean distance by design.

In this article, we will explore ways to work around this limitation, alternatives to K-Means, and strategies to implement a custom clustering solution.

Table of Content

Limitations of K-Means in Scikit-learn
Alternatives and Solutions for Custom Distance Metrics

1. Use Other Clustering Algorithms in Scikit-Learn
2. Implement K-Means from Scratch with Custom Distance
3. Explore External Libraries for Custom Clustering Needs
4. Transform Data to Use Euclidean Distance

Limitations of K-Means in Scikit-learn

The KMeans algorithm in scikit-learn offers efficient and straightforward clustering, but it is restricted to Euclidean distance (L2 norm). This limitation can hinder use cases where other distance metrics, such as Manhattan, Cosine, or Custom distance functions, are required.

If you're new to K-Means, think of it as trying to group different cities based on distance. But K-Means only measures the shortest line between two points, which may not always make sense. If you wanted to use "driving distance" (Manhattan) instead, K-Means would struggle to accommodate that directly.

Below are some practical solutions for overcoming this limitation while maintaining flexibility in clustering tasks.

Alternatives and Solutions for Custom Distance Metrics

1. Use Other Clustering Algorithms in Scikit-Learn

Scikit-learn offers a range of clustering algorithms besides K-Means that support alternative distance metrics. Here are two great options:

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Supports custom distance metrics, including Manhattan distance.
Best for clusters of varying shapes or when dealing with noisy data.

Python

from sklearn.cluster import DBSCAN

# Example using Manhattan distance for DBSCAN
X = [[1, 2], [2, 3], [5, 6], [8, 9]]
dbscan = DBSCAN(metric='manhattan')
dbscan.fit(X)
print(dbscan.labels_)

Output:

[-1 -1 -1 -1]

Agglomerative Clustering

Allows precomputed distance matrices or specific metrics like cosine similarity.
Useful for hierarchical clustering when relationships between clusters matter.

Python

from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import pairwise_distances
import numpy as np

# Sample data
X = np.array([[1, 2], [2, 3], 5, 6], [8, 9]])

# Compute the distance matrix using Manhattan distance
distance_matrix = pairwise_distances(X, metric='manhattan')

# Apply Agglomerative Clustering with 'complete' linkage
agglomerative = AgglomerativeClustering(n_clusters=2, linkage='complete')
labels = agglomerative.fit_predict(distance_matrix)

print("Cluster Labels:", labels)

Output:

Cluster Labels: [1 1 0 0]

2. Implement K-Means from Scratch with Custom Distance

If you need full control over your clustering process, a custom K-Means implementation may be the way to go. This way, you can define your own distance function. Below is a basic implementation using Manhattan distance as an example.

Custom K-Means Implementation Example:

Python

import numpy as np

# Custom Manhattan distance function
def custom_distance(p1, p2):
    return np.sum(np.abs(p1 - p2))

# Assign clusters based on custom distance function
def assign_clusters(X, centroids):
    clusters = []
    for x in X:
        distances = [custom_distance(x, c) for c in centroids]
        clusters.append(np.argmin(distances))
    return clusters

# Compute new centroids as mean of assigned points
def compute_centroids(X, labels, k):
    centroids = []
    for i in range(k):
        points = X[np.array(labels) == i]
        centroids.append(points.mean(axis=0))
    return np.array(centroids)

# Main function to perform custom K-Means clustering
def k_means_custom(X, k, max_iter=100):
    centroids = X[np.random.choice(len(X), k, replace=False)]
    for _ in range(max_iter):
        labels = assign_clusters(X, centroids)
        new_centroids = compute_centroids(X, labels, k)
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    return labels, centroids

# Example usage
X = np.array([[1, 2], [1, 4], [3, 4], [5, 6], [8, 9]])
labels, centroids = k_means_custom(X, k=2)
print("Labels:", labels)
print("Centroids:", centroids)

Output:

Labels: [1, 1, 1, 0, 0]
Centroids: [[6.5        7.5       ]
 [1.66666667 3.33333333]]

This implementation allows you to define any custom distance function to fit your problem requirements. You could replace the Manhattan distance function with other complex logic as needed.

3. Explore External Libraries for Custom Clustering Needs

If your use case involves mixed data types (numerical + categorical), consider using the K-Prototypes algorithm from the kmodes library. It supports different similarity measures for categorical data, making it ideal for clustering problems like customer segmentation.

pip install kmodes

Python

from kmodes.kprototypes import KPrototypes
import numpy as np

# Sample data with mixed numerical and categorical features
X = np.array([[1, 'red'], [2, 'blue'], [3, 'red'], [4, 'blue']])
kproto = KPrototypes(n_clusters=2, init='Cao', verbose=2)

# Fit and predict cluster labels
clusters = kproto.fit_predict(X, categorical=[1])
print("Cluster labels:", clusters)
print("Cluster centroids:")
for i, centroid in enumerate(kproto.cluster_centroids_):
    print(f"Cluster {i}: {centroid}")

Output:

Cluster labels: [1 1 0 0]
Cluster centroids:
Cluster 0: ['3.5' 'blue']
Cluster 1: ['1.5' 'blue']

4. Transform Data to Use Euclidean Distance

Sometimes, you can transform your data to make it work with Euclidean distance in K-Means. For example, normalizing vectors helps K-Means behave like cosine similarity.

Python

from sklearn.preprocessing import normalize
from sklearn.cluster import KMeans

X = [[1, 2], [2, 3], [5, 6], [8, 9]]
X_normalized = normalize(X)
kmeans = KMeans(n_clusters=2).fit(X_normalized)
print(kmeans.labels_)

Output:

[1 1 0 0]

This method leverages the built-in KMeans algorithm while adjusting the data to fit your similarity needs.

Conclusion

While K-Means in scikit-learn is limited to Euclidean distance, you have several options to work around this limitation:

Use DBSCAN or Agglomerative Clustering for built-in support of different metrics.
Implement K-Means from scratch to use your own custom distance function.
Explore external libraries like kmodes for mixed data clustering.
Transform your data to fit Euclidean distance requirements.

These strategies ensure you can still achieve meaningful clustering results, even when custom distance measures are required. Choose the approach that best aligns with your data and clustering objectives!

V

vaibhav_tyagi

Improve

Article Tags :

Practice Tags :

Machine Learning

How to Use Custom Distance Functions for Clustering?

Limitations of K-Means in Scikit-learn

Alternatives and Solutions for Custom Distance Metrics

1. Use Other Clustering Algorithms in Scikit-Learn

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Agglomerative Clustering

2. Implement K-Means from Scratch with Custom Distance

3. Explore External Libraries for Custom Clustering Needs

4. Transform Data to Use Euclidean Distance

Conclusion

Similar Reads

Introduction to Machine Learning

Python for Machine Learning

Feature Engineering

Supervised Learning

Unsupervised Learning

Model Evaluation and Tuning

Advance Machine Learning Technique

Machine Learning Practice

Thank You!

What kind of Experience do you want to share?