We have some data, and we want to cluster it. How exactly do we do that,
and what do the results look like? If you are very familiar with sklearn
and its API, particularly for clustering, then you can probably skip
this tutorial -- hdbscan
implements exactly this API, so you can use
it just as you would any other sklearn clustering algorithm. If, on the
other hand, you aren't that familiar with sklearn, fear not, and read
on. Let's start with the simplest case first -- we have data in a nice
tidy dataframe format.
Let's generate some data with, say 2000 samples, and 10 features. We can put it in a dataframe for a nice clean table view of it.
from sklearn.datasets import make_blobs
import pandas as pd
blobs, labels = make_blobs(n_samples=2000, n_features=10)
pd.DataFrame(blobs).head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -3.370804 | 8.487688 | 4.631243 | -10.181475 | 9.146487 | -8.070935 | -1.612017 | -2.418106 | -8.975390 | -1.769952 |
1 | -4.092931 | 8.409841 | 3.362516 | -9.748945 | 9.556615 | -9.240307 | -2.038291 | -3.129068 | -7.109673 | -0.993827 |
2 | -4.604753 | 9.616391 | 4.631508 | -11.166361 | 10.888212 | -8.427564 | -3.929517 | -4.563951 | -8.886373 | -1.995063 |
3 | -6.889866 | -7.801482 | -6.974958 | -8.570025 | 5.438101 | -5.097457 | -4.941206 | -5.926394 | -10.145152 | 0.219269 |
4 | 5.339728 | 2.791309 | 0.611464 | -2.929875 | -7.694973 | 7.776050 | -1.218101 | 0.408141 | -4.563975 | -1.309128 |
So now we need to import the hdbscan library.
import hdbscan
Now, to cluster we need to generate a clustering object.
clusterer = hdbscan.HDBSCAN()
We can then use this clustering object and fit it to the data we have. This will return the clusterer object back to you -- just in case you want do some method chaining.
clusterer.fit(blobs)
HDBSCAN(algorithm='best', alpha=1.0, approx_min_span_tree=True, gen_min_span_tree=False, leaf_size=40, memory=Memory(None), metric='euclidean', min_cluster_size=5, min_samples=None, p=None)
At this point we are actually done! We've done the clustering! But where
are the results? How do I get the clusters? The clusterer object knows,
and stores the result in an attribute labels_
.
clusterer.labels_
array([2, 2, 2, ..., 2, 2, 0])
So it is an array of integers. What are we to make of that? It is an array with an integer for each data sample. Samples that are in the same cluster get assigned the same number. The cluster labels start at 0 and count up. We can thus determine the number of clusters found by finding the largest cluster label.
clusterer.labels_.max()
2
So we have a total of three clusters, with labels 0, 1, and 2.
Importantly HDBSCAN is noise aware -- it has a notion of data samples
that are not assigned to any cluster. This is handled by assigning these
samples the label -1. But wait, there's more. The hdbscan
library
implements soft clustering, where each data point is assigned a cluster
membership score ranging from 0.0 to 1.0. A score of 0.0 represents a
sample that is not in the cluster at all (all noise points will get this
score) while a score of 1.0 represents a sample that is at the heart of
the cluster (note that this is not the spatial centroid notion of core).
You can access these scores via the probabilities_
attribute.
clusterer.probabilities_
array([ 0.83890858, 1. , 0.72629904, ..., 0.79456452, 0.65311137, 0.76382928])
That is all well and good, but even data that is embedded in a vector
space may not want to consider distances between data points to be pure
Euclidean distance. What can we do in that case? We are still in good
shape, since hdbscan
supports a wide variety of metrics, which you
can set when creating the clusterer object. For example we can do the
following:
clusterer = hdbscan.HDBSCAN(metric='manhattan')
clusterer.fit(blobs)
clusterer.labels_
array([1, 1, 1, ..., 1, 1, 0])
What metrics are supported? Because we simply steal metric computations from sklearn we get a large number of metrics readily available.
hdbscan.dist_metrics.METRIC_MAPPING
{'braycurtis': hdbscan.dist_metrics.BrayCurtisDistance, 'canberra': hdbscan.dist_metrics.CanberraDistance, 'chebyshev': hdbscan.dist_metrics.ChebyshevDistance, 'cityblock': hdbscan.dist_metrics.ManhattanDistance, 'dice': hdbscan.dist_metrics.DiceDistance, 'euclidean': hdbscan.dist_metrics.EuclideanDistance, 'hamming': hdbscan.dist_metrics.HammingDistance, 'haversine': hdbscan.dist_metrics.HaversineDistance, 'infinity': hdbscan.dist_metrics.ChebyshevDistance, 'jaccard': hdbscan.dist_metrics.JaccardDistance, 'kulsinski': hdbscan.dist_metrics.KulsinskiDistance, 'l1': hdbscan.dist_metrics.ManhattanDistance, 'l2': hdbscan.dist_metrics.EuclideanDistance, 'mahalanobis': hdbscan.dist_metrics.MahalanobisDistance, 'manhattan': hdbscan.dist_metrics.ManhattanDistance, 'matching': hdbscan.dist_metrics.MatchingDistance, 'minkowski': hdbscan.dist_metrics.MinkowskiDistance, 'p': hdbscan.dist_metrics.MinkowskiDistance, 'pyfunc': hdbscan.dist_metrics.PyFuncDistance, 'rogerstanimoto': hdbscan.dist_metrics.RogersTanimotoDistance, 'russellrao': hdbscan.dist_metrics.RussellRaoDistance, 'seuclidean': hdbscan.dist_metrics.SEuclideanDistance, 'sokalmichener': hdbscan.dist_metrics.SokalMichenerDistance, 'sokalsneath': hdbscan.dist_metrics.SokalSneathDistance, 'wminkowski': hdbscan.dist_metrics.WMinkowskiDistance}
What if you don't have a nice set of points in a vector space, but only
have a pairwise distance matrix providing the distance between each pair
of points? This is a common situation. Perhaps you have a complex custom
distance measure; perhaps you have strings and are using Levenshtein
distance, etc. Again, this is all fine as hdbscan
supports a special
metric called precomputed
. If you create the clusterer with the
metric set to precomputed
then the clusterer will assume that,
rather than being handed a vector of points in a vector space, it is
receiving an all-pairs distance matrix. Missing distances can be
indicated by numpy.inf
, which leads HDBSCAN to ignore these pairwise
relationships as long as there exists a path between two points that
contains defined distances (i.e. if there are too many distances
missing, the clustering is going to fail).
NOTE: The input vector _must_ contain numerical data. If you have a distance matrix for non-numerical vectors, you will need to map your input vectors to numerical vectors. (e.g use map ['A', 'G', 'C', 'T']-> [ 1, 2, 3, 4] to replace input vector ['A', 'A', 'A', 'C', 'G'] with [ 1, 1, 1, 3, 2])
from sklearn.metrics.pairwise import pairwise_distances
distance_matrix = pairwise_distances(blobs)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_
array([1, 1, 1, ..., 1, 1, 2])
Note that this result only appears different due to a different labelling order for the clusters.