DB Scan Clustering
DB Scan Clustering
This is your last free member-only story this month. Sign up for Medium and get an extra one
Save
272 4
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 1/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
There are different approaches and algorithms to perform clustering tasks which
can be divided into three sub-categories:
In this post, I will try to explain DBSCAN algorithm in detail. If you would like to
read about other type of clustering algorithms, you can also visit the following posts:
Density-based clustering
Partition-based and hierarchical clustering techniques are highly efficient with
normal shaped clusters. However, when it comes to arbitrary shaped clusters or
detecting outliers, density-based techniques are more efficient.
For example, the dataset in the figure below can easily be divided into three clusters
using k-means algoritm.
k-means clustering
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 3/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
The data points in these figures are grouped in arbitrary shapes or include outliers.
Density-based clustering algorithms are very effienct at finding high-density regions
and outliers. It is very important to detect outliers for some task, e.g. anomaly
detection.
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 4/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
DBSCAN algorithm
DBSCAN stands for density-based spatial clustering of applications with noise. It is
able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).
The main idea behind DBSCAN is that a point belongs to a cluster if it is close to
many points from that cluster.
eps: The distance that specifies the neighborhoods. Two points are considered to
be neighbors if the distance between them are less than or equal to eps.
Based on these two parameters, points are classified as core point, border point, or
outlier:
Core point: A point is a core point if there are at least minPts number of points
(including the point itself) in its surrounding area with radius eps.
Border point: A point is a border point if it is reachable from a core point and
there are less than minPts number of points within its surrounding area.
Outlier: A point is an outlier if it is not a core point and not reachable from any
core points.
These points may be better explained with visualizations. The following figure is
taken from Wikipedia:
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 5/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
Figure source
In this case, minPts is 4. Red points are core points because there are at least 4
points within their surrounding area with radius eps. This area is shown with the
circles in the figure. The yellow points are border points because they are reachable
from a core point and have less than 4 points within their neighborhood. Reachable
means being in the surrounding area of a core point. The points B and C have two
points (including the point itself) within their neigborhood (i.e. the surrounding
area with a radius of eps). Finally N is an outlier because it is not a core point and
cannot be reached from a core point.
We have learned the definitions of parameters and different type points. Now we
can talk about how the algoritm works. It is actually quite simple:
Note: A point that is marked as noise may be revisited and be part of a cluster.
Next step is to randomly choose another point among the points that have not
been visited in the previous steps. Then same procedure applies.
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 6/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
By applying these steps, DBSCAN algorithm is able to find high density regions and
separate them from low density regions.
A cluster includes core points that are neighbors (i.e. reachable from one another)
and all the border points of these core points. The required condition to form a
cluster is to have at least one core point. Although very unlikely, we may have a
cluster with only one core point and its border points.
Scikit-learn implementation
We start with creating a sample dataset using datasets module of scikit-learn. After
creating the sample data points, we will normalize the values using StandardScaler
class from preprocessing module of scikit-learn.
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
We will create a dataset with 3 clusters with 0.5 standard deviation for each cluster.
Number of samples is 400 and we also choose three points as centroids (center of
clusters). You can adjust these parameters however you like using make_blobs
function.
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 7/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
#Determine centroids
centers = [[0.5, 2], [-1, -1], [1.5, -1]]
#Create dataset
X, y = make_blobs(n_samples=400, centers=centers,
cluster_std=0.5, random_state=0)
#Normalize the values
X = StandardScaler().fit_transform(X)
We can also plot the dataset to see how each cluster looks:
plt.figure(figsize=(10,6))
plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired')
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 8/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
We just need to define eps and minPts values using eps and min_samples
parameters.
Note: We do not have to specify the number of clusters for DBSCAN which is a great
advantage of DBSCAN over k-means clustering.
y_pred = db.fit_predict(X)
plt.figure(figsize=(10,6))
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("Clusters determined by DBSCAN")
It was able to detect the outliers (marked with blue). We can access the labels of data
points using labels_ attribute. Noise (or outliers) are given -1 label. Let’s check the
number of outliers:
db.labels_[db.labels_ == -1].size
18
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 9/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
The clusters in this sample dataset do not have actually have arbitrary shapes. But
DBSCAN performed really good at detecting outliers which would not be easy with
partition-based (e.g. k-means) or hierarchical (e.g. agglomerative) clustering
techniques. If you also apply DBSCAN to a dataset with arbitrary shaped clusters,
you will see the success of DBSCAN as well.
Cons:
If clusters are very different in terms of in-cluster densities, DBSCAN is not well
suited to define clusters. The characteristics of clusters are defined by the
combination of eps-minPts parameters. Since we pass in one eps-minPts
combination to the algorithm, it cannot generalize well to clusters with much
different densities.
Thanks for reading. Please let me know if you have any feedback.
References
https://en.wikipedia.org/wiki/DBSCAN
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 10/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science
Give a tip
Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.
By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 11/11