0% found this document useful (0 votes)

4 views

DB Scan Clustering

The document provides a detailed explanation of the DBSCAN clustering algorithm, which is a density-based method effective for identifying arbitrary shaped clusters and outliers. It outlines the key parameters of DBSCAN, including 'eps' and 'minPts', and describes how the algorithm classifies points into core, border, and outlier categories. Additionally, it discusses the advantages and disadvantages of DBSCAN compared to other clustering methods, along with a scikit-learn implementation example.

Uploaded by

muneebke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

DB Scan Clustering

Uploaded by

muneebke

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

31/03/2023, 17:51 DBSCAN Clustering — Explained.

Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

Open in app Sign up Sign In

Published in Towards Data Science

This is your last free member-only story this month. Sign up for Medium and get an extra one

Soner Yıldırım Follow

Apr 22, 2020 · 7 min read · · Listen

Save

DBSCAN Clustering — Explained

Detailed theorotical explanation and scikit-learn implementation
Clustering is a way to group a set of data points in a way that similar data points are
grouped together. Therefore, clustering algorithms look for similarities or
dissimilarities among data points. Clustering is an unsupervised learning method so
there is no label associated with data points. The algorithm tries to find the
underlying structure of the data.

272 4

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 1/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

Photo by Jan Meeus on Unsplash

There are different approaches and algorithms to perform clustering tasks which
can be divided into three sub-categories:

Partition-based clustering: E.g. k-means, k-median

Hierarchical clustering: E.g. Agglomerative, Divisive

Density-based clustering: E.g. DBSCAN

In this post, I will try to explain DBSCAN algorithm in detail. If you would like to
read about other type of clustering algorithms, you can also visit the following posts:

K-Means Clustering — Explained

Detailed theorotical explanation and scikit-learn implementation
towardsdatascience.com

Hierarchical Clustering — Explained

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 2/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

Theorotical explanation and scikit learn example

towardsdatascience.com

Density-based clustering
Partition-based and hierarchical clustering techniques are highly efficient with
normal shaped clusters. However, when it comes to arbitrary shaped clusters or
detecting outliers, density-based techniques are more efficient.

For example, the dataset in the figure below can easily be divided into three clusters
using k-means algoritm.

k-means clustering

Consider the following figures:

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 3/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

The data points in these figures are grouped in arbitrary shapes or include outliers.
Density-based clustering algorithms are very effienct at finding high-density regions
and outliers. It is very important to detect outliers for some task, e.g. anomaly
detection.

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 4/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

DBSCAN algorithm
DBSCAN stands for density-based spatial clustering of applications with noise. It is
able to find arbitrary shaped clusters and clusters with noise (i.e. outliers).

The main idea behind DBSCAN is that a point belongs to a cluster if it is close to
many points from that cluster.

There are two key parameters of DBSCAN:

eps: The distance that specifies the neighborhoods. Two points are considered to
be neighbors if the distance between them are less than or equal to eps.

minPts: Minimum number of data points to define a cluster.

Based on these two parameters, points are classified as core point, border point, or
outlier:

Core point: A point is a core point if there are at least minPts number of points
(including the point itself) in its surrounding area with radius eps.

Border point: A point is a border point if it is reachable from a core point and
there are less than minPts number of points within its surrounding area.

Outlier: A point is an outlier if it is not a core point and not reachable from any
core points.

These points may be better explained with visualizations. The following figure is
taken from Wikipedia:

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 5/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

Figure source

In this case, minPts is 4. Red points are core points because there are at least 4
points within their surrounding area with radius eps. This area is shown with the
circles in the figure. The yellow points are border points because they are reachable
from a core point and have less than 4 points within their neighborhood. Reachable
means being in the surrounding area of a core point. The points B and C have two
points (including the point itself) within their neigborhood (i.e. the surrounding
area with a radius of eps). Finally N is an outlier because it is not a core point and
cannot be reached from a core point.

We have learned the definitions of parameters and different type points. Now we
can talk about how the algoritm works. It is actually quite simple:

minPts and eps are determined.

A starting point is selected at random at it’s neighborhood area is determined

using radius eps. If there are at least minPts number of points in the
neighborhood, the point is marked as core point and a cluster formation starts.
If not, the point is marked as noise. Once a cluster formation starts (let’s say
cluster A), all the points within the neighborhood of initial point become a part
of cluster A. If these new points are also core points, the points that are in the
neighborhood of them are also added to cluster A.

Note: A point that is marked as noise may be revisited and be part of a cluster.

Next step is to randomly choose another point among the points that have not
been visited in the previous steps. Then same procedure applies.
https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 6/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

This process is finished when all points are visited.

The distance between points is determined using a distance measurement method as in k-

means algorithm. The most commonly used method is euclidean distance.

By applying these steps, DBSCAN algorithm is able to find high density regions and
separate them from low density regions.

A cluster includes core points that are neighbors (i.e. reachable from one another)
and all the border points of these core points. The required condition to form a
cluster is to have at least one core point. Although very unlikely, we may have a
cluster with only one core point and its border points.

Scikit-learn implementation
We start with creating a sample dataset using datasets module of scikit-learn. After
creating the sample data points, we will normalize the values using StandardScaler
class from preprocessing module of scikit-learn.

Note: It is important to normalize values because it makes it easier to find an appropriate

distance for radius of neighborhood (eps).

Let’s first import the libraries:

import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt

%matplotlib inline

We will create a dataset with 3 clusters with 0.5 standard deviation for each cluster.
Number of samples is 400 and we also choose three points as centroids (center of
clusters). You can adjust these parameters however you like using make_blobs
function.

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 7/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

#Determine centroids
centers = [[0.5, 2], [-1, -1], [1.5, -1]]

#Create dataset
X, y = make_blobs(n_samples=400, centers=centers,
cluster_std=0.5, random_state=0)
#Normalize the values
X = StandardScaler().fit_transform(X)

We can also plot the dataset to see how each cluster looks:

plt.figure(figsize=(10,6))
plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired')

We can now create a DBSCAN object and fit the data:

from sklearn.cluster import DBSCAN

db = DBSCAN(eps=0.4, min_samples=20)
db.fit(X)

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 8/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

We just need to define eps and minPts values using eps and min_samples
parameters.

Note: We do not have to specify the number of clusters for DBSCAN which is a great
advantage of DBSCAN over k-means clustering.

Let’s visualize the clusters determined by DBSCAN:

y_pred = db.fit_predict(X)

plt.figure(figsize=(10,6))
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("Clusters determined by DBSCAN")

It was able to detect the outliers (marked with blue). We can access the labels of data
points using labels_ attribute. Noise (or outliers) are given -1 label. Let’s check the
number of outliers:

db.labels_[db.labels_ == -1].size
18

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 9/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

The clusters in this sample dataset do not have actually have arbitrary shapes. But
DBSCAN performed really good at detecting outliers which would not be easy with
partition-based (e.g. k-means) or hierarchical (e.g. agglomerative) clustering
techniques. If you also apply DBSCAN to a dataset with arbitrary shaped clusters,
you will see the success of DBSCAN as well.

Pros and Cons of DBSCAN

Pros:

Does not require to specify number of clusters beforehand.

Performs well with arbitrary shapes clusters.

DBSCAN is robust to outliers and able to detect the outliers.

Cons:

In some cases, determining an appropriate distance of neighborhood (eps) is not

easy and it requires domain knowledge.

If clusters are very different in terms of in-cluster densities, DBSCAN is not well
suited to define clusters. The characteristics of clusters are defined by the
combination of eps-minPts parameters. Since we pass in one eps-minPts
combination to the algorithm, it cannot generalize well to clusters with much
different densities.

Thanks for reading. Please let me know if you have any feedback.

References

https://en.wikipedia.org/wiki/DBSCAN

Data Science Machine Learning Programming

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 10/11
31/03/2023, 17:51 DBSCAN Clustering — Explained. Detailed theorotical explanation and… | by Soner Yıldırım | Towards Data Science

Enjoy the read? Reward the writer.Beta

Your tip will go to Soner Yıldırım through a third-party platform of their choice, letting them know you appreciate their
story.

Give a tip

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-
edge research to original features you don't want to miss. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review
our Privacy Policy for more information about our privacy practices.

Get this newsletter

About Help Terms Privacy

Get the Medium app

https://towardsdatascience.com/dbscan-clustering-explained-97556a2ad556 11/11

CB Insights Generative AI in 7 Charts
67% (3)
CB Insights Generative AI in 7 Charts
12 pages
Fuzzy Neural Networks
0% (1)
Fuzzy Neural Networks
13 pages
DBSCAN
No ratings yet
DBSCAN
30 pages
20 - 1 - ML - Unsup - 03 - Dbscan Hdbscan
No ratings yet
20 - 1 - ML - Unsup - 03 - Dbscan Hdbscan
21 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
17 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
DM Lect 8_Clustering - DBSCAN
No ratings yet
DM Lect 8_Clustering - DBSCAN
22 pages
DBSCAN Clustering Algorithm: Presented by
No ratings yet
DBSCAN Clustering Algorithm: Presented by
22 pages
DBSCAN Algorithm
No ratings yet
DBSCAN Algorithm
15 pages
4.6 Dbscan
No ratings yet
4.6 Dbscan
27 pages
DBSCAN Presentation
No ratings yet
DBSCAN Presentation
10 pages
DBSCAN AND OPTICS
No ratings yet
DBSCAN AND OPTICS
28 pages
DBSCAN
No ratings yet
DBSCAN
3 pages
DBSCAN Clustering in ML _ Density Based Clustering
No ratings yet
DBSCAN Clustering in ML _ Density Based Clustering
5 pages
ML Exp 9
No ratings yet
ML Exp 9
5 pages
Data mining
No ratings yet
Data mining
3 pages
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
No ratings yet
Understanding DBSCAN Algorithm and Implementation From Scratch - by Andrewngai - Towards Data Science
10 pages
UNIT-6 DBSCAN Clustering
No ratings yet
UNIT-6 DBSCAN Clustering
6 pages
3 Dbscan
No ratings yet
3 Dbscan
7 pages
Dbscan: Presented By: Garrett Poppe
No ratings yet
Dbscan: Presented By: Garrett Poppe
22 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
22 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
DB Scan
No ratings yet
DB Scan
7 pages
DBSCAN Clustering Python
No ratings yet
DBSCAN Clustering Python
4 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
No ratings yet
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
30 pages
ML - 8
No ratings yet
ML - 8
70 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Density ML
No ratings yet
Density ML
51 pages
UNIT - 4 DWDM
No ratings yet
UNIT - 4 DWDM
27 pages
DWM PT 2 QB Soln
No ratings yet
DWM PT 2 QB Soln
8 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
6 pages
Unsupervised Learning Clustering II
No ratings yet
Unsupervised Learning Clustering II
17 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
DBSCAN
No ratings yet
DBSCAN
23 pages
DBSCAN
No ratings yet
DBSCAN
18 pages
1. Clustering
No ratings yet
1. Clustering
75 pages
DBSCAN clustering
No ratings yet
DBSCAN clustering
2 pages
Clustering
No ratings yet
Clustering
12 pages
Density Based CA
No ratings yet
Density Based CA
8 pages
DBSCAN.docx
No ratings yet
DBSCAN.docx
7 pages
Chapter 5
No ratings yet
Chapter 5
43 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Birch
No ratings yet
Birch
6 pages
ML Module 5
No ratings yet
ML Module 5
15 pages
Density Based
No ratings yet
Density Based
27 pages
FAI Lecture - 9-10-2023 PDF
No ratings yet
FAI Lecture - 9-10-2023 PDF
16 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Week 9 Part 1 Clustering
No ratings yet
Week 9 Part 1 Clustering
44 pages
Density Based Clustering
No ratings yet
Density Based Clustering
25 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
ads exp 7_labmanual
No ratings yet
ads exp 7_labmanual
3 pages
Multi Density DBScan
No ratings yet
Multi Density DBScan
8 pages
06-Clustering
No ratings yet
06-Clustering
36 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
Exp5 - Unsupervised Learning
No ratings yet
Exp5 - Unsupervised Learning
13 pages
SE_DEMO
No ratings yet
SE_DEMO
29 pages
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
No ratings yet
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
6 pages
DSS09 (B) - Clustering
No ratings yet
DSS09 (B) - Clustering
35 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Lecture-28-TransformerIntroductionFinal-1
No ratings yet
Lecture-28-TransformerIntroductionFinal-1
69 pages
Lecture-27-Introduction to VLM
No ratings yet
Lecture-27-Introduction to VLM
46 pages
DiffPose
No ratings yet
DiffPose
15 pages
Term Paper
No ratings yet
Term Paper
9 pages
Scheduler Activations
No ratings yet
Scheduler Activations
27 pages
SUPPLYCHAIN
No ratings yet
SUPPLYCHAIN
16 pages
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
No ratings yet
Recent Progress in Semantic Image Segmentation: Xiaolong Liu Zhidong Deng Yuhan Yang
18 pages
LUSIP Results 2024
No ratings yet
LUSIP Results 2024
6 pages
Roadmap To Crack DS - ML Interviews PDF
No ratings yet
Roadmap To Crack DS - ML Interviews PDF
2 pages
An_Ensemble_Deep_Learning_Model_for_Vehicular_Engine_Health_Prediction
No ratings yet
An_Ensemble_Deep_Learning_Model_for_Vehicular_Engine_Health_Prediction
19 pages
Midterm Sample
No ratings yet
Midterm Sample
2 pages
MLL
No ratings yet
MLL
44 pages
SLIQ_Algorithm_with_Example
No ratings yet
SLIQ_Algorithm_with_Example
2 pages
Accenture SongLifeCentric Report Sales and Service
No ratings yet
Accenture SongLifeCentric Report Sales and Service
20 pages
AI Satya
No ratings yet
AI Satya
179 pages
Recent Advances in Open Set Recognition A Survey
No ratings yet
Recent Advances in Open Set Recognition A Survey
18 pages
Innovation - Lesson Plan
No ratings yet
Innovation - Lesson Plan
5 pages
What Will Schools of The Future Look Like
No ratings yet
What Will Schools of The Future Look Like
4 pages
IIT-ROPAR-PGCAIDL-Brochure
No ratings yet
IIT-ROPAR-PGCAIDL-Brochure
27 pages
Thesis Translation Spanish
100% (3)
Thesis Translation Spanish
8 pages
Instant Ebooks Textbook 97 Things About Ethics Everyone in Data Science Should Know 1st Edition Bill Franks Download All Chapters
100% (2)
Instant Ebooks Textbook 97 Things About Ethics Everyone in Data Science Should Know 1st Edition Bill Franks Download All Chapters
54 pages
Is Chapter 10
No ratings yet
Is Chapter 10
4 pages
Kevin Roose's Conversation With Bing's Chatbot - Full Transcript - The New York Times
No ratings yet
Kevin Roose's Conversation With Bing's Chatbot - Full Transcript - The New York Times
26 pages
COVID - 19 in March.: TH I Gian Làm Bài: 60 Phút
No ratings yet
COVID - 19 in March.: TH I Gian Làm Bài: 60 Phút
6 pages
Cs3491 Aiml Q&A Material
No ratings yet
Cs3491 Aiml Q&A Material
22 pages
AI Powered Career Counselor Project
No ratings yet
AI Powered Career Counselor Project
6 pages
Assignment On DWDM
No ratings yet
Assignment On DWDM
8 pages
Lesson 3 (Stages of Writing Process)
No ratings yet
Lesson 3 (Stages of Writing Process)
10 pages
Instant ebooks textbook (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R download all chapters
100% (4)
Instant ebooks textbook (eBook PDF) Data Mining for Business Analytics: Concepts, Techniques, and Applications in R download all chapters
55 pages
Bsi MD de Qa Medical Webinar Samd Es en
No ratings yet
Bsi MD de Qa Medical Webinar Samd Es en
6 pages
CNN Slides KD
No ratings yet
CNN Slides KD
25 pages
MIT TLI - B10 - Brochure-Min
No ratings yet
MIT TLI - B10 - Brochure-Min
22 pages
Final 2006
No ratings yet
Final 2006
15 pages