0% found this document useful (0 votes)
78 views53 pages

Unit 8 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that identifies clusters of arbitrary shapes and detects outliers based on point density. It requires two parameters: ε (epsilon) for neighborhood radius and MinPts for minimum points to form a dense region. The algorithm classifies points into core, border, and noise points, expanding clusters from core points and connecting density-reachable points.

Uploaded by

Juee Jamsandekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views53 pages

Unit 8 DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that identifies clusters of arbitrary shapes and detects outliers based on point density. It requires two parameters: ε (epsilon) for neighborhood radius and MinPts for minimum points to form a dense region. The algorithm classifies points into core, border, and noise points, expanding clusters from core points and connecting density-reachable points.

Uploaded by

Juee Jamsandekar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

DBSCAN

Density-Based Spatial Clustering of Applications with Noise


• DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an
unsupervised machine learning algorithm used for clustering.

• Unlike k-means, which requires specifying the number of clusters beforehand,


DBSCAN can discover clusters of arbitrary shapes and can identify outliers (noise
points).

• DBSCAN is based on density, where density is number of points which are


located on a given area.
• DBSCAN groups points based on their density. It requires two
main parameters:

ε (epsilon) – The radius within which points are considered


neighbors.

MinPts – The minimum number of points required to form a


dense region.
• Key Concepts in DBSCAN

1. Core Points – A point is a core point if it has at least MinPts neighbors


within ε.

2. Border Points – A point that has fewer than MinPts neighbors but is
within ε of a core point.

1. Noise Points – A point that is neither a core point nor a border point
(outliers).
Steps of DBSCAN Algorithm
1. Randomly select an unvisited point.
2. If the point is a core point, form a cluster with all density-reachable
points.
3. If the point is not a core point but a border point, it may be added
to an existing cluster.
4. If the point is neither, it is marked as noise.
5. Repeat until all points are visited.
Advantages of DBSCAN
• No need to specify the number of clusters.
• Identifies clusters of arbitrary shape.
• Detects outliers as noise.

Disadvantages of DBSCAN
• Struggles with clusters of varying density.
• Sensitive to ε and MinPts values.
• High-dimensional data can affect performance.
Reachability and Connectivity

These are the two concepts that you need to understand before
moving further. Reachability states if a data point can be accessed from
another data point directly or indirectly, whereas Connectivity states
whether two data points belong to the same cluster or not. In terms of
reachability and connectivity, two points in DBSCAN can be referred to
as:
• Directly Density-Reachable
• Density-Reachable
• Density-Connected
A point X is directly density-reachable from point Y w.r.t epsilon,
minPoints if,
• 1. X belongs to the neighborhood of Y, i.e, dist(X, Y) <= ε
• 2. Y is a core point

• Here, X is directly density-reachable from Y, but vice versa is not valid.


A point X is density-reachable from point Y w.r.t epsilon, minPoints if
there is a chain of points p1, p2, p3, …, pn and p1=X and pn=Y such that
pi+1 is directly density-reachable from pi.

• Here, X is density-reachable from Y with X being directly density-


reachable from P2, P2 from P3, and P3 from Y. But, the inverse of this is
not valid.
A point X is density-connected from point Y w.r.t epsilon and minPoints if a point O

exists such that both X and Y are density-reachable from O w.r.t to epsilon and

minPoints.

Here, both X and Y are density-reachable from O, therefore, we can say that X is density-

connected from Y.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
1. core points, which have a sufficient number of neighbors within
a specified radius (ε)
2. border points, which are near core points but lack enough
neighbors to be core points themselves
3. noise points, which do not belong to any cluster.
By iteratively expanding clusters from core points and connecting
density-reachable points, DBSCAN forms clusters without relying on
rigid assumptions about their shape or size.
Steps in the DBSCAN Algorithm
1. Identify Core Points: For each point in the dataset, count the
number of points within its ε neighborhood. If the count meets or
exceeds MinPts, mark the point as a core point.
2. Form Clusters: For each core point that is not already assigned to a
cluster, create a new cluster. Recursively find all density-connected
points (points within the ε radius of the core point) and add them to
the cluster.
3. Density Connectivity: Two points, a and b, are density-connected if
there exists a chain of points where each point is within the ε radius of
the next, and at least one point in the chain is a core point. This
chaining process ensures that all points in a cluster are connected
through a series of dense regions.
4. Label Noise Points: After processing all points, any point that does
not belong to a cluster is labeled as noise.
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)
and K-Means are both clustering algorithms that group together data
that have the same characteristic. However, they work on different
principles and are suitable for different types of data. We prefer to use
DBSCAN when the data is not spherical in shape or the number of
classes is not known beforehand.
DBSCAN K-Means

In DBSCAN we need not specify the number K-Means is very sensitive to the number of clusters so it
of clusters. need to specified

Clusters formed in K-Means are spherical or


Clusters formed in DBSCAN can be of any arbitrary shape.
convex in shape

K-Means does not work well with outliers data. Outliers


DBSCAN can work well with datasets having noise and outliers
can skew the clusters in K-Means to a very large extent.

In K-Means only one parameter is required is for training


In DBSCAN two parameters are required for training the Model
the model
Numerical Example
Q. Given the points A(3, 7), B(4, 6), C(5, 5), D(6, 4), E(7, 3), F(6, 2), G(7, 2) and H(8, 4),
Find the core points and outliers using DBSCAN. Take ε = 2.5 and MinPts = 3.
Solution:
Given, Epsilon(Eps) = 2.5
Minimum Points(MinPts) = 3
Let’s represent the given data points in tabular form:
• Step 1: To find the core points, outliers and clusters by using DBSCAN
we need to first calculate the distance among all pairs of given data
point. Let us use Euclidean distance measure for distance calculation.
The final distance matrix becomes as shown below:

Proximity matrix
The diagonal elements of this matrix will always be 0 as the distance of a point with itself is
always 0. In the above table, Distance ≤ Epsilon (i.e. 2.5) is marked red.
Step 2: Now, finding all the data points that lie in the Eps-neighborhood of each data
points. That is, put all the points in the neighborhood set of each data point whose
distance is <=2.5.
• N(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A
• N(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
• N(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
• N(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
• N(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
• N(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
• N(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
• N(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
Here, data points A, B and C have neighbors <= MinPts (i.e. 3) so can’t be considered as
core points. Since they belong to the neighborhood of other data points, hence there exist
no outliers in the given set of data points.
Data points D, E, F, G and H have neighbors >= MinPts (i.e. 3) and hence are the core data
points.
Numerical Example of DBSCAN in Machine Learning
Let’s go through a numerical example of DBSCAN to understand how it works.

Given Data Points:

We have the following 8 points in a 2D space:

P1(1,1),P2(2,1),P3(2,2),P4(3,2),P5(5,5),P6(6,5),P7(6,6),P8(7,6)

Step 1: Define Parameters


ε (epsilon) = 1.5 (Neighborhood radius)
MinPts = 3 (Minimum points to form a cluster)

To find the core ,boundry & outlier points by using DBSCAN algorithm, we need to first calculate the distance
among all pairs of given data points, lets us use Euclidean distance measure distance calculation.
• Step 2: Find Core, Border, and Noise Points

• We calculate the ε-neighborhood for each point (points within distance ≤ 1.5).

Point ε-Neighborhood # of Neighbors Type

P₁ (1,1) {P₂, P₃} 2 Border/Noise

P₂ (2,1) {P₁, P₃, P₄} 3 Core

P₃ (2,2) {P₁, P₂, P₄} 3 Core

P₄ (3,2) {P₂, P₃} 2 Border

P₅ (5,5) {P₆, P₇} 2 Border/Noise

P₆ (6,5) {P₅, P₇, P₈} 3 Core

P₇ (6,6) {P₅, P₆, P₈} 3 Core


P₈ (7,6) {P₆, P₇} 2 Border
Step 3: Form Clusters
• Start with P₂ (Core Point) → Expand cluster with P₃, P₄, P₁ → Cluster C1 =
{P₂, P₃, P₄, P₁}.
• Move to P₆ (Core Point) → Expand with P₅, P₇, P₈ → Cluster C2 = {P₅, P₆, P₇,
P₈}.
• Any remaining point that is not in a cluster is marked as Noise (in this case,
none).

Final Clusters
Cluster 1 (C1): {P₁, P₂, P₃, P₄}
Cluster 2 (C2): {P₅, P₆, P₇, P₈}
Noise Points: None
This example shows how DBSCAN groups points based on density without requiring a predefined number of
clusters.
• To perform DBSCAN on the given problem with Epsilon = 2 and
minimum points = 2.
• What are the core, border and outlier Points.

Data Points X Y
A1 2 10
A2 2 5
A3 8 4
A4 5 8
A5 7 5
A6 6 4
A7 1 2
A8 4 9
• To perform DBSCAN on the given problem with Epsilon = 2 and
minimum points = 3.
• Apply DBSCAN Algorithm with similarity threshold of >=0.8 to the
given datapoint & Minpts >=2.
• What are the core,border and outlier Points.

Data Point P1 P2 P3 P4 P5

P1 1.00 0.10 0.41 0.55 0.35

P2 0.10 1.00 0.64 0.47 0.98

P3 0.41 0.64 1.00 0.44 0.85

P4 0.55 0.47 0.44 1.00 0.76

P5 0.35 0.98 0.85 0.76 1.00


• P1-
• P2- P5
• P3-P5
Data Points Status
• P4-
P1 Noise
• P5- P2,P3 P2 Core
P3 Core
P4 Noise
P5 Core

• No Border Point in the given data sets.


Numerical Example of DBSCAN in Machine Learning
Let’s go through a numerical example of DBSCAN to understand how it works.

Given Data Points:

We have the following 8 points in a 2D space:


A(3,7), B(4,6),C(5,5),D(6,4),E(7,3),F(6,2),G(7,2),H(8,4)

Step 1: Define Parameters


ε (epsilon) = 2.5 (Neighborhood radius)
MinPts = 3 (Minimum points to form a cluster)

To find the core ,boundry & outlier points by using DBSCAN algorithm, we need to first calculate the distance
among all pairs of given data points, lets us use Euclidean distance measure distance calculation.
Data Point X y
Consider two points (x , y1) and
1
A 3 7
(x , y ) in a 2-dimensional space;
2 2
B 4 6
the Euclidean Distance between
C 5 5
them is given by using the
D 6 4
formula:
E 7 3
F 6 2
G 7 2 d = √[(x - x ) + (y - y ) ]
2 1
2
2 1
2

H 8 4
• Thank you

You might also like