Unit 8 DBSCAN
Unit 8 DBSCAN
2. Border Points – A point that has fewer than MinPts neighbors but is
within ε of a core point.
1. Noise Points – A point that is neither a core point nor a border point
(outliers).
Steps of DBSCAN Algorithm
1. Randomly select an unvisited point.
2. If the point is a core point, form a cluster with all density-reachable
points.
3. If the point is not a core point but a border point, it may be added
to an existing cluster.
4. If the point is neither, it is marked as noise.
5. Repeat until all points are visited.
Advantages of DBSCAN
• No need to specify the number of clusters.
• Identifies clusters of arbitrary shape.
• Detects outliers as noise.
Disadvantages of DBSCAN
• Struggles with clusters of varying density.
• Sensitive to ε and MinPts values.
• High-dimensional data can affect performance.
Reachability and Connectivity
These are the two concepts that you need to understand before
moving further. Reachability states if a data point can be accessed from
another data point directly or indirectly, whereas Connectivity states
whether two data points belong to the same cluster or not. In terms of
reachability and connectivity, two points in DBSCAN can be referred to
as:
• Directly Density-Reachable
• Density-Reachable
• Density-Connected
A point X is directly density-reachable from point Y w.r.t epsilon,
minPoints if,
• 1. X belongs to the neighborhood of Y, i.e, dist(X, Y) <= ε
• 2. Y is a core point
exists such that both X and Y are density-reachable from O w.r.t to epsilon and
minPoints.
Here, both X and Y are density-reachable from O, therefore, we can say that X is density-
connected from Y.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
1. core points, which have a sufficient number of neighbors within
a specified radius (ε)
2. border points, which are near core points but lack enough
neighbors to be core points themselves
3. noise points, which do not belong to any cluster.
By iteratively expanding clusters from core points and connecting
density-reachable points, DBSCAN forms clusters without relying on
rigid assumptions about their shape or size.
Steps in the DBSCAN Algorithm
1. Identify Core Points: For each point in the dataset, count the
number of points within its ε neighborhood. If the count meets or
exceeds MinPts, mark the point as a core point.
2. Form Clusters: For each core point that is not already assigned to a
cluster, create a new cluster. Recursively find all density-connected
points (points within the ε radius of the core point) and add them to
the cluster.
3. Density Connectivity: Two points, a and b, are density-connected if
there exists a chain of points where each point is within the ε radius of
the next, and at least one point in the chain is a core point. This
chaining process ensures that all points in a cluster are connected
through a series of dense regions.
4. Label Noise Points: After processing all points, any point that does
not belong to a cluster is labeled as noise.
DBSCAN(Density-Based Spatial Clustering of Applications with Noise)
and K-Means are both clustering algorithms that group together data
that have the same characteristic. However, they work on different
principles and are suitable for different types of data. We prefer to use
DBSCAN when the data is not spherical in shape or the number of
classes is not known beforehand.
DBSCAN K-Means
In DBSCAN we need not specify the number K-Means is very sensitive to the number of clusters so it
of clusters. need to specified
Proximity matrix
The diagonal elements of this matrix will always be 0 as the distance of a point with itself is
always 0. In the above table, Distance ≤ Epsilon (i.e. 2.5) is marked red.
Step 2: Now, finding all the data points that lie in the Eps-neighborhood of each data
points. That is, put all the points in the neighborhood set of each data point whose
distance is <=2.5.
• N(A) = {B}; — — — — — — -→ because distance of B is <= 2.5 with A
• N(B) = {A, C}; — — — — — → because distance of A and C is <= 2.5 with B
• N(C) = {B, D}; — — — — —→ because distance of B and D is <=2.5 with C
• N(D) = {C, E, F, G, H}; — → because distance of C, E, F,G and H is <=2.5 with D
• N(E) = {D, F, G, H}; — — → because distance of D, F, G and H is <=2.5 with E
• N(F) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with F
• N(G) = {D, E, F, H}; — — -→ because distance of D, E, F and H is <=2.5 with G
• N(H) = {D, E, G}; — — — — → because distance of D, E and G is <=2.5 with H
Here, data points A, B and C have neighbors <= MinPts (i.e. 3) so can’t be considered as
core points. Since they belong to the neighborhood of other data points, hence there exist
no outliers in the given set of data points.
Data points D, E, F, G and H have neighbors >= MinPts (i.e. 3) and hence are the core data
points.
Numerical Example of DBSCAN in Machine Learning
Let’s go through a numerical example of DBSCAN to understand how it works.
P1(1,1),P2(2,1),P3(2,2),P4(3,2),P5(5,5),P6(6,5),P7(6,6),P8(7,6)
To find the core ,boundry & outlier points by using DBSCAN algorithm, we need to first calculate the distance
among all pairs of given data points, lets us use Euclidean distance measure distance calculation.
• Step 2: Find Core, Border, and Noise Points
• We calculate the ε-neighborhood for each point (points within distance ≤ 1.5).
Final Clusters
Cluster 1 (C1): {P₁, P₂, P₃, P₄}
Cluster 2 (C2): {P₅, P₆, P₇, P₈}
Noise Points: None
This example shows how DBSCAN groups points based on density without requiring a predefined number of
clusters.
• To perform DBSCAN on the given problem with Epsilon = 2 and
minimum points = 2.
• What are the core, border and outlier Points.
Data Points X Y
A1 2 10
A2 2 5
A3 8 4
A4 5 8
A5 7 5
A6 6 4
A7 1 2
A8 4 9
• To perform DBSCAN on the given problem with Epsilon = 2 and
minimum points = 3.
• Apply DBSCAN Algorithm with similarity threshold of >=0.8 to the
given datapoint & Minpts >=2.
• What are the core,border and outlier Points.
Data Point P1 P2 P3 P4 P5
To find the core ,boundry & outlier points by using DBSCAN algorithm, we need to first calculate the distance
among all pairs of given data points, lets us use Euclidean distance measure distance calculation.
Data Point X y
Consider two points (x , y1) and
1
A 3 7
(x , y ) in a 2-dimensional space;
2 2
B 4 6
the Euclidean Distance between
C 5 5
them is given by using the
D 6 4
formula:
E 7 3
F 6 2
G 7 2 d = √[(x - x ) + (y - y ) ]
2 1
2
2 1
2
H 8 4
• Thank you