DBSCAN
DBSCAN
Clustering
DBSCAN
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
is a popular clustering algorithm that groups together closely packed
points in a dataset and marks points in low-density regions as outliers
or noise. Unlike traditional clustering methods like K-means, which
require the number of clusters to be pre-defined, DBSCAN uses
density to determine the number of clusters and the size of each
cluster.
Key Concepts:
1. Core Points: Points that have at least a specified number of
neighboring points within a given radius (ε). These points form the
"core" of a cluster.
2. Border Points: Points that are within the ε radius of a core point but
do not have enough neighbors to be core points themselves.
3. Noise Points (Outliers): Points that do not belong to any cluster.
These points do not meet the requirements for being core or border
points.
Parameters:
• ε (epsilon): The maximum distance between two points to be
considered neighbors.
• MinPts (minimum points): The minimum number of points required
to form a dense region or a core point.
How DBSCAN Works:
1. Initialize: Start with an arbitrary unvisited point.
2. Find neighbors: Identify all points within the ε radius.
3. Core point identification: If the number of neighbors (including the point
itself) is greater than or equal to MinPts, the point becomes a core point,
and a cluster is formed.
4. Expand clusters: All points that are directly reachable from core points
(including border points) are added to the cluster.
5. Mark outliers: Points that are not reachable from any core points (i.e.,
neither core nor border points) are classified as outliers or noise.
6. Repeat: Continue the process for all points until all points are either
assigned to a cluster or marked as noise.
Advantages
• No need to pre-define the number of clusters: Unlike algorithms like
K-means, DBSCAN automatically finds clusters based on data density.
• Can identify arbitrarily shaped clusters: It works well for clusters of
different shapes, unlike K-means which assumes clusters are
spherical.
• Handles noise well: DBSCAN naturally identifies noise points, making
it robust to outliers.
Disadvantage
• Sensitive to ε and MinPts values: The algorithm’s performance can be
sensitive to the choice of ε and MinPts. Setting them incorrectly can
lead to poor clustering results.
• Doesn't perform well with clusters of varying densities: If the dataset
contains clusters with varying densities, DBSCAN may not perform
well, as a single set of parameters may not work for all clusters.
How to Find ε (Epsilon)? Step 1
• Finding an optimal ε can be
challenging because it depends on
the dataset's structure. A typical
method to determine an
appropriate ε is the k-distance graph
approach.
• Step 1: Calculate the Distance
Between Each Point and Its Nearest
Neighbors
• We need to compute the distance
between each point and its k-th
nearest neighbor, where k = MinPts -
1 = 1 in our case.
• We’ll calculate the Euclidean distance
between each pair of points.
How to Find ε (Epsilon)? Step 2
• Step 2: Find the k-th Nearest Neighbor Distances (k = 1)
• Now, for each point, we will find the distance to the
nearest neighbor (the smallest distance, since k = 1). We
will sort these distances.
• (1, 2): Nearest neighbor is (2, 2) with a distance of 1.0.
• (2, 2): Nearest neighbor is (1, 2) with a distance of 1.0.
• (2, 3): Nearest neighbor is (2, 2) with a distance of 1.0.
• (8, 7): Nearest neighbor is (8, 8) with a distance of 1.41.
• (8, 8): Nearest neighbor is (8, 7) with a distance of 1.41.
• (25, 80): Nearest neighbor is (8, 8) with a distance of
72.35.
How to Find ε (Epsilon)? Step 3
How to Find ε (Epsilon)? Step 3
• Step 4: Choose ε from the Elbow Point
• Based on the k-distance graph, the elbow point is just before the
large jump in distance. Thus, we choose ε = 1.5 as the optimal value.
Final Example of Using DBSCAN on Data:
Chosen Parameters:
• ε = 2 (chosen from the k-distance graph method)
• MinPts = 2
DBSCAN Algorithm Steps:
•Point (1, 2):
•Distance to nearest neighbors: (2, 2), (2, 3) within ε = 2 → Core point.
•Point (2, 2):
•Distance to nearest neighbors: (1, 2), (2, 3) within ε = 2 → Core point.
•Point (2, 3):
•Distance to nearest neighbors: (1, 2), (2, 2) within ε = 2 → Core point.
•Point (8, 7):
•Distance to nearest neighbors: (8, 8) within ε = 2 → Noise.
•Point (8, 8):
•Distance to nearest neighbors: (8, 7) within ε = 2 → Noise.
•Point (25, 80):
•Distance to nearest neighbors: No points within ε = 2 → Noise.