0% found this document useful (0 votes)
9 views14 pages

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies clusters based on the density of data points, marking low-density points as outliers. It operates using parameters ε (epsilon) for neighbor distance and MinPts for minimum points to form a cluster, allowing it to find arbitrarily shaped clusters without pre-defining the number of clusters. However, its performance can be sensitive to the choice of ε and MinPts, and it may struggle with clusters of varying densities.

Uploaded by

tiya Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views14 pages

DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies clusters based on the density of data points, marking low-density points as outliers. It operates using parameters ε (epsilon) for neighbor distance and MinPts for minimum points to form a cluster, allowing it to find arbitrarily shaped clusters without pre-defining the number of clusters. However, its performance can be sensitive to the choice of ε and MinPts, and it may struggle with clusters of varying densities.

Uploaded by

tiya Abid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

DBSCAN

Clustering
DBSCAN
• DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
is a popular clustering algorithm that groups together closely packed
points in a dataset and marks points in low-density regions as outliers
or noise. Unlike traditional clustering methods like K-means, which
require the number of clusters to be pre-defined, DBSCAN uses
density to determine the number of clusters and the size of each
cluster.
Key Concepts:
1. Core Points: Points that have at least a specified number of
neighboring points within a given radius (ε). These points form the
"core" of a cluster.
2. Border Points: Points that are within the ε radius of a core point but
do not have enough neighbors to be core points themselves.
3. Noise Points (Outliers): Points that do not belong to any cluster.
These points do not meet the requirements for being core or border
points.
Parameters:
• ε (epsilon): The maximum distance between two points to be
considered neighbors.
• MinPts (minimum points): The minimum number of points required
to form a dense region or a core point.
How DBSCAN Works:
1. Initialize: Start with an arbitrary unvisited point.
2. Find neighbors: Identify all points within the ε radius.
3. Core point identification: If the number of neighbors (including the point
itself) is greater than or equal to MinPts, the point becomes a core point,
and a cluster is formed.
4. Expand clusters: All points that are directly reachable from core points
(including border points) are added to the cluster.
5. Mark outliers: Points that are not reachable from any core points (i.e.,
neither core nor border points) are classified as outliers or noise.
6. Repeat: Continue the process for all points until all points are either
assigned to a cluster or marked as noise.
Advantages
• No need to pre-define the number of clusters: Unlike algorithms like
K-means, DBSCAN automatically finds clusters based on data density.
• Can identify arbitrarily shaped clusters: It works well for clusters of
different shapes, unlike K-means which assumes clusters are
spherical.
• Handles noise well: DBSCAN naturally identifies noise points, making
it robust to outliers.
Disadvantage
• Sensitive to ε and MinPts values: The algorithm’s performance can be
sensitive to the choice of ε and MinPts. Setting them incorrectly can
lead to poor clustering results.
• Doesn't perform well with clusters of varying densities: If the dataset
contains clusters with varying densities, DBSCAN may not perform
well, as a single set of parameters may not work for all clusters.
How to Find ε (Epsilon)? Step 1
• Finding an optimal ε can be
challenging because it depends on
the dataset's structure. A typical
method to determine an
appropriate ε is the k-distance graph
approach.
• Step 1: Calculate the Distance
Between Each Point and Its Nearest
Neighbors
• We need to compute the distance
between each point and its k-th
nearest neighbor, where k = MinPts -
1 = 1 in our case.
• We’ll calculate the Euclidean distance
between each pair of points.
How to Find ε (Epsilon)? Step 2
• Step 2: Find the k-th Nearest Neighbor Distances (k = 1)
• Now, for each point, we will find the distance to the
nearest neighbor (the smallest distance, since k = 1). We
will sort these distances.
• (1, 2): Nearest neighbor is (2, 2) with a distance of 1.0.
• (2, 2): Nearest neighbor is (1, 2) with a distance of 1.0.
• (2, 3): Nearest neighbor is (2, 2) with a distance of 1.0.
• (8, 7): Nearest neighbor is (8, 8) with a distance of 1.41.
• (8, 8): Nearest neighbor is (8, 7) with a distance of 1.41.
• (25, 80): Nearest neighbor is (8, 8) with a distance of
72.35.
How to Find ε (Epsilon)? Step 3
How to Find ε (Epsilon)? Step 3
• Step 4: Choose ε from the Elbow Point
• Based on the k-distance graph, the elbow point is just before the
large jump in distance. Thus, we choose ε = 1.5 as the optimal value.
Final Example of Using DBSCAN on Data:
Chosen Parameters:
• ε = 2 (chosen from the k-distance graph method)
• MinPts = 2
DBSCAN Algorithm Steps:
•Point (1, 2):
•Distance to nearest neighbors: (2, 2), (2, 3) within ε = 2 → Core point.
•Point (2, 2):
•Distance to nearest neighbors: (1, 2), (2, 3) within ε = 2 → Core point.
•Point (2, 3):
•Distance to nearest neighbors: (1, 2), (2, 2) within ε = 2 → Core point.
•Point (8, 7):
•Distance to nearest neighbors: (8, 8) within ε = 2 → Noise.
•Point (8, 8):
•Distance to nearest neighbors: (8, 7) within ε = 2 → Noise.
•Point (25, 80):
•Distance to nearest neighbors: No points within ε = 2 → Noise.

You might also like