DBSCAN Clustering
DBSCAN Clustering
that are closely packed together and marks outliers as noise based on their
density in the feature space. It identifies clusters as dense regions in the data
space, separated by areas of lower density.
Unlike K-Means or hierarchical clustering, which assume clusters
are compact and spherical, DBSCAN excels in handling real-world data
irregularities such as:
Arbitrary-Shaped Clusters: Clusters can take any shape, not just circular
or convex.
Noise and Outliers: It effectively identifies and handles noise points
without assigning them to any cluster.
Key Parameters in DBSCAN
1. eps: This defines the radius of the neighborhood around a data point.
If the distance between two points is less than or equal to eps, they are considered
neighbors. Choosing the right eps is crucial:
If eps is too small, most points will be classified as noise.
If eps is too large, clusters may merge, and the algorithm may fail to
distinguish between them.
A common method to determine eps is by analyzing the k-distance graph.
2. MinPts: This is the minimum number of points required within
the eps radius to form a dense region.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
1. core points, which have a sufficient number of neighbors within a specified
radius (eplison)
2. border points, which are near core points but lack enough neighbors to be core
points themselves
3. noise points, which do not belong to any cluster.
Steps in the DBSCAN Algorithm
1. Identify Core Points: For each point in the dataset, count the number of
points within its eps neighborhood. If the count meets or exceeds MinPts,
mark the point as a core point.
2. Form Clusters: For each core point that is not already assigned to a cluster,
create a new cluster. Recursively find all density-connected points (points
within the eps radius of the core point) and add them to the cluster.
3. Density Connectivity: Two points, a and b, are density-connected if there
exists a chain of points where each point is within the eps radius of the next,
and at least one point in the chain is a core point. This chaining process
ensures that all points in a cluster are connected through a series of dense
regions.
4. Label Noise Points: After processing all points, any point that does not
belong to a cluster is labeled as noise.
DBSCAN K-Means