0% found this document useful (0 votes)
31 views23 pages

DBSCAN

Uploaded by

Prithu Misra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views23 pages

DBSCAN

Uploaded by

Prithu Misra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 23

DBSCAN

Introduction
• Clustering analysis or simply Clustering is basically
an Unsupervised learning method that divides the
data points into a number of specific batches or
groups, such that the data points in the same groups
have similar properties and data points in different
groups have different properties in some sense.
• Partitioning methods (K-means) and hierarchical
clustering work for finding spherical-shaped clusters
or convex clusters. In other words, they are suitable
only for compact and well-separated clusters.
K-Means DBSCAN
K-means generally clusters all the objects. DBSCAN discards objects that it defines as
noise.
K-means needs a prototype-based concept of a DBSCAN needs a density-based concept.
cluster.
K-means has difficulty with non-globular DBSCAN is used to handle clusters of multiple
clusters and clusters of multiple sizes. sizes and structures and is not powerfully
influenced by noise or outliers.

K-means can be used for data that has a clear DBSCAN needed that its definition of density,
centroid, including a mean or median. which depends on the traditional Euclidean
concept of density, be significant for the data.

K-means can be used to sparse, high DBSCAN generally implements poorly for such
dimensional data, including file data. information because the traditional Euclidean
definition of density does not operate well for
high dimensional data.

The basic K-means algorithm is similar to a DIBSCAN creates no assumption about the
statistical clustering approach (mixture models) distribution of the record.
that consider all clusters come from spherical
Gaussian distributions with several means but
the equal covariance matrix.
• Real-life data may contain irregularities, like:
• Clusters can be of arbitrary shape such as those shown in the figure
below.
• Data may contain noise.
• The figure shows a data set containing non-convex(greater than 180
degrees) shape clusters and outliers. Given such data, the k-means
algorithm has difficulties in identifying these clusters with arbitrary
shapes.
Parameters Required For DBSCAN Algorithm

• Epsilon (ε): The maximum distance between two points


for them to be considered as neighbors.
• MinPts: The minimum number of points required to
form a dense region (core points).
• Minimum number of neighbors (data points) within eps
radius. The larger the dataset, the larger value of MinPts
must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the
dataset as, MinPts >= D+1. The minimum value of
MinPts must be chosen at least 3.
• Let's set ε = 1.5 and MinPts = 3.
• We start by randomly selecting a point from
the dataset and check its ε-neighborhood. If
the ε-neighborhood contains at least MinPts
points, we consider it a core point and expand
its cluster. We repeat this process until all the
points are assigned to a cluster or identified as
noise.
Steps Used In DBSCAN Algorithm
• Find all the neighbor points within eps and identify the core points or
visited with more than MinPts neighbors.
• For each core point if it is not already assigned to a cluster, create a new
cluster.
• Find recursively all its density-connected points and assign them to the
same cluster as the core point.
A point a and b are said to be density connected if there exists a
point c which has a sufficient number of points in its neighbors and both
points a and b are within the eps distance. This is a chaining process. So,
if b is a neighbor of c, c is a neighbor of d, and d is a neighbor of e, which
in turn is neighbor of a implying that b is a neighbor of a.
• Iterate through the remaining unvisited points in the dataset. Those
points that do not belong to any cluster are noise.
P9 is noise here
Here's a step-by-step breakdown of the DBSCAN algorithm for this
example:

• Suppose we have the following dataset of 2D


points:
• Let's set ε = 1.5 and MinPts = 3.
• Points: [(2, 3), (2, 5), (3, 4), (5, 3), (5, 5), (6, 4),
(8, 2), (8, 4), (9, 5)]
• Start with the first point (2, 3). Its ε-
neighborhood contains the points (2, 5), (3, 4),
and (5, 3). Since the ε-neighborhood has
MinPts (3) or more points, we create a new
cluster and assign these points to the cluster.
• Cluster 1: [(2, 3), (2, 5), (3, 4), (5, 3)]
• Next, we move to the next unvisited point (5,
5). Its ε-neighborhood contains the point (6,
4). Since the ε-neighborhood has fewer than
MinPts (3) points, (5, 5) is marked as noise.

• Noise: [(5, 5)]
• We move to the next unvisited point (6, 4). Its ε-
neighborhood contains the points (5, 5), (8, 2), (8,
4), and (9, 5). The ε-neighborhood has MinPts (3) or
more points, so we create a new cluster and assign
these points to the cluster.

• Cluster 2: [(6, 4), (5, 5), (8, 2), (8, 4), (9, 5)]

• All the points have been visited, and we have two
clusters and one noise point.
• Cluster 1: [(2, 3), (2, 5), (3, 4), (5, 3)]
• Cluster 2: [(6, 4), (5, 5), (8, 2), (8, 4), (9, 5)]
• Noise: [(5, 5)]

• In this example, the DBSCAN algorithm
identified two clusters and one noise point in
the dataset based on the specified parameters
ε = 1.5 and MinPts = 3.

You might also like