DB SCAN Unit 4
DB SCAN Unit 4
distinctive groups/clusters in the data, based on the idea that a cluster in data space is a
contiguous region of high point density, separated from other such clusters by contiguous
for density-based clustering. It can discover clusters of different shapes and sizes from a
clustering algorithm used for data analysis and pattern recognition. It groups data points
based on their density, identifying clusters of high-density regions and classifying outliers as
noise. DBSCAN is effective in discovering arbitrary-shaped clusters in data and is widely used
minPts: The minimum number of points (a threshold) clustered together for a region
to be considered dense.
eps (ε): A distance measure that will be used to locate the points in the
If epsilon is too small: In such cases, we define the sparser clusters as noise i.e,
If epsilon is too large: In such cases, the denser clusters may be merged together,
determine whether points are located in a particular cluster. For example, p and q points
Direct density reachable: A point is called direct density reachable if it has a core point in its
neighbourhood.
Density Reachable: A point is known as density reachable from another point if they are
Density Connected: Two points are called density connected if there is a core point that is
There are three types of points after the DBSCAN clustering is complete:
Core — This is a point that has at least m points within distance n from itself.
Border — This is a point that has at least one Core point at a distance n.
Noise — This is a point that is neither a Core nor a Border. And it has less
Compute its distance from all the other data points. If the distance is less than or
If that data point(x) gets the count of its neighbour greater than or equal to min_pts,
Step-3: For each core point, if it is not already assigned to a cluster then create a new
cluster. Further, all the neighbouring points are recursively determined and are assigned the
Step-4: Repeat the above steps until all the points are visited.
DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is important to
understand how to select the values of epsilon and minPoints. A slight variation in these
values can significantly change the results produced by the DBSCAN algorithm.
minPoints(n):
As a starting point, a minimum n can be derived from the number of dimensions D in the
data set, as n ≥ D + 1. For data sets with noise, larger values are usually better and will yield
more significant clusters. Hence, n = 2·D can be evaluated, but it may even be necessary to
choose larger values for very large data.
Epsilon(ε):
If a small epsilon is chosen, a large part of the data will not be clustered. Whereas, for a too
high value of ε, clusters will merge and the majority of objects will be in the same cluster.
Hence, the value for ε can then be chosen by using a k-graph, plotting the distance to the k =
minPoints-1 nearest neighbour ordered from the largest to the smallest value. Good values
of ε are where this plot shows an “elbow”:
Distance Function:
By default, DBSCAN uses Euclidean distance, although other methods can also be used (like
great circle distance for geographical data). The choice of distance function is tightly linked
to the choice of epsilon (ε) value and has a major impact on the outcomes. Hence, the
distance function needs to be chosen appropriately based on the nature of the data set.
Need of DB SCAN:
Partitioning methods like K-means, PAM clustering, etc, and hierarchical clustering
work for finding spherical-shaped clusters or convex clusters i.e, they are suitable only for
compact and well-separated clusters and are also critically affected by the presence of noise
and outliers in the data. Since real-life data often contain various irregularities such as:
Clusters can be of arbitrary shape.
Data may contain noisy points.
To overcome such problems DBSCAN is used as it produces more reasonable results than k-
means across a variety of different distributions.
Advantages of the DBSCAN algorithm :
1. It does not need a predefined number of clusters i.e, not require an initial specification of
the number of clusters.
2. Basically, clusters can be of any random shape and size, including non-spherical ones.
3. It is able to identify noise data, popularly known as outliers.
4. Unlike K means, In DBSCAN the user does not give the number of clusters to be generated
as input to the algorithm.
5. DBSCAN can find any shape of clusters.
Disadvantages of the DBSCAN algorithm
1. DBSCAN clustering will fail when there are no density drops between clusters.
2. It seems to be difficult to detect outlier or noisy points if there is a variation in the density
of the clusters.
3. It is sensitive to parameters i.e. it’s hard to determine the correct set of parameters.
4. Distance metric also plays a vital role in the quality of the DBSCAN algorithm
5. With high dimensional data, it does not give effective clusters
Example – 1
Apply the DBSCAN algorithm to the given data points and Data Points:
Pl: (3, 7)P2: (4, 6)P3:(5, 5)P4: (6, 4) P5:(7, 3) P6: (6, 2) P7: (7, 2)P8: (8, 4)P9: (3, 3) PlO: (2, 6),
Pll: {3, 5)P12: {2, 4) Create the clusters with minPts = 4 and epsilon (E)= 1.9.
Sol:
Use Eucladian distance and calculate the distance between each points.
Distance(A( x 1,Y1), B(xz, Y2 )) =sqrt((x2 -X1) 2 + (Y2 - Y1)2)
Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image: