0% found this document useful (0 votes)
10 views6 pages

DB SCAN Unit 4

Density-Based Clustering identifies clusters in data based on high point density regions, with DBSCAN being a prominent algorithm that can handle noise and outliers. It uses parameters minPts and eps to define cluster density and connectivity, categorizing points into core, border, and noise. While DBSCAN effectively finds arbitrary-shaped clusters, it is sensitive to parameter selection and may struggle with varying densities and high-dimensional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views6 pages

DB SCAN Unit 4

Density-Based Clustering identifies clusters in data based on high point density regions, with DBSCAN being a prominent algorithm that can handle noise and outliers. It uses parameters minPts and eps to define cluster density and connectivity, categorizing points into core, border, and noise. While DBSCAN effectively finds arbitrary-shaped clusters, it is sensitive to parameter selection and may struggle with varying densities and high-dimensional data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 6

Density-Based Clustering Algorithms

Density-Based Clustering refers to unsupervised learning methods that identify

distinctive groups/clusters in the data, based on the idea that a cluster in data space is a

contiguous region of high point density, separated from other such clusters by contiguous

regions of low point density.

Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm

for density-based clustering. It can discover clusters of different shapes and sizes from a

large amount of data, which is containing noise and outliers.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a popular

clustering algorithm used for data analysis and pattern recognition. It groups data points

based on their density, identifying clusters of high-density regions and classifying outliers as

noise. DBSCAN is effective in discovering arbitrary-shaped clusters in data and is widely used

in data mining, spatial data analysis, and machine learning applications.

The DBSCAN algorithm uses two parameters:

 minPts: The minimum number of points (a threshold) clustered together for a region

to be considered dense.

As a rule of thumb, the minimum bound of the parameter “min_pts” can be

computed from the number of dimensions D in the data set, as min_pts ≥ D + 1.

 eps (ε): A distance measure that will be used to locate the points in the

neighborhood of any point.

If epsilon is too small: In such cases, we define the sparser clusters as noise i.e,

result in the elimination of sparse clusters as outliers.

If epsilon is too large: In such cases, the denser clusters may be merged together,

which gives the incorrect clusters.


Reachability in terms of density establishes a point to be reachable from another if it lies

within a particular distance (eps) from it.

Connectivity, on the other hand, involves a transitivity based chaining-approach to

determine whether points are located in a particular cluster. For example, p and q points

could be connected if p->r->s->t->q, where a->b means b is in the neighborhood of a.

Direct density reachable: A point is called direct density reachable if it has a core point in its

neighbourhood.

Density Reachable: A point is known as density reachable from another point if they are

connected through a series of core points.

Density Connected: Two points are called density connected if there is a core point that is

density reachable from both points.

There are three types of points after the DBSCAN clustering is complete:

 Core — This is a point that has at least m points within distance n from itself.

 Border — This is a point that has at least one Core point at a distance n.

 Noise — This is a point that is neither a Core nor a Border. And it has less

than m points within distance n from itself


The major steps followed during the DBSCAN algorithm are as follows:

Step-1: Decide the value of the parameters eps and min_pts.

Step-2: For each data point(x) present in the dataset:

 Compute its distance from all the other data points. If the distance is less than or

equal to the value of epsilon(eps), then consider that point as a neighbour of x.

 If that data point(x) gets the count of its neighbour greater than or equal to min_pts,

then mark it as a core point or as visited.

Step-3: For each core point, if it is not already assigned to a cluster then create a new

cluster. Further, all the neighbouring points are recursively determined and are assigned the

same cluster as that of the core point

Step-4: Repeat the above steps until all the points are visited.

DBSCAN Parameter Selection

DBSCAN is very sensitive to the values of epsilon and minPoints. Therefore, it is important to
understand how to select the values of epsilon and minPoints. A slight variation in these
values can significantly change the results produced by the DBSCAN algorithm.

minPoints(n):
As a starting point, a minimum n can be derived from the number of dimensions D in the
data set, as n ≥ D + 1. For data sets with noise, larger values are usually better and will yield
more significant clusters. Hence, n = 2·D can be evaluated, but it may even be necessary to
choose larger values for very large data.

Epsilon(ε):
If a small epsilon is chosen, a large part of the data will not be clustered. Whereas, for a too
high value of ε, clusters will merge and the majority of objects will be in the same cluster.
Hence, the value for ε can then be chosen by using a k-graph, plotting the distance to the k =
minPoints-1 nearest neighbour ordered from the largest to the smallest value. Good values
of ε are where this plot shows an “elbow”:
Distance Function:

By default, DBSCAN uses Euclidean distance, although other methods can also be used (like
great circle distance for geographical data). The choice of distance function is tightly linked
to the choice of epsilon (ε) value and has a major impact on the outcomes. Hence, the
distance function needs to be chosen appropriately based on the nature of the data set.

DBSCAN Vs K-means Clustering:

Need of DB SCAN:

Partitioning methods like K-means, PAM clustering, etc, and hierarchical clustering
work for finding spherical-shaped clusters or convex clusters i.e, they are suitable only for
compact and well-separated clusters and are also critically affected by the presence of noise
and outliers in the data. Since real-life data often contain various irregularities such as:
 Clusters can be of arbitrary shape.
 Data may contain noisy points.
To overcome such problems DBSCAN is used as it produces more reasonable results than k-
means across a variety of different distributions.
Advantages of the DBSCAN algorithm :
1. It does not need a predefined number of clusters i.e, not require an initial specification of
the number of clusters.
2. Basically, clusters can be of any random shape and size, including non-spherical ones.
3. It is able to identify noise data, popularly known as outliers.
4. Unlike K means, In DBSCAN the user does not give the number of clusters to be generated
as input to the algorithm.
5. DBSCAN can find any shape of clusters.
Disadvantages of the DBSCAN algorithm
1. DBSCAN clustering will fail when there are no density drops between clusters.
2. It seems to be difficult to detect outlier or noisy points if there is a variation in the density
of the clusters.
3. It is sensitive to parameters i.e. it’s hard to determine the correct set of parameters.
4. Distance metric also plays a vital role in the quality of the DBSCAN algorithm
5. With high dimensional data, it does not give effective clusters
Example – 1
Apply the DBSCAN algorithm to the given data points and Data Points:
Pl: (3, 7)P2: (4, 6)P3:(5, 5)P4: (6, 4) P5:(7, 3) P6: (6, 2) P7: (7, 2)P8: (8, 4)P9: (3, 3) PlO: (2, 6),
Pll: {3, 5)P12: {2, 4) Create the clusters with minPts = 4 and epsilon (E)= 1.9.
Sol:
Use Eucladian distance and calculate the distance between each points.
Distance(A( x 1,Y1), B(xz, Y2 )) =sqrt((x2 -X1) 2 + (Y2 - Y1)2)
Single Linkage: It is the Shortest Distance between the closest points of the clusters.
Consider the below image:
Complete Linkage: It is the farthest distance between the two points of two different
clusters. It is one of the popular linkage methods as it forms tighter clusters than single-
linkage.
Average Linkage: It is the linkage method in which the distance between each pair of
datasets is added up and then divided by the total number of datasets to calculate the
average distance between two clusters. It is also one of the most popular linkage methods.
Centroid Linkage: It is the linkage method in which the distance between the centroid of the
clusters is calculated. Consider the below image:

You might also like