ML Unit 4
ML Unit 4
Introduction
DBSCAN is the abbreviation for Density-Based Spatial Clustering of Applications with Noise. It is an
unsupervised clustering algorithm.DBSCAN clustering can work with clusters of any size from huge
amounts of data and can work with datasets containing a significant amount of noise. It is basically based
on the criteria of a minimum number of points within a region.
What is DBSCAN Algorithm?
DBSCAN algorithm can cluster densely grouped points efficiently into one cluster. It can identify local
density in the data points among large datasets. DBSCAN can very effectively handle outliers. An
advantage of DBSACN over the K-means algorithm is that the number of centroids need not be known
beforehand in the case of DBSCAN.
DBSCAN algorithm depends upon two parameters epsilon and minPoints.
Epsilon is defined as the radius of each data point around which the density is considered.
minPoints is the number of points required within the radius so that the data point becomes a core point.
The circle can be extended to higher dimensions.
Working of DBSCAN Algorithm
In the DBSCAN algorithm, a circle with a radius epsilon is drawn around each data point and the data point
is classified into Core Point, Border Point, or Noise Point. The data point is classified as a core point if it
has minPoints number of data points with epsilon radius. If it has points less than minPoints it is known as
Border Point and if there are no points inside epsilon radius it is considered a Noise Point.
In the above figure, we can see that point A has no points inside epsilon(e) radius. Hence it is a Noise Point.
Point B has minPoints(=4) number of points with epsilon e radius , thus it is a Core Point. While the point
has only 1 ( less than minPoints) point, hence it is a Border Point.
Steps Involved in DBSCAN Algorithm.
First, all the points within epsilon radius are found and the core points are identified with number of
points greater than or equal to minPoints.
Next, for each core point, if not assigned to a particular cluster, a new cluster is created for it.
All the densely connected points related to the core point are found and assigned to the same cluster.
Two points are called densely connected points if they have a neighbor point that has both the points
within epsilon distance.
Then all the points in the data are iterated, and the points that do not belong to any cluster are marked
as noise.
Output:
A dataset of K clusters
Method:
1. Randomly assign K objects from the dataset(D) as cluster centres(C)
2. (Re) Assign each object to which object is most similar based upon mean values.
3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated values.
4. Repeat Step 2 until no change occurs.
Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into different
clusters. It means here we will try to group these datasets into two different clusters.
o We need to choose some random k points or centroid to form the cluster. These points can be either
the points from the dataset or any other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We will
compute it by applying some mathematics that we have studied to calculate the distance between
two points. So, we will draw a median between both the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue centroid, and
points to the right of the line are close to the yellow centroid. Let's color them as blue and yellow for
clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new centroid. To
choose the new centroids, we will compute the center of gravity of these centroids, and will find
new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same process
of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue points
are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids or
K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new centroids will be
as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points. So, the
image will be:
o We can see in the above image; there are no dissimilar data points on either side of the line, which
means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters will be
as shown in the below image:
How to choose the value of "K number of clusters" in K-means Clustering?
The performance of the K-means clustering algorithm depends upon highly efficient clusters that it forms.
But choosing the optimal number of clusters is a big task. There are some different ways to find the optimal
number of clusters, but here we are discussing the most appropriate method to find the number of clusters
or value of K. The method is given below:
Elbow Method
The Elbow method is one of the most popular ways to find the optimal number of clusters. This method
uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares, which defines the
total variations within a cluster. The formula to calculate the value of WCSS (for 3 clusters) is given below:
WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in
CLuster3 distance(Pi C3)2
In the above formula of WCSS,
∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point and its
centroid within a cluster1 and the same for the other two terms.
To measure the distance between data points and centroid, we can use any method such as Euclidean
distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:
o It executes the K-means clustering on a given dataset for different K values (ranges from 1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered as the
best value of K.
Since the graph shows the sharp bend, which looks like an elbow, hence it is known as the elbow method.
The graph for the elbow method looks like the below image:
Note: We can choose the number of clusters equal to the given data points. If we choose the number
of clusters equal to the data points, then the value of WCSS becomes zero, and that will be the
endpoint of the plot.
For each cell, the high level is partitioned into several smaller cells in the next lower level.
The statistical info of each cell is calculated and stored beforehand and is used to answer queries.
The parameters of higher-level cells can be easily calculated from parameters of lower-level cell
Count, mean, s, min, max
Type of distribution—normal, uniform, etc.
Then using a top-down approach we need to answer spatial data queries.
For each cell in the current level compute the confidence interval.
Now remove the irrelevant cells from further consideration.
When finishing examining the current layer, proceed to the next lower level.
Advantages:
It is Query-independent, easy to parallelize, incremental update.
O(K), where K is the number of grid cells at the lowest level.
Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected.
WaveCluster
It was proposed by Sheikholeslami, Chatterjee, and Zhang (VLDB’98).
It is a multi-resolution clustering approach which applies wavelet transform to the feature space
A wavelet transform is a signal processing technique that decomposes a signal into different
frequency sub-band.
It can be both grid-based and density-based method.
Input parameters:
No of grid cells for each dimension
The wavelet, and the no of applications of wavelet transform.
Major features:
The time complexity of this method is O(N).
It detects arbitrary shaped clusters at different scales.
It is not sensitive to noise, not sensitive to input order.
It only applicable to low dimensional data.
Identify the subspaces that contain clusters using the Apriori principle.
Identify clusters:
Determine dense units in all subspaces of interests.
Determine connected dense units in all subspaces of interests.
Advantages
It automatically finds subspaces of the highest dimensionality such that high-density clusters exist in those
subspaces.
It is insensitive to the order of records in input and does not presume some canonical data distribution.
It scales linearly with the size of input and has good scalability as the number of dimensions in the data
increases.
Disadvantages
The accuracy of the clustering result may be degraded at the expense of the simplicity of the method.
Summary
Grid-Based Clustering -> It is one of the methods of cluster analysis which uses a multi-resolution grid data
structure.