Clustering_notes
Clustering_notes
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset into
k-number of clusters, and repeats the process until it does not find the best
clusters. The value of k should be predetermined in this algorithm.
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the
new closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of
these two variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to
put them into different clusters. It means here we will try to group
these datasets into two different clusters.
o We need to choose some random k points or centroid to form the
cluster. These points can be either the points from the dataset or any
other point. So, here we are selecting the below two points as k
points, which are not the part of our dataset. Consider the below
image:
o Now we will assign each data point of the scatter plot to its closest K-
point or centroid. We will compute it by applying some mathematics
that we have studied to calculate the distance between two points.
So, we will draw a median between both the centroids. Consider the
below image:
From the above image, it is clear that points left side of the line is near to
the K1 or blue centroid, and points to the right of the line are close to the
yellow centroid. Let's color them as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process
by choosing a new centroid. To choose the new centroids, we will
compute the center of gravity of these centroids, and will find new
centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we
will repeat the same process of finding a median line. The median will
be like below image:
From the above image, we can see, one yellow point is on the left side of
the line, and two blue points are right to the line. So, these three points will
be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.
o We will repeat the process by finding the center of gravity of
centroids, so the new centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and
reassign the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on
either side of the line, which means our model is formed. Consider
the below image:
As our model is ready, so we can now remove the assumed centroids, and
the two final clusters will be as shown in the below image:
1. Choose k number of random points (Data point from the data set or
some other points). These points are also called "Centroids" or
"Means".
2. Assign all the data points in the data set to the closest centroid by
applying any distance formula like Euclidian distance, Manhattan
distance, etc.
3. Now, choose new centroids by calculating the mean of all the data
points in the clusters and goto step 2
4. Continue step 3 until no data point changes classification between
two iterations.
The problem with the K-Means algorithm is that the algorithm needs to
handle outlier data. An outlier is a point different from the rest of the
points. All the outlier data points show up in a different cluster and will
attract other clusters to merge with it. Outlier data increases the mean of a
cluster by up to 10 units. Hence, K-Means clustering is highly affected by
outlier data.
K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances
to other data points is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the
other points in the clusters are minimal.
PAM is the most powerful algorithm of the three algorithms but has the
disadvantage of time complexity. The following K-Medoids are performed
using PAM. In the further parts, we'll see what CLARA and CLARANS are.
Algorithm:
1. Choose k number of random points from the data and assign these k
points to k number of clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each
medoid and assign it to the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data
points to the medoids)
4. Select a random point as the new medoid and swap it with the
previous medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous
medoid, make the new medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the
previous medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with
new medoids to classify data points.
Data set:
x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9
Scatter plot:
If k is given as 2, we need to break down the data points into 2 clusters.
0 5 4 5 6
1 7 7 10 5
2 1 3 - -
3 8 6 10 7
4 4 9 - -
Cluster 1: 0
Cluster 2: 1, 3
0 5 4 - -
1 7 7 5 5
2 1 3 5 9
3 8 6 5 7
4 4 9 - -
Cluster 1: 2, 3
Cluster 2: 1
0 5 4 - -
1 7 7 - -
2 1 3 5 10
3 8 6 5 2
4 4 9 6 5
Cluster 1: 2
Cluster 2: 3, 4
1 7 7 - -
2 1 3 10 10
3 8 6 - -
4 4 9 5 7
Cluster 1: 4
Cluster 2: 0, 2
Hence, PAM is suitable and recommended to be used for small data sets.
CLARA:
CLARANS:
Disadvantages:
1. Not suitable for Clustering arbitrarily shaped groups of data points.
2. As the initial medoids are chosen randomly, the results might vary
based on the choice in different runs.
Can't cope with outlier data Can manage outlier data too
The working of the AHC algorithm can be explained using the below steps:
o Step-1: Create each data point as a single cluster. Let's say there are
N data points, so the number of clusters will also be N.
o Step-2: Take two closest data points or clusters and merge them to
form one cluster. So, there will now be N-1 clusters.
o Step-3: Again, take the two closest clusters and merge them together
to form one cluster. There will be N-2 clusters.
o Step-4: Repeat Step 3 until only one cluster left. So, we will get the
following clusters. Consider the below images:
o Step-5: Once all the clusters are combined into one big cluster,
develop the dendrogram to divide the clusters as per the problem.
As we have seen, the closest distance between the two clusters is crucial
for the hierarchical clustering. There are various ways to calculate the
distance between two clusters, and these ways decide the rule for
clustering. These measures are called Linkage methods. Some of the
popular linkage methods are given below:
The working of the dendrogram can be explained using the below diagram:
In the above diagram, the left part is showing how clusters are created in
agglomerative clustering, and the right part is showing the corresponding
dendrogram.