Module-5-Cluster Analysis-Part1
Module-5-Cluster Analysis-Part1
(Help: https://www.javatpoint.com/hierarchical-clustering-in-machine-learning)
What is Clustering??
Clustering is a technique that groups similar objects such that the objects in the same
group are more similar to each other than the objects in the other groups. The group of
similar objects is called a Cluster.
There are 5 popular clustering algorithms that data scientists need to know:
1. K-Means Clustering:
2. Hierarchical Clustering:
3. Mean-Shift Clustering:
4. Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
5. Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM):
.
Hierarchical Clustering Algorithm
Also called Hierarchical cluster analysis or HCA is an unsupervised clustering
algorithm which involves creating clusters that have predominant ordering from top to
bottom.
For e.g: All files and folders on our hard disk are organized in a hierarchy.
The algorithm groups similar objects into groups called clusters. The endpoint is a set
of clusters or groups, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.
This clustering technique is divided into two types:
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering
There are several ways to measure the distance between clusters in order to decide the
rules for clustering, and they are often called Linkage Methods. Some of the common
linkage methods are:
Complete-linkage: the distance between two clusters is defined as the longest distance
between two points in each cluster.
Single-linkage: the distance between two clusters is defined as the shortest distance
between two points in each cluster. This linkage may be used to detect high values in your
dataset which may be outliers as they will be merged at the end.
Average-linkage: the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then
calculates the distance between the two before merging.
The choice of linkage method entirely depends on you and there is no hard and fast
method that will always give you good results. Different linkage methods lead to
different clusters.
The point of doing all this is to demonstrate the way hierarchical clustering works, it
maintains a memory of how we went through this process and that memory is stored
in Dendrogram.
What is a Dendrogram?
A Dendrogram is a type of tree diagram showing hierarchical relationships between
different sets of data.
As already said a Dendrogram contains the memory of hierarchical clustering algorithm,
so just by looking at the Dendrgram you can tell how the cluster is formed.
Dendrogram
Note:-
1. Distance between data points represents dissimilarities.
2. Height of the blocks represents the distance between clusters.
So you can observe from the above figure that initially P5 and P6 which are closest to
each other by any other point are combined into one cluster followed by P4 getting
merged into the same cluster(C2). Then P1and P2 gets combined into one cluster
followed by P0 getting merged into the same cluster(C4). Now P3 gets merged in
cluster C2 and finally, both clusters get merged into one.
Parts of a Dendrogram
Pic Credit
A dendrogram can be a column graph (as in the image below) or a row graph. Some
dendrograms are circular or have a fluid-shape, but the software will usually produce a
row or column graph. No matter what the shape, the basic graph comprises the same
parts:
The Clades are the branch and are arranged according to how similar (or dissimilar) they
are. Clades that are close to the same height are similar to each other; clades with different
heights are dissimilar — the greater the difference in height, the more dissimilarity.
Each clade has one or more leaves.
Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.
Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.
Leaf F is substantially different from all of the other leaves.
A clade can theoretically have an infinite amount of leaves. However, the more leaves
you have, the harder the graph will be to read with the naked eye.
One question that might have intrigued you by now is how do you decide when to
stop merging the clusters?
You cut the dendrogram tree with a horizontal line at a height where the line can
traverse the maximum distance up and down without intersecting the merging point.
For example in the below figure L3 can traverse maximum distance up and down
without intersecting the merging points. So we draw a horizontal line and the number of
verticle lines it intersects is the optimal number of clusters.
There is evidence that divisive algorithms produce more accurate hierarchies than
agglomerative algorithms in some circumstances but is conceptually more complex.
In both agglomerative and divisive hierarchical clustering, users need to specify the
desired number of clusters as a termination condition(when to stop merging).
Well, there are many measures to do this, perhaps the most popular one is the Dunn's
Index. Dunn's index is the ratio between the minimum inter-cluster distances to the
maximum intra-cluster diameter. The diameter of a cluster is the distance between its
two furthermost points. In order to have well separated and compact clusters you should
aim for a higher Dunn's index.
K-Means Clustering
(Help: https://www.javatpoint.com/k-means-clustering-algorithm-in-machine-learning)
4. For each cluster calculate the new mean based on the datapoints
in the cluster.
Iteration 1
M1= 4, M2=11
D=[x,a]=√(x-a)²
Therefore
C1= {2, 4, 3}
Iteration 2
Therefore
M1= (2+3+4)/3= 3
M2= (10+12+20+30+11+25)/6= 18
Calculating distance and updating clusters based on table below
Iteration 2
New Clusters
Iteration 3
Therefore
Iteration 3
New Clusters
Iteration 4
Therefore
M1= (2+3+4+10+12+11)/6=7
M2= (20+30+25)/3= 25
Calculating distance and updating clusters based on table below
Iteration 4
New Clusters
It means that none of the data points has moved to other cluster.
Also the means/centeroid of these clusters is constant. So this
becomes the stopping condition for our algorithm.
Elbow Method
SSE=0 if K=number of clusters, which means that each data point
has its own cluster.
There are also many other techniques which are used to determine
the value of K.
So this was all about K-Means. I hope now you have the better
understanding of how k-means actually works. There are many
other algorithms that are used for clustering in the industry.
K-Means Clustering Example:
Hierarchical Clustering: Agglomerative Clustering
Example:
Dist A B C D,F E
A 0 0.71 5.66 3.20 4.24
B 0.71 0 4.95 2.50 3.54
C 5.66 4.95 0 2.24 1.41
D,F 3.20 2.50 2.24 0 1.00
E 4.24 3.54 1.41 1.00 0