0% found this document useful (0 votes)

134 views24 pages

K-Means Clustering Overview

Hierarchical clustering is an unsupervised machine learning algorithm that groups similar data points into clusters. It creates a hierarchy of clusters organized as a tree structure or dendrogram. Agglomerative hierarchical clustering is a bottom-up approach where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Divisive hierarchical clustering takes a top-down approach, initially assigning all observations to a single cluster which is then partitioned recursively into smaller clusters. The optimal number of clusters is determined by cutting the dendrogram tree at a level where the horizontal line can traverse maximum distance without intersecting cluster merges.

Uploaded by

Shrimohan Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

134 views24 pages

K-Means Clustering Overview

Uploaded by

Shrimohan Tripathi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

What is Hierarchical Clustering?

(Help: [Link]

What is Clustering??
Clustering is a technique that groups similar objects such that the objects in the same
group are more similar to each other than the objects in the other groups. The group of
similar objects is called a Cluster.

Clustered data points

There are 5 popular clustering algorithms that data scientists need to know:
1. K-Means Clustering:
2. Hierarchical Clustering:
3. Mean-Shift Clustering:
4. Density-Based Spatial Clustering of Applications with Noise (DBSCAN):
5. Expectation-Maximization (EM) Clustering using Gaussian Mixture Models (GMM):
.
Hierarchical Clustering Algorithm
Also called Hierarchical cluster analysis or HCA is an unsupervised clustering
algorithm which involves creating clusters that have predominant ordering from top to
bottom.
For e.g: All files and folders on our hard disk are organized in a hierarchy.
The algorithm groups similar objects into groups called clusters. The endpoint is a set
of clusters or groups, where each cluster is distinct from each other cluster, and the
objects within each cluster are broadly similar to each other.
This clustering technique is divided into two types:
1. Agglomerative Hierarchical Clustering
2. Divisive Hierarchical Clustering

Agglomerative Hierarchical Clustering

The Agglomerative Hierarchical Clustering is the most common type of hierarchical

clustering used to group objects in clusters based on their similarity. It’s also known as
AGNES (Agglomerative Nesting). It's a “bottom-up” approach: each observation starts
in its own cluster, and pairs of clusters are merged as one moves up the
hierarchy.
How does it work?
1. Make each data point a single-point cluster → forms N clusters
2. Take the two closest data points and make them one cluster → forms N-1 clusters
3. Take the two closest clusters and make them one cluster → Forms N-2 clusters.
4. Repeat step-3 until you are left with only one cluster.
Have a look at the visual representation of Agglomerative Hierarchical Clustering for
better understanding:
Agglomerative Hierarchical Clustering

There are several ways to measure the distance between clusters in order to decide the
rules for clustering, and they are often called Linkage Methods. Some of the common
linkage methods are:
 Complete-linkage: the distance between two clusters is defined as the longest distance
between two points in each cluster.
 Single-linkage: the distance between two clusters is defined as the shortest distance
between two points in each cluster. This linkage may be used to detect high values in your
dataset which may be outliers as they will be merged at the end.
 Average-linkage: the distance between two clusters is defined as the average distance
between each point in one cluster to every point in the other cluster.
 Centroid-linkage: finds the centroid of cluster 1 and centroid of cluster 2, and then
calculates the distance between the two before merging.
The choice of linkage method entirely depends on you and there is no hard and fast
method that will always give you good results. Different linkage methods lead to
different clusters.
The point of doing all this is to demonstrate the way hierarchical clustering works, it
maintains a memory of how we went through this process and that memory is stored
in Dendrogram.
What is a Dendrogram?
A Dendrogram is a type of tree diagram showing hierarchical relationships between
different sets of data.
As already said a Dendrogram contains the memory of hierarchical clustering algorithm,
so just by looking at the Dendrgram you can tell how the cluster is formed.

Dendrogram

Note:-
1. Distance between data points represents dissimilarities.
2. Height of the blocks represents the distance between clusters.
So you can observe from the above figure that initially P5 and P6 which are closest to
each other by any other point are combined into one cluster followed by P4 getting
merged into the same cluster(C2). Then P1and P2 gets combined into one cluster
followed by P0 getting merged into the same cluster(C4). Now P3 gets merged in
cluster C2 and finally, both clusters get merged into one.
Parts of a Dendrogram
Pic Credit

A dendrogram can be a column graph (as in the image below) or a row graph. Some
dendrograms are circular or have a fluid-shape, but the software will usually produce a
row or column graph. No matter what the shape, the basic graph comprises the same
parts:
 The Clades are the branch and are arranged according to how similar (or dissimilar) they
are. Clades that are close to the same height are similar to each other; clades with different
heights are dissimilar — the greater the difference in height, the more dissimilarity.
 Each clade has one or more leaves.
 Leaves A, B, and C are more similar to each other than they are to leaves D, E, or F.
 Leaves D and E are more similar to each other than they are to leaves A, B, C, or F.
 Leaf F is substantially different from all of the other leaves.
A clade can theoretically have an infinite amount of leaves. However, the more leaves
you have, the harder the graph will be to read with the naked eye.
One question that might have intrigued you by now is how do you decide when to
stop merging the clusters?
You cut the dendrogram tree with a horizontal line at a height where the line can
traverse the maximum distance up and down without intersecting the merging point.
For example in the below figure L3 can traverse maximum distance up and down
without intersecting the merging points. So we draw a horizontal line and the number of
verticle lines it intersects is the optimal number of clusters.

Choosing the optimal number of clusters.

Number of Clusters in this case = 3.

Divisive Hierarchical Clustering

In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method

where we assign all of the observations to a single cluster and then partition the cluster
to two least similar clusters. Finally, we proceed recursively on each cluster until there is
one cluster for each observation. So this clustering approach is exactly opposite to
Agglomerative clustering.
Pic credit

There is evidence that divisive algorithms produce more accurate hierarchies than
agglomerative algorithms in some circumstances but is conceptually more complex.
In both agglomerative and divisive hierarchical clustering, users need to specify the
desired number of clusters as a termination condition(when to stop merging).

Measuring the goodness of Clusters

Well, there are many measures to do this, perhaps the most popular one is the Dunn's
Index. Dunn's index is the ratio between the minimum inter-cluster distances to the
maximum intra-cluster diameter. The diameter of a cluster is the distance between its
two furthermost points. In order to have well separated and compact clusters you should
aim for a higher Dunn's index.
K-Means Clustering

(Help: [Link]

K-Means Clustering Statement

K-means tries to partition x data points into the set of k clusters

where each data point is assigned to its closest cluster. This
method is defined by the objective function which tries to
minimize the sum of all squared distances within a cluster, for all
clusters.

The objective function is defined as:

Where xj is a data point in the data set, Si is a cluster (set of data

points and ui is the cluster mean(the center of cluster of Si)

K-Means Clustering Algorithm:

1. Choose a value of k, number of clusters to be formed.

2. Randomly select k data points from the data set as the intital
cluster centeroids/centers

3. For each datapoint:

a. Compute the distance between the datapoint and the cluster

centroid

b. Assign the datapoint to the closest centroid

4. For each cluster calculate the new mean based on the datapoints
in the cluster.

5. Repeat 3 & 4 steps until mean of the clusters stops changing or

maximum number of iterations reached.
Flowchart of K-Means Clustering

Understanding with a simple example

We will apply k-means on the following 1 dimensional data set for
K=2.

Data set {2, 4, 10, 12, 3, 20, 30, 11, 25}

Iteration 1

M1, M2 are the two randomly selected centroids/means where

M1= 4, M2=11

and the initial clusters are

C1= {4}, C2= {11}

Calculate the Euclidean distance as

D=[x,a]=√(x-a)²

D1 is the distance from M1

D2 is the distance from M2

Iteration 1

As we can see in the above table, 2 datapoints are added to cluster

C1 and other datapoints added to cluster C2

Therefore

C1= {2, 4, 3}

C2= {10, 12, 20, 30, 11, 25}

Iteration 2

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4)/3= 3

M2= (10+12+20+30+11+25)/6= 18
Calculating distance and updating clusters based on table below

Iteration 2

New Clusters

C1= {2, 3, 4, 10}

C2= {12, 20, 30, 11, 25}

Iteration 3

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10)/4= 4.75

M2= (12+20+30+11+25)/5= 19.6

Calculating distance and updating clusters based on table below

Iteration 3

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

Iteration 4

Calculate new mean of datapoints in C1 and C2.

Therefore

M1= (2+3+4+10+12+11)/6=7

M2= (20+30+25)/3= 25
Calculating distance and updating clusters based on table below

Iteration 4

New Clusters

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

As we can see that the data points in the cluster C1 and C2 in

iteration 3 are same as the data points of the cluster C1 and C2 of
iteration 2.

It means that none of the data points has moved to other cluster.
Also the means/centeroid of these clusters is constant. So this
becomes the stopping condition for our algorithm.

How many clusters?

Selecting a proper value of ‘K’ is very difficult until we have a good
knowledge about our data set.

Therefore we need some method to determine and validate

weather we are using the right number of clusters. The
fundamental aim of partitioning a data set is minimizing the intra-
cluster variation or SSE. SSE can be calculated by –First take the
difference between each data point with its centroid and then add
up all the squares of the differences calculated

So to find optimal number of clusters:

Run k-means for different values of ‘K’. For example K varying

from 1 to 10 and for each value of K compute SSE.

Plot a line chart K values on x axis and its corresponding values of

SSE on y axis as shown below.

Elbow Method
SSE=0 if K=number of clusters, which means that each data point
has its own cluster.

As we can see in the graph there is a rapid drop in SSE as we move

from K=2 to 3 and it becomes almost constant as the value of K is
further increased.

Because of the sudden drop we see an elbow in the graph. So the

value to be considered for K is 3. This method is known as elbow
method.

There are also many other techniques which are used to determine
the value of K.

K-means is the ‘go-to’ clustering algorithm because it is fast and

easy to understand.

Listing some drawbacks of K-Means

1. The result might not be globally optimal: We can’t assure that

this algorithm will lead to the best global solution. Selecting
different random seeds at the beginning affects the final results.

2. Value of K need to be specified beforehand: We can expect this

value only if we have a good idea about our dataset and if we are
working with a new dataset then elbow method can be used to
determine value of K.
3. Works only for linear boundaries: K-means makes this
assumption that the boundaries will be always linear. Hence it
fails when it comes to complicated boundaries.

4. Slow for large number of samples: As this algorithm access each

point of the dataset, it becomes slow when the sample size grows.

So this was all about K-Means. I hope now you have the better
understanding of how k-means actually works. There are many
other algorithms that are used for clustering in the industry.
K-Means Clustering Example:
Hierarchical Clustering: Agglomerative Clustering

Example:

Dist A B C D,F E
A 0 0.71 5.66 3.20 4.24
B 0.71 0 4.95 2.50 3.54
C 5.66 4.95 0 2.24 1.41
D,F 3.20 2.50 2.24 0 1.00
E 4.24 3.54 1.41 1.00 0

Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
DWM 4
No ratings yet
DWM 4
14 pages
Unit 4 Self Made
No ratings yet
Unit 4 Self Made
28 pages
Hierarchical Clustering - 11.3.2024 - Full
No ratings yet
Hierarchical Clustering - 11.3.2024 - Full
14 pages
Unsupervised Learning: Clustering Algorithms
No ratings yet
Unsupervised Learning: Clustering Algorithms
13 pages
Lecture+Notes+ +clustering
No ratings yet
Lecture+Notes+ +clustering
13 pages
Hierarchical vs K-Means Clustering Guide
No ratings yet
Hierarchical vs K-Means Clustering Guide
23 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
26 pages
K-Means Clustering Guide
100% (1)
K-Means Clustering Guide
14 pages
Hierarchical Clustering in Machine Learning
No ratings yet
Hierarchical Clustering in Machine Learning
10 pages
Unit-4 New
No ratings yet
Unit-4 New
36 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
7 pages
Unt III (DS)
No ratings yet
Unt III (DS)
49 pages
K-Means vs Hierarchical Clustering
No ratings yet
K-Means vs Hierarchical Clustering
30 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
110 pages
Unit 4 Cluster Analysis 3
No ratings yet
Unit 4 Cluster Analysis 3
20 pages
6 - Clustering and Applications and Trends in Datamining
No ratings yet
6 - Clustering and Applications and Trends in Datamining
66 pages
Agnes
No ratings yet
Agnes
25 pages
Mlclustering2022 10 26
No ratings yet
Mlclustering2022 10 26
36 pages
Clustering
No ratings yet
Clustering
32 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Unit 4 ML
No ratings yet
Unit 4 ML
14 pages
Hierarchical
No ratings yet
Hierarchical
31 pages
Clustering
No ratings yet
Clustering
75 pages
Hierarchial Clustering
No ratings yet
Hierarchial Clustering
14 pages
Clustering - Unit 4
No ratings yet
Clustering - Unit 4
19 pages
6 - Chapter 6 - Hierarchical Clustering
No ratings yet
6 - Chapter 6 - Hierarchical Clustering
32 pages
4.unsupervised Learning Model-Clustering
No ratings yet
4.unsupervised Learning Model-Clustering
45 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
10 pages
Hierarchical Clustering Explained
No ratings yet
Hierarchical Clustering Explained
14 pages
Clustering
No ratings yet
Clustering
19 pages
Clustering and Applications and Trends in Datamining Lecture:-30 To 35
No ratings yet
Clustering and Applications and Trends in Datamining Lecture:-30 To 35
66 pages
Module 3 - 1
No ratings yet
Module 3 - 1
149 pages
Hierarchical Clustering in Machine Learning
No ratings yet
Hierarchical Clustering in Machine Learning
7 pages
Text Analytics Unit-3
No ratings yet
Text Analytics Unit-3
11 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Hierarchical Clustering PDF
No ratings yet
Hierarchical Clustering PDF
7 pages
Hierarchical Clustering Methods Explained
No ratings yet
Hierarchical Clustering Methods Explained
31 pages
Hierarchical Clusters
No ratings yet
Hierarchical Clusters
6 pages
ML Lec-17
No ratings yet
ML Lec-17
12 pages
Clustering
No ratings yet
Clustering
20 pages
Heirarchical Clustering
No ratings yet
Heirarchical Clustering
22 pages
Hierarchical Clustering Explained
No ratings yet
Hierarchical Clustering Explained
20 pages
ML TCS Lecture Hierarchical 1608
No ratings yet
ML TCS Lecture Hierarchical 1608
41 pages
AI20 - Hierarchical-Clustering
No ratings yet
AI20 - Hierarchical-Clustering
31 pages
Unit 5 Cluster Analysis
No ratings yet
Unit 5 Cluster Analysis
15 pages
Cluster Analysis & Methods Guide
No ratings yet
Cluster Analysis & Methods Guide
11 pages
Clustering
No ratings yet
Clustering
11 pages
Advance Learning Methods Machine Learning Lecture Notes
No ratings yet
Advance Learning Methods Machine Learning Lecture Notes
13 pages
10Hierarchical&Probabilistic Clustering & GMM (ML)
No ratings yet
10Hierarchical&Probabilistic Clustering & GMM (ML)
24 pages
Hierarchical Clustering Guide
No ratings yet
Hierarchical Clustering Guide
11 pages
Artificial Intelligence Report
No ratings yet
Artificial Intelligence Report
23 pages
Module 5
No ratings yet
Module 5
43 pages
Clustering
No ratings yet
Clustering
131 pages
Clustering
No ratings yet
Clustering
38 pages
Week 10
No ratings yet
Week 10
84 pages
Solution
No ratings yet
Solution
27 pages
Break Even Explanation
No ratings yet
Break Even Explanation
2 pages
2022 - Flexible Riser Tensile Armour Stress Assessment in The Bend Stiffener Region
No ratings yet
2022 - Flexible Riser Tensile Armour Stress Assessment in The Bend Stiffener Region
16 pages
Maths
No ratings yet
Maths
338 pages
Jeff Bezos: Leadership Traits & Style
No ratings yet
Jeff Bezos: Leadership Traits & Style
36 pages
Evaluate Human Impact on Environment
No ratings yet
Evaluate Human Impact on Environment
1 page
The Young Rugby Player - Science and Application - Kevin Till, Jonathon Weakley, Sarah Whitehead, Ben Jones
No ratings yet
The Young Rugby Player - Science and Application - Kevin Till, Jonathon Weakley, Sarah Whitehead, Ben Jones
379 pages
Cardiovascular System Structure & Function
No ratings yet
Cardiovascular System Structure & Function
5 pages
m.i.c.e Tiếng Anh
No ratings yet
m.i.c.e Tiếng Anh
20 pages
Instant Ebooks Textbook Populist Rhetorics Case Studies and A Minimalist Definition 1st Edition Christian Kock Download All Chapters
No ratings yet
Instant Ebooks Textbook Populist Rhetorics Case Studies and A Minimalist Definition 1st Edition Christian Kock Download All Chapters
47 pages
Vrts112 - Activity #1
No ratings yet
Vrts112 - Activity #1
1 page
EVS Assignment - Suhasini Takkar
No ratings yet
EVS Assignment - Suhasini Takkar
11 pages
Descriptive Paragraph Writing Guide
50% (4)
Descriptive Paragraph Writing Guide
3 pages
BSBCRT411 Assessment Task 1
No ratings yet
BSBCRT411 Assessment Task 1
2 pages
Clinton's Guide to Inclusive Leadership
No ratings yet
Clinton's Guide to Inclusive Leadership
42 pages
Mitchell 2010
No ratings yet
Mitchell 2010
17 pages
Tadao Ando: Critical Regionalism & Nature
No ratings yet
Tadao Ando: Critical Regionalism & Nature
42 pages
Multi-Planar Welded HSS Connections
No ratings yet
Multi-Planar Welded HSS Connections
5 pages
Animal Rights: Kant "Why We Have No Obligations To Animals"
No ratings yet
Animal Rights: Kant "Why We Have No Obligations To Animals"
9 pages
Geography Form 4 Schemes of Work
No ratings yet
Geography Form 4 Schemes of Work
15 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
ME Sci 8 Q4 1705 TG
No ratings yet
ME Sci 8 Q4 1705 TG
24 pages
Risk Management For Event Planning
No ratings yet
Risk Management For Event Planning
26 pages
Gean A. Carino Final Thesis
No ratings yet
Gean A. Carino Final Thesis
59 pages
Contipack - Portfolio
No ratings yet
Contipack - Portfolio
12 pages
Nature and Scope of Ethics Project
No ratings yet
Nature and Scope of Ethics Project
3 pages
Report Sample
No ratings yet
Report Sample
13 pages
Markov Chains On Metric Spaces A Short Course 1st Edition Michel Benaïm Tobias Hurth Instant Access 2025
No ratings yet
Markov Chains On Metric Spaces A Short Course 1st Edition Michel Benaïm Tobias Hurth Instant Access 2025
155 pages
Using Predicate Logic: Unit-IV
No ratings yet
Using Predicate Logic: Unit-IV
60 pages
Eng SSC CGL 2019 Mains Response Sheet
No ratings yet
Eng SSC CGL 2019 Mains Response Sheet
81 pages

K-Means Clustering Overview

Uploaded by

K-Means Clustering Overview

Uploaded by

What is Hierarchical Clustering?

Clustered data points

Agglomerative Hierarchical Clustering

The Agglomerative Hierarchical Clustering is the most common type of hierarchical

Choosing the optimal number of clusters.

Number of Clusters in this case = 3.

Divisive Hierarchical Clustering

In Divisive or DIANA(DIvisive ANAlysis Clustering) is a top-down clustering method

Measuring the goodness of Clusters

K-Means Clustering Statement

K-means tries to partition x data points into the set of k clusters

The objective function is defined as:

Where xj is a data point in the data set, Si is a cluster (set of data

K-Means Clustering Algorithm:

1. Choose a value of k, number of clusters to be formed.

3. For each datapoint:

a. Compute the distance between the datapoint and the cluster

b. Assign the datapoint to the closest centroid

5. Repeat 3 & 4 steps until mean of the clusters stops changing or

Understanding with a simple example

Data set {2, 4, 10, 12, 3, 20, 30, 11, 25}

M1, M2 are the two randomly selected centroids/means where

and the initial clusters are

C1= {4}, C2= {11}

Calculate the Euclidean distance as

D1 is the distance from M1

D2 is the distance from M2

As we can see in the above table, 2 datapoints are added to cluster

C2= {10, 12, 20, 30, 11, 25}

Calculate new mean of datapoints in C1 and C2.

C1= {2, 3, 4, 10}

C2= {12, 20, 30, 11, 25}

Calculate new mean of datapoints in C1 and C2.

M1= (2+3+4+10)/4= 4.75

M2= (12+20+30+11+25)/5= 19.6

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

Calculate new mean of datapoints in C1 and C2.

C1= {2, 3, 4, 10, 12, 11}

C2= {20, 30, 25}

As we can see that the data points in the cluster C1 and C2 in

How many clusters?

Therefore we need some method to determine and validate

So to find optimal number of clusters:

Run k-means for different values of ‘K’. For example K varying

Plot a line chart K values on x axis and its corresponding values of

As we can see in the graph there is a rapid drop in SSE as we move

Because of the sudden drop we see an elbow in the graph. So the

K-means is the ‘go-to’ clustering algorithm because it is fast and

Listing some drawbacks of K-Means

1. The result might not be globally optimal: We can’t assure that

2. Value of K need to be specified beforehand: We can expect this

4. Slow for large number of samples: As this algorithm access each

You might also like