Unit 3 Clustering
Unit 3 Clustering
Clustering
Unit – 3 Syllabus
• Clustering: Introduction
• Hierarchical Clustering:
– Agglomerative Clustering Algorithm
– The single Linkage Algorithm
– The Complete Linkage Algorithm
– The Average – Linkage Algorithm
• Partitional Clustering:
– Forgy’s Algorithm
– The K-Means Algorithm
Introduction
• In the earlier chapters, we saw that how samples may be classified if
a training set is available to use in the design of a classifier.
• However in many situations classes are themselves are initially
undefined.
• Given a set of feature vectors sampled from some population, we
would like to know if the data set consists of a number of relatively
distinct subsets, then we can define them to be classes.
• This is sometimes called as class discovery or unsupervised
classification
When the goal is to group similar data points in a dataset, then we use cluster
analysis.
Clustering refers to the process of grouping samples so that the samples are
similar within each group. The groups are called clusters.
This method is defined under the branch of Unsupervised Learning, which aims at
gaining insights from unlabelled data points, that is, unlike supervised learning we
don’t have a target variable.
• A good clustering will have high intra-class similarity and low inter-
class similarity
Applications of Clustering
• Recommendation
engines
• Market segmentation
• Social network analysis
• Search result grouping
• Medical imaging
• Image segmentation
• Anomaly detection
Types of clustering:
• Hierarchical Clustering:
– Agglomerative Clustering Algorithm
• The single Linkage Algorithm
• The Complete Linkage Algorithm
• The Average – Linkage Algorithm
– Divisive approach
• Polythetic The division is based on more than one feature.
• Monothetic Only one feature is considered at a time.
• Partitional Clustering:
– Forgy’s Algorithm
– The K-Means Algorithm
– The Isodata Algorithm.
Hierarchical clustering
• Hierarchical clustering refers to a clustering process that
organizes the data into large groups, which contain smaller
groups and so on.
• A hierarchical clustering may be drawn as a tree or
dendrogram.
• The finest grouping is at the bottom of the dendrogram, each
sample by itself forms a cluster.
• At the top of the dendrogram, where all samples are grouped
into one cluster.
Hierarchical clustering
• Figure shown in figure illustrates hierarchical clustering.
• At the top level we have Animals…
followed by sub groups…
• Do not have to assume any particular
number of clusters.
• The representation is called dendrogram.
• Any desired number of clusters can be
obtained by ‘cutting’ the dendrogram at the
proper level.
we develop the hierarchy of clusters in the form of a tree, and this tree-shaped
structure is known as the dendrogram.
Example: Agglomerative
• 100 students from India join MS program in some particular
university in USA.
• Initially each one of them looks like single cluster.
• After some times, 2 students from SJCE, Mysuru makes a
cluster.
• Similarly another cluster of 3 students(patterns / Samples) from RVCE
meets SJCE students.
• Now these two clusters makes another bigger cluster of Karnataka
students.
• Later … south Indian student cluster and so on…
Agglomerative Clustering: It uses a bottom-up approach. It starts with
each object forming its own cluster and then iteratively merges the
clusters according to their similarity to form large clusters.
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point
(or there are k clusters)
• Traditional hierarchical algorithms use a similarity or distance matrix
– Merge or split one cluster at a time
Agglomerative Clustering Algorithm
1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains
Key operation is the computation of the proximity of two clusters
– Different approaches to defining the distance between
clusters distinguish the different algorithms
Some commonly used criteria in Agglomerative clustering Algorithms
(The most popular distance measure used is Euclidean distance)
Single Linkage:
Distance between two clusters is the smallest pairwise distance between
• In the next step.. these two are merged to have single cluster.
• Dendrogram is as shown here.
• Height of the dendrogram is decided
based on the merger distance.
For example: 1 and 2 are merged at
the least distance 4. hence the height
is 4.
The complete linkage Algorithm
• It is also called the maximum method or the farthest neighbor
method.
• It is obtained by defining the distance between two clusters to be
largest distance between a sample in one cluster and a sample in
the other cluster.
• If Ci and Cj are clusters, we define:
Example : Complete linkage algorithm
• Consider the same samples used in single linkage:
• Apply Euclidean distance and compute the distance.
• Algorithm starts with 5 clusters.
• As earlier samples 1 and 2 are the closest, they are merged first.
• While merging the maximum distance will be used to replace the
distance/ cost value.
• For example, the distance between 1&3 = 11.7 and 2&3=8.1.
This algorithm selects 11.7 as the distance.
• In complete linkage hierarchical clustering, the distance
between two clusters is defined as the longest distance
between two points in each cluster.
• In the next level, the smallest distance in the matrix is 8.0
between 4 and 5. Now merge 4 and 5.
• In the next step, the smallest distance is 9.8 between 3 and {4,5},
they are merged.
• At this stage we will have two clusters {1,2} and {3,4,5}.
• Notice that these clusters are different from those obtained from
single linkage algorithm.
• At the next step, the two remaining clusters will be merged.
• The hierarchical clustering will be complete.
• The dendrogram is as shown in the figure.
The Average Linkage Algorithm
• The average linkage algorithm, is an attempt to compromise
between the extremes of the single and complete linkage
algorithm.
• It is also known as the unweighted pair group method using
arithmetic averages.
Example: Average linkage clustering algorithm
• Consider the same samples: compute the Euclidian distance
between the samples
• In the next step, cluster 1 and 2 are merged, as the distance
between them is the least.
• The distance values are computed based on the average
values.
• For example distance between 1 & 3 =11.7 and 2&3=8.1 and the
average is 9.9. This value is replaced in the matrix between {1,2}
and 3.
• In the next stage 4 and 5 are merged:
Example 2: Single Linkage
Then, the updated distance matrix becomes
Then the updated distance matrix is
Example 3: Single linkage
As we are using single linkage, we choose the minimum distance, therefore, we choose 4.97
and consider it as the distance between the D1 and D4, D5. If we were using complete linkage
then the maximum value would have been selected as the distance between D1 and D4, D5
which would have been 6.09. If we were to use Average Linkage then the average of these two
distances would have been taken. Thus, here the distance between D1 and D4, D5 would have
come out to be 5.53 (4.97 + 6.09 / 2).
From now on we will simply repeat Step 2 and Step 3 until we are left with one
cluster. We again look for the minimum value which comes out to be 1.78 indicating
that the new cluster which can be formed is by merging the data points D1 and D2.
Similar to what we did in Step
3, we again recalculate the
distance this time for cluster
D1, D2 and come up with the
following updated distance
matrix.
• the squared error for sample xi, which is the squared Euclidean
distance from the mean: σ 𝑑 (𝑥𝑖𝑗 − μ𝑗)2 (Variance)
• Where μ𝑗 is the mean of the feature j for the values in the cluster
given by : μ𝑗 = 1 σ 𝑚𝑖 = (𝑥𝑖𝑗)
𝑚
1
Ward’s Algorithm… Continued
• The squared error E for the entire cluster is the sum of the
squared errors for the samples
(𝑥𝑖𝑗 − μ𝑗) 2 = m σ2
• E= 𝑖 =1 σ 𝑑
σ𝑚
𝑗 =1
• The vector composed of the means of each feature,
(μ1, … … . . μ𝑑) =
μ, 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑡ℎ𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑡ℎ𝑒 𝑣𝑒𝑐𝑡𝑜𝑟 𝑜𝑟
𝑐𝑒𝑛𝑡𝑟𝑜𝑖𝑑 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟
• The squared error is thus the total variance of the
cluster σ2 𝑡𝑖𝑚𝑒𝑠 the number of samples m.
One Hot Encoding
• Popularly used in classification problem.
• One hot encoding creates new (binary) columns, indicating the
presence of each possible value from the original data.
• It is good only when less number of classes.
• A typical dataset in any Data Science project consists of
numerical and categorical features. While numerical features
can contain only numbers, i.e., integers or decimals,
categorical features can be referred to as a variable
A colour variable with values red, blue, and green
A country variable with values India, USA, and Germany
One Hot Encoding can be defined as a process of transforming
categorical variables into numerical format before fitting and training a
Machine Learning algorithm.
For each categorical variable, One Hot Encoding produces a numeric
vector with a length equal to the number of categories present in the
feature.
One Hot Encoding is a technique that is used to convert categorical
variables into numerical format. It maps a categorical variable to a
binary vector with a length equal to the number of categories present
in the variable.
Ex: ?
Divisive Clustering Algorithm
Apart from the data, the input to the algorithm is ‘k’ , the number
of clusters to be constructed
Data X Y
Point
s
1 4 4
2 8 4
3 15 8
4 24 4
5 24 12
Sample Nearest Cluster
Centroid
(4,4) (4,4)
(8,4) (8,4)
(15,8) (8,4)
(24,4) (8,4)
(24,12) (8,4)
The clusters {(4,4)} and {(8,4),(15,8),(24,4),(24,12)} are formed.
Now re-compute the cluster centroids
New centroids are:
The first cluster (4,4) and
The second cluster centroid is x = (8+15+24+24)/4 = 17.75
y = (4+8+4+12)/4 =7
(8,4) (4,4)
(15,8) (17.75,7)
(24,4) (17.75,7)
(24,12) (17.75,7)
The clusters {(4,4),(8,4)} and {(15,8),(24,4),(24,12)} are formed.
Now re-compute the cluster centroids
The first cluster centroid x = (4+8)/2 = 6 and y = (4+4)/2 = 4
The second cluster centroid is x = (15+24+24)/3 = 21
y = (8+4+12)/4 = 8
Sample Nearest Cluster
Centroid
In the next step notice that the cluster centroid does not change (4,4) (6,4)
(24,4) (21,12)
(24,12) (21,12)
Example-2 Illustration Forgy’s clustering algorithms
A1 A2
6.8 12.6 Plotting data of Table
0.8 9.8 25
1.2 11.6
2.8 9.6 20
3.8 9.9
15
4.4 6.5
A2
4.8 1.1 10
6.0 19.9
6.2 18.5 5
7.6 17.4
0
7.8 12.2 0 2 4 6 8 10 12
6.6 7.7 A1
8.2 4.5
8.4 6.9
9.0 3.4
62
9.6 11.1
Example 2: Forgy’s clustering algorithms
• Suppose, k=3. Three objects are chosen at random shown as circled. These three
centroids are shown below.
Initial Centroids chosen randomly
Centroid Objects
A1 A2
c1 3.8 9.9
c2 7.8 12.2
c3 6.2 18.5
The calculation new centroids of the three cluster using the mean of attribute values of A1
and A2 is shown in the Table below. The cluster with new centroids are shown in Figure.
c1 4.6 7.1
c2 8.2 10.7
c3 6.6 18.6
Next cluster with new centroids 65
Example 2: of Forgy’s clustering algorithms
• The newly obtained centroids after second iteration are given in the table below. Note that
the centroid c3 remains unchanged, where c2 and c1 changed a little.
• With respect to newly obtained cluster centres, 16 points are reassigned again. These are
the same clusters as before. Hence, their centroids also remain unchanged.
• Considering this as the termination criteria, the algorithm stops here.
Centroid Revised
Centroids
A1 A2
c1 5.0 7.1
c2 8.1 12.0
c3 6.6 18.6
67
Apply Forgy’s algorithm for the following dataset with K = 2
Sample X Y
1 0.0 0.5
2 0.5 0.0
3 1.0 0.5
4 2.0 2.0
5 3.5 8.0
6 5.0 3.0
7 7.0 3.0
K-Means Algorithm
It is similar to Forgy’s algorithm.
The k-means algorithm differs from Forgy’s algorithm in that the centroids of the
clusters are recomputed as soon as sample joins a cluster.
Also unlike Forgy’s algorithm which is iterative in nature, the k-means only two
passes through the data set.
The K-Means Algorithm
1. Input for this algorithm is K (the number of clusters) and ‘n’ samples, x1,x2,
…xn.
2. For each remaining (n-k) samples, find the centroid nearest it. Put the
sample in the cluster identified with this nearest centroid. After each
sample is assigned, re-compute the centroid of the altered cluster.
3. Go through the data a second time. For each sample, find the centroid
nearest it. Put the sample in the cluster identified with the nearest cluster.
(During this step do not recompute the centroid)
Apply k-means Algorithm on the following sample points
Begin with two clusters {(8,4)} and {(24,4)} with the centroids
(8,4) and (24,4)
For each remaining samples, find the nearest centroid and put it in that
cluster.
Then re-compute the centroid of the cluster.
The next sample (15,8) is closer to (8,4) so it joins the cluster {(8,4)}.
The centroid of the first cluster is updated to (11.5,6).
(8+15)/2 = 11.5 and (4+8)/2 = 6.
The next sample is (4,4) is nearest to the centroid (11.5,6) so it joins the
cluster {(8,4),(15,8),(4,4)}.
Now the new centroid of the cluster is (9,5.3)
The next sample (24,12) is closer to centroid (24,4) and joins the cluster {(24,4),(24,12)}.
Now the new centroid of the second cluster is updated to (24,8).
At this point step1 is completed.
For step2 examine the samples one by one and put each sample in the identified with
the nearest cluster centroid.