0% found this document useful (0 votes)
289 views16 pages

Lecture 8 Clustring

This document discusses hierarchical clustering methods, specifically single linkage and complete linkage clustering. It provides examples to illustrate how each works. Single linkage merges the nearest clusters based on the closest pair of objects between clusters. Complete linkage merges clusters based on the most distant pair of objects between clusters. The document also shows how hierarchical clustering can be used to group both objects and variables using correlation as a similarity measure.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
289 views16 pages

Lecture 8 Clustring

This document discusses hierarchical clustering methods, specifically single linkage and complete linkage clustering. It provides examples to illustrate how each works. Single linkage merges the nearest clusters based on the closest pair of objects between clusters. Complete linkage merges clusters based on the most distant pair of objects between clusters. The document also shows how hierarchical clustering can be used to group both objects and variables using correlation as a similarity measure.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 8

Hierarchical Clustering Methods


➢ Linkage methods are suitable for clustering items, as well as variables.
➢ Types of linkage:
✓ single linkage (minimum distance or nearest neighbor),
✓ complete linkage (maximum. distance or farthest neighbor), and
✓ average linkage (average distance).

➢ From the figure, we see that


✓ single linkage results when groups are fused according to the distance
between their nearest members.
✓ Complete linkage occurs when groups are fused according to the
distance between their farthest members.
✓ For average linkage, groups are fused according to the average
distance between pairs of members in the respective sets.
The following steps in the agglomerative hierarchical clustering algorithm for
grouping N objects (items or variables):
1. Start with N clusters, each containing a single entity and an 𝑵 × 𝑵 symmetric
matrix of distances (or similarities) 𝐷 = {𝑑𝑖𝑘 }.
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let
the distance between “most similar” clusters U and V be 𝑑𝑈𝑉 .
3. Merge clusters U and V. Label the newly formed cluster (UV). Update the
entries in the distance matrix by (a) deleting the rows and columns
corresponding to clusters U and V and (b) adding a row and column giving
the distances between cluster (UV) and the remaining clusters.
4. Repeat Steps 2 and 3 a total. of N - 1 times. (All objects will be in a single
cluster after the algorithm terminates.) Record the identity of clusters that are
merged and the levels (distances or similarities) at which the mergers take
place.

Single Linkage
➢ The inputs to a single linkage algorithm can be distances or similarities
between pairs of objects.
➢ Groups are formed from the individual entities by merging nearest
neighbors, where the term nearest neighbor connotes the smallest distance
or largest similarity.

➢ Initially, we must find the smallest distance in 𝐷 = {𝑑𝑖𝑘 } and merge the
corresponding objects, say, U and V, to get the cluster (UV). For Step 3 of the
general algorithm of (2), the distances between (UV) and any other cluster
W are computed by
𝑑(𝑈𝑉)𝑊 = 𝑚𝑖𝑛{𝑑𝑈𝑊, 𝑑𝑉𝑊 }
➢ Here the quantities 𝑑𝑈𝑊 and 𝑑𝑉𝑊 are the distances between the nearest
neighbors of clusters U and W and clusters V and W, respectively.

➢ The results of single linkage clustering can be graphically displayed in the


form of a dendrogram, or tree diagram.
Example: (Clustering using single linkage) To illustrate the single linkage
algorithm, we consider the hypothetical distances between pairs of five objects as
follows:
1 2 3 4 5
1 0
2 9 0
𝐷 = {𝑑𝑖𝑘 } = 3 3 7 0
4 6 5 9 0
5 [11 10 2 8 0]

➢ Treating each object as a cluster, we commence clustering by merging the two


closest items. Since
𝑚𝑖𝑛
(𝑑 ) = 𝑑53 = 2
𝑖, 𝑘 𝑖𝑘
objects 5 and 3 are merged to form the cluster (35).
➢ To implement the next level of clustering, we need the distances between the
cluster (35) and the remaining objects 1,2, and 4.
➢ The nearest neighbor distances are
𝑑(35)1 = 𝑚𝑖𝑛{𝑑31 , 𝑑51 } = 𝑚𝑖𝑛{3, 11} = 3
𝑑(35)2 = 𝑚𝑖𝑛{𝑑32 , 𝑑52 } = 𝑚𝑖𝑛{7, 10} = 7
𝑑(35)4 = 𝑚𝑖𝑛{𝑑34 , 𝑑54 } = 𝑚𝑖𝑛{9, 8} = 8
➢ Deleting the rows and columns of D corresponding to objects 3 and 5, and
adding a row and column for the cluster (35), we obtain the new distance
matrix
(35) 1 2 4
(35) 0
1 [3 0 ]
2 7 9 0
4 8 6 5 0
➢ The smallest distance between pairs of clusters is now 𝑑(35)1 = 3, and we
merge cluster (1) with cluster (35) to get next cluster, (135).
➢ Calculating
𝑑(135)2 = 𝑚𝑖𝑛{𝑑(35)2 , 𝑑12 } = 𝑚𝑖𝑛{7, 9} = 7

𝑑(135)4 = 𝑚𝑖𝑛{𝑑(35)4 , 𝑑14 } = 𝑚𝑖𝑛{8, 6} = 6


➢ We find that the distance matrix for the next level of clustering is
(135) 2 4
(135) 0
2 [7 0 ]
4 6 5 0
➢ The minimum nearest neighbor distance between pairs of clusters is 𝑑42 =
5, and we merge objects 4 and 2 to get the cluster (24).

➢ At this point we have two distinct clusters (135) and (24). Their nearest
neighbor distance is
𝑑(135)(24) = 𝑚𝑖𝑛{𝑑(135)2 , 𝑑(135)4 } = 𝑚𝑖𝑛{7, 6} = 6
➢ The final distance matrix becomes
(135) (24)
(135) 0
[ ]
(24) 6 0
➢ Consequently, clusters (135) and (24) are merged to form a single cluster of
all five objects (12345), when the nearest neighbor distance reaches 6.

➢ The dendrogram picturing the hierarchical clustering, the intermediate


results-- where the objects are sorted into a moderate number of clusters-are
of chief Interest.
Distance

Objects
Figure 12.3 Single linkage dendrogram for distances between five objects.

Complete linkage
➢ Complete linkage clustering proceeds in much the same manner as single
linkage clustering, with one important exception: At each stage, the
distance (similarity) between clusters is determined by the distance
(similarity) between the two elements, one from each cluster, that are most
distant.
➢ Thus, complete linkage ensures that all items in a cluster are within some
maximum distance (or minimum similarity) of each other.

➢ The general agglomerative algorithm again starts by finding the minimum


entry in 𝐷 = {𝑑𝑖𝑘 } and merging the corresponding objects, such as U and
V, to get cluster (UV).
➢ For Step 3 of the general algorithm, the distances between (UV) and any
other cluster W are computed by
𝑑(𝑈𝑉)𝑊 = 𝑚𝑎𝑥{𝑑𝑈𝑊, 𝑑𝑉𝑊 }
➢ Here 𝑑𝑈𝑊 and 𝑑𝑉𝑊 are the distances between the most distant members of
clusters U and W and clusters V and W, respectively.
Example: (Clustering using complete linkage) Let the distance matrix is:
1 2 3 4 5
1 0
2 9 0
𝐷 = {𝑑𝑖𝑘 } = 3 3 7 0
4 6 5 9 0
[
5 11 10 2 8 0]
➢ Find the smallest distance and this is 2. This distance is between objects 3
and 5. At the first stage, objects 3 and 5 are merged, since they are most
similar. This gives the cluster (35).
➢ At stage 2, we compute
𝑑(35)1 = 𝑚𝑎𝑥{𝑑31 , 𝑑51 } = 𝑚𝑎𝑥{3, 11} = 11
𝑑(35)2 = 𝑚𝑎𝑥{𝑑32 , 𝑑52 } = 𝑚𝑎𝑥{7, 10} = 10
𝑑(35)4 = 𝑚𝑎𝑥{𝑑34 , 𝑑54 } = 𝑚𝑎𝑥{9, 8} = 9
➢ Deleting the rows and columns of D corresponding to objects 3 and 5, and
adding a row and column for the cluster (35), we obtain the new distance
matrix
(35) 1 2 4
(35) 0
1 [11 0 ]
2 10 9 0
4 9 6 5 0
➢ Find the smallest distance and it is 5. The smallest distance is between
objects 2 and 4. The next merger occurs between the most similar groups, 2
and 4, to give the cluster (24).
➢ At stage 3, we have
𝑑(24)(35) = 𝑚𝑎𝑥{𝑑2(35) , 𝑑4(35) } = 𝑚𝑎𝑥{10, 9} = 10
𝑑(24)1 = 𝑚𝑎𝑥{𝑑21 , 𝑑41 } = 𝑚𝑖𝑛{9, 6} = 9
➢ We find that the distance matrix for the next level of clustering is
(35) (24) 1
(35) 0
(24) [10 0 ]
1 11 9 0
➢ The next merger produces the cluster (124). At the final stage, the groups (35)
and (124) are merged as the single cluster (12345) at level
𝑑(124)(35) = 𝑚𝑎𝑥{𝑑1(35) , 𝑑(24)(35) } = 𝑚𝑎𝑥{11, 10} = 11
➢ The dendrogram is given in the following Figure.
Example: (Clustering variables using complete linkage). Data collected on 22 U.S.
public utility companies for the year 1975 are listed in Table 12.4. Although it is
more interesting to group companies, we shall see here how the complete linkage
algorithm can be used to cluster variables.

➢ We measure the similarity between pairs of variables by product-moment


correlation coefficient.
➢ The correlation matrix is given in the following Table.
Table: Correlations Between Pairs of Variables.
X1 X2 X3 X4 X5 X6 X7 X8
𝑋1 1.000
𝑋2 . 643 1.000
𝑋3 −.103 −.348 1.000
𝑋4 −.082 −.086 . 100 1.000
𝑋5 −.259 −.260 . 435 . 034 1.000
𝑋6 −.152 −.010 . 028 −.286 . 176 1.000
𝑋7 . 045 . 211 . 115 −.164 −.019 −.374 1.000
𝑋8 [−.013 −.328 . 005 . 486 −.007 −.561 −.185 1.000]
➢ When the sample correlations are used as similarity measures, variables
with large negative correlations are regarded as very dissimilar; variables
with large positive correlations are regarded as very similar.
➢ In this case, the "distance" between clusters is measured as the smallest
similarity between members of the corresponding clusters.
➢ The complete linkage algorithm, applied to the foregoing similarity matrix,
yields the dendrogram in Figure 12.8.
➢ We see that
✓ variables 1 and 2 (fixed-charge coverage ratio and rate of return on
capital),
✓ variables 4 and 8 (annual load factor and total fuel costs), and
✓ variables 3 and 5 (cost per kilowatt capacity in place and peak
kilowatt/hour demand growth) cluster at intermediate "similarity”
levels.
✓ Variables 7 (percent nuclear) and 6 (sales) remain by themselves until
the final stages.
✓ The final merger brings together the (12478) group and the (356) group.

Average Linkage
➢ Average linkage treats the distance between two clusters as the average
distance between all pairs of items where one member of a pair belongs to
each cluster.
➢ Again, the input to the average linkage algorithm may be distances or
similarities, and the method can be used to group objects or variables.
➢ The average linkage algorithm proceeds in the manner of the general
algorithm.
➢ We begin by searching the distance matrix
𝐷 = {𝑑𝑖𝑘 }
to find the nearest (most similar) objects—for example, U and V.
➢ These objects are merged to form the cluster (UV).
➢ For Step 3 of the general agglomerative algorithm, the distances between
(UV) and the other cluster W are determined by
∑𝑖 ∑𝑘 𝑑𝑖𝑘
𝑑(𝑈𝑉)𝑊 =
𝑁(𝑈𝑉) 𝑁𝑊
where 𝑑𝑖𝑘 is the distance between object i in the cluster (UV) and object k in
the W, and 𝑁(𝑈𝑉) and 𝑁𝑊 are the number of items in clusters (UV) and W,
respectively.
Problem: For the given set of data points find the clusters using Average Linkage
Technique. Use the Euclidian Distance and draw the dendrogram.
X Y
𝑃1 .40 .53
𝑃2 .22 .38
𝑃3 .35 .32
𝑃4 .26 .19
𝑃5 .08 .41
𝑃6 .45 .30

Solution:
The formula of Euclidian distance between 𝑃1 (𝑥1 , 𝑦1 ) and 𝑃2 (𝑥2 , 𝑦2 ) is

√(𝑥1 − 𝑥2 )2 + (𝑦1 − 𝑦2 )2
The Euclidian distance between 𝑃1 and 𝑃2 is

√(.40 − .22)2 + (.53 − .38)2 = .23


The Euclidian distance between 𝑃1 and 𝑃3 is

√(.40 − .35)2 + (.53 − .32)2 = .22


The Euclidian distance between 𝑃1 and 𝑃4 is

√(.40 − .26)2 + (.53 − .19)2 = .37


Similarly, compute the Euclidian distance of all pairs of points.
Now, let us create the distance matrix as
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6
𝑃1 0
𝑃2 .23 0
𝑃3 .22 .14 0
𝑃4 .37 .19 .16 0
𝑃5 .34 .14 .28 .28 0
𝑃6 .24 .24 .10 .22 .39 0
Find the smallest distance. This is .10. This is the distance between 𝑃3 and 𝑃6 . Merge
𝑃3 and 𝑃6 to form the first cluster. Let us draw the corresponding dendrogram.

𝑃3 𝑃6
Now we need to create a new distance matrix.
Calculate distance between (𝑃3 , 𝑃6 ) and 𝑃1 using Average Linkage
= 𝐴𝑉𝐺[(𝑑𝑖𝑠𝑡 (𝑃3 , 𝑃1 ), 𝑑𝑖𝑠𝑡 (𝑃6 , 𝑃1 )]
. 22 + .24
= = .23
2
Calculate distance between (𝑃3 , 𝑃6 ) and 𝑃2 using Average Linkage
= 𝐴𝑉𝐺[(𝑑𝑖𝑠𝑡 (𝑃3 , 𝑃2 ), 𝑑𝑖𝑠𝑡 (𝑃6 , 𝑃2 )]
. 14 + .24
= = .19
2
Calculate distance between (𝑃3 , 𝑃6 ) and 𝑃4 and (𝑃3 , 𝑃6 ) and 𝑃5 using Average
Linkage in similar way.
Now let us create the new distance matrix as
𝑃1 𝑃2 𝑃4 𝑃5 𝑃3 𝑃6
𝑃1 0
𝑃2 .23 0
𝑃4 .37 .19 0
𝑃5 .34 .14 .28 0
𝑃3 𝑃6 .23 .19 .19 .34 0

Find the smallest distance. It is .14. It is the distance between 𝑃2 and 𝑃5 . Merge 𝑃2
and 𝑃5 to form second cluster.
Let us draw the dendrogram between 𝑃2 and 𝑃5 .

𝑃3 𝑃6 𝑃2 𝑃5
Let us calculate distance between pairs of (𝑃2 𝑃5 ) with other points. Distance
between (𝑃2 𝑃5 ) and 𝑃1 using Average Linkage
= 𝐴𝑉𝐺[(𝑑𝑖𝑠𝑡 (𝑃2 𝑃1 ), 𝑑𝑖𝑠𝑡 (𝑃5 𝑃1 )]
.23+.34
= = .29
2

Distance between (𝑃2 , 𝑃5 ) and 𝑃4 using Average Linkage


= 𝐴𝑉𝐺[(𝑑𝑖𝑠𝑡 (𝑃2 , 𝑃4 ), 𝑑𝑖𝑠𝑡 (𝑃5 , 𝑃4 )]
.19+.28
= = .24
2

Distance between (𝑃2 𝑃5 ) and(𝑃3 𝑃6 ) using Average Linkage


= 𝐴𝑉𝐺[(𝑑𝑖𝑠𝑡 𝑃2 , (𝑃3 , 𝑃6 ), 𝑑𝑖𝑠𝑡 𝑃5 (𝑃3 , 𝑃6 )]
.24+.39
= = .27
2

Now updated distance matrix is


𝑃1 𝑃4 𝑃3 𝑃6 𝑃2 𝑃5
𝑃1 0
𝑃4 .37 0
𝑃3 𝑃6 .23 .19 0
𝑃2 𝑃5 .29 .24 .27 0

Find the smallest distance. It is .19. It is the distance between 𝑃4 and 𝑃3 𝑃6 . So, 𝑃3, ,
𝑃6 and 𝑃4 form the next cluster.
Update the dendrogram.

𝑃3 𝑃6 𝑃4 𝑃2 𝑃5
Recalculate the distance matrix. Replace 𝑃3 , 𝑃6 and 𝑃4 with single entry for 𝑃3 𝑃6 𝑃4 .
Calculate the distance of two other points to the new cluster.
Distance between (𝑃3 𝑃6 𝑃4 ) and 𝑃1 using Average Linkage
= 𝐴𝑉𝐺[(𝑑𝑖𝑠𝑡 𝑃1 , (𝑃3 𝑃6 ), 𝑑𝑖𝑠𝑡 (𝑃4 , 𝑃1 )]
. 24 + .37
= = .31
2
Distance between (𝑃3 𝑃6 𝑃4 ) and (𝑃2 , 𝑃5 ) using Average Linkage
= 𝐴𝑉𝐺[{(𝑑𝑖𝑠𝑡 (𝑃3 𝑃6 ), (𝑃2 , 𝑃5 )}𝑑𝑖𝑠𝑡 {𝑃4 , (𝑃2 , 𝑃5 )}]
.27+.24
= = .26
2

Upgraded distance matrix is


𝑃1 𝑃2 𝑃5 𝑃3 𝑃6 𝑃4
𝑃1 0
𝑃2 𝑃5 .29 0
𝑃3 𝑃6 𝑃4 .31 .26 0

The smallest distance is .26. It is the distance between the clusters 𝑃2 𝑃5 and 𝑃3 𝑃6 𝑃4 .
Make the new cluster 𝑃3 𝑃6 , 𝑃2 𝑃5 , 𝑎𝑛𝑑 𝑃4 .
Update the dendrogram as

𝑃3 𝑃6 𝑃4 𝑃2 𝑃5
Recalculate the distance matrix.
Replace the entries for the clusters 𝑃2 𝑃5 and 𝑃3 𝑃6 𝑃4 as new cluster. Calculate the
distance of new clusters.
Distance between (𝑃2 𝑃5 , 𝑃3 , 𝑃6 , 𝑃4 ) and 𝑃1 using Average Linkage
= 𝐴𝑉𝐺[{(𝑑𝑖𝑠𝑡 (𝑃2 𝑃5 ), 𝑃1 }𝑑𝑖𝑠𝑡 {(𝑃3 , 𝑃6 , 𝑃4 )𝑃1 }]
.29+.31
= = .30
2

Here is the updated distance Matrix.


𝑃1 𝑃2 𝑃5 𝑃3 𝑃6 𝑃4
𝑃1 0
𝑃2 𝑃5 𝑃3 𝑃6 𝑃4 .30 0

Join the two clusters to form the final last cluster. Update the dendrogram. This is
the result of hierarchy average linkage clustering.
𝑃3 𝑃6 𝑃4 𝑃2 𝑃5 𝑃1

You might also like