0% found this document useful (0 votes)
25 views10 pages

Lecture 7 Clustring

This document describes a dataset containing characteristics of 5 individuals and defines 6 binary variables to represent these characteristics. It then calculates similarity coefficients between pairs of individuals based on how many of their binary variables match. The results are displayed in a symmetric 5x5 matrix showing individual 2 and 5 are most similar and individuals 1 and 5 are least similar. The document then discusses hierarchical clustering methods for grouping items or variables based on similarities or distances between them. It describes single, complete, and average linkage methods and the steps in the agglomerative hierarchical clustering algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views10 pages

Lecture 7 Clustring

This document describes a dataset containing characteristics of 5 individuals and defines 6 binary variables to represent these characteristics. It then calculates similarity coefficients between pairs of individuals based on how many of their binary variables match. The results are displayed in a symmetric 5x5 matrix showing individual 2 and 5 are most similar and individuals 1 and 5 are least similar. The document then discusses hierarchical clustering methods for grouping items or variables based on similarities or distances between them. It describes single, complete, and average linkage methods and the steps in the agglomerative hierarchical clustering algorithm.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 7

Example: Suppose five individuals possess the following characteristics:


Height Weight Eye Hair Handedness Gender
color color
Individual 1 68 in 140 lb Green Blond Right Female
Individual 2 73 in 185 lb Brown Brown Right Male
Individual 3 67 in 165 lb Blue Blond Right Male
Individual 4 64 in 120 lb Brown Brown Right Female
Individual 5 76 in 210 lb Brown Brown Left Male

Define six binary variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5 , 𝑋6 as


1 ℎ𝑒𝑖𝑔ℎ𝑡 ≥ 72 𝑖𝑛
𝑋1 = {
0 ℎ𝑒𝑖𝑔ℎ𝑡 < 72 𝑖𝑛
1 𝑤𝑒𝑖𝑔ℎ𝑡 ≥ 150 𝑙𝑏
𝑋2 = {
0 𝑤𝑒𝑖𝑔ℎ𝑡 < 150 𝑙𝑏
1 𝑏𝑟𝑜𝑤𝑛 𝑒𝑦𝑒𝑠
𝑋3 = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟
𝑋4 = {
0 𝑛𝑜𝑡 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟
1 𝑟𝑖𝑔ℎ𝑡 ℎ𝑎𝑛𝑑𝑒𝑑
𝑋5 = {
0 𝑙𝑒𝑓𝑡 ℎ𝑎𝑛𝑑𝑒𝑑
1 𝑓𝑒𝑚𝑎𝑙𝑒
𝑋6 = {
0 𝑚𝑎𝑙𝑒
The scores for individuals 1 and 2 on the p = 6 binary variables are
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6
Individual 1 0 0 0 1 1 1
2 1 1 1 0 1 0

and the number of matches- and mismatches are indicated in the two-way array
Individual 2
1 0 Total
Individual 1 1 1 2 3
0 3 0 3
Total 4 2 6

1
Employing similarity coefficient 1, which gives equal weight to matches, we
compute
𝑎+𝑑 1+0 1
= =
𝑝 6 6
The scores for individuals 1 and 3 on the p = 6 binary variables are
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6
Individual 1 0 0 0 1 1 1
3 0 1 0 1 1 0

and the number of matches- and mismatches are indicated in the two-way array
Individual 3
1 0 Total
Individual 1 1 2 1 3
0 1 2 3
Total 3 3 6

Employing similarity coefficient 1, which gives equal weight to matches, we


compute
𝑎+𝑑 2+4 4
= =
𝑝 6 6
Continuing with similarity coefficient 1, we calculate the remaining similarity
numbers for pairs of individuals.
These are displayed in the 5 × 5 symmetric matrix

Individual
1 2 3 4 5
1 1
2 1/6 1
Individual 3 4/6 3/6 1
4 4/6 3/6 2/6 1
5 0 5/6 2/6 2/6 1

2
➢ Based on the magnitudes of the similarity coefficient, we should conclude
that individuals 2 and 5 are most similar and individuals 1 and 5 are least
similar.
➢ Other pairs fall between these extremes.
➢ If we were to divide the individuals into two relatively homogeneous
subgroups on the basis of the similarity numbers, we might form the
subgroups (1 3 4) and (2 5).

➢ Note that X3 = 0 implies an absence of brown eyes, so that two people, one
with blue eyes and one with green eyes, will yield a 0-0 match.
➢ Consequently, it may be inappropriate to use similarity coefficient 1, 2, or 3
because these coefficients give the same weights to 1-1 and 0-0 matches.

➢ We have described the construction of distances and similarities. It is always


possible to construct similarities from distances. For example, we might set
1
𝑠̃𝑖𝑘 =
1 + 𝑑𝑖𝑘
where 0 < 𝑠̃ 𝑖𝑘 ≤ 1 is the similarity between items i and k and dik is the
corresponding distance.

➢ Grower has shown that with the nonnegative definite condition, and with the
maximum similarity scaled so that 𝑠̃ 𝑖𝑘 = 1,

𝑑𝑖𝑘 = √2(1 − 𝑠̃𝑖𝑘 )


has the properties of a distance.

Similarities and Association Measures for Pairs of Variables


➢ Thus far, we have discussed similarity measures for items.
➢ In some applications, it is the variables, rather than the items, that must be
grouped.
➢ Similarity measures for variables often take the form of sample correlation
coefficients.

3
➢ Moreover, in some clustering applications, negative correlations are replaced
by their absolute values.

➢ When the variables are binary, the data can again be arranged in the form of a
contingency table.
➢ This time, however, the variables, rather than the items, delineate the
categories.
➢ For each pair of variables, there are n items categorized in the table.
➢ With the usual 0 and 1 coding, the table becomes as follows:
Variable k
1 0 Total
Variable i 1 a b a+b
0 c d c+d
Totals a+c b+d a+b+c+d

➢ For instance, variable i equals 1 and variable k equals 0 for b of the n items.

➢ The usual product moment correlation formula applied to the binary


variables in the contingency table gives

𝑎𝑑−𝑏𝑐
𝑟= 1 …………………………..(2)
[(𝑎+𝑏)(𝑐+𝑑)(𝑎+𝑐)(𝑏+𝑑)]2

➢ This number can be taken as a measure of the similarity between the two
variables.

➢ The correlation coefficient in (2) is related to the chi-square statistic (𝑟 2 =


𝜒 2 /𝑛) for testing the independence of two categorical variables.
➢ For n fixed, a large similarity (or correlation) is consistent with the presence
of dependence.

4
Hierarchical Clustering Methods
➢ We can rarely examine all grouping possibilities, even with the largest and
fastest computers.
➢ Because of this problem, a wide variety of clustering algorithms have emerged
that find "reasonable" clusters without having to look at all configurations.

➢ Hierarchical clustering techniques proceed by either a series of successive


mergers or a series of successive divisions.
➢ Agglomerative hierarchical methods start with the individual objects.
➢ Thus, there are initially as many clusters as objects.
✓ The most similar objects are first grouped, and these initial groups are
merged according to their similarities.
✓ Eventually, as the similarity decreases, all subgroups are fused in to a
single cluster.

➢ Divisive hierarchical methods work in the opposite direction.


✓ An initial single group of objects is divided into two subgroups such
that the objects in one subgroup are "far from" the objects in the other.
✓ These subgroups are then further divided into dissimilar subgroups; the
process continues until there are as many subgroups as objects-that is,
until each object forms a group.

➢ The results of both agglomerative and divisive methods may be displayed in


the form of a two-dimensional diagram known as a dendrogram.
➢ We first concentrate on agglomerative hierarchical procedures, in particular,
linkage methods.

➢ Linkage methods are suitable for clustering items, as well as variables.


➢ Types of linkage:
✓ single linkage (minimum distance or nearest neighbor),
✓ complete linkage (maximum. distance or farthest neighbor), and
✓ average linkage (average distance).
➢ The merging or clusters under the three linkage criteria is illustrated
schematically in Figure 12.2.
5
➢ From the figure, we see that
✓ single linkage results when groups are fused according to the distance
between their nearest members.
✓ Complete linkage occurs when groups are fused according to the
distance between their farthest members.
✓ For average linkage, groups are fused according to the average
distance between pairs of members in the respective sets.

Figure 12.2 Intercluster distance (dissimilarity) for (a) single linkage (b) complete
linkage, and (c) average link.

The following steps in the agglomerative hierarchical clustering algorithm for


grouping N objects (items or variables):
1. Start with N clusters, each containing a single entity and an 𝑵 × 𝑵 symmetric
matrix of distances (or similarities) 𝐷 = {𝑑𝑖𝑘 }.
6
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let
the distance between “most similar” clusters U and V be 𝑑𝑈𝑉 .
3. Merge clusters U and V. Label the newly formed cluster (UV). Update the
entries in the distance matrix by (a) deleting the rows and columns
corresponding to clusters U and V and (b) adding a row and column giving
the distances between cluster (UV) and the remaining clusters.
4. Repeat Steps 2 and 3 a total. of N - 1 times. (All objects will be in a single
cluster after the algorithm terminates.) Record the identity of clusters that are
merged and the levels (distances or similarities) at which the mergers take
place.
Single Linkage
➢ The inputs to a single linkage algorithm can be distances or similarities
between pairs of objects.
➢ Groups are formed from the individual entities by merging nearest neighbors,
where the term nearest neighbor connotes the smallest distance or largest
similarity.

➢ Initially, we must find the smallest distance in 𝐷 = {𝑑𝑖𝑘 } and merge the
corresponding objects, say, U and V, to get the cluster (UV). For Step 3 of the
general algorithm of (2), the distances between (UV) and any other cluster W
are computed by
𝑑(𝑈𝑉)𝑊 = 𝑚𝑖𝑛{𝑑𝑈𝑊, 𝑑𝑉𝑊 }
➢ Here the quantities 𝑑𝑈𝑊 and 𝑑𝑉𝑊 are the distances between the nearest
neighbors of clusters U and W and clusters V and W, respectively.

➢ The results of single linkage clustering can be graphically displayed in the


form of a dendrogram, or tree diagram.
➢ The branches in the tree represent clusters.
➢ The branches come together (merge) at nodes whose positions along a
distance (or similarity) axis indicate the level at which the fusions occur.
➢ Dendrograms for some specific cases are considered in the following
examples.

7
Example: (Clustering using single linkage) To illustrate the single linkage
algorithm, we consider the hypothetical distances between pairs of five objects as
follows:
1 2 3 4 5
1 0
2 9 0
𝐷 = {𝑑𝑖𝑘 } = 3 3 7 0
4 6 5 9 0
5 [11 10 2 8 0]

Treating each object as a cluster, we commence clustering by merging the two


closest items. Since
𝑚𝑖𝑛
(𝑑 ) = 𝑑53 = 2
𝑖, 𝑘 𝑖𝑘
objects 5 and 3 are merged to form the cluster (35). To implement the next level of
clustering, we need the distances between the cluster (35) and the remaining objects
1,2, and 4. The nearest neighbor distances are

𝑑(35)1 = 𝑚𝑖𝑛{𝑑31 , 𝑑51 } = 𝑚𝑖𝑛{3, 11} = 3


𝑑(35)2 = 𝑚𝑖𝑛{𝑑32 , 𝑑52 } = 𝑚𝑖𝑛{7, 10} = 7
𝑑(35)4 = 𝑚𝑖𝑛{𝑑34 , 𝑑54 } = 𝑚𝑖𝑛{9, 8} = 8
Deleting the rows and columns of D corresponding to objects 3 and 5, and adding a
row and column for the cluster (35), we obtain the new distance matrix
(35) 1 2 4
(35) 0
1 [3 0 ]
2 7 9 0
4 8 6 5 0

8
The smallest distance between pairs of clusters is now 𝑑(35)1 = 3, and we merge
cluster (1) with cluster (35) to get next cluster, (135). Calculating

𝑑(135)2 = 𝑚𝑖𝑛{𝑑(35)2 , 𝑑12 } = 𝑚𝑖𝑛{7, 9} = 7

𝑑(135)4 = 𝑚𝑖𝑛{𝑑(35)4 , 4} = 𝑚𝑖𝑛{8, 6} = 6


we find that the distance matrix for the next level of clustering is
(135) 2 4
(135) 0
2 [7 0 ]
4 6 5 0
The minimum nearest neighbor distance between pairs of clusters is 𝑑42 = 5, and
we merge objects 4 and 2 to get the cluster (24).

At this point we have two distinct clusters (135) and (24). Their nearest neighbor
distance is
𝑑(135)(24) = 𝑚𝑖𝑛{𝑑(135)2 , 𝑑(135)4 } = 𝑚𝑖𝑛{7, 6} = 6
The final distance matrix becomes
(135) (24)
(135) 0
[ ]
(24) 6 0
Consequently, clusters (135) and (24) are merged to form a single cluster of all five
objects (12345), when the nearest neighbor distance reaches 6.

The dendrogram picturing the hierarchical clustering, the intermediate results--


where the objects are sorted into a moderate number of clusters-are of chief Interest.

9
Distance

Objects
Figure 12.3 Single linkage dendrogram for distances between five objects.

10

You might also like