Lecture 7 Clustring
Lecture 7 Clustring
and the number of matches- and mismatches are indicated in the two-way array
Individual 2
1 0 Total
Individual 1 1 1 2 3
0 3 0 3
Total 4 2 6
1
Employing similarity coefficient 1, which gives equal weight to matches, we
compute
𝑎+𝑑 1+0 1
= =
𝑝 6 6
The scores for individuals 1 and 3 on the p = 6 binary variables are
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6
Individual 1 0 0 0 1 1 1
3 0 1 0 1 1 0
and the number of matches- and mismatches are indicated in the two-way array
Individual 3
1 0 Total
Individual 1 1 2 1 3
0 1 2 3
Total 3 3 6
Individual
1 2 3 4 5
1 1
2 1/6 1
Individual 3 4/6 3/6 1
4 4/6 3/6 2/6 1
5 0 5/6 2/6 2/6 1
2
➢ Based on the magnitudes of the similarity coefficient, we should conclude
that individuals 2 and 5 are most similar and individuals 1 and 5 are least
similar.
➢ Other pairs fall between these extremes.
➢ If we were to divide the individuals into two relatively homogeneous
subgroups on the basis of the similarity numbers, we might form the
subgroups (1 3 4) and (2 5).
➢ Note that X3 = 0 implies an absence of brown eyes, so that two people, one
with blue eyes and one with green eyes, will yield a 0-0 match.
➢ Consequently, it may be inappropriate to use similarity coefficient 1, 2, or 3
because these coefficients give the same weights to 1-1 and 0-0 matches.
➢ Grower has shown that with the nonnegative definite condition, and with the
maximum similarity scaled so that 𝑠̃ 𝑖𝑘 = 1,
3
➢ Moreover, in some clustering applications, negative correlations are replaced
by their absolute values.
➢ When the variables are binary, the data can again be arranged in the form of a
contingency table.
➢ This time, however, the variables, rather than the items, delineate the
categories.
➢ For each pair of variables, there are n items categorized in the table.
➢ With the usual 0 and 1 coding, the table becomes as follows:
Variable k
1 0 Total
Variable i 1 a b a+b
0 c d c+d
Totals a+c b+d a+b+c+d
➢ For instance, variable i equals 1 and variable k equals 0 for b of the n items.
𝑎𝑑−𝑏𝑐
𝑟= 1 …………………………..(2)
[(𝑎+𝑏)(𝑐+𝑑)(𝑎+𝑐)(𝑏+𝑑)]2
➢ This number can be taken as a measure of the similarity between the two
variables.
4
Hierarchical Clustering Methods
➢ We can rarely examine all grouping possibilities, even with the largest and
fastest computers.
➢ Because of this problem, a wide variety of clustering algorithms have emerged
that find "reasonable" clusters without having to look at all configurations.
Figure 12.2 Intercluster distance (dissimilarity) for (a) single linkage (b) complete
linkage, and (c) average link.
➢ Initially, we must find the smallest distance in 𝐷 = {𝑑𝑖𝑘 } and merge the
corresponding objects, say, U and V, to get the cluster (UV). For Step 3 of the
general algorithm of (2), the distances between (UV) and any other cluster W
are computed by
𝑑(𝑈𝑉)𝑊 = 𝑚𝑖𝑛{𝑑𝑈𝑊, 𝑑𝑉𝑊 }
➢ Here the quantities 𝑑𝑈𝑊 and 𝑑𝑉𝑊 are the distances between the nearest
neighbors of clusters U and W and clusters V and W, respectively.
7
Example: (Clustering using single linkage) To illustrate the single linkage
algorithm, we consider the hypothetical distances between pairs of five objects as
follows:
1 2 3 4 5
1 0
2 9 0
𝐷 = {𝑑𝑖𝑘 } = 3 3 7 0
4 6 5 9 0
5 [11 10 2 8 0]
8
The smallest distance between pairs of clusters is now 𝑑(35)1 = 3, and we merge
cluster (1) with cluster (35) to get next cluster, (135). Calculating
At this point we have two distinct clusters (135) and (24). Their nearest neighbor
distance is
𝑑(135)(24) = 𝑚𝑖𝑛{𝑑(135)2 , 𝑑(135)4 } = 𝑚𝑖𝑛{7, 6} = 6
The final distance matrix becomes
(135) (24)
(135) 0
[ ]
(24) 6 0
Consequently, clusters (135) and (24) are merged to form a single cluster of all five
objects (12345), when the nearest neighbor distance reaches 6.
9
Distance
Objects
Figure 12.3 Single linkage dendrogram for distances between five objects.
10