0% found this document useful (0 votes)

25 views10 pages

Lecture 7 Clustring

This document describes a dataset containing characteristics of 5 individuals and defines 6 binary variables to represent these characteristics. It then calculates similarity coefficients between pairs of individuals based on how many of their binary variables match. The results are displayed in a symmetric 5x5 matrix showing individual 2 and 5 are most similar and individuals 1 and 5 are least similar. The document then discusses hierarchical clustering methods for grouping items or variables based on similarities or distances between them. It describes single, complete, and average linkage methods and the steps in the agglomerative hierarchical clustering algorithm.

Uploaded by

Tasin Safwath Chowdhury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views10 pages

Lecture 7 Clustring

Uploaded by

Tasin Safwath Chowdhury

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Lecture 7

Example: Suppose five individuals possess the following characteristics:

Height Weight Eye Hair Handedness Gender
color color
Individual 1 68 in 140 lb Green Blond Right Female
Individual 2 73 in 185 lb Brown Brown Right Male
Individual 3 67 in 165 lb Blue Blond Right Male
Individual 4 64 in 120 lb Brown Brown Right Female
Individual 5 76 in 210 lb Brown Brown Left Male

Define six binary variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5 , 𝑋6 as

1 ℎ𝑒𝑖𝑔ℎ𝑡 ≥ 72 𝑖𝑛
𝑋1 = {
0 ℎ𝑒𝑖𝑔ℎ𝑡 < 72 𝑖𝑛
1 𝑤𝑒𝑖𝑔ℎ𝑡 ≥ 150 𝑙𝑏
𝑋2 = {
0 𝑤𝑒𝑖𝑔ℎ𝑡 < 150 𝑙𝑏
1 𝑏𝑟𝑜𝑤𝑛 𝑒𝑦𝑒𝑠
𝑋3 = {
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟
𝑋4 = {
0 𝑛𝑜𝑡 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟
1 𝑟𝑖𝑔ℎ𝑡 ℎ𝑎𝑛𝑑𝑒𝑑
𝑋5 = {
0 𝑙𝑒𝑓𝑡 ℎ𝑎𝑛𝑑𝑒𝑑
1 𝑓𝑒𝑚𝑎𝑙𝑒
𝑋6 = {
0 𝑚𝑎𝑙𝑒
The scores for individuals 1 and 2 on the p = 6 binary variables are
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6
Individual 1 0 0 0 1 1 1
2 1 1 1 0 1 0

and the number of matches- and mismatches are indicated in the two-way array
Individual 2
1 0 Total
Individual 1 1 1 2 3
0 3 0 3
Total 4 2 6

1
Employing similarity coefficient 1, which gives equal weight to matches, we
compute
𝑎+𝑑 1+0 1
= =
𝑝 6 6
The scores for individuals 1 and 3 on the p = 6 binary variables are
𝑋1 𝑋2 𝑋3 𝑋4 𝑋5 𝑋6
Individual 1 0 0 0 1 1 1
3 0 1 0 1 1 0

and the number of matches- and mismatches are indicated in the two-way array
Individual 3
1 0 Total
Individual 1 1 2 1 3
0 1 2 3
Total 3 3 6

Employing similarity coefficient 1, which gives equal weight to matches, we

compute
𝑎+𝑑 2+4 4
= =
𝑝 6 6
Continuing with similarity coefficient 1, we calculate the remaining similarity
numbers for pairs of individuals.
These are displayed in the 5 × 5 symmetric matrix

Individual
1 2 3 4 5
1 1
2 1/6 1
Individual 3 4/6 3/6 1
4 4/6 3/6 2/6 1
5 0 5/6 2/6 2/6 1

2
➢ Based on the magnitudes of the similarity coefficient, we should conclude
that individuals 2 and 5 are most similar and individuals 1 and 5 are least
similar.
➢ Other pairs fall between these extremes.
➢ If we were to divide the individuals into two relatively homogeneous
subgroups on the basis of the similarity numbers, we might form the
subgroups (1 3 4) and (2 5).

➢ Note that X3 = 0 implies an absence of brown eyes, so that two people, one
with blue eyes and one with green eyes, will yield a 0-0 match.
➢ Consequently, it may be inappropriate to use similarity coefficient 1, 2, or 3
because these coefficients give the same weights to 1-1 and 0-0 matches.

➢ We have described the construction of distances and similarities. It is always

possible to construct similarities from distances. For example, we might set
1
𝑠̃𝑖𝑘 =
1 + 𝑑𝑖𝑘
where 0 < 𝑠̃ 𝑖𝑘 ≤ 1 is the similarity between items i and k and dik is the
corresponding distance.

➢ Grower has shown that with the nonnegative definite condition, and with the
maximum similarity scaled so that 𝑠̃ 𝑖𝑘 = 1,

𝑑𝑖𝑘 = √2(1 − 𝑠̃𝑖𝑘 )

has the properties of a distance.

Similarities and Association Measures for Pairs of Variables

➢ Thus far, we have discussed similarity measures for items.
➢ In some applications, it is the variables, rather than the items, that must be
grouped.
➢ Similarity measures for variables often take the form of sample correlation
coefficients.

3
➢ Moreover, in some clustering applications, negative correlations are replaced
by their absolute values.

➢ When the variables are binary, the data can again be arranged in the form of a
contingency table.
➢ This time, however, the variables, rather than the items, delineate the
categories.
➢ For each pair of variables, there are n items categorized in the table.
➢ With the usual 0 and 1 coding, the table becomes as follows:
Variable k
1 0 Total
Variable i 1 a b a+b
0 c d c+d
Totals a+c b+d a+b+c+d

➢ For instance, variable i equals 1 and variable k equals 0 for b of the n items.

➢ The usual product moment correlation formula applied to the binary

variables in the contingency table gives

𝑎𝑑−𝑏𝑐
𝑟= 1 …………………………..(2)
[(𝑎+𝑏)(𝑐+𝑑)(𝑎+𝑐)(𝑏+𝑑)]2

➢ This number can be taken as a measure of the similarity between the two
variables.

➢ The correlation coefficient in (2) is related to the chi-square statistic (𝑟 2 =

𝜒 2 /𝑛) for testing the independence of two categorical variables.
➢ For n fixed, a large similarity (or correlation) is consistent with the presence
of dependence.

4
Hierarchical Clustering Methods
➢ We can rarely examine all grouping possibilities, even with the largest and
fastest computers.
➢ Because of this problem, a wide variety of clustering algorithms have emerged
that find "reasonable" clusters without having to look at all configurations.

➢ Hierarchical clustering techniques proceed by either a series of successive

mergers or a series of successive divisions.
➢ Agglomerative hierarchical methods start with the individual objects.
➢ Thus, there are initially as many clusters as objects.
✓ The most similar objects are first grouped, and these initial groups are
merged according to their similarities.
✓ Eventually, as the similarity decreases, all subgroups are fused in to a
single cluster.

➢ Divisive hierarchical methods work in the opposite direction.

✓ An initial single group of objects is divided into two subgroups such
that the objects in one subgroup are "far from" the objects in the other.
✓ These subgroups are then further divided into dissimilar subgroups; the
process continues until there are as many subgroups as objects-that is,
until each object forms a group.

➢ The results of both agglomerative and divisive methods may be displayed in

the form of a two-dimensional diagram known as a dendrogram.
➢ We first concentrate on agglomerative hierarchical procedures, in particular,
linkage methods.

➢ Linkage methods are suitable for clustering items, as well as variables.

➢ Types of linkage:
✓ single linkage (minimum distance or nearest neighbor),
✓ complete linkage (maximum. distance or farthest neighbor), and
✓ average linkage (average distance).
➢ The merging or clusters under the three linkage criteria is illustrated
schematically in Figure 12.2.
5
➢ From the figure, we see that
✓ single linkage results when groups are fused according to the distance
between their nearest members.
✓ Complete linkage occurs when groups are fused according to the
distance between their farthest members.
✓ For average linkage, groups are fused according to the average
distance between pairs of members in the respective sets.

Figure 12.2 Intercluster distance (dissimilarity) for (a) single linkage (b) complete
linkage, and (c) average link.

The following steps in the agglomerative hierarchical clustering algorithm for

grouping N objects (items or variables):
1. Start with N clusters, each containing a single entity and an 𝑵 × 𝑵 symmetric
matrix of distances (or similarities) 𝐷 = {𝑑𝑖𝑘 }.
6
2. Search the distance matrix for the nearest (most similar) pair of clusters. Let
the distance between “most similar” clusters U and V be 𝑑𝑈𝑉 .
3. Merge clusters U and V. Label the newly formed cluster (UV). Update the
entries in the distance matrix by (a) deleting the rows and columns
corresponding to clusters U and V and (b) adding a row and column giving
the distances between cluster (UV) and the remaining clusters.
4. Repeat Steps 2 and 3 a total. of N - 1 times. (All objects will be in a single
cluster after the algorithm terminates.) Record the identity of clusters that are
merged and the levels (distances or similarities) at which the mergers take
place.
Single Linkage
➢ The inputs to a single linkage algorithm can be distances or similarities
between pairs of objects.
➢ Groups are formed from the individual entities by merging nearest neighbors,
where the term nearest neighbor connotes the smallest distance or largest
similarity.

➢ Initially, we must find the smallest distance in 𝐷 = {𝑑𝑖𝑘 } and merge the
corresponding objects, say, U and V, to get the cluster (UV). For Step 3 of the
general algorithm of (2), the distances between (UV) and any other cluster W
are computed by
𝑑(𝑈𝑉)𝑊 = 𝑚𝑖𝑛{𝑑𝑈𝑊, 𝑑𝑉𝑊 }
➢ Here the quantities 𝑑𝑈𝑊 and 𝑑𝑉𝑊 are the distances between the nearest
neighbors of clusters U and W and clusters V and W, respectively.

➢ The results of single linkage clustering can be graphically displayed in the

form of a dendrogram, or tree diagram.
➢ The branches in the tree represent clusters.
➢ The branches come together (merge) at nodes whose positions along a
distance (or similarity) axis indicate the level at which the fusions occur.
➢ Dendrograms for some specific cases are considered in the following
examples.

7
Example: (Clustering using single linkage) To illustrate the single linkage
algorithm, we consider the hypothetical distances between pairs of five objects as
follows:
1 2 3 4 5
1 0
2 9 0
𝐷 = {𝑑𝑖𝑘 } = 3 3 7 0
4 6 5 9 0
5 [11 10 2 8 0]

Treating each object as a cluster, we commence clustering by merging the two

closest items. Since
𝑚𝑖𝑛
(𝑑 ) = 𝑑53 = 2
𝑖, 𝑘 𝑖𝑘
objects 5 and 3 are merged to form the cluster (35). To implement the next level of
clustering, we need the distances between the cluster (35) and the remaining objects
1,2, and 4. The nearest neighbor distances are

𝑑(35)1 = 𝑚𝑖𝑛{𝑑31 , 𝑑51 } = 𝑚𝑖𝑛{3, 11} = 3

𝑑(35)2 = 𝑚𝑖𝑛{𝑑32 , 𝑑52 } = 𝑚𝑖𝑛{7, 10} = 7
𝑑(35)4 = 𝑚𝑖𝑛{𝑑34 , 𝑑54 } = 𝑚𝑖𝑛{9, 8} = 8
Deleting the rows and columns of D corresponding to objects 3 and 5, and adding a
row and column for the cluster (35), we obtain the new distance matrix
(35) 1 2 4
(35) 0
1 [3 0 ]
2 7 9 0
4 8 6 5 0

8
The smallest distance between pairs of clusters is now 𝑑(35)1 = 3, and we merge
cluster (1) with cluster (35) to get next cluster, (135). Calculating

𝑑(135)2 = 𝑚𝑖𝑛{𝑑(35)2 , 𝑑12 } = 𝑚𝑖𝑛{7, 9} = 7

𝑑(135)4 = 𝑚𝑖𝑛{𝑑(35)4 , 4} = 𝑚𝑖𝑛{8, 6} = 6

we find that the distance matrix for the next level of clustering is
(135) 2 4
(135) 0
2 [7 0 ]
4 6 5 0
The minimum nearest neighbor distance between pairs of clusters is 𝑑42 = 5, and
we merge objects 4 and 2 to get the cluster (24).

At this point we have two distinct clusters (135) and (24). Their nearest neighbor
distance is
𝑑(135)(24) = 𝑚𝑖𝑛{𝑑(135)2 , 𝑑(135)4 } = 𝑚𝑖𝑛{7, 6} = 6
The final distance matrix becomes
(135) (24)
(135) 0
[ ]
(24) 6 0
Consequently, clusters (135) and (24) are merged to form a single cluster of all five
objects (12345), when the nearest neighbor distance reaches 6.

The dendrogram picturing the hierarchical clustering, the intermediate results--

where the objects are sorted into a moderate number of clusters-are of chief Interest.

9
Distance

Objects
Figure 12.3 Single linkage dendrogram for distances between five objects.

Intermediate R - Cluster Analysis
33% (3)
Intermediate R - Cluster Analysis
27 pages
Clustering
0% (1)
Clustering
127 pages
Clustering Hierarchical Algorithms
100% (1)
Clustering Hierarchical Algorithms
21 pages
Sek PPD Kuala Krai
100% (1)
Sek PPD Kuala Krai
4 pages
Kairana and The Politics of Exclusion: 10 Editorial
No ratings yet
Kairana and The Politics of Exclusion: 10 Editorial
1 page
What Is Cluster Analysis?
No ratings yet
What Is Cluster Analysis?
24 pages
Ucxouri Enebi PDF
No ratings yet
Ucxouri Enebi PDF
352 pages
Cluster Analysis and DBSCAN
No ratings yet
Cluster Analysis and DBSCAN
44 pages
Cluster Analysis BRM Session 14
No ratings yet
Cluster Analysis BRM Session 14
25 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Titanic Marxist-Approach
No ratings yet
Titanic Marxist-Approach
6 pages
Clustering
No ratings yet
Clustering
47 pages
TwoStep Cluster Analysis
No ratings yet
TwoStep Cluster Analysis
35 pages
Introduction To Clustering: Alka Arora Sr. Scientist
No ratings yet
Introduction To Clustering: Alka Arora Sr. Scientist
57 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
33 pages
Gat-02-Feb-2017-Solved Aug Nid PDF
No ratings yet
Gat-02-Feb-2017-Solved Aug Nid PDF
8 pages
SS2 Govt Exam FT
No ratings yet
SS2 Govt Exam FT
1 page
Multidendrograms: Variable-Group Agglomerative Hierarchical Clusterings
No ratings yet
Multidendrograms: Variable-Group Agglomerative Hierarchical Clusterings
22 pages
DM 10,11 Clustering PDF
No ratings yet
DM 10,11 Clustering PDF
65 pages
Cluster Analysis Introduction
No ratings yet
Cluster Analysis Introduction
23 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
CV w4 - Recognition - Statistical Based
No ratings yet
CV w4 - Recognition - Statistical Based
42 pages
Chapter 7
No ratings yet
Chapter 7
49 pages
Inayat Ullah Lak: Director General (PAR) Provincial Assembly of The Punjab
No ratings yet
Inayat Ullah Lak: Director General (PAR) Provincial Assembly of The Punjab
32 pages
GPSC Preliminary Exam General Studies Official Paper I Held On 08 Jan 2023
No ratings yet
GPSC Preliminary Exam General Studies Official Paper I Held On 08 Jan 2023
61 pages
Clustering Today
No ratings yet
Clustering Today
52 pages
Introduction To Data Science: Tom A S Horv Ath
No ratings yet
Introduction To Data Science: Tom A S Horv Ath
39 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Feminist Ethics
No ratings yet
Feminist Ethics
15 pages
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
No ratings yet
Lp2-Etl Model Assignment No. 2: R (2) C (4) V (2) T (2) Total (10) Dated Sign
7 pages
Cluster Analysis Unit 4.
No ratings yet
Cluster Analysis Unit 4.
16 pages
Cluster Analysis Introduction (Unit-6)
No ratings yet
Cluster Analysis Introduction (Unit-6)
20 pages
Living Democracy Brief National Edition 1st Edition Shea Solution Manual
100% (53)
Living Democracy Brief National Edition 1st Edition Shea Solution Manual
18 pages
Cluster Analysis: Biological Data Analysis and Chemometrics
No ratings yet
Cluster Analysis: Biological Data Analysis and Chemometrics
41 pages
ML Co4 Session 29
No ratings yet
ML Co4 Session 29
36 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
Clustering
No ratings yet
Clustering
75 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Lecture 02 - Cluster Analysis 1
No ratings yet
Lecture 02 - Cluster Analysis 1
59 pages
Lecture 8 Clustring
No ratings yet
Lecture 8 Clustring
16 pages
ML Unit V
No ratings yet
ML Unit V
26 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
11 pages
Cambridge International AS & A Level: History 9489/32 March 2021
No ratings yet
Cambridge International AS & A Level: History 9489/32 March 2021
7 pages
Lecture 6 Clustring
No ratings yet
Lecture 6 Clustring
7 pages
Exam-2 SW
No ratings yet
Exam-2 SW
12 pages
Aligarh Movement and Reforms in Muslim Society
No ratings yet
Aligarh Movement and Reforms in Muslim Society
32 pages
Clustering and Association Rule
No ratings yet
Clustering and Association Rule
69 pages
One Hundred Years of Solitude British English Student
No ratings yet
One Hundred Years of Solitude British English Student
7 pages
ML Clustering Algorithm
No ratings yet
ML Clustering Algorithm
29 pages
Class 10 Holiday HW
No ratings yet
Class 10 Holiday HW
4 pages
DM Unit-Iv
No ratings yet
DM Unit-Iv
20 pages
Activity 19
No ratings yet
Activity 19
5 pages
Clustering
No ratings yet
Clustering
27 pages
Module 4 ML
No ratings yet
Module 4 ML
11 pages
Main Campus Students: National University of Modern Language (Student Affairs)
No ratings yet
Main Campus Students: National University of Modern Language (Student Affairs)
12 pages
(CORTEZANO, FIONNA) HISTWES - Assignment - Activity 2-B - What Is Civilization
No ratings yet
(CORTEZANO, FIONNA) HISTWES - Assignment - Activity 2-B - What Is Civilization
3 pages
Electors and Polling Stations Detail (P03)
No ratings yet
Electors and Polling Stations Detail (P03)
7 pages
Map Karo Bacho Padhete Ha
No ratings yet
Map Karo Bacho Padhete Ha
8 pages
Zenon Gniazdowski Dawid Kaliszewski Ok
No ratings yet
Zenon Gniazdowski Dawid Kaliszewski Ok
70 pages
Cencus 2011
No ratings yet
Cencus 2011
23 pages
Clustering
No ratings yet
Clustering
64 pages
Chapter4 CA
No ratings yet
Chapter4 CA
54 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
Lecture-11 Cluster Analysis-1
No ratings yet
Lecture-11 Cluster Analysis-1
28 pages
ML12 Clustering
No ratings yet
ML12 Clustering
34 pages
Lecture-9 Cluster Analysis - LAK
No ratings yet
Lecture-9 Cluster Analysis - LAK
4 pages
Staff Details Billing 29-05-2024
No ratings yet
Staff Details Billing 29-05-2024
1 page
4.1 Clustering
No ratings yet
4.1 Clustering
80 pages
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
No ratings yet
BCA Semester VI Data Mining Module 4 (Presentation Kind of N
56 pages
DWM Unit-Vi
No ratings yet
DWM Unit-Vi
30 pages
57 Betano 27-04
No ratings yet
57 Betano 27-04
6 pages
Classification Chap3
No ratings yet
Classification Chap3
110 pages
ML Unit-5
No ratings yet
ML Unit-5
30 pages
Cluster Analysis: Introduction - I: Dr. A. Ramesh
No ratings yet
Cluster Analysis: Introduction - I: Dr. A. Ramesh
28 pages
Unit - 4 DMA
No ratings yet
Unit - 4 DMA
145 pages
Model UN Masterclass 1 - 4
No ratings yet
Model UN Masterclass 1 - 4
3 pages
British Culture Worksheet
No ratings yet
British Culture Worksheet
4 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Maharashtra CM Oath Ceremony Live Updates Devendra
No ratings yet
Maharashtra CM Oath Ceremony Live Updates Devendra
5 pages
Cluster Analysis Hierarchical & - Means
No ratings yet
Cluster Analysis Hierarchical & - Means
41 pages
Lec 35
No ratings yet
Lec 35
18 pages
Simla Deputation
No ratings yet
Simla Deputation
5 pages
2-2-25 (Poland List)
No ratings yet
2-2-25 (Poland List)
2 pages
2282 - Aditya Education Trusts Mitthulalji Sarada Polytechnic, Nalwandi Road, Beed
No ratings yet
2282 - Aditya Education Trusts Mitthulalji Sarada Polytechnic, Nalwandi Road, Beed
16 pages
Karm Kand Diploma 2024
No ratings yet
Karm Kand Diploma 2024
79 pages
ML Module5
No ratings yet
ML Module5
37 pages
Lecture24 s12
No ratings yet
Lecture24 s12
24 pages
Matching Final PDF
No ratings yet
Matching Final PDF
22 pages

Lecture 7 Clustring

Uploaded by

Lecture 7 Clustring

Uploaded by

Lecture 7

Example: Suppose five individuals possess the following characteristics:

Define six binary variables 𝑋1 , 𝑋2 , 𝑋3 , 𝑋4 , 𝑋5 , 𝑋6 as

Employing similarity coefficient 1, which gives equal weight to matches, we

➢ We have described the construction of distances and similarities. It is always

𝑑𝑖𝑘 = √2(1 − 𝑠̃𝑖𝑘 )

Similarities and Association Measures for Pairs of Variables

➢ The usual product moment correlation formula applied to the binary

➢ The correlation coefficient in (2) is related to the chi-square statistic (𝑟 2 =

➢ Hierarchical clustering techniques proceed by either a series of successive

➢ Divisive hierarchical methods work in the opposite direction.

➢ The results of both agglomerative and divisive methods may be displayed in

➢ Linkage methods are suitable for clustering items, as well as variables.

The following steps in the agglomerative hierarchical clustering algorithm for

➢ The results of single linkage clustering can be graphically displayed in the

Treating each object as a cluster, we commence clustering by merging the two

𝑑(35)1 = 𝑚𝑖𝑛{𝑑31 , 𝑑51 } = 𝑚𝑖𝑛{3, 11} = 3

𝑑(135)2 = 𝑚𝑖𝑛{𝑑(35)2 , 𝑑12 } = 𝑚𝑖𝑛{7, 9} = 7

𝑑(135)4 = 𝑚𝑖𝑛{𝑑(35)4 , 4} = 𝑚𝑖𝑛{8, 6} = 6

The dendrogram picturing the hierarchical clustering, the intermediate results--

You might also like