Lecture-11 Cluster Analysis-1
Lecture-11 Cluster Analysis-1
Cluster Analysis
Reference books
2
Background
Clustering:
Finding groups of objects (e.g. patients) such that the
objects in a group are similar (or related) to one another
and different from the objects in other groups.
Inter-cluster
Intra-cluster
distances are
distances are
maximized
minimized
3
Background
Cluster Analysis:
▪ In statistics, the search for relatively homogeneous
objects is called cluster analysis.
In general,
▪ It is a class of techniques used to classify observations
into groups or clusters such that:
✓ Each group or cluster is homogeneous (or compact)
with respect to certain characteristics.
✓ Each group should be different from other groups
with respect to the same characteristics.
4
Application
▪ In Psychiatry
✓ Clustering is frequently used on patients to form
homogeneous sub-groups using variables (e.g. cognitions)
that help identify the disease severity.
▪ In the field of Biology
✓ To classify the animals into classes, order and families.
▪ In Agriculture
✓ The land fertility of a particular region may not be
homogeneous for any type of crops.
✓ Then the pieces of land sharing similar fertility for a
particular crop may be group together.
▪ In Economics
✓ People of a city center may be grouped together according
to their socio-economic condition.
5
Data Reduction Techniques
▪ Cluster Analysis
✓ Reduces the sample observations in size.
✓ Data reduction technique in rows of the data matrix.
✓ Identifies homogeneous groups or clusters.
▪ Discriminant Analysis
✓ Similar with respect to classification of observations.
✓ But it derives a rule for allocating an object to its proper
population based on some prior information of the group
membership of the object.
6
Basic Steps of Cluster Analysis
Clustering depends on
➢ Choice of clustering algorithms
– Hierarchical clustering
– K-means clustering
➢ Choice of distance
– Euclidean distance
– Minkowski distance
– Manhattan distance
– etc..
➢ Choice of variables
➢ Standardization
➢ The number of clusters
7
Distances or similarity measures
➢ Distances
▪ Euclidean distance
If x = (x1, x2,..., xn) and y = (y1, y2,..., yn) are two points
in Euclidean n-space, then the distance from x to y is
𝑖=1
When p = 2, it is Euclidean distance and
when p = 1→ Manhattan distance 8
Distances
▪ Euclidean distance
𝑑𝑥𝑦 = (𝑥1 − 𝑦1 )2 +(𝑥2 − 𝑦2 )2 + ⋯ + (𝑥𝑛 − 𝑦𝑛 )2
Ex: The body length (in mm) and body weight (in gm) of 5 randomly
selected slugs are following:
Subject 1 2 3 4 5
Length 35 35 38 35 39
Weight 1.3 4.0 3.2 1.0 1.4
▪ Pearson’s correlation
𝐶𝑜𝑣(𝑥𝑖 , 𝑦𝑗 )
𝑆𝑖𝑗 =
𝑣𝑎𝑟 𝑥𝑖 𝑣𝑎𝑟(𝑦𝑗 )
1
0
Hierarchical Clustering
➢ Agglomerative algorithm:
– Starts with each object as a separate cluster
– Combines objects into clusters that are closest
– Ends with one cluster with all objects
– Once a cluster is formed, it cannot be split
➢ Divisive algorithm:
– The opposite of
agglomerative method
dAB
13
Complete Linkage or Furthest Neighbor Method
dAB
14
Average Linkage
▪ The distance between two clusters is obtained by taking
the average distance between all pairs of subjects in the
two clusters.
𝑛𝐴 𝑛𝐴
1
𝑑𝐴𝐵 = 𝑑(𝑢𝑖 , 𝑣𝑗 )
𝑛𝐴 𝑛𝐵
𝑖=1 𝑗=1
where u A and v B
for all i = 1 to nA and j = 1 to nB
dAB
15
Ward’s Minimum Variance
16
Dendrogram
Objects
17
Example of forming Hierarchical Clustering
Example 7.1: The following data represent the number of ever born children
(x1) and number of desired children (x2) of 5 mothers residing in some parts
of north-Eastern Libya:
Variables/Mothers M1 M2 M3 M4 M5
X1 7 8 6 3 11
x2 10 10 5 2 10
Find the Euclidean distance matrix d and represent the others by Dendrogram
after clustering.
Solution:
The Euclidean distance matrix is
𝑀1 𝑀2 𝑀3 𝑀4 𝑀5
𝑀1
0.00 1.00 5.10 8.94 4.00
𝑀2
1.00 0.00 5.39 9.43 3.00
𝑑 = 𝑀3
5.10 5.39 0.00 4.24 7.07
𝑀4
8.94 9.43 4.24 0.00 11.31
𝑀5
4.00 3.00 7.07 11.31 0.00
18
Solution (1)
The matrix 𝑑2 is
1,2 3 4 5
1,2
0.00 5.245 9.185 3.50
3
𝑑2 = 5.245 0.00 4.24 7.07
4
9.185 4.24 0.00 11.31
5
3.50 7.07 11.31 0.00
19
Solution (2)
Final Step-4: The two clusters are linked to form a single cluster of all objects.
The distance between the two clusters is
1
𝑑 1,2,5 (3,4) = 𝑑 + 𝑑23 + 𝑑53 + 𝑑14 + 𝑑24 + 𝑑54 = 7.87
3 × 2 13
20
Solution (3)
21
K-Means Clustering
23
K-Means Clustering
24
K-Means Clustering
25
K-Means Clustering
26
K-Means Clustering
27
Thank You!!
28