0% found this document useful (0 votes)
11 views28 pages

Lecture-11 Cluster Analysis-1

Uploaded by

Sazzad Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views28 pages

Lecture-11 Cluster Analysis-1

Uploaded by

Sazzad Hossain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Lecture-11

M.S. 1st Semester


Session: 2020-21

Cluster Analysis
Reference books

1. Bhuyan K.C. (2005). Multivariate Analysis and its Application, 1st


Edition, New Central Book Agency (P) Ltd, India
2. Everitt, B.S., Landau, S., Leese, M., Stahl, D. (2012). Cluster Analysis,
5th Edition, John Wiley & Sons Ltd, UK

2
Background
Clustering:
Finding groups of objects (e.g. patients) such that the
objects in a group are similar (or related) to one another
and different from the objects in other groups.

Inter-cluster
Intra-cluster
distances are
distances are
maximized
minimized

3
Background
Cluster Analysis:
▪ In statistics, the search for relatively homogeneous
objects is called cluster analysis.

In general,
▪ It is a class of techniques used to classify observations
into groups or clusters such that:
✓ Each group or cluster is homogeneous (or compact)
with respect to certain characteristics.
✓ Each group should be different from other groups
with respect to the same characteristics.

4
Application
▪ In Psychiatry
✓ Clustering is frequently used on patients to form
homogeneous sub-groups using variables (e.g. cognitions)
that help identify the disease severity.
▪ In the field of Biology
✓ To classify the animals into classes, order and families.
▪ In Agriculture
✓ The land fertility of a particular region may not be
homogeneous for any type of crops.
✓ Then the pieces of land sharing similar fertility for a
particular crop may be group together.
▪ In Economics
✓ People of a city center may be grouped together according
to their socio-economic condition.
5
Data Reduction Techniques
▪ Cluster Analysis
✓ Reduces the sample observations in size.
✓ Data reduction technique in rows of the data matrix.
✓ Identifies homogeneous groups or clusters.

▪ Principal Component Analysis (PCA)


✓ Reduces the data in columns i.e., it reduces number of
variables.

▪ Discriminant Analysis
✓ Similar with respect to classification of observations.
✓ But it derives a rule for allocating an object to its proper
population based on some prior information of the group
membership of the object.
6
Basic Steps of Cluster Analysis
Clustering depends on
➢ Choice of clustering algorithms
– Hierarchical clustering
– K-means clustering
➢ Choice of distance
– Euclidean distance
– Minkowski distance
– Manhattan distance
– etc..
➢ Choice of variables
➢ Standardization
➢ The number of clusters
7
Distances or similarity measures
➢ Distances
▪ Euclidean distance

If x = (x1, x2,..., xn) and y = (y1, y2,..., yn) are two points
in Euclidean n-space, then the distance from x to y is

𝑑𝑥𝑦 = (𝑥1 − 𝑦1 )2 +(𝑥2 − 𝑦2 )2 + ⋯ + (𝑥𝑛 − 𝑦𝑛 )2


▪ Minkowski distance
1ൗ
𝑛 𝑝
𝑑𝑥𝑦 = ෍ 𝑥𝑖 − 𝑦𝑖 𝑝

𝑖=1
When p = 2, it is Euclidean distance and
when p = 1→ Manhattan distance 8
Distances
▪ Euclidean distance
𝑑𝑥𝑦 = (𝑥1 − 𝑦1 )2 +(𝑥2 − 𝑦2 )2 + ⋯ + (𝑥𝑛 − 𝑦𝑛 )2

Ex: The body length (in mm) and body weight (in gm) of 5 randomly
selected slugs are following:

Subject 1 2 3 4 5
Length 35 35 38 35 39
Weight 1.3 4.0 3.2 1.0 1.4

The Euclidean distance matrix is

0.00 2.70 3.55 0.30 4.00


2.70 0.00 3.10 3.00 4.77
𝑑= 3.55 3.10 0.00 3.72 2.06
0.30 3.00 3.72 0.00 4.02
4.00 4.77 2.06 4.02 0.00
9
Similarity measure
➢ Similarity measure

▪ Pearson’s correlation
𝐶𝑜𝑣(𝑥𝑖 , 𝑦𝑗 )
𝑆𝑖𝑗 =
𝑣𝑎𝑟 𝑥𝑖 𝑣𝑎𝑟(𝑦𝑗 )

1
0
Hierarchical Clustering
➢ Agglomerative algorithm:
– Starts with each object as a separate cluster
– Combines objects into clusters that are closest
– Ends with one cluster with all objects
– Once a cluster is formed, it cannot be split

➢ Divisive algorithm:
– The opposite of
agglomerative method

➢ Different distances can be used


to form cluster
11
Methods for hierarchical clustering
➢ Criterion to merge an object into a cluster for hierarchical
clustering
▪ Linkage methods
– Single Linkage (minimum distance)
– Complete Linkage (maximum distance)
– Average Linkage
▪ Ward’s method
– Compute sum of squared distances within clusters
– Aggregate clusters with the minimum increase in the
overall sum of squares
▪ Centroid method
– The distance between two clusters is defined as the
difference between the centroids (cluster mean/median)
12
Single Linkage or Nearest Neighbor Method

▪ The distance between two clusters is represented by the


minimum of the distance between all possible pairs of
subjects in the two clusters.

dAB = min ( d(ui, vj) )


where u A and v  B
for all i = 1 to nA and j = 1 to nB

dAB

13
Complete Linkage or Furthest Neighbor Method

▪ The distance between two clusters is defined as the


maximum of the distances between all possible pairs of
observations in the two clusters.

dAB = max ( d(ui, vj) )

where u A and vB


for all i = 1 to nA and j = 1 to nB

dAB

14
Average Linkage
▪ The distance between two clusters is obtained by taking
the average distance between all pairs of subjects in the
two clusters.
𝑛𝐴 𝑛𝐴
1
𝑑𝐴𝐵 = ෍ ෍ 𝑑(𝑢𝑖 , 𝑣𝑗 )
𝑛𝐴 𝑛𝐵
𝑖=1 𝑗=1

where u  A and v  B
for all i = 1 to nA and j = 1 to nB

dAB

15
Ward’s Minimum Variance

• The within group (i.e., within-cluster) sum of squares is


used as the measure of homogeneity.

• Compute the sum of squared distances within the


groups.

• The Ward’s method tries to minimize the total within-


group or within cluster sum of squares.

• Clusters are formed at each step such that the resulting


cluster solution has the fewest within-group sums of
squares.

16
Dendrogram

▪ After clustering, the objects in the clusters can be


represented by a diagram. This diagram is known as
Dendrogram.
Distance

Objects
17
Example of forming Hierarchical Clustering

Example 7.1: The following data represent the number of ever born children
(x1) and number of desired children (x2) of 5 mothers residing in some parts
of north-Eastern Libya:
Variables/Mothers M1 M2 M3 M4 M5
X1 7 8 6 3 11
x2 10 10 5 2 10

Find the Euclidean distance matrix d and represent the others by Dendrogram
after clustering.
Solution:
The Euclidean distance matrix is

𝑀1 𝑀2 𝑀3 𝑀4 𝑀5
𝑀1
0.00 1.00 5.10 8.94 4.00
𝑀2
1.00 0.00 5.39 9.43 3.00
𝑑 = 𝑀3
5.10 5.39 0.00 4.24 7.07
𝑀4
8.94 9.43 4.24 0.00 11.31
𝑀5
4.00 3.00 7.07 11.31 0.00
18
Solution (1)

Using Average Linkage method


Step-1: We can form the first cluster with M1 and M2 since the distance
between these two objects is minimum.
Step-2: Calculate the average distances of the objects in the 1st cluster and the
objects which are in any cluster.
1 1
𝑑 1,2 3 = ෍ ෍ 𝑑𝑖𝑗 = 𝑑 + 𝑑23 = 5.24
𝑛1 𝑛2 2 × 1 13
Here n1 =2=no. of objects in the 1st cluster and n2=1= no. of objects in another
cluster.
Similarly, 𝑑 1,2 4=9.185, 𝑑 1,2 5 = 3.5

The matrix 𝑑2 is
1,2 3 4 5
1,2
0.00 5.245 9.185 3.50
3
𝑑2 = 5.245 0.00 4.24 7.07
4
9.185 4.24 0.00 11.31
5
3.50 7.07 11.31 0.00

19
Solution (2)

Now, M5 is fused to 1st cluster because the distance of M5 is minimum.


Step-3: The distances between objects and newly formed clusters are
1
𝑑 1,2,5 3 = 𝑑 + 𝑑23 + 𝑑53 = 5.85
3 × 1 13
1
𝑑 1,2,5 4 = 𝑑 + 𝑑24 + 𝑑54 = 8.89
3 × 1 14
The matrix 𝑑3 is
1,2,5 3 4
1,2,5
0.00 5.85 8.89
𝑑3 = 3
5.85 0.00 4.24
4
8. . 89 4.24 0.00
Step-4: Once again, M3 and M4 form a new cluster since the distance between
these two objects is minimum.

Final Step-4: The two clusters are linked to form a single cluster of all objects.
The distance between the two clusters is

1
𝑑 1,2,5 (3,4) = 𝑑 + 𝑑23 + 𝑑53 + 𝑑14 + 𝑑24 + 𝑑54 = 7.87
3 × 2 13
20
Solution (3)

▪ The Dendrogram of all mothers is represented by the


following figure:

21
K-Means Clustering

1) The number of clusters is set at K.


2) Select K points or centroids in the space
of variables.
3) Determine the distance of each object to
the centroids.
4) Group the object based on minimum
distance.
5) When all objects are assigned, calculate
the mean or median of the K groups
(find the closest centroid).
6) Restart until the groups do not change
anymore.

▪ The selection of centroids is essential


since it can lead to incorrect clusters.
▪ Hierarchical clustering can be used for
appropriate centroids. 22
K-Means Clustering

23
K-Means Clustering

24
K-Means Clustering

25
K-Means Clustering

26
K-Means Clustering

27
Thank You!!

28

You might also like