0% found this document useful (0 votes)
6 views31 pages

AI20 - Hierarchical-Clustering

Hierarchical clustering generates a hierarchy of partitions from a dataset, allowing users to identify sub-populations. It includes two main methods: agglomerative, which merges clusters, and divisive, which splits them, with various distance measures to determine cluster similarity. While hierarchical clustering is easy to implement and does not require prior knowledge of the number of clusters, it can be sensitive to outliers and is not suitable for large datasets.

Uploaded by

zaydenguide
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

AI20 - Hierarchical-Clustering

Hierarchical clustering generates a hierarchy of partitions from a dataset, allowing users to identify sub-populations. It includes two main methods: agglomerative, which merges clusters, and divisive, which splits them, with various distance measures to determine cluster similarity. While hierarchical clustering is easy to implement and does not require prior knowledge of the number of clusters, it can be sensitive to outliers and is not suitable for large datasets.

Uploaded by

zaydenguide
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

+

Hierarchical Clustering
Hierarchical Clustering
• Hierarchical methods generate a hierarchy of partitions, i.e.
• a partition P1 into 1 clusters (the entire collection)
• a partition P2 into 2 clusters
– …
• a partition Pn into n clusters (each object forms its own cluster)

• It is then up to the user to decide which of the partitions


reflects actual sub-populations in the data.

• Representing data objects in the form of a hierarchy is


useful for data summarization and visualization.
Note: A sequence of partitions is called "hierarchical" if each cluster
in a given partition is the union of clusters in the next larger partition.

P4 P3 P2 P1

Top: hierarchical sequence of partitions


Bottom: non hierarchical sequence
HC methods come in two varieties: agglomerative and divisive

Agglomerative methods [AGNES (AGglomerative NESting)]:


• Start with partition Pn, where each object forms its own cluster.
• Merge the two closest clusters, obtaining P n-1.
• Repeat merge until only one cluster is left.

Divisive methods [DIANA (DIvisive ANAlysis)]:


• Start with P1.
• Split the collection into two clusters that are as homogenous (and
as different from each other) as possible.
• Apply splitting procedure recursively to the clusters.
• Agglomerative methods require a rule to decide which clusters
to merge.
• Typically one defines a distance between clusters and then
merges the two clusters that are closest.
• Divisive methods require a rule for splitting a cluster.
Hierarchical Agglomerative Clustering
• Define a distance between clusters
Initially, every datum is a cluster • Initialize: every example is a cluster
• Iterate:
– Compute distances between all
clusters
(store for efficiency)
– Merge two closest clusters
• Save both clustering and sequence
of cluster operations
• “Dendrogram”
Iteration 1
Iteration 2
Iteration 3
• Builds up a sequence of clusters
(“hierarchical”)

• Because two clusters are


merged per iteration, where each
cluster contains at least one
object, an agglomerative method
requires at most n iterations.
• Algorithm complexity O(N2)
Dendrogram
Dendograms
Result of hierarchical clustering can be represented as binary tree:
• Root of tree represents entire collection
• Terminal nodes represent observations
• Each interior node represents a cluster
• Each subtree represents a partition

Note: For HAC methods, the merge order defines a sequence of n


subtrees of the full tree. For HDC methods a sequence of subtrees
can be defined if there is a figure of merit for each split.
Clustering obtained by cutting the dendrogram at a desired level:
each connected component forms a cluster.
Hierarchical agglomerative clustering
Need to define a distance d(P,Q) between groups, given a distance
measure d(x,y) between observations.
Commonly used distance measures:
1. d1(P,Q) = min d(x,y), for x in P, y in Q ( single linkage )
2. d2(P,Q) = ave d(x,y), for x in P, y in Q ( average linkage )
3. d3(P,Q) = max d(x,y), for x in P, y in Q ( complete linkage )

4. d 4 ( P, Q) = x P − xQ ( centroid method )

PQ 2
5. d5 ( P, Q) = 2 x P − xQ ( Ward’s method )
P + Q

d5 is called Ward’s distance.


Motivation for Ward’s distance:
• Let Pk = P1 ,…, Pk be a partition of the observations into k
groups.
• Measure goodness of a partition by the sum of squared
distances of observations from their cluster means:
k 2

RSS (Pk ) = 
i =1 j Pi
x j − x Pi

• Consider all possible (k-1)-partitions obtainable from Pk by a


merge
• Merging two clusters with smallest Ward’s distance optimizes
goodness of new partition.
Cluster Distances

produces minimal spanning tree.

avoids elongated clusters.


Sec. 17.2

Single Link
• Use minimum similarity of pairs:
sim (ci ,c j ) = min sim ( x, y )
xci , yc j

• Can result in “straggly” (long and thin)


clusters due to chaining effect.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:
sim ((ci  c j ), ck ) = min( sim (ci , ck ), sim (c j , ck ))

Ci Cj Ck
Sec. 17.2

Single Link Example


Sec. 17.2

Complete Link Agglomerative Clustering


• Use maximum similarity of pairs:
sim (ci ,c j ) = max sim ( x, y )
xci , yc j

• Makes “tighter,” spherical clusters that are


typically preferable.
• After merging ci and cj, the similarity of the
resulting cluster to another cluster, ck, is:

sim ((ci  c j ), ck ) = max( sim (ci , ck ), sim (c j , ck ))


Sec. 17.2

Complete Link Example


Example
• The minimum and maximum measures represent two
extremes in measuring the distance between clusters.
They tend to be overly sensitive to outliers or noisy data.
• The use of mean or average distance is a compromise
between the minimum and maximum distances and
overcomes the outlier sensitivity problem.
• Whereas the mean distance is the simplest to compute,
the average distance is advantageous in that it can
handle categoric as well as numeric data. The
computation of the mean vector for categoric data can
be difficult or impossible to define.
Solved example
Step 1

18 22 25 27 42 43

18 0 4 7 9 24 25

22 4 0 3 5 20 21

25 7 3 0 2 17 18

27 9 5 2 0 15 16

42 24 20 17 15 0 1

43 25 21 18 16 1 0
Step 2

18 22 25 27 42, 43

18 0 4 7 9 24

22 4 0 3 5 20

25 7 3 0 2 17

27 9 5 2 0 15

42, 43 24 20 17 15 0
Step 3

18 22 25, 27 42, 43

18 0 4 7 24

22 4 0 3 20

25, 27 7 3 0 15

42, 43 24 20 15 0
Step 4

18 22, 25, 27 42, 43

18 0 4 24

22, 25, 27 4 0 15

42, 43 24 15 0
Step 5

18, 22, 25, 27 42, 43

18, 22, 25, 27 0 15

42, 43 15 0
Step 5

18, 22, 25, 27, 42, 43

18, 22, 25, 27, 42, 43 0


Exit criteria
• Can work with a pre-determined value for number of clusters
• Can set a threshold for dissimilarity of clusters
• In HAC, the distance between nearest clusters is greater than a
threshold
• In HDC, the distance between members of clusters is less than
a threshold
• Can be decided from the dendrogram
Advantages
• Easy to implement and understand
• No prior information is required about the number of clusters
• Outliers can be detected with the help of dendrogram
• Deterministic and predictable
Disadvantages
• Not suitable for large dataset
• Difficulty in handling different sized clusters
• Sensitive to outliers and noise in the dataset
Challenge with divisive methods
• How to partition a large cluster into several smaller ones?
• For example, there are 2 n − 1 − 1 possible ways to partition a set of n
objects into two exclusive subsets, where n is the number of objects.
• When n is large, it is computationally prohibitive to examine all
possibilities.
• A divisive method typically uses heuristics in partitioning, which can
lead to inaccurate results. For the sake of efficiency, divisive methods
typically do not backtrack on partitioning decisions that have been made.
Once a cluster is partitioned, any alternative partitioning of this cluster
will not be considered again. Due to the challenges in divisive methods,
there are many more agglomerative methods than divisive methods.

You might also like