0% found this document useful (0 votes)
11 views44 pages

Agglomerative Clustering

The document provides an overview of hierarchical clustering, a method for organizing data into clusters with high intra-cluster similarity and low inter-cluster similarity. It discusses the principles of clustering, distance measures, and the advantages and disadvantages of hierarchical versus partitional clustering methods. Additionally, it highlights the importance of determining the appropriate number of clusters and introduces techniques like elbow finding and cross-validation for this purpose.

Uploaded by

Dr Aruna Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views44 pages

Agglomerative Clustering

The document provides an overview of hierarchical clustering, a method for organizing data into clusters with high intra-cluster similarity and low inter-cluster similarity. It discusses the principles of clustering, distance measures, and the advantages and disadvantages of hierarchical versus partitional clustering methods. Additionally, it highlights the importance of determining the appropriate number of clusters and introduces techniques like elbow finding and cross-validation for this purpose.

Uploaded by

Dr Aruna Malik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

10601

Machine Learning

Hierarchical clustering

Reading: Bishop: 9-9.2


Second half: Overview
• Clustering
- Hierarchical, semi-supervised learning
• Graphical models
- Bayesian networks, HMMs, Reasoning under uncertainty
• Putting it together
- Model / feature selection, Boosting, dimensionality reduction
• Advanced classification
- SVM
What is Clustering?
• Organizing data into clusters
such that there is
• high intra-cluster similarity

• low inter-cluster similarity

•Informally, finding natural


groupings among objects.

•Why do we want to do that?


•Any REAL application?
Example: clusty
Example: clustering genes

• Microarrays measures the activities


of all genes in different conditions

• Clustering genes can help determine


new functions for unknown genes

• An early “killer application” in this


area
– The most cited (12,309) paper in PNAS!
Unsupervised learning

• Clustering methods are unsupervised learning


techniques
- We do not have a teacher that provides examples with their
labels

• We will also discuss dimensionality reduction,


another unsupervised learning method later in the
course
Outline
•Distance functions
•Hierarchical clustering
•Number of clusters
What is Similarity?
The quality or state of being similar; likeness; resemblance; as, a similarity of features.
Webster's Dictionary

Similarity is hard
to define, but…
“We know it when
we see it”

The real meaning


of similarity is a
philosophical
question. We will
take a more
pragmatic
approach.
Defining Distance Measures
Definition: Let O1 and O2 be two objects from the
universe of possible objects. The distance (dissimilarity)
between O1 and O2 is a real number denoted by D(O1,O2)

gene1
gene2

0.23 3 342.7
gene1 gene2

Inside these black boxes:


d('', '') = 0 d(s, '') = d('',
s) = |s| -- i.e. length
of s d(s1+ch1,
some function on two variables
s2+ch2) = min( d(s1,
s2) + if ch1=ch2 then
0 else 1 fi, d(s1+ch1,
(might be simple or very
s2) + 1, d(s1, s2+ch2)
+1) complex)

A few examples: d(x, y)  (x i  y i )2


• Euclidian distance i • Similarity rather than distance
• Can determine similar trends
• Correlation coefficient
(x i  x )(y i  y )
 s(x, y)  i
 x y
Outline
•Distance measure
•Hierarchical clustering
•Number of clusters
Desirable Properties of a Clustering Algorithm

• Scalability (in terms of both time and space)


• Ability to deal with different data types
• Minimal requirements for domain knowledge to
determine input parameters
• Interpretability and usability
Optional
- Incorporation of user-specified constraints
Two Types of Clustering
• Partitional algorithms: Construct various partitions and then
evaluate them by some criterion
• Hierarchical algorithms: Create a hierarchical decomposition of
the set of objects using some criterion (focus of this class)
Bottom up or top down Top down

Hierarchical Partitional
(How-to) Hierarchical Clustering
The number of dendrograms with n Bottom-Up (agglomerative): Starting
leafs = (2n -3)!/[(2(n -2)) (n -2)!] with each item in its own cluster, find
the best pair to merge into a new cluster.
Number Number of Possible
of Leafs Dendrograms Repeat until all clusters are fused
2 1 together.
3 3
4 15
5 105
... …
10 34,459,425
We begin with a distance
matrix which contains the
distances between every pair
of objects in our database.

0 8 8 7 7

0 2 4 4

0 3 3
D( , ) = 8 0 1

D( , ) = 1 0
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best

Consider all
Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Bottom-Up (agglomerative):
Starting with each item in its own
cluster, find the best pair to merge into
a new cluster. Repeat until all clusters
are fused together.

Consider all
Choose
possible
merges… … the best
But how do we compute distances
between clusters rather than
Consider all objects? Choose
possible
merges… … the best

Consider all Choose


possible … the best
merges…
Computing distance between
clusters: Single Link
• cluster distance = distance of two closest
members in each class

- Potentially
long and skinny
clusters
Computing distance between
clusters: : Complete Link
• cluster distance = distance of two farthest
members

+ tight clusters
Computing distance between
clusters: Average Link
• cluster distance = average distance of all
pairs

the most widely


used measure
Robust against
noise
Example: single link
1 2 3 4 5
1 0 
2  2 0 

3 6 3 0 
 
4 10 9 7 0 
5  9 8 5 4 0

5
4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5
1 0  (1,2) 0 
2  2 
3 3 0 
0 
3 6 3 0  
  4 9 7 0 
4 10 9 7 0   
5 8 5 4 0
5  9 8 5 4 0

d (1, 2), 3  min{d1,3 , d 2, 3}  min{6,3}  3


5
d (1, 2), 4  min{d1, 4 , d 2, 4 }  min{10,9}  9 4
d (1, 2), 5  min{d1,5 , d 2, 5}  min{9,8}  8 3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0 
2  2  (1,2,3) 0 
3 3 0 
0 
3 6 3 0   4 7 0 
  4 9 7 0 
4 10 9 7 0    5 5 4 0
5 8 5 4 0
5  9 8 5 4 0

5
d (1, 2, 3), 4  min{d(1, 2), 4 , d 3, 4}  min{9,7}  7
d (1, 2, 3),5  min{d (1, 2), 5 , d3, 5}  min{8,5}  5 4
3
2
1
Example: single link
1 2 3 4 5 (1,2) 3 4 5 (1,2,3) 4 5
1 0  (1,2) 0 
2  2  (1,2,3) 0 
3 3 0 
0 
3 6 3 0   4 7 0 
  4 9 7 0 
4 10 9 7 0    5 5 4 0
5 8 5 4 0
5  9 8 5 4 0

5
d (1, 2, 3),( 4, 5)  min{d (1, 2, 3), 4 , d (1, 2, 3),5 }  5
4
3
2
1
Single linkage

Height represents 2

distance between objects 1

29 2 6 11 9 17 10 13 24 25 26 20 22 30 27 1 3 8 4 12 5 14 23 15 16 18 19 21 28 7

/ clusters Average linkage


Summary of Hierarchal Clustering Methods

• No need to specify the number of clusters in


advance.
• Hierarchical structure maps nicely onto human
intuition for some domains
• They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
• Like any heuristic search algorithms, local optima
are a problem.
• Interpretation of results is (very) subjective.
But what are the clusters?
In some cases we can determine the “correct” number of clusters.
However, things are rarely this clear cut, unfortunately.
One potential use of a dendrogram is to detect outliers
The single isolated branch is suggestive of a
data point that is very different to all others

Outlier
Example: clustering genes
• Microarrays measures the activities of all
genes in different conditions

• Clustering genes can help determine new


functions for unknown genes
Partitional Clustering
• Nonhierarchical, each instance is placed in
exactly one of K non-overlapping clusters.
• Since the output is only one set of clusters the
user has to specify the desired number of
clusters K.
K-means Clustering: Finished!
Re-assign and move centers, until …
no objects changed membership.
expression in condition 2 5

4
k1

k2
1
k3

0
0 1 2 3 4 5

expression in condition 1
Gaussian
mixture
clustering
Clustering methods: Comparison
Hierarchical K-means GMM

Running naively, O(N3) fastest (each fast (each


time iteration is iteration is
linear) linear)
Assumptions requires a strong strongest
similarity / assumptions assumptions
distance measure
Input none K (number of K (number of
parameters clusters) clusters)
Clusters subjective (only a exactly K exactly K
tree is returned) clusters clusters
Outline
• Distance measure
• Hierarchical clustering
• Number of clusters
How can we tell the right number of clusters?

In general, this is a unsolved problem. However there are many


approximate methods. In the next few slides we will see an example.

10
9
8
7
6
5
4
3
2
1

1 2 3 4 5 6 7 8 9 10
When k = 1, the objective function is 873.0

1 2 3 4 5 6 7 8 9 10
When k = 2, the objective function is 173.1

1 2 3 4 5 6 7 8 9 10
When k = 3, the objective function is 133.6

1 2 3 4 5 6 7 8 9 10
We can plot the objective function values for k equals 1 to 6…

The abrupt change at k = 2, is highly suggestive of two clusters


in the data. This technique for determining the number of
clusters is known as “knee finding” or “elbow finding”.
1.00E+03

9.00E+02
Objective Function

8.00E+02

7.00E+02

6.00E+02

5.00E+02

4.00E+02

3.00E+02

2.00E+02

1.00E+02

0.00E+00
1 2 3 4 5 6
k

Note that the results are not always as clear cut as in this toy example
Cross validation
• We can also use cross validation to determine the correct number of classes
• Recall that GMMs is a generative model. We can compute the likelihood of
the left out data to determine which model (number of clusters) is more
accurate
n  k 
p(x1 x n |  )   p(x j | C  i)wi 
j1 i1 


Cross validation
What you should know
• Why is clustering useful
• What are the different types of clustering
algorithms
• What are the assumptions we are making
for each, and what can we get from them
• Unsolved issues: number of clusters,
initialization, etc.

You might also like