0% found this document useful (0 votes)
59 views98 pages

Cluster Analysis I: Presidency University

This document provides an introduction to cluster analysis and hierarchical clustering. It discusses the aims of classification in multivariate analysis, including supervised vs unsupervised learning. Key aspects covered include defining similarity between objects, hierarchical clustering methods like agglomerative and divisive approaches, and representing hierarchical clusters using dendrograms. Examples are provided to illustrate clustering satellite image pixels and merging objects in a step-by-step hierarchical manner.

Uploaded by

Arka Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views98 pages

Cluster Analysis I: Presidency University

This document provides an introduction to cluster analysis and hierarchical clustering. It discusses the aims of classification in multivariate analysis, including supervised vs unsupervised learning. Key aspects covered include defining similarity between objects, hierarchical clustering methods like agglomerative and divisive approaches, and representing hierarchical clusters using dendrograms. Examples are provided to illustrate clustering satellite image pixels and merging objects in a step-by-step hierarchical manner.

Uploaded by

Arka Bose
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Cluster Analysis I

Presidency University

September,2016
Classication

I Aim of any multivariate study is to learn about the nature of


some object.
Classication

I Aim of any multivariate study is to learn about the nature of


some object.

I One way of learning about the nature is to nd out which class
the object belongs .
Classication

I Aim of any multivariate study is to learn about the nature of


some object.

I One way of learning about the nature is to nd out which class
the object belongs .

I For example, a candidate is asked many questions in


examination to classify him as either pass or fail.
Classication

I Aim of any multivariate study is to learn about the nature of


some object.

I One way of learning about the nature is to nd out which class
the object belongs .

I For example, a candidate is asked many questions in


examination to classify him as either pass or fail.

I Hence classifying a object is one of the most important issue


in multivariate statistics.
Types of classication

Classification
Types of classication

Classification

Supervised
Types of classication

Classification

Supervised Unsupervised
Supervised vs Unsupervised Learning

I Kid trying to learn alphabets.


Supervised vs Unsupervised Learning

I Kid trying to learn alphabets.

I The english letters are only pictures for him.


Supervised vs Unsupervised Learning

I Kid trying to learn alphabets.

I The english letters are only pictures for him.

I Teacher tells him the name of each letter.


Supervised vs Unsupervised Learning

I Kid trying to learn alphabets.

I The english letters are only pictures for him.

I Teacher tells him the name of each letter.

I This is supervised classication.


Supervised vs Unsupervised Learning
I Suppose the same kid grows up and visits china rst time.
Supervised vs Unsupervised Learning
I Suppose the same kid grows up and visits china rst time.

I At some place he sees


Supervised vs Unsupervised Learning
I Suppose the same kid grows up and visits china rst time.

I At some place he sees

I Even now, he cannot read the alphabets.


Supervised vs Unsupervised Learning
I Suppose the same kid grows up and visits china rst time.

I At some place he sees

I Even now, he cannot read the alphabets.

I But at least can identify that some are repeated.


Supervised vs Unsupervised Learning
I Suppose the same kid grows up and visits china rst time.

I At some place he sees

I Even now, he cannot read the alphabets.

I But at least can identify that some are repeated.

I Can pick the unique ones.


Supervised vs Unsupervised Learning
I Suppose the same kid grows up and visits china rst time.

I At some place he sees

I Even now, he cannot read the alphabets.

I But at least can identify that some are repeated.

I Can pick the unique ones.

I This is unsupervised classication.


Supervised vs Unsupervised Learning
I Given data x1 , x2 , ....., xn with labels (i.e. group tag)
,
y1 y2 , ....., yn we learn to predict the label associated with new
input xnew .

What is
Chair Duck this?
Supervised vs Unsupervised Learning
I Given only x1 , x2 , ....., xn , we try to infer some underlying
structure (i.e. nd similarities and identifying groups).

Group these items


according to common
features
Supervised vs Unsupervised Learning

I Discriminant Analysis is supervised classication.

I Clustering is Unsupervised classication.


What is clustering? And why?

I Clustering is the task of dividing up data into groups


(clusters), so that points in any one group are more similar
to each other than to points outside the group
What is clustering? And why?

I Clustering is the task of dividing up data into groups


(clusters), so that points in any one group are more similar
to each other than to points outside the group

Two main uses are :

I Summary: deriving a reduced representation of the full data


set.

I Discovery: looking for new insights into the structure of the


data. e.g., nding groups of students that commit similar
mistakes, or groups of 80's songs that sound alike
Example

I Lets have a look at a satellite image


Example

I Lets have a look at a satellite image

I The image shows a blue ocean and two land masses with green
vegetation.
Example

I Lets have a look at a satellite image

I The image shows a blue ocean and two land masses with green
vegetation.

I Our human eye can detect large regions of similar colour in an


image.
Example

I Lets have a look at a satellite image

I The image shows a blue ocean and two land masses with green
vegetation.

I Our human eye can detect large regions of similar colour in an


image.

I Our brain is performing a simple clustering

I It groups the pixels into two clusters: land and sea.


Proximity Measures

I Cluster Analysis is the art of grouping similar items.


Proximity Measures

I Cluster Analysis is the art of grouping similar items.

I How to dene similarity of items?


Proximity Measures

I Cluster Analysis is the art of grouping similar items.

I How to dene similarity of items?

I Usually we dene a dissimilarity measure dij or a similarity


measure sij between xi and xj .
Proximity Measures

I Cluster Analysis is the art of grouping similar items.

I How to dene similarity of items?

I Usually we dene a dissimilarity measure dij or a similarity


measure sij between xi and xj .

I Generally dij is taken to be any distance measure, e.g.


dij = ||xi − xj ||2
Proximity Measures

I Cluster Analysis is the art of grouping similar items.

I How to dene similarity of items?

I Usually we dene a dissimilarity measure dij or a similarity


measure sij between xi and xj .

I Generally dij is taken to be any distance measure, e.g.


dij = ||xi − xj ||2
I This holds good if we are working with variables.

I For categorical data, we will discuss the proximity measures


later.
Types of Clustering

Clustering

Hierarchical Non-
Clustering Hierarchical
Clustering

Agglomerative Divisive
Hierarchical Clustering

Consider the four shapes


Suppose we want to group
the similar objects
together.
Hierarchical Clustering

I Possibly put the two


cubes in one cluster
and the sphere and
cone in the other as
both have round
surfaces.

I So the two clusters


are now (sphere,
cone) and (small
cube, large cube)
Hierarchical Clustering

I Suppose we further
split these clusters.

I Then we shall split


the (sphere, cone)
cluster into (sphere)
and (cone) clusters.
Hierarchical Clustering

I If we are to further
increase the number
we will split the
second cluster.
Hierarchical Clustering
This step by step clustering process can be expressed using the
following diagram:
Agglomerative vs divisive

Agglomerative (i.e., bottom-up):

I Start with all points in their own group

I Until there is only one cluster, repeatedly: merge the two


groups that have the smallest dissimilarity
Agglomerative vs divisive

Agglomerative (i.e., bottom-up):

I Start with all points in their own group

I Until there is only one cluster, repeatedly: merge the two


groups that have the smallest dissimilarity

Divisive (i.e., top-down):

I Start with all points in one cluster

I Until all points are in their own cluster, repeatedly split the
group into two resulting in the biggest dissimilarity
Simple Example

Step 1: {1}, {2}, {3},


{4}, {5}, {6}, {7}
Simple Example

Step 2: {1}, {2, 3}, {4},


{5}, {6}, {7};
Simple Example

Step 3: {1, 7}, {2, 3},


{4}, {5}, {6}
Simple Example

Step 4: {1, 7}, {2, 3}, {4,


5}, {6}
Simple Example

Step 5: {1, 7}, {2, 3, 6},


{4, 5}
Simple Example

Step 6: {1, 7}, {2, 3, 6,4,


5}
Simple Example

Step 7: {1, 2, 3, 4, 5, 6,
7}
Simple Example
We can simply represent this sequence of clustering assignments in
a tree called dendogram.
What's a dendrogram?
I Dendrogram is a convenient graphic to display a hierarchical
sequence of clustering assignments.
What's a dendrogram?
I Dendrogram is a convenient graphic to display a hierarchical
sequence of clustering assignments.

This is simply a tree where:

I Each node represents a group


What's a dendrogram?
I Dendrogram is a convenient graphic to display a hierarchical
sequence of clustering assignments.

This is simply a tree where:

I Each node represents a group

I Each leaf node is a singleton (i.e., a group containing a single


data point)
What's a dendrogram?
I Dendrogram is a convenient graphic to display a hierarchical
sequence of clustering assignments.

This is simply a tree where:

I Each node represents a group

I Each leaf node is a singleton (i.e., a group containing a single


data point)

I Root node is the group containing the whole data set


What's a dendrogram?
I Dendrogram is a convenient graphic to display a hierarchical
sequence of clustering assignments.

This is simply a tree where:

I Each node represents a group

I Each leaf node is a singleton (i.e., a group containing a single


data point)

I Root node is the group containing the whole data set

I Each internal node has two daughter nodes (children),


representing the the groups that were merged to form it
What's a dendrogram?
I Dendrogram is a convenient graphic to display a hierarchical
sequence of clustering assignments.

This is simply a tree where:

I Each node represents a group

I Each leaf node is a singleton (i.e., a group containing a single


data point)

I Root node is the group containing the whole data set

I Each internal node has two daughter nodes (children),


representing the the groups that were merged to form it

I If we x the leaf nodes at height zero, then each internal node


is drawn at a height proportional to the dissmilarity between
its two daughter nodes
Cutting a dendogram
I Dendogram represents clustering allowing all possible clusters.
Cutting a dendogram
I Dendogram represents clustering allowing all possible clusters.

I In order to use it, we have to cut it at a particular height.


Cutting a dendogram
I Dendogram represents clustering allowing all possible clusters.

I In order to use it, we have to cut it at a particular height.


Cutting a dendogram
I Dendogram represents clustering allowing all possible clusters.

I In order to use it, we have to cut it at a particular height.

I The height at which we cut the dendogram represents the


distance which we choose to call  dissimilar and practically
depends on the number of clusters we want.
Cutting a dendogram
I Dendogram represents clustering allowing all possible clusters.

I In order to use it, we have to cut it at a particular height.

I The height at which we cut the dendogram represents the


distance which we choose to call  dissimilar and practically
depends on the number of clusters we want.

I Depending on the linkage we use, this cut has its own


interpretation, which we will study.
Linkage

I We have studied distances between two data points xi and xj .


Linkage

I We have studied distances between two data points xi and xj .

I While performing hierarchical clustering we need to merge the


clusters which are similar.
Linkage

I We have studied distances between two data points xi and xj .

I While performing hierarchical clustering we need to merge the


clusters which are similar.

I This requires a idea of distance among the clusters, what is


known as linkage.
Linkage
I We are given the data points x1 , x2 , ......, xn and the distance
between any pairs xi and xj as dij .
Linkage
I We are given the data points x1 , x2 , ......, xn and the distance
between any pairs xi and xj as dij .
I At any level, clustering assignments can be expressed by sets
G = {i1 , i2 , ...., inG } giving the indices ( ir ) of points belonging
to that group.
Linkage
I We are given the data points x1 , x2 , ......, xn and the distance
between any pairs xi and xj as dij .
I At any level, clustering assignments can be expressed by sets
G = {i1 , i2 , ...., inG } giving the indices ( ir ) of points belonging
to that group.

I At the bottom level the groups are singleton, i.e. G = {i }


Linkage
I We are given the data points x1 , x2 , ......, xn and the distance
between any pairs xi and xj as dij .
I At any level, clustering assignments can be expressed by sets
G = {i1 , i2 , ...., inG } giving the indices ( ir ) of points belonging
to that group.

I At the bottom level the groups are singleton, i.e. G = {i }

I At the topmost level the only one group is


G = {1, 2, 3, ....., n}.
Linkage
I We are given the data points x1 , x2 , ......, xn and the distance
between any pairs xi and xj as dij .
I At any level, clustering assignments can be expressed by sets
G = {i1 , i2 , ...., inG } giving the indices ( ir ) of points belonging
to that group.

I At the bottom level the groups are singleton, i.e. G = {i }

I At the topmost level the only one group is


G = {1, 2, 3, ....., n}.

I Linkage is a function d (G , H ) that takes two groups G , H and


returns a dissimilarity score between them
Linkage
I We are given the data points x1 , x2 , ......, xn and the distance
between any pairs xi and xj as dij .
I At any level, clustering assignments can be expressed by sets
G = {i1 , i2 , ...., inG } giving the indices ( ir ) of points belonging
to that group.

I At the bottom level the groups are singleton, i.e. G = {i }

I At the topmost level the only one group is


G = {1, 2, 3, ....., n}.

I Linkage is a function d (G , H ) that takes two groups G , H and


returns a dissimilarity score between them

I Remember: the choice of linkage determines how we measure


dissimilarity between groups of points.
Agglomerative Clustering

Given the linkage, agglomerative clustering algorithm consists of


the following steps:
Agglomerative Clustering

Given the linkage, agglomerative clustering algorithm consists of


the following steps:

I Start with all points in their own group


Agglomerative Clustering

Given the linkage, agglomerative clustering algorithm consists of


the following steps:

I Start with all points in their own group

I Until there is only one cluster, repeatedly: merge the two


groups G, H such that d (G , H ) is smallest
Single Linkage
I In single linkage (i.e., nearest-neighbor linkage), the
dissimilarity between G, H is the smallest dissimilarity between
two points in opposite groups:

dsingle (G , H ) = min dij


i ∈G ,j ∈H
Single Linkage

I Single linkage score dsingle (G , H ) is the distance of the closest


pair.

I Cut interpretation: If we cut the denddogram at height 0.9,


then we can say, for each point xi , there is another point xj in
its cluster with dij ≤ 0.9
Complete Linkage
I In complete linkage (i.e., farthest-neighbor linkage),
dissimilarity between G, H is the largest dissimilarity between
two points in opposite groups:

dcomplete (G , H ) = max dij


i ∈ G ,j ∈ H
Complete Linkage

I Complete linkage score dcomplete (G , H ) is the distance of the


farthest pair.

I Cut interpretation: for each point xi , every other point xj in


its cluster satises dij ≤ 0.9
Average Linkage

I In average linkage, the dissimilarity between G, H is the


average dissimilarity over all points in opposite groups:

1
daverage (G , H ) = dij
X
nH nG i ∈G ,j ∈H
Average Linkage

I In average linkage, the dissimilarity between G, H is the


average dissimilarity over all points in opposite groups:

1
daverage (G , H ) = dij
X
nH nG i ∈G ,j ∈H

I average linkage score daverage (G , H ) is the average distance


across all pairs
Average Linkage

I In average linkage, the dissimilarity between G, H is the


average dissimilarity over all points in opposite groups:

1
daverage (G , H ) = dij
X
nH nG i ∈G ,j ∈H

I average linkage score daverage (G , H ) is the average distance


across all pairs

I There is not really any good cut interpretation.


Common Properties

Single, complete, average linkage share the following properties:


Common Properties

Single, complete, average linkage share the following properties:

I These linkages operate on dissimilarities dij , and don't need


the points x1 , x2 , .....xn to be in Euclidean space
Common Properties

Single, complete, average linkage share the following properties:

I These linkages operate on dissimilarities dij , and don't need


the points x1 , x2 , .....xn to be in Euclidean space

I Running agglomerative clustering with any of these linkages


produces a dendrogram with no inversions
Common Properties

Single, complete, average linkage share the following properties:

I These linkages operate on dissimilarities dij , and don't need


the points x1 , x2 , .....xn to be in Euclidean space

I Running agglomerative clustering with any of these linkages


produces a dendrogram with no inversions

I Second property can be stated as : disimilarity scores between


merged clusers only increases as we run the algorithm
Common Properties

Single, complete, average linkage share the following properties:

I These linkages operate on dissimilarities dij , and don't need


the points x1 , x2 , .....xn to be in Euclidean space

I Running agglomerative clustering with any of these linkages


produces a dendrogram with no inversions

I Second property can be stated as : disimilarity scores between


merged clusers only increases as we run the algorithm

I This means that we can draw a proper dendrogram, where the


height of a parent is always higher than height of its daughters
Shortcomings of single, complete linkage
Single and complete linkage can have some practical problems:
Shortcomings of single, complete linkage
Single and complete linkage can have some practical problems:

I Single linkage suers from chaining. In order to merge two


groups, only need one pair of points to be close, irrespective of
all others. Therefore clusters can be too spread out, and not
compact enough
Shortcomings of single, complete linkage
Single and complete linkage can have some practical problems:

I Single linkage suers from chaining. In order to merge two


groups, only need one pair of points to be close, irrespective of
all others. Therefore clusters can be too spread out, and not
compact enough

I Complete linkage avoids chaining, but suers from crowding.


Because its score is based on the worst-case dissimilarity
between pairs, a point can be closer to points in other clusters
than to points in its own cluster. Clusters are compact, but
not far enough apart
Shortcomings of single, complete linkage
Single and complete linkage can have some practical problems:

I Single linkage suers from chaining. In order to merge two


groups, only need one pair of points to be close, irrespective of
all others. Therefore clusters can be too spread out, and not
compact enough

I Complete linkage avoids chaining, but suers from crowding.


Because its score is based on the worst-case dissimilarity
between pairs, a point can be closer to points in other clusters
than to points in its own cluster. Clusters are compact, but
not far enough apart

I Average linkage tries to strike a balance. It uses average


pairwise dissimilarity, so clusters tend to be relatively compact
and relatively far apart
Shortcomings of average linkage

Average linkage isn't perfect, it has its own problems:


Shortcomings of average linkage

Average linkage isn't perfect, it has its own problems:

I It is not clear what properties the resulting clusters have when


we cut an average linkage tree at given height h. Single and
complete linkage trees each had simple interpretations
Shortcomings of average linkage

Average linkage isn't perfect, it has its own problems:

I It is not clear what properties the resulting clusters have when


we cut an average linkage tree at given height h. Single and
complete linkage trees each had simple interpretations

I Results of average linkage clustering can change with a


monotone increasing transformation of dissimilarities dij . If g
is such that g (x ) ≤ g (y ) whenever x ≤ y , and we used
dissimilarites h (dij ) instead of dij , then we could get dierent
answers
Centroid Linkage
I Centroid linkage is commonly used. Let x̄G , x̄H denote group
averages for G , H. Then

dcentroid (G , H ) = dx̄G x̄H


Centroid Linkage

I Centroid linkage score dcentroid (G , H ) is the distance between


the group cen- troids (i.e., group averages)

I There is not really any good cut interpretation.


Shortcomings of centroid linkage

I Can produce dendrograms with inversions, which really messes


up the visualization
Shortcomings of centroid linkage

I Can produce dendrograms with inversions, which really messes


up the visualization

I Even if were we lucky enough to have no inversions, still no


interpretation for the clusters resulting from cutting the tree
Shortcomings of centroid linkage

I Can produce dendrograms with inversions, which really messes


up the visualization

I Even if were we lucky enough to have no inversions, still no


interpretation for the clusters resulting from cutting the tree

I Answers change with a monotone transformation of the


dissimilarity measure
Linkages in a nutshell

Linkage No Inversions? Unchanged with monotone transformations? Cut Interpretation? Note

Single X X X Chaining

Complete X X X Crowding

Average X × ×
Centroid × × × Simple
Linkages in a nutshell

Linkage No Inversions? Unchanged with monotone transformations? Cut Interpretation? Note

Single X X X Chaining

Complete X X X Crowding

Average X × ×
Centroid × × × Simple

I This doesn't tell us what best linkage is


Linkages in a nutshell

Linkage No Inversions? Unchanged with monotone transformations? Cut Interpretation? Note

Single X X X Chaining

Complete X X X Crowding

Average X × ×
Centroid × × × Simple

I This doesn't tell us what best linkage is

I choosing a linkage can be very situation dependent

You might also like