Unit 4 Clustering
Unit 4 Clustering
1
2
Clustering
• Partition unlabeled examples into disjoint subsets of
clusters, such that:
– Objects within a cluster are very similar
– Objects in different clusters are very different
3
Desirable Properties
• Find Structure
• Scalable
• Deal with different types of attributes
• Discover clusters with arbitrary shape
• Minimal domain knowledge
• Robust to noise
• High dimensionality, interpretability and usability
• Clustering quality
– Inter-clusters distance maximized
– Intra-clusters distance minimized
4
Clustering Algorithms
• Exclusive Clustering
• Overlapping clustering
• Hierarchical clustering
5
Partitioning Algorithms
• Given k
• Construct a partition of m objects
where is a vector in a real-valued space ,n is the number of attributes.
• into a set of k clusters
• The cluster mean serves as a prototype of the cluster .
• Find k clusters that optimizes a chosen criterion
– E.g., the within-cluster sum of squares (WCSS)
(sum of distance functions of each point in the cluster to the
cluster mean)
6
1. K-means algorithm
Given k
1. Randomly choose k data points to be the initial cluster
centres
2. Assign each data point to the closest cluster centre
3. Re-compute the cluster centres using the current
cluster memberships.
4. If a convergence criterion is not met, go to 2.
7
Stopping / Convergence criterion
8
K means illustrated
9
Similarity / Distance measures
• Distance metric (scale-dependent)
– Minkowski family of distance measures
10
Similarity / Distance measures
• Correlation coefficients (scale-invariant)
• Mahalanobis distance
• Pearson correlation
11
Time Complexity
• Assume computing distance between two instances
is O(m) where m is the dimensionality of the vectors.
• Computing centroids: Each instance vector gets
added once to some centroid: O(nm) where n is
number of samples..
• Reassigning clusters: O(kn) distance computations,
or O(knm).
• Assume these two steps are each done once for I
iterations: O(Iknm).
12
Advantages & Disadvantages
• Fast, robust easy to understand.
• Relatively efficient: O(lkmn)
• Gives best result when data set are distinct or
well separated from each other.
13
2. Fuzzy C-Means Clustering
One data point may belong to two or more cluster with
different memberships.
Objective function:
15
Example : Mono Dimensional Data
16
Fuzzy c-means
Advantages:
Allows a data point to be in multiple clusters
A more natural representation of the behavior of
genes
genes usually are involved in multiple functions
Limitations:
Need to define c (k in K-means), the number of
clusters
Need to determine membership cutoff value
Clusters are sensitive to initial assignment of centroids
Fuzzy c-means is not a deterministic algorithm
17
3. Hierarchical Clustering
• Build a tree-based hierarchical taxonomy
(dendrogram) from a set of unlabeled examples.
animal
vertebrate invertebrate
18
Aglommerative vs. Divisive Clustering
• Aglommerative (bottom-up)
methods start with each
example in its own cluster
and iteratively combine
them to form larger and
larger clusters.
• Divisive (partitional, top-
down) separate all examples
immediately into clusters.
19
Hierarchical Agglomerative Clustering
(HAC)
20
Hierarchical Clustering Algorithm
Start with all instances in their own cluster.
Until there is only one cluster:
• Among the current clusters, determine the
two Clusters, ci and cj, that are most
similar.
• Replace ci and cj with a single cluster ci cj
21
Cluster Similarity
• Assume a similarity function that determines the
similarity of two instances: sim(x,y).
– Cosine similarity of document vectors.
• How to compute similarity of two clusters each
possibly containing multiple instances?
– Single Link: Similarity of two most similar members.
– Complete Link: Similarity of two least similar members.
– Average link: Average similarity between members.
22
Single Link Agglomerative Clustering
23
Single Link Example
24
Complete Link Agglomerative Clustering
25
Complete Link Example
26
Computational Complexity
• In the first iteration, need to compute similarity of all
pairs of n individual instances : O(n2).
• In each of n2 merging iterations, compute the
distance between the most recently created cluster
and all other existing clusters.
• overall performance is measured as O(n2) and
slightly varies depending on the computing cluster
distance.
27
Computing Cluster Similarity
• After merging ci and cj, the similarity of the
resulting cluster to any other cluster, ck, can be
computed by:
– Single Link:
sim(( ci c j ), ck ) max(sim(ci , ck ), sim(c j , ck ))
– Complete Link:
sim((ci c j ), ck ) min(sim(ci , ck ), sim(c j , ck ))
28
Average link Agglomerative Clustering
29
Clustering Applications
Biology : Classification of plants and animal
kingdom given their features
Marketing: Customer Segmentation based on a
database of customer data containing their
properties and past buying records
Web : Weblog data to discover similar access
patterns.
Social Network: Recognize communities in social
networks.
30
Assignment 6
1. Explain various clustering techniques.
2. Identify one real world applications where the
clustering is applied. Mention the details of
application. Justify why clustering is applied
and also discuss the clustering method used
with your understanding.
31
Thank You..
32