L07 - Advance Analytical Theory and Methods - Clustering
L07 - Advance Analytical Theory and Methods - Clustering
Madava Viranjan
What is Clustering?
• Clustering is the process of grouping set of data objects into multiple
groups or Clusters so that objects within the cluster have high similarities,
but very dissimilar between clusters.
• Mostly use in
– Business intelligence
– Image pattern recognition
– Web search
– Biology
– Security
– Etc.
Clustering comes under Unsupervised Learning
where class labels are not present
Clustering Methods
• Partitioning Methods
– Create partitions from data set where at least one object in a partition.
Most methods are distance based
• Hierarchical Methods
– Create hierarchy either bottom-up or top-down
• How it works?
– Arbitrarily selects ‘k’ objects from ‘D’ as the initial cluster centers.
– Re(assign) each object into cluster which is the most similar
– Re compute the mean and iterate the process until no change
Partitioning Methods
K-Means
• How it works?
– Initial representative objects select randomly
– Assign objects into cluster with nearest representative object
– Randomly selects non representative object, Orandom
– Compute the total cost incurred due to swapping representative object with non
representative object and decide whether object should be moved to different
cluster or not
Partitioning Methods
K-Medoid
• For each non representative object below can happen when changing the
representative object
Hierarchical Clustering
• Group data objects to form a hierarchy or tree of clusters
• Top cluster contains all the data points while bottom layer contains
singleton objects
• Places each object into a cluster of its own. Clusters are merged step by
step according to some criteria
– Cluster ‘C1’ and ‘C2’ can be merged if an object in C1 and object in ‘C2’ form the
minimum Euclidian distance.
c 5
d 3
14
e 7 6
j k
a f
2
b 16 11
9
15 h i
12
g 10 n
18 m
l
Hierarchical Clustering
Divisive method
• Use top-down approach
• All the objects are used to form one initial cluster and split according to
some principle.
– Maximum Euclidian distance
Hierarchical Clustering
Distance Measurement
• Minimum distance
– Minimum value of all pair wise distance of objects in Cluster1 and Cluster2
• Mean distance
– Average distance between the objects in Cluster1 and Cluster2
Density-based
Clustering
• Partition and hierarchical
methods find spherical shaped
clusters
• Density reachable
– ‘p’ is density reachable from ‘q’ if there is a chain of objects ‘p1,…..p n’
such that ‘p1 = q’ and ‘pn = p’ and pi+1 is directly density reachable from
pi.
Density-based Clustering
DBSCAN
MinPts = 3
Density-based Clustering
OPTICS
• It is difficult define parameters earlier