0% found this document useful (0 votes)
382 views22 pages

L07 - Advance Analytical Theory and Methods - Clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
382 views22 pages

L07 - Advance Analytical Theory and Methods - Clustering

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Data Mining - Clustering

Madava Viranjan
What is Clustering?
• Clustering is the process of grouping set of data objects into multiple
groups or Clusters so that objects within the cluster have high similarities,
but very dissimilar between clusters.

• Mostly use in
– Business intelligence
– Image pattern recognition
– Web search
– Biology
– Security
– Etc.
Clustering comes under Unsupervised Learning
where class labels are not present
Clustering Methods
• Partitioning Methods
– Create partitions from data set where at least one object in a partition.
Most methods are distance based

• Hierarchical Methods
– Create hierarchy either bottom-up or top-down

• Density based Methods


– Continue growing the cluster as long as the density in the
neighborhood exceeds some threshold.

• Grid based Methods


– Object space is quantize into finite number of cells that forms a grid.
Partitioning Methods
K-Means
• K-Means is a partitioning method which distribute objects of ‘D’ into ‘K’
Clusters

• Centroid based technique


• Can be defined using mean or medoid

• How it works?
– Arbitrarily selects ‘k’ objects from ‘D’ as the initial cluster centers.
– Re(assign) each object into cluster which is the most similar
– Re compute the mean and iterate the process until no change
Partitioning Methods
K-Means

• In K-Mean we compute within cluster


variation which is the sum of squared
errors

• So how it partitions 1, 2, 3, 8, 9, 10 and


25?
Problems in K-Means
• Initial cluster center selects arbitrarily

• Can generate only globe shape clusters


Partitioning Methods
K-Medoid
• Use an actual representative object

• How it works?
– Initial representative objects select randomly
– Assign objects into cluster with nearest representative object
– Randomly selects non representative object, Orandom
– Compute the total cost incurred due to swapping representative object with non
representative object and decide whether object should be moved to different
cluster or not
Partitioning Methods
K-Medoid
• For each non representative object below can happen when changing the
representative object
Hierarchical Clustering
• Group data objects to form a hierarchy or tree of clusters

• Top cluster contains all the data points while bottom layer contains
singleton objects

• Hierarchy can be built on two strategies


– Top down : Having a single cluster which contains all the data points and split it to
two or more clusters using some strategy
– Bottom up: Start with having a cluster for each data point and merge nearest
clusters
Hierarchical Clustering
Agglomerative method
• Uses bottom-up approach

• Places each object into a cluster of its own. Clusters are merged step by
step according to some criteria
– Cluster ‘C1’ and ‘C2’ can be merged if an object in C1 and object in ‘C2’ form the
minimum Euclidian distance.
c 5

d 3
14
e 7 6
j k
a f
2
b 16 11
9
15 h i
12

g 10 n
18 m
l
Hierarchical Clustering
Divisive method
• Use top-down approach

• All the objects are used to form one initial cluster and split according to
some principle.
– Maximum Euclidian distance
Hierarchical Clustering
Distance Measurement
• Minimum distance
– Minimum value of all pair wise distance of objects in Cluster1 and Cluster2

– Nearest neighbor clustering algorithm


Hierarchical Clustering
Distance Measurement
• Maximum distance
– Maximum value of all pair wise distance of objects in Cluster1 and Cluster2
– Farthest neighbor clustering algorithm

• Mean distance
– Average distance between the objects in Cluster1 and Cluster2
Density-based
Clustering
• Partition and hierarchical
methods find spherical shaped
clusters

• But in real world we have to find


arbitrary shapes

• Density-based clustering can


discover clusters of non-
spherical shapes
Density-based Clustering
DBSCAN
• Two parameters
Ԑ - Epsilon. Maximum radius of the
neighborhood
MinPts – Minimum number of points in the epsilon-
neighborhood to define a Cluster

• Data points are categorize into 3 types


– Core point : Has at least MinPts within its epsilon-neighborhood
including itself
– Border point: Has less than MinPts in its epsilon neighborhood but that
is in the neighborhood of a Core point
– Outlier : Point that cannot be reached by the Cluster
Density-based Clustering
DBSCAN
• Directly density reachable
– For a core object ‘q’ and an object ‘p’, we say ‘p’ is directly density
reachable from ‘q’ if ‘p’ is within epsilon neighborhood of ‘q’

• Density reachable
– ‘p’ is density reachable from ‘q’ if there is a chain of objects ‘p1,…..p n’
such that ‘p1 = q’ and ‘pn = p’ and pi+1 is directly density reachable from
pi.
Density-based Clustering
DBSCAN
MinPts = 3
Density-based Clustering
OPTICS
• It is difficult define parameters earlier

• OPTICS outputs a Cluster ordering.

• For each object computes two information


– core-distance: Smallest epsilon value such that object has at least
MinPts
– reachability-distance: Max of core-distance of Euclidian distance
Density-based Clustering
OPTICS
Density-based Clustering
Advantages
• Clusters can be in any shape
• Does not require predefined number of Clusters
• Able to identify noise

You might also like