Data Mining-Partitioning Methods
Data Mining-Partitioning Methods
Partitioning approach:
Typical
methods:
k-means,
k-medoids,
CLARANS
Hierarchical approach:
Agglomerative Vs Divisive
Model-based:
A model is hypothesized for each of the clusters
and tries to find the best fit of that model to each
other
Typical methods: EM, COBWEB
Frequent pattern-based:
Based on the analysis of frequent patterns
Typical methods: pCluster
User-guided or constraint-based:
Clustering
by considering user-specified or
application-specific constraints
Typical methods: COD, constrained clustering
Weakness
Applicable only
when mean is defined
Categorical data
Need to specify k, the number of clusters, in
advance
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
data.
K-Medoids: Instead of taking the mean value of the
object in a cluster as a reference point, medoids can
be used, which is the most centrally located object
in a cluster.
K-medoids
K-medoids
After reassignment difference in squared error E is
calculated. Total cost of swapping Sum of costs
incurred by all non-medoid objects
If total cost is negative, oj is replaced with orandom as E
will be reduced
K-medoids Algorithm Problem with PAM
PAM is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced
by outliers or other extreme values than a mean
PAM works efficiently for small data sets but does
not scale well for large data sets.
CLARA
Clustering LARge Applications
Choose a representative set of data
Choose medoids from this
Cluster
Draw multiple such samples and apply PAM on each
Returns best Clustering
Effectiveness depends on Sample Size
CLARANS
Clustering
Large
Applications
based
on
RANdomized Search
Uses Sampling and PAM
Doesnt restrict itself to any particular sample
Performs a graph search with each node acting as a
potential solution-( k medoids)
Clustering got after replacement Neighbor
Number of neighbors to be tried is limited
Moves to better neighbour
Silhouette Coefficient
2
Complexity O(n )