Clustering
Clustering
ClusterAnalysis
Types of Data in Cluster Analysis
Methods
Cluster Analysis
Cluster is a collection of data objects:
Similar to one another within the same cluster,
Dissimilar to the objects in other clusters.
Scalability.
Ability to deal with different types of attributes.
Discovery of clusters with arbitrary shape.
Ability to deal with noisy data.
Minimal requirements for domain knowledge to
determine input parameters.
Insensitivity to the order of input records.
High dimensionality.
Constraint-based clustering.
Interpretability and usability.
TYPES OF DATA IN CLUSTER ANALYSIS
Partitioning Algorithms:
K-means
K- medoids
K- means Partitioning
Each cluster is represented by the mean value of the objects
in the cluster. Hence, it is known as Centroid-Based
technique.
Working method:
First, it randomly selects k of the objects, each of which
initially represents a cluster mean.
9
K-Means Clustering
1
0
K-Means Clustering Method
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
Assign 3
Update 3
3
2 each 2 the 2
1
objects
1
cluster 1
0 0
0
0 1 2 3 4 5 6 7 8 9 to 0 1
10
2 3 4 5 6 7 8 9 means 0 1 2 3 4 5 6 7 8 9
10
10
most
similar reassign reassign
center 10 10
K=2 9 9
8 8
Arbitrarily choose K 7 7
6 6
object as initial 5 5
3
the 3
2 2
1 cluster 1
0
0 1 2 3 4 5 6 7 8 9
means 0
0 1 2 3 4 5 6 7 8 9
10 10
9
K-Means Method
1
2
Variations of the K-Means
Method
A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster
means categorical data: k-modes
Handling
Replacing means of clusters with modes
Using new dissimilarity measures to deal with categorical objects
A mixture of categorical and numerical data: k-prototype method
Expectation Maximization
Assigns objects to clusters based on the probability of membership
Scalability of k-means
Compressible, Discardable, To be maintained in main memory
Clustering Features
1
3
Problem of the K-Means
Method
The k-means algorithm is sensitive to outliers
Since an object with an extremely large value may substantially distort the
distribution of the data.
1
4
K-Medoids Clustering
Method
PAM (Partitioning Around Medoids)
starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering
All pairs are analyzed for replacement
PAM works effectively for small data sets, but does not scale well for
large data sets
CLARA
CLARANS
1
5
K-
Medoids
Input: k, and database of n objects
Output: A set of k clusters
Method:
Arbitrarily choose k objects as initial medoids
Repeat
Assign each remaining object to cluster with nearest medoid
Randomly select a non-medoid oarndom
Compute cost S of swapping oj with oarndom
If S < 0 swap to form new set of k medoids
Until no change
Working Principle: Minimize sum of the dissimilarities between each object and its
corresponding reference point. That is, an absolute-error criterion is used
E k |po |
j 1 pCj j
1
6
K-
medoids
Case 1: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to one of oi where i < > j then p is reassigned to oi.
Case 2: p currently belongs to medoid oj. If oj is replaced by orandomas a medoid and p is
closest to orandomthen p is reassigned to orandom.
1
7
K-
medoids
After reassignment difference in squared
error E is calculated. Total cost of
swapping – Sum of costs incurred by all
non-medoid objects
If total cost is negative, o is replaced with o as
j random
E will be reduced
1
8
K-medoids
Algorithm
1
9
Hierarchical Methods
Agglomerative hierarchical clustering:
Each object initially represents a
cluster of its own. Then clusters are
successively merged until the desired
cluster structure is obtained.