DS3 Partitional Clustering
DS3 Partitional Clustering
3.1 Overview
3.2 K-means
3.3 K-medoids
3.4 Fuzzy C means (FCM)
3.5 Gaussian Mixture Clustering (GMC)
3.6 Agglomerative Hierarchical Clustering
3.7 DBSCAN (Density Based Spatial Clustering and
Applications)
3.8 Cluster Evaluation
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
2 ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,
Computer-Assoc-DOWN,Circuit-City-DOWN,
functionality, or group stocks Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,
Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN
Technology2-DOWN
3 MBNA-Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN
Summarization 4
Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,
Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,
Schlumberger-UP
Oil-UP
Supervised classification
– Have class label information
Simple segmentation
– Dividing students into different registration groups alphabetically,
by last name
Results of a query
– Groupings are a result of an external specification
Graph partitioning
– Some mutual relevance and synergy, but areas are not identical
Partitional Clustering
– A division data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
Well-separated clusters
Center-based clusters
Contiguous clusters
Density-based clusters
Property or Conceptual
Well-Separated Clusters:
– A cluster is a set of points such that any point in a cluster is closer (or
more similar) to every other point in the cluster than to any point not
in the cluster.
3 well-separated clusters
Center-based
– A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
– The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point of a
cluster
4 center-based clusters
8 contiguous clusters
Density-based
– A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
– Used when the clusters are irregular or intertwined, and when noise
and outliers are present.
6 density-based clusters
2 Overlapping Circles
2.5
2
Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 5
1
2
3
4
3
2.5
1.5
y
0.5
Iteration 1 Iteration 2
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with two initial centroids in one cluster of each pair of clusters
Department of IEM, NYCU, Hsinchu, Taiwan
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Iteration 3 Iteration 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
Department of IEM, NYCU, Hsinchu, Taiwan
10 Clusters Example
Iteration 4
1
2
3
8
2
y
-2
-4
-6
0 5 10 15 20
x
Starting with some pairs of clusters having three initial centroids, while other have only one.
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration
x 3 Iteration
x 4
8 8
6 6
4 4
2 2
y
y
0 0
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
Multiple runs
– Helps, but probability is not on your side
Sample and use hierarchical clustering to determine
initial centroids (two-step clustering)
Select more than k initial centroids and then select
among these initial centroids
– Select most widely separated
Postprocessing
Bisecting K-means
– Not as susceptible to initialization issues
Pre-processing
– Normalize the data
– Eliminate outliers
Post-processing
– Eliminate small clusters that may represent outliers
– Split ‘loose’ clusters, i.e., clusters with relatively high SSE
– Merge clusters that are ‘close’ and that have relatively low
SSE
– Can use these steps during the clustering process
ISODATA
•
FCM can be viewed as a fuzzy version of K means
E: Error function
W ij
m
k 1
x j ci
j
m is usually set by 2, when the distance between data j and center i is bigger,
the possibility that the data j belongs to cluster i becomes smaller
Details of FCM:
1. Initialize the possibility matrix Wij
2. Calculate centroids Ci
3. Updating the possibility Wij
4. Repeat step 2 and step 3 until converging
5. Output final results
Posterior
Prior
Mahalanbis distance
Department of IEM, NYCU, Hsinchu, Taiwan
3.5 GMC (steps)
DB index (Davies–Bouldin)
FS index (Fukuyama and Sugeno)
XB index (Xie and Beni)
PC (partition coefficient)
PE (partition entropy)
Separation between
groups i, j