Data Mining - Clustering
Data Mining - Clustering
4
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
Ability to deal with different types of attributes
Numerical, binary, categorical, ordinal, linked, and mixture of
these
Constraint-based clustering
○ User may give inputs on constraints
○ Use domain knowledge to determine input parameters
Interpretability and usability
Others
Discovery of clusters with arbitrary shape
Ability to deal with noisy data
Incremental clustering and insensitivity to input order
High dimensionality
5
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
Typical methods: k-means, k-medoids, CLARANS
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
using some criterion
A set of nested clusters organized as a hierarchical tree
Typical methods: Diana, Agnes, BIRCH, CAMELEON
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
6
Partitional Clustering
Original Points
A Partitional Clustering
Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Dendrogram
Traditional Hierarchical Clustering
p1
p3 p4
p2
p1 p2 p3 p4
E pCi ( p ci )
k
i 1
2
11
K-means Clustering
K=2
A1 (2, 10) 0 5 9 C1
A2 (2, 5) 5 6 4 C3
A3 (8, 4) 12 7 9 C2
A4 (5, 8) 5 0 10 C2
A5 (7, 5) 10 5 9 C2
A6 (6, 4) 10 5 7 C2
A7 (1, 2) 9 10 0 C3
A8 (4, 9) 3 2 10 C2
Working Example
New clusters are as follows:
A1 (2, 10) 0 8 7 C1
A2 (2, 5) 5 5 2 C3
A3 (8, 4) 12 4 7 C2
A4 (5, 8) 5 3 8 C2
A5 (7, 5) 10 2 7 C2
A6 (6, 4) 10 2 5 C2
A7 (1, 2) 9 9 2 C3
A8 (4, 9) 3 5 8 C1
Working Example
New cluster centers:
by visual inspection we may imagine the points partitioned into the clusters
{1,2,3} and {8,9,10}, where point 25 is excluded because it appears to be
an outlier.
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
24
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
25
The K-Medoid Clustering Method
○ PAM works effectively for small data sets, but does not scale well for
26
Hierarchical Clustering
Produces a set of nested clusters organized as
a hierarchical tree
Can be visualized as a dendrogram
A tree like diagram that records the sequences of
merges or splits
0.2 6 5
4
0.15 3 4
2
5
0.1 2
0.05 1
3 1
0
1 3 2 5 4 6
Strengths of Hierarchical Clustering
Do not have to assume any particular number
of clusters
Any desired number of clusters can be obtained by
„cutting‟ the dendogram at the proper level
Divisive:
○ Start with one, all-inclusive cluster
○ At each step, split a cluster until each cluster contains a point
(or there are k clusters)
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
32
Dendrogram: Shows How Clusters are Merged
33
Agglomerative Clustering Algorithm
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
After some merging steps, we have some clusters
C1 C2 C3 C4 C5
Proximity Matrix C1
C2
C3 C3
C4
C4
C5
C1
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12
Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and
update the proximity matrix. Proximity Matrix
C1 C2 C3 C4 C5
C1
C3 C2
C4 C3
C4
C1 C5
C5 p4
...
p1 p2C2 p3 p9 p10 p11 p12
After Merging
The question is “How do we update the proximity matrix?”
C2
U
C1 C5 C3 C4
C1 ?
C2 U C5 ? ? ?
C3 ?
C3 ?
C4
C4 ?
Proximity Matrix
C1
C2 U C5
...
p1 p2 p3 p4 p9 p10 p11 p12
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
Similarity?
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward‟s Method uses squared error
How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX
.
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
– Ward‟s Method uses squared error
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
Medoid: distance between the medoids of two clusters, i.e., dist(Ki,
Kj) = dist(Mi, Mj)
Medoid: a chosen, centrally located object in the cluster
44
Cluster Similarity: MIN or Single
Link
Similarity of two clusters is based on the
two most similar (closest) points in the
different clusters
Determined by one pair of points, i.e., by
one link in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Hierarchical Clustering: MIN
5
1
3
5 0.2
2 1 0.15
2 3 6 0.1
0.05
4
4 0
3 6 2 5 4 1
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
0.05
4
0
3 6 4 1 2 5
X1 X2
A 10 5
B 1 4
C 5 8
D 9 2
E 12 10
F 15 8
G 7 7
Working Example contd..
Working Example contd..
Step 1: Calculate distances between all data points using Euclidean distance function. The shortest distance
is between data points C and G.
A B C D E F
B 9.06
C 5.83 5.66
D 3.16 8.25 7.21
A B C,G D E
B 9.06
C,G 4.72 6.10
D 3.16 8.25 6.26
E 5.39 12.53 6.50 14.42
F 5.83 14.56 9.01 16.16 3.61
Working Example contd..
A,D B C,G E
B 8.51
C,G 5.32 6.10
E 6.96 12.53 6.50
F 7.11 14.56 9.01 3.61
Working Example contd..
A,D B C,G
B 8.51
C,G 5.32 6.10
E,F 6.80 13.46 7.65
Working Example contd..
A,D,C,G B
B 6.91
E,F 6.73 13.46
Working Example contd..
A,D,C,G,E,F
B 9.07
Working Example contd..
Exercise
Consider the following distance matrix:
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
64
Centroid, Radius and Diameter of a Cluster (for numerical data sets)
N N (t t ) 2
Dm i 1 j 1
i j
N ( N 1)
65
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as
density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: (Density-Based Spatial Clustering of Applications with Noise)
Ester, et al. (KDD’96)
OPTICS: Ankerst, et al (SIGMOD’99).
DENCLUE: Hinneburg & D. Keim (KDD’98)
CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)
66
Density-Based Clustering: Basic Definitions
Two parameters:
Eps: Maximum radius of the neighborhood
67
Density-Based Clustering: Basic Definitions
The density of a neighborhood can be measured simply by the
number of objects in the neighborhood.
69
Example
Consider the following figure, where MinPts = 3:
Outlier
Border
Eps = 1cm
Core
MinPts = 5
71
DBSCAN: The Algorithm
Arbitrarily select a point p
72
Working Example
Let‟s take a dataset of 13 points as shown and plotted below:
As evident from the above table, the point (1, 2) has only
two other points in its neighborhood (1, 2.5), (1.2, 2.5) for
the assumed value of eps
Let‟s repeat the above process for every point in the dataset
and find out the neighborhood of each
Working Example contd..
Working Example contd..
Working Example contd..
Cluster Validity
For supervised classification we have a variety of
measures to evaluate how good our model is
Accuracy, precision, recall
1
1
10 0.9
0.9
20 0.8
0.8
30 0.7
0.7
40 0.6
0.6
Points
50 0.5
0.5
y
60 0.4
0.4
70 0.3
0.3
80 0.2
0.2
90 0.1
0.1
100 0
0 20 40 60 80 100 Similarity
0 0.2 0.4 0.6 0.8 1
Points
x
Using Similarity Matrix for Cluster Validation
10 0.9 0.9
20 0.8 0.8
30 0.7 0.7
40 0.6 0.6
Points
50 0.5 0.5
y
60 0.4 0.4
70 0.3 0.3
80 0.2 0.2
90 0.1 0.1
100 0 0
20 40 60 80 100 Similarity 0 0.2 0.4 0.6 0.8 1
Points x
DBSCAN
Internal Measures: SSE
Clusters in more complicated figures aren‟t well separated
Internal Index: Used to measure the goodness of a clustering
structure without respect to external information
SSE
SSE is good for comparing two clusterings or two clusters
(average SSE).
Can also be used to estimate the number of clusters
10
6 9
8
4
7
2 6
SSE
5
0
4
-2 3
2
-4
1
-6 0
2 5 10 15 20 25 30
5 10 15
K
Internal Measures: Cohesion and Separation
WSS ( x mi )2
i xC i
Separation is measured by the between cluster sum of squares
BSS Ci (m mi )2
i
Where |Ci| is the size of cluster i
Internal Measures: Cohesion and Separation
Example: SSE
BSS + WSS = constant
m
1 m1 2 3 4 m2 5
cohesion separation
Internal Measures: Silhouette Coefficient