Clustering Others Evaluation
Clustering Others Evaluation
1
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
2
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
3
Density-Based Clustering: Basic Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
4
Density-Reachable and Density-Connected
Density-reachable:
A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
Density-connected
A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
5
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
Continue the process until all of the points have been
processed
6
An Example MinPts = 4
Eps=e
C1
C1
C1
DBSCAN: Determining EPS and MinPts
Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
Noise points have the kth nearest neighbor at
farther distance
So, plot sorted distance of every point to its kth
nearest neighbor
10
CLIQUE (Clustering In QUEst)
11
CLIQUE: The Major Steps
Partition the data space and find the number of points that
lie inside each cell of the partition.
Identify the subspaces that contain clusters using the
Apriori principle
Identify clusters
Determine dense units in all subspaces of interests
Determine connected dense units in all subspaces of
interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
12
Example
A1 A2 • Start at 1-D space and discretize numerical
X1 0.1 0.2 intervals in each axis into grid
X2 0.5 0.5
X3 0.2 1.2
5
X4 0.7 1.1
4
X5 0.5 1.8
X6 0.4 4.7 A2 3
X7 1.5 3.5
X8 2.2 2.2 2
X9 2.3 3.4
1
X10 2.5 2.1
X11 2.8 2.6 0
0 1 2 3
A1
13
A1 A2 5
X1 0.1 0.2
4
X2 0.5 0.5
X3 0.2 1.2 A2 3
X4 0.7 1.1
X5 0.5 1.8 2
X6 0.4 4.7
X7 1.5 3.5
1
X8 2.2 2.2 0
X9 2.3 3.4 0 1 2 3
X10 2.5 2.1 A1
X11 2.8 2.6 A2
0~1 2
A1 1~2 3 • Start at 1-D space and
discretize numerical
0~1 6 2~3 3 intervals in each axis into
1~2 1 3~4 2 grid
2~3 4 4~5 1
14
A2
0~1 2
A1 1~2 3
0~1 6 2~3 3
1~2 1 3~4 2
2~3 4 4~5 1
15
5
A2 3
• Use the dense regions to find
promising candidates in 2-D space 2
based on the Apriori principle
1
Dense regions
A1 : 0~1, 2~3
0
A2 : 0~1, 1~2, 2~3, 3~4 0 1 2 3
A1
A2
0~1 1~2 2~3 3~4
A1 0~1 2 3 0 0
2~3 0 0 3 1
16
5
A2 3
A2
0~1 1~2 2~3 3~4 2
A1 0~1 2 3 0 0
2~3 0 0 3 1 1
0
Dense regions 0 1 2 3
(A1, A2) : (0~1,0~1), (0~1,1~2),(2~3, 2~3) A1
Minimal descriptions
(A1, A2) : (0~1,0~2), (2~3, 2~3)
19
Strength and Weakness of CLIQUE
Strength
automatically finds subspaces of the highest
20
Major Clustering Approaches
Partitioning approach:
Construct various partitions and then evaluate them by some
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
21
Quality: What Is Good Clustering?
22
Measuring Clustering Quality
23
Sec. 16.3
Purity example
Same class in
ground truth 20 24
FN
TP
Different
classes in 20
FP
72
ground truth TN
Sec. 16.3
Example
Different Clusters in
Number of point pairs Same Cluster in clustering
clustering
Same class in ground truth
Different classes in ground
truth
40 136
n=17
40 96 136
n=17
FN+TN=136-40=96
FN=(5*3+1*2)+(1*4)+(1*3)
=24
TN=96-24=72
Sec.in16.3
Different Clusters
Number of point pairs Same Cluster in clustering
clustering
Example Same class in ground truth 20 24
Different classes in ground
truth 20 72
n=17
FN+TN=136-40=96
FN=(5*3+1*2)+(1*4)+(1*3)
=24
TN=96-24=72
Sec. 16.3
Same class in
ground truth 20 24
TP (A) FN (C)
Different
classes in 20
FP (B)
72
ground truth TN (D)
Sec. 16.3
A D
RI
A B C D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-measure, which is
probably a better measure.
Sec. 16.3
Exercise
n=13
Cluster I Cluster II
RI=?
Intrinsic Methods
34
Sec. 16.3
Exercise
Cluster I
2 10
12
16 14
7 7
6
10
Cluster I’s fitness = ?
Determine the Number of Clusters
Empirical method
# of clusters ≈
n
for a dataset of n
2
points
Elbow method
Use the turning point in the curve
36
Example
37
Example
Now try k=2
38
Example
Now try k=3
39
Example
Now try k=4
40
41
Determine the Number of Clusters
clustering
E.g., For each point in the test set, find the
43
Clusters found in Random Data
1 1
0.9 0.9
0.8 0.8
0.7 0.7
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1
0.9 0.9
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
CS590D 44
Assess if non-random structure exists in the data by
measuring the probability that the data is generated by a
uniform data distribution
Test spatial randomness by statistic test: Hopkins Static
This statistic examines whether objects in a data set
45
artificial real
47
SCAN: Density-Based Clustering of
Networks
How many clusters?
What size should they be?
What is the best partitioning?
Should some points be
segregated?
v
()
49
Structure Similarity
The desired features tend to be captured by a measure
we call Structural Similarity
| (v) ( w) |
(v, w)
| (v) || ( w) |
| {9,13} | 2
(9,13) 0.63
2*5 10
(13)={9,13}
(9)={8,9,10,12,13}
Structural similarity is large for members of a clique and
small for hubs and outliers
50
Structure Similarity(Exercise)
The desired features tend to be captured by a measure
we call Structural Similarity
| (v) ( w) |
(v, w)
| (v) || ( w) |
51
Structural Connectivity [1]
-Neighborhood: N (v) {w (v) | (v, w) }
Core: CORE , (v) | N (v) | popularity
Direct structure reachable:
DirRECH , (v, w) CORE , (v) w N (v)
Structure reachable: transitive closure of direct structure
reachability
Structure connected:
CONNECT , (v, w) u V : RECH , (u, v) RECH , (u, w)
Structure-connected cluster C
Connectivity: v, w C : CONNECT , (v, w)
Maximality: v, w V : v C REACH , (v, w) w C
Hubs:
Not belong to any cluster
Bridge to many clusters
Outliers: hub
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9
13
54
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9
| {9,13} | 2
0.63 0.63
13 2*5 10
55
Algorithm
2
3
=2
5
= 0.7 1
4
7
0.67 6
0
8 0.82 11
12
0.75 10
9
13
56
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9
13
57
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9 0.67
13
58
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0.73 0
8 11
0.73
12
0.73
10
9
13
59
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9
13
60
Algorithm
2
3
=2
5
= 0.7 1
4
7 0.51
6
0
8 11
12
10
9
13
61
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0.68 0
8 11
12
10
9
13
62
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
0.51
12
10
9
13
63
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9
13
64
Algorithm
2
3
=2
5
= 0.7 0.51 1
4
7 0.68
6
0.51 0
8 11
12
10
9
13
65
Algorithm
2
3
=2
5
= 0.7 1
4
7
6
0
8 11
12
10
9
13
66
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Hierarchical clustering algorithms
Density-based algorithms
Quality of clustering results can be evaluated in various ways
67
References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.
68
References (2)
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.
69
References (3)
G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06
70