0% found this document useful (0 votes)
56 views70 pages

Clustering Others Evaluation

Density-based clustering methods group together points that are closely packed, discovering clusters of arbitrary shape. DBSCAN is a popular density-based clustering algorithm that can find clusters of arbitrary shape and handle noise while only requiring one scan of the data. CLIQUE is a grid-based clustering method that identifies dense subspaces to allow better clustering, partitioning data into rectangular units to identify maximal sets of connected dense units as clusters. It works in multiple steps: partitioning data, identifying dense subspaces and clusters within them, and generating minimal descriptions of clusters.

Uploaded by

Kathy Kg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views70 pages

Clustering Others Evaluation

Density-based clustering methods group together points that are closely packed, discovering clusters of arbitrary shape. DBSCAN is a popular density-based clustering algorithm that can find clusters of arbitrary shape and handle noise while only requiring one scan of the data. CLIQUE is a grid-based clustering method that identifies dense subspaces to allow better clustering, partitioning data into rectangular units to identify maximal sets of connected dense units as clusters. It works in multiple steps: partitioning data, identifying dense subspaces and clusters within them, and generating minimal descriptions of clusters.

Uploaded by

Kathy Kg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 70

Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such


as density-connected points
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

1
Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}

 core point condition:


|NEps (q)| ≥ MinPts

2
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

3
Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
 p belongs to NEps(q)
 core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q

4
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
5
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
 Continue the process until all of the points have been
processed

6
An Example MinPts = 4
Eps=e


C1
C1


C1
DBSCAN: Determining EPS and MinPts
 Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
 Noise points have the kth nearest neighbor at
farther distance
 So, plot sorted distance of every point to its kth
nearest neighbor

A sharp change at the value


of k-dist that corresponds to
suitable value of eps and the
value of k as MinPts
DBSCAN: Determining EPS and MinPts
 If k is too large=> small clusters (of size less
than k) are likely to be labeled as noise

 If k is too small=> Even a small number of


closely spaced that are noise or outliers will be
incorrectly labeled as clusters
Grid-Based Clustering Method

 Using multi-resolution grid data structure


 Several interesting methods
 STING (a STatistical INformation Grid approach) by

Wang, Yang and Muntz (1997)


 WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
 A multi-resolution clustering approach using
wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)
 Both grid-based and subspace clustering

10
CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)


 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace

11
CLIQUE: The Major Steps

 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of

connected dense units for each cluster


 Determination of minimal cover for each cluster

12
Example
A1 A2 • Start at 1-D space and discretize numerical
X1 0.1 0.2 intervals in each axis into grid
X2 0.5 0.5
X3 0.2 1.2
5
X4 0.7 1.1
4
X5 0.5 1.8
X6 0.4 4.7 A2 3
X7 1.5 3.5
X8 2.2 2.2 2
X9 2.3 3.4
1
X10 2.5 2.1
X11 2.8 2.6 0
0 1 2 3
A1
13
A1 A2 5
X1 0.1 0.2
4
X2 0.5 0.5
X3 0.2 1.2 A2 3
X4 0.7 1.1
X5 0.5 1.8 2
X6 0.4 4.7
X7 1.5 3.5
1
X8 2.2 2.2 0
X9 2.3 3.4 0 1 2 3
X10 2.5 2.1 A1
X11 2.8 2.6 A2
0~1 2
A1 1~2 3 • Start at 1-D space and
discretize numerical
0~1 6 2~3 3 intervals in each axis into
1~2 1 3~4 2 grid
2~3 4 4~5 1
14
A2
0~1 2
A1 1~2 3
0~1 6 2~3 3
1~2 1 3~4 2
2~3 4 4~5 1

• Find dense regions (clusters) in each subspace and generate


their minimal descriptions

Dense regions Minimal descriptions


A1 : 0~1, 2~3 A1 : 0~1, 2~3
A2 : 0~1, 1~2, 2~3, 3~4 A2 : 0~4

15
5

A2 3
• Use the dense regions to find
promising candidates in 2-D space 2
based on the Apriori principle
1
Dense regions
A1 : 0~1, 2~3
0
A2 : 0~1, 1~2, 2~3, 3~4 0 1 2 3
A1
A2
0~1 1~2 2~3 3~4
A1 0~1 2 3 0 0
2~3 0 0 3 1

16
5

A2 3
A2
0~1 1~2 2~3 3~4 2
A1 0~1 2 3 0 0
2~3 0 0 3 1 1

0
Dense regions 0 1 2 3
(A1, A2) : (0~1,0~1), (0~1,1~2),(2~3, 2~3) A1
Minimal descriptions
(A1, A2) : (0~1,0~2), (2~3, 2~3)

Repeat the above in level-wise manner in higher


dimensional subspaces
17
Example

• Start at 1-D space and discretize numerical intervals in


each axis into grid
• Find dense regions (clusters) in each subspace and
generate their minimal descriptions
• Use the dense regions to find promising candidates in 2-D
space based on the Apriori principle
18
Repeat the above in level-wise
manner in higher dimensional
subspaces

19
Strength and Weakness of CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters exist in


those subspaces
 insensitive to the order of records in input and does not

presume some canonical data distribution


 scales linearly with the size of input and has good

scalability as the number of dimensions in the data


increases
 Weakness
 The accuracy of the clustering result may be degraded

at the expense of simplicity of the method

20
Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

21
Quality: What Is Good Clustering?

 A good clustering method will produce high quality


clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters

22
Measuring Clustering Quality

 Two methods: extrinsic vs. intrinsic


 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient

23
Sec. 16.3

External criteria for clustering quality


 Quality measured by its ability to discover
some or all of the hidden patterns or latent
classes in gold standard data
 Assesses a clustering with respect to ground
truth … requires labeled data
 Assume documents with C gold standard
classes, while our clustering algorithms
produce K clusters, ω1, ω2, …, ωK with ni
members.
Sec. 16.3

External Evaluation of Cluster Quality


 Simple measure: purity, the ratio between
the dominant class in the cluster ωi and the
size of cluster ωi
1
Purity (i )  max j (nij ) j C
ni
 Biased because having n clusters
maximizes purity
 Others are entropy of classes in clusters
(or mutual information between classes and
clusters)
Sec. 16.3

Purity example

     
     
    

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5


Sec. 16.3

Rand Index measures between pair


decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
point pairs in clustering
clustering

Same class in
ground truth 20 24
FN
TP

Different
classes in 20
FP
72
ground truth TN
Sec. 16.3
Example
Different Clusters in
Number of point pairs Same Cluster in clustering
clustering
Same class in ground truth
Different classes in ground
truth

40 136

     
n=17      
    

Cluster I Cluster II Cluster III


Sec. 16.3
Different Clusters in
Number of point pairs Same Cluster in clustering
clustering

Same class in ground truth 20 24


Different classes in ground
truth 20 72

40 96 136

     
n=17      
    

Cluster I Cluster II Cluster III

FN+TN=136-40=96
  FN=(5*3+1*2)+(1*4)+(1*3)
=24
TN=96-24=72
Sec.in16.3
Different Clusters
Number of point pairs Same Cluster in clustering
clustering
Example Same class in ground truth 20 24
Different classes in ground
truth 20 72

     
n=17      
    

Cluster I Cluster II Cluster III

 
FN+TN=136-40=96
FN=(5*3+1*2)+(1*4)+(1*3)
  =24

TN=96-24=72
Sec. 16.3

Rand Index measures between pair


decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
point pairs in clustering
clustering

Same class in
ground truth 20 24
TP (A) FN (C)

Different
classes in 20
FP (B)
72
ground truth TN (D)
Sec. 16.3

Rand index and Cluster F-measure

A D
RI 
A B C  D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-measure, which is
probably a better measure.
Sec. 16.3

Exercise

    
n=13    
 
 

Cluster I Cluster II

RI=?
Intrinsic Methods
 
  

34
Sec. 16.3

Exercise

Cluster I

2  10 
 12
16 14
7 7 
6
10


Cluster I’s fitness = ?
Determine the Number of Clusters

 Empirical method
 # of clusters ≈
n
for a dataset of n
2
points
 Elbow method
 Use the turning point in the curve

of sum of within cluster variance


w.r.t the # of clusters

36
Example

Start with k=1

37
Example
Now try k=2

38
Example
Now try k=3

39
Example
Now try k=4

40
41
Determine the Number of Clusters

 Cross validation method


 Divide a given data set into m parts

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the

clustering
 E.g., For each point in the test set, find the

closest centroid, and use the sum of squared


distance between all points in the test set and
the closest centroids to measure how well the
model fits the test set
 For any k > 0, repeat it m times, compare the

overall quality measure w.r.t. different k’s, and find #


of clusters that fits the data the best
42
Assessing Clustering Tendency
 Clustering requires nonuniform distribution of
data!

43
Clusters found in Random Data
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6 DBSCAN


Points 0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

K-means 0.8 0.8


Complete
0.7 0.7
Link
0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
CS590D 44
 Assess if non-random structure exists in the data by
measuring the probability that the data is generated by a
uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 This statistic examines whether objects in a data set

differ significantly from the assumption that they are


uniformly distributed in the multidimensional space.
 It compares the distances w  between the real objects
i
and their nearest neighbors to the distances qi between
artificial objects, uniformly generated over the data
space, and their nearest real neighbors.
 The process is repeated several times for a fraction of

the total population. After that, the Hind statistic is


computed as:

45
artificial real

 If objects are uniformly distributed, qi and wi will


be similar, and the statistic will be close to 0.5.
 If clustering are present, the distances for artificial
objects will be larger than that for the real ones
 these artificial objects are homogeneously

distributed whereas the real ones are grouped


together, and the value of Hind will increase.
 A value for Hind higher than 0.75 indicates a
clustering tendency at the 90% confidence
level. 
46
Hopkins statistics applied to two different
data sets
 Open circles represent real objects, closed circles selected
real objects and asterisks represent artificial objects
generated over the data space

H value = 0.49 H value = 0.73

47
SCAN: Density-Based Clustering of
Networks
 How many clusters?
 What size should they be?
 What is the best partitioning?
 Should some points be
segregated?

Search the graph to find well-


connected components as clusters
An Example Network

 Application: Given simply information of who associates with whom,


could one identify clusters of individuals with common interests or
special relationships (families, cliques, terrorist cells)?
48
A Social Network Model
 Cliques, hubs and outliers
 Individuals in a tight social group, or clique, know many of the
same people, regardless of the size of the group
 Individuals who are hubs know many people in different groups
but belong to no single group. Politicians, for example bridge
multiple groups
 Individuals who are outliers reside at the margins of society.
Hermits, for example, know few people and belong to no group
 The Neighborhood of a Vertex
 Define () as the immediate
neighborhood of a vertex (including v)

v
()
49
Structure Similarity
 The desired features tend to be captured by a measure
we call Structural Similarity
| (v) ( w) |
 (v, w) 
| (v) || ( w) |
| {9,13} | 2
 (9,13)    0.63
2*5 10
(13)={9,13}
(9)={8,9,10,12,13}
 Structural similarity is large for members of a clique and
small for hubs and outliers
50
Structure Similarity(Exercise)
 The desired features tend to be captured by a measure
we call Structural Similarity
| (v) ( w) |
 (v, w) 
| (v) || ( w) |

51
Structural Connectivity [1]
 -Neighborhood: N  (v)  {w  (v) |  (v, w)   }
 Core: CORE ,  (v) | N  (v) |  popularity
 Direct structure reachable:
DirRECH  , (v, w)  CORE , (v)  w  N  (v)
 Structure reachable: transitive closure of direct structure
reachability
 Structure connected:
CONNECT ,  (v, w)  u  V : RECH ,  (u, v)  RECH ,  (u, w)

[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based


Algorithm for Discovering Clusters in Large Spatial Databases
52
Structure-Connected Clusters

 Structure-connected cluster C
 Connectivity: v, w  C : CONNECT ,  (v, w)
 Maximality: v, w  V : v  C  REACH , (v, w)  w  C
 Hubs:
 Not belong to any cluster
 Bridge to many clusters
 Outliers: hub

 Not belong to any cluster


 Connect to less clusters
outlier
53
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

13

54
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9
| {9,13} | 2
0.63   0.63
13 2*5 10

55
Algorithm

2
3
=2
5
 = 0.7 1
4
7
0.67 6
0
8 0.82 11
12
0.75 10
9

13

56
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

13

57
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9 0.67

13

58
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0.73 0
8 11
0.73
12
0.73
10
9

13

59
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

13

60
Algorithm

2
3
=2
5
 = 0.7 1
4
7 0.51
6
0
8 11
12

10
9

13

61
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0.68 0
8 11
12

10
9

13

62
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
0.51
12

10
9

13

63
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

13

64
Algorithm

2
3
=2
5
 = 0.7 0.51 1
4
7 0.68
6
0.51 0
8 11
12

10
9

13

65
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

13

66
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Hierarchical clustering algorithms
 Density-based algorithms
 Quality of clustering results can be evaluated in various ways

67
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.

68
References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.

69
References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets,  SIGMOD’02
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
 X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06

70

You might also like