0% found this document useful (0 votes)

56 views70 pages

Clustering Others Evaluation

Density-based clustering methods group together points that are closely packed, discovering clusters of arbitrary shape. DBSCAN is a popular density-based clustering algorithm that can find clusters of arbitrary shape and handle noise while only requiring one scan of the data. CLIQUE is a grid-based clustering method that identifies dense subspaces to allow better clustering, partitioning data into rectangular units to identify maximal sets of connected dense units as clusters. It works in multiple steps: partitioning data, identifying dense subspaces and clusters within them, and generating minimal descriptions of clusters.

Uploaded by

Kathy Kg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views70 pages

Clustering Others Evaluation

Uploaded by

Kathy Kg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 70

Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such

as density-connected points
 Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)
 OPTICS: Ankerst, et al (SIGMOD’99).
 DENCLUE: Hinneburg & D. Keim (KDD’98)
 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

1
Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}

 core point condition:

|NEps (q)| ≥ MinPts

2
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

3
Density-Based Clustering: Basic Concepts
 Two parameters:
 Eps: Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
 p belongs to NEps(q)
 core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q

4
Density-Reachable and Density-Connected
 Density-reachable:
 A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there
p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q w.r.t. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o w.r.t.
Eps and MinPts
5
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
 Continue the process until all of the points have been
processed

6
An Example MinPts = 4
Eps=e


C1
C1



C1
DBSCAN: Determining EPS and MinPts
 Idea is that for points in a cluster, their kth nearest
neighbors are at roughly the same distance
 Noise points have the kth nearest neighbor at
farther distance
 So, plot sorted distance of every point to its kth
nearest neighbor

A sharp change at the value

of k-dist that corresponds to
suitable value of eps and the
value of k as MinPts
DBSCAN: Determining EPS and MinPts
 If k is too large=> small clusters (of size less
than k) are likely to be labeled as noise

 If k is too small=> Even a small number of

closely spaced that are noise or outliers will be
incorrectly labeled as clusters
Grid-Based Clustering Method

 Using multi-resolution grid data structure

 Several interesting methods
 STING (a STatistical INformation Grid approach) by

Wang, Yang and Muntz (1997)

 WaveCluster by Sheikholeslami, Chatterjee, and
Zhang (VLDB’98)
 A multi-resolution clustering approach using
wavelet method
 CLIQUE: Agrawal, et al. (SIGMOD’98)
 Both grid-based and subspace clustering

10
CLIQUE (Clustering In QUEst)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

 Automatically identifying subspaces of a high dimensional data space
that allow better clustering than original space
 CLIQUE can be considered as both density-based and grid-based
 It partitions each dimension into the same number of equal length
interval
 It partitions an m-dimensional data space into non-overlapping
rectangular units
 A unit is dense if the fraction of total data points contained in the unit
exceeds the input model parameter
 A cluster is a maximal set of connected dense units within a
subspace

11
CLIQUE: The Major Steps

 Partition the data space and find the number of points that
lie inside each cell of the partition.
 Identify the subspaces that contain clusters using the
Apriori principle
 Identify clusters
 Determine dense units in all subspaces of interests
 Determine connected dense units in all subspaces of
interests.
 Generate minimal description for the clusters
 Determine maximal regions that cover a cluster of

connected dense units for each cluster

 Determination of minimal cover for each cluster

12
Example
A1 A2 • Start at 1-D space and discretize numerical
X1 0.1 0.2 intervals in each axis into grid
X2 0.5 0.5
X3 0.2 1.2
5
X4 0.7 1.1
4
X5 0.5 1.8
X6 0.4 4.7 A2 3
X7 1.5 3.5
X8 2.2 2.2 2
X9 2.3 3.4
1
X10 2.5 2.1
X11 2.8 2.6 0
0 1 2 3
A1
13
A1 A2 5
X1 0.1 0.2
4
X2 0.5 0.5
X3 0.2 1.2 A2 3
X4 0.7 1.1
X5 0.5 1.8 2
X6 0.4 4.7
X7 1.5 3.5
1
X8 2.2 2.2 0
X9 2.3 3.4 0 1 2 3
X10 2.5 2.1 A1
X11 2.8 2.6 A2
0~1 2
A1 1~2 3 • Start at 1-D space and
discretize numerical
0~1 6 2~3 3 intervals in each axis into
1~2 1 3~4 2 grid
2~3 4 4~5 1
14
A2
0~1 2
A1 1~2 3
0~1 6 2~3 3
1~2 1 3~4 2
2~3 4 4~5 1

• Find dense regions (clusters) in each subspace and generate

their minimal descriptions

Dense regions Minimal descriptions

A1 : 0~1, 2~3 A1 : 0~1, 2~3
A2 : 0~1, 1~2, 2~3, 3~4 A2 : 0~4

15
5

A2 3
• Use the dense regions to find
promising candidates in 2-D space 2
based on the Apriori principle
1
Dense regions
A1 : 0~1, 2~3
0
A2 : 0~1, 1~2, 2~3, 3~4 0 1 2 3
A1
A2
0~1 1~2 2~3 3~4
A1 0~1 2 3 0 0
2~3 0 0 3 1

16
5

A2 3
A2
0~1 1~2 2~3 3~4 2
A1 0~1 2 3 0 0
2~3 0 0 3 1 1

0
Dense regions 0 1 2 3
(A1, A2) : (0~1,0~1), (0~1,1~2),(2~3, 2~3) A1
Minimal descriptions
(A1, A2) : (0~1,0~2), (2~3, 2~3)

Repeat the above in level-wise manner in higher

dimensional subspaces
17
Example

• Start at 1-D space and discretize numerical intervals in

each axis into grid
• Find dense regions (clusters) in each subspace and
generate their minimal descriptions
• Use the dense regions to find promising candidates in 2-D
space based on the Apriori principle
18
Repeat the above in level-wise
manner in higher dimensional
subspaces

19
Strength and Weakness of CLIQUE

 Strength
 automatically finds subspaces of the highest

dimensionality such that high density clusters exist in

those subspaces
 insensitive to the order of records in input and does not

presume some canonical data distribution

 scales linearly with the size of input and has good

scalability as the number of dimensions in the data

increases
 Weakness
 The accuracy of the clustering result may be degraded

at the expense of simplicity of the method

20
Major Clustering Approaches

 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion

 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSACN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

21
Quality: What Is Good Clustering?

 A good clustering method will produce high quality

clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters

22
Measuring Clustering Quality

 Two methods: extrinsic vs. intrinsic

 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient

23
Sec. 16.3

External criteria for clustering quality

 Quality measured by its ability to discover
some or all of the hidden patterns or latent
classes in gold standard data
 Assesses a clustering with respect to ground
truth … requires labeled data
 Assume documents with C gold standard
classes, while our clustering algorithms
produce K clusters, ω1, ω2, …, ωK with ni
members.
Sec. 16.3

External Evaluation of Cluster Quality

 Simple measure: purity, the ratio between
the dominant class in the cluster ωi and the
size of cluster ωi
1
Purity (i )  max j (nij ) j C
ni
 Biased because having n clusters
maximizes purity
 Others are entropy of classes in clusters
(or mutual information between classes and
clusters)
Sec. 16.3

Purity example

     
     
    

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Sec. 16.3

Rand Index measures between pair

decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
point pairs in clustering
clustering

Same class in
ground truth 20 24
FN
TP

Different
classes in 20
FP
72
ground truth TN
Sec. 16.3
Example
Different Clusters in
Number of point pairs Same Cluster in clustering
clustering
Same class in ground truth
Different classes in ground
truth

40 136

     
n=17      
    

Cluster I Cluster II Cluster III

Sec. 16.3
Different Clusters in
Number of point pairs Same Cluster in clustering
clustering

Same class in ground truth 20 24

Different classes in ground
truth 20 72

40 96 136

     
n=17      
    

Cluster I Cluster II Cluster III

FN+TN=136-40=96
FN=(5*3+1*2)+(1*4)+(1*3)
=24
TN=96-24=72
Sec.in16.3
Different Clusters
Number of point pairs Same Cluster in clustering
clustering
Example Same class in ground truth 20 24
Different classes in ground
truth 20 72

     
n=17      
    

Cluster I Cluster II Cluster III

FN+TN=136-40=96
FN=(5*3+1*2)+(1*4)+(1*3)
=24

TN=96-24=72
Sec. 16.3

Rand Index measures between pair

decisions. Here RI = 0.68
Different
Number of Same Cluster
Clusters in
point pairs in clustering
clustering

Same class in
ground truth 20 24
TP (A) FN (C)

Different
classes in 20
FP (B)
72
ground truth TN (D)
Sec. 16.3

Rand index and Cluster F-measure

A D
RI 
A B C  D
Compare with standard Precision and Recall:
A A
P R
A B AC
People also define and use a cluster F-measure, which is
probably a better measure.
Sec. 16.3

Exercise

    
n=13    
 
 

Cluster I Cluster II

RI=?
Intrinsic Methods



34
Sec. 16.3

Exercise

Cluster I

2  10 
 12
16 14
7 7 
6
10


Cluster I’s fitness = ?
Determine the Number of Clusters

 Empirical method
 # of clusters ≈
n
for a dataset of n
2
points
 Elbow method
 Use the turning point in the curve

of sum of within cluster variance

w.r.t the # of clusters

36
Example

Start with k=1

37
Example
Now try k=2

38
Example
Now try k=3

39
Example
Now try k=4

40
41
Determine the Number of Clusters

 Cross validation method

 Divide a given data set into m parts

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the

clustering
 E.g., For each point in the test set, find the

closest centroid, and use the sum of squared

distance between all points in the test set and
the closest centroids to measure how well the
model fits the test set
 For any k > 0, repeat it m times, compare the

overall quality measure w.r.t. different k’s, and find #

of clusters that fits the data the best
42
Assessing Clustering Tendency
 Clustering requires nonuniform distribution of
data!

43
Clusters found in Random Data
1 1

0.9 0.9

0.8 0.8

0.7 0.7

Random 0.6 0.6 DBSCAN

Points 0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
1 1

0.9 0.9

K-means 0.8 0.8

Complete
0.7 0.7
Link
0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x
CS590D 44
 Assess if non-random structure exists in the data by
measuring the probability that the data is generated by a
uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 This statistic examines whether objects in a data set

differ significantly from the assumption that they are

uniformly distributed in the multidimensional space.
 It compares the distances w between the real objects
i
and their nearest neighbors to the distances qi between
artificial objects, uniformly generated over the data
space, and their nearest real neighbors.
 The process is repeated several times for a fraction of

the total population. After that, the Hind statistic is

computed as:

45
artificial real

 If objects are uniformly distributed, qi and wi will

be similar, and the statistic will be close to 0.5.
 If clustering are present, the distances for artificial
objects will be larger than that for the real ones
 these artificial objects are homogeneously

distributed whereas the real ones are grouped

together, and the value of Hind will increase.
 A value for Hind higher than 0.75 indicates a
clustering tendency at the 90% confidence
level.
46
Hopkins statistics applied to two different
data sets
 Open circles represent real objects, closed circles selected
real objects and asterisks represent artificial objects
generated over the data space

H value = 0.49 H value = 0.73

47
SCAN: Density-Based Clustering of
Networks
 How many clusters?
 What size should they be?
 What is the best partitioning?
 Should some points be
segregated?

Search the graph to find well-

connected components as clusters
An Example Network

 Application: Given simply information of who associates with whom,

could one identify clusters of individuals with common interests or
special relationships (families, cliques, terrorist cells)?
48
A Social Network Model
 Cliques, hubs and outliers
 Individuals in a tight social group, or clique, know many of the
same people, regardless of the size of the group
 Individuals who are hubs know many people in different groups
but belong to no single group. Politicians, for example bridge
multiple groups
 Individuals who are outliers reside at the margins of society.
Hermits, for example, know few people and belong to no group
 The Neighborhood of a Vertex
 Define () as the immediate
neighborhood of a vertex (including v)

v
()
49
Structure Similarity
 The desired features tend to be captured by a measure
we call Structural Similarity
| (v) ( w) |
 (v, w) 
| (v) || ( w) |
| {9,13} | 2
 (9,13)    0.63
2*5 10
(13)={9,13}
(9)={8,9,10,12,13}
 Structural similarity is large for members of a clique and
small for hubs and outliers
50
Structure Similarity(Exercise)
 The desired features tend to be captured by a measure
we call Structural Similarity
| (v) ( w) |
 (v, w) 
| (v) || ( w) |

51
Structural Connectivity [1]
 -Neighborhood: N  (v)  {w  (v) |  (v, w)   }
 Core: CORE ,  (v) | N  (v) |  popularity
 Direct structure reachable:
DirRECH  , (v, w)  CORE , (v)  w  N  (v)
 Structure reachable: transitive closure of direct structure
reachability
 Structure connected:
CONNECT ,  (v, w)  u  V : RECH ,  (u, v)  RECH ,  (u, w)

[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based

Algorithm for Discovering Clusters in Large Spatial Databases
52
Structure-Connected Clusters

 Structure-connected cluster C
 Connectivity: v, w  C : CONNECT ,  (v, w)
 Maximality: v, w  V : v  C  REACH , (v, w)  w  C
 Hubs:
 Not belong to any cluster
 Bridge to many clusters
 Outliers: hub

 Not belong to any cluster

 Connect to less clusters
outlier
53
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

54
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9
| {9,13} | 2
0.63   0.63
13 2*5 10

55
Algorithm

2
3
=2
5
 = 0.7 1
4
7
0.67 6
0
8 0.82 11
12
0.75 10
9

56
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

57
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9 0.67

58
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0.73 0
8 11
0.73
12
0.73
10
9

59
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

60
Algorithm

2
3
=2
5
 = 0.7 1
4
7 0.51
6
0
8 11
12

10
9

61
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0.68 0
8 11
12

10
9

62
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
0.51
12

10
9

63
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

64
Algorithm

2
3
=2
5
 = 0.7 0.51 1
4
7 0.68
6
0.51 0
8 11
12

10
9

65
Algorithm

2
3
=2
5
 = 0.7 1
4
7
6
0
8 11
12

10
9

66
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Hierarchical clustering algorithms
 Density-based algorithms
 Quality of clustering results can be evaluated in various ways

67
References (1)
 R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace
clustering of high dimensional data for data mining applications. SIGMOD'98
 M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.
 M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points
to identify the clustering structure, SIGMOD’99.
 Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02
 M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based
Local Outliers. SIGMOD 2000.
 M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for
discovering clusters in large spatial databases. KDD'96.
 M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial
databases: Focusing techniques for efficient class identification. SSD'95.
 D. Fisher. Knowledge acquisition via incremental conceptual clustering.
Machine Learning, 2:139-172, 1987.
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. VLDB’98.
 V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data
Using Summaries. KDD'99.

68
References (2)
 D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An
approach based on dynamic systems. In Proc. VLDB’98.
 S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for
large databases. SIGMOD'98.
 S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for
categorical attributes. In ICDE'99, pp. 512-521, Sydney, Australia, March
1999.
 A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large
Multimedia Databases with Noise. KDD’98.
 A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
 G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering
Algorithm Using Dynamic Modeling. COMPUTER, 32(8): 68-75, 1999.
 L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to
Cluster Analysis. John Wiley & Sons, 1990.
 E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large
datasets. VLDB’98.

69
References (3)
 G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to
Clustering. John Wiley and Sons, 1988.
 R. Ng and J. Han. Efficient and effective clustering method for spatial data mining.
VLDB'94.
 L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A
Review, SIGKDD Explorations, 6(1), June 2004
 E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern Recognition
 G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution
clustering approach for very large spatial databases. VLDB’98.
 A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering
in Large Databases, ICDT'01.
 A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles,
ICDE'01
 H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data
sets, SIGMOD’02
 W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial
Data Mining, VLDB’97
 T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : An efficient data clustering method
for very large databases. SIGMOD'96
 X. Yin, J. Han, and P. S. Yu, “LinkClus: Efficient Clustering via Heterogeneous Semantic
Links”, VLDB'06

SymSitive 1609 - PB
No ratings yet
SymSitive 1609 - PB
8 pages
Solutions For Using Sage 50 Accounting 2021 1st Edition by Purbhoo
100% (1)
Solutions For Using Sage 50 Accounting 2021 1st Edition by Purbhoo
21 pages
Interest On Drawing
No ratings yet
Interest On Drawing
8 pages
The Global Employer
100% (1)
The Global Employer
198 pages
Pinnacle Kart: Joe Chan Hao Chen Joe Giovanatto Jan Kellerman Sujay Lahiri Vinh Nguyen Mehdi Shabestary Faisal Siddiqui
0% (1)
Pinnacle Kart: Joe Chan Hao Chen Joe Giovanatto Jan Kellerman Sujay Lahiri Vinh Nguyen Mehdi Shabestary Faisal Siddiqui
70 pages
Time Travel-Paul Davies
100% (1)
Time Travel-Paul Davies
7 pages
2019 ERP Software Project Report
No ratings yet
2019 ERP Software Project Report
23 pages
Consular Electronic Application Center - Print Application
No ratings yet
Consular Electronic Application Center - Print Application
7 pages
1 Introduction To Eurocodes - 2011
No ratings yet
1 Introduction To Eurocodes - 2011
24 pages
Investment Grade Energy Auditor Certification Detailv5
No ratings yet
Investment Grade Energy Auditor Certification Detailv5
20 pages
Mimo Elimination of States
No ratings yet
Mimo Elimination of States
19 pages
75 Years of Markting History
No ratings yet
75 Years of Markting History
8 pages
Tle 6281
No ratings yet
Tle 6281
15 pages
PATTERN Practical Research 1 2 1
No ratings yet
PATTERN Practical Research 1 2 1
17 pages
Adieu Bash: Sample Project Management Template
No ratings yet
Adieu Bash: Sample Project Management Template
34 pages
Fundicion (Casting) 1 PDF
No ratings yet
Fundicion (Casting) 1 PDF
33 pages
NT9.1 SDH Network Takeover TL1 ED01
No ratings yet
NT9.1 SDH Network Takeover TL1 ED01
31 pages
Ammonio Methacrylate Copolymer Dispersion
No ratings yet
Ammonio Methacrylate Copolymer Dispersion
2 pages
Aisha Data
No ratings yet
Aisha Data
12 pages
Reference: - Loading..
No ratings yet
Reference: - Loading..
23 pages
History Test
No ratings yet
History Test
3 pages
Powersynth: Multi-Chip Power Module Layout Synthesis: Application of Fast Design Optimization Tools For Mcpms
No ratings yet
Powersynth: Multi-Chip Power Module Layout Synthesis: Application of Fast Design Optimization Tools For Mcpms
1 page
BASF Animal Nutrition Balangut Brochure Poultry
No ratings yet
BASF Animal Nutrition Balangut Brochure Poultry
2 pages
Miller Approximation
No ratings yet
Miller Approximation
14 pages
PEEK-OPTIMA Processing Guide Secured
No ratings yet
PEEK-OPTIMA Processing Guide Secured
0 pages
Ec16403 Lic
No ratings yet
Ec16403 Lic
2 pages
2020 Assignment 1 CHE 3211 2
No ratings yet
2020 Assignment 1 CHE 3211 2
2 pages
Au
No ratings yet
Au
5 pages
Collapse of Reinforced Thermoplastic Pipe (RTP) Under External Pressure
No ratings yet
Collapse of Reinforced Thermoplastic Pipe (RTP) Under External Pressure
6 pages
Trends in Chemical Engineering Education Process, Product and Sustainable Chemical Engineering Challenges - 2008 - Education For Chemical Engineers PDF
No ratings yet
Trends in Chemical Engineering Education Process, Product and Sustainable Chemical Engineering Challenges - 2008 - Education For Chemical Engineers PDF
6 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (643)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2885)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)

Clustering Others Evaluation

Uploaded by

Clustering Others Evaluation

Uploaded by

Density-Based Clustering Methods

 Clustering based on density (local cluster criterion), such

 core point condition:

A sharp change at the value

 If k is too small=> Even a small number of

 Using multi-resolution grid data structure

Wang, Yang and Muntz (1997)

 Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98)

connected dense units for each cluster

• Find dense regions (clusters) in each subspace and generate

Dense regions Minimal descriptions

Repeat the above in level-wise manner in higher

• Start at 1-D space and discretize numerical intervals in

dimensionality such that high density clusters exist in

presume some canonical data distribution

scalability as the number of dimensions in the data

at the expense of simplicity of the method

criterion, e.g., minimizing the sum of square errors

using some criterion

 Typical methods: DBSACN, OPTICS, DenClue

 Typical methods: STING, WaveCluster, CLIQUE

 A good clustering method will produce high quality

 Two methods: extrinsic vs. intrinsic

External criteria for clustering quality

External Evaluation of Cluster Quality

Cluster I Cluster II Cluster III

Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6

Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6

Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index measures between pair

Cluster I Cluster II Cluster III

Same class in ground truth 20 24

Cluster I Cluster II Cluster III

Cluster I Cluster II Cluster III

Rand Index measures between pair

Rand index and Cluster F-measure

of sum of within cluster variance

Start with k=1

 Cross validation method

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the

closest centroid, and use the sum of squared

overall quality measure w.r.t. different k’s, and find #

Random 0.6 0.6 DBSCAN

K-means 0.8 0.8

differ significantly from the assumption that they are

the total population. After that, the Hind statistic is

 If objects are uniformly distributed, qi and wi will

distributed whereas the real ones are grouped

H value = 0.49 H value = 0.73

Search the graph to find well-

 Application: Given simply information of who associates with whom,

[1] M. Ester, H. P. Kriegel, J. Sander, & X. Xu (KDD'96) “A Density-Based

 Not belong to any cluster

You might also like