0% found this document useful (0 votes)

35 views85 pages

8 - Clustering

This document provides an overview of cluster analysis techniques. It defines cluster analysis as the process of grouping similar data objects into clusters. The document outlines different clustering methods including partitioning methods, hierarchical methods, density-based methods, and grid-based methods. It also discusses evaluating clustering quality and considering factors like similarity measures, clustering space, and scalability. Key partitioning methods mentioned are k-means and k-medoids, which assign data points to clusters to minimize distances between points and assigned cluster centroids or medoids.

Uploaded by

MH Polash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views85 pages

8 - Clustering

Uploaded by

MH Polash

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

8.

Cluster Analysis

Knowledge Discovery in Databases

Dominik Probst, [email protected]
Chair of Computer Science 6 (Data Management), Friedrich-Alexander-University Erlangen-Nürnberg
Summer semester 2023
Outline

1. Basic Concepts

2. Partitioning Methods

3. Hierarchical Methods

4. Density-based Methods

5. Grid-based Methods

6. Evaluation of clustering

7. Summary
Basic Concepts
What is Cluster Analysis?
• Cluster: A collection of data objects within a larger set that are:
• Similar (or related) to one another within the same group and,
• dissimilar (or unrelated) to the objects outside the group.
• Cluster analysis (or clustering, data segmentation, . . .):
• Define similarities among data based on the characteristics found in the data (input from user!).
• Group similar data objects into clusters.
• Unsupervised learning:
• No predefined classes.
• I.e., learning by observation (vs. learning by examples: supervised).
• Typical applications:
• As a stand-alone tool to get insight into data distribution.
• As a preprocessing step for other algorithms.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 2
Clustering for Data Understanding and Applications
• Biology:
• Taxonomy of living things: kingdom, phylum, class, order, family, genus, and species.
• Land use:
• Identification of areas of similar land use in an earth-observation database.
• Marketing:
• Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop
targeted marketing programs.
• City planning:
• Identifying groups of houses according to their house type, value, and geographical location.
• Earthquake studies:
• Observed earthquake epicenters should be clustered along continent faults.
• Climate:
• Understanding earth climate, find patterns of atmosphere and ocean.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 3
Quality: What is Good Clustering?
• A good clustering method will produce high-quality clusters.
• High intra-class similarity:
• Cohesive within clusters.
• Low inter-class similarity:
• Distinctive between clusters.
• The quality of a clustering method depends on:
• the similarity measure used by the method,
• its implementation, and
• its ability to discover some or all of the hidden patterns.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 4
Measure the Quality of Clustering
• Dissimilarity/similarity metric:
• Similarity is expressed in terms of a distance function, typically a metric: d (x , y ).
• The definitions of distance functions are usually rather different for interval-scaled, boolean, categorical,
ordinal, ratio, and vector variables.
• Weights should be associated with different variables
based on applications and data semantics.
• Quality of clustering:
• There is usually a separate "quality" function that measures the "goodness" of a cluster.
• It is hard to define "similar enough" or "good enough."
• The answer is typically highly case dependent.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 5
Considerations for Cluster Analysis
• Partitioning criteria:
• Single level vs. hierarchical partitioning.
• Often, multi-level hierarchical partitioning is desirable.
• Separation of clusters:
• Exclusive (e.g., one customer belongs to only one region) vs.
• Non-exclusive (e.g., one document may belong to more than one class).
• Similarity measure:
• Distance-based (e.g., Euclidean, road network, vector) vs.
• Connectivity-based (e.g., density or contiguity).
• Clustering space:
• Full space (often when low-dimensional) vs.
• Subspaces (often in high-dimensional clustering).

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 6
Requirements and Challenges
• Scalability:
• Clustering all the data instead of only the samples.
• Ability to deal with different types of attributes:
• Numerical, binary, categorical, ordinal, linked, and mixture of these.
• Constraint-based clustering:
• User may give inputs on constraints.
• Use domain knowledge to determine input parameters.
• Interpretability and usability.
• Others:
• Discovery of clusters with arbitrary shape.
• Ability to deal with noisy data.
• Incremental clustering and insensitivity to input order.
• High dimensionality.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 7
Major Clustering Approaches (Part of this Lecture)
• Partitioning approach:
• Construct various partitions and then evaluate them by some criterion.
• E.g., minimizing the sum of square errors.
• Typical methods: k-means, k-medoids, CLARA, CLARANS.
• Hierarchical approach:
• Create a hierarchical decomposition of the set of data (or objects) using some criterion.
• Typical methods: AGNES, DIANA, BIRCH, CHAMELEON.
• Density-based approach:
• Based on connectivity and density functions.
• Typical methods: DBSCAN, OPTICS, DENCLUE.
• Grid-based approach:
• Based on a multiple-level granularity structure.
• Typical methods: STING, WaveCluster, CLIQUE.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 8
Major Clustering Approaches (Not Part of this Lecture)
• Model-based approach:
• A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other.
• Typical methods: EM, SOM, COBWEB.
• Frequent-pattern-based approach:
• Based on the analysis of frequent patterns.
• Typical methods: p-Cluster.
• User-guided or constraint-based approach:
• Clustering by considering user-specified or application-specific constraints.
• Typical methods: COD (obstacles), constrained clustering.
• Link-based clustering:
• Objects are often linked together in various ways.
• Massive links can be used to cluster objects: SimRank, LinkClus.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 9
Partitioning Methods
Partitioning Algorithms: Basic Concept
• Partitioning method:
• Partition a database D of n objects oj , j ∈ {1, . . . , n} into a set of k -clusters Ci , 1 ≤ i ≤ k such that
the sum of squared distances to ci is minimized (where ci is the centroid or medoid of cluster Ci ):

k n
X X
min d (oj , ci )2 .
i =1 j =1

• Given k , find a partition of k clusters that optimizes the chosen partitioning criterion.
• Globally optimal: exhaustively enumerate all partitions.
• Heuristic methods: k-means and k-medoids algorithms.
• k-means1 :
• Each cluster is represented by the center of the cluster.
• k-medoids or PAM (Partition around medoids)2 :
• Each cluster is represented by one of the objects in the cluster.
1
J. Macqueen, “Some methods for classification and analysis of multivariate observations,” in In 5-th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281–297.
2
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990, ISBN: 978-0-47187876-6. DOI: 10.1002/9780470316801.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 10
The k -means Clustering Method
• Given k , the k -means algorithm is implemented in four steps:
1. Partition the database into k non-empty subsets.
n n
• E.g. the first k
objects, then the next k
objects, . . .
2. Compute the centroids of the clusters of the current partitioning.
• The centroid is the center, i.e. mean point, of the cluster.
• For each attribute (or dimension), calculate the average value.
3. Assign each object to the cluster with the nearest centroid.
• That is, for each object calculate distance to each of the k
centroids and pick the one with the smallest distance.
4. If any object has changed its cluster, go back to step 2. Otherwise stop.
• Variant:
• Start with arbitrarily chosen k objects as initial centroids in step 1.
• Continue with step 3.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 11
An Example of k -means Clustering
10 10 10
9 9 9
8 Arbitrarily 8 Calculate 8 Check if
7 •• partition 7 •• the cluster 7 •• objects
6 •• objects into 6 •• centroids 6 • •• have to be
5 •• • k groups 5 •• • 5 •• • reasigned
4 • • 4 • • 4 • •
3 •• • 3 •• • 3 • •• •
2
1
• • for k =2 2
1
• • 2
1
• •
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

10 10
9 9
8 8
7 •• 7 ••
6 •• 6 • ••
5 •• • 5 •• •
4 • • Reasign 4 • •
3 •• • objects 3 • •• •
2
1
• • 2
1
• •
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 12
Comments on the k -means Method
• Strength:
• Efficient: O(tkn), where n is # objects, k is # of clusters, and t is the # of iterations.
Normally: k , t ≪ n.
• Comparing: PAM: O(k (n − k )2 ), CLARA: O(ks2 + k (n − k )).
• Comment: Often terminates at a local optimum.
• Weakness:
• Applicable only to objects in a continuous n-dimensional space.
• Using the k -modes method for categorical data.
• In comparison, k -medoids can be applied to a wide range of data.
• Need to specify k , the number of clusters, in advance.
3
• There are ways to automatically determine the best k .
• Sensitive to noisy data and outliers.
• Not suitable to discover clusters with non-convex shapes.

3
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer New York, 2009, ISBN : 978-0-387-84857-0.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 13
Variations of the k -means Method
• Most of the variants of the k -means differ in:
• Selection of the initial k subsets (or centroids).
• Dissimilarity calculations.
• Strategies to calculate cluster centroids.
• Handling categorical data: k -modes:
• Replacing centroids with modes.
• See Chapter 2: mode = value that occurs most frequently in the data.
• Using new dissimilarity measures to deal with categorical objects.
• Using a frequency-based method to update modes of clusters.
• A mixture of categorical and numerical data: k-prototype method.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 14
What is the Problem of the k -means Method?
• The k -means algorithm is sensitive to outliers!
• Since an object with an extremely large value may substantially
distort the distribution of the data.
• k -medoids:
• Instead of taking the mean value of the objects in a cluster as a reference point,
medoids can be used, which is the most centrally located object in a cluster.
10 10
9 9
8
7
•• 8
7
•
•
6
5
• •• • 6
5
•• • •
• ••
4 4 •
3
2
3
2
• •
1
0
0 1 2 3
•4 5 6 7 8 9 10
1
0
0 1 2 3 4
•
5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 15
The k -medoids Clustering Method
• k -medoids clustering:
• Find representative objects (medoids) in clusters.
• PAM:
• Starts from an initial set of k medoids and iteratively replaces one of the medoids
by one of the non-medoids, if it improves the total distance of the resulting clustering.
• PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity).
• Efficiency improvement on PAM:
• CLARA4 : PAM on samples.
• CLARANS5 : Randomized re-sampling.

4
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990, ISBN: 978-0-47187876-6. DOI: 10.1002/9780470316801.
5
R. T. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” in VLDB’94, Proceedings of 20th International Conference on Very Large Data Bases, September 12-15, 1994,
Santiago de Chile, Chile, J. B. Bocca, M. Jarke, and C. Zaniolo, Eds., Morgan Kaufmann, 1994, pp. 144–155. [Online]. Available: http://www.vldb.org/conf/1994/P144.PDF.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 16
PAM: A Typical k -medoids Algorithm
Total cost = 21
10 10 10
9 9 9
8 • Arbitrarily 8 • Assign 8 • Randomly
7 • choose k 7 • each 7 • select
6 • • object as 6 • • remaining 6 • • nonmedoid
5 • initial 5 • object to 5 • object orandom
4 • • medoids 4 • • nearest medoid 4 • •
3 •• 3 •• 3 ••
2
1
• for k =2 2
1
• 2
1
•
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

Total cost = 25
10 10
9 9
8 • 8 •
7 • 7 •
6 • • 6 • •
5 • 5 •
Swapping o 4 • • Compute 4 • •
and orandom 3 •• total cost of 3 ••
if quality is
improved
2
1
• swapping 2
1
•
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 17
PAM (Partitioning Around Medoids)
• Use real objects to represent the clusters.
• Algorithm:
1. Arbitrarily choose k objects as the initial mediods.
2. Repeat.
3. Assign each remaining object to the cluster with the nearest mediod oi .
4. Randomly select a non-medoid object oh .
5. Compute the total cost TCih of swapping oi with oh .
6. If TCih < 0 then swap oi with oh to form the new set of k medoids.
7. Until no change.
P
• TCih = Cjih
j
• with Cjih as the cost for object oj if oi is swapped with oh .
• That is, distance to new medoid minus distance to old medoid.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 18
PAM Clustering - Cases (I)
• Case 1:
• oj currently belongs to medoid oi . If oi is replaced with oh as a medoid, and oj is closest to oh , then oj is
reassigned to oh (same cluster, different distance).

Cjih = d (oj , oh ) − d (oj , oi ).

10
9 t
8
7
•• j
6
5
• •h•
4
3
• i • ••
2
1
•
0
0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 19
PAM Clustering - Cases (II)
• Case 2:
• oj currently belongs to medoid ot , t ̸= j. If oi is replaced with oh as a medoid, and oj is still closest to ot ,
then the assignment does not change.

Cjih = d (oj , oh ) − d (oj , ot ) ≥ 0.

10 j

• ••
9 t
8
7
6
•
• i • ••h•
5
4
3
2
1
•
0
0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 20
PAM Clustering - Cases (III)
• Case 3:
• oj currently belongs to medoid oi . If oi is replaced with oh as a medoid, and oj is closest to medoid ot of
one of the other clusters, then oj is reassigned to ot (new cluster, different distance).

Cjih = d (oj , ot ) − d (oj , oh ) < 0.

10
9
8
7
h
• j
6
5
••i• • •
4
3
• t ••
2
1
•
0
0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 21
PAM Clustering - Cases (IV)
• Case 4:
• oj currently belongs to medoid ot , t ̸= j. If oi is replaced with oh as a medoid (from a different cluster!),
and oj is closest to oh , then oj is reassigned to oh (new cluster, different distance).

Cjih = d (oj , oh ) − d (oj , ot ) < 0.

10
9
8
7
••
6
5
• i• •
4
3
• h••j
• •
2 t
1
0
0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 22
CLARA (Clustering Large Applications)
• Draws multiple samples of the data set, applies PAM on each sample,
and gives the best clustering as the output.
• Strength:
• Deals with larger data sets than PAM.
• Weakness:
• Efficiency depends on the sample size.
• A good clustering based on samples will not necessarily represent
a good clustering of the whole data set if the sample is biased.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 23
CLARANS ("Randomized" CLARA)
• A Clustering Algorithm based on Randomized Search
• Samples:
• Drawn dynamically with some randomness in each step of the search.
• Clustering process:
• Can be presented as searching a graph where each node is a potential solution,
that is, a set of k medoids.
• If local optimum found,
• start with new randomly selected node in search for a new local optimum.
• More efficient and scalable than both PAM and CLARA.
• Focusing techniques and spatial access structures may further improve its performance.6

6
M. Ester, H. Kriegel, and X. Xu, “Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification,” in Advances in Spatial Databases, 4th International Symposium,
SSD’95, Portland, Maine, USA, August 6-9, 1995, Proceedings, M. J. Egenhofer and J. R. Herring, Eds., ser. Lecture Notes in Computer Science, vol. 951, Springer, 1995, pp. 67–82.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 24
Hierarchical Methods
Hierarchical Clustering
• Does not require the number of clusters k as an input,
but needs a termination condition.

Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative (AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive (DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 25
AGNES (Agglomerative Nesting)
• Introduced by Kaufman et al.7
• Use the single-link method. (see below)
• Merge nodes that have the least dissimilarity.
• Go on in a non-descending fashion.
• Eventually all nodes belong to the same cluster.
10 10 10
9

• 9

•
• •• • • •• • • •• •
8 8 8
7 7 7
6 6 6
5
4
• •• 5
4
• •• 5
4
• ••
3
2
1
0
•• 3
2
1
0
•• 3
2
1
0
••
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

7
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990, ISBN : 978-0-47187876-6. DOI : 10.1002/9780470316801.
D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 26
Dendrogram: Shows how Clusters are Merged
• Decompose data objects into a several levels of nested partitioning (tree of clusters),
called a dendrogram.
• A clustering of the data objects is obtained by cutting the dendrogram at the desired level,
then each connected component forms a cluster.

distance

a b c d e
D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 27
DIANA (Divisive Analysis)
• Introduced by Kaufman et al.8
• Inverse order of AGNES.
• Eventually each node forms a cluster of its own.

10 10 10
9

• 9

8
L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley, 1990, ISBN : 978-0-47187876-6. DOI : 10.1002/9780470316801.
D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 28
Distance Between Clusters (I)
• Minimum distance:
• Smallest distance between an object in one cluster and an object in the other, i.e.,
distmin (Ci , Cj ) = minoip ∈Ci ,ojq ∈Cj d (oip , ojq ).
• Maximum distance:
• Largest distance between an object in one cluster and an object in the other, i.e.,
distmax (Ci , Cj ) = maxoip ∈Ci ,ojq ∈Cj d (oip , ojq ).
• Average distance:
• Average distance between an object in one cluster and an object in the other, i.e.,
distavg (Ci , Cj ) = n 1·n
P
oip ∈Ci ,ojq ∈Cj
d (oip , ojq ).
i j

• Mean distance:
• Distance between the centroids of two clusters, i.e., distmean (Ci , Cj ) = d (ci , cj ).

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 29
Distance between Clusters (II)
• Nearest-neighbor clustering algorithm:
• Uses minimum distance to measure distance between clusters.
• Single-linkage algorithm:
• Terminates if distance between nearest clusters exceeds user-defined threshold.
• Minimal spanning-tree algorithm:
• View objects (data points) as nodes of a graph.
• Edges form a path between nodes in a cluster.
• Merging of two clusters corresponds to adding an edge between the nearest pair of nodes.
• Because edges linking clusters always go between distinct clusters,
resulting graph will be a tree.
• Thus, agglomerative hierarchical clustering that uses minimum distance produces minimal spanning
tree.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 30
Distance between Clusters (III)
• Farthest-neighbor clustering algorithm:
• Uses maximum distance to measure distance between clusters.
• Complete-linkage algorithm:
• Terminates if maximum distance between nearest clusters exceeds user-defined threshold.
• Good if true clusters are rather compact and approx. equal in size.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 31
Extensions to Hierarchical Clustering
• Major weakness of agglomerative clustering methods:
• Can never undo what was done previously.
• Do not scale well: Time complexity of at least O(n2 ), where n is the number of objects.
• Integration of hierarchical and distance-based clustering:
• BIRCH: Uses CF-tree and incrementally adjusts the quality of sub-clusters.
• CHAMELEON: Hierarchical clustering using dynamic modeling.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 32
BIRCH (Balanced Iterative Reducing and Clustering Using
Hierarchies)9
Incrementally construct a CF (Clustering Feature) tree:
• A hierarchical data structure for multiphase clustering.
• Phase 1: Scan DB to build an initial in-memory CF-tree.
• A multi-level compression of the data that tries to preserve the inherent clustering structure of the data.
• Phase 2: Use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree.
Scales linearly:
• Finds a good clustering with a single scan and improves the quality with a few additional scans.
Weakness:
• Handles only numerical data, and sensitive to the order of the data records.

9
T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: an efficient data clustering method for very large databases,” in Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data,
Montreal, Quebec, Canada, June 4-6, 1996, H. V. Jagadish and I. S. Mumick, Eds., ACM Press, 1996, pp. 103–114.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 33
Clustering Feature in BIRCH (I)
Clustering Feature CF = (n, LS, SS):
• 3D vector summarizing statistics about clusters:
• n: number of data points.
Pn
• LS: linear sum of N points x.
i =1 i
P n
• SS: square sum of N points xi2 . i =1

10
9 CF = (5, (16, 30), (54, 190))
8
7
•• as the data points contained are:
6
5
• • • (3,4),(2,6),(4,5),(4,7),(3,8)
4
3
• ••
2
1
••
0
0 1 2 3 4 5 6 7 8 9 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 34
Clustering Feature in BIRCH (II)
• Allows to derive many useful statistics of a cluster. E.g.:
Pn
o
• centroid: c= i =n 1 i = LS n
,
q Pn q
( xi − c ) 2
nSS −2LS 2 +nLS
• radius: R = i =1
= ,
n2
q Pnn Pn
(xi −xj )2
q
−2LS 2
• diameter: D = i =1 j =1
n(n−1)
= 2nSS n(n−1)

• Additive:
• For two disjoint clusters C1 and C2 with clustering features CF1 = (n1 , LS1 , SS1 ) and
CF2 = (n2 , LS2 , SS2 ), the clustering feature of the cluster that is formed by merging C1 and C2 is
simply: CF1 + CF2 = (n1 + n2 , LS1 + LS2 , SS1 + SS2 ).

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 35
CF-Tree in BIRCH
• Height-balanced tree.
• Stores the clustering features for a hierarchical clustering.
• Non-leaf nodes store sums of the CFs of their children.
• Two parameters:
• Branching factor B: max # of children.
• Threshold T : maximum diameter of sub-clusters stored at leaf nodes.
• Diameter D: average pairwise distance within a cluster,
reflects the tightness of the cluster around the centroid:
sP s
n Pn
i =1 j =1
(oi − oj )2 2nSS − 2LS 2
D = = .
n (n − 1 ) n(n − 1)

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 36
CF-Tree structure
Root:
CF1 CF2 CF3 ... CF6
B=7
child1 child2 child3 ... child6
T=6

Non-leaf node:
CF11 CF12 CF13 ... CF15 ... ...
child11 child12 child13 ... child15

Leaf node: Leaf node:

prev CF111 CF112 . . . CF116 next prev CF121 CF122 . . . CF124 next

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 37
The BIRCH algorithm
Phase 1:
• For each point in the input:
• Find closest leaf-node entry.
• Add point to leaf-node entry and update CF.
• If entry _diameter > max_diameter , then split leaf node, and possibly parents.
• Information about new point is passed toward the root of the tree.
• Algorithm is O(n) and incremental.
• Concerns:
• Sensitive to insertion order of data points.
• Since we fix the size of leaf nodes, clusters may not be so natural.
• Clusters tend to be spherical given the radius and diameter measures.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 38
CHAMELEON: Hierarchical Clustering using Dynamic Modeling10
• Measures the similarity based on a dynamic model:
• Two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters
are high relative to the intraconnectivity of the clusters and the closeness of items within the clusters.
• Graph-based, and a two-phase algorithm.
• Use a graph-partitioning algorithm:
• Cluster objects into a large number of relatively small sub-clusters.
• Use an agglomerative hierarchical clustering algorithm:
• Find the genuine clusters by repeatedly combining these sub-clusters.

10
G. Karypis, E. Han, and V. Kumar, “Chameleon: Hierarchical clustering using dynamic modeling,” IEEE Computer, vol. 32, no. 8, pp. 68–75, 1999. DOI : 10.1109/2.781637. [Online]. Available:
https://doi.org/10.1109/2.781637.
D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 39
Overall Framework of CHAMELEON
Construct k -NN
Data sparse graph Partition the graph
input.

Merge partitions

Relative interconnectivity:
k -NN graph: connectivity of C1 and C2
p and q are connected if q over internal connectivity.
is among the top k closest Final clusters. Relative closeness:
neighbors of p. closeness of C1 and C2 over
internal closeness.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 40
Probabilistic Hierarchical Clustering
Algorithmic hierarchical clustering.
• Nontrivial to choose a good distance measure.
• Hard to handle missing attribute values.
• Optimization goal not clear: heuristic, local search.
Probabilistic hierarchical clustering.
• Use probabilistic models to measure distances between clusters.
• Generative model:
• Regard the set of data objects to be clustered as a sample of the underlying data-generation
mechanism to be analyzed.
• Easy to understand, same efficiency as algorithmic agglomerative clustering method, can handle
partially observed data.
• In practice, assume the generative models adopt common distribution functions, e.g., Gaussian
distribution or Bernoulli distribution, governed by parameters.
D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 41
Generative Model (I)
• Given a set of 1-D points X = {x1 , . . . , xn } for clustering analysis and assuming they are generated
by a Gaussian distribution:

(x − µ)2

1
N (µ, σ) =√ exp .
2πσ 2 2σ 2

• The probability that a point xi ∈ X is generated by the model:

(xi − µ)2

1
P (xi |µ, σ) = √ exp .
2πσ 2 2σ 2

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 42
Generative Model (II)
• The likelihood that X is generated by the model:

n
(xi − µ)2

Y 1
L(N (µ, σ)|X ) := P (X |µ, σ) = √ exp .
2πσ 2 2σ 2
i =1

• The task of learning the generative model: find the parameters µ and σ , such that

N (µ0 , σ0 ) = arg max (L(N (µ, σ)|X )) .

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 43
A Probabilistic Hierarchical Clustering Algorithm (I)
• For a set of objects partitioned into m clusters C1 , . . . , Cm , the quality can be measured by:

m
Y
Q ({C1 , . . . , Cm }) = P (Ci ),
i =1

where P (Ci ) is the maximum likelihood of Ci .

• Distance between clusters Ci and Cj can be computed as a proximity measure:

P (Ci ∪ Cj )
d (Ci , Cj ) = − log .
P (Ci )P (Cj )

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 44
A Probabilistic Hierarchical Clustering Algorithm (II)
Algorithm: Progressively merge points and clusters
• Input: D = {o1 , . . . , on }: a dataset containing n objects.
• Output: A hierarchy of clusters.
• Method:
• Create cluster for each object Ci = {oi }, for 1 ≤ i ≤ n.
• For i = 1 to n:
• Find a pair of clusters Ci and Cj , such that

P (C ∪C )
= arg maxi ̸=j log P (Ci )i P (Cj j ) .
Ci , Cj

P (C ∪C )
• If log P (C )i P (Cj ) > 0 then merge Ci and Cj ,
i j
• Else stop.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 45
Density-based Methods
Density-based Clustering
• Clustering based on density (local cluster criterion), such as density-connected points.
• Major features:
• Discover clusters of arbitrary shape.
• Handle noise.
• One scan.
• Need density parameters as termination condition.
• Several interesting studies:
• DBSCAN11 .
• OPTICS12 .
• DENCLUE13 .
• CLIQUE14 (more grid-based).
11
M. Ester, H. Kriegel, J. Sander, et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Proceedings of the Second International Conference on Knowledge
Discovery and Data Mining (KDD-96), Portland, Oregon, USA, E. Simoudis, J. Han, and U. M. Fayyad, Eds., AAAI Press, 1996, pp. 226–231.
12
M. Ankerst, M. M. Breunig, H. Kriegel, et al., “OPTICS: ordering points to identify the clustering structure,” in SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data,
June 1-3, 1999, Philadelphia, Pennsylvania, USA, A. Delis, C. Faloutsos, and S. Ghandeharizadeh, Eds., ACM Press, 1999, pp. 49–60.
13
A. Hinneburg and D. A. Keim, “An efficient approach to clustering in large multimedia databases with noise,” in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining
(KDD-98), New York City, New York, USA, August 27-31, 1998, R. Agrawal, P. E. Stolorz, and G. Piatetsky-Shapiro, Eds., AAAI Press, 1998, pp. 58–65.
14
R. Agrawal, J. Gehrke, D. Gunopulos, et al., “Automatic subspace clustering of high dimensional data for data mining applications,” in SIGMOD 1998, Proceedings ACM SIGMOD International Conference on
Management of Data, June 2-4, 1998, Seattle, Washington, USA, L. M. Haas and A. Tiwary, Eds., ACM Press, 1998, pp. 94–105.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 46
Density-based Clustering: Basic Concepts
• Two parameters:
• ϵ: Maximum radius of a neighborhood.
• Defines the neighborhood of a point p: Nϵ (p) := {q ∈ D | d (p, q ) ≤ ϵ}.
• Distance function is still needed.
• MinPts: Minimum number of points in an ϵ-neighborhood of a point p.
• Core-point condition: |Nϵ (p)| ≥ MinPts.
• Only these neighborhoods are considered.

q MinPts =5
ϵ = 1 cm
p

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 47
Directly Density-reachable
• A point q is said to be directly density-reachable from a point p
w.r.t. ϵ, MinPts, if q belongs to Nϵ (p).

MinPts=5
q
ϵ = 1 cm
p

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 48
Density-reachable and Density-connected
• Density-reachable:
• A point q is density-reachable from a point p w.r.t. ϵ, MinPts, if there is a chain of points p1 , . . . , pn ,
p1 = p, pn = q such that pi + 1 is directly density-reachable from pi .

p2
p1

• Density-connected:
• A point p is density-connected to a point q w.r.t. ϵ, MinPts, if there is a point o such that both p and q are
density-reachable from o.

q
p
o

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 49
DBSCAN: Density-based Spatial Clustering of Applications with
Noise
• Relies on a density-based notion of cluster:
• A cluster is defined as a maximal set of density-connected points.
• Discovers clusters of arbitrary shape in spatial databases with noise.

Outlier

Border ϵ = 1cm
MinPts = 5
Core

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 50
DBSCAN: The Algorithm
• All objects in database D are marked unvisited.
• Randomly select unvisited object p and mark as visited.
• If ϵ-neighborhood of p contains less than MinPts objects:
• Mark p as noise.
• Otherwise (that is, p is core point):
• Create new cluster C for point p.
• Add all objects in ϵ-neighborhood of p to candidate set N.
• For each p′ in N that does not yet belong to a cluster:
• ′
Add p to C.
• ′
If p is unvisited, mark as visited.
• If p ′ is core point, add all objects in its ϵ-neighborhood to N.

• Ends when N is empty, that is, C can no longer be expanded.

• Continue the process (randomly select next unvisited object in D) until all points have been visited.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 51
DBSCAN: Sensitive to Parameters

ϵ = 1 and MinPts = 5 ϵ = 0.7 and MinPts = 10

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 52
OPTICS: A Cluster-ordering Method
• OPTICS: Ordering Points To Identify the Clustering Structure.15
• Avoid difficulty of finding the right parameters:
• Only MinPts must be given, ϵ remains open.
• Produces a special order of the database w.r.t. its density-based clustering structure.
• This cluster-ordering contains info equivalent to the density-based clusterings
corresponding to a broad range of parameter settings.
• Good for both automatic and interactive cluster analysis, including finding
intrinsic clustering structure.
• Can be represented graphically or using visualization techniques.
• Process point in order: Select a point that is density-reachable w.r.t. the lowest ϵ value, so that clusters
with higher density (lower ϵ) will be finished first.

15
M. Ankerst, M. M. Breunig, H. Kriegel, et al., “OPTICS: ordering points to identify the clustering structure,” in SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data,
June 1-3, 1999, Philadelphia, Pennsylvania, USA, A. Delis, C. Faloutsos, and S. Ghandeharizadeh, Eds., ACM Press, 1999, pp. 49–60.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 53
OPTICS: Some Extension of DBSCAN
• Core distance:
• of a point p.
• Min ϵ s.t. p is core point.
• Reachability Distance r :
• of a point p from q. p1
• Minimum radius value that makes p q
p2
directly density-reachable from q.
• That is, r (q , p) := max{core-distance(q ), d (q , p)}.
• Example:
• MinPts = 5, core_distance(q ) = 3cm.
• d (q , p1 ) = 2.8cm =⇒ r (q , p1 ) = 3cm.
• d (q , p2 ) = 4cm =⇒ r (q , p2 ) = 4cm.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 54
OPTICS: Algorithm
• Maintains a list of OrderSeeds:
• Points sorted by reachability-distance from their resp. closest core points.
• Begin with arbitrary point from input DB.
• For each point p under consideration:
• Retrieve the closest MinPts points of p, determine core-distance, set reachability-distance to undefined.
• Write p to output.
• If p is core point, for each point q in the ϵ-neighborhood of p:
• Update reachability-distance from p.
• Insert q into OrderSeeds (if q has not yet been processed).
• Move to next object in OrderSeeds list with smallest reachability-distance (or input DB if OrderSeeds is
empty).
• Continue until input DB fully consumed (and OrderSeeds empty).

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 55
OPTICS: Visualization

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 56
DENCLUE: Using Statistical Density Functions
• DENsity-based CLUstEring16
• Using statistical density functions:
2

• fGaussian (x , y ) = exp d (2xσ,y2) , influence of y on x.
2

(x , xi ) = i =1 exp d (2xσ,x2i ) , total influence on x.
D
PN
• fGaussian
2

(x , xi ) = i =1 (xi − x ) · exp d (2xσ,x2i ) , gradient of x in direction xi .
D
PN
• ∇xi fGaussian

• Major features:
• Solid mathematical foundation.
• Good for data sets with large amounts of noise.
• Allows a compact mathematical description of arbitrarily shaped
clusters in high-dimensional data sets.
• Significantly faster than existing algorithms (e.g., DBSCAN).
• But needs a large number of parameters.
16
A. Hinneburg and D. A. Keim, “An efficient approach to clustering in large multimedia databases with noise,” in Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining
(KDD-98), New York City, New York, USA, August 27-31, 1998, R. Agrawal, P. E. Stolorz, and G. Piatetsky-Shapiro, Eds., AAAI Press, 1998, pp. 58–65.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 57
DENCLUE: Technical Essence (I)
• Density estimation:
• Estimation of an unobservable underlying probability density function based on a set of observed data.
• Regarded as a sample.
• Kernel density estimation:
• Treat an observed object as an indicator of high-probability density in the surrounding region.
• Probability density at a point depends on the distances from this point to the observed objects.
• Density attractor:
• Local maximum of the estimated density function.
• Must be above threshold ξ .

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 58
DENCLUE: Technical Essence (II)
• Clusters:
• Can be determined mathematically by identifying density attractors.
• That is, local maxima of the overall density function.
• Center-defined clusters:
• Assign to each density attractor the points density-attracted to it.
• Arbitrary shaped cluster:
• Merge density attractors that are connected through paths of high density (> threshold).

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 59
Grid-based Methods
Grid-based Clustering Method
• Using multi-resolution grid data structure.
• Several interesting methods.
• STING (a STatistical INformation Grid approach)17 .
• WaveCluster18 :
• A multi-resolution clustering approach using wavelet method.
• Not part of this lecture.
• CLIQUE19 :
• Both grid-based and subspace clustering.

17
W. Wang, J. Yang, and R. R. Muntz, “STING: A statistical information grid approach to spatial data mining,” in VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29,
1997, Athens, Greece, M. Jarke, M. J. Carey, K. R. Dittrich, et al., Eds., Morgan Kaufmann, 1997, pp. 186–195. [Online]. Available: http://www.vldb.org/conf/1997/P186.PDF.
18
G. Sheikholeslami, S. Chatterjee, and A. Zhang, “Wavecluster: A multi-resolution clustering approach for very large spatial databases,” in VLDB’98, Proceedings of 24rd International Conference on Very
Large Data Bases, August 24-27, 1998, New York City, New York, USA, A. Gupta, O. Shmueli, and J. Widom, Eds., Morgan Kaufmann, 1998, pp. 428–439. [Online]. Available:
http://www.vldb.org/conf/1998/p428.pdf.
19
R. Agrawal, J. Gehrke, D. Gunopulos, et al., “Automatic subspace clustering of high dimensional data for data mining applications,” in SIGMOD 1998, Proceedings ACM SIGMOD International Conference on
Management of Data, June 2-4, 1998, Seattle, Washington, USA, L. M. Haas and A. Tiwary, Eds., ACM Press, 1998, pp. 94–105.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 60
STING: A Statistical Information Grid Approach
• The spatial area is divided into rectangular cells.
• There are several levels of cells corresponding to different levels of resolution.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 61
The STING Clustering Method (I)
• Each cell at a higher level:
• Partitioned into a number of smaller cells at next lower level.
• Statistical info of each cell:
• Calculated and stored beforehand and used to answer queries.
• Count, plus for each attribute: mean, standard dev, min, max
and type of distribution: normal, uniform, etc.
• Parameters of higher-level cells:
• Can easily be calculated from parameters of lower-level cells.
• Use a top-down approach:
• To answer spatial data queries (or: cluster definitions).
• Start from a pre-selected layer:
• Typically with a small number of cells.
• For each cell at the current level:
• Compute the confidence interval.
D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 62
The STING Clustering Method (II)
• Cells labeled relevant or not relevant.
• At the specified confidence level.
• Irrelevant cells removed from further consideration.
• When finished examining the current layer, proceed to the next lower level.
• Only look at cells that are children of relevant cells.
• Repeat this process until bottom layer is reached.
• Find regions (clusters) that satisfy the density specified.
• Breadth-first search, at bottom layer.
• Examine cells within a certain distance from center of current cell (often just the neighbors).
• If average density within this small area is greater than density specified, mark area and put relevant
cells just examined into queue.
• Examine next cell from queue and repeat procedure, until end of queue.
• Then one region has been identified.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 63
STING Algorithm: Analysis
• Advantages:
• Query-independent, easy to parallelize, incremental update.
• O(K ), where K is the number of grid cells at the lowest level.
• Disadvantages:
• All the cluster boundaries are either horizontal or vertical,
and no diagonal boundary is detected.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 64
CLIQUE (CLustering In QUEst)
• Using subspaces (lower-dimensional) of a high-dimensional data space that allow better
clustering than original space
• CLIQUE can be considered as both density-based and grid-based.
• Partitions each dimension into non-overlapping intervals,
thereby partitioning the entire data space into cells.
• Uses density threshold to identify dense cells.
• A cell is dense, if the number of data points mapped to it exceeds the density threshold.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 65
CLIQUE: The Major Steps (I)
• Monotonicity of dense cells w.r.t. dimensionality:
• Based on the a priori property used in frequent-pattern and association-rule mining.
• k -dimensional cell c can have at least l points only, if every (k − 1)-dimensional projection of c (which is
a cell in a (k − 1)-dimensional subspace) has at least l points, too.
• Clustering step 1:
• Partition each dimension into intervals, identify intervals containing at least l points.
• Iteratively join k -dimensional dense cells c1 and c2 in subspaces (Di1 , . . . , Dik ) and (Dj1 , . . . , Djk ) with
Di1 = Dj1 and Di2 = Dj2 and . . . Di (k −1) = Dj (k −1) and c1 and c2 share the same intervals to those
dimensions.
• New (k + 1)-dimensional candidate cell c in space (Di1 , . . . , Dik , Djk ) tested for density.
• Iteration terminates when no more candidate cells can be generated or no candidate cells are dense.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 66
CLIQUE: The Major Steps (II)
• Clustering step 2:
• Use dense cells in each subspace to assemble clusters.
• Apply Minimum Description Length (MDL) principle to use the maximal regions to cover connected
dense cells.
• Maximal region: hyper rectangle where every cell falling into the regions is dense, and region cannot be
extended further in any dimension.
• Simple greedy approach:
• Start with arbitrary dense cell.
• Find maximum region covering that cell.
• Work on remaining dense cells that have not yet been covered.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 67
CLIQUE: Example

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 68
Strength and Weakness of CLIQUE
• Strength:
• Automatically finds subspaces of the highest dimensionality
such that high-density clusters exist in those subspaces.
• Insensitive to the order of records in input and does not presume
any canonical data distribution.
• Scales linearly with the size of input and has good scalability
as the number of dimensions in the data is increased.
• Weaknesses:
• Dependent on proper grid size and density threshold.
• Accuracy of clustering result may be degraded at the expense
of simplicity of the method.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 69
Evaluation of clustering
Assessing Clustering Tendency
• Assess if non-random structure exists in the data by measuring the probability that the data
is generated by a uniform data distribution.
• Data with random structure:
• Points uniformly distributed in data space.
• Clustering may return clusters, but:
• Artificial partitioning.
• Meaningless.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 70
Hopkins Statistic (I)
• Data set:
• Let X = {xi |i = 1, . . . , n} be a collection of n patterns in a d-dimensional space such that
xi = (xi1 , xi2 , . . . , xid ).
• Random sample of data space:
• Let Y = {yj | j = 1, . . . , m} be m sampling points placed at random in the d-dimensional space, with
m ≪ n.
• Two types of distances defined:
• uj as minimum distance from yj to its nearest pattern in X and
• wj as minimum distance from a randomly selected pattern in X to its nearest neighbor in X .
• The Hopkins statistic in d dimensions is defined as:
Pm
j =1 uj
H = Pm Pm .
j =1 uj + j =1 wj

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 71
Hopkins Statistic (II)
• Compares nearest-neighbor distribution of randomly selected locations (points) to that for
randomly selected patterns.
• Under the null hypothesis, H0 , of uniform distribution:
• Distances from sampling points to nearest patterns should, on the average,
be the same as the interpattern nearest-neighbor distances, implying randomness.
• H should be about 0.5.
• When patterns are aggregated or clustered:
• Distances from sampling points to nearest patterns should, on the average,
larger as the interpattern nearest-neighbor distances.
• H should be larger than 0.5.
• Almost equal to 1.0 for very well clustered data.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 72
Determine the Number of Clusters
• Empirical method:
pn
• # of clusters ≈ 2
for a dataset of n points.
• Elbow method:
• Use the turning point in the curve of sum of within-cluster variance w.r.t. the # of clusters.
• Cross-validation method:
• Divide a given data set into m parts.
• Use m − 1 parts to obtain a clustering model.
• Use the remaining part to test the quality of the clustering.
E.g., for each point in the test set, find the closest centroid, and use the sum of squared distances
between all points in the test set and the closest centroids to measure how well the model fits the test
set.
• For any k > 0, repeat it m times, compare the overall quality measure w.r.t. different k ’s, and find # of
clusters that fits the data the best.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 73
Measuring Clustering Quality
Two methods:
1. Extrinsic: supervised, i.e., the ground truth is available.
• Compare a clustering against the ground truth using certain clustering quality measure.
• Ex. BCubed precision and recall metrics.
2. Intrinsic: unsupervised, i.e., the ground truth is unavailable.
• Evaluate the goodness of a clustering by considering how well the clusters are separated, and how
compact the clusters are.
• Ex. silhouette coefficient.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 74
Measuring Clustering Quality: Extrinsic Methods
• Clustering-quality measure: Q (C , C∗ ).
• For a clustering C given the ground truth C∗ .
• Q is good, if it satisfies the following four essential criteria:
• Cluster homogeneity:
• The purer, the better.
• Cluster completeness:
• Should assign objects that belong to the same category
in the ground truth to the same cluster.
• Rag bag:
• Putting a heterogeneous object into a pure cluster should be penalized more than putting it into a rag bag
(i.e., "miscellaneous" or "other" category).
• Small cluster preservation:
• Splitting a small category into pieces is more harmful than
splitting a large category into pieces.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 75
Summary
Summary
• Cluster analysis:
• Groups objects based on their similarity and has wide applications.
• Measure of similarity:
• Can be computed for various types of data.
• Clustering algorithms can be categorized into:
• Partitioning methods (k -means and k -medoids).
• Hierarchical methods (BIRCH and CHAMELEON; probabilistic hierarchical clustering).
• Density-based methods (DBSCAN, OPTICS, and DENCLUE).
• Grid-based methods (STING, CLIQUE).
• Model-based methods.
• Quality of clustering results.

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 76
Any questions about this chapter?

Ask them now or ask them later in our forum:

StudOn Forum
https://www.studon.fau.de/frm5045379.html

D. Probst | CS6 | KDD 8. Cluster Analysis | Recording of this lecture is prohibited. SS2023 77

American Culture and Drug Abuse
No ratings yet
American Culture and Drug Abuse
1 page
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Clustering Methods
No ratings yet
Clustering Methods
14 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Unit IV
No ratings yet
Unit IV
96 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Clustering
No ratings yet
Clustering
32 pages
10 Clus Basic
No ratings yet
10 Clus Basic
95 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
4.1 Clustering
No ratings yet
4.1 Clustering
69 pages
10 Clus Basic
No ratings yet
10 Clus Basic
66 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
10 Clus Basic
No ratings yet
10 Clus Basic
31 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Unit 5
No ratings yet
Unit 5
85 pages
Clustering
No ratings yet
Clustering
24 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Clustering
No ratings yet
Clustering
104 pages
Clustering
No ratings yet
Clustering
25 pages
Unit V - Clustering
No ratings yet
Unit V - Clustering
19 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering in Python
No ratings yet
Clustering in Python
31 pages
Lecture 6
No ratings yet
Lecture 6
14 pages
DMW Unit-V
No ratings yet
DMW Unit-V
47 pages
Clustering
No ratings yet
Clustering
34 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
BDA Unit 2
No ratings yet
BDA Unit 2
31 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Lect 10 DM
No ratings yet
Lect 10 DM
36 pages
Unit VII
No ratings yet
Unit VII
30 pages
Data Mining-Partitioning Methods
100% (1)
Data Mining-Partitioning Methods
7 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
Clustering
No ratings yet
Clustering
37 pages
Clustering
No ratings yet
Clustering
7 pages
Unit 4
No ratings yet
Unit 4
4 pages
Data Mining - Lecture 9
No ratings yet
Data Mining - Lecture 9
29 pages
Cluster
No ratings yet
Cluster
20 pages
Clustering
No ratings yet
Clustering
89 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
DWDM Unit V Note
No ratings yet
DWDM Unit V Note
19 pages
Complete Clustering
No ratings yet
Complete Clustering
80 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Ecological Concepts in Buildings-A Case Study in Bangalore
No ratings yet
Ecological Concepts in Buildings-A Case Study in Bangalore
6 pages
8 Total Quality Management Principles - Lucidchart Blog
No ratings yet
8 Total Quality Management Principles - Lucidchart Blog
12 pages
8 TQ Quarter4
No ratings yet
8 TQ Quarter4
2 pages
Spanos - Past-Life Ids Ufos Satanic Abuse
No ratings yet
Spanos - Past-Life Ids Ufos Satanic Abuse
8 pages
Persuasive Essay Layout
100% (2)
Persuasive Essay Layout
3 pages
COHESIVE DEVICES-Advanced
100% (2)
COHESIVE DEVICES-Advanced
2 pages
1.develop A Program To Draw A Line Using Bresenham's Line Drawing Technique
No ratings yet
1.develop A Program To Draw A Line Using Bresenham's Line Drawing Technique
1 page
STS Reviewer
No ratings yet
STS Reviewer
23 pages
RAMA - 54211 - 05071181320069 - 0031107101 - 0012046201 - 01 - Front - Ref
No ratings yet
RAMA - 54211 - 05071181320069 - 0031107101 - 0012046201 - 01 - Front - Ref
23 pages
Prisoners Rights Presentation
No ratings yet
Prisoners Rights Presentation
16 pages
Motion in 2D DPP 7 Min
No ratings yet
Motion in 2D DPP 7 Min
3 pages
Chemistry Investigatory Project
33% (3)
Chemistry Investigatory Project
11 pages
Effects of Habitat Fragmentation On The Persistence of Medium and Large Mammal Species in The Brazilian Savanna of Goiás State
No ratings yet
Effects of Habitat Fragmentation On The Persistence of Medium and Large Mammal Species in The Brazilian Savanna of Goiás State
9 pages
WC4331
No ratings yet
WC4331
4 pages
ACR-Orientation Work Arrangement
No ratings yet
ACR-Orientation Work Arrangement
10 pages
Eoa Peg-4000 (En) Msds
No ratings yet
Eoa Peg-4000 (En) Msds
7 pages
Progress Test 2A (Units 4-6)
No ratings yet
Progress Test 2A (Units 4-6)
7 pages
Itep Grammar Practice Exercises: Complete The Sentence: Error Correction
No ratings yet
Itep Grammar Practice Exercises: Complete The Sentence: Error Correction
5 pages
Untitled
No ratings yet
Untitled
4 pages
Astm E1269 - 11 (2018)
No ratings yet
Astm E1269 - 11 (2018)
2 pages
Background of The Study vs. Literature Review
100% (3)
Background of The Study vs. Literature Review
6 pages
Kottak Chapter Highlighted PDF
No ratings yet
Kottak Chapter Highlighted PDF
22 pages
Morality and The Good Life
No ratings yet
Morality and The Good Life
6 pages
Agnico Eagle 2023 Sustainability Performance Data - 25042024
No ratings yet
Agnico Eagle 2023 Sustainability Performance Data - 25042024
147 pages
Fluorescence Micros
No ratings yet
Fluorescence Micros
22 pages
Resources and Development Practise Sheet 1
100% (1)
Resources and Development Practise Sheet 1
3 pages
Lesson One - Inclusive Education - Supplimentary Notes
No ratings yet
Lesson One - Inclusive Education - Supplimentary Notes
10 pages
bml-205 KK en
No ratings yet
bml-205 KK en
1 page
The Use of Smart Materials in Building Design
No ratings yet
The Use of Smart Materials in Building Design
5 pages

8 - Clustering

Uploaded by

8 - Clustering

Uploaded by

8.

Knowledge Discovery in Databases

Cjih = d (oj , oh ) − d (oj , oi ).

Cjih = d (oj , oh ) − d (oj , ot ) ≥ 0.

Cjih = d (oj , ot ) − d (oj , oh ) < 0.

Cjih = d (oj , oh ) − d (oj , ot ) < 0.

Step 0 Step 1 Step 2 Step 3 Step 4

Leaf node: Leaf node:

• The probability that a point xi ∈ X is generated by the model:

N (µ0 , σ0 ) = arg max (L(N (µ, σ)|X )) .

where P (Ci ) is the maximum likelihood of Ci .

• Ends when N is empty, that is, C can no longer be expanded.

ϵ = 1 and MinPts = 5 ϵ = 0.7 and MinPts = 10

Ask them now or ask them later in our forum:

You might also like