Clustering Algorithms
Clustering Algorithms
Concepts and
Techniques
(3rd ed.)
— Chapter 10 —
3
Clustering for Data Understanding
and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market research
4
Clustering as a Preprocessing Tool
(Utility)
Summarization:
Preprocessing for regression, PCA, classification, and
association analysis
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
5
Quality: What Is Good
Clustering?
A good clustering method will produce high quality
clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
6
Measure the Quality of
Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
7
Considerations for Cluster
Analysis
Partitioning criteria
Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
Separation of clusters
Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
Similarity measure
Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
Clustering space
Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering) – dimension refers to the attributes in
the dataset
8
Requirements and Challenges
Scalability
Clustering all the data instead of only on samples
these
Constraint-based clustering
User may give inputs on constraints
High dimensionality
9
Major Clustering Approaches
(I)
Partitioning approach:
Construct various partitions and then evaluate them by some
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
10
Major Clustering Approaches
(II)
Model-based:
A model is hypothesized for each of the clusters and tries to find
Frequent pattern-based:
Based on the analysis of frequent patterns
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
11
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
E ik1 pCi ( p ci ) 2
Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
Global optimal: exhaustively enumerate all partitions
Heuristic methods: k-means and k-medoids algorithms
k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
13
The K-Means Clustering Method
14
An Example of K-Means Clustering
K=2
Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
Partition objects into k nonempty
subsets
Repeat
Compute centroid (i.e., mean Update
the
point) for each partition cluster
Assign each object to the centroids
cluster of its nearest centroid
Until no change
15
Comments on the K-Means Method
17
What Is the Problem of the K-Means
Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
18
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
7 7 7
6
Arbitrar 6
Assign 6
5
y 5
each 5
4 choose 4 remaini 4
3
k object 3
ng 3
2
as 2
object 2
initial to
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10
s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10
Do loop 9
8
Compute
9
8
Swapping total cost
Until no
7 7
O and 6
of 6
change Oramdom 5
swapping
5
4 4
If quality is 3 3
2 2
improved. 1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
19
The K-Medoid Clustering Method
20
Example
Consider the following data set consisting of the
scores of two variables on each of seven
individuals:
21
Example
22
Example
Input table
23
Example
24
Example
25
Example
26
Example
27
Example
The latter partitioning has the lowest within-
cluster variation; therefore, the k-means method
assigns the value 8 to a cluster different from that
containing 9 and 10 due to the outlier point 25.
Moreover, the center of the second cluster, 14.67,
is substantially far from all the members in the
cluster.
Such sensitivity to outliers can be diminished
using k-medoids algorithm
28
K medoids algorithm
29
Example
30
Example
Let us choose that (3, 4) and (7, 4) are the
medoids
31
Example
33
Example
34
BFR Algorithm
BFR - Bradley-Fayyad-Reina
BFR is a scalable clustering algorithm particularly
useful for handling large datasets that do not fit
into memory.
It is designed to work with data that is normally
distributed and uses the k-means clustering
approach as a base
But it is more efficient in handling large datasets
due to its use of intermediate statistics and a
multi-pass process.
35
BFR Algorithm
Key Concepts of BFR Algorithm
Assumptions:
• The data is normally distributed (Gaussian
distribution).
• The algorithm works well with a high-dimensional
space.
• The dataset is too large to fit into main memory at
once, so it works incrementally, processing
chunks of the data that can fit in memory.
36
BFR Algorithm
Algorithm Process: The BFR algorithm breaks
down the clustering process into several stages.
These are repeated iteratively as new chunks of
data are processed.
• Initialization (Phase 1): The process begins by
selecting a random sample of the data that fits
into memory and running k-means clustering on
it. This creates initial clusters, which are
represented using statistical summaries instead
of storing all the data points in each cluster.
37
BFR Algorithm
Summarizing Clusters (Phase 2): Each cluster
is represented by the following three types of
statistics (these are maintained to avoid
recalculating distances for every data point):
• N: The number of points in the cluster.
• SUM: The sum of the coordinates of the points in
the cluster.
• SUMSQ: The sum of the squares of the
coordinates of the points in the cluster. From
these statistics, you can compute the centroid
(mean) and the variance (spread) of the clusters.
38
BFR Algorithm
Processing New Data (Phase 3): New data is
processed in batches or chunks that fit into
memory. For each point:
1. If the point is close enough to the centroid of an
existing cluster (based on a threshold distance), it
is added to that cluster, and the cluster’s statistics
are updated.
2. If the point is not close enough to any existing
cluster, it is stored temporarily for future analysis
or becomes its own cluster if it meets certain
criteria.
39
BFR Algorithm
Discard Set, Compression Set, and Retained Set:
• Discard Set (DS): These are the clusters that are
sufficiently large and are represented by the cluster
statistics (N, SUM, SUMSQ). These clusters are
"discarded" from memory as only their statistics are kept.
• Compression Set (CS): These are clusters that are too
small to be considered finalized but still compressible.
They have some intermediate points but don’t yet meet
the size threshold.
• Retained Set (RS): These are points that cannot be
assigned to any current cluster or compressed group.
They are retained for later re-evaluation.
40
BFR Algorithm
Merging and Output (Phase 4): The algorithm
periodically attempts to merge clusters from the
compression set into the discard set if they grow
large enough. Once the algorithm finishes, it
outputs the cluster centroids along with the
variance or spread.
41
BFR Algorithm
Key Strengths:
• Efficient for Large Datasets: The BFR
algorithm’s ability to summarize clusters with
intermediate statistics allows it to handle much
larger datasets than regular k-means.
• Handles High-Dimensional Data: It is
particularly useful for high-dimensional datasets
where exact distances between points may be
less meaningful due to the curse of
dimensionality.
42
BFR Algorithm
Limitations:
• Assumption of Normality: BFR assumes that
clusters follow a Gaussian distribution, so it may
not perform well on data that doesn’t conform to
this assumption.
• Fixed Number of Clusters (k): Like k-means,
the number of clusters must be pre-specified,
which may not be ideal in cases where the exact
number of clusters is unknown.
43
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
Evaluation of Clustering
44
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
45
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
46
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram
47
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
48
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)
50
Example
Calculate the pairwise distance measure
51
Example
Consider each object as a cluster in the beginning
Then find the closest pair clusters and merge them
into one single cluster
Update the distance matrix using the linkage rule
Distance between ungrouped cluster will not
change
To update the distance between grouped clusters,
52
Example
53
Example
54
Example
55
Example
56
Divisive Hierarchical Clustering
57
Example
Find the clusters using complete linkage. Use
Euclidean distance to solve
X Y dist(XY,XY-bar) X Y dist(XY,XY-bar)
P1 0.4 0.53 0.205 P1 0.4 0.53 0.205
P2 0.22 0.38 0.077 P2 0.22 0.38 0.065
P3 0.35 0.32 0.067 P4 0.26 0.19 0.173
P4 0.26 0.19 0.168 P5 0.08 0.41 0.208
P5 0.08 0.41 0.22 P6 0.45 0.3 0.179
P6 0.45 0.3 0.166 0.282 0.362 0.065
0.293333 0.355 0.067
split P2 from the list and repeat for the
split P3 from the list and repeat for the other remaining cluster
cluster
dist(XY,XY-
X Y bar)
X Y dist(XY,XY-bar)
P1 0.4 0.53 0.217
P1 0.4 0.53 0.201
P4 0.26 0.19 0.187
P4 0.26 0.19 0.172
P5 0.08 0.41 0.17
P5 0.08 0.41 0.224 0.246667 0.376667 0.17
P6 0.45 0.3 0.163
0.2975 0.3575 0.163 split P5 from the list and repeat for the
remaining cluster
split P6 from the list and repeat for the
remaining cluster
59
Example - Solved
X Y dist(XY,XY-bar)
P1 0.4 0.53 0.184
P4 0.26 0.19 0.183
0.33 0.36 0.183
Final Clustering
((((((P1)P4)P5)P6)P2)P3)
61
CURE algorithm
Hierarchical Clustering: CURE is based on
hierarchical clustering, where clusters are formed
by merging smaller clusters.
The process starts by treating each data point as
its own cluster and progressively merges clusters
based on their proximity, until a desired number
of clusters is obtained.
62
CURE algorithm
Representative Points:
• Instead of using just the centroid (mean) of each cluster to
represent the entire cluster (as in k-means), CURE
selects multiple representative points that are scattered
across the cluster.
• These representative points capture the shape and
structure of the cluster more accurately than just a single
centroid, making the algorithm better suited for detecting
clusters of irregular shapes.
• After selecting representative points, they are shrunk
toward the centroid by a fixed fraction, allowing the
algorithm to moderate the influence of outliers and noisy
points.
63
CURE algorithm
Merge Criterion:
• In traditional hierarchical clustering, clusters are
merged based on the distance between their
centroids. In CURE, clusters are merged based
on the minimum distance between the
representative points of two clusters.
• This method reduces the influence of outliers and
ensures that the clusters merged are truly close
to one another.
64
CURE algorithm
65
CURE algorithm
Representative Points:
• For each cluster, a fixed number of
representative points (e.g., 10 points) are
chosen. These points are selected to be far from
each other to represent the shape and boundary
of the cluster.
• After selection, these points are shrunk toward
the centroid of the cluster by a shrink factor (e.g.,
20%). This shrinkage helps reduce the effect of
noise or outliers.
66
CURE algorithm
Merging Clusters:
• Clusters are merged based on the minimum
distance between any pair of representative
points in the two clusters being considered for
merging.
• The algorithm continues to merge clusters until
the desired number of clusters kkk is reached.
Outlier Detection:
• CURE can also handle outliers by discarding
clusters that are too small or too isolated.
67
CURE algorithm
Advantages
Handles Arbitrary Shapes and Sizes: By using
69
Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such
as density-connected points
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies:
DBSCAN: Ester, et al. (KDD’96)
based)
70
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps (epsilon): Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if
p belongs to NEps(q)
core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
71
Density-Based Clustering: Background (II)
Density-reachable:
A point p is density-reachable from p
a point q wrt. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
Density-connected
A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
72
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
Discovers clusters of arbitrary shape in spatial databases
with noise
Outlier
Border
Eps = 1cm
Core MinPts = 5
73
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
If p is a core point, a cluster is formed
If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
(border point: density is low but in the neighborhood
of core point)
Continue the process until all of the points have been
74
DBSCAN algorithm
Algorithmic steps for DBSCAN clustering
Let X = {x1, x2, x3, ..., xn} be the set of data points. DBSCAN requires two
parameters: ε (eps) and the minimum number of points required to form a cluster
(minPts).
1) Start with an arbitrary starting point that has not been visited.
2) Extract the neighborhood of this point using ε (All points which are within the ε
distance are neighborhood).
3) If there are sufficient neighborhood around this point then clustering process
starts and point is marked as visited else this point is labeled as noise (Later this
point can become the part of the cluster).
4) If a point is found to be a part of the cluster then its ε neighborhood is also the
part of the cluster and the above procedure from step 2 is repeated for all ε
neighborhood points. This is repeated until all points in the cluster is determined.
5) A new unvisited point is retrieved and processed, leading to the discovery of a
further cluster or noise.
6) This process continues until all points are marked as visited.
75
DBSCAN algorithm
Advantages
1) Does not require a-priori specification of number of
clusters.
2) Able to identify noise data while clustering.
3) DBSCAN algorithm is able to find arbitrarily size and
arbitrarily shaped clusters.
Disadvantages
1) DBSCAN algorithm fails in case of varying density
clusters.
2) Fails in case of neck type of dataset.
3) Does not work well in case of high dimensional data.
76
DBSCAN-Exercise
Consider the Figure and assume the use of Euclidean distance
between points. Also assume radius = 2 and minpts = 3.
Find any 2 core points using DBSCAN algorithm.
77
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
structure
Then performs the clustering operation on the
grid structure
Fast processing than other methods
79
Steps of Grid-based Clustering
Algorithms
80
Advantages of Grid-based Clustering
Algorithms
fast:
No distance computations
neighboring
Shapes are limited to union of grid-cells
81
Grid-Based Clustering Methods
82
STING: A Statistical Information
Grid Approach
Wang, Yang and Muntz (VLDB’97)
The spatial area area is divided into rectangular cells
There are several levels of cells corresponding to different
levels of resolution
83
STING: A Statistical
Information Grid Approach (2)
Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
Parameters of higher level cells can be easily calculated from
parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
84
STING: A Statistical
Information Grid Approach (3)
Advantages:
Query-independent, easy to parallelize, incremental
update
O(K), where K is the number of grid cells at the
lowest level
Disadvantages:
All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected
86
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Elbow method
Use the turning point in the curve of sum of within cluster variance
E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
89
Measuring Clustering Quality
90
Measuring Clustering Quality: Intrinsic
Method
cluster (ai).
Compute the average distance from all data points in the closest
cluster (bi).
Compute the coefficient:
91
Measuring Clustering Quality: Extrinsic
Methods
Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
Q is good if it satisfies the following 4 essential criteria
Cluster homogeneity: the purer, the better