0% found this document useful (0 votes)
2 views

Clustering Algorithms

Chapter 10 of 'Data Mining: Concepts and Techniques' discusses cluster analysis, covering basic concepts, various clustering methods (partitioning, hierarchical, density-based, grid-based, and model-based), and evaluation criteria for clustering quality. It highlights the applications of clustering in fields like biology, marketing, and city planning, and addresses challenges such as scalability and the handling of different data types. The chapter also details specific algorithms like k-means and k-medoids, including their strengths and weaknesses.

Uploaded by

007daboss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Clustering Algorithms

Chapter 10 of 'Data Mining: Concepts and Techniques' discusses cluster analysis, covering basic concepts, various clustering methods (partitioning, hierarchical, density-based, grid-based, and model-based), and evaluation criteria for clustering quality. It highlights the applications of clustering in fields like biology, marketing, and city planning, and addresses challenges such as scalability and the handling of different data types. The chapter also details specific algorithms like k-means and k-medoids, including their strengths and weaknesses.

Uploaded by

007daboss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 93

Data Mining:

Concepts and
Techniques
(3rd ed.)

— Chapter 10 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)


 Finding similarities between data according to the

characteristics found in the data and grouping similar


data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

3
Clustering for Data Understanding
and Applications
 Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
 Information retrieval: document clustering
 Land use: Identification of areas of similar land use in an earth
observation database
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market research
4
Clustering as a Preprocessing Tool
(Utility)
 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

5
Quality: What Is Good
Clustering?
 A good clustering method will produce high quality
clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

6
Measure the Quality of
Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
7
Considerations for Cluster
Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering) – dimension refers to the attributes in
the dataset
8
Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes


 Numerical, binary, categorical, ordinal, linked, and mixture of

these
 Constraint-based clustering
 User may give inputs on constraints

 Use domain knowledge to determine input parameters

 Interpretability and usability


 Others
 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

9
Major Clustering Approaches
(I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some

criterion, e.g., minimizing the sum of square errors


 Typical methods: k-means, k-medoids, CLARANS

 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects)

using some criterion


 Typical methods: Diana, Agnes, BIRCH, CAMELEON

 Density-based approach:
 Based on connectivity and density functions

 Typical methods: DBSCAN, OPTICS, DenClue

 Grid-based approach:
 based on a multiple-level granularity structure

 Typical methods: STING, WaveCluster, CLIQUE

10
Major Clustering Approaches
(II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other


 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific

constraints
 Typical methods: COD (obstacles), constrained clustering

 Link-based clustering:
 Objects are often linked together in various ways

 Massive links can be used to cluster objects: SimRank, LinkClus

11
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Evaluation of Clustering
12
Partitioning Algorithms: Basic
Concept
 Partitioning method: Partitioning a database D of n objects into a set of
k clusters, such that the sum of squared distances is minimized (where
ci is the centroid or medoid of cluster Ci)

E  ik1 pCi ( p  ci ) 2
 Given k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented
by the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
13
The K-Means Clustering Method

14
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data Loop if
set Reassign objects
needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update
the
point) for each partition cluster
 Assign each object to the centroids
cluster of its nearest centroid
 Until no change
15
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is


# iterations. Normally, k, t << n.

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoids can be applied to a wide range of
data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
16
Variations of the K-Means Method

 Most of the variants of the k-means which differ in


 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

17
What Is the Problem of the K-Means
Method?

 The k-means algorithm is sensitive to outliers !


 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

18
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrar 6
Assign 6

5
y 5
each 5

4 choose 4 remaini 4

3
k object 3
ng 3

2
as 2
object 2

initial to
1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10
medoid 0 1 2 3 4 5 6 7 8 9 10
nearest 0 1 2 3 4 5 6 7 8 9 10

s medoid
K=2 s Randomly select a
Total Cost = 26 nonmedoid
object,Oramdom
10 10

Do loop 9

8
Compute
9

8
Swapping total cost
Until no
7 7

O and 6
of 6

change Oramdom 5
swapping
5

4 4

If quality is 3 3

2 2
improved. 1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

19
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters


 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)

Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering

PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling

20
Example
 Consider the following data set consisting of the
scores of two variables on each of seven
individuals:

21
Example

22
Example

Input table

23
Example

24
Example

25
Example

26
Example

27
Example
 The latter partitioning has the lowest within-
cluster variation; therefore, the k-means method
assigns the value 8 to a cluster different from that
containing 9 and 10 due to the outlier point 25.
 Moreover, the center of the second cluster, 14.67,
is substantially far from all the members in the
cluster.
 Such sensitivity to outliers can be diminished
using k-medoids algorithm

28
K medoids algorithm

29
Example

30
Example
 Let us choose that (3, 4) and (7, 4) are the
medoids

31
Example

 Now choose some other point to be a medoid


instead of (7, 4).
 Let us randomly choose (7, 3).
 Now the new medoid set is: (3, 4) and (7, 3). Now
repeating the same task as earlier:
32
Example

33
Example

 Hence the clusters obtained finally are:


{(3,4), (2,6), (3,8), (4,7)} and
{(7,4), (6,2), (6,4), (7,3), (8,5), (7,6)}.

34
BFR Algorithm
 BFR - Bradley-Fayyad-Reina
 BFR is a scalable clustering algorithm particularly
useful for handling large datasets that do not fit
into memory.
 It is designed to work with data that is normally
distributed and uses the k-means clustering
approach as a base
 But it is more efficient in handling large datasets
due to its use of intermediate statistics and a
multi-pass process.

35
BFR Algorithm
 Key Concepts of BFR Algorithm
 Assumptions:
• The data is normally distributed (Gaussian
distribution).
• The algorithm works well with a high-dimensional
space.
• The dataset is too large to fit into main memory at
once, so it works incrementally, processing
chunks of the data that can fit in memory.

36
BFR Algorithm
 Algorithm Process: The BFR algorithm breaks
down the clustering process into several stages.
These are repeated iteratively as new chunks of
data are processed.
• Initialization (Phase 1): The process begins by
selecting a random sample of the data that fits
into memory and running k-means clustering on
it. This creates initial clusters, which are
represented using statistical summaries instead
of storing all the data points in each cluster.

37
BFR Algorithm
 Summarizing Clusters (Phase 2): Each cluster
is represented by the following three types of
statistics (these are maintained to avoid
recalculating distances for every data point):
• N: The number of points in the cluster.
• SUM: The sum of the coordinates of the points in
the cluster.
• SUMSQ: The sum of the squares of the
coordinates of the points in the cluster. From
these statistics, you can compute the centroid
(mean) and the variance (spread) of the clusters.
38
BFR Algorithm
 Processing New Data (Phase 3): New data is
processed in batches or chunks that fit into
memory. For each point:
1. If the point is close enough to the centroid of an
existing cluster (based on a threshold distance), it
is added to that cluster, and the cluster’s statistics
are updated.
2. If the point is not close enough to any existing
cluster, it is stored temporarily for future analysis
or becomes its own cluster if it meets certain
criteria.
39
BFR Algorithm
 Discard Set, Compression Set, and Retained Set:
• Discard Set (DS): These are the clusters that are
sufficiently large and are represented by the cluster
statistics (N, SUM, SUMSQ). These clusters are
"discarded" from memory as only their statistics are kept.
• Compression Set (CS): These are clusters that are too
small to be considered finalized but still compressible.
They have some intermediate points but don’t yet meet
the size threshold.
• Retained Set (RS): These are points that cannot be
assigned to any current cluster or compressed group.
They are retained for later re-evaluation.

40
BFR Algorithm
 Merging and Output (Phase 4): The algorithm
periodically attempts to merge clusters from the
compression set into the discard set if they grow
large enough. Once the algorithm finishes, it
outputs the cluster centroids along with the
variance or spread.

41
BFR Algorithm
 Key Strengths:
• Efficient for Large Datasets: The BFR
algorithm’s ability to summarize clusters with
intermediate statistics allows it to handle much
larger datasets than regular k-means.
• Handles High-Dimensional Data: It is
particularly useful for high-dimensional datasets
where exact distances between points may be
less meaningful due to the curse of
dimensionality.

42
BFR Algorithm
 Limitations:
• Assumption of Normality: BFR assumes that
clusters follow a Gaussian distribution, so it may
not perform well on data that doesn’t conform to
this assumption.
• Fixed Number of Clusters (k): Like k-means,
the number of clusters must be pre-specified,
which may not be ideal in cases where the exact
number of clusters is unknown.

43
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Evaluation of Clustering

44
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
45
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster

10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

46
Dendrogram: Shows How Clusters are
Merged
Decompose data objects into a several levels of nested
partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting


the dendrogram at the desired level, then each
connected component forms a cluster

47
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)


 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7
6 6
6

5 5
5
4 4
4

3 3
3
2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

48
Distance between X X

Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dist(Ki, Kj) = dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki,


Kj) = dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
49
Example
 Consider the dataset with 6 objects and each
object with two measured features X1 and X2.
Cluster the data using agglomerative clustering

50
Example
 Calculate the pairwise distance measure

51
Example
 Consider each object as a cluster in the beginning
 Then find the closest pair clusters and merge them
into one single cluster
 Update the distance matrix using the linkage rule
 Distance between ungrouped cluster will not
change
 To update the distance between grouped clusters,

find the distance between every pair and update


the minimum distance as the new distance
 Repeat the above steps until all the clusters are
merged

52
Example

53
Example

3.20 2.50 2.24


1.00

54
Example

55
Example

56
Divisive Hierarchical Clustering

Error Sum of Squares:


Compare individual observations for each variable against the cluster
means for that variable. Note that when the Error Sum of Squares is
small, it suggests that the data are close to their cluster means

57
Example
Find the clusters using complete linkage. Use
Euclidean distance to solve

 Solve using agglomerative and divisive


approaches 58
Example – Solved (Single
linkage)

X Y dist(XY,XY-bar) X Y dist(XY,XY-bar)
P1 0.4 0.53 0.205 P1 0.4 0.53 0.205
P2 0.22 0.38 0.077 P2 0.22 0.38 0.065
P3 0.35 0.32 0.067 P4 0.26 0.19 0.173
P4 0.26 0.19 0.168 P5 0.08 0.41 0.208
P5 0.08 0.41 0.22 P6 0.45 0.3 0.179
P6 0.45 0.3 0.166 0.282 0.362 0.065
0.293333 0.355 0.067
split P2 from the list and repeat for the
split P3 from the list and repeat for the other remaining cluster
cluster

dist(XY,XY-
X Y bar)
X Y dist(XY,XY-bar)
P1 0.4 0.53 0.217
P1 0.4 0.53 0.201
P4 0.26 0.19 0.187
P4 0.26 0.19 0.172
P5 0.08 0.41 0.17
P5 0.08 0.41 0.224 0.246667 0.376667 0.17
P6 0.45 0.3 0.163
0.2975 0.3575 0.163 split P5 from the list and repeat for the
remaining cluster
split P6 from the list and repeat for the
remaining cluster

59
Example - Solved

X Y dist(XY,XY-bar)
P1 0.4 0.53 0.184
P4 0.26 0.19 0.183
0.33 0.36 0.183

Since both are same, any one point can be


splitted

Final Clustering
((((((P1)P4)P5)P6)P2)P3)

How will you solve when using


complete linkage?
60
CURE algorithm
 CURE (Clustering Using Representatives) is a
hierarchical clustering algorithm that aims to
address some limitations of traditional clustering
methods, particularly in handling clusters of
arbitrary shapes and varying sizes.
 Unlike standard hierarchical algorithms, CURE
uses a set of representative points to more
accurately represent the shape and size of each
cluster, allowing it to better capture complex
structures in data.

61
CURE algorithm
 Hierarchical Clustering: CURE is based on
hierarchical clustering, where clusters are formed
by merging smaller clusters.
 The process starts by treating each data point as
its own cluster and progressively merges clusters
based on their proximity, until a desired number
of clusters is obtained.

62
CURE algorithm
 Representative Points:
• Instead of using just the centroid (mean) of each cluster to
represent the entire cluster (as in k-means), CURE
selects multiple representative points that are scattered
across the cluster.
• These representative points capture the shape and
structure of the cluster more accurately than just a single
centroid, making the algorithm better suited for detecting
clusters of irregular shapes.
• After selecting representative points, they are shrunk
toward the centroid by a fixed fraction, allowing the
algorithm to moderate the influence of outliers and noisy
points.
63
CURE algorithm
 Merge Criterion:
• In traditional hierarchical clustering, clusters are
merged based on the distance between their
centroids. In CURE, clusters are merged based
on the minimum distance between the
representative points of two clusters.
• This method reduces the influence of outliers and
ensures that the clusters merged are truly close
to one another.

64
CURE algorithm

Steps in the implementation


 Sample Selection:

• If the dataset is too large to fit in memory, CURE

can work on a random sample of the data. This


sampling is critical in large-scale applications.
 Initial Clustering:

• Each data point is treated as a separate cluster

initially. Then, it progressively merges the closest


clusters, using a hierarchical clustering approach.

65
CURE algorithm
 Representative Points:
• For each cluster, a fixed number of
representative points (e.g., 10 points) are
chosen. These points are selected to be far from
each other to represent the shape and boundary
of the cluster.
• After selection, these points are shrunk toward
the centroid of the cluster by a shrink factor (e.g.,
20%). This shrinkage helps reduce the effect of
noise or outliers.

66
CURE algorithm
 Merging Clusters:
• Clusters are merged based on the minimum
distance between any pair of representative
points in the two clusters being considered for
merging.
• The algorithm continues to merge clusters until
the desired number of clusters kkk is reached.
 Outlier Detection:
• CURE can also handle outliers by discarding
clusters that are too small or too isolated.

67
CURE algorithm
Advantages
 Handles Arbitrary Shapes and Sizes: By using

multiple representative points, CURE can capture


clusters of irregular shapes and varying sizes more
effectively than traditional centroid-based methods
like k-means.
 Resistant to Outliers: The shrinkage of
representative points toward the cluster's centroid
helps reduce the impact of outliers.
 Efficient for Large Datasets: CURE can handle

large datasets by first working on a sample and then


using the whole dataset to refine the clusters.
68
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Evaluation of Clustering

69
Density-Based Clustering Methods
 Clustering based on density (local cluster criterion), such
as density-connected points

Major features:

Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition
 Several interesting studies:
 DBSCAN: Ester, et al. (KDD’96)

 OPTICS: Ankerst, et al (SIGMOD’99).

 DENCLUE: Hinneburg & D. Keim (KDD’98)

 CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-

based)
70
Density-Based Clustering: Basic
Concepts
 Two parameters:
 Eps (epsilon): Maximum radius of the neighbourhood
 MinPts: Minimum number of points in an Eps-
neighbourhood of that point
 NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
 Directly density-reachable: A point p is directly density-
reachable from a point q w.r.t. Eps, MinPts if

p belongs to NEps(q)
 core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q

71
Density-Based Clustering: Background (II)

 Density-reachable:
 A point p is density-reachable from p
a point q wrt. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
 Density-connected
 A point p is density-connected to a p q
point q wrt. Eps, MinPts if there is
a point o such that both, p and q o
are density-reachable from o wrt.
Eps and MinPts.
72
DBSCAN: Density-Based Spatial
Clustering of Applications with Noise
 Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
 Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

73
DBSCAN: The Algorithm
 Arbitrary select a point p
 Retrieve all points density-reachable from p w.r.t. Eps and
MinPts
 If p is a core point, a cluster is formed
 If p is a border point, no points are density-reachable
from p and DBSCAN visits the next point of the database
(border point: density is low but in the neighborhood

of core point)
 Continue the process until all of the points have been
74
DBSCAN algorithm
Algorithmic steps for DBSCAN clustering
Let X = {x1, x2, x3, ..., xn} be the set of data points. DBSCAN requires two
parameters: ε (eps) and the minimum number of points required to form a cluster
(minPts).
1) Start with an arbitrary starting point that has not been visited.
2) Extract the neighborhood of this point using ε (All points which are within the ε
distance are neighborhood).
3) If there are sufficient neighborhood around this point then clustering process
starts and point is marked as visited else this point is labeled as noise (Later this
point can become the part of the cluster).
4) If a point is found to be a part of the cluster then its ε neighborhood is also the
part of the cluster and the above procedure from step 2 is repeated for all ε
neighborhood points. This is repeated until all points in the cluster is determined.
5) A new unvisited point is retrieved and processed, leading to the discovery of a
further cluster or noise.
6) This process continues until all points are marked as visited.

75
DBSCAN algorithm
Advantages
1) Does not require a-priori specification of number of
clusters.
2) Able to identify noise data while clustering.
3) DBSCAN algorithm is able to find arbitrarily size and
arbitrarily shaped clusters.

Disadvantages
1) DBSCAN algorithm fails in case of varying density
clusters.
2) Fails in case of neck type of dataset.
3) Does not work well in case of high dimensional data.

76
DBSCAN-Exercise
 Consider the Figure and assume the use of Euclidean distance
between points. Also assume radius = 2 and minpts = 3.
 Find any 2 core points using DBSCAN algorithm.

 Identify any one border point.

 Is the point a directly density reachable from point d ?

77
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods
 Evaluation of Clustering
78
Grid Based Methods
 Methods discussed so far are data driven (i.e)
they partition the set of objects and adapt to the
distribution of the objects
 Grid based methods takes space driven
approach
 Partitions the space into cells to form a grid

structure
 Then performs the clustering operation on the

grid structure
 Fast processing than other methods

79
Steps of Grid-based Clustering
Algorithms

Basic Grid-based Algorithm


1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and
compute the density of each cell.
3. Eliminate cells, whose density is below a certain
threshold t.
4. Form clusters from contiguous (adjacent) groups
of dense cells (usually minimizing a given
objective function)

80
Advantages of Grid-based Clustering
Algorithms

 fast:
 No distance computations

 Clustering is performed on summaries and not

individual objects; complexity is usually O(#-


populated-grid-cells) and not O(#objects)
 Easy to determine which clusters are

neighboring
 Shapes are limited to union of grid-cells

81
Grid-Based Clustering Methods

 Using multi-resolution grid data structure


 Clustering complexity depends on the number of
populated grid cells and not on the number of objects in
the dataset
 Several interesting methods (in addition to the basic grid-
based algorithm)
 STING (a STatistical INformation Grid approach) by
Wang, Yang and Muntz (1997)
 CLIQUE: Agrawal, et al. (SIGMOD’98)

82
STING: A Statistical Information
Grid Approach
 Wang, Yang and Muntz (VLDB’97)
 The spatial area area is divided into rectangular cells
 There are several levels of cells corresponding to different
levels of resolution

83
STING: A Statistical
Information Grid Approach (2)
 Each cell at a high level is partitioned into a number of smaller
cells in the next lower level
 Statistical info of each cell is calculated and stored beforehand
and is used to answer queries
 Parameters of higher level cells can be easily calculated from
parameters of lower level cell

count, mean, s, min, max

type of distribution—normal, uniform, etc.
 Use a top-down approach to answer spatial data queries

84
STING: A Statistical
Information Grid Approach (3)
 Advantages:

Query-independent, easy to parallelize, incremental
update

O(K), where K is the number of grid cells at the
lowest level
 Disadvantages:

All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected

86
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Model-Based Methods (Not of interest of this course)
 Evaluation of Clustering
87
Assessing Clustering Tendency
 Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
 Test spatial randomness by statistic test: Hopkins Static
 Given a dataset D regarded as a sample of a random variable o,

determine how far away o is from being uniformly distributed in


the data space
 Sample n points, p , …, p , uniformly from D. For each p , find its
1 n i
nearest neighbor in D: xi = min{dist (pi, v)} where v in D
 Sample n points, q1, …, qn, uniformly from D. For each qi, find its
nearest neighbor in D – {qi}: yi = min{dist (qi, v)} where v in D and
v ≠ qi
 Calculate the Hopkins Statistic:

 If D is uniformly distributed, ∑ xi and ∑ yi will be close to each


other and H is close to 0.5. If D is highly skewed, H is close to 0
88
Determine the Number of Clusters
 Empirical method
 # of clusters ≈√n/2 for a dataset of n points

 Elbow method
 Use the turning point in the curve of sum of within cluster variance

w.r.t the # of cluster.


 Cross validation method
 Divide a given data set into m parts

 Use m – 1 parts to obtain a clustering model

 Use the remaining part to test the quality of the clustering


E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
 For any k > 0, repeat it m times, compare the overall quality measure

w.r.t. different k’s, and find # of clusters that fits the data the best
89
Measuring Clustering Quality

 Two methods: extrinsic vs. intrinsic


 Extrinsic: supervised, i.e., the ground truth is available
 Compare a clustering against the ground truth using
certain clustering quality measure
 Ex. BCubed precision and recall metrics
 Intrinsic: unsupervised, i.e., the ground truth is unavailable
 Evaluate the goodness of a clustering by considering
how well the clusters are separated, and how compact
the clusters are
 Ex. Silhouette coefficient

90
Measuring Clustering Quality: Intrinsic
Method

 Silhouette analysis can be used to determine the degree of


separation between clusters. For each sample:
 Compute the average distance from all data points in the same

cluster (ai).
 Compute the average distance from all data points in the closest

cluster (bi).
 Compute the coefficient:

 The coefficient can take values in the interval [-1, 1].


 If it is 0 –> the sample is very close to the neighboring clusters.

 If it is 1 –> the sample is far away from the neighboring clusters.

 If it is -1 –> the sample is assigned to the wrong clusters.

 Therefore, we want the coefficients to be as big as possible and close


to 1 to have a good clusters.

91
Measuring Clustering Quality: Extrinsic
Methods
 Clustering quality measure: Q(C, Cg), for a clustering C
given the ground truth Cg.
 Q is good if it satisfies the following 4 essential criteria
 Cluster homogeneity: the purer, the better

 Cluster completeness: should assign objects belong to

the same category in the ground truth to the same


cluster
 Rag bag: putting a heterogeneous object into a pure

cluster should be penalized more than putting it into a


rag bag (i.e., “miscellaneous” or “other” category)
 Small cluster preservation: splitting a small category

into pieces is more harmful than splitting a large


category into pieces
92
Chapter 10. Cluster Analysis: Basic
Concepts and Methods

 Cluster Analysis: Basic Concepts


 Partitioning Methods
 Hierarchical Methods
 Density-Based Methods
 Grid-Based Methods
 Evaluation of Clustering
 Summary
93
Summary
 Cluster analysis groups objects based on their similarity and has
wide applications
 Measure of similarity can be computed for various types of data
 Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
 K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
 Birch and Chameleon are interesting hierarchical clustering
algorithms, and there are also probabilistic hierarchical clustering
algorithms
 DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
 STING and CLIQUE are grid-based methods, where CLIQUE is also
a subspace clustering algorithm
 Quality of clustering results can be evaluated in various ways
94

You might also like