10ClusBasic Editted v1
10ClusBasic Editted v1
— Chapter 10 —
1
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Evaluation of Clustering
Summary
2
What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
3
Applications of Cluster Analysis
Data reduction
Summarization: Preprocessing for regression, PCA,
4
Clustering: Application Examples
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market resarch
5
Basic Steps to Develop a Clustering Task
Feature selection
Select info concerning the task of interest
Proximity measure
Similarity of two feature vectors
Clustering criterion
Expressed via a cost function or some rules
Clustering algorithms
Choice of algorithms
6
Quality: What Is Good Clustering?
7
Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical,
ordinal ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that
measures the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
8
Major Clustering Approaches (I)
Partitioning approach:
Construct various partitions and then evaluate them by some
Hierarchical approach:
Create a hierarchical decomposition of the set of data (or objects)
Density-based approach:
Based on connectivity and density functions
Grid-based approach:
based on a multiple-level granularity structure
9
Major Clustering Approaches (II)
Model-based:
A model is hypothesized for each of the clusters and tries to find
Frequent pattern-based:
Based on the analysis of frequent patterns
User-guided or constraint-based:
Clustering by considering user-specified or application-specific
constraints
Typical methods: COD (obstacles), constrained clustering
Link-based clustering:
Objects are often linked together in various ways
10
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Evaluation of Clustering
Summary
11
Partitioning Algorithms: Basic Concept
E ik1 pCi (d ( p, ci )) 2
13
An Example of K-Means Clustering
K=2
Dissimilarity calculations
16
What Is the Problem of the K-Means Method?
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
17
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
18
The K-Medoid Clustering Method
PAM works effectively for small data sets, but does not scale
well for large data sets (due to the computational complexity)
19
Chapter 10. Cluster Analysis: Basic
Concepts and Methods
Cluster Analysis: Basic Concepts
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
20
Hierarchical Clustering
Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4 agglomerative
(AGNES)
a
ab
b
abcde
c
cde
d
de
e
divisive
(DIANA)
Step 4 Step 3 Step 2 Step 1 Step 0
21
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical packages, e.g., Splus
Use the single-link method and the dissimilarity matrix
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
22
Dendrogram: Shows How Clusters are Merged
23
DIANA (Divisive Analysis)
10 10
10
9 9
9
8 8
8
7 7
7
6 6
6
5 5
5
4 4
4
3 3
3
2 2
2
1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
24
Distance between X X
Clusters
Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)
N N (t t ) 2
Dm i 1 i 1 ip iq
N ( N 1)
26
Extensions to Hierarchical Clustering
Major weakness of agglomerative clustering methods
28
Density-Based Clustering Methods
Handle noise
One scan
29
Density-Based Clustering: Basic
Concepts
Two parameters:
Eps: Maximum radius of the neighbourhood
MinPts: Minimum number of points in an Eps-
neighbourhood of that point
NEps(q): {p belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly
density-reachable from a point q w.r.t. Eps, MinPts if
Density-reachable:
A point p is density-reachable from p
a point q w.r.t. Eps, MinPts if there p1
is a chain of points p1, …, pn, p1 = q
q, pn = p such that pi+1 is directly
density-reachable from pi
Density-connected
A point p is density-connected to a p q
Outlier
Border
Eps = 1cm
Core MinPts = 5
32
DBSCAN: The Algorithm
Arbitrary select a point p
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Evaluation of Clustering
Summary
35
Assessing Clustering Tendency
Assess if non-random structure exists in the data by measuring the
probability that the data is generated by a uniform data distribution
Test spatial randomness by statistic test: Hopkins Static
Given a dataset D regarded as a sample of a random variable o,
Elbow method
Use the turning point in the curve of sum of within cluster variance
E.g., For each point in the test set, find the closest centroid, and
use the sum of squared distance between all points in the test
set and the closest centroids to measure how well the model fits
the test set
For any k > 0, repeat it m times, compare the overall quality
measure w.r.t. different k’s, and find # of clusters that fits the data
the best 37
Measuring Clustering Quality
38
Measuring Clustering Quality: Extrinsic Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Evaluation of Clustering
Summary
40
Summary
Cluster analysis groups objects based on their similarity and has
wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods,
hierarchical methods, density-based methods, grid-based methods,
and model-based methods
K-means and K-medoids algorithms are popular partitioning-based
clustering algorithms
Birch and Chameleon are interesting hierarchical clustering algorithms,
and there are also probabilistic hierarchical clustering algorithms
DBSCAN, OPTICS, and DENCLU are interesting density-based
algorithms
STING and CLIQUE are grid-based methods, where CLIQUE is also a
subspace clustering algorithm
Quality of clustering results can be evaluated in various ways
41