Clustering
Clustering
2
What is Cluster Analysis?
● Cluster: A collection of data objects
○ similar (or related) to one another within the same group
○ dissimilar (or unrelated) to the objects in other groups
● Cluster analysis (or clustering, data segmentation, …)
○ Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
3
What is Cluster Analysis?
● Unsupervised learning: no predefined classes (i.e., learning by
observations)
● Typical applications
○ As a stand-alone tool to get insight into data distribution
○ As a preprocessing step for other algorithms
4
Clustering for Data Understanding and Applications
● Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
● City-planning: Identifying groups of houses according to their house type,
value, and geographical location
● Biology
● Information retrieval: document clustering
● Land use: Identification of areas of similar land use in an earth observation
database
● Climate: understanding earth climate, find patterns of atmospheric and
ocean
5
Clustering as a Preprocessing Tool (Utility)
● Summarization & Compression:
● Finding K-nearest Neighbors
● Outlier detection
○ Outliers are often viewed as those “far away” from any cluster
6
Quality: What Is Good Clustering?
● A good clustering method will produce high quality clusters
7
Major Clustering Approaches (I)
● Partitioning approach:
○ Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
○ Typical methods: k-means, k-medoids, CLARANS
● Hierarchical approach:
○ Create a hierarchical decomposition of the set of data (or objects)
using some criterion
○ Typical methods: Diana, Agnes, BIRCH, CAMELEON
● Density-based approach:
○ Based on connectivity and density functions
○ Typical methods: DBSACN, OPTICS, DenClue
8
Cluster Analysis: Basic Concepts and Methods
● Cluster Analysis: Basic Concepts
● Partitioning Methods
● Hierarchical Methods
● Density-Based Methods
9
Partitioning Algorithms: Basic Concept
E ik1 pCi ( p ci ) 2
• Given k, find a partition of k clusters that optimizes the chosen partitioning
criterion
10
The K-Means Clustering Method
11
An Example of K-Means Clustering
K=2
○ Dissimilarity calculations
13
What Is the Problem of the K-Means Method?
● The k-means algorithm is sensitive to outliers !
○ Since an object with an extremely large value may substantially distort the
distribution of the data
● K-Medoids: Instead of taking the mean value of the object in a cluster as a
reference point, medoids can be used, which is the most centrally located
object in a cluster
14
The K-Medoid Clustering Method
● K-Medoids Clustering: Find representative objects (medoids) in clusters
■ Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of the
resulting clustering
■ PAM works effectively for small data sets, but does not scale well for large
data sets (due to the computational complexity)
15
PAM: A Typical K-Medoids Algorithm Total Cost = 20
10 10 10
9 9 9
8 8 8
Arbitrary Assign
7 7 7
6 6 6
5
choose k 5 each 5
4 object as 4 remainin 4
3
initial 3
g object 3
2
medoids 2
to 2
nearest
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10
Do loop 9
Compute
9
Swapping O
8 8
total cost of
Until no
7 7
and Oramdom 6
swapping 6
change
5 5
If quality is 4 4
improved. 3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
16
Cluster Analysis: Basic Concepts and Methods
● Cluster Analysis: Basic Concepts
● Partitioning Methods
● Hierarchical Methods
● Density-Based Methods
17
Hierarchical Clustering
● This method does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
• AGNES- bottom-up a ab
strategy b abcde
• DIVISIVE- Top- c
cde
down strategy d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
18
AGNES
● Introduced in Kaufmann and Rousseeuw (1990)
19
AGNES (Agglomerative Nesting)
● Use the single-link method and the dissimilarity matrix
● Merge nodes that have the least dissimilarity
● Eventually all nodes belong to the same cluster
10 10 10
9 9 9
8 8 8
7 7 7
6 6 6
5 5 5
4 4 4
3 3 3
2 2 2
1 1 1
0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
20
Dendrogram: Shows How Clusters are Merged
It is commonly used to
represent the process
of hierarchical
clustering. It shows
how objects are
grouped together
22
Distance between Clusters
● Single link: smallest distance between an element in one cluster and an element
in the other,
● Complete link: largest distance between an element in one cluster and an
element in the other,
● Average link: avg distance between an element in one cluster and an element in
the other
23
Extensions to Hierarchical Clustering
● Major weakness of agglomerative clustering methods
○ Do not scale well: time complexity of at least O(n2), where n is the number
of total objects
25