5 - Clustering
5 - Clustering
Clustering – “classification” of objects into groups created by the model (unsupervised learning).
General applications:
Preprocessing / Data Reduction (instead of Sampling)
Pattern Recognition and Image Processing
Spatial Data Analysis
Business Intelligence (especially market research)
WWW: Documents (Web Content Mining), Web-logs (Web Usage Mining)
Biology (lustering of gene expression data)
Partitioning:
Goal: Construct a partitioning of a database 𝐷 of 𝑛 objects into a set of 𝑘 (𝑘 < 𝑛) clusters 𝐶 … 𝐶 , each
of them disjunct from one another, minimizing an objective function.
A Voronoi Diagram is a diagram which partitions the data space into convex Voronoi cells, one cell per
point (these points are the centers of clusters that we calculate in the following methods).
PARTITIONING CLUSTERING:
K-Means Clustering finds a clustering such that the within-cluster variation of each cluster is small and
use the centroid of a cluster as representative.
𝑆𝑆𝐸 𝐶 = 𝑑𝑖𝑠𝑡 𝑝, 𝜇
∈
Measure for the compactness of a clustering:
𝑆𝑆𝐸(𝐶) = 𝑆𝑆𝐸 𝐶
∈
The most optimal partitioning is the one with a minimal value of 𝑆𝑆𝐸(𝐶) out of all.
Algorithm:
Initialization: Choose k arbitrary representatives
Repeat until representatives do not change or the improvement is smaller than a certain threshold:
1. Assign each object to the cluster with the nearest representative
2. Compute the centroids of the clusters of the current partitioning
Example:
Strengths:
o Rather time-efficient: 𝑂(𝑡𝑘𝑛), where n = # objects, k = # clusters, and t = # iterations
o Typically: k, t << n; t is given in advance
o Easy implementation: uses vectors in order to determine the distances of objects to the “means”
Weaknesses:
o It can group objects into different clusters each run if one cluster has its objects too evenly spread-
out
o Applicable only when mean is defined
o Need to specify k, the number of clusters, in advance
o Sensitive to noisy data and outliers: their influence is intensified by the use of the squared error
o Clusters are forced to convex space partitions (Voronoi Cells)
o Result and runtime strongly depend on the initial partition; often terminates at a local optimum –
however: methods for a good initialization exist
K-Medoid Clustering: picks representative objects from the data set to be the cluster centers.
Example:
K-Mode Clustering: chooses the most often occurring value for each dimension, and then constructs
artificial objects out of them to be cluster centers.
Example:
K-Median Clustering:
o Determines the median values for each dimension separately
o Then, like K-Mode-Clustering, creates artificial objects out of these values
Example:
Summary:
o Running these algorithms multiple times doesn’t return the same results every time, since the
initialization of the algorithms is still random
o Strength: easy implementation (→ many varia ons and op miza ons in the literature)
o Weaknesses:
Need to specify k, the number of clusters, in advance (often we run the clustering algorithm
multiple times for different k, of which we then choose the best clustering)
Clusters are forced to convex space partitions (Voronoi Cells), which isn’t always the case
Result and runtime strongly depend on the initial partition; often terminates at a local optimum
The algorithms do not care about the sizes of the actual clusters (possibility of false clustering)
THE SILHOUETTE COEFFICIENT:
The bigger the k, the “better” the clustering becomes...at least according to the sum of square errors.
However, the appropriateness of that clustering obviously decreases if k becomes too big.
These conditions are fulfilled by the Silhouette Coefficient, which we use to help us determine the initial
amount of clusters k.
1
𝑏(𝑜) = min 𝑑𝑖𝑠𝑡(𝑜, 𝑝)
( ) |𝐶 |
∈
0 𝑖𝑓 𝑎(𝑜) = 0, 𝑖. 𝑒. 𝑖𝑓 |𝐶 | = 1
𝑠(𝑜) = 𝑏(𝑜) − 𝑎(𝑜)
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
max{𝑏(𝑜), 𝑎(𝑜)}
𝑠(𝑜) = 0 if the size of the cluster is 1, because we want to minimize the amount of clusters.
1 1
𝑠𝑖𝑙ℎ(𝐶) = 𝑠(𝑜) (∈ [−1,1])
|𝐶| |𝐶 |
∈ ∈
“Reading” the silhouette coefficient of an object (how good is the assignment of o to its cluster):
𝑠(𝑜) = −1: bad, on average closer to members of B
𝑠(𝑜) = 0: in-between A and B
𝑠(𝑜) = 1: good assignment of o to its cluster
Statistical approach for finding maximum likelihood estimates of parameters in probabilistic models
Underlying assumption: Observations are drawn from one of several components of a mixture
distribution
EM obtains a soft clustering (each object belongs to each cluster with a certain probability) reflecting the
uncertainty of the most appropriate assignment:
Exemplary application of EM clustering:
You can use the EM algorithm in order to calculate the k-Means clusters via:
setting Σ to the identity matrix
setting the weights of all clusters to the same value #
EM clustering is superior to k-Means if the clusters have varying sizes (which is most often the case).
However, this superiority results in longer runtime.
⎛ ⎞
𝑂⎜ ⏟
𝑡 ∗ 𝑁
⏟ ∗ 𝐾
⏟ ∗ 𝐷 ⎟
# # #
( ) ,
⎝ . . (# ) ⎠
DBSCAN:
Types of objects:
o Core Object (q) – the center of a certain neighborhood with a radius of 𝜀 which includes <MinPts>
objects (including q)
Neighborhood 𝑁 (𝑞) − the circle surrounding q with a radius of 𝜀
o Border Object – an object that is included in the cluster due to another core object, yet doesn’t have
MinPts neighbors
o Noise – an object that is not density-reachable from any cluster (i.e. any core or border object)
Clustering-related definitions:
o Directly density-reachable – p directly density-reachable from q with
regards to 𝜀, MinPts if:
1. 𝑝 ∈ 𝑁 (𝑞) and
2. 𝑞 is a core object w.r.t. 𝜀, MinPts
o Density-reachable – transitive closure of objects that are directly density-
reachable from q
o Density-connected – p is density-connected to a point q w.r.t. 𝜀, MinPts if
there is a point o such that both p and q are density-reachable from o
w.r.t. 𝜀, MinPts.
DBSCAN Algorithm:
How to determine 𝜀 and MinPts?
Idea: look at the distances to the k-nearest neighbors.
(The 4-nearest-neighbor distance means the radius of a circle which includes 4 objects, including the
center object. Not to be confused with the k-NN distance in classification!)
Then:
1. Fix a value for MinPts (default: 2 ∗ (#𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑠𝑝𝑎𝑐𝑒) − 1)
2. The user selects a “border object” o from the MinPts-distance plot
3. 𝜀 is set to MinPts-distance(o)
Disadvantages:
o Input parameters may be difficult to determine and are not as intuitively determinable as in k-
Means (especially since they are codependent)
o In some situations very sensitive to input parameters
o Hierarchical clustering impossible with DBSCAN: the clustering of a dataset with a hierarchical
structure is very dependent on the used value of 𝜀
HIERARCHICAL CLUSTERING:
Properties:
Hierarchical decomposition of the data set into a set of nested clusters
The results are represented by a so-called dendrogram:
Bottom-up construction is called agglomerative (more local perspective on the data set)
Top-down construction is called divisive (global perspective)
Advantages:
1. Does not require the number of clusters in advance
2. No parameters (standard methods) or very robust parameters (OPTICS)
3. Computes a complete hierarchy of clusters
4. Good result visualizations integrated into the methods
5. A “flat” partition can be derived afterwards (e.g. via a cut through the dendrogram or the
reachability plot)
Disadvantages:
1. May not scale well (at least polynomial time complexity)
2. User has to choose the final clustering
Agglomerative Hierarchical Clustering:
1. Initially, each object is its own cluster
2. In a loop:
a. We compute all pairwise distances between the initial clusters
b. Merge the closest pair
c. Remove the 2 merged clusters from the set of clusters, insert the new one
3. Stop once the set of current clusters has a size of 1 (all objects)
3 common distance functions for agglomerative clustering (none are “the best”, depends on use-case):
Single-Link: we merge the two clusters whose two closest members have the smallest distance (or:
the two clusters with the smallest minimum pairwise distance).
In single-link clustering, the similarity of two clusters is the similarity of their most similar members:
Complete-Link: we merge the two clusters whose merger has the smallest diameter (or: the two
clusters with the smallest maximum pairwise distance).
In complete-link clustering, the similarity of two clusters is the similarity of their most dissimilar
members:
1
𝑑𝑖𝑠𝑡 (𝑋, 𝑌) = 𝑑𝑖𝑠𝑡(𝑥, 𝑦)
|𝑋| ∗ |𝑌|
∈ , ∈
The idea is to process the objects in the “right” order and by decomposing bigger clusters into smaller
ones via reducing the value of 𝜀:
OPTICS:
OPTICS – Ordering Points to Identify the Clustering Structure (part of hierarchical clustering, as the end
user has to choose the end clustering for themselves: OPTICS simply allows for good visualization).
General approach:
o We visit the objects one after another, initially drawing radiuses around each one of them so that
each object is a core object
o We make the jumps from one object to the other, choosing the closest object with regards to all
objects considered beforehand (+ the current one)
o The output is: the order of points and their core + reachability distances
Core Distance:
𝑐𝑑𝑖𝑠𝑡 , (𝑜) = “Smallest distance such that o is a core object with regards to a certain MinPts
value” (Note: Just like in DBSCAN, we include the point itself in the circle!)
Reachability Distance:
𝑟𝑑𝑖𝑠𝑡 , (𝑝, 𝑜) = “Smallest distance such that p is directly density-reachable from o” (i.e. distance
between p and o)
The most important input parameter of the OPTICS algorithm is MinPts (𝜀 is more for efficiency)
o It's very important to set a good value for MinPts!
-> Good results if parameters are just “large enough”
o If it's too small, the clusters may become way too numerous
o It should, as such, be large enough
As seen here:
The OPTICS algorithm (specifically the resulting reachability plot) gives us an idea as to how to define the
value of 𝜀:
SUMMARY:
Hierarchical
Partitioning
Probabilistic Density-Based (OPTICS /
(k-Means / k-Medoid /
(EM Clustering) (DBSCAN) Agglomerative /
k-Mode / k-Median)
Divisive)