0% found this document useful (0 votes)
14 views13 pages

5 - Clustering

The document provides an overview of various clustering techniques, including k-Means, k-Medoids, DBSCAN, and hierarchical clustering, detailing their algorithms, strengths, weaknesses, and applications. It emphasizes the importance of initialization, the silhouette coefficient for determining optimal cluster numbers, and the Expectation Maximization method for probabilistic clustering. The document also discusses the properties and challenges of density-based clustering and hierarchical clustering, highlighting their advantages and disadvantages.

Uploaded by

xerodo1379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views13 pages

5 - Clustering

The document provides an overview of various clustering techniques, including k-Means, k-Medoids, DBSCAN, and hierarchical clustering, detailing their algorithms, strengths, weaknesses, and applications. It emphasizes the importance of initialization, the silhouette coefficient for determining optimal cluster numbers, and the Expectation Maximization method for probabilistic clustering. The document also discusses the properties and challenges of density-based clustering and hierarchical clustering, highlighting their advantages and disadvantages.

Uploaded by

xerodo1379
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Exercises:

7.1 – k-Means: Initialization Values, Proof of the Global Optimum


7.2 – k-Means: Number of initialization values for different sampling methods
7.3 – k-Medoids (example of one iteration of PAM)
8.1 – DBSCAN
8.2 – Agglomerative Hierarchical Clustering
8.3 – OPTICS
9.1 – Expectation Maximization: cluster responsibility & probability

Clustering – “classification” of objects into groups created by the model (unsupervised learning).

General applications:
 Preprocessing / Data Reduction (instead of Sampling)
 Pattern Recognition and Image Processing
 Spatial Data Analysis
 Business Intelligence (especially market research)
 WWW: Documents (Web Content Mining), Web-logs (Web Usage Mining)
 Biology (lustering of gene expression data)

Partitioning:
Goal: Construct a partitioning of a database 𝐷 of 𝑛 objects into a set of 𝑘 (𝑘 < 𝑛) clusters 𝐶 … 𝐶 , each
of them disjunct from one another, minimizing an objective function.

Popular heuristic methods use the following approach:

o Choose k representatives for clusters (for example randomly)


o Improve these initial representatives iteratively:
 Assign each object to the cluster it “fits best” in the current clustering
 Compute new cluster representatives based on these assignments
 Repeat until change of objective function is smaller than a threshold

A Voronoi Diagram is a diagram which partitions the data space into convex Voronoi cells, one cell per
point (these points are the centers of clusters that we calculate in the following methods).
PARTITIONING CLUSTERING:

K-Means Clustering finds a clustering such that the within-cluster variation of each cluster is small and
use the centroid of a cluster as representative.

Measure for the compactness of a cluster 𝐶 (sum of squared errors):

𝑆𝑆𝐸 𝐶 = 𝑑𝑖𝑠𝑡 𝑝, 𝜇

Measure for the compactness of a clustering:
𝑆𝑆𝐸(𝐶) = 𝑆𝑆𝐸 𝐶

The most optimal partitioning is the one with a minimal value of 𝑆𝑆𝐸(𝐶) out of all.

Algorithm:
 Initialization: Choose k arbitrary representatives
 Repeat until representatives do not change or the improvement is smaller than a certain threshold:
1. Assign each object to the cluster with the nearest representative
2. Compute the centroids of the clusters of the current partitioning

Example:

Strengths:
o Rather time-efficient: 𝑂(𝑡𝑘𝑛), where n = # objects, k = # clusters, and t = # iterations
o Typically: k, t << n; t is given in advance
o Easy implementation: uses vectors in order to determine the distances of objects to the “means”

Weaknesses:
o It can group objects into different clusters each run if one cluster has its objects too evenly spread-
out
o Applicable only when mean is defined
o Need to specify k, the number of clusters, in advance
o Sensitive to noisy data and outliers: their influence is intensified by the use of the squared error
o Clusters are forced to convex space partitions (Voronoi Cells)
o Result and runtime strongly depend on the initial partition; often terminates at a local optimum –
however: methods for a good initialization exist
K-Medoid Clustering: picks representative objects from the data set to be the cluster centers.

Example:

K-Mode Clustering: chooses the most often occurring value for each dimension, and then constructs
artificial objects out of them to be cluster centers.

Example:

K-Median Clustering:
o Determines the median values for each dimension separately
o Then, like K-Mode-Clustering, creates artificial objects out of these values

Example:
Summary:
o Running these algorithms multiple times doesn’t return the same results every time, since the
initialization of the algorithms is still random
o Strength: easy implementation (→ many varia ons and op miza ons in the literature)
o Weaknesses:
 Need to specify k, the number of clusters, in advance (often we run the clustering algorithm
multiple times for different k, of which we then choose the best clustering)
 Clusters are forced to convex space partitions (Voronoi Cells), which isn’t always the case
 Result and runtime strongly depend on the initial partition; often terminates at a local optimum
 The algorithms do not care about the sizes of the actual clusters (possibility of false clustering)
THE SILHOUETTE COEFFICIENT:

The bigger the k, the “better” the clustering becomes...at least according to the sum of square errors.
However, the appropriateness of that clustering obviously decreases if k becomes too big.

As such, there are 2 conditions:


1. The elements inside one cluster should be as similar as possible
2. The elements from different clusters should be as dissimilar as possible

These conditions are fulfilled by the Silhouette Coefficient, which we use to help us determine the initial
amount of clusters k.

First, the notations:


 𝐶(𝑜) − the cluster the object o is in
 |𝐶(𝑜)| − the size of the above cluster
 𝐶 − the i-th cluster
 𝑎(𝑜) − the average distance between the object o and the objects in its cluster A (the smaller, the
better)
 𝑏(𝑜) − the average distance between the object o and the objects in the “second closest” cluster B
(the bigger, the better)

Then the math:


1
𝑎(𝑜) = 𝑑𝑖𝑠𝑡(𝑜, 𝑝)
|𝐶(𝑜)|
∈ ( )

1
𝑏(𝑜) = min 𝑑𝑖𝑠𝑡(𝑜, 𝑝)
( ) |𝐶 |

0 𝑖𝑓 𝑎(𝑜) = 0, 𝑖. 𝑒. 𝑖𝑓 |𝐶 | = 1
𝑠(𝑜) = 𝑏(𝑜) − 𝑎(𝑜)
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
max{𝑏(𝑜), 𝑎(𝑜)}

𝑠(𝑜) = 0 if the size of the cluster is 1, because we want to minimize the amount of clusters.
1 1
𝑠𝑖𝑙ℎ(𝐶) = 𝑠(𝑜) (∈ [−1,1])
|𝐶| |𝐶 |
∈ ∈

“Reading” the silhouette coefficient of an object (how good is the assignment of o to its cluster):
 𝑠(𝑜) = −1: bad, on average closer to members of B
 𝑠(𝑜) = 0: in-between A and B
 𝑠(𝑜) = 1: good assignment of o to its cluster

A Silhouette Coefficient 𝑠 = 𝑠𝑖𝑙ℎ(𝐶) of a clustering C (average silhouette of all objects):


 0.7 < 𝑠 ≤ 1.0: strong structure
 0.5 < 𝑠 ≤ 0.7: medium structure
 0.25 < 𝑠 ≤ 0.5: weak structure
 𝑠 ≤ 0.25: no structure
EXPECTATION MAXIMIZATION:

 Statistical approach for finding maximum likelihood estimates of parameters in probabilistic models
 Underlying assumption: Observations are drawn from one of several components of a mixture
distribution

The main idea is:


 To define clusters as probability distributions
 Iteratively improve the parameters of each distribution (the center, “width” and “height”, i.e. the
mixing coefficient, of a Gaussian distribution) until some quality threshold is reached
 To obtain a model which most accurately describes the model with which our input data was
generated

The following properties are to be noted:


 Optimizing the Gaussian of cluster j depends on all other Gaussians
 There is no closed-form solution
 Approximation through iterative optimization procedure necessary

EM obtains a soft clustering (each object belongs to each cluster with a certain probability) reflecting the
uncertainty of the most appropriate assignment:
Exemplary application of EM clustering:

You can use the EM algorithm in order to calculate the k-Means clusters via:
 setting Σ to the identity matrix
 setting the weights of all clusters to the same value #

EM clustering is superior to k-Means if the clusters have varying sizes (which is most often the case).
However, this superiority results in longer runtime.

K-Means can, as such, be used for initializing an EM algorithm to increase runtime.

Both the result and the runtime strongly depend on:


 The initial assignment
 A proper choice of parameter K (= desired number of clusters)
 If K is too high, the mixture may overfit the data
 If K is too low, the mixture may not be flexible enough to approximate the data

Nevertheless, the total computational effort is still rather big:

⎛ ⎞
𝑂⎜ ⏟
𝑡 ∗ 𝑁
⏟ ∗ 𝐾
⏟ ∗ 𝐷 ⎟
# # #
( ) ,
⎝ . . (# ) ⎠
DBSCAN:

DBSCAN – Density Based Spatial Clustering of Applications with Noise.

Why Density-Based Clustering?


Because:

Also: outliers shouldn’t be forced into a cluster.

Types of objects:
o Core Object (q) – the center of a certain neighborhood with a radius of 𝜀 which includes <MinPts>
objects (including q)
 Neighborhood 𝑁 (𝑞) − the circle surrounding q with a radius of 𝜀
o Border Object – an object that is included in the cluster due to another core object, yet doesn’t have
MinPts neighbors
o Noise – an object that is not density-reachable from any cluster (i.e. any core or border object)

Local point density at a point q defined by two parameters:


o 𝜺 − the radius for the neighborhood of a core object q
o MinPts – the minimal allowed number of objects in a given neighborhood 𝑁 (𝑞)

Clustering-related definitions:
o Directly density-reachable – p directly density-reachable from q with
regards to 𝜀, MinPts if:
1. 𝑝 ∈ 𝑁 (𝑞) and
2. 𝑞 is a core object w.r.t. 𝜀, MinPts
o Density-reachable – transitive closure of objects that are directly density-
reachable from q
o Density-connected – p is density-connected to a point q w.r.t. 𝜀, MinPts if
there is a point o such that both p and q are density-reachable from o
w.r.t. 𝜀, MinPts.

DBSCAN Algorithm:
How to determine 𝜀 and MinPts?
 Idea: look at the distances to the k-nearest neighbors.

Example: 4-nearest neighbor:

(The 4-nearest-neighbor distance means the radius of a circle which includes 4 objects, including the
center object. Not to be confused with the k-NN distance in classification!)

Then:
1. Fix a value for MinPts (default: 2 ∗ (#𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑑𝑎𝑡𝑎 𝑠𝑝𝑎𝑐𝑒) − 1)
2. The user selects a “border object” o from the MinPts-distance plot
3. 𝜀 is set to MinPts-distance(o)

Advantages of density-based clustering:


o Clusters can have arbitrary shape and size, i.e. clusters are not restricted to have convex shapes
o Number of clusters is determined automatically – no need to give the number in advance
o Can separate clusters from surrounding noise
o Can be supported by spatial index structures
o Can build a specific cluster immediately after finding a core object
o Complexity:
 𝑁 -query: 𝑂(𝑛)
 DBSCAN: 𝑂(𝑛 )

Disadvantages:
o Input parameters may be difficult to determine and are not as intuitively determinable as in k-
Means (especially since they are codependent)
o In some situations very sensitive to input parameters
o Hierarchical clustering impossible with DBSCAN: the clustering of a dataset with a hierarchical
structure is very dependent on the used value of 𝜀
HIERARCHICAL CLUSTERING:

Properties:
 Hierarchical decomposition of the data set into a set of nested clusters
 The results are represented by a so-called dendrogram:

 The root represents the whole data set


 A leaf represents a single object of a set
 The height of an internal node represents the distance between its 2 child nodes

 Bottom-up construction is called agglomerative (more local perspective on the data set)
 Top-down construction is called divisive (global perspective)

Advantages:
1. Does not require the number of clusters in advance
2. No parameters (standard methods) or very robust parameters (OPTICS)
3. Computes a complete hierarchy of clusters
4. Good result visualizations integrated into the methods
5. A “flat” partition can be derived afterwards (e.g. via a cut through the dendrogram or the
reachability plot)

Disadvantages:
1. May not scale well (at least polynomial time complexity)
2. User has to choose the final clustering
Agglomerative Hierarchical Clustering:
1. Initially, each object is its own cluster
2. In a loop:
a. We compute all pairwise distances between the initial clusters
b. Merge the closest pair
c. Remove the 2 merged clusters from the set of clusters, insert the new one
3. Stop once the set of current clusters has a size of 1 (all objects)

3 common distance functions for agglomerative clustering (none are “the best”, depends on use-case):

 Single-Link: we merge the two clusters whose two closest members have the smallest distance (or:
the two clusters with the smallest minimum pairwise distance).
In single-link clustering, the similarity of two clusters is the similarity of their most similar members:

𝑑𝑖𝑠𝑡 (𝑋, 𝑌) = min 𝑑𝑖𝑠𝑡(𝑥, 𝑦)


∈ , ∈

 Complete-Link: we merge the two clusters whose merger has the smallest diameter (or: the two
clusters with the smallest maximum pairwise distance).
In complete-link clustering, the similarity of two clusters is the similarity of their most dissimilar
members:

𝑑𝑖𝑠𝑡 (𝑋, 𝑌) = max 𝑑𝑖𝑠𝑡(𝑥, 𝑦)


∈ , ∈

 Average-Link: the average of all pairwise distance calculations:

1
𝑑𝑖𝑠𝑡 (𝑋, 𝑌) = 𝑑𝑖𝑠𝑡(𝑥, 𝑦)
|𝑋| ∗ |𝑌|
∈ , ∈

Divisive (Density-Based) Hierarchical Clustering:


 The intuition is: we do the clustering in steps, from bigger distances to smaller and smaller distances
 Initially, all objects form one big cluster, which is subsequently broken up until all clusters are
singletons
 Conceptually more complex than agglomerative clustering, which means that a second “flat”
clustering algorithm is helpful

The idea is to process the objects in the “right” order and by decomposing bigger clusters into smaller
ones via reducing the value of 𝜀:
OPTICS:

OPTICS – Ordering Points to Identify the Clustering Structure (part of hierarchical clustering, as the end
user has to choose the end clustering for themselves: OPTICS simply allows for good visualization).

General approach:
o We visit the objects one after another, initially drawing radiuses around each one of them so that
each object is a core object
o We make the jumps from one object to the other, choosing the closest object with regards to all
objects considered beforehand (+ the current one)
o The output is: the order of points and their core + reachability distances

Core Distance:
𝑐𝑑𝑖𝑠𝑡 , (𝑜) = “Smallest distance such that o is a core object with regards to a certain MinPts
value” (Note: Just like in DBSCAN, we include the point itself in the circle!)

Reachability Distance:
𝑟𝑑𝑖𝑠𝑡 , (𝑝, 𝑜) = “Smallest distance such that p is directly density-reachable from o” (i.e. distance
between p and o)

“Final” Reachability Distance:


𝑟𝑑𝑖𝑠𝑡(𝑝) = The smallest distance between all previous objects (to the left in the reachability plot) and
the current object

The most important input parameter of the OPTICS algorithm is MinPts (𝜀 is more for efficiency)
o It's very important to set a good value for MinPts!
-> Good results if parameters are just “large enough”
o If it's too small, the clusters may become way too numerous
o It should, as such, be large enough

The value of 𝜀 is important as well:


o If we make 𝜀 too small, we lose information.
o If it's too large, it's harmful for the runtime.
Why?
We calculate the reachability distance only for objects in the epsilon-neighborhood of o.
As such, if 𝜀 is assigned to ∞, we will always be calculating the distances to all other points
unnecessarily.

As seen here:
The OPTICS algorithm (specifically the resulting reachability plot) gives us an idea as to how to define the
value of 𝜀:

Approximate runtime: 𝑂 𝑛 ∗ 𝑟𝑢𝑛𝑡𝑖𝑚𝑒(𝜀 − 𝑛𝑒𝑖𝑔ℎ𝑏𝑜𝑟ℎ𝑜𝑜𝑑 𝑞𝑢𝑒𝑟𝑦)


Worst-case runtime (without spatial index support): 𝑂(𝑛 )
Best-case runtime (with tree-based spatial indexing): 𝑂(𝑛 ∗ log 𝑛)

SUMMARY:

Examples of applicable clustering methods:

Hierarchical
Partitioning
Probabilistic Density-Based (OPTICS /
(k-Means / k-Medoid /
(EM Clustering) (DBSCAN) Agglomerative /
k-Mode / k-Median)
Divisive)

You might also like