4 Clustering
4 Clustering
Miguel Rodrigo
Department of Electronic Engineering, School of Engineering
Universitat de València, Avgda. Universitat s/n
46100 Burjassot (Valencia)
[email protected]
1
Clustering: definition
2
Proximity measures
3
K-means
4
Fuzzy C-Means
K-means
Based on fuzzy logic, where a given measure may have a membership to different
categories. When applied to clustering, any pattern may belong to different clusters
with a fuzzy membership, thus introducing overlapping, which is a common
phenomenon, in a natural way and with a robust mathematical background.
Similar to K-Means but instead of using the distance from a given pattern only to the
closest cluster, all distances are taken into account, with an associated fuzzy
membership (m>1 for a fuzzy clustering).
Fuzzy C-means
5
K-means and Fuzzy C-Means
6
Spectral clustering
7
Spectral clustering
It can deal with clusters that are not compact or within convex boundaries, where most of the classical
clustering algorithms fail.
ISODATA is a variant of spectral clustering.
K-means Spectral
K-means Spectral
8
Hierarchical clustering
Dendogram
Clustering method based in an iterative
process, that may be agglomerative or Large and single cluster
divisive, in which the different clusters are
created using a hierarchy based on
proximity/dissimilarity measures.
Depending on the hierarchy that is
selected, a different number of clusters
turns up. Its main advantage and
drawback is the same, though it seems
paradoxical: the selection of a correct
hierarchy is a challenge but the
hierarchical visualization (dendrogram) is
useful in itself. Smallest and separated clusters
9
Hierarchical clustering: Agglomerative approach
F G
D
B E
C
A
10
Hierarchical clustering: Agglomerative approach
The algorithm stops when there is only one cluster for the whole
data set. Sometimes, the whole hierarchy is not obtained, but just a
selection that seems reasonable for the problem at hand.
11
Hierarchical clustering: Agglomerative approach
Not all distances must be calculated at each iteration, only the distance between
the new cluster Cq and the rest of clusters; algorithms based on Lance and
Williams formula are normally used:
• Single-link algorithm:
• Complete-link algorithm:
• Unweighted average algorithm:
• Weighted average algorithm:
• Unweighted centroid algorithm:
• Weighted centroid algorithm:
12
Cluster validity
Suitable number of clusters for a
given distribution of samples Dunn index (M is the number of clusters)
(patients)? Usually, there is not a
unique and definite answer.
Some indices that can help find the
correct number of clusters are
based on Compactness and Silhouette Coefficient:
Isolation.
Score = (b-a)/max(a,b)
a= average intra-cluster
distance
b= average inter-cluster
distance
13
Cluster validity
14
Self Organizing Maps (SOM)
15
Once trained, SOM maps show information into a
uniform 2D map. These SOM map shows the
same 2D distribution for all the features used in
the clustering. Each sample (patient) is always in
the same position of the 2D map.
In colors: Average
feature value per
neuron
16
Other clustering methods: Density-based
The data points in the region separated by two clusters of low point density are considered as noise. The
surroundings with a radius ε of a given object are known as the ε neighborhood of the object. If the ε
neighborhood of the object comprises at least a minimum number of objects then it is called a core object.
• DBSCAN (Density-Based
Spatial Clustering of
Applications with Noise). It
depends on a density-
based notion of cluster. It
also identifies clusters of
arbitrary size in the spatial
database with outliers.
17
Other clustering methods: Density-based
18
Other clustering methods: Manifold learning
Class of unsupervised estimators that seeks to describe datasets as low-dimensional manifolds embedded in
high-dimensional spaces (e.g. piece of paper).
19
Clustering with categorical features
Tips:
• Remember to use any kind of data normalization to compare
categorical (Boolean) with continuous features (e.g. age vs sex).
• Consider running clustering only on continuous features if
possible.
• There are some clustering methods / metrics that better deal with
binary features: Hierarchical clustering, DBSCAN, etc.
20
Clustering methods for supervised learning
21