Lecture Notes On Clustering
Lecture Notes On Clustering
Clustering
Laurenz Wiskott
Institut fur Neuroinformatik
Ruhr-Universitat Bochum, Germany, EU
14 December 2016
Contents
1 Introduction 2
5 Applications 8
2009, 2011, 2014 Laurenz Wiskott (homepage https://www.ini.rub.de/PEOPLE/wiskott/). This work (except for all
figures from other sources, if present) is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License.
To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/. Figures from other sources have their
own copyright, which is generally indicated. Do not distribute parts of these lecture notes showing figures with non-free
copyrights (here usually figures I have the rights to publish but you dont, like my own published figures). Figures I do not
have the rights to publish are grayed out, but the word Figure, Image, or the like in the reference is often linked to a pdf.
More teaching material is available at https://www.ini.rub.de/PEOPLE/wiskott/Teaching/Material/.
1
1 Introduction
Data1 are often given as points (or vectors) xn in a Euclidean vector space and often form groups that
are close to each other, so called clusters (D: Cluster). In data analysis one is, of course, interested to
discover such a structure, a process called clustering.
Clustering algorithms can be classified into hard or crisp clustering, where each point is assigned to
exactly one cluster, and soft or fuzzy clustering, where each point can be assigned to several clusters with
certain probabilities that add up to 1. Another distinction can be made between partitional clustering ,
where all clusters are on the same level, and hierarchical clustering , where the clustering is done from fine
to coarse by merging points successively to larger and larger clusters (agglomerative hierarchical clustering),
or from coarse to fine, where the points are successively split into smaller and smaller clusters (divisive
hierarchical clustering). I will discuss clustering algorithms of different types in turn.
is minimized. This can be interpreted, for instance, in terms of a reconstruction error. Imagine we replace
each data point by its associated center point. This will lead to an error, which could be quantified by (1). The
task is to minimize this error. There is actually a close link to vector quantization (D: Vektorquantisierung)
here.
To achieve the minimization in practice we split the problem into two phases. First we keep the assignment
fixed and optimize the position of the center points; then we keep the center points fixed and optimize the
assignment.
If the assignment is fixed, it is easy to show that the optimal choice of the center positions is given
by
1 X
ck = xn , (2)
Nk
nCk
which is simple the center of gravity of the points assigned to this cluster.
If the center points are fixed, it is obvious that each point should be assigned to the nearest center
position. Thus, a Voronoi tessallation (D: Dirichlet-Zerlegung) is optimal.
The K-means algorithm now consists of applying these two optimizations in turn until conver-
gence. The initial center locations could be chosen randomly from the data points. A drawback of this and
many other clustering algorithms is that the number of clusters is not determined. One has to decide
on a proper K in advance, or one simply runs the algorithm with several different K-values and picks the
best according to some criterion.
Also note that the result of the algorithm is not necessarily a global optimum of the error func-
tion (1). For instance, imagine two distinct clusters of equal size and K = 4. If in such a situation three
center points are initialized to lie in one cluster and only one lies in the other, the algorithm will optimize this
only locally with three center points in one cluster and one in the other, and it will find the better solution
where two center points are in each cluster. It is therefore advisable to run the algorithm several times
with different initial center locations and pick the best result.
1 Important text (but not inline formulas) is set bold face; marks important formulas worth remembering; marks less
important formulas, which I also discuss in the lecture; + marks sections that I typically skip during my lectures.
2
Figure 1: Examples of a converged K-means algorithm, once with 5 (yellow) center points (left), and
two different runs with 10 center points (middle and right). The data points are drawn in black and the
Voronoi tessallation in red. (Created with DemoGNG 1.5 written by Hartmut Loos and Bernd Fritzke, see
http://www.demogng.de/js/demogng.html for a more recent version, with kind transfer of copyrights in
which can be interpreted as a generalized standard deviation. Then define cluster similarity of
two clusters as
k + l
Skl := . (4)
kck cl k
Thus, two clusters are considered similar if they have large dispersion relative to their distance.
A good clustering should be characterized by clusters being as dissimilar as possible. This should apply in
particular to neighboring clusters, because it is clear that distant clusters are dissimilar in any case. Thus,
an overall validation of the clustering can be done by the DB index
K
1 X
VDB := max Skl . (5)
K l6=k
k=1
The DB index does not systematically depend on K and is therefore suitable to find the best
optimal number of clusters, e.g. by plotting VDB and picking a pronounced minimum.
3
3.1.1 Introduction
The K-means algorithm is a very simple method with sharp boundaries between the clusters and no particular
characterization of the shape of individual clusters. In a more refined algorithm, one might want to
model each cluster with a Gaussian, capturing the shape of the clusters. This leads naturally to
a probabilistic interpretation of the data as a superposition of Gaussian probability distributions. For
simplicity we first assume that the Gaussians are isotropic, i.e. spherical.
If we assume that the Gaussians are isotropic, the probability density function (pdf) of cluster k can
be written as
kx ck k2
1
p(x|k) := exp , (6)
(2k2 )d/2 2k2
where controls the width of the Gaussian. There is also a prior probability P (k) that a data
point belongs to a particular cluster k. The overall pdf for the data is then given by the total probability
K
X
p(x) = p(x|k)P (k) , (7)
k=1
and the probablity density of the data given the model (7) is simply
Y
p({xn }) = p(xn ) . (8)
n
The problem now is that we do not know the parameters of the model, i.e. the values of the centers ck
and the widths k of the Gaussians and the probabilities P (k) for the clusters. How could we estimate or
optimize them?
The simple idea is to choose the parameters such, that the probability density of the data is
maximized. In other words we want to choose the model such that the data becomes most probable. This
is referred to as maximum likelihood estimation (D: Maximum-Likelihood Schatzung), and p({xn }) as
a function of the model parameters is referred to as a likelihood function.
A standard method of optimizing a function analytically is to calculate the gradient and set it to zero. I do
not want to work this out here but only state that at a (local) optimum the following equations hold.
P
P (k|xn )xn
ck = Pn , (9)
P (k|xm )
Pm
1 n P (k|xn )kxn ck k2
k2 = P , (10)
d m P (k|xm )
1 X
P (k) = P (k|xn ) . (11)
N n
where all sums go overP all N data points. These equations are perfectly reasonable, as one can see if one
realizes that P (k|xn )/ m P (k|xm ) can be interpreted as a weighting factor for how much data point xn
contributes to cluster k. The key function in these equations is P (k|xn ) which is according to Bayes theorem
p(x|k)P (k)
P (k|x) = (12)
p(x)
p(x|k)P (k)
= P . (13)
l p(x|l)P (l)
4
3.1.5 EM algorithm
The problem with equations (911) is that the parameters on the left-hand side occur im-
plicitely also on the right-hand side. Thus we cannot use these equations directly to calculate the
parameters. However, one can start with some initial parameter values and then iterate through these
equations to improve the estimate. One can actually show that the likelihood increases with each iteration,
if a change occurs. This iterative scheme is referred to as the expectation-maximization algorithm, or
simply EM algorithm. Notice that this is completely different from a gradient ascent method.
Two problems might occur during optimization. Firstly, one of the Gaussians might focus on just one
data point and become infinitly narrow and infinitely high, leading to a divergence of the likelihood.
Secondly, the method can get stuck in a local optimum and miss the globally optimal solution. In
either case it helps to run the algorithm several times and discard inappropriate solutions.
Another general problem is again that the number of clusters is not determined by the algorithm but
must be chosen in advance. Again, running the algorithm several times with different values of K helps.
The Gaussian mixture model can be generalized to unisotropic Gaussians, which may be elongated
or compressed in certain directions in space. One can think of a cigar-shaped or a UFO-shaped Gaussian.
In that case one would generalize (6) to
1 1 T 1
p(x|k) := exp (x c k ) k (x c k ) , (14)
(2)d/2 |k |1/2 2
with the covariance matrix k playing the role of the width parameter 2 in (6). Note that k is symmetric
and positive semi-definite.
Equations (9) and (11) would stay the same, only (10) would change. Taken together we get the equations
P
P (k|xn )xn
ck = Pn , (15)
m P (k|xm )
T
P
n P (k|xPn )(xn ck )(xn ck )
k = , (16)
m P (k|xm )
1 X
P (k) = P (k|xn ) . (17)
N n
5
The partition coefficient index is defined as
1 X
VP C := P (k|xn )2 . (22)
N
k,n
While (20) equals N in any case, irrespectively of how the data points are assigned to the clusters, the
partition coefficient index, due to the square, lies between 1/K, if all points are assigned with equal
probability to all clusters, and 1, if each point is assigned exactly to one cluster. Thus, VP C = 1
would be optimal and indicate clearly separated clusters.
Notice that the spatial information is taken into account only implicitely, which works only for clustering
models such as the Gaussian mixture model, which have soft tails. For K-means, this index would always
be one by construction, regardless of whether the clustering is good or not.
4.1 Dendrograms
In agglomerative hierarchical clustering one starts by considering each single data point as a separate
cluster. Then one merges points that are near to each other into clusters, and finally merges
clusters that are near to each other into clusters. In the end all points form one big cluster.
Documenting the hierarchical merging process results in a tree-like structure and represents the cluster
structure of the data on all levels from fine (cluster distance slighty greater 0) to coarse (cluster distance
). It can be visualized with a dendrogram. Many algorithms can be viewed in this scheme and differ
only in the definition of what near to each other means for clusters.
Let d(xn , xm ) be the distance between two points and Ck indicate a cluster of (possibly only one) points xn .
If we define the distance D(Ck , Cl ) between two clusters Ck and Cl as
then it depends on the distance between the nearest two points of the two clusters. If we define
then the distance depends on the farthest two points of the two clusters, see figure 2. The former distance
measure gives rise to the single-link method the latter to the complete-link method. These names
come from the idea that you introduce links between all the points in the order of their distance. In the
single-link method, two clusters become merged as soon as they are connected by a single link, which then
naturally has length Ds (Ck , Cl ). In the complete-link method, two clusters become merged only if all points
in one cluster have a link to all points in the other cluster. The last link added before the clusters are merged
then naturally has length Dc (Ck , Cl )
Figure 3 illustrates agglomerative hierarchical clustering with the single- and the complete-link method.
Notice that the resulting dendrograms are qualitatively different and that the distances are naturally larger
in the complete-link method.
1. Define each data point as a cluster, Ck := {xk }. Represent each one-point cluster as a point on
the abscissa of a graph, the ordinate of which represents cluster distance.
6
C3
Ds Dc
C1 Ds
C2
Dc
Figure 2: Two different measures of cluster distance. Ds (Ck , Cl ) measures the distance between the two
nearest points of the two clusters and Dc (Ck , Cl ) measures the distance between the two farthest points of
the two clusters. Notice that with Ds cluster C2 would first be merged with C1 , while with Dc it would first
be merged with C3 .
CC BY-SA 4.0
cluster distance
a b c d e a b c d e
a a
c 2 c
1 1
2
b 3 b
4
d d
4
3
e e
Figure 3: An example of the single-link and the complete-link method on a data distribution of 5 data
points a-e. The numbers at the links indicate the order in which the clusters are linked. On the left they are
linked by the minimal smallest distance between points of two clusters; on the right they are linked by the
minimal largest distance between points of two clusters. In this example the dendrograms are qualitatively
different. Also the distances are generally larger in the complete-link method.
CC BY-SA 4.0
7
2. Find the two clusters Ck0 and Cl0 that are closest to each other, i.e.
Draw vertical lines in the graph on top of each cluster up to the distance of these two closest
clusters, i.e. up to D(Ck0 , Cl0 ).
3. Merge the two closest clusters into one, i.e. define a new cluster Cq0 := Ck0 unionCl0 and discard Ck0
and Cl0 . Rearrange the clusters on the abscissa such that the two new closest ones become neighbors
(and already connected clusters remain neighbors). Draw a vertical line between the two closest
cluster.
Depending on how the distance measure D(Ck , Cl ) is defined does this algorithm result in different dendro-
grams and has different intuitive interpretations.
5 Applications
(D) signaling and angiogenesis, and Figure: (Eisen et al., 1998,
(E) wound healing and tissue remodeling. Fig. 1)1 non-free.
8
Words can be clustered based
on a large text corpus by
Semantics of words defining similarity between
words depending on common
context, i.e. if two words cooc-
cur with the same words they
are considered similiar, other-
wise they are not. Clustering
can then reveal semantic sim-
ilarities.
Figure: (Gries and Stefanow-
itsch, 2010, Fig. 3)2 non-
free.
Table: (Mokhtarian and Ory,
2007, Tab. 3)3 unclear.
9
References
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press, Oxford, UK. 3
Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. (1998). Cluster analysis and display of
genome-wide expression patterns. Proc. Natl. Acad. Sci. U.S.A., 95:1486314868. 8
Gan, G., Ma, C., and Wu, J. (2007). Data Clustering - Theory, Algorithms, and Applications. ASA-SIAM
Series on Statistics and Applied Probability. SIAM, Philadelphia, VA, USA. 3, 5, 6, 8
Gries, S. T. and Stefanowitsch, A. (2010). Cluster analysis and the identification of collexeme classes. In
Rice, S. and Newman, J., editors, Empirical and Experimental Methods in Cognitive/Functional Research.
CSLI Publications. 9
Mokhtarian, P. L. and Ory, D. T. (2007). Shopping-related attitudes: A factor and cluster analysis of north-
ern california shoppers. Manuscript downloaded 2016-12-14 from http://www.wctrs-society.com/wp/
wp-content/uploads/abstracts/berkeley/D5/149/shoppingAttitudes.bergenfinalwctrsubmit.
070417.doc. 9
Notes
1 Eisen et al, 1998, Proc. Natl. Acad. Sci. USA 95:148638, Fig. 1, non-free, http://gene-quantification.org/
eisen-et-al-cluster-1998.pdf
2 Gries & Stefanowitsch, 2010, Fig. 3, non-free, http://www.linguistics.ucsb.edu/faculty/stgries/research/2010_
STG-AS_ClusteringCollexemes_EmpExpMeth.pdf
3 Mokhtarian & Ory, 2007, Tab. 3, unclear, http://www.wctrs-society.com/wp/wp-content/uploads/abstracts/
berkeley/D5/149/shoppingAttitudes.bergenfinalwctrsubmit.070417.doc
10