Clustering in R Tutorial
Clustering in R Tutorial
Jari Oksanen
January 26, 2014
Contents
1 Introduction
2 Hierarchic Clustering
2.1 Description of Classes . .
2.2 Numbers of Classes . . . .
2.3 Clustering and Ordination
2.4 Reordering a Dendrogram
2.5 Minimum Spanning Tree .
2.6 Cophenetic Distance . . .
.
.
.
.
.
.
1
4
4
5
6
7
8
3 Interpretation of Classes
3.1 Environmental Interpretation . . . . . . . . . . . . . . . . . . . .
3.2 Community Summaries . . . . . . . . . . . . . . . . . . . . . . .
8
9
10
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
Introduction
In this tutorial we inspect classification. Classification and ordination are alternative strategies of simplifying data. Ordination tries to simplify data into a
map showing similarities among points. Classification simplifies data by putting
similar points into same class. The task of describing a high number of points
is simplified to an easier task of describing a low number of classes.
Hierarchic Clustering
A A
A
AA
B
B BB
B
A
B
A
B
B
B
B
B
B B
is the distance between cluster centroids. There are several alternative ways of
defining the average and defining the closeness, and hence a huge number of
average linkage methods. We only use one of these methods commonly known
as upgma. The lecture slides discuss the methods in more detail.
In the following we will compare three different clustering strategies. If you
want to plot three graphs side by side, you can divide the screen into three
panels by
R> par(mfrow=c(1,3))
This defines three panels side by side. You probably want to stretch the plotting
window if you are using this option. Alternatively, you can have three panels
above each other with
R> par(mfrow=c(3,1))
You can get back to the single panel mode with
R> par(mfrow=c(1,1))
You may also wish to use narrower empty margins for the panels:
R> par(mar=c(3,4,1,1)+.1)
The mar command defines plot margins in order bottom, left, up, right using
row height (text height) as a unit.
The single linkage clustering can be found with:
R> csin <- hclust(d, method="single")
R> csin
The dendrogram can be plotted with:
R> plot(csin)
The default is to plot an inverted tree with the root at the top, and branches
hanging down. You can force the branches down to the base line giving the hang
argument:
R> plot(csin, hang=-1)
If you plotted the csin tree twice you consumed two panels out of three you
have, and there will not be space for the next two trees in the same plot. In
that case you can start a new plot by issuing again the mfrow command and
then drawing csin again.
The complete linkage and average linkage methods are found in the same
way:
R>
R>
R>
R>
The vertical axes of the cluster dendrogram show the fusion level. The two
most similar observations are combined first, and they are at the same level
in all dendrograms. At the upper fusion levels, the scales diverge: they are
the shortest dissimilarities among cluster members in single linkage, the longest
possible dissimilarities in complete linkage, and the distances among cluster
centroids in average linkage (Fig. 1).
3
2.1
Description of Classes
One problem with hierarchic clustering is that it gives a classification of observations (plots, sampling units), but it does not tell how these classes differ
from each other. For community data, there is no information how the species
composition differs between classes (we return to this subject in Chapter 3.2).
The vegan package has function vegemite (Fig. 2) that can produce compact community tables ordered by a dendrogram, ordination or environmental
variables. With the help of these tables it is possible to see which species differ
in classification:
R> vegemite(dune, caver)
The vegemite command will always use one-character columns. If the observed values do not fit one character, the vegemite refuses to work. With argument scale you can recode the values to one-character width. The vegemite
has a graphical sister function tabasco that is described in section 2.4.
2.2
Numbers of Classes
plot(csin, hang=-1)
rect.hclust(csin, 3)
plot(ccom, hang=-1)
rect.hclust(ccom, 3)
2.3
ordiplot(ord, dis="si")
ordihull(ord, cutree(caver, 3))
ordiplot(ord, dis="si")
ordicluster(ord, csin)
We set here explicitly the display argument to display = "sites" to avoid the
annoying and useless warnings. The contrasting clustering strategies (nearest
vs. furthest vs. average neighbour) are evident in the shapes of the clusters.
Single linkage clusters are chained, complete linkage clusters are compact, and
average linkage clusters between these two.
The vegan package has a special function to display the cluster fusions
in ordination. The ordicluster function combines sites and cluster centroids
similarly as the average linkage method:
R> ordiplot(ord, dis="si")
R> ordicluster(ord, caver)
We can prune the top level fusions to highlight the clustering:
R> ordiplot(ord, dis="si")
R> ordicluster(ord, caver, prune=2)
2.4
Reordering a Dendrogram
The leaves of a dendrogram do not have a natural order. You can take a branch
and turn around its root, and the tree is the same (see Fig. 3).
R has two alternative dendrogram presentations: the hclust result object
and a general dendrogram object. The cluster type can be changed with:
R> den <- as.dendrogram(caver)
The dendrograms are more general, and several methods are available for their
manipulation and analysis. It is possible to re-order the leaves of a dendrogram
so that they match as closely as possible an external variable.
6
par(mfrow=c(2,1))
plot(den)
plot(oden)
par(mfrow=c(1,1))
The reordered dendrogram may also give more regularly structured community
table:
R> vegemite(dune, oden)
The vegemite function has a graphical sister function tabasco2 that can also
display the dendrogram. Moreover, it defaults to rerrange the dendrogram by
the first axis of Correspondence Analysis:
R> tabasco(dune, caver)
Correspondence Analysis packs similar species next to each other, and similar
sites next to each other and gives a good diagonal representation of the data. If
you want to see the original ordering of the sample plots, you must set Rowv =
FALSE:
R> tabasco(dune, caver, Rowv = FALSE)
R> tabasco(dune, oden, Rowv = FALSE)
tabasco function also defaults to order species to match the ordering of sites
unless you set Colv = FALSE.
2.5
Alternatively, the tree can be displayed with a plot command that tries to find a
locally optimal configuration for points to reproduce the distances among points
along the tree. Internally the function uses Sammon scaling (sammon function
in the MASS package) to find the configuration of points. Sammon scaling is
a variant of metric scaling trying to reproduce relative distances among points
and it is optimal for showing the local (vs. global) structure of points.
R> plot(mst, type="t")
2.6
Cophenetic Distance
The estimated distance between two points is the level at which they are fused in
the dendrogram, or the height of the root. A good clustering method correctly
reproduces the actual dissimilarities. The distance estimated from a dendrogram is called cophenetic distance. The name echoes the origins of hierarchic
clustering in old fashioned numeric taxonomy. Standard R function cophenetic
estimates the distances among all points from a dendrogram.
We can visually inspect the cophenetic distances against observed dissimilarities. In the following, abline adds a line with zero intercept and slope of
one, or the equivalence line. We also set equal scaling on x and y axes (asp =
1):
R>
R>
R>
R>
R>
R>
Interpretation of Classes
3.1
Environmental Interpretation
For interpreting the classes we take into use the the environmental data:
R> data(dune.env)
The cutree function produced a numerical classification vector, but we need a
factor of classes:
R> cl <- factor(cl)
We can visually display the differences in environmental conditions using boxplot.
The only continuous variable we have is the thickness of the A1 horizon, but we
can change the Moisture class to a numeric variable:
R> Moist <- with(dune.env, as.numeric(as.character(Moisture)))
Compare the result of this to a simpler looking alternative:
R> with(dune.env, as.numeric(Moisture))
The latter uses the internal representation (1, 2, 3, 4) of the factor instead of
its nominal values (1, 2, 4, 5) that we can get with the added as.character
command.
The boxplot is produced with:
R> boxplot(Moist ~ cl, notch=TRUE)
The boxplot shows the data extremes (whiskers and the points for extreme
cases), lower and upper quartiles and the median. The notches show the approximate 95 % confidence limits of the median: if the notches do not overlap,
the medians probably are different at level P < 0.05. For small data sets like
this, the confidence limits are wide, and you get a warning of waist going over
the shoulders.
You can use anova for rigid numerical testing of differences among classes.
In R, anova is an analysis method for linear models (lm):
R> anova(lm(Moist ~ cl))
This performs parametric anova test. If we want to use non-parametric permutation tests with the same test statistic, we can use redundancy analysis:
R> anova(rda(Moist ~
cl))
This works nicely in this case because the y-variate (Moist) is in the current
workspace: the command would not work for variates in data frames unless we
attach the data frame or use the command as with(varechem, ...).
We can use confusion matrix to compare factor variables and classification:
R> with(dune.env, table(cl, Management))
3.2
Community Summaries
10
Hierarchic clustering has a history: it starts from the bottom, and combines
observations to clusters. At higher levels it can only fuse existing clusters to
others, but it cannot break old groups to build better ones. Often we are most
interested in classification at a low number of clusters, and these have a long
historic burden. Once we have decided upon the number of classes, we can often
improve the classification at a given number of clusters. A tool for this task is
K-means clustering that produces an optimal solution for a given number of
classes.
K-means clustering has one huge limitation for community data: it works
in Euclidean metric which rarely is meaningful for community data. In hierarchic clustering we could use non-Euclidean Bray-Curtis dissimilarity, but we do
not have this choice here. However, we can work around some problems with
appropriate standardization. Hellinger transformation has been suggested to be
very useful with Euclidean metric and community data: square root of data
divided by row (site) totals. The Hellinger transformation can be performed
with vegan function decostand which has many other useful standardizations
and transformations:
R> ckm <- kmeans(decostand(dune, "hell"), 3)
R> ckm$cluster
We can display the results in the usual way:
R> ordiplot(ord, dis="si")
R> ordihull(ord, ckm$cluster, col="red")
4.1
The K-means clustering optimizes for the given number of classes, but what is
the correct number of classes? Arguably there is no correct number, because
the World is not made of distinct classes, but all classifications are patriarchal
artifacts. However, some classifications may be less useful (and worse) than
others.
Function cascadeKM in vegan finds some of these criteria for a sequence of Kmeans clusterings. The default method is the CalinskiHarabasz criterion which
is the F ratio of anova for the predictor variables (species in community data).
The test uses Euclidean metric and we use again the Hellinger transformation
(if you try without transformation or different transformations, you can see that
transformations with decostand have capricious effects: obviously the World is
not classy in this case).
R> ccas <- cascadeKM(decostand(dune, "hell"), 2, 15)
R> plot(ccas, sortq=TRUE)
We gave the smallest (2) and largest (15) number of inspected classes in the
call, and sorted the plot by classification. Each class is shown by a different
colour, and bends in the vertical colour bars show non-hierarchical clustering
where the upper levels are no simple fusions of lower levels. The highest value of
the criterion shows the optimal number of classes. This kind of graph is called
11
Fuzzy Clustering
We have so far worked with classification methods which implicitly assume that
there are distinct classes. The World is not made like that. If there are classes,
they are vague and have intermediate and untypical cases. With one word, they
are fuzzy.
Fuzzy classification means that each observation has a certain probability of
belonging to a class. In the crisp case, it has probability 1 of belonging to one
class, and probability 0 of belonging to any other class. In a fuzzy case, it has
probability < 1 for the best class, and probabilities > 0 for several other classes
(and the sum of these probabilities is 1).
Fuzzy classification is similar to K-means clustering in finding the optimal
classification for a given number of classes, but the produced classification is
fuzzy: the result is a probability profile of class membership.
The fuzzy clustering is provided by function fanny in package cluster7 .
Fuzzy clustering defaults to Euclidean metric, but current versions can accept
any dissimilarities. In the following, we also need to use lower membership
exponent (memb.exp): the default 2 gives complete fuzziness and fails here, and
lower values give crisper classifications.8
R> library(cluster)
R> cfuz <- fanny(d, 3, memb.exp=1.7)
The function returns an object with the following items:
R> names(cfuz)
6 A similar plot for cross classifications is called a Marimekko plot: Google to see what it
looks like and how it is constructed.
7 All functions in the cluster package have stupid names.
8 Old version of fanny may not have all these options and can fail; you should upgrade R
to get a new version of the cluster package.
12
ordiplot(ord, dis="si")
ordiplot(ord, dis="si", type="n")
stars(cfuz$membership, locatio=ord, draw.segm=TRUE, add=TRUE, scale=FALSE, len=0.1)
ordihull(ord, cfuz$clustering, col="blue")
This uses stars function (with many optional parameters) to show the probability profile, and draws a convex hull of the crisp classification. The size of the
sector shows the probability of the class membership, and in clear cases, one of
the segments is dominant.
13