Hierarchical Clustering
Hierarchical Clustering
Hierarchical clustering
Figure 19.1: Examples of hierarchical clustering. Left panel, hierarchical clustering of living organisms,
indicating evolutionary relations. Image source: Wikipedia. On the right panel, hierarchical clustering
of gene expression data (Mulvey and Gingold, Online Computational Biology Textbook).
A hierarchical clustering can be represented as a dendrogram (a tree) by joining together first the
examples that are more similar and then gradually joining the most similar clusters until all links are
found, as shown in Figure 19.2. This means that we need to define how to measure the similarity, or
dissimilarity, between examples in our dataset but also how to measure similarity between clusters of
examples, because we need to decide how to cluster clusters into sets of larger clusters.
163
164 CHAPTER 19. HIERARCHICAL CLUSTERING
There are several ways of thinking about this problem. We can think about proximity between
examples as a generic term of “likeness”, without any precise definition. Similarityis more well defined,
generally a number between 0 and 1 that indicates how alike examples are. Dissimilarity is also a
quantitative measure, in this case of difference between examples, and distance is a special case of
a dissimilarity measure that respects the algebraic properties of a distance. Namely, not negative,
symmetrical and respecting the triangle inequality:
There are many possible distance measures. Some of the most used are Euclidean, Manhattan and
squared Euclidean distance.
rP
• Euclidean: kx − yk2 = (xd − yd )2
d
(xd − yd )2
P
• Manhattan: kx − yk1 =
d
p
• Mahalanobis (normalized by variance): (x − y)T Cov −1 (x − y)
For strings and sequences in general, some useful measures are the Hamming distance, which is the
count of differences between the strings, or the Levenshtein distance, or edit distance, counting the
number of single-character edits (insertions, deletions or substitutions) needed to transform one string
into the other.
Apart from a way to measure similarity or distance between examples, we must also measure
distance between clusters. The method for evaluating cluster distance is the linkage, and there are also
several ways of doing this.
• Single linkage: distance between clusters is the distance between the closest points.
• Average linkage: average distance between all pairs of points from the different clusters.
• Median linkage: median distance between all pairs of points from the different clusters.
The obvious advantages of hierarchical clustering is avoiding the need to specify a number of
clusters, both before or after clustering, and the possibility of revealing some hierarchical structure in
the data. The disadvantages are that hierarchical clustering must generally be done in a single pass,
with a greedy algorithm, which may introduce errors, and if the hierarchical structure assumed by this
type of clustering does not exist in the data the result may be confusing or misleading.
Agglomerative clustering is a bottom-up approach that begins with singleton clusters and repeatedly
joins the best two clusters, according to the linkage method used, into a higher level cluster until all
elements are joined. The time complexity of agglomerative clustering is generally O(n3 ), but can be
improved with linkage constraints.
Divisive clustering is a top-down approach that begins with a single cluster containing all examples
and iteratively picks a cluster to split and separates it into smaller clusters until some number of clusters
is reached. The theoretical time complexity for divisive clustering is O(2n ) for an exhaustive search
and this approach needs an additional clustering algorithm for splitting each cluster. However, the time
complexity in practice can be lower, depending on the clustering algorithm used, and it may be better
than agglomerative clustering if we only want a few levels of hierarchical clustering.
166 CHAPTER 19. HIERARCHICAL CLUSTERING
Two clusters
Five clusters
Figure 19.4: Partitioning a hierarchical clustering by cutting the tree at the desired level.
Figure 19.5: Agglomerative clustering with Ward linkage, without connectivity constraints (left panel)
and with connectivity constraints connecting only the 10 nearest neighbours of each example.
3
4 connectivity = kneighbors_graph(X, n_neighbors=10, include_self=False)
5 ward = AgglomerativeClustering(n_clusters=6, connectivity=connectivity,
6 linkage=’ward’).fit(X)
Figure 19.6: Agglomerative clustering of the same data set with (left to right) complete, average and
Ward linkage.
2. Choose the best cluster for splitting (e.g. the largest or the one with the lowest score).
Although the time complexity for an exhaustive search in divisive clustering is O(2n ), using the
k-means algorithm reduces the complexity (although at the cost of having a more greedy divisive
clustering) and the possibility of stopping at the desired level may make this algorithm preferable to
agglomerative clustering in some cases, since agglomerative clustering must run until the complete
tree is generated.
Lines 5-7 are for loading the data and converting the image matrices into a matrix of examples
(rows) and features (columns). Line 8 is for creating the connectivity matrix with the neighbours of
each pixel in the 64 × 64 image matrix. Lines 9 and 10 create the agglomerative clusterer and fit the
data, and the last two lines complete the reduced dataset, with only 16 features, and a 64 features dataset
with the feature values aggregated, averaging the features in the same cluster. Figure 19.8 shows the
result. Although the digits in the reduced dataset are no longer recognizable as digits, it is easy to see
that the patterns are different from digit to digit, so this process reduced the number of features without
losing much information.
Figure 19.8: Feature clustering. The original handwritten digits features were clustered as shown in the
left panel. Using only these 16 clusters as 16 features, the reduced data set is illustrated on the right
panel.
[1] Uri Alon, Naama Barkai, Daniel A Notterman, Kurt Gish, Suzanne Ybarra, Daniel Mack, and
Arnold J Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of
Sciences, 96(12):6745–6750, 1999.
[2] Ethem Alpaydin. Introduction to Machine Learning. The MIT Press, 2nd edition, 2010.
[3] David F Andrews. Plots of high-dimensional data. Biometrics, page 125–136, 1972.
[4] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer, New York, 1st ed. edition, oct 2006.
[5] Deng Cai, Xiaofei He, Zhiwei Li, Wei-Ying Ma, and Ji-Rong Wen. Hierarchical clustering of
www image search results using visual, textual and link information. In MULTIMEDIA ’04
Proceedings of the 12th annual ACM international conference on Multimedia, page 952–959.
Association for Computing Machinery, Inc., October 2004.
[6] Guanghua Chi, Yu Liu, and Haishandbscan Wu. Ghost cities analysis based on positioning data
in china. arXiv preprint arXiv:1510.08505, 2015.
[7] Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Hand-
written digit recognition with a back-propagation network. In Advances in Neural Information
Processing Systems, page 396–404. Morgan Kaufmann, 1990.
[9] Hakan Erdogan, Ruhi Sarikaya, Stanley F Chen, Yuqing Gao, and Michael Picheny. Using
semantic analysis to improve speech recognition performance. Computer Speech & Language,
19(3):321–343, 2005.
[10] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. A density-based algorithm for
discovering clusters in large spatial databases with noise. In Kdd, page 226–231, 1996.
[11] Brendan J Frey and Delbert Dueck. Clustering by passing messages between data points. science,
315(5814):972–976, 2007.
179
180 BIBLIOGRAPHY
[12] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal
problems. Technometrics, 12(1):55–67, 1970.
[13] Patrick Hoffman, Georges Grinstein, Kenneth Marx, Ivo Grosse, and Eugene Stanley. Dna visual
and analytic data mining. In Visualization’97., Proceedings, page 437–441. IEEE, 1997.
[14] Chang-Hwan Lee, Fernando Gutierrez, and Dejing Dou. Calculating feature weights in naive
bayes with kullback-leibler measure. In Data Mining (ICDM), 2011 IEEE 11th International
Conference on, page 1146–1151. IEEE, 2011.
[15] Stuart Lloyd. Least squares quantization in pcm. Information Theory, IEEE Transactions on,
28(2):129–137, 1982.
[16] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine
learning research, 9(Nov):2579–2605, 2008.
[17] James MacQueen et al. Some methods for classification and analysis of multivariate observa-
tions. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability,
volume 1, page 281–297. Oakland, CA, USA., 1967.
[18] Stephen Marsland. Machine Learning: An Algorithmic Perspective. Chapman & Hall/CRC, 1st
edition, 2009.
[19] Thomas M. Mitchell. Machine Learning. McGraw-Hill, Inc., New York, NY, USA, 1 edition,
1997.
[20] Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. A review of feature selection techniques in
bioinformatics. bioinformatics, 23(19):2507–2517, 2007.
[21] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for
nonlinear dimensionality reduction. science, 290(5500):2319–2323, 2000.
[22] Roberto Valenti, Nicu Sebe, Theo Gevers, and Ira Cohen. Machine learning techniques for face
analysis. In Matthieu Cord and Pádraig Cunningham, editors, Machine Learning Techniques for
Multimedia, Cognitive Technologies, page 159–187. Springer Berlin Heidelberg, 2008.
[23] Giorgio Valentini and Thomas G Dietterich. Bias-variance analysis of support vector machines for
the development of svm-based ensemble methods. The Journal of Machine Learning Research,
5:725–775, 2004.
[24] Jake VanderPlas. Frequentism and bayesianism: a python-driven primer. arXiv preprint
arXiv:1411.5018, 2014.