4 3 Topic Notes New
4 3 Topic Notes New
In the first topic of Module 4, you learnt that unsupervised learning uses unlabelled datasets and reveals
the structure of data. One of the tasks of unsupervised machine learning is clustering. It is a process
of dividing an available dataset (data objects represented by feature vectors) into subsets that share
some similarities. These subsets of data are called clusters; for example, an enterprise can group its
clients based on the level of income (clients with high, average, and low income), a university can group
students based on the average mark (students with marks in the range of 10-8, 7-6, and 5-4), a librarian
can group books based on a theme (romantic books, science fiction, etc.). Therefore, each cluster is
made of one or more data objects and characterised by two aspects:
• the similarity of data objects within the cluster;
• the difference of data objects between the clusters.
The concept of similarity was considered in detail in the second topic of the module in regard to the kNN
algorithm. It is also used in unsupervised machine learning. Therefore, it is worth reminding that the
concept of similarity depends on the feature types:
• for categorical feature values, usually, the Hamming distance is used that counts the number of
features in which two data objects differ: the fewer the differences in features, the greater the
similarity of data objects;
• for continuous feature values, the geometric distance between any pair of data objects is calculated
based on the Euclidean or Manhattan distance: the closer the data objects, the greater their mutual
similarity.
In many cases, clustering is used in the data exploration step (see Topic 4.1) to understand the structure
of data.
K-Means clustering
The K-Means algorithm is one of the popular algorithms of unsupervised machine learning. According
to (Jones, 2009), “the algorithm is popular primarily because it works relatively well and is extremely
simple both to understand and to implement”. The algorithm is based on two central concepts:
• the concept of distance;
• the concept of a centroid.
The K-Means algorithm is based on the following steps (Jones, 2009; Kubat, 2017; Tyugu, 2007):
1. Specify the number of clusters, K, that need to be generated by this algorithm. Thus, K is the
hyperparameter of this algorithm. Usually, it is selected by a trial-and-error approach.
2. Randomly select K data objects from the available dataset, which will represent the initial
centroids.
3. For each data object in the dataset:
a) compute the distance between the data object and each of the centroids;
b) find the smallest distance and assign the data object to the corresponding cluster;
4. Re-calculate the values of the centroids by taking the average values for all features of all data
objects belonging to a specific cluster represented by a particular centroid.
5. Repeat Steps 3-4 until the values of centroids do not change.
The mentioned steps are represented in Figure 1. The typical termination criteria of the clustering
process in the K-Means algorithm is an iteration when data objects do not change their cluster
membership (centroid values do not change) (Jones, 2009; Kubat, 2017; Tyugu, 2007). Sometimes, it
is possible to terminate the clustering process by reaching a pre-defined number of iterations.
Hierarchical clustering is another unsupervised machine learning algorithm that does not demand from
the developer any assumption on the number of clusters as the K-Means algorithm does.
Hierarchical clustering
Hierarchical clustering is about building a hierarchy of clusters that is characterised by the following
aspects (Hastie, 2017):
• the clusters at each level of the hierarchy are created by merging clusters at the next lower level;
• at the lowest level, each cluster contains a single data object;
• at the highest level, there is only one cluster containing all of the data objects.
There are several methods to measure the similarity (distance) between clusters to decide the rules for
merging clusters, and they are often called linkage methods. The most used methods are (Hastie,
2017):
where G, H – clusters;
d – distance;
i – a data object from the cluster G;
i’ – a data object from the cluster H.
• Single-linkage: in deciding about the cluster similarity, the distance between the closest elements
of the two clusters (the shortest distance) is calculated:
where G, H – clusters;
d – distance;
i – a data object from the cluster G;
i’ – a data object from the cluster H.
• Average-linkage: the distance between two clusters is defined as the average distance between
each data object in one cluster to every data object in the other cluster:
where G, H – clusters;
NG – the number of data objects in the cluster G;
NH – the number of data objects in the cluster H;
d – distance;
i – a data object from the cluster G;
i’ – a data object from the cluster H.
Different linkage methods lead to different clusters, and thus the choice of the method depends on the
developer.
According to (Hastie, 2017), divisive hierarchical clustering has not been studied nearly as extensively
as agglomerative clustering in the clustering literature. However, Pai (2021) indicates that one of the
methods to implement this type of hierarchical clustering is to perform the procedure of K-Means
recursively on each intermediate cluster till you encounter all the data objects in the dataset or the
minimum number of data objects you desire to have in a cluster.
The key to interpreting a dendrogram is to focus on the height of clades, that is, the height at which two
data objects are merged. The height of clades serves as an indicator of cluster similarity:
• two data objects (leaves) in the same clade are more similar than two data objects (leaves) in
another clade;
• clades that are close to the same height are similar to each other;
• clades with different heights are dissimilar.
Thus, the more significant the difference in the height of clades, the more dissimilar are clusters. Figure
5 represents five data objects – A, B, C, D, and E – located in the space. From the dendrogram, we can
conclude that the data objects A and B are most similar in terms of distance between them, as the height
of the clade that connects them together is the smallest. The data objects C and D are the following two
most similar objects. The data objects A, B, C and D are more similar than the data object E.
Fig. 6. Effect of granularity and cluster size while traversing in the dendrogram (adopted from (Pai,
2021))
A horizontal cut-off line is usually made through the dendrogram to decide the number of clusters based
on the hierarchical clustering algorithm. The number of vertical lines intersected by the horizontal cut-off
line represents the number of clusters. Cut-offs can be performed at different levels of the hierarchy
leading to a different number of clusters. For example, in Figure 7, Cut-off 1 intersects two vertical lines,
and we receive two clusters with the following data objects: (A, B, C, D) and (E). Cut-off 2 intersects 3
vertical lines, so we receive three clusters with the following data objects: (A, B), (C, D) and (E). As a
result, to make the right cut-off, the developers usually need to use evaluation metrics evaluating the
algorithm's performance with the different number of clusters.
Information sources
Hastie, T., Tibshirani, R., Friedman, J. (2017). The elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics.
Jones T.M. (2009). Artificial Intelligence: A Systems Approach. Jones & Bartlett Learning.
Kubat, M. (2017). An Introduction to Machine Learning. Springer International Publishing.
Tyugu, E. (2007). Algorithms and architectures of artificial intelligence. IOS Press.
Pai P. (2021). Hierarchical clustering explained. Available at
https://towardsdatascience.com/hierarchical-clustering-explained-e59b13846da8