0% found this document useful (0 votes)
14 views9 pages

4 3 Topic Notes New

This document discusses unsupervised machine learning techniques including clustering algorithms K-Means and hierarchical clustering. It explains how K-Means works by assigning data points to centroids and updating the centroids iteratively until convergence. It also describes hierarchical clustering as a bottom-up approach that recursively merges clusters based on similarity until all data is in one cluster.

Uploaded by

kollide beats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views9 pages

4 3 Topic Notes New

This document discusses unsupervised machine learning techniques including clustering algorithms K-Means and hierarchical clustering. It explains how K-Means works by assigning data points to centroids and updating the centroids iteratively until convergence. It also describes hierarchical clustering as a bottom-up approach that recursively merges clusters based on similarity until all data is in one cluster.

Uploaded by

kollide beats
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

TOPIC 4.3.

: UNSUPERVISED MACHINE LEARNING

In the first topic of Module 4, you learnt that unsupervised learning uses unlabelled datasets and reveals
the structure of data. One of the tasks of unsupervised machine learning is clustering. It is a process
of dividing an available dataset (data objects represented by feature vectors) into subsets that share
some similarities. These subsets of data are called clusters; for example, an enterprise can group its
clients based on the level of income (clients with high, average, and low income), a university can group
students based on the average mark (students with marks in the range of 10-8, 7-6, and 5-4), a librarian
can group books based on a theme (romantic books, science fiction, etc.). Therefore, each cluster is
made of one or more data objects and characterised by two aspects:
• the similarity of data objects within the cluster;
• the difference of data objects between the clusters.

The concept of similarity was considered in detail in the second topic of the module in regard to the kNN
algorithm. It is also used in unsupervised machine learning. Therefore, it is worth reminding that the
concept of similarity depends on the feature types:
• for categorical feature values, usually, the Hamming distance is used that counts the number of
features in which two data objects differ: the fewer the differences in features, the greater the
similarity of data objects;
• for continuous feature values, the geometric distance between any pair of data objects is calculated
based on the Euclidean or Manhattan distance: the closer the data objects, the greater their mutual
similarity.
In many cases, clustering is used in the data exploration step (see Topic 4.1) to understand the structure
of data.

K-Means clustering
The K-Means algorithm is one of the popular algorithms of unsupervised machine learning. According
to (Jones, 2009), “the algorithm is popular primarily because it works relatively well and is extremely
simple both to understand and to implement”. The algorithm is based on two central concepts:
• the concept of distance;
• the concept of a centroid.

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
A distance represents a similarity measure used to group data objects in clusters (see above). A
centroid is a centre of a cluster around which data objects are grouped based on their distance to the
centroid. It is the average of the current set of feature vectors within the cluster (Jones, 2009). It means
that the data objects that are close to the centroid in terms of distance make a cluster. The task of K-
Means is to minimise the sum of distances between data objects belonging to a cluster and the cluster
centroid. The algorithm maps each data object to only one cluster.

The K-Means algorithm is based on the following steps (Jones, 2009; Kubat, 2017; Tyugu, 2007):
1. Specify the number of clusters, K, that need to be generated by this algorithm. Thus, K is the
hyperparameter of this algorithm. Usually, it is selected by a trial-and-error approach.
2. Randomly select K data objects from the available dataset, which will represent the initial
centroids.
3. For each data object in the dataset:
a) compute the distance between the data object and each of the centroids;
b) find the smallest distance and assign the data object to the corresponding cluster;
4. Re-calculate the values of the centroids by taking the average values for all features of all data
objects belonging to a specific cluster represented by a particular centroid.
5. Repeat Steps 3-4 until the values of centroids do not change.
The mentioned steps are represented in Figure 1. The typical termination criteria of the clustering
process in the K-Means algorithm is an iteration when data objects do not change their cluster
membership (centroid values do not change) (Jones, 2009; Kubat, 2017; Tyugu, 2007). Sometimes, it
is possible to terminate the clustering process by reaching a pre-defined number of iterations.

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
Fig.1. Steps of the K-Means algorithm
The algorithm has two main drawbacks (Jones, 2009; Kubat, 2017; Tyugu, 2007):
• it is necessary to define the number of clusters before the clustering process. It calls for serious
exploration of data before clustering and intensive experimentation to check the performance of
the algorithm with the different number of clusters;
• the initialisation of centroids can also be problematic.

Hierarchical clustering is another unsupervised machine learning algorithm that does not demand from
the developer any assumption on the number of clusters as the K-Means algorithm does.

Hierarchical clustering
Hierarchical clustering is about building a hierarchy of clusters that is characterised by the following
aspects (Hastie, 2017):
• the clusters at each level of the hierarchy are created by merging clusters at the next lower level;
• at the lowest level, each cluster contains a single data object;
• at the highest level, there is only one cluster containing all of the data objects.

There are two types of hierarchical clustering (Hastie, 2017):


• agglomerative hierarchical clustering;

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
• divisive hierarchical clustering.

Agglomerative hierarchical clustering is a bottom-up approach to building a hierarchy of clusters.


Therefore, each data object initially is attributed to its own cluster. Then, at each level, a selected pair
of clusters is recursively merged into a single cluster. This produces a grouping at the next higher level
with one less cluster. The pairs chosen for merging consist of the two groups with the smallest intergroup
dissimilarity (the fragment is based on the information given in (Hastie, 2017)) . Therefore, a pair of clusters is merged at each
hierarchy level until one cluster with all data objects is acquired at the highest level. The steps of the
agglomerative hierarchical clustering can be described as follows:
1. attribute each data object to a single cluster. This step results in N clusters, where N is the
number of data objects in the dataset;
2. take the two data objects with the smallest dissimilarity and make them one cluster. This step
results in N-1 clusters;
3. take the two clusters with the smallest dissimilarity and make them one cluster. This step results
in the reduction of the number of clusters by 1;
4. repeat Step 3 until only one cluster is acquired.
Figure 2 represents the process of agglomerative hierarchical clustering in which clusters having the
smallest distance between them are merged at each step of clustering.

Fig.2. Agglomerative hierarchical clustering

There are several methods to measure the similarity (distance) between clusters to decide the rules for
merging clusters, and they are often called linkage methods. The most used methods are (Hastie,
2017):

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
• Complete-linkage: in deciding about the cluster similarity, the distance between the clusters’ most
distant elements (the longest distance) is calculated:

where G, H – clusters;
d – distance;
i – a data object from the cluster G;
i’ – a data object from the cluster H.

• Single-linkage: in deciding about the cluster similarity, the distance between the closest elements
of the two clusters (the shortest distance) is calculated:

where G, H – clusters;
d – distance;
i – a data object from the cluster G;
i’ – a data object from the cluster H.
• Average-linkage: the distance between two clusters is defined as the average distance between
each data object in one cluster to every data object in the other cluster:

where G, H – clusters;
NG – the number of data objects in the cluster G;
NH – the number of data objects in the cluster H;
d – distance;
i – a data object from the cluster G;
i’ – a data object from the cluster H.

Different linkage methods lead to different clusters, and thus the choice of the method depends on the
developer.

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
Divisive hierarchical clustering applies a top-down clustering approach. Therefore, it starts with a
cluster with all data objects inside it. Then, at each hierarchy level, it recursively splits one of the existing
clusters into two new clusters. The split is chosen to produce two new groups with the largest between-
group dissimilarity (the fragment is based on the information given in (Hastie, 2017)) . Finally, the clustering stops when each
data object makes a cluster. Figure 3 represents the simplified idea of divisive hierarchical clustering:
data objects having the largest distance from other data objects in a cluster are split in a separate cluster
at each step of clustering.

Fig.3. Divisive hierarchical clustering

According to (Hastie, 2017), divisive hierarchical clustering has not been studied nearly as extensively
as agglomerative clustering in the clustering literature. However, Pai (2021) indicates that one of the
methods to implement this type of hierarchical clustering is to perform the procedure of K-Means
recursively on each intermediate cluster till you encounter all the data objects in the dataset or the
minimum number of data objects you desire to have in a cluster.

A dendrogram is a tree-like diagram representing hierarchical relationships between data objects in


the dataset. It is an output of the hierarchical clustering algorithm. According to (Hastie, 2017), “a
dendrogram provides a highly interpretable complete description of the hierarchical clustering in a
graphical format”. Thus, it provides insight into the way how clusters were formed.

A dendrogram consists of (Figure 4):


• clades that represent branches and can have one or more leaves;
• leaves that are terminal of clades corresponding to the data objects used in the clustering
process.

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
Fig.4. Constituent parts of a dendrogram

The key to interpreting a dendrogram is to focus on the height of clades, that is, the height at which two
data objects are merged. The height of clades serves as an indicator of cluster similarity:
• two data objects (leaves) in the same clade are more similar than two data objects (leaves) in
another clade;
• clades that are close to the same height are similar to each other;
• clades with different heights are dissimilar.
Thus, the more significant the difference in the height of clades, the more dissimilar are clusters. Figure
5 represents five data objects – A, B, C, D, and E – located in the space. From the dendrogram, we can
conclude that the data objects A and B are most similar in terms of distance between them, as the height
of the clade that connects them together is the smallest. The data objects C and D are the following two
most similar objects. The data objects A, B, C and D are more similar than the data object E.

Fig.5. Interpreting a dendrogram

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
In the dendrogram, it is necessary to consider the effect of granularity and cluster size while traversing in
the dendrogram (Figure 6).

Fig. 6. Effect of granularity and cluster size while traversing in the dendrogram (adopted from (Pai,
2021))

A horizontal cut-off line is usually made through the dendrogram to decide the number of clusters based
on the hierarchical clustering algorithm. The number of vertical lines intersected by the horizontal cut-off
line represents the number of clusters. Cut-offs can be performed at different levels of the hierarchy
leading to a different number of clusters. For example, in Figure 7, Cut-off 1 intersects two vertical lines,
and we receive two clusters with the following data objects: (A, B, C, D) and (E). Cut-off 2 intersects 3
vertical lines, so we receive three clusters with the following data objects: (A, B), (C, D) and (E). As a
result, to make the right cut-off, the developers usually need to use evaluation metrics evaluating the
algorithm's performance with the different number of clusters.

Fig.7. Cutting-off the dendrogram

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]
Summarising the information about the hierarchical clustering given in this topic, it is worth mentioning
that the entire cluster hierarchy represents an ordered sequence of cluster merges (Hastie, 2017). This
algorithm does not demand to specify the number of clusters before the algorithm operation. Still, the
developer needs to decide a) where to cut the dendrogram to receive the final number of clusters and
b) which method of linkage to use to measure similarity between clusters.

Information sources
Hastie, T., Tibshirani, R., Friedman, J. (2017). The elements of Statistical Learning: Data Mining,
Inference, and Prediction. Springer Series in Statistics.
Jones T.M. (2009). Artificial Intelligence: A Systems Approach. Jones & Bartlett Learning.
Kubat, M. (2017). An Introduction to Machine Learning. Springer International Publishing.
Tyugu, E. (2007). Algorithms and architectures of artificial intelligence. IOS Press.
Pai P. (2021). Hierarchical clustering explained. Available at
https://towardsdatascience.com/hierarchical-clustering-explained-e59b13846da8

Dr.sc.ing., Dr.paed., assoc. professor Alla Anohina-Naumeca


Department of Artificial Intelligence and Systems Engineering
Faculty of Computer Science and Information Technology
Riga Technical University
E-mail: [email protected]

You might also like