0% found this document useful (0 votes)
11 views

Mod2 Clustering Text Book

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Mod2 Clustering Text Book

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

13 Introduction to Unsupervised

Learning Algorithms

LEARNING OBJECTIVES
• To understand the basics of clustering. • To learn how to create dendrograms using
• To understand and appreciate the require- agglomerative-based clustering
ments of clustering.
• To introduce the concept of partitioning-based
clustering (k-means and k-medoids).

LEARNING OUTCOMES
• Students will be able to understand and appre- • Students will be able to solve numericals on
ciate clustering as an unsupervised learning agglomerative-based clustering using single,
method. complete, and average linkages.
• Students will be able to solve numericals on
partitioning-based clustering techniques.

13.1 Introduction to Clustering


Clustering is the process of grouping together data objects into multiple sets or clusters, so that objects
within a cluster have high similarity as compared to objects outside of it. The similarity is assessed based on
the attributes that describe the objects. Similarity is measured by distance metrics. The partitioning of clus-
ters is not done by humans. It is done with the help of algorithms. These algorithms allow us to derive some
useful information from the data which was previously unknown. Clustering is also called data segmentation
because it partitions large datasets into groups according to their similarity.
Clustering can also be used for outlier detection. Outliers are objects which do not fall into any cluster
because of too much dissimilarity with other objects. We can utilize them for special applications like credit
card fraud detection. In credit card transactions, very expensive and infrequent purchases may be signs of
fraudulent cases and we can apply one more level of security to avoid such transactions.
Clustering is known as unsupervised learning because the class label information is not present. You
have already seen in supervised learning algorithm that every input has a corresponding output, which helps
in designing a model. That is why supervised learning is called learning by example, while unsupervised
learning is called learning by observation.

MachineL_Ch013.indd 239 12/15/2018 2:47:57 PM


240 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

13.1.1 Applications of Clustering


Cluster analysis has been widely used in many applications such as business intelligence, pattern recogni-
tion, image processing, bioinformatics, web technology, search engines, and text mining.
1. Business intelligence: Cluster analysis helps in target marketing, where marketers discover groups and
categorize them based on the purchasing patterns. The information retrieved can be used in market
segmentation, product positioning (i.e., allocating products to specific areas), new product develop-
ment, grouping of shopping items, and selecting test markets.
2. Pattern recognition: Here, the clustering methods group similar patterns into clusters whose mem-
bers are more similar to each other. In other words, the similarity of members within a cluster is much
higher when compared to the similarity of members outside of it. There is no prior knowledge of pat-
terns of clusters or even how many clusters are appropriate.
3. Image processing: Extracting and understanding information from images is very important in image
processing. The images are initially segmented and the different objects of interest in them are then
identified. This involves division of an image into areas of similar attributes. It is one of the important
and most challenging tasks in image processing where clustering can be applied. Image processing has
applications in many areas such as analysis of remotely sensed images, traffic system monitoring, and
fingerprint recognition.
4. Bioinformatics: This is a growing field in terms of research activities and is a part of biotechnology
and genetic engineering. In this case, clustering techniques are required to derive plant and animal
taxonomies, categorize genes with similar functionalities, and gain insight into structures inherent to
populations. Biological systematics is another field which involves study of the diversification of living
forms and the relationships among living things through time. The scientific classification of species
can be done on the basis of similar characteristics using clustering. This field can give more informa-
tion about both extinct and extant organisms.
5. Web technology: Clustering helps classifying documents on the web for information delivery.
6. Search engines: The success of Google as a search engine is because of its intensive searching capabili-
ties. Whenever a query is fired by a user, the search engine provides the result for the searched data
according to the nearest similar object which are clustered around the data to be searched. The speed
and accuracy of the retrieved resultant is dependent on the use of the clustering algorithm. Better the
clustering algorithm used, better are the chances of getting the required result first. Hence the defini-
tion of similar object plays a crucial role in getting better search results.
7. Text mining: Text mining involves the process of extracting high quality information from text. High
quality in text mining means clustering in terms of relevance, novelty, and interestingness. It can be
used for sentiment analysis and document summarization.

13.1.2 Requirements of Clustering


The requirements of clustering can be enumerated and explained as follows:
1. Scalability: A clustering algorithm is considered to be highly scalable if it gives similar results inde-
pendent of the size of the database. Generally, clustering on a sample dataset may give different results
compared to a larger dataset. Poor scalability of clustering algorithms leads to distributed clustering for
partitioning large datasets. Some algorithms cluster large-scale datasets without considering the entire
dataset at a time. Data can be randomly divided into equal-sized disjoint subsets and clustered using
a standard algorithm. The centroids of subsets form an ensemble which can be solved by a centroid
correspondence algorithm. The centroids are combined to form a global set of centroids.
2. Dealing with different types of attributes: Algorithms are designed to cluster numeric data. However,
applications may require clustering other data types like nominal, binary, and ordinal. Nominal data
is in alphabetical form and not in integer form. Binary attribute is of two types: symmetric binary and

MachineL_Ch013.indd 240 12/15/2018 2:47:57 PM


1 3 . 1    I n tr o d u cti o n t o C l u s teri n g  • 241

100

50

-50
C1
C2
C3
-100
50 100 150 200

Figure 13.1 Clusters of arbitrary shapes.

asymmetric binary. In symmetric data, both values are equally important. For example, in gender, it
is male and female. In asymmetric data, both values are not equally important. For example, in result,
it is pass and fail. The clustering algorithm should also work for complex data types such as graphs,
sequences, images, and documents.
3. Discovery of clusters with arbitrary shape: Generally, clustering algorithms are to determine spheri-
cal clusters. Due to the characteristics and diverse nature of the data used, clusters may be of arbitrary
shapes and can be nested within one another. For example, the cluster pattern for active and inactive
volcanoes has chain-like patterns as shown in Fig. 13.1.
Traditional clustering algorithms, such as k-means and k-medoids, fail to detect non-spherical
shapes. Thus, it is important to have clustering algorithms that can detect clusters of any arbitrary
shape.
4. Avoiding domain knowledge to determine input parameters: Many algorithms require domain
knowledge like the desired number of clusters in the form of input. Thus, the clustering results may
become sensitive to the input parameters. Such parameters are often hard to determine for high dimen-
sionality data. Domain knowledge requirement affects the quality of clustering and burdens the user.
For example, in k-means algorithm, the metric used to compare results for different values of k is
the mean distance between data points and their cluster centroid. Increasing the number of clusters
will always reduce the distance of data points to the extreme of reaching zero when k is the same as the
number of data points. Thus, this cannot be used. Instead, to roughly determine k, the mean distance
to the centroid is plotted as a function of k and the “elbow point”, where the rate of decrease sharply
shifts. This is shown in Fig. 13.2.
5. Handling noisy data: Real-world data, which is the input of clustering algorithms, are mostly affected
by noise. This results in poor-quality clusters. Noise is an unavoidable problem, which affects the data
collection and data preparation processes. Therefore, the algorithms we use should be able to deal with
noise. There are two types of noise:
• Attribute noise includes implicit errors introduced by measurement tools. They are induced by
different types of sensors.

MachineL_Ch013.indd 241 12/15/2018 2:47:57 PM


242 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

Average within Cluster


Distance to Centroid
Elbow Point

1 2 3 4 5 6 7 8
K (no. of clusters)

Figure 13.2 Elbow method.

• Random errors introduced by batch processes or experts when the data is gathered. This can be
induced in document digitalization process.
6. Incremental clustering: The database used for clustering needs to be updated by adding new data
(incremental updates). Some clustering algorithms cannot incorporate incremental updates but have
to recompute a new clustering from scratch. The algorithms which can accommodate new data with-
out reconstructing the clusters are called incremental clustering algorithms. It is more effective to use
incremental clustering algorithms.
7. Insensitivity to input order: Some clustering algorithms are sensitive to the order in which data
objects are entered. Such algorithms are not ideal as we have little idea about the data objects pre-
sented. Clustering algorithms should be insensitive to the input order of data objects.
8. Handling high-dimensional data: A dataset can contain numerous dimensions or attributes.
Generally, clustering algorithms are good at handling low-dimensional data such as datasets involving
only two or three dimensions. Clustering algorithms which can handle high-dimensional space are
more effective.
9. Handling constraints: Constrained clustering can be considered to contain a set of must-link con-
straints, cannot-link constraints, or both. In a must-link constraint, two instances in the must-link
relation should be included in the same cluster. On the other hand, a cannot-link constraint speci-
fies that the two instances cannot be in the same cluster. These sets of constraints act as guidelines
to cluster the entire dataset. Some constrained clustering algorithms cancel the clustering process
if they cannot form clusters which satisfy the specified constraints. Others try to minimize the
amount of constraint violation if it is impossible to find a clustering which satisfies the constraints.
Constraints can be used to select a clustering model to follow among different clustering meth-
ods. A challenging task is to find data groups with good clustering behavior that satisfy specified
constraints.
10. Interpretability and usability: Users require the clustering results to be interpretable, usable, and
include all the elements. Clustering is always tied with specific semantic interpretations and applica-
tions. The applications should be able to use the information retrieved after clustering in a useful
manner.

MachineL_Ch013.indd 242 12/15/2018 2:47:57 PM


1 3 . 2    T y pe s o f C l u s teri n g  • 243

13.2 Types of Clustering


Clustering algorithms can be classified into two main subgroups:
1. Hard clustering: Each data point either belongs to a cluster completely or not.
2. Soft clustering: Instead of putting each data point into a separate cluster, a probability or likelihood
of that data point to be in those clusters is assigned.
Clustering algorithms can also be classified as follows:
1. Partitioning method.
2. Hierarchical method.
3. Density-based method.
4. Grid-based method.
However, the focus in this chapter is on partitioning method and hierarchical-based methods.

13.2.1 Partitioning Method


Partitioning means division. Suppose we are given a database of n objects and we need to partition this data
into k partitions of data. Within a partition there exists some similarity among the items. So each partition
will represent a cluster and k ≤ n. It means that it will classify the data into k groups, each group contains at
least one object and each object must belong to exactly one group. Although this is the general requirement,
in soft clustering an object can belong to two clusters also. Most partitioning methods are distance-based.
For a given number of partitions (say k), the partitioning method will create an initial partitioning.
Then it uses the iterative relocation technique to improve the partitioning by moving objects from one
group to another. The general criterion of a good partitioning is that objects in the same cluster are close to
each other, whereas objects in different clusters are far from each other. Some other criteria can be used for
judging the quality of partitions.
Partition-based clustering is often computationally expensive and hence most of the methods apply
heuristic methods, including the greedy approach which improves the quality of cluster arriving at a local
optimum.
These heuristic clustering methods result in spherical clusters. For complex-shaped clusters and for large
datasets, some extensions are required for partition-based methods.

13.2.2 Hierarchical Method


Hierarchical clustering is an alternative approach to partitioning clustering for identifying groups in a data-
set. It does not require prespecifying the number of clusters to be generated. The result of hierarchical
clustering is a tree-based representation of the objects, which is also known as dendrogram. Observations can
be subdivided into groups by cutting the dendrogram at a desired similarity level. We classify hierarchical
methods on the basis of how the hierarchical decomposition is formed. There are two approaches:
1. Agglomerative approach: This approach is also known as the bottom-up approach. In this approach,
we start with each object forming a separate group. It keeps on merging the objects or groups that are
close to one another. It keeps on doing so until all of the groups are merged into one or until the ter-
mination condition holds.
2. Divisive approach: This approach is also known as the top-down approach. In this approach, we start
with all of the objects in the same cluster. In the continuous iteration, a cluster is split up into smaller
clusters. It is done until each object is in one cluster or the termination condition holds. This method
is rigid, that is, once a merging or splitting is done, it can never be undone.

MachineL_Ch013.indd 243 12/15/2018 2:47:57 PM


244 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

13.2.3 Density-Based Methods


Density-based clustering algorithm finds nonlinear shapes clusters based on the density. Density-based spa-
tial clustering of applications with noise (DBSCAN) is the most widely used density-based algorithm. It
uses the concept of density reachability and density connectivity.
1. Density reachability: A point “p” is said to be density reachable from a point “q” if it is within e dis-
tance from point “q” and “q” has sufficient number of points in its neighbors that are within distance e.
2. Density connectivity: Points “p” and “q” are said to be density-connected if there exists a point “r”
which has sufficient number of points in its neighbors and both the points are within e distance. This
is called chaining process. So, if “q” is neighbor of “r”, “r” is neighbor of “s”, “s” is neighbor of “t”, and
“t” is neighbor of “p”; this implies that “q” is neighbor of “p”.

13.2.4 Grid-Based Methods


The grid-based clustering approach differs from the conventional clustering algorithms in that it is con-
cerned not with data points but with the value space that surrounds the data points. In general, a typical
grid-based clustering algorithm consists of the following five basic steps:
1. Create the grid structure, i.e., partitioning the data space into a finite number of cells.
2. Calculating the cell density for each cell.
3. Sorting the cells according to their densities.
4. Identifying cluster centers.
5. Traversal of neighbor cells.

13.3 Partitioning Methods of Clustering


The most fundamental clustering method is the partitioning method. This method assumes that we already
know the number of clusters to be formed, which organizes the objects of a set into several exclusive groups or
clusters. If k is the number of clusters to be formed given a dataset D of n objects, the partitioning algorithm
organizes the objects into k partitions (k ≤ n), where each partition represents a cluster. The objective func-
tion in this type of partitioning is that the similarity among the data items within a cluster is higher than the
elements in a different cluster. In other words, inter-cluster similarity is higher than intra-cluster similarity.

13.3.1 k-Means Algorithm


The most well-known clustering algorithm is probably k-means. It is taught in a lot of introductory data
science and machine learning classes. It is easy to understand and implement.
The main concept is to define k cluster centers. The cluster centers should be kept in such a way that
it covers the data points of the entire dataset. The best way to do so is to keep data points as far away from
each other as possible. We can then associate each data point to the nearest cluster center. The initial group-
ing of data is completed when there is no data point remaining. Once the grouping is done, new centroids
are computed. These again form clusters based on the new cluster centers. The process is repeated till no
more changes are done, which implies the cluter centers do not change any more. This algorithm aims at
minimizing an objective function know as squared error function, which is given by

( )
c ci 2
J (V ) = å å xi - v j (13.1)
i =1 j =1

where
||xi – vj|| is the Euclidean distance between xi and vj .

MachineL_Ch013.indd 244 12/15/2018 2:47:58 PM


1 3 . 3   Pa r t i t i o n i n g M e t h o d s o f C l u s t e r i n g  • 245

ci is the number of data points in ith cluster.


c is the number of cluster centers.
The objective function aims for high intra-cluster similarity and low inter-cluster similarity. This function
tries to make the resulting k clusters as compact and as separate as possible. Optimizing the within-cluster
variation is computationally challenging since the problem is NP-hard in general even for two clusters (that
is, k = 2). To overcome the prohibitive computational cost for the exact solution, greedy approaches are
often used in practice.

13.3.1.1 Steps in k-Means Clustering Algorithm


Let us study the steps in k-means clustering algorithm with the following example. Let X = {x1, x2, x3, … xn}
be the set of data points and V = {v1, v2, …vc} be the set of centers. The steps are:
1. Randomly select c cluster centers.
2. Calculate the distance between each data point and cluster centers.
3. Assign the data point to the cluster having a minimum distance from it and the cluster center.
4. Recalculate the new cluster center using Eq. (13.2).

æ 1 ö ci
vi = ç ÷ å xi (13.2)
è ci ø j =1

where ci represents the number of data points in the ith cluster.


5. Recalculate the distance between each data point and the newly obtained cluster centers.
6. If no data point was reassigned then stop, otherwise repeat steps 3 to 5.

The advantages of k-means clustering algorithm are:

1. Fast, robust, and easier to understand.


2. Relatively efficient: The computational complexity of the algorithm is O(tknd), where n is the number
of data objects, k is the number of clusters, d is the number of attributes in each data object, and t is the
number of iterations. Normally, k, t, d << n. That is, the number of clusters attributes and iterations
will be very small compared to the number of data objects in the dataset.
3. Gives best result when dataset are distinct or well separated from each other.

The disadvantages of k-means clustering algorithm are:

1. It requires prior specification of number of clusters.


2. It is not be able to cluster highly overlapping data.
3. It is variant to nonlinear transformations, that is, with different representation of data we get differ-
ent results (data represented in form of Cartesian coordinates and polar coordinates will give different
results).
4. It provides the local optima of the squared error function.
5. Random choosing of the cluster center cannot lead to fruitful result.
6. Applicable only when the mean is defined, that is, fails for categorical data.
7. Unable to handle noisy data and outliers.

MachineL_Ch013.indd 245 12/19/2018 1:17:17 PM


246 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

13.3.1.1.1 k-Means Solved Examples in One-Dimensional Data


Solved Problem 13.1
Apply k-means algorithm in given data for k = 3 (that is, 3 clusters). Use C1(2), C2(16), and C3(38) as
initial cluster centers. Data: 2, 4, 6, 3, 31, 12, 15, 16, 38, 35, 14, 21, 23, 25, 30.

Solution:
The initial cluster centers are given as C1(2), C2(16), and C3(38). Calculating the distance between each
data point and cluster centers, we get the following table.

Data Points Distance from C1(2) Distance f rom C2(16  ) Distance from C3(38)

2 (2 - 2)2 = 0 (2 - 16)2 = 196 (2 - 38)2 = 1296


4 (4 - 2)2 = 4 (4 - 16)2 = 144 (4 - 38)2 = 1156
6 (6 - 2)2 = 16 (6 - 16)2 = 100 (6 - 38)2 = 1024
3 (3 - 2)2 = 1 (3 - 16)2 = 169 (3 - 38)2 = 1225
31 (31 - 2)2 = 841 (31 - 16)2 = 225 (31 - 38)2 = 49
12 (12 - 2)2 = 100 (12 - 16)2 = 16 (12 - 38)2 = 676
15 (15 - 2)2 = 169 (15 - 16)2 = 1 (15 - 38)2 = 529
16 (16 - 2)2 = 196 (16 - 16)2 = 0 (16 - 38)2 = 484
38 (38 - 2)2 = 1296 (38 - 16)2 = 484 (38 - 38)2 = 0
35 (35 - 2)2 = 1089 (35 - 16)2 = 361 (35 - 38)2 = 9
14 (14 - 2)2 = 144 (14 - 16)2 = 4 (14 - 38)2 = 576
21 (21 - 2)2 = 361 (21 - 16)2 = 25 (21 - 38)2 = 289
23 (23 - 2)2 = 441 (23 - 16)2 = 49 (23 - 38)2 = 225
25 (25 - 2)2 = 529 (25 - 16)2 = 81 (25 - 38)2 = 169
30 (30 - 2)2 = 784 (30 - 16)2 =196 (30 - 38)2 = 64

By assigning the data points to the cluster center whose distance from it is minimum of all the cluster
centers, we get the following table.

C1(2) C2(16  ) C3(38)

m1 = 2 m2 = 16 m3 = 38
{2, 3, 4, 6} {12, 14, 15, 16, 21, 23, 25} {31, 35, 38}
New cluster centers
m1 = 3.75 m2 = 18 m3 = 34.67

Similarly, using the new cluster centers we can calculate the distance from it and allocate clusters based
on minimum distance. It is found that there is no difference in the cluster formed and hence we stop this
procedure. The final clustering result is given in the following table.

MachineL_Ch013.indd 246 12/15/2018 2:47:58 PM


1 3 . 3   Partiti o n i n g Meth o d s o f C l u s teri n g  • 247

C1(3.75 ) C2(18) C3(34.67  )

m1 = 3.75 m2 = 18 m3 = 34.67
{2, 3, 4, 6} {12, 14, 15, 16, 21, 23, 25} {31, 35, 38}

Solved Problem 13.2


Apply k-means algorithm in given data for k = 2. Use C1(80) and C2(250) as initial cluster centers. Data:
234, 123, 456, 23, 34, 56, 78, 90, 150, 116, 117, 118, 199.

Solution:
We solve the numerical by following the calculations carried out in solved problem 13.1. The result is
presented in the following table.

C1(80) C2(250)

m1 = 80 m2 = 250
{23, 34, 56, 78, 90, 116, 117, 118, 123} {150, 199, 234, 456}
m1 = 83.9 m2 = 259.75
{23, 34, 56, 78, 90, 116, 117, 118, 123} {150, 199, 234, 456}
m1 = 90.5 m2 = 296.3
{23, 34, 56, 78, 90, 116, 117, 118, 123, 150} {199, 234, 456}
m1 = 90.5 m2 = 296.3
{23, 34, 56, 78, 90, 116, 117, 118, 123, 150} {199, 234, 456}

13.3.1.1.2 k-Means Solved Examples in Two-Dimensional Data


Solved Problem 13.3
Apply k-means clustering for the datasets given in Table 13.1 for two clusters. Tabulate all the assignments.
Table 13.1 Sample dataset for k-means clustering
Sample No. X Y

1 185 72
2 170 56
3 168 60
4 179 68
5 182 72
6 188 77

MachineL_Ch013.indd 247 12/15/2018 2:47:58 PM


248 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

Solution:
Sample No. X Y Assignment

1 185 72 C1
2 170 56 C2

Centroid: C1 = (185, 72) and C2 = (170, 56)


First Iteration:
Distance from C1 is Euclidean distance between (185, 72) and (168, 60) = 20.808
Distance from C2 is Euclidean distance between (170, 56) and (168, 60) = 4.472
Since C2 is closer to (168, 60), the sample belongs to C2.

Sample No. X Y Assignment

1 185 72 C1
2 170 56 C2
3 168 60 C2
4 179 68
5 182 72
6 188 77

Similarly,
1. Distance from C1 for (179, 68) = 7.21
Distance from C2 for (179, 68) = 15
Since C1 is closer to (179, 68), the sample belongs to C1.
2. Distance from C1 for (182, 72) = 3
Distance from C2 for (182, 72) = 20
Since C1 is closer to (182, 72), the sample belongs to C1.
3. Distance from C1 for (188, 77) = 5.83
Distance from C2 for (188, 77) = 27.66
Since C1 is closer to (188, 77), the sample belongs to C1.

Sample No. X Y Assignment

1 185 72 C1
2 170 56 C2
3 168 60 C2
4 179 68 C1
5 182 72 C1
6 188 77 C1

MachineL_Ch013.indd 248 12/15/2018 2:47:58 PM


1 3 . 3   Partiti o n i n g Meth o d s o f C l u s teri n g  • 249

The new centroid for C1 is


æ 185 + 179 + 182 + 188 72 + 68 + 72 + 77 ö
ç , ÷ = (183.5, 73 )
è 4 4 ø
The new centroid for C2 is
æ 170 + 168 56 + 60 ö
ç , ÷ = (169, 58 )
è 2 2 ø
Second Iteration:
Distance from C1 is Euclidean distance between (183.5, 73) and (168, 60) = 20.2
Distance from C2 is Euclidean distance between (169, 58) and (168, 60) = 2.24
Since C2 is closer to (168, 60), the sample belongs to C2.
Similarly,
1. Distance from C1 for (179, 68) = 6.73
Distance from C2 for (179, 68) = 14.14
Since C1 is closer to (179, 68), the sample belongs to C1.
2. Distance from C1 for (182, 72) = 1.80
Distance from C2 for (182, 72) = 19.10
Since C1 is closer to (182, 72), the sample belongs to C1.
3. Distance from C1 for (188, 77) = 6.02
Distance from C2 for (188, 77) = 26.87
Since C1 is closer to (188, 77), the sample belongs to C1.

Sample No. X Y Assignment

1 185 72 C1
2 170 56 C2
3 168 60 C2
4 179 68 C1
5 182 72 C1
6 188 77 C1

After the second iteration, the assignment has not changed and hence the algorithm is stopped and the
points are clustered.

13.3.2 k-Medoids
The k-medoids algorithm is a clustering algorithm very similar to the k-means algorithm. Both k-means
and k-medoids algorithms are partitional and try to minimize the distance between points and cluster
center. In contrast to the k-means algorithm, k-medoids chooses data points as centers and uses Manhattan
distance to define the distance between cluster centers and data points. This technique clusters the dataset
of n objects into k clusters, where the number of clusters k is known in prior. It is more robust to noise and
outliers as compared to k-means because it minimizes a sum of pairwise dissimilarities instead of a sum of
squared Euclidean distances. A medoid is defined as an object of a cluster whose average dissimilarity to all
the objects in the cluster is minimal.

MachineL_Ch013.indd 249 12/15/2018 2:47:59 PM


250 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

The Manhattan distance between two vectors in an n-dimensional real vector space is given by
Eq. (13.2). It is used in computing the distance between a data point and its cluster center.
n
d1 ( p, q ) = p - q 1 = å pi - qi (13.2)
i =1

The most common algorithm in k-medoid clustering is Partitioning Around Medoids (PAM) algorithm.
PAM uses a greedy search which is faster than the exhaustive search and may not find the optimum s­ olution.
It works as follows:
1. Initialize: select k of the n data points as the medoids.
2. Associate each data point to the closest medoid.
3. While the cost of the configuration decreases: For each medoid m and for each non-medoid data point o:
• Swap m and o, recompute the cost (sum of distances of points to their medoid).
• If the total cost of the configuration increased in the previous step, undo the swap.

Solved Problem 13.4


Cluster the following dataset of 6 objects into two clusters, that is, k = 2.

X1 2 6
X2 3 4
X3 3 8
X4 4 2
X5 6 2
X6 6 4

Solution:
Step 1: Two observations c1 = X2 = (3, 4) and c2 = X6 = (6, 4) are randomly selected as medoids (cluster
centers).
Step 2: Manhattan distances are calculated to each center to associate each data object to its nearest medoid.

Data Object Distance To


Sample Point c1 = (3, 4 ) c2 = (6, 4 )

X1 (2, 6) 3 6
X2 (3, 4) 0 3
X3 (3, 8) 4 7
X4 (4, 2) 3 4
X5 (6, 2) 5 2
X6 (6, 4) 3 0
Cost 10 2

MachineL_Ch013.indd 250 12/15/2018 2:47:59 PM


1 3 . 3   Partiti o n i n g Meth o d s o f C l u s teri n g  • 251

Step 3: We select one of the non-medoids O′. Let us assume O′ = (6, 2). So now the medoids are c1(3, 4)
and O′(6, 2). If c1 and O′ are the new medoids. We calculate the total cost involved.

Data Object Distance To


Sample Point c1 = (3, 4 ) c2 = (6, 2 )

X1 (2, 6) 3 8
X2 (3, 4) 0 5
X3 (3, 8) 4 9
X4 (4, 2) 3 2
X5 (6, 2) 5 0
X6 (6, 4) 3 2
Cost 7 4

So cost of swapping medoid from c2 to O′ is 11. Since the cost is less, this is considered as a better cluster
assignment. Here swapping is done as the cost is less.
Step 4: We select another non-medoid O′. Let us assume O′ = (4, 2). So now the medoids are c1(3, 4)
and O′(4, 2). If c1 and O′ are new medoids, we calculate the total cost involved.

Data Object Distance To


Sample Point c1 = (3, 4 ) c2 = (4, 2 )

X1 (2, 6) 3 6
X2 (3, 4) 0 3
X3 (3, 8) 4 7
X4 (4, 2) 3 0
X5 (6, 2) 5 2
X6 (6, 4) 3 4
Cost 7 8

So cost of swapping medoid from c2 to O′ is 15. Since the cost is more, this cluster assignment is not
considered and the swapping is not done.
Thus, we try other non-medoids points to get minimum cost. The assignment with minimum cost
is considered the best. For some applications, k-medoids show better results than k-means. The most
time-consuming part of the k-medoids algorithm is the calculation of the distances between objects. The
distances matrix can be computed in advance to speed-up the process.

MachineL_Ch013.indd 251 12/15/2018 2:47:59 PM


252 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

13.4 Hierarchical Methods


The hierarchical agglomerative clustering methods are most commonly used. The construction of a hierar-
chical agglomerative classification can be achieved by the following general algorithm.
1. Find the two closest objects and merge them into a cluster.
2. Find and merge the next two closest points, where a point is either an individual object or a cluster of
objects.
3. If more than one cluster remains, return to step 2.

13.4.1 Agglomerative Algorithms


Agglomerative algorithm follows a bottom-up strategy, treating each object from its own cluster and itera-
tively merging clusters until a single cluster is formed or a terminal condition is satisfied. According to
some similarity measure, the merging is done by choosing the closest clusters first. A dendrogram, which
is a tree like structure, is used to represent hierarchical clustering. Individual objects are represented by
leaf nodes and the clusters are represented by root nodes. A representation of a dendrogram is shown in
Fig. 13.3.

Level a b c d e
l=0 1.0

l=1 0.8

Similarity Scale
l=2
0.6

l=3 0.4

l=4 0.2

0.0

Figure 13.3 Dendogram.

13.4.1.1 Distance Measures


One of the major factors in clustering is the metric that is used to measure the distance between two clus-
ters, where each cluster is generally a set of objects. The distance between two objects or points p and p1 are
computed using Eqs. (13.3) to (13.6). Let Ci be the cluster and ni is the number of objects in Ci. They are
also known as linkage measures.
Minimum distance:
( )
dist min Ci ,C j = min
p ∈Ci , p ′∈C j
{ p − p′ } (13.3)

Maximum distance:
( )
dist max Ci ,C j = max
p ∈Ci , p ′∈C j
{ p − p′ } (13.4)

Mean distance:
( )
dist mean Ci ,C j = mi − m j (13.5)

MachineL_Ch013.indd 252 12/15/2018 2:48:01 PM


1 3 . 4    H ierarchical Meth o d s  • 253

Average distance:
( )
dist avg Ci ,C j =
1
ni n j

p ∈Ci , p ′∈C j
p − p′ (13.6)

When an algorithm uses the minimum distance, dmin(Ci, Cj), to measure the distance between clusters, it is
called nearest-neighbor clustering algorithm. If the clustering process is terminated when the distance between
the nearest clusters exceeds a user-defined threshold, it is called single-linkage algorithm. Agglomerative hier-
archical clustering algorithm (with minimum distance measure) is called minimum spanning tree algorithm
since spanning tree of a graph is a tree that connects all vertices and a minimal spanning tree is one with the
least sum of edge weights.
An algorithm that uses the maximum distance, dmax(Ci, Cj), to measure the distance between clusters is
called farthest-neighbor clustering algorithm. If clustering is terminated when the maximum distance exceeds
a user-defined threshold, it is called complete-linkage algorithm.
The minimum and maximum measures tend to be sensitive to outliers or noisy data. The third method
thus suggests to take the average distance to rule out outlier problems. Another advantage is that it can
handle categoric data as well.
Algorithm: The agglomerative algorithm is carried out in three steps and the flowchart is shown in Fig. 13.4.
1. Convert object attributes to distance matrix.
2. Set each object as a cluster (thus, if we have N objects, we will have N clusters at the beginning).
3. Repeat until number of clusters is one.
• Merge two closest clusters.
• Update distance matrix.

Start

Select objects and


their measured
features

Compute distance

Set object as

Yes
No. of cluster

No End

Merge two closest

Update distance
matrix

Figure 13.4 Flowchart of agglomerative algorithm.

MachineL_Ch013.indd 253 12/15/2018 2:48:01 PM


254 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

13.4.1.2 Agglomerative Algorithm: Single Link


Single-nearest distance or single linkage is the agglomerative method that uses the distance between the
closest members of the two clusters.
Solved Problem 13.5
Find the clusters using single link technique. Use Euclidean distance and draw the dendrogram.

Sample No. X Y

P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30

Solution:
To compute distance matrix:

d éë( x , y ) ( a, b ) ùû = ( x - a )2 + ( y - b )
2

Euclidean distance:

d ( P1, P2 ) = ( 0.4 - 0.22 )2 + ( 0.53 - 0.38)2


= ( 0.18)2 + ( 0.15)2
= 0.0324 + 0.0225 = 0.23

The distance matrix is:


æ P1 P2 P3 P4 P5 P6 ö
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3 0 ÷
ç P4 0 ÷
ç ÷
ç P5 0 ÷
ç P6 0 ÷
è ø
Similarly,
d ( P1, P3 ) = ( 0.4 - 0.35)2 + ( 0.53 - 0.32 )2
= ( 0.05)2 + ( 0.21)2
= 0.0025 + 0.0441 = 0.216 = 0.22

MachineL_Ch013.indd 254 12/15/2018 2:48:03 PM


1 3 . 4    H ierarchical Meth o d s  • 255

d ( P1, P4 ) = ( 0.4 - 0.26 )2 + ( 0.53 - 0.19 )2


= ( 0.14 )2 + ( 0.34 )2
= 0.0196 + 0.1156 = 0.3676 = 0.37

d ( P1, P5 ) = ( 0.4 - 0.08)2 + ( 0.53 - 0.41)2


= ( 0.32 )2 + ( 0.12 )2
= 0.1024 + 0.0144 = 0.3417 = 0.34

d ( P1, P6 ) = ( 0.4 - 0.45)2 + ( 0.53 - 0.30 )2


= ( 0.05)2 + ( 0.23 )2
= 0.0025 + 0.0529 = 0.2354 = 0.24

d ( P2, P3 ) = ( 0.22 - 0.35)2 + ( 0.38 - 0.32 )2


= ( -0.13 )2 + ( 0.06 )2
= 0.0169 + 0.0036 = 0.1432 = 0.14

d ( P2, P4 ) = ( 0.22 - 0.26 )2 + ( 0.38 - 0.19 )2


= ( -0.04 )2 + ( 0.19 )2
= 0.0016 + 0.0361 = 0.1942 = 0.19

d ( P2, P5 ) = ( 0.22 - 0.08)2 + ( 0.38 - 0.41)2


= ( 0.14 )2 + ( -0.03 )2
= 0.0196 + 0.0009 = 0.1432 = 0.14

d ( P2, P6 ) = ( 0.22 - 0.45)2 + ( 0.38 - 0.30 )2


= ( -0.23 )2 + ( 0.08)2
= 0.0529 + 0.0064 = 0.2435 = 0.24

d ( P3, P4 ) = ( 0.35 - 0.26 )2 + ( 0.32 - 0.19 )2


= ( 0.03 )2 + ( 0.13 )2
= 0.0009 + 0.0169 = 0.1334 = 0.13

MachineL_Ch013.indd 255 12/15/2018 2:48:05 PM


256 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

d ( P3, P5 ) = ( 0.35 - 0.08)2 + ( 0.32 - 0.41)2


= ( 0.27 )2 + ( -0.09 )2
= 0.0729 + 0.0081 = 0.2846 = 0.28

d ( P3, P6 ) = ( 0.35 - 0.45)2 + ( 0.32 - 0.30 )2


= ( -0.1)2 + ( 0.02 )2
= 0.01 + 0.00004 = 0.10198 = 0.10

d ( P4, P5 ) = ( 0.26 - 0.08)2 + ( 0.19 - 0.41)2


= ( 0.07 )2 + ( -0.22 )2
= 0.0049 + 0.0484 = 0.2309 = 0.23

d ( P4, P6 ) = ( 0.26 - 0.45)2 + ( 0.19 - 0.30 )2


= ( -0.19 )2 + ( -0.11)2
= 0.0361 + 0.0121 = 0.2195 = 0.22

d ( P5, P6 ) = ( 0.08 - 0.45)2 + ( 0.41 - 0.30 )2


= ( -0.37 )2 + ( 0.11)2
= 0.1369 + 0.0121 = 0.3860 = 0.39

The distance matrix is:

æ P1 P2 P3 P4P6 ö P5
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3 0.22 0.14 0 ÷
ç P4 0.37 0.19 0.13 0 ÷
ç ÷
ç P5 0.34 0.14 0.28 0.23 0 ÷
ç P6
è 0.24 0.24 0.10 0.22 0.39 0 ÷ø

MachineL_Ch013.indd 256 12/15/2018 2:48:07 PM


1 3 . 4    H ierarchical Meth o d s  • 257

Merging the two closest members of the two clusters and finding the minimum element in distance
matrix, we get
æ P1 P2 P3 P4 P6 ö
P5
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3 0.22 0.14 0 ÷
ç P4 0.37 0.19 0.13 0 ÷
ç ÷
ç P5 0.34 0.14 0.28 0.23 0 ÷
ç P6
è 0.24 0.24 0.10 0.22 0.39 0 ÷ø

Here the minimum value is 0.10 and hence we combine P3 and P6. Now, form cluster of elements
corresponding to minimum value and update distance matrix. To update the distance matrix

min ((P3, P6), P1) = min((P3, P1), (P6, P1)) = min(0.22, 0.24) = 0.22
min ((P3, P6), P2) = min((P3, P2), (P6, P2)) = min(0.14, 0.24) = 0.14
min ((P3, P6), P4) = min((P3, P4), (P6, P4)) = min(0.13, 0.22) = 0.13
min ((P3, P6), P5) = min((P3, P5), (P6, P5)) = min(0.28, 0.39) = 0.28

æ P1 P2 P3, P6 P4 P5 ö
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3, P6 0.22 0.14 0 ÷
ç P4 0.37 0.19 0.13 0 ÷
çç ÷
è P5 0.34 0.14 0.28 0.23 0 ÷ø
Merging two closest members of the two clusters and finding the minimum element in distance matrix.
æ P1 P2 P3, P6 P4 P5 ö
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3, P6 0.22 0.14 0 ÷
ç P4 0.37 0.19 0.13 0 ÷
çç ÷
è P5 0.34 0.14 0.28 0.23 0 ÷ø
Here the minimum value is 0.13 and hence we combine P3, P6 and P4. Now, form cluster of elements
corresponding to minimum values and update distance matrix. To update the distance matrix

min (((P3, P6), P4), P1) = min(((P3, P6), P1), (P4, P1)) = min(0.22, 0.37) = 0.22
min (((P3, P6), P4), P2) = min(((P3, P6), P2), (P4, P2)) = min(0.14, 0.19) = 0.14
min (((P3, P6), P4), P5) = min(((P3, P6), P5), (P4, P5)) = min(0.28, 0.23) = 0.23

MachineL_Ch013.indd 257 12/15/2018 2:48:08 PM


258 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

æ P1 P2 P3, P6, P4 P5 ö
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3, P6, P4 0.22 0.14 0 ÷
ç P5 0 .34 0 .14 0 . 23 0 ÷
è ø

Merging two closest members of the two clusters and finding the minimum element in distance matrix.

æ P1 P2 P3, P6, P4 P5 ö
ç P1 0 ÷
ç ÷
ç P2 0.23 0 ÷
ç ÷
ç P3, P6, P4 0.22 0.14 0 ÷
ç P5 0 .34 0 .14 0 . 23 0 ÷
è ø

Here the minimum value is 0.14 and hence we combine P2 and P5. Now, form cluster of elements cor-
responding to minimum values and update distance matrix. To update the distance matrix

min ((P2, P5), P1) = min((P2, P1), (P5, P1)) = min(0.23, 0.34) = 0.23
min ((P2, P5),(P3, P6, P4)) = min((P2, (P3, P6, P4)), (P5, (P3, P6, P4))) = min(0.14, 0.23) = 0.14

æ P1 P2, P5 P3, P6, P4 ö


ç P1 0 ÷
ç ÷
ç P2, P5 0.23 0 ÷
ç ÷
è P3, P6, P4 0.22 0.14 0 ø
Merging two closest members of the two clusters and finding the minimum element in distance matrix.

æ P1 P2, P5 P3, P6, P4 ö


ç P1 0 ÷
ç ÷
ç P2, P5 0.23 0 ÷
ç ÷
è P3, P6, P4 0.22 0.14 0 ø

Here the minimum value is 0.14 and hence we combine P2, P5 and P3, P6, P4. Now, form cluster of
elements corresponding to minimum values and update distance matrix. To update the distance matrix

min ((P2, P5, P3, P6, P4), P1) = min((P2, P5), P1), ((P3, P6, P4), P1)) = min(0.23, 0.22) = 0.22

æ P1 P2, P5, P3, P6, P4 ö


ç ÷
ç P1 0 ÷
ç P2, P5, P3, P6, P4 0.22 0 ÷
è ø

MachineL_Ch013.indd 258 12/15/2018 2:48:09 PM


1 3 . 4    H ierarchical Meth o d s  • 259

The dendogram can now be drawn as shown in Fig. 13.5.

P3

P6

P4

P2

P5

P1

Figure 13.5 Dendogram of the cluster formed.

13.4.1.3 Agglomerative Algorithm: Complete Link


Complete farthest distance or complete linkage is the agglomerative method that uses the distance between
the members that are farthest apart.

Solved Problem 13.6


For the given set of points, identify clusters using complete link agglomerative clustering.

Solution:
To compute distance matrix:

d éë( x , y ) ( a, b ) ùû = ( x - a )2 + ( y - b )
2

The Euclidean distance is:

d ( P1, P2 ) = (1.0 - 1.5)2 + (1.0 - 1.5)2


= 0.25 + 0.25 = 0.5 = 0.71

The distance matrix is:

 P1 P2 P3 P 6
P4 P5
 P1 0 
 
 P2 0.71 0 
 P3 5.66 4.95 0 
 
 P4 3.6 2.92 2.24 0 
 P5 4.24 3.53 1.41 1.0 0 
 
 P6 3.20 2.5 2.5 0.5 1.12 0 

Merging two closest members of the two clusters and finding the minimum element in distance matrix
and forming the clusters, we get

MachineL_Ch013.indd 259 12/15/2018 2:48:10 PM


260 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

 P1 P2 P3 P 6
P4 P5
 P1 0 
 
 P2 0.71 0 
 P3 5.66 4.95 0 
 
 P4 3.6 2.92 2.24 0 
 P5 4.24 3.53 1.41 1.0 0 
 
 P6 3.20 2.5 2.5 0.5 1.12 0 

Here the minimum value is 0.5 and hence we combine P4 and P6. To update the distance matrix
max (d(P4, P6), P1)) = max(d(P4, P1), d(P6, P1)) = max(3.6, 3.2) = 3.6

 P1 P2 P3 P4, P6 P5
 P1 0 
 
 P2 0.71 0 
 P3 5.66 4.95 0 
 
 P4, P6 3.6 2.92 2.5 0 
 P5
 4.24 3.53 1.41 1.12 0 

Merging two closest by finding the minimum element in distance matrix and forming the clusters, we get

 P1, P2 P3 P4, P6 P5


 P1, P2 0 
 
 P3 5.66 0 
 P4, P6 3.6 2.5 0 
 
 P5 4.24 1.41 1.12 0

Merging two closest by finding the minimum element in distance matrix and forming the clusters, we get

 P1, P2 P3 P4, P6, P5


 P1, P2 0 
 
 P3 5.66 0 
 P4, P6, P5 3.6 2.5 0 

Merging two closest by finding the minimum element in distance matrix and forming the clusters, we get

æ P1, P2 P4, P6, P5, P3 ö


ç ÷
ç P1, P2 0 ÷
ç P4, P6, P5, P3 5.66 0 ÷
è ø

MachineL_Ch013.indd 260 12/15/2018 2:48:12 PM


1 3 . 4    H ierarchical Meth o d s  • 261

The final cluster formed can now be drawn as shown in Fig. 13.6.

P6
P5
P4, P6 P1, P2

Figure 13.6 Cluster formed after merging all data points.

13.4.1.4 Agglomerative Algorithm: Average Link


Average–average distance or average linkage is the method that involves looking at the distances between all
pairs and averages all of these distances. This is also called Unweighted Pair Group Mean Averaging.

Solved Problem 13.7


For the given set of points, identify clusters using average link agglomerative clustering.

A B

P1 1 1
P2 1.5 1.5
P3 5 5
P4 3 4
P5 4 4
P6 3 3.5

Solution:
The distance matrix is:

 P1 P2 P3 P4 P5 P6
 P1 0 
 
 P2 0.71 0 
 P3 5.66 4.95 0 
 
 P4 3.6 2.92 2.24 0 
 P5 4.24 3.53 1.41 1.0 0 
 
 P6 3.20 2.5 2.5 0.5 1.12 0 

MachineL_Ch013.indd 261 12/15/2018 2:48:12 PM


262 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

Merging two closest members of the two clusters and finding the minimum element in distance matrix,
we get

 P1 P2 P3 P 6
P4 P5
 P1 0 
 
 P2 0.71 0 
 P3 5.66 4.95 0 
 
 P4 3.6 2.92 2.24 0 
 P5 4.24 3.53 1.41 1.0 0 
 
 P6 3.20 2.5 2.5 0.5 1.12 0 

Here the minimum value is 0.5 and hence we combine P4 and P6. To update the distance matrix average
(d(P4, P6), P1)) = average(d(P4, P1), d(P6, P1)) = average(3.6, 3.2) = 3.4

 P1 P2 P3 P4, P6 P5
 P1 0 
 
 P2 0.71 0 
 P3 5.66 4.95 0 
 
 P4, P6 3.2 2.71 2.37 0 
 P5
 4.24 3.53 1.41 1.06 0 

Merging two closest by finding the minimum element in distance matrix and forming the clusters:

 P1, P2 P3 P4, P6 P5


 P1, P2 0 
 
 P3 5.31 0 
 P4, P6 2.96 2.5 0 
 
 P5 3.89 1.41 1.12 0
Merging two closest by finding the minimum element in distance matrix and forming the clusters:
æ P1, P2 P3 P4, P6, P5 ö
ç P1, P2 0 ÷
ç ÷
ç P3 5.66 0 ÷
ç ÷
è P4, P6, P5 3.43 1.96 0 ø
Merging two closest by finding the minimum element in distance matrix and forming the clusters:

æ P1, P2 P4, P6, P5, P3 ö


ç ÷
ç P1, P2 0 ÷
ç P4, P6, P5, P3 4.55 0 ÷
è ø

MachineL_Ch013.indd 262 12/15/2018 2:48:14 PM


C a s e S t u d y 2 : T e l e c o m C l u s t e r A n a ly s i s  • 263

The final cluster formed can now be drawn as shown in Fig. 13.7.

P6
P5
P4, P6 P1, P2

Figure 13.7 The final cluster formed merging all data points.

Case Study
There are diverse applications using clustering. As discussed in previous sections, the concept of clustering is
one wherein the output is not known. In other words, the training dataset has only input data samples. So
based on some metrics of grouping or clustering, similar data items are grouped together in one cluster. Let
us now see some of the major areas wherein this concept is extensively used.

Case Study 1: Grouping of Similar Companies Based on


Wikipedia Articles
The k-means clustering algorithm is used to segment S&P (500 index) listed companies based on the text
of Wikipedia articles about each one. Initially, the data is taken and preprocessed. This data is nothing
but articles from Wikipedia about the companies. In the preprocessing phase, the Wikipedia formatting is
removed, all texts are converted to lowercase, and non-alphanumeric characters are removed.
Around 500 companies were taken as input data. From this input data, feature hashing module tokenizes
the text string and transforms the data into a series of numbers based on the hash value of each token. No
lingusitic analysis is performed in this step. Internally, the feature hashing module creates a dictionary of
n-grams. After the dictionary has been built, the feature hashing module converts the dictionary terms into
hash values. It then computes whether a feature was used in each case. For each row of text data, the module
outputs a set of columns – one column for each hashed feature. Here, the dimentionality was reduced using
PCA. So based on the highest variance, the first one or two columns of the transformed matrix is selected
and clusters were built using the transformed dataset.

Case Study 2: Telecom Cluster Analysis


We know that there are a number of telecom companies attract customers with a variety of packages. In
the telecom sector, you have to realize that not every customer has similar needs and you need to strategize
accordingly to attract all of them. Based on customer segmentation, the company as well as customers can
have a the win–win scenario. Taking a sample of customers and based on their international and national
call durations, clustering techniques are used to classify clusters into two main categories.

MachineL_Ch013.indd 263 12/19/2018 5:23:50 PM


264 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

Let us take a small sample size of eight customers. Based on their duration of national and international
calls, a scatter plot is drawn as shown below.

Av. Local Call Duration 5

0
0 1 2 3 4 5 6 7
Av. International Call Duration

Using Euclidean distance metric to compute the centroids, the final clusters forms are shown in the figure
below.

5
Av. Local Call Duration

0
0 1 2 3 4 5 6 7
Av. International Call Duration

Based on the proximity between cluster centres and centriod of the formed clusters, the customer plans
can be decided so that the company as well as the customers benefit from a chosen plan.

Summary

• Clustering is the process of grouping together • Clustering is also called data segmentation
data objects into multiple sets or clusters, so because clustering partitions large datasets into
that objects within a cluster have high similarity groups according to their similarity.
when compared to objects outside of it. • Clustering is known as unsupervised learning
• Similarity is measured by distance metrics and because the class label information is not present.
the most common among them is the Euclidean
distance metrics.

MachineL_Ch013.indd 264 12/15/2018 2:48:15 PM


M u ltiple - C h o ice Q u e s ti o n s  • 265

• The applications of clustering are varied and • Partitioning-based clustering algorithms are
include business intelligence, pattern recogni- distance based. k-means and k-medoids are
tion, image processing, biometrics, web technol- popular partition-based clustering algorithms.
ogy, search engine, and text mining. The number of clusters to be formed is initially
• The requirements of clustering is dependent on specified.
its scalability, number and types of attributes • The result of hierarchical clustering is a tree-
which have to be clustered, shape of the clus- based representation of the objects, which is also
ter to be identified, efficiency in handling noisy known as dendrogram.
data and incremental data points to the existing • Density-based clustering algorithm finds non-
clusters, handling high-dimensional data, and linear shaped clusters based on density. Density-
data with constraints. based spatial clustering of applications with
• The basic types of clustering are hard cluster- noise (DBSCAN) is the most widely used den-
ing and soft clustering depending on whether sity-based algorithm. It uses the concept of den-
the data points belong to only one cluster or sity reachability and density connectivity.
whether they can be shared among clusters. • The grid-based clustering approach differs from
• Clustering algorithms are classified based on conventional clustering algorithms in that it is
partitioning, hierarchical, density-based and concerned not with the data points but with the
grid-based clustering. value space that surrounds the data points.

Multiple-Choice Questions

1. Which of the following statements is true? iv. 


Terminate when RSS falls below a
(a) Assignment of observations to clusters does threshold.
not change between successive iterations in (a) i, iii, and iv
k-means. (b) i, ii, and iii
(b) Assignment of observations to clusters (c) i, ii, and iv
changes between successive iterations in (d) All of the above
k-means.
(c) Assignment of observations to clusters 3. Which of the following algorithm is most sen-
always decrease between successive itera- sitive to outliers?
tions in k-means. (a) k-means clustering algorithm
(d) Assignment of observations to clusters (b) k-medians clustering algorithm
always increase between successive itera- (c) k-modes clustering algorithm
tions in k-means. (d) k-medoids clustering algorithm
2. Which of the following can act as possible ter- 4. Which of the following is finally produced by
mination conditions in k-means? hierarchical clustering?
i. For a fixed number of iterations. (a) Final estimate of cluster centroids
ii. Assignment of observations to clusters does (b) Tree showing how close things are to each
not change between iterations, except for other
cases with a bad local minimum. (c) Assignment of each point to clusters
iii. Centroids do not change between succes- (d) All of the above
sive iterations.

MachineL_Ch013.indd 265 12/15/2018 2:48:15 PM


266 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

5. What is the best choice for the number of (a) 5


clusters based on the following graph? (b) 6
(c) 14
4000 (d) None of the above
Within Groups Sum of Squares

3000

2000

1000

2 4 6 8 10 12 14
Number of Clusters

Very Short Answer Questions

1. Give an example of an application for k-means {2, 4, 10, 12, 3, 20, 30, 11, 25}
clustering algorithm. Explain in brief.
4. Compare between single link, complete link,
2. Explain the different distance measures used and average link based on distance formula.
for clustering.
5. Draw the flowchart of k-means algorithm.
3. Using k-means clustering, cluster the following
data into two clusters. Show each step.

Short Answer Questions

1. Compute the distance matrix for the x–y 3. Use k-means algorithm to cluster the following
coordinates given in the following table. dataset consisting of the scores of two variables
on each of the seven individuals.
Point x coordinate y coordinate
Subject A B
p1 0.4005 0.5306
p2 0.2148 0.3854 1 1.0 1.0

p3 0.3457 0.3156 2 1.5 2.0

p4 0.2652 0.1875 3 3.0 4.0

p5 0.0789 0.4139 4 5.0 7.0

p6 0.4548 0.3022 5 3.5 5.0


6 4.5 5.0
2. How will you define the number of clusters in 7 3.5 4.5
k-means clustering algorithm?

MachineL_Ch013.indd 266 12/19/2018 5:26:12 PM


Review Questions • 267

4. Use the k-means algorithm and Euclidean Suppose the initial seeds (centers of each
distance to cluster the following eight examples cluster) are A1, A4, and A7. Run the k-means
into three clusters: A1 = (2, 10), A2 = (2, 5), algorithm for 1 epoch only. At the end of this
A3 = (8, 4), A4 = (5, 8), A5 = (7, 5), A6 = epoch show:
(6, 4), A7 = (1, 2), A8 = (4, 9). The distance
(a) 
The new clusters (that is, the examples
matrix based on the Euclidean distance is given
belonging to each cluster).
in the following table.
(b) The centers of the new clusters.
A1 A2 A3 A4 A5 A6 A7 A8
5. Use single and complete link agglomerative
A1 0 25 36 13 50 52 65 5 clustering to group the data given in the follow-
ing distance matrix. Show the dendrograms.
A2 0 37 18 25 17 10 20
A3 0 25 2 2 53 41 A B C D
A4 0 13 17 52 2 A 0 1 4 5
A5 0 2 45 25 B 0 2 6
A6 0 29 29 C 0 3
A7 0 58 D 0
A8 0

Review Questions

1. Use k-means algorithm to create three clusters 3. Apply complete link agglomerative clustering
for a given set of values: {2, 3, 6, 8, 9, 12, 15, techniques on the given data to find the promi-
1s8, 22}. nent clusters.
2. Apply agglomerative clustering algorithm on
the given data and draw the dendogram. Show P1 P2 P3 P4 P5 P6
three clusters with its allocated points by using P1 0 0.23 0.22 0.37 0.34 0.24
the single link method.
P2 0.23 0 0.14 0.19 0.14 0.24
a b c d e f P3 0.22 0.14 0 0.13 0.28 0.10
a 0 2 10 17 5 20 P4 0.37 0.19 0.13 0 0.23 0.22
b 2 0 8 3 1 18 P5 0.34 0.14 0.28 0.23 0 0.39
c 10 8 0 5 5 2 P6 0.24 0.24 0.10 0.22 0.39 0
d 17 3 5 0 2 3 4. Explain expectation–maximization algorithm.
e 5 1 5 2 0 13 5. What are the requirements for clustering?
6. What are the applications of clustering?
f 20 18 2 3 13 0

MachineL_Ch013.indd 267 12/19/2018 1:18:02 PM


268 • C H A P T E R 1 3 / I n tr o d u cti o n t o U n s u per v i s e d L ear n i n g A l g o rith m s

Answers

Multiple-Choice Answers
1. (a) 2. (d) 3. (a) 4. (b) 5. (b)

MachineL_Ch013.indd 268 12/15/2018 2:48:21 PM

You might also like