Clustering Examples
Clustering Examples
7
Clustering examples
7.1 Introduction
Clustering is one of the most commonly used experimental data analysis methods.
Throughout all disciplines, from social sciences to biology to computer science, by defin-
ing meaningful categories among data points people try to obtain an initial sense of their
results. For instance, retailers cluster customers, on the basis of their customer profiles, for
the purpose of targeted marketing; computational biologists cluster genes on the basis of
similarities in their expression in diverse researches; and astronomers cluster stars on the
basis of their distinct closeness. The first question to be answered is, of course, what is clus-
tering? Clustering is the process of intuitively grouping a collection of objects in such a way
that identical objects end up in the same category and divide dissimilar objects into different
groups. This definition is obviously rather imprecise and perhaps vague. Yet, it is not easy
to find a more accurate definition. There are several reasons for this. One fundamental prob-
lem is that in many cases the two objectives stated in the previous statement contradict one
another. Mathematically speaking, similarity (or proximity) is not a transitive relationship,
whereas cluster sharing is a relationship of equivalence, and particularly a transitive relation-
ship. More specifically, there can be a long series of objects, x1, . . . , xm, where each xi is very
similar to its two neighbors, xi–1 and xi+1, but x1 and xm are very dissimilar. If we want to make
sure that two elements share the same cluster if they are identical, then we have to place all
the sequence elements in the same cluster. In that case, however, we end up sharing a cluster
with dissimilar elements (x1 and xm) by violating the second criterion. A clustering algorithm,
which highlights not separating nearby points, clusters this input by dividing it horizontally
on both lines. On the other hand, a clustering approach, which stresses that distant points do
not share the same cluster, clusters the same input by dividing it vertically (Shalev-Shwartz
& Ben-David, 2014).
Another fundamental problem for clustering is the lack of “ground truth,” which is a com-
mon problem with unsupervised learning. We’ve been dealing primarily with supervised
learning in the book so far (e.g., the issue of learning a classifier from labeled data on train-
ing). The purpose of supervised learning is simple—we want to train a classifier to predict
as accurately as possible the labels of future examples. In addition, by estimating the empiri-
cal loss, a supervised learner can estimate the success or risk of the hypotheses utilizing the
labeled training data. Clustering, on the other hand, is an unsupervised learning problem;
Practical Machine Learning for Data Analysis Using Python. http://dx.doi.org/10.1016/B978-0-12-821379-7.00007-2
Copyright © 2020 Elsevier Inc. All rights reserved.
465
466 7. Clustering examples
namely, we are not trying to predict any labels. Rather we want some practical way to orga-
nize the data. Hence there is no straightforward clustering performance assessment method.
In addition, it is not clear what the “correct” clustering for that data is or how to assess a
proposed clustering, even on the basis of full knowledge of the underlying data distribution
(Shalev-Shwartz & Ben-David, 2014).
7.2 Clustering
Clustering is a mechanism in which related objects are bunched together. There are two
types of inputs we can use. In similarity-based clustering, the input to the algorithm is a
matrix of dissimilarity or distance matrix D. The input to the algorithm in feature-based clus-
tering is a matrix or design matrix X feature matrix of N x D. Similarity-based clustering has
the advantage of allowing domain-specific similarity or kernel functions to be conveniently
included. The benefit of feature-based clustering is that it applies to “raw” data, which is
potentially noisy. Besides the two input types, there are two potential output types: flat clus-
tering, also called partition clustering, where we divide the objects into disjoint sets, and
hierarchical clustering, where a nested partition tree is formed (Murphy, 2012).
A matrix of dissimilarity D is a matrix in which di,i = 0 and di, j ≥ 0 is a “distance” measure
between i and j. In the strict sense, subjectively determined dissimilarities are seldom dis-
tances, as the inequality of the triangle, di,j ≤ di,k + dj,k, does not often hold. Some algorithms
claim that D is a true matrix of distance, but others do not. If we have a similarity matrix S, by
applying any monotonically decreasing function, for example, D = max(S) − S, we can con-
vert it to a dissimilarity matrix. The most common way of describing object dissimilarity is
in terms of their attributes’ dissimilarity. The square (Euclidean) distance, city block distance,
correlation coefficient, and hamming distance are some common attribute dissimilarity func-
tions (Murphy, 2012).
For the k-means clustering algorithm in which k initial points are selected to represent the
initial cluster centers, all data points are allocated to the closest one, the mean value of the
points in each cluster is calculated to form its current cluster core, and replication continues
until there are no cluster changes. This procedure only works when you know the number of
clusters beforehand, and this section begins by describing what you can do if not. First, we
look at strategies for “agglomeration” to construct a hierarchical clustering structure—that
is, beginning with individual instances and merging them successively into clusters. So, we
look at a system that incrementally works; that is, any new instance is processed as it oc-
curs. Finally, we are investigating a statistical method of clustering based on a mixture model
with various distributions of probability, one for each cluster. It does not separate instances
into disjoint clusters, as does k-means, but rather assigns instances probabilistically to classes
(Witten, Frank, Hall, & Pal, 2016).
Clustering is one of human beings’ most rudimentary mental practices, used to accommo-
date the enormous amount of information we obtain each day. It would be difficult to handle
each piece of information as a single entity. Therefore, human beings appear to categorize
things into clusters (i.e., objects, individuals, events). The specific attributes of the entities it
comprises are then characterized by each cluster. We must presume, as in the case of super-
vised learning, that all patterns are defined in terms of features that form one-dimensional
7.2 Clustering 467
feature vectors. The basic steps to be taken by an expert to establish a clustering function are
as follows:
•Feature selection: Features should be chosen properly in order to encode as much
information as possible about the value function. Once again, a major goal is parsimony
and thus minimal duplication of knowledge among the features. As in the supervised
classification, preprocessing of features may be needed in subsequent stages before they
are used.
•Proximity measure: This measure defines how the two feature vectors are similar or
dissimilar. It is necessary to ensure that all selected characteristics contribute equally to
the proximity measure calculation and there are no features that dominate others. This
should be taken care of during preprocessing.
•Clustering criterion: This criterion relies on the definition given to the term by the expert
on the basis of the form of clusters of the dataset. The criterion of clustering can be
expressed through a cost function or some other rules.
•Clustering algorithms: This phase refers to the selection of a particular algorithmic scheme
that unravels the clustering structure of the data set, having adopted a proximity measure
and a clustering criterion.
•Validation of the results: Once the results have been obtained from the clustering
algorithm, we will check their correctness. Usually this is done using suitable measures.
•Interpretation of the results: In many cases, to draw the correct conclusions, the
application expert should combine the clustering findings with other experimental
evidence and interpretation.
A phase known as the clustering tendency should be involved in a number of cases. It in-
cludes various tests determining whether or not there is a clustering pattern in the data avail-
able. For instance, the dataset may be entirely random in nature, so it would be pointless to
try to unravel clusters. Different feature choices, proximity measures, clustering criteria, and
clustering algorithms might result in completely different clustering results (Theodoridis, Pi-
krakis, Koutroumbas, & Cavouras, 2010).
468 7. Clustering examples
7.3 The k-means clustering algorithm 469
•Hierarchical clustering algorithms: Such methods will also be categorized into two groups.
•Agglomerative algorithms: Such algorithms in each stage generate a clustering sequence of
decreasing number of clusters. The clustering generated by merging two clusters into
one at each stage results from the previous one. Single and full connection algorithms
are the key representatives of the agglomerative algorithms. Such algorithms are ideal
for the recovery of big clusters and compact clusters.
•Divisive algorithms: These algorithms work in the opposite direction; that is, at each stage
they generate a clustering sequence of m. The clustering is created by dividing a single
cluster into two results from the previous one at each stage.
•Clustering algorithms based on cost function optimization: This group includes
algorithms in which a cost function, J, quantifies as “sensitive” to determine a clustering.
The number of m clusters is usually kept unchanged. Most of these algorithms use
differential calculus principles when attempting to optimize J. They end when a local
optimum of J is decided. Also, algorithms of this category are called iterative function
optimization techniques. The following subcategories are included in this category:
•Hard or crisp clustering algorithms are when a vector belongs to a particular cluster
exclusively. The assignment of the vectors to individual clusters is done optimally on
the basis of the accepted criterion of optimality. The Isodata or Lloyd algorithm is the
most popular algorithm in this group.
•Probabilistic clustering algorithms are a special type of hard clustering algorithms that
adopt Bayesian classification arguments and each vector x is assigned to the cluster Ci
for which P(Ci |x) (i.e., the a posteriori probability) is maximum. Such probabilities are
calculated through an optimization process that is properly defined.
•Fuzzy clustering algorithms are when a vector belongs up to a certain degree to a
particular cluster.
•Possibilistic clustering algorithms are when we test the probability of a vector x being a
part of a cluster Ci.
•Boundary detection algorithms are when, instead of identifying the clusters themselves
by the feature vectors, they iteratively update the boundaries of the regions where
clusters are located. Although these algorithms evolve from a theory of cost function
optimization, they are different from the algorithms described previously (Theodoridis
et al., 2010).
Apart from these clustering algorithms, branch and bound clustering algorithms, genetic
clustering algorithms, stochastic relaxation methods, valley-seeking clustering algorithms,
competitive learning algorithms, morphological transformation technique–based algorithms,
density-based algorithms, subspace clustering algorithms, and kernel-based methods are
also types of clustering algorithms (Theodoridis et al., 2010).
K-means clustering begins with the description of a cost function over a parameterized
set of possible clustering, and the objective of the clustering algorithm is to find a minimum
cost partitioning (clustering). The clustering function is turned into an optimization problem
under this model. The objective function is a function ranging from pairs of an input, (X, d),
470 7. Clustering examples
and a suggested clustering solution C = (C1, . . .,Ck) to positive real numbers. The target of a
clustering algorithm is described as finding, for a given input (X, d), a clustering C so that
G((X, d),C) is minimized, given such an objective function that is denoted by G. To achieve
this goal, a suitable search algorithm must be utilized. K-means clustering is therefore a spe-
cific common approximation algorithm rather than the cost function or the corresponding
exact solution to the minimization problem. Most common objective functions include as
a parameter the number of clusters, k. In practice, it is often up to the clustering algorithm
user to choose the parameter k that is best suited to the clustering problem. Some of the most
common objective functions are defined in the following. The k-means objective function is
one of the most common objectives in clustering. The objective function k-means measures
the square distance from each point in X to its cluster’s centroid. For instance, in digital com-
munication tasks, where X members can be interpreted as a set of signals to be transmitted,
the k-means objective function is important. In practical clustering applications, the k-means
objective function is quite common. But it turns out that it is always computationally infea-
sible to find the optimal solution for k-means. Instead, a simple iterative algorithm is often
used, so the term k-means clustering in many cases refers to the outcome of this algorithm
rather than the clustering that minimizes the objective cost of k-means (Shalev-Shwartz &
Ben-David, 2014).
Example 7 .1
The following Python code utilizes k-means clustering to find the center of the clusters of breast
cancer data by using the scikit-learn library APIs. In this example, the breast cancer dataset that
exists in sklearn.datasets is utilized. Scatter plot is presented to show the effectiveness of the al-
gorithm. The cluster centers are plotted in as scatter plot. Note that this example is adapted from
scikit-learn.
# ======================================================================
# K-means clustering example
# ======================================================================
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
#%%
# ######################################################################
# Import some data to play with
Breast_Cancer = datasets.load_breast_cancer()
X = Breast_Cancer.data
y = Breast_Cancer.target
7.4 The k-medoids clustering algorithm 471
Each cluster is represented by the mean of its vectors in the k-means algorithm, but the
cluster is represented by a vector selected among the elements of X in the k-medoids meth-
ods, and we will refer to it as the medoid. In addition to their medoid, each cluster includes
all vectors in X that (1) are not employed as medoids in other clusters and (2) are closer to
their medoid than those representing the other clusters. There are two benefits over the k-
means algorithm to represent clusters using medoids. First, it can be utilized for datasets
originating from either continuous or discrete domains, while k-means is only suitable for
continuous domains since the mean of a subset of data vectors is not essentially a point ly-
ing in the domain for a discrete domain context. Second, k-medoids algorithms appear to be
less sensitive than k-means algorithms to outliers. It should be remembered, however, that
a cluster’s mean has a strong geometric and statistical meaning that is not necessarily true
with medoids. Moreover, the algorithms for the calculation of the best set of medoids needs
more computational power compared to the k-means algorithm. PAM (partitioning around
472 7. Clustering examples
medoids), CLARA (clustering large applications), and CLARANS (clustering large applica-
tions based on randomized search) are the well-known k-medoids algorithms. Remember
that the last two algorithms are inspired from the PAM but are more effective than PAM in
handling large datasets (Theodoridis et al., 2010).
Example 7.2
The following Python code utilizes k-medoids clustering to find the center of the clusters of
synthetic data and Mall_Customers data (https://www.kaggle.com/akram24/mall-customers)
by using the KMedoids clustering function. Scatter plot is presented to show the effectiveness of
the algorithm. The cluster centers are plotted in as scatter plot.
# ======================================================================
# K-medoids clustering example
# ======================================================================
from k_medoids import KMedoids
import numpy as np
import matplotlib.pyplot as plt
#Define a distance utility function
def example_distance_func(data1, data2):
"'example distance function"'
return np.sqrt(np.sum((data1 - data2)**2))
#%%
# K-Medoids Clustering using synthetic data with 3 clusters
from sklearn.datasets import make_blobs
# ######################################################################
# Generate sample data
np.random.seed(0)
batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples = 300, centers = centers, cluster_
std = 0.7)
7.5 Hierarchical clustering 473
474 7. Clustering examples
(similarity) dendrogram. This method can be utilized at any stage as an indicator of natural
or forced cluster formation. Similarly, a suitable level for cutting the dendrogram related to
the resulting hierarchy must be calculated (Theodoridis et al., 2010).
Example 7 .3
The following Python code utilizes agglomerative clustering to cluster the customers as Careful,
Standard, Target, Careless, and Sensible using Mall_Customers data (https://www.kaggle.com/
akram24/mall-customers) and standard scikit-learn library APIs. Customers’ dendrogram is plot-
ted against the Euclidean distance. In addition, the clusters are plotted in a scatter plot to show five
different customer groups. Note that this example is adapted from the web page (https://www.
kdnuggets.com/2019/09/hierarchical-clustering.html).
# ======================================================================
# Agglomerative clustering example
# ======================================================================
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Mall_Customers.csv')
#%%
" " "Out of all the features, CustomerID and Genre are irrelevant fields
and can be dropped and create a matrix of independent variables by select
only Age and Annual Income." " "
X = dataset.iloc[:, [3, 4]].values
import scipy.cluster.hierarchy as sch
dendrogrm = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distance')
plt.show()
#%%
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean',
linkage = 'ward')
y_hc = hc.fit_predict(X)
# Visualising the clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 50, c = 'red', label =
'Careful')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 50, c = 'blue', label =
'Standard')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 50, c = 'green', label =
'Target')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 50, c = 'cyan', label
= 'Careless')
7.5 Hierarchical clustering 475
Example 7.4
The following Python code utilizes agglomerative clustering to group the customers using Mall_
Customers data (https://www.kaggle.com/akram24/mall-customers) and standard scikit-learn li-
brary APIs. Scatter plot is presented to show the effectiveness of the algorithm. In this example, we
present the effect of imposing a connectivity graph to capture local structure in the customer data. It
is possible to see two implications of implementing a connectivity. First, clustering is much quicker
with a connectivity matrix. Second, a single, average, and complete linkage is unstable when using
a connectivity matrix and tends to create a few clusters that grow very fast. Nonetheless, average
and complete linkage tackle this filtration behavior by including all the distances between two clus-
ters when combining them (while only the shortest distance among clusters is considered to exag-
gerate the behavior). The connectivity graph removes this process for average and total connection,
making it look like the more fragile single connection. Having a very small number of neighbors in
the graph introduces a geometry similar to that of a single connection that is well-known for having
this instability of percolation. This is presented in this example. Note that this example is adapted
from scikit-learn.
476 7. Clustering examples
# cluster sizes, but may not impose the local manifold structure of
# the data
knn_graph = kneighbors_graph(X, 30, include_self = False)
7.5 Hierarchical clustering 477
to look for all possible cluster partitions. This can be accomplished by ruling out several par-
titions under a preset criterion as not reasonable. The cluster division is based on all the fea-
tures (coordinates) of the feature vectors in the previous algorithm. These type of algorithms
are also known as polythetic algorithms. On the other hand, there are divisive algorithms at
each stage that achieve a cluster division based on a single feature. These are the algorithms
known as monothetic (Theodoridis et al., 2010).
Divisive clustering begins with all the data in a single cluster and then, in a top-down
manner, splits each cluster into two daughter clusters. Since there are 2N–1 –1 ways to divide
a group of N items into two groups, it is hard to compute the optimal split, hence several
heuristics are utilized. One approach is to pick the largest diameter cluster and divide it into
two using the k-means or k-medoids algorithm with K = 2. This is known as the bisecting k-
means algorithm (Steinbach, Karypis, & Kumar, 2000). We can repeat this until we have any
number of clusters desired. This can be utilized as an alternative to standard k-means, but a
hierarchical clustering is also induced. Another strategy is to construct from the dissimilar-
ity graph a minimum spanning tree and then make new clusters by breaking the connection
related to the largest dissimilarity. Divisive clustering is less common than clustering in ag-
glomerations, but it has two benefits. First, it can be quicker because it only takes O(N) time if
we break for a constant number of levels. Secondly, the splitting decisions are made in view of
all the results, while the bottom-up methods make myopic merge decisions (Murphy, 2012).
Example 7.5
The following Python code utilizes divisive clustering to plot a dendrogram of amino acid se-
quence of human genes. In this example, amino acid sequence of human genes is utilized. The
dendrogram plot presents the effectiveness of the algorithm. Note that this example is adapted
from github (https://github.com/ronak-07/Divisive-Hierarchical-Clustering). A phylogenetic tree
or evolutionary tree is a branching diagram or “tree” displaying the implied evolutionary relations
between different biological species based upon similarities and differences in their physical or ge-
netic characteristics. The goal of this example is to build the phylogenetic tree based on DNA/pro-
tein sequences of species given in the dataset employing divisive (top-down) hierarchical clustering.
# ======================================================================
# Divisive clustering
# ======================================================================
import numpy as np
import scipy
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
global g
import time
# ======================================================================
# Define utility functions
# ======================================================================
def subtract(indices,splinter):
478 7. Clustering examples
def divisive(a,indices,splinter,sub):
if(len(indices) = =1):
return
avg = []
flag = 0
for i in indices:
if(i not in splinter):
sum = 0
for j in indices:
if(j not in splinter):
sum = sum + a[i][j]
if((len(indices)-len(splinter)-1) = =0):
avg.append(sum)
else:
avg.append(sum/(len(indices)-len(splinter)-1))
if(splinter):
k = 0
for i in sub:
total = 0
for j in splinter:
total = total + a[i][j]
avg[k] = avg[k] - (total/(len(splinter)))
k + = 1
positive = []
for i in range(0,len(avg)):
if(avg[i] > 0):
positive.append(avg[i])
flag = 1
if(flag = =1):
splinter.append(sub[avg.index(max(positive))])
sub.remove(sub[avg.index(max(positive))])
divisive(a,indices,splinter,sub)
else:
splinter.append(indices[avg.index(max(avg))])
sub[:] = subtract(indices,splinter)
divisive(a,indices,splinter,sub)
def original_subset(indices):
sp = np.zeros(shape = (len(indices),len(indices)))
for i in range(0,len(indices)):
7.5 Hierarchical clustering 479
for j in range(0,len(indices)):
sp[i][j] = a[indices[i]][indices[j]]
return sp
def original_max(x):
new = original_subset(x)
return new.max()
def diameter(l):
return original_max(l)
def recursive(a,indices,u,v,clusters,g):
clus_s.append(len(indices))
d.append(diameter(indices))
parents[g] = indices
g- = 1
divisive(a,indices,u,v)
clusters.append(u)
clusters.append(v)
new = []
for i in range(len(clusters)):
new.append(clusters[i])
final.append(new)
x = []
y = []
store_list = []
max = -1
f = 0
for list in clusters:
if(diameter(list) > max):
if(len(list)! = 1):
f = 1
max = diameter(list)
store_list = (list)
if(f = =0):
return
else:
clusters.remove(store_list)
recursive(a,store_list,x,y,clusters,g)
480 7. Clustering examples
# ======================================================================
# Main program
# ======================================================================
a = np.load('distance_matrix.npy')
size = len(a)
g = (size-1)*2
parents = {}
final = []
clusters = []
indices = []
clus_s = []
d = []
Z = np.zeros(shape = (size-1,4))
p = []
q = []
ans = []
for i in range(0,len(a)):
indices.append(i)
for i in range(0,size):
list = []
list.append(i)
parents[i] = list
start = time.time()
recursive(a,indices,p,q,clusters,g)
print("Clustering done\t" + str(time.time()-start))
for i in range(0,len(d)):
Z[size-i-2][2] = d[i]
Z[size-i-2][3] = clus_s[i]
for i in range(len(final)-1,0,-1):
for j in range(0,len(final[i-1])):
if final[i-1][j] not in final[i]:
ans.append(final[i-1][j])
ans.append(indices)
7.6 The fuzzy c-means clustering algorithm 481
for i in range(0,len(ans)):
if(len(ans[i]) < = 2):
Z[i][0] = ans[i][0]
Z[i][1] = ans[i][1]
else:
s = 0
add = []
common = []
for j in range(len(ans)-1,-1,-1):
if(set(ans[j]) < set(ans[i])):
common = ans[j]
break;
x = (subtract(ans[i],common))
for key in parents.keys():
if(parents[key] = =common):
Z[i][0] = key
break;
for key in parents.keys():
if(set(parents[key]) = =set(x)):
Z[i][1] = key
s = 1
break;
if(s = =0):
print(Z[i][0],Z[i][1],x)
names = [i for i in range(0,size)]
#%%
# ======================================================================
# Plot dendrogram of divisive clustering
# ======================================================================
plt.figure(figsize = (15, 15))
plt.title('Hierarchical Clustering Dendrogram (Divisive)')
plt.xlabel('Sequence No.')
plt.ylabel('Distance')
augmented_dendrogram(Z,labels = names,show_leaf_
counts = True,p = 25,truncate_mode = 'lastp')
plt.show()
One of the challenges related to the probabilistic algorithms is the presence of the pdfs,
for which a suitable model must be assumed. However, when the clusters are not compact
but shell-shaped, it is not easy to handle instances. Fuzzy clustering algorithms are a family
of clustering algorithms that release themselves from such constraints. Over the past three
482 7. Clustering examples
decades, these methods have been the focus of intensive research. The main point differen-
tiating the two methods is that a vector belongs to more than one cluster simultaneously in
the fuzzy schemes, whereas each vector belongs exclusively to one cluster in the probabilistic
schemes. The number of clusters and their shape is presumed to be known a priori. The clus-
ter shape is defined by the set of parameters adopted. The majority of the well-known fuzzy
clustering algorithms are developed by minimizing a cost function (Theodoridis et al., 2010).
The extensively studied and implemented fuzzy c-means (FCM) clustering algorithm re-
quires a priori knowledge of the number of clusters. If FCM anticipates a desired number of
clusters, and if it is possible to guess the positions for each cluster center, then the rules of
output strongly depend on the selection of initial values. The FCM algorithm generates an
appropriate cluster pattern to minimize by iteration an objective function that is based on
cluster locations. It is also possible to automatically determine the number and initial position
of cluster centers through search techniques available in the mountain clustering process. By
measuring a search measure called the mountain function at each grid point, this approach
considers each distinct grid point as a possible cluster core. It is a subtractive method of clus-
tering with enhanced computational effort, where data points are viewed as candidates for
cluster centers rather than grid points. The estimate is strictly proportional to the number of
data points and irrespective of the dimension of the problem by applying this approach. In
this process, a high-potential data point that is a function of distance measurement is known
as a cluster center, and data points close to new cluster centers are penalized to monitor the
emergence of new cluster centers. Occasionally, a gradual membership of both clusters can
be considered to be the points between cluster centers. This is compensated, of course, by
distorting the meanings of “low” and “high.” The fuzzified c-means algorithm enables each
data point to belong to a cluster to a degree defined by a membership grade, thereby allow-
ing each point to belong to several clusters. The fuzzy c-means algorithm partitions a set of
K data points identified as m-dimensional vectors into c fuzzy clusters and finds a cluster
center in each cluster to minimize an objective function. Fuzzy c-means is different from hard
c-means, mostly as it uses fuzzy partitioning, where a point can belong to numerous clusters
with membership degrees. The membership matrix M is allowed to have elements in the
range [0, 1] to satisfy the fuzzy partitioning. Nonetheless, to maintain the properties of the M
matrix, the total membership of all clusters of a point must always be equal to unity (Sumathi
& Paneerselvam, 2010).
Example 7 .6
The following Python code utilizes fuzzy c-means clustering algorithm to find the center of the
clusters of Iris dataset. In this example, the Iris dataset that exists in sklearn.datasets is utilized. Scat-
ter plot is presented to show the effectiveness of the algorithm. The cluster centers are plotted in as
scatter plot. Note that this example is adapted from the web page (https://github.com/omadson/
fuzzy-c-means). In order to call you should use “pip install fuzzy-c-means” or download the library
from the web page (https://pypi.org/project/fuzzy-c-means/).
# ======================================================================
# Fuzzy c-means clustering example
# ======================================================================
7.7 Density-based clustering algorithms 483
Clusters are known in this sense as one-dimensional space regions that are “dense” in
points of X. Many density-based algorithms do not place any restrictions on the form of the
resulting clusters. Therefore, these algorithms are capable of recovering arbitrarily shaped
484 7. Clustering examples
clusters. We can also manage the outliers effectively. In addition, these algorithms’ time com-
plexity is less, making them able to process large datasets. DBSCAN, DBCLASD, DENCLUE,
and OPTICS are the popular density-based algorithms. While these algorithms share the same
basic philosophy, they differ in the quantification of the density (Theodoridis et al., 2010).
Example 7 .7
The following Python code utilizes DBSCAN clustering algorithm to find the clusters by using
the scikit-learn library APIs. In this example, the synthetic data is utilized. Scatter plot is presented
to show the effectiveness of the algorithm. Different measures are also calculated. Note that this
example is adapted from scikit-learn.
# ======================================================================
# DBSCAN clustering
# ======================================================================
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# ######################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples = 750, centers = centers, cluster_
std = 0.4,
random_state = 0)
X = StandardScaler().fit_transform(X)
7.7 Density-based clustering algorithms 485
# ######################################################################
# Compute DBSCAN
db = DBSCAN(eps = 0.3, min_samples = 10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype = bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
class_member_mask = (labels == k)
486 7. Clustering examples
Example 7 .8
The following Python code utilizes OPTICS clustering algorithm to find the clusters by using the
scikit-learn library APIs. This example employs synthetic data, which is generated so that the clus-
ters have different densities. The class sklearn.cluster.OPTICS is first utilized with its Xi cluster de-
tection method and then we set specific thresholds on the reachability that is related to class sklearn.
cluster.DBSCAN. We can see that the different clusters of OPTICS’s Xi method can be recovered
with different choices of thresholds in DBSCAN. Reachability plot and scatter plot are presented to
show the effectiveness of the algorithm. Note that this example is adapted from scikit-learn.
# ======================================================================
# Optics clustering example
# ======================================================================
# Authors: Shane Grigsby <[email protected]>
# Adrin Jalali <[email protected]>
# License: BSD 3 clause
from sklearn.cluster import OPTICS, cluster_optics_dbscan
7.7 Density-based clustering algorithms 487
space = np.arange(len(X))
reachability = clust.reachability_[clust.ordering_]
labels = clust.labels_[clust.ordering_]
# ======================================================================
# Reachability plot
# ======================================================================
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
Xk = space[labels == klass]
488 7. Clustering examples
Rk = reachability[labels == klass]
ax1.plot(Xk, Rk, color, alpha = 0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha = 0.3)
ax1.plot(space, np.full_like(space, 2., dtype = float), 'k-', alpha = 0.5)
ax1.plot(space, np.full_like(space, 0.5, dtype = float), 'k-.', alpha = 0.5)
ax1.set_ylabel('Reachability (epsilon distance)')
ax1.set_title('Reachability Plot')
# ======================================================================
# Plot OPTICS clustering results
# ======================================================================
colors = ['g.', 'r.', 'b.', 'y.', 'm.']
for klass, color in zip(range(0, 5), colors):
Xk = X[clust.labels_ == klass]
ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha = 0.3)
ax2.plot(X[clust.labels_ == -1, 0], X[clust.labels_ == -1, 1], 'k + ',
alpha = 0.1)
ax2.set_title('Automatic Clustering\nOPTICS')
# ======================================================================
# Plot DBSCAN at 0.5 clustering results
# ======================================================================
colors = ['r', 'greenyellow', 'olive', 'g', 'b', 'c']
for klass, color in zip(range(0, 6), colors):
Xk = X[labels_050 == klass]
ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha = 0.3, marker = '.')
ax3.plot(X[labels_050 == -1, 0], X[labels_050 == -1, 1], 'k + ',
alpha = 0.1)
ax3.set_title('Clustering at 0.5 epsilon cut\nDBSCAN')
# ======================================================================
# Plot DBSCAN at 2. clustering results
# ======================================================================
colors = ['r.', 'm.', 'y.', 'c.']
for klass, color in zip(range(0, 4), colors):
Xk = X[labels_200 == klass]
ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha = 0.3)
ax4.plot(X[labels_200 == -1, 0], X[labels_200 == -1, 1], 'k + ',
alpha = 0.1)
ax4.set_title('Clustering at 2.0 epsilon cut\nDBSCAN')
plt.tight_layout()
plt.show()
7.8 The expectation of maximization for Gaussian mixture model clustering 489
The problem is that we do not know either of the following: the distribution from which
each training instance came or the five parameters of the mixture model. We are therefore
adopting and iterating the technique used for the k-means clustering algorithm. Beginning
with initial assumptions for the five parameters, we utilize them to measure each instance’s
cluster probabilities, we utilize these probabilities to reestimate the parameters, and then we
repeat them. This is called the EM algorithm to maximize expectations. The first step—calcu-
lating the probabilities of the cluster, which are the “expected” class values—is “expectation";
the second step, calculating the distribution parameters, is “maximizing” the probability of
the distributions given the available data (Witten et al., 2016).
Now that we saw the Gaussian mixture model for two distributions, let us consider how
to apply it to conditions that are more concrete. It is quite straightforward to adjust the al-
gorithm from two-class problems to multiclass problems as long as the number k of normal
distributions is given in advance. The model can easily be extended to multiple attributes
from a single numeric attribute per instance as long as it is assumed that attributes are inde-
pendent. The probabilities are multiplied for each attribute to obtain the joint probability for
the instance. The independence assumption no longer holds when the dataset is understood
to contain correlated attributes in advance. Instead, a normal bivariate distribution can model
two attributes together, each having its own mean value, but the two standard deviations are
replaced by a “covariance matrix” with four numeric parameters. Standard statistical tech-
niques are available to estimate instance class probabilities and to estimate the mean and
covariance matrix, provided the instances and their class probabilities. A multivariate distri-
bution can accommodate multiple correlated attributes. The number of parameters increases
with the square of the number of jointly varying attributes. Expectation—calculating the clus-
ter to which each instance belongs, provided the parameters of the distribution—is just like
evaluating an unknown instance’s class. Maximization—estimating the parameters from the
classified instances—is just like evaluating the probabilities of the attribute-value from the
training instances, with the minor distinction being allocated probabilistically rather than
categorically to classes in the EM algorithm instances (Witten et al., 2016).
E xample 7.9
The following Python code utilizes Gaussian mixture models (GMM) clustering algorithm to
find the clusters by using the scikit-learn library APIs. This example employs Iris dataset. Although
GMM are generally employed for clustering, we can compare the found clusters with the actual
classes from the dataset. Predicted labels are plotted on both training and held-out test data using
a variety of GMM covariance types on the Iris dataset. GMMs with spherical, diagonal, full, and
tied covariance matrices are compared in increasing order of performance. Although it is expected
the full covariance will achieve the best performance in general, it is prone to overfitting on small
datasets and does not generalize well to held-out test data. On the plots, training data is shown as
dots, while test data is shown as crosses. Although the Iris dataset is four-dimensional, just the first
two dimensions are shown here. Note that this example is adapted from scikit-learn.
490 7. Clustering examples
# ======================================================================
# GMM clustering
# ======================================================================
# Author: Ron Weiss <[email protected] > , Gael Varoquaux
# Modified by Thierry Guillemot <[email protected]>
# License: BSD 3 clause
import numpy as np
iris = datasets.load_iris()
# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
skf = StratifiedKFold(n_splits = 4)
7.8 The expectation of maximization for Gaussian mixture model clustering 491
X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]
n_classes = len(np.unique(y_train))
492 7. Clustering examples
There’s a problem, though, in GMM—overfitting. If we are not sure which attributes de-
pend on each other, why not be on the safe side and decide that all the attributes are covari-
ant? The reason is that the more parameters there are, the greater the probability of overfitting
the resulting structure to the training data, and covariance significantly increases the number
of parameters. Throughout machine learning, the problem of overfitting arises, and probabi-
listic clustering is no exception. There are two ways it can occur: by defining too many clus-
ters and by specifying too many parameters for distributions. The extreme case of too many
clusters happens when there is one for each data point—then it is clear that the training data
will be overfitted. In addition, when any of the normal distributions becomes so small that
the cluster is based on just one data point, problems will arise in the GMM with EM model.
Implementations also usually insist that there are at least two different data values in clusters.
The problem of overfitting occurs when there are too many parameters. If you are not sure
which attributes are covariant, you may try out different possibilities and select the one that
maximizes the overall data likelihood due to the clustering that is found. However, the more
parameters, the greater the average likelihood of results—not necessarily due to better clus-
tering but due to overfitting—would appear. The more parameters to play with, the simpler it
is to find a seemingly good clustering. It would be good to penalize the model for introducing
new parameters. Recently, complete Bayesian hierarchical clustering techniques have been
developed that generate a distribution of probability over possible hierarchical structures
representing a dataset as output. One of the main ways to do this is to follow a Bayesian
approach where each parameter has a prior distribution of probability. Therefore whenever
a new parameter is added, it is important to integrate its prior probability into the overall
probability figure. Since this includes multiplying the total likelihood by a number less than
1 (the previous likelihood), it will penalize the addition of new parameters automatically.
The updated criteria will have to yield a gain that outweighs the cost in order to enhance
the overall probability. AutoClass is an exhaustive Bayesian clustering strategy that utilizes
all parameters of the finite-mixing model with prior distributions. It enables both numeric
7.9 Bayesian clustering 493
and nominal attributes and uses the EM algorithm to estimate the probability distribution
parameters in order to best fit the data. Since there is no guarantee that the EM algorithm will
converge to the optimum global, the process will be repeated for several different initial value
sets. AutoClass considers various cluster numbers and can consider the various covariance
quantities and different types of underlying distribution of probability for numeric attributes
(Witten et al., 2016).
Example 7.10
The following Python code compares two Gaussian mixture model (GMM) clustering algo-
rithms. It plots the confidence ellipsoids of a mixture of two Gaussians generated by expectation
maximization (“GaussianMixture”) and variational inference (“BayesianGaussianMixture” with a
Dirichlet process prior). This example employs synthetic data, which is generated so that the clus-
ters have different densities. Both models have access to five components with which to fit the data.
Note that the expectation maximization model needs to employ all five components; on the other
hand, the variational inference model efficiently only employs as many as are required for a good
fit. It can be also seen that the expectation maximization model divides some components randomly
since it is trying to fit too numerous components, but the Dirichlet process model adjusts the num-
ber of states automatically. Note that this example is adapted from scikit-learn.
=============================================
Comparison of Gaussian mixture models with EM and Bayesian
=============================================
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture
494 7. Clustering examples
plt.xlim(-9., 5.)
plt.ylim(-3., 6.)
plt.xticks(())
plt.yticks(())
plt.title(title)
How can we detect the poor quality of the clustering algorithm? Silhouette is a useful
technique. For each example that is grouped by cluster, a silhouette sorts and plots s(x). In
this particular situation, in the construction of the silhouette, squared Euclidean distance is
7.10 Silhouette analysis 495
utilized, but the approach can be extended to other distance metrics. It can be clearly seen that
the first clustering is much stronger than the second. We can estimate the average silhouette
values per cluster and over the entire data set in addition to the graphical representation
(Flach, 2012). Silhouette analysis can be utilized to examine the amount of separation be-
tween the clusters. The silhouette plot shows how close each point in a cluster is to points in
the neighboring clusters and thus provides a way to visually determine parameters such as
the number of clusters. This measure has a range of [-1, 1]. Silhouette coefficients close to + 1
imply that the sample is far from neighboring clusters. A value of 0 means that the sample is
on or very close to the decision boundary between two neighboring clusters, and negative
values suggest that the samples may have been allocated to the wrong cluster.
Example 7.11
The following Python code utilizes k-means clustering to find the silhouettes of marketing data
by using the scikit-learn library APIs. This example employs marketing data to see the silhouettes
and cluster centers. The silhouettes and cluster centers are plotted for different cases. Note that this
example is taken from scikit-learn.
In this example the silhouette analysis is utilized to select an optimal value for “n_clusters.” The sil-
houette plot shows that the “n_clusters” value of 2, 3, and 6 are a bad pick for the given data due to the
presence of clusters with below average silhouette scores and also due to wide fluctuations in the size
of the silhouette plots. Silhouette analysis is more ambivalent in deciding between 4 and 5. Also from
the thickness of the silhouette plot the cluster size can be visualized. The silhouette plot for cluster 0
when “n_clusters” is equal to 2 is bigger in size due to the grouping of the 3 subclusters into one big
cluster. Nevertheless, when the “n_clusters” are equal to 4 or 5, all the plots are more or less of similar
thickness and hence are of similar sizes as can be confirmed from the labeled scatter plot on the right.
========================================================================
Selecting the number of clusters with silhouette analysis on k-means
clustering
========================================================================
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd
# Import the Mall Customers dataset by pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3,4]].values
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
496 7. Clustering examples
# The silhouette_score gives the average value for all the sam-
ples.
# This gives a perspective into the density and separation of the
formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
7.11 Image segmentation with clustering 497
# The vertical line for average silhouette score of all the values
ax1.axvline(x = silhouette_avg, color = "red", linestyle = "--")
Images are known as one of the most significant ways of transmitting information. A cru-
cial aspect of machine learning is to understand images and extract the information from
them so that the knowledge can be used for certain tasks. The use of images for robotic navi-
gation would be an example. Other applications such as extracting malignant tissues from
body scans and so on are an integral part of medical diagnosis. One of the first steps toward
recognizing images is to segment them and find various objects in them. Features such as
498 7. Clustering examples
the histogram plots and the transformation of the frequency domain can be used to do this
(Tatiraju & Mehta, 2008).
In image recognition and computer vision, image segmentation is an important prepro-
cessing procedure. Image segmentation corresponds to the decomposition of an image with
the same attributes in a number of nonoverlapping relevant areas. Image segmentation is
a crucial technique in digital image processing, and segmentation accuracy directly affects
follow-up tasks’ effectiveness. The current segmentation techniques have achieved many
successes to varying degrees, considering their complexity and difficulty, but research on
this dimension still faces many problems. Clustering analysis algorithm splits the datasets
according to a certain standard into different groups, so it has a broad implementation in the
segmentation of images. Image segmentation as one of the main digital image processing
techniques, coupled with relevant professional skills, is commonly used for machine vision,
facial recognition, fingerprint recognition, traffic control systems, satellite image tracking ob-
jects (roads, woods, etc.), pedestrian detection, medical imaging, and many other areas and is
worth exploring in-depth (Zheng, Lei, Yao, Gong, & Yin, 2018).
Since the image segmentation plays a crucial role in many applications for image process-
ing, several algorithms for image segmentation have been developed in the last decades. But
these algorithms are continuously being pursued, as image segmentation is a challenging
problem that requires a good solution for the subsequent image-processing steps. Cluster-
ing algorithm was not originally developed exclusively for image processing; the computer
vision community adopted it for image segmentation. For example, the k-means algorithm
needs a priori knowledge of the number of clusters (k) to be grouped into. Every pixel of the
image is allocated to the cluster whose centroid is nearest to the pixel repeatedly and itera-
tively. Based on the pixels allocated to that cluster, the centroid of each cluster is decided. Both
the choice of pixel membership in the clusters and the computation of the centroids are based
on calculating distances. The Euclidean distance is most commonly utilized since it is simple
to calculate. The problem is that using Euclidean distance will lead to errors in the final seg-
mentation of the image (Gaura, Sojka, & Krumnikl, 2011).
Example 7.1 2
The following Python code is utilized for segmenting the images of Greek coins in regions by
using the scikit-learn library APIs. In this example the coins dataset, which exists in skimage.data,
is utilized. This example utilizes “spectral_clustering” on a graph created from voxel-to-voxel dif-
ference on an image to divide this image into multiple, partly homogeneous regions. This process
(spectral clustering on an image) is an effective approximate solution for finding normalized graph
cuts. There are two options to assign labels:
•“K-means” spectral clustering will cluster samples in the embedding space employing a k-means
algorithm.
•“Discrete” will iteratively search for the closest partition space to the embedding space.
7.11 Image segmentation with clustering 499
# ======================================================================
# Image segmentation with clustering
# ======================================================================
# Author: Gael Varoquaux <[email protected] > , Brian Cheung
# License: BSD 3 clause
import time
import numpy as np
from distutils.version import LooseVersion
from scipy.ndimage.filters import gaussian_filter
import matplotlib.pyplot as plt
import skimage
from skimage.data import coins
from skimage.transform import rescale
from sklearn.feature_extraction import image
from sklearn.cluster import spectral_clustering
# Convert the image into a graph with the value of the gradient on the
# edges.
graph = image.img_to_graph(rescaled_coins)
# Apply spectral clustering (this step goes much faster if you have
pyamg
500 7. Clustering examples
# installed)
N_REGIONS = 25
######################################################################
# Visualize the resulting regions
C = c1 c 2 · c K (7.1)
7.12 Feature extraction with clustering 501
Then designating by ek the kth standard basis vector (that is a K × 1 vector with a 1 in the
kth slot and zeros elsewhere), we may represent Cek = ck, and hence the relations in Eq. (7.1)
can be represented for each k as
Next, to write these equations even more appropriately we stack the data column-wise
into the data matrix X =x1, x2 · · xP and produce a K × P assignment matrix W. The pth column
of this matrix, represented as wp, is the standard basis vector related to the cluster to which
the pth point belongs, that is, wp = ek if p ∈ Sk. With this wp notation we can represent each
equation in Eq. (7.2) as Cwp ≈ xp for all p ∈ Sk, or using matrix notation all K such relations
simultaneously as
CW ≈ X. (7.3)
We can forget the assumption that we know the locations of cluster centroids and have
knowledge of which points are assigned to them—that is, the accurate depiction of the cen-
troid matrix C and assignment matrix W. We want to learn the correct values for these two
matrices. In particular, we know that the ideal C and W fulfill the compact relationships
depicted in Eq. (7.3), that is, that CW ≈ X or in other words that CW − X2F is small, while W
is composed of appropriately selected standard basis vectors associated with the data points
to their respective centroids. Note that the aim is nonconvex, and since we cannot minimize
over both C and W at the same time, it is solved via alternating minimization, that is, by alter-
nately minimizing the objective function over one of the variables (C or W) while keeping the
other variable fixed (Watt, Borhani, & Katsaggelos, 2016).
Example 7.13
The following Python code presents the usage of k-means and GMM clustering algorithms as
a feature extractor. We will utilize the Iris dataset, which includes three types (class) of Iris flowers
(Setosa, Versicolour, and Virginica) with four attributes: sepal length, sepal width, petal length, and
petal width. In this example we utilize the sklearn.cluster.KMeans and sklearn.mixture.Gaussian-
Mixture to extract the features of the Iris dataset. In scikit-learn, k-means and GMM are implement-
ed as a cluster object that are sklearn.cluster.KMeans and sklearn.mixture.GaussianMixture and are
employed to extract the features. Note that this example is adapted from Python–scikit-learn.
# ======================================================================
# Feature extraction with k-means and GMM clustering
# ======================================================================
" " "
Created on Mon Dec 23 11:35:28 2019
@author: absubasi
" " "
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import seaborn as sns
502 7. Clustering examples
def Performance_Metrics(y_test,y_pred):
print('Test Accuracy:', np.round(metrics.accuracy_score(y_test,y_
pred),4))
print('Precision:', np.round(metrics.precision_score(y_test,
y_pred,average = 'weighted'),4))
print('Recall:', np.round(metrics.recall_score(y_test,y_pred,
average = 'weighted'),4))
print('F1 Score:', np.round(metrics.f1_score(y_test,y_pred,
average = 'weighted'),4))
print('Cohen Kappa Score:', np.round(metrics.cohen_kappa_score(y_
test,y_pred),4))
print('Matthews Corrcoef:', np.round(metrics.matthews_corrcoef(y_
test,y_ pred),4))
print('\t\tClassification Report:\n', metrics.classification_report(y_
test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
# ======================================================================
# Random forest classifier with k-means for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
7.12 Feature extraction with clustering 503
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 6).fit(X)
distances = np.column_stack([np.sum((X - center)**2, axis = 1)**0.5 for
center in kmeans.cluster_centers_])
504 7. Clustering examples
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 8).fit(X)
proba = gmm.predict_proba(X)
7.12 Feature extraction with clustering 505
y = iris['target']
#Extract Features
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 6).fit(X)
distances = np.column_stack([np.sum((X - center)**2, axis = 1)**0.5 for
center in kmeans.cluster_centers_])
#%%
# ======================================================================
# k-NN classifier with GMM for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 4).fit(X)
proba = gmm.predict_proba(X)
506 7. Clustering examples
#%%
# ======================================================================
# MLP classifier with k-means for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 6).fit(X)
distances = np.column_stack([np.sum((X - center)**2, axis = 1)**0.5 for
center in kmeans.cluster_centers_])
7.13 Clustering for classification 507
print_confusion_matrix_and_save(ytest, ypred)
#%%
# ======================================================================
# MLP classifier with GMM for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 8).fit(X)
proba = gmm.predict_proba(X)
How can unlabeled data be utilized for classification? The idea is to employ naïve Bayes
to utilize the EM iterative clustering algorithm to learn classes from a small, labeled dataset
and then extend it to a large, unlabeled dataset. Hence in the first step, use the labeled data
to train a classifier. In the second step, apply it to the unlabeled data for class probabilities
labeling (the “expectation” step). In the third step, use all data labels to train a new classifier
(the “maximization” step). In the last step, iterate until convergence. The EM method ensures
508 7. Clustering examples
that parameters of the model are found that have equal or greater likelihood for each itera-
tion. The key question that can only be answered empirically is whether these calculations of
higher probability parameters can improve the performance of classification. This could work
well intuitively. Those are used by the EM method to generalize the learned model in order
to use data that do not appear in the labeled dataset. EM generalizes the model iteratively
to classify data correctly. This could work with any algorithm for classification and iterative
clustering. But it’s essentially a bootstrapping technique, and you need to be careful to make
sure the feedback loop is positive. It seems better to use probabilities rather than hard deci-
sions since it helps the process converge slowly instead of jumping to incorrect conclusions.
Together with the standard probabilistic EM technique, naïve Bayes is a particularly suitable
alternative since both share the same basic assumption: independence between attributes or,
more specifically, conditional independence between class attributes. However, in this way,
coupling naïve Bayes and EM works well in the classification of documents. Employing less
than one-third of the labeled training instances as well as five times as many unlabeled ones,
it can achieve the performance of a traditional learner in a particular classification task. If la-
beled instances are costly but unlabeled ones are essentially free, this is a good tradeoff. With
a small number of labeled documents, by adding other unlabeled documents, classification
accuracy can be dramatically improved (Witten et al., 2016).
Two methodological refinements have been shown to enhance the performance. The first
is inspired by experimental evidence showing that the inclusion of unlabeled data will de-
crease rather than improve the performance when there are many labeled data. Inherently,
hand-labeled data must be less noisy than automatically labeled data. The remedy is to add a
weighting parameter, which decreases the contribution of the unlabeled data. By maximizing
the weighted probability of labeled and unlabeled instances, this can be integrated into EM’s
maximization stage. The second improvement is to allow multiple clusters for each class. The
EM clustering algorithm assumes that a mixture of various probability distributions, one per
cluster, produces the data randomly. Initially, each labeled document is assigned randomly to
each of its components in a probabilistic fashion with several clusters per class. The EM algo-
rithm’s maximization step remains as it was before, but the expectation step is adjusted not
only to probabilistically label each example with the classes but to assign it to the components
within the class (Witten et al., 2016).
Example 7 .14
The following Python code utilizes k-means clustering for classification of handwritten digits
by using the scikit-learn library APIs. In this example, the handwritten digits dataset that exists
in sklearn.datasets is utilized. The classification accuracy, precision, recall, F1 score, Cohen kappa
score, and Matthews correlation coefficient are calculated. The classification report and confusion
matrix are also given. Note that this example is adapted from scikit-learn.
# ======================================================================
# Clustering as a classifier
# ======================================================================
7.13 Clustering for classification 509
#k-means on digits
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO #needed for plot
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import numpy as np
# ======================================================================
# Define utility functions
# ======================================================================
def print_confusion_matrix_and_save(y_test, y_pred):
#Print the Confusion Matrix
matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (6, 4))
sns.heatmap(matrix, square = True, annot = True, fmt = 'd',
cbar = False)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
#Save The Confusion Matrix
plt.savefig("Confusion.jpg")
# Save SVG in a fake file object.
f = BytesIO()
plt.savefig(f, format = "svg")
def Performance_Metrics(y_test,y_pred):
print('Test Accuracy:', np.round(metrics.accuracy_score(y_test,y_
pred),4))
print('Precision:', np.round(metrics.precision_score(y_test,
y_pred,average = 'weighted'),4))
print('Recall:', np.round(metrics.recall_score(y_test,y_pred,
average = 'weighted'),4))
print('F1 Score:', np.round(metrics.f1_score(y_test,y_pred,
average = 'weighted'),4))
print('Cohen Kappa Score:', np.round(metrics.cohen_kappa_score(y_
test,y_pred),4))
print('Matthews Corrcoef:', np.round(metrics.matthews_corrcoef(y_
test,y_pred),4))
print('\t\tClassification Report:\n', metrics.classification_report(y_
test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))
510 7. Clustering examples
# ======================================================================
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape
#%%
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]
#%%
from sklearn.manifold import TSNE
References 511
7.14 Summary
In this chapter we present many examples related to clustering problems, which contain
unsupervised learning techniques. Since clustering is a popular task with different applica-
tions in machine learning, this particular chapter is dedicated to explaining and studying
it. Clustering is the process of automatically grouping a collection of objects in such a way
that identical objects end up in the same category and divide dissimilar objects in different
groups. For instance, retailers cluster customers, on the basis of their customer profiles, for
the purpose of targeted marketing; computational biologists cluster genes on the basis of
similarities in their expression in diverse researches; and astronomers cluster stars on the
basis of their distinct closeness. In the previous chapters the learning tasks focused primarily
on supervised learning problems. In this chapter we present several unsupervised machine
learning algorithms for clustering. Besides the utilization of clustering in grouping unlabeled
data, they can be used for image segmentation, feature extraction, and classification as well.
We discussed in detail the application of these algorithms in different fields.
References
Ankerst, M., Breunig, M. M., Kriegel, H. -P., & Sander, J. (1999). OPTICS: ordering points to identify the clustering
structure. ACM, 28, 49–60.
Flach, P. (2012). Machine learning: The art and science of algorithms that make sense of data. Cambridge, United Kingdom:
Cambridge University Press.
Gaura, J., Sojka, E., & Krumnikl, M. (2011). Image segmentation based on k-means clustering and energy-transfer proximity.
Berlin: Springer 567–577.
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT press.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge,
United Kingdom: Cambridge University Press.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on
Text Mining, 400, 525–526 Boston.
Sumathi, S., & Paneerselvam, S. (2010). Computational intelligence paradigms: Theory & applications using MATLAB.
Boca Raton, FL: CRC Press.
Tatiraju, S., & Mehta, A. (2008). Image Segmentation using k-means clustering, EM and Normalized Cuts. Department
of EECS, 1, 1–7.
Theodoridis, S., Pikrakis, A., Koutroumbas, K., & Cavouras, D. (2010). Introduction to pattern recognition: A matlab ap-
proach. Cambridge, MA: Academic Press.
Watt, J., Borhani, R., & Katsaggelos, A. K. (2016). Machine learning refined: Foundations, algorithms, and applications.
Cambridge, United Kingdom: Cambridge University Press.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques.
Burlington, MA: Morgan Kaufmann.
Zheng, X., Lei, Q., Yao, R., Gong, Y., & Yin, Q. (2018). Image segmentation based on adaptive K-means algorithm.
EURASIP Journal on Image and Video Processing, 1, 68.