0% found this document useful (0 votes)

36 views47 pages

Clustering Examples

Clustering is an unsupervised learning technique used across many disciplines to group similar objects together and organize data. There are two main types of clustering - similarity-based which uses a distance matrix, and feature-based which clusters raw data. Clustering algorithms include k-means which assigns data points to the closest cluster center, and hierarchical clustering which forms a nested cluster tree. Evaluating clustering results is difficult as there is no single correct clustering and the outputs must be interpreted within the relevant domain.

Uploaded by

dawit gebreyohans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views47 pages

Clustering Examples

Uploaded by

dawit gebreyohans

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

C H A P T E R

7
Clustering examples
7.1 Introduction

Clustering is one of the most commonly used experimental data analysis methods.
Throughout all disciplines, from social sciences to biology to computer science, by defin-
ing meaningful categories among data points people try to obtain an initial sense of their
results. For instance, retailers cluster customers, on the basis of their customer profiles, for
the purpose of targeted marketing; computational biologists cluster genes on the basis of
similarities in their expression in diverse researches; and astronomers cluster stars on the
basis of their distinct closeness. The first question to be answered is, of course, what is clus-
tering? Clustering is the process of intuitively grouping a collection of objects in such a way
that identical objects end up in the same category and divide dissimilar objects into different
groups. This definition is obviously rather imprecise and perhaps vague. Yet, it is not easy
to find a more accurate definition. There are several reasons for this. One fundamental prob-
lem is that in many cases the two objectives stated in the previous statement contradict one
another. Mathematically speaking, similarity (or proximity) is not a transitive relationship,
whereas cluster sharing is a relationship of equivalence, and particularly a transitive relation-
ship. More specifically, there can be a long series of objects, x1, . . . , xm, where each xi is very
similar to its two neighbors, xi–1 and xi+1, but x1 and xm are very dissimilar. If we want to make
sure that two elements share the same cluster if they are identical, then we have to place all
the sequence elements in the same cluster. In that case, however, we end up sharing a cluster
with dissimilar elements (x1 and xm) by violating the second criterion. A clustering algorithm,
which highlights not separating nearby points, clusters this input by dividing it horizontally
on both lines. On the other hand, a clustering approach, which stresses that distant points do
not share the same cluster, clusters the same input by dividing it vertically (Shalev-Shwartz
& Ben-David, 2014).
Another fundamental problem for clustering is the lack of “ground truth,” which is a com-
mon problem with unsupervised learning. We’ve been dealing primarily with supervised
learning in the book so far (e.g., the issue of learning a classifier from labeled data on train-
ing). The purpose of supervised learning is simple—we want to train a classifier to predict
as accurately as possible the labels of future examples. In addition, by estimating the empiri-
cal loss, a supervised learner can estimate the success or risk of the hypotheses utilizing the
labeled training data. Clustering, on the other hand, is an unsupervised learning problem;
Practical Machine Learning for Data Analysis Using Python. http://dx.doi.org/10.1016/B978-0-12-821379-7.00007-2
Copyright © 2020 Elsevier Inc. All rights reserved.

465
466 7. Clustering examples

namely, we are not trying to predict any labels. Rather we want some practical way to orga-
nize the data. Hence there is no straightforward clustering performance assessment method.
In addition, it is not clear what the “correct” clustering for that data is or how to assess a
proposed clustering, even on the basis of full knowledge of the underlying data distribution
(Shalev-Shwartz & Ben-David, 2014).

7.2 Clustering
Clustering is a mechanism in which related objects are bunched together. There are two
types of inputs we can use. In similarity-based clustering, the input to the algorithm is a
matrix of dissimilarity or distance matrix D. The input to the algorithm in feature-based clus-
tering is a matrix or design matrix X feature matrix of N x D. Similarity-based clustering has
the advantage of allowing domain-specific similarity or kernel functions to be conveniently
included. The benefit of feature-based clustering is that it applies to “raw” data, which is
potentially noisy. Besides the two input types, there are two potential output types: flat clus-
tering, also called partition clustering, where we divide the objects into disjoint sets, and
hierarchical clustering, where a nested partition tree is formed (Murphy, 2012).
A matrix of dissimilarity D is a matrix in which di,i = 0 and di, j ≥ 0 is a “distance” measure
between i and j. In the strict sense, subjectively determined dissimilarities are seldom dis-
tances, as the inequality of the triangle, di,j ≤ di,k + dj,k, does not often hold. Some algorithms
claim that D is a true matrix of distance, but others do not. If we have a similarity matrix S, by
applying any monotonically decreasing function, for example, D = max(S) − S, we can con-
vert it to a dissimilarity matrix. The most common way of describing object dissimilarity is
in terms of their attributes’ dissimilarity. The square (Euclidean) distance, city block distance,
correlation coefficient, and hamming distance are some common attribute dissimilarity func-
tions (Murphy, 2012).
For the k-means clustering algorithm in which k initial points are selected to represent the
initial cluster centers, all data points are allocated to the closest one, the mean value of the
points in each cluster is calculated to form its current cluster core, and replication continues
until there are no cluster changes. This procedure only works when you know the number of
clusters beforehand, and this section begins by describing what you can do if not. First, we
look at strategies for “agglomeration” to construct a hierarchical clustering structure—that
is, beginning with individual instances and merging them successively into clusters. So, we
look at a system that incrementally works; that is, any new instance is processed as it oc-
curs. Finally, we are investigating a statistical method of clustering based on a mixture model
with various distributions of probability, one for each cluster. It does not separate instances
into disjoint clusters, as does k-means, but rather assigns instances probabilistically to classes
(Witten, Frank, Hall, & Pal, 2016).
Clustering is one of human beings’ most rudimentary mental practices, used to accommo-
date the enormous amount of information we obtain each day. It would be difficult to handle
each piece of information as a single entity. Therefore, human beings appear to categorize
things into clusters (i.e., objects, individuals, events). The specific attributes of the entities it
comprises are then characterized by each cluster. We must presume, as in the case of super-
vised learning, that all patterns are defined in terms of features that form one-dimensional

7.2 Clustering 467
feature vectors. The basic steps to be taken by an expert to establish a clustering function are
as follows:
•Feature selection: Features should be chosen properly in order to encode as much
information as possible about the value function. Once again, a major goal is parsimony
and thus minimal duplication of knowledge among the features. As in the supervised
classification, preprocessing of features may be needed in subsequent stages before they
are used.
•Proximity measure: This measure defines how the two feature vectors are similar or
dissimilar. It is necessary to ensure that all selected characteristics contribute equally to
the proximity measure calculation and there are no features that dominate others. This
should be taken care of during preprocessing.
•Clustering criterion: This criterion relies on the definition given to the term by the expert
on the basis of the form of clusters of the dataset. The criterion of clustering can be
expressed through a cost function or some other rules.
•Clustering algorithms: This phase refers to the selection of a particular algorithmic scheme
that unravels the clustering structure of the data set, having adopted a proximity measure
and a clustering criterion.
•Validation of the results: Once the results have been obtained from the clustering
algorithm, we will check their correctness. Usually this is done using suitable measures.
•Interpretation of the results: In many cases, to draw the correct conclusions, the
application expert should combine the clustering findings with other experimental
evidence and interpretation.
A phase known as the clustering tendency should be involved in a number of cases. It in-
cludes various tests determining whether or not there is a clustering pattern in the data avail-
able. For instance, the dataset may be entirely random in nature, so it would be pointless to
try to unravel clusters. Different feature choices, proximity measures, clustering criteria, and
clustering algorithms might result in completely different clustering results (Theodoridis, Pi-
krakis, Koutroumbas, & Cavouras, 2010).

7.2.1 Evaluating the output of clustering methods

The most difficult and frustrating aspect of cluster analysis is the validation of clustering
structures. Without a strong effort to do so, cluster analysis would remain a black art acces-
sible only to those true believers with great experience and confidence. Clustering is an un-
supervised learning technique, so it is difficult to assess the output quality of any given tech-
nique. If we use probabilistic models, we can always evaluate a test set’s likelihood, but this
has two drawbacks: firstly, it does not evaluate any clustering found by the model directly,
and secondly, it does not apply to nonprobabilistic methods. And now we are discussing cer-
tain non-probability-based success indicators. Conceptually, the aim of clustering is to assign
similar points to the same cluster and to ensure that dissimilar points are present in different
clusters. Such quantities can be measured in several forms. Such internal requirements can,
however, be of limited benefit. An alternative is to use any external data type to validate the
system. For instance, if we have labels for each object then we can compare the clustering
with the labels using different metrics like silhouette (Murphy, 2012).

468 7. Clustering examples

7.2.2 Applications of cluster analysis

In a number of applications, clustering is a major tool. The application areas in which clus-
tering is useful can be summarized as follows (Theodoridis et al., 2010):
•Data reduction: The amount of data available, N, is often very high in several instances,
and as a result, its processing becomes very challenging. In order to organize the data
into a number of “important” clusters and treat each cluster as a single entity, cluster
analysis can be utilized. For instance, a representative for each cluster is specified in data
transmission. Instead of transmitting the data samples, we then transmit a code number
that corresponds to the cluster representative where each specific sample is located.
Hence the data compression is accomplished.
•Hypothesis generation: In this case, we apply cluster analysis to a dataset to conclude some
hypotheses regarding the nature of the data. To propose hypotheses, clustering is used
here as a tool. It is then important to test these hypotheses using other datasets.
•Hypothesis testing: Cluster analysis is used to test the validity of a given hypothesis in this
sense.
•Prediction based on groups: In this case, the cluster analysis is applied to the existing dataset
and the subsequent clusters are identified based on the characteristics of the patterns
through which they are formed. In the sequel, if we are given an ambiguous pattern, we
can evaluate the cluster to which it is more likely to belong and define it on the basis of
the respective cluster category.

7.2.3 Number of possible clustering

Different proximity metrics give a different description of similar and dissimilar terms
related to the types of clusters that must be identified by our clustering process. As it is men-
tioned, various combinations of a proximity measure and a clustering scheme can result in
different outcomes to be interpreted by the expert. The best way to designate the feature vec-
tors xi, i = 1, . . . , N, of a set X to clusters would be to describe all possible partitions and to
choose the most sensible one according to a previously chosen criterion. But even for moder-
ate values of N, this is not possible (Theodoridis et al., 2010).

7.2.4 Types of clustering algorithms

Clustering algorithms can be seen as schemes that provide sensitive clustering by consid-
ering only a small portion of the set that comprises all possible X partitions. The outcome
depends on the algorithm and criteria used. A clustering algorithm is therefore a learning
process that attempts to identify the specific features of the clusters that underlie the dataset.
It is possible to divide clustering algorithms into the following major categories:
•Sequential algorithms: Such algorithms create a single cluster. They are quite
straightforward and fast. In most of them, all the feature vectors are given to the
algorithm once or a few times. Normally the final result depends on the order the vectors
are given to the algorithm. Depending on the distance metric used, these techniques tend
to generate compact, hyperspherically, or hyperellipsoidally shaped clusters.

7.3 The k-means clustering algorithm 469
•Hierarchical clustering algorithms: Such methods will also be categorized into two groups.
•Agglomerative algorithms: Such algorithms in each stage generate a clustering sequence of
decreasing number of clusters. The clustering generated by merging two clusters into
one at each stage results from the previous one. Single and full connection algorithms
are the key representatives of the agglomerative algorithms. Such algorithms are ideal
for the recovery of big clusters and compact clusters.
•Divisive algorithms: These algorithms work in the opposite direction; that is, at each stage
they generate a clustering sequence of m. The clustering is created by dividing a single
cluster into two results from the previous one at each stage.
•Clustering algorithms based on cost function optimization: This group includes
algorithms in which a cost function, J, quantifies as “sensitive” to determine a clustering.
The number of m clusters is usually kept unchanged. Most of these algorithms use
differential calculus principles when attempting to optimize J. They end when a local
optimum of J is decided. Also, algorithms of this category are called iterative function
optimization techniques. The following subcategories are included in this category:
•Hard or crisp clustering algorithms are when a vector belongs to a particular cluster
exclusively. The assignment of the vectors to individual clusters is done optimally on
the basis of the accepted criterion of optimality. The Isodata or Lloyd algorithm is the
most popular algorithm in this group.
•Probabilistic clustering algorithms are a special type of hard clustering algorithms that
adopt Bayesian classification arguments and each vector x is assigned to the cluster Ci
for which P(Ci |x) (i.e., the a posteriori probability) is maximum. Such probabilities are
calculated through an optimization process that is properly defined.
•Fuzzy clustering algorithms are when a vector belongs up to a certain degree to a
particular cluster.
•Possibilistic clustering algorithms are when we test the probability of a vector x being a
part of a cluster Ci.
•Boundary detection algorithms are when, instead of identifying the clusters themselves
by the feature vectors, they iteratively update the boundaries of the regions where
clusters are located. Although these algorithms evolve from a theory of cost function
optimization, they are different from the algorithms described previously (Theodoridis
et al., 2010).
Apart from these clustering algorithms, branch and bound clustering algorithms, genetic
clustering algorithms, stochastic relaxation methods, valley-seeking clustering algorithms,
competitive learning algorithms, morphological transformation technique–based algorithms,
density-based algorithms, subspace clustering algorithms, and kernel-based methods are
also types of clustering algorithms (Theodoridis et al., 2010).

7.3 The k-means clustering algorithm

K-means clustering begins with the description of a cost function over a parameterized
set of possible clustering, and the objective of the clustering algorithm is to find a minimum
cost partitioning (clustering). The clustering function is turned into an optimization problem
under this model. The objective function is a function ranging from pairs of an input, (X, d),

470 7. Clustering examples

and a suggested clustering solution C = (C1, . . .,Ck) to positive real numbers. The target of a
clustering algorithm is described as finding, for a given input (X, d), a clustering C so that
G((X, d),C) is minimized, given such an objective function that is denoted by G. To achieve
this goal, a suitable search algorithm must be utilized. K-means clustering is therefore a spe-
cific common approximation algorithm rather than the cost function or the corresponding
exact solution to the minimization problem. Most common objective functions include as
a parameter the number of clusters, k. In practice, it is often up to the clustering algorithm
user to choose the parameter k that is best suited to the clustering problem. Some of the most
common objective functions are defined in the following. The k-means objective function is
one of the most common objectives in clustering. The objective function k-means measures
the square distance from each point in X to its cluster’s centroid. For instance, in digital com-
munication tasks, where X members can be interpreted as a set of signals to be transmitted,
the k-means objective function is important. In practical clustering applications, the k-means
objective function is quite common. But it turns out that it is always computationally infea-
sible to find the optimal solution for k-means. Instead, a simple iterative algorithm is often
used, so the term k-means clustering in many cases refers to the outcome of this algorithm
rather than the clustering that minimizes the objective cost of k-means (Shalev-Shwartz &
Ben-David, 2014).

Example 7 .1
The following Python code utilizes k-means clustering to find the center of the clusters of breast
cancer data by using the scikit-learn library APIs. In this example, the breast cancer dataset that
exists in sklearn.datasets is utilized. Scatter plot is presented to show the effectiveness of the al-
gorithm. The cluster centers are plotted in as scatter plot. Note that this example is adapted from
scikit-learn.
# ======================================================================
# K-means clustering example
# ======================================================================
from sklearn import datasets
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cluster import KMeans
#%%
# ######################################################################
# Import some data to play with
Breast_Cancer = datasets.load_breast_cancer()
X = Breast_Cancer.data
y = Breast_Cancer.target

# Plot the original data points

plt.scatter(X[:, 0], X[:, 1], c = y, cmap = plt.cm.Set1,
edgecolor = 'k')
plt.xlabel('Attribute I')
plt.ylabel('Attribute II')

7.4 The k-medoids clustering algorithm 471

plt.title('Original data Scatter')

plt.xticks(())
plt.yticks(())
#%%
" " "

sklearn.cluster.KMeans(n_clusters = 8, init = 'k-means + +', n_

init = 10, max_iter = 300, tol = 0.0001,
precompute_distances = 'auto', verbose = 0, random_state = None,
copy_x = True, n_jobs = None,
algorithm = 'auto')
" " "
#Find Cluster Centers
kmeans = KMeans(n_clusters = 2)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

#Plot the Cluster Centers

plt.scatter(X[:, 0], X[:, 1], c = y_kmeans, s = 50, cmap = 'viridis')
centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c = 'black', s = 200,
alpha = 0.5);
plt.xlabel('Attribute I')
plt.ylabel('Attribute II')
plt.title('Cluster Centers')
plt.xticks(())
plt.yticks(())

7.4 The k-medoids clustering algorithm

Each cluster is represented by the mean of its vectors in the k-means algorithm, but the
cluster is represented by a vector selected among the elements of X in the k-medoids meth-
ods, and we will refer to it as the medoid. In addition to their medoid, each cluster includes
all vectors in X that (1) are not employed as medoids in other clusters and (2) are closer to
their medoid than those representing the other clusters. There are two benefits over the k-
means algorithm to represent clusters using medoids. First, it can be utilized for datasets
originating from either continuous or discrete domains, while k-means is only suitable for
continuous domains since the mean of a subset of data vectors is not essentially a point ly-
ing in the domain for a discrete domain context. Second, k-medoids algorithms appear to be
less sensitive than k-means algorithms to outliers. It should be remembered, however, that
a cluster’s mean has a strong geometric and statistical meaning that is not necessarily true
with medoids. Moreover, the algorithms for the calculation of the best set of medoids needs
more computational power compared to the k-means algorithm. PAM (partitioning around

472 7. Clustering examples

medoids), CLARA (clustering large applications), and CLARANS (clustering large applica-
tions based on randomized search) are the well-known k-medoids algorithms. Remember
that the last two algorithms are inspired from the PAM but are more effective than PAM in
handling large datasets (Theodoridis et al., 2010).

Example 7.2
The following Python code utilizes k-medoids clustering to find the center of the clusters of
synthetic data and Mall_Customers data (https://www.kaggle.com/akram24/mall-customers)
by using the KMedoids clustering function. Scatter plot is presented to show the effectiveness of
the algorithm. The cluster centers are plotted in as scatter plot.
# ======================================================================
# K-medoids clustering example
# ======================================================================
from k_medoids import KMedoids
import numpy as np
import matplotlib.pyplot as plt
#Define a distance utility function
def example_distance_func(data1, data2):
"'example distance function"'
return np.sqrt(np.sum((data1 - data2)**2))
#%%
# K-Medoids Clustering using synthetic data with 3 clusters
from sklearn.datasets import make_blobs
# ######################################################################
# Generate sample data
np.random.seed(0)
batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X, labels_true = make_blobs(n_samples = 300, centers = centers, cluster_
std = 0.7)

model = KMedoids(n_clusters = n_clusters, dist_func = example_distance_

func)
model.fit(X, plotit = True, verbose = True)
plt.show()
#%%
# K-Medoids clustering using Mall_Customers data
import pandas as pd
#loading the dataset
# Importing the Mall_Customers dataset by pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [2,3,4]].values

7.5 Hierarchical clustering 473

model = KMedoids(n_clusters = 5, dist_func = example_distance_func)

model.fit(X, plotit = True, verbose = True)
plt.show()

7.5 Hierarchical clustering

Hierarchical clustering algorithms have different philosophies. In particular, they gener-

ate a hierarchy of clustering instead of producing a single clustering. In the social sciences
and biological taxonomy, this sort of algorithm is usually found. Hierarchical clustering al-
gorithms build a hierarchy of nested clustering. More precisely, these algorithms include N
steps, as many as the number of data vectors. A new clustering on the basis of the clustering
generated at the previous step t-1 is obtained at each step t (Theodoridis et al., 2010). Two
main approaches exist in hierarchical clustering: bottom-up, or agglomerative, clustering and
top-down, or divisive, clustering. These methods take a matrix of dissimilarity between the
objects as their input. At each step, the most similar groups are combined in the bottom-up
method. Groups are divided in the top-down approach using different criteria. Remember
that both agglomerative and divisive clustering are merely heuristics that do not optimize
any well-defined objective function. Therefore, in any formal sense, it is difficult to assess
the quality of the clustering they create. In fact, they will always generate a clustering of the
input data, even if the data does not have any structure (e.g., random noise) (Murphy, 2012).

7.5.1 Agglomerative clustering algorithm

Agglomerative clustering begins with N groups, each containing initially one entity, and
then the two most similar groups merge at each stage until there is a single group containing
all the data. A typical heuristic for large N is to run k-means first and then apply hierarchical
clustering to the cluster centers estimated. A binary tree called a dendrogram will represent
the merging process. The initial groups (objects) are on the leaves (at the bottom of the figure),
and we join them in the tree each time when two groups are merged. The height of the divi-
sions is the dissimilarity between the groups being joined. The tree root (which is at the top)
is a category with all the data. We produce a clustering of a given size if we cut the tree at any
given height. In addition, there are three variants of agglomerative clustering, depending on
how we define the dissimilarity between object categories (Murphy, 2012).
Alternatively, we can assume that if two vectors at level t of the hierarchy come together
in a single cluster, they will remain in the same cluster for all subsequent clusters. This is
another way to view the nesting property. A nesting property drawback is that there is no
way to recover from a “poor” clustering, which might have arisen at an earlier hierarchy
level. A threshold dendrogram, or simply a dendrogram, is an efficient way of describing
the sequence of clusters generated by an agglomerative algorithm. Each phase of the general
agglomerative scheme (GAS) is related to a dendrogram stage. Cutting the dendrogram may
result in a clustering at a specific level. A dendrogram of proximity is a dendrogram that takes
into account the proximity level in which two clusters are first merged. Once a measure of
dissimilarity (similarity) is employed, the proximity dendrogram is termed a dissimilarity

474 7. Clustering examples

(similarity) dendrogram. This method can be utilized at any stage as an indicator of natural
or forced cluster formation. Similarly, a suitable level for cutting the dendrogram related to
the resulting hierarchy must be calculated (Theodoridis et al., 2010).

Example 7 .3
The following Python code utilizes agglomerative clustering to cluster the customers as Careful,
Standard, Target, Careless, and Sensible using Mall_Customers data (https://www.kaggle.com/
akram24/mall-customers) and standard scikit-learn library APIs. Customers’ dendrogram is plot-
ted against the Euclidean distance. In addition, the clusters are plotted in a scatter plot to show five
different customer groups. Note that this example is adapted from the web page (https://www.
kdnuggets.com/2019/09/hierarchical-clustering.html).
# ======================================================================
# Agglomerative clustering example
# ======================================================================
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Mall_Customers.csv')
#%%
" " "Out of all the features, CustomerID and Genre are irrelevant fields
and can be dropped and create a matrix of independent variables by select
only Age and Annual Income." " "
X = dataset.iloc[:, [3, 4]].values
import scipy.cluster.hierarchy as sch
dendrogrm = sch.dendrogram(sch.linkage(X, method = 'ward'))
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean distance')
plt.show()

#%%
from sklearn.cluster import AgglomerativeClustering
hc = AgglomerativeClustering(n_clusters = 5, affinity = 'euclidean',
linkage = 'ward')
y_hc = hc.fit_predict(X)
# Visualising the clusters
plt.scatter(X[y_hc == 0, 0], X[y_hc == 0, 1], s = 50, c = 'red', label =
'Careful')
plt.scatter(X[y_hc == 1, 0], X[y_hc == 1, 1], s = 50, c = 'blue', label =
'Standard')
plt.scatter(X[y_hc == 2, 0], X[y_hc == 2, 1], s = 50, c = 'green', label =
'Target')
plt.scatter(X[y_hc == 3, 0], X[y_hc == 3, 1], s = 50, c = 'cyan', label
= 'Careless')

7.5 Hierarchical clustering 475

plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 50, c = 'magenta',

label = 'Sensible')
plt.title('Clusters of customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show()

Example 7.4
The following Python code utilizes agglomerative clustering to group the customers using Mall_
Customers data (https://www.kaggle.com/akram24/mall-customers) and standard scikit-learn li-
brary APIs. Scatter plot is presented to show the effectiveness of the algorithm. In this example, we
present the effect of imposing a connectivity graph to capture local structure in the customer data. It
is possible to see two implications of implementing a connectivity. First, clustering is much quicker
with a connectivity matrix. Second, a single, average, and complete linkage is unstable when using
a connectivity matrix and tends to create a few clusters that grow very fast. Nonetheless, average
and complete linkage tackle this filtration behavior by including all the distances between two clus-
ters when combining them (while only the shortest distance among clusters is considered to exag-
gerate the behavior). The connectivity graph removes this process for average and total connection,
making it look like the more fragile single connection. Having a very small number of neighbors in
the graph introduces a geometry similar to that of a single connection that is well-known for having
this instability of percolation. This is presented in this example. Note that this example is adapted
from scikit-learn.

# Authors: Gael Varoquaux, Nelle Varoquaux

# License: BSD 3 clause
import time
import matplotlib.pyplot as plt
import numpy as np

from sklearn.cluster import AgglomerativeClustering

from sklearn.neighbors import kneighbors_graph
import pandas as pd
#2 Importing the Mall_Customers dataset by pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3,4]].values

# Create a graph capturing local connectivity. Larger number of neighbors

# will give more homogeneous clusters to the cost of computation
# time. A very large number of neighbors gives more evenly distributed

476 7. Clustering examples

# cluster sizes, but may not impose the local manifold structure of
# the data
knn_graph = kneighbors_graph(X, 30, include_self = False)

for connectivity in (None, knn_graph):

for n_clusters in (4, 5, 6):
plt.figure(figsize = (10, 4))
for index, linkage in enumerate(('average',
'complete',
'ward',
'single')):
plt.subplot(1, 4, index + 1)
model = AgglomerativeClustering(linkage = linkage,
connectivity = connectivity,
n_clusters = n_clusters)
t0 = time.time()
model.fit(X)
elapsed_time = time.time() - t0
plt.scatter(X[:, 0], X[:, 1], c = model.labels_,
cmap = plt.cm.nipy_spectral)
plt.title('linkage = %s\n(time %.2fs)' % (linkage, elapsed_time),
fontdict = dict(verticalalignment = 'top'))
plt.axis('equal')
plt.axis('off')
plt.subplots_adjust(bottom = 0, top = .89, wspace = 0,
left = 0, right = 1)
plt.suptitle('n_cluster = %i, connectivity = %r' %
(n_clusters, connectivity is not None), size = 17)
plt.show()

7.5.2 Divisive clustering algorithm

The divisive algorithms adopt the counter-strategy of agglomerative schemes. There is a
single set in the first cluster, X. We are looking for the best possible partitioning of X into two
clusters in the first step. The straightforward approach is to consider all possible X partitions
in two sets and, according to a predetermined criterion, to choose the maximum. This process
is then extended to each of the two sets created in the preceding stage iteratively. The final
clustering includes a number of N clusters, each with a single X vector. Various choices of g
give rise to various algorithms. One can easily observe that even for moderate values of N,
this divisive scheme is computationally very demanding. Compared with the agglomerative
system, this is its main drawback. Therefore some further computational simplifications are
needed if these schemes are to be of any practical use. One option is to make choices and not

7.5 Hierarchical clustering 477
to look for all possible cluster partitions. This can be accomplished by ruling out several par-
titions under a preset criterion as not reasonable. The cluster division is based on all the fea-
tures (coordinates) of the feature vectors in the previous algorithm. These type of algorithms
are also known as polythetic algorithms. On the other hand, there are divisive algorithms at
each stage that achieve a cluster division based on a single feature. These are the algorithms
known as monothetic (Theodoridis et al., 2010).
Divisive clustering begins with all the data in a single cluster and then, in a top-down
manner, splits each cluster into two daughter clusters. Since there are 2N–1 –1 ways to divide
a group of N items into two groups, it is hard to compute the optimal split, hence several
heuristics are utilized. One approach is to pick the largest diameter cluster and divide it into
two using the k-means or k-medoids algorithm with K = 2. This is known as the bisecting k-
means algorithm (Steinbach, Karypis, & Kumar, 2000). We can repeat this until we have any
number of clusters desired. This can be utilized as an alternative to standard k-means, but a
hierarchical clustering is also induced. Another strategy is to construct from the dissimilar-
ity graph a minimum spanning tree and then make new clusters by breaking the connection
related to the largest dissimilarity. Divisive clustering is less common than clustering in ag-
glomerations, but it has two benefits. First, it can be quicker because it only takes O(N) time if
we break for a constant number of levels. Secondly, the splitting decisions are made in view of
all the results, while the bottom-up methods make myopic merge decisions (Murphy, 2012).

Example 7.5
The following Python code utilizes divisive clustering to plot a dendrogram of amino acid se-
quence of human genes. In this example, amino acid sequence of human genes is utilized. The
dendrogram plot presents the effectiveness of the algorithm. Note that this example is adapted
from github (https://github.com/ronak-07/Divisive-Hierarchical-Clustering). A phylogenetic tree
or evolutionary tree is a branching diagram or “tree” displaying the implied evolutionary relations
between different biological species based upon similarities and differences in their physical or ge-
netic characteristics. The goal of this example is to build the phylogenetic tree based on DNA/pro-
tein sequences of species given in the dataset employing divisive (top-down) hierarchical clustering.

# ======================================================================
# Divisive clustering
# ======================================================================
import numpy as np
import scipy
import matplotlib.pyplot as plt
from scipy.cluster import hierarchy
global g
import time
# ======================================================================
# Define utility functions
# ======================================================================
def subtract(indices,splinter):

478 7. Clustering examples

l3 = [x for x in indices if x not in splinter]

return l3

def divisive(a,indices,splinter,sub):
if(len(indices) = =1):
return
avg = []
flag = 0
for i in indices:
if(i not in splinter):
sum = 0
for j in indices:
if(j not in splinter):
sum = sum + a[i][j]
if((len(indices)-len(splinter)-1) = =0):
avg.append(sum)
else:
avg.append(sum/(len(indices)-len(splinter)-1))
if(splinter):
k = 0
for i in sub:
total = 0
for j in splinter:
total = total + a[i][j]
avg[k] = avg[k] - (total/(len(splinter)))
k + = 1
positive = []
for i in range(0,len(avg)):
if(avg[i] > 0):
positive.append(avg[i])
flag = 1
if(flag = =1):
splinter.append(sub[avg.index(max(positive))])
sub.remove(sub[avg.index(max(positive))])
divisive(a,indices,splinter,sub)
else:
splinter.append(indices[avg.index(max(avg))])
sub[:] = subtract(indices,splinter)
divisive(a,indices,splinter,sub)

def original_subset(indices):
sp = np.zeros(shape = (len(indices),len(indices)))
for i in range(0,len(indices)):

7.5 Hierarchical clustering 479

for j in range(0,len(indices)):
sp[i][j] = a[indices[i]][indices[j]]
return sp

def original_max(x):
new = original_subset(x)
return new.max()

def diameter(l):
return original_max(l)

def recursive(a,indices,u,v,clusters,g):
clus_s.append(len(indices))
d.append(diameter(indices))
parents[g] = indices
g- = 1
divisive(a,indices,u,v)
clusters.append(u)
clusters.append(v)
new = []
for i in range(len(clusters)):
new.append(clusters[i])
final.append(new)
x = []
y = []
store_list = []
max = -1
f = 0
for list in clusters:
if(diameter(list) > max):
if(len(list)! = 1):
f = 1
max = diameter(list)
store_list = (list)
if(f = =0):
return
else:
clusters.remove(store_list)
recursive(a,store_list,x,y,clusters,g)

def augmented_dendrogram(*args, **kwargs):

data = scipy.cluster.hierarchy.dendrogram(*args, **kwargs)
if not kwargs.get('no_plot', False):

480 7. Clustering examples

for i, d in zip(data['icoord'], data['dcoord']):

x = 0.5 * sum(i[1:3])
y = d[1]
plt.plot(x, y, 'ro')
plt.annotate("%.3g" % y, (x, y), xytext = (0,12),textcoords = 'offset
points',va = 'top', ha = 'center')
return data

# ======================================================================
# Main program
# ======================================================================
a = np.load('distance_matrix.npy')
size = len(a)
g = (size-1)*2
parents = {}
final = []
clusters = []
indices = []
clus_s = []
d = []
Z = np.zeros(shape = (size-1,4))
p = []
q = []
ans = []
for i in range(0,len(a)):
indices.append(i)

for i in range(0,size):
list = []
list.append(i)
parents[i] = list

start = time.time()
recursive(a,indices,p,q,clusters,g)
print("Clustering done\t" + str(time.time()-start))
for i in range(0,len(d)):
Z[size-i-2][2] = d[i]
Z[size-i-2][3] = clus_s[i]

for i in range(len(final)-1,0,-1):
for j in range(0,len(final[i-1])):
if final[i-1][j] not in final[i]:

ans.append(final[i-1][j])
ans.append(indices)

7.6 The fuzzy c-means clustering algorithm 481

for i in range(0,len(ans)):
if(len(ans[i]) < = 2):
Z[i][0] = ans[i][0]
Z[i][1] = ans[i][1]
else:
s = 0
add = []
common = []
for j in range(len(ans)-1,-1,-1):
if(set(ans[j]) < set(ans[i])):
common = ans[j]
break;
x = (subtract(ans[i],common))
for key in parents.keys():
if(parents[key] = =common):
Z[i][0] = key
break;
for key in parents.keys():
if(set(parents[key]) = =set(x)):
Z[i][1] = key
s = 1
break;
if(s = =0):
print(Z[i][0],Z[i][1],x)
names = [i for i in range(0,size)]
#%%
# ======================================================================
# Plot dendrogram of divisive clustering
# ======================================================================
plt.figure(figsize = (15, 15))
plt.title('Hierarchical Clustering Dendrogram (Divisive)')
plt.xlabel('Sequence No.')
plt.ylabel('Distance')
augmented_dendrogram(Z,labels = names,show_leaf_
counts = True,p = 25,truncate_mode = 'lastp')
plt.show()

7.6 The fuzzy c-means clustering algorithm

One of the challenges related to the probabilistic algorithms is the presence of the pdfs,
for which a suitable model must be assumed. However, when the clusters are not compact
but shell-shaped, it is not easy to handle instances. Fuzzy clustering algorithms are a family
of clustering algorithms that release themselves from such constraints. Over the past three

482 7. Clustering examples

decades, these methods have been the focus of intensive research. The main point differen-
tiating the two methods is that a vector belongs to more than one cluster simultaneously in
the fuzzy schemes, whereas each vector belongs exclusively to one cluster in the probabilistic
schemes. The number of clusters and their shape is presumed to be known a priori. The clus-
ter shape is defined by the set of parameters adopted. The majority of the well-known fuzzy
clustering algorithms are developed by minimizing a cost function (Theodoridis et al., 2010).
The extensively studied and implemented fuzzy c-means (FCM) clustering algorithm re-
quires a priori knowledge of the number of clusters. If FCM anticipates a desired number of
clusters, and if it is possible to guess the positions for each cluster center, then the rules of
output strongly depend on the selection of initial values. The FCM algorithm generates an
appropriate cluster pattern to minimize by iteration an objective function that is based on
cluster locations. It is also possible to automatically determine the number and initial position
of cluster centers through search techniques available in the mountain clustering process. By
measuring a search measure called the mountain function at each grid point, this approach
considers each distinct grid point as a possible cluster core. It is a subtractive method of clus-
tering with enhanced computational effort, where data points are viewed as candidates for
cluster centers rather than grid points. The estimate is strictly proportional to the number of
data points and irrespective of the dimension of the problem by applying this approach. In
this process, a high-potential data point that is a function of distance measurement is known
as a cluster center, and data points close to new cluster centers are penalized to monitor the
emergence of new cluster centers. Occasionally, a gradual membership of both clusters can
be considered to be the points between cluster centers. This is compensated, of course, by
distorting the meanings of “low” and “high.” The fuzzified c-means algorithm enables each
data point to belong to a cluster to a degree defined by a membership grade, thereby allow-
ing each point to belong to several clusters. The fuzzy c-means algorithm partitions a set of
K data points identified as m-dimensional vectors into c fuzzy clusters and finds a cluster
center in each cluster to minimize an objective function. Fuzzy c-means is different from hard
c-means, mostly as it uses fuzzy partitioning, where a point can belong to numerous clusters
with membership degrees. The membership matrix M is allowed to have elements in the
range [0, 1] to satisfy the fuzzy partitioning. Nonetheless, to maintain the properties of the M
matrix, the total membership of all clusters of a point must always be equal to unity (Sumathi
& Paneerselvam, 2010).

Example 7 .6
The following Python code utilizes fuzzy c-means clustering algorithm to find the center of the
clusters of Iris dataset. In this example, the Iris dataset that exists in sklearn.datasets is utilized. Scat-
ter plot is presented to show the effectiveness of the algorithm. The cluster centers are plotted in as
scatter plot. Note that this example is adapted from the web page (https://github.com/omadson/
fuzzy-c-means). In order to call you should use “pip install fuzzy-c-means” or download the library
from the web page (https://pypi.org/project/fuzzy-c-means/).
# ======================================================================
# Fuzzy c-means clustering example
# ======================================================================

7.7 Density-based clustering algorithms 483

#pip install fuzzy-c-means

from fcmeans import FCM
from sklearn.datasets import make_blobs
from matplotlib import pyplot as plt
from seaborn import scatterplot as scatter
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.datasets import load_iris
iris = load_iris()
X = iris['data']
y = iris['target']
# Plot the original data points
plt.scatter(X[y == 0, 0], X[y == 0, 1], s = 80, c = 'orange',
label = 'Iris-setosa')
plt.scatter(X[y == 1, 0], X[y == 1, 1], s = 80, c = 'yellow',
label = 'Iris-versicolour')
plt.scatter(X[y == 2, 0], X[y == 2, 1], s = 80, c = 'green',
label = 'Iris-virginica')
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.title('Original data points')
plt.legend()
#%%
# fit the fuzzy-c-means
fcm = FCM(n_clusters = 3)
fcm.fit(X)
# outputs
fcm_centers = fcm.centers
fcm_labels = fcm.u.argmax(axis = 1)
# plot result
f, axes = plt.subplots(1, 2, figsize = (11,5))
scatter(X[:,0], X[:,1], ax = axes[0])
scatter(X[:,0], X[:,1], ax = axes[1], hue = fcm_labels)
scatter(fcm_centers[:,0], fcm_centers[:,1],
ax = axes[1],marker = "s",s = 200)
plt.show()

7.7 Density-based clustering algorithms

Clusters are known in this sense as one-dimensional space regions that are “dense” in
points of X. Many density-based algorithms do not place any restrictions on the form of the
resulting clusters. Therefore, these algorithms are capable of recovering arbitrarily shaped

484 7. Clustering examples

clusters. We can also manage the outliers effectively. In addition, these algorithms’ time com-
plexity is less, making them able to process large datasets. DBSCAN, DBCLASD, DENCLUE,
and OPTICS are the popular density-based algorithms. While these algorithms share the same
basic philosophy, they differ in the quantification of the density (Theodoridis et al., 2010).

7.7.1 The DBSCAN algorithm

The “density” as defined in DBSCAN (density-based spatial clustering of applications with
noise) is calculated around a point x as the number of points in X falling within a certain region
in the one-dimensional space around x. The algorithm’s results are strongly influenced by the
choice of both ε and q. Different parameter values can generate completely different results.
Such parameters should be selected to allow the algorithm to detect the least “dense” cluster.
In practice, to determine their “best” combination for the dataset at hand, one must test with
different values for both ε and q. The DBSCAN is not suitable for cases where the clusters in X
have large density differences and are not suitable for high-dimensional results. The OPTICS
(ordering points to identify the clustering structure) algorithm is an extension of DBSCAN that
overcomes the need to carefully pick the parameters ε and q. It produces a density-based cluster
ordering, which describes the intrinsic hierarchical cluster structure of the dataset in an under-
standable manner. Experiments show that OPTICS’s computational complexity is approximate-
ly 1.6 of the computational complexity needed by DBSCAN. On the other hand, in practice, for
different values of ε and q, one has to run DBSCAN more than once (Theodoridis et al., 2010).

Example 7 .7
The following Python code utilizes DBSCAN clustering algorithm to find the clusters by using
the scikit-learn library APIs. In this example, the synthetic data is utilized. Scatter plot is presented
to show the effectiveness of the algorithm. Different measures are also calculated. Note that this
example is adapted from scikit-learn.

# ======================================================================
# DBSCAN clustering
# ======================================================================
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

# ######################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples = 750, centers = centers, cluster_
std = 0.4,
random_state = 0)
X = StandardScaler().fit_transform(X)

7.7 Density-based clustering algorithms 485

# ######################################################################
# Compute DBSCAN
db = DBSCAN(eps = 0.3, min_samples = 10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype = bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)

print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true,
labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true,
labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
#%%
# ######################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.

unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]

class_member_mask = (labels == k)

xy = X[class_member_mask & core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor = tuple(col),
markeredgecolor = 'k', markersize = 14)
xy = X[class_member_mask & ∼core_samples_mask]

486 7. Clustering examples

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor = tuple(col),

markeredgecolor = 'k', markersize = 6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

7.7.2 OPTICS clustering algorithms

It is possible to extend the DBSCAN algorithm so that many distance parameters are pro-
cessed simultaneously, that is, the density-based clusters are built simultaneously with re-
spect to different densities. Nonetheless, we would need to follow a specific order in which
objects are processed while extending a cluster to achieve a consistent result. We always have
to pick an object that can be reached by density with respect to the lowest ε value to ensure
that clusters are finished first with respect to higher density (i.e., smaller ε values). In theory,
the OPTICS algorithm operates as such an extended DBSCAN algorithm for an infinite num-
ber of distance parameters that are less than a “generating distance” ε. The only difference
is that we do not allocate memberships to the unit. Rather we store the order in which the
objects are processed and the information that an extended DBSCAN algorithm would use to
assign memberships to the cluster. OPTICS algorithm generates a database order, addition-
ally storing the core distance for each object and an appropriate reachability-distance. We will
see that this knowledge is sufficient to remove all clustering based on density in relation to
any distance ε’ that is smaller than the distance ε produced from this order. The runtime of
the OPTICS algorithm is almost the same as the runtime for DBSCAN due to its conceptual
equivalence to the DBSCAN algorithm. It just turned out that OPTICS’s run-time was almost
always 1.6 times DBSCAN’s run-time. That an abstraction is possible only indicates that the
cluster ordering of a dataset actually contains the information about that dataset’s intrinsic
clustering structure (Ankerst, Breunig, Kriegel, & Sander, 1999).

Example 7 .8
The following Python code utilizes OPTICS clustering algorithm to find the clusters by using the
scikit-learn library APIs. This example employs synthetic data, which is generated so that the clus-
ters have different densities. The class sklearn.cluster.OPTICS is first utilized with its Xi cluster de-
tection method and then we set specific thresholds on the reachability that is related to class sklearn.
cluster.DBSCAN. We can see that the different clusters of OPTICS’s Xi method can be recovered
with different choices of thresholds in DBSCAN. Reachability plot and scatter plot are presented to
show the effectiveness of the algorithm. Note that this example is adapted from scikit-learn.

# ======================================================================
# Optics clustering example
# ======================================================================
# Authors: Shane Grigsby <[email protected]>
# Adrin Jalali <[email protected]>
# License: BSD 3 clause
from sklearn.cluster import OPTICS, cluster_optics_dbscan

7.7 Density-based clustering algorithms 487

import matplotlib.gridspec as gridspec

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Generate synthetic sample data
np.random.seed(0)
n_points_per_cluster = 250
C1 = [-5, -2] + .8 * np.random.randn(n_points_per_cluster, 2)
C2 = [4, -1] + .1 * np.random.randn(n_points_per_cluster, 2)
C3 = [1, -2] + .2 * np.random.randn(n_points_per_cluster, 2)
C4 = [-2, 3] + .3 * np.random.randn(n_points_per_cluster, 2)
C5 = [3, -2] + 1.6 * np.random.randn(n_points_per_cluster, 2)
C6 = [5, 6] + 2 * np.random.randn(n_points_per_cluster, 2)
X = np.vstack((C1, C2, C3, C4, C5, C6))
#Use OPTICS for Clustering
clust = OPTICS(min_samples = 50, xi = .05, min_cluster_size = .05)

# Run the fit

clust.fit(X)

labels_050 = cluster_optics_dbscan(reachability = clust.reachability_,

core_distances = clust.core_distances_,
ordering = clust.ordering_, eps = 0.5)
labels_200 = cluster_optics_dbscan(reachability = clust.reachability_,
core_distances = clust.core_distances_,
ordering = clust.ordering_, eps = 2)

space = np.arange(len(X))
reachability = clust.reachability_[clust.ordering_]
labels = clust.labels_[clust.ordering_]

plt.figure(figsize = (15, 10))

G = gridspec.GridSpec(2, 3)
ax1 = plt.subplot(G[0, :])
ax2 = plt.subplot(G[1, 0])
ax3 = plt.subplot(G[1, 1])
ax4 = plt.subplot(G[1, 2])

# ======================================================================
# Reachability plot
# ======================================================================
colors = ['g.', 'r.', 'b.', 'y.', 'c.']
for klass, color in zip(range(0, 5), colors):
Xk = space[labels == klass]

488 7. Clustering examples

Rk = reachability[labels == klass]
ax1.plot(Xk, Rk, color, alpha = 0.3)
ax1.plot(space[labels == -1], reachability[labels == -1], 'k.', alpha = 0.3)
ax1.plot(space, np.full_like(space, 2., dtype = float), 'k-', alpha = 0.5)
ax1.plot(space, np.full_like(space, 0.5, dtype = float), 'k-.', alpha = 0.5)
ax1.set_ylabel('Reachability (epsilon distance)')
ax1.set_title('Reachability Plot')

# ======================================================================
# Plot OPTICS clustering results
# ======================================================================
colors = ['g.', 'r.', 'b.', 'y.', 'm.']
for klass, color in zip(range(0, 5), colors):
Xk = X[clust.labels_ == klass]
ax2.plot(Xk[:, 0], Xk[:, 1], color, alpha = 0.3)
ax2.plot(X[clust.labels_ == -1, 0], X[clust.labels_ == -1, 1], 'k + ',
alpha = 0.1)
ax2.set_title('Automatic Clustering\nOPTICS')

# ======================================================================
# Plot DBSCAN at 0.5 clustering results
# ======================================================================
colors = ['r', 'greenyellow', 'olive', 'g', 'b', 'c']
for klass, color in zip(range(0, 6), colors):
Xk = X[labels_050 == klass]
ax3.plot(Xk[:, 0], Xk[:, 1], color, alpha = 0.3, marker = '.')
ax3.plot(X[labels_050 == -1, 0], X[labels_050 == -1, 1], 'k + ',
alpha = 0.1)
ax3.set_title('Clustering at 0.5 epsilon cut\nDBSCAN')

# ======================================================================
# Plot DBSCAN at 2. clustering results
# ======================================================================
colors = ['r.', 'm.', 'y.', 'c.']
for klass, color in zip(range(0, 4), colors):
Xk = X[labels_200 == klass]
ax4.plot(Xk[:, 0], Xk[:, 1], color, alpha = 0.3)
ax4.plot(X[labels_200 == -1, 0], X[labels_200 == -1, 1], 'k + ',
alpha = 0.1)
ax4.set_title('Clustering at 2.0 epsilon cut\nDBSCAN')

plt.tight_layout()
plt.show()

7.8 The expectation of maximization for Gaussian mixture model clustering 489

7.8 The expectation of maximization for Gaussian mixture model clustering

The problem is that we do not know either of the following: the distribution from which
each training instance came or the five parameters of the mixture model. We are therefore
adopting and iterating the technique used for the k-means clustering algorithm. Beginning
with initial assumptions for the five parameters, we utilize them to measure each instance’s
cluster probabilities, we utilize these probabilities to reestimate the parameters, and then we
repeat them. This is called the EM algorithm to maximize expectations. The first step—calcu-
lating the probabilities of the cluster, which are the “expected” class values—is “expectation";
the second step, calculating the distribution parameters, is “maximizing” the probability of
the distributions given the available data (Witten et al., 2016).
Now that we saw the Gaussian mixture model for two distributions, let us consider how
to apply it to conditions that are more concrete. It is quite straightforward to adjust the al-
gorithm from two-class problems to multiclass problems as long as the number k of normal
distributions is given in advance. The model can easily be extended to multiple attributes
from a single numeric attribute per instance as long as it is assumed that attributes are inde-
pendent. The probabilities are multiplied for each attribute to obtain the joint probability for
the instance. The independence assumption no longer holds when the dataset is understood
to contain correlated attributes in advance. Instead, a normal bivariate distribution can model
two attributes together, each having its own mean value, but the two standard deviations are
replaced by a “covariance matrix” with four numeric parameters. Standard statistical tech-
niques are available to estimate instance class probabilities and to estimate the mean and
covariance matrix, provided the instances and their class probabilities. A multivariate distri-
bution can accommodate multiple correlated attributes. The number of parameters increases
with the square of the number of jointly varying attributes. Expectation—calculating the clus-
ter to which each instance belongs, provided the parameters of the distribution—is just like
evaluating an unknown instance’s class. Maximization—estimating the parameters from the
classified instances—is just like evaluating the probabilities of the attribute-value from the
training instances, with the minor distinction being allocated probabilistically rather than
categorically to classes in the EM algorithm instances (Witten et al., 2016).

E xample 7.9
The following Python code utilizes Gaussian mixture models (GMM) clustering algorithm to
find the clusters by using the scikit-learn library APIs. This example employs Iris dataset. Although
GMM are generally employed for clustering, we can compare the found clusters with the actual
classes from the dataset. Predicted labels are plotted on both training and held-out test data using
a variety of GMM covariance types on the Iris dataset. GMMs with spherical, diagonal, full, and
tied covariance matrices are compared in increasing order of performance. Although it is expected
the full covariance will achieve the best performance in general, it is prone to overfitting on small
datasets and does not generalize well to held-out test data. On the plots, training data is shown as
dots, while test data is shown as crosses. Although the Iris dataset is four-dimensional, just the first
two dimensions are shown here. Note that this example is adapted from scikit-learn.

490 7. Clustering examples

# ======================================================================
# GMM clustering
# ======================================================================
# Author: Ron Weiss <[email protected] > , Gael Varoquaux
# Modified by Thierry Guillemot <[email protected]>
# License: BSD 3 clause

import matplotlib as mpl

import matplotlib.pyplot as plt

import numpy as np

from sklearn import datasets

from sklearn.mixture import GaussianMixture
from sklearn.model_selection import StratifiedKFold

colors = ['navy', 'red', 'green']

def make_ellipses(gmm, ax):

for n, color in enumerate(colors):
if gmm.covariance_type == 'full':
covariances = gmm.covariances_[n][:2, :2]
elif gmm.covariance_type == 'tied':
covariances = gmm.covariances_[:2, :2]
elif gmm.covariance_type == 'diag':
covariances = np.diag(gmm.covariances_[n][:2])
elif gmm.covariance_type == 'spherical':
covariances = np.eye(gmm.means_.shape[1]) * gmm.covariances_[n]

v, w = np.linalg.eigh(covariances)
u = w[0] / np.linalg.norm(w[0])
angle = np.arctan2(u[1], u[0])
angle = 180 * angle / np.pi # convert to degrees
v = 2. * np.sqrt(2.) * np.sqrt(v)
ell = mpl.patches.Ellipse(gmm.means_[n, :2], v[0], v[1],
180 + angle, color = color)
ell.set_clip_box(ax.bbox)
ell.set_alpha(0.5)
ax.add_artist(ell)
ax.set_aspect('equal', 'datalim')

iris = datasets.load_iris()
# Break up the dataset into non-overlapping training (75%) and testing
# (25%) sets.
skf = StratifiedKFold(n_splits = 4)

7.8 The expectation of maximization for Gaussian mixture model clustering 491

# Only take the first fold.

train_index, test_index = next(iter(skf.split(iris.data, iris.target)))

X_train = iris.data[train_index]
y_train = iris.target[train_index]
X_test = iris.data[test_index]
y_test = iris.target[test_index]

n_classes = len(np.unique(y_train))

# Try GMMs using different types of covariances.

estimators = {cov_type: GaussianMixture(n_components = n_classes,
covariance_type = cov_type, max_iter = 20, random_state = 0)
for cov_type in ['spherical', 'diag', 'tied', 'full']}
n_estimators = len(estimators)

plt.figure(figsize = (3 * n_estimators // 2, 6))

plt.subplots_adjust(bottom = .01, top = 0.95, hspace = .15,
wspace = .05,
left = .01, right = .99)
for index, (name, estimator) in enumerate(estimators.items()):
# Since we have class labels for the training data, we can
# initialize the GMM parameters in a supervised manner.
estimator.means_init = np.array([X_train[y_train ==
i].mean(axis = 0)
for i in range(n_classes)])

# Train the other parameters using the EM algorithm.

estimator.fit(X_train)

h = plt.subplot(2, n_estimators // 2, index + 1)

make_ellipses(estimator, h)

for n, color in enumerate(colors):

data = iris.data[iris.target == n]
plt.scatter(data[:, 0], data[:, 1], s = 0.8, color = color,
label = iris.target_names[n])
# Plot the test data with crosses
for n, color in enumerate(colors):
data = X_test[y_test == n]
plt.scatter(data[:, 0], data[:, 1], marker = 'x', color = color)
y_train_pred = estimator.predict(X_train)

492 7. Clustering examples

rain_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100

t
plt.text(0.05, 0.9, 'Train accuracy: %.1f' % train_accuracy,
transform = h.transAxes)
y_test_pred = estimator.predict(X_test)
test_accuracy = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100

plt.text(0.05, 0.8, 'Test accuracy: %.1f' % test_accuracy,
transform = h.transAxes)
plt.xticks(())
plt.yticks(())
plt.title(name)
plt.legend(scatterpoints = 1, loc = 'lower right', prop = dict(size = 12))
plt.show()

7.9 Bayesian clustering

There’s a problem, though, in GMM—overfitting. If we are not sure which attributes de-
pend on each other, why not be on the safe side and decide that all the attributes are covari-
ant? The reason is that the more parameters there are, the greater the probability of overfitting
the resulting structure to the training data, and covariance significantly increases the number
of parameters. Throughout machine learning, the problem of overfitting arises, and probabi-
listic clustering is no exception. There are two ways it can occur: by defining too many clus-
ters and by specifying too many parameters for distributions. The extreme case of too many
clusters happens when there is one for each data point—then it is clear that the training data
will be overfitted. In addition, when any of the normal distributions becomes so small that
the cluster is based on just one data point, problems will arise in the GMM with EM model.
Implementations also usually insist that there are at least two different data values in clusters.
The problem of overfitting occurs when there are too many parameters. If you are not sure
which attributes are covariant, you may try out different possibilities and select the one that
maximizes the overall data likelihood due to the clustering that is found. However, the more
parameters, the greater the average likelihood of results—not necessarily due to better clus-
tering but due to overfitting—would appear. The more parameters to play with, the simpler it
is to find a seemingly good clustering. It would be good to penalize the model for introducing
new parameters. Recently, complete Bayesian hierarchical clustering techniques have been
developed that generate a distribution of probability over possible hierarchical structures
representing a dataset as output. One of the main ways to do this is to follow a Bayesian
approach where each parameter has a prior distribution of probability. Therefore whenever
a new parameter is added, it is important to integrate its prior probability into the overall
probability figure. Since this includes multiplying the total likelihood by a number less than
1 (the previous likelihood), it will penalize the addition of new parameters automatically.
The updated criteria will have to yield a gain that outweighs the cost in order to enhance
the overall probability. AutoClass is an exhaustive Bayesian clustering strategy that utilizes
all parameters of the finite-mixing model with prior distributions. It enables both numeric

7.9 Bayesian clustering 493
and nominal attributes and uses the EM algorithm to estimate the probability distribution
parameters in order to best fit the data. Since there is no guarantee that the EM algorithm will
converge to the optimum global, the process will be repeated for several different initial value
sets. AutoClass considers various cluster numbers and can consider the various covariance
quantities and different types of underlying distribution of probability for numeric attributes
(Witten et al., 2016).

Example 7.10
The following Python code compares two Gaussian mixture model (GMM) clustering algo-
rithms. It plots the confidence ellipsoids of a mixture of two Gaussians generated by expectation
maximization (“GaussianMixture”) and variational inference (“BayesianGaussianMixture” with a
Dirichlet process prior). This example employs synthetic data, which is generated so that the clus-
ters have different densities. Both models have access to five components with which to fit the data.
Note that the expectation maximization model needs to employ all five components; on the other
hand, the variational inference model efficiently only employs as many as are required for a good
fit. It can be also seen that the expectation maximization model divides some components randomly
since it is trying to fit too numerous components, but the Dirichlet process model adjusts the num-
ber of states automatically. Note that this example is adapted from scikit-learn.

=============================================
Comparison of Gaussian mixture models with EM and Bayesian
=============================================
import itertools
import numpy as np
from scipy import linalg
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn import mixture

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold',

' darkorange'])

def plot_results(X, Y_, means, covariances, index, title):

splot = plt.subplot(2, 1, 1 + index)
for i, (mean, covar, color) in enumerate(zip(
means, covariances, color_iter)):
v, w = linalg.eigh(covar)
v = 2. * np.sqrt(2.) * np.sqrt(v)
u = w[0] / linalg.norm(w[0])
# as the DP will not use every component it has access to
# unless it needs it, we shouldn't plot the redundant
# components.
if not np.any(Y_ == i):
continue

494 7. Clustering examples

plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color = color)

# Plot an ellipse to show the Gaussian component

angle = np.arctan(u[1] / u[0])
angle = 180. * angle / np.pi # convert to degrees
ell = mpl.patches.Ellipse(mean, v[0], v[1], 180. + angle,
color = color)
ell.set_clip_box(splot.bbox)
ell.set_alpha(0.5)
splot.add_artist(ell)

plt.xlim(-9., 5.)
plt.ylim(-3., 6.)
plt.xticks(())
plt.yticks(())
plt.title(title)

# Number of samples per component

n_samples = 500

# Generate random sample, two components

np.random.seed(0)
C = np.array([[0., -0.1], [1.7, .4]])
X = np.r_[np.dot(np.random.randn(n_samples, 2), C),
.7 * np.random.randn(n_samples, 2) + np.array([-6, 3])]
# Fit a Gaussian mixture with EM using five components
gmm = mixture.GaussianMixture(n_components = 5, covariance_
type = 'full').fit(X)
plot_results(X, gmm.predict(X), gmm.means_, gmm.covariances_, 0,
'Gaussian Mixture')

# Fit a Dirichlet process Gaussian mixture using five components

dpgmm = mixture.BayesianGaussianMixture(n_components = 5,
covariance_type = 'full').fit(X)
plot_results(X, dpgmm.predict(X), dpgmm.means_, dpgmm.covariances_, 1,
'Bayesian Gaussian Mixture with a Dirichlet process prior')
plt.show()

7.10 Silhouette analysis

How can we detect the poor quality of the clustering algorithm? Silhouette is a useful
technique. For each example that is grouped by cluster, a silhouette sorts and plots s(x). In
this particular situation, in the construction of the silhouette, squared Euclidean distance is

7.10 Silhouette analysis 495
utilized, but the approach can be extended to other distance metrics. It can be clearly seen that
the first clustering is much stronger than the second. We can estimate the average silhouette
values per cluster and over the entire data set in addition to the graphical representation
(Flach, 2012). Silhouette analysis can be utilized to examine the amount of separation be-
tween the clusters. The silhouette plot shows how close each point in a cluster is to points in
the neighboring clusters and thus provides a way to visually determine parameters such as
the number of clusters. This measure has a range of [-1, 1]. Silhouette coefficients close to + 1
imply that the sample is far from neighboring clusters. A value of 0 means that the sample is
on or very close to the decision boundary between two neighboring clusters, and negative
values suggest that the samples may have been allocated to the wrong cluster.

Example 7.11
The following Python code utilizes k-means clustering to find the silhouettes of marketing data
by using the scikit-learn library APIs. This example employs marketing data to see the silhouettes
and cluster centers. The silhouettes and cluster centers are plotted for different cases. Note that this
example is taken from scikit-learn.
In this example the silhouette analysis is utilized to select an optimal value for “n_clusters.” The sil-
houette plot shows that the “n_clusters” value of 2, 3, and 6 are a bad pick for the given data due to the
presence of clusters with below average silhouette scores and also due to wide fluctuations in the size
of the silhouette plots. Silhouette analysis is more ambivalent in deciding between 4 and 5. Also from
the thickness of the silhouette plot the cluster size can be visualized. The silhouette plot for cluster 0
when “n_clusters” is equal to 2 is bigger in size due to the grouping of the 3 subclusters into one big
cluster. Nevertheless, when the “n_clusters” are equal to 4 or 5, all the plots are more or less of similar
thickness and hence are of similar sizes as can be confirmed from the labeled scatter plot on the right.
========================================================================
Selecting the number of clusters with silhouette analysis on k-means
clustering
========================================================================
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np
import pandas as pd
# Import the Mall Customers dataset by pandas
dataset = pd.read_csv('Mall_Customers.csv')
X = dataset.iloc[:, [3,4]].values
range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)

496 7. Clustering examples

# The 1st subplot is the silhouette plot

# The silhouette coefficient can range from -1, 1 but in this example
all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters + 1)*10 is for inserting blank
space between silhou-
ette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

# Initialize the clusterer with n_clusters value and a random gen-

erator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters = n_clusters, random_state = 10)
cluster_labels = clusterer.fit_predict(X)

# The silhouette_score gives the average value for all the sam-
ples.
# This gives a perspective into the density and separation of the
formed
# clusters
silhouette_avg = silhouette_score(X, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)

# Compute the silhouette scores for each sample

sample_silhouette_values = silhouette_samples(X, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor = color, edgecolor = color, alpha = 0.7)

# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples

7.11 Image segmentation with clustering 497

ax1.set_title("The silhouette plot for the various clusters.")

ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x = silhouette_avg, color = "red", linestyle = "--")

ax1.set_yticks([]) # Clear the yaxis labels / ticks

ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

# 2nd Plot showing the actual clusters formed

colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clus-
ters)
ax2.scatter(X[:, 0], X[:, 1], marker = '.', s = 30, lw = 0, al-
pha = 0.7,
c = colors, edgecolor = 'k')
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker = 'o',
c = "white", alpha = 1, s = 200, edgecolor = 'k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker = '$%d$' % i, alpha = 1,
s = 50, edgecolor = 'k')

ax2.set_title("The visualization of the clustered data.")

ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")

plt.suptitle(("Silhouette analysis for KMeans clustering on

sample
data "
"with n_clusters = %d" % n_clusters),
fontsize = 14, fontweight = 'bold')
plt.show()

7.11 Image segmentation with clustering

Images are known as one of the most significant ways of transmitting information. A cru-
cial aspect of machine learning is to understand images and extract the information from
them so that the knowledge can be used for certain tasks. The use of images for robotic navi-
gation would be an example. Other applications such as extracting malignant tissues from
body scans and so on are an integral part of medical diagnosis. One of the first steps toward
recognizing images is to segment them and find various objects in them. Features such as

498 7. Clustering examples

the histogram plots and the transformation of the frequency domain can be used to do this
(Tatiraju & Mehta, 2008).
In image recognition and computer vision, image segmentation is an important prepro-
cessing procedure. Image segmentation corresponds to the decomposition of an image with
the same attributes in a number of nonoverlapping relevant areas. Image segmentation is
a crucial technique in digital image processing, and segmentation accuracy directly affects
follow-up tasks’ effectiveness. The current segmentation techniques have achieved many
successes to varying degrees, considering their complexity and difficulty, but research on
this dimension still faces many problems. Clustering analysis algorithm splits the datasets
according to a certain standard into different groups, so it has a broad implementation in the
segmentation of images. Image segmentation as one of the main digital image processing
techniques, coupled with relevant professional skills, is commonly used for machine vision,
facial recognition, fingerprint recognition, traffic control systems, satellite image tracking ob-
jects (roads, woods, etc.), pedestrian detection, medical imaging, and many other areas and is
worth exploring in-depth (Zheng, Lei, Yao, Gong, & Yin, 2018).
Since the image segmentation plays a crucial role in many applications for image process-
ing, several algorithms for image segmentation have been developed in the last decades. But
these algorithms are continuously being pursued, as image segmentation is a challenging
problem that requires a good solution for the subsequent image-processing steps. Cluster-
ing algorithm was not originally developed exclusively for image processing; the computer
vision community adopted it for image segmentation. For example, the k-means algorithm
needs a priori knowledge of the number of clusters (k) to be grouped into. Every pixel of the
image is allocated to the cluster whose centroid is nearest to the pixel repeatedly and itera-
tively. Based on the pixels allocated to that cluster, the centroid of each cluster is decided. Both
the choice of pixel membership in the clusters and the computation of the centroids are based
on calculating distances. The Euclidean distance is most commonly utilized since it is simple
to calculate. The problem is that using Euclidean distance will lead to errors in the final seg-
mentation of the image (Gaura, Sojka, & Krumnikl, 2011).

Example 7.1 2
The following Python code is utilized for segmenting the images of Greek coins in regions by
using the scikit-learn library APIs. In this example the coins dataset, which exists in skimage.data,
is utilized. This example utilizes “spectral_clustering” on a graph created from voxel-to-voxel dif-
ference on an image to divide this image into multiple, partly homogeneous regions. This process
(spectral clustering on an image) is an effective approximate solution for finding normalized graph
cuts. There are two options to assign labels:

•“K-means” spectral clustering will cluster samples in the embedding space employing a k-means
algorithm.
•“Discrete” will iteratively search for the closest partition space to the embedding space.

Note that this example is adapted from scikit-learn.

7.11 Image segmentation with clustering 499

# ======================================================================
# Image segmentation with clustering
# ======================================================================
# Author: Gael Varoquaux <[email protected] > , Brian Cheung
# License: BSD 3 clause
import time
import numpy as np
from distutils.version import LooseVersion
from scipy.ndimage.filters import gaussian_filter
import matplotlib.pyplot as plt
import skimage
from skimage.data import coins
from skimage.transform import rescale
from sklearn.feature_extraction import image
from sklearn.cluster import spectral_clustering

# these were introduced in skimage-0.14

if LooseVersion(skimage.__version__) >= '0.14':
rescale_params = {'anti_aliasing': False, 'multichannel': False}
else:
rescale_params = {}
# load the coins as a numpy array
orig_coins = coins()

# Resize it to 20% of the original size to speed up the processing

# Applying a Gaussian filter for smoothing prior to down-scaling
# reduces aliasing artifacts.
smoothened_coins = gaussian_filter(orig_coins, sigma = 2)
rescaled_coins = rescale(smoothened_coins, 0.2, mode = "reflect",
**rescale_params)

# Convert the image into a graph with the value of the gradient on the
# edges.
graph = image.img_to_graph(rescaled_coins)

# Take a decreasing function of the gradient: an exponential

# The smaller beta is, the more independent the segmentation is of the
# actual image. For beta = 1, the segmentation is close to a voronoi
beta = 10
eps = 1e-6
graph.data = np.exp(-beta * graph.data / graph.data.std()) + eps

# Apply spectral clustering (this step goes much faster if you have
pyamg

500 7. Clustering examples

# installed)
N_REGIONS = 25

######################################################################
# Visualize the resulting regions

for assign_labels in ('kmeans', 'discretize'):

t0 = time.time()
labels = spectral_clustering(graph, n_clusters = N_REGIONS,
assign_labels = assign_labels, random_state = 42)
t1 = time.time()
labels = labels.reshape(rescaled_coins.shape)

plt.figure(figsize = (5, 5))

plt.imshow(rescaled_coins, cmap = plt.cm.gray)
for l in range(N_REGIONS):
plt.contour(labels == l,
colors = [plt.cm.nipy_spectral(l / float(N_REGIONS))])
plt.xticks(())
plt.yticks(())
title = 'Spectral clustering: %s, %.2fs' % (assign_labels, (t1 - t0))
print(title)
plt.title(title)
plt.show()

7.12 Feature extraction with clustering

K-means clustering reduces the data dimension by finding appropriate representatives or

centroids for clusters, or groups, of data points. All elements of every cluster are then char-
acterized by their cluster’s corresponding centroid. Thus the problem of clustering is parti-
tioning data into clusters with similar characteristics, and with k-means this characteristic
especially has geometric closeness in the feature space. When this is represented clearly, it
can be employed to create a learning problem for an accurate recovery of cluster centroids,
dropping the impractical notion. If we denote the centroid of the kth cluster by ck and the set
of indices of the subset of those P data points by Sk, and x1 . . . xP, belongs to this cluster, then
the points in the kth cluster must lie close to its centroid for all k = 1 . . . K. These necessary
relations can be represented more appropriately by first stacking the centroids column-wise
into the centroid matrix.

C = c1 c 2 · c K (7.1)

7.12 Feature extraction with clustering 501
Then designating by ek the kth standard basis vector (that is a K × 1 vector with a 1 in the
kth slot and zeros elsewhere), we may represent Cek = ck, and hence the relations in Eq. (7.1)
can be represented for each k as

C ek ≈ x p for all p ∈Sk (7.2)

Next, to write these equations even more appropriately we stack the data column-wise
into the data matrix X =x1, x2 · · xP and produce a K × P assignment matrix W. The pth column
of this matrix, represented as wp, is the standard basis vector related to the cluster to which
the pth point belongs, that is, wp = ek if p ∈ Sk. With this wp notation we can represent each
equation in Eq. (7.2) as Cwp ≈ xp for all p ∈ Sk, or using matrix notation all K such relations
simultaneously as
CW ≈ X. (7.3)
We can forget the assumption that we know the locations of cluster centroids and have
knowledge of which points are assigned to them—that is, the accurate depiction of the cen-
troid matrix C and assignment matrix W. We want to learn the correct values for these two
matrices. In particular, we know that the ideal C and W fulfill the compact relationships
depicted in Eq. (7.3), that is, that CW ≈ X or in other words that CW − X2F is small, while W
is composed of appropriately selected standard basis vectors associated with the data points
to their respective centroids. Note that the aim is nonconvex, and since we cannot minimize
over both C and W at the same time, it is solved via alternating minimization, that is, by alter-
nately minimizing the objective function over one of the variables (C or W) while keeping the
other variable fixed (Watt, Borhani, & Katsaggelos, 2016).

Example 7.13
The following Python code presents the usage of k-means and GMM clustering algorithms as
a feature extractor. We will utilize the Iris dataset, which includes three types (class) of Iris flowers
(Setosa, Versicolour, and Virginica) with four attributes: sepal length, sepal width, petal length, and
petal width. In this example we utilize the sklearn.cluster.KMeans and sklearn.mixture.Gaussian-
Mixture to extract the features of the Iris dataset. In scikit-learn, k-means and GMM are implement-
ed as a cluster object that are sklearn.cluster.KMeans and sklearn.mixture.GaussianMixture and are
employed to extract the features. Note that this example is adapted from Python–scikit-learn.

# ======================================================================
# Feature extraction with k-means and GMM clustering
# ======================================================================
" " "
Created on Mon Dec 23 11:35:28 2019
@author: absubasi
" " "
from sklearn import metrics
from sklearn.metrics import confusion_matrix
import seaborn as sns

502 7. Clustering examples

import matplotlib.pyplot as plt

from io import BytesIO #needed for plot
# ======================================================================
# Define utility functions
# ======================================================================
def print_confusion_matrix_and_save(y_test, y_pred):
#Print the Confusion Matrix
matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (6, 4))
sns.heatmap(matrix, square = True, annot = True, fmt = 'd',
cbar = False)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
#Save The Confusion Matrix
plt.savefig("Confusion.jpg")
# Save SVG in a fake file object.
f = BytesIO()
plt.savefig(f, format = "svg")

def Performance_Metrics(y_test,y_pred):
print('Test Accuracy:', np.round(metrics.accuracy_score(y_test,y_
pred),4))
print('Precision:', np.round(metrics.precision_score(y_test,
y_pred,average = 'weighted'),4))
print('Recall:', np.round(metrics.recall_score(y_test,y_pred,
average = 'weighted'),4))
print('F1 Score:', np.round(metrics.f1_score(y_test,y_pred,
average = 'weighted'),4))
print('Cohen Kappa Score:', np.round(metrics.cohen_kappa_score(y_
test,y_pred),4))
print('Matthews Corrcoef:', np.round(metrics.matthews_corrcoef(y_
test,y_ pred),4))
print('\t\tClassification Report:\n', metrics.classification_report(y_
test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))

# ======================================================================
# Random forest classifier with k-means for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np

7.12 Feature extraction with clustering 503

iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 6).fit(X)
distances = np.column_stack([np.sum((X - center)**2, axis = 1)**0.5 for

center in kmeans.cluster_centers_])

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
# Split dataset into training set and test set
# 70% training and 30% test
Xtrain, Xtest, ytrain, ytest = train_test_split(distances, y,test_
size = 0.3, random_state = 0)
#In order to change to accuracy increase n_estimators
#Classify Data
" " "RandomForestClassifier(n_estimators = 'warn', criterion = 'gini',
max_depth = None,
min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_
leaf = 0.0,
max_features = 'auto', max_leaf_nodes = None, min_impurity_
decrease = 0.0,
min_impurity_split = None, bootstrap = True, oob_score = False, n_
jobs = None,
random_state = None, verbose = 0, warm_start = False, class_
weight = None)" " "
clf = RandomForestClassifier(n_estimators = 200)
#Create the Model
#Train the model with Training Dataset
clf.fit(Xtrain,ytrain)
#Test the model with Testset
ypred = clf.predict(Xtest)

#Evaluate the Model and Print Performance Metrics

Performance_Metrics(ytest,ypred)
print_confusion_matrix_and_save(ytest, ypred)
#%%
# ======================================================================
# Random forest classifier with GMM for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np

504 7. Clustering examples

iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 8).fit(X)
proba = gmm.predict_proba(X)

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier
# Split dataset into training set and test set
# 70% training and 30% test
Xtrain, Xtest, ytrain, ytest = train_test_split(proba, y,test_
size = 0.3, random_state = 0)
#In order to change to accuracy increase n_estimators
" " "RandomForestClassifier(n_estimators = 'warn', criterion = 'gini',
max_depth = None,
min_samples_split = 2, min_samples_leaf = 1, min_weight_fraction_
leaf = 0.0,
max_features = 'auto', max_leaf_nodes = None, min_impurity_
decrease = 0.0,
min_impurity_split = None, bootstrap = True, oob_score = False, n_
jobs = None,
random_state = None, verbose = 0, warm_start = False, class_

weight = None)" " "
clf = RandomForestClassifier(n_estimators = 200)
#Create the Model
#Train the model with Training Dataset
clf.fit(Xtrain,ytrain)
#Test the model with Testset
ypred = clf.predict(Xtest)

#Evaluate the Model and Print Performance Metrics

Performance_Metrics(ytest,ypred)
print_confusion_matrix_and_save(ytest, ypred)
#%%
# ======================================================================
# k-NN classifier with k-means for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']

7.12 Feature extraction with clustering 505

y = iris['target']
#Extract Features
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 6).fit(X)
distances = np.column_stack([np.sum((X - center)**2, axis = 1)**0.5 for

center in kmeans.cluster_centers_])

from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
# 70% training and 30% test
Xtrain, Xtest, ytrain, ytest = train_test_split(distances, y,test_

size = 0.3, random_state = 0)
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors = 1)
#Create the Model
#Train the model with Training Dataset
clf.fit(Xtrain,ytrain)
#Test the model with Testset
ypred = clf.predict(Xtest)

#Evaluate the Model and Print Performance Metrics

Performance_Metrics(ytest,ypred)
print_confusion_matrix_and_save(ytest, ypred)

#%%
# ======================================================================
# k-NN classifier with GMM for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 4).fit(X)
proba = gmm.predict_proba(X)

from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
# 70% training and 30% test
Xtrain, Xtest, ytrain, ytest = train_test_split(proba, y,test_

size = 0.3, random_state = 0)

506 7. Clustering examples

from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors = 1)
#Create the Model
#Train the model with Training Dataset
clf.fit(Xtrain,ytrain)
#Test the model with Testset
ypred = clf.predict(Xtest)

#Evaluate the Model and Print Performance Metrics

Performance_Metrics(ytest,ypred)
print_confusion_matrix_and_save(ytest, ypred)

#%%
# ======================================================================
# MLP classifier with k-means for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters = 6).fit(X)
distances = np.column_stack([np.sum((X - center)**2, axis = 1)**0.5 for

center in kmeans.cluster_centers_])

from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
# 70% training and 30% test
Xtrain, Xtest, ytrain, ytest = train_test_split(distances, y,test_

size = 0.3, random_state = 0)
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes = (100, ), learning_rate_

init = 0.001,
alpha = 1, momentum = 0.9,max_iter = 1000)
#Create the Model
#Train the model with Training Dataset
clf.fit(Xtrain,ytrain)
#Test the model with Testset
ypred = clf.predict(Xtest)
#Evaluate the Model and Print Performance Metrics
Performance_Metrics(ytest,ypred)

7.13 Clustering for classification 507

print_confusion_matrix_and_save(ytest, ypred)
#%%
# ======================================================================
# MLP classifier with GMM for feature extraction
# ======================================================================
#load Data
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
#Extract Features
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components = 8).fit(X)
proba = gmm.predict_proba(X)

from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
# 70% training and 30% test
Xtrain, Xtest, ytrain, ytest = train_test_split(proba, y,test_

size = 0.3, random_state = 0)
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes = (100, ), learning_rate_

init = 0.001,
alpha = 1, momentum = 0.9,max_iter = 1000)
#Create the Model
#Train the model with Training Dataset
clf.fit(Xtrain,ytrain)
#Test the model with Testset
ypred = clf.predict(Xtest)

#Evaluate the Model and Print Performance Metrics

Performance_Metrics(ytest,ypred)
print_confusion_matrix_and_save(ytest, ypred)

7.13 Clustering for classification

How can unlabeled data be utilized for classification? The idea is to employ naïve Bayes
to utilize the EM iterative clustering algorithm to learn classes from a small, labeled dataset
and then extend it to a large, unlabeled dataset. Hence in the first step, use the labeled data
to train a classifier. In the second step, apply it to the unlabeled data for class probabilities
labeling (the “expectation” step). In the third step, use all data labels to train a new classifier
(the “maximization” step). In the last step, iterate until convergence. The EM method ensures

508 7. Clustering examples

that parameters of the model are found that have equal or greater likelihood for each itera-
tion. The key question that can only be answered empirically is whether these calculations of
higher probability parameters can improve the performance of classification. This could work
well intuitively. Those are used by the EM method to generalize the learned model in order
to use data that do not appear in the labeled dataset. EM generalizes the model iteratively
to classify data correctly. This could work with any algorithm for classification and iterative
clustering. But it’s essentially a bootstrapping technique, and you need to be careful to make
sure the feedback loop is positive. It seems better to use probabilities rather than hard deci-
sions since it helps the process converge slowly instead of jumping to incorrect conclusions.
Together with the standard probabilistic EM technique, naïve Bayes is a particularly suitable
alternative since both share the same basic assumption: independence between attributes or,
more specifically, conditional independence between class attributes. However, in this way,
coupling naïve Bayes and EM works well in the classification of documents. Employing less
than one-third of the labeled training instances as well as five times as many unlabeled ones,
it can achieve the performance of a traditional learner in a particular classification task. If la-
beled instances are costly but unlabeled ones are essentially free, this is a good tradeoff. With
a small number of labeled documents, by adding other unlabeled documents, classification
accuracy can be dramatically improved (Witten et al., 2016).
Two methodological refinements have been shown to enhance the performance. The first
is inspired by experimental evidence showing that the inclusion of unlabeled data will de-
crease rather than improve the performance when there are many labeled data. Inherently,
hand-labeled data must be less noisy than automatically labeled data. The remedy is to add a
weighting parameter, which decreases the contribution of the unlabeled data. By maximizing
the weighted probability of labeled and unlabeled instances, this can be integrated into EM’s
maximization stage. The second improvement is to allow multiple clusters for each class. The
EM clustering algorithm assumes that a mixture of various probability distributions, one per
cluster, produces the data randomly. Initially, each labeled document is assigned randomly to
each of its components in a probabilistic fashion with several clusters per class. The EM algo-
rithm’s maximization step remains as it was before, but the expectation step is adjusted not
only to probabilistically label each example with the classes but to assign it to the components
within the class (Witten et al., 2016).

Example 7 .14
The following Python code utilizes k-means clustering for classification of handwritten digits
by using the scikit-learn library APIs. In this example, the handwritten digits dataset that exists
in sklearn.datasets is utilized. The classification accuracy, precision, recall, F1 score, Cohen kappa
score, and Matthews correlation coefficient are calculated. The classification report and confusion
matrix are also given. Note that this example is adapted from scikit-learn.

# ======================================================================
# Clustering as a classifier
# ======================================================================

7.13 Clustering for classification 509

#k-means on digits
import seaborn as sns
import matplotlib.pyplot as plt
from io import BytesIO #needed for plot
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import numpy as np
# ======================================================================
# Define utility functions
# ======================================================================
def print_confusion_matrix_and_save(y_test, y_pred):
#Print the Confusion Matrix
matrix = confusion_matrix(y_test, y_pred)
plt.figure(figsize = (6, 4))
sns.heatmap(matrix, square = True, annot = True, fmt = 'd',
cbar = False)
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.show()
#Save The Confusion Matrix
plt.savefig("Confusion.jpg")
# Save SVG in a fake file object.
f = BytesIO()
plt.savefig(f, format = "svg")

def Performance_Metrics(y_test,y_pred):
print('Test Accuracy:', np.round(metrics.accuracy_score(y_test,y_
pred),4))
print('Precision:', np.round(metrics.precision_score(y_test,
y_pred,average = 'weighted'),4))
print('Recall:', np.round(metrics.recall_score(y_test,y_pred,
average = 'weighted'),4))
print('F1 Score:', np.round(metrics.f1_score(y_test,y_pred,
average = 'weighted'),4))
print('Cohen Kappa Score:', np.round(metrics.cohen_kappa_score(y_
test,y_pred),4))
print('Matthews Corrcoef:', np.round(metrics.matthews_corrcoef(y_
test,y_pred),4))
print('\t\tClassification Report:\n', metrics.classification_report(y_
test,y_pred))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred))

510 7. Clustering examples

# ======================================================================
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
digits = load_digits()
digits.data.shape

kmeans = KMeans(n_clusters = 10, random_state = 0)

clusters = kmeans.fit_predict(digits.data)
kmeans.cluster_centers_.shape

fig, ax = plt.subplots(2, 5, figsize = (8, 3))

centers = kmeans.cluster_centers_.reshape(10, 8, 8)
for axi, center in zip(ax.flat, centers):
axi.set(xticks = [], yticks = [])
axi.imshow(center, interpolation = 'nearest', cmap = plt.cm.binary)

#%%
from scipy.stats import mode
labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]

# Evaluate and Print the Performance Metrics

Performance_Metrics(digits.target, labels)
#Print and Save Confusion Matrix
print_confusion_matrix_and_save(digits.target, labels)

#%%
from sklearn.manifold import TSNE

# Project the data: this step will take several seconds

tsne = TSNE(n_components = 2, init = 'random', random_state = 0)
digits_proj = tsne.fit_transform(digits.data)

# Compute the clusters

kmeans = KMeans(n_clusters = 10, random_state = 0)
clusters = kmeans.fit_predict(digits_proj)

# Permute the labels

labels = np.zeros_like(clusters)
for i in range(10):
mask = (clusters == i)
labels[mask] = mode(digits.target[mask])[0]

References 511

# Evaluate and Print the Perfromance Metrics

Performance_Metrics(digits.target, labels)
#Print and Save Confusion Matrix
print_confusion_matrix_and_save(digits.target, labels)

7.14 Summary

In this chapter we present many examples related to clustering problems, which contain
unsupervised learning techniques. Since clustering is a popular task with different applica-
tions in machine learning, this particular chapter is dedicated to explaining and studying
it. Clustering is the process of automatically grouping a collection of objects in such a way
that identical objects end up in the same category and divide dissimilar objects in different
groups. For instance, retailers cluster customers, on the basis of their customer profiles, for
the purpose of targeted marketing; computational biologists cluster genes on the basis of
similarities in their expression in diverse researches; and astronomers cluster stars on the
basis of their distinct closeness. In the previous chapters the learning tasks focused primarily
on supervised learning problems. In this chapter we present several unsupervised machine
learning algorithms for clustering. Besides the utilization of clustering in grouping unlabeled
data, they can be used for image segmentation, feature extraction, and classification as well.
We discussed in detail the application of these algorithms in different fields.

References
Ankerst, M., Breunig, M. M., Kriegel, H. -P., & Sander, J. (1999). OPTICS: ordering points to identify the clustering
structure. ACM, 28, 49–60.
Flach, P. (2012). Machine learning: The art and science of algorithms that make sense of data. Cambridge, United Kingdom:
Cambridge University Press.
Gaura, J., Sojka, E., & Krumnikl, M. (2011). Image segmentation based on k-means clustering and energy-transfer proximity.
Berlin: Springer 567–577.
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT press.
Shalev-Shwartz, S., & Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge,
United Kingdom: Cambridge University Press.
Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. KDD Workshop on
Text Mining, 400, 525–526 Boston.
Sumathi, S., & Paneerselvam, S. (2010). Computational intelligence paradigms: Theory & applications using MATLAB.
Boca Raton, FL: CRC Press.
Tatiraju, S., & Mehta, A. (2008). Image Segmentation using k-means clustering, EM and Normalized Cuts. Department
of EECS, 1, 1–7.
Theodoridis, S., Pikrakis, A., Koutroumbas, K., & Cavouras, D. (2010). Introduction to pattern recognition: A matlab ap-
proach. Cambridge, MA: Academic Press.
Watt, J., Borhani, R., & Katsaggelos, A. K. (2016). Machine learning refined: Foundations, algorithms, and applications.
Cambridge, United Kingdom: Cambridge University Press.
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques.
Burlington, MA: Morgan Kaufmann.
Zheng, X., Lei, Q., Yao, R., Gong, Y., & Yin, Q. (2018). Image segmentation based on adaptive K-means algorithm.
EURASIP Journal on Image and Video Processing, 1, 68.

Xu, R., & Wunsch, D. (2005) - Survey of Clustering Algorithms
No ratings yet
Xu, R., & Wunsch, D. (2005) - Survey of Clustering Algorithms
35 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
UNIT 4 NOTES
No ratings yet
UNIT 4 NOTES
66 pages
4.unit 4 ML Q&A
No ratings yet
4.unit 4 ML Q&A
73 pages
Chapter8-Basic Cluster Analysis2016
No ratings yet
Chapter8-Basic Cluster Analysis2016
143 pages
Unit III Clustering
No ratings yet
Unit III Clustering
47 pages
MODULE_5
No ratings yet
MODULE_5
45 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering
No ratings yet
Clustering
75 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Ds Module 5
No ratings yet
Ds Module 5
49 pages
Lec 31
No ratings yet
Lec 31
31 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
21AI71-module-5-textbook
No ratings yet
21AI71-module-5-textbook
25 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Data Clustering A Review
No ratings yet
Data Clustering A Review
60 pages
Mod2 Clustering Text Book
No ratings yet
Mod2 Clustering Text Book
30 pages
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
No ratings yet
Understanding The Inners of Clustering: DR Akashdeep, UIET, Panjab University Chandigarh, Maivriklab@pu - Ac.in
61 pages
A Comprehensive Survey of Clustering Algorithms
No ratings yet
A Comprehensive Survey of Clustering Algorithms
30 pages
Clustering
No ratings yet
Clustering
29 pages
Clustering
No ratings yet
Clustering
27 pages
Cluster Lecture-1
No ratings yet
Cluster Lecture-1
20 pages
ARTIFICIAL INTELLIGENCE LEC 5
No ratings yet
ARTIFICIAL INTELLIGENCE LEC 5
20 pages
Aiml 5th Module Part2
No ratings yet
Aiml 5th Module Part2
28 pages
45
No ratings yet
45
16 pages
Nearset Clustering
No ratings yet
Nearset Clustering
15 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
59 pages
UNIT IV
No ratings yet
UNIT IV
19 pages
FML Unit4
No ratings yet
FML Unit4
14 pages
Prasanna Hebbar @govt First Grade College Honnavar
No ratings yet
Prasanna Hebbar @govt First Grade College Honnavar
11 pages
clustering-u-5
No ratings yet
clustering-u-5
2 pages
unit-4 ML
No ratings yet
unit-4 ML
16 pages
Clustering and Visualisation of Data - 2020
No ratings yet
Clustering and Visualisation of Data - 2020
5 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Classify Clustering
No ratings yet
Classify Clustering
31 pages
Assignment 4
No ratings yet
Assignment 4
40 pages
14
No ratings yet
14
4 pages
Unit 3 unsupervised learning algorith
No ratings yet
Unit 3 unsupervised learning algorith
15 pages
Clustering
No ratings yet
Clustering
3 pages
Data Clustering: A Review
No ratings yet
Data Clustering: A Review
60 pages
fuzzy meaning
No ratings yet
fuzzy meaning
6 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
No ratings yet
Survey of Clustering Algorithms: Rui Xu, Student Member, IEEE and Donald Wunsch II, Fellow, IEEE
34 pages
Clustering
No ratings yet
Clustering
39 pages
Assignment 2
No ratings yet
Assignment 2
8 pages
Machine Learning Bloque 4
No ratings yet
Machine Learning Bloque 4
12 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
17 pages
Clustering new
No ratings yet
Clustering new
6 pages
Data Clustering Seminar
No ratings yet
Data Clustering Seminar
34 pages
UNIT5
No ratings yet
UNIT5
60 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Psychology Assignment 1
No ratings yet
Psychology Assignment 1
9 pages
Importance of Clustering in Data Mining
No ratings yet
Importance of Clustering in Data Mining
5 pages
Theory of Reasoned Action Fishbein and Ajzen 1975
No ratings yet
Theory of Reasoned Action Fishbein and Ajzen 1975
10 pages
Lesson Plan Social - Studies 1ST Day
No ratings yet
Lesson Plan Social - Studies 1ST Day
10 pages
Lesson 4 Instructional Software
No ratings yet
Lesson 4 Instructional Software
13 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Lesson 2
No ratings yet
Lesson 2
30 pages
Cluster
100% (1)
Cluster
72 pages
Ijcttjournal V1i1p12
No ratings yet
Ijcttjournal V1i1p12
3 pages
Examples of Aptitude Tests
100% (1)
Examples of Aptitude Tests
2 pages
Roles of Innovation Leadership On Using Big Data
No ratings yet
Roles of Innovation Leadership On Using Big Data
14 pages
Between X and Y How Process Tracing Contributes To Opening The Black Box of Causality
100% (1)
Between X and Y How Process Tracing Contributes To Opening The Black Box of Causality
19 pages
Lesson-2 LAN Design
No ratings yet
Lesson-2 LAN Design
116 pages
Ikigai The Japanese Secret To A Long and Happy Life (Book Review) - QAspire by Tanmay Vora
100% (3)
Ikigai The Japanese Secret To A Long and Happy Life (Book Review) - QAspire by Tanmay Vora
3 pages
Grade Viii MDP 2024 Final
No ratings yet
Grade Viii MDP 2024 Final
8 pages
6 WebMining
No ratings yet
6 WebMining
45 pages
DM Intro - 1
No ratings yet
DM Intro - 1
31 pages
5 DM Association
No ratings yet
5 DM Association
27 pages
Clustering
No ratings yet
Clustering
5 pages
Labs
No ratings yet
Labs
35 pages
3 DM Classification
No ratings yet
3 DM Classification
55 pages
Instant Download Human Computer Interaction Fundamentals 1st Edition Andrew Sears PDF All Chapter
100% (8)
Instant Download Human Computer Interaction Fundamentals 1st Edition Andrew Sears PDF All Chapter
84 pages
Data Mining With Weka - Demo
No ratings yet
Data Mining With Weka - Demo
12 pages
ZEB PPT - PPTX (Autosaved)
No ratings yet
ZEB PPT - PPTX (Autosaved)
11 pages
Lesson 1
No ratings yet
Lesson 1
27 pages
Optimizing Data Warehousing Performance Through Machine Learning
No ratings yet
Optimizing Data Warehousing Performance Through Machine Learning
10 pages
FS 1 Episode No. 5
No ratings yet
FS 1 Episode No. 5
4 pages
Training of Trainers: GREEN Skills in TVET
No ratings yet
Training of Trainers: GREEN Skills in TVET
28 pages
Culture
No ratings yet
Culture
39 pages
Discussion 1
No ratings yet
Discussion 1
3 pages
C. Location Analysis
No ratings yet
C. Location Analysis
21 pages
Tools For Decision Analysis
No ratings yet
Tools For Decision Analysis
16 pages
Levine 2005
No ratings yet
Levine 2005
12 pages
Sibling Rivalry Essay
100% (2)
Sibling Rivalry Essay
2 pages
Chap8 Advanced Cluster Analysis
No ratings yet
Chap8 Advanced Cluster Analysis
45 pages
Psy Profiling 19 Dec
No ratings yet
Psy Profiling 19 Dec
5 pages
JOURNAL
No ratings yet
JOURNAL
28 pages
Maxine Green Social Reconstructionist Frontier Marx Education NTR
No ratings yet
Maxine Green Social Reconstructionist Frontier Marx Education NTR
12 pages
Critical Period Hypothesis
No ratings yet
Critical Period Hypothesis
10 pages
Barriers For Conducting Clinical Trials in Developing Countries - A Systematic Review
No ratings yet
Barriers For Conducting Clinical Trials in Developing Countries - A Systematic Review
11 pages
Final Synopsis
No ratings yet
Final Synopsis
9 pages
Seminar Slide Latest
No ratings yet
Seminar Slide Latest
15 pages
CGP Gr12 Module-6
No ratings yet
CGP Gr12 Module-6
4 pages
Grade 10 Research 3
No ratings yet
Grade 10 Research 3
5 pages
Present The Terms "Scopos" and "Commission" in Relation To The Different Types of Text
No ratings yet
Present The Terms "Scopos" and "Commission" in Relation To The Different Types of Text
2 pages
Crime Activity Detection Using Machine Learning
No ratings yet
Crime Activity Detection Using Machine Learning
7 pages
5.final Exam Schedule (Spring 2020)
No ratings yet
5.final Exam Schedule (Spring 2020)
2 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet

Clustering Examples

Uploaded by

Clustering Examples

Uploaded by

C H A P T E R

7.2.1 Evaluating the output of clustering methods

7.2.2 Applications of cluster analysis

7.2.3 Number of possible clustering

7.2.4 Types of clustering algorithms

7.3 The k-means clustering algorithm

# Plot the original data points

plt.title('Original data Scatter')

sklearn.cluster.KMeans(n_clusters = 8, init = 'k-means + +', n_

#Plot the Cluster Centers

7.4 The k-medoids clustering algorithm

model = KMedoids(n_clusters = n_clusters, dist_func = example_distance_

model = KMedoids(n_clusters = 5, dist_func = example_distance_func)

7.5 Hierarchical clustering

Hierarchical clustering algorithms have different philosophies. In particular, they gener-

7.5.1 Agglomerative clustering algorithm

plt.scatter(X[y_hc == 4, 0], X[y_hc == 4, 1], s = 50, c = 'magenta',

# Authors: Gael Varoquaux, Nelle Varoquaux

from sklearn.cluster import AgglomerativeClustering

# Create a graph capturing local connectivity. Larger number of neighbors

for connectivity in (None, knn_graph):

7.5.2 Divisive clustering algorithm

l3 = [x for x in indices if x not in splinter]

def augmented_dendrogram(*args, **kwargs):

for i, d in zip(data['icoord'], data['dcoord']):

7.6 The fuzzy c-means clustering algorithm

#pip install fuzzy-c-means

7.7 Density-based clustering algorithms

7.7.1 The DBSCAN algorithm

# Number of clusters in labels, ignoring noise if present.

print('Estimated number of clusters: %d' % n_clusters_)

# Black removed and is used for noise instead.

xy = X[class_member_mask & core_samples_mask]

plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor = tuple(col),

7.7.2 OPTICS clustering algorithms

import matplotlib.gridspec as gridspec

# Run the fit

labels_050 = cluster_optics_dbscan(reachability = clust.reachability_,

plt.figure(figsize = (15, 10))

7.8 The expectation of maximization for Gaussian mixture model clustering

import matplotlib as mpl

from sklearn import datasets

colors = ['navy', 'red', 'green']

def make_ellipses(gmm, ax):

# Only take the first fold.

# Try GMMs using different types of covariances.

plt.figure(figsize = (3 * n_estimators // 2, 6))

# Train the other parameters using the EM algorithm.

h = plt.subplot(2, n_estimators // 2, index + 1)

for n, color in enumerate(colors):

rain_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100

7.9 Bayesian clustering

color_iter = itertools.cycle(['navy', 'c', 'cornflowerblue', 'gold',

def plot_results(X, Y_, means, covariances, index, title):

plt.scatter(X[Y_ == i, 0], X[Y_ == i, 1], .8, color = color)

# Plot an ellipse to show the Gaussian component

# Number of samples per component

# Generate random sample, two components

# Fit a Dirichlet process Gaussian mixture using five components

7.10 Silhouette analysis

# The 1st subplot is the silhouette plot

# Initialize the clusterer with n_clusters value and a random gen-

# Compute the silhouette scores for each sample

ax1.set_title("The silhouette plot for the various clusters.")

ax1.set_yticks([]) # Clear the yaxis labels / ticks

# 2nd Plot showing the actual clusters formed

ax2.set_title("The visualization of the clustered data.")

plt.suptitle(("Silhouette analysis for KMeans clustering on

7.11 Image segmentation with clustering

Note that this example is adapted from scikit-learn.

# these were introduced in skimage-0.14

# Resize it to 20% of the original size to speed up the processing

# Take a decreasing function of the gradient: an exponential

for assign_labels in ('kmeans', 'discretize'):

plt.figure(figsize = (5, 5))

7.12 Feature extraction with clustering

rain_accuracy = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100