Assignment 4

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Kalinga University

Faculty of Information Technology

Course- BSC(CS)
Subject- Data Mining
Subject Code – BCS604A Sem- VI

Unit 4
Cluster Analysis

Clustering in Data Mining

Clustering is an unsupervised Machine Learning-based Algorithm that comprises a


group of data points into clusters so that the objects belong to the same group.

Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.

1
Let's understand this with an example, suppose we are a market manager, and we
have a new tempting product to sell. We are sure that the product would bring
enormous profit, as long as it is sold to the right people. So, how can we tell who is
best suited for the product from our company's huge customer base?

2
Clustering, falling under the category of unsupervised machine learning, is one
of the problems that machine learning algorithms solve.

27.2M

475

Hello Java Program for Beginners

Clustering only utilizes input data, to determine patterns, anomalies, or similarities


in its input data.

A good clustering algorithm aims to obtain clusters whose:

o The intra-cluster similarities are high, It implies that the data present inside
the cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that
is not similar to other data.

What is a Cluster?

o A cluster is a subset of similar objects

3
o A subset of objects such that the distance between any of the two objects in
the cluster is less than the distance between any object in the cluster and any
object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high
density of objects.

What is clustering in Data Mining?

o Clustering is the method of converting a group of abstract objects into


classes of similar objects.
o Clustering is a method of partitioning a set of data or objects into a set of
significant subclasses called clusters.
o It helps users to understand the structure or natural grouping in a data set and
used either as a stand-alone instrument to get a better insight into data
distribution or as a pre-processing step for other algorithms

Important points:

o Data objects of a cluster can be considered as one group.


o We first partition the information set into groups while doing cluster
analysis. It is based on data similarities and then assigns the levels to the
groups.
o The over-classification main advantage is that it is adaptable to
modifications, and it helps single out important characteristics that
differentiate between distinct groups.

Applications of cluster analysis in data mining:

o In many applications, clustering analysis is widely used, such as data


analysis, market research, pattern recognition, and image processing.
o It assists marketers to find different groups in their client base and based on
the purchasing patterns. They can characterize their customer groups.
o It helps in allocating documents on the internet for data discovery.

4
o Clustering is also used in tracking applications such as detection of credit
card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight
into the distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal
taxonomies, categorization of genes with the same functionalities and gain
insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city
according to house type, value, and geographical location.

Why is clustering used in data mining?

Clustering analysis has been an evolving problem in data mining due to its variety
of applications. The advent of various data clustering tools in the last few years and
their comprehensive use in a broad range of applications, including image
processing, computational biology, mobile communication, medicine, and
economics, must contribute to the popularity of these algorithms. The main issue
with the data clustering algorithms is that it cant be standardized. The advanced
algorithm may give the best results with one type of data set, but it may fail or
perform poorly with other kinds of data set. Although many efforts have been
made to standardize the algorithms that can perform well in all situations, no
significant achievement has been achieved so far. Many clustering tools have been
proposed so far. However, each algorithm has its advantages or disadvantages and
cant work on all real situations.

1. Scalability:

Scalability in clustering implies that as we boost the amount of data objects, the
time to perform clustering should approximately scale to the complexity order of
the algorithm. For example, if we perform K- means clustering, we know it is
O(n), where n is the number of objects in the data. If we raise the number of data
objects 10 folds, then the time taken to cluster them should also approximately
increase 10 times. It means there should be a linear relationship. If that is not the
case, then there is some error with our implementation process.

5
Data should be scalable if it is not scalable, then we can't get the appropriate
result. The figure illustrates the graphical example where it may lead to the wrong
result.

2. Interpretability:

The outcomes of clustering should be interpretable, comprehensible, and usable.

3. Discovery of clusters with attribute shape:

The clustering algorithm should be able to find arbitrary shape clusters. They
should not be limited to only distance measurements that tend to discover a
spherical cluster of small sizes.

4. Ability to deal with different types of attributes:

Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.

5. Ability to deal with noisy data:

Databases contain data that is noisy, missing, or incorrect. Few algorithms are
sensitive to such data and may result in poor quality clusters.

6. High dimensionality:
6
The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space.

Different types of Clustering

A whole group of clusters is usually referred to as Clustering. Here, we have


distinguished different kinds of Clustering, such as Hierarchical(nested)
vs. Partitional(unnested), Exclusive vs. Overlapping vs. Fuzzy,
and Complete vs. Partial.

o Hierarchical versus Partitional

The most frequently discussed different features among various types of Clustering
is whether the clusters sets are nested or unnested, or in more conventional
terminology, partitional or hierarchical. A partitional Clustering is usually a
distribution of the set of data objects into non-overlapping subsets (clusters) so that
each data object is in precisely one subset.

If we allow clusters to have subclusters, then we get a hierarchical Clustering,


which is a group of nested clusters that are organized as a tree. Each node (cluster)
in the tree (Not for the leaf nodes) is the association of its subclusters, and the tree
roots are the cluster, including all the objects. Usually, the leaves of the tree are
individual clusters of individual data objects. If we enable the cluster to be nested,
then one clarification of figure 1 ( a) is that it has two subclusters figure 1 (b)
illustrates this, each of which has three subclusters shown in figure 1 (d). The
clusters have appeared in figure 1 (a-d) when taken in a specific order, also from a
hierarchical (nested) Clustering, 1, 2, 4, and 6 clusters on each level. Finally, a
hierarchical Clustering can be seen as an arrangement of partitional Clustering, and
a partitional Clustering can be acquired by taking any member of that sequence, it
means by cutting the hierarchical tree at the specific level.

o Exclusive versus Overlapping versus Fuzzy

The Clustering that appeared in the figure is all exclusive, as they give the
responsibility to each object to a single cluster. There are numerous circumstances
in which a point could sensibly be set in more than one cluster, and these
circumstances are better addressed by non-exclusive Clustering. In general terms,
7
an overlapping or non-exclusive Clustering is used to reflect the fact that an
object can together belong to more than one group (class). For example, a person at
a company can be both a trainee student and an employee of the company. A non-
exclusive Clustering is also usually used if an object is "between" two or more then
two clusters and could sensibly be allocated to any of these clusters. Consider a
point somewhere between two of the clusters rather than make an entirely random
task of the object to a single cluster. it is put in all of the clusters to "equally good"
clusters.

In fuzzy Clustering, each object belongs to each cluster with a membership weight
that is between 0 and 1. In other words, clusters are considered as fuzzy sets.
Mathematically, a fuzzy set is defined as one in which an object is associated with
any set with a weight that ranges between 0 and 1. In fuzzy Clustering, we usually
set the additional constraint, and the sum of weights for each object must be equal
to 1. Similarly, probabilistic Clustering systems compute the probability in which
each point belongs to a cluster, and these probabilities must sum to 1. Since the
membership weights or probabilities for any object sum to 1, a fuzzy or
probabilistic Clustering doesn't address actual multiclass situations.

Complete versus Partial

A complete Clustering allocates each object to a cluster, whereas partial


Clustering does not. The inspiration for a partial Clustering is that a few objects
in a data set may not belong to distinct groups. Most of the time, objects in the data
set may produce outliers, noise, or "uninteresting background." For example, some
news headlines stories may share a common subject, such that " Industrial
production shrinks globally by 1.1 percent," While different stories are more
frequent or one-of-a-kind. Consequently, to locate the significant topics in the last
month's stories, we might need to search only for clusters of documents that are
firmly related by a common subject. In other cases, a complete Clustering of
objects is desired. For example, an application that utilizes Clustering to sort out
documents for browsing needs to ensure that all documents can be browsed.

Different types of Clusters

Clustering addresses to discover helpful groups of objects (Clusters), where the


objectives of the data analysis characterize utility. Of course, there are various
notions of a cluster that demonstrate utility in practice. In order to visually show
the differences between these kinds of clusters, we utilize two-dimensional points,

8
as shown in the figure that types of clusters described here are equally valid for
different sorts of data.

o Well-separated cluster

A cluster is a set of objects where each object is closer or more similar to every
other object in the cluster. Sometimes a limit is used to indicate that all the objects
in a cluster must be adequately close or similar to each other. The definition of a
cluster is satisfied only when the data contains natural clusters that are quite far
from one another. The figure illustrates an example of well-separated clusters that
comprise of two points in a two-dimensional space. Well-separated clusters do not
require to be spherical but can have any shape.

o Prototype-Based cluster

A cluster is a set of objects where each object is closer or more similar to the
prototype that characterizes the cluster to the prototype of any other cluster. For
data with continuous characteristics, the prototype of a cluster is usually a centroid.
It means the average (Mean) of all the points in the cluster when a centroid is not
significant. For example, when the data has definite characteristics, the prototype is
usually a medoid that is the most representative point of a cluster. For some sorts
9
of data, the model can be viewed as the most central point, and in such examples,
we commonly refer to prototype-based clusters as center-based clusters. As anyone
might expect, such clusters tend to be spherical. The figure illustrates an example
of center-based clusters.

o Graph-Based cluster

If the data is depicted as a graph, where the nodes are the objects, then a cluster can
be described as a connected component. It is a group of objects that are associated
with each other, but that has no association with objects that is outside the group. A
significant example of graph-based clusters is contiguity-based clusters, where two
objects are associated when they are placed at a specified distance from each other.
It suggests that every object in a contiguity-based cluster is the same as some
other object in the cluster. Figures demonstrate an example of such clusters for
two-dimensional points. The meaning of a cluster is useful when clusters are
unpredictable or intertwined but can experience difficulty when noise present. It is
shown by the two circular clusters in the figure; the little extension of points can
join two different clusters.

Other kinds of graph-based clusters are also possible. One such way describes a
cluster as a clique. Clique is a set of nodes in a graph that is completely associated
with each other. Particularly, we add connections between the objects according to
their distance from one another. A cluster is generated when a set of objects forms
a clique. It is like prototype-based clusters, and such clusters tend to be spherical.

10
o Density-Based Cluster

A cluster is a compressed domain of objects that are surrounded by a region of low


density. The two spherical clusters are not merged, as in the figure, because the
bridge between them fades into the noise. Similarly, the curve that is present in the
Figure disappears into the noise and does not form a cluster in Figure. It also
disappears into the noise and does not form a cluster shown in the figure. A
density-based definition of a cluster is usually occupied when the clusters are
irregularly and intertwined, and when noise and outliers exist. The other hand
contiguity-based definition of a cluster would not work properly for the data of
Figure. Since the noise would tend to form a network between clusters.

11
o Shared- property or Conceptual Clusters

We can describe a cluster as a set of objects that offer some property. The object in
a center-based cluster shares the property that they are all closest to the similar
centroid or medoid. However, the shared-property approach additionally
incorporates new types of the cluster. Consider the cluster given in the figure. A
triangular area (cluster) is next to a rectangular one, and there are two intertwined
circles (clusters). In both cases, a Clustering algorithm would require a specific
concept of a cluster to recognize these clusters effectively. The way of discovering
such clusters is called conceptual Clustering.

12
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

Partitioning Clustering

It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.

In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.

13
Density-Based Clustering

The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.

These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.

14
Distribution Model-Based Clustering

In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.

The example of this type is the Expectation-Maximization Clustering


algorithm that uses Gaussian Mixture Models (GMM).

15
Hierarchical Clustering

Hierarchical clustering can be used as an alternative for the partitioned clustering


as there is no requirement of pre-specifying the number of clusters to be created. In
this technique, the dataset is divided into clusters to create a tree-like structure,
which is also called a dendrogram. The observations or any number of clusters
can be selected by cutting the tree at the correct level. The most common example
of this method is the Agglomerative Hierarchical algorithm.

16
Fuzzy Clustering

Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.

Clustering Algorithms

The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between
the observation of the dataset.

Here we are discussing mainly popular Clustering algorithms that are widely used
in machine learning:

1. K-Means algorithm: The k-means algorithm is one of the most popular


clustering algorithms. It classifies the dataset by dividing the samples into
different clusters of equal variances. The number of clusters must be

17
specified in this algorithm. It is fast with fewer computations required, with
the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in
the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center of
the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar
to the mean-shift, but with some remarkable advantages. In this algorithm,
the areas of high density are separated by the areas of low density. Because
of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can
be used as an alternative for the k-means algorithm or for those cases where
K-means can be failed. In GMM, it is assumed that the data points are
Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.

Applications of Clustering

Below are some commonly known applications of clustering technique in Machine


Learning:

o In Identification of Cancer Cells: The clustering algorithms are widely


used for the identification of cancerous cells. It divides the cancerous and
non-cancerous data sets into different groups.
o In Search Engines: Search engines also work on the clustering technique.
The search result appears based on the closest object to the search query. It
18
does it by grouping similar data objects in one group that is far from the
other dissimilar objects. The accurate result of a query depends on the
quality of the clustering algorithm used.
o Customer Segmentation: It is used in market research to segment the
customers based on their choice and preferences.
o In Biology: It is used in the biology stream to classify different species of
plants and animals using the image recognition technique.
o In Land Use: The clustering technique is used in identifying the area of
similar lands use in the GIS database. This can be very useful to find that for
what purpose the particular land should be used, that means for which
purpose it is more suitable.

What is K-Means Algorithm?/ K-means K- medoids,

K-Means Clustering is an Unsupervised Learning algorithm, which groups the


unlabeled dataset into different clusters. Here K defines the number of pre-defined
clusters that need to be created in the process, as if K=2, there will be two clusters,
and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different


clusters in such a way that each dataset belongs only one group that has similar
properties.

It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid.


The main aim of this algorithm is to minimize the sum of distances between the
data point and their corresponding clusters.

19
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative


process.
o Assigns each data point to its closest k-center. Those data points which are
near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.

The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input
dataset).

20
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.

21
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point
or centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a
median between both the centroids.
o

22
o Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear visualization.

23
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the

24
center of gravity of these centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like

25
below image:

From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.

26
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.

27
o We will repeat the process by finding the center of gravity of centroids, so
the new centroids will be as shown in the below image:

28
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:

o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below

29
image:

As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:

Python Implementation of K-means Clustering Algorithm

30
In the above section, we have discussed the K-means algorithm, now let's see how
it can be implemented using Python.

Before implementation, let's understand what type of problem we will solve here.
So, we have a dataset of Mall_Customers, which is the data of customers who
visit the mall and spend there.

In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($),
and Spending Score (which is the calculated value of how much a customer has
spent in the mall, the more the value, the more he has spent). From this dataset, we
need to calculate some patterns, as it is an unsupervised method, so we don't know
what to calculate exactly.

The steps to be followed for the implementation are given below:

o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters

Step-1: Data pre-processing Step

The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be different
from other models. Let's discuss it:

o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our
model, which is part of data pre-processing. The code is given below:

1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd

In the above code, the numpy we have imported for the performing mathematics
calculation, matplotlib is for plotting the graph, and pandas are for managing the
dataset.
31
o Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using
the Mall_Customer_data.csv dataset. It can be imported using the below
code:

1. # Importing the dataset


2. dataset = pd.read_csv('Mall_Customers_data.csv')

By executing the above lines of code, we will get our dataset in the Spyder IDE.
The dataset looks like the below image:

From the above dataset, we need to find some patterns in it.

o Extracting Independent Variables

32
Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will just
add a line of code for the matrix of features.

1. x = dataset.iloc[:, [3, 4]].values

As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d
plot to visualize the model, and some features are not required, such as
customer_id.

Step-2: Finding the optimal number of clusters using the elbow method

In the second step, we will try to find the optimal number of clusters for our
clustering problem. So, as discussed above, here we are going to use the elbow
method for this purpose.

As we know, the elbow method uses the WCSS concept to draw the plot by
plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So
we are going to calculate the value for WCSS for different k values ranging from 1
to 10. Below is the code for it:

1. #finding optimal number of clusters using the elbow method


2. from sklearn.cluster import KMeans
3. wcss_list= [] #Initializing the list for the values of WCSS
4.
5. #Using for loop for iterations from 1 to 10.
6. for i in range(1, 11):
7. kmeans = KMeans(n_clusters=i, init='k-means++', random_state= 42)
8. kmeans.fit(x)
9. wcss_list.append(kmeans.inertia_)
10. mtp.plot(range(1, 11), wcss_list)
11. mtp.title('The Elobw Method Graph')
12. mtp.xlabel('Number of clusters(k)')
13. mtp.ylabel('wcss_list')
14. mtp.show()

33
As we can see in the above code, we have used the KMeans class of sklearn.
cluster library to form the clusters.

Next, we have created the wcss_list variable to initialize an empty list, which is
used to contain the value of wcss computed for different values of k ranging from 1
to 10.

After that, we have initialized the for loop for the iteration on a different value of k
ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it is
taken as 11 to include 10th value.

The rest part of the code is similar as we did in earlier topics, as we have fitted the
model on a matrix of features and then plotted the graph between the number of
clusters and WCSS.

Output: After executing the above code, we will get the below output:

From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.

34
Step- 3: Training the K-means algorithm on the training dataset

As we have got the number of clusters, so we can now train the model on the
dataset.

To train the model, we will use the same two lines of code as we have used in the
above section, but here instead of using i, we will use 5, as we know there are 5
clusters that need to be formed. The code is given below:

1. #training the K-means model on a dataset


2. kmeans = KMeans(n_clusters=5, init='k-means++', random_state= 42)
3. y_predict= kmeans.fit_predict(x)

The first line is the same as above for creating the object of KMeans class.
35
In the second line of code, we have created the dependent variable y_predict to
train the model.

By executing the above lines of code, we will get the y_predict variable. We can
check it under the variable explorer option in the Spyder IDE. We can now
compare the values of y_predict with our original dataset. Consider the below
image:

From the above image, we can now relate that the CustomerID 1 belongs to a
cluster

3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster
4, and so on.

Step-4: Visualizing the Clusters

The last step is to visualize the clusters. As we have 5 clusters for our model, so we
will visualize each cluster one by one.

To visualize the clusters will use scatter plot using mtp.scatter() function of
matplotlib.

1. #visulaizing the clusters

36
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', lab
el = 'Cluster 1') #for first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', la
bel = 'Cluster 2') #for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label
= 'Cluster 3') #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', lab
el = 'Cluster 4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta',
label = 'Cluster 5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()

In above lines of code, we have written code for each clusters, ranging from 1 to 5.
The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x
value for the showing the matrix of features values, and the y_predict is ranging
from 0 to 1.

Output:

37
The output image is clearly showing the five different clusters with different
colors. The clusters are formed between two parameters of the dataset; Annual
income of customer and Spending. We can change the colors and labels as per the
requirement or choice. We can also observe some points from the above patterns,
which are given below:

o Cluster1 shows the customers with average salary and average spending so
we can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we can
categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so
they can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so they
can be categorized as target, and these customers can be the most profitable
customers for the mall owner.

k-medoids to CLARANS

CLARANS (Clustering Large Applications based on RANdomized Search) is a


38
Data Mining algorithm designed to cluster spatial data. We have already
covered K-Means and K-Medoids clustering algorithms in our previous articles.
This article talks about another clustering technique called CLARANS along with
its Pythonic demo code.

CLARANS is a partitioning method of clustering particularly useful in spatial data


mining. We mean recognizing patterns and relationships existing in spatial data
(such as distance-related, direction-relation or topological data, e.g. data plotted on
a road map) by spatial data mining.

Why CLARANS algorithm?

As mentioned in our K-Medoids algorithm’s article, the K-Medoids clustering


technique can resolve the limitation of the K-Means algorithm of being adversely
affected by noise/outliers in the input data. But K-Medoids proves to be a
computationally costly method for considerably large values of ‘k’ (number of
clusters) and large datasets.

The CLARA algorithm was introduced as an extension of K-Medoids. It uses only


random samples of the input data (instead of the entire dataset) and computes the
best medoids in those samples. It thus works better than K-Medoids for crowded
datasets. However, the algorithm may give wrong clustering results if one or more
sampled medoids are away from the actual best medoids.

CLARANS algorithm takes care of the cons of both K-Medoids and CLARA
algorithms besides dealing with difficult-to-handle data mining data, i.e. spatial
data. It maintains a balance between the computational cost and the influence of
data sampling on clusters’ formation.

Steps of CLARANS algorithm

1. Select ‘k’ random data points and label them as medoids for the time being.
2. Select a random point say ‘a’ from the points picked in step (1), and another
point say ‘b’ which is not included in those points.
3. We would already have the sum of distances of point ‘a’ from all other
points since that computation is required for selecting the points in step (1).
Perform similar computation for point ‘b’.

39
4. If the sum of distances from all other points for point ‘b’ turns out to be less
than that for point ‘a’, replace ‘a’ by ‘b’.
5. The algorithm performs such a randomized search of medoids ‘x’ times
where ‘x’ denotes the number of local minima computed, i.e. number of
iterations to be performed, which we specify as a parameter. The set of
medoids obtained after such ‘x’ number of steps is termed as ‘local
optimum’.
6. A counter is incremented every time a replacement of points is made. The
process of examining the points for possible replacement is repeated till the
counter does not exceed the maximum number of neighbors to be examined
(specified as a parameter).
7. The set of medoids obtained when the algorithm stops is the best local
optimum choice of medoids.

40

You might also like