Assignment 4
Assignment 4
Assignment 4
Course- BSC(CS)
Subject- Data Mining
Subject Code – BCS604A Sem- VI
Unit 4
Cluster Analysis
Clustering helps to splits data into several subsets. Each of these subsets contains
data similar to each other, and these subsets are called clusters. Now that the data
from our customer base is divided into clusters, we can make an informed decision
about who we think is best suited for this product.
1
Let's understand this with an example, suppose we are a market manager, and we
have a new tempting product to sell. We are sure that the product would bring
enormous profit, as long as it is sold to the right people. So, how can we tell who is
best suited for the product from our company's huge customer base?
2
Clustering, falling under the category of unsupervised machine learning, is one
of the problems that machine learning algorithms solve.
27.2M
475
o The intra-cluster similarities are high, It implies that the data present inside
the cluster is similar to one another.
o The inter-cluster similarity is low, and it means each cluster holds data that
is not similar to other data.
What is a Cluster?
3
o A subset of objects such that the distance between any of the two objects in
the cluster is less than the distance between any object in the cluster and any
object that is not located inside it.
o A connected region of a multidimensional space with a comparatively high
density of objects.
Important points:
4
o Clustering is also used in tracking applications such as detection of credit
card fraud.
o As a data mining function, cluster analysis serves as a tool to gain insight
into the distribution of data to analyze the characteristics of each cluster.
o In terms of biology, It can be used to determine plant and animal
taxonomies, categorization of genes with the same functionalities and gain
insight into structure inherent to populations.
o It helps in the identification of areas of similar land that are used in an earth
observation database and the identification of house groups in a city
according to house type, value, and geographical location.
Clustering analysis has been an evolving problem in data mining due to its variety
of applications. The advent of various data clustering tools in the last few years and
their comprehensive use in a broad range of applications, including image
processing, computational biology, mobile communication, medicine, and
economics, must contribute to the popularity of these algorithms. The main issue
with the data clustering algorithms is that it cant be standardized. The advanced
algorithm may give the best results with one type of data set, but it may fail or
perform poorly with other kinds of data set. Although many efforts have been
made to standardize the algorithms that can perform well in all situations, no
significant achievement has been achieved so far. Many clustering tools have been
proposed so far. However, each algorithm has its advantages or disadvantages and
cant work on all real situations.
1. Scalability:
Scalability in clustering implies that as we boost the amount of data objects, the
time to perform clustering should approximately scale to the complexity order of
the algorithm. For example, if we perform K- means clustering, we know it is
O(n), where n is the number of objects in the data. If we raise the number of data
objects 10 folds, then the time taken to cluster them should also approximately
increase 10 times. It means there should be a linear relationship. If that is not the
case, then there is some error with our implementation process.
5
Data should be scalable if it is not scalable, then we can't get the appropriate
result. The figure illustrates the graphical example where it may lead to the wrong
result.
2. Interpretability:
The clustering algorithm should be able to find arbitrary shape clusters. They
should not be limited to only distance measurements that tend to discover a
spherical cluster of small sizes.
Algorithms should be capable of being applied to any data such as data based on
intervals (numeric), binary data, and categorical data.
Databases contain data that is noisy, missing, or incorrect. Few algorithms are
sensitive to such data and may result in poor quality clusters.
6. High dimensionality:
6
The clustering tools should not only able to handle high dimensional data space but
also the low-dimensional space.
The most frequently discussed different features among various types of Clustering
is whether the clusters sets are nested or unnested, or in more conventional
terminology, partitional or hierarchical. A partitional Clustering is usually a
distribution of the set of data objects into non-overlapping subsets (clusters) so that
each data object is in precisely one subset.
The Clustering that appeared in the figure is all exclusive, as they give the
responsibility to each object to a single cluster. There are numerous circumstances
in which a point could sensibly be set in more than one cluster, and these
circumstances are better addressed by non-exclusive Clustering. In general terms,
7
an overlapping or non-exclusive Clustering is used to reflect the fact that an
object can together belong to more than one group (class). For example, a person at
a company can be both a trainee student and an employee of the company. A non-
exclusive Clustering is also usually used if an object is "between" two or more then
two clusters and could sensibly be allocated to any of these clusters. Consider a
point somewhere between two of the clusters rather than make an entirely random
task of the object to a single cluster. it is put in all of the clusters to "equally good"
clusters.
In fuzzy Clustering, each object belongs to each cluster with a membership weight
that is between 0 and 1. In other words, clusters are considered as fuzzy sets.
Mathematically, a fuzzy set is defined as one in which an object is associated with
any set with a weight that ranges between 0 and 1. In fuzzy Clustering, we usually
set the additional constraint, and the sum of weights for each object must be equal
to 1. Similarly, probabilistic Clustering systems compute the probability in which
each point belongs to a cluster, and these probabilities must sum to 1. Since the
membership weights or probabilities for any object sum to 1, a fuzzy or
probabilistic Clustering doesn't address actual multiclass situations.
8
as shown in the figure that types of clusters described here are equally valid for
different sorts of data.
o Well-separated cluster
A cluster is a set of objects where each object is closer or more similar to every
other object in the cluster. Sometimes a limit is used to indicate that all the objects
in a cluster must be adequately close or similar to each other. The definition of a
cluster is satisfied only when the data contains natural clusters that are quite far
from one another. The figure illustrates an example of well-separated clusters that
comprise of two points in a two-dimensional space. Well-separated clusters do not
require to be spherical but can have any shape.
o Prototype-Based cluster
A cluster is a set of objects where each object is closer or more similar to the
prototype that characterizes the cluster to the prototype of any other cluster. For
data with continuous characteristics, the prototype of a cluster is usually a centroid.
It means the average (Mean) of all the points in the cluster when a centroid is not
significant. For example, when the data has definite characteristics, the prototype is
usually a medoid that is the most representative point of a cluster. For some sorts
9
of data, the model can be viewed as the most central point, and in such examples,
we commonly refer to prototype-based clusters as center-based clusters. As anyone
might expect, such clusters tend to be spherical. The figure illustrates an example
of center-based clusters.
o Graph-Based cluster
If the data is depicted as a graph, where the nodes are the objects, then a cluster can
be described as a connected component. It is a group of objects that are associated
with each other, but that has no association with objects that is outside the group. A
significant example of graph-based clusters is contiguity-based clusters, where two
objects are associated when they are placed at a specified distance from each other.
It suggests that every object in a contiguity-based cluster is the same as some
other object in the cluster. Figures demonstrate an example of such clusters for
two-dimensional points. The meaning of a cluster is useful when clusters are
unpredictable or intertwined but can experience difficulty when noise present. It is
shown by the two circular clusters in the figure; the little extension of points can
join two different clusters.
Other kinds of graph-based clusters are also possible. One such way describes a
cluster as a clique. Clique is a set of nodes in a graph that is completely associated
with each other. Particularly, we add connections between the objects according to
their distance from one another. A cluster is generated when a set of objects forms
a clique. It is like prototype-based clusters, and such clusters tend to be spherical.
10
o Density-Based Cluster
11
o Shared- property or Conceptual Clusters
We can describe a cluster as a set of objects that offer some property. The object in
a center-based cluster shares the property that they are all closest to the similar
centroid or medoid. However, the shared-property approach additionally
incorporates new types of the cluster. Consider the cluster given in the figure. A
triangular area (cluster) is next to a rectangular one, and there are two intertwined
circles (clusters). In both cases, a Clustering algorithm would require a specific
concept of a cluster to recognize these clusters effectively. The way of discovering
such clusters is called conceptual Clustering.
12
Types of Clustering Methods
The clustering methods are broadly divided into Hard clustering (datapoint
belongs to only one group) and Soft Clustering (data points can belong to another
group also). But there are also other various approaches of Clustering exist. Below
are the main clustering methods used in Machine learning:
1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering
Partitioning Clustering
It is a type of clustering that divides the data into non-hierarchical groups. It is also
known as the centroid-based method. The most common example of partitioning
clustering is the K-Means Clustering algorithm.
In this type, the dataset is divided into a set of k groups, where K is used to define
the number of pre-defined groups. The cluster center is created in such a way that
the distance between the data points of one cluster is minimum as compared to
another cluster centroid.
13
Density-Based Clustering
The density-based clustering method connects the highly-dense areas into clusters,
and the arbitrarily shaped distributions are formed as long as the dense region can
be connected. This algorithm does it by identifying different clusters in the dataset
and connects the areas of high densities into clusters. The dense areas in data space
are divided from each other by sparser areas.
These algorithms can face difficulty in clustering the data points if the dataset has
varying densities and high dimensions.
14
Distribution Model-Based Clustering
In the distribution model-based clustering method, the data is divided based on the
probability of how a dataset belongs to a particular distribution. The grouping is
done by assuming some distributions commonly Gaussian Distribution.
15
Hierarchical Clustering
16
Fuzzy Clustering
Fuzzy clustering is a type of soft method in which a data object may belong to
more than one group or cluster. Each dataset has a set of membership coefficients,
which depend on the degree of membership to be in a cluster. Fuzzy C-means
algorithm is the example of this type of clustering; it is sometimes also known as
the Fuzzy k-means algorithm.
Clustering Algorithms
The Clustering algorithms can be divided based on their models that are explained
above. There are different types of clustering algorithms published, but only a few
are commonly used. The clustering algorithm is based on the kind of data that we
are using. Such as, some algorithms need to guess the number of clusters in the
given dataset, whereas some are required to find the minimum distance between
the observation of the dataset.
Here we are discussing mainly popular Clustering algorithms that are widely used
in machine learning:
17
specified in this algorithm. It is fast with fewer computations required, with
the linear complexity of O(n).
2. Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in
the smooth density of data points. It is an example of a centroid-based
model, that works on updating the candidates for centroid to be the center of
the points within a given region.
3. DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of
Applications with Noise. It is an example of a density-based model similar
to the mean-shift, but with some remarkable advantages. In this algorithm,
the areas of high density are separated by the areas of low density. Because
of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can
be used as an alternative for the k-means algorithm or for those cases where
K-means can be failed. In GMM, it is assumed that the data points are
Gaussian distributed.
5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical
algorithm performs the bottom-up hierarchical clustering. In this, each data
point is treated as a single cluster at the outset and then successively merged.
The cluster hierarchy can be represented as a tree-structure.
6. Affinity Propagation: It is different from other clustering algorithms as it
does not require to specify the number of clusters. In this, each data point
sends a message between the pair of data points until convergence. It has
O(N2T) time complexity, which is the main drawback of this algorithm.
Applications of Clustering
It allows us to cluster the data into different groups and a convenient way to
discover the categories of groups in the unlabeled dataset on its own without the
need for any training.
19
The algorithm takes the unlabeled dataset as input, divides the dataset into k-
number of clusters, and repeats the process until it does not find the best clusters.
The value of k should be predetermined in this algorithm.
Hence each cluster has datapoints with some commonalities, and it is away from
other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
Step-2: Select random K points or centroids. (It can be other from the input
dataset).
20
Step-3: Assign each data point to their closest centroid, which will form the
predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new
closest centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two
variables is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put
them into different clusters. It means here we will try to group these datasets
into two different clusters.
21
o We need to choose some random k points or centroid to form the cluster.
These points can be either the points from the dataset or any other point. So,
here we are selecting the below two points as k points, which are not the part
of our dataset. Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point
or centroid. We will compute it by applying some mathematics that we have
studied to calculate the distance between two points. So, we will draw a
median between both the centroids.
o
22
o Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or
blue centroid, and points to the right of the line are close to the yellow centroid.
Let's color them as blue and yellow for clear visualization.
23
o As we need to find the closest cluster, so we will repeat the process by
choosing a new centroid. To choose the new centroids, we will compute the
24
center of gravity of these centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will
repeat the same process of finding a median line. The median will be like
25
below image:
From the above image, we can see, one yellow point is on the left side of the line,
and two blue points are right to the line. So, these three points will be assigned to
new centroids.
26
As reassignment has taken place, so we will again go to the step-4, which is
finding new centroids or K-points.
27
o We will repeat the process by finding the center of gravity of centroids, so
the new centroids will be as shown in the below image:
28
o As we got the new centroids so again will draw the median line and reassign
the data points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either
side of the line, which means our model is formed. Consider the below
29
image:
As our model is ready, so we can now remove the assumed centroids, and the two
final clusters will be as shown in the below image:
30
In the above section, we have discussed the K-means algorithm, now let's see how
it can be implemented using Python.
Before implementation, let's understand what type of problem we will solve here.
So, we have a dataset of Mall_Customers, which is the data of customers who
visit the mall and spend there.
In the given dataset, we have Customer_Id, Gender, Age, Annual Income ($),
and Spending Score (which is the calculated value of how much a customer has
spent in the mall, the more the value, the more he has spent). From this dataset, we
need to calculate some patterns, as it is an unsupervised method, so we don't know
what to calculate exactly.
o Data Pre-processing
o Finding the optimal number of clusters using the elbow method
o Training the K-means algorithm on the training dataset
o Visualizing the clusters
The first step will be the data pre-processing, as we did in our earlier topics of
Regression and Classification. But for the clustering problem, it will be different
from other models. Let's discuss it:
o Importing Libraries
As we did in previous topics, firstly, we will import the libraries for our
model, which is part of data pre-processing. The code is given below:
1. # importing libraries
2. import numpy as nm
3. import matplotlib.pyplot as mtp
4. import pandas as pd
In the above code, the numpy we have imported for the performing mathematics
calculation, matplotlib is for plotting the graph, and pandas are for managing the
dataset.
31
o Importing the Dataset:
Next, we will import the dataset that we need to use. So here, we are using
the Mall_Customer_data.csv dataset. It can be imported using the below
code:
By executing the above lines of code, we will get our dataset in the Spyder IDE.
The dataset looks like the below image:
32
Here we don't need any dependent variable for data pre-processing step as it is a
clustering problem, and we have no idea about what to determine. So we will just
add a line of code for the matrix of features.
As we can see, we are extracting only 3rd and 4th feature. It is because we need a 2d
plot to visualize the model, and some features are not required, such as
customer_id.
Step-2: Finding the optimal number of clusters using the elbow method
In the second step, we will try to find the optimal number of clusters for our
clustering problem. So, as discussed above, here we are going to use the elbow
method for this purpose.
As we know, the elbow method uses the WCSS concept to draw the plot by
plotting WCSS values on the Y-axis and the number of clusters on the X-axis. So
we are going to calculate the value for WCSS for different k values ranging from 1
to 10. Below is the code for it:
33
As we can see in the above code, we have used the KMeans class of sklearn.
cluster library to form the clusters.
Next, we have created the wcss_list variable to initialize an empty list, which is
used to contain the value of wcss computed for different values of k ranging from 1
to 10.
After that, we have initialized the for loop for the iteration on a different value of k
ranging from 1 to 10; since for loop in Python, exclude the outbound limit, so it is
taken as 11 to include 10th value.
The rest part of the code is similar as we did in earlier topics, as we have fitted the
model on a matrix of features and then plotted the graph between the number of
clusters and WCSS.
Output: After executing the above code, we will get the below output:
From the above plot, we can see the elbow point is at 5. So the number of clusters
here will be 5.
34
Step- 3: Training the K-means algorithm on the training dataset
As we have got the number of clusters, so we can now train the model on the
dataset.
To train the model, we will use the same two lines of code as we have used in the
above section, but here instead of using i, we will use 5, as we know there are 5
clusters that need to be formed. The code is given below:
The first line is the same as above for creating the object of KMeans class.
35
In the second line of code, we have created the dependent variable y_predict to
train the model.
By executing the above lines of code, we will get the y_predict variable. We can
check it under the variable explorer option in the Spyder IDE. We can now
compare the values of y_predict with our original dataset. Consider the below
image:
From the above image, we can now relate that the CustomerID 1 belongs to a
cluster
3(as index starts from 0, hence 2 will be considered as 3), and 2 belongs to cluster
4, and so on.
The last step is to visualize the clusters. As we have 5 clusters for our model, so we
will visualize each cluster one by one.
To visualize the clusters will use scatter plot using mtp.scatter() function of
matplotlib.
36
2. mtp.scatter(x[y_predict == 0, 0], x[y_predict == 0, 1], s = 100, c = 'blue', lab
el = 'Cluster 1') #for first cluster
3. mtp.scatter(x[y_predict == 1, 0], x[y_predict == 1, 1], s = 100, c = 'green', la
bel = 'Cluster 2') #for second cluster
4. mtp.scatter(x[y_predict== 2, 0], x[y_predict == 2, 1], s = 100, c = 'red', label
= 'Cluster 3') #for third cluster
5. mtp.scatter(x[y_predict == 3, 0], x[y_predict == 3, 1], s = 100, c = 'cyan', lab
el = 'Cluster 4') #for fourth cluster
6. mtp.scatter(x[y_predict == 4, 0], x[y_predict == 4, 1], s = 100, c = 'magenta',
label = 'Cluster 5') #for fifth cluster
7. mtp.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s =
300, c = 'yellow', label = 'Centroid')
8. mtp.title('Clusters of customers')
9. mtp.xlabel('Annual Income (k$)')
10. mtp.ylabel('Spending Score (1-100)')
11. mtp.legend()
12. mtp.show()
In above lines of code, we have written code for each clusters, ranging from 1 to 5.
The first coordinate of the mtp.scatter, i.e., x[y_predict == 0, 0] containing the x
value for the showing the matrix of features values, and the y_predict is ranging
from 0 to 1.
Output:
37
The output image is clearly showing the five different clusters with different
colors. The clusters are formed between two parameters of the dataset; Annual
income of customer and Spending. We can change the colors and labels as per the
requirement or choice. We can also observe some points from the above patterns,
which are given below:
o Cluster1 shows the customers with average salary and average spending so
we can categorize these customers as
o Cluster2 shows the customer has a high income but low spending, so we can
categorize them as careful.
o Cluster3 shows the low income and also low spending so they can be
categorized as sensible.
o Cluster4 shows the customers with low income with very high spending so
they can be categorized as careless.
o Cluster5 shows the customers with high income and high spending so they
can be categorized as target, and these customers can be the most profitable
customers for the mall owner.
k-medoids to CLARANS
CLARANS algorithm takes care of the cons of both K-Medoids and CLARA
algorithms besides dealing with difficult-to-handle data mining data, i.e. spatial
data. It maintains a balance between the computational cost and the influence of
data sampling on clusters’ formation.
1. Select ‘k’ random data points and label them as medoids for the time being.
2. Select a random point say ‘a’ from the points picked in step (1), and another
point say ‘b’ which is not included in those points.
3. We would already have the sum of distances of point ‘a’ from all other
points since that computation is required for selecting the points in step (1).
Perform similar computation for point ‘b’.
39
4. If the sum of distances from all other points for point ‘b’ turns out to be less
than that for point ‘a’, replace ‘a’ by ‘b’.
5. The algorithm performs such a randomized search of medoids ‘x’ times
where ‘x’ denotes the number of local minima computed, i.e. number of
iterations to be performed, which we specify as a parameter. The set of
medoids obtained after such ‘x’ number of steps is termed as ‘local
optimum’.
6. A counter is incremented every time a replacement of points is made. The
process of examining the points for possible replacement is repeated till the
counter does not exceed the maximum number of neighbors to be examined
(specified as a parameter).
7. The set of medoids obtained when the algorithm stops is the best local
optimum choice of medoids.
40