0% found this document useful (0 votes)
114 views

Clustering

The document discusses clustering, which is the task of dividing data points into groups such that points within the same group are similar and dissimilar from other groups. It provides examples of clustering fruits and graph data points. The main types of clustering methods are described as partitioning, density-based, distribution model-based, hierarchical, and fuzzy clustering. K-means clustering is explained as an unsupervised algorithm that groups unlabeled data into K predefined clusters by assigning points to the closest centroid in an iterative process. The elbow method for choosing the optimal number of clusters K is outlined. Finally, fuzzy c-means clustering is introduced as a soft clustering approach where each point has a probability of belonging to multiple clusters.

Uploaded by

Aatri Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
114 views

Clustering

The document discusses clustering, which is the task of dividing data points into groups such that points within the same group are similar and dissimilar from other groups. It provides examples of clustering fruits and graph data points. The main types of clustering methods are described as partitioning, density-based, distribution model-based, hierarchical, and fuzzy clustering. K-means clustering is explained as an unsupervised algorithm that groups unlabeled data into K predefined clusters by assigning points to the closest centroid in an iterative process. The elbow method for choosing the optimal number of clusters K is outlined. Finally, fuzzy c-means clustering is introduced as a soft clustering approach where each point has a probability of belonging to multiple clusters.

Uploaded by

Aatri Pal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Clustering is the task of dividing the population or data points into a number of groups such that

data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.

Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled
data present. There are no criteria for good clustering. It depends on the user, what is the criteria
they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions that constitute the similarity of points and each assumption make
different and equally valid clusters.

The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in Machine
learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
o The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the below
two points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids
or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that
it forms. But choosing the optimal number of clusters is a big task. There are some different ways
to find the optimal number of clusters, but here we are discussing the most appropriate method to
find the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.

Soft Clustering:

In soft clustering, instead of putting each data points into separate clusters, a probability of that
point to be in that cluster assigned. In soft clustering or fuzzy clustering, each data point can belong
to multiple clusters along with its probability score or likelihood.

One of the widely used soft clustering algorithms is the Fuzzy C-means clustering (FCM)
Algorithm.

Fuzzy C-Means Clustering:

Fuzzy C-Means clustering is a soft clustering approach, where each data point is assigned a
likelihood or probability score to belong to that cluster. The step-wise approach of the Fuzzy c-
means clustering algorithm is:

• Fix the value of c (number of clusters), and select a value of m (generally 1.25<m<2), and
initialize partition matrix U.

• Calculate cluster centers (centroid).


Here,
µ: Fuzzy membership value
m: fuzziness parameter

• Update Partition Matrix

• Repeat the above steps until convergence.

Suppose the given data points are {(1, 3), (2, 5), (4, 8), (7, 9)}
The steps to perform algorithm are:

Step 1: Initialize the data points into desired number of clusters randomly.
Let’s assume there are 2 clusters in which the data is to be divided, initializing the data point
randomly. Each data point lies in both the clusters with some membership value which can be
assumed anything in the initial state.
The table below represents the values of the data points along with their membership (gamma)
in each of the cluster.
Cluster (1, 3) (2, 5) (4, 8) (7, 9)
1) 0.82 0.72 0.22 0.12
2) 0.22 0.32 0.82 0.92

Step 2: Find out the centroid.


The formula for finding out the centroid (V) is:

Where, µ is fuzzy membership value of the data point, m is the fuzziness


parameter (generally taken as 2), and xk is the data point.
Here,
V11 = (0.82 *1 + 0.72 * 2 + 0.22 * 4 + 0.12 * 7) / ( (0.82 + 0.72 + 0.22 + 0.12 ) = 1.568
V12 = (0.82 *3 + 0.72 * 5 + 0.22 * 8 + 0.12 * 9) / ( (0.82 + 0.72 + 0.22 + 0.12 ) = 4.051
V21 = (0.22 *1 + 0.32 * 2 + 0.82 * 4 + 0.92 * 7) / ( (0.22 + 0.32 + 0.82 + 0.92 ) = 5.35
V22 = (0.22 *3 + 0.32 * 5 + 0.82 * 8 + 0.92 * 9) / ( (0.22 + 0.32 + 0.82 + 0.92 ) = 8.215
Centroids are: (1.568, 4.051) and (5.35, 8.215)

Step 3: Find out the distance of each point from centroid.


D11 = ((1 - 1.568)2 + (3 - 4.051)2)0.5 = 1.2
D12 = ((1 - 5.35)2 + (3 - 8.215)2)0.5 = 6.79
Similarly, the distance of all other points is computed from both the centroids.
Step 4: Updating membership values.

For point 1 new membership values are:


= [{ [(1.2)2 / (1.2)2] + [(1.2)2 / (6.79)2]} ^ {(1 / (2 – 1))} ] -1 = 0.96
= [{ [(6.79)2 / (6.79)2] + [(6.79)2 / (1.2)2]} ^ {(1 / (2 – 1))} ] -1 = 0.04

Alternatively,

Similarly, compute all other membership values, and update the matrix.
Step 5: Repeat the steps(2-4) until the constant values are obtained for the membership values
or the difference is less than the tolerance value (a small value up to which the difference in
values of two consequent updations is accepted).
Step 6: Defuzzify the obtained membership values.

Entropy
Data clustering involves solving two main problems. The first problem is defining exactly what
makes a good clustering of data. The second problem is determining an effective technique to
search through all possible combinations of clustering to find the best clustering. Entropy
addresses the first problem. Entropy is a metric that's a measure of the amount of disorder in a
vector. There are several variations of entropy. The most common is called Shannon's entropy.
Expressed mathematically, Shannon's entropy is:

Here H is the symbol for entropy. X is a vector of zero-indexed symbols, and P means "probability
of." The log2 function (log to base 2) assumes that log2(0) = 0.0 rather than the true value of
negative infinity. Entropy is best explained by example. Suppose you have a vector = { red, red,
blue, green, green, green }. Then x0 = red, x1 = blue and x2 = green. The probability of red is
P(x0) = 2/6 = 0.33. Similarly, P(x1) = 1/6 = 0.17 and P(x2) = 3/6 = 0.50. Putting these values in
the equation gives:

H(x) = - [ 0.33 * log2(0.33) + 0.17 * log (0.17) + 0.50 * log(0.50) ]

= - [ (0.33 * -1.58) + (0.17 * -2.58) + (0.50 * -1.00) ]

= - [ -0.53 + -0.43 + -0.50 ]

= 1.46

The smallest possible value for entropy is 0.0, which occurs when all symbols in a vector are the
same. In other words, there's no disorder in the vector. The larger the value of entropy, the more
disorder there is in the associated vector.

You might also like