0% found this document useful (0 votes)

142 views17 pages

Clustering

The document discusses clustering, which is the task of dividing data points into groups such that points within the same group are similar and dissimilar from other groups. It provides examples of clustering fruits and graph data points. The main types of clustering methods are described as partitioning, density-based, distribution model-based, hierarchical, and fuzzy clustering. K-means clustering is explained as an unsupervised algorithm that groups unlabeled data into K predefined clusters by assigning points to the closest centroid in an iterative process. The elbow method for choosing the optimal number of clusters K is outlined. Finally, fuzzy c-means clustering is introduced as a soft clustering approach where each point has a probability of belonging to multiple clusters.

Uploaded by

Aatri Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

142 views17 pages

Clustering

Uploaded by

Aatri Pal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Clustering is the task of dividing the population or data points into a number of groups such that

data points in the same groups are more similar to other data points in the same group and
dissimilar to the data points in other groups. It is basically a collection of objects on the basis of
similarity and dissimilarity between them.

For ex– The data points in the graph below clustered together can be classified into one single
group. We can distinguish the clusters, and we can identify that there are 3 clusters in the below
picture.

Why Clustering?
Clustering is very much important as it determines the intrinsic grouping among the unlabelled
data present. There are no criteria for good clustering. It depends on the user, what is the criteria
they may use which satisfy their need. For instance, we could be interested in finding
representatives for homogeneous groups (data reduction), in finding “natural clusters” and
describe their unknown properties (“natural” data types), in finding useful and suitable groupings
(“useful” data classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions that constitute the similarity of points and each assumption make
different and equally valid clusters.

The below diagram explains the working of the clustering algorithm. We can see the different
fruits are divided into several groups with similar properties.
Types of Clustering Methods

The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one
group) and Soft Clustering (data points can belong to another group also). But there are also other
various approaches of Clustering exist. Below are the main clustering methods used in Machine
learning:

1. Partitioning Clustering
2. Density-Based Clustering
3. Distribution Model-Based Clustering
4. Hierarchical Clustering
5. Fuzzy Clustering

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset
into different clusters. Here K defines the number of pre-defined clusters that need to be created
in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and
so on.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
o The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of
each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points can
be either the points from the dataset or any other point. So, here we are selecting the below
two points as k points, which are not the part of our dataset. Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or centroid. We
will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue centroid,
and points to the right of the line are close to the yellow centroid. Let's color them as blue and
yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the same
process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new centroids
or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data points.
So, the image will be:

o We can see in the above image; there are no dissimilar data points on either side of the line,
which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:

How to choose the value of "K number of clusters" in K-means Clustering?

The performance of the K-means clustering algorithm depends upon highly efficient clusters that
it forms. But choosing the optimal number of clusters is a big task. There are some different ways
to find the optimal number of clusters, but here we are discussing the most appropriate method to
find the number of clusters or value of K. The method is given below:

Elbow Method

The Elbow method is one of the most popular ways to find the optimal number of clusters. This
method uses the concept of WCSS value. WCSS stands for Within Cluster Sum of Squares,
which defines the total variations within a cluster. The formula to calculate the value of WCSS
(for 3 clusters) is given below:

WCSS= ∑Pi in Cluster1 distance(Pi C1)2 +∑Pi in Cluster2distance(Pi C2)2+∑Pi in CLuster3 distance(Pi C3)2

In the above formula of WCSS,

∑Pi in Cluster1 distance(Pi C1)2: It is the sum of the square of the distances between each data point
and its centroid within a cluster1 and the same for the other two terms.

To measure the distance between data points and centroid, we can use any method such as
Euclidean distance or Manhattan distance.
To find the optimal value of clusters, the elbow method follows the below steps:

o It executes the K-means clustering on a given dataset for different K values (ranges from
1-10).
o For each value of K, calculates the WCSS value.
o Plots a curve between calculated WCSS values and the number of clusters K.
o The sharp point of bend or a point of the plot looks like an arm, then that point is considered
as the best value of K.

Soft Clustering:

In soft clustering, instead of putting each data points into separate clusters, a probability of that
point to be in that cluster assigned. In soft clustering or fuzzy clustering, each data point can belong
to multiple clusters along with its probability score or likelihood.

One of the widely used soft clustering algorithms is the Fuzzy C-means clustering (FCM)
Algorithm.

Fuzzy C-Means Clustering:

Fuzzy C-Means clustering is a soft clustering approach, where each data point is assigned a
likelihood or probability score to belong to that cluster. The step-wise approach of the Fuzzy c-
means clustering algorithm is:

• Fix the value of c (number of clusters), and select a value of m (generally 1.25<m<2), and
initialize partition matrix U.

• Calculate cluster centers (centroid).

Here,
µ: Fuzzy membership value
m: fuzziness parameter

• Update Partition Matrix

• Repeat the above steps until convergence.

Suppose the given data points are {(1, 3), (2, 5), (4, 8), (7, 9)}
The steps to perform algorithm are:

Step 1: Initialize the data points into desired number of clusters randomly.
Let’s assume there are 2 clusters in which the data is to be divided, initializing the data point
randomly. Each data point lies in both the clusters with some membership value which can be
assumed anything in the initial state.
The table below represents the values of the data points along with their membership (gamma)
in each of the cluster.
Cluster (1, 3) (2, 5) (4, 8) (7, 9)
1) 0.82 0.72 0.22 0.12
2) 0.22 0.32 0.82 0.92

Step 2: Find out the centroid.

The formula for finding out the centroid (V) is:

Where, µ is fuzzy membership value of the data point, m is the fuzziness

parameter (generally taken as 2), and xk is the data point.
Here,
V11 = (0.82 *1 + 0.72 * 2 + 0.22 * 4 + 0.12 * 7) / ( (0.82 + 0.72 + 0.22 + 0.12 ) = 1.568
V12 = (0.82 *3 + 0.72 * 5 + 0.22 * 8 + 0.12 * 9) / ( (0.82 + 0.72 + 0.22 + 0.12 ) = 4.051
V21 = (0.22 *1 + 0.32 * 2 + 0.82 * 4 + 0.92 * 7) / ( (0.22 + 0.32 + 0.82 + 0.92 ) = 5.35
V22 = (0.22 *3 + 0.32 * 5 + 0.82 * 8 + 0.92 * 9) / ( (0.22 + 0.32 + 0.82 + 0.92 ) = 8.215
Centroids are: (1.568, 4.051) and (5.35, 8.215)

Step 3: Find out the distance of each point from centroid.

D11 = ((1 - 1.568)2 + (3 - 4.051)2)0.5 = 1.2
D12 = ((1 - 5.35)2 + (3 - 8.215)2)0.5 = 6.79
Similarly, the distance of all other points is computed from both the centroids.
Step 4: Updating membership values.

For point 1 new membership values are:

= [{ [(1.2)2 / (1.2)2] + [(1.2)2 / (6.79)2]} ^ {(1 / (2 – 1))} ] -1 = 0.96
= [{ [(6.79)2 / (6.79)2] + [(6.79)2 / (1.2)2]} ^ {(1 / (2 – 1))} ] -1 = 0.04

Alternatively,

Similarly, compute all other membership values, and update the matrix.
Step 5: Repeat the steps(2-4) until the constant values are obtained for the membership values
or the difference is less than the tolerance value (a small value up to which the difference in
values of two consequent updations is accepted).
Step 6: Defuzzify the obtained membership values.

Entropy
Data clustering involves solving two main problems. The first problem is defining exactly what
makes a good clustering of data. The second problem is determining an effective technique to
search through all possible combinations of clustering to find the best clustering. Entropy
addresses the first problem. Entropy is a metric that's a measure of the amount of disorder in a
vector. There are several variations of entropy. The most common is called Shannon's entropy.
Expressed mathematically, Shannon's entropy is:

Here H is the symbol for entropy. X is a vector of zero-indexed symbols, and P means "probability
of." The log2 function (log to base 2) assumes that log2(0) = 0.0 rather than the true value of
negative infinity. Entropy is best explained by example. Suppose you have a vector = { red, red,
blue, green, green, green }. Then x0 = red, x1 = blue and x2 = green. The probability of red is
P(x0) = 2/6 = 0.33. Similarly, P(x1) = 1/6 = 0.17 and P(x2) = 3/6 = 0.50. Putting these values in
the equation gives:

H(x) = - [ 0.33 * log2(0.33) + 0.17 * log (0.17) + 0.50 * log(0.50) ]

= - [ (0.33 * -1.58) + (0.17 * -2.58) + (0.50 * -1.00) ]

= - [ -0.53 + -0.43 + -0.50 ]

= 1.46

The smallest possible value for entropy is 0.0, which occurs when all symbols in a vector are the
same. In other words, there's no disorder in the vector. The larger the value of entropy, the more
disorder there is in the associated vector.

UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
Algo
No ratings yet
Algo
59 pages
Presentation 1
No ratings yet
Presentation 1
47 pages
K Means Clustering Algorithm
No ratings yet
K Means Clustering Algorithm
12 pages
Clustering
No ratings yet
Clustering
10 pages
AI Week 11
No ratings yet
AI Week 11
21 pages
K-Means Clustering Algorithm - Javatpoint
No ratings yet
K-Means Clustering Algorithm - Javatpoint
21 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Unit 4
No ratings yet
Unit 4
22 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Unit 4 Aam
No ratings yet
Unit 4 Aam
26 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
K-Mean Clustering
No ratings yet
K-Mean Clustering
8 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
DM Unit Iv
No ratings yet
DM Unit Iv
45 pages
K Means Clustering
No ratings yet
K Means Clustering
11 pages
K-Means Algo
No ratings yet
K-Means Algo
4 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Data Mining-4
No ratings yet
Data Mining-4
9 pages
K Clustering
No ratings yet
K Clustering
28 pages
Clustering Notes
No ratings yet
Clustering Notes
29 pages
Unit 4
No ratings yet
Unit 4
29 pages
CLUSTERING
No ratings yet
CLUSTERING
11 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
KMeans Clustering
No ratings yet
KMeans Clustering
16 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Kmean
No ratings yet
Kmean
24 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
K Mean
No ratings yet
K Mean
9 pages
K Means Algorithms
No ratings yet
K Means Algorithms
27 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Clustering
No ratings yet
Clustering
24 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
ML Module5 Clustering
No ratings yet
ML Module5 Clustering
71 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
ML Unit 2
No ratings yet
ML Unit 2
17 pages
CPE412 Pattern Recognition (Week 7)
No ratings yet
CPE412 Pattern Recognition (Week 7)
48 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
ML Unit-2
No ratings yet
ML Unit-2
31 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
CE345 - Lecture #9 - Clustering
No ratings yet
CE345 - Lecture #9 - Clustering
56 pages
Lecture 18 K Means Clustering
No ratings yet
Lecture 18 K Means Clustering
77 pages
Unit IV
No ratings yet
Unit IV
96 pages
ML (Unit 4)
No ratings yet
ML (Unit 4)
19 pages
Clustering
No ratings yet
Clustering
125 pages
Session 18-Cluster Analysis
No ratings yet
Session 18-Cluster Analysis
20 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
Unsupervised Learning 2024-PPG
No ratings yet
Unsupervised Learning 2024-PPG
85 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
K Means
No ratings yet
K Means
26 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Clustering
No ratings yet
Clustering
9 pages
Unit Iv
No ratings yet
Unit Iv
12 pages
ML Unit5 Notes
No ratings yet
ML Unit5 Notes
18 pages
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Dive Into Algorithms: A Pythonic Adventure for the Intrepid Beginner
From Everand
Dive Into Algorithms: A Pythonic Adventure for the Intrepid Beginner
Bradford Tuckfield
No ratings yet
Integer Programming
No ratings yet
Integer Programming
51 pages
Improvement of One Factor at A Time Through Design of Experiments
No ratings yet
Improvement of One Factor at A Time Through Design of Experiments
6 pages
Lecture-3 Tolerances and Surfaces
No ratings yet
Lecture-3 Tolerances and Surfaces
56 pages
Anomaly Detection
No ratings yet
Anomaly Detection
49 pages
Q3 Week 1 - Describing and Drawing Parallel, Intersecting and Perpendicular Lines
No ratings yet
Q3 Week 1 - Describing and Drawing Parallel, Intersecting and Perpendicular Lines
14 pages
Yang Qian-2013-Applied Cryptography in Embedded Systems
No ratings yet
Yang Qian-2013-Applied Cryptography in Embedded Systems
98 pages
Difference Between C and C++ With Example. C Vs C++
No ratings yet
Difference Between C and C++ With Example. C Vs C++
42 pages
Dynkin System
No ratings yet
Dynkin System
5 pages
Core Worlds: (Galactic Orders Expansion Rules Included)
100% (1)
Core Worlds: (Galactic Orders Expansion Rules Included)
8 pages
Chapter 7 Risk Management
No ratings yet
Chapter 7 Risk Management
8 pages
Analysis of Beams and Compound Beams
No ratings yet
Analysis of Beams and Compound Beams
4 pages
Big Data Unit II
No ratings yet
Big Data Unit II
23 pages
Bessel Equation
No ratings yet
Bessel Equation
10 pages
5 6266826484271284719
No ratings yet
5 6266826484271284719
3 pages
Math
No ratings yet
Math
19 pages
Expt-1: Introduction To Measurement and Statistical Error: Measurement of Different Shapes
No ratings yet
Expt-1: Introduction To Measurement and Statistical Error: Measurement of Different Shapes
12 pages
CBSE Class 11 Mathematics Worksheet
No ratings yet
CBSE Class 11 Mathematics Worksheet
3 pages
CP Basic Profit Loss
No ratings yet
CP Basic Profit Loss
10 pages
Lesson 2 - Applications of Valuing Cash Flows
No ratings yet
Lesson 2 - Applications of Valuing Cash Flows
10 pages
23-03-2025 - SR Out Going - Jee Main Model - Igtm-1 - QP
No ratings yet
23-03-2025 - SR Out Going - Jee Main Model - Igtm-1 - QP
19 pages
4 SC
No ratings yet
4 SC
7 pages
Acing The New SAT
No ratings yet
Acing The New SAT
7 pages
Supplements To The Exercises in Chapters 1-7 of Walter Rudin's
No ratings yet
Supplements To The Exercises in Chapters 1-7 of Walter Rudin's
89 pages
The Trouble With Maths A Practical Guide To Helpin... - (2. Factors That Affect Learning)
No ratings yet
The Trouble With Maths A Practical Guide To Helpin... - (2. Factors That Affect Learning)
23 pages
Image Enhancement Final
No ratings yet
Image Enhancement Final
29 pages
Planning A Presentation
No ratings yet
Planning A Presentation
39 pages
Quantitative Techniques
100% (1)
Quantitative Techniques
3 pages
Chapter 2
100% (2)
Chapter 2
51 pages
Your Turn: Unit 7: Trigonometry
No ratings yet
Your Turn: Unit 7: Trigonometry
19 pages
Student Directory of Narrabundah College
No ratings yet
Student Directory of Narrabundah College
4 pages

Clustering

Uploaded by

Clustering

Uploaded by

Clustering is the task of dividing the population or data points into a number of groups such that

What is K-Means Algorithm?

The k-means clustering algorithm mainly performs two tasks:

How does the K-Means Algorithm Work?

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

How to choose the value of "K number of clusters" in K-means Clustering?

In the above formula of WCSS,

Fuzzy C-Means Clustering:

• Calculate cluster centers (centroid).

• Update Partition Matrix

• Repeat the above steps until convergence.

Step 2: Find out the centroid.

Where, µ is fuzzy membership value of the data point, m is the fuzziness

Step 3: Find out the distance of each point from centroid.

For point 1 new membership values are:

H(x) = - [ 0.33 * log2(0.33) + 0.17 * log (0.17) + 0.50 * log(0.50) ]

= - [ (0.33 * -1.58) + (0.17 * -2.58) + (0.50 * -1.00) ]

= - [ -0.53 + -0.43 + -0.50 ]

You might also like