Unit IV

Uploaded by

Hemanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views51 pages

Unit IV

Uploaded by

Hemanth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

K-means Clustering

K-means:
• K-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where
k<n.
• K-Means clustering is an unsupervised clustering
technique.
• It is a partitions based clustering algorithm.
• A cluster is deﬁned as a group of objects that
belongs to the same class.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following
steps-
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster
centres.
• Select cluster centers in such a way that they are as
farther as possible from each other.
Step-03:
• Calculate the distance between each data point and
each cluster center.
• The distance may be calculated either by using
K-Means Clustering Algorithm
Step-04:
• Assign each data point to some cluster.
A data point is assigned to that cluster whose center is
nearest to that data point.
Step-05:
• Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of
all the data points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05
until any of the following stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Squared Error criteria
Flowchart
Example
Use K-Means Algorithm to create two clusters-

Solution-
• We follow the above discussed K-Means Clustering
Algorithm.
• Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of
the two clusters.
• The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance
between point A(2, 2) and each of the center of the two
clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1) 2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1) 2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
• In the similar manner, we calculate the distance of other
points from each of the center of the two clusters.

From here, New clusters are-

Cluster-01:
• First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
• Second cluster contains points-C(1, 1), E(1.5, 0.5)
Now,
• We re-compute the new cluster centers.
• The new cluster center is computed by taking mean of
all the points contained in that cluster.
For Cluster-01:
• Center of Cluster-01
• = ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
• = (2.67, 1.67)

For Cluster-02:
• Center of Cluster-02
• = ((1 + 1.5)/2, (1 + 0.5)/2)
• = (1.25, 0.75)
This is completion of Iteration-01.
Next,
• we go to iteration-02, iteration-03 and so on until the
centers do not change anymore.
Iteration-02:
Given Distance from Distance Points
points cluster(2.67,1.67 from belongs to
) of data points cluster(1.25,0. cluster
75) of data
points

A(2,2) 0.73 1.45 C1

B(3,2) 0.44 2.14 C1
C(1,1) 1.79 0.34 C2
D(3,1) 0.54 1.76 C1
From here, New
E(1.5,0.5) clusters are-
1.45 0.34 C2
Cluster-01:
First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
Second cluster contains points-C(1, 1), E(1.5, 0.5)
Here,
Cluster elements are same as in the previous iteration then stop
the process.
K-means Advantages
• Relatively simple to implement.
• Scales to large data sets.
• Guarantees convergence.
• Easily adapts to new examples.
• Generalizes to clusters of different
shapes and sizes, such as elliptical
clusters.
K-means Disadvantages
• It requires to specify the number of
clusters (k) in advance.
• It can not handle noisy data and
outliers.
• It is not suitable to identify clusters
with non-convex shapes.
Exercise Problem
Challenges in Unsupervised Learning
• Number of clusters are normally not known a
priori.
• For clustering algorithms, such as K-means,
different initial centers may lead to different
clustering results, moreover K is unknown.
• Time complexity - parititional clustering
algorithms are O(N) whereas hierarchical
are O(N2).
• The similarity criteria is not clear - should we use
Euclidean or Manhattan or Hamming?
• In hierarchical clustering, at what stage should we
Hierarchical clustering
• The hierarchical clustering methods are used to
group the data into hierarchy or tree-like structure.
• For example, in a machine learning problem of
organizing employees of a university in different
departments, ﬁrst the employees are grouped
under the different departments in the university,
and then within each department, the employees
can be grouped according to their roles such as
professors, assistant professors, supervisors, lab
assistants, etc. This creates a hierarchical structure
of the employee data and eases visualization and
Types of Hierarchal Clustering
There are two types of hierarchal
clustering:
1. Agglomerative clustering
2. Divisive Clustering
Types of Hierarchal Clustering
• Agglomerative Clustering is a type of hierarchical

clustering algorithm. It is an unsupervised machine

learning technique that divides the population into several

clusters such that data points in the same cluster are more

similar and data points in different clusters are dissimilar.

• Points in the same cluster are closer to each other.

• Points in the different clusters are far apart.

• On the other hand, the divisive method starts with one

cluster with all given objects and then splits it iteratively to
form smaller clusters
• The agglomerative hierarchical clustering
method uses the bottom-up strategy. It starts
with each object forming its own cluster and
then iteratively merges the clusters 2 according
to their similarity to form larger clusters. It
terminates either when a certain clustering
condition imposed by the user is achieved or all
the clusters merge into a single cluster.
Some pros and cons of
Hierarchical Clustering
Pros
• No assumption of a particular number
of clusters (i.e. k-means)
• May correspond to meaningful
taxonomies
Cons
• Once a decision is made to combine
two clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))
Agglomerative Clustering: It uses a bottom-up
approach. It starts with each object forming its
own cluster and then iteratively merges the
clusters according to their similarity to form
large clusters. It terminates either
• When certain clustering condition imposed by
user is achieved or
• All clusters merge into a single cluster
variants of Agglomerative methods:
1. Agglomerative Algorithm: Single Link
• Single-nearest distance or single linkage is the
agglomerative method that uses the distance
between the closest members of the two
clusters.
Question. Find the clusters using a single link
technique. Use Euclidean distance and draw
the dendrogram.
Sample
X Y
No.
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Step 2: Merging the two closest members of the two
clusters and finding the minimum element in distance
matrix. Here the minimum value is 0.10 and hence we
combine P3 and P6 (as 0.10 came in the P6 row and P3
column).
Now, form clusters of elements corresponding to the
minimum value and update the distance matrix. To update
the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
Now we will repeat the same process. Merge two closest members of
the two clusters and find the minimum element in distance matrix.
The minimum value is 0.13 and hence we combine P3, P6 and P4.
Now, form the clusters of elements corresponding to the minimum
values and update the Distance matrix. In order to find, what we have
to update in distance matrix,
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22

min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14

min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2 and P5. Now, form cluster of elements
corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4))
= min (0.14. 0.23) = 0.14
Again repeating the same process: The minimum value is
0.14 and hence we combine P2,P5 and P3,P6,P4. Now, form cluster
of elements corresponding to minimum value and update the
distance matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1))
= min (0.23, 0.22) = 0.22
So now we have
reached to the solution
finally, the dendrogram
for those question will
be as follows:
DBSCAN Clustering
• There are different approaches and
algorithms to perform clustering tasks
which can be divided into three sub-
categories:
1. Partition-based clustering: E.g. k-
means, k-median
2. Hierarchical clustering: E.g.
Agglomerative, Divisive
3. Density-based clustering: E.g. DBSCAN
Density-based clustering
• Partition-based and hierarchical clustering
techniques are highly efficient with normal
shaped clusters. However, when it comes to
arbitrary shaped clusters or detecting outliers,
density-based techniques are more efficient.
• For example, the dataset in the figure below can
easily be divided into three clusters using k-
means algorithm.

k-means
Consider the following ﬁgures:

The data points in these ﬁgures are grouped in

arbitrary shapes or include outliers. Density-
based clustering algorithms are very efficient at
finding high-density regions and outliers. It is
very important to detect outliers for some task, e.
DBSCAN Algorithm
• DBSCAN stands for Density-Based Spatial Clustering
of Applications with Noise. It is able to find arbitrary
shaped clusters and clusters with noise (i.e. outliers).
• In DBSCAN, instead of guessing the number of
clusters, will define two hyper parameters:
epsilon and minPoints to arrive at clusters.
• Epsilon (ε): The distance that specifies the
neighborhoods. Two points are considered to be
neighbors if the distance between them are less than
or equal to epsilon
• minPoints(n): Minimum number of data points to
define a cluster.
DBSCAN Algorithm
Based on Epsilon (ε) and minPoints(n)
parameters, points are classified as core, border,
and outlier or noise points:
• Core point: A point is a core point if there are at
least minPoints number of points (including the
point itself) in its surrounding area with radius
epsilon.
• Border point: A point is a border point if it is
reachable from a core point and there are less
than minPoints number of points within its
surrounding area.
DBSCAN Algorithm
• These points may be better explained with
visualizations.
Density connected
Three terms are necessary to understand in order
to understand DBSCAN:
• Direct density reachable: A point is called
direct density reachable if it has a core point in
its neighbourhood.
• Density Connected: Two points are called
density connected if there is a core point which
is density reachable from both the points.
• Density Reachable: A point is called density
reachable from another point if they are
connected through a series of core points.
Evaluation Metrics of DBSCAN
• We will use the Silhouette score and Adjusted
rand score for evaluating clustering algorithms.
• Silhouette score is in the range of -1 to 1. A score
near 1 denotes the best meaning that the data point
i is very compact within the cluster to which it
belongs and far away from the other clusters.
Values near 0 denote overlapping clusters.
• Absolute Rand Score is in the range of 0 to 1. More
than 0.9 denotes excellent cluster recovery, above
0.8 is a good recovery. Less than 0.5 is considered
to be poor recovery.
DBSCAN
Pros
• The DBSCAN is better than other cluster algorithms
because it does not require a pre-set number of
clusters.
• It identifies outliers as noise, unlike the Mean-Shift
method that forces such points into the cluster in
spite of having different characteristics.
• It finds arbitrarily shaped and sized clusters quite
well.
Cons
• It is not very effective when you have clusters of
varying densities.
• If you have high dimensional data, the determining
DBSCAN Algorithm
Step1: Label Core point and Noise point
▪ Select a random starting point, say x
▪ Identify neighborhood of point x using the radius ε
▪ Count the number of points, say k, in this neighborhood
including point x
▪ If k>=Minpts of points then mark x as a core point else
mark x as noise point
▪ Select a new unvisited point and repeat the above steps
Step2: Check if noise point can become
boundary point
▪ If noise point is directly density reachable (That is within
the boundary of radius ε from the core point), mark it as
boundary and it will form the part of the cluster
▪ A point which is neither core point nor boundary point is
Thank you

Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
5 - CH 5-K-Means Clustering
No ratings yet
5 - CH 5-K-Means Clustering
54 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
Unit 4
No ratings yet
Unit 4
29 pages
Data Mining Practical
No ratings yet
Data Mining Practical
31 pages
FKFKF
No ratings yet
FKFKF
9 pages
Evolution of ERP System
100% (1)
Evolution of ERP System
9 pages
Literature Review of Ibm
100% (2)
Literature Review of Ibm
8 pages
Cluster
100% (1)
Cluster
72 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
Lecture 14 Clustering
0% (1)
Lecture 14 Clustering
57 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Knowledge Discovery Process
No ratings yet
Knowledge Discovery Process
24 pages
CRM PPTs
100% (1)
CRM PPTs
43 pages
Unit V
No ratings yet
Unit V
165 pages
Association Rule Generation For Student Performance Analysis Using Apriori Algorithm
No ratings yet
Association Rule Generation For Student Performance Analysis Using Apriori Algorithm
5 pages
Chapter 3: Data Mining
No ratings yet
Chapter 3: Data Mining
20 pages
ML Unit 4
No ratings yet
ML Unit 4
110 pages
DBSCAN Algorithm
No ratings yet
DBSCAN Algorithm
15 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
79 pages
Week 10
No ratings yet
Week 10
84 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Clustering
No ratings yet
Clustering
55 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Unit 1
No ratings yet
Unit 1
78 pages
Chap5 Frequent Itemset
No ratings yet
Chap5 Frequent Itemset
70 pages
Kenny-230724-Top 50 Data Science Projects
No ratings yet
Kenny-230724-Top 50 Data Science Projects
9 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
ML CH 4
No ratings yet
ML CH 4
65 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Module 5
No ratings yet
Module 5
43 pages
Hierarchical Clustering
No ratings yet
Hierarchical Clustering
32 pages
MC4020 DWDM Iat 1 (Set2)
No ratings yet
MC4020 DWDM Iat 1 (Set2)
1 page
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Module-05 RM & Ipr
No ratings yet
Module-05 RM & Ipr
42 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit 5
No ratings yet
Unit 5
63 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
Clustering
No ratings yet
Clustering
75 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Unit 4
No ratings yet
Unit 4
19 pages
Lecture 18 Clustering 19092024 091909am
No ratings yet
Lecture 18 Clustering 19092024 091909am
33 pages
ML Unit-2
No ratings yet
ML Unit-2
34 pages
MachineLearning Unit-III
No ratings yet
MachineLearning Unit-III
26 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
ML Lecture 7 - Ensemble Learning
No ratings yet
ML Lecture 7 - Ensemble Learning
18 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Slide TIF311 DM 10 11
No ratings yet
Slide TIF311 DM 10 11
49 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Clustering
No ratings yet
Clustering
23 pages
Data Science Material
No ratings yet
Data Science Material
48 pages
Unit 3
No ratings yet
Unit 3
12 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Clustering
No ratings yet
Clustering
35 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
Unit 4 Machine Learning
No ratings yet
Unit 4 Machine Learning
12 pages
Bana1 Visualization
No ratings yet
Bana1 Visualization
22 pages
Clustering Personal
No ratings yet
Clustering Personal
9 pages
22it601 - Data Mining and Warehousing: Lecture Notes Template
No ratings yet
22it601 - Data Mining and Warehousing: Lecture Notes Template
10 pages
Entrepreneurship Development Scamper
No ratings yet
Entrepreneurship Development Scamper
6 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Lec 1
No ratings yet
Lec 1
24 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
K-Means Clustering
No ratings yet
K-Means Clustering
38 pages
AI Chapter 3 Part 5
No ratings yet
AI Chapter 3 Part 5
30 pages
Classification - KNN
No ratings yet
Classification - KNN
8 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
6 Clustering
No ratings yet
6 Clustering
15 pages
CH-6 DM Clustering
No ratings yet
CH-6 DM Clustering
28 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
1 page
BA Assignment
No ratings yet
BA Assignment
5 pages
ML Seminar
No ratings yet
ML Seminar
37 pages
Frequent Pattern Based Clustering
No ratings yet
Frequent Pattern Based Clustering
4 pages
MSC Computer Science With Data Analytics
No ratings yet
MSC Computer Science With Data Analytics
23 pages
DWDM Unit5
No ratings yet
DWDM Unit5
14 pages
AI Engineer Using Microsoft Azure Nanodegree Program Syllabus
No ratings yet
AI Engineer Using Microsoft Azure Nanodegree Program Syllabus
14 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Lecture Notes - Clustering
No ratings yet
Lecture Notes - Clustering
13 pages
"These Are Just Rough Notes For References" What Is K-Means Clustering
No ratings yet
"These Are Just Rough Notes For References" What Is K-Means Clustering
9 pages
This Study Resource Was
No ratings yet
This Study Resource Was
4 pages
Graph Transformer Networks: Corresponding Author
No ratings yet
Graph Transformer Networks: Corresponding Author
11 pages
Clustering: CMPUT 466/551 Nilanjan Ray
No ratings yet
Clustering: CMPUT 466/551 Nilanjan Ray
34 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Prediction of Dynamic Price of Ride-On-Demand Services Using Linear Regression
No ratings yet
Prediction of Dynamic Price of Ride-On-Demand Services Using Linear Regression
1 page
Customer Segmentation Using Machine Learning: Ilavendhan@galgotiasuniversity - Edu.in
No ratings yet
Customer Segmentation Using Machine Learning: Ilavendhan@galgotiasuniversity - Edu.in
7 pages
Final Exam BWA44603
No ratings yet
Final Exam BWA44603
4 pages
Evolution of Machine Learning
No ratings yet
Evolution of Machine Learning
7 pages
Linear Algebra Fundamentals
From Everand
Linear Algebra Fundamentals
Kartikeya Dutta
No ratings yet
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
From Everand
Matrices with MATLAB (Taken from "MATLAB for Beginners: A Gentle Approach")
Peter Kattan
3/5 (4)