Unit IV
Unit IV
K-means:
• K-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where
k<n.
• K-Means clustering is an unsupervised clustering
technique.
• It is a partitions based clustering algorithm.
• A cluster is defined as a group of objects that
belongs to the same class.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following
steps-
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster
centres.
• Select cluster centers in such a way that they are as
farther as possible from each other.
Step-03:
• Calculate the distance between each data point and
each cluster center.
• The distance may be calculated either by using
K-Means Clustering Algorithm
Step-04:
• Assign each data point to some cluster.
A data point is assigned to that cluster whose center is
nearest to that data point.
Step-05:
• Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of
all the data points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05
until any of the following stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Squared Error criteria
Flowchart
Example
Use K-Means Algorithm to create two clusters-
Solution-
• We follow the above discussed K-Means Clustering
Algorithm.
• Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of
the two clusters.
• The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance
between point A(2, 2) and each of the center of the two
clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1) 2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1) 2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
• In the similar manner, we calculate the distance of other
points from each of the center of the two clusters.
For Cluster-02:
• Center of Cluster-02
• = ((1 + 1.5)/2, (1 + 0.5)/2)
• = (1.25, 0.75)
This is completion of Iteration-01.
Next,
• we go to iteration-02, iteration-03 and so on until the
centers do not change anymore.
Iteration-02:
Given Distance from Distance Points
points cluster(2.67,1.67 from belongs to
) of data points cluster(1.25,0. cluster
75) of data
points
clusters such that data points in the same cluster are more
min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14
min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2 and P5. Now, form cluster of elements
corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4))
= min (0.14. 0.23) = 0.14
Again repeating the same process: The minimum value is
0.14 and hence we combine P2,P5 and P3,P6,P4. Now, form cluster
of elements corresponding to minimum value and update the
distance matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1))
= min (0.23, 0.22) = 0.22
So now we have
reached to the solution
finally, the dendrogram
for those question will
be as follows:
DBSCAN Clustering
• There are different approaches and
algorithms to perform clustering tasks
which can be divided into three sub-
categories:
1. Partition-based clustering: E.g. k-
means, k-median
2. Hierarchical clustering: E.g.
Agglomerative, Divisive
3. Density-based clustering: E.g. DBSCAN
Density-based clustering
• Partition-based and hierarchical clustering
techniques are highly efficient with normal
shaped clusters. However, when it comes to
arbitrary shaped clusters or detecting outliers,
density-based techniques are more efficient.
• For example, the dataset in the figure below can
easily be divided into three clusters using k-
means algorithm.
k-means
Consider the following figures: