0% found this document useful (0 votes)
55 views51 pages

Unit IV

Uploaded by

Hemanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views51 pages

Unit IV

Uploaded by

Hemanth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

K-means Clustering

K-means:
• K-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where
k<n.
• K-Means clustering is an unsupervised clustering
technique.
• It is a partitions based clustering algorithm.
• A cluster is defined as a group of objects that
belongs to the same class.
K-Means Clustering Algorithm
K-Means Clustering Algorithm involves the following
steps-
Step-01:
• Choose the number of clusters K.
Step-02:
• Randomly select any K data points as cluster
centres.
• Select cluster centers in such a way that they are as
farther as possible from each other.
Step-03:
• Calculate the distance between each data point and
each cluster center.
• The distance may be calculated either by using
K-Means Clustering Algorithm
Step-04:
• Assign each data point to some cluster.
A data point is assigned to that cluster whose center is
nearest to that data point.
Step-05:
• Re-compute the center of newly formed clusters.
The center of a cluster is computed by taking mean of
all the data points contained in that cluster.
Step-06:
• Keep repeating the procedure from Step-03 to Step-05
until any of the following stopping criteria is met-
• Center of newly formed clusters do not change
• Data points remain present in the same cluster
• Maximum number of iterations are reached
Squared Error criteria
Flowchart
Example
Use K-Means Algorithm to create two clusters-

Solution-
• We follow the above discussed K-Means Clustering
Algorithm.
• Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
Iteration-01:
• We calculate the distance of each point from each of the center of
the two clusters.
• The distance is calculated by using the Euclidean distance formula.
The following illustration shows the calculation of distance
between point A(2, 2) and each of the center of the two
clusters-
Calculating Distance Between A(2, 2) and C1(2, 2)-
Ρ(A, C1)
= sqrt [ (x2 – x1)2 + (y2 – y1) 2 ]
= sqrt [ (2 – 2)2 + (2 – 2)2 ]
= sqrt [ 0 + 0 ]
=0
Calculating Distance Between A(2, 2) and C2(1, 1)-
Ρ(A, C2)
= sqrt [ (x2 – x1)2 + (y2 – y1) 2 ]
= sqrt [ (1 – 2)2 + (1 – 2)2 ]
= sqrt [ 1 + 1 ]
= sqrt [ 2 ]
• In the similar manner, we calculate the distance of other
points from each of the center of the two clusters.

From here, New clusters are-


Cluster-01:
• First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
• Second cluster contains points-C(1, 1), E(1.5, 0.5)
Now,
• We re-compute the new cluster centers.
• The new cluster center is computed by taking mean of
all the points contained in that cluster.
For Cluster-01:
• Center of Cluster-01
• = ((2 + 3 + 3)/3, (2 + 2 + 1)/3)
• = (2.67, 1.67)

For Cluster-02:
• Center of Cluster-02
• = ((1 + 1.5)/2, (1 + 0.5)/2)
• = (1.25, 0.75)
This is completion of Iteration-01.
Next,
• we go to iteration-02, iteration-03 and so on until the
centers do not change anymore.
Iteration-02:
Given Distance from Distance Points
points cluster(2.67,1.67 from belongs to
) of data points cluster(1.25,0. cluster
75) of data
points

A(2,2) 0.73 1.45 C1


B(3,2) 0.44 2.14 C1
C(1,1) 1.79 0.34 C2
D(3,1) 0.54 1.76 C1
From here, New
E(1.5,0.5) clusters are-
1.45 0.34 C2
Cluster-01:
First cluster contains points-A(2, 2), B(3, 2), D(3, 1)
Cluster-02:
Second cluster contains points-C(1, 1), E(1.5, 0.5)
Here,
Cluster elements are same as in the previous iteration then stop
the process.
K-means Advantages
• Relatively simple to implement.
• Scales to large data sets.
• Guarantees convergence.
• Easily adapts to new examples.
• Generalizes to clusters of different
shapes and sizes, such as elliptical
clusters.
K-means Disadvantages
• It requires to specify the number of
clusters (k) in advance.
• It can not handle noisy data and
outliers.
• It is not suitable to identify clusters
with non-convex shapes.
Exercise Problem
Challenges in Unsupervised Learning
• Number of clusters are normally not known a
priori.
• For clustering algorithms, such as K-means,
different initial centers may lead to different
clustering results, moreover K is unknown.
• Time complexity - parititional clustering
algorithms are O(N) whereas hierarchical
are O(N2).
• The similarity criteria is not clear - should we use
Euclidean or Manhattan or Hamming?
• In hierarchical clustering, at what stage should we
Hierarchical clustering
• The hierarchical clustering methods are used to
group the data into hierarchy or tree-like structure.
• For example, in a machine learning problem of
organizing employees of a university in different
departments, first the employees are grouped
under the different departments in the university,
and then within each department, the employees
can be grouped according to their roles such as
professors, assistant professors, supervisors, lab
assistants, etc. This creates a hierarchical structure
of the employee data and eases visualization and
Types of Hierarchal Clustering
There are two types of hierarchal
clustering:
1. Agglomerative clustering
2. Divisive Clustering
Types of Hierarchal Clustering
• Agglomerative Clustering is a type of hierarchical

clustering algorithm. It is an unsupervised machine

learning technique that divides the population into several

clusters such that data points in the same cluster are more

similar and data points in different clusters are dissimilar.

• Points in the same cluster are closer to each other.

• Points in the different clusters are far apart.

• On the other hand, the divisive method starts with one


cluster with all given objects and then splits it iteratively to
form smaller clusters
• The agglomerative hierarchical clustering
method uses the bottom-up strategy. It starts
with each object forming its own cluster and
then iteratively merges the clusters 2 according
to their similarity to form larger clusters. It
terminates either when a certain clustering
condition imposed by the user is achieved or all
the clusters merge into a single cluster.
Some pros and cons of
Hierarchical Clustering
Pros
• No assumption of a particular number
of clusters (i.e. k-means)
• May correspond to meaningful
taxonomies
Cons
• Once a decision is made to combine
two clusters, it can’t be undone
• Too slow for large data sets, O(𝑛2 log(𝑛))
Agglomerative Clustering: It uses a bottom-up
approach. It starts with each object forming its
own cluster and then iteratively merges the
clusters according to their similarity to form
large clusters. It terminates either
• When certain clustering condition imposed by
user is achieved or
• All clusters merge into a single cluster
variants of Agglomerative methods:
1. Agglomerative Algorithm: Single Link
• Single-nearest distance or single linkage is the
agglomerative method that uses the distance
between the closest members of the two
clusters.
Question. Find the clusters using a single link
technique. Use Euclidean distance and draw
the dendrogram.
Sample
X Y
No.
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Step 2: Merging the two closest members of the two
clusters and finding the minimum element in distance
matrix. Here the minimum value is 0.10 and hence we
combine P3 and P6 (as 0.10 came in the P6 row and P3
column).
Now, form clusters of elements corresponding to the
minimum value and update the distance matrix. To update
the distance matrix:
min ((P3,P6), P1) = min ((P3,P1), (P6,P1)) = min (0.22,0.24) = 0.22
min ((P3,P6), P2) = min ((P3,P2), (P6,P2)) = min (0.14,0.24) = 0.14
min ((P3,P6), P4) = min ((P3,P4), (P6,P4)) = min (0.13,0.22) = 0.13
min ((P3,P6), P5) = min ((P3,P5), (P6,P5)) = min (0.28,0.39) = 0.28
Now we will repeat the same process. Merge two closest members of
the two clusters and find the minimum element in distance matrix.
The minimum value is 0.13 and hence we combine P3, P6 and P4.
Now, form the clusters of elements corresponding to the minimum
values and update the Distance matrix. In order to find, what we have
to update in distance matrix,
min (((P3,P6) P4), P1) = min (((P3,P6), P1), (P4,P1)) = min (0.22,0.37) = 0.22

min (((P3,P6), P4), P2) = min (((P3,P6), P2), (P4,P2)) = min (0.14,0.19) = 0.14

min (((P3,P6), P4), P5) = min (((P3,P6), P5), (P4,P5)) = min (0.28,0.23) = 0.23
Again repeating the same process: The minimum value is 0.14 and
hence we combine P2 and P5. Now, form cluster of elements
corresponding to minimum value and update the distance matrix. To
update the distance matrix:
min ((P2,P5), P1) = min ((P2,P1), (P5,P1)) = min (0.23, 0.34) = 0.23
min ((P2,P5), (P3,P6,P4)) = min ((P3,P6,P4), (P3,P6,P4))
= min (0.14. 0.23) = 0.14
Again repeating the same process: The minimum value is
0.14 and hence we combine P2,P5 and P3,P6,P4. Now, form cluster
of elements corresponding to minimum value and update the
distance matrix. To update the distance matrix:
min ((P2,P5,P3,P6,P4), P1) = min ((P2,P5), P1), ((P3,P6,P4), P1))
= min (0.23, 0.22) = 0.22
So now we have
reached to the solution
finally, the dendrogram
for those question will
be as follows:
DBSCAN Clustering
• There are different approaches and
algorithms to perform clustering tasks
which can be divided into three sub-
categories:
1. Partition-based clustering: E.g. k-
means, k-median
2. Hierarchical clustering: E.g.
Agglomerative, Divisive
3. Density-based clustering: E.g. DBSCAN
Density-based clustering
• Partition-based and hierarchical clustering
techniques are highly efficient with normal
shaped clusters. However, when it comes to
arbitrary shaped clusters or detecting outliers,
density-based techniques are more efficient.
• For example, the dataset in the figure below can
easily be divided into three clusters using k-
means algorithm.

k-means
Consider the following figures:

The data points in these figures are grouped in


arbitrary shapes or include outliers. Density-
based clustering algorithms are very efficient at
finding high-density regions and outliers. It is
very important to detect outliers for some task, e.
DBSCAN Algorithm
• DBSCAN stands for Density-Based Spatial Clustering
of Applications with Noise. It is able to find arbitrary
shaped clusters and clusters with noise (i.e. outliers).
• In DBSCAN, instead of guessing the number of
clusters, will define two hyper parameters:
epsilon and minPoints to arrive at clusters.
• Epsilon (ε): The distance that specifies the
neighborhoods. Two points are considered to be
neighbors if the distance between them are less than
or equal to epsilon
• minPoints(n): Minimum number of data points to
define a cluster.
DBSCAN Algorithm
Based on Epsilon (ε) and minPoints(n)
parameters, points are classified as core, border,
and outlier or noise points:
• Core point: A point is a core point if there are at
least minPoints number of points (including the
point itself) in its surrounding area with radius
epsilon.
• Border point: A point is a border point if it is
reachable from a core point and there are less
than minPoints number of points within its
surrounding area.
DBSCAN Algorithm
• These points may be better explained with
visualizations.
Density connected
Three terms are necessary to understand in order
to understand DBSCAN:
• Direct density reachable: A point is called
direct density reachable if it has a core point in
its neighbourhood.
• Density Connected: Two points are called
density connected if there is a core point which
is density reachable from both the points.
• Density Reachable: A point is called density
reachable from another point if they are
connected through a series of core points.
Evaluation Metrics of DBSCAN
• We will use the Silhouette score and Adjusted
rand score for evaluating clustering algorithms.
• Silhouette score is in the range of -1 to 1. A score
near 1 denotes the best meaning that the data point
i is very compact within the cluster to which it
belongs and far away from the other clusters.
Values near 0 denote overlapping clusters.
• Absolute Rand Score is in the range of 0 to 1. More
than 0.9 denotes excellent cluster recovery, above
0.8 is a good recovery. Less than 0.5 is considered
to be poor recovery.
DBSCAN
Pros
• The DBSCAN is better than other cluster algorithms
because it does not require a pre-set number of
clusters.
• It identifies outliers as noise, unlike the Mean-Shift
method that forces such points into the cluster in
spite of having different characteristics.
• It finds arbitrarily shaped and sized clusters quite
well.
Cons
• It is not very effective when you have clusters of
varying densities.
• If you have high dimensional data, the determining
DBSCAN Algorithm
Step1: Label Core point and Noise point
▪ Select a random starting point, say x
▪ Identify neighborhood of point x using the radius ε
▪ Count the number of points, say k, in this neighborhood
including point x
▪ If k>=Minpts of points then mark x as a core point else
mark x as noise point
▪ Select a new unvisited point and repeat the above steps
Step2: Check if noise point can become
boundary point
▪ If noise point is directly density reachable (That is within
the boundary of radius ε from the core point), mark it as
boundary and it will form the part of the cluster
▪ A point which is neither core point nor boundary point is
Thank you

You might also like