0% found this document useful (0 votes)
9 views12 pages

Unit 3

The document explains the K-Means and Hierarchical clustering algorithms used in unsupervised machine learning for grouping data points into clusters based on similarity. K-Means involves assigning data points to the nearest centroids and updating these centroids iteratively until convergence, while Hierarchical clustering builds a hierarchy of clusters without needing to specify the number of clusters in advance. The document also details the steps involved in both algorithms, including examples and mathematical definitions for distance calculations.

Uploaded by

yarokiduniya31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views12 pages

Unit 3

The document explains the K-Means and Hierarchical clustering algorithms used in unsupervised machine learning for grouping data points into clusters based on similarity. K-Means involves assigning data points to the nearest centroids and updating these centroids iteratively until convergence, while Hierarchical clustering builds a hierarchy of clusters without needing to specify the number of clusters in advance. The document also details the steps involved in both algorithms, including examples and mathematical definitions for distance calculations.

Uploaded by

yarokiduniya31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

K-Means Algorithm: Concept

K-Means is a clustering algorithm used in unsupervised machine learning to


group data points into K clusters based on similarity. The algorithm works by
first randomly picking some central points called centroids and each data
point is then assigned to the closest centroid forming a cluster. After all the
points are assigned to a cluster the centroids are updated by finding the
average position of the points in each cluster. This process repeats until the
centroids stop changing forming clusters. The goal of clustering is to divide
the data points into clusters so that similar data points belong to same
group.
n
1
The cluster mean of K i={t i 1 , t i 2 , … ,t ¿ } is defined as mi= ∑ t ij .
n j=1

Euclidean distance for 1-dimensional data, that is, between two points x 1 and
x 2 is defined as d ( x 1 , x 2 )=|x 1−x 2|.

For 2 dimensional data t 1=(x 1 , y 1) and t 2=(x 2 , y 2), Euclidean distance is defined
as d ( t 1 , t 2) =√ (x 1−x 2) +( y 1− y 2 ) . This approach can be extended to higher
2 2

dimensions also.

We are given a dataset D={t 1 , t 2 , … , t n } and desired number of clusters k , the


output will be the set of clusters, that is, each element in D assigned a
cluster.

K-Means Algorithm

1. Assign value to the number of clusters k .

2. Assign initial values for means (centroids) m1 ,m2 , … , mk for each of k


clusters.

3. Calculate distance of each data point from each of the centroid, that is,
d (t i , m j) for each data point t i ( 1 ≤i ≤ n ) and centroid m j ( 1 ≤ j≤ k )

4. Assign a data point a cluster if its distance is minimum among all


distances.

5. Calculate new mean for each of the cluster.

6. Repeat steps 3,4 and 5 until centroids stop changing or convergence


criteria achieved.
Example 1. Consider the data {2, 4, 10, 12, 3, 20, 30, 11, 25}. Assuming 2
clusters, apply K-means algorithm to assign each datapoint a cluster.

Solution: Here D={2 , 4 , 10 , 12, 3 , 20 ,30 ,11, 25 } and k =2. Let m1 ,=2 , m2 =4 are two
cluster means initially.

Calculating distance of each data point from each cluster mean, we get the
following results in first iteration:

d ( 2 ,2 )=|2−2|=0 ,and d ( 2 , 4 )=|2−4|=2. Since 0 is minimum, thus data point 2 is


assigned cluster 1.

d ( 4 ,2 )=|4−2|=2 ,and d ( 4 , 4 ) =|4−4|=0. Since 0 is minimum, thus data point 4 is


assigned cluster 2.

d ( 10 , 2 )=|10−2|=8 ,and d ( 10 , 4 )=|10−4|=6 . Since 6 is minimum, thus data point


10 is assigned cluster 2.

d ( 12 ,2 ) =|12−2|=10 ,and d ( 12 , 4 ) =|12−4|=8. Since 8 is minimum, thus data point


12 is assigned cluster 2.

d ( 3 , 2 )=|3−2|=1 ,and d ( 3 , 4 )=|3−4|=1. Since both distances are 1, we can chose


any cluster arbitrarily. Let the cluster assigned to data point 3 is 2.

d ( 20 , 2 )=|20−2|=18 ,and d ( 20 , 4 )=|20−4|=16 . Since 16 is minimum, thus data


point 20 is assigned cluster 2.

d ( 30 , 2 )=|30−2|=28 ,and d ( 30 , 4 )=|30−4|=26 . Since 26 is minimum, thus data


point 30 is assigned cluster 2.

d ( 11, 2 )=|11−2|=9 ,and d ( 11, 4 )=|11−4|=7. Since 7 is minimum, thus data point
11 is assigned cluster 2.

d ( 25 , 2 )=|25−2|=23 ,and d ( 25 , 4 )=|25−4|=21 . Since 21 is minimum, thus data


point 25 is assigned cluster 2.

The clusters are Cluster 1= {2, 3} and Cluster 2 = {4, 10, 12, 20, 30, 11,
25} and the cluster means are 2.5 and 16.

Repeating the same process with new means 2.5 and 16, we get the clusters
Cluster 1= {2, 3, 4} and Cluster 2 = {10, 12, 20, 30, 11, 25} and the cluster
means are 3 and 18.

The complete results are shown in the following table:


Iteratio m1 m2 Cluster 1 Cluster 2
n No.
1 2 4 {2, 3} {4, 10, 12, 20, 30, 11, 25}
2 2.5 16 {2, 3, 4} {10, 12, 20, 30, 11, 25}
3. 3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}
4. 4.75 19.6 {2, 3, 4, 10,11,12} {20, 30, 25}
5. 7 25 {2, 3, 4, 10,11,12} {20, 30, 25}

Hierarchical Clustering

Hierarchical clustering is an unsupervised machine learning algorithm


used to group similar data points into clusters without specifying the
number of clusters beforehand. Unlike K-Means, which requires K clusters
as input, hierarchical clustering builds a hierarchy of clusters and allows
flexibility in selecting the number of clusters.

Types of Hierarchical Clustering

1. Agglomerative Hierarchical Clustering (Bottom-Up Approach) –


Most Common

o Each data point starts as its own cluster.

o Merge the two closest clusters iteratively until all points belong
to a single cluster.

2. Divisive Hierarchical Clustering (Top-Down Approach) – Less


Common

o Start with one large cluster containing all points.

o Recursively split clusters until each point is its own cluster.

Agglomerative Hierarchical Clustering

1. Compute the distance matrix (distance between each pair of data


points).

2. Treat each data point as a separate cluster.

3. Find the two closest clusters and merge them into a new cluster.

4. Update the distance matrix to reflect the new cluster.


5. Repeat steps 3 and 4 until only one cluster remains.

6. Use a dendrogram to determine the optimal number of clusters.

Mathematical Basis: Linkage Methods

To decide which clusters to merge, different linkage criteria are used:

1. Single Linkage (Minimum Distance)

d ( C i , C j )=Min ⁡{d ( x , y ) : x ∈C i , y ∈C j }

 Distance between the closest points of two clusters.

 Results in long, chain-like clusters (sensitive to noise).

2. Complete Linkage (Maximum Distance)

d ( C i , C j )=Max ⁡{d ( x , y ) : x ∈C i , y ∈C j }

 Distance between the farthest points of two clusters.

 Produces compact, spherical clusters.

3. Average Linkage (Mean Distance)

d ( C i , C j )= ∑ ∑ d(x , y )
x ∈C i y∈C j

 Distance is the average of all pairwise distances between clusters.

 Balances between single and complete linkage.

4. Centroid Linkage (Cluster Mean)

d ( C i , C j )=|μ i−μ j|

 Distance between cluster centroids.

 Used when clusters are assumed to be spherical.


Example 2. Consider the data {3, 5, 8, 10, 15, 18, 23}, use hierarchical
clustering and use single linkage method to find the clusters. Draw
dendrogram.

Solution: Assuming every single point as a cluster, the initial distance


matrix is

{10 {15 {18 {23


{3} {5} {8}
} } } }
{3} 0 2 5 7 12 15 20

{5} 2 0 3 5 10 13 18

{8} 5 3 0 2 7 10 15
{10
7 5 2 0 5 8 13
}
{15
12 10 7 5 0 3 8
}
{18
15 13 10 8 3 0 5
}
{23
20 18 15 13 8 5 0
}

The minimum distance is 2, thus the closest clusters are {3}, {5} and {8},
{10}. Thus, the clusters are {{3, 5}, {8, 10}, {15}, {18}, {23}}

The new distance matrix for new clusters is

{3, {8,1 {15 {18 {23


5} 0} } } }
{3,5
0 3 10 13 18
}
{8,1
3 0 5 8 13
0}
{15} 10 5 0 3 8

{18} 13 8 3 0 5

{23} 18 13 8 5 0

The minimum distance is 3, thus the closest clusters are {3, 5}, {8, 10} and
{15},{18}. Thus, the new clusters are {{3, 5, 8, 10}, {15, 18}, {23}}
The new distance matrix for new clusters is

{15,
{3,5,8,10} {23}
18}
{3,5,8,10} 0 5 13

{15, 18} 5 0 5

{23} 13 5 0

The minimum distance is 5, thus the closest clusters are {3, 5, 8, 10}, {15,
18}, {23}. Thus, the new clusters is {3, 5, 8, 10, 15, 18, 23}.

Here only two clusters remaining, these clusters can be merged into one with
distance 5.

Dendrogram for Hierarchical clustering

Example 3. Consider the data {3, 5, 8, 10, 15, 18, 23}, use hierarchical
clustering and use complete linkage method to find the clusters. Draw
dendrogram.

Solution: Assuming every single point as a cluster, the initial distance


matrix will remain same as in previous case.

The minimum distance is 2, thus the closest clusters are {3}, {5} and {8},
{10}. Thus, the clusters are {{3, 5}, {8, 10}, {15}, {18}, {23}}
The new distance matrix (using maximum distance) for new clusters is

Cluster Pair Max Distance


{3,5}, {8,10} 7
{3,5}, {15} 12
{3,5}, {18} 15
{3,5}, {23} 20
{8,10}, {15} 7
{8,10}, {18} 10
{8,10}, {23} 15
{15}, {18} 3
{15}, {23} 8

The minimum distance is 3, thus the closest clusters are {15} and {18}.
Thus, the clusters are {{3, 5}, {8, 10}, {15, 18}, {23}}.

The new distance matrix (using maximum distance) for new clusters is

Cluster Pair Max Distance


{3,5}, {8,10} 7
{3,5}, {15,18} 15
{3,5}, {23} 20
{8,10}, {15,18} 10
{8,10}, {23} 15
{15,18},{23} 8

The minimum distance is 7, thus the closest clusters are {3,5} and {8,10}.
Thus, the clusters are {{3, 5, 8, 10}, {15, 18}, {23}}.

The new distance matrix (using maximum distance) for new clusters is

Cluster Pair Max Distance


{3,5,8,10}, 15
{15,18}
{3,5,8,10}, {23} 20
{15,18}, {23} 8

The minimum distance is 8, thus the closest clusters are {15, 18} and {23}.
Thus, the clusters are {{3, 5, 8, 10}, {15, 18, 23}}.

Only two clusters remaining, these can be merged to form one cluster with
distance 20.
Dendrogram for Hierarchical clustering

Divisive Hierarchical Clustering (Top-Down Approach)

Divisive Hierarchical Clustering is the opposite of Agglomerative Clustering:

 Start with one large cluster containing all data points.

 Recursively split it into smaller clusters until each data point is its own
cluster.

 The splitting is based on maximizing dissimilarity.

Divisive Clustering Algorithm

1. Consider all data points as a single cluster.

2. Find the most dissimilar data points (using Euclidean distance,


variance, or other methods).

3. Split the cluster into two subclusters based on dissimilarity.

4. Repeat the process for each subcluster until all points are separated.

5. Create a dendrogram showing the hierarchy of splits.

Example 3. Consider the data {3, 5, 8, 10, 15, 18, 23}, use divisive
hierarchical clustering and use single linkage method to find the clusters.
Draw dendrogram.
Solution: The highest distance is between 3 and 23. Thus, assuming 3 and 23
and center points, if we assign remaining point the nearest point, then we
find the two clusters {3, 5, 8, 10}, {15, 18, 23}.

In the first cluster {3, 5, 8, 10}, the distance between 3 and 10 is maximum;
thus, assuming 3 and 10 and center points, if we assign remaining point the
nearest point, then we find the two clusters {3, 5}, {8, 10}. Similarly, the
cluster {15, 18, 23} can be divided into two clusters {15, 18}, {23}.

Finally, further breaking the clusters, we get {3}, {5}, {8}, {10}, {15},
{18}, {23}.

Dendrogram for Hierarchical clustering

DBSCAN (Density-Based Spatial Clustering of Applications with


Noise) Algorithm

DBSCAN is a density-based clustering algorithm that groups data points into


clusters based on density and can identify outliers (noise points).

Advantages over K-Means & Hierarchical Clustering:

It does not require specifying the number of clusters (K is not needed).

It can find arbitrary-shaped clusters (unlike K-Means which assumes spherical


clusters).

It identifies outliers (noise points) naturally.

DBSCAN uses two parameters to determine clusters:


1. ε (epsilon): The maximum distance between two points to be
considered neighbors.

2. MinPts: The minimum number of points required to form a dense region


(core point).

Each point in DBSCAN is classified as one of the following:

1. Core Point: Has at least MinPts points within ε radius.

2. Border Point: Has fewer than MinPts neighbors but is reachable from
a core point.

3. Noise Point (Outlier): Not a core or border point (isolated).

DBSCAN Algorithm

1. Choose an unvisited point and check if it is a core point (i.e., has at


least MinPts neighbors within ε).

2. If core point, form a new cluster and expand it by including all density-
reachable points.

3. If border point, add it to an existing cluster.

4. If noise point, label it as an outlier.

5. Repeat until all points are visited.

Example 4. Consider the data {3, 5, 8, 10, 15, 18, 23}, use DBSCAN
algorithm to find the clusters.

Solution: Let ε (Epsilon) = 3 (Maximum distance to be considered neighbors)


and MinPts = 2 (Minimum points required to form a cluster). The distance
matrix is

{10 {15 {18 {23


{3} {5} {8}
} } } }
{3} 0 2 5 7 12 15 20

{5} 2 0 3 5 10 13 18

{8} 5 3 0 2 7 10 15
{10
7 5 2 0 5 8 13
}
{15
12 10 7 5 0 3 8
}
{18
15 13 10 8 3 0 5
}
{23
20 18 15 13 8 5 0
}

Identifying Core, Border, and Noise Points:

A core point has at least MinPts = 2 neighbors within ε = 3.

Poin Neighbors within ε = MinPts ≥


Classification
t 3 2?
3 {5} No Noise(-1)
5 {3, 8} Yes Core (0)
8 {5, 10} Yes Core (0)
10 {8} No Border (1)
15 {18} No Border (1)
18 {15} Yes Border (1)
23 {18} No Noise(-1)

Cluster 0: {5, 8, 10}


Cluster 1: {15, 18}
Noise Points: {3, 23} (do not meet MinPts).

Silhouette Score: A Measure of Clustering Quality

The Silhouette Score is a metric used to evaluate the quality of clustering in


unsupervised learning. It measures how well each data point fits within its
assigned cluster compared to other clusters.

 Higher scores indicate well-separated, compact clusters.


 Lower scores suggest overlapping or poorly formed clusters.

For each data point iii, the Silhouette Score S(i) is defined as:

b ( i )−a(i)
S ( i )=
Max ¿ ¿
where:

a (i) = Average distance from point i to all other points in its own cluster
(intra-cluster distance).

b ( i ) = Average distance from point i to all points in the nearest neighboring


cluster (inter-cluster distance).
Silhouette
Interpretation
Score

Well-clustered (point is far from other


S ( i )=1
clusters)

S ( i )=0 On the boundary between clusters

Misclassified (point is closer to another


S ( i )=−1
cluster)

You might also like