Unit 3
Unit 3
Euclidean distance for 1-dimensional data, that is, between two points x 1 and
x 2 is defined as d ( x 1 , x 2 )=|x 1−x 2|.
For 2 dimensional data t 1=(x 1 , y 1) and t 2=(x 2 , y 2), Euclidean distance is defined
as d ( t 1 , t 2) =√ (x 1−x 2) +( y 1− y 2 ) . This approach can be extended to higher
2 2
dimensions also.
K-Means Algorithm
3. Calculate distance of each data point from each of the centroid, that is,
d (t i , m j) for each data point t i ( 1 ≤i ≤ n ) and centroid m j ( 1 ≤ j≤ k )
Solution: Here D={2 , 4 , 10 , 12, 3 , 20 ,30 ,11, 25 } and k =2. Let m1 ,=2 , m2 =4 are two
cluster means initially.
Calculating distance of each data point from each cluster mean, we get the
following results in first iteration:
d ( 11, 2 )=|11−2|=9 ,and d ( 11, 4 )=|11−4|=7. Since 7 is minimum, thus data point
11 is assigned cluster 2.
The clusters are Cluster 1= {2, 3} and Cluster 2 = {4, 10, 12, 20, 30, 11,
25} and the cluster means are 2.5 and 16.
Repeating the same process with new means 2.5 and 16, we get the clusters
Cluster 1= {2, 3, 4} and Cluster 2 = {10, 12, 20, 30, 11, 25} and the cluster
means are 3 and 18.
Hierarchical Clustering
o Merge the two closest clusters iteratively until all points belong
to a single cluster.
3. Find the two closest clusters and merge them into a new cluster.
d ( C i , C j )=Min {d ( x , y ) : x ∈C i , y ∈C j }
d ( C i , C j )=Max {d ( x , y ) : x ∈C i , y ∈C j }
d ( C i , C j )= ∑ ∑ d(x , y )
x ∈C i y∈C j
d ( C i , C j )=|μ i−μ j|
{5} 2 0 3 5 10 13 18
{8} 5 3 0 2 7 10 15
{10
7 5 2 0 5 8 13
}
{15
12 10 7 5 0 3 8
}
{18
15 13 10 8 3 0 5
}
{23
20 18 15 13 8 5 0
}
The minimum distance is 2, thus the closest clusters are {3}, {5} and {8},
{10}. Thus, the clusters are {{3, 5}, {8, 10}, {15}, {18}, {23}}
{18} 13 8 3 0 5
{23} 18 13 8 5 0
The minimum distance is 3, thus the closest clusters are {3, 5}, {8, 10} and
{15},{18}. Thus, the new clusters are {{3, 5, 8, 10}, {15, 18}, {23}}
The new distance matrix for new clusters is
{15,
{3,5,8,10} {23}
18}
{3,5,8,10} 0 5 13
{15, 18} 5 0 5
{23} 13 5 0
The minimum distance is 5, thus the closest clusters are {3, 5, 8, 10}, {15,
18}, {23}. Thus, the new clusters is {3, 5, 8, 10, 15, 18, 23}.
Here only two clusters remaining, these clusters can be merged into one with
distance 5.
Example 3. Consider the data {3, 5, 8, 10, 15, 18, 23}, use hierarchical
clustering and use complete linkage method to find the clusters. Draw
dendrogram.
The minimum distance is 2, thus the closest clusters are {3}, {5} and {8},
{10}. Thus, the clusters are {{3, 5}, {8, 10}, {15}, {18}, {23}}
The new distance matrix (using maximum distance) for new clusters is
The minimum distance is 3, thus the closest clusters are {15} and {18}.
Thus, the clusters are {{3, 5}, {8, 10}, {15, 18}, {23}}.
The new distance matrix (using maximum distance) for new clusters is
The minimum distance is 7, thus the closest clusters are {3,5} and {8,10}.
Thus, the clusters are {{3, 5, 8, 10}, {15, 18}, {23}}.
The new distance matrix (using maximum distance) for new clusters is
The minimum distance is 8, thus the closest clusters are {15, 18} and {23}.
Thus, the clusters are {{3, 5, 8, 10}, {15, 18, 23}}.
Only two clusters remaining, these can be merged to form one cluster with
distance 20.
Dendrogram for Hierarchical clustering
Recursively split it into smaller clusters until each data point is its own
cluster.
4. Repeat the process for each subcluster until all points are separated.
Example 3. Consider the data {3, 5, 8, 10, 15, 18, 23}, use divisive
hierarchical clustering and use single linkage method to find the clusters.
Draw dendrogram.
Solution: The highest distance is between 3 and 23. Thus, assuming 3 and 23
and center points, if we assign remaining point the nearest point, then we
find the two clusters {3, 5, 8, 10}, {15, 18, 23}.
In the first cluster {3, 5, 8, 10}, the distance between 3 and 10 is maximum;
thus, assuming 3 and 10 and center points, if we assign remaining point the
nearest point, then we find the two clusters {3, 5}, {8, 10}. Similarly, the
cluster {15, 18, 23} can be divided into two clusters {15, 18}, {23}.
Finally, further breaking the clusters, we get {3}, {5}, {8}, {10}, {15},
{18}, {23}.
2. Border Point: Has fewer than MinPts neighbors but is reachable from
a core point.
DBSCAN Algorithm
2. If core point, form a new cluster and expand it by including all density-
reachable points.
Example 4. Consider the data {3, 5, 8, 10, 15, 18, 23}, use DBSCAN
algorithm to find the clusters.
{5} 2 0 3 5 10 13 18
{8} 5 3 0 2 7 10 15
{10
7 5 2 0 5 8 13
}
{15
12 10 7 5 0 3 8
}
{18
15 13 10 8 3 0 5
}
{23
20 18 15 13 8 5 0
}
For each data point iii, the Silhouette Score S(i) is defined as:
b ( i )−a(i)
S ( i )=
Max ¿ ¿
where:
a (i) = Average distance from point i to all other points in its own cluster
(intra-cluster distance).