0% found this document useful (0 votes)
14 views43 pages

Unit4-Clustering-Algorithms

Unit 4 of the MDS 602 course focuses on clustering algorithms, including DBScan, hierarchical, fuzzy, and graph-based clustering. It discusses evaluation measures for clustering, distinguishing between internal and external metrics, and highlights the strengths and weaknesses of hierarchical clustering. Additionally, it covers the principles of fuzzy clustering, emphasizing its applications in image segmentation, pattern recognition, and customer segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views43 pages

Unit4-Clustering-Algorithms

Unit 4 of the MDS 602 course focuses on clustering algorithms, including DBScan, hierarchical, fuzzy, and graph-based clustering. It discusses evaluation measures for clustering, distinguishing between internal and external metrics, and highlights the strengths and weaknesses of hierarchical clustering. Additionally, it covers the principles of fuzzy clustering, emphasizing its applications in image segmentation, pattern recognition, and customer segmentation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Unit 4

Clustering
- DBScan
- Hierarchical Clustering
- Fuzzy Clustering
- Graph Based Clustering

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 1
Objective

Clustering Algorithms
– Centroid based
– Density based
– Hierarchical
– Fuzzy
– Graph based

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 2
Evaluation of Clustering

Challenging due to
– In many real-world datasets, the boundaries between clusters are not
clear-cut.

Some data points might sit at the boundary of two clusters and could be
reasonably assigned to both.
– Different applications might prioritize different aspects of clustering.

For example, in one application, it might be essential to have tight, well-
separated clusters, while in another, capturing the overall data structure
might be more important.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 4
Types of Evaluation

Two types of clustering evaluation measures (or metrics)


Internal measures do not require any ground truth to assess
the quality of clusters. They are based solely on the data and
the clustering results.


External measures compare the clustering results to ground
truth labels.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 5
Internal Evaluation Measures

Most internal validation measures are based on the following two
criteria:


Compactness measures how closely related objects in the same
cluster are.
– Compactness can be measured in different ways, such as by using the
variance of the points within each cluster, or computing the average pairwise
distance between them.

Separation measures how distinct or well-separated a cluster is
from other clusters.
– Examples for measures of separation include pairwise distances between
cluster centers or pairwise minimum distances between objects in different
clusters.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 6
External Evaluation Measures

External evaluation measures are used when the true labels of
the data points are known.

These measures compare the the results of the clustering
algorithm against the ground truth labels.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 7
Other Clustering Algorithms

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 8
Hierarchical Clustering

Hierarchical clustering organizes data into a hierarchy of
clusters, represented as a tree-like structure known as a
dendrogram
– This algorithm builds a hierarchy of clusters by iteratively merging or
splitting clusters based on their similarity.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 9
Hierarchical Clustering

Types:
– There are two main types of hierarchical clustering:

agglomerative (bottom-up) and

divisive (top-down).

Strengths
– Hierarchical clustering can discover clusters of arbitrary shapes and
sizes, and it provides a visual representation of the hierarchical
relationships between clusters.

Weaknesses
– Hierarchical clustering can be computationally expensive, especially for
large datasets. It is also sensitive to the initial ordering of the data points
and the choice of the distance metric.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 10
Hierarchical Clustering - Types

Agglomerative Clustering:
– Bottom-up approach merging similar clusters sequentially.


Divisive Clustering:
– Top-down approach dividing clusters iteratively.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 11
Basic Agglomerative Hierarchical Algo.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 12
Hierarchical Agglomerative Clustering (HAC)

Types
– single link

For the single link or MIN version of hierarchical clustering,
the proximity of two clusters is defined as the minimum of
the distance (maximum of the similarity) between any two
points in the two different clusters
– complete link

For the complete link or MAX version of hierarchical
clustering, the proximity of two clusters is defined as the
maximum of the distance (minimum of the similarity)
between any two points in the two different clusters
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 13
Hierarchical Agglomerative Clustering (HAC)

Use the distance matrix in table below to perform single link
and complete link hierarchical clustering. Show your results by
drawing a dendogram.

The dendogram should clearly show the order in which the
points are merged.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 14
HAC

Calculate Euclidean distance, create the distance matrix.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 15
HAC

Find the minimum value element from distance matrix.
– The minimum value element is (p3,p6)and value is 0.11
– i.e. our 1st cluster (p3,p6)

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 16
HAC

Recalculate or update the distance matrix for cluster(p3,p6)

Same formula can be used for p2,p4,p5

Updated distance matrix

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 17
HAC

Repeat to get another pair
– The minimum value element is (p2,p5)and value is 0.14
– i.e. our 2nd cluster (p2,p5)

Recalculate or update the distance
matrix for cluster (p2,p5)
– Same formula can be used for
(p3,p6),p4
– Updated distance matrix:

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 18
HAC

Repeat
– 1st cluster (p3,p6)
– 2nd cluster (p2,p5)
– 3rd cluster (p2,p5,p3,p6)
– 4th cluster (p2,p5,p3,p6,p4)
– 5th cluster (p2,p5,p3,p6,p4,p1)

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 19
HAC

Repeat
– 1st cluster (p3,p6)
– 2nd cluster (p2,p5)
– 3rd cluster (p2,p5,p3,p6)
– 4th cluster (p2,p5,p3,p6,p4)
– 5th cluster (p2,p5,p3,p6,p4,p1)

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 20
MAX version of HAC

Find the min distance and group them

Update the distance

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 21
MAX version of HAC

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 22
Numericals

https://aryabhattacollege.ac.in/samplepaper/63722744950872
5497DataMining(Chapter8).pdf

https://www.analyticsvidhya.com/blog/2021/06/single-link-hiera
rchical-clustering-clearly-explained/

https://medium.com/@rohanjoseph_91119/learn-with-an-exam
ple-hierarchical-clustering-873b5b50890c

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 23
Density Based Clustering

Clustering based on density (local cluster criterion), such as
density-connected points or based on an explicitly constructed
density function

Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 24
Density Based Clustering

Algorithms
– DBSCAN: Ester, et al. (KDD 96)
– DENCLUE: Hinneburg & D. Keim (KDD 98/2006)
– OPTICS: Ankerst, et al (SIGMOD 99).
– CLIQUE: Agrawal, et al. (SIGMOD 98)

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 25
DBSCAN

DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius r (Ɛ - EPS)
– A point is a core point if it has more than a specified number of points
(MinPts) within Ɛ

These are points that are at the interior of a cluster
– A border point has fewer than MinPts within Ɛ, but is in the
neighborhood of a core point
– A noise point is any point that is not a core point or a border point.

DBSCAN is a density-based algorithm.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 26
DBSCAN- core, border, and noise points

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 27
Simplified view of DBSCAN

Create a graph whose nodes are the points to be clustered

For each core-point c create an edge from c to every point p in the Ɛ-
neighborhood of c

Set N to the nodes of the graph;

If N does not contain any core points terminate

Pick a core point c in N

Let X be the set of nodes that can be reached from c by going forward;
– create a cluster containing X U {c}
– N=N / (X U {c})

Continue with step 4

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 28
Estimating Ɛ and MinPts

Idea is that for points in a cluster, their kth nearest neighbors
are at roughly the same distance
– Noise points have the kth nearest neighbor at farther distance
– So, plot sorted distance of every point to its kth nearest neighbor

Thus, eps=10

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 29
Where DBSCAN Fails Larger EPS , MinPts= 4

Varying densities

High-dimensional data

Original Points

2024, MDS
MDS 602 (Advanced Data Mining)
Master’s in Data Science Unit 4: Clustering Smaller EPS , MinPts= 4 30
Resources

https://cdn.aaai.org/KDD/1996/KDD96-037.pdf

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 31
Hard partition

Hard partition
– where the data points are strictly allocated as a member of one cluster
and are not a member of another cluster, assuming that the number of
clusters is known.
– The k-means is one of the algorithms that use a hard partition.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 32
Soft Partition

Every data point is given a probability of closeness [0, 1] for
existing clusters, assuming that the number of clusters is
known.

One of the algorithms that use fuzzy partition is Fuzzy c-
means that we will talk about it in depth.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 33
Fuzzy Clustering

Traditional clustering algorithms assign data points to
exclusive clusters, where each point belongs to one and only
one cluster.

Fuzzy clustering, on the other hand, acknowledges the
inherent ambiguity in real-world data and allows for a more
flexible approach.
– With fuzzy clustering, a data point can belong to multiple clusters to
varying degrees, providing a richer representation of complex
relationships.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 34
Fuzzy Clustering

The key idea behind fuzzy clustering lies in the concept of
membership functions.
– Instead of assigning binary membership values (0 or 1) as in
traditional clustering, fuzzy clustering employs membership values
ranging from 0 to 1.
– These values represent the degree of belongingness of each data
point to each cluster.

A higher membership value indicates a stronger association
with a particular cluster, while a lower value signifies a weaker
association.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 35
Application of fuzzy clustering

Image Segmentation
– In computer vision, fuzzy clustering can be used for image
segmentation tasks.
– By assigning membership values to pixels, it allows for a more flexible
and granular delineation of image regions, enabling better object
recognition and scene understanding.

Pattern Recognition
– Fuzzy clustering is useful in pattern recognition tasks, such as
character recognition or speech recognition.
– By capturing the uncertainty and variability inherent in patterns, it
allows for more robust and accurate recognition algorithms.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 36
Application of fuzzy clustering

Customer Segmentation
– Fuzzy clustering is valuable in customer segmentation for marketing
purposes.
– By considering the degree of membership to different segments,
businesses can tailor personalized marketing strategies to effectively
target customers based on their varying preferences and behaviors.

Document clustering
– Fuzzy clustering can be used to cluster documents based on their
content, such as keywords, topics, and themes.
– This can help in organizing and retrieving documents efficiently.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 37
Fuzzy c-means (FCM)

FCM is the most well-known and widely used fuzzy clustering
technique.
– It is an iterative algorithm that minimizes the sum of the weighted
squared distances between each data point and the centers of the
clusters.
– The degree of membership of each data point to each cluster is
calculated using a membership function, which assigns a probability
value between 0 and 1 for each cluster.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 38
Graph based Clustering

Connected components clustering (CCC)

HCS:
– stands for highly connected subgraphs clustering

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 39
HCS

It uses the notion of min-cuts.

There are many heuristic algorithms to find min-cuts or
approximations

HCS starts with the entire graph and checks if it is highly
connected.
– We can define highly connected however we like so long as we can
test for this condition efficiently.
– If the graph is not highly connected, it partitions the graph into two
non-empty subgraphs in such a way that the number of edges that
cross the partition is minimized.
– There are algorithms to find min-cuts under additional constraints.
MDS 602 (Advanced Data Mining)
2024, MDS Master’s in Data Science Unit 4: Clustering 40
HCS

We then recurse on the two subgraphs.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 41
HCS

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 42
Use of HCS

Network based summarization

Mapping application

Use Case
– Online Delivery
– E-Commerce Store setup
– Bus Route planning etc.

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 43
Thank you

MDS 602 (Advanced Data Mining)


2024, MDS Master’s in Data Science Unit 4: Clustering 44

You might also like