Unit 5 Clustering-2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Unit 5

Clustering
Introduction
Clustering is a fundamental technique in unsupervised machine learning used to group similar
objects or data points together based on certain features or characteristics. The main goal of
clustering is to partition a dataset into groups, or clusters, such that data points within the same
cluster are more similar to each other than to those in other clusters.
Clustering algorithms are widely used across various domains such as data analysis, pattern
recognition, image processing, and customer segmentation. They are valuable for exploratory data
analysis, identifying hidden patterns, and understanding the structure of data.

Definition of clustering
Clustering is a technique in machine learning where similar data points are grouped together
into clusters, based on their features or characteristics, without needing predefined labels.
Or
Group of similar types of objects and the objects in one group are different to from another group.
Cluster Analysis

Definition of Cluster Analysis:


Cluster analysis is the process of finding similar groups of objects in order to form clusters. It
is an unsupervised machine learning based algorithms that act on unlabeled data (data which do
not contain output (class labels))
Example:

Fundamentals of Data Science Unit 5 1


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Properties of Cluster Analysis


Cluster analysis is a statistical technique used to group similar objects or data points into clusters, where
objects within the same cluster are more similar to each other compared to those in other clusters. Its main
properties include:
1. Unsupervised Learning: Cluster analysis does not require predefined labels or categories for
the data. It autonomously identifies patterns and structures within the data.
2. Clustering Scalability: (Capability or ability)
Now a days there is a huge amount of data and dealing with huge databases. In order to
handle huge databases, the clustering algorithm should be scalable and data should be
scalable if it is not scalable then we cannot get the appropriate result and would lead to wrong result.
3. Dealing with unstructured data: These would be some databases that contains errors,
missing values and noisy data. So it should able to handle unstructured data by giving it
some structure and organizing data into groups of similar data objects.

4. Algorithm usability with multiple data: Different kinds of data (Binary data, scalar data and
soon….) can be used with clustering algorithms. It should be capable and dealing with different
types of data.
5. High Dimentionality: The algorithm should be able to handle high dimensional data.
6. Interpretability: The outcomes of clustering should be usable and data must be easy to
understand.

Fundamentals of Data Science Unit 5 2


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Applications of Cluster Analysis


Cluster analysis finds applications in various fields:
1. Market Segmentation: Grouping customers based on similarities for targeted marketing.
2. Image Segmentation: Dividing images into regions for object recognition.
3. Anomaly Detection: Identifying unusual patterns in data for fraud detection.
4. Text Mining: Clustering documents for topic modeling.
5. Social Network Analysis: Identifying communities within networks.
6. Ecology: Classifying ecosystems based on species composition.
7. Customer Behavior Analysis: Grouping customers for personalized recommendations.
8. Medical Diagnosis: Classifying patients based on clinical characteristics.
9. Supply Chain Management: Optimizing supply chain networks by grouping products or suppliers.
10.Genetics: Classifying organisms based on genetic similarities.

Advantages of Cluster Analysis


Cluster analysis offers concise advantages:
1. Pattern Discovery: Reveals inherent data patterns.
2. Unsupervised Learning: Doesn't require labeled data.
3. Insight Generation: Generates hypotheses about data structure.
4. Data Reduction: Condenses large datasets.
5. Segmentation: Tailors strategies to different groups.
6. Anomaly Detection: Identifies outliers for fraud detection.
7. Visualization: Provides intuitive insights.
8. Flexibility: Various algorithms suit different needs.
9. Interpretability: Clusters offer actionable insights.
10.Predictive Modeling: Enhances performance in machine learning.
Disadvantages of Cluster Analysis
1. Subjectivity: Cluster interpretation can be subjective.
2. Sensitivity to Parameters: Results may vary with parameter settings.
3. Scalability: Computationally intensive for large datasets.
4. Assumption of Homogeneity: Assumes clusters have similar variance and density.
5. Curse of Dimensionality: Less effective in high-dimensional spaces.
6. Initialization Sensitivity: Results may vary based on initial conditions.
7. Evaluation Challenges: No definitive criteria for evaluating cluster quality.
8. Noise Sensitivity: Susceptible to noise, outliers, and irrelevant features.
9. Inter-cluster Variability: Difficulty in handling clusters with varying densities or shapes.
10.Validity Concerns: Difficulty in determining the optimal number of clusters.

Fundamentals of Data Science Unit 5 3


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Typically requirements of Clustering in Data Mining


1. Data Preparation: Clean, preprocessed data without missing values or outliers is essential
for accurate clustering results.
2. Feature Selection: Choosing relevant features or variables that capture the essential characteristics
of the data while minimizing noise and irrelevant information.
3. Distance Metric: Selection of an appropriate distance or similarity measure to quantify the distance
between data points, considering the nature of the data and the clustering algorithm being
used.
4. Clustering Algorithm: Selection of a suitable clustering algorithm based on the characteristics of
the data, such as K-means, hierarchical clustering, DBSCAN, or density-based clustering.
5. Parameter Selection: Determining the optimal parameters for the chosen clustering algorithm,
such as the number of clusters (K), linkage method, or distance threshold.
6. Evaluation Metrics: Choosing appropriate metrics to evaluate the quality and effectiveness of the
clustering results, such as silhouette score, Davies-Bouldin index, or within-cluster sum of
squares (WCSS).
7. Interpretation: Developing methods to interpret and understand the clusters generated by the
algorithm, such as visualizations, cluster profiling, or cluster comparison techniques.
8. Computational Resources: Ensuring that sufficient computational resources are available to handle
the computational complexity of clustering algorithms, especially for large datasets.
9. Scalability: Assessing the scalability of the clustering algorithm to handle large volumes of data
efficiently, considering factors such as memory usage and runtime performance.
10.Application-Specific Considerations: Taking into account domain-specific requirements and
constraints when applying clustering to real-world problems, such as business objectives,
domain knowledge, and regulatory constraints.

Fundamentals of Data Science Unit 5 4


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Major clustering Methods


1. K-Means Clustering: A partitioning method that divides data into K clusters by iteratively assigning
data points to the nearest cluster centroid and updating centroids based on the mean of data points
in each cluster.

2. Hierarchical Clustering: Builds a hierarchy of clusters by either agglomerative (bottom-up) or


divisive (top-down) approaches, where clusters are merged or split based on their proximity to each
other.

Fundamentals of Data Science Unit 5 5


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

3. Density-Based Spatial Clustering of Applications with Noise (DBSCAN): Groups together


closely packed data points into clusters based on density, where a cluster is formed by a set of data
points that are densely connected to each other, separated by areas of low density.

4. Mean Shift Clustering: A non-parametric method that identifies clusters by shifting each data point
towards the mode of its local density distribution, converging to cluster centroids where density is
highest.
5. Expectation-Maximization (EM) Clustering (Gaussian Mixture Models): Models data as a
mixture of several Gaussian distributions and uses the EM algorithm to estimate parameters such as
cluster means, covariance’s, and mixing coefficients.
6. Agglomerative Clustering: An hierarchical clustering approach that starts with each data point as
a separate cluster and iteratively merges the closest clusters based on a chosen distance metric until
a desired number of clusters is obtained.
7. Fuzzy Clustering (Fuzzy C-Means): Assigns each data point to multiple clusters with varying
degrees of membership, allowing data points to belong to more than one cluster simultaneously.
8. Self-Organizing Maps (SOM): Organizes data points on a two-dimensional grid, where nearby grid
cells represent similar data points, allowing for visualization and understanding of high-dimensional
data.
9. Density-Based Clustering Algorithms (OPTICS, HDBSCAN): Similar to DBSCAN, these methods
identify clusters based on density, but provide additional features such as hierarchical clustering,
robustness to parameter settings, and handling of varying density clusters.
10.Model-Based Clustering (BIRCH, CLARA): Utilizes statistical models or prototypes to represent
clusters and assigns data points to clusters based on their fit to these models or prototypes.

Fundamentals of Data Science Unit 5 6


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Major clustering methods


Partitioning Methods (K-Mean)
Definition:
K-means clustering is one of the most popular partitioning methods in data mining and machine
learning. It's an unsupervised learning algorithm used for clustering data into groups or clusters
based on similarity.

Here is a basic overview of how the K-means algorithm works:


1. Initialization: Choose K initial cluster centers randomly or based on some heuristic. These
cluster centers are often called centroids.
2. Assignment: Assign each data point to the nearest centroid based on some distance metric,
typically Euclidean distance.
3. Update: Update the centroids by computing the mean of all data points assigned to each
centroid.
4. Repeat: Iterate steps 2 and 3 until convergence, i.e., until the centroids no longer change
significantly or a maximum number of iterations is reached.
5. Termination: Once the centroids stabilize or the maximum number of iterations is reached,
the algorithm terminates, and each data point belongs to one of the K clusters.

Here are some important points to consider:


 Choosing K: The number of clusters K needs to be specified before running the algorithm. Selecting
an appropriate K value is crucial and often involves domain knowledge or using techniques like the
elbow method or silhouette analysis.
 Initialization: The final clusters can be sensitive to the initial choice of centroids. Therefore, it's
common to run the algorithm multiple times with different initializations and choose the one with the
lowest within-cluster variation.
 Convergence: K-means may converge to a local optimum, which depends on the initial centroids.
Hence, multiple initializations are beneficial.
 Scalability: K-means is computationally efficient and can handle large datasets. However, it may
struggle with clusters of varying sizes, non-linear cluster boundaries, and outliers.
 Cluster Interpretation: After clustering, it's important to interpret the clusters to understand the
characteristics of each group. This may involve analyzing cluster centroids, visualizing clusters, or
using domain knowledge.

Fundamentals of Data Science Unit 5 7


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

K-means clustering is widely used in various applications such as customer segmentation, image
segmentation, document clustering, and anomaly detection. Despite its simplicity, it remains a powerful and
versatile algorithm in the field of data mining and machine learning.

Flowchart of K-mean Clustering

K Means clustering Algorithm


Step 1: Take the mean value.
Step 2: Find the nearest number of mean and put in cluster.
Step 3: Repeat step 1 and step 2 until we get the same mean.

Example:
Data points K = 2,4,6,9,12,16,20,24,26
Number of clusters=02 4, 12 Chosen Centroid

K1 = 2, 4, 6 K2 = 9, 12, 16, 20, 24, 26 Taking nearest value from chosen centroid

Step 1:
K1 = 2+4+6/3 = 4, K2= 9+12+16+20+24+26/6 = 18

Fundamentals of Data Science Unit 5 8


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Step 2:
K1 = 2, 4, 6, 9 K2 = 12, 16, 20, 24, 26

Note : For K1 = 4-2=2, 4-4=0, 6-4=4, 9-4=5 then 12-4=8 and 18-12=6 so the nearest value is 6 ,we have
to put in K2.
We have to find the mean value.
K1= 2+4+6+9/4= 21/4 = 5.25 round of to 5.
K2= 12+16+20+24+26/5 = 19.6 round of to 20.

Similarly we have to repeat Step 1 and Step 2 until we have to get the same value.
Next Step:
Compare 5 with K1 and K2 nearest values , Similarly Compare 20 with K1 and K2 nearest values.
K1 = 2, 4, 6, 9, 12 K2 = 16, 20, 24, 26

We have to find the mean value.


K1= 2+4+6+9+12/5 = 33/5 = 6.6 = 7 K2= 16+20+24+26/4 = 21.5 = 22

Compare 7 with K1 and K2 nearest values , Similarly Compare 22 with K1 and K2 nearest values.
K1 = 2, 4, 6, 9, 12 K2 = 16, 20, 24, 26

We have to find the mean value.


K1= 2+4+6+9+12/5 = 33/5 = 6.6 = 7 K2= 16+20+24+26/4 = 21.5 = 22

Here we have to stop the algorithm because we have got same values of previous values.
Here we have got two different clusters i.e. K1 and K2

Fundamentals of Data Science Unit 5 9


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Hierarchical Methods
Definition:
Hierarchical clustering is a method in data mining for grouping similar data points into clusters in a
hierarchical structure.

There are two types of Hierarchical clustering


1. Agglomerative hierarchical clustering
2. Divisive Clustering

1. Agglomerative hierarchical clustering


Hierarchical clustering can be agglomerative, where data points start as individual clusters and
are iteratively merged, or divisive, where all data points begin in one cluster and are recursively split.
1. Determine the similarity between individuals and all other clusters. (Find proximity matrix).
2. Consider each data point as an individual cluster.
3. Combine similar clusters.
4. Recalculate the proximity matrix for each cluster.
5. Repeat step 3 and step 4 until you get a single cluster.

Definition: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left.

Key points:
 First consider the given matrix
 In matrix, values will represents the difference between two values (A, B, C, D, and E).
 In whole matrix diagonal values represents the 0, because from the same distance it represents
0 (Eg: A to A =0,B to B=0 etc). Remaining values are different.
 Here we have to consider either lower matrix or upper matrix only.
 In matrix we have check the least element. In this element we have to form the cluster.
 Pair the elements i.e D and E and the distance is 3.

Fundamentals of Data Science Unit 5 10


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Example:
A B C D E F
A 0
B 5 0
C 14 9 0
D 11 20 13 0
E 18 15 6 3 0
F 10 16 8 10 11 0

Explanation
Step 1:
A B C D E F
A 0
B 5 0
C 14 9 0
D 11 20 13 0
E 18 15 6 3 0

F 10 16 8 10 11 0

In the above example consider the smallest value in the matrix i.e is 3.

Pair two items (D, E)  3


A B C DE F
A 0
B 5 0
C 14 9 0
DE 0
F 10 16 8 0

 Always diagonal elements are zero.


 Keep previous values as it is.
 Here we are pairing as new distance i.e DE then we have to find the new distance.
 In DE Column we have to find the distance.
 Di Distance

Fundamentals of Data Science Unit 5 11


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

(DE, A) – MAX(Di(D,A), Di(E,A))


MAX(11, 18) = 18
(DE, B) - MAX(Di(D,B), Di(E,B))
MAX(20, 15) = 20
(DE, C) - MAX(Di(D,C), Di(E,C))
MAX(13, 6) = 13
(DE, F) - MAX(Di(D,F), Di(E,F))
MAX(10, 11) = 11

Here we got lower triangular matrix.


A B C DE F
A 0
B 5 0

C 14 9 0
DE 18 20 13 0
F 10 16 8 11 0

Step 2:
Again consider the smallest value in the matrix i.e is 5 then pair it
A B C DE F
A 0
B 5 0

C 14 9 0
DE 18 20 13 0
F 10 16 8 11 0

Pair (A, B)  5
Here we have find already some values so keep as it is.
AB C DE F
AB 0
C 0
DE 13 0
F 8 11 0

Fundamentals of Data Science Unit 5 12


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Calculate:
AB C DE F
AB 0
C 14 0

DE 20 13 0

F 16 8 11 0

(AB, C) – MAX(Di(A,C), Di(B,C))


MAX(14, 9) = 14
(AB, DE) - MAX(Di(A,DE), Di(B,DE))
MAX(18, 20) = 20
(AB, F) - MAX(Di(A,F), Di(B,F))
MAX(10, 16) = 16

Step 3:
AB C DE F
AB 0
C 14 0

DE 15 13 0

F 16 8 11 0

Again we have to consider the least value i.e is 8.


AB CF DE
AB 0
CF 0
DE 20 0

Calculate:
AB CF DE
AB 0
CF 16 0
DE 20 13 0

Pair (C, DE) - 6


(AB, CF)- MAX ((AB, C), (AB,F))= (14,16) – 16
(CF, DE)- MAX((CF,D),(CF,E))= (13,11) - 13

Fundamentals of Data Science Unit 5 13


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Step 4:
AB CF DE
AB 0
CF 16 0
DE 20 13 0

Again we have to consider the least value i.e is 13.

Pair (CDE, F) – 13
AB CFDE
AB 0
CFDE 0

Calculate:
AB CDE
AB 0
CFDE 20 0

Pair (AB, CFDE) – 20


MAX ((AB,CF),(AB,DE))= MAX(16,20) = 20

Pair are: (D,E)- 3, (A,B)-5, (C,F)-8, (DE,CF)-13, (AB,CFDE)-20.

Fundamentals of Data Science Unit 5 14


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Clusters representation

b. Divisive Hierarchical Clustering

Divisive hierarchical clustering, also known as top-down hierarchical clustering, is an approach


to clustering where all data points begin in one cluster, and this cluster is recursively divided into
smaller clusters until each data point is in its own individual cluster.
Or
Divisive algorithm is the reverse of the agglomerative algorithm as it is a top-down approach.
“Here data will be splitting but not merging the data.”

Key Concepts of Divisive Hierarchical Clustering:


1. Top-Down Approach:
o DHC (Divisive Hierarchical Clustering) starts with all data points in a single cluster.
o It iteratively splits the clusters into smaller clusters until each data point is in its own
singleton cluster or a stopping criterion is met (e.g., a desired number of clusters is
achieved).
2. Cluster Splitting:
o At each step, the cluster to be split is chosen based on some criterion, such as the cluster
with the largest diameter or variance.

Fundamentals of Data Science Unit 5 15


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

o The chosen cluster is split into two sub-clusters using a clustering algorithm
(commonly k-means or another partitioning method).
3. Tree Structure:
o The process creates a binary tree where the root represents the entire dataset, and
each node represents a cluster that is split into two child nodes.
o The leaves of the tree represent individual data points.

Problem
Using Prim’s Algorithm
Divisive clustering will be used by Complete linkage method.
x y

P1 0.40 0.53

P2 0.22 0.38

P3 0.35 0.32

P4 0.26 0.19

P5 0.08 0.41

P6 0.45 0.30

Fundamentals of Data Science Unit 5 16


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Step 1:
Similar to agglomerative method, we consider the Euclidean distance method to measure for
calculating distance between two points and use the complete linkage method.
Here we have to calculate the distance between P1 and P2.

D(x,y)= D(P1,P2) =

= 0.23
Similarly we will calculate the distance among each pair of data points and it can be represented in the
form of distance matrix below. This kind of distance matrix is also called Proximity Matrix in
Hierarchical clustering.
This matrix will be calculated as Proximity Matrix method with the help of Euclidean distance method.
P1 P2 P3 P4 P5 P6

P1 0

P2 0.23 0

P3 0.22 0.15 0

P4 0.37 0.20 0.15 0

P5 0.34 0.14 0.28 0.29 0

P6 0.23 0.25 0.11 0.22 0.39 0

Step 2:
Now compute the minimum spanning tree (MST) from the above proximity matrix by using either
Kruskal’s or Prim’s algorithm.
We are using here prim’s algorithm for sake of simplicity and arrange all the distance in ascending
order.
MST is a greedy approach this will be arranged in ascending order .

Edge Cost

(P3,P6) 0.11

(P2,P5) 0.14

(P2,P3) 0.15

Fundamentals of Data Science Unit 5 17


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

(P3,P4) 0.15

(P2,P4) 0.20

(P1,P3) 0.22

(P4,P6) 0.22

(P1,P2) 0.23

(P1,P6) 0.23

(P2,P6) 0.25

(P3,P5) 0.28

(P4,P5) 0.29

(P1,P5) 0.34

(P1,P4) 0.37

(P5,P6) 0.39

Now the final MST can be constructed by connecting all the edges between (starting from
minimum) until all the points will be participated and no closed loop or circuit will be formed.
So, follow the greedy approach, we first construct P3, P6 with the most lowest distance 0.11, then
connect P2 and P5 with the second lowest distance 0.14 then connect P2 with P3 with third lowest
distance 0.15 and so on.

Here we cannot make the circuit.


Example: P2, P4  we cannot merge because it will create the circuit.

Fundamentals of Data Science Unit 5 18


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Step 3:
There fore we apply complete linkage method (MAX) to break the edges of the final MST according to
the maximum cost of distance gradually and we are able create new clusters by considering those
largest distance.

i) At first we break the edge whose cost is 0.22 ie (P1,P3), So two clusters get formed
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2, P3, P4, P5, P6}

ii) Now, we will seek for the next maximum distance gradually and we will choose (P3, P4) as the next
highest cost i.e 0.15 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2, P3, P5, P6}
Cluster 3: {P4}
Continue this splitting iteration until each new cluster containing only a single object.

Fundamentals of Data Science Unit 5 19


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

iii) Now, we will seek for the next maximum distance gradually and we will choose (P2, P3) as the next
highest cost i.e 0.15 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2, P5}
Cluster 3: {P4}
Cluster 4: {P3, P6}

iv) Now, we will seek for the next maximum distance gradually and we will choose (P2, P5) as the next
highest cost i.e 0.14 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2}
Cluster 3: {P4}
Cluster 4: {P3, P6}
Cluster 5: {P5}

Fundamentals of Data Science Unit 5 20


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

iv) Now, we will seek for the next maximum distance gradually and we will choose (P3, P6) as the next
highest cost i.e 0.11 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2}
Cluster 3: {P4}
Cluster 4: {P3}
Cluster 5: {P5}
Cluster 6: {P6}

Dendrogram

Here top to bottom-up approach used in Dendrogram.

Fundamentals of Data Science Unit 5 21


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Advantages of Divisive Hierarchical Clustering:


1. Granular Control: Allows detailed control over the clustering process, enabling specific clusters to
be split based on detailed criteria.
2. Intuitive Structure: Offers an intuitive approach for scenarios where starting with a single cluster
and progressively refining it is logical.
3. Potential for Better Clusters: Can sometimes yield more accurate or meaningful clusters
compared to bottom-up approaches, as it considers global information from the start.

Disadvantages of Divisive Hierarchical Clustering:


1. Computational Complexity: More computationally intensive, especially for large datasets, due to
the need for multiple re-clustering operations.
2. Choice of Split Criteria: Highly dependent on the effectiveness of the chosen criteria and
clustering methods for splitting clusters.

3. Less Common: Less widely used and understood compared to agglomerative methods, which can
make implementation and interpretation more challenging.
Advantages of Hierarchical clustering
 It is simple to implement and gives the best output in some cases.
 It is easy and results in a hierarchy, a structure that contains more information.
 It does not need us to pre-specify the number of clusters.

Disadvantages of hierarchical clustering


 It breaks the large clusters.
 It is Difficult to handle different sized clusters and convex shapes.
 It is sensitive to noise and outliers.
 The algorithm can never be changed or deleted once it was done previously.

Fundamentals of Data Science Unit 5 22


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Density-Based Methods
Definition:
Density-based clustering methods aim to identify dense regions of data points in the feature
space, forming clusters based on areas of high data density.
One of the most well-known density-based clustering algorithms is DBSCAN (Density-Based
Spatial Clustering of Applications with Noise).

DBSCAN is advantageous because it can identify clusters of arbitrary shapes and handle noise
effectively. However, it requires setting two parameters: epsilon (the radius of the neighbourhood) and
MinPts (the minimum number of points required to form a dense region). Choosing appropriate values for
these parameters can be challenging and may require domain knowledge or experimentation.

Another density-based clustering algorithm worth mentioning is OPTICS (Ordering Points to Identify
the Clustering Structure), which extends DBSCAN by providing a hierarchical clustering result and does not
require setting the epsilon parameter explicitly.

Density-based clustering methods are suitable for datasets with complex structures and varying
densities, making them particularly useful in spatial data analysis, anomaly detection, and identifying
clusters of irregular shapes.

In short

 In Density based clustering we have to give two inputs i.e (minpts(),ε).

 ε  Epsilon is a radius of circle formed with data object as center.


minpts()  Minimum number of data points inside the circle.

Example: minpts = 3
In the below example minimum data points in the circle is 3 or greater than or equal to (>=) 3.

Fundamentals of Data Science Unit 5 23


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

In this example ‘a’ is the data object , a as the centre then you forming the centre, then the radius of
that circle is called as Epsilon.

In this algorithm, we have 3 types of data points.


Core Point (Center data item): It should satisfy the condition of minpts.
Boundary Point: Neighbor of core point
Noise or outlier: it is not core point nor boundary point.

Neighbours of core points are called as boundary point.

‘q’ is a neighbour of ‘p’ is a boundary point and ‘s’ is a neighbour of ‘r’ is boundary point but ‘t’ is not a
core point nor a normal point.

Grid-Based Methods
Definition:
Grid-based clustering methods partition the data space into a finite number of cells (forming a grid)
and then perform clustering on these cells instead of the individual data points. This approach reduces the
computational complexity and is particularly useful for handling large datasets. Grid-based methods are
efficient because they aggregate data points into manageable grids and operate on these grids.

Fundamentals of Data Science Unit 5 24


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Key Concepts in Grid-Based Clustering


1. Grid Structure:
o The data space is divided into a grid structure composed of multiple cells (also called bins).
o Each cell covers a region of the data space, and data points falling within a cell are treated
as a single unit.
2. Density Calculation:
o The density of each cell is calculated, usually as the number of data points within the cell.
o Cells with a high density are considered potential cluster centers.
3. Cluster Formation:
o Clusters are formed by merging adjacent high-density cells.
o Cells with densities below a certain threshold may be considered noise or outliers.

Steps in Grid-Based Clustering


1. Grid Partitioning:
o Divide the data space into a grid of cells. The size and shape of the cells can be uniform or
adaptive, depending on the algorithm.
2. Density Calculation:
o Calculate the density of each cell by counting the number of data points it contains.
3. Identification of Dense Cells:
o Identify cells that have densities higher than a predefined threshold.
4. Cluster Formation:
o Merge adjacent dense cells to form clusters.
o Apply additional criteria to refine the clusters, such as connectivity and proximity.
5. Cluster Refinement (Optional):
o Post-processing steps to refine the clusters, such as removing small clusters or merging
closely situated clusters.

Examples of Grid-Based Clustering Algorithms


1. STING (Statistical Information Grid):
o Divides the data space into hierarchical grid cells and performs statistical analysis on each
cell.
o Uses a multi-resolution approach to analyze data at different levels of granularity.
2. CLIQUE (Clustering In QUEst):
o Combines grid-based and density-based approaches.
o Finds dense regions in subspaces of high-dimensional data, making it suitable for high-
dimensional clustering.

Fundamentals of Data Science Unit 5 25


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

3. WaveCluster:
o Uses wavelet transformation to transform the data space and identify dense regions.
o Efficiently handles noise and captures clusters of arbitrary shapes.

Advantages:
1. Simplicity: Grids offer a straightforward representation of space, making them easy to implement
and understand.
2. Efficiency: Calculations and algorithms on grid-based data structures are often computationally
efficient.
3. Spatial reasoning: Grids facilitate spatial reasoning and enable easy manipulation of spatial
relationships.
4. Regularity: Grids provide a regular and uniform structure, which can simplify certain types of
analysis and processing tasks.

Disadvantages:
1. Discretization errors: Grid-based approaches introduce discretization errors due to the finite
resolution of the grid, potentially leading to inaccuracies.
2. Memory consumption: Storing a grid representation of large environments or complex spaces can
require significant memory resources.
3. Grid size limitations: Grids may struggle to represent continuous or highly detailed environments
effectively, especially when finer granularity is needed.
4. Boundary effects: Algorithms operating on grid-based representations may encounter challenges
near grid boundaries, leading to edge effects or artifacts.

Applications
1. Robotics: Navigation and mapping using grid-based SLAM.
2. GIS: Analyzing geographic data like land use and elevation.
3. Computer Graphics: Image rendering and texture mapping.
4. Finite Element Analysis: Simulating structural behavior.
5. Image Processing: Segmentation and filtering.
6. Molecular Modelling: Analyzing molecular structures.
7. Weather Forecasting: Numerical weather prediction models.
8. Video Games: Pathfinding and collision detection.

Fundamentals of Data Science Unit 5 26


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Evaluation of Clustering
Evaluating the quality of clustering results is crucial to assess the performance of clustering algorithms and
determine their effectiveness in grouping data points. Several methods are used for evaluating clustering,
including:
1. Internal Evaluation Metrics: These metrics assess the quality of clustering based solely on the
input data and the clustering results. Examples include:
o Silhouette Score: Measures how similar an object is to its own cluster compared to other
clusters. Values range from -1 to 1, with higher values indicating better clustering.
o Davies–Bouldin Index: Computes the average similarity between each cluster and its most
similar cluster, where lower values indicate better clustering.
o Calinski-Harabasz Index: Ratio of the between-cluster dispersion and within-cluster
dispersion, with higher values indicating better-defined clusters.
2. External Evaluation Metrics: These metrics compare the clustering results with a ground truth
clustering, if available. Examples include:
o Adjusted Rand Index (ARI): Measures the similarity between two clusterings, where a
score of 1 indicates perfect clustering agreement.
o Normalized Mutual Information (NMI): Measures the mutual information between the
true and predicted clusterings, normalized to range between 0 and 1.
o Fowlkes-Mallows Index: Computes the geometric mean of the pairwise precision and
recall, where higher values indicate better clustering performance.
3. Visual Inspection: Visualization techniques, such as scatter plots, dendrograms, or t-SNE
embeddings, allow analysts to visually inspect clustering results and assess their quality.
4. Domain-Specific Measures: In some cases, domain-specific measures may be used to evaluate
clustering results. For example, in customer segmentation, metrics like customer retention or
revenue increase may be used to assess the effectiveness of clustering.
It's important to select appropriate evaluation metrics based on the characteristics of the data and
the goals of the clustering task. Additionally, using a combination of multiple evaluation methods provides
a more comprehensive understanding of clustering performance.
Advantages:
1. Quantitative assessment aids algorithm selection.
2. Provides interpretability and insight into data grouping.
3. Guides optimization and fosters research development.
Disadvantages:
1. Dependency on data characteristics may lead to metric bias.
2. Sensitivity to noise and outliers can affect results.
3. Subjectivity in interpretation may arise from evaluation measures.

Fundamentals of Data Science Unit 5 27


Divya S R, Assistant Professor, Department of Computer Science, AES National Degree College, Gauribidanur.

Applications
Evaluation of clustering is applied across various domains:
1. Marketing: Segmentation for targeted campaigns.
2. Biology: Grouping genes or proteins for analysis.
3. Image Analysis: Segmenting objects in images.
4. Anomaly Detection: Identifying outliers in data.
5. Text Mining: Clustering documents for topic modeling.
6. Customer Management: Segmentation for personalized services.
7. Social Network Analysis: Identifying communities in networks.
8. Healthcare: Stratifying patients for personalized treatment.
9. Supply Chain Management: Grouping products or suppliers for optimization.
10.Climate Science: Identifying weather patterns and climate regions.

Fundamentals of Data Science Unit 5 28

You might also like