Unit 5 Clustering-2
Unit 5 Clustering-2
Unit 5 Clustering-2
Unit 5
Clustering
Introduction
Clustering is a fundamental technique in unsupervised machine learning used to group similar
objects or data points together based on certain features or characteristics. The main goal of
clustering is to partition a dataset into groups, or clusters, such that data points within the same
cluster are more similar to each other than to those in other clusters.
Clustering algorithms are widely used across various domains such as data analysis, pattern
recognition, image processing, and customer segmentation. They are valuable for exploratory data
analysis, identifying hidden patterns, and understanding the structure of data.
Definition of clustering
Clustering is a technique in machine learning where similar data points are grouped together
into clusters, based on their features or characteristics, without needing predefined labels.
Or
Group of similar types of objects and the objects in one group are different to from another group.
Cluster Analysis
4. Algorithm usability with multiple data: Different kinds of data (Binary data, scalar data and
soon….) can be used with clustering algorithms. It should be capable and dealing with different
types of data.
5. High Dimentionality: The algorithm should be able to handle high dimensional data.
6. Interpretability: The outcomes of clustering should be usable and data must be easy to
understand.
4. Mean Shift Clustering: A non-parametric method that identifies clusters by shifting each data point
towards the mode of its local density distribution, converging to cluster centroids where density is
highest.
5. Expectation-Maximization (EM) Clustering (Gaussian Mixture Models): Models data as a
mixture of several Gaussian distributions and uses the EM algorithm to estimate parameters such as
cluster means, covariance’s, and mixing coefficients.
6. Agglomerative Clustering: An hierarchical clustering approach that starts with each data point as
a separate cluster and iteratively merges the closest clusters based on a chosen distance metric until
a desired number of clusters is obtained.
7. Fuzzy Clustering (Fuzzy C-Means): Assigns each data point to multiple clusters with varying
degrees of membership, allowing data points to belong to more than one cluster simultaneously.
8. Self-Organizing Maps (SOM): Organizes data points on a two-dimensional grid, where nearby grid
cells represent similar data points, allowing for visualization and understanding of high-dimensional
data.
9. Density-Based Clustering Algorithms (OPTICS, HDBSCAN): Similar to DBSCAN, these methods
identify clusters based on density, but provide additional features such as hierarchical clustering,
robustness to parameter settings, and handling of varying density clusters.
10.Model-Based Clustering (BIRCH, CLARA): Utilizes statistical models or prototypes to represent
clusters and assigns data points to clusters based on their fit to these models or prototypes.
K-means clustering is widely used in various applications such as customer segmentation, image
segmentation, document clustering, and anomaly detection. Despite its simplicity, it remains a powerful and
versatile algorithm in the field of data mining and machine learning.
Example:
Data points K = 2,4,6,9,12,16,20,24,26
Number of clusters=02 4, 12 Chosen Centroid
K1 = 2, 4, 6 K2 = 9, 12, 16, 20, 24, 26 Taking nearest value from chosen centroid
Step 1:
K1 = 2+4+6/3 = 4, K2= 9+12+16+20+24+26/6 = 18
Step 2:
K1 = 2, 4, 6, 9 K2 = 12, 16, 20, 24, 26
Note : For K1 = 4-2=2, 4-4=0, 6-4=4, 9-4=5 then 12-4=8 and 18-12=6 so the nearest value is 6 ,we have
to put in K2.
We have to find the mean value.
K1= 2+4+6+9/4= 21/4 = 5.25 round of to 5.
K2= 12+16+20+24+26/5 = 19.6 round of to 20.
Similarly we have to repeat Step 1 and Step 2 until we have to get the same value.
Next Step:
Compare 5 with K1 and K2 nearest values , Similarly Compare 20 with K1 and K2 nearest values.
K1 = 2, 4, 6, 9, 12 K2 = 16, 20, 24, 26
Compare 7 with K1 and K2 nearest values , Similarly Compare 22 with K1 and K2 nearest values.
K1 = 2, 4, 6, 9, 12 K2 = 16, 20, 24, 26
Here we have to stop the algorithm because we have got same values of previous values.
Here we have got two different clusters i.e. K1 and K2
Hierarchical Methods
Definition:
Hierarchical clustering is a method in data mining for grouping similar data points into clusters in a
hierarchical structure.
Definition: Agglomerative is a bottom-up approach, in which the algorithm starts with taking all data
points as single clusters and merging them until one cluster is left.
Key points:
First consider the given matrix
In matrix, values will represents the difference between two values (A, B, C, D, and E).
In whole matrix diagonal values represents the 0, because from the same distance it represents
0 (Eg: A to A =0,B to B=0 etc). Remaining values are different.
Here we have to consider either lower matrix or upper matrix only.
In matrix we have check the least element. In this element we have to form the cluster.
Pair the elements i.e D and E and the distance is 3.
Example:
A B C D E F
A 0
B 5 0
C 14 9 0
D 11 20 13 0
E 18 15 6 3 0
F 10 16 8 10 11 0
Explanation
Step 1:
A B C D E F
A 0
B 5 0
C 14 9 0
D 11 20 13 0
E 18 15 6 3 0
F 10 16 8 10 11 0
In the above example consider the smallest value in the matrix i.e is 3.
C 14 9 0
DE 18 20 13 0
F 10 16 8 11 0
Step 2:
Again consider the smallest value in the matrix i.e is 5 then pair it
A B C DE F
A 0
B 5 0
C 14 9 0
DE 18 20 13 0
F 10 16 8 11 0
Pair (A, B) 5
Here we have find already some values so keep as it is.
AB C DE F
AB 0
C 0
DE 13 0
F 8 11 0
Calculate:
AB C DE F
AB 0
C 14 0
DE 20 13 0
F 16 8 11 0
Step 3:
AB C DE F
AB 0
C 14 0
DE 15 13 0
F 16 8 11 0
Calculate:
AB CF DE
AB 0
CF 16 0
DE 20 13 0
Step 4:
AB CF DE
AB 0
CF 16 0
DE 20 13 0
Pair (CDE, F) – 13
AB CFDE
AB 0
CFDE 0
Calculate:
AB CDE
AB 0
CFDE 20 0
Clusters representation
o The chosen cluster is split into two sub-clusters using a clustering algorithm
(commonly k-means or another partitioning method).
3. Tree Structure:
o The process creates a binary tree where the root represents the entire dataset, and
each node represents a cluster that is split into two child nodes.
o The leaves of the tree represent individual data points.
Problem
Using Prim’s Algorithm
Divisive clustering will be used by Complete linkage method.
x y
P1 0.40 0.53
P2 0.22 0.38
P3 0.35 0.32
P4 0.26 0.19
P5 0.08 0.41
P6 0.45 0.30
Step 1:
Similar to agglomerative method, we consider the Euclidean distance method to measure for
calculating distance between two points and use the complete linkage method.
Here we have to calculate the distance between P1 and P2.
D(x,y)= D(P1,P2) =
= 0.23
Similarly we will calculate the distance among each pair of data points and it can be represented in the
form of distance matrix below. This kind of distance matrix is also called Proximity Matrix in
Hierarchical clustering.
This matrix will be calculated as Proximity Matrix method with the help of Euclidean distance method.
P1 P2 P3 P4 P5 P6
P1 0
P2 0.23 0
P3 0.22 0.15 0
Step 2:
Now compute the minimum spanning tree (MST) from the above proximity matrix by using either
Kruskal’s or Prim’s algorithm.
We are using here prim’s algorithm for sake of simplicity and arrange all the distance in ascending
order.
MST is a greedy approach this will be arranged in ascending order .
Edge Cost
(P3,P6) 0.11
(P2,P5) 0.14
(P2,P3) 0.15
(P3,P4) 0.15
(P2,P4) 0.20
(P1,P3) 0.22
(P4,P6) 0.22
(P1,P2) 0.23
(P1,P6) 0.23
(P2,P6) 0.25
(P3,P5) 0.28
(P4,P5) 0.29
(P1,P5) 0.34
(P1,P4) 0.37
(P5,P6) 0.39
Now the final MST can be constructed by connecting all the edges between (starting from
minimum) until all the points will be participated and no closed loop or circuit will be formed.
So, follow the greedy approach, we first construct P3, P6 with the most lowest distance 0.11, then
connect P2 and P5 with the second lowest distance 0.14 then connect P2 with P3 with third lowest
distance 0.15 and so on.
Step 3:
There fore we apply complete linkage method (MAX) to break the edges of the final MST according to
the maximum cost of distance gradually and we are able create new clusters by considering those
largest distance.
i) At first we break the edge whose cost is 0.22 ie (P1,P3), So two clusters get formed
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2, P3, P4, P5, P6}
ii) Now, we will seek for the next maximum distance gradually and we will choose (P3, P4) as the next
highest cost i.e 0.15 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2, P3, P5, P6}
Cluster 3: {P4}
Continue this splitting iteration until each new cluster containing only a single object.
iii) Now, we will seek for the next maximum distance gradually and we will choose (P2, P3) as the next
highest cost i.e 0.15 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2, P5}
Cluster 3: {P4}
Cluster 4: {P3, P6}
iv) Now, we will seek for the next maximum distance gradually and we will choose (P2, P5) as the next
highest cost i.e 0.14 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2}
Cluster 3: {P4}
Cluster 4: {P3, P6}
Cluster 5: {P5}
iv) Now, we will seek for the next maximum distance gradually and we will choose (P3, P6) as the next
highest cost i.e 0.11 and break the edge to create the tree.
Cluster 1: Consists of { P1 }.
Cluster 2: Consists of {P2}
Cluster 3: {P4}
Cluster 4: {P3}
Cluster 5: {P5}
Cluster 6: {P6}
Dendrogram
3. Less Common: Less widely used and understood compared to agglomerative methods, which can
make implementation and interpretation more challenging.
Advantages of Hierarchical clustering
It is simple to implement and gives the best output in some cases.
It is easy and results in a hierarchy, a structure that contains more information.
It does not need us to pre-specify the number of clusters.
Density-Based Methods
Definition:
Density-based clustering methods aim to identify dense regions of data points in the feature
space, forming clusters based on areas of high data density.
One of the most well-known density-based clustering algorithms is DBSCAN (Density-Based
Spatial Clustering of Applications with Noise).
DBSCAN is advantageous because it can identify clusters of arbitrary shapes and handle noise
effectively. However, it requires setting two parameters: epsilon (the radius of the neighbourhood) and
MinPts (the minimum number of points required to form a dense region). Choosing appropriate values for
these parameters can be challenging and may require domain knowledge or experimentation.
Another density-based clustering algorithm worth mentioning is OPTICS (Ordering Points to Identify
the Clustering Structure), which extends DBSCAN by providing a hierarchical clustering result and does not
require setting the epsilon parameter explicitly.
Density-based clustering methods are suitable for datasets with complex structures and varying
densities, making them particularly useful in spatial data analysis, anomaly detection, and identifying
clusters of irregular shapes.
In short
Example: minpts = 3
In the below example minimum data points in the circle is 3 or greater than or equal to (>=) 3.
In this example ‘a’ is the data object , a as the centre then you forming the centre, then the radius of
that circle is called as Epsilon.
‘q’ is a neighbour of ‘p’ is a boundary point and ‘s’ is a neighbour of ‘r’ is boundary point but ‘t’ is not a
core point nor a normal point.
Grid-Based Methods
Definition:
Grid-based clustering methods partition the data space into a finite number of cells (forming a grid)
and then perform clustering on these cells instead of the individual data points. This approach reduces the
computational complexity and is particularly useful for handling large datasets. Grid-based methods are
efficient because they aggregate data points into manageable grids and operate on these grids.
3. WaveCluster:
o Uses wavelet transformation to transform the data space and identify dense regions.
o Efficiently handles noise and captures clusters of arbitrary shapes.
Advantages:
1. Simplicity: Grids offer a straightforward representation of space, making them easy to implement
and understand.
2. Efficiency: Calculations and algorithms on grid-based data structures are often computationally
efficient.
3. Spatial reasoning: Grids facilitate spatial reasoning and enable easy manipulation of spatial
relationships.
4. Regularity: Grids provide a regular and uniform structure, which can simplify certain types of
analysis and processing tasks.
Disadvantages:
1. Discretization errors: Grid-based approaches introduce discretization errors due to the finite
resolution of the grid, potentially leading to inaccuracies.
2. Memory consumption: Storing a grid representation of large environments or complex spaces can
require significant memory resources.
3. Grid size limitations: Grids may struggle to represent continuous or highly detailed environments
effectively, especially when finer granularity is needed.
4. Boundary effects: Algorithms operating on grid-based representations may encounter challenges
near grid boundaries, leading to edge effects or artifacts.
Applications
1. Robotics: Navigation and mapping using grid-based SLAM.
2. GIS: Analyzing geographic data like land use and elevation.
3. Computer Graphics: Image rendering and texture mapping.
4. Finite Element Analysis: Simulating structural behavior.
5. Image Processing: Segmentation and filtering.
6. Molecular Modelling: Analyzing molecular structures.
7. Weather Forecasting: Numerical weather prediction models.
8. Video Games: Pathfinding and collision detection.
Evaluation of Clustering
Evaluating the quality of clustering results is crucial to assess the performance of clustering algorithms and
determine their effectiveness in grouping data points. Several methods are used for evaluating clustering,
including:
1. Internal Evaluation Metrics: These metrics assess the quality of clustering based solely on the
input data and the clustering results. Examples include:
o Silhouette Score: Measures how similar an object is to its own cluster compared to other
clusters. Values range from -1 to 1, with higher values indicating better clustering.
o Davies–Bouldin Index: Computes the average similarity between each cluster and its most
similar cluster, where lower values indicate better clustering.
o Calinski-Harabasz Index: Ratio of the between-cluster dispersion and within-cluster
dispersion, with higher values indicating better-defined clusters.
2. External Evaluation Metrics: These metrics compare the clustering results with a ground truth
clustering, if available. Examples include:
o Adjusted Rand Index (ARI): Measures the similarity between two clusterings, where a
score of 1 indicates perfect clustering agreement.
o Normalized Mutual Information (NMI): Measures the mutual information between the
true and predicted clusterings, normalized to range between 0 and 1.
o Fowlkes-Mallows Index: Computes the geometric mean of the pairwise precision and
recall, where higher values indicate better clustering performance.
3. Visual Inspection: Visualization techniques, such as scatter plots, dendrograms, or t-SNE
embeddings, allow analysts to visually inspect clustering results and assess their quality.
4. Domain-Specific Measures: In some cases, domain-specific measures may be used to evaluate
clustering results. For example, in customer segmentation, metrics like customer retention or
revenue increase may be used to assess the effectiveness of clustering.
It's important to select appropriate evaluation metrics based on the characteristics of the data and
the goals of the clustering task. Additionally, using a combination of multiple evaluation methods provides
a more comprehensive understanding of clustering performance.
Advantages:
1. Quantitative assessment aids algorithm selection.
2. Provides interpretability and insight into data grouping.
3. Guides optimization and fosters research development.
Disadvantages:
1. Dependency on data characteristics may lead to metric bias.
2. Sensitivity to noise and outliers can affect results.
3. Subjectivity in interpretation may arise from evaluation measures.
Applications
Evaluation of clustering is applied across various domains:
1. Marketing: Segmentation for targeted campaigns.
2. Biology: Grouping genes or proteins for analysis.
3. Image Analysis: Segmenting objects in images.
4. Anomaly Detection: Identifying outliers in data.
5. Text Mining: Clustering documents for topic modeling.
6. Customer Management: Segmentation for personalized services.
7. Social Network Analysis: Identifying communities in networks.
8. Healthcare: Stratifying patients for personalized treatment.
9. Supply Chain Management: Grouping products or suppliers for optimization.
10.Climate Science: Identifying weather patterns and climate regions.