L07 Clustering algorithms
L07 Clustering algorithms
What is unsupervised
learning?
● an area of machine learning that deals with methods for analysing and
cluster datasets without explicit classifications.
● operates on unlabeled data, independently discovering underlying
patterns and insights without the need for human intervention (IBM, 2023)
Unsupervised
Learning
Gaussian Mixture
OPTICS Latent Dirichlet
Model
(GMM) Allocation (LDA)
Density-Based
Spatial Clustering of Hierarchical
K-means clustering
Applications with Clustering
Noise (DBScan)
Unsupervised Learning Techniques
I. K-Means Clustering - Popular for grouping based on similarities.
II. DBSCAN - Groups closely packed points, effective for various shapes and sizes.
III. Gaussian Mixture Model - Assumes data from finite Gaussian distributions.
non-globular clusters.
VI. Birch
1)D e t e r m i n e K n u m b e r of c l u s t e r s
2)R a n d o m l y c h o o s e k d a t a p o i n t s ( s e e d s ) to b e
t h e i n i ti a l centroids, cluster centers
3)A s s i g n e a c h d a t a p o i n t to t h e c l o s e s t
centroid 4 ) R e - compute the c e nt ro i d s
using the current
cluster memberships.
5 ) I f a c o n v e r g e n c e c r i t e r i o n is n o t met, g o to 3.
K- MEANS
CLUSTERING
PROCESSES
1)D e t e r m i n e K n u m b e r of c l u s t e r s
2)R a n d o m l y c h o o s e k d a t a p o i n t s ( s e e d s ) to b e
t h e i n i ti a l centroids, cluster centers
3)A s s i g n e a c h d a t a p o i n t to t h e c l o s e s t
centroid 4 ) R e - compute the c e nt ro i d s
using the current
cluster memberships.
5 ) I f a c o n v e r g e n c e c r i t e r i o n is n o t met, g o to 3.
K- MEANS
CLUSTERING
PROCESSES
1)D e t e r m i n e K n u m b e r of c l u s t e r s
2)R a n d o m l y c h o o s e k d a t a p o i n t s ( s e e d s ) to b e
t h e i n i ti a l centroids, cluster centers
3)A s s i g n e a c h d a t a p o i n t to t h e c l o s e s t
centroid 4 ) R e - compute the c e nt ro i d s
using the current
cluster memberships.
5 ) I f a c o n v e r g e n c e c r i t e r i o n is n o t met, g o to 3.
K- MEANS
CLUSTERING
PROCESSES
1)D e t e r m i n e K n u m b e r of c l u s t e r s
2)R a n d o m l y c h o o s e k d a t a p o i n t s ( s e e d s ) to b e
t h e i n i ti a l centroids, cluster centers
3)A s s i g n e a c h d a t a p o i n t to t h e c l o s e s t
centroid 4 ) R e - compute the c e nt ro i d s
using the current
cluster memberships.
5 ) I f a c o n v e r g e n c e c r i t e r i o n is n o t met, g o to 3.
K- MEANS
CLUSTERING
PROCESSES
1)D e t e r m i n e K n u m b e r of c l u s t e r s
2)R a n d o m l y c h o o s e k d a t a p o i n t s ( s e e d s ) to b e
t h e i n i ti a l centroids, cluster centers
3)A s s i g n e a c h d a t a p o i n t to t h e c l o s e s t
centroid 4 ) R e - compute the c e nt ro i d s
using the current
cluster memberships.
5 ) I f a c o n v e r g e n c e c r i t e r i o n is n o t met, g o to 3. O t h e r w i s e ,
sto p when the centroids don’ t change.
Step of K-Means Clustering
Initialization Assignment Update Repeat
Start by deciding the Assign each data Recalculate the Continue the
number (k) of clusters point to the nearest centroids as the assignment and
to create. centroid, forming k center of the clusters. update steps until the
clusters. centroids do not
Initialize k centroids change significantly,
randomly. indicating that the
clusters are stable.
K- MEANS
CLUSTERING
EVALUATION METRICS
Elbow Method:
Plots e x p l a i n e d v a r i a ti o n a s a f u n c ti o n of t h e
number of c l u s t e r s .
I d e n ti fi e s the ' e l b o w p o i nt ' in t h e c u r v e a s t h e
o p ti m a l n u m b e r of clusters.
Adds c l u s t e r s unti l a d d i ti o n a l c l u s t e r s d o n o t
s i g n i fi c a n t l y i m p r o v e the model.
S i l h o u e t t e Coeffi cient:
Measures h o w s i m i l a r an o b j e c t is to i t s ow n
cluster co mpared to other clusters.
Calculated as:
(b−a)/ max(a, b) w h e re :
a: Mean i n t r a - c l u ste r d i sta n c e .
b: Mean n e a r e s t - c l u ste r d i sta n c e .
Ranges f ro m - 1 to 1 ,w h e r e h i g h e r v a l u e s i n d i c a t e
b e tt e r m a t c h e d objects within clusters.
K- MEANS
CLUSTERING
APPLICATIONS
Human Resource Information System:
Study on a K-means clustering algorithm based on the Spark platform.
Clusters employees by their characteristics for efficient human resources recommendations.
Enables personalized talent management and better understanding of employee behavior
and preferences.
Ref: https://shorturl.at/Q57uS
K - means Clustering
Pros:
Simple and easy to implement
can efficiently handle large datasets with many variables
and observations.
: Cons:
The user must specify the number of clusters (K) in
advance, which can be challenging when the optimal number
of clusters is unknown.
Outliers can significantly affect the centroids and
cluster assignments, potentially leading to inaccurate
clustering results.
DBSC
AN
Definati
on
Overview of DBSCAN:
Density-Based S p a ti a l C l u s t e r i n g of A p p l i c a ti o n s w i t h N o i s e ( D B S C A N ) is a p r o m i n e n t
clustering algorithm.
Notable for i d e n ti f y i n g c l u s t e r s of a r b i t r a r y s h a p e s a n d
s i z e s . E ff e c ti v e l y handles noise and
o u t l i e rs .
Operational Pa ra m e te r s:
Eps ( E p s i l o n ) : D e fi n e s t h e r a d i u s of t h e n e i g h b o r h o o d
around a point.
Min P t s ( M i n i m u m P o i n t s ) : S p e c i fi e s t h e m i n i m u m n u m b e r
of points r e q u i r e d to fo r m a
dense region.
Key Fe a t u r e s :
Does n o t r e q u i r e t h e n u m b e r of c l u s t e r s to b e s p e c i fi e d in
advance. Capable of d i s c o v e r i n g clusters with va r ie d
s h a p e s a n d d e n s i ti e s .
Points in s p a r s e a r e a s a re c l a s s i fi e d a s n o i s e .
DBSC
AN
Proce
ss
DBSCAN C l u s t e r i n g S t e p s :
I d e n ti fi c a ti o n of C o r e P o i n t s :
Determines i f each p o i n t h a s a m i n i m u m n u m b e r of n e i g h b o r s w i t h i n a g i v e n d i s t a n c e
(Eps). Core points a re i d e n ti fi e d a s t h e s t a r ti n g p o i n t s for
c l u s t e r f o r m a ti o n .
C l u s t e r Expansion:
Recursively c o n n e c t s all d i r e c t l y r e a c h a b l e p o i n t s f r o m e a c h c o r e
p o i n t . C o n ti n u e s to e x p a n d t h e cluster by
a g g r e g a ti n g all c o n n e c t e d points.
Handling Noise:
Points t h a t a re n o t r e a c h a b l e f r o m a n y c o r e p o i n t a re l a b e l e d a s
n o i s e . E ff e c ti v e l y s e p a ra t e s outliers from
main clusters.
I t e r a t i o n and R e s u l t :
I terates t h e p r o c e s s u nti l all p o i n t s a re e i t h e r a s s i g n e d to c l u s t e r s or l a b e l e d a s
noise. Results in d i s ti n c t clusters that c a p t u r e the
dense regions of the dataset.
Step of DBSCAN
In a given dataset, classify each point as a core point,
Core Points border point, or noise point, based on the number of points
within a given radius (ε).
Expectation (E-step)
Maximization (M-step)
For each point, compute
02 Update the parameters
the probability that it
of the Gaussians to
belongs to each cluster
maximize the likelihood
(Gaussian distribution).
of the data points
GAUSSIAN
Evaluation
MIXTURE MODEL
Metrics
L o g - Likelihood:
Higher values s u g g e s t b e tt e r d a t a f it b y t h e m o d e l .
B a y e s i a n I n f o r m a ti o n C r i t e r i o n ( B I C ) a n d A k a i k e I n f o r m a ti o n C r i t e r i o n ( A I C ) :
Lower v a l u e s i n d i c a t e s u p e r i o r m o d e l f it w h i l e c o n s i d e r i n g m o d e l
c o m p l e x i t y.
C l a s s i fi c a ti o n A c c u r a c y :
Measures a c c u r a c y of c l u s t e r a s s i g n m e n t s i f t r u e l a b e l s a re a v a i l a b l e .
F l e x i b i l i t y a n d C a p a b i l i ti e s of GMM:
Flexible c l u s t e r s h a p e s b a s e d on G a u s s i a n d i s t r i b u ti o n s .
Capable of i d e n ti f y i n g c o m p l e x d a t a s t r u c t u r e s w i t h a p p r o p r i a t e
cluster numberand i n i ti a l i za ti o n .
GAUSSIAN
Applicati
MIXTURE MODEL
on
GMM in Speech E m o t i o n
U ti l i ze d
Recognition:
to e x t r a c t e m o ti o n a l s t a t e s f r o m s p e e c h s i g n a l d a t a s e t s .
Aims for h i g h a c c u r a c y in d e t e c ti n g e m o ti o n s l i ke a n g e r, c a l m n e s s ,
fear, h a p p i n e s s , a n d sadness.
S i g n i fi c a n t i m p l i c a ti o n s for i m p r o v i n g h u m a n - m a c h i n e i n t e r a c ti o n s .
Enhances a p p l i c a ti o n s in h e a l t h c a r e , e d u c a ti o n , m a r ke ti n g , a n d
a d v e r ti s i n g .
GMM in P e r s o n a l i t y Tr a i t s and Physiological
Res po ns es :
Employed to a u t o m a ti c a l l y c l u s t e r i n d i v i d u a l s b a s e d on p e r s o n a l i t y t r a i t s
a n d e l e c t r o c a r d i o g r a m r e s p o n s e s d u r i n g s t r e s s r e c o v e r y.
Revealed a s s o c i a ti o n s b e t w e e n p e r s o n a l i t y t r a i t s (e. g., n e u r o ti c i s m ,
extraversion) a n d p h y s i o l o g i c a l r e s p o n s e s (e. g., e l e c t r o c a r d i o g r a m ,
s a l i v a r y c o r ti s o l ) .
Highlights t h e u ti l i t y of GM M in u n d e r s t a n d i n g t h e r e l a ti o n s h i p b e t w e e n
personality a n d p h y s i o l o g i c a l s t r e s s m a n i f e s t a ti o n s .
Gaussian Mixture Models
Cons:
Pros:
Training GMMs involves estimating parameters
can capture complex data distributions such as means, covariances, and mixture
by modeling them as a combination of weights, which can be computationally
multiple Gaussian distributions. expensive, especially for high-dimensional data
or large datasets.
Soft Clustering : data points are
assigned probabilities of belonging to Prone to overfitting, requiring
each cluster careful regularization
Accommodates Different Cluster Shapes
and Sizes GMMs' performance can be sensitive to the
initialization of parameters, leads to
suboptimal solutions or convergence to local
optima.
MEANSHIFT
CLUSTERING
Definati
on
N o n - p a r a m e t r i c c l u s t e r i n g t e c h n i q u e for d a t a a n a l y s i s a n d i m a g e p r o c e s s i n g .
I d e n ti fi e s d e n s e a r e a s of d a t a p o i n t s i t e r a ti v e l y.
S h i ft s e a c h p o i n t t o w a r d s t h e d e n s e s t a re a in i t s v i c i n i t y.
I t e r a ti v e l y u p d a t e s p o i n t l o c a ti o n s b y c o m p u ti n g m e a n s w i t h i n a s p e c i fi e d r e g i o n ( b a n d w i d t h or
radius).
C o n ti n u e s s h i ft i n g unti l c o n v e r g e n c e , w h e r e p o i n t s c e a s e s i g n i fi c a n t m o v e m e n t , d e fi n i n g c l u s t e r
centers..
MEANSHIFT
CLUSTERING
Proce
ss
S t e p s of Mean Shift Clustering:
Initialization:
Set a b a n d w i d t h p a r a m e t e r to d e t e r m i n e n e i g h b o r h o o d s i ze .
Mean Computation:
Fo r e a c h d a t a p o i nt , c o m p u t e t h e m e a n w i t h i n i t s
n e i g h b o r h o o d . S h i ft t h e p o i n t to this mean.
Iteration:
Repeat t h e p r o c e s s i t e r a ti v e l y unti l c o n v e r g e n c e .
Convergence o c c u r s w h e n p o i n t s s t o p m o v i n g s i g n i fi c a n t l y
or meet a p r e d e fi n e d t h r e s h o l d .
C l u s t e r Formation:
Form c l u s t e r s b a s e d on t h e p r o x i m i t y of p o i n t s to e a c h
other a ft e r convergence.
MEANSHIFT
CLUSTERING
Evaluation
Metrics
Convergence Time:
R e fl e c t s h ow s w i ft l y t h e a l g o r i t h m c o n v e r g e s .
Depends on i n i ti a l d a t a d i s t r i b u ti o n a n d c h o s e n b a n d w i d t h .
Cluster Cohesion and Separation:
Evaluates e ff e c ti v e n e s s in f o r m i n g d i s ti n c t a n d c o h e r e n t
clusters. A s s e s s e d using metrics l ike s i l h o u e tt e
score.
R o b u s t n e s s to Noise:
Mean S h i ft is r o b u s t to n o i s e a n d o u t l i e rs .
N a t u ra l l y g rav i tate s towa rd s h i g h - d e n s i t y regions, largely
ignoring sparse o u t l i e rs .
MEANSHIFT
CLUSTERING
Applicati
on
Online P e r s o n a l i t y Tr a i t s Mining:
M e a n S h i ft C l u s t e r i n g a p p l i e d for c o n s t r u c ti n g a 1 4 - c l u s t e r p e r s o n a l i t y t r a i t s
m o d e l . U ti l i ze s online user text features and behavioral
c h a r a c t e r i s ti c s .
O ff e r s a s c a l a b l e a n d o b j e c ti v e m e t h o d for m i n i n g p e r s o n a l i t y t r a i t s
on l in e . A p p l i c a b l e in v a r i o u s domains, including online
learning.
Social Media Analysis:
M e a n S h i ft C l u s t e r i n g u s e d for c o n t e n t c l u s t e r i n g a n d c l a s s i fi c a ti o n in s o c i a l m e d i a
analysis. C an be e x t e n d e d to u n d e r s t a n d personality traits by analyzing
social media posts.
Clusters b a s e d on l a n g u a g e , s e n ti m e n t , or b e h a v i o r.
Enables i d e n ti fi c a ti o n of p a tt e r n s a n d r e l a ti o n s h i p s b e t w e e n p e r s o n a l i t y t r a i t s a n d
online behaviors.
Facilitates m o r e e ff e c ti v e m a r ke ti n g , s o c i a l i n fl u e n c e , a n d m e n t a l h e a l t h m o n i t o r i n g .
SPECTRAL
Definati
CLUSTERING
on
Te c h n i q u e u ti l i z i n g e i g e n v a l u e s of t h e s i m i l a r i t y m a t r i x for d i m e n s i o n a l i t y r e d u c ti o n b e f o r e
clustering.
Tr e a t s c l u s t e r i n g a s a g r a p h - p a r ti ti o n i n g p r o b l e m .
E ff e c ti v e in i d e n ti f y i n g n o n - g l o b u l a r c l u s t e r s .
C a p a b l e of d i s c o v e r i n g c l u s t e r s wi th c o m p l e x
shapes.
SPECTRAL
Proce
CLUSTERING
ss
S t e p s of S p e c t r a l Clustering:
R e p r e s e n t d a t a p o i n t s a s n o d e s in a g ra p h .
E s t a b l i s h e d g e w e i g h t s b a s e d on s i m i l a r i t y b e t w e e n n o d e s .
C o n s t r u c t an a d j a c e n c y matrix r e fl e c ti n g t h e s e
weights.
Fo r m u l ate a L a p l a c i a n matrix from the a d j a c e n c y matrix.
P e r f o r m e i g e n v a l u e d e c o m p o s i ti o n of t h e L a p l a c i a n m a t r i x .
O b t a i n e i g e n v e c t o r s d e fi n i n g a r e d u c e d s p a c e .
C l u s t e r d a t a in t h i s r e d u c e d s p a c e u s i n g a c o n v e n ti o n a l
algorithm l ike k - means.
Step of Spectral Clustering
Similarity Graph
Build a similarity graph among
1 all data points, typically using a
measure like Gaussian (radial
Laplacian Matrix basis function) similarity.
2
Compute the Laplacian matrix
Eigenvalue Decomposition
from the similarity graph.
3 Compute the eigenvalues and
K-means on Eigenvectors eigenvectors of the Laplacian
matrix.
Use the eigenvectors
4
corresponding to the k smallest
non-zero eigenvalues to embed
the data points into a lower-
dimensional space, and then
apply K-means clustering to
cluster these points.
SPECTRAL
CLUSTERING
Evaluation
Metrics
N o r m a l i z e d M u t u a l I n f o r m a ti o n ( N M I ) :
Evaluates c l u s t e r i n g q u a l i t y b y c o m p a r i n g p r e d i c t e d c l u s t e r s to g r o u n d
truth. A c c o u n t s for p o t e n ti a l i n f o r m a ti o n ga i n or
loss through clustering.
Running Time:
Measures a l g o r i t h m e ffi c i e n c y, p a r ti c u l a r l y i m p o r t a n t d u e to
c o m p u t a ti o n a l i n t e n s i t y.
Eigenvalue d e c o m p o s i ti o n a n d s i m i l a r i t y m a t r i x c o n s t r u c ti o n c a n b e
r e s o u r c e - intensive.
• A ffi n i t y M a t r i x ( A ) , D e g r e e M a t r i x ( D ) , G r a p h L a p l a c i a n M a t r i x ( L )
SPECTRAL
Applicati
CLUSTERING
on
Personality Tr a i t s a n d B o d y M a s s I n d e x ( B M I ) :
U ti l i ze d S p e c t r a l C l u s t e r i n g to i d e n ti f y p e r s o n a l i t y t ra i t c l u s t e r s a s s o c i a t e d w i th BMI.
Revealed 14 t ra i t c l u s t e r s d e m o n s t r a ti n g w e l l - e s t a b l i s h e d a s s o c i a ti o n s
b e t we e n p e rs o n a l i t y traits and BMI.
Personality Ty p e s Revisited:
Algorithmic a p p r o a c h a p p l i e d to B i g F i v e t r a i t s d a t a s e t .
Resulted in a f i v e - c l u s t e r s o l u ti o n : re s i l i e n t , o v e r c o n t r o l l e r, u n d e r c o n t r o l l e r, r e s e r v e d ,
a n d v u l n e r a b l e - re s i l i e n t .
Provides i n s i g h t s i nto v a r i o u s p e r s o n a l i t y p r o t o t y p e s b a s e d on B i g F i v e t ra i t s .
Spectral Clustering
Pros:
capture complex cluster
structures handle non-linear
decision boundaries
work well for data with irregular shapes or clusters of
varying densities
Cons:
sensitivity to parameter choices
computational complexity
scalability issues for large
datasets
BIRCH
● Aka “Balanced Iterative Reducing and Clustering using Hierarchies”
● Handles large datasets
a. Creates a condensed summary of the dataset
b. Clusters the summary.
● 4 Phases:
a. Loading
b. Optional Condensing
c. Global Clustering
d. Optional Refining
Affinity Propagation
Introduction Key Steps
● Clusters are automatically 1. Similarity Calculation
identified without knowing 2. Responsibility Calculation
the data point location. 3. Availability Calculation
● Utilizes message passing for 4. Iterative Update
each data point 5. Net Responsibility Calculation
6. Exemplar Selection
Matrices 7. Cluster Assignment
A. Similarity Matrix (S)
B. Responsibility Matrix (R)
C. Availability Matrix (A)
Hierarchical Clustering
Hierarchical clustering builds a tree-like hierarchy of clusters by
recursively merging or splitting clusters based on their similarities or
dissimilarities.
1.Initialisation
2.Distance Matrix
Calculation
3.Merge Clusters
4. Update Distance Matrix
5.Repeat Steps 3-4
6. Dendrogram Construction
7. Cluster Selection
Hierarchical Clustering
Pros:
Produces dendrogram trees that visually represent the clustering process, making
it easier to interpret and understand the relationships between clusters.
No need to specify the number of clusters
It can identify nested clusters, which is useful when the data has a
hierarchical structure or when there are meaningful subgroups within larger
clusters.
Cons:
computationally expensive, especially for large datasets, as the algorithm's
time complexity is O(n2logn)
Sensitivity to noise and outliers
Once the clustering process is completed, it's challenging to modify the
hierarchy without rerunning the algorithm from scratch.
F uz z y C - means ( F C M)
is a variant of the traditional K-means clustering algorithm
is soft clustering, assigns a membership degree between 0 and 1 for each data
point for each cluster. (K-means is hard clustering, which assign each data point
to a single cluster)
1.Initialisation
2.Membership Degree
Calculation
3.Cluster Center Update
4. Convergence Check
5.Iteration
6. Result Interpretation
F uz z y C - means ( F C M)
Pros :
Cons :
Ability to handle overlapping clusters sensitive to the initial selection of
FCM allows data points to belong to cluster centroids
multiple clusters simultaneously, providing Difficulty in determining the number
more flexibility in representing complex of clusters
data patterns Speed : FCM slower than K-means as
can handle noisy data and outliers each point is evaluated with each
better than traditional hard clustering cluster, and more operations are
algorithms involved in each evaluation. But k-means
only calculate the distance
Latent Dirichlet Allocation (LDA)
● Kretinin, M., & Nguyen, G. (2022). Topic Modeling on News Articles using Latent Dirichlet Allocation. In 2022 IEEE 26th International Conference on Intelligent Engineering Systems (INES). 2022 IEEE 26th International
Conference on Intelligent Engineering Systems (INES). IEEE. https://doi.org/10.1109/ines56734.2022.9922609
● Zhao, L., Zhao, Q., & Wang, Y. (2020). Research on Chinese Movie Reviews Based on Latent Dirichlet Allocation Topic Model. In 2020 2nd International Conference on Machine Learning, Big Data and Business
Intelligence (MLBDBI). 2020 2nd International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI). IEEE. https://doi.org/10.1109/mlbdbi51377.2020.00016
● Karmakar, S., Sivakumar, N., & Pillai, A. S. (2023). Exploring Satisfaction Level of Customers in Restaurants by Using Latent Dirichlet Allocation(LDA) Algorithm. In 2023 International Conference on Inventive
Computation Technologies (ICICT). 2023 International Conference on Inventive Computation Technologies (ICICT). IEEE. https://doi.org/10.1109/icict57646.2023.10134169
Latent Semantic Analysis (LSA)
Representation in Semantic
Space (Matrix)
+
Dimensionality Reduction Topic Modelling On Topic Modelling On
(SVD) User Reviews in E- Perception towards
+ Commerce Platform Government
Clustering
● Chehal, D., Gupta, P., & Gulati, P. (2020). RETRACTED ARTICLE: Implementation and comparison of topic modeling techniques based on user reviews in e-commerce recommendations. In
Journal of Ambient Intelligence and Humanized Computing (Vol. 12, Issue 5, pp. 5055–5070). Springer Science and Business Media LLC. https://doi.org/10.1007/s12652-020-01956-6
● Qomariyah, S., Iriawan, N., & Fithriasari, K. (2019). Topic modeling Twitter data using Latent Dirichlet Allocation and Latent Semantic Analysis. In AIP Conference Proceedings. THE 2ND
INTERNATIONAL CONFERENCE ON SCIENCE, MATHEMATICS, ENVIRONMENT, AND EDUCATION. AIP Publishing. https://doi.org/10.1063/1.5139825a