0% found this document useful (0 votes)

2 views53 pages

Clustering

The document provides an overview of clustering methods in data analysis, distinguishing between clustering (unsupervised) and classification (supervised). It discusses various clustering techniques, including partitioning methods like k-means, hierarchical clustering, density-based methods like DBSCAN, and model-based methods such as Gaussian mixture models. Each method has its strengths and weaknesses, and the choice of method depends on the specific characteristics of the data being analyzed.

Uploaded by

cewinom874

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views53 pages

Clustering

Uploaded by

cewinom874

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Introduction to clustering methods

Epid 814 - Marisa Eisenberg

Cluster Analysis

• What is a cluster?

• A set of objects/data points, such that the objects in

the set are more similar to one another than they are to
the objects outside the set/other clusters.

i
Cluster Analysis

• Broadly used in data analysis, including machine learning

• Clustering (unsupervised) vs. classification

(supervised)

• Hard clustering (every element belongs to only one

cluster) vs. fuzzy clustering (every element has various
probabilities of belonging to a given cluster)

• Some methods find the number of clusters, others use a

predefined number of clusters
Cluster Analysis

• Wide range of methods—which is best depends on the

data to be clustered. Not really one ‘best’ method across
all settings.

• In general, we want:

• High intra-cluster similarity, low inter-cluster similarity

(how to determine similarity?)

• Potential to discover hidden features (especially in high

dimensional data)
Some general classes (or clusters haha) of
clustering methods:

• Partitioning methods (e.g. k-means clustering & other

centroid methods)

• Hierarchical clustering methods

• Density-based methods

• Model or distribution-based methods (e.g. Gaussian mixture

models, latent class analysis)

• Network clustering methods (community detection methods)

• & many others!

Partitioning methods

• General idea is often:

• Construct a partition of the data into k clusters

• Evaluate the resulting clusters and improve the

partition

• Repeat until optimal partition/clusters found

• Examples: k-means, k-medioids, k-modes (among many

others)
K-means clustering

• Select k centroids (means), and each data point is

assigned to the nearest centroid

• This partitions the space into Voronoi cells, which are our
clusters

• For each cluster, calculate the centroid of all points

• These become the new cluster centroids

• Reassign points to nearest centroid and repeat

K-means clustering example
Randomly choose 3 cluster centers to start
The cluster centers partition the space based on
which center is nearest
These are our starting clusters

Red
cluster

Blue cluster:
blue center is
nearest

Yellow cluster
What are the means of the data points in each
cluster?
These are the new centers.
Now which data points are closest to each center?
Now which data points are closest to each center?
Redefine the clusters based on which center
they’re nearest
And repeat! Keep calculating the centers and
redefining the clusters until they stop changing.
And repeat! Keep calculating the centers and
redefining the clusters until they stop changing.
The results once the clusters and centers are fixed
are your final k-means clusters.
K-means clustering

• Relatively efficient

• Can converge to local optima (e.g. depending on starting

points)

• Have to specify k (number of clusters)

• Cannot make clusters with non-convex

shapes

• Tends toward equal sized clusters By Chire - Own work, Public Domain, https://
commons.wikimedia.org/w/index.php?curid=11765684

• How to handle categorical data? (e.g. can use k-modes)

Hierarchical clustering methods

• Agglomerative approach to clustering

• Starts with small clusters (e.g. individual points) and

then merges based on distance

• Divisive approach does the reverse (all one cluster then

split into smaller ones)

• Many different approaches with different distance

measurements, etc.
Hierarchical clustering example

A B
• Start with all single point clusters

• Merge the two nearest clusters—forms a

C
new cluster

• Merge the next two nearest clusters, etc.

D E

• How to decide cluster distances? (What

metric, do we use nearest point distance,
furthest, centroid?)

• Capture clusters as a dendrogram—can

choose resolution of clusters as desired
A B C D E
https://cran.r-project.org/web/packages/dendextend/vignettes/Cluster_Analysis.html
Hierarchical clustering

• Slow for larger data sets

• Useful for finding substructures/subclusters in data

• Assumes every data point is relevant/part of the clusters

• How to choose level of granularity?

Density-based clustering

• Decides clusters based on density of points

• Not every point need be assigned a cluster—some can

be considered noise or outliers

• One of the most commonly used algorithms is DBSCAN

(Density-Based Spatial Clustering of Applications with
Noise)
DBSCAN

• Choose a radius r and a minimum number of points m

• Classify each point as a:

• Core point - has at least m other points within radius r

• Border point - does not have m points within radius r,

but is reachable a core point p - i.e. can be connected
to data point p by a chain of core points each within
radius r of the next point

• Outlier - neither core nor border

DBSCAN example

radius

Core point
m=3
DBSCAN example

radius

m=3
DBSCAN example

Directly
reachable radius

m=3
DBSCAN example

Core radius

m=3
DBSCAN example

radius

m=3
DBSCAN example

Core radius

m=3
DBSCAN example

Reachable
Core radius

m=3
DBSCAN example

radius

Core m=3
DBSCAN example

radius

Core
m=3
DBSCAN example

radius

Core
m=3
DBSCAN example

radius

m=3
DBSCAN example

radius

m=3
DBSCAN example

Border point radius

m=3
DBSCAN example

radius

m=3
DBSCAN example

Cluster 1
radius

Cluster 2 m=3

Outlier point
DBSCAN example

Cluster 1
radius

Cluster 2 m=3

Outlier point
DBSCAN

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
DBSCAN

• Can find non-convex clusters

• Automatically determines number of clusters needed

• Not every point goes into a cluster (handles outliers/noise;

however can be a drawback if you want to assign all points to
a cluster)

• Tends to find/work best with clusters of similar density

• How to choose radius & min points? There are rules of thumb
but can be tricky! Often use min points = 2 x dim, for radius,
can us elbow plot of a k-distance graph, but harder to say)
Model based methods: Gaussian Mixture Models

• Assumes the data points come from a combination of

multivariate gaussians

• This seems restrictive but is often no more so than other

methods (e.g. k-means in some sense assumes a
centroid and resulting Voronoi diagram govern the data)

• Each data point has a probability of belonging to each

cluster

• Often fit via expectation maximization (a type of maximum

likelihood approach)
Model based methods: Gaussian Mixture Models

• Select number of clusters (number of gaussians to fit)

• Randomly initialize them (or better yet, use a method to

pick a good starting guess)

• Compute the probability that each data point is in each

cluster (based on the value of the gaussian at that point)

• Compute new parameters (µ,σ) for each gaussian that

maximize this probability

• Repeat last two steps above until convergence

Model based methods: Gaussian Mixture Models

https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
Network methods: modularity maximization

• Community (cluster)
detection approach for
networks

• Looks for groups of nodes

that have more within-group
edges than would be
expected from a random
graph with the same degree
for each node

Modularity and community structure in networks. M. E. J. Newman. PNAS 2006, 103 (23) 8577-8582; DOI: 10.1073/pnas.0601602103
Network methods: assortativity

• Not really for cluster (community) detection, so much as to

evaluate how clustered a given property is on the network

• Often look at clustering of degree, but can be other properties

(e.g. how is the network clustered by gender, vaccination,
smoking behaviors, etc.)

• For degree, the assortativity coefficient is the Pearson

correlation coefficient between pairs of connected nodes,
averaged over the network

• For attribute assortativity, the assortativity coefficient can be

interpreted as similar to an intraclass correlation coefficient
A) B)

Drinker'(N=155) Non/drinker'(N=268) No'data'(N=167) Drinker'(N=155) Non/drinker'(N=268)

Ali Walsh Dissertation, 2019 (assortativity 0.2)

Clustering methods

• Many different approaches! These are just a few

examples

• Different methods behave better/worse on different data

sets

• Testing how well a clustering method behaves can be

difficult, especially in high dimensions and/or without
ground truth information
https://towardsdatascience.com/the-5-clustering-algorithms-data-scientists-need-to-know-a36d136ef68
Resources
• https://en.wikipedia.org/wiki/Cluster_analysis

• https://en.wikipedia.org/wiki/DBSCAN

• https://medium.com/predict/three-popular-clustering-methods-and-
when-to-use-each-4227c80ba2b6

• https://blog.dominodatalab.com/topology-and-density-based-
clustering/

• https://shapeofdata.wordpress.com/2014/03/04/k-modes/

• https://towardsdatascience.com/the-5-clustering-algorithms-data-
scientists-need-to-know-a36d136ef68

CS583 Unsupervised Learning
No ratings yet
CS583 Unsupervised Learning
95 pages
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
100% (1)
21 Machine Learning Using Scikit Learn Ipynb Colaboratory PDF
23 pages
A Review of Machine Learning For The Optimization of Production Process
No ratings yet
A Review of Machine Learning For The Optimization of Production Process
14 pages
Customer Segmentation Using K-Means Custering Report - ML3
No ratings yet
Customer Segmentation Using K-Means Custering Report - ML3
26 pages
Project Quality Management - Exam Summary Note
No ratings yet
Project Quality Management - Exam Summary Note
4 pages
Clustering
No ratings yet
Clustering
55 pages
Video Data Mining: Junghwan Oh
No ratings yet
Video Data Mining: Junghwan Oh
5 pages
Coventry (1,2,3,4)
No ratings yet
Coventry (1,2,3,4)
49 pages
Crime Rate Analysis Using K-Means
No ratings yet
Crime Rate Analysis Using K-Means
58 pages
Machine Learning
100% (5)
Machine Learning
56 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Assignment 5
No ratings yet
Assignment 5
43 pages
Introduction To Medical Image Processing
No ratings yet
Introduction To Medical Image Processing
10 pages
Dataminingshort Question Part2
No ratings yet
Dataminingshort Question Part2
17 pages
PDM FSA Predictive Maintenance Framework
No ratings yet
PDM FSA Predictive Maintenance Framework
13 pages
Exploring Automated Systems For Planning Travel Itineraries
No ratings yet
Exploring Automated Systems For Planning Travel Itineraries
8 pages
Precision Medicine in Digital Pathology Via Image Analysis and Machine Learning
No ratings yet
Precision Medicine in Digital Pathology Via Image Analysis and Machine Learning
25 pages
Wa0000.
No ratings yet
Wa0000.
26 pages
Machine Learing Algorithms
No ratings yet
Machine Learing Algorithms
13 pages
ML 07 Clustering
No ratings yet
ML 07 Clustering
56 pages
K Means Example
No ratings yet
K Means Example
10 pages
EML %TH Module
No ratings yet
EML %TH Module
40 pages
State of The Art Review of Applications of Image Processing Techniques For Tool Condition Monitoring On Conventional Machining Processes
No ratings yet
State of The Art Review of Applications of Image Processing Techniques For Tool Condition Monitoring On Conventional Machining Processes
29 pages
Data Science Project Training Report
No ratings yet
Data Science Project Training Report
19 pages
Capture D'écran, Le 2025-04-21 À 21.26.38
No ratings yet
Capture D'écran, Le 2025-04-21 À 21.26.38
14 pages
Regime Detection Via Unsupervised Learning
No ratings yet
Regime Detection Via Unsupervised Learning
3 pages
Unit 2
No ratings yet
Unit 2
33 pages
Topic Wise Dsa Questions
No ratings yet
Topic Wise Dsa Questions
15 pages
NDSSI, NSMI Dan Band Ratio
No ratings yet
NDSSI, NSMI Dan Band Ratio
6 pages
Major Issue in The Tripple Adder Circuit and So
No ratings yet
Major Issue in The Tripple Adder Circuit and So
5 pages
Ifferent Methods of Clustering
No ratings yet
Ifferent Methods of Clustering
8 pages
Project Requirement Specifications
No ratings yet
Project Requirement Specifications
14 pages
K-Means Clustering in WSN
No ratings yet
K-Means Clustering in WSN
5 pages
Al3451 ML - Questionbank - 3,4,5
No ratings yet
Al3451 ML - Questionbank - 3,4,5
11 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
045 6 (A) What Is A Dendrogram - How Is It Constructed
No ratings yet
045 6 (A) What Is A Dendrogram - How Is It Constructed
4 pages
Ictpm 04
No ratings yet
Ictpm 04
4 pages
(A) Categories of Sequential Circuits Based On Fun
No ratings yet
(A) Categories of Sequential Circuits Based On Fun
4 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
Detecting Malware Using Process Tree and Process Activity Data
No ratings yet
Detecting Malware Using Process Tree and Process Activity Data
5 pages
Prolog Student Course KnowledgeBase
No ratings yet
Prolog Student Course KnowledgeBase
2 pages
An Improved Collaborative Movie Recommendation System Using Computational Intelligence-2
No ratings yet
An Improved Collaborative Movie Recommendation System Using Computational Intelligence-2
9 pages
CSCE689 DRL Project Report
No ratings yet
CSCE689 DRL Project Report
7 pages
Kasukabe Defence Group Project Report
No ratings yet
Kasukabe Defence Group Project Report
7 pages
Eculidence
No ratings yet
Eculidence
2 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Topic 4 Scope Calculation Guide
No ratings yet
Topic 4 Scope Calculation Guide
2 pages
Topic 6 Cost Management Calculations
No ratings yet
Topic 6 Cost Management Calculations
2 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Earned Value Management (EVM) Formulas
No ratings yet
Earned Value Management (EVM) Formulas
1 page
Data Mining and Machine Learning
No ratings yet
Data Mining and Machine Learning
48 pages
Unit 5
No ratings yet
Unit 5
63 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering
No ratings yet
Clustering
6 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Blue Colour Code. Give Me Simple Short Humanize SP
No ratings yet
Blue Colour Code. Give Me Simple Short Humanize SP
1 page
Chapter 6
No ratings yet
Chapter 6
62 pages
Chapter 3-Unsupervised Learning - Updated
No ratings yet
Chapter 3-Unsupervised Learning - Updated
54 pages
Module 5
No ratings yet
Module 5
43 pages
Week 10
No ratings yet
Week 10
84 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering
No ratings yet
Clustering
12 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Cluster Analysis 1731695796
No ratings yet
Cluster Analysis 1731695796
91 pages
Clustering
No ratings yet
Clustering
75 pages
Chapter 4 - Clustering
No ratings yet
Chapter 4 - Clustering
21 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering
No ratings yet
Clustering
45 pages
Clustering 2
No ratings yet
Clustering 2
17 pages
Unit 4
No ratings yet
Unit 4
16 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
65 pages
Clustering
No ratings yet
Clustering
75 pages
An Introduction To Clustering Methods
No ratings yet
An Introduction To Clustering Methods
8 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Cluster
100% (1)
Cluster
72 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Clustering
No ratings yet
Clustering
38 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
UNIT5
No ratings yet
UNIT5
60 pages
M5
No ratings yet
M5
40 pages
M5
No ratings yet
M5
40 pages
Unit5 CSM ML
No ratings yet
Unit5 CSM ML
32 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Unit - 4 DWDM
No ratings yet
Unit - 4 DWDM
27 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Clustering Explanation
No ratings yet
Clustering Explanation
8 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
18 pages
Clustering
No ratings yet
Clustering
11 pages
Clustering
No ratings yet
Clustering
39 pages
Unit 5
No ratings yet
Unit 5
5 pages
Clustering New
No ratings yet
Clustering New
6 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet

Clustering

Uploaded by

Clustering

Uploaded by

Introduction to clustering methods

Epid 814 - Marisa Eisenberg

• A set of objects/data points, such that the objects in

• Broadly used in data analysis, including machine learning

• Clustering (unsupervised) vs. classification

• Hard clustering (every element belongs to only one

• Some methods find the number of clusters, others use a

• Wide range of methods—which is best depends on the

• High intra-cluster similarity, low inter-cluster similarity

• Potential to discover hidden features (especially in high

• Partitioning methods (e.g. k-means clustering & other

• Hierarchical clustering methods

• Model or distribution-based methods (e.g. Gaussian mixture

• Network clustering methods (community detection methods)

• & many others!

• General idea is often:

• Construct a partition of the data into k clusters

• Evaluate the resulting clusters and improve the

• Repeat until optimal partition/clusters found

• Examples: k-means, k-medioids, k-modes (among many

• Select k centroids (means), and each data point is

• For each cluster, calculate the centroid of all points

• These become the new cluster centroids

• Reassign points to nearest centroid and repeat

• Can converge to local optima (e.g. depending on starting

• Have to specify k (number of clusters)

• Cannot make clusters with non-convex

• How to handle categorical data? (e.g. can use k-modes)

• Agglomerative approach to clustering

• Starts with small clusters (e.g. individual points) and

• Divisive approach does the reverse (all one cluster then

• Many different approaches with different distance

• Merge the two nearest clusters—forms a

• Merge the next two nearest clusters, etc.

• How to decide cluster distances? (What

• Capture clusters as a dendrogram—can

• Slow for larger data sets

• Useful for finding substructures/subclusters in data

• Assumes every data point is relevant/part of the clusters

• How to choose level of granularity?

• Decides clusters based on density of points

• Not every point need be assigned a cluster—some can

• One of the most commonly used algorithms is DBSCAN

• Choose a radius r and a minimum number of points m

• Classify each point as a:

• Core point - has at least m other points within radius r

• Border point - does not have m points within radius r,

• Outlier - neither core nor border

Border point radius

• Can find non-convex clusters

• Automatically determines number of clusters needed

• Not every point goes into a cluster (handles outliers/noise;

• Tends to find/work best with clusters of similar density

• Assumes the data points come from a combination of

• This seems restrictive but is often no more so than other

• Each data point has a probability of belonging to each

• Often fit via expectation maximization (a type of maximum

• Select number of clusters (number of gaussians to fit)

• Randomly initialize them (or better yet, use a method to

• Compute the probability that each data point is in each

• Compute new parameters (µ,σ) for each gaussian that

• Repeat last two steps above until convergence

• Looks for groups of nodes

• Not really for cluster (community) detection, so much as to

• Often look at clustering of degree, but can be other properties

• For degree, the assortativity coefficient is the Pearson

• For attribute assortativity, the assortativity coefficient can be

Drinker'(N=155) Non/drinker'(N=268) No'data'(N=167) Drinker'(N=155) Non/drinker'(N=268)

Ali Walsh Dissertation, 2019 (assortativity 0.2)

• Many different approaches! These are just a few

• Different methods behave better/worse on different data

• Testing how well a clustering method behaves can be

You might also like