0% found this document useful (0 votes)

15 views56 pages

ML 07 Clustering

The document provides an overview of clustering in machine learning, focusing on unsupervised learning techniques to group unlabeled data points based on similarity. It discusses various clustering methods, including partitional, hierarchical, density-based, and graph-based approaches, along with specific algorithms like K-means and DBSCAN. The document also highlights challenges in clustering, such as determining the number of clusters and handling outliers.

Uploaded by

yogtinku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views56 pages

ML 07 Clustering

Uploaded by

yogtinku

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

CS 60050

Machine Learning

Clustering

Some material borrowed from course materials of Andrew Ng and Jing Gao
Unsupervised learning
• Given a set of unlabeled data points / items
• Find patterns or structure in the data
• Clustering: automatically group the data points / items
into groups of ‘similar’ or ‘related’ points
• Main challenges
– How to measure similarity?
– What is the ideal number of clusters? Few larger clusters, or
more number of smaller clusters?
Motivations for Clustering
• Understanding the data better
– Grouping Web search results into clusters, each of which
captures a particular aspect of the query
– Segment the market or customers of a service
• As precursor for some other application
– Summarization and data compression
– Recommendation
Different types of clustering
• Partitional
– Divide set of items into non-overlapping subsets
– Each item will be member of one subset

• Overlapping
– Divide set of items into potentially overlapping subsets
– Each item can simultaneously belong to multiple subsets
Different types of clustering
• Fuzzy
– Every item belongs to every cluster with a membership
weight between 0 (absolutely does not belong) and 1
(absolutely belongs)
– Usual constraint: sum of weights for each individual item
should be 1
– Convert to partitional clustering: assign every item to that
cluster for which its membership weight is highest
Different types of clustering
• Hierarchical
– Set of nested clusters, where one larger cluster can contain
smaller clusters
– Organized as a tree (dendrogram): leaf nodes are singleton
clusters containing individual items, each intermediate
node is union of its children sub-clusters
– A sequence of partitional clusterings – cut the dendrogram
at a certain level to get a partitional clustering
An example dendrogram
Different types of clustering
• Complete vs. partial
– A complete clustering assigns every item to one or more
clusters
– A partial clustering may not assign some items to any
cluster (e.g., outliers, items that are not sufficiently similar
to any other item)
Types of clustering methods
• Prototype-based
– Each cluster defined by a prototype (centroid or medoid),
i.e., the most representative point in the cluster
– A cluster is the set of items in which each item is closer
(more similar) to the prototype of this cluster, than to the
prototype of any other cluster
– Example method: K-means
Types of clustering methods
• Density-based
– Assumes items distributed in a space where ‘similar’ items
are placed close to each other (e.g., feature space)
– A cluster is a dense region of items, that is surrounded by a
region of low density
– Example method: DBSCAN
Types of clustering methods
• Graph-based
– Assumes items represented as a graph/network where
items are nodes, and ‘similar’ items are linked via edges
– A cluster is a group of nodes having more and / or better
connections among its members, than between its
members and the rest of the network
– Also called ‘community structure’ in networks
– Example method: Algorithm by Girvan and Newman
We are applying clustering
in this lecture itself.

How?
K-means clustering
K-means
• Prototype-based, partitioning technique
• Finds a user-specified number of clusters (K)
• Each cluster represented by its centroid item

• There have been extensions where number of

clusters is not needed as input
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster := index (from 1 to ) of cluster centroid
assignment
closest to
Move for = 1 to
centroid := average (mean) of points assigned to cluster

}
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster := index (from 1 to ) of cluster centroid
assignment
closest to
Move for = 1 to
centroid := average (mean) of points assigned to cluster

}
Optimization in K-means
• Consider data points in Euclidean space
• A measure of cluster quality: Sum of Squared Error (SSE)
– Error of each data point: Euclidean distance of the point to its
closest centroid
– SSE: total sum of the squared error for each point
– Will be minimized if the centroid of a cluster is the mean of all
data points in that cluster
• Steps of K-means minimizes SSE (finds a local minima)
Choosing value of K
• Based on domain knowledge about suitable number of
clusters for a particular problem domain

• Alternatively, based on some measure of cluster quality, e.g.,

try for different values of K and choose that value for which
SSE is minimum
Choosing initial centroids
• Can be selected randomly, but can lead to poor clustering
• Perform multiple runs, each with a different set of randomly
chosen initial centroids, and select that configuration that
yields minimum SSE
• Use domain knowledge to choose centroids, e.g., while
clustering search results, select one search result relevant to
each aspect of the query
Importance of choosing initial centroids well
3

2.5

1.5 Original Points

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3
3
2.5
2.5
2
2
1.5
1.5
y

y
1
1
0.5
0.5
0
0

-2 - 1.5 -1 -0.5 0 0.5 1 1.5 2

x -2 - 1.5 -1 -0.5 0 0.5 1 1.5 2
x

Optimal Clustering Sub-optimal Clustering

Similarity/closeness between items
• Measure of similarity/closeness between items depends on
the problem domain
• Will be performed many times over the course of the
algorithm, hence needs to be efficient
• Examples
– Points in Euclidean space à Euclidean distance
– Text documents à cosine similarity between term-vectors
Reducing SSE with post-processing
• Finding more clusters will reduce SSE, but sometimes we want
to improve SSE without increasing clusters

• K-means has found a local minima; find another “nearby”

clustering with lower SSE (if exists)
Reducing SSE with post-processing
• Techniques used
– Splitting a cluster, e.g., the cluster with highest SSE, or the
cluster with highest standard deviation of a chosen feature
– Merging two clusters, e.g., the clusters with the closest
centroids
Known problem of K-means
• Sensitive to outliers that can change the distribution of
the clusters
– A solution: K-Mediods: instead of taking the mean value of
the points in a cluster, use the medoid that is the most
centrally located point in the cluster
• Detected clusters are usually globular (spherical) in
shape; problems in detecting arbitrary-shaped clusters
Hierarchical clustering
Hierarchical clustering
• Bottom-up or Agglomerative clustering
– Start considering each data point as a singleton cluster
– Successively merge clusters if similarity is sufficiently high
– Until all points have been merged into a single cluster
• Top-down or Divisive clustering
– Start with all data points in a single cluster
– Iteratively split clusters into smaller sub-clusters if the
similarity between two sub-parts is low
Both Divisive and Agglomerative clustering can
be represented as a Dendrogram
Basic agglomerative hierarchical
clustering algorithm
• Start with each item in a singleton cluster
• Compute the proximity/similarity matrix between clusters
• Repeat
– Merge the closest/most similar two clusters
– Update the proximity matrix to reflect proximity between
the new cluster and the other clusters
• Until only one cluster remains
Proximity/similarity between clusters
• MIN similarity between two clusters: Proximity (similarity)
between the closest (most similar) two points, one from each
cluster (minimum pairwise distance)
• MAX similarity between two clusters: Proximity between
the farthest two points, one from each cluster (maximum
pairwise distance)
• Group average similarity: average pairwise proximity of all
pairs of points, one from each cluster
Types of hierarchical clustering
• Complete linkage
– Merge in each step the two clusters with the smallest
maximum similarity
• Single linkage
– Merge in each step the two clusters with the smallest
minimum similarity
A divisive graph-based
clustering algorithm
A graph-based hierarchical clustering
algorithm
• A cluster is a group of nodes having more and / or
better connections among its members, than between
its members and the rest of the network
• Cluster in graphs/networks: also called community
structure
• Algorithm by Girvan and Newman: Community
structure in social and biological networks, PNAS 2002
Girvan-Newman algorithm
• Focus on edges / links that are most ‘between’ clusters
• Edge betweenness of an edge e : fraction of shortest
paths between all pairs of nodes, which pass through e
Girvan-Newman algorithm

• Edges between clusters/communities are likely to have

high betweenness centrality
• Progressively remove edges having high betweenness
centrality, to separate clusters from one another
Girvan-Newman algorithm
Girvan-Newman algorithm
1. Compute betweenness centrality for all edges
2. Remove the edge with highest betweenness centrality
3. Re-compute betweenness centrality for all edges affected by
the removal
4. Repeat steps 2 and 3 until no edges remain

Results in a hierarchical clustering tree (dendrogram)

Density-based clustering
Density based clustering methods
• Locates regions of high density, that are separated
from one another by regions of low density
DBSCAN
• DBSCAN: Density Based Spatial Clustering of
Applications with Noise
– Proposed by Ester et al. in SIGKDD 1996
– First algorithm for detecting density-based clusters
• Advantages (e.g., over K-means)
– Can detect clusters of arbitrary shapes (while clusters
detected by K-means are usually globular)
– Robust to outliers
DBSCAN: intuition
• For any point in a cluster, the local point density
around that point has to exceed some threshold
• The set of points in one cluster is spatially connected
• Local point density at a point p defined by two
parameters
• ε : radius for the neighborhood of point p:
Nε (p) := {q in data set | dist(p, q) ≤ ε}
• MinPts : minimum number of points in the given
neighborhood Nε (p)
Neighborhood of a point
• ε-Neighborhood of a point p : Points within a radius
of ε from the point p
• “High density”: if ε-Neighborhood of a point contains
at least MinPts number of points

ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Divide points into three types
• Core point: A point that has more than a specified number of
points (MinPts) within its ε-Neighborhood (points that are at
the interior of a cluster)
• Border point: has fewer than MinPts points within its ε-
Neighborhood (not a core point), but falls within the ε-
Neighborhood of a core point
• Outlier point: any point that is not a core point nor a border
point
Density-Reachability
• Directly density-reachable: A point q is directly
density-reachable from object p if p is a core point
and q is in p’s ε-neighborhood.
q is directly density-reachable from p
ε ε p is not directly density-reachable from q
q p
Density-reachability is not symmetric
MinPts = 4
Density-Reachability
• Density-reachability can be direct or indirect
– Point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– pßp2ßp1ßq form a chain

MinPts = 7
p p is (indirectly) density-reachable from q
p2 q is not density-reachable from p
p1
q
DBSCAN algorithm
Input: The data set D
Parameters: ε, MinPts
for each point p in D
if p is a core point and not processed then
C = {all points density-reachable from p}
mark all points in C as processed
report C as a cluster
else
mark p as outlier
end if
end for
Understanding the algorithm
• Arbitrary select a point p

• Retrieve all points density-reachable from p w.r.t. ε and MinPts

• If p is a core point, a cluster is formed

• If p is a border point, no points are density-reachable from p

and DBSCAN visits the next point of the database

• Continue the process until all of the points have been processed
(each point marked as either core or border or outlier)
When DBSCAN works well

Original Points Clusters

• Resistant to noise / outliers (note: partial clustering)
• Can handle clusters of different shapes and sizes
• Number of clusters identified automatically
When DBSCAN does not work well
• Cannot identify clusters of varying densities
• Sensitive to parameters
Resources on Clustering (free on web)
• Data Clustering: Algorithms and Applications
– Book by Charu Aggarwal, Chandan Reddy

• Community Detection in graphs

– Survey paper by Santo Fortunato

Clustering
No ratings yet
Clustering
28 pages
Confidence Interval
No ratings yet
Confidence Interval
16 pages
Design of Carry Save Adder Using Transmission Gate Logic
No ratings yet
Design of Carry Save Adder Using Transmission Gate Logic
5 pages
Dice Problems
No ratings yet
Dice Problems
58 pages
Input Output 1 Logical Reasoning Handout Cetking
No ratings yet
Input Output 1 Logical Reasoning Handout Cetking
6 pages
Big Book For Buckyballs Tricks
0% (2)
Big Book For Buckyballs Tricks
6 pages
Mathematics Advanced Sample Examination Materials 2020
No ratings yet
Mathematics Advanced Sample Examination Materials 2020
33 pages
Type Design Glyphs Handout
100% (1)
Type Design Glyphs Handout
10 pages
9702 w11 QP 23
No ratings yet
9702 w11 QP 23
16 pages
Design and Control of A Bidirectional Dual Active Bridge DC-DC Co PDF
No ratings yet
Design and Control of A Bidirectional Dual Active Bridge DC-DC Co PDF
95 pages
April 2021g4687
No ratings yet
April 2021g4687
32 pages
Kriner Cash's Resume
No ratings yet
Kriner Cash's Resume
21 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Three I Allproofs
No ratings yet
Three I Allproofs
63 pages
Eda Important Two Marks & 16 Marks
0% (1)
Eda Important Two Marks & 16 Marks
17 pages
Actividad 3 English Activity V2
0% (2)
Actividad 3 English Activity V2
3 pages
Pictorial Media: 1. Flat Pictures (Still Pictures)
No ratings yet
Pictorial Media: 1. Flat Pictures (Still Pictures)
23 pages
Intro To Plant Taxonomy Notes
No ratings yet
Intro To Plant Taxonomy Notes
26 pages
K Medoids
No ratings yet
K Medoids
101 pages
Sparse, Stacked and Variational Autoencoder - by Venkata Krishna Jonnalagadda - Medium
No ratings yet
Sparse, Stacked and Variational Autoencoder - by Venkata Krishna Jonnalagadda - Medium
17 pages
Nonlinear Dynamics As A Ground-State Solution On Quantum Computers
No ratings yet
Nonlinear Dynamics As A Ground-State Solution On Quantum Computers
21 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Vi. Fluid Friction in Steady One Dimensional Flow
No ratings yet
Vi. Fluid Friction in Steady One Dimensional Flow
38 pages
Clustering
No ratings yet
Clustering
35 pages
Unit 2
No ratings yet
Unit 2
33 pages
MAF551 Budget&BudgetaryControl TimeSeries&SeasonalVariationsAnalysis
No ratings yet
MAF551 Budget&BudgetaryControl TimeSeries&SeasonalVariationsAnalysis
10 pages
Unsupervised Algorithms Unit3
No ratings yet
Unsupervised Algorithms Unit3
53 pages
Sequential Decoding of The XYZ Hexagonal Stabilizer Code:,, Y Zzy
No ratings yet
Sequential Decoding of The XYZ Hexagonal Stabilizer Code:,, Y Zzy
14 pages
Endogeneity
No ratings yet
Endogeneity
19 pages
Lecture 10
No ratings yet
Lecture 10
19 pages
NMO Round1 2021
No ratings yet
NMO Round1 2021
8 pages
Week 1 - Intro To Statistics - Data
100% (4)
Week 1 - Intro To Statistics - Data
5 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
Clustering Class
No ratings yet
Clustering Class
103 pages
Clustering
No ratings yet
Clustering
53 pages
Corr T2 AM1-IMS-22-1-2024
No ratings yet
Corr T2 AM1-IMS-22-1-2024
3 pages
CS New
No ratings yet
CS New
7 pages
Abdul Wahab DSP LAB 33
No ratings yet
Abdul Wahab DSP LAB 33
7 pages
Quantium Q.Promotions Brochure Updated
No ratings yet
Quantium Q.Promotions Brochure Updated
4 pages
Quantium Q.Shelf Brochure Updated
No ratings yet
Quantium Q.Shelf Brochure Updated
4 pages
Clustering
No ratings yet
Clustering
19 pages
Pre Sales Consultant For Industrial AI
No ratings yet
Pre Sales Consultant For Industrial AI
3 pages
Unit 5
No ratings yet
Unit 5
63 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Assignment PDF
No ratings yet
Assignment PDF
2 pages
Assignment of Mathematics B
No ratings yet
Assignment of Mathematics B
2 pages
Chapter 6
No ratings yet
Chapter 6
62 pages
Data Structures Se E&Tc List of Practicals
No ratings yet
Data Structures Se E&Tc List of Practicals
23 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
ML - 8
No ratings yet
ML - 8
70 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Clustering Analysis
No ratings yet
Clustering Analysis
102 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Clustering, A Tool To Analyze Data Points
No ratings yet
Clustering, A Tool To Analyze Data Points
61 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages
Clustering
No ratings yet
Clustering
75 pages
ML4 Unsupervised Learning
No ratings yet
ML4 Unsupervised Learning
60 pages
ML L14 Clustering
No ratings yet
ML L14 Clustering
59 pages
DMDWUNITV
No ratings yet
DMDWUNITV
72 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
Data Sufficiency Question Bank
No ratings yet
Data Sufficiency Question Bank
5 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
Module5 QB 1
No ratings yet
Module5 QB 1
21 pages
Module 5
No ratings yet
Module 5
91 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
75 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Clustering
No ratings yet
Clustering
65 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
Data Mining Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Data Mining Cluster Analysis: Basic Concepts and Algorithms
26 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Cluster
100% (1)
Cluster
72 pages
Chap7 Basic Cluster Analysis
No ratings yet
Chap7 Basic Cluster Analysis
82 pages
Soft Vs Hard Clustering
No ratings yet
Soft Vs Hard Clustering
5 pages
UNIT5
No ratings yet
UNIT5
60 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
M5
No ratings yet
M5
40 pages
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
No ratings yet
Lecture Notes For Chapter 8: by Tan, Steinbach, Kumar
93 pages
Data Mining Unit 3 Cluster Analysis: Types of Clusters
No ratings yet
Data Mining Unit 3 Cluster Analysis: Types of Clusters
11 pages
M5
No ratings yet
M5
40 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
22AIP3101A Session 9
No ratings yet
22AIP3101A Session 9
38 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Clustering Lecture
No ratings yet
Clustering Lecture
46 pages
Clustering
No ratings yet
Clustering
39 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
Applications of Finite Mathematics
From Everand
Applications of Finite Mathematics
Gautami Devar
No ratings yet
Agile Foundation Courseware – English
From Everand
Agile Foundation Courseware – English
Nader Rad
No ratings yet
Vector Calculus Using Mathematica Second Edition
From Everand
Vector Calculus Using Mathematica Second Edition
Steven Tan
No ratings yet