ML 07 Clustering
ML 07 Clustering
Machine Learning
Clustering
Some material borrowed from course materials of Andrew Ng and Jing Gao
Unsupervised learning
• Given a set of unlabeled data points / items
• Find patterns or structure in the data
• Clustering: automatically group the data points / items
into groups of ‘similar’ or ‘related’ points
• Main challenges
– How to measure similarity?
– What is the ideal number of clusters? Few larger clusters, or
more number of smaller clusters?
Motivations for Clustering
• Understanding the data better
– Grouping Web search results into clusters, each of which
captures a particular aspect of the query
– Segment the market or customers of a service
• As precursor for some other application
– Summarization and data compression
– Recommendation
Different types of clustering
• Partitional
– Divide set of items into non-overlapping subsets
– Each item will be member of one subset
• Overlapping
– Divide set of items into potentially overlapping subsets
– Each item can simultaneously belong to multiple subsets
Different types of clustering
• Fuzzy
– Every item belongs to every cluster with a membership
weight between 0 (absolutely does not belong) and 1
(absolutely belongs)
– Usual constraint: sum of weights for each individual item
should be 1
– Convert to partitional clustering: assign every item to that
cluster for which its membership weight is highest
Different types of clustering
• Hierarchical
– Set of nested clusters, where one larger cluster can contain
smaller clusters
– Organized as a tree (dendrogram): leaf nodes are singleton
clusters containing individual items, each intermediate
node is union of its children sub-clusters
– A sequence of partitional clusterings – cut the dendrogram
at a certain level to get a partitional clustering
An example dendrogram
Different types of clustering
• Complete vs. partial
– A complete clustering assigns every item to one or more
clusters
– A partial clustering may not assign some items to any
cluster (e.g., outliers, items that are not sufficiently similar
to any other item)
Types of clustering methods
• Prototype-based
– Each cluster defined by a prototype (centroid or medoid),
i.e., the most representative point in the cluster
– A cluster is the set of items in which each item is closer
(more similar) to the prototype of this cluster, than to the
prototype of any other cluster
– Example method: K-means
Types of clustering methods
• Density-based
– Assumes items distributed in a space where ‘similar’ items
are placed close to each other (e.g., feature space)
– A cluster is a dense region of items, that is surrounded by a
region of low density
– Example method: DBSCAN
Types of clustering methods
• Graph-based
– Assumes items represented as a graph/network where
items are nodes, and ‘similar’ items are linked via edges
– A cluster is a group of nodes having more and / or better
connections among its members, than between its
members and the rest of the network
– Also called ‘community structure’ in networks
– Example method: Algorithm by Girvan and Newman
We are applying clustering
in this lecture itself.
How?
K-means clustering
K-means
• Prototype-based, partitioning technique
• Finds a user-specified number of clusters (K)
• Each cluster represented by its centroid item
}
K-means algorithm
Randomly initialize cluster centroids
Repeat {
for = 1 to
Cluster := index (from 1 to ) of cluster centroid
assignment
closest to
Move for = 1 to
centroid := average (mean) of points assigned to cluster
}
Optimization in K-means
• Consider data points in Euclidean space
• A measure of cluster quality: Sum of Squared Error (SSE)
– Error of each data point: Euclidean distance of the point to its
closest centroid
– SSE: total sum of the squared error for each point
– Will be minimized if the centroid of a cluster is the mean of all
data points in that cluster
• Steps of K-means minimizes SSE (finds a local minima)
Choosing value of K
• Based on domain knowledge about suitable number of
clusters for a particular problem domain
2.5
y
1
0.5
3
3
2.5
2.5
2
2
1.5
1.5
y
y
1
1
0.5
0.5
0
0
ε-Neighborhood of p
ε ε ε-Neighborhood of q
q p
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Divide points into three types
• Core point: A point that has more than a specified number of
points (MinPts) within its ε-Neighborhood (points that are at
the interior of a cluster)
• Border point: has fewer than MinPts points within its ε-
Neighborhood (not a core point), but falls within the ε-
Neighborhood of a core point
• Outlier point: any point that is not a core point nor a border
point
Density-Reachability
• Directly density-reachable: A point q is directly
density-reachable from object p if p is a core point
and q is in p’s ε-neighborhood.
q is directly density-reachable from p
ε ε p is not directly density-reachable from q
q p
Density-reachability is not symmetric
MinPts = 4
Density-Reachability
• Density-reachability can be direct or indirect
– Point p is directly density-reachable from p2
– p2 is directly density-reachable from p1
– p1 is directly density-reachable from q
– pßp2ßp1ßq form a chain
MinPts = 7
p p is (indirectly) density-reachable from q
p2 q is not density-reachable from p
p1
q
DBSCAN algorithm
Input: The data set D
Parameters: ε, MinPts
for each point p in D
if p is a core point and not processed then
C = {all points density-reachable from p}
mark all points in C as processed
report C as a cluster
else
mark p as outlier
end if
end for
Understanding the algorithm
• Arbitrary select a point p
• Continue the process until all of the points have been processed
(each point marked as either core or border or outlier)
When DBSCAN works well