ML4 Unsupervised Learning
ML4 Unsupervised Learning
K-Means Clustering
Hierarchical Clustering
Density-based Clustering
2
Supervised learning vs. unsupervised
learning
Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute (labelled).
These patterns are then utilized to predict the
values of the target attribute in future data
instances.
Unsupervised learning: The data have no
target attribute (non lablelled).
We want to explore the data to find some intrinsic
structures in them.
3
Introduction
Data clustering concerns how to group a set of objects
based on their similarity of attributes and/or their
proximity in the vector space.
A good clustering method will produce high quality
clusters with
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
Clustering is often called an unsupervised learning
task
Clustering is often considered synonymous with
unsupervised learning. In fact, association rule mining is
also unsupervised
4
An illustration
The data set has three natural groups of data points,
i.e., 3 natural clusters.
5
Stages in clustering
6
Aspects of clustering
A clustering algorithm
Partitional clustering
Hierarchical clustering
…
A distance (similarity, or dissimilarity) function
Clustering quality
Inter-clusters distance maximized
Intra-clusters distance minimized
The quality of a clustering result depends on the
algorithm, the distance function, and the
application.
7
Similarity and Distance Measurement
Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
There are numerous distance functions for
Different types of data
Numeric data
Nominal data
8
Distance functions for
numeric attributes
9
Pearson Correlation
10
Trend Similarity (Pearson Correlation)
11
Similarity Measurements
12
Similarity Measurements
13
Similarity Measurements
14
Similarity Measurements
15
Similarity Measurements
16
Similarity Measurements
17
Euclidean distance and Manhattan distance
Euclidean distance
dist (xi , x j ) ( xi1 x j1 ) 2 ( xi 2 x j 2 ) 2 ... ( xir x jr ) 2
Manhattan distance
dist (xi , x j ) | xi1 x j1 | | xi 2 x j 2 | ... | xir x jr |
18
Squared distance and Chebychev distance
Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
dist (xi , x j ) ( xi1 x j1 ) 2 ( xi 2 x j 2 ) 2 ... ( xir x jr ) 2
19
Distance function for text documents
A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
To simplify: a document is usually considered a
“bag” of words in document clustering.
Sequence and position of words are ignored.
A document is represented with a vector just like a
normal data point.
It is common to use similarity to compare two
documents rather than distance.
The most commonly used similarity function is the cosine
similarity. We will study this later.
20
Data standardization
In the Euclidean space, standardization of attributes
is recommended so that all attributes can have
equal impact on the computation of distances.
Consider the following pair of data points
xi: (0.1, 20) and xj: (0.9, 720).
22
K-means clustering
K-means is a partitional clustering algorithm
Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X Rr, and r is the number of
attributes (dimensions) in the data.
The k-means algorithm partitions the given
data into k clusters.
Each cluster has a cluster center, called centroid.
k is specified by the user
23
K-means algorithm
Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).
24
An example
+
+
25
An example (cont …)
26
Weaknesses of k-means: Problems with
outliers
27
Weaknesses of k-means: To deal with
outliers
One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification
28
Weaknesses of k-means (cont …)
The algorithm is sensitive to initial seeds.
29
Weaknesses of k-means (cont …)
If we use different seeds: good results
There are some
methods to help
choose good
seeds
30
Use frequent values to represent cluster
This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.
31
Clusters of arbitrary shapes
Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
Using centroids are not suitable
(upper figure) in general
K-means clusters may be more
useful (lower figure), e.g., for making
2 size T-shirts.
32
Hierarchical Clustering
Produce a nested sequence of clusters, a tree.
33
Types of hierarchical clustering
Agglomerative (bottom up) clustering: It builds the
tree from the bottom level, and
merges the most similar (or nearest) pair of clusters
stops when all the data points are merged into a single
cluster (i.e., the root cluster).
Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
Splits the root into a set of child clusters. Each child cluster
is recursively divided further
stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point
34
Agglomerative clustering
35
Agglomerative clustering algorithm
36
An example: working of the algorithm
37
Measuring the distance of two clusters
A few ways to measure distances of two
clusters.
Results in different variations of the
algorithm.
Single link
Complete link
Average link
Centroids
…
38
Single link method
The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster.
It can find arbitrarily
shaped clusters, but Two natural clusters are
It may cause the
split into two
undesirable “chain effect”
by noisy points
39
Complete link method
The distance between two clusters is the distance
of two furthest data points in the two clusters.
It is sensitive to outliers because they are far
away
40
Average link and centroid methods
Average link: A compromise
between
the sensitivity of complete-link
clustering to outliers and
the tendency of single-link
clustering to form long chains
that do not correspond to the
intuitive notion of clusters as
compact, spherical objects.
In this method, the distance
between two clusters is the
average distance of all pair-
wise distances between the
data points in two clusters.
41
Average link and centroid methods
Centroid method: In this method, the distance
between two clusters is the distance between
their centroids
42
Density-based Approaches
Results of a
k-medoid
algorithm for
k=4
Density Based Clustering: Basic Concept
Intuition for the formalization of the basic idea
For any point in a cluster, the local point density
around that point has to exceed some threshold
The set of points from one cluster is spatially
connected
Local point density at a point p defined by two
parameters
e – radius for the neighborhood of point p:
neighbourhood N(p)
e-Neighborhood
e-Neighborhood – Objects within a radius of e
from an object.
N e ( p) : {q | d ( p, q) e }
“High density” - ε-Neighborhood of an object
contains at least MinPts of objects.
ε-Neighborhood of p
ε ε
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
q p Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given e and MinPts,
Outlier categorize the objects into
three exclusive groups.
Border
A point is a core point if it has more
than a specified number of points
Core (MinPts) within Eps These are
points that are at the interior of a
cluster.
A border point has fewer than
e = 1unit, MinPts = 5 MinPts within Eps, but is in the
neighborhood of a core point.
DBScan Algorithm
DBSCAN: The Algorithm
Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example
Parameter
e = 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
Parameter
e = 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
Parameter
e = 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Example
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
Original Points
Advantages
Clusters can have arbitrary shape and size
Number of clusters is determined automatically
Can separate clusters from surrounding noise
Can be supported by spatial index structures
Disadvantages
Input parameters may be difficult to determine
In some situations very sensitive to input
parameter setting
Denclue: Technical Essence
Guassian:
d ( x, y )2
f y
Gaussian ( x) e 2 2
Density Function
d ( x , xi ) 2
N
D
( x)
2
2
f Gaussian i 1
e
60