0% found this document useful (0 votes)
18 views

ML4 Unsupervised Learning

Uploaded by

andesong88
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

ML4 Unsupervised Learning

Uploaded by

andesong88
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Unsupervised Learning

TS. Dinh Dong Luong


Contents
 Introduction

 Distance and Similarity Measurement

 K-Means Clustering

 Hierarchical Clustering

 Density-based Clustering

2
Supervised learning vs. unsupervised
learning
 Supervised learning: discover patterns in the
data that relate data attributes with a target
(class) attribute (labelled).
 These patterns are then utilized to predict the
values of the target attribute in future data
instances.
 Unsupervised learning: The data have no
target attribute (non lablelled).
 We want to explore the data to find some intrinsic
structures in them.
3
Introduction
 Data clustering concerns how to group a set of objects
based on their similarity of attributes and/or their
proximity in the vector space.
 A good clustering method will produce high quality
clusters with
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 Clustering is often called an unsupervised learning
task
 Clustering is often considered synonymous with
unsupervised learning. In fact, association rule mining is
also unsupervised

4
An illustration
 The data set has three natural groups of data points,
i.e., 3 natural clusters.

5
Stages in clustering

6
Aspects of clustering
 A clustering algorithm
 Partitional clustering
 Hierarchical clustering
 …
 A distance (similarity, or dissimilarity) function
 Clustering quality
 Inter-clusters distance  maximized
 Intra-clusters distance  minimized
 The quality of a clustering result depends on the
algorithm, the distance function, and the
application.

7
Similarity and Distance Measurement
 Key to clustering. “similarity” and
“dissimilarity” can also commonly used terms.
 There are numerous distance functions for
 Different types of data
 Numeric data
 Nominal data

8
Distance functions for
numeric attributes

9
Pearson Correlation

10
Trend Similarity (Pearson Correlation)

11
Similarity Measurements

12
Similarity Measurements

13
Similarity Measurements

14
Similarity Measurements

15
Similarity Measurements

16
Similarity Measurements

17
Euclidean distance and Manhattan distance
 Euclidean distance
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

 Manhattan distance
dist (xi , x j ) | xi1  x j1 |  | xi 2  x j 2 | ... | xir  x jr |

 Weighted Euclidean distance


dist (xi , x j )  w1 ( xi1  x j1 ) 2  w2 ( xi 2  x j 2 ) 2  ...  wr ( xir  x jr ) 2

18
Squared distance and Chebychev distance
 Squared Euclidean distance: to place
progressively greater weight on data points
that are further apart.
dist (xi , x j )  ( xi1  x j1 ) 2  ( xi 2  x j 2 ) 2  ...  ( xir  x jr ) 2

 Chebychev distance: one wants to define two


data points as "different" if they are different
on any one of the attributes.
dist(xi , x j )  max(| xi1  x j1 |, | xi 2  x j 2 |, ..., | xir  x jr |)

19
Distance function for text documents
 A text document consists of a sequence of
sentences and each sentence consists of a
sequence of words.
 To simplify: a document is usually considered a
“bag” of words in document clustering.
 Sequence and position of words are ignored.
 A document is represented with a vector just like a
normal data point.
 It is common to use similarity to compare two
documents rather than distance.
 The most commonly used similarity function is the cosine
similarity. We will study this later.

20
Data standardization
 In the Euclidean space, standardization of attributes
is recommended so that all attributes can have
equal impact on the computation of distances.
 Consider the following pair of data points
 xi: (0.1, 20) and xj: (0.9, 720).

dist ( x i , x j )  (0.9  0.1) 2  (720  20) 2  700.000457 ,

 The distance is almost completely dominated by


(720-20) = 700.
 Standardize attributes: to force the attributes to have
a common value range
21
How to choose a clustering algorithm
Choosing the “best” algorithm is a challenge.
 Every algorithm has limitations and works well with certain
data distributions.
 It is very hard, if not impossible, to know what distribution
the application data follow. The data may not fully follow
any “ideal” structure or distribution required by the
algorithms.
 One also needs to decide how to standardize the data, to
choose a suitable distance function and to select other
parameter values.

22
K-means clustering
 K-means is a partitional clustering algorithm
 Let the set of data points (or instances) D be
{x1, x2, …, xn},
where xi = (xi1, xi2, …, xir) is a vector in a real-
valued space X  Rr, and r is the number of
attributes (dimensions) in the data.
 The k-means algorithm partitions the given
data into k clusters.
 Each cluster has a cluster center, called centroid.
 k is specified by the user

23
K-means algorithm
 Given k, the k-means algorithm works as
follows:
1)Randomly choose k data points (seeds) to be the
initial centroids, cluster centers
2)Assign each data point to the closest centroid
3)Re-compute the centroids using the current
cluster memberships.
4)If a convergence criterion is not met, go to 2).

24
An example

+
+

25
An example (cont …)

26
Weaknesses of k-means: Problems with
outliers

27
Weaknesses of k-means: To deal with
outliers
 One method is to remove some data points in the
clustering process that are much further away from
the centroids than other data points.
 To be safe, we may want to monitor these possible outliers
over a few iterations and then decide to remove them.
 Another method is to perform random sampling.
Since in sampling we only choose a small subset of
the data points, the chance of selecting an outlier is
very small.
 Assign the rest of the data points to the clusters by
distance or similarity comparison, or classification

28
Weaknesses of k-means (cont …)
 The algorithm is sensitive to initial seeds.

29
Weaknesses of k-means (cont …)
 If we use different seeds: good results
 There are some
methods to help
choose good
seeds

30
Use frequent values to represent cluster
 This method is mainly for clustering of
categorical data (e.g., k-modes clustering).
 Main method used in text clustering, where a
small set of frequent words in each cluster is
selected to represent the cluster.

31
Clusters of arbitrary shapes
 Hyper-elliptical and hyper-
spherical clusters are usually
easy to represent, using their
centroid together with spreads.
 Irregular shape clusters are hard
to represent. They may not be
useful in some applications.
 Using centroids are not suitable
(upper figure) in general
 K-means clusters may be more
useful (lower figure), e.g., for making
2 size T-shirts.

32
Hierarchical Clustering
 Produce a nested sequence of clusters, a tree.

33
Types of hierarchical clustering
 Agglomerative (bottom up) clustering: It builds the
tree from the bottom level, and
 merges the most similar (or nearest) pair of clusters
 stops when all the data points are merged into a single
cluster (i.e., the root cluster).
 Divisive (top down) clustering: It starts with all data
points in one cluster, the root.
 Splits the root into a set of child clusters. Each child cluster
is recursively divided further
 stops when only singleton clusters of individual data points
remain, i.e., each cluster with only a single point

34
Agglomerative clustering

It is more popular then divisive methods.


 At the beginning, each data point forms a
cluster (also called a node).
 Merge nodes/clusters that have the least
distance.
 Go on merging

 Eventually all nodes belong to one cluster

35
Agglomerative clustering algorithm

36
An example: working of the algorithm

37
Measuring the distance of two clusters
 A few ways to measure distances of two
clusters.
 Results in different variations of the
algorithm.
 Single link
 Complete link
 Average link
 Centroids
 …

38
Single link method
 The distance between
two clusters is the
distance between two
closest data points in
the two clusters, one
data point from each
cluster.
 It can find arbitrarily
shaped clusters, but Two natural clusters are
 It may cause the
split into two
undesirable “chain effect”
by noisy points

39
Complete link method
 The distance between two clusters is the distance
of two furthest data points in the two clusters.
 It is sensitive to outliers because they are far
away

40
Average link and centroid methods
Average link: A compromise
between
 the sensitivity of complete-link
clustering to outliers and
 the tendency of single-link
clustering to form long chains
that do not correspond to the
intuitive notion of clusters as
compact, spherical objects.
 In this method, the distance
between two clusters is the
average distance of all pair-
wise distances between the
data points in two clusters.
41
Average link and centroid methods
 Centroid method: In this method, the distance
between two clusters is the distance between
their centroids

42
Density-based Approaches

 Why Density-Based Clustering methods?


 Discover clusters of arbitrary shape.
 Clusters – Dense regions of objects separated by
regions of low density
 DBSCAN – the first density based clustering
 OPTICS – density based cluster-ordering
 DENCLUE – a general density-based
description of cluster and clustering
Density-Based Clustering
Basic Idea:
Clusters are dense regions in the
data space, separated by regions of
lower object density

 Why Density-Based Clustering?

Results of a
k-medoid
algorithm for
k=4
Density Based Clustering: Basic Concept
 Intuition for the formalization of the basic idea
 For any point in a cluster, the local point density
around that point has to exceed some threshold
 The set of points from one cluster is spatially

connected
 Local point density at a point p defined by two
parameters
 e – radius for the neighborhood of point p:

Ne (p) := {q in data set D | dist(p, q)  e}


 MinPts – minimum number of points in the given

neighbourhood N(p)
e-Neighborhood
 e-Neighborhood – Objects within a radius of e
from an object.
N e ( p) : {q | d ( p, q)  e }
 “High density” - ε-Neighborhood of an object
contains at least MinPts of objects.

ε-Neighborhood of p
ε ε
ε-Neighborhood of q
Density of p is “high” (MinPts = 4)
q p Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Given e and MinPts,
Outlier categorize the objects into
three exclusive groups.
Border
A point is a core point if it has more
than a specified number of points
Core (MinPts) within Eps These are
points that are at the interior of a
cluster.
A border point has fewer than
e = 1unit, MinPts = 5 MinPts within Eps, but is in the
neighborhood of a core point.

A noise point is any point that


is not a core point nor a border
point.
DBSCAN Algorithm

Input: The data set D


Parameter: e, MinPts
For each object p in D
if p is a core object and not processed then
C = retrieve all objects density-reachable from p
mark all objects in C as processed
report C as a cluster
else mark p as outlier
end if
End For

DBScan Algorithm
DBSCAN: The Algorithm

 Arbitrary select a point p

 Retrieve all points density-reachable from p wrt Eps and MinPts.

 If p is a core point, a cluster is formed.

 If p is a border point, no points are density-reachable from p and


DBSCAN visits the next point of the database.

 Continue the process until all of the points have been processed.
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example

 Parameter
 e = 2 cm
 MinPts = 3

for each o  D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
Example

Original Points Point types: core,


border and outliers
e = 10, MinPts = 4
When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.92).

Original Points

• Cannot handle Varying


densities
• sensitive to parameters
(MinPts=4, Eps=9.75)
DBSCAN: Sensitive to Parameters
Density Based Clustering: Discussion

 Advantages
 Clusters can have arbitrary shape and size
 Number of clusters is determined automatically
 Can separate clusters from surrounding noise
 Can be supported by spatial index structures
 Disadvantages
 Input parameters may be difficult to determine
 In some situations very sensitive to input
parameter setting
Denclue: Technical Essence

 Influence functions: (influence of y on x,  is a


user given constant)
 Square : f ysquare(x) = 0, if dist(x,y) > ,
1, otherwise

 Guassian:
d ( x, y )2

f y
Gaussian ( x)  e 2 2
Density Function

 Density Definition is defined as the sum of the


influence functions of all data points.

d ( x , xi ) 2


N
D
( x) 
2
2
f Gaussian i 1
e
60

You might also like