Unsupervised Learning: Density-based Methods
CS 822 Data Mining
Anis ur Rahman
Department of Computing
NUST-SEECS
Islamabad
December 17, 2018
1 / 32
Unsupervised Learning: Density-based Methods
Roadmap
Introduction
DBSCAN Algorithm
OPTICS Algorithm
DENCLUE Algorithm
2 / 32
Unsupervised Learning: Density-based Methods Introduction
The Principle
Regard clusters as dense regions in the data space separated by
regions of low density
Major features
Discover clusters of arbitrary shape
Handle noise (regions of low density)
One scan
Need of density parameters as termination condition
Several interesting studies
DBSCAN. Ester, et al. (KDD’96)
OPTICS. Ankerst, et al (SIGMOD’99).
DENCLUE. Hinneburg & D. Keim (KDD’98)
CLIQUE. Agrawal, et al. (SIGMOD’98) (more grid-based)
3 / 32
Unsupervised Learning: Density-based Methods Introduction
Introduction
Density-based clustering algorithms
1 DBSCAN. grows clusters according to a density-based
connectivity analysis
2 OPTICS. extends DBSCAN to produce a cluster ordering obtained
from a wide range of parameter settings
3 DENCLUE. clusters objects based on a set of density distribution
functions
4 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Roadmap
Introduction
DBSCAN Algorithm
OPTICS Algorithm
DENCLUE Algorithm
5 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
DBSCAN Algorithm
DBSCAN Algorithm
Stands for Density-Based Spatial Clustering of Applications with
Noise
It is a density-based clustering algorithm
The algorithm grows regions with sufficiently high density into
clusters and discovers clusters of arbitrary shape in spatial
databases with noise
It defines a cluster as a maximal set of density-connected points
6 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Definitions
-Neighborhood of an object
The neighborhood within a radius of a given object
Core object
If the -neighborhood of an object contains at least a minimum
number, MinPts, of objects
7 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Definitions
Directly density-reachable objects
Given a set of objects, D , we say that an object p is directly
density-reachable from object q if p is within the -neighborhood
of q, and q is a core object
Example.
q is directly density-reachable from m
m is directly density-reachable from p and vice versa
8 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Definitions
Indirectly density-reachable objects
An object p is indirectly density-reachable from object q,
if there is a chain of objects p1 , · · · , pn , where p1 = q and pn = p
such that pi +1 is directly density-reachable from pi , for 1 ≤ i ≤ n
Example
q is density-reachable from p because q is directly
density-reachable from m and m is directly density-reachable from
p
p is not density-reachable from q because q is not a core object
9 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Definitions
Indirectly density-connected objects
An object p is indirectly density-connected to object q,
if there is an object o such that both p and q are density-reachable
from o
Example
p, q and m are all density connected
10 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Example: Density-reachability and density
connectivity
A given represented by the radius of the circles, and, say, let MinPts =
3.
11 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Example: Density-reachability and density
connectivity
1 Core objects
m, p, o, and r are core objects because each is in an
-neighborhood containing at least three points
2 Directly density-reachable objects
q is directly density-reachable from m
m is directly density-reachable from p and vice versa
3 Indirectly density-reachable objects
q is indirectly density-reachable from p because q is directly
density-reachable from m and m is directly densityreachable from p
However, p is not indirectly density-reachable from q because q is
not a core object
Similarly, r and s are indirectly density-reachable from o, and o is
indirectly density-reachable from r
4 Indirectly Density-connected objects
o, r, and s are all indirectly density-connected
12 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Definitions
A density-based cluster
A set of density-connected objects that is maximal with respect to
density-reachability
Every object not contained in any cluster is considered to be noise
13 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
DBSCAN
Main steps:
1 Search for clusters by checking the -neighborhood of each
point in the database
2 If the -neighborhood of a point p contains at least MinPts, a new
cluster with p as a core object is created
3 Iteratively collect directly density-reachable objects from these
core objects, which may involve the merge of a few
density-reachable clusters
4 Terminate process when no new point can be added to any
cluster
14 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Step 1. Step 2.
Step 3. Step 4.
15 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Step 5. Step 6.
Step 7. Step 8.
16 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
DBSCAN Algorithm
The computational complexity is O (n 2 ), where n is the number of
database objects
With appropriate settings of the user-defined parameters and
MinPts, the algorithm is effective at finding arbitrary-shaped
clusters
17 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Original points Point types
Clusters
18 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Why DBSCAN is not enough?
Very different local densities may be needed to reveal clusters in
different regions
Clusters A , B , C1 , C2 , and C3 cannot be detected using one global
density parameter
A global density parameter can detect either A , B , C or C1 , C2 , C3
Use hierarchical clustering,
but
Single link effect
Hard to interpret
Use OPTICS
19 / 32
Unsupervised Learning: Density-based Methods DBSCAN Algorithm
Original points MinPts=4, =9.75
MinPts=4, =9.92
20 / 32
Unsupervised Learning: Density-based Methods OPTICS Algorithm
Roadmap
Introduction
DBSCAN Algorithm
OPTICS Algorithm
DENCLUE Algorithm
21 / 32
Unsupervised Learning: Density-based Methods OPTICS Algorithm
OPTICS Principle
Produce a special order of the database
with respect to its density-based clustering structure
containing information about every clustering level of the data set
(up to a generating distance )
Which information to use?
22 / 32
Unsupervised Learning: Density-based Methods OPTICS Algorithm
Definitions
Core-distance of an object
The core-distance of an object p is the smallest 0 that makes p a
core object
If p is not a core object, the core distance of p is undefined
Example (, MinPts=5)
0 is the core distance of p
It is the distance between p and the fourth closest object
23 / 32
Unsupervised Learning: Density-based Methods OPTICS Algorithm
Definitions
Reachability-distance of an object
The reachability-distance of an object q with respect to object to
object p is:
max(core − distance(p), Euclidian(p, q))
Example
Reachability − distance(q1 , p) = core − distance(p) = 0
Reachability − distance(q2 , p) = Euclidian(q2 , p)
24 / 32
Unsupervised Learning: Density-based Methods OPTICS Algorithm
OPTICS Algorithm
Stands for Ordering Points to Identify the Clustering Structure
Produces a set or ordering of density-based clusters
Constructs different clusterings simultaneously
The objects should be processed in a specific order
This order selects an object that is density-reachable with respect
to the lowest value
so that clusters with higher density (lower ) will be finished first
Based on this idea, two values need to be stored for each
object—core-distance and reachability-distance
This information is sufficient for the extraction of all density-based
clusterings with respect to any distance 0 that is smaller than the
distance used in generating the order
25 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
Roadmap
Introduction
DBSCAN Algorithm
OPTICS Algorithm
DENCLUE Algorithm
26 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
DENCLUE Algorithm
DENCLUE stands for DENsity-based CLUstEring
It is a clustering method based on density distribution functions
DENCLUE is built on the following ideas:
1 the influence of each data point can be formally modeled using a
mathematical function, called an influence function
2 the overall density of the data space are the sum of the influence
function applied to all data points
3 clusters can then be determined mathematically by identifying
density attractors, where density attractors are local maxima of
the overall density function
27 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
DENCLUE Algorithm
Influence function
Let x and y be objects or points in Fd , a d-dimensional input space
The influence function of data object y on x is a function:
y
fB (x) = fB (x, y)
It can be used to compute a square wave influence function,
(
0 if d (x, y) > σ
fSquare (x, y) =
1 otherwise
or a Gaussian influence function,
d (x,y)2
−
fGuass (x, y) = e 2σ 2
σ is a threshold parameter
28 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
DENCLUE Algorithm
Density function
The density function at an object or point x is defined as the sum of
influence functions of all data points
That is, it is the total influence on x of all of the data points
Given n data objects, the density function at x is defined as
n
x x x x
X
fBD (x) = fBi (x) = fB1 (x) + fB2 (x) + · · · + fBn (x)
i =1
29 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
DENCLUE Algorithm
Possible density functions for a 2-D data set.
Dataset Square
Gaussian
30 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
DENCLUE Algorithm
From the density function, we can define the density attractor, the
local maxima of the overall density function
A hill-climbing algorithm guided by the gradient can be used to
determine the density attractor of a set of data points
31 / 32
Unsupervised Learning: Density-based Methods DENCLUE Algorithm
References
J. Han, M. Kamber, Data Mining: Concepts and Techniques, Elsevier
Inc. (2006). (Chapter 7)
32 / 32