density-based-clustering-technique
density-based-clustering-technique
Results of a k-medoid
ε-Neighborhood of p
ε ε ε-Neighborhood of q
qq pp
Density of p is “high” (MinPts = 4)
Density of q is “low” (MinPts = 4)
Core, Border & Outlier
Outlier Given and MinPts,
categorize the objects into
Border three exclusive groups.
Minpts = 3
Eps=radius
of the
circles
Density-Reachability
Directly density-reachable
An object q is directly density-reachable from
object p if p is a core object and q is in p’s -
neighborhood.
MinPts = 4
Density-reachability
Density-Reachable (directly and
indirectly):
A point p is directly density-reachable from p2;
p2 is directly density-reachable from p1;
p1 is directly density-reachable from q;
pp2p1q form a chain.
p p is (indirectly) density-reachable
p2 from q
p1 q is not density- reachable from p?
q
MinPts = 7
Density-Connectivity
Density-reachable is not symmetric
not good enough to describe clusters
Density-Connected
A pair of points p and q are density-connected
if they are commonly density-reachable from a
point o.
Density-connectivity is
symmetric
p q
o
Formal Description of Cluster
P is a core object.
Review of Concepts
Is an object o in a cluster Are objects p and q in the
or an outlier? same cluster?
DBScan Algorithm
DBSCAN: The Algorithm
Arbitrary select a point p
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
Parameter
= 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm: Example
Parameter
= 2 cm
MinPts = 3
for each o D do
if o is not yet classified then
if o is a core-object then
collect all objects density-reachable from o
and assign them to a new cluster.
else
assign o to NOISE
DBSCAN Algorithm
Eliminate noise points
Perform clustering on the remaining points
18
MinPts = 5
P1
C1 P
P C1
C1
C1
C1
C1 C1
C1
Example
= 10, MinPts = 4
When DBSCAN Works Well
• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well
(MinPts=4, Eps=9.92).
Original Points
(MinPts=4, Eps=9.75)
DBSCAN: Sensitive to Parameters
Determining the Parameters
and MinPts
Cluster: Point density higher than specified by and MinPts
Idea: use the point density of the least dense cluster in the data
set as parameters – but how to determine this?
Heuristic: look at the distances to the k-nearest neighbors
p 3-distance(p) :
q 3-distance(q) :
Thus, eps=10
26
Determining the Parameters
and MinPts
Example k-distance plot
3-distance
first „valley“
Objects
„border object“
Heuristic method:
Fix a value for MinPts (default: 2 d –1)
User selects “border object” o from the MinPts-distance plot;
is set to MinPts-distance(o)
Determining the Parameters
and MinPts
Problematic example
A C
F A, B, C
E
B, D, E
3-Distance
G
B‘, D‘, F, G
G1
G3 D1, D2,
D G2 G1, G2, G3
B D’
B’ D1
D2 Objects
Density Based Clustering:
Discussion
Advantages
Clusters can have arbitrary shape and size
Number of clusters is determined automatically
Can separate clusters from surrounding noise
Can be supported by spatial index structures
Disadvantages
Input parameters may be difficult to determine
In some situations very sensitive to input parameter
setting
OPTICS: Ordering Points To Identify
the Clustering Structure
DBSCAN
Input parameter – hard to determine.
Algorithm very sensitive to input parameters.
OPTICS – Ankerst, Breunig, Kriegel, and
Sander (SIGMOD’99)
Based on DBSCAN.
Does not produce clusters explicitly.
Rather generate an ordering of data objects
representing density-based clustering structure.
OPTICS con’t
Produces a special order of the database wrt its
density-based clustering structure
This cluster-ordering contains info equiv to the
density-based clusterings corresponding to a broad
range of parameter settings
Good for both automatic and interactive cluster
analysis, including finding intrinsic clustering
structure
Can be represented graphically or using
visualization techniques
Density-Based Hierarchical
Clustering
Observation: Dense clusters are completely contained
by less dense clusters
C D
C1 C2
Idea: Process objects in the “right” order and keep track of point
density in their neighborhood
C MinPts = 3
C1 C2
2 1
Core- and Reachability Distance
Parameters: “generating” distance fixed value
MinPts
core-distance,MinPts(o)
“smallest distance such that o is a core object”
(if that distance is “?”otherwise)
MinPts = 5
reachability-distance,MinPts(p, o)
“smallest distance such that p is p
directly density-reachable from o” q o
(if that distance is “?”otherwise)
core-distance(o)
reachability-distance(p,o)
reachability-distance(q,o)
OPTICS: Extension of
DBSCAN
Order points by shortest reachability distance to
guarantee that clusters w.r.t. higher density are finished
first. (for a constant MinPts, higher
density requires lower ε)
The Algorithm OPTICS
Basic data structure: controlList
Memorize shortest reachability distances seen so far
(“distance of a jump to that point”)
Visit each point
Make always a shortest jump
Output:
order of points
core-distance of points
reachability-distance of points
The Algorithm OPTICS
ControlList ordered by reachability-distance.
ControlList cluster-ordered
foreach o Database file
// initially, o.processed = false for all objects o
if o.processed = false;
insert (o, “?”) into ControlList;
while ControlList is not empty
database
select first element (o, r-dist) from ControlList;
retrieve N(o) and determine c_dist= core-distance(o);
set o.processed = true;
write (o, r_dist, c_dist) to file;
if o is a core object at any distance
foreach p N(o) not yet processed;
determine r_distp = reachability-distance(p, o);
if (p, _) ControlList
insert (p, r_distp) in ControlList;
else if (p, old_r_dist) ControlList and r_distp old_r_dist
update (p, r_distp) in ControlList;
OPTICS: Properties
“Flat” density-based clusters wrt. * andMinPts afterwards:
Starts with an object o where c-dist(o) * and r-dist(o) > *
Continues while r-dist *
1
2 17
3 16 18 34
1
4 2 16 17
18
Core-distance
Performance: approx. Reachability-distance
runtime( DBSCAN( , MinPts) )
O( n * runtime(-neighborhood-query) )
without spatial index support (worst case): O( n2 )
e.g. tree-based spatial index support: O( n log(n) )
OPTICS: The Reachability Plot
represents the density-based clustering
structure
easy to analyze
independent of the dimension of the data
reachability distance
reachability distance
“large enough”
2
Reachability-
distance
undefined
‘
DBSCAN OPTICS
Density Boolean value Numerical value
(high/low) (core distance)
Density- Boolean value Numerical value
connected (yes/no) (reachability distance)
Searching random greedy
strategy
When OPTICS Works Well
Guassian:
d ( x, y )2
y 2 2
f Gaussian ( x) e
Density Function
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
Gradient: The steepness of a
slope
Example
d ( x , y )2
f Gaussian ( x , y ) e 2 2
d ( x , xi ) 2
( x ) i 1 e
D N
2 2
f Gaussian
d ( x , xi ) 2
( x, xi ) i 1 ( xi x) e
D N
2 2
f Gaussian
Denclue: Technical Essence
Clusters can be determined mathematically
by identifying density attractors.
Density attractors are local maximum of the
overall density function.
Density Attractor
Cluster Definition
Center-defined cluster
A subset of objects attracted by an attractor x
density(x) ≥
Arbitrary-shape cluster
A group of center-defined clusters which are
connected by a path P
For each object x on P, density(x) ≥ .
Center-Defined and Arbitrary
DENCLUE: How to find the
clusters
Divide the space into grids, with size 2
Consider only grids that are highly
populated
For each object, calculate its density
attractor using hill climbing technique
Tricks can be applied to avoid calculating
density attractor of all points
Density attractors form basis of all clusters
Features of DENCLUE
Major features
Solid mathematical foundation
Compact definition for density and cluster
Flexible for both center-defined clusters and arbitrary-
shape clusters
But needs a large number of parameters
: parameter to calculate density
: density threshold
: parameter to calculate attractor