0% found this document useful (0 votes)
16 views7 pages

14 Dbscan

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views7 pages

14 Dbscan

Uploaded by

l.arrizabalaga
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Density-Based Methods

Rubén Sánchez Corcuera


[email protected]

■ Partitioning and hierarchical methods are designed to find

DBSCAN
spherical-shaped clusters.
■ They have difficulty finding clusters of arbitrary shape such as the “S” shape
and oval clusters.

Clustering
○ Given such data, they would likely inaccurately identify convex regions,
where noise or outliers are included in the clusters.
■ To find clusters of arbitrary shape, alternatively, we can model clusters as
dense regions in the data space, separated by sparse regions.
■ This is the main strategy behind density-based clustering methods, which
can discover clusters of nonspherical shape.

Density-Based Methods DBSCAN: Density-Based Spatial Clustering of


Applications with Noise
■ How can we find dense regions in density-based clustering?
○ The density of an object o can be measured by the number of objects close
to o.
○ DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
finds core objects, that is, objects that have dense neighborhoods.
○ It connects core objects and their neighborhoods to form dense regions as
clusters.
■ How does DBSCAN quantify the neighborhood of an object?
○ A user-specified parameter ϵ > 0 is used to specify the radius of a
neighborhood we consider for every object.
○ The ϵ -neighborhood of an object o is the space within a radius centered
at o.
3 4
DBSCAN DBSCAN

■ Given a set, D, of objects, we can identify all core objects with respect to the
■ Due to the fixed neighborhood size parameterized by ϵ, the density of a given parameters, ϵ and MinPts.
neighborhood can be measured simply by the number of objects in the
neighborhood. ■ The clustering task is therein reduced to using core objects and their
neighborhoods to form dense regions, where the dense regions are clusters.
■ To determine whether a neighborhood is dense or not, DBSCAN uses another
user-specified parameter, MinPts, which specifies the density threshold of dense ■ For a core object q and an object p, we say that p is directly density-reachable
regions. from q (with respect to ϵ and MinPts) if p is within the ϵ -neighborhood of q.

■ An object is a core object if the ϵ -neighborhood of the object contains at least ■ Clearly, an object p is directly density-reachable from another object q if and
MinPts objects. only if q is a core object and p is in the ϵ-neighborhood of q.

■ Core objects are the pillars of dense regions. ■ Using the directly density-reachable relation, a core object can “bring” all
objects from its ϵ-neighborhood into a dense region.

5 6

DBSCAN DBSCAN

■ How can we assemble a large dense region using small dense regions centered
by core objects?
■ To connect core objects as well as their neighbors in a dense region, DBSCAN
■ In DBSCAN, p is density-reachable from q (with respect to ϵ and MinPts in D) if
uses the notion of density-connectedness. Two objects p1,p2 ∈ D are
there is a chain of objects p1,…,pn, such that p1 = q, pn = p, and pi+1 is directly
density-connected with respect to ϵ and MinPts if there is an object q ∈ D such
density-reachable from pi with respect to and MinPts, for 1 <= i <= n, pi ∈ D.
that both p1 and p2 are density reachable from q with respect to ϵ and MinPts.
■ Note that density-reachability is not an equivalence relation because it is not
■ Unlike density-reachability, density connectedness is an equivalence relation. It
symmetric.
is easy to show that, for objects o1, o2, and o3, if o1 and o2 are density-connected,
■ If both o1 and o2 are core objects and o1 is density-reachable from o2, then o2 is and o2 and o3 are density-connected, then so are o1 and o3.
density-reachable from o1. However, if o2 is a core object but o1 is not, then o1
may be density-reachable from o2, but not vice versa.

7 8
Density-reachability and density-connectivity Density-reachability and density-connectivity
example example
■ Consider the figure for a given ϵ represented by the radius of the circles, and, ■ Of the labeled points, m, p, o, r are core objects because each is in an
say, let MinPts = 3. ϵ–neighborhood containing at least three points.

9 10

Density-reachability and density-connectivity Density-reachability and density-connectivity


example example
■ Object q is directly density-reachable from m. ■ Object m is directly density-reachable from p and vice versa.

11 12
Density-reachability and density-connectivity Density-reachability and density-connectivity
example example
■ Object q is (indirectly) density-reachable from p because q is directly density ■ However, p is not density reachable from q because q is not a core object.
reachable from m and m is directly density-reachable from p.

13 14

Density-reachability and density-connectivity DBSCAN


example
■ Similarly, r and s are density-reachable from o and o is density-reachable from r.
Thus, o, r, and s are all density-connected.

■ We can use the closure of density-connectedness to find connected dense


regions as clusters.
■ Each closed set is a density-based cluster. A subset C ⊆ D is a cluster if:
1. for any two objects o1,o2 ∈ C, o1 and o2 are density-connected
2. there does not exist an object o ∈ C and another object o’ ∈(D-C) such
that o and o’ are density connected.

15 16
DBSCAN: How does it find clusters? DBSCAN: How does it find clusters?

■ Initially, all objects in a given data set D are marked as “unvisited.”


■ If the ϵ-neighborhood of p’ has at leastMinPts objects, those objects in the ϵ
■ DBSCAN randomly selects an unvisited object p, marks p as “visited,” and
-neighborhood of p’ are added to N.
checks whether the ϵ-neighborhood of p contains at least MinPts objects. If not,
p is marked as a noise point. ■ DBSCAN continues adding objects to C until C can no longer be expanded, that
is, N is empty.
■ Otherwise, a new cluster C is created for p, and all the objects in the
ϵ-neighborhood of p are added to a candidate set, N. DBSCAN iteratively adds ■ At this time, cluster C is completed, and thus is output.
to C those objects in N that do not belong to any cluster.
■ To find the next cluster, DBSCAN randomly selects an unvisited object from the
■ In this process, for an object p’ in N that carries the label “unvisited,” DBSCAN remaining ones. The clustering process continues until all objects are visited.
marks it as “visited” and checks its ϵ-neighborhood.

17 18

DBSCAN

■ If a spatial index is used, the computational complexity of DBSCAN


is O(nlogn), where n is the number of database objects. Otherwise,
the complexity is O(n2).
■ With appropriate settings of the user-defined parameters, ϵ and
MinPts, the algorithm is effective in finding arbitrary-shaped
clusters.

19 20
DBSCAN: Advantages DBSCAN: Advantages
■ Handles irregularly shaped and sized clusters. One of the main advantages
of DBSCAN is its ability to detect clusters that are irregularly shaped. Of all
the common clustering algorithms out there, DBSCAN is one of the ■ Less sensitive to initialization conditions. DBSCAN is less sensitive to
algorithms that makes the fewest assumptions about the shape of your initialization conditions like the order of the observations in the dataset and
clusters. That means that DBSCAN can be used to detect clusters that are the seed that is used than some other clustering algorithms. Some points
oddly or irregularly shaped, such as clusters that are ring-shaped. that are on the borders between clusters may shift around when
■ Robust to outliers. Another big advantage of DBSCAN is that it is able to initialization conditions change, but the majority of the observations should
detect outliers and exclude them from the clusters entirely. That means that remain in the same cluster.
DBSCAN is very robust to outliers and great for datasets with multiple ■ Relatively fast. While DBSCAN is not the fastest clustering algorithm out
outliers. there, it is certainly not the slowest either. There are multiple
■ Does not require the number of clusters to be specified. Yet another implementations of DBSCAN that aim to optimize the time complexity of
advantage of DBSCAN is that it does not require the user to specify the the algorithm. DBSCAN is generally slower than k-means clustering but
number of clusters. Instead, DBSCAN can automatically detect the number faster than hierarchical clustering and spectral clustering.
of clusters that exist in the data. This is great for cases where you do not
have much intuition on how many clusters there should be.
21 22

DBSCAN: Disadvantages DBSCAN: Disadvantages

■ Difficult to incorporate categorical features. One of the main disadvantages of


DBSCAN is that it does not perform well on datasets with categorical features.
That means that you are best off using DBSCAN in cases where most of your
features are numeric. ■ Sensitive to scale. Like many other clustering algorithms, DBSCAN is
■ Requires a drop in density to detect cluster borders. With DBSCAN, there must sensitive to the scale of your variables. That means that you may need
be a drop in the density of the data points between clusters in order for the to rescale your variables if they are on very different scales.
algorithm to be able to detect the boundaries between clusters. If there are
multiple clusters that are overlapping without a drop in data density between ■ Struggles with high dimensional data. Like many clustering algorithms,
them, they may get grouped into a single cluster. the performance of DBSCAN tends to degrade in situations where
there are many features. In general, you are better off using
■ Struggles with clusters of varying density. DBSAN also has a difficulty detecting
clusters of varying density. This is because DBSCAN determines where clusters dimensionality reduction or features selection techniques to reduce the
start and stop by looking at places where the density of data points drops below number of features if you have a high-dimensional dataset.
a certain threshold. It may be difficult to find a threshold that captures all of the
points in the less dense cluster without excluding too many extraneous outliers
in the more dense cluster.
23 24
DBSCAN: When to use it? DBSCAN: When NOT to use it?

■ You suspect there may be irregularly shaped clusters. If you have ■ No drop in density between clusters. In general, DBSCAN requires
reason to expect that the clusters in your dataset may be irregularly there to be a drop in the density of data points in order to detect
shaped, DBSCAN is a great option. DBSCAN will be able to identify boundaries between clusters. That means that you should not use
clusters that are spherical or ellipsoidal as well as clusters that have DBSCAN if you do not expect there to be much of a drop in density
more irregular shapes. between different clusters. For example, if you expect many of your
clusters overlap, multiple clusters might get grouped together into one
■ Data has outliers. DBSCAN is also a great option for cases where there large cluster.
are many outliers in your dataset. DBSCAN is able to detect outlying
data point that do not belong to any clusters and exclude those data ■ Many categorical features. DBSCAN is generally intended to be used in
points from the the clusters. scenarios where the majority of your features are numeric. That means
that you should avoid using DBSCAN in cases where you have many
■ Anomaly detection. Since DBSCAN automatically detects outliers and categorial features. In these scenarios, you may be better off using
excludes them from all clusters, DBSCAN is also a good option in cases hierarchical clustering with an appropriate distance metric or an
where you want to be able to detect outliers in your dataset. extension of k-means clustering like k-modes to k-prototypes

25 26

OPTICS (Ordering Points to Identify the Clustering


Structure) Further reading

■ Section 10.4 in [Han & Kamber, 2016]

■ OPTICS is an extension of DBSCAN that performs better on datasets


that have clusters of varying densities.
■ You can find it also implemented in scikit-learn:
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTIC
S.html

27 28

You might also like