14 Dbscan
14 Dbscan
DBSCAN
spherical-shaped clusters.
■ They have difficulty finding clusters of arbitrary shape such as the “S” shape
and oval clusters.
Clustering
○ Given such data, they would likely inaccurately identify convex regions,
where noise or outliers are included in the clusters.
■ To find clusters of arbitrary shape, alternatively, we can model clusters as
dense regions in the data space, separated by sparse regions.
■ This is the main strategy behind density-based clustering methods, which
can discover clusters of nonspherical shape.
■ Given a set, D, of objects, we can identify all core objects with respect to the
■ Due to the fixed neighborhood size parameterized by ϵ, the density of a given parameters, ϵ and MinPts.
neighborhood can be measured simply by the number of objects in the
neighborhood. ■ The clustering task is therein reduced to using core objects and their
neighborhoods to form dense regions, where the dense regions are clusters.
■ To determine whether a neighborhood is dense or not, DBSCAN uses another
user-specified parameter, MinPts, which specifies the density threshold of dense ■ For a core object q and an object p, we say that p is directly density-reachable
regions. from q (with respect to ϵ and MinPts) if p is within the ϵ -neighborhood of q.
■ An object is a core object if the ϵ -neighborhood of the object contains at least ■ Clearly, an object p is directly density-reachable from another object q if and
MinPts objects. only if q is a core object and p is in the ϵ-neighborhood of q.
■ Core objects are the pillars of dense regions. ■ Using the directly density-reachable relation, a core object can “bring” all
objects from its ϵ-neighborhood into a dense region.
5 6
DBSCAN DBSCAN
■ How can we assemble a large dense region using small dense regions centered
by core objects?
■ To connect core objects as well as their neighbors in a dense region, DBSCAN
■ In DBSCAN, p is density-reachable from q (with respect to ϵ and MinPts in D) if
uses the notion of density-connectedness. Two objects p1,p2 ∈ D are
there is a chain of objects p1,…,pn, such that p1 = q, pn = p, and pi+1 is directly
density-connected with respect to ϵ and MinPts if there is an object q ∈ D such
density-reachable from pi with respect to and MinPts, for 1 <= i <= n, pi ∈ D.
that both p1 and p2 are density reachable from q with respect to ϵ and MinPts.
■ Note that density-reachability is not an equivalence relation because it is not
■ Unlike density-reachability, density connectedness is an equivalence relation. It
symmetric.
is easy to show that, for objects o1, o2, and o3, if o1 and o2 are density-connected,
■ If both o1 and o2 are core objects and o1 is density-reachable from o2, then o2 is and o2 and o3 are density-connected, then so are o1 and o3.
density-reachable from o1. However, if o2 is a core object but o1 is not, then o1
may be density-reachable from o2, but not vice versa.
7 8
Density-reachability and density-connectivity Density-reachability and density-connectivity
example example
■ Consider the figure for a given ϵ represented by the radius of the circles, and, ■ Of the labeled points, m, p, o, r are core objects because each is in an
say, let MinPts = 3. ϵ–neighborhood containing at least three points.
9 10
11 12
Density-reachability and density-connectivity Density-reachability and density-connectivity
example example
■ Object q is (indirectly) density-reachable from p because q is directly density ■ However, p is not density reachable from q because q is not a core object.
reachable from m and m is directly density-reachable from p.
13 14
15 16
DBSCAN: How does it find clusters? DBSCAN: How does it find clusters?
17 18
DBSCAN
19 20
DBSCAN: Advantages DBSCAN: Advantages
■ Handles irregularly shaped and sized clusters. One of the main advantages
of DBSCAN is its ability to detect clusters that are irregularly shaped. Of all
the common clustering algorithms out there, DBSCAN is one of the ■ Less sensitive to initialization conditions. DBSCAN is less sensitive to
algorithms that makes the fewest assumptions about the shape of your initialization conditions like the order of the observations in the dataset and
clusters. That means that DBSCAN can be used to detect clusters that are the seed that is used than some other clustering algorithms. Some points
oddly or irregularly shaped, such as clusters that are ring-shaped. that are on the borders between clusters may shift around when
■ Robust to outliers. Another big advantage of DBSCAN is that it is able to initialization conditions change, but the majority of the observations should
detect outliers and exclude them from the clusters entirely. That means that remain in the same cluster.
DBSCAN is very robust to outliers and great for datasets with multiple ■ Relatively fast. While DBSCAN is not the fastest clustering algorithm out
outliers. there, it is certainly not the slowest either. There are multiple
■ Does not require the number of clusters to be specified. Yet another implementations of DBSCAN that aim to optimize the time complexity of
advantage of DBSCAN is that it does not require the user to specify the the algorithm. DBSCAN is generally slower than k-means clustering but
number of clusters. Instead, DBSCAN can automatically detect the number faster than hierarchical clustering and spectral clustering.
of clusters that exist in the data. This is great for cases where you do not
have much intuition on how many clusters there should be.
21 22
■ You suspect there may be irregularly shaped clusters. If you have ■ No drop in density between clusters. In general, DBSCAN requires
reason to expect that the clusters in your dataset may be irregularly there to be a drop in the density of data points in order to detect
shaped, DBSCAN is a great option. DBSCAN will be able to identify boundaries between clusters. That means that you should not use
clusters that are spherical or ellipsoidal as well as clusters that have DBSCAN if you do not expect there to be much of a drop in density
more irregular shapes. between different clusters. For example, if you expect many of your
clusters overlap, multiple clusters might get grouped together into one
■ Data has outliers. DBSCAN is also a great option for cases where there large cluster.
are many outliers in your dataset. DBSCAN is able to detect outlying
data point that do not belong to any clusters and exclude those data ■ Many categorical features. DBSCAN is generally intended to be used in
points from the the clusters. scenarios where the majority of your features are numeric. That means
that you should avoid using DBSCAN in cases where you have many
■ Anomaly detection. Since DBSCAN automatically detects outliers and categorial features. In these scenarios, you may be better off using
excludes them from all clusters, DBSCAN is also a good option in cases hierarchical clustering with an appropriate distance metric or an
where you want to be able to detect outliers in your dataset. extension of k-means clustering like k-modes to k-prototypes
25 26
27 28