DM Lect 8 - Clustering - DBSCAN
DM Lect 8 - Clustering - DBSCAN
Clustering algorithms
➢ Connectivity-based Clustering
➢ Centroid-based Clustering
➢ Distribution-based Clustering
➢ Density-based Clustering
➢ Graph – based Clustering
2
DBSCAN
• K-Means is suitable for finding spherical-shaped clusters
or convex clusters.
• In other words, it works well for compact and well separated
clusters.
• Moreover, it is also severely affected by the presence of noise
and outliers in the data.
• Unfortunately, real life data may contain:
• Clusters can be of arbitrary shape (oval, linear, and “S” shape).
• Data may contain noise and outliers.
• The plot contains 5 clusters and outliers,
• including:
• 2 oval clusters.
• 2 linear clusters.
• 1 compact cluster.
3
DBSCAN
•Given such data, k-means algorithm
has difficulties for identifying theses
clusters with arbitrary shapes.
•We know there are 5 clusters in the
data, but it can be seen that k-means
method inaccurately identifies the 5
clusters.
4
DBSCAN
▪It can be seen that DBSCAN
performs better for these data sets
and can identify the correct set of
clusters compared to k-means
algorithms.
5
DBSCAN
▪The DBSCAN, a density-based clustering algorithm, can be
used to identify clusters of any shape in dataset containing
noise and outliers.
7
Algorithm of DBSCAN
▪The goal is to identify dense regions, which can be
measured by the number of objects close to a given point.
8
Algorithm of DBSCAN
▪Any point x in the dataset, with a neighbor count greater
than or equal to MinPts, is marked as a core point.
▪We say that x is border point, if the number of its
neighbors is less than MinPts, but it belongs to the epsilon-
neighborhood of some core point.
▪Finally, if a point is neither a core nor a border point, then
it is called a noise point or an outlier.
9
Algorithm of DBSCAN
▪The figure below shows the different types of points (core,
border and outlier points) using MinPts = 6.
▪ x is a core point because neighbours_epsilon(x)=6,
▪ Y is a border point because neighbours_epsilon(y)<MinPts,
but it belongs to the ϵ-neighborhood of the core point x.
▪ z is a noise point.
10
Algorithm of DBSCAN
The points are classified as follows:
▪A point p is a core point, if at least MinPts points
are within distance (eps) of it (including p). Those
points are said to be directly reachable from p.
13
Algorithm of DBSCAN
o The algorithm of DBSCAN works as follow:
1. For each point xi, compute the distance between xi and the
other points.
• Finds all neighbor points within distance eps of the starting
point (xi).
• Each point, with a neighbor count greater than or equal to
MinPts, is marked as core point or visited.
2. For each core point, if it’s not already assigned to a cluster,
create a new cluster. Find recursively all its density connected
points and assign them to the same cluster as the core point.
3. Iterate through the remaining unvisited points in the data set.
14
DBSCAN Example
Given 8 data points:
A1 = (2, 10), A2 = (2, 5), A3 = (8, 4), A4 = (5, 8) , A5 = (7, 5), A6 =
(6 , 4) , A7 = (1, 2), A8 = (4, 9).
Apply the DBSCAN algorithm to find the final clusters and
identify outlier points in the given data points.
1. (Use epsilon (eps) = 2 and Minpts =2 and the Euclidean
distance as a distance measure)
2. What if eps = 10.
3. Draw a 10 X 10 grid to illustrate your answer and the
discovered clusters along with the outliers with each
epsilon in 1 and 2.
15
DBSCAN Example (eps = 2 , Minpts = 2)
Step 1: Construct distance matrix
A1 = (2, 10)
A2 = (2, 5)
A3 = (8, 4)
A4 = (5, 8)
A5 = (7, 5)
A6 = (6 , 4)
A7 = (1, 2)
A8 = (4, 9)
16
DBSCAN Example (eps = 2 , Minpts = 2)
Step 2: Find the Epsilon neighborhood of each data point
eps = 𝟐 , Minpts = 2
N (A1) = {}
N (A2) = {}
N (A3) = {A5, A6}
N (A4) = {A8}
N (A5) = {A3, A6}
N (A6) = {A3, A5}
N (A7) = {}
N (A8) = {A4}
17
DBSCAN Example (eps = 2 , Minpts = 2)
Step 3: Identify the final clusters and outliers
Cluster (1) = {A3, A5, A6}
Cluster (2) = {A4, A8}
Outliers
A1, A2, A7
18
DBSCAN Example (eps = 𝟏𝟎 , Minpts = 2)
Step 1: Construct distance matrix
A1 = (2, 10)
A2 = (2, 5)
A3 = (8, 4)
A4 = (5, 8)
A5 = (7, 5)
A6 = (6 , 4)
A7 = (1, 2)
A8 = (4, 9)
19
DBSCAN Example (eps = 𝟏𝟎 , Minpts = 2)
Step 2: Find the Epsilon neighborhood of each data point
eps = 𝟏𝟎 , Minpts = 2
N (A1) = {A8}
N (A2) = {A7}
N (A3) = {A5, A6}
N (A4) = {A8}
N (A5) = {A3, A6}
N (A6) = {A3, A5}
N (A7) = {A2}
N (A8) = {A1,A4}
20
DBSCAN Example (eps = 𝟏𝟎 , Minpts = 2)
Step 3: Identify the final clusters and outliers
Cluster (1) = {A1, A4, A8}
Cluster (2) = {A3, A5, A6}
Cluster (3) = {A2, A7}
No Outliers
21
Parameter Estimation of DBSCAN
▪ DBSCAN algorithm requires the user to identify the optimal
values for eps and MinPts.
▪ MinPts: As a general rule, a minimum minPts can be derived
from the number of dimensions D in the data set, as MinPts
≥ D + 1.
▪ Larger values are usually better for data sets with noise
and will yield more significant clusters.
▪ The minimum value for MinPts must be 3, but it may be
necessary to choose larger values for very large data.
▪ eps:
▪ if it is too small, a large part of the data will not be
clustered; It will be considered outliers.
▪ On the other hand if it is too high, clusters will merge
and the majority of objects will be in the same cluster.
▪ In general, small values of eps are preferable
22