0% found this document useful (0 votes)
8 views27 pages

DBSCAN

The document discusses DBSCAN, a density-based clustering algorithm that identifies clusters without requiring a predefined number of clusters. It explains key concepts such as core points, border points, and noise points, as well as the hyperparameters minPts and eps. Additionally, it covers the algorithm's strengths and weaknesses, internal measures for evaluating clustering quality, and the silhouette coefficient for assessing individual points.

Uploaded by

Istiak Utsab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views27 pages

DBSCAN

The document discusses DBSCAN, a density-based clustering algorithm that identifies clusters without requiring a predefined number of clusters. It explains key concepts such as core points, border points, and noise points, as well as the hyperparameters minPts and eps. Additionally, it covers the algorithm's strengths and weaknesses, internal measures for evaluating clustering quality, and the silhouette coefficient for assessing individual points.

Uploaded by

Istiak Utsab
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Mining and Machine

Learning
Topic Contents

DBSCAN
Recommended
3

Reading

“Introduction to Data Mining,”


Pang-Ning Tan, Michael Steinbach
and Vipin Kumar, Addison Wesley,
2006.
 Chapter 8 (Cluster Analysis: Basic
Concepts and Algorithms)

3
Clustering

Problem description
 Given:

A data set of N data items which are d-dimensional data


feature vectors.
 Task:

Determine a natural, useful partitioning of the data set into a


number of clusters (k) and noise.
DBSCAN

 Unlike k-means, the desire number of cluster is not given as


input. Rather DBSCAN determine dense cluster from data
point.
 Density is define as a minimum number of point at within a
certain distance of point each other.
 It handled outlier problem easily and efficiently. Since outlier
are not dense hence they can not form a cluster.
DBSCAN

 Minimum point & Threshold value.


 minPts: The minimum number of points (a threshold) clustered
together for a region to be considered dense i.e. the minimum
number of data points that can form a cluster
 eps (ε): A distance measure that will be used to locate the
points in the neighborhood of any point.

This two are the hyperparameter need to tune to use this


algorithm.
DBSCAN

Core Point, Noise Point, Border Point.


1.Core data point: A data point which has at least ‘minPts’ within
the distance of ‘ε’.
2.Border data point: A data point which is in within ‘ε’ distance
from core data point but not a core point.
3.Noise data point: A data point which is neither core nor border
data point.
DBSCAN
DBSCAN: Core, Border, and Noise Points
DBSCAN: Determining EPS and MinPts

 Idea is that for points in a cluster, their kth nearest


neighbors are at roughly the same distance
 Noise points have the kth nearest neighbor at farther
distance
 So, plot sorted distance of every point to its kth
nearest neighbor
DBSCAN Algorithm

 Eliminate noise points


 Perform clustering on the remaining points
DBSCAN

 Simplified DBSCAN Algorithm

Step 1 — Identify all points as either core point, border point or


noise point.
Step 2 — For all of the unclustered core points.
Step 2a — Create a new cluster.
Step 2b — add all the points that are unclustered and density
connected to the current point into this cluster.
DBSCAN

 DBSCAN is a density-based algorithm.


– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number


of points (MinPts) within Eps
 These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in


the neighborhood of a core point

– A noise point is any point that is not a core point or a border


point.
DBSCAN

 Simplified DBSCAN Algorithm

Step 1 — Identify all points as either core point, border point


or noise point.
Step 2 — For all of the unclustered core points.
Step 2a — Create a new cluster.
Step 2b — add all the points that are unclustered and
density connected to the current point into this cluster.
DBSCAN
DBSCAN
DBSCAN
DBSCAN: Core, Border and Noise Points

Original Points Point types: core,


border and noise

Eps = 10, MinPts = 4


When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise
• Can handle clusters of different shapes and sizes
When DBSCAN Does NOT Work Well

(MinPts=4, Eps=9.75).

Original Points

• Varying densities
• High-dimensional data
(MinPts=4, Eps=9.92)
Statistical Framework for Correlation

 Correlation of incidence and proximity matrices for the


K-means clusterings of the following two data sets.

1 1

0.9 0.9

0.8 0.8

0.7 0.7

0.6 0.6

0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x

Corr = -0.9235 Corr = -0.5810


Internal Measures: Cohesion and Separation

 Cluster Cohesion: Measures how closely related


are objects in a cluster
– Example: SSE
 Cluster Separation: Measure how distinct or well-
separated a cluster is from other clusters
 Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
WSS   ( x  mi )2
i xC i
– Separation is measured by the between cluster sum of squares

BSS  Ci ( m  mi ) 2

i
– Where |Ci| is the size of cluster i
Internal Measures: Cohesion and
Separation

 Example: SSE
– BSS + WSS = constant
m
  
1 m1 2 3 4 m2 5

K=1 cluster: WSS(1  3) 2  ( 2  3) 2  ( 4  3) 2  (5  3) 2 10


BSS4 (3  3) 2 0
Total 10  0 10

K=2 clusters: WSS(1  1.5) 2  ( 2  1.5) 2  ( 4  4.5) 2  (5  4.5) 2 1


BSS2 (3  1.5) 2  2 ( 4.5  3) 2 9
Total 1  9 10
Internal Measures: Cohesion and Separation

 A proximity graph based approach can also be used for


cohesion and separation.
– Cluster cohesion is the sum of the weight of all links within a cluster.
– Cluster separation is the sum of the weights between nodes in the cluster
and nodes outside the cluster.

cohesion separation
Internal Measures: Silhouette Coefficient

 Silhouette Coefficient combine ideas of both cohesion and separation,


but for individual points, as well as clusters and clusterings
 For an individual point, i
– Calculate a = average distance of i to the points in its cluster
– Calculate b = min (average distance of i to points in another cluster)
– The silhouette coefficient for a point is then given by

s = 1 – a/b if a < b, (or s = b/a - 1 if a  b, not the usual case)

b
– Typically between 0 and 1. a
– The closer to 1 the better.

 Can calculate the Average Silhouette width for a cluster or a


clustering

You might also like