0% found this document useful (0 votes)
12 views5 pages

ML Exp 9

Program to implement DBSCAN in machine learning.

Uploaded by

ananyahc12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views5 pages

ML Exp 9

Program to implement DBSCAN in machine learning.

Uploaded by

ananyahc12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

Experiment No.

Objective:
Program to implement DBSCAN in machine learning.

Apparatus required:
Pc and Jupyter or collab .

Theory:
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a
powerful density-based clustering algorithm that is widely used for its ability to handle noise,
detect clusters of arbitrary shapes, and require minimal input parameters.

DBSCAN is particularly useful when dealing with datasets that contain noise and clusters of
arbitrary shapes and sizes.

What is DBSCAN?

DBSCAN is a data clustering algorithm used to identify clusters of high density (core points)
and areas with low density (noise points) separated by regions of lower density (border
points).Here are the key points about DBSCAN:

 Core Points: A data point is considered a core point if it has more than a minimum
number of points (MinPts) within a certain radius (Eps). This radius is called the Eps-
neighborhood of the point.
 Border Points: A data point is considered a border point if it falls within the Eps-
neighborhood of a core point but does not satisfy the minimum number of points
(MinPts) criteria to be a core point itself.
 Noise Points: Data points that are not core points or border points are classified as
noise points.
Core Concepts of How DBSCAN Works

1. Density: DBSCAN defines clusters as areas of high density separated by areas of low
density. It distinguishes between core points, border points, and noise points based on the
density of points within a specified radius.

2. Epsilon (ε): This parameter defines the radius around each point within which to search
for neighboring points. It determines the distance within which points are considered part
of the same cluster.

3. MinPts: This parameter specifies the minimum number of points required within the ε-
neighborhood of a point for it to be considered a core point.

Diagrammatic Representation

Imagine a scatter plot of your data points. Here's how DBSCAN would interpret it:

 Red Circles: Represent ε-neighborhoods around data points.


 Blue Points: Core points with enough neighbors within ε-distance.
 Green Points: Border points, close to core points but not dense enough themselves.
 Gray Points: Noise points, isolated from any dense region.

The blue core points and their connected neighbors (including green border points) form two
separate clusters based on their density.

Algorithm Steps

1. Initialization (Identify Core Points): DBSCAN starts by selecting an arbitrary point


from the dataset that has not been visited. It then finds all points in its ε-neighborhood and
determines whether it's a core point. For each data point, it checks if it has at least MinPts
neighbors within the ε-neighborhood. If yes, it's a core point.

2. Cluster Formation (Expand Cluster): For each core point, explore its density-
connected neighbors. A neighbor is density-connected if it's a core point itself or
reachable from a core point through a chain of border points. All these points (core point,
its density-connected neighbors, and their connected neighbors) form a cluster. If the
selected point is a core point, a new cluster is formed and all reachable points within its ε-
neighborhood are assigned to this cluster. DBSCAN recursively expands the cluster by
repeating the neighborhood search process for each point found in the ε-neighborhood,
until no more points can be added.

3. Classify Remaining Points (Border Points and Noise): Points not classified as core
points or border points during the previous steps are considered noise. If a point is not a
core point but is within the ε-neighborhood of a core point, it's considered a border point
and is assigned to the cluster of the core point. Points that are neither core points nor
border points are considered noise points and do not belong to any cluster.

Advantages of DBSCAN

 Robust to Noise: DBSCAN can identify noise points and does not assign them to any
cluster.
 Arbitrary Cluster Shapes: It can identify clusters of arbitrary shapes and sizes.
 Parameter-Based: DBSCAN requires minimal input parameters (ε and MinPts)
compared to other clustering algorithms like K-means.
 Doesn't require knowing the number of clusters beforehand: Unlike K-Means
clustering, DBSCAN does not require you to specify the number of clusters to be
formed.
 Can identify outliers: DBSCAN can effectively identify outliers or noise points in
the data.

Disadvantages

 Sensitivity to Parameters: The performance of DBSCAN can be sensitive to the


choice of ε and MinPts parameters.
 Difficulty with Varying Density: DBSCAN may struggle with datasets containing
clusters with varying densities.

Applications

 Anomaly Detection: DBSCAN can identify outliers as noise points, making it useful
for anomaly detection.
 Spatial Data Analysis: It's commonly used in geographic information systems (GIS)
for spatial data clustering.
 Customer Segmentation: DBSCAN can be applied to cluster customers based on
their purchasing behavior or demographic information.
Implementation Using Code:

## DBSCAN

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import seaborn as sns

data = pd.read_csv('customers.csv')
data.rename(columns={'CustomerID':'customer_id','Gender':'gender','Age':'age','Annual
Income (k$)':'income','Spending Score (1-100)':'score'},inplace=True)
features = ['age', 'income', 'score']
train_x = data[features]
cls = DBSCAN(eps=12.5, min_samples=4).fit(train_x)
datasetDBSCAN = train_x.copy()
datasetDBSCAN.loc[:,'cluster'] = cls.labels_
datasetDBSCAN.cluster.value_counts().to_frame()

outliers = datasetDBSCAN[datasetDBSCAN['cluster']==-1]

fig, (ax) = plt.subplots(1,2,figsize=(10,6))

sns.scatterplot(x='income', y='score',data=datasetDBSCAN[datasetDBSCAN['cluster']!=-
1],hue='cluster', ax=ax[0], palette='Set3', legend='full', s=180)

sns.scatterplot(x='age', y='score',

data=datasetDBSCAN[datasetDBSCAN['cluster']!=-1],

hue='cluster', palette='Set3', ax=ax[1], legend='full', s=180)

ax[0].scatter(outliers['income'], outliers['score'], s=9, label='outliers', c="k")

ax[1].scatter(outliers['age'], outliers['score'], s=9, label='outliers', c="k")


ax[0].legend()
ax[1].legend()

plt.setp(ax[0].get_legend().get_texts(), fontsize='11')
plt.setp(ax[1].get_legend().get_texts(), fontsize='11')
OUTPUT:

Result:
Program to implement DBSCAN is implemented.

You might also like