ML Exp 9
ML Exp 9
Objective:
Program to implement DBSCAN in machine learning.
Apparatus required:
Pc and Jupyter or collab .
Theory:
DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise. It is a
powerful density-based clustering algorithm that is widely used for its ability to handle noise,
detect clusters of arbitrary shapes, and require minimal input parameters.
DBSCAN is particularly useful when dealing with datasets that contain noise and clusters of
arbitrary shapes and sizes.
What is DBSCAN?
DBSCAN is a data clustering algorithm used to identify clusters of high density (core points)
and areas with low density (noise points) separated by regions of lower density (border
points).Here are the key points about DBSCAN:
Core Points: A data point is considered a core point if it has more than a minimum
number of points (MinPts) within a certain radius (Eps). This radius is called the Eps-
neighborhood of the point.
Border Points: A data point is considered a border point if it falls within the Eps-
neighborhood of a core point but does not satisfy the minimum number of points
(MinPts) criteria to be a core point itself.
Noise Points: Data points that are not core points or border points are classified as
noise points.
Core Concepts of How DBSCAN Works
1. Density: DBSCAN defines clusters as areas of high density separated by areas of low
density. It distinguishes between core points, border points, and noise points based on the
density of points within a specified radius.
2. Epsilon (ε): This parameter defines the radius around each point within which to search
for neighboring points. It determines the distance within which points are considered part
of the same cluster.
3. MinPts: This parameter specifies the minimum number of points required within the ε-
neighborhood of a point for it to be considered a core point.
Diagrammatic Representation
Imagine a scatter plot of your data points. Here's how DBSCAN would interpret it:
The blue core points and their connected neighbors (including green border points) form two
separate clusters based on their density.
Algorithm Steps
2. Cluster Formation (Expand Cluster): For each core point, explore its density-
connected neighbors. A neighbor is density-connected if it's a core point itself or
reachable from a core point through a chain of border points. All these points (core point,
its density-connected neighbors, and their connected neighbors) form a cluster. If the
selected point is a core point, a new cluster is formed and all reachable points within its ε-
neighborhood are assigned to this cluster. DBSCAN recursively expands the cluster by
repeating the neighborhood search process for each point found in the ε-neighborhood,
until no more points can be added.
3. Classify Remaining Points (Border Points and Noise): Points not classified as core
points or border points during the previous steps are considered noise. If a point is not a
core point but is within the ε-neighborhood of a core point, it's considered a border point
and is assigned to the cluster of the core point. Points that are neither core points nor
border points are considered noise points and do not belong to any cluster.
Advantages of DBSCAN
Robust to Noise: DBSCAN can identify noise points and does not assign them to any
cluster.
Arbitrary Cluster Shapes: It can identify clusters of arbitrary shapes and sizes.
Parameter-Based: DBSCAN requires minimal input parameters (ε and MinPts)
compared to other clustering algorithms like K-means.
Doesn't require knowing the number of clusters beforehand: Unlike K-Means
clustering, DBSCAN does not require you to specify the number of clusters to be
formed.
Can identify outliers: DBSCAN can effectively identify outliers or noise points in
the data.
Disadvantages
Applications
Anomaly Detection: DBSCAN can identify outliers as noise points, making it useful
for anomaly detection.
Spatial Data Analysis: It's commonly used in geographic information systems (GIS)
for spatial data clustering.
Customer Segmentation: DBSCAN can be applied to cluster customers based on
their purchasing behavior or demographic information.
Implementation Using Code:
## DBSCAN
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
import seaborn as sns
data = pd.read_csv('customers.csv')
data.rename(columns={'CustomerID':'customer_id','Gender':'gender','Age':'age','Annual
Income (k$)':'income','Spending Score (1-100)':'score'},inplace=True)
features = ['age', 'income', 'score']
train_x = data[features]
cls = DBSCAN(eps=12.5, min_samples=4).fit(train_x)
datasetDBSCAN = train_x.copy()
datasetDBSCAN.loc[:,'cluster'] = cls.labels_
datasetDBSCAN.cluster.value_counts().to_frame()
outliers = datasetDBSCAN[datasetDBSCAN['cluster']==-1]
sns.scatterplot(x='income', y='score',data=datasetDBSCAN[datasetDBSCAN['cluster']!=-
1],hue='cluster', ax=ax[0], palette='Set3', legend='full', s=180)
sns.scatterplot(x='age', y='score',
data=datasetDBSCAN[datasetDBSCAN['cluster']!=-1],
plt.setp(ax[0].get_legend().get_texts(), fontsize='11')
plt.setp(ax[1].get_legend().get_texts(), fontsize='11')
OUTPUT:
Result:
Program to implement DBSCAN is implemented.