0% found this document useful (0 votes)
42 views23 pages

ML Assignment 1

Uploaded by

Aadrika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views23 pages

ML Assignment 1

Uploaded by

Aadrika
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Customer

Segmentation Using
K-Means & DBSCAN
(E-Commerce)
Efrem Joseph Charls 221210037
Gautham Binoy Dev 221210039
Immanuel Thomas Francis 221210050
Table of contents
I II III

Methodology Data Overview Preprocessing


Statistical methods used Brief view of subject Data cleaning,
to explore & model data. matter - the dataset. normalization, etc ...

IV V VI
Clustering
Feature Engineering Analysis Conclusions
Creating new ones & Identifying distinct groups Key insights, trends &
refining those that exist. & data patterns. patterns derived.
I

Methodology
Statistical methods used to explore & model data.
Methodology
Loading the dataset & performing
Data Collection +
Exploratory Data Analysis to
EDA
understand data characteristics.

Preprocessing & Preparing data for clustering and


Feature Engineering creating new features to better capture
underlying patterns within data.
Abstract
Segmenting data via K-Means &
Clustering + Model
DBSCAN while assessing cluster quality
Evaluation
using various metrics.
Developing actionable insights off of
Insight Derivation &
identified key segments, trends &
Solutioning
anomalies in data.
II

Data Overview
Brief view of subject matter - the dataset.
Data Overview
A unique identifier assigned to each customer. This field is critical for grouping transactions
CustomerID
by customer and performing customer-level analysis.

The country where the customer resides, which can give insight into geographical
Country
purchasing patterns.

A unique number assigned to each transaction. This field helps to identify individual
InvoiceNo
purchase events and track customer orders.

StockCode A product identifier, representing the items purchased in each transaction.

Description A brief description of each product or stock item.

Quantity The number of units of each product sold in a particular transaction.

UnitPrice The price of a single unit of the product sold.

InvoiceDate The date and time at which the transaction occurred.


4,06,829 rows
Records & Transactions within the Dataset
III

Preprocessing
Data cleaning, normalization & data transformation.
Dataset Challenges & Considerations

Missing Anomalies & Skewed Data


CustomerIDs Outliers
Rows without CustomerID are Unusual transactions, such as Data may be highly skewed,
incomplete and cannot be refunds that indicate negative with a small number of
used for customer values, should not distort customers contributing to a
segmentation. analysis large % of revenue.
Preprocessing Steps

Handling Missing Data


Rows with missing values were removed as they could not contribute to
customer level analysis. Retaining such rows would introduce ambiguity further.

Data Type Conversion


Correct data types are required for efficient processing & analysis. As such,
features from the dataset were mapped to their appropriate data types.

Data Cleaning & Integrity Checks


Ensuring that the data is free from inconsistencies & errors via removal of
duplicates, addressing anomalies & outlier detection.
Preprocessing Steps - Contd

Feature Scaling
Clustering Algorithms are sensitive to the scale / range of data. Larger range
features can dominate smaller range features. (We used Standardization)

Final Prepared Dataset


Final dataset post preprocessing with unique customer rows and is now ready
for further statistical analysis.
IV

Feature Engineering
Creating new ones & refining those that exist.
Key Features Engineered
Recency - R
Number of days since the customer’s last order, represents how recently they’ve
interacted with the platform.
Recency = Current Date - Last Order Date

Frequency - F
Number of orders a customer placed within a specific time period, indicates
how frequent their purchases are.
Frequency = Count <Order ID>

Monetary - M
The total monetary value of the purchases made by the customer,
reflects their contribution to overall revenue.
Monetary = ∑ Order Value
V

Clustering Analysis
Identifying distinct groups & data patterns.
Clustering
Clustering in ML is an unsupervised
learning technique used to group
similar data points into clusters.
Unlike supervised learning,
clustering does not require labelled
data. It automatically discovers
inherent groupings based on the
characteristics of the data.

Common clustering algorithms


include K-Means, DBSCAN,
Hierarchical, Spectral etc … each
with its own set of unique strengths
and weaknesses.
K-Means & DBSCAN
Theoretical Framework - Similarities & Differences
K-Means Similarities DBSCAN
● Best with spherical / Unsupervised Learning ● Can handle arbitrary-shaped
circular clusters. clusters.
● Struggles with irregular Both are unsupervised, i.e. they don’t rely ● Does not struggle with
shapes. on labelled data to group data points. irregular shapes.
● Sensitive to noise & ● Identifies outliers and labels
outliers. Every point is Distance Based them. Does not force data
assigned to a cluster. points into clusters.
● Requires number of ● Automatically determines
Both rely on distance metrics (usually
clusters to be number of clusters based on
euclidean distance) for grouping data.
specified. data density.
Optimal Number of Clusters K-Means
Elbow Method Silhouette Method
Clusters Formed / Customer Segments
Results
Evaluation Metrics

01 02 03

Silhouette Score Inertia Davies-Bouldin Index


Evaluates the quality of clusters by Measures the Quantifies the average similarity ratio of
measuring how similar a data point is compactness of clusters each cluster with its most similar cluster.
to its own cluster compared to other via SSD between data Lower Value -> better
clusters. points & cluster centroids. Higher Value -> worser

Metrics
VI

Conclusions
Key insights, trends & patterns derived.
Cluster & Result Interpretations
K-Means DBSCAN Noise
The K-Means algorithm
The DBSCAN algorithm, with Customers not classified into
produced clusters that were
a higher silhouette score of any cluster, which could
relatively well-defined, as 0.66, indicated that the represent outliers or less
indicated by an average clusters formed were engaged customers.
silhouette score (0.61). The
comparatively better defined.
clusters formed showed clear The density-based nature of
separation based on the
DBSCAN allowed it to identify
frequency and monetary clusters that may not have
value of customer purchases.
been as apparent with
K-Means.

Clusters Formed : Low , Medium, High Value


THANK YOU
Feel free to ask Questions.

You might also like