ML Assignment 1
ML Assignment 1
Segmentation Using
K-Means & DBSCAN
(E-Commerce)
Efrem Joseph Charls 221210037
Gautham Binoy Dev 221210039
Immanuel Thomas Francis 221210050
Table of contents
I II III
IV V VI
Clustering
Feature Engineering Analysis Conclusions
Creating new ones & Identifying distinct groups Key insights, trends &
refining those that exist. & data patterns. patterns derived.
I
Methodology
Statistical methods used to explore & model data.
Methodology
Loading the dataset & performing
Data Collection +
Exploratory Data Analysis to
EDA
understand data characteristics.
Data Overview
Brief view of subject matter - the dataset.
Data Overview
A unique identifier assigned to each customer. This field is critical for grouping transactions
CustomerID
by customer and performing customer-level analysis.
The country where the customer resides, which can give insight into geographical
Country
purchasing patterns.
A unique number assigned to each transaction. This field helps to identify individual
InvoiceNo
purchase events and track customer orders.
Preprocessing
Data cleaning, normalization & data transformation.
Dataset Challenges & Considerations
Feature Scaling
Clustering Algorithms are sensitive to the scale / range of data. Larger range
features can dominate smaller range features. (We used Standardization)
Feature Engineering
Creating new ones & refining those that exist.
Key Features Engineered
Recency - R
Number of days since the customer’s last order, represents how recently they’ve
interacted with the platform.
Recency = Current Date - Last Order Date
Frequency - F
Number of orders a customer placed within a specific time period, indicates
how frequent their purchases are.
Frequency = Count <Order ID>
Monetary - M
The total monetary value of the purchases made by the customer,
reflects their contribution to overall revenue.
Monetary = ∑ Order Value
V
Clustering Analysis
Identifying distinct groups & data patterns.
Clustering
Clustering in ML is an unsupervised
learning technique used to group
similar data points into clusters.
Unlike supervised learning,
clustering does not require labelled
data. It automatically discovers
inherent groupings based on the
characteristics of the data.
01 02 03
Metrics
VI
Conclusions
Key insights, trends & patterns derived.
Cluster & Result Interpretations
K-Means DBSCAN Noise
The K-Means algorithm
The DBSCAN algorithm, with Customers not classified into
produced clusters that were
a higher silhouette score of any cluster, which could
relatively well-defined, as 0.66, indicated that the represent outliers or less
indicated by an average clusters formed were engaged customers.
silhouette score (0.61). The
comparatively better defined.
clusters formed showed clear The density-based nature of
separation based on the
DBSCAN allowed it to identify
frequency and monetary clusters that may not have
value of customer purchases.
been as apparent with
K-Means.