UNSUPERVISED MACHINE
LEARNING
(CUSTOMER SEGMENTATION)
ONLINE RETAIL
INTRODUCTION
1. The main goal is to identify customers that are most profitable and the ones who
churned out to prevent further loss of customer by redefining company policies.
2. CLUSTER ANALYSIS: Statistically Segment Customers into groups Observation by using the
features given below
Data Description
Attribute Data Type Description
Invoice Number Nominal 6-digit unique number / code starts with letter 'c', it indicates a cancellation
Stock Code Nominal a 5-digit unique number assigned to each distinct product.
Description Nominal Product (item) name
Quantity Numeric Quantities of each product (item) per transaction
Invoice Date Numeric Date and time when each transaction was generated
Unit Price Numeric Product price per unit in sterling.
CustomerID Nominal 5-digit unique number for Customer
Country Nominal the name of the country where each customer resides.
IMPORTING AND INSPECTING DATASET
Data set Name:- Online Retail
No of Observation:541908 (shape=8x541908)
dtypes: datetime=(1), float64=(2), int64=(1), object=(4) 1+2+1+4 = 8 columns
Data Cleaning
Checking Missing data
No use of this data it
1. CustomerID - 135080(25% Missing Values)
can be dropped
2. Description - 1454 (0.27% Missing Values)
Checking duplicates
Dropped
5268 data points were duplicated duplicates
Total data points left
No of Observation left :401604 (shape=8x 401604)
FEATURE ENGINEERING
Extracting year Date and Month from Invoice Date
Added Feature 'TotalAmount' by multiplying values from the Quantity
and UnitPrice column.(Sterling)
Added feature 'TimeType' based on hours to define whether its
Morning, Afternoon, or Evening
Dropping InvoiceNo starting with 'C’ that represents cancellation
MOST FREQUENT VALUES
MOST FREQUENT VALUES
MOST FREQUENT VALUES
Observations/Hypothesis
1. Most Customers are from the United Kingdom. A considerable number of customers are also from Germany, France, EIRE and
Spain.
2. There are no orders placed on Saturdays. Looks like it’s a non-working day for the retailer.
3. Most of the customers have purchased gifts in the month of November, October, December, and September. Less number of
customers have purchased the gifts in the month of April, January, and February.
4. Most of the customers have purchased the items in the Afternoon, moderate numbers of customers have purchased the
items in Morning and the least in the Evening.
5. WHITE HANGING HEART T-LIGHT HOLDER, REGENCY CAKESTAND 3 TIER, JUMBO BAG RED RETRO SPOT are the most ordered
products
LESS FREQUENT VALUES
Observations/Hypothesis
1. Saudi Arabia, Bahrain, the Czech Republic, Brazil, and Lithuania has the least number of customers
2. GREEN WIT METAL BAG CHARM, WHITE WITH METAL BAG CHARM, BLUE/NAT SELL NECLACE W PENDENT, PINK EASTER ENS
FLOWER, PAPER CRAFT LITTLE BIRDIE are some of the least sold products.
COUNTRY WISE ORDERS
COUNTRY WISE CUSTOMERS
COUNTRY WISE PURCHASE QUANTITY
PRODUCT WISE PURCHASE QUANTITY
PRODUCT WISE REVENUE
PRODUCT WISE CUSTOMERS
CUSTOMER WISE CANCELLATIONS
COUNTRY WISE CANCELLATIONS
VISUALIZING DISTRIBUTIONS
1. Visualizing the distribution of quantity, unitprice and total amount columns
2. It shows a positively skewed distribution because most of the values are clustered around the left side of the distribution
while the right tail of the distribution is longer, which means mean>median>mode
3. For symmetric graph mean=median=mode.
LOG TRANSFORMATION
1. After applying log transformation now the distribution plot looks comparatively better than being skewed.
2. We use log transformation when our original continuous data does not follow the bell curve, we can log transform this data
to make it as “normal” as possible so that the analysis results from this data become more valid.
RFM ANALYSIS
How frequently do
customers visit
Money By
Spent Customer
Recent visit by the
customers
RFM MODELLING
Customer Name Recency Frequency Monetary
Anthony 326 1
15 7183
RFM TABLE
Rahul 2 182 4310
Syed 75 31 1765
CONCLUSIONS:
Anthony
Anthony visited 326 days (approx. 1 year) ago and visited 15 times and spent Lost Potential
around 7183 Sterlings Customer
Rahul
Rahul visited 2 days ago and visited 182 times and spent around 4310 Sterlings
Recently visited
Potential Customer
Syed
Syed visited 75 days ago (2.5 months)and visited 31 times and spent around About to Lose
1765 Sterlings Average Customer
RFM MODELLING
1. Earlier the distributions of Recency, Frequency and Monetary columns were positively skewed but after applying log
transformation, the distributions appear to be symmetrical and normally distributed.
2. It will be more suitable to use the transformed features for better visualization of clusters.
RFM CORRELATION HEATMAP
1. We can see that Recency is highly correlated with the
RFM value.
2. Frequency and Monetary are moderately correlated
with the RFM.
Scaling for CLUSTERING Analysis
1. Log Transformation 2. Standard Scaler
of Features like on X variables, (0) Clustering Analysis
Followed by Modelling
Recency Frequency mean and (1) as
and Monetary standard deviation
Pipeline
EXTRACTING DATA DATA CLEANING DATA VISUALIZATION RFM ANALYSIS
Checking Missing data
Online Retail 1. 25 % of items RECENCY: Must be LESS
Observation:541908 (i.e 135080) FREQUENCY: Must be MORE
(shape=8x541908) 2. CustomerID – 1454
Checking duplicates MONETARY: Must be MORE
5268 data points were
Condition: For Best Customers
Duplicated
401604 DATA POINT LEFT
MODELLING CUSTOMER SEGMENTATION CONCLUSION
Binning (RFM SCORE)
Binning (RFM combination)
K-Means
Hierarchical
DBSCAN Clustering
BINNING RFM SCORES
BINNING RFM SCORES
RECENCY
FREQUENCY GROUP 1 LOST POOR CUSTOMERS
GROUP 2 AVERAGE CUSTOMERS
GROUP 3 GOOD CUSTOMERS
MONETARY GROUP 4 BEST CUSTOMERS
QUANTILE CUT
QUANTILE CUT
RECENCY
FREQUENCY GROUP 1 LOST POOR CUSTOMERS
GROUP 2 LOSING LOYAL CUSTOMERS
GROUP 3 GOOD CUSTOMERS
MONETARY GROUP 4 BEST CUSTOMERS
K-MEANS CLUSTERING
1. From the Elbow curve 5 appears to be at the elbow and hence can be considered as the number of clusters. n_clusters=4 or 6
can also be considered.
2. If we go by the maximum Silhouette Score as the criteria for selecting an optimal number of clusters, then n_clusters=2 can
be chosen.
3. If we look at both of the graphs at the same time to decide the optimal number of clusters, So 4 appears to be a good choice,
having a decent Silhouette score as well as near the elbow of the elbow curve.
K-MEANS | 2CLUSTER
K-MEANS | 2CLUSTER
RECENCY
GROUP 0 BEST CUSTOMERS
FREQUENCY GROUP 1 LOST POOR CUSTOMERS
MONETARY
K-MEANS | 5CLUSTER
K-MEANS | 5CLUSTER
RECENCY
FREQUENCY
GROUP 0 LOST POOR CUSTOMERS
GROUP 1 BEST CUSTOMERS
RECENTLY VISITED AVERAGE
GROUP 2 CUSTOMERS
MONETARY
GROUP 3 LOSING LOYAL CUSTOMERS
GROUP 4 AVERAGE CUSTOMERS
K-MEANS | 4CLUSTER
K-MEANS | 4CLUSTER
RECENCY
FREQUENCY GROUP 0 LOSING LOYAL CUSTOMERS
GROUP 1 BEST CUSTOMERS
GROUP 2 LOST POOR CUSTOMERS
RECENTLY VISITED AVERAGE
MONETARY GROUP 3 CUSTOMERS
HIERARCHICAL CLUSTERING
In the K-means clustering there is a challenge to
predetermine the number of clusters, and it always tries
to create the clusters of the same size. To solve these two
challenges, we can opt for the hierarchical clustering
algorithm because, in this algorithm, we don't need to
have knowledge about the predefined number of clusters.
Hierarchical clustering is based on two techniques:
a. Agglomerative: Agglomerative is a bottom-up approach, in which the
algorithm starts with taking all data points as single clusters and merging
them until one cluster is left.
b. Divisive: Divisive algorithm is the reverse of the agglomerative algorithm
as it is a top-down approach.
We have defined the optimal number of clusters based on dendrogram as
shown here
HIERARCHICAL | 2CLUSTER
HIERARCHICAL | 2CLUSTER
RECENCY
GROUP 0 AVERAGE CUSTOMERS
FREQUENCY GROUP 1 BEST CUSTOMERS
MONETARY
HIERARCHICAL | 3CLUSTER
HIERARCHICAL | 3CLUSTER
RECENCY
GROUP 0 BEST CUSTOMERS
FREQUENCY
GROUP 1 LOSING LOYAL CUSTOMERS
GROUP 2 LOST POOR CUSTOMERS
MONETARY
DBSCAN
DBSCAN
RECENCY
FREQUENCY GROUP -1 AVERAGE CUSTOMERS
GROUP 0 LOST POOR CUSTOMERS
GROUP 1 GOOD CUSTOMERS
MONETARY GROUP 2 LOSING LOYAL CUSTOMERS
SUMMARY
▪ We started with a simple binning and quantile based simple segmentation model first then moved to more complex
models because simple implementation helps having a first glance at the data and know where/how to exploit it better.
▪ Then we moved to k-means clustering and visualized the results with different number of clusters. As we know there is
no assurance that k-means will lead to the global best solution. We moved forward and tried Hierarchical Clustering
and DBSCAN clusterer as well.
▪ We created several useful clusters of customers on the basis of different metrics and methods to cateorize the
customers on the basis of their behavioral attributes to define their valuability, loyalty, profitability etc. for the business.
Though significantly separated clusters are not visible in the plots, but the clusters obtained is fairly valid and useful as
per the algorithms and the statistics extracted from the data.
▪ Segments depends on how the business plans to use the results, and the level of granularity they want to see in the
clusters. Keeping these points in view we clustered the major segments based on our understanding as per different
criteria as shown in the summary dataframe.
FINAL CONCLUSION
CUSTOMER SEGMENTS OBTAINED FROM CLUSTERING ANALYSIS