Clustering Analysis: Reading The Data
Clustering Analysis: Reading The Data
CLUSTERING ANALYSIS
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past
few months. You are given the task to identify the segments based on credit card usage.
Data Dictionary for Market Segmentation:
1. spending: Amount spent by the customer per month (in 1000s)
2. advance payments: Amount paid by the customer in advance by cash (in 100s)
3. probability_of_full_payment: Probability of payment done in full by the customer to
the bank
4. current balance: Balance amount left in the account to make purchases (in 1000s)
5. credit limit: Limit of the amount in credit card (10000s)
6. min_payment_amt : minimum paid by the customer while making payments for
purchases made monthly (in 100s)
7. max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)
1.1 Read the data and do exploratory data analysis. Describe the data briefly.
So, we will import all the necessary libraries for cluster analysis,
Import numpy as np
Import pandas as pd
Import matplotlib.pyplot as plt
Import seaborn as sns
From sklearn.cluster import KMeans
From sklearn.metrics import silhouette samples, silhouette score
Reading the data,
We have 7 variables,
No null values present in any variables.
The mean and median values seems to be almost equal.
The standard deviation for spending is high when compared to other variables.
No duplicates in the dataset
The box plot of the probability of full payment variable shows few outliers.
Probability of full payment is negatively skewed - -0.537954
The dist plot shows the distribution of data from 0.80 to 0.92.
The dist plot shows the distribution of data from 5.0 to 6.5.
The dist plot shows the distribution of data from 2.5 to 4.0
5
The box plot of the min payment amount variable shows few outliers.
Min payment amount is positively skewed - 0.401667
The box plot of the max spent in single shopping variable shows no outliers.
Max spent in single shopping is positively skewed - 0.561897
The dist plot shows the distribution of data from 4.5 to 6.5
No outlier treatment only 3 to 4 values re observed has outlier we are treating them
6
Multivariate analysis
Check for multicollinearity
7
1.2 Do you think scaling is necessary for clustering in this case? Justify
Yes, scaling is very important as the model works based on the distance based computations
scaling is necessary for unscaled data.
Scaling needs to be done as the values of the variables are in different scales.
Spending, advance payments are in different values and this may get more weightage.
Scaling will have all the values in the relative same range.
I have used standard scalar for scaling
Below is the snapshot of scaled data.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them
Hierarchical clustering
The above dendrogram indicates all the data points have clustered to different clusters by wards
method.
To find the optimal number cluster through which we can solve our business objective we use
truncate mode = lastp.
Now, we can understand all the data points have clustered into 3 clusters.
The above dendrogram indicates all the data points have clustered to different clusters by average
method.
To find the optimal number cluster through which we can solve our business objective we use
truncate mode = lastp.
Now, we can understand all the data points have clustered into 3 clusters.
Observation
Both the method are almost similar means, minor variation, which we know it occurs.
There was not too much variations from both methods
Cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset had gone for 3 group cluster
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment (payment
made).
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply
elbow curve and silhouette score.
K-means clustering,
Randomly we decide to give n_clusters = 3 and we look at the distribution of clusters
according to the n_clusters.
We apply K-means technique to the scaled data.
To find the inertia value for all the clusters from 1 to 11, I used a for loop to find the optimal number
of clusters.
The elbow curve seen here also shows us after 3 clusters there is no huge drop in the values, so we
select 3 clusters.
So adding the cluster results to our dataset to solve our business objective.
14
This table shows the clusters to the dataset and also individual sil_width score.
Cluster frequency
Cluster 0 Medium
Cluster 1 low
Cluster 2 High
Observation
By K-
values. Also the elbow curve seems to show similar results.
The silhouette width score of the K means also seems to very less value that indicates all the data
points are properly clustered to the cluster. There is no mismatch in the data points with regards to
clustering
Cluster grouping based on the dendrogram, 3 or 4 looks good. Did the further analysis, and
based on the dataset had gone for 3 group cluster
And three group cluster solution gives a pattern based on high/medium/low spending with
max_spent_in_single_shopping (high value item) and probability_of_full_payment (payment
made).
15
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.
Group 3: Medium Spending Group - They are potential target customers who are
paying bills and doing purchases and maintaining comparatively good credit score.
So we can increase credit limit or can lower down interest rate. - Promote premium
cards/loyalty cars to increase transactions. - Increase spending habits by trying with
premium ecommerce sites, travel portal, travel airlines/hotel, as this will encourage
them to spend more