UL Coded Project Report - KC
UL Coded Project Report - KC
Submitted By
Karthik Chandrasekaran
1
Contents
Problem Statement – Coded Project ............................................................................................................ 5
Context ................................................................................................................................................... 5
Objective ................................................................................................................................................. 5
Data Description ...................................................................................................................................... 5
Data Dictionary .................................................................................................................................... 5
Data Overview......................................................................................................................................... 6
Data Preprocessing .................................................................................................................................. 7
Dataset Summary Statistics .................................................................................................................. 7
Exploratory Data Analysis ........................................................................................................................ 8
Univariate Analysis .............................................................................................................................. 8
Outlier Treatment ..................................................................................................................................14
Scaling the dataset before clustering ......................................................................................................15
K-means clustering algorithms ................................................................................................................15
Apply K-means - Elbow curve ..............................................................................................................15
Silhouette Score ..................................................................................................................................16
Cluster Profiling ......................................................................................................................................17
Hierarchical clustering ............................................................................................................................19
Compare cluster K-means clusters and Hierarchical clusters ....................................................................23
Actionable Recommendations ................................................................................................................24
Using PCA to reduce the number of variables ..........................................................................................25
Actionable Recommendations ................................................................................................................30
2
List of Figures
Sl. Page
List of Figures
No Number
1 Fig 1: Avg Credit Limit 8
2 Fig 2: Total Credit Cards 8
3 Fig 3: Total Bank Visits 9
4 Fig 4: Total Online Visits 9
5 Fig 5: Total Calls made 10
6 Fig 6: Histogram and Boxplot for Avg_Credit_Limit 10
7 Fig 7: Histogram and Boxplot for Total_Credit_Cards 10
8 Fig 8: Histogram and Boxplot for Total_Visits_Bank 11
9 Fig 9: Histogram and Boxplot for Total_Visits_Online 11
10 Fig 10: Histogram and Boxplot for Total_Calls_Made 11
11 Fig 11: Correlation Matrix 12
12 Fig 12: PairPlot 13
13 Fig 13: Outliers 14
14 Fig 13: Removal of Outliers 14
15 Fig 14: Scaled Dataset 15
16 Fig 15: Clusters and Average Distortions 15
17 Fig 16: Selecting k with the Elbow Method 15
18 Fig 17: Silhouette Score 16
19 Fig 18: Silhouette score for 3 is the highest. 16
20 Fig 19: Finding optimal no. of clusters with silhouette coefficients 16
21 Fig 20: Cluster Profiling 17
22 Fig 21: Checking the groups for Avg_Credit_Limit 17
23 Fig 21: Checking the groups for the remainder features 17
24 Fig 22: Boxplot of numerical variables for each cluster: K_means_segments 18
25 Fig 23: Dendrograms for each linkage methods 20
26 Fig 24: Cophent correlation for each linkage methods 20
27 Fig 25: Cophent correlation for each linkage methods 21
28 Fig 26: Creating 3 HC clusters 21
29 Fig 27: Checking the groups for Avg_Credit_Limit 21
30 Fig 28: Checking the groups for the remainder features 22
31 Fig 29: Boxplot of numerical variables for each cluster: HC_Clusters 22
32 Fig 30: Boxplot of numerical variables for each cluster: K_means_segments 23
33 Fig 31: Boxplot of numerical variables for each cluster: HC_Clusters 24
34 Fig 32: Scaled Dataset 25
35 Fig 33: Cumulative Explained Variance by Components1 25
36 Fig 34: Cumulative Explained Variance by Components2 26
37 Fig 35: Dendrograms with Linkage Methods 28
38 Fig 36: PCA_HC_Clusters 28
39 Fig 37: Boxplot of numerical variables for each cluster: PCA_HC_Clusters 29
3
List of Tables
4
Problem Statement – Coded Project
Context
AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by
their marketing research team, that the penetration in the market can be improved. Based on this input, the
Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing
customers. Another insight from the market research was that the customers perceive the support services of
the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that
customer queries are resolved faster. The Head of Marketing and Head of Delivery both decide to reach out to
the Data Science team for help.
Objective
To identify different segments in the existing customers, based on their spending patterns as well as past
interaction with the bank, using clustering algorithms, and provide recommendations to the bank on how to
better market to and service these customers.
Data Description
The data provided is of various customers of a bank and their financial attributes like credit limit, the total
number of credit cards the customer has, and different channels through which customers have contacted the
bank for any queries (including visiting the bank, online, and through a call center).
Data Dictionary
Variable Description
Sl_No Primary key of the records
Customer Key Customer identification number
Average Credit Limit Average credit limit of each customer for all credit cards
Total credit cards Total number of credit cards possessed by the customer
Total number of visits that the customer made (yearly) personally
Total visits bank
to the bank
Total number of visits or online logins made by the customer
Total visits online
(yearly)
Total number of calls made by the customer to the bank or its
Total calls made
customer service department (yearly)
5
Data Overview
6
Table 4: Unique values in each column
Observation
Customer key, which is an identifier, has repeated values.
Data Pre-processing
Drop the columns that is not needed for the Analysis, 'Sl_No', 'Customer_Key', and drop duplicate customer
keys
7
Observation
After removing duplicated keys and rows and unnecessary columns, there are 644 unique observations and 5
columns in our data.
Credit limit average is around 35K with 50% of customers having a credit limit less than 18K, which
implies a high positive skewness.
Looking at standard deviation, we can see a considerably high variation in credit limits as well.
On average, credit cards owned by each customer are ~5. Some customers have 10
On average, most customer interactions are through calls, then online. Also, some customers never
contacted/visited the bank.
There are 644 rows and 5 columns.
8
Fig 3: Total Bank Visits
23.9% have visited the bank 2 times. 15.1% have never visited the bank.
9
Fig 5: Total Calls made
16.1% of customers have called up the bank 4 times. 14.6% of them have never made any calls to the bank.
10
Fig 8: Histogram and Boxplot for Total_Visits_Bank
Many outliers in average credit limit. High credit customers are causing skewness.
Online visits are mostly between 1 and 4 with some outliers with more than 8 and above.
11
Multivariate analysis
Observations
12
Fig 12: PairPlot
13
Outlier Treatment
Visually checking distributions
Appropriate k seems to be a 2 or 3.
15
Silhouette Score
Observations
16
Cluster Profiling
17
Boxplot of numerical variables for each cluster: K_means_segments
Insights K-means
Cluster 0 :
Avg_Credit_Limit: The mid end type of client.
Total_Credit_Cards: The mid end type of client.
Total_visits_bank: Visit the most the bank.
Total_visits_online: Doesn't access much the online bank.
Total_calls_made: Don't call as much as expected.
Cluster 1 :
Avg_Credit_Limit: The high end type of client.
Total_Credit_Cards: The high end type of client.
Total_visits_bank: The low end type of client.
Total_visits_online: The high end type of client.
Total_calls_made: The low end type of client.
Cluster 2 :
Avg_Credit_Limit: The low end type of client.
Total_Credit_Cards: The low end type of client.
Total_visits_bank: Doesn't visit much the bank.
Total_visits_online: The mid end type of client.
Total_calls_made: The high end type of client.
18
Hierarchical clustering
Apply Hierarchical clustering with different linkage methods and plot dendrograms for each linkage methods
19
Fig 23: Dendrograms for each linkage methods
Dendrogram with Weighted, centroid and average Linkage shows the distinct and separated cluster, which
is represented by highest correlation score meaning that the clusters are separated from each other.
Cophent correlation is a measure of the correlation between the distance of points in feature space and
distance on dendrogram. Closer it is to 1, the better is the clustering.
Highest cophent correlation is 0.8924686891600743, which is obtained with Euclidean distance metric and
average linkage method.
20
Create and print dataframe to compare Cophenetic Coefficient for each linkage
Creating 3 HC clusters
21
Checking the groups for the remainder features
Cluster 0 :
Avg_Credit_Limit: The low end type of client.
Total_Credit_Cards: The mid end type of client.
Total_visits_bank: The high end type of client.
Total_visits_online: The mid end type of client.
Total_calls_made: The high end type of client.
22
Cluster 1 :
Avg_Credit_Limit: The high end type of client.
Total_Credit_Cards: The high end type of client.
Total_visits_bank: The low end type of client.
Total_visits_online: The high end type of client.
Total_calls_made: The mid end type of client.
Cluster 2 :
Avg_Credit_Limit: The high end type of client.
Total_Credit_Cards: The low end type of client.
Total_visits_bank: The low end type of client.
Total_visits_online: The low end type of client.
Total_calls_made: The low end type of client.
Conclusions K-means
Cluster 0: Seems to be type of clients with the mid-range credit limit, more willing to visit the bank.
Cluster 1: High range type of client with more credit cards and high online transactions.
Cluster 2: Seems to be the type of client with the lowest credit limit, more willing to call the bank.
23
Boxplot of numerical variables for each cluster: HC_Clusters
Cluster 0: Seems to be type of clients with the lowest credit limit. A client who prefers visiting the
bank.
Cluster 1: Seems to be type of clients with the highest credit limit. A client that demands online
contact.
Cluster 2: Seems to be the type of client with the mid credit limit range, a type of client that do not
visit the bank neither uses online banking nor the calling facility.
Actionable Recommendations
The cluster that represents customers with high average credit limits and high average
balances.
Recommendation: This cluster could be targeted with personalized offers for high-value products and
services. The bank could consider offering these customers higher credit limits or lower interest rates
The cluster represents customers with low average credit limits and low average balances.
Recommendation: This cluster could be targeted with offers for basic banking products and services. The
bank could consider offering these customers lower fees or higher interest rates on savings accounts.
The cluster represents customers with average credit limits and average balances.
Recommendation: This cluster could be targeted with offers for a variety of banking products and services.
The bank could consider offering these customers personalized offers based on their individual needs and
preferences.
In-person customers and Phone customers should be reached out to promote online banking.
24
Using PCA to reduce the number of variables
Let us use the PCA to reduce the dimensions so that it explains 80% variance
25
Fig 34: Cumulative Explained Variance by Components2
Perform Clustering
26
27
Fig 35: Dendrograms with Linkage Methods
It can be seen that ward linkage method shows 4 as number of clusters
28
Boxplot of numerical variables for each cluster: PCA_HC_Clusters
Insights PCA_HC_Clusters
Cluster 0
Lowest Avg_Credit_Limit.
Lowest number in Total_Credit_Cards.
Second lowest Total_visits_bank.
Second highest Total_visits_online.
Total_calls_made avg of 7.
Clients prefer calls.
Cluster 1
Highest Avg_Credit_Limit.
Highest number in Total_Credit_Cards.
Smallest Total_visits_bank.
Highest Total_visits_online.
Total_calls_made avg of 1.
Clients would visit online.
Cluster 3
This type of customer has a bad Avg_Credit_Limit and likes to call the bank. It is important to identify whether
they are the type of customer the bank wants to invest in. Mainly because developing a better call center
experience can be expensive and customers in this cluster enjoy the phone call experience.
Cluster 1
This type of customer has a good Avg_Credit_Limit and likes to visit the bank in person. It is important to
identify visiting patterns and improve your experience.
Cluster 2
This type of customer has a good Avg_Credit_Limit and likes to visit the online bank. It is important to identify
patterns of online visits and improve your experience by tracking your internet flow showing new products
and services.
Cluster 3
This type of customer has a decent Avg_Credit_Limit and likes to visit the bank in person. It is important to
identify visiting patterns and improve their experience.
30