100% found this document useful (2 votes)
655 views

Data Mining Project

- The document describes a data mining project involving clustering and classification problems for a bank and insurance company. - For the clustering problem, the document analyzes customer data to identify segments based on credit card usage, finds 3 optimum clusters using hierarchical and k-means clustering, and recommends different promotional strategies tailored to each cluster. - For the classification problem, the document explores insurance customer data to predict claim status, and compares models using CART, RF, and ANN to evaluate their performance on train and test sets.

Uploaded by

Bhuvanesh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
655 views

Data Mining Project

- The document describes a data mining project involving clustering and classification problems for a bank and insurance company. - For the clustering problem, the document analyzes customer data to identify segments based on credit card usage, finds 3 optimum clusters using hierarchical and k-means clustering, and recommends different promotional strategies tailored to each cluster. - For the classification problem, the document explores insurance customer data to predict claim status, and compares models using CART, RF, and ANN to evaluate their performance on train and test sets.

Uploaded by

Bhuvanesh Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Data Mining Project

Problem 1: Clustering

Problem Statement:
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task
to identify the segments based on credit card usage.

1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Exploratory Data Analysis:
Exploratory data analysis is a way of examining data sets in order to summarise their key features, which
typically involves the use of statistical graphics and other data visualisation techniques.

Samples of Data:
• The data contains overall 210 rows. It has 7 columns. None of the columns has Null values.
• All the columns have Integer values.
• There are no duplicate rows
• There are no major outliers observed. Only some outliers are seen in “min_payment_amt”

Statistical Summary Table

• We observe from the summary of statistics above that “spending” and “advance_payments” have large
values compared to values in other columns.

• We observe that mean is overall same in similar scaled column.

• “Probability_of_full_payment” columns have probability values ranging between 0.808 to 0.918

• To do some clustering analysis we will need to scale the data as there is difference in scales of columns
are explained in above 2 points.

Corelation Heatmap:

A correlation heatmap displays a 2D correlation table between two values using coloured cells. Correlation
heatmaps are great for comparing association between two values.
• We observe that there are many high positively co-related columns.

• The values higher than 0.80 denotes high co-relation, meaning that if one factor go up the other will
also go up simultaneously.

• We see high co-relation between spending -> advance_payments, credit_limit and current_balance
which is natural. People with high spending will naturally have high balance, credit and advance
payments.

• Similarly, we see good co-relation between spending -> max_spend_in_single_day. Which denotes that
people who have high spending power also has high max_spend on single day.

• Univariate/Bivariate Analysis:

• When we draw graphs of univariate and Bivariate variables, we observe similar co-relation depicted in
heatmap diagram.

• Wherever there is high corelation like spending -> advance_payments, credit_limit, current_balance and
max_spent_in_a_single_day, we see graphs going upward and downwards as the spending directed.

• The values in the dataset are not normalised.


1.2 Do you think scaling is necessary for clustering in this case? Justify
Yes, scaling will be required. From the statistical summary table, we have already observed that columns like
“spending” and “advance_payments” denotes high values as they represent the amount debited/credited to
the card, on the other hand “Probability_of_full_payment” columns have probability values ranging between
0.808 to 0.918.
As these are two different scales, we need to perform scaling on the data before moving for any clustering
model. The reason for this is because all clustering algorithms employ some form of distance metric to assess
if value is more likely to belong to the same cluster as object ‘m’ than object ‘n’. Thus, magnitude of the
variables has an impact on these distance measurements.
Without normalisation, virtually all of the distance estimated between two objects will be due to their values,
which will not be correct. We will using standardScaler, where the mean is removed and each feature/variable
is scaled to unit variance. This procedure is carried out in a feature-by-feature manner.
1.3 Apply hierarchical clustering to scaled data. Identify the number of optimum clusters using Dendrogram and
briefly describe them.
Scaling (standardSclaer)

(a) Dendrogram using “Average” method (b) Dendrogram using “Ward” method

• We generated Dendrograms using “Average” and “Ward” linkage methods. We observe there is not much
difference
• From the Dendrograms we see that there are 3 (three) optimum clusters which can segregate the customers
• Clusters C1, C2 and C3 has 70, 67 and 73 customers in their respective clusters using Ward linkage method.

- Let us see the clusters groups created by Dendrograms using Ward methods for further analysis

• The three groups C1, C2 & C3 are primarily segregated based on Spending power. There are three groups which
we can say High, Medium and Low spending power.
• We see that ‘spending’, ‘max_spent_in_single_day’ and ‘advance_payments’ are classified under High spending,
Medium spending and Low spending power.
• C1 = High Spending Power, C2 = Low Medium Power and C3 = Medium Spending Power
• Another observation is that “min_payment_amt” is highest in C2 group.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and silhouette
score. Explain the results properly. Interpret and write inferences on the finalized clusters.

(a) Elbow-curve for K-means (b) Silhouette scores for 3 and 4 clusters

• We observe from above elbow-curve that 3 clusters are good enough as graph bends after 3

• Also, comparing Silhouette scores between 3 and 4 clusters, we see very insignificant difference, so 3 clusters
are optimum clusters.

Last column “Clus_kmeans3” shows the cluster values for each row.

Cluster Profile for KMeans clustering

• Here we observe 3 clusters 0, 1 and 2 are formed. The customer are segrerated as 72, 67 and 71
respectively in each cluster.

• The cluster are again formed based on Spending powers, just like in hierarchical clustering
(Dendrograms)

• We see that ‘spending’, ‘max_spent_in_single_day’ and ‘advance_payments’ are classified under High
spending, Medium spending and Low spending power.

• Its just that the order is as follows:


o Cluster ‘0’ – Medium spending power customers
o Cluster ‘1’ – High spending power customers
o Cluster ‘2’ – Low spending power customers

1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for different
clusters.
As the clusters are profiled with respect to their spending power, the marketing strategy for promotional
offers should be based on the same. Below are some recommendations based on the cluster profile analysis.
High Spending Power:

• These are Big, Premium customers who spend more and spend lavishly.

• Credit Limit: Need to higher their credit limit so they can spend more

• Loyalty Rewards: Loyalty rewards points to increase their spending and make them more Loyal

• Premium Products: Promote offer on Luxury/Lavish lifestyle items to them

• Cross-Sell: As they are high spending and good repayment, cross-sell Personal loans

• Subscription Tie-ups: Promote subscription-based offers to engage more and regular monthly
subscriptions to increase spending

• Fewer Discounts: These people do not require much of discounts, but offer them some discount on
full payment, discount if they spend ‘x’ amount in a month. This ‘x’ amount should be at least 20%
more than their current monthly spend
Medium Spending Power:

• These are Medium, regular customers who regularly spend through bills or purchases.

• These are prospective customers where they can be motivated to High spending group

• Credit Limit: Need to higher their credit limits a little so that they can spend more

• Discount Offers: Study their spending habits and promote discounts on items which they regularly buy

• Loyalty Rewards: Offer Loyalty rewards points to increase their spending power

• Surveys/Reviews: Get more engagement by asking them to fill up surveys and get to know where they
need more discount/services support

• Premium Offer: Slowly motivate them to move to premium product stores/websites


Low Spending Power:

• These are Low spending customers, making some purchases

• Discount offers: These people require major discounts and offers on stores/bills they generally pay

• Reminders: These people need proper regular reminders of payment

• Payment Offers: Offer/discount related to payments, like early payment, full payment etc

• Incentives: Offer incentives tied to spending thresholds


Problem 2: CART-RF-ANN – Claim Prediction

Problem Statement:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Sample of the data

• There is data of 3000 customers


• It has 10 columns describing information of customers travels, including the Target field ‘Claimed’
• There are Categorial and Numerical values. Categorial values has to be converted to Numerical.
• There are no Null values
Statistical Summary and Duplicate Check

• We can see issues with Duration column. There are 2 wrong values in the Duration column
o We observe that there is minimum value = -1. This is not possible. Let’s update to value = 1.
We assume that ‘-‘sign is by mistake/typo. It can be 0/1. We are keeping it as 1
o We see max value as 4580. It is extremely high compared to others. Looks like another typo
mistake where ‘0’ is extra. Let’s update this also to = 458

• Rest of the values in Statistical summary looks fine

• We see there are 139 duplicate records.


• But we are NOT removing these duplicate records, as we do not have any unique identifier.

• After doing the correction in Duration columns for values “-1” and “4580”, we see above the corrected
Statistical summary.
• There are no major changes in statistical metrics, so we are now fine to move ahead.
Univariate, Bi-variate, and multivariate analysis:

Numerical Columns

Categorical Columns
• Besides ‘Age’, all other numerical columns data is right-skewed.
• ‘EPX’ Agency has maximum bookings, while ‘JZI’ has the least. The difference between them it high
• ‘Customized Plan’ is the most opted plan, and ‘Gold Plan’ is the least. Difference between them it high
• ‘Asia’ is the most preferred destination, while ‘Europe’ is the least
• ‘C2B’ Agency has highest insurance claim percentage, while ‘JZI’ & ‘EPX’ agency has the least claim
percentage
• ‘Silver Plan’ & ‘Gold Plan’ has maximum claims, while ‘Cancellation plan’ has least claimed percentage
• All are ‘Online’ channels, only 46 rows are from ‘Offline’ Channel

INSURANCE CLAIMED % ( BY A G E N C Y )
Claimed Not Claimed

100%

80%
39%
70%
60% 86% 87%

40%
61%
20% 30%
14% 13%
0%
C2B CWT EPX JZI

INSURANCE CLAIMED % ( BY P L A N )
Claimed Not Claimed

100%

36% 28%
80%
61%
60%
78%
94%
40%
64% 72%

20% 39%
22%
6%
0%
Bronze Plan Cancellation Customised Gold Plan Silver Plan
Plan Plan

- Online/Offline Channel distribution


• Not much co-relation is observed between the column’s
• Though, a positive correlation is observed between ‘Sales’ and ‘Commission’. Higher the sales more the
commission, which is a quite natural association
2.2 Data Split: Split the data into test and train, build classification model CART, Random Forest, Artificial Neural
Network

Sample of data after converting all object values to Categorical codes and dropping target column “Claimed”

• We are not scaling, as they are Tree-based models and thus will not be much impacted by non-scaling.
• We observe that for “Claimed” – Yes/No, we have ~30/70 percentage. So, we will divide our Train and Test
into 70/30 percentage. This is also a standard division in classification problems.
• We see that after dividing the dataset into Train and Test sets, we have 2100 rows in Training and 900 in
the Test set

2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model.

(i) Performance Metric of Decision Tree Classifier (CART) -


We obtained the best results using the Grid Search method. Below are the best parameters for Decision Tree
Important Parameters as per Decision Tree Classifier - CART

• We see that Agency_Code and Sales are the two most significant
contributor in prediction of Claimed Vs Not-Claimed
• Product_Name is the third one with ~ 12% importance
• Destination, Channel, Type, Age and Commission do not play any
role in Classification

Decision Tree Classifier – With Best Grid Search Params:


(a) ROC curve of Training Set AUC= 0.824 (b) ROC curve of Test Set AUC= 0.805

(a) Confusion Matrix (Training Set) (b) Confusion Matrix (Test Set)

(a) Classification Report (Training Set) (b) Classification Report (Test set)
Summary - CART

• The overall Accuracy of the Decision Tree is 79% in the Training phase and 78.6% in the Testing phase

• Precision: Measures how correct model is predicting positives, is .70 and 0.68 over Train/Test

• Recall: How many positives missed by model, is 0.56 and 0.57 over Train/Test. This score is low

• So overall, models perform very similar over Train and Test datasets, which looks like overfitting.

• False Negative is higher than False positives.

• Major contribution features are Agency, Sales and Product


(ii) Performance Metric of Random Forest -
We obtained the best results using Grid Search method. Below are the best parameters for Random Forest

Important Parameters as per Random Forest


• We observe from Random Forest classifier that four major features
are Agency_Code, Product_Name, Sales and Commission

(a) ROC curve of Training set AUC= 0.849 (b) ROC curve of Test set AUC=0.817

(a) Confusion Matrix (Training set) (b) Confusion Matrix (Testing set)

(a) Classification Report (Training set) (b) Classification Report (Test set)
Summary – Random Forest

• The overall Accuracy of Random Forest is 80.0% in the Training phase and 78.8% in the Testing phase

• Precision: Measures how correct model is predicting positives, is .71 and 0.67 over Train/Test. This
score is higher than CART Classifier

• Recall: How many positives missed by the model? It is 0.60 and 0.62 over Train/Test.

• So overall, models perform well over Train and a slight decrease in Test datasets. So, it does not look
like Overfitting or Underfitting. A stable model

• The difference between FP and FN are more negligible, meaning a robust model

• Major contribution features are Agency, Sales, Product and Commission


(iii) Performance Metric of Neural Network (ANN) –
We obtained the best results using the Grid Search method. Below are the best parameters for Multi-Layer
Perceptron Classifier (Neural Network Classifier)

(a) ROC curve for Training set AUC=0.806 (b) ROC curve for Test set AUC= 0.796

(a) Confusion Matrix for Training set (b) Confusion Matrix for Test set
(a) Classification Report for Training set (b) Classification Report for Test set

Summary – ANN

• The overall accuracy of the ANN is 77.8% in the Training phase and 77.6% in the Testing phase

• Precision: Measures how to correct model is predicting positives, is .69 and 0.70 over Train/Test

• Recall: How many positives missed by the model, is 0.51 and 0.53 over Train/Test

• So overall, models perform very similar over Train and Test datasets. Looks like overfitting

• Major contribution features cannot be determined in an ANN model. It works as a black box.

2.4 Final Model: Compare all the models and write an inference about which model is best/optimized.
Let us compare the Classification Metrics and ROC curves for all 3 models and then decide.

(a) ROC curve comparison for all three models – Train (b) ROC curve comparison for all three models – Test

Model/ Metric CART CART RF RF ANN ANN


Train Test Train Test Train Test
Accuracy 0.79 0.77 0.80 0.78 0.78 0.77
AUC 0.82 0.80 0.86 0.82 0.82 0.80
Recall 0.53 0.51 0.61 0.56 0.51 0.50
Precision 0.70 0.67 0.72 0.68 0.68 0.67
F1 Score 0.60 0.58 0.66 0.62 0.59 0.57
Summary – Comparison of all 3 Models:

• All 3 models have the same performance in their Training/Test datasets.

• Random Forest has the maximum Accuracy of 80% in Train and 78.8% over Test

• We will choose Random Forest for classification prediction due to following reasons
❖ It has the maximum Accuracy in Train and Test (How accurately the model classifies)
❖ It provides information about the essential features used in classification.
❖ These critical features will support us in further analysis and bringing more clarity over the
solution to this business problem
❖ It has high Precision values (0.61), which means its predicting positives correctly is high. This
means we will be able to predict correct positives more accurately
❖ Though it has little high Recall (0.56) also, but compared to other models (0.51), this rise is
insignificant
❖ It also has the highest F1-Score. F-1 score is the harmonic mean of Precision and Recall.
❖ Highest AUC – AUROC denotes the total measure of performance across all possible
classification threshold. This means higher the value, the more accurately the model
distinguishes between positive and negative classes. If AUC=1, it means the model can
perfectly differentiate between all the Positive and the Negative class points. So having a
high AUC of 0.86 is good to have compared to others.

2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
We can get insights and recommendations from the EDA and Important Feature set identified by the
Classifiers.

- From the Classifiers, we see that Agency and Product Plan play a significant role. We also see the same
from the illustration of the Decision Tree diagram.

- The dataset is highly imbalanced based on the “Online”/ “Offline” mode. We need to get more data
with Offline modes. Also, collect more data as ASIA destination is maximum and significantly less
information about Europe and the Americas
- Agency ‘C2B’: We need to analyse more on this Agency, which has the highest claim percentage, it has
the second-highest booking, so it’s essential.
o Surprisingly there are no ‘Cancellation Plan’ offered from C2B. They should add the
‘Cancellation Plan’ in their offering as it has the minimum claim cases.
o There is no “Airlines” mode offered. And no other destination except ‘Asia’. Asia has the
highest claim percentage, So, they should expand their bookings to other countries.

- Agency ‘JZI’: We need to analyse Sales on this Agency.


o Significantly fewer sales and they only sell Online. We need to check for offline “Agency” mode
also. Training of resources should be included
o They also sell only ‘Bronze Plan’, they should increase their product offering to include other
plans. Better promotional campaigns should be looked into
o If they are not performing well, looking for a new agency can also be considered.

- Gold and Silver Product:


o Need to update features/offering of these plans as they are most sought for claims.
o Survey/Reviews/Interaction with customers are required to get more information on these
plans and assess their needs
o Update the plan features and include ‘cancellation’ facility with some fees option

- Airlines Vs. Travel Agency:


o Airlines (83) average sales are double the average sales from ‘Travel Agency’ (46) mode.
o Surprisingly, the Airlines has 50% claim cases while ‘Travel Agency’ has just 18% of claims.
o Need to relook on the commission and product offerings for ‘Travel Agency’ type. We need to
train the resources to better communicate with customers, support more sales than Airlines,
and improve the claim ratios.

- Predictive Analysis Indicator:


o Now, we have a model with 80% accuracy, so we can to some extent predict whether the claim
will be made or not.
o Need to check more with CRM activities in these cases which is predicted by model
o CRM interaction needs to be done, to understand other features like claim time, services,
customer satisfaction to get more information for predicted cases
o More diverse data with more other features will provide more insights on other issues.
Currently only two main features, Agency_code and Product_Plan, seems to be very less
pointers to customer problems.

Images:
- Images in Introduction section for both problems is taken from:
https://thenextweb.com/news/6-ways-to-save-big-on-business-travel
https://www.roamright.com/travel-insurance-blog/filing-claim/
https://towardsdatascience.com/segmenting-credit-card-customers-with-machine-learning-ed4dbcea009c
- Graphs and ROC curves are from attached Jupyter notebooks
- Some graphs have been re-created in MS-Excel for better clarity

You might also like