Data Mining Project
Data Mining Project
Problem 1: Clustering
Problem Statement:
A leading bank wants to develop a customer segmentation to give promotional offers to its customers. They
collected a sample that summarizes the activities of users during the past few months. You are given the task
to identify the segments based on credit card usage.
1.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Exploratory Data Analysis:
Exploratory data analysis is a way of examining data sets in order to summarise their key features, which
typically involves the use of statistical graphics and other data visualisation techniques.
Samples of Data:
• The data contains overall 210 rows. It has 7 columns. None of the columns has Null values.
• All the columns have Integer values.
• There are no duplicate rows
• There are no major outliers observed. Only some outliers are seen in “min_payment_amt”
• We observe from the summary of statistics above that “spending” and “advance_payments” have large
values compared to values in other columns.
• To do some clustering analysis we will need to scale the data as there is difference in scales of columns
are explained in above 2 points.
Corelation Heatmap:
A correlation heatmap displays a 2D correlation table between two values using coloured cells. Correlation
heatmaps are great for comparing association between two values.
• We observe that there are many high positively co-related columns.
• The values higher than 0.80 denotes high co-relation, meaning that if one factor go up the other will
also go up simultaneously.
• We see high co-relation between spending -> advance_payments, credit_limit and current_balance
which is natural. People with high spending will naturally have high balance, credit and advance
payments.
• Similarly, we see good co-relation between spending -> max_spend_in_single_day. Which denotes that
people who have high spending power also has high max_spend on single day.
• Univariate/Bivariate Analysis:
• When we draw graphs of univariate and Bivariate variables, we observe similar co-relation depicted in
heatmap diagram.
• Wherever there is high corelation like spending -> advance_payments, credit_limit, current_balance and
max_spent_in_a_single_day, we see graphs going upward and downwards as the spending directed.
(a) Dendrogram using “Average” method (b) Dendrogram using “Ward” method
• We generated Dendrograms using “Average” and “Ward” linkage methods. We observe there is not much
difference
• From the Dendrograms we see that there are 3 (three) optimum clusters which can segregate the customers
• Clusters C1, C2 and C3 has 70, 67 and 73 customers in their respective clusters using Ward linkage method.
- Let us see the clusters groups created by Dendrograms using Ward methods for further analysis
• The three groups C1, C2 & C3 are primarily segregated based on Spending power. There are three groups which
we can say High, Medium and Low spending power.
• We see that ‘spending’, ‘max_spent_in_single_day’ and ‘advance_payments’ are classified under High spending,
Medium spending and Low spending power.
• C1 = High Spending Power, C2 = Low Medium Power and C3 = Medium Spending Power
• Another observation is that “min_payment_amt” is highest in C2 group.
1.4 Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow curve and silhouette
score. Explain the results properly. Interpret and write inferences on the finalized clusters.
(a) Elbow-curve for K-means (b) Silhouette scores for 3 and 4 clusters
• We observe from above elbow-curve that 3 clusters are good enough as graph bends after 3
• Also, comparing Silhouette scores between 3 and 4 clusters, we see very insignificant difference, so 3 clusters
are optimum clusters.
Last column “Clus_kmeans3” shows the cluster values for each row.
• Here we observe 3 clusters 0, 1 and 2 are formed. The customer are segrerated as 72, 67 and 71
respectively in each cluster.
• The cluster are again formed based on Spending powers, just like in hierarchical clustering
(Dendrograms)
• We see that ‘spending’, ‘max_spent_in_single_day’ and ‘advance_payments’ are classified under High
spending, Medium spending and Low spending power.
1.5 Describe cluster profiles for the clusters defined. Recommend different promotional strategies for different
clusters.
As the clusters are profiled with respect to their spending power, the marketing strategy for promotional
offers should be based on the same. Below are some recommendations based on the cluster profile analysis.
High Spending Power:
• These are Big, Premium customers who spend more and spend lavishly.
• Credit Limit: Need to higher their credit limit so they can spend more
• Loyalty Rewards: Loyalty rewards points to increase their spending and make them more Loyal
• Cross-Sell: As they are high spending and good repayment, cross-sell Personal loans
• Subscription Tie-ups: Promote subscription-based offers to engage more and regular monthly
subscriptions to increase spending
• Fewer Discounts: These people do not require much of discounts, but offer them some discount on
full payment, discount if they spend ‘x’ amount in a month. This ‘x’ amount should be at least 20%
more than their current monthly spend
Medium Spending Power:
• These are Medium, regular customers who regularly spend through bills or purchases.
• These are prospective customers where they can be motivated to High spending group
• Credit Limit: Need to higher their credit limits a little so that they can spend more
• Discount Offers: Study their spending habits and promote discounts on items which they regularly buy
• Loyalty Rewards: Offer Loyalty rewards points to increase their spending power
• Surveys/Reviews: Get more engagement by asking them to fill up surveys and get to know where they
need more discount/services support
• Discount offers: These people require major discounts and offers on stores/bills they generally pay
• Payment Offers: Offer/discount related to payments, like early payment, full payment etc
Problem Statement:
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate, Bi-variate, and
multivariate analysis).
Sample of the data
• We can see issues with Duration column. There are 2 wrong values in the Duration column
o We observe that there is minimum value = -1. This is not possible. Let’s update to value = 1.
We assume that ‘-‘sign is by mistake/typo. It can be 0/1. We are keeping it as 1
o We see max value as 4580. It is extremely high compared to others. Looks like another typo
mistake where ‘0’ is extra. Let’s update this also to = 458
• After doing the correction in Duration columns for values “-1” and “4580”, we see above the corrected
Statistical summary.
• There are no major changes in statistical metrics, so we are now fine to move ahead.
Univariate, Bi-variate, and multivariate analysis:
Numerical Columns
Categorical Columns
• Besides ‘Age’, all other numerical columns data is right-skewed.
• ‘EPX’ Agency has maximum bookings, while ‘JZI’ has the least. The difference between them it high
• ‘Customized Plan’ is the most opted plan, and ‘Gold Plan’ is the least. Difference between them it high
• ‘Asia’ is the most preferred destination, while ‘Europe’ is the least
• ‘C2B’ Agency has highest insurance claim percentage, while ‘JZI’ & ‘EPX’ agency has the least claim
percentage
• ‘Silver Plan’ & ‘Gold Plan’ has maximum claims, while ‘Cancellation plan’ has least claimed percentage
• All are ‘Online’ channels, only 46 rows are from ‘Offline’ Channel
INSURANCE CLAIMED % ( BY A G E N C Y )
Claimed Not Claimed
100%
80%
39%
70%
60% 86% 87%
40%
61%
20% 30%
14% 13%
0%
C2B CWT EPX JZI
INSURANCE CLAIMED % ( BY P L A N )
Claimed Not Claimed
100%
36% 28%
80%
61%
60%
78%
94%
40%
64% 72%
20% 39%
22%
6%
0%
Bronze Plan Cancellation Customised Gold Plan Silver Plan
Plan Plan
Sample of data after converting all object values to Categorical codes and dropping target column “Claimed”
• We are not scaling, as they are Tree-based models and thus will not be much impacted by non-scaling.
• We observe that for “Claimed” – Yes/No, we have ~30/70 percentage. So, we will divide our Train and Test
into 70/30 percentage. This is also a standard division in classification problems.
• We see that after dividing the dataset into Train and Test sets, we have 2100 rows in Training and 900 in
the Test set
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score, classification reports for each model.
• We see that Agency_Code and Sales are the two most significant
contributor in prediction of Claimed Vs Not-Claimed
• Product_Name is the third one with ~ 12% importance
• Destination, Channel, Type, Age and Commission do not play any
role in Classification
(a) Confusion Matrix (Training Set) (b) Confusion Matrix (Test Set)
(a) Classification Report (Training Set) (b) Classification Report (Test set)
Summary - CART
• The overall Accuracy of the Decision Tree is 79% in the Training phase and 78.6% in the Testing phase
• Precision: Measures how correct model is predicting positives, is .70 and 0.68 over Train/Test
• Recall: How many positives missed by model, is 0.56 and 0.57 over Train/Test. This score is low
• So overall, models perform very similar over Train and Test datasets, which looks like overfitting.
(a) ROC curve of Training set AUC= 0.849 (b) ROC curve of Test set AUC=0.817
(a) Confusion Matrix (Training set) (b) Confusion Matrix (Testing set)
(a) Classification Report (Training set) (b) Classification Report (Test set)
Summary – Random Forest
• The overall Accuracy of Random Forest is 80.0% in the Training phase and 78.8% in the Testing phase
• Precision: Measures how correct model is predicting positives, is .71 and 0.67 over Train/Test. This
score is higher than CART Classifier
• Recall: How many positives missed by the model? It is 0.60 and 0.62 over Train/Test.
• So overall, models perform well over Train and a slight decrease in Test datasets. So, it does not look
like Overfitting or Underfitting. A stable model
• The difference between FP and FN are more negligible, meaning a robust model
(a) ROC curve for Training set AUC=0.806 (b) ROC curve for Test set AUC= 0.796
(a) Confusion Matrix for Training set (b) Confusion Matrix for Test set
(a) Classification Report for Training set (b) Classification Report for Test set
Summary – ANN
• The overall accuracy of the ANN is 77.8% in the Training phase and 77.6% in the Testing phase
• Precision: Measures how to correct model is predicting positives, is .69 and 0.70 over Train/Test
• Recall: How many positives missed by the model, is 0.51 and 0.53 over Train/Test
• So overall, models perform very similar over Train and Test datasets. Looks like overfitting
• Major contribution features cannot be determined in an ANN model. It works as a black box.
2.4 Final Model: Compare all the models and write an inference about which model is best/optimized.
Let us compare the Classification Metrics and ROC curves for all 3 models and then decide.
(a) ROC curve comparison for all three models – Train (b) ROC curve comparison for all three models – Test
• Random Forest has the maximum Accuracy of 80% in Train and 78.8% over Test
• We will choose Random Forest for classification prediction due to following reasons
❖ It has the maximum Accuracy in Train and Test (How accurately the model classifies)
❖ It provides information about the essential features used in classification.
❖ These critical features will support us in further analysis and bringing more clarity over the
solution to this business problem
❖ It has high Precision values (0.61), which means its predicting positives correctly is high. This
means we will be able to predict correct positives more accurately
❖ Though it has little high Recall (0.56) also, but compared to other models (0.51), this rise is
insignificant
❖ It also has the highest F1-Score. F-1 score is the harmonic mean of Precision and Recall.
❖ Highest AUC – AUROC denotes the total measure of performance across all possible
classification threshold. This means higher the value, the more accurately the model
distinguishes between positive and negative classes. If AUC=1, it means the model can
perfectly differentiate between all the Positive and the Negative class points. So having a
high AUC of 0.86 is good to have compared to others.
2.5 Inference: Based on the whole Analysis, what are the business insights and recommendations
We can get insights and recommendations from the EDA and Important Feature set identified by the
Classifiers.
- From the Classifiers, we see that Agency and Product Plan play a significant role. We also see the same
from the illustration of the Decision Tree diagram.
- The dataset is highly imbalanced based on the “Online”/ “Offline” mode. We need to get more data
with Offline modes. Also, collect more data as ASIA destination is maximum and significantly less
information about Europe and the Americas
- Agency ‘C2B’: We need to analyse more on this Agency, which has the highest claim percentage, it has
the second-highest booking, so it’s essential.
o Surprisingly there are no ‘Cancellation Plan’ offered from C2B. They should add the
‘Cancellation Plan’ in their offering as it has the minimum claim cases.
o There is no “Airlines” mode offered. And no other destination except ‘Asia’. Asia has the
highest claim percentage, So, they should expand their bookings to other countries.
Images:
- Images in Introduction section for both problems is taken from:
https://thenextweb.com/news/6-ways-to-save-big-on-business-travel
https://www.roamright.com/travel-insurance-blog/filing-claim/
https://towardsdatascience.com/segmenting-credit-card-customers-with-machine-learning-ed4dbcea009c
- Graphs and ROC curves are from attached Jupyter notebooks
- Some graphs have been re-created in MS-Excel for better clarity