Data Mining Project Ragunathan
Data Mining Project Ragunathan
Data Mining Project Ragunathan
INSURANCE
DOMAIN
DATA MINING
AND ANALYSIS
10/23/2021
Contents
Executive Summary..............................................................................................................................2
Introduction..........................................................................................................................................2
Clustering Analysis................................................................................................................................3
Q 1.1..................................................................................................................................................3
Fig.1 – Datatypes.......................................................................................................................3
Fig.2 – null values count in each column...................................................................................3
Fig.4 – Duplicate count..............................................................................................................4
Fig.5 – Mean, mode standard deviation analysis.......................................................................5
Fig.6 – Outliers in each column..................................................................................................6
Fig.7 – Pairplot...........................................................................................................................6
Q 1.2..................................................................................................................................................7
Q1.3...................................................................................................................................................7
Fig.8 – Dendrogram...................................................................................................................8
Q1.4...................................................................................................................................................8
Fig.9 – Three customer segmentation.......................................................................................8
Fig.10 – Two customer segmentation........................................................................................9
Q 1.5..................................................................................................................................................9
CART-RF-ANN......................................................................................................................................10
Q 2.1................................................................................................................................................10
Fig.2.1 – Column info...............................................................................................................10
Fig.2.2 – Column null count info..............................................................................................11
Fig.2.3 – Metadata of all columns............................................................................................11
Fig.2.4 – Duplicate count.........................................................................................................12
Fig.2.5 – Pairplot......................................................................................................................12
Fig.2.6 – Distribution plot........................................................................................................13
Q 2.2................................................................................................................................................13
Q.2.3................................................................................................................................................13
Fig.2.7 – Performance of CART................................................................................................14
Fig.2.8 – Performance of Rain Forest -Training........................................................................14
Fig.2.9 – Performance of Rain Forest -Test..............................................................................15
Fig.2.10 – ROC of Rain Forest -Test..........................................................................................16
Fig.2.10 – Performance of Neural Network -training..............................................................16
Fig.2.11 ROC of Neural Network -training................................................................................17
Fig.2.11 – Performance of Neural Network -Test data............................................................17
Fig.2.12 ROC of Neural Network -test.....................................................................................18
Q.2.4................................................................................................................................................18
Fig.2.13 Performance comparison of Various Modals............................................................19
Fig.2.14 Training data - ROC_AUC Curve of various modals..................................................19
Fig.2.15 Test data - ROC_AUC Curve of various modals.......................................................20
Q.2.5................................................................................................................................................20
Executive Summary
To identify the segments based on credit card usage based on the data
collected by a leading bank which wants to develop a customer segmentation to give
promotional offers to its customers.
Introduction
The purpose of the whole exercise is to identify the customer segments based
on the various factors like spending, advance payments, probability of full payment,
current balance credit limit minimum payment amount and maximum amount spend
in single shopping.
Clustering Analysis
Q 1.1
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis)
Result of Analysis:
Fig.1 – Datatypes
Fig.7 – Pairplot
Q 1.2
Do you think scaling is necessary for clustering in this case? Justify.
Result of Analysis:
The unit of the given columns is different from each other. Since we are dealing
with distance-based algorithms, to avoid the impact of one more column over another
columns, we need to do the scaling.
current_balance: Balance amount left in the account to make purchases (in 1000s)
min_payment_amt : minimum paid by the customer while making payments for purchases
made monthly (in 100s)
Q1.3
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them
The hierarchical clustering performed on the scaled data. The number of optimal
clusters are 2 for the given bank dataset. The following figures explain the clusters.
Fig.8 – Dendrogram
Q1.4
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them.
Applied K-Means clustering on scaled data. While elbow curve method shows that
the given data can be segmented into three segments, the silhouette score shows that the
given data can be segmented into two parts.
While analyzing the three segmented data, in following profile, there is not much
difference in mean of various features/column.
While analyzing the two segmented data, in following profile, there is clear
separation in mean of various features/column as shown below.
Fig.10 – Two customer segmentation
silhouette score of with two cluster is 0.46577247686580914 and the same of with
three cluster is 0.40072705527512986.
Elbow and Silhouette methods are used to find the optimal number of clusters.
Ambiguity arises for the elbow method to pick the value of k. Silhouette analysis can be
used to study the separation distance between the resulting clusters and can be considered
a better method compared to the Elbow method.
So, silhouette score of with two cluster is higher, we can opt for two cluster
segmentations.
Q 1.5
Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.
Result of Analysis:
Fig.2.5 – Pairplot
Fig.2.6 – Distribution plot
Q 2.2
Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network
Q.2.3
Performance Metrics: Comment and Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score,
classification reports for each model.
Performance of CART:
Fig.2.7 – Performance of CART
Q.2.4
Final Model: Compare all the models and write an inference which model is
best/optimized.
Q.2.5
Inference: Based on the whole Analysis, what are the business insights and
recommendations
The given dataset has both Yes and No in Claimed column. The
business expectation is to reduce the claims in order to get maximum profit.
So, we need to concentrate on the cases which has yes in Claimed column.
Need to find out what are the channel, products , agencies and age groups
which has maximum claim , and concentrate on actions to reduce claims in
those areas. In addition to that , we need to increase the number of policies in
the areas of no Claim.