Data Mining Project Ragunathan

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

BANKING &

INSURANCE
DOMAIN
DATA MINING
AND ANALYSIS
10/23/2021
Contents

Executive Summary..............................................................................................................................2
Introduction..........................................................................................................................................2
Clustering Analysis................................................................................................................................3
Q 1.1..................................................................................................................................................3
Fig.1 – Datatypes.......................................................................................................................3
Fig.2 – null values count in each column...................................................................................3
Fig.4 – Duplicate count..............................................................................................................4
Fig.5 – Mean, mode standard deviation analysis.......................................................................5
Fig.6 – Outliers in each column..................................................................................................6
Fig.7 – Pairplot...........................................................................................................................6
Q 1.2..................................................................................................................................................7
Q1.3...................................................................................................................................................7
Fig.8 – Dendrogram...................................................................................................................8
Q1.4...................................................................................................................................................8
Fig.9 – Three customer segmentation.......................................................................................8
Fig.10 – Two customer segmentation........................................................................................9
Q 1.5..................................................................................................................................................9
CART-RF-ANN......................................................................................................................................10
Q 2.1................................................................................................................................................10
Fig.2.1 – Column info...............................................................................................................10
Fig.2.2 – Column null count info..............................................................................................11
Fig.2.3 – Metadata of all columns............................................................................................11
Fig.2.4 – Duplicate count.........................................................................................................12
Fig.2.5 – Pairplot......................................................................................................................12
Fig.2.6 – Distribution plot........................................................................................................13
Q 2.2................................................................................................................................................13
Q.2.3................................................................................................................................................13
Fig.2.7 – Performance of CART................................................................................................14
Fig.2.8 – Performance of Rain Forest -Training........................................................................14
Fig.2.9 – Performance of Rain Forest -Test..............................................................................15
Fig.2.10 – ROC of Rain Forest -Test..........................................................................................16
Fig.2.10 – Performance of Neural Network -training..............................................................16
Fig.2.11 ROC of Neural Network -training................................................................................17
Fig.2.11 – Performance of Neural Network -Test data............................................................17
Fig.2.12 ROC of Neural Network -test.....................................................................................18
Q.2.4................................................................................................................................................18
Fig.2.13 Performance comparison of Various Modals............................................................19
Fig.2.14 Training data - ROC_AUC Curve of various modals..................................................19
Fig.2.15 Test data - ROC_AUC Curve of various modals.......................................................20
Q.2.5................................................................................................................................................20

Executive Summary
To identify the segments based on credit card usage based on the data
collected by a leading bank which wants to develop a customer segmentation to give
promotional offers to its customers.

Introduction
The purpose of the whole exercise is to identify the customer segments based
on the various factors like spending, advance payments, probability of full payment,
current balance credit limit minimum payment amount and maximum amount spend
in single shopping.
Clustering Analysis
Q 1.1
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis)

Result of Analysis:

The descriptive statistics help us to understand the nature of data like


data type of each feature/column, the number of rows, the presence of null
values, outliers, duplicates, mean, mode, standard deviation and the
distribution of data between features etc. The following results of analysis
help us to understand the nature of data.

Fig.1 – Datatypes

Fig.2 – null values count in each column


Fig.3 – not null count

Fig.4 – Duplicate count

Fig.5 – Mean, mode standard deviation analysis


Fig.6 – Outliers in each column

Fig.7 – Pairplot
Q 1.2
Do you think scaling is necessary for clustering in this case? Justify.

Result of Analysis:

The unit of the given columns is different from each other. Since we are dealing
with distance-based algorithms, to avoid the impact of one more column over another
columns, we need to do the scaling.

Data Dictionary for Market Segmentation:

spending: Amount spent by the customer per month (in 1000s)

advance_payments: Amount paid by the customer in advance by cash (in 100s)

probability_of_full_payment: Probability of payment done in full by the customer to the


bank

current_balance: Balance amount left in the account to make purchases (in 1000s)

credit_limit: Limit of the amount in credit card (10000s)

min_payment_amt : minimum paid by the customer while making payments for purchases
made monthly (in 100s)

max_spent_in_single_shopping: Maximum amount spent in one purchase (in 1000s)

Q1.3
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them

The hierarchical clustering performed on the scaled data. The number of optimal
clusters are 2 for the given bank dataset. The following figures explain the clusters.
Fig.8 – Dendrogram

Q1.4
Apply hierarchical clustering to scaled data. Identify the number of optimum
clusters using Dendrogram and briefly describe them.

Applied K-Means clustering on scaled data. While elbow curve method shows that
the given data can be segmented into three segments, the silhouette score shows that the
given data can be segmented into two parts.

While analyzing the three segmented data, in following profile, there is not much
difference in mean of various features/column.

Fig.9 – Three customer segmentation

While analyzing the two segmented data, in following profile, there is clear
separation in mean of various features/column as shown below.
Fig.10 – Two customer segmentation

silhouette score of with two cluster is 0.46577247686580914 and the same of with
three cluster is 0.40072705527512986.

Elbow and Silhouette methods are used to find the optimal number of clusters.
Ambiguity arises for the elbow method to pick the value of k. Silhouette analysis can be
used to study the separation distance between the resulting clusters and can be considered
a better method compared to the Elbow method.

So, silhouette score of with two cluster is higher, we can opt for two cluster
segmentations.

Q 1.5
Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.

There is significant difference spotted in spending and advance payments of two


segments of customers. If we provide some offers/promotion to cluster0(please refer
previous question , figure 10) , we can attract cluster 0 customers to improve the business
and the same we need to retain the Cluster 1 customers also.
CART-RF-ANN
Q 2.1
Read the data, do the necessary initial steps, and exploratory data analysis
(Univariate, Bi-variate, and multivariate analysis).

Result of Analysis:

The descriptive statistics help us to understand the nature of data like


data type of each feature/column, the number of rows, the presence of null
values, outliers, duplicates, mean, mode, standard deviation and the
distribution of data between features etc. The following results of analysis
help us to understand the nature of data.

Fig.2.1 – Column info


Fig.2.2 – Column null count info

Fig.2.3 – Metadata of all columns


Fig.2.4 – Duplicate count

Fig.2.5 – Pairplot
Fig.2.6 – Distribution plot

Q 2.2
Data Split: Split the data into test and train, build classification model CART,
Random Forest, Artificial Neural Network

The splitting of data into test and train is done, building of


classification model using CART, Random Forest and Artificial Neural Network
also done. Please refer the attached source code for that.

Q.2.3
Performance Metrics: Comment and Check the performance of Predictions on Train
and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score,
classification reports for each model.

The performance of CART, Random Forest and Artificial Neural


Network models are as follows.

Performance of CART:
Fig.2.7 – Performance of CART

Performance of Rain forest:

Fig.2.8 – Performance of Rain Forest -Training


Fig.2.9 – Performance of Rain Forest -Test
Fig.2.10 – ROC of Rain Forest -Test

Performance of Neural Network:

Fig.2.10 – Performance of Neural Network -training


Fig.2.11 ROC of Neural Network -training

Fig.2.11 – Performance of Neural Network -Test data


Fig.2.12 ROC of Neural Network -test

Q.2.4
Final Model: Compare all the models and write an inference which model is
best/optimized.

The performance of all three model (CART, RF and ANN) is given as


follows . The ROC_AUC area details of all three model also as follows.
Based on the performance of models, the best model is Random
Forest for the given data set.

Fig.2.13 Performance comparison of Various Modals


Fig.2.14 Training data - ROC_AUC Curve of various modals
Test ROC_AUC Curve

Fig.2.15 Test data - ROC_AUC Curve of various modals

Q.2.5
Inference: Based on the whole Analysis, what are the business insights and
recommendations

The given dataset has both Yes and No in Claimed column. The
business expectation is to reduce the claims in order to get maximum profit.
So, we need to concentrate on the cases which has yes in Claimed column.
Need to find out what are the channel, products , agencies and age groups
which has maximum claim , and concentrate on actions to reduce claims in
those areas. In addition to that , we need to increase the number of policies in
the areas of no Claim.

You might also like