0% found this document useful (0 votes)
112 views23 pages

Data Mining Project Report - Reshma

The document discusses analyzing customer data from a bank to perform customer segmentation using clustering algorithms. It includes exploratory data analysis of the data, which finds outliers and positive skewness across most variables. Scaling is justified for clustering since there is high correlation between variables. Hierarchical and K-means clustering are applied to identify optimal customer segments. Models like CART, random forest and neural network are also used for classification and their performance is evaluated. Key insights and recommendations are provided based on the full analysis.

Uploaded by

reshma ajith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
112 views23 pages

Data Mining Project Report - Reshma

The document discusses analyzing customer data from a bank to perform customer segmentation using clustering algorithms. It includes exploratory data analysis of the data, which finds outliers and positive skewness across most variables. Scaling is justified for clustering since there is high correlation between variables. Hierarchical and K-means clustering are applied to identify optimal customer segments. Models like CART, random forest and neural network are also used for classification and their performance is evaluated. Key insights and recommendations are provided based on the full analysis.

Uploaded by

reshma ajith
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

DATA MINING

PROJECT
REPORT
DSBA

Reshma A
PGP-DSBA Online May-2022
Date:02/10/22
Contents
Problem 1:Clustering...............................................................................................................................4
1.1. Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)..................................................................................................4
1.2. Do you think scaling is necessary for clustering in this case? Justify......................................7
1.3. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.....................................................................................7
1.4. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters..................................................................................................................................9
1.5. Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.........................................................................................................10
Problem 2:CART-RF-ANN.......................................................................................................................10
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)................................................................................................11
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.......................................................................................................13
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score , classification
reports for each model......................................................................................................................14
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized..................................................................................................................................18
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations.............................................................................................................................20
List of Figures
Figure 1:Heatmap......................................................................................................................................7
Figure 2:Dendrogram................................................................................................................................8
Figure 3:Dendrogram2..............................................................................................................................9
Figure 4: Cluster count..............................................................................................................................9
Figure 5:Elbow Curve..............................................................................................................................10
Figure 6:Heatmap....................................................................................................................................14
Figure 7:CART ROC for Train data............................................................................................................16
Figure 8:CART ROC for Test Data.............................................................................................................17
Figure 9:RF ROC for Train data................................................................................................................17
Figure 10:Classificatio Report for RF.......................................................................................................18
Figure 11: ROC for Test Data...................................................................................................................20
Figure 12:ROC for Train data...................................................................................................................21

List of Tables
Table 1: Sample of dataset........................................................................................................................4
Table 2:Statistical data..............................................................................................................................4
Table 3:Variable information....................................................................................................................5
Table 4:Skewness Table............................................................................................................................7
Table 5:Scaled data...................................................................................................................................7
Table 6: Cluster frequencies....................................................................................................................11
Table 7:Sample dataset...........................................................................................................................12
Table 8:Variable information...................................................................................................................12
Table 9:Statistical information................................................................................................................13
Problem 1:Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. Please note that it is a summarized data that contains the average values in all the
columns considering all the months, and not for any particular month. You are given the task to
identify the segments based on credit card usage.

1.1. Read the data, do the necessary initial steps, and exploratory
data analysis (Univariate, Bi-variate, and multivariate
analysis).

Sample of dataset:

Table 1: Sample of dataset

The following is the statistical data associated with the dataset and the information:

Table 2:Statistical data


Table 3:Variable information
The shape of the data is (210,7).
The data dictionary:
1. spending: Amount spent by the customer using the credit card per month (in 1000s). For
example, if the spending is 19.94, then the customer has actually spent (19.94 * 1000 =
19940) 19940 Rs per month on an average. 
2. advance_payments: Amount paid by the customer in advance by cash even before the
credit card bill got generated for any particular month (in 100s). For example, if the
advance_payments is 16.92, then the customer has paid (16.92*100 = 1692) 1692 Rs on an
average per month. 
3. probability_of_full_payment: Probability of the credit card payment done in full by the
customer to the bank. If it is 0.8752, then it means that the customer has a chance of
87.52% to pay the entire credit card bill on an average per month.
4. current_balance: The balance amount left in the credit card account to make the future
purchases (in 1000s). For example, if the current_balance is 6.675, then it means that the
customer is left out with a credit card balance of (6.675*1000 = 6675) 6675 Rs which he can
use for the future purchases. 
5. credit_limit: Limit of the amount in credit card (10000s) sanctioned by the bank to the
customer. For example, if the credit_limit is 3.763, it means that the customer has been
sanctioned a credit card limit of (3.763*10000 = 37,630) 37630 Rs.
6. min_payment_amt : The average minimum amount paid by the customer while making
payments for the credit card bill purchases made monthly (in 100s). For example, if the
min_payment_amt is 3.252, it means that the customer has paid only (3.252*100 = 325.2)
325.2 Rs as the minimum payment instead of paying the entire credit card bill amount on
an average per month. 
7. max_spent_in_single_shopping: Maximum amount spent by the customer for a single
transaction using the credit card (in 1000s). For example, if the
max_spent_in_single_shopping is 6.55, it means that the customer has spent a maximum
of (6.55*1000=6550) 6550 Rs for a single transaction using credit card on an average per
month. 
From the above data information that we obtained, there are 7 columns and 210 rows

Univariate Analysis
Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the
data has only one variable. Univariate data requires to analyze each variable separately. Data is
gathered for the purpose of answering a question, or more specifically, a research question
From the univariate analysis , we can concur that there are outliers only in probability of full
payment and minimum payment amount. Only probability of full payment is negatively skewed
and all others are positively skewed
Table 4:Skewness Table

Multivariate Analysis

Figure 1:Heatmap

From the map we can see that there is high correlation between all the variables .

1.2. Do you think scaling is necessary for clustering in this case? Justify

Scaling is necessary for clustering in this case as the variable are in different ranges. As a result of
scaling all the values will be in similar ranges leading to more precise clustering.
Given below is the sample of the dataset after scaling
Table 5:Scaled data

1.3. Apply hierarchical clustering to scaled data. Identify the number of


optimum clusters using Dendrogram and briefly describe them

A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the
subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the
clusters are merged together.

Given below is the constructed dendrogram.

Figure 2:Dendrogram
Figure 3:Dendrogram2

The table below shows a sample of the clustering with column ‘Clusters’ signifying the clusters.
There are 70 records in cluster1 ,67 in cluster2 and 73 in cluster3. All clusters have almost same
frequencies

Figure 4: Cluster count

1.4. Apply K-Means clustering on scaled data and determine optimum


clusters. Apply elbow curve and silhouette score. Explain the results
properly. Interpret and write inferences on the finalized clusters.
Figure 5:Elbow Curve

The optimal cluster number was identified as 3


The values associated are:
1470.0,
659.1717544870407,
430.6589731513006,
371.294118336207,
327.1257207873644,
289.46717056412876,
263.039059255031,
239.82566793991384,
224.10984648448698,
207.74062592000587

Silhouette Score
The silhouette width was calculated for all the records and the minimum score was recorded at -
0.06687035371860134
From the clusters created we can see that the high spending group has a higher spending pattern and
they are more susceptible customers

1.5. Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.

Table 6: Cluster frequencies

From the dendograms and further analysis we can confirm 3 clusters are optimal
There are three spending groups based on spending.
Cluster1: High spending group
Cluster2: Low spending group
Clsuter3: Medium spending group

High Spending Group:


Increase loan limits as they seem financially more capable in comparison
Tie up with premium brands
Increase credit limit.
Medium Spending Group:
Increasing credit limit may motivate the user to spend more
Use loyalty cards to boost the customer
Low Spending Group:
Attractive offers on early payments
Giving reminders for on time payments
Use tie-ups with hypermarts and low cost shops.

Problem 2:CART-RF-ANN

An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management. Use CART,
RF & ANN and compare the models' performances in train and test sets.

2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).
Attribute information:

1.Target: Claim Status (Claimed)


2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration in days)
7. Destination of the tour (Destination)
8. Amount worth of sales per customer in procuring tour insurance policies in rupees
(in 100’s)
9. The commission received for tour insurance firm (Commission is in percentage of
sales)
10.Age of insured (Age)

Given below is the sample dataset. There are 3000 records and 10 columns. There are 139
duplicate rows among the records but since it is a very nominal count among 3000 records,
they are not being deleted

Table 7:Sample dataset

Table 8:Variable information


Table 9:Statistical information

Univariate Analysis

Evidently from the heat map we created, we can see that there is very less correlation
between the variables.
Figure 6:Heatmap

2.2 Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network

Classification Model Cart

CART stands for Classification and Regression Trees. The parameter setting were identified
as

We created variable importance based on the decision tree classifier


Random Forest

The random forest is a classification algorithm consisting of many decisions trees. It uses
bagging and feature randomness when building each individual tree to try to create an
uncorrelated forest of trees whose prediction by committee is more accurate than that of any
individual tree

The best parameters were identified as :

The variable importance are given below:

Artificial Neural Network


An artificial neural network is an interconnected group of nodes, inspired by a simplification
of neurons in a brain.
The best parameters were identified as:

2.3 Performance Metrics: Comment and Check the performance of


Predictions on Train and Test sets using Accuracy, Confusion Matrix,
Plot ROC curve and get ROC_AUC score , classification reports for each
model. 
CART MODEL:

For training data,

Figure 7:CART ROC for Train data

The AUC is 0.825


The confusion matrix is as given:

The classification report for train data is as follows

Table 10:Classification Report for Train CART data

For test data,


The AUC is calculated as 0.792
Figure 8:CART ROC for Test Data

The following is the confusion matrix calculated :

Table 11:Classification Report for Test CART data

RANDOM FOREST MODEL:

For training data,


Figure 9:ROC for Train RF data

The following is the confusion matrix calculated :

The classification report is given below:

Table 12:Classification Report for Train RF data

For test data,


Figure 10: ROC for Test RF data

The following is the confusion matrix calculated :

The classification report is given below:

Table 13:Classification Report for Test RF data

ARTIFICIAL NEURAL NETWORK:

For training data,


Figure 11:ROC for Train ANN data

The classification report for train data is as follows:

Table 14:Classification Report for Train ANN data

For test data,

Figure 12:ROC for Test ANN data


The AUC is calculated as 0.687
The classification for test data is:

Table 15:Classification Report for Test ANN data

2.4 Final Model: Compare all the models and write an inference which
model is best/optimized.

Figure 13: ROC for Test Data

CART model: Red


Random Forest Classification:Green
Artificial Neural Network: Black

Train Data Comparison:

CART Random Forest Neural


AUC 0.825 0.834 0.687
Precision 0.67 0.71 0.44
F1 score 0.78 0.62 0.53
Accuracy 0.79 0.80 0.65

Test Data Comparison:

CART Random Forest Neural


AUC 0.792 0.815 0.696
Precision 0.71 0.73 0.47
F1 score 0.58 0.55 0.53
Accuracy 0.77 0.76 0.65

From the comparison data, it is evident that RF model is the optimal model for this case
scenario. The accuracy precision and f1 score of the model is better than both CART model
and the Neural Networks

Figure 14:ROC for Train data

CART model: Red


Random Forest Classification:Green
Artificial Neural Network: Black

2.5  Inference: Based on the whole Analysis, what are the business insights
and recommendations
Based on the whole analysis, the deductions drawn were,

Using these models, the insurance firms can pick and choose their customers and prioritize
based on their predictive claim frequency
The customers who have a better claim frequency should be targeted as compared to the
rest.
Higher commission rates can be recommended for customers who have a claim status as NO.

You might also like