Data Mining Project Report - Reshma
Data Mining Project Report - Reshma
PROJECT
REPORT
DSBA
Reshma A
PGP-DSBA Online May-2022
Date:02/10/22
Contents
Problem 1:Clustering...............................................................................................................................4
1.1. Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)..................................................................................................4
1.2. Do you think scaling is necessary for clustering in this case? Justify......................................7
1.3. Apply hierarchical clustering to scaled data. Identify the number of optimum clusters
using Dendrogram and briefly describe them.....................................................................................7
1.4. Apply K-Means clustering on scaled data and determine optimum clusters. Apply elbow
curve and silhouette score. Explain the results properly. Interpret and write inferences on the
finalized clusters..................................................................................................................................9
1.5. Describe cluster profiles for the clusters defined. Recommend different promotional
strategies for different clusters.........................................................................................................10
Problem 2:CART-RF-ANN.......................................................................................................................10
2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis)................................................................................................11
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network.......................................................................................................13
2.3 Performance Metrics: Comment and Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score , classification
reports for each model......................................................................................................................14
2.4 Final Model: Compare all the models and write an inference which model is
best/optimized..................................................................................................................................18
2.5 Inference: Based on the whole Analysis, what are the business insights and
recommendations.............................................................................................................................20
List of Figures
Figure 1:Heatmap......................................................................................................................................7
Figure 2:Dendrogram................................................................................................................................8
Figure 3:Dendrogram2..............................................................................................................................9
Figure 4: Cluster count..............................................................................................................................9
Figure 5:Elbow Curve..............................................................................................................................10
Figure 6:Heatmap....................................................................................................................................14
Figure 7:CART ROC for Train data............................................................................................................16
Figure 8:CART ROC for Test Data.............................................................................................................17
Figure 9:RF ROC for Train data................................................................................................................17
Figure 10:Classificatio Report for RF.......................................................................................................18
Figure 11: ROC for Test Data...................................................................................................................20
Figure 12:ROC for Train data...................................................................................................................21
List of Tables
Table 1: Sample of dataset........................................................................................................................4
Table 2:Statistical data..............................................................................................................................4
Table 3:Variable information....................................................................................................................5
Table 4:Skewness Table............................................................................................................................7
Table 5:Scaled data...................................................................................................................................7
Table 6: Cluster frequencies....................................................................................................................11
Table 7:Sample dataset...........................................................................................................................12
Table 8:Variable information...................................................................................................................12
Table 9:Statistical information................................................................................................................13
Problem 1:Clustering
A leading bank wants to develop a customer segmentation to give promotional offers to its
customers. They collected a sample that summarizes the activities of users during the past few
months. Please note that it is a summarized data that contains the average values in all the
columns considering all the months, and not for any particular month. You are given the task to
identify the segments based on credit card usage.
1.1. Read the data, do the necessary initial steps, and exploratory
data analysis (Univariate, Bi-variate, and multivariate
analysis).
Sample of dataset:
The following is the statistical data associated with the dataset and the information:
Univariate Analysis
Univariate analysis is the simplest form of analyzing data. Uni means one, so in other words the
data has only one variable. Univariate data requires to analyze each variable separately. Data is
gathered for the purpose of answering a question, or more specifically, a research question
From the univariate analysis , we can concur that there are outliers only in probability of full
payment and minimum payment amount. Only probability of full payment is negatively skewed
and all others are positively skewed
Table 4:Skewness Table
Multivariate Analysis
Figure 1:Heatmap
From the map we can see that there is high correlation between all the variables .
1.2. Do you think scaling is necessary for clustering in this case? Justify
Scaling is necessary for clustering in this case as the variable are in different ranges. As a result of
scaling all the values will be in similar ranges leading to more precise clustering.
Given below is the sample of the dataset after scaling
Table 5:Scaled data
A Hierarchical clustering method works via grouping data into a tree of clusters. Hierarchical
clustering begins by treating every data point as a separate cluster. Then, it repeatedly executes the
subsequent steps:
1. Identify the 2 clusters which can be closest together, and
2. Merge the 2 maximum comparable clusters. We need to continue these steps until all the
clusters are merged together.
Figure 2:Dendrogram
Figure 3:Dendrogram2
The table below shows a sample of the clustering with column ‘Clusters’ signifying the clusters.
There are 70 records in cluster1 ,67 in cluster2 and 73 in cluster3. All clusters have almost same
frequencies
Silhouette Score
The silhouette width was calculated for all the records and the minimum score was recorded at -
0.06687035371860134
From the clusters created we can see that the high spending group has a higher spending pattern and
they are more susceptible customers
1.5. Describe cluster profiles for the clusters defined. Recommend different
promotional strategies for different clusters.
From the dendograms and further analysis we can confirm 3 clusters are optimal
There are three spending groups based on spending.
Cluster1: High spending group
Cluster2: Low spending group
Clsuter3: Medium spending group
Problem 2:CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management
decides to collect data from the past few years. You are assigned the task to make a model
which predicts the claim status and provide recommendations to management. Use CART,
RF & ANN and compare the models' performances in train and test sets.
2.1 Read the data, do the necessary initial steps, and exploratory data
analysis (Univariate, Bi-variate, and multivariate analysis).
Attribute information:
Given below is the sample dataset. There are 3000 records and 10 columns. There are 139
duplicate rows among the records but since it is a very nominal count among 3000 records,
they are not being deleted
Univariate Analysis
Evidently from the heat map we created, we can see that there is very less correlation
between the variables.
Figure 6:Heatmap
2.2 Data Split: Split the data into test and train, build classification model
CART, Random Forest, Artificial Neural Network
CART stands for Classification and Regression Trees. The parameter setting were identified
as
The random forest is a classification algorithm consisting of many decisions trees. It uses
bagging and feature randomness when building each individual tree to try to create an
uncorrelated forest of trees whose prediction by committee is more accurate than that of any
individual tree
2.4 Final Model: Compare all the models and write an inference which
model is best/optimized.
From the comparison data, it is evident that RF model is the optimal model for this case
scenario. The accuracy precision and f1 score of the model is better than both CART model
and the Neural Networks
2.5 Inference: Based on the whole Analysis, what are the business insights
and recommendations
Based on the whole analysis, the deductions drawn were,
Using these models, the insurance firms can pick and choose their customers and prioritize
based on their predictive claim frequency
The customers who have a better claim frequency should be targeted as compared to the
rest.
Higher commission rates can be recommended for customers who have a claim status as NO.