Data Mining Problem 2 Report
Data Mining Problem 2 Report
Executive Summary
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
Introduction
The purpose of this whole exercise is to explore the dataset to Interpret the inferences for each. Do the
exploratory data analysis. Split the data set into test and train, build classification model CART, Random
Forest, Artificial Neural Network. Logical reason behind the selection of different values for the
parameters involved in each model. We have mainly covered 3 main aspects for any insurance dataset and
those are -
Data Quality Assurance, Data Insights and data visualization & Business strategies.
Data Description
This Data set contains data of 3000 customers and 10 variables namely. The attributes are described
below
1. Agency code : Name of the agency
2. Agency Type : Type of travel insurance agencies airline or travel:
3. Distribution channel : online or offline
4. Product Name : Basic, Premium, Bronze, 1 way comprehensive , 2 way comprehensive,
cancellation, ticket protector, 24 protect etc
5. Claim : Yes or No
6. Duration: of travel In days
7. Destination : of travel Country
8. Net Sales : Amount of sales of travel insurance policies
9. Commission : Commission received for travel insurance agency
10. Age : Age of insured
There are total 3000 rows and 10 columns in the dataset. Out of 10, 6 columns are of object type and
rest 4 are of either integer or float data type.
From the above results we can see that there is no missing value present in the dataset
After the Analysis, we have found 139 duplicated values. After exploring those duplicated values, by
looking at the rows, we can see that although duplicated() is returning 139 duplicated rows but looking
at the type of data, there is possibility that duplication may occur among the data. As this data consists
of user data, there may be a chance that features corresponding to 2 persons may be equal. Hence we
will not be dropping this duplicates.
Scaling the Dataset with Z-score and treating Object type columns
After drafting new Data set with only 3 columns named Commission, Sales & Duration for applying z-
score; Assuming the distribution to be normally distributed, how many of records are above 3 SD. We
will be treating those records whose zscore is above 3 SD. Therefore, Treating outliers by changing the
zscore to -3 ,where ever zscore<3 and visa versa.
Later, importing rest other columns from original dataset into new dataframe
Similarly, treating the columns whose datatype is stored as object. We will be changing into int and will
be assigning some codes.
Columns -
Commision float64
Sales float64
Duration float64
Type int8
Claimed int8
Channel int8
Product Name int8
Destination int8
Univariate Analysis
Histograms are plotted for all the numerical variables using sns.displot () function from seaborn
package.
Bar Plots are plotted for all Categorical Variables using sns.countplot() function from seaborn package.
Figure 2.1 - Pairplot
Multivariate Analysis
We will now plot a Heat Map or Correlation Matrix to evaluate the relationship between different
variables in our dataset. This graph can help us to check for any correlations between different variables.
From below heatmap, we can say that there's a small correlation between Commission and Sale &
Duration & Sales.
Figure 2.2 – heatmap
2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Splitting Dataset in Train and Test Data (80:20)
We are working on target variable “Claimed”, for test & train Dataset extracted the column. For building
the models we will now have to split the dataset into Training and Testing Data with the ratio of 70:30.
These two datasets are stored in X_train and X_test with their corresponding dimensions.
CART Model
Classification and Regression Trees(CART) are a type of Decision trees used in Data mining. It is a type of
Supervised Learning Technique where the predicted outcome is either a discrete or class (classification)
of the dataset or the outcome is of continuous or numerical in nature(regression).
Using the Train Dataset(X_train) we will be creating a CART model and then further testing the model on
Test Dataset(X_test)
For creating the CART Model two packages were imported namely, “DecisionTreeClassifier” and “tree”
from sklearn.
With the help of DecisonTreeClassifier we will create a decision tree model namely, dt_model and using
the “gini” criteria we will fit the train data into this model. After this using the tree package we will
create a dot file namely, claim_tree.dot to help visualise the tree.
Below are the variable importance values or the feature importance to build the tree.
Imp
Commision 0.074662
Sales 0.186782
Duration 0.041505
Type 0.000000
Channel 0.016576
Product Name 0.680475
Destination 0.000000
Type,Channel,Destination is least important while considering the outcome. The most dominating factor
is Product Name
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build a regularised decision tree. Hence, doing a few iterations with the values we got the best parameters
to build the decision tree which are as follows
{'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 15}
Figure 2.6 – ROC train labels Figure 2.7 ROC test labels
RANDOM CLASSIFICATION
Random Forest is another Supervised Learning Technique used in Machine Learning which consists of
many decision trees that helps in predictions using individual trees and selects the best output from them.
Using the Train Dataset(X_train) we will be creating a Random Forest model and then further testing the
model on Test Dataset(X_test)
For creating the Random Forest, the package “RandomForestClassifier” is imported from sklearn.metrics.
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build a Random Forest namely, rfcl. Hence, doing a few iterations with the values we got the best
parameters to build the RF Model.
Channel, Destination is least important while considering the outcome. The most dominating factor is
Product Name,
Commission, Sales, Duration, and Type. We can confirm this very well by looking at the plot. Unlike
CART , here many of the features are dependent on the final outcomes.
Test labels
ROC –
Neural Networks
Artificial Neural Network(ANN) is a computational model that consists of several processing elements
that receive inputs and deliver outputs based on their predefined activation functions.
Using the train dataset(X_train) and test dataset(X_test) we will be creating a Neural Network using
MLPClassifier from sklearn.metrics.
Firstly, we will have to Scale the two datasets using Standard Scaler package.
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build an Artificial Neural Network Model namely, mlp. Hence, doing a few iterations with the values we
got the best parameters to build the ANN Model
After training, we can calculate the Confusion metric, ROC and model performance on prediction score –
Training Confusion metric & score –
2.4 Final Model: Compare all the model and write an inference which model is
best/optimized.
Comparing all the model evaluation parameters for different models, RANDOM FOREST is performing
best. AUC_SCORE , ACCURACY for both training and testing dataset is better in RF as compared to other
models.
Moreover as for this case, Type 1 error needs not be minimized as we do not want our model to
incorrectly predict those customers who will ask for the claims in actual but our model will predict it
won’t.
Hence we should look for the greater precision as compared to recall.
The second best model is Neural Networks. Model evaluation parameters for NN is quite near to RF .
2.5 Inference: Basis on these predictions, what are the business insights and
recommendations
Looking at the feature importance from Random Forest, we can see that Channel, Destination is least
important while considering the outcome.
The most dominating factor is Product Name, Commission, Sales, Duration, and Type. We can confirm
this very well by looking at the plot. Unlike CART , here many of the features are dependent on the final
outcomes.
The insurance company can introduce few more attractive products which can attract customers as its the
most dominating factors.
As commission and sales are correlated to each other, the firm should increase the commission margin
which in return can increase the sales as well.
Duration is also an important factor hence firm should concentrate on providing long duration schemes.
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables such
as day of the incident, time, age group, and associating it with other external information such as location,
behavior patterns, weather information, airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions, which
subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other
interesting fact, is almost all the offline business has a claimed associated, need to find why? • Need to
train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing
campaign or evaluate if we need to tie up with alternate agency • Also based on the model we are getting
80%accuracy, so we need customer books airline tickets or plans, cross sell the insurance based on the
claim data pattern. • Other interesting fact is more sales happen via Agency than Airlines and the trend
shows the claim are processed more at Airline. So we may need to deep dive into the process to
understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time •
Increase customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling costs
Insights gained from data and AI-powered analytics could expand the boundaries of insurability, extend
existing products, and give rise to new risk transfer solutions in areas like a non-damage business
interruption and reputational damage.