0% found this document useful (0 votes)
248 views13 pages

Data Mining Problem 2 Report

This document describes using three machine learning models - CART, Random Forest, and Artificial Neural Network - to predict insurance claim status using past customer data. Exploratory data analysis was conducted on the dataset, including checking for missing values, outliers, and variable correlations. The data was then split into training and test sets. CART, Random Forest, and ANN models were trained on the training set and evaluated on the test set using various performance metrics like accuracy, confusion matrices, and ROC curves. Product name was found to be the most important predictor for CART and Random Forest models. The three models were then compared to help the insurance company improve their claim prediction and management strategies.

Uploaded by

Babu Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
248 views13 pages

Data Mining Problem 2 Report

This document describes using three machine learning models - CART, Random Forest, and Artificial Neural Network - to predict insurance claim status using past customer data. Exploratory data analysis was conducted on the dataset, including checking for missing values, outliers, and variable correlations. The data was then split into training and test sets. CART, Random Forest, and ANN models were trained on the training set and evaluated on the test set using various performance metrics like accuracy, confusion matrices, and ROC curves. Product name was found to be the most important predictor for CART and Random Forest models. The three models were then compared to help the insurance company improve their claim prediction and management strategies.

Uploaded by

Babu Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Problem 2 - CART-RF-ANN

Executive Summary
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

Introduction
The purpose of this whole exercise is to explore the dataset to Interpret the inferences for each. Do the
exploratory data analysis.  Split the data set into test and train, build classification model CART, Random
Forest, Artificial Neural Network. Logical reason behind the selection of different values for the
parameters involved in each model. We have mainly covered 3 main aspects for any insurance dataset and
those are -
Data Quality Assurance, Data Insights and data visualization & Business strategies.

Data Description
This Data set contains data of 3000 customers and 10 variables namely. The attributes are described
below
1. Agency code : Name of the agency
2. Agency Type :  Type of travel insurance agencies airline or travel:
3. Distribution channel :  online or offline
4. Product Name : Basic, Premium, Bronze, 1 way comprehensive , 2 way comprehensive,
cancellation, ticket protector, 24 protect etc
5. Claim : Yes or No
6. Duration: of travel In days
7. Destination : of travel Country
8. Net Sales : Amount of sales of travel insurance policies
9. Commission :  Commission received for travel insurance agency
10. Age : Age of insured

Sample of the dataset:

Table 2.1 Data set Sample


2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).

Exploratory Data Analysis :


Let us check the types of variables in the data frame.
We will have to explore the object type & convert it into the integer or float type to make sure we are
providing correct input to our ML model.
Age int64
Agency_Code object
Type object
Claimed object
Commision float64
Channel object
Duration int64
Sales float64
Product Name object
Destination object
dtypes: float64(2), int64(2), object(6)

There are total 3000 rows and 10 columns in the dataset. Out of 10, 6 columns are of object type and
rest 4 are of either integer or float data type.

Check for missing values in the dataset:


The missing values or “NA” needs to be checked and dropped from the dataset for the ease of
evaluation and null values can give errors or disparities in results. Missing Values can be computed using
.isnull().sum() function.
Age 0
Agency_Code 0
Type 0
Claimed 0
Commision 0
Channel 0
Duration 0
Sales 0
Product Name 0
Destination 0
dtype: int64

From the above results we can see that there is no missing value present in the dataset

Check for Duplicated values in the dataset:


We need check the dataset for duplicated values to calculate the significance of the Analysis . Function,
duplicated().sum() would help to identify the number of duplicated rows in dataset.

After the Analysis, we have found 139 duplicated values. After exploring those duplicated values, by
looking at the rows, we can see that although duplicated() is returning 139 duplicated rows but looking
at the type of data, there is possibility that duplication may occur among the data. As this data consists
of user data, there may be a chance that features corresponding to 2 persons may be equal. Hence we
will not be dropping this duplicates.

Dropping the non-important columns


In this dataset, “Agency_Code” & “Age” are the column which cannot be used for our analysis. Hence,
we will be dropping this column by using .drop() function.

Scaling the Dataset with Z-score and treating Object type columns
After drafting new Data set with only 3 columns named Commission, Sales & Duration for applying z-
score; Assuming the distribution to be normally distributed, how many of records are above 3 SD. We
will be treating those records whose zscore is above 3 SD. Therefore, Treating outliers by changing the
zscore to -3 ,where ever zscore<3 and visa versa.

Later, importing rest other columns from original dataset into new dataframe

Similarly, treating the columns whose datatype is stored as object. We will be changing into int and will
be assigning some codes.

After treatment (Dataset view) -

Table 2.2 – After treatment Data set view

Columns -
Commision float64
Sales float64
Duration float64
Type int8
Claimed int8
Channel int8
Product Name int8
Destination int8

Univariate Analysis
Histograms are plotted for all the numerical variables using sns.displot () function from seaborn
package.
Bar Plots are plotted for all Categorical Variables using sns.countplot() function from seaborn package.
Figure 2.1 - Pairplot

There's a small correlation between Commission and Sales

Multivariate Analysis
We will now plot a Heat Map or Correlation Matrix to evaluate the relationship between different
variables in our dataset. This graph can help us to check for any correlations between different variables.

From below heatmap, we can say that there's a small correlation between Commission and Sale &
Duration & Sales.
Figure 2.2 – heatmap

2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Splitting Dataset in Train and Test Data (80:20)
We are working on target variable “Claimed”, for test & train Dataset extracted the column. For building
the models we will now have to split the dataset into Training and Testing Data with the ratio of 70:30.
These two datasets are stored in X_train and X_test with their corresponding dimensions.

Figure 2.3 -extraction

Splitting and training the set -


Figure 2.4 – Splitting & Training set

CART Model
Classification and Regression Trees(CART) are a type of Decision trees used in Data mining. It is a type of
Supervised Learning Technique where the predicted outcome is either a discrete or class (classification)
of the dataset or the outcome is of continuous or numerical in nature(regression).

Using the Train Dataset(X_train) we will be creating a CART model and then further testing the model on
Test Dataset(X_test)

For creating the CART Model two packages were imported namely, “DecisionTreeClassifier” and “tree”
from sklearn.

With the help of DecisonTreeClassifier we will create a decision tree model namely, dt_model and using
the “gini” criteria we will fit the train data into this model. After this using the tree package we will
create a dot file namely, claim_tree.dot to help visualise the tree.

Below are the variable importance values or the feature importance to build the tree.

Imp
Commision 0.074662
Sales 0.186782
Duration 0.041505
Type 0.000000
Channel 0.016576
Product Name 0.680475
Destination 0.000000

Type,Channel,Destination is least important while considering the outcome. The most dominating factor
is Product Name
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build a regularised decision tree. Hence, doing a few iterations with the values we got the best parameters
to build the decision tree which are as follows
{'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 15}

Training the model & evaluation


After training using CART model , Confusion metrics & classification report for training Data set

Figure 2.5 – Confusion metrics & Classification report – Train set


Confusion metrics & classification report for test Data set
Figure 2.5 – Confusion metrics & Classification report – Test set
ROC figures for the Train & Test labels

Figure 2.6 – ROC train labels Figure 2.7 ROC test labels
RANDOM CLASSIFICATION
Random Forest is another Supervised Learning Technique used in Machine Learning which consists of
many decision trees that helps in predictions using individual trees and selects the best output from them.
Using the Train Dataset(X_train) we will be creating a Random Forest model and then further testing the
model on Test Dataset(X_test)
For creating the Random Forest, the package “RandomForestClassifier” is imported from sklearn.metrics.
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build a Random Forest namely, rfcl. Hence, doing a few iterations with the values we got the best
parameters to build the RF Model.
Channel, Destination is least important while considering the outcome. The most dominating factor is
Product Name,
Commission, Sales, Duration, and Type. We can confirm this very well by looking at the plot. Unlike
CART , here many of the features are dependent on the final outcomes.

Figure 2.9 – Factor


After training, we have plotted the confision metric & prediction score for both test & train data set
Train labels

Test labels
ROC –

Neural Networks
Artificial Neural Network(ANN) is a computational model that consists of several processing elements
that receive inputs and deliver outputs based on their predefined activation functions.

Using the train dataset(X_train) and test dataset(X_test) we will be creating a Neural Network using
MLPClassifier from sklearn.metrics.

Firstly, we will have to Scale the two datasets using Standard Scaler package.

Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build an Artificial Neural Network Model namely, mlp. Hence, doing a few iterations with the values we
got the best parameters to build the ANN Model

Figure 2.10 – Best parameters

After training, we can calculate the Confusion metric, ROC and model performance on prediction score –
Training Confusion metric & score –

Test Confusion metric & score –

ROC for test and training Set –

2.4 Final Model: Compare all the model and write an inference which model is
best/optimized.
Comparing all the model evaluation parameters for different models, RANDOM FOREST is performing
best. AUC_SCORE , ACCURACY for both training and testing dataset is better in RF as compared to other
models.
Moreover as for this case, Type 1 error needs not be minimized as we do not want our model to
incorrectly predict those customers who will ask for the claims in actual but our model will predict it
won’t.
Hence we should look for the greater precision as compared to recall.
The second best model is Neural Networks. Model evaluation parameters for NN is quite near to RF .

2.5 Inference: Basis on these predictions, what are the business insights and
recommendations
Looking at the feature importance from Random Forest, we can see that Channel, Destination is least
important while considering the outcome.
The most dominating factor is Product Name, Commission, Sales, Duration, and Type. We can confirm
this very well by looking at the plot. Unlike CART , here many of the features are dependent on the final
outcomes.
The insurance company can introduce few more attractive products which can attract customers as its the
most dominating factors.
As commission and sales are correlated to each other, the firm should increase the commission margin
which in return can increase the sales as well.
Duration is also an important factor hence firm should concentrate on providing long duration schemes.
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables such
as day of the incident, time, age group, and associating it with other external information such as location,
behavior patterns, weather information, airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions, which
subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other
interesting fact, is almost all the offline business has a claimed associated, need to find why? • Need to
train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing
campaign or evaluate if we need to tie up with alternate agency • Also based on the model we are getting
80%accuracy, so we need customer books airline tickets or plans, cross sell the insurance based on the
claim data pattern. • Other interesting fact is more sales happen via Agency than Airlines and the trend
shows the claim are processed more at Airline. So we may need to deep dive into the process to
understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time •
Increase customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling costs
Insights gained from data and AI-powered analytics could expand the boundaries of insurability, extend
existing products, and give rise to new risk transfer solutions in areas like a non-damage business
interruption and reputational damage.

You might also like