0% found this document useful (0 votes)

251 views13 pages

Data Mining Problem 2 Report

This document describes using three machine learning models - CART, Random Forest, and Artificial Neural Network - to predict insurance claim status using past customer data. Exploratory data analysis was conducted on the dataset, including checking for missing values, outliers, and variable correlations. The data was then split into training and test sets. CART, Random Forest, and ANN models were trained on the training set and evaluated on the test set using various performance metrics like accuracy, confusion matrices, and ROC curves. Product name was found to be the most important predictor for CART and Random Forest models. The three models were then compared to help the insurance company improve their claim prediction and management strategies.

Uploaded by

Babu Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

251 views13 pages

Data Mining Problem 2 Report

Uploaded by

Babu Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

Problem 2 - CART-RF-ANN

Executive Summary
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

Introduction
The purpose of this whole exercise is to explore the dataset to Interpret the inferences for each. Do the
exploratory data analysis. Split the data set into test and train, build classification model CART, Random
Forest, Artificial Neural Network. Logical reason behind the selection of different values for the
parameters involved in each model. We have mainly covered 3 main aspects for any insurance dataset and
those are -
Data Quality Assurance, Data Insights and data visualization & Business strategies.

Data Description
This Data set contains data of 3000 customers and 10 variables namely. The attributes are described
below
1. Agency code : Name of the agency
2. Agency Type : Type of travel insurance agencies airline or travel:
3. Distribution channel : online or offline
4. Product Name : Basic, Premium, Bronze, 1 way comprehensive , 2 way comprehensive,
cancellation, ticket protector, 24 protect etc
5. Claim : Yes or No
6. Duration: of travel In days
7. Destination : of travel Country
8. Net Sales : Amount of sales of travel insurance policies
9. Commission : Commission received for travel insurance agency
10. Age : Age of insured

Sample of the dataset:

Table 2.1 Data set Sample

2.1 Read the data, do the necessary initial steps, and exploratory data analysis (Univariate,
Bi-variate, and multivariate analysis).

Exploratory Data Analysis :

Let us check the types of variables in the data frame.
We will have to explore the object type & convert it into the integer or float type to make sure we are
providing correct input to our ML model.
Age int64
Agency_Code object
Type object
Claimed object
Commision float64
Channel object
Duration int64
Sales float64
Product Name object
Destination object
dtypes: float64(2), int64(2), object(6)

There are total 3000 rows and 10 columns in the dataset. Out of 10, 6 columns are of object type and
rest 4 are of either integer or float data type.

Check for missing values in the dataset:

The missing values or “NA” needs to be checked and dropped from the dataset for the ease of
evaluation and null values can give errors or disparities in results. Missing Values can be computed using
.isnull().sum() function.
Age 0
Agency_Code 0
Type 0
Claimed 0
Commision 0
Channel 0
Duration 0
Sales 0
Product Name 0
Destination 0
dtype: int64

From the above results we can see that there is no missing value present in the dataset

Check for Duplicated values in the dataset:

We need check the dataset for duplicated values to calculate the significance of the Analysis . Function,
duplicated().sum() would help to identify the number of duplicated rows in dataset.

After the Analysis, we have found 139 duplicated values. After exploring those duplicated values, by
looking at the rows, we can see that although duplicated() is returning 139 duplicated rows but looking
at the type of data, there is possibility that duplication may occur among the data. As this data consists
of user data, there may be a chance that features corresponding to 2 persons may be equal. Hence we
will not be dropping this duplicates.

Dropping the non-important columns

In this dataset, “Agency_Code” & “Age” are the column which cannot be used for our analysis. Hence,
we will be dropping this column by using .drop() function.

Scaling the Dataset with Z-score and treating Object type columns
After drafting new Data set with only 3 columns named Commission, Sales & Duration for applying z-
score; Assuming the distribution to be normally distributed, how many of records are above 3 SD. We
will be treating those records whose zscore is above 3 SD. Therefore, Treating outliers by changing the
zscore to -3 ,where ever zscore<3 and visa versa.

Later, importing rest other columns from original dataset into new dataframe

Similarly, treating the columns whose datatype is stored as object. We will be changing into int and will
be assigning some codes.

After treatment (Dataset view) -

Table 2.2 – After treatment Data set view

Columns -
Commision float64
Sales float64
Duration float64
Type int8
Claimed int8
Channel int8
Product Name int8
Destination int8

Univariate Analysis
Histograms are plotted for all the numerical variables using sns.displot () function from seaborn
package.
Bar Plots are plotted for all Categorical Variables using sns.countplot() function from seaborn package.
Figure 2.1 - Pairplot

There's a small correlation between Commission and Sales

Multivariate Analysis
We will now plot a Heat Map or Correlation Matrix to evaluate the relationship between different
variables in our dataset. This graph can help us to check for any correlations between different variables.

From below heatmap, we can say that there's a small correlation between Commission and Sale &
Duration & Sales.
Figure 2.2 – heatmap

2.2 Data Split: Split the data into test and train, build classification model CART, Random
Forest, Artificial Neural Network
2.3 Performance Metrics: Check the performance of Predictions on Train and Test sets using
Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model
Splitting Dataset in Train and Test Data (80:20)
We are working on target variable “Claimed”, for test & train Dataset extracted the column. For building
the models we will now have to split the dataset into Training and Testing Data with the ratio of 70:30.
These two datasets are stored in X_train and X_test with their corresponding dimensions.

Figure 2.3 -extraction

Splitting and training the set -

Figure 2.4 – Splitting & Training set

CART Model
Classification and Regression Trees(CART) are a type of Decision trees used in Data mining. It is a type of
Supervised Learning Technique where the predicted outcome is either a discrete or class (classification)
of the dataset or the outcome is of continuous or numerical in nature(regression).

Using the Train Dataset(X_train) we will be creating a CART model and then further testing the model on
Test Dataset(X_test)

For creating the CART Model two packages were imported namely, “DecisionTreeClassifier” and “tree”
from sklearn.

With the help of DecisonTreeClassifier we will create a decision tree model namely, dt_model and using
the “gini” criteria we will fit the train data into this model. After this using the tree package we will
create a dot file namely, claim_tree.dot to help visualise the tree.

Below are the variable importance values or the feature importance to build the tree.

Imp
Commision 0.074662
Sales 0.186782
Duration 0.041505
Type 0.000000
Channel 0.016576
Product Name 0.680475
Destination 0.000000

Type,Channel,Destination is least important while considering the outcome. The most dominating factor
is Product Name
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build a regularised decision tree. Hence, doing a few iterations with the values we got the best parameters
to build the decision tree which are as follows
{'max_depth': 5, 'min_samples_leaf': 5, 'min_samples_split': 15}

Training the model & evaluation

After training using CART model , Confusion metrics & classification report for training Data set

Figure 2.5 – Confusion metrics & Classification report – Train set

Confusion metrics & classification report for test Data set
Figure 2.5 – Confusion metrics & Classification report – Test set
ROC figures for the Train & Test labels

Figure 2.6 – ROC train labels Figure 2.7 ROC test labels
RANDOM CLASSIFICATION
Random Forest is another Supervised Learning Technique used in Machine Learning which consists of
many decision trees that helps in predictions using individual trees and selects the best output from them.
Using the Train Dataset(X_train) we will be creating a Random Forest model and then further testing the
model on Test Dataset(X_test)
For creating the Random Forest, the package “RandomForestClassifier” is imported from sklearn.metrics.
Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build a Random Forest namely, rfcl. Hence, doing a few iterations with the values we got the best
parameters to build the RF Model.
Channel, Destination is least important while considering the outcome. The most dominating factor is
Product Name,
Commission, Sales, Duration, and Type. We can confirm this very well by looking at the plot. Unlike
CART , here many of the features are dependent on the final outcomes.

Figure 2.9 – Factor

After training, we have plotted the confision metric & prediction score for both test & train data set
Train labels

Test labels
ROC –

Neural Networks
Artificial Neural Network(ANN) is a computational model that consists of several processing elements
that receive inputs and deliver outputs based on their predefined activation functions.

Using the train dataset(X_train) and test dataset(X_test) we will be creating a Neural Network using
MLPClassifier from sklearn.metrics.

Firstly, we will have to Scale the two datasets using Standard Scaler package.

Using the GridSearchCV package from sklearn.model_selection we will identify the best parameters to
build an Artificial Neural Network Model namely, mlp. Hence, doing a few iterations with the values we
got the best parameters to build the ANN Model

Figure 2.10 – Best parameters

After training, we can calculate the Confusion metric, ROC and model performance on prediction score –
Training Confusion metric & score –

Test Confusion metric & score –

ROC for test and training Set –

2.4 Final Model: Compare all the model and write an inference which model is
best/optimized.
Comparing all the model evaluation parameters for different models, RANDOM FOREST is performing
best. AUC_SCORE , ACCURACY for both training and testing dataset is better in RF as compared to other
models.
Moreover as for this case, Type 1 error needs not be minimized as we do not want our model to
incorrectly predict those customers who will ask for the claims in actual but our model will predict it
won’t.
Hence we should look for the greater precision as compared to recall.
The second best model is Neural Networks. Model evaluation parameters for NN is quite near to RF .

2.5 Inference: Basis on these predictions, what are the business insights and
recommendations
Looking at the feature importance from Random Forest, we can see that Channel, Destination is least
important while considering the outcome.
The most dominating factor is Product Name, Commission, Sales, Duration, and Type. We can confirm
this very well by looking at the plot. Unlike CART , here many of the features are dependent on the final
outcomes.
The insurance company can introduce few more attractive products which can attract customers as its the
most dominating factors.
As commission and sales are correlated to each other, the firm should increase the commission margin
which in return can increase the sales as well.
Duration is also an important factor hence firm should concentrate on providing long duration schemes.
I strongly recommended we collect more real time unstructured data and past data if possible.
This is understood by looking at the insurance data by drawing relations between different variables such
as day of the incident, time, age group, and associating it with other external information such as location,
behavior patterns, weather information, airline/vehicle types, etc.
• Streamlining online experiences benefitted customers, leading to an increase in conversions, which
subsequently raised profits. • As per the data 90% of insurance is done by online channel. • Other
interesting fact, is almost all the offline business has a claimed associated, need to find why? • Need to
train the JZI agency resources to pick up sales as they are in bottom, need to run promotional marketing
campaign or evaluate if we need to tie up with alternate agency • Also based on the model we are getting
80%accuracy, so we need customer books airline tickets or plans, cross sell the insurance based on the
claim data pattern. • Other interesting fact is more sales happen via Agency than Airlines and the trend
shows the claim are processed more at Airline. So we may need to deep dive into the process to
understand the workflow and why?
Key performance indicators (KPI) The KPI’s of insurance claims are: • Reduce claims cycle time •
Increase customer satisfaction • Combat fraud • Optimize claims recovery • Reduce claim handling costs
Insights gained from data and AI-powered analytics could expand the boundaries of insurability, extend
existing products, and give rise to new risk transfer solutions in areas like a non-damage business
interruption and reputational damage.

Predictive Modelling Project - Nandini
No ratings yet
Predictive Modelling Project - Nandini
31 pages
FRA Extended
No ratings yet
FRA Extended
22 pages
Honeywell: Precision Platform 4022 Scanner System Manual
No ratings yet
Honeywell: Precision Platform 4022 Scanner System Manual
135 pages
ML-2 Guided Project Report
No ratings yet
ML-2 Guided Project Report
63 pages
Business Report SMDM Project - Coded
No ratings yet
Business Report SMDM Project - Coded
27 pages
SQL Project Questions
0% (1)
SQL Project Questions
3 pages
Azure Data Factory
100% (4)
Azure Data Factory
16 pages
Sap Fico Project Book Material
No ratings yet
Sap Fico Project Book Material
100 pages
SMDM Project Report-Survi Ghura
100% (1)
SMDM Project Report-Survi Ghura
26 pages
CATERPILLAR - Individual Reflective Report
100% (1)
CATERPILLAR - Individual Reflective Report
12 pages
Sample - Customer Churn Prediction Python Documentation
No ratings yet
Sample - Customer Churn Prediction Python Documentation
33 pages
Factor-Hair RV PDF
No ratings yet
Factor-Hair RV PDF
23 pages
Asphalt Shingles Data Analysis PDF
No ratings yet
Asphalt Shingles Data Analysis PDF
4 pages
Data Mining Project - 27.06.2021
No ratings yet
Data Mining Project - 27.06.2021
6 pages
Clustering Analysis: Prepared by Muralidharan N
100% (1)
Clustering Analysis: Prepared by Muralidharan N
16 pages
Rahulsharma - 03 12 23
No ratings yet
Rahulsharma - 03 12 23
25 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Time Series Forecasting Project (Shoe Sales)
No ratings yet
Time Series Forecasting Project (Shoe Sales)
26 pages
SMDM Project Report Dipti
No ratings yet
SMDM Project Report Dipti
14 pages
Uber Drive Practice DP PDF
No ratings yet
Uber Drive Practice DP PDF
10 pages
FRA Project Report Milestone 1 PDF
No ratings yet
FRA Project Report Milestone 1 PDF
29 pages
Wholesale Custumer
100% (1)
Wholesale Custumer
32 pages
Palash Bhai - Machine Learning Assignment
100% (2)
Palash Bhai - Machine Learning Assignment
18 pages
Capstone Notes-1
No ratings yet
Capstone Notes-1
18 pages
ML Quiz 2
No ratings yet
ML Quiz 2
1 page
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
Vijayalakshmi
No ratings yet
Vijayalakshmi
17 pages
FRA Main Project Part B Guided
No ratings yet
FRA Main Project Part B Guided
23 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
SMDM Project
No ratings yet
SMDM Project
17 pages
AS Graded Project Suchi Solanki
No ratings yet
AS Graded Project Suchi Solanki
21 pages
Advance Stats Project Parijat
No ratings yet
Advance Stats Project Parijat
18 pages
PM ProjectJune - 2021
100% (1)
PM ProjectJune - 2021
33 pages
Surabhi FRA PartA
No ratings yet
Surabhi FRA PartA
13 pages
Problem 1 - (Download Data) : Importing Nessceary Libraries
No ratings yet
Problem 1 - (Download Data) : Importing Nessceary Libraries
16 pages
TSF - Project
100% (1)
TSF - Project
5 pages
Anshul Dyundi Machine Learning July 2022
50% (2)
Anshul Dyundi Machine Learning July 2022
46 pages
M4 Data Mining W4 Business Report
No ratings yet
M4 Data Mining W4 Business Report
22 pages
AV Project Shivakumar Vanga
100% (1)
AV Project Shivakumar Vanga
37 pages
MySQL - Week 5 Quiz
100% (1)
MySQL - Week 5 Quiz
6 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
SMDM-Business Report
No ratings yet
SMDM-Business Report
11 pages
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
No ratings yet
Ensemble Learning: Proprietary Content. ©great Learning. All Rights Reserved. Unauthorized Use or Distribution Prohibited
6 pages
Finance Research Analysis (FRA) Project Report
No ratings yet
Finance Research Analysis (FRA) Project Report
63 pages
Answer Book - Sparkling Wines
No ratings yet
Answer Book - Sparkling Wines
10 pages
AS Extended Buisnesss Report
No ratings yet
AS Extended Buisnesss Report
25 pages
Project-Time Series Forecasting
100% (1)
Project-Time Series Forecasting
10 pages
Answer Report: Data Mining
No ratings yet
Answer Report: Data Mining
32 pages
Education - Post 12th Standard - CSV
No ratings yet
Education - Post 12th Standard - CSV
11 pages
ML 2 - Problem Statements and Rubirics
No ratings yet
ML 2 - Problem Statements and Rubirics
3 pages
Answer Book - Rose Wines
100% (1)
Answer Book - Rose Wines
11 pages
Predictive Modeling - Supporting File1
No ratings yet
Predictive Modeling - Supporting File1
3 pages
Problem Statement 1
100% (1)
Problem Statement 1
17 pages
Predictive Modeling
No ratings yet
Predictive Modeling
38 pages
SMDM Extended Project
No ratings yet
SMDM Extended Project
1 page
End Term Quiz1 - Attempt Review
No ratings yet
End Term Quiz1 - Attempt Review
5 pages
P L Lohitha 19-04-23 TSF Business Report
No ratings yet
P L Lohitha 19-04-23 TSF Business Report
70 pages
Business - Report-Comp-Fin - Data - Part A - Problem
No ratings yet
Business - Report-Comp-Fin - Data - Part A - Problem
17 pages
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
No ratings yet
Great Lakes Extraa - Learn Project Business Report - 2-Kavish-Rathod
22 pages
Finance Risk Analytics - Priyanka Sharma - Business Report
No ratings yet
Finance Risk Analytics - Priyanka Sharma - Business Report
49 pages
VARUNSAINI - 13 Nov 2022
No ratings yet
VARUNSAINI - 13 Nov 2022
14 pages
Great Learning Predictive Modelling Project
No ratings yet
Great Learning Predictive Modelling Project
12 pages
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
No ratings yet
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
18 pages
Cart-Rf-Ann: Prepared by Muralidharan N
67% (3)
Cart-Rf-Ann: Prepared by Muralidharan N
33 pages
Assignment DDLJ
No ratings yet
Assignment DDLJ
8 pages
Qurashi Mohd Mudassir: Objective
No ratings yet
Qurashi Mohd Mudassir: Objective
2 pages
Azure Data Factory Data Flows: Luke Newport Technical Specialist - Data & AI
100% (1)
Azure Data Factory Data Flows: Luke Newport Technical Specialist - Data & AI
30 pages
Transcript - Challenges Working With Big Data
No ratings yet
Transcript - Challenges Working With Big Data
2 pages
STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1
No ratings yet
STAT 4540 Homework 1 Solution: 1 ISLR 2.4.1
6 pages
A Love Affair That Lasts A Lifetime: Cornwall
No ratings yet
A Love Affair That Lasts A Lifetime: Cornwall
14 pages
1 CNC Press Break
No ratings yet
1 CNC Press Break
27 pages
OU2 Difference Between ML DL AI
No ratings yet
OU2 Difference Between ML DL AI
3 pages
Cracking Wifi Passwords Using Aircrack-Ng Using A Target-Specific Custom Wordlist Generated by Us
No ratings yet
Cracking Wifi Passwords Using Aircrack-Ng Using A Target-Specific Custom Wordlist Generated by Us
9 pages
Acer Aspire 3810t - INVENTEC BAP31 - 1310A2264501 - REV AX1Sec
No ratings yet
Acer Aspire 3810t - INVENTEC BAP31 - 1310A2264501 - REV AX1Sec
36 pages
SAP Basic Nav
No ratings yet
SAP Basic Nav
27 pages
Document
No ratings yet
Document
6 pages
1 s2.0 S0140366419306930 Main
No ratings yet
1 s2.0 S0140366419306930 Main
7 pages
GNN Python Code in Keras and Pytorch - by YashwanthReddyGoduguchintha - Medium
No ratings yet
GNN Python Code in Keras and Pytorch - by YashwanthReddyGoduguchintha - Medium
10 pages
Movius, How To Negotiate - Virtually
No ratings yet
Movius, How To Negotiate - Virtually
5 pages
Data Communication & Networking (CEN-222)
No ratings yet
Data Communication & Networking (CEN-222)
12 pages
Resume 2
No ratings yet
Resume 2
1 page
Exactive Series Manbre en
No ratings yet
Exactive Series Manbre en
258 pages
Sell Bitcoin For Bank Transfer NoOnes
No ratings yet
Sell Bitcoin For Bank Transfer NoOnes
1 page
b75mvh Plus
No ratings yet
b75mvh Plus
14 pages
Trans Connect
No ratings yet
Trans Connect
7 pages
7.5 Effects of Layer 2 Devices On Data Flow: 7.5.1 Ethernet LAN Segmentation
No ratings yet
7.5 Effects of Layer 2 Devices On Data Flow: 7.5.1 Ethernet LAN Segmentation
9 pages
Difficult Riddles For Smart Kids 300 Dif PDF
0% (3)
Difficult Riddles For Smart Kids 300 Dif PDF
7 pages
Module 1 Study Guide
No ratings yet
Module 1 Study Guide
9 pages
PostgreSQL Administration (DBA) - ToC 22112024
No ratings yet
PostgreSQL Administration (DBA) - ToC 22112024
12 pages
8086imp Short Programs8
No ratings yet
8086imp Short Programs8
7 pages
Idioms For 12th Class
0% (1)
Idioms For 12th Class
21 pages
Gas Detector Report
No ratings yet
Gas Detector Report
2 pages
Users User Email: For GAM 4.40
No ratings yet
Users User Email: For GAM 4.40
7 pages
AI-Driven Application Co-Design & Co-Development
No ratings yet
AI-Driven Application Co-Design & Co-Development
43 pages
RRB JE COMPUTER SCIENCE INFORMATION TECHNOLOGY Chapter Wise Solved - Cutter
No ratings yet
RRB JE COMPUTER SCIENCE INFORMATION TECHNOLOGY Chapter Wise Solved - Cutter
1 page
Junit Tutorial
100% (1)
Junit Tutorial
59 pages
Computerized Accounting System
100% (1)
Computerized Accounting System
6 pages

Data Mining Problem 2 Report

Uploaded by

Data Mining Problem 2 Report

Uploaded by

Problem 2 - CART-RF-ANN

Sample of the dataset:

Table 2.1 Data set Sample

Exploratory Data Analysis :

Check for missing values in the dataset:

Check for Duplicated values in the dataset:

Dropping the non-important columns

After treatment (Dataset view) -

Table 2.2 – After treatment Data set view

There's a small correlation between Commission and Sales

Figure 2.3 -extraction

Splitting and training the set -

Training the model & evaluation

Figure 2.5 – Confusion metrics & Classification report – Train set

Figure 2.9 – Factor

Figure 2.10 – Best parameters

Test Confusion metric & score –

ROC for test and training Set –

You might also like