0% found this document useful (0 votes)

177 views11 pages

Data Analytics for Insurance Prediction

This document summarizes a group project using R to predict whether customers would be interested in vehicle insurance. The group analyzed demographic, vehicle, and policy data using descriptive statistics and built logistic regression and classification tree models. The logistic regression model with gender, age, driving license, prior insurance, vehicle age, damage, and annual premium as predictors performed best with an error rate of 36%. The classification tree was skewed since most responses were negative.

Uploaded by

Shubham Wadhwa 23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views11 pages

Data Analytics for Insurance Prediction

Uploaded by

Shubham Wadhwa 23

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Data Analytics Using R

Group Project

Group 10

Aveek Dutta (PGP/23/008)

Mihir Bade (PGP/23/010)
Shubham Meena (PGP/23/051)
Shubham Wadhwa (PGP/23/177)
Diksha (PGP/23/204)

Problem Statement: Healthcare Cross-Sell Prediction

Introduction

Just like medical insurance, there is vehicle insurance where every year customer needs to
pay a premium of certain amount to insurance provider company so that in case of
unfortunate accident by the vehicle, the insurance provider company will provide a
compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is

extremely helpful for the company because it can then accordingly plan its communication
strategy to reach out to those customers and optimize its business model and revenue.

To whether the customer would be interested in Vehicle insurance, you have information
about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy
(Premium, sourcing channel) etc.

We have picked up data from Kaggle.com

List of variables

 Unique ID

 Gender

 Age

 Driving_Licence (1 if the customer as driving licence, 0 if he/she doesn’t have it)

 Region_Code

 Previously_Insured
 Vehicle_Age

 Vehicle_Damage (1 If the vehicle is damaged, 0 if its not so)

 Annual_Premium

 Policy_Sales_Channel

Checking number of respondents with driving license

Descriptive analysis

The data does not have any missing values.

Id is just the serial number and hence is not used in model.

Policy sales channel and region code are also just the indicators used to differentiate between
sales channel and different regions. The problem here is they are represented by a number
and we cannot use it as a number to build model.

We could have converted it into factors but we did not because of following reasons.

1) There are almost 150 different codes for each of them. It will make the model
cluttered and difficult to interpret without adding much to improve prediction.
2) Some codes have very few representations. We had planned to club them all in same
category as it will also reduce the levels but the challenge is in deciding the threshold
and identifying which region codes are below the threshold with the help of code and
not by manually looking.

Age and vehicle age has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.

Annual premium has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.

Gender has significant impact on the vehicle insurance selection as the mean is quite different
for respondents and non-respondents.
Previous insurance has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.

Vehicle damage has significant impact on the vehicle insurance selection as the mean is quite
different for respondents and non-respondents.

It seems that Vintage does not have significant impact on the respondents and it does not
show any significant difference.
Response and whether a customer has licence or seems to have significant relationship

Model Building

We used logistics regression and classification tree algorithms to build the models. We then
selected Logistic regression model to predict the test data as it had low error rate.

Logistic Regression Model

We initially built an all-inclusive model and then used the stepwise method to identify the
best model.

Building an all-inclusive model

Trying stepwise methods to determine best model

ROC curve to determine best threshold

Plotting the table to calculate error rate

According to the stepwise model, the best model included following variables.

Gender, Age, Driving licence, previously insured, Vehicle age, Vehicle damage, Annual
premium and policy sales channel.

We removed the policy sales channel variable and built the new model. The reason for
removing the policy sales channel is as discussed above.

Checking goodness of fit of the model

Here, H-L test generally shows poor fit for n>25,000

Classification Tree

We converted the response variable to factors so that we can build a classification tree.

In the classification tree, we observed ¾ terminal nodes and all the nodes point to the same
conclusion that is 0 or No response. This has happened because most of the response points
are 0 or No and hence the model is getting skewed towards the same.

Plotting ROC curve

We also tried building the random forest and bagging over the same dataset. But there was
some error because of size allocated to the R console. We looked how to resolve the error on
the internet but we did not we did not understand the concept very well.

We then used the Logistic regression model with following variables to predict the values for
the test data.

Gender, Age, Driving licence, previously insured, Vehicle age, Vehicle damage and Annual
premium

Conclusion

Logistic regression model gave us the bet results, with a total error rate of 36%. Classification
tree could not provide condition for classifying a customer as buying health insurance.

Appendix
d=read.csv("D:\\Study Material\\Term IV\\Data Analytics Using
R\\Assignment\\train.csv",header=T,na.strings="?")
str(d)
colSums(is.na(d))
#checking number of respondents with driving licence
sum(d$Driving_License)
length(d$Driving_License)
table(d$Vehicle_Damage)
#number of respondents who have been previously insured
table(d$Previously_Insured)
t=table(d$Policy_Sales_Channel)
prop.table(t)
length(t)
# trying to identify which policy channels have less than 50 entries
for (i in 1:length(t)) {
if (t[i]<50) {
x=append(x,t[i])
}
}
length(x)
#Descriptive analytics
tapply(d$Response,d$Gender,"mean")
tapply(d$Response,d$Driving_License,"mean")
tapply(d$Response,d$Previously_Insured,"mean")
tapply(d$Response,d$Vehicle_Age,"mean")
tapply(d$Response,d$Vehicle_Damage,"mean")
tapply(d$Age,d$Response,"mean")
tapply(d$Annual_Premium,d$Response,"mean")
library(ISLR)
# Creating the Training Data Set (80%)
set.seed(4)
train=sample(381109,304887)
d.train=d[train,]
dim(d.train)
Respose=d$Response
d.test=d[-train,]
dim(d.test)
#Creating an array of "Status" variable for Training Data
Response.train=Response[train]
Response.train=as.factor(Response.train)
# Creating an array of "Status" variable for Test Data
Response.test=Response[-train]
# building an all inclusive model
mod_1=glm(d$Response~.,data = d,subset = train,family = binomial)
summary(mod_1)
library(MASS)
# Trying stepwise methods to determine best model
stepAIC(mod_1,trace = F)
# calculating associated probabilities
mod_1.probs=predict(mod_1,type = "response")
length(mod_1.probs)
head(mod_1.probs)
# Using ROC curve to determine best threshold
library(pROC)
R=roc(Response.train,mod_1.probs)
plot(roc(Response.train,mod_1.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
R$auc
# Checking which value represents what
contrasts(Response.train)
# Converting probabilities using threshold
mod_1.pred=rep(0,length(Response.train))
mod_1.pred[mod_1.probs>0.0865663]=1
# Plotting the table to calculate error rate
table(mod_1.pred,Response.train)
sum(mod_1.pred==Response.train)
# Checking goodness of fit of the model
library(ResourceSelection)
# several articles suggest that large datasets( >25000) will always show poor fit and H-L may
not be good criteria
hoslem.test(Response.train,fitted(mod_1))
t=read.csv("D:\\Study Material\\Term IV\\Data Analytics Using
R\\Assignment\\test.csv",header=T,na.strings="?")
colSums(is.na(t))
names(t)
# Building the model suggested by step and our kbnowledge
mod_f=glm(Response~Gender+Age+Driving_License+Previously_Insured+Vehicle_Age+V
ehicle_Damage+Annual_Premium,data = t, family = binomial)
# predicting probability values for test data
mod_f.probs=predict(mod_f,data=t,type = "response")
# Checking if the predictions are for correct data set
length(mod_f.probs)
length(t$id)
head(mod_f.probs)
# calculating best threshold based on trains test split
R=roc(Response.test,mod_f.probs)
plot(roc(Response.train,mod_f.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
R$auc

Credit Card Marketing Analytics
100% (1)
Credit Card Marketing Analytics
18 pages
R Programming for Data Analytics Guide
No ratings yet
R Programming for Data Analytics Guide
100 pages
Introduction to R Programming Guide
100% (1)
Introduction to R Programming Guide
101 pages
Business Analytics Using R - A Practical Approach
No ratings yet
Business Analytics Using R - A Practical Approach
7 pages
R Language Lab Manual Lab 1
100% (1)
R Language Lab Manual Lab 1
33 pages
R Quick Guide
No ratings yet
R Quick Guide
140 pages
Data Analytics Training Manual in R
No ratings yet
Data Analytics Training Manual in R
47 pages
Statistics Using R Language
No ratings yet
Statistics Using R Language
5 pages
Data Visualization in R
No ratings yet
Data Visualization in R
36 pages
Car Insurance Claim Prediction with ML
No ratings yet
Car Insurance Claim Prediction with ML
26 pages
Business Analytics With Excel PDF
No ratings yet
Business Analytics With Excel PDF
3 pages
R Programming
No ratings yet
R Programming
11 pages
Bca PDF
No ratings yet
Bca PDF
114 pages
Data Science with Python Guide
No ratings yet
Data Science with Python Guide
25 pages
Health Insurance Cost Prediction Using IBM Watson
No ratings yet
Health Insurance Cost Prediction Using IBM Watson
27 pages
Data Analysis and Linear Regression Insights
No ratings yet
Data Analysis and Linear Regression Insights
3 pages
R Language: Power in Statistical Computing
No ratings yet
R Language: Power in Statistical Computing
11 pages
Introduction To Power BI and Its Features
No ratings yet
Introduction To Power BI and Its Features
41 pages
R Programming Course Notes
No ratings yet
R Programming Course Notes
28 pages
R Code Default Data PDF
No ratings yet
R Code Default Data PDF
10 pages
Basic Statistics with R Guide
No ratings yet
Basic Statistics with R Guide
241 pages
Substitution & Transposition Ciphers
No ratings yet
Substitution & Transposition Ciphers
16 pages
Excel Roadmap for Data Analysts
No ratings yet
Excel Roadmap for Data Analysts
3 pages
SAS Vs SQL
No ratings yet
SAS Vs SQL
52 pages
Relational Databases
No ratings yet
Relational Databases
107 pages
Data Science 03 - Regression PDF
No ratings yet
Data Science 03 - Regression PDF
32 pages
R Language
No ratings yet
R Language
59 pages
Stat Ess Mod 3 Ses 1
50% (2)
Stat Ess Mod 3 Ses 1
29 pages
Predictive Analytics
No ratings yet
Predictive Analytics
7 pages
Information and Network Security Overview
No ratings yet
Information and Network Security Overview
103 pages
Ken Black QA ch01
No ratings yet
Ken Black QA ch01
23 pages
Data Analysis
No ratings yet
Data Analysis
5 pages
Big Data Notes
No ratings yet
Big Data Notes
89 pages
Useful R Packages
No ratings yet
Useful R Packages
73 pages
R Programming-Practical File
No ratings yet
R Programming-Practical File
5 pages
Chapter 8 B - Trendlines and Regression Analysis
No ratings yet
Chapter 8 B - Trendlines and Regression Analysis
73 pages
Class XII Informatics Practices Practical List
100% (1)
Class XII Informatics Practices Practical List
10 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Summer Internship Project Report
No ratings yet
Summer Internship Project Report
36 pages
Important SAS Interview Questions and Answers
No ratings yet
Important SAS Interview Questions and Answers
1 page
Guesstimate
No ratings yet
Guesstimate
18 pages
Research Design and Variables in Studies
0% (3)
Research Design and Variables in Studies
2 pages
Quality Management System
No ratings yet
Quality Management System
4 pages
ALX Data Analytics Program Description
No ratings yet
ALX Data Analytics Program Description
6 pages
Cryptography Overview & Algorithms
No ratings yet
Cryptography Overview & Algorithms
11 pages
TSF GRIP Tasks PDF
No ratings yet
TSF GRIP Tasks PDF
32 pages
Data Analyst Test - AdvaRisk Score
No ratings yet
Data Analyst Test - AdvaRisk Score
13 pages
Unit Iii
100% (1)
Unit Iii
36 pages
Isom 3400 - Python For Business Analytics 1. Intro To Python
No ratings yet
Isom 3400 - Python For Business Analytics 1. Intro To Python
46 pages
Data Analytics & R Programming: Decision Tree Algorithm
No ratings yet
Data Analytics & R Programming: Decision Tree Algorithm
10 pages
Graded Assignment Questions
No ratings yet
Graded Assignment Questions
7 pages
Heart Disease Prediction via Naive Bayes
No ratings yet
Heart Disease Prediction via Naive Bayes
6 pages
Understanding the Document Object Model
No ratings yet
Understanding the Document Object Model
4 pages
Business Analyst Role at Uber Hyderabad
No ratings yet
Business Analyst Role at Uber Hyderabad
3 pages
Health Insurance Cross Sell Prediction
No ratings yet
Health Insurance Cross Sell Prediction
32 pages
Data Scientists: Model Comparison
No ratings yet
Data Scientists: Model Comparison
14 pages
Problem Statement
No ratings yet
Problem Statement
6 pages
File T NG H P N I Dung - Team 3 - VJPE205 1.1 1
No ratings yet
File T NG H P N I Dung - Team 3 - VJPE205 1.1 1
40 pages
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
No ratings yet
Arpita Saha SMDM Coded Project Module 2 10 01 2024 G2 Business Report
21 pages
Comparison of Fixed Deposits of Banks and Share Market and Analysisis Which Is The Most Preferable Investment Portfolio of Customers
No ratings yet
Comparison of Fixed Deposits of Banks and Share Market and Analysisis Which Is The Most Preferable Investment Portfolio of Customers
13 pages
31 2020 1571 PDF
No ratings yet
31 2020 1571 PDF
2 pages
Nature and Purpose of Business
No ratings yet
Nature and Purpose of Business
141 pages
Sip 2
No ratings yet
Sip 2
74 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
NWO Plans Exposed by Insider in 1969
100% (2)
NWO Plans Exposed by Insider in 1969
30 pages
Salary PDF New 2
No ratings yet
Salary PDF New 2
20 pages
2499 Brochure
No ratings yet
2499 Brochure
4 pages
OPD Form IHealthcare-2
100% (1)
OPD Form IHealthcare-2
2 pages
Palileo v. Cosio
No ratings yet
Palileo v. Cosio
4 pages
Healthcare Platform for Churches
No ratings yet
Healthcare Platform for Churches
2 pages
Zac bs2
No ratings yet
Zac bs2
4 pages
Engineering Service Pricing Guide
No ratings yet
Engineering Service Pricing Guide
37 pages
Sblaf - Fillable Type - Bel Isp 121223 Filled
No ratings yet
Sblaf - Fillable Type - Bel Isp 121223 Filled
9 pages
Kvs Chandigarh Xi BST (054) QP & Ms (See) 24-25
No ratings yet
Kvs Chandigarh Xi BST (054) QP & Ms (See) 24-25
6 pages
Ic 77 2023 Edition
No ratings yet
Ic 77 2023 Edition
425 pages
Assignment 402
No ratings yet
Assignment 402
7 pages
Success in Commerce (2020 - 2022)
No ratings yet
Success in Commerce (2020 - 2022)
67 pages
Chapter 6
No ratings yet
Chapter 6
33 pages
Products Liability Questionnaire - We Trade Corp
No ratings yet
Products Liability Questionnaire - We Trade Corp
5 pages
RP Prarthana Naidu Analysis and Detection of Fraud in Health Insurance
No ratings yet
RP Prarthana Naidu Analysis and Detection of Fraud in Health Insurance
12 pages
Chartis vs. Cyber City Insurance Case
100% (1)
Chartis vs. Cyber City Insurance Case
18 pages
GAA 2012 General Provisions
No ratings yet
GAA 2012 General Provisions
19 pages
IPFT Exercises 2021
No ratings yet
IPFT Exercises 2021
40 pages
Capgemini Top-Trends-2025 Sustainability
No ratings yet
Capgemini Top-Trends-2025 Sustainability
35 pages
Hyundai Grand i10 Nios Insurance Details
No ratings yet
Hyundai Grand i10 Nios Insurance Details
1 page
IRDAI Policyholders Regulations
No ratings yet
IRDAI Policyholders Regulations
8 pages
Pen Flex Travel Gold Uk 2023 Ipid - Final
No ratings yet
Pen Flex Travel Gold Uk 2023 Ipid - Final
2 pages
Public Liability Insurance Proposal Form - 0 - 0
No ratings yet
Public Liability Insurance Proposal Form - 0 - 0
2 pages
Suretyship Property Insurance
No ratings yet
Suretyship Property Insurance
4 pages

Data Analytics for Insurance Prediction

Uploaded by

Data Analytics for Insurance Prediction

Uploaded by

Data Analytics Using R

Aveek Dutta (PGP/23/008)

Problem Statement: Healthcare Cross-Sell Prediction

Building a model to predict whether a customer would be interested in Vehicle Insurance is

We have picked up data from Kaggle.com

 Driving_Licence (1 if the customer as driving licence, 0 if he/she doesn’t have it)

 Vehicle_Damage (1 If the vehicle is damaged, 0 if its not so)

Checking number of respondents with driving license

The data does not have any missing values.

Id is just the serial number and hence is not used in model.

Logistic Regression Model

Building an all-inclusive model

ROC curve to determine best threshold

Plotting the table to calculate error rate

Checking goodness of fit of the model

Here, H-L test generally shows poor fit for n>25,000

Plotting ROC curve

You might also like