0% found this document useful (0 votes)
107 views

Data Analysis Using R

This document summarizes a group project using R to predict whether customers would be interested in vehicle insurance. The group analyzed demographic, vehicle, and policy data using descriptive statistics and built logistic regression and classification tree models. The logistic regression model with gender, age, driving license, prior insurance, vehicle age, damage, and annual premium as predictors performed best with an error rate of 36%. The classification tree was skewed since most responses were negative.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views

Data Analysis Using R

This document summarizes a group project using R to predict whether customers would be interested in vehicle insurance. The group analyzed demographic, vehicle, and policy data using descriptive statistics and built logistic regression and classification tree models. The logistic regression model with gender, age, driving license, prior insurance, vehicle age, damage, and annual premium as predictors performed best with an error rate of 36%. The classification tree was skewed since most responses were negative.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data Analytics Using R

Group Project

Group 10

Aveek Dutta (PGP/23/008)


Mihir Bade (PGP/23/010)
Shubham Meena (PGP/23/051)
Shubham Wadhwa (PGP/23/177)
Diksha (PGP/23/204)

Problem Statement: Healthcare Cross-Sell Prediction

Introduction

Just like medical insurance, there is vehicle insurance where every year customer needs to
pay a premium of certain amount to insurance provider company so that in case of
unfortunate accident by the vehicle, the insurance provider company will provide a
compensation (called ‘sum assured’) to the customer.

Building a model to predict whether a customer would be interested in Vehicle Insurance is


extremely helpful for the company because it can then accordingly plan its communication
strategy to reach out to those customers and optimize its business model and revenue.

To whether the customer would be interested in Vehicle insurance, you have information
about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy
(Premium, sourcing channel) etc.

We have picked up data from Kaggle.com

List of variables

 Unique ID

 Gender

 Age

 Driving_Licence (1 if the customer as driving licence, 0 if he/she doesn’t have it)

 Region_Code

 Previously_Insured
 Vehicle_Age

 Vehicle_Damage (1 If the vehicle is damaged, 0 if its not so)

 Annual_Premium

 Policy_Sales_Channel

Checking number of respondents with driving license

Descriptive analysis

The data does not have any missing values.

Id is just the serial number and hence is not used in model.

Policy sales channel and region code are also just the indicators used to differentiate between
sales channel and different regions. The problem here is they are represented by a number
and we cannot use it as a number to build model.

We could have converted it into factors but we did not because of following reasons.

1) There are almost 150 different codes for each of them. It will make the model
cluttered and difficult to interpret without adding much to improve prediction.
2) Some codes have very few representations. We had planned to club them all in same
category as it will also reduce the levels but the challenge is in deciding the threshold
and identifying which region codes are below the threshold with the help of code and
not by manually looking.

Age and vehicle age has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.

Annual premium has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.

Gender has significant impact on the vehicle insurance selection as the mean is quite different
for respondents and non-respondents.
Previous insurance has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.

Vehicle damage has significant impact on the vehicle insurance selection as the mean is quite
different for respondents and non-respondents.

It seems that Vintage does not have significant impact on the respondents and it does not
show any significant difference.
Response and whether a customer has licence or seems to have significant relationship

Model Building

We used logistics regression and classification tree algorithms to build the models. We then
selected Logistic regression model to predict the test data as it had low error rate.

Logistic Regression Model

We initially built an all-inclusive model and then used the stepwise method to identify the
best model.

Building an all-inclusive model


Trying stepwise methods to determine best model

ROC curve to determine best threshold

Plotting the table to calculate error rate


According to the stepwise model, the best model included following variables.

Gender, Age, Driving licence, previously insured, Vehicle age, Vehicle damage, Annual
premium and policy sales channel.

We removed the policy sales channel variable and built the new model. The reason for
removing the policy sales channel is as discussed above.

Checking goodness of fit of the model

Here, H-L test generally shows poor fit for n>25,000

Classification Tree

We converted the response variable to factors so that we can build a classification tree.

In the classification tree, we observed ¾ terminal nodes and all the nodes point to the same
conclusion that is 0 or No response. This has happened because most of the response points
are 0 or No and hence the model is getting skewed towards the same.

Plotting ROC curve


We also tried building the random forest and bagging over the same dataset. But there was
some error because of size allocated to the R console. We looked how to resolve the error on
the internet but we did not we did not understand the concept very well.

We then used the Logistic regression model with following variables to predict the values for
the test data.

Gender, Age, Driving licence, previously insured, Vehicle age, Vehicle damage and Annual
premium

Conclusion

Logistic regression model gave us the bet results, with a total error rate of 36%. Classification
tree could not provide condition for classifying a customer as buying health insurance.

Appendix
d=read.csv("D:\\Study Material\\Term IV\\Data Analytics Using
R\\Assignment\\train.csv",header=T,na.strings="?")
str(d)
colSums(is.na(d))
#checking number of respondents with driving licence
sum(d$Driving_License)
length(d$Driving_License)
table(d$Vehicle_Damage)
#number of respondents who have been previously insured
table(d$Previously_Insured)
t=table(d$Policy_Sales_Channel)
prop.table(t)
length(t)
# trying to identify which policy channels have less than 50 entries
for (i in 1:length(t)) {
if (t[i]<50) {
x=append(x,t[i])
}
}
length(x)
#Descriptive analytics
tapply(d$Response,d$Gender,"mean")
tapply(d$Response,d$Driving_License,"mean")
tapply(d$Response,d$Previously_Insured,"mean")
tapply(d$Response,d$Vehicle_Age,"mean")
tapply(d$Response,d$Vehicle_Damage,"mean")
tapply(d$Age,d$Response,"mean")
tapply(d$Annual_Premium,d$Response,"mean")
library(ISLR)
# Creating the Training Data Set (80%)
set.seed(4)
train=sample(381109,304887)
d.train=d[train,]
dim(d.train)
Respose=d$Response
d.test=d[-train,]
dim(d.test)
#Creating an array of "Status" variable for Training Data
Response.train=Response[train]
Response.train=as.factor(Response.train)
# Creating an array of "Status" variable for Test Data
Response.test=Response[-train]
# building an all inclusive model
mod_1=glm(d$Response~.,data = d,subset = train,family = binomial)
summary(mod_1)
library(MASS)
# Trying stepwise methods to determine best model
stepAIC(mod_1,trace = F)
# calculating associated probabilities
mod_1.probs=predict(mod_1,type = "response")
length(mod_1.probs)
head(mod_1.probs)
# Using ROC curve to determine best threshold
library(pROC)
R=roc(Response.train,mod_1.probs)
plot(roc(Response.train,mod_1.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
R$auc
# Checking which value represents what
contrasts(Response.train)
# Converting probabilities using threshold
mod_1.pred=rep(0,length(Response.train))
mod_1.pred[mod_1.probs>0.0865663]=1
# Plotting the table to calculate error rate
table(mod_1.pred,Response.train)
sum(mod_1.pred==Response.train)
# Checking goodness of fit of the model
library(ResourceSelection)
# several articles suggest that large datasets( >25000) will always show poor fit and H-L may
not be good criteria
hoslem.test(Response.train,fitted(mod_1))
t=read.csv("D:\\Study Material\\Term IV\\Data Analytics Using
R\\Assignment\\test.csv",header=T,na.strings="?")
colSums(is.na(t))
names(t)
# Building the model suggested by step and our kbnowledge
mod_f=glm(Response~Gender+Age+Driving_License+Previously_Insured+Vehicle_Age+V
ehicle_Damage+Annual_Premium,data = t, family = binomial)
# predicting probability values for test data
mod_f.probs=predict(mod_f,data=t,type = "response")
# Checking if the predictions are for correct data set
length(mod_f.probs)
length(t$id)
head(mod_f.probs)
# calculating best threshold based on trains test split
R=roc(Response.test,mod_f.probs)
plot(roc(Response.train,mod_f.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
R$auc

You might also like