Data Analysis Using R
Data Analysis Using R
Group Project
Group 10
Introduction
Just like medical insurance, there is vehicle insurance where every year customer needs to
pay a premium of certain amount to insurance provider company so that in case of
unfortunate accident by the vehicle, the insurance provider company will provide a
compensation (called ‘sum assured’) to the customer.
To whether the customer would be interested in Vehicle insurance, you have information
about demographics (gender, age, region code type), Vehicles (Vehicle Age, Damage), Policy
(Premium, sourcing channel) etc.
List of variables
Unique ID
Gender
Age
Region_Code
Previously_Insured
Vehicle_Age
Annual_Premium
Policy_Sales_Channel
Descriptive analysis
Policy sales channel and region code are also just the indicators used to differentiate between
sales channel and different regions. The problem here is they are represented by a number
and we cannot use it as a number to build model.
We could have converted it into factors but we did not because of following reasons.
1) There are almost 150 different codes for each of them. It will make the model
cluttered and difficult to interpret without adding much to improve prediction.
2) Some codes have very few representations. We had planned to club them all in same
category as it will also reduce the levels but the challenge is in deciding the threshold
and identifying which region codes are below the threshold with the help of code and
not by manually looking.
Age and vehicle age has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.
Annual premium has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.
Gender has significant impact on the vehicle insurance selection as the mean is quite different
for respondents and non-respondents.
Previous insurance has significant impact on the vehicle insurance selection as the mean is
quite different for respondents and non-respondents.
Vehicle damage has significant impact on the vehicle insurance selection as the mean is quite
different for respondents and non-respondents.
It seems that Vintage does not have significant impact on the respondents and it does not
show any significant difference.
Response and whether a customer has licence or seems to have significant relationship
Model Building
We used logistics regression and classification tree algorithms to build the models. We then
selected Logistic regression model to predict the test data as it had low error rate.
We initially built an all-inclusive model and then used the stepwise method to identify the
best model.
Gender, Age, Driving licence, previously insured, Vehicle age, Vehicle damage, Annual
premium and policy sales channel.
We removed the policy sales channel variable and built the new model. The reason for
removing the policy sales channel is as discussed above.
Classification Tree
We converted the response variable to factors so that we can build a classification tree.
In the classification tree, we observed ¾ terminal nodes and all the nodes point to the same
conclusion that is 0 or No response. This has happened because most of the response points
are 0 or No and hence the model is getting skewed towards the same.
We then used the Logistic regression model with following variables to predict the values for
the test data.
Gender, Age, Driving licence, previously insured, Vehicle age, Vehicle damage and Annual
premium
Conclusion
Logistic regression model gave us the bet results, with a total error rate of 36%. Classification
tree could not provide condition for classifying a customer as buying health insurance.
Appendix
d=read.csv("D:\\Study Material\\Term IV\\Data Analytics Using
R\\Assignment\\train.csv",header=T,na.strings="?")
str(d)
colSums(is.na(d))
#checking number of respondents with driving licence
sum(d$Driving_License)
length(d$Driving_License)
table(d$Vehicle_Damage)
#number of respondents who have been previously insured
table(d$Previously_Insured)
t=table(d$Policy_Sales_Channel)
prop.table(t)
length(t)
# trying to identify which policy channels have less than 50 entries
for (i in 1:length(t)) {
if (t[i]<50) {
x=append(x,t[i])
}
}
length(x)
#Descriptive analytics
tapply(d$Response,d$Gender,"mean")
tapply(d$Response,d$Driving_License,"mean")
tapply(d$Response,d$Previously_Insured,"mean")
tapply(d$Response,d$Vehicle_Age,"mean")
tapply(d$Response,d$Vehicle_Damage,"mean")
tapply(d$Age,d$Response,"mean")
tapply(d$Annual_Premium,d$Response,"mean")
library(ISLR)
# Creating the Training Data Set (80%)
set.seed(4)
train=sample(381109,304887)
d.train=d[train,]
dim(d.train)
Respose=d$Response
d.test=d[-train,]
dim(d.test)
#Creating an array of "Status" variable for Training Data
Response.train=Response[train]
Response.train=as.factor(Response.train)
# Creating an array of "Status" variable for Test Data
Response.test=Response[-train]
# building an all inclusive model
mod_1=glm(d$Response~.,data = d,subset = train,family = binomial)
summary(mod_1)
library(MASS)
# Trying stepwise methods to determine best model
stepAIC(mod_1,trace = F)
# calculating associated probabilities
mod_1.probs=predict(mod_1,type = "response")
length(mod_1.probs)
head(mod_1.probs)
# Using ROC curve to determine best threshold
library(pROC)
R=roc(Response.train,mod_1.probs)
plot(roc(Response.train,mod_1.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
R$auc
# Checking which value represents what
contrasts(Response.train)
# Converting probabilities using threshold
mod_1.pred=rep(0,length(Response.train))
mod_1.pred[mod_1.probs>0.0865663]=1
# Plotting the table to calculate error rate
table(mod_1.pred,Response.train)
sum(mod_1.pred==Response.train)
# Checking goodness of fit of the model
library(ResourceSelection)
# several articles suggest that large datasets( >25000) will always show poor fit and H-L may
not be good criteria
hoslem.test(Response.train,fitted(mod_1))
t=read.csv("D:\\Study Material\\Term IV\\Data Analytics Using
R\\Assignment\\test.csv",header=T,na.strings="?")
colSums(is.na(t))
names(t)
# Building the model suggested by step and our kbnowledge
mod_f=glm(Response~Gender+Age+Driving_License+Previously_Insured+Vehicle_Age+V
ehicle_Damage+Annual_Premium,data = t, family = binomial)
# predicting probability values for test data
mod_f.probs=predict(mod_f,data=t,type = "response")
# Checking if the predictions are for correct data set
length(mod_f.probs)
length(t$id)
head(mod_f.probs)
# calculating best threshold based on trains test split
R=roc(Response.test,mod_f.probs)
plot(roc(Response.train,mod_f.probs),col="blue",legacy.axes = TRUE)
coords(R, "best", ret = "threshold")
R$auc