Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1
DATA MINING
By: Aditya Janardan Hajare
Batch: PGPDSBA Mar’C21 Group 1
Index
1. Problem 2: CART-RF-ANN
1.1 Importing Data
1.2 Understanding the data
2. Question 1
2.1 Univariate Analysis for Non-Categorical variables
2.2 Univariate Analysis for Categorical variables
2.3 Multivariate Analysis for non-categorical variables
3. Question 2
3.1 Building a CART classifier
3.2 Building a Random Forest classifier
3.3 Building a Neural Network classifier
4. Question 3
4.1 CART – AUC and ROC for the training data.
4.2 Random Forest model evaluation on training data
4.3 Neural Network model evaluation on training data
5. Question 4
6. Question 5
1) Problem 2: CART-RF-ANN
An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.
Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)
Above data can be fetched using df.describe().T. Below points can be interpreted from the output.
1) The mean age of the patients is 38 years and average age is 36 years. Minimum and Maximum
age of the patients is 8 and 84 respectively
2) The std deviation is high for Duration of the tour
There are total 139 duplicate rows. This will not be treated as it accounts for approx.. 4% entries of the
dataset and there is also a possibility that the data might be similar for other patients. Hence these
duplicate entries will not be treated.
2. Question 1
Below points can be interpreted from the univariate analysis of categorical variable.
1) EPX (agency_code) has the highest count of insurances being done and the claim status is
relatively low.
2) The claim rate if high for the insurance provided by Airlines and online channel is preferred the
most for claiming the insurance.
3) Customer prefer customized plan of insurances rather than std. The claim status is high for the
gold plan even when this plan is least preferred followed by Silver Plan.
4) Highest no of customer travel in Asian countries and the claim status is also high. But if ratios of
no of customer to claim status is compared, then claim status is high for customers travelling to
America followed by Europe and Asia.
Below points can be interpreted from the above pairplot of Age, Commission, Duration, Sales
Decision tree Classifier is built and tree was plotted using link: http://webgraphviz.com/
The above shown is the predicted classes and probs using random forest classifier.
#n_estimators are small values as the the kernel failed multiple times
The above shown is the predicted classes and probs using random forest classifier.
3.3 Building a Neural Network classifier
Above shown are the predicted classes and probs using Neural Network Classifier.
4. Question 3
4.1.2 CART - Confusion Matrix and Classification report for the training data.
1) The accuracy value for both train and test data has no major difference
2) The precision value for train data is higher than test data
3) The f1-score also has no major difference.
4.2.3 Random Forest - Confusion Matrix and Classification report for the training data.
Random Forest - Confusion Matrix and Classification report for the testing data.
4.2.4 Inferences from the Random Forest Training and Testing data.
1) The accuracy value for both train and test data has no major difference
2) The precision value for train data is higher than test data
3) The f1-score also has no major difference.
4.3.3 Neural Network - Confusion Matrix and Classification report for the training data.
4.3.5 Inferences from the Neural Network Training and Testing data.
1) The accuracy value for both train and test data has no minor difference
2) The precision value for train and test data is same
3) The f1-score for train and test data is same
5. Question 4
Combined ROC curve for train data using CART, Random Forest and Neural Network models
Combined ROC curve for test data using CART, Random Forest and Neural Network models
Summary: Neural Network model has better accuracy, precision, recall and better f1-score than CART
and Random Forest. Hence I am selecting Neural Network model.
6. Question 5
Based on the dataset available for the analysis, more data related to age-group, time, incident, location,
airline names etc. can help to get more co-relations between the data and help in analyzing it in more
detailed manner.
Online channel is most preferred for getting the insurance done. Hence company should give discounts
to customers enrolling online. Doing so will reduce the offline registration and this will result in reducing
the offline overheads and mistakes
The JZI agency has minimum sales which is hitting the business. Company should either help the agency
to grow with a market penetration plan to reach max. possible customer or they should find an
alternative to JZI.
Since most of sale is done by Agency but claims are processed by Airlines. Need to deep dive into this to
understand more about this.