0% found this document useful (0 votes)
79 views18 pages

Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1

This business report analyzes insurance claim data using machine learning models. It imports claim data and cleans outliers. Univariate analysis finds relationships between variables. CART, random forest, and neural network classifiers are built on train and test splits. Neural networks have the best performance. Recommendations include collecting more demographic data, offering online discounts, improving underperforming agencies, and analyzing claim processing.

Uploaded by

Aditya Hajare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views18 pages

Business Report On Data Mining: By: Aditya Janardan Hajare Batch: PGPDSBA Mar'C21 Group 1

This business report analyzes insurance claim data using machine learning models. It imports claim data and cleans outliers. Univariate analysis finds relationships between variables. CART, random forest, and neural network classifiers are built on train and test splits. Neural networks have the best performance. Recommendations include collecting more demographic data, offering online discounts, improving underperforming agencies, and analyzing claim processing.

Uploaded by

Aditya Hajare
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BUSINESS REPORT ON

DATA MINING
By: Aditya Janardan Hajare
Batch: PGPDSBA Mar’C21 Group 1
Index

1. Problem 2: CART-RF-ANN
1.1 Importing Data
1.2 Understanding the data

2. Question 1
2.1 Univariate Analysis for Non-Categorical variables
2.2 Univariate Analysis for Categorical variables
2.3 Multivariate Analysis for non-categorical variables

3. Question 2
3.1 Building a CART classifier
3.2 Building a Random Forest classifier
3.3 Building a Neural Network classifier

4. Question 3
4.1 CART – AUC and ROC for the training data.
4.2 Random Forest model evaluation on training data
4.3 Neural Network model evaluation on training data

5. Question 4

6. Question 5
1) Problem 2: CART-RF-ANN

An Insurance firm providing tour insurance is facing higher claim frequency. The management decides to
collect data from the past few years. You are assigned the task to make a model which predicts the claim
status and provide recommendations to management. Use CART, RF & ANN and compare the models'
performances in train and test sets.

Dataset for Problem 2: insurance_part2_data-1.csv

Attribute Information:
1. Target: Claim Status (Claimed)
2. Code of tour firm (Agency_Code)
3. Type of tour insurance firms (Type)
4. Distribution channel of tour insurance agencies (Channel)
5. Name of the tour insurance products (Product)
6. Duration of the tour (Duration)
7. Destination of the tour (Destination)
8. Amount of sales of tour insurance policies (Sales)
9. The commission received for tour insurance firm (Commission)
10. Age of insured (Age)

1.1 Importing the data

1.2 Understanding the data


Using df.info() we can interpret the below points about the dataset

1) There are total 3000 rows and 10 columns.


2) Data has different data type. Viz. int64, object and float
3) The data type needs to be changed to categorical type except Age, Commission, Duration and
Sales. Balance will be categorical for building and interpreting the model
4) There are no-null values present

Above data can be fetched using df.describe().T. Below points can be interpreted from the output.

1) The mean age of the patients is 38 years and average age is 36 years. Minimum and Maximum
age of the patients is 8 and 84 respectively
2) The std deviation is high for Duration of the tour

There above table shows the unique values of each variable.

There are total 139 duplicate rows. This will not be treated as it accounts for approx.. 4% entries of the
dataset and there is also a possibility that the data might be similar for other patients. Hence these
duplicate entries will not be treated.
2. Question 1

2.1 Univariate Analysis for Non-Categorical variables


Below points can be interpreted from the univariate analysis

1) Heavy outliers are present in Age, Commission, Duration and Sales.


2) The outliers need to be treated for better results.

2.2 Univariate Analysis for Categorical variables

Below points can be interpreted from the univariate analysis of categorical variable.

1) EPX (agency_code) has the highest count of insurances being done and the claim status is
relatively low.
2) The claim rate if high for the insurance provided by Airlines and online channel is preferred the
most for claiming the insurance.
3) Customer prefer customized plan of insurances rather than std. The claim status is high for the
gold plan even when this plan is least preferred followed by Silver Plan.
4) Highest no of customer travel in Asian countries and the claim status is also high. But if ratios of
no of customer to claim status is compared, then claim status is high for customers travelling to
America followed by Europe and Asia.

2.3 Multivariate Analysis for non-categorical variables

Below points can be interpreted from the above pairplot of Age, Commission, Duration, Sales

1) There is no strong relationship between any of the non-categorical variables.


2) The heat map helps in understanding the relationship between these variables and can be found
that there is a strong relationship between sales and commission.
The above tables is the output after changing the data type to categorical for Agency_Code, Type,
Claimed, Channel, Product Name and Destination.
3. Question 2

Below is the Train and Test data split

Decision tree Classifier is built and tree was plotted using link: http://webgraphviz.com/

Link to open the Decision Tree Classifier: Click Here

3.1 Building a CART classifier

The above shown is the predicted classes and probs using random forest classifier.

3.2 Building a Random Forest classifier

#n_estimators are small values as the the kernel failed multiple times

The above shown is the predicted classes and probs using random forest classifier.
3.3 Building a Neural Network classifier

Above shown are the predicted classes and probs using Neural Network Classifier.

4. Question 3

4.1 CART – AUC and ROC for the training data.

AUC Score – 0.845


4.1.1 CART – AUC and ROC for the test data

AUC Score – 0.798

4.1.2 CART - Confusion Matrix and Classification report for the training data.

Accuracy of the train data – 0.8

Classification of the CART training data


4.1.3 CART - Confusion Matrix and Classification report for the testing data.

Accuracy of the train data – 0.7555

Classification of the CART testing data

4.1.4 Inferences from the CART Training and Testing data.

1) The accuracy value for both train and test data has no major difference
2) The precision value for train data is higher than test data
3) The f1-score also has no major difference.

4.2 Random Forest model evaluation on training data


4.2.1 Random Forest – AUC and ROC for the training data.

Area under the curve is 0.8665081


4.2.2 Random Forest – AUC and ROC for the testing data.

Area under the curve is 0.818548

4.2.3 Random Forest - Confusion Matrix and Classification report for the training data.

Accuracy of the train data – 0.77888

Classification of the Random Forest training data

Random Forest - Confusion Matrix and Classification report for the testing data.

Accuracy of the test data – 0.788571


Classification of the Random Forest testing data

4.2.4 Inferences from the Random Forest Training and Testing data.

1) The accuracy value for both train and test data has no major difference
2) The precision value for train data is higher than test data
3) The f1-score also has no major difference.

4.3 Neural Network model evaluation on training data


4.3.1 Neural Network – AUC and ROC for the training data.

Area under the curve is 0.815826


4.3.2 Neural Network – AUC and ROC for the testing data.

Area under the curve is 0.782790

4.3.3 Neural Network - Confusion Matrix and Classification report for the training data.

Accuracy of the training data – 0.788571

Classification of the Random Forest training data


4.3.4 Neural Network - Confusion Matrix and Classification report for the testing data.

Accuracy of the testing data – 0.764444

Classification of the Random Forest testing data

4.3.5 Inferences from the Neural Network Training and Testing data.

1) The accuracy value for both train and test data has no minor difference
2) The precision value for train and test data is same
3) The f1-score for train and test data is same
5. Question 4

Combined ROC curve for train data using CART, Random Forest and Neural Network models

Combined ROC curve for test data using CART, Random Forest and Neural Network models

Summary: Neural Network model has better accuracy, precision, recall and better f1-score than CART
and Random Forest. Hence I am selecting Neural Network model.
6. Question 5

Based on the dataset available for the analysis, more data related to age-group, time, incident, location,
airline names etc. can help to get more co-relations between the data and help in analyzing it in more
detailed manner.

Online channel is most preferred for getting the insurance done. Hence company should give discounts
to customers enrolling online. Doing so will reduce the offline registration and this will result in reducing
the offline overheads and mistakes

The JZI agency has minimum sales which is hitting the business. Company should either help the agency
to grow with a market penetration plan to reach max. possible customer or they should find an
alternative to JZI.

Since most of sale is done by Agency but claims are processed by Airlines. Need to deep dive into this to
understand more about this.

You might also like