Final Report

This report details a data science project aimed at predicting car insurance claims using a dataset of 58,592 records and 44 features. The analysis focused on minimizing costs associated with false negatives, which are significantly more costly than false positives in insurance. The XGBoost Classifier, optimized through Random Search, was identified as the best-performing model, demonstrating strong performance in ROC analysis and effectively addressing class imbalance.

Uploaded by

kkaushik1904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views17 pages

Final Report

Uploaded by

kkaushik1904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Final Report – Car Insurance Claims Data Science Project

In the insurance industry, accurately predicting which clients will file claims is crucial. For
car insurance, numerous variables describe each policy case, including car features (make,
model, airbags, fuel type, sensors, age of the car, torque, and power) and policyholder
demographics (age, city, population density).
Predicting if a policy will be claimed involves two types of errors: false negatives (predicting
that the policy won't be claimed when it actually is) and false positives (predicting that the
policy will be claimed when it isn't). Each type of error incurs a cost, in car insurance, false
negatives carry a significantly higher cost than false positives—ranging from 5 to 50 times
higher, due to the potential for significant financial loss and poor risk management. Thus, the
goal is to develop a model that minimizes a cost function based on these error costs.
To achieve this, I performed data wrangling, exploratory data analysis, preprocessing,
modeling, model evaluation, and optimization.

The Data Wrangling

The dataset used for this project is available on Kaggle and contains 44 columns and 58,592
rows. Each row represents a policy record. The dataset includes a policy ID, policy tenure,
car features (such as age, make, segment, fuel type, and transmission type), demographic
features (such as the age of the policyholder, city, and population density), and a target
variable indicating whether the policy was claimed or not.

In this step, I assessed the data types of each column, identifying integer, float, and
categorical types. Additionally, I encoded all Yes-No values to True-False, as this format is
preferred by scikit-learn. I split the max power and max torque columns because both
contained two quantities in the form of Nm-rpm for max torque and bhp-rpm for max power.
I checked for missing values and found none. I also verified the uniqueness of IDs and
checked for duplicates. Fortunately, the dataset was very clean and healthy.

Exploratory Data Analysis

My Exploratory Data Analysis (EDA) consisted of three parts. First, I explored each variable
separately, creating numerous plots to visually examine the data. For numerical columns, I
made histograms and box plots, while for categorical columns, I used count plots. Among the
categorical columns, there were several Boolean columns (True-False), for which I also used
count plots, including the target variable. My goal was to understand the distributions,
identify outliers, check for normality, and assess imbalance. Notably, the target variable
(is_claim) exhibited considerable class imbalance, as the majority of policies were not
claimed. Here are some of the most interesting plots:
I explored the relationships between the numeric variables using a heatmap. The heatmap
revealed several instances of both direct and inverse correlations among the variables:

I also examined the relationship between the target variable and all the predictors. For
numeric variables, I used boxplots to compare their distributions based on the target variable.
For categorical variables, I used data frames to compare the categories in relation to the target
variable:
I conducted null hypothesis significance tests to evaluate if there was a statistically
significant difference in the proportions when considering the target variable:
I also conducted hypothesis tests for the means. I used parametric tests in cases where
variance homogeneity was proven and non-parametric (permutations) tests where the
distributions didn't pass a Levene’s Variance Homogeneity test:

The EDA revealed influential predictors and correlations among the variables.

Preprocessing
As the dataset was clean, preprocessing was straightforward. I scaled the features, created
dummy variables for categorical columns, and split the data into train and test sets.
pandas.get_dummies:

Scale numeric columns with StandardScaler:

Split in training and test sets:

Modelling, Evaluation and Optimization

Baseline Models
Since this is a classification task, I trained the following baseline models: K-Nearest
Neighbors Classifier, Decision Tree Classifier, and Logistic Regression.

I evaluated the models using the Classification Report, which includes metrics such as
precision, recall, and F1 score, as well as the Confusion Matrix, which shows the number of
correctly and incorrectly predicted cases for both classes.
Random Forest Classifier (default hyperparameters)
Ensemble algorithms often outperform simple classifiers. Thus, I proceeded with a Random
Forest Classifier, an ensemble method that fits multiple decision trees on various subsets of
the dataset and averages their predictions to enhance performance. I began with a Random
Forest Classifier using default parameters and evaluated its performance using the
Classification Report and Confusion Matrix.

Another useful classification metric I used was the Receiver Operating Characteristic (ROC)
Curve, which illustrates the model's performance across all classification thresholds. The
ROC curve is summarized with the Area Under the Curve (AUC), where a higher AUC
indicates better performance, with 1 being the ideal score. The ROC curve and AUC for the
baseline Random Forest Classifier are as follows:

Random Forest Classifier (hyperparameter tuning with Random Search)

Hyperparameters define how a machine learning model learns, making it crucial to find the
optimal settings to steer the model's learning in the desired direction. Since the F1 score
balances attention between false negatives and false positives, it serves as the scoring metric
for Random Search. Random Search is a hyperparameter tuning algorithm that defines a
search space as a bounded domain of hyperparameter values and randomly samples points
within that domain.
In this process, Random Search performs cross-validation to identify the best
hyperparameters. For this task, I set up a 5-fold cross-validation. The estimator used was a
Random Forest Classifier, with the F1 score as the scoring metric. The parameter distribution
for the Random Search is shown in the following image:

The best hyperparameters identified through the Random Search process, along with the
classification report, confusion matrix, ROC curve, and AUC for the best estimator, are
presented below:
Extreme Gradient Boosting (default hyperparameters)
Extreme Gradient Boosting (XGBoost) is a state-of-the-art ensemble method known for its
superior performance in various regression and classification tasks. Typically, it uses trees as
base estimators and employs boosting, an iterative approach to optimize performance. The
XGBoost library offers an API compatible with the scikit-learn environment. I trained an
XGBClassifier with the scale_pos_weight parameter to address class imbalance in the
dataset. The evaluation metrics for this model are as follows:
Extreme Gradient Boosting (hyperparameter tuning with Random Search)
Similar to the Random Forest Classifier, the XGBoost Classifier can significantly benefit
from hyperparameter tuning. To optimize the performance of the XGBoost model, I
conducted a Random Search for hyperparameter tuning. The scoring metric used for the was
the F1 score. The following outlines the parameter distributions and results:
XGB + Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE is a powerful technique used to address class imbalance in machine learning datasets
by generating synthetic samples for the minority class. This helps to balance the dataset and
can lead to better model performance, especially in cases where the target variable is heavily
skewed. In this step, I applied SMOTE to resample the data and then trained an XGBoost
Classifier on the resampled dataset:

The metrics for this base XGB with SMOTE are the following:
XGB + SMOTE (hyperparameter tuning with Random Search)
The initial XGB + SMOTE model resulted in an extreme reduction of false negatives,
achieving zero false negatives. However, this also led to a significant increase in false
positives. To address this, I performed hyperparameter tuning using Random Search to find
the optimal parameters that balance both false positives and false negatives more effectively.

To optimize the classification performance of the model, I explored different classification

thresholds. Initially, I identified the optimal threshold using the precision-recall curve. This
threshold maximizes the model's ability to balance precision and recall, crucial for
minimizing false positives and false negatives in the predictions:
To further refine the model's performance, I explored the optimal classification threshold
using the Receiver Operating Characteristic (ROC) curve. This threshold maximizes the true
positive rate while minimizing the false positive rate, enhancing the model's ability to
discriminate between classes effectively:

Neural Network: Multilayer Perceptron Classifier (MPC)

The Multilayer Perceptron (MLP) Classifier is a fundamental neural network architecture
capable of learning complex relationships in data, particularly useful for solving non-linearly
separable problems. In this project, I employed the MLP Classifier provided by scikit-learn to
explore its performance on the insurance claim prediction task:
Business Modelling
To determine the optimal model for a Car Insurance company, it is crucial to consider the
financial implications of prediction errors. In this context, false positives and false negatives
carry different costs for the company. A cost function was designed to quantify these costs:
Here is a summary of the FP and FN counts for each developed model:

In the context of insurance claims, the cost associated with false negatives (CFN) is typically
significantly higher than that of false positives (CFP). To assess the impact of varying these
costs, I explored different ratios of CFN to CFP, ranging from 5 to 50. The results of the cost
function, represented as a heatmap, illustrate how different ratios influence the overall cost in
this scenario:

• When the CFN is 5 times the CFP, the best models are: 1 Base Logistic Regression, 2
Base KNN, 3 Base Random Forest Classifier.
• When the CFN is 10 times the CFP, the best models are: 1 XGB (Random Search), 2
Base KNN, 3 Base Logistic Regression.
• When the CFN is 20 times the CFP, the best models are: 1 XGB (Random Search), 2
XGB (SMOTE + Random Search2), 3 XGB (SMOTE).
• When the CFN is 50 times the CFP, the best models are: 1 XGB (SMOTE), 2 XGB
(SMOTE + Random Search1), 3 XGB (Random Search).
Based on these findings, the most probable scenarios, occurring with CFN values between
10- and 20-times CFP, suggest XGB (Random Search) as the optimal model. This model
consistently demonstrates strong performance, particularly highlighted by its highest Area
Under the Curve (AUC) in the Receiver Operating Characteristic (ROC) analysis,
showcasing robustness across various decision thresholds.

Conclusion
The goal of this project was to develop a model for predicting car insurance claims based on
policy, car, and demographic features. Given the nature of insurance claims, where false
negatives incur significantly higher costs than false positives, the focus was on optimizing for
this scenario.
Using a dataset comprising 44 features and 58,592 policy records, I conducted thorough data
wrangling and exploratory data analysis. This process uncovered meaningful patterns and
relationships within the data, guiding subsequent preprocessing steps to prepare for modeling.
I trained and evaluated several models, beginning with three baseline models and progressing
to ensemble methods such as Random Forest and XGBoost, which were further refined
through hyperparameter optimization using Random Search. Additionally, I employed the
SMOTE technique to address class imbalance and explored the performance of a Multilayer
Perceptron Classifier.
Evaluation metrics, particularly ROC and AUC, were used to compare model performance.
The XGBoost Classifier, optimized through Random Search, emerged as the superior model.
It consistently demonstrated the highest Area Under the Curve (AUC) in the ROC analysis,
showcasing robustness across various decision thresholds and effectively minimizing our
targeted cost function.
As a final recommendation, improving the dataset's class imbalance would likely enhance
model performance further. Additionally, with increased computational resources, conducting
a Complete Grid Search for hyperparameter tuning could yield even more refined results.
Lastly, establishing a precise business cost function with exact ratios between false positives
and false negatives would provide deeper insights for decision-making.

Lecture 3-Linear-Regression-Part2
No ratings yet
Lecture 3-Linear-Regression-Part2
45 pages
Pengaruh Akuntabilitas, Transparansi, Dan Pengawasan Terhadap Kinerja Pengelolaan Anggaran
No ratings yet
Pengaruh Akuntabilitas, Transparansi, Dan Pengawasan Terhadap Kinerja Pengelolaan Anggaran
21 pages
写商业报告的重要性
100% (1)
写商业报告的重要性
5 pages
Anova and Manova
No ratings yet
Anova and Manova
30 pages
20230225DSLG2088
No ratings yet
20230225DSLG2088
33 pages
Hydrology 1 - Precipitation
No ratings yet
Hydrology 1 - Precipitation
44 pages
Customer Ran. # Arrival Time Start Time Service Time Finish Start Interarrival Time Waiting Time
No ratings yet
Customer Ran. # Arrival Time Start Time Service Time Finish Start Interarrival Time Waiting Time
4 pages
Arthur, Bennett, Edens, & Bell (2003) JAP PDF
No ratings yet
Arthur, Bennett, Edens, & Bell (2003) JAP PDF
12 pages
Statistics: Correlation: 2.1 Interpreting A Scatterplot
No ratings yet
Statistics: Correlation: 2.1 Interpreting A Scatterplot
8 pages
Notes 3
No ratings yet
Notes 3
9 pages
Consumer Credit Card Usage Analysis Krithik Jain Business Statistics MGSC 2301-07 Professor Dimitrios Fotiadis
No ratings yet
Consumer Credit Card Usage Analysis Krithik Jain Business Statistics MGSC 2301-07 Professor Dimitrios Fotiadis
10 pages
CH 3 3502
No ratings yet
CH 3 3502
9 pages
Teaching DoE With Paper Helicopters and Minitab
No ratings yet
Teaching DoE With Paper Helicopters and Minitab
17 pages
DHS P2 Solutions
No ratings yet
DHS P2 Solutions
14 pages
Time Series Analysis and Forecasting
No ratings yet
Time Series Analysis and Forecasting
21 pages
Multidimensional Item Response Theory 1st Edition Wes Bonifay PDF Download
No ratings yet
Multidimensional Item Response Theory 1st Edition Wes Bonifay PDF Download
45 pages
Swot LMS2015
No ratings yet
Swot LMS2015
13 pages
Data Analysis and Graphics Using R 1st Edition Matthew Norman Download
No ratings yet
Data Analysis and Graphics Using R 1st Edition Matthew Norman Download
53 pages
Anteneh Getachew Final Approved Project MBA
No ratings yet
Anteneh Getachew Final Approved Project MBA
74 pages
T Test F Test 7.3
No ratings yet
T Test F Test 7.3
13 pages
Mean of A White-Noise Process
No ratings yet
Mean of A White-Noise Process
3 pages
Matplotlib - 2D Line Plot
No ratings yet
Matplotlib - 2D Line Plot
12 pages
Latin Square Designs
No ratings yet
Latin Square Designs
4 pages
Homework 1
No ratings yet
Homework 1
3 pages
Correlational Study
No ratings yet
Correlational Study
46 pages
MEDLO5012 Design of Experiments 03: Course Code Course Name Credits
No ratings yet
MEDLO5012 Design of Experiments 03: Course Code Course Name Credits
2 pages
Chapter 2.PDF Research
No ratings yet
Chapter 2.PDF Research
14 pages
Taguig City University
No ratings yet
Taguig City University
10 pages
Clinical Trial Designs
100% (1)
Clinical Trial Designs
18 pages
ML - Collection.2019 04 15
No ratings yet
ML - Collection.2019 04 15
30 pages
FULLTEXT02
No ratings yet
FULLTEXT02
72 pages
Default of Credit Card Clients
No ratings yet
Default of Credit Card Clients
33 pages
Business Report M2 PDF
100% (2)
Business Report M2 PDF
14 pages
Schuerger - Statement of Understanding
No ratings yet
Schuerger - Statement of Understanding
10 pages
ML5&6&7&8&9&10
No ratings yet
ML5&6&7&8&9&10
35 pages
Case Study Stock Market Prediciton
No ratings yet
Case Study Stock Market Prediciton
10 pages
Hyperparameter Tuning and Classifier Evaluation 1730943018
No ratings yet
Hyperparameter Tuning and Classifier Evaluation 1730943018
15 pages
Project Report
No ratings yet
Project Report
34 pages
Machine Learning Lab Manual
No ratings yet
Machine Learning Lab Manual
22 pages
Final Presentation
No ratings yet
Final Presentation
12 pages
Logistic Regression Tech Document
No ratings yet
Logistic Regression Tech Document
12 pages
Aiml Project
No ratings yet
Aiml Project
13 pages
Credit Card Approval Prediction Report-Final
No ratings yet
Credit Card Approval Prediction Report-Final
27 pages
FRA Project Report - Chilla Nagaraju
100% (1)
FRA Project Report - Chilla Nagaraju
66 pages
Bachelor thesis-G.H. Van de Water-S2297213
No ratings yet
Bachelor thesis-G.H. Van de Water-S2297213
48 pages
FRM Course Syllabus IPDownload
No ratings yet
FRM Course Syllabus IPDownload
3 pages
Master Endre Final
No ratings yet
Master Endre Final
116 pages
ML New Record
No ratings yet
ML New Record
51 pages
Heart Merged
No ratings yet
Heart Merged
8 pages
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
No ratings yet
A1388404476 - 64039 - 23 - 2023 - Machine Learning II
10 pages
ML Assignment 2
No ratings yet
ML Assignment 2
3 pages
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
No ratings yet
Perform Prediction Using Regression Algorithm: Ex No: 1 Date
13 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
Jupyter Lab
No ratings yet
Jupyter Lab
42 pages
Assignment
No ratings yet
Assignment
5 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
Loan Approval Prediction2
No ratings yet
Loan Approval Prediction2
72 pages
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
100% (1)
SVM (Support Vector Machine) For Classification - by Aditya Kumar - Towards Data Science
28 pages
Turover Prediction
No ratings yet
Turover Prediction
52 pages
Seminar Presentation
No ratings yet
Seminar Presentation
25 pages
Module 4 - Supervised Learning - First ML Model
No ratings yet
Module 4 - Supervised Learning - First ML Model
23 pages
ML Manual
No ratings yet
ML Manual
24 pages
SQR Da 2
No ratings yet
SQR Da 2
11 pages
A10 Model Performance v2 2up
No ratings yet
A10 Model Performance v2 2up
11 pages
Loan Default Prediction Using Supervised Machine Learning Algorithms
No ratings yet
Loan Default Prediction Using Supervised Machine Learning Algorithms
70 pages
Assignment1 LATEX
No ratings yet
Assignment1 LATEX
11 pages
Credit Card Fraud Analysis Ashutosh
No ratings yet
Credit Card Fraud Analysis Ashutosh
3 pages
BA Project - Team17
No ratings yet
BA Project - Team17
13 pages
Phase 3 IBM
No ratings yet
Phase 3 IBM
7 pages
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
No ratings yet
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
5 pages
ML Ex 5
No ratings yet
ML Ex 5
6 pages
ICSCSP 2021 Proceedings-477-488
No ratings yet
ICSCSP 2021 Proceedings-477-488
12 pages
Hands On Machine Learning 3 Edition
No ratings yet
Hands On Machine Learning 3 Edition
31 pages
Model Learning Steps
No ratings yet
Model Learning Steps
12 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
Quadexp IDS Project
No ratings yet
Quadexp IDS Project
22 pages
CP4252 Machine Learning Lab Manual
No ratings yet
CP4252 Machine Learning Lab Manual
26 pages
UCLA Electronic Theses and Dissertations: Title
No ratings yet
UCLA Electronic Theses and Dissertations: Title
43 pages
Catboost ET Comparaison
No ratings yet
Catboost ET Comparaison
20 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Project Report
No ratings yet
Project Report
19 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
A3 Classification and Feature Engineering
No ratings yet
A3 Classification and Feature Engineering
2 pages
Week 7 Laboratory Activity
No ratings yet
Week 7 Laboratory Activity
12 pages
CE802 Pilot
No ratings yet
CE802 Pilot
2 pages
Rev Insurance Business Report
No ratings yet
Rev Insurance Business Report
4 pages
DIT865 2018 Mar Solution
No ratings yet
DIT865 2018 Mar Solution
9 pages
R Assignment
No ratings yet
R Assignment
8 pages
Project Report - Credit Card Fraud Detection
No ratings yet
Project Report - Credit Card Fraud Detection
11 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet