0% found this document useful (0 votes)
9 views17 pages

Final Report

This report details a data science project aimed at predicting car insurance claims using a dataset of 58,592 records and 44 features. The analysis focused on minimizing costs associated with false negatives, which are significantly more costly than false positives in insurance. The XGBoost Classifier, optimized through Random Search, was identified as the best-performing model, demonstrating strong performance in ROC analysis and effectively addressing class imbalance.

Uploaded by

kkaushik1904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views17 pages

Final Report

This report details a data science project aimed at predicting car insurance claims using a dataset of 58,592 records and 44 features. The analysis focused on minimizing costs associated with false negatives, which are significantly more costly than false positives in insurance. The XGBoost Classifier, optimized through Random Search, was identified as the best-performing model, demonstrating strong performance in ROC analysis and effectively addressing class imbalance.

Uploaded by

kkaushik1904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Final Report – Car Insurance Claims Data Science Project

In the insurance industry, accurately predicting which clients will file claims is crucial. For
car insurance, numerous variables describe each policy case, including car features (make,
model, airbags, fuel type, sensors, age of the car, torque, and power) and policyholder
demographics (age, city, population density).
Predicting if a policy will be claimed involves two types of errors: false negatives (predicting
that the policy won't be claimed when it actually is) and false positives (predicting that the
policy will be claimed when it isn't). Each type of error incurs a cost, in car insurance, false
negatives carry a significantly higher cost than false positives—ranging from 5 to 50 times
higher, due to the potential for significant financial loss and poor risk management. Thus, the
goal is to develop a model that minimizes a cost function based on these error costs.
To achieve this, I performed data wrangling, exploratory data analysis, preprocessing,
modeling, model evaluation, and optimization.

The Data Wrangling


The dataset used for this project is available on Kaggle and contains 44 columns and 58,592
rows. Each row represents a policy record. The dataset includes a policy ID, policy tenure,
car features (such as age, make, segment, fuel type, and transmission type), demographic
features (such as the age of the policyholder, city, and population density), and a target
variable indicating whether the policy was claimed or not.

In this step, I assessed the data types of each column, identifying integer, float, and
categorical types. Additionally, I encoded all Yes-No values to True-False, as this format is
preferred by scikit-learn. I split the max power and max torque columns because both
contained two quantities in the form of Nm-rpm for max torque and bhp-rpm for max power.
I checked for missing values and found none. I also verified the uniqueness of IDs and
checked for duplicates. Fortunately, the dataset was very clean and healthy.

Exploratory Data Analysis


My Exploratory Data Analysis (EDA) consisted of three parts. First, I explored each variable
separately, creating numerous plots to visually examine the data. For numerical columns, I
made histograms and box plots, while for categorical columns, I used count plots. Among the
categorical columns, there were several Boolean columns (True-False), for which I also used
count plots, including the target variable. My goal was to understand the distributions,
identify outliers, check for normality, and assess imbalance. Notably, the target variable
(is_claim) exhibited considerable class imbalance, as the majority of policies were not
claimed. Here are some of the most interesting plots:
I explored the relationships between the numeric variables using a heatmap. The heatmap
revealed several instances of both direct and inverse correlations among the variables:

I also examined the relationship between the target variable and all the predictors. For
numeric variables, I used boxplots to compare their distributions based on the target variable.
For categorical variables, I used data frames to compare the categories in relation to the target
variable:
I conducted null hypothesis significance tests to evaluate if there was a statistically
significant difference in the proportions when considering the target variable:
I also conducted hypothesis tests for the means. I used parametric tests in cases where
variance homogeneity was proven and non-parametric (permutations) tests where the
distributions didn't pass a Levene’s Variance Homogeneity test:

The EDA revealed influential predictors and correlations among the variables.

Preprocessing
As the dataset was clean, preprocessing was straightforward. I scaled the features, created
dummy variables for categorical columns, and split the data into train and test sets.
pandas.get_dummies:

Scale numeric columns with StandardScaler:


Split in training and test sets:

Modelling, Evaluation and Optimization


Baseline Models
Since this is a classification task, I trained the following baseline models: K-Nearest
Neighbors Classifier, Decision Tree Classifier, and Logistic Regression.

I evaluated the models using the Classification Report, which includes metrics such as
precision, recall, and F1 score, as well as the Confusion Matrix, which shows the number of
correctly and incorrectly predicted cases for both classes.
Random Forest Classifier (default hyperparameters)
Ensemble algorithms often outperform simple classifiers. Thus, I proceeded with a Random
Forest Classifier, an ensemble method that fits multiple decision trees on various subsets of
the dataset and averages their predictions to enhance performance. I began with a Random
Forest Classifier using default parameters and evaluated its performance using the
Classification Report and Confusion Matrix.

Another useful classification metric I used was the Receiver Operating Characteristic (ROC)
Curve, which illustrates the model's performance across all classification thresholds. The
ROC curve is summarized with the Area Under the Curve (AUC), where a higher AUC
indicates better performance, with 1 being the ideal score. The ROC curve and AUC for the
baseline Random Forest Classifier are as follows:

Random Forest Classifier (hyperparameter tuning with Random Search)


Hyperparameters define how a machine learning model learns, making it crucial to find the
optimal settings to steer the model's learning in the desired direction. Since the F1 score
balances attention between false negatives and false positives, it serves as the scoring metric
for Random Search. Random Search is a hyperparameter tuning algorithm that defines a
search space as a bounded domain of hyperparameter values and randomly samples points
within that domain.
In this process, Random Search performs cross-validation to identify the best
hyperparameters. For this task, I set up a 5-fold cross-validation. The estimator used was a
Random Forest Classifier, with the F1 score as the scoring metric. The parameter distribution
for the Random Search is shown in the following image:

The best hyperparameters identified through the Random Search process, along with the
classification report, confusion matrix, ROC curve, and AUC for the best estimator, are
presented below:
Extreme Gradient Boosting (default hyperparameters)
Extreme Gradient Boosting (XGBoost) is a state-of-the-art ensemble method known for its
superior performance in various regression and classification tasks. Typically, it uses trees as
base estimators and employs boosting, an iterative approach to optimize performance. The
XGBoost library offers an API compatible with the scikit-learn environment. I trained an
XGBClassifier with the scale_pos_weight parameter to address class imbalance in the
dataset. The evaluation metrics for this model are as follows:
Extreme Gradient Boosting (hyperparameter tuning with Random Search)
Similar to the Random Forest Classifier, the XGBoost Classifier can significantly benefit
from hyperparameter tuning. To optimize the performance of the XGBoost model, I
conducted a Random Search for hyperparameter tuning. The scoring metric used for the was
the F1 score. The following outlines the parameter distributions and results:
XGB + Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE is a powerful technique used to address class imbalance in machine learning datasets
by generating synthetic samples for the minority class. This helps to balance the dataset and
can lead to better model performance, especially in cases where the target variable is heavily
skewed. In this step, I applied SMOTE to resample the data and then trained an XGBoost
Classifier on the resampled dataset:

The metrics for this base XGB with SMOTE are the following:
XGB + SMOTE (hyperparameter tuning with Random Search)
The initial XGB + SMOTE model resulted in an extreme reduction of false negatives,
achieving zero false negatives. However, this also led to a significant increase in false
positives. To address this, I performed hyperparameter tuning using Random Search to find
the optimal parameters that balance both false positives and false negatives more effectively.

To optimize the classification performance of the model, I explored different classification


thresholds. Initially, I identified the optimal threshold using the precision-recall curve. This
threshold maximizes the model's ability to balance precision and recall, crucial for
minimizing false positives and false negatives in the predictions:
To further refine the model's performance, I explored the optimal classification threshold
using the Receiver Operating Characteristic (ROC) curve. This threshold maximizes the true
positive rate while minimizing the false positive rate, enhancing the model's ability to
discriminate between classes effectively:

Neural Network: Multilayer Perceptron Classifier (MPC)


The Multilayer Perceptron (MLP) Classifier is a fundamental neural network architecture
capable of learning complex relationships in data, particularly useful for solving non-linearly
separable problems. In this project, I employed the MLP Classifier provided by scikit-learn to
explore its performance on the insurance claim prediction task:
Business Modelling
To determine the optimal model for a Car Insurance company, it is crucial to consider the
financial implications of prediction errors. In this context, false positives and false negatives
carry different costs for the company. A cost function was designed to quantify these costs:
Here is a summary of the FP and FN counts for each developed model:

In the context of insurance claims, the cost associated with false negatives (CFN) is typically
significantly higher than that of false positives (CFP). To assess the impact of varying these
costs, I explored different ratios of CFN to CFP, ranging from 5 to 50. The results of the cost
function, represented as a heatmap, illustrate how different ratios influence the overall cost in
this scenario:

• When the CFN is 5 times the CFP, the best models are: 1 Base Logistic Regression, 2
Base KNN, 3 Base Random Forest Classifier.
• When the CFN is 10 times the CFP, the best models are: 1 XGB (Random Search), 2
Base KNN, 3 Base Logistic Regression.
• When the CFN is 20 times the CFP, the best models are: 1 XGB (Random Search), 2
XGB (SMOTE + Random Search2), 3 XGB (SMOTE).
• When the CFN is 50 times the CFP, the best models are: 1 XGB (SMOTE), 2 XGB
(SMOTE + Random Search1), 3 XGB (Random Search).
Based on these findings, the most probable scenarios, occurring with CFN values between
10- and 20-times CFP, suggest XGB (Random Search) as the optimal model. This model
consistently demonstrates strong performance, particularly highlighted by its highest Area
Under the Curve (AUC) in the Receiver Operating Characteristic (ROC) analysis,
showcasing robustness across various decision thresholds.

Conclusion
The goal of this project was to develop a model for predicting car insurance claims based on
policy, car, and demographic features. Given the nature of insurance claims, where false
negatives incur significantly higher costs than false positives, the focus was on optimizing for
this scenario.
Using a dataset comprising 44 features and 58,592 policy records, I conducted thorough data
wrangling and exploratory data analysis. This process uncovered meaningful patterns and
relationships within the data, guiding subsequent preprocessing steps to prepare for modeling.
I trained and evaluated several models, beginning with three baseline models and progressing
to ensemble methods such as Random Forest and XGBoost, which were further refined
through hyperparameter optimization using Random Search. Additionally, I employed the
SMOTE technique to address class imbalance and explored the performance of a Multilayer
Perceptron Classifier.
Evaluation metrics, particularly ROC and AUC, were used to compare model performance.
The XGBoost Classifier, optimized through Random Search, emerged as the superior model.
It consistently demonstrated the highest Area Under the Curve (AUC) in the ROC analysis,
showcasing robustness across various decision thresholds and effectively minimizing our
targeted cost function.
As a final recommendation, improving the dataset's class imbalance would likely enhance
model performance further. Additionally, with increased computational resources, conducting
a Complete Grid Search for hyperparameter tuning could yield even more refined results.
Lastly, establishing a precise business cost function with exact ratios between false positives
and false negatives would provide deeper insights for decision-making.

You might also like