Final Report
Final Report
In the insurance industry, accurately predicting which clients will file claims is crucial. For
car insurance, numerous variables describe each policy case, including car features (make,
model, airbags, fuel type, sensors, age of the car, torque, and power) and policyholder
demographics (age, city, population density).
Predicting if a policy will be claimed involves two types of errors: false negatives (predicting
that the policy won't be claimed when it actually is) and false positives (predicting that the
policy will be claimed when it isn't). Each type of error incurs a cost, in car insurance, false
negatives carry a significantly higher cost than false positives—ranging from 5 to 50 times
higher, due to the potential for significant financial loss and poor risk management. Thus, the
goal is to develop a model that minimizes a cost function based on these error costs.
To achieve this, I performed data wrangling, exploratory data analysis, preprocessing,
modeling, model evaluation, and optimization.
In this step, I assessed the data types of each column, identifying integer, float, and
categorical types. Additionally, I encoded all Yes-No values to True-False, as this format is
preferred by scikit-learn. I split the max power and max torque columns because both
contained two quantities in the form of Nm-rpm for max torque and bhp-rpm for max power.
I checked for missing values and found none. I also verified the uniqueness of IDs and
checked for duplicates. Fortunately, the dataset was very clean and healthy.
I also examined the relationship between the target variable and all the predictors. For
numeric variables, I used boxplots to compare their distributions based on the target variable.
For categorical variables, I used data frames to compare the categories in relation to the target
variable:
I conducted null hypothesis significance tests to evaluate if there was a statistically
significant difference in the proportions when considering the target variable:
I also conducted hypothesis tests for the means. I used parametric tests in cases where
variance homogeneity was proven and non-parametric (permutations) tests where the
distributions didn't pass a Levene’s Variance Homogeneity test:
The EDA revealed influential predictors and correlations among the variables.
Preprocessing
As the dataset was clean, preprocessing was straightforward. I scaled the features, created
dummy variables for categorical columns, and split the data into train and test sets.
pandas.get_dummies:
I evaluated the models using the Classification Report, which includes metrics such as
precision, recall, and F1 score, as well as the Confusion Matrix, which shows the number of
correctly and incorrectly predicted cases for both classes.
Random Forest Classifier (default hyperparameters)
Ensemble algorithms often outperform simple classifiers. Thus, I proceeded with a Random
Forest Classifier, an ensemble method that fits multiple decision trees on various subsets of
the dataset and averages their predictions to enhance performance. I began with a Random
Forest Classifier using default parameters and evaluated its performance using the
Classification Report and Confusion Matrix.
Another useful classification metric I used was the Receiver Operating Characteristic (ROC)
Curve, which illustrates the model's performance across all classification thresholds. The
ROC curve is summarized with the Area Under the Curve (AUC), where a higher AUC
indicates better performance, with 1 being the ideal score. The ROC curve and AUC for the
baseline Random Forest Classifier are as follows:
The best hyperparameters identified through the Random Search process, along with the
classification report, confusion matrix, ROC curve, and AUC for the best estimator, are
presented below:
Extreme Gradient Boosting (default hyperparameters)
Extreme Gradient Boosting (XGBoost) is a state-of-the-art ensemble method known for its
superior performance in various regression and classification tasks. Typically, it uses trees as
base estimators and employs boosting, an iterative approach to optimize performance. The
XGBoost library offers an API compatible with the scikit-learn environment. I trained an
XGBClassifier with the scale_pos_weight parameter to address class imbalance in the
dataset. The evaluation metrics for this model are as follows:
Extreme Gradient Boosting (hyperparameter tuning with Random Search)
Similar to the Random Forest Classifier, the XGBoost Classifier can significantly benefit
from hyperparameter tuning. To optimize the performance of the XGBoost model, I
conducted a Random Search for hyperparameter tuning. The scoring metric used for the was
the F1 score. The following outlines the parameter distributions and results:
XGB + Synthetic Minority Over-sampling Technique (SMOTE)
SMOTE is a powerful technique used to address class imbalance in machine learning datasets
by generating synthetic samples for the minority class. This helps to balance the dataset and
can lead to better model performance, especially in cases where the target variable is heavily
skewed. In this step, I applied SMOTE to resample the data and then trained an XGBoost
Classifier on the resampled dataset:
The metrics for this base XGB with SMOTE are the following:
XGB + SMOTE (hyperparameter tuning with Random Search)
The initial XGB + SMOTE model resulted in an extreme reduction of false negatives,
achieving zero false negatives. However, this also led to a significant increase in false
positives. To address this, I performed hyperparameter tuning using Random Search to find
the optimal parameters that balance both false positives and false negatives more effectively.
In the context of insurance claims, the cost associated with false negatives (CFN) is typically
significantly higher than that of false positives (CFP). To assess the impact of varying these
costs, I explored different ratios of CFN to CFP, ranging from 5 to 50. The results of the cost
function, represented as a heatmap, illustrate how different ratios influence the overall cost in
this scenario:
• When the CFN is 5 times the CFP, the best models are: 1 Base Logistic Regression, 2
Base KNN, 3 Base Random Forest Classifier.
• When the CFN is 10 times the CFP, the best models are: 1 XGB (Random Search), 2
Base KNN, 3 Base Logistic Regression.
• When the CFN is 20 times the CFP, the best models are: 1 XGB (Random Search), 2
XGB (SMOTE + Random Search2), 3 XGB (SMOTE).
• When the CFN is 50 times the CFP, the best models are: 1 XGB (SMOTE), 2 XGB
(SMOTE + Random Search1), 3 XGB (Random Search).
Based on these findings, the most probable scenarios, occurring with CFN values between
10- and 20-times CFP, suggest XGB (Random Search) as the optimal model. This model
consistently demonstrates strong performance, particularly highlighted by its highest Area
Under the Curve (AUC) in the Receiver Operating Characteristic (ROC) analysis,
showcasing robustness across various decision thresholds.
Conclusion
The goal of this project was to develop a model for predicting car insurance claims based on
policy, car, and demographic features. Given the nature of insurance claims, where false
negatives incur significantly higher costs than false positives, the focus was on optimizing for
this scenario.
Using a dataset comprising 44 features and 58,592 policy records, I conducted thorough data
wrangling and exploratory data analysis. This process uncovered meaningful patterns and
relationships within the data, guiding subsequent preprocessing steps to prepare for modeling.
I trained and evaluated several models, beginning with three baseline models and progressing
to ensemble methods such as Random Forest and XGBoost, which were further refined
through hyperparameter optimization using Random Search. Additionally, I employed the
SMOTE technique to address class imbalance and explored the performance of a Multilayer
Perceptron Classifier.
Evaluation metrics, particularly ROC and AUC, were used to compare model performance.
The XGBoost Classifier, optimized through Random Search, emerged as the superior model.
It consistently demonstrated the highest Area Under the Curve (AUC) in the ROC analysis,
showcasing robustness across various decision thresholds and effectively minimizing our
targeted cost function.
As a final recommendation, improving the dataset's class imbalance would likely enhance
model performance further. Additionally, with increased computational resources, conducting
a Complete Grid Search for hyperparameter tuning could yield even more refined results.
Lastly, establishing a precise business cost function with exact ratios between false positives
and false negatives would provide deeper insights for decision-making.