Vehicle Insurance Prediction Part-2

raheem nasirudeen

4 min readNov 25, 2022

In this section, I’ll walk us through how I got to the number 5 spot on the leaderboard.

Read Part-1

Methodology

Understand the data
adversarial validation
Data Transformation
Feature Engineering
Building Machine Learning Model
Interpretable Machine Learning

Understand the data

In this section, we examine the information in our data to determine its data type.

we have 7 categorical, and 11 numeric features.

Adversarial validation

This method makes it simple to calculate differences between training and test data. In essence, if the ROC-AUC score is around 0.5, it indicates that the distribution of the training and test sets of data is the same. It indicates that the data scientist will be pleased because the likelihood of the leaderboard being overfit will be reduced.

You can follow the adversarial validation notebook here

Data Transformation

In this section, I will explain how I cleaned the data features. The dataset contains a large number of categorical features. The ordinal features are those that have meaningful ranking and the nominal features do not.

Ordinal features are transformed using map encoding techniques.

Nominal features are transformed using label encoding

Note: There is no free lunch method in transforming categorical data.

Feature Engineering

In improving machine learning model performance, feature engineering is both an art and a science. 3 new features were added to the existing features, applying count encoding on the postal code features and annual mileage. creating a binning feature on the annual mileage features.

Machine Learning

In this section, a machine learning algorithm is implemented to make a likelihood prediction to the target feature (OUTCOME), with log_loss as the evaluation metrics. The algorithm used is Catboost a family of Family of Gradient boosting model like xgboost and lightgbm.

X = independent features

y = dependent feature (target variable)

where 0 means the insurance claim has to be rejected and 1 means the insurance claim has to be accepted.

To create a more generalize dmachine learning model 5 Stratified Kfold cross validation techniques is implemented and the final prediction is the average of the 5 folds prediction.

The log loss score on the local cross validation is 0.6810 and the score on the private leaderboard is 0.68076

Feature Importance of the Catboost Model

The most important features are Type of vehicle, Driving experience, Postal code, ID and a generated features am_bin.

Interpretable Machine Learning

In this section, we unveil the black box model using Shapely values in other to communicate our findings to the business stakeholders.

Decision making

A higher level of driving experience indicates that the claim must be accepted.
When a Gender is female, it indicates that the claim is accepted; when it is male, it indicates that the claim is rejected. (The gender bias in algorithms must be addressed.)
Clients with lower annual mileage must be accepted.

Link — Notebook

References

Contact

Vehicle Insurance Prediction Part-2

Written by raheem nasirudeen