Vehicle Insurance Prediction Part-2

raheem nasirudeen
4 min readNov 25, 2022
Top 5 on the Leaderboard

In this section, I’ll walk us through how I got to the number 5 spot on the leaderboard.

Read Part-1


  1. Understand the data
  2. adversarial validation
  3. Data Transformation
  4. Feature Engineering
  5. Building Machine Learning Model
  6. Interpretable Machine Learning

Understand the data

In this section, we examine the information in our data to determine its data type.

we have 7 categorical, and 11 numeric features.

Adversarial validation

This method makes it simple to calculate differences between training and test data. In essence, if the ROC-AUC score is around 0.5, it indicates that the distribution of the training and test sets of data is the same. It indicates that the data scientist will be pleased because the likelihood of the leaderboard being overfit will be reduced.

The score is approximately 0.5
Happy mood

You can follow the adversarial validation notebook here

Data Transformation

Data Transformation Reality

In this section, I will explain how I cleaned the data features. The dataset contains a large number of categorical features. The ordinal features are those that have meaningful ranking and the nominal features do not.

Ordinal features are transformed using map encoding techniques.

Map encoding techniques

Nominal features are transformed using label encoding

Label Encoding Techniques

Note: There is no free lunch method in transforming categorical data.

Feature Engineering

In improving machine learning model performance, feature engineering is both an art and a science. 3 new features were added to the existing features, applying count encoding on the postal code features and annual mileage. creating a binning feature on the annual mileage features.

Feature engineering

Machine Learning

In this section, a machine learning algorithm is implemented to make a likelihood prediction to the target feature (OUTCOME), with log_loss as the evaluation metrics. The algorithm used is Catboost a family of Family of Gradient boosting model like xgboost and lightgbm.

X = independent features

y = dependent feature (target variable)

where 0 means the insurance claim has to be rejected and 1 means the insurance claim has to be accepted.

To create a more generalize dmachine learning model 5 Stratified Kfold cross validation techniques is implemented and the final prediction is the average of the 5 folds prediction.

Cross validation implementation

The log loss score on the local cross validation is 0.6810 and the score on the private leaderboard is 0.68076

Feature Importance of the Catboost Model

The most important features are Type of vehicle, Driving experience, Postal code, ID and a generated features am_bin.

Submission file generated

Interpretable Machine Learning

In this section, we unveil the black box model using Shapely values in other to communicate our findings to the business stakeholders.

Shapely values

Decision making

  1. A higher level of driving experience indicates that the claim must be accepted.
  2. When a Gender is female, it indicates that the claim is accepted; when it is male, it indicates that the claim is rejected. (The gender bias in algorithms must be addressed.)
  3. Clients with lower annual mileage must be accepted.

Link — Notebook




  1. Linkedin
  2. Twitter