Vehicle Insurance Prediction Part-2
In this section, I’ll walk us through how I got to the number 5 spot on the leaderboard.
- Understand the data
- adversarial validation
- Data Transformation
- Feature Engineering
- Building Machine Learning Model
- Interpretable Machine Learning
Understand the data
In this section, we examine the information in our data to determine its data type.
we have 7 categorical, and 11 numeric features.
This method makes it simple to calculate differences between training and test data. In essence, if the ROC-AUC score is around 0.5, it indicates that the distribution of the training and test sets of data is the same. It indicates that the data scientist will be pleased because the likelihood of the leaderboard being overfit will be reduced.
You can follow the adversarial validation notebook here
In this section, I will explain how I cleaned the data features. The dataset contains a large number of categorical features. The ordinal features are those that have meaningful ranking and the nominal features do not.
Ordinal features are transformed using map encoding techniques.
Nominal features are transformed using label encoding
Note: There is no free lunch method in transforming categorical data.
In improving machine learning model performance, feature engineering is both an art and a science. 3 new features were added to the existing features, applying count encoding on the postal code features and annual mileage. creating a binning feature on the annual mileage features.
In this section, a machine learning algorithm is implemented to make a likelihood prediction to the target feature (OUTCOME), with log_loss as the evaluation metrics. The algorithm used is Catboost a family of Family of Gradient boosting model like xgboost and lightgbm.
X = independent features
y = dependent feature (target variable)
where 0 means the insurance claim has to be rejected and 1 means the insurance claim has to be accepted.
To create a more generalize dmachine learning model 5 Stratified Kfold cross validation techniques is implemented and the final prediction is the average of the 5 folds prediction.
The log loss score on the local cross validation is 0.6810 and the score on the private leaderboard is 0.68076
Feature Importance of the Catboost Model
The most important features are Type of vehicle, Driving experience, Postal code, ID and a generated features am_bin.
Interpretable Machine Learning
In this section, we unveil the black box model using Shapely values in other to communicate our findings to the business stakeholders.
- A higher level of driving experience indicates that the claim must be accepted.
- When a Gender is female, it indicates that the claim is accepted; when it is male, it indicates that the claim is rejected. (The gender bias in algorithms must be addressed.)
- Clients with lower annual mileage must be accepted.
Link — Notebook