Vehicle Insurance Prediction (Part 1)

raheem nasirudeen
5 min readNov 15, 2022


In this article, I will share my top 5 solution approach for predicting vehicle insurance. This part is focused on data & statistical analysis.

Competition platform image

About Dataset and problem:

The Vehicle Insurance business is a multi-billion dollar industry. Every year millions and millions of premiums are paid, and a huge amount of claims also pile up.

You have to step into the shoes of a data scientist who is building models to help an insurance company understand which claims are worth rejecting and the claims which should be accepted for reimbursement.

You are given a rich dataset consisting of thousands of rows of past records, which you can use to learn more about your customers’ behaviours.

Link —

This article will be divided into 2 parts


In this project, i was able to build a machine learning model with log_loss of 0.68076 using catboost algorithm with 5 stratified Kfold cross validation to predict which claims are worth rejecting and the claims which should be accepted for reimbursement for an insurance company. my approach consist of the 4 types of data analytics. I will be sharing the solution in 2 parts.


  1. Data Investigation
  2. Exploratory data analysis
  3. Statistical Analysis
  4. Data Transformation
  5. Adversarial validation
  6. Feature Engineering
  7. Machine Learning model

I will be covering 1–3 methodology in this section.

Data Investigation

I started the project by investigating the dataset by understanding the rows and columns of the dataset, check if there is any missing values in the data, check the data types of each features. The data types with incorrect format is changed.

The rows and columns of the dataset.
There is no missing values in the dataset.
The data types of the features.

Descriptive Analysis ( Exploratory Data Analysis)

In this section, we will use data visualization to explore our dataset. The data will be explored using Uni-Variate, Bi-Variate, and Multi-Variate Analysis. The structure will be question — visual— observation.

Uni-variate Analysis (Comparing 1 variable at a time)

Question 1

What is the percentage of OUTCOME?


  • 57% of the claims are rejected and 42.3% of the claims are accepted.

Question 2

What is the gender ratio of the clients?


Males account for 62.2% of clients, while females account for 37.8%.

Question 3

Which age group do the majority of the clients belong to?


The majority of the clients are over the age of 40.

Question 4

Driving Experience of Clients


The majority of our clients have 20–29 years of driving experience, while the fewest have 30+ years.

Question 5

Clients’ educational level ratio


44.4% of the clients have a high school diploma, 29.7% have a university diploma, and 25.9% have no education.

Question 6

Clients’ income level ratio.


We have 48.8% of clients who are upper class, 23.3% who are working class, 14.0 who are middle class, and 13.8 who are poor.

Question 7

The Most Common Vehicle Types


Sports car > Sedan > HatchBack > SUV

Question 8

The Distribution of our Numeric Features


The distribution appears to be normal.

Bi-variate (Comparing 2 variables at a time)

Question 9

What is the gender ratio in terms of outcome?


Female claims are accepted at 16% and rejected at 22%.

26% of male claims are accepted, while 36% are denied.

Question 10

What is the age group ratio in relation to the outcome?


The Age Group with Outcome appears to be balanced across various Age Groups.

Question 11

Outcome based on social and economic status.


Each social economic status has a similar distribution of the Rejected and Accepted.

Question 12

Is there a correlation between credit score and annual mileage


We can see a weakly negative correlation between them.

Statistical Analysis

In this section, we’ll use the Chi-Square method to examine how different category variables relate to the OUTCOME. The data were transformed but a better explanation will be showing at part 2 of the article.

Note: Finding relationships between categorical data through correlation is one of the pitfalls of data science.

Create our hypothesis

H0 — There was no significant difference between the feature and the OUTCOME class.

H1 — The checked feature differs significantly from the OUTCOME class.

The data is transformed and a chi square model from statsmodel is used.

The Chi-squared decision.


Only Gender and Driving Experience have a significant difference with the OUTCOME class and will be a good categorical predictor, according to our findings.

You can find the notebook link — Notebook

Ending note

Stayed tuned!!! to the 2nd article, where i will be showing my approach of building a machine learning model, that is ranked 5th position out 1008 registered participants. I will be covering predictive and prescriptive analysis.




  1. Linkedin
  2. Twitter