1st Position
Data Science Nigeria 2019 Challenge #1: Insurance Prediction

raheem nasirudeen
5 min readNov 25, 2019
Celebration with my Mentor Mr Ahmed Olanrewaju and AI Saturday facilitators and member.

24 hours Data Hackathon at Data Science Nigeria AXAMansard Insurance Prediction Hackathon Hosted on zindi.africa platform.

My first submission placed me at the 1st position after 40 participants on the leader board which gave me a good baseline on the hackathon and am able to maintain my first position from the start to the end(The Invisible).

Before going to my 1st winning solution approach little things about me. I’m an active competitive data scientist which have done 25 previous data hackathon across various competitive platforms like zindi.africa, kaggle.com, https://www.analyticsvidhya.com/ and machinehack.com. This make me learn a lot from various data set and have hands on in 10 months consecutively.

The data set is a very dirty which seems so questioning about the right approach in cleaning the data set. I have adapted to such data from machinehack.com platform, their data set comes dirty in most hackathon they released on their platform.

Getting to my approach

The objective of the hackathon is to predict the probability of having at least one claim over the insured period of the building. The data set consist of ((7160, 14), (3069, 13)) training rows, training columns, test rows and test columns. Like the Normal approach most winners will always say do your Explanatory data analysis (EDA) well before diving into the features. This exposed me more to the data set. The many thing is to predict probabilities which should not be perfect score(0,1).

Data Cleaning

Most categorical data like Building_Painted, Building_Fenced, Garden, Settlement are converted using map encoding to 0 and 1.

To the Missing Values : filling the Building Dimension with Median Value of the Distribution and other missing values with -1 value.

Concerning the NumberOfWindows issue i look at the (.)value which seems so dirty i took the approach of NumberOfWindows in real world sense which does not determine a meaningful ranking that is its belongs to nominal data(like number of players shirt) so i determined not to touch the (.)value and i used one hot encoding(dummies) . Getting to the Geocode feature that must comes with numerical values, some string value actually make it an object feature which seems so confusing, so the task with various encoding techniques comes in to play, with onehotencoding it will make the columns so large and many values you can see in train set but not in the test set. Label encoding can be so confusing in which will takes account the first value to 1 and so on which might seriously affect the distribution. So to my favorite encoding techniques for such problem Frequency Encoding which convert each geocode values with their value counts(how many times they appear on the rows) to replace each values from the object feature.

Feature Engineering

During the class at boot camp Dr Emmanuel Doro chief data scientist from Walmert to those that came to his class. he keeps on saying what differentiate a struggling data scientist and successful data scientist is proper Feature Engineering. Feature engineering is more of act than science which mostly requires domain knowledge of the data set. so with many intuition have seen imagine having a features of 0 and 1 (driving lincense, national identification card, Voters card) if yes == 1 and no == 0. The 3 columns should generate total credentials which i learnt from a hackathon. so columns building_painted and building_fenced, some building might be painted and not fenced and vice versa will some will be both painted and fenced. Basically, it is what will call Feature Interaction by adding some columns together.

Lastly, i was able to google more about the meaning of insurance period which comes in floating number i had to go ahead from my little research to come up with a feature called month.

Model Building

I’m a favorite of cat boost algorithm which is what i used.

Validation Strategy

From the view of many data science hackathon there are most times leader board will shake up on the final day (falling angel).I was much more careful despite there is no split of test set from the leader board. I tried different validation strategy which will perform better than Hold out cross validation(normal train test split) like K-Fold and Stratified K-Fold. I chose to rely on Stratified K-Fold because my validation score is more similar to the leader board score.

Ensemble tricks

i used weighted average ensemble method because it helps me to improve my score a little bit. note: You must have a single model with a good score before you ensemble it.

Some Tips to Get Started

  1. Perform explanatory data analysis(EDA)
  2. Area Under Curve metrics is to predict probablities(model.predect_proba()[:,1])
  3. F1_score metrics is for perfect (0,1)
  4. Always find a good baseline
  5. Never stick to one Encoding method(always try to see the best first)
  6. Feature engineering and Feature Selection are very important
  7. Domain expertise can go a long way to have a better features
  8. Always start with small features
  9. Ensemble tricks also improve score to win

Note: My 24 hours Invisible 1st position is dedicated to Allah. Many failed experiment from past experience on various data hackathon and dedication. Consistency is the key.

Raising One Million AI Talent in 10 years

Many communities have grown a lot in the last few years due to their selfless contribution to the community. if Kaggle communities top winners did not provide their solution approach and codes of previous hosted competition many data scientists would have not reach their potential.

We have to be selfless in order to grow very rapidly and to see the dream and vision of Raising One Million AI Talent in 10 years to come True.

my Github Repositry link — github.com/nasirudeenraheem/-Data-Science-Nigeria-2019-Challenge-1-Insurance-Prediction-AXAMansard-Insurance-Hackathon

Note: Different version of conda, and catboost version might not mkae the score the same on the leader board but will be much closer.

Thanks to our Father in AI in person of Dr Bayo Adekanmbi , the entire data science Nigeria staff and to our Experts that have put in many selfless attitude to make the boot camp 2019 a successful one and more impactive to us all. Looking forward to Boot Camp 2020.

Thanks to Data Science Nigeria, AXAMansard and zindi.africa.

#1MillionAiTalentin10years

--

--