3rd PlaceSolution (Predict the employee burn out rate by HackerEarth)
Introduction to the challenge
The challenge is hosted on HackerEarth
World Mental Health Day is celebrated on October 10 each year. The objective of this day is to raise awareness about mental health issues around the world and mobilize efforts in support of mental health. According to an anonymous survey, about 450 million people live with mental disorders that can be one of the primary causes of poor health and disability worldwide.
You are a Machine Learning engineer in a company. You are given a task to understand and observe the mental health of all the employees in your company. Therefore, you are required to predict the burnout rate of employees based on the provided features thus helping the company to take appropriate measures for their employees.
The problem of the competition is formulated as a regression task, evaluation metrics is r2_score. The Training set consists of (22750 x 9) and the test set consists of (12250 x 8) shape to predict the burn out rate(Target)
EDA (Exploratory Data Analysis)
The data is been explored to check the data observation(rows) and features(columns). The data consist of both categorical and numerical features, while most of the categorical features as low cardinality. To my observation, there were missing values in our dataset, and to my surprise the Target variable(Burn Rate).
Data Preprocessing and Feature Engineering
I drop the missing Values with the subset of columns missing in the training data and not missing in the test data i.e I removed the observations that were missing by rows through the columns selected.
I transformed the categorical features using one-hot encoding for non-meaningful rank features (Gender and Company Type) and use Ordinal Encoding for meaningful ranking features (WFH Setup Available).
I noticed outliers in some numeric variables and I tried to experiment with it and did not improve my public score, I had to leave it(private score might improve though). The Numeric Features are used without additional Preprocessing.
A date Time format is also in the features which I removed without creating DateTime features. This done due to the fact of the Variable description, from my analysis it will be a very wrong idea to use the Date of Joining the company to predict the burnout rate of the employee. This is one of the benefits of the analysis and curiosity of a data scientist to know when to add or remove a feature. Feature Engineering is not all about just creating more features, it is more of an art than science.
I observed from my baseline model with most features important and regroup the columns based on the plot(it gives slight improvement to my model).
The validation strategy is crucial in all data-driven tasks and especially in competitions. in this scenario, scores from simple validation done by kfold might give overestimated assumptions about our real score and expected position on the leaderboard.
Trained Model(Ensemble Method)
I tried to make my solution simpler using 1 Catboost scored (93%) and 1 Xgboost (93%), Using 10 splits Kfold cross-validations with different random seeds to boost the model score a little bit. The 2 models are ensembles together using the Weighted Average ensemble method. 65% Catboost and 35% Xgboost to improve my score a little bit.
Competitive data science is one of the best ways to learn data science for newbies and experts, it gives opportunities for experimentation which is one of the core skills as a data scientist.
The major goal of participating in competitive data scientist should be more leverage on learning and boosting your portfolio, getting a Prize is an added advantage(This Hackathon did not Award Prize).
Info and Contact
I’m a data scientist at AXA Nigeria.
Twitter — @Nasereliver
Linkedin — Nasirudeen
Code Github Link — Solution