Data Science Nigeria Challenge #1: Loan Default Prediction

raheem nasirudeen
5 min readOct 30, 2018

Welcome to the world of Data Science, Machine Learning and Artificial Intelligence!

The Sexiest Job of the 21st Century..

To be realistic i hate to write, i am a statistician, a student of The Polytechnic Ibadan an AISaturdays(Ibadan) memeber. in courtesy of MR AHMED OLANREWAJU(@abono2000), Dr Sekinat Folurunsho(@sakinatTijani).

Introduction.

Here am covering a little about the organizer of the competition(Data Science Nigeria) and where the competition is been hosted(Zindi).

Data Science Nigeria is a non-profit run and managed by the Data Scientists Network Foundation. Our vision is to accelerate Nigeria’s development through a solution-oriented application of machine learning in solving social/business problems and to galvanize data science knowledge revolution, which can position Nigeria to become the outsourcing hub for international Data Science/Advanced Analytics/Big Data projects, with opportunity to access at least 1% share of the global big data and analytics market, valued at $150b in 2017 ($203b in 2020).

www.datasciencenigeria.org

conveyed by Dr Bayo Adekanbi.

Zindi is the first data science competition platform in Africa. Zindi hosts an entire data science ecosystem of scientists, engineers, academics, companies, NGOs, governments and institutions focused on solving Africa’s most pressing problems.

zindi.africa

Background of the study.

Data Science Nigeria organized a five days boot camp which was held on October 10–14 2018 a free intensive program on an Inter-Campus Machine Learning Competition and Deep Learning for the Future of the nation.

The data is available after the boot camp for Knowledge purpose.

Though i miss the 2018 boot camp and am already preparing for the 2019 boot camp

Understanding the Problem to be solved.

The Problem Loan Default Prediction is to determine whether a Loan is good or bad.

Understanding the Data.

The data is to predict whether a loan default prediction is good or bad which is a Binary Classification problem and a supervised learning in which the target variable is known.

The Dataset consist of three different Test and Training set to predict on.

Demographic data: which consist of Social well being of the customer.

Performance data : This is the repeated loan that the customer has taken for which we need to predict the performance of. Basically, we need to predict if this loan would default given all previous loans and demographics of a customer.

Previous loans data : This dataset contains all previous loans that the customer had prior to the loan above that we want to predict the performance of. Each loan will have a different systemloanid, but the same customerid for each customer.

Solving the problem.

The tools which will be used in solving the program is Python programming Language on Jupyter notebook Anaconda environment.

There are many tools to use in Machine Learning, i will mention just a few like AzureML, R and so on.

The libraries which will be used in analyzing the data include Pandas and numpy(for Data input,manipulating and making inference from the data), matplotlib and seaborn(for data visualization) and Sci-kit learn(to deploy machine learning model).

The dataset consist of three files. In other to analyse, the data has to be merged together in which the primary key is given in the description which is customerid. Primary key is most used in relational database to merge data.

Importing the necessary library for data analysis,visualization and doing some feature engineering on the data for better representation of the data.

Importing the libraries.

The screenshoot of the libraries.

Checking the format of the data.

Training data

reading each three training set..

Merging the data with the primary key “customerid”.

Merging the training data.

Data Type.

customerid                   13693 non-null object
birthdate 13693 non-null object
bank_account_type 13693 non-null object
longitude_gps 13693 non-null float64
latitude_gps 13693 non-null float64
bank_name_clients 13693 non-null object
employment_status_clients 12330 non-null object
loannumber_x 13693 non-null int64
approveddate_x 13693 non-null object
creationdate_x 13693 non-null object
totaldue_x 13693 non-null float64
termdays_x 13693 non-null int64
good_bad_flag 13693 non-null object
loannumber_y 13693 non-null int64
approveddate_y 13693 non-null object
creationdate_y 13693 non-null object
totaldue_y 13693 non-null float64
termdays_y 13693 non-null int64
closeddate 13693 non-null object
firstduedate 13693 non-null object
firstrepaiddate 13693 non-null object
LoanAmount 13693 non-null float64
dtypes: float64(5), int64(4), object(13)

The datatype is known to deploy our machine learning model. we must have a better representation of the data. The conversion of the object format some are to be drop,Replace in ordinal format or nominal format depends on the analyses look.

Ordinal data are meaningful ranking data while nominal data has no meaningful ranking. This is where Feature Engineering comes in finding a better data to feed to the model and cleaning up the data.

Data Cleaning is really important in working with real world Dataset.

The fact about a Data scientists is that he or she spend 80% cleaning the data and 20% deploying the model.

Data fed into the model determine how best the model fit in the algorithm used.

Note: The Target variable column is good_flag_bad which is in object format the data has to be converted to 1 for good and 0 bad.

converting the target variable to 1 and zero.

Explanatory data analysis is really important in knowing much about our data.

After feature engineering i will be spliting my data into training and test set.

spilting the data.

Have tried different algorithm like Logistic Regression, Decision Tree and others but the one that gives me better prediction is Random Forest.

Using random forest and hyper parameters tuning.

Using scikit Learn..

I did my feature engineering and fit the data into the model using Random Forest Classifier and got the Accuracy of 89%.

There are many metrics used to measure in Binary Classification algorithm like Classification Report, Roc-curve, Accuracy and others.

Accuracy score

which my position is number 62 on the competition leader board.

#GodBlessDSN

#GodBlessNigeria

#GodBlessAISaturday(Ibadan)..

--

--