Global Health Hackathon Winner Solution Approach

raheem nasirudeen
5 min readApr 22, 2022

--

This hackathon was hosted by madirohacks virtually on march 19–20 to solve global heath problems.

I had a team called Team Data

Team Lead — Raheem Nasirudeen

Team Member — Emmanuel Imonmion

I will be sharing the approach we used in solving problems in the data science category.

Medical Diagnosis Test prediction

Determine the Likelihood of a patient diagonistics test been negative or positive using machine learning and explainable AI.

Image_Source

Abstract

The use of data analytics in healthcare system has been in existence over the years. In this Tutorial, 4 types of data analytics are used which consists of descriptive, diagonistics, predictive and prescriptive analysis in predicting the likelihood of a patient diagonistics test. Machine learning used is a Gradient boosting model Xgboost which is 90% roc_auc and interpretable machine learning shap is used to explain the model.

Introduction

We must find innovative ways that are easy and safe to get testing and treatment done and prevent new infections, or we will see another generation with millions forced to live with HIV.

Methodology

1. Identify the business problem

2. Organize the data set

3. Data Exploration and Data Visualization

4. Data Cleaning and Transformation.

5. Statistical Inference

6. Feature selection

7. Creating our train and test data

8. Evaluation of the model

9. Making prediction on our test data.

10. Making decision from the model to solve the business problem.

Exploratory Data Analysis

Before we begin our analysis we will be using results_response as our target variable. Hence, we are going to drop missing rows of target variable for futher analysis and investigation.

21029 rows and 58 columns in the dataset
Different column data type

Descriptive Analysis

In this section we will be looking at some features using Uni-variate and Bivariate Analysis.

we have 92.9% negative and 7.1% positive result.
Most post are last tested at **6 weeks to 3 months** with 38.2%.
**Female** gender visited the clinic more than **Male**
Most language spoken is **Zulu** with 71.1% and least with **Xhosa** with 0.6%.
The Top 10 facilities for patient.

Bivariate Analysis

9.2% result by male are positive while female has 6.4% and other 2.9%.
The tested within short time are negative and while more than a 6 month tested are more positive.
Most tested postitve speaks esotho and Xhosa language.

Diagonistics Analysis

In this section we will check the significance of categorical features with our target variable(result_response) by using Chi-square goodness of fit test.

step 1 — we transform all our categorical features.

step 2 — using a chi square model.

Formulate our hypothesis

H0 — The features checked did not have significant role to play to the result_response

H1 — The features checked have significant role to play to the result_response.

If the p-value is > 0.05 hence the feature did not have significant effect on result_response and vice versa.

we are going to observe the most important features using chi-square for the categorical (features)

The categorical features impact on the target feature.

Machine Learning

1. feature selection by dropping redundant features

2. split the data into train and test data

3. Missing values will be handled, by Xgboost, a gradient boosting model

Drop the redundant column
Select features that are numeric
Data splitng to train and test data.
Initializing our xgboost model.

Model performance

In this section, we choose a metrics that can be used to validate the performance of an imbalance dataset.

Our model performance metrics is roc_auc_score using scikit-learn.

we have ~90% score.
Image of the roc_auc_score.

Feature importance of the model

The Top 15 feature importance of the model.

Explainable AI

in this section, we used shapely values to make decisions from the black box model.

Shapely values is a game theoretic approach to explain the output of any machine learning model.

The base value of the model is -3.2963135

Also, the Expected Value: -3.2963135 being displayed above will be used as the base value throughout the visualizations below. The values above this base value will be put into class 1 (result is positive) whereas the values below it will be put into class 0 (result is negative).

Feature Importance with shap

The most important features using shap.

The top 5 features for determining whether a test response will be positive or negative are:

1. Confirmatory test

2. last_tested

3. engage time difference

4. appointment

5. how ease to use a tool

Criteria for winning the hackathon by the judges

it was specified that this solution made use of good statistical approach and Explainable AI for not treating the model as a black box.

References

  1. Notebook
  2. https://shap.readthedocs.io/en/latest/index.html
  3. https://nasere4567.medium.com/autism-class-prediction-and-making-informed-decisions-with-interpretable-machine-learning-5fed89b663c0

--

--

raheem nasirudeen

Data scientist | BBA graduate | Statistician | Causal Inference