Global Health Hackathon Winner Solution Approach
This hackathon was hosted by madirohacks virtually on march 19–20 to solve global heath problems.
I had a team called Team Data
Team Lead — Raheem Nasirudeen
Team Member — Emmanuel Imonmion
I will be sharing the approach we used in solving problems in the data science category.
Medical Diagnosis Test prediction
Determine the Likelihood of a patient diagonistics test been negative or positive using machine learning and explainable AI.
Abstract
The use of data analytics in healthcare system has been in existence over the years. In this Tutorial, 4 types of data analytics are used which consists of descriptive, diagonistics, predictive and prescriptive analysis in predicting the likelihood of a patient diagonistics test. Machine learning used is a Gradient boosting model Xgboost which is 90% roc_auc and interpretable machine learning shap is used to explain the model.
Introduction
We must find innovative ways that are easy and safe to get testing and treatment done and prevent new infections, or we will see another generation with millions forced to live with HIV.
Methodology
1. Identify the business problem
2. Organize the data set
3. Data Exploration and Data Visualization
4. Data Cleaning and Transformation.
5. Statistical Inference
6. Feature selection
7. Creating our train and test data
8. Evaluation of the model
9. Making prediction on our test data.
10. Making decision from the model to solve the business problem.
Exploratory Data Analysis
Before we begin our analysis we will be using results_response as our target variable. Hence, we are going to drop missing rows of target variable for futher analysis and investigation.
Descriptive Analysis
In this section we will be looking at some features using Uni-variate and Bivariate Analysis.
Bivariate Analysis
Diagonistics Analysis
In this section we will check the significance of categorical features with our target variable(result_response) by using Chi-square goodness of fit test.
step 1 — we transform all our categorical features.
step 2 — using a chi square model.
Formulate our hypothesis
H0 — The features checked did not have significant role to play to the result_response
H1 — The features checked have significant role to play to the result_response.
If the p-value is > 0.05 hence the feature did not have significant effect on result_response and vice versa.
we are going to observe the most important features using chi-square for the categorical (features)
Machine Learning
1. feature selection by dropping redundant features
2. split the data into train and test data
3. Missing values will be handled, by Xgboost, a gradient boosting model
Model performance
In this section, we choose a metrics that can be used to validate the performance of an imbalance dataset.
Our model performance metrics is roc_auc_score using scikit-learn.
Feature importance of the model
Explainable AI
in this section, we used shapely values to make decisions from the black box model.
Shapely values is a game theoretic approach to explain the output of any machine learning model.
Also, the Expected Value: -3.2963135 being displayed above will be used as the base value throughout the visualizations below. The values above this base value will be put into class 1 (result is positive) whereas the values below it will be put into class 0 (result is negative).
Feature Importance with shap
The top 5 features for determining whether a test response will be positive or negative are:
1. Confirmatory test
2. last_tested
3. engage time difference
4. appointment
5. how ease to use a tool
Criteria for winning the hackathon by the judges
it was specified that this solution made use of good statistical approach and Explainable AI for not treating the model as a black box.
References