Liver Disease Prediction
Abstract
The liver is an organ about the size of a football. It sits just under your rib cage on the right side of your abdomen. The liver is essential for digesting food and ridding your body of toxic substances (Read_more). In this article, we build a gradient boosting machine learning algorithm to predict the 4 stages of liver disease in the dataset. The 3 stages of data analytics are Descriptive Analytics, Diagnostics Analytics, and Predictive Analysis. Using the methodology of solving a data science project. The evaluation metric for this problem is F1_score and we have an average score of 50.29%.
Problem Statement
This Machine Learning Challenge requires participants to create predictive models to predict the stage of liver Cirrhosis using 18 clinical features. Cirrhosis damages the liver for a variety of causes leading to scarring and liver failure.
Hepatitis and chronic alcohol abuse are frequent causes of the disease. Liver damage caused by cirrhosis can’t be undone, but further damage can be limited. Treatments focus on the underlying cause. In advanced cases, a liver transplant may be required. Predicting the stage of cirrhosis and beginning the treatment before it’s too late can prevent the fatal consequences of the disease. In this article section 1, we explore the dataset, section 2 we perform descriptive analysis comprising of univariate and bivariate analysis of some features for both categorical and numeric data in the dataset, section 3 we perform diagnostics analysis on the data using pair plot and spearman correlation on the features, section 4 we fill the missing values in the dataset, Section 5 we split the dataset into train and validation data and section 6 we make a prediction on the validation data and find the most important features.
The Dataset
1) ID: Unique Identifier
2) N_Days: number of days between registration and the earlier of death, transplantation, or study analysis time.
3) Status: status of the patient C (censored), CL (censored due to liver tx), or D (death)
4) Drug: type of drug. D-penicillamine or placebo
5) Age: age in [days]
6) Sex: M (male) or F (female)
7) Ascites: presence of ascites N (No) or Y (Yes)
8) Hepatomegaly: the presence of hepatomegaly N (No) or Y (Yes)
9) Spiders: the presence of spiders N (No) or Y (Yes)
10) Edema: the presence of edema N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy)
11) Bilirubin: serum bilirubin in [mg/dl]
12) Cholesterol: serum cholesterol in [mg/dl]
13) Albumin: albumin in [gm/dl]
14) Copper: urine copper in [ug/day]
15) Alk_Phos: alkaline phosphatase in [U/liter]
16) SGOT: SGOT in [U/ml]
17) Triglycerides: triglicerides in [mg/dl]
18) Platelets: platelets per cubic [ml/1000]
19) Prothrombin: prothrombin time in seconds [s]
20) Stage: histologic stage of disease (1, 2, 3, or 4)
Data Exploration
We started the notebook by importing the necessary libraries and helper functions for data analysis of the categorical features in our dataset. here we import the major data science libraries.
We load the train and test data using pandas read csv.
we have 6800 rows and 21 columns in the training set and have 3200 and 20 columns in the test data. Note: This is a supervised learning problem whereby the training data has a label called target variable(y-dependent) and test data is without the label and a Multi-classification problem. we observe some missing values in the dataset.
Descriptive Analysis (Uni-Variate Analysis)
The uni-variate analysis is part of the data analysis stage that involves analyzing data based on a single feature. In this section, we explore some categorical features (Stage, Status, Sex) using a bar chart and explore our numeric features (Bilirubin, Cholesterol, Albumin, Copper, Alk_Phos, SGOT,
Triglycerides, Platelets, Prothrombin)using the statistics function (Mean, Median, and Standard deviation).
Uni-Variate Analysis on the Numerical data
Bi-Variate Analysis on the Categorical data
In this section, we are exploring our categorical data with our target feature (Stage). Bi-variate analysis means analyzing data based on 2 features at a time.
Diagnostics Analysis
In this section, we make use of correlation to find relationships among features with the target variable (Stage) and a scatter plot to check the distribution of the Stage falls using 2 features in the dataset.
We observe from the data that most of the stage falls within the range of 120 cholesterol to 500 no matter the Age.
Major observation there is an extreme outlier with Age 40 and Cholesterol 1750 belongs to stage 1
there is an extreme outlier with Age 65 and Cholesterol 1650 belonging to stage 4.
we use spearman correlation to find relationships between the features and the target feature in the dataset.
Correlation shows the statistical relationship between 2 variables (read_more)
Dealing with Missing Values
One of the major scopes of data cleaning is filling the missing values in the dataset. Missing values occur in real-world data because of many reasons that can be known or unknown but as a data scientist, it is always necessary to find the best approach for these issues. In this dataset, the missing values are filled by using a group by method with Sex and Age for most numerical features and the ffill method for the categorical data missing values in the data. Based on some research about the missingness in the data. Note: There’s no best way to fill missing values, proper analysis, and extensive experimentation is allowed.
Data Transformation
Data Transformation is part of the data cleaning process that involves transforming our data to the best form by using various encoding techniques before fitting our machine learning models. Most of the categorical data are transformed using ordinal encoding for features that have a meaningful ranking (yes, no) and nominal encoding using pd.get_dummies for features without meaningful ranking (Sex).
Predictive Analysis (Modelling Phase)
In this section, we split our data into train and validation data by stratifying the target variable to have the same percentage of classes in the train and validation data. we used a tree-based model Catboost for building our machine learning model and evaluate the prediction using F1_Score, the higher the score is closer to 1.0, the better the model performs.
The data is split into 70% training data and 30% test data using stratified hold-out validation and the result for the f1_score is 50.29%.
The most important features of the model are in decreasing order.
Summary
This blog post explains the approach to be used in solving liver disease stage prediction using 3 types of data analytics (Descriptive, Diagnostics, and prescriptive Analysis).
Tip for Beginners: There is no free lunch model in solving a machine learning problem.
Github — Notebook
Twitter — Raheem Nasirudeen
Linkedln — Raheem Nasirudeen
Expect more health-related and data science hackathon approach articles.
Thanks.
References