Predicting Hospital Length of Stay

Prathmesh Sardeshmukh

Related Listings

Pisciculture Species ...

0 comments, 0 reviews , 0 likes
Malaria Detection

0 comments, 1 review , 1 like

Credit Card Customers Churn Prediction

0 comments, 1 review , 859 views, 3 likes

Major Concepts

Models Home » Domain Usecases » Health Care and Pharmaceuticals » Predicting Hospital Length of Stay

Predicting Hospital Length of Stay

Models Status

Model Overview

Predicting Length of Stay

Problem Statement:

During recent covid situations, it has become a hectic task for hospital authorities to manage their logistics and bed availability as many patients come to get admitted in their care, so prior knowledge of how long the patient would be staying there would be beneficial for them to increase their healthcare management efficiency.

So, predicting the Length of Stay of the patient at the time of admission will help the staff of management to prepare an optimized treatment plan for patients with high Length of Stay. Also, prior knowledge of Length of stay can aid in logistics such as room and bed allocation planning.

Who can use this?

This use case is related to the healthcare domain so, the medical industry can use this model to predict Length of Stay of patients. So, this model will help them optimize their bed allocation and treatment plan planning which will lead them to manage their time and work.

Model Solution:

The problem is about classifying the Length of Stay of a patient. To solve this problem a classification algorithm will be suitable. More about the model is discussed below.

Dataset Source/Description:

The dataset is from Github repository,

https://raw.githubusercontent.com/microsoft/r-server-hospital-length-of-stay/master/Data/LengthOfStay.csv

This dataset consisted of 100000 rows mentioning their different types of tests and based on that we have to predict the Length of Stay of that patient. Our target column had a Length of Stay ranging from 1 to 17 days.

Columns Description of Dataset:

Eid: Unique Id of the hospital admission

Vdate: Visit date

Rcount: Number of readmissions within last 180 days

Gender: Gender of the patient [M or F]

Dialysisrenalendstage: Flag for renal disease during encounter

Asthma: Flag for asthma during encounter

Irondef: Flag for iron deficiency during encounter

Pneum: Flag for pneumonia during encounter

Substancedependence: Flag for substance dependence during encounter

Psychologicaldisordermajor: Flag for major psychological disorder during encounter

Depress: Flag for depression during encounter

Psychother: Flag for other psychological disorder during encounter

Fibrosisandother: Flag for fibrosis during encounter

Malnutrition: Flag for malnutrituion during encounter

Hemo: Flag for blood disorder during encounter

Hematocritic: Average hematocritic value during encounter (g/dL)

Neutrophils: Average neutrophils value during encounter (cells/microL)

Sodium: Average sodium value during encounter (mmol/L)

Glucose: Average glucose value during encounter (mmol/L)

Bloodureanitro: Average blood urea nitrogen value during encounter (mg/dL)

Creatinine: Average creatinine value during encounter (mg/dL)

Bmi: Average BMI during encounter (kg/m2)

Pulse: Average pulse during encounter (beats/m)

Respiration: Average respiration during encounter (breaths/m)

secondarydiagnosisnonicd9: Flag for whether a non ICD 9 formatted diagnosis was coded as a secondary diagnosis

discharged: Date of discharge

facid: Facility ID at which the encounter occurred

lengthofstay: Length of stay for the encounter

Facid Description:

90	C	General Medicine 3 South

95	E	Behavioural 1 East

75	A	General Medicine 3 West

80	B	Pulmonary 2 West

100	D	Geraitrics 2 East

Data Pre-processing:

We will be dropping the eid, vdate, discharged columns as it doesn’t provide any importance while predicting our target value.

Now, we will be deciding a range for our Length of Stay as we can’t be specific for a fix day so a range we will be quite suitable. Looking at the distribution of Length of stay there are very less counts of values greater than 10.

df['lengthofstay'].value_counts()

1     17979

3     16068

4     14822

2     12825

5     12116

6     10362

7      7263

8      4652

9      2184

10     1000

11      460

12      137

13       75

14       31

15       16

16        6

17        4

Name: lengthofstay, dtype: int64

So, after trying different ranges we decided a final range i.e. 1-2, 3-4, 4+ will be our range for first model

print("1-2: ",len(df.loc[df['lengthofstay'] <= 2])) 

print("3-4 : ", len(df.loc[(df['lengthofstay'] >= 3) & (df['lengthofstay'] <= 4)]))

print("4+ : ", len(df.loc[df['lengthofstay'] > 4]))

1-2:  30804

3-4 :  30890

4+ :  35000

so here when the model will predict 4+ we will jump to our second model where the range will be 5-6, 7-10,

print("5-6 : ", len(df.loc[(df['lengthofstay'] >= 5) & (df['lengthofstay'] <= 6)]))

print("7-10 : ", len(df.loc[(df['lengthofstay'] >= 7) & (df['lengthofstay'] <= 10)]))

print("10+ : ", len(df.loc[(df['lengthofstay'] >10)]))

5-6 :  19172

7-10 :  15099

10+ :  729

the reason for doing this is, we was getting a very highly imbalanced distribution and the results will be baised so, we decided to go with two models where one model will predict the range [1-2, 3-4, 4+] and other model will predict the range [5-6, 7-10] and we are dropping the values greater than 10 as they were very low counts of those and we won’t be able to predict them properly.

As there were no nans present so we moved forward for encoding the categorical cols there were only 3 columns rcount, gender and facid so the rcount had one variable 5+ we replaced that one with 5 and converted that to integer, for gender we replaced the male and female with 1 and 0 also for facid we were given the capacity for different variables so replaced that with them.

df['rcount'] = df['rcount'].replace('5+', '5')

df['rcount'].astype('int64')

df['gender'] = df['gender'].replace('F', 0)

df['gender'] = df['gender'].replace('M', 1)

mapfacid = {

    'A':75,

    'B':80,

    'C':30,

    'D':100,

    'E':95

}



df.facid = df.facid.map(mapfacid)

Also we will be creating a new dataset where we will be having values greater than 4 on which we will train our second model.

df2 = df.loc[df.lengthofstay > 4]

We will perform splitting of data for training and testing purpose and will standardize the data and pass it to our models.

Models Evaluated:

I have evaluated the dataset with 6 models:

Logistic Regression:

The Logistic Regression makes use of Sigmoid Function, which takes any real value between 0 and 1. It is used to predict the relationship between the dependent variable and independent variable, where dependent variable is binary in nature.EG. Success /Failure, Yes/No, True/False.

Sigmoid Function is :

Accuracy of Logistic Regression: 76 %., F1-score: 48.47%

Decision Tree Classifier

The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data training data). I this model we have used entropy as function which uses Information Gain as it Metrics. ID3 is the Algorithm in Decision Tree which uses Entropy and information gain.

Where, n=Number of Classes

Accuracy of Decision Tree Classifier: 85%, F1-score: 57.54%

Random Forest Classifier

It is a type of learning where one can combine different algorithms or the same algorithm many times to get a powerful prediction model. Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations. This concept is known as “bagging" and is very popular for its ability to reduce variance and overfitting.

Accuracy of Random Forest Classifier: 91%, F1-score: 69.12%

Gradient Boosting Classifier

The term gradient boosting consists of two sub- terms, gradient and boosting Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Gradient boosting is as a numerical optimization problem where the objective is to minimize the loss function of the model by adding weak learners using gradient descent. Gradient boosting does not modify the sample distribution as weak learners train on the remaining residual errors of a strong learner (i.e, pseudo-residuals). By training on the residuals of the model, this is an alternative means to give more importance to misclassified observations.

Accuracy of Gradient Boosting Classifier: 81.86%, F1-score:72.06%

XGBoost Classifier

XGBoost stands for eXtreme Gradient Boosting. It is a decision-tree-based ensemble ML algorithm that uses a gradient boosting framework. It’s objective function is a sum of a specific loss function evaluated over all predictions and a sum of regularization term for all predictors (KK trees).

Accuracy of XGBoost Classifier: 80.34 % , F1-score: 68.84%

LGBM Classifier:

It is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient as compared to other boosting algorithms. LightGBM grows tree vertically while other tree based learning algorithms grow trees horizontally. It means that LightGBM grows tree leaf-wise while other algorithms grow level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, leaf-wise algorithm can reduce more loss than a level-wise algorithm.

The key difference in speed is because XGBoost split the tree nodes one level at a time and LightGBM does that one node at a time.

LightGBM Tree Leaf Growth:

Other Algorithm Level Growth:

Model Used:

I have chosen LGBM Classifier as a final model for prediction. I trained this model on the training dataset, after that I have checked the model efficiency, so as compared to all other models this model was performing best. For both the datasets the LGBM Classifier was performing best.

LGBM Classifier Model:

LGBMClassifier(max_depth = 10, random_state=132,learning_rate=0.1, n_estimators=500)

LGBM Classifier 1 Model:

LGBMClassifier(max_depth = 10, random_state=132,learning_rate=0.1, n_estimators=1000)

Where,

max_depth (int, optional (default=-1)) – Maximum tree depth for base learners, this parameter control max depth of each trained tree

n_estimators (int, optional (default=100)) – Number of boosted trees to fit. This parameter captures the number of trees that we add to model.

learning_rate (float, optional (default=0.1)) – Boosting learning rate. This parameter have a impact on training accuracy

Solution Efficiency:

Models	            Accuracy	F1-Score

LGBM Classifier	    94%	         93.76%

LGBM Classifier 1   94%	         94.34%

0 comments

Shubh Sharma likes this

Related Listings

Prathmesh Sardeshmukh's other Models Reports

Major Concepts

Predicting Hospital Length of Stay

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us

Member Sign In

Member Sign In

Create Account

Related Listings

Prathmesh Sardeshmukh's other Models Reports

Major Concepts

Predicting Hospital Length of Stay

Models Status

Model Overview

Deployment

Photos

Reviews

Connect With Us