Prathmesh Sardeshmukh's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Health Care and Pharmaceuticals » Predicting Hospital Length of Stay

Predicting Hospital Length of Stay

Models Status

Model Overview

Predicting Length of Stay


Problem Statement:


During recent covid situations, it has become a hectic task for hospital authorities to manage their logistics and bed availability as many patients come to get admitted in their care, so prior knowledge of how long the patient would be staying there would be beneficial for them to increase their healthcare management efficiency.


So, predicting the Length of Stay of the patient at the time of admission will help the staff of management to prepare an optimized treatment plan for patients with high Length of Stay. Also, prior knowledge of Length of stay can aid in logistics such as room and bed allocation planning.


Who can use this?


This use case is related to the healthcare domain so, the medical industry can use this model to predict Length of Stay of patients. So, this model will help them optimize their bed allocation and treatment plan planning which will lead them to manage their time and work.


Model Solution:


The problem is about classifying the Length of Stay of a patient. To solve this problem a classification algorithm will be suitable. More about the model is discussed below.


Dataset Source/Description:


The dataset is from Github repository,


https://raw.githubusercontent.com/microsoft/r-server-hospital-length-of-stay/master/Data/LengthOfStay.csv


This dataset consisted of 100000 rows mentioning their different types of tests and based on that we have to predict the Length of Stay of that patient. Our target column had a Length of Stay ranging from 1 to 17 days.


Columns Description of Dataset:



  • Eid: Unique Id of the hospital admission

  • Vdate: Visit date

  • Rcount: Number of readmissions within last 180 days

  • Gender: Gender of the patient [M or F]

  • Dialysisrenalendstage: Flag for renal disease during encounter

  • Asthma: Flag for asthma during encounter

  • Irondef: Flag for iron deficiency during encounter

  • Pneum: Flag for pneumonia during encounter

  • Substancedependence: Flag for substance dependence during encounter

  • Psychologicaldisordermajor: Flag for major psychological disorder during encounter

  • Depress: Flag for depression during encounter

  • Psychother: Flag for other psychological disorder during encounter

  • Fibrosisandother: Flag for fibrosis during encounter

  • Malnutrition: Flag for malnutrituion during encounter

  • Hemo: Flag for blood disorder during encounter

  • Hematocritic: Average hematocritic value during encounter (g/dL)

  • Neutrophils: Average neutrophils value during encounter (cells/microL)

  • Sodium: Average sodium value during encounter (mmol/L)

  • Glucose: Average glucose value during encounter (mmol/L)

  • Bloodureanitro: Average blood urea nitrogen value during encounter (mg/dL)

  • Creatinine: Average creatinine value during encounter (mg/dL)

  • Bmi: Average BMI during encounter (kg/m2)

  • Pulse: Average pulse during encounter (beats/m)

  • Respiration: Average respiration during encounter (breaths/m)

  • secondarydiagnosisnonicd9: Flag for whether a non ICD 9 formatted diagnosis was coded as a secondary diagnosis

  • discharged: Date of discharge

  • facid: Facility ID at which the encounter occurred

  • lengthofstay: Length of stay for the encounter


 


Facid Description:


90	C	General Medicine 3 South
95 E Behavioural 1 East
75 A General Medicine 3 West
80 B Pulmonary 2 West
100 D Geraitrics 2 East​





Data Pre-processing:


We will be dropping the eid, vdate, discharged columns as it doesn’t provide any importance while predicting our target value.


Now, we will be deciding a range for our Length of Stay as we can’t be specific for a fix day so a range we will be quite suitable. Looking at the distribution of Length of stay there are very less counts of values greater than 10.


df['lengthofstay'].value_counts()

1     17979
3 16068
4 14822
2 12825
5 12116
6 10362
7 7263
8 4652
9 2184
10 1000
11 460
12 137
13 75
14 31
15 16
16 6
17 4
Name: lengthofstay, dtype: int64



So, after trying different ranges we decided a final range i.e. 1-2, 3-4, 4+ will be our range for first model


print("1-2: ",len(df.loc[df['lengthofstay'] <= 2])) 
print("3-4 : ", len(df.loc[(df['lengthofstay'] >= 3) & (df['lengthofstay'] <= 4)]))
print("4+ : ", len(df.loc[df['lengthofstay'] > 4]))

1-2:  30804
3-4 : 30890
4+ : 35000


so here when the model will predict 4+ we will jump to our second model where the range will be 5-6, 7-10,


print("5-6 : ", len(df.loc[(df['lengthofstay'] >= 5) & (df['lengthofstay'] <= 6)]))
print("7-10 : ", len(df.loc[(df['lengthofstay'] >= 7) & (df['lengthofstay'] <= 10)]))
print("10+ : ", len(df.loc[(df['lengthofstay'] >10)]))

5-6 :  19172
7-10 : 15099
10+ : 729


the reason for doing this is, we was getting a very highly imbalanced distribution and the results will be baised so, we decided to go with two models where one model will predict the range [1-2, 3-4, 4+] and other model will predict the range [5-6, 7-10] and we are dropping the values greater than 10 as they were very low counts of those and we won’t be able to predict them properly.


As there were no nans present so we moved forward for encoding the categorical cols there were only 3 columns rcount, gender and facid so the rcount had one variable 5+ we replaced that one with 5 and converted that to integer, for gender we replaced the male and female with 1 and 0 also for facid we were given the capacity for different variables so replaced that with them.


df['rcount'] = df['rcount'].replace('5+', '5')

df['rcount'].astype('int64')

df['gender'] = df['gender'].replace('F', 0)
df['gender'] = df['gender'].replace('M', 1)

mapfacid = {
'A':75,
'B':80,
'C':30,
'D':100,
'E':95
}

df.facid = df.facid.map(mapfacid)


Also we will be creating a new dataset where we will be having values greater than 4 on which we will train our second model.


df2 = df.loc[df.lengthofstay > 4]

We will perform splitting of data for training and testing purpose and will standardize the data and pass it to our models.



Models Evaluated:


I have evaluated the dataset with 6 models:



  1. Logistic Regression:


The Logistic Regression makes use of Sigmoid Function, which takes any real value between 0 and 1.  It is used to predict the relationship between the dependent variable and independent variable, where dependent variable is binary in nature.EG. Success /Failure, Yes/No, True/False.


Sigmoid Function is :


 


Accuracy of Logistic Regression: 76 %., F1-score: 48.47%


 



  1. Decision Tree Classifier


The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data training data). I this model we have used entropy as function which uses Information Gain as it Metrics. ID3 is the Algorithm in Decision Tree which uses Entropy and information gain.


Where, n=Number of Classes


            Accuracy of Decision Tree Classifier: 85%, F1-score: 57.54%


 



  1. Random Forest Classifier


It is a type of learning where one can combine different algorithms or the same algorithm many times to get a powerful prediction model. Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations. This concept is known as “bagging" and is very popular for its ability to reduce variance and overfitting.


 


 


Accuracy of Random Forest Classifier: 91%, F1-score: 69.12%


 



  1. Gradient Boosting Classifier


The term gradient boosting consists of two sub- terms, gradient and boosting Gradient descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. Gradient boosting is as a numerical optimization problem where the objective is to minimize the loss function of the model by adding weak learners using gradient descent. Gradient boosting does not modify the sample distribution as weak learners train on the remaining residual errors of a strong learner (i.e, pseudo-residuals). By training on the residuals of the model, this is an alternative means to give more importance to misclassified observations.


 


 


Accuracy of Gradient Boosting Classifier: 81.86%, F1-score:72.06%


 



  1. XGBoost Classifier


XGBoost stands for eXtreme Gradient Boosting. It is a decision-tree-based ensemble ML algorithm that uses a gradient boosting framework. It’s objective function is a sum of a specific loss function evaluated over all predictions and a sum of regularization term for all predictors (KK trees).


Accuracy of XGBoost Classifier: 80.34 % , F1-score: 68.84%


 



  1. LGBM Classifier:


It is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient as compared to other boosting algorithms. LightGBM grows tree vertically while other tree based learning algorithms grow trees horizontally. It means that LightGBM grows tree leaf-wise while other algorithms grow level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, leaf-wise algorithm can reduce more loss than a level-wise algorithm.


The key difference in speed is because XGBoost split the tree nodes one level at a time and LightGBM does that one node at a time.


 


LightGBM Tree Leaf Growth:


Other Algorithm Level Growth:


 


 


Model Used:


I have chosen LGBM Classifier as a final model for prediction. I trained this model on the training dataset, after that I have checked the model efficiency, so as compared to all other models this model was performing best. For both the datasets the LGBM Classifier was performing best.



  • LGBM Classifier Model:


LGBMClassifier(max_depth = 10, random_state=132,learning_rate=0.1, n_estimators=500)


  • LGBM Classifier 1 Model:


LGBMClassifier(max_depth = 10, random_state=132,learning_rate=0.1, n_estimators=1000)

Where,


max_depth (int, optional (default=-1)) – Maximum tree depth for base learners, this parameter control max depth of each trained tree


n_estimators (int, optional (default=100)) – Number of boosted trees to fit. This parameter captures the number of trees that we add to model.


learning_rate (float, optional (default=0.1)) – Boosting learning rate. This parameter have a impact on training accuracy


 


Solution Efficiency:


Models	            Accuracy	F1-Score
LGBM Classifier 94% 93.76%
LGBM Classifier 1 94% 94.34%

0 comments