Note: If the author has requested for "Expert Guidance" and you can help, please start a New Topic in the "Discussions" Tab

Prasad Chaskar's other Models Reports

Major Concepts


Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Brain Stroke Prediction

Models Status

Model Overview

Use Case Summary

Problem Statement

Visululize the relationships between various Healthy and Unhealthy habits to Heart Strokes, and there by making prediction whether person have brain stroke or not with best model and hypertuned parameters.Even at times, we have seen various apps and websites claims to help as a doctor on the basis of their models, we are also trying to build something like that here.This model also be first step towards the awareness of this killer at early stage.

 Attribute Information
1) gender: Male(1), Female(0),Other(2)
2) age: age of the patient
3) hypertension: 0 if the patient doesn't have hypertension, 1 if the patient has hypertension
4) heart_disease: 0 if the patient doesn't have any heart diseases, 1 if the patient has a heart disease
5) ever_married: Yes(1), No(0)
6) work_type: Private(2), Self-employed(3), Govt_jov,children(0), Children(4), Never_worked(1)
7) Residence_type: Rural(0), Urban(1)
8) avg_glucose_level: average glucose level in blood
9) bmi: body mass index
10) smoking_status: formerly smoked(1), never smoked(2), smokes(3) and Unknown(0)

Output Description
After getting various input parameters  like gender, age,  various diseases, and smoking  status and model predict whether person have brain stroke or not.
1 if the patient had a stroke or 0 if not

How to Use model?
1) Choose inputs with correct values.
2) Click on Predict.
3) Your output is ready.

Lets look code...

What is Brain Stroke?
A stroke occurs when the blood supply to part of your brain is interrupted or reduced, preventing brain tissue from getting oxygen and nutrients. Brain cells begin to die in minutes.A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications.

Symptoms :If you or someone you're with may be having a stroke, pay particular attention to the time the symptoms began. Some treatment options are most effective when given soon after a stroke begins.Signs and symptoms of stroke include:

  • Trouble speaking and understanding what others are saying. You may experience confusion, slur your words or have difficulty understanding speech.

  • Paralysis or numbness of the face, arm or leg. You may develop sudden numbness, weakness or paralysis in your face, arm or leg.

  • This often affects just one side of your body. Try to raise both your arms over your head at the same time. If one arm begins to fall, you may be having a stroke.Also, one side of your mouth may droop when you try to smile.

  • Problems seeing in one or both eyes. You may suddenly have blurred or blackened vision in one or both eyes, or you may see double.

  • A sudden, severe headache, which may be accompanied by vomiting, dizziness or altered consciousness, may indicate that you're having a stroke.

  • Trouble walking. You may stumble or lose your balance. You may also have sudden dizziness or a loss of coordination.

    import pandas as pd

    import numpy as np

    import matplotlib.pyplot as plt

    import seaborn as sns

    from sklearn.metrics import f1_score
    from sklearn.preprocessing import LabelEncoder,OneHotEncoder
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.metrics import classification_report
    from sklearn.metrics import roc_auc_score

Read Data From CSV.

stroke_df = pd.read_csv('healthcare-dataset-stroke-data.csv')

Identify Categorical and Numerical Features

categorical_vars = list()
numerical_vars = list()

for i in stroke_df.columns:
if stroke_df[i].dtype =='object':

Inference :

If you observe we do not required id column for our prediction. So we drop it.



Check for NULL Values


Inference :

In our dataset there is no null values present except bmi column.


print("Total Rows In BMI column :",len(stroke_df.bmi))
print("Total null values present in bmi column :",stroke_df.bmi.isnull().sum())​

Handling Missing Values
There are many ways to handle missing values.
One could be delete rows in which we have null values present.
But because of this we can can loss lot of information
Another way is replace null vvalues with mean/median.
The second method is effective when dataset is numeric and continous & good news is our bmi column fit perfectly in this condition.
So we use second method.

EDA : 

plt.title("Countplot for Stroke",{'fontsize':20});

Inference :
Based on distribution of stroke feature we can say that dataset is imbalance.
We have more records of patients had no stroke as compare to patients had stroke.
Lets handle the imbalance data later.

fig, axes = plt.subplots(2, 2, figsize=(12, 8), sharey=True)
fig.suptitle('Distribution CountPlot Some Features')

sns.countplot(ax=axes[0][0], x=stroke_df['hypertension'],palette="viridis")

sns.countplot(ax=axes[0][1], x=stroke_df['work_type'],palette="rocket");

sns.countplot(ax=axes[1][0], x=stroke_df['ever_married'],palette="husl");

sns.countplot(ax=axes[1][1], x=stroke_df['work_type'],palette="husl");

Distribution based on Stroke Patients

labels = [ "formely smoked" , "neber smoked","smokes","unknown"]
plt.pie(x=stroke_df.smoking_status[stroke_df.stroke == 1].value_counts(),
# explode = (0, 0, 0, 0.2),
shadow=True, colors=['plum','lightpink','lawngreen','cyan']);
plt.legend(labels,bbox_to_anchor=(1.05,1.025), loc="upper left");
plt.title("Patients have stroke based on work type",{'fontsize':20});


labels = [ "Private" , "Self-employed","Govt_job","children"]
plt.pie(x=stroke_df.work_type[stroke_df.stroke == 1].value_counts(),
explode = (0, 0, 0, 0.2),
shadow=True, colors=['royalblue','darkorange','springgreen','lightcyan','lavender']);
plt.legend(labels,bbox_to_anchor=(1.05,1.025), loc="upper left");
plt.title("Patients have stroke based on work type",{'fontsize':20});

Inference :Based on distribution the people whos work type is private having stroke as compared to gov job.

X = stroke_df.drop('stroke',axis=1)
y = stroke_df.stroke

X.age = round(X.age)

Convert Categorical Variables into numeric using Label Encoder.

encoder = LabelEncoder()

objList = X.select_dtypes(include = "object").columns
for feat in objList:
X[feat] = encoder.fit_transform(X[feat])

Handling Imbalance Data
SMOTE algorithm works in 4 simple steps:

  • Choose a minority class as the input vector.

  • Find its k nearest neighbors (k_neighbors is specified as an argument in the SMOTE() function).

  • Choose one of these neighbors and place a synthetic point anywhere on the line joining the point under consideration and its chosen neighbor.

  • Repeat the steps until data is balanced.

from imblearn.over_sampling import SMOTE
smote = SMOTE()

x_smote, y_smote = smote.fit_resample(X, y)​

Spliting Data
To get a good prediction, divide the data into training and testing data, it is because as the name suggests you will train few data points and test few data points, and keep on doing that unless you get good results.

X_train,X_test,y_train,y_test = train_test_split(x_smote,y_smote,test_size=0.28)

Feature Scaling : 

scalar = StandardScaler()
X_train_scaled = scalar.fit_transform(X_train)
X_test_scaled = scalar.fit_transform(X_test)

Logistic Regression

log_reg = LogisticRegression(),y_train)


Random Forest

rf = RandomForestClassifier(),y_train)

Output : 0.9309585016525891

Classification Report 

rf_pred = rf.predict(X_test_scaled)
log_pred = log_reg.predict(X_test_scaled)

print("Classifiaction Report for Random Forest")
print("Classification Report for Logistic Regression")

Confusion Matrix

Random Forest

class_names = [0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))

cnf_matrix = confusion_matrix(y_test,rf_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap="Blues",
fmt = 'g')
plt.title(f'Heat Map for Random Forest', {'fontsize':20})
plt.ylabel('Actual label')
plt.xlabel('Predicted label')​

Logistic Regression

class_names = [0,1]
fig,ax = plt.subplots()
tick_marks = np.arange(len(class_names))

cnf_matrix = confusion_matrix(y_test,log_pred)
sns.heatmap(pd.DataFrame(cnf_matrix), annot = True, cmap = 'Blues',
fmt = 'g')
plt.title(f'Heat Map for Logistic Regression', {'fontsize':20})
plt.ylabel('Actual label')
plt.xlabel('Predicted label')​

Roc Curves

pred_prob1 = log_reg.predict_proba(X_test_scaled)
pred_prob2 = rf.predict_proba(X_test_scaled)
from sklearn.metrics import roc_curve

# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)

# roc curve for tpr = fpr
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)'seaborn')
# plot roc curves
plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Logistic Regression')
plt.plot(fpr2, tpr2, linestyle='--',color='green', label='Random Forest')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')


(F1 score for Logistic Regression is :0.80 and forRandom Forest: 0.92)

Feature Importance For Random Forest Model

feature_imp1 = rf.feature_importances_
sns.barplot(x=feature_imp1, y=X.columns)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.title("Visualizing Important Features For Random Forest ",{'fontsize':25});
feature_dict = {k:v for (k,v) in zip(X.columns,feature_imp1)}

Conclusion :
We start with reading data and then categorised categorical features and numerical features.After that we deal with missing values in BMI feature.
Then we perform EDA on features.We conclude that we have imbalance data ie negative class examples is greater that positive class.
After visulization we handle imbalance data.
After that we move to most important part model building. Before starting to train model we split our data into train data(testing purpose) and test data(validation purpose) and perform feature scaling.
Random Forest and Logistic Regression models were tried.
To check which model perform best plot roc-auc curves along with classifiaction report and confusion matrices.
While Random Forest win the race.