Prasad Chaskar's other Models Reports

Major Concepts

 

Sign-Up/Login to access Several ML Models and also Deploy & Monetize your own ML solutions for free

Models Home » Domain Usecases » Health Care and Pharmaceuticals » Thyroid Disease Prediction

Thyroid Disease Prediction

Models Status

Model Overview



Introduction :

Thyroid disease is one of the diseases that afflict the world’s population, and the number of cases of this disease is increasing. Because of medical reports that show serious imbalances in thyroid diseases, our study deals with the classification of thyroid disease between hyperthyroidism and hypothyroidism.
This disease was classified using algorithms.
Machine learning showed us good results using several algorithms and was built in the form of two models.

Required Libraries :
1. pandas

2. numpy
3. matplotlib
4. sklearn
5. seaborn

Problem Statement :

To predict whether a person has a thyroid disease or not based on the various biological and physical parameters of the body.
To make a model having high accuracy and precision and can predict the results with greater confidence.

Data Description:

The dataset contains 3772 training instances and 3 classes.
Dataset Link - https://archive.ics.uci.edu/ml/datasets/thyroid+disease


Features :

- age: The person's age in years
- sex: The person's sex
- TSH (Thyroid-stimulating hormone): blood test that measures this hormone.
- T3:Triiodothyronine is a thyroid hormone.
- thyroid surgery: done surgery or not.
- TT4: the main form of thyroid hormone made by the thyroid gland.
- T4U: normal subjects.
- FTI: Free thyroxine.

Import Libraries:

#pandas
import pandas as pd

#numpy
import numpy as np

#matplotlib
import matplotlib.pyplot as plt

#seaborn
import seaborn as sns

#sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import classification_report,f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pickle
import warnings
warnings.filterwarnings('ignore')

Read Data from csv :

thyroid_df = pd.read_csv('hypothyroid.csv')
thyroid_df.head()


Handling Missing Values :

thyroid_df.isnull().sum()



# fill missing values with mean of that column.
miss_cols = ['FTI','TSH','T3','TT4','T4U']
for i in miss_cols:
thyroid_df[i] = thyroid_df[i].fillna(thyroid_df[i].mean())

# drop remaining null values
thyroid_df.dropna(inplace=True)

# change dtype of columns which are initially objects
thyroid_df.TT4 = thyroid_df.TT4.astype(int)
thyroid_df.FTI = thyroid_df.FTI.astype(int)
thyroid_df.age = thyroid_df.age.astype(int)

EDA :
Count plot for Target Column :

sns.countplot(x='Label',data=thyroid_df)
plt.title("Countplot for Target variable");


Perform EDA for Positive class : 

positive_df = thyroid_df[thyroid_df.Label=='P']

plt.figure(figsize=(9,6))
sns.histplot(x='age',data=positive_df,color='blue')
plt.title("Distribution of Positive Class Based on Age",{'fontsize':20});


Inference :



The most of patients who suffer from thyroid belonging to age group between 50-70

Pie chart for Sex feature :

plt.figure(figsize=(10,8))
plt.pie(x=positive_df.sex.value_counts(),
labels=['Female','Male'],
startangle = 90,
colors=['springgreen','orange'],
autopct='%.2f'
);
plt.legend();




Inference :


Female patients who has disease is greater than male patients.

plt.figure(figsize=(8,8))
plt.pie(x=positive_df.sick.value_counts(),
labels=['Sick','Well'],
startangle = 20,
colors=['deepskyblue','red'],
autopct='%.2f',
explode=[0,0.2]
);
plt.legend();​



Transform non-numerical labels to numerical labels :

s_encoder = LabelEncoder()
si_encoder = LabelEncoder()
preg_encoder = LabelEncoder()
th_encoder = LabelEncoder()
treat_encoder = LabelEncoder()
lith_encoder = LabelEncoder()
g_encoder= LabelEncoder()
tu_encoder = LabelEncoder()

X['sex'] = s_encoder.fit_transform(X.sex)
X['I131 treatment'] = treat_encoder.fit_transform(X['I131 treatment'])
X['sick'] = si_encoder.fit_transform(X.sick)
X['pregnant'] = preg_encoder.fit_transform(X.pregnant)
X['thyroid surgery'] = th_encoder.fit_transform(X['thyroid surgery'])
X['lithium'] = lith_encoder.fit_transform(X['lithium'])
X['goitre'] = g_encoder.fit_transform(X['goitre'])
X['tumor'] = tu_encoder.fit_transform(X['tumor'])

def func(df):
if df == 'P':
return 1
else:
return 0

Split Data :

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=11)





Handle Imbalance Data :

smote = SMOTE(random_state=11)

x_smote, y_smote = smote.fit_resample(X_train, y_train)

print("Shape before the Oversampling : ",X_train.shape)
print("Shape after the Oversampling : ",x_smote.shape)

#Output
# Shape before the Oversampling : (2896, 14)
# Shape after the Oversampling : (5340, 14)

Feature Scaling :

scalr = MinMaxScaler()
#for training data
x_smote.TT4 = scalr.fit_transform(x_smote[['TT4']])
x_smote.age = scalr.fit_transform(x_smote[['age']])
x_smote.FTI = scalr.fit_transform(x_smote[['FTI']])

#for testing data
X_test.TT4 = scalr.transform(X_test[['TT4']])
X_test.age = scalr.transform(X_test[['age']])
X_test.FTI = scalr.transform(X_test[['FTI']])

Build Models :

models = {
LogisticRegression(max_iter=500):'Logistic Regression',
SVC():"Support Vector Machine",
RandomForestClassifier():'Random Forest'
}
for m in models.keys():
m.fit(x_smote,y_smote)
for model,name in models.items():
print(f"Accuracy Score for {name} is : ",model.score(X_test,y_test)*100,"%")

#Output
#Accuracy Score for Logistic Regression is : 98.20441988950276 %
#Accuracy Score for Support Vector Machine is : 98.20441988950276 %
#Accuracy Score for Random Forest is : 98.89502762430939 %

Classification Report :

or model,name in models.items():
y_pred = model.predict(X_test)
print(f"Classification Report for {name}")
print("----------------------------------------------------------")
print(classification_report(y_test,y_pred))
print("----------------------------------------------------------")


Random Forest algorithm gives higher accuracy as compared to others,so we choose it for prediction.

Feature Importance For Random Forest :

plt.figure(figsize=(9,7))
feature_imp1 = rf.feature_importances_
sns.barplot(x=feature_imp1, y=X.columns)
# Add labels to your graph
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features For Random Forest ",{'fontsize':25})
plt.show();




# Drop some features
x_smote.drop(['sick', 'pregnant', 'I131 treatment',
'lithium', 'goitre', 'tumor'], axis=1, inplace=True)
X_test.drop(['sick', 'pregnant', 'I131 treatment',
'lithium', 'goitre', 'tumor'], axis=1, inplace=True)

Save Model : 

with open('thyroid.pkl','wb') as f:
pickle.dump(new_rf,f)


Thank You for reading smile
Other details and code avialable on notebook.

0 comments